* [rfc] io memory barriers, and getting rid of mmiowb
@ 2007-11-20 16:02 Nick Piggin
2007-11-20 16:46 ` David Howells
2007-11-21 13:50 ` Ralf Baechle
0 siblings, 2 replies; 9+ messages in thread
From: Nick Piggin @ 2007-11-20 16:02 UTC (permalink / raw)
To: linux-arch, Linus Torvalds, Benjamin Herrenschmidt
Hi,
The mmiowb() barrier isn't very well documented, and it is somewhat difficult
to understand exactly how it is supposed to work (both for users and
implementers).
I guess it came about because sn2 has MMIO which is highly out of order with
respect to regular cacheable operations, and apparently found it too expensive
to implement the full barriers (eg. wmb()) completely, so they quietly relaxed
them, and introduce mmiowb().
The argument that this semantic for wmb is OK, is that the entity reordering
the IOs is actually another agent than the CPU issuing the barriers, so it is
exempt from having to be ordered. I don't really like this argument -- the
point of having the barriers is not because a driver writer wants some specific
bit of silicon to perform some specific ordering, but for the high level
load/store operations to be ordered.
A powerpc sync instruction on some pSeries apparently has to travel around the
interconnect ring before it can complete, so I don't see why the sn2 case is
more special (sn2 has to effectively flush a reorder buffer in its chipset,
not exactly a big surprise if one requires ordering of those IOs...).
sn2 _somewhat_ is able to get away with its breakage of wmb() -- it doesn't
order cacheable vs mmio even on a single CPU -- because it is not so common to
be used for that (usually it is just used for io vs io). However that isn't
a good reason to change semantics of a lot of existing code in a subtle way,
while also introducing a new type of barrier. It's plainly already hard enough
to comprehend and implement memory ordering.
IMO, a better way to fix this is to recognise that many wmb()s don't need the
full semantics. So introduce a new variant of the existing, understood, memory
barrier set -- the io_*mb(). This just orders io vs io. These barriers can be
taken advantage of my more code and more architectures. wmb() semantics don't
need to be changed. No new memory ordering concepts for driver writers to
learn.
The other thing I've tried to do is add a primitive for IO critical sections as
well. Don't know what people think about this, but I think it is better than
asking driver writers to explicitly think about the memory ordering required to
order their IO operations with those involved in the SMP lock and unlock
operations.
This is an RFC. I've written some documentation, converted some architectures,
and converted a couple of drivers (many seem to be a bit confused about mmiowb)
Index: linux-2.6/Documentation/memory-barriers.txt
===================================================================
--- linux-2.6.orig/Documentation/memory-barriers.txt
+++ linux-2.6/Documentation/memory-barriers.txt
@@ -986,30 +986,55 @@ CPU MEMORY BARRIERS
The Linux kernel has eight basic CPU memory barriers:
- TYPE MANDATORY SMP CONDITIONAL
- =============== ======================= ===========================
- GENERAL mb() smp_mb()
- WRITE wmb() smp_wmb()
- READ rmb() smp_rmb()
- DATA DEPENDENCY read_barrier_depends() smp_read_barrier_depends()
+ TYPE FULL CACHEABLE (SMP CONDITIONAL) IO
+ ========= ======================= =========================== ============
+ GENERAL mb() smp_mb() io_mb()
+ WRITE wmb() smp_wmb() io_wmb()
+ READ rmb() smp_rmb() io_rmb()
+ DATA DEP. read_barrier_depends() smp_read_barrier_depends()
+ LOCK io_lock()
+ UNLOCK io_unlock()
All CPU memory barriers unconditionally imply compiler barriers.
-SMP memory barriers are reduced to compiler barriers on uniprocessor compiled
-systems because it is assumed that a CPU will appear to be self-consistent,
-and will order overlapping accesses correctly with respect to itself.
-
-[!] Note that SMP memory barriers _must_ be used to control the ordering of
-references to shared memory on SMP systems, though the use of locking instead
-is sufficient.
-
-Mandatory barriers should not be used to control SMP effects, since mandatory
-barriers unnecessarily impose overhead on UP systems. They may, however, be
-used to control MMIO effects on accesses through relaxed memory I/O windows.
-These are required even on non-SMP systems as they affect the order in which
-memory operations appear to a device by prohibiting both the compiler and the
-CPU from reordering them.
+CACHEABLE memory barriers control order of access to regular (cacheable) RAM
+by the CPU. They are reduced to compiler barriers on uniprocessor compiled
+systems because cacheable memory is operated upon by CPUs, and a a CPU is
+self-consistent.
+
+[!] Note that CACHEABLE or FULL memory barriers must be used to control the
+ordering of references to shared memory on SMP systems, though the use of
+locking instead is sufficient.
+
+IO memory barriers should be used to control the order of memory operations
+from the CPU to IO memory (memory mapped devices; uncacheable, write combining,
+etc). Some of the architectures that Linux supports can reorder these accesses
+can be reordered even on UP systems, so they can not be eliminated on UP
+machines. They may also involve different instructions than CACHEABLE memory
+barriers.
+
+FULL barriers order all memory operations to both CACHEABLE and IO memory.
+These should be used specifically when ordering is required between loads
+and/or stores to CACHEABLE, and loads/stores to IO memory. FULL barriers
+should not be used to control only the ordering of operations to CACHEABLE
+memory, nor only operations to IO memory, since they unnecessarily impose
+overhead on UP systems.
+
+One of the most important special case of FULL barrier are the lock and unlock
+barriers. These barriers must be combined with smp locking primitives (eg.
+spin_lock/unlock, mutex_lock/unlock), and act to contain IO memory operations
+within the critical section (normally, locking primitives only contain
+CACHEABLE accesses within the critical section). io_lock must be called
+after taking the SMP lock, and io_unlock before releasing it. It is strongly
+encouraged to make these calls *directly* after/before the SMP locks. Do not
+get smart. These are classified as FULL barriers, because they serve to order
+IO accesses with the CACHEABLE accesses involved in taking and releasing SMP
+locks. However they are prefixed with io_, which is more intuitive.
+
+All barriers are required even on non-SMP systems as they affect the order in
+which memory operations appear to a device by prohibiting both the compiler and
+(in the case of IO and FULL barriers) the CPU from reordering them.
There are some more advanced barrier functions:
@@ -1065,21 +1090,6 @@ There are some more advanced barrier fun
operations" subsection for information on where to use these.
-MMIO WRITE BARRIER
-------------------
-
-The Linux kernel also has a special barrier for use with memory-mapped I/O
-writes:
-
- mmiowb();
-
-This is a variation on the mandatory write barrier that causes writes to weakly
-ordered I/O regions to be partially ordered. Its effects may go beyond the
-CPU->Hardware interface and actually affect the hardware at some level.
-
-See the subsection "Locks vs I/O accesses" for more information.
-
-
===============================
IMPLICIT KERNEL MEMORY BARRIERS
===============================
@@ -1285,22 +1295,23 @@ But assuming CPU 1 gets the lock first,
LOCKS VS I/O ACCESSES
---------------------
-Under certain circumstances (especially involving NUMA), I/O accesses within
-two spinlocked sections on two different CPUs may be seen as interleaved by the
-PCI bridge, because the PCI bridge does not necessarily participate in the
-cache-coherence protocol, and is therefore incapable of issuing the required
-read memory barriers.
+SMP locking primitives are implemented with operations to CACHEABLE memory,
+and they are guaranteed only to provide LOCK and UNLOCK ordering for other
+CACHEABLE memory accesses inside the critical section. IO memory accesses,
+even if performed inside the lock section, can "leak out".
For example:
CPU 1 CPU 2
=============================== ===============================
spin_lock(Q)
- writel(0, ADDR)
+ writel(0, ADDR);
+ io_wmb();
writel(1, DATA);
spin_unlock(Q);
spin_lock(Q);
writel(4, ADDR);
+ io_wmb();
writel(5, DATA);
spin_unlock(Q);
@@ -1308,29 +1319,39 @@ may be seen by the PCI bridge as follows
STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE *DATA = 5
-which would probably cause the hardware to malfunction.
-
-
-What is necessary here is to intervene with an mmiowb() before dropping the
-spinlock, for example:
+which would probably cause the hardware to malfunction. The reason is that
+the stores to IO memory from CPU 1 is not ordered with respect to the
+spin_unlock, so it may not be visible to the PCI bridge before the spin_unlock
+is visible to CPU 2. CPU2 then acquires the lock and executes an IO store,
+which happens to pass the IO store from CPU 1.
+
+What is necessary here is to use some type of FULL memory barrier to prevent
+the reordering of CACHEABLE and IO operations. The best way to do this is with
+io_lock and io_unlock barriers, which directs the critical section to order
+IO access as well. For example:
CPU 1 CPU 2
=============================== ===============================
- spin_lock(Q)
- writel(0, ADDR)
+ spin_lock(Q);
+ io_lock();
+ writel(0, ADDR);
+ io_wmb();
writel(1, DATA);
- mmiowb();
+ io_unlock();
spin_unlock(Q);
spin_lock(Q);
+ io_lock();
writel(4, ADDR);
+ io_wmb();
writel(5, DATA);
- mmiowb();
+ io_unlock();
spin_unlock(Q);
-this will ensure that the two stores issued on CPU 1 appear at the PCI bridge
-before either of the stores issued on CPU 2.
+this will ensure that the two IO stores issued on CPU 1 appear at the PCI
+bridge before either of the IO stores issued on CPU 2.
+XXX: get rid of this, and just prefer io_lock()/io_unlock() ?
Furthermore, following a store by a load from the same device obviates the need
for the mmiowb(), because the load forces the store to complete before the load
is performed:
@@ -1346,6 +1367,13 @@ is performed:
b = readl(DATA);
spin_unlock(Q);
+XXX: However, in this case, it may still be possible for the IO stores to pass
+the CACHEABLE operations involved in _taking_ the lock (the IO load only
+prevents the writel leaking out)... however: does this really work for all
+IO memory types? Does it prevent the spin_unlock store from being executed
+before the readl?
+
+
See Documentation/DocBook/deviceiobook.tmpl for more information.
@@ -1578,8 +1606,8 @@ use of memory barriers unnecessary, ther
might be needed:
(1) On some systems, I/O stores are not strongly ordered across all CPUs, and
- so for _all_ general drivers locks should be used and mmiowb() must be
- issued prior to unlocking the critical section.
+ so for _all_ general drivers locks should be used and io_lock/io_unlock
+ must be issued inside the critical section.
(2) If the accessor functions are used to refer to an I/O memory window with
relaxed memory access properties, then _mandatory_ memory barriers are
@@ -1628,8 +1656,9 @@ explicit barriers are used.
Normally this won't be a problem because the I/O accesses done inside such
sections will include synchronous load operations on strictly ordered I/O
-registers that form implicit I/O barriers. If this isn't sufficient then an
-mmiowb() may need to be used explicitly.
+registers that form implicit I/O barriers. If this isn't sufficient then
+interrupt-safe locks should be used, along with io_lock and io_unlock
+barriers.
A similar situation may occur between an interrupt routine and two routines
@@ -1686,9 +1715,6 @@ functions:
cause a malfunction - consider the 16550 Rx/Tx serial registers for
example.
- Used with prefetchable I/O memory, an mmiowb() barrier may be required to
- force stores to be ordered.
-
Please refer to the PCI specification for more information on interactions
between PCI transactions.
Index: linux-2.6/arch/ia64/sn/kernel/iomv.c
===================================================================
--- linux-2.6.orig/arch/ia64/sn/kernel/iomv.c
+++ linux-2.6/arch/ia64/sn/kernel/iomv.c
@@ -69,6 +69,7 @@ EXPORT_SYMBOL(sn_io_addr);
* On SN2, we wait for the PIO_WRITE_STATUS SHub register to clear.
* See PV 871084 for details about the WAR about zero value.
*
+ * XXX: do we also need an mf.a, or does the *adr load order for us?
*/
void __sn_mmiowb(void)
{
Index: linux-2.6/include/asm-ia64/system.h
===================================================================
--- linux-2.6.orig/include/asm-ia64/system.h
+++ linux-2.6/include/asm-ia64/system.h
@@ -87,6 +87,9 @@ extern struct ia64_boot_param {
#define wmb() mb()
#define read_barrier_depends() do { } while(0)
+#define io_lock() do { } while (0)
+#define io_unlock() mmiowb()
+
#ifdef CONFIG_SMP
# define smp_mb() mb()
# define smp_rmb() rmb()
@@ -99,6 +102,10 @@ extern struct ia64_boot_param {
# define smp_read_barrier_depends() do { } while(0)
#endif
+#define io_mb() mb()
+#define io_rmb() rmb()
+#define io_wmb() wmb()
+
/*
* XXX check on this ---I suspect what Linus really wants here is
* acquire vs release semantics but we can't discuss this stuff with
Index: linux-2.6/include/asm-powerpc/system.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/system.h
+++ linux-2.6/include/asm-powerpc/system.h
@@ -33,8 +33,15 @@
* SMP since it is only used to order updates to system memory.
*/
#define mb() __asm__ __volatile__ ("sync" : : : "memory")
-#define rmb() __asm__ __volatile__ (__stringify(LWSYNC) : : : "memory")
+#define rmb() __asm__ __volatile__ ("sync" : : : "memory")
#define wmb() __asm__ __volatile__ ("sync" : : : "memory")
+
+#define io_lock() do { } while (0)
+#define io_unlock() \
+do { \
+ get_paca()->io_sync = 0; \
+ mb(); \
+}
#define read_barrier_depends() do { } while(0)
#define set_mb(var, value) do { var = value; mb(); } while (0)
@@ -43,7 +50,7 @@
#define AT_VECTOR_SIZE_ARCH 6 /* entries in ARCH_DLINFO */
#ifdef CONFIG_SMP
#define smp_mb() mb()
-#define smp_rmb() rmb()
+#define smp_rmb() __asm__ __volatile__ (__stringify(LWSYNC) : : :"memory")
#define smp_wmb() eieio()
#define smp_read_barrier_depends() read_barrier_depends()
#else
@@ -53,6 +60,10 @@
#define smp_read_barrier_depends() do { } while(0)
#endif /* CONFIG_SMP */
+#define io_mb() mb()
+#define io_rmb() mb()
+#define io_wmb() eieio()
+
/*
* This is a barrier which prevents following instructions from being
* started until the value of the argument x is known. For example, if
Index: linux-2.6/include/asm-x86/system_32.h
===================================================================
--- linux-2.6.orig/include/asm-x86/system_32.h
+++ linux-2.6/include/asm-x86/system_32.h
@@ -223,6 +223,9 @@ static inline unsigned long get_limit(un
#define rmb() alternative("lock; addl $0,0(%%esp)", "lfence", X86_FEATURE_XMM2)
#define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
+#define io_lock() do { } while (0)
+#define io_unlock() mb()
+
/**
* read_barrier_depends - Flush all pending reads that subsequents reads
* depend on.
@@ -299,6 +302,10 @@ static inline unsigned long get_limit(un
#define set_mb(var, value) do { var = value; barrier(); } while (0)
#endif
+#define io_mb() mb()
+#define io_rmb() rmb()
+#define io_wmb() wmb()
+
#include <linux/irqflags.h>
/*
Index: linux-2.6/include/asm-x86/system_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/system_64.h
+++ linux-2.6/include/asm-x86/system_64.h
@@ -153,6 +153,9 @@ static inline void clflush(volatile void
#define smp_read_barrier_depends() do {} while(0)
#endif
+#define io_mb() mb()
+#define io_rmb() rmb()
+#define io_wmb() wmb()
/*
* Force strict CPU ordering.
@@ -163,6 +166,10 @@ static inline void clflush(volatile void
#define rmb() asm volatile("lfence":::"memory")
#define wmb() asm volatile("sfence" ::: "memory")
+#define io_lock() do { } while (0)
+#define io_unlock() mb()
+
+
#define read_barrier_depends() do {} while(0)
#define set_mb(var, value) do { (void) xchg(&var, value); } while (0)
Index: linux-2.6/drivers/net/wireless/b43/main.c
===================================================================
--- linux-2.6.orig/drivers/net/wireless/b43/main.c
+++ linux-2.6/drivers/net/wireless/b43/main.c
@@ -265,7 +265,7 @@ static void b43_ram_write(struct b43_wld
val = swab32(val);
b43_write32(dev, B43_MMIO_RAM_CONTROL, offset);
- mmiowb();
+ io_wmb();
b43_write32(dev, B43_MMIO_RAM_DATA, val);
}
@@ -334,19 +334,19 @@ void b43_shm_write32(struct b43_wldev *d
if (offset & 0x0003) {
/* Unaligned access */
b43_shm_control_word(dev, routing, offset >> 2);
- mmiowb();
+ io_wmb();
b43_write16(dev, B43_MMIO_SHM_DATA_UNALIGNED,
(value >> 16) & 0xffff);
- mmiowb();
+ io_wmb();
b43_shm_control_word(dev, routing, (offset >> 2) + 1);
- mmiowb();
+ io_wmb();
b43_write16(dev, B43_MMIO_SHM_DATA, value & 0xffff);
return;
}
offset >>= 2;
}
b43_shm_control_word(dev, routing, offset);
- mmiowb();
+ io_wmb();
b43_write32(dev, B43_MMIO_SHM_DATA, value);
}
@@ -357,14 +357,14 @@ void b43_shm_write16(struct b43_wldev *d
if (offset & 0x0003) {
/* Unaligned access */
b43_shm_control_word(dev, routing, offset >> 2);
- mmiowb();
+ io_wmb();
b43_write16(dev, B43_MMIO_SHM_DATA_UNALIGNED, value);
return;
}
offset >>= 2;
}
b43_shm_control_word(dev, routing, offset);
- mmiowb();
+ io_wmb();
b43_write16(dev, B43_MMIO_SHM_DATA, value);
}
@@ -470,9 +470,9 @@ static void b43_tsf_write_locked(struct
u32 hi = (tsf & 0xFFFFFFFF00000000ULL) >> 32;
b43_write32(dev, B43_MMIO_REV3PLUS_TSF_LOW, 0);
- mmiowb();
+ io_wmb();
b43_write32(dev, B43_MMIO_REV3PLUS_TSF_HIGH, hi);
- mmiowb();
+ io_wmb();
b43_write32(dev, B43_MMIO_REV3PLUS_TSF_LOW, lo);
} else {
u16 v0 = (tsf & 0x000000000000FFFFULL);
@@ -481,13 +481,13 @@ static void b43_tsf_write_locked(struct
u16 v3 = (tsf & 0xFFFF000000000000ULL) >> 48;
b43_write16(dev, B43_MMIO_TSF_0, 0);
- mmiowb();
+ io_wmb();
b43_write16(dev, B43_MMIO_TSF_3, v3);
- mmiowb();
+ io_wmb();
b43_write16(dev, B43_MMIO_TSF_2, v2);
- mmiowb();
+ io_wmb();
b43_write16(dev, B43_MMIO_TSF_1, v1);
- mmiowb();
+ io_wmb();
b43_write16(dev, B43_MMIO_TSF_0, v0);
}
}
@@ -1390,6 +1390,7 @@ static void b43_interrupt_tasklet(struct
unsigned long flags;
spin_lock_irqsave(&dev->wl->irq_lock, flags);
+ io_lock();
B43_WARN_ON(b43_status(dev) != B43_STAT_STARTED);
@@ -1415,7 +1416,7 @@ static void b43_interrupt_tasklet(struct
dma_reason[2], dma_reason[3],
dma_reason[4], dma_reason[5]);
b43_controller_restart(dev, "DMA error");
- mmiowb();
+ io_unlock();
spin_unlock_irqrestore(&dev->wl->irq_lock, flags);
return;
}
@@ -1466,7 +1467,7 @@ static void b43_interrupt_tasklet(struct
handle_irq_transmit_status(dev);
b43_interrupt_enable(dev, dev->irq_savedstate);
- mmiowb();
+ io_unlock();
spin_unlock_irqrestore(&dev->wl->irq_lock, flags);
}
@@ -1514,6 +1515,7 @@ static irqreturn_t b43_interrupt_handler
return IRQ_NONE;
spin_lock(&dev->wl->irq_lock);
+ io_lock();
if (b43_status(dev) < B43_STAT_STARTED)
goto out;
@@ -1545,7 +1547,7 @@ static irqreturn_t b43_interrupt_handler
dev->irq_reason = reason;
tasklet_schedule(&dev->isr_tasklet);
out:
- mmiowb();
+ io_unlock();
spin_unlock(&dev->wl->irq_lock);
return ret;
@@ -2799,8 +2801,9 @@ static int b43_dev_config(struct ieee802
}
spin_lock_irqsave(&wl->irq_lock, flags);
+ io_lock();
b43_interrupt_enable(dev, savedirqs);
- mmiowb();
+ io_unlock();
spin_unlock_irqrestore(&wl->irq_lock, flags);
out_unlock_mutex:
mutex_unlock(&wl->mutex);
Index: linux-2.6/drivers/net/wireless/b43/phy.c
===================================================================
--- linux-2.6.orig/drivers/net/wireless/b43/phy.c
+++ linux-2.6/drivers/net/wireless/b43/phy.c
@@ -299,7 +299,7 @@ void b43_phy_write(struct b43_wldev *dev
offset = adjust_phyreg_for_phytype(phy, offset, dev);
b43_write16(dev, B43_MMIO_PHY_CONTROL, offset);
- mmiowb();
+ io_wmb();
b43_write16(dev, B43_MMIO_PHY_DATA, val);
}
@@ -2472,7 +2472,7 @@ u16 b43_radio_read16(struct b43_wldev *d
void b43_radio_write16(struct b43_wldev *dev, u16 offset, u16 val)
{
b43_write16(dev, B43_MMIO_RADIO_CONTROL, offset);
- mmiowb();
+ io_wmb();
b43_write16(dev, B43_MMIO_RADIO_DATA_LOW, val);
}
@@ -2648,7 +2648,7 @@ u8 b43_radio_aci_scan(struct b43_wldev *
void b43_nrssi_hw_write(struct b43_wldev *dev, u16 offset, s16 val)
{
b43_phy_write(dev, B43_PHY_NRSSILT_CTRL, offset);
- mmiowb();
+ io_wmb();
b43_phy_write(dev, B43_PHY_NRSSILT_DATA, (u16) val);
}
Index: linux-2.6/drivers/net/wireless/b43/sysfs.c
===================================================================
--- linux-2.6.orig/drivers/net/wireless/b43/sysfs.c
+++ linux-2.6/drivers/net/wireless/b43/sysfs.c
@@ -139,13 +139,14 @@ static ssize_t b43_attr_interfmode_store
mutex_lock(&wldev->wl->mutex);
spin_lock_irqsave(&wldev->wl->irq_lock, flags);
+ io_lock();
err = b43_radio_set_interference_mitigation(wldev, mode);
if (err) {
b43err(wldev->wl, "Interference Mitigation not "
"supported by device\n");
}
- mmiowb();
+ io_unlock();
spin_unlock_irqrestore(&wldev->wl->irq_lock, flags);
mutex_unlock(&wldev->wl->mutex);
Index: linux-2.6/drivers/net/wireless/b43/pio.h
===================================================================
--- linux-2.6.orig/drivers/net/wireless/b43/pio.h
+++ linux-2.6/drivers/net/wireless/b43/pio.h
@@ -87,7 +87,7 @@ static inline
void b43_pio_write(struct b43_pioqueue *queue, u16 offset, u16 value)
{
b43_write16(queue->dev, queue->mmio_base + offset, value);
- mmiowb();
+ io_wmb();
}
int b43_pio_init(struct b43_wldev *dev);
Index: linux-2.6/drivers/net/s2io.c
===================================================================
--- linux-2.6.orig/drivers/net/s2io.c
+++ linux-2.6/drivers/net/s2io.c
@@ -2446,7 +2446,7 @@ static int fill_rx_buffers(struct s2io_n
DBG_PRINT(INFO_DBG, "%s: Out of ", dev->name);
DBG_PRINT(INFO_DBG, "memory to allocate SKBs\n");
if (first_rxdp) {
- wmb();
+ wmb(); /* io_wmb? */
first_rxdp->Control_1 |= RXD_OWN_XENA;
}
nic->mac_control.stats_info->sw_stat. \
@@ -2553,7 +2553,7 @@ static int fill_rx_buffers(struct s2io_n
rxdp->Control_2 |= SET_RXD_MARKER;
if (!(alloc_tab & ((1 << rxsync_frequency) - 1))) {
if (first_rxdp) {
- wmb();
+ io_wmb();
first_rxdp->Control_1 |= RXD_OWN_XENA;
}
first_rxdp = rxdp;
@@ -2568,7 +2568,7 @@ static int fill_rx_buffers(struct s2io_n
* and other fields are seen by adapter correctly.
*/
if (first_rxdp) {
- wmb();
+ io_wmb();
first_rxdp->Control_1 |= RXD_OWN_XENA;
}
@@ -4081,7 +4081,7 @@ static int s2io_xmit(struct sk_buff *skb
writeq(val64, &tx_fifo->List_Control);
- mmiowb();
+ wmb(); /* io_lock() / io_unock() ? */
put_off++;
if (put_off == mac_control->fifos[queue].tx_curr_put_info.fifo_len + 1)
@@ -6665,7 +6665,7 @@ static int rxd_owner_bit_reset(struct s
}
set_rxd_buffer_size(sp, rxdp, size);
- wmb();
+ io_wmb();
/* flip the Ownership bit to Hardware */
rxdp->Control_1 |= RXD_OWN_XENA;
}
Index: linux-2.6/drivers/misc/ioc4.c
===================================================================
--- linux-2.6.orig/drivers/misc/ioc4.c
+++ linux-2.6/drivers/misc/ioc4.c
@@ -156,7 +156,7 @@ ioc4_clock_calibrate(struct ioc4_driver_
/* Reset to power-on state */
writel(0, &idd->idd_misc_regs->int_out.raw);
- mmiowb();
+ io_wmb();
/* Set up square wave */
int_out.raw = 0;
@@ -164,7 +164,7 @@ ioc4_clock_calibrate(struct ioc4_driver_
int_out.fields.mode = IOC4_INT_OUT_MODE_TOGGLE;
int_out.fields.diag = 0;
writel(int_out.raw, &idd->idd_misc_regs->int_out.raw);
- mmiowb();
+ io_wmb();
/* Check square wave period averaged over some number of cycles */
do {
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [rfc] io memory barriers, and getting rid of mmiowb
2007-11-20 16:02 [rfc] io memory barriers, and getting rid of mmiowb Nick Piggin
@ 2007-11-20 16:46 ` David Howells
2007-11-20 16:58 ` Matthew Wilcox
2007-11-21 0:30 ` Nick Piggin
2007-11-21 13:50 ` Ralf Baechle
1 sibling, 2 replies; 9+ messages in thread
From: David Howells @ 2007-11-20 16:46 UTC (permalink / raw)
To: Nick Piggin; +Cc: dhowells, linux-arch, Linus Torvalds, Benjamin Herrenschmidt
Hi Nick,
Some comments on the documentation changes in your patch:
> The Linux kernel has eight basic CPU memory barriers:
That should be 'eleven' at least with these changes.
Can I recommend that unless io_lock() and io_unlock() actually take locks, you
postfix their names with _mb or _barrier. If they are actual locking
primitives, then they don't belong in this table.
> + TYPE FULL CACHEABLE (SMP CONDITIONAL) IO
I think I'd make that 'SMP CONDITIONAL' with '(CACHEABLE)' on the next line or
vice versa. This means the table doesn't have to expand quite so much, and
for the data dep barrier lines, do:
TYPE FULL CACHEABLE I/O
(SMP CONDITIONAL)
=============== =============== ======================= ===============
GENERAL mb() smp_mb() io_mb()
WRITE wmb() smp_wmb() io_wmb()
READ rmb() smp_rmb() io_rmb()
DATA DEPEPENDENCY
read_barrier_depends()
smp_read_barrier_depends()
Does having 'CACHEABLE' imply that 'FULL' ones are or that they aren't?
I think I would alter this:
+CACHEABLE memory barriers control order of access to regular (cacheable) RAM
+by the CPU. They are reduced to compiler barriers on uniprocessor compiled
+systems because cacheable memory is operated upon by CPUs, and a a CPU is
+self-consistent.
To say something like:
SMP conditional memory barriers control the order of access to regular,
cacheable RAM by the CPU. They are reduced to compiler barriers on
uniprocessor compiled systems because it is assumed that a CPU will appear to
be self-consistent, and will order overlapping accesses correctly with
respect to itself when writing to ordinary RAM.
And 'IO' should be 'I/O' for consistency.
> +[!] Note that CACHEABLE or FULL memory barriers must be used to control the
> +ordering of references to shared memory on SMP systems, though the use of
> +locking instead is sufficient.
So a spin_lock() or a down() will do the trick?
> +CACHEABLE accesses within the critical section). io_lock must be called
Sentence breaks should be two spaces or a newline after the full stop, btw.
> Do not get smart.
I'd very much recommend against saying that.
> +All barriers are required even on non-SMP systems as they affect the order in
> +which memory operations appear to a device by prohibiting both the compiler and
> +(in the case of IO and FULL barriers) the CPU from reordering them.
Add to that, perhaps, that if the barriers are unnecessary for a particular
arch then they'll be optimised away. That might help discourage people from
saying "but I don't need them on arch X or platform Y because they don't do
anything there".
> -Under certain circumstances (especially involving NUMA), I/O accesses within
> -two spinlocked sections on two different CPUs may be seen as interleaved by the
> -PCI bridge, because the PCI bridge does not necessarily participate in the
> -cache-coherence protocol, and is therefore incapable of issuing the required
> -read memory barriers.
Is this not true? If it is true, why remove it? Just stick in a blank line
and add the next paragraph.
> +SMP locking primitives are implemented with operations to CACHEABLE memory,
> +and they are guaranteed only to provide LOCK and UNLOCK ordering for other
> +CACHEABLE memory accesses inside the critical section. IO memory accesses,
> +even if performed inside the lock section, can "leak out".
Hmmm... I'm not sure 'CACHEABLE' needs capitalising quite so much. I can see
why you're doing this though.
> - writel(0, ADDR)
> + writel(0, ADDR);
Seems I wasn't very consistent on the usage of semicolons in these tables.
Adding them is fine.
> +which would probably cause the hardware to malfunction. The reason is that
> +the stores to IO memory from CPU 1 is not ordered with respect to the
'are not ordered'
> +spin_unlock, so it may not be visible to the PCI bridge before the spin_unlock
'so they may not be visible'
> +What is necessary here is to use some type of FULL memory barrier to prevent
> +the reordering of CACHEABLE and IO operations.
I'd insert a 'both' after the second 'of'.
> The best way to do this is with
> +io_lock and io_unlock barriers, which directs the critical section to order
> +IO access as well. For example:
Should there be a spin_lock_io() for example?
> +this will ensure that the two IO stores issued on CPU 1 appear at the PCI
> +bridge before either of the IO stores issued on CPU 2.
Hmmm... How about 'stores to I/O' rather than 'I/O stores'?
> +XXX: get rid of this, and just prefer io_lock()/io_unlock() ?
What do you mean?
David
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [rfc] io memory barriers, and getting rid of mmiowb
2007-11-20 16:46 ` David Howells
@ 2007-11-20 16:58 ` Matthew Wilcox
2007-11-20 17:04 ` David Howells
2007-11-21 0:30 ` Nick Piggin
1 sibling, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2007-11-20 16:58 UTC (permalink / raw)
To: David Howells
Cc: Nick Piggin, linux-arch, Linus Torvalds, Benjamin Herrenschmidt
On Tue, Nov 20, 2007 at 04:46:22PM +0000, David Howells wrote:
> > The best way to do this is with
> > +io_lock and io_unlock barriers, which directs the critical section to order
> > +IO access as well. For example:
>
> Should there be a spin_lock_io() for example?
Then we need ...
spin_lock_irqsave_io()
spin_lock_irq_io()
spin_lock_bh_io()
spin_trylock_io()
spin_trylock_irqsave_io()
spin_trylock_irq_io()
spin_trylock_bh_io()
ditto with write_ and read_. Maybe double_spin_lock_io too. That's an
extra 25 primitives (oh, plus the unlocks; 28).
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [rfc] io memory barriers, and getting rid of mmiowb
2007-11-20 16:58 ` Matthew Wilcox
@ 2007-11-20 17:04 ` David Howells
0 siblings, 0 replies; 9+ messages in thread
From: David Howells @ 2007-11-20 17:04 UTC (permalink / raw)
To: Matthew Wilcox
Cc: dhowells, Nick Piggin, linux-arch, Linus Torvalds,
Benjamin Herrenschmidt
Matthew Wilcox <matthew@wil.cx> wrote:
> > Should there be a spin_lock_io() for example?
>
> Then we need ...
Yeah.
David
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [rfc] io memory barriers, and getting rid of mmiowb
2007-11-20 16:46 ` David Howells
2007-11-20 16:58 ` Matthew Wilcox
@ 2007-11-21 0:30 ` Nick Piggin
2007-11-21 1:53 ` David Howells
1 sibling, 1 reply; 9+ messages in thread
From: Nick Piggin @ 2007-11-21 0:30 UTC (permalink / raw)
To: David Howells; +Cc: linux-arch, Linus Torvalds, Benjamin Herrenschmidt
Hi David, thanks for the feedback. Maybe this whole idea is still going to
be rejected, but most of your suggestions are good anyway ;)
Note that I think I've made one major mistake in the patch, too: that writel
is already supposed to be ordered WRT other I/O, so the io_wmb is not needed
in the drivers. However, we still have a writel_relaxed, so io_wmb is still
needed there.
On Tue, Nov 20, 2007 at 04:46:22PM +0000, David Howells wrote:
>
> Hi Nick,
>
> Some comments on the documentation changes in your patch:
>
> > The Linux kernel has eight basic CPU memory barriers:
>
> That should be 'eleven' at least with these changes.
>
> Can I recommend that unless io_lock() and io_unlock() actually take locks, you
> postfix their names with _mb or _barrier. If they are actual locking
> primitives, then they don't belong in this table.
Yeah ... the name isn't ideal. _mb is a nice one, but I don't want to use it
unless it is guaranteed to be a full barrier (rather than lock/unlock
barriers). barrier similarly has an existing meaning.
Anyway, I agree with you, and it won't be just io_lock / io_unlock ;)
> TYPE FULL CACHEABLE I/O
> (SMP CONDITIONAL)
> =============== =============== ======================= ===============
> GENERAL mb() smp_mb() io_mb()
> WRITE wmb() smp_wmb() io_wmb()
> READ rmb() smp_rmb() io_rmb()
> DATA DEPEPENDENCY
> read_barrier_depends()
> smp_read_barrier_depends()
>
> Does having 'CACHEABLE' imply that 'FULL' ones are or that they aren't?
Are or aren't what?
The problem with just calling them mandatory or SMP conditional is that it
doesn't suggest the important *what* is being ordered.
> I think I would alter this:
>
> +CACHEABLE memory barriers control order of access to regular (cacheable) RAM
> +by the CPU. They are reduced to compiler barriers on uniprocessor compiled
> +systems because cacheable memory is operated upon by CPUs, and a a CPU is
> +self-consistent.
>
> To say something like:
>
> SMP conditional memory barriers control the order of access to regular,
> cacheable RAM by the CPU. They are reduced to compiler barriers on
> uniprocessor compiled systems because it is assumed that a CPU will appear to
> be self-consistent, and will order overlapping accesses correctly with
> respect to itself when writing to ordinary RAM.
Pretty verbose, isn't it ;) Not only does a CPU appear self consistent, it is.
And not only will overlapping accesses be ordered correctly, so will completely
out of order ones.
> > +[!] Note that CACHEABLE or FULL memory barriers must be used to control the
> > +ordering of references to shared memory on SMP systems, though the use of
> > +locking instead is sufficient.
>
> So a spin_lock() or a down() will do the trick?
I hope so. I didn't write this (just changed slightly).
> > Do not get smart.
>
> I'd very much recommend against saying that.
Really? What if you grep drivers/*, do you still recommend against? ;(
> > -Under certain circumstances (especially involving NUMA), I/O accesses within
> > -two spinlocked sections on two different CPUs may be seen as interleaved by the
> > -PCI bridge, because the PCI bridge does not necessarily participate in the
> > -cache-coherence protocol, and is therefore incapable of issuing the required
> > -read memory barriers.
>
> Is this not true? If it is true, why remove it? Just stick in a blank line
> and add the next paragraph.
It's not because it doesn't participate in the cache coherence protocol. It
is because cacheable access isn't ordered with IO access, so it can leak
out of the lock.
The PCI bridge does not perform speculative, or out of order reads, so a read
memory barrier would not do anything for it.
As far as I know, the platforms that do this reordering (eg. Altix and pseries)
the PCI bridge does not reorder access, but it gets reordered before going
over the interconnect.
I just figure it is vague, unneccesary, hardware specific. Can take it out.
> > +this will ensure that the two IO stores issued on CPU 1 appear at the PCI
> > +bridge before either of the IO stores issued on CPU 2.
>
> Hmmm... How about 'stores to I/O' rather than 'I/O stores'?
>
> > +XXX: get rid of this, and just prefer io_lock()/io_unlock() ?
>
> What do you mean?
I mean that although the readl can obviate the need for mmiowb on PCI,
will it be a problem just to encourage the use of the IO locks instead.
Just for simplicity and clarity.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [rfc] io memory barriers, and getting rid of mmiowb
2007-11-21 0:30 ` Nick Piggin
@ 2007-11-21 1:53 ` David Howells
2007-11-22 8:23 ` Nick Piggin
0 siblings, 1 reply; 9+ messages in thread
From: David Howells @ 2007-11-21 1:53 UTC (permalink / raw)
To: Nick Piggin; +Cc: dhowells, linux-arch, Linus Torvalds, Benjamin Herrenschmidt
Nick Piggin <npiggin@suse.de> wrote:
> Yeah ... the name isn't ideal. _mb is a nice one, but I don't want to use it
> unless it is guaranteed to be a full barrier (rather than lock/unlock
> barriers). barrier similarly has an existing meaning.
How about _iobarrier?
Or how about calling them begin_io_section() and end_io_section()?
> > Does having 'CACHEABLE' imply that 'FULL' ones are or that they aren't?
>
> Are or aren't what?
Hmmm... Nevermind. I'm not sure what I was thinking of applies anyway.
Perhaps you should state that a FULL barrier implies a CACHEABLE barrier and
it implies an I/O barrier and it also orders ordinary memory accesses with
respect to I/O accesses.
Also, I object to 'CACHEABLE' because memory barriers may still be required
even if there aren't any caches.
> The problem with just calling them mandatory or SMP conditional is that it
> doesn't suggest the important *what* is being ordered.
Well, it is called 'memory-barriers.txt':-)
But I know what you mean, you've expanded the whole scope of the document in a
way.
> Pretty verbose, isn't it ;) Not only does a CPU appear self consistent, it
> is. And not only will overlapping accesses be ordered correctly, so will
> completely out of order ones.
Sometimes it's a good idea to explicitly state your assumptions, just so that
people know.
> > > +[!] Note that CACHEABLE or FULL memory barriers must be used to control the
> > > +ordering of references to shared memory on SMP systems, though the use of
> > > +locking instead is sufficient.
> >
> > So a spin_lock() or a down() will do the trick?
>
> I hope so. I didn't write this (just changed slightly).
I think this paragraph implies that use of a spin lock render I/O barriers
redundant for that particular situation. However your io_lock() implies
otherwise.
> > > Do not get smart.
> >
> > I'd very much recommend against saying that.
>
> Really? What if you grep drivers/*, do you still recommend against? ;(
Bad practice elsewhere doesn't condone it here. In fact, you're even more
constrained in formal documentation.
> I just figure it is vague, unneccesary, hardware specific. Can take it out.
You still have to say why. You should check this with Paul McKenney. I think
he added the statement you've removed.
> > > +XXX: get rid of this, and just prefer io_lock()/io_unlock() ?
> >
> > What do you mean?
>
> I mean that although the readl can obviate the need for mmiowb on PCI,
> will it be a problem just to encourage the use of the IO locks instead.
> Just for simplicity and clarity.
So if I have a hardware device with a register that I want to make three reads
of, and I expect each read to have side effects, I can just do:
io_lock()
readl()
readl()
readl()
io_unlock()
One thing I'm not entirely clear on. Do you expect two I/O accesses from one
CPU remain ordered wrt to each other? Or is the following required:
io_lock()
readl()
io_rmb()
readl()
io_rmb()
readl()
io_unlock()
when talking to a device?
David
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [rfc] io memory barriers, and getting rid of mmiowb
2007-11-20 16:02 [rfc] io memory barriers, and getting rid of mmiowb Nick Piggin
2007-11-20 16:46 ` David Howells
@ 2007-11-21 13:50 ` Ralf Baechle
2007-11-22 8:04 ` Nick Piggin
1 sibling, 1 reply; 9+ messages in thread
From: Ralf Baechle @ 2007-11-21 13:50 UTC (permalink / raw)
To: Nick Piggin, dhowells; +Cc: linux-arch, Linus Torvalds, Benjamin Herrenschmidt
On Tue, Nov 20, 2007 at 05:02:20PM +0100, Nick Piggin wrote:
> + TYPE FULL CACHEABLE (SMP CONDITIONAL) IO
> + ========= ======================= =========================== ============
> + GENERAL mb() smp_mb() io_mb()
> + WRITE wmb() smp_wmb() io_wmb()
> + READ rmb() smp_rmb() io_rmb()
> + DATA DEP. read_barrier_depends() smp_read_barrier_depends()
A while ago I went through the kernel looking at the uses mb(), wmb() and
rmb() and found that every use was either fishy or should have been a
smp_mb(), smp_wmb() or wmp_rmb(). Which leads me to the question if there
is any need for the non-smp_ variants left or can we just bury them?
Ralf
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [rfc] io memory barriers, and getting rid of mmiowb
2007-11-21 13:50 ` Ralf Baechle
@ 2007-11-22 8:04 ` Nick Piggin
0 siblings, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2007-11-22 8:04 UTC (permalink / raw)
To: Ralf Baechle; +Cc: dhowells, linux-arch, Linus Torvalds, Benjamin Herrenschmidt
On Wed, Nov 21, 2007 at 01:50:16PM +0000, Ralf Baechle wrote:
> On Tue, Nov 20, 2007 at 05:02:20PM +0100, Nick Piggin wrote:
>
> > + TYPE FULL CACHEABLE (SMP CONDITIONAL) IO
> > + ========= ======================= =========================== ============
> > + GENERAL mb() smp_mb() io_mb()
> > + WRITE wmb() smp_wmb() io_wmb()
> > + READ rmb() smp_rmb() io_rmb()
> > + DATA DEP. read_barrier_depends() smp_read_barrier_depends()
>
> A while ago I went through the kernel looking at the uses mb(), wmb() and
> rmb() and found that every use was either fishy or should have been a
> smp_mb(), smp_wmb() or wmp_rmb(). Which leads me to the question if there
> is any need for the non-smp_ variants left or can we just bury them?
I don't know driver code very well, however it seems like there is a
legitimate need for barriers there, _however_ I think a lot of those went
in when writel could be unordered on some architectures.
The sane way to do it is to make all drivers default to strongly ordered
IO (eg. make readl writel strongly ordered). And then yes indeed I believe
most of the mb/wmb/rmb can go away.
We still need full memory barrires in order to order cacheable access with
IO access (eg. in order to support SMP locking of IO routines). However in
that case I believe an easily understood lock-based primitive should be
provided, along with a fully ordering implementations of the regular memory
barriers, rather than mmiowb (which clearly very few driver writers or anyone
else understands).
After that, if some architectures want to implement more relaxed variants
of the IO accessors, then OK. And we can use io_ memory barriers for that.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [rfc] io memory barriers, and getting rid of mmiowb
2007-11-21 1:53 ` David Howells
@ 2007-11-22 8:23 ` Nick Piggin
0 siblings, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2007-11-22 8:23 UTC (permalink / raw)
To: David Howells; +Cc: linux-arch, Linus Torvalds, Benjamin Herrenschmidt
On Wed, Nov 21, 2007 at 01:53:09AM +0000, David Howells wrote:
> Nick Piggin <npiggin@suse.de> wrote:
>
> > Yeah ... the name isn't ideal. _mb is a nice one, but I don't want to use it
> > unless it is guaranteed to be a full barrier (rather than lock/unlock
> > barriers). barrier similarly has an existing meaning.
>
> How about _iobarrier?
>
> Or how about calling them begin_io_section() and end_io_section()?
Well it's lock semantics. If we decide to use any name without lock/unlock
in it, it will have acquire/release. begin_io_lock/end_io_lock, maybe...
> Also, I object to 'CACHEABLE' because memory barriers may still be required
> even if there aren't any caches.
OK. RAM?
> > The problem with just calling them mandatory or SMP conditional is that it
> > doesn't suggest the important *what* is being ordered.
>
> Well, it is called 'memory-barriers.txt':-)
>
> But I know what you mean, you've expanded the whole scope of the document in a
> way.
Yeah, I mean the type of memory being ordered.
> > > > +[!] Note that CACHEABLE or FULL memory barriers must be used to control the
> > > > +ordering of references to shared memory on SMP systems, though the use of
> > > > +locking instead is sufficient.
> > >
> > > So a spin_lock() or a down() will do the trick?
> >
> > I hope so. I didn't write this (just changed slightly).
>
> I think this paragraph implies that use of a spin lock render I/O barriers
> redundant for that particular situation. However your io_lock() implies
> otherwise.
Oh, perhaps. Yes, locks won't work for IO. By shared memory, I was thinking
about regular shared ram, but it could be talking about shared access to IO
memory too. So it needs clarification.
> > > > Do not get smart.
> > >
> > > I'd very much recommend against saying that.
> >
> > Really? What if you grep drivers/*, do you still recommend against? ;(
>
> Bad practice elsewhere doesn't condone it here. In fact, you're even more
> constrained in formal documentation.
OK, just to please you ;)
> > > > +XXX: get rid of this, and just prefer io_lock()/io_unlock() ?
> > >
> > > What do you mean?
> >
> > I mean that although the readl can obviate the need for mmiowb on PCI,
> > will it be a problem just to encourage the use of the IO locks instead.
> > Just for simplicity and clarity.
>
> So if I have a hardware device with a register that I want to make three reads
> of, and I expect each read to have side effects, I can just do:
>
> io_lock()
> readl()
> readl()
> readl()
> io_unlock()
Well, presumably they will somehow be excluding other CPUs from accessing,
in which case I guess they would take a lock. And we would encourage them to
do the io_lock / io_unlock just after/before taking and releasing the lock.
> One thing I'm not entirely clear on. Do you expect two I/O accesses from one
> CPU remain ordered wrt to each other? Or is the following required:
>
> io_lock()
> readl()
> io_rmb()
> readl()
> io_rmb()
> readl()
> io_unlock()
>
> when talking to a device?
Good question. I'm not an expert on that.
sn2 implements readl_relaxed() differently to readl. powerpc implements their
readls with sync and isync, so presumably they can go out of order too. I do
think all architectures should make read*/write* as completely ordered WRT
one another (if not ordered with RAM), and then we should consistenly
implement _relaxed postfixes if performance matters.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2007-11-22 8:23 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-20 16:02 [rfc] io memory barriers, and getting rid of mmiowb Nick Piggin
2007-11-20 16:46 ` David Howells
2007-11-20 16:58 ` Matthew Wilcox
2007-11-20 17:04 ` David Howells
2007-11-21 0:30 ` Nick Piggin
2007-11-21 1:53 ` David Howells
2007-11-22 8:23 ` Nick Piggin
2007-11-21 13:50 ` Ralf Baechle
2007-11-22 8:04 ` Nick Piggin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).