public inbox for linux-arch@vger.kernel.org
 help / color / mirror / Atom feed
* RFC: being more anal about iospace accesses..
@ 2004-09-08 22:57 Linus Torvalds
  2004-09-08 23:07 ` David S. Miller
                   ` (3 more replies)
  0 siblings, 4 replies; 63+ messages in thread
From: Linus Torvalds @ 2004-09-08 22:57 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Linux Arch list, Alan Cox,
	David S. Miller


In my continual quest to make sparse more useful, and statically check
more stuff, I started looking at verifying correct usage of PCI
memory-mapped accesses. They are nasty because they generally "just work"  
on x86 (and some other architectures), and because we've historically had 
very permissive types for the macros that are supposed to be used. So 
there's been absolutely _tons_ of confusion whether a pointer is a pointer 
to regular kernel memory, or to memory-mapped IO space. Not helped by the 
fact that sometimes it's an integer (which can be 32-bit, and then you 
totally break on 64-bit architectures).

A quick trial patch shows that it doesn't seem to be _too_ painful to use 
the same sparse "address space" mechanism for this as we use for user 
pointers. Just do

	#define __iospace __attribute__((noderef,address_space(2)))

and force all "ioremap()" returns to have type "void __iospace *".  
Together with making "readb/w/l()" take this kind of pointer, it "Just
Makes Sense(tm)".

HOWEVER, some of that spills over even outside of "sparse". Because some 
people have successfully used "unsigned long" instead of a pointer, this 
will actually cause warnings even for normal gcc usage. And while I've 
tried out a compile on my normal x86 setup and could fix all of the 
gcc-visible things, this may be disruptive enough that people object to 
doing something like this.

Comments? I'm appending my "iospace-patch" which is enough to make gcc
happy on my (very limited) x86 driver configuration, and starts annotating
_some_ of the users with the sparse __iospace annotations. A full sparse 
run still will result in a _lot_ of warnings, but this does seem to be the 
right way to go. It not only makes it very obvious whether a pointer 
points to memory or IO space, but sparse will also warn about trying to 
dereference such a thing directly. 

Btw, doing this I was again amazed at the kind of crap we have. Drivers
casting "ioremap()" to "unsigned long" only to then cast it back to a
pointer when actually using it in any way etc etc. That's a sign of some
really bad historical interfaces. I'd blame somebody else, but dammit, I
can't find anybody but myself ;)

			Linus

----
===== arch/i386/mm/ioremap.c 1.21 vs edited =====
--- 1.21/arch/i386/mm/ioremap.c	2004-06-17 23:41:15 -07:00
+++ edited/arch/i386/mm/ioremap.c	2004-09-08 15:22:18 -07:00
@@ -110,9 +110,9 @@
  * have to convert them into an offset in a page-aligned mapping, but the
  * caller shouldn't need to know that small detail.
  */
-void * __ioremap(unsigned long phys_addr, unsigned long size, unsigned long flags)
+void __iospace * __ioremap(unsigned long phys_addr, unsigned long size, unsigned long flags)
 {
-	void * addr;
+	void __iospace * addr;
 	struct vm_struct * area;
 	unsigned long offset, last_addr;
 
@@ -125,7 +125,7 @@
 	 * Don't remap the low PCI/ISA area, it's always mapped..
 	 */
 	if (phys_addr >= 0xA0000 && last_addr < 0x100000)
-		return phys_to_virt(phys_addr);
+		return (void __iospace *) phys_to_virt(phys_addr);
 
 	/*
 	 * Don't allow anybody to remap normal RAM that we're using..
@@ -156,12 +156,12 @@
 	if (!area)
 		return NULL;
 	area->phys_addr = phys_addr;
-	addr = area->addr;
+	addr = (void __iospace *) area->addr;
 	if (remap_area_pages((unsigned long) addr, phys_addr, size, flags)) {
-		vunmap(addr);
+		vunmap((void __force *) addr);
 		return NULL;
 	}
-	return (void *) (offset + (char *)addr);
+	return (void __iospace *) (offset + (char __iospace *)addr);
 }
 
 
@@ -187,10 +187,10 @@
  * Must be freed with iounmap.
  */
 
-void *ioremap_nocache (unsigned long phys_addr, unsigned long size)
+void __iospace *ioremap_nocache (unsigned long phys_addr, unsigned long size)
 {
 	unsigned long last_addr;
-	void *p = __ioremap(phys_addr, size, _PAGE_PCD);
+	void __iospace *p = __ioremap(phys_addr, size, _PAGE_PCD);
 	if (!p) 
 		return p; 
 
@@ -221,12 +221,12 @@
 	return p;					
 }
 
-void iounmap(void *addr)
+void iounmap(void __iospace *addr)
 {
 	struct vm_struct *p;
-	if (addr <= high_memory) 
+	if ((void __force *) addr <= high_memory) 
 		return; 
-	p = remove_vm_area((void *) (PAGE_MASK & (unsigned long) addr));
+	p = remove_vm_area((void *) (PAGE_MASK & (unsigned long __force) addr));
 	if (!p) { 
 		printk("__iounmap: bad address %p\n", addr);
 		return;
===== arch/i386/pci/mmconfig.c 1.1 vs edited =====
--- 1.1/arch/i386/pci/mmconfig.c	2004-02-29 22:20:00 -08:00
+++ edited/arch/i386/pci/mmconfig.c	2004-09-08 15:37:51 -07:00
@@ -9,7 +9,7 @@
 /* The physical address of the MMCONFIG aperture.  Set from ACPI tables. */
 u32 pci_mmcfg_base_addr;
 
-#define mmcfg_virt_addr (fix_to_virt(FIX_PCIE_MCFG))
+#define mmcfg_virt_addr ((void __iospace *) fix_to_virt(FIX_PCIE_MCFG))
 
 /* The base address of the last MMCONFIG device accessed */
 static u32 mmcfg_last_accessed_device;
===== arch/i386/kernel/pci-dma.c 1.16 vs edited =====
--- 1.16/arch/i386/kernel/pci-dma.c	2004-08-22 17:39:03 -07:00
+++ edited/arch/i386/kernel/pci-dma.c	2004-09-08 11:07:13 -07:00
@@ -72,7 +72,7 @@
 int dma_declare_coherent_memory(struct device *dev, dma_addr_t bus_addr,
 				dma_addr_t device_addr, size_t size, int flags)
 {
-	void *mem_base;
+	void __iospace *mem_base;
 	int pages = size >> PAGE_SHIFT;
 	int bitmap_size = (pages + 31)/32;
 
===== arch/x86_64/kernel/early_printk.c 1.13 vs edited =====
--- 1.13/arch/x86_64/kernel/early_printk.c	2004-08-26 23:59:59 -07:00
+++ edited/arch/x86_64/kernel/early_printk.c	2004-09-08 11:11:32 -07:00
@@ -8,7 +8,7 @@
 /* Simple VGA output */
 
 #ifdef __i386__
-#define VGABASE		__pa((void *)(__PAGE_OFFSET + 0xb8000UL))
+#define VGABASE		(__ISA_IO_base + 0xb8000)
 #else
 #define VGABASE		0xffffffff800b8000UL
 #endif
===== sound/pci/intel8x0.c 1.109 vs edited =====
--- 1.109/sound/pci/intel8x0.c	2004-08-23 21:01:07 -07:00
+++ edited/sound/pci/intel8x0.c	2004-09-08 15:40:07 -07:00
@@ -406,10 +406,10 @@
 
 	unsigned int mmio;
 	unsigned long addr;
-	unsigned long remap_addr;
+	void __iospace * remap_addr;
 	unsigned int bm_mmio;
 	unsigned long bmaddr;
-	unsigned long remap_bmaddr;
+	void __iospace * remap_bmaddr;
 
 	struct pci_dev *pci;
 	snd_card_t *card;
@@ -2227,9 +2227,9 @@
 		snd_dma_free_pages(&chip->bdbars);
 	}
 	if (chip->remap_addr)
-		iounmap((void *) chip->remap_addr);
+		iounmap(chip->remap_addr);
 	if (chip->remap_bmaddr)
-		iounmap((void *) chip->remap_bmaddr);
+		iounmap(chip->remap_bmaddr);
 	pci_release_regions(chip->pci);
 	kfree(chip);
 	return 0;
@@ -2502,9 +2502,8 @@
 	if (pci_resource_flags(pci, 2) & IORESOURCE_MEM) {	/* ICH4 and Nforce */
 		chip->mmio = 1;
 		chip->addr = pci_resource_start(pci, 2);
-		chip->remap_addr = (unsigned long) ioremap_nocache(chip->addr,
-								   pci_resource_len(pci, 2));
-		if (chip->remap_addr == 0) {
+		chip->remap_addr = ioremap_nocache(chip->addr, pci_resource_len(pci, 2));
+		if (!chip->remap_addr) {
 			snd_printk("AC'97 space ioremap problem\n");
 			snd_intel8x0_free(chip);
 			return -EIO;
@@ -2515,9 +2514,8 @@
 	if (pci_resource_flags(pci, 3) & IORESOURCE_MEM) {	/* ICH4 */
 		chip->bm_mmio = 1;
 		chip->bmaddr = pci_resource_start(pci, 3);
-		chip->remap_bmaddr = (unsigned long) ioremap_nocache(chip->bmaddr,
-								     pci_resource_len(pci, 3));
-		if (chip->remap_bmaddr == 0) {
+		chip->remap_bmaddr = ioremap_nocache(chip->bmaddr, pci_resource_len(pci, 3));
+		if (!chip->remap_bmaddr) {
 			snd_printk("Controller space ioremap problem\n");
 			snd_intel8x0_free(chip);
 			return -EIO;
===== drivers/ide/ide-iops.c 1.26 vs edited =====
--- 1.26/drivers/ide/ide-iops.c	2004-05-29 11:30:32 -07:00
+++ edited/drivers/ide/ide-iops.c	2004-09-08 15:02:33 -07:00
@@ -112,57 +112,57 @@
 
 static u8 ide_mm_inb (unsigned long port)
 {
-	return (u8) readb(port);
+	return (u8) readb((void __iospace *) port);
 }
 
 static u16 ide_mm_inw (unsigned long port)
 {
-	return (u16) readw(port);
+	return (u16) readw((void __iospace *) port);
 }
 
 static void ide_mm_insw (unsigned long port, void *addr, u32 count)
 {
-	__ide_mm_insw(port, addr, count);
+	__ide_mm_insw((void __iospace *) port, addr, count);
 }
 
 static u32 ide_mm_inl (unsigned long port)
 {
-	return (u32) readl(port);
+	return (u32) readl((void __iospace *) port);
 }
 
 static void ide_mm_insl (unsigned long port, void *addr, u32 count)
 {
-	__ide_mm_insl(port, addr, count);
+	__ide_mm_insl((void __iospace *) port, addr, count);
 }
 
 static void ide_mm_outb (u8 value, unsigned long port)
 {
-	writeb(value, port);
+	writeb(value, (void __iospace *) port);
 }
 
 static void ide_mm_outbsync (ide_drive_t *drive, u8 value, unsigned long port)
 {
-	writeb(value, port);	
+	writeb(value, (void __iospace *) port);
 }
 
 static void ide_mm_outw (u16 value, unsigned long port)
 {
-	writew(value, port);
+	writew(value, (void __iospace *) port);
 }
 
 static void ide_mm_outsw (unsigned long port, void *addr, u32 count)
 {
-	__ide_mm_outsw(port, addr, count);
+	__ide_mm_outsw((void __iospace *) port, addr, count);
 }
 
 static void ide_mm_outl (u32 value, unsigned long port)
 {
-	writel(value, port);
+	writel(value, (void __iospace *) port);
 }
 
 static void ide_mm_outsl (unsigned long port, void *addr, u32 count)
 {
-	__ide_mm_outsl(port, addr, count);
+	__ide_mm_outsl((void __iospace *) port, addr, count);
 }
 
 void default_hwif_mmiops (ide_hwif_t *hwif)
===== drivers/acpi/osl.c 1.54 vs edited =====
--- 1.54/drivers/acpi/osl.c	2004-08-13 22:10:58 -07:00
+++ edited/drivers/acpi/osl.c	2004-09-08 11:24:41 -07:00
@@ -171,11 +171,11 @@
 }
 
 acpi_status
-acpi_os_map_memory(acpi_physical_address phys, acpi_size size, void **virt)
+acpi_os_map_memory(acpi_physical_address phys, acpi_size size, void __iospace **virt)
 {
 	if (efi_enabled) {
 		if (EFI_MEMORY_WB & efi_mem_attributes(phys)) {
-			*virt = phys_to_virt(phys);
+			*virt = (void __iospace *) phys_to_virt(phys);
 		} else {
 			*virt = ioremap(phys, size);
 		}
@@ -197,7 +197,7 @@
 }
 
 void
-acpi_os_unmap_memory(void *virt, acpi_size size)
+acpi_os_unmap_memory(void __iospace *virt, acpi_size size)
 {
 	iounmap(virt);
 }
@@ -376,30 +376,31 @@
 	u32			width)
 {
 	u32			dummy;
-	void			*virt_addr;
+	void __iospace		*virt_addr;
 	int			iomem = 0;
 
 	if (efi_enabled) {
 		if (EFI_MEMORY_WB & efi_mem_attributes(phys_addr)) {
-			virt_addr = phys_to_virt(phys_addr);
+			/* HACK ALERT! We can use readb/w/l on real memory too.. */
+			virt_addr = (void __iospace *) phys_to_virt(phys_addr);
 		} else {
 			iomem = 1;
 			virt_addr = ioremap(phys_addr, width);
 		}
 	} else
-		virt_addr = phys_to_virt(phys_addr);
+		virt_addr = (void __iospace *) phys_to_virt(phys_addr);
 	if (!value)
 		value = &dummy;
 
 	switch (width) {
 	case 8:
-		*(u8*) value = *(u8*) virt_addr;
+		*(u8*) value = readb(virt_addr);
 		break;
 	case 16:
-		*(u16*) value = *(u16*) virt_addr;
+		*(u16*) value = readw(virt_addr);
 		break;
 	case 32:
-		*(u32*) value = *(u32*) virt_addr;
+		*(u32*) value = readl(virt_addr);
 		break;
 	default:
 		BUG();
@@ -419,28 +420,29 @@
 	u32			value,
 	u32			width)
 {
-	void			*virt_addr;
+	void __iospace		*virt_addr;
 	int			iomem = 0;
 
 	if (efi_enabled) {
 		if (EFI_MEMORY_WB & efi_mem_attributes(phys_addr)) {
-			virt_addr = phys_to_virt(phys_addr);
+			/* HACK ALERT! We can use writeb/w/l on real memory too */
+			virt_addr = (void __iospace *) phys_to_virt(phys_addr);
 		} else {
 			iomem = 1;
 			virt_addr = ioremap(phys_addr, width);
 		}
 	} else
-		virt_addr = phys_to_virt(phys_addr);
+		virt_addr = (void __iospace *) phys_to_virt(phys_addr);
 
 	switch (width) {
 	case 8:
-		*(u8*) virt_addr = value;
+		writeb(value, virt_addr);
 		break;
 	case 16:
-		*(u16*) virt_addr = value;
+		writew(value, virt_addr);
 		break;
 	case 32:
-		*(u32*) virt_addr = value;
+		writel(value, virt_addr);
 		break;
 	default:
 		BUG();
@@ -962,7 +964,7 @@
 {
 #if defined(__i386__) || defined(__x86_64__) 
 	char tmp;
-	return !__get_user(tmp, (char *)ptr) && !__get_user(tmp, (char *)ptr + len - 1);
+	return !__get_user(tmp, (char __user *)ptr) && !__get_user(tmp, (char __user *)ptr + len - 1);
 #endif
 	return 1;
 }
===== drivers/serial/8250_pci.c 1.39 vs edited =====
--- 1.39/drivers/serial/8250_pci.c	2004-08-23 01:15:10 -07:00
+++ edited/drivers/serial/8250_pci.c	2004-09-08 15:35:56 -07:00
@@ -83,7 +83,7 @@
 
 struct serial_private {
 	unsigned int		nr;
-	void			*remapped_bar[PCI_NUM_BAR_RESOURCES];
+	void __iospace		*remapped_bar[PCI_NUM_BAR_RESOURCES];
 	struct pci_serial_quirk	*quirk;
 	int			line[0];
 };
@@ -243,7 +243,8 @@
  */
 static int __devinit pci_plx9050_init(struct pci_dev *dev)
 {
-	u8 *p, irq_config;
+	u8 irq_config;
+	void __iospace *p;
 
 	if ((pci_resource_flags(dev, 0) & IORESOURCE_MEM) == 0) {
 		moan_device("no memory in bar 0", dev);
@@ -272,12 +273,12 @@
 	p = ioremap(pci_resource_start(dev, 0), 0x80);
 	if (p == NULL)
 		return -ENOMEM;
-	writel(irq_config, (unsigned long)p + 0x4c);
+	writel(irq_config, p + 0x4c);
 
 	/*
 	 * Read the register back to ensure that it took effect.
 	 */
-	readl((unsigned long)p + 0x4c);
+	readl(p + 0x4c);
 	iounmap(p);
 
 	return 0;
@@ -397,7 +398,8 @@
 
 static int pci_siig10x_init(struct pci_dev *dev)
 {
-	u16 data, *p;
+	u16 data;
+	void __iospace *p;
 
 	switch (dev->device & 0xfff8) {
 	case PCI_DEVICE_ID_SIIG_1S_10x:	/* 1S */
@@ -415,8 +417,8 @@
 	if (p == NULL)
 		return -ENOMEM;
 
-	writew(readw((unsigned long) p + 0x28) & data, (unsigned long) p + 0x28);
-	readw((unsigned long)p + 0x28);
+	writew(readw(p + 0x28) & data, p + 0x28);
+	readw(p + 0x28);
 	iounmap(p);
 	return 0;
 }
===== include/asm/io.h 1.26 vs edited =====
--- 1.26/include/asm-i386/io.h	2004-03-12 01:30:22 -08:00
+++ edited/include/asm/io.h	2004-09-08 11:17:57 -07:00
@@ -2,6 +2,8 @@
 #define _ASM_IO_H
 
 #include <linux/config.h>
+#include <linux/string.h>
+#include <linux/compiler.h>
 
 /*
  * This file contains the definitions for the x86 IO instructions
@@ -86,7 +88,7 @@
  */
 #define page_to_phys(page)    ((dma_addr_t)page_to_pfn(page) << PAGE_SHIFT)
 
-extern void * __ioremap(unsigned long offset, unsigned long size, unsigned long flags);
+extern void __iospace * __ioremap(unsigned long offset, unsigned long size, unsigned long flags);
 
 /**
  * ioremap     -   map bus memory into CPU space
@@ -100,13 +102,13 @@
  * address. 
  */
 
-static inline void * ioremap (unsigned long offset, unsigned long size)
+static inline void __iospace * ioremap (unsigned long offset, unsigned long size)
 {
 	return __ioremap(offset, size, 0);
 }
 
-extern void * ioremap_nocache (unsigned long offset, unsigned long size);
-extern void iounmap(void *addr);
+extern void __iospace * ioremap_nocache (unsigned long offset, unsigned long size);
+extern void iounmap(void __iospace *addr);
 
 /*
  * bt_ioremap() and bt_iounmap() are for temporary early boot-time
@@ -139,9 +141,18 @@
  * memory location directly.
  */
 
-#define readb(addr) (*(volatile unsigned char *) (addr))
-#define readw(addr) (*(volatile unsigned short *) (addr))
-#define readl(addr) (*(volatile unsigned int *) (addr))
+static inline unsigned char readb(volatile void __iospace *addr)
+{
+	return *(volatile unsigned char __force *) addr;
+}
+static inline unsigned short readw(volatile void __iospace *addr)
+{
+	return *(volatile unsigned short __force *) addr;
+}
+static inline unsigned int readl(volatile void __iospace *addr)
+{
+	return *(volatile unsigned int __force *) addr;
+}
 #define readb_relaxed(addr) readb(addr)
 #define readw_relaxed(addr) readw(addr)
 #define readl_relaxed(addr) readl(addr)
@@ -149,16 +160,34 @@
 #define __raw_readw readw
 #define __raw_readl readl
 
-#define writeb(b,addr) (*(volatile unsigned char *) (addr) = (b))
-#define writew(b,addr) (*(volatile unsigned short *) (addr) = (b))
-#define writel(b,addr) (*(volatile unsigned int *) (addr) = (b))
+static inline void writeb(unsigned char b, volatile void __iospace *addr)
+{
+	*(volatile unsigned char __force *) addr = b;
+}
+static inline void writew(unsigned short b, volatile void __iospace *addr)
+{
+	*(volatile unsigned short __force *) addr = b;
+}
+static inline void writel(unsigned int b, volatile void __iospace *addr)
+{
+	*(volatile unsigned int __force *) addr = b;
+}
 #define __raw_writeb writeb
 #define __raw_writew writew
 #define __raw_writel writel
 
-#define memset_io(a,b,c)	memset((void *)(a),(b),(c))
-#define memcpy_fromio(a,b,c)	__memcpy((a),(void *)(b),(c))
-#define memcpy_toio(a,b,c)	__memcpy((void *)(a),(b),(c))
+static inline void memset_io(volatile void __iospace *addr, unsigned char val, int count)
+{
+	memset((void __force *) addr, val, count);
+}
+static inline void memcpy_fromio(void *dst, volatile void __iospace *src, int count)
+{
+	__memcpy(dst, (void __force *) src, count);
+}
+static inline void memcpy_toio(volatile void __iospace *dst, void *src, int count)
+{
+	__memcpy((void __force *) dst, src, count);
+}
 
 /*
  * ISA space is 'always mapped' on a typical x86 system, no need to
@@ -168,7 +197,7 @@
  * used as the IO-area pointer (it can be iounmapped as well, so the
  * analogy with PCI is quite large):
  */
-#define __ISA_IO_base ((char *)(PAGE_OFFSET))
+#define __ISA_IO_base ((char __iospace *)(PAGE_OFFSET))
 
 #define isa_readb(a) readb(__ISA_IO_base + (a))
 #define isa_readw(a) readw(__ISA_IO_base + (a))
@@ -185,8 +214,8 @@
  * Again, i386 does not require mem IO specific function.
  */
 
-#define eth_io_copy_and_sum(a,b,c,d)		eth_copy_and_sum((a),(void *)(b),(c),(d))
-#define isa_eth_io_copy_and_sum(a,b,c,d)	eth_copy_and_sum((a),(void *)(__ISA_IO_base + (b)),(c),(d))
+#define eth_io_copy_and_sum(a,b,c,d)		eth_copy_and_sum((a),(void __force *)(b),(c),(d))
+#define isa_eth_io_copy_and_sum(a,b,c,d)	eth_copy_and_sum((a),(void __force *)(__ISA_IO_base + (b)),(c),(d))
 
 /**
  *	check_signature		-	find BIOS signatures
@@ -199,7 +228,7 @@
  *	Returns 1 on a match.
  */
  
-static inline int check_signature(unsigned long io_addr,
+static inline int check_signature(volatile void __iospace * io_addr,
 	const unsigned char *signature, int length)
 {
 	int retval = 0;
===== include/acpi/acpiosxf.h 1.34 vs edited =====
--- 1.34/include/acpi/acpiosxf.h	2004-07-16 22:37:53 -07:00
+++ edited/include/acpi/acpiosxf.h	2004-09-08 11:19:20 -07:00
@@ -169,11 +169,11 @@
 acpi_os_map_memory (
 	acpi_physical_address           physical_address,
 	acpi_size                       size,
-	void                            **logical_address);
+	void __iospace                  **logical_address);
 
 void
 acpi_os_unmap_memory (
-	void                            *logical_address,
+	void __iospace                  *logical_address,
 	acpi_size                       size);
 
 acpi_status
===== include/linux/compiler.h 1.29 vs edited =====
--- 1.29/include/linux/compiler.h	2004-08-23 01:14:53 -07:00
+++ edited/include/linux/compiler.h	2004-09-08 10:34:45 -07:00
@@ -6,13 +6,17 @@
 # define __kernel	/* default address space */
 # define __safe		__attribute__((safe))
 # define __force	__attribute__((force))
+# define __iospace	__attribute__((noderef, address_space(2)))
 extern void __chk_user_ptr(void __user *);
+extern void __chk_io_ptr(void __iospace *);
 #else
 # define __user
 # define __kernel
 # define __safe
 # define __force
+# define __iospace
 # define __chk_user_ptr(x) (void)0
+# define __chk_io_ptr(x) (void)0
 #endif
 
 #ifdef __KERNEL__
===== include/asm-i386/io.h 1.26 vs edited =====
--- 1.26/include/asm-i386/io.h	2004-03-12 01:30:22 -08:00
+++ edited/include/asm-i386/io.h	2004-09-08 11:17:57 -07:00
@@ -2,6 +2,8 @@
 #define _ASM_IO_H
 
 #include <linux/config.h>
+#include <linux/string.h>
+#include <linux/compiler.h>
 
 /*
  * This file contains the definitions for the x86 IO instructions
@@ -86,7 +88,7 @@
  */
 #define page_to_phys(page)    ((dma_addr_t)page_to_pfn(page) << PAGE_SHIFT)
 
-extern void * __ioremap(unsigned long offset, unsigned long size, unsigned long flags);
+extern void __iospace * __ioremap(unsigned long offset, unsigned long size, unsigned long flags);
 
 /**
  * ioremap     -   map bus memory into CPU space
@@ -100,13 +102,13 @@
  * address. 
  */
 
-static inline void * ioremap (unsigned long offset, unsigned long size)
+static inline void __iospace * ioremap (unsigned long offset, unsigned long size)
 {
 	return __ioremap(offset, size, 0);
 }
 
-extern void * ioremap_nocache (unsigned long offset, unsigned long size);
-extern void iounmap(void *addr);
+extern void __iospace * ioremap_nocache (unsigned long offset, unsigned long size);
+extern void iounmap(void __iospace *addr);
 
 /*
  * bt_ioremap() and bt_iounmap() are for temporary early boot-time
@@ -139,9 +141,18 @@
  * memory location directly.
  */
 
-#define readb(addr) (*(volatile unsigned char *) (addr))
-#define readw(addr) (*(volatile unsigned short *) (addr))
-#define readl(addr) (*(volatile unsigned int *) (addr))
+static inline unsigned char readb(volatile void __iospace *addr)
+{
+	return *(volatile unsigned char __force *) addr;
+}
+static inline unsigned short readw(volatile void __iospace *addr)
+{
+	return *(volatile unsigned short __force *) addr;
+}
+static inline unsigned int readl(volatile void __iospace *addr)
+{
+	return *(volatile unsigned int __force *) addr;
+}
 #define readb_relaxed(addr) readb(addr)
 #define readw_relaxed(addr) readw(addr)
 #define readl_relaxed(addr) readl(addr)
@@ -149,16 +160,34 @@
 #define __raw_readw readw
 #define __raw_readl readl
 
-#define writeb(b,addr) (*(volatile unsigned char *) (addr) = (b))
-#define writew(b,addr) (*(volatile unsigned short *) (addr) = (b))
-#define writel(b,addr) (*(volatile unsigned int *) (addr) = (b))
+static inline void writeb(unsigned char b, volatile void __iospace *addr)
+{
+	*(volatile unsigned char __force *) addr = b;
+}
+static inline void writew(unsigned short b, volatile void __iospace *addr)
+{
+	*(volatile unsigned short __force *) addr = b;
+}
+static inline void writel(unsigned int b, volatile void __iospace *addr)
+{
+	*(volatile unsigned int __force *) addr = b;
+}
 #define __raw_writeb writeb
 #define __raw_writew writew
 #define __raw_writel writel
 
-#define memset_io(a,b,c)	memset((void *)(a),(b),(c))
-#define memcpy_fromio(a,b,c)	__memcpy((a),(void *)(b),(c))
-#define memcpy_toio(a,b,c)	__memcpy((void *)(a),(b),(c))
+static inline void memset_io(volatile void __iospace *addr, unsigned char val, int count)
+{
+	memset((void __force *) addr, val, count);
+}
+static inline void memcpy_fromio(void *dst, volatile void __iospace *src, int count)
+{
+	__memcpy(dst, (void __force *) src, count);
+}
+static inline void memcpy_toio(volatile void __iospace *dst, void *src, int count)
+{
+	__memcpy((void __force *) dst, src, count);
+}
 
 /*
  * ISA space is 'always mapped' on a typical x86 system, no need to
@@ -168,7 +197,7 @@
  * used as the IO-area pointer (it can be iounmapped as well, so the
  * analogy with PCI is quite large):
  */
-#define __ISA_IO_base ((char *)(PAGE_OFFSET))
+#define __ISA_IO_base ((char __iospace *)(PAGE_OFFSET))
 
 #define isa_readb(a) readb(__ISA_IO_base + (a))
 #define isa_readw(a) readw(__ISA_IO_base + (a))
@@ -185,8 +214,8 @@
  * Again, i386 does not require mem IO specific function.
  */
 
-#define eth_io_copy_and_sum(a,b,c,d)		eth_copy_and_sum((a),(void *)(b),(c),(d))
-#define isa_eth_io_copy_and_sum(a,b,c,d)	eth_copy_and_sum((a),(void *)(__ISA_IO_base + (b)),(c),(d))
+#define eth_io_copy_and_sum(a,b,c,d)		eth_copy_and_sum((a),(void __force *)(b),(c),(d))
+#define isa_eth_io_copy_and_sum(a,b,c,d)	eth_copy_and_sum((a),(void __force *)(__ISA_IO_base + (b)),(c),(d))
 
 /**
  *	check_signature		-	find BIOS signatures
@@ -199,7 +228,7 @@
  *	Returns 1 on a match.
  */
  
-static inline int check_signature(unsigned long io_addr,
+static inline int check_signature(volatile void __iospace * io_addr,
 	const unsigned char *signature, int length)
 {
 	int retval = 0;
===== include/asm-generic/ide_iops.h 1.1 vs edited =====
--- 1.1/include/asm-generic/ide_iops.h	2003-02-18 06:31:01 -08:00
+++ edited/include/asm-generic/ide_iops.h	2004-09-08 15:03:14 -07:00
@@ -5,7 +5,7 @@
 #define __ide_outsw	outsw
 #define __ide_outsl	outsl
 
-static __inline__ void __ide_mm_insw(unsigned long port, void *addr, u32 count)
+static __inline__ void __ide_mm_insw(void __iospace *port, void *addr, u32 count)
 {
 	while (count--) {
 		*(u16 *)addr = readw(port);
@@ -13,7 +13,7 @@
 	}
 }
 
-static __inline__ void __ide_mm_insl(unsigned long port, void *addr, u32 count)
+static __inline__ void __ide_mm_insl(void __iospace *port, void *addr, u32 count)
 {
 	while (count--) {
 		*(u32 *)addr = readl(port);
@@ -21,7 +21,7 @@
 	}
 }
 
-static __inline__ void __ide_mm_outsw(unsigned long port, void *addr, u32 count)
+static __inline__ void __ide_mm_outsw(void __iospace *port, void *addr, u32 count)
 {
 	while (count--) {
 		writew(*(u16 *)addr, port);
@@ -29,7 +29,7 @@
 	}
 }
 
-static __inline__ void __ide_mm_outsl(unsigned long port, void *addr, u32 count)
+static __inline__ void __ide_mm_outsl(void __iospace * port, void *addr, u32 count)
 {
 	while (count--) {
 		writel(*(u32 *)addr, port);

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-08 22:57 RFC: being more anal about iospace accesses Linus Torvalds
@ 2004-09-08 23:07 ` David S. Miller
  2004-09-08 23:25   ` Linus Torvalds
                     ` (2 more replies)
  2004-09-09  6:23 ` David Woodhouse
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 63+ messages in thread
From: David S. Miller @ 2004-09-08 23:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: viro, akpm, linux-arch, alan


I'm all for this except a possible naming suggestion that
the thing should be called "iomemspace" not plain "iospace".

We already use similar naming distinctions elsewhere in the tree
and even in user visible things such as /proc/ioport /proc/iomem

I wish we could make it fail with the compiler too somehow, that
way people other than us sparse geeks would see and fix this stuff
up.

Anyways, I like this a lot.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-08 23:07 ` David S. Miller
@ 2004-09-08 23:25   ` Linus Torvalds
  2004-09-09  1:19   ` Linus Torvalds
  2004-09-09  8:27   ` Geert Uytterhoeven
  2 siblings, 0 replies; 63+ messages in thread
From: Linus Torvalds @ 2004-09-08 23:25 UTC (permalink / raw)
  To: David S. Miller; +Cc: viro, akpm, linux-arch, alan



On Wed, 8 Sep 2004, David S. Miller wrote:
> 
> I'm all for this except a possible naming suggestion that
> the thing should be called "iomemspace" not plain "iospace".

I was initially going to call it just "__io", but decided that somebody
may well already be using that as a name somewhere (and a quick grep
indeed confirmed that). Thus "__iospace".

But "__iomem" would make sense. I don't think we need both "mem" and
"space", and I agree that we already have used the term "iomem" for this
thing. Will rename.

Anyway, I've already got two fairly enthusiastic votes for the concept,
but I'll give people some more time to think it over.

This would make our "warning statistics" look bad (but see earlier email
on why I think it's entirely safe even in a stable series otherwise),
which I'm a bit concerned about. But there's absolutely no way to avoid
that - anything in drivers will inevitably take a minimum of a few kernel
releases, and possibly _much_ longer before people get upset enough to fix
the warnings in random driver X.

		Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-08 23:07 ` David S. Miller
  2004-09-08 23:25   ` Linus Torvalds
@ 2004-09-09  1:19   ` Linus Torvalds
  2004-09-09  4:36     ` David S. Miller
  2004-09-09  5:04     ` viro
  2004-09-09  8:27   ` Geert Uytterhoeven
  2 siblings, 2 replies; 63+ messages in thread
From: Linus Torvalds @ 2004-09-09  1:19 UTC (permalink / raw)
  To: David S. Miller; +Cc: viro, akpm, linux-arch, alan



On Wed, 8 Sep 2004, David S. Miller wrote:
> 
> I wish we could make it fail with the compiler too somehow, that
> way people other than us sparse geeks would see and fix this stuff
> up.

I don't want to break a stable release too much. As-is, the _code_ 
generated should be 100% the same, even if gcc complains. So afaik, my 
patch is very safe indeed, which I like.

So it can cause gcc to complain for non-converted code ("xxx makes pointer
from integer without a cast"), but since any such code ended up converting
it to a pointer _with_ a cast anyway, I really believe code-generation is 
safe.

In contrast, to actually make gcc complain about the types of these things 
would not only be disruptive, it would require some _major_ change to how 
we do things. Right now we assume that "ioremap()" generates an 
offsettable "thing", and that really means that it must be of one of three 
types:
 - "unsigned long" (obviously type-unsafe, and would be totally pointless)
 - "void *" or "char *" (this is basically what we used to have).

Neither is in any way type-safe, nor can we make gcc complain without
breaking the "offsettable" requirement. The only way to make gcc complain
loudly would be to use something that can't be converted to anything else
(eg a "struct iomem_opaque { unsigned long cookie; }") but that would
break every single user out there, and require us to do the off-setting at
a much lower level (passing along both cookie and offset separately all
the way down the callchain - horribly bad in many ways).

In contrast, using sparse means that we can still generate exactly the
same code, have all old code still work (albeit with the occasional
potential warning from gcc) _and_ have a fairly easy way to check.

In fact, even modern external modules that work with 2.6.x kernels are 
quite checkable. I just verified that using "make C=1" works perfectly 
fine on an external module tree). So yes, it's just for "us sparse geeks", 
but sparse is to easy to set up and use that I don't think that's a huge 
downside:

	make ; make install ; "use 'C=1' when compiling the kernel"

it can't get much simpler than that.

(I think the biggest problem is _finding_ sparse, but just googling for 
"linux sparse tar-ball" shows at least two half-way relevant hits in the 
top five, so..)

		Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-09  1:19   ` Linus Torvalds
@ 2004-09-09  4:36     ` David S. Miller
  2004-09-09  5:56       ` Richard Henderson
  2004-09-09  5:04     ` viro
  1 sibling, 1 reply; 63+ messages in thread
From: David S. Miller @ 2004-09-09  4:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: viro, akpm, linux-arch, alan

On Wed, 8 Sep 2004 18:19:08 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> On Wed, 8 Sep 2004, David S. Miller wrote:
> > 
> > I wish we could make it fail with the compiler too somehow, that
> > way people other than us sparse geeks would see and fix this stuff
> > up.
> 
> I don't want to break a stable release too much. As-is, the _code_ 
> generated should be 100% the same, even if gcc complains. So afaik, my 
> patch is very safe indeed, which I like.

Agreed.  I wish GCC had a way to define a type which is offsetable
yet not an "int".  Surprise surprise, that works out to be your
address space thing in sparse. :)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-09  1:19   ` Linus Torvalds
  2004-09-09  4:36     ` David S. Miller
@ 2004-09-09  5:04     ` viro
  2004-09-09  5:05       ` David S. Miller
  2004-09-09  5:13       ` Linus Torvalds
  1 sibling, 2 replies; 63+ messages in thread
From: viro @ 2004-09-09  5:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, akpm, linux-arch, alan

On Wed, Sep 08, 2004 at 06:19:08PM -0700, Linus Torvalds wrote:
> (I think the biggest problem is _finding_ sparse, but just googling for 
> "linux sparse tar-ball" shows at least two half-way relevant hits in the 
> top five, so..)

The obvious thing to do would be to put snapshots on kernel.org, though...

One note on patch: I would wrap the definition of __iospace into an
ifndef.  Rationale: that allows to separate the work on these annotations
from the rest.

IOW, I'd love to be able to say
make C=1 CHECKFLAGS="-Dno_check_iospace -Wbitwise"
and get endianness warnings alone (and be able to do _just_ iospace checks
without other experimental stuff).

Once the level of noise from that area is down to something sane, we can
make the thing unconditional and basically include it into the baseline
set of checks.  Comments?

BTW, I've got the beginning of endianness annotations carved and mergable;
patches in an hour or two...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-09  5:04     ` viro
@ 2004-09-09  5:05       ` David S. Miller
  2004-09-09  5:13       ` Linus Torvalds
  1 sibling, 0 replies; 63+ messages in thread
From: David S. Miller @ 2004-09-09  5:05 UTC (permalink / raw)
  To: viro; +Cc: torvalds, akpm, linux-arch, alan

On Thu, 9 Sep 2004 06:04:10 +0100
viro@parcelfarce.linux.theplanet.co.uk wrote:

> On Wed, Sep 08, 2004 at 06:19:08PM -0700, Linus Torvalds wrote:
> > (I think the biggest problem is _finding_ sparse, but just googling for 
> > "linux sparse tar-ball" shows at least two half-way relevant hits in the 
> > top five, so..)
> 
> The obvious thing to do would be to put snapshots on kernel.org, though...

It's not like the thing is huge.  I'd personally ship it in the
kernel tree itself, why not?  Do you not want to mix up the
licenses or the source history Linus?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-09  5:04     ` viro
  2004-09-09  5:05       ` David S. Miller
@ 2004-09-09  5:13       ` Linus Torvalds
  2004-09-09  6:08         ` viro
  1 sibling, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2004-09-09  5:13 UTC (permalink / raw)
  To: viro; +Cc: David S. Miller, akpm, linux-arch, alan



On Thu, 9 Sep 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
> 
> The obvious thing to do would be to put snapshots on kernel.org, though...

I could do that, I guess. It's calm enough these days that I wouldn't feel 
like I have to update it all the time.

> One note on patch: I would wrap the definition of __iospace into an
> ifndef.  Rationale: that allows to separate the work on these annotations
> from the rest.

Yes, I was thinking of having some way to disable these things. I think
-Wbitwise was good, it would make sense to be able to do similar things
for address-spaces too. Either on a source level with changing __iospace
(aka __iomem), or having a sparse option to allow certain spaces to alias 
each other.

		Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-09  4:36     ` David S. Miller
@ 2004-09-09  5:56       ` Richard Henderson
  0 siblings, 0 replies; 63+ messages in thread
From: Richard Henderson @ 2004-09-09  5:56 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linus Torvalds, viro, akpm, linux-arch, alan

On Wed, Sep 08, 2004 at 09:36:58PM -0700, David S. Miller wrote:
> Agreed.  I wish GCC had a way to define a type which is offsetable
> yet not an "int".

Some people might claim you want to be programming in Ada.  ;-)


r~

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-09  5:13       ` Linus Torvalds
@ 2004-09-09  6:08         ` viro
  0 siblings, 0 replies; 63+ messages in thread
From: viro @ 2004-09-09  6:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, akpm, linux-arch, alan

On Wed, Sep 08, 2004 at 10:13:45PM -0700, Linus Torvalds wrote:
> Yes, I was thinking of having some way to disable these things. I think
> -Wbitwise was good, it would make sense to be able to do similar things
> for address-spaces too. Either on a source level with changing __iospace
> (aka __iomem), or having a sparse option to allow certain spaces to alias 
> each other.

BTW, is there any reason to have cpu_to_le32p() and its ilk as macros instead
of inlined functions?  If there isn't, I'd gladly lose the perversions I'm
currently using to deal with type checks and let the normal logics take care
of that...

Unlike cpu_to_...() these guys are obviously never constant, so that objection
does not apply.  AFAICS, all callers of that bunch would be OK with such
change, so unless there are dark and nasty arch-specific reasons, I'd rather
turn them into inlines...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-08 22:57 RFC: being more anal about iospace accesses Linus Torvalds
  2004-09-08 23:07 ` David S. Miller
@ 2004-09-09  6:23 ` David Woodhouse
  2004-09-09 13:14   ` Alan Cox
  2004-09-11  6:09 ` Linus Torvalds
  2004-09-15 15:03 ` Linus Torvalds
  3 siblings, 1 reply; 63+ messages in thread
From: David Woodhouse @ 2004-09-09  6:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Andrew Morton, Linux Arch list, Alan Cox,
	David S. Miller

On Wed, 2004-09-08 at 15:57 -0700, Linus Torvalds wrote:
>  And while I've tried out a compile on my normal x86 setup and could
> fix all of the gcc-visible things, this may be disruptive enough that
> people object to doing something like this.
> 
> Comments? 

Just do it. It's currently a complete mess -- and it's already in need
of fixing on those non-i386 platforms where write[bwl] had to be
functions anyway. Bringing i386 into line would be great.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-08 23:07 ` David S. Miller
  2004-09-08 23:25   ` Linus Torvalds
  2004-09-09  1:19   ` Linus Torvalds
@ 2004-09-09  8:27   ` Geert Uytterhoeven
  2 siblings, 0 replies; 63+ messages in thread
From: Geert Uytterhoeven @ 2004-09-09  8:27 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linus Torvalds, viro, Andrew Morton, linux-arch, Alan Cox

On Wed, 8 Sep 2004, David S. Miller wrote:
> Anyways, I like this a lot.

/me too!

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-09  6:23 ` David Woodhouse
@ 2004-09-09 13:14   ` Alan Cox
  0 siblings, 0 replies; 63+ messages in thread
From: Alan Cox @ 2004-09-09 13:14 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Linus Torvalds, Al Viro, Andrew Morton, Linux Arch list,
	David S. Miller

On Iau, 2004-09-09 at 07:23, David Woodhouse wrote:
> Just do it. It's currently a complete mess -- and it's already in need
> of fixing on those non-i386 platforms where write[bwl] had to be
> functions anyway. Bringing i386 into line would be great.

I'd say "Just do it - but not in a -rc" 8)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-08 22:57 RFC: being more anal about iospace accesses Linus Torvalds
  2004-09-08 23:07 ` David S. Miller
  2004-09-09  6:23 ` David Woodhouse
@ 2004-09-11  6:09 ` Linus Torvalds
  2004-09-11  6:42   ` Anton Blanchard
                     ` (2 more replies)
  2004-09-15 15:03 ` Linus Torvalds
  3 siblings, 3 replies; 63+ messages in thread
From: Linus Torvalds @ 2004-09-11  6:09 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Linux Arch list, Alan Cox,
	David S. Miller, Jeff Garzik, Benjamin Herrenschmidt


Ok, I've committed the first batch of these things. It should all compile
to exactly the same thing, and I've tried to remove the warnings gcc
generates for the x86/ppc64 configurations I use.

It does generate more warnings than I'd like from gcc. As mentioned, they 
should all be "harmless" in the sense that it doesn't affect code 
generation, but clearly it would be good to try to fix up the worst cases. 
Hint hint.

The sparse ones are a different matter, though: I've removed some of them.
Quite often it's just a matter of declaring the "mmio_base" pointer with
the right type - a one-line change in ieee1394 made hundreds of warnings 
just go away.

But other times there are drivers that seem to take a perverse pleasure in 
mixing types for IOMEM accesses, and sometims mixes the data structure 
between containing IOMEM and IOPORT addresses (one is now "void __iomem 
*", the other one has always been an integer). libata springs to mind, 
where it really looks like it might be better to separate out the "mmio" 
addresses from the "pio" addresses.

Doing the conversion was usually trivial, but sometimes showed some bugs. 
For example, the ATI radeon fbcon driver is clearly horribly buggy:

	drivers/video/aty/radeon_base.c:1725:42: warning: incorrect type in argument 2 (different address spaces)
	drivers/video/aty/radeon_base.c:1725:42:    expected void const *from
	drivers/video/aty/radeon_base.c:1725:42:    got void [noderef] *[assigned] base_addr<asn:2>
	drivers/video/aty/radeon_base.c:1757:39: warning: incorrect type in argument 1 (different address spaces)
	drivers/video/aty/radeon_base.c:1757:39:    expected void *to
	drivers/video/aty/radeon_base.c:1757:39:    got void [noderef] *[assigned] base_addr<asn:2>

comes from it doing

	count -= copy_to_user(buf, base_addr+p, count);
	..
	count -= copy_from_user(base_addr+p, buf, count);

where we are copying memory-mapped PCI space to/from user space. THIS IS
REALLY REALLY WRONG! It's especially wrong as it can cause things like
cache control and prefetch instructions to be used on PCI space, which
migth well trigger serious bugs. That's ignoring the fact that on some 
architectures the PCI IO space isn't even accessible that way in the first 
place. At least not at the address that is the "PCI base address"..

Anyway, last I remember, AMD machines could do strange things if you did 
prefecthing on IO space. So this is not relegated to "weird 
architectures".

One such weird architecture is PPC64, which is why that one does this:

	drivers/video/aty/radeon_base.c:2249:17: warning: incorrect type in assignment (different base types)
	drivers/video/aty/radeon_base.c:2249:17:    expected void [noderef] *fb_base<asn:2>
	drivers/video/aty/radeon_base.c:2249:17:    got unsigned long 

from:

		/* Argh. Scary arch !!! */
	#ifdef CONFIG_PPC64
		rinfo->fb_base = IO_TOKEN_TO_ADDR(rinfo->fb_base);
	#endif

which just "pre-translates" the address. Never mind that this is 
conceptually wrong, it means that we later will "iounmap()" basically a 
random address. Whee.

I think the real fix is to write a "memcpy_fromio_touser()" and it's 
reverse brother, just using a simple readbwl/writebwl loop with the proper 
__get_user/__put_user. That would likely make the ppc64 horrible "Scary 
arch" thing also go away.

Benh: while doign the ppc64 thing, I also encountered something clearly
broken in arch/ppc64/mm/init.c: iounmap_explicit(). I just commented out
the stupid thing (it was wrong regardless of which way the preceding
branch went: it would either call iounmap() twice on the same area, or it
would get a NULL pointer dereference). Can you check what that is 
_supposed_ to be doing?

		Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-11  6:09 ` Linus Torvalds
@ 2004-09-11  6:42   ` Anton Blanchard
  2004-09-11  7:26     ` Benjamin Herrenschmidt
  2004-09-11  7:29     ` Geert Uytterhoeven
  2004-09-11  7:23   ` Benjamin Herrenschmidt
  2004-09-11 14:42   ` Alan Cox
  2 siblings, 2 replies; 63+ messages in thread
From: Anton Blanchard @ 2004-09-11  6:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Andrew Morton, Linux Arch list, Alan Cox,
	David S. Miller, Jeff Garzik, Benjamin Herrenschmidt

 
> Benh: while doign the ppc64 thing, I also encountered something clearly
> broken in arch/ppc64/mm/init.c: iounmap_explicit(). I just commented out
> the stupid thing (it was wrong regardless of which way the preceding
> branch went: it would either call iounmap() twice on the same area, or it
> would get a NULL pointer dereference). Can you check what that is 
> _supposed_ to be doing?

Yuck, that change was made in preparation for hotplug removal of host
bridges and is obviously broken.

BTW One of our problems is how drivers can tap IO addresses without doing
something to map them first. We end up doing the ioremap of IO space
during PCI init.

The problem comes when we have to remove a host bridge from a partition
and must have all IO and memory resources unmapped. Memory resources
take care of themselves since drivers iounmap those resources. IO
resources do not.

Anton

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-11  6:09 ` Linus Torvalds
  2004-09-11  6:42   ` Anton Blanchard
@ 2004-09-11  7:23   ` Benjamin Herrenschmidt
  2004-09-11 14:42   ` Alan Cox
  2 siblings, 0 replies; 63+ messages in thread
From: Benjamin Herrenschmidt @ 2004-09-11  7:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Andrew Morton, Linux Arch list, Alan Cox,
	David S. Miller, Jeff Garzik

On Sat, 2004-09-11 at 16:09, Linus Torvalds wrote:

> 
> Doing the conversion was usually trivial, but sometimes showed some bugs. 
> For example, the ATI radeon fbcon driver is clearly horribly buggy:
> 
> 	drivers/video/aty/radeon_base.c:1725:42: warning: incorrect type in argument 2 (different address spaces)
> 	drivers/video/aty/radeon_base.c:1725:42:    expected void const *from
> 	drivers/video/aty/radeon_base.c:1725:42:    got void [noderef] *[assigned] base_addr<asn:2>
> 	drivers/video/aty/radeon_base.c:1757:39: warning: incorrect type in argument 1 (different address spaces)
> 	drivers/video/aty/radeon_base.c:1757:39:    expected void *to
> 	drivers/video/aty/radeon_base.c:1757:39:    got void [noderef] *[assigned] base_addr<asn:2>
> 
> comes from it doing
> 
> 	count -= copy_to_user(buf, base_addr+p, count);
> 	..
> 	count -= copy_from_user(base_addr+p, buf, count);
> 
> where we are copying memory-mapped PCI space to/from user space. THIS IS
> REALLY REALLY WRONG! It's especially wrong as it can cause things like
> cache control and prefetch instructions to be used on PCI space, which
> migth well trigger serious bugs. That's ignoring the fact that on some 
> architectures the PCI IO space isn't even accessible that way in the first 
> place. At least not at the address that is the "PCI base address"..
> 
I know... I want to get rid of this asap. It actually comes from the
generic fbdev code (though that one has been fixed since then), the
reason radeonfb had its own copies was because of the ioremap limit
that it has... I need to fix all that asap, though hopefully, almost
nobody uses fb read/write ops anyway

Anyway, last I remember, AMD machines could do strange things if you did 
> prefecthing on IO space. So this is not relegated to "weird 
> architectures".
> 
> One such weird architecture is PPC64, which is why that one does this:
> 
> 	drivers/video/aty/radeon_base.c:2249:17: warning: incorrect type in assignment (different base types)
> 	drivers/video/aty/radeon_base.c:2249:17:    expected void [noderef] *fb_base<asn:2>
> 	drivers/video/aty/radeon_base.c:2249:17:    got unsigned long 
> 
> from:
> 
> 		/* Argh. Scary arch !!! */
> 	#ifdef CONFIG_PPC64
> 		rinfo->fb_base = IO_TOKEN_TO_ADDR(rinfo->fb_base);
> 	#endif
> 
> which just "pre-translates" the address. Never mind that this is 
> conceptually wrong, it means that we later will "iounmap()" basically a 
> random address. Whee.

And it is sort-of bullshit too anyways. Unless we end up with eeh tokens
in the top bits of fb_base which I doubt will ever happen. It's that
sort of stuff taht should probably be removed by now. I think we added
it bcs fbdev can access the ioremap'ed memory bypassing readW/writeX,
but in the case of the fbdev, this is not neededd anyway.

> I think the real fix is to write a "memcpy_fromio_touser()" and it's 
> reverse brother, just using a simple readbwl/writebwl loop with the proper 
> __get_user/__put_user. That would likely make the ppc64 horrible "Scary 
> arch" thing also go away.
> 
> Benh: while doign the ppc64 thing, I also encountered something clearly
> broken in arch/ppc64/mm/init.c: iounmap_explicit(). I just commented out
> the stupid thing (it was wrong regardless of which way the preceding
> branch went: it would either call iounmap() twice on the same area, or it
> would get a NULL pointer dereference). Can you check what that is 
> _supposed_ to be doing?
> 

It's supposed to unmap PHB IO space, I have to double check. I want
to rewrite this all anyway, we have this "speacial" allocator for
ioremap space separate from the real vmalloc one for reasons that
are no longer valid (if they ever were) imho and I want to remove that,
I'm just too busy at the moment to do it. At this point, I'm thinking
of trashing the _explicit variants of ioremap. They are used to create
a contiguous area containing all PCI IO spaces of PHBs appende to each
other. We still need to do this but can probably do it in a cleaner
way.

Ben. 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-11  6:42   ` Anton Blanchard
@ 2004-09-11  7:26     ` Benjamin Herrenschmidt
  2004-09-11  7:29     ` Geert Uytterhoeven
  1 sibling, 0 replies; 63+ messages in thread
From: Benjamin Herrenschmidt @ 2004-09-11  7:26 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Linus Torvalds, Al Viro, Andrew Morton, Linux Arch list, Alan Cox,
	David S. Miller, Jeff Garzik


> 
> The problem comes when we have to remove a host bridge from a partition
> and must have all IO and memory resources unmapped. Memory resources
> take care of themselves since drivers iounmap those resources. IO
> resources do not.

I've been thinking about changing that to normal ioremap's like ppc32,
relying on normal pointer arithmetic to get us the right address, but
that would only work if all drivers properly stored their IO port "base"
in a long, not an int, which might not be the case...

As I wrote earlier, I also want to get rid of the imalloc stuff ....

Ben.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-11  6:42   ` Anton Blanchard
  2004-09-11  7:26     ` Benjamin Herrenschmidt
@ 2004-09-11  7:29     ` Geert Uytterhoeven
  1 sibling, 0 replies; 63+ messages in thread
From: Geert Uytterhoeven @ 2004-09-11  7:29 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Linus Torvalds, Al Viro, Andrew Morton, Linux Arch list, Alan Cox,
	David S. Miller, Jeff Garzik, Benjamin Herrenschmidt

On Sat, 11 Sep 2004, Anton Blanchard wrote:
> > Benh: while doign the ppc64 thing, I also encountered something clearly
> > broken in arch/ppc64/mm/init.c: iounmap_explicit(). I just commented out
> > the stupid thing (it was wrong regardless of which way the preceding
> > branch went: it would either call iounmap() twice on the same area, or it
> > would get a NULL pointer dereference). Can you check what that is
> > _supposed_ to be doing?
>
> Yuck, that change was made in preparation for hotplug removal of host
> bridges and is obviously broken.
>
> BTW One of our problems is how drivers can tap IO addresses without doing
> something to map them first. We end up doing the ioremap of IO space
> during PCI init.
>
> The problem comes when we have to remove a host bridge from a partition
> and must have all IO and memory resources unmapped. Memory resources
> take care of themselves since drivers iounmap those resources. IO
> resources do not.

Add ioioremap()?

(or more painful for maintenance but conceptually cleaner:
s/ioremap/iomemremap/g and add ioremap()?)

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-11  6:09 ` Linus Torvalds
  2004-09-11  6:42   ` Anton Blanchard
  2004-09-11  7:23   ` Benjamin Herrenschmidt
@ 2004-09-11 14:42   ` Alan Cox
  2 siblings, 0 replies; 63+ messages in thread
From: Alan Cox @ 2004-09-11 14:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Andrew Morton, Linux Arch list, David S. Miller,
	Jeff Garzik, Benjamin Herrenschmidt

On Sad, 2004-09-11 at 07:09, Linus Torvalds wrote:
> Anyway, last I remember, AMD machines could do strange things if you did 
> prefecthing on IO space. So this is not relegated to "weird 
> architectures".

Correct: 
Early Athlon cache corrupts for prefetch() of non prefetchable

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-08 22:57 RFC: being more anal about iospace accesses Linus Torvalds
                   ` (2 preceding siblings ...)
  2004-09-11  6:09 ` Linus Torvalds
@ 2004-09-15 15:03 ` Linus Torvalds
  2004-09-15 19:02   ` Geert Uytterhoeven
  3 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2004-09-15 15:03 UTC (permalink / raw)
  To: Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik


Some of you guys may have noticed that the anal typechecking turned out to 
cause another IO interface to be grown.

Some drivers - notably SATA and a couple of network drivers - mixed PIO 
adn MMIO addresses on purpose, because they drive hardware that literally 
uses one or the other. Sometimes it's a compile option that just #defines 
'writel()' to 'inl()', sometimes it's a runtime decision depending on the 
hardware or configuration.

The anal typechecking obviously ended up being very unhappy about this, 
since it wants "void __iomem *" for MMIO pointers, and a normal "unsigned 
long" for PIO accesses.

Rather than scrapping the typechecking, or requiring drivers to do strange
and nasty casts all over the place, there's a new interface in town. It's
called "iomap", because it extends the old "ioremap()" interface to work
on the PIO accesses too. That way, the drivers that really want to do both
can very naturally do it.

Nothing currently uses it, although Jeff has patches for SATA for testing 
(and they clean up the code quite noticeably, never mind getting rid of 
the warnings).  The interface has been implemented by yours truly for x86 
and ppc64, and David did a first-pass version for sparc64 too (missing the 
"xxxx_rep()" functions that were added a bit later, I believe).

So far experience seems to show that it's a very natural interface for
most non-x86 hardware - they all tend to map in both PIO and MMIO into one
address space _anyway_, so the two aren't really any different. It's
mainly just x86 and it's ilk that actually have two different interfaces
for the two kinds of PCI accesses, and at least in that case it's trivial
to encode the difference in the virtual ioremap pointer.

The best way to explain the interface is to just point you guys at the
<asm-generic/iomap.h> file, which isn't very big, has about as much
comments than code, and contains nothing but the necessary function
declarations. The actual meaning of the functions should be pretty
obvious even without the comments.

You can use the <asm-generic/iomap.h> file for declarations even if you
don't use the actual "generic" (x86-ilk) implementation itself - that's
what ppc64 does, for example. Or you could just add the stuff to your own
<asm/io.h>, like davem did for sparc64.

Nothing uses the interface yet, but it's likely that a couple of drivers
will be moved over to it immediately after 2.6.9.  They may not matter for
your architecture, but please check out the new interfaces anyway. It's
likely about ~100 lines of code, although I have to admit that the IO 
access functions tend to be some of the densest and nastiest code around.

For examples, look at lib/iomap.c (x86-like "dual address space") or
arch/ppc64/kernel/eeh.c (Look for "Here comes the EEH implementation of
the IOMAP interfaces"). Or the sparc64 ones in include/asm-sparc64/io.h.

		Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 15:03 ` Linus Torvalds
@ 2004-09-15 19:02   ` Geert Uytterhoeven
  2004-09-15 19:16     ` Linus Torvalds
  0 siblings, 1 reply; 63+ messages in thread
From: Geert Uytterhoeven @ 2004-09-15 19:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik

On Wed, 15 Sep 2004, Linus Torvalds wrote:
> Rather than scrapping the typechecking, or requiring drivers to do strange
> and nasty casts all over the place, there's a new interface in town. It's
> called "iomap", because it extends the old "ioremap()" interface to work
> on the PIO accesses too. That way, the drivers that really want to do both
> can very naturally do it.
>
> Nothing currently uses it, although Jeff has patches for SATA for testing

Before people really start using this: what's the advantage of this scheme
compared to e.g. dev->busops->in8(), which some of us have been waiting for
since a long time?

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 19:02   ` Geert Uytterhoeven
@ 2004-09-15 19:16     ` Linus Torvalds
  2004-09-15 19:40       ` Matthew Wilcox
                         ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Linus Torvalds @ 2004-09-15 19:16 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik



On Wed, 15 Sep 2004, Geert Uytterhoeven wrote:
> 
> Before people really start using this: what's the advantage of this scheme
> compared to e.g. dev->busops->in8(), which some of us have been waiting for
> since a long time?

Ehh, exactly why would you wait for that? 

What this new interface is doing is to slowlt move "ioremap()" users over
to "pci_iomap()" (and by extension, maybe "sbus_ioremap()" etc if somebody
ever starts doing that too), which allows you to do the bus- and device-
specific translation just _once_.

Doing it at every IO op (aka "dev->busops->xxx()") is a complete waste of 
time, and extremely stupid, in my not-so-humble opinion. Let me count the 
ways:

 - it's more efficient to set up the mapping _once_. That's especailly 
   true since the mapping may need to do things like page table etc
   accesses, so you HAVE to prepare it anyway (ie doing it at IO time
   would be a bug)

 - doing it on a dev or bus-centric setup is totally idiotic, because it 
   makes it effectively impossible to share one chip-driver across X
   different buses: your scheme would have ten different "dev"s pointing 
   to their own ops. The pci_iomap() has _one_ mapping scheme per bus 
   type, and then you just pass in the cookie to "ioread()" and friends.

 - we've tried it in some circumstances (IDE driver and PCI config space 
   accesses if I recall correctly, which I obviously might not do), and it
   has been a total failure. 

In other words, I hereby proclaim that anybody who does "dev->busops" is a 
total idiot.

Of course, you're now free to try to change my mind. Some people that are
idiots today might well be the revolutionary thinkers of tomorrow. I'm
nothing if not flexible.

		Linus "Yoga of the Mind" Torvalds

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 19:16     ` Linus Torvalds
@ 2004-09-15 19:40       ` Matthew Wilcox
  2004-09-15 20:10         ` Linus Torvalds
  2004-09-15 20:20       ` Russell King
  2004-09-15 22:38       ` James Bottomley
  2 siblings, 1 reply; 63+ messages in thread
From: Matthew Wilcox @ 2004-09-15 19:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik

On Wed, Sep 15, 2004 at 12:16:35PM -0700, Linus Torvalds wrote:
> On Wed, 15 Sep 2004, Geert Uytterhoeven wrote:
> > Before people really start using this: what's the advantage of this scheme
> > compared to e.g. dev->busops->in8(), which some of us have been waiting for
> > since a long time?
> 
> Ehh, exactly why would you wait for that? 

I'm no fan of bus->ops->in8, but the ioread8() interface is basically
impossible for some architectures.  There's just not enough information
in a single void * cookie to make this work; not without screwing the
speed of readb() down the the speed of inb().

For example, on some of the older PA machines, we can have an EISA
controller and multiple PCI controllers.  readb() is great, it's all
mapped into a flat memory space, and there are no problems.  But inb()
is a royal PITA.  First, we have to decode the port number to figure
out which controller it's under, then (if it's EISA) swizzle the bits in
the port number to get to an address, but if it's PCI, we have to take a
spinlock, write the port number into a register of the right controller,
read the data back from a different register (which is what triggers
the io port cycle to be generated on the PCI bus) and unlock the spinlock.

ioport_map() looks great for the EISA controller because it lets us
pre-map the port number to an address.  But it'd actually be deadly
if the device has more than 8 consecutive IO ports [1].  For the PCI
controllers, it's a complete misery.  We could do something horrid like
assuming no IO is less than 2GB and returning a cookie from ioport_map()
that then gets decoded into somethign similar to the current dance,
but that's really not pleasant.

I'm told SH is in even worse state than PA, but I haven't looked at
that code.

Could we have a 'dev' parameter to ioread8()?  Please?


[1] The implementation is:
        if (port & 0x300) {
                return 0xfc000000 | ((port & 0xfc00) >> 6)
                        | ((port & 0x3f8) << 9) | (port & 7);
        } else {
                return 0xfc000000 | port;
        }

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 19:40       ` Matthew Wilcox
@ 2004-09-15 20:10         ` Linus Torvalds
  2004-09-15 20:17           ` Linus Torvalds
  2004-09-16 12:17           ` David Woodhouse
  0 siblings, 2 replies; 63+ messages in thread
From: Linus Torvalds @ 2004-09-15 20:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik



On Wed, 15 Sep 2004, Matthew Wilcox wrote:

> Could we have a 'dev' parameter to ioread8()?  Please?

Nope. See earlier email on why. There is no "dev" any more at the point of
a generic chipset handler: the "dev" may have any of millions of possible 
types (well, ok, a few different bus architectures, but still).

I seriously doubt that you don't have space to encode the same information 
that was already there before in a pointer. David was kicking and 
screaming too, until Jeff pointed out that it was all in the same address 
space, and then he said "oh", and suddenly it was trivial.

> [1] The implementation is:
>         if (port & 0x300) {
>                 return 0xfc000000 | ((port & 0xfc00) >> 6)
>                         | ((port & 0x3f8) << 9) | (port & 7);
>         } else {
>                 return 0xfc000000 | port;
>         }

So? Without knowing the details of how stupid the PA-RISC IO mappings are, 
the above already makes me suspect that you have one memory space, and the 
0xfc000000 area is used for IO. Goodie. So now you make

	/* We've hard-mapped the IO region at 0xfc000000 */
	void __iomem *ioport_map(unsigned long port)
	{
		return (void __iomem *) (port | 0xfc000000);
	}

and then in ioread8() you do

	unsigned int ioread8(void __iomem *addr)
	{
		unsigned long bits = (unsigned long __force) addr;
		if ((bits & 0xfc000000) == 0xfc000000)
			addr = io_port_swizle_bits(bits & 0x3ffffff);
		/* MMIO has already been mapped */
		return __raw_readb(addr);
	}

with "io_mem_read()" doing the old bit swizzle, exactly like before.

The _only_ thing you need is an _virtual_ area of the cookie space (aka 
"void __iomem *") that you can just test. It clearly exists, because your 
PCI code does "ioremap()", so _any_ memory area that cannot be reached 
with ioremap() is by default something you can use as a flag for it being 
PIO. Then you do all the same things that you did before.

Is it uglier than sane architectures? Sure. Some architectures do 
_everything_ at pci_iomap() time, and then they just do a simple 
special-memory dereference that is common to both PIO and MMIO code in 
"ioreadXX()".

The fact that PARISC did their PCI translation wrong is painful for you.  
That doesn't mean that Linux should do the same mistakes the PARISC system
designers made.

Think of the positive side: It builds character.

		Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 20:10         ` Linus Torvalds
@ 2004-09-15 20:17           ` Linus Torvalds
  2004-09-16 12:17           ` David Woodhouse
  1 sibling, 0 replies; 63+ messages in thread
From: Linus Torvalds @ 2004-09-15 20:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik



On Wed, 15 Sep 2004, Linus Torvalds wrote:
> 
> Think of the positive side: It builds character.

Besides, a Canadian should know better than asking a Finn for favours
after what you did yesterday.

You'll all burn in hell.

		Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 19:16     ` Linus Torvalds
  2004-09-15 19:40       ` Matthew Wilcox
@ 2004-09-15 20:20       ` Russell King
  2004-09-15 20:34         ` Linus Torvalds
  2004-09-15 22:38       ` James Bottomley
  2 siblings, 1 reply; 63+ messages in thread
From: Russell King @ 2004-09-15 20:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik

On Wed, Sep 15, 2004 at 12:16:35PM -0700, Linus Torvalds wrote:
> On Wed, 15 Sep 2004, Geert Uytterhoeven wrote:
> > Before people really start using this: what's the advantage of this scheme
> > compared to e.g. dev->busops->in8(), which some of us have been waiting for
> > since a long time?
> 
> Ehh, exactly why would you wait for that? 
> 
> What this new interface is doing is to slowlt move "ioremap()" users over
> to "pci_iomap()" (and by extension, maybe "sbus_ioremap()" etc if somebody
> ever starts doing that too), which allows you to do the bus- and device-
> specific translation just _once_.

What about (PCI!) bus types where you need to write one register with
the address and one with the data.  (I'm glaring at a certain embedded
CPU produced by a well known large chip manufacturer, which I think we
need to do these games to get byte accesses to work correctly.)

Sometimes unified accessor macros/functions just don't work.

> In other words, I hereby proclaim that anybody who does "dev->busops" is a 
> total idiot.

Good, I'll pass that comment on to other people and maybe we can force
certain sillycon manufacturers to stop selling broken hardware.  Or
maybe the universe will implode.  I think the latter is a more likely
scenario than the former.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 PCMCIA      - http://pcmcia.arm.linux.org.uk/
                 2.6 Serial core

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 20:20       ` Russell King
@ 2004-09-15 20:34         ` Linus Torvalds
  2004-09-15 20:51           ` Linus Torvalds
  0 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2004-09-15 20:34 UTC (permalink / raw)
  To: Russell King
  Cc: Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik



On Wed, 15 Sep 2004, Russell King wrote:
> 
> What about (PCI!) bus types where you need to write one register with
> the address and one with the data.  (I'm glaring at a certain embedded
> CPU produced by a well known large chip manufacturer, which I think we
> need to do these games to get byte accesses to work correctly.)

Well, that's really the exact same issue as with architectures that cram
memory and PCI into the same 32-bit address space: clearly that "can't
work", since PCI wants 32 bits, and memory wants at least a noticeable 
fraction of the same.

You have a "address register" (which is the 32 physical CPU address output
on the address bus), and you have a "data register" (which is what gets
output on the data bus).

Does it work? Hell yes. Every single PC out there today does exactly that.
It works by having the resource manager say "region X is reserved for RAM,
region Y is reserved for MMIO, and region Z is reserved for PIO" (well, 
the latter is not PC any more, it's ppc, alpha and others).

Same issue, exactly. If you see this thing in practice, just
force-relocate the resources away from each other. Say that MMIO is not
allowed in the 0x0000-0xffff region, and declare that that is reserved for
PIO. And then you just test at run-time. Easy. Change details to suit your
particular problem (maybe you want to use the high bit as a marker. Maybe
you use a water-mark or other more dynamic behaviour).

This is a _real_ solution, btw. It's not a cop-out. It's the solution that
has been used over and over again, and it just works. It's basically 
saying "theory doesn't matter - this is practice".

		Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 20:34         ` Linus Torvalds
@ 2004-09-15 20:51           ` Linus Torvalds
  0 siblings, 0 replies; 63+ messages in thread
From: Linus Torvalds @ 2004-09-15 20:51 UTC (permalink / raw)
  To: Russell King
  Cc: Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik



On Wed, 15 Sep 2004, Linus Torvalds wrote:
> 
> This is a _real_ solution, btw. It's not a cop-out. It's the solution that
> has been used over and over again, and it just works. It's basically 
> saying "theory doesn't matter - this is practice".

Btw, if your architecture supports DMA, you pretty much _have_ to have a
hole in the PCI MMIO space to map real RAM anyway. That hole would tend to 
be the optimal place to "hide" the IO ports in.

That's effectively what the generic lib/iomap.c does (except you may, of
course, have your RAM mapped at some non-zero address). 

			Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 19:16     ` Linus Torvalds
  2004-09-15 19:40       ` Matthew Wilcox
  2004-09-15 20:20       ` Russell King
@ 2004-09-15 22:38       ` James Bottomley
  2004-09-16  2:33         ` Matthew Wilcox
  2 siblings, 1 reply; 63+ messages in thread
From: James Bottomley @ 2004-09-15 22:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik

On Wed, 2004-09-15 at 15:16, Linus Torvalds wrote:
> What this new interface is doing is to slowlt move "ioremap()" users over
> to "pci_iomap()" (and by extension, maybe "sbus_ioremap()" etc if somebody
> ever starts doing that too), which allows you to do the bus- and device-
> specific translation just _once_.

Please can we not do it this way, that was the old pci_map_single et al
which we've just spent quite a bit of time converting to
dma_map_single.  It might not matter to the PCI centric people, but the
generic DMA API was a wonderful simplification to the rest of us.

Can we start iomap off on the right foot and simply do dma_iomap (or
dev_iomap since dma_ may end up to be confusing), taking a struct device
as an argument and let the platform, which knows where the buses are,
map this to the correct operations?

James

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 22:38       ` James Bottomley
@ 2004-09-16  2:33         ` Matthew Wilcox
  2004-09-16  4:28           ` Linus Torvalds
  0 siblings, 1 reply; 63+ messages in thread
From: Matthew Wilcox @ 2004-09-16  2:33 UTC (permalink / raw)
  To: James Bottomley
  Cc: Linus Torvalds, Geert Uytterhoeven, Linux Arch list, Al Viro,
	Andrew Morton, Alan Cox, David S. Miller, Jeff Garzik

On Wed, Sep 15, 2004 at 06:38:50PM -0400, James Bottomley wrote:
> Please can we not do it this way, that was the old pci_map_single et al
> which we've just spent quite a bit of time converting to
> dma_map_single.  It might not matter to the PCI centric people, but the
> generic DMA API was a wonderful simplification to the rest of us.
> 
> Can we start iomap off on the right foot and simply do dma_iomap (or
> dev_iomap since dma_ may end up to be confusing), taking a struct device
> as an argument and let the platform, which knows where the buses are,
> map this to the correct operations?

Actually, I think we're OK with the current interface here.  The routines
we have that give us an iomem pointer are:

/* Create a virtual mapping cookie for an IO port range */
extern void __iomem *ioport_map(unsigned long port, unsigned int nr);
extern void ioport_unmap(void __iomem *);

/* Create a virtual mapping cookie for a PCI BAR (memory or IO) */
struct pci_dev;
extern void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long max);
extern void pci_iounmap(struct pci_dev *dev, void __iomem *);

So rather than pass the struct device in, we've got a special function
for each bus type.


What we haven't dealt with in this interface are:

 - relaxed reads
 - byte ordering

The relaxed attribute is, I think, not a property of the space we're
looking at.  It's a property of any given transaction.  So I think we
really do want a set of ioreadXX_relaxed() functions.

Byte ordering I'm less clear on.  Is this a property of the area "this
area is little endian on the bus, but the host cpu is big endian",
or is it a property of an individual transaction?

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16  2:33         ` Matthew Wilcox
@ 2004-09-16  4:28           ` Linus Torvalds
  2004-09-16  4:57             ` Benjamin Herrenschmidt
                               ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Linus Torvalds @ 2004-09-16  4:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: James Bottomley, Geert Uytterhoeven, Linux Arch list, Al Viro,
	Andrew Morton, Alan Cox, David S. Miller, Jeff Garzik



On Thu, 16 Sep 2004, Matthew Wilcox wrote:
> 
> Actually, I think we're OK with the current interface here.  The routines
> we have that give us an iomem pointer are:
> 
> /* Create a virtual mapping cookie for an IO port range */
> extern void __iomem *ioport_map(unsigned long port, unsigned int nr);
> extern void ioport_unmap(void __iomem *);
> 
> /* Create a virtual mapping cookie for a PCI BAR (memory or IO) */
> struct pci_dev;
> extern void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long max);
> extern void pci_iounmap(struct pci_dev *dev, void __iomem *);
> 
> So rather than pass the struct device in, we've got a special function
> for each bus type.

Indeed. Once created, the cookie is "bus-agnostic", which is very much the 
point. Unlike the original pci DMA interfaces, there are no real bus 
assumtions in the interfaces once set up. 

And set-up obviously _does_ need to be bus-specific, the same way device
discovery is. You don't dicsover non-specific devices. You very much
initialize one particular instance of one particular device/bus.

Now, other buses might want to choose to ignore the ioread/iowrite 
interfaces, and do their own bus-specific ones. I doubt it makes much 
sense, but it's not like we want to _force_ one particular IO model on 
people. 

> What we haven't dealt with in this interface are:
> 
>  - relaxed reads
>  - byte ordering
> 
> The relaxed attribute is, I think, not a property of the space we're
> looking at.  It's a property of any given transaction.  So I think we
> really do want a set of ioreadXX_relaxed() functions.

I haven't seen very many people actually use this. I've been assuming that 
what people want is the same "reasonably relaxed" behaviour that the 
regular MMIO functions have (as opposed to what traditional PIO is: it's 
slow as hell because it doesn't support posting). Are there actually users 
of any other model out there?

> Byte ordering I'm less clear on.  Is this a property of the area "this
> area is little endian on the bus, but the host cpu is big endian",
> or is it a property of an individual transaction?

I'd suggest individual transaction, although again, I doubt it is all that
much used. The "repeat" versions do IO-native byte order, which is really
the only case where we've ever really cared. David mentioned that some
network devices had tried to use "host-native" byte order (which some
hardware supports), but that it hadn't been worth the pain and apparently
got ripped out.

In other words, I'd like to keep the interfaces as simple as humanly
possible, and only introduce new concepts if absolutely required. For
example, one issue is 64-bit accesses: they will _not_ be atomic on most
architectures, and given that the question is whether they are worth
supporting at all, or whether drivers should just be expected to make the
non-atomicity explicit (by loading two dwords, the way they have to do
with the old interfaces anyway).

(Note that 64-bit accesses aren't necessarily atomic even on 64-bit
architectures. BenH says that the IO architecture on a G5 means that when
you do a 64-bit load, it will be done as two 32-bit bus cycles
regardless).

		Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16  4:28           ` Linus Torvalds
@ 2004-09-16  4:57             ` Benjamin Herrenschmidt
  2004-09-16  4:58               ` Benjamin Herrenschmidt
  2004-09-16 13:41             ` Matthew Wilcox
  2004-09-16 22:30             ` Matthew Wilcox
  2 siblings, 1 reply; 63+ messages in thread
From: Benjamin Herrenschmidt @ 2004-09-16  4:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik

On Thu, 2004-09-16 at 14:28, Linus Torvalds wrote:

> (Note that 64-bit accesses aren't necessarily atomic even on 64-bit
> architectures. BenH says that the IO architecture on a G5 means that when
> you do a 64-bit load, it will be done as two 32-bit bus cycles
> regardless).

Yup, at least with the current models, due to the fact that the U3
northbridge basically has a 32 bits path to IO devices. Note that this
isn't the case with IBM machines who do proper 64 bits PCI IOs.

Ben.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16  4:57             ` Benjamin Herrenschmidt
@ 2004-09-16  4:58               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 63+ messages in thread
From: Benjamin Herrenschmidt @ 2004-09-16  4:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik

On Thu, 2004-09-16 at 14:57, Benjamin Herrenschmidt wrote:
> On Thu, 2004-09-16 at 14:28, Linus Torvalds wrote:
> 
> > (Note that 64-bit accesses aren't necessarily atomic even on 64-bit
> > architectures. BenH says that the IO architecture on a G5 means that when
> > you do a 64-bit load, it will be done as two 32-bit bus cycles
> > regardless).
> 
> Yup, at least with the current models, due to the fact that the U3
> northbridge basically has a 32 bits path to IO devices. Note that this
> isn't the case with IBM machines who do proper 64 bits PCI IOs.

Hrm... now I'm confused ... it _MIGHT_ be able to do 64 bits wide IOs
in fact ... but it's limited to 32 bits _addresses_ ... oh well...

Ben.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 20:10         ` Linus Torvalds
  2004-09-15 20:17           ` Linus Torvalds
@ 2004-09-16 12:17           ` David Woodhouse
  2004-09-16 13:52             ` Linus Torvalds
  1 sibling, 1 reply; 63+ messages in thread
From: David Woodhouse @ 2004-09-16 12:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Geert Uytterhoeven, Linux Arch list, Al Viro,
	Andrew Morton, Alan Cox, David S. Miller, Jeff Garzik

On Wed, 2004-09-15 at 13:10 -0700, Linus Torvalds wrote:
> On Wed, 15 Sep 2004, Matthew Wilcox wrote:
> 
> > Could we have a 'dev' parameter to ioread8()?  Please?
> 
> Nope. See earlier email on why. There is no "dev" any more at the point of
> a generic chipset handler: the "dev" may have any of millions of possible 
> types (well, ok, a few different bus architectures, but still).

It's not just about address spaces. Have you looked at some of the
implementations of readb() et al on some SH boards recently, with all
the if/elif/elif/ crap for different ranges of the address space? And
don't even get me started on the one where the PCI bridge allows you to
only map 48MiB of the PCI bus into the host address space at a time, so
MMIO accesses have to be done via an indirect method like config space
accesses are. You don't assign host addresses in ioremap() and you
_really_ get screwed over if broken drivers dereference __iomem
'pointers' instead of using the proper accessor functions :)

Yes, we _can_ continue do it the same way. But I was sort of hoping that
with a new API it'd get nicer.

> I seriously doubt that you don't have space to encode the same information 
> that was already there before in a pointer. David was kicking and 
> screaming too, until Jeff pointed out that it was all in the same address 
> space, and then he said "oh", and suddenly it was trivial.

Well actually I bitched a little more on IRC and not here, because this
is supposed to be a low-noise forum. 

I would have been a lot more coherent in my argument a couple of years
ago when I actually had platforms on my desk on which it matters.
Nowadays it's all ppc and ppc64 and they are relatively sane, at lst n
tht rspct. At least they don't think it's sane to drag a crack-smoking
hobo off the street and ask him to implement a PCI host bridge in an
FPGA, so my life is a lot nicer :)

I certainly don't recall just saying 'oh'; you may have inferred that
from my silence.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16  4:28           ` Linus Torvalds
  2004-09-16  4:57             ` Benjamin Herrenschmidt
@ 2004-09-16 13:41             ` Matthew Wilcox
  2004-09-16 18:21               ` Linus Torvalds
  2004-09-16 22:30             ` Matthew Wilcox
  2 siblings, 1 reply; 63+ messages in thread
From: Matthew Wilcox @ 2004-09-16 13:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik

On Wed, Sep 15, 2004 at 09:28:01PM -0700, Linus Torvalds wrote:
> > The relaxed attribute is, I think, not a property of the space we're
> > looking at.  It's a property of any given transaction.  So I think we
> > really do want a set of ioreadXX_relaxed() functions.
> 
> I haven't seen very many people actually use this. I've been assuming that 
> what people want is the same "reasonably relaxed" behaviour that the 
> regular MMIO functions have (as opposed to what traditional PIO is: it's 
> slow as hell because it doesn't support posting). Are there actually users 
> of any other model out there?

I think the reason you haven't seen much of it is that only SGI people
care about it ;-)  The only drivers in the tree that use it are Fusion,
qla1280 and qla2xxx, and the only architecture that defines readb_relaxed()
to be anything other than readb() is ia64.

However, they do see significant wins from using this.  According to
Jeremy Higdon this can reduce a 50us read to a 1us read.  So it's probably
worth doing for them.

> > Byte ordering I'm less clear on.  Is this a property of the area "this
> > area is little endian on the bus, but the host cpu is big endian",
> > or is it a property of an individual transaction?
> 
> I'd suggest individual transaction, although again, I doubt it is all that
> much used. The "repeat" versions do IO-native byte order, which is really
> the only case where we've ever really cared. David mentioned that some
> network devices had tried to use "host-native" byte order (which some
> hardware supports), but that it hadn't been worth the pain and apparently
> got ripped out.

After some further thought, I think it's a property of the remapping
transaction.  I'm thinking about the case of a chip that's connected
either via PCI or via a host-native bus on a big-endian platform.  The
foo_iomap() would mark the area as being big-endian and pci_iomap() as
little-endian.

> In other words, I'd like to keep the interfaces as simple as humanly
> possible, and only introduce new concepts if absolutely required. For
> example, one issue is 64-bit accesses: they will _not_ be atomic on most
> architectures, and given that the question is whether they are worth
> supporting at all, or whether drivers should just be expected to make the
> non-atomicity explicit (by loading two dwords, the way they have to do
> with the old interfaces anyway).

There's a readq() on all 64 bit architectures.  It's not used very much,
but (for example), the s2io 10GE driver uses it extensively.  I think
if we _don't_ provide an ioread64() we'll see these drivers either not
switch to this interface or, worse, reimplement readq themselves with
predictably horrible consequences.

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 12:17           ` David Woodhouse
@ 2004-09-16 13:52             ` Linus Torvalds
  0 siblings, 0 replies; 63+ messages in thread
From: Linus Torvalds @ 2004-09-16 13:52 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Matthew Wilcox, Geert Uytterhoeven, Linux Arch list, Al Viro,
	Andrew Morton, Alan Cox, David S. Miller, Jeff Garzik



On Thu, 16 Sep 2004, David Woodhouse wrote:
>
>   At least they don't think it's sane to drag a crack-smoking
> hobo off the street and ask him to implement a PCI host bridge in an
> FPGA, so my life is a lot nicer :)

LOL.

I hope some of the poor unsuspecting souls who have to use the dang thing
will educate the hw hobos in question. Preferably with a baseball bat.

The good news is that hardware _does_ tend to slowly become saner. I don't
know if it's the crack supply that eventually dwindels, or what, but the
hw designers to tend to eventually take note when they find death-threats
in their cubby-holes. So if you do see a broken chip, tell them.  

Preferably by getting your company to say "we'll use that other _sane_
chip instead", and letting the broken vendor know why.

> I certainly don't recall just saying 'oh'; you may have inferred that
> from my silence.

Heh. I was talking about the other David, ie the "Miller" kind. 

		Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 13:41             ` Matthew Wilcox
@ 2004-09-16 18:21               ` Linus Torvalds
  2004-09-16 18:52                 ` Jesse Barnes
  2004-09-16 19:01                 ` Linus Torvalds
  0 siblings, 2 replies; 63+ messages in thread
From: Linus Torvalds @ 2004-09-16 18:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: James Bottomley, Geert Uytterhoeven, Linux Arch list, Al Viro,
	Andrew Morton, Alan Cox, David S. Miller, Jeff Garzik



On Thu, 16 Sep 2004, Matthew Wilcox wrote:
> 
> I think the reason you haven't seen much of it is that only SGI people
> care about it ;-)  The only drivers in the tree that use it are Fusion,
> qla1280 and qla2xxx, and the only architecture that defines readb_relaxed()
> to be anything other than readb() is ia64.

Bzzt.

ia64 actually defines readb to be the same as readb_relaxed.

It's literally only the sn2 _platform_ that does something stupid. In 
other words, it's not ia64 that is broken (surprise surprise - you never 
expected me to have said anything good about ia64), it's literally the SN2 
platform that seems broken.

And quite frankly, even than I'm not sure that it's the hw platform itself 
that is broken, it might well just be the architecture code. For some 
reason the SN2 code does:

	/*
	 * The following routines are SN Platform specific, called when
	 * a reference is made to readX/writeX set macros.  SN Platform
	 * readX set of macros ensures that Posted DMA writes on the
	 * Bridge is flushed.
	 *
	 * The routines should be self explainatory.
	 */

	static inline unsigned char
	___sn_readb (void *addr)
	{
	        unsigned char val;

	        val = *(volatile unsigned char *)addr;
	        __sn_mf_a();
	        sn_dma_flush((unsigned long)addr);
	        return val;
	}

in other words, it looks like there's a broken PCI "bridge" involved, that 
just doesn't follow the specs, and does bad things.

> However, they do see significant wins from using this.  According to
> Jeremy Higdon this can reduce a 50us read to a 1us read.  So it's probably
> worth doing for them.

I'd suggest to them to unbreak their silly "readb()" implementation, and 
 (a) talk to their hw engineers
 (b) add explicit flush code to the 5 (or so) drivers that they really 
     care about. Let's face it, I doubt there are that many different 
     devices that people use on the SN2. Make a nice "synchronize"  
     interface that you could architect, and make conforming platforms
     just make it into a no-op.

I'd hate to add complex new interfaces just because some hardware is 
broken. Don't make the generic interfaces suffer for broken hardware.

		Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 18:21               ` Linus Torvalds
@ 2004-09-16 18:52                 ` Jesse Barnes
  2004-09-16 19:09                   ` Linus Torvalds
                                     ` (2 more replies)
  2004-09-16 19:01                 ` Linus Torvalds
  1 sibling, 3 replies; 63+ messages in thread
From: Jesse Barnes @ 2004-09-16 18:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik

On Thursday, September 16, 2004 11:21 am, Linus Torvalds wrote:
> It's literally only the sn2 _platform_ that does something stupid. In
> other words, it's not ia64 that is broken (surprise surprise - you never
> expected me to have said anything good about ia64), it's literally the SN2
> platform that seems broken.

No, it's just the only smart platform.  Apparently we're the only ones who 
realized that most PIO reads do not rely on prior DMA being complete (and 
that it's expensive to make sure that they do), so we optimized that out of 
our hardware.  Only interrupts will guarantee DMA completion on SGI 
platforms.

> And quite frankly, even than I'm not sure that it's the hw platform itself
> that is broken, it might well just be the architecture code. For some
> reason the SN2 code does:
>
>  /*
>   * The following routines are SN Platform specific, called when
>   * a reference is made to readX/writeX set macros.  SN Platform
>   * readX set of macros ensures that Posted DMA writes on the
>   * Bridge is flushed.
>   *
>   * The routines should be self explainatory.
>   */
>
>  static inline unsigned char
>  ___sn_readb (void *addr)
>  {
>          unsigned char val;
>
>          val = *(volatile unsigned char *)addr;
>          __sn_mf_a();
>          sn_dma_flush((unsigned long)addr);
>          return val;
>  }
>
> in other words, it looks like there's a broken PCI "bridge" involved, that
> just doesn't follow the specs, and does bad things.

Correct, it doesn't follow the specs.  See above.

> I'd suggest to them to unbreak their silly "readb()" implementation, and
>  (a) talk to their hw engineers
>  (b) add explicit flush code to the 5 (or so) drivers that they really
>      care about. Let's face it, I doubt there are that many different
>      devices that people use on the SN2. Make a nice "synchronize"
>      interface that you could architect, and make conforming platforms
>      just make it into a no-op.
>
> I'd hate to add complex new interfaces just because some hardware is
> broken. Don't make the generic interfaces suffer for broken hardware.

Sure, that's the way I'd like to go, but that's a dangerous route.  Devices 
that want to do a PIO read to see if their DMA is complete will work most of 
the time, but you'll get occasional data corruption.  IOW, specifying reads 
that you know *aren't* dependent is much easier to do, and safer.  It also 
allows SGI hardware to run all current drivers w/o modification, which is 
nice.

If you want to explicitly document that the new io* interface doesn't 
guarantee DMA coherence, great, we can start from scratch with a good 
implementation that has an explicit flush mechanism.

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 18:21               ` Linus Torvalds
  2004-09-16 18:52                 ` Jesse Barnes
@ 2004-09-16 19:01                 ` Linus Torvalds
  2004-09-16 19:13                   ` Jeff Garzik
  2004-09-17  5:20                   ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 63+ messages in thread
From: Linus Torvalds @ 2004-09-16 19:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: James Bottomley, Geert Uytterhoeven, Linux Arch list, Al Viro,
	Andrew Morton, Alan Cox, David S. Miller, Jeff Garzik



On Thu, 16 Sep 2004, Linus Torvalds wrote:
> 
> I'd suggest to them to unbreak their silly "readb()" implementation, and 
>  (a) talk to their hw engineers
>  (b) add explicit flush code to the 5 (or so) drivers that they really 
>      care about. Let's face it, I doubt there are that many different 
>      devices that people use on the SN2. Make a nice "synchronize"  
>      interface that you could architect, and make conforming platforms
>      just make it into a no-op.

Actually, we could also take the "safe" approach, and just specify that 
the new "ioread()/iowrite()" interfaces are always very relaxed, and apart 
from endianness won't do much extra synchronization.

What are sane rules? Clearly it is _not_ a sane rule to allow memory 
re-ordering to the same device, because most driver authors have problems 
as-is, and trying to make them aware of speculative reads and writes 
passing writes is just not going to fly. So a minimal sane rule would be 
that each device seens programmatic IO accesses in "thread order".

But I think it would be ok to say that DMA needs explicit synchronization,
and obviously write posting (as long as it doesn't re-order IO accesses)  
is fine. The latter is already true for MMIO (and arguably acceptable even
for PIO, although x86's have historically had slightly stricter ordering
on PIO).

So how about trying to come up with a sane interface for "synchronize
DMA". I _suspect_ it will have to be device-specific, so we should aim for
a similar interface to the "dma-mapping" ones (ie based on "struct
device", not the bus it's on).

In most cases it would be a no-op, or we could hide some pointer a la
"dev->bus->sync(dev)" (which could do a status read from the bridge or
whatever..)

		Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 18:52                 ` Jesse Barnes
@ 2004-09-16 19:09                   ` Linus Torvalds
  2004-09-16 20:02                     ` Jesse Barnes
  2004-09-16 20:04                   ` David S. Miller
  2004-09-17  5:17                   ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2004-09-16 19:09 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik



On Thu, 16 Sep 2004, Jesse Barnes wrote:
> 
> Sure, that's the way I'd like to go, but that's a dangerous route.

It might be slightly less dangerous if we only do this for the new 
interface. It would still be up to you guys to verify that drivers use 
whatever new "dma_synchronize(dev)" interface, but hey, that's the price 
you pay for not following the standard. 

I know hardware people love saying "we can do this faster doing XXX and 
then sw can take care of it". Make the powers-that-be realize that if 
software takes care of it, it means that sw needs resources to track it..

> If you want to explicitly document that the new io* interface doesn't 
> guarantee DMA coherence, great, we can start from scratch with a good 
> implementation that has an explicit flush mechanism.

I think that's the right thing to do at this point, but since nobody else
will really see this issue (since standard PCI does synchronize DMA),
there will inevitably be bugs that are SGI-specific. Just as everybody is
aware of the issue, and realizes that there's going to be some cost of
that to SGI in the form of testing and verification, this sounds like the
way to go..

Does that sound reasonable to everybody?

		Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 19:01                 ` Linus Torvalds
@ 2004-09-16 19:13                   ` Jeff Garzik
  2004-09-16 19:50                     ` Linus Torvalds
  2004-09-17  5:20                   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 63+ messages in thread
From: Jeff Garzik @ 2004-09-16 19:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller

Linus Torvalds wrote:
> What are sane rules?

I'd err on the side of "simple and tough to screw up" :)

What rules are most beneficial to the likely users of this API?

	Jeff

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 19:13                   ` Jeff Garzik
@ 2004-09-16 19:50                     ` Linus Torvalds
  2004-09-16 20:07                       ` Alan Cox
  2004-09-17  5:44                       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 63+ messages in thread
From: Linus Torvalds @ 2004-09-16 19:50 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller



On Thu, 16 Sep 2004, Jeff Garzik wrote:
>
> Linus Torvalds wrote:
> > What are sane rules?
> 
> I'd err on the side of "simple and tough to screw up" :)

I definitely agree. As I said, I think strict memory ordering is a must
for that reason.

That said, if we make make things _too_ strict, and that then makes some 
people start looking at extended interfaces in order to get their work 
done, the rules then effectively become more complex again, because now we 
have a much bigger API - a "simple" set with strict rules and 
synchronization, mixed together with the "give them rope" set.

And quite frankly, we've already seen that many driver writers will then 
start using the "give them rope" set, if only because they really think 
they need to. And then they'll screw up again.

> What rules are most beneficial to the likely users of this API?

I think the "programmatic IO accesses are strictly in-order for one
particular device" rule is so fundamentally required that nobody will
really argue against it. There is very little hardware that doesn't really 
enforce that rule anyway, and most of the extensions to it ("allow 
bursting and other fancy stuff" is in practice done by the driver setup 
code anyway, and has nothing to do with the individual IO accesses).

So the only remaining source of issues would be synchronization between
devices (which we haven't really supported anyway, and which nobody really
tends to care about), and synchronization with "side-band signals". And
there are really only two side-band signals that I can think of: DMA and
IRQ's.

I would argue that we likely don't normally care about strict
synchronization with those side-band signals, because most of the
serialization that a driver would rely on would be strictly causal (and
thus happen regardless of what interface we make up:  I seriously doubt we
could come up with a non-causal interface, but some physicists might be
very interested indeed ;).

So for example, we can pretty much depend on a "command completed"
interrupt being asserted only after we've written the command to the chip. 
No interface issues there. Similarly, if we read a status register in an 
interrupt handler, we _will_ get the status of the interrupt, unless the 
chip itself does some buffering at which point it is a driver issue to 
handle that, rather than an interface issue.

So the remaining issues are really things we already see with DMA, for
example: if we read a mailbox from memory saying that DMA has completed,
we need to have a read-memory barrier before we actually read the contents
of the IO, otherwise the CPU might read buffer contents "prior" to the
completion.

The same would go for doing a "ioread()" that reads a status register in 
MMIO that says "DMA is done". Before we can _trust_ that, we'd need to 
synchronize with the DMA "out-of-band" stuff. I think that's an acceptable 
interface, since it's something we already support for memory-based 
interfaces. 

It would be different, of course - it wouldn't be a read barrier, it would
be a "ioread" barrier. But the _concept_ is the same. And it doesn't tend
to hit us all that often, and the biggest pain is literally that in-spec
PCI devices will never show this, so the real problem for hardware like
SGI's in this case is lack of test coverage out in the rest of the world. 
That's the price you pay if you do things differently.

Implementation shouldn't be that hard. On any conforming PCI platform, 
it's a no-op, since reads from MMIO space are supposed to already 
synchronize with any buffered DMA (it still leaves the question of CPU 
memory ordering wrt uncached accesses, and thus we might want to have a 
memory barrier in place depending on architecture). And SGI would have 
something that synchronizes with the bridge nearest to the device.

So I _think_ the SGI case would be perfectly happy with an interface like 

	void dma_sync(struct device *dev);

which on their SN2 machines would do "dev->bus->sync(dev)" (where they 
would add something that reads from the bridge), and the rest of the world 
might just do a memory barrier there, or leave it as a no-op.

And yes, a few drivers would have "dma_sync()" in a few places. Not that 
many, likely.

Would driver writers get this wrong, and forget? Absolutely. But that's 
true of any interface we could come up with. If there is something we can 
absolutely depend on in life, it's that driver writers _will_ do something 
wrong. 

		Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 19:09                   ` Linus Torvalds
@ 2004-09-16 20:02                     ` Jesse Barnes
  2004-09-16 20:37                       ` James Bottomley
  0 siblings, 1 reply; 63+ messages in thread
From: Jesse Barnes @ 2004-09-16 20:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik

On Thursday, September 16, 2004 12:09 pm, Linus Torvalds wrote:
> On Thu, 16 Sep 2004, Jesse Barnes wrote:
> > Sure, that's the way I'd like to go, but that's a dangerous route.
>
> It might be slightly less dangerous if we only do this for the new
> interface. It would still be up to you guys to verify that drivers use
> whatever new "dma_synchronize(dev)" interface, but hey, that's the price
> you pay for not following the standard.

Sure, sounds fair.  I just wanted to avoid changing the existing interfaces, 
that's all.

> I know hardware people love saying "we can do this faster doing XXX and
> then sw can take care of it". Make the powers-that-be realize that if
> software takes care of it, it means that sw needs resources to track it..

Yep.

> > If you want to explicitly document that the new io* interface doesn't
> > guarantee DMA coherence, great, we can start from scratch with a good
> > implementation that has an explicit flush mechanism.
>
> I think that's the right thing to do at this point, but since nobody else
> will really see this issue (since standard PCI does synchronize DMA),

Note that PCI-X and PCI Express allow a lack of synchronization via the 
relaxed ordering bit.  So platforms can either set that bit in their iomap 
functions and deal with synchronization via dma_sync or not use it I guess.

> there will inevitably be bugs that are SGI-specific. Just as everybody is
> aware of the issue, and realizes that there's going to be some cost of
> that to SGI in the form of testing and verification, this sounds like the
> way to go..
>
> Does that sound reasonable to everybody?

Sure, thanks.

Jesse

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 18:52                 ` Jesse Barnes
  2004-09-16 19:09                   ` Linus Torvalds
@ 2004-09-16 20:04                   ` David S. Miller
  2004-09-16 20:13                     ` Jeff Garzik
  2004-09-16 20:20                     ` Jesse Barnes
  2004-09-17  5:17                   ` Benjamin Herrenschmidt
  2 siblings, 2 replies; 63+ messages in thread
From: David S. Miller @ 2004-09-16 20:04 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: torvalds, willy, James.Bottomley, geert, linux-arch, viro, akpm,
	alan, davem, jgarzik

On Thu, 16 Sep 2004 11:52:54 -0700
Jesse Barnes <jbarnes@engr.sgi.com> wrote:

> On Thursday, September 16, 2004 11:21 am, Linus Torvalds wrote:
> > It's literally only the sn2 _platform_ that does something stupid. In
> > other words, it's not ia64 that is broken (surprise surprise - you never
> > expected me to have said anything good about ia64), it's literally the SN2
> > platform that seems broken.
> 
> No, it's just the only smart platform.  Apparently we're the only ones who 
> realized that most PIO reads do not rely on prior DMA being complete (and 
> that it's expensive to make sure that they do), so we optimized that out of 
> our hardware.  Only interrupts will guarantee DMA completion on SGI 
> platforms.

Jesse, I think you might not need that hack in the I/O accessor
routines to handle this.  pci_dma_unmap(), pci_dma_sync_*() and
friends should be the place to perform this "prior DMA sync" thing.

Drivers aren't allowed to touch active DMA data areas until they
call one of those interfaces (see Documentation/DMA-mapping.txt)
so that should just work.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 19:50                     ` Linus Torvalds
@ 2004-09-16 20:07                       ` Alan Cox
  2004-09-17  5:44                       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 63+ messages in thread
From: Alan Cox @ 2004-09-16 20:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, David S. Miller

Right now we have writeb and __writeb() the later of which isn't store
ordering but requires the driver author makes an effort to step on a
land mine. It does help performance having these in some cases so can't
we propogate the same "__" functions on

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 20:04                   ` David S. Miller
@ 2004-09-16 20:13                     ` Jeff Garzik
  2004-09-16 20:45                       ` David S. Miller
  2004-09-16 20:20                     ` Jesse Barnes
  1 sibling, 1 reply; 63+ messages in thread
From: Jeff Garzik @ 2004-09-16 20:13 UTC (permalink / raw)
  To: David S. Miller
  Cc: Jesse Barnes, torvalds, willy, James.Bottomley, geert, linux-arch,
	viro, akpm, alan, davem

On Thu, Sep 16, 2004 at 01:04:50PM -0700, David S. Miller wrote:
> Drivers aren't allowed to touch active DMA data areas until they
> call one of those interfaces (see Documentation/DMA-mapping.txt)

Does that include DMA memory allocated with pci_alloc_consistent?

	Jeff

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 20:04                   ` David S. Miller
  2004-09-16 20:13                     ` Jeff Garzik
@ 2004-09-16 20:20                     ` Jesse Barnes
  1 sibling, 0 replies; 63+ messages in thread
From: Jesse Barnes @ 2004-09-16 20:20 UTC (permalink / raw)
  To: David S. Miller
  Cc: torvalds, willy, James.Bottomley, geert, linux-arch, viro, akpm,
	alan, davem, jgarzik

On Thursday, September 16, 2004 1:04 pm, David S. Miller wrote:
> On Thu, 16 Sep 2004 11:52:54 -0700
>
> Jesse Barnes <jbarnes@engr.sgi.com> wrote:
> > On Thursday, September 16, 2004 11:21 am, Linus Torvalds wrote:
> > > It's literally only the sn2 _platform_ that does something stupid. In
> > > other words, it's not ia64 that is broken (surprise surprise - you
> > > never expected me to have said anything good about ia64), it's
> > > literally the SN2 platform that seems broken.
> >
> > No, it's just the only smart platform.  Apparently we're the only ones
> > who realized that most PIO reads do not rely on prior DMA being complete
> > (and that it's expensive to make sure that they do), so we optimized that
> > out of our hardware.  Only interrupts will guarantee DMA completion on
> > SGI platforms.
>
> Jesse, I think you might not need that hack in the I/O accessor
> routines to handle this.  pci_dma_unmap(), pci_dma_sync_*() and
> friends should be the place to perform this "prior DMA sync" thing.

That would be ok, except for the fact that we wouldn't be able to provide 
'consistent' memory.  I think it's better to implement the expected PCI 
semantics but provide an out via the _relaxed interface (new interfaces 
aside).

Jesse

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 20:02                     ` Jesse Barnes
@ 2004-09-16 20:37                       ` James Bottomley
  2004-09-16 20:42                         ` Jesse Barnes
  0 siblings, 1 reply; 63+ messages in thread
From: James Bottomley @ 2004-09-16 20:37 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Linus Torvalds, Matthew Wilcox, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik, Grant Grundler

On Thu, 2004-09-16 at 16:02, Jesse Barnes wrote:
> Note that PCI-X and PCI Express allow a lack of synchronization via the 
> relaxed ordering bit.  So platforms can either set that bit in their iomap 
> functions and deal with synchronization via dma_sync or not use it I guess.

Now you've confused me again.  I thought the SGI implementation of this
relaxed read thing was *not* the same as the PCI-X RO bit which was why
you needed a special readX_relaxed?

My understanding is that Relaxed Ordering is that it's two bit (i.e. two
separately allowable optimisations) that allow either PIO writes to pass
DMA reads or PIO reads to pass DMA writes (or both) in the stream (I may
have this wrong, I copied Grant because he's been working on this).

I thought your readX_relaxed was only the first half of this?

James

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 20:37                       ` James Bottomley
@ 2004-09-16 20:42                         ` Jesse Barnes
  2004-09-16 21:37                           ` Grant Grundler
  0 siblings, 1 reply; 63+ messages in thread
From: Jesse Barnes @ 2004-09-16 20:42 UTC (permalink / raw)
  To: James Bottomley
  Cc: Linus Torvalds, Matthew Wilcox, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik, Grant Grundler

On Thursday, September 16, 2004 1:37 pm, James Bottomley wrote:
> On Thu, 2004-09-16 at 16:02, Jesse Barnes wrote:
> > Note that PCI-X and PCI Express allow a lack of synchronization via the
> > relaxed ordering bit.  So platforms can either set that bit in their
> > iomap functions and deal with synchronization via dma_sync or not use it
> > I guess.
>
> Now you've confused me again.  I thought the SGI implementation of this
> relaxed read thing was *not* the same as the PCI-X RO bit which was why
> you needed a special readX_relaxed?

Depends on how you read the specs.  I think they're compatible, but others 
think that PCI-X RO will require a different implementation.  We'll know when 
we see hardware that supports it.

> My understanding is that Relaxed Ordering is that it's two bit (i.e. two
> separately allowable optimisations) that allow either PIO writes to pass
> DMA reads or PIO reads to pass DMA writes (or both) in the stream (I may
> have this wrong, I copied Grant because he's been working on this).
>
> I thought your readX_relaxed was only the first half of this?

Only the second half, actually.  It may also apply to intra-DMA block 
ordering.

Jesse

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 20:13                     ` Jeff Garzik
@ 2004-09-16 20:45                       ` David S. Miller
  0 siblings, 0 replies; 63+ messages in thread
From: David S. Miller @ 2004-09-16 20:45 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: jbarnes, torvalds, willy, James.Bottomley, geert, linux-arch,
	viro, akpm, alan, davem

On Thu, 16 Sep 2004 16:13:08 -0400
Jeff Garzik <jgarzik@pobox.com> wrote:

> On Thu, Sep 16, 2004 at 01:04:50PM -0700, David S. Miller wrote:
> > Drivers aren't allowed to touch active DMA data areas until they
> > call one of those interfaces (see Documentation/DMA-mapping.txt)
> 
> Does that include DMA memory allocated with pci_alloc_consistent?

Good point.  No, it does not include consistent DMA memory.
I guess they do need it after all.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 20:42                         ` Jesse Barnes
@ 2004-09-16 21:37                           ` Grant Grundler
  0 siblings, 0 replies; 63+ messages in thread
From: Grant Grundler @ 2004-09-16 21:37 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: James Bottomley, Linus Torvalds, Matthew Wilcox,
	Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik, Grant Grundler

On Thu, Sep 16, 2004 at 01:42:48PM -0700, Jesse Barnes wrote:
> Depends on how you read the specs.

I'd say it depends on if the PCI-X device implements RO hint
correctly and if the platform chipset doesn't ignore RO bit
in the PCI-X commands that go across the wire.

> I think they're compatible, but others 
> think that PCI-X RO will require a different implementation.

No. PCI-X RO is an attribute the OS can set on the PCI-X device
before putting it to work. The device then decides when to use RO
hint and hopefully only uses RO hint when it's really ok (eg. most bulk
data transfer and not for all control data).

readX_relaxed() is ha^H^Hoptimization so SGI platform doesn't have
to pay a huge penalty to enforce PCI ordering rules on every MMIO read.
IIRC, the SGI Altix chipset will allow MMIO Read returns to bypass DMA writes.

Original thread discusses this in more detail:
	http://www.ussg.iu.edu/hypermail/linux/kernel/0401.0/1678.html

> > My understanding is that Relaxed Ordering is that it's two bit (i.e. two
> > separately allowable optimisations)

PCI-X "Enable RO" is only one bit in the PCI-X Command Register.

> > that allow either PIO writes to pass DMA reads

Most parisc platforms allow this to happen. I can't see any harm in it
though it's possible some PCI device will break because of it.
Since the PCI-X device isn't sourcing transactions in this direction,
I don't see how PCI-X RO would be involved.

> >  or PIO reads to pass DMA writes (or both) in the stream (I may
> > have this wrong, I copied Grant because he's been working on this).
> >
> > I thought your readX_relaxed was only the first half of this?
> 
> Only the second half, actually.
> It may also apply to intra-DMA block ordering.

readX_relaxed() has nothing to do with DMA ordering AFAICT.
PCI-X RO hint would if I understand you correctly.

thanks,
grant

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16  4:28           ` Linus Torvalds
  2004-09-16  4:57             ` Benjamin Herrenschmidt
  2004-09-16 13:41             ` Matthew Wilcox
@ 2004-09-16 22:30             ` Matthew Wilcox
  2004-09-16 22:42               ` Linus Torvalds
  2 siblings, 1 reply; 63+ messages in thread
From: Matthew Wilcox @ 2004-09-16 22:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik

On Wed, Sep 15, 2004 at 09:28:01PM -0700, Linus Torvalds wrote:
> Indeed. Once created, the cookie is "bus-agnostic", which is very much the 
> point. Unlike the original pci DMA interfaces, there are no real bus 
> assumtions in the interfaces once set up. 

In designing the new iomap interface for PA, I noticed a problem.

2004/09/14 torvalds   |  * They do _not_ update the port address. If you
2004/09/14 torvalds   |  * want MMIO that copies stuff laid out in MMIO
2004/09/14 torvalds   |  * memory across multiple ports, use "memcpy_toio()"
2004/09/14 torvalds   |  * and friends.

That implies you can pass an iomap cookie to memcpy_toio(), which makes
designing the iomap interface significantly harder.  Could I ask for
iocopy_to(), iocopy_from() and ioset() interfaces too?

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 22:30             ` Matthew Wilcox
@ 2004-09-16 22:42               ` Linus Torvalds
  2004-09-16 22:46                 ` Jeff Garzik
  2004-09-17 12:44                 ` Matthew Wilcox
  0 siblings, 2 replies; 63+ messages in thread
From: Linus Torvalds @ 2004-09-16 22:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: James Bottomley, Geert Uytterhoeven, Linux Arch list, Al Viro,
	Andrew Morton, Alan Cox, David S. Miller, Jeff Garzik



On Thu, 16 Sep 2004, Matthew Wilcox wrote:
> 
> 2004/09/14 torvalds   |  * They do _not_ update the port address. If you
> 2004/09/14 torvalds   |  * want MMIO that copies stuff laid out in MMIO
> 2004/09/14 torvalds   |  * memory across multiple ports, use "memcpy_toio()"
> 2004/09/14 torvalds   |  * and friends.
> 
> That implies you can pass an iomap cookie to memcpy_toio()

Actually, no it does not. It very much says "if you want __MMIO__ that 
copies stuff".

You can pass cookies the other way:  if you create a cookie with the old
"ioremap()" interface (which means MMIO only, of course), I intended for
that to work well with the new interfaces (ioread/iowrite). But a
new-style IO-map cookie will _not_ work with the old interfaces. 

In other words, we don't magically make the old interfaces support PIO 
accesses. That includes "memcpy_toio()" and friends. They support MMIO 
only.

I don't see anybody ever wanting to do a "memcpy_to/from_pio()" anyway. 
The op just doesn't make any real sense.

And that is actually enforced by the x86 implementation: creating a
new-style PIO cookie and trying to use them with the old-style interfaces
will create an instant oops. So there's no real worry that some
enterprising driver writer would try to do this. He'd notice that it 
doesn't work very quickly indeed ;)

			Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 22:42               ` Linus Torvalds
@ 2004-09-16 22:46                 ` Jeff Garzik
  2004-09-16 23:15                   ` Linus Torvalds
  2004-09-17 12:44                 ` Matthew Wilcox
  1 sibling, 1 reply; 63+ messages in thread
From: Jeff Garzik @ 2004-09-16 22:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller

Linus Torvalds wrote:
> I don't see anybody ever wanting to do a "memcpy_to/from_pio()" anyway. 
> The op just doesn't make any real sense.

Actually, I do this in 8139too, to take a snapshot of the NIC's 
registers for the ethtool "get registers" command.  ;-)

If 8139too is compiled for MMIO, I use memcpy_fromio.  If 8139too is 
compiled for PIO...

	Jeff

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 22:46                 ` Jeff Garzik
@ 2004-09-16 23:15                   ` Linus Torvalds
  2004-09-16 23:30                     ` Jeff Garzik
  0 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2004-09-16 23:15 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller



On Thu, 16 Sep 2004, Jeff Garzik wrote:
> Linus Torvalds wrote:
> > I don't see anybody ever wanting to do a "memcpy_to/from_pio()" anyway. 
> > The op just doesn't make any real sense.
> 
> Actually, I do this in 8139too, to take a snapshot of the NIC's 
> registers for the ethtool "get registers" command.  ;-)

It doesn't have good semantics, though. For example, would your code work
if the copy was byte-by-byte? Word-for-word? Long-for-long? It might work
on some hardware, but it's not an operation that makes sense "in general".

And yes, there is likely hardware out there where "memcpy_fromio()" 
doesn't work even on MMIO (it might put restrictions on how the memory is 
accessed), but at least it still makes sense _conceptually_.

(I actually suspect your memcpy_fromio() doesn't even work on MMIO. For
example, the ppc64 iSeries memcpy_fromio() really _does_ do byte-at-a-time
copies. Slow and broken? Maybe. I don't have an iSeries. But arguably 
perfectly ok. Now, do you think your hardware does the right thing there? 
Or alternatively, do you think it's ok if a 64-bit architecture does 
quad-word loads?)

In other words, memcpy_fromio() really only makes sense on truly "flat" 
MMIO. RAM/ROM-like things, in other words. It can be dangerous to use on 
things like noncached control register areas..

		Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 23:15                   ` Linus Torvalds
@ 2004-09-16 23:30                     ` Jeff Garzik
  2004-09-16 23:43                       ` Linus Torvalds
  0 siblings, 1 reply; 63+ messages in thread
From: Jeff Garzik @ 2004-09-16 23:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller

Linus Torvalds wrote:
  It doesn't have good semantics, though. For example, would your code work
> if the copy was byte-by-byte? Word-for-word? Long-for-long? It might work
> on some hardware, but it's not an operation that makes sense "in general".


That's actually a good point.

Looking at the code,
/* TODO: we are too slack to do reg dumping for pio, for now */
#ifdef CONFIG_8139TOO_PIO
#define rtl8139_get_regs_len    NULL
#define rtl8139_get_regs        NULL


I seem to recall that I didn't implement it precisely because some 
registers wanted to read only at certain bit widths.

	Jeff

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 23:30                     ` Jeff Garzik
@ 2004-09-16 23:43                       ` Linus Torvalds
  0 siblings, 0 replies; 63+ messages in thread
From: Linus Torvalds @ 2004-09-16 23:43 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller



On Thu, 16 Sep 2004, Jeff Garzik wrote:

> Linus Torvalds wrote:
> > [ rant deleted ]
> 
> That's actually a good point.

I hate it when people do that.

"That's actually a good point".

I can really imagine surprised expression - "Linus actually made a good
point. What is the world coming to?"

Grumble grumble.

Dammit. ALL my points are good. 

Except for the ones where I obviously toked up a bit too much before I
wrote.

Or the ones I make too early in the morning, before my coffee.

Oh, or the ones after a few too many beers.

Ok, ok. Never mind. I'll go sulk in a corner now.

		Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 18:52                 ` Jesse Barnes
  2004-09-16 19:09                   ` Linus Torvalds
  2004-09-16 20:04                   ` David S. Miller
@ 2004-09-17  5:17                   ` Benjamin Herrenschmidt
  2004-09-17 15:30                     ` Jesse Barnes
  2 siblings, 1 reply; 63+ messages in thread
From: Benjamin Herrenschmidt @ 2004-09-17  5:17 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Linus Torvalds, Matthew Wilcox, James Bottomley,
	Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik

On Fri, 2004-09-17 at 04:52, Jesse Barnes wrote:
> On Thursday, September 16, 2004 11:21 am, Linus Torvalds wrote:
> > It's literally only the sn2 _platform_ that does something stupid. In
> > other words, it's not ia64 that is broken (surprise surprise - you never
> > expected me to have said anything good about ia64), it's literally the SN2
> > platform that seems broken.
> 
> No, it's just the only smart platform.  Apparently we're the only ones who 
> realized that most PIO reads do not rely on prior DMA being complete (and 
> that it's expensive to make sure that they do), so we optimized that out of 
> our hardware.  Only interrupts will guarantee DMA completion on SGI 
> platforms.

Hrm... that's weird... on sane DMA hardward, most PIO/MMIO reads are
actually here just to guarantee that completion, interrupts isn't enough
especially if we do things like polling (NAPI) etc... besides,
interrupts are usually totally asynchronous to anything else, so I
wouldn't say it's sane.

> Sure, that's the way I'd like to go, but that's a dangerous route.  Devices 
> that want to do a PIO read to see if their DMA is complete will work most of 
> the time, but you'll get occasional data corruption.  IOW, specifying reads 
> that you know *aren't* dependent is much easier to do, and safer.  It also 
> allows SGI hardware to run all current drivers w/o modification, which is 
> nice.

Agreed. Reads must NOT be relaxed by default

> If you want to explicitly document that the new io* interface doesn't 
> guarantee DMA coherence, great, we can start from scratch with a good 
> implementation that has an explicit flush mechanism.
> 
> Thanks,
> Jesse
-- 
Benjamin Herrenschmidt <benh@kernel.crashing.org>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 19:01                 ` Linus Torvalds
  2004-09-16 19:13                   ` Jeff Garzik
@ 2004-09-17  5:20                   ` Benjamin Herrenschmidt
  2004-09-17 15:03                     ` Linus Torvalds
  1 sibling, 1 reply; 63+ messages in thread
From: Benjamin Herrenschmidt @ 2004-09-17  5:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik


> Actually, we could also take the "safe" approach, and just specify that 
> the new "ioread()/iowrite()" interfaces are always very relaxed, and apart 
> from endianness won't do much extra synchronization.

That would be catastrophic with 99% of the drivers on PPC. The processor
may speculate reads from outside spinlocks and that sort of horror, or
defer them to until much later, crossing writes and such things...

And we don't always have nice way of providing barriers. For example,
for enforcing a read to happen "now", the only way is to actually create
a pseudo dependency on the data returned by the read, so an
function/macro like "read_barrier()" can't be implemented unless it
takes the result of the read as an argument.

> What are sane rules? Clearly it is _not_ a sane rule to allow memory 
> re-ordering to the same device, because most driver authors have problems 
> as-is, and trying to make them aware of speculative reads and writes 
> passing writes is just not going to fly. So a minimal sane rule would be 
> that each device seens programmatic IO accesses in "thread order".
> 
> But I think it would be ok to say that DMA needs explicit synchronization,
> and obviously write posting (as long as it doesn't re-order IO accesses)  
> is fine. The latter is already true for MMIO (and arguably acceptable even
> for PIO, although x86's have historically had slightly stricter ordering
> on PIO).

Write posting is always assumed since it's already fine per PCI spec. I
don't agree with read vs. DMA.

> So how about trying to come up with a sane interface for "synchronize
> DMA". I _suspect_ it will have to be device-specific, so we should aim for
> a similar interface to the "dma-mapping" ones (ie based on "struct
> device", not the bus it's on).
> 
> In most cases it would be a no-op, or we could hide some pointer a la
> "dev->bus->sync(dev)" (which could do a status read from the bridge or
> whatever..)

Ok, well, synchronize DMA would end up having to do a read on most
setups, so that mean knowing of a register on the device with no side
effects etc... difficult to acheive outside of the driver...

Ben.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 19:50                     ` Linus Torvalds
  2004-09-16 20:07                       ` Alan Cox
@ 2004-09-17  5:44                       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 63+ messages in thread
From: Benjamin Herrenschmidt @ 2004-09-17  5:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller


> So the only remaining source of issues would be synchronization between
> devices (which we haven't really supported anyway, and which nobody really
> tends to care about), and synchronization with "side-band signals". And
> there are really only two side-band signals that I can think of: DMA and
> IRQ's.

DMA isn't really "side band", and reads only synchronize with DMA on the
same bus path anyways.

Interrupts are never properly synchronized (and I've seen many bugs
because of that, we can _NEVER_ assume we are properly synchronized with
an interrupt, it can be buffered a totally random amount of time in the
path to the CPU and even inside of the CPU before reaching the core,
whatever the driver did to mask it on the device, drivers writers must
be aware of that, they aren't most of the time, calling sychronize_irq()
after masking on-device will _NOT_ guarantee you won't get a stale one).

> I would argue that we likely don't normally care about strict
> synchronization with those side-band signals, because most of the
> serialization that a driver would rely on would be strictly causal (and
> thus happen regardless of what interface we make up:  I seriously doubt we
> could come up with a non-causal interface, but some physicists might be
> very interested indeed ;).

Nope,  we do care a hell lot about synchronization with DMA, and it's
been a source of countless horrible to track down bugs in the past when
debugging high IO stress benchmarks on big POWER machines.

> So for example, we can pretty much depend on a "command completed"
> interrupt being asserted only after we've written the command to the chip. 
> No interface issues there.

Yes. That one is fine, except if we get a stale irq from a previous
command where we didn't care about the interrupt, or that sort of thing
of course.

> Similarly, if we read a status register in an 
> interrupt handler, we _will_ get the status of the interrupt, unless the 
> chip itself does some buffering at which point it is a driver issue to 
> handle that, rather than an interface issue.

Right. What we cannot rely on with interrupt is not taking them after
masking them. We _can_ rely on disable_irq() because it will mask at the
controller level and the arch code will take care of not delivering if
stale, but masking on-chip is never guaranteed to have a synchronizable
effect. At least, for level interrupts, we know that if we take it
anyway, we can just return and drop it until it goes away

> So the remaining issues are really things we already see with DMA, for
> example: if we read a mailbox from memory saying that DMA has completed,
> we need to have a read-memory barrier before we actually read the contents
> of the IO, otherwise the CPU might read buffer contents "prior" to the
> completion.

We need to have those 2 reads done in order yes, an lwsync on ppc would
be ok which is what we do on rmb(), but that is irrelevant to the IO
accessors, this is memory vs. memory

> The same would go for doing a "ioread()" that reads a status register in 
> MMIO that says "DMA is done". Before we can _trust_ that, we'd need to 
> synchronize with the DMA "out-of-band" stuff. I think that's an acceptable 
> interface, since it's something we already support for memory-based 
> interfaces. 

This is the typical case affecting most drivers. Read status from chip,
then read memory. We need 2 things here:

  - We need DMA synchronization, that is we need to enforce that the
MMIO read flushes the DMA. This is where the PCI spec is clear and where
we are getting this new 'relaxed' stuff potentially coming in the
picture, but I'm pretty much against making the relaxed case default

  - We also need to make sure the memory read isn't speculated and done
by the CPU prior to the IO read, which is currently acheive on PPC via
some deep magic inside of the IO reads (creating an artificial data
dependency on the result of all IO reads with a conditional trap that is
never taken followed an isync).

> It would be different, of course - it wouldn't be a read barrier, it would
> be a "ioread" barrier. But the _concept_ is the same. And it doesn't tend
> to hit us all that often, and the biggest pain is literally that in-spec
> PCI devices will never show this, so the real problem for hardware like
> SGI's in this case is lack of test coverage out in the rest of the world. 
> That's the price you pay if you do things differently.
> 
> Implementation shouldn't be that hard. On any conforming PCI platform, 
> it's a no-op, since reads from MMIO space are supposed to already 
> synchronize with any buffered DMA (it still leaves the question of CPU 
> memory ordering wrt uncached accesses, and thus we might want to have a 
> memory barrier in place depending on architecture). And SGI would have 
> something that synchronizes with the bridge nearest to the device.
> 
> So I _think_ the SGI case would be perfectly happy with an interface like 
> 
> 	void dma_sync(struct device *dev);

Hrm... I don't like that... Also, don't they need to do a read (just a
different form of read) to synchronize ? If they do, they actually need
a full IO accessor ... What does the PCI-X* (X, express, etc...) about
relaxed ordering IOs ?

> which on their SN2 machines would do "dev->bus->sync(dev)" (where they 
> would add something that reads from the bridge), and the rest of the world 
> might just do a memory barrier there, or leave it as a no-op.
> 
> And yes, a few drivers would have "dma_sync()" in a few places. Not that 
> many, likely.
> 
> Would driver writers get this wrong, and forget? Absolutely. But that's 
> true of any interface we could come up with. If there is something we can 
> absolutely depend on in life, it's that driver writers _will_ do something 
> wrong. 
> 
> 		Linus
-- 
Benjamin Herrenschmidt <benh@kernel.crashing.org>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 22:42               ` Linus Torvalds
  2004-09-16 22:46                 ` Jeff Garzik
@ 2004-09-17 12:44                 ` Matthew Wilcox
  1 sibling, 0 replies; 63+ messages in thread
From: Matthew Wilcox @ 2004-09-17 12:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik

On Thu, Sep 16, 2004 at 03:42:34PM -0700, Linus Torvalds wrote:
> On Thu, 16 Sep 2004, Matthew Wilcox wrote:
> > 2004/09/14 torvalds   |  * They do _not_ update the port address. If you
> > 2004/09/14 torvalds   |  * want MMIO that copies stuff laid out in MMIO
> > 2004/09/14 torvalds   |  * memory across multiple ports, use "memcpy_toio()"
> > 2004/09/14 torvalds   |  * and friends.
> > 
> > That implies you can pass an iomap cookie to memcpy_toio()
> 
> Actually, no it does not. It very much says "if you want __MMIO__ that 
> copies stuff".
> 
> You can pass cookies the other way:  if you create a cookie with the old
> "ioremap()" interface (which means MMIO only, of course), I intended for
> that to work well with the new interfaces (ioread/iowrite). But a
> new-style IO-map cookie will _not_ work with the old interfaces. 

OK, you're thinking in x86-specific terms and I'm thinking in PA-specific
terms ;-)  I don't particularly want to be able to memcpy_fooio() on
ioport space.  But I do want to be able to memcpy_fooio() on big-endian
mmio space and I do want to design the iomap cookies such that you can't
pass the old ioremap cookies to the new interface.

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-17  5:20                   ` Benjamin Herrenschmidt
@ 2004-09-17 15:03                     ` Linus Torvalds
  0 siblings, 0 replies; 63+ messages in thread
From: Linus Torvalds @ 2004-09-17 15:03 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik



On Fri, 17 Sep 2004, Benjamin Herrenschmidt wrote:
> 
> > Actually, we could also take the "safe" approach, and just specify that 
> > the new "ioread()/iowrite()" interfaces are always very relaxed, and apart 
> > from endianness won't do much extra synchronization.
> 
> That would be catastrophic with 99% of the drivers on PPC. The processor
> may speculate reads from outside spinlocks and that sort of horror, or
> defer them to until much later, crossing writes and such things...

I didn't say "unordered". I said "relaxed", which in this case meant 
"unordered wrt DMA.

You still need to enforce memory ordering. Quite frankly, I do not believe 
a non-ordered IO access is EVER valid.

		Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-17  5:17                   ` Benjamin Herrenschmidt
@ 2004-09-17 15:30                     ` Jesse Barnes
  0 siblings, 0 replies; 63+ messages in thread
From: Jesse Barnes @ 2004-09-17 15:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Matthew Wilcox, James Bottomley,
	Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik

On Thursday, September 16, 2004 10:17 pm, Benjamin Herrenschmidt wrote:
> Hrm... that's weird... on sane DMA hardward, most PIO/MMIO reads are
> actually here just to guarantee that completion, interrupts isn't enough
> especially if we do things like polling (NAPI) etc... besides,
> interrupts are usually totally asynchronous to anything else, so I
> wouldn't say it's sane.

Well, it can definitely be problematic.  Devices that want to complete more 
than one transaction per interrupt have always been troublesome on sn 
platforms.

>
> > Sure, that's the way I'd like to go, but that's a dangerous route. 
> > Devices that want to do a PIO read to see if their DMA is complete will
> > work most of the time, but you'll get occasional data corruption.  IOW,
> > specifying reads that you know *aren't* dependent is much easier to do,
> > and safer.  It also allows SGI hardware to run all current drivers w/o
> > modification, which is nice.
>
> Agreed. Reads must NOT be relaxed by default

At least not for the current interface, but maybe for the new one.  The 
proposed dma_sync interface could take both an iomem address *and* a device 
if that would make it easier (it would for us).

Jesse

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2004-09-17 15:31 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-08 22:57 RFC: being more anal about iospace accesses Linus Torvalds
2004-09-08 23:07 ` David S. Miller
2004-09-08 23:25   ` Linus Torvalds
2004-09-09  1:19   ` Linus Torvalds
2004-09-09  4:36     ` David S. Miller
2004-09-09  5:56       ` Richard Henderson
2004-09-09  5:04     ` viro
2004-09-09  5:05       ` David S. Miller
2004-09-09  5:13       ` Linus Torvalds
2004-09-09  6:08         ` viro
2004-09-09  8:27   ` Geert Uytterhoeven
2004-09-09  6:23 ` David Woodhouse
2004-09-09 13:14   ` Alan Cox
2004-09-11  6:09 ` Linus Torvalds
2004-09-11  6:42   ` Anton Blanchard
2004-09-11  7:26     ` Benjamin Herrenschmidt
2004-09-11  7:29     ` Geert Uytterhoeven
2004-09-11  7:23   ` Benjamin Herrenschmidt
2004-09-11 14:42   ` Alan Cox
2004-09-15 15:03 ` Linus Torvalds
2004-09-15 19:02   ` Geert Uytterhoeven
2004-09-15 19:16     ` Linus Torvalds
2004-09-15 19:40       ` Matthew Wilcox
2004-09-15 20:10         ` Linus Torvalds
2004-09-15 20:17           ` Linus Torvalds
2004-09-16 12:17           ` David Woodhouse
2004-09-16 13:52             ` Linus Torvalds
2004-09-15 20:20       ` Russell King
2004-09-15 20:34         ` Linus Torvalds
2004-09-15 20:51           ` Linus Torvalds
2004-09-15 22:38       ` James Bottomley
2004-09-16  2:33         ` Matthew Wilcox
2004-09-16  4:28           ` Linus Torvalds
2004-09-16  4:57             ` Benjamin Herrenschmidt
2004-09-16  4:58               ` Benjamin Herrenschmidt
2004-09-16 13:41             ` Matthew Wilcox
2004-09-16 18:21               ` Linus Torvalds
2004-09-16 18:52                 ` Jesse Barnes
2004-09-16 19:09                   ` Linus Torvalds
2004-09-16 20:02                     ` Jesse Barnes
2004-09-16 20:37                       ` James Bottomley
2004-09-16 20:42                         ` Jesse Barnes
2004-09-16 21:37                           ` Grant Grundler
2004-09-16 20:04                   ` David S. Miller
2004-09-16 20:13                     ` Jeff Garzik
2004-09-16 20:45                       ` David S. Miller
2004-09-16 20:20                     ` Jesse Barnes
2004-09-17  5:17                   ` Benjamin Herrenschmidt
2004-09-17 15:30                     ` Jesse Barnes
2004-09-16 19:01                 ` Linus Torvalds
2004-09-16 19:13                   ` Jeff Garzik
2004-09-16 19:50                     ` Linus Torvalds
2004-09-16 20:07                       ` Alan Cox
2004-09-17  5:44                       ` Benjamin Herrenschmidt
2004-09-17  5:20                   ` Benjamin Herrenschmidt
2004-09-17 15:03                     ` Linus Torvalds
2004-09-16 22:30             ` Matthew Wilcox
2004-09-16 22:42               ` Linus Torvalds
2004-09-16 22:46                 ` Jeff Garzik
2004-09-16 23:15                   ` Linus Torvalds
2004-09-16 23:30                     ` Jeff Garzik
2004-09-16 23:43                       ` Linus Torvalds
2004-09-17 12:44                 ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox