All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: being more anal about iospace accesses..
@ 2004-09-08 22:57 Linus Torvalds
  2004-09-08 23:07 ` David S. Miller
                   ` (3 more replies)
  0 siblings, 4 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-08 22:57 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Linux Arch list, Alan Cox,
	David S. Miller


In my continual quest to make sparse more useful, and statically check
more stuff, I started looking at verifying correct usage of PCI
memory-mapped accesses. They are nasty because they generally "just work"  
on x86 (and some other architectures), and because we've historically had 
very permissive types for the macros that are supposed to be used. So 
there's been absolutely _tons_ of confusion whether a pointer is a pointer 
to regular kernel memory, or to memory-mapped IO space. Not helped by the 
fact that sometimes it's an integer (which can be 32-bit, and then you 
totally break on 64-bit architectures).

A quick trial patch shows that it doesn't seem to be _too_ painful to use 
the same sparse "address space" mechanism for this as we use for user 
pointers. Just do

	#define __iospace __attribute__((noderef,address_space(2)))

and force all "ioremap()" returns to have type "void __iospace *".  
Together with making "readb/w/l()" take this kind of pointer, it "Just
Makes Sense(tm)".

HOWEVER, some of that spills over even outside of "sparse". Because some 
people have successfully used "unsigned long" instead of a pointer, this 
will actually cause warnings even for normal gcc usage. And while I've 
tried out a compile on my normal x86 setup and could fix all of the 
gcc-visible things, this may be disruptive enough that people object to 
doing something like this.

Comments? I'm appending my "iospace-patch" which is enough to make gcc
happy on my (very limited) x86 driver configuration, and starts annotating
_some_ of the users with the sparse __iospace annotations. A full sparse 
run still will result in a _lot_ of warnings, but this does seem to be the 
right way to go. It not only makes it very obvious whether a pointer 
points to memory or IO space, but sparse will also warn about trying to 
dereference such a thing directly. 

Btw, doing this I was again amazed at the kind of crap we have. Drivers
casting "ioremap()" to "unsigned long" only to then cast it back to a
pointer when actually using it in any way etc etc. That's a sign of some
really bad historical interfaces. I'd blame somebody else, but dammit, I
can't find anybody but myself ;)

			Linus

----
===== arch/i386/mm/ioremap.c 1.21 vs edited =====
--- 1.21/arch/i386/mm/ioremap.c	2004-06-17 23:41:15 -07:00
+++ edited/arch/i386/mm/ioremap.c	2004-09-08 15:22:18 -07:00
@@ -110,9 +110,9 @@
  * have to convert them into an offset in a page-aligned mapping, but the
  * caller shouldn't need to know that small detail.
  */
-void * __ioremap(unsigned long phys_addr, unsigned long size, unsigned long flags)
+void __iospace * __ioremap(unsigned long phys_addr, unsigned long size, unsigned long flags)
 {
-	void * addr;
+	void __iospace * addr;
 	struct vm_struct * area;
 	unsigned long offset, last_addr;
 
@@ -125,7 +125,7 @@
 	 * Don't remap the low PCI/ISA area, it's always mapped..
 	 */
 	if (phys_addr >= 0xA0000 && last_addr < 0x100000)
-		return phys_to_virt(phys_addr);
+		return (void __iospace *) phys_to_virt(phys_addr);
 
 	/*
 	 * Don't allow anybody to remap normal RAM that we're using..
@@ -156,12 +156,12 @@
 	if (!area)
 		return NULL;
 	area->phys_addr = phys_addr;
-	addr = area->addr;
+	addr = (void __iospace *) area->addr;
 	if (remap_area_pages((unsigned long) addr, phys_addr, size, flags)) {
-		vunmap(addr);
+		vunmap((void __force *) addr);
 		return NULL;
 	}
-	return (void *) (offset + (char *)addr);
+	return (void __iospace *) (offset + (char __iospace *)addr);
 }
 
 
@@ -187,10 +187,10 @@
  * Must be freed with iounmap.
  */
 
-void *ioremap_nocache (unsigned long phys_addr, unsigned long size)
+void __iospace *ioremap_nocache (unsigned long phys_addr, unsigned long size)
 {
 	unsigned long last_addr;
-	void *p = __ioremap(phys_addr, size, _PAGE_PCD);
+	void __iospace *p = __ioremap(phys_addr, size, _PAGE_PCD);
 	if (!p) 
 		return p; 
 
@@ -221,12 +221,12 @@
 	return p;					
 }
 
-void iounmap(void *addr)
+void iounmap(void __iospace *addr)
 {
 	struct vm_struct *p;
-	if (addr <= high_memory) 
+	if ((void __force *) addr <= high_memory) 
 		return; 
-	p = remove_vm_area((void *) (PAGE_MASK & (unsigned long) addr));
+	p = remove_vm_area((void *) (PAGE_MASK & (unsigned long __force) addr));
 	if (!p) { 
 		printk("__iounmap: bad address %p\n", addr);
 		return;
===== arch/i386/pci/mmconfig.c 1.1 vs edited =====
--- 1.1/arch/i386/pci/mmconfig.c	2004-02-29 22:20:00 -08:00
+++ edited/arch/i386/pci/mmconfig.c	2004-09-08 15:37:51 -07:00
@@ -9,7 +9,7 @@
 /* The physical address of the MMCONFIG aperture.  Set from ACPI tables. */
 u32 pci_mmcfg_base_addr;
 
-#define mmcfg_virt_addr (fix_to_virt(FIX_PCIE_MCFG))
+#define mmcfg_virt_addr ((void __iospace *) fix_to_virt(FIX_PCIE_MCFG))
 
 /* The base address of the last MMCONFIG device accessed */
 static u32 mmcfg_last_accessed_device;
===== arch/i386/kernel/pci-dma.c 1.16 vs edited =====
--- 1.16/arch/i386/kernel/pci-dma.c	2004-08-22 17:39:03 -07:00
+++ edited/arch/i386/kernel/pci-dma.c	2004-09-08 11:07:13 -07:00
@@ -72,7 +72,7 @@
 int dma_declare_coherent_memory(struct device *dev, dma_addr_t bus_addr,
 				dma_addr_t device_addr, size_t size, int flags)
 {
-	void *mem_base;
+	void __iospace *mem_base;
 	int pages = size >> PAGE_SHIFT;
 	int bitmap_size = (pages + 31)/32;
 
===== arch/x86_64/kernel/early_printk.c 1.13 vs edited =====
--- 1.13/arch/x86_64/kernel/early_printk.c	2004-08-26 23:59:59 -07:00
+++ edited/arch/x86_64/kernel/early_printk.c	2004-09-08 11:11:32 -07:00
@@ -8,7 +8,7 @@
 /* Simple VGA output */
 
 #ifdef __i386__
-#define VGABASE		__pa((void *)(__PAGE_OFFSET + 0xb8000UL))
+#define VGABASE		(__ISA_IO_base + 0xb8000)
 #else
 #define VGABASE		0xffffffff800b8000UL
 #endif
===== sound/pci/intel8x0.c 1.109 vs edited =====
--- 1.109/sound/pci/intel8x0.c	2004-08-23 21:01:07 -07:00
+++ edited/sound/pci/intel8x0.c	2004-09-08 15:40:07 -07:00
@@ -406,10 +406,10 @@
 
 	unsigned int mmio;
 	unsigned long addr;
-	unsigned long remap_addr;
+	void __iospace * remap_addr;
 	unsigned int bm_mmio;
 	unsigned long bmaddr;
-	unsigned long remap_bmaddr;
+	void __iospace * remap_bmaddr;
 
 	struct pci_dev *pci;
 	snd_card_t *card;
@@ -2227,9 +2227,9 @@
 		snd_dma_free_pages(&chip->bdbars);
 	}
 	if (chip->remap_addr)
-		iounmap((void *) chip->remap_addr);
+		iounmap(chip->remap_addr);
 	if (chip->remap_bmaddr)
-		iounmap((void *) chip->remap_bmaddr);
+		iounmap(chip->remap_bmaddr);
 	pci_release_regions(chip->pci);
 	kfree(chip);
 	return 0;
@@ -2502,9 +2502,8 @@
 	if (pci_resource_flags(pci, 2) & IORESOURCE_MEM) {	/* ICH4 and Nforce */
 		chip->mmio = 1;
 		chip->addr = pci_resource_start(pci, 2);
-		chip->remap_addr = (unsigned long) ioremap_nocache(chip->addr,
-								   pci_resource_len(pci, 2));
-		if (chip->remap_addr == 0) {
+		chip->remap_addr = ioremap_nocache(chip->addr, pci_resource_len(pci, 2));
+		if (!chip->remap_addr) {
 			snd_printk("AC'97 space ioremap problem\n");
 			snd_intel8x0_free(chip);
 			return -EIO;
@@ -2515,9 +2514,8 @@
 	if (pci_resource_flags(pci, 3) & IORESOURCE_MEM) {	/* ICH4 */
 		chip->bm_mmio = 1;
 		chip->bmaddr = pci_resource_start(pci, 3);
-		chip->remap_bmaddr = (unsigned long) ioremap_nocache(chip->bmaddr,
-								     pci_resource_len(pci, 3));
-		if (chip->remap_bmaddr == 0) {
+		chip->remap_bmaddr = ioremap_nocache(chip->bmaddr, pci_resource_len(pci, 3));
+		if (!chip->remap_bmaddr) {
 			snd_printk("Controller space ioremap problem\n");
 			snd_intel8x0_free(chip);
 			return -EIO;
===== drivers/ide/ide-iops.c 1.26 vs edited =====
--- 1.26/drivers/ide/ide-iops.c	2004-05-29 11:30:32 -07:00
+++ edited/drivers/ide/ide-iops.c	2004-09-08 15:02:33 -07:00
@@ -112,57 +112,57 @@
 
 static u8 ide_mm_inb (unsigned long port)
 {
-	return (u8) readb(port);
+	return (u8) readb((void __iospace *) port);
 }
 
 static u16 ide_mm_inw (unsigned long port)
 {
-	return (u16) readw(port);
+	return (u16) readw((void __iospace *) port);
 }
 
 static void ide_mm_insw (unsigned long port, void *addr, u32 count)
 {
-	__ide_mm_insw(port, addr, count);
+	__ide_mm_insw((void __iospace *) port, addr, count);
 }
 
 static u32 ide_mm_inl (unsigned long port)
 {
-	return (u32) readl(port);
+	return (u32) readl((void __iospace *) port);
 }
 
 static void ide_mm_insl (unsigned long port, void *addr, u32 count)
 {
-	__ide_mm_insl(port, addr, count);
+	__ide_mm_insl((void __iospace *) port, addr, count);
 }
 
 static void ide_mm_outb (u8 value, unsigned long port)
 {
-	writeb(value, port);
+	writeb(value, (void __iospace *) port);
 }
 
 static void ide_mm_outbsync (ide_drive_t *drive, u8 value, unsigned long port)
 {
-	writeb(value, port);	
+	writeb(value, (void __iospace *) port);
 }
 
 static void ide_mm_outw (u16 value, unsigned long port)
 {
-	writew(value, port);
+	writew(value, (void __iospace *) port);
 }
 
 static void ide_mm_outsw (unsigned long port, void *addr, u32 count)
 {
-	__ide_mm_outsw(port, addr, count);
+	__ide_mm_outsw((void __iospace *) port, addr, count);
 }
 
 static void ide_mm_outl (u32 value, unsigned long port)
 {
-	writel(value, port);
+	writel(value, (void __iospace *) port);
 }
 
 static void ide_mm_outsl (unsigned long port, void *addr, u32 count)
 {
-	__ide_mm_outsl(port, addr, count);
+	__ide_mm_outsl((void __iospace *) port, addr, count);
 }
 
 void default_hwif_mmiops (ide_hwif_t *hwif)
===== drivers/acpi/osl.c 1.54 vs edited =====
--- 1.54/drivers/acpi/osl.c	2004-08-13 22:10:58 -07:00
+++ edited/drivers/acpi/osl.c	2004-09-08 11:24:41 -07:00
@@ -171,11 +171,11 @@
 }
 
 acpi_status
-acpi_os_map_memory(acpi_physical_address phys, acpi_size size, void **virt)
+acpi_os_map_memory(acpi_physical_address phys, acpi_size size, void __iospace **virt)
 {
 	if (efi_enabled) {
 		if (EFI_MEMORY_WB & efi_mem_attributes(phys)) {
-			*virt = phys_to_virt(phys);
+			*virt = (void __iospace *) phys_to_virt(phys);
 		} else {
 			*virt = ioremap(phys, size);
 		}
@@ -197,7 +197,7 @@
 }
 
 void
-acpi_os_unmap_memory(void *virt, acpi_size size)
+acpi_os_unmap_memory(void __iospace *virt, acpi_size size)
 {
 	iounmap(virt);
 }
@@ -376,30 +376,31 @@
 	u32			width)
 {
 	u32			dummy;
-	void			*virt_addr;
+	void __iospace		*virt_addr;
 	int			iomem = 0;
 
 	if (efi_enabled) {
 		if (EFI_MEMORY_WB & efi_mem_attributes(phys_addr)) {
-			virt_addr = phys_to_virt(phys_addr);
+			/* HACK ALERT! We can use readb/w/l on real memory too.. */
+			virt_addr = (void __iospace *) phys_to_virt(phys_addr);
 		} else {
 			iomem = 1;
 			virt_addr = ioremap(phys_addr, width);
 		}
 	} else
-		virt_addr = phys_to_virt(phys_addr);
+		virt_addr = (void __iospace *) phys_to_virt(phys_addr);
 	if (!value)
 		value = &dummy;
 
 	switch (width) {
 	case 8:
-		*(u8*) value = *(u8*) virt_addr;
+		*(u8*) value = readb(virt_addr);
 		break;
 	case 16:
-		*(u16*) value = *(u16*) virt_addr;
+		*(u16*) value = readw(virt_addr);
 		break;
 	case 32:
-		*(u32*) value = *(u32*) virt_addr;
+		*(u32*) value = readl(virt_addr);
 		break;
 	default:
 		BUG();
@@ -419,28 +420,29 @@
 	u32			value,
 	u32			width)
 {
-	void			*virt_addr;
+	void __iospace		*virt_addr;
 	int			iomem = 0;
 
 	if (efi_enabled) {
 		if (EFI_MEMORY_WB & efi_mem_attributes(phys_addr)) {
-			virt_addr = phys_to_virt(phys_addr);
+			/* HACK ALERT! We can use writeb/w/l on real memory too */
+			virt_addr = (void __iospace *) phys_to_virt(phys_addr);
 		} else {
 			iomem = 1;
 			virt_addr = ioremap(phys_addr, width);
 		}
 	} else
-		virt_addr = phys_to_virt(phys_addr);
+		virt_addr = (void __iospace *) phys_to_virt(phys_addr);
 
 	switch (width) {
 	case 8:
-		*(u8*) virt_addr = value;
+		writeb(value, virt_addr);
 		break;
 	case 16:
-		*(u16*) virt_addr = value;
+		writew(value, virt_addr);
 		break;
 	case 32:
-		*(u32*) virt_addr = value;
+		writel(value, virt_addr);
 		break;
 	default:
 		BUG();
@@ -962,7 +964,7 @@
 {
 #if defined(__i386__) || defined(__x86_64__) 
 	char tmp;
-	return !__get_user(tmp, (char *)ptr) && !__get_user(tmp, (char *)ptr + len - 1);
+	return !__get_user(tmp, (char __user *)ptr) && !__get_user(tmp, (char __user *)ptr + len - 1);
 #endif
 	return 1;
 }
===== drivers/serial/8250_pci.c 1.39 vs edited =====
--- 1.39/drivers/serial/8250_pci.c	2004-08-23 01:15:10 -07:00
+++ edited/drivers/serial/8250_pci.c	2004-09-08 15:35:56 -07:00
@@ -83,7 +83,7 @@
 
 struct serial_private {
 	unsigned int		nr;
-	void			*remapped_bar[PCI_NUM_BAR_RESOURCES];
+	void __iospace		*remapped_bar[PCI_NUM_BAR_RESOURCES];
 	struct pci_serial_quirk	*quirk;
 	int			line[0];
 };
@@ -243,7 +243,8 @@
  */
 static int __devinit pci_plx9050_init(struct pci_dev *dev)
 {
-	u8 *p, irq_config;
+	u8 irq_config;
+	void __iospace *p;
 
 	if ((pci_resource_flags(dev, 0) & IORESOURCE_MEM) == 0) {
 		moan_device("no memory in bar 0", dev);
@@ -272,12 +273,12 @@
 	p = ioremap(pci_resource_start(dev, 0), 0x80);
 	if (p == NULL)
 		return -ENOMEM;
-	writel(irq_config, (unsigned long)p + 0x4c);
+	writel(irq_config, p + 0x4c);
 
 	/*
 	 * Read the register back to ensure that it took effect.
 	 */
-	readl((unsigned long)p + 0x4c);
+	readl(p + 0x4c);
 	iounmap(p);
 
 	return 0;
@@ -397,7 +398,8 @@
 
 static int pci_siig10x_init(struct pci_dev *dev)
 {
-	u16 data, *p;
+	u16 data;
+	void __iospace *p;
 
 	switch (dev->device & 0xfff8) {
 	case PCI_DEVICE_ID_SIIG_1S_10x:	/* 1S */
@@ -415,8 +417,8 @@
 	if (p == NULL)
 		return -ENOMEM;
 
-	writew(readw((unsigned long) p + 0x28) & data, (unsigned long) p + 0x28);
-	readw((unsigned long)p + 0x28);
+	writew(readw(p + 0x28) & data, p + 0x28);
+	readw(p + 0x28);
 	iounmap(p);
 	return 0;
 }
===== include/asm/io.h 1.26 vs edited =====
--- 1.26/include/asm-i386/io.h	2004-03-12 01:30:22 -08:00
+++ edited/include/asm/io.h	2004-09-08 11:17:57 -07:00
@@ -2,6 +2,8 @@
 #define _ASM_IO_H
 
 #include <linux/config.h>
+#include <linux/string.h>
+#include <linux/compiler.h>
 
 /*
  * This file contains the definitions for the x86 IO instructions
@@ -86,7 +88,7 @@
  */
 #define page_to_phys(page)    ((dma_addr_t)page_to_pfn(page) << PAGE_SHIFT)
 
-extern void * __ioremap(unsigned long offset, unsigned long size, unsigned long flags);
+extern void __iospace * __ioremap(unsigned long offset, unsigned long size, unsigned long flags);
 
 /**
  * ioremap     -   map bus memory into CPU space
@@ -100,13 +102,13 @@
  * address. 
  */
 
-static inline void * ioremap (unsigned long offset, unsigned long size)
+static inline void __iospace * ioremap (unsigned long offset, unsigned long size)
 {
 	return __ioremap(offset, size, 0);
 }
 
-extern void * ioremap_nocache (unsigned long offset, unsigned long size);
-extern void iounmap(void *addr);
+extern void __iospace * ioremap_nocache (unsigned long offset, unsigned long size);
+extern void iounmap(void __iospace *addr);
 
 /*
  * bt_ioremap() and bt_iounmap() are for temporary early boot-time
@@ -139,9 +141,18 @@
  * memory location directly.
  */
 
-#define readb(addr) (*(volatile unsigned char *) (addr))
-#define readw(addr) (*(volatile unsigned short *) (addr))
-#define readl(addr) (*(volatile unsigned int *) (addr))
+static inline unsigned char readb(volatile void __iospace *addr)
+{
+	return *(volatile unsigned char __force *) addr;
+}
+static inline unsigned short readw(volatile void __iospace *addr)
+{
+	return *(volatile unsigned short __force *) addr;
+}
+static inline unsigned int readl(volatile void __iospace *addr)
+{
+	return *(volatile unsigned int __force *) addr;
+}
 #define readb_relaxed(addr) readb(addr)
 #define readw_relaxed(addr) readw(addr)
 #define readl_relaxed(addr) readl(addr)
@@ -149,16 +160,34 @@
 #define __raw_readw readw
 #define __raw_readl readl
 
-#define writeb(b,addr) (*(volatile unsigned char *) (addr) = (b))
-#define writew(b,addr) (*(volatile unsigned short *) (addr) = (b))
-#define writel(b,addr) (*(volatile unsigned int *) (addr) = (b))
+static inline void writeb(unsigned char b, volatile void __iospace *addr)
+{
+	*(volatile unsigned char __force *) addr = b;
+}
+static inline void writew(unsigned short b, volatile void __iospace *addr)
+{
+	*(volatile unsigned short __force *) addr = b;
+}
+static inline void writel(unsigned int b, volatile void __iospace *addr)
+{
+	*(volatile unsigned int __force *) addr = b;
+}
 #define __raw_writeb writeb
 #define __raw_writew writew
 #define __raw_writel writel
 
-#define memset_io(a,b,c)	memset((void *)(a),(b),(c))
-#define memcpy_fromio(a,b,c)	__memcpy((a),(void *)(b),(c))
-#define memcpy_toio(a,b,c)	__memcpy((void *)(a),(b),(c))
+static inline void memset_io(volatile void __iospace *addr, unsigned char val, int count)
+{
+	memset((void __force *) addr, val, count);
+}
+static inline void memcpy_fromio(void *dst, volatile void __iospace *src, int count)
+{
+	__memcpy(dst, (void __force *) src, count);
+}
+static inline void memcpy_toio(volatile void __iospace *dst, void *src, int count)
+{
+	__memcpy((void __force *) dst, src, count);
+}
 
 /*
  * ISA space is 'always mapped' on a typical x86 system, no need to
@@ -168,7 +197,7 @@
  * used as the IO-area pointer (it can be iounmapped as well, so the
  * analogy with PCI is quite large):
  */
-#define __ISA_IO_base ((char *)(PAGE_OFFSET))
+#define __ISA_IO_base ((char __iospace *)(PAGE_OFFSET))
 
 #define isa_readb(a) readb(__ISA_IO_base + (a))
 #define isa_readw(a) readw(__ISA_IO_base + (a))
@@ -185,8 +214,8 @@
  * Again, i386 does not require mem IO specific function.
  */
 
-#define eth_io_copy_and_sum(a,b,c,d)		eth_copy_and_sum((a),(void *)(b),(c),(d))
-#define isa_eth_io_copy_and_sum(a,b,c,d)	eth_copy_and_sum((a),(void *)(__ISA_IO_base + (b)),(c),(d))
+#define eth_io_copy_and_sum(a,b,c,d)		eth_copy_and_sum((a),(void __force *)(b),(c),(d))
+#define isa_eth_io_copy_and_sum(a,b,c,d)	eth_copy_and_sum((a),(void __force *)(__ISA_IO_base + (b)),(c),(d))
 
 /**
  *	check_signature		-	find BIOS signatures
@@ -199,7 +228,7 @@
  *	Returns 1 on a match.
  */
  
-static inline int check_signature(unsigned long io_addr,
+static inline int check_signature(volatile void __iospace * io_addr,
 	const unsigned char *signature, int length)
 {
 	int retval = 0;
===== include/acpi/acpiosxf.h 1.34 vs edited =====
--- 1.34/include/acpi/acpiosxf.h	2004-07-16 22:37:53 -07:00
+++ edited/include/acpi/acpiosxf.h	2004-09-08 11:19:20 -07:00
@@ -169,11 +169,11 @@
 acpi_os_map_memory (
 	acpi_physical_address           physical_address,
 	acpi_size                       size,
-	void                            **logical_address);
+	void __iospace                  **logical_address);
 
 void
 acpi_os_unmap_memory (
-	void                            *logical_address,
+	void __iospace                  *logical_address,
 	acpi_size                       size);
 
 acpi_status
===== include/linux/compiler.h 1.29 vs edited =====
--- 1.29/include/linux/compiler.h	2004-08-23 01:14:53 -07:00
+++ edited/include/linux/compiler.h	2004-09-08 10:34:45 -07:00
@@ -6,13 +6,17 @@
 # define __kernel	/* default address space */
 # define __safe		__attribute__((safe))
 # define __force	__attribute__((force))
+# define __iospace	__attribute__((noderef, address_space(2)))
 extern void __chk_user_ptr(void __user *);
+extern void __chk_io_ptr(void __iospace *);
 #else
 # define __user
 # define __kernel
 # define __safe
 # define __force
+# define __iospace
 # define __chk_user_ptr(x) (void)0
+# define __chk_io_ptr(x) (void)0
 #endif
 
 #ifdef __KERNEL__
===== include/asm-i386/io.h 1.26 vs edited =====
--- 1.26/include/asm-i386/io.h	2004-03-12 01:30:22 -08:00
+++ edited/include/asm-i386/io.h	2004-09-08 11:17:57 -07:00
@@ -2,6 +2,8 @@
 #define _ASM_IO_H
 
 #include <linux/config.h>
+#include <linux/string.h>
+#include <linux/compiler.h>
 
 /*
  * This file contains the definitions for the x86 IO instructions
@@ -86,7 +88,7 @@
  */
 #define page_to_phys(page)    ((dma_addr_t)page_to_pfn(page) << PAGE_SHIFT)
 
-extern void * __ioremap(unsigned long offset, unsigned long size, unsigned long flags);
+extern void __iospace * __ioremap(unsigned long offset, unsigned long size, unsigned long flags);
 
 /**
  * ioremap     -   map bus memory into CPU space
@@ -100,13 +102,13 @@
  * address. 
  */
 
-static inline void * ioremap (unsigned long offset, unsigned long size)
+static inline void __iospace * ioremap (unsigned long offset, unsigned long size)
 {
 	return __ioremap(offset, size, 0);
 }
 
-extern void * ioremap_nocache (unsigned long offset, unsigned long size);
-extern void iounmap(void *addr);
+extern void __iospace * ioremap_nocache (unsigned long offset, unsigned long size);
+extern void iounmap(void __iospace *addr);
 
 /*
  * bt_ioremap() and bt_iounmap() are for temporary early boot-time
@@ -139,9 +141,18 @@
  * memory location directly.
  */
 
-#define readb(addr) (*(volatile unsigned char *) (addr))
-#define readw(addr) (*(volatile unsigned short *) (addr))
-#define readl(addr) (*(volatile unsigned int *) (addr))
+static inline unsigned char readb(volatile void __iospace *addr)
+{
+	return *(volatile unsigned char __force *) addr;
+}
+static inline unsigned short readw(volatile void __iospace *addr)
+{
+	return *(volatile unsigned short __force *) addr;
+}
+static inline unsigned int readl(volatile void __iospace *addr)
+{
+	return *(volatile unsigned int __force *) addr;
+}
 #define readb_relaxed(addr) readb(addr)
 #define readw_relaxed(addr) readw(addr)
 #define readl_relaxed(addr) readl(addr)
@@ -149,16 +160,34 @@
 #define __raw_readw readw
 #define __raw_readl readl
 
-#define writeb(b,addr) (*(volatile unsigned char *) (addr) = (b))
-#define writew(b,addr) (*(volatile unsigned short *) (addr) = (b))
-#define writel(b,addr) (*(volatile unsigned int *) (addr) = (b))
+static inline void writeb(unsigned char b, volatile void __iospace *addr)
+{
+	*(volatile unsigned char __force *) addr = b;
+}
+static inline void writew(unsigned short b, volatile void __iospace *addr)
+{
+	*(volatile unsigned short __force *) addr = b;
+}
+static inline void writel(unsigned int b, volatile void __iospace *addr)
+{
+	*(volatile unsigned int __force *) addr = b;
+}
 #define __raw_writeb writeb
 #define __raw_writew writew
 #define __raw_writel writel
 
-#define memset_io(a,b,c)	memset((void *)(a),(b),(c))
-#define memcpy_fromio(a,b,c)	__memcpy((a),(void *)(b),(c))
-#define memcpy_toio(a,b,c)	__memcpy((void *)(a),(b),(c))
+static inline void memset_io(volatile void __iospace *addr, unsigned char val, int count)
+{
+	memset((void __force *) addr, val, count);
+}
+static inline void memcpy_fromio(void *dst, volatile void __iospace *src, int count)
+{
+	__memcpy(dst, (void __force *) src, count);
+}
+static inline void memcpy_toio(volatile void __iospace *dst, void *src, int count)
+{
+	__memcpy((void __force *) dst, src, count);
+}
 
 /*
  * ISA space is 'always mapped' on a typical x86 system, no need to
@@ -168,7 +197,7 @@
  * used as the IO-area pointer (it can be iounmapped as well, so the
  * analogy with PCI is quite large):
  */
-#define __ISA_IO_base ((char *)(PAGE_OFFSET))
+#define __ISA_IO_base ((char __iospace *)(PAGE_OFFSET))
 
 #define isa_readb(a) readb(__ISA_IO_base + (a))
 #define isa_readw(a) readw(__ISA_IO_base + (a))
@@ -185,8 +214,8 @@
  * Again, i386 does not require mem IO specific function.
  */
 
-#define eth_io_copy_and_sum(a,b,c,d)		eth_copy_and_sum((a),(void *)(b),(c),(d))
-#define isa_eth_io_copy_and_sum(a,b,c,d)	eth_copy_and_sum((a),(void *)(__ISA_IO_base + (b)),(c),(d))
+#define eth_io_copy_and_sum(a,b,c,d)		eth_copy_and_sum((a),(void __force *)(b),(c),(d))
+#define isa_eth_io_copy_and_sum(a,b,c,d)	eth_copy_and_sum((a),(void __force *)(__ISA_IO_base + (b)),(c),(d))
 
 /**
  *	check_signature		-	find BIOS signatures
@@ -199,7 +228,7 @@
  *	Returns 1 on a match.
  */
  
-static inline int check_signature(unsigned long io_addr,
+static inline int check_signature(volatile void __iospace * io_addr,
 	const unsigned char *signature, int length)
 {
 	int retval = 0;
===== include/asm-generic/ide_iops.h 1.1 vs edited =====
--- 1.1/include/asm-generic/ide_iops.h	2003-02-18 06:31:01 -08:00
+++ edited/include/asm-generic/ide_iops.h	2004-09-08 15:03:14 -07:00
@@ -5,7 +5,7 @@
 #define __ide_outsw	outsw
 #define __ide_outsl	outsl
 
-static __inline__ void __ide_mm_insw(unsigned long port, void *addr, u32 count)
+static __inline__ void __ide_mm_insw(void __iospace *port, void *addr, u32 count)
 {
 	while (count--) {
 		*(u16 *)addr = readw(port);
@@ -13,7 +13,7 @@
 	}
 }
 
-static __inline__ void __ide_mm_insl(unsigned long port, void *addr, u32 count)
+static __inline__ void __ide_mm_insl(void __iospace *port, void *addr, u32 count)
 {
 	while (count--) {
 		*(u32 *)addr = readl(port);
@@ -21,7 +21,7 @@
 	}
 }
 
-static __inline__ void __ide_mm_outsw(unsigned long port, void *addr, u32 count)
+static __inline__ void __ide_mm_outsw(void __iospace *port, void *addr, u32 count)
 {
 	while (count--) {
 		writew(*(u16 *)addr, port);
@@ -29,7 +29,7 @@
 	}
 }
 
-static __inline__ void __ide_mm_outsl(unsigned long port, void *addr, u32 count)
+static __inline__ void __ide_mm_outsl(void __iospace * port, void *addr, u32 count)
 {
 	while (count--) {
 		writel(*(u32 *)addr, port);

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-08 22:57 RFC: being more anal about iospace accesses Linus Torvalds
@ 2004-09-08 23:07 ` David S. Miller
  2004-09-08 23:25   ` Linus Torvalds
                     ` (2 more replies)
  2004-09-09  6:23 ` David Woodhouse
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-08 23:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: viro, akpm, linux-arch, alan


I'm all for this except a possible naming suggestion that
the thing should be called "iomemspace" not plain "iospace".

We already use similar naming distinctions elsewhere in the tree
and even in user visible things such as /proc/ioport /proc/iomem

I wish we could make it fail with the compiler too somehow, that
way people other than us sparse geeks would see and fix this stuff
up.

Anyways, I like this a lot.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-08 23:07 ` David S. Miller
@ 2004-09-08 23:25   ` Linus Torvalds
  2004-09-09  1:19   ` Linus Torvalds
  2004-09-09  8:27   ` Geert Uytterhoeven
  2 siblings, 0 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-08 23:25 UTC (permalink / raw)
  To: David S. Miller; +Cc: viro, akpm, linux-arch, alan



On Wed, 8 Sep 2004, David S. Miller wrote:
> 
> I'm all for this except a possible naming suggestion that
> the thing should be called "iomemspace" not plain "iospace".

I was initially going to call it just "__io", but decided that somebody
may well already be using that as a name somewhere (and a quick grep
indeed confirmed that). Thus "__iospace".

But "__iomem" would make sense. I don't think we need both "mem" and
"space", and I agree that we already have used the term "iomem" for this
thing. Will rename.

Anyway, I've already got two fairly enthusiastic votes for the concept,
but I'll give people some more time to think it over.

This would make our "warning statistics" look bad (but see earlier email
on why I think it's entirely safe even in a stable series otherwise),
which I'm a bit concerned about. But there's absolutely no way to avoid
that - anything in drivers will inevitably take a minimum of a few kernel
releases, and possibly _much_ longer before people get upset enough to fix
the warnings in random driver X.

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-08 23:07 ` David S. Miller
  2004-09-08 23:25   ` Linus Torvalds
@ 2004-09-09  1:19   ` Linus Torvalds
  2004-09-09  4:36     ` David S. Miller
  2004-09-09  5:04     ` viro
  2004-09-09  8:27   ` Geert Uytterhoeven
  2 siblings, 2 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-09  1:19 UTC (permalink / raw)
  To: David S. Miller; +Cc: viro, akpm, linux-arch, alan



On Wed, 8 Sep 2004, David S. Miller wrote:
> 
> I wish we could make it fail with the compiler too somehow, that
> way people other than us sparse geeks would see and fix this stuff
> up.

I don't want to break a stable release too much. As-is, the _code_ 
generated should be 100% the same, even if gcc complains. So afaik, my 
patch is very safe indeed, which I like.

So it can cause gcc to complain for non-converted code ("xxx makes pointer
from integer without a cast"), but since any such code ended up converting
it to a pointer _with_ a cast anyway, I really believe code-generation is 
safe.

In contrast, to actually make gcc complain about the types of these things 
would not only be disruptive, it would require some _major_ change to how 
we do things. Right now we assume that "ioremap()" generates an 
offsettable "thing", and that really means that it must be of one of three 
types:
 - "unsigned long" (obviously type-unsafe, and would be totally pointless)
 - "void *" or "char *" (this is basically what we used to have).

Neither is in any way type-safe, nor can we make gcc complain without
breaking the "offsettable" requirement. The only way to make gcc complain
loudly would be to use something that can't be converted to anything else
(eg a "struct iomem_opaque { unsigned long cookie; }") but that would
break every single user out there, and require us to do the off-setting at
a much lower level (passing along both cookie and offset separately all
the way down the callchain - horribly bad in many ways).

In contrast, using sparse means that we can still generate exactly the
same code, have all old code still work (albeit with the occasional
potential warning from gcc) _and_ have a fairly easy way to check.

In fact, even modern external modules that work with 2.6.x kernels are 
quite checkable. I just verified that using "make C=1" works perfectly 
fine on an external module tree). So yes, it's just for "us sparse geeks", 
but sparse is to easy to set up and use that I don't think that's a huge 
downside:

	make ; make install ; "use 'C=1' when compiling the kernel"

it can't get much simpler than that.

(I think the biggest problem is _finding_ sparse, but just googling for 
"linux sparse tar-ball" shows at least two half-way relevant hits in the 
top five, so..)

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-09  1:19   ` Linus Torvalds
@ 2004-09-09  4:36     ` David S. Miller
  2004-09-09  5:56       ` Richard Henderson
  2004-09-09  5:04     ` viro
  1 sibling, 1 reply; 97+ messages in thread
From: David S. Miller @ 2004-09-09  4:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: viro, akpm, linux-arch, alan

On Wed, 8 Sep 2004 18:19:08 -0700 (PDT)
Linus Torvalds <torvalds@osdl.org> wrote:

> On Wed, 8 Sep 2004, David S. Miller wrote:
> > 
> > I wish we could make it fail with the compiler too somehow, that
> > way people other than us sparse geeks would see and fix this stuff
> > up.
> 
> I don't want to break a stable release too much. As-is, the _code_ 
> generated should be 100% the same, even if gcc complains. So afaik, my 
> patch is very safe indeed, which I like.

Agreed.  I wish GCC had a way to define a type which is offsetable
yet not an "int".  Surprise surprise, that works out to be your
address space thing in sparse. :)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-09  1:19   ` Linus Torvalds
  2004-09-09  4:36     ` David S. Miller
@ 2004-09-09  5:04     ` viro
  2004-09-09  5:05       ` David S. Miller
  2004-09-09  5:13       ` Linus Torvalds
  1 sibling, 2 replies; 97+ messages in thread
From: viro @ 2004-09-09  5:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, akpm, linux-arch, alan

On Wed, Sep 08, 2004 at 06:19:08PM -0700, Linus Torvalds wrote:
> (I think the biggest problem is _finding_ sparse, but just googling for 
> "linux sparse tar-ball" shows at least two half-way relevant hits in the 
> top five, so..)

The obvious thing to do would be to put snapshots on kernel.org, though...

One note on patch: I would wrap the definition of __iospace into an
ifndef.  Rationale: that allows to separate the work on these annotations
from the rest.

IOW, I'd love to be able to say
make C=1 CHECKFLAGS="-Dno_check_iospace -Wbitwise"
and get endianness warnings alone (and be able to do _just_ iospace checks
without other experimental stuff).

Once the level of noise from that area is down to something sane, we can
make the thing unconditional and basically include it into the baseline
set of checks.  Comments?

BTW, I've got the beginning of endianness annotations carved and mergable;
patches in an hour or two...

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-09  5:04     ` viro
@ 2004-09-09  5:05       ` David S. Miller
  2004-09-09  5:13       ` Linus Torvalds
  1 sibling, 0 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-09  5:05 UTC (permalink / raw)
  To: viro; +Cc: torvalds, akpm, linux-arch, alan

On Thu, 9 Sep 2004 06:04:10 +0100
viro@parcelfarce.linux.theplanet.co.uk wrote:

> On Wed, Sep 08, 2004 at 06:19:08PM -0700, Linus Torvalds wrote:
> > (I think the biggest problem is _finding_ sparse, but just googling for 
> > "linux sparse tar-ball" shows at least two half-way relevant hits in the 
> > top five, so..)
> 
> The obvious thing to do would be to put snapshots on kernel.org, though...

It's not like the thing is huge.  I'd personally ship it in the
kernel tree itself, why not?  Do you not want to mix up the
licenses or the source history Linus?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-09  5:04     ` viro
  2004-09-09  5:05       ` David S. Miller
@ 2004-09-09  5:13       ` Linus Torvalds
  2004-09-09  6:08         ` viro
  1 sibling, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2004-09-09  5:13 UTC (permalink / raw)
  To: viro; +Cc: David S. Miller, akpm, linux-arch, alan



On Thu, 9 Sep 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
> 
> The obvious thing to do would be to put snapshots on kernel.org, though...

I could do that, I guess. It's calm enough these days that I wouldn't feel 
like I have to update it all the time.

> One note on patch: I would wrap the definition of __iospace into an
> ifndef.  Rationale: that allows to separate the work on these annotations
> from the rest.

Yes, I was thinking of having some way to disable these things. I think
-Wbitwise was good, it would make sense to be able to do similar things
for address-spaces too. Either on a source level with changing __iospace
(aka __iomem), or having a sparse option to allow certain spaces to alias 
each other.

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-09  4:36     ` David S. Miller
@ 2004-09-09  5:56       ` Richard Henderson
  0 siblings, 0 replies; 97+ messages in thread
From: Richard Henderson @ 2004-09-09  5:56 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linus Torvalds, viro, akpm, linux-arch, alan

On Wed, Sep 08, 2004 at 09:36:58PM -0700, David S. Miller wrote:
> Agreed.  I wish GCC had a way to define a type which is offsetable
> yet not an "int".

Some people might claim you want to be programming in Ada.  ;-)


r~

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-09  5:13       ` Linus Torvalds
@ 2004-09-09  6:08         ` viro
  0 siblings, 0 replies; 97+ messages in thread
From: viro @ 2004-09-09  6:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David S. Miller, akpm, linux-arch, alan

On Wed, Sep 08, 2004 at 10:13:45PM -0700, Linus Torvalds wrote:
> Yes, I was thinking of having some way to disable these things. I think
> -Wbitwise was good, it would make sense to be able to do similar things
> for address-spaces too. Either on a source level with changing __iospace
> (aka __iomem), or having a sparse option to allow certain spaces to alias 
> each other.

BTW, is there any reason to have cpu_to_le32p() and its ilk as macros instead
of inlined functions?  If there isn't, I'd gladly lose the perversions I'm
currently using to deal with type checks and let the normal logics take care
of that...

Unlike cpu_to_...() these guys are obviously never constant, so that objection
does not apply.  AFAICS, all callers of that bunch would be OK with such
change, so unless there are dark and nasty arch-specific reasons, I'd rather
turn them into inlines...

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-08 22:57 RFC: being more anal about iospace accesses Linus Torvalds
  2004-09-08 23:07 ` David S. Miller
@ 2004-09-09  6:23 ` David Woodhouse
  2004-09-09 13:14   ` Alan Cox
  2004-09-11  6:09 ` Linus Torvalds
  2004-09-15 15:03 ` Linus Torvalds
  3 siblings, 1 reply; 97+ messages in thread
From: David Woodhouse @ 2004-09-09  6:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Andrew Morton, Linux Arch list, Alan Cox,
	David S. Miller

On Wed, 2004-09-08 at 15:57 -0700, Linus Torvalds wrote:
>  And while I've tried out a compile on my normal x86 setup and could
> fix all of the gcc-visible things, this may be disruptive enough that
> people object to doing something like this.
> 
> Comments? 

Just do it. It's currently a complete mess -- and it's already in need
of fixing on those non-i386 platforms where write[bwl] had to be
functions anyway. Bringing i386 into line would be great.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-08 23:07 ` David S. Miller
  2004-09-08 23:25   ` Linus Torvalds
  2004-09-09  1:19   ` Linus Torvalds
@ 2004-09-09  8:27   ` Geert Uytterhoeven
  2 siblings, 0 replies; 97+ messages in thread
From: Geert Uytterhoeven @ 2004-09-09  8:27 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linus Torvalds, viro, Andrew Morton, linux-arch, Alan Cox

On Wed, 8 Sep 2004, David S. Miller wrote:
> Anyways, I like this a lot.

/me too!

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-09  6:23 ` David Woodhouse
@ 2004-09-09 13:14   ` Alan Cox
  0 siblings, 0 replies; 97+ messages in thread
From: Alan Cox @ 2004-09-09 13:14 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Linus Torvalds, Al Viro, Andrew Morton, Linux Arch list,
	David S. Miller

On Iau, 2004-09-09 at 07:23, David Woodhouse wrote:
> Just do it. It's currently a complete mess -- and it's already in need
> of fixing on those non-i386 platforms where write[bwl] had to be
> functions anyway. Bringing i386 into line would be great.

I'd say "Just do it - but not in a -rc" 8)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-08 22:57 RFC: being more anal about iospace accesses Linus Torvalds
  2004-09-08 23:07 ` David S. Miller
  2004-09-09  6:23 ` David Woodhouse
@ 2004-09-11  6:09 ` Linus Torvalds
  2004-09-11  6:42   ` Anton Blanchard
                     ` (2 more replies)
  2004-09-15 15:03 ` Linus Torvalds
  3 siblings, 3 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-11  6:09 UTC (permalink / raw)
  To: Al Viro, Andrew Morton, Linux Arch list, Alan Cox,
	David S. Miller, Jeff Garzik, Benjamin Herrenschmidt


Ok, I've committed the first batch of these things. It should all compile
to exactly the same thing, and I've tried to remove the warnings gcc
generates for the x86/ppc64 configurations I use.

It does generate more warnings than I'd like from gcc. As mentioned, they 
should all be "harmless" in the sense that it doesn't affect code 
generation, but clearly it would be good to try to fix up the worst cases. 
Hint hint.

The sparse ones are a different matter, though: I've removed some of them.
Quite often it's just a matter of declaring the "mmio_base" pointer with
the right type - a one-line change in ieee1394 made hundreds of warnings 
just go away.

But other times there are drivers that seem to take a perverse pleasure in 
mixing types for IOMEM accesses, and sometims mixes the data structure 
between containing IOMEM and IOPORT addresses (one is now "void __iomem 
*", the other one has always been an integer). libata springs to mind, 
where it really looks like it might be better to separate out the "mmio" 
addresses from the "pio" addresses.

Doing the conversion was usually trivial, but sometimes showed some bugs. 
For example, the ATI radeon fbcon driver is clearly horribly buggy:

	drivers/video/aty/radeon_base.c:1725:42: warning: incorrect type in argument 2 (different address spaces)
	drivers/video/aty/radeon_base.c:1725:42:    expected void const *from
	drivers/video/aty/radeon_base.c:1725:42:    got void [noderef] *[assigned] base_addr<asn:2>
	drivers/video/aty/radeon_base.c:1757:39: warning: incorrect type in argument 1 (different address spaces)
	drivers/video/aty/radeon_base.c:1757:39:    expected void *to
	drivers/video/aty/radeon_base.c:1757:39:    got void [noderef] *[assigned] base_addr<asn:2>

comes from it doing

	count -= copy_to_user(buf, base_addr+p, count);
	..
	count -= copy_from_user(base_addr+p, buf, count);

where we are copying memory-mapped PCI space to/from user space. THIS IS
REALLY REALLY WRONG! It's especially wrong as it can cause things like
cache control and prefetch instructions to be used on PCI space, which
migth well trigger serious bugs. That's ignoring the fact that on some 
architectures the PCI IO space isn't even accessible that way in the first 
place. At least not at the address that is the "PCI base address"..

Anyway, last I remember, AMD machines could do strange things if you did 
prefecthing on IO space. So this is not relegated to "weird 
architectures".

One such weird architecture is PPC64, which is why that one does this:

	drivers/video/aty/radeon_base.c:2249:17: warning: incorrect type in assignment (different base types)
	drivers/video/aty/radeon_base.c:2249:17:    expected void [noderef] *fb_base<asn:2>
	drivers/video/aty/radeon_base.c:2249:17:    got unsigned long 

from:

		/* Argh. Scary arch !!! */
	#ifdef CONFIG_PPC64
		rinfo->fb_base = IO_TOKEN_TO_ADDR(rinfo->fb_base);
	#endif

which just "pre-translates" the address. Never mind that this is 
conceptually wrong, it means that we later will "iounmap()" basically a 
random address. Whee.

I think the real fix is to write a "memcpy_fromio_touser()" and it's 
reverse brother, just using a simple readbwl/writebwl loop with the proper 
__get_user/__put_user. That would likely make the ppc64 horrible "Scary 
arch" thing also go away.

Benh: while doign the ppc64 thing, I also encountered something clearly
broken in arch/ppc64/mm/init.c: iounmap_explicit(). I just commented out
the stupid thing (it was wrong regardless of which way the preceding
branch went: it would either call iounmap() twice on the same area, or it
would get a NULL pointer dereference). Can you check what that is 
_supposed_ to be doing?

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-11  6:09 ` Linus Torvalds
@ 2004-09-11  6:42   ` Anton Blanchard
  2004-09-11  7:26     ` Benjamin Herrenschmidt
  2004-09-11  7:29     ` Geert Uytterhoeven
  2004-09-11  7:23   ` Benjamin Herrenschmidt
  2004-09-11 14:42   ` Alan Cox
  2 siblings, 2 replies; 97+ messages in thread
From: Anton Blanchard @ 2004-09-11  6:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Andrew Morton, Linux Arch list, Alan Cox,
	David S. Miller, Jeff Garzik, Benjamin Herrenschmidt

 
> Benh: while doign the ppc64 thing, I also encountered something clearly
> broken in arch/ppc64/mm/init.c: iounmap_explicit(). I just commented out
> the stupid thing (it was wrong regardless of which way the preceding
> branch went: it would either call iounmap() twice on the same area, or it
> would get a NULL pointer dereference). Can you check what that is 
> _supposed_ to be doing?

Yuck, that change was made in preparation for hotplug removal of host
bridges and is obviously broken.

BTW One of our problems is how drivers can tap IO addresses without doing
something to map them first. We end up doing the ioremap of IO space
during PCI init.

The problem comes when we have to remove a host bridge from a partition
and must have all IO and memory resources unmapped. Memory resources
take care of themselves since drivers iounmap those resources. IO
resources do not.

Anton

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-11  6:09 ` Linus Torvalds
  2004-09-11  6:42   ` Anton Blanchard
@ 2004-09-11  7:23   ` Benjamin Herrenschmidt
  2004-09-11 14:42   ` Alan Cox
  2 siblings, 0 replies; 97+ messages in thread
From: Benjamin Herrenschmidt @ 2004-09-11  7:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Andrew Morton, Linux Arch list, Alan Cox,
	David S. Miller, Jeff Garzik

On Sat, 2004-09-11 at 16:09, Linus Torvalds wrote:

> 
> Doing the conversion was usually trivial, but sometimes showed some bugs. 
> For example, the ATI radeon fbcon driver is clearly horribly buggy:
> 
> 	drivers/video/aty/radeon_base.c:1725:42: warning: incorrect type in argument 2 (different address spaces)
> 	drivers/video/aty/radeon_base.c:1725:42:    expected void const *from
> 	drivers/video/aty/radeon_base.c:1725:42:    got void [noderef] *[assigned] base_addr<asn:2>
> 	drivers/video/aty/radeon_base.c:1757:39: warning: incorrect type in argument 1 (different address spaces)
> 	drivers/video/aty/radeon_base.c:1757:39:    expected void *to
> 	drivers/video/aty/radeon_base.c:1757:39:    got void [noderef] *[assigned] base_addr<asn:2>
> 
> comes from it doing
> 
> 	count -= copy_to_user(buf, base_addr+p, count);
> 	..
> 	count -= copy_from_user(base_addr+p, buf, count);
> 
> where we are copying memory-mapped PCI space to/from user space. THIS IS
> REALLY REALLY WRONG! It's especially wrong as it can cause things like
> cache control and prefetch instructions to be used on PCI space, which
> migth well trigger serious bugs. That's ignoring the fact that on some 
> architectures the PCI IO space isn't even accessible that way in the first 
> place. At least not at the address that is the "PCI base address"..
> 
I know... I want to get rid of this asap. It actually comes from the
generic fbdev code (though that one has been fixed since then), the
reason radeonfb had its own copies was because of the ioremap limit
that it has... I need to fix all that asap, though hopefully, almost
nobody uses fb read/write ops anyway

Anyway, last I remember, AMD machines could do strange things if you did 
> prefecthing on IO space. So this is not relegated to "weird 
> architectures".
> 
> One such weird architecture is PPC64, which is why that one does this:
> 
> 	drivers/video/aty/radeon_base.c:2249:17: warning: incorrect type in assignment (different base types)
> 	drivers/video/aty/radeon_base.c:2249:17:    expected void [noderef] *fb_base<asn:2>
> 	drivers/video/aty/radeon_base.c:2249:17:    got unsigned long 
> 
> from:
> 
> 		/* Argh. Scary arch !!! */
> 	#ifdef CONFIG_PPC64
> 		rinfo->fb_base = IO_TOKEN_TO_ADDR(rinfo->fb_base);
> 	#endif
> 
> which just "pre-translates" the address. Never mind that this is 
> conceptually wrong, it means that we later will "iounmap()" basically a 
> random address. Whee.

And it is sort-of bullshit too anyways. Unless we end up with eeh tokens
in the top bits of fb_base which I doubt will ever happen. It's that
sort of stuff taht should probably be removed by now. I think we added
it bcs fbdev can access the ioremap'ed memory bypassing readW/writeX,
but in the case of the fbdev, this is not neededd anyway.

> I think the real fix is to write a "memcpy_fromio_touser()" and it's 
> reverse brother, just using a simple readbwl/writebwl loop with the proper 
> __get_user/__put_user. That would likely make the ppc64 horrible "Scary 
> arch" thing also go away.
> 
> Benh: while doign the ppc64 thing, I also encountered something clearly
> broken in arch/ppc64/mm/init.c: iounmap_explicit(). I just commented out
> the stupid thing (it was wrong regardless of which way the preceding
> branch went: it would either call iounmap() twice on the same area, or it
> would get a NULL pointer dereference). Can you check what that is 
> _supposed_ to be doing?
> 

It's supposed to unmap PHB IO space, I have to double check. I want
to rewrite this all anyway, we have this "speacial" allocator for
ioremap space separate from the real vmalloc one for reasons that
are no longer valid (if they ever were) imho and I want to remove that,
I'm just too busy at the moment to do it. At this point, I'm thinking
of trashing the _explicit variants of ioremap. They are used to create
a contiguous area containing all PCI IO spaces of PHBs appende to each
other. We still need to do this but can probably do it in a cleaner
way.

Ben. 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-11  6:42   ` Anton Blanchard
@ 2004-09-11  7:26     ` Benjamin Herrenschmidt
  2004-09-11  7:29     ` Geert Uytterhoeven
  1 sibling, 0 replies; 97+ messages in thread
From: Benjamin Herrenschmidt @ 2004-09-11  7:26 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Linus Torvalds, Al Viro, Andrew Morton, Linux Arch list, Alan Cox,
	David S. Miller, Jeff Garzik


> 
> The problem comes when we have to remove a host bridge from a partition
> and must have all IO and memory resources unmapped. Memory resources
> take care of themselves since drivers iounmap those resources. IO
> resources do not.

I've been thinking about changing that to normal ioremap's like ppc32,
relying on normal pointer arithmetic to get us the right address, but
that would only work if all drivers properly stored their IO port "base"
in a long, not an int, which might not be the case...

As I wrote earlier, I also want to get rid of the imalloc stuff ....

Ben.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-11  6:42   ` Anton Blanchard
  2004-09-11  7:26     ` Benjamin Herrenschmidt
@ 2004-09-11  7:29     ` Geert Uytterhoeven
  1 sibling, 0 replies; 97+ messages in thread
From: Geert Uytterhoeven @ 2004-09-11  7:29 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Linus Torvalds, Al Viro, Andrew Morton, Linux Arch list, Alan Cox,
	David S. Miller, Jeff Garzik, Benjamin Herrenschmidt

On Sat, 11 Sep 2004, Anton Blanchard wrote:
> > Benh: while doign the ppc64 thing, I also encountered something clearly
> > broken in arch/ppc64/mm/init.c: iounmap_explicit(). I just commented out
> > the stupid thing (it was wrong regardless of which way the preceding
> > branch went: it would either call iounmap() twice on the same area, or it
> > would get a NULL pointer dereference). Can you check what that is
> > _supposed_ to be doing?
>
> Yuck, that change was made in preparation for hotplug removal of host
> bridges and is obviously broken.
>
> BTW One of our problems is how drivers can tap IO addresses without doing
> something to map them first. We end up doing the ioremap of IO space
> during PCI init.
>
> The problem comes when we have to remove a host bridge from a partition
> and must have all IO and memory resources unmapped. Memory resources
> take care of themselves since drivers iounmap those resources. IO
> resources do not.

Add ioioremap()?

(or more painful for maintenance but conceptually cleaner:
s/ioremap/iomemremap/g and add ioremap()?)

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-11  6:09 ` Linus Torvalds
  2004-09-11  6:42   ` Anton Blanchard
  2004-09-11  7:23   ` Benjamin Herrenschmidt
@ 2004-09-11 14:42   ` Alan Cox
  2 siblings, 0 replies; 97+ messages in thread
From: Alan Cox @ 2004-09-11 14:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Andrew Morton, Linux Arch list, David S. Miller,
	Jeff Garzik, Benjamin Herrenschmidt

On Sad, 2004-09-11 at 07:09, Linus Torvalds wrote:
> Anyway, last I remember, AMD machines could do strange things if you did 
> prefecthing on IO space. So this is not relegated to "weird 
> architectures".

Correct: 
Early Athlon cache corrupts for prefetch() of non prefetchable

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-08 22:57 RFC: being more anal about iospace accesses Linus Torvalds
                   ` (2 preceding siblings ...)
  2004-09-11  6:09 ` Linus Torvalds
@ 2004-09-15 15:03 ` Linus Torvalds
  2004-09-15 16:30   ` Being " Linus Torvalds
  2004-09-15 19:02   ` RFC: being more anal " Geert Uytterhoeven
  3 siblings, 2 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-15 15:03 UTC (permalink / raw)
  To: Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik


Some of you guys may have noticed that the anal typechecking turned out to 
cause another IO interface to be grown.

Some drivers - notably SATA and a couple of network drivers - mixed PIO 
adn MMIO addresses on purpose, because they drive hardware that literally 
uses one or the other. Sometimes it's a compile option that just #defines 
'writel()' to 'inl()', sometimes it's a runtime decision depending on the 
hardware or configuration.

The anal typechecking obviously ended up being very unhappy about this, 
since it wants "void __iomem *" for MMIO pointers, and a normal "unsigned 
long" for PIO accesses.

Rather than scrapping the typechecking, or requiring drivers to do strange
and nasty casts all over the place, there's a new interface in town. It's
called "iomap", because it extends the old "ioremap()" interface to work
on the PIO accesses too. That way, the drivers that really want to do both
can very naturally do it.

Nothing currently uses it, although Jeff has patches for SATA for testing 
(and they clean up the code quite noticeably, never mind getting rid of 
the warnings).  The interface has been implemented by yours truly for x86 
and ppc64, and David did a first-pass version for sparc64 too (missing the 
"xxxx_rep()" functions that were added a bit later, I believe).

So far experience seems to show that it's a very natural interface for
most non-x86 hardware - they all tend to map in both PIO and MMIO into one
address space _anyway_, so the two aren't really any different. It's
mainly just x86 and it's ilk that actually have two different interfaces
for the two kinds of PCI accesses, and at least in that case it's trivial
to encode the difference in the virtual ioremap pointer.

The best way to explain the interface is to just point you guys at the
<asm-generic/iomap.h> file, which isn't very big, has about as much
comments than code, and contains nothing but the necessary function
declarations. The actual meaning of the functions should be pretty
obvious even without the comments.

You can use the <asm-generic/iomap.h> file for declarations even if you
don't use the actual "generic" (x86-ilk) implementation itself - that's
what ppc64 does, for example. Or you could just add the stuff to your own
<asm/io.h>, like davem did for sparc64.

Nothing uses the interface yet, but it's likely that a couple of drivers
will be moved over to it immediately after 2.6.9.  They may not matter for
your architecture, but please check out the new interfaces anyway. It's
likely about ~100 lines of code, although I have to admit that the IO 
access functions tend to be some of the densest and nastiest code around.

For examples, look at lib/iomap.c (x86-like "dual address space") or
arch/ppc64/kernel/eeh.c (Look for "Here comes the EEH implementation of
the IOMAP interfaces"). Or the sparc64 ones in include/asm-sparc64/io.h.

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Being more anal about iospace accesses..
  2004-09-15 15:03 ` Linus Torvalds
@ 2004-09-15 16:30   ` Linus Torvalds
  2004-09-15 16:54     ` Jörn Engel
                       ` (5 more replies)
  2004-09-15 19:02   ` RFC: being more anal " Geert Uytterhoeven
  1 sibling, 6 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-15 16:30 UTC (permalink / raw)
  To: Kernel Mailing List


This is a background mail mainly for driver writers and/or architecture
people. Or others that are just interested in really low-level hw access
details. Others - please feel free to ignore.

[ This has been discussed to some degree already on the architecture 
  mailing lists and obviously among the people who actually worked on it, 
  but I thought I'd bounce it off linux-kernel too, in order to make
  people more aware of what the new type-checking does. Most people may
  have seen it as only generating a ton of new warnings for some crufty
  device drivers. ]

The background for this iospace type-checking change is that we've long
had some serious confusion about how to access PCI memory mapped IO
(MMIO), mainly because on a PC (and some non-PC's too) that IO really does 
look like regular memory, so you can have a driver that just accesses a 
pointer directly, and it will actually work on most machines.

At the same time, we've had the proper "accessor" functions (read[bwl](), 
write[bwl]() and friends) that on purpose dropped all type information 
from the MMIO pointer, mostly just because of historical reasons, and as a 
result some drivers didn't use a pointer at all, but some kind of integer. 
Sometimes even one that couldn't _fit_ a MMIO address in it on a 64-bit 
machine.

In short, the PCI MMIO access case was largely the same as the user
pointer case, except the access functions were different (readb vs
get_user) and they were even less lax about checking for sanity. At least
the user access code required a pointer with the right size.

We've been very successful in annotating user pointers, and that found a
couple of bugs, and more importantly it made the kernel code much more
"aware" of what kind of pointer was passed around. In general, a big
success, I think. And an obvious example for what MMIO pointers should do.

So lately, the kernel infrastructure for MMIO accesses has become a _lot_
more strict about what it accepts. Not only do the MMIO access functions
want a real pointer (which is already more type-checking than we did
before, and causes gcc to spew out lots of warnings for some drivers), but 
as with user pointers, sparse annotations mark them as being in a 
different address space, and building the kernel with checking on will 
warn about mixing up address spaces. So far so good.

So right now the current snapshots (and 2.6.9-rc2) have this enabled, and
some drivers will be _very_ noisy when compiled. Most of the regular ones
are fine, so maybe people haven't even noticed it that much, but some of
them were using things like "u32" to store MMIO pointers, and are
generally extremely broken on anything but an x86.  We'll hopefully get
around to fixing them up eventually, but in the meantime this should at 
least explain the background for some of the new noise people may see.

Perhaps even more interesting is _another_ case of driver, though: one
that started warning not because it was ugly and broken, but because it
did something fairly rare but something that does happen occasionally: it
mixed PIO and MMIO accesses on purpose, because it drove hardware that
literally uses one or the other.

Sometimes such a "mixed interface" driver does it based on a compile
option that just #defines 'writel()' to 'inl()', sometimes it's a runtime
decision depending on the hardware or configuration. 

The anal typechecking obviously ended up being very unhappy about this, 
since it wants "void __iomem *" for MMIO pointers, and a normal "unsigned 
long" for PIO accesses. The compile-time option could have been easily 
fixed up by adding the proper cast when re-defining the IO accessor, but 
that doesn't work for the dynamic case.

Also, the compile-time switchers often really _wanted_ to be dynamic, but
it was just too painful with the regular Linux IO interfaces to duplicate 
the code and do things conditionally one way or the other.

To make a long story even longer: rather than scrapping the typechecking,
or requiring drivers to do strange and nasty casts all over the place,
there's now a new interface in town. It's called "iomap", because it
extends the old "ioremap()" interface to work on the PIO accesses too.

That way, the drivers that really want to mix both PIO and MMIO accesses
can very naturally do it: they just need to remap the PIO space too, the
same way that we've required people to remap the MMIO space for a long
long time.

For example, if you don't know (or, more importantly - don't care) what 
kind of IO interface you use, you can now do something like

	void __iomem * map = pci_iomap(dev, bar, maxbytes);
	...
	status = ioread32(map + DRIVER_STATUS_OFFSET);

and it will do the proper IO mapping for the named PCI BAR for that
device. Regardless of whether the BAR was an IO or MEM mapping. Very
convenient for cases where the hardware migt expose its IO window in
either (or sometimes both).

Nothing in the current tree actually uses this new interface, although
Jeff has patches for SATA for testing (and they clean up the code quite
noticeably, never mind getting rid of the warnings).  The interface has
been implemented by yours truly for x86 and ppc64, and David did a
first-pass version for sparc64 too (missing the "xxxx_rep()" functions
that were added a bit later, I believe).

So far experience seems to show that it's a very natural interface for
most non-x86 hardware - they all tend to map in both PIO and MMIO into one
address space _anyway_, so the two aren't really any different. It's
mainly just x86 and it's ilk that actually have two different interfaces
for the two kinds of PCI accesses, and at least in that case it's trivial
to encode the difference in the virtual ioremap pointer.

The best way to explain the interface is to just point you guys at the
<asm-generic/iomap.h> file, which isn't very big, has about as much
comments than code, and contains nothing but the necessary function
declarations. The actual meaning of the functions should be pretty
obvious even without the comments.

Feel free to flame or discuss rationally,

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 16:30   ` Being " Linus Torvalds
@ 2004-09-15 16:54     ` Jörn Engel
  2004-09-15 17:07       ` Linus Torvalds
                         ` (4 more replies)
  2004-09-15 16:56     ` Dave Jones
                       ` (4 subsequent siblings)
  5 siblings, 5 replies; 97+ messages in thread
From: Jörn Engel @ 2004-09-15 16:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

On Wed, 15 September 2004 09:30:42 -0700, Linus Torvalds wrote:
> 
> For example, if you don't know (or, more importantly - don't care) what 
> kind of IO interface you use, you can now do something like
> 
> 	void __iomem * map = pci_iomap(dev, bar, maxbytes);
> 	...
> 	status = ioread32(map + DRIVER_STATUS_OFFSET);

C now supports pointer arithmetic with void*?  I hope the width of a
void is not architecture dependent, that would introduce more subtle
bugs.

Jörn

-- 
Sometimes, asking the right question is already the answer.
-- Unknown

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 16:30   ` Being " Linus Torvalds
  2004-09-15 16:54     ` Jörn Engel
@ 2004-09-15 16:56     ` Dave Jones
  2004-09-15 17:19     ` Roger Luethi
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 97+ messages in thread
From: Dave Jones @ 2004-09-15 16:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

On Wed, Sep 15, 2004 at 09:30:42AM -0700, Linus Torvalds wrote:

 > So right now the current snapshots (and 2.6.9-rc2) have this enabled, and
 > some drivers will be _very_ noisy when compiled. Most of the regular ones
 > are fine, so maybe people haven't even noticed it that much, but some of
 > them were using things like "u32" to store MMIO pointers, and are
 > generally extremely broken on anything but an x86.  We'll hopefully get
 > around to fixing them up eventually, but in the meantime this should at 
 > least explain the background for some of the new noise people may see.

For the curious, 6MB of sparse output is generated from a make allmodconfig
right now. (http://www.codemonkey.org.uk/junk/2.6.9-rc2-warnings.txt)

You can filter out just the __iomem warnings by grepping for asn:2

		Dave


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 16:54     ` Jörn Engel
@ 2004-09-15 17:07       ` Linus Torvalds
  2004-09-15 17:32         ` Jörn Engel
                           ` (2 more replies)
  2004-09-15 17:07       ` Jeff Garzik
                         ` (3 subsequent siblings)
  4 siblings, 3 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-15 17:07 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Kernel Mailing List



On Wed, 15 Sep 2004, Jörn Engel wrote:
> 
> C now supports pointer arithmetic with void*?

C doesn't. gcc does. It's a documented extension, and it acts like if it 
was done on a byte.

See gcc's user guide "Extension to the C Language Family".

It's a singularly good feature. 

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 16:54     ` Jörn Engel
  2004-09-15 17:07       ` Linus Torvalds
@ 2004-09-15 17:07       ` Jeff Garzik
  2004-09-15 17:16       ` Roland Dreier
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 97+ messages in thread
From: Jeff Garzik @ 2004-09-15 17:07 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Linus Torvalds, Kernel Mailing List

On Wed, Sep 15, 2004 at 06:54:50PM +0200, Jörn Engel wrote:
> On Wed, 15 September 2004 09:30:42 -0700, Linus Torvalds wrote:
> > 
> > For example, if you don't know (or, more importantly - don't care) what 
> > kind of IO interface you use, you can now do something like
> > 
> > 	void __iomem * map = pci_iomap(dev, bar, maxbytes);
> > 	...
> > 	status = ioread32(map + DRIVER_STATUS_OFFSET);
> 
> C now supports pointer arithmetic with void*?  I hope the width of a
> void is not architecture dependent, that would introduce more subtle
> bugs.

gcc extension, which has been used in the kernel for ages.

	Jeff




^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 16:54     ` Jörn Engel
  2004-09-15 17:07       ` Linus Torvalds
  2004-09-15 17:07       ` Jeff Garzik
@ 2004-09-15 17:16       ` Roland Dreier
  2004-09-15 17:39         ` Linus Torvalds
  2004-09-15 17:36       ` Horst von Brand
  2004-09-15 17:40       ` Brian Gerst
  4 siblings, 1 reply; 97+ messages in thread
From: Roland Dreier @ 2004-09-15 17:16 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Linus Torvalds, Kernel Mailing List

    Jörn> C now supports pointer arithmetic with void*?  I hope the
    Jörn> width of a void is not architecture dependent, that would
    Jörn> introduce more subtle bugs.

Pointer arithmetic on void * has been a gcc extension since forever
(gcc acts as though sizeof (void) == 1).

However, I somewhat agree -- it's ugly for drivers rely on this and do
arithmetic on void *.  It should be OK for a driver to use 
char __iomem * for its IO base if it needs to add in offsets, right?

 - R.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 16:30   ` Being " Linus Torvalds
  2004-09-15 16:54     ` Jörn Engel
  2004-09-15 16:56     ` Dave Jones
@ 2004-09-15 17:19     ` Roger Luethi
  2004-09-15 17:23     ` Richard B. Johnson
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 97+ messages in thread
From: Roger Luethi @ 2004-09-15 17:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

On Wed, 15 Sep 2004 09:30:42 -0700, Linus Torvalds wrote:
> This is a background mail mainly for driver writers and/or architecture

Thanks, I appreciate the effort (speaking as the maintainer of an
"even more interesting" driver).

Roger

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 16:30   ` Being " Linus Torvalds
                       ` (2 preceding siblings ...)
  2004-09-15 17:19     ` Roger Luethi
@ 2004-09-15 17:23     ` Richard B. Johnson
  2004-09-15 22:21     ` Deepak Saxena
  2004-09-15 22:29     ` Roland Dreier
  5 siblings, 0 replies; 97+ messages in thread
From: Richard B. Johnson @ 2004-09-15 17:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

On Wed, 15 Sep 2004, Linus Torvalds wrote:

>
> This is a background mail mainly for driver writers and/or architecture
> people. Or others that are just interested in really low-level hw access
> details. Others - please feel free to ignore.
>
[SNIPPED mostly....]


> For example, if you don't know (or, more importantly - don't care) what
> kind of IO interface you use, you can now do something like
>
> 	void __iomem * map = pci_iomap(dev, bar, maxbytes);
> 	...
> 	status = ioread32(map + DRIVER_STATUS_OFFSET);
                          ^^^^^^^^^^^^^^^^

Doesn't this rely on the non-standard GNUism that you can
perform pointer-arithmetic on a void-pointer? Which is illegal,
immoral, and fattening. I'd much rather see char-pointers so
it's valid to perform the offset math. That way, in the future,
a new tool that follows (ANSI, IEEE, POSIX) rules doesn't barf.
I suggest a new pointer type like (BASE *) or (BAR *) that
hides the (unsigned char *) necessary to not barf, plus
minimize side-effects.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.26 on an i686 machine (5570.56 BogoMips).
            Note 96.31% of all statistics are fiction.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 17:07       ` Linus Torvalds
@ 2004-09-15 17:32         ` Jörn Engel
  2004-09-15 17:57           ` Linus Torvalds
  2004-09-15 17:45         ` Nikita Danilov
  2004-09-15 18:40         ` Chris Wedgwood
  2 siblings, 1 reply; 97+ messages in thread
From: Jörn Engel @ 2004-09-15 17:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

On Wed, 15 September 2004 10:07:29 -0700, Linus Torvalds wrote:
> On Wed, 15 Sep 2004, Jörn Engel wrote:
> > 
> > C now supports pointer arithmetic with void*?
> 
> C doesn't. gcc does. It's a documented extension, and it acts like if it 
> was done on a byte.
> 
> See gcc's user guide "Extension to the C Language Family".
> 
> It's a singularly good feature. 

Nice.

But it still leaves me confused.  Before I had this code:

	struct regs {
		uint32_t status;
		...
	}

	...

	struct regs *regs = ioremap(...);
	uint32_t status = regs->status;
	...

So now I should do it like this:

	#define REG_STATUS 0

	...

	void __iomem *regs = ioremap(...);
	uint32_t status = readl(regs + REG_STATUS);
	...

But wait, that only works when long is 32bit wide.  Plus I could be
stupid enough and "#define REG_STATUS 64" while the register space is
just 64 bytes long.  It solves the confusion about address spaces,
agreed, but overall I'm more confused now.  Hope it's just temporary.

Jörn

-- 
There is no worse hell than that provided by the regrets
for wasted opportunities.
-- Andre-Louis Moreau in Scarabouche

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 16:54     ` Jörn Engel
                         ` (2 preceding siblings ...)
  2004-09-15 17:16       ` Roland Dreier
@ 2004-09-15 17:36       ` Horst von Brand
  2004-09-15 17:40       ` Brian Gerst
  4 siblings, 0 replies; 97+ messages in thread
From: Horst von Brand @ 2004-09-15 17:36 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Linus Torvalds, Kernel Mailing List

=?iso-8859-1?Q?J=F6rn?= Engel <joern@wohnheim.fh-wedel.de> said:
> On Wed, 15 September 2004 09:30:42 -0700, Linus Torvalds wrote:

[...]

> > For example, if you don't know (or, more importantly - don't care) what 
> > kind of IO interface you use, you can now do something like
> > 
> > 	void __iomem * map = pci_iomap(dev, bar, maxbytes);
> > 	...
> > 	status = ioread32(map + DRIVER_STATUS_OFFSET);

> C now supports pointer arithmetic with void*?

It doesn't. It's a gcc-ism.

>                                               I hope the width of a
> void is not architecture dependent, that would introduce more subtle
> bugs.

gcc takes it as a char pointer for such uses.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 17:16       ` Roland Dreier
@ 2004-09-15 17:39         ` Linus Torvalds
  2004-09-15 20:07           ` Russell King
  0 siblings, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2004-09-15 17:39 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Jörn Engel, Kernel Mailing List



On Wed, 15 Sep 2004, Roland Dreier wrote:
> 
> However, I somewhat agree -- it's ugly for drivers rely on this and do
> arithmetic on void *.  It should be OK for a driver to use 
> char __iomem * for its IO base if it needs to add in offsets, right?

"char __iomem *" will certainly work - all the normal pointer conversions 
are ok. Some people in fact use pointers to structures in MMIO space, and 
this is quite reasonable when working with a chip that uses "mailboxes" 
for commands.

However, I disagree with "void *" arithmetic being ugly. It's a very nice
feature to have a pointer that can be validly cast to any other type, and
that is the whole _point_ of "void *". The fact that C++ got that wrong is
arguably the worst failing of the language, causing tons of unnecessary
casts that can silently hide real bugs (maybe the thing you cast wasn't a
"void *" in the first place, but you'll never know - the compiler will do
the cast for you).

For example, to go back to the mailbox example, let's say that your 
hardware has an IO area that is 8kB in size, with the last 4kB being 
mailboxes.

The _sane_ way to do that is to do

	void __iomem *base_io = ioremap(...);
	struct mailbox __iomem *mbox = base_io + MAILBOX_OFFSET;

and then just work on that.

In contrast, havign to cast to a "char *" in order to do arithmetic, and 
then casting back to the resultant structure type pointer is not only 
ugly and unreadable, it's a lot more prone to errors as a result.

In other words, think of "void *" as a pointer to storage. Not "char"  
(which is the C name for a signed byte), but really, it's the pointer to 
whatever underlying memory there is. And a _fundamental_ part of such 
memory is the fact that it is addressable. Thus "pointer to storage 
arithmetic" really does make sense on a very fundamental level. It has 
nothing to do with C types, and that also explains why "void *" silently 
converts to anything else. It's a very internally consistent world-view.

Now, I disagree with gcc when it comes to actually taking the "size" of 
void. Gcc will silently accept

	void *x;
	x = malloc(sizeof(*x));

which I consider to be an abomination (and the above _can_ happen, quite
easily, as part of macros for doing allocation etc - nobody would write 
it in that form, but if you have an "MEMALLOC(x)" macro that does the 
sizeof, you could end up trying to feed the compiler bogus code).

The fact that you can do arithmetic on typeless storage does _not_ imply
that typeless storage would have a "size" in my book.

So sparse will say:

	warning: cannot size expression

and refuse to look at broken code like the above. But hey, the fact that I 
have better taste than anybody else in the universe is just something I 
have to live with. It's not easy being me.

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 16:54     ` Jörn Engel
                         ` (3 preceding siblings ...)
  2004-09-15 17:36       ` Horst von Brand
@ 2004-09-15 17:40       ` Brian Gerst
  4 siblings, 0 replies; 97+ messages in thread
From: Brian Gerst @ 2004-09-15 17:40 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Linus Torvalds, Kernel Mailing List

Jörn Engel wrote:
> On Wed, 15 September 2004 09:30:42 -0700, Linus Torvalds wrote:
> 
>>For example, if you don't know (or, more importantly - don't care) what 
>>kind of IO interface you use, you can now do something like
>>
>>	void __iomem * map = pci_iomap(dev, bar, maxbytes);
>>	...
>>	status = ioread32(map + DRIVER_STATUS_OFFSET);
> 
> 
> C now supports pointer arithmetic with void*?  I hope the width of a
> void is not architecture dependent, that would introduce more subtle
> bugs.
> 
> Jörn
> 

http://gcc.gnu.org/onlinedocs/gcc-3.4.2/gcc/Pointer-Arith.html#Pointer-Arith

5.18 Arithmetic on void- and Function-Pointers

In GNU C, addition and subtraction operations are supported on pointers 
to void and on pointers to functions. This is done by treating the size 
of a void or of a function as 1.

A consequence of this is that sizeof is also allowed on void and on 
function types, and returns 1.

The option -Wpointer-arith requests a warning if these extensions are used.

--
				Brian Gerst

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 17:07       ` Linus Torvalds
  2004-09-15 17:32         ` Jörn Engel
@ 2004-09-15 17:45         ` Nikita Danilov
  2004-09-15 18:09           ` Linus Torvalds
  2004-09-15 18:40         ` Chris Wedgwood
  2 siblings, 1 reply; 97+ messages in thread
From: Nikita Danilov @ 2004-09-15 17:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jörn Engel, Kernel Mailing List

Linus Torvalds writes:
 > 
 > 
 > On Wed, 15 Sep 2004, JЖrn Engel wrote:
 > > 
 > > C now supports pointer arithmetic with void*?
 > 
 > C doesn't. gcc does. It's a documented extension, and it acts like if it 
 > was done on a byte.
 > 
 > See gcc's user guide "Extension to the C Language Family".
 > 
 > It's a singularly good feature. 

Unfortunately it breaks even better identity

  foo *p;

  p + nr == (foo *)((char *)p + nr * sizeof *p)

unless one thinks that is --together with gcc-- that nothing occupies
exactly one byte.

 > 
 > 		Linus

Nikita.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 17:32         ` Jörn Engel
@ 2004-09-15 17:57           ` Linus Torvalds
  2004-09-15 18:06             ` Linus Torvalds
                               ` (2 more replies)
  0 siblings, 3 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-15 17:57 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Kernel Mailing List



On Wed, 15 Sep 2004, Jörn Engel wrote:
> 
> But it still leaves me confused.  Before I had this code:
> 
> 	struct regs {
> 		uint32_t status;
> 		...
> 	}
> 
> 	...
> 
> 	struct regs *regs = ioremap(...);
> 	uint32_t status = regs->status;
> 	...
> 
> So now I should do it like this:
> 
> 	#define REG_STATUS 0
> 
> 	...
> 
> 	void __iomem *regs = ioremap(...);
> 	uint32_t status = readl(regs + REG_STATUS);

No, you can certainly continue to use non-void pointers. The "void __iomem
*" case is just the typeless one, exactly equivalent to regular void
pointer usage.

So let me clarify my original post with two points:

 - if your device only supports MMIO, you might as well just use the old 
   interfaces. The new interface will _also_ work, but there is no real 
   advantage, unless you count the "pci_iomap()" as a simpler interface.

   The new interface is really only meaningful for things that want to 
   support _both_ PIO and MMIO. It's also, perhaps, a bit syntactically 
   easier to work with, so some people might prefer that for that 
   reason. See my comments further down on the auto-sizing. BUT it doesn't 
   make the old interfaces go away by any means, and I'm not even
   suggesting that people should re-write drivers just for the hell of it.

   In short: if you don't go "ooh, that will simplify XXX", then you 
   should just ignore the new interfaces.

 - you can _absolutely_ use other pointers than "void *". You should 
   annotate them with "__iomem" if you want to be sparse-clean (and it 
   often helps visually to clarify the issue), but gcc won't care, the 
   "__iomem" annotation is purely a extended check.

So you can absolutely still continue with

	struct mydev_iolayout {
		__u32 status;
		__u32 irqmask;
		...

	struct mydev_iolayout __iomem *regs = pci_map(...);
	status = ioread32(&regs.status);

which is often a lot more readable, and thus in fact _preferred_. It also
adds another level of type-checking, so I applaud drivers that do this.

Now, I'm _contemplating_ also allowing the "get_user()" kind of "unsized" 
access function for the new interface. Right now all the old (and the new) 
access functions are all explicitly sized. But for the "struct layout" 
case, it's actually often nice to just say

	status = ioread(&regs.status);

and the compiler can certainly figure out the size of the register on its
own. This was impossible with the old interface, because the old 
interfaces didn't even take a _pointer_, much less one that could be sized 
up automatically.

(The auto-sizing is something that "get_user()" and "put_user()" have
always done, and it makes them very easy to use. It involved a few pretty
ugly macros, but hey, that's all hidden away, and is actually pretty
simple in the end).

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 17:57           ` Linus Torvalds
@ 2004-09-15 18:06             ` Linus Torvalds
  2004-09-15 19:34             ` Greg KH
  2004-09-16 14:58             ` Horst von Brand
  2 siblings, 0 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-15 18:06 UTC (permalink / raw)
  To: Jörn Engel; +Cc: Kernel Mailing List



On Wed, 15 Sep 2004, Linus Torvalds wrote:
> 
>    In short: if you don't go "ooh, that will simplify XXX", then you 
>    should just ignore the new interfaces.

Btw, to get an example of what _is_ simplified, look at 
drivers/scsi/libata-core.c:

	void ata_tf_load(struct ata_port *ap, struct ata_taskfile *tf)
	{
	        if (ap->flags & ATA_FLAG_MMIO)
	                ata_tf_load_mmio(ap, tf);
	        else
	                ata_tf_load_pio(ap, tf);
	}

and realize that "ata_tf_load_mmio()" and "ata_tf_load_pio()" are exactly 
the SAME FUNCTION. Except one uses MMIO, the other uses PIO. With the new 
setup, it literally collapses into one function, and code size goes down 
pretty dramatically. Not to mention making the code more readable.

For another example of this (of the static kind), look at something like 
drivers/net/8139too.c:

	#ifdef USE_IO_OPS

	#define RTL_R8(reg)             inb (((unsigned long)ioaddr) + (reg))
	#define RTL_R16(reg)            inw (((unsigned long)ioaddr) + (reg))
	#define RTL_R32(reg)            ((unsigned long) inl (((unsigned long)ioaddr) + (reg)))
	...

	#else

	...
	/* read MMIO register */
	#define RTL_R8(reg)             readb (ioaddr + (reg))
	#define RTL_R16(reg)            readw (ioaddr + (reg))
	#define RTL_R32(reg)            ((unsigned long) readl (ioaddr + (reg)))

see? In this case, USE_IO_OPS depends on a static configuration variable, 
namely CONFIG_8139TOO_PIO. So the user at _compile_ time has to decide 
whether he wants to use MMIO or PIO. See the Kconfig help file:

          This instructs the driver to use programmed I/O ports (PIO) instead 
          of PCI shared memory (MMIO).  This can possibly solve some problems
          in case your mainboard has memory consistency issues.  If unsure,
          say N.

In other words, the new interface is not designed to replace the old ones,
it's designed to help drivers like these, that either go to a lot of extra
pain in order to support both methods, or then have a _static_ config
option that makes it really hard for system vendors to just release one
driver that knows when it needs to use PIO and when it needs MMIO.

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 17:45         ` Nikita Danilov
@ 2004-09-15 18:09           ` Linus Torvalds
  0 siblings, 0 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-15 18:09 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Jörn Engel, Kernel Mailing List



On Wed, 15 Sep 2004, Nikita Danilov wrote:
> 
> Unfortunately it breaks even better identity
> 
>   foo *p;
> 
>   p + nr == (foo *)((char *)p + nr * sizeof *p)

No, gcc allows the above, by making sizeof(void) be 1.

And sane compilers would just inform the user at compile-time with a nice
readable message that he's doing something stupid.

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 17:07       ` Linus Torvalds
  2004-09-15 17:32         ` Jörn Engel
  2004-09-15 17:45         ` Nikita Danilov
@ 2004-09-15 18:40         ` Chris Wedgwood
  2 siblings, 0 replies; 97+ messages in thread
From: Chris Wedgwood @ 2004-09-15 18:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jörn Engel, Kernel Mailing List

On Wed, Sep 15, 2004 at 10:07:29AM -0700, Linus Torvalds wrote:

> C doesn't. gcc does. It's a documented extension, and it acts like
> if it was done on a byte.

[...]

> It's a singularly good feature.

I dunno about that.  Maybe it is, but it has some gotchas.

Recently when doing a sparsification of code I noticed there are
places which essentially do things like:

    void *foo;

    [...]

    foo += bar * n;

Part of the fix (cleanup) was to change the 'void *foo' to
'gratuitous_typedef_t __user *foo' --- which silently breaks the math
if you don't explicitly check for this.


  --cw

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 15:03 ` Linus Torvalds
  2004-09-15 16:30   ` Being " Linus Torvalds
@ 2004-09-15 19:02   ` Geert Uytterhoeven
  2004-09-15 19:16     ` Linus Torvalds
  1 sibling, 1 reply; 97+ messages in thread
From: Geert Uytterhoeven @ 2004-09-15 19:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik

On Wed, 15 Sep 2004, Linus Torvalds wrote:
> Rather than scrapping the typechecking, or requiring drivers to do strange
> and nasty casts all over the place, there's a new interface in town. It's
> called "iomap", because it extends the old "ioremap()" interface to work
> on the PIO accesses too. That way, the drivers that really want to do both
> can very naturally do it.
>
> Nothing currently uses it, although Jeff has patches for SATA for testing

Before people really start using this: what's the advantage of this scheme
compared to e.g. dev->busops->in8(), which some of us have been waiting for
since a long time?

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 19:02   ` RFC: being more anal " Geert Uytterhoeven
@ 2004-09-15 19:16     ` Linus Torvalds
  2004-09-15 19:40       ` Matthew Wilcox
                         ` (2 more replies)
  0 siblings, 3 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-15 19:16 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik



On Wed, 15 Sep 2004, Geert Uytterhoeven wrote:
> 
> Before people really start using this: what's the advantage of this scheme
> compared to e.g. dev->busops->in8(), which some of us have been waiting for
> since a long time?

Ehh, exactly why would you wait for that? 

What this new interface is doing is to slowlt move "ioremap()" users over
to "pci_iomap()" (and by extension, maybe "sbus_ioremap()" etc if somebody
ever starts doing that too), which allows you to do the bus- and device-
specific translation just _once_.

Doing it at every IO op (aka "dev->busops->xxx()") is a complete waste of 
time, and extremely stupid, in my not-so-humble opinion. Let me count the 
ways:

 - it's more efficient to set up the mapping _once_. That's especailly 
   true since the mapping may need to do things like page table etc
   accesses, so you HAVE to prepare it anyway (ie doing it at IO time
   would be a bug)

 - doing it on a dev or bus-centric setup is totally idiotic, because it 
   makes it effectively impossible to share one chip-driver across X
   different buses: your scheme would have ten different "dev"s pointing 
   to their own ops. The pci_iomap() has _one_ mapping scheme per bus 
   type, and then you just pass in the cookie to "ioread()" and friends.

 - we've tried it in some circumstances (IDE driver and PCI config space 
   accesses if I recall correctly, which I obviously might not do), and it
   has been a total failure. 

In other words, I hereby proclaim that anybody who does "dev->busops" is a 
total idiot.

Of course, you're now free to try to change my mind. Some people that are
idiots today might well be the revolutionary thinkers of tomorrow. I'm
nothing if not flexible.

		Linus "Yoga of the Mind" Torvalds

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 17:57           ` Linus Torvalds
  2004-09-15 18:06             ` Linus Torvalds
@ 2004-09-15 19:34             ` Greg KH
  2004-09-15 19:53               ` Linus Torvalds
  2004-09-16 14:58             ` Horst von Brand
  2 siblings, 1 reply; 97+ messages in thread
From: Greg KH @ 2004-09-15 19:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: J?rn Engel, Kernel Mailing List

On Wed, Sep 15, 2004 at 10:57:25AM -0700, Linus Torvalds wrote:
> 
> So you can absolutely still continue with
> 
> 	struct mydev_iolayout {
> 		__u32 status;
> 		__u32 irqmask;
> 		...
> 
> 	struct mydev_iolayout __iomem *regs = pci_map(...);
> 	status = ioread32(&regs.status);
> 
> which is often a lot more readable, and thus in fact _preferred_. It also
> adds another level of type-checking, so I applaud drivers that do this.

Currently a few drivers do:
	status = readl(&regs.status);
which causes sparse warnings.

How should that code be changed to prevent this?  Convert them all to
ioread32()?  Or figure out a way to supress the warning for readl()?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 19:16     ` Linus Torvalds
@ 2004-09-15 19:40       ` Matthew Wilcox
  2004-09-15 20:10         ` Linus Torvalds
  2004-09-15 20:20       ` Russell King
  2004-09-15 22:38       ` James Bottomley
  2 siblings, 1 reply; 97+ messages in thread
From: Matthew Wilcox @ 2004-09-15 19:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik

On Wed, Sep 15, 2004 at 12:16:35PM -0700, Linus Torvalds wrote:
> On Wed, 15 Sep 2004, Geert Uytterhoeven wrote:
> > Before people really start using this: what's the advantage of this scheme
> > compared to e.g. dev->busops->in8(), which some of us have been waiting for
> > since a long time?
> 
> Ehh, exactly why would you wait for that? 

I'm no fan of bus->ops->in8, but the ioread8() interface is basically
impossible for some architectures.  There's just not enough information
in a single void * cookie to make this work; not without screwing the
speed of readb() down the the speed of inb().

For example, on some of the older PA machines, we can have an EISA
controller and multiple PCI controllers.  readb() is great, it's all
mapped into a flat memory space, and there are no problems.  But inb()
is a royal PITA.  First, we have to decode the port number to figure
out which controller it's under, then (if it's EISA) swizzle the bits in
the port number to get to an address, but if it's PCI, we have to take a
spinlock, write the port number into a register of the right controller,
read the data back from a different register (which is what triggers
the io port cycle to be generated on the PCI bus) and unlock the spinlock.

ioport_map() looks great for the EISA controller because it lets us
pre-map the port number to an address.  But it'd actually be deadly
if the device has more than 8 consecutive IO ports [1].  For the PCI
controllers, it's a complete misery.  We could do something horrid like
assuming no IO is less than 2GB and returning a cookie from ioport_map()
that then gets decoded into somethign similar to the current dance,
but that's really not pleasant.

I'm told SH is in even worse state than PA, but I haven't looked at
that code.

Could we have a 'dev' parameter to ioread8()?  Please?


[1] The implementation is:
        if (port & 0x300) {
                return 0xfc000000 | ((port & 0xfc00) >> 6)
                        | ((port & 0x3f8) << 9) | (port & 7);
        } else {
                return 0xfc000000 | port;
        }

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 19:34             ` Greg KH
@ 2004-09-15 19:53               ` Linus Torvalds
  2004-09-15 20:06                 ` Greg KH
  0 siblings, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2004-09-15 19:53 UTC (permalink / raw)
  To: Greg KH; +Cc: J?rn Engel, Kernel Mailing List



On Wed, 15 Sep 2004, Greg KH wrote:
> 
> Currently a few drivers do:
> 	status = readl(&regs.status);

I assume that's "&regs->status", since regs had better be a pointer..

> which causes sparse warnings.
> 
> How should that code be changed to prevent this?  Convert them all to
> ioread32()?  Or figure out a way to supress the warning for readl()?

Just make sure that you annotate "regs" as a pointer to IO space.

Ie if you have

	struct xxxx __iomem *regs;

then everything will be fine - sparse knows that "regs" is a IO pointer,
and that obviously makes "&regs->member" _also_ an IO pointer.

You'll need that __iomem annotation anyway, since otherwise sparse would
complain when you do the assignment from the "ioremap()" call (and the
thing had better come from ioremap() some way, or it's not valid in the
first place to do "readl()" on it).

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 19:53               ` Linus Torvalds
@ 2004-09-15 20:06                 ` Greg KH
  0 siblings, 0 replies; 97+ messages in thread
From: Greg KH @ 2004-09-15 20:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: J?rn Engel, Kernel Mailing List

On Wed, Sep 15, 2004 at 12:53:51PM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 15 Sep 2004, Greg KH wrote:
> > 
> > Currently a few drivers do:
> > 	status = readl(&regs.status);
> 
> I assume that's "&regs->status", since regs had better be a pointer..

Yes, sorry.

> > which causes sparse warnings.
> > 
> > How should that code be changed to prevent this?  Convert them all to
> > ioread32()?  Or figure out a way to supress the warning for readl()?
> 
> Just make sure that you annotate "regs" as a pointer to IO space.

Ah, ok, that works.  Thanks for the clarification, I now realize that
__iomem can be a marker for any type of pointer, not just a void.
That's where I was confused.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 17:39         ` Linus Torvalds
@ 2004-09-15 20:07           ` Russell King
  0 siblings, 0 replies; 97+ messages in thread
From: Russell King @ 2004-09-15 20:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Roland Dreier, Jörn Engel, Kernel Mailing List

On Wed, Sep 15, 2004 at 10:39:02AM -0700, Linus Torvalds wrote:
> In other words, think of "void *" as a pointer to storage. Not "char"  
> (which is the C name for a signed byte),

Common Programming Error #99: "char" is implementation whether it is
signed or may be unsigned.  Only a "char" type qualified by "signed"
or "unsigned" can be relied upon to have the requested property.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 PCMCIA      - http://pcmcia.arm.linux.org.uk/
                 2.6 Serial core

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 19:40       ` Matthew Wilcox
@ 2004-09-15 20:10         ` Linus Torvalds
  2004-09-15 20:17           ` Linus Torvalds
  2004-09-16 12:17           ` David Woodhouse
  0 siblings, 2 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-15 20:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik



On Wed, 15 Sep 2004, Matthew Wilcox wrote:

> Could we have a 'dev' parameter to ioread8()?  Please?

Nope. See earlier email on why. There is no "dev" any more at the point of
a generic chipset handler: the "dev" may have any of millions of possible 
types (well, ok, a few different bus architectures, but still).

I seriously doubt that you don't have space to encode the same information 
that was already there before in a pointer. David was kicking and 
screaming too, until Jeff pointed out that it was all in the same address 
space, and then he said "oh", and suddenly it was trivial.

> [1] The implementation is:
>         if (port & 0x300) {
>                 return 0xfc000000 | ((port & 0xfc00) >> 6)
>                         | ((port & 0x3f8) << 9) | (port & 7);
>         } else {
>                 return 0xfc000000 | port;
>         }

So? Without knowing the details of how stupid the PA-RISC IO mappings are, 
the above already makes me suspect that you have one memory space, and the 
0xfc000000 area is used for IO. Goodie. So now you make

	/* We've hard-mapped the IO region at 0xfc000000 */
	void __iomem *ioport_map(unsigned long port)
	{
		return (void __iomem *) (port | 0xfc000000);
	}

and then in ioread8() you do

	unsigned int ioread8(void __iomem *addr)
	{
		unsigned long bits = (unsigned long __force) addr;
		if ((bits & 0xfc000000) == 0xfc000000)
			addr = io_port_swizle_bits(bits & 0x3ffffff);
		/* MMIO has already been mapped */
		return __raw_readb(addr);
	}

with "io_mem_read()" doing the old bit swizzle, exactly like before.

The _only_ thing you need is an _virtual_ area of the cookie space (aka 
"void __iomem *") that you can just test. It clearly exists, because your 
PCI code does "ioremap()", so _any_ memory area that cannot be reached 
with ioremap() is by default something you can use as a flag for it being 
PIO. Then you do all the same things that you did before.

Is it uglier than sane architectures? Sure. Some architectures do 
_everything_ at pci_iomap() time, and then they just do a simple 
special-memory dereference that is common to both PIO and MMIO code in 
"ioreadXX()".

The fact that PARISC did their PCI translation wrong is painful for you.  
That doesn't mean that Linux should do the same mistakes the PARISC system
designers made.

Think of the positive side: It builds character.

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 20:10         ` Linus Torvalds
@ 2004-09-15 20:17           ` Linus Torvalds
  2004-09-16 12:17           ` David Woodhouse
  1 sibling, 0 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-15 20:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik



On Wed, 15 Sep 2004, Linus Torvalds wrote:
> 
> Think of the positive side: It builds character.

Besides, a Canadian should know better than asking a Finn for favours
after what you did yesterday.

You'll all burn in hell.

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 19:16     ` Linus Torvalds
  2004-09-15 19:40       ` Matthew Wilcox
@ 2004-09-15 20:20       ` Russell King
  2004-09-15 20:34         ` Linus Torvalds
  2004-09-15 22:38       ` James Bottomley
  2 siblings, 1 reply; 97+ messages in thread
From: Russell King @ 2004-09-15 20:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik

On Wed, Sep 15, 2004 at 12:16:35PM -0700, Linus Torvalds wrote:
> On Wed, 15 Sep 2004, Geert Uytterhoeven wrote:
> > Before people really start using this: what's the advantage of this scheme
> > compared to e.g. dev->busops->in8(), which some of us have been waiting for
> > since a long time?
> 
> Ehh, exactly why would you wait for that? 
> 
> What this new interface is doing is to slowlt move "ioremap()" users over
> to "pci_iomap()" (and by extension, maybe "sbus_ioremap()" etc if somebody
> ever starts doing that too), which allows you to do the bus- and device-
> specific translation just _once_.

What about (PCI!) bus types where you need to write one register with
the address and one with the data.  (I'm glaring at a certain embedded
CPU produced by a well known large chip manufacturer, which I think we
need to do these games to get byte accesses to work correctly.)

Sometimes unified accessor macros/functions just don't work.

> In other words, I hereby proclaim that anybody who does "dev->busops" is a 
> total idiot.

Good, I'll pass that comment on to other people and maybe we can force
certain sillycon manufacturers to stop selling broken hardware.  Or
maybe the universe will implode.  I think the latter is a more likely
scenario than the former.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 PCMCIA      - http://pcmcia.arm.linux.org.uk/
                 2.6 Serial core

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 20:20       ` Russell King
@ 2004-09-15 20:34         ` Linus Torvalds
  2004-09-15 20:51           ` Linus Torvalds
  0 siblings, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2004-09-15 20:34 UTC (permalink / raw)
  To: Russell King
  Cc: Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik



On Wed, 15 Sep 2004, Russell King wrote:
> 
> What about (PCI!) bus types where you need to write one register with
> the address and one with the data.  (I'm glaring at a certain embedded
> CPU produced by a well known large chip manufacturer, which I think we
> need to do these games to get byte accesses to work correctly.)

Well, that's really the exact same issue as with architectures that cram
memory and PCI into the same 32-bit address space: clearly that "can't
work", since PCI wants 32 bits, and memory wants at least a noticeable 
fraction of the same.

You have a "address register" (which is the 32 physical CPU address output
on the address bus), and you have a "data register" (which is what gets
output on the data bus).

Does it work? Hell yes. Every single PC out there today does exactly that.
It works by having the resource manager say "region X is reserved for RAM,
region Y is reserved for MMIO, and region Z is reserved for PIO" (well, 
the latter is not PC any more, it's ppc, alpha and others).

Same issue, exactly. If you see this thing in practice, just
force-relocate the resources away from each other. Say that MMIO is not
allowed in the 0x0000-0xffff region, and declare that that is reserved for
PIO. And then you just test at run-time. Easy. Change details to suit your
particular problem (maybe you want to use the high bit as a marker. Maybe
you use a water-mark or other more dynamic behaviour).

This is a _real_ solution, btw. It's not a cop-out. It's the solution that
has been used over and over again, and it just works. It's basically 
saying "theory doesn't matter - this is practice".

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 20:34         ` Linus Torvalds
@ 2004-09-15 20:51           ` Linus Torvalds
  0 siblings, 0 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-15 20:51 UTC (permalink / raw)
  To: Russell King
  Cc: Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik



On Wed, 15 Sep 2004, Linus Torvalds wrote:
> 
> This is a _real_ solution, btw. It's not a cop-out. It's the solution that
> has been used over and over again, and it just works. It's basically 
> saying "theory doesn't matter - this is practice".

Btw, if your architecture supports DMA, you pretty much _have_ to have a
hole in the PCI MMIO space to map real RAM anyway. That hole would tend to 
be the optimal place to "hide" the IO ports in.

That's effectively what the generic lib/iomap.c does (except you may, of
course, have your RAM mapped at some non-zero address). 

			Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 16:30   ` Being " Linus Torvalds
                       ` (3 preceding siblings ...)
  2004-09-15 17:23     ` Richard B. Johnson
@ 2004-09-15 22:21     ` Deepak Saxena
  2004-09-15 22:46       ` Linus Torvalds
  2004-09-15 22:29     ` Roland Dreier
  5 siblings, 1 reply; 97+ messages in thread
From: Deepak Saxena @ 2004-09-15 22:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

On Sep 15 2004, at 09:30, Linus Torvalds was caught saying:
> At the same time, we've had the proper "accessor" functions (read[bwl](), 
> write[bwl]() and friends) that on purpose dropped all type information 
> from the MMIO pointer, mostly just because of historical reasons, and as a 
> result some drivers didn't use a pointer at all, but some kind of integer. 
> Sometimes even one that couldn't _fit_ a MMIO address in it on a 64-bit 
> machine.

Linus,

Since we are on the subject of io-access, I would like a
clarification/opinion on the read*/write* & in*/out* accessors 
(and now the ioread/write equivalents). Are these functions only meant 
to be used for PCI memory-mapped devices or _any_ memory mapped devices? 
Same with ioremap(). I ask because there are bits of code in the
kernel that use these on non-PCI devices and this sometimes causes 
some complication in platform-level code. For example, b/c of
the way PCI access work on the IXP4xx (indirect access via register
read/writes), we have to do the following to differentiate b/w
PCI and non-PCI devices (include/asm-arm/arch-ixp4xx/io.h):

static inline void *
__ixp4xx_ioremap(unsigned long addr, size_t size, unsigned long flags, unsigned long align)
{
	extern void * __ioremap(unsigned long, size_t, unsigned long, unsigned long);
	if((addr < 0x48000000) || (addr > 0x4fffffff))
		return __ioremap(addr, size, flags, align);

	return (void *)addr;
}

static inline void 
__ixp4xx_writeb(u8 value, u32 addr)
{
	u32 n, byte_enables, data;

	if (addr >= VMALLOC_START) {
		__raw_writeb(value, addr);
		return;
	}

	n = addr % 4;
	byte_enables = (0xf & ~BIT(n)) << IXP4XX_PCI_NP_CBE_BESL;
	data = value << (8*n);
	ixp4xx_pci_write(addr, byte_enables | NP_CMD_MEMWRITE, data);
}

#define	writeb(p, v)			__ixp4xx_writeb(p, v)

(0x48000000 -> 0x4fffffff is the PCI memory window on this CPU).

While this is not a huge level of uglyness, I have systems where 
this is going to get much uglier b/c we have overlapping addresses
on different buses and we need to be able to differentiate accesses
It raises the question of whether we need a different interface 
for non-PCI devices, if we should be passing a struct device into all 
the I/O accessors functions to make it easier for platform code to 
determine what to do, or if we should make I/O access functions a 
property of devices. So instead of doing read*/write/in*/out*, we 
would do either:

a) pass device into io-access routines:

	cookie = iomap(dev, foo, len);
	bar = read32(dev, cookie + offset);

or

b) make access routines a function of the devices themselves

	cookie = dev->iomap(foo, len);
	bar = dev->read32(cookie + REG_OFFSET);

The former is nicer b/c it allows the dev to be ignored on systems where
we do not care about PCI vs non-PCI devices.

Comments?

~Deepak

-- 
Deepak Saxena - dsaxena at plexity dot net - http://www.plexity.net/

"Unlike me, many of you have accepted the situation of your imprisonment
and will die here like rotten cabbages." - Number 6

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 16:30   ` Being " Linus Torvalds
                       ` (4 preceding siblings ...)
  2004-09-15 22:21     ` Deepak Saxena
@ 2004-09-15 22:29     ` Roland Dreier
  2004-09-15 23:26       ` Being more careful " Linus Torvalds
  5 siblings, 1 reply; 97+ messages in thread
From: Roland Dreier @ 2004-09-15 22:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Kernel Mailing List

Linus, while we're on the subject of new sparse checks, could you give
a quick recap of the semantics of the new __leXX types (and what
__bitwise means to sparse)?  I don't think I've ever seen this stuff
described on LKML.

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 19:16     ` Linus Torvalds
  2004-09-15 19:40       ` Matthew Wilcox
  2004-09-15 20:20       ` Russell King
@ 2004-09-15 22:38       ` James Bottomley
  2004-09-16  2:33         ` Matthew Wilcox
  2 siblings, 1 reply; 97+ messages in thread
From: James Bottomley @ 2004-09-15 22:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik

On Wed, 2004-09-15 at 15:16, Linus Torvalds wrote:
> What this new interface is doing is to slowlt move "ioremap()" users over
> to "pci_iomap()" (and by extension, maybe "sbus_ioremap()" etc if somebody
> ever starts doing that too), which allows you to do the bus- and device-
> specific translation just _once_.

Please can we not do it this way, that was the old pci_map_single et al
which we've just spent quite a bit of time converting to
dma_map_single.  It might not matter to the PCI centric people, but the
generic DMA API was a wonderful simplification to the rest of us.

Can we start iomap off on the right foot and simply do dma_iomap (or
dev_iomap since dma_ may end up to be confusing), taking a struct device
as an argument and let the platform, which knows where the buses are,
map this to the correct operations?

James

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 22:21     ` Deepak Saxena
@ 2004-09-15 22:46       ` Linus Torvalds
  2004-09-15 23:09         ` Deepak Saxena
  0 siblings, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2004-09-15 22:46 UTC (permalink / raw)
  To: Deepak Saxena; +Cc: Kernel Mailing List



On Wed, 15 Sep 2004, Deepak Saxena wrote:
> 
> Since we are on the subject of io-access, I would like a
> clarification/opinion on the read*/write* & in*/out* accessors 
> (and now the ioread/write equivalents). Are these functions only meant 
> to be used for PCI memory-mapped devices or _any_ memory mapped devices? 

It really depends on the bus architecture.

At some point, if the bus is different enough from a "normal" setup, you 
should just use your own accessor functions. Trying to overload 
"readl/writel" is just too painful.

However, at that point you should also realize that you can't re-use _any_ 
of the existing chip drivers, and you'll have to write your own. If the 
bus is exotic enough, that's not a problem, and you'd have to do that 
anyway. But there really aren't all that many "exotic" buses around any 
more.

Quite frankly, of your two suggested interfaces, I would select neither. 
I'd just say that if your bus is special enough, just write your own 
drivers, and use

	cookie = ixp4xx_iomap(dev, xx);
	...
	ixp4xx_iowrite(val, cookie + offset);

which is perfectly valid. You don't have to make these devices even _look_ 
like a PCI device. Why should you?

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 22:46       ` Linus Torvalds
@ 2004-09-15 23:09         ` Deepak Saxena
  2004-09-16 12:51           ` Geert Uytterhoeven
  0 siblings, 1 reply; 97+ messages in thread
From: Deepak Saxena @ 2004-09-15 23:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Sep 15 2004, at 15:46, Linus Torvalds was caught saying:
> Quite frankly, of your two suggested interfaces, I would select neither. 
> I'd just say that if your bus is special enough, just write your own 
> drivers, and use
> 
> 	cookie = ixp4xx_iomap(dev, xx);
> 	...
> 	ixp4xx_iowrite(val, cookie + offset);
> 
> which is perfectly valid. You don't have to make these devices even _look_ 
> like a PCI device. Why should you?

Problem is that some of those devices are not that special. For example,
the on-board 16550 is accessed using readb/writeb in the 8250.c driver.
I don't think we want to add that level of low-level detail to that
driver and instead should just hide it in the platform code. I look
at it from the point of view that the driver should not care about how
the access actually occurs on the bus. It just says, write data foo at
location bar regardless of whether bar is ISA, PCI, on-chip, RapidIO,
etc and that writing of the data is hidden in the implementation of
the accessor API.

~Deepak


-- 
Deepak Saxena - dsaxena at plexity dot net - http://www.plexity.net/

"Unlike me, many of you have accepted the situation of your imprisonment
and will die here like rotten cabbages." - Number 6

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more careful about iospace accesses..
  2004-09-15 22:29     ` Roland Dreier
@ 2004-09-15 23:26       ` Linus Torvalds
  2004-09-16  0:10         ` viro
  2004-09-16 12:25         ` David Woodhouse
  0 siblings, 2 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-15 23:26 UTC (permalink / raw)
  To: Roland Dreier, Al Viro; +Cc: Kernel Mailing List


[ Subject line changed to avoid getting caught as spam ;]

On Wed, 15 Sep 2004, Roland Dreier wrote:
>
> Linus, while we're on the subject of new sparse checks, could you give
> a quick recap of the semantics of the new __leXX types (and what
> __bitwise means to sparse)?  I don't think I've ever seen this stuff
> described on LKML.

[ The bitwise checks are actually by Al Viro, but I'll explain the basic
  idea. Al is Cc'd so that he can add any corrections or extensions. ]

Sparse allows a number of extra type qualifiers, including address spaces 
and various random extra restrictions on what you can do with them. There 
are "context" bits that allow you to use a symbol or type only in certain 
contexts, for example, and there are type qualifiers like "noderef" that 
just say that a pointer cannot be dereferenced (it looks _exactly_ like a 
pointer in all other respects, but trying to actually access anything 
through it will cause a sparse warning).

The "bitwise" attribute is very much like the "noderef" one, in that it
restricts how you can use an expression of that type. Unlike "noderef",
it's designed for integer types, though. In fact, sparse will refuse to
apply the bitwise attribute to non-integer types.

As the name suggests, a "bitwise" expression is one that is restricted to
only a certain "bitwise" operations that make sense within that class. In
particular, you can't mix a "bitwise" class with a normal integer
expression (the constant zero happens to be special, since it's "safe"  
for all bitwise ops), and in fact you can't even mix it with _another_
bitwise expression of a different type.

And when I say "different", I mean even _slightly_ different. Each typedef 
creates a type of it's own, and will thus create a bitwise type that is 
not compatible with anything else. So if you declare

	int __bitwise i;
	int __bitwise j;

the two variables "i" and "j" are _not_ compatible, simply because they
were declared separately, while in the case of

	int __bitwise i, j;

they _are_ compatible. The above is a horribly contrieved example, as it
shows an extreme case that doesn't make much sense, but it shows how
"bitwise" always creates its own new "class".

Normally you'd always use "__bitwise"  in a typedef, which effectively
makes that particular typedef one single "bitwise class". After that, you 
can obviously declare any number of variables in that class.

Now apart from the classes having to match, "bitwise" as it's name
suggests, also restricts all operations within that class to a subset of
"bit-safe" operations. For example, addition isn't "bit-safe", since
clearly the carry-chain moves bits around. But you can do normal bit-wise
operations, and you can compare the values against other values in the
same class, since those are all "bit-safe".

Oh, as an example of something that isn't obviously bit-safe: look out for
things like bit negation: doing a ~ is ok on an bitwise "int" type, but it
is _not_ ok on a bitwise "short" or "char". Why?  Because on a bitwise
"int" you actually stay within the type. But doing the same thing on a
short or char will move "outside" the type by virtue of setting the high
bits (normal C semantics: a short gets promoted to an "int", so doign a
bitwise negation on a short will actually set the high bits).

So as far as sparse is concerned, a "bitwise" type is not really so much 
about endianness as it is about making sure bits are never lost or moved 
around.

For example, you can use the bitwise operation to verify the __GFP_XXX 
mask bits. Right now they are just regular integers, which means that you 
can write

	kmalloc(GFP_KERNEL, size);

and the compiler will not notice anything wrong. But something is
_seriously_ wrong: the GFP_KERNEL should be the _second_ argument. If we
mark it to be a "bitwise" type (which it is), that bug would have been
noticed immediately, and you could still do all the operations that are
valid of GFP_xxx values.

See the usage?

In the byte-order case, what we have is:

	typedef __u16 __bitwise __le16;
	typedef __u16 __bitwise __be16;
	typedef __u32 __bitwise __le32;
	typedef __u32 __bitwise __be32;
	typedef __u64 __bitwise __le64;
	typedef __u64 __bitwise __be64;

and if you think about the above rules about what is acceptable for 
bitwise types, you'll likely immediately notivce that it automatically 
means

 - you can never assign a __le16 type to any other integer type or any 
   other bitwise type. You'd get a warnign about incompatible types. Makes 
   sense, no?
 - you can only do operations that are safe within that byte order. For 
   example, it is safe to do a bitwise "&" on two __le16 values. Clearly 
   the result is meaningful.
 - if you want to go outside that bitwise type, you have to convert it 
   properly first. For example, if you want to add a constant to a __le16 
   type, you can do so, but you have to use the proper sequence:

	__le16 sum, a, b;

	sum = a + b;	/* INVALID! "warning: incompatible types for operation (+)" */
	sum = cpu_to_le16(le16_to_cpu(a) + le16_to_cpu(b));	/* Ok */

See? 

In short, "bitwise" is about more than just byte-order, but the semantics 
of bitwise-restricted ops happen to be the semantics that are valid for 
byte-order operations too.

Oh, btw, right now you only get the warnings from sparse if you use
"-Wbitwise" on the command line. Without that, sparse will ignore the
bitwise attribute.

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more careful about iospace accesses..
  2004-09-15 23:26       ` Being more careful " Linus Torvalds
@ 2004-09-16  0:10         ` viro
  2004-09-16 11:40           ` David Woodhouse
  2004-09-16 12:25         ` David Woodhouse
  1 sibling, 1 reply; 97+ messages in thread
From: viro @ 2004-09-16  0:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Roland Dreier, Kernel Mailing List

On Wed, Sep 15, 2004 at 04:26:12PM -0700, Linus Torvalds wrote:
 
>    other bitwise type. You'd get a warnign about incompatible types. Makes 
>    sense, no?
>  - you can only do operations that are safe within that byte order. For 
>    example, it is safe to do a bitwise "&" on two __le16 values. Clearly 
>    the result is meaningful.

BTW, so far the most frequent class of endianness bugs had been along the
lines of
	foo->le16_field = cpu_to_le32(12);
and vice versa.  On big-endian it's a guaranteed FUBAR - think carefully about
the value that will end up there.

> Oh, btw, right now you only get the warnings from sparse if you use
> "-Wbitwise" on the command line. Without that, sparse will ignore the
> bitwise attribute.

We probably want __attribute__((opaque)) in addition to bitwise - e.g. for
the handles of all sorts passed in network filesystem protocols.  I'll look
into that when we get the endianness warnings somewhat under control.

For now I'm going to #define __opaque __bitwise and use it for stuff like
	typedef __u32 __opaque cifs_fid;
etc.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 22:38       ` James Bottomley
@ 2004-09-16  2:33         ` Matthew Wilcox
  2004-09-16  4:28           ` Linus Torvalds
  0 siblings, 1 reply; 97+ messages in thread
From: Matthew Wilcox @ 2004-09-16  2:33 UTC (permalink / raw)
  To: James Bottomley
  Cc: Linus Torvalds, Geert Uytterhoeven, Linux Arch list, Al Viro,
	Andrew Morton, Alan Cox, David S. Miller, Jeff Garzik

On Wed, Sep 15, 2004 at 06:38:50PM -0400, James Bottomley wrote:
> Please can we not do it this way, that was the old pci_map_single et al
> which we've just spent quite a bit of time converting to
> dma_map_single.  It might not matter to the PCI centric people, but the
> generic DMA API was a wonderful simplification to the rest of us.
> 
> Can we start iomap off on the right foot and simply do dma_iomap (or
> dev_iomap since dma_ may end up to be confusing), taking a struct device
> as an argument and let the platform, which knows where the buses are,
> map this to the correct operations?

Actually, I think we're OK with the current interface here.  The routines
we have that give us an iomem pointer are:

/* Create a virtual mapping cookie for an IO port range */
extern void __iomem *ioport_map(unsigned long port, unsigned int nr);
extern void ioport_unmap(void __iomem *);

/* Create a virtual mapping cookie for a PCI BAR (memory or IO) */
struct pci_dev;
extern void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long max);
extern void pci_iounmap(struct pci_dev *dev, void __iomem *);

So rather than pass the struct device in, we've got a special function
for each bus type.


What we haven't dealt with in this interface are:

 - relaxed reads
 - byte ordering

The relaxed attribute is, I think, not a property of the space we're
looking at.  It's a property of any given transaction.  So I think we
really do want a set of ioreadXX_relaxed() functions.

Byte ordering I'm less clear on.  Is this a property of the area "this
area is little endian on the bus, but the host cpu is big endian",
or is it a property of an individual transaction?

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16  2:33         ` Matthew Wilcox
@ 2004-09-16  4:28           ` Linus Torvalds
  2004-09-16  4:57             ` Benjamin Herrenschmidt
                               ` (2 more replies)
  0 siblings, 3 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-16  4:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: James Bottomley, Geert Uytterhoeven, Linux Arch list, Al Viro,
	Andrew Morton, Alan Cox, David S. Miller, Jeff Garzik



On Thu, 16 Sep 2004, Matthew Wilcox wrote:
> 
> Actually, I think we're OK with the current interface here.  The routines
> we have that give us an iomem pointer are:
> 
> /* Create a virtual mapping cookie for an IO port range */
> extern void __iomem *ioport_map(unsigned long port, unsigned int nr);
> extern void ioport_unmap(void __iomem *);
> 
> /* Create a virtual mapping cookie for a PCI BAR (memory or IO) */
> struct pci_dev;
> extern void __iomem *pci_iomap(struct pci_dev *dev, int bar, unsigned long max);
> extern void pci_iounmap(struct pci_dev *dev, void __iomem *);
> 
> So rather than pass the struct device in, we've got a special function
> for each bus type.

Indeed. Once created, the cookie is "bus-agnostic", which is very much the 
point. Unlike the original pci DMA interfaces, there are no real bus 
assumtions in the interfaces once set up. 

And set-up obviously _does_ need to be bus-specific, the same way device
discovery is. You don't dicsover non-specific devices. You very much
initialize one particular instance of one particular device/bus.

Now, other buses might want to choose to ignore the ioread/iowrite 
interfaces, and do their own bus-specific ones. I doubt it makes much 
sense, but it's not like we want to _force_ one particular IO model on 
people. 

> What we haven't dealt with in this interface are:
> 
>  - relaxed reads
>  - byte ordering
> 
> The relaxed attribute is, I think, not a property of the space we're
> looking at.  It's a property of any given transaction.  So I think we
> really do want a set of ioreadXX_relaxed() functions.

I haven't seen very many people actually use this. I've been assuming that 
what people want is the same "reasonably relaxed" behaviour that the 
regular MMIO functions have (as opposed to what traditional PIO is: it's 
slow as hell because it doesn't support posting). Are there actually users 
of any other model out there?

> Byte ordering I'm less clear on.  Is this a property of the area "this
> area is little endian on the bus, but the host cpu is big endian",
> or is it a property of an individual transaction?

I'd suggest individual transaction, although again, I doubt it is all that
much used. The "repeat" versions do IO-native byte order, which is really
the only case where we've ever really cared. David mentioned that some
network devices had tried to use "host-native" byte order (which some
hardware supports), but that it hadn't been worth the pain and apparently
got ripped out.

In other words, I'd like to keep the interfaces as simple as humanly
possible, and only introduce new concepts if absolutely required. For
example, one issue is 64-bit accesses: they will _not_ be atomic on most
architectures, and given that the question is whether they are worth
supporting at all, or whether drivers should just be expected to make the
non-atomicity explicit (by loading two dwords, the way they have to do
with the old interfaces anyway).

(Note that 64-bit accesses aren't necessarily atomic even on 64-bit
architectures. BenH says that the IO architecture on a G5 means that when
you do a 64-bit load, it will be done as two 32-bit bus cycles
regardless).

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16  4:28           ` Linus Torvalds
@ 2004-09-16  4:57             ` Benjamin Herrenschmidt
  2004-09-16  4:58               ` Benjamin Herrenschmidt
  2004-09-16 13:41             ` Matthew Wilcox
  2004-09-16 22:30             ` Matthew Wilcox
  2 siblings, 1 reply; 97+ messages in thread
From: Benjamin Herrenschmidt @ 2004-09-16  4:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik

On Thu, 2004-09-16 at 14:28, Linus Torvalds wrote:

> (Note that 64-bit accesses aren't necessarily atomic even on 64-bit
> architectures. BenH says that the IO architecture on a G5 means that when
> you do a 64-bit load, it will be done as two 32-bit bus cycles
> regardless).

Yup, at least with the current models, due to the fact that the U3
northbridge basically has a 32 bits path to IO devices. Note that this
isn't the case with IBM machines who do proper 64 bits PCI IOs.

Ben.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16  4:57             ` Benjamin Herrenschmidt
@ 2004-09-16  4:58               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 97+ messages in thread
From: Benjamin Herrenschmidt @ 2004-09-16  4:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik

On Thu, 2004-09-16 at 14:57, Benjamin Herrenschmidt wrote:
> On Thu, 2004-09-16 at 14:28, Linus Torvalds wrote:
> 
> > (Note that 64-bit accesses aren't necessarily atomic even on 64-bit
> > architectures. BenH says that the IO architecture on a G5 means that when
> > you do a 64-bit load, it will be done as two 32-bit bus cycles
> > regardless).
> 
> Yup, at least with the current models, due to the fact that the U3
> northbridge basically has a 32 bits path to IO devices. Note that this
> isn't the case with IBM machines who do proper 64 bits PCI IOs.

Hrm... now I'm confused ... it _MIGHT_ be able to do 64 bits wide IOs
in fact ... but it's limited to 32 bits _addresses_ ... oh well...

Ben.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more careful about iospace accesses..
  2004-09-16  0:10         ` viro
@ 2004-09-16 11:40           ` David Woodhouse
  0 siblings, 0 replies; 97+ messages in thread
From: David Woodhouse @ 2004-09-16 11:40 UTC (permalink / raw)
  To: viro; +Cc: Linus Torvalds, Roland Dreier, Kernel Mailing List

On Thu, 2004-09-16 at 01:10 +0100,
viro@parcelfarce.linux.theplanet.co.uk wrote:
> On Wed, Sep 15, 2004 at 04:26:12PM -0700, Linus Torvalds wrote:
>  
> >    other bitwise type. You'd get a warnign about incompatible types. Makes 
> >    sense, no?
> >  - you can only do operations that are safe within that byte order. For 
> >    example, it is safe to do a bitwise "&" on two __le16 values. Clearly 
> >    the result is meaningful.
> 
> BTW, so far the most frequent class of endianness bugs had been along the
> lines of
> 	foo->le16_field = cpu_to_le32(12);
> and vice versa.  On big-endian it's a guaranteed FUBAR - think carefully about
> the value that will end up there.

Is that really more frequent than just 'foo->le16_field = 12' ? 
I'm surprised. 

Certainly, it was the frequency of pure assignment without _any_ attempt
at byte-swapping which caused me to introduce the 'jint32_t' et al
structures in jffs2, which even gcc then bitches about if you use them
wrongly.

I suppose I can ditch those now -- I always intended to after a while
anyway.

-- 
dwmw2



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-15 20:10         ` Linus Torvalds
  2004-09-15 20:17           ` Linus Torvalds
@ 2004-09-16 12:17           ` David Woodhouse
  2004-09-16 13:52             ` Linus Torvalds
  1 sibling, 1 reply; 97+ messages in thread
From: David Woodhouse @ 2004-09-16 12:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Geert Uytterhoeven, Linux Arch list, Al Viro,
	Andrew Morton, Alan Cox, David S. Miller, Jeff Garzik

On Wed, 2004-09-15 at 13:10 -0700, Linus Torvalds wrote:
> On Wed, 15 Sep 2004, Matthew Wilcox wrote:
> 
> > Could we have a 'dev' parameter to ioread8()?  Please?
> 
> Nope. See earlier email on why. There is no "dev" any more at the point of
> a generic chipset handler: the "dev" may have any of millions of possible 
> types (well, ok, a few different bus architectures, but still).

It's not just about address spaces. Have you looked at some of the
implementations of readb() et al on some SH boards recently, with all
the if/elif/elif/ crap for different ranges of the address space? And
don't even get me started on the one where the PCI bridge allows you to
only map 48MiB of the PCI bus into the host address space at a time, so
MMIO accesses have to be done via an indirect method like config space
accesses are. You don't assign host addresses in ioremap() and you
_really_ get screwed over if broken drivers dereference __iomem
'pointers' instead of using the proper accessor functions :)

Yes, we _can_ continue do it the same way. But I was sort of hoping that
with a new API it'd get nicer.

> I seriously doubt that you don't have space to encode the same information 
> that was already there before in a pointer. David was kicking and 
> screaming too, until Jeff pointed out that it was all in the same address 
> space, and then he said "oh", and suddenly it was trivial.

Well actually I bitched a little more on IRC and not here, because this
is supposed to be a low-noise forum. 

I would have been a lot more coherent in my argument a couple of years
ago when I actually had platforms on my desk on which it matters.
Nowadays it's all ppc and ppc64 and they are relatively sane, at lst n
tht rspct. At least they don't think it's sane to drag a crack-smoking
hobo off the street and ask him to implement a PCI host bridge in an
FPGA, so my life is a lot nicer :)

I certainly don't recall just saying 'oh'; you may have inferred that
from my silence.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more careful about iospace accesses..
  2004-09-15 23:26       ` Being more careful " Linus Torvalds
  2004-09-16  0:10         ` viro
@ 2004-09-16 12:25         ` David Woodhouse
  2004-09-16 14:01           ` Linus Torvalds
  1 sibling, 1 reply; 97+ messages in thread
From: David Woodhouse @ 2004-09-16 12:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Roland Dreier, Al Viro, Kernel Mailing List

On Wed, 2004-09-15 at 16:26 -0700, Linus Torvalds wrote:
>  - if you want to go outside that bitwise type, you have to convert it 
>    properly first. For example, if you want to add a constant to a __le16 
>    type, you can do so, but you have to use the proper sequence:
> 
> 	__le16 sum, a, b;
> 
> 	sum = a + b;	/* INVALID! "warning: incompatible types for operation (+)" */
> 	sum = cpu_to_le16(le16_to_cpu(a) + le16_to_cpu(b));	/* Ok */
> 
> See? 

Yeah right, that latter case is _so_ much more readable, and makes it
_so_ easy for the compiler to optimise precisely when it wants to do the
byte-swapping, especially if the back end has load-and-swap or
store-and-swap instructions. :)

It's even nicer when it ends up as:

	sum = cpu_to_le16(le16_to_cpu(a) + le16_to_cpu(b));	/* Ok */
	sum |= c;
	sum = cpu_to_le16(le16_to_cpu(sum) + le16_to_cpu(d));

I'd really quite like to see the real compiler know about endianness,
too. I dare say I _could_ optimise the above (admittedly contrived but
not _so_ unlikely) case, but I don't _want_ to hand-optimise my code --
that's what I keep a compiler _for_.

-- 
dwmw2



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 23:09         ` Deepak Saxena
@ 2004-09-16 12:51           ` Geert Uytterhoeven
  2004-09-16 22:10             ` Deepak Saxena
  0 siblings, 1 reply; 97+ messages in thread
From: Geert Uytterhoeven @ 2004-09-16 12:51 UTC (permalink / raw)
  To: Deepak Saxena; +Cc: Linus Torvalds, Linux Kernel Development

On Wed, 15 Sep 2004, Deepak Saxena wrote:
> On Sep 15 2004, at 15:46, Linus Torvalds was caught saying:
> > Quite frankly, of your two suggested interfaces, I would select neither.
> > I'd just say that if your bus is special enough, just write your own
> > drivers, and use
> >
> > 	cookie = ixp4xx_iomap(dev, xx);
> > 	...
> > 	ixp4xx_iowrite(val, cookie + offset);
> >
> > which is perfectly valid. You don't have to make these devices even _look_
> > like a PCI device. Why should you?
>
> Problem is that some of those devices are not that special. For example,
> the on-board 16550 is accessed using readb/writeb in the 8250.c driver.

Just what I was thinking...

> I don't think we want to add that level of low-level detail to that
> driver and instead should just hide it in the platform code. I look
> at it from the point of view that the driver should not care about how
> the access actually occurs on the bus. It just says, write data foo at
> location bar regardless of whether bar is ISA, PCI, on-chip, RapidIO,
> etc and that writing of the data is hidden in the implementation of
> the accessor API.

While 16550 serial is a bad example since it does byte accesses only (and thus
doesn't suffer from endianness), I have no problem to imagine there exist
platforms where you have multiple instances of a `standard' (cfr. 16550 serial)
device block, while each of them must be accessed differently:
  - one of them is in PCI I/O space (little endian)
  - one of them is in PCI MMIO space (little endian)
  - one of them is on-chip MMIO (big endian)
  - one of them is somewhere else, but sparsely addressed (some bytes of
    padding between each register)
and we can for sure come up with a few more examples of weird addressing.

How to solve this, without cluttering each ioread*() with many if clauses?

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16  4:28           ` Linus Torvalds
  2004-09-16  4:57             ` Benjamin Herrenschmidt
@ 2004-09-16 13:41             ` Matthew Wilcox
  2004-09-16 18:21               ` Linus Torvalds
  2004-09-16 22:30             ` Matthew Wilcox
  2 siblings, 1 reply; 97+ messages in thread
From: Matthew Wilcox @ 2004-09-16 13:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik

On Wed, Sep 15, 2004 at 09:28:01PM -0700, Linus Torvalds wrote:
> > The relaxed attribute is, I think, not a property of the space we're
> > looking at.  It's a property of any given transaction.  So I think we
> > really do want a set of ioreadXX_relaxed() functions.
> 
> I haven't seen very many people actually use this. I've been assuming that 
> what people want is the same "reasonably relaxed" behaviour that the 
> regular MMIO functions have (as opposed to what traditional PIO is: it's 
> slow as hell because it doesn't support posting). Are there actually users 
> of any other model out there?

I think the reason you haven't seen much of it is that only SGI people
care about it ;-)  The only drivers in the tree that use it are Fusion,
qla1280 and qla2xxx, and the only architecture that defines readb_relaxed()
to be anything other than readb() is ia64.

However, they do see significant wins from using this.  According to
Jeremy Higdon this can reduce a 50us read to a 1us read.  So it's probably
worth doing for them.

> > Byte ordering I'm less clear on.  Is this a property of the area "this
> > area is little endian on the bus, but the host cpu is big endian",
> > or is it a property of an individual transaction?
> 
> I'd suggest individual transaction, although again, I doubt it is all that
> much used. The "repeat" versions do IO-native byte order, which is really
> the only case where we've ever really cared. David mentioned that some
> network devices had tried to use "host-native" byte order (which some
> hardware supports), but that it hadn't been worth the pain and apparently
> got ripped out.

After some further thought, I think it's a property of the remapping
transaction.  I'm thinking about the case of a chip that's connected
either via PCI or via a host-native bus on a big-endian platform.  The
foo_iomap() would mark the area as being big-endian and pci_iomap() as
little-endian.

> In other words, I'd like to keep the interfaces as simple as humanly
> possible, and only introduce new concepts if absolutely required. For
> example, one issue is 64-bit accesses: they will _not_ be atomic on most
> architectures, and given that the question is whether they are worth
> supporting at all, or whether drivers should just be expected to make the
> non-atomicity explicit (by loading two dwords, the way they have to do
> with the old interfaces anyway).

There's a readq() on all 64 bit architectures.  It's not used very much,
but (for example), the s2io 10GE driver uses it extensively.  I think
if we _don't_ provide an ioread64() we'll see these drivers either not
switch to this interface or, worse, reimplement readq themselves with
predictably horrible consequences.

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 12:17           ` David Woodhouse
@ 2004-09-16 13:52             ` Linus Torvalds
  0 siblings, 0 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-16 13:52 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Matthew Wilcox, Geert Uytterhoeven, Linux Arch list, Al Viro,
	Andrew Morton, Alan Cox, David S. Miller, Jeff Garzik



On Thu, 16 Sep 2004, David Woodhouse wrote:
>
>   At least they don't think it's sane to drag a crack-smoking
> hobo off the street and ask him to implement a PCI host bridge in an
> FPGA, so my life is a lot nicer :)

LOL.

I hope some of the poor unsuspecting souls who have to use the dang thing
will educate the hw hobos in question. Preferably with a baseball bat.

The good news is that hardware _does_ tend to slowly become saner. I don't
know if it's the crack supply that eventually dwindels, or what, but the
hw designers to tend to eventually take note when they find death-threats
in their cubby-holes. So if you do see a broken chip, tell them.  

Preferably by getting your company to say "we'll use that other _sane_
chip instead", and letting the broken vendor know why.

> I certainly don't recall just saying 'oh'; you may have inferred that
> from my silence.

Heh. I was talking about the other David, ie the "Miller" kind. 

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more careful about iospace accesses..
  2004-09-16 12:25         ` David Woodhouse
@ 2004-09-16 14:01           ` Linus Torvalds
  2004-09-18  9:46             ` Kai Henningsen
  0 siblings, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2004-09-16 14:01 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Roland Dreier, Al Viro, Kernel Mailing List



On Thu, 16 Sep 2004, David Woodhouse wrote:

> On Wed, 2004-09-15 at 16:26 -0700, Linus Torvalds wrote:
> >  - if you want to go outside that bitwise type, you have to convert it 
> >    properly first. For example, if you want to add a constant to a __le16 
> >    type, you can do so, but you have to use the proper sequence:
> > 
> > 	__le16 sum, a, b;
> > 
> > 	sum = a + b;	/* INVALID! "warning: incompatible types for operation (+)" */
> > 	sum = cpu_to_le16(le16_to_cpu(a) + le16_to_cpu(b));	/* Ok */
> > 
> > See? 
> 
> Yeah right, that latter case is _so_ much more readable

It's not about readability.

It's about the first case being WRONG!

You can't add two values in the wrong byte-order. It's not an operation 
that makes sense. You _have_ to convert them to CPU byte order first.

I certainly agree that the first version "looks nicer". 

> It's even nicer when it ends up as:
> 
> 	sum = cpu_to_le16(le16_to_cpu(a) + le16_to_cpu(b));	/* Ok */
> 	sum |= c;
> 	sum = cpu_to_le16(le16_to_cpu(sum) + le16_to_cpu(d));

This is actually the strongest argument _against_ hiding endianness in the 
compiler, or hiding it behind macros. Make it very explicit, and just make 
sure there are tools (ie 'sparse') that can tell you when you do something 
wrong.

Any programmer who sees the above will go "well that's stupid", and 
rewrite it as something saner instead. You can certainly rewrite it as

	cpu_sum = le16_to_cpu(a) + le16_to_cpu(b);
	cpu_sum |= le16_to_cpu(c);
	cpu_sum += le16_to_cpu(d);
	sum = cpu_to_le16(d);

which gets rid of the double conversions. 

But if you hide the endianness in macro's, you'll never see the mess at 
all, and won't be able to fix it.

> I'd really quite like to see the real compiler know about endianness,
> too.

I would have agreed with you some time ago. Having been bitten by too damn 
many bompiler bugs I'e become convinced that the compiler doing things 
behind your back to "help" you just isn't worth it. Not in a kernel, at 
least. It's much better to build up good typechecking and the 
infrastructure to help you get the job done.

Expressions like the above might happen once or twice in a project with
several million lines of code. It's just not worth compiler infrastructure
for - that just makes people use it as if it is free, and impossible to
find the bugs when they _do_ happen. Much better to have a type system 
that can warn about the bad uses, but that doesn't actually change any of 
the code generated..

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-15 17:57           ` Linus Torvalds
  2004-09-15 18:06             ` Linus Torvalds
  2004-09-15 19:34             ` Greg KH
@ 2004-09-16 14:58             ` Horst von Brand
  2 siblings, 0 replies; 97+ messages in thread
From: Horst von Brand @ 2004-09-16 14:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jörn Engel, Kernel Mailing List

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 762 bytes --]

Linus Torvalds <torvalds@osdl.org> said:
> On Wed, 15 Sep 2004, Jörn Engel wrote:
> > 
> > But it still leaves me confused.  Before I had this code:

[...]

> No, you can certainly continue to use non-void pointers. The "void __iomem
> *" case is just the typeless one, exactly equivalent to regular void
> pointer usage.
> 
> So let me clarify my original post with two points:

[...]

Could you please put the clarified message into Documentation? It would be
a shame if it got lost.
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 13:41             ` Matthew Wilcox
@ 2004-09-16 18:21               ` Linus Torvalds
  2004-09-16 18:52                 ` Jesse Barnes
  2004-09-16 19:01                 ` Linus Torvalds
  0 siblings, 2 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-16 18:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: James Bottomley, Geert Uytterhoeven, Linux Arch list, Al Viro,
	Andrew Morton, Alan Cox, David S. Miller, Jeff Garzik



On Thu, 16 Sep 2004, Matthew Wilcox wrote:
> 
> I think the reason you haven't seen much of it is that only SGI people
> care about it ;-)  The only drivers in the tree that use it are Fusion,
> qla1280 and qla2xxx, and the only architecture that defines readb_relaxed()
> to be anything other than readb() is ia64.

Bzzt.

ia64 actually defines readb to be the same as readb_relaxed.

It's literally only the sn2 _platform_ that does something stupid. In 
other words, it's not ia64 that is broken (surprise surprise - you never 
expected me to have said anything good about ia64), it's literally the SN2 
platform that seems broken.

And quite frankly, even than I'm not sure that it's the hw platform itself 
that is broken, it might well just be the architecture code. For some 
reason the SN2 code does:

	/*
	 * The following routines are SN Platform specific, called when
	 * a reference is made to readX/writeX set macros.  SN Platform
	 * readX set of macros ensures that Posted DMA writes on the
	 * Bridge is flushed.
	 *
	 * The routines should be self explainatory.
	 */

	static inline unsigned char
	___sn_readb (void *addr)
	{
	        unsigned char val;

	        val = *(volatile unsigned char *)addr;
	        __sn_mf_a();
	        sn_dma_flush((unsigned long)addr);
	        return val;
	}

in other words, it looks like there's a broken PCI "bridge" involved, that 
just doesn't follow the specs, and does bad things.

> However, they do see significant wins from using this.  According to
> Jeremy Higdon this can reduce a 50us read to a 1us read.  So it's probably
> worth doing for them.

I'd suggest to them to unbreak their silly "readb()" implementation, and 
 (a) talk to their hw engineers
 (b) add explicit flush code to the 5 (or so) drivers that they really 
     care about. Let's face it, I doubt there are that many different 
     devices that people use on the SN2. Make a nice "synchronize"  
     interface that you could architect, and make conforming platforms
     just make it into a no-op.

I'd hate to add complex new interfaces just because some hardware is 
broken. Don't make the generic interfaces suffer for broken hardware.

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 18:21               ` Linus Torvalds
@ 2004-09-16 18:52                 ` Jesse Barnes
  2004-09-16 19:09                   ` Linus Torvalds
                                     ` (2 more replies)
  2004-09-16 19:01                 ` Linus Torvalds
  1 sibling, 3 replies; 97+ messages in thread
From: Jesse Barnes @ 2004-09-16 18:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik

On Thursday, September 16, 2004 11:21 am, Linus Torvalds wrote:
> It's literally only the sn2 _platform_ that does something stupid. In
> other words, it's not ia64 that is broken (surprise surprise - you never
> expected me to have said anything good about ia64), it's literally the SN2
> platform that seems broken.

No, it's just the only smart platform.  Apparently we're the only ones who 
realized that most PIO reads do not rely on prior DMA being complete (and 
that it's expensive to make sure that they do), so we optimized that out of 
our hardware.  Only interrupts will guarantee DMA completion on SGI 
platforms.

> And quite frankly, even than I'm not sure that it's the hw platform itself
> that is broken, it might well just be the architecture code. For some
> reason the SN2 code does:
>
>  /*
>   * The following routines are SN Platform specific, called when
>   * a reference is made to readX/writeX set macros.  SN Platform
>   * readX set of macros ensures that Posted DMA writes on the
>   * Bridge is flushed.
>   *
>   * The routines should be self explainatory.
>   */
>
>  static inline unsigned char
>  ___sn_readb (void *addr)
>  {
>          unsigned char val;
>
>          val = *(volatile unsigned char *)addr;
>          __sn_mf_a();
>          sn_dma_flush((unsigned long)addr);
>          return val;
>  }
>
> in other words, it looks like there's a broken PCI "bridge" involved, that
> just doesn't follow the specs, and does bad things.

Correct, it doesn't follow the specs.  See above.

> I'd suggest to them to unbreak their silly "readb()" implementation, and
>  (a) talk to their hw engineers
>  (b) add explicit flush code to the 5 (or so) drivers that they really
>      care about. Let's face it, I doubt there are that many different
>      devices that people use on the SN2. Make a nice "synchronize"
>      interface that you could architect, and make conforming platforms
>      just make it into a no-op.
>
> I'd hate to add complex new interfaces just because some hardware is
> broken. Don't make the generic interfaces suffer for broken hardware.

Sure, that's the way I'd like to go, but that's a dangerous route.  Devices 
that want to do a PIO read to see if their DMA is complete will work most of 
the time, but you'll get occasional data corruption.  IOW, specifying reads 
that you know *aren't* dependent is much easier to do, and safer.  It also 
allows SGI hardware to run all current drivers w/o modification, which is 
nice.

If you want to explicitly document that the new io* interface doesn't 
guarantee DMA coherence, great, we can start from scratch with a good 
implementation that has an explicit flush mechanism.

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 18:21               ` Linus Torvalds
  2004-09-16 18:52                 ` Jesse Barnes
@ 2004-09-16 19:01                 ` Linus Torvalds
  2004-09-16 19:13                   ` Jeff Garzik
  2004-09-17  5:20                   ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-16 19:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: James Bottomley, Geert Uytterhoeven, Linux Arch list, Al Viro,
	Andrew Morton, Alan Cox, David S. Miller, Jeff Garzik



On Thu, 16 Sep 2004, Linus Torvalds wrote:
> 
> I'd suggest to them to unbreak their silly "readb()" implementation, and 
>  (a) talk to their hw engineers
>  (b) add explicit flush code to the 5 (or so) drivers that they really 
>      care about. Let's face it, I doubt there are that many different 
>      devices that people use on the SN2. Make a nice "synchronize"  
>      interface that you could architect, and make conforming platforms
>      just make it into a no-op.

Actually, we could also take the "safe" approach, and just specify that 
the new "ioread()/iowrite()" interfaces are always very relaxed, and apart 
from endianness won't do much extra synchronization.

What are sane rules? Clearly it is _not_ a sane rule to allow memory 
re-ordering to the same device, because most driver authors have problems 
as-is, and trying to make them aware of speculative reads and writes 
passing writes is just not going to fly. So a minimal sane rule would be 
that each device seens programmatic IO accesses in "thread order".

But I think it would be ok to say that DMA needs explicit synchronization,
and obviously write posting (as long as it doesn't re-order IO accesses)  
is fine. The latter is already true for MMIO (and arguably acceptable even
for PIO, although x86's have historically had slightly stricter ordering
on PIO).

So how about trying to come up with a sane interface for "synchronize
DMA". I _suspect_ it will have to be device-specific, so we should aim for
a similar interface to the "dma-mapping" ones (ie based on "struct
device", not the bus it's on).

In most cases it would be a no-op, or we could hide some pointer a la
"dev->bus->sync(dev)" (which could do a status read from the bridge or
whatever..)

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 18:52                 ` Jesse Barnes
@ 2004-09-16 19:09                   ` Linus Torvalds
  2004-09-16 20:02                     ` Jesse Barnes
  2004-09-16 20:04                   ` David S. Miller
  2004-09-17  5:17                   ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2004-09-16 19:09 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik



On Thu, 16 Sep 2004, Jesse Barnes wrote:
> 
> Sure, that's the way I'd like to go, but that's a dangerous route.

It might be slightly less dangerous if we only do this for the new 
interface. It would still be up to you guys to verify that drivers use 
whatever new "dma_synchronize(dev)" interface, but hey, that's the price 
you pay for not following the standard. 

I know hardware people love saying "we can do this faster doing XXX and 
then sw can take care of it". Make the powers-that-be realize that if 
software takes care of it, it means that sw needs resources to track it..

> If you want to explicitly document that the new io* interface doesn't 
> guarantee DMA coherence, great, we can start from scratch with a good 
> implementation that has an explicit flush mechanism.

I think that's the right thing to do at this point, but since nobody else
will really see this issue (since standard PCI does synchronize DMA),
there will inevitably be bugs that are SGI-specific. Just as everybody is
aware of the issue, and realizes that there's going to be some cost of
that to SGI in the form of testing and verification, this sounds like the
way to go..

Does that sound reasonable to everybody?

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 19:01                 ` Linus Torvalds
@ 2004-09-16 19:13                   ` Jeff Garzik
  2004-09-16 19:50                     ` Linus Torvalds
  2004-09-17  5:20                   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 97+ messages in thread
From: Jeff Garzik @ 2004-09-16 19:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller

Linus Torvalds wrote:
> What are sane rules?

I'd err on the side of "simple and tough to screw up" :)

What rules are most beneficial to the likely users of this API?

	Jeff

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 19:13                   ` Jeff Garzik
@ 2004-09-16 19:50                     ` Linus Torvalds
  2004-09-16 20:07                       ` Alan Cox
  2004-09-17  5:44                       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-16 19:50 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller



On Thu, 16 Sep 2004, Jeff Garzik wrote:
>
> Linus Torvalds wrote:
> > What are sane rules?
> 
> I'd err on the side of "simple and tough to screw up" :)

I definitely agree. As I said, I think strict memory ordering is a must
for that reason.

That said, if we make make things _too_ strict, and that then makes some 
people start looking at extended interfaces in order to get their work 
done, the rules then effectively become more complex again, because now we 
have a much bigger API - a "simple" set with strict rules and 
synchronization, mixed together with the "give them rope" set.

And quite frankly, we've already seen that many driver writers will then 
start using the "give them rope" set, if only because they really think 
they need to. And then they'll screw up again.

> What rules are most beneficial to the likely users of this API?

I think the "programmatic IO accesses are strictly in-order for one
particular device" rule is so fundamentally required that nobody will
really argue against it. There is very little hardware that doesn't really 
enforce that rule anyway, and most of the extensions to it ("allow 
bursting and other fancy stuff" is in practice done by the driver setup 
code anyway, and has nothing to do with the individual IO accesses).

So the only remaining source of issues would be synchronization between
devices (which we haven't really supported anyway, and which nobody really
tends to care about), and synchronization with "side-band signals". And
there are really only two side-band signals that I can think of: DMA and
IRQ's.

I would argue that we likely don't normally care about strict
synchronization with those side-band signals, because most of the
serialization that a driver would rely on would be strictly causal (and
thus happen regardless of what interface we make up:  I seriously doubt we
could come up with a non-causal interface, but some physicists might be
very interested indeed ;).

So for example, we can pretty much depend on a "command completed"
interrupt being asserted only after we've written the command to the chip. 
No interface issues there. Similarly, if we read a status register in an 
interrupt handler, we _will_ get the status of the interrupt, unless the 
chip itself does some buffering at which point it is a driver issue to 
handle that, rather than an interface issue.

So the remaining issues are really things we already see with DMA, for
example: if we read a mailbox from memory saying that DMA has completed,
we need to have a read-memory barrier before we actually read the contents
of the IO, otherwise the CPU might read buffer contents "prior" to the
completion.

The same would go for doing a "ioread()" that reads a status register in 
MMIO that says "DMA is done". Before we can _trust_ that, we'd need to 
synchronize with the DMA "out-of-band" stuff. I think that's an acceptable 
interface, since it's something we already support for memory-based 
interfaces. 

It would be different, of course - it wouldn't be a read barrier, it would
be a "ioread" barrier. But the _concept_ is the same. And it doesn't tend
to hit us all that often, and the biggest pain is literally that in-spec
PCI devices will never show this, so the real problem for hardware like
SGI's in this case is lack of test coverage out in the rest of the world. 
That's the price you pay if you do things differently.

Implementation shouldn't be that hard. On any conforming PCI platform, 
it's a no-op, since reads from MMIO space are supposed to already 
synchronize with any buffered DMA (it still leaves the question of CPU 
memory ordering wrt uncached accesses, and thus we might want to have a 
memory barrier in place depending on architecture). And SGI would have 
something that synchronizes with the bridge nearest to the device.

So I _think_ the SGI case would be perfectly happy with an interface like 

	void dma_sync(struct device *dev);

which on their SN2 machines would do "dev->bus->sync(dev)" (where they 
would add something that reads from the bridge), and the rest of the world 
might just do a memory barrier there, or leave it as a no-op.

And yes, a few drivers would have "dma_sync()" in a few places. Not that 
many, likely.

Would driver writers get this wrong, and forget? Absolutely. But that's 
true of any interface we could come up with. If there is something we can 
absolutely depend on in life, it's that driver writers _will_ do something 
wrong. 

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 19:09                   ` Linus Torvalds
@ 2004-09-16 20:02                     ` Jesse Barnes
  2004-09-16 20:37                       ` James Bottomley
  0 siblings, 1 reply; 97+ messages in thread
From: Jesse Barnes @ 2004-09-16 20:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik

On Thursday, September 16, 2004 12:09 pm, Linus Torvalds wrote:
> On Thu, 16 Sep 2004, Jesse Barnes wrote:
> > Sure, that's the way I'd like to go, but that's a dangerous route.
>
> It might be slightly less dangerous if we only do this for the new
> interface. It would still be up to you guys to verify that drivers use
> whatever new "dma_synchronize(dev)" interface, but hey, that's the price
> you pay for not following the standard.

Sure, sounds fair.  I just wanted to avoid changing the existing interfaces, 
that's all.

> I know hardware people love saying "we can do this faster doing XXX and
> then sw can take care of it". Make the powers-that-be realize that if
> software takes care of it, it means that sw needs resources to track it..

Yep.

> > If you want to explicitly document that the new io* interface doesn't
> > guarantee DMA coherence, great, we can start from scratch with a good
> > implementation that has an explicit flush mechanism.
>
> I think that's the right thing to do at this point, but since nobody else
> will really see this issue (since standard PCI does synchronize DMA),

Note that PCI-X and PCI Express allow a lack of synchronization via the 
relaxed ordering bit.  So platforms can either set that bit in their iomap 
functions and deal with synchronization via dma_sync or not use it I guess.

> there will inevitably be bugs that are SGI-specific. Just as everybody is
> aware of the issue, and realizes that there's going to be some cost of
> that to SGI in the form of testing and verification, this sounds like the
> way to go..
>
> Does that sound reasonable to everybody?

Sure, thanks.

Jesse

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 18:52                 ` Jesse Barnes
  2004-09-16 19:09                   ` Linus Torvalds
@ 2004-09-16 20:04                   ` David S. Miller
  2004-09-16 20:13                     ` Jeff Garzik
  2004-09-16 20:20                     ` Jesse Barnes
  2004-09-17  5:17                   ` Benjamin Herrenschmidt
  2 siblings, 2 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-16 20:04 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: torvalds, willy, James.Bottomley, geert, linux-arch, viro, akpm,
	alan, davem, jgarzik

On Thu, 16 Sep 2004 11:52:54 -0700
Jesse Barnes <jbarnes@engr.sgi.com> wrote:

> On Thursday, September 16, 2004 11:21 am, Linus Torvalds wrote:
> > It's literally only the sn2 _platform_ that does something stupid. In
> > other words, it's not ia64 that is broken (surprise surprise - you never
> > expected me to have said anything good about ia64), it's literally the SN2
> > platform that seems broken.
> 
> No, it's just the only smart platform.  Apparently we're the only ones who 
> realized that most PIO reads do not rely on prior DMA being complete (and 
> that it's expensive to make sure that they do), so we optimized that out of 
> our hardware.  Only interrupts will guarantee DMA completion on SGI 
> platforms.

Jesse, I think you might not need that hack in the I/O accessor
routines to handle this.  pci_dma_unmap(), pci_dma_sync_*() and
friends should be the place to perform this "prior DMA sync" thing.

Drivers aren't allowed to touch active DMA data areas until they
call one of those interfaces (see Documentation/DMA-mapping.txt)
so that should just work.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 19:50                     ` Linus Torvalds
@ 2004-09-16 20:07                       ` Alan Cox
  2004-09-17  5:44                       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 97+ messages in thread
From: Alan Cox @ 2004-09-16 20:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, David S. Miller

Right now we have writeb and __writeb() the later of which isn't store
ordering but requires the driver author makes an effort to step on a
land mine. It does help performance having these in some cases so can't
we propogate the same "__" functions on

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 20:04                   ` David S. Miller
@ 2004-09-16 20:13                     ` Jeff Garzik
  2004-09-16 20:45                       ` David S. Miller
  2004-09-16 20:20                     ` Jesse Barnes
  1 sibling, 1 reply; 97+ messages in thread
From: Jeff Garzik @ 2004-09-16 20:13 UTC (permalink / raw)
  To: David S. Miller
  Cc: Jesse Barnes, torvalds, willy, James.Bottomley, geert, linux-arch,
	viro, akpm, alan, davem

On Thu, Sep 16, 2004 at 01:04:50PM -0700, David S. Miller wrote:
> Drivers aren't allowed to touch active DMA data areas until they
> call one of those interfaces (see Documentation/DMA-mapping.txt)

Does that include DMA memory allocated with pci_alloc_consistent?

	Jeff

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 20:04                   ` David S. Miller
  2004-09-16 20:13                     ` Jeff Garzik
@ 2004-09-16 20:20                     ` Jesse Barnes
  1 sibling, 0 replies; 97+ messages in thread
From: Jesse Barnes @ 2004-09-16 20:20 UTC (permalink / raw)
  To: David S. Miller
  Cc: torvalds, willy, James.Bottomley, geert, linux-arch, viro, akpm,
	alan, davem, jgarzik

On Thursday, September 16, 2004 1:04 pm, David S. Miller wrote:
> On Thu, 16 Sep 2004 11:52:54 -0700
>
> Jesse Barnes <jbarnes@engr.sgi.com> wrote:
> > On Thursday, September 16, 2004 11:21 am, Linus Torvalds wrote:
> > > It's literally only the sn2 _platform_ that does something stupid. In
> > > other words, it's not ia64 that is broken (surprise surprise - you
> > > never expected me to have said anything good about ia64), it's
> > > literally the SN2 platform that seems broken.
> >
> > No, it's just the only smart platform.  Apparently we're the only ones
> > who realized that most PIO reads do not rely on prior DMA being complete
> > (and that it's expensive to make sure that they do), so we optimized that
> > out of our hardware.  Only interrupts will guarantee DMA completion on
> > SGI platforms.
>
> Jesse, I think you might not need that hack in the I/O accessor
> routines to handle this.  pci_dma_unmap(), pci_dma_sync_*() and
> friends should be the place to perform this "prior DMA sync" thing.

That would be ok, except for the fact that we wouldn't be able to provide 
'consistent' memory.  I think it's better to implement the expected PCI 
semantics but provide an out via the _relaxed interface (new interfaces 
aside).

Jesse

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 20:02                     ` Jesse Barnes
@ 2004-09-16 20:37                       ` James Bottomley
  2004-09-16 20:42                         ` Jesse Barnes
  0 siblings, 1 reply; 97+ messages in thread
From: James Bottomley @ 2004-09-16 20:37 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Linus Torvalds, Matthew Wilcox, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik, Grant Grundler

On Thu, 2004-09-16 at 16:02, Jesse Barnes wrote:
> Note that PCI-X and PCI Express allow a lack of synchronization via the 
> relaxed ordering bit.  So platforms can either set that bit in their iomap 
> functions and deal with synchronization via dma_sync or not use it I guess.

Now you've confused me again.  I thought the SGI implementation of this
relaxed read thing was *not* the same as the PCI-X RO bit which was why
you needed a special readX_relaxed?

My understanding is that Relaxed Ordering is that it's two bit (i.e. two
separately allowable optimisations) that allow either PIO writes to pass
DMA reads or PIO reads to pass DMA writes (or both) in the stream (I may
have this wrong, I copied Grant because he's been working on this).

I thought your readX_relaxed was only the first half of this?

James

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 20:37                       ` James Bottomley
@ 2004-09-16 20:42                         ` Jesse Barnes
  2004-09-16 21:37                           ` Grant Grundler
  0 siblings, 1 reply; 97+ messages in thread
From: Jesse Barnes @ 2004-09-16 20:42 UTC (permalink / raw)
  To: James Bottomley
  Cc: Linus Torvalds, Matthew Wilcox, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik, Grant Grundler

On Thursday, September 16, 2004 1:37 pm, James Bottomley wrote:
> On Thu, 2004-09-16 at 16:02, Jesse Barnes wrote:
> > Note that PCI-X and PCI Express allow a lack of synchronization via the
> > relaxed ordering bit.  So platforms can either set that bit in their
> > iomap functions and deal with synchronization via dma_sync or not use it
> > I guess.
>
> Now you've confused me again.  I thought the SGI implementation of this
> relaxed read thing was *not* the same as the PCI-X RO bit which was why
> you needed a special readX_relaxed?

Depends on how you read the specs.  I think they're compatible, but others 
think that PCI-X RO will require a different implementation.  We'll know when 
we see hardware that supports it.

> My understanding is that Relaxed Ordering is that it's two bit (i.e. two
> separately allowable optimisations) that allow either PIO writes to pass
> DMA reads or PIO reads to pass DMA writes (or both) in the stream (I may
> have this wrong, I copied Grant because he's been working on this).
>
> I thought your readX_relaxed was only the first half of this?

Only the second half, actually.  It may also apply to intra-DMA block 
ordering.

Jesse

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 20:13                     ` Jeff Garzik
@ 2004-09-16 20:45                       ` David S. Miller
  0 siblings, 0 replies; 97+ messages in thread
From: David S. Miller @ 2004-09-16 20:45 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: jbarnes, torvalds, willy, James.Bottomley, geert, linux-arch,
	viro, akpm, alan, davem

On Thu, 16 Sep 2004 16:13:08 -0400
Jeff Garzik <jgarzik@pobox.com> wrote:

> On Thu, Sep 16, 2004 at 01:04:50PM -0700, David S. Miller wrote:
> > Drivers aren't allowed to touch active DMA data areas until they
> > call one of those interfaces (see Documentation/DMA-mapping.txt)
> 
> Does that include DMA memory allocated with pci_alloc_consistent?

Good point.  No, it does not include consistent DMA memory.
I guess they do need it after all.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 20:42                         ` Jesse Barnes
@ 2004-09-16 21:37                           ` Grant Grundler
  0 siblings, 0 replies; 97+ messages in thread
From: Grant Grundler @ 2004-09-16 21:37 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: James Bottomley, Linus Torvalds, Matthew Wilcox,
	Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik, Grant Grundler

On Thu, Sep 16, 2004 at 01:42:48PM -0700, Jesse Barnes wrote:
> Depends on how you read the specs.

I'd say it depends on if the PCI-X device implements RO hint
correctly and if the platform chipset doesn't ignore RO bit
in the PCI-X commands that go across the wire.

> I think they're compatible, but others 
> think that PCI-X RO will require a different implementation.

No. PCI-X RO is an attribute the OS can set on the PCI-X device
before putting it to work. The device then decides when to use RO
hint and hopefully only uses RO hint when it's really ok (eg. most bulk
data transfer and not for all control data).

readX_relaxed() is ha^H^Hoptimization so SGI platform doesn't have
to pay a huge penalty to enforce PCI ordering rules on every MMIO read.
IIRC, the SGI Altix chipset will allow MMIO Read returns to bypass DMA writes.

Original thread discusses this in more detail:
	http://www.ussg.iu.edu/hypermail/linux/kernel/0401.0/1678.html

> > My understanding is that Relaxed Ordering is that it's two bit (i.e. two
> > separately allowable optimisations)

PCI-X "Enable RO" is only one bit in the PCI-X Command Register.

> > that allow either PIO writes to pass DMA reads

Most parisc platforms allow this to happen. I can't see any harm in it
though it's possible some PCI device will break because of it.
Since the PCI-X device isn't sourcing transactions in this direction,
I don't see how PCI-X RO would be involved.

> >  or PIO reads to pass DMA writes (or both) in the stream (I may
> > have this wrong, I copied Grant because he's been working on this).
> >
> > I thought your readX_relaxed was only the first half of this?
> 
> Only the second half, actually.
> It may also apply to intra-DMA block ordering.

readX_relaxed() has nothing to do with DMA ordering AFAICT.
PCI-X RO hint would if I understand you correctly.

thanks,
grant

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more anal about iospace accesses..
  2004-09-16 12:51           ` Geert Uytterhoeven
@ 2004-09-16 22:10             ` Deepak Saxena
  0 siblings, 0 replies; 97+ messages in thread
From: Deepak Saxena @ 2004-09-16 22:10 UTC (permalink / raw)
  To: Geert Uytterhoeven; +Cc: Linus Torvalds, Linux Kernel Development

On Sep 16 2004, at 14:51, Geert Uytterhoeven was caught saying:
> While 16550 serial is a bad example since it does byte accesses only (and thus
> doesn't suffer from endianness), I have no problem to imagine there exist
> platforms where you have multiple instances of a `standard' (cfr. 16550 serial)

No longer true. We now have UPIO_IOMEM32 for some of the fscked up HW 
that large silicon manufacturer has released with 32-bit wide registers
that must be accessed as full 32-bits.

> device block, while each of them must be accessed differently:
>   - one of them is in PCI I/O space (little endian)
>   - one of them is in PCI MMIO space (little endian)
>   - one of them is on-chip MMIO (big endian)
>   - one of them is somewhere else, but sparsely addressed (some bytes of
>     padding between each register)
> and we can for sure come up with a few more examples of weird addressing.
> 
> How to solve this, without cluttering each ioread*() with many if clauses?

If clauses will still be needed, the only difference is that instead
of basing them on hardcoded addresses we now base them on
the devices coming in. 

~Deepak

-- 
Deepak Saxena - dsaxena at plexity dot net - http://www.plexity.net/

"Unlike me, many of you have accepted the situation of your imprisonment
and will die here like rotten cabbages." - Number 6

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16  4:28           ` Linus Torvalds
  2004-09-16  4:57             ` Benjamin Herrenschmidt
  2004-09-16 13:41             ` Matthew Wilcox
@ 2004-09-16 22:30             ` Matthew Wilcox
  2004-09-16 22:42               ` Linus Torvalds
  2 siblings, 1 reply; 97+ messages in thread
From: Matthew Wilcox @ 2004-09-16 22:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik

On Wed, Sep 15, 2004 at 09:28:01PM -0700, Linus Torvalds wrote:
> Indeed. Once created, the cookie is "bus-agnostic", which is very much the 
> point. Unlike the original pci DMA interfaces, there are no real bus 
> assumtions in the interfaces once set up. 

In designing the new iomap interface for PA, I noticed a problem.

2004/09/14 torvalds   |  * They do _not_ update the port address. If you
2004/09/14 torvalds   |  * want MMIO that copies stuff laid out in MMIO
2004/09/14 torvalds   |  * memory across multiple ports, use "memcpy_toio()"
2004/09/14 torvalds   |  * and friends.

That implies you can pass an iomap cookie to memcpy_toio(), which makes
designing the iomap interface significantly harder.  Could I ask for
iocopy_to(), iocopy_from() and ioset() interfaces too?

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 22:30             ` Matthew Wilcox
@ 2004-09-16 22:42               ` Linus Torvalds
  2004-09-16 22:46                 ` Jeff Garzik
  2004-09-17 12:44                 ` Matthew Wilcox
  0 siblings, 2 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-16 22:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: James Bottomley, Geert Uytterhoeven, Linux Arch list, Al Viro,
	Andrew Morton, Alan Cox, David S. Miller, Jeff Garzik



On Thu, 16 Sep 2004, Matthew Wilcox wrote:
> 
> 2004/09/14 torvalds   |  * They do _not_ update the port address. If you
> 2004/09/14 torvalds   |  * want MMIO that copies stuff laid out in MMIO
> 2004/09/14 torvalds   |  * memory across multiple ports, use "memcpy_toio()"
> 2004/09/14 torvalds   |  * and friends.
> 
> That implies you can pass an iomap cookie to memcpy_toio()

Actually, no it does not. It very much says "if you want __MMIO__ that 
copies stuff".

You can pass cookies the other way:  if you create a cookie with the old
"ioremap()" interface (which means MMIO only, of course), I intended for
that to work well with the new interfaces (ioread/iowrite). But a
new-style IO-map cookie will _not_ work with the old interfaces. 

In other words, we don't magically make the old interfaces support PIO 
accesses. That includes "memcpy_toio()" and friends. They support MMIO 
only.

I don't see anybody ever wanting to do a "memcpy_to/from_pio()" anyway. 
The op just doesn't make any real sense.

And that is actually enforced by the x86 implementation: creating a
new-style PIO cookie and trying to use them with the old-style interfaces
will create an instant oops. So there's no real worry that some
enterprising driver writer would try to do this. He'd notice that it 
doesn't work very quickly indeed ;)

			Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 22:42               ` Linus Torvalds
@ 2004-09-16 22:46                 ` Jeff Garzik
  2004-09-16 23:15                   ` Linus Torvalds
  2004-09-17 12:44                 ` Matthew Wilcox
  1 sibling, 1 reply; 97+ messages in thread
From: Jeff Garzik @ 2004-09-16 22:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller

Linus Torvalds wrote:
> I don't see anybody ever wanting to do a "memcpy_to/from_pio()" anyway. 
> The op just doesn't make any real sense.

Actually, I do this in 8139too, to take a snapshot of the NIC's 
registers for the ethtool "get registers" command.  ;-)

If 8139too is compiled for MMIO, I use memcpy_fromio.  If 8139too is 
compiled for PIO...

	Jeff

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 22:46                 ` Jeff Garzik
@ 2004-09-16 23:15                   ` Linus Torvalds
  2004-09-16 23:30                     ` Jeff Garzik
  0 siblings, 1 reply; 97+ messages in thread
From: Linus Torvalds @ 2004-09-16 23:15 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller



On Thu, 16 Sep 2004, Jeff Garzik wrote:
> Linus Torvalds wrote:
> > I don't see anybody ever wanting to do a "memcpy_to/from_pio()" anyway. 
> > The op just doesn't make any real sense.
> 
> Actually, I do this in 8139too, to take a snapshot of the NIC's 
> registers for the ethtool "get registers" command.  ;-)

It doesn't have good semantics, though. For example, would your code work
if the copy was byte-by-byte? Word-for-word? Long-for-long? It might work
on some hardware, but it's not an operation that makes sense "in general".

And yes, there is likely hardware out there where "memcpy_fromio()" 
doesn't work even on MMIO (it might put restrictions on how the memory is 
accessed), but at least it still makes sense _conceptually_.

(I actually suspect your memcpy_fromio() doesn't even work on MMIO. For
example, the ppc64 iSeries memcpy_fromio() really _does_ do byte-at-a-time
copies. Slow and broken? Maybe. I don't have an iSeries. But arguably 
perfectly ok. Now, do you think your hardware does the right thing there? 
Or alternatively, do you think it's ok if a 64-bit architecture does 
quad-word loads?)

In other words, memcpy_fromio() really only makes sense on truly "flat" 
MMIO. RAM/ROM-like things, in other words. It can be dangerous to use on 
things like noncached control register areas..

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 23:15                   ` Linus Torvalds
@ 2004-09-16 23:30                     ` Jeff Garzik
  2004-09-16 23:43                       ` Linus Torvalds
  0 siblings, 1 reply; 97+ messages in thread
From: Jeff Garzik @ 2004-09-16 23:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller

Linus Torvalds wrote:
  It doesn't have good semantics, though. For example, would your code work
> if the copy was byte-by-byte? Word-for-word? Long-for-long? It might work
> on some hardware, but it's not an operation that makes sense "in general".


That's actually a good point.

Looking at the code,
/* TODO: we are too slack to do reg dumping for pio, for now */
#ifdef CONFIG_8139TOO_PIO
#define rtl8139_get_regs_len    NULL
#define rtl8139_get_regs        NULL


I seem to recall that I didn't implement it precisely because some 
registers wanted to read only at certain bit widths.

	Jeff

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 23:30                     ` Jeff Garzik
@ 2004-09-16 23:43                       ` Linus Torvalds
  0 siblings, 0 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-16 23:43 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller



On Thu, 16 Sep 2004, Jeff Garzik wrote:

> Linus Torvalds wrote:
> > [ rant deleted ]
> 
> That's actually a good point.

I hate it when people do that.

"That's actually a good point".

I can really imagine surprised expression - "Linus actually made a good
point. What is the world coming to?"

Grumble grumble.

Dammit. ALL my points are good. 

Except for the ones where I obviously toked up a bit too much before I
wrote.

Or the ones I make too early in the morning, before my coffee.

Oh, or the ones after a few too many beers.

Ok, ok. Never mind. I'll go sulk in a corner now.

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 18:52                 ` Jesse Barnes
  2004-09-16 19:09                   ` Linus Torvalds
  2004-09-16 20:04                   ` David S. Miller
@ 2004-09-17  5:17                   ` Benjamin Herrenschmidt
  2004-09-17 15:30                     ` Jesse Barnes
  2 siblings, 1 reply; 97+ messages in thread
From: Benjamin Herrenschmidt @ 2004-09-17  5:17 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Linus Torvalds, Matthew Wilcox, James Bottomley,
	Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik

On Fri, 2004-09-17 at 04:52, Jesse Barnes wrote:
> On Thursday, September 16, 2004 11:21 am, Linus Torvalds wrote:
> > It's literally only the sn2 _platform_ that does something stupid. In
> > other words, it's not ia64 that is broken (surprise surprise - you never
> > expected me to have said anything good about ia64), it's literally the SN2
> > platform that seems broken.
> 
> No, it's just the only smart platform.  Apparently we're the only ones who 
> realized that most PIO reads do not rely on prior DMA being complete (and 
> that it's expensive to make sure that they do), so we optimized that out of 
> our hardware.  Only interrupts will guarantee DMA completion on SGI 
> platforms.

Hrm... that's weird... on sane DMA hardward, most PIO/MMIO reads are
actually here just to guarantee that completion, interrupts isn't enough
especially if we do things like polling (NAPI) etc... besides,
interrupts are usually totally asynchronous to anything else, so I
wouldn't say it's sane.

> Sure, that's the way I'd like to go, but that's a dangerous route.  Devices 
> that want to do a PIO read to see if their DMA is complete will work most of 
> the time, but you'll get occasional data corruption.  IOW, specifying reads 
> that you know *aren't* dependent is much easier to do, and safer.  It also 
> allows SGI hardware to run all current drivers w/o modification, which is 
> nice.

Agreed. Reads must NOT be relaxed by default

> If you want to explicitly document that the new io* interface doesn't 
> guarantee DMA coherence, great, we can start from scratch with a good 
> implementation that has an explicit flush mechanism.
> 
> Thanks,
> Jesse
-- 
Benjamin Herrenschmidt <benh@kernel.crashing.org>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 19:01                 ` Linus Torvalds
  2004-09-16 19:13                   ` Jeff Garzik
@ 2004-09-17  5:20                   ` Benjamin Herrenschmidt
  2004-09-17 15:03                     ` Linus Torvalds
  1 sibling, 1 reply; 97+ messages in thread
From: Benjamin Herrenschmidt @ 2004-09-17  5:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik


> Actually, we could also take the "safe" approach, and just specify that 
> the new "ioread()/iowrite()" interfaces are always very relaxed, and apart 
> from endianness won't do much extra synchronization.

That would be catastrophic with 99% of the drivers on PPC. The processor
may speculate reads from outside spinlocks and that sort of horror, or
defer them to until much later, crossing writes and such things...

And we don't always have nice way of providing barriers. For example,
for enforcing a read to happen "now", the only way is to actually create
a pseudo dependency on the data returned by the read, so an
function/macro like "read_barrier()" can't be implemented unless it
takes the result of the read as an argument.

> What are sane rules? Clearly it is _not_ a sane rule to allow memory 
> re-ordering to the same device, because most driver authors have problems 
> as-is, and trying to make them aware of speculative reads and writes 
> passing writes is just not going to fly. So a minimal sane rule would be 
> that each device seens programmatic IO accesses in "thread order".
> 
> But I think it would be ok to say that DMA needs explicit synchronization,
> and obviously write posting (as long as it doesn't re-order IO accesses)  
> is fine. The latter is already true for MMIO (and arguably acceptable even
> for PIO, although x86's have historically had slightly stricter ordering
> on PIO).

Write posting is always assumed since it's already fine per PCI spec. I
don't agree with read vs. DMA.

> So how about trying to come up with a sane interface for "synchronize
> DMA". I _suspect_ it will have to be device-specific, so we should aim for
> a similar interface to the "dma-mapping" ones (ie based on "struct
> device", not the bus it's on).
> 
> In most cases it would be a no-op, or we could hide some pointer a la
> "dev->bus->sync(dev)" (which could do a status read from the bridge or
> whatever..)

Ok, well, synchronize DMA would end up having to do a read on most
setups, so that mean knowing of a register on the device with no side
effects etc... difficult to acheive outside of the driver...

Ben.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 19:50                     ` Linus Torvalds
  2004-09-16 20:07                       ` Alan Cox
@ 2004-09-17  5:44                       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 97+ messages in thread
From: Benjamin Herrenschmidt @ 2004-09-17  5:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Garzik, Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller


> So the only remaining source of issues would be synchronization between
> devices (which we haven't really supported anyway, and which nobody really
> tends to care about), and synchronization with "side-band signals". And
> there are really only two side-band signals that I can think of: DMA and
> IRQ's.

DMA isn't really "side band", and reads only synchronize with DMA on the
same bus path anyways.

Interrupts are never properly synchronized (and I've seen many bugs
because of that, we can _NEVER_ assume we are properly synchronized with
an interrupt, it can be buffered a totally random amount of time in the
path to the CPU and even inside of the CPU before reaching the core,
whatever the driver did to mask it on the device, drivers writers must
be aware of that, they aren't most of the time, calling sychronize_irq()
after masking on-device will _NOT_ guarantee you won't get a stale one).

> I would argue that we likely don't normally care about strict
> synchronization with those side-band signals, because most of the
> serialization that a driver would rely on would be strictly causal (and
> thus happen regardless of what interface we make up:  I seriously doubt we
> could come up with a non-causal interface, but some physicists might be
> very interested indeed ;).

Nope,  we do care a hell lot about synchronization with DMA, and it's
been a source of countless horrible to track down bugs in the past when
debugging high IO stress benchmarks on big POWER machines.

> So for example, we can pretty much depend on a "command completed"
> interrupt being asserted only after we've written the command to the chip. 
> No interface issues there.

Yes. That one is fine, except if we get a stale irq from a previous
command where we didn't care about the interrupt, or that sort of thing
of course.

> Similarly, if we read a status register in an 
> interrupt handler, we _will_ get the status of the interrupt, unless the 
> chip itself does some buffering at which point it is a driver issue to 
> handle that, rather than an interface issue.

Right. What we cannot rely on with interrupt is not taking them after
masking them. We _can_ rely on disable_irq() because it will mask at the
controller level and the arch code will take care of not delivering if
stale, but masking on-chip is never guaranteed to have a synchronizable
effect. At least, for level interrupts, we know that if we take it
anyway, we can just return and drop it until it goes away

> So the remaining issues are really things we already see with DMA, for
> example: if we read a mailbox from memory saying that DMA has completed,
> we need to have a read-memory barrier before we actually read the contents
> of the IO, otherwise the CPU might read buffer contents "prior" to the
> completion.

We need to have those 2 reads done in order yes, an lwsync on ppc would
be ok which is what we do on rmb(), but that is irrelevant to the IO
accessors, this is memory vs. memory

> The same would go for doing a "ioread()" that reads a status register in 
> MMIO that says "DMA is done". Before we can _trust_ that, we'd need to 
> synchronize with the DMA "out-of-band" stuff. I think that's an acceptable 
> interface, since it's something we already support for memory-based 
> interfaces. 

This is the typical case affecting most drivers. Read status from chip,
then read memory. We need 2 things here:

  - We need DMA synchronization, that is we need to enforce that the
MMIO read flushes the DMA. This is where the PCI spec is clear and where
we are getting this new 'relaxed' stuff potentially coming in the
picture, but I'm pretty much against making the relaxed case default

  - We also need to make sure the memory read isn't speculated and done
by the CPU prior to the IO read, which is currently acheive on PPC via
some deep magic inside of the IO reads (creating an artificial data
dependency on the result of all IO reads with a conditional trap that is
never taken followed an isync).

> It would be different, of course - it wouldn't be a read barrier, it would
> be a "ioread" barrier. But the _concept_ is the same. And it doesn't tend
> to hit us all that often, and the biggest pain is literally that in-spec
> PCI devices will never show this, so the real problem for hardware like
> SGI's in this case is lack of test coverage out in the rest of the world. 
> That's the price you pay if you do things differently.
> 
> Implementation shouldn't be that hard. On any conforming PCI platform, 
> it's a no-op, since reads from MMIO space are supposed to already 
> synchronize with any buffered DMA (it still leaves the question of CPU 
> memory ordering wrt uncached accesses, and thus we might want to have a 
> memory barrier in place depending on architecture). And SGI would have 
> something that synchronizes with the bridge nearest to the device.
> 
> So I _think_ the SGI case would be perfectly happy with an interface like 
> 
> 	void dma_sync(struct device *dev);

Hrm... I don't like that... Also, don't they need to do a read (just a
different form of read) to synchronize ? If they do, they actually need
a full IO accessor ... What does the PCI-X* (X, express, etc...) about
relaxed ordering IOs ?

> which on their SN2 machines would do "dev->bus->sync(dev)" (where they 
> would add something that reads from the bridge), and the rest of the world 
> might just do a memory barrier there, or leave it as a no-op.
> 
> And yes, a few drivers would have "dma_sync()" in a few places. Not that 
> many, likely.
> 
> Would driver writers get this wrong, and forget? Absolutely. But that's 
> true of any interface we could come up with. If there is something we can 
> absolutely depend on in life, it's that driver writers _will_ do something 
> wrong. 
> 
> 		Linus
-- 
Benjamin Herrenschmidt <benh@kernel.crashing.org>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-16 22:42               ` Linus Torvalds
  2004-09-16 22:46                 ` Jeff Garzik
@ 2004-09-17 12:44                 ` Matthew Wilcox
  1 sibling, 0 replies; 97+ messages in thread
From: Matthew Wilcox @ 2004-09-17 12:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik

On Thu, Sep 16, 2004 at 03:42:34PM -0700, Linus Torvalds wrote:
> On Thu, 16 Sep 2004, Matthew Wilcox wrote:
> > 2004/09/14 torvalds   |  * They do _not_ update the port address. If you
> > 2004/09/14 torvalds   |  * want MMIO that copies stuff laid out in MMIO
> > 2004/09/14 torvalds   |  * memory across multiple ports, use "memcpy_toio()"
> > 2004/09/14 torvalds   |  * and friends.
> > 
> > That implies you can pass an iomap cookie to memcpy_toio()
> 
> Actually, no it does not. It very much says "if you want __MMIO__ that 
> copies stuff".
> 
> You can pass cookies the other way:  if you create a cookie with the old
> "ioremap()" interface (which means MMIO only, of course), I intended for
> that to work well with the new interfaces (ioread/iowrite). But a
> new-style IO-map cookie will _not_ work with the old interfaces. 

OK, you're thinking in x86-specific terms and I'm thinking in PA-specific
terms ;-)  I don't particularly want to be able to memcpy_fooio() on
ioport space.  But I do want to be able to memcpy_fooio() on big-endian
mmio space and I do want to design the iomap cookies such that you can't
pass the old ioremap cookies to the new interface.

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-17  5:20                   ` Benjamin Herrenschmidt
@ 2004-09-17 15:03                     ` Linus Torvalds
  0 siblings, 0 replies; 97+ messages in thread
From: Linus Torvalds @ 2004-09-17 15:03 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Matthew Wilcox, James Bottomley, Geert Uytterhoeven,
	Linux Arch list, Al Viro, Andrew Morton, Alan Cox,
	David S. Miller, Jeff Garzik



On Fri, 17 Sep 2004, Benjamin Herrenschmidt wrote:
> 
> > Actually, we could also take the "safe" approach, and just specify that 
> > the new "ioread()/iowrite()" interfaces are always very relaxed, and apart 
> > from endianness won't do much extra synchronization.
> 
> That would be catastrophic with 99% of the drivers on PPC. The processor
> may speculate reads from outside spinlocks and that sort of horror, or
> defer them to until much later, crossing writes and such things...

I didn't say "unordered". I said "relaxed", which in this case meant 
"unordered wrt DMA.

You still need to enforce memory ordering. Quite frankly, I do not believe 
a non-ordered IO access is EVER valid.

		Linus

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: RFC: being more anal about iospace accesses..
  2004-09-17  5:17                   ` Benjamin Herrenschmidt
@ 2004-09-17 15:30                     ` Jesse Barnes
  0 siblings, 0 replies; 97+ messages in thread
From: Jesse Barnes @ 2004-09-17 15:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Matthew Wilcox, James Bottomley,
	Geert Uytterhoeven, Linux Arch list, Al Viro, Andrew Morton,
	Alan Cox, David S. Miller, Jeff Garzik

On Thursday, September 16, 2004 10:17 pm, Benjamin Herrenschmidt wrote:
> Hrm... that's weird... on sane DMA hardward, most PIO/MMIO reads are
> actually here just to guarantee that completion, interrupts isn't enough
> especially if we do things like polling (NAPI) etc... besides,
> interrupts are usually totally asynchronous to anything else, so I
> wouldn't say it's sane.

Well, it can definitely be problematic.  Devices that want to complete more 
than one transaction per interrupt have always been troublesome on sn 
platforms.

>
> > Sure, that's the way I'd like to go, but that's a dangerous route. 
> > Devices that want to do a PIO read to see if their DMA is complete will
> > work most of the time, but you'll get occasional data corruption.  IOW,
> > specifying reads that you know *aren't* dependent is much easier to do,
> > and safer.  It also allows SGI hardware to run all current drivers w/o
> > modification, which is nice.
>
> Agreed. Reads must NOT be relaxed by default

At least not for the current interface, but maybe for the new one.  The 
proposed dma_sync interface could take both an iomem address *and* a device 
if that would make it easier (it would for us).

Jesse

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: Being more careful about iospace accesses..
  2004-09-16 14:01           ` Linus Torvalds
@ 2004-09-18  9:46             ` Kai Henningsen
  0 siblings, 0 replies; 97+ messages in thread
From: Kai Henningsen @ 2004-09-18  9:46 UTC (permalink / raw)
  To: linux-kernel

torvalds@osdl.org (Linus Torvalds)  wrote on 16.09.04 in <Pine.LNX.4.58.0409160652460.2333@ppc970.osdl.org>:

> On Thu, 16 Sep 2004, David Woodhouse wrote:
>
> > On Wed, 2004-09-15 at 16:26 -0700, Linus Torvalds wrote:
> > >  - if you want to go outside that bitwise type, you have to convert it
> > >    properly first. For example, if you want to add a constant to a
> > >    __le16 type, you can do so, but you have to use the proper sequence:
> > >
> > > 	__le16 sum, a, b;
> > >
> > > 	sum = a + b;	/* INVALID! "warning: incompatible types for operation
> > > (+)" */ 	sum = cpu_to_le16(le16_to_cpu(a) + le16_to_cpu(b));	/* Ok */
> > >
> > > See?
> >
> > Yeah right, that latter case is _so_ much more readable
>
> It's not about readability.
>
> It's about the first case being WRONG!

... in general.

And on the machines where it works (le machines), I'd certainly expect  
those conversion functions to be trivially eliminated by the compiler (ie,  
they're either trivial macros or trivial inline functions).

> > I'd really quite like to see the real compiler know about endianness,
> > too.

Well, these day, optimizers often can recognize the usual endianness  
conversion idioms, so the compiler still gets a chance at inserting your  
load-with-swap or whatever.

MfG Kai

^ permalink raw reply	[flat|nested] 97+ messages in thread

end of thread, other threads:[~2004-09-18 13:06 UTC | newest]

Thread overview: 97+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-08 22:57 RFC: being more anal about iospace accesses Linus Torvalds
2004-09-08 23:07 ` David S. Miller
2004-09-08 23:25   ` Linus Torvalds
2004-09-09  1:19   ` Linus Torvalds
2004-09-09  4:36     ` David S. Miller
2004-09-09  5:56       ` Richard Henderson
2004-09-09  5:04     ` viro
2004-09-09  5:05       ` David S. Miller
2004-09-09  5:13       ` Linus Torvalds
2004-09-09  6:08         ` viro
2004-09-09  8:27   ` Geert Uytterhoeven
2004-09-09  6:23 ` David Woodhouse
2004-09-09 13:14   ` Alan Cox
2004-09-11  6:09 ` Linus Torvalds
2004-09-11  6:42   ` Anton Blanchard
2004-09-11  7:26     ` Benjamin Herrenschmidt
2004-09-11  7:29     ` Geert Uytterhoeven
2004-09-11  7:23   ` Benjamin Herrenschmidt
2004-09-11 14:42   ` Alan Cox
2004-09-15 15:03 ` Linus Torvalds
2004-09-15 16:30   ` Being " Linus Torvalds
2004-09-15 16:54     ` Jörn Engel
2004-09-15 17:07       ` Linus Torvalds
2004-09-15 17:32         ` Jörn Engel
2004-09-15 17:57           ` Linus Torvalds
2004-09-15 18:06             ` Linus Torvalds
2004-09-15 19:34             ` Greg KH
2004-09-15 19:53               ` Linus Torvalds
2004-09-15 20:06                 ` Greg KH
2004-09-16 14:58             ` Horst von Brand
2004-09-15 17:45         ` Nikita Danilov
2004-09-15 18:09           ` Linus Torvalds
2004-09-15 18:40         ` Chris Wedgwood
2004-09-15 17:07       ` Jeff Garzik
2004-09-15 17:16       ` Roland Dreier
2004-09-15 17:39         ` Linus Torvalds
2004-09-15 20:07           ` Russell King
2004-09-15 17:36       ` Horst von Brand
2004-09-15 17:40       ` Brian Gerst
2004-09-15 16:56     ` Dave Jones
2004-09-15 17:19     ` Roger Luethi
2004-09-15 17:23     ` Richard B. Johnson
2004-09-15 22:21     ` Deepak Saxena
2004-09-15 22:46       ` Linus Torvalds
2004-09-15 23:09         ` Deepak Saxena
2004-09-16 12:51           ` Geert Uytterhoeven
2004-09-16 22:10             ` Deepak Saxena
2004-09-15 22:29     ` Roland Dreier
2004-09-15 23:26       ` Being more careful " Linus Torvalds
2004-09-16  0:10         ` viro
2004-09-16 11:40           ` David Woodhouse
2004-09-16 12:25         ` David Woodhouse
2004-09-16 14:01           ` Linus Torvalds
2004-09-18  9:46             ` Kai Henningsen
2004-09-15 19:02   ` RFC: being more anal " Geert Uytterhoeven
2004-09-15 19:16     ` Linus Torvalds
2004-09-15 19:40       ` Matthew Wilcox
2004-09-15 20:10         ` Linus Torvalds
2004-09-15 20:17           ` Linus Torvalds
2004-09-16 12:17           ` David Woodhouse
2004-09-16 13:52             ` Linus Torvalds
2004-09-15 20:20       ` Russell King
2004-09-15 20:34         ` Linus Torvalds
2004-09-15 20:51           ` Linus Torvalds
2004-09-15 22:38       ` James Bottomley
2004-09-16  2:33         ` Matthew Wilcox
2004-09-16  4:28           ` Linus Torvalds
2004-09-16  4:57             ` Benjamin Herrenschmidt
2004-09-16  4:58               ` Benjamin Herrenschmidt
2004-09-16 13:41             ` Matthew Wilcox
2004-09-16 18:21               ` Linus Torvalds
2004-09-16 18:52                 ` Jesse Barnes
2004-09-16 19:09                   ` Linus Torvalds
2004-09-16 20:02                     ` Jesse Barnes
2004-09-16 20:37                       ` James Bottomley
2004-09-16 20:42                         ` Jesse Barnes
2004-09-16 21:37                           ` Grant Grundler
2004-09-16 20:04                   ` David S. Miller
2004-09-16 20:13                     ` Jeff Garzik
2004-09-16 20:45                       ` David S. Miller
2004-09-16 20:20                     ` Jesse Barnes
2004-09-17  5:17                   ` Benjamin Herrenschmidt
2004-09-17 15:30                     ` Jesse Barnes
2004-09-16 19:01                 ` Linus Torvalds
2004-09-16 19:13                   ` Jeff Garzik
2004-09-16 19:50                     ` Linus Torvalds
2004-09-16 20:07                       ` Alan Cox
2004-09-17  5:44                       ` Benjamin Herrenschmidt
2004-09-17  5:20                   ` Benjamin Herrenschmidt
2004-09-17 15:03                     ` Linus Torvalds
2004-09-16 22:30             ` Matthew Wilcox
2004-09-16 22:42               ` Linus Torvalds
2004-09-16 22:46                 ` Jeff Garzik
2004-09-16 23:15                   ` Linus Torvalds
2004-09-16 23:30                     ` Jeff Garzik
2004-09-16 23:43                       ` Linus Torvalds
2004-09-17 12:44                 ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.