[RFC] I/O error handling for userspace

public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] I/O error handling for userspace
@ 2004-12-03 16:31 Jesse Barnes
  2004-12-03 16:43 ` Jesse Barnes
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: Jesse Barnes @ 2004-12-03 16:31 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 1431 bytes --]

[This is a repost of an earlier patch now that I've had time to finish it and 
test it more thoroughly.]

This patch adds support for sending a SIGBUS to a userspace application 
using /proc/bus/pci to drive a device if an I/O error occurs.  We're using 
this in house for the X server's BIOS emulator and it seems to be working 
well.

The idea is to track mmaped /proc/bus/pci regions so that the machine check 
handler is able to properly determine which process is responsible for any 
faults that occur (ia64 is interesting in that the error may not occur in the 
process context that actually generated the bad reference).  If a match is 
found, a SIGBUS is sent to the process, along with the address that caused 
the fault.  The machine check record is then cleared and recovery takes place 
(the assumption is that the signal to userspace is a sufficient record of the 
error).

The patch also special cases memory mapping of legacy space, which is the 
first 64k of I/O space and the first megabyte of memory space.  Sub platforms 
can optionally remap their bridge to the target bus and setup legacy handling 
in the callout.

Comments?  Given that this is working well for us, I'd like to get it upstream 
sometime soon.  I also expect that it could be used to deal with more types 
of I/O errors, perhaps allowing the kernel to call a driver shutdown routine 
if an I/O error occurs in a kernel driver.

Thanks,
Jesse

[-- Attachment #2: io-error-sigbus-10.patch --]
[-- Type: text/plain, Size: 13469 bytes --]

===== arch/ia64/kernel/mca.c 1.71 vs edited =====
--- 1.71/arch/ia64/kernel/mca.c	2004-11-11 10:04:30 -08:00
+++ edited/arch/ia64/kernel/mca.c	2004-12-01 09:51:25 -08:00
@@ -814,8 +814,10 @@
 	ia64_os_to_sal_handoff_state.imots_sal_check_ra =
 		ia64_sal_to_os_handoff_state.imsto_sal_check_ra;
 
-	if (recover)
+	if (recover) {
 		ia64_os_to_sal_handoff_state.imots_os_status = IA64_MCA_CORRECTED;
+		ia64_sal_clear_state_info(SAL_INFO_TYPE_MCA);
+	}
 	else
 		ia64_os_to_sal_handoff_state.imots_os_status = IA64_MCA_COLD_BOOT;
 
@@ -871,21 +873,70 @@
 void
 ia64_mca_ucmc_handler(void)
 {
+	struct io_range *range;
+	unsigned long io_addr = 0;
 	pal_processor_state_info_t *psp = (pal_processor_state_info_t *)
 		&ia64_sal_to_os_handoff_state.proc_state_param;
-	int recover; 
+	int recover = 0;
+	ia64_err_rec_t *curr_record;
 
 	/* Get the MCA error record and log it */
 	ia64_mca_log_sal_error_record(SAL_INFO_TYPE_MCA);
 
-	/* TLB error is only exist in this SAL error record */
-	recover = (psp->tc && !(psp->cc || psp->bc || psp->rc || psp->uc))
-	/* other error recovery */
-	   || (ia64_mca_ucmc_extension 
-		&& ia64_mca_ucmc_extension(
-			IA64_LOG_CURR_BUFFER(SAL_INFO_TYPE_MCA),
-			&ia64_sal_to_os_handoff_state,
-			&ia64_os_to_sal_handoff_state)); 
+ 	/* TLB errors are fixed up before we get here, so recover */
+ 	if (psp->tc) {
+ 		recover = 1;
+ 		goto return_to_sal;
+ 	}
+
+ 	/*
+ 	 * If it's not a bus check with a valid target identifier,
+ 	 * we don't have a chance.
+ 	 */
+ 	if (!psp->bc) {
+ 		recover = 0;
+ 		goto return_to_sal;
+ 	}
+
+ 	/*
+ 	 * If we can't get this lock, we can't safely look at the list,
+ 	 * so give up.
+ 	 */
+ 	if (!spin_trylock(&io_range_list_lock)) {
+ 		recover = 0;
+ 		goto return_to_sal;
+ 	}
+
+ 	curr_record = IA64_LOG_CURR_BUFFER(SAL_INFO_TYPE_MCA);
+	io_addr = curr_record->proc_err.info->target_identifier;
+
+ 	/*
+ 	 * See if an I/O error occured in a previously registered range
+ 	 */
+ 	list_for_each_entry(range, &pci_io_ranges, range_list) {
+ 		if (range->start <= io_addr && io_addr <= range->end) {
+ 			struct siginfo siginfo;
+ 			struct task_struct *owner = NULL;
+ 			recover = 1;
+ 			siginfo.si_signo = SIGBUS;
+ 			siginfo.si_code = BUS_ADRERR;
+ 			siginfo.si_addr  = (void *) io_addr;
+ 			owner = find_task_by_pid(range->owner);
+ 			if (owner)
+ 				force_sig_info(SIGBUS, &siginfo, owner);
+ 			else {
+ 				/*
+ 				 * need to free memory too, is that safe
+ 				 * here?
+ 				 */
+ 				list_del(&range->range_list);
+ 			}
+ 			break;
+ 		}
+ 	}
+ 	spin_unlock(&io_range_list_lock);
+
+ return_to_sal:
 
 	/*
 	 *  Wakeup all the processors which are spinning in the rendezvous
===== arch/ia64/pci/pci.c 1.59 vs edited =====
--- 1.59/arch/ia64/pci/pci.c	2004-11-05 11:55:25 -08:00
+++ edited/arch/ia64/pci/pci.c	2004-12-01 11:01:11 -08:00
@@ -20,6 +20,7 @@
 #include <linux/slab.h>
 #include <linux/smp_lock.h>
 #include <linux/spinlock.h>
+#include <linux/slab.h>
 
 #include <asm/machvec.h>
 #include <asm/page.h>
@@ -36,6 +37,8 @@
 #include <asm/irq.h>
 #include <asm/hw_irq.h>
 
+#include "../sn/include/pci/pcidev.h"
+#include "../sn/include/pci/pcibus_provider_defs.h"
 
 #undef DEBUG
 #define DEBUG
@@ -48,6 +51,9 @@
 
 static int pci_routeirq;
 
+LIST_HEAD(pci_io_ranges);
+spinlock_t io_range_list_lock = SPIN_LOCK_UNLOCKED;
+
 /*
  * Low-level SAL-based PCI configuration access functions. Note that SAL
  * calls are already serialized (via sal_lock), so we don't need another
@@ -501,24 +507,35 @@
 pci_mmap_page_range (struct pci_dev *dev, struct vm_area_struct *vma,
 		     enum pci_mmap_state mmap_state, int write_combine)
 {
-	/*
-	 * I/O space cannot be accessed via normal processor loads and
-	 * stores on this platform.
-	 */
-	if (mmap_state == pci_mmap_io)
-		/*
-		 * XXX we could relax this for I/O spaces for which ACPI
-		 * indicates that the space is 1-to-1 mapped.  But at the
-		 * moment, we don't support multiple PCI address spaces and
-		 * the legacy I/O space is not 1-to-1 mapped, so this is moot.
-		 */
-		return -EINVAL;
+	struct io_range *new_range;
+	int ret = 0;
+	int iospace = (mmap_state == pci_mmap_io) ? 1 : 0;
+
+	/* Remap legacy I/O space for this bus if the offset is < 0xffff */
+	if (mmap_state == pci_mmap_io &&
+	    (vma->vm_pgoff << PAGE_SHIFT) < 0xffff) {
+		unsigned long legacy_io;
+		if ((ret = pci_get_legacy_space(iospace, dev, &legacy_io)))
+			goto out;
+
+		vma->vm_pgoff += legacy_io >> PAGE_SHIFT;
+	}
+
+	/* Remap legacy mem space for this bus if the offset is < 1M */
+	if (mmap_state == pci_mmap_mem &&
+	    (vma->vm_pgoff << PAGE_SHIFT) < (1024*1024)) {
+		unsigned long legacy_mem;
+		if ((ret = pci_get_legacy_space(iospace, dev, &legacy_mem)))
+			goto out;
+
+		vma->vm_pgoff += legacy_mem >> PAGE_SHIFT;
+	}
 
 	/*
 	 * Leave vm_pgoff as-is, the PCI space address is the physical
 	 * address on this platform.
 	 */
-	vma->vm_flags |= (VM_SHM | VM_LOCKED | VM_IO);
+	vma->vm_flags |= (VM_SHM | VM_IO | VM_RESERVED);
 
 	if (write_combine)
 		vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
@@ -526,9 +543,78 @@
 		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 
 	if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-			     vma->vm_end - vma->vm_start, vma->vm_page_prot))
-		return -EAGAIN;
+			    vma->vm_end - vma->vm_start, vma->vm_page_prot)) {
+		ret = -EAGAIN;
+		goto out;
+	}
+
+	new_range = kmalloc(sizeof(struct io_range), GFP_KERNEL);
+	if (!new_range) {
+		printk(KERN_WARNING "%s: cannot allocate io_range, "
+		       "I/O errors for 0x%016lx-0x%016lx will be fatal",
+		       __FUNCTION__, vma->vm_start, vma->vm_end);
+		goto out;
+	}
+
+	/*
+	 * Track this range and its associated process for use by the
+	 * MCA handler.
+	 */
+	new_range->start = __pa(vma->vm_pgoff << PAGE_SHIFT);
+	new_range->end = new_range->start + (vma->vm_end - vma->vm_start);
+	new_range->owner = current->pid;
+
+	spin_lock(&io_range_list_lock);
+	list_add(&new_range->range_list, &pci_io_ranges);
+	spin_unlock(&io_range_list_lock);
+
+	printk("I/O range 0x%016lx-0x%016lx registered\n",
+	       new_range->start, new_range->end);
+ out:
+	return ret;
+}
+
+/**
+ * pci_unmap_page_range - release any resources associated with a previous mapping
+ * @dev: pci device involved
+ *
+ * On ia64, this routine removes and frees the range in question from the
+ * io_range_list.
+ */
+void
+pci_mmap_release_dev(struct pci_dev *dev)
+{
+	struct io_range *range;
+
+	spin_lock(&io_range_list_lock);
+	list_for_each_entry(range, &pci_io_ranges, range_list) {
+		if (range->owner == current->pid) {
+			list_del(&range->range_list);
+			printk("I/O range 0x%016lx-0x%016lx de-registered\n",
+			       range->start, range->end);
+			kfree(range);
+		}
+	}
+	spin_unlock(&io_range_list_lock);
+}
 
+/**
+ * __ia64_pci_get_legacy_space - for machines w/o a machine vector
+ * @iospace: which space, I/O or memory
+ * @dev: pci dev
+ * @base: base address
+ *
+ * For most platforms, the legacy base address is 0, but platforms
+ * can override it by providing their own machine vector for this
+ * routine.  Note that platforms may want to provide their own routine
+ * even if the base is 0 in order to remap legacy space to the bus that
+ * @dev sits on.
+ */
+int
+__ia64_pci_get_legacy_space(int iospace, struct pci_dev *dev,
+				unsigned long *base)
+{
+	*base = 0;
 	return 0;
 }
 
===== arch/ia64/sn/pci/pci_dma.c 1.2 vs edited =====
--- 1.2/arch/ia64/sn/pci/pci_dma.c	2004-10-20 12:00:10 -07:00
+++ edited/arch/ia64/sn/pci/pci_dma.c	2004-12-01 10:44:54 -08:00
@@ -10,6 +10,7 @@
  */
 
 #include <linux/module.h>
+#include <linux/pci.h>
 #include <asm/sn/sn_sal.h>
 #include "pci/pcibus_provider_defs.h"
 #include "pci/pcidev.h"
@@ -475,3 +476,19 @@
 EXPORT_SYMBOL(sn_pci_free_consistent);
 EXPORT_SYMBOL(sn_pci_dma_supported);
 EXPORT_SYMBOL(sn_dma_mapping_error);
+
+int sn_pci_get_legacy_space(int iospace, struct pci_dev *dev,
+			    unsigned long *base)
+{
+	if (SN_PCIDEV_BUSSOFT(dev) == NULL)
+		return -ENODEV;
+
+	if (iospace) {
+		/* Put the phys addr in uncached space */
+		*base = SN_PCIDEV_BUSSOFT(dev)->bs_legacy_io | __IA64_UNCACHED_OFFSET;
+	} else {
+		/* Put the phys addr in uncached space */
+		*base = SN_PCIDEV_BUSSOFT(dev)->bs_legacy_mem | __IA64_UNCACHED_OFFSET;
+	}
+	return 0;
+}
===== drivers/pci/proc.c 1.41 vs edited =====
--- 1.41/drivers/pci/proc.c	2004-10-06 09:44:51 -07:00
+++ edited/drivers/pci/proc.c	2004-12-01 10:13:26 -08:00
@@ -279,6 +279,10 @@
 
 static int proc_bus_pci_release(struct inode *inode, struct file *file)
 {
+	const struct proc_dir_entry *dp = PDE(inode);
+	struct pci_dev *dev = dp->data;
+
+	pci_mmap_release_dev(dev);
 	kfree(file->private_data);
 	file->private_data = NULL;
 
===== include/asm-ia64/io.h 1.24 vs edited =====
--- 1.24/include/asm-ia64/io.h	2004-10-28 12:10:56 -07:00
+++ edited/include/asm-ia64/io.h	2004-12-01 09:51:26 -08:00
@@ -1,6 +1,8 @@
 #ifndef _ASM_IA64_IO_H
 #define _ASM_IA64_IO_H
 
+#include <linux/list.h>
+
 /*
  * This file contains the definitions for the emulated IO instructions
  * inb/inw/inl/outb/outw/outl and the "string versions" of the same
@@ -51,6 +53,17 @@
 extern struct io_space io_space[];
 extern unsigned int num_io_spaces;
 
+/*
+ * Simple I/O range object with owner (if there is one)
+ */
+struct io_range {
+	unsigned long start, end;
+	struct list_head range_list;
+	pid_t owner;
+};
+
+extern struct list_head pci_io_ranges;
+
 # ifdef __KERNEL__
 
 /*
@@ -66,11 +79,14 @@
 #define PIO_RESERVED		__IA64_UNCACHED_OFFSET
 #define HAVE_ARCH_PIO_SIZE
 
+#include <linux/spinlock.h>
 #include <asm/intrinsics.h>
 #include <asm/machvec.h>
 #include <asm/page.h>
 #include <asm/system.h>
 #include <asm-generic/iomap.h>
+
+extern spinlock_t io_range_list_lock;
 
 /*
  * Change virtual addresses to physical addresses and vv.
===== include/asm-ia64/machvec.h 1.29 vs edited =====
--- 1.29/include/asm-ia64/machvec.h	2004-10-25 13:06:49 -07:00
+++ edited/include/asm-ia64/machvec.h	2004-12-01 10:46:49 -08:00
@@ -20,6 +20,7 @@
 struct irq_desc;
 struct page;
 struct mm_struct;
+struct pci_dev;
 
 typedef void ia64_mv_setup_t (char **);
 typedef void ia64_mv_cpu_init_t (void);
@@ -31,6 +32,7 @@
 typedef struct irq_desc *ia64_mv_irq_desc (unsigned int);
 typedef u8 ia64_mv_irq_to_vector (unsigned int);
 typedef unsigned int ia64_mv_local_vector_to_irq (u8);
+typedef int ia64_mv_pci_get_legacy_space_t (int, struct pci_dev *, unsigned long *);
 
 /* DMA-mapping interface: */
 typedef void ia64_mv_dma_init (void);
@@ -140,6 +142,7 @@
 #  define platform_readw_relaxed        ia64_mv.readw_relaxed
 #  define platform_readl_relaxed        ia64_mv.readl_relaxed
 #  define platform_readq_relaxed        ia64_mv.readq_relaxed
+#  define platform_pci_get_legacy_space	ia64_mv.pci_get_legacy_space
 # endif
 
 /* __attribute__((__aligned__(16))) is required to make size of the
@@ -187,6 +190,7 @@
 	ia64_mv_readw_relaxed_t *readw_relaxed;
 	ia64_mv_readl_relaxed_t *readl_relaxed;
 	ia64_mv_readq_relaxed_t *readq_relaxed;
+	ia64_mv_pci_get_legacy_space_t *pci_get_legacy_space;
 } __attribute__((__aligned__(16))); /* align attrib? see above comment */
 
 #define MACHVEC_INIT(name)			\
@@ -230,6 +234,7 @@
 	platform_readw_relaxed,			\
 	platform_readl_relaxed,			\
 	platform_readq_relaxed,			\
+	platform_pci_get_legacy_space,	       	\
 }
 
 extern struct ia64_machine_vector ia64_mv;
@@ -374,6 +379,9 @@
 #endif
 #ifndef platform_readq_relaxed
 # define platform_readq_relaxed	__ia64_readq_relaxed
+#endif
+#ifndef platform_pci_get_legacy_space
+# define platform_pci_get_legacy_space __ia64_pci_get_legacy_space
 #endif
 
 #endif /* _ASM_IA64_MACHVEC_H */
===== include/asm-ia64/machvec_init.h 1.8 vs edited =====
--- 1.8/include/asm-ia64/machvec_init.h	2004-10-25 13:06:49 -07:00
+++ edited/include/asm-ia64/machvec_init.h	2004-12-01 10:55:12 -08:00
@@ -5,6 +5,7 @@
 extern ia64_mv_irq_desc __ia64_irq_desc;
 extern ia64_mv_irq_to_vector __ia64_irq_to_vector;
 extern ia64_mv_local_vector_to_irq __ia64_local_vector_to_irq;
+extern ia64_mv_pci_get_legacy_space_t __ia64_pci_get_legacy_space;
 
 extern ia64_mv_inb_t __ia64_inb;
 extern ia64_mv_inw_t __ia64_inw;
===== include/asm-ia64/machvec_sn2.h 1.16 vs edited =====
--- 1.16/include/asm-ia64/machvec_sn2.h	2004-10-25 13:06:49 -07:00
+++ edited/include/asm-ia64/machvec_sn2.h	2004-12-01 10:42:03 -08:00
@@ -70,6 +70,7 @@
 extern ia64_mv_dma_sync_sg_for_device	sn_dma_sync_sg_for_device;
 extern ia64_mv_dma_mapping_error	sn_dma_mapping_error;
 extern ia64_mv_dma_supported		sn_dma_supported;
+extern ia64_mv_pci_get_legacy_space_t	sn_pci_get_legacy_space;
 
 /*
  * This stuff has dual use!
@@ -118,6 +119,7 @@
 #define platform_dma_sync_sg_for_device	sn_dma_sync_sg_for_device
 #define platform_dma_mapping_error		sn_dma_mapping_error
 #define platform_dma_supported		sn_dma_supported
+#define platform_pci_get_legacy_space	sn_pci_get_legacy_space
 
 #include <asm/sn/io.h>
 
===== include/asm-ia64/pci.h 1.27 vs edited =====
--- 1.27/include/asm-ia64/pci.h	2004-11-03 13:36:55 -08:00
+++ edited/include/asm-ia64/pci.h	2004-12-01 10:53:19 -08:00
@@ -85,6 +85,8 @@
 #define HAVE_PCI_MMAP
 extern int pci_mmap_page_range (struct pci_dev *dev, struct vm_area_struct *vma,
 				enum pci_mmap_state mmap_state, int write_combine);
+extern void pci_mmap_release_dev (struct pci_dev *dev);
+#define pci_get_legacy_space platform_pci_get_legacy_space
 
 struct pci_window {
 	struct resource resource;

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] I/O error handling for userspace
  2004-12-03 16:31 [RFC] I/O error handling for userspace Jesse Barnes
@ 2004-12-03 16:43 ` Jesse Barnes
  2004-12-06 12:42 ` Hidetoshi Seto
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Jesse Barnes @ 2004-12-03 16:43 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 1559 bytes --]

On Friday, December 3, 2004 8:31 am, Jesse Barnes wrote:
> [This is a repost of an earlier patch now that I've had time to finish it
> and test it more thoroughly.]
>
> This patch adds support for sending a SIGBUS to a userspace application
> using /proc/bus/pci to drive a device if an I/O error occurs.  We're using
> this in house for the X server's BIOS emulator and it seems to be working
> well.
>
> The idea is to track mmaped /proc/bus/pci regions so that the machine check
> handler is able to properly determine which process is responsible for any
> faults that occur (ia64 is interesting in that the error may not occur in
> the process context that actually generated the bad reference).  If a match
> is found, a SIGBUS is sent to the process, along with the address that
> caused the fault.  The machine check record is then cleared and recovery
> takes place (the assumption is that the signal to userspace is a sufficient
> record of the error).
>
> The patch also special cases memory mapping of legacy space, which is the
> first 64k of I/O space and the first megabyte of memory space.  Sub
> platforms can optionally remap their bridge to the target bus and setup
> legacy handling in the callout.
>
> Comments?  Given that this is working well for us, I'd like to get it
> upstream sometime soon.  I also expect that it could be used to deal with
> more types of I/O errors, perhaps allowing the kernel to call a driver
> shutdown routine if an I/O error occurs in a kernel driver.

Doh!  This is the updated one I meant to send.

Jesse

[-- Attachment #2: io-error-sigbus-11.patch --]
[-- Type: text/plain, Size: 13399 bytes --]

===== arch/ia64/kernel/mca.c 1.71 vs edited =====
--- 1.71/arch/ia64/kernel/mca.c	2004-11-11 10:04:30 -08:00
+++ edited/arch/ia64/kernel/mca.c	2004-12-02 14:30:33 -08:00
@@ -814,8 +814,10 @@
 	ia64_os_to_sal_handoff_state.imots_sal_check_ra =
 		ia64_sal_to_os_handoff_state.imsto_sal_check_ra;
 
-	if (recover)
+	if (recover) {
 		ia64_os_to_sal_handoff_state.imots_os_status = IA64_MCA_CORRECTED;
+		ia64_sal_clear_state_info(SAL_INFO_TYPE_MCA);
+	}
 	else
 		ia64_os_to_sal_handoff_state.imots_os_status = IA64_MCA_COLD_BOOT;
 
@@ -871,21 +873,69 @@
 void
 ia64_mca_ucmc_handler(void)
 {
+	struct io_range *range, *tmp;
+	unsigned long io_addr = 0;
 	pal_processor_state_info_t *psp = (pal_processor_state_info_t *)
 		&ia64_sal_to_os_handoff_state.proc_state_param;
-	int recover; 
+	int recover = 0;
+	ia64_err_rec_t *curr_record;
 
 	/* Get the MCA error record and log it */
 	ia64_mca_log_sal_error_record(SAL_INFO_TYPE_MCA);
 
-	/* TLB error is only exist in this SAL error record */
-	recover = (psp->tc && !(psp->cc || psp->bc || psp->rc || psp->uc))
-	/* other error recovery */
-	   || (ia64_mca_ucmc_extension 
-		&& ia64_mca_ucmc_extension(
-			IA64_LOG_CURR_BUFFER(SAL_INFO_TYPE_MCA),
-			&ia64_sal_to_os_handoff_state,
-			&ia64_os_to_sal_handoff_state)); 
+ 	/* TLB errors are fixed up before we get here, so recover */
+ 	if (psp->tc) {
+ 		recover = 1;
+ 		goto return_to_sal;
+ 	}
+
+ 	/*
+ 	 * If it's not a bus check with a valid target identifier,
+ 	 * we don't have a chance.
+ 	 */
+ 	if (!psp->bc) {
+ 		recover = 0;
+ 		goto return_to_sal;
+ 	}
+
+ 	/*
+ 	 * If we can't get this lock, we can't safely look at the list,
+ 	 * so give up.
+ 	 */
+ 	if (!spin_trylock(&io_range_list_lock)) {
+ 		recover = 0;
+ 		goto return_to_sal;
+ 	}
+
+ 	curr_record = IA64_LOG_CURR_BUFFER(SAL_INFO_TYPE_MCA);
+	io_addr = curr_record->proc_err.info->target_identifier;
+
+ 	/*
+ 	 * See if an I/O error occured in a previously registered range
+ 	 */
+ 	list_for_each_entry_safe(range, tmp, &pci_io_ranges, range_list) {
+ 		if (range->start <= io_addr && io_addr <= range->end) {
+ 			struct siginfo siginfo;
+ 			struct task_struct *owner = NULL;
+ 			recover = 1;
+ 			siginfo.si_signo = SIGBUS;
+ 			siginfo.si_code = BUS_ADRERR;
+ 			siginfo.si_addr  = (void *) io_addr;
+ 			owner = find_task_by_pid(range->owner);
+ 			if (owner)
+ 				force_sig_info(SIGBUS, &siginfo, owner);
+ 			else {
+ 				/*
+ 				 * need to free memory too, is that safe
+ 				 * here?
+ 				 */
+ 				list_del(&range->range_list);
+ 			}
+ 		}
+ 	}
+ 	spin_unlock(&io_range_list_lock);
+
+ return_to_sal:
 
 	/*
 	 *  Wakeup all the processors which are spinning in the rendezvous
===== arch/ia64/pci/pci.c 1.59 vs edited =====
--- 1.59/arch/ia64/pci/pci.c	2004-11-05 11:55:25 -08:00
+++ edited/arch/ia64/pci/pci.c	2004-12-02 14:30:02 -08:00
@@ -20,6 +20,7 @@
 #include <linux/slab.h>
 #include <linux/smp_lock.h>
 #include <linux/spinlock.h>
+#include <linux/slab.h>
 
 #include <asm/machvec.h>
 #include <asm/page.h>
@@ -36,7 +37,6 @@
 #include <asm/irq.h>
 #include <asm/hw_irq.h>
 
-
 #undef DEBUG
 #define DEBUG
 
@@ -48,6 +48,9 @@
 
 static int pci_routeirq;
 
+LIST_HEAD(pci_io_ranges);
+spinlock_t io_range_list_lock = SPIN_LOCK_UNLOCKED;
+
 /*
  * Low-level SAL-based PCI configuration access functions. Note that SAL
  * calls are already serialized (via sal_lock), so we don't need another
@@ -501,24 +504,35 @@
 pci_mmap_page_range (struct pci_dev *dev, struct vm_area_struct *vma,
 		     enum pci_mmap_state mmap_state, int write_combine)
 {
-	/*
-	 * I/O space cannot be accessed via normal processor loads and
-	 * stores on this platform.
-	 */
-	if (mmap_state == pci_mmap_io)
-		/*
-		 * XXX we could relax this for I/O spaces for which ACPI
-		 * indicates that the space is 1-to-1 mapped.  But at the
-		 * moment, we don't support multiple PCI address spaces and
-		 * the legacy I/O space is not 1-to-1 mapped, so this is moot.
-		 */
-		return -EINVAL;
+	struct io_range *new_range;
+	int ret = 0;
+	int iospace = (mmap_state == pci_mmap_io) ? 1 : 0;
+
+	/* Remap legacy I/O space for this bus if the offset is < 0xffff */
+	if (mmap_state == pci_mmap_io &&
+	    (vma->vm_pgoff << PAGE_SHIFT) < 0xffff) {
+		unsigned long legacy_io;
+		if ((ret = pci_get_legacy_space(iospace, dev, &legacy_io)))
+			goto out;
+
+		vma->vm_pgoff += legacy_io >> PAGE_SHIFT;
+	}
+
+	/* Remap legacy mem space for this bus if the offset is < 1M */
+	if (mmap_state == pci_mmap_mem &&
+	    (vma->vm_pgoff << PAGE_SHIFT) < (1024*1024)) {
+		unsigned long legacy_mem;
+		if ((ret = pci_get_legacy_space(iospace, dev, &legacy_mem)))
+			goto out;
+
+		vma->vm_pgoff += legacy_mem >> PAGE_SHIFT;
+	}
 
 	/*
 	 * Leave vm_pgoff as-is, the PCI space address is the physical
 	 * address on this platform.
 	 */
-	vma->vm_flags |= (VM_SHM | VM_LOCKED | VM_IO);
+	vma->vm_flags |= (VM_SHM | VM_IO | VM_RESERVED);
 
 	if (write_combine)
 		vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
@@ -526,9 +540,78 @@
 		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 
 	if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-			     vma->vm_end - vma->vm_start, vma->vm_page_prot))
-		return -EAGAIN;
+			    vma->vm_end - vma->vm_start, vma->vm_page_prot)) {
+		ret = -EAGAIN;
+		goto out;
+	}
 
+	new_range = kmalloc(sizeof(struct io_range), GFP_KERNEL);
+	if (!new_range) {
+		printk(KERN_WARNING "%s: cannot allocate io_range, "
+		       "I/O errors for 0x%016lx-0x%016lx will be fatal",
+		       __FUNCTION__, vma->vm_start, vma->vm_end);
+		goto out;
+	}
+
+	/*
+	 * Track this range and its associated process for use by the
+	 * MCA handler.
+	 */
+	new_range->start = __pa(vma->vm_pgoff << PAGE_SHIFT);
+	new_range->end = new_range->start + (vma->vm_end - vma->vm_start);
+	new_range->owner = current->pid;
+
+	spin_lock(&io_range_list_lock);
+	list_add(&new_range->range_list, &pci_io_ranges);
+	spin_unlock(&io_range_list_lock);
+
+	printk("I/O range 0x%016lx-0x%016lx registered\n",
+	       new_range->start, new_range->end);
+ out:
+	return ret;
+}
+
+/**
+ * pci_unmap_page_range - release any resources associated with a previous mapping
+ * @dev: pci device involved
+ *
+ * On ia64, this routine removes and frees the range in question from the
+ * io_range_list.
+ */
+void
+pci_mmap_release_dev(struct pci_dev *dev)
+{
+	struct io_range *range, *tmp;
+
+	spin_lock(&io_range_list_lock);
+	list_for_each_entry_safe(range, tmp, &pci_io_ranges, range_list) {
+		if (range->owner == current->pid) {
+			list_del(&range->range_list);
+			printk("I/O range 0x%016lx-0x%016lx de-registered\n",
+			       range->start, range->end);
+			kfree(range);
+		}
+	}
+	spin_unlock(&io_range_list_lock);
+}
+
+/**
+ * __ia64_pci_get_legacy_space - for machines w/o a machine vector
+ * @iospace: which space, I/O or memory
+ * @dev: pci dev
+ * @base: base address
+ *
+ * For most platforms, the legacy base address is 0, but platforms
+ * can override it by providing their own machine vector for this
+ * routine.  Note that platforms may want to provide their own routine
+ * even if the base is 0 in order to remap legacy space to the bus that
+ * @dev sits on.
+ */
+int
+__ia64_pci_get_legacy_space(int iospace, struct pci_dev *dev,
+				unsigned long *base)
+{
+	*base = 0;
 	return 0;
 }
 
===== arch/ia64/sn/pci/pci_dma.c 1.2 vs edited =====
--- 1.2/arch/ia64/sn/pci/pci_dma.c	2004-10-20 12:00:10 -07:00
+++ edited/arch/ia64/sn/pci/pci_dma.c	2004-12-01 10:44:54 -08:00
@@ -10,6 +10,7 @@
  */
 
 #include <linux/module.h>
+#include <linux/pci.h>
 #include <asm/sn/sn_sal.h>
 #include "pci/pcibus_provider_defs.h"
 #include "pci/pcidev.h"
@@ -475,3 +476,19 @@
 EXPORT_SYMBOL(sn_pci_free_consistent);
 EXPORT_SYMBOL(sn_pci_dma_supported);
 EXPORT_SYMBOL(sn_dma_mapping_error);
+
+int sn_pci_get_legacy_space(int iospace, struct pci_dev *dev,
+			    unsigned long *base)
+{
+	if (SN_PCIDEV_BUSSOFT(dev) == NULL)
+		return -ENODEV;
+
+	if (iospace) {
+		/* Put the phys addr in uncached space */
+		*base = SN_PCIDEV_BUSSOFT(dev)->bs_legacy_io | __IA64_UNCACHED_OFFSET;
+	} else {
+		/* Put the phys addr in uncached space */
+		*base = SN_PCIDEV_BUSSOFT(dev)->bs_legacy_mem | __IA64_UNCACHED_OFFSET;
+	}
+	return 0;
+}
===== drivers/pci/proc.c 1.41 vs edited =====
--- 1.41/drivers/pci/proc.c	2004-10-06 09:44:51 -07:00
+++ edited/drivers/pci/proc.c	2004-12-01 10:13:26 -08:00
@@ -279,6 +279,10 @@
 
 static int proc_bus_pci_release(struct inode *inode, struct file *file)
 {
+	const struct proc_dir_entry *dp = PDE(inode);
+	struct pci_dev *dev = dp->data;
+
+	pci_mmap_release_dev(dev);
 	kfree(file->private_data);
 	file->private_data = NULL;
 
===== include/asm-ia64/io.h 1.24 vs edited =====
--- 1.24/include/asm-ia64/io.h	2004-10-28 12:10:56 -07:00
+++ edited/include/asm-ia64/io.h	2004-12-01 09:51:26 -08:00
@@ -1,6 +1,8 @@
 #ifndef _ASM_IA64_IO_H
 #define _ASM_IA64_IO_H
 
+#include <linux/list.h>
+
 /*
  * This file contains the definitions for the emulated IO instructions
  * inb/inw/inl/outb/outw/outl and the "string versions" of the same
@@ -51,6 +53,17 @@
 extern struct io_space io_space[];
 extern unsigned int num_io_spaces;
 
+/*
+ * Simple I/O range object with owner (if there is one)
+ */
+struct io_range {
+	unsigned long start, end;
+	struct list_head range_list;
+	pid_t owner;
+};
+
+extern struct list_head pci_io_ranges;
+
 # ifdef __KERNEL__
 
 /*
@@ -66,11 +79,14 @@
 #define PIO_RESERVED		__IA64_UNCACHED_OFFSET
 #define HAVE_ARCH_PIO_SIZE
 
+#include <linux/spinlock.h>
 #include <asm/intrinsics.h>
 #include <asm/machvec.h>
 #include <asm/page.h>
 #include <asm/system.h>
 #include <asm-generic/iomap.h>
+
+extern spinlock_t io_range_list_lock;
 
 /*
  * Change virtual addresses to physical addresses and vv.
===== include/asm-ia64/machvec.h 1.29 vs edited =====
--- 1.29/include/asm-ia64/machvec.h	2004-10-25 13:06:49 -07:00
+++ edited/include/asm-ia64/machvec.h	2004-12-01 10:46:49 -08:00
@@ -20,6 +20,7 @@
 struct irq_desc;
 struct page;
 struct mm_struct;
+struct pci_dev;
 
 typedef void ia64_mv_setup_t (char **);
 typedef void ia64_mv_cpu_init_t (void);
@@ -31,6 +32,7 @@
 typedef struct irq_desc *ia64_mv_irq_desc (unsigned int);
 typedef u8 ia64_mv_irq_to_vector (unsigned int);
 typedef unsigned int ia64_mv_local_vector_to_irq (u8);
+typedef int ia64_mv_pci_get_legacy_space_t (int, struct pci_dev *, unsigned long *);
 
 /* DMA-mapping interface: */
 typedef void ia64_mv_dma_init (void);
@@ -140,6 +142,7 @@
 #  define platform_readw_relaxed        ia64_mv.readw_relaxed
 #  define platform_readl_relaxed        ia64_mv.readl_relaxed
 #  define platform_readq_relaxed        ia64_mv.readq_relaxed
+#  define platform_pci_get_legacy_space	ia64_mv.pci_get_legacy_space
 # endif
 
 /* __attribute__((__aligned__(16))) is required to make size of the
@@ -187,6 +190,7 @@
 	ia64_mv_readw_relaxed_t *readw_relaxed;
 	ia64_mv_readl_relaxed_t *readl_relaxed;
 	ia64_mv_readq_relaxed_t *readq_relaxed;
+	ia64_mv_pci_get_legacy_space_t *pci_get_legacy_space;
 } __attribute__((__aligned__(16))); /* align attrib? see above comment */
 
 #define MACHVEC_INIT(name)			\
@@ -230,6 +234,7 @@
 	platform_readw_relaxed,			\
 	platform_readl_relaxed,			\
 	platform_readq_relaxed,			\
+	platform_pci_get_legacy_space,	       	\
 }
 
 extern struct ia64_machine_vector ia64_mv;
@@ -374,6 +379,9 @@
 #endif
 #ifndef platform_readq_relaxed
 # define platform_readq_relaxed	__ia64_readq_relaxed
+#endif
+#ifndef platform_pci_get_legacy_space
+# define platform_pci_get_legacy_space __ia64_pci_get_legacy_space
 #endif
 
 #endif /* _ASM_IA64_MACHVEC_H */
===== include/asm-ia64/machvec_init.h 1.8 vs edited =====
--- 1.8/include/asm-ia64/machvec_init.h	2004-10-25 13:06:49 -07:00
+++ edited/include/asm-ia64/machvec_init.h	2004-12-01 10:55:12 -08:00
@@ -5,6 +5,7 @@
 extern ia64_mv_irq_desc __ia64_irq_desc;
 extern ia64_mv_irq_to_vector __ia64_irq_to_vector;
 extern ia64_mv_local_vector_to_irq __ia64_local_vector_to_irq;
+extern ia64_mv_pci_get_legacy_space_t __ia64_pci_get_legacy_space;
 
 extern ia64_mv_inb_t __ia64_inb;
 extern ia64_mv_inw_t __ia64_inw;
===== include/asm-ia64/machvec_sn2.h 1.16 vs edited =====
--- 1.16/include/asm-ia64/machvec_sn2.h	2004-10-25 13:06:49 -07:00
+++ edited/include/asm-ia64/machvec_sn2.h	2004-12-01 10:42:03 -08:00
@@ -70,6 +70,7 @@
 extern ia64_mv_dma_sync_sg_for_device	sn_dma_sync_sg_for_device;
 extern ia64_mv_dma_mapping_error	sn_dma_mapping_error;
 extern ia64_mv_dma_supported		sn_dma_supported;
+extern ia64_mv_pci_get_legacy_space_t	sn_pci_get_legacy_space;
 
 /*
  * This stuff has dual use!
@@ -118,6 +119,7 @@
 #define platform_dma_sync_sg_for_device	sn_dma_sync_sg_for_device
 #define platform_dma_mapping_error		sn_dma_mapping_error
 #define platform_dma_supported		sn_dma_supported
+#define platform_pci_get_legacy_space	sn_pci_get_legacy_space
 
 #include <asm/sn/io.h>
 
===== include/asm-ia64/pci.h 1.27 vs edited =====
--- 1.27/include/asm-ia64/pci.h	2004-11-03 13:36:55 -08:00
+++ edited/include/asm-ia64/pci.h	2004-12-01 10:53:19 -08:00
@@ -85,6 +85,8 @@
 #define HAVE_PCI_MMAP
 extern int pci_mmap_page_range (struct pci_dev *dev, struct vm_area_struct *vma,
 				enum pci_mmap_state mmap_state, int write_combine);
+extern void pci_mmap_release_dev (struct pci_dev *dev);
+#define pci_get_legacy_space platform_pci_get_legacy_space
 
 struct pci_window {
 	struct resource resource;

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] I/O error handling for userspace
  2004-12-03 16:31 [RFC] I/O error handling for userspace Jesse Barnes
  2004-12-03 16:43 ` Jesse Barnes
@ 2004-12-06 12:42 ` Hidetoshi Seto
  2004-12-06 16:13 ` Jesse Barnes
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Hidetoshi Seto @ 2004-12-06 12:42 UTC (permalink / raw)
  To: linux-ia64

Jesse Barnes wrote:
> On Friday, December 3, 2004 8:31 am, Jesse Barnes wrote:
> 
>>This patch adds support for sending a SIGBUS to a userspace application
>>using /proc/bus/pci to drive a device if an I/O error occurs.  We're using
>>this in house for the X server's BIOS emulator and it seems to be working
>>well.
>>
>>The idea is to track mmaped /proc/bus/pci regions so that the machine check
>>handler is able to properly determine which process is responsible for any
>>faults that occur (ia64 is interesting in that the error may not occur in
>>the process context that actually generated the bad reference).  If a match
>>is found, a SIGBUS is sent to the process, along with the address that
>>caused the fault.  The machine check record is then cleared and recovery
>>takes place (the assumption is that the signal to userspace is a sufficient
>>record of the error).

Cool!
BTW I have some short comments.

> ------------------------------------------------------------------------
> 
> === arch/ia64/kernel/mca.c 1.71 vs edited ==> --- 1.71/arch/ia64/kernel/mca.c	2004-11-11 10:04:30 -08:00
> +++ edited/arch/ia64/kernel/mca.c	2004-12-02 14:30:33 -08:00
> @@ -871,21 +873,69 @@
>  void
>  ia64_mca_ucmc_handler(void)
>  {
> +	struct io_range *range, *tmp;
> +	unsigned long io_addr = 0;
>  	pal_processor_state_info_t *psp = (pal_processor_state_info_t *)
>  		&ia64_sal_to_os_handoff_state.proc_state_param;
> -	int recover; 
> +	int recover = 0;
> +	ia64_err_rec_t *curr_record;
>  
>  	/* Get the MCA error record and log it */
>  	ia64_mca_log_sal_error_record(SAL_INFO_TYPE_MCA);
>  
> -	/* TLB error is only exist in this SAL error record */
> -	recover = (psp->tc && !(psp->cc || psp->bc || psp->rc || psp->uc))
> -	/* other error recovery */
> -	   || (ia64_mca_ucmc_extension 
> -		&& ia64_mca_ucmc_extension(
> -			IA64_LOG_CURR_BUFFER(SAL_INFO_TYPE_MCA),
> -			&ia64_sal_to_os_handoff_state,
> -			&ia64_os_to_sal_handoff_state)); 

Where the ia64_mca_ucmc_extension had go? ... vanished? :-(


> + 	/* TLB errors are fixed up before we get here, so recover */
> + 	if (psp->tc) {
> + 		recover = 1;
> + 		goto return_to_sal;
> + 	}
> +

I'm not sure because I hadn't get any logs on crazy error situation, but
isn't it possible that an error log have both of tc and another(cc|bc|rc|uc)?
Why we could omit "!(psp->cc || psp->bc || psp->rc || psp->uc)" ?


> + 	/*
> + 	 * If it's not a bus check with a valid target identifier,
> + 	 * we don't have a chance.
> + 	 */
> + 	if (!psp->bc) {
> + 		recover = 0;
> + 		goto return_to_sal;
> + 	}
> +
> + 	/*
> + 	 * If we can't get this lock, we can't safely look at the list,
> + 	 * so give up.
> + 	 */
> + 	if (!spin_trylock(&io_range_list_lock)) {
> + 		recover = 0;
> + 		goto return_to_sal;
> + 	}
> +
> + 	curr_record = IA64_LOG_CURR_BUFFER(SAL_INFO_TYPE_MCA);
> +	io_addr = curr_record->proc_err.info->target_identifier;
> +
> + 	/*
> + 	 * See if an I/O error occured in a previously registered range
> + 	 */
> + 	list_for_each_entry_safe(range, tmp, &pci_io_ranges, range_list) {
> + 		if (range->start <= io_addr && io_addr <= range->end) {
> + 			struct siginfo siginfo;
> + 			struct task_struct *owner = NULL;
> + 			recover = 1;
> + 			siginfo.si_signo = SIGBUS;
> + 			siginfo.si_code = BUS_ADRERR;
> + 			siginfo.si_addr  = (void *) io_addr;
> + 			owner = find_task_by_pid(range->owner);
> + 			if (owner)
> + 				force_sig_info(SIGBUS, &siginfo, owner);
> + 			else {
> + 				/*
> + 				 * need to free memory too, is that safe
> + 				 * here?
> + 				 */
> + 				list_del(&range->range_list);
> + 			}
> + 		}
> + 	}
> + 	spin_unlock(&io_range_list_lock);
> +
> + return_to_sal:

force_sig_info() takes spinlock in it... I think calling this isn't safe on MCA.


The rest part of the patch (for legacies) seems to be functionally-independent.
I don't have enough ideas for legacy-maps, but I guess you have done good work.


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] I/O error handling for userspace
  2004-12-03 16:31 [RFC] I/O error handling for userspace Jesse Barnes
  2004-12-03 16:43 ` Jesse Barnes
  2004-12-06 12:42 ` Hidetoshi Seto
@ 2004-12-06 16:13 ` Jesse Barnes
  2004-12-06 16:59 ` Jesse Barnes
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Jesse Barnes @ 2004-12-06 16:13 UTC (permalink / raw)
  To: linux-ia64

On Monday, December 6, 2004 4:42 am, Hidetoshi Seto wrote:
> > - /* TLB error is only exist in this SAL error record */
> > - recover = (psp->tc && !(psp->cc || psp->bc || psp->rc || psp->uc))
> > - /* other error recovery */
> > -    || (ia64_mca_ucmc_extension
> > -  && ia64_mca_ucmc_extension(
> > -   IA64_LOG_CURR_BUFFER(SAL_INFO_TYPE_MCA),
> > -   &ia64_sal_to_os_handoff_state,
> > -   &ia64_os_to_sal_handoff_state));
>
> Where the ia64_mca_ucmc_extension had go? ... vanished? :-(

Oops, that was accidental.  I'll fix it up.

>
> > +  /* TLB errors are fixed up before we get here, so recover */
> > +  if (psp->tc) {
> > +   recover = 1;
> > +   goto return_to_sal;
> > +  }
> > +
>
> I'm not sure because I hadn't get any logs on crazy error situation, but
> isn't it possible that an error log have both of tc and
> another(cc|bc|rc|uc)? Why we could omit "!(psp->cc || psp->bc || psp->rc ||
> psp->uc)" ?

I'm not sure either, I'll readd the check to make sure we haven't taken 
another error too.

> > +  spin_unlock(&io_range_list_lock);
> > +
> > + return_to_sal:
>
> force_sig_info() takes spinlock in it... I think calling this isn't safe on
> MCA.

You're right, hmm, maybe I have to create a new kernel thread to wakeup and 
send the signals with?  Any suggestions here?  Using a thread would 
invalidate the assumptions I'm making in userspace, since the thread that 
caused the MCA might resume and get the SIGBUS too late to do anything with 
it.  Maybe I just need a version of force_sig_info that does a trylock 
instead and returns a value saying whether the signal was delivered.

> The rest part of the patch (for legacies) seems to be
> functionally-independent. I don't have enough ideas for legacy-maps, but I
> guess you have done good work.

Yeah, I could split that out, that might make more sense.  Thanks for looking!

Jesse

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] I/O error handling for userspace
  2004-12-03 16:31 [RFC] I/O error handling for userspace Jesse Barnes
                   ` (2 preceding siblings ...)
  2004-12-06 16:13 ` Jesse Barnes
@ 2004-12-06 16:59 ` Jesse Barnes
  2004-12-06 17:05 ` Jesse Barnes
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Jesse Barnes @ 2004-12-06 16:59 UTC (permalink / raw)
  To: linux-ia64

On Monday, December 6, 2004 4:42 am, Hidetoshi Seto wrote:
> Jesse Barnes wrote:
> > On Friday, December 3, 2004 8:31 am, Jesse Barnes wrote:
> >>This patch adds support for sending a SIGBUS to a userspace application
> >>using /proc/bus/pci to drive a device if an I/O error occurs.  We're
> >> using this in house for the X server's BIOS emulator and it seems to be
> >> working well.
> >>
> >>The idea is to track mmaped /proc/bus/pci regions so that the machine
> >> check handler is able to properly determine which process is responsible
> >> for any faults that occur (ia64 is interesting in that the error may not
> >> occur in the process context that actually generated the bad reference).
> >>  If a match is found, a SIGBUS is sent to the process, along with the
> >> address that caused the fault.  The machine check record is then cleared
> >> and recovery takes place (the assumption is that the signal to userspace
> >> is a sufficient record of the error).
>
> Cool!
> BTW I have some short comments.

Does this look a little better?  I've removed the clearing of the error 
records too, in light of Keith's patch to clear them out quickly if they're 
corrected (though I'll need more additions to set the recovered flag 
correctly).

> force_sig_info() takes spinlock in it... I think calling this isn't safe on
> MCA.

This is the only bit I'm unsure about.  I can't just add a spin_trylock 
version, since the call path for send_sig_info calls the slab allocator, 
which takes other locks.

Assuming that only the CPU that caused the MCA is in the MCA handler (i.e. 
rendezvous doesn't occur), then the only time that one of the spinlocks could 
hang is if the current CPU also owned it, right?  Hmm, maybe the 
ia64_spinlock_contention routine could check for a machine check condition 
and promote the failure to an uncorrectable one in that case?  That's pretty 
ugly though...

Jesse

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] I/O error handling for userspace
  2004-12-03 16:31 [RFC] I/O error handling for userspace Jesse Barnes
                   ` (3 preceding siblings ...)
  2004-12-06 16:59 ` Jesse Barnes
@ 2004-12-06 17:05 ` Jesse Barnes
  2004-12-06 22:56 ` Jesse Barnes
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Jesse Barnes @ 2004-12-06 17:05 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 2079 bytes --]

On Monday, December 6, 2004 8:59 am, Jesse Barnes wrote:
> On Monday, December 6, 2004 4:42 am, Hidetoshi Seto wrote:
> > Jesse Barnes wrote:
> > > On Friday, December 3, 2004 8:31 am, Jesse Barnes wrote:
> > >>This patch adds support for sending a SIGBUS to a userspace application
> > >>using /proc/bus/pci to drive a device if an I/O error occurs.  We're
> > >> using this in house for the X server's BIOS emulator and it seems to
> > >> be working well.
> > >>
> > >>The idea is to track mmaped /proc/bus/pci regions so that the machine
> > >> check handler is able to properly determine which process is
> > >> responsible for any faults that occur (ia64 is interesting in that the
> > >> error may not occur in the process context that actually generated the
> > >> bad reference). If a match is found, a SIGBUS is sent to the process,
> > >> along with the address that caused the fault.  The machine check
> > >> record is then cleared and recovery takes place (the assumption is
> > >> that the signal to userspace is a sufficient record of the error).
> >
> > Cool!
> > BTW I have some short comments.
>
> Does this look a little better?  I've removed the clearing of the error
> records too, in light of Keith's patch to clear them out quickly if they're
> corrected (though I'll need more additions to set the recovered flag
> correctly).
>
> > force_sig_info() takes spinlock in it... I think calling this isn't safe
> > on MCA.
>
> This is the only bit I'm unsure about.  I can't just add a spin_trylock
> version, since the call path for send_sig_info calls the slab allocator,
> which takes other locks.
>
> Assuming that only the CPU that caused the MCA is in the MCA handler (i.e.
> rendezvous doesn't occur), then the only time that one of the spinlocks
> could hang is if the current CPU also owned it, right?  Hmm, maybe the
> ia64_spinlock_contention routine could check for a machine check condition
> and promote the failure to an uncorrectable one in that case?  That's
> pretty ugly though...

Ugg, again I forget to post the patch.  Here it is.

Jesse

[-- Attachment #2: io-error-sigbus-12.patch --]
[-- Type: text/plain, Size: 13386 bytes --]

===== arch/ia64/kernel/mca.c 1.71 vs edited =====
--- 1.71/arch/ia64/kernel/mca.c	2004-11-11 10:04:30 -08:00
+++ edited/arch/ia64/kernel/mca.c	2004-12-06 08:52:04 -08:00
@@ -871,21 +871,78 @@
 void
 ia64_mca_ucmc_handler(void)
 {
+	struct io_range *range, *tmp;
+	unsigned long io_addr = 0;
 	pal_processor_state_info_t *psp = (pal_processor_state_info_t *)
 		&ia64_sal_to_os_handoff_state.proc_state_param;
-	int recover; 
+	int recover = 0;
+	ia64_err_rec_t *curr_record;
 
 	/* Get the MCA error record and log it */
 	ia64_mca_log_sal_error_record(SAL_INFO_TYPE_MCA);
 
-	/* TLB error is only exist in this SAL error record */
-	recover = (psp->tc && !(psp->cc || psp->bc || psp->rc || psp->uc))
-	/* other error recovery */
-	   || (ia64_mca_ucmc_extension 
-		&& ia64_mca_ucmc_extension(
-			IA64_LOG_CURR_BUFFER(SAL_INFO_TYPE_MCA),
-			&ia64_sal_to_os_handoff_state,
-			&ia64_os_to_sal_handoff_state)); 
+ 	/* TLB errors are fixed up before we get here, so recover */
+ 	if (psp->tc && !(psp->cc || psp->bc || psp->rc || psp->uc)) {
+ 		recover = 1;
+ 		goto return_to_sal;
+ 	}
+
+	/* This error was handled by one of the MCA recovery extensions */
+	if (ia64_mca_ucmc_extension &&
+	    ia64_mca_ucmc_extension(IA64_LOG_CURR_BUFFER(SAL_INFO_TYPE_MCA),
+				    &ia64_sal_to_os_handoff_state,
+				    &ia64_os_to_sal_handoff_state)) {
+		recover = 1;
+		goto return_to_sal;
+	}
+
+ 	/*
+ 	 * If it's not a bus check with a valid target identifier,
+ 	 * we don't have a chance.
+ 	 */
+ 	if (!psp->bc) {
+ 		recover = 0;
+ 		goto return_to_sal;
+ 	}
+
+ 	/*
+ 	 * If we can't get this lock, we can't safely look at the list,
+ 	 * so give up.
+ 	 */
+ 	if (!spin_trylock(&io_range_list_lock)) {
+ 		recover = 0;
+ 		goto return_to_sal;
+ 	}
+
+ 	curr_record = IA64_LOG_CURR_BUFFER(SAL_INFO_TYPE_MCA);
+	io_addr = curr_record->proc_err.info->target_identifier;
+
+ 	/*
+ 	 * See if an I/O error occured in a previously registered range
+ 	 */
+ 	list_for_each_entry_safe(range, tmp, &pci_io_ranges, range_list) {
+ 		if (range->start <= io_addr && io_addr <= range->end) {
+ 			struct siginfo siginfo;
+ 			struct task_struct *owner = NULL;
+ 			recover = 1;
+ 			siginfo.si_signo = SIGBUS;
+ 			siginfo.si_code = BUS_ADRERR;
+ 			siginfo.si_addr  = (void *) io_addr;
+ 			owner = find_task_by_pid(range->owner);
+ 			if (owner)
+ 				force_sig_info(SIGBUS, &siginfo, owner);
+ 			else {
+ 				/*
+ 				 * need to free memory too, is that safe
+ 				 * here?
+ 				 */
+ 				list_del(&range->range_list);
+ 			}
+ 		}
+ 	}
+ 	spin_unlock(&io_range_list_lock);
+
+ return_to_sal:
 
 	/*
 	 *  Wakeup all the processors which are spinning in the rendezvous
===== arch/ia64/pci/pci.c 1.59 vs edited =====
--- 1.59/arch/ia64/pci/pci.c	2004-11-05 11:55:25 -08:00
+++ edited/arch/ia64/pci/pci.c	2004-12-02 14:30:02 -08:00
@@ -20,6 +20,7 @@
 #include <linux/slab.h>
 #include <linux/smp_lock.h>
 #include <linux/spinlock.h>
+#include <linux/slab.h>
 
 #include <asm/machvec.h>
 #include <asm/page.h>
@@ -36,7 +37,6 @@
 #include <asm/irq.h>
 #include <asm/hw_irq.h>
 
-
 #undef DEBUG
 #define DEBUG
 
@@ -48,6 +48,9 @@
 
 static int pci_routeirq;
 
+LIST_HEAD(pci_io_ranges);
+spinlock_t io_range_list_lock = SPIN_LOCK_UNLOCKED;
+
 /*
  * Low-level SAL-based PCI configuration access functions. Note that SAL
  * calls are already serialized (via sal_lock), so we don't need another
@@ -501,24 +504,35 @@
 pci_mmap_page_range (struct pci_dev *dev, struct vm_area_struct *vma,
 		     enum pci_mmap_state mmap_state, int write_combine)
 {
-	/*
-	 * I/O space cannot be accessed via normal processor loads and
-	 * stores on this platform.
-	 */
-	if (mmap_state == pci_mmap_io)
-		/*
-		 * XXX we could relax this for I/O spaces for which ACPI
-		 * indicates that the space is 1-to-1 mapped.  But at the
-		 * moment, we don't support multiple PCI address spaces and
-		 * the legacy I/O space is not 1-to-1 mapped, so this is moot.
-		 */
-		return -EINVAL;
+	struct io_range *new_range;
+	int ret = 0;
+	int iospace = (mmap_state == pci_mmap_io) ? 1 : 0;
+
+	/* Remap legacy I/O space for this bus if the offset is < 0xffff */
+	if (mmap_state == pci_mmap_io &&
+	    (vma->vm_pgoff << PAGE_SHIFT) < 0xffff) {
+		unsigned long legacy_io;
+		if ((ret = pci_get_legacy_space(iospace, dev, &legacy_io)))
+			goto out;
+
+		vma->vm_pgoff += legacy_io >> PAGE_SHIFT;
+	}
+
+	/* Remap legacy mem space for this bus if the offset is < 1M */
+	if (mmap_state == pci_mmap_mem &&
+	    (vma->vm_pgoff << PAGE_SHIFT) < (1024*1024)) {
+		unsigned long legacy_mem;
+		if ((ret = pci_get_legacy_space(iospace, dev, &legacy_mem)))
+			goto out;
+
+		vma->vm_pgoff += legacy_mem >> PAGE_SHIFT;
+	}
 
 	/*
 	 * Leave vm_pgoff as-is, the PCI space address is the physical
 	 * address on this platform.
 	 */
-	vma->vm_flags |= (VM_SHM | VM_LOCKED | VM_IO);
+	vma->vm_flags |= (VM_SHM | VM_IO | VM_RESERVED);
 
 	if (write_combine)
 		vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
@@ -526,9 +540,78 @@
 		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 
 	if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-			     vma->vm_end - vma->vm_start, vma->vm_page_prot))
-		return -EAGAIN;
+			    vma->vm_end - vma->vm_start, vma->vm_page_prot)) {
+		ret = -EAGAIN;
+		goto out;
+	}
 
+	new_range = kmalloc(sizeof(struct io_range), GFP_KERNEL);
+	if (!new_range) {
+		printk(KERN_WARNING "%s: cannot allocate io_range, "
+		       "I/O errors for 0x%016lx-0x%016lx will be fatal",
+		       __FUNCTION__, vma->vm_start, vma->vm_end);
+		goto out;
+	}
+
+	/*
+	 * Track this range and its associated process for use by the
+	 * MCA handler.
+	 */
+	new_range->start = __pa(vma->vm_pgoff << PAGE_SHIFT);
+	new_range->end = new_range->start + (vma->vm_end - vma->vm_start);
+	new_range->owner = current->pid;
+
+	spin_lock(&io_range_list_lock);
+	list_add(&new_range->range_list, &pci_io_ranges);
+	spin_unlock(&io_range_list_lock);
+
+	printk("I/O range 0x%016lx-0x%016lx registered\n",
+	       new_range->start, new_range->end);
+ out:
+	return ret;
+}
+
+/**
+ * pci_unmap_page_range - release any resources associated with a previous mapping
+ * @dev: pci device involved
+ *
+ * On ia64, this routine removes and frees the range in question from the
+ * io_range_list.
+ */
+void
+pci_mmap_release_dev(struct pci_dev *dev)
+{
+	struct io_range *range, *tmp;
+
+	spin_lock(&io_range_list_lock);
+	list_for_each_entry_safe(range, tmp, &pci_io_ranges, range_list) {
+		if (range->owner == current->pid) {
+			list_del(&range->range_list);
+			printk("I/O range 0x%016lx-0x%016lx de-registered\n",
+			       range->start, range->end);
+			kfree(range);
+		}
+	}
+	spin_unlock(&io_range_list_lock);
+}
+
+/**
+ * __ia64_pci_get_legacy_space - for machines w/o a machine vector
+ * @iospace: which space, I/O or memory
+ * @dev: pci dev
+ * @base: base address
+ *
+ * For most platforms, the legacy base address is 0, but platforms
+ * can override it by providing their own machine vector for this
+ * routine.  Note that platforms may want to provide their own routine
+ * even if the base is 0 in order to remap legacy space to the bus that
+ * @dev sits on.
+ */
+int
+__ia64_pci_get_legacy_space(int iospace, struct pci_dev *dev,
+				unsigned long *base)
+{
+	*base = 0;
 	return 0;
 }
 
===== arch/ia64/sn/pci/pci_dma.c 1.2 vs edited =====
--- 1.2/arch/ia64/sn/pci/pci_dma.c	2004-10-20 12:00:10 -07:00
+++ edited/arch/ia64/sn/pci/pci_dma.c	2004-12-01 10:44:54 -08:00
@@ -10,6 +10,7 @@
  */
 
 #include <linux/module.h>
+#include <linux/pci.h>
 #include <asm/sn/sn_sal.h>
 #include "pci/pcibus_provider_defs.h"
 #include "pci/pcidev.h"
@@ -475,3 +476,19 @@
 EXPORT_SYMBOL(sn_pci_free_consistent);
 EXPORT_SYMBOL(sn_pci_dma_supported);
 EXPORT_SYMBOL(sn_dma_mapping_error);
+
+int sn_pci_get_legacy_space(int iospace, struct pci_dev *dev,
+			    unsigned long *base)
+{
+	if (SN_PCIDEV_BUSSOFT(dev) == NULL)
+		return -ENODEV;
+
+	if (iospace) {
+		/* Put the phys addr in uncached space */
+		*base = SN_PCIDEV_BUSSOFT(dev)->bs_legacy_io | __IA64_UNCACHED_OFFSET;
+	} else {
+		/* Put the phys addr in uncached space */
+		*base = SN_PCIDEV_BUSSOFT(dev)->bs_legacy_mem | __IA64_UNCACHED_OFFSET;
+	}
+	return 0;
+}
===== drivers/pci/proc.c 1.41 vs edited =====
--- 1.41/drivers/pci/proc.c	2004-10-06 09:44:51 -07:00
+++ edited/drivers/pci/proc.c	2004-12-01 10:13:26 -08:00
@@ -279,6 +279,10 @@
 
 static int proc_bus_pci_release(struct inode *inode, struct file *file)
 {
+	const struct proc_dir_entry *dp = PDE(inode);
+	struct pci_dev *dev = dp->data;
+
+	pci_mmap_release_dev(dev);
 	kfree(file->private_data);
 	file->private_data = NULL;
 
===== include/asm-ia64/io.h 1.24 vs edited =====
--- 1.24/include/asm-ia64/io.h	2004-10-28 12:10:56 -07:00
+++ edited/include/asm-ia64/io.h	2004-12-01 09:51:26 -08:00
@@ -1,6 +1,8 @@
 #ifndef _ASM_IA64_IO_H
 #define _ASM_IA64_IO_H
 
+#include <linux/list.h>
+
 /*
  * This file contains the definitions for the emulated IO instructions
  * inb/inw/inl/outb/outw/outl and the "string versions" of the same
@@ -51,6 +53,17 @@
 extern struct io_space io_space[];
 extern unsigned int num_io_spaces;
 
+/*
+ * Simple I/O range object with owner (if there is one)
+ */
+struct io_range {
+	unsigned long start, end;
+	struct list_head range_list;
+	pid_t owner;
+};
+
+extern struct list_head pci_io_ranges;
+
 # ifdef __KERNEL__
 
 /*
@@ -66,11 +79,14 @@
 #define PIO_RESERVED		__IA64_UNCACHED_OFFSET
 #define HAVE_ARCH_PIO_SIZE
 
+#include <linux/spinlock.h>
 #include <asm/intrinsics.h>
 #include <asm/machvec.h>
 #include <asm/page.h>
 #include <asm/system.h>
 #include <asm-generic/iomap.h>
+
+extern spinlock_t io_range_list_lock;
 
 /*
  * Change virtual addresses to physical addresses and vv.
===== include/asm-ia64/machvec.h 1.29 vs edited =====
--- 1.29/include/asm-ia64/machvec.h	2004-10-25 13:06:49 -07:00
+++ edited/include/asm-ia64/machvec.h	2004-12-01 10:46:49 -08:00
@@ -20,6 +20,7 @@
 struct irq_desc;
 struct page;
 struct mm_struct;
+struct pci_dev;
 
 typedef void ia64_mv_setup_t (char **);
 typedef void ia64_mv_cpu_init_t (void);
@@ -31,6 +32,7 @@
 typedef struct irq_desc *ia64_mv_irq_desc (unsigned int);
 typedef u8 ia64_mv_irq_to_vector (unsigned int);
 typedef unsigned int ia64_mv_local_vector_to_irq (u8);
+typedef int ia64_mv_pci_get_legacy_space_t (int, struct pci_dev *, unsigned long *);
 
 /* DMA-mapping interface: */
 typedef void ia64_mv_dma_init (void);
@@ -140,6 +142,7 @@
 #  define platform_readw_relaxed        ia64_mv.readw_relaxed
 #  define platform_readl_relaxed        ia64_mv.readl_relaxed
 #  define platform_readq_relaxed        ia64_mv.readq_relaxed
+#  define platform_pci_get_legacy_space	ia64_mv.pci_get_legacy_space
 # endif
 
 /* __attribute__((__aligned__(16))) is required to make size of the
@@ -187,6 +190,7 @@
 	ia64_mv_readw_relaxed_t *readw_relaxed;
 	ia64_mv_readl_relaxed_t *readl_relaxed;
 	ia64_mv_readq_relaxed_t *readq_relaxed;
+	ia64_mv_pci_get_legacy_space_t *pci_get_legacy_space;
 } __attribute__((__aligned__(16))); /* align attrib? see above comment */
 
 #define MACHVEC_INIT(name)			\
@@ -230,6 +234,7 @@
 	platform_readw_relaxed,			\
 	platform_readl_relaxed,			\
 	platform_readq_relaxed,			\
+	platform_pci_get_legacy_space,	       	\
 }
 
 extern struct ia64_machine_vector ia64_mv;
@@ -374,6 +379,9 @@
 #endif
 #ifndef platform_readq_relaxed
 # define platform_readq_relaxed	__ia64_readq_relaxed
+#endif
+#ifndef platform_pci_get_legacy_space
+# define platform_pci_get_legacy_space __ia64_pci_get_legacy_space
 #endif
 
 #endif /* _ASM_IA64_MACHVEC_H */
===== include/asm-ia64/machvec_init.h 1.8 vs edited =====
--- 1.8/include/asm-ia64/machvec_init.h	2004-10-25 13:06:49 -07:00
+++ edited/include/asm-ia64/machvec_init.h	2004-12-01 10:55:12 -08:00
@@ -5,6 +5,7 @@
 extern ia64_mv_irq_desc __ia64_irq_desc;
 extern ia64_mv_irq_to_vector __ia64_irq_to_vector;
 extern ia64_mv_local_vector_to_irq __ia64_local_vector_to_irq;
+extern ia64_mv_pci_get_legacy_space_t __ia64_pci_get_legacy_space;
 
 extern ia64_mv_inb_t __ia64_inb;
 extern ia64_mv_inw_t __ia64_inw;
===== include/asm-ia64/machvec_sn2.h 1.16 vs edited =====
--- 1.16/include/asm-ia64/machvec_sn2.h	2004-10-25 13:06:49 -07:00
+++ edited/include/asm-ia64/machvec_sn2.h	2004-12-01 10:42:03 -08:00
@@ -70,6 +70,7 @@
 extern ia64_mv_dma_sync_sg_for_device	sn_dma_sync_sg_for_device;
 extern ia64_mv_dma_mapping_error	sn_dma_mapping_error;
 extern ia64_mv_dma_supported		sn_dma_supported;
+extern ia64_mv_pci_get_legacy_space_t	sn_pci_get_legacy_space;
 
 /*
  * This stuff has dual use!
@@ -118,6 +119,7 @@
 #define platform_dma_sync_sg_for_device	sn_dma_sync_sg_for_device
 #define platform_dma_mapping_error		sn_dma_mapping_error
 #define platform_dma_supported		sn_dma_supported
+#define platform_pci_get_legacy_space	sn_pci_get_legacy_space
 
 #include <asm/sn/io.h>
 
===== include/asm-ia64/pci.h 1.27 vs edited =====
--- 1.27/include/asm-ia64/pci.h	2004-11-03 13:36:55 -08:00
+++ edited/include/asm-ia64/pci.h	2004-12-01 10:53:19 -08:00
@@ -85,6 +85,8 @@
 #define HAVE_PCI_MMAP
 extern int pci_mmap_page_range (struct pci_dev *dev, struct vm_area_struct *vma,
 				enum pci_mmap_state mmap_state, int write_combine);
+extern void pci_mmap_release_dev (struct pci_dev *dev);
+#define pci_get_legacy_space platform_pci_get_legacy_space
 
 struct pci_window {
 	struct resource resource;

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] I/O error handling for userspace
  2004-12-03 16:31 [RFC] I/O error handling for userspace Jesse Barnes
                   ` (4 preceding siblings ...)
  2004-12-06 17:05 ` Jesse Barnes
@ 2004-12-06 22:56 ` Jesse Barnes
  2004-12-06 23:51 ` Keith Owens
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Jesse Barnes @ 2004-12-06 22:56 UTC (permalink / raw)
  To: linux-ia64

On Monday, December 6, 2004 9:05 am, Jesse Barnes wrote:
> This is the only bit I'm unsure about.  I can't just add a spin_trylock
> version, since the call path for send_sig_info calls the slab allocator,
> which takes other locks.
>
> Assuming that only the CPU that caused the MCA is in the MCA handler
> (i.e. rendezvous doesn't occur), then the only time that one of the
> spinlocks could hang is if the current CPU also owned it, right?  Hmm,
> maybe the ia64_spinlock_contention routine could check for a machine
> check condition and promote the failure to an uncorrectable one in that
> case?  That's pretty ugly though...

This is tricky.  If we want I/O error handling to be 100% reliable when I/O 
errors are caused by userspace applications, we need to deal with the case 
where the offending process' machine check is received in either user or 
kernel mode, regardless of what context we're currently in.  My code assumes 
that we receive the machine check in user mode and so the force_sig_info is 
safe, but obviously that won't always be the case.

We need to do a few things in order to ensure safety (this should apply to the 
double bit memory error case too I think):
  o make sure the process doesn't run until we've tried to recover from the
    error
  o don't take any locks while we're in machine check context
  o don't destroy our current context since we may want to resume to it
    eventually (esp. in the case where we received the machine check in kernel
    context)

So, given the above, maybe we could put the process in a TASK_STOPPED state 
and pend a scheduler tick on the CPU where we took the machine check?  At 
that point, we could also wake up an MCA worker thread or raise an MCA 
interrupt (maybe using the NMI interrupt vector, it's high priority and isn't 
used right now) to send the signal or do whatever cleanup was needed.

Jesse

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] I/O error handling for userspace
  2004-12-03 16:31 [RFC] I/O error handling for userspace Jesse Barnes
                   ` (5 preceding siblings ...)
  2004-12-06 22:56 ` Jesse Barnes
@ 2004-12-06 23:51 ` Keith Owens
  2004-12-07  0:38 ` Keith Owens
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Keith Owens @ 2004-12-06 23:51 UTC (permalink / raw)
  To: linux-ia64

On Mon, 6 Dec 2004 08:59:45 -0800, 
Jesse Barnes <jbarnes@engr.sgi.com> wrote:
>On Monday, December 6, 2004 4:42 am, Hidetoshi Seto wrote:
>> force_sig_info() takes spinlock in it... I think calling this isn't safe on
>> MCA.
>
>This is the only bit I'm unsure about.  I can't just add a spin_trylock 
>version, since the call path for send_sig_info calls the slab allocator, 
>which takes other locks.
>
>Assuming that only the CPU that caused the MCA is in the MCA handler (i.e. 
>rendezvous doesn't occur), then the only time that one of the spinlocks could 
>hang is if the current CPU also owned it, right?  Hmm, maybe the 
>ia64_spinlock_contention routine could check for a machine check condition 
>and promote the failure to an uncorrectable one in that case?  That's pretty 
>ugly though...

The best option I can come up with is to record the fact that we need
sigbus and save the failing io_addr and range->owner in per cpu
variables.  In arch/ia64/kernel/entry.S work_pending, test for
a pending sigbus from MCA and send the signal from there, using the
saved data.  IOW, send the signal on the next return from kernel to
user space on the current cpu, not from the MCA handler.

To check from a pending signal from MCA, define TIF_SIGNAL_MCA in
thread_info.h.  Adding TIF_SIGNAL_MCA as a work flag will require some
adjustments to TIF_WORK_MASK and TIF_ALLWORK_MASK, no big deal.
Checking the extra flag has no impact on the existing ia64_leave_kernel
code path unless the flag is set, we check for all the flags in a
single mask test.

The offending task might be sleeping waiting for I/O completion which
will never occur now.  OTOH the current task could be the one that was
interrupted by the MCA.  To ensure that the task (a) gets woken up and
(b) cannot do any more useful work, set TIF_SIGNAL_MCA on both the
current task and the offending task.  Whichever task goes through
ia64_leave_kernel next will send the signal, from the right context.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] I/O error handling for userspace
  2004-12-03 16:31 [RFC] I/O error handling for userspace Jesse Barnes
                   ` (6 preceding siblings ...)
  2004-12-06 23:51 ` Keith Owens
@ 2004-12-07  0:38 ` Keith Owens
  2004-12-07  0:40 ` Jesse Barnes
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Keith Owens @ 2004-12-07  0:38 UTC (permalink / raw)
  To: linux-ia64

On Mon, 6 Dec 2004 14:56:58 -0800, 
Jesse Barnes <jbarnes@engr.sgi.com> wrote:
>On Monday, December 6, 2004 9:05 am, Jesse Barnes wrote:
>> This is the only bit I'm unsure about.  I can't just add a spin_trylock
>> version, since the call path for send_sig_info calls the slab allocator,
>> which takes other locks.
>>
>> Assuming that only the CPU that caused the MCA is in the MCA handler
>> (i.e. rendezvous doesn't occur), then the only time that one of the
>> spinlocks could hang is if the current CPU also owned it, right?  Hmm,
>> maybe the ia64_spinlock_contention routine could check for a machine
>> check condition and promote the failure to an uncorrectable one in that
>> case?  That's pretty ugly though...
>
>This is tricky.  If we want I/O error handling to be 100% reliable when I/O 
>errors are caused by userspace applications, we need to deal with the case 
>where the offending process' machine check is received in either user or 
>kernel mode, regardless of what context we're currently in.  My code assumes 
>that we receive the machine check in user mode and so the force_sig_info is 
>safe, but obviously that won't always be the case.
>
>We need to do a few things in order to ensure safety (this should apply to the 
>double bit memory error case too I think):
>  o make sure the process doesn't run until we've tried to recover from the
>    error
>  o don't take any locks while we're in machine check context
>  o don't destroy our current context since we may want to resume to it
>    eventually (esp. in the case where we received the machine check in kernel
>    context)
>
>So, given the above, maybe we could put the process in a TASK_STOPPED state 
>and pend a scheduler tick on the CPU where we took the machine check?
>that point, we could also wake up an MCA worker thread or raise an MCA 
>interrupt (maybe using the NMI interrupt vector, it's high priority and isn't 
>used right now) to send the signal or do whatever cleanup was needed.

You seem to be assuming that the offending process is currently
running.  I don't see how that is guaranteed, the task could start the
I/O then sleep waiting for completion.  When the MCA arrives, any task
could be in control of the cpu, including the idle task.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] I/O error handling for userspace
  2004-12-03 16:31 [RFC] I/O error handling for userspace Jesse Barnes
                   ` (7 preceding siblings ...)
  2004-12-07  0:38 ` Keith Owens
@ 2004-12-07  0:40 ` Jesse Barnes
  2004-12-07  1:29 ` Keith Owens
  2004-12-07  1:36 ` Jesse Barnes
  10 siblings, 0 replies; 12+ messages in thread
From: Jesse Barnes @ 2004-12-07  0:40 UTC (permalink / raw)
  To: linux-ia64

On Monday, December 6, 2004 4:38 pm, Keith Owens wrote:
> >We need to do a few things in order to ensure safety (this should apply to
> > the double bit memory error case too I think):
> >  o make sure the process doesn't run until we've tried to recover from
> > the error
> >  o don't take any locks while we're in machine check context
> >  o don't destroy our current context since we may want to resume to it
> >    eventually (esp. in the case where we received the machine check in
> > kernel context)
> >
> >So, given the above, maybe we could put the process in a TASK_STOPPED
> > state and pend a scheduler tick on the CPU where we took the machine
> > check? that point, we could also wake up an MCA worker thread or raise an
> > MCA interrupt (maybe using the NMI interrupt vector, it's high priority
> > and isn't used right now) to send the signal or do whatever cleanup was
> > needed.
>
> You seem to be assuming that the offending process is currently
> running.  I don't see how that is guaranteed, the task could start the
> I/O then sleep waiting for completion.  When the MCA arrives, any task
> could be in control of the cpu, including the idle task.

No, I just left that part out.  For the case of I/O reads (and even memory 
errors) we have a reverse mapping from the failing address to its owning 
process, so we can figure out who to signal from any context.  What I'd like 
to avoid is destroying the current context, like we do in the double bit 
error case now when we recover into the mca bh handler.

Jesse

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] I/O error handling for userspace
  2004-12-03 16:31 [RFC] I/O error handling for userspace Jesse Barnes
                   ` (8 preceding siblings ...)
  2004-12-07  0:40 ` Jesse Barnes
@ 2004-12-07  1:29 ` Keith Owens
  2004-12-07  1:36 ` Jesse Barnes
  10 siblings, 0 replies; 12+ messages in thread
From: Keith Owens @ 2004-12-07  1:29 UTC (permalink / raw)
  To: linux-ia64

On Mon, 6 Dec 2004 16:40:28 -0800, 
Jesse Barnes <jbarnes@engr.sgi.com> wrote:
>On Monday, December 6, 2004 4:38 pm, Keith Owens wrote:
>> >We need to do a few things in order to ensure safety (this should apply to
>> > the double bit memory error case too I think):
>> >  o make sure the process doesn't run until we've tried to recover from
>> > the error
>> >  o don't take any locks while we're in machine check context
>> >  o don't destroy our current context since we may want to resume to it
>> >    eventually (esp. in the case where we received the machine check in
>> > kernel context)
>> >
>> >So, given the above, maybe we could put the process in a TASK_STOPPED
>> > state and pend a scheduler tick on the CPU where we took the machine
>> > check? that point, we could also wake up an MCA worker thread or raise an
>> > MCA interrupt (maybe using the NMI interrupt vector, it's high priority
>> > and isn't used right now) to send the signal or do whatever cleanup was
>> > needed.
>>
>> You seem to be assuming that the offending process is currently
>> running.  I don't see how that is guaranteed, the task could start the
>> I/O then sleep waiting for completion.  When the MCA arrives, any task
>> could be in control of the cpu, including the idle task.
>
>No, I just left that part out.  For the case of I/O reads (and even memory 
>errors) we have a reverse mapping from the failing address to its owning 
>process, so we can figure out who to signal from any context.  What I'd like 
>to avoid is destroying the current context, like we do in the double bit 
>error case now when we recover into the mca bh handler.

I understand that you know the pid of the offending process (OP), that
is not my concern.  Your comment "put the process in a TASK_STOPPED
state and pend a scheduler tick on the CPU where we took the machine
check" assumes that sending a scheduler tick will reschedule the OP.
IOW you are assuming that the OP is currently running on this cpu.  If
the OP is already stopped, what sends a signal to the OP?

My earlier suggestion of setting TIF_SIGNAL_MCA on both the OP and the
current process handles both cases.  Whether it is running or stopped,
the next return from kernel on this cpu will send the signal.  And we
get that check for free, adding another TIF flag has no impact on the
fast path for ia64_leave_kernel.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] I/O error handling for userspace
  2004-12-03 16:31 [RFC] I/O error handling for userspace Jesse Barnes
                   ` (9 preceding siblings ...)
  2004-12-07  1:29 ` Keith Owens
@ 2004-12-07  1:36 ` Jesse Barnes
  10 siblings, 0 replies; 12+ messages in thread
From: Jesse Barnes @ 2004-12-07  1:36 UTC (permalink / raw)
  To: linux-ia64

On Monday, December 6, 2004 5:29 pm, Keith Owens wrote:
> I understand that you know the pid of the offending process (OP), that
> is not my concern.  Your comment "put the process in a TASK_STOPPED
> state and pend a scheduler tick on the CPU where we took the machine
> check" assumes that sending a scheduler tick will reschedule the OP.

That's not what I was trying to say, I know that the OP may not be running.  I 
just want to make sure that it doesn't get re-run when we return back to the 
interrupted context or get rescheduled to if it was sleeping at the time we 
took the machine check.

> IOW you are assuming that the OP is currently running on this cpu.  If
> the OP is already stopped, what sends a signal to the OP?

The machine check interrupt handler or worker thread that I also mentioned (at 
least that's what I was trying to get at).

> My earlier suggestion of setting TIF_SIGNAL_MCA on both the OP and the
> current process handles both cases.  Whether it is running or stopped,
> the next return from kernel on this cpu will send the signal.  And we
> get that check for free, adding another TIF flag has no impact on the
> fast path for ia64_leave_kernel.

Yeah, that sounds like it would work too, and would probably be simpler.

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2004-12-07  1:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-12-03 16:31 [RFC] I/O error handling for userspace Jesse Barnes
2004-12-03 16:43 ` Jesse Barnes
2004-12-06 12:42 ` Hidetoshi Seto
2004-12-06 16:13 ` Jesse Barnes
2004-12-06 16:59 ` Jesse Barnes
2004-12-06 17:05 ` Jesse Barnes
2004-12-06 22:56 ` Jesse Barnes
2004-12-06 23:51 ` Keith Owens
2004-12-07  0:38 ` Keith Owens
2004-12-07  0:40 ` Jesse Barnes
2004-12-07  1:29 ` Keith Owens
2004-12-07  1:36 ` Jesse Barnes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox