[PATCH] [0/16] HWPOISON: Intro

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] [0/16] HWPOISON: Intro
@ 2009-05-29 21:35 Andi Kleen
  2009-05-29 21:35 ` [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages Andi Kleen
                   ` (16 more replies)
  0 siblings, 17 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu

Another version of the hwpoison patchkit. I addressed 
all feedback, except:
I didn't move the handlers into other files for now, prefer
to keep things together for now
I'm keeping an own pagepoison bit because I think that's 
cleaner than any other hacks.

Andrew, please put it into mm for .31 track.

The patchkit is also available in
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git hwpoison

Thanks,
-Andi

---

Upcoming Intel CPUs have support for recovering from some memory errors
(``MCA recovery''). This requires the OS to declare a page "poisoned", 
kill the processes associated with it and avoid using it in the future. 

This patchkit implements the necessary infrastructure in the VM.

To quote the overview comment:

 * High level machine check handler. Handles pages reported by the
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 * failure.
 *
 * This focusses on pages detected as corrupted in the background.
 * When the current CPU tries to consume corruption the currently
 * running process can just be killed directly instead. This implies
 * that if the error cannot be handled for some reason it's safe to
 * just ignore it because no corruption has been consumed yet. Instead
 * when that happens another machine check will happen.
 *
 * Handles page cache pages in various states. The tricky part
 * here is that we can access any page asynchronous to other VM
 * users, because memory failures could happen anytime and anywhere,
 * possibly violating some of their assumptions. This is why this code
 * has to be extremely careful. Generally it tries to use normal locking
 * rules, as in get the standard locks, even if that means the
 * error handling takes potentially a long time.
 *
 * Some of the operations here are somewhat inefficient and have non
 * linear algorithmic complexity, because the data structures have not
 * been optimized for this case. This is in particular the case
 * for the mapping from a vma to a process. Since this case is expected
 * to be rare we hope we can get away with this.

The code consists of a the high level handler in mm/memory-failure.c, 
a new page poison bit and various checks in the VM to handle poisoned
pages.

The main target right now is KVM guests, but it works for all kinds
of applications.

For the KVM use there was need for a new signal type so that
KVM can inject the machine check into the guest with the proper
address. This in theory allows other applications to handle
memory failures too. The expection is that near all applications
won't do that, but some very specialized ones might. 

This is not fully complete yet, in particular there are still ways
to access poison through various ways (crash dump, /proc/kcore etc.)
that need to be plugged too.

Also undoubtedly the high level handler still has bugs and cases
it cannot recover from. For example nonlinear mappings deadlock right now
and a few other cases lose references. Huge pages are not supported
yet. Any additional testing, reviewing etc. welcome. 

The patch series requires the earlier x86 MCE feature series for the x86
specific action optional part. The code can be tested without the x86 specific
part using the injector, this only requires to enable the Kconfig entry
manually in some Kconfig file (by default it is implicitely enabled
by the architecture)

v2: Lots of smaller changes in the series based on review feedback.
Rename Poison to HWPoison after akpm's request.
A new pfn based injector based on feedback.
A lot of improvements mostly from Fengguang Wu
See comments in the individual patches.
v3: Various updates, see changelogs in individual patches.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
  2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
  2009-05-29 21:52   ` Alan Cox
  2009-05-29 21:35 ` [PATCH] [2/16] HWPOISON: Export poison flag in /proc/kpageflags Andi Kleen
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Hardware poisoned pages need special handling in the VM and shouldn't be 
touched again. This requires a new page flag. Define it here.

The page flags wars seem to be over, so it shouldn't be a problem
to get a new one.

v2: Add TestSetHWPoison (suggested by Johannes Weiner)

Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/page-flags.h |   17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

Index: linux/include/linux/page-flags.h
===================================================================
--- linux.orig/include/linux/page-flags.h	2009-05-29 23:32:10.000000000 +0200
+++ linux/include/linux/page-flags.h	2009-05-29 23:32:10.000000000 +0200
@@ -51,6 +51,9 @@
  * PG_buddy is set to indicate that the page is free and in the buddy system
  * (see mm/page_alloc.c).
  *
+ * PG_hwpoison indicates that a page got corrupted in hardware and contains
+ * data with incorrect ECC bits that triggered a machine check. Accessing is
+ * not safe since it may cause another machine check. Don't touch!
  */
 
 /*
@@ -104,6 +107,9 @@
 #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
 	PG_uncached,		/* Page has been mapped as uncached */
 #endif
+#ifdef CONFIG_MEMORY_FAILURE
+	PG_hwpoison,		/* hardware poisoned page. Don't touch */
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -273,6 +279,15 @@
 PAGEFLAG_FALSE(Uncached)
 #endif
 
+#ifdef CONFIG_MEMORY_FAILURE
+PAGEFLAG(HWPoison, hwpoison)
+TESTSETFLAG(HWPoison, hwpoison)
+#define __PG_HWPOISON (1UL << PG_hwpoison)
+#else
+PAGEFLAG_FALSE(HWPoison)
+#define __PG_HWPOISON 0
+#endif
+
 static inline int PageUptodate(struct page *page)
 {
 	int ret = test_bit(PG_uptodate, &(page)->flags);
@@ -403,7 +418,7 @@
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 __PG_UNEVICTABLE | __PG_MLOCKED)
+	 __PG_HWPOISON  | __PG_UNEVICTABLE | __PG_MLOCKED)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
  2009-05-29 21:35 ` [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages Andi Kleen
@ 2009-05-29 21:52   ` Alan Cox
  0 siblings, 0 replies; 36+ messages in thread
From: Alan Cox @ 2009-05-29 21:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-kernel, linux-mm, fengguang.wu

On Fri, 29 May 2009 23:35:26 +0200 (CEST)
Andi Kleen <andi@firstfloor.org> wrote:

> 
> Hardware poisoned pages need special handling in the VM and shouldn't be 
> touched again. This requires a new page flag. Define it here.
> 
> The page flags wars seem to be over, so it shouldn't be a problem
> to get a new one.
> 
> v2: Add TestSetHWPoison (suggested by Johannes Weiner)
> 
> Acked-by: Christoph Lameter <cl@linux.com>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> ---
>  include/linux/page-flags.h |   17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> Index: linux/include/linux/page-flags.h
> ===================================================================
> --- linux.orig/include/linux/page-flags.h	2009-05-29 23:32:10.000000000 +0200
> +++ linux/include/linux/page-flags.h	2009-05-29 23:32:10.000000000 +0200
> @@ -51,6 +51,9 @@
>   * PG_buddy is set to indicate that the page is free and in the buddy system
>   * (see mm/page_alloc.c).
>   *
> + * PG_hwpoison indicates that a page got corrupted in hardware and contains
> + * data with incorrect ECC bits that triggered a machine check. Accessing is
> + * not safe since it may cause another machine check. Don't touch!
>   */
>  
>  /*
> @@ -104,6 +107,9 @@
>  #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
>  	PG_uncached,		/* Page has been mapped as uncached */
>  #endif
> +#ifdef CONFIG_MEMORY_FAILURE
> +	PG_hwpoison,		/* hardware poisoned page. Don't touch */
> +#endif
>  	__NR_PAGEFLAGS,
>  
>  	/* Filesystems */
> @@ -273,6 +279,15 @@
>  PAGEFLAG_FALSE(Uncached)
>  #endif
>  
> +#ifdef CONFIG_MEMORY_FAILURE
> +PAGEFLAG(HWPoison, hwpoison)
> +TESTSETFLAG(HWPoison, hwpoison)
> +#define __PG_HWPOISON (1UL << PG_hwpoison)
> +#else
> +PAGEFLAG_FALSE(HWPoison)
> +#define __PG_HWPOISON 0
> +#endif
> +
>  static inline int PageUptodate(struct page *page)
>  {
>  	int ret = test_bit(PG_uptodate, &(page)->flags);
> @@ -403,7 +418,7 @@
>  	 1 << PG_private | 1 << PG_private_2 | \
>  	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
>  	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
> -	 __PG_UNEVICTABLE | __PG_MLOCKED)
> +	 __PG_HWPOISON  | __PG_UNEVICTABLE | __PG_MLOCKED)
>  
>  /*
>   * Flags checked when a page is prepped for return by the page allocator.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


-- 
--
	"Alan, I'm getting a bit worried about you."
				-- Linus Torvalds

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] [2/16] HWPOISON: Export poison flag in /proc/kpageflags
  2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
  2009-05-29 21:35 ` [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
  2009-05-29 21:35 ` [PATCH] [3/16] HWPOISON: Export some rmap vma locking to outside world Andi Kleen
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
  To: fengguang.wu, akpm, linux-kernel, linux-mm


From: Fengguang Wu <fengguang.wu@intel.com>

Export the new poison flag in /proc/kpageflags. Poisoned pages are moderately
interesting even for administrators, so export them here. Also useful
for debugging.

AK: I extracted this out of a larger patch from Fengguang Wu.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 fs/proc/page.c |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux/fs/proc/page.c
===================================================================
--- linux.orig/fs/proc/page.c	2009-05-29 23:32:10.000000000 +0200
+++ linux/fs/proc/page.c	2009-05-29 23:32:10.000000000 +0200
@@ -79,6 +79,7 @@
 #define KPF_WRITEBACK  8
 #define KPF_RECLAIM    9
 #define KPF_BUDDY     10
+#define KPF_HWPOISON  11
 
 #define kpf_copy_bit(flags, dstpos, srcpos) (((flags >> srcpos) & 1) << dstpos)
 
@@ -118,6 +119,9 @@
 			kpf_copy_bit(kflags, KPF_WRITEBACK, PG_writeback) |
 			kpf_copy_bit(kflags, KPF_RECLAIM, PG_reclaim) |
 			kpf_copy_bit(kflags, KPF_BUDDY, PG_buddy);
+#ifdef CONFIG_MEMORY_FAILURE
+		uflags |= kpf_copy_bit(kflags, KPF_HWPOISON, PG_hwpoison);
+#endif
 
 		if (put_user(uflags, out++)) {
 			ret = -EFAULT;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] [3/16] HWPOISON: Export some rmap vma locking to outside world
  2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
  2009-05-29 21:35 ` [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages Andi Kleen
  2009-05-29 21:35 ` [PATCH] [2/16] HWPOISON: Export poison flag in /proc/kpageflags Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
  2009-05-29 21:35 ` [PATCH] [4/16] HWPOISON: Add support for poison swap entries v2 Andi Kleen
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Needed for later patch that walks rmap entries on its own.

This used to be very frowned upon, but memory-failure.c does
some rather specialized rmap walking and rmap has been stable
for quite some time, so I think it's ok now to export it.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |    6 ++++++
 mm/rmap.c            |    4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h	2009-05-29 23:32:10.000000000 +0200
+++ linux/include/linux/rmap.h	2009-05-29 23:33:30.000000000 +0200
@@ -118,6 +118,12 @@
 }
 #endif
 
+/*
+ * Called by memory-failure.c to kill processes.
+ */
+struct anon_vma *page_lock_anon_vma(struct page *page);
+void page_unlock_anon_vma(struct anon_vma *anon_vma);
+
 #else	/* !CONFIG_MMU */
 
 #define anon_vma_init()		do {} while (0)
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2009-05-29 23:32:10.000000000 +0200
+++ linux/mm/rmap.c	2009-05-29 23:33:30.000000000 +0200
@@ -191,7 +191,7 @@
  * Getting a lock on a stable anon_vma from a page off the LRU is
  * tricky: page_lock_anon_vma rely on RCU to guard against the races.
  */
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page)
 {
 	struct anon_vma *anon_vma;
 	unsigned long anon_mapping;
@@ -211,7 +211,7 @@
 	return NULL;
 }
 
-static void page_unlock_anon_vma(struct anon_vma *anon_vma)
+void page_unlock_anon_vma(struct anon_vma *anon_vma)
 {
 	spin_unlock(&anon_vma->lock);
 	rcu_read_unlock();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] [4/16] HWPOISON: Add support for poison swap entries v2
  2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
                   ` (2 preceding siblings ...)
  2009-05-29 21:35 ` [PATCH] [3/16] HWPOISON: Export some rmap vma locking to outside world Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
  2009-05-29 21:35 ` [PATCH] [5/16] HWPOISON: Add new SIGBUS error codes for hardware poison signals Andi Kleen
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Memory migration uses special swap entry types to trigger special actions on 
page faults. Extend this mechanism to also support poisoned swap entries, to 
trigger poison handling on page faults. This allows follow-on patches to 
prevent processes from faulting in poisoned pages again.

v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
v3: Better overflow fix (Hidehiro Kawai)

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/swap.h    |   34 ++++++++++++++++++++++++++++------
 include/linux/swapops.h |   38 ++++++++++++++++++++++++++++++++++++++
 mm/swapfile.c           |    4 ++--
 3 files changed, 68 insertions(+), 8 deletions(-)

Index: linux/include/linux/swap.h
===================================================================
--- linux.orig/include/linux/swap.h	2009-05-29 23:32:10.000000000 +0200
+++ linux/include/linux/swap.h	2009-05-29 23:32:10.000000000 +0200
@@ -34,16 +34,38 @@
  * the type/offset into the pte as 5/27 as well.
  */
 #define MAX_SWAPFILES_SHIFT	5
-#ifndef CONFIG_MIGRATION
-#define MAX_SWAPFILES		(1 << MAX_SWAPFILES_SHIFT)
+
+/*
+ * Use some of the swap files numbers for other purposes. This
+ * is a convenient way to hook into the VM to trigger special
+ * actions on faults.
+ */
+
+/*
+ * NUMA node memory migration support
+ */
+#ifdef CONFIG_MIGRATION
+#define SWP_MIGRATION_NUM 2
+#define SWP_MIGRATION_READ	(MAX_SWAPFILES + SWP_HWPOISON_NUM)
+#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
 #else
-/* Use last two entries for page migration swap entries */
-#define MAX_SWAPFILES		((1 << MAX_SWAPFILES_SHIFT)-2)
-#define SWP_MIGRATION_READ	MAX_SWAPFILES
-#define SWP_MIGRATION_WRITE	(MAX_SWAPFILES + 1)
+#define SWP_MIGRATION_NUM 0
 #endif
 
 /*
+ * Handling of hardware poisoned pages with memory corruption.
+ */
+#ifdef CONFIG_MEMORY_FAILURE
+#define SWP_HWPOISON_NUM 1
+#define SWP_HWPOISON		MAX_SWAPFILES
+#else
+#define SWP_HWPOISON_NUM 0
+#endif
+
+#define MAX_SWAPFILES \
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+
+/*
  * Magic header for a swap area. The first part of the union is
  * what the swap magic looks like for the old (limited to 128MB)
  * swap area format, the second part of the union adds - in the
Index: linux/include/linux/swapops.h
===================================================================
--- linux.orig/include/linux/swapops.h	2009-05-29 23:32:10.000000000 +0200
+++ linux/include/linux/swapops.h	2009-05-29 23:32:10.000000000 +0200
@@ -131,3 +131,41 @@
 
 #endif
 
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Support for hardware poisoned pages
+ */
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+	BUG_ON(!PageLocked(page));
+	return swp_entry(SWP_HWPOISON, page_to_pfn(page));
+}
+
+static inline int is_hwpoison_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_HWPOISON;
+}
+#else
+
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+	return swp_entry(0, 0);
+}
+
+static inline int is_hwpoison_entry(swp_entry_t swp)
+{
+	return 0;
+}
+#endif
+
+#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
+static inline int non_swap_entry(swp_entry_t entry)
+{
+	return swp_type(entry) >= MAX_SWAPFILES;
+}
+#else
+static inline int non_swap_entry(swp_entry_t entry)
+{
+	return 0;
+}
+#endif
Index: linux/mm/swapfile.c
===================================================================
--- linux.orig/mm/swapfile.c	2009-05-29 23:32:10.000000000 +0200
+++ linux/mm/swapfile.c	2009-05-29 23:32:10.000000000 +0200
@@ -579,7 +579,7 @@
 	struct swap_info_struct *p;
 	struct page *page = NULL;
 
-	if (is_migration_entry(entry))
+	if (non_swap_entry(entry))
 		return 1;
 
 	p = swap_info_get(entry);
@@ -1949,7 +1949,7 @@
 	unsigned long offset, type;
 	int result = 0;
 
-	if (is_migration_entry(entry))
+	if (non_swap_entry(entry))
 		return 1;
 
 	type = swp_type(entry);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] [5/16] HWPOISON: Add new SIGBUS error codes for hardware poison signals
  2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
                   ` (3 preceding siblings ...)
  2009-05-29 21:35 ` [PATCH] [4/16] HWPOISON: Add support for poison swap entries v2 Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
  2009-05-29 21:35 ` [PATCH] [6/16] HWPOISON: Add basic support for poisoned pages in fault handler v3 Andi Kleen
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Add new SIGBUS codes for reporting machine checks as signals. When 
the hardware detects an uncorrected ECC error it can trigger these
signals.

This is needed for telling KVM's qemu about machine checks that happen to
guests, so that it can inject them, but might be also useful for other programs.
I find it useful in my test programs.

This patch merely defines the new types.

- Define two new si_codes for SIGBUS.  BUS_MCEERR_AO and BUS_MCEERR_AR
* BUS_MCEERR_AO is for "Action Optional" machine checks, which means that some
corruption has been detected in the background, but nothing has been consumed
so far. The program can ignore those if it wants (but most programs would
already get killed)
* BUS_MCEERR_AR is for "Action Required" machine checks. This happens
when corrupted data is consumed or the application ran into an area
which has been known to be corrupted earlier. These require immediate
action and cannot just returned to. Most programs would kill themselves.
- They report the address of the corruption in the user address space
in si_addr.
- Define a new si_addr_lsb field that reports the extent of the corruption
to user space. That's currently always a (small) page. The user application
cannot tell where in this page the corruption happened.

AK: I plan to write a man page update before anyone asks.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/asm-generic/siginfo.h |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Index: linux/include/asm-generic/siginfo.h
===================================================================
--- linux.orig/include/asm-generic/siginfo.h	2009-05-29 23:32:10.000000000 +0200
+++ linux/include/asm-generic/siginfo.h	2009-05-29 23:32:10.000000000 +0200
@@ -82,6 +82,7 @@
 #ifdef __ARCH_SI_TRAPNO
 			int _trapno;	/* TRAP # which caused the signal */
 #endif
+			short _addr_lsb; /* LSB of the reported address */
 		} _sigfault;
 
 		/* SIGPOLL */
@@ -112,6 +113,7 @@
 #ifdef __ARCH_SI_TRAPNO
 #define si_trapno	_sifields._sigfault._trapno
 #endif
+#define si_addr_lsb	_sifields._sigfault._addr_lsb
 #define si_band		_sifields._sigpoll._band
 #define si_fd		_sifields._sigpoll._fd
 
@@ -192,7 +194,11 @@
 #define BUS_ADRALN	(__SI_FAULT|1)	/* invalid address alignment */
 #define BUS_ADRERR	(__SI_FAULT|2)	/* non-existant physical address */
 #define BUS_OBJERR	(__SI_FAULT|3)	/* object specific hardware error */
-#define NSIGBUS		3
+/* hardware memory error consumed on a machine check: action required */
+#define BUS_MCEERR_AR	(__SI_FAULT|4)
+/* hardware memory error detected in process but not consumed: action optional*/
+#define BUS_MCEERR_AO	(__SI_FAULT|5)
+#define NSIGBUS		5
 
 /*
  * SIGTRAP si_codes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] [6/16] HWPOISON: Add basic support for poisoned pages in fault handler v3
  2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
                   ` (4 preceding siblings ...)
  2009-05-29 21:35 ` [PATCH] [5/16] HWPOISON: Add new SIGBUS error codes for hardware poison signals Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
  2009-05-29 21:35 ` [PATCH] [7/16] HWPOISON: Add various poison checks in mm/memory.c Andi Kleen
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


- Add a new VM_FAULT_HWPOISON error code to handle_mm_fault. Right now
architectures have to explicitely enable poison page support, so
this is forward compatible to all architectures. They only need
to add it when they enable poison page support.
- Add poison page handling in swap in fault code

v2: Add missing delayacct_clear_flag (Hidehiro Kawai)
v3: Really use delayacct_clear_flag (Hidehiro Kawai)

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/mm.h |    3 ++-
 mm/memory.c        |   18 +++++++++++++++---
 2 files changed, 17 insertions(+), 4 deletions(-)

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2009-05-29 23:32:09.000000000 +0200
+++ linux/mm/memory.c	2009-05-29 23:33:31.000000000 +0200
@@ -1315,7 +1315,8 @@
 				if (ret & VM_FAULT_ERROR) {
 					if (ret & VM_FAULT_OOM)
 						return i ? i : -ENOMEM;
-					else if (ret & VM_FAULT_SIGBUS)
+					if (ret &
+					    (VM_FAULT_HWPOISON|VM_FAULT_SIGBUS))
 						return i ? i : -EFAULT;
 					BUG();
 				}
@@ -2459,8 +2460,15 @@
 		goto out;
 
 	entry = pte_to_swp_entry(orig_pte);
-	if (is_migration_entry(entry)) {
-		migration_entry_wait(mm, pmd, address);
+	if (unlikely(non_swap_entry(entry))) {
+		if (is_migration_entry(entry)) {
+			migration_entry_wait(mm, pmd, address);
+		} else if (is_hwpoison_entry(entry)) {
+			ret = VM_FAULT_HWPOISON;
+		} else {
+			print_bad_pte(vma, address, pte, NULL);
+			ret = VM_FAULT_OOM;
+		}
 		goto out;
 	}
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
@@ -2484,6 +2492,10 @@
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
+	} else if (PageHWPoison(page)) {
+		ret = VM_FAULT_HWPOISON;
+		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+		goto out;
 	}
 
 	lock_page(page);
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h	2009-05-29 23:32:09.000000000 +0200
+++ linux/include/linux/mm.h	2009-05-29 23:33:29.000000000 +0200
@@ -702,11 +702,12 @@
 #define VM_FAULT_SIGBUS	0x0002
 #define VM_FAULT_MAJOR	0x0004
 #define VM_FAULT_WRITE	0x0008	/* Special case for get_user_pages */
+#define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned page */
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 
-#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS)
+#define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
 
 /*
  * Can be called by the pagefault handler when it gets a VM_FAULT_OOM.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] [7/16] HWPOISON: Add various poison checks in mm/memory.c
  2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
                   ` (5 preceding siblings ...)
  2009-05-29 21:35 ` [PATCH] [6/16] HWPOISON: Add basic support for poisoned pages in fault handler v3 Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
  2009-05-29 21:35 ` [PATCH] [8/16] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler Andi Kleen
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Bail out early when hardware poisoned pages are found in page fault handling.
Since they are poisoned they should not be mapped freshly into processes,
because that would cause another (potentially deadly) machine check

This is generally handled in the same way as OOM, just a different
error code is returned to the architecture code.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/memory.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c	2009-05-29 23:32:10.000000000 +0200
+++ linux/mm/memory.c	2009-05-29 23:32:10.000000000 +0200
@@ -2659,6 +2659,9 @@
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
 		return ret;
 
+	if (unlikely(PageHWPoison(vmf.page)))
+		return VM_FAULT_HWPOISON;
+
 	/*
 	 * For consistency in subsequent calls, make the faulted page always
 	 * locked.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] [8/16] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler
  2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
                   ` (6 preceding siblings ...)
  2009-05-29 21:35 ` [PATCH] [7/16] HWPOISON: Add various poison checks in mm/memory.c Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
  2009-05-29 21:35 ` [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour Andi Kleen
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Add VM_FAULT_HWPOISON handling to the x86 page fault handler. This is 
very similar to VM_FAULT_OOM, the only difference is that a different
si_code is passed to user space and the new addr_lsb field is initialized.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 arch/x86/mm/fault.c |   18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

Index: linux/arch/x86/mm/fault.c
===================================================================
--- linux.orig/arch/x86/mm/fault.c	2009-05-29 23:32:09.000000000 +0200
+++ linux/arch/x86/mm/fault.c	2009-05-29 23:32:10.000000000 +0200
@@ -189,6 +189,7 @@
 	info.si_errno	= 0;
 	info.si_code	= si_code;
 	info.si_addr	= (void __user *)address;
+	info.si_addr_lsb = si_code == BUS_MCEERR_AR ? PAGE_SHIFT : 0;
 
 	force_sig_info(si_signo, &info, tsk);
 }
@@ -827,10 +828,12 @@
 }
 
 static void
-do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address)
+do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
+	  unsigned int fault)
 {
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
+	int code = BUS_ADRERR;
 
 	up_read(&mm->mmap_sem);
 
@@ -846,7 +849,14 @@
 	tsk->thread.error_code	= error_code;
 	tsk->thread.trap_no	= 14;
 
-	force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
+#ifdef CONFIG_MEMORY_FAILURE
+	if (fault & VM_FAULT_HWPOISON) {
+		printk(KERN_ERR "MCE: Killing %s:%d due to hardware memory corruption\n",
+			tsk->comm, tsk->pid);
+		code = BUS_MCEERR_AR;
+	}
+#endif
+	force_sig_info_fault(SIGBUS, code, address, tsk);
 }
 
 static noinline void
@@ -856,8 +866,8 @@
 	if (fault & VM_FAULT_OOM) {
 		out_of_memory(regs, error_code, address);
 	} else {
-		if (fault & VM_FAULT_SIGBUS)
-			do_sigbus(regs, error_code, address);
+		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON))
+			do_sigbus(regs, error_code, address, fault);
 		else
 			BUG();
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
  2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
                   ` (7 preceding siblings ...)
  2009-05-29 21:35 ` [PATCH] [8/16] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
  2009-05-29 21:35 ` [PATCH] [10/16] HWPOISON: Handle hardware poisoned pages in try_to_unmap Andi Kleen
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
  To: Lee.Schermerhorn, npiggin, akpm, linux-kernel, linux-mm,
	fengguang.wu


try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
which are selected by magic flag variables. The logic is not very straight
forward, because each of these flag change multiple behaviours (e.g.
migration turns off aging, not only sets up migration ptes etc.)
Also the different flags interact in magic ways.

A later patch in this series adds another mode to try_to_unmap, so 
this becomes quickly unmanageable.

Replace the different flags with a action code (migration, munlock, munmap)
and some additional flags as modifiers (ignore mlock, ignore aging).
This makes the logic more straight forward and allows easier extension
to new behaviours. Change all the caller to declare what they want to 
do.

This patch is supposed to be a nop in behaviour. If anyone can prove 
it is not that would be a bug.

Cc: Lee.Schermerhorn@hp.com
Cc: npiggin@suse.de

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |   14 +++++++++++++-
 mm/migrate.c         |    2 +-
 mm/rmap.c            |   40 ++++++++++++++++++++++------------------
 mm/vmscan.c          |    2 +-
 4 files changed, 37 insertions(+), 21 deletions(-)

Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h	2009-05-29 23:32:10.000000000 +0200
+++ linux/include/linux/rmap.h	2009-05-29 23:33:30.000000000 +0200
@@ -84,7 +84,19 @@
  * Called from mm/vmscan.c to handle paging out
  */
 int page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt);
-int try_to_unmap(struct page *, int ignore_refs);
+
+enum ttu_flags {
+	TTU_UNMAP = 0,			/* unmap mode */
+	TTU_MIGRATION = 1,		/* migration mode */
+	TTU_MUNLOCK = 2,		/* munlock mode */
+	TTU_ACTION_MASK = 0xff,
+
+	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
+	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
+};
+#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
+
+int try_to_unmap(struct page *, enum ttu_flags flags);
 
 /*
  * Called from mm/filemap_xip.c to unmap empty zero page
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2009-05-29 23:32:10.000000000 +0200
+++ linux/mm/rmap.c	2009-05-29 23:33:30.000000000 +0200
@@ -755,7 +755,7 @@
  * repeatedly from either try_to_unmap_anon or try_to_unmap_file.
  */
 static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
-				int migration)
+				enum ttu_flags flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
@@ -777,11 +777,13 @@
 	 * If it's recently referenced (perhaps page_referenced
 	 * skipped over this mm) then we should reactivate it.
 	 */
-	if (!migration) {
+	if (!(flags & TTU_IGNORE_MLOCK)) {
 		if (vma->vm_flags & VM_LOCKED) {
 			ret = SWAP_MLOCK;
 			goto out_unmap;
 		}
+	}
+	if (!(flags & TTU_IGNORE_ACCESS)) {
 		if (ptep_clear_flush_young_notify(vma, address, pte)) {
 			ret = SWAP_FAIL;
 			goto out_unmap;
@@ -821,12 +823,12 @@
 			 * pte. do_swap_page() will wait until the migration
 			 * pte is removed and then restart fault handling.
 			 */
-			BUG_ON(!migration);
+			BUG_ON(TTU_ACTION(flags) != TTU_MIGRATION);
 			entry = make_migration_entry(page, pte_write(pteval));
 		}
 		set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
 		BUG_ON(pte_file(*pte));
-	} else if (PAGE_MIGRATION && migration) {
+	} else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) {
 		/* Establish migration entry for a file page */
 		swp_entry_t entry;
 		entry = make_migration_entry(page, pte_write(pteval));
@@ -995,12 +997,13 @@
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * 'LOCKED.
  */
-static int try_to_unmap_anon(struct page *page, int unlock, int migration)
+static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 {
 	struct anon_vma *anon_vma;
 	struct vm_area_struct *vma;
 	unsigned int mlocked = 0;
 	int ret = SWAP_AGAIN;
+	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
 
 	if (MLOCK_PAGES && unlikely(unlock))
 		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
@@ -1016,7 +1019,7 @@
 				continue;  /* must visit all unlocked vmas */
 			ret = SWAP_MLOCK;  /* saw at least one mlocked vma */
 		} else {
-			ret = try_to_unmap_one(page, vma, migration);
+			ret = try_to_unmap_one(page, vma, flags);
 			if (ret == SWAP_FAIL || !page_mapped(page))
 				break;
 		}
@@ -1040,8 +1043,7 @@
 /**
  * try_to_unmap_file - unmap/unlock file page using the object-based rmap method
  * @page: the page to unmap/unlock
- * @unlock:  request for unlock rather than unmap [unlikely]
- * @migration:  unmapping for migration - ignored if @unlock
+ * @flags: action and flags
  *
  * Find all the mappings of a page using the mapping pointer and the vma chains
  * contained in the address_space struct it points to.
@@ -1053,7 +1055,7 @@
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * 'LOCKED.
  */
-static int try_to_unmap_file(struct page *page, int unlock, int migration)
+static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
 {
 	struct address_space *mapping = page->mapping;
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -1065,6 +1067,7 @@
 	unsigned long max_nl_size = 0;
 	unsigned int mapcount;
 	unsigned int mlocked = 0;
+	int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
 
 	if (MLOCK_PAGES && unlikely(unlock))
 		ret = SWAP_SUCCESS;	/* default for try_to_munlock() */
@@ -1077,7 +1080,7 @@
 				continue;	/* must visit all vmas */
 			ret = SWAP_MLOCK;
 		} else {
-			ret = try_to_unmap_one(page, vma, migration);
+			ret = try_to_unmap_one(page, vma, flags);
 			if (ret == SWAP_FAIL || !page_mapped(page))
 				goto out;
 		}
@@ -1102,7 +1105,8 @@
 			ret = SWAP_MLOCK;	/* leave mlocked == 0 */
 			goto out;		/* no need to look further */
 		}
-		if (!MLOCK_PAGES && !migration && (vma->vm_flags & VM_LOCKED))
+		if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
+			(vma->vm_flags & VM_LOCKED))
 			continue;
 		cursor = (unsigned long) vma->vm_private_data;
 		if (cursor > max_nl_cursor)
@@ -1136,7 +1140,7 @@
 	do {
 		list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
 						shared.vm_set.list) {
-			if (!MLOCK_PAGES && !migration &&
+			if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
 			    (vma->vm_flags & VM_LOCKED))
 				continue;
 			cursor = (unsigned long) vma->vm_private_data;
@@ -1176,7 +1180,7 @@
 /**
  * try_to_unmap - try to remove all page table mappings to a page
  * @page: the page to get unmapped
- * @migration: migration flag
+ * @flags: action and flags
  *
  * Tries to remove all the page table entries which are mapping this
  * page, used in the pageout path.  Caller must hold the page lock.
@@ -1187,16 +1191,16 @@
  * SWAP_FAIL	- the page is unswappable
  * SWAP_MLOCK	- page is mlocked.
  */
-int try_to_unmap(struct page *page, int migration)
+int try_to_unmap(struct page *page, enum ttu_flags flags)
 {
 	int ret;
 
 	BUG_ON(!PageLocked(page));
 
 	if (PageAnon(page))
-		ret = try_to_unmap_anon(page, 0, migration);
+		ret = try_to_unmap_anon(page, flags);
 	else
-		ret = try_to_unmap_file(page, 0, migration);
+		ret = try_to_unmap_file(page, flags);
 	if (ret != SWAP_MLOCK && !page_mapped(page))
 		ret = SWAP_SUCCESS;
 	return ret;
@@ -1222,8 +1226,8 @@
 	VM_BUG_ON(!PageLocked(page) || PageLRU(page));
 
 	if (PageAnon(page))
-		return try_to_unmap_anon(page, 1, 0);
+		return try_to_unmap_anon(page, TTU_MUNLOCK);
 	else
-		return try_to_unmap_file(page, 1, 0);
+		return try_to_unmap_file(page, TTU_MUNLOCK);
 }
 #endif
Index: linux/mm/vmscan.c
===================================================================
--- linux.orig/mm/vmscan.c	2009-05-29 23:32:09.000000000 +0200
+++ linux/mm/vmscan.c	2009-05-29 23:32:10.000000000 +0200
@@ -666,7 +666,7 @@
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page, 0)) {
+			switch (try_to_unmap(page, TTU_UNMAP)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
Index: linux/mm/migrate.c
===================================================================
--- linux.orig/mm/migrate.c	2009-05-29 23:32:09.000000000 +0200
+++ linux/mm/migrate.c	2009-05-29 23:32:10.000000000 +0200
@@ -669,7 +669,7 @@
 	}
 
 	/* Establish migration ptes or remove ptes */
-	try_to_unmap(page, 1);
+	try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 
 	if (!page_mapped(page))
 		rc = move_to_new_page(newpage, page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] [10/16] HWPOISON: Handle hardware poisoned pages in try_to_unmap
  2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
                   ` (8 preceding siblings ...)
  2009-05-29 21:35 ` [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
  2009-05-29 21:35 ` [PATCH] [11/16] HWPOISON: Handle poisoned pages in set_page_dirty() Andi Kleen
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


When a page has the poison bit set replace the PTE with a poison entry. 
This causes the right error handling to be done later when a process runs 
into it.

Also add a new flag to not do that (needed for the memory-failure handler
later)

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/linux/rmap.h |    1 +
 mm/rmap.c            |    9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2009-05-29 23:32:10.000000000 +0200
+++ linux/mm/rmap.c	2009-05-29 23:33:29.000000000 +0200
@@ -801,7 +801,14 @@
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
 
-	if (PageAnon(page)) {
+	if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+		if (PageAnon(page))
+			dec_mm_counter(mm, anon_rss);
+		else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+			dec_mm_counter(mm, file_rss);
+		set_pte_at(mm, address, pte,
+				swp_entry_to_pte(make_hwpoison_entry(page)));
+	} else if (PageAnon(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
 
 		if (PageSwapCache(page)) {
Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h	2009-05-29 23:32:10.000000000 +0200
+++ linux/include/linux/rmap.h	2009-05-29 23:32:10.000000000 +0200
@@ -93,6 +93,7 @@
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
+	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
 };
 #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] [11/16] HWPOISON: Handle poisoned pages in set_page_dirty()
  2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
                   ` (9 preceding siblings ...)
  2009-05-29 21:35 ` [PATCH] [10/16] HWPOISON: Handle hardware poisoned pages in try_to_unmap Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
  2009-05-29 21:35 ` [PATCH] [12/16] HWPOISON: check and isolate corrupted free pages Andi Kleen
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Bail out early in set_page_dirty for poisoned pages. We don't want any
of the dirty accounting done or file system write back started, because
the page will be just thrown away.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/page-writeback.c |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2009-05-29 23:32:08.000000000 +0200
+++ linux/mm/page-writeback.c	2009-05-29 23:32:10.000000000 +0200
@@ -1277,6 +1277,10 @@
 {
 	struct address_space *mapping = page_mapping(page);
 
+	if (unlikely(PageHWPoison(page))) {
+		SetPageDirty(page);
+		return 0;
+	}
 	if (likely(mapping)) {
 		int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
 #ifdef CONFIG_BLOCK

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] [12/16] HWPOISON: check and isolate corrupted free pages
  2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
                   ` (10 preceding siblings ...)
  2009-05-29 21:35 ` [PATCH] [11/16] HWPOISON: Handle poisoned pages in set_page_dirty() Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
  2009-05-29 21:35 ` [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v4 Andi Kleen
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
  To: fengguang.wu, akpm, linux-kernel, linux-mm


From: Wu Fengguang <fengguang.wu@intel.com>

If memory corruption hits the free buddy pages, we can safely ignore them.
No one will access them until page allocation time, then prep_new_page()
will automatically check and isolate PG_hwpoison page for us (for 0-order
allocation).

This patch expands prep_new_page() to check every component page in a high
order page allocation, in order to completely stop PG_hwpoison pages from
being recirculated.

Note that the common case -- only allocating a single page, doesn't
do any more work than before. Allocating > order 0 does a bit more work,
but that's relatively uncommon.

This simple implementation may drop some innocent neighbor pages, hopefully
it is not a big problem because the event should be rare enough.

This patch adds some runtime costs to high order page users.

[AK: Improved description]
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/page_alloc.c |   22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c	2009-05-29 23:32:08.000000000 +0200
+++ linux/mm/page_alloc.c	2009-05-29 23:32:11.000000000 +0200
@@ -633,12 +633,22 @@
  */
 static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 {
-	if (unlikely(page_mapcount(page) |
-		(page->mapping != NULL)  |
-		(page_count(page) != 0)  |
-		(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
-		bad_page(page);
-		return 1;
+	int i;
+
+	for (i = 0; i < (1 << order); i++) {
+		struct page *p = page + i;
+
+		if (unlikely(page_mapcount(p) |
+			(p->mapping != NULL)  |
+			(page_count(p) != 0)  |
+			(p->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
+			/*
+			 * The whole array of pages will be dropped,
+			 * hopefully this is a rare and abnormal event.
+			 */
+			bad_page(p);
+			return 1;
+		}
 	}
 
 	set_page_private(page, 0);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v4
  2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
                   ` (11 preceding siblings ...)
  2009-05-29 21:35 ` [PATCH] [12/16] HWPOISON: check and isolate corrupted free pages Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
  2009-06-01 11:16   ` Nick Piggin
  2009-05-29 21:35 ` [PATCH] [14/16] HWPOISON: FOR TESTING: Enable memory failure code unconditionally Andi Kleen
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
  To: hugh, npiggin, riel, chris.mason, akpm, linux-kernel, linux-mm,
	fengguang.wu


Add the high level memory handler that poisons pages
that got corrupted by hardware (typically by a two bit flip in a DIMM
or a cache) on the Linux level. The goal is to prevent everyone
from accessing these pages in the future.

This done at the VM level by marking a page hwpoisoned
and doing the appropriate action based on the type of page
it is.

The code that does this is portable and lives in mm/memory-failure.c

To quote the overview comment:

 * High level machine check handler. Handles pages reported by the
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 * failure.
 *
 * This focuses on pages detected as corrupted in the background.
 * When the current CPU tries to consume corruption the currently
 * running process can just be killed directly instead. This implies
 * that if the error cannot be handled for some reason it's safe to
 * just ignore it because no corruption has been consumed yet. Instead
 * when that happens another machine check will happen.
 *
 * Handles page cache pages in various states. The tricky part
 * here is that we can access any page asynchronous to other VM
 * users, because memory failures could happen anytime and anywhere,
 * possibly violating some of their assumptions. This is why this code
 * has to be extremely careful. Generally it tries to use normal locking
 * rules, as in get the standard locks, even if that means the
 * error handling takes potentially a long time.
 *
 * Some of the operations here are somewhat inefficient and have non
 * linear algorithmic complexity, because the data structures have not
 * been optimized for this case. This is in particular the case
 * for the mapping from a vma to a process. Since this case is expected
 * to be rare we hope we can get away with this.

There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before 
killing
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used 
in different situations. Right now both are implemented and can
be switched with a new sysctl vm.memory_failure_early_kill
The default is early kill.

The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep 
everything together and rmap knowledge has been seeping out anyways

v2: Fix anon vma unlock crash (noticed by Johannes Weiner <hannes@cmpxchg.org>)
Handle pages on free list correctly (also noticed by Johannes)
Fix inverted try_to_release_page check (found by Chris Mason)
Add documentation for the new sysctl.
Various other cleanups/comment fixes.
v3: Use blockable signal for AO SIGBUS for better qemu handling.
Numerous fixes from Fengguang Wu: 
New code layout for the table (redone by AK)
Move the hwpoison bit setting before the lock (Fengguang Wu)
Some code cleanups (Fengguang Wu, AK)
Add missing lru_drain (Fengguang Wu)
Do more checks for valid mappings (inspired by patch from Fengguang)
Handle free pages and fixes for clean pages (Fengguang)
Removed swap cache handling for now, needs more work
Better mapping checks to avoid races (Fengguang)
Fix swapcache (Fengguang)
Handle private2 pages too (Fengguang)
v4: Various fixes based on review comments from Nick Piggin
Document locking order.
Improved comments.
Slightly improved description
Remove bogus hunk.
Wait properly for writeback pages (Nick Piggin)

Cc: hugh@veritas.com
Cc: npiggin@suse.de
Cc: riel@redhat.com
Cc: chris.mason@oracle.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>

---
 Documentation/sysctl/vm.txt |   21 +
 fs/proc/meminfo.c           |    9 
 include/linux/mm.h          |    4 
 kernel/sysctl.c             |   14 
 mm/Kconfig                  |    3 
 mm/Makefile                 |    1 
 mm/filemap.c                |    4 
 mm/memory-failure.c         |  720 ++++++++++++++++++++++++++++++++++++++++++++
 mm/rmap.c                   |    4 
 9 files changed, 778 insertions(+), 2 deletions(-)

Index: linux/mm/Makefile
===================================================================
--- linux.orig/mm/Makefile	2009-05-29 23:32:07.000000000 +0200
+++ linux/mm/Makefile	2009-05-29 23:33:28.000000000 +0200
@@ -38,3 +38,4 @@
 endif
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
Index: linux/mm/memory-failure.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/mm/memory-failure.c	2009-05-29 23:32:11.000000000 +0200
@@ -0,0 +1,720 @@
+/*
+ * Copyright (C) 2008, 2009 Intel Corporation
+ * Author: Andi Kleen
+ *
+ * This software may be redistributed and/or modified under the terms of
+ * the GNU General Public License ("GPL") version 2 only as published by the
+ * Free Software Foundation.
+ *
+ * High level machine check handler. Handles pages reported by the
+ * hardware as being corrupted usually due to a 2bit ECC memory or cache
+ * failure.
+ *
+ * This focuses on pages detected as corrupted in the background.
+ * When the current CPU tries to consume corruption the currently
+ * running process can just be killed directly instead. This implies
+ * that if the error cannot be handled for some reason it's safe to
+ * just ignore it because no corruption has been consumed yet. Instead
+ * when that happens another machine check will happen.
+ *
+ * Handles page cache pages in various states.	The tricky part
+ * here is that we can access any page asynchronous to other VM
+ * users, because memory failures could happen anytime and anywhere,
+ * possibly violating some of their assumptions. This is why this code
+ * has to be extremely careful. Generally it tries to use normal locking
+ * rules, as in get the standard locks, even if that means the
+ * error handling takes potentially a long time.
+ *
+ * The operation to map back from RMAP chains to processes has to walk
+ * the complete process list and has non linear complexity with the number
+ * mappings. In short it can be quite slow. But since memory corruptions
+ * are rare we hope to get away with this.
+ */
+
+/*
+ * Notebook:
+ * - hugetlb needs more code
+ * - nonlinear
+ * - remap races
+ * - anonymous (tinject):
+ *   + left over references when process catches signal?
+ * - kcore/oldmem/vmcore/mem/kmem check for hwpoison pages
+ * - pass bad pages to kdump next kernel
+ */
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/page-flags.h>
+#include <linux/sched.h>
+#include <linux/rmap.h>
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+#include <linux/backing-dev.h>
+#include "internal.h"
+
+#define Dprintk(x...) printk(x)
+
+int sysctl_memory_failure_early_kill __read_mostly = 1;
+
+atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);
+
+/*
+ * Send all the processes who have the page mapped an ``action optional''
+ * signal.
+ */
+static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
+			unsigned long pfn)
+{
+	struct siginfo si;
+	int ret;
+
+	printk(KERN_ERR
+		"MCE %#lx: Killing %s:%d due to hardware memory corruption\n",
+		pfn, t->comm, t->pid);
+	si.si_signo = SIGBUS;
+	si.si_errno = 0;
+	si.si_code = BUS_MCEERR_AO;
+	si.si_addr = (void *)addr;
+#ifdef __ARCH_SI_TRAPNO
+	si.si_trapno = trapno;
+#endif
+	si.si_addr_lsb = PAGE_SHIFT;
+	/*
+	 * Don't use force here, it's convenient if the signal
+	 * can be temporarily blocked.
+	 * This could cause a loop when the user sets SIGBUS
+	 * to SIG_IGN, but hopefully noone will do that?
+	 */
+	ret = send_sig_info(SIGBUS, &si, t);  /* synchronous? */
+	if (ret < 0)
+		printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
+		       t->comm, t->pid, ret);
+	return ret;
+}
+
+/*
+ * Kill all processes that have a poisoned page mapped and then isolate
+ * the page.
+ *
+ * General strategy:
+ * Find all processes having the page mapped and kill them.
+ * But we keep a page reference around so that the page is not
+ * actually freed yet.
+ * Then stash the page away
+ *
+ * There's no convenient way to get back to mapped processes
+ * from the VMAs. So do a brute-force search over all
+ * running processes.
+ *
+ * Remember that machine checks are not common (or rather
+ * if they are common you have other problems), so this shouldn't
+ * be a performance issue.
+ *
+ * Also there are some races possible while we get from the
+ * error detection to actually handle it.
+ */
+
+struct to_kill {
+	struct list_head nd;
+	struct task_struct *tsk;
+	unsigned long addr;
+};
+
+/*
+ * Failure handling: if we can't find or can't kill a process there's
+ * not much we can do.	We just print a message and ignore otherwise.
+ */
+
+/*
+ * Schedule a process for later kill.
+ * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
+ * TBD would GFP_NOIO be enough?
+ */
+static void add_to_kill(struct task_struct *tsk, struct page *p,
+		       struct vm_area_struct *vma,
+		       struct list_head *to_kill,
+		       struct to_kill **tkc)
+{
+	struct to_kill *tk;
+
+	if (*tkc) {
+		tk = *tkc;
+		*tkc = NULL;
+	} else {
+		tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
+		if (!tk) {
+			printk(KERN_ERR "MCE: Out of memory while machine check handling\n");
+			return;
+		}
+	}
+	tk->addr = page_address_in_vma(p, vma);
+	if (tk->addr == -EFAULT) {
+		printk(KERN_INFO "MCE: Unable to determine user space address during error handling\n");
+		tk->addr = 0;
+	}
+	get_task_struct(tsk);
+	tk->tsk = tsk;
+	list_add_tail(&tk->nd, to_kill);
+}
+
+/*
+ * Kill the processes that have been collected earlier.
+ *
+ * Only do anything when DOIT is set, otherwise just free the list
+ * (this is used for clean pages which do not need killing)
+ * Also when FAIL is set do a force kill because something went
+ * wrong earlier.
+ */
+static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
+			  int fail, unsigned long pfn)
+{
+	struct to_kill *tk, *next;
+
+	list_for_each_entry_safe (tk, next, to_kill, nd) {
+		if (doit) {
+			/*
+			 * In case something went wrong with munmaping
+			 * make sure the process doesn't catch the
+			 * signal and then access the memory. Just kill it.
+			 * the signal handlers
+			 */
+			if (fail) {
+				printk(KERN_ERR
+		"MCE %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",
+					pfn, tk->tsk->comm, tk->tsk->pid);
+				force_sig(SIGKILL, tk->tsk);
+			}
+
+			/*
+			 * In theory the process could have mapped
+			 * something else on the address in-between. We could
+			 * check for that, but we need to tell the
+			 * process anyways.
+			 */
+			else if (kill_proc_ao(tk->tsk, tk->addr, trapno,
+					      pfn) < 0)
+				printk(KERN_ERR
+		"MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
+					pfn, tk->tsk->comm, tk->tsk->pid);
+		}
+		put_task_struct(tk->tsk);
+		kfree(tk);
+	}
+}
+
+/*
+ * Collect processes when the error hit an anonymous page.
+ */
+static void collect_procs_anon(struct page *page, struct list_head *to_kill,
+			      struct to_kill **tkc)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+	struct anon_vma *av = page_lock_anon_vma(page);
+
+	if (av == NULL)	/* Not actually mapped anymore */
+		return;
+
+	read_lock(&tasklist_lock);
+	for_each_process (tsk) {
+		if (!tsk->mm)
+			continue;
+		list_for_each_entry (vma, &av->head, anon_vma_node) {
+			if (vma->vm_mm == tsk->mm)
+				add_to_kill(tsk, page, vma, to_kill, tkc);
+		}
+	}
+	page_unlock_anon_vma(av);
+	read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect processes when the error hit a file mapped page.
+ */
+static void collect_procs_file(struct page *page, struct list_head *to_kill,
+			      struct to_kill **tkc)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+	struct prio_tree_iter iter;
+	struct address_space *mapping = page_mapping(page);
+
+	read_lock(&tasklist_lock);
+	spin_lock(&mapping->i_mmap_lock);
+	for_each_process(tsk) {
+		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+		if (!tsk->mm)
+			continue;
+
+		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
+				      pgoff)
+			if (vma->vm_mm == tsk->mm)
+				add_to_kill(tsk, page, vma, to_kill, tkc);
+	}
+	spin_unlock(&mapping->i_mmap_lock);
+	read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect the processes who have the corrupted page mapped to kill.
+ * This is done in two steps for locking reasons.
+ * First preallocate one tokill structure outside the spin locks,
+ * so that we can kill at least one process reasonably reliable.
+ */
+static void collect_procs(struct page *page, struct list_head *tokill)
+{
+	struct to_kill *tk;
+
+	tk = kmalloc(sizeof(struct to_kill), GFP_KERNEL);
+	/* memory allocation failure is implicitly handled */
+	if (PageAnon(page))
+		collect_procs_anon(page, tokill, &tk);
+	else
+		collect_procs_file(page, tokill, &tk);
+	kfree(tk);
+}
+
+/*
+ * Error handlers for various types of pages.
+ */
+
+enum outcome {
+	FAILED,
+	DELAYED,
+	IGNORED,
+	RECOVERED,
+};
+
+static const char *action_name[] = {
+	[FAILED] = "Failed",		/* Error handling failed */
+	[DELAYED] = "Delayed",		/* Will be handled later */
+	[IGNORED] = "Ignored",		/* Error safely ignored */
+	[RECOVERED] = "Recovered",	/* Successfully recovered */
+};
+
+/*
+ * Error hit kernel page.
+ * Do nothing, try to be lucky and not touch this instead. For a few cases we
+ * could be more sophisticated.
+ */
+static int me_kernel(struct page *p)
+{
+	return DELAYED;
+}
+
+/*
+ * Already poisoned page.
+ */
+static int me_ignore(struct page *p)
+{
+	return IGNORED;
+}
+
+/*
+ * Page in unknown state. Do nothing.
+ */
+static int me_unknown(struct page *p)
+{
+	printk(KERN_ERR "MCE %#lx: Unknown page state\n", page_to_pfn(p));
+	return FAILED;
+}
+
+/*
+ * Free memory
+ */
+static int me_free(struct page *p)
+{
+	return DELAYED;
+}
+
+/*
+ * Clean (or cleaned) page cache page.
+ */
+static int me_pagecache_clean(struct page *p)
+{
+	if (!isolate_lru_page(p))
+		page_cache_release(p);
+
+	if (page_has_private(p))
+		do_invalidatepage(p, 0);
+	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
+		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
+			page_to_pfn(p));
+
+	/*
+	 * remove_from_page_cache assumes (mapping && !mapped)
+	 */
+	if (page_mapping(p) && !page_mapped(p)) {
+		remove_from_page_cache(p);
+		page_cache_release(p);
+	}
+
+	return RECOVERED;
+}
+
+/*
+ * Dirty cache page page
+ * Issues: when the error hit a hole page the error is not properly
+ * propagated.
+ */
+static int me_pagecache_dirty(struct page *p)
+{
+	struct address_space *mapping = page_mapping(p);
+
+	SetPageError(p);
+	/* TBD: print more information about the file. */
+	printk(KERN_ERR "MCE %#lx: Hardware memory corruption on dirty file page: write error\n",
+			page_to_pfn(p));
+	if (mapping) {
+		/*
+		 * Truncate does the same, but we're not quite the same
+		 * as truncate. This doesn't try to unallocate blocks
+		 * on disk or make the file shorter. It's more like a
+		 * "temporary hole punch".
+		 * Needs more checking, but keep it for now.
+		 */
+		cancel_dirty_page(p, PAGE_CACHE_SIZE);
+
+		/*
+		 * IO error will be reported by write(), fsync(), etc.
+		 * who check the mapping.
+		 * This way the application knows that something went
+		 * wrong with its dirty file data.
+		 */
+		mapping_set_error(mapping, EIO);
+	}
+
+	me_pagecache_clean(p);
+
+	/*
+	 * Did the earlier release work?
+	 */
+	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
+		return FAILED;
+
+	return RECOVERED;
+}
+
+/*
+ * Clean and dirty swap cache.
+ *
+ * Dirty swap cache page is tricky to handle. The page could live both in page
+ * cache and swap cache(ie. page is freshly swapped in). So it could be
+ * referenced concurrently by 2 types of PTEs:
+ * normal PTEs and swap PTEs. We try to handle them consistently by calling u
+ * try_to_unmap(TTU_IGNORE_HWPOISON) to convert the normal PTEs to swap PTEs,
+ * and then
+ *      - clear dirty bit to prevent IO
+ *      - remove from LRU
+ *      - but keep in the swap cache, so that when we return to it on
+ *        a later page fault, we know the application is accessing
+ *        corrupted data and shall be killed (we installed simple
+ *        interception code in do_swap_page to catch it).
+ *
+ * Clean swap cache pages can be directly isolated. A later page fault will
+ * bring in the known good data from disk.
+ */
+static int me_swapcache_dirty(struct page *p)
+{
+	ClearPageDirty(p);
+
+	if (!isolate_lru_page(p))
+		page_cache_release(p);
+
+	return DELAYED;
+}
+
+static int me_swapcache_clean(struct page *p)
+{
+	ClearPageUptodate(p);
+
+	if (!isolate_lru_page(p))
+		page_cache_release(p);
+
+	delete_from_swap_cache(p);
+
+	return RECOVERED;
+}
+
+/*
+ * Huge pages. Needs work.
+ * Issues:
+ * No rmap support so we cannot find the original mapper. In theory could walk
+ * all MMs and look for the mappings, but that would be non atomic and racy.
+ * Need rmap for hugepages for this. Alternatively we could employ a heuristic,
+ * like just walking the current process and hoping it has it mapped (that
+ * should be usually true for the common "shared database cache" case)
+ * Should handle free huge pages and dequeue them too, but this needs to
+ * handle huge page accounting correctly.
+ */
+static int me_huge_page(struct page *p)
+{
+	return FAILED;
+}
+
+/*
+ * Various page states we can handle.
+ *
+ * A page state is defined by its current page->flags bits.
+ * The table matches them in order and calls the right handler.
+ *
+ * This is quite tricky because we can access page at any time
+ * in its live cycle, so all accesses have to be extremly careful.
+ *
+ * This is not complete. More states could be added.
+ * For any missing state don't attempt recovery.
+ */
+
+#define dirty		(1UL << PG_dirty)
+#define swapcache	(1UL << PG_swapcache)
+#define unevict		(1UL << PG_unevictable)
+#define mlocked		(1UL << PG_mlocked)
+#define writeback	(1UL << PG_writeback)
+#define lru		(1UL << PG_lru)
+#define swapbacked	(1UL << PG_swapbacked)
+#define head		(1UL << PG_head)
+#define tail		(1UL << PG_tail)
+#define compound	(1UL << PG_compound)
+#define slab		(1UL << PG_slab)
+#define buddy		(1UL << PG_buddy)
+#define reserved	(1UL << PG_reserved)
+
+/*
+ * The table is > 80 columns because all the alternatvies were much worse.
+ */
+
+static struct page_state {
+	unsigned long mask;
+	unsigned long res;
+	char *msg;
+	int (*action)(struct page *p);
+} error_states[] = {
+	{ reserved,	reserved,	"reserved kernel",	me_ignore },
+	{ buddy,	buddy,		"free kernel",		me_free },
+
+	/*
+	 * Could in theory check if slab page is free or if we can drop
+	 * currently unused objects without touching them. But just
+	 * treat it as standard kernel for now.
+	 */
+	{ slab,			slab,		"kernel slab",		me_kernel },
+
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+	{ head,			head,		"hugetlb",		me_huge_page },
+	{ tail,			tail,		"hugetlb",		me_huge_page },
+#else
+	{ compound,		compound,	"hugetlb",		me_huge_page },
+#endif
+
+	{ swapcache|dirty,	swapcache|dirty,"dirty swapcache",	me_swapcache_dirty },
+	{ swapcache|dirty,	swapcache,	"clean swapcache",	me_swapcache_clean },
+
+#ifdef CONFIG_UNEVICTABLE_LRU
+	{ unevict|dirty,	unevict|dirty,	"unevictable dirty lru", me_pagecache_dirty },
+	{ unevict,		unevict,	"unevictable lru",	me_pagecache_clean },
+#endif
+
+#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+	{ mlocked|dirty,	mlocked|dirty,	"mlocked dirty lru",	me_pagecache_dirty },
+	{ mlocked,		mlocked,	"mlocked lru",		me_pagecache_clean },
+#endif
+
+	{ lru|dirty,		lru|dirty,	"dirty lru",		me_pagecache_dirty },
+	{ lru|dirty,		lru,		"clean lru",		me_pagecache_clean },
+	{ swapbacked,		swapbacked,	"anonymous",		me_pagecache_clean },
+
+	/*
+	 * Add more states here.
+	 */
+
+	/*
+	 * Catchall entry: must be at end.
+	 */
+	{ 0,			0,		"unknown page state",	me_unknown },
+};
+
+static void page_action(char *msg, struct page *p, int (*action)(struct page *),
+			unsigned long pfn)
+{
+	int ret;
+
+	printk(KERN_ERR "MCE %#lx: %s page recovery: starting\n", pfn, msg);
+	ret = action(p);
+	printk(KERN_ERR "MCE %#lx: %s page recovery: %s\n",
+	       pfn, msg, action_name[ret]);
+	if (page_count(p) != 1)
+		printk(KERN_ERR
+		       "MCE %#lx: %s page still referenced by %d users\n",
+		       pfn, msg, page_count(p) - 1);
+
+	/* Could do more checks here if page looks ok */
+	atomic_long_add(1, &mce_bad_pages);
+
+	/*
+	 * Could adjust zone counters here to correct for the missing page.
+	 */
+}
+
+#define N_UNMAP_TRIES 5
+
+/*
+ * Do all that is necessary to remove user space mappings. Unmap
+ * the pages and send SIGBUS to the processes if the data was dirty.
+ */
+static void hwpoison_user_mappings(struct page *p, unsigned long pfn,
+				  int trapno)
+{
+	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	int kill = sysctl_memory_failure_early_kill;
+	struct address_space *mapping;
+	LIST_HEAD(tokill);
+	int ret;
+	int i;
+
+	if (PageReserved(p) || PageCompound(p) || PageSlab(p))
+		return;
+
+	if (!PageLRU(p))
+		lru_add_drain();
+
+	/*
+	 * This check implies we don't kill processes if their pages
+	 * are in the swap cache early. Those are always late kills.
+	 */
+	if (!page_mapped(p))
+		return;
+
+	if (PageSwapCache(p)) {
+		printk(KERN_ERR
+		       "MCE %#lx: keeping poisoned page in swap cache\n", pfn);
+		ttu |= TTU_IGNORE_HWPOISON;
+	}
+
+	/*
+	 * Poisoned clean file pages are harmless, the
+	 * data can be restored by regular page faults.
+	 */
+	mapping = page_mapping(p);
+	if (!PageDirty(p) && !PageWriteback(p) &&
+	    !PageAnon(p) && !PageSwapBacked(p) &&
+	    mapping && mapping_cap_account_dirty(mapping)) {
+		if (page_mkclean(p))
+			SetPageDirty(p);
+		else {
+			kill = 0;
+			ttu |= TTU_IGNORE_HWPOISON;
+		}
+	}
+
+	/*
+	 * First collect all the processes that have the page
+	 * mapped.  This has to be done before try_to_unmap,
+	 * because ttu takes the rmap data structures down.
+	 *
+	 * This also has the side effect to propagate the dirty
+	 * bit from PTEs into the struct page. This is needed
+	 * to actually decide if something needs to be killed
+	 * or errored, or if it's ok to just drop the page.
+	 *
+	 * Error handling: We ignore errors here because
+	 * there's nothing that can be done.
+	 *
+	 * RED-PEN some cases in process exit seem to deadlock
+	 * on the page lock. drop it or add poison checks?
+	 */
+	if (kill)
+		collect_procs(p, &tokill);
+
+	/*
+	 * try_to_unmap can fail temporarily due to races.
+	 * Try a few times (RED-PEN better strategy?)
+	 */
+	for (i = 0; i < N_UNMAP_TRIES; i++) {
+		ret = try_to_unmap(p, ttu);
+		if (ret == SWAP_SUCCESS)
+			break;
+		Dprintk("MCE %#lx: try_to_unmap retry needed %d\n", pfn,  ret);
+	}
+
+	/*
+	 * Now that the dirty bit has been propagated to the
+	 * struct page and all unmaps done we can decide if
+	 * killing is needed or not.  Only kill when the page
+	 * was dirty, otherwise the tokill list is merely
+	 * freed.  When there was a problem unmapping earlier
+	 * use a more force-full uncatchable kill to prevent
+	 * any accesses to the poisoned memory.
+	 */
+	kill_procs_ao(&tokill, !!PageDirty(p), trapno,
+		      ret != SWAP_SUCCESS, pfn);
+}
+
+/**
+ * memory_failure - Handle memory failure of a page.
+ *
+ */
+void memory_failure(unsigned long pfn, int trapno)
+{
+	struct page_state *ps;
+	struct page *p;
+
+	if (!pfn_valid(pfn)) {
+		printk(KERN_ERR
+   "MCE %#lx: Hardware memory corruption in memory outside kernel control\n",
+		       pfn);
+		return;
+	}
+
+
+	p = pfn_to_page(pfn);
+	if (TestSetPageHWPoison(p)) {
+		printk(KERN_ERR "MCE %#lx: Error for already hardware poisoned page\n", pfn);
+		return;
+	}
+
+	/*
+	 * We need/can do nothing about count=0 pages.
+	 * 1) it's a free page, and therefore in safe hand:
+	 *    prep_new_page() will be the gate keeper.
+	 * 2) it's part of a non-compound high order page.
+	 *    Implies some kernel user: cannot stop them from
+	 *    R/W the page; let's pray that the page has been
+	 *    used and will be freed some time later.
+	 * In fact it's dangerous to directly bump up page count from 0,
+	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
+	 */
+	if (!get_page_unless_zero(compound_head(p))) {
+		printk(KERN_ERR
+		       "MCE 0x%lx: ignoring free or high order page\n", pfn);
+		return;
+	}
+
+	/*
+	 * Lock the page and wait for writeback to finish.
+	 * It's very difficult to mess with pages currently under IO
+	 * and in many cases impossible, so we just avoid it here.
+	 */
+	lock_page_nosync(p);
+	wait_on_page_writeback(p);
+
+	/*
+	 * Now take care of user space mappings.
+	 */
+	hwpoison_user_mappings(p, pfn, trapno);
+
+	/* Tored down by someone else? */
+	if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
+		printk(KERN_ERR
+		       "MCE %#lx: ignoring NULL mapping LRU page\n", pfn);
+		goto out;
+	}
+
+	for (ps = error_states;; ps++) {
+		if ((p->flags & ps->mask) == ps->res) {
+			page_action(ps->msg, p, ps->action, pfn);
+			break;
+		}
+	}
+out:
+	unlock_page(p);
+}
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h	2009-05-29 23:32:10.000000000 +0200
+++ linux/include/linux/mm.h	2009-05-29 23:32:11.000000000 +0200
@@ -1322,6 +1322,10 @@
 
 extern void *alloc_locked_buffer(size_t size);
 extern void free_locked_buffer(void *buffer, size_t size);
+
+extern void memory_failure(unsigned long pfn, int trapno);
+extern int sysctl_memory_failure_early_kill;
+extern atomic_long_t mce_bad_pages;
 extern void release_locked_buffer(void *buffer, size_t size);
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c	2009-05-29 23:32:07.000000000 +0200
+++ linux/kernel/sysctl.c	2009-05-29 23:32:11.000000000 +0200
@@ -1282,6 +1282,20 @@
 		.proc_handler	= &scan_unevictable_handler,
 	},
 #endif
+#ifdef CONFIG_MEMORY_FAILURE
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "memory_failure_early_kill",
+		.data		= &sysctl_memory_failure_early_kill,
+		.maxlen		= sizeof(vm_highmem_is_dirtyable),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec_minmax,
+		.strategy	= &sysctl_intvec,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+#endif
+
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
Index: linux/fs/proc/meminfo.c
===================================================================
--- linux.orig/fs/proc/meminfo.c	2009-05-29 23:32:07.000000000 +0200
+++ linux/fs/proc/meminfo.c	2009-05-29 23:32:11.000000000 +0200
@@ -97,7 +97,11 @@
 		"Committed_AS:   %8lu kB\n"
 		"VmallocTotal:   %8lu kB\n"
 		"VmallocUsed:    %8lu kB\n"
-		"VmallocChunk:   %8lu kB\n",
+		"VmallocChunk:   %8lu kB\n"
+#ifdef CONFIG_MEMORY_FAILURE
+		"BadPages:       %8lu kB\n"
+#endif
+		,
 		K(i.totalram),
 		K(i.freeram),
 		K(i.bufferram),
@@ -144,6 +148,9 @@
 		(unsigned long)VMALLOC_TOTAL >> 10,
 		vmi.used >> 10,
 		vmi.largest_chunk >> 10
+#ifdef CONFIG_MEMORY_FAILURE
+		,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
+#endif
 		);
 
 	hugetlb_report_meminfo(m);
Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig	2009-05-29 23:32:07.000000000 +0200
+++ linux/mm/Kconfig	2009-05-29 23:33:29.000000000 +0200
@@ -226,6 +226,9 @@
 config MMU_NOTIFIER
 	bool
 
+config MEMORY_FAILURE
+	bool
+
 config NOMMU_INITIAL_TRIM_EXCESS
 	int "Turn on mmap() excess space trimming before booting"
 	depends on !MMU
Index: linux/Documentation/sysctl/vm.txt
===================================================================
--- linux.orig/Documentation/sysctl/vm.txt	2009-05-29 23:32:07.000000000 +0200
+++ linux/Documentation/sysctl/vm.txt	2009-05-29 23:32:11.000000000 +0200
@@ -32,6 +32,7 @@
 - legacy_va_layout
 - lowmem_reserve_ratio
 - max_map_count
+- memory_failure_early_kill
 - min_free_kbytes
 - min_slab_ratio
 - min_unmapped_ratio
@@ -53,7 +54,6 @@
 - vfs_cache_pressure
 - zone_reclaim_mode
 
-
 ==============================================================
 
 block_dump
@@ -275,6 +275,25 @@
 
 The default value is 65536.
 
+=============================================================
+
+memory_failure_early_kill:
+
+Control how to kill processes when uncorrected memory error (typically
+a 2bit error in a memory module) is detected in the background by hardware.
+
+1: Kill all processes that have the corrupted page mapped as soon as the
+corruption is detected.
+
+0: Only unmap the page from all processes and only kill a process
+who tries to access it.
+
+The kill is done using a catchable SIGBUS, so processes can handle this
+if they want to.
+
+This is only active on architectures/platforms with advanced machine
+check handling and depends on the hardware capabilities.
+
 ==============================================================
 
 min_free_kbytes:
Index: linux/mm/filemap.c
===================================================================
--- linux.orig/mm/filemap.c	2009-05-29 23:32:07.000000000 +0200
+++ linux/mm/filemap.c	2009-05-29 23:32:11.000000000 +0200
@@ -105,6 +105,10 @@
  *
  *  ->task->proc_lock
  *    ->dcache_lock		(proc_pid_lookup)
+ *
+ *  (code doesn't rely on that order, so you could switch it around)
+ *  ->tasklist_lock             (memory_failure, collect_procs_ao)
+ *    ->i_mmap_lock
  */
 
 /*
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c	2009-05-29 23:32:10.000000000 +0200
+++ linux/mm/rmap.c	2009-05-29 23:32:11.000000000 +0200
@@ -36,6 +36,10 @@
  *                 mapping->tree_lock (widely used, in set_page_dirty,
  *                           in arch-dependent flush_dcache_mmap_lock,
  *                           within inode_lock in __sync_single_inode)
+ *
+ * (code doesn't rely on that order so it could be switched around)
+ * ->tasklist_lock
+ *   anon_vma->lock      (memory_failure, collect_procs_anon)
  */
 
 #include <linux/mm.h>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v4
  2009-05-29 21:35 ` [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v4 Andi Kleen
@ 2009-06-01 11:16   ` Nick Piggin
  2009-06-01 12:46     ` Wu Fengguang
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2009-06-01 11:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: hugh, riel, chris.mason, akpm, linux-kernel, linux-mm,
	fengguang.wu

On Fri, May 29, 2009 at 11:35:39PM +0200, Andi Kleen wrote:
> +	mapping = page_mapping(p);
> +	if (!PageDirty(p) && !PageWriteback(p) &&
> +	    !PageAnon(p) && !PageSwapBacked(p) &&
> +	    mapping && mapping_cap_account_dirty(mapping)) {

Haven't had another good look at this yet, but if you hold the
page locked, and have done a wait_on_page_writeback, then
PageWriteback == true is a kernel bug.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v4
  2009-06-01 11:16   ` Nick Piggin
@ 2009-06-01 12:46     ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2009-06-01 12:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, hugh@veritas.com, riel@redhat.com,
	chris.mason@oracle.com, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org

On Mon, Jun 01, 2009 at 07:16:41PM +0800, Nick Piggin wrote:
> On Fri, May 29, 2009 at 11:35:39PM +0200, Andi Kleen wrote:
> > +	mapping = page_mapping(p);
> > +	if (!PageDirty(p) && !PageWriteback(p) &&
> > +	    !PageAnon(p) && !PageSwapBacked(p) &&
> > +	    mapping && mapping_cap_account_dirty(mapping)) {
> 
> Haven't had another good look at this yet, but if you hold the
> page locked, and have done a wait_on_page_writeback, then
> PageWriteback == true is a kernel bug.

Right, we can eliminate the PageWriteback() test when there is a
wait_on_page_writeback().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] [14/16] HWPOISON: FOR TESTING: Enable memory failure code unconditionally
  2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
                   ` (12 preceding siblings ...)
  2009-05-29 21:35 ` [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v4 Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
  2009-05-29 21:35 ` [PATCH] [15/16] HWPOISON: Add madvise() based injector for hardware poisoned pages v3 Andi Kleen
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Normally the memory-failure.c code is enabled by the architecture, but
for easier testing independent of architecture changes enable it unconditionally.

This should not be merged into mainline.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/Kconfig |    2 ++
 1 file changed, 2 insertions(+)

Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig	2009-05-29 23:32:11.000000000 +0200
+++ linux/mm/Kconfig	2009-05-29 23:33:28.000000000 +0200
@@ -228,6 +228,8 @@
 
 config MEMORY_FAILURE
 	bool
+	default y
+	depends on MMU
 
 config NOMMU_INITIAL_TRIM_EXCESS
 	int "Turn on mmap() excess space trimming before booting"

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] [15/16] HWPOISON: Add madvise() based injector for hardware poisoned pages v3
  2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
                   ` (13 preceding siblings ...)
  2009-05-29 21:35 ` [PATCH] [14/16] HWPOISON: FOR TESTING: Enable memory failure code unconditionally Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
  2009-05-29 21:35 ` [PATCH] [16/16] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs Andi Kleen
  2009-05-29 21:52 ` [PATCH] [0/16] HWPOISON: Intro Alan Cox
  16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Impact: optional, useful for debugging

Add a new madvice sub command to inject poison for some
pages in a process' address space.  This is useful for
testing the poison page handling.

Open issues:

- This patch allows root to tie up arbitary amounts of memory.
Should this be disabled inside containers?
- There's a small race window between getting the page and injecting.
The patch drops the ref count because otherwise memory_failure
complains about dangling references. In theory with a multi threaded
injector one could inject poison for a process foreign page this way.
Not a serious issue right now.

v2: Use write flag for get_user_pages to make sure to always get
a fresh page
v3: Don't request write mapping (Fengguang Wu)

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 include/asm-generic/mman.h |    1 +
 mm/madvise.c               |   37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

Index: linux/mm/madvise.c
===================================================================
--- linux.orig/mm/madvise.c	2009-05-29 23:32:07.000000000 +0200
+++ linux/mm/madvise.c	2009-05-29 23:32:11.000000000 +0200
@@ -208,6 +208,38 @@
 	return error;
 }
 
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Error injection support for memory error handling.
+ */
+static int madvise_hwpoison(unsigned long start, unsigned long end)
+{
+	/*
+	 * RED-PEN
+	 * This allows to tie up arbitary amounts of memory.
+	 * Might be a good idea to disable it inside containers even for root.
+	 */
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	for (; start < end; start += PAGE_SIZE) {
+		struct page *p;
+		int ret = get_user_pages(current, current->mm, start, 1,
+						0, 0, &p, NULL);
+		if (ret != 1)
+			return ret;
+		put_page(p);
+		/*
+		 * RED-PEN page can be reused in a short window, but otherwise
+		 * we'll have to fight with the reference count.
+		 */
+		printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n",
+		       page_to_pfn(p), start);
+		memory_failure(page_to_pfn(p), 0);
+	}
+	return 0;
+}
+#endif
+
 static long
 madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		unsigned long start, unsigned long end, int behavior)
@@ -290,6 +322,11 @@
 	int write;
 	size_t len;
 
+#ifdef CONFIG_MEMORY_FAILURE
+	if (behavior == MADV_HWPOISON)
+		return madvise_hwpoison(start, start+len_in);
+#endif
+
 	write = madvise_need_mmap_write(behavior);
 	if (write)
 		down_write(&current->mm->mmap_sem);
Index: linux/include/asm-generic/mman.h
===================================================================
--- linux.orig/include/asm-generic/mman.h	2009-05-29 23:32:07.000000000 +0200
+++ linux/include/asm-generic/mman.h	2009-05-29 23:32:11.000000000 +0200
@@ -34,6 +34,7 @@
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
+#define MADV_HWPOISON	12		/* hw poison the page (root only) */
 
 /* compatibility flags */
 #define MAP_FILE	0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] [16/16] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
  2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
                   ` (14 preceding siblings ...)
  2009-05-29 21:35 ` [PATCH] [15/16] HWPOISON: Add madvise() based injector for hardware poisoned pages v3 Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
  2009-05-29 21:52 ` [PATCH] [0/16] HWPOISON: Intro Alan Cox
  16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu


Useful for some testing scenarios, although specific testing is often
done better through MADV_POISON

This can be done with the x86 level MCE injector too, but this interface
allows it to do independently from low level x86 changes.

Open issues: 

Should be disabled for cgroups.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 mm/Kconfig           |    4 ++++
 mm/Makefile          |    1 +
 mm/hwpoison-inject.c |   41 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+)

Index: linux/mm/hwpoison-inject.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux/mm/hwpoison-inject.c	2009-05-29 23:32:11.000000000 +0200
@@ -0,0 +1,41 @@
+/* Inject a hwpoison memory failure on a arbitary pfn */
+#include <linux/module.h>
+#include <linux/debugfs.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+
+static struct dentry *hwpoison_dir, *corrupt_pfn;
+
+static int hwpoison_inject(void *data, u64 val)
+{
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	printk(KERN_INFO "Injecting memory failure at pfn %Lx\n", val);
+	memory_failure(val, 18);
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n");
+
+static void pfn_inject_exit(void)
+{
+	if (hwpoison_dir)
+		debugfs_remove_recursive(hwpoison_dir);
+}
+
+static int pfn_inject_init(void)
+{
+	hwpoison_dir = debugfs_create_dir("hwpoison", NULL);
+	if (hwpoison_dir == NULL)
+		return -ENOMEM;
+	corrupt_pfn = debugfs_create_file("corrupt-pfn", 0600, hwpoison_dir,
+					  NULL, &hwpoison_fops);
+	if (corrupt_pfn == NULL) {
+		pfn_inject_exit();
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+module_init(pfn_inject_init);
+module_exit(pfn_inject_exit);
Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig	2009-05-29 23:32:11.000000000 +0200
+++ linux/mm/Kconfig	2009-05-29 23:32:11.000000000 +0200
@@ -231,6 +231,10 @@
 	default y
 	depends on MMU
 
+config HWPOISON_INJECT
+	tristate "Poison pages injector"
+	depends on MEMORY_FAILURE && DEBUG_KERNEL
+
 config NOMMU_INITIAL_TRIM_EXCESS
 	int "Turn on mmap() excess space trimming before booting"
 	depends on !MMU
Index: linux/mm/Makefile
===================================================================
--- linux.orig/mm/Makefile	2009-05-29 23:32:11.000000000 +0200
+++ linux/mm/Makefile	2009-05-29 23:32:11.000000000 +0200
@@ -39,3 +39,4 @@
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
+obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [0/16] HWPOISON: Intro
  2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
                   ` (15 preceding siblings ...)
  2009-05-29 21:35 ` [PATCH] [16/16] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs Andi Kleen
@ 2009-05-29 21:52 ` Alan Cox
  2009-05-29 22:24   ` Andi Kleen
  2009-05-30  6:37   ` More thoughts about hwpoison and pageflags compression Andi Kleen
  16 siblings, 2 replies; 36+ messages in thread
From: Alan Cox @ 2009-05-29 21:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-kernel, linux-mm, fengguang.wu

On Fri, 29 May 2009 23:35:25 +0200 (CEST)
Andi Kleen <andi@firstfloor.org> wrote:

> 
> Another version of the hwpoison patchkit. I addressed 
> all feedback, except:
> I didn't move the handlers into other files for now, prefer
> to keep things together for now
> I'm keeping an own pagepoison bit because I think that's 
> cleaner than any other hacks.
> 
> Andrew, please put it into mm for .31 track.

Andrew please put it on the "Andi needs to justify his pageflags" non-path

I'm with Rik on this - we may have a few pageflags handy now but being
slack with them for an obscure feature that can be done other ways and
isn't performance critical is just lazy and bad planning for the long
term.

Andi - "I'm doing it my way so nyahh, put it into .31" doesn't fly. If
you want it in .31 convince Rik and me and others that its a good use of
a pageflag.

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [0/16] HWPOISON: Intro
  2009-05-29 21:52 ` [PATCH] [0/16] HWPOISON: Intro Alan Cox
@ 2009-05-29 22:24   ` Andi Kleen
  2009-05-30  6:37   ` More thoughts about hwpoison and pageflags compression Andi Kleen
  1 sibling, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 22:24 UTC (permalink / raw)
  To: Alan Cox; +Cc: Andi Kleen, akpm, linux-kernel, linux-mm, fengguang.wu

On Fri, May 29, 2009 at 10:52:02PM +0100, Alan Cox wrote:
> On Fri, 29 May 2009 23:35:25 +0200 (CEST)
> Andi Kleen <andi@firstfloor.org> wrote:
> 
> > 
> > Another version of the hwpoison patchkit. I addressed 
> > all feedback, except:
> > I didn't move the handlers into other files for now, prefer
> > to keep things together for now
> > I'm keeping an own pagepoison bit because I think that's 
> > cleaner than any other hacks.
> > 
> > Andrew, please put it into mm for .31 track.
> 
> Andrew please put it on the "Andi needs to justify his pageflags" non-path
> 
> I'm with Rik on this - we may have a few pageflags handy now but being
> slack with them for an obscure feature that can be done other ways and
> isn't performance critical is just lazy and bad planning for the long
> term.

There's still plenty of space. Especially on 64bit it's an absolute
non problem.

On 32bit the shortage of page flags was really
artificial because there were some caches put into ->flags, but 
these are largely obsolete to my understanding:
- discontigmem is gone (which cached the node)
- non vmap sparsemem is used a few times, but not on large systems
where you have a lot of zones, so you are ok with only having a few bits
for that
- if we really run out of bits on the sparsemem mapping it's easy
enough to do another small hash table for this, similar to the discontig
hash tables.

Also Christoph L. redid the dynamic allocation, so the boundaries
are now dynamically growing/shrinking. This means that if an architecture
doesn't use poison it doesn't use the bit.

> Andi - "I'm doing it my way so nyahh, put it into .31" doesn't fly. If
> you want it in .31 convince Rik and me and others that its a good use of
> a pageflag.

Sorry, you guys also didn't do a very good job explaining why 
it is that big a problem to take a page flag. Yes I know it's popular
folklore, but as far as I understand most of the reasons to be so
stingy on them have disappeared over time anyways (but the folklore
staid for some reason)

Anyways here's my pitch:

It's a straight forward concept expressable as a page flag. Lots
of places need to check for it (we expect there will be more users
in the future). Also even crash dumps should check for it, so
it's important to have a clean interface.

Also it's an optional flag, if there's still an architecture
around which needs special caches in ->flags then it's unlikely
it will turn it on.

Also what's the alternative? Are you suggesting we should do huffman
encoding on flags or something? That seemed just too ugly, especially to solve 
a non problem.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* More thoughts about hwpoison and pageflags compression
  2009-05-29 21:52 ` [PATCH] [0/16] HWPOISON: Intro Alan Cox
  2009-05-29 22:24   ` Andi Kleen
@ 2009-05-30  6:37   ` Andi Kleen
  2009-05-30  6:53     ` Andrew Morton
  1 sibling, 1 reply; 36+ messages in thread
From: Andi Kleen @ 2009-05-30  6:37 UTC (permalink / raw)
  To: Alan Cox; +Cc: Andi Kleen, akpm, linux-kernel, linux-mm, fengguang.wu

I thought a bit more about Alan's proposal of page flags compression
for poisoned pages. I actually found more problems with it :-)
(in addition to the points I wrote up in my earlier email on the topic)

Just wanted to write them up:

First some basics about hwpoison. 

- HwPoisioning can come in at any time and at any state of the page. 
- There can be multiple hwpoison events coming in for the same page in a short time window.
This can happen for example when the hardware detects errors on different cache lines of a page, 
which can happen in some DIMM breakage scenarios.
The HwPoison bit serves as a synchronization point for this, it's essentially a lock
for the hwpoison code (although no spinlock)
- HwPoison is high level code should only use portable primitives.

Alan proposed to use reserved|writeback to express hwpoisioning instead
of an own bit.

- Now the first problem is that we don't have a portable primitive to set
multiple bits atomically. cmpxchg() can be only used in architecture specific
code. So it wouldn't be atomic in its locking function.

That means that all multiple bit variants are problematic, or at least
would need a new global atomic primitive.

- Then you can actually have a page in writeback and poisoned. That is
we can't stop writeback (we might at some point in the future), so the order
the code works right now is:

set page poisoned
bail out if was already poisioned
do some other stuff
lock the page
wait for page writeback
	(which just polls on the bit to clear)

Now the obvious problem is of course, if we used writeback|reserved, how
would it it do the poison locking while the the page is still in writeback?
The encoding would not be unique.

If we don't do that we would risk multiple memory_failures() on the same
page, which has various issues.

So at least writeback|reserved doesn't work.

- Could we in theory find another weird bit combination that's truly impossible today
?
Probably, but it would be very hard to verify that this can truly never happen.

- Then I don't like it due to the fragility against other software bugs. Unless someone 
blasts 0xffs over the struct page (in which case treating it poisoned is probably a 
good thing anyways) then a separate bit is fairly robust against software bugs. 
Right now "impossible combinations" are used as a indication that something is wrong u
with the page, to catch broken software.

If we gave meaning to previously impossible combinations then this robustness
would be less. So a separate bit is generally more robust and doesn't take
this away from the other code.

So using a separate bit is a sensible choice imho.

Hope this helps,

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More thoughts about hwpoison and pageflags compression
  2009-05-30  6:37   ` More thoughts about hwpoison and pageflags compression Andi Kleen
@ 2009-05-30  6:53     ` Andrew Morton
  2009-05-30  7:27       ` Andi Kleen
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2009-05-30  6:53 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Alan Cox, linux-kernel, linux-mm, fengguang.wu

On Sat, 30 May 2009 08:37:10 +0200 Andi Kleen <andi@firstfloor.org> wrote:

> So using a separate bit is a sensible choice imho.

Could you make the feature 64-bit-only and use one of bits 32-63?

Did you consider making the poison tag external to the pageframe?  Some
hash(page*) into a bitmap or something?  If suitably designed, such
infrastructure could perhaps be reused to reclaim some existing page
flags.  Dave Hansen had such a patch a few years back.  Or maybe it
was Andy Whitcroft.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More thoughts about hwpoison and pageflags compression
  2009-05-30  6:53     ` Andrew Morton
@ 2009-05-30  7:27       ` Andi Kleen
  2009-05-30  7:29         ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: Andi Kleen @ 2009-05-30  7:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, Alan Cox, linux-kernel, linux-mm, fengguang.wu

On Fri, May 29, 2009 at 11:53:02PM -0700, Andrew Morton wrote:
> On Sat, 30 May 2009 08:37:10 +0200 Andi Kleen <andi@firstfloor.org> wrote:
> 
> > So using a separate bit is a sensible choice imho.
> 
> Could you make the feature 64-bit-only and use one of bits 32-63?

We could, but these systems can run 32bit kernels too (although
it's probably not a good idea). Ok it would be probably possible
to make it 64bit only, but I would prefer to not do that.

Also even 32bit has still flags free and even if we run out there's an easy 
path to free more (see my earlier writeup)

So I don't see the pressing need to conserve every bit on 32bit.

> Did you consider making the poison tag external to the pageframe?  Some
> hash(page*) into a bitmap or something?  If suitably designed, such
> infrastructure could perhaps be reused to reclaim some existing page
> flags.  Dave Hansen had such a patch a few years back.  Or maybe it
> was Andy Whitcroft.

I considered it at some point, but it would have complicated the code
and I preferred to keep it simple. The poison handler should be relatively
straight forward and do its work quickly otherwise it might not isolate
the page before it's actually used.
 
-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More thoughts about hwpoison and pageflags compression
  2009-05-30  7:27       ` Andi Kleen
@ 2009-05-30  7:29         ` Andrew Morton
  2009-05-30  7:55           ` Andi Kleen
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2009-05-30  7:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Alan Cox, linux-kernel, linux-mm, fengguang.wu

On Sat, 30 May 2009 09:27:58 +0200 Andi Kleen <andi@firstfloor.org> wrote:

> On Fri, May 29, 2009 at 11:53:02PM -0700, Andrew Morton wrote:
> > On Sat, 30 May 2009 08:37:10 +0200 Andi Kleen <andi@firstfloor.org> wrote:
> > 
> > > So using a separate bit is a sensible choice imho.
> > 
> > Could you make the feature 64-bit-only and use one of bits 32-63?
> 
> We could, but these systems can run 32bit kernels too (although
> it's probably not a good idea). Ok it would be probably possible
> to make it 64bit only, but I would prefer to not do that.
> 
> Also even 32bit has still flags free and even if we run out there's an easy 
> path to free more (see my earlier writeup)

hm.  Maybe that should be proven sooner rather than later.

> So I don't see the pressing need to conserve every bit on 32bit.
> 
> > Did you consider making the poison tag external to the pageframe?  Some
> > hash(page*) into a bitmap or something?  If suitably designed, such
> > infrastructure could perhaps be reused to reclaim some existing page
> > flags.  Dave Hansen had such a patch a few years back.  Or maybe it
> > was Andy Whitcroft.
> 
> I considered it at some point, but it would have complicated the code
> and I preferred to keep it simple. The poison handler should be relatively
> straight forward and do its work quickly otherwise it might not isolate
> the page before it's actually used.

Well it's going to get complicated when we run out anyway.  And run out
we shall.

Plus we haven't looked into the complexity of the external flags yet.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More thoughts about hwpoison and pageflags compression
  2009-05-30  7:29         ` Andrew Morton
@ 2009-05-30  7:55           ` Andi Kleen
  0 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-30  7:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, Alan Cox, linux-kernel, linux-mm, fengguang.wu

On Sat, May 30, 2009 at 12:29:30AM -0700, Andrew Morton wrote:
> On Sat, 30 May 2009 09:27:58 +0200 Andi Kleen <andi@firstfloor.org> wrote:
> 
> > On Fri, May 29, 2009 at 11:53:02PM -0700, Andrew Morton wrote:
> > > On Sat, 30 May 2009 08:37:10 +0200 Andi Kleen <andi@firstfloor.org> wrote:
> > > 
> > > > So using a separate bit is a sensible choice imho.
> > > 
> > > Could you make the feature 64-bit-only and use one of bits 32-63?
> > 
> > We could, but these systems can run 32bit kernels too (although
> > it's probably not a good idea). Ok it would be probably possible
> > to make it 64bit only, but I would prefer to not do that.
> > 
> > Also even 32bit has still flags free and even if we run out there's an easy 
> > path to free more (see my earlier writeup)
> 
> hm.  Maybe that should be proven sooner rather than later.

The SPARSEMEM code already has some fallback. I don't know if it works, but 
at least the code looks to be there.

 * There are three possibilities for how page->flags get
 * laid out.  The first is for the normal case, without
 * sparsemem.  The second is for sparsemem when there is
 * plenty of space for node and section.  The last is when
 * we have run out of space and have to fall back to an
 * alternate (slower) way of determining the node.
 *
 * No sparsemem or sparsemem vmemmap: |       NODE     | ZONE | ... | FLAGS |
 * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
 * classic sparse no space for node:  | SECTION |     ZONE    | ... | FLAGS |

/*
 * If we did not store the node number in the page then we have to
 * do a lookup in the section_to_node_table in order to find which
 * node the page belongs to.
 */
#if MAX_NUMNODES <= 256
static u8 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
#else
static u16 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
#endif

The other part that could be added is to use a separate hash to go from
page to SECTION (that would be very similar to the old discontig perfect hash
I did to go from pfn to node), then the "SECTION" part would be free for reuse too.

Then you could use the full 32bits. On 32bit we're right now at 22,
hwpoison would be 23. There's still some room.

> Plus we haven't looked into the complexity of the external flags yet.

It would be dumb to do external flags before you actually run out.
After all what good are free bits?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] [0/16] HWPOISON: Intro
@ 2009-06-03 18:46 Andi Kleen
  2009-06-09 10:20 ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Andi Kleen @ 2009-06-03 18:46 UTC (permalink / raw)
  To: akpm, npiggin, linux-kernel, linux-mm, fengguang.wu

Latest version of the hwpoison patchkit.

I rebased it on the mmotm of today to make it easier for Andrew to merge.

All comments have been addressed I believe (except I didn't move the
handlers out to other files)

For details please see the changelogs in the individual patches.

Andrew,

Please consider for merging.

Nick:

I addressed your comments on truncate. Nick can you review that
part? Is it ok for you now?
I also added a check for metadata buffer pages. Can you please review
that too.
And I integrated your truncate patch. The only thing missing 
is a Signed-off-by, can you provide that please?

Also I thought a bit about the fsync() error scenario. It's really
a problem that can already happen even without hwpoison, e.g.
when a page is dropped at the wrong time. The real fix would
be to make address space errors more sticky. Fengguang has been
looking at that, but it's probably not something for .31
I wrote up a detailed comment in the code describing the issue.

Thanks,
-Andi

---

Upcoming Intel CPUs have support for recovering from some memory errors
(``MCA recovery''). This requires the OS to declare a page "poisoned", 
kill the processes associated with it and avoid using it in the future. 

This patchkit implements the necessary infrastructure in the VM.

To quote the overview comment:

 * High level machine check handler. Handles pages reported by the
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 * failure.
 *
 * This focusses on pages detected as corrupted in the background.
 * When the current CPU tries to consume corruption the currently
 * running process can just be killed directly instead. This implies
 * that if the error cannot be handled for some reason it's safe to
 * just ignore it because no corruption has been consumed yet. Instead
 * when that happens another machine check will happen.
 *
 * Handles page cache pages in various states. The tricky part
 * here is that we can access any page asynchronous to other VM
 * users, because memory failures could happen anytime and anywhere,
 * possibly violating some of their assumptions. This is why this code
 * has to be extremely careful. Generally it tries to use normal locking
 * rules, as in get the standard locks, even if that means the
 * error handling takes potentially a long time.
 *
 * Some of the operations here are somewhat inefficient and have non
 * linear algorithmic complexity, because the data structures have not
 * been optimized for this case. This is in particular the case
 * for the mapping from a vma to a process. Since this case is expected
 * to be rare we hope we can get away with this.

The code consists of a the high level handler in mm/memory-failure.c, 
a new page poison bit and various checks in the VM to handle poisoned
pages.

The main target right now is KVM guests, but it works for all kinds
of applications.

For the KVM use there was need for a new signal type so that
KVM can inject the machine check into the guest with the proper
address. This in theory allows other applications to handle
memory failures too. The expection is that near all applications
won't do that, but some very specialized ones might. 

This is not fully complete yet, in particular there are still ways
to access poison through various ways (crash dump, /proc/kcore etc.)
that need to be plugged too.

Also undoubtedly the high level handler still has bugs and cases
it cannot recover from. For example nonlinear mappings deadlock right now
and a few other cases lose references. Huge pages are not supported
yet. Any additional testing, reviewing etc. welcome. 

The patch series requires the earlier x86 MCE feature series for the x86
specific action optional part. The code can be tested without the x86 specific
part using the injector, this only requires to enable the Kconfig entry
manually in some Kconfig file (by default it is implicitely enabled
by the architecture)

v2: Lots of smaller changes in the series based on review feedback.
Rename Poison to HWPoison after akpm's request.
A new pfn based injector based on feedback.
A lot of improvements mostly from Fengguang Wu
See comments in the individual patches.
v3: Various updates, see changelogs in individual patches.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [0/16] HWPOISON: Intro
  2009-06-03 18:46 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
@ 2009-06-09 10:20 ` Nick Piggin
  2009-06-10  9:07   ` Wu Fengguang
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2009-06-09 10:20 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-kernel, linux-mm, fengguang.wu

On Wed, Jun 03, 2009 at 08:46:31PM +0200, Andi Kleen wrote:
> Also I thought a bit about the fsync() error scenario. It's really
> a problem that can already happen even without hwpoison, e.g.
> when a page is dropped at the wrong time.

No, the page will never be "dropped" like that except with
this hwpoison. Errors, sure, might get dropped sometimes
due to implementation bugs, but this is adding semantics that
basically break fsync by-design.

I really want to resolve the EIO issue because as I said, it
is a user-abi issue and too many of those just get shoved
through only for someone to care about fundamental breakage
after some years.

You say that SIGKILL is overkill for such pages, but in fact
this is exactly what you do with mapped pages anyway, so why
not with other pages as well? I think it is perfectly fine to
do so (and maybe a new error code can be introduced and that
can be delivered to processes that can handle it rather than
SIGKILL).

Last request: do you have a panic-on-memory-error option?
I think HA systems and ones with properly designed data
integrity at the application layer will much prefer to
halt the system than attempt ad-hoc recovery that does not
always work and might screw things up worse.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [0/16] HWPOISON: Intro
  2009-06-09 10:20 ` Nick Piggin
@ 2009-06-10  9:07   ` Wu Fengguang
  2009-06-10  9:18     ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Wu Fengguang @ 2009-06-10  9:07 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org

On Tue, Jun 09, 2009 at 06:20:14PM +0800, Nick Piggin wrote:
> On Wed, Jun 03, 2009 at 08:46:31PM +0200, Andi Kleen wrote:
> > Also I thought a bit about the fsync() error scenario. It's really
> > a problem that can already happen even without hwpoison, e.g.
> > when a page is dropped at the wrong time.
> 
> No, the page will never be "dropped" like that except with
> this hwpoison. Errors, sure, might get dropped sometimes
> due to implementation bugs, but this is adding semantics that
> basically break fsync by-design.

You mean the non persistent EIO is undesirable?

In the other hand, sticky EIO that can only be explicitly cleared by
user can also be annoying. How about auto clearing the EIO bit when
the last active user closes the file?

> I really want to resolve the EIO issue because as I said, it
> is a user-abi issue and too many of those just get shoved
> through only for someone to care about fundamental breakage
> after some years.

Yup.

> You say that SIGKILL is overkill for such pages, but in fact
> this is exactly what you do with mapped pages anyway, so why
> not with other pages as well? I think it is perfectly fine to
> do so (and maybe a new error code can be introduced and that
> can be delivered to processes that can handle it rather than
> SIGKILL).

We can make it a user selectable policy.

They are different in that, mapped dirty pages are normally more vital
(data structures etc.) for correct execution, while write() operates
more often on normal data.

> Last request: do you have a panic-on-memory-error option?
> I think HA systems and ones with properly designed data
> integrity at the application layer will much prefer to
> halt the system than attempt ad-hoc recovery that does not
> always work and might screw things up worse.

Good suggestion. We'll consider such an option. But unconditionally
panic may be undesirable. For example, a corrupted free page or a
clean unmapped file page can be simply isolated - they won't impact
anything.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [0/16] HWPOISON: Intro
  2009-06-10  9:07   ` Wu Fengguang
@ 2009-06-10  9:18     ` Nick Piggin
  2009-06-10  9:45       ` Wu Fengguang
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2009-06-10  9:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org

On Wed, Jun 10, 2009 at 05:07:03PM +0800, Wu Fengguang wrote:
> On Tue, Jun 09, 2009 at 06:20:14PM +0800, Nick Piggin wrote:
> > On Wed, Jun 03, 2009 at 08:46:31PM +0200, Andi Kleen wrote:
> > > Also I thought a bit about the fsync() error scenario. It's really
> > > a problem that can already happen even without hwpoison, e.g.
> > > when a page is dropped at the wrong time.
> > 
> > No, the page will never be "dropped" like that except with
> > this hwpoison. Errors, sure, might get dropped sometimes
> > due to implementation bugs, but this is adding semantics that
> > basically break fsync by-design.
> 
> You mean the non persistent EIO is undesirable?
> 
> In the other hand, sticky EIO that can only be explicitly cleared by
> user can also be annoying. How about auto clearing the EIO bit when
> the last active user closes the file?

Well the existing EIO semantics IMO are not great, but that
does not have a big bearing on this new situation. What you
are doing is deliberately throwing away the dirty data, and
giving EIO back in some cases. (but perhaps not others, a
subsequent read or write syscall is not going to get EIO is
it? only fsync).

So even if we did change existing EIO semantics then the
memory corruption case of throwing away dirty data is still
going to be "different" (wrong, I would say).

 
> > I really want to resolve the EIO issue because as I said, it
> > is a user-abi issue and too many of those just get shoved
> > through only for someone to care about fundamental breakage
> > after some years.
> 
> Yup.
> 
> > You say that SIGKILL is overkill for such pages, but in fact
> > this is exactly what you do with mapped pages anyway, so why
> > not with other pages as well? I think it is perfectly fine to
> > do so (and maybe a new error code can be introduced and that
> > can be delivered to processes that can handle it rather than
> > SIGKILL).
> 
> We can make it a user selectable policy.
 
Really? Does it need to be? Can the admin sanely make that
choice?


> They are different in that, mapped dirty pages are normally more vital
> (data structures etc.) for correct execution, while write() operates
> more often on normal data.

read and write, remember. That might be somewhat true, but
definitely there are exceptions both ways. How do you
quantify that or justify it? Just handwaving? Why not make
it more consistent overall and just do SIGKILL for everyone?
 

> > Last request: do you have a panic-on-memory-error option?
> > I think HA systems and ones with properly designed data
> > integrity at the application layer will much prefer to
> > halt the system than attempt ad-hoc recovery that does not
> > always work and might screw things up worse.
> 
> Good suggestion. We'll consider such an option. But unconditionally
> panic may be undesirable. For example, a corrupted free page or a
> clean unmapped file page can be simply isolated - they won't impact
> anything.

I thought you were worried about introducing races where the
data can be consumed when doing things such as lock_page and
wait_on_page_writeback. But if things can definitely be
discarded with no references or chances of being consumed, yes
you would not panic for that. But panic for dirty data or
corrupted kernel memory etc. makes a lot of sense.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [0/16] HWPOISON: Intro
  2009-06-10  9:18     ` Nick Piggin
@ 2009-06-10  9:45       ` Wu Fengguang
  2009-06-10 11:15         ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Wu Fengguang @ 2009-06-10  9:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org

On Wed, Jun 10, 2009 at 05:18:07PM +0800, Nick Piggin wrote:
> On Wed, Jun 10, 2009 at 05:07:03PM +0800, Wu Fengguang wrote:
> > On Tue, Jun 09, 2009 at 06:20:14PM +0800, Nick Piggin wrote:
> > > On Wed, Jun 03, 2009 at 08:46:31PM +0200, Andi Kleen wrote:
> > > > Also I thought a bit about the fsync() error scenario. It's really
> > > > a problem that can already happen even without hwpoison, e.g.
> > > > when a page is dropped at the wrong time.
> > > 
> > > No, the page will never be "dropped" like that except with
> > > this hwpoison. Errors, sure, might get dropped sometimes
> > > due to implementation bugs, but this is adding semantics that
> > > basically break fsync by-design.
> > 
> > You mean the non persistent EIO is undesirable?
> > 
> > In the other hand, sticky EIO that can only be explicitly cleared by
> > user can also be annoying. How about auto clearing the EIO bit when
> > the last active user closes the file?
> 
> Well the existing EIO semantics IMO are not great, but that
> does not have a big bearing on this new situation. What you

Nod.

> are doing is deliberately throwing away the dirty data, and
> giving EIO back in some cases. (but perhaps not others, a
> subsequent read or write syscall is not going to get EIO is
> it? only fsync).

Right, only fsync/msync and close on nfs will report the error.

write() is normally cached, so obviously it cannot report the later IO
error.

We can make read() IO succeed even if the relevant pages are corrupted
- they can be isolated transparent to user space readers :-)

> So even if we did change existing EIO semantics then the
> memory corruption case of throwing away dirty data is still
> going to be "different" (wrong, I would say).

Oh well.

> > > I really want to resolve the EIO issue because as I said, it
> > > is a user-abi issue and too many of those just get shoved
> > > through only for someone to care about fundamental breakage
> > > after some years.
> > 
> > Yup.
> > 
> > > You say that SIGKILL is overkill for such pages, but in fact
> > > this is exactly what you do with mapped pages anyway, so why
> > > not with other pages as well? I think it is perfectly fine to
> > > do so (and maybe a new error code can be introduced and that
> > > can be delivered to processes that can handle it rather than
> > > SIGKILL).
> > 
> > We can make it a user selectable policy.
>  
> Really? Does it need to be? Can the admin sanely make that
> choice?

I just recalled another fact. See below.

> > They are different in that, mapped dirty pages are normally more vital
> > (data structures etc.) for correct execution, while write() operates
> > more often on normal data.
> 
> read and write, remember. That might be somewhat true, but
> definitely there are exceptions both ways. How do you
> quantify that or justify it? Just handwaving? Why not make
> it more consistent overall and just do SIGKILL for everyone?

1) under read IO hwpoison pages can be hidden to user space
2) under write IO hwpoison pages are normally committed by pdflush,
   so cannot find the impacted application to kill at all.
3) fsync() users can be caught though. But then the application
   have the option to check its return code. If it doesn't do it,
   it may well don't care. So why kill it?
 
Think about a multimedia server. Shall we kill the daemon if some IO
page in the movie get corrupted? And a mission critical server? 
Obviously the admin will want the right to choose.

> > > Last request: do you have a panic-on-memory-error option?
> > > I think HA systems and ones with properly designed data
> > > integrity at the application layer will much prefer to
> > > halt the system than attempt ad-hoc recovery that does not
> > > always work and might screw things up worse.
> > 
> > Good suggestion. We'll consider such an option. But unconditionally
> > panic may be undesirable. For example, a corrupted free page or a
> > clean unmapped file page can be simply isolated - they won't impact
> > anything.
> 
> I thought you were worried about introducing races where the
> data can be consumed when doing things such as lock_page and
> wait_on_page_writeback. But if things can definitely be
> discarded with no references or chances of being consumed, yes
> you would not panic for that. But panic for dirty data or
> corrupted kernel memory etc. makes a lot of sense.

OK. We can panic on dirty/writeback pages, and do try_lock to check
for active users :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [0/16] HWPOISON: Intro
  2009-06-10  9:45       ` Wu Fengguang
@ 2009-06-10 11:15         ` Nick Piggin
  2009-06-10 12:36           ` Wu Fengguang
  0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2009-06-10 11:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org

On Wed, Jun 10, 2009 at 05:45:26PM +0800, Wu Fengguang wrote:
> On Wed, Jun 10, 2009 at 05:18:07PM +0800, Nick Piggin wrote:
> > On Wed, Jun 10, 2009 at 05:07:03PM +0800, Wu Fengguang wrote:
> > > On Tue, Jun 09, 2009 at 06:20:14PM +0800, Nick Piggin wrote:
> > > > On Wed, Jun 03, 2009 at 08:46:31PM +0200, Andi Kleen wrote:
> > > > > Also I thought a bit about the fsync() error scenario. It's really
> > > > > a problem that can already happen even without hwpoison, e.g.
> > > > > when a page is dropped at the wrong time.
> > > > 
> > > > No, the page will never be "dropped" like that except with
> > > > this hwpoison. Errors, sure, might get dropped sometimes
> > > > due to implementation bugs, but this is adding semantics that
> > > > basically break fsync by-design.
> > > 
> > > You mean the non persistent EIO is undesirable?
> > > 
> > > In the other hand, sticky EIO that can only be explicitly cleared by
> > > user can also be annoying. How about auto clearing the EIO bit when
> > > the last active user closes the file?
> > 
> > Well the existing EIO semantics IMO are not great, but that
> > does not have a big bearing on this new situation. What you
> 
> Nod.
> 
> > are doing is deliberately throwing away the dirty data, and
> > giving EIO back in some cases. (but perhaps not others, a
> > subsequent read or write syscall is not going to get EIO is
> > it? only fsync).
> 
> Right, only fsync/msync and close on nfs will report the error.
> 
> write() is normally cached, so obviously it cannot report the later IO
> error.
> 
> We can make read() IO succeed even if the relevant pages are corrupted
> - they can be isolated transparent to user space readers :-)

But if the page was dirty and you throw out the dirty data,
then next read will give inconsistent data.

 
> > So even if we did change existing EIO semantics then the
> > memory corruption case of throwing away dirty data is still
> > going to be "different" (wrong, I would say).
> 
> Oh well.

Well I just think SIGKILL is the much safer behaviour to
start with (and matches behaviour with mmapped pagecache
and anon), and does not introduce these different semantics.

 
> > > > I really want to resolve the EIO issue because as I said, it
> > > > is a user-abi issue and too many of those just get shoved
> > > > through only for someone to care about fundamental breakage
> > > > after some years.
> > > 
> > > Yup.
> > > 
> > > > You say that SIGKILL is overkill for such pages, but in fact
> > > > this is exactly what you do with mapped pages anyway, so why
> > > > not with other pages as well? I think it is perfectly fine to
> > > > do so (and maybe a new error code can be introduced and that
> > > > can be delivered to processes that can handle it rather than
> > > > SIGKILL).
> > > 
> > > We can make it a user selectable policy.
> >  
> > Really? Does it need to be? Can the admin sanely make that
> > choice?
> 
> I just recalled another fact. See below.
> 
> > > They are different in that, mapped dirty pages are normally more vital
> > > (data structures etc.) for correct execution, while write() operates
> > > more often on normal data.
> > 
> > read and write, remember. That might be somewhat true, but
> > definitely there are exceptions both ways. How do you
> > quantify that or justify it? Just handwaving? Why not make
> > it more consistent overall and just do SIGKILL for everyone?
> 
> 1) under read IO hwpoison pages can be hidden to user space

I mean for cases where the recovery cannot be transparent
(ie. error in dirty page).


> 2) under write IO hwpoison pages are normally committed by pdflush,
>    so cannot find the impacted application to kill at all.

Correct.

> 3) fsync() users can be caught though. But then the application
>    have the option to check its return code. If it doesn't do it,
>    it may well don't care. So why kill it?

Well if it does not check, then we cannot find it to kill
it anyway. If it does care (and hence check with fsync),
then we could kill it.

 
> Think about a multimedia server. Shall we kill the daemon if some IO
> page in the movie get corrupted?

My multimedia server is using mmap for data...

> And a mission critical server? 

Mission critical server should be killed too because it
likely does not understand this semantic of throwing out
dirty data page. It should be detected and restarted and
should recover or fail over to another server.


> Obviously the admin will want the right to choose.

I don't know if they are equipped to really know. Do they
know that their application will correctly handle these
semantics of throwing out dirty data? It is potentially
much more dangerous to do this exactly because it can confuse
the case where it matters most (ie. ones that care about
data integrity).

It just seems like killing is far less controversial and
simpler. Start with that and it should do the right thing
for most people anyway. We could discuss possible ways
to recover in another patch if you want to do this
EIO thing.

 
> > > > Last request: do you have a panic-on-memory-error option?
> > > > I think HA systems and ones with properly designed data
> > > > integrity at the application layer will much prefer to
> > > > halt the system than attempt ad-hoc recovery that does not
> > > > always work and might screw things up worse.
> > > 
> > > Good suggestion. We'll consider such an option. But unconditionally
> > > panic may be undesirable. For example, a corrupted free page or a
> > > clean unmapped file page can be simply isolated - they won't impact
> > > anything.
> > 
> > I thought you were worried about introducing races where the
> > data can be consumed when doing things such as lock_page and
> > wait_on_page_writeback. But if things can definitely be
> > discarded with no references or chances of being consumed, yes
> > you would not panic for that. But panic for dirty data or
> > corrupted kernel memory etc. makes a lot of sense.
> 
> OK. We can panic on dirty/writeback pages, and do try_lock to check
> for active users :)

That would be good. IMO panic should be the safest and sanest
option (admin knows exactly what it is and has very simple and
clear semantics).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [0/16] HWPOISON: Intro
  2009-06-10 11:15         ` Nick Piggin
@ 2009-06-10 12:36           ` Wu Fengguang
  2009-06-10 12:47             ` Nick Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Wu Fengguang @ 2009-06-10 12:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org

On Wed, Jun 10, 2009 at 07:15:41PM +0800, Nick Piggin wrote:
> On Wed, Jun 10, 2009 at 05:45:26PM +0800, Wu Fengguang wrote:
> > On Wed, Jun 10, 2009 at 05:18:07PM +0800, Nick Piggin wrote:
> > > On Wed, Jun 10, 2009 at 05:07:03PM +0800, Wu Fengguang wrote:
> > > > On Tue, Jun 09, 2009 at 06:20:14PM +0800, Nick Piggin wrote:
> > > > > On Wed, Jun 03, 2009 at 08:46:31PM +0200, Andi Kleen wrote:
> > > > > > Also I thought a bit about the fsync() error scenario. It's really
> > > > > > a problem that can already happen even without hwpoison, e.g.
> > > > > > when a page is dropped at the wrong time.
> > > > > 
> > > > > No, the page will never be "dropped" like that except with
> > > > > this hwpoison. Errors, sure, might get dropped sometimes
> > > > > due to implementation bugs, but this is adding semantics that
> > > > > basically break fsync by-design.
> > > > 
> > > > You mean the non persistent EIO is undesirable?
> > > > 
> > > > In the other hand, sticky EIO that can only be explicitly cleared by
> > > > user can also be annoying. How about auto clearing the EIO bit when
> > > > the last active user closes the file?
> > > 
> > > Well the existing EIO semantics IMO are not great, but that
> > > does not have a big bearing on this new situation. What you
> > 
> > Nod.
> > 
> > > are doing is deliberately throwing away the dirty data, and
> > > giving EIO back in some cases. (but perhaps not others, a
> > > subsequent read or write syscall is not going to get EIO is
> > > it? only fsync).
> > 
> > Right, only fsync/msync and close on nfs will report the error.
> > 
> > write() is normally cached, so obviously it cannot report the later IO
> > error.
> > 
> > We can make read() IO succeed even if the relevant pages are corrupted
> > - they can be isolated transparent to user space readers :-)
> 
> But if the page was dirty and you throw out the dirty data,
> then next read will give inconsistent data.

Yup. That's a big problem - the application won't get any error
feedback here if it doesn't call fsync() to commit IO.

>  
> > > So even if we did change existing EIO semantics then the
> > > memory corruption case of throwing away dirty data is still
> > > going to be "different" (wrong, I would say).
> > 
> > Oh well.
> 
> Well I just think SIGKILL is the much safer behaviour to
> start with (and matches behaviour with mmapped pagecache
> and anon), and does not introduce these different semantics.

So what?  SIGKILL any future processes visiting the corrupted file?
Or better to return EIO to them? Either way we'll be maintaining
a consistent AS_EIO_HWPOISON bit.

>  
> > > > > I really want to resolve the EIO issue because as I said, it
> > > > > is a user-abi issue and too many of those just get shoved
> > > > > through only for someone to care about fundamental breakage
> > > > > after some years.
> > > > 
> > > > Yup.
> > > > 
> > > > > You say that SIGKILL is overkill for such pages, but in fact
> > > > > this is exactly what you do with mapped pages anyway, so why
> > > > > not with other pages as well? I think it is perfectly fine to
> > > > > do so (and maybe a new error code can be introduced and that
> > > > > can be delivered to processes that can handle it rather than
> > > > > SIGKILL).
> > > > 
> > > > We can make it a user selectable policy.
> > >  
> > > Really? Does it need to be? Can the admin sanely make that
> > > choice?
> > 
> > I just recalled another fact. See below.
> > 
> > > > They are different in that, mapped dirty pages are normally more vital
> > > > (data structures etc.) for correct execution, while write() operates
> > > > more often on normal data.
> > > 
> > > read and write, remember. That might be somewhat true, but
> > > definitely there are exceptions both ways. How do you
> > > quantify that or justify it? Just handwaving? Why not make
> > > it more consistent overall and just do SIGKILL for everyone?
> > 
> > 1) under read IO hwpoison pages can be hidden to user space
> 
> I mean for cases where the recovery cannot be transparent
> (ie. error in dirty page).

OK. That's a good point.

> > 2) under write IO hwpoison pages are normally committed by pdflush,
> >    so cannot find the impacted application to kill at all.
> 
> Correct.
> 
> > 3) fsync() users can be caught though. But then the application
> >    have the option to check its return code. If it doesn't do it,
> >    it may well don't care. So why kill it?
> 
> Well if it does not check, then we cannot find it to kill
> it anyway. If it does care (and hence check with fsync),
> then we could kill it.

If it really care, it will check EIO after fsync ;)
But yes, if it moderately care, it may ignore the return value.

So SIGKILL on fsync() seems to be a good option.

> > Think about a multimedia server. Shall we kill the daemon if some IO
> > page in the movie get corrupted?
> 
> My multimedia server is using mmap for data...
> 
> > And a mission critical server? 
> 
> Mission critical server should be killed too because it
> likely does not understand this semantic of throwing out
> dirty data page. It should be detected and restarted and
> should recover or fail over to another server.

Sorry for the confusion. I meant one server may want to survive,
while another want to kill (and restart service).

> > Obviously the admin will want the right to choose.
> 
> I don't know if they are equipped to really know. Do they
> know that their application will correctly handle these
> semantics of throwing out dirty data? It is potentially
> much more dangerous to do this exactly because it can confuse
> the case where it matters most (ie. ones that care about
> data integrity).
> 
> It just seems like killing is far less controversial and
> simpler. Start with that and it should do the right thing
> for most people anyway. We could discuss possible ways
> to recover in another patch if you want to do this
> EIO thing.

OK, we can
        - kill fsync() users
        - and then return EIO for later read()/write()s
        - forget about the EIO condition on last file close()
Do you agree?

> > > > > Last request: do you have a panic-on-memory-error option?
> > > > > I think HA systems and ones with properly designed data
> > > > > integrity at the application layer will much prefer to
> > > > > halt the system than attempt ad-hoc recovery that does not
> > > > > always work and might screw things up worse.
> > > > 
> > > > Good suggestion. We'll consider such an option. But unconditionally
> > > > panic may be undesirable. For example, a corrupted free page or a
> > > > clean unmapped file page can be simply isolated - they won't impact
> > > > anything.
> > > 
> > > I thought you were worried about introducing races where the
> > > data can be consumed when doing things such as lock_page and
> > > wait_on_page_writeback. But if things can definitely be
> > > discarded with no references or chances of being consumed, yes
> > > you would not panic for that. But panic for dirty data or
> > > corrupted kernel memory etc. makes a lot of sense.
> > 
> > OK. We can panic on dirty/writeback pages, and do try_lock to check
> > for active users :)
> 
> That would be good. IMO panic should be the safest and sanest
> option (admin knows exactly what it is and has very simple and
> clear semantics).

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] [0/16] HWPOISON: Intro
  2009-06-10 12:36           ` Wu Fengguang
@ 2009-06-10 12:47             ` Nick Piggin
  0 siblings, 0 replies; 36+ messages in thread
From: Nick Piggin @ 2009-06-10 12:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andi Kleen, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org

On Wed, Jun 10, 2009 at 08:36:00PM +0800, Wu Fengguang wrote:
> On Wed, Jun 10, 2009 at 07:15:41PM +0800, Nick Piggin wrote:
> > > We can make read() IO succeed even if the relevant pages are corrupted
> > > - they can be isolated transparent to user space readers :-)
> > 
> > But if the page was dirty and you throw out the dirty data,
> > then next read will give inconsistent data.
> 
> Yup. That's a big problem - the application won't get any error
> feedback here if it doesn't call fsync() to commit IO.

Right.


> > > > So even if we did change existing EIO semantics then the
> > > > memory corruption case of throwing away dirty data is still
> > > > going to be "different" (wrong, I would say).
> > > 
> > > Oh well.
> > 
> > Well I just think SIGKILL is the much safer behaviour to
> > start with (and matches behaviour with mmapped pagecache
> > and anon), and does not introduce these different semantics.
> 
> So what?  SIGKILL any future processes visiting the corrupted file?
> Or better to return EIO to them? Either way we'll be maintaining
> a consistent AS_EIO_HWPOISON bit.

If you don't throw the page out of the pagecache, it could
be left in there as a marker to SIGKILL anybody who tries to
access that page. OTOH this might present some other
difficulties regarding supression of writeback etc. Not
quite sure.

Of course the safest mode, IMO, is to panic the kernel in
situations like this (eg. corruption in dirty pagecache). I
would almost like to see that made as the default mode. That
avoids all questions of how exactly to handle these things.
Then if you can subsequently justify what kind of application
or case would work better with a particular behaviour (such
as throw away the data) then we can discuss and merge that.


> > > 1) under read IO hwpoison pages can be hidden to user space
> > 
> > I mean for cases where the recovery cannot be transparent
> > (ie. error in dirty page).
> 
> OK. That's a good point.
> 
> > > 2) under write IO hwpoison pages are normally committed by pdflush,
> > >    so cannot find the impacted application to kill at all.
> > 
> > Correct.
> > 
> > > 3) fsync() users can be caught though. But then the application
> > >    have the option to check its return code. If it doesn't do it,
> > >    it may well don't care. So why kill it?
> > 
> > Well if it does not check, then we cannot find it to kill
> > it anyway. If it does care (and hence check with fsync),
> > then we could kill it.
> 
> If it really care, it will check EIO after fsync ;)
> But yes, if it moderately care, it may ignore the return value.
> 
> So SIGKILL on fsync() seems to be a good option.
> 
> > > Think about a multimedia server. Shall we kill the daemon if some IO
> > > page in the movie get corrupted?
> > 
> > My multimedia server is using mmap for data...
> > 
> > > And a mission critical server? 
> > 
> > Mission critical server should be killed too because it
> > likely does not understand this semantic of throwing out
> > dirty data page. It should be detected and restarted and
> > should recover or fail over to another server.
> 
> Sorry for the confusion. I meant one server may want to survive,
> while another want to kill (and restart service).

Yes I just don't think even a really good admin will know
what to choose. At which point might as well remove the option
and just try to implement something sane...

But maybe you can write some good documentation for it, I will
stand corrected ;) 

> > > Obviously the admin will want the right to choose.
> > 
> > I don't know if they are equipped to really know. Do they
> > know that their application will correctly handle these
> > semantics of throwing out dirty data? It is potentially
> > much more dangerous to do this exactly because it can confuse
> > the case where it matters most (ie. ones that care about
> > data integrity).
> > 
> > It just seems like killing is far less controversial and
> > simpler. Start with that and it should do the right thing
> > for most people anyway. We could discuss possible ways
> > to recover in another patch if you want to do this
> > EIO thing.
> 
> OK, we can
>         - kill fsync() users
>         - and then return EIO for later read()/write()s
>         - forget about the EIO condition on last file close()
> Do you agree?

I really don't know ;) Anything I can think could be wrong
for a given situation. panic seems like the best default
option to me.

I don't want to sound like I'm quibbling. I don't actually
care too much what options are implemented so long as each
is justified and documented, and so long as the default is a
sane one.

Thanks,
Nick

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] [0/16] HWPOISON: Intro
@ 2009-05-27 20:12 Andi Kleen
  0 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-27 20:12 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, fengguang.wu

This is the latest version of the hwpoison patch. It has
a lot of fixes and improvements and review/testing over the last 
version.

A lot of thanks to Fengguang Wu for doing a lot of great
improvements, like fixing quite a lot of problems
and implementing free page handling.

It's also standalone now, not relying on any
other patchkits. Standalone it's only usable through
the debugging injection interfaces, but architectures
can (and do) make use of it.

It's also fairly unintruisive, as you can see. 
It doesn't really change any existing code paths 
significantly.

I believe this version is now ready for merging.

Any additional review/comments/etc of course welcome.

Andrew, can you please consider it for merging into -mm
for the 2.6.31 track?

The patchkit is also available in
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git hwpoison

-Andi

---

Upcoming Intel CPUs have support for recovering from some memory errors
(``MCA recovery''). This requires the OS to declare a page "poisoned", 
kill the processes associated with it and avoid using it in the future. 

This patchkit implements the necessary infrastructure in the VM.

To quote the overview comment:

 * High level machine check handler. Handles pages reported by the
 * hardware as being corrupted usually due to a 2bit ECC memory or cache
 * failure.
 *
 * This focusses on pages detected as corrupted in the background.
 * When the current CPU tries to consume corruption the currently
 * running process can just be killed directly instead. This implies
 * that if the error cannot be handled for some reason it's safe to
 * just ignore it because no corruption has been consumed yet. Instead
 * when that happens another machine check will happen.
 *
 * Handles page cache pages in various states. The tricky part
 * here is that we can access any page asynchronous to other VM
 * users, because memory failures could happen anytime and anywhere,
 * possibly violating some of their assumptions. This is why this code
 * has to be extremely careful. Generally it tries to use normal locking
 * rules, as in get the standard locks, even if that means the
 * error handling takes potentially a long time.
 *
 * Some of the operations here are somewhat inefficient and have non
 * linear algorithmic complexity, because the data structures have not
 * been optimized for this case. This is in particular the case
 * for the mapping from a vma to a process. Since this case is expected
 * to be rare we hope we can get away with this.

The code consists of a the high level handler in mm/memory-failure.c, 
a new page poison bit and various checks in the VM to handle poisoned
pages.

The main target right now is KVM guests, but it works for all kinds
of applications.

For the KVM use there was need for a new signal type so that
KVM can inject the machine check into the guest with the proper
address. This in theory allows other applications to handle
memory failures too. The expection is that near all applications
won't do that, but some very specialized ones might. 

This is not fully complete yet, in particular there are still ways
to access poison through various ways (crash dump, /proc/kcore etc.)
that need to be plugged too.

Also undoubtedly the high level handler still has bugs and cases
it cannot recover from. For example nonlinear mappings deadlock right now
and a few other cases lose references. Huge pages are not supported
yet. Any additional testing, reviewing etc. welcome. 

The patch series requires the earlier x86 MCE feature series for the x86
specific action optional part. The code can be tested without the x86 specific
part using the injector, this only requires to enable the Kconfig entry
manually in some Kconfig file (by default it is implicitely enabled
by the architecture)

v2: Lots of smaller changes in the series based on review feedback.
Rename Poison to HWPoison after akpm's request.
A new pfn based injector based on feedback.
A lot of improvements mostly from Fengguang Wu
See comments in the individual patches.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2009-06-10 12:46 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
2009-05-29 21:35 ` [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages Andi Kleen
2009-05-29 21:52   ` Alan Cox
2009-05-29 21:35 ` [PATCH] [2/16] HWPOISON: Export poison flag in /proc/kpageflags Andi Kleen
2009-05-29 21:35 ` [PATCH] [3/16] HWPOISON: Export some rmap vma locking to outside world Andi Kleen
2009-05-29 21:35 ` [PATCH] [4/16] HWPOISON: Add support for poison swap entries v2 Andi Kleen
2009-05-29 21:35 ` [PATCH] [5/16] HWPOISON: Add new SIGBUS error codes for hardware poison signals Andi Kleen
2009-05-29 21:35 ` [PATCH] [6/16] HWPOISON: Add basic support for poisoned pages in fault handler v3 Andi Kleen
2009-05-29 21:35 ` [PATCH] [7/16] HWPOISON: Add various poison checks in mm/memory.c Andi Kleen
2009-05-29 21:35 ` [PATCH] [8/16] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler Andi Kleen
2009-05-29 21:35 ` [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour Andi Kleen
2009-05-29 21:35 ` [PATCH] [10/16] HWPOISON: Handle hardware poisoned pages in try_to_unmap Andi Kleen
2009-05-29 21:35 ` [PATCH] [11/16] HWPOISON: Handle poisoned pages in set_page_dirty() Andi Kleen
2009-05-29 21:35 ` [PATCH] [12/16] HWPOISON: check and isolate corrupted free pages Andi Kleen
2009-05-29 21:35 ` [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v4 Andi Kleen
2009-06-01 11:16   ` Nick Piggin
2009-06-01 12:46     ` Wu Fengguang
2009-05-29 21:35 ` [PATCH] [14/16] HWPOISON: FOR TESTING: Enable memory failure code unconditionally Andi Kleen
2009-05-29 21:35 ` [PATCH] [15/16] HWPOISON: Add madvise() based injector for hardware poisoned pages v3 Andi Kleen
2009-05-29 21:35 ` [PATCH] [16/16] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs Andi Kleen
2009-05-29 21:52 ` [PATCH] [0/16] HWPOISON: Intro Alan Cox
2009-05-29 22:24   ` Andi Kleen
2009-05-30  6:37   ` More thoughts about hwpoison and pageflags compression Andi Kleen
2009-05-30  6:53     ` Andrew Morton
2009-05-30  7:27       ` Andi Kleen
2009-05-30  7:29         ` Andrew Morton
2009-05-30  7:55           ` Andi Kleen
  -- strict thread matches above, loose matches on Subject: below --
2009-06-03 18:46 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
2009-06-09 10:20 ` Nick Piggin
2009-06-10  9:07   ` Wu Fengguang
2009-06-10  9:18     ` Nick Piggin
2009-06-10  9:45       ` Wu Fengguang
2009-06-10 11:15         ` Nick Piggin
2009-06-10 12:36           ` Wu Fengguang
2009-06-10 12:47             ` Nick Piggin
2009-05-27 20:12 Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).