* [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
2009-05-29 21:52 ` Alan Cox
2009-05-29 21:35 ` [PATCH] [2/16] HWPOISON: Export poison flag in /proc/kpageflags Andi Kleen
` (15 subsequent siblings)
16 siblings, 1 reply; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
To: akpm, linux-kernel, linux-mm, fengguang.wu
Hardware poisoned pages need special handling in the VM and shouldn't be
touched again. This requires a new page flag. Define it here.
The page flags wars seem to be over, so it shouldn't be a problem
to get a new one.
v2: Add TestSetHWPoison (suggested by Johannes Weiner)
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
include/linux/page-flags.h | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
Index: linux/include/linux/page-flags.h
===================================================================
--- linux.orig/include/linux/page-flags.h 2009-05-29 23:32:10.000000000 +0200
+++ linux/include/linux/page-flags.h 2009-05-29 23:32:10.000000000 +0200
@@ -51,6 +51,9 @@
* PG_buddy is set to indicate that the page is free and in the buddy system
* (see mm/page_alloc.c).
*
+ * PG_hwpoison indicates that a page got corrupted in hardware and contains
+ * data with incorrect ECC bits that triggered a machine check. Accessing is
+ * not safe since it may cause another machine check. Don't touch!
*/
/*
@@ -104,6 +107,9 @@
#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
PG_uncached, /* Page has been mapped as uncached */
#endif
+#ifdef CONFIG_MEMORY_FAILURE
+ PG_hwpoison, /* hardware poisoned page. Don't touch */
+#endif
__NR_PAGEFLAGS,
/* Filesystems */
@@ -273,6 +279,15 @@
PAGEFLAG_FALSE(Uncached)
#endif
+#ifdef CONFIG_MEMORY_FAILURE
+PAGEFLAG(HWPoison, hwpoison)
+TESTSETFLAG(HWPoison, hwpoison)
+#define __PG_HWPOISON (1UL << PG_hwpoison)
+#else
+PAGEFLAG_FALSE(HWPoison)
+#define __PG_HWPOISON 0
+#endif
+
static inline int PageUptodate(struct page *page)
{
int ret = test_bit(PG_uptodate, &(page)->flags);
@@ -403,7 +418,7 @@
1 << PG_private | 1 << PG_private_2 | \
1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \
1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
- __PG_UNEVICTABLE | __PG_MLOCKED)
+ __PG_HWPOISON | __PG_UNEVICTABLE | __PG_MLOCKED)
/*
* Flags checked when a page is prepped for return by the page allocator.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages
2009-05-29 21:35 ` [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages Andi Kleen
@ 2009-05-29 21:52 ` Alan Cox
0 siblings, 0 replies; 36+ messages in thread
From: Alan Cox @ 2009-05-29 21:52 UTC (permalink / raw)
To: Andi Kleen; +Cc: akpm, linux-kernel, linux-mm, fengguang.wu
On Fri, 29 May 2009 23:35:26 +0200 (CEST)
Andi Kleen <andi@firstfloor.org> wrote:
>
> Hardware poisoned pages need special handling in the VM and shouldn't be
> touched again. This requires a new page flag. Define it here.
>
> The page flags wars seem to be over, so it shouldn't be a problem
> to get a new one.
>
> v2: Add TestSetHWPoison (suggested by Johannes Weiner)
>
> Acked-by: Christoph Lameter <cl@linux.com>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
>
> ---
> include/linux/page-flags.h | 17 ++++++++++++++++-
> 1 file changed, 16 insertions(+), 1 deletion(-)
>
> Index: linux/include/linux/page-flags.h
> ===================================================================
> --- linux.orig/include/linux/page-flags.h 2009-05-29 23:32:10.000000000 +0200
> +++ linux/include/linux/page-flags.h 2009-05-29 23:32:10.000000000 +0200
> @@ -51,6 +51,9 @@
> * PG_buddy is set to indicate that the page is free and in the buddy system
> * (see mm/page_alloc.c).
> *
> + * PG_hwpoison indicates that a page got corrupted in hardware and contains
> + * data with incorrect ECC bits that triggered a machine check. Accessing is
> + * not safe since it may cause another machine check. Don't touch!
> */
>
> /*
> @@ -104,6 +107,9 @@
> #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
> PG_uncached, /* Page has been mapped as uncached */
> #endif
> +#ifdef CONFIG_MEMORY_FAILURE
> + PG_hwpoison, /* hardware poisoned page. Don't touch */
> +#endif
> __NR_PAGEFLAGS,
>
> /* Filesystems */
> @@ -273,6 +279,15 @@
> PAGEFLAG_FALSE(Uncached)
> #endif
>
> +#ifdef CONFIG_MEMORY_FAILURE
> +PAGEFLAG(HWPoison, hwpoison)
> +TESTSETFLAG(HWPoison, hwpoison)
> +#define __PG_HWPOISON (1UL << PG_hwpoison)
> +#else
> +PAGEFLAG_FALSE(HWPoison)
> +#define __PG_HWPOISON 0
> +#endif
> +
> static inline int PageUptodate(struct page *page)
> {
> int ret = test_bit(PG_uptodate, &(page)->flags);
> @@ -403,7 +418,7 @@
> 1 << PG_private | 1 << PG_private_2 | \
> 1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \
> 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
> - __PG_UNEVICTABLE | __PG_MLOCKED)
> + __PG_HWPOISON | __PG_UNEVICTABLE | __PG_MLOCKED)
>
> /*
> * Flags checked when a page is prepped for return by the page allocator.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
--
"Alan, I'm getting a bit worried about you."
-- Linus Torvalds
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH] [2/16] HWPOISON: Export poison flag in /proc/kpageflags
2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
2009-05-29 21:35 ` [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
2009-05-29 21:35 ` [PATCH] [3/16] HWPOISON: Export some rmap vma locking to outside world Andi Kleen
` (14 subsequent siblings)
16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
To: fengguang.wu, akpm, linux-kernel, linux-mm
From: Fengguang Wu <fengguang.wu@intel.com>
Export the new poison flag in /proc/kpageflags. Poisoned pages are moderately
interesting even for administrators, so export them here. Also useful
for debugging.
AK: I extracted this out of a larger patch from Fengguang Wu.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
fs/proc/page.c | 4 ++++
1 file changed, 4 insertions(+)
Index: linux/fs/proc/page.c
===================================================================
--- linux.orig/fs/proc/page.c 2009-05-29 23:32:10.000000000 +0200
+++ linux/fs/proc/page.c 2009-05-29 23:32:10.000000000 +0200
@@ -79,6 +79,7 @@
#define KPF_WRITEBACK 8
#define KPF_RECLAIM 9
#define KPF_BUDDY 10
+#define KPF_HWPOISON 11
#define kpf_copy_bit(flags, dstpos, srcpos) (((flags >> srcpos) & 1) << dstpos)
@@ -118,6 +119,9 @@
kpf_copy_bit(kflags, KPF_WRITEBACK, PG_writeback) |
kpf_copy_bit(kflags, KPF_RECLAIM, PG_reclaim) |
kpf_copy_bit(kflags, KPF_BUDDY, PG_buddy);
+#ifdef CONFIG_MEMORY_FAILURE
+ uflags |= kpf_copy_bit(kflags, KPF_HWPOISON, PG_hwpoison);
+#endif
if (put_user(uflags, out++)) {
ret = -EFAULT;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread* [PATCH] [3/16] HWPOISON: Export some rmap vma locking to outside world
2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
2009-05-29 21:35 ` [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages Andi Kleen
2009-05-29 21:35 ` [PATCH] [2/16] HWPOISON: Export poison flag in /proc/kpageflags Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
2009-05-29 21:35 ` [PATCH] [4/16] HWPOISON: Add support for poison swap entries v2 Andi Kleen
` (13 subsequent siblings)
16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
To: akpm, linux-kernel, linux-mm, fengguang.wu
Needed for later patch that walks rmap entries on its own.
This used to be very frowned upon, but memory-failure.c does
some rather specialized rmap walking and rmap has been stable
for quite some time, so I think it's ok now to export it.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
include/linux/rmap.h | 6 ++++++
mm/rmap.c | 4 ++--
2 files changed, 8 insertions(+), 2 deletions(-)
Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h 2009-05-29 23:32:10.000000000 +0200
+++ linux/include/linux/rmap.h 2009-05-29 23:33:30.000000000 +0200
@@ -118,6 +118,12 @@
}
#endif
+/*
+ * Called by memory-failure.c to kill processes.
+ */
+struct anon_vma *page_lock_anon_vma(struct page *page);
+void page_unlock_anon_vma(struct anon_vma *anon_vma);
+
#else /* !CONFIG_MMU */
#define anon_vma_init() do {} while (0)
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c 2009-05-29 23:32:10.000000000 +0200
+++ linux/mm/rmap.c 2009-05-29 23:33:30.000000000 +0200
@@ -191,7 +191,7 @@
* Getting a lock on a stable anon_vma from a page off the LRU is
* tricky: page_lock_anon_vma rely on RCU to guard against the races.
*/
-static struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page)
{
struct anon_vma *anon_vma;
unsigned long anon_mapping;
@@ -211,7 +211,7 @@
return NULL;
}
-static void page_unlock_anon_vma(struct anon_vma *anon_vma)
+void page_unlock_anon_vma(struct anon_vma *anon_vma)
{
spin_unlock(&anon_vma->lock);
rcu_read_unlock();
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread* [PATCH] [4/16] HWPOISON: Add support for poison swap entries v2
2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
` (2 preceding siblings ...)
2009-05-29 21:35 ` [PATCH] [3/16] HWPOISON: Export some rmap vma locking to outside world Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
2009-05-29 21:35 ` [PATCH] [5/16] HWPOISON: Add new SIGBUS error codes for hardware poison signals Andi Kleen
` (12 subsequent siblings)
16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
To: akpm, linux-kernel, linux-mm, fengguang.wu
Memory migration uses special swap entry types to trigger special actions on
page faults. Extend this mechanism to also support poisoned swap entries, to
trigger poison handling on page faults. This allows follow-on patches to
prevent processes from faulting in poisoned pages again.
v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu)
v3: Better overflow fix (Hidehiro Kawai)
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
include/linux/swap.h | 34 ++++++++++++++++++++++++++++------
include/linux/swapops.h | 38 ++++++++++++++++++++++++++++++++++++++
mm/swapfile.c | 4 ++--
3 files changed, 68 insertions(+), 8 deletions(-)
Index: linux/include/linux/swap.h
===================================================================
--- linux.orig/include/linux/swap.h 2009-05-29 23:32:10.000000000 +0200
+++ linux/include/linux/swap.h 2009-05-29 23:32:10.000000000 +0200
@@ -34,16 +34,38 @@
* the type/offset into the pte as 5/27 as well.
*/
#define MAX_SWAPFILES_SHIFT 5
-#ifndef CONFIG_MIGRATION
-#define MAX_SWAPFILES (1 << MAX_SWAPFILES_SHIFT)
+
+/*
+ * Use some of the swap files numbers for other purposes. This
+ * is a convenient way to hook into the VM to trigger special
+ * actions on faults.
+ */
+
+/*
+ * NUMA node memory migration support
+ */
+#ifdef CONFIG_MIGRATION
+#define SWP_MIGRATION_NUM 2
+#define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM)
+#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1)
#else
-/* Use last two entries for page migration swap entries */
-#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)-2)
-#define SWP_MIGRATION_READ MAX_SWAPFILES
-#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + 1)
+#define SWP_MIGRATION_NUM 0
#endif
/*
+ * Handling of hardware poisoned pages with memory corruption.
+ */
+#ifdef CONFIG_MEMORY_FAILURE
+#define SWP_HWPOISON_NUM 1
+#define SWP_HWPOISON MAX_SWAPFILES
+#else
+#define SWP_HWPOISON_NUM 0
+#endif
+
+#define MAX_SWAPFILES \
+ ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+
+/*
* Magic header for a swap area. The first part of the union is
* what the swap magic looks like for the old (limited to 128MB)
* swap area format, the second part of the union adds - in the
Index: linux/include/linux/swapops.h
===================================================================
--- linux.orig/include/linux/swapops.h 2009-05-29 23:32:10.000000000 +0200
+++ linux/include/linux/swapops.h 2009-05-29 23:32:10.000000000 +0200
@@ -131,3 +131,41 @@
#endif
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Support for hardware poisoned pages
+ */
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+ BUG_ON(!PageLocked(page));
+ return swp_entry(SWP_HWPOISON, page_to_pfn(page));
+}
+
+static inline int is_hwpoison_entry(swp_entry_t entry)
+{
+ return swp_type(entry) == SWP_HWPOISON;
+}
+#else
+
+static inline swp_entry_t make_hwpoison_entry(struct page *page)
+{
+ return swp_entry(0, 0);
+}
+
+static inline int is_hwpoison_entry(swp_entry_t swp)
+{
+ return 0;
+}
+#endif
+
+#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION)
+static inline int non_swap_entry(swp_entry_t entry)
+{
+ return swp_type(entry) >= MAX_SWAPFILES;
+}
+#else
+static inline int non_swap_entry(swp_entry_t entry)
+{
+ return 0;
+}
+#endif
Index: linux/mm/swapfile.c
===================================================================
--- linux.orig/mm/swapfile.c 2009-05-29 23:32:10.000000000 +0200
+++ linux/mm/swapfile.c 2009-05-29 23:32:10.000000000 +0200
@@ -579,7 +579,7 @@
struct swap_info_struct *p;
struct page *page = NULL;
- if (is_migration_entry(entry))
+ if (non_swap_entry(entry))
return 1;
p = swap_info_get(entry);
@@ -1949,7 +1949,7 @@
unsigned long offset, type;
int result = 0;
- if (is_migration_entry(entry))
+ if (non_swap_entry(entry))
return 1;
type = swp_type(entry);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread* [PATCH] [5/16] HWPOISON: Add new SIGBUS error codes for hardware poison signals
2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
` (3 preceding siblings ...)
2009-05-29 21:35 ` [PATCH] [4/16] HWPOISON: Add support for poison swap entries v2 Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
2009-05-29 21:35 ` [PATCH] [6/16] HWPOISON: Add basic support for poisoned pages in fault handler v3 Andi Kleen
` (11 subsequent siblings)
16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
To: akpm, linux-kernel, linux-mm, fengguang.wu
Add new SIGBUS codes for reporting machine checks as signals. When
the hardware detects an uncorrected ECC error it can trigger these
signals.
This is needed for telling KVM's qemu about machine checks that happen to
guests, so that it can inject them, but might be also useful for other programs.
I find it useful in my test programs.
This patch merely defines the new types.
- Define two new si_codes for SIGBUS. BUS_MCEERR_AO and BUS_MCEERR_AR
* BUS_MCEERR_AO is for "Action Optional" machine checks, which means that some
corruption has been detected in the background, but nothing has been consumed
so far. The program can ignore those if it wants (but most programs would
already get killed)
* BUS_MCEERR_AR is for "Action Required" machine checks. This happens
when corrupted data is consumed or the application ran into an area
which has been known to be corrupted earlier. These require immediate
action and cannot just returned to. Most programs would kill themselves.
- They report the address of the corruption in the user address space
in si_addr.
- Define a new si_addr_lsb field that reports the extent of the corruption
to user space. That's currently always a (small) page. The user application
cannot tell where in this page the corruption happened.
AK: I plan to write a man page update before anyone asks.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
include/asm-generic/siginfo.h | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
Index: linux/include/asm-generic/siginfo.h
===================================================================
--- linux.orig/include/asm-generic/siginfo.h 2009-05-29 23:32:10.000000000 +0200
+++ linux/include/asm-generic/siginfo.h 2009-05-29 23:32:10.000000000 +0200
@@ -82,6 +82,7 @@
#ifdef __ARCH_SI_TRAPNO
int _trapno; /* TRAP # which caused the signal */
#endif
+ short _addr_lsb; /* LSB of the reported address */
} _sigfault;
/* SIGPOLL */
@@ -112,6 +113,7 @@
#ifdef __ARCH_SI_TRAPNO
#define si_trapno _sifields._sigfault._trapno
#endif
+#define si_addr_lsb _sifields._sigfault._addr_lsb
#define si_band _sifields._sigpoll._band
#define si_fd _sifields._sigpoll._fd
@@ -192,7 +194,11 @@
#define BUS_ADRALN (__SI_FAULT|1) /* invalid address alignment */
#define BUS_ADRERR (__SI_FAULT|2) /* non-existant physical address */
#define BUS_OBJERR (__SI_FAULT|3) /* object specific hardware error */
-#define NSIGBUS 3
+/* hardware memory error consumed on a machine check: action required */
+#define BUS_MCEERR_AR (__SI_FAULT|4)
+/* hardware memory error detected in process but not consumed: action optional*/
+#define BUS_MCEERR_AO (__SI_FAULT|5)
+#define NSIGBUS 5
/*
* SIGTRAP si_codes
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread* [PATCH] [6/16] HWPOISON: Add basic support for poisoned pages in fault handler v3
2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
` (4 preceding siblings ...)
2009-05-29 21:35 ` [PATCH] [5/16] HWPOISON: Add new SIGBUS error codes for hardware poison signals Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
2009-05-29 21:35 ` [PATCH] [7/16] HWPOISON: Add various poison checks in mm/memory.c Andi Kleen
` (10 subsequent siblings)
16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
To: akpm, linux-kernel, linux-mm, fengguang.wu
- Add a new VM_FAULT_HWPOISON error code to handle_mm_fault. Right now
architectures have to explicitely enable poison page support, so
this is forward compatible to all architectures. They only need
to add it when they enable poison page support.
- Add poison page handling in swap in fault code
v2: Add missing delayacct_clear_flag (Hidehiro Kawai)
v3: Really use delayacct_clear_flag (Hidehiro Kawai)
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
include/linux/mm.h | 3 ++-
mm/memory.c | 18 +++++++++++++++---
2 files changed, 17 insertions(+), 4 deletions(-)
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c 2009-05-29 23:32:09.000000000 +0200
+++ linux/mm/memory.c 2009-05-29 23:33:31.000000000 +0200
@@ -1315,7 +1315,8 @@
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
return i ? i : -ENOMEM;
- else if (ret & VM_FAULT_SIGBUS)
+ if (ret &
+ (VM_FAULT_HWPOISON|VM_FAULT_SIGBUS))
return i ? i : -EFAULT;
BUG();
}
@@ -2459,8 +2460,15 @@
goto out;
entry = pte_to_swp_entry(orig_pte);
- if (is_migration_entry(entry)) {
- migration_entry_wait(mm, pmd, address);
+ if (unlikely(non_swap_entry(entry))) {
+ if (is_migration_entry(entry)) {
+ migration_entry_wait(mm, pmd, address);
+ } else if (is_hwpoison_entry(entry)) {
+ ret = VM_FAULT_HWPOISON;
+ } else {
+ print_bad_pte(vma, address, pte, NULL);
+ ret = VM_FAULT_OOM;
+ }
goto out;
}
delayacct_set_flag(DELAYACCT_PF_SWAPIN);
@@ -2484,6 +2492,10 @@
/* Had to read the page from swap area: Major fault */
ret = VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
+ } else if (PageHWPoison(page)) {
+ ret = VM_FAULT_HWPOISON;
+ delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+ goto out;
}
lock_page(page);
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h 2009-05-29 23:32:09.000000000 +0200
+++ linux/include/linux/mm.h 2009-05-29 23:33:29.000000000 +0200
@@ -702,11 +702,12 @@
#define VM_FAULT_SIGBUS 0x0002
#define VM_FAULT_MAJOR 0x0004
#define VM_FAULT_WRITE 0x0008 /* Special case for get_user_pages */
+#define VM_FAULT_HWPOISON 0x0010 /* Hit poisoned page */
#define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */
#define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */
-#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS)
+#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
/*
* Can be called by the pagefault handler when it gets a VM_FAULT_OOM.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread* [PATCH] [7/16] HWPOISON: Add various poison checks in mm/memory.c
2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
` (5 preceding siblings ...)
2009-05-29 21:35 ` [PATCH] [6/16] HWPOISON: Add basic support for poisoned pages in fault handler v3 Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
2009-05-29 21:35 ` [PATCH] [8/16] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler Andi Kleen
` (9 subsequent siblings)
16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
To: akpm, linux-kernel, linux-mm, fengguang.wu
Bail out early when hardware poisoned pages are found in page fault handling.
Since they are poisoned they should not be mapped freshly into processes,
because that would cause another (potentially deadly) machine check
This is generally handled in the same way as OOM, just a different
error code is returned to the architecture code.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
mm/memory.c | 3 +++
1 file changed, 3 insertions(+)
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c 2009-05-29 23:32:10.000000000 +0200
+++ linux/mm/memory.c 2009-05-29 23:32:10.000000000 +0200
@@ -2659,6 +2659,9 @@
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
return ret;
+ if (unlikely(PageHWPoison(vmf.page)))
+ return VM_FAULT_HWPOISON;
+
/*
* For consistency in subsequent calls, make the faulted page always
* locked.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread* [PATCH] [8/16] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler
2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
` (6 preceding siblings ...)
2009-05-29 21:35 ` [PATCH] [7/16] HWPOISON: Add various poison checks in mm/memory.c Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
2009-05-29 21:35 ` [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour Andi Kleen
` (8 subsequent siblings)
16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
To: akpm, linux-kernel, linux-mm, fengguang.wu
Add VM_FAULT_HWPOISON handling to the x86 page fault handler. This is
very similar to VM_FAULT_OOM, the only difference is that a different
si_code is passed to user space and the new addr_lsb field is initialized.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
arch/x86/mm/fault.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
Index: linux/arch/x86/mm/fault.c
===================================================================
--- linux.orig/arch/x86/mm/fault.c 2009-05-29 23:32:09.000000000 +0200
+++ linux/arch/x86/mm/fault.c 2009-05-29 23:32:10.000000000 +0200
@@ -189,6 +189,7 @@
info.si_errno = 0;
info.si_code = si_code;
info.si_addr = (void __user *)address;
+ info.si_addr_lsb = si_code == BUS_MCEERR_AR ? PAGE_SHIFT : 0;
force_sig_info(si_signo, &info, tsk);
}
@@ -827,10 +828,12 @@
}
static void
-do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address)
+do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
+ unsigned int fault)
{
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
+ int code = BUS_ADRERR;
up_read(&mm->mmap_sem);
@@ -846,7 +849,14 @@
tsk->thread.error_code = error_code;
tsk->thread.trap_no = 14;
- force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
+#ifdef CONFIG_MEMORY_FAILURE
+ if (fault & VM_FAULT_HWPOISON) {
+ printk(KERN_ERR "MCE: Killing %s:%d due to hardware memory corruption\n",
+ tsk->comm, tsk->pid);
+ code = BUS_MCEERR_AR;
+ }
+#endif
+ force_sig_info_fault(SIGBUS, code, address, tsk);
}
static noinline void
@@ -856,8 +866,8 @@
if (fault & VM_FAULT_OOM) {
out_of_memory(regs, error_code, address);
} else {
- if (fault & VM_FAULT_SIGBUS)
- do_sigbus(regs, error_code, address);
+ if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON))
+ do_sigbus(regs, error_code, address, fault);
else
BUG();
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread* [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour
2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
` (7 preceding siblings ...)
2009-05-29 21:35 ` [PATCH] [8/16] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
2009-05-29 21:35 ` [PATCH] [10/16] HWPOISON: Handle hardware poisoned pages in try_to_unmap Andi Kleen
` (7 subsequent siblings)
16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
To: Lee.Schermerhorn, npiggin, akpm, linux-kernel, linux-mm,
fengguang.wu
try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
which are selected by magic flag variables. The logic is not very straight
forward, because each of these flag change multiple behaviours (e.g.
migration turns off aging, not only sets up migration ptes etc.)
Also the different flags interact in magic ways.
A later patch in this series adds another mode to try_to_unmap, so
this becomes quickly unmanageable.
Replace the different flags with a action code (migration, munlock, munmap)
and some additional flags as modifiers (ignore mlock, ignore aging).
This makes the logic more straight forward and allows easier extension
to new behaviours. Change all the caller to declare what they want to
do.
This patch is supposed to be a nop in behaviour. If anyone can prove
it is not that would be a bug.
Cc: Lee.Schermerhorn@hp.com
Cc: npiggin@suse.de
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
include/linux/rmap.h | 14 +++++++++++++-
mm/migrate.c | 2 +-
mm/rmap.c | 40 ++++++++++++++++++++++------------------
mm/vmscan.c | 2 +-
4 files changed, 37 insertions(+), 21 deletions(-)
Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h 2009-05-29 23:32:10.000000000 +0200
+++ linux/include/linux/rmap.h 2009-05-29 23:33:30.000000000 +0200
@@ -84,7 +84,19 @@
* Called from mm/vmscan.c to handle paging out
*/
int page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt);
-int try_to_unmap(struct page *, int ignore_refs);
+
+enum ttu_flags {
+ TTU_UNMAP = 0, /* unmap mode */
+ TTU_MIGRATION = 1, /* migration mode */
+ TTU_MUNLOCK = 2, /* munlock mode */
+ TTU_ACTION_MASK = 0xff,
+
+ TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
+ TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
+};
+#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
+
+int try_to_unmap(struct page *, enum ttu_flags flags);
/*
* Called from mm/filemap_xip.c to unmap empty zero page
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c 2009-05-29 23:32:10.000000000 +0200
+++ linux/mm/rmap.c 2009-05-29 23:33:30.000000000 +0200
@@ -755,7 +755,7 @@
* repeatedly from either try_to_unmap_anon or try_to_unmap_file.
*/
static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
- int migration)
+ enum ttu_flags flags)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -777,11 +777,13 @@
* If it's recently referenced (perhaps page_referenced
* skipped over this mm) then we should reactivate it.
*/
- if (!migration) {
+ if (!(flags & TTU_IGNORE_MLOCK)) {
if (vma->vm_flags & VM_LOCKED) {
ret = SWAP_MLOCK;
goto out_unmap;
}
+ }
+ if (!(flags & TTU_IGNORE_ACCESS)) {
if (ptep_clear_flush_young_notify(vma, address, pte)) {
ret = SWAP_FAIL;
goto out_unmap;
@@ -821,12 +823,12 @@
* pte. do_swap_page() will wait until the migration
* pte is removed and then restart fault handling.
*/
- BUG_ON(!migration);
+ BUG_ON(TTU_ACTION(flags) != TTU_MIGRATION);
entry = make_migration_entry(page, pte_write(pteval));
}
set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
- } else if (PAGE_MIGRATION && migration) {
+ } else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) {
/* Establish migration entry for a file page */
swp_entry_t entry;
entry = make_migration_entry(page, pte_write(pteval));
@@ -995,12 +997,13 @@
* vm_flags for that VMA. That should be OK, because that vma shouldn't be
* 'LOCKED.
*/
-static int try_to_unmap_anon(struct page *page, int unlock, int migration)
+static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
{
struct anon_vma *anon_vma;
struct vm_area_struct *vma;
unsigned int mlocked = 0;
int ret = SWAP_AGAIN;
+ int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
if (MLOCK_PAGES && unlikely(unlock))
ret = SWAP_SUCCESS; /* default for try_to_munlock() */
@@ -1016,7 +1019,7 @@
continue; /* must visit all unlocked vmas */
ret = SWAP_MLOCK; /* saw at least one mlocked vma */
} else {
- ret = try_to_unmap_one(page, vma, migration);
+ ret = try_to_unmap_one(page, vma, flags);
if (ret == SWAP_FAIL || !page_mapped(page))
break;
}
@@ -1040,8 +1043,7 @@
/**
* try_to_unmap_file - unmap/unlock file page using the object-based rmap method
* @page: the page to unmap/unlock
- * @unlock: request for unlock rather than unmap [unlikely]
- * @migration: unmapping for migration - ignored if @unlock
+ * @flags: action and flags
*
* Find all the mappings of a page using the mapping pointer and the vma chains
* contained in the address_space struct it points to.
@@ -1053,7 +1055,7 @@
* vm_flags for that VMA. That should be OK, because that vma shouldn't be
* 'LOCKED.
*/
-static int try_to_unmap_file(struct page *page, int unlock, int migration)
+static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
{
struct address_space *mapping = page->mapping;
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -1065,6 +1067,7 @@
unsigned long max_nl_size = 0;
unsigned int mapcount;
unsigned int mlocked = 0;
+ int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
if (MLOCK_PAGES && unlikely(unlock))
ret = SWAP_SUCCESS; /* default for try_to_munlock() */
@@ -1077,7 +1080,7 @@
continue; /* must visit all vmas */
ret = SWAP_MLOCK;
} else {
- ret = try_to_unmap_one(page, vma, migration);
+ ret = try_to_unmap_one(page, vma, flags);
if (ret == SWAP_FAIL || !page_mapped(page))
goto out;
}
@@ -1102,7 +1105,8 @@
ret = SWAP_MLOCK; /* leave mlocked == 0 */
goto out; /* no need to look further */
}
- if (!MLOCK_PAGES && !migration && (vma->vm_flags & VM_LOCKED))
+ if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
+ (vma->vm_flags & VM_LOCKED))
continue;
cursor = (unsigned long) vma->vm_private_data;
if (cursor > max_nl_cursor)
@@ -1136,7 +1140,7 @@
do {
list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
shared.vm_set.list) {
- if (!MLOCK_PAGES && !migration &&
+ if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
(vma->vm_flags & VM_LOCKED))
continue;
cursor = (unsigned long) vma->vm_private_data;
@@ -1176,7 +1180,7 @@
/**
* try_to_unmap - try to remove all page table mappings to a page
* @page: the page to get unmapped
- * @migration: migration flag
+ * @flags: action and flags
*
* Tries to remove all the page table entries which are mapping this
* page, used in the pageout path. Caller must hold the page lock.
@@ -1187,16 +1191,16 @@
* SWAP_FAIL - the page is unswappable
* SWAP_MLOCK - page is mlocked.
*/
-int try_to_unmap(struct page *page, int migration)
+int try_to_unmap(struct page *page, enum ttu_flags flags)
{
int ret;
BUG_ON(!PageLocked(page));
if (PageAnon(page))
- ret = try_to_unmap_anon(page, 0, migration);
+ ret = try_to_unmap_anon(page, flags);
else
- ret = try_to_unmap_file(page, 0, migration);
+ ret = try_to_unmap_file(page, flags);
if (ret != SWAP_MLOCK && !page_mapped(page))
ret = SWAP_SUCCESS;
return ret;
@@ -1222,8 +1226,8 @@
VM_BUG_ON(!PageLocked(page) || PageLRU(page));
if (PageAnon(page))
- return try_to_unmap_anon(page, 1, 0);
+ return try_to_unmap_anon(page, TTU_MUNLOCK);
else
- return try_to_unmap_file(page, 1, 0);
+ return try_to_unmap_file(page, TTU_MUNLOCK);
}
#endif
Index: linux/mm/vmscan.c
===================================================================
--- linux.orig/mm/vmscan.c 2009-05-29 23:32:09.000000000 +0200
+++ linux/mm/vmscan.c 2009-05-29 23:32:10.000000000 +0200
@@ -666,7 +666,7 @@
* processes. Try to unmap it here.
*/
if (page_mapped(page) && mapping) {
- switch (try_to_unmap(page, 0)) {
+ switch (try_to_unmap(page, TTU_UNMAP)) {
case SWAP_FAIL:
goto activate_locked;
case SWAP_AGAIN:
Index: linux/mm/migrate.c
===================================================================
--- linux.orig/mm/migrate.c 2009-05-29 23:32:09.000000000 +0200
+++ linux/mm/migrate.c 2009-05-29 23:32:10.000000000 +0200
@@ -669,7 +669,7 @@
}
/* Establish migration ptes or remove ptes */
- try_to_unmap(page, 1);
+ try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
if (!page_mapped(page))
rc = move_to_new_page(newpage, page);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread* [PATCH] [10/16] HWPOISON: Handle hardware poisoned pages in try_to_unmap
2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
` (8 preceding siblings ...)
2009-05-29 21:35 ` [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
2009-05-29 21:35 ` [PATCH] [11/16] HWPOISON: Handle poisoned pages in set_page_dirty() Andi Kleen
` (6 subsequent siblings)
16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
To: akpm, linux-kernel, linux-mm, fengguang.wu
When a page has the poison bit set replace the PTE with a poison entry.
This causes the right error handling to be done later when a process runs
into it.
Also add a new flag to not do that (needed for the memory-failure handler
later)
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
include/linux/rmap.h | 1 +
mm/rmap.c | 9 ++++++++-
2 files changed, 9 insertions(+), 1 deletion(-)
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c 2009-05-29 23:32:10.000000000 +0200
+++ linux/mm/rmap.c 2009-05-29 23:33:29.000000000 +0200
@@ -801,7 +801,14 @@
/* Update high watermark before we lower rss */
update_hiwater_rss(mm);
- if (PageAnon(page)) {
+ if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+ if (PageAnon(page))
+ dec_mm_counter(mm, anon_rss);
+ else if (!is_migration_entry(pte_to_swp_entry(*pte)))
+ dec_mm_counter(mm, file_rss);
+ set_pte_at(mm, address, pte,
+ swp_entry_to_pte(make_hwpoison_entry(page)));
+ } else if (PageAnon(page)) {
swp_entry_t entry = { .val = page_private(page) };
if (PageSwapCache(page)) {
Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h 2009-05-29 23:32:10.000000000 +0200
+++ linux/include/linux/rmap.h 2009-05-29 23:32:10.000000000 +0200
@@ -93,6 +93,7 @@
TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
+ TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
};
#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread* [PATCH] [11/16] HWPOISON: Handle poisoned pages in set_page_dirty()
2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
` (9 preceding siblings ...)
2009-05-29 21:35 ` [PATCH] [10/16] HWPOISON: Handle hardware poisoned pages in try_to_unmap Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
2009-05-29 21:35 ` [PATCH] [12/16] HWPOISON: check and isolate corrupted free pages Andi Kleen
` (5 subsequent siblings)
16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
To: akpm, linux-kernel, linux-mm, fengguang.wu
Bail out early in set_page_dirty for poisoned pages. We don't want any
of the dirty accounting done or file system write back started, because
the page will be just thrown away.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
mm/page-writeback.c | 4 ++++
1 file changed, 4 insertions(+)
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c 2009-05-29 23:32:08.000000000 +0200
+++ linux/mm/page-writeback.c 2009-05-29 23:32:10.000000000 +0200
@@ -1277,6 +1277,10 @@
{
struct address_space *mapping = page_mapping(page);
+ if (unlikely(PageHWPoison(page))) {
+ SetPageDirty(page);
+ return 0;
+ }
if (likely(mapping)) {
int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
#ifdef CONFIG_BLOCK
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread* [PATCH] [12/16] HWPOISON: check and isolate corrupted free pages
2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
` (10 preceding siblings ...)
2009-05-29 21:35 ` [PATCH] [11/16] HWPOISON: Handle poisoned pages in set_page_dirty() Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
2009-05-29 21:35 ` [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v4 Andi Kleen
` (4 subsequent siblings)
16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
To: fengguang.wu, akpm, linux-kernel, linux-mm
From: Wu Fengguang <fengguang.wu@intel.com>
If memory corruption hits the free buddy pages, we can safely ignore them.
No one will access them until page allocation time, then prep_new_page()
will automatically check and isolate PG_hwpoison page for us (for 0-order
allocation).
This patch expands prep_new_page() to check every component page in a high
order page allocation, in order to completely stop PG_hwpoison pages from
being recirculated.
Note that the common case -- only allocating a single page, doesn't
do any more work than before. Allocating > order 0 does a bit more work,
but that's relatively uncommon.
This simple implementation may drop some innocent neighbor pages, hopefully
it is not a big problem because the event should be rare enough.
This patch adds some runtime costs to high order page users.
[AK: Improved description]
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
mm/page_alloc.c | 22 ++++++++++++++++------
1 file changed, 16 insertions(+), 6 deletions(-)
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c 2009-05-29 23:32:08.000000000 +0200
+++ linux/mm/page_alloc.c 2009-05-29 23:32:11.000000000 +0200
@@ -633,12 +633,22 @@
*/
static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
{
- if (unlikely(page_mapcount(page) |
- (page->mapping != NULL) |
- (page_count(page) != 0) |
- (page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
- bad_page(page);
- return 1;
+ int i;
+
+ for (i = 0; i < (1 << order); i++) {
+ struct page *p = page + i;
+
+ if (unlikely(page_mapcount(p) |
+ (p->mapping != NULL) |
+ (page_count(p) != 0) |
+ (p->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
+ /*
+ * The whole array of pages will be dropped,
+ * hopefully this is a rare and abnormal event.
+ */
+ bad_page(p);
+ return 1;
+ }
}
set_page_private(page, 0);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread* [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v4
2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
` (11 preceding siblings ...)
2009-05-29 21:35 ` [PATCH] [12/16] HWPOISON: check and isolate corrupted free pages Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
2009-06-01 11:16 ` Nick Piggin
2009-05-29 21:35 ` [PATCH] [14/16] HWPOISON: FOR TESTING: Enable memory failure code unconditionally Andi Kleen
` (3 subsequent siblings)
16 siblings, 1 reply; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
To: hugh, npiggin, riel, chris.mason, akpm, linux-kernel, linux-mm,
fengguang.wu
Add the high level memory handler that poisons pages
that got corrupted by hardware (typically by a two bit flip in a DIMM
or a cache) on the Linux level. The goal is to prevent everyone
from accessing these pages in the future.
This done at the VM level by marking a page hwpoisoned
and doing the appropriate action based on the type of page
it is.
The code that does this is portable and lives in mm/memory-failure.c
To quote the overview comment:
* High level machine check handler. Handles pages reported by the
* hardware as being corrupted usually due to a 2bit ECC memory or cache
* failure.
*
* This focuses on pages detected as corrupted in the background.
* When the current CPU tries to consume corruption the currently
* running process can just be killed directly instead. This implies
* that if the error cannot be handled for some reason it's safe to
* just ignore it because no corruption has been consumed yet. Instead
* when that happens another machine check will happen.
*
* Handles page cache pages in various states. The tricky part
* here is that we can access any page asynchronous to other VM
* users, because memory failures could happen anytime and anywhere,
* possibly violating some of their assumptions. This is why this code
* has to be extremely careful. Generally it tries to use normal locking
* rules, as in get the standard locks, even if that means the
* error handling takes potentially a long time.
*
* Some of the operations here are somewhat inefficient and have non
* linear algorithmic complexity, because the data structures have not
* been optimized for this case. This is in particular the case
* for the mapping from a vma to a process. Since this case is expected
* to be rare we hope we can get away with this.
There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before
killing
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used
in different situations. Right now both are implemented and can
be switched with a new sysctl vm.memory_failure_early_kill
The default is early kill.
The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep
everything together and rmap knowledge has been seeping out anyways
v2: Fix anon vma unlock crash (noticed by Johannes Weiner <hannes@cmpxchg.org>)
Handle pages on free list correctly (also noticed by Johannes)
Fix inverted try_to_release_page check (found by Chris Mason)
Add documentation for the new sysctl.
Various other cleanups/comment fixes.
v3: Use blockable signal for AO SIGBUS for better qemu handling.
Numerous fixes from Fengguang Wu:
New code layout for the table (redone by AK)
Move the hwpoison bit setting before the lock (Fengguang Wu)
Some code cleanups (Fengguang Wu, AK)
Add missing lru_drain (Fengguang Wu)
Do more checks for valid mappings (inspired by patch from Fengguang)
Handle free pages and fixes for clean pages (Fengguang)
Removed swap cache handling for now, needs more work
Better mapping checks to avoid races (Fengguang)
Fix swapcache (Fengguang)
Handle private2 pages too (Fengguang)
v4: Various fixes based on review comments from Nick Piggin
Document locking order.
Improved comments.
Slightly improved description
Remove bogus hunk.
Wait properly for writeback pages (Nick Piggin)
Cc: hugh@veritas.com
Cc: npiggin@suse.de
Cc: riel@redhat.com
Cc: chris.mason@oracle.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
Documentation/sysctl/vm.txt | 21 +
fs/proc/meminfo.c | 9
include/linux/mm.h | 4
kernel/sysctl.c | 14
mm/Kconfig | 3
mm/Makefile | 1
mm/filemap.c | 4
mm/memory-failure.c | 720 ++++++++++++++++++++++++++++++++++++++++++++
mm/rmap.c | 4
9 files changed, 778 insertions(+), 2 deletions(-)
Index: linux/mm/Makefile
===================================================================
--- linux.orig/mm/Makefile 2009-05-29 23:32:07.000000000 +0200
+++ linux/mm/Makefile 2009-05-29 23:33:28.000000000 +0200
@@ -38,3 +38,4 @@
endif
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
+obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
Index: linux/mm/memory-failure.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/mm/memory-failure.c 2009-05-29 23:32:11.000000000 +0200
@@ -0,0 +1,720 @@
+/*
+ * Copyright (C) 2008, 2009 Intel Corporation
+ * Author: Andi Kleen
+ *
+ * This software may be redistributed and/or modified under the terms of
+ * the GNU General Public License ("GPL") version 2 only as published by the
+ * Free Software Foundation.
+ *
+ * High level machine check handler. Handles pages reported by the
+ * hardware as being corrupted usually due to a 2bit ECC memory or cache
+ * failure.
+ *
+ * This focuses on pages detected as corrupted in the background.
+ * When the current CPU tries to consume corruption the currently
+ * running process can just be killed directly instead. This implies
+ * that if the error cannot be handled for some reason it's safe to
+ * just ignore it because no corruption has been consumed yet. Instead
+ * when that happens another machine check will happen.
+ *
+ * Handles page cache pages in various states. The tricky part
+ * here is that we can access any page asynchronous to other VM
+ * users, because memory failures could happen anytime and anywhere,
+ * possibly violating some of their assumptions. This is why this code
+ * has to be extremely careful. Generally it tries to use normal locking
+ * rules, as in get the standard locks, even if that means the
+ * error handling takes potentially a long time.
+ *
+ * The operation to map back from RMAP chains to processes has to walk
+ * the complete process list and has non linear complexity with the number
+ * mappings. In short it can be quite slow. But since memory corruptions
+ * are rare we hope to get away with this.
+ */
+
+/*
+ * Notebook:
+ * - hugetlb needs more code
+ * - nonlinear
+ * - remap races
+ * - anonymous (tinject):
+ * + left over references when process catches signal?
+ * - kcore/oldmem/vmcore/mem/kmem check for hwpoison pages
+ * - pass bad pages to kdump next kernel
+ */
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/page-flags.h>
+#include <linux/sched.h>
+#include <linux/rmap.h>
+#include <linux/pagemap.h>
+#include <linux/swap.h>
+#include <linux/backing-dev.h>
+#include "internal.h"
+
+#define Dprintk(x...) printk(x)
+
+int sysctl_memory_failure_early_kill __read_mostly = 1;
+
+atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);
+
+/*
+ * Send all the processes who have the page mapped an ``action optional''
+ * signal.
+ */
+static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
+ unsigned long pfn)
+{
+ struct siginfo si;
+ int ret;
+
+ printk(KERN_ERR
+ "MCE %#lx: Killing %s:%d due to hardware memory corruption\n",
+ pfn, t->comm, t->pid);
+ si.si_signo = SIGBUS;
+ si.si_errno = 0;
+ si.si_code = BUS_MCEERR_AO;
+ si.si_addr = (void *)addr;
+#ifdef __ARCH_SI_TRAPNO
+ si.si_trapno = trapno;
+#endif
+ si.si_addr_lsb = PAGE_SHIFT;
+ /*
+ * Don't use force here, it's convenient if the signal
+ * can be temporarily blocked.
+ * This could cause a loop when the user sets SIGBUS
+ * to SIG_IGN, but hopefully noone will do that?
+ */
+ ret = send_sig_info(SIGBUS, &si, t); /* synchronous? */
+ if (ret < 0)
+ printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
+ t->comm, t->pid, ret);
+ return ret;
+}
+
+/*
+ * Kill all processes that have a poisoned page mapped and then isolate
+ * the page.
+ *
+ * General strategy:
+ * Find all processes having the page mapped and kill them.
+ * But we keep a page reference around so that the page is not
+ * actually freed yet.
+ * Then stash the page away
+ *
+ * There's no convenient way to get back to mapped processes
+ * from the VMAs. So do a brute-force search over all
+ * running processes.
+ *
+ * Remember that machine checks are not common (or rather
+ * if they are common you have other problems), so this shouldn't
+ * be a performance issue.
+ *
+ * Also there are some races possible while we get from the
+ * error detection to actually handle it.
+ */
+
+struct to_kill {
+ struct list_head nd;
+ struct task_struct *tsk;
+ unsigned long addr;
+};
+
+/*
+ * Failure handling: if we can't find or can't kill a process there's
+ * not much we can do. We just print a message and ignore otherwise.
+ */
+
+/*
+ * Schedule a process for later kill.
+ * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
+ * TBD would GFP_NOIO be enough?
+ */
+static void add_to_kill(struct task_struct *tsk, struct page *p,
+ struct vm_area_struct *vma,
+ struct list_head *to_kill,
+ struct to_kill **tkc)
+{
+ struct to_kill *tk;
+
+ if (*tkc) {
+ tk = *tkc;
+ *tkc = NULL;
+ } else {
+ tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
+ if (!tk) {
+ printk(KERN_ERR "MCE: Out of memory while machine check handling\n");
+ return;
+ }
+ }
+ tk->addr = page_address_in_vma(p, vma);
+ if (tk->addr == -EFAULT) {
+ printk(KERN_INFO "MCE: Unable to determine user space address during error handling\n");
+ tk->addr = 0;
+ }
+ get_task_struct(tsk);
+ tk->tsk = tsk;
+ list_add_tail(&tk->nd, to_kill);
+}
+
+/*
+ * Kill the processes that have been collected earlier.
+ *
+ * Only do anything when DOIT is set, otherwise just free the list
+ * (this is used for clean pages which do not need killing)
+ * Also when FAIL is set do a force kill because something went
+ * wrong earlier.
+ */
+static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
+ int fail, unsigned long pfn)
+{
+ struct to_kill *tk, *next;
+
+ list_for_each_entry_safe (tk, next, to_kill, nd) {
+ if (doit) {
+ /*
+ * In case something went wrong with munmaping
+ * make sure the process doesn't catch the
+ * signal and then access the memory. Just kill it.
+ * the signal handlers
+ */
+ if (fail) {
+ printk(KERN_ERR
+ "MCE %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",
+ pfn, tk->tsk->comm, tk->tsk->pid);
+ force_sig(SIGKILL, tk->tsk);
+ }
+
+ /*
+ * In theory the process could have mapped
+ * something else on the address in-between. We could
+ * check for that, but we need to tell the
+ * process anyways.
+ */
+ else if (kill_proc_ao(tk->tsk, tk->addr, trapno,
+ pfn) < 0)
+ printk(KERN_ERR
+ "MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
+ pfn, tk->tsk->comm, tk->tsk->pid);
+ }
+ put_task_struct(tk->tsk);
+ kfree(tk);
+ }
+}
+
+/*
+ * Collect processes when the error hit an anonymous page.
+ */
+static void collect_procs_anon(struct page *page, struct list_head *to_kill,
+ struct to_kill **tkc)
+{
+ struct vm_area_struct *vma;
+ struct task_struct *tsk;
+ struct anon_vma *av = page_lock_anon_vma(page);
+
+ if (av == NULL) /* Not actually mapped anymore */
+ return;
+
+ read_lock(&tasklist_lock);
+ for_each_process (tsk) {
+ if (!tsk->mm)
+ continue;
+ list_for_each_entry (vma, &av->head, anon_vma_node) {
+ if (vma->vm_mm == tsk->mm)
+ add_to_kill(tsk, page, vma, to_kill, tkc);
+ }
+ }
+ page_unlock_anon_vma(av);
+ read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect processes when the error hit a file mapped page.
+ */
+static void collect_procs_file(struct page *page, struct list_head *to_kill,
+ struct to_kill **tkc)
+{
+ struct vm_area_struct *vma;
+ struct task_struct *tsk;
+ struct prio_tree_iter iter;
+ struct address_space *mapping = page_mapping(page);
+
+ read_lock(&tasklist_lock);
+ spin_lock(&mapping->i_mmap_lock);
+ for_each_process(tsk) {
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+ if (!tsk->mm)
+ continue;
+
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
+ pgoff)
+ if (vma->vm_mm == tsk->mm)
+ add_to_kill(tsk, page, vma, to_kill, tkc);
+ }
+ spin_unlock(&mapping->i_mmap_lock);
+ read_unlock(&tasklist_lock);
+}
+
+/*
+ * Collect the processes who have the corrupted page mapped to kill.
+ * This is done in two steps for locking reasons.
+ * First preallocate one tokill structure outside the spin locks,
+ * so that we can kill at least one process reasonably reliable.
+ */
+static void collect_procs(struct page *page, struct list_head *tokill)
+{
+ struct to_kill *tk;
+
+ tk = kmalloc(sizeof(struct to_kill), GFP_KERNEL);
+ /* memory allocation failure is implicitly handled */
+ if (PageAnon(page))
+ collect_procs_anon(page, tokill, &tk);
+ else
+ collect_procs_file(page, tokill, &tk);
+ kfree(tk);
+}
+
+/*
+ * Error handlers for various types of pages.
+ */
+
+enum outcome {
+ FAILED,
+ DELAYED,
+ IGNORED,
+ RECOVERED,
+};
+
+static const char *action_name[] = {
+ [FAILED] = "Failed", /* Error handling failed */
+ [DELAYED] = "Delayed", /* Will be handled later */
+ [IGNORED] = "Ignored", /* Error safely ignored */
+ [RECOVERED] = "Recovered", /* Successfully recovered */
+};
+
+/*
+ * Error hit kernel page.
+ * Do nothing, try to be lucky and not touch this instead. For a few cases we
+ * could be more sophisticated.
+ */
+static int me_kernel(struct page *p)
+{
+ return DELAYED;
+}
+
+/*
+ * Already poisoned page.
+ */
+static int me_ignore(struct page *p)
+{
+ return IGNORED;
+}
+
+/*
+ * Page in unknown state. Do nothing.
+ */
+static int me_unknown(struct page *p)
+{
+ printk(KERN_ERR "MCE %#lx: Unknown page state\n", page_to_pfn(p));
+ return FAILED;
+}
+
+/*
+ * Free memory
+ */
+static int me_free(struct page *p)
+{
+ return DELAYED;
+}
+
+/*
+ * Clean (or cleaned) page cache page.
+ */
+static int me_pagecache_clean(struct page *p)
+{
+ if (!isolate_lru_page(p))
+ page_cache_release(p);
+
+ if (page_has_private(p))
+ do_invalidatepage(p, 0);
+ if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
+ Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
+ page_to_pfn(p));
+
+ /*
+ * remove_from_page_cache assumes (mapping && !mapped)
+ */
+ if (page_mapping(p) && !page_mapped(p)) {
+ remove_from_page_cache(p);
+ page_cache_release(p);
+ }
+
+ return RECOVERED;
+}
+
+/*
+ * Dirty cache page page
+ * Issues: when the error hit a hole page the error is not properly
+ * propagated.
+ */
+static int me_pagecache_dirty(struct page *p)
+{
+ struct address_space *mapping = page_mapping(p);
+
+ SetPageError(p);
+ /* TBD: print more information about the file. */
+ printk(KERN_ERR "MCE %#lx: Hardware memory corruption on dirty file page: write error\n",
+ page_to_pfn(p));
+ if (mapping) {
+ /*
+ * Truncate does the same, but we're not quite the same
+ * as truncate. This doesn't try to unallocate blocks
+ * on disk or make the file shorter. It's more like a
+ * "temporary hole punch".
+ * Needs more checking, but keep it for now.
+ */
+ cancel_dirty_page(p, PAGE_CACHE_SIZE);
+
+ /*
+ * IO error will be reported by write(), fsync(), etc.
+ * who check the mapping.
+ * This way the application knows that something went
+ * wrong with its dirty file data.
+ */
+ mapping_set_error(mapping, EIO);
+ }
+
+ me_pagecache_clean(p);
+
+ /*
+ * Did the earlier release work?
+ */
+ if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
+ return FAILED;
+
+ return RECOVERED;
+}
+
+/*
+ * Clean and dirty swap cache.
+ *
+ * Dirty swap cache page is tricky to handle. The page could live both in page
+ * cache and swap cache(ie. page is freshly swapped in). So it could be
+ * referenced concurrently by 2 types of PTEs:
+ * normal PTEs and swap PTEs. We try to handle them consistently by calling u
+ * try_to_unmap(TTU_IGNORE_HWPOISON) to convert the normal PTEs to swap PTEs,
+ * and then
+ * - clear dirty bit to prevent IO
+ * - remove from LRU
+ * - but keep in the swap cache, so that when we return to it on
+ * a later page fault, we know the application is accessing
+ * corrupted data and shall be killed (we installed simple
+ * interception code in do_swap_page to catch it).
+ *
+ * Clean swap cache pages can be directly isolated. A later page fault will
+ * bring in the known good data from disk.
+ */
+static int me_swapcache_dirty(struct page *p)
+{
+ ClearPageDirty(p);
+
+ if (!isolate_lru_page(p))
+ page_cache_release(p);
+
+ return DELAYED;
+}
+
+static int me_swapcache_clean(struct page *p)
+{
+ ClearPageUptodate(p);
+
+ if (!isolate_lru_page(p))
+ page_cache_release(p);
+
+ delete_from_swap_cache(p);
+
+ return RECOVERED;
+}
+
+/*
+ * Huge pages. Needs work.
+ * Issues:
+ * No rmap support so we cannot find the original mapper. In theory could walk
+ * all MMs and look for the mappings, but that would be non atomic and racy.
+ * Need rmap for hugepages for this. Alternatively we could employ a heuristic,
+ * like just walking the current process and hoping it has it mapped (that
+ * should be usually true for the common "shared database cache" case)
+ * Should handle free huge pages and dequeue them too, but this needs to
+ * handle huge page accounting correctly.
+ */
+static int me_huge_page(struct page *p)
+{
+ return FAILED;
+}
+
+/*
+ * Various page states we can handle.
+ *
+ * A page state is defined by its current page->flags bits.
+ * The table matches them in order and calls the right handler.
+ *
+ * This is quite tricky because we can access page at any time
+ * in its live cycle, so all accesses have to be extremly careful.
+ *
+ * This is not complete. More states could be added.
+ * For any missing state don't attempt recovery.
+ */
+
+#define dirty (1UL << PG_dirty)
+#define swapcache (1UL << PG_swapcache)
+#define unevict (1UL << PG_unevictable)
+#define mlocked (1UL << PG_mlocked)
+#define writeback (1UL << PG_writeback)
+#define lru (1UL << PG_lru)
+#define swapbacked (1UL << PG_swapbacked)
+#define head (1UL << PG_head)
+#define tail (1UL << PG_tail)
+#define compound (1UL << PG_compound)
+#define slab (1UL << PG_slab)
+#define buddy (1UL << PG_buddy)
+#define reserved (1UL << PG_reserved)
+
+/*
+ * The table is > 80 columns because all the alternatvies were much worse.
+ */
+
+static struct page_state {
+ unsigned long mask;
+ unsigned long res;
+ char *msg;
+ int (*action)(struct page *p);
+} error_states[] = {
+ { reserved, reserved, "reserved kernel", me_ignore },
+ { buddy, buddy, "free kernel", me_free },
+
+ /*
+ * Could in theory check if slab page is free or if we can drop
+ * currently unused objects without touching them. But just
+ * treat it as standard kernel for now.
+ */
+ { slab, slab, "kernel slab", me_kernel },
+
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+ { head, head, "hugetlb", me_huge_page },
+ { tail, tail, "hugetlb", me_huge_page },
+#else
+ { compound, compound, "hugetlb", me_huge_page },
+#endif
+
+ { swapcache|dirty, swapcache|dirty,"dirty swapcache", me_swapcache_dirty },
+ { swapcache|dirty, swapcache, "clean swapcache", me_swapcache_clean },
+
+#ifdef CONFIG_UNEVICTABLE_LRU
+ { unevict|dirty, unevict|dirty, "unevictable dirty lru", me_pagecache_dirty },
+ { unevict, unevict, "unevictable lru", me_pagecache_clean },
+#endif
+
+#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+ { mlocked|dirty, mlocked|dirty, "mlocked dirty lru", me_pagecache_dirty },
+ { mlocked, mlocked, "mlocked lru", me_pagecache_clean },
+#endif
+
+ { lru|dirty, lru|dirty, "dirty lru", me_pagecache_dirty },
+ { lru|dirty, lru, "clean lru", me_pagecache_clean },
+ { swapbacked, swapbacked, "anonymous", me_pagecache_clean },
+
+ /*
+ * Add more states here.
+ */
+
+ /*
+ * Catchall entry: must be at end.
+ */
+ { 0, 0, "unknown page state", me_unknown },
+};
+
+static void page_action(char *msg, struct page *p, int (*action)(struct page *),
+ unsigned long pfn)
+{
+ int ret;
+
+ printk(KERN_ERR "MCE %#lx: %s page recovery: starting\n", pfn, msg);
+ ret = action(p);
+ printk(KERN_ERR "MCE %#lx: %s page recovery: %s\n",
+ pfn, msg, action_name[ret]);
+ if (page_count(p) != 1)
+ printk(KERN_ERR
+ "MCE %#lx: %s page still referenced by %d users\n",
+ pfn, msg, page_count(p) - 1);
+
+ /* Could do more checks here if page looks ok */
+ atomic_long_add(1, &mce_bad_pages);
+
+ /*
+ * Could adjust zone counters here to correct for the missing page.
+ */
+}
+
+#define N_UNMAP_TRIES 5
+
+/*
+ * Do all that is necessary to remove user space mappings. Unmap
+ * the pages and send SIGBUS to the processes if the data was dirty.
+ */
+static void hwpoison_user_mappings(struct page *p, unsigned long pfn,
+ int trapno)
+{
+ enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+ int kill = sysctl_memory_failure_early_kill;
+ struct address_space *mapping;
+ LIST_HEAD(tokill);
+ int ret;
+ int i;
+
+ if (PageReserved(p) || PageCompound(p) || PageSlab(p))
+ return;
+
+ if (!PageLRU(p))
+ lru_add_drain();
+
+ /*
+ * This check implies we don't kill processes if their pages
+ * are in the swap cache early. Those are always late kills.
+ */
+ if (!page_mapped(p))
+ return;
+
+ if (PageSwapCache(p)) {
+ printk(KERN_ERR
+ "MCE %#lx: keeping poisoned page in swap cache\n", pfn);
+ ttu |= TTU_IGNORE_HWPOISON;
+ }
+
+ /*
+ * Poisoned clean file pages are harmless, the
+ * data can be restored by regular page faults.
+ */
+ mapping = page_mapping(p);
+ if (!PageDirty(p) && !PageWriteback(p) &&
+ !PageAnon(p) && !PageSwapBacked(p) &&
+ mapping && mapping_cap_account_dirty(mapping)) {
+ if (page_mkclean(p))
+ SetPageDirty(p);
+ else {
+ kill = 0;
+ ttu |= TTU_IGNORE_HWPOISON;
+ }
+ }
+
+ /*
+ * First collect all the processes that have the page
+ * mapped. This has to be done before try_to_unmap,
+ * because ttu takes the rmap data structures down.
+ *
+ * This also has the side effect to propagate the dirty
+ * bit from PTEs into the struct page. This is needed
+ * to actually decide if something needs to be killed
+ * or errored, or if it's ok to just drop the page.
+ *
+ * Error handling: We ignore errors here because
+ * there's nothing that can be done.
+ *
+ * RED-PEN some cases in process exit seem to deadlock
+ * on the page lock. drop it or add poison checks?
+ */
+ if (kill)
+ collect_procs(p, &tokill);
+
+ /*
+ * try_to_unmap can fail temporarily due to races.
+ * Try a few times (RED-PEN better strategy?)
+ */
+ for (i = 0; i < N_UNMAP_TRIES; i++) {
+ ret = try_to_unmap(p, ttu);
+ if (ret == SWAP_SUCCESS)
+ break;
+ Dprintk("MCE %#lx: try_to_unmap retry needed %d\n", pfn, ret);
+ }
+
+ /*
+ * Now that the dirty bit has been propagated to the
+ * struct page and all unmaps done we can decide if
+ * killing is needed or not. Only kill when the page
+ * was dirty, otherwise the tokill list is merely
+ * freed. When there was a problem unmapping earlier
+ * use a more force-full uncatchable kill to prevent
+ * any accesses to the poisoned memory.
+ */
+ kill_procs_ao(&tokill, !!PageDirty(p), trapno,
+ ret != SWAP_SUCCESS, pfn);
+}
+
+/**
+ * memory_failure - Handle memory failure of a page.
+ *
+ */
+void memory_failure(unsigned long pfn, int trapno)
+{
+ struct page_state *ps;
+ struct page *p;
+
+ if (!pfn_valid(pfn)) {
+ printk(KERN_ERR
+ "MCE %#lx: Hardware memory corruption in memory outside kernel control\n",
+ pfn);
+ return;
+ }
+
+
+ p = pfn_to_page(pfn);
+ if (TestSetPageHWPoison(p)) {
+ printk(KERN_ERR "MCE %#lx: Error for already hardware poisoned page\n", pfn);
+ return;
+ }
+
+ /*
+ * We need/can do nothing about count=0 pages.
+ * 1) it's a free page, and therefore in safe hand:
+ * prep_new_page() will be the gate keeper.
+ * 2) it's part of a non-compound high order page.
+ * Implies some kernel user: cannot stop them from
+ * R/W the page; let's pray that the page has been
+ * used and will be freed some time later.
+ * In fact it's dangerous to directly bump up page count from 0,
+ * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
+ */
+ if (!get_page_unless_zero(compound_head(p))) {
+ printk(KERN_ERR
+ "MCE 0x%lx: ignoring free or high order page\n", pfn);
+ return;
+ }
+
+ /*
+ * Lock the page and wait for writeback to finish.
+ * It's very difficult to mess with pages currently under IO
+ * and in many cases impossible, so we just avoid it here.
+ */
+ lock_page_nosync(p);
+ wait_on_page_writeback(p);
+
+ /*
+ * Now take care of user space mappings.
+ */
+ hwpoison_user_mappings(p, pfn, trapno);
+
+ /* Tored down by someone else? */
+ if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
+ printk(KERN_ERR
+ "MCE %#lx: ignoring NULL mapping LRU page\n", pfn);
+ goto out;
+ }
+
+ for (ps = error_states;; ps++) {
+ if ((p->flags & ps->mask) == ps->res) {
+ page_action(ps->msg, p, ps->action, pfn);
+ break;
+ }
+ }
+out:
+ unlock_page(p);
+}
Index: linux/include/linux/mm.h
===================================================================
--- linux.orig/include/linux/mm.h 2009-05-29 23:32:10.000000000 +0200
+++ linux/include/linux/mm.h 2009-05-29 23:32:11.000000000 +0200
@@ -1322,6 +1322,10 @@
extern void *alloc_locked_buffer(size_t size);
extern void free_locked_buffer(void *buffer, size_t size);
+
+extern void memory_failure(unsigned long pfn, int trapno);
+extern int sysctl_memory_failure_early_kill;
+extern atomic_long_t mce_bad_pages;
extern void release_locked_buffer(void *buffer, size_t size);
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c 2009-05-29 23:32:07.000000000 +0200
+++ linux/kernel/sysctl.c 2009-05-29 23:32:11.000000000 +0200
@@ -1282,6 +1282,20 @@
.proc_handler = &scan_unevictable_handler,
},
#endif
+#ifdef CONFIG_MEMORY_FAILURE
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "memory_failure_early_kill",
+ .data = &sysctl_memory_failure_early_kill,
+ .maxlen = sizeof(vm_highmem_is_dirtyable),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif
+
/*
* NOTE: do not add new entries to this table unless you have read
* Documentation/sysctl/ctl_unnumbered.txt
Index: linux/fs/proc/meminfo.c
===================================================================
--- linux.orig/fs/proc/meminfo.c 2009-05-29 23:32:07.000000000 +0200
+++ linux/fs/proc/meminfo.c 2009-05-29 23:32:11.000000000 +0200
@@ -97,7 +97,11 @@
"Committed_AS: %8lu kB\n"
"VmallocTotal: %8lu kB\n"
"VmallocUsed: %8lu kB\n"
- "VmallocChunk: %8lu kB\n",
+ "VmallocChunk: %8lu kB\n"
+#ifdef CONFIG_MEMORY_FAILURE
+ "BadPages: %8lu kB\n"
+#endif
+ ,
K(i.totalram),
K(i.freeram),
K(i.bufferram),
@@ -144,6 +148,9 @@
(unsigned long)VMALLOC_TOTAL >> 10,
vmi.used >> 10,
vmi.largest_chunk >> 10
+#ifdef CONFIG_MEMORY_FAILURE
+ ,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
+#endif
);
hugetlb_report_meminfo(m);
Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig 2009-05-29 23:32:07.000000000 +0200
+++ linux/mm/Kconfig 2009-05-29 23:33:29.000000000 +0200
@@ -226,6 +226,9 @@
config MMU_NOTIFIER
bool
+config MEMORY_FAILURE
+ bool
+
config NOMMU_INITIAL_TRIM_EXCESS
int "Turn on mmap() excess space trimming before booting"
depends on !MMU
Index: linux/Documentation/sysctl/vm.txt
===================================================================
--- linux.orig/Documentation/sysctl/vm.txt 2009-05-29 23:32:07.000000000 +0200
+++ linux/Documentation/sysctl/vm.txt 2009-05-29 23:32:11.000000000 +0200
@@ -32,6 +32,7 @@
- legacy_va_layout
- lowmem_reserve_ratio
- max_map_count
+- memory_failure_early_kill
- min_free_kbytes
- min_slab_ratio
- min_unmapped_ratio
@@ -53,7 +54,6 @@
- vfs_cache_pressure
- zone_reclaim_mode
-
==============================================================
block_dump
@@ -275,6 +275,25 @@
The default value is 65536.
+=============================================================
+
+memory_failure_early_kill:
+
+Control how to kill processes when uncorrected memory error (typically
+a 2bit error in a memory module) is detected in the background by hardware.
+
+1: Kill all processes that have the corrupted page mapped as soon as the
+corruption is detected.
+
+0: Only unmap the page from all processes and only kill a process
+who tries to access it.
+
+The kill is done using a catchable SIGBUS, so processes can handle this
+if they want to.
+
+This is only active on architectures/platforms with advanced machine
+check handling and depends on the hardware capabilities.
+
==============================================================
min_free_kbytes:
Index: linux/mm/filemap.c
===================================================================
--- linux.orig/mm/filemap.c 2009-05-29 23:32:07.000000000 +0200
+++ linux/mm/filemap.c 2009-05-29 23:32:11.000000000 +0200
@@ -105,6 +105,10 @@
*
* ->task->proc_lock
* ->dcache_lock (proc_pid_lookup)
+ *
+ * (code doesn't rely on that order, so you could switch it around)
+ * ->tasklist_lock (memory_failure, collect_procs_ao)
+ * ->i_mmap_lock
*/
/*
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c 2009-05-29 23:32:10.000000000 +0200
+++ linux/mm/rmap.c 2009-05-29 23:32:11.000000000 +0200
@@ -36,6 +36,10 @@
* mapping->tree_lock (widely used, in set_page_dirty,
* in arch-dependent flush_dcache_mmap_lock,
* within inode_lock in __sync_single_inode)
+ *
+ * (code doesn't rely on that order so it could be switched around)
+ * ->tasklist_lock
+ * anon_vma->lock (memory_failure, collect_procs_anon)
*/
#include <linux/mm.h>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v4
2009-05-29 21:35 ` [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v4 Andi Kleen
@ 2009-06-01 11:16 ` Nick Piggin
2009-06-01 12:46 ` Wu Fengguang
0 siblings, 1 reply; 36+ messages in thread
From: Nick Piggin @ 2009-06-01 11:16 UTC (permalink / raw)
To: Andi Kleen
Cc: hugh, riel, chris.mason, akpm, linux-kernel, linux-mm,
fengguang.wu
On Fri, May 29, 2009 at 11:35:39PM +0200, Andi Kleen wrote:
> + mapping = page_mapping(p);
> + if (!PageDirty(p) && !PageWriteback(p) &&
> + !PageAnon(p) && !PageSwapBacked(p) &&
> + mapping && mapping_cap_account_dirty(mapping)) {
Haven't had another good look at this yet, but if you hold the
page locked, and have done a wait_on_page_writeback, then
PageWriteback == true is a kernel bug.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v4
2009-06-01 11:16 ` Nick Piggin
@ 2009-06-01 12:46 ` Wu Fengguang
0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2009-06-01 12:46 UTC (permalink / raw)
To: Nick Piggin
Cc: Andi Kleen, hugh@veritas.com, riel@redhat.com,
chris.mason@oracle.com, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
On Mon, Jun 01, 2009 at 07:16:41PM +0800, Nick Piggin wrote:
> On Fri, May 29, 2009 at 11:35:39PM +0200, Andi Kleen wrote:
> > + mapping = page_mapping(p);
> > + if (!PageDirty(p) && !PageWriteback(p) &&
> > + !PageAnon(p) && !PageSwapBacked(p) &&
> > + mapping && mapping_cap_account_dirty(mapping)) {
>
> Haven't had another good look at this yet, but if you hold the
> page locked, and have done a wait_on_page_writeback, then
> PageWriteback == true is a kernel bug.
Right, we can eliminate the PageWriteback() test when there is a
wait_on_page_writeback().
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH] [14/16] HWPOISON: FOR TESTING: Enable memory failure code unconditionally
2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
` (12 preceding siblings ...)
2009-05-29 21:35 ` [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v4 Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
2009-05-29 21:35 ` [PATCH] [15/16] HWPOISON: Add madvise() based injector for hardware poisoned pages v3 Andi Kleen
` (2 subsequent siblings)
16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
To: akpm, linux-kernel, linux-mm, fengguang.wu
Normally the memory-failure.c code is enabled by the architecture, but
for easier testing independent of architecture changes enable it unconditionally.
This should not be merged into mainline.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
mm/Kconfig | 2 ++
1 file changed, 2 insertions(+)
Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig 2009-05-29 23:32:11.000000000 +0200
+++ linux/mm/Kconfig 2009-05-29 23:33:28.000000000 +0200
@@ -228,6 +228,8 @@
config MEMORY_FAILURE
bool
+ default y
+ depends on MMU
config NOMMU_INITIAL_TRIM_EXCESS
int "Turn on mmap() excess space trimming before booting"
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread* [PATCH] [15/16] HWPOISON: Add madvise() based injector for hardware poisoned pages v3
2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
` (13 preceding siblings ...)
2009-05-29 21:35 ` [PATCH] [14/16] HWPOISON: FOR TESTING: Enable memory failure code unconditionally Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
2009-05-29 21:35 ` [PATCH] [16/16] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs Andi Kleen
2009-05-29 21:52 ` [PATCH] [0/16] HWPOISON: Intro Alan Cox
16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
To: akpm, linux-kernel, linux-mm, fengguang.wu
Impact: optional, useful for debugging
Add a new madvice sub command to inject poison for some
pages in a process' address space. This is useful for
testing the poison page handling.
Open issues:
- This patch allows root to tie up arbitary amounts of memory.
Should this be disabled inside containers?
- There's a small race window between getting the page and injecting.
The patch drops the ref count because otherwise memory_failure
complains about dangling references. In theory with a multi threaded
injector one could inject poison for a process foreign page this way.
Not a serious issue right now.
v2: Use write flag for get_user_pages to make sure to always get
a fresh page
v3: Don't request write mapping (Fengguang Wu)
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
include/asm-generic/mman.h | 1 +
mm/madvise.c | 37 +++++++++++++++++++++++++++++++++++++
2 files changed, 38 insertions(+)
Index: linux/mm/madvise.c
===================================================================
--- linux.orig/mm/madvise.c 2009-05-29 23:32:07.000000000 +0200
+++ linux/mm/madvise.c 2009-05-29 23:32:11.000000000 +0200
@@ -208,6 +208,38 @@
return error;
}
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Error injection support for memory error handling.
+ */
+static int madvise_hwpoison(unsigned long start, unsigned long end)
+{
+ /*
+ * RED-PEN
+ * This allows to tie up arbitary amounts of memory.
+ * Might be a good idea to disable it inside containers even for root.
+ */
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ for (; start < end; start += PAGE_SIZE) {
+ struct page *p;
+ int ret = get_user_pages(current, current->mm, start, 1,
+ 0, 0, &p, NULL);
+ if (ret != 1)
+ return ret;
+ put_page(p);
+ /*
+ * RED-PEN page can be reused in a short window, but otherwise
+ * we'll have to fight with the reference count.
+ */
+ printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n",
+ page_to_pfn(p), start);
+ memory_failure(page_to_pfn(p), 0);
+ }
+ return 0;
+}
+#endif
+
static long
madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
unsigned long start, unsigned long end, int behavior)
@@ -290,6 +322,11 @@
int write;
size_t len;
+#ifdef CONFIG_MEMORY_FAILURE
+ if (behavior == MADV_HWPOISON)
+ return madvise_hwpoison(start, start+len_in);
+#endif
+
write = madvise_need_mmap_write(behavior);
if (write)
down_write(¤t->mm->mmap_sem);
Index: linux/include/asm-generic/mman.h
===================================================================
--- linux.orig/include/asm-generic/mman.h 2009-05-29 23:32:07.000000000 +0200
+++ linux/include/asm-generic/mman.h 2009-05-29 23:32:11.000000000 +0200
@@ -34,6 +34,7 @@
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
+#define MADV_HWPOISON 12 /* hw poison the page (root only) */
/* compatibility flags */
#define MAP_FILE 0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread* [PATCH] [16/16] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
` (14 preceding siblings ...)
2009-05-29 21:35 ` [PATCH] [15/16] HWPOISON: Add madvise() based injector for hardware poisoned pages v3 Andi Kleen
@ 2009-05-29 21:35 ` Andi Kleen
2009-05-29 21:52 ` [PATCH] [0/16] HWPOISON: Intro Alan Cox
16 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 21:35 UTC (permalink / raw)
To: akpm, linux-kernel, linux-mm, fengguang.wu
Useful for some testing scenarios, although specific testing is often
done better through MADV_POISON
This can be done with the x86 level MCE injector too, but this interface
allows it to do independently from low level x86 changes.
Open issues:
Should be disabled for cgroups.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
mm/Kconfig | 4 ++++
mm/Makefile | 1 +
mm/hwpoison-inject.c | 41 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 46 insertions(+)
Index: linux/mm/hwpoison-inject.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/mm/hwpoison-inject.c 2009-05-29 23:32:11.000000000 +0200
@@ -0,0 +1,41 @@
+/* Inject a hwpoison memory failure on a arbitary pfn */
+#include <linux/module.h>
+#include <linux/debugfs.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+
+static struct dentry *hwpoison_dir, *corrupt_pfn;
+
+static int hwpoison_inject(void *data, u64 val)
+{
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ printk(KERN_INFO "Injecting memory failure at pfn %Lx\n", val);
+ memory_failure(val, 18);
+ return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n");
+
+static void pfn_inject_exit(void)
+{
+ if (hwpoison_dir)
+ debugfs_remove_recursive(hwpoison_dir);
+}
+
+static int pfn_inject_init(void)
+{
+ hwpoison_dir = debugfs_create_dir("hwpoison", NULL);
+ if (hwpoison_dir == NULL)
+ return -ENOMEM;
+ corrupt_pfn = debugfs_create_file("corrupt-pfn", 0600, hwpoison_dir,
+ NULL, &hwpoison_fops);
+ if (corrupt_pfn == NULL) {
+ pfn_inject_exit();
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+module_init(pfn_inject_init);
+module_exit(pfn_inject_exit);
Index: linux/mm/Kconfig
===================================================================
--- linux.orig/mm/Kconfig 2009-05-29 23:32:11.000000000 +0200
+++ linux/mm/Kconfig 2009-05-29 23:32:11.000000000 +0200
@@ -231,6 +231,10 @@
default y
depends on MMU
+config HWPOISON_INJECT
+ tristate "Poison pages injector"
+ depends on MEMORY_FAILURE && DEBUG_KERNEL
+
config NOMMU_INITIAL_TRIM_EXCESS
int "Turn on mmap() excess space trimming before booting"
depends on !MMU
Index: linux/mm/Makefile
===================================================================
--- linux.orig/mm/Makefile 2009-05-29 23:32:11.000000000 +0200
+++ linux/mm/Makefile 2009-05-29 23:32:11.000000000 +0200
@@ -39,3 +39,4 @@
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
+obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [PATCH] [0/16] HWPOISON: Intro
2009-05-29 21:35 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
` (15 preceding siblings ...)
2009-05-29 21:35 ` [PATCH] [16/16] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs Andi Kleen
@ 2009-05-29 21:52 ` Alan Cox
2009-05-29 22:24 ` Andi Kleen
2009-05-30 6:37 ` More thoughts about hwpoison and pageflags compression Andi Kleen
16 siblings, 2 replies; 36+ messages in thread
From: Alan Cox @ 2009-05-29 21:52 UTC (permalink / raw)
To: Andi Kleen; +Cc: akpm, linux-kernel, linux-mm, fengguang.wu
On Fri, 29 May 2009 23:35:25 +0200 (CEST)
Andi Kleen <andi@firstfloor.org> wrote:
>
> Another version of the hwpoison patchkit. I addressed
> all feedback, except:
> I didn't move the handlers into other files for now, prefer
> to keep things together for now
> I'm keeping an own pagepoison bit because I think that's
> cleaner than any other hacks.
>
> Andrew, please put it into mm for .31 track.
Andrew please put it on the "Andi needs to justify his pageflags" non-path
I'm with Rik on this - we may have a few pageflags handy now but being
slack with them for an obscure feature that can be done other ways and
isn't performance critical is just lazy and bad planning for the long
term.
Andi - "I'm doing it my way so nyahh, put it into .31" doesn't fly. If
you want it in .31 convince Rik and me and others that its a good use of
a pageflag.
Alan
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [PATCH] [0/16] HWPOISON: Intro
2009-05-29 21:52 ` [PATCH] [0/16] HWPOISON: Intro Alan Cox
@ 2009-05-29 22:24 ` Andi Kleen
2009-05-30 6:37 ` More thoughts about hwpoison and pageflags compression Andi Kleen
1 sibling, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-29 22:24 UTC (permalink / raw)
To: Alan Cox; +Cc: Andi Kleen, akpm, linux-kernel, linux-mm, fengguang.wu
On Fri, May 29, 2009 at 10:52:02PM +0100, Alan Cox wrote:
> On Fri, 29 May 2009 23:35:25 +0200 (CEST)
> Andi Kleen <andi@firstfloor.org> wrote:
>
> >
> > Another version of the hwpoison patchkit. I addressed
> > all feedback, except:
> > I didn't move the handlers into other files for now, prefer
> > to keep things together for now
> > I'm keeping an own pagepoison bit because I think that's
> > cleaner than any other hacks.
> >
> > Andrew, please put it into mm for .31 track.
>
> Andrew please put it on the "Andi needs to justify his pageflags" non-path
>
> I'm with Rik on this - we may have a few pageflags handy now but being
> slack with them for an obscure feature that can be done other ways and
> isn't performance critical is just lazy and bad planning for the long
> term.
There's still plenty of space. Especially on 64bit it's an absolute
non problem.
On 32bit the shortage of page flags was really
artificial because there were some caches put into ->flags, but
these are largely obsolete to my understanding:
- discontigmem is gone (which cached the node)
- non vmap sparsemem is used a few times, but not on large systems
where you have a lot of zones, so you are ok with only having a few bits
for that
- if we really run out of bits on the sparsemem mapping it's easy
enough to do another small hash table for this, similar to the discontig
hash tables.
Also Christoph L. redid the dynamic allocation, so the boundaries
are now dynamically growing/shrinking. This means that if an architecture
doesn't use poison it doesn't use the bit.
> Andi - "I'm doing it my way so nyahh, put it into .31" doesn't fly. If
> you want it in .31 convince Rik and me and others that its a good use of
> a pageflag.
Sorry, you guys also didn't do a very good job explaining why
it is that big a problem to take a page flag. Yes I know it's popular
folklore, but as far as I understand most of the reasons to be so
stingy on them have disappeared over time anyways (but the folklore
staid for some reason)
Anyways here's my pitch:
It's a straight forward concept expressable as a page flag. Lots
of places need to check for it (we expect there will be more users
in the future). Also even crash dumps should check for it, so
it's important to have a clean interface.
Also it's an optional flag, if there's still an architecture
around which needs special caches in ->flags then it's unlikely
it will turn it on.
Also what's the alternative? Are you suggesting we should do huffman
encoding on flags or something? That seemed just too ugly, especially to solve
a non problem.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* More thoughts about hwpoison and pageflags compression
2009-05-29 21:52 ` [PATCH] [0/16] HWPOISON: Intro Alan Cox
2009-05-29 22:24 ` Andi Kleen
@ 2009-05-30 6:37 ` Andi Kleen
2009-05-30 6:53 ` Andrew Morton
1 sibling, 1 reply; 36+ messages in thread
From: Andi Kleen @ 2009-05-30 6:37 UTC (permalink / raw)
To: Alan Cox; +Cc: Andi Kleen, akpm, linux-kernel, linux-mm, fengguang.wu
I thought a bit more about Alan's proposal of page flags compression
for poisoned pages. I actually found more problems with it :-)
(in addition to the points I wrote up in my earlier email on the topic)
Just wanted to write them up:
First some basics about hwpoison.
- HwPoisioning can come in at any time and at any state of the page.
- There can be multiple hwpoison events coming in for the same page in a short time window.
This can happen for example when the hardware detects errors on different cache lines of a page,
which can happen in some DIMM breakage scenarios.
The HwPoison bit serves as a synchronization point for this, it's essentially a lock
for the hwpoison code (although no spinlock)
- HwPoison is high level code should only use portable primitives.
Alan proposed to use reserved|writeback to express hwpoisioning instead
of an own bit.
- Now the first problem is that we don't have a portable primitive to set
multiple bits atomically. cmpxchg() can be only used in architecture specific
code. So it wouldn't be atomic in its locking function.
That means that all multiple bit variants are problematic, or at least
would need a new global atomic primitive.
- Then you can actually have a page in writeback and poisoned. That is
we can't stop writeback (we might at some point in the future), so the order
the code works right now is:
set page poisoned
bail out if was already poisioned
do some other stuff
lock the page
wait for page writeback
(which just polls on the bit to clear)
Now the obvious problem is of course, if we used writeback|reserved, how
would it it do the poison locking while the the page is still in writeback?
The encoding would not be unique.
If we don't do that we would risk multiple memory_failures() on the same
page, which has various issues.
So at least writeback|reserved doesn't work.
- Could we in theory find another weird bit combination that's truly impossible today
?
Probably, but it would be very hard to verify that this can truly never happen.
- Then I don't like it due to the fragility against other software bugs. Unless someone
blasts 0xffs over the struct page (in which case treating it poisoned is probably a
good thing anyways) then a separate bit is fairly robust against software bugs.
Right now "impossible combinations" are used as a indication that something is wrong u
with the page, to catch broken software.
If we gave meaning to previously impossible combinations then this robustness
would be less. So a separate bit is generally more robust and doesn't take
this away from the other code.
So using a separate bit is a sensible choice imho.
Hope this helps,
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: More thoughts about hwpoison and pageflags compression
2009-05-30 6:37 ` More thoughts about hwpoison and pageflags compression Andi Kleen
@ 2009-05-30 6:53 ` Andrew Morton
2009-05-30 7:27 ` Andi Kleen
0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2009-05-30 6:53 UTC (permalink / raw)
To: Andi Kleen; +Cc: Alan Cox, linux-kernel, linux-mm, fengguang.wu
On Sat, 30 May 2009 08:37:10 +0200 Andi Kleen <andi@firstfloor.org> wrote:
> So using a separate bit is a sensible choice imho.
Could you make the feature 64-bit-only and use one of bits 32-63?
Did you consider making the poison tag external to the pageframe? Some
hash(page*) into a bitmap or something? If suitably designed, such
infrastructure could perhaps be reused to reclaim some existing page
flags. Dave Hansen had such a patch a few years back. Or maybe it
was Andy Whitcroft.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: More thoughts about hwpoison and pageflags compression
2009-05-30 6:53 ` Andrew Morton
@ 2009-05-30 7:27 ` Andi Kleen
2009-05-30 7:29 ` Andrew Morton
0 siblings, 1 reply; 36+ messages in thread
From: Andi Kleen @ 2009-05-30 7:27 UTC (permalink / raw)
To: Andrew Morton; +Cc: Andi Kleen, Alan Cox, linux-kernel, linux-mm, fengguang.wu
On Fri, May 29, 2009 at 11:53:02PM -0700, Andrew Morton wrote:
> On Sat, 30 May 2009 08:37:10 +0200 Andi Kleen <andi@firstfloor.org> wrote:
>
> > So using a separate bit is a sensible choice imho.
>
> Could you make the feature 64-bit-only and use one of bits 32-63?
We could, but these systems can run 32bit kernels too (although
it's probably not a good idea). Ok it would be probably possible
to make it 64bit only, but I would prefer to not do that.
Also even 32bit has still flags free and even if we run out there's an easy
path to free more (see my earlier writeup)
So I don't see the pressing need to conserve every bit on 32bit.
> Did you consider making the poison tag external to the pageframe? Some
> hash(page*) into a bitmap or something? If suitably designed, such
> infrastructure could perhaps be reused to reclaim some existing page
> flags. Dave Hansen had such a patch a few years back. Or maybe it
> was Andy Whitcroft.
I considered it at some point, but it would have complicated the code
and I preferred to keep it simple. The poison handler should be relatively
straight forward and do its work quickly otherwise it might not isolate
the page before it's actually used.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: More thoughts about hwpoison and pageflags compression
2009-05-30 7:27 ` Andi Kleen
@ 2009-05-30 7:29 ` Andrew Morton
2009-05-30 7:55 ` Andi Kleen
0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2009-05-30 7:29 UTC (permalink / raw)
To: Andi Kleen; +Cc: Alan Cox, linux-kernel, linux-mm, fengguang.wu
On Sat, 30 May 2009 09:27:58 +0200 Andi Kleen <andi@firstfloor.org> wrote:
> On Fri, May 29, 2009 at 11:53:02PM -0700, Andrew Morton wrote:
> > On Sat, 30 May 2009 08:37:10 +0200 Andi Kleen <andi@firstfloor.org> wrote:
> >
> > > So using a separate bit is a sensible choice imho.
> >
> > Could you make the feature 64-bit-only and use one of bits 32-63?
>
> We could, but these systems can run 32bit kernels too (although
> it's probably not a good idea). Ok it would be probably possible
> to make it 64bit only, but I would prefer to not do that.
>
> Also even 32bit has still flags free and even if we run out there's an easy
> path to free more (see my earlier writeup)
hm. Maybe that should be proven sooner rather than later.
> So I don't see the pressing need to conserve every bit on 32bit.
>
> > Did you consider making the poison tag external to the pageframe? Some
> > hash(page*) into a bitmap or something? If suitably designed, such
> > infrastructure could perhaps be reused to reclaim some existing page
> > flags. Dave Hansen had such a patch a few years back. Or maybe it
> > was Andy Whitcroft.
>
> I considered it at some point, but it would have complicated the code
> and I preferred to keep it simple. The poison handler should be relatively
> straight forward and do its work quickly otherwise it might not isolate
> the page before it's actually used.
Well it's going to get complicated when we run out anyway. And run out
we shall.
Plus we haven't looked into the complexity of the external flags yet.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: More thoughts about hwpoison and pageflags compression
2009-05-30 7:29 ` Andrew Morton
@ 2009-05-30 7:55 ` Andi Kleen
0 siblings, 0 replies; 36+ messages in thread
From: Andi Kleen @ 2009-05-30 7:55 UTC (permalink / raw)
To: Andrew Morton; +Cc: Andi Kleen, Alan Cox, linux-kernel, linux-mm, fengguang.wu
On Sat, May 30, 2009 at 12:29:30AM -0700, Andrew Morton wrote:
> On Sat, 30 May 2009 09:27:58 +0200 Andi Kleen <andi@firstfloor.org> wrote:
>
> > On Fri, May 29, 2009 at 11:53:02PM -0700, Andrew Morton wrote:
> > > On Sat, 30 May 2009 08:37:10 +0200 Andi Kleen <andi@firstfloor.org> wrote:
> > >
> > > > So using a separate bit is a sensible choice imho.
> > >
> > > Could you make the feature 64-bit-only and use one of bits 32-63?
> >
> > We could, but these systems can run 32bit kernels too (although
> > it's probably not a good idea). Ok it would be probably possible
> > to make it 64bit only, but I would prefer to not do that.
> >
> > Also even 32bit has still flags free and even if we run out there's an easy
> > path to free more (see my earlier writeup)
>
> hm. Maybe that should be proven sooner rather than later.
The SPARSEMEM code already has some fallback. I don't know if it works, but
at least the code looks to be there.
* There are three possibilities for how page->flags get
* laid out. The first is for the normal case, without
* sparsemem. The second is for sparsemem when there is
* plenty of space for node and section. The last is when
* we have run out of space and have to fall back to an
* alternate (slower) way of determining the node.
*
* No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
* classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
* classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
/*
* If we did not store the node number in the page then we have to
* do a lookup in the section_to_node_table in order to find which
* node the page belongs to.
*/
#if MAX_NUMNODES <= 256
static u8 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
#else
static u16 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
#endif
The other part that could be added is to use a separate hash to go from
page to SECTION (that would be very similar to the old discontig perfect hash
I did to go from pfn to node), then the "SECTION" part would be free for reuse too.
Then you could use the full 32bits. On 32bit we're right now at 22,
hwpoison would be 23. There's still some room.
> Plus we haven't looked into the complexity of the external flags yet.
It would be dumb to do external flags before you actually run out.
After all what good are free bits?
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 36+ messages in thread