From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35])
	by kanga.kvack.org (Postfix) with ESMTP id 8EDB26B004F
	for <linux-mm@kvack.org>; Thu, 28 May 2009 04:25:58 -0400 (EDT)
Date: Thu, 28 May 2009 10:26:16 +0200
From: Nick Piggin <npiggin@suse.de>
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3
Message-ID: <20090528082616.GG6920@wotan.suse.de>
References: <200905271012.668777061@firstfloor.org> <20090527201239.C2C9C1D0294@basil.firstfloor.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20090527201239.C2C9C1D0294@basil.firstfloor.org>
Sender: owner-linux-mm@kvack.org
To: Andi Kleen <andi@firstfloor.org>
Cc: hugh@veritas.com, riel@redhat.com, akpm@linux-foundation.org, chris.mason@oracle.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com
List-ID: <linux-mm.kvack.org>

On Wed, May 27, 2009 at 10:12:39PM +0200, Andi Kleen wrote:
> 
> This patch adds the high level memory handler that poisons pages
> that got corrupted by hardware (typically by a bit flip in a DIMM
> or a cache) on the Linux level. Linux tries to access these
> pages in the future then.

Quick review.

> Index: linux/mm/Makefile
> ===================================================================
> --- linux.orig/mm/Makefile	2009-05-27 21:23:18.000000000 +0200
> +++ linux/mm/Makefile	2009-05-27 21:24:39.000000000 +0200
> @@ -38,3 +38,4 @@
>  endif
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
>  obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> +obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
> Index: linux/mm/memory-failure.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux/mm/memory-failure.c	2009-05-27 21:28:19.000000000 +0200
> @@ -0,0 +1,677 @@
> +/*
> + * Copyright (C) 2008, 2009 Intel Corporation
> + * Author: Andi Kleen
> + *
> + * This software may be redistributed and/or modified under the terms of
> + * the GNU General Public License ("GPL") version 2 only as published by the
> + * Free Software Foundation.
> + *
> + * High level machine check handler. Handles pages reported by the
> + * hardware as being corrupted usually due to a 2bit ECC memory or cache
> + * failure.
> + *
> + * This focuses on pages detected as corrupted in the background.
> + * When the current CPU tries to consume corruption the currently
> + * running process can just be killed directly instead. This implies
> + * that if the error cannot be handled for some reason it's safe to
> + * just ignore it because no corruption has been consumed yet. Instead
> + * when that happens another machine check will happen.
> + *
> + * Handles page cache pages in various states.	The tricky part
> + * here is that we can access any page asynchronous to other VM
> + * users, because memory failures could happen anytime and anywhere,
> + * possibly violating some of their assumptions. This is why this code
> + * has to be extremely careful. Generally it tries to use normal locking
> + * rules, as in get the standard locks, even if that means the
> + * error handling takes potentially a long time.
> + *
> + * The operation to map back from RMAP chains to processes has to walk
> + * the complete process list and has non linear complexity with the number
> + * mappings. In short it can be quite slow. But since memory corruptions
> + * are rare we hope to get away with this.
> + */
> +
> +/*
> + * Notebook:
> + * - hugetlb needs more code
> + * - nonlinear
> + * - remap races
> + * - anonymous (tinject):
> + *   + left over references when process catches signal?
> + * - kcore/oldmem/vmcore/mem/kmem check for hwpoison pages
> + * - pass bad pages to kdump next kernel
> + */
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +#include <linux/page-flags.h>
> +#include <linux/sched.h>
> +#include <linux/rmap.h>
> +#include <linux/pagemap.h>
> +#include <linux/swap.h>
> +#include <linux/backing-dev.h>
> +#include "internal.h"
> +
> +#define Dprintk(x...) printk(x)
> +
> +int sysctl_memory_failure_early_kill __read_mostly = 1;
> +
> +atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0);
> +
> +/*
> + * Send all the processes who have the page mapped an ``action optional''
> + * signal.
> + */
> +static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno,
> +			unsigned long pfn)
> +{
> +	struct siginfo si;
> +	int ret;
> +
> +	printk(KERN_ERR
> +		"MCE %#lx: Killing %s:%d due to hardware memory corruption\n",
> +		pfn, t->comm, t->pid);
> +	si.si_signo = SIGBUS;
> +	si.si_errno = 0;
> +	si.si_code = BUS_MCEERR_AO;
> +	si.si_addr = (void *)addr;
> +#ifdef __ARCH_SI_TRAPNO
> +	si.si_trapno = trapno;
> +#endif
> +	si.si_addr_lsb = PAGE_SHIFT;
> +	/*
> +	 * Don't use force here, it's convenient if the signal
> +	 * can be temporarily blocked.
> +	 * This could cause a loop when the user sets SIGBUS
> +	 * to SIG_IGN, but hopefully noone will do that?
> +	 */
> +	ret = send_sig_info(SIGBUS, &si, t);  /* synchronous? */
> +	if (ret < 0)
> +		printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n",
> +		       t->comm, t->pid, ret);
> +	return ret;
> +}
> +
> +/*
> + * Kill all processes that have a poisoned page mapped and then isolate
> + * the page.
> + *
> + * General strategy:
> + * Find all processes having the page mapped and kill them.
> + * But we keep a page reference around so that the page is not
> + * actually freed yet.
> + * Then stash the page away
> + *
> + * There's no convenient way to get back to mapped processes
> + * from the VMAs. So do a brute-force search over all
> + * running processes.
> + *
> + * Remember that machine checks are not common (or rather
> + * if they are common you have other problems), so this shouldn't
> + * be a performance issue.
> + *
> + * Also there are some races possible while we get from the
> + * error detection to actually handle it.
> + */
> +
> +struct to_kill {
> +	struct list_head nd;
> +	struct task_struct *tsk;
> +	unsigned long addr;
> +};

It would be kinda nice to have a field in task_struct that is usable
say for anyone holding the tasklist lock for write. Then you could
make a list with them. But I guess it isn't worthwhile unless there
are other users.

> +
> +/*
> + * Failure handling: if we can't find or can't kill a process there's
> + * not much we can do.	We just print a message and ignore otherwise.
> + */
> +
> +/*
> + * Schedule a process for later kill.
> + * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
> + * TBD would GFP_NOIO be enough?
> + */
> +static void add_to_kill(struct task_struct *tsk, struct page *p,
> +		       struct vm_area_struct *vma,
> +		       struct list_head *to_kill,
> +		       struct to_kill **tkc)
> +{
> +	int fail = 0;
> +	struct to_kill *tk;
> +
> +	if (*tkc) {
> +		tk = *tkc;
> +		*tkc = NULL;
> +	} else {
> +		tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
> +		if (!tk) {
> +			printk(KERN_ERR "MCE: Out of memory while machine check handling\n");
> +			return;
> +		}
> +	}
> +	tk->addr = page_address_in_vma(p, vma);
> +	if (tk->addr == -EFAULT) {
> +		printk(KERN_INFO "MCE: Failed to get address in VMA\n");

I don't know if this is very helpful message. I could legitimately happen and
nothing anybody can do about it...

> +		tk->addr = 0;
> +		fail = 1;

Fail doesn't seem to be used anywhere.


> +	}
> +	get_task_struct(tsk);
> +	tk->tsk = tsk;
> +	list_add_tail(&tk->nd, to_kill);
> +}
> +
> +/*
> + * Kill the processes that have been collected earlier.
> + */
> +static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno,
> +			  int fail, unsigned long pfn)

I guess "doit" etc is obvious once reading the code and caller, but maybe a
quick comment in the header to describe?

> +{
> +	struct to_kill *tk, *next;
> +
> +	list_for_each_entry_safe (tk, next, to_kill, nd) {
> +		if (doit) {
> +			/*
> +			 * In case something went wrong with munmaping
> +			 * make sure the process doesn't catch the
> +			 * signal and then access the memory. So reset
> +			 * the signal handlers
> +			 */
> +			if (fail)
> +				flush_signal_handlers(tk->tsk, 1);

Is this a legitimate thing to do? Is it racy? Why would you not send a
sigkill or something if you want them to die right now?


> +
> +			/*
> +			 * In theory the process could have mapped
> +			 * something else on the address in-between. We could
> +			 * check for that, but we need to tell the
> +			 * process anyways.
> +			 */
> +			if (kill_proc_ao(tk->tsk, tk->addr, trapno, pfn) < 0)
> +				printk(KERN_ERR
> +		"MCE %#lx: Cannot send advisory machine check signal to %s:%d\n",
> +					pfn, tk->tsk->comm, tk->tsk->pid);
> +		}
> +		put_task_struct(tk->tsk);
> +		kfree(tk);
> +	}
> +}
> +
> +/*
> + * Collect processes when the error hit an anonymous page.
> + */
> +static void collect_procs_anon(struct page *page, struct list_head *to_kill,
> +			      struct to_kill **tkc)
> +{
> +	struct vm_area_struct *vma;
> +	struct task_struct *tsk;
> +	struct anon_vma *av = page_lock_anon_vma(page);
> +
> +	if (av == NULL)	/* Not actually mapped anymore */
> +		return;
> +
> +	read_lock(&tasklist_lock);
> +	for_each_process (tsk) {
> +		if (!tsk->mm)
> +			continue;
> +		list_for_each_entry (vma, &av->head, anon_vma_node) {
> +			if (vma->vm_mm == tsk->mm)
> +				add_to_kill(tsk, page, vma, to_kill, tkc);
> +		}
> +	}
> +	page_unlock_anon_vma(av);
> +	read_unlock(&tasklist_lock);
> +}
> +
> +/*
> + * Collect processes when the error hit a file mapped page.
> + */
> +static void collect_procs_file(struct page *page, struct list_head *to_kill,
> +			      struct to_kill **tkc)
> +{
> +	struct vm_area_struct *vma;
> +	struct task_struct *tsk;
> +	struct prio_tree_iter iter;
> +	struct address_space *mapping = page_mapping(page);
> +
> +	read_lock(&tasklist_lock);
> +	spin_lock(&mapping->i_mmap_lock);

You have tasklist_lock(R) nesting outside i_mmap_lock, and inside anon_vma
lock. And anon_vma lock nests inside i_mmap_lock.

This seems fragile. If rwlocks ever become FIFO or tasklist_lock changes
type (maybe -rt kernels do it), then you could have a task holding
anon_vma lock and waiting for tasklist_lock, and another holding tasklist
lock and waiting for i_mmap_lock, and another holding i_mmap_lock and
waiting for anon_vma lock.

I think nesting either inside or outside these locks consistently is less
fragile. Do we already have a dependency?... I don't know of one, but you
should document this in mm/rmap.c and mm/filemap.c.


> +	for_each_process(tsk) {
> +		pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +
> +		if (!tsk->mm)
> +			continue;
> +
> +		vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff,
> +				      pgoff)
> +			if (vma->vm_mm == tsk->mm)
> +				add_to_kill(tsk, page, vma, to_kill, tkc);
> +	}
> +	spin_unlock(&mapping->i_mmap_lock);
> +	read_unlock(&tasklist_lock);
> +}
> +
> +/*
> + * Collect the processes who have the corrupted page mapped to kill.
> + * This is done in two steps for locking reasons.
> + * First preallocate one tokill structure outside the spin locks,
> + * so that we can kill at least one process reasonably reliable.
> + */
> +static void collect_procs(struct page *page, struct list_head *tokill)
> +{
> +	struct to_kill *tk;
> +
> +	tk = kmalloc(sizeof(struct to_kill), GFP_KERNEL);
> +	/* memory allocation failure is implicitly handled */

Well... it's explicitly handled... in the callee ;)


> +	if (PageAnon(page))
> +		collect_procs_anon(page, tokill, &tk);
> +	else
> +		collect_procs_file(page, tokill, &tk);
> +	kfree(tk);
> +}
> +
> +/*
> + * Error handlers for various types of pages.
> + */
> +
> +enum outcome {
> +	FAILED,
> +	DELAYED,
> +	IGNORED,
> +	RECOVERED,
> +};
> +
> +static const char *action_name[] = {
> +	[FAILED] = "Failed",
> +	[DELAYED] = "Delayed",
> +	[IGNORED] = "Ignored",

How is delayed different to ignored (or failed, for that matter)?


> +	[RECOVERED] = "Recovered",

And what does recovered mean? THe processes were killed and the page taken
out of circulation, but the machine is still in some unknown state of corruption
henceforth, right?


> +};
> +
> +/*
> + * Error hit kernel page.
> + * Do nothing, try to be lucky and not touch this instead. For a few cases we
> + * could be more sophisticated.
> + */
> +static int me_kernel(struct page *p)
> +{
> +	return DELAYED;
> +}
> +
> +/*
> + * Already poisoned page.
> + */
> +static int me_ignore(struct page *p)
> +{
> +	return IGNORED;
> +}
> +
> +/*
> + * Page in unknown state. Do nothing.
> + */
> +static int me_unknown(struct page *p)
> +{
> +	printk(KERN_ERR "MCE %#lx: Unknown page state\n", page_to_pfn(p));
> +	return FAILED;
> +}
> +
> +/*
> + * Free memory
> + */
> +static int me_free(struct page *p)
> +{
> +	return DELAYED;
> +}
> +
> +/*
> + * Clean (or cleaned) page cache page.
> + */
> +static int me_pagecache_clean(struct page *p)
> +{
> +	if (!isolate_lru_page(p))
> +		page_cache_release(p);
> +
> +	if (page_has_private(p))
> +		do_invalidatepage(p, 0);
> +	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> +		Dprintk(KERN_ERR "MCE %#lx: failed to release buffers\n",
> +			page_to_pfn(p));
> +
> +	/*
> +	 * remove_from_page_cache assumes (mapping && !mapped)
> +	 */
> +	if (page_mapping(p) && !page_mapped(p)) {
> +		remove_from_page_cache(p);
> +		page_cache_release(p);
> +	}

remove_mapping would probably be a better idea. Otherwise you can
probably introduce pagecache removal vs page fault races which
will make the kernel bug.


> +
> +	return RECOVERED;
> +}
> +
> +/*
> + * Dirty cache page page
> + * Issues: when the error hit a hole page the error is not properly
> + * propagated.
> + */
> +static int me_pagecache_dirty(struct page *p)
> +{
> +	struct address_space *mapping = page_mapping(p);
> +
> +	SetPageError(p);
> +	/* TBD: print more information about the file. */
> +	printk(KERN_ERR "MCE %#lx: Hardware memory corruption on dirty file page: write error\n",
> +			page_to_pfn(p));
> +	if (mapping) {
> +		/*
> +		 * Truncate does the same, but we're not quite the same
> +		 * as truncate. Needs more checking, but keep it for now.
> +		 */

What's different about truncate? It would be good to reuse as much as possible.


> +		cancel_dirty_page(p, PAGE_CACHE_SIZE);
> +
> +		/*
> +		 * IO error will be reported by write(), fsync(), etc.
> +		 * who check the mapping.
> +		 */
> +		mapping_set_error(mapping, EIO);

Interesting. It's not *exactly* an IO error (well, not like one we're usually
used to).


> +	}
> +
> +	me_pagecache_clean(p);
> +
> +	/*
> +	 * Did the earlier release work?
> +	 */
> +	if (page_has_private(p) && !try_to_release_page(p, GFP_NOIO))
> +		return FAILED;
> +
> +	return RECOVERED;
> +}
> +
> +/*
> + * Clean and dirty swap cache.
> + */
> +static int me_swapcache_dirty(struct page *p)
> +{
> +	ClearPageDirty(p);
> +
> +	if (!isolate_lru_page(p))
> +		page_cache_release(p);
> +
> +	return DELAYED;
> +}
> +
> +static int me_swapcache_clean(struct page *p)
> +{
> +	ClearPageUptodate(p);
> +
> +	if (!isolate_lru_page(p))
> +		page_cache_release(p);
> +
> +	delete_from_swap_cache(p);
> +
> +	return RECOVERED;
> +}

All these handlers are quite interesting in that they need to
know about most of the mm. What are you trying to do in each
of them would be a good idea to say, and probably they should
rather go into their appropriate files instead of all here
(eg. swapcache stuff should go in mm/swap_state for example).

You haven't waited on writeback here AFAIKS, and have you
*really* verified it is safe to call delete_from_swap_cache?


> +/*
> + * Huge pages. Needs work.
> + * Issues:
> + * No rmap support so we cannot find the original mapper. In theory could walk
> + * all MMs and look for the mappings, but that would be non atomic and racy.
> + * Need rmap for hugepages for this. Alternatively we could employ a heuristic,
> + * like just walking the current process and hoping it has it mapped (that
> + * should be usually true for the common "shared database cache" case)
> + * Should handle free huge pages and dequeue them too, but this needs to
> + * handle huge page accounting correctly.
> + */
> +static int me_huge_page(struct page *p)
> +{
> +	return FAILED;
> +}
> +
> +/*
> + * Various page states we can handle.
> + *
> + * A page state is defined by its current page->flags bits.
> + * The table matches them in order and calls the right handler.
> + *
> + * This is quite tricky because we can access page at any time
> + * in its live cycle, so all accesses have to be extremly careful.
> + *
> + * This is not complete. More states could be added.
> + * For any missing state don't attempt recovery.
> + */
> +
> +#define dirty		(1UL << PG_dirty)
> +#define swapcache	(1UL << PG_swapcache)
> +#define unevict		(1UL << PG_unevictable)
> +#define mlocked		(1UL << PG_mlocked)
> +#define writeback	(1UL << PG_writeback)
> +#define lru		(1UL << PG_lru)
> +#define swapbacked	(1UL << PG_swapbacked)
> +#define head		(1UL << PG_head)
> +#define tail		(1UL << PG_tail)
> +#define compound	(1UL << PG_compound)
> +#define slab		(1UL << PG_slab)
> +#define buddy		(1UL << PG_buddy)
> +#define reserved	(1UL << PG_reserved)

This looks like more work than just putting 1UL << (...) in each entry
in your table. Hmm, does this whole table thing even buy you much (versus a
much simpler switch statement?)

And seeing as you are doing a lot of checking for various page flags anyway,
(eg. in your prepare function). Just seems like needless complexity.

> +
> +/*
> + * The table is > 80 columns because all the alternatvies were much worse.
> + */
> +
> +static struct page_state {
> +	unsigned long mask;
> +	unsigned long res;
> +	char *msg;
> +	int (*action)(struct page *p);
> +} error_states[] = {
> +	{ reserved,	reserved,	"reserved kernel",	me_ignore },
> +	{ buddy,	buddy,		"free kernel",		me_free },
> +
> +	/*
> +	 * Could in theory check if slab page is free or if we can drop
> +	 * currently unused objects without touching them. But just
> +	 * treat it as standard kernel for now.
> +	 */
> +	{ slab,			slab,		"kernel slab",		me_kernel },
> +
> +#ifdef CONFIG_PAGEFLAGS_EXTENDED
> +	{ head,			head,		"hugetlb",		me_huge_page },
> +	{ tail,			tail,		"hugetlb",		me_huge_page },
> +#else
> +	{ compound,		compound,	"hugetlb",		me_huge_page },
> +#endif
> +
> +	{ swapcache|dirty,	swapcache|dirty,"dirty swapcache",	me_swapcache_dirty },
> +	{ swapcache|dirty,	swapcache,	"clean swapcache",	me_swapcache_clean },
> +
> +#ifdef CONFIG_UNEVICTABLE_LRU
> +	{ unevict|dirty,	unevict|dirty,	"unevictable dirty lru", me_pagecache_dirty },
> +	{ unevict,		unevict,	"unevictable lru",	me_pagecache_clean },
> +#endif
> +
> +#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
> +	{ mlocked|dirty,	mlocked|dirty,	"mlocked dirty lru",	me_pagecache_dirty },
> +	{ mlocked,		mlocked,	"mlocked lru",		me_pagecache_clean },
> +#endif
> +
> +	{ lru|dirty,		lru|dirty,	"dirty lru",		me_pagecache_dirty },
> +	{ lru|dirty,		lru,		"clean lru",		me_pagecache_clean },
> +	{ swapbacked,		swapbacked,	"anonymous",		me_pagecache_clean },
> +
> +	/*
> +	 * Add more states here.
> +	 */
> +
> +	/*
> +	 * Catchall entry: must be at end.
> +	 */
> +	{ 0,			0,		"unknown page state",	me_unknown },
> +};
> +
> +static void page_action(char *msg, struct page *p, int (*action)(struct page *),
> +			unsigned long pfn)
> +{
> +	int ret;
> +
> +	printk(KERN_ERR "MCE %#lx: %s page recovery: starting\n", pfn, msg);
> +	ret = action(p);
> +	printk(KERN_ERR "MCE %#lx: %s page recovery: %s\n",
> +	       pfn, msg, action_name[ret]);
> +	if (page_count(p) != 1)
> +		printk(KERN_ERR
> +		       "MCE %#lx: %s page still referenced by %d users\n",
> +		       pfn, msg, page_count(p) - 1);
> +
> +	/* Could do more checks here if page looks ok */
> +	atomic_long_add(1, &mce_bad_pages);
> +
> +	/*
> +	 * Could adjust zone counters here to correct for the missing page.
> +	 */
> +}
> +
> +#define N_UNMAP_TRIES 5
> +
> +static void hwpoison_page_prepare(struct page *p, unsigned long pfn,
> +				  int trapno)
> +{
> +	enum ttu_flags ttu = TTU_UNMAP| TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
> +	int kill = sysctl_memory_failure_early_kill;
> +	struct address_space *mapping;
> +	LIST_HEAD(tokill);
> +	int ret;
> +	int i;
> +
> +	if (PageReserved(p) || PageCompound(p) || PageSlab(p))
> +		return;
> +
> +	if (!PageLRU(p))
> +		lru_add_drain();
> +
> +	/*
> +	 * This check implies we don't kill processes if their pages
> +	 * are in the swap cache early. Those are always late kills.
> +	 */
> +	if (!page_mapped(p))
> +		return;
> +
> +	if (PageSwapCache(p)) {
> +		printk(KERN_ERR
> +		       "MCE %#lx: keeping poisoned page in swap cache\n", pfn);
> +		ttu |= TTU_IGNORE_HWPOISON;
> +	}
> +
> +	/*
> +	 * Poisoned clean file pages are harmless, the
> +	 * data can be restored by regular page faults.
> +	 */
> +	mapping = page_mapping(p);
> +	if (!PageDirty(p) && !PageWriteback(p) &&
> +	    !PageAnon(p) && !PageSwapBacked(p) &&
> +	    mapping && mapping_cap_account_dirty(mapping)) {
> +		if (page_mkclean(p))
> +			SetPageDirty(p);
> +		else {
> +			kill = 0;
> +			ttu |= TTU_IGNORE_HWPOISON;
> +		}
> +	}
> +
> +	/*
> +	 * First collect all the processes that have the page
> +	 * mapped.  This has to be done before try_to_unmap,
> +	 * because ttu takes the rmap data structures down.
> +	 *
> +	 * This also has the side effect to propagate the dirty
> +	 * bit from PTEs into the struct page. This is needed
> +	 * to actually decide if something needs to be killed
> +	 * or errored, or if it's ok to just drop the page.
> +	 *
> +	 * Error handling: We ignore errors here because
> +	 * there's nothing that can be done.
> +	 *
> +	 * RED-PEN some cases in process exit seem to deadlock
> +	 * on the page lock. drop it or add poison checks?
> +	 */
> +	if (kill)
> +		collect_procs(p, &tokill);
> +
> +	/*
> +	 * try_to_unmap can fail temporarily due to races.
> +	 * Try a few times (RED-PEN better strategy?)
> +	 */
> +	for (i = 0; i < N_UNMAP_TRIES; i++) {
> +		ret = try_to_unmap(p, ttu);
> +		if (ret == SWAP_SUCCESS)
> +			break;
> +		Dprintk("MCE %#lx: try_to_unmap retry needed %d\n", pfn,  ret);
> +	}
> +
> +	/*
> +	 * Now that the dirty bit has been propagated to the
> +	 * struct page and all unmaps done we can decide if
> +	 * killing is needed or not.  Only kill when the page
> +	 * was dirty, otherwise the tokill list is merely
> +	 * freed.  When there was a problem unmapping earlier
> +	 * use a more force-full uncatchable kill to prevent
> +	 * any accesses to the poisoned memory.
> +	 */
> +	kill_procs_ao(&tokill, !!PageDirty(p), trapno,
> +		      ret != SWAP_SUCCESS, pfn);
> +}
> +
> +/**
> + * memory_failure - Handle memory failure of a page.
> + *
> + */
> +void memory_failure(unsigned long pfn, int trapno)
> +{
> +	struct page_state *ps;
> +	struct page *p;
> +
> +	if (!pfn_valid(pfn)) {
> +		printk(KERN_ERR
> +   "MCE %#lx: Hardware memory corruption in memory outside kernel control\n",
> +		       pfn);
> +		return;
> +	}
> +
> +
> +	p = pfn_to_page(pfn);
> +	if (TestSetPageHWPoison(p)) {
> +		printk(KERN_ERR "MCE %#lx: Error for already hardware poisoned page\n", pfn);
> +		return;
> +	}
> +
> +	/*
> +	 * We need/can do nothing about count=0 pages.
> +	 * 1) it's a free page, and therefore in safe hand:
> +	 *    prep_new_page() will be the gate keeper.
> +	 * 2) it's part of a non-compound high order page.
> +	 *    Implies some kernel user: cannot stop them from
> +	 *    R/W the page; let's pray that the page has been
> +	 *    used and will be freed some time later.
> +	 * In fact it's dangerous to directly bump up page count from 0,
> +	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
> +	 */
> +	if (!get_page_unless_zero(compound_head(p))) {
> +		printk(KERN_ERR
> +		       "MCE 0x%lx: ignoring free or high order page\n", pfn);
> +		return;
> +	}
> +
> +	lock_page_nosync(p);
> +	hwpoison_page_prepare(p, pfn, trapno);
> +
> +	/* Tored down by someone else? */
> +	if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) {
> +		printk(KERN_ERR
> +		       "MCE %#lx: ignoring NULL mapping LRU page\n", pfn);
> +		goto out;
> +	}
> +
> +	for (ps = error_states;; ps++) {
> +		if ((p->flags & ps->mask) == ps->res) {
> +			page_action(ps->msg, p, ps->action, pfn);
> +			break;
> +		}
> +	}
> +out:
> +	unlock_page(p);
> +}
> Index: linux/include/linux/mm.h
> ===================================================================
> --- linux.orig/include/linux/mm.h	2009-05-27 21:24:39.000000000 +0200
> +++ linux/include/linux/mm.h	2009-05-27 21:24:39.000000000 +0200
> @@ -1322,6 +1322,10 @@
>  
>  extern void *alloc_locked_buffer(size_t size);
>  extern void free_locked_buffer(void *buffer, size_t size);
> +
> +extern void memory_failure(unsigned long pfn, int trapno);
> +extern int sysctl_memory_failure_early_kill;
> +extern atomic_long_t mce_bad_pages;
>  extern void release_locked_buffer(void *buffer, size_t size);
>  #endif /* __KERNEL__ */
>  #endif /* _LINUX_MM_H */
> Index: linux/kernel/sysctl.c
> ===================================================================
> --- linux.orig/kernel/sysctl.c	2009-05-27 21:23:18.000000000 +0200
> +++ linux/kernel/sysctl.c	2009-05-27 21:24:39.000000000 +0200
> @@ -1282,6 +1282,20 @@
>  		.proc_handler	= &scan_unevictable_handler,
>  	},
>  #endif
> +#ifdef CONFIG_MEMORY_FAILURE
> +	{
> +		.ctl_name	= CTL_UNNUMBERED,
> +		.procname	= "memory_failure_early_kill",
> +		.data		= &sysctl_memory_failure_early_kill,
> +		.maxlen		= sizeof(vm_highmem_is_dirtyable),
> +		.mode		= 0644,
> +		.proc_handler	= &proc_dointvec_minmax,
> +		.strategy	= &sysctl_intvec,
> +		.extra1		= &zero,
> +		.extra2		= &one,
> +	},
> +#endif
> +
>  /*
>   * NOTE: do not add new entries to this table unless you have read
>   * Documentation/sysctl/ctl_unnumbered.txt
> Index: linux/fs/proc/meminfo.c
> ===================================================================
> --- linux.orig/fs/proc/meminfo.c	2009-05-27 21:23:18.000000000 +0200
> +++ linux/fs/proc/meminfo.c	2009-05-27 21:24:39.000000000 +0200
> @@ -97,7 +97,11 @@
>  		"Committed_AS:   %8lu kB\n"
>  		"VmallocTotal:   %8lu kB\n"
>  		"VmallocUsed:    %8lu kB\n"
> -		"VmallocChunk:   %8lu kB\n",
> +		"VmallocChunk:   %8lu kB\n"
> +#ifdef CONFIG_MEMORY_FAILURE
> +		"BadPages:       %8lu kB\n"
> +#endif
> +		,
>  		K(i.totalram),
>  		K(i.freeram),
>  		K(i.bufferram),
> @@ -144,6 +148,9 @@
>  		(unsigned long)VMALLOC_TOTAL >> 10,
>  		vmi.used >> 10,
>  		vmi.largest_chunk >> 10
> +#ifdef CONFIG_MEMORY_FAILURE
> +		,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
> +#endif
>  		);
>  
>  	hugetlb_report_meminfo(m);
> Index: linux/mm/Kconfig
> ===================================================================
> --- linux.orig/mm/Kconfig	2009-05-27 21:23:18.000000000 +0200
> +++ linux/mm/Kconfig	2009-05-27 21:24:39.000000000 +0200
> @@ -226,6 +226,9 @@
>  config MMU_NOTIFIER
>  	bool
>  
> +config MEMORY_FAILURE
> +	bool
> +
>  config NOMMU_INITIAL_TRIM_EXCESS
>  	int "Turn on mmap() excess space trimming before booting"
>  	depends on !MMU
> Index: linux/Documentation/sysctl/vm.txt
> ===================================================================
> --- linux.orig/Documentation/sysctl/vm.txt	2009-05-27 21:23:18.000000000 +0200
> +++ linux/Documentation/sysctl/vm.txt	2009-05-27 21:24:39.000000000 +0200
> @@ -32,6 +32,7 @@
>  - legacy_va_layout
>  - lowmem_reserve_ratio
>  - max_map_count
> +- memory_failure_early_kill
>  - min_free_kbytes
>  - min_slab_ratio
>  - min_unmapped_ratio
> @@ -53,7 +54,6 @@
>  - vfs_cache_pressure
>  - zone_reclaim_mode
>  
> -
>  ==============================================================
>  
>  block_dump
> @@ -275,6 +275,25 @@
>  
>  The default value is 65536.
>  
> +=============================================================
> +
> +memory_failure_early_kill:
> +
> +Control how to kill processes when uncorrected memory error (typically
> +a 2bit error in a memory module) is detected in the background by hardware.
> +
> +1: Kill all processes that have the corrupted page mapped as soon as the
> +corruption is detected.
> +
> +0: Only unmap the page from all processes and only kill a process
> +who tries to access it.
> +
> +The kill is done using a catchable SIGBUS, so processes can handle this
> +if they want to.
> +
> +This is only active on architectures/platforms with advanced machine
> +check handling and depends on the hardware capabilities.
> +
>  ==============================================================
>  
>  min_free_kbytes:
> Index: linux/arch/x86/mm/fault.c
> ===================================================================
> --- linux.orig/arch/x86/mm/fault.c	2009-05-27 21:24:39.000000000 +0200
> +++ linux/arch/x86/mm/fault.c	2009-05-27 21:24:39.000000000 +0200
> @@ -851,8 +851,9 @@
>  
>  #ifdef CONFIG_MEMORY_FAILURE
>  	if (fault & VM_FAULT_HWPOISON) {
> -		printk(KERN_ERR "MCE: Killing %s:%d due to hardware memory corruption\n",
> -			tsk->comm, tsk->pid);
> +		printk(KERN_ERR
> +       "MCE: Killing %s:%d for accessing hardware corrupted memory at %#lx\n",
> +			tsk->comm, tsk->pid, address);
>  		code = BUS_MCEERR_AR;
>  	}
>  #endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>