From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nick Piggin Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Date: Wed, 5 Aug 2009 16:16:42 +0200 Message-ID: <20090805141642.GB23992@wotan.suse.de> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> <20090805134607.GH11385@basil.fritz.box> <20090805140145.GB28563@wotan.suse.de> <20090805141001.GJ11385@basil.fritz.box> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <20090805141001.GJ11385@basil.fritz.box> Sender: owner-linux-mm@kvack.org To: Andi Kleen Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com, linux-arch@vger.kernel.org List-Id: linux-arch.vger.kernel.org On Wed, Aug 05, 2009 at 04:10:01PM +0200, Andi Kleen wrote: > > I haven't brought up the caller at this point, but IIRC you had > > the page locked and mapping confirmed at this point anyway so > > it would never be an error for your code. > > > > Probably it would be nice to just force callers to verify the page. > > Normally IMO it is much nicer and clearer to do it at the time the > > page gets locked, unless there is good reason otherwise. > > Ok. I think I'll just keep it as it is for now. > > The only reason I added the error code was to make truncate_inode_page > fit into .error_remove_page, but then latter I did another wrapper > so it could be removed again. But it won't hurt to have it either. OK, it's more of a cleanup/nit. One question I had for the others (Andrew? other mm guys?) what is the feelings of merging this feature? Leaving aside exact implementation and just considering the high level design and cost/benefit. Last time there were some people objecting, so I wonder the situation now? So does anybody need more convincing? :) Also I will just cc linux-arch. It would be interesting to know whether powerpc, ia64, or s390 or others would be interested to use this feature? Thanks, Nick -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andi Kleen Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Date: Wed, 5 Aug 2009 16:41:12 +0200 Message-ID: <20090805144112.GM11385@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> <20090805134607.GH11385@basil.fritz.box> <20090805140145.GB28563@wotan.suse.de> <20090805141001.GJ11385@basil.fritz.box> <20090805141642.GB23992@wotan.suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Received: from one.firstfloor.org ([213.235.205.2]:44177 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933694AbZHEOlN (ORCPT ); Wed, 5 Aug 2009 10:41:13 -0400 Content-Disposition: inline In-Reply-To: <20090805141642.GB23992@wotan.suse.de> Sender: linux-arch-owner@vger.kernel.org List-ID: To: Nick Piggin Cc: Andi Kleen , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com, linux-arch@vger.kernel.org > One question I had for the others (Andrew? other mm guys?) what is the > feelings of merging this feature? Leaving aside exact implementation > and just considering the high level design and cost/benefit. Last time > there were some people objecting, so I wonder the situation now? So > does anybody need more convincing? :) The main objection last time was that it was a bit too late in the release schedule. I can't remember anyone really questioning the basic feature itself. > Also I will just cc linux-arch. It would be interesting to know whether > powerpc, ia64, or s390 or others would be interested to use this feature? ia64 is interested (but no code so far) I talked to DaveM and he seems to be interested for sparc too. I would expect other server architectures to eventually use it as they get around to writing the necessary architecture specific glue. -Andi -- ak@linux.intel.com -- Speaking for myself only. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nick Piggin Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Date: Wed, 5 Aug 2009 16:44:02 +0200 Message-ID: <20090805144402.GD23992@wotan.suse.de> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> <20090805134607.GH11385@basil.fritz.box> <20090805140145.GB28563@wotan.suse.de> <20090805141001.GJ11385@basil.fritz.box> <20090805141642.GB23992@wotan.suse.de> <20090805144112.GM11385@basil.fritz.box> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Received: from cantor.suse.de ([195.135.220.2]:40873 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933707AbZHEOoD (ORCPT ); Wed, 5 Aug 2009 10:44:03 -0400 Content-Disposition: inline In-Reply-To: <20090805144112.GM11385@basil.fritz.box> Sender: linux-arch-owner@vger.kernel.org List-ID: To: Andi Kleen Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com, linux-arch@vger.kernel.org On Wed, Aug 05, 2009 at 04:41:12PM +0200, Andi Kleen wrote: > > One question I had for the others (Andrew? other mm guys?) what is the > > feelings of merging this feature? Leaving aside exact implementation > > and just considering the high level design and cost/benefit. Last time > > there were some people objecting, so I wonder the situation now? So > > does anybody need more convincing? :) > > The main objection last time was that it was a bit too late in the > release schedule. > > I can't remember anyone really questioning the basic feature itself. I can't exactly remember. Maybe it was in a thread with Alan and/or Arjan ;) I don't think the feature itself was questioned as much as cost/benefit. Maybe I was wrong... I just want to see everyone is happy with the basic idea ;) > > Also I will just cc linux-arch. It would be interesting to know whether > > powerpc, ia64, or s390 or others would be interested to use this feature? > > ia64 is interested (but no code so far) I talked to DaveM and he seems to be > interested for sparc too. I would expect other server architectures to > eventually use it as they get around to writing the necessary architecture > specific glue. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Wilcox Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Date: Wed, 5 Aug 2009 09:00:39 -0600 Message-ID: <20090805150038.GL3711@parisc-linux.org> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> <20090805134607.GH11385@basil.fritz.box> <20090805140145.GB28563@wotan.suse.de> <20090805141001.GJ11385@basil.fritz.box> <20090805141642.GB23992@wotan.suse.de> <20090805144112.GM11385@basil.fritz.box> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Received: from palinux.external.hp.com ([192.25.206.14]:49790 "EHLO mail.parisc-linux.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933075AbZHEPAj (ORCPT ); Wed, 5 Aug 2009 11:00:39 -0400 Content-Disposition: inline In-Reply-To: <20090805144112.GM11385@basil.fritz.box> Sender: linux-arch-owner@vger.kernel.org List-ID: To: Andi Kleen Cc: Nick Piggin , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com, linux-arch@vger.kernel.org, linux-parisc@vger.kernel.org On Wed, Aug 05, 2009 at 04:41:12PM +0200, Andi Kleen wrote: > > One question I had for the others (Andrew? other mm guys?) what is the > > feelings of merging this feature? Leaving aside exact implementation > > and just considering the high level design and cost/benefit. Last time > > there were some people objecting, so I wonder the situation now? So > > does anybody need more convincing? :) > > The main objection last time was that it was a bit too late in the > release schedule. > > I can't remember anyone really questioning the basic feature itself. > > > Also I will just cc linux-arch. It would be interesting to know whether > > powerpc, ia64, or s390 or others would be interested to use this feature? > > ia64 is interested (but no code so far) I talked to DaveM and he seems to be > interested for sparc too. I would expect other server architectures to > eventually use it as they get around to writing the necessary architecture > specific glue. parisc could certainly implement this. Don't know if there's interest. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Schwidefsky Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Date: Thu, 6 Aug 2009 13:48:30 +0200 Message-ID: <20090806134830.4f3931d2@skybase> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> <20090805134607.GH11385@basil.fritz.box> <20090805140145.GB28563@wotan.suse.de> <20090805141001.GJ11385@basil.fritz.box> <20090805141642.GB23992@wotan.suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Return-path: Received: from mtagate8.de.ibm.com ([195.212.29.157]:57166 "EHLO mtagate8.de.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752911AbZHFLsi (ORCPT ); Thu, 6 Aug 2009 07:48:38 -0400 In-Reply-To: <20090805141642.GB23992@wotan.suse.de> Sender: linux-arch-owner@vger.kernel.org List-ID: To: Nick Piggin Cc: Andi Kleen , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com, linux-arch@vger.kernel.org On Wed, 5 Aug 2009 16:16:42 +0200 Nick Piggin wrote: > On Wed, Aug 05, 2009 at 04:10:01PM +0200, Andi Kleen wrote: > > > I haven't brought up the caller at this point, but IIRC you had > > > the page locked and mapping confirmed at this point anyway so > > > it would never be an error for your code. > > > > > > Probably it would be nice to just force callers to verify the page. > > > Normally IMO it is much nicer and clearer to do it at the time the > > > page gets locked, unless there is good reason otherwise. > > > > Ok. I think I'll just keep it as it is for now. > > > > The only reason I added the error code was to make truncate_inode_page > > fit into .error_remove_page, but then latter I did another wrapper > > so it could be removed again. But it won't hurt to have it either. > > OK, it's more of a cleanup/nit. > > One question I had for the others (Andrew? other mm guys?) what is the > feelings of merging this feature? Leaving aside exact implementation > and just considering the high level design and cost/benefit. Last time > there were some people objecting, so I wonder the situation now? So > does anybody need more convincing? :) > > Also I will just cc linux-arch. It would be interesting to know whether > powerpc, ia64, or s390 or others would be interested to use this feature? This is not relevant for s390, as current machines do transparent memory sparing if a memory module goes bad. Really old machines reported bad memory to the OS by means of a machine check (storage error uncorrected and storage error corrected). I have never seen this happen, the level below the OS deals with these errors for us. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andi Kleen Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Date: Thu, 6 Aug 2009 14:04:17 +0200 Message-ID: <20090806120417.GC22124@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> <20090805134607.GH11385@basil.fritz.box> <20090805140145.GB28563@wotan.suse.de> <20090805141001.GJ11385@basil.fritz.box> <20090805141642.GB23992@wotan.suse.de> <20090806134830.4f3931d2@skybase> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Received: from one.firstfloor.org ([213.235.205.2]:57803 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752006AbZHFMES (ORCPT ); Thu, 6 Aug 2009 08:04:18 -0400 Content-Disposition: inline In-Reply-To: <20090806134830.4f3931d2@skybase> Sender: linux-arch-owner@vger.kernel.org List-ID: To: Martin Schwidefsky Cc: Nick Piggin , Andi Kleen , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com, linux-arch@vger.kernel.org > This is not relevant for s390, as current machines do transparent memory > sparing if a memory module goes bad. Really old machines reported bad > memory to the OS by means of a machine check (storage error uncorrected > and storage error corrected). I have never seen this happen, the level > below the OS deals with these errors for us. Ok fine. It's for the poorer cousins then who can't afford memory mirroring. -Andi -- ak@linux.intel.com -- Speaking for myself only. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cantor2.suse.de ([195.135.220.15]:49580 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934462AbZHEOQm (ORCPT ); Wed, 5 Aug 2009 10:16:42 -0400 Date: Wed, 5 Aug 2009 16:16:42 +0200 From: Nick Piggin Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-ID: <20090805141642.GB23992@wotan.suse.de> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> <20090805134607.GH11385@basil.fritz.box> <20090805140145.GB28563@wotan.suse.de> <20090805141001.GJ11385@basil.fritz.box> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805141001.GJ11385@basil.fritz.box> Sender: linux-arch-owner@vger.kernel.org List-ID: To: Andi Kleen Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com, linux-arch@vger.kernel.org Message-ID: <20090805141642.TiVOpo9RQk_ea96zkDJWeTtCCajUp6gDqEh8sz_RvXM@z> On Wed, Aug 05, 2009 at 04:10:01PM +0200, Andi Kleen wrote: > > I haven't brought up the caller at this point, but IIRC you had > > the page locked and mapping confirmed at this point anyway so > > it would never be an error for your code. > > > > Probably it would be nice to just force callers to verify the page. > > Normally IMO it is much nicer and clearer to do it at the time the > > page gets locked, unless there is good reason otherwise. > > Ok. I think I'll just keep it as it is for now. > > The only reason I added the error code was to make truncate_inode_page > fit into .error_remove_page, but then latter I did another wrapper > so it could be removed again. But it won't hurt to have it either. OK, it's more of a cleanup/nit. One question I had for the others (Andrew? other mm guys?) what is the feelings of merging this feature? Leaving aside exact implementation and just considering the high level design and cost/benefit. Last time there were some people objecting, so I wonder the situation now? So does anybody need more convincing? :) Also I will just cc linux-arch. It would be interesting to know whether powerpc, ia64, or s390 or others would be interested to use this feature? Thanks, Nick From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933960AbZHEJg3 (ORCPT ); Wed, 5 Aug 2009 05:36:29 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933817AbZHEJg3 (ORCPT ); Wed, 5 Aug 2009 05:36:29 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49045 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933009AbZHEJg2 (ORCPT ); Wed, 5 Aug 2009 05:36:28 -0400 From: Andi Kleen Message-Id: <200908051136.682859934@firstfloor.org> To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [0/19] HWPOISON: Intro Date: Wed, 5 Aug 2009 11:36:27 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [AK: This is a version with the correct file list and some fixes. Please disregard the version I posted yesterday] New version of the hwpoison patchkit. Various changes. Believed to address all earlier review comments. Active error truncate is enabled per file system now, so it adds a new VFS operation "error_remove_page" for this. This prevents any truncation on metadata pages, on those it just does invalidate. Also various bug fixes, most of them from Fengguang. Please see the individual patches for changelog. Should be good to go now. Passes the hwpoison specific parts of the mce-test test suite (git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git) Also available as git tree from git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git hwpoison Andrew, Please consider for merging. Thanks, -Andi Signed-off-by: Andi Kleen --- Upcoming Intel CPUs have support for recovering from some memory errors (``MCA recovery''). This requires the OS to declare a page "poisoned", kill the processes associated with it and avoid using it in the future. This patchkit implements the necessary infrastructure in the VM. To quote the overview comment: * High level machine check handler. Handles pages reported by the * hardware as being corrupted usually due to a 2bit ECC memory or cache * failure. * * This focusses on pages detected as corrupted in the background. * When the current CPU tries to consume corruption the currently * running process can just be killed directly instead. This implies * that if the error cannot be handled for some reason it's safe to * just ignore it because no corruption has been consumed yet. Instead * when that happens another machine check will happen. * * Handles page cache pages in various states. The tricky part * here is that we can access any page asynchronous to other VM * users, because memory failures could happen anytime and anywhere, * possibly violating some of their assumptions. This is why this code * has to be extremely careful. Generally it tries to use normal locking * rules, as in get the standard locks, even if that means the * error handling takes potentially a long time. * * Some of the operations here are somewhat inefficient and have non * linear algorithmic complexity, because the data structures have not * been optimized for this case. This is in particular the case * for the mapping from a vma to a process. Since this case is expected * to be rare we hope we can get away with this. The code consists of a the high level handler in mm/memory-failure.c, a new page poison bit and various checks in the VM to handle poisoned pages. The main target right now is KVM guests, but it works for all kinds of applications. For the KVM use there was need for a new signal type so that KVM can inject the machine check into the guest with the proper address. This in theory allows other applications to handle memory failures too. The expection is that near all applications won't do that, but some very specialized ones might. This is not fully complete yet, in particular there are still ways to access poison through various ways (crash dump, /proc/kcore etc.) that need to be plugged too. -Andi From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933978AbZHEJgd (ORCPT ); Wed, 5 Aug 2009 05:36:33 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933817AbZHEJgb (ORCPT ); Wed, 5 Aug 2009 05:36:31 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49053 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933794AbZHEJg3 (ORCPT ); Wed, 5 Aug 2009 05:36:29 -0400 From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [1/19] HWPOISON: Add page flag for poisoned pages Message-Id: <20090805093628.B90C3B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:28 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hardware poisoned pages need special handling in the VM and shouldn't be touched again. This requires a new page flag. Define it here. The page flags wars seem to be over, so it shouldn't be a problem to get a new one. v2: Add TestSetHWPoison (suggested by Johannes Weiner) Acked-by: Christoph Lameter Signed-off-by: Andi Kleen --- include/linux/page-flags.h | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) Index: linux/include/linux/page-flags.h =================================================================== --- linux.orig/include/linux/page-flags.h +++ linux/include/linux/page-flags.h @@ -51,6 +51,9 @@ * PG_buddy is set to indicate that the page is free and in the buddy system * (see mm/page_alloc.c). * + * PG_hwpoison indicates that a page got corrupted in hardware and contains + * data with incorrect ECC bits that triggered a machine check. Accessing is + * not safe since it may cause another machine check. Don't touch! */ /* @@ -102,6 +105,9 @@ enum pageflags { #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR PG_uncached, /* Page has been mapped as uncached */ #endif +#ifdef CONFIG_MEMORY_FAILURE + PG_hwpoison, /* hardware poisoned page. Don't touch */ +#endif __NR_PAGEFLAGS, /* Filesystems */ @@ -263,6 +269,15 @@ PAGEFLAG(Uncached, uncached) PAGEFLAG_FALSE(Uncached) #endif +#ifdef CONFIG_MEMORY_FAILURE +PAGEFLAG(HWPoison, hwpoison) +TESTSETFLAG(HWPoison, hwpoison) +#define __PG_HWPOISON (1UL << PG_hwpoison) +#else +PAGEFLAG_FALSE(HWPoison) +#define __PG_HWPOISON 0 +#endif + static inline int PageUptodate(struct page *page) { int ret = test_bit(PG_uptodate, &(page)->flags); @@ -387,7 +402,7 @@ static inline void __ClearPageTail(struc 1 << PG_private | 1 << PG_private_2 | \ 1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ - 1 << PG_unevictable | __PG_MLOCKED) + 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON) /* * Flags checked when a page is prepped for return by the page allocator. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933993AbZHEJgi (ORCPT ); Wed, 5 Aug 2009 05:36:38 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933983AbZHEJgf (ORCPT ); Wed, 5 Aug 2009 05:36:35 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49063 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933980AbZHEJge (ORCPT ); Wed, 5 Aug 2009 05:36:34 -0400 From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [7/19] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2 Message-Id: <20090805093634.C8C1DB15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:34 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add VM_FAULT_HWPOISON handling to the x86 page fault handler. This is very similar to VM_FAULT_OOM, the only difference is that a different si_code is passed to user space and the new addr_lsb field is initialized. v2: Make the printk more verbose/unique Signed-off-by: Andi Kleen --- arch/x86/mm/fault.c | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) Index: linux/arch/x86/mm/fault.c =================================================================== --- linux.orig/arch/x86/mm/fault.c +++ linux/arch/x86/mm/fault.c @@ -167,6 +167,7 @@ force_sig_info_fault(int si_signo, int s info.si_errno = 0; info.si_code = si_code; info.si_addr = (void __user *)address; + info.si_addr_lsb = si_code == BUS_MCEERR_AR ? PAGE_SHIFT : 0; force_sig_info(si_signo, &info, tsk); } @@ -799,10 +800,12 @@ out_of_memory(struct pt_regs *regs, unsi } static void -do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address) +do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, + unsigned int fault) { struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + int code = BUS_ADRERR; up_read(&mm->mmap_sem); @@ -818,7 +821,15 @@ do_sigbus(struct pt_regs *regs, unsigned tsk->thread.error_code = error_code; tsk->thread.trap_no = 14; - force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk); +#ifdef CONFIG_MEMORY_FAILURE + if (fault & VM_FAULT_HWPOISON) { + printk(KERN_ERR + "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n", + tsk->comm, tsk->pid, address); + code = BUS_MCEERR_AR; + } +#endif + force_sig_info_fault(SIGBUS, code, address, tsk); } static noinline void @@ -828,8 +839,8 @@ mm_fault_error(struct pt_regs *regs, uns if (fault & VM_FAULT_OOM) { out_of_memory(regs, error_code, address); } else { - if (fault & VM_FAULT_SIGBUS) - do_sigbus(regs, error_code, address); + if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON)) + do_sigbus(regs, error_code, address, fault); else BUG(); } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934009AbZHEJgp (ORCPT ); Wed, 5 Aug 2009 05:36:45 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933997AbZHEJgm (ORCPT ); Wed, 5 Aug 2009 05:36:42 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49078 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933983AbZHEJgj (ORCPT ); Wed, 5 Aug 2009 05:36:39 -0400 From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> To: fengguang.wu@intel.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [12/19] HWPOISON: Add invalidate_inode_page Message-Id: <20090805093639.D5FBEB15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:39 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Wu Fengguang Add a simple way to invalidate a single page This is just a refactoring of the truncate.c code. Originally from Fengguang, modified by Andi Kleen. Signed-off-by: Andi Kleen --- include/linux/mm.h | 2 ++ mm/truncate.c | 26 ++++++++++++++++++++------ 2 files changed, 22 insertions(+), 6 deletions(-) Index: linux/include/linux/mm.h =================================================================== --- linux.orig/include/linux/mm.h +++ linux/include/linux/mm.h @@ -811,6 +811,8 @@ extern int vmtruncate_range(struct inode int truncate_inode_page(struct address_space *mapping, struct page *page); +int invalidate_inode_page(struct page *page); + #ifdef CONFIG_MMU extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, unsigned int flags); Index: linux/mm/truncate.c =================================================================== --- linux.orig/mm/truncate.c +++ linux/mm/truncate.c @@ -146,6 +146,24 @@ int truncate_inode_page(struct address_s return truncate_complete_page(mapping, page); } +/* + * Safely invalidate one page from its pagecache mapping. + * It only drops clean, unused pages. The page must be locked. + * + * Returns 1 if the page is successfully invalidated, otherwise 0. + */ +int invalidate_inode_page(struct page *page) +{ + struct address_space *mapping = page_mapping(page); + if (!mapping) + return 0; + if (PageDirty(page) || PageWriteback(page)) + return 0; + if (page_mapped(page)) + return 0; + return invalidate_complete_page(mapping, page); +} + /** * truncate_inode_pages - truncate range of pages specified by start & end byte offsets * @mapping: mapping to truncate @@ -312,12 +330,8 @@ unsigned long invalidate_mapping_pages(s if (lock_failed) continue; - if (PageDirty(page) || PageWriteback(page)) - goto unlock; - if (page_mapped(page)) - goto unlock; - ret += invalidate_complete_page(mapping, page); -unlock: + ret += invalidate_inode_page(page); + unlock_page(page); if (next > end) break; From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934049AbZHEJhR (ORCPT ); Wed, 5 Aug 2009 05:37:17 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S934015AbZHEJgq (ORCPT ); Wed, 5 Aug 2009 05:36:46 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49101 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933995AbZHEJgk (ORCPT ); Wed, 5 Aug 2009 05:36:40 -0400 From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [13/19] HWPOISON: Define a new error_remove_page address space op for async truncation Message-Id: <20090805093640.D8856B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:40 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Truncating metadata pages is not safe right now before we haven't audited all file systems. To enable truncation only for data address space define a new address_space callback error_remove_page. This is used for memory_failure.c memory error handling. This can be then set to truncate_inode_page() This patch just defines the new operation and adds documentation. Callers and users come in followon patches. Signed-off-by: Andi Kleen --- Documentation/filesystems/vfs.txt | 7 +++++++ include/linux/fs.h | 1 + include/linux/mm.h | 1 + mm/truncate.c | 17 +++++++++++++++++ 4 files changed, 26 insertions(+) Index: linux/include/linux/fs.h =================================================================== --- linux.orig/include/linux/fs.h +++ linux/include/linux/fs.h @@ -595,6 +595,7 @@ struct address_space_operations { int (*launder_page) (struct page *); int (*is_partially_uptodate) (struct page *, read_descriptor_t *, unsigned long); + int (*error_remove_page)(struct address_space *, struct page *); }; /* Index: linux/Documentation/filesystems/vfs.txt =================================================================== --- linux.orig/Documentation/filesystems/vfs.txt +++ linux/Documentation/filesystems/vfs.txt @@ -536,6 +536,7 @@ struct address_space_operations { /* migrate the contents of a page to the specified target */ int (*migratepage) (struct page *, struct page *); int (*launder_page) (struct page *); + int (*error_remove_page) (struct mapping *mapping, struct page *page); }; writepage: called by the VM to write a dirty page to backing store. @@ -694,6 +695,12 @@ struct address_space_operations { prevent redirtying the page, it is kept locked during the whole operation. + error_remove_page: normally set to generic_error_remove_page if truncation + is ok for this address space. Used for memory failure handling. + Setting this implies you deal with pages going away under you, + unless you have them locked or reference counts increased. + + The File Object =============== Index: linux/mm/truncate.c =================================================================== --- linux.orig/mm/truncate.c +++ linux/mm/truncate.c @@ -147,6 +147,23 @@ int truncate_inode_page(struct address_s } /* + * Used to get rid of pages on hardware memory corruption. + */ +int generic_error_remove_page(struct address_space *mapping, struct page *page) +{ + if (!mapping) + return -EINVAL; + /* + * Only punch for normal data pages for now. + * Handling other types like directories would need more auditing. + */ + if (!S_ISREG(mapping->host->i_mode)) + return -EIO; + return truncate_inode_page(mapping, page); +} +EXPORT_SYMBOL(generic_error_remove_page); + +/* * Safely invalidate one page from its pagecache mapping. * It only drops clean, unused pages. The page must be locked. * Index: linux/include/linux/mm.h =================================================================== --- linux.orig/include/linux/mm.h +++ linux/include/linux/mm.h @@ -810,6 +810,7 @@ extern int vmtruncate(struct inode * ino extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end); int truncate_inode_page(struct address_space *mapping, struct page *page); +int generic_error_remove_page(struct address_space *mapping, struct page *page); int invalidate_inode_page(struct page *page); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934044AbZHEJhB (ORCPT ); Wed, 5 Aug 2009 05:37:01 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S934035AbZHEJg7 (ORCPT ); Wed, 5 Aug 2009 05:36:59 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49105 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934018AbZHEJgq (ORCPT ); Wed, 5 Aug 2009 05:36:46 -0400 From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [19/19] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs Message-Id: <20090805093646.EE6F0B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:46 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Useful for some testing scenarios, although specific testing is often done better through MADV_POISON This can be done with the x86 level MCE injector too, but this interface allows it to do independently from low level x86 changes. Open issues: Should be disabled for cgroups. Signed-off-by: Andi Kleen --- mm/Kconfig | 6 +++++- mm/Makefile | 1 + mm/hwpoison-inject.c | 40 ++++++++++++++++++++++++++++++++++++++++ mm/madvise.c | 2 +- 4 files changed, 47 insertions(+), 2 deletions(-) Index: linux/mm/hwpoison-inject.c =================================================================== --- /dev/null +++ linux/mm/hwpoison-inject.c @@ -0,0 +1,40 @@ +/* Inject a hwpoison memory failure on a arbitary pfn */ +#include +#include +#include +#include + +static struct dentry *hwpoison_dir, *corrupt_pfn; + +static int hwpoison_inject(void *data, u64 val) +{ + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + printk(KERN_INFO "Injecting memory failure at pfn %Lx\n", val); + return __memory_failure(val, 18, 0); +} + +DEFINE_SIMPLE_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n"); + +static void pfn_inject_exit(void) +{ + if (hwpoison_dir) + debugfs_remove_recursive(hwpoison_dir); +} + +static int pfn_inject_init(void) +{ + hwpoison_dir = debugfs_create_dir("hwpoison", NULL); + if (hwpoison_dir == NULL) + return -ENOMEM; + corrupt_pfn = debugfs_create_file("corrupt-pfn", 0600, hwpoison_dir, + NULL, &hwpoison_fops); + if (corrupt_pfn == NULL) { + pfn_inject_exit(); + return -ENOMEM; + } + return 0; +} + +module_init(pfn_inject_init); +module_exit(pfn_inject_exit); Index: linux/mm/Kconfig =================================================================== --- linux.orig/mm/Kconfig +++ linux/mm/Kconfig @@ -236,12 +236,16 @@ config DEFAULT_MMAP_MIN_ADDR config MEMORY_FAILURE depends on MMU depends on X86_MCE - bool "Enable memory failure recovery" + bool "Enable recovery from hardware memory errors" help Enables code to recover from some memory failures on systems with MCA recovery. This allows a system to continue running even when some of its memory has uncorrected errors. +config HWPOISON_INJECT + tristate "Poison pages injector" + depends on MEMORY_FAILURE && DEBUG_KERNEL + config NOMMU_INITIAL_TRIM_EXCESS int "Turn on mmap() excess space trimming before booting" depends on !MMU Index: linux/mm/Makefile =================================================================== --- linux.orig/mm/Makefile +++ linux/mm/Makefile @@ -41,5 +41,6 @@ endif obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o +obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o Index: linux/mm/madvise.c =================================================================== --- linux.orig/mm/madvise.c +++ linux/mm/madvise.c @@ -213,7 +213,7 @@ static long madvise_remove(struct vm_are */ static int madvise_hwpoison(unsigned long start, unsigned long end) { - int ret = -EIO; + int ret = 0; /* * RED-PEN * This allows to tie up arbitary amounts of memory. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934032AbZHEJgu (ORCPT ); Wed, 5 Aug 2009 05:36:50 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933983AbZHEJgs (ORCPT ); Wed, 5 Aug 2009 05:36:48 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49105 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933998AbZHEJgl (ORCPT ); Wed, 5 Aug 2009 05:36:41 -0400 From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [14/19] HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process Message-Id: <20090805093641.DB176B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:41 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This allows processes to override their early/late kill behaviour on hardware memory errors. Typically applications which are memory error aware is better of with early kill (see the error as soon as possible), all others with late kill (only see the error when the error is really impacting execution) There's a global sysctl, but this way an application can set its specific policy. We're using two bits, one to signify that the process stated its intention and that I also made the prctl future proof by enforcing the unused arguments are 0. The state is inherited to children for now. I've been considering to reset it on exec, but not done for now (TBD). Note this makes us officially run out of process flags on 32bit, but the next patch can easily add another field. Manpage patch will be supplied separately. Signed-off-by: Andi Kleen --- include/linux/prctl.h | 2 ++ include/linux/sched.h | 2 ++ kernel/sys.c | 22 ++++++++++++++++++++++ 3 files changed, 26 insertions(+) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1674,6 +1674,7 @@ extern cputime_t task_gtime(struct task_ #define PF_EXITPIDONE 0x00000008 /* pi exit done on shut down */ #define PF_VCPU 0x00000010 /* I'm a virtual CPU */ #define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */ +#define PF_MCE_PROCESS 0x00000080 /* process policy on mce errors */ #define PF_SUPERPRIV 0x00000100 /* used super-user privileges */ #define PF_DUMPCORE 0x00000200 /* dumped core */ #define PF_SIGNALED 0x00000400 /* killed by a signal */ @@ -1693,6 +1694,7 @@ extern cputime_t task_gtime(struct task_ #define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */ #define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */ #define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */ +#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ #define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */ #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */ #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezeable */ Index: linux/kernel/sys.c =================================================================== --- linux.orig/kernel/sys.c +++ linux/kernel/sys.c @@ -1528,6 +1528,28 @@ SYSCALL_DEFINE5(prctl, int, option, unsi current->timer_slack_ns = arg2; error = 0; break; + case PR_MCE_KILL: + if (arg4 | arg5) + return -EINVAL; + switch (arg2) { + case 0: + if (arg3 != 0) + return -EINVAL; + current->flags &= ~PF_MCE_PROCESS; + break; + case 1: + current->flags |= PF_MCE_PROCESS; + if (arg3 != 0) + current->flags |= PF_MCE_EARLY; + else + current->flags &= ~PF_MCE_EARLY; + break; + default: + return -EINVAL; + } + error = 0; + break; + default: error = -EINVAL; break; Index: linux/include/linux/prctl.h =================================================================== --- linux.orig/include/linux/prctl.h +++ linux/include/linux/prctl.h @@ -88,4 +88,6 @@ #define PR_TASK_PERF_COUNTERS_DISABLE 31 #define PR_TASK_PERF_COUNTERS_ENABLE 32 +#define PR_MCE_KILL 33 + #endif /* _LINUX_PRCTL_H */ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934041AbZHEJhc (ORCPT ); Wed, 5 Aug 2009 05:37:32 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933973AbZHEJh3 (ORCPT ); Wed, 5 Aug 2009 05:37:29 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49151 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933997AbZHEJgp (ORCPT ); Wed, 5 Aug 2009 05:36:45 -0400 From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [18/19] HWPOISON: Add madvise() based injector for hardware poisoned pages v3 Message-Id: <20090805093645.E5F95B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:45 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Impact: optional, useful for debugging Add a new madvice sub command to inject poison for some pages in a process' address space. This is useful for testing the poison page handling. Open issues: - This patch allows root to tie up arbitary amounts of memory. Should this be disabled inside containers? - There's a small race window between getting the page and injecting. The patch drops the ref count because otherwise memory_failure complains about dangling references. In theory with a multi threaded injector one could inject poison for a process foreign page this way. Not a serious issue right now. v2: Use write flag for get_user_pages to make sure to always get a fresh page v3: Don't request write mapping (Fengguang Wu) Signed-off-by: Andi Kleen --- include/asm-generic/mman-common.h | 1 + mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++ 2 files changed, 35 insertions(+) Index: linux/mm/madvise.c =================================================================== --- linux.orig/mm/madvise.c +++ linux/mm/madvise.c @@ -207,6 +207,36 @@ static long madvise_remove(struct vm_are return error; } +#ifdef CONFIG_MEMORY_FAILURE +/* + * Error injection support for memory error handling. + */ +static int madvise_hwpoison(unsigned long start, unsigned long end) +{ + int ret = -EIO; + /* + * RED-PEN + * This allows to tie up arbitary amounts of memory. + * Might be a good idea to disable it inside containers even for root. + */ + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + for (; start < end; start += PAGE_SIZE) { + struct page *p; + int ret = get_user_pages(current, current->mm, start, 1, + 0, 0, &p, NULL); + if (ret != 1) + return ret; + printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n", + page_to_pfn(p), start); + /* Ignore return value for now */ + __memory_failure(page_to_pfn(p), 0, 1); + put_page(p); + } + return ret; +} +#endif + static long madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, int behavior) @@ -307,6 +337,10 @@ SYSCALL_DEFINE3(madvise, unsigned long, int write; size_t len; +#ifdef CONFIG_MEMORY_FAILURE + if (behavior == MADV_HWPOISON) + return madvise_hwpoison(start, start+len_in); +#endif if (!madvise_behavior_valid(behavior)) return error; Index: linux/include/asm-generic/mman-common.h =================================================================== --- linux.orig/include/asm-generic/mman-common.h +++ linux/include/asm-generic/mman-common.h @@ -34,6 +34,7 @@ #define MADV_REMOVE 9 /* remove these pages & resources */ #define MADV_DONTFORK 10 /* don't inherit across fork */ #define MADV_DOFORK 11 /* do inherit across fork */ +#define MADV_HWPOISON 12 /* poison a page for testing */ /* compatibility flags */ #define MAP_FILE 0 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934054AbZHEJhf (ORCPT ); Wed, 5 Aug 2009 05:37:35 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933973AbZHEJhc (ORCPT ); Wed, 5 Aug 2009 05:37:32 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49142 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934008AbZHEJgp (ORCPT ); Wed, 5 Aug 2009 05:36:45 -0400 From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> To: Trond.Myklebust@netapp.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [17/19] HWPOISON: Enable error_remove_page for NFS Message-Id: <20090805093644.E3607B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:44 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Enable hardware memory error handling for NFS Truncation of data pages at runtime should be safe in NFS, even when it doesn't support migration so far. Cc: Trond.Myklebust@netapp.com Signed-off-by: Andi Kleen --- fs/nfs/file.c | 1 + 1 file changed, 1 insertion(+) Index: linux/fs/nfs/file.c =================================================================== --- linux.orig/fs/nfs/file.c +++ linux/fs/nfs/file.c @@ -480,6 +480,7 @@ const struct address_space_operations nf .releasepage = nfs_release_page, .direct_IO = nfs_direct_IO, .launder_page = nfs_launder_page, + .error_remove_page = generic_error_remove_page, }; /* From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934022AbZHEJgq (ORCPT ); Wed, 5 Aug 2009 05:36:46 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933983AbZHEJgm (ORCPT ); Wed, 5 Aug 2009 05:36:42 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49105 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933996AbZHEJgi (ORCPT ); Wed, 5 Aug 2009 05:36:38 -0400 From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> To: npiggin@suse.de, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-Id: <20090805093638.D3754B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:38 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Nick Piggin Extract out truncate_inode_page() out of the truncate path so that it can be used by memory-failure.c [AK: description, headers, fix typos] v2: Some white space changes from Fengguang Wu Signed-off-by: Andi Kleen --- include/linux/mm.h | 2 ++ mm/truncate.c | 29 +++++++++++++++-------------- 2 files changed, 17 insertions(+), 14 deletions(-) Index: linux/mm/truncate.c =================================================================== --- linux.orig/mm/truncate.c +++ linux/mm/truncate.c @@ -93,11 +93,11 @@ EXPORT_SYMBOL(cancel_dirty_page); * its lock, b) when a concurrent invalidate_mapping_pages got there first and * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space. */ -static void +static int truncate_complete_page(struct address_space *mapping, struct page *page) { if (page->mapping != mapping) - return; + return -EIO; if (page_has_private(page)) do_invalidatepage(page, 0); @@ -108,6 +108,7 @@ truncate_complete_page(struct address_sp remove_from_page_cache(page); ClearPageMappedToDisk(page); page_cache_release(page); /* pagecache ref */ + return 0; } /* @@ -135,6 +136,16 @@ invalidate_complete_page(struct address_ return ret; } +int truncate_inode_page(struct address_space *mapping, struct page *page) +{ + if (page_mapped(page)) { + unmap_mapping_range(mapping, + (loff_t)page->index << PAGE_CACHE_SHIFT, + PAGE_CACHE_SIZE, 0); + } + return truncate_complete_page(mapping, page); +} + /** * truncate_inode_pages - truncate range of pages specified by start & end byte offsets * @mapping: mapping to truncate @@ -196,12 +207,7 @@ void truncate_inode_pages_range(struct a unlock_page(page); continue; } - if (page_mapped(page)) { - unmap_mapping_range(mapping, - (loff_t)page_index<index<index > next) next = page->index; next++; - truncate_complete_page(mapping, page); unlock_page(page); } pagevec_release(&pvec); Index: linux/include/linux/mm.h =================================================================== --- linux.orig/include/linux/mm.h +++ linux/include/linux/mm.h @@ -809,6 +809,8 @@ static inline void unmap_shared_mapping_ extern int vmtruncate(struct inode * inode, loff_t offset); extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end); +int truncate_inode_page(struct address_space *mapping, struct page *page); + #ifdef CONFIG_MMU extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, unsigned int flags); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934016AbZHEJiQ (ORCPT ); Wed, 5 Aug 2009 05:38:16 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S934003AbZHEJgo (ORCPT ); Wed, 5 Aug 2009 05:36:44 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49101 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933987AbZHEJgh (ORCPT ); Wed, 5 Aug 2009 05:36:37 -0400 From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> To: fengguang.wu@intel.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [10/19] HWPOISON: check and isolate corrupted free pages v2 Message-Id: <20090805093637.D0A8AB15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:37 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Wu Fengguang If memory corruption hits the free buddy pages, we can safely ignore them. No one will access them until page allocation time, then prep_new_page() will automatically check and isolate PG_hwpoison page for us (for 0-order allocation). This patch expands prep_new_page() to check every component page in a high order page allocation, in order to completely stop PG_hwpoison pages from being recirculated. Note that the common case -- only allocating a single page, doesn't do any more work than before. Allocating > order 0 does a bit more work, but that's relatively uncommon. This simple implementation may drop some innocent neighbor pages, hopefully it is not a big problem because the event should be rare enough. This patch adds some runtime costs to high order page users. [AK: Improved description] v2: Andi Kleen: Port to -mm code Move check into separate function. Don't dump stack in bad_pages for hwpoisoned pages. Signed-off-by: Wu Fengguang Signed-off-by: Andi Kleen --- mm/page_alloc.c | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) Index: linux/mm/page_alloc.c =================================================================== --- linux.orig/mm/page_alloc.c +++ linux/mm/page_alloc.c @@ -234,6 +234,12 @@ static void bad_page(struct page *page) static unsigned long nr_shown; static unsigned long nr_unshown; + /* Don't complain about poisoned pages */ + if (PageHWPoison(page)) { + __ClearPageBuddy(page); + return; + } + /* * Allow a burst of 60 reports, then keep quiet for that minute; * or allow a steady drip of one report per second. @@ -646,7 +652,7 @@ static inline void expand(struct zone *z /* * This page is about to be returned from the page allocator */ -static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) +static inline int check_new_page(struct page *page) { if (unlikely(page_mapcount(page) | (page->mapping != NULL) | @@ -655,6 +661,18 @@ static int prep_new_page(struct page *pa bad_page(page); return 1; } + return 0; +} + +static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) +{ + int i; + + for (i = 0; i < (1 << order); i++) { + struct page *p = page + i; + if (unlikely(check_new_page(p))) + return 1; + } set_page_private(page, 0); set_page_refcounted(page); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934063AbZHEJim (ORCPT ); Wed, 5 Aug 2009 05:38:42 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S934070AbZHEJiR (ORCPT ); Wed, 5 Aug 2009 05:38:17 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49105 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934002AbZHEJgo (ORCPT ); Wed, 5 Aug 2009 05:36:44 -0400 From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> To: tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-Id: <20090805093643.E0C00B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:43 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Enable removing of corrupted pages through truncation for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs These should cover most server needs. I chose the set of migration aware file systems for this for now, assuming they have been especially audited. But in general it should be safe for all file systems on the data area that support read/write and truncate. Caveat: the hardware error handler does not take i_mutex for now before calling the truncate function. Is that ok? Cc: tytso@mit.edu Cc: hch@infradead.org Cc: mfasheh@suse.com Cc: aia21@cantab.net Cc: hugh.dickins@tiscali.co.uk Cc: swhiteho@redhat.com Signed-off-by: Andi Kleen --- fs/ext2/inode.c | 2 ++ fs/ext3/inode.c | 3 +++ fs/ext4/inode.c | 4 ++++ fs/gfs2/aops.c | 3 +++ fs/ntfs/aops.c | 2 ++ fs/ocfs2/aops.c | 1 + fs/xfs/linux-2.6/xfs_aops.c | 1 + mm/shmem.c | 1 + 8 files changed, 17 insertions(+) Index: linux/fs/gfs2/aops.c =================================================================== --- linux.orig/fs/gfs2/aops.c +++ linux/fs/gfs2/aops.c @@ -1135,6 +1135,7 @@ static const struct address_space_operat .direct_IO = gfs2_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; static const struct address_space_operations gfs2_ordered_aops = { @@ -1151,6 +1152,7 @@ static const struct address_space_operat .direct_IO = gfs2_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; static const struct address_space_operations gfs2_jdata_aops = { @@ -1166,6 +1168,7 @@ static const struct address_space_operat .invalidatepage = gfs2_invalidatepage, .releasepage = gfs2_releasepage, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; void gfs2_set_aops(struct inode *inode) Index: linux/fs/ntfs/aops.c =================================================================== --- linux.orig/fs/ntfs/aops.c +++ linux/fs/ntfs/aops.c @@ -1550,6 +1550,7 @@ const struct address_space_operations nt .migratepage = buffer_migrate_page, /* Move a page cache page from one physical page to an other. */ + .error_remove_page = generic_error_remove_page, }; /** @@ -1569,6 +1570,7 @@ const struct address_space_operations nt .migratepage = buffer_migrate_page, /* Move a page cache page from one physical page to an other. */ + .error_remove_page = generic_error_remove_page, }; #ifdef NTFS_RW Index: linux/fs/ocfs2/aops.c =================================================================== --- linux.orig/fs/ocfs2/aops.c +++ linux/fs/ocfs2/aops.c @@ -1968,4 +1968,5 @@ const struct address_space_operations oc .releasepage = ocfs2_releasepage, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; Index: linux/fs/xfs/linux-2.6/xfs_aops.c =================================================================== --- linux.orig/fs/xfs/linux-2.6/xfs_aops.c +++ linux/fs/xfs/linux-2.6/xfs_aops.c @@ -1636,4 +1636,5 @@ const struct address_space_operations xf .direct_IO = xfs_vm_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; Index: linux/mm/shmem.c =================================================================== --- linux.orig/mm/shmem.c +++ linux/mm/shmem.c @@ -2421,6 +2421,7 @@ static const struct address_space_operat .write_end = shmem_write_end, #endif .migratepage = migrate_page, + .error_remove_page = generic_error_remove_page, }; static const struct file_operations shmem_file_operations = { Index: linux/fs/ext2/inode.c =================================================================== --- linux.orig/fs/ext2/inode.c +++ linux/fs/ext2/inode.c @@ -819,6 +819,7 @@ const struct address_space_operations ex .writepages = ext2_writepages, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; const struct address_space_operations ext2_aops_xip = { @@ -837,6 +838,7 @@ const struct address_space_operations ex .direct_IO = ext2_direct_IO, .writepages = ext2_writepages, .migratepage = buffer_migrate_page, + .error_remove_page = generic_error_remove_page, }; /* Index: linux/fs/ext3/inode.c =================================================================== --- linux.orig/fs/ext3/inode.c +++ linux/fs/ext3/inode.c @@ -1819,6 +1819,7 @@ static const struct address_space_operat .direct_IO = ext3_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; static const struct address_space_operations ext3_writeback_aops = { @@ -1834,6 +1835,7 @@ static const struct address_space_operat .direct_IO = ext3_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; static const struct address_space_operations ext3_journalled_aops = { @@ -1848,6 +1850,7 @@ static const struct address_space_operat .invalidatepage = ext3_invalidatepage, .releasepage = ext3_releasepage, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; void ext3_set_aops(struct inode *inode) Index: linux/fs/ext4/inode.c =================================================================== --- linux.orig/fs/ext4/inode.c +++ linux/fs/ext4/inode.c @@ -3373,6 +3373,7 @@ static const struct address_space_operat .direct_IO = ext4_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; static const struct address_space_operations ext4_writeback_aops = { @@ -3388,6 +3389,7 @@ static const struct address_space_operat .direct_IO = ext4_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; static const struct address_space_operations ext4_journalled_aops = { @@ -3402,6 +3404,7 @@ static const struct address_space_operat .invalidatepage = ext4_invalidatepage, .releasepage = ext4_releasepage, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; static const struct address_space_operations ext4_da_aops = { @@ -3418,6 +3421,7 @@ static const struct address_space_operat .direct_IO = ext4_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; void ext4_set_aops(struct inode *inode) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934104AbZHEJkB (ORCPT ); Wed, 5 Aug 2009 05:40:01 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933963AbZHEJj6 (ORCPT ); Wed, 5 Aug 2009 05:39:58 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49063 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933962AbZHEJgb (ORCPT ); Wed, 5 Aug 2009 05:36:31 -0400 From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [3/19] HWPOISON: Add support for poison swap entries v2 Message-Id: <20090805093630.BE26EB15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:30 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Memory migration uses special swap entry types to trigger special actions on page faults. Extend this mechanism to also support poisoned swap entries, to trigger poison handling on page faults. This allows follow-on patches to prevent processes from faulting in poisoned pages again. v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu) v3: Better overflow fix (Hidehiro Kawai) Signed-off-by: Andi Kleen --- include/linux/swap.h | 34 ++++++++++++++++++++++++++++------ include/linux/swapops.h | 38 ++++++++++++++++++++++++++++++++++++++ mm/swapfile.c | 4 ++-- 3 files changed, 68 insertions(+), 8 deletions(-) Index: linux/include/linux/swap.h =================================================================== --- linux.orig/include/linux/swap.h +++ linux/include/linux/swap.h @@ -34,16 +34,38 @@ static inline int current_is_kswapd(void * the type/offset into the pte as 5/27 as well. */ #define MAX_SWAPFILES_SHIFT 5 -#ifndef CONFIG_MIGRATION -#define MAX_SWAPFILES (1 << MAX_SWAPFILES_SHIFT) + +/* + * Use some of the swap files numbers for other purposes. This + * is a convenient way to hook into the VM to trigger special + * actions on faults. + */ + +/* + * NUMA node memory migration support + */ +#ifdef CONFIG_MIGRATION +#define SWP_MIGRATION_NUM 2 +#define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM) +#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1) #else -/* Use last two entries for page migration swap entries */ -#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)-2) -#define SWP_MIGRATION_READ MAX_SWAPFILES -#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + 1) +#define SWP_MIGRATION_NUM 0 #endif /* + * Handling of hardware poisoned pages with memory corruption. + */ +#ifdef CONFIG_MEMORY_FAILURE +#define SWP_HWPOISON_NUM 1 +#define SWP_HWPOISON MAX_SWAPFILES +#else +#define SWP_HWPOISON_NUM 0 +#endif + +#define MAX_SWAPFILES \ + ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM) + +/* * Magic header for a swap area. The first part of the union is * what the swap magic looks like for the old (limited to 128MB) * swap area format, the second part of the union adds - in the Index: linux/include/linux/swapops.h =================================================================== --- linux.orig/include/linux/swapops.h +++ linux/include/linux/swapops.h @@ -131,3 +131,41 @@ static inline int is_write_migration_ent #endif +#ifdef CONFIG_MEMORY_FAILURE +/* + * Support for hardware poisoned pages + */ +static inline swp_entry_t make_hwpoison_entry(struct page *page) +{ + BUG_ON(!PageLocked(page)); + return swp_entry(SWP_HWPOISON, page_to_pfn(page)); +} + +static inline int is_hwpoison_entry(swp_entry_t entry) +{ + return swp_type(entry) == SWP_HWPOISON; +} +#else + +static inline swp_entry_t make_hwpoison_entry(struct page *page) +{ + return swp_entry(0, 0); +} + +static inline int is_hwpoison_entry(swp_entry_t swp) +{ + return 0; +} +#endif + +#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) +static inline int non_swap_entry(swp_entry_t entry) +{ + return swp_type(entry) >= MAX_SWAPFILES; +} +#else +static inline int non_swap_entry(swp_entry_t entry) +{ + return 0; +} +#endif Index: linux/mm/swapfile.c =================================================================== --- linux.orig/mm/swapfile.c +++ linux/mm/swapfile.c @@ -697,7 +697,7 @@ int free_swap_and_cache(swp_entry_t entr struct swap_info_struct *p; struct page *page = NULL; - if (is_migration_entry(entry)) + if (non_swap_entry(entry)) return 1; p = swap_info_get(entry); @@ -2083,7 +2083,7 @@ static int __swap_duplicate(swp_entry_t int count; bool has_cache; - if (is_migration_entry(entry)) + if (non_swap_entry(entry)) return -EINVAL; type = swp_type(entry); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934089AbZHEJjR (ORCPT ); Wed, 5 Aug 2009 05:39:17 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933982AbZHEJjN (ORCPT ); Wed, 5 Aug 2009 05:39:13 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49078 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933974AbZHEJgc (ORCPT ); Wed, 5 Aug 2009 05:36:32 -0400 From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [5/19] HWPOISON: Add basic support for poisoned pages in fault handler v3 Message-Id: <20090805093632.C3860B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:32 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org - Add a new VM_FAULT_HWPOISON error code to handle_mm_fault. Right now architectures have to explicitely enable poison page support, so this is forward compatible to all architectures. They only need to add it when they enable poison page support. - Add poison page handling in swap in fault code v2: Add missing delayacct_clear_flag (Hidehiro Kawai) v3: Really use delayacct_clear_flag (Hidehiro Kawai) Signed-off-by: Andi Kleen --- include/linux/mm.h | 3 ++- mm/memory.c | 18 +++++++++++++++--- 2 files changed, 17 insertions(+), 4 deletions(-) Index: linux/mm/memory.c =================================================================== --- linux.orig/mm/memory.c +++ linux/mm/memory.c @@ -1319,7 +1319,8 @@ int __get_user_pages(struct task_struct if (ret & VM_FAULT_ERROR) { if (ret & VM_FAULT_OOM) return i ? i : -ENOMEM; - else if (ret & VM_FAULT_SIGBUS) + if (ret & + (VM_FAULT_HWPOISON|VM_FAULT_SIGBUS)) return i ? i : -EFAULT; BUG(); } @@ -2511,8 +2512,15 @@ static int do_swap_page(struct mm_struct goto out; entry = pte_to_swp_entry(orig_pte); - if (is_migration_entry(entry)) { - migration_entry_wait(mm, pmd, address); + if (unlikely(non_swap_entry(entry))) { + if (is_migration_entry(entry)) { + migration_entry_wait(mm, pmd, address); + } else if (is_hwpoison_entry(entry)) { + ret = VM_FAULT_HWPOISON; + } else { + print_bad_pte(vma, address, pte, NULL); + ret = VM_FAULT_OOM; + } goto out; } delayacct_set_flag(DELAYACCT_PF_SWAPIN); @@ -2536,6 +2544,10 @@ static int do_swap_page(struct mm_struct /* Had to read the page from swap area: Major fault */ ret = VM_FAULT_MAJOR; count_vm_event(PGMAJFAULT); + } else if (PageHWPoison(page)) { + ret = VM_FAULT_HWPOISON; + delayacct_clear_flag(DELAYACCT_PF_SWAPIN); + goto out; } lock_page(page); Index: linux/include/linux/mm.h =================================================================== --- linux.orig/include/linux/mm.h +++ linux/include/linux/mm.h @@ -700,11 +700,12 @@ static inline int page_mapped(struct pag #define VM_FAULT_SIGBUS 0x0002 #define VM_FAULT_MAJOR 0x0004 #define VM_FAULT_WRITE 0x0008 /* Special case for get_user_pages */ +#define VM_FAULT_HWPOISON 0x0010 /* Hit poisoned page */ #define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */ #define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */ -#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS) +#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON) /* * Can be called by the pagefault handler when it gets a VM_FAULT_OOM. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934095AbZHEJji (ORCPT ); Wed, 5 Aug 2009 05:39:38 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S934071AbZHEJjI (ORCPT ); Wed, 5 Aug 2009 05:39:08 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49084 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933884AbZHEJgg (ORCPT ); Wed, 5 Aug 2009 05:36:36 -0400 From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> To: ak@linux.intel.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [9/19] HWPOISON: Handle hardware poisoned pages in try_to_unmap Message-Id: <20090805093636.CDE6FB15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:36 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Andi Kleen When a page has the poison bit set replace the PTE with a poison entry. This causes the right error handling to be done later when a process runs into it. v2: add a new flag to not do that (needed for the memory-failure handler later) (Fengguang) v3: remove unnecessary is_migration_entry() test (Fengguang, Minchan) Reviewed-by: Minchan Kim Reviewed-by: Wu Fengguang Signed-off-by: Andi Kleen --- include/linux/rmap.h | 1 + mm/rmap.c | 9 ++++++++- 2 files changed, 9 insertions(+), 1 deletion(-) Index: linux/mm/rmap.c =================================================================== --- linux.orig/mm/rmap.c +++ linux/mm/rmap.c @@ -819,7 +819,14 @@ static int try_to_unmap_one(struct page /* Update high watermark before we lower rss */ update_hiwater_rss(mm); - if (PageAnon(page)) { + if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) { + if (PageAnon(page)) + dec_mm_counter(mm, anon_rss); + else + dec_mm_counter(mm, file_rss); + set_pte_at(mm, address, pte, + swp_entry_to_pte(make_hwpoison_entry(page))); + } else if (PageAnon(page)) { swp_entry_t entry = { .val = page_private(page) }; if (PageSwapCache(page)) { Index: linux/include/linux/rmap.h =================================================================== --- linux.orig/include/linux/rmap.h +++ linux/include/linux/rmap.h @@ -93,6 +93,7 @@ enum ttu_flags { TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */ TTU_IGNORE_ACCESS = (1 << 9), /* don't age */ + TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */ }; #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934039AbZHEJjP (ORCPT ); Wed, 5 Aug 2009 05:39:15 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933980AbZHEJjK (ORCPT ); Wed, 5 Aug 2009 05:39:10 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49078 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933982AbZHEJgf (ORCPT ); Wed, 5 Aug 2009 05:36:35 -0400 From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> To: Lee.Schermerhorn@hp.com, npiggin@suse.de, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [8/19] HWPOISON: Use bitmask/action code for try_to_unmap behaviour Message-Id: <20090805093635.CB595B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:35 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org try_to_unmap currently has multiple modi (migration, munlock, normal unmap) which are selected by magic flag variables. The logic is not very straight forward, because each of these flag change multiple behaviours (e.g. migration turns off aging, not only sets up migration ptes etc.) Also the different flags interact in magic ways. A later patch in this series adds another mode to try_to_unmap, so this becomes quickly unmanageable. Replace the different flags with a action code (migration, munlock, munmap) and some additional flags as modifiers (ignore mlock, ignore aging). This makes the logic more straight forward and allows easier extension to new behaviours. Change all the caller to declare what they want to do. This patch is supposed to be a nop in behaviour. If anyone can prove it is not that would be a bug. Cc: Lee.Schermerhorn@hp.com Cc: npiggin@suse.de Signed-off-by: Andi Kleen --- include/linux/rmap.h | 13 ++++++++++++- mm/migrate.c | 2 +- mm/rmap.c | 40 ++++++++++++++++++++++------------------ mm/vmscan.c | 2 +- 4 files changed, 36 insertions(+), 21 deletions(-) Index: linux/include/linux/rmap.h =================================================================== --- linux.orig/include/linux/rmap.h +++ linux/include/linux/rmap.h @@ -85,7 +85,18 @@ static inline void page_dup_rmap(struct */ int page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt, unsigned long *vm_flags); -int try_to_unmap(struct page *, int ignore_refs); +enum ttu_flags { + TTU_UNMAP = 0, /* unmap mode */ + TTU_MIGRATION = 1, /* migration mode */ + TTU_MUNLOCK = 2, /* munlock mode */ + TTU_ACTION_MASK = 0xff, + + TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */ + TTU_IGNORE_ACCESS = (1 << 9), /* don't age */ +}; +#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK) + +int try_to_unmap(struct page *, enum ttu_flags flags); /* * Called from mm/filemap_xip.c to unmap empty zero page Index: linux/mm/rmap.c =================================================================== --- linux.orig/mm/rmap.c +++ linux/mm/rmap.c @@ -773,7 +773,7 @@ void page_remove_rmap(struct page *page) * repeatedly from either try_to_unmap_anon or try_to_unmap_file. */ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, - int migration) + enum ttu_flags flags) { struct mm_struct *mm = vma->vm_mm; unsigned long address; @@ -795,11 +795,13 @@ static int try_to_unmap_one(struct page * If it's recently referenced (perhaps page_referenced * skipped over this mm) then we should reactivate it. */ - if (!migration) { + if (!(flags & TTU_IGNORE_MLOCK)) { if (vma->vm_flags & VM_LOCKED) { ret = SWAP_MLOCK; goto out_unmap; } + } + if (!(flags & TTU_IGNORE_ACCESS)) { if (ptep_clear_flush_young_notify(vma, address, pte)) { ret = SWAP_FAIL; goto out_unmap; @@ -839,12 +841,12 @@ static int try_to_unmap_one(struct page * pte. do_swap_page() will wait until the migration * pte is removed and then restart fault handling. */ - BUG_ON(!migration); + BUG_ON(TTU_ACTION(flags) != TTU_MIGRATION); entry = make_migration_entry(page, pte_write(pteval)); } set_pte_at(mm, address, pte, swp_entry_to_pte(entry)); BUG_ON(pte_file(*pte)); - } else if (PAGE_MIGRATION && migration) { + } else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) { /* Establish migration entry for a file page */ swp_entry_t entry; entry = make_migration_entry(page, pte_write(pteval)); @@ -1013,12 +1015,13 @@ static int try_to_mlock_page(struct page * vm_flags for that VMA. That should be OK, because that vma shouldn't be * 'LOCKED. */ -static int try_to_unmap_anon(struct page *page, int unlock, int migration) +static int try_to_unmap_anon(struct page *page, enum ttu_flags flags) { struct anon_vma *anon_vma; struct vm_area_struct *vma; unsigned int mlocked = 0; int ret = SWAP_AGAIN; + int unlock = TTU_ACTION(flags) == TTU_MUNLOCK; if (MLOCK_PAGES && unlikely(unlock)) ret = SWAP_SUCCESS; /* default for try_to_munlock() */ @@ -1034,7 +1037,7 @@ static int try_to_unmap_anon(struct page continue; /* must visit all unlocked vmas */ ret = SWAP_MLOCK; /* saw at least one mlocked vma */ } else { - ret = try_to_unmap_one(page, vma, migration); + ret = try_to_unmap_one(page, vma, flags); if (ret == SWAP_FAIL || !page_mapped(page)) break; } @@ -1058,8 +1061,7 @@ static int try_to_unmap_anon(struct page /** * try_to_unmap_file - unmap/unlock file page using the object-based rmap method * @page: the page to unmap/unlock - * @unlock: request for unlock rather than unmap [unlikely] - * @migration: unmapping for migration - ignored if @unlock + * @flags: action and flags * * Find all the mappings of a page using the mapping pointer and the vma chains * contained in the address_space struct it points to. @@ -1071,7 +1073,7 @@ static int try_to_unmap_anon(struct page * vm_flags for that VMA. That should be OK, because that vma shouldn't be * 'LOCKED. */ -static int try_to_unmap_file(struct page *page, int unlock, int migration) +static int try_to_unmap_file(struct page *page, enum ttu_flags flags) { struct address_space *mapping = page->mapping; pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); @@ -1083,6 +1085,7 @@ static int try_to_unmap_file(struct page unsigned long max_nl_size = 0; unsigned int mapcount; unsigned int mlocked = 0; + int unlock = TTU_ACTION(flags) == TTU_MUNLOCK; if (MLOCK_PAGES && unlikely(unlock)) ret = SWAP_SUCCESS; /* default for try_to_munlock() */ @@ -1095,7 +1098,7 @@ static int try_to_unmap_file(struct page continue; /* must visit all vmas */ ret = SWAP_MLOCK; } else { - ret = try_to_unmap_one(page, vma, migration); + ret = try_to_unmap_one(page, vma, flags); if (ret == SWAP_FAIL || !page_mapped(page)) goto out; } @@ -1120,7 +1123,8 @@ static int try_to_unmap_file(struct page ret = SWAP_MLOCK; /* leave mlocked == 0 */ goto out; /* no need to look further */ } - if (!MLOCK_PAGES && !migration && (vma->vm_flags & VM_LOCKED)) + if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) && + (vma->vm_flags & VM_LOCKED)) continue; cursor = (unsigned long) vma->vm_private_data; if (cursor > max_nl_cursor) @@ -1154,7 +1158,7 @@ static int try_to_unmap_file(struct page do { list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) { - if (!MLOCK_PAGES && !migration && + if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) && (vma->vm_flags & VM_LOCKED)) continue; cursor = (unsigned long) vma->vm_private_data; @@ -1194,7 +1198,7 @@ out: /** * try_to_unmap - try to remove all page table mappings to a page * @page: the page to get unmapped - * @migration: migration flag + * @flags: action and flags * * Tries to remove all the page table entries which are mapping this * page, used in the pageout path. Caller must hold the page lock. @@ -1205,16 +1209,16 @@ out: * SWAP_FAIL - the page is unswappable * SWAP_MLOCK - page is mlocked. */ -int try_to_unmap(struct page *page, int migration) +int try_to_unmap(struct page *page, enum ttu_flags flags) { int ret; BUG_ON(!PageLocked(page)); if (PageAnon(page)) - ret = try_to_unmap_anon(page, 0, migration); + ret = try_to_unmap_anon(page, flags); else - ret = try_to_unmap_file(page, 0, migration); + ret = try_to_unmap_file(page, flags); if (ret != SWAP_MLOCK && !page_mapped(page)) ret = SWAP_SUCCESS; return ret; @@ -1239,8 +1243,8 @@ int try_to_munlock(struct page *page) VM_BUG_ON(!PageLocked(page) || PageLRU(page)); if (PageAnon(page)) - return try_to_unmap_anon(page, 1, 0); + return try_to_unmap_anon(page, TTU_MUNLOCK); else - return try_to_unmap_file(page, 1, 0); + return try_to_unmap_file(page, TTU_MUNLOCK); } Index: linux/mm/vmscan.c =================================================================== --- linux.orig/mm/vmscan.c +++ linux/mm/vmscan.c @@ -654,7 +654,7 @@ static unsigned long shrink_page_list(st * processes. Try to unmap it here. */ if (page_mapped(page) && mapping) { - switch (try_to_unmap(page, 0)) { + switch (try_to_unmap(page, TTU_UNMAP)) { case SWAP_FAIL: goto activate_locked; case SWAP_AGAIN: Index: linux/mm/migrate.c =================================================================== --- linux.orig/mm/migrate.c +++ linux/mm/migrate.c @@ -669,7 +669,7 @@ static int unmap_and_move(new_page_t get } /* Establish migration ptes or remove ptes */ - try_to_unmap(page, 1); + try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); if (!page_mapped(page)) rc = move_to_new_page(newpage, page); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934081AbZHEJjI (ORCPT ); Wed, 5 Aug 2009 05:39:08 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933980AbZHEJgg (ORCPT ); Wed, 5 Aug 2009 05:36:36 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49084 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933977AbZHEJgd (ORCPT ); Wed, 5 Aug 2009 05:36:33 -0400 From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [6/19] HWPOISON: Add various poison checks in mm/memory.c v2 Message-Id: <20090805093633.C6124B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:33 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Bail out early when hardware poisoned pages are found in page fault handling. Since they are poisoned they should not be mapped freshly into processes, because that would cause another (potentially deadly) machine check This is generally handled in the same way as OOM, just a different error code is returned to the architecture code. v2: Do a page unlock if needed (Fengguang Wu) Signed-off-by: Andi Kleen --- mm/memory.c | 6 ++++++ 1 file changed, 6 insertions(+) Index: linux/mm/memory.c =================================================================== --- linux.orig/mm/memory.c +++ linux/mm/memory.c @@ -2711,6 +2711,12 @@ static int __do_fault(struct mm_struct * if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) return ret; + if (unlikely(PageHWPoison(vmf.page))) { + if (ret & VM_FAULT_LOCKED) + unlock_page(vmf.page); + return VM_FAULT_HWPOISON; + } + /* * For consistency in subsequent calls, make the faulted page always * locked. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934087AbZHEJj5 (ORCPT ); Wed, 5 Aug 2009 05:39:57 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933963AbZHEJgc (ORCPT ); Wed, 5 Aug 2009 05:36:32 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49059 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933950AbZHEJg3 (ORCPT ); Wed, 5 Aug 2009 05:36:29 -0400 From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [2/19] HWPOISON: Export some rmap vma locking to outside world Message-Id: <20090805093629.BBAAEB15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:29 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Needed for later patch that walks rmap entries on its own. This used to be very frowned upon, but memory-failure.c does some rather specialized rmap walking and rmap has been stable for quite some time, so I think it's ok now to export it. Signed-off-by: Andi Kleen --- include/linux/rmap.h | 6 ++++++ mm/rmap.c | 4 ++-- 2 files changed, 8 insertions(+), 2 deletions(-) Index: linux/include/linux/rmap.h =================================================================== --- linux.orig/include/linux/rmap.h +++ linux/include/linux/rmap.h @@ -112,6 +112,12 @@ int page_mkclean(struct page *); */ int try_to_munlock(struct page *); +/* + * Called by memory-failure.c to kill processes. + */ +struct anon_vma *page_lock_anon_vma(struct page *page); +void page_unlock_anon_vma(struct anon_vma *anon_vma); + #else /* !CONFIG_MMU */ #define anon_vma_init() do {} while (0) Index: linux/mm/rmap.c =================================================================== --- linux.orig/mm/rmap.c +++ linux/mm/rmap.c @@ -191,7 +191,7 @@ void __init anon_vma_init(void) * Getting a lock on a stable anon_vma from a page off the LRU is * tricky: page_lock_anon_vma rely on RCU to guard against the races. */ -static struct anon_vma *page_lock_anon_vma(struct page *page) +struct anon_vma *page_lock_anon_vma(struct page *page) { struct anon_vma *anon_vma; unsigned long anon_mapping; @@ -211,7 +211,7 @@ out: return NULL; } -static void page_unlock_anon_vma(struct anon_vma *anon_vma) +void page_unlock_anon_vma(struct anon_vma *anon_vma) { spin_unlock(&anon_vma->lock); rcu_read_unlock(); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934074AbZHEJiV (ORCPT ); Wed, 5 Aug 2009 05:38:21 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S934046AbZHEJiP (ORCPT ); Wed, 5 Aug 2009 05:38:15 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49101 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933926AbZHEJgo (ORCPT ); Wed, 5 Aug 2009 05:36:44 -0400 From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> To: npiggin@suse.de, riel@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [15/19] HWPOISON: The high level memory error handler in the VM v7 Message-Id: <20090805093642.DE156B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:42 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add the high level memory handler that poisons pages that got corrupted by hardware (typically by a two bit flip in a DIMM or a cache) on the Linux level. The goal is to prevent everyone from accessing these pages in the future. This done at the VM level by marking a page hwpoisoned and doing the appropriate action based on the type of page it is. The code that does this is portable and lives in mm/memory-failure.c To quote the overview comment: * High level machine check handler. Handles pages reported by the * hardware as being corrupted usually due to a 2bit ECC memory or cache * failure. * * This focuses on pages detected as corrupted in the background. * When the current CPU tries to consume corruption the currently * running process can just be killed directly instead. This implies * that if the error cannot be handled for some reason it's safe to * just ignore it because no corruption has been consumed yet. Instead * when that happens another machine check will happen. * * Handles page cache pages in various states. The tricky part * here is that we can access any page asynchronous to other VM * users, because memory failures could happen anytime and anywhere, * possibly violating some of their assumptions. This is why this code * has to be extremely careful. Generally it tries to use normal locking * rules, as in get the standard locks, even if that means the * error handling takes potentially a long time. * * Some of the operations here are somewhat inefficient and have non * linear algorithmic complexity, because the data structures have not * been optimized for this case. This is in particular the case * for the mapping from a vma to a process. Since this case is expected * to be rare we hope we can get away with this. There are in principle two strategies to kill processes on poison: - just unmap the data and wait for an actual reference before killing - kill as soon as corruption is detected. Both have advantages and disadvantages and should be used in different situations. Right now both are implemented and can be switched with a new sysctl vm.memory_failure_early_kill The default is early kill. The patch does some rmap data structure walking on its own to collect processes to kill. This is unusual because normally all rmap data structure knowledge is in rmap.c only. I put it here for now to keep everything together and rmap knowledge has been seeping out anyways v2: Fix anon vma unlock crash (noticed by Johannes Weiner ) Handle pages on free list correctly (also noticed by Johannes) Fix inverted try_to_release_page check (found by Chris Mason) Add documentation for the new sysctl. Various other cleanups/comment fixes. v3: Use blockable signal for AO SIGBUS for better qemu handling. Numerous fixes from Fengguang Wu: New code layout for the table (redone by AK) Move the hwpoison bit setting before the lock (Fengguang Wu) Some code cleanups (Fengguang Wu, AK) Add missing lru_drain (Fengguang Wu) Do more checks for valid mappings (inspired by patch from Fengguang) Handle free pages and fixes for clean pages (Fengguang) Removed swap cache handling for now, needs more work Better mapping checks to avoid races (Fengguang) Fix swapcache (Fengguang) Handle private2 pages too (Fengguang) v4: Various fixes based on review comments from Nick Piggin Document locking order. Improved comments. Slightly improved description Remove bogus hunk. Wait properly for writeback pages (Nick Piggin) v5: Improve various comments Handle page_address_in_vma() failure better by SIGKILL and also make message debugging only Clean up printks Remove redundant PageWriteback check (Nick Piggin) Add missing clear_page_mlock Reformat state table to be <80 columns again Use truncate helper instead of manual truncate in me_pagecache_* Check for metadata buffer pages and reject them. A few cleanups. v6: Fix a printk broken in the last round of cleanups. More minor cleanups and fixes based on comments from Fengguang Wu. Rename /proc/meminfo Header to "HardwareCorrupted" Add a printk for the failed mapping case (Fengguang Wu) Better clean page check (Fengguang Wu) v7: Use lru_add_drain_all Use isolate_lru result in final status [Merge with Fengguang's version. Only cherry picked changes.] Fix bad pfn handling in action_result (Fengguang) Reverse tasklist lock locking order (Nick Piggin) Check if VMA has really page mapped (Fengguang) Add a missing page->mapping test (Fengguang) Account all bad pages (Fengguang) Misc comment fixes (Fengguang, Andi) Add sysctl to force panic on memory failure Only truncate if address space has it enabled, otherwise invalidate Now selectable in Kconfig by the user. Allow elevated reference count and return status Misc changes Cc: npiggin@suse.de Cc: riel@redhat.com Signed-off-by: Andi Kleen Acked-by: Rik van Riel Reviewed-by: Hidehiro Kawai --- Documentation/sysctl/vm.txt | 41 ++ fs/proc/meminfo.c | 9 include/linux/mm.h | 7 include/linux/rmap.h | 1 kernel/sysctl.c | 25 + mm/Kconfig | 9 mm/Makefile | 1 mm/filemap.c | 4 mm/memory-failure.c | 828 ++++++++++++++++++++++++++++++++++++++++++++ mm/rmap.c | 7 10 files changed, 929 insertions(+), 3 deletions(-) Index: linux/mm/Makefile =================================================================== --- linux.orig/mm/Makefile +++ linux/mm/Makefile @@ -40,5 +40,6 @@ obj-$(CONFIG_SMP) += allocpercpu.o endif obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o +obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o Index: linux/mm/memory-failure.c =================================================================== --- /dev/null +++ linux/mm/memory-failure.c @@ -0,0 +1,828 @@ +/* + * Copyright (C) 2008, 2009 Intel Corporation + * Authors: Andi Kleen, Fengguang Wu + * + * This software may be redistributed and/or modified under the terms of + * the GNU General Public License ("GPL") version 2 only as published by the + * Free Software Foundation. + * + * High level machine check handler. Handles pages reported by the + * hardware as being corrupted usually due to a 2bit ECC memory or cache + * failure. + * + * Handles page cache pages in various states. The tricky part + * here is that we can access any page asynchronous to other VM + * users, because memory failures could happen anytime and anywhere, + * possibly violating some of their assumptions. This is why this code + * has to be extremely careful. Generally it tries to use normal locking + * rules, as in get the standard locks, even if that means the + * error handling takes potentially a long time. + * + * The operation to map back from RMAP chains to processes has to walk + * the complete process list and has non linear complexity with the number + * mappings. In short it can be quite slow. But since memory corruptions + * are rare we hope to get away with this. + */ + +/* + * Notebook: + * - hugetlb needs more code + * - kcore/oldmem/vmcore/mem/kmem check for hwpoison pages + * - pass bad pages to kdump next kernel + */ +#define DEBUG 1 +#include +#include +#include +#include +#include +#include +#include +#include +#include "internal.h" + +/* For better testing 1; Change default later to 0 */ +int sysctl_memory_failure_early_kill __read_mostly = 1; + +int sysctl_memory_failure_recovery __read_mostly = 1; + +atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0); + +/* + * Send all the processes who have the page mapped an ``action optional'' + * signal. + */ +static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno, + unsigned long pfn) +{ + struct siginfo si; + int ret; + + printk(KERN_ERR + "MCE %#lx: Killing %s:%d early due to hardware memory corruption\n", + pfn, t->comm, t->pid); + si.si_signo = SIGBUS; + si.si_errno = 0; + si.si_code = BUS_MCEERR_AO; + si.si_addr = (void *)addr; +#ifdef __ARCH_SI_TRAPNO + si.si_trapno = trapno; +#endif + si.si_addr_lsb = PAGE_SHIFT; + /* + * Don't use force here, it's convenient if the signal + * can be temporarily blocked. + * This could cause a loop when the user sets SIGBUS + * to SIG_IGN, but hopefully noone will do that? + */ + ret = send_sig_info(SIGBUS, &si, t); /* synchronous? */ + if (ret < 0) + printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n", + t->comm, t->pid, ret); + return ret; +} + +/* + * Kill all processes that have a poisoned page mapped and then isolate + * the page. + * + * General strategy: + * Find all processes having the page mapped and kill them. + * But we keep a page reference around so that the page is not + * actually freed yet. + * Then stash the page away + * + * There's no convenient way to get back to mapped processes + * from the VMAs. So do a brute-force search over all + * running processes. + * + * Remember that machine checks are not common (or rather + * if they are common you have other problems), so this shouldn't + * be a performance issue. + * + * Also there are some races possible while we get from the + * error detection to actually handle it. + */ + +struct to_kill { + struct list_head nd; + struct task_struct *tsk; + unsigned long addr; + unsigned addr_valid:1; +}; + +/* + * Failure handling: if we can't find or can't kill a process there's + * not much we can do. We just print a message and ignore otherwise. + */ + +/* + * Schedule a process for later kill. + * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM. + * TBD would GFP_NOIO be enough? + */ +static void add_to_kill(struct task_struct *tsk, struct page *p, + struct vm_area_struct *vma, + struct list_head *to_kill, + struct to_kill **tkc) +{ + struct to_kill *tk; + + if (*tkc) { + tk = *tkc; + *tkc = NULL; + } else { + tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC); + if (!tk) { + printk(KERN_ERR + "MCE: Out of memory while machine check handling\n"); + return; + } + } + tk->addr = page_address_in_vma(p, vma); + tk->addr_valid = 1; + + /* + * In theory we don't have to kill when the page was + * munmaped. But it could be also a mremap. Since that's + * likely very rare kill anyways just out of paranoia, but use + * a SIGKILL because the error is not contained anymore. + */ + if (tk->addr == -EFAULT) { + pr_debug("MCE: Unable to find user space address %lx in %s\n", + page_to_pfn(p), tsk->comm); + tk->addr_valid = 0; + } + get_task_struct(tsk); + tk->tsk = tsk; + list_add_tail(&tk->nd, to_kill); +} + +/* + * Kill the processes that have been collected earlier. + * + * Only do anything when DOIT is set, otherwise just free the list + * (this is used for clean pages which do not need killing) + * Also when FAIL is set do a force kill because something went + * wrong earlier. + */ +static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno, + int fail, unsigned long pfn) +{ + struct to_kill *tk, *next; + + list_for_each_entry_safe (tk, next, to_kill, nd) { + if (doit) { + /* + * In case something went wrong with munmaping + * make sure the process doesn't catch the + * signal and then access the memory. Just kill it. + * the signal handlers + */ + if (fail || tk->addr_valid == 0) { + printk(KERN_ERR + "MCE %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n", + pfn, tk->tsk->comm, tk->tsk->pid); + force_sig(SIGKILL, tk->tsk); + } + + /* + * In theory the process could have mapped + * something else on the address in-between. We could + * check for that, but we need to tell the + * process anyways. + */ + else if (kill_proc_ao(tk->tsk, tk->addr, trapno, + pfn) < 0) + printk(KERN_ERR + "MCE %#lx: Cannot send advisory machine check signal to %s:%d\n", + pfn, tk->tsk->comm, tk->tsk->pid); + } + put_task_struct(tk->tsk); + kfree(tk); + } +} + +static int task_early_kill(struct task_struct *tsk) +{ + if (!tsk->mm) + return 0; + if (tsk->flags & PF_MCE_PROCESS) + return !!(tsk->flags & PF_MCE_EARLY); + return sysctl_memory_failure_early_kill; +} + +/* + * Collect processes when the error hit an anonymous page. + */ +static void collect_procs_anon(struct page *page, struct list_head *to_kill, + struct to_kill **tkc) +{ + struct vm_area_struct *vma; + struct task_struct *tsk; + struct anon_vma *av; + + read_lock(&tasklist_lock); + av = page_lock_anon_vma(page); + if (av == NULL) /* Not actually mapped anymore */ + goto out; + for_each_process (tsk) { + if (!task_early_kill(tsk)) + continue; + list_for_each_entry (vma, &av->head, anon_vma_node) { + if (!page_mapped_in_vma(page, vma)) + continue; + if (vma->vm_mm == tsk->mm) + add_to_kill(tsk, page, vma, to_kill, tkc); + } + } + page_unlock_anon_vma(av); +out: + read_unlock(&tasklist_lock); +} + +/* + * Collect processes when the error hit a file mapped page. + */ +static void collect_procs_file(struct page *page, struct list_head *to_kill, + struct to_kill **tkc) +{ + struct vm_area_struct *vma; + struct task_struct *tsk; + struct prio_tree_iter iter; + struct address_space *mapping = page->mapping; + + /* + * A note on the locking order between the two locks. + * We don't rely on this particular order. + * If you have some other code that needs a different order + * feel free to switch them around. Or add a reverse link + * from mm_struct to task_struct, then this could be all + * done without taking tasklist_lock and looping over all tasks. + */ + + read_lock(&tasklist_lock); + spin_lock(&mapping->i_mmap_lock); + for_each_process(tsk) { + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + + if (!task_early_kill(tsk)) + continue; + + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, + pgoff) { + /* + * Send early kill signal to tasks where a vma covers + * the page but the corrupted page is not necessarily + * mapped it in its pte. + * Assume applications who requested early kill want + * to be informed of all such data corruptions. + */ + if (vma->vm_mm == tsk->mm) + add_to_kill(tsk, page, vma, to_kill, tkc); + } + } + spin_unlock(&mapping->i_mmap_lock); + read_unlock(&tasklist_lock); +} + +/* + * Collect the processes who have the corrupted page mapped to kill. + * This is done in two steps for locking reasons. + * First preallocate one tokill structure outside the spin locks, + * so that we can kill at least one process reasonably reliable. + */ +static void collect_procs(struct page *page, struct list_head *tokill) +{ + struct to_kill *tk; + + if (!page->mapping) + return; + + tk = kmalloc(sizeof(struct to_kill), GFP_NOIO); + if (!tk) + return; + if (PageAnon(page)) + collect_procs_anon(page, tokill, &tk); + else + collect_procs_file(page, tokill, &tk); + kfree(tk); +} + +/* + * Error handlers for various types of pages. + */ + +enum outcome { + FAILED, /* Error handling failed */ + DELAYED, /* Will be handled later */ + IGNORED, /* Error safely ignored */ + RECOVERED, /* Successfully recovered */ +}; + +static const char *action_name[] = { + [FAILED] = "Failed", + [DELAYED] = "Delayed", + [IGNORED] = "Ignored", + [RECOVERED] = "Recovered", +}; + +/* + * Error hit kernel page. + * Do nothing, try to be lucky and not touch this instead. For a few cases we + * could be more sophisticated. + */ +static int me_kernel(struct page *p, unsigned long pfn) +{ + return DELAYED; +} + +/* + * Already poisoned page. + */ +static int me_ignore(struct page *p, unsigned long pfn) +{ + return IGNORED; +} + +/* + * Page in unknown state. Do nothing. + */ +static int me_unknown(struct page *p, unsigned long pfn) +{ + printk(KERN_ERR "MCE %#lx: Unknown page state\n", pfn); + return FAILED; +} + +/* + * Free memory + */ +static int me_free(struct page *p, unsigned long pfn) +{ + return DELAYED; +} + +/* + * Clean (or cleaned) page cache page. + */ +static int me_pagecache_clean(struct page *p, unsigned long pfn) +{ + int err; + int ret = FAILED; + struct address_space *mapping; + + if (!isolate_lru_page(p)) + page_cache_release(p); + + /* + * Now truncate the page in the page cache. This is really + * more like a "temporary hole punch" + * Don't do this for block devices when someone else + * has a reference, because it could be file system metadata + * and that's not safe to truncate. + */ + mapping = page_mapping(p); + if (!mapping) { + /* + * Page has been teared down in the meanwhile + */ + return FAILED; + } + + /* + * Truncation is a bit tricky. Enable it per file system for now. + * + * Open: to take i_mutex or not for this? Right now we don't. + */ + if (mapping->a_ops->error_remove_page) { + err = mapping->a_ops->error_remove_page(mapping, p); + if (err != 0) { + printk(KERN_INFO "MCE %#lx: Failed to punch page: %d\n", + pfn, err); + } else if (page_has_private(p) && + !try_to_release_page(p, GFP_NOIO)) { + pr_debug("MCE %#lx: failed to release buffers\n", pfn); + } else { + ret = RECOVERED; + } + } else { + /* + * If the file system doesn't support it just invalidate + * This fails on dirty or anything with private pages + */ + if (invalidate_inode_page(p)) + ret = RECOVERED; + else + printk(KERN_INFO "MCE %#lx: Failed to invalidate\n", + pfn); + } + return ret; +} + +/* + * Dirty cache page page + * Issues: when the error hit a hole page the error is not properly + * propagated. + */ +static int me_pagecache_dirty(struct page *p, unsigned long pfn) +{ + struct address_space *mapping = page_mapping(p); + + SetPageError(p); + /* TBD: print more information about the file. */ + if (mapping) { + /* + * IO error will be reported by write(), fsync(), etc. + * who check the mapping. + * This way the application knows that something went + * wrong with its dirty file data. + * + * There's one open issue: + * + * The EIO will be only reported on the next IO + * operation and then cleared through the IO map. + * Normally Linux has two mechanisms to pass IO error + * first through the AS_EIO flag in the address space + * and then through the PageError flag in the page. + * Since we drop pages on memory failure handling the + * only mechanism open to use is through AS_AIO. + * + * This has the disadvantage that it gets cleared on + * the first operation that returns an error, while + * the PageError bit is more sticky and only cleared + * when the page is reread or dropped. If an + * application assumes it will always get error on + * fsync, but does other operations on the fd before + * and the page is dropped inbetween then the error + * will not be properly reported. + * + * This can already happen even without hwpoisoned + * pages: first on metadata IO errors (which only + * report through AS_EIO) or when the page is dropped + * at the wrong time. + * + * So right now we assume that the application DTRT on + * the first EIO, but we're not worse than other parts + * of the kernel. + */ + mapping_set_error(mapping, EIO); + } + + return me_pagecache_clean(p, pfn); +} + +/* + * Clean and dirty swap cache. + * + * Dirty swap cache page is tricky to handle. The page could live both in page + * cache and swap cache(ie. page is freshly swapped in). So it could be + * referenced concurrently by 2 types of PTEs: + * normal PTEs and swap PTEs. We try to handle them consistently by calling + * try_to_unmap(TTU_IGNORE_HWPOISON) to convert the normal PTEs to swap PTEs, + * and then + * - clear dirty bit to prevent IO + * - remove from LRU + * - but keep in the swap cache, so that when we return to it on + * a later page fault, we know the application is accessing + * corrupted data and shall be killed (we installed simple + * interception code in do_swap_page to catch it). + * + * Clean swap cache pages can be directly isolated. A later page fault will + * bring in the known good data from disk. + */ +static int me_swapcache_dirty(struct page *p, unsigned long pfn) +{ + int ret = FAILED; + + ClearPageDirty(p); + /* Trigger EIO in shmem: */ + ClearPageUptodate(p); + + if (!isolate_lru_page(p)) { + page_cache_release(p); + ret = DELAYED; + } + + return ret; +} + +static int me_swapcache_clean(struct page *p, unsigned long pfn) +{ + int ret = FAILED; + + if (!isolate_lru_page(p)) { + page_cache_release(p); + ret = RECOVERED; + } + delete_from_swap_cache(p); + return ret; +} + +/* + * Huge pages. Needs work. + * Issues: + * No rmap support so we cannot find the original mapper. In theory could walk + * all MMs and look for the mappings, but that would be non atomic and racy. + * Need rmap for hugepages for this. Alternatively we could employ a heuristic, + * like just walking the current process and hoping it has it mapped (that + * should be usually true for the common "shared database cache" case) + * Should handle free huge pages and dequeue them too, but this needs to + * handle huge page accounting correctly. + */ +static int me_huge_page(struct page *p, unsigned long pfn) +{ + return FAILED; +} + +/* + * Various page states we can handle. + * + * A page state is defined by its current page->flags bits. + * The table matches them in order and calls the right handler. + * + * This is quite tricky because we can access page at any time + * in its live cycle, so all accesses have to be extremly careful. + * + * This is not complete. More states could be added. + * For any missing state don't attempt recovery. + */ + +#define dirty (1UL << PG_dirty) +#define sc (1UL << PG_swapcache) +#define unevict (1UL << PG_unevictable) +#define mlock (1UL << PG_mlocked) +#define writeback (1UL << PG_writeback) +#define lru (1UL << PG_lru) +#define swapbacked (1UL << PG_swapbacked) +#define head (1UL << PG_head) +#define tail (1UL << PG_tail) +#define compound (1UL << PG_compound) +#define slab (1UL << PG_slab) +#define buddy (1UL << PG_buddy) +#define reserved (1UL << PG_reserved) + +static struct page_state { + unsigned long mask; + unsigned long res; + char *msg; + int (*action)(struct page *p, unsigned long pfn); +} error_states[] = { + { reserved, reserved, "reserved kernel", me_ignore }, + { buddy, buddy, "free kernel", me_free }, + + /* + * Could in theory check if slab page is free or if we can drop + * currently unused objects without touching them. But just + * treat it as standard kernel for now. + */ + { slab, slab, "kernel slab", me_kernel }, + +#ifdef CONFIG_PAGEFLAGS_EXTENDED + { head, head, "huge", me_huge_page }, + { tail, tail, "huge", me_huge_page }, +#else + { compound, compound, "huge", me_huge_page }, +#endif + + { sc|dirty, sc|dirty, "swapcache", me_swapcache_dirty }, + { sc|dirty, sc, "swapcache", me_swapcache_clean }, + +#ifdef CONFIG_UNEVICTABLE_LRU + { unevict|dirty, unevict|dirty, "unevictable LRU", me_pagecache_dirty}, + { unevict, unevict, "unevictable LRU", me_pagecache_clean}, +#endif + +#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT + { mlock|dirty, mlock|dirty, "mlocked LRU", me_pagecache_dirty }, + { mlock, mlock, "mlocked LRU", me_pagecache_clean }, +#endif + + { lru|dirty, lru|dirty, "LRU", me_pagecache_dirty }, + { lru|dirty, lru, "clean LRU", me_pagecache_clean }, + { swapbacked, swapbacked, "anonymous", me_pagecache_clean }, + + /* + * Catchall entry: must be at end. + */ + { 0, 0, "unknown page state", me_unknown }, +}; + +#undef lru + +static void action_result(unsigned long pfn, char *msg, int result) +{ + struct page *page = NULL; + if (pfn_valid(pfn)) + page = pfn_to_page(pfn); + + printk(KERN_ERR "MCE %#lx: %s%s page recovery: %s\n", + pfn, + page && PageDirty(page) ? "dirty " : "", + msg, action_name[result]); +} + +static int page_action(struct page_state *ps, struct page *p, + unsigned long pfn, int ref) +{ + int result; + + result = ps->action(p, pfn); + action_result(pfn, ps->msg, result); + if (page_count(p) != 1 + ref) + printk(KERN_ERR + "MCE %#lx: %s page still referenced by %d users\n", + pfn, ps->msg, page_count(p) - 1); + + /* Could do more checks here if page looks ok */ + /* + * Could adjust zone counters here to correct for the missing page. + */ + + return result == RECOVERED ? 0 : -EBUSY; +} + +#define N_UNMAP_TRIES 5 + +/* + * Do all that is necessary to remove user space mappings. Unmap + * the pages and send SIGBUS to the processes if the data was dirty. + */ +static void hwpoison_user_mappings(struct page *p, unsigned long pfn, + int trapno) +{ + enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS; + struct address_space *mapping; + LIST_HEAD(tokill); + int ret; + int i; + int kill = 1; + + if (PageReserved(p) || PageCompound(p) || PageSlab(p)) + return; + + if (!PageLRU(p)) + lru_add_drain_all(); + + /* + * This check implies we don't kill processes if their pages + * are in the swap cache early. Those are always late kills. + */ + if (!page_mapped(p)) + return; + + if (PageSwapCache(p)) { + printk(KERN_ERR + "MCE %#lx: keeping poisoned page in swap cache\n", pfn); + ttu |= TTU_IGNORE_HWPOISON; + } + + /* + * Propagate the dirty bit from PTEs to struct page first, because we + * need this to decide if we should kill or just drop the page. + */ + mapping = page_mapping(p); + if (!PageDirty(p) && mapping && mapping_cap_writeback_dirty(mapping)) { + if (page_mkclean(p)) { + SetPageDirty(p); + } else { + kill = 0; + ttu |= TTU_IGNORE_HWPOISON; + printk(KERN_INFO + "MCE %#lx: corrupted page was clean: dropped without side effects\n", + pfn); + } + } + + /* + * First collect all the processes that have the page + * mapped in dirty form. This has to be done before try_to_unmap, + * because ttu takes the rmap data structures down. + * + * Error handling: We ignore errors here because + * there's nothing that can be done. + */ + if (kill) + collect_procs(p, &tokill); + + /* + * try_to_unmap can fail temporarily due to races. + * Try a few times (RED-PEN better strategy?) + */ + for (i = 0; i < N_UNMAP_TRIES; i++) { + ret = try_to_unmap(p, ttu); + if (ret == SWAP_SUCCESS) + break; + pr_debug("MCE %#lx: try_to_unmap retry needed %d\n", pfn, ret); + } + + if (ret != SWAP_SUCCESS) + printk(KERN_ERR "MCE %#lx: failed to unmap page (mapcount=%d)\n", + pfn, page_mapcount(p)); + + /* + * Now that the dirty bit has been propagated to the + * struct page and all unmaps done we can decide if + * killing is needed or not. Only kill when the page + * was dirty, otherwise the tokill list is merely + * freed. When there was a problem unmapping earlier + * use a more force-full uncatchable kill to prevent + * any accesses to the poisoned memory. + */ + kill_procs_ao(&tokill, !!PageDirty(p), trapno, + ret != SWAP_SUCCESS, pfn); +} + +int __memory_failure(unsigned long pfn, int trapno, int ref) +{ + struct page_state *ps; + struct page *p; + int res; + + if (!sysctl_memory_failure_recovery) + panic("Memory failure from trap %d on page %lx", trapno, pfn); + + if (!pfn_valid(pfn)) { + action_result(pfn, "memory outside kernel control", IGNORED); + return -EIO; + } + + p = pfn_to_page(pfn); + if (TestSetPageHWPoison(p)) { + action_result(pfn, "already hardware poisoned", IGNORED); + return 0; + } + + atomic_long_add(1, &mce_bad_pages); + + /* + * We need/can do nothing about count=0 pages. + * 1) it's a free page, and therefore in safe hand: + * prep_new_page() will be the gate keeper. + * 2) it's part of a non-compound high order page. + * Implies some kernel user: cannot stop them from + * R/W the page; let's pray that the page has been + * used and will be freed some time later. + * In fact it's dangerous to directly bump up page count from 0, + * that may make page_freeze_refs()/page_unfreeze_refs() mismatch. + */ + if (!get_page_unless_zero(compound_head(p))) { + action_result(pfn, "free or high order kernel", IGNORED); + return PageBuddy(compound_head(p)) ? 0 : -EBUSY; + } + + /* + * Lock the page and wait for writeback to finish. + * It's very difficult to mess with pages currently under IO + * and in many cases impossible, so we just avoid it here. + */ + lock_page_nosync(p); + wait_on_page_writeback(p); + + /* + * Now take care of user space mappings. + */ + hwpoison_user_mappings(p, pfn, trapno); + + /* + * Torn down by someone else? + */ + if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) { + action_result(pfn, "already truncated LRU", IGNORED); + res = 0; + goto out; + } + + res = -EBUSY; + for (ps = error_states;; ps++) { + if ((p->flags & ps->mask) == ps->res) { + res = page_action(ps, p, pfn, ref); + break; + } + } +out: + unlock_page(p); + return res; +} +EXPORT_SYMBOL_GPL(__memory_failure); + +/** + * memory_failure - Handle memory failure of a page. + * @pfn: Page Number of the corrupted page + * @trapno: Trap number reported in the signal to user space. + * + * This function is called by the low level machine check code + * of an architecture when it detects hardware memory corruption + * of a page. It tries its best to recover, which includes + * dropping pages, killing processes etc. + * + * The function is primarily of use for corruptions that + * happen outside the current execution context (e.g. when + * detected by a background scrubber) + * + * Must run in process context (e.g. a work queue) with interrupts + * enabled and no spinlocks hold. + */ +void memory_failure(unsigned long pfn, int trapno) +{ + __memory_failure(pfn, trapno, 0); +} Index: linux/include/linux/mm.h =================================================================== --- linux.orig/include/linux/mm.h +++ linux/include/linux/mm.h @@ -1324,5 +1324,12 @@ void vmemmap_populate_print_last(void); extern int account_locked_memory(struct mm_struct *mm, struct rlimit *rlim, size_t size); extern void refund_locked_memory(struct mm_struct *mm, size_t size); + +extern void memory_failure(unsigned long pfn, int trapno); +extern int __memory_failure(unsigned long pfn, int trapno, int ref); +extern int sysctl_memory_failure_early_kill; +extern int sysctl_memory_failure_recovery; +extern atomic_long_t mce_bad_pages; + #endif /* __KERNEL__ */ #endif /* _LINUX_MM_H */ Index: linux/kernel/sysctl.c =================================================================== --- linux.orig/kernel/sysctl.c +++ linux/kernel/sysctl.c @@ -1356,6 +1356,31 @@ static struct ctl_table vm_table[] = { .mode = 0644, .proc_handler = &scan_unevictable_handler, }, +#ifdef CONFIG_MEMORY_FAILURE + { + .ctl_name = CTL_UNNUMBERED, + .procname = "memory_failure_early_kill", + .data = &sysctl_memory_failure_early_kill, + .maxlen = sizeof(sysctl_memory_failure_early_kill), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .strategy = &sysctl_intvec, + .extra1 = &zero, + .extra2 = &one, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "memory_failure_recovery", + .data = &sysctl_memory_failure_recovery, + .maxlen = sizeof(sysctl_memory_failure_recovery), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .strategy = &sysctl_intvec, + .extra1 = &zero, + .extra2 = &one, + }, +#endif + /* * NOTE: do not add new entries to this table unless you have read * Documentation/sysctl/ctl_unnumbered.txt Index: linux/fs/proc/meminfo.c =================================================================== --- linux.orig/fs/proc/meminfo.c +++ linux/fs/proc/meminfo.c @@ -95,7 +95,11 @@ static int meminfo_proc_show(struct seq_ "Committed_AS: %8lu kB\n" "VmallocTotal: %8lu kB\n" "VmallocUsed: %8lu kB\n" - "VmallocChunk: %8lu kB\n", + "VmallocChunk: %8lu kB\n" +#ifdef CONFIG_MEMORY_FAILURE + "HardwareCorrupted: %8lu kB\n" +#endif + , K(i.totalram), K(i.freeram), K(i.bufferram), @@ -140,6 +144,9 @@ static int meminfo_proc_show(struct seq_ (unsigned long)VMALLOC_TOTAL >> 10, vmi.used >> 10, vmi.largest_chunk >> 10 +#ifdef CONFIG_MEMORY_FAILURE + ,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10) +#endif ); hugetlb_report_meminfo(m); Index: linux/mm/Kconfig =================================================================== --- linux.orig/mm/Kconfig +++ linux/mm/Kconfig @@ -233,6 +233,15 @@ config DEFAULT_MMAP_MIN_ADDR /proc/sys/vm/mmap_min_addr tunable. +config MEMORY_FAILURE + depends on MMU + depends on X86_MCE + bool "Enable memory failure recovery" + help + Enables code to recover from some memory failures on systems + with MCA recovery. This allows a system to continue running + even when some of its memory has uncorrected errors. + config NOMMU_INITIAL_TRIM_EXCESS int "Turn on mmap() excess space trimming before booting" depends on !MMU Index: linux/Documentation/sysctl/vm.txt =================================================================== --- linux.orig/Documentation/sysctl/vm.txt +++ linux/Documentation/sysctl/vm.txt @@ -32,6 +32,8 @@ Currently, these files are in /proc/sys/ - legacy_va_layout - lowmem_reserve_ratio - max_map_count +- memory_failure_early_kill +- memory_failure_recovery - min_free_kbytes - min_slab_ratio - min_unmapped_ratio @@ -53,7 +55,6 @@ Currently, these files are in /proc/sys/ - vfs_cache_pressure - zone_reclaim_mode - ============================================================== block_dump @@ -275,6 +276,44 @@ e.g., up to one or two maps per allocati The default value is 65536. +============================================================= + +memory_failure_early_kill: + +Control how to kill processes when uncorrected memory error (typically +a 2bit error in a memory module) is detected in the background by hardware +that cannot be handled by the kernel. In some cases (like the page +still having a valid copy on disk) the kernel will handle the failure +transparently without affecting any applications. But if there is +no other uptodate copy of the data it will kill to prevent any data +corruptions from propagating. + +1: Kill all processes that have the corrupted and not reloadable page mapped +as soon as the corruption is detected. Note this is not supported +for a few types of pages, like kernel internally allocated data or +the swap cache, but works for the majority of user pages. + +0: Only unmap the corrupted page from all processes and only kill a process +who tries to access it. + +The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can +handle this if they want to. + +This is only active on architectures/platforms with advanced machine +check handling and depends on the hardware capabilities. + +Applications can override this setting individually with the PR_MCE_KILL prctl + +============================================================== + +memory_failure_recovery + +Enable memory failure recovery (when supported by the platform) + +1: Attempt recovery. + +0: Always panic on a memory failure. + ============================================================== min_free_kbytes: Index: linux/mm/filemap.c =================================================================== --- linux.orig/mm/filemap.c +++ linux/mm/filemap.c @@ -105,6 +105,10 @@ * * ->task->proc_lock * ->dcache_lock (proc_pid_lookup) + * + * (code doesn't rely on that order, so you could switch it around) + * ->tasklist_lock (memory_failure, collect_procs_ao) + * ->i_mmap_lock */ /* Index: linux/mm/rmap.c =================================================================== --- linux.orig/mm/rmap.c +++ linux/mm/rmap.c @@ -36,6 +36,11 @@ * mapping->tree_lock (widely used, in set_page_dirty, * in arch-dependent flush_dcache_mmap_lock, * within inode_lock in __sync_single_inode) + * + * (code doesn't rely on that order so it could be switched around) + * ->tasklist_lock + * anon_vma->lock (memory_failure, collect_procs_anon) + * pte map lock */ #include @@ -311,7 +316,7 @@ pte_t *page_check_address(struct page *p * if the page is not mapped into the page tables of this VMA. Only * valid for normal file or anonymous VMAs. */ -static int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma) +int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma) { unsigned long address; pte_t *pte; Index: linux/include/linux/rmap.h =================================================================== --- linux.orig/include/linux/rmap.h +++ linux/include/linux/rmap.h @@ -129,6 +129,7 @@ int try_to_munlock(struct page *); */ struct anon_vma *page_lock_anon_vma(struct page *page); void page_unlock_anon_vma(struct anon_vma *anon_vma); +int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma); #else /* !CONFIG_MMU */ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934112AbZHEJkF (ORCPT ); Wed, 5 Aug 2009 05:40:05 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S934069AbZHEJj4 (ORCPT ); Wed, 5 Aug 2009 05:39:56 -0400 Received: from one.firstfloor.org ([213.235.205.2]:49070 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933794AbZHEJgb (ORCPT ); Wed, 5 Aug 2009 05:36:31 -0400 From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: [PATCH] [4/19] HWPOISON: Add new SIGBUS error codes for hardware poison signals Message-Id: <20090805093631.C0BD6B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:31 +0200 (CEST) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add new SIGBUS codes for reporting machine checks as signals. When the hardware detects an uncorrected ECC error it can trigger these signals. This is needed for telling KVM's qemu about machine checks that happen to guests, so that it can inject them, but might be also useful for other programs. I find it useful in my test programs. This patch merely defines the new types. - Define two new si_codes for SIGBUS. BUS_MCEERR_AO and BUS_MCEERR_AR * BUS_MCEERR_AO is for "Action Optional" machine checks, which means that some corruption has been detected in the background, but nothing has been consumed so far. The program can ignore those if it wants (but most programs would already get killed) * BUS_MCEERR_AR is for "Action Required" machine checks. This happens when corrupted data is consumed or the application ran into an area which has been known to be corrupted earlier. These require immediate action and cannot just returned to. Most programs would kill themselves. - They report the address of the corruption in the user address space in si_addr. - Define a new si_addr_lsb field that reports the extent of the corruption to user space. That's currently always a (small) page. The user application cannot tell where in this page the corruption happened. AK: I plan to write a man page update before anyone asks. Signed-off-by: Andi Kleen --- include/asm-generic/siginfo.h | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) Index: linux/include/asm-generic/siginfo.h =================================================================== --- linux.orig/include/asm-generic/siginfo.h +++ linux/include/asm-generic/siginfo.h @@ -82,6 +82,7 @@ typedef struct siginfo { #ifdef __ARCH_SI_TRAPNO int _trapno; /* TRAP # which caused the signal */ #endif + short _addr_lsb; /* LSB of the reported address */ } _sigfault; /* SIGPOLL */ @@ -112,6 +113,7 @@ typedef struct siginfo { #ifdef __ARCH_SI_TRAPNO #define si_trapno _sifields._sigfault._trapno #endif +#define si_addr_lsb _sifields._sigfault._addr_lsb #define si_band _sifields._sigpoll._band #define si_fd _sifields._sigpoll._fd @@ -192,7 +194,11 @@ typedef struct siginfo { #define BUS_ADRALN (__SI_FAULT|1) /* invalid address alignment */ #define BUS_ADRERR (__SI_FAULT|2) /* non-existant physical address */ #define BUS_OBJERR (__SI_FAULT|3) /* object specific hardware error */ -#define NSIGBUS 3 +/* hardware memory error consumed on a machine check: action required */ +#define BUS_MCEERR_AR (__SI_FAULT|4) +/* hardware memory error detected in process but not consumed: action optional*/ +#define BUS_MCEERR_AO (__SI_FAULT|5) +#define NSIGBUS 5 /* * SIGTRAP si_codes From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934126AbZHEKUL (ORCPT ); Wed, 5 Aug 2009 06:20:11 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933950AbZHEKUK (ORCPT ); Wed, 5 Aug 2009 06:20:10 -0400 Received: from cantor2.suse.de ([195.135.220.15]:40433 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933794AbZHEKUJ (ORCPT ); Wed, 5 Aug 2009 06:20:09 -0400 Date: Wed, 5 Aug 2009 12:20:08 +0200 From: Nick Piggin To: Andi Kleen Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-ID: <20090805102008.GB17190@wotan.suse.de> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805093638.D3754B15D8@basil.firstfloor.org> User-Agent: Mutt/1.5.9i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 05, 2009 at 11:36:38AM +0200, Andi Kleen wrote: > > From: Nick Piggin > > Extract out truncate_inode_page() out of the truncate path so that > it can be used by memory-failure.c > > [AK: description, headers, fix typos] > v2: Some white space changes from Fengguang Wu > > Signed-off-by: Andi Kleen > > --- > include/linux/mm.h | 2 ++ > mm/truncate.c | 29 +++++++++++++++-------------- > 2 files changed, 17 insertions(+), 14 deletions(-) > > Index: linux/mm/truncate.c > =================================================================== > --- linux.orig/mm/truncate.c > +++ linux/mm/truncate.c > @@ -93,11 +93,11 @@ EXPORT_SYMBOL(cancel_dirty_page); > * its lock, b) when a concurrent invalidate_mapping_pages got there first and > * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space. > */ > -static void > +static int > truncate_complete_page(struct address_space *mapping, struct page *page) > { > if (page->mapping != mapping) > - return; > + return -EIO; Hmm, at this point, the page must have been removed from pagecache, so I don't know if you need to pass an error back? > if (page_has_private(page)) > do_invalidatepage(page, 0); > @@ -108,6 +108,7 @@ truncate_complete_page(struct address_sp > remove_from_page_cache(page); > ClearPageMappedToDisk(page); > page_cache_release(page); /* pagecache ref */ > + return 0; > } > > /* > @@ -135,6 +136,16 @@ invalidate_complete_page(struct address_ > return ret; > } > > +int truncate_inode_page(struct address_space *mapping, struct page *page) > +{ > + if (page_mapped(page)) { > + unmap_mapping_range(mapping, > + (loff_t)page->index << PAGE_CACHE_SHIFT, > + PAGE_CACHE_SIZE, 0); > + } > + return truncate_complete_page(mapping, page); > +} > + > /** > * truncate_inode_pages - truncate range of pages specified by start & end byte offsets > * @mapping: mapping to truncate > @@ -196,12 +207,7 @@ void truncate_inode_pages_range(struct a > unlock_page(page); > continue; > } > - if (page_mapped(page)) { > - unmap_mapping_range(mapping, > - (loff_t)page_index< - PAGE_CACHE_SIZE, 0); > - } > - truncate_complete_page(mapping, page); > + truncate_inode_page(mapping, page); > unlock_page(page); > } > pagevec_release(&pvec); > @@ -238,15 +244,10 @@ void truncate_inode_pages_range(struct a > break; > lock_page(page); > wait_on_page_writeback(page); > - if (page_mapped(page)) { > - unmap_mapping_range(mapping, > - (loff_t)page->index< - PAGE_CACHE_SIZE, 0); > - } > + truncate_inode_page(mapping, page); > if (page->index > next) > next = page->index; > next++; > - truncate_complete_page(mapping, page); > unlock_page(page); > } > pagevec_release(&pvec); > Index: linux/include/linux/mm.h > =================================================================== > --- linux.orig/include/linux/mm.h > +++ linux/include/linux/mm.h > @@ -809,6 +809,8 @@ static inline void unmap_shared_mapping_ > extern int vmtruncate(struct inode * inode, loff_t offset); > extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end); > > +int truncate_inode_page(struct address_space *mapping, struct page *page); > + > #ifdef CONFIG_MMU > extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, > unsigned long address, unsigned int flags); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934132AbZHELMs (ORCPT ); Wed, 5 Aug 2009 07:12:48 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752239AbZHELMs (ORCPT ); Wed, 5 Aug 2009 07:12:48 -0400 Received: from bombadil.infradead.org ([18.85.46.34]:52804 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751282AbZHELMr (ORCPT ); Wed, 5 Aug 2009 07:12:47 -0400 Date: Wed, 5 Aug 2009 07:12:31 -0400 From: Christoph Hellwig To: Andi Kleen Cc: tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090805111231.GA19532@infradead.org> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805093643.E0C00B15D8@basil.firstfloor.org> User-Agent: Mutt/1.5.18 (2008-05-17) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 05, 2009 at 11:36:43AM +0200, Andi Kleen wrote: > > Enable removing of corrupted pages through truncation > for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs > These should cover most server needs. > > I chose the set of migration aware file systems for this > for now, assuming they have been especially audited. > But in general it should be safe for all file systems > on the data area that support read/write and truncate. > > Caveat: the hardware error handler does not take i_mutex > for now before calling the truncate function. Is that ok? It will probably need locking, e.g. the iolock in XFS. I'll need to take a look at the actual implementation of generic_error_remove_page to make sense of this. Is there any way for us to test this functionality without introducing real hardware problems? From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933918AbZHEMRG (ORCPT ); Wed, 5 Aug 2009 08:17:06 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933634AbZHEMRF (ORCPT ); Wed, 5 Aug 2009 08:17:05 -0400 Received: from mga03.intel.com ([143.182.124.21]:31208 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933701AbZHEMRE (ORCPT ); Wed, 5 Aug 2009 08:17:04 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.43,328,1246863600"; d="scan'208";a="172524717" Date: Wed, 5 Aug 2009 19:52:42 +0800 From: Wu Fengguang To: Christoph Hellwig Cc: Andi Kleen , "tytso@mit.edu" , "mfasheh@suse.com" , "aia21@cantab.net" , "hugh.dickins@tiscali.co.uk" , "swhiteho@redhat.com" , "akpm@linux-foundation.org" , "npiggin@suse.de" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "hidehiro.kawai.ez@hitachi.com" Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090805115242.GB6737@localhost> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <20090805111231.GA19532@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805111231.GA19532@infradead.org> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 05, 2009 at 07:12:31PM +0800, Christoph Hellwig wrote: > On Wed, Aug 05, 2009 at 11:36:43AM +0200, Andi Kleen wrote: > > > > Enable removing of corrupted pages through truncation > > for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs > > These should cover most server needs. > > > > I chose the set of migration aware file systems for this > > for now, assuming they have been especially audited. > > But in general it should be safe for all file systems > > on the data area that support read/write and truncate. > > > > Caveat: the hardware error handler does not take i_mutex > > for now before calling the truncate function. Is that ok? > > It will probably need locking, e.g. the iolock in XFS. I'll > need to take a look at the actual implementation of > generic_error_remove_page to make sense of this. In patch 13, it simply calls truncate_inode_page() for S_ISREG inodes. Nick suggests call truncate_inode_page() with i_mutex. Sure we can do mutex_trylock(i_mutex), but we'd appreciate it if some fs gurus can demonstrate some bad consequences of not doing so, thanks! > Is there any way for us to test this functionality without introducing > real hardware problems? We have some additional patches (ugly but works for now) that export interfaces for injecting hwpoison to selected types pages. It can guarantee only data/metadata pages of selected fs will be poisoned. Based on which we can do all kinds of stress testing in user space. Thanks, Fengguang From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934262AbZHEMiD (ORCPT ); Wed, 5 Aug 2009 08:38:03 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933610AbZHEMiC (ORCPT ); Wed, 5 Aug 2009 08:38:02 -0400 Received: from mga14.intel.com ([143.182.124.37]:50581 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933564AbZHEMiB (ORCPT ); Wed, 5 Aug 2009 08:38:01 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.43,328,1246863600"; d="scan'208";a="172530828" Date: Wed, 5 Aug 2009 20:37:49 +0800 From: Wu Fengguang To: Nick Piggin Cc: Andi Kleen , "akpm@linux-foundation.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "hidehiro.kawai.ez@hitachi.com" Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-ID: <20090805123749.GA9443@localhost> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805102008.GB17190@wotan.suse.de> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 05, 2009 at 06:20:08PM +0800, Nick Piggin wrote: > On Wed, Aug 05, 2009 at 11:36:38AM +0200, Andi Kleen wrote: > > > > From: Nick Piggin > > > > Extract out truncate_inode_page() out of the truncate path so that > > it can be used by memory-failure.c > > > > [AK: description, headers, fix typos] > > v2: Some white space changes from Fengguang Wu > > > > Signed-off-by: Andi Kleen > > > > --- > > include/linux/mm.h | 2 ++ > > mm/truncate.c | 29 +++++++++++++++-------------- > > 2 files changed, 17 insertions(+), 14 deletions(-) > > > > Index: linux/mm/truncate.c > > =================================================================== > > --- linux.orig/mm/truncate.c > > +++ linux/mm/truncate.c > > @@ -93,11 +93,11 @@ EXPORT_SYMBOL(cancel_dirty_page); > > * its lock, b) when a concurrent invalidate_mapping_pages got there first and > > * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space. > > */ > > -static void > > +static int > > truncate_complete_page(struct address_space *mapping, struct page *page) > > { > > if (page->mapping != mapping) > > - return; > > + return -EIO; > > Hmm, at this point, the page must have been removed from pagecache, > so I don't know if you need to pass an error back? Me think so too. When called from hwpoison, the page count and lock have both be taken, so at least _in this case_, (page->mapping != mapping) can be equally written as (page->mapping == NULL) But anyway, the return value is now ignored in upper layer :) Thanks, Fengguang > > > if (page_has_private(page)) > > do_invalidatepage(page, 0); > > @@ -108,6 +108,7 @@ truncate_complete_page(struct address_sp > > remove_from_page_cache(page); > > ClearPageMappedToDisk(page); > > page_cache_release(page); /* pagecache ref */ > > + return 0; > > } > > > > /* > > @@ -135,6 +136,16 @@ invalidate_complete_page(struct address_ > > return ret; > > } > > > > +int truncate_inode_page(struct address_space *mapping, struct page *page) > > +{ > > + if (page_mapped(page)) { > > + unmap_mapping_range(mapping, > > + (loff_t)page->index << PAGE_CACHE_SHIFT, > > + PAGE_CACHE_SIZE, 0); > > + } > > + return truncate_complete_page(mapping, page); > > +} > > + > > /** > > * truncate_inode_pages - truncate range of pages specified by start & end byte offsets > > * @mapping: mapping to truncate > > @@ -196,12 +207,7 @@ void truncate_inode_pages_range(struct a > > unlock_page(page); > > continue; > > } > > - if (page_mapped(page)) { > > - unmap_mapping_range(mapping, > > - (loff_t)page_index< > - PAGE_CACHE_SIZE, 0); > > - } > > - truncate_complete_page(mapping, page); > > + truncate_inode_page(mapping, page); > > unlock_page(page); > > } > > pagevec_release(&pvec); > > @@ -238,15 +244,10 @@ void truncate_inode_pages_range(struct a > > break; > > lock_page(page); > > wait_on_page_writeback(page); > > - if (page_mapped(page)) { > > - unmap_mapping_range(mapping, > > - (loff_t)page->index< > - PAGE_CACHE_SIZE, 0); > > - } > > + truncate_inode_page(mapping, page); > > if (page->index > next) > > next = page->index; > > next++; > > - truncate_complete_page(mapping, page); > > unlock_page(page); > > } > > pagevec_release(&pvec); > > Index: linux/include/linux/mm.h > > =================================================================== > > --- linux.orig/include/linux/mm.h > > +++ linux/include/linux/mm.h > > @@ -809,6 +809,8 @@ static inline void unmap_shared_mapping_ > > extern int vmtruncate(struct inode * inode, loff_t offset); > > extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end); > > > > +int truncate_inode_page(struct address_space *mapping, struct page *page); > > + > > #ifdef CONFIG_MMU > > extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, > > unsigned long address, unsigned int flags); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934447AbZHENqM (ORCPT ); Wed, 5 Aug 2009 09:46:12 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933803AbZHENqL (ORCPT ); Wed, 5 Aug 2009 09:46:11 -0400 Received: from one.firstfloor.org ([213.235.205.2]:48460 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933662AbZHENqL (ORCPT ); Wed, 5 Aug 2009 09:46:11 -0400 Date: Wed, 5 Aug 2009 15:46:08 +0200 From: Andi Kleen To: Nick Piggin Cc: Andi Kleen , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-ID: <20090805134607.GH11385@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805102008.GB17190@wotan.suse.de> User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 05, 2009 at 12:20:08PM +0200, Nick Piggin wrote: > > truncate_complete_page(struct address_space *mapping, struct page *page) > > { > > if (page->mapping != mapping) > > - return; > > + return -EIO; > > Hmm, at this point, the page must have been removed from pagecache, > so I don't know if you need to pass an error back? It could be reused, which would be bad for us? The final check is the page error count in the end anyways. -Andi -- ak@linux.intel.com -- Speaking for myself only. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934463AbZHENu7 (ORCPT ); Wed, 5 Aug 2009 09:50:59 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S934323AbZHENu7 (ORCPT ); Wed, 5 Aug 2009 09:50:59 -0400 Received: from one.firstfloor.org ([213.235.205.2]:56756 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934456AbZHENu6 (ORCPT ); Wed, 5 Aug 2009 09:50:58 -0400 Date: Wed, 5 Aug 2009 15:50:57 +0200 From: Andi Kleen To: Christoph Hellwig Cc: Andi Kleen , tytso@mit.edu, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090805135056.GI11385@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <20090805111231.GA19532@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805111231.GA19532@infradead.org> User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 05, 2009 at 07:12:31AM -0400, Christoph Hellwig wrote: > On Wed, Aug 05, 2009 at 11:36:43AM +0200, Andi Kleen wrote: > > > > Enable removing of corrupted pages through truncation > > for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs > > These should cover most server needs. > > > > I chose the set of migration aware file systems for this > > for now, assuming they have been especially audited. > > But in general it should be safe for all file systems > > on the data area that support read/write and truncate. > > > > Caveat: the hardware error handler does not take i_mutex > > for now before calling the truncate function. Is that ok? > > It will probably need locking, e.g. the iolock in XFS. I'll > need to take a look at the actual implementation of > generic_error_remove_page to make sense of this. Thanks for the review. It just calls the lower end of the truncate path. http://git.kernel.org/?p=linux/kernel/git/ak/linux-mce-2.6.git;a=commitdiff;h=b5d3322870b0f6e369809674d2c73857fd6ef480 and http://git.kernel.org/?p=linux/kernel/git/ak/linux-mce-2.6.git;a=commitdiff;h=4c2df4fa9e4687bde815883fabc602ccc5961272 Is that ok? It should be roughly the same as normal truncate does, except it doesn't take i_mutex currently. We're careful to not do it for non regular inodes. > > Is there any way for us to test this functionality without introducing > real hardware problems? Yes, there are three different injectors to chose from :) The easiest one is usually the madvise(MADV_POISON) injector. Just map a suitable page and There's a test program in http://git.kernel.org/?p=utils/cpu/mce/mce-test.git;a=blob;f=tsrc/tinjpage.c;h=954e1edab765d1c141f693ae9767ba9d5491c1aa;hb=HEAD that can be extended for new tests. -Andi -- ak@linux.intel.com -- Speaking for myself only. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755613AbZHEOBr (ORCPT ); Wed, 5 Aug 2009 10:01:47 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S933270AbZHEOBr (ORCPT ); Wed, 5 Aug 2009 10:01:47 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49020 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752239AbZHEOBq (ORCPT ); Wed, 5 Aug 2009 10:01:46 -0400 Date: Wed, 5 Aug 2009 16:01:45 +0200 From: Nick Piggin To: Andi Kleen Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-ID: <20090805140145.GB28563@wotan.suse.de> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> <20090805134607.GH11385@basil.fritz.box> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805134607.GH11385@basil.fritz.box> User-Agent: Mutt/1.5.9i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 05, 2009 at 03:46:08PM +0200, Andi Kleen wrote: > On Wed, Aug 05, 2009 at 12:20:08PM +0200, Nick Piggin wrote: > > > truncate_complete_page(struct address_space *mapping, struct page *page) > > > { > > > if (page->mapping != mapping) > > > - return; > > > + return -EIO; > > > > Hmm, at this point, the page must have been removed from pagecache, > > so I don't know if you need to pass an error back? > > It could be reused, which would be bad for us? I haven't brought up the caller at this point, but IIRC you had the page locked and mapping confirmed at this point anyway so it would never be an error for your code. Probably it would be nice to just force callers to verify the page. Normally IMO it is much nicer and clearer to do it at the time the page gets locked, unless there is good reason otherwise. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934435AbZHEOKE (ORCPT ); Wed, 5 Aug 2009 10:10:04 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S934240AbZHEOKD (ORCPT ); Wed, 5 Aug 2009 10:10:03 -0400 Received: from one.firstfloor.org ([213.235.205.2]:35499 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933568AbZHEOKB (ORCPT ); Wed, 5 Aug 2009 10:10:01 -0400 Date: Wed, 5 Aug 2009 16:10:01 +0200 From: Andi Kleen To: Nick Piggin Cc: Andi Kleen , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-ID: <20090805141001.GJ11385@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> <20090805134607.GH11385@basil.fritz.box> <20090805140145.GB28563@wotan.suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805140145.GB28563@wotan.suse.de> User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > I haven't brought up the caller at this point, but IIRC you had > the page locked and mapping confirmed at this point anyway so > it would never be an error for your code. > > Probably it would be nice to just force callers to verify the page. > Normally IMO it is much nicer and clearer to do it at the time the > page gets locked, unless there is good reason otherwise. Ok. I think I'll just keep it as it is for now. The only reason I added the error code was to make truncate_inode_page fit into .error_remove_page, but then latter I did another wrapper so it could be removed again. But it won't hurt to have it either. -Andi -- ak@linux.intel.com -- Speaking for myself only. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933787AbZHEPMb (ORCPT ); Wed, 5 Aug 2009 11:12:31 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753970AbZHEPMb (ORCPT ); Wed, 5 Aug 2009 11:12:31 -0400 Received: from mga14.intel.com ([143.182.124.37]:31199 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753898AbZHEPMa (ORCPT ); Wed, 5 Aug 2009 11:12:30 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.43,329,1246863600"; d="scan'208";a="172581941" Date: Wed, 5 Aug 2009 23:12:20 +0800 From: Wu Fengguang To: Nick Piggin Cc: Andi Kleen , "akpm@linux-foundation.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "hidehiro.kawai.ez@hitachi.com" Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-ID: <20090805151220.GB6210@localhost> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> <20090805134607.GH11385@basil.fritz.box> <20090805140145.GB28563@wotan.suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805140145.GB28563@wotan.suse.de> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 05, 2009 at 10:01:45PM +0800, Nick Piggin wrote: > On Wed, Aug 05, 2009 at 03:46:08PM +0200, Andi Kleen wrote: > > On Wed, Aug 05, 2009 at 12:20:08PM +0200, Nick Piggin wrote: > > > > truncate_complete_page(struct address_space *mapping, struct page *page) > > > > { > > > > if (page->mapping != mapping) > > > > - return; > > > > + return -EIO; > > > > > > Hmm, at this point, the page must have been removed from pagecache, > > > so I don't know if you need to pass an error back? > > > > It could be reused, which would be bad for us? > > I haven't brought up the caller at this point, but IIRC you had > the page locked and mapping confirmed at this point anyway so > it would never be an error for your code. Right, that 'if' will always evaluate to false for the hwpoison case. Because that 'mapping' was taken from 'page->mapping' inside page lock and they will just remain the same values. > Probably it would be nice to just force callers to verify the page. > Normally IMO it is much nicer and clearer to do it at the time the > page gets locked, unless there is good reason otherwise. Yes we do checked page->mapping after taking page lock. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751554AbZHJGge (ORCPT ); Mon, 10 Aug 2009 02:36:34 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1750921AbZHJGgd (ORCPT ); Mon, 10 Aug 2009 02:36:33 -0400 Received: from mail7.hitachi.co.jp ([133.145.228.42]:40603 "EHLO mail7.hitachi.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750878AbZHJGgc (ORCPT ); Mon, 10 Aug 2009 02:36:32 -0400 X-AuditID: b753bd60-ac247ba000006009-df-4a7fbfee76b4 Message-ID: <4A7FBFD1.2010208@hitachi.com> Date: Mon, 10 Aug 2009 15:36:01 +0900 From: Hidehiro Kawai User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; ja-JP; rv:1.4) Gecko/20030624 Netscape/7.1 (ax) X-Accept-Language: ja MIME-Version: 1.0 To: Andi Kleen Cc: tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> In-Reply-To: <20090805093643.E0C00B15D8@basil.firstfloor.org> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Brightmail-Tracker: AAAAAA== X-FMFTCR: RANGEA Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Andi Kleen wrote: > Index: linux/fs/ext3/inode.c > =================================================================== > --- linux.orig/fs/ext3/inode.c > +++ linux/fs/ext3/inode.c > @@ -1819,6 +1819,7 @@ static const struct address_space_operat > .direct_IO = ext3_direct_IO, > .migratepage = buffer_migrate_page, > .is_partially_uptodate = block_is_partially_uptodate, > + .error_remove_page = generic_error_remove_page, > }; (I'm sorry if I'm missing the point.) If my understanding is correct, the following scenario can happen: 1. An uncorrected error on a dirty page cache page is detected by memory scrubbing 2. Kernel unmaps and truncates the page to recover from the error 3. An application reads data from the file location corresponding to the truncated page ==> Old or garbage data will be read into a new page cache page 4. The application modifies the data and write back it to the disk 5. The file will corrurpt! (Yes, the application is wrong to not do the right thing, i.e. fsync, but it's not user's fault!) A similar data corruption can be caused by a write I/O error, because dirty flag is cleared even if the page couldn't be written to the disk. However, we have a way to avoid this kind of data corruption at least for ext3. If we mount an ext3 filesystem with data=ordered and data_err=abort, all I/O errors on file data block belonging to the committing transaction are checked. When I/O error is found, abort journaling and remount the filesystem with read-only to prevent further updates. This kind of feature is very important for mission critical systems. If we merge this patch, we would face the data corruption problem again. I think there are three options, (1) drop this patch (2) merge this patch with new panic_on_dirty_page_cache_corruption sysctl (3) implement a more sophisticated error_remove_page function > static const struct address_space_operations ext3_writeback_aops = { > @@ -1834,6 +1835,7 @@ static const struct address_space_operat > .direct_IO = ext3_direct_IO, > .migratepage = buffer_migrate_page, > .is_partially_uptodate = block_is_partially_uptodate, > + .error_remove_page = generic_error_remove_page, > }; The writeback case would be OK. It's not much different from the I/O error case. > static const struct address_space_operations ext3_journalled_aops = { > @@ -1848,6 +1850,7 @@ static const struct address_space_operat > .invalidatepage = ext3_invalidatepage, > .releasepage = ext3_releasepage, > .is_partially_uptodate = block_is_partially_uptodate, > + .error_remove_page = generic_error_remove_page, > }; > > void ext3_set_aops(struct inode *inode) I'm not sure about the journalled case. I'm going to take a look at it later. Best regards, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752074AbZHJHHt (ORCPT ); Mon, 10 Aug 2009 03:07:49 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751883AbZHJHHs (ORCPT ); Mon, 10 Aug 2009 03:07:48 -0400 Received: from mga14.intel.com ([143.182.124.37]:44252 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751871AbZHJHHs (ORCPT ); Mon, 10 Aug 2009 03:07:48 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.43,352,1246863600"; d="scan'208";a="174024835" Date: Mon, 10 Aug 2009 15:07:45 +0800 From: Wu Fengguang To: Hidehiro Kawai Cc: Andi Kleen , "tytso@mit.edu" , "hch@infradead.org" , "mfasheh@suse.com" , "aia21@cantab.net" , "hugh.dickins@tiscali.co.uk" , "swhiteho@redhat.com" , "akpm@linux-foundation.org" , "npiggin@suse.de" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090810070745.GA26533@localhost> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4A7FBFD1.2010208@hitachi.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Hidehiro, On Mon, Aug 10, 2009 at 02:36:01PM +0800, Hidehiro Kawai wrote: > Hi, > > Andi Kleen wrote: > > > Index: linux/fs/ext3/inode.c > > =================================================================== > > --- linux.orig/fs/ext3/inode.c > > +++ linux/fs/ext3/inode.c > > @@ -1819,6 +1819,7 @@ static const struct address_space_operat > > .direct_IO = ext3_direct_IO, > > .migratepage = buffer_migrate_page, > > .is_partially_uptodate = block_is_partially_uptodate, > > + .error_remove_page = generic_error_remove_page, > > }; > > (I'm sorry if I'm missing the point.) > > If my understanding is correct, the following scenario can happen: > > 1. An uncorrected error on a dirty page cache page is detected by > memory scrubbing > 2. Kernel unmaps and truncates the page to recover from the error > 3. An application reads data from the file location corresponding > to the truncated page > ==> Old or garbage data will be read into a new page cache page > 4. The application modifies the data and write back it to the disk > 5. The file will corrurpt! > > (Yes, the application is wrong to not do the right thing, i.e. fsync, > but it's not user's fault!) Right. Note that the data has already been corrupted and the above scenario can be called as re-corruption. We set AS_EIO to trigger some IO reporting mechanism so that it won't corrupt *silently*. > A similar data corruption can be caused by a write I/O error, > because dirty flag is cleared even if the page couldn't be written > to the disk. Yes. > However, we have a way to avoid this kind of data corruption at > least for ext3. If we mount an ext3 filesystem with data=ordered > and data_err=abort, all I/O errors on file data block belonging to > the committing transaction are checked. When I/O error is found, > abort journaling and remount the filesystem with read-only to > prevent further updates. This kind of feature is very important > for mission critical systems. Agreed. We also set PG_error, which should be enough to trigger such remount? > If we merge this patch, we would face the data corruption problem > again. > > I think there are three options, > > (1) drop this patch > (2) merge this patch with new panic_on_dirty_page_cache_corruption > sysctl > (3) implement a more sophisticated error_remove_page function In fact we proposed a patch for preventing the re-corruption case, see http://lkml.org/lkml/2009/6/11/294 However it is hard to answer the (policy) question "How sticky should the EIO bit remain?". > > static const struct address_space_operations ext3_writeback_aops = { > > @@ -1834,6 +1835,7 @@ static const struct address_space_operat > > .direct_IO = ext3_direct_IO, > > .migratepage = buffer_migrate_page, > > .is_partially_uptodate = block_is_partially_uptodate, > > + .error_remove_page = generic_error_remove_page, > > }; > > The writeback case would be OK. It's not much different from the I/O > error case. > > > static const struct address_space_operations ext3_journalled_aops = { > > @@ -1848,6 +1850,7 @@ static const struct address_space_operat > > .invalidatepage = ext3_invalidatepage, > > .releasepage = ext3_releasepage, > > .is_partially_uptodate = block_is_partially_uptodate, > > + .error_remove_page = generic_error_remove_page, > > }; > > > > void ext3_set_aops(struct inode *inode) > > I'm not sure about the journalled case. I'm going to take a look at > it later. Thanks, Fengguang From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752375AbZHJHoY (ORCPT ); Mon, 10 Aug 2009 03:44:24 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751105AbZHJHoX (ORCPT ); Mon, 10 Aug 2009 03:44:23 -0400 Received: from one.firstfloor.org ([213.235.205.2]:47116 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751032AbZHJHoX (ORCPT ); Mon, 10 Aug 2009 03:44:23 -0400 Date: Mon, 10 Aug 2009 09:44:21 +0200 From: Andi Kleen To: Hidehiro Kawai Cc: Andi Kleen , tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090810074421.GA6838@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4A7FBFD1.2010208@hitachi.com> User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, > If my understanding is correct, the following scenario can happen: Yes it can happen. > > 1. An uncorrected error on a dirty page cache page is detected by > memory scrubbing > 2. Kernel unmaps and truncates the page to recover from the error > 3. An application reads data from the file location corresponding > to the truncated page > ==> Old or garbage data will be read into a new page cache page The problem currently is that the error is not sticky enough and doesn't stay around long enough. It gets reported once, but not in later IO operations. However it's a generic problem not unique to hwpoison. Me and Fengguang went through the error propagation as our test program triggered the problem and we looked like it was really a generic problem, not unique to hardware poison (e.g. the IO error handling on metadata has exactly the same problem) And redesigning VFS IO error reporting was a bit of of scope for hwpoison. So we decided to not be better than a normal IO error here for now. An application that handles current IO errors correctly will also also handle hwpoison IO errors correctly. And application that doesn't handle current IO errors correctly will also not necessarily handle hwpoison correctly (it's not better and not worse) So the hwpoison errors are pretty much the same as the normal IO errors. The normal error path probably needs some improvements, in particular the address space EIO error error likely needs to be more sticky than it is today. An application has to handle the error on the first strike. That is something that could be improved in the VFS -- although I fear any improvements here could also break compatibility. I don't think it's a blocker on hwpoison for now. It needs more design effort and thinking (e.g. likely the address space IO error bit should be separated into multiple bits) Perhaps you're interested in working on this? > 4. The application modifies the data and write back it to the disk > 5. The file will corrurpt! > > (Yes, the application is wrong to not do the right thing, i.e. fsync, > but it's not user's fault!) > > A similar data corruption can be caused by a write I/O error, > because dirty flag is cleared even if the page couldn't be written > to the disk. > > However, we have a way to avoid this kind of data corruption at > least for ext3. If we mount an ext3 filesystem with data=ordered > and data_err=abort, all I/O errors on file data block belonging to > the committing transaction are checked. When I/O error is found, > abort journaling and remount the filesystem with read-only to > prevent further updates. This kind of feature is very important > for mission critical systems. Well it sounds like a potentially useful enhancement to ext3 (or ext4). One issue is that the default is not ordered anymore since Linus changed the default. I'm sure other enhancements for IO errors could be done too. Some of the file systems also handle them still quite poorly (e.g. btrfs) But again I don't think it's a blocker for hwpoison. > I think there are three options, > > (1) drop this patch > (2) merge this patch with new panic_on_dirty_page_cache_corruption > sysctl > (3) implement a more sophisticated error_remove_page function (4) accept that hwpoison error handling is not better and not worse than normal IO error handling. We opted for (4). -Andi -- ak@linux.intel.com -- Speaking for myself only. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752763AbZHKMAc (ORCPT ); Tue, 11 Aug 2009 08:00:32 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752470AbZHKMAa (ORCPT ); Tue, 11 Aug 2009 08:00:30 -0400 Received: from mailx.hitachi.co.jp ([133.145.228.49]:33207 "EHLO mailx.hitachi.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752738AbZHKMAZ (ORCPT ); Tue, 11 Aug 2009 08:00:25 -0400 X-AuditID: b753bd60-a8c6fba000004725-e0-4a80ea23bfc2 Message-ID: <4A80EA14.4030300@hitachi.com> Date: Tue, 11 Aug 2009 12:48:36 +0900 From: Hidehiro Kawai User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; ja-JP; rv:1.4) Gecko/20030624 Netscape/7.1 (ax) X-Accept-Language: ja MIME-Version: 1.0 To: Wu Fengguang Cc: Andi Kleen , "tytso@mit.edu" , "hch@infradead.org" , "mfasheh@suse.com" , "aia21@cantab.net" , "hugh.dickins@tiscali.co.uk" , "swhiteho@redhat.com" , "akpm@linux-foundation.org" , "npiggin@suse.de" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Satoshi OSHIMA , Taketoshi Sakuraba Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810070745.GA26533@localhost> In-Reply-To: <20090810070745.GA26533@localhost> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Brightmail-Tracker: AAAAAA== X-FMFTCR: RANGEA Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Wu Fengguang wrote: >>However, we have a way to avoid this kind of data corruption at >>least for ext3. If we mount an ext3 filesystem with data=ordered >>and data_err=abort, all I/O errors on file data block belonging to >>the committing transaction are checked. When I/O error is found, >>abort journaling and remount the filesystem with read-only to >>prevent further updates. This kind of feature is very important >>for mission critical systems. > > Agreed. We also set PG_error, which should be enough to trigger such > remount? ext3 doesn't check PG_error. Maybe we need to do: 1. trylock_buffer() 2. if step 1. succeeds, then clear_buffer_dirty(), clear_buffer_uptodate(), and set_buffer_write_io_error() Note that we can't truncate the corrupted page until the error check is done by kjournald. >>If we merge this patch, we would face the data corruption problem >>again. >> >>I think there are three options, >> >>(1) drop this patch >>(2) merge this patch with new panic_on_dirty_page_cache_corruption >> sysctl >>(3) implement a more sophisticated error_remove_page function > > In fact we proposed a patch for preventing the re-corruption case, see > > http://lkml.org/lkml/2009/6/11/294 > > However it is hard to answer the (policy) question "How sticky should > the EIO bit remain?". It's a good approach! This approach may also solve my concern, the re-corruption issue caused by transient IO errors. But I also think it needs a bit more consideration. For example, if the application has the valid data in the user space buffer, it would try to re-write it after detecting an IO error from the previous write. In this case, we should clear the sticky error flag. Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752793AbZHKMAh (ORCPT ); Tue, 11 Aug 2009 08:00:37 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752610AbZHKMAe (ORCPT ); Tue, 11 Aug 2009 08:00:34 -0400 Received: from mailx.hitachi.co.jp ([133.145.228.49]:33209 "EHLO mailx.hitachi.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752734AbZHKMAZ (ORCPT ); Tue, 11 Aug 2009 08:00:25 -0400 X-AuditID: b753bd60-aae45ba000006009-b1-4a80eaa734c3 Message-ID: <4A80EAA3.7040107@hitachi.com> Date: Tue, 11 Aug 2009 12:50:59 +0900 From: Hidehiro Kawai User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; ja-JP; rv:1.4) Gecko/20030624 Netscape/7.1 (ax) X-Accept-Language: ja MIME-Version: 1.0 To: Andi Kleen Cc: tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> In-Reply-To: <20090810074421.GA6838@basil.fritz.box> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Brightmail-Tracker: AAAAAA== X-FMFTCR: RANGEA Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Andi Kleen wrote: >>1. An uncorrected error on a dirty page cache page is detected by >> memory scrubbing >>2. Kernel unmaps and truncates the page to recover from the error >>3. An application reads data from the file location corresponding >> to the truncated page >> ==> Old or garbage data will be read into a new page cache page > > The problem currently is that the error is not sticky enough and > doesn't stay around long enough. It gets reported once, > but not in later IO operations. > > However it's a generic problem not unique to hwpoison. Me Yes, it's a generic problem, and introducing a sticky error flag is one of the approach to solve the problem. I think it is a good approach because it doesn't depend on individual filesystems. > And application > that doesn't handle current IO errors correctly will also > not necessarily handle hwpoison correctly (it's not better and not worse) This is my main concern. I'd like to prevent re-corruption even if applications don't have good manners. As for usual I/O error, ext3/4 can now do it by using data=ordered and data_err=abort mount options. Moreover, if you mount the ext3/4 filesystem with the additional errors=panic option, kernel gets panic on write error instead of read-only remount. Customers who regard data integrity is very important require these features. But this patch (PATCH 16/19) introduce this problem again, because it doesn't provide a way to shut out further writes to the fs. Of course, we can do it by setting tolerant level to 0 or memory_failure_recovery to 0. But it would be overkill. That is why I suggested this: >>(2) merge this patch with new panic_on_dirty_page_cache_corruption >> sysctl > That is something that could be improved in the VFS -- although I fear > any improvements here could also break compatibility. I don't think > it's a blocker on hwpoison for now. It needs more design > effort and thinking (e.g. likely the address space IO error > bit should be separated into multiple bits) > > Perhaps you're interested in working on this? Yes. Transient IO errors have a potential for causing re-corruption problem. Now ext3/4 provide ways to prevent it, but not the other filesystems. We would need a generic way. >>4. The application modifies the data and write back it to the disk >>5. The file will corrurpt! >> >>(Yes, the application is wrong to not do the right thing, i.e. fsync, >> but it's not user's fault!) >> >>A similar data corruption can be caused by a write I/O error, >>because dirty flag is cleared even if the page couldn't be written >>to the disk. >> >>However, we have a way to avoid this kind of data corruption at >>least for ext3. If we mount an ext3 filesystem with data=ordered >>and data_err=abort, all I/O errors on file data block belonging to >>the committing transaction are checked. When I/O error is found, >>abort journaling and remount the filesystem with read-only to >>prevent further updates. This kind of feature is very important >>for mission critical systems. > > Well it sounds like a potentially useful enhancement to ext3 (or ext4). > > One issue is that the default is not ordered anymore since > Linus changed the default. Yes, but what is important is whether the system provides such feature or not. > I'm sure other enhancements for IO errors could be done too. > Some of the file systems also handle them still quite poorly (e.g. btrfs) > > But again I don't think it's a blocker for hwpoison. Unfortunately, it can be a blocker. As I stated, we can block the possible re-corruption caused by transient IO errors on ext3/4 filesystems. But applying this patch (PATCH 16/19), re-corruption can happen even if we use data=ordered, data_err=abort and errors=panic mount options. So... >>I think there are three options, >> >>(1) drop this patch >>(2) merge this patch with new panic_on_dirty_page_cache_corruption >> sysctl >>(3) implement a more sophisticated error_remove_page function > > (4) accept that hwpoison error handling is not better and not worse than normal > IO error handling. > > We opted for (4). Could you consider adopting (2) or (3)? Fengguang's sticky EIO approach (http://lkml.org/lkml/2009/6/11/294) is also OK. I hope HWPOISON patches are merged into 2.6.32. So (2) is the best answer for me, because it's simple and less intrusive. Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752860AbZHKMD6 (ORCPT ); Tue, 11 Aug 2009 08:03:58 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752472AbZHKMD5 (ORCPT ); Tue, 11 Aug 2009 08:03:57 -0400 Received: from one.firstfloor.org ([213.235.205.2]:54988 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751289AbZHKMD4 (ORCPT ); Tue, 11 Aug 2009 08:03:56 -0400 Date: Tue, 11 Aug 2009 09:17:56 +0200 From: Andi Kleen To: Hidehiro Kawai Cc: Andi Kleen , tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090811071756.GC14368@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4A80EAA3.7040107@hitachi.com> User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Aug 11, 2009 at 12:50:59PM +0900, Hidehiro Kawai wrote: > > And application > > that doesn't handle current IO errors correctly will also > > not necessarily handle hwpoison correctly (it's not better and not worse) > > This is my main concern. I'd like to prevent re-corruption even if > applications don't have good manners. I don't think there's much we can do if the application doesn't check for IO errors properly. What would you do if it doesn't check for IO errors at all? If it checks for IO errors it simply has to check for them on all IO operations -- if they do they will detect hwpoison errors correctly too. > As for usual I/O error, ext3/4 can now do it by using data=ordered and > data_err=abort mount options. Moreover, if you mount the ext3/4 > filesystem with the additional errors=panic option, kernel gets > panic on write error instead of read-only remount. Customers > who regard data integrity is very important require these features. Well they can also set vm.memory_failure_recovery = 0 then if they don't care about their uptime. > That is why I suggested this: > >>(2) merge this patch with new panic_on_dirty_page_cache_corruption You probably mean panic_on_non_anonymous_dirty_page_cache Normally anonymous memory is dirty. > >> sysctl It's unclear to me this special mode is really desirable. Does it bring enough value to the user to justify the complexity of another exotic option? The case is relatively exotic, as in dirty write cache that is mapped to a file. Try to explain it in documentation and you see how ridiculous it sounds; u it simply doesn't have clean semantics ("In case you have applications with broken error IO handling on your mission critical system ...") > > I'm sure other enhancements for IO errors could be done too. > > Some of the file systems also handle them still quite poorly (e.g. btrfs) > > > > But again I don't think it's a blocker for hwpoison. > > Unfortunately, it can be a blocker. As I stated, we can block the > possible re-corruption caused by transient IO errors on ext3/4 > filesystems. But applying this patch (PATCH 16/19), re-corruption > can happen even if we use data=ordered, data_err=abort and > errors=panic mount options. We don't corrupt data on disk. Applications that don't check for IO errors correctly may see stale data from the same file on disk though. This can happen in all the cases you listed above except for panic-on-error. If you want panic-on-error behaviour simply set vm.memory_failure_recovery = 0 > > (4) accept that hwpoison error handling is not better and not worse than normal > > IO error handling. > > > > We opted for (4). > > Could you consider adopting (2) or (3)? Fengguang's sticky EIO > approach (http://lkml.org/lkml/2009/6/11/294) is also OK. I believe redesigned IO error handling does not belong in the core hwpoison patchkit. It's big enough as it is and I consider it frozen unless fatal bugs are found -- and frankly this is not a fatal error in my estimation. If you want to have improved IO error handling feel free to submit it separately. I agree this area could use some work. But it probably needs more design work first. -Andi -- ak@linux.intel.com -- Speaking for myself only. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754612AbZHKNAY (ORCPT ); Tue, 11 Aug 2009 09:00:24 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754552AbZHKNAU (ORCPT ); Tue, 11 Aug 2009 09:00:20 -0400 Received: from mga03.intel.com ([143.182.124.21]:40969 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752990AbZHKNAT (ORCPT ); Tue, 11 Aug 2009 09:00:19 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.43,360,1246863600"; d="scan'208";a="174539683" Date: Tue, 11 Aug 2009 20:38:19 +0800 From: Wu Fengguang To: Hidehiro Kawai Cc: Andi Kleen , "tytso@mit.edu" , "hch@infradead.org" , "mfasheh@suse.com" , "aia21@cantab.net" , "hugh.dickins@tiscali.co.uk" , "swhiteho@redhat.com" , "akpm@linux-foundation.org" , "npiggin@suse.de" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Satoshi OSHIMA , Taketoshi Sakuraba Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090811123819.GB18881@localhost> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810070745.GA26533@localhost> <4A80EA14.4030300@hitachi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4A80EA14.4030300@hitachi.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Aug 11, 2009 at 11:48:36AM +0800, Hidehiro Kawai wrote: > Wu Fengguang wrote: > > In fact we proposed a patch for preventing the re-corruption case, see > > > > http://lkml.org/lkml/2009/6/11/294 > > > > However it is hard to answer the (policy) question "How sticky should > > the EIO bit remain?". > > It's a good approach! This approach may also solve my concern, > the re-corruption issue caused by transient IO errors. > > But I also think it needs a bit more consideration. For example, > if the application has the valid data in the user space buffer, > it would try to re-write it after detecting an IO error from the > previous write. In this case, we should clear the sticky error flag. Yes, and maybe more than that. The IO error issue really deserves an independent work, which will inevitably involve lots of discussions with lots of people. For the data re-corruption problem, "vm.memory_failure_recovery = 0" should be the most clean workaround for now. Can we settle with that? Our goal for this initial hwpoison implementation is to achieve good coverage (not necessarily every possible case :). Thanks, Fengguang From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752462AbZHKNKi (ORCPT ); Tue, 11 Aug 2009 09:10:38 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752234AbZHKNKh (ORCPT ); Tue, 11 Aug 2009 09:10:37 -0400 Received: from one.firstfloor.org ([213.235.205.2]:50995 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752233AbZHKNKg (ORCPT ); Tue, 11 Aug 2009 09:10:36 -0400 Date: Tue, 11 Aug 2009 08:59:29 +0200 From: Andi Kleen To: Hidehiro Kawai Cc: Wu Fengguang , Andi Kleen , "tytso@mit.edu" , "hch@infradead.org" , "mfasheh@suse.com" , "aia21@cantab.net" , "hugh.dickins@tiscali.co.uk" , "swhiteho@redhat.com" , "akpm@linux-foundation.org" , "npiggin@suse.de" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Satoshi OSHIMA , Taketoshi Sakuraba Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090811065929.GB14368@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810070745.GA26533@localhost> <4A80EA14.4030300@hitachi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4A80EA14.4030300@hitachi.com> User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Aug 11, 2009 at 12:48:36PM +0900, Hidehiro Kawai wrote: > Wu Fengguang wrote: > > >>However, we have a way to avoid this kind of data corruption at > >>least for ext3. If we mount an ext3 filesystem with data=ordered > >>and data_err=abort, all I/O errors on file data block belonging to > >>the committing transaction are checked. When I/O error is found, > >>abort journaling and remount the filesystem with read-only to > >>prevent further updates. This kind of feature is very important > >>for mission critical systems. > > > > Agreed. We also set PG_error, which should be enough to trigger such > > remount? > > ext3 doesn't check PG_error. Maybe we need to do: When we truncate the page it's gone so there's no page to set PG_error on. -Andi -- ak@linux.intel.com -- Speaking for myself only. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755756AbZHLCuG (ORCPT ); Tue, 11 Aug 2009 22:50:06 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755720AbZHLCuF (ORCPT ); Tue, 11 Aug 2009 22:50:05 -0400 Received: from mail7.hitachi.co.jp ([133.145.228.42]:47416 "EHLO mail7.hitachi.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755541AbZHLCuE (ORCPT ); Tue, 11 Aug 2009 22:50:04 -0400 X-AuditID: b753bd60-abe88ba000001a2f-c8-4a822dda0b3c Message-ID: <4A822DD4.1050202@hitachi.com> Date: Wed, 12 Aug 2009 11:49:56 +0900 From: Hidehiro Kawai User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; ja-JP; rv:1.4) Gecko/20030624 Netscape/7.1 (ax) X-Accept-Language: ja MIME-Version: 1.0 To: Andi Kleen Cc: tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> In-Reply-To: <20090811071756.GC14368@basil.fritz.box> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Brightmail-Tracker: AAAAAA== X-FMFTCR: RANGEA Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Andi Kleen wrote: > On Tue, Aug 11, 2009 at 12:50:59PM +0900, Hidehiro Kawai wrote: > >>>And application >>>that doesn't handle current IO errors correctly will also >>>not necessarily handle hwpoison correctly (it's not better and not worse) >> >>This is my main concern. I'd like to prevent re-corruption even if >>applications don't have good manners. > > I don't think there's much we can do if the application doesn't > check for IO errors properly. What would you do if it doesn't > check for IO errors at all? If it checks for IO errors it simply > has to check for them on all IO operations -- if they do > they will detect hwpoison errors correctly too. I believe it's not uncommon for applications to do buffered write and then exit without fsync(). And I think it's difficult to preclude such applications and commands from the system perfectly. >>That is why I suggested this: >> >>>>(2) merge this patch with new panic_on_dirty_page_cache_corruption >>>> sysctl > > You probably mean panic_on_non_anonymous_dirty_page_cache > Normally anonymous memory is dirty. Yes, and sorry for my ambiguous description. I used the word "cache" to intend to not include anonymous pages. > It's unclear to me this special mode is really desirable. > Does it bring enough value to the user to justify the complexity > of another exotic option? The case is relatively exotic, > as in dirty write cache that is mapped to a file. > > Try to explain it in documentation and you see how ridiculous it sounds; u > it simply doesn't have clean semantics > > ("In case you have applications with broken error IO handling on > your mission critical system ...") Generally, dropping unwritten dirty page caches is considered to be risky. So the "panic on IO error" policy has been used as usual practice for some systems. I just suggested that we adopted this policy into machine check errors. vm.memory_failure_recovery satisfies my minimal requirement. It's OK at this stage, but I'd like to improve this in the future. Another option is to introduce "ignore all" policy instead of panicking at the beginig of memory_failure(). Perhaps it finally causes SRAR machine check, and then kernel will panic or a process will be killed. Anyway, this is a topic for the next stage. >>>(4) accept that hwpoison error handling is not better and not worse than normal >>>IO error handling. >>> >>>We opted for (4). >> >>Could you consider adopting (2) or (3)? Fengguang's sticky EIO >>approach (http://lkml.org/lkml/2009/6/11/294) is also OK. > > I believe redesigned IO error handling does not belong in the > core hwpoison patchkit. I agree. > If you want to have improved IO error handling feel free to > submit it separately. I agree this area could use some work. > But it probably needs more design work first. Well, this patch set itself looks good to me. I also looked into the other patches, I couldn't find any problems (although I'm not good judge of reviewing). Reviewed-by: Hidehiro Kawai Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932148AbZHLHqN (ORCPT ); Wed, 12 Aug 2009 03:46:13 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932122AbZHLHqM (ORCPT ); Wed, 12 Aug 2009 03:46:12 -0400 Received: from one.firstfloor.org ([213.235.205.2]:34284 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932119AbZHLHqL (ORCPT ); Wed, 12 Aug 2009 03:46:11 -0400 Date: Wed, 12 Aug 2009 09:46:11 +0200 From: Andi Kleen To: Hidehiro Kawai Cc: Andi Kleen , tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090812074611.GC28848@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> <4A822DD4.1050202@hitachi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4A822DD4.1050202@hitachi.com> User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 12, 2009 at 11:49:56AM +0900, Hidehiro Kawai wrote: > > I don't think there's much we can do if the application doesn't > > check for IO errors properly. What would you do if it doesn't > > check for IO errors at all? If it checks for IO errors it simply > > has to check for them on all IO operations -- if they do > > they will detect hwpoison errors correctly too. > > I believe it's not uncommon for applications to do buffered write > and then exit without fsync(). And I think it's difficult to > preclude such applications and commands from the system perfectly. That's true, but for anything mission critical you would expect them to use some transactional mechanism, either with O_SYNC or fsync(). Otherwise they always risk data loss anyways. > > It's unclear to me this special mode is really desirable. > > Does it bring enough value to the user to justify the complexity > > of another exotic option? The case is relatively exotic, > > as in dirty write cache that is mapped to a file. > > > > Try to explain it in documentation and you see how ridiculous it sounds; u > > it simply doesn't have clean semantics > > > > ("In case you have applications with broken error IO handling on > > your mission critical system ...") > > Generally, dropping unwritten dirty page caches is considered to be > risky. So the "panic on IO error" policy has been used as usual > practice for some systems. I just suggested that we adopted > this policy into machine check errors. Hmm, what we could possibly do -- as followon patches -- would be to let error_remove_page check the per file system panic-on-io-error super block setting for dirty pages and panic in this case too. Unfortunately this setting is currently per file system, not generic, so it would need to be a fs specific check (or the flag would need to be moved into a generic fs superblock field first) I think that would be relatively clean semantics wise. Would you be interested in working on patches for that? > Another option is to introduce "ignore all" policy instead of > panicking at the beginig of memory_failure(). Perhaps it finally > causes SRAR machine check, and then kernel will panic or a process > will be killed. Anyway, this is a topic for the next stage. The problem is memory_failure() would then need to start distingushing between AR=1 and AR=0 which it doesn't today. It could be done, but would need some more work. > > If you want to have improved IO error handling feel free to > > submit it separately. I agree this area could use some work. > > But it probably needs more design work first. > > Well, this patch set itself looks good to me. > I also looked into the other patches, I couldn't find any > problems (although I'm not good judge of reviewing). > > Reviewed-by: Hidehiro Kawai Thanks for your review and your comments. -Andi -- ak@linux.intel.com -- Speaking for myself only. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754746AbZHLIFt (ORCPT ); Wed, 12 Aug 2009 04:05:49 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754303AbZHLIFq (ORCPT ); Wed, 12 Aug 2009 04:05:46 -0400 Received: from cantor2.suse.de ([195.135.220.15]:47422 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752651AbZHLIFl (ORCPT ); Wed, 12 Aug 2009 04:05:41 -0400 Date: Wed, 12 Aug 2009 10:05:40 +0200 From: Nick Piggin To: Andi Kleen Cc: Hidehiro Kawai , tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090812080540.GA32342@wotan.suse.de> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090811071756.GC14368@basil.fritz.box> User-Agent: Mutt/1.5.9i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Aug 11, 2009 at 09:17:56AM +0200, Andi Kleen wrote: > On Tue, Aug 11, 2009 at 12:50:59PM +0900, Hidehiro Kawai wrote: > > > And application > > > that doesn't handle current IO errors correctly will also > > > not necessarily handle hwpoison correctly (it's not better and not worse) > > > > This is my main concern. I'd like to prevent re-corruption even if > > applications don't have good manners. > > I don't think there's much we can do if the application doesn't > check for IO errors properly. What would you do if it doesn't > check for IO errors at all? If it checks for IO errors it simply > has to check for them on all IO operations -- if they do > they will detect hwpoison errors correctly too. But will quite possibly do the wrong thing: ie. try to re-sync the same page again, or try to write the page to a new location, etc. This is the whole problem with -EIO semantics I brought up. > > That is why I suggested this: > > >>(2) merge this patch with new panic_on_dirty_page_cache_corruption > > You probably mean panic_on_non_anonymous_dirty_page_cache > Normally anonymous memory is dirty. > > > >> sysctl > > It's unclear to me this special mode is really desirable. > Does it bring enough value to the user to justify the complexity > of another exotic option? The case is relatively exotic, > as in dirty write cache that is mapped to a file. > > Try to explain it in documentation and you see how ridiculous it sounds; u > it simply doesn't have clean semantics > > ("In case you have applications with broken error IO handling on > your mission critical system ...") Not broken error handling. It is very simple: if the application is assuming EIO is an error with dirty data being sent to disk, rather than an error with the data itself (which I think may be a common assumption). Then it could have a problem. If a database for example tries to write the data to another location in response to EIO and then record it in a list of failed IOs before halting the database. Then if it restarts it might try to again try writing out these failed IOs (eg. give the administrator a chance to fix IO devices). Completely made up scenario but it is not outlandish and it would cause bad data corruption. A mission critical server will *definitely* want to panic on dirty page corruption, IMO, because by definition they should be able to tolerate panic. But if they do not know about this change to -EIO semantics, then it is quite possible to cause problems. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932179AbZHLIXg (ORCPT ); Wed, 12 Aug 2009 04:23:36 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754712AbZHLIXf (ORCPT ); Wed, 12 Aug 2009 04:23:35 -0400 Received: from one.firstfloor.org ([213.235.205.2]:40200 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754322AbZHLIXe (ORCPT ); Wed, 12 Aug 2009 04:23:34 -0400 Date: Wed, 12 Aug 2009 10:23:31 +0200 From: Andi Kleen To: Nick Piggin Cc: Andi Kleen , Hidehiro Kawai , tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090812082331.GD28848@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> <20090812080540.GA32342@wotan.suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090812080540.GA32342@wotan.suse.de> User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > page corruption, IMO, because by definition they should be able to > tolerate panic. But if they do not know about this change to -EIO > semantics, then it is quite possible to cause problems. There's no change really. You already have this problem with any metadata error, which can cause similar trouble. If the application handles those correctly it will also handle hwpoison correctly. -Andi -- ak@linux.intel.com -- Speaking for myself only. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932216AbZHLIqR (ORCPT ); Wed, 12 Aug 2009 04:46:17 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932130AbZHLIqQ (ORCPT ); Wed, 12 Aug 2009 04:46:16 -0400 Received: from cantor.suse.de ([195.135.220.2]:44775 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932084AbZHLIqQ (ORCPT ); Wed, 12 Aug 2009 04:46:16 -0400 Date: Wed, 12 Aug 2009 10:46:13 +0200 From: Nick Piggin To: Andi Kleen Cc: Hidehiro Kawai , tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090812084613.GB32342@wotan.suse.de> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> <20090812080540.GA32342@wotan.suse.de> <20090812082331.GD28848@basil.fritz.box> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090812082331.GD28848@basil.fritz.box> User-Agent: Mutt/1.5.9i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 12, 2009 at 10:23:31AM +0200, Andi Kleen wrote: > > page corruption, IMO, because by definition they should be able to > > tolerate panic. But if they do not know about this change to -EIO > > semantics, then it is quite possible to cause problems. > > There's no change really. You already have this problem with > any metadata error, which can cause similar trouble. > If the application handles those correctly it will also > handle hwpoison correctly. What do you mean metadata error? To be clear, we're talking about safety given by a panic on any memory error, versus safety of this hwpoison code (specifically which removes dirty pagecache pages as -EIO). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932228AbZHLI52 (ORCPT ); Wed, 12 Aug 2009 04:57:28 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932202AbZHLI51 (ORCPT ); Wed, 12 Aug 2009 04:57:27 -0400 Received: from one.firstfloor.org ([213.235.205.2]:52054 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932199AbZHLI51 (ORCPT ); Wed, 12 Aug 2009 04:57:27 -0400 Date: Wed, 12 Aug 2009 10:57:27 +0200 From: Andi Kleen To: Nick Piggin Cc: Andi Kleen , Hidehiro Kawai , tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090812085727.GE28848@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> <20090812080540.GA32342@wotan.suse.de> <20090812082331.GD28848@basil.fritz.box> <20090812084613.GB32342@wotan.suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090812084613.GB32342@wotan.suse.de> User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 12, 2009 at 10:46:13AM +0200, Nick Piggin wrote: > On Wed, Aug 12, 2009 at 10:23:31AM +0200, Andi Kleen wrote: > > > page corruption, IMO, because by definition they should be able to > > > tolerate panic. But if they do not know about this change to -EIO > > > semantics, then it is quite possible to cause problems. > > > > There's no change really. You already have this problem with > > any metadata error, which can cause similar trouble. > > If the application handles those correctly it will also > > handle hwpoison correctly. > > What do you mean metadata error? e.g. when there's an write error on the indirect block or any other fs metadata. This can also cause you to lose data. The error reporting also works through the address space like with hwpoison, so it only gets reported once. I'm not really against fixing that (make the error more sticky as Fengguang puts it), but I don't think it needs to be mixed with hwpoison. -Andi -- ak@linux.intel.com -- Speaking for myself only. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932241AbZHLJFU (ORCPT ); Wed, 12 Aug 2009 05:05:20 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754470AbZHLJFT (ORCPT ); Wed, 12 Aug 2009 05:05:19 -0400 Received: from cantor2.suse.de ([195.135.220.15]:49943 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753930AbZHLJFS (ORCPT ); Wed, 12 Aug 2009 05:05:18 -0400 Date: Wed, 12 Aug 2009 11:05:18 +0200 From: Nick Piggin To: Andi Kleen Cc: Hidehiro Kawai , tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090812090518.GC32342@wotan.suse.de> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> <20090812080540.GA32342@wotan.suse.de> <20090812082331.GD28848@basil.fritz.box> <20090812084613.GB32342@wotan.suse.de> <20090812085727.GE28848@basil.fritz.box> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090812085727.GE28848@basil.fritz.box> User-Agent: Mutt/1.5.9i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 12, 2009 at 10:57:27AM +0200, Andi Kleen wrote: > On Wed, Aug 12, 2009 at 10:46:13AM +0200, Nick Piggin wrote: > > On Wed, Aug 12, 2009 at 10:23:31AM +0200, Andi Kleen wrote: > > > > page corruption, IMO, because by definition they should be able to > > > > tolerate panic. But if they do not know about this change to -EIO > > > > semantics, then it is quite possible to cause problems. > > > > > > There's no change really. You already have this problem with > > > any metadata error, which can cause similar trouble. > > > If the application handles those correctly it will also > > > handle hwpoison correctly. > > > > What do you mean metadata error? > > e.g. when there's an write error on the indirect block or any > other fs metadata. This can also cause you to lose data. The error > reporting also works through the address space like with hwpoison, > so it only gets reported once. Well, this is also a filesystem issue, but anyway the data typically does not get thrown out. So a subsequent fsync should be able to retry. But if the filesystem can't handle such errors and loses the original data when there is an IO error in newly dirty metadata, then it's a problem in the filesystem really isn't it? > I'm not really against fixing that (make the error more sticky > as Fengguang puts it), but I don't think it needs to be mixed > with hwpoison. I don't know if making it sticky realy "fixes" it. The problem is different semantics of what EIO means. My example illustrates this. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932347AbZHLJjx (ORCPT ); Wed, 12 Aug 2009 05:39:53 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932322AbZHLJjw (ORCPT ); Wed, 12 Aug 2009 05:39:52 -0400 Received: from mga14.intel.com ([143.182.124.37]:43941 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932315AbZHLJjw (ORCPT ); Wed, 12 Aug 2009 05:39:52 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.43,365,1246863600"; d="scan'208";a="174916377" Date: Wed, 12 Aug 2009 17:39:35 +0800 From: Wu Fengguang To: Nick Piggin Cc: Andi Kleen , Hidehiro Kawai , "tytso@mit.edu" , "hch@infradead.org" , "mfasheh@suse.com" , "aia21@cantab.net" , "hugh.dickins@tiscali.co.uk" , "swhiteho@redhat.com" , "akpm@linux-foundation.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Satoshi OSHIMA , Taketoshi Sakuraba Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090812093935.GA2724@localhost> References: <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> <20090812080540.GA32342@wotan.suse.de> <20090812082331.GD28848@basil.fritz.box> <20090812084613.GB32342@wotan.suse.de> <20090812085727.GE28848@basil.fritz.box> <20090812090518.GC32342@wotan.suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090812090518.GC32342@wotan.suse.de> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 12, 2009 at 05:05:18PM +0800, Nick Piggin wrote: > On Wed, Aug 12, 2009 at 10:57:27AM +0200, Andi Kleen wrote: > > On Wed, Aug 12, 2009 at 10:46:13AM +0200, Nick Piggin wrote: > > > On Wed, Aug 12, 2009 at 10:23:31AM +0200, Andi Kleen wrote: > > > > > page corruption, IMO, because by definition they should be able to > > > > > tolerate panic. But if they do not know about this change to -EIO > > > > > semantics, then it is quite possible to cause problems. > > > > > > > > There's no change really. You already have this problem with > > > > any metadata error, which can cause similar trouble. > > > > If the application handles those correctly it will also > > > > handle hwpoison correctly. > > > > > > What do you mean metadata error? > > > > e.g. when there's an write error on the indirect block or any > > other fs metadata. This can also cause you to lose data. The error > > reporting also works through the address space like with hwpoison, > > so it only gets reported once. > > Well, this is also a filesystem issue, but anyway the data typically > does not get thrown out. So a subsequent fsync should be able to > retry. Right. In normal EIO, the data in page cache is still good and accessible. > But if the filesystem can't handle such errors and loses the original > data when there is an IO error in newly dirty metadata, then it's > a problem in the filesystem really isn't it? Right, and the fs should report EIO on future sync attempts as long as the problem sticks. > > I'm not really against fixing that (make the error more sticky > > as Fengguang puts it), but I don't think it needs to be mixed > > with hwpoison. > > I don't know if making it sticky realy "fixes" it. The problem is > different semantics of what EIO means. My example illustrates this. Case 1: (re)sync on EIO: sticky EIO will help. Case 2: read out the data from page cache and rewrite it somewhere. Sticky EIO is not enough, because here the application assumes the dirty page is still accessible. In this case, patch http://lkml.org/lkml/2009/6/11/294 will help. It effectively freezes the radix tree, so that no new pages will be loaded to replace the corrupted data and fake a 'good' one. Thanks, Fengguang From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932355AbZHLJwU (ORCPT ); Wed, 12 Aug 2009 05:52:20 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932272AbZHLJwU (ORCPT ); Wed, 12 Aug 2009 05:52:20 -0400 Received: from mail4.hitachi.co.jp ([133.145.228.5]:58291 "EHLO mail4.hitachi.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932153AbZHLJwT (ORCPT ); Wed, 12 Aug 2009 05:52:19 -0400 X-AuditID: b753bd60-a9670ba000004725-2e-4a8290d1498a Message-ID: <4A8290CE.7000904@hitachi.com> Date: Wed, 12 Aug 2009 18:52:14 +0900 From: Hidehiro Kawai User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; ja-JP; rv:1.4) Gecko/20030624 Netscape/7.1 (ax) X-Accept-Language: ja MIME-Version: 1.0 To: Andi Kleen Cc: tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> <4A822DD4.1050202@hitachi.com> <20090812074611.GC28848@basil.fritz.box> In-Reply-To: <20090812074611.GC28848@basil.fritz.box> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Brightmail-Tracker: AAAAAA== X-FMFTCR: RANGEA Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Andi Kleen wrote: >>Generally, dropping unwritten dirty page caches is considered to be >>risky. So the "panic on IO error" policy has been used as usual >>practice for some systems. I just suggested that we adopted >>this policy into machine check errors. > > Hmm, what we could possibly do -- as followon patches -- would be to > let error_remove_page check the per file system panic-on-io-error > super block setting for dirty pages and panic in this case too. > Unfortunately this setting is currently per file system, not generic, > so it would need to be a fs specific check (or the flag would need > to be moved into a generic fs superblock field first) A generic setting would be better, so I suggested panic_on_dirty_page_cache_corruption flag which would be checked before invoking error_remove_page(). If we check per-filesystem settings, we might want to notify EIO to the filesystem. > I think that would be relatively clean semantics wise. Would you be > interested in working on patches for that? Yes. :-) I will work on this as soon as I come back from summer vacation. >>Another option is to introduce "ignore all" policy instead of >>panicking at the beginig of memory_failure(). Perhaps it finally >>causes SRAR machine check, and then kernel will panic or a process >>will be killed. Anyway, this is a topic for the next stage. > > The problem is memory_failure() would then need to start distingushing > between AR=1 and AR=0 which it doesn't today. > > It could be done, but would need some more work. It's my understanding that memory_failure() are never called in AR=1 case. Is it wrong? Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932437AbZHLKRB (ORCPT ); Wed, 12 Aug 2009 06:17:01 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753706AbZHLKRB (ORCPT ); Wed, 12 Aug 2009 06:17:01 -0400 Received: from one.firstfloor.org ([213.235.205.2]:52643 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752005AbZHLKRA (ORCPT ); Wed, 12 Aug 2009 06:17:00 -0400 Date: Wed, 12 Aug 2009 12:16:58 +0200 From: Andi Kleen To: Hidehiro Kawai Cc: Andi Kleen , tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090812101658.GF28848@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> <4A822DD4.1050202@hitachi.com> <20090812074611.GC28848@basil.fritz.box> <4A8290CE.7000904@hitachi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4A8290CE.7000904@hitachi.com> User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 12, 2009 at 06:52:14PM +0900, Hidehiro Kawai wrote: > Andi Kleen wrote: > > >>Generally, dropping unwritten dirty page caches is considered to be > >>risky. So the "panic on IO error" policy has been used as usual > >>practice for some systems. I just suggested that we adopted > >>this policy into machine check errors. > > > > Hmm, what we could possibly do -- as followon patches -- would be to > > let error_remove_page check the per file system panic-on-io-error > > super block setting for dirty pages and panic in this case too. > > Unfortunately this setting is currently per file system, not generic, > > so it would need to be a fs specific check (or the flag would need > > to be moved into a generic fs superblock field first) > > A generic setting would be better, so I suggested > panic_on_dirty_page_cache_corruption flag which would be checked > before invoking error_remove_page(). If we check per-filesystem > settings, we might want to notify EIO to the filesystem. You mean remounting ro if that is set? That makes sense, but I'm not sure how complicated it would be. I still would prefer to unify it with the file system settings. > > The problem is memory_failure() would then need to start distingushing > > between AR=1 and AR=0 which it doesn't today. > > > > It could be done, but would need some more work. > > It's my understanding that memory_failure() are never called in > AR=1 case. Is it wrong? Today yes, but we don't want to hardcode that assumption. e.g. for IA64 they will definitely need the equivalent of AR=1 handling. -Andi -- ak@linux.intel.com -- Speaking for myself only. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id 1CE526B005D for ; Wed, 5 Aug 2009 05:36:30 -0400 (EDT) From: Andi Kleen Message-Id: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [0/19] HWPOISON: Intro Date: Wed, 5 Aug 2009 11:36:27 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: [AK: This is a version with the correct file list and some fixes. Please disregard the version I posted yesterday] New version of the hwpoison patchkit. Various changes. Believed to address all earlier review comments. Active error truncate is enabled per file system now, so it adds a new VFS operation "error_remove_page" for this. This prevents any truncation on metadata pages, on those it just does invalidate. Also various bug fixes, most of them from Fengguang. Please see the individual patches for changelog. Should be good to go now. Passes the hwpoison specific parts of the mce-test test suite (git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git) Also available as git tree from git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git hwpoison Andrew, Please consider for merging. Thanks, -Andi Signed-off-by: Andi Kleen --- Upcoming Intel CPUs have support for recovering from some memory errors (``MCA recovery''). This requires the OS to declare a page "poisoned", kill the processes associated with it and avoid using it in the future. This patchkit implements the necessary infrastructure in the VM. To quote the overview comment: * High level machine check handler. Handles pages reported by the * hardware as being corrupted usually due to a 2bit ECC memory or cache * failure. * * This focusses on pages detected as corrupted in the background. * When the current CPU tries to consume corruption the currently * running process can just be killed directly instead. This implies * that if the error cannot be handled for some reason it's safe to * just ignore it because no corruption has been consumed yet. Instead * when that happens another machine check will happen. * * Handles page cache pages in various states. The tricky part * here is that we can access any page asynchronous to other VM * users, because memory failures could happen anytime and anywhere, * possibly violating some of their assumptions. This is why this code * has to be extremely careful. Generally it tries to use normal locking * rules, as in get the standard locks, even if that means the * error handling takes potentially a long time. * * Some of the operations here are somewhat inefficient and have non * linear algorithmic complexity, because the data structures have not * been optimized for this case. This is in particular the case * for the mapping from a vma to a process. Since this case is expected * to be rare we hope we can get away with this. The code consists of a the high level handler in mm/memory-failure.c, a new page poison bit and various checks in the VM to handle poisoned pages. The main target right now is KVM guests, but it works for all kinds of applications. For the KVM use there was need for a new signal type so that KVM can inject the machine check into the guest with the proper address. This in theory allows other applications to handle memory failures too. The expection is that near all applications won't do that, but some very specialized ones might. This is not fully complete yet, in particular there are still ways to access poison through various ways (crash dump, /proc/kcore etc.) that need to be plugged too. -Andi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id E4BC56B007E for ; Wed, 5 Aug 2009 05:36:30 -0400 (EDT) From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [1/19] HWPOISON: Add page flag for poisoned pages Message-Id: <20090805093628.B90C3B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:28 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: Hardware poisoned pages need special handling in the VM and shouldn't be touched again. This requires a new page flag. Define it here. The page flags wars seem to be over, so it shouldn't be a problem to get a new one. v2: Add TestSetHWPoison (suggested by Johannes Weiner) Acked-by: Christoph Lameter Signed-off-by: Andi Kleen --- include/linux/page-flags.h | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) Index: linux/include/linux/page-flags.h =================================================================== --- linux.orig/include/linux/page-flags.h +++ linux/include/linux/page-flags.h @@ -51,6 +51,9 @@ * PG_buddy is set to indicate that the page is free and in the buddy system * (see mm/page_alloc.c). * + * PG_hwpoison indicates that a page got corrupted in hardware and contains + * data with incorrect ECC bits that triggered a machine check. Accessing is + * not safe since it may cause another machine check. Don't touch! */ /* @@ -102,6 +105,9 @@ enum pageflags { #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR PG_uncached, /* Page has been mapped as uncached */ #endif +#ifdef CONFIG_MEMORY_FAILURE + PG_hwpoison, /* hardware poisoned page. Don't touch */ +#endif __NR_PAGEFLAGS, /* Filesystems */ @@ -263,6 +269,15 @@ PAGEFLAG(Uncached, uncached) PAGEFLAG_FALSE(Uncached) #endif +#ifdef CONFIG_MEMORY_FAILURE +PAGEFLAG(HWPoison, hwpoison) +TESTSETFLAG(HWPoison, hwpoison) +#define __PG_HWPOISON (1UL << PG_hwpoison) +#else +PAGEFLAG_FALSE(HWPoison) +#define __PG_HWPOISON 0 +#endif + static inline int PageUptodate(struct page *page) { int ret = test_bit(PG_uptodate, &(page)->flags); @@ -387,7 +402,7 @@ static inline void __ClearPageTail(struc 1 << PG_private | 1 << PG_private_2 | \ 1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \ - 1 << PG_unevictable | __PG_MLOCKED) + 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON) /* * Flags checked when a page is prepped for return by the page allocator. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 8A2BE6B0062 for ; Wed, 5 Aug 2009 05:36:31 -0400 (EDT) From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [2/19] HWPOISON: Export some rmap vma locking to outside world Message-Id: <20090805093629.BBAAEB15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:29 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: Needed for later patch that walks rmap entries on its own. This used to be very frowned upon, but memory-failure.c does some rather specialized rmap walking and rmap has been stable for quite some time, so I think it's ok now to export it. Signed-off-by: Andi Kleen --- include/linux/rmap.h | 6 ++++++ mm/rmap.c | 4 ++-- 2 files changed, 8 insertions(+), 2 deletions(-) Index: linux/include/linux/rmap.h =================================================================== --- linux.orig/include/linux/rmap.h +++ linux/include/linux/rmap.h @@ -112,6 +112,12 @@ int page_mkclean(struct page *); */ int try_to_munlock(struct page *); +/* + * Called by memory-failure.c to kill processes. + */ +struct anon_vma *page_lock_anon_vma(struct page *page); +void page_unlock_anon_vma(struct anon_vma *anon_vma); + #else /* !CONFIG_MMU */ #define anon_vma_init() do {} while (0) Index: linux/mm/rmap.c =================================================================== --- linux.orig/mm/rmap.c +++ linux/mm/rmap.c @@ -191,7 +191,7 @@ void __init anon_vma_init(void) * Getting a lock on a stable anon_vma from a page off the LRU is * tricky: page_lock_anon_vma rely on RCU to guard against the races. */ -static struct anon_vma *page_lock_anon_vma(struct page *page) +struct anon_vma *page_lock_anon_vma(struct page *page) { struct anon_vma *anon_vma; unsigned long anon_mapping; @@ -211,7 +211,7 @@ out: return NULL; } -static void page_unlock_anon_vma(struct anon_vma *anon_vma) +void page_unlock_anon_vma(struct anon_vma *anon_vma) { spin_unlock(&anon_vma->lock); rcu_read_unlock(); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id BE7916B007E for ; Wed, 5 Aug 2009 05:36:32 -0400 (EDT) From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [3/19] HWPOISON: Add support for poison swap entries v2 Message-Id: <20090805093630.BE26EB15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:30 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: Memory migration uses special swap entry types to trigger special actions on page faults. Extend this mechanism to also support poisoned swap entries, to trigger poison handling on page faults. This allows follow-on patches to prevent processes from faulting in poisoned pages again. v2: Fix overflow in MAX_SWAPFILES (Fengguang Wu) v3: Better overflow fix (Hidehiro Kawai) Signed-off-by: Andi Kleen --- include/linux/swap.h | 34 ++++++++++++++++++++++++++++------ include/linux/swapops.h | 38 ++++++++++++++++++++++++++++++++++++++ mm/swapfile.c | 4 ++-- 3 files changed, 68 insertions(+), 8 deletions(-) Index: linux/include/linux/swap.h =================================================================== --- linux.orig/include/linux/swap.h +++ linux/include/linux/swap.h @@ -34,16 +34,38 @@ static inline int current_is_kswapd(void * the type/offset into the pte as 5/27 as well. */ #define MAX_SWAPFILES_SHIFT 5 -#ifndef CONFIG_MIGRATION -#define MAX_SWAPFILES (1 << MAX_SWAPFILES_SHIFT) + +/* + * Use some of the swap files numbers for other purposes. This + * is a convenient way to hook into the VM to trigger special + * actions on faults. + */ + +/* + * NUMA node memory migration support + */ +#ifdef CONFIG_MIGRATION +#define SWP_MIGRATION_NUM 2 +#define SWP_MIGRATION_READ (MAX_SWAPFILES + SWP_HWPOISON_NUM) +#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + 1) #else -/* Use last two entries for page migration swap entries */ -#define MAX_SWAPFILES ((1 << MAX_SWAPFILES_SHIFT)-2) -#define SWP_MIGRATION_READ MAX_SWAPFILES -#define SWP_MIGRATION_WRITE (MAX_SWAPFILES + 1) +#define SWP_MIGRATION_NUM 0 #endif /* + * Handling of hardware poisoned pages with memory corruption. + */ +#ifdef CONFIG_MEMORY_FAILURE +#define SWP_HWPOISON_NUM 1 +#define SWP_HWPOISON MAX_SWAPFILES +#else +#define SWP_HWPOISON_NUM 0 +#endif + +#define MAX_SWAPFILES \ + ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM) + +/* * Magic header for a swap area. The first part of the union is * what the swap magic looks like for the old (limited to 128MB) * swap area format, the second part of the union adds - in the Index: linux/include/linux/swapops.h =================================================================== --- linux.orig/include/linux/swapops.h +++ linux/include/linux/swapops.h @@ -131,3 +131,41 @@ static inline int is_write_migration_ent #endif +#ifdef CONFIG_MEMORY_FAILURE +/* + * Support for hardware poisoned pages + */ +static inline swp_entry_t make_hwpoison_entry(struct page *page) +{ + BUG_ON(!PageLocked(page)); + return swp_entry(SWP_HWPOISON, page_to_pfn(page)); +} + +static inline int is_hwpoison_entry(swp_entry_t entry) +{ + return swp_type(entry) == SWP_HWPOISON; +} +#else + +static inline swp_entry_t make_hwpoison_entry(struct page *page) +{ + return swp_entry(0, 0); +} + +static inline int is_hwpoison_entry(swp_entry_t swp) +{ + return 0; +} +#endif + +#if defined(CONFIG_MEMORY_FAILURE) || defined(CONFIG_MIGRATION) +static inline int non_swap_entry(swp_entry_t entry) +{ + return swp_type(entry) >= MAX_SWAPFILES; +} +#else +static inline int non_swap_entry(swp_entry_t entry) +{ + return 0; +} +#endif Index: linux/mm/swapfile.c =================================================================== --- linux.orig/mm/swapfile.c +++ linux/mm/swapfile.c @@ -697,7 +697,7 @@ int free_swap_and_cache(swp_entry_t entr struct swap_info_struct *p; struct page *page = NULL; - if (is_migration_entry(entry)) + if (non_swap_entry(entry)) return 1; p = swap_info_get(entry); @@ -2083,7 +2083,7 @@ static int __swap_duplicate(swp_entry_t int count; bool has_cache; - if (is_migration_entry(entry)) + if (non_swap_entry(entry)) return -EINVAL; type = swp_type(entry); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id C57376B0083 for ; Wed, 5 Aug 2009 05:36:32 -0400 (EDT) From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [4/19] HWPOISON: Add new SIGBUS error codes for hardware poison signals Message-Id: <20090805093631.C0BD6B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:31 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: Add new SIGBUS codes for reporting machine checks as signals. When the hardware detects an uncorrected ECC error it can trigger these signals. This is needed for telling KVM's qemu about machine checks that happen to guests, so that it can inject them, but might be also useful for other programs. I find it useful in my test programs. This patch merely defines the new types. - Define two new si_codes for SIGBUS. BUS_MCEERR_AO and BUS_MCEERR_AR * BUS_MCEERR_AO is for "Action Optional" machine checks, which means that some corruption has been detected in the background, but nothing has been consumed so far. The program can ignore those if it wants (but most programs would already get killed) * BUS_MCEERR_AR is for "Action Required" machine checks. This happens when corrupted data is consumed or the application ran into an area which has been known to be corrupted earlier. These require immediate action and cannot just returned to. Most programs would kill themselves. - They report the address of the corruption in the user address space in si_addr. - Define a new si_addr_lsb field that reports the extent of the corruption to user space. That's currently always a (small) page. The user application cannot tell where in this page the corruption happened. AK: I plan to write a man page update before anyone asks. Signed-off-by: Andi Kleen --- include/asm-generic/siginfo.h | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) Index: linux/include/asm-generic/siginfo.h =================================================================== --- linux.orig/include/asm-generic/siginfo.h +++ linux/include/asm-generic/siginfo.h @@ -82,6 +82,7 @@ typedef struct siginfo { #ifdef __ARCH_SI_TRAPNO int _trapno; /* TRAP # which caused the signal */ #endif + short _addr_lsb; /* LSB of the reported address */ } _sigfault; /* SIGPOLL */ @@ -112,6 +113,7 @@ typedef struct siginfo { #ifdef __ARCH_SI_TRAPNO #define si_trapno _sifields._sigfault._trapno #endif +#define si_addr_lsb _sifields._sigfault._addr_lsb #define si_band _sifields._sigpoll._band #define si_fd _sifields._sigpoll._fd @@ -192,7 +194,11 @@ typedef struct siginfo { #define BUS_ADRALN (__SI_FAULT|1) /* invalid address alignment */ #define BUS_ADRERR (__SI_FAULT|2) /* non-existant physical address */ #define BUS_OBJERR (__SI_FAULT|3) /* object specific hardware error */ -#define NSIGBUS 3 +/* hardware memory error consumed on a machine check: action required */ +#define BUS_MCEERR_AR (__SI_FAULT|4) +/* hardware memory error detected in process but not consumed: action optional*/ +#define BUS_MCEERR_AO (__SI_FAULT|5) +#define NSIGBUS 5 /* * SIGTRAP si_codes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 733236B0082 for ; Wed, 5 Aug 2009 05:36:34 -0400 (EDT) From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [5/19] HWPOISON: Add basic support for poisoned pages in fault handler v3 Message-Id: <20090805093632.C3860B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:32 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: - Add a new VM_FAULT_HWPOISON error code to handle_mm_fault. Right now architectures have to explicitely enable poison page support, so this is forward compatible to all architectures. They only need to add it when they enable poison page support. - Add poison page handling in swap in fault code v2: Add missing delayacct_clear_flag (Hidehiro Kawai) v3: Really use delayacct_clear_flag (Hidehiro Kawai) Signed-off-by: Andi Kleen --- include/linux/mm.h | 3 ++- mm/memory.c | 18 +++++++++++++++--- 2 files changed, 17 insertions(+), 4 deletions(-) Index: linux/mm/memory.c =================================================================== --- linux.orig/mm/memory.c +++ linux/mm/memory.c @@ -1319,7 +1319,8 @@ int __get_user_pages(struct task_struct if (ret & VM_FAULT_ERROR) { if (ret & VM_FAULT_OOM) return i ? i : -ENOMEM; - else if (ret & VM_FAULT_SIGBUS) + if (ret & + (VM_FAULT_HWPOISON|VM_FAULT_SIGBUS)) return i ? i : -EFAULT; BUG(); } @@ -2511,8 +2512,15 @@ static int do_swap_page(struct mm_struct goto out; entry = pte_to_swp_entry(orig_pte); - if (is_migration_entry(entry)) { - migration_entry_wait(mm, pmd, address); + if (unlikely(non_swap_entry(entry))) { + if (is_migration_entry(entry)) { + migration_entry_wait(mm, pmd, address); + } else if (is_hwpoison_entry(entry)) { + ret = VM_FAULT_HWPOISON; + } else { + print_bad_pte(vma, address, pte, NULL); + ret = VM_FAULT_OOM; + } goto out; } delayacct_set_flag(DELAYACCT_PF_SWAPIN); @@ -2536,6 +2544,10 @@ static int do_swap_page(struct mm_struct /* Had to read the page from swap area: Major fault */ ret = VM_FAULT_MAJOR; count_vm_event(PGMAJFAULT); + } else if (PageHWPoison(page)) { + ret = VM_FAULT_HWPOISON; + delayacct_clear_flag(DELAYACCT_PF_SWAPIN); + goto out; } lock_page(page); Index: linux/include/linux/mm.h =================================================================== --- linux.orig/include/linux/mm.h +++ linux/include/linux/mm.h @@ -700,11 +700,12 @@ static inline int page_mapped(struct pag #define VM_FAULT_SIGBUS 0x0002 #define VM_FAULT_MAJOR 0x0004 #define VM_FAULT_WRITE 0x0008 /* Special case for get_user_pages */ +#define VM_FAULT_HWPOISON 0x0010 /* Hit poisoned page */ #define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */ #define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */ -#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS) +#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON) /* * Can be called by the pagefault handler when it gets a VM_FAULT_OOM. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 48E6A6B0083 for ; Wed, 5 Aug 2009 05:36:35 -0400 (EDT) From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [6/19] HWPOISON: Add various poison checks in mm/memory.c v2 Message-Id: <20090805093633.C6124B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:33 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: Bail out early when hardware poisoned pages are found in page fault handling. Since they are poisoned they should not be mapped freshly into processes, because that would cause another (potentially deadly) machine check This is generally handled in the same way as OOM, just a different error code is returned to the architecture code. v2: Do a page unlock if needed (Fengguang Wu) Signed-off-by: Andi Kleen --- mm/memory.c | 6 ++++++ 1 file changed, 6 insertions(+) Index: linux/mm/memory.c =================================================================== --- linux.orig/mm/memory.c +++ linux/mm/memory.c @@ -2711,6 +2711,12 @@ static int __do_fault(struct mm_struct * if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) return ret; + if (unlikely(PageHWPoison(vmf.page))) { + if (ret & VM_FAULT_LOCKED) + unlock_page(vmf.page); + return VM_FAULT_HWPOISON; + } + /* * For consistency in subsequent calls, make the faulted page always * locked. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 1B9206B0088 for ; Wed, 5 Aug 2009 05:36:36 -0400 (EDT) From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [7/19] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2 Message-Id: <20090805093634.C8C1DB15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:34 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: Add VM_FAULT_HWPOISON handling to the x86 page fault handler. This is very similar to VM_FAULT_OOM, the only difference is that a different si_code is passed to user space and the new addr_lsb field is initialized. v2: Make the printk more verbose/unique Signed-off-by: Andi Kleen --- arch/x86/mm/fault.c | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) Index: linux/arch/x86/mm/fault.c =================================================================== --- linux.orig/arch/x86/mm/fault.c +++ linux/arch/x86/mm/fault.c @@ -167,6 +167,7 @@ force_sig_info_fault(int si_signo, int s info.si_errno = 0; info.si_code = si_code; info.si_addr = (void __user *)address; + info.si_addr_lsb = si_code == BUS_MCEERR_AR ? PAGE_SHIFT : 0; force_sig_info(si_signo, &info, tsk); } @@ -799,10 +800,12 @@ out_of_memory(struct pt_regs *regs, unsi } static void -do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address) +do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, + unsigned int fault) { struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; + int code = BUS_ADRERR; up_read(&mm->mmap_sem); @@ -818,7 +821,15 @@ do_sigbus(struct pt_regs *regs, unsigned tsk->thread.error_code = error_code; tsk->thread.trap_no = 14; - force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk); +#ifdef CONFIG_MEMORY_FAILURE + if (fault & VM_FAULT_HWPOISON) { + printk(KERN_ERR + "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n", + tsk->comm, tsk->pid, address); + code = BUS_MCEERR_AR; + } +#endif + force_sig_info_fault(SIGBUS, code, address, tsk); } static noinline void @@ -828,8 +839,8 @@ mm_fault_error(struct pt_regs *regs, uns if (fault & VM_FAULT_OOM) { out_of_memory(regs, error_code, address); } else { - if (fault & VM_FAULT_SIGBUS) - do_sigbus(regs, error_code, address); + if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON)) + do_sigbus(regs, error_code, address, fault); else BUG(); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id A97AD6B0087 for ; Wed, 5 Aug 2009 05:36:37 -0400 (EDT) From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [9/19] HWPOISON: Handle hardware poisoned pages in try_to_unmap Message-Id: <20090805093636.CDE6FB15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:36 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: ak@linux.intel.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: From: Andi Kleen When a page has the poison bit set replace the PTE with a poison entry. This causes the right error handling to be done later when a process runs into it. v2: add a new flag to not do that (needed for the memory-failure handler later) (Fengguang) v3: remove unnecessary is_migration_entry() test (Fengguang, Minchan) Reviewed-by: Minchan Kim Reviewed-by: Wu Fengguang Signed-off-by: Andi Kleen --- include/linux/rmap.h | 1 + mm/rmap.c | 9 ++++++++- 2 files changed, 9 insertions(+), 1 deletion(-) Index: linux/mm/rmap.c =================================================================== --- linux.orig/mm/rmap.c +++ linux/mm/rmap.c @@ -819,7 +819,14 @@ static int try_to_unmap_one(struct page /* Update high watermark before we lower rss */ update_hiwater_rss(mm); - if (PageAnon(page)) { + if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) { + if (PageAnon(page)) + dec_mm_counter(mm, anon_rss); + else + dec_mm_counter(mm, file_rss); + set_pte_at(mm, address, pte, + swp_entry_to_pte(make_hwpoison_entry(page))); + } else if (PageAnon(page)) { swp_entry_t entry = { .val = page_private(page) }; if (PageSwapCache(page)) { Index: linux/include/linux/rmap.h =================================================================== --- linux.orig/include/linux/rmap.h +++ linux/include/linux/rmap.h @@ -93,6 +93,7 @@ enum ttu_flags { TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */ TTU_IGNORE_ACCESS = (1 << 9), /* don't age */ + TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */ }; #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 4BC066B0089 for ; Wed, 5 Aug 2009 05:36:38 -0400 (EDT) From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [8/19] HWPOISON: Use bitmask/action code for try_to_unmap behaviour Message-Id: <20090805093635.CB595B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:35 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: Lee.Schermerhorn@hp.com, npiggin@suse.de, akpm@linux-foundation.orgnpiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: try_to_unmap currently has multiple modi (migration, munlock, normal unmap) which are selected by magic flag variables. The logic is not very straight forward, because each of these flag change multiple behaviours (e.g. migration turns off aging, not only sets up migration ptes etc.) Also the different flags interact in magic ways. A later patch in this series adds another mode to try_to_unmap, so this becomes quickly unmanageable. Replace the different flags with a action code (migration, munlock, munmap) and some additional flags as modifiers (ignore mlock, ignore aging). This makes the logic more straight forward and allows easier extension to new behaviours. Change all the caller to declare what they want to do. This patch is supposed to be a nop in behaviour. If anyone can prove it is not that would be a bug. Cc: Lee.Schermerhorn@hp.com Cc: npiggin@suse.de Signed-off-by: Andi Kleen --- include/linux/rmap.h | 13 ++++++++++++- mm/migrate.c | 2 +- mm/rmap.c | 40 ++++++++++++++++++++++------------------ mm/vmscan.c | 2 +- 4 files changed, 36 insertions(+), 21 deletions(-) Index: linux/include/linux/rmap.h =================================================================== --- linux.orig/include/linux/rmap.h +++ linux/include/linux/rmap.h @@ -85,7 +85,18 @@ static inline void page_dup_rmap(struct */ int page_referenced(struct page *, int is_locked, struct mem_cgroup *cnt, unsigned long *vm_flags); -int try_to_unmap(struct page *, int ignore_refs); +enum ttu_flags { + TTU_UNMAP = 0, /* unmap mode */ + TTU_MIGRATION = 1, /* migration mode */ + TTU_MUNLOCK = 2, /* munlock mode */ + TTU_ACTION_MASK = 0xff, + + TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */ + TTU_IGNORE_ACCESS = (1 << 9), /* don't age */ +}; +#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK) + +int try_to_unmap(struct page *, enum ttu_flags flags); /* * Called from mm/filemap_xip.c to unmap empty zero page Index: linux/mm/rmap.c =================================================================== --- linux.orig/mm/rmap.c +++ linux/mm/rmap.c @@ -773,7 +773,7 @@ void page_remove_rmap(struct page *page) * repeatedly from either try_to_unmap_anon or try_to_unmap_file. */ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, - int migration) + enum ttu_flags flags) { struct mm_struct *mm = vma->vm_mm; unsigned long address; @@ -795,11 +795,13 @@ static int try_to_unmap_one(struct page * If it's recently referenced (perhaps page_referenced * skipped over this mm) then we should reactivate it. */ - if (!migration) { + if (!(flags & TTU_IGNORE_MLOCK)) { if (vma->vm_flags & VM_LOCKED) { ret = SWAP_MLOCK; goto out_unmap; } + } + if (!(flags & TTU_IGNORE_ACCESS)) { if (ptep_clear_flush_young_notify(vma, address, pte)) { ret = SWAP_FAIL; goto out_unmap; @@ -839,12 +841,12 @@ static int try_to_unmap_one(struct page * pte. do_swap_page() will wait until the migration * pte is removed and then restart fault handling. */ - BUG_ON(!migration); + BUG_ON(TTU_ACTION(flags) != TTU_MIGRATION); entry = make_migration_entry(page, pte_write(pteval)); } set_pte_at(mm, address, pte, swp_entry_to_pte(entry)); BUG_ON(pte_file(*pte)); - } else if (PAGE_MIGRATION && migration) { + } else if (PAGE_MIGRATION && (TTU_ACTION(flags) == TTU_MIGRATION)) { /* Establish migration entry for a file page */ swp_entry_t entry; entry = make_migration_entry(page, pte_write(pteval)); @@ -1013,12 +1015,13 @@ static int try_to_mlock_page(struct page * vm_flags for that VMA. That should be OK, because that vma shouldn't be * 'LOCKED. */ -static int try_to_unmap_anon(struct page *page, int unlock, int migration) +static int try_to_unmap_anon(struct page *page, enum ttu_flags flags) { struct anon_vma *anon_vma; struct vm_area_struct *vma; unsigned int mlocked = 0; int ret = SWAP_AGAIN; + int unlock = TTU_ACTION(flags) == TTU_MUNLOCK; if (MLOCK_PAGES && unlikely(unlock)) ret = SWAP_SUCCESS; /* default for try_to_munlock() */ @@ -1034,7 +1037,7 @@ static int try_to_unmap_anon(struct page continue; /* must visit all unlocked vmas */ ret = SWAP_MLOCK; /* saw at least one mlocked vma */ } else { - ret = try_to_unmap_one(page, vma, migration); + ret = try_to_unmap_one(page, vma, flags); if (ret == SWAP_FAIL || !page_mapped(page)) break; } @@ -1058,8 +1061,7 @@ static int try_to_unmap_anon(struct page /** * try_to_unmap_file - unmap/unlock file page using the object-based rmap method * @page: the page to unmap/unlock - * @unlock: request for unlock rather than unmap [unlikely] - * @migration: unmapping for migration - ignored if @unlock + * @flags: action and flags * * Find all the mappings of a page using the mapping pointer and the vma chains * contained in the address_space struct it points to. @@ -1071,7 +1073,7 @@ static int try_to_unmap_anon(struct page * vm_flags for that VMA. That should be OK, because that vma shouldn't be * 'LOCKED. */ -static int try_to_unmap_file(struct page *page, int unlock, int migration) +static int try_to_unmap_file(struct page *page, enum ttu_flags flags) { struct address_space *mapping = page->mapping; pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); @@ -1083,6 +1085,7 @@ static int try_to_unmap_file(struct page unsigned long max_nl_size = 0; unsigned int mapcount; unsigned int mlocked = 0; + int unlock = TTU_ACTION(flags) == TTU_MUNLOCK; if (MLOCK_PAGES && unlikely(unlock)) ret = SWAP_SUCCESS; /* default for try_to_munlock() */ @@ -1095,7 +1098,7 @@ static int try_to_unmap_file(struct page continue; /* must visit all vmas */ ret = SWAP_MLOCK; } else { - ret = try_to_unmap_one(page, vma, migration); + ret = try_to_unmap_one(page, vma, flags); if (ret == SWAP_FAIL || !page_mapped(page)) goto out; } @@ -1120,7 +1123,8 @@ static int try_to_unmap_file(struct page ret = SWAP_MLOCK; /* leave mlocked == 0 */ goto out; /* no need to look further */ } - if (!MLOCK_PAGES && !migration && (vma->vm_flags & VM_LOCKED)) + if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) && + (vma->vm_flags & VM_LOCKED)) continue; cursor = (unsigned long) vma->vm_private_data; if (cursor > max_nl_cursor) @@ -1154,7 +1158,7 @@ static int try_to_unmap_file(struct page do { list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) { - if (!MLOCK_PAGES && !migration && + if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) && (vma->vm_flags & VM_LOCKED)) continue; cursor = (unsigned long) vma->vm_private_data; @@ -1194,7 +1198,7 @@ out: /** * try_to_unmap - try to remove all page table mappings to a page * @page: the page to get unmapped - * @migration: migration flag + * @flags: action and flags * * Tries to remove all the page table entries which are mapping this * page, used in the pageout path. Caller must hold the page lock. @@ -1205,16 +1209,16 @@ out: * SWAP_FAIL - the page is unswappable * SWAP_MLOCK - page is mlocked. */ -int try_to_unmap(struct page *page, int migration) +int try_to_unmap(struct page *page, enum ttu_flags flags) { int ret; BUG_ON(!PageLocked(page)); if (PageAnon(page)) - ret = try_to_unmap_anon(page, 0, migration); + ret = try_to_unmap_anon(page, flags); else - ret = try_to_unmap_file(page, 0, migration); + ret = try_to_unmap_file(page, flags); if (ret != SWAP_MLOCK && !page_mapped(page)) ret = SWAP_SUCCESS; return ret; @@ -1239,8 +1243,8 @@ int try_to_munlock(struct page *page) VM_BUG_ON(!PageLocked(page) || PageLRU(page)); if (PageAnon(page)) - return try_to_unmap_anon(page, 1, 0); + return try_to_unmap_anon(page, TTU_MUNLOCK); else - return try_to_unmap_file(page, 1, 0); + return try_to_unmap_file(page, TTU_MUNLOCK); } Index: linux/mm/vmscan.c =================================================================== --- linux.orig/mm/vmscan.c +++ linux/mm/vmscan.c @@ -654,7 +654,7 @@ static unsigned long shrink_page_list(st * processes. Try to unmap it here. */ if (page_mapped(page) && mapping) { - switch (try_to_unmap(page, 0)) { + switch (try_to_unmap(page, TTU_UNMAP)) { case SWAP_FAIL: goto activate_locked; case SWAP_AGAIN: Index: linux/mm/migrate.c =================================================================== --- linux.orig/mm/migrate.c +++ linux/mm/migrate.c @@ -669,7 +669,7 @@ static int unmap_and_move(new_page_t get } /* Establish migration ptes or remove ptes */ - try_to_unmap(page, 1); + try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); if (!page_mapped(page)) rc = move_to_new_page(newpage, page); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id B5CF36B0087 for ; Wed, 5 Aug 2009 05:36:38 -0400 (EDT) From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [10/19] HWPOISON: check and isolate corrupted free pages v2 Message-Id: <20090805093637.D0A8AB15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:37 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: fengguang.wu@intel.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.orgfengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: From: Wu Fengguang If memory corruption hits the free buddy pages, we can safely ignore them. No one will access them until page allocation time, then prep_new_page() will automatically check and isolate PG_hwpoison page for us (for 0-order allocation). This patch expands prep_new_page() to check every component page in a high order page allocation, in order to completely stop PG_hwpoison pages from being recirculated. Note that the common case -- only allocating a single page, doesn't do any more work than before. Allocating > order 0 does a bit more work, but that's relatively uncommon. This simple implementation may drop some innocent neighbor pages, hopefully it is not a big problem because the event should be rare enough. This patch adds some runtime costs to high order page users. [AK: Improved description] v2: Andi Kleen: Port to -mm code Move check into separate function. Don't dump stack in bad_pages for hwpoisoned pages. Signed-off-by: Wu Fengguang Signed-off-by: Andi Kleen --- mm/page_alloc.c | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) Index: linux/mm/page_alloc.c =================================================================== --- linux.orig/mm/page_alloc.c +++ linux/mm/page_alloc.c @@ -234,6 +234,12 @@ static void bad_page(struct page *page) static unsigned long nr_shown; static unsigned long nr_unshown; + /* Don't complain about poisoned pages */ + if (PageHWPoison(page)) { + __ClearPageBuddy(page); + return; + } + /* * Allow a burst of 60 reports, then keep quiet for that minute; * or allow a steady drip of one report per second. @@ -646,7 +652,7 @@ static inline void expand(struct zone *z /* * This page is about to be returned from the page allocator */ -static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) +static inline int check_new_page(struct page *page) { if (unlikely(page_mapcount(page) | (page->mapping != NULL) | @@ -655,6 +661,18 @@ static int prep_new_page(struct page *pa bad_page(page); return 1; } + return 0; +} + +static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) +{ + int i; + + for (i = 0; i < (1 << order); i++) { + struct page *p = page + i; + if (unlikely(check_new_page(p))) + return 1; + } set_page_private(page, 0); set_page_refcounted(page); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 3DE636B0098 for ; Wed, 5 Aug 2009 05:36:40 -0400 (EDT) From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-Id: <20090805093638.D3754B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:38 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: npiggin@suse.de, akpm@linux-foundation.orgnpiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: From: Nick Piggin Extract out truncate_inode_page() out of the truncate path so that it can be used by memory-failure.c [AK: description, headers, fix typos] v2: Some white space changes from Fengguang Wu Signed-off-by: Andi Kleen --- include/linux/mm.h | 2 ++ mm/truncate.c | 29 +++++++++++++++-------------- 2 files changed, 17 insertions(+), 14 deletions(-) Index: linux/mm/truncate.c =================================================================== --- linux.orig/mm/truncate.c +++ linux/mm/truncate.c @@ -93,11 +93,11 @@ EXPORT_SYMBOL(cancel_dirty_page); * its lock, b) when a concurrent invalidate_mapping_pages got there first and * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space. */ -static void +static int truncate_complete_page(struct address_space *mapping, struct page *page) { if (page->mapping != mapping) - return; + return -EIO; if (page_has_private(page)) do_invalidatepage(page, 0); @@ -108,6 +108,7 @@ truncate_complete_page(struct address_sp remove_from_page_cache(page); ClearPageMappedToDisk(page); page_cache_release(page); /* pagecache ref */ + return 0; } /* @@ -135,6 +136,16 @@ invalidate_complete_page(struct address_ return ret; } +int truncate_inode_page(struct address_space *mapping, struct page *page) +{ + if (page_mapped(page)) { + unmap_mapping_range(mapping, + (loff_t)page->index << PAGE_CACHE_SHIFT, + PAGE_CACHE_SIZE, 0); + } + return truncate_complete_page(mapping, page); +} + /** * truncate_inode_pages - truncate range of pages specified by start & end byte offsets * @mapping: mapping to truncate @@ -196,12 +207,7 @@ void truncate_inode_pages_range(struct a unlock_page(page); continue; } - if (page_mapped(page)) { - unmap_mapping_range(mapping, - (loff_t)page_index<index<index > next) next = page->index; next++; - truncate_complete_page(mapping, page); unlock_page(page); } pagevec_release(&pvec); Index: linux/include/linux/mm.h =================================================================== --- linux.orig/include/linux/mm.h +++ linux/include/linux/mm.h @@ -809,6 +809,8 @@ static inline void unmap_shared_mapping_ extern int vmtruncate(struct inode * inode, loff_t offset); extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end); +int truncate_inode_page(struct address_space *mapping, struct page *page); + #ifdef CONFIG_MMU extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, unsigned int flags); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id ECFEF6B008C for ; Wed, 5 Aug 2009 05:36:41 -0400 (EDT) From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [12/19] HWPOISON: Add invalidate_inode_page Message-Id: <20090805093639.D5FBEB15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:39 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: fengguang.wu@intel.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.orgfengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: From: Wu Fengguang Add a simple way to invalidate a single page This is just a refactoring of the truncate.c code. Originally from Fengguang, modified by Andi Kleen. Signed-off-by: Andi Kleen --- include/linux/mm.h | 2 ++ mm/truncate.c | 26 ++++++++++++++++++++------ 2 files changed, 22 insertions(+), 6 deletions(-) Index: linux/include/linux/mm.h =================================================================== --- linux.orig/include/linux/mm.h +++ linux/include/linux/mm.h @@ -811,6 +811,8 @@ extern int vmtruncate_range(struct inode int truncate_inode_page(struct address_space *mapping, struct page *page); +int invalidate_inode_page(struct page *page); + #ifdef CONFIG_MMU extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, unsigned int flags); Index: linux/mm/truncate.c =================================================================== --- linux.orig/mm/truncate.c +++ linux/mm/truncate.c @@ -146,6 +146,24 @@ int truncate_inode_page(struct address_s return truncate_complete_page(mapping, page); } +/* + * Safely invalidate one page from its pagecache mapping. + * It only drops clean, unused pages. The page must be locked. + * + * Returns 1 if the page is successfully invalidated, otherwise 0. + */ +int invalidate_inode_page(struct page *page) +{ + struct address_space *mapping = page_mapping(page); + if (!mapping) + return 0; + if (PageDirty(page) || PageWriteback(page)) + return 0; + if (page_mapped(page)) + return 0; + return invalidate_complete_page(mapping, page); +} + /** * truncate_inode_pages - truncate range of pages specified by start & end byte offsets * @mapping: mapping to truncate @@ -312,12 +330,8 @@ unsigned long invalidate_mapping_pages(s if (lock_failed) continue; - if (PageDirty(page) || PageWriteback(page)) - goto unlock; - if (page_mapped(page)) - goto unlock; - ret += invalidate_complete_page(mapping, page); -unlock: + ret += invalidate_inode_page(page); + unlock_page(page); if (next > end) break; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id F36926B0095 for ; Wed, 5 Aug 2009 05:36:41 -0400 (EDT) From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [13/19] HWPOISON: Define a new error_remove_page address space op for async truncation Message-Id: <20090805093640.D8856B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:40 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: Truncating metadata pages is not safe right now before we haven't audited all file systems. To enable truncation only for data address space define a new address_space callback error_remove_page. This is used for memory_failure.c memory error handling. This can be then set to truncate_inode_page() This patch just defines the new operation and adds documentation. Callers and users come in followon patches. Signed-off-by: Andi Kleen --- Documentation/filesystems/vfs.txt | 7 +++++++ include/linux/fs.h | 1 + include/linux/mm.h | 1 + mm/truncate.c | 17 +++++++++++++++++ 4 files changed, 26 insertions(+) Index: linux/include/linux/fs.h =================================================================== --- linux.orig/include/linux/fs.h +++ linux/include/linux/fs.h @@ -595,6 +595,7 @@ struct address_space_operations { int (*launder_page) (struct page *); int (*is_partially_uptodate) (struct page *, read_descriptor_t *, unsigned long); + int (*error_remove_page)(struct address_space *, struct page *); }; /* Index: linux/Documentation/filesystems/vfs.txt =================================================================== --- linux.orig/Documentation/filesystems/vfs.txt +++ linux/Documentation/filesystems/vfs.txt @@ -536,6 +536,7 @@ struct address_space_operations { /* migrate the contents of a page to the specified target */ int (*migratepage) (struct page *, struct page *); int (*launder_page) (struct page *); + int (*error_remove_page) (struct mapping *mapping, struct page *page); }; writepage: called by the VM to write a dirty page to backing store. @@ -694,6 +695,12 @@ struct address_space_operations { prevent redirtying the page, it is kept locked during the whole operation. + error_remove_page: normally set to generic_error_remove_page if truncation + is ok for this address space. Used for memory failure handling. + Setting this implies you deal with pages going away under you, + unless you have them locked or reference counts increased. + + The File Object =============== Index: linux/mm/truncate.c =================================================================== --- linux.orig/mm/truncate.c +++ linux/mm/truncate.c @@ -147,6 +147,23 @@ int truncate_inode_page(struct address_s } /* + * Used to get rid of pages on hardware memory corruption. + */ +int generic_error_remove_page(struct address_space *mapping, struct page *page) +{ + if (!mapping) + return -EINVAL; + /* + * Only punch for normal data pages for now. + * Handling other types like directories would need more auditing. + */ + if (!S_ISREG(mapping->host->i_mode)) + return -EIO; + return truncate_inode_page(mapping, page); +} +EXPORT_SYMBOL(generic_error_remove_page); + +/* * Safely invalidate one page from its pagecache mapping. * It only drops clean, unused pages. The page must be locked. * Index: linux/include/linux/mm.h =================================================================== --- linux.orig/include/linux/mm.h +++ linux/include/linux/mm.h @@ -810,6 +810,7 @@ extern int vmtruncate(struct inode * ino extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end); int truncate_inode_page(struct address_space *mapping, struct page *page); +int generic_error_remove_page(struct address_space *mapping, struct page *page); int invalidate_inode_page(struct page *page); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 24AF76B0098 for ; Wed, 5 Aug 2009 05:36:43 -0400 (EDT) From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [14/19] HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process Message-Id: <20090805093641.DB176B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:41 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: This allows processes to override their early/late kill behaviour on hardware memory errors. Typically applications which are memory error aware is better of with early kill (see the error as soon as possible), all others with late kill (only see the error when the error is really impacting execution) There's a global sysctl, but this way an application can set its specific policy. We're using two bits, one to signify that the process stated its intention and that I also made the prctl future proof by enforcing the unused arguments are 0. The state is inherited to children for now. I've been considering to reset it on exec, but not done for now (TBD). Note this makes us officially run out of process flags on 32bit, but the next patch can easily add another field. Manpage patch will be supplied separately. Signed-off-by: Andi Kleen --- include/linux/prctl.h | 2 ++ include/linux/sched.h | 2 ++ kernel/sys.c | 22 ++++++++++++++++++++++ 3 files changed, 26 insertions(+) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1674,6 +1674,7 @@ extern cputime_t task_gtime(struct task_ #define PF_EXITPIDONE 0x00000008 /* pi exit done on shut down */ #define PF_VCPU 0x00000010 /* I'm a virtual CPU */ #define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */ +#define PF_MCE_PROCESS 0x00000080 /* process policy on mce errors */ #define PF_SUPERPRIV 0x00000100 /* used super-user privileges */ #define PF_DUMPCORE 0x00000200 /* dumped core */ #define PF_SIGNALED 0x00000400 /* killed by a signal */ @@ -1693,6 +1694,7 @@ extern cputime_t task_gtime(struct task_ #define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */ #define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */ #define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */ +#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ #define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */ #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */ #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezeable */ Index: linux/kernel/sys.c =================================================================== --- linux.orig/kernel/sys.c +++ linux/kernel/sys.c @@ -1528,6 +1528,28 @@ SYSCALL_DEFINE5(prctl, int, option, unsi current->timer_slack_ns = arg2; error = 0; break; + case PR_MCE_KILL: + if (arg4 | arg5) + return -EINVAL; + switch (arg2) { + case 0: + if (arg3 != 0) + return -EINVAL; + current->flags &= ~PF_MCE_PROCESS; + break; + case 1: + current->flags |= PF_MCE_PROCESS; + if (arg3 != 0) + current->flags |= PF_MCE_EARLY; + else + current->flags &= ~PF_MCE_EARLY; + break; + default: + return -EINVAL; + } + error = 0; + break; + default: error = -EINVAL; break; Index: linux/include/linux/prctl.h =================================================================== --- linux.orig/include/linux/prctl.h +++ linux/include/linux/prctl.h @@ -88,4 +88,6 @@ #define PR_TASK_PERF_COUNTERS_DISABLE 31 #define PR_TASK_PERF_COUNTERS_ENABLE 32 +#define PR_MCE_KILL 33 + #endif /* _LINUX_PRCTL_H */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id 574F26B0096 for ; Wed, 5 Aug 2009 05:36:45 -0400 (EDT) From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-Id: <20090805093643.E0C00B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:43 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: Enable removing of corrupted pages through truncation for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs These should cover most server needs. I chose the set of migration aware file systems for this for now, assuming they have been especially audited. But in general it should be safe for all file systems on the data area that support read/write and truncate. Caveat: the hardware error handler does not take i_mutex for now before calling the truncate function. Is that ok? Cc: tytso@mit.edu Cc: hch@infradead.org Cc: mfasheh@suse.com Cc: aia21@cantab.net Cc: hugh.dickins@tiscali.co.uk Cc: swhiteho@redhat.com Signed-off-by: Andi Kleen --- fs/ext2/inode.c | 2 ++ fs/ext3/inode.c | 3 +++ fs/ext4/inode.c | 4 ++++ fs/gfs2/aops.c | 3 +++ fs/ntfs/aops.c | 2 ++ fs/ocfs2/aops.c | 1 + fs/xfs/linux-2.6/xfs_aops.c | 1 + mm/shmem.c | 1 + 8 files changed, 17 insertions(+) Index: linux/fs/gfs2/aops.c =================================================================== --- linux.orig/fs/gfs2/aops.c +++ linux/fs/gfs2/aops.c @@ -1135,6 +1135,7 @@ static const struct address_space_operat .direct_IO = gfs2_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; static const struct address_space_operations gfs2_ordered_aops = { @@ -1151,6 +1152,7 @@ static const struct address_space_operat .direct_IO = gfs2_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; static const struct address_space_operations gfs2_jdata_aops = { @@ -1166,6 +1168,7 @@ static const struct address_space_operat .invalidatepage = gfs2_invalidatepage, .releasepage = gfs2_releasepage, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; void gfs2_set_aops(struct inode *inode) Index: linux/fs/ntfs/aops.c =================================================================== --- linux.orig/fs/ntfs/aops.c +++ linux/fs/ntfs/aops.c @@ -1550,6 +1550,7 @@ const struct address_space_operations nt .migratepage = buffer_migrate_page, /* Move a page cache page from one physical page to an other. */ + .error_remove_page = generic_error_remove_page, }; /** @@ -1569,6 +1570,7 @@ const struct address_space_operations nt .migratepage = buffer_migrate_page, /* Move a page cache page from one physical page to an other. */ + .error_remove_page = generic_error_remove_page, }; #ifdef NTFS_RW Index: linux/fs/ocfs2/aops.c =================================================================== --- linux.orig/fs/ocfs2/aops.c +++ linux/fs/ocfs2/aops.c @@ -1968,4 +1968,5 @@ const struct address_space_operations oc .releasepage = ocfs2_releasepage, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; Index: linux/fs/xfs/linux-2.6/xfs_aops.c =================================================================== --- linux.orig/fs/xfs/linux-2.6/xfs_aops.c +++ linux/fs/xfs/linux-2.6/xfs_aops.c @@ -1636,4 +1636,5 @@ const struct address_space_operations xf .direct_IO = xfs_vm_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; Index: linux/mm/shmem.c =================================================================== --- linux.orig/mm/shmem.c +++ linux/mm/shmem.c @@ -2421,6 +2421,7 @@ static const struct address_space_operat .write_end = shmem_write_end, #endif .migratepage = migrate_page, + .error_remove_page = generic_error_remove_page, }; static const struct file_operations shmem_file_operations = { Index: linux/fs/ext2/inode.c =================================================================== --- linux.orig/fs/ext2/inode.c +++ linux/fs/ext2/inode.c @@ -819,6 +819,7 @@ const struct address_space_operations ex .writepages = ext2_writepages, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; const struct address_space_operations ext2_aops_xip = { @@ -837,6 +838,7 @@ const struct address_space_operations ex .direct_IO = ext2_direct_IO, .writepages = ext2_writepages, .migratepage = buffer_migrate_page, + .error_remove_page = generic_error_remove_page, }; /* Index: linux/fs/ext3/inode.c =================================================================== --- linux.orig/fs/ext3/inode.c +++ linux/fs/ext3/inode.c @@ -1819,6 +1819,7 @@ static const struct address_space_operat .direct_IO = ext3_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; static const struct address_space_operations ext3_writeback_aops = { @@ -1834,6 +1835,7 @@ static const struct address_space_operat .direct_IO = ext3_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; static const struct address_space_operations ext3_journalled_aops = { @@ -1848,6 +1850,7 @@ static const struct address_space_operat .invalidatepage = ext3_invalidatepage, .releasepage = ext3_releasepage, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; void ext3_set_aops(struct inode *inode) Index: linux/fs/ext4/inode.c =================================================================== --- linux.orig/fs/ext4/inode.c +++ linux/fs/ext4/inode.c @@ -3373,6 +3373,7 @@ static const struct address_space_operat .direct_IO = ext4_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; static const struct address_space_operations ext4_writeback_aops = { @@ -3388,6 +3389,7 @@ static const struct address_space_operat .direct_IO = ext4_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; static const struct address_space_operations ext4_journalled_aops = { @@ -3402,6 +3404,7 @@ static const struct address_space_operat .invalidatepage = ext4_invalidatepage, .releasepage = ext4_releasepage, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; static const struct address_space_operations ext4_da_aops = { @@ -3418,6 +3421,7 @@ static const struct address_space_operat .direct_IO = ext4_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .error_remove_page = generic_error_remove_page, }; void ext4_set_aops(struct inode *inode) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 876226B0098 for ; Wed, 5 Aug 2009 05:36:45 -0400 (EDT) From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [17/19] HWPOISON: Enable error_remove_page for NFS Message-Id: <20090805093644.E3607B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:44 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: Trond.Myklebust@netapp.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: Enable hardware memory error handling for NFS Truncation of data pages at runtime should be safe in NFS, even when it doesn't support migration so far. Cc: Trond.Myklebust@netapp.com Signed-off-by: Andi Kleen --- fs/nfs/file.c | 1 + 1 file changed, 1 insertion(+) Index: linux/fs/nfs/file.c =================================================================== --- linux.orig/fs/nfs/file.c +++ linux/fs/nfs/file.c @@ -480,6 +480,7 @@ const struct address_space_operations nf .releasepage = nfs_release_page, .direct_IO = nfs_direct_IO, .launder_page = nfs_launder_page, + .error_remove_page = generic_error_remove_page, }; /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id D70996B0098 for ; Wed, 5 Aug 2009 05:36:46 -0400 (EDT) From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [18/19] HWPOISON: Add madvise() based injector for hardware poisoned pages v3 Message-Id: <20090805093645.E5F95B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:45 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: Impact: optional, useful for debugging Add a new madvice sub command to inject poison for some pages in a process' address space. This is useful for testing the poison page handling. Open issues: - This patch allows root to tie up arbitary amounts of memory. Should this be disabled inside containers? - There's a small race window between getting the page and injecting. The patch drops the ref count because otherwise memory_failure complains about dangling references. In theory with a multi threaded injector one could inject poison for a process foreign page this way. Not a serious issue right now. v2: Use write flag for get_user_pages to make sure to always get a fresh page v3: Don't request write mapping (Fengguang Wu) Signed-off-by: Andi Kleen --- include/asm-generic/mman-common.h | 1 + mm/madvise.c | 34 ++++++++++++++++++++++++++++++++++ 2 files changed, 35 insertions(+) Index: linux/mm/madvise.c =================================================================== --- linux.orig/mm/madvise.c +++ linux/mm/madvise.c @@ -207,6 +207,36 @@ static long madvise_remove(struct vm_are return error; } +#ifdef CONFIG_MEMORY_FAILURE +/* + * Error injection support for memory error handling. + */ +static int madvise_hwpoison(unsigned long start, unsigned long end) +{ + int ret = -EIO; + /* + * RED-PEN + * This allows to tie up arbitary amounts of memory. + * Might be a good idea to disable it inside containers even for root. + */ + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + for (; start < end; start += PAGE_SIZE) { + struct page *p; + int ret = get_user_pages(current, current->mm, start, 1, + 0, 0, &p, NULL); + if (ret != 1) + return ret; + printk(KERN_INFO "Injecting memory failure for page %lx at %lx\n", + page_to_pfn(p), start); + /* Ignore return value for now */ + __memory_failure(page_to_pfn(p), 0, 1); + put_page(p); + } + return ret; +} +#endif + static long madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, int behavior) @@ -307,6 +337,10 @@ SYSCALL_DEFINE3(madvise, unsigned long, int write; size_t len; +#ifdef CONFIG_MEMORY_FAILURE + if (behavior == MADV_HWPOISON) + return madvise_hwpoison(start, start+len_in); +#endif if (!madvise_behavior_valid(behavior)) return error; Index: linux/include/asm-generic/mman-common.h =================================================================== --- linux.orig/include/asm-generic/mman-common.h +++ linux/include/asm-generic/mman-common.h @@ -34,6 +34,7 @@ #define MADV_REMOVE 9 /* remove these pages & resources */ #define MADV_DONTFORK 10 /* don't inherit across fork */ #define MADV_DOFORK 11 /* do inherit across fork */ +#define MADV_HWPOISON 12 /* poison a page for testing */ /* compatibility flags */ #define MAP_FILE 0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 3E9BF6B0096 for ; Wed, 5 Aug 2009 05:36:47 -0400 (EDT) From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [15/19] HWPOISON: The high level memory error handler in the VM v7 Message-Id: <20090805093642.DE156B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:42 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: npiggin@suse.de, riel@redhat.com, akpm@linux-foundation.orgnpiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: Add the high level memory handler that poisons pages that got corrupted by hardware (typically by a two bit flip in a DIMM or a cache) on the Linux level. The goal is to prevent everyone from accessing these pages in the future. This done at the VM level by marking a page hwpoisoned and doing the appropriate action based on the type of page it is. The code that does this is portable and lives in mm/memory-failure.c To quote the overview comment: * High level machine check handler. Handles pages reported by the * hardware as being corrupted usually due to a 2bit ECC memory or cache * failure. * * This focuses on pages detected as corrupted in the background. * When the current CPU tries to consume corruption the currently * running process can just be killed directly instead. This implies * that if the error cannot be handled for some reason it's safe to * just ignore it because no corruption has been consumed yet. Instead * when that happens another machine check will happen. * * Handles page cache pages in various states. The tricky part * here is that we can access any page asynchronous to other VM * users, because memory failures could happen anytime and anywhere, * possibly violating some of their assumptions. This is why this code * has to be extremely careful. Generally it tries to use normal locking * rules, as in get the standard locks, even if that means the * error handling takes potentially a long time. * * Some of the operations here are somewhat inefficient and have non * linear algorithmic complexity, because the data structures have not * been optimized for this case. This is in particular the case * for the mapping from a vma to a process. Since this case is expected * to be rare we hope we can get away with this. There are in principle two strategies to kill processes on poison: - just unmap the data and wait for an actual reference before killing - kill as soon as corruption is detected. Both have advantages and disadvantages and should be used in different situations. Right now both are implemented and can be switched with a new sysctl vm.memory_failure_early_kill The default is early kill. The patch does some rmap data structure walking on its own to collect processes to kill. This is unusual because normally all rmap data structure knowledge is in rmap.c only. I put it here for now to keep everything together and rmap knowledge has been seeping out anyways v2: Fix anon vma unlock crash (noticed by Johannes Weiner ) Handle pages on free list correctly (also noticed by Johannes) Fix inverted try_to_release_page check (found by Chris Mason) Add documentation for the new sysctl. Various other cleanups/comment fixes. v3: Use blockable signal for AO SIGBUS for better qemu handling. Numerous fixes from Fengguang Wu: New code layout for the table (redone by AK) Move the hwpoison bit setting before the lock (Fengguang Wu) Some code cleanups (Fengguang Wu, AK) Add missing lru_drain (Fengguang Wu) Do more checks for valid mappings (inspired by patch from Fengguang) Handle free pages and fixes for clean pages (Fengguang) Removed swap cache handling for now, needs more work Better mapping checks to avoid races (Fengguang) Fix swapcache (Fengguang) Handle private2 pages too (Fengguang) v4: Various fixes based on review comments from Nick Piggin Document locking order. Improved comments. Slightly improved description Remove bogus hunk. Wait properly for writeback pages (Nick Piggin) v5: Improve various comments Handle page_address_in_vma() failure better by SIGKILL and also make message debugging only Clean up printks Remove redundant PageWriteback check (Nick Piggin) Add missing clear_page_mlock Reformat state table to be <80 columns again Use truncate helper instead of manual truncate in me_pagecache_* Check for metadata buffer pages and reject them. A few cleanups. v6: Fix a printk broken in the last round of cleanups. More minor cleanups and fixes based on comments from Fengguang Wu. Rename /proc/meminfo Header to "HardwareCorrupted" Add a printk for the failed mapping case (Fengguang Wu) Better clean page check (Fengguang Wu) v7: Use lru_add_drain_all Use isolate_lru result in final status [Merge with Fengguang's version. Only cherry picked changes.] Fix bad pfn handling in action_result (Fengguang) Reverse tasklist lock locking order (Nick Piggin) Check if VMA has really page mapped (Fengguang) Add a missing page->mapping test (Fengguang) Account all bad pages (Fengguang) Misc comment fixes (Fengguang, Andi) Add sysctl to force panic on memory failure Only truncate if address space has it enabled, otherwise invalidate Now selectable in Kconfig by the user. Allow elevated reference count and return status Misc changes Cc: npiggin@suse.de Cc: riel@redhat.com Signed-off-by: Andi Kleen Acked-by: Rik van Riel Reviewed-by: Hidehiro Kawai --- Documentation/sysctl/vm.txt | 41 ++ fs/proc/meminfo.c | 9 include/linux/mm.h | 7 include/linux/rmap.h | 1 kernel/sysctl.c | 25 + mm/Kconfig | 9 mm/Makefile | 1 mm/filemap.c | 4 mm/memory-failure.c | 828 ++++++++++++++++++++++++++++++++++++++++++++ mm/rmap.c | 7 10 files changed, 929 insertions(+), 3 deletions(-) Index: linux/mm/Makefile =================================================================== --- linux.orig/mm/Makefile +++ linux/mm/Makefile @@ -40,5 +40,6 @@ obj-$(CONFIG_SMP) += allocpercpu.o endif obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o +obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o Index: linux/mm/memory-failure.c =================================================================== --- /dev/null +++ linux/mm/memory-failure.c @@ -0,0 +1,828 @@ +/* + * Copyright (C) 2008, 2009 Intel Corporation + * Authors: Andi Kleen, Fengguang Wu + * + * This software may be redistributed and/or modified under the terms of + * the GNU General Public License ("GPL") version 2 only as published by the + * Free Software Foundation. + * + * High level machine check handler. Handles pages reported by the + * hardware as being corrupted usually due to a 2bit ECC memory or cache + * failure. + * + * Handles page cache pages in various states. The tricky part + * here is that we can access any page asynchronous to other VM + * users, because memory failures could happen anytime and anywhere, + * possibly violating some of their assumptions. This is why this code + * has to be extremely careful. Generally it tries to use normal locking + * rules, as in get the standard locks, even if that means the + * error handling takes potentially a long time. + * + * The operation to map back from RMAP chains to processes has to walk + * the complete process list and has non linear complexity with the number + * mappings. In short it can be quite slow. But since memory corruptions + * are rare we hope to get away with this. + */ + +/* + * Notebook: + * - hugetlb needs more code + * - kcore/oldmem/vmcore/mem/kmem check for hwpoison pages + * - pass bad pages to kdump next kernel + */ +#define DEBUG 1 +#include +#include +#include +#include +#include +#include +#include +#include +#include "internal.h" + +/* For better testing 1; Change default later to 0 */ +int sysctl_memory_failure_early_kill __read_mostly = 1; + +int sysctl_memory_failure_recovery __read_mostly = 1; + +atomic_long_t mce_bad_pages __read_mostly = ATOMIC_LONG_INIT(0); + +/* + * Send all the processes who have the page mapped an ``action optional'' + * signal. + */ +static int kill_proc_ao(struct task_struct *t, unsigned long addr, int trapno, + unsigned long pfn) +{ + struct siginfo si; + int ret; + + printk(KERN_ERR + "MCE %#lx: Killing %s:%d early due to hardware memory corruption\n", + pfn, t->comm, t->pid); + si.si_signo = SIGBUS; + si.si_errno = 0; + si.si_code = BUS_MCEERR_AO; + si.si_addr = (void *)addr; +#ifdef __ARCH_SI_TRAPNO + si.si_trapno = trapno; +#endif + si.si_addr_lsb = PAGE_SHIFT; + /* + * Don't use force here, it's convenient if the signal + * can be temporarily blocked. + * This could cause a loop when the user sets SIGBUS + * to SIG_IGN, but hopefully noone will do that? + */ + ret = send_sig_info(SIGBUS, &si, t); /* synchronous? */ + if (ret < 0) + printk(KERN_INFO "MCE: Error sending signal to %s:%d: %d\n", + t->comm, t->pid, ret); + return ret; +} + +/* + * Kill all processes that have a poisoned page mapped and then isolate + * the page. + * + * General strategy: + * Find all processes having the page mapped and kill them. + * But we keep a page reference around so that the page is not + * actually freed yet. + * Then stash the page away + * + * There's no convenient way to get back to mapped processes + * from the VMAs. So do a brute-force search over all + * running processes. + * + * Remember that machine checks are not common (or rather + * if they are common you have other problems), so this shouldn't + * be a performance issue. + * + * Also there are some races possible while we get from the + * error detection to actually handle it. + */ + +struct to_kill { + struct list_head nd; + struct task_struct *tsk; + unsigned long addr; + unsigned addr_valid:1; +}; + +/* + * Failure handling: if we can't find or can't kill a process there's + * not much we can do. We just print a message and ignore otherwise. + */ + +/* + * Schedule a process for later kill. + * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM. + * TBD would GFP_NOIO be enough? + */ +static void add_to_kill(struct task_struct *tsk, struct page *p, + struct vm_area_struct *vma, + struct list_head *to_kill, + struct to_kill **tkc) +{ + struct to_kill *tk; + + if (*tkc) { + tk = *tkc; + *tkc = NULL; + } else { + tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC); + if (!tk) { + printk(KERN_ERR + "MCE: Out of memory while machine check handling\n"); + return; + } + } + tk->addr = page_address_in_vma(p, vma); + tk->addr_valid = 1; + + /* + * In theory we don't have to kill when the page was + * munmaped. But it could be also a mremap. Since that's + * likely very rare kill anyways just out of paranoia, but use + * a SIGKILL because the error is not contained anymore. + */ + if (tk->addr == -EFAULT) { + pr_debug("MCE: Unable to find user space address %lx in %s\n", + page_to_pfn(p), tsk->comm); + tk->addr_valid = 0; + } + get_task_struct(tsk); + tk->tsk = tsk; + list_add_tail(&tk->nd, to_kill); +} + +/* + * Kill the processes that have been collected earlier. + * + * Only do anything when DOIT is set, otherwise just free the list + * (this is used for clean pages which do not need killing) + * Also when FAIL is set do a force kill because something went + * wrong earlier. + */ +static void kill_procs_ao(struct list_head *to_kill, int doit, int trapno, + int fail, unsigned long pfn) +{ + struct to_kill *tk, *next; + + list_for_each_entry_safe (tk, next, to_kill, nd) { + if (doit) { + /* + * In case something went wrong with munmaping + * make sure the process doesn't catch the + * signal and then access the memory. Just kill it. + * the signal handlers + */ + if (fail || tk->addr_valid == 0) { + printk(KERN_ERR + "MCE %#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n", + pfn, tk->tsk->comm, tk->tsk->pid); + force_sig(SIGKILL, tk->tsk); + } + + /* + * In theory the process could have mapped + * something else on the address in-between. We could + * check for that, but we need to tell the + * process anyways. + */ + else if (kill_proc_ao(tk->tsk, tk->addr, trapno, + pfn) < 0) + printk(KERN_ERR + "MCE %#lx: Cannot send advisory machine check signal to %s:%d\n", + pfn, tk->tsk->comm, tk->tsk->pid); + } + put_task_struct(tk->tsk); + kfree(tk); + } +} + +static int task_early_kill(struct task_struct *tsk) +{ + if (!tsk->mm) + return 0; + if (tsk->flags & PF_MCE_PROCESS) + return !!(tsk->flags & PF_MCE_EARLY); + return sysctl_memory_failure_early_kill; +} + +/* + * Collect processes when the error hit an anonymous page. + */ +static void collect_procs_anon(struct page *page, struct list_head *to_kill, + struct to_kill **tkc) +{ + struct vm_area_struct *vma; + struct task_struct *tsk; + struct anon_vma *av; + + read_lock(&tasklist_lock); + av = page_lock_anon_vma(page); + if (av == NULL) /* Not actually mapped anymore */ + goto out; + for_each_process (tsk) { + if (!task_early_kill(tsk)) + continue; + list_for_each_entry (vma, &av->head, anon_vma_node) { + if (!page_mapped_in_vma(page, vma)) + continue; + if (vma->vm_mm == tsk->mm) + add_to_kill(tsk, page, vma, to_kill, tkc); + } + } + page_unlock_anon_vma(av); +out: + read_unlock(&tasklist_lock); +} + +/* + * Collect processes when the error hit a file mapped page. + */ +static void collect_procs_file(struct page *page, struct list_head *to_kill, + struct to_kill **tkc) +{ + struct vm_area_struct *vma; + struct task_struct *tsk; + struct prio_tree_iter iter; + struct address_space *mapping = page->mapping; + + /* + * A note on the locking order between the two locks. + * We don't rely on this particular order. + * If you have some other code that needs a different order + * feel free to switch them around. Or add a reverse link + * from mm_struct to task_struct, then this could be all + * done without taking tasklist_lock and looping over all tasks. + */ + + read_lock(&tasklist_lock); + spin_lock(&mapping->i_mmap_lock); + for_each_process(tsk) { + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + + if (!task_early_kill(tsk)) + continue; + + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, + pgoff) { + /* + * Send early kill signal to tasks where a vma covers + * the page but the corrupted page is not necessarily + * mapped it in its pte. + * Assume applications who requested early kill want + * to be informed of all such data corruptions. + */ + if (vma->vm_mm == tsk->mm) + add_to_kill(tsk, page, vma, to_kill, tkc); + } + } + spin_unlock(&mapping->i_mmap_lock); + read_unlock(&tasklist_lock); +} + +/* + * Collect the processes who have the corrupted page mapped to kill. + * This is done in two steps for locking reasons. + * First preallocate one tokill structure outside the spin locks, + * so that we can kill at least one process reasonably reliable. + */ +static void collect_procs(struct page *page, struct list_head *tokill) +{ + struct to_kill *tk; + + if (!page->mapping) + return; + + tk = kmalloc(sizeof(struct to_kill), GFP_NOIO); + if (!tk) + return; + if (PageAnon(page)) + collect_procs_anon(page, tokill, &tk); + else + collect_procs_file(page, tokill, &tk); + kfree(tk); +} + +/* + * Error handlers for various types of pages. + */ + +enum outcome { + FAILED, /* Error handling failed */ + DELAYED, /* Will be handled later */ + IGNORED, /* Error safely ignored */ + RECOVERED, /* Successfully recovered */ +}; + +static const char *action_name[] = { + [FAILED] = "Failed", + [DELAYED] = "Delayed", + [IGNORED] = "Ignored", + [RECOVERED] = "Recovered", +}; + +/* + * Error hit kernel page. + * Do nothing, try to be lucky and not touch this instead. For a few cases we + * could be more sophisticated. + */ +static int me_kernel(struct page *p, unsigned long pfn) +{ + return DELAYED; +} + +/* + * Already poisoned page. + */ +static int me_ignore(struct page *p, unsigned long pfn) +{ + return IGNORED; +} + +/* + * Page in unknown state. Do nothing. + */ +static int me_unknown(struct page *p, unsigned long pfn) +{ + printk(KERN_ERR "MCE %#lx: Unknown page state\n", pfn); + return FAILED; +} + +/* + * Free memory + */ +static int me_free(struct page *p, unsigned long pfn) +{ + return DELAYED; +} + +/* + * Clean (or cleaned) page cache page. + */ +static int me_pagecache_clean(struct page *p, unsigned long pfn) +{ + int err; + int ret = FAILED; + struct address_space *mapping; + + if (!isolate_lru_page(p)) + page_cache_release(p); + + /* + * Now truncate the page in the page cache. This is really + * more like a "temporary hole punch" + * Don't do this for block devices when someone else + * has a reference, because it could be file system metadata + * and that's not safe to truncate. + */ + mapping = page_mapping(p); + if (!mapping) { + /* + * Page has been teared down in the meanwhile + */ + return FAILED; + } + + /* + * Truncation is a bit tricky. Enable it per file system for now. + * + * Open: to take i_mutex or not for this? Right now we don't. + */ + if (mapping->a_ops->error_remove_page) { + err = mapping->a_ops->error_remove_page(mapping, p); + if (err != 0) { + printk(KERN_INFO "MCE %#lx: Failed to punch page: %d\n", + pfn, err); + } else if (page_has_private(p) && + !try_to_release_page(p, GFP_NOIO)) { + pr_debug("MCE %#lx: failed to release buffers\n", pfn); + } else { + ret = RECOVERED; + } + } else { + /* + * If the file system doesn't support it just invalidate + * This fails on dirty or anything with private pages + */ + if (invalidate_inode_page(p)) + ret = RECOVERED; + else + printk(KERN_INFO "MCE %#lx: Failed to invalidate\n", + pfn); + } + return ret; +} + +/* + * Dirty cache page page + * Issues: when the error hit a hole page the error is not properly + * propagated. + */ +static int me_pagecache_dirty(struct page *p, unsigned long pfn) +{ + struct address_space *mapping = page_mapping(p); + + SetPageError(p); + /* TBD: print more information about the file. */ + if (mapping) { + /* + * IO error will be reported by write(), fsync(), etc. + * who check the mapping. + * This way the application knows that something went + * wrong with its dirty file data. + * + * There's one open issue: + * + * The EIO will be only reported on the next IO + * operation and then cleared through the IO map. + * Normally Linux has two mechanisms to pass IO error + * first through the AS_EIO flag in the address space + * and then through the PageError flag in the page. + * Since we drop pages on memory failure handling the + * only mechanism open to use is through AS_AIO. + * + * This has the disadvantage that it gets cleared on + * the first operation that returns an error, while + * the PageError bit is more sticky and only cleared + * when the page is reread or dropped. If an + * application assumes it will always get error on + * fsync, but does other operations on the fd before + * and the page is dropped inbetween then the error + * will not be properly reported. + * + * This can already happen even without hwpoisoned + * pages: first on metadata IO errors (which only + * report through AS_EIO) or when the page is dropped + * at the wrong time. + * + * So right now we assume that the application DTRT on + * the first EIO, but we're not worse than other parts + * of the kernel. + */ + mapping_set_error(mapping, EIO); + } + + return me_pagecache_clean(p, pfn); +} + +/* + * Clean and dirty swap cache. + * + * Dirty swap cache page is tricky to handle. The page could live both in page + * cache and swap cache(ie. page is freshly swapped in). So it could be + * referenced concurrently by 2 types of PTEs: + * normal PTEs and swap PTEs. We try to handle them consistently by calling + * try_to_unmap(TTU_IGNORE_HWPOISON) to convert the normal PTEs to swap PTEs, + * and then + * - clear dirty bit to prevent IO + * - remove from LRU + * - but keep in the swap cache, so that when we return to it on + * a later page fault, we know the application is accessing + * corrupted data and shall be killed (we installed simple + * interception code in do_swap_page to catch it). + * + * Clean swap cache pages can be directly isolated. A later page fault will + * bring in the known good data from disk. + */ +static int me_swapcache_dirty(struct page *p, unsigned long pfn) +{ + int ret = FAILED; + + ClearPageDirty(p); + /* Trigger EIO in shmem: */ + ClearPageUptodate(p); + + if (!isolate_lru_page(p)) { + page_cache_release(p); + ret = DELAYED; + } + + return ret; +} + +static int me_swapcache_clean(struct page *p, unsigned long pfn) +{ + int ret = FAILED; + + if (!isolate_lru_page(p)) { + page_cache_release(p); + ret = RECOVERED; + } + delete_from_swap_cache(p); + return ret; +} + +/* + * Huge pages. Needs work. + * Issues: + * No rmap support so we cannot find the original mapper. In theory could walk + * all MMs and look for the mappings, but that would be non atomic and racy. + * Need rmap for hugepages for this. Alternatively we could employ a heuristic, + * like just walking the current process and hoping it has it mapped (that + * should be usually true for the common "shared database cache" case) + * Should handle free huge pages and dequeue them too, but this needs to + * handle huge page accounting correctly. + */ +static int me_huge_page(struct page *p, unsigned long pfn) +{ + return FAILED; +} + +/* + * Various page states we can handle. + * + * A page state is defined by its current page->flags bits. + * The table matches them in order and calls the right handler. + * + * This is quite tricky because we can access page at any time + * in its live cycle, so all accesses have to be extremly careful. + * + * This is not complete. More states could be added. + * For any missing state don't attempt recovery. + */ + +#define dirty (1UL << PG_dirty) +#define sc (1UL << PG_swapcache) +#define unevict (1UL << PG_unevictable) +#define mlock (1UL << PG_mlocked) +#define writeback (1UL << PG_writeback) +#define lru (1UL << PG_lru) +#define swapbacked (1UL << PG_swapbacked) +#define head (1UL << PG_head) +#define tail (1UL << PG_tail) +#define compound (1UL << PG_compound) +#define slab (1UL << PG_slab) +#define buddy (1UL << PG_buddy) +#define reserved (1UL << PG_reserved) + +static struct page_state { + unsigned long mask; + unsigned long res; + char *msg; + int (*action)(struct page *p, unsigned long pfn); +} error_states[] = { + { reserved, reserved, "reserved kernel", me_ignore }, + { buddy, buddy, "free kernel", me_free }, + + /* + * Could in theory check if slab page is free or if we can drop + * currently unused objects without touching them. But just + * treat it as standard kernel for now. + */ + { slab, slab, "kernel slab", me_kernel }, + +#ifdef CONFIG_PAGEFLAGS_EXTENDED + { head, head, "huge", me_huge_page }, + { tail, tail, "huge", me_huge_page }, +#else + { compound, compound, "huge", me_huge_page }, +#endif + + { sc|dirty, sc|dirty, "swapcache", me_swapcache_dirty }, + { sc|dirty, sc, "swapcache", me_swapcache_clean }, + +#ifdef CONFIG_UNEVICTABLE_LRU + { unevict|dirty, unevict|dirty, "unevictable LRU", me_pagecache_dirty}, + { unevict, unevict, "unevictable LRU", me_pagecache_clean}, +#endif + +#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT + { mlock|dirty, mlock|dirty, "mlocked LRU", me_pagecache_dirty }, + { mlock, mlock, "mlocked LRU", me_pagecache_clean }, +#endif + + { lru|dirty, lru|dirty, "LRU", me_pagecache_dirty }, + { lru|dirty, lru, "clean LRU", me_pagecache_clean }, + { swapbacked, swapbacked, "anonymous", me_pagecache_clean }, + + /* + * Catchall entry: must be at end. + */ + { 0, 0, "unknown page state", me_unknown }, +}; + +#undef lru + +static void action_result(unsigned long pfn, char *msg, int result) +{ + struct page *page = NULL; + if (pfn_valid(pfn)) + page = pfn_to_page(pfn); + + printk(KERN_ERR "MCE %#lx: %s%s page recovery: %s\n", + pfn, + page && PageDirty(page) ? "dirty " : "", + msg, action_name[result]); +} + +static int page_action(struct page_state *ps, struct page *p, + unsigned long pfn, int ref) +{ + int result; + + result = ps->action(p, pfn); + action_result(pfn, ps->msg, result); + if (page_count(p) != 1 + ref) + printk(KERN_ERR + "MCE %#lx: %s page still referenced by %d users\n", + pfn, ps->msg, page_count(p) - 1); + + /* Could do more checks here if page looks ok */ + /* + * Could adjust zone counters here to correct for the missing page. + */ + + return result == RECOVERED ? 0 : -EBUSY; +} + +#define N_UNMAP_TRIES 5 + +/* + * Do all that is necessary to remove user space mappings. Unmap + * the pages and send SIGBUS to the processes if the data was dirty. + */ +static void hwpoison_user_mappings(struct page *p, unsigned long pfn, + int trapno) +{ + enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS; + struct address_space *mapping; + LIST_HEAD(tokill); + int ret; + int i; + int kill = 1; + + if (PageReserved(p) || PageCompound(p) || PageSlab(p)) + return; + + if (!PageLRU(p)) + lru_add_drain_all(); + + /* + * This check implies we don't kill processes if their pages + * are in the swap cache early. Those are always late kills. + */ + if (!page_mapped(p)) + return; + + if (PageSwapCache(p)) { + printk(KERN_ERR + "MCE %#lx: keeping poisoned page in swap cache\n", pfn); + ttu |= TTU_IGNORE_HWPOISON; + } + + /* + * Propagate the dirty bit from PTEs to struct page first, because we + * need this to decide if we should kill or just drop the page. + */ + mapping = page_mapping(p); + if (!PageDirty(p) && mapping && mapping_cap_writeback_dirty(mapping)) { + if (page_mkclean(p)) { + SetPageDirty(p); + } else { + kill = 0; + ttu |= TTU_IGNORE_HWPOISON; + printk(KERN_INFO + "MCE %#lx: corrupted page was clean: dropped without side effects\n", + pfn); + } + } + + /* + * First collect all the processes that have the page + * mapped in dirty form. This has to be done before try_to_unmap, + * because ttu takes the rmap data structures down. + * + * Error handling: We ignore errors here because + * there's nothing that can be done. + */ + if (kill) + collect_procs(p, &tokill); + + /* + * try_to_unmap can fail temporarily due to races. + * Try a few times (RED-PEN better strategy?) + */ + for (i = 0; i < N_UNMAP_TRIES; i++) { + ret = try_to_unmap(p, ttu); + if (ret == SWAP_SUCCESS) + break; + pr_debug("MCE %#lx: try_to_unmap retry needed %d\n", pfn, ret); + } + + if (ret != SWAP_SUCCESS) + printk(KERN_ERR "MCE %#lx: failed to unmap page (mapcount=%d)\n", + pfn, page_mapcount(p)); + + /* + * Now that the dirty bit has been propagated to the + * struct page and all unmaps done we can decide if + * killing is needed or not. Only kill when the page + * was dirty, otherwise the tokill list is merely + * freed. When there was a problem unmapping earlier + * use a more force-full uncatchable kill to prevent + * any accesses to the poisoned memory. + */ + kill_procs_ao(&tokill, !!PageDirty(p), trapno, + ret != SWAP_SUCCESS, pfn); +} + +int __memory_failure(unsigned long pfn, int trapno, int ref) +{ + struct page_state *ps; + struct page *p; + int res; + + if (!sysctl_memory_failure_recovery) + panic("Memory failure from trap %d on page %lx", trapno, pfn); + + if (!pfn_valid(pfn)) { + action_result(pfn, "memory outside kernel control", IGNORED); + return -EIO; + } + + p = pfn_to_page(pfn); + if (TestSetPageHWPoison(p)) { + action_result(pfn, "already hardware poisoned", IGNORED); + return 0; + } + + atomic_long_add(1, &mce_bad_pages); + + /* + * We need/can do nothing about count=0 pages. + * 1) it's a free page, and therefore in safe hand: + * prep_new_page() will be the gate keeper. + * 2) it's part of a non-compound high order page. + * Implies some kernel user: cannot stop them from + * R/W the page; let's pray that the page has been + * used and will be freed some time later. + * In fact it's dangerous to directly bump up page count from 0, + * that may make page_freeze_refs()/page_unfreeze_refs() mismatch. + */ + if (!get_page_unless_zero(compound_head(p))) { + action_result(pfn, "free or high order kernel", IGNORED); + return PageBuddy(compound_head(p)) ? 0 : -EBUSY; + } + + /* + * Lock the page and wait for writeback to finish. + * It's very difficult to mess with pages currently under IO + * and in many cases impossible, so we just avoid it here. + */ + lock_page_nosync(p); + wait_on_page_writeback(p); + + /* + * Now take care of user space mappings. + */ + hwpoison_user_mappings(p, pfn, trapno); + + /* + * Torn down by someone else? + */ + if (PageLRU(p) && !PageSwapCache(p) && p->mapping == NULL) { + action_result(pfn, "already truncated LRU", IGNORED); + res = 0; + goto out; + } + + res = -EBUSY; + for (ps = error_states;; ps++) { + if ((p->flags & ps->mask) == ps->res) { + res = page_action(ps, p, pfn, ref); + break; + } + } +out: + unlock_page(p); + return res; +} +EXPORT_SYMBOL_GPL(__memory_failure); + +/** + * memory_failure - Handle memory failure of a page. + * @pfn: Page Number of the corrupted page + * @trapno: Trap number reported in the signal to user space. + * + * This function is called by the low level machine check code + * of an architecture when it detects hardware memory corruption + * of a page. It tries its best to recover, which includes + * dropping pages, killing processes etc. + * + * The function is primarily of use for corruptions that + * happen outside the current execution context (e.g. when + * detected by a background scrubber) + * + * Must run in process context (e.g. a work queue) with interrupts + * enabled and no spinlocks hold. + */ +void memory_failure(unsigned long pfn, int trapno) +{ + __memory_failure(pfn, trapno, 0); +} Index: linux/include/linux/mm.h =================================================================== --- linux.orig/include/linux/mm.h +++ linux/include/linux/mm.h @@ -1324,5 +1324,12 @@ void vmemmap_populate_print_last(void); extern int account_locked_memory(struct mm_struct *mm, struct rlimit *rlim, size_t size); extern void refund_locked_memory(struct mm_struct *mm, size_t size); + +extern void memory_failure(unsigned long pfn, int trapno); +extern int __memory_failure(unsigned long pfn, int trapno, int ref); +extern int sysctl_memory_failure_early_kill; +extern int sysctl_memory_failure_recovery; +extern atomic_long_t mce_bad_pages; + #endif /* __KERNEL__ */ #endif /* _LINUX_MM_H */ Index: linux/kernel/sysctl.c =================================================================== --- linux.orig/kernel/sysctl.c +++ linux/kernel/sysctl.c @@ -1356,6 +1356,31 @@ static struct ctl_table vm_table[] = { .mode = 0644, .proc_handler = &scan_unevictable_handler, }, +#ifdef CONFIG_MEMORY_FAILURE + { + .ctl_name = CTL_UNNUMBERED, + .procname = "memory_failure_early_kill", + .data = &sysctl_memory_failure_early_kill, + .maxlen = sizeof(sysctl_memory_failure_early_kill), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .strategy = &sysctl_intvec, + .extra1 = &zero, + .extra2 = &one, + }, + { + .ctl_name = CTL_UNNUMBERED, + .procname = "memory_failure_recovery", + .data = &sysctl_memory_failure_recovery, + .maxlen = sizeof(sysctl_memory_failure_recovery), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .strategy = &sysctl_intvec, + .extra1 = &zero, + .extra2 = &one, + }, +#endif + /* * NOTE: do not add new entries to this table unless you have read * Documentation/sysctl/ctl_unnumbered.txt Index: linux/fs/proc/meminfo.c =================================================================== --- linux.orig/fs/proc/meminfo.c +++ linux/fs/proc/meminfo.c @@ -95,7 +95,11 @@ static int meminfo_proc_show(struct seq_ "Committed_AS: %8lu kB\n" "VmallocTotal: %8lu kB\n" "VmallocUsed: %8lu kB\n" - "VmallocChunk: %8lu kB\n", + "VmallocChunk: %8lu kB\n" +#ifdef CONFIG_MEMORY_FAILURE + "HardwareCorrupted: %8lu kB\n" +#endif + , K(i.totalram), K(i.freeram), K(i.bufferram), @@ -140,6 +144,9 @@ static int meminfo_proc_show(struct seq_ (unsigned long)VMALLOC_TOTAL >> 10, vmi.used >> 10, vmi.largest_chunk >> 10 +#ifdef CONFIG_MEMORY_FAILURE + ,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10) +#endif ); hugetlb_report_meminfo(m); Index: linux/mm/Kconfig =================================================================== --- linux.orig/mm/Kconfig +++ linux/mm/Kconfig @@ -233,6 +233,15 @@ config DEFAULT_MMAP_MIN_ADDR /proc/sys/vm/mmap_min_addr tunable. +config MEMORY_FAILURE + depends on MMU + depends on X86_MCE + bool "Enable memory failure recovery" + help + Enables code to recover from some memory failures on systems + with MCA recovery. This allows a system to continue running + even when some of its memory has uncorrected errors. + config NOMMU_INITIAL_TRIM_EXCESS int "Turn on mmap() excess space trimming before booting" depends on !MMU Index: linux/Documentation/sysctl/vm.txt =================================================================== --- linux.orig/Documentation/sysctl/vm.txt +++ linux/Documentation/sysctl/vm.txt @@ -32,6 +32,8 @@ Currently, these files are in /proc/sys/ - legacy_va_layout - lowmem_reserve_ratio - max_map_count +- memory_failure_early_kill +- memory_failure_recovery - min_free_kbytes - min_slab_ratio - min_unmapped_ratio @@ -53,7 +55,6 @@ Currently, these files are in /proc/sys/ - vfs_cache_pressure - zone_reclaim_mode - ============================================================== block_dump @@ -275,6 +276,44 @@ e.g., up to one or two maps per allocati The default value is 65536. +============================================================= + +memory_failure_early_kill: + +Control how to kill processes when uncorrected memory error (typically +a 2bit error in a memory module) is detected in the background by hardware +that cannot be handled by the kernel. In some cases (like the page +still having a valid copy on disk) the kernel will handle the failure +transparently without affecting any applications. But if there is +no other uptodate copy of the data it will kill to prevent any data +corruptions from propagating. + +1: Kill all processes that have the corrupted and not reloadable page mapped +as soon as the corruption is detected. Note this is not supported +for a few types of pages, like kernel internally allocated data or +the swap cache, but works for the majority of user pages. + +0: Only unmap the corrupted page from all processes and only kill a process +who tries to access it. + +The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can +handle this if they want to. + +This is only active on architectures/platforms with advanced machine +check handling and depends on the hardware capabilities. + +Applications can override this setting individually with the PR_MCE_KILL prctl + +============================================================== + +memory_failure_recovery + +Enable memory failure recovery (when supported by the platform) + +1: Attempt recovery. + +0: Always panic on a memory failure. + ============================================================== min_free_kbytes: Index: linux/mm/filemap.c =================================================================== --- linux.orig/mm/filemap.c +++ linux/mm/filemap.c @@ -105,6 +105,10 @@ * * ->task->proc_lock * ->dcache_lock (proc_pid_lookup) + * + * (code doesn't rely on that order, so you could switch it around) + * ->tasklist_lock (memory_failure, collect_procs_ao) + * ->i_mmap_lock */ /* Index: linux/mm/rmap.c =================================================================== --- linux.orig/mm/rmap.c +++ linux/mm/rmap.c @@ -36,6 +36,11 @@ * mapping->tree_lock (widely used, in set_page_dirty, * in arch-dependent flush_dcache_mmap_lock, * within inode_lock in __sync_single_inode) + * + * (code doesn't rely on that order so it could be switched around) + * ->tasklist_lock + * anon_vma->lock (memory_failure, collect_procs_anon) + * pte map lock */ #include @@ -311,7 +316,7 @@ pte_t *page_check_address(struct page *p * if the page is not mapped into the page tables of this VMA. Only * valid for normal file or anonymous VMAs. */ -static int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma) +int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma) { unsigned long address; pte_t *pte; Index: linux/include/linux/rmap.h =================================================================== --- linux.orig/include/linux/rmap.h +++ linux/include/linux/rmap.h @@ -129,6 +129,7 @@ int try_to_munlock(struct page *); */ struct anon_vma *page_lock_anon_vma(struct page *page); void page_unlock_anon_vma(struct anon_vma *anon_vma); +int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma); #else /* !CONFIG_MMU */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id B9F3C6B0096 for ; Wed, 5 Aug 2009 05:36:48 -0400 (EDT) From: Andi Kleen References: <200908051136.682859934@firstfloor.org> In-Reply-To: <200908051136.682859934@firstfloor.org> Subject: [PATCH] [19/19] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs Message-Id: <20090805093646.EE6F0B15D8@basil.firstfloor.org> Date: Wed, 5 Aug 2009 11:36:46 +0200 (CEST) Sender: owner-linux-mm@kvack.org To: akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: Useful for some testing scenarios, although specific testing is often done better through MADV_POISON This can be done with the x86 level MCE injector too, but this interface allows it to do independently from low level x86 changes. Open issues: Should be disabled for cgroups. Signed-off-by: Andi Kleen --- mm/Kconfig | 6 +++++- mm/Makefile | 1 + mm/hwpoison-inject.c | 40 ++++++++++++++++++++++++++++++++++++++++ mm/madvise.c | 2 +- 4 files changed, 47 insertions(+), 2 deletions(-) Index: linux/mm/hwpoison-inject.c =================================================================== --- /dev/null +++ linux/mm/hwpoison-inject.c @@ -0,0 +1,40 @@ +/* Inject a hwpoison memory failure on a arbitary pfn */ +#include +#include +#include +#include + +static struct dentry *hwpoison_dir, *corrupt_pfn; + +static int hwpoison_inject(void *data, u64 val) +{ + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + printk(KERN_INFO "Injecting memory failure at pfn %Lx\n", val); + return __memory_failure(val, 18, 0); +} + +DEFINE_SIMPLE_ATTRIBUTE(hwpoison_fops, NULL, hwpoison_inject, "%lli\n"); + +static void pfn_inject_exit(void) +{ + if (hwpoison_dir) + debugfs_remove_recursive(hwpoison_dir); +} + +static int pfn_inject_init(void) +{ + hwpoison_dir = debugfs_create_dir("hwpoison", NULL); + if (hwpoison_dir == NULL) + return -ENOMEM; + corrupt_pfn = debugfs_create_file("corrupt-pfn", 0600, hwpoison_dir, + NULL, &hwpoison_fops); + if (corrupt_pfn == NULL) { + pfn_inject_exit(); + return -ENOMEM; + } + return 0; +} + +module_init(pfn_inject_init); +module_exit(pfn_inject_exit); Index: linux/mm/Kconfig =================================================================== --- linux.orig/mm/Kconfig +++ linux/mm/Kconfig @@ -236,12 +236,16 @@ config DEFAULT_MMAP_MIN_ADDR config MEMORY_FAILURE depends on MMU depends on X86_MCE - bool "Enable memory failure recovery" + bool "Enable recovery from hardware memory errors" help Enables code to recover from some memory failures on systems with MCA recovery. This allows a system to continue running even when some of its memory has uncorrected errors. +config HWPOISON_INJECT + tristate "Poison pages injector" + depends on MEMORY_FAILURE && DEBUG_KERNEL + config NOMMU_INITIAL_TRIM_EXCESS int "Turn on mmap() excess space trimming before booting" depends on !MMU Index: linux/mm/Makefile =================================================================== --- linux.orig/mm/Makefile +++ linux/mm/Makefile @@ -41,5 +41,6 @@ endif obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o +obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o Index: linux/mm/madvise.c =================================================================== --- linux.orig/mm/madvise.c +++ linux/mm/madvise.c @@ -213,7 +213,7 @@ static long madvise_remove(struct vm_are */ static int madvise_hwpoison(unsigned long start, unsigned long end) { - int ret = -EIO; + int ret = 0; /* * RED-PEN * This allows to tie up arbitary amounts of memory. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 7656B6B0089 for ; Wed, 5 Aug 2009 06:20:06 -0400 (EDT) Date: Wed, 5 Aug 2009 12:20:08 +0200 From: Nick Piggin Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-ID: <20090805102008.GB17190@wotan.suse.de> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805093638.D3754B15D8@basil.firstfloor.org> Sender: owner-linux-mm@kvack.org To: Andi Kleen Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: On Wed, Aug 05, 2009 at 11:36:38AM +0200, Andi Kleen wrote: > > From: Nick Piggin > > Extract out truncate_inode_page() out of the truncate path so that > it can be used by memory-failure.c > > [AK: description, headers, fix typos] > v2: Some white space changes from Fengguang Wu > > Signed-off-by: Andi Kleen > > --- > include/linux/mm.h | 2 ++ > mm/truncate.c | 29 +++++++++++++++-------------- > 2 files changed, 17 insertions(+), 14 deletions(-) > > Index: linux/mm/truncate.c > =================================================================== > --- linux.orig/mm/truncate.c > +++ linux/mm/truncate.c > @@ -93,11 +93,11 @@ EXPORT_SYMBOL(cancel_dirty_page); > * its lock, b) when a concurrent invalidate_mapping_pages got there first and > * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space. > */ > -static void > +static int > truncate_complete_page(struct address_space *mapping, struct page *page) > { > if (page->mapping != mapping) > - return; > + return -EIO; Hmm, at this point, the page must have been removed from pagecache, so I don't know if you need to pass an error back? > if (page_has_private(page)) > do_invalidatepage(page, 0); > @@ -108,6 +108,7 @@ truncate_complete_page(struct address_sp > remove_from_page_cache(page); > ClearPageMappedToDisk(page); > page_cache_release(page); /* pagecache ref */ > + return 0; > } > > /* > @@ -135,6 +136,16 @@ invalidate_complete_page(struct address_ > return ret; > } > > +int truncate_inode_page(struct address_space *mapping, struct page *page) > +{ > + if (page_mapped(page)) { > + unmap_mapping_range(mapping, > + (loff_t)page->index << PAGE_CACHE_SHIFT, > + PAGE_CACHE_SIZE, 0); > + } > + return truncate_complete_page(mapping, page); > +} > + > /** > * truncate_inode_pages - truncate range of pages specified by start & end byte offsets > * @mapping: mapping to truncate > @@ -196,12 +207,7 @@ void truncate_inode_pages_range(struct a > unlock_page(page); > continue; > } > - if (page_mapped(page)) { > - unmap_mapping_range(mapping, > - (loff_t)page_index< - PAGE_CACHE_SIZE, 0); > - } > - truncate_complete_page(mapping, page); > + truncate_inode_page(mapping, page); > unlock_page(page); > } > pagevec_release(&pvec); > @@ -238,15 +244,10 @@ void truncate_inode_pages_range(struct a > break; > lock_page(page); > wait_on_page_writeback(page); > - if (page_mapped(page)) { > - unmap_mapping_range(mapping, > - (loff_t)page->index< - PAGE_CACHE_SIZE, 0); > - } > + truncate_inode_page(mapping, page); > if (page->index > next) > next = page->index; > next++; > - truncate_complete_page(mapping, page); > unlock_page(page); > } > pagevec_release(&pvec); > Index: linux/include/linux/mm.h > =================================================================== > --- linux.orig/include/linux/mm.h > +++ linux/include/linux/mm.h > @@ -809,6 +809,8 @@ static inline void unmap_shared_mapping_ > extern int vmtruncate(struct inode * inode, loff_t offset); > extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end); > > +int truncate_inode_page(struct address_space *mapping, struct page *page); > + > #ifdef CONFIG_MMU > extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, > unsigned long address, unsigned int flags); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id A11276B004F for ; Wed, 5 Aug 2009 07:12:46 -0400 (EDT) Date: Wed, 5 Aug 2009 07:12:31 -0400 From: Christoph Hellwig Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090805111231.GA19532@infradead.org> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805093643.E0C00B15D8@basil.firstfloor.org> Sender: owner-linux-mm@kvack.org To: Andi Kleen Cc: tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: On Wed, Aug 05, 2009 at 11:36:43AM +0200, Andi Kleen wrote: > > Enable removing of corrupted pages through truncation > for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs > These should cover most server needs. > > I chose the set of migration aware file systems for this > for now, assuming they have been especially audited. > But in general it should be safe for all file systems > on the data area that support read/write and truncate. > > Caveat: the hardware error handler does not take i_mutex > for now before calling the truncate function. Is that ok? It will probably need locking, e.g. the iolock in XFS. I'll need to take a look at the actual implementation of generic_error_remove_page to make sense of this. Is there any way for us to test this functionality without introducing real hardware problems? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id E1F946B004F for ; Wed, 5 Aug 2009 08:17:00 -0400 (EDT) Date: Wed, 5 Aug 2009 19:52:42 +0800 From: Wu Fengguang Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090805115242.GB6737@localhost> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <20090805111231.GA19532@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805111231.GA19532@infradead.org> Sender: owner-linux-mm@kvack.org To: Christoph Hellwig Cc: Andi Kleen , "tytso@mit.edu" , "mfasheh@suse.com" , "aia21@cantab.net" , "hugh.dickins@tiscali.co.uk" , "swhiteho@redhat.com" , "akpm@linux-foundation.org" , "npiggin@suse.de" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "hidehiro.kawai.ez@hitachi.com" List-ID: On Wed, Aug 05, 2009 at 07:12:31PM +0800, Christoph Hellwig wrote: > On Wed, Aug 05, 2009 at 11:36:43AM +0200, Andi Kleen wrote: > > > > Enable removing of corrupted pages through truncation > > for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs > > These should cover most server needs. > > > > I chose the set of migration aware file systems for this > > for now, assuming they have been especially audited. > > But in general it should be safe for all file systems > > on the data area that support read/write and truncate. > > > > Caveat: the hardware error handler does not take i_mutex > > for now before calling the truncate function. Is that ok? > > It will probably need locking, e.g. the iolock in XFS. I'll > need to take a look at the actual implementation of > generic_error_remove_page to make sense of this. In patch 13, it simply calls truncate_inode_page() for S_ISREG inodes. Nick suggests call truncate_inode_page() with i_mutex. Sure we can do mutex_trylock(i_mutex), but we'd appreciate it if some fs gurus can demonstrate some bad consequences of not doing so, thanks! > Is there any way for us to test this functionality without introducing > real hardware problems? We have some additional patches (ugly but works for now) that export interfaces for injecting hwpoison to selected types pages. It can guarantee only data/metadata pages of selected fs will be poisoned. Based on which we can do all kinds of stress testing in user space. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 240986B005D for ; Wed, 5 Aug 2009 08:38:00 -0400 (EDT) Date: Wed, 5 Aug 2009 20:37:49 +0800 From: Wu Fengguang Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-ID: <20090805123749.GA9443@localhost> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805102008.GB17190@wotan.suse.de> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Andi Kleen , "akpm@linux-foundation.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "hidehiro.kawai.ez@hitachi.com" List-ID: On Wed, Aug 05, 2009 at 06:20:08PM +0800, Nick Piggin wrote: > On Wed, Aug 05, 2009 at 11:36:38AM +0200, Andi Kleen wrote: > > > > From: Nick Piggin > > > > Extract out truncate_inode_page() out of the truncate path so that > > it can be used by memory-failure.c > > > > [AK: description, headers, fix typos] > > v2: Some white space changes from Fengguang Wu > > > > Signed-off-by: Andi Kleen > > > > --- > > include/linux/mm.h | 2 ++ > > mm/truncate.c | 29 +++++++++++++++-------------- > > 2 files changed, 17 insertions(+), 14 deletions(-) > > > > Index: linux/mm/truncate.c > > =================================================================== > > --- linux.orig/mm/truncate.c > > +++ linux/mm/truncate.c > > @@ -93,11 +93,11 @@ EXPORT_SYMBOL(cancel_dirty_page); > > * its lock, b) when a concurrent invalidate_mapping_pages got there first and > > * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space. > > */ > > -static void > > +static int > > truncate_complete_page(struct address_space *mapping, struct page *page) > > { > > if (page->mapping != mapping) > > - return; > > + return -EIO; > > Hmm, at this point, the page must have been removed from pagecache, > so I don't know if you need to pass an error back? Me think so too. When called from hwpoison, the page count and lock have both be taken, so at least _in this case_, (page->mapping != mapping) can be equally written as (page->mapping == NULL) But anyway, the return value is now ignored in upper layer :) Thanks, Fengguang > > > if (page_has_private(page)) > > do_invalidatepage(page, 0); > > @@ -108,6 +108,7 @@ truncate_complete_page(struct address_sp > > remove_from_page_cache(page); > > ClearPageMappedToDisk(page); > > page_cache_release(page); /* pagecache ref */ > > + return 0; > > } > > > > /* > > @@ -135,6 +136,16 @@ invalidate_complete_page(struct address_ > > return ret; > > } > > > > +int truncate_inode_page(struct address_space *mapping, struct page *page) > > +{ > > + if (page_mapped(page)) { > > + unmap_mapping_range(mapping, > > + (loff_t)page->index << PAGE_CACHE_SHIFT, > > + PAGE_CACHE_SIZE, 0); > > + } > > + return truncate_complete_page(mapping, page); > > +} > > + > > /** > > * truncate_inode_pages - truncate range of pages specified by start & end byte offsets > > * @mapping: mapping to truncate > > @@ -196,12 +207,7 @@ void truncate_inode_pages_range(struct a > > unlock_page(page); > > continue; > > } > > - if (page_mapped(page)) { > > - unmap_mapping_range(mapping, > > - (loff_t)page_index< > - PAGE_CACHE_SIZE, 0); > > - } > > - truncate_complete_page(mapping, page); > > + truncate_inode_page(mapping, page); > > unlock_page(page); > > } > > pagevec_release(&pvec); > > @@ -238,15 +244,10 @@ void truncate_inode_pages_range(struct a > > break; > > lock_page(page); > > wait_on_page_writeback(page); > > - if (page_mapped(page)) { > > - unmap_mapping_range(mapping, > > - (loff_t)page->index< > - PAGE_CACHE_SIZE, 0); > > - } > > + truncate_inode_page(mapping, page); > > if (page->index > next) > > next = page->index; > > next++; > > - truncate_complete_page(mapping, page); > > unlock_page(page); > > } > > pagevec_release(&pvec); > > Index: linux/include/linux/mm.h > > =================================================================== > > --- linux.orig/include/linux/mm.h > > +++ linux/include/linux/mm.h > > @@ -809,6 +809,8 @@ static inline void unmap_shared_mapping_ > > extern int vmtruncate(struct inode * inode, loff_t offset); > > extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end); > > > > +int truncate_inode_page(struct address_space *mapping, struct page *page); > > + > > #ifdef CONFIG_MMU > > extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, > > unsigned long address, unsigned int flags); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id E75116B004F for ; Wed, 5 Aug 2009 09:46:05 -0400 (EDT) Date: Wed, 5 Aug 2009 15:46:08 +0200 From: Andi Kleen Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-ID: <20090805134607.GH11385@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805102008.GB17190@wotan.suse.de> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Andi Kleen , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: On Wed, Aug 05, 2009 at 12:20:08PM +0200, Nick Piggin wrote: > > truncate_complete_page(struct address_space *mapping, struct page *page) > > { > > if (page->mapping != mapping) > > - return; > > + return -EIO; > > Hmm, at this point, the page must have been removed from pagecache, > so I don't know if you need to pass an error back? It could be reused, which would be bad for us? The final check is the page error count in the end anyways. -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 269666B004F for ; Wed, 5 Aug 2009 09:50:53 -0400 (EDT) Date: Wed, 5 Aug 2009 15:50:57 +0200 From: Andi Kleen Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090805135056.GI11385@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <20090805111231.GA19532@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805111231.GA19532@infradead.org> Sender: owner-linux-mm@kvack.org To: Christoph Hellwig Cc: Andi Kleen , tytso@mit.edu, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: On Wed, Aug 05, 2009 at 07:12:31AM -0400, Christoph Hellwig wrote: > On Wed, Aug 05, 2009 at 11:36:43AM +0200, Andi Kleen wrote: > > > > Enable removing of corrupted pages through truncation > > for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs > > These should cover most server needs. > > > > I chose the set of migration aware file systems for this > > for now, assuming they have been especially audited. > > But in general it should be safe for all file systems > > on the data area that support read/write and truncate. > > > > Caveat: the hardware error handler does not take i_mutex > > for now before calling the truncate function. Is that ok? > > It will probably need locking, e.g. the iolock in XFS. I'll > need to take a look at the actual implementation of > generic_error_remove_page to make sense of this. Thanks for the review. It just calls the lower end of the truncate path. http://git.kernel.org/?p=linux/kernel/git/ak/linux-mce-2.6.git;a=commitdiff;h=b5d3322870b0f6e369809674d2c73857fd6ef480 and http://git.kernel.org/?p=linux/kernel/git/ak/linux-mce-2.6.git;a=commitdiff;h=4c2df4fa9e4687bde815883fabc602ccc5961272 Is that ok? It should be roughly the same as normal truncate does, except it doesn't take i_mutex currently. We're careful to not do it for non regular inodes. > > Is there any way for us to test this functionality without introducing > real hardware problems? Yes, there are three different injectors to chose from :) The easiest one is usually the madvise(MADV_POISON) injector. Just map a suitable page and There's a test program in http://git.kernel.org/?p=utils/cpu/mce/mce-test.git;a=blob;f=tsrc/tinjpage.c;h=954e1edab765d1c141f693ae9767ba9d5491c1aa;hb=HEAD that can be extended for new tests. -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 110C76B004F for ; Wed, 5 Aug 2009 10:01:48 -0400 (EDT) Date: Wed, 5 Aug 2009 16:01:45 +0200 From: Nick Piggin Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-ID: <20090805140145.GB28563@wotan.suse.de> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> <20090805134607.GH11385@basil.fritz.box> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805134607.GH11385@basil.fritz.box> Sender: owner-linux-mm@kvack.org To: Andi Kleen Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: On Wed, Aug 05, 2009 at 03:46:08PM +0200, Andi Kleen wrote: > On Wed, Aug 05, 2009 at 12:20:08PM +0200, Nick Piggin wrote: > > > truncate_complete_page(struct address_space *mapping, struct page *page) > > > { > > > if (page->mapping != mapping) > > > - return; > > > + return -EIO; > > > > Hmm, at this point, the page must have been removed from pagecache, > > so I don't know if you need to pass an error back? > > It could be reused, which would be bad for us? I haven't brought up the caller at this point, but IIRC you had the page locked and mapping confirmed at this point anyway so it would never be an error for your code. Probably it would be nice to just force callers to verify the page. Normally IMO it is much nicer and clearer to do it at the time the page gets locked, unless there is good reason otherwise. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 08BEC6B004F for ; Wed, 5 Aug 2009 10:10:01 -0400 (EDT) Date: Wed, 5 Aug 2009 16:10:01 +0200 From: Andi Kleen Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-ID: <20090805141001.GJ11385@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> <20090805134607.GH11385@basil.fritz.box> <20090805140145.GB28563@wotan.suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805140145.GB28563@wotan.suse.de> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Andi Kleen , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com List-ID: > I haven't brought up the caller at this point, but IIRC you had > the page locked and mapping confirmed at this point anyway so > it would never be an error for your code. > > Probably it would be nice to just force callers to verify the page. > Normally IMO it is much nicer and clearer to do it at the time the > page gets locked, unless there is good reason otherwise. Ok. I think I'll just keep it as it is for now. The only reason I added the error code was to make truncate_inode_page fit into .error_remove_page, but then latter I did another wrapper so it could be removed again. But it won't hurt to have it either. -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 49AF26B004F for ; Wed, 5 Aug 2009 10:41:15 -0400 (EDT) Date: Wed, 5 Aug 2009 16:41:12 +0200 From: Andi Kleen Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-ID: <20090805144112.GM11385@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> <20090805134607.GH11385@basil.fritz.box> <20090805140145.GB28563@wotan.suse.de> <20090805141001.GJ11385@basil.fritz.box> <20090805141642.GB23992@wotan.suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805141642.GB23992@wotan.suse.de> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Andi Kleen , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com, linux-arch@vger.kernel.org List-ID: > One question I had for the others (Andrew? other mm guys?) what is the > feelings of merging this feature? Leaving aside exact implementation > and just considering the high level design and cost/benefit. Last time > there were some people objecting, so I wonder the situation now? So > does anybody need more convincing? :) The main objection last time was that it was a bit too late in the release schedule. I can't remember anyone really questioning the basic feature itself. > Also I will just cc linux-arch. It would be interesting to know whether > powerpc, ia64, or s390 or others would be interested to use this feature? ia64 is interested (but no code so far) I talked to DaveM and he seems to be interested for sparc too. I would expect other server architectures to eventually use it as they get around to writing the necessary architecture specific glue. -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 908EE6B004F for ; Wed, 5 Aug 2009 10:44:02 -0400 (EDT) Date: Wed, 5 Aug 2009 16:44:02 +0200 From: Nick Piggin Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-ID: <20090805144402.GD23992@wotan.suse.de> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> <20090805134607.GH11385@basil.fritz.box> <20090805140145.GB28563@wotan.suse.de> <20090805141001.GJ11385@basil.fritz.box> <20090805141642.GB23992@wotan.suse.de> <20090805144112.GM11385@basil.fritz.box> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805144112.GM11385@basil.fritz.box> Sender: owner-linux-mm@kvack.org To: Andi Kleen Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com, linux-arch@vger.kernel.org List-ID: On Wed, Aug 05, 2009 at 04:41:12PM +0200, Andi Kleen wrote: > > One question I had for the others (Andrew? other mm guys?) what is the > > feelings of merging this feature? Leaving aside exact implementation > > and just considering the high level design and cost/benefit. Last time > > there were some people objecting, so I wonder the situation now? So > > does anybody need more convincing? :) > > The main objection last time was that it was a bit too late in the > release schedule. > > I can't remember anyone really questioning the basic feature itself. I can't exactly remember. Maybe it was in a thread with Alan and/or Arjan ;) I don't think the feature itself was questioned as much as cost/benefit. Maybe I was wrong... I just want to see everyone is happy with the basic idea ;) > > Also I will just cc linux-arch. It would be interesting to know whether > > powerpc, ia64, or s390 or others would be interested to use this feature? > > ia64 is interested (but no code so far) I talked to DaveM and he seems to be > interested for sparc too. I would expect other server architectures to > eventually use it as they get around to writing the necessary architecture > specific glue. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 1A1B26B004F for ; Wed, 5 Aug 2009 11:00:47 -0400 (EDT) Date: Wed, 5 Aug 2009 09:00:39 -0600 From: Matthew Wilcox Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-ID: <20090805150038.GL3711@parisc-linux.org> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> <20090805134607.GH11385@basil.fritz.box> <20090805140145.GB28563@wotan.suse.de> <20090805141001.GJ11385@basil.fritz.box> <20090805141642.GB23992@wotan.suse.de> <20090805144112.GM11385@basil.fritz.box> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805144112.GM11385@basil.fritz.box> Sender: owner-linux-mm@kvack.org To: Andi Kleen Cc: Nick Piggin , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com, linux-arch@vger.kernel.org, linux-parisc@vger.kernel.org List-ID: On Wed, Aug 05, 2009 at 04:41:12PM +0200, Andi Kleen wrote: > > One question I had for the others (Andrew? other mm guys?) what is the > > feelings of merging this feature? Leaving aside exact implementation > > and just considering the high level design and cost/benefit. Last time > > there were some people objecting, so I wonder the situation now? So > > does anybody need more convincing? :) > > The main objection last time was that it was a bit too late in the > release schedule. > > I can't remember anyone really questioning the basic feature itself. > > > Also I will just cc linux-arch. It would be interesting to know whether > > powerpc, ia64, or s390 or others would be interested to use this feature? > > ia64 is interested (but no code so far) I talked to DaveM and he seems to be > interested for sparc too. I would expect other server architectures to > eventually use it as they get around to writing the necessary architecture > specific glue. parisc could certainly implement this. Don't know if there's interest. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id D4B256B0085 for ; Wed, 5 Aug 2009 11:12:27 -0400 (EDT) Date: Wed, 5 Aug 2009 23:12:20 +0800 From: Wu Fengguang Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-ID: <20090805151220.GB6210@localhost> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> <20090805134607.GH11385@basil.fritz.box> <20090805140145.GB28563@wotan.suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090805140145.GB28563@wotan.suse.de> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Andi Kleen , "akpm@linux-foundation.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "hidehiro.kawai.ez@hitachi.com" List-ID: On Wed, Aug 05, 2009 at 10:01:45PM +0800, Nick Piggin wrote: > On Wed, Aug 05, 2009 at 03:46:08PM +0200, Andi Kleen wrote: > > On Wed, Aug 05, 2009 at 12:20:08PM +0200, Nick Piggin wrote: > > > > truncate_complete_page(struct address_space *mapping, struct page *page) > > > > { > > > > if (page->mapping != mapping) > > > > - return; > > > > + return -EIO; > > > > > > Hmm, at this point, the page must have been removed from pagecache, > > > so I don't know if you need to pass an error back? > > > > It could be reused, which would be bad for us? > > I haven't brought up the caller at this point, but IIRC you had > the page locked and mapping confirmed at this point anyway so > it would never be an error for your code. Right, that 'if' will always evaluate to false for the hwpoison case. Because that 'mapping' was taken from 'page->mapping' inside page lock and they will just remain the same values. > Probably it would be nice to just force callers to verify the page. > Normally IMO it is much nicer and clearer to do it at the time the > page gets locked, unless there is good reason otherwise. Yes we do checked page->mapping after taking page lock. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 2AE886B005A for ; Thu, 6 Aug 2009 07:48:45 -0400 (EDT) Received: from d12nrmr1607.megacenter.de.ibm.com (d12nrmr1607.megacenter.de.ibm.com [9.149.167.49]) by mtagate8.de.ibm.com (8.14.3/8.13.8) with ESMTP id n76BmWIE105686 for ; Thu, 6 Aug 2009 11:48:32 GMT Received: from d12av02.megacenter.de.ibm.com (d12av02.megacenter.de.ibm.com [9.149.165.228]) by d12nrmr1607.megacenter.de.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id n76BmWjc2437156 for ; Thu, 6 Aug 2009 13:48:32 +0200 Received: from d12av02.megacenter.de.ibm.com (loopback [127.0.0.1]) by d12av02.megacenter.de.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id n76BmWPL017950 for ; Thu, 6 Aug 2009 13:48:32 +0200 Date: Thu, 6 Aug 2009 13:48:30 +0200 From: Martin Schwidefsky Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-ID: <20090806134830.4f3931d2@skybase> In-Reply-To: <20090805141642.GB23992@wotan.suse.de> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> <20090805134607.GH11385@basil.fritz.box> <20090805140145.GB28563@wotan.suse.de> <20090805141001.GJ11385@basil.fritz.box> <20090805141642.GB23992@wotan.suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Andi Kleen , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com, linux-arch@vger.kernel.org List-ID: On Wed, 5 Aug 2009 16:16:42 +0200 Nick Piggin wrote: > On Wed, Aug 05, 2009 at 04:10:01PM +0200, Andi Kleen wrote: > > > I haven't brought up the caller at this point, but IIRC you had > > > the page locked and mapping confirmed at this point anyway so > > > it would never be an error for your code. > > > > > > Probably it would be nice to just force callers to verify the page. > > > Normally IMO it is much nicer and clearer to do it at the time the > > > page gets locked, unless there is good reason otherwise. > > > > Ok. I think I'll just keep it as it is for now. > > > > The only reason I added the error code was to make truncate_inode_page > > fit into .error_remove_page, but then latter I did another wrapper > > so it could be removed again. But it won't hurt to have it either. > > OK, it's more of a cleanup/nit. > > One question I had for the others (Andrew? other mm guys?) what is the > feelings of merging this feature? Leaving aside exact implementation > and just considering the high level design and cost/benefit. Last time > there were some people objecting, so I wonder the situation now? So > does anybody need more convincing? :) > > Also I will just cc linux-arch. It would be interesting to know whether > powerpc, ia64, or s390 or others would be interested to use this feature? This is not relevant for s390, as current machines do transparent memory sparing if a memory module goes bad. Really old machines reported bad memory to the OS by means of a machine check (storage error uncorrected and storage error corrected). I have never seen this happen, the level below the OS deals with these errors for us. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 361946B005A for ; Thu, 6 Aug 2009 08:04:20 -0400 (EDT) Date: Thu, 6 Aug 2009 14:04:17 +0200 From: Andi Kleen Subject: Re: [PATCH] [11/19] HWPOISON: Refactor truncate to allow direct truncating of page v2 Message-ID: <20090806120417.GC22124@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093638.D3754B15D8@basil.firstfloor.org> <20090805102008.GB17190@wotan.suse.de> <20090805134607.GH11385@basil.fritz.box> <20090805140145.GB28563@wotan.suse.de> <20090805141001.GJ11385@basil.fritz.box> <20090805141642.GB23992@wotan.suse.de> <20090806134830.4f3931d2@skybase> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090806134830.4f3931d2@skybase> Sender: owner-linux-mm@kvack.org To: Martin Schwidefsky Cc: Nick Piggin , Andi Kleen , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, hidehiro.kawai.ez@hitachi.com, linux-arch@vger.kernel.org List-ID: > This is not relevant for s390, as current machines do transparent memory > sparing if a memory module goes bad. Really old machines reported bad > memory to the OS by means of a machine check (storage error uncorrected > and storage error corrected). I have never seen this happen, the level > below the OS deals with these errors for us. Ok fine. It's for the poorer cousins then who can't afford memory mirroring. -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 485F56B004D for ; Mon, 10 Aug 2009 02:36:33 -0400 (EDT) Received: from mlsv8.hitachi.co.jp (unknown [133.144.234.166]) by mail9.hitachi.co.jp (Postfix) with ESMTP id 1152437C87 for ; Mon, 10 Aug 2009 15:36:33 +0900 (JST) Message-ID: <4A7FBFD1.2010208@hitachi.com> Date: Mon, 10 Aug 2009 15:36:01 +0900 From: Hidehiro Kawai MIME-Version: 1.0 Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> In-Reply-To: <20090805093643.E0C00B15D8@basil.firstfloor.org> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andi Kleen Cc: tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com List-ID: Hi, Andi Kleen wrote: > Index: linux/fs/ext3/inode.c > =================================================================== > --- linux.orig/fs/ext3/inode.c > +++ linux/fs/ext3/inode.c > @@ -1819,6 +1819,7 @@ static const struct address_space_operat > .direct_IO = ext3_direct_IO, > .migratepage = buffer_migrate_page, > .is_partially_uptodate = block_is_partially_uptodate, > + .error_remove_page = generic_error_remove_page, > }; (I'm sorry if I'm missing the point.) If my understanding is correct, the following scenario can happen: 1. An uncorrected error on a dirty page cache page is detected by memory scrubbing 2. Kernel unmaps and truncates the page to recover from the error 3. An application reads data from the file location corresponding to the truncated page ==> Old or garbage data will be read into a new page cache page 4. The application modifies the data and write back it to the disk 5. The file will corrurpt! (Yes, the application is wrong to not do the right thing, i.e. fsync, but it's not user's fault!) A similar data corruption can be caused by a write I/O error, because dirty flag is cleared even if the page couldn't be written to the disk. However, we have a way to avoid this kind of data corruption at least for ext3. If we mount an ext3 filesystem with data=ordered and data_err=abort, all I/O errors on file data block belonging to the committing transaction are checked. When I/O error is found, abort journaling and remount the filesystem with read-only to prevent further updates. This kind of feature is very important for mission critical systems. If we merge this patch, we would face the data corruption problem again. I think there are three options, (1) drop this patch (2) merge this patch with new panic_on_dirty_page_cache_corruption sysctl (3) implement a more sophisticated error_remove_page function > static const struct address_space_operations ext3_writeback_aops = { > @@ -1834,6 +1835,7 @@ static const struct address_space_operat > .direct_IO = ext3_direct_IO, > .migratepage = buffer_migrate_page, > .is_partially_uptodate = block_is_partially_uptodate, > + .error_remove_page = generic_error_remove_page, > }; The writeback case would be OK. It's not much different from the I/O error case. > static const struct address_space_operations ext3_journalled_aops = { > @@ -1848,6 +1850,7 @@ static const struct address_space_operat > .invalidatepage = ext3_invalidatepage, > .releasepage = ext3_releasepage, > .is_partially_uptodate = block_is_partially_uptodate, > + .error_remove_page = generic_error_remove_page, > }; > > void ext3_set_aops(struct inode *inode) I'm not sure about the journalled case. I'm going to take a look at it later. Best regards, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id BFEA16B004D for ; Mon, 10 Aug 2009 03:07:47 -0400 (EDT) Date: Mon, 10 Aug 2009 15:07:45 +0800 From: Wu Fengguang Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090810070745.GA26533@localhost> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4A7FBFD1.2010208@hitachi.com> Sender: owner-linux-mm@kvack.org To: Hidehiro Kawai Cc: Andi Kleen , "tytso@mit.edu" , "hch@infradead.org" , "mfasheh@suse.com" , "aia21@cantab.net" , "hugh.dickins@tiscali.co.uk" , "swhiteho@redhat.com" , "akpm@linux-foundation.org" , "npiggin@suse.de" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" List-ID: Hi Hidehiro, On Mon, Aug 10, 2009 at 02:36:01PM +0800, Hidehiro Kawai wrote: > Hi, > > Andi Kleen wrote: > > > Index: linux/fs/ext3/inode.c > > =================================================================== > > --- linux.orig/fs/ext3/inode.c > > +++ linux/fs/ext3/inode.c > > @@ -1819,6 +1819,7 @@ static const struct address_space_operat > > .direct_IO = ext3_direct_IO, > > .migratepage = buffer_migrate_page, > > .is_partially_uptodate = block_is_partially_uptodate, > > + .error_remove_page = generic_error_remove_page, > > }; > > (I'm sorry if I'm missing the point.) > > If my understanding is correct, the following scenario can happen: > > 1. An uncorrected error on a dirty page cache page is detected by > memory scrubbing > 2. Kernel unmaps and truncates the page to recover from the error > 3. An application reads data from the file location corresponding > to the truncated page > ==> Old or garbage data will be read into a new page cache page > 4. The application modifies the data and write back it to the disk > 5. The file will corrurpt! > > (Yes, the application is wrong to not do the right thing, i.e. fsync, > but it's not user's fault!) Right. Note that the data has already been corrupted and the above scenario can be called as re-corruption. We set AS_EIO to trigger some IO reporting mechanism so that it won't corrupt *silently*. > A similar data corruption can be caused by a write I/O error, > because dirty flag is cleared even if the page couldn't be written > to the disk. Yes. > However, we have a way to avoid this kind of data corruption at > least for ext3. If we mount an ext3 filesystem with data=ordered > and data_err=abort, all I/O errors on file data block belonging to > the committing transaction are checked. When I/O error is found, > abort journaling and remount the filesystem with read-only to > prevent further updates. This kind of feature is very important > for mission critical systems. Agreed. We also set PG_error, which should be enough to trigger such remount? > If we merge this patch, we would face the data corruption problem > again. > > I think there are three options, > > (1) drop this patch > (2) merge this patch with new panic_on_dirty_page_cache_corruption > sysctl > (3) implement a more sophisticated error_remove_page function In fact we proposed a patch for preventing the re-corruption case, see http://lkml.org/lkml/2009/6/11/294 However it is hard to answer the (policy) question "How sticky should the EIO bit remain?". > > static const struct address_space_operations ext3_writeback_aops = { > > @@ -1834,6 +1835,7 @@ static const struct address_space_operat > > .direct_IO = ext3_direct_IO, > > .migratepage = buffer_migrate_page, > > .is_partially_uptodate = block_is_partially_uptodate, > > + .error_remove_page = generic_error_remove_page, > > }; > > The writeback case would be OK. It's not much different from the I/O > error case. > > > static const struct address_space_operations ext3_journalled_aops = { > > @@ -1848,6 +1850,7 @@ static const struct address_space_operat > > .invalidatepage = ext3_invalidatepage, > > .releasepage = ext3_releasepage, > > .is_partially_uptodate = block_is_partially_uptodate, > > + .error_remove_page = generic_error_remove_page, > > }; > > > > void ext3_set_aops(struct inode *inode) > > I'm not sure about the journalled case. I'm going to take a look at > it later. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id C8F706B005A for ; Mon, 10 Aug 2009 03:44:23 -0400 (EDT) Date: Mon, 10 Aug 2009 09:44:21 +0200 From: Andi Kleen Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090810074421.GA6838@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4A7FBFD1.2010208@hitachi.com> Sender: owner-linux-mm@kvack.org To: Hidehiro Kawai Cc: Andi Kleen , tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com List-ID: Hi, > If my understanding is correct, the following scenario can happen: Yes it can happen. > > 1. An uncorrected error on a dirty page cache page is detected by > memory scrubbing > 2. Kernel unmaps and truncates the page to recover from the error > 3. An application reads data from the file location corresponding > to the truncated page > ==> Old or garbage data will be read into a new page cache page The problem currently is that the error is not sticky enough and doesn't stay around long enough. It gets reported once, but not in later IO operations. However it's a generic problem not unique to hwpoison. Me and Fengguang went through the error propagation as our test program triggered the problem and we looked like it was really a generic problem, not unique to hardware poison (e.g. the IO error handling on metadata has exactly the same problem) And redesigning VFS IO error reporting was a bit of of scope for hwpoison. So we decided to not be better than a normal IO error here for now. An application that handles current IO errors correctly will also also handle hwpoison IO errors correctly. And application that doesn't handle current IO errors correctly will also not necessarily handle hwpoison correctly (it's not better and not worse) So the hwpoison errors are pretty much the same as the normal IO errors. The normal error path probably needs some improvements, in particular the address space EIO error error likely needs to be more sticky than it is today. An application has to handle the error on the first strike. That is something that could be improved in the VFS -- although I fear any improvements here could also break compatibility. I don't think it's a blocker on hwpoison for now. It needs more design effort and thinking (e.g. likely the address space IO error bit should be separated into multiple bits) Perhaps you're interested in working on this? > 4. The application modifies the data and write back it to the disk > 5. The file will corrurpt! > > (Yes, the application is wrong to not do the right thing, i.e. fsync, > but it's not user's fault!) > > A similar data corruption can be caused by a write I/O error, > because dirty flag is cleared even if the page couldn't be written > to the disk. > > However, we have a way to avoid this kind of data corruption at > least for ext3. If we mount an ext3 filesystem with data=ordered > and data_err=abort, all I/O errors on file data block belonging to > the committing transaction are checked. When I/O error is found, > abort journaling and remount the filesystem with read-only to > prevent further updates. This kind of feature is very important > for mission critical systems. Well it sounds like a potentially useful enhancement to ext3 (or ext4). One issue is that the default is not ordered anymore since Linus changed the default. I'm sure other enhancements for IO errors could be done too. Some of the file systems also handle them still quite poorly (e.g. btrfs) But again I don't think it's a blocker for hwpoison. > I think there are three options, > > (1) drop this patch > (2) merge this patch with new panic_on_dirty_page_cache_corruption > sysctl > (3) implement a more sophisticated error_remove_page function (4) accept that hwpoison error handling is not better and not worse than normal IO error handling. We opted for (4). -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id BC93E6B004F for ; Mon, 10 Aug 2009 23:48:50 -0400 (EDT) Received: from mlsv7.hitachi.co.jp (unknown [133.144.234.166]) by mail9.hitachi.co.jp (Postfix) with ESMTP id E379037C88 for ; Tue, 11 Aug 2009 12:48:53 +0900 (JST) Message-ID: <4A80EA14.4030300@hitachi.com> Date: Tue, 11 Aug 2009 12:48:36 +0900 From: Hidehiro Kawai MIME-Version: 1.0 Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810070745.GA26533@localhost> In-Reply-To: <20090810070745.GA26533@localhost> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Wu Fengguang Cc: Andi Kleen , "tytso@mit.edu" , "hch@infradead.org" , "mfasheh@suse.com" , "aia21@cantab.net" , "hugh.dickins@tiscali.co.uk" , "swhiteho@redhat.com" , "akpm@linux-foundation.org" , "npiggin@suse.de" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Satoshi OSHIMA , Taketoshi Sakuraba List-ID: Wu Fengguang wrote: >>However, we have a way to avoid this kind of data corruption at >>least for ext3. If we mount an ext3 filesystem with data=ordered >>and data_err=abort, all I/O errors on file data block belonging to >>the committing transaction are checked. When I/O error is found, >>abort journaling and remount the filesystem with read-only to >>prevent further updates. This kind of feature is very important >>for mission critical systems. > > Agreed. We also set PG_error, which should be enough to trigger such > remount? ext3 doesn't check PG_error. Maybe we need to do: 1. trylock_buffer() 2. if step 1. succeeds, then clear_buffer_dirty(), clear_buffer_uptodate(), and set_buffer_write_io_error() Note that we can't truncate the corrupted page until the error check is done by kjournald. >>If we merge this patch, we would face the data corruption problem >>again. >> >>I think there are three options, >> >>(1) drop this patch >>(2) merge this patch with new panic_on_dirty_page_cache_corruption >> sysctl >>(3) implement a more sophisticated error_remove_page function > > In fact we proposed a patch for preventing the re-corruption case, see > > http://lkml.org/lkml/2009/6/11/294 > > However it is hard to answer the (policy) question "How sticky should > the EIO bit remain?". It's a good approach! This approach may also solve my concern, the re-corruption issue caused by transient IO errors. But I also think it needs a bit more consideration. For example, if the application has the valid data in the user space buffer, it would try to re-write it after detecting an IO error from the previous write. In this case, we should clear the sticky error flag. Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 3D4C66B004F for ; Mon, 10 Aug 2009 23:51:03 -0400 (EDT) Received: from mlsv1.hitachi.co.jp (unknown [133.144.234.166]) by mail4.hitachi.co.jp (Postfix) with ESMTP id 2BCB733CC9 for ; Tue, 11 Aug 2009 12:51:05 +0900 (JST) Message-ID: <4A80EAA3.7040107@hitachi.com> Date: Tue, 11 Aug 2009 12:50:59 +0900 From: Hidehiro Kawai MIME-Version: 1.0 Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> In-Reply-To: <20090810074421.GA6838@basil.fritz.box> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andi Kleen Cc: tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba List-ID: Andi Kleen wrote: >>1. An uncorrected error on a dirty page cache page is detected by >> memory scrubbing >>2. Kernel unmaps and truncates the page to recover from the error >>3. An application reads data from the file location corresponding >> to the truncated page >> ==> Old or garbage data will be read into a new page cache page > > The problem currently is that the error is not sticky enough and > doesn't stay around long enough. It gets reported once, > but not in later IO operations. > > However it's a generic problem not unique to hwpoison. Me Yes, it's a generic problem, and introducing a sticky error flag is one of the approach to solve the problem. I think it is a good approach because it doesn't depend on individual filesystems. > And application > that doesn't handle current IO errors correctly will also > not necessarily handle hwpoison correctly (it's not better and not worse) This is my main concern. I'd like to prevent re-corruption even if applications don't have good manners. As for usual I/O error, ext3/4 can now do it by using data=ordered and data_err=abort mount options. Moreover, if you mount the ext3/4 filesystem with the additional errors=panic option, kernel gets panic on write error instead of read-only remount. Customers who regard data integrity is very important require these features. But this patch (PATCH 16/19) introduce this problem again, because it doesn't provide a way to shut out further writes to the fs. Of course, we can do it by setting tolerant level to 0 or memory_failure_recovery to 0. But it would be overkill. That is why I suggested this: >>(2) merge this patch with new panic_on_dirty_page_cache_corruption >> sysctl > That is something that could be improved in the VFS -- although I fear > any improvements here could also break compatibility. I don't think > it's a blocker on hwpoison for now. It needs more design > effort and thinking (e.g. likely the address space IO error > bit should be separated into multiple bits) > > Perhaps you're interested in working on this? Yes. Transient IO errors have a potential for causing re-corruption problem. Now ext3/4 provide ways to prevent it, but not the other filesystems. We would need a generic way. >>4. The application modifies the data and write back it to the disk >>5. The file will corrurpt! >> >>(Yes, the application is wrong to not do the right thing, i.e. fsync, >> but it's not user's fault!) >> >>A similar data corruption can be caused by a write I/O error, >>because dirty flag is cleared even if the page couldn't be written >>to the disk. >> >>However, we have a way to avoid this kind of data corruption at >>least for ext3. If we mount an ext3 filesystem with data=ordered >>and data_err=abort, all I/O errors on file data block belonging to >>the committing transaction are checked. When I/O error is found, >>abort journaling and remount the filesystem with read-only to >>prevent further updates. This kind of feature is very important >>for mission critical systems. > > Well it sounds like a potentially useful enhancement to ext3 (or ext4). > > One issue is that the default is not ordered anymore since > Linus changed the default. Yes, but what is important is whether the system provides such feature or not. > I'm sure other enhancements for IO errors could be done too. > Some of the file systems also handle them still quite poorly (e.g. btrfs) > > But again I don't think it's a blocker for hwpoison. Unfortunately, it can be a blocker. As I stated, we can block the possible re-corruption caused by transient IO errors on ext3/4 filesystems. But applying this patch (PATCH 16/19), re-corruption can happen even if we use data=ordered, data_err=abort and errors=panic mount options. So... >>I think there are three options, >> >>(1) drop this patch >>(2) merge this patch with new panic_on_dirty_page_cache_corruption >> sysctl >>(3) implement a more sophisticated error_remove_page function > > (4) accept that hwpoison error handling is not better and not worse than normal > IO error handling. > > We opted for (4). Could you consider adopting (2) or (3)? Fengguang's sticky EIO approach (http://lkml.org/lkml/2009/6/11/294) is also OK. I hope HWPOISON patches are merged into 2.6.32. So (2) is the best answer for me, because it's simple and less intrusive. Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 76CAB6B004F for ; Tue, 11 Aug 2009 02:59:35 -0400 (EDT) Date: Tue, 11 Aug 2009 08:59:29 +0200 From: Andi Kleen Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090811065929.GB14368@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810070745.GA26533@localhost> <4A80EA14.4030300@hitachi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4A80EA14.4030300@hitachi.com> Sender: owner-linux-mm@kvack.org To: Hidehiro Kawai Cc: Wu Fengguang , Andi Kleen , "tytso@mit.edu" , "hch@infradead.org" , "mfasheh@suse.com" , "aia21@cantab.net" , "hugh.dickins@tiscali.co.uk" , "swhiteho@redhat.com" , "akpm@linux-foundation.org" , "npiggin@suse.de" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Satoshi OSHIMA , Taketoshi Sakuraba List-ID: On Tue, Aug 11, 2009 at 12:48:36PM +0900, Hidehiro Kawai wrote: > Wu Fengguang wrote: > > >>However, we have a way to avoid this kind of data corruption at > >>least for ext3. If we mount an ext3 filesystem with data=ordered > >>and data_err=abort, all I/O errors on file data block belonging to > >>the committing transaction are checked. When I/O error is found, > >>abort journaling and remount the filesystem with read-only to > >>prevent further updates. This kind of feature is very important > >>for mission critical systems. > > > > Agreed. We also set PG_error, which should be enough to trigger such > > remount? > > ext3 doesn't check PG_error. Maybe we need to do: When we truncate the page it's gone so there's no page to set PG_error on. -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id E64C66B004F for ; Tue, 11 Aug 2009 03:17:53 -0400 (EDT) Date: Tue, 11 Aug 2009 09:17:56 +0200 From: Andi Kleen Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090811071756.GC14368@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4A80EAA3.7040107@hitachi.com> Sender: owner-linux-mm@kvack.org To: Hidehiro Kawai Cc: Andi Kleen , tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba List-ID: On Tue, Aug 11, 2009 at 12:50:59PM +0900, Hidehiro Kawai wrote: > > And application > > that doesn't handle current IO errors correctly will also > > not necessarily handle hwpoison correctly (it's not better and not worse) > > This is my main concern. I'd like to prevent re-corruption even if > applications don't have good manners. I don't think there's much we can do if the application doesn't check for IO errors properly. What would you do if it doesn't check for IO errors at all? If it checks for IO errors it simply has to check for them on all IO operations -- if they do they will detect hwpoison errors correctly too. > As for usual I/O error, ext3/4 can now do it by using data=ordered and > data_err=abort mount options. Moreover, if you mount the ext3/4 > filesystem with the additional errors=panic option, kernel gets > panic on write error instead of read-only remount. Customers > who regard data integrity is very important require these features. Well they can also set vm.memory_failure_recovery = 0 then if they don't care about their uptime. > That is why I suggested this: > >>(2) merge this patch with new panic_on_dirty_page_cache_corruption You probably mean panic_on_non_anonymous_dirty_page_cache Normally anonymous memory is dirty. > >> sysctl It's unclear to me this special mode is really desirable. Does it bring enough value to the user to justify the complexity of another exotic option? The case is relatively exotic, as in dirty write cache that is mapped to a file. Try to explain it in documentation and you see how ridiculous it sounds; u it simply doesn't have clean semantics ("In case you have applications with broken error IO handling on your mission critical system ...") > > I'm sure other enhancements for IO errors could be done too. > > Some of the file systems also handle them still quite poorly (e.g. btrfs) > > > > But again I don't think it's a blocker for hwpoison. > > Unfortunately, it can be a blocker. As I stated, we can block the > possible re-corruption caused by transient IO errors on ext3/4 > filesystems. But applying this patch (PATCH 16/19), re-corruption > can happen even if we use data=ordered, data_err=abort and > errors=panic mount options. We don't corrupt data on disk. Applications that don't check for IO errors correctly may see stale data from the same file on disk though. This can happen in all the cases you listed above except for panic-on-error. If you want panic-on-error behaviour simply set vm.memory_failure_recovery = 0 > > (4) accept that hwpoison error handling is not better and not worse than normal > > IO error handling. > > > > We opted for (4). > > Could you consider adopting (2) or (3)? Fengguang's sticky EIO > approach (http://lkml.org/lkml/2009/6/11/294) is also OK. I believe redesigned IO error handling does not belong in the core hwpoison patchkit. It's big enough as it is and I consider it frozen unless fatal bugs are found -- and frankly this is not a fatal error in my estimation. If you want to have improved IO error handling feel free to submit it separately. I agree this area could use some work. But it probably needs more design work first. -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 161046B004F for ; Tue, 11 Aug 2009 09:00:20 -0400 (EDT) Date: Tue, 11 Aug 2009 20:38:19 +0800 From: Wu Fengguang Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090811123819.GB18881@localhost> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810070745.GA26533@localhost> <4A80EA14.4030300@hitachi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4A80EA14.4030300@hitachi.com> Sender: owner-linux-mm@kvack.org To: Hidehiro Kawai Cc: Andi Kleen , "tytso@mit.edu" , "hch@infradead.org" , "mfasheh@suse.com" , "aia21@cantab.net" , "hugh.dickins@tiscali.co.uk" , "swhiteho@redhat.com" , "akpm@linux-foundation.org" , "npiggin@suse.de" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Satoshi OSHIMA , Taketoshi Sakuraba List-ID: On Tue, Aug 11, 2009 at 11:48:36AM +0800, Hidehiro Kawai wrote: > Wu Fengguang wrote: > > In fact we proposed a patch for preventing the re-corruption case, see > > > > http://lkml.org/lkml/2009/6/11/294 > > > > However it is hard to answer the (policy) question "How sticky should > > the EIO bit remain?". > > It's a good approach! This approach may also solve my concern, > the re-corruption issue caused by transient IO errors. > > But I also think it needs a bit more consideration. For example, > if the application has the valid data in the user space buffer, > it would try to re-write it after detecting an IO error from the > previous write. In this case, we should clear the sticky error flag. Yes, and maybe more than that. The IO error issue really deserves an independent work, which will inevitably involve lots of discussions with lots of people. For the data re-corruption problem, "vm.memory_failure_recovery = 0" should be the most clean workaround for now. Can we settle with that? Our goal for this initial hwpoison implementation is to achieve good coverage (not necessarily every possible case :). Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 818D76B004F for ; Tue, 11 Aug 2009 22:49:59 -0400 (EDT) Received: from mlsv1.hitachi.co.jp (unknown [133.144.234.166]) by mail4.hitachi.co.jp (Postfix) with ESMTP id 6D40933CC8 for ; Wed, 12 Aug 2009 11:50:04 +0900 (JST) Message-ID: <4A822DD4.1050202@hitachi.com> Date: Wed, 12 Aug 2009 11:49:56 +0900 From: Hidehiro Kawai MIME-Version: 1.0 Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> In-Reply-To: <20090811071756.GC14368@basil.fritz.box> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andi Kleen Cc: tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba List-ID: Andi Kleen wrote: > On Tue, Aug 11, 2009 at 12:50:59PM +0900, Hidehiro Kawai wrote: > >>>And application >>>that doesn't handle current IO errors correctly will also >>>not necessarily handle hwpoison correctly (it's not better and not worse) >> >>This is my main concern. I'd like to prevent re-corruption even if >>applications don't have good manners. > > I don't think there's much we can do if the application doesn't > check for IO errors properly. What would you do if it doesn't > check for IO errors at all? If it checks for IO errors it simply > has to check for them on all IO operations -- if they do > they will detect hwpoison errors correctly too. I believe it's not uncommon for applications to do buffered write and then exit without fsync(). And I think it's difficult to preclude such applications and commands from the system perfectly. >>That is why I suggested this: >> >>>>(2) merge this patch with new panic_on_dirty_page_cache_corruption >>>> sysctl > > You probably mean panic_on_non_anonymous_dirty_page_cache > Normally anonymous memory is dirty. Yes, and sorry for my ambiguous description. I used the word "cache" to intend to not include anonymous pages. > It's unclear to me this special mode is really desirable. > Does it bring enough value to the user to justify the complexity > of another exotic option? The case is relatively exotic, > as in dirty write cache that is mapped to a file. > > Try to explain it in documentation and you see how ridiculous it sounds; u > it simply doesn't have clean semantics > > ("In case you have applications with broken error IO handling on > your mission critical system ...") Generally, dropping unwritten dirty page caches is considered to be risky. So the "panic on IO error" policy has been used as usual practice for some systems. I just suggested that we adopted this policy into machine check errors. vm.memory_failure_recovery satisfies my minimal requirement. It's OK at this stage, but I'd like to improve this in the future. Another option is to introduce "ignore all" policy instead of panicking at the beginig of memory_failure(). Perhaps it finally causes SRAR machine check, and then kernel will panic or a process will be killed. Anyway, this is a topic for the next stage. >>>(4) accept that hwpoison error handling is not better and not worse than normal >>>IO error handling. >>> >>>We opted for (4). >> >>Could you consider adopting (2) or (3)? Fengguang's sticky EIO >>approach (http://lkml.org/lkml/2009/6/11/294) is also OK. > > I believe redesigned IO error handling does not belong in the > core hwpoison patchkit. I agree. > If you want to have improved IO error handling feel free to > submit it separately. I agree this area could use some work. > But it probably needs more design work first. Well, this patch set itself looks good to me. I also looked into the other patches, I couldn't find any problems (although I'm not good judge of reviewing). Reviewed-by: Hidehiro Kawai Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 174AC6B004F for ; Wed, 12 Aug 2009 03:46:12 -0400 (EDT) Date: Wed, 12 Aug 2009 09:46:11 +0200 From: Andi Kleen Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090812074611.GC28848@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> <4A822DD4.1050202@hitachi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4A822DD4.1050202@hitachi.com> Sender: owner-linux-mm@kvack.org To: Hidehiro Kawai Cc: Andi Kleen , tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba List-ID: On Wed, Aug 12, 2009 at 11:49:56AM +0900, Hidehiro Kawai wrote: > > I don't think there's much we can do if the application doesn't > > check for IO errors properly. What would you do if it doesn't > > check for IO errors at all? If it checks for IO errors it simply > > has to check for them on all IO operations -- if they do > > they will detect hwpoison errors correctly too. > > I believe it's not uncommon for applications to do buffered write > and then exit without fsync(). And I think it's difficult to > preclude such applications and commands from the system perfectly. That's true, but for anything mission critical you would expect them to use some transactional mechanism, either with O_SYNC or fsync(). Otherwise they always risk data loss anyways. > > It's unclear to me this special mode is really desirable. > > Does it bring enough value to the user to justify the complexity > > of another exotic option? The case is relatively exotic, > > as in dirty write cache that is mapped to a file. > > > > Try to explain it in documentation and you see how ridiculous it sounds; u > > it simply doesn't have clean semantics > > > > ("In case you have applications with broken error IO handling on > > your mission critical system ...") > > Generally, dropping unwritten dirty page caches is considered to be > risky. So the "panic on IO error" policy has been used as usual > practice for some systems. I just suggested that we adopted > this policy into machine check errors. Hmm, what we could possibly do -- as followon patches -- would be to let error_remove_page check the per file system panic-on-io-error super block setting for dirty pages and panic in this case too. Unfortunately this setting is currently per file system, not generic, so it would need to be a fs specific check (or the flag would need to be moved into a generic fs superblock field first) I think that would be relatively clean semantics wise. Would you be interested in working on patches for that? > Another option is to introduce "ignore all" policy instead of > panicking at the beginig of memory_failure(). Perhaps it finally > causes SRAR machine check, and then kernel will panic or a process > will be killed. Anyway, this is a topic for the next stage. The problem is memory_failure() would then need to start distingushing between AR=1 and AR=0 which it doesn't today. It could be done, but would need some more work. > > If you want to have improved IO error handling feel free to > > submit it separately. I agree this area could use some work. > > But it probably needs more design work first. > > Well, this patch set itself looks good to me. > I also looked into the other patches, I couldn't find any > problems (although I'm not good judge of reviewing). > > Reviewed-by: Hidehiro Kawai Thanks for your review and your comments. -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id C2E256B005A for ; Wed, 12 Aug 2009 04:05:41 -0400 (EDT) Date: Wed, 12 Aug 2009 10:05:40 +0200 From: Nick Piggin Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090812080540.GA32342@wotan.suse.de> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090811071756.GC14368@basil.fritz.box> Sender: owner-linux-mm@kvack.org To: Andi Kleen Cc: Hidehiro Kawai , tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba List-ID: On Tue, Aug 11, 2009 at 09:17:56AM +0200, Andi Kleen wrote: > On Tue, Aug 11, 2009 at 12:50:59PM +0900, Hidehiro Kawai wrote: > > > And application > > > that doesn't handle current IO errors correctly will also > > > not necessarily handle hwpoison correctly (it's not better and not worse) > > > > This is my main concern. I'd like to prevent re-corruption even if > > applications don't have good manners. > > I don't think there's much we can do if the application doesn't > check for IO errors properly. What would you do if it doesn't > check for IO errors at all? If it checks for IO errors it simply > has to check for them on all IO operations -- if they do > they will detect hwpoison errors correctly too. But will quite possibly do the wrong thing: ie. try to re-sync the same page again, or try to write the page to a new location, etc. This is the whole problem with -EIO semantics I brought up. > > That is why I suggested this: > > >>(2) merge this patch with new panic_on_dirty_page_cache_corruption > > You probably mean panic_on_non_anonymous_dirty_page_cache > Normally anonymous memory is dirty. > > > >> sysctl > > It's unclear to me this special mode is really desirable. > Does it bring enough value to the user to justify the complexity > of another exotic option? The case is relatively exotic, > as in dirty write cache that is mapped to a file. > > Try to explain it in documentation and you see how ridiculous it sounds; u > it simply doesn't have clean semantics > > ("In case you have applications with broken error IO handling on > your mission critical system ...") Not broken error handling. It is very simple: if the application is assuming EIO is an error with dirty data being sent to disk, rather than an error with the data itself (which I think may be a common assumption). Then it could have a problem. If a database for example tries to write the data to another location in response to EIO and then record it in a list of failed IOs before halting the database. Then if it restarts it might try to again try writing out these failed IOs (eg. give the administrator a chance to fix IO devices). Completely made up scenario but it is not outlandish and it would cause bad data corruption. A mission critical server will *definitely* want to panic on dirty page corruption, IMO, because by definition they should be able to tolerate panic. But if they do not know about this change to -EIO semantics, then it is quite possible to cause problems. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 9B6EE6B0055 for ; Wed, 12 Aug 2009 04:23:27 -0400 (EDT) Date: Wed, 12 Aug 2009 10:23:31 +0200 From: Andi Kleen Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090812082331.GD28848@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> <20090812080540.GA32342@wotan.suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090812080540.GA32342@wotan.suse.de> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Andi Kleen , Hidehiro Kawai , tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba List-ID: > page corruption, IMO, because by definition they should be able to > tolerate panic. But if they do not know about this change to -EIO > semantics, then it is quite possible to cause problems. There's no change really. You already have this problem with any metadata error, which can cause similar trouble. If the application handles those correctly it will also handle hwpoison correctly. -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id DAE036B004F for ; Wed, 12 Aug 2009 04:46:13 -0400 (EDT) Date: Wed, 12 Aug 2009 10:46:13 +0200 From: Nick Piggin Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090812084613.GB32342@wotan.suse.de> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> <20090812080540.GA32342@wotan.suse.de> <20090812082331.GD28848@basil.fritz.box> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090812082331.GD28848@basil.fritz.box> Sender: owner-linux-mm@kvack.org To: Andi Kleen Cc: Hidehiro Kawai , tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba List-ID: On Wed, Aug 12, 2009 at 10:23:31AM +0200, Andi Kleen wrote: > > page corruption, IMO, because by definition they should be able to > > tolerate panic. But if they do not know about this change to -EIO > > semantics, then it is quite possible to cause problems. > > There's no change really. You already have this problem with > any metadata error, which can cause similar trouble. > If the application handles those correctly it will also > handle hwpoison correctly. What do you mean metadata error? To be clear, we're talking about safety given by a panic on any memory error, versus safety of this hwpoison code (specifically which removes dirty pagecache pages as -EIO). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 1B1656B004F for ; Wed, 12 Aug 2009 04:57:22 -0400 (EDT) Date: Wed, 12 Aug 2009 10:57:27 +0200 From: Andi Kleen Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090812085727.GE28848@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> <20090812080540.GA32342@wotan.suse.de> <20090812082331.GD28848@basil.fritz.box> <20090812084613.GB32342@wotan.suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090812084613.GB32342@wotan.suse.de> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Andi Kleen , Hidehiro Kawai , tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba List-ID: On Wed, Aug 12, 2009 at 10:46:13AM +0200, Nick Piggin wrote: > On Wed, Aug 12, 2009 at 10:23:31AM +0200, Andi Kleen wrote: > > > page corruption, IMO, because by definition they should be able to > > > tolerate panic. But if they do not know about this change to -EIO > > > semantics, then it is quite possible to cause problems. > > > > There's no change really. You already have this problem with > > any metadata error, which can cause similar trouble. > > If the application handles those correctly it will also > > handle hwpoison correctly. > > What do you mean metadata error? e.g. when there's an write error on the indirect block or any other fs metadata. This can also cause you to lose data. The error reporting also works through the address space like with hwpoison, so it only gets reported once. I'm not really against fixing that (make the error more sticky as Fengguang puts it), but I don't think it needs to be mixed with hwpoison. -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id D640C6B0055 for ; Wed, 12 Aug 2009 05:05:18 -0400 (EDT) Date: Wed, 12 Aug 2009 11:05:18 +0200 From: Nick Piggin Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090812090518.GC32342@wotan.suse.de> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> <20090812080540.GA32342@wotan.suse.de> <20090812082331.GD28848@basil.fritz.box> <20090812084613.GB32342@wotan.suse.de> <20090812085727.GE28848@basil.fritz.box> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090812085727.GE28848@basil.fritz.box> Sender: owner-linux-mm@kvack.org To: Andi Kleen Cc: Hidehiro Kawai , tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba List-ID: On Wed, Aug 12, 2009 at 10:57:27AM +0200, Andi Kleen wrote: > On Wed, Aug 12, 2009 at 10:46:13AM +0200, Nick Piggin wrote: > > On Wed, Aug 12, 2009 at 10:23:31AM +0200, Andi Kleen wrote: > > > > page corruption, IMO, because by definition they should be able to > > > > tolerate panic. But if they do not know about this change to -EIO > > > > semantics, then it is quite possible to cause problems. > > > > > > There's no change really. You already have this problem with > > > any metadata error, which can cause similar trouble. > > > If the application handles those correctly it will also > > > handle hwpoison correctly. > > > > What do you mean metadata error? > > e.g. when there's an write error on the indirect block or any > other fs metadata. This can also cause you to lose data. The error > reporting also works through the address space like with hwpoison, > so it only gets reported once. Well, this is also a filesystem issue, but anyway the data typically does not get thrown out. So a subsequent fsync should be able to retry. But if the filesystem can't handle such errors and loses the original data when there is an IO error in newly dirty metadata, then it's a problem in the filesystem really isn't it? > I'm not really against fixing that (make the error more sticky > as Fengguang puts it), but I don't think it needs to be mixed > with hwpoison. I don't know if making it sticky realy "fixes" it. The problem is different semantics of what EIO means. My example illustrates this. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id F11786B0055 for ; Wed, 12 Aug 2009 05:39:50 -0400 (EDT) Date: Wed, 12 Aug 2009 17:39:35 +0800 From: Wu Fengguang Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090812093935.GA2724@localhost> References: <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> <20090812080540.GA32342@wotan.suse.de> <20090812082331.GD28848@basil.fritz.box> <20090812084613.GB32342@wotan.suse.de> <20090812085727.GE28848@basil.fritz.box> <20090812090518.GC32342@wotan.suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090812090518.GC32342@wotan.suse.de> Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Andi Kleen , Hidehiro Kawai , "tytso@mit.edu" , "hch@infradead.org" , "mfasheh@suse.com" , "aia21@cantab.net" , "hugh.dickins@tiscali.co.uk" , "swhiteho@redhat.com" , "akpm@linux-foundation.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Satoshi OSHIMA , Taketoshi Sakuraba List-ID: On Wed, Aug 12, 2009 at 05:05:18PM +0800, Nick Piggin wrote: > On Wed, Aug 12, 2009 at 10:57:27AM +0200, Andi Kleen wrote: > > On Wed, Aug 12, 2009 at 10:46:13AM +0200, Nick Piggin wrote: > > > On Wed, Aug 12, 2009 at 10:23:31AM +0200, Andi Kleen wrote: > > > > > page corruption, IMO, because by definition they should be able to > > > > > tolerate panic. But if they do not know about this change to -EIO > > > > > semantics, then it is quite possible to cause problems. > > > > > > > > There's no change really. You already have this problem with > > > > any metadata error, which can cause similar trouble. > > > > If the application handles those correctly it will also > > > > handle hwpoison correctly. > > > > > > What do you mean metadata error? > > > > e.g. when there's an write error on the indirect block or any > > other fs metadata. This can also cause you to lose data. The error > > reporting also works through the address space like with hwpoison, > > so it only gets reported once. > > Well, this is also a filesystem issue, but anyway the data typically > does not get thrown out. So a subsequent fsync should be able to > retry. Right. In normal EIO, the data in page cache is still good and accessible. > But if the filesystem can't handle such errors and loses the original > data when there is an IO error in newly dirty metadata, then it's > a problem in the filesystem really isn't it? Right, and the fs should report EIO on future sync attempts as long as the problem sticks. > > I'm not really against fixing that (make the error more sticky > > as Fengguang puts it), but I don't think it needs to be mixed > > with hwpoison. > > I don't know if making it sticky realy "fixes" it. The problem is > different semantics of what EIO means. My example illustrates this. Case 1: (re)sync on EIO: sticky EIO will help. Case 2: read out the data from page cache and rewrite it somewhere. Sticky EIO is not enough, because here the application assumes the dirty page is still accessible. In this case, patch http://lkml.org/lkml/2009/6/11/294 will help. It effectively freezes the radix tree, so that no new pages will be loaded to replace the corrupted data and fake a 'good' one. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id DE6686B004F for ; Wed, 12 Aug 2009 05:52:14 -0400 (EDT) Received: from mlsv1.hitachi.co.jp (unknown [133.144.234.166]) by mail4.hitachi.co.jp (Postfix) with ESMTP id A301833CC6 for ; Wed, 12 Aug 2009 18:52:19 +0900 (JST) Message-ID: <4A8290CE.7000904@hitachi.com> Date: Wed, 12 Aug 2009 18:52:14 +0900 From: Hidehiro Kawai MIME-Version: 1.0 Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> <4A822DD4.1050202@hitachi.com> <20090812074611.GC28848@basil.fritz.box> In-Reply-To: <20090812074611.GC28848@basil.fritz.box> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Andi Kleen Cc: tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba List-ID: Andi Kleen wrote: >>Generally, dropping unwritten dirty page caches is considered to be >>risky. So the "panic on IO error" policy has been used as usual >>practice for some systems. I just suggested that we adopted >>this policy into machine check errors. > > Hmm, what we could possibly do -- as followon patches -- would be to > let error_remove_page check the per file system panic-on-io-error > super block setting for dirty pages and panic in this case too. > Unfortunately this setting is currently per file system, not generic, > so it would need to be a fs specific check (or the flag would need > to be moved into a generic fs superblock field first) A generic setting would be better, so I suggested panic_on_dirty_page_cache_corruption flag which would be checked before invoking error_remove_page(). If we check per-filesystem settings, we might want to notify EIO to the filesystem. > I think that would be relatively clean semantics wise. Would you be > interested in working on patches for that? Yes. :-) I will work on this as soon as I come back from summer vacation. >>Another option is to introduce "ignore all" policy instead of >>panicking at the beginig of memory_failure(). Perhaps it finally >>causes SRAR machine check, and then kernel will panic or a process >>will be killed. Anyway, this is a topic for the next stage. > > The problem is memory_failure() would then need to start distingushing > between AR=1 and AR=0 which it doesn't today. > > It could be done, but would need some more work. It's my understanding that memory_failure() are never called in AR=1 case. Is it wrong? Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id E62A76B004F for ; Wed, 12 Aug 2009 06:16:55 -0400 (EDT) Date: Wed, 12 Aug 2009 12:16:58 +0200 From: Andi Kleen Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems Message-ID: <20090812101658.GF28848@basil.fritz.box> References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> <4A80EAA3.7040107@hitachi.com> <20090811071756.GC14368@basil.fritz.box> <4A822DD4.1050202@hitachi.com> <20090812074611.GC28848@basil.fritz.box> <4A8290CE.7000904@hitachi.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4A8290CE.7000904@hitachi.com> Sender: owner-linux-mm@kvack.org To: Hidehiro Kawai Cc: Andi Kleen , tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba List-ID: On Wed, Aug 12, 2009 at 06:52:14PM +0900, Hidehiro Kawai wrote: > Andi Kleen wrote: > > >>Generally, dropping unwritten dirty page caches is considered to be > >>risky. So the "panic on IO error" policy has been used as usual > >>practice for some systems. I just suggested that we adopted > >>this policy into machine check errors. > > > > Hmm, what we could possibly do -- as followon patches -- would be to > > let error_remove_page check the per file system panic-on-io-error > > super block setting for dirty pages and panic in this case too. > > Unfortunately this setting is currently per file system, not generic, > > so it would need to be a fs specific check (or the flag would need > > to be moved into a generic fs superblock field first) > > A generic setting would be better, so I suggested > panic_on_dirty_page_cache_corruption flag which would be checked > before invoking error_remove_page(). If we check per-filesystem > settings, we might want to notify EIO to the filesystem. You mean remounting ro if that is set? That makes sense, but I'm not sure how complicated it would be. I still would prefer to unify it with the file system settings. > > The problem is memory_failure() would then need to start distingushing > > between AR=1 and AR=0 which it doesn't today. > > > > It could be done, but would need some more work. > > It's my understanding that memory_failure() are never called in > AR=1 case. Is it wrong? Today yes, but we don't want to hardcode that assumption. e.g. for IA64 they will definitely need the equivalent of AR=1 handling. -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org