From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752793AbZHKMAh (ORCPT ); Tue, 11 Aug 2009 08:00:37 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752610AbZHKMAe (ORCPT ); Tue, 11 Aug 2009 08:00:34 -0400 Received: from mailx.hitachi.co.jp ([133.145.228.49]:33209 "EHLO mailx.hitachi.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752734AbZHKMAZ (ORCPT ); Tue, 11 Aug 2009 08:00:25 -0400 X-AuditID: b753bd60-aae45ba000006009-b1-4a80eaa734c3 Message-ID: <4A80EAA3.7040107@hitachi.com> Date: Tue, 11 Aug 2009 12:50:59 +0900 From: Hidehiro Kawai User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; ja-JP; rv:1.4) Gecko/20030624 Netscape/7.1 (ax) X-Accept-Language: ja MIME-Version: 1.0 To: Andi Kleen Cc: tytso@mit.edu, hch@infradead.org, mfasheh@suse.com, aia21@cantab.net, hugh.dickins@tiscali.co.uk, swhiteho@redhat.com, akpm@linux-foundation.org, npiggin@suse.de, linux-kernel@vger.kernel.org, linux-mm@kvack.org, fengguang.wu@intel.com, Satoshi OSHIMA , Taketoshi Sakuraba Subject: Re: [PATCH] [16/19] HWPOISON: Enable .remove_error_page for migration aware file systems References: <200908051136.682859934@firstfloor.org> <20090805093643.E0C00B15D8@basil.firstfloor.org> <4A7FBFD1.2010208@hitachi.com> <20090810074421.GA6838@basil.fritz.box> In-Reply-To: <20090810074421.GA6838@basil.fritz.box> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Brightmail-Tracker: AAAAAA== X-FMFTCR: RANGEA Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Andi Kleen wrote: >>1. An uncorrected error on a dirty page cache page is detected by >> memory scrubbing >>2. Kernel unmaps and truncates the page to recover from the error >>3. An application reads data from the file location corresponding >> to the truncated page >> ==> Old or garbage data will be read into a new page cache page > > The problem currently is that the error is not sticky enough and > doesn't stay around long enough. It gets reported once, > but not in later IO operations. > > However it's a generic problem not unique to hwpoison. Me Yes, it's a generic problem, and introducing a sticky error flag is one of the approach to solve the problem. I think it is a good approach because it doesn't depend on individual filesystems. > And application > that doesn't handle current IO errors correctly will also > not necessarily handle hwpoison correctly (it's not better and not worse) This is my main concern. I'd like to prevent re-corruption even if applications don't have good manners. As for usual I/O error, ext3/4 can now do it by using data=ordered and data_err=abort mount options. Moreover, if you mount the ext3/4 filesystem with the additional errors=panic option, kernel gets panic on write error instead of read-only remount. Customers who regard data integrity is very important require these features. But this patch (PATCH 16/19) introduce this problem again, because it doesn't provide a way to shut out further writes to the fs. Of course, we can do it by setting tolerant level to 0 or memory_failure_recovery to 0. But it would be overkill. That is why I suggested this: >>(2) merge this patch with new panic_on_dirty_page_cache_corruption >> sysctl > That is something that could be improved in the VFS -- although I fear > any improvements here could also break compatibility. I don't think > it's a blocker on hwpoison for now. It needs more design > effort and thinking (e.g. likely the address space IO error > bit should be separated into multiple bits) > > Perhaps you're interested in working on this? Yes. Transient IO errors have a potential for causing re-corruption problem. Now ext3/4 provide ways to prevent it, but not the other filesystems. We would need a generic way. >>4. The application modifies the data and write back it to the disk >>5. The file will corrurpt! >> >>(Yes, the application is wrong to not do the right thing, i.e. fsync, >> but it's not user's fault!) >> >>A similar data corruption can be caused by a write I/O error, >>because dirty flag is cleared even if the page couldn't be written >>to the disk. >> >>However, we have a way to avoid this kind of data corruption at >>least for ext3. If we mount an ext3 filesystem with data=ordered >>and data_err=abort, all I/O errors on file data block belonging to >>the committing transaction are checked. When I/O error is found, >>abort journaling and remount the filesystem with read-only to >>prevent further updates. This kind of feature is very important >>for mission critical systems. > > Well it sounds like a potentially useful enhancement to ext3 (or ext4). > > One issue is that the default is not ordered anymore since > Linus changed the default. Yes, but what is important is whether the system provides such feature or not. > I'm sure other enhancements for IO errors could be done too. > Some of the file systems also handle them still quite poorly (e.g. btrfs) > > But again I don't think it's a blocker for hwpoison. Unfortunately, it can be a blocker. As I stated, we can block the possible re-corruption caused by transient IO errors on ext3/4 filesystems. But applying this patch (PATCH 16/19), re-corruption can happen even if we use data=ordered, data_err=abort and errors=panic mount options. So... >>I think there are three options, >> >>(1) drop this patch >>(2) merge this patch with new panic_on_dirty_page_cache_corruption >> sysctl >>(3) implement a more sophisticated error_remove_page function > > (4) accept that hwpoison error handling is not better and not worse than normal > IO error handling. > > We opted for (4). Could you consider adopting (2) or (3)? Fengguang's sticky EIO approach (http://lkml.org/lkml/2009/6/11/294) is also OK. I hope HWPOISON patches are merged into 2.6.32. So (2) is the best answer for me, because it's simple and less intrusive. Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center