From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nick Piggin Subject: Re: [patch 9/9] mm: do_sync_mapping_range integrity fix Date: Fri, 31 Oct 2008 10:16:16 +0100 Message-ID: <20081031091616.GF19268@wotan.suse.de> References: <20081028144715.683011000@suse.de> <20081028145734.706927000@nick.local0.net> <20081030161344.0ed5ca52.akpm@linux-foundation.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org, david@fromorbit.com, chris.mason@oracle.com To: Andrew Morton Return-path: Received: from mx1.suse.de ([195.135.220.2]:41338 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753022AbYJaJQS (ORCPT ); Fri, 31 Oct 2008 05:16:18 -0400 Content-Disposition: inline In-Reply-To: <20081030161344.0ed5ca52.akpm@linux-foundation.org> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Thu, Oct 30, 2008 at 04:13:44PM -0700, Andrew Morton wrote: > On Wed, 29 Oct 2008 01:47:24 +1100 > npiggin@suse.de wrote: > > > Chris Mason notices do_sync_mapping_range didn't actually ask for data > > integrity writeout. Unfortunately, it is advertised as being usable for > > data integrity operations. > > > > This is a data interity bug. > > > > Signed-off-by: Nick Piggin > > --- > > Index: linux-2.6/fs/sync.c > > =================================================================== > > --- linux-2.6.orig/fs/sync.c > > +++ linux-2.6/fs/sync.c > > @@ -269,7 +269,7 @@ int do_sync_mapping_range(struct address > > > > if (flags & SYNC_FILE_RANGE_WRITE) { > > ret = __filemap_fdatawrite_range(mapping, offset, endbyte, > > - WB_SYNC_NONE); > > + WB_SYNC_ALL); > > if (ret < 0) > > goto out; > > } > > > > Really? > > Some thought did go into the code which you're "fixing". Yes, I even remember something of a flamewar involving me and you :) > If the caller > is using sync_file_range() for integrity then the caller has done a > SYNC_FILE_RANGE_WAIT_BEFORE. No disputes about whether the API works "by design". But I think the implementation has a bug. I'll explain: > So all we need to guarantee here is that > __filemap_fdatawrite_range(WB_SYNC_NONE) will start writeout on all > dirty pages in the range. Probably that gets broken lower down as part > of various hacks^woptimisations have gone in, but which ones, and > where? Perhaps _this_ (if it's there) is what should be fixed. WB_SYNC_NONE has never (until this was introduced) been used for data integrity AFAIKS. There is code littered throughout fs/ which assumes WB_SYNC_NONE ~= "efficiency / background writeback". At least definitely the buffer.c "trylock" will cause dirty pages to be skipped. There is also a fair amount of filesystem code I haven't looked at. The fs-writeback.c code might not affect this path, but it also definitely makes the same assumption about WB_SYNC_NONE, so it would be ugly to mandate WB_SYNC_NONE is for data integrity from mapping downard, but not from inode upward... I didn't check, but I suspect this has been broken since it got merged. > And if we _do_ make the above change, we don't need to run the > wait_on_page_writeback_range() if userspace asked for > SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE, yes? Now you're asking the hard questions... I think we still have to wait, because SYNC_FILE_RANGE_WRITE itself doesn't necessarily wait for writeback. After the optimisation to skip waiting for clean and writeback pages in write_cache_pages, actually I think this change (to use WB_SYNC_ALL) should not hurt very much... > > > > IOW, I don't think enough thought (or at least description of that > thought) has gone into this one.