From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jamie Lokier Subject: Re: [rfc] fsync_range? Date: Wed, 21 Jan 2009 12:02:57 +0000 Message-ID: <20090121120257.GB8609@shareable.org> References: <20090120164726.GA24891@wotan.suse.de> <20090120183120.GD27464@shareable.org> <20090121012900.GD24891@wotan.suse.de> <20090121031500.GA2354@shareable.org> <20090121041604.GI24891@wotan.suse.de> <20090121045921.GA3944@shareable.org> <20090121062306.GK24891@wotan.suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: linux-fsdevel@vger.kernel.org To: Nick Piggin Return-path: Received: from mail2.shareable.org ([80.68.89.115]:51234 "EHLO mail2.shareable.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756180AbZAUMC7 (ORCPT ); Wed, 21 Jan 2009 07:02:59 -0500 Content-Disposition: inline In-Reply-To: <20090121062306.GK24891@wotan.suse.de> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Nick Piggin wrote: > Again it comes back to the whole writeout thing, which makes it more > constraining on the kernel to optimise. Cute :-) It was intended to make it easier to optimise, but maybe it failed. > For example, my fsync "livelock" avoidance patches did the following: > > 1. find all pages which are dirty or under writeout first. > 2. write out the dirty pages. > 3. wait for our set of pages. > > Simple, obvious, and the kernel can optimise this well because the > userspace has asked for a high level request "make this data safe" > rather than low level directives. We can't do this same nice simple > sequence with sync_file_range because SYNC_FILE_RANGE_WAIT_AFTER > means we have to wait for all writeout pages in the range, including > unrelated ones, after the dirty writeout. SYNC_FILE_RANGE_WAIT_BEFORE > means we have to wait for clean writeout pages before we even start > doing real work. As noted in my other mail just now, although sync_file_range() is described as though it does the three bulk operations consecutively, I think it wouldn't be too shocking to think the intended semantics _could_ be: "wait and initiate writeous _as if_ we did, for each page _in parallel_ { if (SYNC_FILE_RANGE_WAIT_BEFORE && page->writeout) wait(page) if (SYNC_FILE_RANGE_WRITE) start_writeout(page) if (SYNC_FILE_RANGE_WAIT_AFTER && writeout) wait(page) }" That permits many strategies, and I think one of them is the nice livelock-avoiding fsync you describe up above. You might be able to squeeze the sync_file_range() flags into that by chopping it up like this. Btw, you omitted step 1.5 "wait for dirty pages which are already under writeout", but it's made explicit here: 1. find all pages which are dirty or under writeout first, and remember which of them are dirty _and_ under writeout (DW). 2. if (SYNC_FILE_RANGE_WRITE) write out the dirty pages not in DW. 3. if (SYNC_FILE_RANGE_WAIT_BEFORE) { wait for the set of pages in DW. write out the pages in DW. } 4. if (SYNC_FILE_RANGE_WAIT_BEFORE || SYNC_FILE_RANGE_WAIT_AFTER) wait for our set of pages. However, maybe the flags aren't all that useful really, and maybe sync_file_range() could be replaced by a stub which ignores the flags and calls fsync_range(). -- Jamie