From mboxrd@z Thu Jan 1 00:00:00 1970 Subject: Re: [RFC][PATCH 1/3] tracking dirty pages in shared mappings -V4 From: Peter Zijlstra In-Reply-To: <20060511213045.32b41aa6.akpm@osdl.org> References: <1146861313.3561.13.camel@lappy> <445CA22B.8030807@cyberone.com.au> <1146922446.3561.20.camel@lappy> <445CA907.9060002@cyberone.com.au> <1146929357.3561.28.camel@lappy> <1147116034.16600.2.camel@lappy> <1147207458.27680.19.camel@lappy> <20060511080220.48688b40.akpm@osdl.org> <4463EA16.5090208@cyberone.com.au> <20060511213045.32b41aa6.akpm@osdl.org> Content-Type: text/plain Date: Fri, 12 May 2006 09:06:01 +0200 Message-Id: <1147417561.8951.17.camel@twins> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Andrew Morton Cc: Nick Piggin , clameter@sgi.com, torvalds@osdl.org, ak@suse.de, rohitseth@google.com, mbligh@google.com, hugh@veritas.com, riel@redhat.com, andrea@suse.de, arjan@infradead.org, apw@shadowen.org, mel@csn.ul.ie, marcelo@kvack.org, anton@samba.org, paulmck@us.ibm.com, linux-mm@kvack.org List-ID: On Thu, 2006-05-11 at 21:30 -0700, Andrew Morton wrote: > Nick Piggin wrote: > > > > >So let's see. We take a write fault, we mark the page dirty then we return > > >to userspace which will proceed with the write and will mark the pte dirty. > > > > > >Later, the VM will write the page out. > > > > > >Later still, the pte will get cleaned by reclaim or by munmap or whatever > > >and the page will be marked dirty and the page will again be written out. > > >Potentially needlessly. > > > > > > > page_wrprotect also marks the page clean, > > Oh. I missed that when reading the comment which describes > page_wrprotect() (I do go on). Yes, this name is not the best of names :-( I was aware of this, but since in my mind the counting through protection faults was the prime idea, I stuck to page_wrprotect(). But I'm hard pressed to come up with a better one. Nick proposes: page_mkclean() But that also doesn't cover the whole of it from my perspective. > > so this window is very small. > > The window is that the fault path might set_page_dirty, then throttle > > on writeout, and the page gets written out before it really gets dirtied > > by the application (which then has to fault again). > > : int test_clear_page_dirty(struct page *page) > : { > : struct address_space *mapping = page_mapping(page); > : unsigned long flags; > : > : if (mapping) { > : write_lock_irqsave(&mapping->tree_lock, flags); > : if (TestClearPageDirty(page)) { > : radix_tree_tag_clear(&mapping->page_tree, > : page_index(page), > : PAGECACHE_TAG_DIRTY); > : write_unlock_irqrestore(&mapping->tree_lock, flags); > : /* > : * We can continue to use `mapping' here because the > : * page is locked, which pins the address_space > : */ > > So if userspace modifies the page right here, and marks the pte dirty. > > : if (mapping_cap_account_dirty(mapping)) { > : page_wrprotect(page); > > We just lost that pte dirty bit, and hence the user's data. I thought that at the time we clean PAGECACHE_TAG_DIRTY the page is in flight to disk. Now that I look at it again, perhaps the page_wrprotect() call in clear_page_dirty_for_io() should be in test_set_page_writeback(). > : dec_page_state(nr_dirty); > : } > : return 1; > : } > : write_unlock_irqrestore(&mapping->tree_lock, flags); > : return 0; > : } > : return TestClearPageDirty(page); > : } > : > > Which is just the sort of subtle and nasty problem I was referring to... > > If that's correct then I guess we need the > > if (ptep_clear_flush_dirty(vma, addr, pte) || > page_test_and_clear_dirty(page)) > ret += set_page_dirty(page); > > treatment in page_wrprotect(). I would make that a BUG_ON(); the only way for a pte of a shared mapping to become dirty is through the fault handler, and that should already call set_page_dirty() on it. > Now I suppose it's not really a dataloss race, because in practice the > kernel is about to write this page to backing store anwyay. I guess. I > cannot immediately think of any clear_page_dirty() callers for whom that > won't be true. > > Someone please convince me that this has all been thought about and is solid > as a rock. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org