From mboxrd@z Thu Jan  1 00:00:00 1970
From: Matthew Wilcox <matthew@wil.cx>
Subject: Re: hole-punch vs fault
Date: Wed, 27 Nov 2013 21:44:54 -0700
Message-ID: <20131128044454.GL24288@parisc-linux.org>
References: <20131127134831.GI24288@parisc-linux.org> <20131127221932.GA27330@quack.suse.cz> <20131128023343.GJ24288@parisc-linux.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: linux-fsdevel@vger.kernel.org
To: Jan Kara <jack@suse.cz>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from palinux.external.hp.com ([192.25.206.14]:39495 "EHLO
	mail.parisc-linux.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1754434Ab3K1Eo4 (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Wed, 27 Nov 2013 23:44:56 -0500
Content-Disposition: inline
In-Reply-To: <20131128023343.GJ24288@parisc-linux.org>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Wed, Nov 27, 2013 at 07:33:43PM -0700, Matthew Wilcox wrote:
> On Wed, Nov 27, 2013 at 11:19:32PM +0100, Jan Kara wrote:
> > > The second is to put some kind of generation counter in the inode or
> > > address_space that is incremented on every deallocation.  The fault path
> > > would read the generation counter before calling find_get_page() and then
> > > check it hasn't changed after getting the page lock (goto retry_find).  I
> > > like that one a little more since we can fix it all in common code, and I
> > > think it'll be lower overhead in the fault path.
> >   I'm not sure I understand how this should work. After pagecache is
> > truncated, fault can come and happily instantiate the page again. After
> > that fault is done, hole punching awakes and removes blocks from under that
> > page. So checking the generation counter after you get the page lock seems
> > useless to me.
> 
> Yeah, I don't think the page lock is enough (maybe someone can convince
> me otherwise).  How does this look?  Checking the generation number
> after we get i_mutex ensures that the truncate has finished running.

Ohh.  We can't take i_mutex because that's an inversion with mmap_sem.
So option 1 is to convince ourselves that page lock is enough (and that
doesn't solve the problem for ext2-xip).

Option 2 is to do something similar to seqcount_t.  We can't use
seqcount_t as-is because it spins in read_seqcount_begin, and it could
be spinning for a very long time while we read a page in from disc!

How about something like this:

typedef struct seqtex {
	unsigned		sequence;
	spinlock_t		wait_lock;
	struct list_head	wait_list;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
	struct lockdep_map	dep_map;
#endif
}

Take seqtex_read_lock() at the beginning of filemap_fault(), release
it after we have the page locked (since truncate_inode_pages_range will
block on the pagelock before tearing down the data structures).

Take seqtex_write_lock() at the start of truncate & holepunch operations,
drop it at the end of truncate & holepunch operations.

seqtex_read_lock blocks if we're inside a write lock.  seqtex_read_unlock
tells us whether to try again.  write_unlock wakes anybody waiting.
write_lock doesn't block because we're assuming the i_mutex is held above
us, so we can only have one writer at a time (similar to seqcount_t).


I keep hoping for an option 3.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."