From mboxrd@z Thu Jan  1 00:00:00 1970
From: Suparna Bhattacharya <suparna@in.ibm.com>
Subject: Re: [RFC PATCH 1/3] Introduce a place holder page for the pagecache.
Date: Fri, 3 Nov 2006 12:42:57 +0530
Message-ID: <20061103071257.GA13999@in.ibm.com>
References: <20061027182215.GJ19895@think.oraclecorp.com> <20061027182358.GK19895@think.oraclecorp.com> <20061101102350.GA25882@in.ibm.com> <20061101124153.GF5620@think.oraclecorp.com>
Reply-To: suparna@in.ibm.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: linux-fsdevel@vger.kernel.org, akpm@osdl.org, zach.brown@oracle.com
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from e36.co.us.ibm.com ([32.97.110.154]:8607 "EHLO e36.co.us.ibm.com")
	by vger.kernel.org with ESMTP id S1753161AbWKCHKI (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Fri, 3 Nov 2006 02:10:08 -0500
Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106])
	by e36.co.us.ibm.com (8.13.8/8.12.11) with ESMTP id kA37A7oD005275
	for <linux-fsdevel@vger.kernel.org>; Fri, 3 Nov 2006 02:10:07 -0500
Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169])
	by d03relay04.boulder.ibm.com (8.13.6/8.13.6/NCO v8.1.1) with ESMTP id kA37A7p3331310
	for <linux-fsdevel@vger.kernel.org>; Fri, 3 Nov 2006 00:10:07 -0700
Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1])
	by d03av03.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id kA37A638010511
	for <linux-fsdevel@vger.kernel.org>; Fri, 3 Nov 2006 00:10:07 -0700
To: Chris Mason <chris.mason@oracle.com>
Content-Disposition: inline
In-Reply-To: <20061101124153.GF5620@think.oraclecorp.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-Id: linux-fsdevel.vger.kernel.org

On Wed, Nov 01, 2006 at 07:41:53AM -0500, Chris Mason wrote:
> On Wed, Nov 01, 2006 at 03:53:50PM +0530, Suparna Bhattacharya wrote:
> > 
> > Really appreciate your biting the bullet on this one. I have a few
> > minor comments/questions as I'm trying to compare this to what we
> > have discussed in the past and the last attempt approach outlined in
> > http://www.kernel.org/pub/linux/kernel/people/suparna/DIO-simplify.txt
> > back in Feb this year.
> 
> The main difference from DIO-simplify.txt is that I'm making truncate
> wait on the placeholder pages instead of using i_alloc_sem.  I haven't

Ah OK, that is true.

> coded the TAG_LOCKED part, but more on that below.  As you
> noticed, I'm using expanding truncates to update i_size for extending
> the file, which has pluses and minuses.
> 
> > 
> > I recall that the first time you experimented with dummy struct pages
> > (placeholder pages as they are now called), you ended up putting it on
> > hold due to regressions in performance which we attributed to radix
> > tree insertion overheads. I think you mention in an earlier note that
> > you've managed to optimize some of that -- I'm just wondering what helped
> > there - radix_tree_preload or something more than that ?
> 
> My original code was fairly dumb and didn't use gang lookups.  It also
> dropped the radix tree lock after every insert/delete.  The code posted
> here is about 2x as fast (or 1/2 the regression, depending on how you
> look at it).   For smaller read ios, the performance hit is about zero.
> The cost of radix tree insertion is offset by less mutex and semaphore
> locking.

OK, so you think the main overheads had to do with radix-tree locking,
rather than with the additional node insertions ?

> 
> > 
> > This is sort of why we had discussed the alternative of tagged locking
> > in the radix tree.
> > 	  Lookup and lock pages in the range by tagging the radix tree
> > 	  slots as "TAG_LOCKED", and locking pages that were present.
> > 	  Modify find_get_page et al to return a dummy locked cache page
> > 	  if TAG_LOCKED. (Alternatively put the check in add_to_page_cache,
> > 	  instead)
> > Putting the check in add_to_page_cache is more consistent with what you
> > do now.
> > 
> > Where I think we were stuck at that point was that even this helps only upto
> > a point in saving on leaf nodes, but did not avoid intermediate levels of
> > nodes from being created. With Nick's changes to the radix tree .. I have 
> > been thinking that some kind of sliding height scheme could become possible.
> > 
> > Anyway, the other reason for bringing this up is because I am wondering if
> > some of the complexity in find_and_insert_placeholder_pages and a few other
> > places could be reduced with a tagged locking approach where we do not
> > actually stuff placeholder pages explicitly in the page cache and hence
> > possibly can avoid some of the checks you have below. I need to take a closer
> > look to be completely sure, but just a thought ...
> > 
> > I guess the important thing at this point is to get the interfaces right
> > in a way that allows for future optimization.
> 
> Yes, my goal was to get the code fast enough for most machines, so that
> we could test the result and see what impact it would have on the rest
> of the page cache.  From here we can decide if additional radix tuning
> is a good idea, or if we just want to stick with the mutex/semaphore
> combo in use now.

Yup, that makes sense.

The complexity in find_and_insert_placeholder_pages does worry me though,
wouldn't want to end up debugging tricky problems in that code after all
this. Do you have any ideas on that ?

> 
> I suspect that we won't be able to tune the radix tree enough to make it
> suitable for the huge SGI 512 cpu test cases.  So we might want to stop
> tuning now (letting XFS keep their current code) or switch away from
> radix to something that can lock down an extent.
> 
> Radix tree gang lookups, insertions and deletions all have room for
> optimization though (even without the LOCKED tag).  A new tag in the
> radix tree isn't free, Andrew might prefer not to increase the space
> usage across the board just for O_DIRECT.

Yes that was a consideration on my mind as well. There are costs both
ways.

Regards
Suparna
> 
> -chris

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India