Re: [patch 3/9] mm: speculative get page

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Nick Piggin <nickpiggin@yahoo.com.au>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <npiggin@suse.de>, Hugh Dickins <hugh@veritas.com>,
	Andrew Morton <akpm@osdl.org>,
	Linux Memory Management <linux-mm@kvack.org>
Subject: Re: [patch 3/9] mm: speculative get page
Date: Tue, 26 Sep 2006 08:41:55 +1000	[thread overview]
Message-ID: <45185B33.6060109@yahoo.com.au> (raw)
In-Reply-To: <1159189453.5018.25.camel@lappy>

Peter Zijlstra wrote:

>On Mon, 2006-09-25 at 13:47 +0200, Nick Piggin wrote:
>
>
>>+/*
>>+ * speculatively take a reference to a page.
>>+ * If the page is free (_count == 0), then _count is untouched, and 0
>>+ * is returned. Otherwise, _count is incremented by 1 and 1 is returned.
>>+ *
>>+ * This function must be run in the same rcu_read_lock() section as has
>>+ * been used to lookup the page in the pagecache radix-tree: this allows
>>+ * allocators to use a synchronize_rcu() to stabilize _count.
>>+ *
>>+ * Unless an RCU grace period has passed, the count of all pages coming out
>>+ * of the allocator must be considered unstable. page_count may return higher
>>+ * than expected, and put_page must be able to do the right thing when the
>>+ * page has been finished with (because put_page is what is used to drop an
>>+ * invalid speculative reference).
>>+ *
>>+ * This forms the core of the lockless pagecache locking protocol, where
>>+ * the lookup-side (eg. find_get_page) has the following pattern:
>>+ * 1. find page in radix tree
>>+ * 2. conditionally increment refcount
>>+ * 3. check the page is still in pagecache (if no, goto 1)
>>+ *
>>+ * Remove-side that cares about stability of _count (eg. reclaim) has the
>>
>                                   ^^^^^^^^^^^^^^^^^^^
>is that the reason that the following two code paths are good without
>change:
>
>  truncate_inode_page_range()
>   truncate_complete_page()
>     remove_from_page_cache()
>       radix_tree_delete()
>

^^^ Yes.

>and
>
>  zap_pte_range()
>    free_swap_and_cache()  <-- does check page_count()
>      delete_from_swap_cache()
>        __delete_from_swap_cache()
>          radix_tree_delete()
>
>>From the comments around the truncate bit it seems to be ok with keeping
>the page as anonymous, however the zap_pte_range() thing does seem to
>want to have a stable page_count().
>

However when I last looked at it, it count can be elevated there for other
reasons (I think it was swap IO or get_user_pages or something). Anyway,
those pages will remain on the LRU and eventually get reclaimed.

I did initially change that code around a little bit, but I remember
working through it with Hugh and we decided that it would be OK as it was.
It should indeed be commented though.

>>+ * following (with tree_lock held for write):
>>+ * A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg)
>>+ * B. remove page from pagecache
>>+ * C. free the page
>>+ *
>>+ * There are 2 critical interleavings that matter:
>>+ * - 2 runs before A: in this case, A sees elevated refcount and bails out
>>+ * - A runs before 2: in this case, 2 sees zero refcount and retries;
>>+ *   subsequently, B will complete and 1 will find no page, causing the
>>+ *   lookup to return NULL.
>>+ *
>>+ * It is possible that between 1 and 2, the page is removed then the exact same
>>+ * page is inserted into the same position in pagecache. That's OK: the
>>+ * old find_get_page using tree_lock could equally have run before or after
>>+ * such a re-insertion, depending on order that locks are granted.
>>+ *
>>+ * Lookups racing against pagecache insertion isn't a big problem: either 1
>>+ * will find the page or it will not. Likewise, the old find_get_page could run
>>+ * either before the insertion or afterwards, depending on timing.
>>+ */
>>
>
>Awesome code ;-)
>

Thanks :) Well, thank Hugh.

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2006-09-25 22:41 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-09-22 19:22 [patch 0/4] lockless pagecache for 2.6.18-rc7-mm1 Nick Piggin
2006-09-22 19:22 ` [patch 2/4] radix-tree: use indirect bit Nick Piggin
2006-09-22 19:22 ` [patch 2/9] radix-tree: gang_lookup_slot Nick Piggin
2006-09-22 19:22 ` [patch 3/9] mm: speculative get page Nick Piggin
2006-09-23 10:01   ` Peter Zijlstra
2006-09-24 18:01   ` Hugh Dickins
2006-09-25  2:00     ` Nick Piggin
2006-09-25 11:47       ` Nick Piggin
2006-09-25 13:04         ` Peter Zijlstra
2006-09-25 22:41           ` Nick Piggin [this message]
2006-09-22 19:22 ` [patch 4/9] mm: lockless pagecache lookups Nick Piggin
2006-09-22 20:01   ` Lee Schermerhorn
2006-09-23  2:35     ` Nick Piggin
2006-09-22 19:24 ` [patch 0/4] lockless pagecache for 2.6.18-rc7-mm1 Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=45185B33.6060109@yahoo.com.au \
    --to=nickpiggin@yahoo.com.au \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@osdl.org \
    --cc=hugh@veritas.com \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox