All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Avi Kivity <avi@redhat.com>, Thomas Gleixner <tglx@linutronix.de>,
	Rik van Riel <riel@redhat.com>, Ingo Molnar <mingo@elte.hu>,
	akpm@linux-foundation.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	David Miller <davem@davemloft.net>,
	Hugh Dickins <hugh.dickins@tiscali.co.uk>,
	Mel Gorman <mel@csn.ul.ie>, Nick Piggin <npiggin@suse.de>
Subject: Re: [PATCH 13/13] mm: Optimize page_lock_anon_vma
Date: Fri, 9 Apr 2010 12:22:49 -0700	[thread overview]
Message-ID: <20100409192249.GD2421@linux.vnet.ibm.com> (raw)
In-Reply-To: <1270802129.20295.3269.camel@laptop>

On Fri, Apr 09, 2010 at 10:35:29AM +0200, Peter Zijlstra wrote:
> On Thu, 2010-04-08 at 15:18 -0700, Paul E. McKenney wrote:
> > On Thu, Apr 08, 2010 at 09:17:50PM +0200, Peter Zijlstra wrote:
> > > Optimize page_lock_anon_vma() by removing the atomic ref count
> > > ops from the fast path.
> > > 
> > > Rather complicates the code a lot, but might be worth it.
> > 
> > Some questions and a disclaimer below.
> > 
> > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > ---
> > >  mm/rmap.c |   71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
> > >  1 file changed, 67 insertions(+), 4 deletions(-)
> > > 
> > > Index: linux-2.6/mm/rmap.c
> > > ===================================================================
> > > --- linux-2.6.orig/mm/rmap.c
> > > +++ linux-2.6/mm/rmap.c
> > > @@ -78,6 +78,12 @@ static inline struct anon_vma *anon_vma_
> > >  void anon_vma_free(struct anon_vma *anon_vma)
> > >  {
> > >  	VM_BUG_ON(atomic_read(&anon_vma->ref));
> > > +	/*
> > > +	 * Sync against the anon_vma->lock, so that we can hold the
> > > +	 * lock without requiring a reference. See page_lock_anon_vma().
> > > +	 */
> > > +	mutex_lock(&anon_vma->lock);
> > 
> > On some systems, the CPU is permitted to pull references into the critical
> > section from either side.  So, do we also need an smp_mb() here?
> > 
> > > +	mutex_unlock(&anon_vma->lock);
> > 
> > So, a question...
> > 
> > Can the above mutex be contended?  If yes, what happens when the
> > competing mutex_lock() acquires the lock at this point?  Or, worse yet,
> > after the kmem_cache_free()?
> > 
> > If no, what do we accomplish by acquiring the lock?
> 
> The thing we gain is that when the holder of the lock finds a !0
> refcount it knows it can't go away because any free will first wait to
> acquire the lock.

OK.  Here is the sequence of events that I am concerned about:

1.	CPU 0 invokes page_lock_anon_vma() [13/13], and executes the
	assignment to anon_vma.  It has not yet attempted to
	acquire the anon_vma->lock mutex.

2.	CPU 1 invokes page_unlock_anon_vma() [13/13], which in turn
	calls anon_vma_put() [5/13], which atomically decrements
	->ref, finds it zero, invokes anon_vma_free() [13/13], which 
	finds ->ref still zero, so acquires ->lock and immediately
	releases it, and then calls kmem_cache_free().

3.	This kmem_cache does have SLAB_DESTROY_BY_RCU, so this
	anon_vma structure will remain an anon_vma for as long as
	CPU 0 remains in its RCU read-side critical section.

4.	CPU 2 allocates an anon_vma, and gets the one that
	CPU 0 just freed.  It initializes it and makes ->ref
	non-zero.

5.	CPU 0 continues executing page_lock_anon_vma(), and therefore
	invokes mutex_trylock() on a now-reused struct anon_vma.
	It finds ->ref nonzero, so increments it and continues using
	it, despite its having been reallocated, possibly to some
	other process.

Or am I missing a step?  (Extremely possible, as I am not as familiar
with this code as I might be.)

> > If the above mutex can be contended, can we fix by substituting
> > synchronize_rcu_expedited()?  Which will soon require some scalability
> > attention if it gets used here, but what else is new?  ;-)
> 
> No, synchronize_rcu_expedited() will not work here, there is no RCU read
> side that covers the full usage of the anon_vma (there can't be, it
> needs to sleep).

Got it, apologies for my confusion.

> > >  	kmem_cache_free(anon_vma_cachep, anon_vma);
> > >  }
> > > 
> > > @@ -291,7 +297,7 @@ void __init anon_vma_init(void)
> > > 
> > >  /*
> > >   * Getting a lock on a stable anon_vma from a page off the LRU is
> > > - * tricky: page_lock_anon_vma relies on RCU to guard against the races.
> > > + * tricky: anon_vma_get relies on RCU to guard against the races.
> > >   */
> > >  struct anon_vma *anon_vma_get(struct page *page)
> > >  {
> > > @@ -320,12 +326,70 @@ out:
> > >  	return anon_vma;
> > >  }
> > > 
> > > +/*
> > > + * Similar to anon_vma_get(), however it relies on the anon_vma->lock
> > > + * to pin the object. However since we cannot wait for the mutex
> > > + * acquisition inside the RCU read lock, we use the ref count
> > > + * in the slow path.
> > > + */
> > >  struct anon_vma *page_lock_anon_vma(struct page *page)
> > >  {
> > > -	struct anon_vma *anon_vma = anon_vma_get(page);
> > > +	struct anon_vma *anon_vma = NULL;
> > > +	unsigned long anon_mapping;
> > > +
> > > +again:
> > > +	rcu_read_lock();
> > 
> > This is interesting.  You have an RCU read-side critical section with
> > no rcu_dereference().
> > 
> > This strange state of affairs is actually legal (assuming that
> > anon_mapping is the RCU-protected structure) because all dereferences
> > of the anon_vma variable are atomic operations that guarantee ordering
> > (the mutex_trylock() and the atomic_inc_not_zero().
> > 
> > The other dereferences (the atomic_read()s) are under the lock, so
> > are also OK assuming that the lock is held when initializing and
> > updating these fields, and even more OK due to the smp_rmb() below.
> > 
> > But see below.
> 
> Right so the only thing rcu_read_lock() does here is create the
> guarantee that anon_vma is safe to dereference (it lives on a
> SLAB_DESTROY_BY_RCU slab).
> 
> But yes, I suppose that page->mapping read that now uses ACCESS_ONCE()
> would actually want to be an rcu_dereference(), since that both provides
> the ACCESS_ONCE() as the read-depend barrier that I thing would be
> needed.

Ah, I was getting the wrong access.  Now that I see it, yes, this is
tied to the access of page->mapping that is assigned to anon_mapping.

> > > +	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> > > +	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> > > +		goto unlock;
> > > +	if (!page_mapped(page))
> > > +		goto unlock;
> > > +
> > > +	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> > > +	if (!mutex_trylock(&anon_vma->lock)) {
> > > +		/*
> > > +		 * We failed to acquire the lock, take a ref so we can
> > > +		 * drop the RCU read lock and sleep on it.
> > > +		 */
> > > +		if (!atomic_inc_not_zero(&anon_vma->ref)) {
> > > +			/*
> > > +			 * Failed to get a ref, we're dead, bail.
> > > +			 */
> > > +			anon_vma = NULL;
> > > +			goto unlock;
> > > +		}
> > > +		rcu_read_unlock();
> > > 
> > > -	if (anon_vma)
> > >  		mutex_lock(&anon_vma->lock);
> > > +		/*
> > > +		 * We got the lock, drop the temp. ref, if it was the last
> > > +		 * one free it and bail.
> > > +		 */
> > > +		if (atomic_dec_and_test(&anon_vma->ref)) {
> > > +			mutex_unlock(&anon_vma->lock);
> > > +			anon_vma_free(anon_vma);
> > > +			anon_vma = NULL;
> > > +		}
> > > +		goto out;
> > > +	}
> > > +	/*
> > > +	 * Got the lock, check we're still alive. Seeing a ref
> > > +	 * here guarantees the object will stay alive due to
> > > +	 * anon_vma_free() syncing against the lock we now hold.
> > > +	 */
> > > +	smp_rmb(); /* Order against anon_vma_put() */
> > 
> > This is ordering the fetch into anon_vma against the atomic_read() below?
> > If so, smp_read_barrier_depends() will cover it more cheaply.  Alternatively,
> > use rcu_dereference() when fetching into anon_vma.
> > 
> > Or am I misunderstanding the purpose of this barrier?
> 
> Yes, it is:
> 
>   atomic_dec_and_test(&anon_vma->ref) /* implies mb */
> 
> 				smp_rmb();
> 				atomic_read(&anon_vma->ref);
> 
> > (Disclaimer: I have not yet found anon_vma_put(), so I am assuming that
> > anon_vma_free() plays the role of a grace period.)
> 
> Yes, that lives in one of the other patches (does not exist in
> mainline).

Thank you -- and yes, I should have thought to search the patch set.

							Thanx, Paul

  reply	other threads:[~2010-04-09 19:22 UTC|newest]

Thread overview: 113+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-08 19:17 [PATCH 00/13] mm: preemptibility -v2 Peter Zijlstra
2010-04-08 19:17 ` Peter Zijlstra
2010-04-08 19:17 ` Peter Zijlstra
2010-04-08 19:17 ` [PATCH 01/13] powerpc: Add rcu_read_lock() to gup_fast() implementation Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-08 20:31   ` Rik van Riel
2010-04-09  3:11   ` Nick Piggin
2010-04-13  1:05   ` Benjamin Herrenschmidt
2010-04-13  3:43     ` Paul E. McKenney
2010-04-14 13:51       ` Peter Zijlstra
2010-04-15 14:28         ` Paul E. McKenney
2010-04-16  6:54           ` Benjamin Herrenschmidt
2010-04-16 13:43             ` Paul E. McKenney
2010-04-16 13:43               ` Paul E. McKenney
2010-04-16 23:25               ` Benjamin Herrenschmidt
2010-04-16 13:51           ` Peter Zijlstra
2010-04-16 14:17             ` Paul E. McKenney
2010-04-16 14:23               ` Peter Zijlstra
2010-04-16 14:32                 ` Paul E. McKenney
2010-04-16 14:56                   ` Peter Zijlstra
2010-04-16 15:09                     ` Paul E. McKenney
2010-04-16 15:14                       ` Peter Zijlstra
2010-04-16 16:45                         ` Paul E. McKenney
2010-04-16 19:37                           ` Peter Zijlstra
2010-04-16 20:28                             ` Paul E. McKenney
2010-04-18  3:06                           ` James Bottomley
2010-04-18 13:55                             ` Paul E. McKenney
2010-04-18 18:55                               ` James Bottomley
2010-04-16  6:51       ` Benjamin Herrenschmidt
2010-04-16  8:18         ` Nick Piggin
2010-04-16  8:29           ` Benjamin Herrenschmidt
2010-04-16  9:22             ` Nick Piggin
2010-04-08 19:17 ` [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma() Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-08 20:50   ` Rik van Riel
2010-04-08 21:20   ` Andrew Morton
2010-04-08 21:54     ` Peter Zijlstra
2010-04-08 21:54       ` Peter Zijlstra
2010-04-09  2:19       ` KOSAKI Motohiro
2010-04-09  2:19   ` Minchan Kim
2010-04-09  3:16   ` Nick Piggin
2010-04-09  4:56     ` KAMEZAWA Hiroyuki
2010-04-09  6:34       ` KOSAKI Motohiro
2010-04-09  6:47         ` KAMEZAWA Hiroyuki
2010-04-09  7:29           ` KOSAKI Motohiro
2010-04-09  7:57             ` KAMEZAWA Hiroyuki
2010-04-09  8:03               ` KAMEZAWA Hiroyuki
2010-04-09  8:24                 ` KAMEZAWA Hiroyuki
2010-04-09  8:01             ` Minchan Kim
2010-04-09  8:17               ` KOSAKI Motohiro
2010-04-09 14:41                 ` mlock and pageout race? Minchan Kim
2010-04-09  8:44             ` [PATCH 02/13] mm: Revalidate anon_vma in page_lock_anon_vma() Peter Zijlstra
2010-05-24 19:32               ` Andrew Morton
2010-05-25  9:01                 ` Peter Zijlstra
2010-04-09 12:57   ` Peter Zijlstra
2010-04-08 19:17 ` [PATCH 03/13] x86: Remove last traces of quicklist usage Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-08 20:51   ` Rik van Riel
2010-04-08 19:17 ` [PATCH 04/13] mm: Move anon_vma ref out from under CONFIG_KSM Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-09 12:35   ` Rik van Riel
2010-04-08 19:17 ` [PATCH 05/13] mm: Make use of the anon_vma ref count Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-09  7:04   ` Christian Ehrhardt
2010-04-09  9:57     ` Peter Zijlstra
2010-04-08 19:17 ` [PATCH 06/13] mm: Preemptible mmu_gather Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-09  3:25   ` Nick Piggin
2010-04-09  8:18     ` Peter Zijlstra
2010-04-09 20:36     ` Peter Zijlstra
2010-04-19 19:16       ` Peter Zijlstra
2010-04-08 19:17 ` [PATCH 07/13] powerpc: " Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-09  4:07   ` Nick Piggin
2010-04-09  8:14     ` Peter Zijlstra
2010-04-09  8:46       ` Nick Piggin
2010-04-09  9:22         ` Peter Zijlstra
2010-04-13  2:06       ` Benjamin Herrenschmidt
2010-04-13  1:56     ` Benjamin Herrenschmidt
2010-04-13  1:23   ` Benjamin Herrenschmidt
2010-04-13 10:22     ` Peter Zijlstra
2010-04-14 13:34     ` Peter Zijlstra
2010-04-14 13:51     ` Peter Zijlstra
2010-04-08 19:17 ` [PATCH 08/13] sparc: " Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-08 19:17 ` [PATCH 09/13] mm, powerpc: Move the RCU page-table freeing into generic code Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-09  3:35   ` Nick Piggin
2010-04-09  8:08     ` Peter Zijlstra
2010-04-08 19:17 ` [PATCH 10/13] lockdep, mutex: Provide mutex_lock_nest_lock Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-09 15:36   ` Rik van Riel
2010-04-08 19:17 ` [PATCH 11/13] mutex: Provide mutex_is_contended Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-09 15:37   ` Rik van Riel
2010-04-08 19:17 ` [PATCH 12/13] mm: Convert i_mmap_lock and anon_vma->lock to mutexes Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-08 19:17 ` [PATCH 13/13] mm: Optimize page_lock_anon_vma Peter Zijlstra
2010-04-08 19:17   ` Peter Zijlstra
2010-04-08 22:18   ` Paul E. McKenney
2010-04-09  8:35     ` Peter Zijlstra
2010-04-09 19:22       ` Paul E. McKenney [this message]
2010-04-08 20:29 ` [PATCH 00/13] mm: preemptibility -v2 David Miller
2010-04-08 20:35   ` Peter Zijlstra
2010-04-09  1:00   ` David Miller
2010-04-09  4:14 ` Nick Piggin
2010-04-09  8:35   ` Peter Zijlstra
2010-04-09  8:50     ` Nick Piggin
2010-04-09  8:58       ` Peter Zijlstra
2010-04-09  8:58 ` Martin Schwidefsky
2010-04-09  9:53   ` Peter Zijlstra
2010-04-09  9:03 ` David Howells
2010-04-09  9:22   ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100409192249.GD2421@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=avi@redhat.com \
    --cc=benh@kernel.crashing.org \
    --cc=davem@davemloft.net \
    --cc=hugh.dickins@tiscali.co.uk \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mel@csn.ul.ie \
    --cc=mingo@elte.hu \
    --cc=npiggin@suse.de \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.