From: Andrea Arcangeli <aarcange@redhat.com>
To: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Waiman Long <Waiman.Long@hp.com>, Ingo Molnar <mingo@elte.hu>,
Andrew Morton <akpm@linux-foundation.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
Rik van Riel <riel@redhat.com>,
Peter Hurley <peter@hurleysoftware.com>,
Davidlohr Bueso <davidlohr.bueso@hp.com>,
Alex Shi <alex.shi@intel.com>,
Tim Chen <tim.c.chen@linux.intel.com>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Matthew R Wilcox <matthew.r.wilcox@intel.com>,
Dave Hansen <dave.hansen@intel.com>,
Michel Lespinasse <walken@google.com>,
Andi Kleen <andi@firstfloor.org>,
"Chandramouleeswaran, Aswin" <aswin@hp.com>,
"Norton, Scott J" <scott.norton@hp.com>,
Haggai Eran <haggaie@mellanox.com>,
Sagi Grimberg <sagig@mellanox.com>,
Or Gerlitz <ogerlitz@mellanox.com>,
Jerome Glisse <jglisse@redhat.com>
Subject: Re: [PATCH] anon_vmas: Convert the rwsem to an rwlock_t
Date: Mon, 30 Sep 2013 10:52:43 +0200 [thread overview]
Message-ID: <20130930085243.GA25685@redhat.com> (raw)
In-Reply-To: <20130928193739.GA8642@gmail.com>
Hi everyone,
On Sat, Sep 28, 2013 at 09:37:39PM +0200, Ingo Molnar wrote:
>
> * Ingo Molnar <mingo@kernel.org> wrote:
>
> > If we do that then I suspect the next step will be queued rwlocks :-/
> > The current rwlock_t implementation is rather primitive by modern
> > standards. (We'd probably have killed rwlock_t long ago if not for the
> > tasklist_lock.)
> >
> > But yeah, it would work and conceptually a hard spinlock fits something
> > as lowlevel as the anon-vma lock.
> >
> > I did a quick review pass and it appears nothing obvious is scheduling
> > with the anon-vma lock held. If it did in a non-obvious way it's likely
> > a bug anyway. The hugepage code grew a lot of logic running under the
> > anon-vma lock, but it all seems atomic.
> >
> > So a conversion to rwlock_t could be attempted. (It should be relatively
> > easy patch as well, because the locking operation is now nicely
> > abstracted out.)
>
> Here's a totally untested patch to convert the anon vma lock to an
> rwlock_t.
Sorry having to break the party but the sleepable locks for anon_vma
and i_mmap_mutex are now requirement for the "pageable RDMA" effort
recently achieved upstream by mellanox with the MMU notifier.
And as far as I can tell that's the only single good reason for why
those locks shouldn't be spinlocks (otherwise I would have also
complianed at the time of that conversion, the original regression was
known, ask Andi). After the lock conversion it took a while to fix all
other minor bits to make mmu notifier methods fully sleepable.
The problem with the spinlocks is that in the rmap code (like
try_to_unmap) we need to call mmu_notifier_invalidate_page with an
"mm" as parameter, and the callee assumes the "mm" won't go away under
it. The other second requirement is that the page cannot be freed
until we call the mmu_notifier_invalidate_page (secondary MMU is ok to
still access the page after the linux pte has been dropped and the TLB
flushed).
In the rmap code the only things that keep things afloat is either the
anon_vma lock or the i_mmap_mutex so it is quite tricky to drop that
lock while keeping "mm" and "page" both afloat for the invalidate
post-anon-vma-unlock.
Maybe there are ways to makes that safe? (reference counting,
trylocking the mmap_sem). But there isn't a very strightforward way to
do that.
It isn't just mellanox drivers: originally SGI XPMEM driver also
needed to schedule in those methods (then they figured how to get away
in only scheduling in the mmu_notifier range calls but I suppose they
prefer to be able to schedule in all invalidate methods including
mmu_notifier_invalidate_page).
Nvidia is now also going to use mmu notifier to allow the GPU (acting
as a secondary MMU with pagetables) to access main memory without
requiring RAM pinning and without disabling all VM features on the
graphics memory (and without GART physical). Probably they don't need
to schedule but I CC'ed Jerome just in case they need (as that would
add one more relevant user of the feature).
As far as KVM is concerned, there is no benefit in scheduling in the
methods. The KVM mmu notifier invalidate consists in zeroing out one
or more sptes and sending an IPI to flush the guest TLBs if others
vcpus are running in other cpus. It's a mechanism pretty much
identical to the one used by the primary MMU and it only requires irqs
enabled to avoid deadlocks in case of cross-IPIs.
next prev parent reply other threads:[~2013-09-30 8:53 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-09-27 19:00 [PATCH] rwsem: reduce spinlock contention in wakeup code path Waiman Long
2013-09-27 19:28 ` Linus Torvalds
2013-09-27 19:39 ` Davidlohr Bueso
2013-09-27 21:49 ` Tim Chen
2013-09-28 6:45 ` Ingo Molnar
2013-09-28 7:41 ` Ingo Molnar
2013-09-28 18:55 ` Linus Torvalds
2013-09-28 19:13 ` Andi Kleen
2013-09-28 19:22 ` Linus Torvalds
2013-09-28 19:21 ` Ingo Molnar
2013-09-28 19:33 ` Linus Torvalds
2013-09-28 19:39 ` Ingo Molnar
2013-09-30 10:44 ` Peter Zijlstra
2013-09-30 16:13 ` Linus Torvalds
2013-09-30 16:41 ` Peter Zijlstra
2013-10-01 7:28 ` Ingo Molnar
2013-10-01 8:09 ` Peter Zijlstra
2013-10-01 8:25 ` Ingo Molnar
2013-09-30 15:58 ` Waiman Long
2013-10-01 7:33 ` Ingo Molnar
2013-10-01 20:03 ` Waiman Long
2013-09-28 19:37 ` [PATCH] anon_vmas: Convert the rwsem to an rwlock_t Ingo Molnar
2013-09-28 19:43 ` Linus Torvalds
2013-09-28 19:46 ` Ingo Molnar
2013-09-28 19:52 ` [PATCH, v2] " Ingo Molnar
2013-09-30 11:00 ` Peter Zijlstra
2013-09-30 11:44 ` Peter Zijlstra
2013-09-30 17:03 ` Andrew Morton
2013-09-30 17:25 ` Linus Torvalds
2013-09-30 17:10 ` Tim Chen
2013-09-30 18:14 ` Peter Zijlstra
2013-09-30 19:23 ` Tim Chen
2013-09-30 19:35 ` Waiman Long
2013-09-30 19:47 ` Tim Chen
2013-09-30 22:03 ` Tim Chen
2013-10-01 2:41 ` Waiman Long
2013-09-30 8:52 ` Andrea Arcangeli [this message]
2013-09-30 14:40 ` [PATCH] " Jerome Glisse
2013-09-30 16:26 ` Linus Torvalds
2013-09-30 19:16 ` Andrea Arcangeli
2013-09-30 18:21 ` Andi Kleen
2013-09-30 19:04 ` Jerome Glisse
2013-09-29 23:06 ` [PATCH] rwsem: reduce spinlock contention in wakeup code path Davidlohr Bueso
2013-09-29 23:26 ` Linus Torvalds
2013-09-30 0:40 ` Davidlohr Bueso
2013-09-30 0:51 ` Linus Torvalds
2013-09-30 6:57 ` Ingo Molnar
2013-09-30 1:11 ` Michel Lespinasse
2013-09-30 7:05 ` Ingo Molnar
2013-09-30 16:03 ` Waiman Long
2013-09-30 10:46 ` Peter Zijlstra
2013-10-01 7:48 ` Ingo Molnar
2013-10-01 8:10 ` Peter Zijlstra
2013-09-27 19:32 ` Peter Hurley
2013-09-28 0:46 ` Waiman Long
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130930085243.GA25685@redhat.com \
--to=aarcange@redhat.com \
--cc=Waiman.Long@hp.com \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=alex.shi@intel.com \
--cc=andi@firstfloor.org \
--cc=aswin@hp.com \
--cc=dave.hansen@intel.com \
--cc=davidlohr.bueso@hp.com \
--cc=haggaie@mellanox.com \
--cc=jglisse@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=matthew.r.wilcox@intel.com \
--cc=mingo@elte.hu \
--cc=mingo@kernel.org \
--cc=ogerlitz@mellanox.com \
--cc=peter@hurleysoftware.com \
--cc=riel@redhat.com \
--cc=sagig@mellanox.com \
--cc=scott.norton@hp.com \
--cc=tim.c.chen@linux.intel.com \
--cc=torvalds@linux-foundation.org \
--cc=walken@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).