All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jerome Glisse <jglisse@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Waiman Long <Waiman.Long@hp.com>, Ingo Molnar <mingo@elte.hu>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Rik van Riel <riel@redhat.com>,
	Peter Hurley <peter@hurleysoftware.com>,
	Davidlohr Bueso <davidlohr.bueso@hp.com>,
	Alex Shi <alex.shi@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Matthew R Wilcox <matthew.r.wilcox@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Michel Lespinasse <walken@google.com>,
	Andi Kleen <andi@firstfloor.org>,
	"Chandramouleeswaran, Aswin" <aswin@hp.com>,
	"Norton, Scott J" <scott.norton@hp.com>,
	Haggai Eran <haggaie@mellanox.com>,
	Sagi Grimberg <sagig@mellanox.com>,
	Or Gerlitz <ogerlitz@mellanox.com>
Subject: Re: [PATCH] anon_vmas: Convert the rwsem to an rwlock_t
Date: Mon, 30 Sep 2013 10:40:17 -0400	[thread overview]
Message-ID: <20130930144016.GA2880@redhat.com> (raw)
In-Reply-To: <20130930085243.GA25685@redhat.com>

On Mon, Sep 30, 2013 at 10:52:43AM +0200, Andrea Arcangeli wrote:
> Hi everyone,
> 
> On Sat, Sep 28, 2013 at 09:37:39PM +0200, Ingo Molnar wrote:
> > 
> > * Ingo Molnar <mingo@kernel.org> wrote:
> > 
> > > If we do that then I suspect the next step will be queued rwlocks :-/ 
> > > The current rwlock_t implementation is rather primitive by modern 
> > > standards. (We'd probably have killed rwlock_t long ago if not for the 
> > > tasklist_lock.)
> > > 
> > > But yeah, it would work and conceptually a hard spinlock fits something 
> > > as lowlevel as the anon-vma lock.
> > > 
> > > I did a quick review pass and it appears nothing obvious is scheduling 
> > > with the anon-vma lock held. If it did in a non-obvious way it's likely 
> > > a bug anyway. The hugepage code grew a lot of logic running under the 
> > > anon-vma lock, but it all seems atomic.
> > > 
> > > So a conversion to rwlock_t could be attempted. (It should be relatively 
> > > easy patch as well, because the locking operation is now nicely 
> > > abstracted out.)
> > 
> > Here's a totally untested patch to convert the anon vma lock to an 
> > rwlock_t.
> 
> Sorry having to break the party but the sleepable locks for anon_vma
> and i_mmap_mutex are now requirement for the "pageable RDMA" effort
> recently achieved upstream by mellanox with the MMU notifier.
> 
> And as far as I can tell that's the only single good reason for why
> those locks shouldn't be spinlocks (otherwise I would have also
> complianed at the time of that conversion, the original regression was
> known, ask Andi). After the lock conversion it took a while to fix all
> other minor bits to make mmu notifier methods fully sleepable.
> 
> The problem with the spinlocks is that in the rmap code (like
> try_to_unmap) we need to call mmu_notifier_invalidate_page with an
> "mm" as parameter, and the callee assumes the "mm" won't go away under
> it. The other second requirement is that the page cannot be freed
> until we call the mmu_notifier_invalidate_page (secondary MMU is ok to
> still access the page after the linux pte has been dropped and the TLB
> flushed).
> 
> In the rmap code the only things that keep things afloat is either the
> anon_vma lock or the i_mmap_mutex so it is quite tricky to drop that
> lock while keeping "mm" and "page" both afloat for the invalidate
> post-anon-vma-unlock.
> 
> Maybe there are ways to makes that safe? (reference counting,
> trylocking the mmap_sem). But there isn't a very strightforward way to
> do that.
> 
> It isn't just mellanox drivers: originally SGI XPMEM driver also
> needed to schedule in those methods (then they figured how to get away
> in only scheduling in the mmu_notifier range calls but I suppose they
> prefer to be able to schedule in all invalidate methods including
> mmu_notifier_invalidate_page).
> 
> Nvidia is now also going to use mmu notifier to allow the GPU (acting
> as a secondary MMU with pagetables) to access main memory without
> requiring RAM pinning and without disabling all VM features on the
> graphics memory (and without GART physical). Probably they don't need
> to schedule but I CC'ed Jerome just in case they need (as that would
> add one more relevant user of the feature).

Thanks for the cc if it was on mm mailing list i would have seen it
but i am way behind on my lkml folder. Yes right now we have to sleep
in mmu_notifier_invalidate_page but it's somewhat solvable.

The mkclean path is solvable more or less easily but the try_to_unmap
do have to sleep and that's bad. Especialy the shrinker path. My plan
is to provide informations down the mmu notifier API and fails quickly
on shrinker path.

For shrinking i am not sure if i should just rely on shrinker API to
force device or not. It's not something i want to tackle in first
patchset anyway.

I am hoping to be able to send patchset in next few weeks. But the
sleeping inside invalidate page is definitly on my WHATTHEHECKLIST.

Cheers,
Jerome

> 
> As far as KVM is concerned, there is no benefit in scheduling in the
> methods. The KVM mmu notifier invalidate consists in zeroing out one
> or more sptes and sending an IPI to flush the guest TLBs if others
> vcpus are running in other cpus. It's a mechanism pretty much
> identical to the one used by the primary MMU and it only requires irqs
> enabled to avoid deadlocks in case of cross-IPIs.

  reply	other threads:[~2013-09-30 14:41 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-27 19:00 [PATCH] rwsem: reduce spinlock contention in wakeup code path Waiman Long
2013-09-27 19:28 ` Linus Torvalds
2013-09-27 19:39   ` Davidlohr Bueso
2013-09-27 21:49     ` Tim Chen
2013-09-28  6:45       ` Ingo Molnar
2013-09-28  7:41   ` Ingo Molnar
2013-09-28 18:55     ` Linus Torvalds
2013-09-28 19:13       ` Andi Kleen
2013-09-28 19:22         ` Linus Torvalds
2013-09-28 19:21       ` Ingo Molnar
2013-09-28 19:33         ` Linus Torvalds
2013-09-28 19:39           ` Ingo Molnar
2013-09-30 10:44           ` Peter Zijlstra
2013-09-30 16:13             ` Linus Torvalds
2013-09-30 16:41               ` Peter Zijlstra
2013-10-01  7:28                 ` Ingo Molnar
2013-10-01  8:09                   ` Peter Zijlstra
2013-10-01  8:25                     ` Ingo Molnar
2013-09-30 15:58           ` Waiman Long
2013-10-01  7:33             ` Ingo Molnar
2013-10-01 20:03               ` Waiman Long
2013-09-28 19:37         ` [PATCH] anon_vmas: Convert the rwsem to an rwlock_t Ingo Molnar
2013-09-28 19:43           ` Linus Torvalds
2013-09-28 19:46             ` Ingo Molnar
2013-09-28 19:52             ` [PATCH, v2] " Ingo Molnar
2013-09-30 11:00               ` Peter Zijlstra
2013-09-30 11:44               ` Peter Zijlstra
2013-09-30 17:03               ` Andrew Morton
2013-09-30 17:25                 ` Linus Torvalds
2013-09-30 17:10               ` Tim Chen
2013-09-30 18:14                 ` Peter Zijlstra
2013-09-30 19:23                   ` Tim Chen
2013-09-30 19:35                     ` Waiman Long
2013-09-30 19:47                       ` Tim Chen
2013-09-30 22:03                         ` Tim Chen
2013-10-01  2:41                         ` Waiman Long
2013-09-30  8:52           ` [PATCH] " Andrea Arcangeli
2013-09-30 14:40             ` Jerome Glisse [this message]
2013-09-30 16:26             ` Linus Torvalds
2013-09-30 19:16               ` Andrea Arcangeli
2013-09-30 18:21             ` Andi Kleen
2013-09-30 19:04               ` Jerome Glisse
2013-09-29 23:06       ` [PATCH] rwsem: reduce spinlock contention in wakeup code path Davidlohr Bueso
2013-09-29 23:26         ` Linus Torvalds
2013-09-30  0:40           ` Davidlohr Bueso
2013-09-30  0:51             ` Linus Torvalds
2013-09-30  6:57           ` Ingo Molnar
2013-09-30  1:11       ` Michel Lespinasse
2013-09-30  7:05         ` Ingo Molnar
2013-09-30 16:03           ` Waiman Long
2013-09-30 10:46       ` Peter Zijlstra
2013-10-01  7:48         ` Ingo Molnar
2013-10-01  8:10           ` Peter Zijlstra
2013-09-27 19:32 ` Peter Hurley
2013-09-28  0:46   ` Waiman Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130930144016.GA2880@redhat.com \
    --to=jglisse@redhat.com \
    --cc=Waiman.Long@hp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@intel.com \
    --cc=andi@firstfloor.org \
    --cc=aswin@hp.com \
    --cc=dave.hansen@intel.com \
    --cc=davidlohr.bueso@hp.com \
    --cc=haggaie@mellanox.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matthew.r.wilcox@intel.com \
    --cc=mingo@elte.hu \
    --cc=mingo@kernel.org \
    --cc=ogerlitz@mellanox.com \
    --cc=peter@hurleysoftware.com \
    --cc=riel@redhat.com \
    --cc=sagig@mellanox.com \
    --cc=scott.norton@hp.com \
    --cc=tim.c.chen@linux.intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.