Re: kvm splat in mmu_spte_clear_track_bits

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jerome Glisse <jglisse@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Andrea Arcangeli" <aarcange@redhat.com>,
	"Adam Borowski" <kilobyte@angband.pl>,
	"Takashi Iwai" <tiwai@suse.de>, "Bernhard Held" <berny156@gmx.de>,
	"Nadav Amit" <nadav.amit@gmail.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Wanpeng Li" <kernellwp@gmail.com>,
	"Radim Krčmář" <rkrcmar@redhat.com>,
	"Joerg Roedel" <jroedel@suse.de>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	kvm <kvm@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Michal Hocko" <mhocko@kernel.org>
Subject: Re: kvm splat in mmu_spte_clear_track_bits
Date: Tue, 29 Aug 2017 15:13:51 -0400	[thread overview]
Message-ID: <20170829191351.GD7546@redhat.com> (raw)
In-Reply-To: <CA+55aFxrzKt2NpgERtg+QwxFPHZ1vMFNxyWEtrQR2-Hj3-xxLg@mail.gmail.com>

On Tue, Aug 29, 2017 at 12:06:42PM -0700, Linus Torvalds wrote:
> On Tue, Aug 29, 2017 at 11:34 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > Kirill did regress invalidate_page as it use to be call outside the
> > spinlock and now it is call inside the spinlock thus reverting will
> > introduce back a regression.
> 
> Honestly, this MMU notifier thing has been nothing but a badly
> designed mistake from beginning to end, and bad rules for what can
> sleep and what can not are one fundamental problem.
> 
> There are fundamentally two levels of VM locking, and those two levels
> are not going to go away, and we're not budging on them:
> 
>  - there's the "virtual address" level, which can block. We have a
> nice mmap_semaphore, and we guarantee that it's held for writing for
> all changes to the virtual memory layout
> 
>    This is the "mmap/munmap" kind of granularity. The mmu callbacks at
> *this* level are fine to block.
> 
>  - then there is the "page level" VM handling, and honestly, that
> *fundamentally* uses a spinlock. If we look at a particular page, that
> page is meaningless without the lock. Really.
> 
>    I honestly believe that any MMU callback at this level needs to be
> atomic. Some of the absolutely *have* to be (that "change_pte", for
> example).
> 
> In that second case, we might have a "begin/end" surrounding the
> actual page table walk. And that might sleep, but then it
> *fundamentally*  cannot actually be able some particular single page
> or stable range. Because without the page table spinlock, no such
> stability exists. It's purely a "we are not going to start looking at
> this range" kind of thing.
> 
> I really don't understand why the nVidia crap cannot follow those
> simple rules. Because either
> 
>  (a) you're working with virtual addresses, and you should be able to
> work on that virtual layer
> 
>  (b) you're actually working with physical pages, and you can just
> hold on to those physical pages yourself.
> 
> I really detest our MMU callbacks. I shouldn't have allowed them to be
> merged. And I definitely shoul.dn't allow them to screw up our VM
> layer.
> 
> But we have them, and we should work at making sure people do sane things.
> 
> And yes, those sane things may include
> 
>  (a) guaranteeing that the start/end range calls are always done
> around the actual locked region.
> 
>  (b) adding a ton of validation so that people *see* then they break
> the rules. Even when they don't use some random notifier crud.
> 
> That (b) may involve adding a number of "might_sleep()" calls (not
> deep in the notifiers themselves, but in the actual wrapper functions
> even when notifiers are compiled out entirely!), but also adding calls
> to validate those kinds of "you can't call
> mmu_notifier_invalidate_page() without having first called
> mmu_notifier_invalidate_range_start() in a sleepable context".
> 
> But (b) definitely should also be a very real onus on the mmu
> notifiers themselves. No way can we sleep when we're traversing page
> tables. We hold a page table lock. We can sleep before and after, but
> not during actual page traversal.

Yes and i am fine with page traversal being under spinlock and not
being able to sleep during that. I agree doing otherwise would be
insane. It is just that the existing behavior of try_to_unmap_one()
and page_mkclean_one() have been broken and that no mmu_notifier
calls were added around the lock section.

I sent a patch that properly compute the range to invalidate and move
to invalidate_range() but is lacking the invalidate_range_start()/
end() so i am gonna respin that with range_start/end bracketing and
assume the worse for the range of address.

Cheers,
Jérôme

next prev parent reply	other threads:[~2017-08-29 19:13 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-20 23:13 kvm splat in mmu_spte_clear_track_bits Adam Borowski
2017-08-21  1:26 ` Wanpeng Li
2017-08-21 19:12   ` Adam Borowski
2017-08-21 19:58     ` Radim Krčmář
2017-08-21 22:32       ` Adam Borowski
2017-08-23 12:22         ` Paolo Bonzini
2017-08-24  7:43           ` Wanpeng Li
2017-08-25 13:14             ` Adam Borowski
2017-08-25 13:40               ` Paolo Bonzini
2017-08-27 12:35                 ` Adam Borowski
2017-08-28 15:26                   ` Bernhard Held
2017-08-28 16:01                     ` Takashi Iwai
2017-08-28 16:07                       ` Bernhard Held
2017-08-28 16:17                         ` Takashi Iwai
2017-08-28 16:56                     ` Nadav Amit
2017-08-29  9:19                       ` Bernhard Held
     [not found]                         ` <s5hh8wq8ruy.wl-tiwai@suse.de>
2017-08-29 12:59                           ` Adam Borowski
2017-08-29 14:09                             ` Andrea Arcangeli
2017-08-29 16:10                               ` Linus Torvalds
2017-08-29 18:28                                 ` Jerome Glisse
2017-08-29 18:34                               ` Jerome Glisse
2017-08-29 19:06                                 ` Linus Torvalds
2017-08-29 19:13                                   ` Jerome Glisse [this message]
2017-08-29 19:38                                     ` Linus Torvalds
2017-08-29 20:49                                       ` Andrea Arcangeli
2017-08-29 20:59                                         ` Linus Torvalds
2017-08-30  8:19                               ` Michal Hocko
2017-08-29 15:53                         ` Nadav Amit
2017-08-29 12:57                       ` Mike Galbraith

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170829191351.GD7546@redhat.com \
    --to=jglisse@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=berny156@gmx.de \
    --cc=jroedel@suse.de \
    --cc=kernellwp@gmail.com \
    --cc=kilobyte@angband.pl \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhocko@kernel.org \
    --cc=nadav.amit@gmail.com \
    --cc=pbonzini@redhat.com \
    --cc=rkrcmar@redhat.com \
    --cc=tiwai@suse.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.