Re: kvm splat in mmu_spte_clear_track_bits

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: Jerome Glisse <jglisse@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Andrea Arcangeli" <aarcange@redhat.com>,
	"Adam Borowski" <kilobyte@angband.pl>,
	"Takashi Iwai" <tiwai@suse.de>, "Bernhard Held" <berny156@gmx.de>,
	"Nadav Amit" <nadav.amit@gmail.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Wanpeng Li" <kernellwp@gmail.com>,
	"Radim Krčmář" <rkrcmar@redhat.com>,
	"Joerg Roedel" <jroedel@suse.de>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	kvm <kvm@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Michal Hocko" <mhocko@kernel.org>
Subject: Re: kvm splat in mmu_spte_clear_track_bits
Date: Tue, 29 Aug 2017 15:13:51 -0400	[thread overview]
Message-ID: <20170829191351.GD7546@redhat.com> (raw)
In-Reply-To: <CA+55aFxrzKt2NpgERtg+QwxFPHZ1vMFNxyWEtrQR2-Hj3-xxLg@mail.gmail.com>

On Tue, Aug 29, 2017 at 12:06:42PM -0700, Linus Torvalds wrote:
> On Tue, Aug 29, 2017 at 11:34 AM, Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > Kirill did regress invalidate_page as it use to be call outside the
> > spinlock and now it is call inside the spinlock thus reverting will
> > introduce back a regression.
> 
> Honestly, this MMU notifier thing has been nothing but a badly
> designed mistake from beginning to end, and bad rules for what can
> sleep and what can not are one fundamental problem.
> 
> There are fundamentally two levels of VM locking, and those two levels
> are not going to go away, and we're not budging on them:
> 
>  - there's the "virtual address" level, which can block. We have a
> nice mmap_semaphore, and we guarantee that it's held for writing for
> all changes to the virtual memory layout
> 
>    This is the "mmap/munmap" kind of granularity. The mmu callbacks at
> *this* level are fine to block.
> 
>  - then there is the "page level" VM handling, and honestly, that
> *fundamentally* uses a spinlock. If we look at a particular page, that
> page is meaningless without the lock. Really.
> 
>    I honestly believe that any MMU callback at this level needs to be
> atomic. Some of the absolutely *have* to be (that "change_pte", for
> example).
> 
> In that second case, we might have a "begin/end" surrounding the
> actual page table walk. And that might sleep, but then it
> *fundamentally*  cannot actually be able some particular single page
> or stable range. Because without the page table spinlock, no such
> stability exists. It's purely a "we are not going to start looking at
> this range" kind of thing.
> 
> I really don't understand why the nVidia crap cannot follow those
> simple rules. Because either
> 
>  (a) you're working with virtual addresses, and you should be able to
> work on that virtual layer
> 
>  (b) you're actually working with physical pages, and you can just
> hold on to those physical pages yourself.
> 
> I really detest our MMU callbacks. I shouldn't have allowed them to be
> merged. And I definitely shoul.dn't allow them to screw up our VM
> layer.
> 
> But we have them, and we should work at making sure people do sane things.
> 
> And yes, those sane things may include
> 
>  (a) guaranteeing that the start/end range calls are always done
> around the actual locked region.
> 
>  (b) adding a ton of validation so that people *see* then they break
> the rules. Even when they don't use some random notifier crud.
> 
> That (b) may involve adding a number of "might_sleep()" calls (not
> deep in the notifiers themselves, but in the actual wrapper functions
> even when notifiers are compiled out entirely!), but also adding calls
> to validate those kinds of "you can't call
> mmu_notifier_invalidate_page() without having first called
> mmu_notifier_invalidate_range_start() in a sleepable context".
> 
> But (b) definitely should also be a very real onus on the mmu
> notifiers themselves. No way can we sleep when we're traversing page
> tables. We hold a page table lock. We can sleep before and after, but
> not during actual page traversal.

Yes and i am fine with page traversal being under spinlock and not
being able to sleep during that. I agree doing otherwise would be
insane. It is just that the existing behavior of try_to_unmap_one()
and page_mkclean_one() have been broken and that no mmu_notifier
calls were added around the lock section.

I sent a patch that properly compute the range to invalidate and move
to invalidate_range() but is lacking the invalidate_range_start()/
end() so i am gonna respin that with range_start/end bracketing and
assume the worse for the range of address.

Cheers,
Jérôme

next prev parent reply	other threads:[~2017-08-29 19:13 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-20 23:13 kvm splat in mmu_spte_clear_track_bits Adam Borowski
2017-08-21  1:26 ` Wanpeng Li
2017-08-21 19:12   ` Adam Borowski
2017-08-21 19:58     ` Radim Krčmář
2017-08-21 22:32       ` Adam Borowski
2017-08-23 12:22         ` Paolo Bonzini
2017-08-24  7:43           ` Wanpeng Li
2017-08-25 13:14             ` Adam Borowski
2017-08-25 13:40               ` Paolo Bonzini
2017-08-27 12:35                 ` Adam Borowski
2017-08-28 15:26                   ` Bernhard Held
2017-08-28 16:01                     ` Takashi Iwai
2017-08-28 16:07                       ` Bernhard Held
2017-08-28 16:17                         ` Takashi Iwai
2017-08-28 16:56                     ` Nadav Amit
2017-08-29  9:19                       ` Bernhard Held
     [not found]                         ` <s5hh8wq8ruy.wl-tiwai@suse.de>
2017-08-29 12:59                           ` Adam Borowski
2017-08-29 14:09                             ` Andrea Arcangeli
2017-08-29 16:10                               ` Linus Torvalds
2017-08-29 18:28                                 ` Jerome Glisse
2017-08-29 18:34                               ` Jerome Glisse
2017-08-29 19:06                                 ` Linus Torvalds
2017-08-29 19:13                                   ` Jerome Glisse [this message]
2017-08-29 19:38                                     ` Linus Torvalds
2017-08-29 20:49                                       ` Andrea Arcangeli
2017-08-29 20:59                                         ` Linus Torvalds
2017-08-30  8:19                               ` Michal Hocko
2017-08-29 15:53                         ` Nadav Amit
2017-08-29 12:57                       ` Mike Galbraith

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170829191351.GD7546@redhat.com \
    --to=jglisse@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=berny156@gmx.de \
    --cc=jroedel@suse.de \
    --cc=kernellwp@gmail.com \
    --cc=kilobyte@angband.pl \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhocko@kernel.org \
    --cc=nadav.amit@gmail.com \
    --cc=pbonzini@redhat.com \
    --cc=rkrcmar@redhat.com \
    --cc=tiwai@suse.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox