Re: [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Radim Krčmář" <rkrcmar@redhat.com>
To: Roman Kagan <rkagan@virtuozzo.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	kvm@vger.kernel.org, Denis Lunev <den@virtuozzo.com>
Subject: Re: [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode
Date: Mon, 19 Dec 2016 16:31:12 +0100	[thread overview]
Message-ID: <20161219153110.GA16665@potion> (raw)
In-Reply-To: <20161219071811.GA3103@rkaganb.sw.ru>

2016-12-19 10:18+0300, Roman Kagan:
> On Thu, Dec 15, 2016 at 04:09:39PM +0100, Radim Krčmář wrote:
>> 2016-12-15 09:55+0300, Roman Kagan:
>> > On Wed, Dec 14, 2016 at 10:21:11PM +0100, Radim Krčmář wrote:
>> >> 2016-12-12 17:32+0300, Roman Kagan:
>> >> > Async pagefault machinery assumes communication with L1 guests only: all
>> >> > the state -- MSRs, apf area addresses, etc, -- are for L1.  However, it
>> >> > currently doesn't check if the vCPU is running L1 or L2, and may inject
>> >> > a #PF into whatever context is currently executing.
>> >> > 
>> >> > In vmx this just results in crashing the L2 on bogus #PFs and hanging
>> >> > tasks in L1 due to missing PAGE_READY async_pfs.  To reproduce it, use a
>> >> > host with swap enabled, run a VM on it, run a nested VM on top, and set
>> >> > RSS limit for L1 on the host via
>> >> > /sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes
>> >> > to swap it out (you may need to tighten and loosen it once or twice, or
>> >> > create some memory load inside L1).  Very quickly L2 guest starts
>> >> > receiving pagefaults with bogus %cr2 (apf tokens from the host
>> >> > actually), and L1 guest starts accumulating tasks stuck in D state in
>> >> > kvm_async_pf_task_wait.
>> >> > 
>> >> > In svm such #PFs are converted into vmexit from L2 to L1 on #PF which is
>> >> > then handled by L1 similar to ordinary async_pf.  However this only
>> >> > works with KVM running in L1; another hypervisor may not expect this
>> >> > (e.g.  VirtualBox asserts on #PF vmexit when NPT is on).
>> >> 
>> >> async_pf is an optional paravirtual device.  It is L1's fault if it
>> >> enabled something that it doesn't support ...
>> > 
>> > async_pf in L1 is enabled by the core Linux; the hypervisor may be
>> > third-party and have no control over it.
>> 
>> Admin can pass no-kvmapf to Linux when planning to use a hypervisor that
>> doesn't support paravirtualized async_pf.  Linux allows only in-kernel
>> hypervisors that do have full control over it.
> 
> Imagine you are a hoster providing VPSes to your customers.  You have
> basically no control over what they run there.  Now if you are brave
> enough to enable nested, you most certainly won't want async_pf to
> create problems for your customers only because they have a kernel with
> async_pf support and a hypervisor without (which at the moment means a
> significant fraction of VPS owners).

In that situation, you already told your customers to disable kvm-apf,
because it is broken (on VMX).  After updating the L0, you announce that
kvm-apf can be enabled and depending on the fix that KVM uses, it is
either enabled only for sufficiently new L1, or even for older ones.
Not a big difference from VPS provider point of view, IMO.

(Hm, and VPS providers could use a toggle to disable kvm-apf on L0,
 because it adds overhead in scenarios with CPU overcommit.)

>> >> AMD's behavior makes sense and already works, therefore I'd like to see
>> >> the same on Intel as well.  (I thought that SVM was broken as well,
>> >> sorry for my misleading first review.)
>> >> 
>> >> > To avoid that, only do async_pf stuff when executing L1 guest.
>> >> 
>> >> The good thing is that we are already killing VMX L1 with async_pf, so
>> >> regressions don't prevent us from making Intel KVM do the same as AMD:
>> >> force a nested VM exit from nested_vmx_check_exception() if the injected
>> >> #PF is async_pf and handle the #PF VM exit in L1.
>> > 
>> > I'm not getting your point: the wealth of existing hypervisors running
>> > in L1 which don't take #PF vmexits can be made not to hang or crash
>> > their guests with a not so complex fix in L0 hypervisor.  Why do the
>> > users need to update *both* their L0 and L1 hypervisors instead?
>> 
>> L1 enables paravirtual async_pf to get notified about L0 page faults,
>> which would allow L1 to reschedule the blocked process and get better
>> performance.  Running a guest is just another process in L1, hence we
>> can assume that L1 is interested in being notified.
> 
> That's a nice theory but in practice there is a fair amount of installed
> VMs with a kernel that requests async_pf and a hypervisor that can't
> live with it.

Yes, and we don't have to care -- they live now, when kvm-apf is broken.

We can fix them in a way that is backward compatible with known
hypervisors, but the solution is worse because of that.
kvm-apf is just for L1 performance, so it should waste as little cycles
as possible and because users can't depend on working kvm-apf, I'd not
shackle ourselves by past mistakes.

>> If you want a fix without changing L1 hypervisors, then you need to
>> regress KVM on SVM.
> 
> I don't buy this argument.  I don't see any significant difference from
> L0's viewpoint between emulating a #PF vmexit and emulating an external
> interrupt vmexit combined with #PF injection into L1.  The latter,
> however, will keep L1 getting along just fine with the existing kernels
> and hypervisors.

Yes, the delivery method is not crucial, I'd accept another delivery
method if L1 on KVM+SVM doesn't regress performance.
The main regression is not forwarding L0 page faults to L1 while nested,
because of this condition:

  if (!prefault && !is_guest_mode(vcpu) && can_do_async_pf(vcpu)) {

>> This series regresses needlessly, though -- it forces L1 to wait in L2
>> until the page for L2 is fetched by L0.
> 
> Indeed, it's half-baked.  I also just realized that it incorrectly does
> nested vmexit before L1 vmentry but #PF injection is attempted on the
> next round which defeats the whole purpose.

I also see separating the nested VM exit from the kvm-apf event delivery
as a regression -- doesn't delivering interrupt vector 14 in the nested
VM exit work without losing backward compatibility?

Thanks.

next prev parent reply	other threads:[~2016-12-19 15:31 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-12 14:32 [PATCH v2 0/5] kvm: deliver async_pf to L1 only Roman Kagan
2016-12-12 14:32 ` [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode Roman Kagan
2016-12-14 21:21   ` Radim Krčmář
2016-12-15  6:55     ` Roman Kagan
2016-12-15 15:09       ` Radim Krčmář
2016-12-19  7:18         ` Roman Kagan
2016-12-19  9:53           ` Paolo Bonzini
2016-12-19 15:31           ` Radim Krčmář [this message]
2016-12-12 14:32 ` [PATCH v2 2/5] kvm: add helper for testing ready async_pf's Roman Kagan
2016-12-12 14:32 ` [PATCH v2 3/5] kvm: kick vcpu when async_pf is resolved Roman Kagan
2016-12-12 14:32 ` [PATCH v2 4/5] kvm/vmx: kick L2 guest to L1 by ready async_pf Roman Kagan
2016-12-12 14:32 ` [PATCH v2 5/5] kvm/svm: " Roman Kagan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161219153110.GA16665@potion \
    --to=rkrcmar@redhat.com \
    --cc=den@virtuozzo.com \
    --cc=kvm@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=rkagan@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.