From: "Radim Krčmář" <rkrcmar@redhat.com>
To: Roman Kagan <rkagan@virtuozzo.com>,
Paolo Bonzini <pbonzini@redhat.com>,
kvm@vger.kernel.org, Denis Lunev <den@virtuozzo.com>
Subject: Re: [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode
Date: Mon, 19 Dec 2016 16:31:12 +0100 [thread overview]
Message-ID: <20161219153110.GA16665@potion> (raw)
In-Reply-To: <20161219071811.GA3103@rkaganb.sw.ru>
2016-12-19 10:18+0300, Roman Kagan:
> On Thu, Dec 15, 2016 at 04:09:39PM +0100, Radim Krčmář wrote:
>> 2016-12-15 09:55+0300, Roman Kagan:
>> > On Wed, Dec 14, 2016 at 10:21:11PM +0100, Radim Krčmář wrote:
>> >> 2016-12-12 17:32+0300, Roman Kagan:
>> >> > Async pagefault machinery assumes communication with L1 guests only: all
>> >> > the state -- MSRs, apf area addresses, etc, -- are for L1. However, it
>> >> > currently doesn't check if the vCPU is running L1 or L2, and may inject
>> >> > a #PF into whatever context is currently executing.
>> >> >
>> >> > In vmx this just results in crashing the L2 on bogus #PFs and hanging
>> >> > tasks in L1 due to missing PAGE_READY async_pfs. To reproduce it, use a
>> >> > host with swap enabled, run a VM on it, run a nested VM on top, and set
>> >> > RSS limit for L1 on the host via
>> >> > /sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes
>> >> > to swap it out (you may need to tighten and loosen it once or twice, or
>> >> > create some memory load inside L1). Very quickly L2 guest starts
>> >> > receiving pagefaults with bogus %cr2 (apf tokens from the host
>> >> > actually), and L1 guest starts accumulating tasks stuck in D state in
>> >> > kvm_async_pf_task_wait.
>> >> >
>> >> > In svm such #PFs are converted into vmexit from L2 to L1 on #PF which is
>> >> > then handled by L1 similar to ordinary async_pf. However this only
>> >> > works with KVM running in L1; another hypervisor may not expect this
>> >> > (e.g. VirtualBox asserts on #PF vmexit when NPT is on).
>> >>
>> >> async_pf is an optional paravirtual device. It is L1's fault if it
>> >> enabled something that it doesn't support ...
>> >
>> > async_pf in L1 is enabled by the core Linux; the hypervisor may be
>> > third-party and have no control over it.
>>
>> Admin can pass no-kvmapf to Linux when planning to use a hypervisor that
>> doesn't support paravirtualized async_pf. Linux allows only in-kernel
>> hypervisors that do have full control over it.
>
> Imagine you are a hoster providing VPSes to your customers. You have
> basically no control over what they run there. Now if you are brave
> enough to enable nested, you most certainly won't want async_pf to
> create problems for your customers only because they have a kernel with
> async_pf support and a hypervisor without (which at the moment means a
> significant fraction of VPS owners).
In that situation, you already told your customers to disable kvm-apf,
because it is broken (on VMX). After updating the L0, you announce that
kvm-apf can be enabled and depending on the fix that KVM uses, it is
either enabled only for sufficiently new L1, or even for older ones.
Not a big difference from VPS provider point of view, IMO.
(Hm, and VPS providers could use a toggle to disable kvm-apf on L0,
because it adds overhead in scenarios with CPU overcommit.)
>> >> AMD's behavior makes sense and already works, therefore I'd like to see
>> >> the same on Intel as well. (I thought that SVM was broken as well,
>> >> sorry for my misleading first review.)
>> >>
>> >> > To avoid that, only do async_pf stuff when executing L1 guest.
>> >>
>> >> The good thing is that we are already killing VMX L1 with async_pf, so
>> >> regressions don't prevent us from making Intel KVM do the same as AMD:
>> >> force a nested VM exit from nested_vmx_check_exception() if the injected
>> >> #PF is async_pf and handle the #PF VM exit in L1.
>> >
>> > I'm not getting your point: the wealth of existing hypervisors running
>> > in L1 which don't take #PF vmexits can be made not to hang or crash
>> > their guests with a not so complex fix in L0 hypervisor. Why do the
>> > users need to update *both* their L0 and L1 hypervisors instead?
>>
>> L1 enables paravirtual async_pf to get notified about L0 page faults,
>> which would allow L1 to reschedule the blocked process and get better
>> performance. Running a guest is just another process in L1, hence we
>> can assume that L1 is interested in being notified.
>
> That's a nice theory but in practice there is a fair amount of installed
> VMs with a kernel that requests async_pf and a hypervisor that can't
> live with it.
Yes, and we don't have to care -- they live now, when kvm-apf is broken.
We can fix them in a way that is backward compatible with known
hypervisors, but the solution is worse because of that.
kvm-apf is just for L1 performance, so it should waste as little cycles
as possible and because users can't depend on working kvm-apf, I'd not
shackle ourselves by past mistakes.
>> If you want a fix without changing L1 hypervisors, then you need to
>> regress KVM on SVM.
>
> I don't buy this argument. I don't see any significant difference from
> L0's viewpoint between emulating a #PF vmexit and emulating an external
> interrupt vmexit combined with #PF injection into L1. The latter,
> however, will keep L1 getting along just fine with the existing kernels
> and hypervisors.
Yes, the delivery method is not crucial, I'd accept another delivery
method if L1 on KVM+SVM doesn't regress performance.
The main regression is not forwarding L0 page faults to L1 while nested,
because of this condition:
if (!prefault && !is_guest_mode(vcpu) && can_do_async_pf(vcpu)) {
>> This series regresses needlessly, though -- it forces L1 to wait in L2
>> until the page for L2 is fetched by L0.
>
> Indeed, it's half-baked. I also just realized that it incorrectly does
> nested vmexit before L1 vmentry but #PF injection is attempted on the
> next round which defeats the whole purpose.
I also see separating the nested VM exit from the kvm-apf event delivery
as a regression -- doesn't delivering interrupt vector 14 in the nested
VM exit work without losing backward compatibility?
Thanks.
next prev parent reply other threads:[~2016-12-19 15:31 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-12-12 14:32 [PATCH v2 0/5] kvm: deliver async_pf to L1 only Roman Kagan
2016-12-12 14:32 ` [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode Roman Kagan
2016-12-14 21:21 ` Radim Krčmář
2016-12-15 6:55 ` Roman Kagan
2016-12-15 15:09 ` Radim Krčmář
2016-12-19 7:18 ` Roman Kagan
2016-12-19 9:53 ` Paolo Bonzini
2016-12-19 15:31 ` Radim Krčmář [this message]
2016-12-12 14:32 ` [PATCH v2 2/5] kvm: add helper for testing ready async_pf's Roman Kagan
2016-12-12 14:32 ` [PATCH v2 3/5] kvm: kick vcpu when async_pf is resolved Roman Kagan
2016-12-12 14:32 ` [PATCH v2 4/5] kvm/vmx: kick L2 guest to L1 by ready async_pf Roman Kagan
2016-12-12 14:32 ` [PATCH v2 5/5] kvm/svm: " Roman Kagan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161219153110.GA16665@potion \
--to=rkrcmar@redhat.com \
--cc=den@virtuozzo.com \
--cc=kvm@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=rkagan@virtuozzo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox