public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Roman Kagan <rkagan@virtuozzo.com>
To: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>, <kvm@vger.kernel.org>,
	Denis Lunev <den@virtuozzo.com>
Subject: Re: [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode
Date: Mon, 19 Dec 2016 10:18:11 +0300	[thread overview]
Message-ID: <20161219071811.GA3103@rkaganb.sw.ru> (raw)
In-Reply-To: <20161215150938.GE9488@potion>

On Thu, Dec 15, 2016 at 04:09:39PM +0100, Radim Krčmář wrote:
> 2016-12-15 09:55+0300, Roman Kagan:
> > On Wed, Dec 14, 2016 at 10:21:11PM +0100, Radim Krčmář wrote:
> >> 2016-12-12 17:32+0300, Roman Kagan:
> >> > Async pagefault machinery assumes communication with L1 guests only: all
> >> > the state -- MSRs, apf area addresses, etc, -- are for L1.  However, it
> >> > currently doesn't check if the vCPU is running L1 or L2, and may inject
> >> > a #PF into whatever context is currently executing.
> >> > 
> >> > In vmx this just results in crashing the L2 on bogus #PFs and hanging
> >> > tasks in L1 due to missing PAGE_READY async_pfs.  To reproduce it, use a
> >> > host with swap enabled, run a VM on it, run a nested VM on top, and set
> >> > RSS limit for L1 on the host via
> >> > /sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes
> >> > to swap it out (you may need to tighten and loosen it once or twice, or
> >> > create some memory load inside L1).  Very quickly L2 guest starts
> >> > receiving pagefaults with bogus %cr2 (apf tokens from the host
> >> > actually), and L1 guest starts accumulating tasks stuck in D state in
> >> > kvm_async_pf_task_wait.
> >> > 
> >> > In svm such #PFs are converted into vmexit from L2 to L1 on #PF which is
> >> > then handled by L1 similar to ordinary async_pf.  However this only
> >> > works with KVM running in L1; another hypervisor may not expect this
> >> > (e.g.  VirtualBox asserts on #PF vmexit when NPT is on).
> >> 
> >> async_pf is an optional paravirtual device.  It is L1's fault if it
> >> enabled something that it doesn't support ...
> > 
> > async_pf in L1 is enabled by the core Linux; the hypervisor may be
> > third-party and have no control over it.
> 
> Admin can pass no-kvmapf to Linux when planning to use a hypervisor that
> doesn't support paravirtualized async_pf.  Linux allows only in-kernel
> hypervisors that do have full control over it.

Imagine you are a hoster providing VPSes to your customers.  You have
basically no control over what they run there.  Now if you are brave
enough to enable nested, you most certainly won't want async_pf to
create problems for your customers only because they have a kernel with
async_pf support and a hypervisor without (which at the moment means a
significant fraction of VPS owners).

> >> AMD's behavior makes sense and already works, therefore I'd like to see
> >> the same on Intel as well.  (I thought that SVM was broken as well,
> >> sorry for my misleading first review.)
> >> 
> >> > To avoid that, only do async_pf stuff when executing L1 guest.
> >> 
> >> The good thing is that we are already killing VMX L1 with async_pf, so
> >> regressions don't prevent us from making Intel KVM do the same as AMD:
> >> force a nested VM exit from nested_vmx_check_exception() if the injected
> >> #PF is async_pf and handle the #PF VM exit in L1.
> > 
> > I'm not getting your point: the wealth of existing hypervisors running
> > in L1 which don't take #PF vmexits can be made not to hang or crash
> > their guests with a not so complex fix in L0 hypervisor.  Why do the
> > users need to update *both* their L0 and L1 hypervisors instead?
> 
> L1 enables paravirtual async_pf to get notified about L0 page faults,
> which would allow L1 to reschedule the blocked process and get better
> performance.  Running a guest is just another process in L1, hence we
> can assume that L1 is interested in being notified.

That's a nice theory but in practice there is a fair amount of installed
VMs with a kernel that requests async_pf and a hypervisor that can't
live with it.

> If you want a fix without changing L1 hypervisors, then you need to
> regress KVM on SVM.

I don't buy this argument.  I don't see any significant difference from
L0's viewpoint between emulating a #PF vmexit and emulating an external
interrupt vmexit combined with #PF injection into L1.  The latter,
however, will keep L1 getting along just fine with the existing kernels
and hypervisors.

> This series regresses needlessly, though -- it forces L1 to wait in L2
> until the page for L2 is fetched by L0.

Indeed, it's half-baked.  I also just realized that it incorrectly does
nested vmexit before L1 vmentry but #PF injection is attempted on the
next round which defeats the whole purpose.  I'll rework the series once
I have the time (hopefully before x-mas).

Thanks,
Roman.

  reply	other threads:[~2016-12-19  7:33 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-12 14:32 [PATCH v2 0/5] kvm: deliver async_pf to L1 only Roman Kagan
2016-12-12 14:32 ` [PATCH v2 1/5] kvm/x86: skip async_pf when in guest mode Roman Kagan
2016-12-14 21:21   ` Radim Krčmář
2016-12-15  6:55     ` Roman Kagan
2016-12-15 15:09       ` Radim Krčmář
2016-12-19  7:18         ` Roman Kagan [this message]
2016-12-19  9:53           ` Paolo Bonzini
2016-12-19 15:31           ` Radim Krčmář
2016-12-12 14:32 ` [PATCH v2 2/5] kvm: add helper for testing ready async_pf's Roman Kagan
2016-12-12 14:32 ` [PATCH v2 3/5] kvm: kick vcpu when async_pf is resolved Roman Kagan
2016-12-12 14:32 ` [PATCH v2 4/5] kvm/vmx: kick L2 guest to L1 by ready async_pf Roman Kagan
2016-12-12 14:32 ` [PATCH v2 5/5] kvm/svm: " Roman Kagan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161219071811.GA3103@rkaganb.sw.ru \
    --to=rkagan@virtuozzo.com \
    --cc=den@virtuozzo.com \
    --cc=kvm@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=rkrcmar@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox