From mboxrd@z Thu Jan  1 00:00:00 1970
From: Radim =?utf-8?B?S3LEjW3DocWZ?= <rkrcmar@redhat.com>
Subject: Re: [PATCH] kvm/x86: skip async_pf when in guest mode
Date: Thu, 24 Nov 2016 21:49:59 +0100
Message-ID: <20161124204958.GA16218@potion>
References: <20161124163039.6847-1-rkagan@virtuozzo.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Paolo Bonzini <pbonzini@redhat.com>, kvm@vger.kernel.org,
        Denis Lunev <den@virtuozzo.com>
To: Roman Kagan <rkagan@virtuozzo.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:45300 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S941167AbcKXUuF (ORCPT <rfc822;kvm@vger.kernel.org>);
        Thu, 24 Nov 2016 15:50:05 -0500
Content-Disposition: inline
In-Reply-To: <20161124163039.6847-1-rkagan@virtuozzo.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

2016-11-24 19:30+0300, Roman Kagan:
> Async pagefault machinery assumes communication with L1 guests only: all
> the state -- MSRs, apf area addresses, etc, -- are for L1.  However, it
> currently doesn't check if the vCPU is running L1 or L2, and may inject
> 
> To reproduce the problem, use a host with swap enabled, run a VM on it,
> run a nested VM on top, and set RSS limit for L1 on the host via
> /sys/fs/cgroup/memory/machine.slice/machine-*.scope/memory.limit_in_bytes
> to swap it out (you may need to tighten and release it once or twice, or
> create some memory load inside L1).  Very quickly L2 guest starts
> receiving pagefaults with bogus %cr2 (apf tokens from the host
> actually), and L1 guest starts accumulating tasks stuck in D state in
> kvm_async_pf_task_wait.
> 
> To avoid that, only do async_pf stuff when executing L1 guest.
> 
> Note: this patch only fixes x86; other async_pf-capable arches may also
> need something similar.
> 
> Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
> ---

Applied to kvm/queue, thanks.

The VM task in L1 could be scheduled out instead of hogging the VCPU for
a long time, so L1 might want to handle async_pf, especially if L1 set
KVM_ASYNC_PF_SEND_ALWAYS.  Another case happens if L1 scheduled out a
high-priority task on async_pf and executed the low-priority VM task in
spare time, expecting another #PF when the page is ready, which might be
long before the next nested VM exit.

Have you considered doing a nested VM exit and delivering the async_pf
to L1 immediately?