From: Sean Christopherson <seanjc@google.com>
To: Eric Wheeler <kvm@lists.ewheeler.net>
Cc: Amaan Cheval <amaan.cheval@gmail.com>,
brak@gameservers.com, kvm@vger.kernel.org
Subject: Re: Deadlock due to EPT_VIOLATION
Date: Wed, 23 Aug 2023 17:52:04 -0700 [thread overview]
Message-ID: <ZOaptK09RbJtFbmk@google.com> (raw)
In-Reply-To: <5da12792-1e5d-b89d-ea0-e1159c645568@ewheeler.net>
On Wed, Aug 23, 2023, Eric Wheeler wrote:
> On Wed, 23 Aug 2023, Sean Christopherson wrote:
> > Not a coincidence, at all. The bug is that, in v6.1, is_page_fault_stale() takes
> > the local @mmu_seq snapshot as an int, whereas as the per-VM count is stored as an
> > unsigned long.
>
> I'm surprised that there were no compiler warnings about signedness or
> type precision. What would have prevented such a compiler warning?
-Wconversion can detect this, but it detects freaking *everything*, i.e. its
signal to noise ratio is straight up awful. It's so noisy in fact that it's not
even in the kernel's W=1 build, it's pushed down all the way to W=3. W=1 is
basically "you'll get some noise, but it may find useful stuff. W=3 is essentially
"don't bother wading through the warnings unless you're masochistic".
E.g. turning it on leads to:
linux/include/linux/kvm_host.h:891:60: error:
conversion to ‘long unsigned int’ from ‘int’ may change the sign of the result [-Werror=sign-conversion]
891 | (atomic_read(&kvm->online_vcpus) - 1))
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
which is completely asinine (suppressing the warning would require declaring the
above literal as 1u).
FWIW, I would love to be able to prevent these types of bugs as this isn't the
first implicit conversion bug that has hit KVM x86[*], but the signal to noise
ratio is just so, so bad.
[*] commit d5aaad6f8342 ("KVM: x86/mmu: Fix per-cpu counter corruption on 32-bit builds")
> > When the sequence sets bit 31, the local @mmu_seq value becomes
> > a signed *negative* value, and then when that gets passed to mmu_invalidate_retry_hva(),
> > which correctly takes an unsigned long, the negative value gets sign-extended and
> > so the comparison ends up being
> >
> > if (0x8002dc25 != 0xffffffff8002dc25)
> >
> > and KVM thinks the sequence count is stale. I missed it for so long because I
> > was stupidly looking mostly at upstream code (see below), and because of the subtle
> > sign-extension behavior (I was mostly on the lookout for a straight truncation
> > bug where bits[63:32] got dropped).
> >
> > I suspect others haven't hit this issues because no one else is generating anywhere
> > near the same number of mmu_notifier invalidations, and/or live migrates VMs more
> > regularly (which effectively resets the sequence count).
> >
> > The real kicker to all this is that the bug was accidentally fixed in v6.3 by
> > commit ba6e3fe25543 ("KVM: x86/mmu: Grab mmu_invalidate_seq in kvm_faultin_pfn()"),
> > as that refactoring correctly stored the "local" mmu_seq as an unsigned long.
> >
> > I'll post the below as a proper patch for inclusion in stable kernels.
>
> Awesome, and well done. Can you think of a "simple" patch for the
> 6.1-series that would be live-patch safe?
This is what I'm going to post for 6.1, it's as safe and simple a patch as can
be. The only potential hiccup for live-patching is that it's all but guaranteed
to be inlined, but the scope creep should be limited to one-level up, e.g. to
direct_page_fault().
Author: Sean Christopherson <seanjc@google.com>
Date: Wed Aug 23 16:28:12 2023 -0700
KVM: x86/mmu: Fix an sign-extension bug with mmu_seq that hangs vCPUs
Take the vCPU's mmu_seq snapshot as an "unsigned long" instead of an "int"
when checking to see if a page fault is stale, as the sequence count is
stored as an "unsigned long" everywhere else in KVM. This fixes a bug
where KVM will effectively hang vCPUs due to always thinking page faults
are stale, which result in KVM refusing to "fix" faults.
mmu_invalidate_seq (née mmu_notifier_seq) is sequence counter used when
KVM is handling page faults to detect if userspace mapping relevant to the
guest was invalidated snapshotting the counter and acquiring mmu_lock, to
ensure that the host pfn that KVM retrieved is still fresh. If KVM sees
that the counter has change, KVM resumes the guest without fixing the
fault.
What _should_ happen is that the source of the mmu_notifier invalidations
eventually goes away, mmu_invalidate_seq will become stable, and KVM can
once again fix guest page fault(s).
But for a long-lived VM and/or a VM that the host just doesn't particularly
like, it's possible for a VM to be on the receiving end of 2 billion (with
a B) mmu_notifier invalidations. When that happens, bit 31 will be set in
mmu_invalidate_seq. This causes the value to be turned into a 32-bit
negative value when implicitly cast to an "int" by is_page_fault_stale(),
and then sign-extended into a 64-bit unsigned when the signed "int" is
implicitly cast back to an "unsigned long" on the call to
mmu_invalidate_retry_hva().
As a result of the casting and sign-extension, given a sequence counter of
e.g. 0x8002dc25, mmu_invalidate_retry_hva() ends up doing
if (0x8002dc25 != 0xffffffff8002dc25)
and signals that the page fault is stale and needs to be retried even
though the sequence counter is stable, and KVM effectively hangs any vCPU
that takes a page fault (EPT violation or #NPF when TDP is enabled).
Note, upstream commit ba6e3fe25543 ("KVM: x86/mmu: Grab mmu_invalidate_seq
in kvm_faultin_pfn()") unknowingly fixed the bug in v6.3 when refactoring
how KVM tracks the sequence counter snapshot.
Reported-by: Brian Rak <brak@vultr.com>
Reported-by: Amaan Cheval <amaan.cheval@gmail.com>
Reported-by: Eric Wheeler <kvm@lists.ewheeler.net>
Closes: https://lore.kernel.org/all/f023d927-52aa-7e08-2ee5-59a2fbc65953@gameservers.com
Fixes: a955cad84cda ("KVM: x86/mmu: Retry page fault if root is invalidated by memslot update")
Signed-off-by: Sean Christopherson <seanjc@google.com>
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 230108a90cf3..beca03556379 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4212,7 +4212,8 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
* root was invalidated by a memslot update or a relevant mmu_notifier fired.
*/
static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
- struct kvm_page_fault *fault, int mmu_seq)
+ struct kvm_page_fault *fault,
+ unsigned long mmu_seq)
{
struct kvm_mmu_page *sp = to_shadow_page(vcpu->arch.mmu->root.hpa);
next prev parent reply other threads:[~2023-08-24 0:53 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-23 14:02 Deadlock due to EPT_VIOLATION Brian Rak
2023-05-23 16:22 ` Sean Christopherson
2023-05-24 13:39 ` Brian Rak
2023-05-26 16:59 ` Brian Rak
2023-05-26 21:02 ` Sean Christopherson
2023-05-30 17:35 ` Brian Rak
2023-05-30 18:36 ` Sean Christopherson
2023-05-31 17:40 ` Brian Rak
2023-07-21 14:34 ` Amaan Cheval
2023-07-21 17:37 ` Sean Christopherson
2023-07-24 12:08 ` Amaan Cheval
2023-07-25 17:30 ` Sean Christopherson
2023-08-02 14:21 ` Amaan Cheval
2023-08-02 15:34 ` Sean Christopherson
2023-08-02 16:45 ` Amaan Cheval
2023-08-02 17:52 ` Sean Christopherson
2023-08-08 15:34 ` Amaan Cheval
2023-08-08 17:07 ` Sean Christopherson
2023-08-10 0:48 ` Eric Wheeler
2023-08-10 1:27 ` Eric Wheeler
2023-08-10 23:58 ` Sean Christopherson
2023-08-11 12:37 ` Amaan Cheval
2023-08-11 18:02 ` Sean Christopherson
2023-08-12 0:50 ` Eric Wheeler
2023-08-14 17:29 ` Sean Christopherson
2023-08-15 0:30 ` Eric Wheeler
2023-08-15 16:10 ` Sean Christopherson
2023-08-16 23:54 ` Eric Wheeler
2023-08-17 18:21 ` Sean Christopherson
2023-08-18 0:55 ` Eric Wheeler
2023-08-18 14:33 ` Sean Christopherson
2023-08-18 23:06 ` Eric Wheeler
2023-08-21 20:27 ` Eric Wheeler
2023-08-21 23:51 ` Sean Christopherson
2023-08-22 0:11 ` Sean Christopherson
2023-08-22 1:10 ` Eric Wheeler
2023-08-22 15:11 ` Sean Christopherson
2023-08-22 21:23 ` Eric Wheeler
2023-08-22 21:32 ` Sean Christopherson
2023-08-23 0:39 ` Eric Wheeler
2023-08-23 17:54 ` Sean Christopherson
2023-08-23 19:44 ` Eric Wheeler
2023-08-23 22:12 ` Eric Wheeler
2023-08-23 22:32 ` Eric Wheeler
2023-08-23 23:21 ` Sean Christopherson
2023-08-24 0:30 ` Eric Wheeler
2023-08-24 0:52 ` Sean Christopherson [this message]
2023-08-24 23:51 ` Eric Wheeler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZOaptK09RbJtFbmk@google.com \
--to=seanjc@google.com \
--cc=amaan.cheval@gmail.com \
--cc=brak@gameservers.com \
--cc=kvm@lists.ewheeler.net \
--cc=kvm@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.