From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 32BE0C2FC0F for ; Thu, 17 Aug 2023 18:23:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1354295AbjHQSWb (ORCPT ); Thu, 17 Aug 2023 14:22:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35538 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1354238AbjHQSWA (ORCPT ); Thu, 17 Aug 2023 14:22:00 -0400 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9DCA43A97 for ; Thu, 17 Aug 2023 11:21:37 -0700 (PDT) Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-d72403b9e03so169908276.0 for ; Thu, 17 Aug 2023 11:21:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1692296465; x=1692901265; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=QJ/AmT87XjwjHbKdAHeyGn9OgBJOQ1lFwCiDbKXb6nM=; b=wMrw9lMFcEffCcRcJtV5g/0/iLKduBVIcEgiIFSA8IrJkJkeTZMPTrR/QMqjTiavcs nuaiwQ6pCJ9MIzIs2bZLAL4QUsU24UPYBsHFRaZdaplp82oXRktr6Pr68kKE99GnoZjG DlfsZ9bD95+npn4+notpONZY6gdR5kCDexm4mJtnFwtycrLE06/xLqf8KIZ66zNnfX2z hESHg4xd/wMAxvta3/eSIFFZf/I6mNlBeVNbU6YBRxlygqvZOowvcjUWe6b54CaUouDG HNxF+zackke6qiFESkFImNEnJlysteXDgW9AP8P+UgP9f1fPtq9SItRVjN7GfEdb7z94 1fug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692296465; x=1692901265; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=QJ/AmT87XjwjHbKdAHeyGn9OgBJOQ1lFwCiDbKXb6nM=; b=XIcF6wBHVtnjI2cKCPtO0C+ekcHUTxeNXnupzWcuaNnLzSobh1KS0AhKtRCnzfRZG2 iwCuyGgVlar1IhsQHxHP6tJiHpZswLF5UddATN889dQxTJ2q1b0th5QqqTqmsCepVOgb 4pO9LhiQ06f4WssFpbMIcOCIkTDa48lH02MTXv3qO0MfQmU45jaawSiM5MloLJ9chrCK KFFED3dbq7Nt7RsEAGYI1TlEgwvScXg6ocaWIfk0D3frKVK4BYQZ0GJG7A3VwaJrRebo zCV1pH4rW2Fr0eWEmQx7Fa8i4bGjFJxNGZhuy90h3SZqlQmTWMRkqEgHYKMnHtNWbwER cAsw== X-Gm-Message-State: AOJu0YxeyqVjUvd+7xHW414yp7aDAVX3fxTwCZxKLF0GEv/NyCaq2isY o78NeNfLyAv+QQjkX4etaeeO630gFFw= X-Google-Smtp-Source: AGHT+IENbp20qCwcOf57Y8LNV1qe1Dy/7KYDpwHsK8x2I7A0BeOOwhz8KLF3v6BjkD8DKTGnavlgnCTrjDU= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a25:818e:0:b0:d4f:638b:d7fd with SMTP id p14-20020a25818e000000b00d4f638bd7fdmr4674ybk.9.1692296465662; Thu, 17 Aug 2023 11:21:05 -0700 (PDT) Date: Thu, 17 Aug 2023 11:21:03 -0700 In-Reply-To: <5e678d57-66b-a18d-f97e-b41357fdb7f@ewheeler.net> Mime-Version: 1.0 References: <5e678d57-66b-a18d-f97e-b41357fdb7f@ewheeler.net> Message-ID: Subject: Re: Deadlock due to EPT_VIOLATION From: Sean Christopherson To: Eric Wheeler Cc: Amaan Cheval , brak@gameservers.com, kvm@vger.kernel.org Content-Type: text/plain; charset="us-ascii" Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Wed, Aug 16, 2023, Eric Wheeler wrote: > On Tue, 15 Aug 2023, Sean Christopherson wrote: > > On Mon, Aug 14, 2023, Eric Wheeler wrote: > > > On Tue, 8 Aug 2023, Sean Christopherson wrote: > > > > > If you have any suggestions on how modifying the host kernel (and then migrating > > > > > a locked up guest to it) or eBPF programs that might help illuminate the issue > > > > > further, let me know! > > > > > > > > > > Thanks for all your help so far! > > > > > > > > Since it sounds like you can test with a custom kernel, try running with this > > > > patch and then enable the kvm_page_fault tracepoint when a vCPU gets stuck. The > > > > below expands said tracepoint to capture information about mmu_notifiers and > > > > memslots generation. With luck, it will reveal a smoking gun. > > > > > > Getting this patch into production systems is challenging, perhaps live > > > patching is an option: > > > > Ah, I take when you gathered information after a live migration you were migrating > > VMs into a sidecar environment. > > > > > Questions: > > > > > > 1. Do you know if this would be safe to insert as a live kernel patch? > > > > Hmm, probably not safe. > > > > > For example, does adding to TRACE_EVENT modify a struct (which is not > > > live-patch-safe) or is it something that should plug in with simple > > > function redirection? > > > > Yes, the tracepoint defines a struct, e.g. in this case trace_event_raw_kvm_page_fault. > > > > Looking back, I think I misinterpreted an earlier response regarding bpftrace and > > unnecessarily abandoned that tactic. *sigh* > > > > If your environment provides btf info, then this bpftrace program should provide > > the mmu_notifier half of the tracepoint hack-a-patch. If this yields nothing > > interesting then we can try diving into whether or not the mmu_root is stale, but > > let's cross that bridge when we have to. > > > > I recommend loading this only when you have a stuck vCPU, it'll be quite noisy. > > > > kprobe:handle_ept_violation > > { > > printf("vcpu = %lx pid = %u MMU seq = %lx, in-prog = %lx, start = %lx, end = %lx\n", > > arg0, ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr, > > ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq, > > ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress, > > ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start, > > ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end); > > } > > > > If you don't have BTF info, we can still use a bpf program, but to get at the > > fields of interested, I think we'd have to resort to pointer arithmetic with struct > > offsets grab from your build. > > We have BTF, so hurray for not needing struct offsets! > > I am testing this on a host that is not (yet) known to be stuck. Please do > a quick sanity check for me and make sure this looks like the kind of > output that you want to see: > > I had to shrink the printf line because it was longer than 64 bytes. I put > the process ID as the first item and changed %lx to %08lx for visual > alignment. Aside from that, it is the same as what you provided. > > We're piping it through `uniq -c` to only see interesting changes (and > show counts) because it is extremely noisy. If this looks good to you then > please confirm and I will run it on a production system after a lock-up: > > kprobe:handle_ept_violation > { > printf("ept[%u] vcpu=%08lx seq=%08lx inprog=%lx start=%08lx end=%08lx\n", > ((struct kvm_vcpu *)arg0)->pid->numbers[0].nr, > arg0, > ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_seq, > ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_in_progress, > ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_start, > ((struct kvm_vcpu *)arg0)->kvm->mmu_invalidate_range_end); > } > > Questions: > - Should pid be zero? (Note this is not yet running on a host with a > locked-up guest, in case that is the reason.) No. I'm not at all familiar with PID management, I just copy+pasted from pid_nr(), which is what KVM uses when displaying the pid in debugfs. I printed the PID purely to be able to unambiguously correlated prints to vCPUs without needing to cross reference kernel addresses. I.e. having the PID makes life easier, but it shouldn't be strictly necessary. > - Can you think of any reason that this would be unsafe? (Forgive my > paranoia, but of course this will be running on a production > hypervisor.) Printing the raw address of the vCPU structure will effectively neuter KASLR, but KASLR isn't all that much of a barrier, and whoever has permission to load a BPF program on the system can do far, far more damage. > - Can you think of any adjustments to the bpf script above before > running this for real? You could try and make it less noisy or more precise, e.g. by tailoring it to print only information on the vCPU that is stuck. If the noise isn't a problem though, I would keep it as-is, the more information the better. > Here is an example trace on a test host that isn't locked up: > > ~]# bpftrace handle_ept_violation.bt | grep ^ept --line-buffered | uniq -c > 1926 ept[0] vcpu=ffff969569468000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000 > 215722 ept[0] vcpu=ffff9695684b8000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000 > 66280 ept[0] vcpu=ffff969569468000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000 > 18609437 ept[0] vcpu=ffff9695684b8000 seq=8009c7eb inprog=0 start=7f5993200000 end=7f5993400000 Woah. That's over 2 *billion* invalidations for a single VM. Even if that's a long-lived VM, that's still seems rather insane. E.g. if the uptime of that VM *on that host* is 6 months, my back of the napkin math says that that's nearly 100 invalidations every second for 6 months straight. Bit 31 being set in relative isolation almost makes me wonder if mmu_invalidate_seq got corrupted somehow. Either that or you are thrashing that VM with a vengeance.