From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Junaid Shahid <junaids@google.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Sasha Levin <sashal@kernel.org>,
seanjc@google.com, tglx@linutronix.de, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
kvm@vger.kernel.org
Subject: [PATCH AUTOSEL 5.19 06/38] kvm: x86: mmu: Always flush TLBs when enabling dirty logging
Date: Sat, 10 Sep 2022 17:15:51 -0400 [thread overview]
Message-ID: <20220910211623.69825-6-sashal@kernel.org> (raw)
In-Reply-To: <20220910211623.69825-1-sashal@kernel.org>
From: Junaid Shahid <junaids@google.com>
[ Upstream commit b64d740ea7ddc929d97b28de4c0665f7d5db9e2a ]
When A/D bits are not available, KVM uses a software access tracking
mechanism, which involves making the SPTEs inaccessible. However,
the clear_young() MMU notifier does not flush TLBs. So it is possible
that there may still be stale, potentially writable, TLB entries.
This is usually fine, but can be problematic when enabling dirty
logging, because it currently only does a TLB flush if any SPTEs were
modified. But if all SPTEs are in access-tracked state, then there
won't be a TLB flush, which means that the guest could still possibly
write to memory and not have it reflected in the dirty bitmap.
So just unconditionally flush the TLBs when enabling dirty logging.
As an alternative, KVM could explicitly check the MMU-Writable bit when
write-protecting SPTEs to decide if a flush is needed (instead of
checking the Writable bit), but given that a flush almost always happens
anyway, so just making it unconditional seems simpler.
Signed-off-by: Junaid Shahid <junaids@google.com>
Message-Id: <20220810224939.2611160-1-junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
arch/x86/kvm/mmu/mmu.c | 45 +++++++----------------------------------
arch/x86/kvm/mmu/spte.h | 14 +++++++++----
arch/x86/kvm/x86.c | 44 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 61 insertions(+), 42 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 356226c7ebbdc..aa1ba803659cd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5907,47 +5907,18 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
const struct kvm_memory_slot *memslot,
int start_level)
{
- bool flush = false;
-
if (kvm_memslots_have_rmaps(kvm)) {
write_lock(&kvm->mmu_lock);
- flush = slot_handle_level(kvm, memslot, slot_rmap_write_protect,
- start_level, KVM_MAX_HUGEPAGE_LEVEL,
- false);
+ slot_handle_level(kvm, memslot, slot_rmap_write_protect,
+ start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
write_unlock(&kvm->mmu_lock);
}
if (is_tdp_mmu_enabled(kvm)) {
read_lock(&kvm->mmu_lock);
- flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, start_level);
+ kvm_tdp_mmu_wrprot_slot(kvm, memslot, start_level);
read_unlock(&kvm->mmu_lock);
}
-
- /*
- * Flush TLBs if any SPTEs had to be write-protected to ensure that
- * guest writes are reflected in the dirty bitmap before the memslot
- * update completes, i.e. before enabling dirty logging is visible to
- * userspace.
- *
- * Perform the TLB flush outside the mmu_lock to reduce the amount of
- * time the lock is held. However, this does mean that another CPU can
- * now grab mmu_lock and encounter a write-protected SPTE while CPUs
- * still have a writable mapping for the associated GFN in their TLB.
- *
- * This is safe but requires KVM to be careful when making decisions
- * based on the write-protection status of an SPTE. Specifically, KVM
- * also write-protects SPTEs to monitor changes to guest page tables
- * during shadow paging, and must guarantee no CPUs can write to those
- * page before the lock is dropped. As mentioned in the previous
- * paragraph, a write-protected SPTE is no guarantee that CPU cannot
- * perform writes. So to determine if a TLB flush is truly required, KVM
- * will clear a separate software-only bit (MMU-writable) and skip the
- * flush if-and-only-if this bit was already clear.
- *
- * See is_writable_pte() for more details.
- */
- if (flush)
- kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
}
/* Must be called with the mmu_lock held in write-mode. */
@@ -6070,32 +6041,30 @@ void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
const struct kvm_memory_slot *memslot)
{
- bool flush = false;
-
if (kvm_memslots_have_rmaps(kvm)) {
write_lock(&kvm->mmu_lock);
/*
* Clear dirty bits only on 4k SPTEs since the legacy MMU only
* support dirty logging at a 4k granularity.
*/
- flush = slot_handle_level_4k(kvm, memslot, __rmap_clear_dirty, false);
+ slot_handle_level_4k(kvm, memslot, __rmap_clear_dirty, false);
write_unlock(&kvm->mmu_lock);
}
if (is_tdp_mmu_enabled(kvm)) {
read_lock(&kvm->mmu_lock);
- flush |= kvm_tdp_mmu_clear_dirty_slot(kvm, memslot);
+ kvm_tdp_mmu_clear_dirty_slot(kvm, memslot);
read_unlock(&kvm->mmu_lock);
}
/*
+ * The caller will flush the TLBs after this function returns.
+ *
* It's also safe to flush TLBs out of mmu lock here as currently this
* function is only used for dirty logging, in which case flushing TLB
* out of mmu lock also guarantees no dirty pages will be lost in
* dirty_bitmap.
*/
- if (flush)
- kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
}
void kvm_mmu_zap_all(struct kvm *kvm)
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index f80dbb628df57..e09bdcf1e47c5 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -326,7 +326,7 @@ static __always_inline bool is_rsvd_spte(struct rsvd_bits_validate *rsvd_check,
}
/*
- * An shadow-present leaf SPTE may be non-writable for 3 possible reasons:
+ * A shadow-present leaf SPTE may be non-writable for 4 possible reasons:
*
* 1. To intercept writes for dirty logging. KVM write-protects huge pages
* so that they can be split be split down into the dirty logging
@@ -344,8 +344,13 @@ static __always_inline bool is_rsvd_spte(struct rsvd_bits_validate *rsvd_check,
* read-only memslot or guest memory backed by a read-only VMA. Writes to
* such pages are disallowed entirely.
*
- * To keep track of why a given SPTE is write-protected, KVM uses 2
- * software-only bits in the SPTE:
+ * 4. To emulate the Accessed bit for SPTEs without A/D bits. Note, in this
+ * case, the SPTE is access-protected, not just write-protected!
+ *
+ * For cases #1 and #4, KVM can safely make such SPTEs writable without taking
+ * mmu_lock as capturing the Accessed/Dirty state doesn't require taking it.
+ * To differentiate #1 and #4 from #2 and #3, KVM uses two software-only bits
+ * in the SPTE:
*
* shadow_mmu_writable_mask, aka MMU-writable -
* Cleared on SPTEs that KVM is currently write-protecting for shadow paging
@@ -374,7 +379,8 @@ static __always_inline bool is_rsvd_spte(struct rsvd_bits_validate *rsvd_check,
* shadow page tables between vCPUs. Write-protecting an SPTE for dirty logging
* (which does not clear the MMU-writable bit), does not flush TLBs before
* dropping the lock, as it only needs to synchronize guest writes with the
- * dirty bitmap.
+ * dirty bitmap. Similarly, making the SPTE inaccessible (and non-writable) for
+ * access-tracking via the clear_young() MMU notifier also does not flush TLBs.
*
* So, there is the problem: clearing the MMU-writable bit can encounter a
* write-protected SPTE while CPUs still have writable mappings for that SPTE
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bc411d19dac08..e2c3561edb3e7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12248,6 +12248,50 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
} else {
kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_4K);
}
+
+ /*
+ * Unconditionally flush the TLBs after enabling dirty logging.
+ * A flush is almost always going to be necessary (see below),
+ * and unconditionally flushing allows the helpers to omit
+ * the subtly complex checks when removing write access.
+ *
+ * Do the flush outside of mmu_lock to reduce the amount of
+ * time mmu_lock is held. Flushing after dropping mmu_lock is
+ * safe as KVM only needs to guarantee the slot is fully
+ * write-protected before returning to userspace, i.e. before
+ * userspace can consume the dirty status.
+ *
+ * Flushing outside of mmu_lock requires KVM to be careful when
+ * making decisions based on writable status of an SPTE, e.g. a
+ * !writable SPTE doesn't guarantee a CPU can't perform writes.
+ *
+ * Specifically, KVM also write-protects guest page tables to
+ * monitor changes when using shadow paging, and must guarantee
+ * no CPUs can write to those page before mmu_lock is dropped.
+ * Because CPUs may have stale TLB entries at this point, a
+ * !writable SPTE doesn't guarantee CPUs can't perform writes.
+ *
+ * KVM also allows making SPTES writable outside of mmu_lock,
+ * e.g. to allow dirty logging without taking mmu_lock.
+ *
+ * To handle these scenarios, KVM uses a separate software-only
+ * bit (MMU-writable) to track if a SPTE is !writable due to
+ * a guest page table being write-protected (KVM clears the
+ * MMU-writable flag when write-protecting for shadow paging).
+ *
+ * The use of MMU-writable is also the primary motivation for
+ * the unconditional flush. Because KVM must guarantee that a
+ * CPU doesn't contain stale, writable TLB entries for a
+ * !MMU-writable SPTE, KVM must flush if it encounters any
+ * MMU-writable SPTE regardless of whether the actual hardware
+ * writable bit was set. I.e. KVM is almost guaranteed to need
+ * to flush, while unconditionally flushing allows the "remove
+ * write access" helpers to ignore MMU-writable entirely.
+ *
+ * See is_writable_pte() for more details (the case involving
+ * access-tracked SPTEs is particularly relevant).
+ */
+ kvm_arch_flush_remote_tlbs_memslot(kvm, new);
}
}
--
2.35.1
next prev parent reply other threads:[~2022-09-10 21:16 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-10 21:15 [PATCH AUTOSEL 5.19 01/38] Input: goodix - add support for GT1158 Sasha Levin
2022-09-10 21:15 ` [PATCH AUTOSEL 5.19 02/38] platform/surface: aggregator_registry: Add support for Surface Laptop Go 2 Sasha Levin
2022-09-10 21:15 ` [PATCH AUTOSEL 5.19 03/38] drm/msm/rd: Fix FIFO-full deadlock Sasha Levin
2022-09-10 21:15 ` [PATCH AUTOSEL 5.19 04/38] peci: cpu: Fix use-after-free in adev_release() Sasha Levin
2022-09-10 21:15 ` [PATCH AUTOSEL 5.19 05/38] hwmon: (pmbus) Use dev_err_probe() to filter -EPROBE_DEFER error messages Sasha Levin
2022-09-10 21:15 ` Sasha Levin [this message]
2022-09-10 21:15 ` [PATCH AUTOSEL 5.19 07/38] dt-bindings: iio: gyroscope: bosch,bmg160: correct number of pins Sasha Levin
2022-09-10 21:15 ` [PATCH AUTOSEL 5.19 08/38] HID: hidraw: fix memory leak in hidraw_release() Sasha Levin
2022-09-10 21:15 ` [PATCH AUTOSEL 5.19 09/38] HID: asus: ROG NKey: Ignore portion of 0x5a report Sasha Levin
2022-09-10 21:15 ` [PATCH AUTOSEL 5.19 10/38] HID: ishtp-hid-clientHID: ishtp-hid-client: Fix comment typo Sasha Levin
2022-09-10 21:15 ` [PATCH AUTOSEL 5.19 11/38] hid: intel-ish-hid: ishtp: Fix ishtp client sending disordered message Sasha Levin
2022-09-10 21:15 ` [PATCH AUTOSEL 5.19 12/38] HID: thrustmaster: Add sparco wheel and fix array length Sasha Levin
2022-09-10 21:15 ` [PATCH AUTOSEL 5.19 13/38] HID: AMD_SFH: Add a DMI quirk entry for Chromebooks Sasha Levin
2022-09-10 21:15 ` [PATCH AUTOSEL 5.19 14/38] HID: add Lenovo Yoga C630 battery quirk Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 15/38] HID: Add Apple Touchbar on T2 Macs in hid_have_special_driver list Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 16/38] HID: intel-ish-hid: ipc: Add Meteor Lake PCI device ID Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 17/38] HID: nintendo: fix rumble worker null pointer deref Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 18/38] Bluetooth: MGMT: Fix Get Device Flags Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 19/38] tg3: Disable tg3 device on system reboot to avoid triggering AER Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 20/38] r8152: add PID for the Lenovo OneLink+ Dock Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 21/38] gpio: mockup: remove gpio debugfs when remove device Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 22/38] ieee802154: cc2520: add rc code in cc2520_tx() Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 23/38] Input: iforce - add support for Boeder Force Feedback Wheel Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 24/38] drm/amdgpu: disable FRU access on special SIENNA CICHLID card Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 25/38] drm/amd/pm: use vbios carried pptable for all SMU13.0.7 SKUs Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 26/38] nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM610 Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 27/38] nvmet-tcp: fix unhandled tcp states in nvmet_tcp_state_change() Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 28/38] drm/amd/amdgpu: skip ucode loading if ucode_size == 0 Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 29/38] net: dsa: hellcreek: Print warning only once Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 30/38] perf/arm_pmu_platform: fix tests for platform_get_irq() failure Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 31/38] platform/x86: acer-wmi: Acer Aspire One AOD270/Packard Bell Dot keymap fixes Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 32/38] usb: storage: Add ASUS <0x0b05:0x1932> to IGNORE_UAS Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 33/38] misc: fastrpc: increase maximum session count Sasha Levin
2022-09-11 9:31 ` Johan Hovold
2022-09-12 8:57 ` Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 34/38] platform/x86: asus-wmi: Increase FAN_CURVE_BUF_LEN to 32 Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 35/38] LoongArch: Fix section mismatch due to acpi_os_ioremap() Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 36/38] LoongArch: Fix arch_remove_memory() undefined build error Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 37/38] gpio: 104-dio-48e: Make irq_chip immutable Sasha Levin
2022-09-10 21:16 ` [PATCH AUTOSEL 5.19 38/38] gpio: 104-idio-16: " Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220910211623.69825-6-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=junaids@google.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=pbonzini@redhat.com \
--cc=seanjc@google.com \
--cc=stable@vger.kernel.org \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox