* [PATCH 0/8] KVM: arm64: EL2 synchronisation and pKVM stage-2 error propagation fixes
@ 2026-04-28 10:30 Fuad Tabba
2026-04-28 10:30 ` [PATCH 1/8] KVM: arm64: Make EL2 exception entry and exit context-synchronization events Fuad Tabba
` (7 more replies)
0 siblings, 8 replies; 9+ messages in thread
From: Fuad Tabba @ 2026-04-28 10:30 UTC (permalink / raw)
To: maz, oliver.upton
Cc: james.morse, suzuki.poulose, yuzenghui, qperret, vdonnefort,
tabba, catalin.marinas, will, linux-arm-kernel, kvmarm,
linux-kernel, stable
Hi folks,
This is yet another series of fixes I'd like to land before posting a
follow-up to Will's pKVM infrastructure series [1].
I found these while developing KVM and arm64 system guides for
review-prompts [2], an open-source set of AI-assisted review prompts
used by sashiko [3]. While writing the guides I tried to find cases
that would be easy to miss or trip up an LLM, and stumbled on these
bugs. A local run with the updated guides flagged all of them
correctly (some of the commit messages incorporate feedback from that
run, e.g., the impact of WARN_ON() in hyp). I plan to upstream the
guides once they are complete.
The patches fall into three groups:
EL2 context-synchronisation (patches 1-2):
Patch 1 sets SCTLR_EL2.EIS and SCTLR_EL2.EOS in
INIT_SCTLR_EL2_MMU_ON. On FEAT_ExS hardware these bits are
UNKNOWN at reset; without them EL2 exception entry and exit are
not architecturally guaranteed to be Context Synchronisation
Events. KVM/arm64 hot paths rely on that guarantee implicitly to
elide explicit ISBs after MSRs to context-switching sysregs.
Patch 2 adds an explicit ISB after write_sysreg_hcr() on the
__deactivate_traps() path. The activate path is covered by the
ERET that follows (a CSE, guaranteed by patch 1); on the
deactivate path, subsequent EL2 sysreg accesses run before any
natural CSE.
Minor fixes (patches 3-4):
Patch 3 fixes a parameter-name typo in __deactivate_fgt() that
causes it to silently capture a variable from the enclosing scope
rather than use its declared parameter.
Patch 4 guards the VHE hyp panic path against a NULL vcpu pointer;
the nVHE counterpart already has this guard.
pKVM stage-2 error propagation (patches 5-8):
At EL2 in nVHE/pKVM, WARN_ON() is not warn-and-continue: it
expands to a BRK that enters the invalid-host-el2 vector and
branches to hyp_panic(), which is __noreturn.
Four pKVM memory-transition functions wrapped the return value of
kvm_pgtable_stage2_map() in WARN_ON() and discarded it. For the
share and donation paths the map can fail via -ENOMEM when the
vcpu memcache is exhausted, converting a recoverable hypercall
error into a fatal hyp panic. The four patches capture and
propagate the return value, with appropriate stage-2 unmap and
host-side rollback for the reachable failure cases.
Cheers,
/fuad
[1] https://lore.kernel.org/all/20260105154939.11041-1-will@kernel.org/
[2] https://github.com/masoncl/review-prompts
[3] https://sashiko.dev/
Fuad Tabba (8):
KVM: arm64: Make EL2 exception entry and exit context-synchronization
events
KVM: arm64: Synchronise HCR_EL2 writes on the guest exit path
KVM: arm64: Guard against NULL vcpu on VHE hyp panic path
KVM: arm64: Fix __deactivate_fgt macro parameter typo
KVM: arm64: Propagate stage-2 map failure on host->guest share
KVM: arm64: Propagate stage-2 map failure on host->guest donation
KVM: arm64: Propagate stage-2 map failure on guest->host share
KVM: arm64: Propagate stage-2 map failure on guest->host unshare
arch/arm64/include/asm/sysreg.h | 2 +-
arch/arm64/kvm/hyp/include/hyp/switch.h | 2 +-
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 99 +++++++++++++++++++++----
arch/arm64/kvm/hyp/nvhe/switch.c | 11 +++
arch/arm64/kvm/hyp/vhe/switch.c | 14 +++-
5 files changed, 111 insertions(+), 17 deletions(-)
--
2.54.0.545.g6539524ca2-goog
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/8] KVM: arm64: Make EL2 exception entry and exit context-synchronization events
2026-04-28 10:30 [PATCH 0/8] KVM: arm64: EL2 synchronisation and pKVM stage-2 error propagation fixes Fuad Tabba
@ 2026-04-28 10:30 ` Fuad Tabba
2026-04-28 10:30 ` [PATCH 2/8] KVM: arm64: Synchronise HCR_EL2 writes on the guest exit path Fuad Tabba
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Fuad Tabba @ 2026-04-28 10:30 UTC (permalink / raw)
To: maz, oliver.upton
Cc: james.morse, suzuki.poulose, yuzenghui, qperret, vdonnefort,
tabba, catalin.marinas, will, linux-arm-kernel, kvmarm,
linux-kernel, stable
SCTLR_EL2.EIS and SCTLR_EL2.EOS control whether exception entry and
exit at EL2 are Context Synchronisation Events (CSEs). Per ARM DDI
0487 M.b, EIS is governed by D1.4.2 rule RBBSRF (p. D1-7205) and EOS
by D1.4.4.1 rule RBWCFK (p. D1-7209). D24.2.175 (p. D24-9754):
- !FEAT_ExS: the bit is RES1, so the entry/exit is unconditionally
a CSE.
- FEAT_ExS: the reset value is architecturally UNKNOWN; software
must set the bit to make the entry/exit a CSE.
INIT_SCTLR_EL2_MMU_ON in arch/arm64/include/asm/sysreg.h sets neither
bit. KVM/arm64 hot paths rely on ERET from EL2 being a CSE, and on
synchronous EL1->EL2 entry being a CSE, to elide explicit ISBs after
MSRs to context-switching system registers (HCR_EL2, HFGxTR_EL2,
HCRX_EL2, ZCR_EL2, CPACR_EL1, CPTR_EL2, SCTLR_EL1, ptrauth keys,
etc.); examples include the activate-traps path,
ptrauth_switch_to_guest, and the FPSIMD trap re-enable in
kvm_hyp_handle_fpsimd. On FEAT_ExS hardware those reliances are not
architecturally backed unless EOS=1 (and, for entry, EIS=1), and
whether they hold today depends on firmware initialisation outside
the kernel's control.
Make the guarantee explicit: include SCTLR_ELx_EIS | SCTLR_ELx_EOS in
INIT_SCTLR_EL2_MMU_ON so that EL2 exception entry and exit are
unconditionally CSEs regardless of whether FEAT_ExS is implemented.
This matches the pairing in arch/arm64/kvm/config.c which treats EIS
and EOS together as RES1 under !FEAT_ExS.
INIT_SCTLR_EL2_MMU_OFF is left unchanged: that path is used during
very early EL2 init and the EL2 MMU-off transition, neither of which
relies on these bits in the same way.
Fixes: fe2c8d19189e ("KVM: arm64: Turn SCTLR_ELx_FLAGS into INIT_SCTLR_EL2_MMU_ON")
Signed-off-by: Fuad Tabba <tabba@google.com>
---
arch/arm64/include/asm/sysreg.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 736561480f36..7aa08d59d494 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -844,7 +844,7 @@
#define INIT_SCTLR_EL2_MMU_ON \
(SCTLR_ELx_M | SCTLR_ELx_C | SCTLR_ELx_SA | SCTLR_ELx_I | \
SCTLR_ELx_IESB | SCTLR_ELx_WXN | ENDIAN_SET_EL2 | \
- SCTLR_ELx_ITFSB | SCTLR_EL2_RES1)
+ SCTLR_ELx_ITFSB | SCTLR_ELx_EIS | SCTLR_ELx_EOS | SCTLR_EL2_RES1)
#define INIT_SCTLR_EL2_MMU_OFF \
(SCTLR_EL2_RES1 | ENDIAN_SET_EL2)
--
2.54.0.545.g6539524ca2-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/8] KVM: arm64: Synchronise HCR_EL2 writes on the guest exit path
2026-04-28 10:30 [PATCH 0/8] KVM: arm64: EL2 synchronisation and pKVM stage-2 error propagation fixes Fuad Tabba
2026-04-28 10:30 ` [PATCH 1/8] KVM: arm64: Make EL2 exception entry and exit context-synchronization events Fuad Tabba
@ 2026-04-28 10:30 ` Fuad Tabba
2026-04-28 10:30 ` [PATCH 3/8] KVM: arm64: Guard against NULL vcpu on VHE hyp panic path Fuad Tabba
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Fuad Tabba @ 2026-04-28 10:30 UTC (permalink / raw)
To: maz, oliver.upton
Cc: james.morse, suzuki.poulose, yuzenghui, qperret, vdonnefort,
tabba, catalin.marinas, will, linux-arm-kernel, kvmarm,
linux-kernel, stable
MSR HCR_EL2 is not self-synchronising. Per ARM DDI 0487 M.b K1.2.4
(p.K1-16823) and B2.6.1 (p.B2-297), a Context Synchronisation Event
is required between an HCR_EL2 write and any subsequent direct
register access at the same EL that depends on the new value being
in effect.
On the entry path, the HCR_EL2 write in __activate_traps is followed
by further EL2 sysreg work (MDCR_EL2, CPTR_EL2, VBAR_EL2, and on the
speculative-AT errata path SCTLR_EL1/TCR_EL1) before ERET into the
guest. None of those intervening accesses depend on the new HCR_EL2
value, and ERET is a CSE per ARM DDI 0487 M.b D1.4.4.1 rule RBWCFK
(p. D1-7209) conditional on SCTLR_EL2.EOS=1, which is set
unconditionally by INIT_SCTLR_EL2_MMU_ON (see the prerequisite patch
in this series). The requirement is therefore satisfied implicitly
on the activate path.
The deactivate path is different: after write_sysreg_hcr() in
__deactivate_traps() further EL2 sysreg work runs before any natural
CSE - on nVHE, __deactivate_cptr_traps and the VBAR_EL2 write; on
VHE, the timer context save which reads CNTP_CVAL_EL0 under the new
TGE/E2H, and the EL1 sysreg restore. Add an explicit isb() at each
of the two deactivate sites.
The practical impact today is bounded: HCR_EL2.E2H does not toggle
in either path, and the trap bits being changed primarily affect
EL1&0 behaviour. But the architectural rule should be honoured.
Note that write_sysreg_hcr() itself already issues isb() on the
Ampere errata path (sysreg.h), confirming the architectural
expectation; the fast path optimises that away.
The fix is at the call sites rather than inside write_sysreg_hcr()
because the macro has many users (e.g. the activate path, at.c,
hardirq.h, ptrauth alternatives) where the immediately-following
code either reaches ERET or has another CSE; making the macro emit
an unconditional ISB would impose unnecessary cost on those
well-formed users.
Fixes: 9404673293b0 ("KVM: arm64: timers: Correctly handle TGE flip with CNTPOFF_EL2")
Signed-off-by: Fuad Tabba <tabba@google.com>
---
arch/arm64/kvm/hyp/nvhe/switch.c | 11 +++++++++++
arch/arm64/kvm/hyp/vhe/switch.c | 11 +++++++++++
2 files changed, 22 insertions(+)
diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
index 8d1df3d33595..9d7ead5a5503 100644
--- a/arch/arm64/kvm/hyp/nvhe/switch.c
+++ b/arch/arm64/kvm/hyp/nvhe/switch.c
@@ -105,6 +105,17 @@ static void __deactivate_traps(struct kvm_vcpu *vcpu)
__deactivate_traps_common(vcpu);
write_sysreg_hcr(this_cpu_ptr(&kvm_init_params)->hcr_el2);
+ /*
+ * MSR HCR_EL2 is not self-synchronising. Per ARM ARM K1.2.4 p.K1-16823
+ * and B2.6.1 p.B2-297, a Context Synchronisation Event is required
+ * between an HCR_EL2 write and any subsequent direct register access at
+ * the same EL that depends on the new value being in effect.
+ * The activate_traps path falls through to ERET (a CSE), but the
+ * deactivate path still executes further EL2 sysreg work (CPTR/VBAR
+ * writes below) before any natural CSE, so make the synchronisation
+ * explicit.
+ */
+ isb();
__deactivate_cptr_traps(vcpu);
write_sysreg(__kvm_hyp_host_vector, vbar_el2);
diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
index 9db3f11a4754..140d3bcb5651 100644
--- a/arch/arm64/kvm/hyp/vhe/switch.c
+++ b/arch/arm64/kvm/hyp/vhe/switch.c
@@ -149,6 +149,17 @@ static void __deactivate_traps(struct kvm_vcpu *vcpu)
___deactivate_traps(vcpu);
write_sysreg_hcr(HCR_HOST_VHE_FLAGS);
+ /*
+ * MSR HCR_EL2 is not self-synchronising. Per ARM ARM K1.2.4 p.K1-16823
+ * and B2.6.1 p.B2-297, a Context Synchronisation Event is required
+ * between an HCR_EL2 write and any subsequent direct register access at
+ * the same EL that depends on the new value being in effect.
+ * The activate_traps path falls through to ERET (a CSE), but the
+ * deactivate path still executes further EL2 sysreg work (CPTR/VBAR
+ * writes below) before any natural CSE, so make the synchronisation
+ * explicit.
+ */
+ isb();
if (has_cntpoff()) {
struct timer_map map;
--
2.54.0.545.g6539524ca2-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 3/8] KVM: arm64: Guard against NULL vcpu on VHE hyp panic path
2026-04-28 10:30 [PATCH 0/8] KVM: arm64: EL2 synchronisation and pKVM stage-2 error propagation fixes Fuad Tabba
2026-04-28 10:30 ` [PATCH 1/8] KVM: arm64: Make EL2 exception entry and exit context-synchronization events Fuad Tabba
2026-04-28 10:30 ` [PATCH 2/8] KVM: arm64: Synchronise HCR_EL2 writes on the guest exit path Fuad Tabba
@ 2026-04-28 10:30 ` Fuad Tabba
2026-04-28 10:30 ` [PATCH 4/8] KVM: arm64: Fix __deactivate_fgt macro parameter typo Fuad Tabba
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Fuad Tabba @ 2026-04-28 10:30 UTC (permalink / raw)
To: maz, oliver.upton
Cc: james.morse, suzuki.poulose, yuzenghui, qperret, vdonnefort,
tabba, catalin.marinas, will, linux-arm-kernel, kvmarm,
linux-kernel, stable
On VHE, __hyp_call_panic() unconditionally calls __deactivate_traps(vcpu)
on the vcpu pointer read from host_ctxt->__hyp_running_vcpu. That pointer
is cleared after every guest exit (and is never set when no guest is
running), so an unexpected EL2 exception landing in _guest_exit_panic,
e.g. via the el2t*_invalid / el2h_irq_invalid vectors - reaches this
function with vcpu == NULL. __deactivate_traps() then dereferences vcpu
via ___deactivate_traps() -> vserror_state_is_nested() -> vcpu_has_nv()
-> vcpu->arch.features, faulting inside the panic handler and obscuring
the original failure.
The nVHE counterpart (hyp_panic() in arch/arm64/kvm/hyp/nvhe/switch.c)
already guards its vcpu-using cleanup with "if (vcpu)"; mirror that
here. sysreg_restore_host_state_vhe() and __hyp_do_panic() do not depend
on vcpu and continue to run unconditionally, preserving panic forensics.
The trailing panic("...VCPU:%p", vcpu) prints "(null)" safely via
printk's %p handling.
Fixes: 6a0259ed29bb ("KVM: arm64: Remove hyp_panic arguments")
Signed-off-by: Fuad Tabba <tabba@google.com>
---
arch/arm64/kvm/hyp/vhe/switch.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
index 140d3bcb5651..8912863cc238 100644
--- a/arch/arm64/kvm/hyp/vhe/switch.c
+++ b/arch/arm64/kvm/hyp/vhe/switch.c
@@ -674,7 +674,8 @@ static void __noreturn __hyp_call_panic(u64 spsr, u64 elr, u64 par)
host_ctxt = host_data_ptr(host_ctxt);
vcpu = host_ctxt->__hyp_running_vcpu;
- __deactivate_traps(vcpu);
+ if (vcpu)
+ __deactivate_traps(vcpu);
sysreg_restore_host_state_vhe(host_ctxt);
panic("HYP panic:\nPS:%08llx PC:%016llx ESR:%08llx\nFAR:%016llx HPFAR:%016llx PAR:%016llx\nVCPU:%p\n",
--
2.54.0.545.g6539524ca2-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 4/8] KVM: arm64: Fix __deactivate_fgt macro parameter typo
2026-04-28 10:30 [PATCH 0/8] KVM: arm64: EL2 synchronisation and pKVM stage-2 error propagation fixes Fuad Tabba
` (2 preceding siblings ...)
2026-04-28 10:30 ` [PATCH 3/8] KVM: arm64: Guard against NULL vcpu on VHE hyp panic path Fuad Tabba
@ 2026-04-28 10:30 ` Fuad Tabba
2026-04-28 10:30 ` [PATCH 5/8] KVM: arm64: Propagate stage-2 map failure on host->guest share Fuad Tabba
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Fuad Tabba @ 2026-04-28 10:30 UTC (permalink / raw)
To: maz, oliver.upton
Cc: james.morse, suzuki.poulose, yuzenghui, qperret, vdonnefort,
tabba, catalin.marinas, will, linux-arm-kernel, kvmarm,
linux-kernel, stable
__deactivate_fgt() declares its first parameter as "htcxt" but the body
references "hctxt". The parameter is unused; the macro silently captures
"hctxt" from the enclosing scope. Both existing callers
(__deactivate_traps_hfgxtr() and __deactivate_traps_ich_hfgxtr()) happen
to define a local "struct kvm_cpu_context *hctxt", so the macro works
by coincidence.
A future caller without an "hctxt" local in scope, or naming it
differently, would compile but bind to the wrong context. Align the
parameter name with the sibling __activate_fgt() macro.
The "vcpu" parameter remains unused in the body, kept for API symmetry
with __activate_fgt() (which uses it).
Fixes: f5a5a406b4b8 ("KVM: arm64: Propagate and handle Fine-Grained UNDEF bits")
Signed-off-by: Fuad Tabba <tabba@google.com>
---
arch/arm64/kvm/hyp/include/hyp/switch.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h
index 98b2976837b1..bf0eb5e43427 100644
--- a/arch/arm64/kvm/hyp/include/hyp/switch.h
+++ b/arch/arm64/kvm/hyp/include/hyp/switch.h
@@ -245,7 +245,7 @@ static inline void __activate_traps_ich_hfgxtr(struct kvm_vcpu *vcpu)
__activate_fgt(hctxt, vcpu, ICH_HFGITR_EL2);
}
-#define __deactivate_fgt(htcxt, vcpu, reg) \
+#define __deactivate_fgt(hctxt, vcpu, reg) \
do { \
write_sysreg_s(ctxt_sys_reg(hctxt, reg), \
SYS_ ## reg); \
--
2.54.0.545.g6539524ca2-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 5/8] KVM: arm64: Propagate stage-2 map failure on host->guest share
2026-04-28 10:30 [PATCH 0/8] KVM: arm64: EL2 synchronisation and pKVM stage-2 error propagation fixes Fuad Tabba
` (3 preceding siblings ...)
2026-04-28 10:30 ` [PATCH 4/8] KVM: arm64: Fix __deactivate_fgt macro parameter typo Fuad Tabba
@ 2026-04-28 10:30 ` Fuad Tabba
2026-04-28 10:30 ` [PATCH 6/8] KVM: arm64: Propagate stage-2 map failure on host->guest donation Fuad Tabba
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Fuad Tabba @ 2026-04-28 10:30 UTC (permalink / raw)
To: maz, oliver.upton
Cc: james.morse, suzuki.poulose, yuzenghui, qperret, vdonnefort,
tabba, catalin.marinas, will, linux-arm-kernel, kvmarm,
linux-kernel, stable
__pkvm_host_share_guest() mutates the host vmemmap for every page in
the range (sets PKVM_PAGE_SHARED_OWNED and increments
host_share_guest_count) and then calls kvm_pgtable_stage2_map() to
install the guest stage-2 mapping. The stage-2 map's return value was
wrapped in WARN_ON() and otherwise discarded.
At EL2 in nVHE/pKVM, WARN_ON() is not warn-and-continue: it expands
to a BRK that enters the invalid-host-el2 vector and branches to
hyp_panic(), declared __noreturn. WARN_ON of a reachable failure at
EL2 is a panic primitive, not a debug aid.
kvm_pgtable_stage2_map() can fail in reachable ways: the stage-2
walker requests fresh pages from the caller's memcache and returns
-ENOMEM when the memcache is exhausted mid-walk. The host controls
the vcpu memcache via the topup interface, so an under-provisioned
share request converts a recoverable error into a fatal hyp panic.
Capture the stage-2 map return value and propagate it. The walker
may have installed leaf entries for some pages in the IPA range
before failing, so unmap the range to clear any partial mappings;
otherwise the guest would retain stage-2 access to pages the host is
about to reclaim. Then roll back the host vmemmap mutations from the
forward pass: the forward pass increments the count by 1 on every
page, and the only forward state transition is OWNED -> SHARED_OWNED
(the count 0 -> 1 transition). The reverse pass decrements the count
and, if it drops back to zero, restores PKVM_PAGE_OWNED. Pages
already SHARED_OWNED with other sharers (count > 1 after the forward
pass) only need the count decremented.
Fixes: d0bd3e6570ae ("KVM: arm64: Introduce __pkvm_host_share_guest()")
Signed-off-by: Fuad Tabba <tabba@google.com>
---
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 30 ++++++++++++++++++++++++---
1 file changed, 27 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 28a471d1927c..7044913a0758 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -1458,9 +1458,33 @@ int __pkvm_host_share_guest(u64 pfn, u64 gfn, u64 nr_pages, struct pkvm_hyp_vcpu
page->host_share_guest_count++;
}
- WARN_ON(kvm_pgtable_stage2_map(&vm->pgt, ipa, size, phys,
- pkvm_mkstate(prot, PKVM_PAGE_SHARED_BORROWED),
- &vcpu->vcpu.arch.pkvm_memcache, 0));
+ ret = kvm_pgtable_stage2_map(&vm->pgt, ipa, size, phys,
+ pkvm_mkstate(prot, PKVM_PAGE_SHARED_BORROWED),
+ &vcpu->vcpu.arch.pkvm_memcache, 0);
+ if (ret) {
+ /*
+ * Stage-2 map can fail mid-walk (e.g. -ENOMEM from the
+ * memcache), leaving partial leaf entries installed in the
+ * guest stage-2. Tear them down before rolling back host
+ * bookkeeping; otherwise the guest would retain access to
+ * pages the host is about to reclaim as PKVM_PAGE_OWNED.
+ */
+ kvm_pgtable_stage2_unmap(&vm->pgt, ipa, size);
+
+ /*
+ * Roll back the host vmemmap mutations applied above. A page
+ * whose host_share_guest_count is now 1 was PKVM_PAGE_OWNED
+ * before this call (count 0->1, state OWNED->SHARED_OWNED);
+ * undo both. A page with count > 1 was already
+ * PKVM_PAGE_SHARED_OWNED with other sharers; only the count
+ * needs to be decremented.
+ */
+ for_each_hyp_page(page, phys, size) {
+ page->host_share_guest_count--;
+ if (!page->host_share_guest_count)
+ set_host_state(page, PKVM_PAGE_OWNED);
+ }
+ }
unlock:
guest_unlock_component(vm);
--
2.54.0.545.g6539524ca2-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 6/8] KVM: arm64: Propagate stage-2 map failure on host->guest donation
2026-04-28 10:30 [PATCH 0/8] KVM: arm64: EL2 synchronisation and pKVM stage-2 error propagation fixes Fuad Tabba
` (4 preceding siblings ...)
2026-04-28 10:30 ` [PATCH 5/8] KVM: arm64: Propagate stage-2 map failure on host->guest share Fuad Tabba
@ 2026-04-28 10:30 ` Fuad Tabba
2026-04-28 10:30 ` [PATCH 7/8] KVM: arm64: Propagate stage-2 map failure on guest->host share Fuad Tabba
2026-04-28 10:30 ` [PATCH 8/8] KVM: arm64: Propagate stage-2 map failure on guest->host unshare Fuad Tabba
7 siblings, 0 replies; 9+ messages in thread
From: Fuad Tabba @ 2026-04-28 10:30 UTC (permalink / raw)
To: maz, oliver.upton
Cc: james.morse, suzuki.poulose, yuzenghui, qperret, vdonnefort,
tabba, catalin.marinas, will, linux-arm-kernel, kvmarm,
linux-kernel, stable
__pkvm_host_donate_guest() flips the host stage-2 PTE for the donated
page to a non-valid annotation (KVM_HOST_INVALID_PTE_TYPE_DONATION,
owner = PKVM_ID_GUEST) via host_stage2_set_owner_metadata_locked()
and then calls kvm_pgtable_stage2_map() to install the matching guest
stage-2 mapping. The map's return value was wrapped in WARN_ON() and
otherwise discarded.
At EL2 in nVHE/pKVM, WARN_ON() is not warn-and-continue: it expands
to a BRK that enters the invalid-host-el2 vector and branches to
hyp_panic(), declared __noreturn. WARN_ON of a reachable failure at
EL2 is a panic primitive, not a debug aid.
kvm_pgtable_stage2_map() can fail in reachable ways even at PAGE_SIZE
granularity: __pkvm_host_donate_guest() verifies PKVM_NOPAGE for the
guest IPA before the map, meaning no valid stage-2 entry exists. The
walker must allocate new page-table pages from the vcpu memcache to
install the mapping, returning -ENOMEM if exhausted. The host
controls the vcpu memcache via the topup interface, so an
under-provisioned donation request converts a recoverable error into
a fatal hyp panic.
Capture the stage-2 map return value and propagate it. The walker
may have installed partial leaf entries for the IPA before failing,
so unmap the range to clear them; otherwise the guest would retain
stage-2 access to a page the host is about to reclaim as
PKVM_PAGE_OWNED. Then roll back the host stage-2 mutation: the only
forward mutation is host_stage2_set_owner_metadata_locked() flipping
the host vmemmap from PKVM_PAGE_OWNED to PKVM_NOPAGE and the host
stage-2 PTE from idmap to invalid+annotation.
host_stage2_set_owner_locked(_, _, PKVM_ID_HOST) restores both.
The rollback calls host_stage2_set_owner_locked() under WARN_ON.
This is the correct use: host_stage2_set_owner_metadata_locked()
just wrote the host leaf PTE as an invalid+annotation entry, so the
reverse idmap rewrite cannot require new page-table allocation — it
rewrites the leaf in-place. The WARN_ON asserts an impossible state
under correct EL2 execution, semantically distinct from the misuse
being fixed.
Fixes: 1e579adca177 ("KVM: arm64: Introduce __pkvm_host_donate_guest()")
Signed-off-by: Fuad Tabba <tabba@google.com>
---
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 27 ++++++++++++++++++++++++---
1 file changed, 24 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 7044913a0758..b8c57a95e9bf 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -1391,9 +1391,30 @@ int __pkvm_host_donate_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu)
meta = host_stage2_encode_gfn_meta(vm, gfn);
WARN_ON(host_stage2_set_owner_metadata_locked(phys, PAGE_SIZE,
PKVM_ID_GUEST, meta));
- WARN_ON(kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys,
- pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_OWNED),
- &vcpu->vcpu.arch.pkvm_memcache, 0));
+ ret = kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys,
+ pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_OWNED),
+ &vcpu->vcpu.arch.pkvm_memcache, 0);
+ if (ret) {
+ /*
+ * Stage-2 map can fail mid-walk (e.g. -ENOMEM from the
+ * memcache), leaving partial leaf entries installed in the
+ * guest stage-2. Tear them down before rolling back the host
+ * stage-2; otherwise the guest would retain access to a page
+ * the host is about to reclaim as PKVM_PAGE_OWNED.
+ */
+ kvm_pgtable_stage2_unmap(&vm->pgt, ipa, PAGE_SIZE);
+
+ /*
+ * Roll back the donation annotation applied above by
+ * host_stage2_set_owner_metadata_locked() (host vmemmap
+ * PKVM_NOPAGE -> PKVM_PAGE_OWNED, host stage-2 PTE
+ * invalid+annotation -> idmap). The leaf PTE was just
+ * installed by the forward call, so reinstating the idmap
+ * rewrites it without needing fresh page-table pages from
+ * host_s2_pool.
+ */
+ WARN_ON(host_stage2_set_owner_locked(phys, PAGE_SIZE, PKVM_ID_HOST));
+ }
unlock:
guest_unlock_component(vm);
--
2.54.0.545.g6539524ca2-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 7/8] KVM: arm64: Propagate stage-2 map failure on guest->host share
2026-04-28 10:30 [PATCH 0/8] KVM: arm64: EL2 synchronisation and pKVM stage-2 error propagation fixes Fuad Tabba
` (5 preceding siblings ...)
2026-04-28 10:30 ` [PATCH 6/8] KVM: arm64: Propagate stage-2 map failure on host->guest donation Fuad Tabba
@ 2026-04-28 10:30 ` Fuad Tabba
2026-04-28 10:30 ` [PATCH 8/8] KVM: arm64: Propagate stage-2 map failure on guest->host unshare Fuad Tabba
7 siblings, 0 replies; 9+ messages in thread
From: Fuad Tabba @ 2026-04-28 10:30 UTC (permalink / raw)
To: maz, oliver.upton
Cc: james.morse, suzuki.poulose, yuzenghui, qperret, vdonnefort,
tabba, catalin.marinas, will, linux-arm-kernel, kvmarm,
linux-kernel, stable
__pkvm_guest_share_host() updates the guest stage-2 PTE for a
guest-OWNED page to PKVM_PAGE_SHARED_OWNED via
kvm_pgtable_stage2_map() and then transitions the host vmemmap and
stage-2 PTE to PKVM_PAGE_SHARED_BORROWED. The map's return value was
wrapped in WARN_ON() and otherwise discarded.
At EL2 in nVHE/pKVM, WARN_ON() is not warn-and-continue: it expands
to a BRK that enters the invalid-host-el2 vector and branches to
hyp_panic(), declared __noreturn.
__pkvm_guest_share_host() calls get_valid_guest_pte() before the
map, which verifies that a valid last-level (PAGE_SIZE) leaf PTE
already exists for the IPA. Because the leaf and all intermediate
tables are in place, the subsequent kvm_pgtable_stage2_map()
replacing it cannot fail via -ENOMEM: no block to split, no new
tables to install. The failure path is not currently reachable.
Nevertheless, WARN_ON() on any fallible call is the wrong pattern at
EL2: if the get_valid_guest_pte() precondition were ever relaxed, or
the walker gained a new failure mode, the WARN_ON would convert a
recoverable error into a fatal hyp panic. Capture the return value
and propagate it. The unmap() is kept as a defensive guard for the
currently unreachable failure path; no host-side unwinding is needed
since the host vmemmap and stage-2 update is the next step and is
correctly skipped on error.
Fixes: 03313efed5e2 ("KVM: arm64: Implement the MEM_SHARE hypercall for protected VMs")
Signed-off-by: Fuad Tabba <tabba@google.com>
---
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 21 +++++++++++++++++----
1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index b8c57a95e9bf..6fb546af699f 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -979,10 +979,23 @@ int __pkvm_guest_share_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn)
if (__host_check_page_state_range(phys, PAGE_SIZE, PKVM_NOPAGE))
goto unlock;
- ret = 0;
- WARN_ON(kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys,
- pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_SHARED_OWNED),
- &vcpu->vcpu.arch.pkvm_memcache, 0));
+ ret = kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys,
+ pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_SHARED_OWNED),
+ &vcpu->vcpu.arch.pkvm_memcache, 0);
+ if (ret) {
+ /*
+ * Stage-2 map can fail mid-walk (e.g. -ENOMEM from the
+ * memcache), leaving partial leaf entries in the guest
+ * stage-2 transitioned to PKVM_PAGE_SHARED_OWNED. Tear
+ * them down so the host does not see a partially-shared
+ * mapping it has not yet acknowledged via the host
+ * stage-2 update below. No host bookkeeping needs
+ * unwinding here: the only mutation prior to the failed
+ * map is the (now-discarded) guest stage-2 update itself.
+ */
+ kvm_pgtable_stage2_unmap(&vm->pgt, ipa, PAGE_SIZE);
+ goto unlock;
+ }
WARN_ON(__host_set_page_state_range(phys, PAGE_SIZE, PKVM_PAGE_SHARED_BORROWED));
unlock:
guest_unlock_component(vm);
--
2.54.0.545.g6539524ca2-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 8/8] KVM: arm64: Propagate stage-2 map failure on guest->host unshare
2026-04-28 10:30 [PATCH 0/8] KVM: arm64: EL2 synchronisation and pKVM stage-2 error propagation fixes Fuad Tabba
` (6 preceding siblings ...)
2026-04-28 10:30 ` [PATCH 7/8] KVM: arm64: Propagate stage-2 map failure on guest->host share Fuad Tabba
@ 2026-04-28 10:30 ` Fuad Tabba
7 siblings, 0 replies; 9+ messages in thread
From: Fuad Tabba @ 2026-04-28 10:30 UTC (permalink / raw)
To: maz, oliver.upton
Cc: james.morse, suzuki.poulose, yuzenghui, qperret, vdonnefort,
tabba, catalin.marinas, will, linux-arm-kernel, kvmarm,
linux-kernel, stable
__pkvm_guest_unshare_host() re-acquires exclusive guest ownership of
a page by (i) annotating the host stage-2 PTE via
host_stage2_set_owner_metadata_locked(), (ii) mapping the page in
the guest stage-2 as PKVM_PAGE_OWNED via kvm_pgtable_stage2_map(),
and (iii) restoring host ownership via
host_stage2_set_owner_locked(). The map's return value was wrapped
in WARN_ON() and otherwise discarded.
At EL2 in nVHE/pKVM, WARN_ON() is not warn-and-continue: it expands
to a BRK that enters the invalid-host-el2 vector and branches to
hyp_panic(), declared __noreturn.
__pkvm_guest_unshare_host() calls get_valid_guest_pte() before the
map, which verifies that a valid last-level (PAGE_SIZE) leaf PTE
already exists for the IPA. Because the leaf and all intermediate
tables are in place, the subsequent kvm_pgtable_stage2_map()
replacing it cannot fail via -ENOMEM: no block to split, no new
tables to install. The failure path is not currently reachable.
Nevertheless, WARN_ON() on any fallible call is the wrong pattern at
EL2. Capture the return value and propagate it. The unmap() and
host-side rollback are kept as defensive guards for the currently
unreachable failure path. The rollback's
WARN_ON(__host_set_page_state_range()) asserts an impossible state:
the host leaf PTE was just written by
host_stage2_set_owner_metadata_locked(), so the reverse idmap
rewrite cannot require new page-table allocation from host_s2_pool.
This is the correct use of WARN_ON at EL2 — an impossible-state
assertion, not a reachable error being ignored.
Fixes: 246c976c370d ("KVM: arm64: Implement the MEM_UNSHARE hypercall for protected VMs")
Signed-off-by: Fuad Tabba <tabba@google.com>
---
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 37 ++++++++++++++++++---------
1 file changed, 25 insertions(+), 12 deletions(-)
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 6fb546af699f..12f3ea7a2d75 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -984,14 +984,10 @@ int __pkvm_guest_share_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn)
&vcpu->vcpu.arch.pkvm_memcache, 0);
if (ret) {
/*
- * Stage-2 map can fail mid-walk (e.g. -ENOMEM from the
- * memcache), leaving partial leaf entries in the guest
- * stage-2 transitioned to PKVM_PAGE_SHARED_OWNED. Tear
- * them down so the host does not see a partially-shared
- * mapping it has not yet acknowledged via the host
- * stage-2 update below. No host bookkeeping needs
- * unwinding here: the only mutation prior to the failed
- * map is the (now-discarded) guest stage-2 update itself.
+ * Defensive: get_valid_guest_pte() guarantees a last-level
+ * leaf PTE already exists, so stage-2 map() cannot currently
+ * fail here. The unmap() restores the IPA to a clean state as
+ * a guard should the precondition ever change.
*/
kvm_pgtable_stage2_unmap(&vm->pgt, ipa, PAGE_SIZE);
goto unlock;
@@ -1024,13 +1020,30 @@ int __pkvm_guest_unshare_host(struct pkvm_hyp_vcpu *vcpu, u64 gfn)
if (__host_check_page_state_range(phys, PAGE_SIZE, PKVM_PAGE_SHARED_BORROWED))
goto unlock;
- ret = 0;
meta = host_stage2_encode_gfn_meta(vm, gfn);
WARN_ON(host_stage2_set_owner_metadata_locked(phys, PAGE_SIZE,
PKVM_ID_GUEST, meta));
- WARN_ON(kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys,
- pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_OWNED),
- &vcpu->vcpu.arch.pkvm_memcache, 0));
+ ret = kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys,
+ pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_OWNED),
+ &vcpu->vcpu.arch.pkvm_memcache, 0);
+ if (ret) {
+ /*
+ * Defensive: get_valid_guest_pte() guarantees a last-level
+ * leaf PTE already exists, so stage-2 map() cannot currently
+ * fail here. The unmap() and host-side rollback below are
+ * kept as guards should the precondition ever change.
+ */
+ kvm_pgtable_stage2_unmap(&vm->pgt, ipa, PAGE_SIZE);
+
+ /*
+ * Roll back the host stage-2 mutation above: the host leaf
+ * PTE was just written by host_stage2_set_owner_metadata_locked(),
+ * so __host_set_page_state_range() rewrites it in-place
+ * without needing fresh page-table pages from host_s2_pool.
+ */
+ WARN_ON(__host_set_page_state_range(phys, PAGE_SIZE,
+ PKVM_PAGE_SHARED_BORROWED));
+ }
unlock:
guest_unlock_component(vm);
host_unlock_component();
--
2.54.0.545.g6539524ca2-goog
^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-04-28 10:30 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28 10:30 [PATCH 0/8] KVM: arm64: EL2 synchronisation and pKVM stage-2 error propagation fixes Fuad Tabba
2026-04-28 10:30 ` [PATCH 1/8] KVM: arm64: Make EL2 exception entry and exit context-synchronization events Fuad Tabba
2026-04-28 10:30 ` [PATCH 2/8] KVM: arm64: Synchronise HCR_EL2 writes on the guest exit path Fuad Tabba
2026-04-28 10:30 ` [PATCH 3/8] KVM: arm64: Guard against NULL vcpu on VHE hyp panic path Fuad Tabba
2026-04-28 10:30 ` [PATCH 4/8] KVM: arm64: Fix __deactivate_fgt macro parameter typo Fuad Tabba
2026-04-28 10:30 ` [PATCH 5/8] KVM: arm64: Propagate stage-2 map failure on host->guest share Fuad Tabba
2026-04-28 10:30 ` [PATCH 6/8] KVM: arm64: Propagate stage-2 map failure on host->guest donation Fuad Tabba
2026-04-28 10:30 ` [PATCH 7/8] KVM: arm64: Propagate stage-2 map failure on guest->host share Fuad Tabba
2026-04-28 10:30 ` [PATCH 8/8] KVM: arm64: Propagate stage-2 map failure on guest->host unshare Fuad Tabba
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox