[PATCH v6 00/10] KVM: s390: A bunch of gmap-related fixes

Linux s390 Architecture development
 help / color / mirror / Atom feed

* [PATCH v6 00/10] KVM: s390: A bunch of gmap-related fixes
@ 2026-06-23 15:33 Claudio Imbrenda
  2026-06-23 15:33 ` [PATCH v6 01/10] s390/mm: Fix handling of _PAGE_UNUSED pte bit Claudio Imbrenda
                   ` (9 more replies)
  0 siblings, 10 replies; 17+ messages in thread
From: Claudio Imbrenda @ 2026-06-23 15:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, linux-s390, borntraeger, frankja, david, seiden, nrb,
	schlameuss, gra, hca, gerald.schaefer, gor, agordeev, svens

Fix handling of _PAGE_UNUSED pte bit. The bit was lingering around
longer than it should have, causing issues. This is yet a different
approach from the previous version.

Fix some cmma dirty tracking functions.

Also fix a few other small issues in gmap.

v5->v6:
* Move locking back in the individual cases in kvm_s390_set_mem_control()
* Add commends and rework the handling of cmma dirty flags in
  kvm_s390_vm_{start,stop}_migration()
* Avoid potential deadlock in the unlikely case of a nested guest running
  with a real address space

v4->v5:
* Make sure the counter of cmma-dirty pages does not explode
* Improve comments in cmma selftest
* Rework and fix locking in kvm_s390_set_mem_control()
* Do not return success if an allocation in kvm_s390_set_cmma_bits() fails

v3->v4:
* Improve/fix descriptions of patches 1 and 3
* Fix handle_{sske,pfmf} under memory pressure
* Fix cmma dirty tracking and its selftest

v2->v3:
* Fix _PAGE_UNUSED in yet a different way
* Fix a potential overflow in dat_peek_cmma()
* Some minor gmap fixes that are not actual bugs yet

v1->v2:
* Completely different approach, entirely new patch

Claudio Imbrenda (10):
  s390/mm: Fix handling of _PAGE_UNUSED pte bit
  KVM: s390: Fix dat_peek_cmma() overflow
  KVM: s390: Do not set special large pages dirty
  KVM: s390: Fix code typo in gmap_protect_asce_top_level()
  KVM: s390: Fix handle_{sske,pfmf} under memory pressure
  KVM: s390: Fix locking in kvm_s390_set_mem_control()
  KVM: s390: Fix cmma dirty tracking
  KVM: s390: selftests: Fix cmma selftest
  KVM: s390: Return failure in case of failure in
    kvm_s390_set_cmma_bits()
  KVM: s390: vsie: Avoid potential deadlock with real spaces

 arch/s390/include/asm/pgtable.h              |  4 +-
 arch/s390/kvm/dat.c                          |  5 +-
 arch/s390/kvm/gmap.c                         | 41 +++++++++---
 arch/s390/kvm/gmap.h                         | 12 +++-
 arch/s390/kvm/kvm-s390.c                     | 66 +++++++++++++-------
 arch/s390/kvm/priv.c                         | 10 ++-
 arch/s390/mm/gmap_helpers.c                  |  3 +-
 tools/testing/selftests/kvm/s390/cmma_test.c |  6 ++
 8 files changed, 111 insertions(+), 36 deletions(-)

-- 
2.54.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v6 01/10] s390/mm: Fix handling of _PAGE_UNUSED pte bit
  2026-06-23 15:33 [PATCH v6 00/10] KVM: s390: A bunch of gmap-related fixes Claudio Imbrenda
@ 2026-06-23 15:33 ` Claudio Imbrenda
  2026-06-23 15:54   ` sashiko-bot
  2026-06-23 15:33 ` [PATCH v6 02/10] KVM: s390: Fix dat_peek_cmma() overflow Claudio Imbrenda
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 17+ messages in thread
From: Claudio Imbrenda @ 2026-06-23 15:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, linux-s390, borntraeger, frankja, david, seiden, nrb,
	schlameuss, gra, hca, gerald.schaefer, gor, agordeev, svens

The _PAGE_UNUSED softbit should not really be lying around. Its sole
purpose is to signal to try_to_unmap_one() and try_to_migrate_one()
that the page can be discarded instead of being moved / swapped.

KVM has no way to know why a page is being unmapped, so it sets the bit
on userspace ptes corresponding to unused guest pages every time they
get unmapped. KVM has no reasonable way to clear the bit once the page
is in use again.

While set_ptes() checks and clears the bit, other paths that set new
ptes did not. This led to used pages being thrown out as if they were
unused, causing guest corruption.

Fix the issue by clearing the _PAGE_UNUSED bit for present ptes in
set_pte(), i.e. whenever a present pte is getting set. The check in
set_ptes() is then redundant and can be removed.

Also fix gmap_helper_try_set_pte_unused() to only set the bit if the
pte is present; the _PAGE_UNUSED bit is only defined for present ptes
and thus should not be set for non-present ptes.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Fixes: c98175b7917f ("KVM: s390: Add gmap_helper_set_unused()")
Acked-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h | 4 ++--
 arch/s390/mm/gmap_helpers.c     | 3 ++-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index ca376a9b8e41..d03663483f76 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -980,6 +980,8 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 
 static inline void set_pte(pte_t *ptep, pte_t pte)
 {
+	if (pte_present(pte))
+		pte = clear_pte_bit(pte, __pgprot(_PAGE_UNUSED));
 	WRITE_ONCE(*ptep, pte);
 }
 
@@ -1332,8 +1334,6 @@ pgprot_t pgprot_writecombine(pgprot_t prot);
 static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 			      pte_t *ptep, pte_t entry, unsigned int nr)
 {
-	if (pte_present(entry))
-		entry = clear_pte_bit(entry, __pgprot(_PAGE_UNUSED));
 	page_table_check_ptes_set(mm, addr, ptep, entry, nr);
 	for (;;) {
 		set_pte(ptep, entry);
diff --git a/arch/s390/mm/gmap_helpers.c b/arch/s390/mm/gmap_helpers.c
index 1cfe4724fbe2..60023b6fdcb1 100644
--- a/arch/s390/mm/gmap_helpers.c
+++ b/arch/s390/mm/gmap_helpers.c
@@ -181,7 +181,8 @@ void gmap_helper_try_set_pte_unused(struct mm_struct *mm, unsigned long vmaddr)
 	if (IS_ERR_OR_NULL(ptep))
 		return;
 
-	__atomic64_or(_PAGE_UNUSED, (long *)ptep);
+	if (pte_present(*ptep))
+		__atomic64_or(_PAGE_UNUSED, (long *)ptep);
 	pte_unmap_unlock(ptep, ptl);
 }
 EXPORT_SYMBOL_GPL(gmap_helper_try_set_pte_unused);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 02/10] KVM: s390: Fix dat_peek_cmma() overflow
  2026-06-23 15:33 [PATCH v6 00/10] KVM: s390: A bunch of gmap-related fixes Claudio Imbrenda
  2026-06-23 15:33 ` [PATCH v6 01/10] s390/mm: Fix handling of _PAGE_UNUSED pte bit Claudio Imbrenda
@ 2026-06-23 15:33 ` Claudio Imbrenda
  2026-06-23 15:49   ` sashiko-bot
  2026-06-23 15:33 ` [PATCH v6 03/10] KVM: s390: Do not set special large pages dirty Claudio Imbrenda
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 17+ messages in thread
From: Claudio Imbrenda @ 2026-06-23 15:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, linux-s390, borntraeger, frankja, david, seiden, nrb,
	schlameuss, gra, hca, gerald.schaefer, gor, agordeev, svens

If userspace passes a start address that is out of bounds,
_dat_walk_gfn_range() will fail with -EFAULT, but state.end will not be
touched and will stay 0. This will cause *count to underflow and report
a very high number, and the function will end up erroneously reporting
success.

Fix by only setting *count if the end address is not smaller than the
starting address. This way invalid starting addresses will correctly
return -EFAULT and *count will correctly indicate that no values have
been returned.

Fixes: 7b368470e1a4 ("KVM: s390: KVM page table management functions: CMMA")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 arch/s390/kvm/dat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/kvm/dat.c b/arch/s390/kvm/dat.c
index 4a41c0247ffa..cffac7782c4b 100644
--- a/arch/s390/kvm/dat.c
+++ b/arch/s390/kvm/dat.c
@@ -1209,7 +1209,7 @@ int dat_peek_cmma(gfn_t start, union asce asce, unsigned int *count, u8 *values)
 	int rc;
 
 	rc = _dat_walk_gfn_range(start, start + *count, asce, &ops, DAT_WALK_DEFAULT, &state);
-	*count = state.end - start;
+	*count = state.end >= start ? state.end - start : 0;
 	/* Return success if at least one value was saved, otherwise an error. */
 	return (rc == -EFAULT && *count > 0) ? 0 : rc;
 }
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 03/10] KVM: s390: Do not set special large pages dirty
  2026-06-23 15:33 [PATCH v6 00/10] KVM: s390: A bunch of gmap-related fixes Claudio Imbrenda
  2026-06-23 15:33 ` [PATCH v6 01/10] s390/mm: Fix handling of _PAGE_UNUSED pte bit Claudio Imbrenda
  2026-06-23 15:33 ` [PATCH v6 02/10] KVM: s390: Fix dat_peek_cmma() overflow Claudio Imbrenda
@ 2026-06-23 15:33 ` Claudio Imbrenda
  2026-06-23 15:55   ` sashiko-bot
  2026-06-23 15:33 ` [PATCH v6 04/10] KVM: s390: Fix code typo in gmap_protect_asce_top_level() Claudio Imbrenda
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 17+ messages in thread
From: Claudio Imbrenda @ 2026-06-23 15:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, linux-s390, borntraeger, frankja, david, seiden, nrb,
	schlameuss, gra, hca, gerald.schaefer, gor, agordeev, svens

Special pages / folios should not be set dirty. This also applies to
large pages.

Add a missing check in gmap_clear_young_crste() to prevent setting the
large page dirty if it is a special page.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Fixes: a2c17f9270cc ("KVM: s390: New gmap code")
---
 arch/s390/kvm/gmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/kvm/gmap.c b/arch/s390/kvm/gmap.c
index 52d55ddea8d4..3192f610f696 100644
--- a/arch/s390/kvm/gmap.c
+++ b/arch/s390/kvm/gmap.c
@@ -327,7 +327,7 @@ static long gmap_clear_young_crste(union crste *crstep, gfn_t gfn, gfn_t end, st
 		new.h.i = 1;
 		new.s.fc1.y = 0;
 		new.s.fc1.prefix_notif = 0;
-		if (new.s.fc1.d || !new.h.p)
+		if ((new.s.fc1.d || !new.h.p) && !new.s.fc1.s)
 			folio_set_dirty(phys_to_folio(crste_origin_large(crste)));
 		new.s.fc1.d = 0;
 		new.h.p = 1;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 04/10] KVM: s390: Fix code typo in gmap_protect_asce_top_level()
  2026-06-23 15:33 [PATCH v6 00/10] KVM: s390: A bunch of gmap-related fixes Claudio Imbrenda
                   ` (2 preceding siblings ...)
  2026-06-23 15:33 ` [PATCH v6 03/10] KVM: s390: Do not set special large pages dirty Claudio Imbrenda
@ 2026-06-23 15:33 ` Claudio Imbrenda
  2026-06-23 15:33 ` [PATCH v6 05/10] KVM: s390: Fix handle_{sske,pfmf} under memory pressure Claudio Imbrenda
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Claudio Imbrenda @ 2026-06-23 15:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, linux-s390, borntraeger, frankja, david, seiden, nrb,
	schlameuss, gra, hca, gerald.schaefer, gor, agordeev, svens

The correct length to pass to kvm_s390_get_guest_pages() is asce.tl + 1,
not asce.dt + 1. It was a typo, which, due to fortuitous circumstances,
did not cause bugs. It should nonetheless be fixed.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Fixes: e5f98a6899bd ("KVM: s390: Add some helper functions needed for vSIE")
---
 arch/s390/kvm/gmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/kvm/gmap.c b/arch/s390/kvm/gmap.c
index 3192f610f696..e6e786811db8 100644
--- a/arch/s390/kvm/gmap.c
+++ b/arch/s390/kvm/gmap.c
@@ -1262,7 +1262,7 @@ static int gmap_protect_asce_top_level(struct kvm_s390_mmu_cache *mc, struct gma
 	/* Pairs with the smp_wmb() in kvm_mmu_invalidate_end(). */
 	smp_rmb();
 
-	rc = kvm_s390_get_guest_pages(sg->kvm, context.f, asce.rsto, asce.dt + 1, false);
+	rc = kvm_s390_get_guest_pages(sg->kvm, context.f, asce.rsto, asce.tl + 1, false);
 	if (rc > 0)
 		rc = -EFAULT;
 	if (!rc)
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 05/10] KVM: s390: Fix handle_{sske,pfmf} under memory pressure
  2026-06-23 15:33 [PATCH v6 00/10] KVM: s390: A bunch of gmap-related fixes Claudio Imbrenda
                   ` (3 preceding siblings ...)
  2026-06-23 15:33 ` [PATCH v6 04/10] KVM: s390: Fix code typo in gmap_protect_asce_top_level() Claudio Imbrenda
@ 2026-06-23 15:33 ` Claudio Imbrenda
  2026-06-23 15:33 ` [PATCH v6 06/10] KVM: s390: Fix locking in kvm_s390_set_mem_control() Claudio Imbrenda
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 17+ messages in thread
From: Claudio Imbrenda @ 2026-06-23 15:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, linux-s390, borntraeger, frankja, david, seiden, nrb,
	schlameuss, gra, hca, gerald.schaefer, gor, agordeev, svens

Under heavy memory pressure, handle_sske() and handle_pfmf() might
cause an endless loop if the mmu cache runs empty, the atomic
allocations fail, and the top-up function also fails. While quite
unlikely, that scenario is not impossible.

Fix the issue by not ignoring the return value of
kvm_s390_mmu_cache_topup(), and appropriately returning an error code
in case of failure.

Fixes: e38c884df921 ("KVM: s390: Switch to new gmap")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
---
 arch/s390/kvm/priv.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index 447ec7ed423d..9bc6fd02ff77 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -366,7 +366,9 @@ static int handle_sske(struct kvm_vcpu *vcpu)
 		if (rc > 1)
 			return kvm_s390_inject_program_int(vcpu, PGM_ADDRESSING);
 		if (rc == -ENOMEM) {
-			kvm_s390_mmu_cache_topup(vcpu->arch.mc);
+			rc = kvm_s390_mmu_cache_topup(vcpu->arch.mc);
+			if (rc)
+				return rc;
 			continue;
 		}
 		if (rc < 0)
@@ -1122,7 +1124,9 @@ static int handle_pfmf(struct kvm_vcpu *vcpu)
 			if (rc > 1)
 				return kvm_s390_inject_program_int(vcpu, rc);
 			if (rc == -ENOMEM) {
-				kvm_s390_mmu_cache_topup(vcpu->arch.mc);
+				rc = kvm_s390_mmu_cache_topup(vcpu->arch.mc);
+				if (rc)
+					return rc;
 				continue;
 			}
 			if (rc < 0)
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 06/10] KVM: s390: Fix locking in kvm_s390_set_mem_control()
  2026-06-23 15:33 [PATCH v6 00/10] KVM: s390: A bunch of gmap-related fixes Claudio Imbrenda
                   ` (4 preceding siblings ...)
  2026-06-23 15:33 ` [PATCH v6 05/10] KVM: s390: Fix handle_{sske,pfmf} under memory pressure Claudio Imbrenda
@ 2026-06-23 15:33 ` Claudio Imbrenda
  2026-06-23 15:49   ` sashiko-bot
  2026-06-23 15:33 ` [PATCH v6 07/10] KVM: s390: Fix cmma dirty tracking Claudio Imbrenda
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 17+ messages in thread
From: Claudio Imbrenda @ 2026-06-23 15:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, linux-s390, borntraeger, frankja, david, seiden, nrb,
	schlameuss, gra, hca, gerald.schaefer, gor, agordeev, svens

Add the missing locking around dat_reset_cmma().

Fixes: e38c884df921 ("KVM: s390: Switch to new gmap")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 arch/s390/kvm/kvm-s390.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index ffb20a64d328..fe0ae15e5ad0 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -990,9 +990,11 @@ static int kvm_s390_set_mem_control(struct kvm *kvm, struct kvm_device_attr *att
 		if (!kvm->arch.use_cmma)
 			break;
 
+		guard(mutex)(&kvm->lock);
 		VM_EVENT(kvm, 3, "%s", "RESET: CMMA states");
 		do {
-			start_gfn = dat_reset_cmma(kvm->arch.gmap->asce, start_gfn);
+			scoped_guard(read_lock, &kvm->mmu_lock)
+				start_gfn = dat_reset_cmma(kvm->arch.gmap->asce, start_gfn);
 			cond_resched();
 		} while (start_gfn);
 		ret = 0;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 07/10] KVM: s390: Fix cmma dirty tracking
  2026-06-23 15:33 [PATCH v6 00/10] KVM: s390: A bunch of gmap-related fixes Claudio Imbrenda
                   ` (5 preceding siblings ...)
  2026-06-23 15:33 ` [PATCH v6 06/10] KVM: s390: Fix locking in kvm_s390_set_mem_control() Claudio Imbrenda
@ 2026-06-23 15:33 ` Claudio Imbrenda
  2026-06-23 15:50   ` sashiko-bot
  2026-06-23 15:33 ` [PATCH v6 08/10] KVM: s390: selftests: Fix cmma selftest Claudio Imbrenda
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 17+ messages in thread
From: Claudio Imbrenda @ 2026-06-23 15:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, linux-s390, borntraeger, frankja, david, seiden, nrb,
	schlameuss, gra, hca, gerald.schaefer, gor, agordeev, svens

It is possible that some guest memory areas have not been touched yet
when starting migration mode, and thus have no ptes allocated. Only
existing and allocated ptes should count toward the total of dirty cmma
entries.

When starting migration mode, enable the migration_mode flag
immediately, so that any subsequent ESSA will trap in the host and
cause cmma_dirty_pages to be increased as needed.
Subsequently, set the cmma_d bit on all existing cmma-clean PGSTEs,
increasing cmma_dirty_pages as needed. Skipping cmma-dirty pages
prevents double counting.

Conversely, when disabling migration mode, set cmma_dirty_pages to 0
and clear the cmma_d bit in all existing PGSTEs.

The invariant is that when migration mode is off, no PGSTE has its
cmma_d bit set, and cmma_dirty_pages is 0. kvm->slots_lock protects
kvm_s390_vm_start_migration() and kvm_s390_vm_stop_migration() from
each other and from kvm_s390_get_cmma_bits().

Also fix dat_get_cmma() to properly wrap around if the first attempt
reached the end of guest memory without finding cmma-dirty pages.

Fixes: e38c884df921 ("KVM: s390: Switch to new gmap")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 arch/s390/kvm/dat.c      |  3 +++
 arch/s390/kvm/gmap.c     | 31 ++++++++++++++++++++++++----
 arch/s390/kvm/gmap.h     | 12 ++++++++++-
 arch/s390/kvm/kvm-s390.c | 44 ++++++++++++++++++++++++++++++++--------
 arch/s390/kvm/priv.c     |  2 +-
 5 files changed, 77 insertions(+), 15 deletions(-)

diff --git a/arch/s390/kvm/dat.c b/arch/s390/kvm/dat.c
index cffac7782c4b..0ad4ebc80eba 100644
--- a/arch/s390/kvm/dat.c
+++ b/arch/s390/kvm/dat.c
@@ -1253,6 +1253,9 @@ int dat_get_cmma(union asce asce, gfn_t *start, unsigned int *count, u8 *values,
 	};
 
 	_dat_walk_gfn_range(*start, asce_end(asce), asce, &ops, DAT_WALK_IGN_HOLES, &state);
+	/* If no dirty pages were found, wrap around and continue searching */
+	if (*start && state.start == -1)
+		_dat_walk_gfn_range(0, *start, asce, &ops, DAT_WALK_IGN_HOLES, &state);
 
 	if (state.start == -1) {
 		*count = 0;
diff --git a/arch/s390/kvm/gmap.c b/arch/s390/kvm/gmap.c
index e6e786811db8..0f944944badf 100644
--- a/arch/s390/kvm/gmap.c
+++ b/arch/s390/kvm/gmap.c
@@ -1073,23 +1073,46 @@ int gmap_protect_rmap(struct kvm_s390_mmu_cache *mc, struct gmap *sg, gfn_t p_gf
 	return 0;
 }
 
+static long __set_cmma_clean_pte(union pte *ptep, gfn_t gfn, gfn_t next, struct dat_walk *walk)
+{
+	union pgste pgste;
+
+	pgste = pgste_get_lock(ptep);
+	pgste.cmma_d = 0;
+	pgste_set_unlock(ptep, pgste);
+
+	if (need_resched())
+		return next;
+	return 0;
+}
+
 static long __set_cmma_dirty_pte(union pte *ptep, gfn_t gfn, gfn_t next, struct dat_walk *walk)
 {
-	__atomic64_or(PGSTE_CMMA_D_BIT, &pgste_of(ptep)->val);
+	union pgste pgste;
+
+	pgste = pgste_get_lock(ptep);
+	if (!pgste.cmma_d)
+		atomic64_inc(walk->priv);
+	pgste.cmma_d = 1;
+	pgste_set_unlock(ptep, pgste);
+
 	if (need_resched())
 		return next;
 	return 0;
 }
 
-void gmap_set_cmma_all_dirty(struct gmap *gmap)
+void _gmap_set_cmma_all(struct gmap *gmap, bool dirty)
 {
-	const struct dat_walk_ops ops = { .pte_entry = __set_cmma_dirty_pte, };
+	const struct dat_walk_ops ops = {
+		.pte_entry = dirty ? __set_cmma_dirty_pte : __set_cmma_clean_pte,
+	};
 	gfn_t gfn = 0;
 
 	do {
 		scoped_guard(read_lock, &gmap->kvm->mmu_lock)
 			gfn = _dat_walk_gfn_range(gfn, asce_end(gmap->asce), gmap->asce, &ops,
-						  DAT_WALK_IGN_HOLES, NULL);
+						  DAT_WALK_IGN_HOLES,
+						  &gmap->kvm->arch.cmma_dirty_pages);
 		cond_resched();
 	} while (gfn);
 }
diff --git a/arch/s390/kvm/gmap.h b/arch/s390/kvm/gmap.h
index 5374f21aaf8d..39cb2ee1eede 100644
--- a/arch/s390/kvm/gmap.h
+++ b/arch/s390/kvm/gmap.h
@@ -103,7 +103,7 @@ int gmap_pv_destroy_range(struct gmap *gmap, gfn_t start, gfn_t end, bool interr
 int gmap_insert_rmap(struct gmap *sg, gfn_t p_gfn, gfn_t r_gfn, int level);
 int gmap_protect_rmap(struct kvm_s390_mmu_cache *mc, struct gmap *sg, gfn_t p_gfn, gfn_t r_gfn,
 		      kvm_pfn_t pfn, int level, bool wr);
-void gmap_set_cmma_all_dirty(struct gmap *gmap);
+void _gmap_set_cmma_all(struct gmap *gmap, bool dirty);
 void _gmap_handle_vsie_unshadow_event(struct gmap *parent, gfn_t gfn);
 struct gmap *gmap_create_shadow(struct kvm_s390_mmu_cache *mc, struct gmap *gmap,
 				union asce asce, int edat_level);
@@ -167,6 +167,16 @@ static inline bool gmap_unmap_prefix(struct gmap *gmap, gfn_t gfn, gfn_t end)
 	return _gmap_unmap_prefix(gmap, gfn, end, false);
 }
 
+static inline void gmap_set_cmma_all_dirty(struct gmap *gmap)
+{
+	_gmap_set_cmma_all(gmap, true);
+}
+
+static inline void gmap_set_cmma_all_clean(struct gmap *gmap)
+{
+	_gmap_set_cmma_all(gmap, false);
+}
+
 /**
  * pte_needs_unshadow() -- Check if the pte operations triggers unshadowing.
  * @oldpte: the previous value for the guest pte.
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index fe0ae15e5ad0..91cf7a3c55c7 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1187,13 +1187,13 @@ static void kvm_s390_sync_request_broadcast(struct kvm *kvm, int req)
 
 /*
  * Must be called with kvm->srcu held to avoid races on memslots, and with
- * kvm->slots_lock to avoid races with ourselves and kvm_s390_vm_stop_migration.
+ * kvm->slots_lock to avoid races with ourselves, kvm_s390_vm_stop_migration(),
+ * and kvm_s390_get_cmma_bits().
  */
 static int kvm_s390_vm_start_migration(struct kvm *kvm)
 {
 	struct kvm_memory_slot *ms;
 	struct kvm_memslots *slots;
-	unsigned long ram_pages = 0;
 	int bkt;
 
 	/* migration mode already enabled */
@@ -1210,28 +1210,54 @@ static int kvm_s390_vm_start_migration(struct kvm *kvm)
 	kvm_for_each_memslot(ms, bkt, slots) {
 		if (!ms->dirty_bitmap)
 			return -EINVAL;
-		ram_pages += ms->npages;
 	}
-	/* mark all the pages as dirty */
+	/*
+	 * Set the flag and let KVM handle ESSA manually, potentially setting
+	 * the cmma_d bit in some PGSTEs and increasing cmma_dirty_pages.
+	 * At this point cmma_dirty_pages is still 0, and all existing PGSTEs
+	 * have their cmma_d bit set to 0.
+	 * Any newly allocated page table has its entries marked as cmma-clean,
+	 * which is fine because the CMMA values are not dirty.
+	 */
+	WRITE_ONCE(kvm->arch.migration_mode, 1);
+	/*
+	 * Mark all PGSTEs as cmma-dirty, increasing cmma_dirty_pages as needed,
+	 * but without double-counting pages that have become dirty on their own
+	 * in the meantime.
+	 * At this point some pages might have become dirty on their own already
+	 * and cmma_dirty_pages might therefore be non-zero.
+	 */
 	gmap_set_cmma_all_dirty(kvm->arch.gmap);
-	atomic64_set(&kvm->arch.cmma_dirty_pages, ram_pages);
-	kvm->arch.migration_mode = 1;
 	kvm_s390_sync_request_broadcast(kvm, KVM_REQ_START_MIGRATION);
 	return 0;
 }
 
 /*
- * Must be called with kvm->slots_lock to avoid races with ourselves and
- * kvm_s390_vm_start_migration.
+ * Must be called with kvm->slots_lock to avoid races with ourselves,
+ * kvm_s390_vm_start_migration() and kvm_s390_get_cmma_bits().
  */
 static int kvm_s390_vm_stop_migration(struct kvm *kvm)
 {
 	/* migration mode already disabled */
 	if (!kvm->arch.migration_mode)
 		return 0;
-	kvm->arch.migration_mode = 0;
+	/*
+	 * Unset the flag and propagate to all vCPUs. From now on the cmma_d
+	 * bit will not be touched on any PGSTE.
+	 * At this point cmma_dirty_pages is possibly non-zero, and thus some
+	 * PGSTEs might have cmma_d set.
+	 */
+	WRITE_ONCE(kvm->arch.migration_mode, 0);
 	if (kvm->arch.use_cmma)
 		kvm_s390_sync_request_broadcast(kvm, KVM_REQ_STOP_MIGRATION);
+	/* Clear cmma_d on all existing PGSTEs and set cmma_dirty_pages to 0. */
+	gmap_set_cmma_all_clean(kvm->arch.gmap);
+	atomic64_set(&kvm->arch.cmma_dirty_pages, 0);
+	/*
+	 * At this point the system has the expected state: migration_mode is 0,
+	 * cmma_dirty_pages is 0, and all existing PGSTEs have their cmma_d bit
+	 * set to 0.
+	 */
 	return 0;
 }
 
diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index 9bc6fd02ff77..ad0ddc433a73 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -1236,7 +1236,7 @@ static int handle_essa(struct kvm_vcpu *vcpu)
 						: ESSA_SET_STABLE_IF_RESIDENT))
 		return kvm_s390_inject_program_int(vcpu, PGM_SPECIFICATION);
 
-	if (!vcpu->kvm->arch.migration_mode) {
+	if (!READ_ONCE(vcpu->kvm->arch.migration_mode)) {
 		/*
 		 * CMMA is enabled in the KVM settings, but is disabled in
 		 * the SIE block and in the mm_context, and we are not doing
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 08/10] KVM: s390: selftests: Fix cmma selftest
  2026-06-23 15:33 [PATCH v6 00/10] KVM: s390: A bunch of gmap-related fixes Claudio Imbrenda
                   ` (6 preceding siblings ...)
  2026-06-23 15:33 ` [PATCH v6 07/10] KVM: s390: Fix cmma dirty tracking Claudio Imbrenda
@ 2026-06-23 15:33 ` Claudio Imbrenda
  2026-06-23 15:33 ` [PATCH v6 09/10] KVM: s390: Return failure in case of failure in kvm_s390_set_cmma_bits() Claudio Imbrenda
  2026-06-23 15:33 ` [PATCH v6 10/10] KVM: s390: vsie: Avoid potential deadlock with real spaces Claudio Imbrenda
  9 siblings, 0 replies; 17+ messages in thread
From: Claudio Imbrenda @ 2026-06-23 15:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, linux-s390, borntraeger, frankja, david, seiden, nrb,
	schlameuss, gra, hca, gerald.schaefer, gor, agordeev, svens

The existing cmma selftest depended on the host allocating page tables
for all present memslots. Since the gmap rewrite, memory that is not
accessed by the guest might not have page tables allocated yet.

This caused the test to fail due to a mismatch in the assertion.

Fix by having the guest access also the second half of the test
memslot, thus guaranteeing that its page tables are present.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Fixes: e38c884df921 ("KVM: s390: Switch to new gmap")
---
 tools/testing/selftests/kvm/s390/cmma_test.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/tools/testing/selftests/kvm/s390/cmma_test.c b/tools/testing/selftests/kvm/s390/cmma_test.c
index e39a724fe860..15d81b2ed7ad 100644
--- a/tools/testing/selftests/kvm/s390/cmma_test.c
+++ b/tools/testing/selftests/kvm/s390/cmma_test.c
@@ -34,16 +34,22 @@ static char cmma_value_buf[MAIN_PAGE_COUNT + TEST_DATA_PAGE_COUNT];
 /**
  * Dirty CMMA attributes of exactly one page in the TEST_DATA memslot,
  * so use_cmma goes on and the CMMA related ioctls do something.
+ * Touch the page at offset 1M inside TEST_DATA to make sure its page
+ * tables are allocated in the host.
  */
 static void guest_do_one_essa(void)
 {
 	asm volatile(
 		/* load TEST_DATA_START_GFN into r1 */
+		"	xgr 1,1\n"
 		"	llilf 1,%[start_gfn]\n"
 		/* calculate the address from the gfn */
 		"	sllg 1,1,12(0)\n"
 		/* set the first page in TEST_DATA memslot to STABLE */
 		"	.insn rrf,0xb9ab0000,2,1,1,0\n"
+		"	agfi 1,0x100000\n"
+		/* also touch the first page of the second MB of TEST_DATA */
+		"	.insn rrf,0xb9ab0000,2,1,1,0\n"
 		/* hypercall */
 		"	diag 0,0,0x501\n"
 		"0:	j 0b"
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 09/10] KVM: s390: Return failure in case of failure in kvm_s390_set_cmma_bits()
  2026-06-23 15:33 [PATCH v6 00/10] KVM: s390: A bunch of gmap-related fixes Claudio Imbrenda
                   ` (7 preceding siblings ...)
  2026-06-23 15:33 ` [PATCH v6 08/10] KVM: s390: selftests: Fix cmma selftest Claudio Imbrenda
@ 2026-06-23 15:33 ` Claudio Imbrenda
  2026-06-23 15:33 ` [PATCH v6 10/10] KVM: s390: vsie: Avoid potential deadlock with real spaces Claudio Imbrenda
  9 siblings, 0 replies; 17+ messages in thread
From: Claudio Imbrenda @ 2026-06-23 15:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, linux-s390, borntraeger, frankja, david, seiden, nrb,
	schlameuss, gra, hca, gerald.schaefer, gor, agordeev, svens

If the allocation of the bits array failed, kvm_s390_set_cmma_bits()
would return 0 instead of an error code.

Rework the function to use the __free() macros and thus simplify the
code flow; when the above mentioned allocation fails, simply return
-ENOMEM.

Fixes: e38c884df921 ("KVM: s390: Switch to new gmap")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
---
 arch/s390/kvm/kvm-s390.c | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 91cf7a3c55c7..c08a63a1bfb9 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2313,8 +2313,8 @@ static int kvm_s390_get_cmma_bits(struct kvm *kvm,
 static int kvm_s390_set_cmma_bits(struct kvm *kvm,
 				  const struct kvm_s390_cmma_log *args)
 {
-	struct kvm_s390_mmu_cache *mc;
-	u8 *bits = NULL;
+	struct kvm_s390_mmu_cache *mc __free(kvm_s390_mmu_cache) = NULL;
+	u8 *bits __free(kvfree) = NULL;
 	int r = 0;
 
 	if (!kvm->arch.use_cmma)
@@ -2334,18 +2334,16 @@ static int kvm_s390_set_cmma_bits(struct kvm *kvm,
 		return -ENOMEM;
 	bits = vmalloc(array_size(sizeof(*bits), args->count));
 	if (!bits)
-		goto out;
+		return -ENOMEM;
 
 	r = copy_from_user(bits, (void __user *)args->values, args->count);
-	if (r) {
-		r = -EFAULT;
-		goto out;
-	}
+	if (r)
+		return -EFAULT;
 
 	do {
 		r = kvm_s390_mmu_cache_topup(mc);
 		if (r)
-			break;
+			return r;
 		scoped_guard(read_lock, &kvm->mmu_lock) {
 			r = dat_set_cmma_bits(mc, kvm->arch.gmap->asce, args->start_gfn,
 					      args->count, args->mask, bits);
@@ -2353,9 +2351,7 @@ static int kvm_s390_set_cmma_bits(struct kvm *kvm,
 	} while (r == -ENOMEM);
 
 	set_bit(GMAP_FLAG_USES_CMM, &kvm->arch.gmap->flags);
-out:
-	kvm_s390_free_mmu_cache(mc);
-	vfree(bits);
+
 	return r;
 }
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 10/10] KVM: s390: vsie: Avoid potential deadlock with real spaces
  2026-06-23 15:33 [PATCH v6 00/10] KVM: s390: A bunch of gmap-related fixes Claudio Imbrenda
                   ` (8 preceding siblings ...)
  2026-06-23 15:33 ` [PATCH v6 09/10] KVM: s390: Return failure in case of failure in kvm_s390_set_cmma_bits() Claudio Imbrenda
@ 2026-06-23 15:33 ` Claudio Imbrenda
  2026-06-23 15:51   ` sashiko-bot
  9 siblings, 1 reply; 17+ messages in thread
From: Claudio Imbrenda @ 2026-06-23 15:33 UTC (permalink / raw)
  To: linux-kernel
  Cc: kvm, linux-s390, borntraeger, frankja, david, seiden, nrb,
	schlameuss, gra, hca, gerald.schaefer, gor, agordeev, svens

The natural lock ordering is mmu_lock -> children_lock, but in
gmap_create_shadow() the reverse order is used when handling shadowing
of real address spaces.

Convert the inner locking of kvm->mmu_lock to a trylock; return -EAGAIN
if the lock is busy, and let the caller try again.

This path is not expected to happen in real-life scenarios, so its
performance is not important.

Fixes: a2c17f9270cc ("KVM: s390: New gmap code")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 arch/s390/kvm/gmap.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/s390/kvm/gmap.c b/arch/s390/kvm/gmap.c
index 0f944944badf..1fc8d848b2bc 100644
--- a/arch/s390/kvm/gmap.c
+++ b/arch/s390/kvm/gmap.c
@@ -1349,8 +1349,12 @@ struct gmap *gmap_create_shadow(struct kvm_s390_mmu_cache *mc, struct gmap *pare
 			/* Only allow one real-space gmap shadow. */
 			list_for_each_entry(sg, &parent->children, list) {
 				if (sg->guest_asce.r) {
-					scoped_guard(write_lock, &parent->kvm->mmu_lock)
+					if (write_trylock(&parent->kvm->mmu_lock)) {
 						gmap_unshadow(sg);
+						write_unlock(&parent->kvm->mmu_lock);
+					} else {
+						return ERR_PTR(-EAGAIN);
+					}
 					break;
 				}
 			}
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 02/10] KVM: s390: Fix dat_peek_cmma() overflow
  2026-06-23 15:33 ` [PATCH v6 02/10] KVM: s390: Fix dat_peek_cmma() overflow Claudio Imbrenda
@ 2026-06-23 15:49   ` sashiko-bot
  0 siblings, 0 replies; 17+ messages in thread
From: sashiko-bot @ 2026-06-23 15:49 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: kvm, Alexander Gordeev, Heiko Carstens, linux-s390, Vasily Gorbik

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] The KVM_S390_GET_CMMA_BITS ioctl allocates the `values` array using `vmalloc()` without zeroing the memory. During the page table walk, `dat_peek_cmma()` and `dat_get_cmma()` can skip unmapped segments or clean pages, leaving uninitialized gaps in the array. When copied back to userspace, this leaks up to 1MB of uninitialized kernel heap memory.
--

commit 561661723442584bc83ff1dfa1a8372280efbf28
Author: Claudio Imbrenda <imbrenda@linux.ibm.com>

KVM: s390: Fix dat_peek_cmma() overflow

If userspace passes a start address that is out of bounds,
_dat_walk_gfn_range() will fail with -EFAULT, but state.end will not be
touched and will stay 0. This will cause *count to underflow and report
a very high number, and the function will end up erroneously reporting
success.

> diff --git a/arch/s390/kvm/dat.c b/arch/s390/kvm/dat.c
> index 4a41c0247ffa2..cffac7782c4bf 100644
> --- a/arch/s390/kvm/dat.c
> +++ b/arch/s390/kvm/dat.c
> @@ -1209,7 +1209,7 @@ int dat_peek_cmma(gfn_t start, union asce asce, unsigned int *count, u8 *values)
>  	int rc;
>  
>  	rc = _dat_walk_gfn_range(start, start + *count, asce, &ops, DAT_WALK_DEFAULT, &state);
> -	*count = state.end - start;
> +	*count = state.end >= start ? state.end - start : 0;

[Severity: High]
This is a pre-existing issue, but does this code leak uninitialized kernel
memory to userspace?

The KVM_S390_GET_CMMA_BITS ioctl allocates the values array using vmalloc()
without zeroing it:

arch/s390/kvm/kvm-s390.c:kvm_s390_get_cmma_bits() {
    ...
    values = vmalloc(args->count);
    if (!values)
        return -ENOMEM;
    ...
}

During the page table walk here, unmapped segments or clean pages can be
skipped without writing to the corresponding index in the values array,
such as in __dat_get_cmma_pte():

arch/s390/kvm/dat.c:__dat_get_cmma_pte() {
    ...
    state->values[gfn - state->start] = pgste.usage | pgste.nodat << 6;
    ...
}

Because state.end can advance without filling all gaps in the array, the count
calculated here will include uninitialized elements. Since the ioctl copies
this entire buffer back to userspace based on the returned count, could this
leak previous kernel heap content?

Would it be safer to use vzalloc() in kvm_s390_get_cmma_bits() instead?

>  	/* Return success if at least one value was saved, otherwise an error. */
>  	return (rc == -EFAULT && *count > 0) ? 0 : rc;
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260623153331.233784-1-imbrenda@linux.ibm.com?part=2

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 06/10] KVM: s390: Fix locking in kvm_s390_set_mem_control()
  2026-06-23 15:33 ` [PATCH v6 06/10] KVM: s390: Fix locking in kvm_s390_set_mem_control() Claudio Imbrenda
@ 2026-06-23 15:49   ` sashiko-bot
  0 siblings, 0 replies; 17+ messages in thread
From: sashiko-bot @ 2026-06-23 15:49 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: linux-s390, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, kvm

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Lost early-exit return codes in `dat_crste_walk_range` leads to soft lockups and aging logic failures.
--

commit b7c3f62b3078686b2bddd109edeee90c43ecceb5
Author: Claudio Imbrenda <imbrenda@linux.ibm.com>

KVM: s390: Fix locking in kvm_s390_set_mem_control()

Add the missing locking around dat_reset_cmma().

> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index ffb20a64d328f..fe0ae15e5ad07 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -990,9 +990,11 @@ static int kvm_s390_set_mem_control(struct kvm *kvm, struct kvm_device_attr *att
>  		if (!kvm->arch.use_cmma)
>  			break;
>  
> +		guard(mutex)(&kvm->lock);
>  		VM_EVENT(kvm, 3, "%s", "RESET: CMMA states");
>  		do {
> -			start_gfn = dat_reset_cmma(kvm->arch.gmap->asce, start_gfn);
> +			scoped_guard(read_lock, &kvm->mmu_lock)
> +				start_gfn = dat_reset_cmma(kvm->arch.gmap->asce, start_gfn);

[Severity: High]
This is a pre-existing issue, but looking at how dat_reset_cmma() and other
DAT walk operations use _dat_walk_gfn_range() under the hood, is there a bug
in arch/s390/kvm/dat.c:dat_crste_walk_range() where we can silently lose
early-exit return codes?

When dat_crste_walk_range() performs lower-level walks, it assigns the result
to 'rc', but lacks an immediate check to break the loop:

arch/s390/kvm/dat.c:dat_crste_walk_range() {
	...
		if (!crste.h.i && !crste.h.fc) {
			if (!is_pmd(crste))
				rc = dat_crste_walk_range(max(start, cur), min(end, next),
							  _dereference_crste(crste), walk);
			else if (walk->ops->pte_entry)
				rc = dat_pte_walk_range(max(start, cur), min(end, next),
							dereference_pmd(crste.pmd), walk);
		}
	}
	return rc;
}

If a walk operation defines both crste_ops and pte_entry (such as
dat_reset_skeys or test_age_ops) and dat_pte_walk_range() returns a non-zero
code (like a gfn for a reschedule request or 1 for finding a young page),
the loop continues to the next iteration.

In the next iteration, if the upper-level crste_op evaluates to 0, could it
silently overwrite 'rc' and allow the walk to incorrectly continue?

Can this cause soft lockups on the host for guests with large memory due to
lost reschedule requests, or incorrect page aging data causing performance
degradation?

>  			cond_resched();
>  		} while (start_gfn);
>  		ret = 0;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260623153331.233784-1-imbrenda@linux.ibm.com?part=6

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 07/10] KVM: s390: Fix cmma dirty tracking
  2026-06-23 15:33 ` [PATCH v6 07/10] KVM: s390: Fix cmma dirty tracking Claudio Imbrenda
@ 2026-06-23 15:50   ` sashiko-bot
  0 siblings, 0 replies; 17+ messages in thread
From: sashiko-bot @ 2026-06-23 15:50 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: linux-s390, Heiko Carstens, Vasily Gorbik, Alexander Gordeev, kvm

Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:

New issues:
- [High] Race condition between kvm_s390_vm_stop_migration() and handle_essa() causes cmma_dirty_pages undercounting and data loss.
- [Medium] Missing curly braces around the kvm->arch.use_cmma condition in kvm_s390_vm_stop_migration() causes an unconditional, highly expensive guest page table walk.

Pre-existing issues:
- [High] Page table walk skips pages and silently overwrites return values in dat_crste_walk_range(), breaking dat_test_age_gfn() and dat_reset_skeys().
- [High] Uninitialized vmalloc memory is copied to user space in kvm_s390_get_cmma_bits(), leaking kernel memory.
- [High] The CMMA dirty tracking allows native hardware execution of ESSA instructions during the full memory walk, potentially causing dirty CMMA state to be missed during live migration.
--

commit 9bad94601846b4f7b13bcf1ade41fc1b47e4dedf
Author: Claudio Imbrenda <imbrenda@linux.ibm.com>

KVM: s390: Fix cmma dirty tracking

This commit modifies the CMMA dirty tracking to set the migration_mode flag
before marking all existing PGSTEs as cmma-dirty. It also attempts to clear
the cmma_d bit and reset the dirty page count when stopping migration, and
addresses a wrap-around case in dat_get_cmma() when searching for pages.

> diff --git a/arch/s390/kvm/dat.c b/arch/s390/kvm/dat.c
> index cffac7782c4bf..0ad4ebc80ebaf 100644
> --- a/arch/s390/kvm/dat.c
> +++ b/arch/s390/kvm/dat.c
> @@ -1253,6 +1253,9 @@ int dat_get_cmma(union asce asce, gfn_t *start, unsigned int *count, u8 *values,
>  	};
>  
>  	_dat_walk_gfn_range(*start, asce_end(asce), asce, &ops, DAT_WALK_IGN_HOLES, &state);

[Severity: High]
This is a pre-existing issue, but does dat_get_cmma() leave memory uninitialized
when skipping clean pages?

Looking at kvm_s390_get_cmma_bits(), the values array is allocated using
vmalloc(args->count), which leaves the memory uninitialized. When
dat_get_cmma() writes the dirty states, it skips clean pages (cmma_d == 0)
and unmapped holes without writing to those indices. Does the subsequent
copy_to_user() in kvm_s390_get_cmma_bits() then leak this uninitialized
kernel memory to user space?

[Severity: High]
This is also a pre-existing issue, but does the page table walk in
dat_crste_walk_range() silently overwrite return values?

When _dat_walk_gfn_range() calls dat_crste_walk_range(), the loop logic
places the if (rc) break; condition before the child walk is checked:

arch/s390/kvm/dat.c:dat_crste_walk_range() {
    ...
    the_op = walk->ops->crste_ops[crste.h.tt];
    if (the_op) {
        rc = the_op(walk->last, cur, next, walk);
        ...
    }
    if (rc)
        break;
    if (!crste.h.i && !crste.h.fc) {
        if (!is_pmd(crste))
            rc = dat_crste_walk_range(...);
        else if (walk->ops->pte_entry)
            rc = dat_pte_walk_range(...);
    }
}

When dat_pte_walk_range() returns a positive rc, the loop proceeds to the
next iteration. If the_op executes (for example, _dat_test_young_crste
returning 0), does it overwrite the previous positive rc, causing the loop
to fail to break and ignoring yield requests or young-page hits?

[ ... ]

> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index fe0ae15e5ad07..91cf7a3c55c79 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -1210,28 +1210,54 @@ static int kvm_s390_vm_start_migration(struct kvm *kvm)
>  	kvm_for_each_memslot(ms, bkt, slots) {
>  		if (!ms->dirty_bitmap)
>  			return -EINVAL;
> -		ram_pages += ms->npages;
>  	}
> -	/* mark all the pages as dirty */
> +	/*
> +	 * Set the flag and let KVM handle ESSA manually, potentially setting
> +	 * the cmma_d bit in some PGSTEs and increasing cmma_dirty_pages.
> +	 * At this point cmma_dirty_pages is still 0, and all existing PGSTEs
> +	 * have their cmma_d bit set to 0.
> +	 * Any newly allocated page table has its entries marked as cmma-clean,
> +	 * which is fine because the CMMA values are not dirty.
> +	 */
> +	WRITE_ONCE(kvm->arch.migration_mode, 1);
> +	/*
> +	 * Mark all PGSTEs as cmma-dirty, increasing cmma_dirty_pages as needed,
> +	 * but without double-counting pages that have become dirty on their own
> +	 * in the meantime.
> +	 * At this point some pages might have become dirty on their own already
> +	 * and cmma_dirty_pages might therefore be non-zero.
> +	 */
>  	gmap_set_cmma_all_dirty(kvm->arch.gmap);
> -	atomic64_set(&kvm->arch.cmma_dirty_pages, ram_pages);
> -	kvm->arch.migration_mode = 1;
>  	kvm_s390_sync_request_broadcast(kvm, KVM_REQ_START_MIGRATION);

[Severity: High]
This is a pre-existing issue, but does this sequence allow native hardware
execution of ESSA instructions during the full memory walk?

Since the KVM_REQ_START_MIGRATION broadcast (which clears ECB2_CMMA to
disable hardware ESSA execution) is issued after gmap_set_cmma_all_dirty()
completes, a vCPU could execute ESSA natively on a newly allocated page
that lies behind the current walk pointer. Since newly allocated PGSTEs
start with cmma_d = 0 and the walk has already passed them, could this
cause dirty CMMA state to be missed during live migration?

>  	return 0;
>  }
>  
>  /*
> - * Must be called with kvm->slots_lock to avoid races with ourselves and
> - * kvm_s390_vm_start_migration.
> + * Must be called with kvm->slots_lock to avoid races with ourselves,
> + * kvm_s390_vm_start_migration() and kvm_s390_get_cmma_bits().
>   */
>  static int kvm_s390_vm_stop_migration(struct kvm *kvm)
>  {
>  	/* migration mode already disabled */
>  	if (!kvm->arch.migration_mode)
>  		return 0;
> -	kvm->arch.migration_mode = 0;
> +	/*
> +	 * Unset the flag and propagate to all vCPUs. From now on the cmma_d
> +	 * bit will not be touched on any PGSTE.
> +	 * At this point cmma_dirty_pages is possibly non-zero, and thus some
> +	 * PGSTEs might have cmma_d set.
> +	 */
> +	WRITE_ONCE(kvm->arch.migration_mode, 0);
>  	if (kvm->arch.use_cmma)
>  		kvm_s390_sync_request_broadcast(kvm, KVM_REQ_STOP_MIGRATION);
> +	/* Clear cmma_d on all existing PGSTEs and set cmma_dirty_pages to 0. */
> +	gmap_set_cmma_all_clean(kvm->arch.gmap);
> +	atomic64_set(&kvm->arch.cmma_dirty_pages, 0);

[Severity: Medium]
Are curly braces missing around the kvm->arch.use_cmma condition?

Because they are absent, gmap_set_cmma_all_clean() and atomic64_set()
execute unconditionally. For VMs where use_cmma is false, this seems to force
an unnecessary full walk of the entire guest page table under mmu_lock to
clear bits that were never set, which could stall the host and VM.

[Severity: High]
Can a race condition between kvm_s390_vm_stop_migration() and handle_essa()
cause cmma_dirty_pages to undercount?

If handle_essa() reads migration_mode locklessly and enters __do_essa()
concurrently with kvm_s390_vm_stop_migration():

CPU1 (kvm_s390_vm_stop_migration):
    gmap_set_cmma_all_clean(kvm->arch.gmap);

CPU2 (handle_essa -> __do_essa):
    Sets cmma_d = 1 on a PTE that CPU1 already cleaned
    atomic64_inc(&kvm->arch.cmma_dirty_pages);

CPU1 (kvm_s390_vm_stop_migration):
    atomic64_set(&kvm->arch.cmma_dirty_pages, 0);

Would this unconditional reset to 0 overwrite CPU2's increment, leaving a
PTE with cmma_d == 1 but cmma_dirty_pages == 0, leading to premature stops
in dat_get_cmma() and data loss during the next migration?

> +	/*
> +	 * At this point the system has the expected state: migration_mode is 0,
> +	 * cmma_dirty_pages is 0, and all existing PGSTEs have their cmma_d bit
> +	 * set to 0.
> +	 */
>  	return 0;
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260623153331.233784-1-imbrenda@linux.ibm.com?part=7

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 10/10] KVM: s390: vsie: Avoid potential deadlock with real spaces
  2026-06-23 15:33 ` [PATCH v6 10/10] KVM: s390: vsie: Avoid potential deadlock with real spaces Claudio Imbrenda
@ 2026-06-23 15:51   ` sashiko-bot
  0 siblings, 0 replies; 17+ messages in thread
From: sashiko-bot @ 2026-06-23 15:51 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: linux-s390, kvm, Heiko Carstens, Vasily Gorbik, Alexander Gordeev

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Returning `-EAGAIN` without releasing the newly allocated `gmap` structure causes a severe memory leak, potentially leading to a host DoS under lock contention.
--

commit 50740591beb78c10a0032430dbcc9d8cf4291017
Author: Claudio Imbrenda <imbrenda@linux.ibm.com>

KVM: s390: vsie: Avoid potential deadlock with real spaces

This commit converts the inner locking of kvm->mmu_lock in gmap_create_shadow()
to a trylock. It returns -EAGAIN if the lock is busy to avoid a potential
deadlock with real spaces when the natural lock ordering is reversed.

> diff --git a/arch/s390/kvm/gmap.c b/arch/s390/kvm/gmap.c
> --- a/arch/s390/kvm/gmap.c
> +++ b/arch/s390/kvm/gmap.c
[ ... ]
> @@ -1349,8 +1349,12 @@ struct gmap *gmap_create_shadow(struct kvm_s390_mmu_cache *mc, struct gmap *pare
>  			/* Only allow one real-space gmap shadow. */
>  			list_for_each_entry(sg, &parent->children, list) {
>  				if (sg->guest_asce.r) {
> -					scoped_guard(write_lock, &parent->kvm->mmu_lock)
> +					if (write_trylock(&parent->kvm->mmu_lock)) {
>  						gmap_unshadow(sg);
> +						write_unlock(&parent->kvm->mmu_lock);
> +					} else {
> +						return ERR_PTR(-EAGAIN);
> +					}

[Severity: High]
Does returning here leak the memory allocated for the new gmap structure?

Looking at the code earlier in gmap_create_shadow(), the new gmap is allocated
unconditionally before this lock attempt:

arch/s390/kvm/gmap.c:gmap_create_shadow() {
    ...
	/* Create a new shadow gmap. */
	new = gmap_new(parent->kvm, asce.r ? 1UL << (64 - PAGE_SHIFT) : asce_end(asce));
	if (!new)
		return ERR_PTR(-ENOMEM);
    ...
	scoped_guard(spinlock, &parent->children_lock) {
		...
		if (asce.r) {
			/* Only allow one real-space gmap shadow. */
			list_for_each_entry(sg, &parent->children, list) {
				if (sg->guest_asce.r) {
					if (write_trylock(&parent->kvm->mmu_lock)) {
						gmap_unshadow(sg);
						write_unlock(&parent->kvm->mmu_lock);
					} else {
						return ERR_PTR(-EAGAIN);
					}

When write_trylock() fails, should we call gmap_put(new) before returning
ERR_PTR(-EAGAIN) to avoid leaking the gmap structure and its page tables?

Since callers like vsie_run() will loop and retry on -EAGAIN, could this
lead to rapid memory exhaustion under lock contention?

>  					break;
>  				}
>  			}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260623153331.233784-1-imbrenda@linux.ibm.com?part=10

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 01/10] s390/mm: Fix handling of _PAGE_UNUSED pte bit
  2026-06-23 15:33 ` [PATCH v6 01/10] s390/mm: Fix handling of _PAGE_UNUSED pte bit Claudio Imbrenda
@ 2026-06-23 15:54   ` sashiko-bot
  0 siblings, 0 replies; 17+ messages in thread
From: sashiko-bot @ 2026-06-23 15:54 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: linux-s390, Vasily Gorbik, Alexander Gordeev, kvm, Heiko Carstens

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [Critical] The patch fails to fully resolve the guest corruption issue because `_PAGE_UNUSED` is not cleared when a guest re-uses a writable host page, since no host page fault occurs to trigger `set_pte()`.
--

commit 5e54edccf96c9de6555009d16a3fc257dbca326f
Author: Claudio Imbrenda <imbrenda@linux.ibm.com>

s390/mm: Fix handling of _PAGE_UNUSED pte bit

The _PAGE_UNUSED softbit signals that a page can be discarded instead of being
moved or swapped. This patch clears the _PAGE_UNUSED bit for present ptes in
set_pte() and gmap_helper_try_set_pte_unused() to prevent used guest pages
from being incorrectly thrown out.

> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -980,6 +980,8 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>  
>  static inline void set_pte(pte_t *ptep, pte_t pte)
>  {
> +	if (pte_present(pte))
> +		pte = clear_pte_bit(pte, __pgprot(_PAGE_UNUSED));
>  	WRITE_ONCE(*ptep, pte);
>  }

[Severity: Critical]
This is a pre-existing issue, but does this approach fully resolve the guest
data corruption?

When KVM unmaps a guest page, it sets _PAGE_UNUSED on the host PTE via
gmap_helper_try_set_pte_unused(). This atomic OR preserves the present and
writable bits.

If the guest subsequently accesses the page, a DAT fault occurs and KVM
resolves it via __kvm_faultin_pfn() and get_user_pages(). Because the host
PTE is already present and writable, get_user_pages() can succeed without
triggering a host page fault.

In this lockless host PTE resolution path, set_pte() is never invoked, so
the _PAGE_UNUSED bit remains set. KVM then links the page into the guest
DAT table, and the guest can write new data.

If host memory pressure later causes try_to_unmap_one() to process this
page, it will see pte_unused() as true. Will this cause the page to still
be silently discarded instead of swapped out, leading to the same guest
corruption?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260623153331.233784-1-imbrenda@linux.ibm.com?part=1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 03/10] KVM: s390: Do not set special large pages dirty
  2026-06-23 15:33 ` [PATCH v6 03/10] KVM: s390: Do not set special large pages dirty Claudio Imbrenda
@ 2026-06-23 15:55   ` sashiko-bot
  0 siblings, 0 replies; 17+ messages in thread
From: sashiko-bot @ 2026-06-23 15:55 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: Vasily Gorbik, Alexander Gordeev, linux-s390, Heiko Carstens, kvm

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:

Pre-existing issues:
- [Critical] Missing check for special pages (`!s.s` / `!old.s.fc1.s`) in KVM gmap unmap and destroy paths causes operations on garbage folio/page pointers.
- [Critical] The UVC handler `_kvm_s390_pv_make_secure()` unconditionally converts the faulting PFN to a folio, enabling a malicious PV guest to crash the host by attempting to secure an MMIO region.
- [High] The `s` (Special) bit is lost when a large page (PMD) is split into PTEs, leading to illegal `folio_set_dirty()` calls on special pages.
--

commit c8813ca7d60b80b32bbd7cf3d235de6086bea6b4
Author: Claudio Imbrenda <imbrenda@linux.ibm.com>

KVM: s390: Do not set special large pages dirty

Special pages / folios should not be set dirty. This also applies to
large pages. Add a missing check in gmap_clear_young_crste() to prevent
setting the large page dirty if it is a special page.

> diff --git a/arch/s390/kvm/gmap.c b/arch/s390/kvm/gmap.c
> --- a/arch/s390/kvm/gmap.c
> +++ b/arch/s390/kvm/gmap.c
> @@ -327,7 +327,7 @@ static long gmap_clear_young_crste(union crste *crstep, gfn_t gfn, gfn_t end, st
>  		new.h.i = 1;
>  		new.s.fc1.y = 0;
>  		new.s.fc1.prefix_notif = 0;
> -		if (new.s.fc1.d || !new.h.p)
> +		if ((new.s.fc1.d || !new.h.p) && !new.s.fc1.s)
>  			folio_set_dirty(phys_to_folio(crste_origin_large(crste)));
>  		new.s.fc1.d = 0;
>  		new.h.p = 1;

[Severity: Critical]
This isn't a bug introduced by this patch, but since you are fixing the
special bit check here, I noticed other unmap and destroy paths might be
missing similar checks.

In _gmap_unmap_pte() and _gmap_unmap_crste(), the folio is extracted and
converted without verifying if the special bit is clear:

arch/s390/kvm/gmap.c:_gmap_unmap_pte() {
    ...
    if (ptep->s.pr && test_bit(GMAP_FLAG_EXPORT_ON_UNMAP, &priv->gmap->flags))
        folio = pfn_folio(ptep->h.pfra);
    ...
}

Similarly, _destroy_pages_pte() and _destroy_pages_crste() check for presence
but seem to skip checking the special bit before calling phys_to_page().

Can this lead to host kernel panics? If a protected VM has MMIO regions
mapped, could unmapping or destroying the VM cause these functions to extract
a garbage pointer from the special PFN and crash the host upon dereference?

[Severity: Critical]
This is a pre-existing issue, but while looking at PFN conversions, does
the UVC handler correctly validate the faulting PFN before converting it?

arch/s390/kvm/pv.c:_kvm_s390_pv_make_secure() {
    ...
    folio = pfn_folio(f->pfn);
    priv->rc = -EAGAIN;
    ...
    if (folio_trylock(folio)) {
    ...
}

Could a malicious PV guest issue a secure storage conversion request on a
guest physical address corresponding to an MMIO region? If KVM's fault-in
logic resolves the PFN without a backing struct page, would the unconditional
pfn_folio() conversion and subsequent trylock result in a host panic?

[Severity: High]
This is a pre-existing issue, but I noticed the special bit might be getting
lost during page table splitting.

When dat_split_ste() copies properties from the old PMD to the new PTEs,
it appears to omit the special bit:

arch/s390/kvm/dat.c:dat_split_ste() {
    ...
    init.val = pmd_origin_large(old);
    init.h.p = old.h.p;
    init.h.i = old.h.i;
    init.s.d = old.s.fc1.d;
    init.s.w = old.s.fc1.w;
    init.s.y = old.s.fc1.y;
    init.s.sd = old.s.fc1.sd;
    init.s.pr = old.s.fc1.pr;
    ...
}

If a special large page is split, will the resulting PTEs permanently lose
their special designation? Could this cause aging paths to mistakenly skip
the special check and attempt to call folio_set_dirty() on an invalid
pointer later?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260623153331.233784-1-imbrenda@linux.ibm.com?part=3

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-06-23 15:55 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-23 15:33 [PATCH v6 00/10] KVM: s390: A bunch of gmap-related fixes Claudio Imbrenda
2026-06-23 15:33 ` [PATCH v6 01/10] s390/mm: Fix handling of _PAGE_UNUSED pte bit Claudio Imbrenda
2026-06-23 15:54   ` sashiko-bot
2026-06-23 15:33 ` [PATCH v6 02/10] KVM: s390: Fix dat_peek_cmma() overflow Claudio Imbrenda
2026-06-23 15:49   ` sashiko-bot
2026-06-23 15:33 ` [PATCH v6 03/10] KVM: s390: Do not set special large pages dirty Claudio Imbrenda
2026-06-23 15:55   ` sashiko-bot
2026-06-23 15:33 ` [PATCH v6 04/10] KVM: s390: Fix code typo in gmap_protect_asce_top_level() Claudio Imbrenda
2026-06-23 15:33 ` [PATCH v6 05/10] KVM: s390: Fix handle_{sske,pfmf} under memory pressure Claudio Imbrenda
2026-06-23 15:33 ` [PATCH v6 06/10] KVM: s390: Fix locking in kvm_s390_set_mem_control() Claudio Imbrenda
2026-06-23 15:49   ` sashiko-bot
2026-06-23 15:33 ` [PATCH v6 07/10] KVM: s390: Fix cmma dirty tracking Claudio Imbrenda
2026-06-23 15:50   ` sashiko-bot
2026-06-23 15:33 ` [PATCH v6 08/10] KVM: s390: selftests: Fix cmma selftest Claudio Imbrenda
2026-06-23 15:33 ` [PATCH v6 09/10] KVM: s390: Return failure in case of failure in kvm_s390_set_cmma_bits() Claudio Imbrenda
2026-06-23 15:33 ` [PATCH v6 10/10] KVM: s390: vsie: Avoid potential deadlock with real spaces Claudio Imbrenda
2026-06-23 15:51   ` sashiko-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox