Kernel KVM virtualization development
 help / color / mirror / Atom feed
* [GIT PULL v1 00/12] KVM: s390: Fix S390_USER_OPEREXEC and more gmap fixes
@ 2026-06-24 11:19 Claudio Imbrenda
  2026-06-24 11:19 ` [GIT PULL v1 01/12] KVM: s390: Fix S390_USER_OPEREXEC enablement without STFLE 74 Claudio Imbrenda
                   ` (11 more replies)
  0 siblings, 12 replies; 18+ messages in thread
From: Claudio Imbrenda @ 2026-06-24 11:19 UTC (permalink / raw)
  To: pbonzini; +Cc: kvm, linux-s390, frankja, borntraeger

Ciao Paolo,

here are some fixes for the second half of the merge window:

* Fix S390_USER_OPEREXEC so it can now be enabled regardless of other
  unrelated capabilities
* Fix handling of the _PAGE_UNUSED pte bit that could lead to guest
  memory corruption in some scenarios
* A bunch of misc gmap fixes (locking, behaviour under memory pressure)
* Fix CMMA dirty tracking

The following changes since commit 8cd9520d35a6c38db6567e97dd93b1f11f185dc6:

  Linux 7.1 (2026-06-14 15:58:38 +0100)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git tags/kvm-s390-next-7.2-2

for you to fetch changes up to babe08404e1993697a523e60bc0f9d096ffe1ef8:

  KVM: s390: Return failure in case of failure in kvm_s390_set_cmma_bits() (2026-06-24 10:08:57 +0200)

----------------------------------------------------------------
KVM: s390: Fix S390_USER_OPEREXEC and more gmap fixes

----------------------------------------------------------------
Claudio Imbrenda (9):
      s390/mm: Fix handling of _PAGE_UNUSED pte bit
      KVM: s390: Fix dat_peek_cmma() overflow
      KVM: s390: Do not set special large pages dirty
      KVM: s390: Fix code typo in gmap_protect_asce_top_level()
      KVM: s390: Fix handle_{sske,pfmf} under memory pressure
      KVM: s390: Fix locking in kvm_s390_set_mem_control()
      KVM: s390: Fix cmma dirty tracking
      KVM: s390: selftests: Fix cmma selftest
      KVM: s390: Return failure in case of failure in kvm_s390_set_cmma_bits()

Eric Farman (3):
      KVM: s390: Fix S390_USER_OPEREXEC enablement without STFLE 74
      KVM: s390: selftests: Extended user_operexec tests
      KVM: s390: Fix typo in UCONTROL documentation

 Documentation/virt/kvm/api.rst                     |   2 +-
 arch/s390/include/asm/pgtable.h                    |   4 +-
 arch/s390/kvm/dat.c                                |   5 +-
 arch/s390/kvm/gmap.c                               |  35 +++++--
 arch/s390/kvm/gmap.h                               |  12 ++-
 arch/s390/kvm/kvm-s390.c                           |  71 ++++++++-----
 arch/s390/kvm/priv.c                               |  10 +-
 arch/s390/mm/gmap_helpers.c                        |   3 +-
 .../testing/selftests/kvm/include/s390/facility.h  |   6 ++
 tools/testing/selftests/kvm/s390/cmma_test.c       |   6 ++
 tools/testing/selftests/kvm/s390/user_operexec.c   | 110 +++++++++++++++++++++
 11 files changed, 226 insertions(+), 38 deletions(-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [GIT PULL v1 01/12] KVM: s390: Fix S390_USER_OPEREXEC enablement without STFLE 74
  2026-06-24 11:19 [GIT PULL v1 00/12] KVM: s390: Fix S390_USER_OPEREXEC and more gmap fixes Claudio Imbrenda
@ 2026-06-24 11:19 ` Claudio Imbrenda
  2026-06-24 11:40   ` sashiko-bot
  2026-06-24 11:19 ` [GIT PULL v1 02/12] KVM: s390: selftests: Extended user_operexec tests Claudio Imbrenda
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 18+ messages in thread
From: Claudio Imbrenda @ 2026-06-24 11:19 UTC (permalink / raw)
  To: pbonzini; +Cc: kvm, linux-s390, frankja, borntraeger

From: Eric Farman <farman@linux.ibm.com>

The KVM_CAP_S390_USER_OPEREXEC capability allows operation exceptions
to be forwarded to userspace. But the actual enablement at the hardware
level occurs in kvm_arch_vcpu_postcreate(), and only if STFLE.74 or
user_instr0 are enabled. The latter is associated with a separate
capability (KVM_CAP_S390_USER_INSTR0), so the only way this happens
for the USER_OPEREXEC capability is if STFLE.74 is enabled. KVM
unconditionally enables this bit in kvm_arch_init_vm(), but the guest
could disable it from the CPU model and thus ignore this capability.

Add USER_OPEREXEC to the check in kvm_arch_vcpu_postcreate(), such that
either capability would enable this type of exception.

Fixes: 8e8678e740ec ("KVM: s390: Add capability that forwards operation exceptions")
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
[Fixed patch title, as recommended by frankja@linux.ibm.com]
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260507200836.3500368-2-farman@linux.ibm.com>
---
 arch/s390/kvm/kvm-s390.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index ffb20a64d328..8a3d55410f06 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -3542,7 +3542,8 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 		vcpu->arch.gmap = vcpu->kvm->arch.gmap;
 		sca_add_vcpu(vcpu);
 	}
-	if (test_kvm_facility(vcpu->kvm, 74) || vcpu->kvm->arch.user_instr0)
+	if (test_kvm_facility(vcpu->kvm, 74) || vcpu->kvm->arch.user_instr0 ||
+	    vcpu->kvm->arch.user_operexec)
 		vcpu->arch.sie_block->ictl |= ICTL_OPEREXC;
 }
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [GIT PULL v1 02/12] KVM: s390: selftests: Extended user_operexec tests
  2026-06-24 11:19 [GIT PULL v1 00/12] KVM: s390: Fix S390_USER_OPEREXEC and more gmap fixes Claudio Imbrenda
  2026-06-24 11:19 ` [GIT PULL v1 01/12] KVM: s390: Fix S390_USER_OPEREXEC enablement without STFLE 74 Claudio Imbrenda
@ 2026-06-24 11:19 ` Claudio Imbrenda
  2026-06-24 11:19 ` [GIT PULL v1 03/12] KVM: s390: Fix typo in UCONTROL documentation Claudio Imbrenda
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Claudio Imbrenda @ 2026-06-24 11:19 UTC (permalink / raw)
  To: pbonzini; +Cc: kvm, linux-s390, frankja, borntraeger

From: Eric Farman <farman@linux.ibm.com>

There is a possibility that the user_operexec capability
only works if facility bit 74 is enabled. This is now fixed,
but add a selftest to demonstrate that.

Signed-off-by: Eric Farman <farman@linux.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260507200836.3500368-3-farman@linux.ibm.com>
---
 .../selftests/kvm/include/s390/facility.h     |   6 +
 .../selftests/kvm/s390/user_operexec.c        | 110 ++++++++++++++++++
 2 files changed, 116 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/s390/facility.h b/tools/testing/selftests/kvm/include/s390/facility.h
index 41a265742666..e5259f63be22 100644
--- a/tools/testing/selftests/kvm/include/s390/facility.h
+++ b/tools/testing/selftests/kvm/include/s390/facility.h
@@ -11,6 +11,7 @@
 #ifndef SELFTEST_KVM_FACILITY_H
 #define SELFTEST_KVM_FACILITY_H
 
+#include <linux/atomic.h>
 #include <linux/bitops.h>
 
 /* alt_stfle_fac_list[16] + stfle_fac_list[16] */
@@ -19,6 +20,11 @@
 extern u64 stfl_doublewords[NB_STFL_DOUBLEWORDS];
 extern bool stfle_flag;
 
+static inline bool clear_bit_inv(unsigned long nr, unsigned long *ptr)
+{
+	return clear_bit(nr ^ (BITS_PER_LONG - 1), ptr);
+}
+
 static inline bool test_bit_inv(unsigned long nr, const unsigned long *ptr)
 {
 	return test_bit(nr ^ (BITS_PER_LONG - 1), ptr);
diff --git a/tools/testing/selftests/kvm/s390/user_operexec.c b/tools/testing/selftests/kvm/s390/user_operexec.c
index 714906c1d12a..b24c1f9dbbe8 100644
--- a/tools/testing/selftests/kvm/s390/user_operexec.c
+++ b/tools/testing/selftests/kvm/s390/user_operexec.c
@@ -6,6 +6,7 @@
  * Authors:
  *  Janosch Frank <frankja@linux.ibm.com>
  */
+#include "facility.h"
 #include "kselftest.h"
 #include "kvm_util.h"
 #include "test_util.h"
@@ -109,6 +110,111 @@ static void test_user_operexec_combined(void)
 	kvm_vm_free(vm);
 }
 
+static struct kvm_vm *create_vm_without_sthyi(void)
+{
+	struct kvm_s390_vm_cpu_processor info;
+	struct kvm_vm *vm;
+
+	vm = vm_create(1);
+
+	kvm_device_attr_get(vm->fd, KVM_S390_VM_CPU_MODEL,
+			    KVM_S390_VM_CPU_PROCESSOR, &info);
+
+	clear_bit_inv(74, (unsigned long *)&info.fac_list);
+	kvm_device_attr_set(vm->fd, KVM_S390_VM_CPU_MODEL,
+			    KVM_S390_VM_CPU_PROCESSOR, &info);
+
+	return vm;
+}
+
+static void test_user_instr0_no_stfle_74(void)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	int rc;
+
+	vm = create_vm_without_sthyi();
+
+	rc = __vm_enable_cap(vm, KVM_CAP_S390_USER_INSTR0, 0);
+	TEST_ASSERT_EQ(0, rc);
+
+	vcpu = vm_vcpu_add(vm, 0, guest_code_instr0);
+
+	vcpu_run(vcpu);
+	TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_S390_SIEIC);
+	TEST_ASSERT_EQ(vcpu->run->s390_sieic.icptcode, ICPT_OPEREXC);
+	TEST_ASSERT_EQ(vcpu->run->s390_sieic.ipa, 0x0000);
+
+	kvm_vm_free(vm);
+}
+
+static void test_user_operexec_no_stfle_74(void)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	int rc;
+
+	vm = create_vm_without_sthyi();
+
+	rc = __vm_enable_cap(vm, KVM_CAP_S390_USER_OPEREXEC, 0);
+	TEST_ASSERT_EQ(0, rc);
+
+	vcpu = vm_vcpu_add(vm, 0, guest_code_user_operexec);
+
+	vcpu_run(vcpu);
+	TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_S390_SIEIC);
+	TEST_ASSERT_EQ(vcpu->run->s390_sieic.icptcode, ICPT_OPEREXC);
+	TEST_ASSERT_EQ(vcpu->run->s390_sieic.ipa, 0x0807);
+
+	kvm_vm_free(vm);
+}
+
+static void test_instr0_combined_no_stfle_74(void)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	int rc;
+
+	vm = create_vm_without_sthyi();
+
+	rc = __vm_enable_cap(vm, KVM_CAP_S390_USER_INSTR0, 0);
+	TEST_ASSERT_EQ(0, rc);
+	rc = __vm_enable_cap(vm, KVM_CAP_S390_USER_OPEREXEC, 0);
+	TEST_ASSERT_EQ(0, rc);
+
+	vcpu = vm_vcpu_add(vm, 0, guest_code_instr0);
+
+	vcpu_run(vcpu);
+	TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_S390_SIEIC);
+	TEST_ASSERT_EQ(vcpu->run->s390_sieic.icptcode, ICPT_OPEREXC);
+	TEST_ASSERT_EQ(vcpu->run->s390_sieic.ipa, 0x0000);
+
+	kvm_vm_free(vm);
+}
+
+static void test_operexec_combined_no_stfle_74(void)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	int rc;
+
+	vm = create_vm_without_sthyi();
+
+	rc = __vm_enable_cap(vm, KVM_CAP_S390_USER_INSTR0, 0);
+	TEST_ASSERT_EQ(0, rc);
+	rc = __vm_enable_cap(vm, KVM_CAP_S390_USER_OPEREXEC, 0);
+	TEST_ASSERT_EQ(0, rc);
+
+	vcpu = vm_vcpu_add(vm, 0, guest_code_user_operexec);
+
+	vcpu_run(vcpu);
+	TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_S390_SIEIC);
+	TEST_ASSERT_EQ(vcpu->run->s390_sieic.icptcode, ICPT_OPEREXC);
+	TEST_ASSERT_EQ(vcpu->run->s390_sieic.ipa, 0x0807);
+
+	kvm_vm_free(vm);
+}
+
 /*
  * Run all tests above.
  *
@@ -122,6 +228,10 @@ static struct testdef {
 	{ "instr0", test_user_instr0 },
 	{ "operexec", test_user_operexec },
 	{ "operexec_combined", test_user_operexec_combined},
+	{ "instr0_no_stfle_74", test_user_instr0_no_stfle_74 },
+	{ "instr0_combined_no_stfle_74", test_instr0_combined_no_stfle_74 },
+	{ "operexec_combined_no_stfle_74", test_operexec_combined_no_stfle_74 },
+	{ "operexec_no_stfle_74", test_user_operexec_no_stfle_74 },
 };
 
 int main(int argc, char *argv[])
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [GIT PULL v1 03/12] KVM: s390: Fix typo in UCONTROL documentation
  2026-06-24 11:19 [GIT PULL v1 00/12] KVM: s390: Fix S390_USER_OPEREXEC and more gmap fixes Claudio Imbrenda
  2026-06-24 11:19 ` [GIT PULL v1 01/12] KVM: s390: Fix S390_USER_OPEREXEC enablement without STFLE 74 Claudio Imbrenda
  2026-06-24 11:19 ` [GIT PULL v1 02/12] KVM: s390: selftests: Extended user_operexec tests Claudio Imbrenda
@ 2026-06-24 11:19 ` Claudio Imbrenda
  2026-06-24 11:19 ` [GIT PULL v1 04/12] s390/mm: Fix handling of _PAGE_UNUSED pte bit Claudio Imbrenda
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Claudio Imbrenda @ 2026-06-24 11:19 UTC (permalink / raw)
  To: pbonzini; +Cc: kvm, linux-s390, frankja, borntraeger

From: Eric Farman <farman@linux.ibm.com>

Small typo noticed while writing the USER_OPEREXEC selftest.

Signed-off-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260507200836.3500368-4-farman@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 Documentation/virt/kvm/api.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 52bbbb553ce1..f0eba90602f0 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6827,7 +6827,7 @@ s390 specific.
 		} s390_ucontrol;
 
 s390 specific. A page fault has occurred for a user controlled virtual
-machine (KVM_VM_S390_UNCONTROL) on its host page table that cannot be
+machine (KVM_VM_S390_UCONTROL) on its host page table that cannot be
 resolved by the kernel.
 The program code and the translation exception code that were placed
 in the cpu's lowcore are presented here as defined by the z Architecture
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [GIT PULL v1 04/12] s390/mm: Fix handling of _PAGE_UNUSED pte bit
  2026-06-24 11:19 [GIT PULL v1 00/12] KVM: s390: Fix S390_USER_OPEREXEC and more gmap fixes Claudio Imbrenda
                   ` (2 preceding siblings ...)
  2026-06-24 11:19 ` [GIT PULL v1 03/12] KVM: s390: Fix typo in UCONTROL documentation Claudio Imbrenda
@ 2026-06-24 11:19 ` Claudio Imbrenda
  2026-06-24 11:42   ` sashiko-bot
  2026-06-24 11:19 ` [GIT PULL v1 05/12] KVM: s390: Fix dat_peek_cmma() overflow Claudio Imbrenda
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 18+ messages in thread
From: Claudio Imbrenda @ 2026-06-24 11:19 UTC (permalink / raw)
  To: pbonzini; +Cc: kvm, linux-s390, frankja, borntraeger

The _PAGE_UNUSED softbit should not really be lying around. Its sole
purpose is to signal to try_to_unmap_one() and try_to_migrate_one()
that the page can be discarded instead of being moved / swapped.

KVM has no way to know why a page is being unmapped, so it sets the bit
on userspace ptes corresponding to unused guest pages every time they
get unmapped. KVM has no reasonable way to clear the bit once the page
is in use again.

While set_ptes() checks and clears the bit, other paths that set new
ptes did not. This led to used pages being thrown out as if they were
unused, causing guest corruption.

Fix the issue by clearing the _PAGE_UNUSED bit for present ptes in
set_pte(), i.e. whenever a present pte is getting set. The check in
set_ptes() is then redundant and can be removed.

Also fix gmap_helper_try_set_pte_unused() to only set the bit if the
pte is present; the _PAGE_UNUSED bit is only defined for present ptes
and thus should not be set for non-present ptes.

Fixes: c98175b7917f ("KVM: s390: Add gmap_helper_set_unused()")
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260623153331.233784-2-imbrenda@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h | 4 ++--
 arch/s390/mm/gmap_helpers.c     | 3 ++-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 2c6cee8241e0..4740c75649eb 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -980,6 +980,8 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 
 static inline void set_pte(pte_t *ptep, pte_t pte)
 {
+	if (pte_present(pte))
+		pte = clear_pte_bit(pte, __pgprot(_PAGE_UNUSED));
 	WRITE_ONCE(*ptep, pte);
 }
 
@@ -1332,8 +1334,6 @@ pgprot_t pgprot_writecombine(pgprot_t prot);
 static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 			      pte_t *ptep, pte_t entry, unsigned int nr)
 {
-	if (pte_present(entry))
-		entry = clear_pte_bit(entry, __pgprot(_PAGE_UNUSED));
 	page_table_check_ptes_set(mm, addr, ptep, entry, nr);
 	for (;;) {
 		set_pte(ptep, entry);
diff --git a/arch/s390/mm/gmap_helpers.c b/arch/s390/mm/gmap_helpers.c
index 1cfe4724fbe2..60023b6fdcb1 100644
--- a/arch/s390/mm/gmap_helpers.c
+++ b/arch/s390/mm/gmap_helpers.c
@@ -181,7 +181,8 @@ void gmap_helper_try_set_pte_unused(struct mm_struct *mm, unsigned long vmaddr)
 	if (IS_ERR_OR_NULL(ptep))
 		return;
 
-	__atomic64_or(_PAGE_UNUSED, (long *)ptep);
+	if (pte_present(*ptep))
+		__atomic64_or(_PAGE_UNUSED, (long *)ptep);
 	pte_unmap_unlock(ptep, ptl);
 }
 EXPORT_SYMBOL_GPL(gmap_helper_try_set_pte_unused);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [GIT PULL v1 05/12] KVM: s390: Fix dat_peek_cmma() overflow
  2026-06-24 11:19 [GIT PULL v1 00/12] KVM: s390: Fix S390_USER_OPEREXEC and more gmap fixes Claudio Imbrenda
                   ` (3 preceding siblings ...)
  2026-06-24 11:19 ` [GIT PULL v1 04/12] s390/mm: Fix handling of _PAGE_UNUSED pte bit Claudio Imbrenda
@ 2026-06-24 11:19 ` Claudio Imbrenda
  2026-06-24 11:38   ` sashiko-bot
  2026-06-24 11:19 ` [GIT PULL v1 06/12] KVM: s390: Do not set special large pages dirty Claudio Imbrenda
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 18+ messages in thread
From: Claudio Imbrenda @ 2026-06-24 11:19 UTC (permalink / raw)
  To: pbonzini; +Cc: kvm, linux-s390, frankja, borntraeger

If userspace passes a start address that is out of bounds,
_dat_walk_gfn_range() will fail with -EFAULT, but state.end will not be
touched and will stay 0. This will cause *count to underflow and report
a very high number, and the function will end up erroneously reporting
success.

Fix by only setting *count if the end address is not smaller than the
starting address. This way invalid starting addresses will correctly
return -EFAULT and *count will correctly indicate that no values have
been returned.

Fixes: 7b368470e1a4 ("KVM: s390: KVM page table management functions: CMMA")
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260623153331.233784-3-imbrenda@linux.ibm.com>
---
 arch/s390/kvm/dat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/kvm/dat.c b/arch/s390/kvm/dat.c
index 4a41c0247ffa..cffac7782c4b 100644
--- a/arch/s390/kvm/dat.c
+++ b/arch/s390/kvm/dat.c
@@ -1209,7 +1209,7 @@ int dat_peek_cmma(gfn_t start, union asce asce, unsigned int *count, u8 *values)
 	int rc;
 
 	rc = _dat_walk_gfn_range(start, start + *count, asce, &ops, DAT_WALK_DEFAULT, &state);
-	*count = state.end - start;
+	*count = state.end >= start ? state.end - start : 0;
 	/* Return success if at least one value was saved, otherwise an error. */
 	return (rc == -EFAULT && *count > 0) ? 0 : rc;
 }
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [GIT PULL v1 06/12] KVM: s390: Do not set special large pages dirty
  2026-06-24 11:19 [GIT PULL v1 00/12] KVM: s390: Fix S390_USER_OPEREXEC and more gmap fixes Claudio Imbrenda
                   ` (4 preceding siblings ...)
  2026-06-24 11:19 ` [GIT PULL v1 05/12] KVM: s390: Fix dat_peek_cmma() overflow Claudio Imbrenda
@ 2026-06-24 11:19 ` Claudio Imbrenda
  2026-06-24 11:37   ` sashiko-bot
  2026-06-24 11:19 ` [GIT PULL v1 07/12] KVM: s390: Fix code typo in gmap_protect_asce_top_level() Claudio Imbrenda
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 18+ messages in thread
From: Claudio Imbrenda @ 2026-06-24 11:19 UTC (permalink / raw)
  To: pbonzini; +Cc: kvm, linux-s390, frankja, borntraeger

Special pages / folios should not be set dirty. This also applies to
large pages.

Add a missing check in gmap_clear_young_crste() to prevent setting the
large page dirty if it is a special page.

Fixes: a2c17f9270cc ("KVM: s390: New gmap code")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260623153331.233784-4-imbrenda@linux.ibm.com>
---
 arch/s390/kvm/gmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/kvm/gmap.c b/arch/s390/kvm/gmap.c
index 52d55ddea8d4..3192f610f696 100644
--- a/arch/s390/kvm/gmap.c
+++ b/arch/s390/kvm/gmap.c
@@ -327,7 +327,7 @@ static long gmap_clear_young_crste(union crste *crstep, gfn_t gfn, gfn_t end, st
 		new.h.i = 1;
 		new.s.fc1.y = 0;
 		new.s.fc1.prefix_notif = 0;
-		if (new.s.fc1.d || !new.h.p)
+		if ((new.s.fc1.d || !new.h.p) && !new.s.fc1.s)
 			folio_set_dirty(phys_to_folio(crste_origin_large(crste)));
 		new.s.fc1.d = 0;
 		new.h.p = 1;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [GIT PULL v1 07/12] KVM: s390: Fix code typo in gmap_protect_asce_top_level()
  2026-06-24 11:19 [GIT PULL v1 00/12] KVM: s390: Fix S390_USER_OPEREXEC and more gmap fixes Claudio Imbrenda
                   ` (5 preceding siblings ...)
  2026-06-24 11:19 ` [GIT PULL v1 06/12] KVM: s390: Do not set special large pages dirty Claudio Imbrenda
@ 2026-06-24 11:19 ` Claudio Imbrenda
  2026-06-24 11:19 ` [GIT PULL v1 08/12] KVM: s390: Fix handle_{sske,pfmf} under memory pressure Claudio Imbrenda
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Claudio Imbrenda @ 2026-06-24 11:19 UTC (permalink / raw)
  To: pbonzini; +Cc: kvm, linux-s390, frankja, borntraeger

The correct length to pass to kvm_s390_get_guest_pages() is asce.tl + 1,
not asce.dt + 1. It was a typo, which, due to fortuitous circumstances,
did not cause bugs. It should nonetheless be fixed.

Fixes: e5f98a6899bd ("KVM: s390: Add some helper functions needed for vSIE")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260623153331.233784-5-imbrenda@linux.ibm.com>
---
 arch/s390/kvm/gmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/kvm/gmap.c b/arch/s390/kvm/gmap.c
index 3192f610f696..e6e786811db8 100644
--- a/arch/s390/kvm/gmap.c
+++ b/arch/s390/kvm/gmap.c
@@ -1262,7 +1262,7 @@ static int gmap_protect_asce_top_level(struct kvm_s390_mmu_cache *mc, struct gma
 	/* Pairs with the smp_wmb() in kvm_mmu_invalidate_end(). */
 	smp_rmb();
 
-	rc = kvm_s390_get_guest_pages(sg->kvm, context.f, asce.rsto, asce.dt + 1, false);
+	rc = kvm_s390_get_guest_pages(sg->kvm, context.f, asce.rsto, asce.tl + 1, false);
 	if (rc > 0)
 		rc = -EFAULT;
 	if (!rc)
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [GIT PULL v1 08/12] KVM: s390: Fix handle_{sske,pfmf} under memory pressure
  2026-06-24 11:19 [GIT PULL v1 00/12] KVM: s390: Fix S390_USER_OPEREXEC and more gmap fixes Claudio Imbrenda
                   ` (6 preceding siblings ...)
  2026-06-24 11:19 ` [GIT PULL v1 07/12] KVM: s390: Fix code typo in gmap_protect_asce_top_level() Claudio Imbrenda
@ 2026-06-24 11:19 ` Claudio Imbrenda
  2026-06-24 11:19 ` [GIT PULL v1 09/12] KVM: s390: Fix locking in kvm_s390_set_mem_control() Claudio Imbrenda
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Claudio Imbrenda @ 2026-06-24 11:19 UTC (permalink / raw)
  To: pbonzini; +Cc: kvm, linux-s390, frankja, borntraeger

Under heavy memory pressure, handle_sske() and handle_pfmf() might
cause an endless loop if the mmu cache runs empty, the atomic
allocations fail, and the top-up function also fails. While quite
unlikely, that scenario is not impossible.

Fix the issue by not ignoring the return value of
kvm_s390_mmu_cache_topup(), and appropriately returning an error code
in case of failure.

Fixes: e38c884df921 ("KVM: s390: Switch to new gmap")
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260623153331.233784-6-imbrenda@linux.ibm.com>
---
 arch/s390/kvm/priv.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index 447ec7ed423d..9bc6fd02ff77 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -366,7 +366,9 @@ static int handle_sske(struct kvm_vcpu *vcpu)
 		if (rc > 1)
 			return kvm_s390_inject_program_int(vcpu, PGM_ADDRESSING);
 		if (rc == -ENOMEM) {
-			kvm_s390_mmu_cache_topup(vcpu->arch.mc);
+			rc = kvm_s390_mmu_cache_topup(vcpu->arch.mc);
+			if (rc)
+				return rc;
 			continue;
 		}
 		if (rc < 0)
@@ -1122,7 +1124,9 @@ static int handle_pfmf(struct kvm_vcpu *vcpu)
 			if (rc > 1)
 				return kvm_s390_inject_program_int(vcpu, rc);
 			if (rc == -ENOMEM) {
-				kvm_s390_mmu_cache_topup(vcpu->arch.mc);
+				rc = kvm_s390_mmu_cache_topup(vcpu->arch.mc);
+				if (rc)
+					return rc;
 				continue;
 			}
 			if (rc < 0)
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [GIT PULL v1 09/12] KVM: s390: Fix locking in kvm_s390_set_mem_control()
  2026-06-24 11:19 [GIT PULL v1 00/12] KVM: s390: Fix S390_USER_OPEREXEC and more gmap fixes Claudio Imbrenda
                   ` (7 preceding siblings ...)
  2026-06-24 11:19 ` [GIT PULL v1 08/12] KVM: s390: Fix handle_{sske,pfmf} under memory pressure Claudio Imbrenda
@ 2026-06-24 11:19 ` Claudio Imbrenda
  2026-06-24 11:19 ` [GIT PULL v1 10/12] KVM: s390: Fix cmma dirty tracking Claudio Imbrenda
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Claudio Imbrenda @ 2026-06-24 11:19 UTC (permalink / raw)
  To: pbonzini; +Cc: kvm, linux-s390, frankja, borntraeger

Add the missing locking around dat_reset_cmma().

Fixes: e38c884df921 ("KVM: s390: Switch to new gmap")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260623153331.233784-7-imbrenda@linux.ibm.com>
---
 arch/s390/kvm/kvm-s390.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 8a3d55410f06..221b2fb199d4 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -990,9 +990,11 @@ static int kvm_s390_set_mem_control(struct kvm *kvm, struct kvm_device_attr *att
 		if (!kvm->arch.use_cmma)
 			break;
 
+		guard(mutex)(&kvm->lock);
 		VM_EVENT(kvm, 3, "%s", "RESET: CMMA states");
 		do {
-			start_gfn = dat_reset_cmma(kvm->arch.gmap->asce, start_gfn);
+			scoped_guard(read_lock, &kvm->mmu_lock)
+				start_gfn = dat_reset_cmma(kvm->arch.gmap->asce, start_gfn);
 			cond_resched();
 		} while (start_gfn);
 		ret = 0;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [GIT PULL v1 10/12] KVM: s390: Fix cmma dirty tracking
  2026-06-24 11:19 [GIT PULL v1 00/12] KVM: s390: Fix S390_USER_OPEREXEC and more gmap fixes Claudio Imbrenda
                   ` (8 preceding siblings ...)
  2026-06-24 11:19 ` [GIT PULL v1 09/12] KVM: s390: Fix locking in kvm_s390_set_mem_control() Claudio Imbrenda
@ 2026-06-24 11:19 ` Claudio Imbrenda
  2026-06-24 11:46   ` sashiko-bot
  2026-06-24 11:19 ` [GIT PULL v1 11/12] KVM: s390: selftests: Fix cmma selftest Claudio Imbrenda
  2026-06-24 11:19 ` [GIT PULL v1 12/12] KVM: s390: Return failure in case of failure in kvm_s390_set_cmma_bits() Claudio Imbrenda
  11 siblings, 1 reply; 18+ messages in thread
From: Claudio Imbrenda @ 2026-06-24 11:19 UTC (permalink / raw)
  To: pbonzini; +Cc: kvm, linux-s390, frankja, borntraeger

It is possible that some guest memory areas have not been touched yet
when starting migration mode, and thus have no ptes allocated. Only
existing and allocated ptes should count toward the total of dirty cmma
entries.

When starting migration mode, enable the migration_mode flag
immediately, so that any subsequent ESSA will trap in the host and
cause cmma_dirty_pages to be increased as needed.
Subsequently, set the cmma_d bit on all existing cmma-clean PGSTEs,
increasing cmma_dirty_pages as needed. Skipping cmma-dirty pages
prevents double counting.

Conversely, when disabling migration mode, set cmma_dirty_pages to 0
and clear the cmma_d bit in all existing PGSTEs.

The invariant is that when migration mode is off, no PGSTE has its
cmma_d bit set, and cmma_dirty_pages is 0. kvm->slots_lock protects
kvm_s390_vm_start_migration() and kvm_s390_vm_stop_migration() from
each other and from kvm_s390_get_cmma_bits().

Also fix dat_get_cmma() to properly wrap around if the first attempt
reached the end of guest memory without finding cmma-dirty pages.

[ imbrenda: Moved kvm_s390_sync_request_broadcast() before gmap_set_cmma_all_dirty() ]

Fixes: e38c884df921 ("KVM: s390: Switch to new gmap")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260623153331.233784-8-imbrenda@linux.ibm.com>
---
 arch/s390/kvm/dat.c      |  3 +++
 arch/s390/kvm/gmap.c     | 31 +++++++++++++++++++++++----
 arch/s390/kvm/gmap.h     | 12 ++++++++++-
 arch/s390/kvm/kvm-s390.c | 46 +++++++++++++++++++++++++++++++---------
 arch/s390/kvm/priv.c     |  2 +-
 5 files changed, 78 insertions(+), 16 deletions(-)

diff --git a/arch/s390/kvm/dat.c b/arch/s390/kvm/dat.c
index cffac7782c4b..0ad4ebc80eba 100644
--- a/arch/s390/kvm/dat.c
+++ b/arch/s390/kvm/dat.c
@@ -1253,6 +1253,9 @@ int dat_get_cmma(union asce asce, gfn_t *start, unsigned int *count, u8 *values,
 	};
 
 	_dat_walk_gfn_range(*start, asce_end(asce), asce, &ops, DAT_WALK_IGN_HOLES, &state);
+	/* If no dirty pages were found, wrap around and continue searching */
+	if (*start && state.start == -1)
+		_dat_walk_gfn_range(0, *start, asce, &ops, DAT_WALK_IGN_HOLES, &state);
 
 	if (state.start == -1) {
 		*count = 0;
diff --git a/arch/s390/kvm/gmap.c b/arch/s390/kvm/gmap.c
index e6e786811db8..0f944944badf 100644
--- a/arch/s390/kvm/gmap.c
+++ b/arch/s390/kvm/gmap.c
@@ -1073,23 +1073,46 @@ int gmap_protect_rmap(struct kvm_s390_mmu_cache *mc, struct gmap *sg, gfn_t p_gf
 	return 0;
 }
 
+static long __set_cmma_clean_pte(union pte *ptep, gfn_t gfn, gfn_t next, struct dat_walk *walk)
+{
+	union pgste pgste;
+
+	pgste = pgste_get_lock(ptep);
+	pgste.cmma_d = 0;
+	pgste_set_unlock(ptep, pgste);
+
+	if (need_resched())
+		return next;
+	return 0;
+}
+
 static long __set_cmma_dirty_pte(union pte *ptep, gfn_t gfn, gfn_t next, struct dat_walk *walk)
 {
-	__atomic64_or(PGSTE_CMMA_D_BIT, &pgste_of(ptep)->val);
+	union pgste pgste;
+
+	pgste = pgste_get_lock(ptep);
+	if (!pgste.cmma_d)
+		atomic64_inc(walk->priv);
+	pgste.cmma_d = 1;
+	pgste_set_unlock(ptep, pgste);
+
 	if (need_resched())
 		return next;
 	return 0;
 }
 
-void gmap_set_cmma_all_dirty(struct gmap *gmap)
+void _gmap_set_cmma_all(struct gmap *gmap, bool dirty)
 {
-	const struct dat_walk_ops ops = { .pte_entry = __set_cmma_dirty_pte, };
+	const struct dat_walk_ops ops = {
+		.pte_entry = dirty ? __set_cmma_dirty_pte : __set_cmma_clean_pte,
+	};
 	gfn_t gfn = 0;
 
 	do {
 		scoped_guard(read_lock, &gmap->kvm->mmu_lock)
 			gfn = _dat_walk_gfn_range(gfn, asce_end(gmap->asce), gmap->asce, &ops,
-						  DAT_WALK_IGN_HOLES, NULL);
+						  DAT_WALK_IGN_HOLES,
+						  &gmap->kvm->arch.cmma_dirty_pages);
 		cond_resched();
 	} while (gfn);
 }
diff --git a/arch/s390/kvm/gmap.h b/arch/s390/kvm/gmap.h
index 5374f21aaf8d..4e04fbd07696 100644
--- a/arch/s390/kvm/gmap.h
+++ b/arch/s390/kvm/gmap.h
@@ -103,7 +103,7 @@ int gmap_pv_destroy_range(struct gmap *gmap, gfn_t start, gfn_t end, bool interr
 int gmap_insert_rmap(struct gmap *sg, gfn_t p_gfn, gfn_t r_gfn, int level);
 int gmap_protect_rmap(struct kvm_s390_mmu_cache *mc, struct gmap *sg, gfn_t p_gfn, gfn_t r_gfn,
 		      kvm_pfn_t pfn, int level, bool wr);
-void gmap_set_cmma_all_dirty(struct gmap *gmap);
+void _gmap_set_cmma_all(struct gmap *gmap, bool dirty);
 void _gmap_handle_vsie_unshadow_event(struct gmap *parent, gfn_t gfn);
 struct gmap *gmap_create_shadow(struct kvm_s390_mmu_cache *mc, struct gmap *gmap,
 				union asce asce, int edat_level);
@@ -197,6 +197,16 @@ static inline bool pte_needs_unshadow(union pte oldpte, union pte newpte, union
 	return !newpte.h.p || !newpte.s.pr;
 }
 
+static inline void gmap_set_cmma_all_dirty(struct gmap *gmap)
+{
+	_gmap_set_cmma_all(gmap, true);
+}
+
+static inline void gmap_set_cmma_all_clean(struct gmap *gmap)
+{
+	_gmap_set_cmma_all(gmap, false);
+}
+
 static inline union pgste _gmap_ptep_xchg(struct gmap *gmap, union pte *ptep, union pte newpte,
 					  union pgste pgste, gfn_t gfn, bool needs_lock)
 {
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 221b2fb199d4..9ad6bd4edbce 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1187,13 +1187,13 @@ static void kvm_s390_sync_request_broadcast(struct kvm *kvm, int req)
 
 /*
  * Must be called with kvm->srcu held to avoid races on memslots, and with
- * kvm->slots_lock to avoid races with ourselves and kvm_s390_vm_stop_migration.
+ * kvm->slots_lock to avoid races with ourselves, kvm_s390_vm_stop_migration(),
+ * and kvm_s390_get_cmma_bits().
  */
 static int kvm_s390_vm_start_migration(struct kvm *kvm)
 {
 	struct kvm_memory_slot *ms;
 	struct kvm_memslots *slots;
-	unsigned long ram_pages = 0;
 	int bkt;
 
 	/* migration mode already enabled */
@@ -1210,28 +1210,54 @@ static int kvm_s390_vm_start_migration(struct kvm *kvm)
 	kvm_for_each_memslot(ms, bkt, slots) {
 		if (!ms->dirty_bitmap)
 			return -EINVAL;
-		ram_pages += ms->npages;
 	}
-	/* mark all the pages as dirty */
-	gmap_set_cmma_all_dirty(kvm->arch.gmap);
-	atomic64_set(&kvm->arch.cmma_dirty_pages, ram_pages);
-	kvm->arch.migration_mode = 1;
+	/*
+	 * Set the flag and let KVM handle ESSA manually, potentially setting
+	 * the cmma_d bit in some PGSTEs and increasing cmma_dirty_pages.
+	 * At this point cmma_dirty_pages is still 0, and all existing PGSTEs
+	 * have their cmma_d bit set to 0.
+	 * Any newly allocated page table has its entries marked as cmma-clean,
+	 * which is fine because the CMMA values are not dirty.
+	 */
+	WRITE_ONCE(kvm->arch.migration_mode, 1);
 	kvm_s390_sync_request_broadcast(kvm, KVM_REQ_START_MIGRATION);
+	/*
+	 * Mark all PGSTEs as cmma-dirty, increasing cmma_dirty_pages as needed,
+	 * but without double-counting pages that have become dirty on their own
+	 * in the meantime.
+	 * At this point some pages might have become dirty on their own already
+	 * and cmma_dirty_pages might therefore be non-zero.
+	 */
+	gmap_set_cmma_all_dirty(kvm->arch.gmap);
 	return 0;
 }
 
 /*
- * Must be called with kvm->slots_lock to avoid races with ourselves and
- * kvm_s390_vm_start_migration.
+ * Must be called with kvm->slots_lock to avoid races with ourselves,
+ * kvm_s390_vm_start_migration() and kvm_s390_get_cmma_bits().
  */
 static int kvm_s390_vm_stop_migration(struct kvm *kvm)
 {
 	/* migration mode already disabled */
 	if (!kvm->arch.migration_mode)
 		return 0;
-	kvm->arch.migration_mode = 0;
+	/*
+	 * Unset the flag and propagate to all vCPUs. From now on the cmma_d
+	 * bit will not be touched on any PGSTE.
+	 * At this point cmma_dirty_pages is possibly non-zero, and thus some
+	 * PGSTEs might have cmma_d set.
+	 */
+	WRITE_ONCE(kvm->arch.migration_mode, 0);
 	if (kvm->arch.use_cmma)
 		kvm_s390_sync_request_broadcast(kvm, KVM_REQ_STOP_MIGRATION);
+	/* Clear cmma_d on all existing PGSTEs and set cmma_dirty_pages to 0. */
+	gmap_set_cmma_all_clean(kvm->arch.gmap);
+	atomic64_set(&kvm->arch.cmma_dirty_pages, 0);
+	/*
+	 * At this point the system has the expected state: migration_mode is 0,
+	 * cmma_dirty_pages is 0, and all existing PGSTEs have their cmma_d bit
+	 * set to 0.
+	 */
 	return 0;
 }
 
diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index 9bc6fd02ff77..ad0ddc433a73 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -1236,7 +1236,7 @@ static int handle_essa(struct kvm_vcpu *vcpu)
 						: ESSA_SET_STABLE_IF_RESIDENT))
 		return kvm_s390_inject_program_int(vcpu, PGM_SPECIFICATION);
 
-	if (!vcpu->kvm->arch.migration_mode) {
+	if (!READ_ONCE(vcpu->kvm->arch.migration_mode)) {
 		/*
 		 * CMMA is enabled in the KVM settings, but is disabled in
 		 * the SIE block and in the mm_context, and we are not doing
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [GIT PULL v1 11/12] KVM: s390: selftests: Fix cmma selftest
  2026-06-24 11:19 [GIT PULL v1 00/12] KVM: s390: Fix S390_USER_OPEREXEC and more gmap fixes Claudio Imbrenda
                   ` (9 preceding siblings ...)
  2026-06-24 11:19 ` [GIT PULL v1 10/12] KVM: s390: Fix cmma dirty tracking Claudio Imbrenda
@ 2026-06-24 11:19 ` Claudio Imbrenda
  2026-06-24 11:19 ` [GIT PULL v1 12/12] KVM: s390: Return failure in case of failure in kvm_s390_set_cmma_bits() Claudio Imbrenda
  11 siblings, 0 replies; 18+ messages in thread
From: Claudio Imbrenda @ 2026-06-24 11:19 UTC (permalink / raw)
  To: pbonzini; +Cc: kvm, linux-s390, frankja, borntraeger

The existing cmma selftest depended on the host allocating page tables
for all present memslots. Since the gmap rewrite, memory that is not
accessed by the guest might not have page tables allocated yet.

This caused the test to fail due to a mismatch in the assertion.

Fix by having the guest access also the second half of the test
memslot, thus guaranteeing that its page tables are present.

Fixes: e38c884df921 ("KVM: s390: Switch to new gmap")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260623153331.233784-9-imbrenda@linux.ibm.com>
---
 tools/testing/selftests/kvm/s390/cmma_test.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/tools/testing/selftests/kvm/s390/cmma_test.c b/tools/testing/selftests/kvm/s390/cmma_test.c
index e39a724fe860..15d81b2ed7ad 100644
--- a/tools/testing/selftests/kvm/s390/cmma_test.c
+++ b/tools/testing/selftests/kvm/s390/cmma_test.c
@@ -34,16 +34,22 @@ static char cmma_value_buf[MAIN_PAGE_COUNT + TEST_DATA_PAGE_COUNT];
 /**
  * Dirty CMMA attributes of exactly one page in the TEST_DATA memslot,
  * so use_cmma goes on and the CMMA related ioctls do something.
+ * Touch the page at offset 1M inside TEST_DATA to make sure its page
+ * tables are allocated in the host.
  */
 static void guest_do_one_essa(void)
 {
 	asm volatile(
 		/* load TEST_DATA_START_GFN into r1 */
+		"	xgr 1,1\n"
 		"	llilf 1,%[start_gfn]\n"
 		/* calculate the address from the gfn */
 		"	sllg 1,1,12(0)\n"
 		/* set the first page in TEST_DATA memslot to STABLE */
 		"	.insn rrf,0xb9ab0000,2,1,1,0\n"
+		"	agfi 1,0x100000\n"
+		/* also touch the first page of the second MB of TEST_DATA */
+		"	.insn rrf,0xb9ab0000,2,1,1,0\n"
 		/* hypercall */
 		"	diag 0,0,0x501\n"
 		"0:	j 0b"
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [GIT PULL v1 12/12] KVM: s390: Return failure in case of failure in kvm_s390_set_cmma_bits()
  2026-06-24 11:19 [GIT PULL v1 00/12] KVM: s390: Fix S390_USER_OPEREXEC and more gmap fixes Claudio Imbrenda
                   ` (10 preceding siblings ...)
  2026-06-24 11:19 ` [GIT PULL v1 11/12] KVM: s390: selftests: Fix cmma selftest Claudio Imbrenda
@ 2026-06-24 11:19 ` Claudio Imbrenda
  11 siblings, 0 replies; 18+ messages in thread
From: Claudio Imbrenda @ 2026-06-24 11:19 UTC (permalink / raw)
  To: pbonzini; +Cc: kvm, linux-s390, frankja, borntraeger

If the allocation of the bits array failed, kvm_s390_set_cmma_bits()
would return 0 instead of an error code.

Rework the function to use the __free() macros and thus simplify the
code flow; when the above mentioned allocation fails, simply return
-ENOMEM.

Fixes: e38c884df921 ("KVM: s390: Switch to new gmap")
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20260623153331.233784-10-imbrenda@linux.ibm.com>
---
 arch/s390/kvm/kvm-s390.c | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 9ad6bd4edbce..3b26c909ad0f 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2313,8 +2313,8 @@ static int kvm_s390_get_cmma_bits(struct kvm *kvm,
 static int kvm_s390_set_cmma_bits(struct kvm *kvm,
 				  const struct kvm_s390_cmma_log *args)
 {
-	struct kvm_s390_mmu_cache *mc;
-	u8 *bits = NULL;
+	struct kvm_s390_mmu_cache *mc __free(kvm_s390_mmu_cache) = NULL;
+	u8 *bits __free(kvfree) = NULL;
 	int r = 0;
 
 	if (!kvm->arch.use_cmma)
@@ -2334,18 +2334,16 @@ static int kvm_s390_set_cmma_bits(struct kvm *kvm,
 		return -ENOMEM;
 	bits = vmalloc(array_size(sizeof(*bits), args->count));
 	if (!bits)
-		goto out;
+		return -ENOMEM;
 
 	r = copy_from_user(bits, (void __user *)args->values, args->count);
-	if (r) {
-		r = -EFAULT;
-		goto out;
-	}
+	if (r)
+		return -EFAULT;
 
 	do {
 		r = kvm_s390_mmu_cache_topup(mc);
 		if (r)
-			break;
+			return r;
 		scoped_guard(read_lock, &kvm->mmu_lock) {
 			r = dat_set_cmma_bits(mc, kvm->arch.gmap->asce, args->start_gfn,
 					      args->count, args->mask, bits);
@@ -2353,9 +2351,7 @@ static int kvm_s390_set_cmma_bits(struct kvm *kvm,
 	} while (r == -ENOMEM);
 
 	set_bit(GMAP_FLAG_USES_CMM, &kvm->arch.gmap->flags);
-out:
-	kvm_s390_free_mmu_cache(mc);
-	vfree(bits);
+
 	return r;
 }
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v1 06/12] KVM: s390: Do not set special large pages dirty
  2026-06-24 11:19 ` [GIT PULL v1 06/12] KVM: s390: Do not set special large pages dirty Claudio Imbrenda
@ 2026-06-24 11:37   ` sashiko-bot
  0 siblings, 0 replies; 18+ messages in thread
From: sashiko-bot @ 2026-06-24 11:37 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: Alexander Gordeev, kvm, Heiko Carstens, linux-s390, Vasily Gorbik

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [Critical] Missing validation for special pages in KVM GMAP unmap and destroy paths leads to host kernel crash on invalid folio pointer dereference.
- [Critical] Missing check for a valid `struct page` before folio conversion during PV guest secure page creation causes host crash.
--

commit 5f6958e6ac1a501221bda61b5fd8d27af20f157d
Author: Claudio Imbrenda <imbrenda@linux.ibm.com>

KVM: s390: Do not set special large pages dirty

Special pages / folios should not be set dirty. This also applies to
large pages.

> diff --git a/arch/s390/kvm/gmap.c b/arch/s390/kvm/gmap.c
> index 52d55ddea8d4f..3192f610f6962 100644
> --- a/arch/s390/kvm/gmap.c
> +++ b/arch/s390/kvm/gmap.c
> @@ -327,7 +327,7 @@ static long gmap_clear_young_crste(union crste *crstep, gfn_t gfn, gfn_t end, st
>  		new.h.i = 1;
>  		new.s.fc1.y = 0;
>  		new.s.fc1.prefix_notif = 0;
> -		if (new.s.fc1.d || !new.h.p)
> +		if ((new.s.fc1.d || !new.h.p) && !new.s.fc1.s)
>  			folio_set_dirty(phys_to_folio(crste_origin_large(crste)));
>  		new.s.fc1.d = 0;
>  		new.h.p = 1;

[Severity: Critical]
This isn't a bug introduced by this patch, but since you are fixing the
special page check here, do we need similar checks in other places?

For instance, in _gmap_unmap_pte(), does this code safely handle special 
pages when calling pfn_folio() on the PTE's PFN?

arch/s390/kvm/gmap.c:_gmap_unmap_pte() {
    ...
	if (ptep->s.pr && test_bit(GMAP_FLAG_EXPORT_ON_UNMAP, &priv->gmap->flags))
		folio = pfn_folio(ptep->h.pfra);
    ...
}

Similarly, in _gmap_unmap_crste(), does this code need a check for 
!old.s.fc1.s before calling phys_to_folio() on a large page origin?

arch/s390/kvm/gmap.c:_gmap_unmap_crste() {
    ...
	if (old.s.fc1.pr && test_bit(GMAP_FLAG_EXPORT_ON_UNMAP, &priv->gmap->flags))
		folio = phys_to_folio(crste_origin_large(old));
    ...
}

And in _destroy_pages_pte(), does it need a check for !ptep->s.s to avoid 
an invalid struct page dereference?

arch/s390/kvm/gmap.c:_destroy_pages_pte() {
    ...
	__kvm_s390_pv_destroy_page(phys_to_page(pte_origin(*ptep)));
    ...
}

Also in _destroy_pages_crste(), should there be a check for !crstep->s.fc1.s
before calling phys_to_page()?

arch/s390/kvm/gmap.c:_destroy_pages_crste() {
    ...
	for ( ; cur < end; cur += PAGE_SIZE)
		__kvm_s390_pv_destroy_page(phys_to_page(cur));
    ...
}

Can unprivileged userspace trigger a host kernel crash by mapping a memory
region backed by special PFNs without struct pages, converting it to a
protected VM, and then tearing it down or unmapping it?

[Severity: Critical]
This is also a pre-existing issue, but does _kvm_s390_pv_make_secure() 
need a check for a valid struct page before converting the folio and 
attempting to acquire its lock?

arch/s390/kvm/pv.c:_kvm_s390_pv_make_secure() {
    ...
	folio = pfn_folio(f->pfn);
    ...
	if (folio_trylock(folio)) {
    ...
}

Could unprivileged userspace map PFNMAP memory without a backing struct 
page into a PV guest and cause a host crash when this path is triggered
during guest memory fault?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260624111928.144283-1-imbrenda@linux.ibm.com?part=6

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v1 05/12] KVM: s390: Fix dat_peek_cmma() overflow
  2026-06-24 11:19 ` [GIT PULL v1 05/12] KVM: s390: Fix dat_peek_cmma() overflow Claudio Imbrenda
@ 2026-06-24 11:38   ` sashiko-bot
  0 siblings, 0 replies; 18+ messages in thread
From: sashiko-bot @ 2026-06-24 11:38 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: linux-s390, Heiko Carstens, Alexander Gordeev, Vasily Gorbik, kvm

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [High] Uninitialized kernel memory from `vmalloc` is copied to userspace when CMMA attributes are retrieved because skipped pages or unmapped CRSTEs leave uninitialized gaps in the buffer.
- [Medium] The `KVM_S390_SET_CMMA_BITS` ioctl incorrectly returns a positive program exception code (`PGM_ADDRESSING` = 5) on out-of-bounds access instead of a standard negative errno.
--

commit 8344f31bea75084c2603931b9dc4f97ab83d6263
Author: Claudio Imbrenda <imbrenda@linux.ibm.com>

KVM: s390: Fix dat_peek_cmma() overflow

This commit fixes an underflow issue in dat_peek_cmma() when userspace
provides an out-of-bounds start address. It ensures the count is correctly
zeroed out instead of underflowing, so the function correctly returns an error
instead of erroneously reporting success.

Message-ID: <20260623153331.233784-3-imbrenda@linux.ibm.com>

> diff --git a/arch/s390/kvm/dat.c b/arch/s390/kvm/dat.c
> index 4a41c0247ffa2..cffac7782c4bf 100644
> --- a/arch/s390/kvm/dat.c
> +++ b/arch/s390/kvm/dat.c
> @@ -1209,7 +1209,7 @@ int dat_peek_cmma(gfn_t start, union asce asce, unsigned int *count, u8 *values)
>  	int rc;
>  
>  	rc = _dat_walk_gfn_range(start, start + *count, asce, &ops, DAT_WALK_DEFAULT, &state);
> -	*count = state.end - start;
> +	*count = state.end >= start ? state.end - start : 0;

[Severity: High]
This is a pre-existing issue, but since we are modifying how count is
calculated for the values array, does this code leak uninitialized kernel
memory to userspace?

If KVM_S390_GET_CMMA_BITS is called on a memory range containing clean pages
or unmapped holes, it allocates the values buffer using vmalloc() which does
not zero-initialize memory:

arch/s390/kvm/kvm-s390.c:kvm_s390_get_cmma_bits() {
    ...
	values = vmalloc(args->count);
	if (!values)
		return -ENOMEM;
    ...
	if (copy_to_user((void __user *)args->values, values, args->count))
    ...
}

During the walk, dat_peek_cmma() and dat_get_cmma() can skip clean PTEs and
empty CRSTEs without writing to the values array. For example:

arch/s390/kvm/dat.c:__dat_peek_cmma_crste() {
    ...
	if (crstep->h.i)
		state->end = min(walk->end, next);
    ...
}

arch/s390/kvm/dat.c:__dat_get_cmma_pte() {
    ...
	if (!READ_ONCE(*pgste_of(ptep)).cmma_d)
		return 0;
    ...
}

Since the gap in GFNs is included in the returned count, does the
copy_to_user() call in kvm_s390_get_cmma_bits() end up copying the entire
length and leaking those uninitialized vmalloc gaps to userspace?

[Severity: Medium]
This isn't a bug introduced by this patch, but while looking at the CMMA
addressing failure paths, is there an issue with how out-of-bounds accesses
are reported by KVM_S390_SET_CMMA_BITS?

If an invalid start_gfn is provided, dat_entry_walk() returns a positive
program exception code:

arch/s390/kvm/dat.c:dat_entry_walk() {
    ...
	if (!asce_contains_gfn(asce, gfn))
		return PGM_ADDRESSING;
    ...
}

This positive value (5) appears to be propagated directly back to the ioctl
result in kvm_s390_set_cmma_bits():

arch/s390/kvm/kvm-s390.c:kvm_s390_set_cmma_bits() {
    ...
			r = dat_set_cmma_bits(mc, kvm->arch.gmap->asce, args->start_gfn,
					      args->count, args->mask, bits);
		}
	} while (r == -ENOMEM);

	set_bit(GMAP_FLAG_USES_CMM, &kvm->arch.gmap->flags);

	return r;
}

Could this cause userspace checking for standard Linux error codes
(ioctl() < 0) to incorrectly assume the KVM_S390_SET_CMMA_BITS ioctl
succeeded when it actually failed?

>  	/* Return success if at least one value was saved, otherwise an error. */
>  	return (rc == -EFAULT && *count > 0) ? 0 : rc;
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260624111928.144283-1-imbrenda@linux.ibm.com?part=5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v1 01/12] KVM: s390: Fix S390_USER_OPEREXEC enablement without STFLE 74
  2026-06-24 11:19 ` [GIT PULL v1 01/12] KVM: s390: Fix S390_USER_OPEREXEC enablement without STFLE 74 Claudio Imbrenda
@ 2026-06-24 11:40   ` sashiko-bot
  0 siblings, 0 replies; 18+ messages in thread
From: sashiko-bot @ 2026-06-24 11:40 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: Alexander Gordeev, Heiko Carstens, Vasily Gorbik, linux-s390, kvm

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

New issues:
- [Medium] A lockless data race between concurrent KVM_ENABLE_CAP and KVM_CREATE_VCPU can result in a vCPU silently missing the operation exception intercept.

Pre-existing issues:
- [Critical] Missing `vcpu->mutex` in `kvm_arch_vcpu_postcreate` allows a concurrent `KVM_RUN` ioctl to trigger a host kernel panic via a NULL pointer dereference on `vcpu->arch.gmap`.
--

commit 40c6c07119e484ffc2bb8e5fccec2d6c44c1df03
Author: Eric Farman <farman@linux.ibm.com>

KVM: s390: Fix S390_USER_OPEREXEC enablement without STFLE 74

This commit ensures the USER_OPEREXEC capability is properly enabled
in kvm_arch_vcpu_postcreate() even if the guest disables STFLE.74.
It adds a check for vcpu->kvm->arch.user_operexec alongside the
existing checks for STFLE.74 and user_instr0.

> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -3542,7 +3542,8 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
>  		vcpu->arch.gmap = vcpu->kvm->arch.gmap;

[Severity: Critical]
This is a pre-existing issue, but does this function need to hold
vcpu->mutex while initializing vcpu->arch.gmap?

When kvm_vm_ioctl_create_vcpu() executes, it unlocks vcpu->mutex and
increments kvm->online_vcpus before calling kvm_arch_vcpu_postcreate():

virt/kvm/kvm_main.c:kvm_vm_ioctl_create_vcpu()
    atomic_inc(&kvm->online_vcpus);
    mutex_unlock(&vcpu->mutex);
    mutex_unlock(&kvm->lock);
    kvm_arch_vcpu_postcreate(vcpu);

If a second thread immediately calls the KVM_RUN ioctl on the new vCPU,
it can acquire vcpu->mutex and proceed to __vcpu_run() before
kvm_arch_vcpu_postcreate() finishes.  Could this result in a NULL pointer
dereference when __vcpu_run() accesses vcpu->arch.gmap->asce.val?

arch/s390/kvm/kvm-s390.c:__vcpu_run()
    sie_return = kvm_s390_enter_exit_sie(vcpu->arch.sie_block,
                                         vcpu->run->s.regs.gprs,
                                         vcpu->arch.gmap->asce.val);

>  		sca_add_vcpu(vcpu);
>  	}
> -	if (test_kvm_facility(vcpu->kvm, 74) || vcpu->kvm->arch.user_instr0)
> +	if (test_kvm_facility(vcpu->kvm, 74) || vcpu->kvm->arch.user_instr0 ||
> +	    vcpu->kvm->arch.user_operexec)

[Severity: Medium]
Does the lockless read of user_operexec introduce a data race with
KVM_ENABLE_CAP?

If a thread enables KVM_CAP_S390_USER_OPEREXEC concurrently with vCPU
creation, kvm_vm_ioctl_enable_cap() writes to kvm->arch.user_operexec and
iterates over online vCPUs without a full memory barrier:

arch/s390/kvm/kvm-s390.c:kvm_vm_ioctl_enable_cap()
    kvm->arch.user_operexec = 1;
    icpt_operexc_on_all_vcpus(kvm);

Meanwhile, kvm_vm_ioctl_create_vcpu() increments online_vcpus before
calling kvm_arch_vcpu_postcreate().

Because of this sequence, is it possible for both threads to observe the
older state of the other's variables? If so, could this cause the newly
created vCPU to silently miss setting the ICTL_OPEREXC bit?

>  		vcpu->arch.sie_block->ictl |= ICTL_OPEREXC;
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260624111928.144283-1-imbrenda@linux.ibm.com?part=1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v1 04/12] s390/mm: Fix handling of _PAGE_UNUSED pte bit
  2026-06-24 11:19 ` [GIT PULL v1 04/12] s390/mm: Fix handling of _PAGE_UNUSED pte bit Claudio Imbrenda
@ 2026-06-24 11:42   ` sashiko-bot
  0 siblings, 0 replies; 18+ messages in thread
From: sashiko-bot @ 2026-06-24 11:42 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: linux-s390, Vasily Gorbik, Heiko Carstens, kvm, Alexander Gordeev

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

New issues:
- [Medium] The KVM unused page discard optimization is completely defeated by page aging.

Pre-existing issues:
- [Critical] Guest memory corruption because hardware memory access does not clear `_PAGE_UNUSED`.
--

commit 4123a9573ff29d2fd29637723b6e4e5841aed14e
Author: Claudio Imbrenda <imbrenda@linux.ibm.com>

s390/mm: Fix handling of _PAGE_UNUSED pte bit

The _PAGE_UNUSED softbit should not really be lying around. Its sole
purpose is to signal to try_to_unmap_one() and try_to_migrate_one()
that the page can be discarded instead of being moved / swapped.

> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 2c6cee8241e04..4740c75649ebb 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -980,6 +980,8 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>  
>  static inline void set_pte(pte_t *ptep, pte_t pte)
>  {
> +	if (pte_present(pte))
> +		pte = clear_pte_bit(pte, __pgprot(_PAGE_UNUSED));

[Severity: Critical]
This is a pre-existing issue, but does moving the clearing logic into
set_pte() fully resolve the guest memory corruption?

When the guest resumes using a page, the s390 hardware transparently updates
the referenced and dirty bits in the hardware Storage Key without triggering
a host page fault. Because no software page fault occurs, set_pte() is never
invoked, leaving _PAGE_UNUSED erroneously set on the active host PTE.

If memory compaction occurs, try_to_migrate_one() checks pte_unused() and
drops the page without first validating the Storage Key:

mm/rmap.c:try_to_migrate_one() {
    ...
    } else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
               !userfaultfd_armed(vma)) {
        /*
         * The guest indicated that the page content is of no
         * interest anymore. Simply discard the pte, vmscan
         * will take care of the rest.
         */
    ...
}

Could this lead to the silent loss of active guest data since the migration
path incorrectly assumes the page is discardable?

[Severity: Medium]
Does clearing _PAGE_UNUSED unconditionally here in set_pte() defeat the KVM
unused page discard optimization during page aging?

When the system is under memory pressure, shrink_page_list() calls
folio_referenced(), which invokes ptep_test_and_clear_young() to inspect the
Storage Key and clear the young bit.

On s390, ptep_test_and_clear_young() writes the updated PTE back via an
unconditional call to set_pte(). This would permanently destroy the
KVM-provided _PAGE_UNUSED marker.

Immediately after aging, try_to_unmap_one() checks the unused status:

mm/rmap.c:try_to_unmap_one() {
    ...
    } else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
               !userfaultfd_armed(vma)) {
        /*
         * The guest indicated that the page content is of no
         * interest anymore. Simply discard the pte
         */
    ...
}

Because the bit was just cleared by the aging process, won't this evaluate
to false, causing truly unused guest pages to be unnecessarily swapped out
instead of efficiently discarded?

>  	WRITE_ONCE(*ptep, pte);
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260624111928.144283-1-imbrenda@linux.ibm.com?part=4

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [GIT PULL v1 10/12] KVM: s390: Fix cmma dirty tracking
  2026-06-24 11:19 ` [GIT PULL v1 10/12] KVM: s390: Fix cmma dirty tracking Claudio Imbrenda
@ 2026-06-24 11:46   ` sashiko-bot
  0 siblings, 0 replies; 18+ messages in thread
From: sashiko-bot @ 2026-06-24 11:46 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: Alexander Gordeev, linux-s390, kvm, Heiko Carstens, Vasily Gorbik

Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:

New issues:
- [High] Race between `kvm_s390_vm_stop_migration()` and `handle_essa()` causes permanent corruption of the `cmma_dirty_pages` counter.
- [High] Unbounded guest address space scan in `dat_get_cmma()` while holding the `mmu_lock` read lock causes soft lockups.
- [High] The `cmma_dirty_pages` counter is not decremented when PGSTEs are zapped, leading to a permanent counter leak.

Pre-existing issues:
- [High] Decoupled locking in `__do_essa()` causes `cmma_dirty_pages` to temporarily underflow, exposing a massive `remaining` value to userspace.
- [High] `KVM_S390_GET_CMMA_BITS` leaks uninitialized kernel heap memory to userspace due to unpopulated array elements.
--

commit f98e5de8f917aa1f013fc20f1b092133a210644d
Author: Claudio Imbrenda <imbrenda@linux.ibm.com>

KVM: s390: Fix cmma dirty tracking

It is possible that some guest memory areas have not been touched yet
when starting migration mode, and thus have no ptes allocated. Only
existing and allocated ptes should count toward the total of dirty cmma
entries.

[ ... ]

> diff --git a/arch/s390/kvm/dat.c b/arch/s390/kvm/dat.c
> index cffac7782c4bf..0ad4ebc80ebaf 100644
> --- a/arch/s390/kvm/dat.c
> +++ b/arch/s390/kvm/dat.c
> @@ -1253,6 +1253,9 @@ int dat_get_cmma(union asce asce, gfn_t *start, unsigned int *count, u8 *values,
>  	};
>  
>  	_dat_walk_gfn_range(*start, asce_end(asce), asce, &ops, DAT_WALK_IGN_HOLES, &state);
> +	/* If no dirty pages were found, wrap around and continue searching */
> +	if (*start && state.start == -1)
> +		_dat_walk_gfn_range(0, *start, asce, &ops, DAT_WALK_IGN_HOLES, &state);

[Severity: High]
Can this wrap-around scan trigger a host soft lockup?

Since both the initial and wrap-around scans execute entirely under the
mmu_lock read lock, preemption is disabled. If a guest has a massive
memory address space that is fully populated but contains no dirty CMMA
pages, dat_get_cmma could iterate over millions of page table entries in
a tight loop without ever checking need_resched or yielding.

[Severity: High]
This is a pre-existing issue, but does dat_get_cmma leak uninitialized
kernel heap memory to userspace here?

When KVM_S390_GET_CMMA_BITS is invoked, the values array is allocated using
vmalloc without zero-initialization. If dat_get_cmma encounters clean pages
or unpopulated holes interleaved with dirty pages, __dat_get_cmma_pte skips
those array indices. The subsequent copy_to_user in kvm_s390_get_cmma_bits
then copies the entire contiguous array back to userspace, taking those
uninitialized vmalloc bytes with it.

>  	if (state.start == -1) {
>  		*count = 0;

[ ... ]

> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 221b2fb199d4e..9ad6bd4edbce3 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
[ ... ]
> @@ -1210,28 +1210,54 @@ static int kvm_s390_vm_start_migration(struct kvm *kvm)
>  	kvm_for_each_memslot(ms, bkt, slots) {
>  		if (!ms->dirty_bitmap)
>  			return -EINVAL;
> -		ram_pages += ms->npages;
>  	}
> -	/* mark all the pages as dirty */
> -	gmap_set_cmma_all_dirty(kvm->arch.gmap);
> -	atomic64_set(&kvm->arch.cmma_dirty_pages, ram_pages);
> -	kvm->arch.migration_mode = 1;
> +	/*
> +	 * Set the flag and let KVM handle ESSA manually, potentially setting
> +	 * the cmma_d bit in some PGSTEs and increasing cmma_dirty_pages.
> +	 * At this point cmma_dirty_pages is still 0, and all existing PGSTEs
> +	 * have their cmma_d bit set to 0.
> +	 * Any newly allocated page table has its entries marked as cmma-clean,
> +	 * which is fine because the CMMA values are not dirty.
> +	 */
> +	WRITE_ONCE(kvm->arch.migration_mode, 1);
>  	kvm_s390_sync_request_broadcast(kvm, KVM_REQ_START_MIGRATION);
> +	/*
> +	 * Mark all PGSTEs as cmma-dirty, increasing cmma_dirty_pages as needed,
> +	 * but without double-counting pages that have become dirty on their own
> +	 * in the meantime.
> +	 * At this point some pages might have become dirty on their own already
> +	 * and cmma_dirty_pages might therefore be non-zero.
> +	 */
> +	gmap_set_cmma_all_dirty(kvm->arch.gmap);

[Severity: High]
Could this lead to a permanent leak in the cmma_dirty_pages counter?

This establishes an invariant that the cmma_dirty_pages counter mirrors the
global number of PGSTEs with the cmma_d bit set. However, if host userspace
unmaps memory or deletes a memslot while migration mode is active,
dat_free_pt destroys the page tables and associated PGSTEs without
decrementing cmma_dirty_pages. This leaves the counter artificially inflated,
potentially trapping the VMM in an infinite migration loop.

>  	return 0;
>  }
>  
>  /*
> - * Must be called with kvm->slots_lock to avoid races with ourselves and
> - * kvm_s390_vm_start_migration.
> + * Must be called with kvm->slots_lock to avoid races with ourselves,
> + * kvm_s390_vm_start_migration() and kvm_s390_get_cmma_bits().
>   */
>  static int kvm_s390_vm_stop_migration(struct kvm *kvm)
>  {
>  	/* migration mode already disabled */
>  	if (!kvm->arch.migration_mode)
>  		return 0;
> -	kvm->arch.migration_mode = 0;
> +	/*
> +	 * Unset the flag and propagate to all vCPUs. From now on the cmma_d
> +	 * bit will not be touched on any PGSTE.
> +	 * At this point cmma_dirty_pages is possibly non-zero, and thus some
> +	 * PGSTEs might have cmma_d set.
> +	 */
> +	WRITE_ONCE(kvm->arch.migration_mode, 0);
>  	if (kvm->arch.use_cmma)
>  		kvm_s390_sync_request_broadcast(kvm, KVM_REQ_STOP_MIGRATION);
> +	/* Clear cmma_d on all existing PGSTEs and set cmma_dirty_pages to 0. */
> +	gmap_set_cmma_all_clean(kvm->arch.gmap);
> +	atomic64_set(&kvm->arch.cmma_dirty_pages, 0);

[Severity: High]
Is there a race condition here with concurrent ESSA instructions?

kvm_s390_sync_request_broadcast does not wait for vCPUs currently executing
host code. If a vCPU executing handle_essa reads migration_mode as 1, it
will proceed to __do_essa. Because gmap_set_cmma_all_clean and __do_essa
can overlap, the stop-migration thread might clear cmma_dirty_pages to 0
right before __do_essa calls atomic64_inc. This would leave the counter at
1 with no actual dirty PGSTEs, breaking migration invariants.

> +	/*
> +	 * At this point the system has the expected state: migration_mode is 0,
> +	 * cmma_dirty_pages is 0, and all existing PGSTEs have their cmma_d bit
> +	 * set to 0.
> +	 */
>  	return 0;
>  }

[ ... ]

> diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
> index 9bc6fd02ff777..ad0ddc433a73c 100644
> --- a/arch/s390/kvm/priv.c
> +++ b/arch/s390/kvm/priv.c
> @@ -1236,7 +1236,7 @@ static int handle_essa(struct kvm_vcpu *vcpu)
>  						: ESSA_SET_STABLE_IF_RESIDENT))
>  		return kvm_s390_inject_program_int(vcpu, PGM_SPECIFICATION);
>  
> -	if (!vcpu->kvm->arch.migration_mode) {
> +	if (!READ_ONCE(vcpu->kvm->arch.migration_mode)) {

[Severity: High]
This is a pre-existing issue, but does the decoupling of the PGSTE lock
and the global counter increment in __do_essa open a race window?

In __do_essa, dat_perform_essa sets the cmma_d bit under the PGSTE lock,
but the global cmma_dirty_pages counter is incremented outside of that
critical section. Concurrently, dat_get_cmma could acquire the same PGSTE
lock, see the bit set, clear it, and decrement cmma_dirty_pages before
__do_essa has incremented it. If the counter starts at 0, it temporarily
underflows to -1. If kvm_s390_get_cmma_bits reads it at that moment, it
exposes ULLONG_MAX to userspace as the remaining count.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260624111928.144283-1-imbrenda@linux.ibm.com?part=10

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2026-06-24 11:46 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24 11:19 [GIT PULL v1 00/12] KVM: s390: Fix S390_USER_OPEREXEC and more gmap fixes Claudio Imbrenda
2026-06-24 11:19 ` [GIT PULL v1 01/12] KVM: s390: Fix S390_USER_OPEREXEC enablement without STFLE 74 Claudio Imbrenda
2026-06-24 11:40   ` sashiko-bot
2026-06-24 11:19 ` [GIT PULL v1 02/12] KVM: s390: selftests: Extended user_operexec tests Claudio Imbrenda
2026-06-24 11:19 ` [GIT PULL v1 03/12] KVM: s390: Fix typo in UCONTROL documentation Claudio Imbrenda
2026-06-24 11:19 ` [GIT PULL v1 04/12] s390/mm: Fix handling of _PAGE_UNUSED pte bit Claudio Imbrenda
2026-06-24 11:42   ` sashiko-bot
2026-06-24 11:19 ` [GIT PULL v1 05/12] KVM: s390: Fix dat_peek_cmma() overflow Claudio Imbrenda
2026-06-24 11:38   ` sashiko-bot
2026-06-24 11:19 ` [GIT PULL v1 06/12] KVM: s390: Do not set special large pages dirty Claudio Imbrenda
2026-06-24 11:37   ` sashiko-bot
2026-06-24 11:19 ` [GIT PULL v1 07/12] KVM: s390: Fix code typo in gmap_protect_asce_top_level() Claudio Imbrenda
2026-06-24 11:19 ` [GIT PULL v1 08/12] KVM: s390: Fix handle_{sske,pfmf} under memory pressure Claudio Imbrenda
2026-06-24 11:19 ` [GIT PULL v1 09/12] KVM: s390: Fix locking in kvm_s390_set_mem_control() Claudio Imbrenda
2026-06-24 11:19 ` [GIT PULL v1 10/12] KVM: s390: Fix cmma dirty tracking Claudio Imbrenda
2026-06-24 11:46   ` sashiko-bot
2026-06-24 11:19 ` [GIT PULL v1 11/12] KVM: s390: selftests: Fix cmma selftest Claudio Imbrenda
2026-06-24 11:19 ` [GIT PULL v1 12/12] KVM: s390: Return failure in case of failure in kvm_s390_set_cmma_bits() Claudio Imbrenda

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox