[PATCH v1 0/3] s390/uv: handle folios that cannot be split while dirty

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/3] s390/uv: handle folios that cannot be split while dirty
@ 2025-05-16 12:39 David Hildenbrand
  2025-05-16 12:39 ` [PATCH v1 1/3] s390/uv: don't return 0 from make_hva_secure() if the operation was not successful David Hildenbrand
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: David Hildenbrand @ 2025-05-16 12:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-s390, kvm, linux-mm, David Hildenbrand,
	Christian Borntraeger, Janosch Frank, Claudio Imbrenda,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Sven Schnelle,
	Thomas Huth, Matthew Wilcox, Zi Yan, Sebastian Mitterle

From patch #3:

"
Currently, starting a PV VM on an iomap-based filesystem with large
folio support, such as XFS, will not work. We'll be stuck in
unpack_one()->gmap_make_secure(), because we can't seem to make progress
splitting the large folio.

The problem is that we require a writable PTE but a writable PTE under such
filesystems will imply a dirty folio.

So whenever we have a writable PTE, we'll have a dirty folio, and dirty
iomap folios cannot currently get split, because
split_folio()->split_huge_page_to_list_to_order()->filemap_release_folio()
will fail in iomap_release_folio().

So we will not make any progress splitting such large folios.
"

Let's fix one related problem during unpack first, to then handle such
folios by triggering writeback before immediately trying to split them
again.

This makes it work on XFS with large folios again.

Long-term, we should cleanly supporting splitting such folios even
without writeback, but that's a bit harder to implement and not a quick
fix.

Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Huth <thuth@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Sebastian Mitterle <smitterl@redhat.com>

David Hildenbrand (3):
  s390/uv: don't return 0 from make_hva_secure() if the operation was
    not successful
  s390/uv: always return 0 from s390_wiggle_split_folio() if successful
  s390/uv: improve splitting of large folios that cannot be split while
    dirty

 arch/s390/kernel/uv.c | 85 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 72 insertions(+), 13 deletions(-)

base-commit: 088d13246a4672bc03aec664675138e3f5bff68c
-- 
2.49.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v1 1/3] s390/uv: don't return 0 from make_hva_secure() if the operation was not successful
  2025-05-16 12:39 [PATCH v1 0/3] s390/uv: handle folios that cannot be split while dirty David Hildenbrand
@ 2025-05-16 12:39 ` David Hildenbrand
  2025-05-16 14:07   ` David Hildenbrand
  2025-05-16 21:08   ` Zi Yan
  2025-05-16 12:39 ` [PATCH v1 2/3] s390/uv: always return 0 from s390_wiggle_split_folio() if successful David Hildenbrand
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 13+ messages in thread
From: David Hildenbrand @ 2025-05-16 12:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-s390, kvm, linux-mm, David Hildenbrand,
	Christian Borntraeger, Janosch Frank, Claudio Imbrenda,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Sven Schnelle,
	Thomas Huth, Matthew Wilcox, Zi Yan, Sebastian Mitterle, stable

If s390_wiggle_split_folio() returns 0 because splitting a large folio
succeeded, we will return 0 from make_hva_secure() even though a retry
is required. Return -EAGAIN in that case.

Otherwise, we'll return 0 from gmap_make_secure(), and consequently from
unpack_one(). In kvm_s390_pv_unpack(), we assume that unpacking
succeeded and skip unpacking this page. Later on, we run into issues
and fail booting the VM.

So far, this issue was only observed with follow-up patches where we
split large pagecache XFS folios. Maybe it can also be triggered with
shmem?

We'll cleanup s390_wiggle_split_folio() a bit next, to also return 0
if no split was required.

Fixes: d8dfda5af0be ("KVM: s390: pv: fix race when making a page secure")
Cc: stable@vger.kernel.org
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/kernel/uv.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index 9a5d5be8acf41..2cc3b599c7fe3 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -393,8 +393,11 @@ int make_hva_secure(struct mm_struct *mm, unsigned long hva, struct uv_cb_header
 	folio_walk_end(&fw, vma);
 	mmap_read_unlock(mm);
 
-	if (rc == -E2BIG || rc == -EBUSY)
+	if (rc == -E2BIG || rc == -EBUSY) {
 		rc = s390_wiggle_split_folio(mm, folio, rc == -E2BIG);
+		if (!rc)
+			rc = -EAGAIN;
+	}
 	folio_put(folio);
 
 	return rc;
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 1/3] s390/uv: don't return 0 from make_hva_secure() if the operation was not successful
  2025-05-16 12:39 ` [PATCH v1 1/3] s390/uv: don't return 0 from make_hva_secure() if the operation was not successful David Hildenbrand
@ 2025-05-16 14:07   ` David Hildenbrand
  2025-05-16 21:08   ` Zi Yan
  1 sibling, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2025-05-16 14:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-s390, kvm, linux-mm, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Thomas Huth, Matthew Wilcox,
	Zi Yan, Sebastian Mitterle, stable

On 16.05.25 14:39, David Hildenbrand wrote:
> If s390_wiggle_split_folio() returns 0 because splitting a large folio
> succeeded, we will return 0 from make_hva_secure() even though a retry
> is required. Return -EAGAIN in that case.
> 
> Otherwise, we'll return 0 from gmap_make_secure(), and consequently from
> unpack_one(). In kvm_s390_pv_unpack(), we assume that unpacking
> succeeded and skip unpacking this page. Later on, we run into issues
> and fail booting the VM.
> 
> So far, this issue was only observed with follow-up patches where we
> split large pagecache XFS folios. Maybe it can also be triggered with
> shmem?

Yes! I can reproduce it when allocating pages outside of the qemu process.

$ echo force > /sys/kernel/mm/transparent_hugepage/shmem_enabled
$ rm /dev/shm/vm-ram
$ fallocate -l 4G /dev/shm/vm-ram
$ /usr/libexec/qemu-kvm ... -object 
memory-backend-file,id=mem0,size=4g,share=on,mem-path=/dev/shm/vm-ram -M 
memory-backend=mem0

LOADPARM=[        ]

Using virtio-blk.
Using SCSI scheme.
.........................................................................................................................
qemu-kvm: KVM PV command 4 (KVM_PV_VERIFY) failed: header rc 102 rrc 1a 
IOCTL rc: -22
Protected boot has failed: 0xa02
Guest crashed on cpu 0: disabled-wait
PSW: 0x0002000080000000 0x0000000000004608


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 1/3] s390/uv: don't return 0 from make_hva_secure() if the operation was not successful
  2025-05-16 12:39 ` [PATCH v1 1/3] s390/uv: don't return 0 from make_hva_secure() if the operation was not successful David Hildenbrand
  2025-05-16 14:07   ` David Hildenbrand
@ 2025-05-16 21:08   ` Zi Yan
  2025-05-16 21:20     ` David Hildenbrand
  1 sibling, 1 reply; 13+ messages in thread
From: Zi Yan @ 2025-05-16 21:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-s390, kvm, linux-mm, Christian Borntraeger,
	Janosch Frank, Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Thomas Huth, Matthew Wilcox,
	Sebastian Mitterle, stable

On 16 May 2025, at 8:39, David Hildenbrand wrote:

> If s390_wiggle_split_folio() returns 0 because splitting a large folio
> succeeded, we will return 0 from make_hva_secure() even though a retry
> is required. Return -EAGAIN in that case.
>
> Otherwise, we'll return 0 from gmap_make_secure(), and consequently from
> unpack_one(). In kvm_s390_pv_unpack(), we assume that unpacking
> succeeded and skip unpacking this page. Later on, we run into issues
> and fail booting the VM.
>
> So far, this issue was only observed with follow-up patches where we
> split large pagecache XFS folios. Maybe it can also be triggered with
> shmem?
>
> We'll cleanup s390_wiggle_split_folio() a bit next, to also return 0
> if no split was required.
>
> Fixes: d8dfda5af0be ("KVM: s390: pv: fix race when making a page secure")
> Cc: stable@vger.kernel.org
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/s390/kernel/uv.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> index 9a5d5be8acf41..2cc3b599c7fe3 100644
> --- a/arch/s390/kernel/uv.c
> +++ b/arch/s390/kernel/uv.c
> @@ -393,8 +393,11 @@ int make_hva_secure(struct mm_struct *mm, unsigned long hva, struct uv_cb_header
>  	folio_walk_end(&fw, vma);
>  	mmap_read_unlock(mm);
>
> -	if (rc == -E2BIG || rc == -EBUSY)
> +	if (rc == -E2BIG || rc == -EBUSY) {
>  		rc = s390_wiggle_split_folio(mm, folio, rc == -E2BIG);
> +		if (!rc)
> +			rc = -EAGAIN;

Why not just folio_put() then jump back to the beginning of the
function to do the retry? This could avoid going all the way back
to kvm_s390_unpack().

> +	}
>  	folio_put(folio);
>
>  	return rc;
> -- 
> 2.49.0


--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 1/3] s390/uv: don't return 0 from make_hva_secure() if the operation was not successful
  2025-05-16 21:08   ` Zi Yan
@ 2025-05-16 21:20     ` David Hildenbrand
  2025-05-17  0:02       ` Zi Yan
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2025-05-16 21:20 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-s390, kvm, linux-mm, Christian Borntraeger,
	Janosch Frank, Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Thomas Huth, Matthew Wilcox,
	Sebastian Mitterle, stable

On 16.05.25 23:08, Zi Yan wrote:
> On 16 May 2025, at 8:39, David Hildenbrand wrote:
> 
>> If s390_wiggle_split_folio() returns 0 because splitting a large folio
>> succeeded, we will return 0 from make_hva_secure() even though a retry
>> is required. Return -EAGAIN in that case.
>>
>> Otherwise, we'll return 0 from gmap_make_secure(), and consequently from
>> unpack_one(). In kvm_s390_pv_unpack(), we assume that unpacking
>> succeeded and skip unpacking this page. Later on, we run into issues
>> and fail booting the VM.
>>
>> So far, this issue was only observed with follow-up patches where we
>> split large pagecache XFS folios. Maybe it can also be triggered with
>> shmem?
>>
>> We'll cleanup s390_wiggle_split_folio() a bit next, to also return 0
>> if no split was required.
>>
>> Fixes: d8dfda5af0be ("KVM: s390: pv: fix race when making a page secure")
>> Cc: stable@vger.kernel.org
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   arch/s390/kernel/uv.c | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
>> index 9a5d5be8acf41..2cc3b599c7fe3 100644
>> --- a/arch/s390/kernel/uv.c
>> +++ b/arch/s390/kernel/uv.c
>> @@ -393,8 +393,11 @@ int make_hva_secure(struct mm_struct *mm, unsigned long hva, struct uv_cb_header
>>   	folio_walk_end(&fw, vma);
>>   	mmap_read_unlock(mm);
>>
>> -	if (rc == -E2BIG || rc == -EBUSY)
>> +	if (rc == -E2BIG || rc == -EBUSY) {
>>   		rc = s390_wiggle_split_folio(mm, folio, rc == -E2BIG);
>> +		if (!rc)
>> +			rc = -EAGAIN;
> 
> Why not just folio_put() then jump back to the beginning of the
> function to do the retry? This could avoid going all the way back
> to kvm_s390_unpack().

Hi, thanks for the review.

We had a pretty optimized version with such tricks before Claudio 
refactored it in:

commit 5cbe24350b7d8ef6d466a37d56b07ae643c622ca
Author: Claudio Imbrenda <imbrenda@linux.ibm.com>
Date:   Thu Jan 23 15:46:17 2025 +0100

     KVM: s390: move pv gmap functions into kvm



In particular, one relevant hunk was:

-       switch (rc) {
-       case -E2BIG:
-               folio_lock(folio);
-               rc = split_folio(folio);
-               folio_unlock(folio);
-               folio_put(folio);
-
-               switch (rc) {
-               case 0:
-                       /* Splitting succeeded, try again immediately. */
-                       goto again;
-               case -EAGAIN:
-                       /* Additional folio references. */
-                       if (drain_lru(&drain_lru_called))
-                               goto again;
-                       return -EAGAIN;



Claudio probably had a good reason to rewrite the code -- and I hope 
we'll be able to rip all of that out soon, so ...

... minimal changes until then :)


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 1/3] s390/uv: don't return 0 from make_hva_secure() if the operation was not successful
  2025-05-16 21:20     ` David Hildenbrand
@ 2025-05-17  0:02       ` Zi Yan
  0 siblings, 0 replies; 13+ messages in thread
From: Zi Yan @ 2025-05-17  0:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-s390, kvm, linux-mm, Christian Borntraeger,
	Janosch Frank, Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Thomas Huth, Matthew Wilcox,
	Sebastian Mitterle, stable

On 16 May 2025, at 17:20, David Hildenbrand wrote:

> On 16.05.25 23:08, Zi Yan wrote:
>> On 16 May 2025, at 8:39, David Hildenbrand wrote:
>>
>>> If s390_wiggle_split_folio() returns 0 because splitting a large folio
>>> succeeded, we will return 0 from make_hva_secure() even though a retry
>>> is required. Return -EAGAIN in that case.
>>>
>>> Otherwise, we'll return 0 from gmap_make_secure(), and consequently from
>>> unpack_one(). In kvm_s390_pv_unpack(), we assume that unpacking
>>> succeeded and skip unpacking this page. Later on, we run into issues
>>> and fail booting the VM.
>>>
>>> So far, this issue was only observed with follow-up patches where we
>>> split large pagecache XFS folios. Maybe it can also be triggered with
>>> shmem?
>>>
>>> We'll cleanup s390_wiggle_split_folio() a bit next, to also return 0
>>> if no split was required.
>>>
>>> Fixes: d8dfda5af0be ("KVM: s390: pv: fix race when making a page secure")
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   arch/s390/kernel/uv.c | 5 ++++-
>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
>>> index 9a5d5be8acf41..2cc3b599c7fe3 100644
>>> --- a/arch/s390/kernel/uv.c
>>> +++ b/arch/s390/kernel/uv.c
>>> @@ -393,8 +393,11 @@ int make_hva_secure(struct mm_struct *mm, unsigned long hva, struct uv_cb_header
>>>   	folio_walk_end(&fw, vma);
>>>   	mmap_read_unlock(mm);
>>>
>>> -	if (rc == -E2BIG || rc == -EBUSY)
>>> +	if (rc == -E2BIG || rc == -EBUSY) {
>>>   		rc = s390_wiggle_split_folio(mm, folio, rc == -E2BIG);
>>> +		if (!rc)
>>> +			rc = -EAGAIN;
>>
>> Why not just folio_put() then jump back to the beginning of the
>> function to do the retry? This could avoid going all the way back
>> to kvm_s390_unpack().
>
> Hi, thanks for the review.
>
> We had a pretty optimized version with such tricks before Claudio refactored it in:
>
> commit 5cbe24350b7d8ef6d466a37d56b07ae643c622ca
> Author: Claudio Imbrenda <imbrenda@linux.ibm.com>
> Date:   Thu Jan 23 15:46:17 2025 +0100
>
>     KVM: s390: move pv gmap functions into kvm
>
>
>
> In particular, one relevant hunk was:
>
> -       switch (rc) {
> -       case -E2BIG:
> -               folio_lock(folio);
> -               rc = split_folio(folio);
> -               folio_unlock(folio);
> -               folio_put(folio);
> -
> -               switch (rc) {
> -               case 0:
> -                       /* Splitting succeeded, try again immediately. */
> -                       goto again;
> -               case -EAGAIN:
> -                       /* Additional folio references. */
> -                       if (drain_lru(&drain_lru_called))
> -                               goto again;
> -                       return -EAGAIN;
>
>
>
> Claudio probably had a good reason to rewrite the code -- and I hope we'll be able to rip all of that out soon, so ...
>
> ... minimal changes until then :)

Got it. Acked-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v1 2/3] s390/uv: always return 0 from s390_wiggle_split_folio() if successful
  2025-05-16 12:39 [PATCH v1 0/3] s390/uv: handle folios that cannot be split while dirty David Hildenbrand
  2025-05-16 12:39 ` [PATCH v1 1/3] s390/uv: don't return 0 from make_hva_secure() if the operation was not successful David Hildenbrand
@ 2025-05-16 12:39 ` David Hildenbrand
  2025-05-17  0:08   ` Zi Yan
  2025-05-16 12:39 ` [PATCH v1 3/3] s390/uv: improve splitting of large folios that cannot be split while dirty David Hildenbrand
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2025-05-16 12:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-s390, kvm, linux-mm, David Hildenbrand,
	Christian Borntraeger, Janosch Frank, Claudio Imbrenda,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Sven Schnelle,
	Thomas Huth, Matthew Wilcox, Zi Yan, Sebastian Mitterle

Let's consistently return 0 if the operation was successful, and just
detect ourselves whether splitting is required -- folio_test_large() is
a cheap operation.

Update the documentation.

Should we simply always return -EAGAIN instead of 0, so we don't have
to handle it in the caller? Not sure, staring at the documentation, this
way looks a bit cleaner.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/kernel/uv.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index 2cc3b599c7fe3..f6ddb2b54032e 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -324,34 +324,36 @@ static int make_folio_secure(struct mm_struct *mm, struct folio *folio, struct u
 }
 
 /**
- * s390_wiggle_split_folio() - try to drain extra references to a folio and optionally split.
+ * s390_wiggle_split_folio() - try to drain extra references to a folio and
+ *			       split the folio if it is large.
  * @mm:    the mm containing the folio to work on
  * @folio: the folio
- * @split: whether to split a large folio
  *
  * Context: Must be called while holding an extra reference to the folio;
  *          the mm lock should not be held.
- * Return: 0 if the folio was split successfully;
- *         -EAGAIN if the folio was not split successfully but another attempt
- *                 can be made, or if @split was set to false;
- *         -EINVAL in case of other errors. See split_folio().
+ * Return: 0 if the operation was successful;
+ *	   -EAGAIN if splitting the large folio was not successful,
+ *		   but another attempt can be made;
+ *	   -EINVAL in case of other folio splitting errors. See split_folio().
  */
-static int s390_wiggle_split_folio(struct mm_struct *mm, struct folio *folio, bool split)
+static int s390_wiggle_split_folio(struct mm_struct *mm, struct folio *folio)
 {
 	int rc;
 
 	lockdep_assert_not_held(&mm->mmap_lock);
 	folio_wait_writeback(folio);
 	lru_add_drain_all();
-	if (split) {
+
+	if (folio_test_large(folio)) {
 		folio_lock(folio);
 		rc = split_folio(folio);
 		folio_unlock(folio);
 
 		if (rc != -EBUSY)
 			return rc;
+		return -EAGAIN;
 	}
-	return -EAGAIN;
+	return 0;
 }
 
 int make_hva_secure(struct mm_struct *mm, unsigned long hva, struct uv_cb_header *uvcb)
@@ -394,7 +396,7 @@ int make_hva_secure(struct mm_struct *mm, unsigned long hva, struct uv_cb_header
 	mmap_read_unlock(mm);
 
 	if (rc == -E2BIG || rc == -EBUSY) {
-		rc = s390_wiggle_split_folio(mm, folio, rc == -E2BIG);
+		rc = s390_wiggle_split_folio(mm, folio);
 		if (!rc)
 			rc = -EAGAIN;
 	}
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 2/3] s390/uv: always return 0 from s390_wiggle_split_folio() if successful
  2025-05-16 12:39 ` [PATCH v1 2/3] s390/uv: always return 0 from s390_wiggle_split_folio() if successful David Hildenbrand
@ 2025-05-17  0:08   ` Zi Yan
  0 siblings, 0 replies; 13+ messages in thread
From: Zi Yan @ 2025-05-17  0:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-s390, kvm, linux-mm, Christian Borntraeger,
	Janosch Frank, Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Thomas Huth, Matthew Wilcox,
	Sebastian Mitterle

On 16 May 2025, at 8:39, David Hildenbrand wrote:

> Let's consistently return 0 if the operation was successful, and just
> detect ourselves whether splitting is required -- folio_test_large() is
> a cheap operation.
>
> Update the documentation.
>
> Should we simply always return -EAGAIN instead of 0, so we don't have
> to handle it in the caller? Not sure, staring at the documentation, this
> way looks a bit cleaner.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/s390/kernel/uv.c | 22 ++++++++++++----------
>  1 file changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> index 2cc3b599c7fe3..f6ddb2b54032e 100644
> --- a/arch/s390/kernel/uv.c
> +++ b/arch/s390/kernel/uv.c
> @@ -324,34 +324,36 @@ static int make_folio_secure(struct mm_struct *mm, struct folio *folio, struct u
>  }
>
>  /**
> - * s390_wiggle_split_folio() - try to drain extra references to a folio and optionally split.
> + * s390_wiggle_split_folio() - try to drain extra references to a folio and
> + *			       split the folio if it is large.
>   * @mm:    the mm containing the folio to work on
>   * @folio: the folio
> - * @split: whether to split a large folio
>   *
>   * Context: Must be called while holding an extra reference to the folio;
>   *          the mm lock should not be held.
> - * Return: 0 if the folio was split successfully;
> - *         -EAGAIN if the folio was not split successfully but another attempt
> - *                 can be made, or if @split was set to false;
> - *         -EINVAL in case of other errors. See split_folio().
> + * Return: 0 if the operation was successful;
> + *	   -EAGAIN if splitting the large folio was not successful,
> + *		   but another attempt can be made;
> + *	   -EINVAL in case of other folio splitting errors. See split_folio().
>   */
> -static int s390_wiggle_split_folio(struct mm_struct *mm, struct folio *folio, bool split)
> +static int s390_wiggle_split_folio(struct mm_struct *mm, struct folio *folio)
>  {
>  	int rc;
>
>  	lockdep_assert_not_held(&mm->mmap_lock);
>  	folio_wait_writeback(folio);
>  	lru_add_drain_all();
> -	if (split) {
> +
> +	if (folio_test_large(folio)) {
>  		folio_lock(folio);
>  		rc = split_folio(folio);
>  		folio_unlock(folio);
>
>  		if (rc != -EBUSY)
>  			return rc;
> +		return -EAGAIN;
>  	}
> -	return -EAGAIN;
> +	return 0;
>  }

I can see how this function is written to service as two purposes,
trying to get rid of pcp ref of a folio and split a folio (to avoid
the extra pcp ref from failing split, lru_add_drain_all() is
called before split). Hope it will be refactored later.

>
>  int make_hva_secure(struct mm_struct *mm, unsigned long hva, struct uv_cb_header *uvcb)
> @@ -394,7 +396,7 @@ int make_hva_secure(struct mm_struct *mm, unsigned long hva, struct uv_cb_header
>  	mmap_read_unlock(mm);
>
>  	if (rc == -E2BIG || rc == -EBUSY) {
> -		rc = s390_wiggle_split_folio(mm, folio, rc == -E2BIG);
> +		rc = s390_wiggle_split_folio(mm, folio);
>  		if (!rc)
>  			rc = -EAGAIN;
>  	}
> -- 
> 2.49.0

The changes look good to me. Acked-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v1 3/3] s390/uv: improve splitting of large folios that cannot be split while dirty
  2025-05-16 12:39 [PATCH v1 0/3] s390/uv: handle folios that cannot be split while dirty David Hildenbrand
  2025-05-16 12:39 ` [PATCH v1 1/3] s390/uv: don't return 0 from make_hva_secure() if the operation was not successful David Hildenbrand
  2025-05-16 12:39 ` [PATCH v1 2/3] s390/uv: always return 0 from s390_wiggle_split_folio() if successful David Hildenbrand
@ 2025-05-16 12:39 ` David Hildenbrand
  2025-05-16 17:07 ` [PATCH v1 0/3] s390/uv: handle " Claudio Imbrenda
  2025-05-16 17:17 ` Claudio Imbrenda
  4 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2025-05-16 12:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-s390, kvm, linux-mm, David Hildenbrand,
	Christian Borntraeger, Janosch Frank, Claudio Imbrenda,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Sven Schnelle,
	Thomas Huth, Matthew Wilcox, Zi Yan, Sebastian Mitterle

Currently, starting a PV VM on an iomap-based filesystem with large
folio support, such as XFS, will not work. We'll be stuck in
unpack_one()->gmap_make_secure(), because we can't seem to make progress
splitting the large folio.

The problem is that we require a writable PTE but a writable PTE under such
filesystems will imply a dirty folio.

So whenever we have a writable PTE, we'll have a dirty folio, and dirty
iomap folios cannot currently get split, because
split_folio()->split_huge_page_to_list_to_order()->filemap_release_folio()
will fail in iomap_release_folio().

So we will not make any progress splitting such large folios.

Until dirty folios can be split more reliably, let's manually trigger
writeback of the problematic folio using
filemap_write_and_wait_range(), and retry the split immediately
afterwards exactly once, before looking up the folio again.

Should this logic be part of split_folio()? Likely not; most split users
don't have to split so eagerly to make any progress.

For now, this seems to affect xfs, zonefs and erofs, and this patch
makes it work again (tested on xfs only).

While this could be considered a fix for 6795801366da ("xfs: Support
large folios"), df2f9708ff1f ("zonefs: enable support for large folios")
and ce529cc25b18 ("erofs: enable large folios for iomap mode"), before
commit eef88fe45ac9 ("s390/uv: Split large folios in gmap_make_secure()"),
we did not try splitting large folios at all. So it's all rather part of
making SE compatible with file systems that support large folios. But to
have some "Fixes:" tag, let's just use eef88fe45ac9.

Not CCing stable, because there are a lot of dependencies, and it simply
not working is not critical in stable kernels.

Reported-by: Sebastian Mitterle <smitterl@redhat.com>
Closes: https://issues.redhat.com/browse/RHEL-58218
Fixes: eef88fe45ac9 ("s390/uv: Split large folios in gmap_make_secure()")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/kernel/uv.c | 66 +++++++++++++++++++++++++++++++++++++++----
 1 file changed, 60 insertions(+), 6 deletions(-)

diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index f6ddb2b54032e..d278bf0c09d1b 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -15,6 +15,7 @@
 #include <linux/pagemap.h>
 #include <linux/swap.h>
 #include <linux/pagewalk.h>
+#include <linux/backing-dev.h>
 #include <asm/facility.h>
 #include <asm/sections.h>
 #include <asm/uv.h>
@@ -338,22 +339,75 @@ static int make_folio_secure(struct mm_struct *mm, struct folio *folio, struct u
  */
 static int s390_wiggle_split_folio(struct mm_struct *mm, struct folio *folio)
 {
-	int rc;
+	int rc, tried_splits;
 
 	lockdep_assert_not_held(&mm->mmap_lock);
 	folio_wait_writeback(folio);
 	lru_add_drain_all();
 
-	if (folio_test_large(folio)) {
+	if (!folio_test_large(folio))
+		return 0;
+
+	for (tried_splits = 0; tried_splits < 2; tried_splits++) {
+		struct address_space *mapping;
+		loff_t lstart, lend;
+		struct inode *inode;
+
 		folio_lock(folio);
 		rc = split_folio(folio);
+		if (rc != -EBUSY) {
+			folio_unlock(folio);
+			return rc;
+		}
+
+		/*
+		 * Splitting with -EBUSY can fail for various reasons, but we
+		 * have to handle one case explicitly for now: some mappings
+		 * don't allow for splitting dirty folios; writeback will
+		 * mark them clean again, including marking all page table
+		 * entries mapping the folio read-only, to catch future write
+		 * attempts.
+		 *
+		 * While the system should be writing back dirty folios in the
+		 * background, we obtained this folio by looking up a writable
+		 * page table entry. On these problematic mappings, writable
+		 * page table entries imply dirty folios, preventing the
+		 * split in the first place.
+		 *
+		 * To prevent a livelock when trigger writeback manually and
+		 * letting the caller look up the folio again in the page
+		 * table (turning it dirty), immediately try to split again.
+		 *
+		 * This is only a problem for some mappings (e.g., XFS);
+		 * mappings that do not support writeback (e.g., shmem) do not
+		 * apply.
+		 */
+		if (!folio_test_dirty(folio) || folio_test_anon(folio) ||
+		    !folio->mapping || !mapping_can_writeback(folio->mapping)) {
+			folio_unlock(folio);
+			break;
+		}
+
+		/*
+		 * Ideally, we'd only trigger writeback on this exact folio. But
+		 * there is no easy way to do that, so we'll stabilize the
+		 * mapping while we still hold the folio lock, so we can drop
+		 * the folio lock to trigger writeback on the range currently
+		 * covered by the folio instead.
+		 */
+		mapping = folio->mapping;
+		lstart = folio_pos(folio);
+		lend = lstart + folio_size(folio) - 1;
+		inode = igrab(mapping->host);
 		folio_unlock(folio);
 
-		if (rc != -EBUSY)
-			return rc;
-		return -EAGAIN;
+		if (unlikely(!inode))
+			break;
+
+		filemap_write_and_wait_range(mapping, lstart, lend);
+		iput(mapping->host);
 	}
-	return 0;
+	return -EAGAIN;
 }
 
 int make_hva_secure(struct mm_struct *mm, unsigned long hva, struct uv_cb_header *uvcb)
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 0/3] s390/uv: handle folios that cannot be split while dirty
  2025-05-16 12:39 [PATCH v1 0/3] s390/uv: handle folios that cannot be split while dirty David Hildenbrand
                   ` (2 preceding siblings ...)
  2025-05-16 12:39 ` [PATCH v1 3/3] s390/uv: improve splitting of large folios that cannot be split while dirty David Hildenbrand
@ 2025-05-16 17:07 ` Claudio Imbrenda
  2025-05-16 18:55   ` David Hildenbrand
  2025-05-16 17:17 ` Claudio Imbrenda
  4 siblings, 1 reply; 13+ messages in thread
From: Claudio Imbrenda @ 2025-05-16 17:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-s390, kvm, linux-mm, Christian Borntraeger,
	Janosch Frank, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Sven Schnelle, Thomas Huth, Matthew Wilcox, Zi Yan,
	Sebastian Mitterle

On Fri, 16 May 2025 14:39:43 +0200
David Hildenbrand <david@redhat.com> wrote:

> From patch #3:
> 
> "
> Currently, starting a PV VM on an iomap-based filesystem with large
> folio support, such as XFS, will not work. We'll be stuck in
> unpack_one()->gmap_make_secure(), because we can't seem to make progress
> splitting the large folio.
> 
> The problem is that we require a writable PTE but a writable PTE under such
> filesystems will imply a dirty folio.
> 
> So whenever we have a writable PTE, we'll have a dirty folio, and dirty
> iomap folios cannot currently get split, because
> split_folio()->split_huge_page_to_list_to_order()->filemap_release_folio()
> will fail in iomap_release_folio().
> 
> So we will not make any progress splitting such large folios.
> "
> 
> Let's fix one related problem during unpack first, to then handle such
> folios by triggering writeback before immediately trying to split them
> again.
> 
> This makes it work on XFS with large folios again.
> 
> Long-term, we should cleanly supporting splitting such folios even
> without writeback, but that's a bit harder to implement and not a quick
> fix.

yet another layer of duck tape

I really dislike the current interaction between secure execution and
I/O, I hope I can get a cleaner solution as soon as possible


meanwhile, let's keep the boat afloat; whole series:

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>


David: thanks for fixing this mess!

> 
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Janosch Frank <frankja@linux.ibm.com>
> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Heiko Carstens <hca@linux.ibm.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Alexander Gordeev <agordeev@linux.ibm.com>
> Cc: Sven Schnelle <svens@linux.ibm.com>
> Cc: Thomas Huth <thuth@redhat.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Sebastian Mitterle <smitterl@redhat.com>
> 
> David Hildenbrand (3):
>   s390/uv: don't return 0 from make_hva_secure() if the operation was
>     not successful
>   s390/uv: always return 0 from s390_wiggle_split_folio() if successful
>   s390/uv: improve splitting of large folios that cannot be split while
>     dirty
> 
>  arch/s390/kernel/uv.c | 85 ++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 72 insertions(+), 13 deletions(-)
> 
> 
> base-commit: 088d13246a4672bc03aec664675138e3f5bff68c



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 0/3] s390/uv: handle folios that cannot be split while dirty
  2025-05-16 17:07 ` [PATCH v1 0/3] s390/uv: handle " Claudio Imbrenda
@ 2025-05-16 18:55   ` David Hildenbrand
  0 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2025-05-16 18:55 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: linux-kernel, linux-s390, kvm, linux-mm, Christian Borntraeger,
	Janosch Frank, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Sven Schnelle, Thomas Huth, Matthew Wilcox, Zi Yan,
	Sebastian Mitterle

On 16.05.25 19:07, Claudio Imbrenda wrote:
> On Fri, 16 May 2025 14:39:43 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>>  From patch #3:
>>
>> "
>> Currently, starting a PV VM on an iomap-based filesystem with large
>> folio support, such as XFS, will not work. We'll be stuck in
>> unpack_one()->gmap_make_secure(), because we can't seem to make progress
>> splitting the large folio.
>>
>> The problem is that we require a writable PTE but a writable PTE under such
>> filesystems will imply a dirty folio.
>>
>> So whenever we have a writable PTE, we'll have a dirty folio, and dirty
>> iomap folios cannot currently get split, because
>> split_folio()->split_huge_page_to_list_to_order()->filemap_release_folio()
>> will fail in iomap_release_folio().
>>
>> So we will not make any progress splitting such large folios.
>> "
>>
>> Let's fix one related problem during unpack first, to then handle such
>> folios by triggering writeback before immediately trying to split them
>> again.
>>
>> This makes it work on XFS with large folios again.
>>
>> Long-term, we should cleanly supporting splitting such folios even
>> without writeback, but that's a bit harder to implement and not a quick
>> fix.
> 
> yet another layer of duck tape
> 
> I really dislike the current interaction between secure execution and
> I/O, I hope I can get a cleaner solution as soon as possible

I'll be more than happy to review such a series -- hoping we can just 
support large folios naturally :)

> 
> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> 
> 

Thanks!

> David: thanks for fixing this mess!


NP; I had a prototype of patch #3 for a long time. But after rebasing on 
top of your work I saw these weird validation errors and just couldn't 
find the issue. And I only saw them with patch #3 on ordinary pagecache 
folios, not with shmem, which severely confused.

... gave it another try today after ~1month and almost immediately 
spotted the issue.

Some things just need time :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 0/3] s390/uv: handle folios that cannot be split while dirty
  2025-05-16 12:39 [PATCH v1 0/3] s390/uv: handle folios that cannot be split while dirty David Hildenbrand
                   ` (3 preceding siblings ...)
  2025-05-16 17:07 ` [PATCH v1 0/3] s390/uv: handle " Claudio Imbrenda
@ 2025-05-16 17:17 ` Claudio Imbrenda
  2025-05-16 18:56   ` David Hildenbrand
  4 siblings, 1 reply; 13+ messages in thread
From: Claudio Imbrenda @ 2025-05-16 17:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-s390, kvm, linux-mm, Christian Borntraeger,
	Janosch Frank, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Sven Schnelle, Thomas Huth, Matthew Wilcox, Zi Yan,
	Sebastian Mitterle

On Fri, 16 May 2025 14:39:43 +0200
David Hildenbrand <david@redhat.com> wrote:

> From patch #3:
> 
> "
> Currently, starting a PV VM on an iomap-based filesystem with large
> folio support, such as XFS, will not work. We'll be stuck in
> unpack_one()->gmap_make_secure(), because we can't seem to make progress
> splitting the large folio.
> 
> The problem is that we require a writable PTE but a writable PTE under such
> filesystems will imply a dirty folio.
> 
> So whenever we have a writable PTE, we'll have a dirty folio, and dirty
> iomap folios cannot currently get split, because
> split_folio()->split_huge_page_to_list_to_order()->filemap_release_folio()
> will fail in iomap_release_folio().
> 
> So we will not make any progress splitting such large folios.
> "
> 
> Let's fix one related problem during unpack first, to then handle such
> folios by triggering writeback before immediately trying to split them
> again.
> 
> This makes it work on XFS with large folios again.
> 
> Long-term, we should cleanly supporting splitting such folios even
> without writeback, but that's a bit harder to implement and not a quick
> fix.

picked for 6.16, I think it will survive the CI without issues, since
I assume you tested this thoroughly

> 
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Janosch Frank <frankja@linux.ibm.com>
> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Heiko Carstens <hca@linux.ibm.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Alexander Gordeev <agordeev@linux.ibm.com>
> Cc: Sven Schnelle <svens@linux.ibm.com>
> Cc: Thomas Huth <thuth@redhat.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Sebastian Mitterle <smitterl@redhat.com>
> 
> David Hildenbrand (3):
>   s390/uv: don't return 0 from make_hva_secure() if the operation was
>     not successful
>   s390/uv: always return 0 from s390_wiggle_split_folio() if successful
>   s390/uv: improve splitting of large folios that cannot be split while
>     dirty
> 
>  arch/s390/kernel/uv.c | 85 ++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 72 insertions(+), 13 deletions(-)
> 
> 
> base-commit: 088d13246a4672bc03aec664675138e3f5bff68c



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1 0/3] s390/uv: handle folios that cannot be split while dirty
  2025-05-16 17:17 ` Claudio Imbrenda
@ 2025-05-16 18:56   ` David Hildenbrand
  0 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2025-05-16 18:56 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: linux-kernel, linux-s390, kvm, linux-mm, Christian Borntraeger,
	Janosch Frank, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Sven Schnelle, Thomas Huth, Matthew Wilcox, Zi Yan,
	Sebastian Mitterle

On 16.05.25 19:17, Claudio Imbrenda wrote:
> On Fri, 16 May 2025 14:39:43 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>>  From patch #3:
>>
>> "
>> Currently, starting a PV VM on an iomap-based filesystem with large
>> folio support, such as XFS, will not work. We'll be stuck in
>> unpack_one()->gmap_make_secure(), because we can't seem to make progress
>> splitting the large folio.
>>
>> The problem is that we require a writable PTE but a writable PTE under such
>> filesystems will imply a dirty folio.
>>
>> So whenever we have a writable PTE, we'll have a dirty folio, and dirty
>> iomap folios cannot currently get split, because
>> split_folio()->split_huge_page_to_list_to_order()->filemap_release_folio()
>> will fail in iomap_release_folio().
>>
>> So we will not make any progress splitting such large folios.
>> "
>>
>> Let's fix one related problem during unpack first, to then handle such
>> folios by triggering writeback before immediately trying to split them
>> again.
>>
>> This makes it work on XFS with large folios again.
>>
>> Long-term, we should cleanly supporting splitting such folios even
>> without writeback, but that's a bit harder to implement and not a quick
>> fix.
> 
> picked for 6.16, I think it will survive the CI without issues, since
> I assume you tested this thoroughly

I did test what was known to be broken, but our QE did not run a bigger 
test on it. So giving it some soaking time + waiting for a bit for more 
review might be a good idea!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-05-17  0:09 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-16 12:39 [PATCH v1 0/3] s390/uv: handle folios that cannot be split while dirty David Hildenbrand
2025-05-16 12:39 ` [PATCH v1 1/3] s390/uv: don't return 0 from make_hva_secure() if the operation was not successful David Hildenbrand
2025-05-16 14:07   ` David Hildenbrand
2025-05-16 21:08   ` Zi Yan
2025-05-16 21:20     ` David Hildenbrand
2025-05-17  0:02       ` Zi Yan
2025-05-16 12:39 ` [PATCH v1 2/3] s390/uv: always return 0 from s390_wiggle_split_folio() if successful David Hildenbrand
2025-05-17  0:08   ` Zi Yan
2025-05-16 12:39 ` [PATCH v1 3/3] s390/uv: improve splitting of large folios that cannot be split while dirty David Hildenbrand
2025-05-16 17:07 ` [PATCH v1 0/3] s390/uv: handle " Claudio Imbrenda
2025-05-16 18:55   ` David Hildenbrand
2025-05-16 17:17 ` Claudio Imbrenda
2025-05-16 18:56   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).