* [PATCH v2] s390: fix HugeTLB vmemmap optimization crash
@ 2025-10-28 21:15 Luiz Capitulino
2025-10-28 21:53 ` Andrew Morton
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Luiz Capitulino @ 2025-10-28 21:15 UTC (permalink / raw)
To: hca, borntraeger, joao.m.martins, mike.kravetz, linux-kernel,
linux-mm, linux-s390, gor, gerald.schaefer, agordeev
Cc: osalvador, akpm, david, aneesh.kumar
A reproducible crash occurs when enabling HugeTLB vmemmap optimization (HVO)
on s390. The crash and the proposed fix were worked on an s390 KVM guest
running on an older hypervisor, as I don't have access to an LPAR. However,
the same issue should occur on bare-metal.
Reproducer (it may take a few runs to trigger):
# sysctl vm.hugetlb_optimize_vmemmap=1
# echo 1 > /proc/sys/vm/nr_hugepages
# echo 0 > /proc/sys/vm/nr_hugepages
Crash log:
[ 52.340369] list_del corruption. prev->next should be 000000d382110008, but was 000000d7116d3880. (prev=000000d7116d3910)
[ 52.340420] ------------[ cut here ]------------
[ 52.340424] kernel BUG at lib/list_debug.c:62!
[ 52.340566] monitor event: 0040 ilc:2 [#1]SMP
[ 52.340573] Modules linked in: ctcm fsm qeth ccwgroup zfcp scsi_transport_fc qdio dasd_fba_mod dasd_eckd_mod dasd_mod xfs ghash_s390 prng des_s390 libdes sha3_512_s390 sha3_256_s390 virtio_net virtio_blk net_failover sha_common failover dm_mirror dm_region_hash dm_log dm_mod paes_s390 crypto_engine pkey_cca pkey_ep11 zcrypt pkey_pckmo pkey aes_s390
[ 52.340606] CPU: 1 UID: 0 PID: 1672 Comm: root-rep2 Kdump: loaded Not tainted 6.18.0-rc3 #1 NONE
[ 52.340610] Hardware name: IBM 3931 LA1 400 (KVM/Linux)
[ 52.340611] Krnl PSW : 0704c00180000000 0000015710cda7fe (__list_del_entry_valid_or_report+0xfe/0x128)
[ 52.340619] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[ 52.340622] Krnl GPRS: c0000000ffffefff 0000000100000027 000000000000006d 0000000000000000
[ 52.340623] 000000d7116d35d8 000000d7116d35d0 0000000000000002 000000d7116d39b0
[ 52.340625] 000000d7116d3880 000000d7116d3910 000000d7116d3910 000000d382110008
[ 52.340626] 000003ffac1ccd08 000000d7116d39b0 0000015710cda7fa 000000d7116d37d0
[ 52.340632] Krnl Code: 0000015710cda7ee: c020003e496f larl %r2,00000157114a3acc
0000015710cda7f4: c0e5ffd5280e brasl %r14,000001571077f810
#0000015710cda7fa: af000000 mc 0,0
>0000015710cda7fe: b9040029 lgr %r2,%r9
0000015710cda802: c0e5ffe5e193 brasl %r14,0000015710996b28
0000015710cda808: e34090080004 lg %r4,8(%r9)
0000015710cda80e: b9040059 lgr %r5,%r9
0000015710cda812: b9040038 lgr %r3,%r8
[ 52.340643] Call Trace:
[ 52.340645] [<0000015710cda7fe>] __list_del_entry_valid_or_report+0xfe/0x128
[ 52.340649] ([<0000015710cda7fa>] __list_del_entry_valid_or_report+0xfa/0x128)
[ 52.340652] [<0000015710a30b2e>] hugetlb_vmemmap_restore_folios+0x96/0x138
[ 52.340655] [<0000015710a268ac>] update_and_free_pages_bulk+0x64/0x150
[ 52.340659] [<0000015710a26f8a>] set_max_huge_pages+0x4ca/0x6f0
[ 52.340662] [<0000015710a273ba>] hugetlb_sysctl_handler_common+0xea/0x120
[ 52.340665] [<0000015710a27484>] hugetlb_sysctl_handler+0x44/0x50
[ 52.340667] [<0000015710b53ffa>] proc_sys_call_handler+0x17a/0x280
[ 52.340672] [<0000015710a90968>] vfs_write+0x2c8/0x3a0
[ 52.340676] [<0000015710a90bd2>] ksys_write+0x72/0x100
[ 52.340679] [<00000157111483a8>] __do_syscall+0x150/0x318
[ 52.340682] [<0000015711153a5e>] system_call+0x6e/0x90
[ 52.340684] Last Breaking-Event-Address:
[ 52.340684] [<000001571077f85c>] _printk+0x4c/0x58
[ 52.340690] Kernel panic - not syncing: Fatal exception: panic_on_oops
This issue was introduced by commit f13b83fdd996 ("hugetlb: batch TLB
flushes when freeing vmemmap"). Before that change, the HVO
implementation called flush_tlb_kernel_range() each time a vmemmap
PMD split and remapping was performed. The mentioned commit changed this
to issue a few flush_tlb_all() calls after performing all remappings.
However, on s390, flush_tlb_kernel_range() expands to
__tlb_flush_kernel() while flush_tlb_all() is not implemented. As a
result, we went from flushing the TLB for every remapping to no flushing
at all.
This commit fixes this by implementing flush_tlb_all() on s390 as an
alias to __tlb_flush_global(). This should cause a flush on all TLB
entries on all CPUs as expected by the flush_tlb_all() semantics.
Fixes: f13b83fdd996 ("hugetlb: batch TLB flushes when freeing vmemmap")
Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
---
arch/s390/include/asm/tlbflush.h | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/s390/include/asm/tlbflush.h b/arch/s390/include/asm/tlbflush.h
index 75491baa21974..0d53993534840 100644
--- a/arch/s390/include/asm/tlbflush.h
+++ b/arch/s390/include/asm/tlbflush.h
@@ -103,9 +103,13 @@ static inline void __tlb_flush_mm_lazy(struct mm_struct * mm)
* flush_tlb_range functions need to do the flush.
*/
#define flush_tlb() do { } while (0)
-#define flush_tlb_all() do { } while (0)
#define flush_tlb_page(vma, addr) do { } while (0)
+static inline void flush_tlb_all(void)
+{
+ __tlb_flush_global();
+}
+
static inline void flush_tlb_mm(struct mm_struct *mm)
{
__tlb_flush_mm_lazy(mm);
--
2.51.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH v2] s390: fix HugeTLB vmemmap optimization crash
2025-10-28 21:15 [PATCH v2] s390: fix HugeTLB vmemmap optimization crash Luiz Capitulino
@ 2025-10-28 21:53 ` Andrew Morton
2025-10-30 14:59 ` Heiko Carstens
2025-10-29 6:36 ` Alexander Gordeev
2025-10-29 9:57 ` David Hildenbrand
2 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2025-10-28 21:53 UTC (permalink / raw)
To: Luiz Capitulino
Cc: hca, borntraeger, joao.m.martins, mike.kravetz, linux-kernel,
linux-mm, linux-s390, gor, gerald.schaefer, agordeev, osalvador,
david, aneesh.kumar
On Tue, 28 Oct 2025 17:15:33 -0400 Luiz Capitulino <luizcap@redhat.com> wrote:
> A reproducible crash occurs when enabling HugeTLB vmemmap optimization (HVO)
> on s390. The crash and the proposed fix were worked on an s390 KVM guest
> running on an older hypervisor, as I don't have access to an LPAR. However,
> the same issue should occur on bare-metal.
>
> Reproducer (it may take a few runs to trigger):
>
> # sysctl vm.hugetlb_optimize_vmemmap=1
> # echo 1 > /proc/sys/vm/nr_hugepages
> # echo 0 > /proc/sys/vm/nr_hugepages
>
> ...
>
> This commit fixes this by implementing flush_tlb_all() on s390 as an
> alias to __tlb_flush_global(). This should cause a flush on all TLB
> entries on all CPUs as expected by the flush_tlb_all() semantics.
>
> ...
>
> arch/s390/include/asm/tlbflush.h | 6 +++++-
Thanks, I'll add this to mm.git. If s390 people prefer to merge it
(or nack it!) then please do so and I'll drop the mm.git copy.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] s390: fix HugeTLB vmemmap optimization crash
2025-10-28 21:53 ` Andrew Morton
@ 2025-10-30 14:59 ` Heiko Carstens
0 siblings, 0 replies; 11+ messages in thread
From: Heiko Carstens @ 2025-10-30 14:59 UTC (permalink / raw)
To: Andrew Morton
Cc: Luiz Capitulino, borntraeger, joao.m.martins, mike.kravetz,
linux-kernel, linux-mm, linux-s390, gor, gerald.schaefer,
agordeev, osalvador, david, aneesh.kumar
On Tue, Oct 28, 2025 at 02:53:34PM -0700, Andrew Morton wrote:
> On Tue, 28 Oct 2025 17:15:33 -0400 Luiz Capitulino <luizcap@redhat.com> wrote:
> > A reproducible crash occurs when enabling HugeTLB vmemmap optimization (HVO)
> > on s390. The crash and the proposed fix were worked on an s390 KVM guest
> > running on an older hypervisor, as I don't have access to an LPAR. However,
> > the same issue should occur on bare-metal.
> >
> > Reproducer (it may take a few runs to trigger):
> >
> > # sysctl vm.hugetlb_optimize_vmemmap=1
> > # echo 1 > /proc/sys/vm/nr_hugepages
> > # echo 0 > /proc/sys/vm/nr_hugepages
> >
> > ...
> >
> > This commit fixes this by implementing flush_tlb_all() on s390 as an
> > alias to __tlb_flush_global(). This should cause a flush on all TLB
> > entries on all CPUs as expected by the flush_tlb_all() semantics.
> >
> > ...
> >
> > arch/s390/include/asm/tlbflush.h | 6 +++++-
>
> Thanks, I'll add this to mm.git. If s390 people prefer to merge it
> (or nack it!) then please do so and I'll drop the mm.git copy.
Andrew, could you drop this one please? After looking a bit deeper
into the real problem, this patch would just paper over the real bug
(and it could still happen).
I added you on Cc for the bug fix, but that is supposed to go via the
s390 tree - just in case you are wondering :)
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] s390: fix HugeTLB vmemmap optimization crash
2025-10-28 21:15 [PATCH v2] s390: fix HugeTLB vmemmap optimization crash Luiz Capitulino
2025-10-28 21:53 ` Andrew Morton
@ 2025-10-29 6:36 ` Alexander Gordeev
2025-10-29 9:57 ` David Hildenbrand
2 siblings, 0 replies; 11+ messages in thread
From: Alexander Gordeev @ 2025-10-29 6:36 UTC (permalink / raw)
To: Luiz Capitulino
Cc: hca, borntraeger, joao.m.martins, mike.kravetz, linux-kernel,
linux-mm, linux-s390, gor, gerald.schaefer, osalvador, akpm,
david, aneesh.kumar
On Tue, Oct 28, 2025 at 05:15:33PM -0400, Luiz Capitulino wrote:
> A reproducible crash occurs when enabling HugeTLB vmemmap optimization (HVO)
> on s390. The crash and the proposed fix were worked on an s390 KVM guest
> running on an older hypervisor, as I don't have access to an LPAR. However,
> the same issue should occur on bare-metal.
>
> Reproducer (it may take a few runs to trigger):
>
> # sysctl vm.hugetlb_optimize_vmemmap=1
> # echo 1 > /proc/sys/vm/nr_hugepages
> # echo 0 > /proc/sys/vm/nr_hugepages
>
> Crash log:
>
> [ 52.340369] list_del corruption. prev->next should be 000000d382110008, but was 000000d7116d3880. (prev=000000d7116d3910)
> [ 52.340420] ------------[ cut here ]------------
> [ 52.340424] kernel BUG at lib/list_debug.c:62!
> [ 52.340566] monitor event: 0040 ilc:2 [#1]SMP
> [ 52.340573] Modules linked in: ctcm fsm qeth ccwgroup zfcp scsi_transport_fc qdio dasd_fba_mod dasd_eckd_mod dasd_mod xfs ghash_s390 prng des_s390 libdes sha3_512_s390 sha3_256_s390 virtio_net virtio_blk net_failover sha_common failover dm_mirror dm_region_hash dm_log dm_mod paes_s390 crypto_engine pkey_cca pkey_ep11 zcrypt pkey_pckmo pkey aes_s390
> [ 52.340606] CPU: 1 UID: 0 PID: 1672 Comm: root-rep2 Kdump: loaded Not tainted 6.18.0-rc3 #1 NONE
> [ 52.340610] Hardware name: IBM 3931 LA1 400 (KVM/Linux)
> [ 52.340611] Krnl PSW : 0704c00180000000 0000015710cda7fe (__list_del_entry_valid_or_report+0xfe/0x128)
> [ 52.340619] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> [ 52.340622] Krnl GPRS: c0000000ffffefff 0000000100000027 000000000000006d 0000000000000000
> [ 52.340623] 000000d7116d35d8 000000d7116d35d0 0000000000000002 000000d7116d39b0
> [ 52.340625] 000000d7116d3880 000000d7116d3910 000000d7116d3910 000000d382110008
> [ 52.340626] 000003ffac1ccd08 000000d7116d39b0 0000015710cda7fa 000000d7116d37d0
> [ 52.340632] Krnl Code: 0000015710cda7ee: c020003e496f larl %r2,00000157114a3acc
> 0000015710cda7f4: c0e5ffd5280e brasl %r14,000001571077f810
> #0000015710cda7fa: af000000 mc 0,0
> >0000015710cda7fe: b9040029 lgr %r2,%r9
> 0000015710cda802: c0e5ffe5e193 brasl %r14,0000015710996b28
> 0000015710cda808: e34090080004 lg %r4,8(%r9)
> 0000015710cda80e: b9040059 lgr %r5,%r9
> 0000015710cda812: b9040038 lgr %r3,%r8
> [ 52.340643] Call Trace:
> [ 52.340645] [<0000015710cda7fe>] __list_del_entry_valid_or_report+0xfe/0x128
> [ 52.340649] ([<0000015710cda7fa>] __list_del_entry_valid_or_report+0xfa/0x128)
> [ 52.340652] [<0000015710a30b2e>] hugetlb_vmemmap_restore_folios+0x96/0x138
> [ 52.340655] [<0000015710a268ac>] update_and_free_pages_bulk+0x64/0x150
> [ 52.340659] [<0000015710a26f8a>] set_max_huge_pages+0x4ca/0x6f0
> [ 52.340662] [<0000015710a273ba>] hugetlb_sysctl_handler_common+0xea/0x120
> [ 52.340665] [<0000015710a27484>] hugetlb_sysctl_handler+0x44/0x50
> [ 52.340667] [<0000015710b53ffa>] proc_sys_call_handler+0x17a/0x280
> [ 52.340672] [<0000015710a90968>] vfs_write+0x2c8/0x3a0
> [ 52.340676] [<0000015710a90bd2>] ksys_write+0x72/0x100
> [ 52.340679] [<00000157111483a8>] __do_syscall+0x150/0x318
> [ 52.340682] [<0000015711153a5e>] system_call+0x6e/0x90
> [ 52.340684] Last Breaking-Event-Address:
> [ 52.340684] [<000001571077f85c>] _printk+0x4c/0x58
> [ 52.340690] Kernel panic - not syncing: Fatal exception: panic_on_oops
>
> This issue was introduced by commit f13b83fdd996 ("hugetlb: batch TLB
> flushes when freeing vmemmap"). Before that change, the HVO
> implementation called flush_tlb_kernel_range() each time a vmemmap
> PMD split and remapping was performed. The mentioned commit changed this
> to issue a few flush_tlb_all() calls after performing all remappings.
>
> However, on s390, flush_tlb_kernel_range() expands to
> __tlb_flush_kernel() while flush_tlb_all() is not implemented. As a
> result, we went from flushing the TLB for every remapping to no flushing
> at all.
>
> This commit fixes this by implementing flush_tlb_all() on s390 as an
> alias to __tlb_flush_global(). This should cause a flush on all TLB
> entries on all CPUs as expected by the flush_tlb_all() semantics.
>
> Fixes: f13b83fdd996 ("hugetlb: batch TLB flushes when freeing vmemmap")
> Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
> ---
> arch/s390/include/asm/tlbflush.h | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/s390/include/asm/tlbflush.h b/arch/s390/include/asm/tlbflush.h
> index 75491baa21974..0d53993534840 100644
> --- a/arch/s390/include/asm/tlbflush.h
> +++ b/arch/s390/include/asm/tlbflush.h
> @@ -103,9 +103,13 @@ static inline void __tlb_flush_mm_lazy(struct mm_struct * mm)
> * flush_tlb_range functions need to do the flush.
> */
> #define flush_tlb() do { } while (0)
> -#define flush_tlb_all() do { } while (0)
> #define flush_tlb_page(vma, addr) do { } while (0)
>
> +static inline void flush_tlb_all(void)
> +{
> + __tlb_flush_global();
> +}
> +
> static inline void flush_tlb_mm(struct mm_struct *mm)
> {
> __tlb_flush_mm_lazy(mm);
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH v2] s390: fix HugeTLB vmemmap optimization crash
2025-10-28 21:15 [PATCH v2] s390: fix HugeTLB vmemmap optimization crash Luiz Capitulino
2025-10-28 21:53 ` Andrew Morton
2025-10-29 6:36 ` Alexander Gordeev
@ 2025-10-29 9:57 ` David Hildenbrand
2025-10-29 10:44 ` Heiko Carstens
2 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2025-10-29 9:57 UTC (permalink / raw)
To: Luiz Capitulino, hca, borntraeger, joao.m.martins, mike.kravetz,
linux-kernel, linux-mm, linux-s390, gor, gerald.schaefer,
agordeev
Cc: osalvador, akpm, aneesh.kumar
On 28.10.25 22:15, Luiz Capitulino wrote:
> A reproducible crash occurs when enabling HugeTLB vmemmap optimization (HVO)
> on s390. The crash and the proposed fix were worked on an s390 KVM guest
> running on an older hypervisor, as I don't have access to an LPAR. However,
> the same issue should occur on bare-metal.
>
> Reproducer (it may take a few runs to trigger):
>
> # sysctl vm.hugetlb_optimize_vmemmap=1
> # echo 1 > /proc/sys/vm/nr_hugepages
> # echo 0 > /proc/sys/vm/nr_hugepages
>
> Crash log:
>
> [ 52.340369] list_del corruption. prev->next should be 000000d382110008, but was 000000d7116d3880. (prev=000000d7116d3910)
> [ 52.340420] ------------[ cut here ]------------
> [ 52.340424] kernel BUG at lib/list_debug.c:62!
> [ 52.340566] monitor event: 0040 ilc:2 [#1]SMP
> [ 52.340573] Modules linked in: ctcm fsm qeth ccwgroup zfcp scsi_transport_fc qdio dasd_fba_mod dasd_eckd_mod dasd_mod xfs ghash_s390 prng des_s390 libdes sha3_512_s390 sha3_256_s390 virtio_net virtio_blk net_failover sha_common failover dm_mirror dm_region_hash dm_log dm_mod paes_s390 crypto_engine pkey_cca pkey_ep11 zcrypt pkey_pckmo pkey aes_s390
> [ 52.340606] CPU: 1 UID: 0 PID: 1672 Comm: root-rep2 Kdump: loaded Not tainted 6.18.0-rc3 #1 NONE
> [ 52.340610] Hardware name: IBM 3931 LA1 400 (KVM/Linux)
> [ 52.340611] Krnl PSW : 0704c00180000000 0000015710cda7fe (__list_del_entry_valid_or_report+0xfe/0x128)
> [ 52.340619] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> [ 52.340622] Krnl GPRS: c0000000ffffefff 0000000100000027 000000000000006d 0000000000000000
> [ 52.340623] 000000d7116d35d8 000000d7116d35d0 0000000000000002 000000d7116d39b0
> [ 52.340625] 000000d7116d3880 000000d7116d3910 000000d7116d3910 000000d382110008
> [ 52.340626] 000003ffac1ccd08 000000d7116d39b0 0000015710cda7fa 000000d7116d37d0
> [ 52.340632] Krnl Code: 0000015710cda7ee: c020003e496f larl %r2,00000157114a3acc
> 0000015710cda7f4: c0e5ffd5280e brasl %r14,000001571077f810
> #0000015710cda7fa: af000000 mc 0,0
> >0000015710cda7fe: b9040029 lgr %r2,%r9
> 0000015710cda802: c0e5ffe5e193 brasl %r14,0000015710996b28
> 0000015710cda808: e34090080004 lg %r4,8(%r9)
> 0000015710cda80e: b9040059 lgr %r5,%r9
> 0000015710cda812: b9040038 lgr %r3,%r8
> [ 52.340643] Call Trace:
> [ 52.340645] [<0000015710cda7fe>] __list_del_entry_valid_or_report+0xfe/0x128
> [ 52.340649] ([<0000015710cda7fa>] __list_del_entry_valid_or_report+0xfa/0x128)
> [ 52.340652] [<0000015710a30b2e>] hugetlb_vmemmap_restore_folios+0x96/0x138
> [ 52.340655] [<0000015710a268ac>] update_and_free_pages_bulk+0x64/0x150
> [ 52.340659] [<0000015710a26f8a>] set_max_huge_pages+0x4ca/0x6f0
> [ 52.340662] [<0000015710a273ba>] hugetlb_sysctl_handler_common+0xea/0x120
> [ 52.340665] [<0000015710a27484>] hugetlb_sysctl_handler+0x44/0x50
> [ 52.340667] [<0000015710b53ffa>] proc_sys_call_handler+0x17a/0x280
> [ 52.340672] [<0000015710a90968>] vfs_write+0x2c8/0x3a0
> [ 52.340676] [<0000015710a90bd2>] ksys_write+0x72/0x100
> [ 52.340679] [<00000157111483a8>] __do_syscall+0x150/0x318
> [ 52.340682] [<0000015711153a5e>] system_call+0x6e/0x90
> [ 52.340684] Last Breaking-Event-Address:
> [ 52.340684] [<000001571077f85c>] _printk+0x4c/0x58
> [ 52.340690] Kernel panic - not syncing: Fatal exception: panic_on_oops
>
> This issue was introduced by commit f13b83fdd996 ("hugetlb: batch TLB
> flushes when freeing vmemmap"). Before that change, the HVO
> implementation called flush_tlb_kernel_range() each time a vmemmap
> PMD split and remapping was performed. The mentioned commit changed this
> to issue a few flush_tlb_all() calls after performing all remappings.
>
> However, on s390, flush_tlb_kernel_range() expands to
> __tlb_flush_kernel() while flush_tlb_all() is not implemented. As a
> result, we went from flushing the TLB for every remapping to no flushing
> at all.
>
> This commit fixes this by implementing flush_tlb_all() on s390 as an
> alias to __tlb_flush_global(). This should cause a flush on all TLB
> entries on all CPUs as expected by the flush_tlb_all() semantics.
>
> Fixes: f13b83fdd996 ("hugetlb: batch TLB flushes when freeing vmemmap")
> Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
> ---
Nice finding!
Makes me wonder whether the default flush_tlb_all() should actually map
to a BUILD_BUG(), such that we don't silently not-flush on archs that
don't implement it.
Reviewed-by: David Hildenbrand <david@redhat.com>
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH v2] s390: fix HugeTLB vmemmap optimization crash
2025-10-29 9:57 ` David Hildenbrand
@ 2025-10-29 10:44 ` Heiko Carstens
2025-10-29 12:15 ` David Hildenbrand
0 siblings, 1 reply; 11+ messages in thread
From: Heiko Carstens @ 2025-10-29 10:44 UTC (permalink / raw)
To: David Hildenbrand
Cc: Luiz Capitulino, borntraeger, joao.m.martins, mike.kravetz,
linux-kernel, linux-mm, linux-s390, gor, gerald.schaefer,
agordeev, osalvador, akpm, aneesh.kumar
On Wed, Oct 29, 2025 at 10:57:15AM +0100, David Hildenbrand wrote:
> On 28.10.25 22:15, Luiz Capitulino wrote:
> > A reproducible crash occurs when enabling HugeTLB vmemmap optimization (HVO)
> > on s390. The crash and the proposed fix were worked on an s390 KVM guest
> > running on an older hypervisor, as I don't have access to an LPAR. However,
> > the same issue should occur on bare-metal.
...
> > This commit fixes this by implementing flush_tlb_all() on s390 as an
> > alias to __tlb_flush_global(). This should cause a flush on all TLB
> > entries on all CPUs as expected by the flush_tlb_all() semantics.
> >
> > Fixes: f13b83fdd996 ("hugetlb: batch TLB flushes when freeing vmemmap")
> > Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
> > ---
>
> Nice finding!
>
> Makes me wonder whether the default flush_tlb_all() should actually map to a
> BUILD_BUG(), such that we don't silently not-flush on archs that don't
> implement it.
Which default flush_tlb_all()? :)
There was a no-op implementation for s390, and besides drivers/xen/balloon.c
there is only mm/hugetlb_vmemmap.c in common code which makes use of this. To
me it looks like both call sites only need to flush TLB entries of the kernel
address space. So I'd rather prefer if flush_tlb_all() would die instead.
But I'm also wondering about the correctness of the whole thing even with this
patch. If I'm not mistaken then vmemmap_split_pmd() changes an active pmd
entry of the kernel mapping. That is: an active leaf entry (aka large page) is
changed to an active entry pointing to a page table.
Changing active entries without the detour over an invalid entry or using
proper instructions like crdte or cspg is not allowed on s390. This was solved
for other parts that change active entries of the kernel mapping in an
architecture compliant way for s390 (see arch/s390/mm/pageattr.c).
Am I missing something?
Gerald, since you enabled the corresponding Kconfig option for s390: is there
any reason why this should work in an architecture compliant way?
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH v2] s390: fix HugeTLB vmemmap optimization crash
2025-10-29 10:44 ` Heiko Carstens
@ 2025-10-29 12:15 ` David Hildenbrand
2025-10-29 12:49 ` Heiko Carstens
0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2025-10-29 12:15 UTC (permalink / raw)
To: Heiko Carstens
Cc: Luiz Capitulino, borntraeger, joao.m.martins, mike.kravetz,
linux-kernel, linux-mm, linux-s390, gor, gerald.schaefer,
agordeev, osalvador, akpm, aneesh.kumar
On 29.10.25 11:44, Heiko Carstens wrote:
> On Wed, Oct 29, 2025 at 10:57:15AM +0100, David Hildenbrand wrote:
>> On 28.10.25 22:15, Luiz Capitulino wrote:
>>> A reproducible crash occurs when enabling HugeTLB vmemmap optimization (HVO)
>>> on s390. The crash and the proposed fix were worked on an s390 KVM guest
>>> running on an older hypervisor, as I don't have access to an LPAR. However,
>>> the same issue should occur on bare-metal.
> ...
>>> This commit fixes this by implementing flush_tlb_all() on s390 as an
>>> alias to __tlb_flush_global(). This should cause a flush on all TLB
>>> entries on all CPUs as expected by the flush_tlb_all() semantics.
>>>
>>> Fixes: f13b83fdd996 ("hugetlb: batch TLB flushes when freeing vmemmap")
>>> Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
>>> ---
>>
>> Nice finding!
>>
>> Makes me wonder whether the default flush_tlb_all() should actually map to a
>> BUILD_BUG(), such that we don't silently not-flush on archs that don't
>> implement it.
>
> Which default flush_tlb_all()? :)
What I meant is: all such functions that an architecture doesn't expect
to be called because they are effectively unimplemented.
Taking a look at flush_tlb_all(), there is really only a dummy
implementation on s390x and on riscv without MMU.
So yeah, there is no "default" fallback one :)
BTW, I'm staring at s390x's flush_tlb() function and wonder why that one
is defined. I'm sure there is a good reason ;)
>
> There was a no-op implementation for s390, and besides drivers/xen/balloon.c
> there is only mm/hugetlb_vmemmap.c in common code which makes use of this. To
> me it looks like both call sites only need to flush TLB entries of the kernel
> address space. So I'd rather prefer if flush_tlb_all() would die instead.
I'd assume that we only modify the kernel virtual address space, so I agree.
>
> But I'm also wondering about the correctness of the whole thing even with this
> patch. If I'm not mistaken then vmemmap_split_pmd() changes an active pmd
> entry of the kernel mapping. That is: an active leaf entry (aka large page) is
> changed to an active entry pointing to a page table.
That's my understanding as well.
>
> Changing active entries without the detour over an invalid entry or using
> proper instructions like crdte or cspg is not allowed on s390. This was solved
> for other parts that change active entries of the kernel mapping in an
> architecture compliant way for s390 (see arch/s390/mm/pageattr.c).
Good point. I recall ARM64 has similar break-before-make requirements
because they cannot tolerate two different TLB entries (small vs. large)
for the same virtual address.
And if I rememebr correctly, that's the reason why arm64 does not enable
ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP just yet.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH v2] s390: fix HugeTLB vmemmap optimization crash
2025-10-29 12:15 ` David Hildenbrand
@ 2025-10-29 12:49 ` Heiko Carstens
2025-10-30 14:38 ` Gerald Schaefer
0 siblings, 1 reply; 11+ messages in thread
From: Heiko Carstens @ 2025-10-29 12:49 UTC (permalink / raw)
To: David Hildenbrand
Cc: Luiz Capitulino, borntraeger, joao.m.martins, mike.kravetz,
linux-kernel, linux-mm, linux-s390, gor, gerald.schaefer,
agordeev, osalvador, akpm, aneesh.kumar
On Wed, Oct 29, 2025 at 01:15:44PM +0100, David Hildenbrand wrote:
> BTW, I'm staring at s390x's flush_tlb() function and wonder why that one is
> defined. I'm sure there is a good reason ;)
Yes, I stumbled across that yesterday evening as well. I think its only
purpose is that it wants to be deleted :). I just didn't do it yet since I
don't want to see a merge conflict with this patch.
I also need to check if the only usage of flush_tlb_page(), which is also a
no-op for s390, in mm/memory.c is not indicating a problem too.
> > Changing active entries without the detour over an invalid entry or using
> > proper instructions like crdte or cspg is not allowed on s390. This was solved
> > for other parts that change active entries of the kernel mapping in an
> > architecture compliant way for s390 (see arch/s390/mm/pageattr.c).
>
> Good point. I recall ARM64 has similar break-before-make requirements
> because they cannot tolerate two different TLB entries (small vs. large) for
> the same virtual address.
>
> And if I rememebr correctly, that's the reason why arm64 does not enable
> ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP just yet.
Ok, let's wait for Gerald. Maybe there is a non-obvious reason why this works
anyway.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v2] s390: fix HugeTLB vmemmap optimization crash
2025-10-29 12:49 ` Heiko Carstens
@ 2025-10-30 14:38 ` Gerald Schaefer
2025-10-30 14:54 ` Luiz Capitulino
0 siblings, 1 reply; 11+ messages in thread
From: Gerald Schaefer @ 2025-10-30 14:38 UTC (permalink / raw)
To: Heiko Carstens
Cc: David Hildenbrand, Luiz Capitulino, borntraeger, joao.m.martins,
mike.kravetz, linux-kernel, linux-mm, linux-s390, gor, agordeev,
osalvador, akpm, aneesh.kumar
On Wed, 29 Oct 2025 13:49:53 +0100
Heiko Carstens <hca@linux.ibm.com> wrote:
> On Wed, Oct 29, 2025 at 01:15:44PM +0100, David Hildenbrand wrote:
> > BTW, I'm staring at s390x's flush_tlb() function and wonder why that one is
> > defined. I'm sure there is a good reason ;)
>
> Yes, I stumbled across that yesterday evening as well. I think its only
> purpose is that it wants to be deleted :). I just didn't do it yet since I
> don't want to see a merge conflict with this patch.
>
> I also need to check if the only usage of flush_tlb_page(), which is also a
> no-op for s390, in mm/memory.c is not indicating a problem too.
>
> > > Changing active entries without the detour over an invalid entry or using
> > > proper instructions like crdte or cspg is not allowed on s390. This was solved
> > > for other parts that change active entries of the kernel mapping in an
> > > architecture compliant way for s390 (see arch/s390/mm/pageattr.c).
> >
> > Good point. I recall ARM64 has similar break-before-make requirements
> > because they cannot tolerate two different TLB entries (small vs. large) for
> > the same virtual address.
> >
> > And if I rememebr correctly, that's the reason why arm64 does not enable
> > ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP just yet.
>
> Ok, let's wait for Gerald. Maybe there is a non-obvious reason why this works
> anyway.
No, using pmd_populate_kernel() on an active/valid PMD in vmemmap_split_pmd()
should violate the architecture, as you described. So this would not work
with current code, and also should not have worked when I did the change,
or only by chance.
Therefore, we should disable ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP again, for
now. Doing it right would most likely require common code changes and
CRDTE / CSPG usage on s390. Not sure if this feature is really worth the
hassle, reading all the drawbacks that I mentioned in my commit 00a34d5a99c0
("s390: select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP").
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH v2] s390: fix HugeTLB vmemmap optimization crash
2025-10-30 14:38 ` Gerald Schaefer
@ 2025-10-30 14:54 ` Luiz Capitulino
2025-10-30 14:56 ` Heiko Carstens
0 siblings, 1 reply; 11+ messages in thread
From: Luiz Capitulino @ 2025-10-30 14:54 UTC (permalink / raw)
To: Gerald Schaefer, Heiko Carstens
Cc: David Hildenbrand, borntraeger, joao.m.martins, mike.kravetz,
linux-kernel, linux-mm, linux-s390, gor, agordeev, osalvador,
akpm, aneesh.kumar
On 2025-10-30 10:38, Gerald Schaefer wrote:
> On Wed, 29 Oct 2025 13:49:53 +0100
> Heiko Carstens <hca@linux.ibm.com> wrote:
>
>> On Wed, Oct 29, 2025 at 01:15:44PM +0100, David Hildenbrand wrote:
>>> BTW, I'm staring at s390x's flush_tlb() function and wonder why that one is
>>> defined. I'm sure there is a good reason ;)
>>
>> Yes, I stumbled across that yesterday evening as well. I think its only
>> purpose is that it wants to be deleted :). I just didn't do it yet since I
>> don't want to see a merge conflict with this patch.
>>
>> I also need to check if the only usage of flush_tlb_page(), which is also a
>> no-op for s390, in mm/memory.c is not indicating a problem too.
>>
>>>> Changing active entries without the detour over an invalid entry or using
>>>> proper instructions like crdte or cspg is not allowed on s390. This was solved
>>>> for other parts that change active entries of the kernel mapping in an
>>>> architecture compliant way for s390 (see arch/s390/mm/pageattr.c).
>>>
>>> Good point. I recall ARM64 has similar break-before-make requirements
>>> because they cannot tolerate two different TLB entries (small vs. large) for
>>> the same virtual address.
>>>
>>> And if I rememebr correctly, that's the reason why arm64 does not enable
>>> ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP just yet.
>>
>> Ok, let's wait for Gerald. Maybe there is a non-obvious reason why this works
>> anyway.
>
> No, using pmd_populate_kernel() on an active/valid PMD in vmemmap_split_pmd()
> should violate the architecture, as you described. So this would not work
> with current code, and also should not have worked when I did the change,
> or only by chance.
>
> Therefore, we should disable ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP again, for
> now. Doing it right would most likely require common code changes and
> CRDTE / CSPG usage on s390. Not sure if this feature is really worth the
> hassle, reading all the drawbacks that I mentioned in my commit 00a34d5a99c0
> ("s390: select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP").
OK, let's do the right thing. Do you plan to post a patch? I can do it
if you would like.
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH v2] s390: fix HugeTLB vmemmap optimization crash
2025-10-30 14:54 ` Luiz Capitulino
@ 2025-10-30 14:56 ` Heiko Carstens
0 siblings, 0 replies; 11+ messages in thread
From: Heiko Carstens @ 2025-10-30 14:56 UTC (permalink / raw)
To: Luiz Capitulino
Cc: Gerald Schaefer, David Hildenbrand, borntraeger, joao.m.martins,
mike.kravetz, linux-kernel, linux-mm, linux-s390, gor, agordeev,
osalvador, akpm, aneesh.kumar
On Thu, Oct 30, 2025 at 10:54:47AM -0400, Luiz Capitulino wrote:
> On 2025-10-30 10:38, Gerald Schaefer wrote:
> > On Wed, 29 Oct 2025 13:49:53 +0100
> > Heiko Carstens <hca@linux.ibm.com> wrote:
> > Therefore, we should disable ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP again, for
> > now. Doing it right would most likely require common code changes and
> > CRDTE / CSPG usage on s390. Not sure if this feature is really worth the
> > hassle, reading all the drawbacks that I mentioned in my commit 00a34d5a99c0
> > ("s390: select ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP").
>
> OK, let's do the right thing. Do you plan to post a patch? I can do it
> if you would like.
It is already in your inbox :)
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-10-30 14:59 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-28 21:15 [PATCH v2] s390: fix HugeTLB vmemmap optimization crash Luiz Capitulino
2025-10-28 21:53 ` Andrew Morton
2025-10-30 14:59 ` Heiko Carstens
2025-10-29 6:36 ` Alexander Gordeev
2025-10-29 9:57 ` David Hildenbrand
2025-10-29 10:44 ` Heiko Carstens
2025-10-29 12:15 ` David Hildenbrand
2025-10-29 12:49 ` Heiko Carstens
2025-10-30 14:38 ` Gerald Schaefer
2025-10-30 14:54 ` Luiz Capitulino
2025-10-30 14:56 ` Heiko Carstens
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).