From: Christian Borntraeger <borntraeger@linux.ibm.com>
To: David Hildenbrand <david@redhat.com>,
Balbir Singh <balbirs@nvidia.com>,
Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Liam.Howlett@oracle.com, airlied@gmail.com,
akpm@linux-foundation.org, apopple@nvidia.com, baohua@kernel.org,
baolin.wang@linux.alibaba.com, byungchul@sk.com, dakr@kernel.org,
dev.jain@arm.com, dri-devel@lists.freedesktop.org,
francois.dugast@intel.com, gourry@gourry.net,
joshua.hahnjy@gmail.com, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, lorenzo.stoakes@oracle.com, lyude@redhat.com,
matthew.brost@intel.com, mpenttil@redhat.com, npache@redhat.com,
osalvador@suse.de, rakie.kim@sk.com, rcampbell@nvidia.com,
ryan.roberts@arm.com, simona@ffwll.ch,
ying.huang@linux.alibaba.com, ziy@nvidia.com,
kvm@vger.kernel.org, linux-s390@vger.kernel.org,
linux-next@vger.kernel.org
Subject: Re: linux-next: KVM/s390x regression
Date: Mon, 20 Oct 2025 09:01:35 +0200 [thread overview]
Message-ID: <c163a247-4f02-4010-a860-5060e34a34db@linux.ibm.com> (raw)
In-Reply-To: <cb85aaa3-e456-4fd8-b323-46c75d453a02@redhat.com>
Am 18.10.25 um 00:41 schrieb David Hildenbrand:
> On 18.10.25 00:15, David Hildenbrand wrote:
>> On 17.10.25 23:56, Balbir Singh wrote:
>>> On 10/18/25 04:07, David Hildenbrand wrote:
>>>> On 17.10.25 17:20, Christian Borntraeger wrote:
>>>>>
>>>>>
>>>>> Am 17.10.25 um 17:07 schrieb David Hildenbrand:
>>>>>> On 17.10.25 17:01, Christian Borntraeger wrote:
>>>>>>> Am 17.10.25 um 16:54 schrieb David Hildenbrand:
>>>>>>>> On 17.10.25 16:49, Christian Borntraeger wrote:
>>>>>>>>> This patch triggers a regression for s390x kvm as qemu guests can no longer start
>>>>>>>>>
>>>>>>>>> error: kvm run failed Cannot allocate memory
>>>>>>>>> PSW=mask 0000000180000000 addr 000000007fd00600
>>>>>>>>> R00=0000000000000000 R01=0000000000000000 R02=0000000000000000 R03=0000000000000000
>>>>>>>>> R04=0000000000000000 R05=0000000000000000 R06=0000000000000000 R07=0000000000000000
>>>>>>>>> R08=0000000000000000 R09=0000000000000000 R10=0000000000000000 R11=0000000000000000
>>>>>>>>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
>>>>>>>>> C00=00000000000000e0 C01=0000000000000000 C02=0000000000000000 C03=0000000000000000
>>>>>>>>> C04=0000000000000000 C05=0000000000000000 C06=0000000000000000 C07=0000000000000000
>>>>>>>>> C08=0000000000000000 C09=0000000000000000 C10=0000000000000000 C11=0000000000000000
>>>>>>>>> C12=0000000000000000 C13=0000000000000000 C14=00000000c2000000 C15=0000000000000000
>>>>>>>>>
>>>>>>>>> KVM on s390x does not use THP so far, will investigate. Does anyone have a quick idea?
>>>>>>>>
>>>>>>>> Only when running KVM guests and apart from that everything else seems to be fine?
>>>>>>>
>>>>>>> We have other weirdness in linux-next but in different areas. Could that somehow be
>>>>>>> related to use disabling THP for the kvm address space?
>>>>>>
>>>>>> Not sure ... it's a bit weird. I mean, when KVM disables THPs we essentially just remap everything to be mapped by PTEs. So there shouldn't be any PMDs in that whole process.
>>>>>>
>>>>>> Remapping a file THP (shmem) implies zapping the THP completely.
>>>>>>
>>>>>>
>>>>>> I assume in your kernel config has CONFIG_ZONE_DEVICE and CONFIG_ARCH_ENABLE_THP_MIGRATION set, right?
>>>>>
>>>>> yes.
>>>>>
>>>>>>
>>>>>> I'd rule out copy_huge_pmd(), zap_huge_pmd() a well.
>>>>>>
>>>>>>
>>>>>> What happens if you revert the change in mm/pgtable-generic.c?
>>>>>
>>>>> That partial revert seems to fix the issue
>>>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>>>>> index 0c847cdf4fd3..567e2d084071 100644
>>>>> --- a/mm/pgtable-generic.c
>>>>> +++ b/mm/pgtable-generic.c
>>>>> @@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>>>>> if (pmdvalp)
>>>>> *pmdvalp = pmdval;
>>>>> - if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
>>>>> + if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
>>>>
>>>> Okay, but that means that effectively we stumble over a PMD entry that is not a migration entry but still non-present.
>>>>
>>>> And I would expect that it's a page table, because otherwise the change
>>>> wouldn't make a difference.
>>>>
>>>> And the weird thing is that this only triggers sometimes, because if
>>>> it would always trigger nothing would ever work.
>>>>
>>>> Is there some weird scenario where s390x might set a left page table mapped in a PMD to non-present?
>>>>
>>>
>>> Good point
>>>
>>>> Staring at the definition of pmd_present() on s390x it's really just
>>>>
>>>> return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;
>>>>
>>>>
>>>> Maybe this is happening in the gmap code only and not actually in the core-mm code?
>>>>
>>>
>>>
>>> I am not an s390 expert, but just looking at the code
>>>
>>> So the check on s390 effectively
>>>
>>> segment_entry/present = false or segment_entry_empty/invalid = true
>>
>> pmd_present() == true iff _SEGMENT_ENTRY_PRESENT is set
>>
>> because
>>
>> return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;
>>
>> is the same as
>>
>> return pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT;
>>
>> But that means we have something where _SEGMENT_ENTRY_PRESENT is not set.
>>
>> I suspect that can only be the gmap tables.
>>
>> Likely __gmap_link() does not set _SEGMENT_ENTRY_PRESENT, which is fine
>> because it's a software managed bit for "ordinary" page tables, not gmap
>> tables.
>>
>> Which raises the question why someone would wrongly use
>> pte_offset_map()/__pte_offset_map() on the gmap tables.
>>
>> I cannot immediately spot any such usage in kvm/gmap code, though.
>>
>
> Ah, it's all that pte_alloc_map_lock() stuff in gmap.c.
>
> Oh my.
>
> So we're mapping a user PTE table that is linked into the gmap tables through a PMD table that does not have the right sw bits set we would expect in a user PMD table.
>
> What's also scary is that pte_alloc_map_lock() would try to pte_alloc() a user page table in the gmap, which sounds completely wrong?
>
> Yeah, when walking the gmap and wanting to lock the linked user PTE table, we should probably never use the pte_*map variants but obtain
> the lock through pte_lockptr().
>
> All magic we end up doing with RCU etc in __pte_offset_map_lock()
> does not apply to the gmap PMD table.
>
CC Claudio.
next prev parent reply other threads:[~2025-10-20 7:01 UTC|newest]
Thread overview: 75+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-01 6:56 [v7 00/16] mm: support device-private THP Balbir Singh
2025-10-01 6:56 ` [v7 01/16] mm/zone_device: support large zone device private folios Balbir Singh
2025-10-12 6:10 ` Lance Yang
2025-10-12 22:54 ` Balbir Singh
2025-10-01 6:56 ` [v7 02/16] mm/zone_device: Rename page_free callback to folio_free Balbir Singh
2025-10-01 6:56 ` [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations Balbir Singh
2025-10-12 15:46 ` Lance Yang
2025-10-13 0:01 ` Balbir Singh
2025-10-13 1:48 ` Lance Yang
2025-10-17 14:49 ` linux-next: KVM/s390x regression (was: [v7 03/16] mm/huge_memory: add device-private THP support to PMD operations) Christian Borntraeger
2025-10-17 14:54 ` linux-next: KVM/s390x regression David Hildenbrand
2025-10-17 15:01 ` Christian Borntraeger
2025-10-17 15:07 ` David Hildenbrand
2025-10-17 15:20 ` Christian Borntraeger
2025-10-17 17:07 ` David Hildenbrand
2025-10-17 21:56 ` Balbir Singh
2025-10-17 22:15 ` David Hildenbrand
2025-10-17 22:41 ` David Hildenbrand
2025-10-20 7:01 ` Christian Borntraeger [this message]
2025-10-20 7:00 ` Christian Borntraeger
2025-10-20 8:41 ` David Hildenbrand
2025-10-20 9:04 ` Claudio Imbrenda
2025-10-27 16:47 ` Claudio Imbrenda
2025-10-27 16:59 ` David Hildenbrand
2025-10-27 17:06 ` Christian Borntraeger
2025-10-28 9:24 ` Balbir Singh
2025-10-28 13:01 ` [PATCH v1 0/1] KVM: s390: Fix missing present bit for gmap puds Claudio Imbrenda
2025-10-28 13:01 ` [PATCH v1 1/1] " Claudio Imbrenda
2025-10-28 21:23 ` Balbir Singh
2025-10-29 10:00 ` David Hildenbrand
2025-10-29 10:20 ` Claudio Imbrenda
2025-10-28 22:53 ` [PATCH v1 0/1] " Andrew Morton
2025-10-01 6:56 ` [v7 04/16] mm/rmap: extend rmap and migration support device-private entries Balbir Singh
2025-10-22 11:54 ` Lance Yang
2025-10-01 6:56 ` [v7 05/16] mm/huge_memory: implement device-private THP splitting Balbir Singh
2025-10-01 6:56 ` [v7 06/16] mm/migrate_device: handle partially mapped folios during collection Balbir Singh
2025-10-01 6:56 ` [v7 07/16] mm/migrate_device: implement THP migration of zone device pages Balbir Singh
2025-10-01 6:56 ` [v7 08/16] mm/memory/fault: add THP fault handling for zone device private pages Balbir Singh
2025-10-01 6:57 ` [v7 09/16] lib/test_hmm: add zone device private THP test infrastructure Balbir Singh
2025-10-01 6:57 ` [v7 10/16] mm/memremap: add driver callback support for folio splitting Balbir Singh
2025-10-01 6:57 ` [v7 11/16] mm/migrate_device: add THP splitting during migration Balbir Singh
2025-10-13 21:17 ` Zi Yan
2025-10-13 21:33 ` Balbir Singh
2025-10-13 21:55 ` Zi Yan
2025-10-13 22:50 ` Balbir Singh
2025-10-19 8:19 ` Wei Yang
2025-10-19 22:49 ` Balbir Singh
2025-10-19 22:59 ` Zi Yan
2025-10-21 21:34 ` Balbir Singh
2025-10-22 2:59 ` Zi Yan
2025-10-22 7:16 ` Balbir Singh
2025-10-22 15:26 ` Zi Yan
2025-10-28 9:32 ` Balbir Singh
2025-10-01 6:57 ` [v7 12/16] lib/test_hmm: add large page allocation failure testing Balbir Singh
2025-10-01 6:57 ` [v7 13/16] selftests/mm/hmm-tests: new tests for zone device THP migration Balbir Singh
2025-10-01 6:57 ` [v7 14/16] selftests/mm/hmm-tests: partial unmap, mremap and anon_write tests Balbir Singh
2025-10-01 6:57 ` [v7 15/16] selftests/mm/hmm-tests: new throughput tests including THP Balbir Singh
2025-10-01 6:57 ` [v7 16/16] gpu/drm/nouveau: enable THP support for GPU memory migration Balbir Singh
2025-10-09 3:17 ` [v7 00/16] mm: support device-private THP Andrew Morton
2025-10-09 3:26 ` Balbir Singh
2025-10-09 10:33 ` Matthew Brost
2025-10-13 22:51 ` Balbir Singh
2025-11-11 23:43 ` Andrew Morton
2025-11-11 23:52 ` Balbir Singh
2025-11-12 0:24 ` Andrew Morton
2025-11-12 0:36 ` Balbir Singh
2025-11-20 2:40 ` Matthew Brost
2025-11-20 2:50 ` Balbir Singh
2025-11-20 2:59 ` Balbir Singh
2025-11-20 3:15 ` Matthew Brost
2025-11-20 3:58 ` Balbir Singh
2025-11-20 5:46 ` Balbir Singh
2025-11-20 5:53 ` Matthew Brost
2025-11-20 6:03 ` Balbir Singh
2025-11-20 17:27 ` Matthew Brost
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c163a247-4f02-4010-a860-5060e34a34db@linux.ibm.com \
--to=borntraeger@linux.ibm.com \
--cc=Liam.Howlett@oracle.com \
--cc=airlied@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=balbirs@nvidia.com \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=byungchul@sk.com \
--cc=dakr@kernel.org \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=francois.dugast@intel.com \
--cc=gourry@gourry.net \
--cc=imbrenda@linux.ibm.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-next@vger.kernel.org \
--cc=linux-s390@vger.kernel.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=lyude@redhat.com \
--cc=matthew.brost@intel.com \
--cc=mpenttil@redhat.com \
--cc=npache@redhat.com \
--cc=osalvador@suse.de \
--cc=rakie.kim@sk.com \
--cc=rcampbell@nvidia.com \
--cc=ryan.roberts@arm.com \
--cc=simona@ffwll.ch \
--cc=ying.huang@linux.alibaba.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).