Morning, > On Mar 26, 2026, at 1:50 AM, Andrew Morton wrote: > > […] > > AI review has a couple of questions: > > […] > > Is there a race condition between the validation of pudval and the > dereference in pmd_offset()? I don’t think so - as the comment states: If we spot a PMD table, it will not go away, so we can continue to walk it. > Since pmd_offset() dereferences the original pud pointer rather than > using the validated pudval snapshot, could a concurrent thread collapse > the PUD into a huge leaf right before pmd_offset() is called? > > If that happens, it looks like pmd_offset() might compute a page table > pointer using the physical address of the huge leaf, which would then be > dereferenced, leading to the same crash this is trying to prevent. It shouldn’t change, and the check specifically confirms that the underlying entry is a table which can not be changed into a huge leaf (given that we have the mmap read lock). > Should pmd_offset() be passed the address of the snapshot (&pudval) instead? No, the snapshot just wraps it in READ_ONCE so that we have guaranteed coherence when doing the check. Using the reference does not have any benefits in my perspective - I also don’t see it doing much harm per se, but let’s not increase the scope of the change unnecessarily. > Can this loop infinitely on unsplittable PUD leaves? > > If walk_pmd_range() encounters a PUD leaf (such as a VFIO or DAX mapping) > and returns ACTION_AGAIN, this code jumps back to the again label. > > During the retry, split_huge_pud() is called, but it only splits Transparent > Huge Pages. For device memory mapped as large PUD leaves, split_huge_pud() > does nothing and the entry remains a leaf. > > When walk_pmd_range() is called again, it will see the same leaf entry and > return ACTION_AGAIN, creating a deterministic infinite loop while holding > the mmap lock. We shouldn’t hit an infinite loop, theoretically I guess that we can when the two threads are concurrently splitting and refaulting in perfect lockstep, and it is something discussed in another patch [1], but adding extra locking is quite expensive for something so astronomically small. With regards to “unsplittable” PUDs such as VFIO (i presume this refers to device PFNMAPs) or DAX (special) mappings. As far as I’m aware, if something’s a vma it should be splittable like this. Previously split_huge functions had an extra check for pud_devmap(), and the trans_huge check should return true for VFIO mappings (i.e. on x86, it just checks whether the PSE bit is set which happens for huge PFNMAPs). [1] https://lore.kernel.org/all/45e50068-751c-4e8c-a6b0-62cf8d1e58e6@kernel.org/