From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-171.mta1.migadu.com (out-171.mta1.migadu.com [95.215.58.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C7ACF21FF2A for ; Fri, 3 Jul 2026 17:40:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.171 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1783100404; cv=none; b=cTm7SlEqB7L6hQhXbTuBkV+brzXZ+FqCgaRbJyxeu0bREvwAn0z4k07WDLV4MSxY1Ki4rQZ2ohznPouaxeChYO8bZ6Y1ZGEpg/NP1FTZRYgq7Tzb3YtmskZL6aH8UKO/7CgAQAhuUEE9jLhThPrRoYa5XKskZW2a+DZjsPAx/6w= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1783100404; c=relaxed/simple; bh=oys9++8/SflIVfWql7GLkyXUrjhXgMUx4V4IKHHS60M=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Tbl8VG8Kqo8vf/WJ+7F080dtImLjA0Zgw4DxXr0M2dypac10bfIztUT+SNsChKdZHx37U4imLXDSn57ToFYfjq6d1Zvv0HjgTWRNI6Gl3TuqzcwDUoU2cQy4TTcJp71Qr/fC0IJPi0IpemYX69rATROSmM+QwCq0GoyH8814I1U= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=qUoyWMzD; arc=none smtp.client-ip=95.215.58.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="qUoyWMzD" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1783100399; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=AspNTgzSimgRtWGSS73uGz3S1XYM2c12/HFHDNOQgTM=; b=qUoyWMzDFXwif+FLKsYJqM05RddrN+O/NP+P1bwyUdWY5kU4+la9ERej/0mTBLASjfeeCR 2LuJyz/eFUK0EDPRQXaBqez6XHXcTon5MpCxPIp+9EFzU/2kg+b/aHr+DBt2xfsRNpACbJ N+Wo6GoXbfKxldchJ8mt2StEaVwEVk0= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com, linux-mm@kvack.org Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH v3 06/11] mm: handle PMD swap entries in non-present PMD walkers Date: Fri, 3 Jul 2026 10:38:23 -0700 Message-ID: <20260703173903.3789516-7-usama.arif@linux.dev> In-Reply-To: <20260703173903.3789516-1-usama.arif@linux.dev> References: <20260703173903.3789516-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT Teach the remaining non-present PMD walkers about swap entries, mirroring the PTE-level equivalents. smaps_pmd_entry() accounts swap and swap_pss via a new shared smaps_account_swap() helper used by both PTE and PMD paths. move_soft_dirty_pmd(), clear_soft_dirty_pmd(), and make_uffd_wp_pmd(), pagemap_pmd_range_thp() and change_huge_pmd() handle swap entries alongside migration entries. hmm_vma_handle_absent_pmd() faults in PMD swap entries via hmm_vma_fault() instead of returning -EFAULT. The first per-page handle_mm_fault() call triggers do_huge_pmd_swap_page(), which maps the entire folio; subsequent calls become harmless huge_pmd_set_accessed() and the walker retries with a present PMD. madvise_free_huge_pmd() handles PMD swap entries directly: for a full-range MADV_FREE it clears the PMD, frees the deposited page table, and releases the swap slots; for a partial range it splits to PTE swap entries. Without this, MADV_FREE silently becomes a no-op on swapped-out THPs, leaking swap slots. zap_huge_pmd() frees swap slots via swap_put_entries_direct(), matching zap_nonpresent_ptes(). change_non_present_huge_pmd() skips write-permission changes for swap entries and only updates uffd_wp, matching change_softleaf_pte(). madvise_cold_or_pageout_pte_range() skips PMD swap entries early. MADV_COLD and MADV_PAGEOUT operate on resident folios, so a swapped-out THP has nothing to deactivate or reclaim; skipping also prevents the walker from descending into or splitting the PMD swap entry. The locked THP path also treats a racing PMD swap entry as handled before checking for other non-present PMD types. mincore_pte_range() routes the pmd_trans_huge_lock() branch through mincore_swap() for non-present PMDs, matching how the PTE path already calls mincore_swap() for non-present PTEs. Without this a swapped-out PMD-mapped THP would be reported as resident, because pmd_is_huge() (and therefore pmd_trans_huge_lock()) accepts any non-present non-none PMD and the old branch unconditionally did memset(vec, 1, nr). mincore_swap() returns 1 for migration / device-private entries (preserving the prior behavior for those) and checks swap-cache residency for swap entries. queue_folios_pmd() in mempolicy silently skips swap entries, matching the PTE walker which only counts migration entries as failures. Without this, mbind(MPOL_MF_STRICT) would spuriously return -EIO on a swapped-out THP. check_pmd_state() in khugepaged returns SCAN_PMD_MAPPED for PMD swap entries, treating a swapped-out THP as still being a THP from khugepaged's perspective and matching the existing migration-entry handling. Signed-off-by: Usama Arif --- fs/proc/task_mmu.c | 43 +++++++++++++++++++++------------- mm/hmm.c | 3 ++- mm/huge_memory.c | 58 +++++++++++++++++++++++++++++++++++----------- mm/khugepaged.c | 6 +++++ mm/madvise.c | 14 ++++++++++- mm/mempolicy.c | 2 ++ mm/mincore.c | 45 ++++++++++++++++++++++++++++++++++- 7 files changed, 139 insertions(+), 32 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 1fb5acd88ad0..f85899eec80f 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1046,6 +1046,23 @@ static void smaps_pte_hole_lookup(unsigned long addr, struct mm_walk *walk) #endif } +static void smaps_account_swap(struct mem_size_stats *mss, + softleaf_t entry, unsigned long size) +{ + int mapcount; + + mss->swap += size; + mapcount = swp_swapcount(entry); + if (mapcount >= 2) { + u64 pss_delta = (u64)size << PSS_SHIFT; + + do_div(pss_delta, mapcount); + mss->swap_pss += pss_delta; + } else { + mss->swap_pss += (u64)size << PSS_SHIFT; + } +} + static void smaps_pte_entry(pte_t *pte, unsigned long addr, struct mm_walk *walk) { @@ -1067,18 +1084,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, const softleaf_t entry = softleaf_from_pte(ptent); if (softleaf_is_swap(entry)) { - int mapcount; - - mss->swap += PAGE_SIZE; - mapcount = swp_swapcount(entry); - if (mapcount >= 2) { - u64 pss_delta = (u64)PAGE_SIZE << PSS_SHIFT; - - do_div(pss_delta, mapcount); - mss->swap_pss += pss_delta; - } else { - mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT; - } + smaps_account_swap(mss, entry, PAGE_SIZE); } else if (softleaf_has_pfn(entry)) { if (softleaf_is_device_private(entry)) present = true; @@ -1108,9 +1114,13 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr, if (pmd_present(*pmd)) { page = vm_normal_page_pmd(vma, addr, *pmd); present = true; - } else if (unlikely(thp_migration_supported())) { + } else { const softleaf_t entry = softleaf_from_pmd(*pmd); + if (softleaf_is_swap(entry)) { + smaps_account_swap(mss, entry, HPAGE_PMD_SIZE); + return; + } if (softleaf_has_pfn(entry)) page = softleaf_to_page(entry); } @@ -1752,7 +1762,7 @@ static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, pmd = pmd_clear_soft_dirty(pmd); set_pmd_at(vma->vm_mm, addr, pmdp, pmd); - } else if (pmd_is_migration_entry(pmd)) { + } else if (pmd_is_migration_entry(pmd) || pmd_is_swap_entry(pmd)) { pmd = pmd_swp_clear_soft_dirty(pmd); set_pmd_at(vma->vm_mm, addr, pmdp, pmd); } @@ -2112,7 +2122,8 @@ static int pagemap_pmd_range_thp(pmd_t *pmdp, unsigned long addr, flags |= PM_UFFD_WP; if (pm->show_pfn) frame = pmd_pfn(pmd) + idx; - } else if (thp_migration_supported()) { + } else if (pmd_is_swap_entry(pmd) || + (thp_migration_supported() && pmd_is_migration_entry(pmd))) { const softleaf_t entry = softleaf_from_pmd(pmd); unsigned long offset; @@ -2550,7 +2561,7 @@ static void make_uffd_wp_pmd(struct vm_area_struct *vma, old = pmdp_invalidate_ad(vma, addr, pmdp); pmd = pmd_mkuffd_wp(old); set_pmd_at(vma->vm_mm, addr, pmdp, pmd); - } else if (pmd_is_migration_entry(pmd)) { + } else if (pmd_is_migration_entry(pmd) || pmd_is_swap_entry(pmd)) { pmd = pmd_swp_mkuffd_wp(pmd); set_pmd_at(vma->vm_mm, addr, pmdp, pmd); } diff --git a/mm/hmm.c b/mm/hmm.c index 4f3f627d2b47..c5356910c580 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -370,7 +370,8 @@ static int hmm_vma_handle_absent_pmd(struct mm_walk *walk, unsigned long start, required_fault = hmm_range_need_fault(hmm_vma_walk, hmm_pfns, npages, 0); if (required_fault) { - if (softleaf_is_device_private(entry)) + if (softleaf_is_device_private(entry) || + softleaf_is_swap(entry)) return hmm_vma_fault(addr, end, required_fault, walk); else return -EFAULT; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 69e4e09ac1f6..4cbd6123bf18 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2312,6 +2312,14 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) return 0; } +static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) +{ + pgtable_t pgtable; + + pgtable = pgtable_trans_huge_withdraw(mm, pmd); + pte_free(mm, pgtable); + mm_dec_nr_ptes(mm); +} /* * Return true if we do MADV_FREE successfully on entire pmd page. * Otherwise, return false. @@ -2336,8 +2344,23 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, goto out; if (unlikely(!pmd_present(orig_pmd))) { + if (pmd_is_swap_entry(orig_pmd)) { + if (next - addr != HPAGE_PMD_SIZE) { + spin_unlock(ptl); + __split_huge_pmd(vma, pmd, addr, false); + goto out_unlocked; + } + softleaf_t sl = softleaf_from_pmd(orig_pmd); + + pmdp_huge_get_and_clear(mm, addr, pmd); + zap_deposited_table(mm, pmd); + spin_unlock(ptl); + swap_put_entries_direct(sl, HPAGE_PMD_NR); + add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR); + return true; + } VM_BUG_ON(thp_migration_supported() && - !pmd_is_migration_entry(orig_pmd)); + !pmd_is_migration_entry(orig_pmd)); goto out; } @@ -2386,15 +2409,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, return ret; } -static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) -{ - pgtable_t pgtable; - - pgtable = pgtable_trans_huge_withdraw(mm, pmd); - pte_free(mm, pgtable); - mm_dec_nr_ptes(mm); -} - static void zap_huge_pmd_folio(struct mm_struct *mm, struct vm_area_struct *vma, pmd_t pmdval, struct folio *folio, bool is_present) { @@ -2487,6 +2501,16 @@ bool zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, arch_check_zapped_pmd(vma, orig_pmd); tlb_remove_pmd_tlb_entry(tlb, pmd, addr); + if (pmd_is_swap_entry(orig_pmd)) { + softleaf_t sl = softleaf_from_pmd(orig_pmd); + + zap_deposited_table(mm, pmd); + spin_unlock(ptl); + swap_put_entries_direct(sl, HPAGE_PMD_NR); + add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR); + return true; + } + is_present = pmd_present(orig_pmd); folio = normal_or_softleaf_folio_pmd(vma, addr, orig_pmd, is_present); has_deposit = has_deposited_pgtable(vma, orig_pmd, folio); @@ -2519,7 +2543,8 @@ static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl, static pmd_t move_soft_dirty_pmd(pmd_t pmd) { if (pgtable_supports_soft_dirty()) { - if (unlikely(pmd_is_migration_entry(pmd))) + if (unlikely(pmd_is_migration_entry(pmd) || + pmd_is_swap_entry(pmd))) pmd = pmd_swp_mksoft_dirty(pmd); else if (pmd_present(pmd)) pmd = pmd_mksoft_dirty(pmd); @@ -2599,7 +2624,14 @@ static void change_non_present_huge_pmd(struct mm_struct *mm, pmd_t newpmd; VM_WARN_ON(!pmd_is_valid_softleaf(*pmd)); - if (softleaf_is_migration_write(entry)) { + + /* + * PMD swap entries don't encode write permission in the entry type, + * so only uffd_wp flag changes apply. No folio lookup needed. + */ + if (softleaf_is_swap(entry)) { + newpmd = *pmd; + } else if (softleaf_is_migration_write(entry)) { const struct folio *folio = softleaf_to_folio(entry); /* @@ -2658,7 +2690,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, if (!ptl) return 0; - if (thp_migration_supported() && pmd_is_valid_softleaf(*pmd)) { + if (pmd_is_valid_softleaf(*pmd)) { change_non_present_huge_pmd(mm, addr, pmd, uffd_wp, uffd_wp_resolve); goto unlock; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 617bca76db49..8c10e7e6fc0d 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1101,6 +1101,12 @@ static inline enum scan_result check_pmd_state(pmd_t *pmd) */ if (pmd_is_migration_entry(pmde)) return SCAN_PMD_MAPPED; + /* + * A PMD-mapped THP that has been swapped out is still a THP from + * khugepaged's perspective; treat it like a present huge PMD. + */ + if (pmd_is_swap_entry(pmde)) + return SCAN_PMD_MAPPED; if (!pmd_present(pmde)) return SCAN_NO_PTE_TABLE; if (pmd_trans_huge(pmde)) diff --git a/mm/madvise.c b/mm/madvise.c index 9292f60b19aa..0d6aa0608f70 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -374,6 +374,15 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, !can_do_file_pageout(vma); #ifdef CONFIG_TRANSPARENT_HUGEPAGE + /* + * Swapped-out THPs have no resident folio to deactivate or reclaim. + * Avoid descending into or splitting a PMD swap entry. + */ + if (pmd_is_swap_entry(*pmd)) { + walk->action = ACTION_CONTINUE; + return 0; + } + if (pmd_trans_huge(*pmd)) { pmd_t orig_pmd; unsigned long next = pmd_addr_end(addr, end); @@ -384,6 +393,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, return 0; orig_pmd = *pmd; + if (pmd_is_swap_entry(orig_pmd)) + goto huge_unlock; + if (is_huge_zero_pmd(orig_pmd)) goto huge_unlock; @@ -665,7 +677,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, int nr, max_nr; next = pmd_addr_end(addr, end); - if (pmd_trans_huge(*pmd)) + if (pmd_trans_huge(*pmd) || pmd_is_swap_entry(*pmd)) if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next)) return 0; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index bba65898aee1..584ce81d4781 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -658,6 +658,8 @@ static void queue_folios_pmd(pmd_t *pmd, struct mm_walk *walk) qp->nr_failed++; return; } + if (unlikely(pmd_is_swap_entry(*pmd))) + return; folio = pmd_folio(*pmd); if (is_huge_zero_folio(folio)) { walk->action = ACTION_CONTINUE; diff --git a/mm/mincore.c b/mm/mincore.c index 53b982803771..ddf7c96964b0 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -99,6 +99,41 @@ static unsigned char mincore_swap(swp_entry_t entry, bool shmem) return present; } +#ifdef CONFIG_THP_SWAP +static void mincore_pmd_swap(swp_entry_t entry, unsigned long addr, + unsigned long end, unsigned char *vec) +{ + unsigned long haddr = addr & HPAGE_PMD_MASK; + unsigned long start = (addr - haddr) >> PAGE_SHIFT; + unsigned long nr = (end - addr) >> PAGE_SHIFT; + struct folio *folio; + enum swap_pmd_cache state; + int i; + + state = swap_pmd_cache_lookup(entry, &folio); + if (state == SWAP_PMD_CACHE_HUGE) { + memset(vec, folio_test_uptodate(folio), nr); + folio_put(folio); + return; + } + + if (state == SWAP_PMD_CACHE_EMPTY) { + memset(vec, 0, nr); + return; + } + + /* + * The PMD swap entry is only a compact encoding for consecutive swap + * slots. If the PMD-sized swapcache folio was split, report residency + * from the individual slots covered by this mincore() range. + */ + for (i = 0; i < nr; i++) + vec[i] = mincore_swap(swp_entry(swp_type(entry), + swp_offset(entry) + start + i), + false); +} +#endif + /* * Later we can get more picky about what "in core" means precisely. * For now, simply check to see if the page is in the page cache, @@ -172,7 +207,15 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { - memset(vec, 1, nr); + if (pmd_is_swap_entry(*pmd)) { +#ifdef CONFIG_THP_SWAP + mincore_pmd_swap(softleaf_from_pmd(*pmd), addr, end, vec); +#else + memset(vec, 0, nr); +#endif + } else { + memset(vec, 1, nr); + } spin_unlock(ptl); goto out; } -- 2.53.0-Meta