From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-182.mta0.migadu.com (out-182.mta0.migadu.com [91.218.175.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B64773B6C09 for ; Mon, 27 Apr 2026 10:07:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.182 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284425; cv=none; b=qXVc0FWLNBMfWfZ/WTNGtTc9p2J1jSkEOdtsS4rTvJpCMO80HTlgCR3Y5XN/N0Z6z/SLhiNwjthoh6SO7JqYkdrOFgcRdWApqA9RV7od8nK5ZFVN5yoNoNdPmrUgX9LgVg/2vkVqcZ5UCYpBc69ogdAIl8MdzEkg9aUof/UXTFk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284425; c=relaxed/simple; bh=S57HdT2rm8Sovkra4ATq7RT+vV9HpstGNJ8EJenP+4E=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=VQivxGEDdzMY/vlstuQf+pVBngEMmKOvTcvXGPqSG29k3G87FX/iYI9DqnxNc7V7kVoHu84A2pWD37+IKH29tOkVORKBXCNRG3hrowxRznYfKA8qrpcExrnSfrlYOivU7rvgfielbrk92dtAAffa4ezMsCbMzqUFFO4BcjZGDOE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=cnLcfiot; arc=none smtp.client-ip=91.218.175.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="cnLcfiot" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1777284420; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=XFcOlvA4XPLMVWGfM2EDnm9H4rzCw4SoXZ6Ff4QTzhk=; b=cnLcfiotY+wY2qWGQvpTkmUlzm+WmFoBE5tQYjkRJ7HXk8dzY/GrHc2XM22UFcFbHMiVCE rdPDPsx363qNp1kZlOSafVFA133H1LIeOnO75N1anR5ba1gtDLnGOkfhQZBZNLzitZnWRO ad0Hq/U/fP4ZCo3o4DbI8qNRM+dWU70= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH 11/13] mm: handle PMD swap entry faults on swap-in Date: Mon, 27 Apr 2026 03:02:00 -0700 Message-ID: <20260427100553.2754667-12-usama.arif@linux.dev> In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev> References: <20260427100553.2754667-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT Add do_huge_pmd_swap_page() and dispatch to it from __handle_mm_fault() when vmf->orig_pmd encodes a swap entry. The handler resolves the entire 2 MB mapping in one shot, mirroring do_swap_page() (PTE path) at PMD granularity: - Look up the folio in the swap cache; on a miss, allocate a PMD-order folio and read from swap (shared with unuse_pmd_entry() via swapin_alloc_pmd_folio() in mm/swap_state.c). - After locking, re-validate that the folio still corresponds to our entry and is still PMD-sized. Between the unlocked cache lookup and the lock, a racing swap-in on the same entry may have removed it from the cache via folio_free_swap(), or reclaim / memory_failure / deferred-split may have split the folio into smaller folios. - Restore soft_dirty and uffd_wp from the swap PMD. Map writable only when the entry was exclusive, the VMA permits writes, and uffd-wp is not armed. Drop the exclusive marker when the cached folio is under writeback to an SWP_STABLE_WRITES backend (zram, encrypted) so the PMD is mapped read-only; a later write COWs into a fresh folio rather than corrupting the in-flight writeback. Mirrors do_swap_page(). - When the resulting PMD is read-only but the fault was a write, update vmf->orig_pmd and call wp_huge_pmd() in the same handler to COW immediately rather than forcing a second fault. Mask VM_FAULT_FALLBACK from its return: a PMD-COW that splits to PTE-level is normal, but the bit is part of VM_FAULT_ERROR and arch fault handlers BUG() on it without SIGBUS/HWPOISON/SIGSEGV. Requires exposing wp_huge_pmd() via mm/internal.h. - Free the swap slot via should_try_to_free_swap() (hoisted from mm/memory.c into mm/internal.h so PTE- and PMD-level swap-in share the heuristic). When PMD-order resources are unavailable (folio allocation fails, the cached folio was split, memcg charge fails, or swapin_folio() races) split the PMD swap entry into 512 PTE swap entries via __split_huge_pmd() and return 0. The fault retries and do_swap_page() takes over per-PTE. This avoids returning VM_FAULT_OOM for transient PMD-order allocation failures. Signed-off-by: Usama Arif --- include/linux/huge_mm.h | 9 ++ mm/huge_memory.c | 197 ++++++++++++++++++++++++++++++++++++++++ mm/internal.h | 36 ++++++++ mm/memory.c | 40 +------- mm/swap_state.c | 2 +- 5 files changed, 247 insertions(+), 37 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 2949e5acff35..93ee6c36d6ea 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -522,6 +522,15 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf); vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf); +#ifdef CONFIG_THP_SWAP +vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf); +#else +static inline vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf) +{ + return 0; +} +#endif + extern struct folio *huge_zero_folio; extern unsigned long huge_zero_pfn; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index bfcc9b274be7..141ab45adee4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2375,6 +2375,203 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) return 0; } +#ifdef CONFIG_THP_SWAP +/** + * do_huge_pmd_swap_page() - Handle a fault on a PMD-level swap entry. + * @vmf: Fault context. vmf->orig_pmd contains the swap PMD. + * + * Looks up the folio in the swap cache, and if it is a PMD-sized folio, + * maps it directly at the PMD level. If the folio is not in the swap + * cache, allocates a PMD-sized folio and reads from swap. On allocation + * failure, splits the PMD swap entry into PTE-level entries and retries + * at PTE granularity. + * + * Return: VM_FAULT_* flags. + */ +vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct mm_struct *mm = vma->vm_mm; + struct folio *folio; + struct page *page; + struct swap_info_struct *si; + unsigned long haddr = vmf->address & HPAGE_PMD_MASK; + softleaf_t entry; + swp_entry_t swp_entry; + pmd_t pmd; + vm_fault_t ret = 0; + bool exclusive; + rmap_t rmap_flags = RMAP_NONE; + + entry = softleaf_from_pmd(vmf->orig_pmd); + if (unlikely(!softleaf_is_swap(entry))) + return 0; + + swp_entry = entry; + + /* Prevent swapoff from happening to us. */ + si = get_swap_device(swp_entry); + if (unlikely(!si)) + return 0; + + folio = swap_cache_get_folio(swp_entry); + if (!folio) { + folio = swapin_alloc_pmd_folio(swp_entry, mm); + if (!folio) + goto split_fallback; + + /* Had to read from swap area: Major fault */ + ret = VM_FAULT_MAJOR; + count_vm_event(PGMAJFAULT); + count_memcg_event_mm(mm, PGMAJFAULT); + } + + ret |= folio_lock_or_retry(folio, vmf); + if (ret & VM_FAULT_RETRY) + goto out_release; + + /* Verify the folio is still in swap cache and matches our entry */ + if (unlikely(!folio_matches_swap_entry(folio, swp_entry))) + goto out_page; + + /* + * Folio should be PMD-sized; if not (e.g. split in swap cache), + * split the PMD swap entry and retry at PTE level. + */ + if (folio_nr_pages(folio) != HPAGE_PMD_NR) { + folio_unlock(folio); + folio_put(folio); + goto split_fallback; + } + + if (unlikely(!folio_test_uptodate(folio))) { + ret = VM_FAULT_SIGBUS; + goto out_page; + } + + page = folio_page(folio, 0); + arch_swap_restore(folio_swap(swp_entry, folio), folio); + + if ((vmf->flags & FAULT_FLAG_WRITE) && !folio_test_lru(folio)) + lru_add_drain(); + + folio_throttle_swaprate(folio, GFP_KERNEL); + + /* Lock the PMD and verify it hasn't changed */ + vmf->ptl = pmd_lock(mm, vmf->pmd); + if (unlikely(!pmd_same(vmf->orig_pmd, pmdp_get(vmf->pmd)))) { + spin_unlock(vmf->ptl); + goto out_page; + } + + exclusive = pmd_swp_exclusive(vmf->orig_pmd); + + /* + * Some swap backends (e.g. zram) don't support concurrent page + * modifications while under writeback. If we map exclusive on such + * a backend while the folio is still under writeback, the writeback + * may see partial modifications and corrupt the swap slot. Drop the + * exclusive marker and only map R/O for that case; further GUP + * references can't appear once the page is fully unmapped, so this + * is safe. + */ + if (exclusive && folio_test_writeback(folio) && + data_race(si->flags & SWP_STABLE_WRITES)) + exclusive = false; + + /* + * Set up the PMD mapping. Similar to do_swap_page() but at PMD level. + */ + add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR); + add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR); + + pmd = folio_mk_pmd(folio, vma->vm_page_prot); + pmd = pmd_mkyoung(pmd); + + if (pmd_swp_soft_dirty(vmf->orig_pmd)) + pmd = pmd_mksoft_dirty(pmd); + if (pmd_swp_uffd_wp(vmf->orig_pmd)) + pmd = pmd_mkuffd_wp(pmd); + + /* + * Check exclusivity to determine if we can map writable. + */ + if (exclusive || folio_ref_count(folio) == 1) { + if ((vma->vm_flags & VM_WRITE) && + !userfaultfd_huge_pmd_wp(vma, pmd) && + !pmd_needs_soft_dirty_wp(vma, pmd)) { + pmd = pmd_mkwrite(pmd, vma); + if (vmf->flags & FAULT_FLAG_WRITE) { + pmd = pmd_mkdirty(pmd); + vmf->flags &= ~FAULT_FLAG_WRITE; + } + } + rmap_flags |= RMAP_EXCLUSIVE; + } + + flush_icache_pages(vma, page, HPAGE_PMD_NR); + + if (!folio_test_anon(folio)) + folio_add_new_anon_rmap(folio, vma, haddr, rmap_flags); + else + folio_add_anon_rmap_pmd(folio, page, vma, haddr, rmap_flags); + + folio_put_swap(folio, NULL); + + set_pmd_at(mm, haddr, vmf->pmd, pmd); + update_mmu_cache_pmd(vma, haddr, vmf->pmd); + + /* Update orig_pmd for any follow-up wp_huge_pmd() below. */ + vmf->orig_pmd = pmd; + + /* + * Conditionally try to free up the swap cache. Do it after mapping, + * so raced page faults will likely see the folio in swap cache and + * wait on the folio lock. + */ + if (should_try_to_free_swap(si, folio, vma, 1, vmf->flags)) + folio_free_swap(folio); + + spin_unlock(vmf->ptl); + + folio_unlock(folio); + put_swap_device(si); + + /* + * If the write fault wasn't satisfied above (folio is shared without + * exclusivity), fall through to wp_huge_pmd to handle COW or + * userfaultfd-wp without forcing a second fault. + * + * wp_huge_pmd() may return VM_FAULT_FALLBACK if it had to split the + * PMD; that's a normal outcome — the natural PTE-level refault will + * complete the COW. Mask it so callers (and the arch fault handler) + * don't see VM_FAULT_FALLBACK as a fatal VM_FAULT_ERROR. + */ + if (vmf->flags & FAULT_FLAG_WRITE) { + vm_fault_t wp_ret = wp_huge_pmd(vmf); + + wp_ret &= ~VM_FAULT_FALLBACK; + ret |= wp_ret; + if (ret & VM_FAULT_ERROR) + ret &= VM_FAULT_ERROR; + } + + return ret; + +out_page: + folio_unlock(folio); +out_release: + folio_put(folio); + put_swap_device(si); + return ret; + +split_fallback: + __split_huge_pmd(vma, vmf->pmd, haddr, false); + put_swap_device(si); + return 0; +} +#endif /* CONFIG_THP_SWAP */ + static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) { pgtable_t pgtable; diff --git a/mm/internal.h b/mm/internal.h index 7de489689f54..c522bff72688 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -508,6 +508,42 @@ static inline vm_fault_t vmf_anon_prepare(struct vm_fault *vmf) } vm_fault_t do_swap_page(struct vm_fault *vmf); +vm_fault_t wp_huge_pmd(struct vm_fault *vmf); + +/* + * Check if we should call folio_free_swap to free the swap cache. + * folio_free_swap only frees the swap cache to release the slot if swap + * count is zero, so we don't need to check the swap count here. + */ +static inline bool should_try_to_free_swap(struct swap_info_struct *si, + struct folio *folio, + struct vm_area_struct *vma, + unsigned int extra_refs, + unsigned int fault_flags) +{ + if (!folio_test_swapcache(folio)) + return false; + /* + * Always try to free swap cache for SWP_SYNCHRONOUS_IO devices. Swap + * cache can help save some IO or memory overhead, but these devices + * are fast, and meanwhile, swap cache pinning the slot deferring the + * release of metadata or fragmentation is a more critical issue. + */ + if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) + return true; + if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) || + folio_test_mlocked(folio)) + return true; + /* + * If we want to map a page that's in the swapcache writable, we + * have to detect via the refcount if we're really the exclusive + * user. Try freeing the swapcache to get rid of the swapcache + * reference only in case it's likely that we'll be the exclusive user. + */ + return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) && + folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio)); +} + void folio_rotate_reclaimable(struct folio *folio); bool __folio_end_writeback(struct folio *folio); void deactivate_file_folio(struct folio *folio); diff --git a/mm/memory.c b/mm/memory.c index 8aa90afd601a..3006e1bc2bd7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4481,40 +4481,6 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf) return 0; } -/* - * Check if we should call folio_free_swap to free the swap cache. - * folio_free_swap only frees the swap cache to release the slot if swap - * count is zero, so we don't need to check the swap count here. - */ -static inline bool should_try_to_free_swap(struct swap_info_struct *si, - struct folio *folio, - struct vm_area_struct *vma, - unsigned int extra_refs, - unsigned int fault_flags) -{ - if (!folio_test_swapcache(folio)) - return false; - /* - * Always try to free swap cache for SWP_SYNCHRONOUS_IO devices. Swap - * cache can help save some IO or memory overhead, but these devices - * are fast, and meanwhile, swap cache pinning the slot deferring the - * release of metadata or fragmentation is a more critical issue. - */ - if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) - return true; - if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) || - folio_test_mlocked(folio)) - return true; - /* - * If we want to map a page that's in the swapcache writable, we - * have to detect via the refcount if we're really the exclusive - * user. Try freeing the swapcache to get rid of the swapcache - * reference only in case it's likely that we'll be the exclusive user. - */ - return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) && - folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio)); -} - static vm_fault_t pte_marker_clear(struct vm_fault *vmf) { vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, @@ -6233,8 +6199,7 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf) return VM_FAULT_FALLBACK; } -/* `inline' is required to avoid gcc 4.1.2 build error */ -static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) +vm_fault_t wp_huge_pmd(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE; @@ -6518,6 +6483,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, if (pmd_is_migration_entry(vmf.orig_pmd)) pmd_migration_entry_wait(mm, vmf.pmd); + else if (IS_ENABLED(CONFIG_THP_SWAP) && + pmd_is_swap_entry(vmf.orig_pmd)) + return do_huge_pmd_swap_page(&vmf); return 0; } if (pmd_trans_huge(vmf.orig_pmd)) { diff --git a/mm/swap_state.c b/mm/swap_state.c index c2e8c76658f5..19c6759006bb 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -592,7 +592,7 @@ struct folio *swapin_folio(swp_entry_t entry, struct folio *folio) * * Allocate a HPAGE_PMD_ORDER folio, charge it to @mm's memcg for @entry, and * issue the swap-in via swapin_folio(). Used by callers that need to map a - * PMD swap entry as a whole THP (PMD swapoff). + * PMD swap entry as a whole THP (PMD swap-in fault and swapoff). * * Return: the swapped-in folio, or NULL on alloc/charge/swapin failure (in * which case the caller should fall back to splitting the PMD). -- 2.52.0