From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-179.mta1.migadu.com (out-179.mta1.migadu.com [95.215.58.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6EFCD37472D for ; Mon, 27 Apr 2026 10:07:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.179 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284437; cv=none; b=GFhJ4gHbpwnYq0yK/HL8AUXe2AQMslnRwN1uEi4s0uqSsiTX4Os4o4NIgi/USi0cQpJ0L4LH3oCFQm+Do6m+sNf3Qjq4aXzoJuKDH0S6K9YMVul/awZgqiuL5O12VTfO0I0bWhLe76OMb5njjD4ajGCRFjkQdaUGDCBCiqFMiwo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284437; c=relaxed/simple; bh=uNyFiUDKTHdwcO0ZbABM/GsAEnvBT6gP+5TFX9YOnzc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=CaGpqTLPwHkLmtfstiaI7Lh5vEC44SuC0QNz/VsMA7wVarLGRqwD2E6SWpgAFH7L7KI1Z+3RaHcCbURHXXDUtdwGmFwKq4n+1y/T6WNJV9tB9FYeN1xb6NK3zndtadyorG1EhPInmoJpXfVjB/LI4sxwvPbhSyOt3gLefsH7c4E= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=TQ0yJd0I; arc=none smtp.client-ip=95.215.58.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="TQ0yJd0I" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1777284433; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=y1OHVsAXxMFfWk6JE9Gj2xlQe5BNdiMFZPhyECznPTk=; b=TQ0yJd0IdaebYVoyjiHAwFAEFZqDT99p5QksXJBrsRhtNnntwM0NTxhmeuk4jIBbYoGW5M 4QqB16pCLPMiFucc6NI6Lvtg5WsBWNOA9GNDg81AN+jhkYRLNw24cGmDHRCOCmIsfDu8B8 ZoUQ8bR2ajSzAj5NJvk3Tqx6/W8j0sQ= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH 12/13] mm: install PMD swap entries on swap-out Date: Mon, 27 Apr 2026 03:02:01 -0700 Message-ID: <20260427100553.2754667-13-usama.arif@linux.dev> In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev> References: <20260427100553.2754667-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT Reclaim today splits a PMD-mapped anonymous THP into 512 PTE swap entries before unmap, losing the huge mapping across the swap round-trip and forcing khugepaged to rebuild it later. The contiguous swap range was already secured when the folio was added to the swap cache (a non-contiguous allocation would have split the folio earlier), so the PMD can be replaced by a single PMD-level swap entry instead. This patch mirrors the existing PTE swap-out path at PMD granularity: - shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for PMD-mappable swapcache folios, gated on zswap_never_enabled() since zswap cannot reconstruct a 2 MB folio from per-page blobs (Best to handle zswap case separately). - try_to_unmap_one() now has a PMD branch that calls set_pmd_swap_entry() and adjusts MM_ANONPAGES / MM_SWAPENTS by HPAGE_PMD_NR before walk_done. TTU_SPLIT_HUGE_PMD remains the fallback. - set_pmd_swap_entry() is the installer. Mirroring the PTE swap-out sequence at PMD granularity, it clears the present mapping (keeping the original for rollback), bumps the swap_map refcount for the folio's 512 slots, drops the exclusive mark if the page was anon-exclusive, propagates the dirty bit to the folio so writeback is not lost, and installs a swap PMD that preserves the original soft-dirty / uffd-wp / exclusive bits. Any failing step rolls back the present mapping. The swap entry value matches what 512 PTE swap entries would encode, so swap_map refcounting is unchanged: each of the 512 slots carries a count of 1, released individually on later split or together on swap-in. Signed-off-by: Usama Arif --- include/linux/huge_mm.h | 2 + include/linux/vm_event_item.h | 1 + mm/huge_memory.c | 78 +++++++++++++++++++++++++++++++++++ mm/rmap.c | 20 +++++++++ mm/vmscan.c | 14 ++++++- mm/vmstat.c | 1 + 6 files changed, 115 insertions(+), 1 deletion(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 93ee6c36d6ea..cbfac4720fc9 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -524,6 +524,8 @@ vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf); #ifdef CONFIG_THP_SWAP vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf); +int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw, + struct folio *folio); #else static inline vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf) { diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 03fe95f5a020..7267c06674c0 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -108,6 +108,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_ZERO_PAGE_ALLOC_FAILED, THP_SWPOUT, THP_SWPOUT_FALLBACK, + THP_SWPOUT_PMD, #endif #ifdef CONFIG_BALLOON BALLOON_INFLATE, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 141ab45adee4..47ff7fb9ee9b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -5497,3 +5497,81 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) trace_remove_migration_pmd(address, pmd_val(pmde)); } #endif + +#ifdef CONFIG_THP_SWAP +/** + * set_pmd_swap_entry() - Replace a PMD mapping with a PMD-level swap entry. + * @pvmw: Page vma mapped walk context, must have pvmw->pmd set and + * pvmw->pte NULL (i.e. PMD-mapped). + * @folio: The folio being swapped out. Must be in the swap cache. + * + * This installs a PMD-level swap entry in place of a present PMD mapping, + * avoiding the need to split the PMD into PTE-level swap entries. + * + * Return: 0 on success, negative error code on failure. + */ +int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw, + struct folio *folio) +{ + struct vm_area_struct *vma = pvmw->vma; + struct mm_struct *mm = vma->vm_mm; + unsigned long address = pvmw->address; + unsigned long haddr = address & HPAGE_PMD_MASK; + struct page *page = folio_page(folio, 0); + bool anon_exclusive; + pmd_t pmdval; + swp_entry_t entry; + pmd_t pmdswp; + + if (!(pvmw->pmd && !pvmw->pte)) + return 0; + + VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio); + VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio); + + if (unlikely(folio_test_swapbacked(folio) != + folio_test_swapcache(folio))) { + WARN_ON_ONCE(1); + return -EBUSY; + } + + flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE); + + pmdval = pmdp_invalidate(vma, haddr, pvmw->pmd); + + /* Update high watermark before we lower rss */ + update_hiwater_rss(mm); + + if (folio_dup_swap(folio, NULL) < 0) { + set_pmd_at(mm, haddr, pvmw->pmd, pmdval); + return -ENOMEM; + } + + /* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */ + anon_exclusive = PageAnonExclusive(page); + if (anon_exclusive && folio_try_share_anon_rmap_pmd(folio, page)) { + folio_put_swap(folio, NULL); + set_pmd_at(mm, haddr, pvmw->pmd, pmdval); + return -EBUSY; + } + + if (pmd_dirty(pmdval)) + folio_mark_dirty(folio); + + entry = folio->swap; + pmdswp = softleaf_to_pmd(entry); + if (pmd_soft_dirty(pmdval)) + pmdswp = pmd_swp_mksoft_dirty(pmdswp); + if (pmd_uffd_wp(pmdval)) + pmdswp = pmd_swp_mkuffd_wp(pmdswp); + if (anon_exclusive) + pmdswp = pmd_swp_mkexclusive(pmdswp); + set_pmd_at(mm, haddr, pvmw->pmd, pmdswp); + + folio_remove_rmap_pmd(folio, page, vma); + folio_put(folio); + + count_vm_event(THP_SWPOUT_PMD); + return 0; +} +#endif /* CONFIG_THP_SWAP */ diff --git a/mm/rmap.c b/mm/rmap.c index 057e18cb80b0..b188213648c5 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2077,6 +2077,26 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, goto walk_abort; } +#ifdef CONFIG_THP_SWAP + /* + * If the folio is in the swap cache and we're not + * asked to split, install a PMD-level swap entry. + */ + if (!(flags & TTU_SPLIT_HUGE_PMD) && + folio_test_anon(folio) && + folio_test_swapcache(folio)) { + if (set_pmd_swap_entry(&pvmw, folio)) + goto walk_abort; + + ensure_on_mmlist(mm); + add_mm_counter(mm, MM_ANONPAGES, + -HPAGE_PMD_NR); + add_mm_counter(mm, MM_SWAPENTS, + HPAGE_PMD_NR); + goto walk_done; + } +#endif + if (flags & TTU_SPLIT_HUGE_PMD) { /* * We temporarily have to drop the PTL and diff --git a/mm/vmscan.c b/mm/vmscan.c index bd1b1aa12581..e895aaade8f2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -64,6 +64,7 @@ #include #include +#include #include "internal.h" #include "swap.h" @@ -1330,7 +1331,18 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, enum ttu_flags flags = TTU_BATCH_FLUSH; bool was_swapbacked = folio_test_swapbacked(folio); - if (folio_test_pmd_mappable(folio)) + /* + * With THP_SWAP, PMD-mappable folios already in the + * swap cache can be unmapped with a PMD-level swap + * entry, avoiding the cost of splitting the PMD. + * Skip this when zswap has been enabled because + * zswap stores pages individually and cannot + * reconstruct a large folio on swap-in. + */ + if (folio_test_pmd_mappable(folio) && + !(IS_ENABLED(CONFIG_THP_SWAP) && + folio_test_swapcache(folio) && + zswap_never_enabled())) flags |= TTU_SPLIT_HUGE_PMD; /* * Without TTU_SYNC, try_to_unmap will only begin to diff --git a/mm/vmstat.c b/mm/vmstat.c index f534972f517d..9b4963a7eb04 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1421,6 +1421,7 @@ const char * const vmstat_text[] = { [I(THP_ZERO_PAGE_ALLOC_FAILED)] = "thp_zero_page_alloc_failed", [I(THP_SWPOUT)] = "thp_swpout", [I(THP_SWPOUT_FALLBACK)] = "thp_swpout_fallback", + [I(THP_SWPOUT_PMD)] = "thp_swpout_pmd", #endif #ifdef CONFIG_BALLOON [I(BALLOON_INFLATE)] = "balloon_inflate", -- 2.52.0