From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-179.mta0.migadu.com (out-179.mta0.migadu.com [91.218.175.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1B41F30C629 for ; Fri, 3 Jul 2026 17:40:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.179 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1783100426; cv=none; b=V4fB9QmLI6GzrG6RuwfqTepDeNGqz976vqkbTllyS50fAzNnBRLzWRhgHSrkynQDFTOgeuWDIcO/pohYdTK5q4QorttXc1bE6G0NJMXObN60NwTWKBJHkoY//85KsZhascgMDeTzHerObu8JszZ6+ol3Xm3vSts8w8Ip2nWAjnQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1783100426; c=relaxed/simple; bh=avXFC1N3jWF2U//g57UTN7SPJWDooA7JEfLDzsKA6RU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=CiUJUs4iN9sQMpCBYU4lMOOmhiYVAdOIwklRHwyW2R7036/hk34608p2XQ4lz6XjnT4w60i4e/H5SEnSppSz3mjbdlKGrLhMeSp1DtGcFJVAUTVbeFnBXfPYGxYe1j7/9TQIjFW0QJFSKz37pXZAmCWE6oE+j2d6sMTS2pV4EEA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=xlDQnDi1; arc=none smtp.client-ip=91.218.175.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="xlDQnDi1" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1783100423; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TIKK6DMFFJwzMMiY6SJDhTvan3XaCquMxcgpKmovFJQ=; b=xlDQnDi1eJb7mB41mK8QD/2cNRK+ueQCKVK7GbYxJXS6RhYhZZIzP0RCPITOxkp9g40W0W 8yiE94miZOQhQozInKm/zuW5suiydqb3xsmHZF6agllMmqA491ZbDIgkgvVSWs63cCIgCA dTv1gniv+Yu2WRLFgGq5sGhUbPrVd78= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com, linux-mm@kvack.org Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH v3 10/11] mm: install PMD swap entries on swap-out Date: Fri, 3 Jul 2026 10:38:27 -0700 Message-ID: <20260703173903.3789516-11-usama.arif@linux.dev> In-Reply-To: <20260703173903.3789516-1-usama.arif@linux.dev> References: <20260703173903.3789516-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT Reclaim today splits a PMD-mapped anonymous THP into 512 PTE swap entries before unmap, losing the huge mapping across the swap round-trip and forcing khugepaged to rebuild it later. The contiguous swap range was already secured when the folio was added to the swap cache (a non-contiguous allocation would have split the folio earlier), so the PMD can be replaced by a single PMD-level swap entry instead. This patch mirrors the existing PTE swap-out path at PMD granularity: - shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for PMD-mappable swapcache folios. zswap is handled by the PMD swap-in users: if any covered slot currently has a zswap entry, they split the PMD swap entry and fall back to the per-PTE path. - try_to_unmap_one() now has a PMD branch that calls set_pmd_swap_entry() and adjusts MM_ANONPAGES / MM_SWAPENTS by HPAGE_PMD_NR before walk_done. TTU_SPLIT_HUGE_PMD remains the fallback. - set_pmd_swap_entry() is the installer. Mirroring the PTE swap-out sequence at PMD granularity, it clears the present mapping (keeping the original for rollback), bumps the swap_map refcount for the folio's 512 slots, transfers the exclusive state in the swap entry, propagates the dirty bit to the folio so writeback is not lost, and installs a swap PMD that preserves the original soft-dirty / uffd-wp / exclusive bits. Any failing step rolls back the present mapping. The swap entry value matches what 512 PTE swap entries would encode, so swap_map refcounting is unchanged: each of the 512 slots carries a count of 1, released individually on later split or together on swap-in. Signed-off-by: Usama Arif --- include/linux/huge_mm.h | 2 + include/linux/vm_event_item.h | 1 + mm/huge_memory.c | 78 +++++++++++++++++++++++++++++++++++ mm/rmap.c | 20 +++++++++ mm/vmscan.c | 9 +++- mm/vmstat.c | 1 + 6 files changed, 110 insertions(+), 1 deletion(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 9ec475ccfc91..b746f8c8db69 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -533,6 +533,8 @@ vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf); #ifdef CONFIG_THP_SWAP vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf); +int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw, + struct folio *folio); #else static inline vm_fault_t do_huge_pmd_swap_page(struct vm_fault *vmf) { diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 03fe95f5a020..7267c06674c0 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -108,6 +108,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_ZERO_PAGE_ALLOC_FAILED, THP_SWPOUT, THP_SWPOUT_FALLBACK, + THP_SWPOUT_PMD, #endif #ifdef CONFIG_BALLOON BALLOON_INFLATE, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5fa60324a2f0..7ec81a9c4bc1 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -5450,3 +5450,81 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) trace_remove_migration_pmd(address, pmd_val(pmde)); } #endif + +#ifdef CONFIG_THP_SWAP +/** + * set_pmd_swap_entry() - Replace a PMD mapping with a PMD-level swap entry. + * @pvmw: Page vma mapped walk context, must have pvmw->pmd set and + * pvmw->pte NULL (i.e. PMD-mapped). + * @folio: The folio being swapped out. Must be in the swap cache. + * + * This installs a PMD-level swap entry in place of a present PMD mapping, + * avoiding the need to split the PMD into PTE-level swap entries. + * + * Return: 0 on success, negative error code on failure. + */ +int set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw, + struct folio *folio) +{ + struct vm_area_struct *vma = pvmw->vma; + struct mm_struct *mm = vma->vm_mm; + unsigned long address = pvmw->address; + unsigned long haddr = address & HPAGE_PMD_MASK; + struct page *page = folio_page(folio, 0); + bool anon_exclusive; + pmd_t pmdval; + swp_entry_t entry; + pmd_t pmdswp; + + if (!(pvmw->pmd && !pvmw->pte)) + return 0; + + VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio); + VM_BUG_ON_FOLIO(!folio_test_anon(folio), folio); + + if (unlikely(folio_test_swapbacked(folio) != + folio_test_swapcache(folio))) { + WARN_ON_ONCE(1); + return -EBUSY; + } + + flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE); + + pmdval = pmdp_invalidate(vma, haddr, pvmw->pmd); + + /* Update high watermark before we lower rss */ + update_hiwater_rss(mm); + + if (folio_dup_swap(folio, NULL) < 0) { + set_pmd_at(mm, haddr, pvmw->pmd, pmdval); + return -ENOMEM; + } + + /* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */ + anon_exclusive = PageAnonExclusive(page); + if (anon_exclusive && folio_try_share_anon_rmap_pmd(folio, page)) { + folio_put_swap(folio, NULL); + set_pmd_at(mm, haddr, pvmw->pmd, pmdval); + return -EBUSY; + } + + if (pmd_dirty(pmdval)) + folio_mark_dirty(folio); + + entry = folio->swap; + pmdswp = softleaf_to_pmd(entry); + if (pmd_soft_dirty(pmdval)) + pmdswp = pmd_swp_mksoft_dirty(pmdswp); + if (pmd_uffd_wp(pmdval)) + pmdswp = pmd_swp_mkuffd_wp(pmdswp); + if (anon_exclusive) + pmdswp = pmd_swp_mkexclusive(pmdswp); + set_pmd_at(mm, haddr, pvmw->pmd, pmdswp); + + folio_remove_rmap_pmd(folio, page, vma); + folio_put(folio); + + count_vm_event(THP_SWPOUT_PMD); + return 0; +} +#endif /* CONFIG_THP_SWAP */ diff --git a/mm/rmap.c b/mm/rmap.c index 0fb7a1b82cf3..ffc7aa62a29e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2079,6 +2079,26 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, goto walk_abort; } +#ifdef CONFIG_THP_SWAP + /* + * If the folio is in the swap cache and we're not + * asked to split, install a PMD-level swap entry. + */ + if (!(flags & TTU_SPLIT_HUGE_PMD) && + folio_test_anon(folio) && + folio_test_swapcache(folio)) { + if (set_pmd_swap_entry(&pvmw, folio)) + goto walk_abort; + + mm_prepare_for_swap_entries(mm); + add_mm_counter(mm, MM_ANONPAGES, + -HPAGE_PMD_NR); + add_mm_counter(mm, MM_SWAPENTS, + HPAGE_PMD_NR); + goto walk_done; + } +#endif + if (flags & TTU_SPLIT_HUGE_PMD) { /* * We temporarily have to drop the PTL and diff --git a/mm/vmscan.c b/mm/vmscan.c index 56fe5393f30f..3d7999c3f1ad 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1321,7 +1321,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, enum ttu_flags flags = TTU_BATCH_FLUSH; bool was_swapbacked = folio_test_swapbacked(folio); - if (folio_test_pmd_mappable(folio)) + /* + * With THP_SWAP, PMD-mappable folios already in the + * swap cache can be unmapped with a PMD-level swap + * entry, avoiding the cost of splitting the PMD. + */ + if (folio_test_pmd_mappable(folio) && + !(IS_ENABLED(CONFIG_THP_SWAP) && + folio_test_swapcache(folio))) flags |= TTU_SPLIT_HUGE_PMD; /* * Without TTU_SYNC, try_to_unmap will only begin to diff --git a/mm/vmstat.c b/mm/vmstat.c index 7b93fbf9af09..629055399987 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1421,6 +1421,7 @@ const char * const vmstat_text[] = { [I(THP_ZERO_PAGE_ALLOC_FAILED)] = "thp_zero_page_alloc_failed", [I(THP_SWPOUT)] = "thp_swpout", [I(THP_SWPOUT_FALLBACK)] = "thp_swpout_fallback", + [I(THP_SWPOUT_PMD)] = "thp_swpout_pmd", #endif #ifdef CONFIG_BALLOON [I(BALLOON_INFLATE)] = "balloon_inflate", -- 2.53.0-Meta