From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-170.mta0.migadu.com (out-170.mta0.migadu.com [91.218.175.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3408830D3F4 for ; Fri, 3 Jul 2026 17:39:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.170 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1783100387; cv=none; b=nh2zlEiY3v1zM0tFsP0LJ59GmyoTTal2P1jvJ9ekII8OLSntyXm0Xbw41Hcq/wYYDQkfFfpk/rkedtJwkCRBzUoQANk5D7QTS0nxjrSUloHgl19/RxbnOyO5L92eu4prew/S3jngZAxNmos/XJGhSwBVDAJUYX+pCK685P+HrDc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1783100387; c=relaxed/simple; bh=FkOskc6/50pFG/7KybL40+KCrwb6Jsth8I17CKkF7Z0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=NwjDt4jnH/T8iEmYNMJKVBUqqYDEEq7wfOpwIShwInIsO8Ai94Nb8kOJX9AR2p8AAMnFkdSsudBnClJ8neInHOQ79HyC69sLN5o5XGLoutO2GJgSms3VUOJpJizrao7y107bB2fn1qf9wvbwP00IM4JwTwEcSnlIDmYxbZc+dBQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=Csp+YJIC; arc=none smtp.client-ip=91.218.175.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="Csp+YJIC" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1783100383; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ty8Zab+PurnWFMjvtnuezE+LvfMhZMLOzuocy84baZI=; b=Csp+YJIC4c4lBHxRE2sRFnlRFsuVCzEDiXj++QsNpb5IfWbgUD/AIZRiN+3vLLQK2W/Djp 2+Y/RkX/uE1n1oEeDuzHrqJ+9nvJ9yQQquvcUDKFSGCJ/sVNR0XCtPTu8lAQNzw7bBcAAj DjnEG1PmEUCaQNQDXuC9BtvecLs5zlg= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com, linux-mm@kvack.org Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH v3 05/11] mm: swap in PMD swap entries as whole THPs during swapoff Date: Fri, 3 Jul 2026 10:38:22 -0700 Message-ID: <20260703173903.3789516-6-usama.arif@linux.dev> In-Reply-To: <20260703173903.3789516-1-usama.arif@linux.dev> References: <20260703173903.3789516-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT Add swap_pmd_cache_lookup() to classify the swap cache behind a PMD swap entry as empty, backed by one PMD-sized folio, or requiring per-page handling because at least one covered slot has a smaller folio in the swap cache. PMD swap entries are handled at PMD granularity only while the covered cache range is empty or backed by a PMD-sized folio; a split cache forces the entry to be split and retried through the PTE path. Add unuse_pmd() and call it from unuse_pmd_range() to swap in PMD-level swap entries as whole THPs during swapoff. This mirrors the existing unuse_pte_range() but operates at PMD granularity. If the PMD-order folio cannot be allocated, the swap cache already contains per-page folios in the covered range (e.g. split in the swap cache by deferred_split_scan() or memory_failure() while the PMD swap entry was installed), or the folio is not uptodate, the PMD swap entry is split into PTE-level entries via __split_huge_pmd() and a non-zero error is returned so unuse_pmd_range() falls through to unuse_pte_range(), which handles the individual entries at order-0. Signed-off-by: Usama Arif --- mm/swap.h | 17 ++++++ mm/swap_state.c | 44 ++++++++++++++ mm/swapfile.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 215 insertions(+) diff --git a/mm/swap.h b/mm/swap.h index 44ab8e1e595b..17c2c57e0da4 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -303,6 +303,23 @@ static inline bool folio_matches_swap_entry(const struct folio *folio, bool swap_cache_has_folio(swp_entry_t entry); struct folio *swap_cache_get_folio(swp_entry_t entry); void *swap_cache_get_shadow(swp_entry_t entry); +enum swap_pmd_cache { + SWAP_PMD_CACHE_EMPTY, + SWAP_PMD_CACHE_HUGE, + SWAP_PMD_CACHE_SPLIT, +}; + +#ifdef CONFIG_THP_SWAP +enum swap_pmd_cache swap_pmd_cache_lookup(swp_entry_t entry, + struct folio **foliop); +#else +static inline enum swap_pmd_cache swap_pmd_cache_lookup(swp_entry_t entry, + struct folio **foliop) +{ + *foliop = NULL; + return SWAP_PMD_CACHE_EMPTY; +} +#endif void swap_cache_del_folio(struct folio *folio); struct folio *swap_cache_alloc_folio(swp_entry_t target_entry, gfp_t gfp_mask, unsigned long orders, struct vm_fault *vmf, diff --git a/mm/swap_state.c b/mm/swap_state.c index 6fd6e3415b71..9b9ca82ace4b 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -118,6 +118,50 @@ bool swap_cache_has_folio(swp_entry_t entry) return swp_tb_is_folio(swp_tb); } +#ifdef CONFIG_THP_SWAP +/** + * swap_pmd_cache_lookup - classify the swap cache behind a PMD swap entry + * @entry: first swap slot encoded by the PMD swap entry + * @foliop: returned PMD-sized folio, with a reference, if present + * + * A PMD swap entry is a compact page-table encoding for HPAGE_PMD_NR + * consecutive swap slots. The swap cache behind those slots can be empty, + * one PMD-sized folio, or per-slot folios after the original folio was split. + * + * Context: Caller must keep @entry valid using the usual swap cache rules. + * Return: SWAP_PMD_CACHE_EMPTY if no slot in the PMD range has a cached folio, + * SWAP_PMD_CACHE_HUGE if one PMD-sized folio covers the range, or + * SWAP_PMD_CACHE_SPLIT if the range needs per-page handling. + */ +enum swap_pmd_cache swap_pmd_cache_lookup(swp_entry_t entry, + struct folio **foliop) +{ + unsigned int type = swp_type(entry); + pgoff_t offset = swp_offset(entry); + struct folio *folio; + int i; + + *foliop = NULL; + + folio = swap_cache_get_folio(entry); + if (folio) { + if (folio_nr_pages(folio) == HPAGE_PMD_NR) { + *foliop = folio; + return SWAP_PMD_CACHE_HUGE; + } + folio_put(folio); + return SWAP_PMD_CACHE_SPLIT; + } + + for (i = 1; i < HPAGE_PMD_NR; i++) { + if (swap_cache_has_folio(swp_entry(type, offset + i))) + return SWAP_PMD_CACHE_SPLIT; + } + + return SWAP_PMD_CACHE_EMPTY; +} +#endif + /** * swap_cache_get_shadow - Looks up a shadow in the swap cache. * @entry: swap entry used for the lookup. diff --git a/mm/swapfile.c b/mm/swapfile.c index 0695dbd1a8b1..664956da60c8 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -42,6 +42,7 @@ #include #include #include +#include #include #include @@ -2641,6 +2642,147 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, return 0; } +/* + * unuse_pmd - Map a locked folio at PMD granularity during swapoff. + * + * The caller provides a locked, swapped-in folio. Returns 0 on success + * (PMD was mapped). Returns -EAGAIN if the swap cache folio no longer + * matches the entry or the PMD changed under the lock (try_to_unuse will + * rescan). Returns -EIO if the folio is not uptodate; in that case the + * PMD is split so unuse_pte_range() can handle individual pages. + */ +static int unuse_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, softleaf_t entry, + struct folio *folio) +{ + struct mm_struct *mm = vma->vm_mm; + struct page *page; + pmd_t new_pmd, old_pmd; + spinlock_t *ptl; + rmap_t rmap_flags = RMAP_NONE; + bool exclusive; + + if (unlikely(!folio_matches_swap_entry(folio, entry))) + return -EAGAIN; + + if (unlikely(!folio_test_uptodate(folio))) { + __split_huge_pmd(vma, pmd, addr, false); + return -EIO; + } + + page = folio_page(folio, 0); + + ptl = pmd_lock(mm, pmd); + old_pmd = pmdp_get(pmd); + + if (!pmd_is_swap_entry(old_pmd) || + softleaf_from_pmd(old_pmd).val != entry.val) { + spin_unlock(ptl); + return -EAGAIN; + } + + exclusive = pmd_swp_exclusive(old_pmd); + + /* + * Some architectures may have to restore extra metadata to the folio + * when reading from swap. This metadata may be indexed by swap entry + * so this must be called before folio_put_swap(). + */ + arch_swap_restore(folio_swap(entry, folio), folio); + + add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR); + add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR); + + new_pmd = folio_mk_pmd(folio, vma->vm_page_prot); + new_pmd = pmd_mkold(new_pmd); + if (pmd_swp_soft_dirty(old_pmd)) + new_pmd = pmd_mksoft_dirty(new_pmd); + if (pmd_swp_uffd_wp(old_pmd)) + new_pmd = pmd_mkuffd_wp(new_pmd); + + if (exclusive) + rmap_flags |= RMAP_EXCLUSIVE; + + folio_get(folio); + if (!folio_test_anon(folio)) + folio_add_new_anon_rmap(folio, vma, addr, rmap_flags); + else + folio_add_anon_rmap_pmd(folio, page, vma, addr, rmap_flags); + + set_pmd_at(mm, addr, pmd, new_pmd); + folio_put_swap(folio, NULL); + + spin_unlock(ptl); + + folio_free_swap(folio); + return 0; +} + +/* + * Try to swap in a PMD swap entry as a whole THP. Returns 0 on success. + * If the swap cache no longer has one PMD-sized folio, zswap may require + * per-page loading, or a PMD-order allocation/read fails, split the PMD so + * the caller can fall back to unuse_pte_range(). Otherwise propagates the + * error from unuse_pmd(). + */ +static int unuse_pmd_entry(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, softleaf_t entry) +{ + struct folio *folio; + enum swap_pmd_cache cache_state; + int ret; + + cache_state = swap_pmd_cache_lookup(entry, &folio); + if (cache_state == SWAP_PMD_CACHE_SPLIT) { + ret = -EAGAIN; + goto split_fallback; + } + if (!folio) { + struct vm_fault vmf = { + .vma = vma, + .address = addr, + .real_address = addr, + .pmd = pmd, + }; + + if (zswap_range_has_entry(entry, HPAGE_PMD_NR)) { + ret = -EAGAIN; + goto split_fallback; + } + + folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE, + BIT(HPAGE_PMD_ORDER), &vmf, NULL, 0); + if (IS_ERR_OR_NULL(folio)) { + ret = folio ? PTR_ERR(folio) : -ENOMEM; + goto split_fallback; + } + } + + folio_lock(folio); + folio_wait_writeback(folio); + /* + * If the cached folio is no longer PMD-sized (e.g. split in the + * swap cache by deferred_split_scan() or memory_failure() while + * the PMD swap entry was installed), the PMD swap entry no longer + * maps a single contiguous folio. Split the PMD swap entry so + * unuse_pte_range() can swap the per-slot folios in individually. + */ + if (folio_nr_pages(folio) != HPAGE_PMD_NR) { + folio_unlock(folio); + folio_put(folio); + ret = -EAGAIN; + goto split_fallback; + } + ret = unuse_pmd(vma, pmd, addr, entry, folio); + folio_unlock(folio); + folio_put(folio); + return ret; + +split_fallback: + __split_huge_pmd(vma, pmd, addr, false); + return ret; +} + static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, unsigned int type) @@ -2653,6 +2795,18 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, do { cond_resched(); next = pmd_addr_end(addr, end); + + pmd_t pmdval = pmdp_get(pmd); + + if (pmd_is_swap_entry(pmdval)) { + softleaf_t sl = softleaf_from_pmd(pmdval); + + if (swp_type(sl) == type) { + if (!unuse_pmd_entry(vma, pmd, addr, sl)) + continue; + } + } + ret = unuse_pte_range(vma, pmd, addr, next, type); if (ret) return ret; -- 2.53.0-Meta