From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-171.mta0.migadu.com (out-171.mta0.migadu.com [91.218.175.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 98FE43B4E9D for ; Mon, 27 Apr 2026 10:06:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.171 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284408; cv=none; b=MRhINZlLqMV70LmAr/THoj81YeI0WF3NFLxifrQw26kuezNPlYrSyN0Q5w2O3bNR+oV/xmEuLql+Im7n+zgMKbjw8z8tKbIYPjgGZH1Wc+Mexpt2Wfqq8n2Y953zPU/RGcxhkFTgAZKPEcwYYComL/nKd4dT1s+5lsnslyKOz2o= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284408; c=relaxed/simple; bh=/PJE3YaW1aCo8LXbm6xsPpbObQVtZXHypva7Z/wdmjw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=k+a1TspI7RmH8lfOiRU/R5fgR9fIj1dDgrkIpTPS15G3HsSXYJW3ORsoN1ZemyVqBT5NlrHFGnUEpdPgI9bBSSwl/vkd2pO5zI7AAa/buL6BX+wLtBCjTRcFgAIBTedL0MOgcOubV/SAOQuTir3OxiyimhezDsM8eX311wfw1E8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=a1tJJucI; arc=none smtp.client-ip=91.218.175.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="a1tJJucI" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1777284404; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Yfd9GmH0gizrPfR9x8NMn3XlSbw9wtpjS9uv7XLZVWM=; b=a1tJJucI3EgISTT/MGYrnUxmZUqNbnUf5tluskBmlEPN/ya/MjGpPjo/lROhaGifYRTder 9oAQd4nSIrK/ugb4+byJ41MvZHlkhA7Q2u8I6m1y7q8ICf1z1MCiU/T1NnclxeDu07ZpNn cF09AzoCYQUJFvR8hh1YJGPIyp9jLQE= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH 08/13] mm: swap in PMD swap entries as whole THPs during swapoff Date: Mon, 27 Apr 2026 03:01:57 -0700 Message-ID: <20260427100553.2754667-9-usama.arif@linux.dev> In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev> References: <20260427100553.2754667-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT Add unuse_pmd() and call it from unuse_pmd_range() to swap in PMD-level swap entries as whole THPs during swapoff. This mirrors the existing unuse_pte_range() but operates at PMD granularity. If the PMD-order folio cannot be allocated, the cached folio is no longer PMD-sized (e.g. split in the swap cache by deferred_split_scan() or memory_failure() while the PMD swap entry was installed), or the folio is not uptodate, the PMD swap entry is split into PTE-level entries via __split_huge_pmd() and a non-zero error is returned so unuse_pmd_range() falls through to unuse_pte_range(), which handles the individual entries at order-0. swapin_alloc_pmd_folio() is a separate function in swap_state.c as it will be reused in swapin in a later patch. Signed-off-by: Usama Arif --- mm/swap.h | 7 +++ mm/swap_state.c | 35 +++++++++++++ mm/swapfile.c | 137 ++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 179 insertions(+) diff --git a/mm/swap.h b/mm/swap.h index a77016f2423b..76752df71693 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -301,6 +301,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag, struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag, struct vm_fault *vmf); struct folio *swapin_folio(swp_entry_t entry, struct folio *folio); +struct folio *swapin_alloc_pmd_folio(swp_entry_t entry, struct mm_struct *mm); void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, unsigned long addr); @@ -438,6 +439,12 @@ static inline struct folio *swapin_folio(swp_entry_t entry, struct folio *folio) return NULL; } +static inline struct folio *swapin_alloc_pmd_folio(swp_entry_t entry, + struct mm_struct *mm) +{ + return NULL; +} + static inline void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma, unsigned long addr) { diff --git a/mm/swap_state.c b/mm/swap_state.c index 1415a5c54a43..c2e8c76658f5 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -584,6 +584,41 @@ struct folio *swapin_folio(swp_entry_t entry, struct folio *folio) return swapcache; } +#ifdef CONFIG_THP_SWAP +/** + * swapin_alloc_pmd_folio - allocate, charge, and read a PMD-sized swap folio. + * @entry: starting swap entry to swap in + * @mm: mm to charge for the swap-in + * + * Allocate a HPAGE_PMD_ORDER folio, charge it to @mm's memcg for @entry, and + * issue the swap-in via swapin_folio(). Used by callers that need to map a + * PMD swap entry as a whole THP (PMD swapoff). + * + * Return: the swapped-in folio, or NULL on alloc/charge/swapin failure (in + * which case the caller should fall back to splitting the PMD). + */ +struct folio *swapin_alloc_pmd_folio(swp_entry_t entry, struct mm_struct *mm) +{ + struct folio *folio; + + folio = folio_alloc(GFP_HIGHUSER_MOVABLE, HPAGE_PMD_ORDER); + if (!folio) + return NULL; + + if (mem_cgroup_swapin_charge_folio(folio, mm, GFP_KERNEL, entry)) { + folio_put(folio); + return NULL; + } + + if (!swapin_folio(entry, folio)) { + folio_put(folio); + return NULL; + } + + return folio; +} +#endif /* CONFIG_THP_SWAP */ + /* * Locate a page of swap in physical memory, reserving swap cache space * and reading the disk if it is not already cached. diff --git a/mm/swapfile.c b/mm/swapfile.c index 390f191be9a6..7256edf4ce66 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -42,6 +42,7 @@ #include #include #include +#include #include #include @@ -2519,6 +2520,130 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, return 0; } +/* + * unuse_pmd - Map a locked folio at PMD granularity during swapoff. + * + * The caller provides a locked, swapped-in folio. Returns 0 on success + * (PMD was mapped). Returns -EAGAIN if the swap cache folio no longer + * matches the entry or the PMD changed under the lock (try_to_unuse will + * rescan). Returns -EIO if the folio is not uptodate; in that case the + * PMD is split so unuse_pte_range() can handle individual pages. + */ +static int unuse_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, softleaf_t entry, + struct folio *folio) +{ + struct mm_struct *mm = vma->vm_mm; + struct page *page; + pmd_t new_pmd, old_pmd; + spinlock_t *ptl; + rmap_t rmap_flags = RMAP_NONE; + bool exclusive; + + if (unlikely(!folio_matches_swap_entry(folio, entry))) + return -EAGAIN; + + if (unlikely(!folio_test_uptodate(folio))) { + __split_huge_pmd(vma, pmd, addr, false); + return -EIO; + } + + page = folio_page(folio, 0); + + ptl = pmd_lock(mm, pmd); + old_pmd = pmdp_get(pmd); + + if (!pmd_is_swap_entry(old_pmd) || + softleaf_from_pmd(old_pmd).val != entry.val) { + spin_unlock(ptl); + return -EAGAIN; + } + + exclusive = pmd_swp_exclusive(old_pmd); + + /* + * Some architectures may have to restore extra metadata to the folio + * when reading from swap. This metadata may be indexed by swap entry + * so this must be called before folio_put_swap(). + */ + arch_swap_restore(folio_swap(entry, folio), folio); + + add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR); + add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR); + + new_pmd = folio_mk_pmd(folio, vma->vm_page_prot); + new_pmd = pmd_mkold(new_pmd); + if (pmd_swp_soft_dirty(old_pmd)) + new_pmd = pmd_mksoft_dirty(new_pmd); + if (pmd_swp_uffd_wp(old_pmd)) + new_pmd = pmd_mkuffd_wp(new_pmd); + + if (exclusive) + rmap_flags |= RMAP_EXCLUSIVE; + + folio_get(folio); + if (!folio_test_anon(folio)) + folio_add_new_anon_rmap(folio, vma, addr, rmap_flags); + else + folio_add_anon_rmap_pmd(folio, page, vma, addr, rmap_flags); + + set_pmd_at(mm, addr, pmd, new_pmd); + folio_put_swap(folio, NULL); + + spin_unlock(ptl); + + folio_free_swap(folio); + return 0; +} + +/* + * Try to swap in a PMD swap entry as a whole THP. Returns 0 on success. + * Returns -ENOMEM if the PMD-order folio could not be allocated/charged, + * -EIO if swap-in failed, or -EAGAIN if the cached folio is no longer + * PMD-sized; in all of these the PMD is split so the caller can fall + * back to unuse_pte_range(). Otherwise propagates the error from + * unuse_pmd(). + */ +static int unuse_pmd_entry(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, softleaf_t entry) +{ + struct folio *folio; + int ret; + + folio = swap_cache_get_folio(entry); + if (!folio) { + folio = swapin_alloc_pmd_folio(entry, vma->vm_mm); + if (!folio) { + ret = -ENOMEM; + goto split_fallback; + } + } + + folio_lock(folio); + folio_wait_writeback(folio); + /* + * If the cached folio is no longer PMD-sized (e.g. split in the + * swap cache by deferred_split_scan() or memory_failure() while + * the PMD swap entry was installed), the PMD swap entry no longer + * maps a single contiguous folio. Split the PMD swap entry so + * unuse_pte_range() can swap the per-slot folios in individually. + */ + if (folio_nr_pages(folio) != HPAGE_PMD_NR) { + folio_unlock(folio); + folio_put(folio); + ret = -EAGAIN; + goto split_fallback; + } + ret = unuse_pmd(vma, pmd, addr, entry, folio); + folio_unlock(folio); + folio_put(folio); + return ret; + +split_fallback: + __split_huge_pmd(vma, pmd, addr, false); + return ret; +} + static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, unsigned int type) @@ -2531,6 +2656,18 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, do { cond_resched(); next = pmd_addr_end(addr, end); + + pmd_t pmdval = pmdp_get(pmd); + + if (pmd_is_swap_entry(pmdval)) { + softleaf_t sl = softleaf_from_pmd(pmdval); + + if (swp_type(sl) == type) { + if (!unuse_pmd_entry(vma, pmd, addr, sl)) + continue; + } + } + ret = unuse_pte_range(vma, pmd, addr, next, type); if (ret) return ret; -- 2.52.0