From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-179.mta0.migadu.com (out-179.mta0.migadu.com [91.218.175.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 97DDD3FD12E for ; Tue, 2 Jun 2026 14:26:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.179 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410416; cv=none; b=N+0R9hcN6pN+QVJMBzl6ujgNLDTp67MsoMXu+tPjSRcb/b+GG5Rou0Xv60Vvz6rXUK+7ygPzXJ+4TWR8YyLyeSFZ+cIXsdrigepMCSbrIQ5aq1TYMnJonNmPWbXnHlsuHlQYWvbDCHl5bLSpywPTV+9B5JGVIR+NqokixQO0+eE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780410416; c=relaxed/simple; bh=cqk3+taSyLhkf5kgJagAZFhI1o1VvjytPYh6XpUZsY4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=czk6hAipwsB74MTKeItFfab6ba53TLUJHjglJ/zWpUAVxUD92bxgwIIjufBP2uESGdVW1nvne+w3L0gBfwRlK/aMZdw3wnml9nNlb6c+sN9uobLEXCl8ZhodP+TWzpcOFJWzJ/H2HQSfc6A+AL0xAKekp2Gd9wQ01i4yRbKlOec= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=ucGs9R8n; arc=none smtp.client-ip=91.218.175.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="ucGs9R8n" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780410412; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=06MxZOv/Uvd5gtLmvPmMkSAulI9G01NYpUuX43wXZOE=; b=ucGs9R8nk+jgS+wZZhDdMWXHeq85nzUx9Wp4QkaYCkzxiyLS6vNERSOqT60XvIEknrW14W brMwF1izcCZFfFzz+sOYRiuYCkdQ6b9kQExo5WZ6nCDaT59CeMVdwvLvdgEEr5Msfb9xVi hMNpLwIoQgABdEuGe/4NCkqoaLUyVFI= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: ying.huang@linux.alibaba.com, Baoquan He , willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam R. Howlett , ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [v2 10/16] mm: swap in PMD swap entries as whole THPs during swapoff Date: Tue, 2 Jun 2026 07:24:18 -0700 Message-ID: <20260602142537.198755-11-usama.arif@linux.dev> In-Reply-To: <20260602142537.198755-1-usama.arif@linux.dev> References: <20260602142537.198755-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT Add unuse_pmd() and call it from unuse_pmd_range() to swap in PMD-level swap entries as whole THPs during swapoff. This mirrors the existing unuse_pte_range() but operates at PMD granularity. If the PMD-order folio cannot be allocated, the cached folio is no longer PMD-sized (e.g. split in the swap cache by deferred_split_scan() or memory_failure() while the PMD swap entry was installed), or the folio is not uptodate, the PMD swap entry is split into PTE-level entries via __split_huge_pmd() and a non-zero error is returned so unuse_pmd_range() falls through to unuse_pte_range(), which handles the individual entries at order-0. Signed-off-by: Usama Arif --- mm/swapfile.c | 145 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 145 insertions(+) diff --git a/mm/swapfile.c b/mm/swapfile.c index 37408905490e..56454e486324 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -42,6 +42,7 @@ #include #include #include +#include #include #include @@ -2641,6 +2642,138 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, return 0; } +/* + * unuse_pmd - Map a locked folio at PMD granularity during swapoff. + * + * The caller provides a locked, swapped-in folio. Returns 0 on success + * (PMD was mapped). Returns -EAGAIN if the swap cache folio no longer + * matches the entry or the PMD changed under the lock (try_to_unuse will + * rescan). Returns -EIO if the folio is not uptodate; in that case the + * PMD is split so unuse_pte_range() can handle individual pages. + */ +static int unuse_pmd(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, softleaf_t entry, + struct folio *folio) +{ + struct mm_struct *mm = vma->vm_mm; + struct page *page; + pmd_t new_pmd, old_pmd; + spinlock_t *ptl; + rmap_t rmap_flags = RMAP_NONE; + bool exclusive; + + if (unlikely(!folio_matches_swap_entry(folio, entry))) + return -EAGAIN; + + if (unlikely(!folio_test_uptodate(folio))) { + __split_huge_pmd(vma, pmd, addr, false); + return -EIO; + } + + page = folio_page(folio, 0); + + ptl = pmd_lock(mm, pmd); + old_pmd = pmdp_get(pmd); + + if (!pmd_is_swap_entry(old_pmd) || + softleaf_from_pmd(old_pmd).val != entry.val) { + spin_unlock(ptl); + return -EAGAIN; + } + + exclusive = pmd_swp_exclusive(old_pmd); + + /* + * Some architectures may have to restore extra metadata to the folio + * when reading from swap. This metadata may be indexed by swap entry + * so this must be called before folio_put_swap(). + */ + arch_swap_restore(folio_swap(entry, folio), folio); + + add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR); + add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR); + + new_pmd = folio_mk_pmd(folio, vma->vm_page_prot); + new_pmd = pmd_mkold(new_pmd); + if (pmd_swp_soft_dirty(old_pmd)) + new_pmd = pmd_mksoft_dirty(new_pmd); + if (pmd_swp_uffd_wp(old_pmd)) + new_pmd = pmd_mkuffd_wp(new_pmd); + + if (exclusive) + rmap_flags |= RMAP_EXCLUSIVE; + + folio_get(folio); + if (!folio_test_anon(folio)) + folio_add_new_anon_rmap(folio, vma, addr, rmap_flags); + else + folio_add_anon_rmap_pmd(folio, page, vma, addr, rmap_flags); + + set_pmd_at(mm, addr, pmd, new_pmd); + folio_put_swap(folio, NULL); + + spin_unlock(ptl); + + folio_free_swap(folio); + return 0; +} + +/* + * Try to swap in a PMD swap entry as a whole THP. Returns 0 on success. + * Returns -ENOMEM if the PMD-order folio could not be allocated/charged, + * -EIO if swap-in failed, or -EAGAIN if the cached folio is no longer + * PMD-sized; in all of these the PMD is split so the caller can fall + * back to unuse_pte_range(). Otherwise propagates the error from + * unuse_pmd(). + */ +static int unuse_pmd_entry(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr, softleaf_t entry) +{ + struct folio *folio; + int ret; + + folio = swap_cache_get_folio(entry); + if (!folio) { + struct vm_fault vmf = { + .vma = vma, + .address = addr, + .real_address = addr, + .pmd = pmd, + }; + + folio = swapin_sync(entry, GFP_HIGHUSER_MOVABLE, + BIT(HPAGE_PMD_ORDER), &vmf, NULL, 0); + if (IS_ERR_OR_NULL(folio)) { + ret = -ENOMEM; + goto split_fallback; + } + } + + folio_lock(folio); + folio_wait_writeback(folio); + /* + * If the cached folio is no longer PMD-sized (e.g. split in the + * swap cache by deferred_split_scan() or memory_failure() while + * the PMD swap entry was installed), the PMD swap entry no longer + * maps a single contiguous folio. Split the PMD swap entry so + * unuse_pte_range() can swap the per-slot folios in individually. + */ + if (folio_nr_pages(folio) != HPAGE_PMD_NR) { + folio_unlock(folio); + folio_put(folio); + ret = -EAGAIN; + goto split_fallback; + } + ret = unuse_pmd(vma, pmd, addr, entry, folio); + folio_unlock(folio); + folio_put(folio); + return ret; + +split_fallback: + __split_huge_pmd(vma, pmd, addr, false); + return ret; +} + static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, unsigned int type) @@ -2653,6 +2786,18 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, do { cond_resched(); next = pmd_addr_end(addr, end); + + pmd_t pmdval = pmdp_get(pmd); + + if (pmd_is_swap_entry(pmdval)) { + softleaf_t sl = softleaf_from_pmd(pmdval); + + if (swp_type(sl) == type) { + if (!unuse_pmd_entry(vma, pmd, addr, sl)) + continue; + } + } + ret = unuse_pte_range(vma, pmd, addr, next, type); if (ret) return ret; -- 2.52.0