From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-188.mta0.migadu.com (out-188.mta0.migadu.com [91.218.175.188]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7B54E3B6BE8 for ; Mon, 27 Apr 2026 10:06:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.188 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284419; cv=none; b=pnJEm015plKRMtHh56WQ8+RY/aHIMbjtgyAmd55yygDK3v6EUgp0O25any7Y0VhD8zUQOpJvoXHJA40RC5bvEGr+Gy7wbe+FWwncslDuPx8V5G3fcR0UVkVrUBPp2HuZ0FCLN9ZtUxNLBXtZGBZLmov+Rmlfb488NXp18kNzMS0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777284419; c=relaxed/simple; bh=OFR4XWZMU1nlgfgmqIvh0ii0g2O4NYvCx1qmywnoUDY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=j6UR8359/itCTx0Fj9RH3GZwDJWCns73FGjKoJHP8MkbMB2rUtLSSYkq4dVWs+WLpCDiQ7zpVpnL8UZnDQJV9qPKLnAJ7+WAtu/zwfVTo6uyS7duGuEnq6V59WIcoiM60dQQuJPZMFzhLiyZ0SSLi0o9i23X8OOxeitVKhOqv/4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=gCJdOCkZ; arc=none smtp.client-ip=91.218.175.188 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="gCJdOCkZ" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1777284415; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=AZBl4ylAGDSqk3dPTEed++o2pIaK/pmYovXJ16f2ROQ=; b=gCJdOCkZWPm8fxEN5Ih7QIgc7Bzg9s540KZ6+YL6Y18IXoknJI+jDA0f7Ow3T3tm/a2igc FmSlQN7CG/oKEJkGf3StTReqe08eyzpREydIRLodOzMBCjtatnTAwQIJOgWfhuwRzXwcYn 1RafuI6xLz1QXXWxmPr2crxpfAw4cNo= From: Usama Arif To: Andrew Morton , david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com Cc: bhe@redhat.com, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, Liam.Howlett@oracle.com, ryan.roberts@arm.com, Vlastimil Babka , lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, Usama Arif Subject: [PATCH 10/13] mm: handle PMD swap entries in UFFDIO_MOVE Date: Mon, 27 Apr 2026 03:01:59 -0700 Message-ID: <20260427100553.2754667-11-usama.arif@linux.dev> In-Reply-To: <20260427100553.2754667-1-usama.arif@linux.dev> References: <20260427100553.2754667-1-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT move_pages_huge_pmd() returned -ENOENT for any non-trans_huge, non-migration PMD, which fails aligned UFFDIO_MOVE on a swapped-out THP -- the PMD swap entry is a perfectly valid mapping that should move whole. Splitting via the move_pages_ptes() fallback isn't a substitute either: __split_huge_pmd_locked() splits a PMD swap entry into HPAGE_PMD_NR PTE swap entries pointing at the same swap-cache folio, but move_swap_pte() refuses any swap-cache folio that is still large and returns -EBUSY. Add move_swap_pmd(), modeled on move_swap_pte(), that moves the swap entry whole-PMD and re-anchors the swap-cache folio's anon rmap to the destination VMA. Reject !pmd_swp_exclusive() entries with -EBUSY to preserve UFFDIO_MOVE's single-owner semantics, propagate soft-dirty, and carry the deposited page table across with the entry. The dispatcher in move_pages_huge_pmd() now waits for migration on a PMD migration entry (matching the PTE path) and routes PMD swap entries through move_swap_pmd() after pinning the swap device, fetching and locking any cached folio, and arming an mmu_notifier range so secondary MMUs see the move. If the swap-cache folio was split (e.g. by deferred_split_scan or memory_failure) between swap-out and UFFDIO_MOVE, src_folio is no longer PMD-sized but the PMD swap entry still covers all 512 slots. Moving the entry whole would only re-anchor one folio's anon rmap, leaving the other 511 with a stale anon_vma. Return -EBUSY in this case, matching move_pages_pte()'s rejection of large folios, so the caller falls back to PTE-level moves. Signed-off-by: Usama Arif --- mm/huge_memory.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 112 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 109e4dc4a167..bfcc9b274be7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2871,6 +2871,62 @@ int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, #endif #ifdef CONFIG_USERFAULTFD +/* + * Move a PMD-level swap entry from src_pmd to dst_pmd. Both PMD locks are + * acquired here; src_folio (if present) must already be locked. The deposited + * page table backing the source THP is moved across with the entry. + */ +static int move_swap_pmd(struct mm_struct *mm, struct vm_area_struct *dst_vma, + unsigned long dst_addr, unsigned long src_addr, + pmd_t *dst_pmd, pmd_t *src_pmd, + pmd_t orig_dst_pmd, pmd_t orig_src_pmd, + spinlock_t *dst_ptl, spinlock_t *src_ptl, + struct folio *src_folio, swp_entry_t entry) +{ + pgtable_t src_pgtable; + pmd_t moved_pmd; + + /* + * The folio may have been freed and reused for a different swap entry + * while it was unlocked. Re-verify the association. + */ + if (src_folio && unlikely(!folio_test_swapcache(src_folio) || + entry.val != src_folio->swap.val)) + return -EAGAIN; + + double_pt_lock(dst_ptl, src_ptl); + + if (!pmd_same(*src_pmd, orig_src_pmd) || + !pmd_same(*dst_pmd, orig_dst_pmd)) { + double_pt_unlock(dst_ptl, src_ptl); + return -EAGAIN; + } + + /* + * If the folio is in the swap cache, re-anchor its anon rmap to the + * destination VMA so a future swap-in fault at dst_addr finds it. + * Otherwise, re-check that no folio was newly inserted under us. + */ + if (src_folio) { + folio_move_anon_rmap(src_folio, dst_vma); + src_folio->index = linear_page_index(dst_vma, dst_addr); + } else if (swap_cache_has_folio(entry)) { + double_pt_unlock(dst_ptl, src_ptl); + return -EAGAIN; + } + + moved_pmd = pmdp_huge_get_and_clear(mm, src_addr, src_pmd); + if (pgtable_supports_soft_dirty()) + moved_pmd = pmd_swp_mksoft_dirty(moved_pmd); + set_pmd_at(mm, dst_addr, dst_pmd, moved_pmd); + + src_pgtable = pgtable_trans_huge_withdraw(mm, src_pmd); + pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable); + + double_pt_unlock(dst_ptl, src_ptl); + return 0; +} + /* * The PT lock for src_pmd and dst_vma/src_vma (for reading) are locked by * the caller, but it must return after releasing the page_table_lock. @@ -2905,11 +2961,66 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm } if (!pmd_trans_huge(src_pmdval)) { - spin_unlock(src_ptl); if (pmd_is_migration_entry(src_pmdval)) { + spin_unlock(src_ptl); pmd_migration_entry_wait(mm, &src_pmdval); return -EAGAIN; } + if (pmd_is_swap_entry(src_pmdval)) { + swp_entry_t entry; + struct swap_info_struct *si; + + /* + * UFFDIO_MOVE on anon mappings requires single-owner + * semantics; refuse to move a shared swap entry. + */ + if (!pmd_swp_exclusive(src_pmdval)) { + spin_unlock(src_ptl); + return -EBUSY; + } + + entry = softleaf_from_pmd(src_pmdval); + spin_unlock(src_ptl); + + /* Pin the swap device against a racing swapoff. */ + si = get_swap_device(entry); + if (unlikely(!si)) + return -EAGAIN; + + src_folio = swap_cache_get_folio(entry); + + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, + mm, src_addr, + src_addr + HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); + + if (src_folio) { + folio_lock(src_folio); + if (folio_nr_pages(src_folio) != HPAGE_PMD_NR) { + err = -EBUSY; + folio_unlock(src_folio); + folio_put(src_folio); + mmu_notifier_invalidate_range_end(&range); + put_swap_device(si); + return err; + } + } + + dst_ptl = pmd_lockptr(mm, dst_pmd); + err = move_swap_pmd(mm, dst_vma, dst_addr, src_addr, + dst_pmd, src_pmd, dst_pmdval, + src_pmdval, dst_ptl, src_ptl, + src_folio, entry); + + mmu_notifier_invalidate_range_end(&range); + if (src_folio) { + folio_unlock(src_folio); + folio_put(src_folio); + } + put_swap_device(si); + return err; + } + spin_unlock(src_ptl); return -ENOENT; } -- 2.52.0