From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-188.mta1.migadu.com (out-188.mta1.migadu.com [95.215.58.188]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 75C7436213D for ; Fri, 12 Jun 2026 08:51:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.188 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781254274; cv=none; b=Ch9zVq0Gj/CV8jIg6cWkT0orcxJBry3iucm/QW9KQx01zyCTNDofTnAFs7+9eQtWlaaLmJsiVPSjlyUJQzuz12OEX/Pans+OykvJdY3yp+gS8ngUzwq2uCIcr8T66AhNE8qJmYX4AmG/QOpX2pxiJV4wPHXSSKPghOgGMclqnDA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781254274; c=relaxed/simple; bh=yfIX/QPf8Oe1oWmfDcxElNztPP3F2/6+863VVS8uFUc=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-Type; b=adTiBY0wgYe821PmVq6ZMJVTlt52UoFlIQUcvE+Az1lDbELK2+cwO8o1lmTspp2e8qOOgSeuqro1QC+cxddOUkYH6vfbb85vHnhnPT+ow+X1Y+9Iba3YPuyKajY779HOmt7vd4OAPB3ztHETN1RXDDFM/P6RIxagxfNKL0RPN2E= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=LNeUche+; arc=none smtp.client-ip=95.215.58.188 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="LNeUche+" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1781254270; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lMRz2y+ODLJ6wKlv3rNNMZU4weQYTIAWYxB0lbs16Jk=; b=LNeUche+JRlDEZ07XeGplyh7cjJDtU4S4rjfet3lcLu2WF8hArOXKYGDIKX/5KtZ1yJwo0 P0EsaS8CzW1uD/5F05mWujwaFAKutRKA3KqAtKGptztP9H9iw5aLrUmRP+B+anpHulaYJx cuyBrm5lk/TgrSvvubaf9jMjjeHa+nY= From: Lance Yang To: usama.arif@linux.dev Cc: akpm@linux-foundation.org, david@kernel.org, chrisl@kernel.org, kasong@tencent.com, ljs@kernel.org, ziy@nvidia.com, ying.huang@linux.alibaba.com, baoquan.he@linux.dev, willy@infradead.org, youngjun.park@lge.com, hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, alex@ghiti.fr, kas@kernel.org, baohua@kernel.org, dev.jain@arm.com, baolin.wang@linux.alibaba.com, npache@redhat.com, liam@infradead.org, ryan.roberts@arm.com, vbabka@kernel.org, lance.yang@linux.dev, linux-kernel@vger.kernel.org, nphamcs@gmail.com, shikemeng@huaweicloud.com, kernel-team@meta.com, linux-mm@kvack.org Subject: Re: [v2 13/16] mm: handle PMD swap entries in UFFDIO_MOVE Date: Fri, 12 Jun 2026 16:50:27 +0800 Message-Id: <20260612085027.5401-1-lance.yang@linux.dev> In-Reply-To: <20260602142537.198755-14-usama.arif@linux.dev> References: <20260602142537.198755-14-usama.arif@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT +Cc linux-mm On Tue, Jun 02, 2026 at 07:24:21AM -0700, Usama Arif wrote: [...] >@@ -2846,11 +2902,66 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm > } > > if (!pmd_trans_huge(src_pmdval)) { >- spin_unlock(src_ptl); > if (pmd_is_migration_entry(src_pmdval)) { >+ spin_unlock(src_ptl); > pmd_migration_entry_wait(mm, &src_pmdval); > return -EAGAIN; > } >+ if (pmd_is_swap_entry(src_pmdval)) { Looks buggy ... unless I missed something ... >+ swp_entry_t entry; >+ struct swap_info_struct *si; >+ >+ /* >+ * UFFDIO_MOVE on anon mappings requires single-owner >+ * semantics; refuse to move a shared swap entry. >+ */ >+ if (!pmd_swp_exclusive(src_pmdval)) { >+ spin_unlock(src_ptl); >+ return -EBUSY; >+ } >+ >+ entry = softleaf_from_pmd(src_pmdval); >+ spin_unlock(src_ptl); >+ >+ /* Pin the swap device against a racing swapoff. */ >+ si = get_swap_device(entry); >+ if (unlikely(!si)) >+ return -EAGAIN; >+ >+ src_folio = swap_cache_get_folio(entry); We only check the first swap slot. Imagine we have something like this after the PMD-sized swapcache folio was split while the PMD swap entry was installed: page table: src PMD -> swap entry S swap cache: S + 0 -> no folio S + 1 -> order-0 folio in the swap cache S + 2 -> no folio S + 3 -> order-0 folio in the swap cache ... >+ >+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, >+ mm, src_addr, >+ src_addr + HPAGE_PMD_SIZE); >+ mmu_notifier_invalidate_range_start(&range); >+ >+ if (src_folio) { >+ folio_lock(src_folio); >+ if (folio_nr_pages(src_folio) != HPAGE_PMD_NR) { If S has a non-PMD-sized folio, this returns -EBUSY. >+ err = -EBUSY; >+ folio_unlock(src_folio); >+ folio_put(src_folio); >+ mmu_notifier_invalidate_range_end(&range); >+ put_swap_device(si); >+ return err; >+ } >+ } >+ >+ dst_ptl = pmd_lockptr(mm, dst_pmd); But if S has no folio, the initial lookup passes src_folio == NULL to move_swap_pmd(), , which only rechecks S: if (src_folio) { [...] } else if (swap_cache_has_folio(entry)) { double_pt_unlock(dst_ptl, src_ptl); return -EAGAIN; } So if S is empty, the move can still go ahead even if S + 1 ... S + 511 contain folios in the swap cache. >+ err = move_swap_pmd(mm, dst_vma, dst_addr, src_addr, >+ dst_pmd, src_pmd, dst_pmdval, >+ src_pmdval, dst_ptl, src_ptl, >+ src_folio, entry); >+ In that case, checking only S misses the order-0 folios in later slots. move_swap_pmd() can then move the PMD swap entry whole without calling folio_move_anon_rmap() or updating folio->index for those later folios. Note that move_swap_pte() already does this for PTE-mapped swap entries, because a folio in the swap cache needs its index and mapping updated to align with dst_vma. If those folios are later faulted in at dst, their rmap metadata still points at the old anon_vma/index. Later rmap users derive the virtual address from folio->mapping and folio->index, so they can look at the wrong VMA/address ... Should check the whole PMD swap range before deciding there is no folio in the swap cache to update? Am I reading that code right? >+ mmu_notifier_invalidate_range_end(&range); >+ if (src_folio) { >+ folio_unlock(src_folio); >+ folio_put(src_folio); >+ } >+ put_swap_device(si); >+ return err; >+ } >+ spin_unlock(src_ptl); > return -ENOENT; > } > >-- >2.52.0 > Cheers, Lance