From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-186.mta1.migadu.com (out-186.mta1.migadu.com [95.215.58.186]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 12452301711 for ; Sun, 9 Nov 2025 16:26:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.186 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762705621; cv=none; b=L40gBVQA8Iux4aTWRdk2bUp+HtcSeSJrL4rLMZxGy6whRtMwtGKAwgEmP0LTqH+tb3Jb65Sdmw5PS6wthxBA7Ezy4ZpgkUL6mEBZ9/nsE2Z9baGmtEoML6j9jeWGo5Uj+MMIZZpKBoK1EpH7HS5gkMjS9o1br7WHe+R5utiUNvA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762705621; c=relaxed/simple; bh=M6/vBucdxVbLAF6O/PzO5KEcefqj1NXVjYQTgRfyGgM=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=lti7Bxgdn9jwFeG8LHhpQczaCUlBNOx9oPzPsvstN/eiJWbRi24CyJQBo5wUxqzGbgZQshXx1tAU6DfzvmrN8xPGZUfIHYkGxrVAY+KgN5s3/vRrDbUHDQrmQ+vXfCE8qdupnRMl3xGjeJPJZM8l7fRGwBWmz+nDzoSB+CJ4uEI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=pwyW2EW6; arc=none smtp.client-ip=95.215.58.186 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="pwyW2EW6" Message-ID: <5ed51639-604c-4e15-84ae-4bf3777f83c1@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1762705611; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Oc0nS0Oa6VgyqrdvhkSTK7B+nOWIW7alkYgFANGDZPQ=; b=pwyW2EW686jL52qUi8I2VuTz1EmNkWpfSru0DsgZMM+9Q0p0FNTrkdRidIItTWCToFu7KE n9z0y9FEp1NXnIrrKUnS0acP2LaEfRrFzFZYM6R/qjdwf2GSGPEVniFjn2kGZ4QpoXZKdK dRzKFf0NJRNE89MT5AE5NrJjf4gimIk= Date: Mon, 10 Nov 2025 00:26:26 +0800 Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH v2 01/16] mm: correctly handle UFFD PTE markers Content-Language: en-US To: Lorenzo Stoakes Cc: Christian Borntraeger , Janosch Frank , Claudio Imbrenda , David Hildenbrand , Alexander Gordeev , Gerald Schaefer , Heiko Carstens , Vasily Gorbik , Sven Schnelle , Peter Xu , Alexander Viro , Christian Brauner , Jan Kara , Andrew Morton , Arnd Bergmann , Zi Yan , Baolin Wang , "Liam R . Howlett" , Nico Pache , Ryan Roberts , Dev Jain , Barry Song , Muchun Song , Oscar Salvador , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Michal Hocko , Matthew Brost , Joshua Hahn , Rakie Kim , Byungchul Park , Gregory Price , Ying Huang , Alistair Popple , Axel Rasmussen , Yuanchu Xie , Wei Xu , Kemeng Shi , Kairui Song , Nhat Pham , Baoquan He , Chris Li , SeongJae Park , Matthew Wilcox , Jason Gunthorpe , Leon Romanovsky , Xu Xin , Chengming Zhou , Jann Horn , Miaohe Lin , Naoya Horiguchi , Pedro Falcato , Pasha Tatashin , Rik van Riel , Harry Yoo , Hugh Dickins , linux-kernel@vger.kernel.org, kvm@vger.kernel.org, linux-s390@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, damon@lists.linux.dev References: <0b50fd4b1d3241d0965e6b969fb49bcc14704d9b.1762621568.git.lorenzo.stoakes@oracle.com> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Lance Yang In-Reply-To: <0b50fd4b1d3241d0965e6b969fb49bcc14704d9b.1762621568.git.lorenzo.stoakes@oracle.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT On 2025/11/9 01:08, Lorenzo Stoakes wrote: > PTE markers were previously only concerned with UFFD-specific logic - that > is, PTE entries with the UFFD WP marker set or those marked via > UFFDIO_POISON. > > However since the introduction of guard markers in commit > 7c53dfbdb024 ("mm: add PTE_MARKER_GUARD PTE marker"), this has no longer > been the case. > > Issues have been avoided as guard regions are not permitted in conjunction > with UFFD, but it still leaves very confusing logic in place, most notably > the misleading and poorly named pte_none_mostly() and > huge_pte_none_mostly(). > > This predicate returns true for PTE entries that ought to be treated as > none, but only in certain circumstances, and on the assumption we are > dealing with H/W poison markers or UFFD WP markers. > > This patch removes these functions and makes each invocation of these > functions instead explicitly check what it needs to check. > > As part of this effort it introduces is_uffd_pte_marker() to explicitly > determine if a marker in fact is used as part of UFFD or not. > > In the HMM logic we note that the only time we would need to check for a > fault is in the case of a UFFD WP marker, otherwise we simply encounter a > fault error (VM_FAULT_HWPOISON for H/W poisoned marker, VM_FAULT_SIGSEGV > for a guard marker), so only check for the UFFD WP case. > > While we're here we also refactor code to make it easier to understand. > > Reviewed-by: Vlastimil Babka > Signed-off-by: Lorenzo Stoakes > --- > fs/userfaultfd.c | 83 +++++++++++++++++++---------------- > include/asm-generic/hugetlb.h | 8 ---- > include/linux/swapops.h | 18 -------- > include/linux/userfaultfd_k.h | 21 +++++++++ > mm/hmm.c | 2 +- > mm/hugetlb.c | 47 ++++++++++---------- > mm/mincore.c | 17 +++++-- > mm/userfaultfd.c | 27 +++++++----- > 8 files changed, 123 insertions(+), 100 deletions(-) > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > index 54c6cc7fe9c6..04c66b5001d5 100644 > --- a/fs/userfaultfd.c > +++ b/fs/userfaultfd.c > @@ -233,40 +233,46 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx, > { > struct vm_area_struct *vma = vmf->vma; > pte_t *ptep, pte; > - bool ret = true; > > assert_fault_locked(vmf); > > ptep = hugetlb_walk(vma, vmf->address, vma_mmu_pagesize(vma)); > if (!ptep) > - goto out; > + return true; > > - ret = false; > pte = huge_ptep_get(vma->vm_mm, vmf->address, ptep); > > /* > * Lockless access: we're in a wait_event so it's ok if it > - * changes under us. PTE markers should be handled the same as none > - * ptes here. > + * changes under us. > */ > - if (huge_pte_none_mostly(pte)) > - ret = true; > + > + /* If missing entry, wait for handler. */ > + if (huge_pte_none(pte)) > + return true; > + /* UFFD PTE markers require handling. */ > + if (is_uffd_pte_marker(pte)) > + return true; > + /* If VMA has UFFD WP faults enabled and WP fault, wait for handler. */ > if (!huge_pte_write(pte) && (reason & VM_UFFD_WP)) > - ret = true; > -out: > - return ret; > + return true; > + > + /* Otherwise, if entry isn't present, let fault handler deal with it. */ > + return false; > } > #else > static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx, > struct vm_fault *vmf, > unsigned long reason) > { > - return false; /* should never get here */ > + /* Should never get here. */ > + VM_WARN_ON_ONCE(1); > + return false; > } > #endif /* CONFIG_HUGETLB_PAGE */ > > /* > - * Verify the pagetables are still not ok after having reigstered into > + * Verify the pagetables are still not ok after having registered into > * the fault_pending_wqh to avoid userland having to UFFDIO_WAKE any > * userfault that has already been resolved, if userfaultfd_read_iter and > * UFFDIO_COPY|ZEROPAGE are being run simultaneously on two different > @@ -284,53 +290,55 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx, > pmd_t *pmd, _pmd; > pte_t *pte; > pte_t ptent; > - bool ret = true; > + bool ret; > > assert_fault_locked(vmf); > > pgd = pgd_offset(mm, address); > if (!pgd_present(*pgd)) > - goto out; > + return true; > p4d = p4d_offset(pgd, address); > if (!p4d_present(*p4d)) > - goto out; > + return true; > pud = pud_offset(p4d, address); > if (!pud_present(*pud)) > - goto out; > + return true; > pmd = pmd_offset(pud, address); > again: > _pmd = pmdp_get_lockless(pmd); > if (pmd_none(_pmd)) > - goto out; > + return true; > > - ret = false; > if (!pmd_present(_pmd)) > - goto out; > + return false; > > - if (pmd_trans_huge(_pmd)) { > - if (!pmd_write(_pmd) && (reason & VM_UFFD_WP)) > - ret = true; > - goto out; > - } > + if (pmd_trans_huge(_pmd)) > + return !pmd_write(_pmd) && (reason & VM_UFFD_WP); > > pte = pte_offset_map(pmd, address); > - if (!pte) { > - ret = true; > + if (!pte) > goto again; > - } > + > /* > * Lockless access: we're in a wait_event so it's ok if it > - * changes under us. PTE markers should be handled the same as none > - * ptes here. > + * changes under us. > */ > ptent = ptep_get(pte); > - if (pte_none_mostly(ptent)) > - ret = true; > + > + ret = true; > + /* If missing entry, wait for handler. */ > + if (pte_none(ptent)) > + goto out; > + /* UFFD PTE markers require handling. */ > + if (is_uffd_pte_marker(ptent)) > + goto out; > + /* If VMA has UFFD WP faults enabled and WP fault, wait for handler. */ > if (!pte_write(ptent) && (reason & VM_UFFD_WP)) > - ret = true; > - pte_unmap(pte); > + goto out; > > + ret = false; > out: > + pte_unmap(pte); > return ret; > } > > @@ -490,12 +498,13 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) > set_current_state(blocking_state); > spin_unlock_irq(&ctx->fault_pending_wqh.lock); > > - if (!is_vm_hugetlb_page(vma)) > - must_wait = userfaultfd_must_wait(ctx, vmf, reason); > - else > + if (is_vm_hugetlb_page(vma)) { > must_wait = userfaultfd_huge_must_wait(ctx, vmf, reason); > - if (is_vm_hugetlb_page(vma)) > hugetlb_vma_unlock_read(vma); > + } else { > + must_wait = userfaultfd_must_wait(ctx, vmf, reason); > + } > + > release_fault_lock(vmf); > > if (likely(must_wait && !READ_ONCE(ctx->released))) { > diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h > index dcb8727f2b82..e1a2e1b7c8e7 100644 > --- a/include/asm-generic/hugetlb.h > +++ b/include/asm-generic/hugetlb.h > @@ -97,14 +97,6 @@ static inline int huge_pte_none(pte_t pte) > } > #endif > > -/* Please refer to comments above pte_none_mostly() for the usage */ > -#ifndef __HAVE_ARCH_HUGE_PTE_NONE_MOSTLY > -static inline int huge_pte_none_mostly(pte_t pte) > -{ > - return huge_pte_none(pte) || is_pte_marker(pte); > -} > -#endif > - > #ifndef __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT > static inline void huge_ptep_set_wrprotect(struct mm_struct *mm, > unsigned long addr, pte_t *ptep) > diff --git a/include/linux/swapops.h b/include/linux/swapops.h > index 2687928a8146..d1f665935cfc 100644 > --- a/include/linux/swapops.h > +++ b/include/linux/swapops.h > @@ -469,24 +469,6 @@ static inline int is_guard_swp_entry(swp_entry_t entry) > (pte_marker_get(entry) & PTE_MARKER_GUARD); > } > > -/* > - * This is a special version to check pte_none() just to cover the case when > - * the pte is a pte marker. It existed because in many cases the pte marker > - * should be seen as a none pte; it's just that we have stored some information > - * onto the none pte so it becomes not-none any more. > - * > - * It should be used when the pte is file-backed, ram-based and backing > - * userspace pages, like shmem. It is not needed upon pgtables that do not > - * support pte markers at all. For example, it's not needed on anonymous > - * memory, kernel-only memory (including when the system is during-boot), > - * non-ram based generic file-system. It's fine to be used even there, but the > - * extra pte marker check will be pure overhead. > - */ > -static inline int pte_none_mostly(pte_t pte) > -{ > - return pte_none(pte) || is_pte_marker(pte); > -} > - > static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry) > { > struct page *p = pfn_to_page(swp_offset_pfn(entry)); > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h > index c0e716aec26a..da0b4fcc566f 100644 > --- a/include/linux/userfaultfd_k.h > +++ b/include/linux/userfaultfd_k.h > @@ -479,4 +479,25 @@ static inline bool pte_swp_uffd_wp_any(pte_t pte) > return false; > } > > + > +static inline bool is_uffd_pte_marker(pte_t pte) > +{ > + swp_entry_t entry; > + > + if (pte_present(pte)) > + return false; > + > + entry = pte_to_swp_entry(pte); > + if (!is_pte_marker_entry(entry)) > + return false; > + > + /* UFFD WP, poisoned swap entries are UFFD handled. */ > + if (pte_marker_entry_uffd_wp(entry)) > + return true; > + if (is_poisoned_swp_entry(entry)) > + return true; > + > + return false; > +} > + > #endif /* _LINUX_USERFAULTFD_K_H */ > diff --git a/mm/hmm.c b/mm/hmm.c > index a56081d67ad6..43d4a91035ff 100644 > --- a/mm/hmm.c > +++ b/mm/hmm.c > @@ -244,7 +244,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, > uint64_t pfn_req_flags = *hmm_pfn; > uint64_t new_pfn_flags = 0; > > - if (pte_none_mostly(pte)) { > + if (pte_none(pte) || pte_marker_uffd_wp(pte)) { > required_fault = > hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0); > if (required_fault) > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 1ea459723cce..01c784547d1e 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -6743,29 +6743,28 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, > } > > vmf.orig_pte = huge_ptep_get(mm, vmf.address, vmf.pte); > - if (huge_pte_none_mostly(vmf.orig_pte)) { > - if (is_pte_marker(vmf.orig_pte)) { > - pte_marker marker = > - pte_marker_get(pte_to_swp_entry(vmf.orig_pte)); > - > - if (marker & PTE_MARKER_POISONED) { > - ret = VM_FAULT_HWPOISON_LARGE | > - VM_FAULT_SET_HINDEX(hstate_index(h)); > - goto out_mutex; > - } else if (WARN_ON_ONCE(marker & PTE_MARKER_GUARD)) { > - /* This isn't supported in hugetlb. */ > - ret = VM_FAULT_SIGSEGV; > - goto out_mutex; > - } > - } > - > + if (huge_pte_none(vmf.orig_pte)) > /* > - * Other PTE markers should be handled the same way as none PTE. > - * > * hugetlb_no_page will drop vma lock and hugetlb fault > * mutex internally, which make us return immediately. > */ > return hugetlb_no_page(mapping, &vmf); > + > + if (is_pte_marker(vmf.orig_pte)) { > + const pte_marker marker = > + pte_marker_get(pte_to_swp_entry(vmf.orig_pte)); > + > + if (marker & PTE_MARKER_POISONED) { > + ret = VM_FAULT_HWPOISON_LARGE | > + VM_FAULT_SET_HINDEX(hstate_index(h)); > + goto out_mutex; > + } else if (WARN_ON_ONCE(marker & PTE_MARKER_GUARD)) { > + /* This isn't supported in hugetlb. */ > + ret = VM_FAULT_SIGSEGV; > + goto out_mutex; > + } > + > + return hugetlb_no_page(mapping, &vmf); > } > > ret = 0; > @@ -6934,6 +6933,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, > int ret = -ENOMEM; > struct folio *folio; > bool folio_in_pagecache = false; > + pte_t dst_ptep; > > if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) { > ptl = huge_pte_lock(h, dst_mm, dst_pte); > @@ -7073,13 +7073,14 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, > if (folio_test_hwpoison(folio)) > goto out_release_unlock; > > + ret = -EEXIST; > + > + dst_ptep = huge_ptep_get(dst_mm, dst_addr, dst_pte); > /* > - * We allow to overwrite a pte marker: consider when both MISSING|WP > - * registered, we firstly wr-protect a none pte which has no page cache > - * page backing it, then access the page. > + * See comment about UFFD marker overwriting in > + * mfill_atomic_install_pte(). > */ > - ret = -EEXIST; > - if (!huge_pte_none_mostly(huge_ptep_get(dst_mm, dst_addr, dst_pte))) > + if (!huge_pte_none(dst_ptep) && !is_uffd_pte_marker(dst_ptep)) > goto out_release_unlock; > > if (folio_in_pagecache) > diff --git a/mm/mincore.c b/mm/mincore.c > index 8ec4719370e1..151b2dbb783b 100644 > --- a/mm/mincore.c > +++ b/mm/mincore.c > @@ -32,11 +32,22 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, > spinlock_t *ptl; > > ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte); > + > /* > * Hugepages under user process are always in RAM and never > * swapped out, but theoretically it needs to be checked. > */ > - present = pte && !huge_pte_none_mostly(huge_ptep_get(walk->mm, addr, pte)); > + if (!pte) { > + present = 0; > + } else { > + const pte_t ptep = huge_ptep_get(walk->mm, addr, pte); > + > + if (huge_pte_none(ptep) || is_pte_marker(ptep)) > + present = 0; > + else > + present = 1; > + } > + > for (; addr != end; vec++, addr += PAGE_SIZE) > *vec = present; > walk->private = vec; > @@ -175,8 +186,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > pte_t pte = ptep_get(ptep); > > step = 1; > - /* We need to do cache lookup too for pte markers */ > - if (pte_none_mostly(pte)) > + /* We need to do cache lookup too for UFFD pte markers */ > + if (pte_none(pte) || is_uffd_pte_marker(pte)) Seems like something is changed, new is_uffd_pte_marker check will miss non-UFFD markers (like guard markers) , and then would fall through to the swap entry logic to be misreported as resident by mincore_swap(). ``` /* We need to do cache lookup too for UFFD pte markers */ if (pte_none(pte) || is_uffd_pte_marker(pte)) __mincore_unmapped_range(addr, addr + PAGE_SIZE, vma, vec); else if (pte_present(pte)) { unsigned int batch = pte_batch_hint(ptep, pte); if (batch > 1) { unsigned int max_nr = (end - addr) >> PAGE_SHIFT; step = min_t(unsigned int, batch, max_nr); } for (i = 0; i < step; i++) vec[i] = 1; } else { /* pte is a swap entry */ *vec = mincore_swap(pte_to_swp_entry(pte), false); } ``` Wouldn't the generic is_pte_marker() be safer here? Thanks, Lance > __mincore_unmapped_range(addr, addr + PAGE_SIZE, > vma, vec); > else if (pte_present(pte)) { > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index 00122f42718c..cc4ce205bbec 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -178,6 +178,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd, > spinlock_t *ptl; > struct folio *folio = page_folio(page); > bool page_in_cache = folio_mapping(folio); > + pte_t dst_ptep; > > _dst_pte = mk_pte(page, dst_vma->vm_page_prot); > _dst_pte = pte_mkdirty(_dst_pte); > @@ -199,12 +200,15 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd, > } > > ret = -EEXIST; > + > + dst_ptep = ptep_get(dst_pte); > + > /* > - * We allow to overwrite a pte marker: consider when both MISSING|WP > - * registered, we firstly wr-protect a none pte which has no page cache > - * page backing it, then access the page. > + * We are allowed to overwrite a UFFD pte marker: consider when both > + * MISSING|WP registered, we firstly wr-protect a none pte which has no > + * page cache page backing it, then access the page. > */ > - if (!pte_none_mostly(ptep_get(dst_pte))) > + if (!pte_none(dst_ptep) && !is_uffd_pte_marker(dst_ptep)) > goto out_unlock; > > if (page_in_cache) { > @@ -583,12 +587,15 @@ static __always_inline ssize_t mfill_atomic_hugetlb( > goto out_unlock; > } > > - if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE) && > - !huge_pte_none_mostly(huge_ptep_get(dst_mm, dst_addr, dst_pte))) { > - err = -EEXIST; > - hugetlb_vma_unlock_read(dst_vma); > - mutex_unlock(&hugetlb_fault_mutex_table[hash]); > - goto out_unlock; > + if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) { > + const pte_t ptep = huge_ptep_get(dst_mm, dst_addr, dst_pte); > + > + if (!huge_pte_none(ptep) && !is_uffd_pte_marker(ptep)) { > + err = -EEXIST; > + hugetlb_vma_unlock_read(dst_vma); > + mutex_unlock(&hugetlb_fault_mutex_table[hash]); > + goto out_unlock; > + } > } > > err = hugetlb_mfill_atomic_pte(dst_pte, dst_vma, dst_addr,