All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Rapoport <rppt@kernel.org>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Janosch Frank <frankja@linux.ibm.com>,
	Claudio Imbrenda <imbrenda@linux.ibm.com>,
	David Hildenbrand <david@redhat.com>,
	Alexander Gordeev <agordeev@linux.ibm.com>,
	Gerald Schaefer <gerald.schaefer@linux.ibm.com>,
	Heiko Carstens <hca@linux.ibm.com>,
	Vasily Gorbik <gor@linux.ibm.com>,
	Sven Schnelle <svens@linux.ibm.com>, Peter Xu <peterx@redhat.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	Arnd Bergmann <arnd@arndb.de>, Zi Yan <ziy@nvidia.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Nico Pache <npache@redhat.com>,
	Ryan Roberts <ryan.roberts@arm.com>, Dev Jain <dev.jain@arm.com>,
	Barry Song <baohua@kernel.org>, Lance Yang <lance.yang@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Oscar Salvador <osalvador@suse.de>,
	Vlastimil Babka <vbabka@suse.cz>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Matthew Brost <matthew.brost@intel.com>,
	Joshua Hahn <joshua.hahnjy@gmail.com>,
	Rakie Kim <rakie.kim@sk.com>, Byungchul Park <byungchul@sk.com>,
	Gregory Price <gourry@gourry.net>,
	Ying Huang <ying.huang@linux.alibaba.com>,
	Alistair Popple <apopple@nvidia.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
	Kemeng Shi <shikemeng@huaweicloud.com>,
	Kairui Song <kasong@tencent.com>, Nhat Pham <nphamcs@gmail.com>,
	Baoquan He <bhe@redhat.com>, Chris Li <chrisl@kernel.org>,
	SeongJae Park <sj@kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Jason Gunthorpe <jgg@ziepe.ca>, Leon Romanovsky <leon@kernel.org>,
	Xu Xin <xu.xin16@zte.com.cn>,
	Chengming Zhou <chengming.zhou@linux.dev>,
	Jann Horn <jannh@google.com>, Miaohe Lin <linmiaohe@huawei.com>,
	Naoya Horiguchi <nao.horiguchi@gmail.com>,
	Pedro Falcato <pfalcato@suse.de>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Rik van Riel <riel@surriel.com>, Harry Yoo <harry.yoo@oracle.com>,
	Hugh Dickins <hughd@google.com>,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	linux-s390@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-arch@vger.kernel.org,
	damon@lists.linux.dev
Subject: Re: [PATCH v2 01/16] mm: correctly handle UFFD PTE markers
Date: Mon, 10 Nov 2025 13:17:37 +0200	[thread overview]
Message-ID: <aRHJ0RDu9fJGEBF8@kernel.org> (raw)
In-Reply-To: <0b50fd4b1d3241d0965e6b969fb49bcc14704d9b.1762621568.git.lorenzo.stoakes@oracle.com>

On Sat, Nov 08, 2025 at 05:08:15PM +0000, Lorenzo Stoakes wrote:
> PTE markers were previously only concerned with UFFD-specific logic - that
> is, PTE entries with the UFFD WP marker set or those marked via
> UFFDIO_POISON.
> 
> However since the introduction of guard markers in commit
>  7c53dfbdb024 ("mm: add PTE_MARKER_GUARD PTE marker"), this has no longer
>  been the case.
> 
> Issues have been avoided as guard regions are not permitted in conjunction
> with UFFD, but it still leaves very confusing logic in place, most notably
> the misleading and poorly named pte_none_mostly() and
> huge_pte_none_mostly().
> 
> This predicate returns true for PTE entries that ought to be treated as
> none, but only in certain circumstances, and on the assumption we are
> dealing with H/W poison markers or UFFD WP markers.
> 
> This patch removes these functions and makes each invocation of these
> functions instead explicitly check what it needs to check.
> 
> As part of this effort it introduces is_uffd_pte_marker() to explicitly
> determine if a marker in fact is used as part of UFFD or not.
> 
> In the HMM logic we note that the only time we would need to check for a
> fault is in the case of a UFFD WP marker, otherwise we simply encounter a
> fault error (VM_FAULT_HWPOISON for H/W poisoned marker, VM_FAULT_SIGSEGV
> for a guard marker), so only check for the UFFD WP case.
> 
> While we're here we also refactor code to make it easier to understand.
> 
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  fs/userfaultfd.c              | 83 +++++++++++++++++++----------------
>  include/asm-generic/hugetlb.h |  8 ----
>  include/linux/swapops.h       | 18 --------
>  include/linux/userfaultfd_k.h | 21 +++++++++
>  mm/hmm.c                      |  2 +-
>  mm/hugetlb.c                  | 47 ++++++++++----------
>  mm/mincore.c                  | 17 +++++--
>  mm/userfaultfd.c              | 27 +++++++-----
>  8 files changed, 123 insertions(+), 100 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 54c6cc7fe9c6..04c66b5001d5 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -233,40 +233,46 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	pte_t *ptep, pte;
> -	bool ret = true;
>  
>  	assert_fault_locked(vmf);
>  
>  	ptep = hugetlb_walk(vma, vmf->address, vma_mmu_pagesize(vma));
>  	if (!ptep)
> -		goto out;
> +		return true;
>  
> -	ret = false;
>  	pte = huge_ptep_get(vma->vm_mm, vmf->address, ptep);
>  
>  	/*
>  	 * Lockless access: we're in a wait_event so it's ok if it
> -	 * changes under us.  PTE markers should be handled the same as none
> -	 * ptes here.
> +	 * changes under us.
>  	 */
> -	if (huge_pte_none_mostly(pte))
> -		ret = true;
> +
> +	/* If missing entry, wait for handler. */

It's actually #PF handler that waits ;-)

When userfaultfd_(huge_)must_wait() return true, it means that process that
caused a fault should wait until userspace resolves the fault and return
false means that it's ok to retry the #PF.

So the comment here should probably read as

	/* entry is still missing, wait for userspace to resolve the fault */

and the rest of the comments here and in userfaultfd_must_wait() need
similar update.

> +	if (huge_pte_none(pte))
> +		return true;
> +	/* UFFD PTE markers require handling. */
> +	if (is_uffd_pte_marker(pte))
> +		return true;
> +	/* If VMA has UFFD WP faults enabled and WP fault, wait for handler. */
>  	if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
> -		ret = true;
> -out:
> -	return ret;
> +		return true;
> +
> +	/* Otherwise, if entry isn't present, let fault handler deal with it. */

Entry is actually present here, e.g because there is a thread that called
UFFDIO_COPY in parallel with the fault, so no need to stuck the faulting
process.

> +	return false;
>  }
>  #else
>  static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
>  					      struct vm_fault *vmf,
>  					      unsigned long reason)
>  {
> -	return false;	/* should never get here */
> +	/* Should never get here. */
> +	VM_WARN_ON_ONCE(1);
> +	return false;
>  }
>  #endif /* CONFIG_HUGETLB_PAGE */
>  
>  /*
> - * Verify the pagetables are still not ok after having reigstered into
> + * Verify the pagetables are still not ok after having registered into
>   * the fault_pending_wqh to avoid userland having to UFFDIO_WAKE any
>   * userfault that has already been resolved, if userfaultfd_read_iter and
>   * UFFDIO_COPY|ZEROPAGE are being run simultaneously on two different
> @@ -284,53 +290,55 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
>  	pmd_t *pmd, _pmd;
>  	pte_t *pte;
>  	pte_t ptent;
> -	bool ret = true;
> +	bool ret;
>  
>  	assert_fault_locked(vmf);
>  
>  	pgd = pgd_offset(mm, address);
>  	if (!pgd_present(*pgd))
> -		goto out;
> +		return true;
>  	p4d = p4d_offset(pgd, address);
>  	if (!p4d_present(*p4d))
> -		goto out;
> +		return true;
>  	pud = pud_offset(p4d, address);
>  	if (!pud_present(*pud))
> -		goto out;
> +		return true;
>  	pmd = pmd_offset(pud, address);
>  again:
>  	_pmd = pmdp_get_lockless(pmd);
>  	if (pmd_none(_pmd))
> -		goto out;
> +		return true;
>  
> -	ret = false;
>  	if (!pmd_present(_pmd))
> -		goto out;
> +		return false;

This one is actually tricky, maybe it's worth adding a gist of commit log
from a365ac09d334 ("mm, userfaultfd, THP: avoid waiting when PMD under THP migration")
as a comment.

>  
> -	if (pmd_trans_huge(_pmd)) {
> -		if (!pmd_write(_pmd) && (reason & VM_UFFD_WP))
> -			ret = true;
> -		goto out;
> -	}
> +	if (pmd_trans_huge(_pmd))
> +		return !pmd_write(_pmd) && (reason & VM_UFFD_WP);

...

> diff --git a/mm/hmm.c b/mm/hmm.c
> index a56081d67ad6..43d4a91035ff 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -244,7 +244,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  	uint64_t pfn_req_flags = *hmm_pfn;
>  	uint64_t new_pfn_flags = 0;
>  
> -	if (pte_none_mostly(pte)) {
> +	if (pte_none(pte) || pte_marker_uffd_wp(pte)) {

Would be nice to add the note from the changelog as a comment here.

>  		required_fault =
>  			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
>  		if (required_fault)

-- 
Sincerely yours,
Mike.

  parent reply	other threads:[~2025-11-10 11:18 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-08 17:08 [PATCH v2 00/16] mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries Lorenzo Stoakes
2025-11-08 17:08 ` [PATCH v2 01/16] mm: correctly handle UFFD PTE markers Lorenzo Stoakes
2025-11-09 16:26   ` Lance Yang
2025-11-10  6:36     ` Lorenzo Stoakes
2025-11-10 11:17   ` Mike Rapoport [this message]
2025-11-10 13:01     ` Lorenzo Stoakes
2025-11-10 13:44       ` Mike Rapoport
2025-11-10 18:05         ` Lorenzo Stoakes
2025-11-08 17:08 ` [PATCH v2 02/16] mm: introduce leaf entry type and use to simplify leaf entry logic Lorenzo Stoakes
2025-11-09 12:34   ` Lance Yang
2025-11-10 18:48     ` Lorenzo Stoakes
2025-11-09 13:10   ` Kairui Song
2025-11-10 18:34     ` Lorenzo Stoakes
2025-11-08 17:08 ` [PATCH v2 03/16] mm: avoid unnecessary uses of is_swap_pte() Lorenzo Stoakes
2025-11-08 17:08 ` [PATCH v2 04/16] mm: eliminate is_swap_pte() when softleaf_from_pte() suffices Lorenzo Stoakes
2025-11-09 12:49   ` Kairui Song
2025-11-10 19:38     ` Lorenzo Stoakes
2025-11-08 17:08 ` [PATCH v2 05/16] mm: use leaf entries in debug pgtable + remove is_swap_pte() Lorenzo Stoakes
2025-11-08 17:08 ` [PATCH v2 06/16] fs/proc/task_mmu: refactor pagemap_pmd_range() Lorenzo Stoakes
2025-11-08 17:08 ` [PATCH v2 07/16] mm: avoid unnecessary use of is_swap_pmd() Lorenzo Stoakes
2025-11-08 17:08 ` [PATCH v2 08/16] mm/huge_memory: refactor copy_huge_pmd() non-present logic Lorenzo Stoakes
2025-11-08 17:08 ` [PATCH v2 09/16] mm/huge_memory: refactor change_huge_pmd() " Lorenzo Stoakes
2025-11-08 17:08 ` [PATCH v2 10/16] mm: replace pmd_to_swp_entry() with softleaf_from_pmd() Lorenzo Stoakes
2025-11-08 17:18   ` SeongJae Park
2025-11-10 22:03     ` Lorenzo Stoakes
2025-11-08 17:08 ` [PATCH v2 11/16] mm: introduce pmd_is_huge() and use where appropriate Lorenzo Stoakes
2025-11-08 17:08 ` [PATCH v2 12/16] mm: remove remaining is_swap_pmd() users and is_swap_pmd() Lorenzo Stoakes
2025-11-08 17:08 ` [PATCH v2 13/16] mm: remove non_swap_entry() and use softleaf helpers instead Lorenzo Stoakes
2025-11-08 17:08 ` [PATCH v2 14/16] mm: remove is_hugetlb_entry_[migration, hwpoisoned]() Lorenzo Stoakes
2025-11-08 17:08 ` [PATCH v2 15/16] mm: eliminate further swapops predicates Lorenzo Stoakes
2025-11-08 17:08 ` [PATCH v2 16/16] mm: replace remaining pte_to_swp_entry() with softleaf_from_pte() Lorenzo Stoakes
2025-11-08 18:01 ` [PATCH v2 00/16] mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries Andrew Morton
2025-11-10  7:32 ` Chris Li
2025-11-10 10:18   ` Lorenzo Stoakes
2025-11-10 11:04     ` Chris Li
2025-11-10 11:27       ` Lorenzo Stoakes
2025-11-10 23:38         ` Hugh Dickins
2025-11-11  0:23           ` Andrew Morton
2025-11-11  4:07             ` Hugh Dickins
2025-11-11  6:51               ` Lorenzo Stoakes
2025-11-11  4:16           ` Kairui Song
2025-11-11  6:55             ` Lorenzo Stoakes
2025-11-11  9:19         ` Chris Li
2025-11-11 10:03           ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aRHJ0RDu9fJGEBF8@kernel.org \
    --to=rppt@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=agordeev@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=arnd@arndb.de \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bhe@redhat.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=brauner@kernel.org \
    --cc=byungchul@sk.com \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=damon@lists.linux.dev \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=frankja@linux.ibm.com \
    --cc=gerald.schaefer@linux.ibm.com \
    --cc=gor@linux.ibm.com \
    --cc=gourry@gourry.net \
    --cc=harry.yoo@oracle.com \
    --cc=hca@linux.ibm.com \
    --cc=hughd@google.com \
    --cc=imbrenda@linux.ibm.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kasong@tencent.com \
    --cc=kvm@vger.kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=leon@kernel.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nao.horiguchi@gmail.com \
    --cc=npache@redhat.com \
    --cc=nphamcs@gmail.com \
    --cc=osalvador@suse.de \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=pfalcato@suse.de \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=ryan.roberts@arm.com \
    --cc=shikemeng@huaweicloud.com \
    --cc=sj@kernel.org \
    --cc=surenb@google.com \
    --cc=svens@linux.ibm.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xu.xin16@zte.com.cn \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.