From: Nikita Kalyazin <kalyazin@amazon.com>
To: James Houghton <jthoughton@google.com>
Cc: <akpm@linux-foundation.org>, <pbonzini@redhat.com>,
<shuah@kernel.org>, <kvm@vger.kernel.org>,
<linux-kselftest@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
<linux-mm@kvack.org>, <lorenzo.stoakes@oracle.com>,
<david@redhat.com>, <ryan.roberts@arm.com>,
<quic_eberman@quicinc.com>, <peterx@redhat.com>, <graf@amazon.de>,
<jgowans@amazon.com>, <roypat@amazon.co.uk>, <derekmn@amazon.com>,
<nsaenz@amazon.es>, <xmarcalx@amazon.com>
Subject: Re: [PATCH v2 1/5] mm: userfaultfd: generic continue for non hugetlbfs
Date: Thu, 3 Apr 2025 18:01:28 +0100 [thread overview]
Message-ID: <16763ece-fa3b-4a68-9b3c-5331954d96e7@amazon.com> (raw)
In-Reply-To: <CADrL8HW2bFsFqifTytY8+Fy1nH-arFZ2JqAKi44_4nMtkPksDA@mail.gmail.com>
On 02/04/2025 20:04, James Houghton wrote:
> On Wed, Apr 2, 2025 at 9:07 AM Nikita Kalyazin <kalyazin@amazon.com> wrote:
>>
>> Remove shmem-specific code from UFFDIO_CONTINUE implementation for
>> non-huge pages by calling vm_ops->fault(). A new VMF flag,
>> FAULT_FLAG_NO_USERFAULT_MINOR, is introduced to avoid recursive call to
>> handle_userfault().
>>
>> Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
>> ---
>> include/linux/mm_types.h | 3 +++
>> mm/hugetlb.c | 2 +-
>> mm/shmem.c | 3 ++-
>> mm/userfaultfd.c | 25 ++++++++++++++++++-------
>> 4 files changed, 24 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 0234f14f2aa6..91a00f2cd565 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -1429,6 +1429,8 @@ enum tlb_flush_reason {
>> * @FAULT_FLAG_ORIG_PTE_VALID: whether the fault has vmf->orig_pte cached.
>> * We should only access orig_pte if this flag set.
>> * @FAULT_FLAG_VMA_LOCK: The fault is handled under VMA lock.
>> + * @FAULT_FLAG_NO_USERFAULT_MINOR: The fault handler must not call userfaultfd
>> + * minor handler.
>
> Perhaps instead a flag that says to avoid the userfaultfd minor fault
> handler, maybe we should have a flag to indicate that vm_ops->fault()
> has been called by UFFDIO_CONTINUE. See below.
>
>> *
>> * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
>> * whether we would allow page faults to retry by specifying these two
>> @@ -1467,6 +1469,7 @@ enum fault_flag {
>> FAULT_FLAG_UNSHARE = 1 << 10,
>> FAULT_FLAG_ORIG_PTE_VALID = 1 << 11,
>> FAULT_FLAG_VMA_LOCK = 1 << 12,
>> + FAULT_FLAG_NO_USERFAULT_MINOR = 1 << 13,
>> };
>>
>> typedef unsigned int __bitwise zap_flags_t;
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 97930d44d460..ba90d48144fc 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -6228,7 +6228,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>> }
>>
>> /* Check for page in userfault range. */
>> - if (userfaultfd_minor(vma)) {
>> + if (userfaultfd_minor(vma) && !(vmf->flags & FAULT_FLAG_NO_USERFAULT_MINOR)) {
>> folio_unlock(folio);
>> folio_put(folio);
>> /* See comment in userfaultfd_missing() block above */
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 1ede0800e846..5e1911e39dec 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -2467,7 +2467,8 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
>> fault_mm = vma ? vma->vm_mm : NULL;
>>
>> folio = filemap_get_entry(inode->i_mapping, index);
>> - if (folio && vma && userfaultfd_minor(vma)) {
>> + if (folio && vma && userfaultfd_minor(vma) &&
>> + !(vmf->flags & FAULT_FLAG_NO_USERFAULT_MINOR)) {
>> if (!xa_is_value(folio))
>> folio_put(folio);
>> *fault_type = handle_userfault(vmf, VM_UFFD_MINOR);
>> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
>> index d06453fa8aba..68a995216789 100644
>> --- a/mm/userfaultfd.c
>> +++ b/mm/userfaultfd.c
>> @@ -386,24 +386,35 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd,
>> unsigned long dst_addr,
>> uffd_flags_t flags)
>> {
>> - struct inode *inode = file_inode(dst_vma->vm_file);
>> - pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
>> struct folio *folio;
>> struct page *page;
>> int ret;
>> + struct vm_fault vmf = {
>> + .vma = dst_vma,
>> + .address = dst_addr,
>> + .flags = FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE |
>> + FAULT_FLAG_NO_USERFAULT_MINOR,
>> + .pte = NULL,
>> + .page = NULL,
>> + .pgoff = linear_page_index(dst_vma, dst_addr),
>> + };
>> +
>> + if (!dst_vma->vm_ops || !dst_vma->vm_ops->fault)
>> + return -EINVAL;
>>
>> - ret = shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC);
>> - /* Our caller expects us to return -EFAULT if we failed to find folio */
>> - if (ret == -ENOENT)
>> + ret = dst_vma->vm_ops->fault(&vmf);
>
> shmem_get_folio() was being called with SGP_NOALLOC, and now it is
> being called with SGP_CACHE (by shmem_fault()). This will result in a
> UAPI change: UFFDIO_CONTINUE for a VA without a page in the page cache
> should result in EFAULT, but now the page will be allocated.
> SGP_NOALLOC was carefully chosen[1], so I think a better way to do
> this will be to:
>
> 1. Have a FAULT_FLAG_USERFAULT_CONTINUE (or something)
> 2. In shmem_fault(), if FAULT_FLAG_USERFAULT_CONTINUE, use SGP_NOALLOC
> instead of SGP_CACHE (and make sure not to drop into
> handle_userfault(), of course)
Yes, that is very true. Thanks for pointing out. Will apply in the
next version.
>
> [1]: https://lore.kernel.org/linux-mm/20220610173812.1768919-1-axelrasmussen@google.com/
>
>> + if (ret & VM_FAULT_ERROR) {
>> ret = -EFAULT;
>> - if (ret)
>> goto out;
>> + }
>> +
>> + page = vmf.page;
>> + folio = page_folio(page);
>> if (!folio) {
>
> What if ret == VM_FAULT_RETRY? I think we should retry instead instead
> of returning -EFAULT.
True. I'm going to retry the vm_ops->fault() call in this case.
>
> And I'm not sure how VM_FAULT_NOPAGE should be handled, like if we
> need special logic for it or not.
As far as I see, the only case it can come up in shmem is during a fault
when a hole is being punched [1]. I'm thinking of returning EAGAIN to
the userspace if this happens.
[1] https://elixir.bootlin.com/linux/v6.14-rc6/source/mm/shmem.c#L2670
>
>> ret = -EFAULT;
>> goto out;
>> }
>>
>> - page = folio_file_page(folio, pgoff);
>> if (PageHWPoison(page)) {
>> ret = -EIO;
>> goto out_release;
>> --
>> 2.47.1
>>
next prev parent reply other threads:[~2025-04-03 17:01 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-02 16:07 [PATCH v2 0/5] KVM: guest_memfd: support for uffd minor Nikita Kalyazin
2025-04-02 16:07 ` [PATCH v2 1/5] mm: userfaultfd: generic continue for non hugetlbfs Nikita Kalyazin
2025-04-02 19:04 ` James Houghton
2025-04-03 17:01 ` Nikita Kalyazin [this message]
2025-04-02 16:07 ` [PATCH v2 2/5] KVM: guest_memfd: add kvm_gmem_vma_is_gmem Nikita Kalyazin
2025-04-02 16:07 ` [PATCH v2 3/5] mm: userfaultfd: allow to register continue for guest_memfd Nikita Kalyazin
2025-04-02 21:25 ` James Houghton
2025-04-03 17:01 ` Nikita Kalyazin
2025-04-02 16:07 ` [PATCH v2 4/5] KVM: guest_memfd: add support for userfaultfd minor Nikita Kalyazin
2025-04-02 16:07 ` [PATCH v2 5/5] KVM: selftests: test userfaultfd minor for guest_memfd Nikita Kalyazin
2025-04-02 21:10 ` James Houghton
2025-04-03 17:02 ` Nikita Kalyazin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=16763ece-fa3b-4a68-9b3c-5331954d96e7@amazon.com \
--to=kalyazin@amazon.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=derekmn@amazon.com \
--cc=graf@amazon.de \
--cc=jgowans@amazon.com \
--cc=jthoughton@google.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=nsaenz@amazon.es \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
--cc=quic_eberman@quicinc.com \
--cc=roypat@amazon.co.uk \
--cc=ryan.roberts@arm.com \
--cc=shuah@kernel.org \
--cc=xmarcalx@amazon.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).