[PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending
@ 2025-04-23 23:37 Jens Axboe
  2025-04-24 14:03 ` Johannes Weiner
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2025-04-23 23:37 UTC (permalink / raw)
  To: Andrew Morton, Peter Xu
  Cc: linux-fsdevel@vger.kernel.org, Linux-MM, Johannes Weiner

userfaultfd may use interruptible sleeps to wait on userspace filling
a page fault, which works fine if the task can be reliably put to
sleeping waiting for that. However, if the task has a normal (ie
non-fatal) signal pending, then TASK_INTERRUPTIBLE sleep will simply
cause schedule() to be a no-op.

For a task that registers a page with userfaultfd and then proceeds
to do a write from it, if that task also has a signal pending then
it'll essentially busy loop from do_page_fault() -> handle_userfault()
until that fault has been filled. Normally it'd be expected that the
task would sleep until that happens. Here's a trace from an application
doing just that:

handle_userfault+0x4b8/0xa00 (P)
hugetlb_fault+0xe24/0x1060
handle_mm_fault+0x2bc/0x318
do_page_fault+0x1e8/0x6f0
do_translation_fault+0x9c/0xd0
do_mem_abort+0x44/0xa0
el1_abort+0x3c/0x68
el1h_64_sync_handler+0xd4/0x100
el1h_64_sync+0x6c/0x70
fault_in_readable+0x74/0x108 (P)
iomap_file_buffered_write+0x14c/0x438
blkdev_write_iter+0x1a8/0x340
vfs_write+0x20c/0x348
ksys_write+0x64/0x108
__arm64_sys_write+0x1c/0x38

where the task is looping with 100% CPU time in the above mentioned
fault path.

Since it's impossible to handle signals, or other conditions like
TIF_NOTIFY_SIGNAL that also prevents interruptible sleeping, from the
fault path, use TASK_UNINTERRUPTIBLE with a short timeout even for vmf
modes that would normally ask for INTERRUPTIBLE or KILLABLE sleep. Fatal
signals will still be handled by the caller, and the timeout is short
enough to hopefully not cause any issues. If this is the first invocation
of this fault, eg FAULT_FLAG_TRIED isn't set, then the normal sleep mode
is used.

Cc: stable@vger.kernel.org
Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
Reported-by: Zhiwei Jiang <qq282012236@gmail.com>
Link: https://lore.kernel.org/io-uring/20250422162913.1242057-1-qq282012236@gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index d80f94346199..1016268c7b51 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -334,15 +334,29 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	return ret;
 }
 
-static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
+struct userfault_wait {
+	unsigned int task_state;
+	bool timeout;
+};
+
+static struct userfault_wait userfaultfd_get_blocking_state(unsigned int flags)
 {
+	/*
+	 * If the fault has already been tried AND there's a signal pending
+	 * for this task, use TASK_UNINTERRUPTIBLE with a small timeout.
+	 * This prevents busy looping where schedule() otherwise does nothing
+	 * for TASK_INTERRUPTIBLE when the task has a signal pending.
+	 */
+	if ((flags & FAULT_FLAG_TRIED) && signal_pending(current))
+		return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, true };
+
 	if (flags & FAULT_FLAG_INTERRUPTIBLE)
-		return TASK_INTERRUPTIBLE;
+		return (struct userfault_wait) { TASK_INTERRUPTIBLE, false };
 
 	if (flags & FAULT_FLAG_KILLABLE)
-		return TASK_KILLABLE;
+		return (struct userfault_wait) { TASK_KILLABLE, false };
 
-	return TASK_UNINTERRUPTIBLE;
+	return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, false };
 }
 
 /*
@@ -368,7 +382,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	struct userfaultfd_wait_queue uwq;
 	vm_fault_t ret = VM_FAULT_SIGBUS;
 	bool must_wait;
-	unsigned int blocking_state;
+	struct userfault_wait wait_mode;
 
 	/*
 	 * We don't do userfault handling for the final child pid update
@@ -466,7 +480,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	uwq.ctx = ctx;
 	uwq.waken = false;
 
-	blocking_state = userfaultfd_get_blocking_state(vmf->flags);
+	wait_mode = userfaultfd_get_blocking_state(vmf->flags);
 
         /*
          * Take the vma lock now, in order to safely call
@@ -488,7 +502,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	 * following the spin_unlock to happen before the list_add in
 	 * __add_wait_queue.
 	 */
-	set_current_state(blocking_state);
+	set_current_state(wait_mode.task_state);
 	spin_unlock_irq(&ctx->fault_pending_wqh.lock);
 
 	if (!is_vm_hugetlb_page(vma))
@@ -501,7 +515,11 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 
 	if (likely(must_wait && !READ_ONCE(ctx->released))) {
 		wake_up_poll(&ctx->fd_wqh, EPOLLIN);
-		schedule();
+		/* See comment in userfaultfd_get_blocking_state() */
+		if (!wait_mode.timeout)
+			schedule();
+		else
+			schedule_timeout(HZ / 10);
 	}
 
 	__set_current_state(TASK_RUNNING);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending
  2025-04-23 23:37 [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending Jens Axboe
@ 2025-04-24 14:03 ` Johannes Weiner
  2025-04-24 14:54   ` Jens Axboe
  2025-04-24 18:26   ` Peter Xu
  0 siblings, 2 replies; 16+ messages in thread
From: Johannes Weiner @ 2025-04-24 14:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrew Morton, Peter Xu, linux-fsdevel@vger.kernel.org, Linux-MM

On Wed, Apr 23, 2025 at 05:37:06PM -0600, Jens Axboe wrote:
> userfaultfd may use interruptible sleeps to wait on userspace filling
> a page fault, which works fine if the task can be reliably put to
> sleeping waiting for that. However, if the task has a normal (ie
> non-fatal) signal pending, then TASK_INTERRUPTIBLE sleep will simply
> cause schedule() to be a no-op.
> 
> For a task that registers a page with userfaultfd and then proceeds
> to do a write from it, if that task also has a signal pending then
> it'll essentially busy loop from do_page_fault() -> handle_userfault()
> until that fault has been filled. Normally it'd be expected that the
> task would sleep until that happens. Here's a trace from an application
> doing just that:
> 
> handle_userfault+0x4b8/0xa00 (P)
> hugetlb_fault+0xe24/0x1060
> handle_mm_fault+0x2bc/0x318
> do_page_fault+0x1e8/0x6f0

Makes sense. There is a fault_signal_pending() check before retrying:

static inline bool fault_signal_pending(vm_fault_t fault_flags,
                                        struct pt_regs *regs)
{
        return unlikely((fault_flags & VM_FAULT_RETRY) &&
                        (fatal_signal_pending(current) ||
                         (user_mode(regs) && signal_pending(current))));
}

Since it's an in-kernel fault, and the signal is non-fatal, it won't
stop looping until the fault is handled.

This in itself seems a bit sketchy. You have to hope there is no
dependency between handling the signal -> handling the fault inside
the userspace components.

> do_translation_fault+0x9c/0xd0
> do_mem_abort+0x44/0xa0
> el1_abort+0x3c/0x68
> el1h_64_sync_handler+0xd4/0x100
> el1h_64_sync+0x6c/0x70
> fault_in_readable+0x74/0x108 (P)
> iomap_file_buffered_write+0x14c/0x438
> blkdev_write_iter+0x1a8/0x340
> vfs_write+0x20c/0x348
> ksys_write+0x64/0x108
> __arm64_sys_write+0x1c/0x38
>
> where the task is looping with 100% CPU time in the above mentioned
> fault path.
> 
> Since it's impossible to handle signals, or other conditions like
> TIF_NOTIFY_SIGNAL that also prevents interruptible sleeping, from the
> fault path, use TASK_UNINTERRUPTIBLE with a short timeout even for vmf
> modes that would normally ask for INTERRUPTIBLE or KILLABLE sleep. Fatal
> signals will still be handled by the caller, and the timeout is short
> enough to hopefully not cause any issues. If this is the first invocation
> of this fault, eg FAULT_FLAG_TRIED isn't set, then the normal sleep mode
> is used.
>
> Cc: stable@vger.kernel.org
> Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")

When this patch was first introduced, VM_FAULT_RETRY would work only
once. The second try would have FAULT_FLAG_ALLOW_RETRY cleared,
causing handle_userfault() to return VM_SIGBUS, which would bubble
through the fixup table (kernel fault), -EFAULT from
iomap_file_buffered_write() and unwind the kernel stack this way.

So I'm thinking this is the more likely commit for Fixes: and stable:

commit 4064b982706375025628094e51d11cf1a958a5d3
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Apr 1 21:08:45 2020 -0700

    mm: allow VM_FAULT_RETRY for multiple times

> Reported-by: Zhiwei Jiang <qq282012236@gmail.com>
> Link: https://lore.kernel.org/io-uring/20250422162913.1242057-1-qq282012236@gmail.com/
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending
  2025-04-24 14:03 ` Johannes Weiner
@ 2025-04-24 14:54   ` Jens Axboe
  2025-04-24 15:11     ` Johannes Weiner
  2025-04-24 18:26   ` Peter Xu
  1 sibling, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2025-04-24 14:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Peter Xu, linux-fsdevel@vger.kernel.org, Linux-MM

On 4/24/25 8:03 AM, Johannes Weiner wrote:
> On Wed, Apr 23, 2025 at 05:37:06PM -0600, Jens Axboe wrote:
>> userfaultfd may use interruptible sleeps to wait on userspace filling
>> a page fault, which works fine if the task can be reliably put to
>> sleeping waiting for that. However, if the task has a normal (ie
>> non-fatal) signal pending, then TASK_INTERRUPTIBLE sleep will simply
>> cause schedule() to be a no-op.
>>
>> For a task that registers a page with userfaultfd and then proceeds
>> to do a write from it, if that task also has a signal pending then
>> it'll essentially busy loop from do_page_fault() -> handle_userfault()
>> until that fault has been filled. Normally it'd be expected that the
>> task would sleep until that happens. Here's a trace from an application
>> doing just that:
>>
>> handle_userfault+0x4b8/0xa00 (P)
>> hugetlb_fault+0xe24/0x1060
>> handle_mm_fault+0x2bc/0x318
>> do_page_fault+0x1e8/0x6f0
> 
> Makes sense. There is a fault_signal_pending() check before retrying:
> 
> static inline bool fault_signal_pending(vm_fault_t fault_flags,
>                                         struct pt_regs *regs)
> {
>         return unlikely((fault_flags & VM_FAULT_RETRY) &&
>                         (fatal_signal_pending(current) ||
>                          (user_mode(regs) && signal_pending(current))));
> }
> 
> Since it's an in-kernel fault, and the signal is non-fatal, it won't
> stop looping until the fault is handled.
> 
> This in itself seems a bit sketchy. You have to hope there is no
> dependency between handling the signal -> handling the fault inside
> the userspace components.

Indeed... But that's generic userfaultfd sketchiness, not really related
to this patch.

> 
>> do_translation_fault+0x9c/0xd0
>> do_mem_abort+0x44/0xa0
>> el1_abort+0x3c/0x68
>> el1h_64_sync_handler+0xd4/0x100
>> el1h_64_sync+0x6c/0x70
>> fault_in_readable+0x74/0x108 (P)
>> iomap_file_buffered_write+0x14c/0x438
>> blkdev_write_iter+0x1a8/0x340
>> vfs_write+0x20c/0x348
>> ksys_write+0x64/0x108
>> __arm64_sys_write+0x1c/0x38
>>
>> where the task is looping with 100% CPU time in the above mentioned
>> fault path.
>>
>> Since it's impossible to handle signals, or other conditions like
>> TIF_NOTIFY_SIGNAL that also prevents interruptible sleeping, from the
>> fault path, use TASK_UNINTERRUPTIBLE with a short timeout even for vmf
>> modes that would normally ask for INTERRUPTIBLE or KILLABLE sleep. Fatal
>> signals will still be handled by the caller, and the timeout is short
>> enough to hopefully not cause any issues. If this is the first invocation
>> of this fault, eg FAULT_FLAG_TRIED isn't set, then the normal sleep mode
>> is used.
>>
>> Cc: stable@vger.kernel.org
>> Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
> 
> When this patch was first introduced, VM_FAULT_RETRY would work only
> once. The second try would have FAULT_FLAG_ALLOW_RETRY cleared,
> causing handle_userfault() to return VM_SIGBUS, which would bubble
> through the fixup table (kernel fault), -EFAULT from
> iomap_file_buffered_write() and unwind the kernel stack this way.
> 
> So I'm thinking this is the more likely commit for Fixes: and stable:
> 
> commit 4064b982706375025628094e51d11cf1a958a5d3
> Author: Peter Xu <peterx@redhat.com>
> Date:   Wed Apr 1 21:08:45 2020 -0700
> 
>     mm: allow VM_FAULT_RETRY for multiple times

Thanks for checking that - yep that sounds fine to me, we can adjust the
fixes tag appropriately.

>> Reported-by: Zhiwei Jiang <qq282012236@gmail.com>
>> Link: https://lore.kernel.org/io-uring/20250422162913.1242057-1-qq282012236@gmail.com/
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending
  2025-04-24 14:54   ` Jens Axboe
@ 2025-04-24 15:11     ` Johannes Weiner
  2025-04-24 15:22       ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Johannes Weiner @ 2025-04-24 15:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrew Morton, Peter Xu, linux-fsdevel@vger.kernel.org, Linux-MM

On Thu, Apr 24, 2025 at 08:54:40AM -0600, Jens Axboe wrote:
> On 4/24/25 8:03 AM, Johannes Weiner wrote:
> > On Wed, Apr 23, 2025 at 05:37:06PM -0600, Jens Axboe wrote:
> >> userfaultfd may use interruptible sleeps to wait on userspace filling
> >> a page fault, which works fine if the task can be reliably put to
> >> sleeping waiting for that. However, if the task has a normal (ie
> >> non-fatal) signal pending, then TASK_INTERRUPTIBLE sleep will simply
> >> cause schedule() to be a no-op.
> >>
> >> For a task that registers a page with userfaultfd and then proceeds
> >> to do a write from it, if that task also has a signal pending then
> >> it'll essentially busy loop from do_page_fault() -> handle_userfault()
> >> until that fault has been filled. Normally it'd be expected that the
> >> task would sleep until that happens. Here's a trace from an application
> >> doing just that:
> >>
> >> handle_userfault+0x4b8/0xa00 (P)
> >> hugetlb_fault+0xe24/0x1060
> >> handle_mm_fault+0x2bc/0x318
> >> do_page_fault+0x1e8/0x6f0
> > 
> > Makes sense. There is a fault_signal_pending() check before retrying:
> > 
> > static inline bool fault_signal_pending(vm_fault_t fault_flags,
> >                                         struct pt_regs *regs)
> > {
> >         return unlikely((fault_flags & VM_FAULT_RETRY) &&
> >                         (fatal_signal_pending(current) ||
> >                          (user_mode(regs) && signal_pending(current))));
> > }
> > 
> > Since it's an in-kernel fault, and the signal is non-fatal, it won't
> > stop looping until the fault is handled.
> > 
> > This in itself seems a bit sketchy. You have to hope there is no
> > dependency between handling the signal -> handling the fault inside
> > the userspace components.
> 
> Indeed... But that's generic userfaultfd sketchiness, not really related
> to this patch.

Definitely, it wasn't meant as an objection to the patch. The bug just
highlights a fairly subtle dependency chain between signals and
userfault handling that users of the feature might not be aware of.
Sorry if I was being unclear.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending
  2025-04-24 15:11     ` Johannes Weiner
@ 2025-04-24 15:22       ` Jens Axboe
  0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2025-04-24 15:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Peter Xu, linux-fsdevel@vger.kernel.org, Linux-MM

On 4/24/25 9:11 AM, Johannes Weiner wrote:
> On Thu, Apr 24, 2025 at 08:54:40AM -0600, Jens Axboe wrote:
>> On 4/24/25 8:03 AM, Johannes Weiner wrote:
>>> On Wed, Apr 23, 2025 at 05:37:06PM -0600, Jens Axboe wrote:
>>>> userfaultfd may use interruptible sleeps to wait on userspace filling
>>>> a page fault, which works fine if the task can be reliably put to
>>>> sleeping waiting for that. However, if the task has a normal (ie
>>>> non-fatal) signal pending, then TASK_INTERRUPTIBLE sleep will simply
>>>> cause schedule() to be a no-op.
>>>>
>>>> For a task that registers a page with userfaultfd and then proceeds
>>>> to do a write from it, if that task also has a signal pending then
>>>> it'll essentially busy loop from do_page_fault() -> handle_userfault()
>>>> until that fault has been filled. Normally it'd be expected that the
>>>> task would sleep until that happens. Here's a trace from an application
>>>> doing just that:
>>>>
>>>> handle_userfault+0x4b8/0xa00 (P)
>>>> hugetlb_fault+0xe24/0x1060
>>>> handle_mm_fault+0x2bc/0x318
>>>> do_page_fault+0x1e8/0x6f0
>>>
>>> Makes sense. There is a fault_signal_pending() check before retrying:
>>>
>>> static inline bool fault_signal_pending(vm_fault_t fault_flags,
>>>                                         struct pt_regs *regs)
>>> {
>>>         return unlikely((fault_flags & VM_FAULT_RETRY) &&
>>>                         (fatal_signal_pending(current) ||
>>>                          (user_mode(regs) && signal_pending(current))));
>>> }
>>>
>>> Since it's an in-kernel fault, and the signal is non-fatal, it won't
>>> stop looping until the fault is handled.
>>>
>>> This in itself seems a bit sketchy. You have to hope there is no
>>> dependency between handling the signal -> handling the fault inside
>>> the userspace components.
>>
>> Indeed... But that's generic userfaultfd sketchiness, not really related
>> to this patch.
> 
> Definitely, it wasn't meant as an objection to the patch. The bug just
> highlights a fairly subtle dependency chain between signals and
> userfault handling that users of the feature might not be aware of.
> Sorry if I was being unclear.

It was more for the clarification for others, I know that you're well
aware of that!

All good, I'll send out a v2 with the Fixes tag adjusted, just to make
it easier on everyone.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending
  2025-04-24 14:03 ` Johannes Weiner
  2025-04-24 14:54   ` Jens Axboe
@ 2025-04-24 18:26   ` Peter Xu
  2025-04-24 18:40     ` Jens Axboe
  2025-04-24 19:42     ` Matthew Wilcox
  1 sibling, 2 replies; 16+ messages in thread
From: Peter Xu @ 2025-04-24 18:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jens Axboe, Andrew Morton, linux-fsdevel@vger.kernel.org,
	Linux-MM

On Thu, Apr 24, 2025 at 10:03:44AM -0400, Johannes Weiner wrote:
> On Wed, Apr 23, 2025 at 05:37:06PM -0600, Jens Axboe wrote:
> > userfaultfd may use interruptible sleeps to wait on userspace filling
> > a page fault, which works fine if the task can be reliably put to
> > sleeping waiting for that. However, if the task has a normal (ie
> > non-fatal) signal pending, then TASK_INTERRUPTIBLE sleep will simply
> > cause schedule() to be a no-op.
> > 
> > For a task that registers a page with userfaultfd and then proceeds
> > to do a write from it, if that task also has a signal pending then
> > it'll essentially busy loop from do_page_fault() -> handle_userfault()
> > until that fault has been filled. Normally it'd be expected that the
> > task would sleep until that happens. Here's a trace from an application
> > doing just that:
> > 
> > handle_userfault+0x4b8/0xa00 (P)
> > hugetlb_fault+0xe24/0x1060
> > handle_mm_fault+0x2bc/0x318
> > do_page_fault+0x1e8/0x6f0
> 
> Makes sense. There is a fault_signal_pending() check before retrying:
> 
> static inline bool fault_signal_pending(vm_fault_t fault_flags,
>                                         struct pt_regs *regs)
> {
>         return unlikely((fault_flags & VM_FAULT_RETRY) &&
>                         (fatal_signal_pending(current) ||
>                          (user_mode(regs) && signal_pending(current))));
> }
> 
> Since it's an in-kernel fault, and the signal is non-fatal, it won't
> stop looping until the fault is handled.
> 
> This in itself seems a bit sketchy. You have to hope there is no
> dependency between handling the signal -> handling the fault inside
> the userspace components.

True. So far, my understanding is e.g. in an userfaultfd context the signal
handler is responsible for not touching any possible trapped pages, or the
sighandler needs fixing on its own.

> 
> > do_translation_fault+0x9c/0xd0
> > do_mem_abort+0x44/0xa0
> > el1_abort+0x3c/0x68
> > el1h_64_sync_handler+0xd4/0x100
> > el1h_64_sync+0x6c/0x70
> > fault_in_readable+0x74/0x108 (P)
> > iomap_file_buffered_write+0x14c/0x438
> > blkdev_write_iter+0x1a8/0x340
> > vfs_write+0x20c/0x348
> > ksys_write+0x64/0x108
> > __arm64_sys_write+0x1c/0x38
> >
> > where the task is looping with 100% CPU time in the above mentioned
> > fault path.
> > 
> > Since it's impossible to handle signals, or other conditions like
> > TIF_NOTIFY_SIGNAL that also prevents interruptible sleeping, from the
> > fault path, use TASK_UNINTERRUPTIBLE with a short timeout even for vmf
> > modes that would normally ask for INTERRUPTIBLE or KILLABLE sleep. Fatal
> > signals will still be handled by the caller, and the timeout is short
> > enough to hopefully not cause any issues. If this is the first invocation
> > of this fault, eg FAULT_FLAG_TRIED isn't set, then the normal sleep mode
> > is used.
> >
> > Cc: stable@vger.kernel.org
> > Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
> 
> When this patch was first introduced, VM_FAULT_RETRY would work only
> once. The second try would have FAULT_FLAG_ALLOW_RETRY cleared,
> causing handle_userfault() to return VM_SIGBUS, which would bubble
> through the fixup table (kernel fault), -EFAULT from
> iomap_file_buffered_write() and unwind the kernel stack this way.

AFAIU we can't rely on the exception fixups because when reaching there it
means the user access is going to get a -EFAULT, but here the right
behavior is we keep waiting, aka, UNINTERRUPTIBLE wait until it's done.

> 
> So I'm thinking this is the more likely commit for Fixes: and stable:
> 
> commit 4064b982706375025628094e51d11cf1a958a5d3
> Author: Peter Xu <peterx@redhat.com>
> Date:   Wed Apr 1 21:08:45 2020 -0700
> 
>     mm: allow VM_FAULT_RETRY for multiple times

IMHO the multiple attempts are still fine, instead it's problematic if we
wait in INTERRUPTIBLE mode even in !user mode..  so maybe it's slightly
more suitable to use this as Fixes:

commit c270a7eedcf278304e05ebd2c96807487c97db61
Author: Peter Xu <peterx@redhat.com>
Date:   Wed Apr 1 21:08:41 2020 -0700

    mm: introduce FAULT_FLAG_INTERRUPTIBLE

The important change there is:

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 888272621f38..c076d3295958 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -462,9 +462,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
        uwq.ctx = ctx;
        uwq.waken = false;
 
-       return_to_userland =
-               (vmf->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
-               (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
+       return_to_userland = vmf->flags & FAULT_FLAG_INTERRUPTIBLE;
        blocking_state = return_to_userland ? TASK_INTERRUPTIBLE :
                         TASK_KILLABLE;

I think we still need to avoid checking FAULT_FLAG_USER, because e.g. in
some other use cases like GUP we'd still want the threads (KVM does GUP and
it's a heavy user of userfaultfd) to respond to non-fatals.

However maybe we shouldn't really set INTERRUPTIBLE at all if it's non-GUP
and if it's non-user either.

So in general, some trivial concerns here on the patch..

Firstly, waiting UNINTERRUPTIBLE (even if with a small timeout) if
FAULT_FLAG_INTERRUPTIBLE is set is a slight ABI violation to me - after
all, FAULT_FLAG_INTERRUPTIBLE says "please respond to non-fatal signals
too!".

Secondly, userfaultfd is indeed the only consumer of
FAULT_FLAG_INTERRUPTIBLE but not necessary always in the future.  While
this patch resolves it for userfaultfd, it might get caught again later if
something else in the kernel starts to respects the _INTERRUPTIBLE flag
request.  For example, __folio_lock_or_retry() ignores that flag so far,
but logically it should obey too (with a folio_wait_locked_interruptible)..

I also think it's not as elegant to have the magic HZ/10, and it's also
destined even the loop is less frequent that's a waste of time (as if the
user page access comes from kernel context, we must wait... until the page
fault is resolved..).

Is it possible we simply unset the request from the top?  As discussed
above, I think we still need to make sure GUP at least works for
non-fatals, however I think it might be more reasonable we never set
_INTERRUPTIBLE for !gup, then this problem might go away too with all above
concerns addressed.

A not-even-compiled patch just to clarify what I meant (and it won't work
unless it makes sense to both of you and we'll need to touch all archs when
changing the default flags):

===8<===
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 296d294142c8..fa721525d93a 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1300,9 +1300,14 @@ void do_user_addr_fault(struct pt_regs *regs,
         * We set FAULT_FLAG_USER based on the register state, not
         * based on X86_PF_USER. User space accesses that cause
         * system page faults are still user accesses.
+        *
+        * When we're in user mode, allow fast response on non-fatal
+        * signals.  Do not set this in kernel mode faults because normally
+        * a kernel fault means the fault must be resolved anyway before
+        * going back to userspace.
         */
        if (user_mode(regs))
-               flags |= FAULT_FLAG_USER;
+               flags |= FAULT_FLAG_USER | FAULT_FLAG_INTERRUPTIBLE;
 
 #ifdef CONFIG_X86_64
        /*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9b701cfbef22..a80f3f609b37 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -487,8 +487,7 @@ extern unsigned int kobjsize(const void *objp);
  * arch-specific page fault handlers.
  */
 #define FAULT_FLAG_DEFAULT  (FAULT_FLAG_ALLOW_RETRY | \
-                            FAULT_FLAG_KILLABLE | \
-                            FAULT_FLAG_INTERRUPTIBLE)
+                            FAULT_FLAG_KILLABLE)
===8<===

That also kind of matches with what we do with fault_signal_pending().
Would it make sense?

Thanks,

-- 
Peter Xu


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending
  2025-04-24 18:26   ` Peter Xu
@ 2025-04-24 18:40     ` Jens Axboe
  2025-04-24 19:13       ` Peter Xu
  2025-04-24 19:42     ` Matthew Wilcox
  1 sibling, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2025-04-24 18:40 UTC (permalink / raw)
  To: Peter Xu, Johannes Weiner
  Cc: Andrew Morton, linux-fsdevel@vger.kernel.org, Linux-MM

On 4/24/25 12:26 PM, Peter Xu wrote:
> On Thu, Apr 24, 2025 at 10:03:44AM -0400, Johannes Weiner wrote:
>> On Wed, Apr 23, 2025 at 05:37:06PM -0600, Jens Axboe wrote:
>>> userfaultfd may use interruptible sleeps to wait on userspace filling
>>> a page fault, which works fine if the task can be reliably put to
>>> sleeping waiting for that. However, if the task has a normal (ie
>>> non-fatal) signal pending, then TASK_INTERRUPTIBLE sleep will simply
>>> cause schedule() to be a no-op.
>>>
>>> For a task that registers a page with userfaultfd and then proceeds
>>> to do a write from it, if that task also has a signal pending then
>>> it'll essentially busy loop from do_page_fault() -> handle_userfault()
>>> until that fault has been filled. Normally it'd be expected that the
>>> task would sleep until that happens. Here's a trace from an application
>>> doing just that:
>>>
>>> handle_userfault+0x4b8/0xa00 (P)
>>> hugetlb_fault+0xe24/0x1060
>>> handle_mm_fault+0x2bc/0x318
>>> do_page_fault+0x1e8/0x6f0
>>
>> Makes sense. There is a fault_signal_pending() check before retrying:
>>
>> static inline bool fault_signal_pending(vm_fault_t fault_flags,
>>                                         struct pt_regs *regs)
>> {
>>         return unlikely((fault_flags & VM_FAULT_RETRY) &&
>>                         (fatal_signal_pending(current) ||
>>                          (user_mode(regs) && signal_pending(current))));
>> }
>>
>> Since it's an in-kernel fault, and the signal is non-fatal, it won't
>> stop looping until the fault is handled.
>>
>> This in itself seems a bit sketchy. You have to hope there is no
>> dependency between handling the signal -> handling the fault inside
>> the userspace components.
> 
> True. So far, my understanding is e.g. in an userfaultfd context the signal
> handler is responsible for not touching any possible trapped pages, or the
> sighandler needs fixing on its own.
> 
>>
>>> do_translation_fault+0x9c/0xd0
>>> do_mem_abort+0x44/0xa0
>>> el1_abort+0x3c/0x68
>>> el1h_64_sync_handler+0xd4/0x100
>>> el1h_64_sync+0x6c/0x70
>>> fault_in_readable+0x74/0x108 (P)
>>> iomap_file_buffered_write+0x14c/0x438
>>> blkdev_write_iter+0x1a8/0x340
>>> vfs_write+0x20c/0x348
>>> ksys_write+0x64/0x108
>>> __arm64_sys_write+0x1c/0x38
>>>
>>> where the task is looping with 100% CPU time in the above mentioned
>>> fault path.
>>>
>>> Since it's impossible to handle signals, or other conditions like
>>> TIF_NOTIFY_SIGNAL that also prevents interruptible sleeping, from the
>>> fault path, use TASK_UNINTERRUPTIBLE with a short timeout even for vmf
>>> modes that would normally ask for INTERRUPTIBLE or KILLABLE sleep. Fatal
>>> signals will still be handled by the caller, and the timeout is short
>>> enough to hopefully not cause any issues. If this is the first invocation
>>> of this fault, eg FAULT_FLAG_TRIED isn't set, then the normal sleep mode
>>> is used.
>>>
>>> Cc: stable@vger.kernel.org
>>> Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
>>
>> When this patch was first introduced, VM_FAULT_RETRY would work only
>> once. The second try would have FAULT_FLAG_ALLOW_RETRY cleared,
>> causing handle_userfault() to return VM_SIGBUS, which would bubble
>> through the fixup table (kernel fault), -EFAULT from
>> iomap_file_buffered_write() and unwind the kernel stack this way.
> 
> AFAIU we can't rely on the exception fixups because when reaching there it
> means the user access is going to get a -EFAULT, but here the right
> behavior is we keep waiting, aka, UNINTERRUPTIBLE wait until it's done.
> 
>>
>> So I'm thinking this is the more likely commit for Fixes: and stable:
>>
>> commit 4064b982706375025628094e51d11cf1a958a5d3
>> Author: Peter Xu <peterx@redhat.com>
>> Date:   Wed Apr 1 21:08:45 2020 -0700
>>
>>     mm: allow VM_FAULT_RETRY for multiple times
> 
> IMHO the multiple attempts are still fine, instead it's problematic if we
> wait in INTERRUPTIBLE mode even in !user mode..  so maybe it's slightly
> more suitable to use this as Fixes:
> 
> commit c270a7eedcf278304e05ebd2c96807487c97db61
> Author: Peter Xu <peterx@redhat.com>
> Date:   Wed Apr 1 21:08:41 2020 -0700
> 
>     mm: introduce FAULT_FLAG_INTERRUPTIBLE
> 
> The important change there is:
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 888272621f38..c076d3295958 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -462,9 +462,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>         uwq.ctx = ctx;
>         uwq.waken = false;
>  
> -       return_to_userland =
> -               (vmf->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
> -               (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
> +       return_to_userland = vmf->flags & FAULT_FLAG_INTERRUPTIBLE;
>         blocking_state = return_to_userland ? TASK_INTERRUPTIBLE :
>                          TASK_KILLABLE;
> 
> I think we still need to avoid checking FAULT_FLAG_USER, because e.g. in
> some other use cases like GUP we'd still want the threads (KVM does GUP and
> it's a heavy user of userfaultfd) to respond to non-fatals.
> 
> However maybe we shouldn't really set INTERRUPTIBLE at all if it's non-GUP
> and if it's non-user either.
> 
> So in general, some trivial concerns here on the patch..
> 
> Firstly, waiting UNINTERRUPTIBLE (even if with a small timeout) if
> FAULT_FLAG_INTERRUPTIBLE is set is a slight ABI violation to me - after
> all, FAULT_FLAG_INTERRUPTIBLE says "please respond to non-fatal signals
> too!".

First of all, it won't respond to signals _right now_ if waiting on
userfaultd, so that ABI violation already exists. The UNINTERRUPTIBLE
doesn't really change that at all.

> Secondly, userfaultfd is indeed the only consumer of
> FAULT_FLAG_INTERRUPTIBLE but not necessary always in the future.  While
> this patch resolves it for userfaultfd, it might get caught again later if
> something else in the kernel starts to respects the _INTERRUPTIBLE flag
> request.  For example, __folio_lock_or_retry() ignores that flag so far,
> but logically it should obey too (with a folio_wait_locked_interruptible)..
> 
> I also think it's not as elegant to have the magic HZ/10, and it's also
> destined even the loop is less frequent that's a waste of time (as if the
> user page access comes from kernel context, we must wait... until the page
> fault is resolved..).

Yeah I don't love the magic either, but the actual value of it isn't
important - it's just to prevent a CPU spin for these cases.

> Is it possible we simply unset the request from the top?  As discussed
> above, I think we still need to make sure GUP at least works for
> non-fatals, however I think it might be more reasonable we never set
> _INTERRUPTIBLE for !gup, then this problem might go away too with all above
> concerns addressed.
> 
> A not-even-compiled patch just to clarify what I meant (and it won't work
> unless it makes sense to both of you and we'll need to touch all archs when
> changing the default flags):
> 
> ===8<===
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 296d294142c8..fa721525d93a 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1300,9 +1300,14 @@ void do_user_addr_fault(struct pt_regs *regs,
>          * We set FAULT_FLAG_USER based on the register state, not
>          * based on X86_PF_USER. User space accesses that cause
>          * system page faults are still user accesses.
> +        *
> +        * When we're in user mode, allow fast response on non-fatal
> +        * signals.  Do not set this in kernel mode faults because normally
> +        * a kernel fault means the fault must be resolved anyway before
> +        * going back to userspace.
>          */
>         if (user_mode(regs))
> -               flags |= FAULT_FLAG_USER;
> +               flags |= FAULT_FLAG_USER | FAULT_FLAG_INTERRUPTIBLE;
>  
>  #ifdef CONFIG_X86_64
>         /*
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9b701cfbef22..a80f3f609b37 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -487,8 +487,7 @@ extern unsigned int kobjsize(const void *objp);
>   * arch-specific page fault handlers.
>   */
>  #define FAULT_FLAG_DEFAULT  (FAULT_FLAG_ALLOW_RETRY | \
> -                            FAULT_FLAG_KILLABLE | \
> -                            FAULT_FLAG_INTERRUPTIBLE)
> +                            FAULT_FLAG_KILLABLE)
> ===8<===
> 
> That also kind of matches with what we do with fault_signal_pending().
> Would it make sense?

I don't think doing a non-bounded non-interruptible sleep for a
condition that may never resolve (eg userfaultfd never fills the fault)
is a good idea. What happens if the condition never becomes true? You
can't even kill the task at that point... Generally UNINTERRUPTIBLE
sleep should only be used if it's a bounded wait.

For example, if I ran my previous write(2) reproducer here and the task
got killed or exited before the userfaultfd fills the fault, then you'd
have the task stuck in 'D' forever. Can't be killed, can't get
reclaimed.

In other words, this won't work.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending
  2025-04-24 18:40     ` Jens Axboe
@ 2025-04-24 19:13       ` Peter Xu
  2025-04-24 19:20         ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2025-04-24 19:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Johannes Weiner, Andrew Morton, linux-fsdevel@vger.kernel.org,
	Linux-MM

On Thu, Apr 24, 2025 at 12:40:56PM -0600, Jens Axboe wrote:
> On 4/24/25 12:26 PM, Peter Xu wrote:
> > On Thu, Apr 24, 2025 at 10:03:44AM -0400, Johannes Weiner wrote:
> >> On Wed, Apr 23, 2025 at 05:37:06PM -0600, Jens Axboe wrote:
> >>> userfaultfd may use interruptible sleeps to wait on userspace filling
> >>> a page fault, which works fine if the task can be reliably put to
> >>> sleeping waiting for that. However, if the task has a normal (ie
> >>> non-fatal) signal pending, then TASK_INTERRUPTIBLE sleep will simply
> >>> cause schedule() to be a no-op.
> >>>
> >>> For a task that registers a page with userfaultfd and then proceeds
> >>> to do a write from it, if that task also has a signal pending then
> >>> it'll essentially busy loop from do_page_fault() -> handle_userfault()
> >>> until that fault has been filled. Normally it'd be expected that the
> >>> task would sleep until that happens. Here's a trace from an application
> >>> doing just that:
> >>>
> >>> handle_userfault+0x4b8/0xa00 (P)
> >>> hugetlb_fault+0xe24/0x1060
> >>> handle_mm_fault+0x2bc/0x318
> >>> do_page_fault+0x1e8/0x6f0
> >>
> >> Makes sense. There is a fault_signal_pending() check before retrying:
> >>
> >> static inline bool fault_signal_pending(vm_fault_t fault_flags,
> >>                                         struct pt_regs *regs)
> >> {
> >>         return unlikely((fault_flags & VM_FAULT_RETRY) &&
> >>                         (fatal_signal_pending(current) ||
> >>                          (user_mode(regs) && signal_pending(current))));
> >> }
> >>
> >> Since it's an in-kernel fault, and the signal is non-fatal, it won't
> >> stop looping until the fault is handled.
> >>
> >> This in itself seems a bit sketchy. You have to hope there is no
> >> dependency between handling the signal -> handling the fault inside
> >> the userspace components.
> > 
> > True. So far, my understanding is e.g. in an userfaultfd context the signal
> > handler is responsible for not touching any possible trapped pages, or the
> > sighandler needs fixing on its own.
> > 
> >>
> >>> do_translation_fault+0x9c/0xd0
> >>> do_mem_abort+0x44/0xa0
> >>> el1_abort+0x3c/0x68
> >>> el1h_64_sync_handler+0xd4/0x100
> >>> el1h_64_sync+0x6c/0x70
> >>> fault_in_readable+0x74/0x108 (P)
> >>> iomap_file_buffered_write+0x14c/0x438
> >>> blkdev_write_iter+0x1a8/0x340
> >>> vfs_write+0x20c/0x348
> >>> ksys_write+0x64/0x108
> >>> __arm64_sys_write+0x1c/0x38
> >>>
> >>> where the task is looping with 100% CPU time in the above mentioned
> >>> fault path.
> >>>
> >>> Since it's impossible to handle signals, or other conditions like
> >>> TIF_NOTIFY_SIGNAL that also prevents interruptible sleeping, from the
> >>> fault path, use TASK_UNINTERRUPTIBLE with a short timeout even for vmf
> >>> modes that would normally ask for INTERRUPTIBLE or KILLABLE sleep. Fatal
> >>> signals will still be handled by the caller, and the timeout is short
> >>> enough to hopefully not cause any issues. If this is the first invocation
> >>> of this fault, eg FAULT_FLAG_TRIED isn't set, then the normal sleep mode
> >>> is used.
> >>>
> >>> Cc: stable@vger.kernel.org
> >>> Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
> >>
> >> When this patch was first introduced, VM_FAULT_RETRY would work only
> >> once. The second try would have FAULT_FLAG_ALLOW_RETRY cleared,
> >> causing handle_userfault() to return VM_SIGBUS, which would bubble
> >> through the fixup table (kernel fault), -EFAULT from
> >> iomap_file_buffered_write() and unwind the kernel stack this way.
> > 
> > AFAIU we can't rely on the exception fixups because when reaching there it
> > means the user access is going to get a -EFAULT, but here the right
> > behavior is we keep waiting, aka, UNINTERRUPTIBLE wait until it's done.
> > 
> >>
> >> So I'm thinking this is the more likely commit for Fixes: and stable:
> >>
> >> commit 4064b982706375025628094e51d11cf1a958a5d3
> >> Author: Peter Xu <peterx@redhat.com>
> >> Date:   Wed Apr 1 21:08:45 2020 -0700
> >>
> >>     mm: allow VM_FAULT_RETRY for multiple times
> > 
> > IMHO the multiple attempts are still fine, instead it's problematic if we
> > wait in INTERRUPTIBLE mode even in !user mode..  so maybe it's slightly
> > more suitable to use this as Fixes:
> > 
> > commit c270a7eedcf278304e05ebd2c96807487c97db61
> > Author: Peter Xu <peterx@redhat.com>
> > Date:   Wed Apr 1 21:08:41 2020 -0700
> > 
> >     mm: introduce FAULT_FLAG_INTERRUPTIBLE
> > 
> > The important change there is:
> > 
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 888272621f38..c076d3295958 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -462,9 +462,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >         uwq.ctx = ctx;
> >         uwq.waken = false;
> >  
> > -       return_to_userland =
> > -               (vmf->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
> > -               (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
> > +       return_to_userland = vmf->flags & FAULT_FLAG_INTERRUPTIBLE;
> >         blocking_state = return_to_userland ? TASK_INTERRUPTIBLE :
> >                          TASK_KILLABLE;
> > 
> > I think we still need to avoid checking FAULT_FLAG_USER, because e.g. in
> > some other use cases like GUP we'd still want the threads (KVM does GUP and
> > it's a heavy user of userfaultfd) to respond to non-fatals.
> > 
> > However maybe we shouldn't really set INTERRUPTIBLE at all if it's non-GUP
> > and if it's non-user either.
> > 
> > So in general, some trivial concerns here on the patch..
> > 
> > Firstly, waiting UNINTERRUPTIBLE (even if with a small timeout) if
> > FAULT_FLAG_INTERRUPTIBLE is set is a slight ABI violation to me - after
> > all, FAULT_FLAG_INTERRUPTIBLE says "please respond to non-fatal signals
> > too!".
> 
> First of all, it won't respond to signals _right now_ if waiting on
> userfaultd, so that ABI violation already exists. The UNINTERRUPTIBLE
> doesn't really change that at all.

Since we can't check pending signals in a loop.. IIUC it's still ok if we
only check it right before a major sleep would happen.  IIUC that means the
old code (at least the userfaultfd path..) is following the rules.

But yeah, I think even with HZ/10 timeout, it will still respond to
non-fatal signals more or less.  It isn't that bad, so my current concerns
are more of nitpicking category.  Sorry if that wasn't clear.  I'm just
trying to see whether we have something even better (and if possible, easier).

> 
> > Secondly, userfaultfd is indeed the only consumer of
> > FAULT_FLAG_INTERRUPTIBLE but not necessary always in the future.  While
> > this patch resolves it for userfaultfd, it might get caught again later if
> > something else in the kernel starts to respects the _INTERRUPTIBLE flag
> > request.  For example, __folio_lock_or_retry() ignores that flag so far,
> > but logically it should obey too (with a folio_wait_locked_interruptible)..
> > 
> > I also think it's not as elegant to have the magic HZ/10, and it's also
> > destined even the loop is less frequent that's a waste of time (as if the
> > user page access comes from kernel context, we must wait... until the page
> > fault is resolved..).
> 
> Yeah I don't love the magic either, but the actual value of it isn't
> important - it's just to prevent a CPU spin for these cases.

Right.  If this patch is the only thing so far can fix it, I'm OK we go
with it first.  Said that..

> 
> > Is it possible we simply unset the request from the top?  As discussed
> > above, I think we still need to make sure GUP at least works for
> > non-fatals, however I think it might be more reasonable we never set
> > _INTERRUPTIBLE for !gup, then this problem might go away too with all above
> > concerns addressed.
> > 
> > A not-even-compiled patch just to clarify what I meant (and it won't work
> > unless it makes sense to both of you and we'll need to touch all archs when
> > changing the default flags):
> > 
> > ===8<===
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index 296d294142c8..fa721525d93a 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1300,9 +1300,14 @@ void do_user_addr_fault(struct pt_regs *regs,
> >          * We set FAULT_FLAG_USER based on the register state, not
> >          * based on X86_PF_USER. User space accesses that cause
> >          * system page faults are still user accesses.
> > +        *
> > +        * When we're in user mode, allow fast response on non-fatal
> > +        * signals.  Do not set this in kernel mode faults because normally
> > +        * a kernel fault means the fault must be resolved anyway before
> > +        * going back to userspace.
> >          */
> >         if (user_mode(regs))
> > -               flags |= FAULT_FLAG_USER;
> > +               flags |= FAULT_FLAG_USER | FAULT_FLAG_INTERRUPTIBLE;
> >  
> >  #ifdef CONFIG_X86_64
> >         /*
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 9b701cfbef22..a80f3f609b37 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -487,8 +487,7 @@ extern unsigned int kobjsize(const void *objp);
> >   * arch-specific page fault handlers.
> >   */
> >  #define FAULT_FLAG_DEFAULT  (FAULT_FLAG_ALLOW_RETRY | \
> > -                            FAULT_FLAG_KILLABLE | \
> > -                            FAULT_FLAG_INTERRUPTIBLE)
> > +                            FAULT_FLAG_KILLABLE)
> > ===8<===
> > 
> > That also kind of matches with what we do with fault_signal_pending().
> > Would it make sense?
> 
> I don't think doing a non-bounded non-interruptible sleep for a
> condition that may never resolve (eg userfaultfd never fills the fault)
> is a good idea. What happens if the condition never becomes true? You

If page fault is never going to be resolved, normally we sigkill the
program as it can't move any further with no way to resolve the page fault.

But yeah that's based on the fact sigkill will work first..

> can't even kill the task at that point... Generally UNINTERRUPTIBLE
> sleep should only be used if it's a bounded wait.
> 
> For example, if I ran my previous write(2) reproducer here and the task
> got killed or exited before the userfaultfd fills the fault, then you'd
> have the task stuck in 'D' forever. Can't be killed, can't get
> reclaimed.
> 
> In other words, this won't work.

.. Would you help explain why it didn't work even for SIGKILL?  Above will
still set FAULT_FLAG_KILLABLE, hence I thought SIGKILL would always work
regardless.

For such kernel user page access, IIUC it should respond to SIGKILL in
handle_userfault(), then fault_signal_pending() would trap the SIGKILL this
time -> going kernel fixups. Then the upper stack should get -EFAULT in the
exception fixup path.

I could have missed something..

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending
  2025-04-24 19:13       ` Peter Xu
@ 2025-04-24 19:20         ` Jens Axboe
  2025-04-24 19:57           ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2025-04-24 19:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: Johannes Weiner, Andrew Morton, linux-fsdevel@vger.kernel.org,
	Linux-MM

On 4/24/25 1:13 PM, Peter Xu wrote:

(skipping to this bit as I think we're mostly in agreement on the above)

>>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>>> index 296d294142c8..fa721525d93a 100644
>>> --- a/arch/x86/mm/fault.c
>>> +++ b/arch/x86/mm/fault.c
>>> @@ -1300,9 +1300,14 @@ void do_user_addr_fault(struct pt_regs *regs,
>>>          * We set FAULT_FLAG_USER based on the register state, not
>>>          * based on X86_PF_USER. User space accesses that cause
>>>          * system page faults are still user accesses.
>>> +        *
>>> +        * When we're in user mode, allow fast response on non-fatal
>>> +        * signals.  Do not set this in kernel mode faults because normally
>>> +        * a kernel fault means the fault must be resolved anyway before
>>> +        * going back to userspace.
>>>          */
>>>         if (user_mode(regs))
>>> -               flags |= FAULT_FLAG_USER;
>>> +               flags |= FAULT_FLAG_USER | FAULT_FLAG_INTERRUPTIBLE;
>>>  
>>>  #ifdef CONFIG_X86_64
>>>         /*
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 9b701cfbef22..a80f3f609b37 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -487,8 +487,7 @@ extern unsigned int kobjsize(const void *objp);
>>>   * arch-specific page fault handlers.
>>>   */
>>>  #define FAULT_FLAG_DEFAULT  (FAULT_FLAG_ALLOW_RETRY | \
>>> -                            FAULT_FLAG_KILLABLE | \
>>> -                            FAULT_FLAG_INTERRUPTIBLE)
>>> +                            FAULT_FLAG_KILLABLE)
>>> ===8<===
>>>
>>> That also kind of matches with what we do with fault_signal_pending().
>>> Would it make sense?
>>
>> I don't think doing a non-bounded non-interruptible sleep for a
>> condition that may never resolve (eg userfaultfd never fills the fault)
>> is a good idea. What happens if the condition never becomes true? You
> 
> If page fault is never going to be resolved, normally we sigkill the
> program as it can't move any further with no way to resolve the page fault.
> 
> But yeah that's based on the fact sigkill will work first..

Yep

>> can't even kill the task at that point... Generally UNINTERRUPTIBLE
>> sleep should only be used if it's a bounded wait.
>>
>> For example, if I ran my previous write(2) reproducer here and the task
>> got killed or exited before the userfaultfd fills the fault, then you'd
>> have the task stuck in 'D' forever. Can't be killed, can't get
>> reclaimed.
>>
>> In other words, this won't work.
> 
> .. Would you help explain why it didn't work even for SIGKILL?  Above will
> still set FAULT_FLAG_KILLABLE, hence I thought SIGKILL would always work
> regardless.
> 
> For such kernel user page access, IIUC it should respond to SIGKILL in
> handle_userfault(), then fault_signal_pending() would trap the SIGKILL this
> time -> going kernel fixups. Then the upper stack should get -EFAULT in the
> exception fixup path.
> 
> I could have missed something..

It won't work because sending the signal will not wake the process in
question as it's sleeping uninterruptibly, forever. My looping approach
still works for fatal signals as we abort the loop every now and then,
hence we know it won't be stuck forever. But if you don't have a timeout
on that uninterruptible sleep, it's not waking from being sent a signal
alone.

Example:

axboe@m2max-kvm ~> sudo ./tufd 
got buf 0xffff89800000
child will write
Page fault
flags = 0; address = ffff89800000
wait on child
fish: Job 1, 'sudo ./tufd' terminated by signal SIGKILL (Forced quit)

meanwhile in ps:

root         837     837  0.0    2  0.0  14628  1220 ?        Dl   12:37   0:00 ./tufd
root         837     838  0.0    2  0.0  14628  1220 ?        Sl   12:37   0:00 ./tufd

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending
  2025-04-24 18:26   ` Peter Xu
  2025-04-24 18:40     ` Jens Axboe
@ 2025-04-24 19:42     ` Matthew Wilcox
  2025-04-24 21:45       ` Peter Xu
  1 sibling, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2025-04-24 19:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: Johannes Weiner, Jens Axboe, Andrew Morton,
	linux-fsdevel@vger.kernel.org, Linux-MM

On Thu, Apr 24, 2025 at 02:26:37PM -0400, Peter Xu wrote:
> Secondly, userfaultfd is indeed the only consumer of
> FAULT_FLAG_INTERRUPTIBLE but not necessary always in the future.  While
> this patch resolves it for userfaultfd, it might get caught again later if
> something else in the kernel starts to respects the _INTERRUPTIBLE flag
> request.  For example, __folio_lock_or_retry() ignores that flag so far,
> but logically it should obey too (with a folio_wait_locked_interruptible)..

No.  Hell, no.  We don't want non-fatal signals being able to interrupt
that.  There's a reason we introduced killable as a concept in the first
place.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending
  2025-04-24 19:20         ` Jens Axboe
@ 2025-04-24 19:57           ` Peter Xu
  2025-05-01 16:18             ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2025-04-24 19:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Johannes Weiner, Andrew Morton, linux-fsdevel@vger.kernel.org,
	Linux-MM

On Thu, Apr 24, 2025 at 01:20:46PM -0600, Jens Axboe wrote:
> On 4/24/25 1:13 PM, Peter Xu wrote:
> 
> (skipping to this bit as I think we're mostly in agreement on the above)
> 
> >>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> >>> index 296d294142c8..fa721525d93a 100644
> >>> --- a/arch/x86/mm/fault.c
> >>> +++ b/arch/x86/mm/fault.c
> >>> @@ -1300,9 +1300,14 @@ void do_user_addr_fault(struct pt_regs *regs,
> >>>          * We set FAULT_FLAG_USER based on the register state, not
> >>>          * based on X86_PF_USER. User space accesses that cause
> >>>          * system page faults are still user accesses.
> >>> +        *
> >>> +        * When we're in user mode, allow fast response on non-fatal
> >>> +        * signals.  Do not set this in kernel mode faults because normally
> >>> +        * a kernel fault means the fault must be resolved anyway before
> >>> +        * going back to userspace.
> >>>          */
> >>>         if (user_mode(regs))
> >>> -               flags |= FAULT_FLAG_USER;
> >>> +               flags |= FAULT_FLAG_USER | FAULT_FLAG_INTERRUPTIBLE;
> >>>  
> >>>  #ifdef CONFIG_X86_64
> >>>         /*
> >>> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >>> index 9b701cfbef22..a80f3f609b37 100644
> >>> --- a/include/linux/mm.h
> >>> +++ b/include/linux/mm.h
> >>> @@ -487,8 +487,7 @@ extern unsigned int kobjsize(const void *objp);
> >>>   * arch-specific page fault handlers.
> >>>   */
> >>>  #define FAULT_FLAG_DEFAULT  (FAULT_FLAG_ALLOW_RETRY | \
> >>> -                            FAULT_FLAG_KILLABLE | \
> >>> -                            FAULT_FLAG_INTERRUPTIBLE)
> >>> +                            FAULT_FLAG_KILLABLE)
> >>> ===8<===
> >>>
> >>> That also kind of matches with what we do with fault_signal_pending().
> >>> Would it make sense?
> >>
> >> I don't think doing a non-bounded non-interruptible sleep for a
> >> condition that may never resolve (eg userfaultfd never fills the fault)
> >> is a good idea. What happens if the condition never becomes true? You
> > 
> > If page fault is never going to be resolved, normally we sigkill the
> > program as it can't move any further with no way to resolve the page fault.
> > 
> > But yeah that's based on the fact sigkill will work first..
> 
> Yep
> 
> >> can't even kill the task at that point... Generally UNINTERRUPTIBLE
> >> sleep should only be used if it's a bounded wait.
> >>
> >> For example, if I ran my previous write(2) reproducer here and the task
> >> got killed or exited before the userfaultfd fills the fault, then you'd
> >> have the task stuck in 'D' forever. Can't be killed, can't get
> >> reclaimed.
> >>
> >> In other words, this won't work.
> > 
> > .. Would you help explain why it didn't work even for SIGKILL?  Above will
> > still set FAULT_FLAG_KILLABLE, hence I thought SIGKILL would always work
> > regardless.
> > 
> > For such kernel user page access, IIUC it should respond to SIGKILL in
> > handle_userfault(), then fault_signal_pending() would trap the SIGKILL this
> > time -> going kernel fixups. Then the upper stack should get -EFAULT in the
> > exception fixup path.
> > 
> > I could have missed something..
> 
> It won't work because sending the signal will not wake the process in
> question as it's sleeping uninterruptibly, forever. My looping approach
> still works for fatal signals as we abort the loop every now and then,
> hence we know it won't be stuck forever. But if you don't have a timeout
> on that uninterruptible sleep, it's not waking from being sent a signal
> alone.
> 
> Example:
> 
> axboe@m2max-kvm ~> sudo ./tufd 
> got buf 0xffff89800000
> child will write
> Page fault
> flags = 0; address = ffff89800000
> wait on child
> fish: Job 1, 'sudo ./tufd' terminated by signal SIGKILL (Forced quit)
> 
> meanwhile in ps:
> 
> root         837     837  0.0    2  0.0  14628  1220 ?        Dl   12:37   0:00 ./tufd
> root         837     838  0.0    2  0.0  14628  1220 ?        Sl   12:37   0:00 ./tufd

I don't know TASK_WAKEKILL well, but I was hoping it would work in this
case.. E.g., even if with the patch, we still have chance to not use any
timeout at [1] below?

        if (likely(must_wait && !READ_ONCE(ctx->released))) {
                wake_up_poll(&ctx->fd_wqh, EPOLLIN);
-               schedule();
+               /* See comment in userfaultfd_get_blocking_state() */
+               if (!wait_mode.timeout)
+                       schedule();   <----------------------------- [1]
+               else
+                       schedule_timeout(HZ / 10);
        }

So my understanding is sigkill also need to work always for [1] if
FAULT_FLAG_KILLABLE is set (which should always be, iiuc).

Did I miss something else? It would be helpful too if you could share the
reproducer; I can give it a shot.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending
  2025-04-24 19:42     ` Matthew Wilcox
@ 2025-04-24 21:45       ` Peter Xu
  2025-04-25  4:52         ` Matthew Wilcox
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2025-04-24 21:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Johannes Weiner, Jens Axboe, Andrew Morton,
	linux-fsdevel@vger.kernel.org, Linux-MM

On Thu, Apr 24, 2025 at 08:42:00PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 24, 2025 at 02:26:37PM -0400, Peter Xu wrote:
> > Secondly, userfaultfd is indeed the only consumer of
> > FAULT_FLAG_INTERRUPTIBLE but not necessary always in the future.  While
> > this patch resolves it for userfaultfd, it might get caught again later if
> > something else in the kernel starts to respects the _INTERRUPTIBLE flag
> > request.  For example, __folio_lock_or_retry() ignores that flag so far,
> > but logically it should obey too (with a folio_wait_locked_interruptible)..
> 
> No.  Hell, no.  We don't want non-fatal signals being able to interrupt
> that.  There's a reason we introduced killable as a concept in the first
> place.

Not really proposing that as I don't have a use caes.  Just curious, could
you explain a bit why having it interruptible is against the killable
concept if (IIUC) it is still killable?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending
  2025-04-24 21:45       ` Peter Xu
@ 2025-04-25  4:52         ` Matthew Wilcox
  2025-04-25 15:44           ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Matthew Wilcox @ 2025-04-25  4:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: Johannes Weiner, Jens Axboe, Andrew Morton,
	linux-fsdevel@vger.kernel.org, Linux-MM

On Thu, Apr 24, 2025 at 05:45:37PM -0400, Peter Xu wrote:
> On Thu, Apr 24, 2025 at 08:42:00PM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 24, 2025 at 02:26:37PM -0400, Peter Xu wrote:
> > > Secondly, userfaultfd is indeed the only consumer of
> > > FAULT_FLAG_INTERRUPTIBLE but not necessary always in the future.  While
> > > this patch resolves it for userfaultfd, it might get caught again later if
> > > something else in the kernel starts to respects the _INTERRUPTIBLE flag
> > > request.  For example, __folio_lock_or_retry() ignores that flag so far,
> > > but logically it should obey too (with a folio_wait_locked_interruptible)..
> > 
> > No.  Hell, no.  We don't want non-fatal signals being able to interrupt
> > that.  There's a reason we introduced killable as a concept in the first
> > place.
> 
> Not really proposing that as I don't have a use caes.  Just curious, could
> you explain a bit why having it interruptible is against the killable
> concept if (IIUC) it is still killable?

Because "interruptible" means it can be interrupted by inane stuff like
SIGWINCH and SIGALRM.  And then we return from a page fault prematurely
and can't actually handle the situation, so we end up going back into the
page fault handler anyway having accomplished nothing other than burn CPU.

At least it's better than interruptible system calls which just gets
you short reads, corrupted data and crashing programs.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending
  2025-04-25  4:52         ` Matthew Wilcox
@ 2025-04-25 15:44           ` Peter Xu
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2025-04-25 15:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Johannes Weiner, Jens Axboe, Andrew Morton,
	linux-fsdevel@vger.kernel.org, Linux-MM

On Fri, Apr 25, 2025 at 05:52:20AM +0100, Matthew Wilcox wrote:
> Because "interruptible" means it can be interrupted by inane stuff like
> SIGWINCH and SIGALRM.  And then we return from a page fault prematurely
> and can't actually handle the situation, so we end up going back into the
> page fault handler anyway having accomplished nothing other than burn CPU.
> 
> At least it's better than interruptible system calls which just gets
> you short reads, corrupted data and crashing programs.

I see where it came from now, thanks.

IIUC it'll be a major spinning issue only if the fault is generated from
the sighandler itself which spins on its own. The hope is that sighandler
should almost always be suggested to be as tiny as possible, to make it
unlikely to happen.  Said that, it's a valid point indeed.

Maybe we should make FAULT_FLAG_INTERRUPTIBLE a hint rather than a request
to handle_mm_fault(), so the internal of page resolution can decide whether
to respect the hint.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending
  2025-04-24 19:57           ` Peter Xu
@ 2025-05-01 16:18             ` Peter Xu
  2025-05-01 16:28               ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2025-05-01 16:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Johannes Weiner, Andrew Morton, linux-fsdevel@vger.kernel.org,
	Linux-MM

On Thu, Apr 24, 2025 at 03:57:09PM -0400, Peter Xu wrote:
> On Thu, Apr 24, 2025 at 01:20:46PM -0600, Jens Axboe wrote:
> > On 4/24/25 1:13 PM, Peter Xu wrote:
> > 
> > (skipping to this bit as I think we're mostly in agreement on the above)
> > 
> > >>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > >>> index 296d294142c8..fa721525d93a 100644
> > >>> --- a/arch/x86/mm/fault.c
> > >>> +++ b/arch/x86/mm/fault.c
> > >>> @@ -1300,9 +1300,14 @@ void do_user_addr_fault(struct pt_regs *regs,
> > >>>          * We set FAULT_FLAG_USER based on the register state, not
> > >>>          * based on X86_PF_USER. User space accesses that cause
> > >>>          * system page faults are still user accesses.
> > >>> +        *
> > >>> +        * When we're in user mode, allow fast response on non-fatal
> > >>> +        * signals.  Do not set this in kernel mode faults because normally
> > >>> +        * a kernel fault means the fault must be resolved anyway before
> > >>> +        * going back to userspace.
> > >>>          */
> > >>>         if (user_mode(regs))
> > >>> -               flags |= FAULT_FLAG_USER;
> > >>> +               flags |= FAULT_FLAG_USER | FAULT_FLAG_INTERRUPTIBLE;
> > >>>  
> > >>>  #ifdef CONFIG_X86_64
> > >>>         /*
> > >>> diff --git a/include/linux/mm.h b/include/linux/mm.h
> > >>> index 9b701cfbef22..a80f3f609b37 100644
> > >>> --- a/include/linux/mm.h
> > >>> +++ b/include/linux/mm.h
> > >>> @@ -487,8 +487,7 @@ extern unsigned int kobjsize(const void *objp);
> > >>>   * arch-specific page fault handlers.
> > >>>   */
> > >>>  #define FAULT_FLAG_DEFAULT  (FAULT_FLAG_ALLOW_RETRY | \
> > >>> -                            FAULT_FLAG_KILLABLE | \
> > >>> -                            FAULT_FLAG_INTERRUPTIBLE)
> > >>> +                            FAULT_FLAG_KILLABLE)
> > >>> ===8<===
> > >>>
> > >>> That also kind of matches with what we do with fault_signal_pending().
> > >>> Would it make sense?
> > >>
> > >> I don't think doing a non-bounded non-interruptible sleep for a
> > >> condition that may never resolve (eg userfaultfd never fills the fault)
> > >> is a good idea. What happens if the condition never becomes true? You
> > > 
> > > If page fault is never going to be resolved, normally we sigkill the
> > > program as it can't move any further with no way to resolve the page fault.
> > > 
> > > But yeah that's based on the fact sigkill will work first..
> > 
> > Yep
> > 
> > >> can't even kill the task at that point... Generally UNINTERRUPTIBLE
> > >> sleep should only be used if it's a bounded wait.
> > >>
> > >> For example, if I ran my previous write(2) reproducer here and the task
> > >> got killed or exited before the userfaultfd fills the fault, then you'd
> > >> have the task stuck in 'D' forever. Can't be killed, can't get
> > >> reclaimed.
> > >>
> > >> In other words, this won't work.
> > > 
> > > .. Would you help explain why it didn't work even for SIGKILL?  Above will
> > > still set FAULT_FLAG_KILLABLE, hence I thought SIGKILL would always work
> > > regardless.
> > > 
> > > For such kernel user page access, IIUC it should respond to SIGKILL in
> > > handle_userfault(), then fault_signal_pending() would trap the SIGKILL this
> > > time -> going kernel fixups. Then the upper stack should get -EFAULT in the
> > > exception fixup path.
> > > 
> > > I could have missed something..
> > 
> > It won't work because sending the signal will not wake the process in
> > question as it's sleeping uninterruptibly, forever. My looping approach
> > still works for fatal signals as we abort the loop every now and then,
> > hence we know it won't be stuck forever. But if you don't have a timeout
> > on that uninterruptible sleep, it's not waking from being sent a signal
> > alone.
> > 
> > Example:
> > 
> > axboe@m2max-kvm ~> sudo ./tufd 
> > got buf 0xffff89800000
> > child will write
> > Page fault
> > flags = 0; address = ffff89800000
> > wait on child
> > fish: Job 1, 'sudo ./tufd' terminated by signal SIGKILL (Forced quit)
> > 
> > meanwhile in ps:
> > 
> > root         837     837  0.0    2  0.0  14628  1220 ?        Dl   12:37   0:00 ./tufd
> > root         837     838  0.0    2  0.0  14628  1220 ?        Sl   12:37   0:00 ./tufd
> 
> I don't know TASK_WAKEKILL well, but I was hoping it would work in this
> case.. E.g., even if with the patch, we still have chance to not use any
> timeout at [1] below?
> 
>         if (likely(must_wait && !READ_ONCE(ctx->released))) {
>                 wake_up_poll(&ctx->fd_wqh, EPOLLIN);
> -               schedule();
> +               /* See comment in userfaultfd_get_blocking_state() */
> +               if (!wait_mode.timeout)
> +                       schedule();   <----------------------------- [1]
> +               else
> +                       schedule_timeout(HZ / 10);
>         }
> 
> So my understanding is sigkill also need to work always for [1] if
> FAULT_FLAG_KILLABLE is set (which should always be, iiuc).
> 
> Did I miss something else? It would be helpful too if you could share the
> reproducer; I can give it a shot.

Since the signal issue alone can definitely be reproduced with any
reproducer that triggers the fault in the kernel.. I wrote one today with
write() syscall, I'll attach that at the end.

I did try this patch, meanwhile I also verified that actually what I
provided previously (at the end of the reply) can also avoid the cpu
spinning, and it is also killable at least here..

https://lore.kernel.org/all/aAqCXfPirHqWMlb4@x1.local/

Jens, could you help me to find why that simpler (and IMHO must cleaner)
change wouldn't work for your case?

The important thing is, as I mentioned above sigkill need to also work for
[1], and I really want to know when it won't.. meanwhile it's logically
incorrect to set INTERRUPTIBLE for kernel faults, which this patch didn't
really address.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending
  2025-05-01 16:18             ` Peter Xu
@ 2025-05-01 16:28               ` Peter Xu
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2025-05-01 16:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Johannes Weiner, Andrew Morton, linux-fsdevel@vger.kernel.org,
	Linux-MM

On Thu, May 01, 2025 at 12:18:35PM -0400, Peter Xu wrote:
> On Thu, Apr 24, 2025 at 03:57:09PM -0400, Peter Xu wrote:
> > On Thu, Apr 24, 2025 at 01:20:46PM -0600, Jens Axboe wrote:
> > > On 4/24/25 1:13 PM, Peter Xu wrote:
> > > 
> > > (skipping to this bit as I think we're mostly in agreement on the above)
> > > 
> > > >>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > >>> index 296d294142c8..fa721525d93a 100644
> > > >>> --- a/arch/x86/mm/fault.c
> > > >>> +++ b/arch/x86/mm/fault.c
> > > >>> @@ -1300,9 +1300,14 @@ void do_user_addr_fault(struct pt_regs *regs,
> > > >>>          * We set FAULT_FLAG_USER based on the register state, not
> > > >>>          * based on X86_PF_USER. User space accesses that cause
> > > >>>          * system page faults are still user accesses.
> > > >>> +        *
> > > >>> +        * When we're in user mode, allow fast response on non-fatal
> > > >>> +        * signals.  Do not set this in kernel mode faults because normally
> > > >>> +        * a kernel fault means the fault must be resolved anyway before
> > > >>> +        * going back to userspace.
> > > >>>          */
> > > >>>         if (user_mode(regs))
> > > >>> -               flags |= FAULT_FLAG_USER;
> > > >>> +               flags |= FAULT_FLAG_USER | FAULT_FLAG_INTERRUPTIBLE;
> > > >>>  
> > > >>>  #ifdef CONFIG_X86_64
> > > >>>         /*
> > > >>> diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > >>> index 9b701cfbef22..a80f3f609b37 100644
> > > >>> --- a/include/linux/mm.h
> > > >>> +++ b/include/linux/mm.h
> > > >>> @@ -487,8 +487,7 @@ extern unsigned int kobjsize(const void *objp);
> > > >>>   * arch-specific page fault handlers.
> > > >>>   */
> > > >>>  #define FAULT_FLAG_DEFAULT  (FAULT_FLAG_ALLOW_RETRY | \
> > > >>> -                            FAULT_FLAG_KILLABLE | \
> > > >>> -                            FAULT_FLAG_INTERRUPTIBLE)
> > > >>> +                            FAULT_FLAG_KILLABLE)
> > > >>> ===8<===
> > > >>>
> > > >>> That also kind of matches with what we do with fault_signal_pending().
> > > >>> Would it make sense?
> > > >>
> > > >> I don't think doing a non-bounded non-interruptible sleep for a
> > > >> condition that may never resolve (eg userfaultfd never fills the fault)
> > > >> is a good idea. What happens if the condition never becomes true? You
> > > > 
> > > > If page fault is never going to be resolved, normally we sigkill the
> > > > program as it can't move any further with no way to resolve the page fault.
> > > > 
> > > > But yeah that's based on the fact sigkill will work first..
> > > 
> > > Yep
> > > 
> > > >> can't even kill the task at that point... Generally UNINTERRUPTIBLE
> > > >> sleep should only be used if it's a bounded wait.
> > > >>
> > > >> For example, if I ran my previous write(2) reproducer here and the task
> > > >> got killed or exited before the userfaultfd fills the fault, then you'd
> > > >> have the task stuck in 'D' forever. Can't be killed, can't get
> > > >> reclaimed.
> > > >>
> > > >> In other words, this won't work.
> > > > 
> > > > .. Would you help explain why it didn't work even for SIGKILL?  Above will
> > > > still set FAULT_FLAG_KILLABLE, hence I thought SIGKILL would always work
> > > > regardless.
> > > > 
> > > > For such kernel user page access, IIUC it should respond to SIGKILL in
> > > > handle_userfault(), then fault_signal_pending() would trap the SIGKILL this
> > > > time -> going kernel fixups. Then the upper stack should get -EFAULT in the
> > > > exception fixup path.
> > > > 
> > > > I could have missed something..
> > > 
> > > It won't work because sending the signal will not wake the process in
> > > question as it's sleeping uninterruptibly, forever. My looping approach
> > > still works for fatal signals as we abort the loop every now and then,
> > > hence we know it won't be stuck forever. But if you don't have a timeout
> > > on that uninterruptible sleep, it's not waking from being sent a signal
> > > alone.
> > > 
> > > Example:
> > > 
> > > axboe@m2max-kvm ~> sudo ./tufd 
> > > got buf 0xffff89800000
> > > child will write
> > > Page fault
> > > flags = 0; address = ffff89800000
> > > wait on child
> > > fish: Job 1, 'sudo ./tufd' terminated by signal SIGKILL (Forced quit)
> > > 
> > > meanwhile in ps:
> > > 
> > > root         837     837  0.0    2  0.0  14628  1220 ?        Dl   12:37   0:00 ./tufd
> > > root         837     838  0.0    2  0.0  14628  1220 ?        Sl   12:37   0:00 ./tufd
> > 
> > I don't know TASK_WAKEKILL well, but I was hoping it would work in this
> > case.. E.g., even if with the patch, we still have chance to not use any
> > timeout at [1] below?
> > 
> >         if (likely(must_wait && !READ_ONCE(ctx->released))) {
> >                 wake_up_poll(&ctx->fd_wqh, EPOLLIN);
> > -               schedule();
> > +               /* See comment in userfaultfd_get_blocking_state() */
> > +               if (!wait_mode.timeout)
> > +                       schedule();   <----------------------------- [1]
> > +               else
> > +                       schedule_timeout(HZ / 10);
> >         }
> > 
> > So my understanding is sigkill also need to work always for [1] if
> > FAULT_FLAG_KILLABLE is set (which should always be, iiuc).
> > 
> > Did I miss something else? It would be helpful too if you could share the
> > reproducer; I can give it a shot.
> 
> Since the signal issue alone can definitely be reproduced with any
> reproducer that triggers the fault in the kernel.. I wrote one today with
> write() syscall, I'll attach that at the end.
> 
> I did try this patch, meanwhile I also verified that actually what I
> provided previously (at the end of the reply) can also avoid the cpu
> spinning, and it is also killable at least here..
> 
> https://lore.kernel.org/all/aAqCXfPirHqWMlb4@x1.local/
> 
> Jens, could you help me to find why that simpler (and IMHO must cleaner)
> change wouldn't work for your case?
> 
> The important thing is, as I mentioned above sigkill need to also work for
> [1], and I really want to know when it won't.. meanwhile it's logically
> incorrect to set INTERRUPTIBLE for kernel faults, which this patch didn't
> really address.

My reproducer:

$ cat uffd-kernel-sig.c
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <stdint.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <linux/userfaultfd.h>
#include <poll.h>
#include <pthread.h>
#include <fcntl.h>
#include <signal.h>
#include <errno.h>
#include <sys/ioctl.h>
#include <assert.h>

#define PAGE_SIZE 4096
#define BUFFER_PAGES 2

void sigusr1_handler(int signum) {
    printf("SIGUSR1 SIGNAL\n");
}

static int setup_userfaultfd(void *addr, size_t len) {
    int uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
    if (uffd == -1) {
        perror("userfaultfd");
        exit(1);
    }

    struct uffdio_api ua = {
        .api = UFFD_API
    };
    if (ioctl(uffd, UFFDIO_API, &ua) == -1) {
        perror("UFFDIO_API");
        exit(1);
    }

    struct uffdio_register ur = {
        .range = {
            .start = (unsigned long)addr,
            .len = len
        },
        .mode = UFFDIO_REGISTER_MODE_MISSING
    };
    if (ioctl(uffd, UFFDIO_REGISTER, &ur) == -1) {
        perror("UFFDIO_REGISTER");
        exit(1);
    }

    return uffd;
}

void *signal_sender(void *arg) {
    pid_t pid = getpid();
    usleep(100000);
    kill(pid, SIGUSR1);
    return NULL;
}

int main() {
    struct sigaction sa;

    sa.sa_handler = sigusr1_handler;
    sigemptyset(&sa.sa_mask);
    sa.sa_flags = 0;
    if (sigaction(SIGUSR1, &sa, NULL) == -1) {
        perror("sigaction");
        exit(1);
    }

    size_t buffer_size = BUFFER_PAGES * PAGE_SIZE;

    void *src_buf = mmap(NULL, buffer_size, PROT_READ | PROT_WRITE,
                         MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (src_buf == MAP_FAILED) {
        perror("mmap src_buf");
        exit(1);
    }

    if (madvise(src_buf, buffer_size, MADV_DONTNEED) == -1) {
        perror("madvise");
        exit(1);
    }

    void *dst_buf = malloc(buffer_size);
    if (!dst_buf) {
        perror("malloc dst_buf");
        exit(1);
    }

    int uffd = setup_userfaultfd(src_buf, buffer_size);

    pthread_t thread;
    if (pthread_create(&thread, NULL, signal_sender, NULL) != 0) {
        perror("pthread_create");
        exit(1);
    }

    int tmp = open("/tmp/file", O_WRONLY | O_CREAT, 0644);
    if (tmp < 0) {
        exit(EXIT_FAILURE);
    }

    ssize_t written = write(tmp, src_buf, buffer_size);
    printf("write returned：%zd\n", written);

    close(tmp);

    return 0;
}

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-05-01 16:28 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-23 23:37 [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending Jens Axboe
2025-04-24 14:03 ` Johannes Weiner
2025-04-24 14:54   ` Jens Axboe
2025-04-24 15:11     ` Johannes Weiner
2025-04-24 15:22       ` Jens Axboe
2025-04-24 18:26   ` Peter Xu
2025-04-24 18:40     ` Jens Axboe
2025-04-24 19:13       ` Peter Xu
2025-04-24 19:20         ` Jens Axboe
2025-04-24 19:57           ` Peter Xu
2025-05-01 16:18             ` Peter Xu
2025-05-01 16:28               ` Peter Xu
2025-04-24 19:42     ` Matthew Wilcox
2025-04-24 21:45       ` Peter Xu
2025-04-25  4:52         ` Matthew Wilcox
2025-04-25 15:44           ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).