From: Peter Xu <peterx@redhat.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
Andrew Morton <akpm@linux-foundation.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
Linux-MM <linux-mm@kvack.org>
Subject: Re: [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending
Date: Thu, 1 May 2025 12:18:35 -0400 [thread overview]
Message-ID: <aBOe27gBqlwIj6lD@x1.local> (raw)
In-Reply-To: <aAqXlcYI9j39zQnE@x1.local>
On Thu, Apr 24, 2025 at 03:57:09PM -0400, Peter Xu wrote:
> On Thu, Apr 24, 2025 at 01:20:46PM -0600, Jens Axboe wrote:
> > On 4/24/25 1:13 PM, Peter Xu wrote:
> >
> > (skipping to this bit as I think we're mostly in agreement on the above)
> >
> > >>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > >>> index 296d294142c8..fa721525d93a 100644
> > >>> --- a/arch/x86/mm/fault.c
> > >>> +++ b/arch/x86/mm/fault.c
> > >>> @@ -1300,9 +1300,14 @@ void do_user_addr_fault(struct pt_regs *regs,
> > >>> * We set FAULT_FLAG_USER based on the register state, not
> > >>> * based on X86_PF_USER. User space accesses that cause
> > >>> * system page faults are still user accesses.
> > >>> + *
> > >>> + * When we're in user mode, allow fast response on non-fatal
> > >>> + * signals. Do not set this in kernel mode faults because normally
> > >>> + * a kernel fault means the fault must be resolved anyway before
> > >>> + * going back to userspace.
> > >>> */
> > >>> if (user_mode(regs))
> > >>> - flags |= FAULT_FLAG_USER;
> > >>> + flags |= FAULT_FLAG_USER | FAULT_FLAG_INTERRUPTIBLE;
> > >>>
> > >>> #ifdef CONFIG_X86_64
> > >>> /*
> > >>> diff --git a/include/linux/mm.h b/include/linux/mm.h
> > >>> index 9b701cfbef22..a80f3f609b37 100644
> > >>> --- a/include/linux/mm.h
> > >>> +++ b/include/linux/mm.h
> > >>> @@ -487,8 +487,7 @@ extern unsigned int kobjsize(const void *objp);
> > >>> * arch-specific page fault handlers.
> > >>> */
> > >>> #define FAULT_FLAG_DEFAULT (FAULT_FLAG_ALLOW_RETRY | \
> > >>> - FAULT_FLAG_KILLABLE | \
> > >>> - FAULT_FLAG_INTERRUPTIBLE)
> > >>> + FAULT_FLAG_KILLABLE)
> > >>> ===8<===
> > >>>
> > >>> That also kind of matches with what we do with fault_signal_pending().
> > >>> Would it make sense?
> > >>
> > >> I don't think doing a non-bounded non-interruptible sleep for a
> > >> condition that may never resolve (eg userfaultfd never fills the fault)
> > >> is a good idea. What happens if the condition never becomes true? You
> > >
> > > If page fault is never going to be resolved, normally we sigkill the
> > > program as it can't move any further with no way to resolve the page fault.
> > >
> > > But yeah that's based on the fact sigkill will work first..
> >
> > Yep
> >
> > >> can't even kill the task at that point... Generally UNINTERRUPTIBLE
> > >> sleep should only be used if it's a bounded wait.
> > >>
> > >> For example, if I ran my previous write(2) reproducer here and the task
> > >> got killed or exited before the userfaultfd fills the fault, then you'd
> > >> have the task stuck in 'D' forever. Can't be killed, can't get
> > >> reclaimed.
> > >>
> > >> In other words, this won't work.
> > >
> > > .. Would you help explain why it didn't work even for SIGKILL? Above will
> > > still set FAULT_FLAG_KILLABLE, hence I thought SIGKILL would always work
> > > regardless.
> > >
> > > For such kernel user page access, IIUC it should respond to SIGKILL in
> > > handle_userfault(), then fault_signal_pending() would trap the SIGKILL this
> > > time -> going kernel fixups. Then the upper stack should get -EFAULT in the
> > > exception fixup path.
> > >
> > > I could have missed something..
> >
> > It won't work because sending the signal will not wake the process in
> > question as it's sleeping uninterruptibly, forever. My looping approach
> > still works for fatal signals as we abort the loop every now and then,
> > hence we know it won't be stuck forever. But if you don't have a timeout
> > on that uninterruptible sleep, it's not waking from being sent a signal
> > alone.
> >
> > Example:
> >
> > axboe@m2max-kvm ~> sudo ./tufd
> > got buf 0xffff89800000
> > child will write
> > Page fault
> > flags = 0; address = ffff89800000
> > wait on child
> > fish: Job 1, 'sudo ./tufd' terminated by signal SIGKILL (Forced quit)
> >
> > meanwhile in ps:
> >
> > root 837 837 0.0 2 0.0 14628 1220 ? Dl 12:37 0:00 ./tufd
> > root 837 838 0.0 2 0.0 14628 1220 ? Sl 12:37 0:00 ./tufd
>
> I don't know TASK_WAKEKILL well, but I was hoping it would work in this
> case.. E.g., even if with the patch, we still have chance to not use any
> timeout at [1] below?
>
> if (likely(must_wait && !READ_ONCE(ctx->released))) {
> wake_up_poll(&ctx->fd_wqh, EPOLLIN);
> - schedule();
> + /* See comment in userfaultfd_get_blocking_state() */
> + if (!wait_mode.timeout)
> + schedule(); <----------------------------- [1]
> + else
> + schedule_timeout(HZ / 10);
> }
>
> So my understanding is sigkill also need to work always for [1] if
> FAULT_FLAG_KILLABLE is set (which should always be, iiuc).
>
> Did I miss something else? It would be helpful too if you could share the
> reproducer; I can give it a shot.
Since the signal issue alone can definitely be reproduced with any
reproducer that triggers the fault in the kernel.. I wrote one today with
write() syscall, I'll attach that at the end.
I did try this patch, meanwhile I also verified that actually what I
provided previously (at the end of the reply) can also avoid the cpu
spinning, and it is also killable at least here..
https://lore.kernel.org/all/aAqCXfPirHqWMlb4@x1.local/
Jens, could you help me to find why that simpler (and IMHO must cleaner)
change wouldn't work for your case?
The important thing is, as I mentioned above sigkill need to also work for
[1], and I really want to know when it won't.. meanwhile it's logically
incorrect to set INTERRUPTIBLE for kernel faults, which this patch didn't
really address.
Thanks,
--
Peter Xu
next prev parent reply other threads:[~2025-05-01 16:18 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-23 23:37 [PATCH] mm/userfaultfd: prevent busy looping for tasks with signals pending Jens Axboe
2025-04-24 14:03 ` Johannes Weiner
2025-04-24 14:54 ` Jens Axboe
2025-04-24 15:11 ` Johannes Weiner
2025-04-24 15:22 ` Jens Axboe
2025-04-24 18:26 ` Peter Xu
2025-04-24 18:40 ` Jens Axboe
2025-04-24 19:13 ` Peter Xu
2025-04-24 19:20 ` Jens Axboe
2025-04-24 19:57 ` Peter Xu
2025-05-01 16:18 ` Peter Xu [this message]
2025-05-01 16:28 ` Peter Xu
2025-04-24 19:42 ` Matthew Wilcox
2025-04-24 21:45 ` Peter Xu
2025-04-25 4:52 ` Matthew Wilcox
2025-04-25 15:44 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aBOe27gBqlwIj6lD@x1.local \
--to=peterx@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=hannes@cmpxchg.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.