From: Oleg Nesterov <oleg@redhat.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Florian Weimer <fweimer@redhat.com>,
Shawn Landden <shawn@git.icu>,
libc-alpha@sourceware.org, linux-api@vger.kernel.org,
LKML <linux-kernel@vger.kernel.org>,
Arnd Bergmann <arnd@arndb.de>,
Deepa Dinamani <deepa.kernel@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Catalin Marinas <catalin.marinas@arm.com>,
Keith Packard <keithp@keithp.com>,
Peter Zijlstra <peterz@infradead.org>
Subject: Re: handle_exit_race && PF_EXITING
Date: Wed, 6 Nov 2019 09:55:29 +0100 [thread overview]
Message-ID: <20191106085529.GA12575@redhat.com> (raw)
In-Reply-To: <alpine.DEB.2.21.1911051959260.1869@nanos.tec.linutronix.de>
On 11/05, Thomas Gleixner wrote:
>
> sys_futex()
> loop infinite because
> PF_EXITING is set,
> but PF_EXITPIDONE not
Yes.
IOW, the problem is very simple. RT task preempts the exiting lock owner
after it sets PF_EXITING but before it sets PF_EXITPIDONE, if they run on
the same CPU futex_lock_pi() will spin forever.
> So the obvious question is why PF_EXITPIDONE is set way after the futex
> exit cleanup has run,
Another obvious question is why this code checks PF_EXITING. I still think
it should not.
> The way we can deal with that is:
>
> do_exit()
> tsk->flags |= PF_EXITING;
> ...
> mutex_lock(&tsk->futex_exit_mutex);
> futex_exit();
> tsk->flags |= PF_EXITPIDONE;
> mutex_unlock(&tsk->futex_exit_mutex);
>
> and on the futex lock_pi side:
>
> if (!(tsk->flags & PF_EXITING))
> return 0; <- All good
>
> if (tsk->flags & PF_EXITPIDONE)
> return -EOWNERDEAD; <- Locker can take over
>
> mutex_lock(&tsk->futex_exit_mutex);
> if (tsk->flags & PF_EXITPIDONE) {
> mutex_unlock(&tsk->futex_exit_mutex);
> return -EOWNERDEAD; <- Locker can take over
> }
>
> queue_futex();
> mutex_unlock(&tsk->futex_exit_mutex);
>
> Not that I think it's pretty, but it plugs all holes AFAICT.
I have found the fix I sent in 2015, attached below. I forgot everything
I knew about futex.c, so I need some time to adapt it to the current code.
But I think it is clear what this patch tries to do, do you see any hole?
Oleg.
[PATCH] futex: don't spin waiting for PF_EXITING -> PF_EXITPIDONE transition
It is absolutely not clear why attach_to_pi_owner() returns -EAGAIN which
triggers "retry" if the lock owner is PF_EXITING but not PF_EXITPIDONE.
This burns CPU for no reason and this can even livelock if the rt_task()
caller preempts a PF_EXITING owner.
Remove the PF_EXITING check altogether. We do not care if it is exiting,
all we need to know is can we rely on exit_pi_state_list() or not. So we
also need to set PF_EXITPIDONE before we flush ->pi_state_list and call
exit_pi_state_list() unconditionally.
Perhaps we can add the fast-path list_empty() check in mm_release() back,
but lets fix the problem first. Besides, in theory this check is already
not correct, at least it should be list_empty_careful() to avoid the race
with free_pi_state() in progress.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/exit.c | 22 +---------------------
kernel/fork.c | 3 +--
kernel/futex.c | 40 ++++++++++------------------------------
3 files changed, 12 insertions(+), 53 deletions(-)
diff --git a/kernel/exit.c b/kernel/exit.c
index 6806c55..bc969ed 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -683,27 +683,13 @@ void do_exit(long code)
*/
if (unlikely(tsk->flags & PF_EXITING)) {
pr_alert("Fixing recursive fault but reboot is needed!\n");
- /*
- * We can do this unlocked here. The futex code uses
- * this flag just to verify whether the pi state
- * cleanup has been done or not. In the worst case it
- * loops once more. We pretend that the cleanup was
- * done as there is no way to return. Either the
- * OWNER_DIED bit is set by now or we push the blocked
- * task into the wait for ever nirwana as well.
- */
+ /* Avoid the new additions to ->pi_state_list at least */
tsk->flags |= PF_EXITPIDONE;
set_current_state(TASK_UNINTERRUPTIBLE);
schedule();
}
exit_signals(tsk); /* sets PF_EXITING */
- /*
- * tsk->flags are checked in the futex code to protect against
- * an exiting task cleaning up the robust pi futexes.
- */
- smp_mb();
- raw_spin_unlock_wait(&tsk->pi_lock);
if (unlikely(in_atomic()))
pr_info("note: %s[%d] exited with preempt_count %d\n",
@@ -779,12 +765,6 @@ void do_exit(long code)
* Make sure we are holding no locks:
*/
debug_check_no_locks_held();
- /*
- * We can do this unlocked here. The futex code uses this flag
- * just to verify whether the pi state cleanup has been done
- * or not. In the worst case it loops once more.
- */
- tsk->flags |= PF_EXITPIDONE;
if (tsk->io_context)
exit_io_context(tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index 4dc2dda..ec3208e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -803,8 +803,7 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
tsk->compat_robust_list = NULL;
}
#endif
- if (unlikely(!list_empty(&tsk->pi_state_list)))
- exit_pi_state_list(tsk);
+ exit_pi_state_list(tsk);
#endif
uprobe_free_utask(tsk);
diff --git a/kernel/futex.c b/kernel/futex.c
index b101381..c1104a8 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -716,11 +716,13 @@ void exit_pi_state_list(struct task_struct *curr)
if (!futex_cmpxchg_enabled)
return;
+
/*
- * We are a ZOMBIE and nobody can enqueue itself on
- * pi_state_list anymore, but we have to be careful
- * versus waiters unqueueing themselves:
+ * attach_to_pi_owner() can no longer add the new entry. But
+ * we have to be careful versus waiters unqueueing themselves.
*/
+ curr->flags |= PF_EXITPIDONE;
+
raw_spin_lock_irq(&curr->pi_lock);
while (!list_empty(head)) {
@@ -905,24 +907,12 @@ static int attach_to_pi_owner(u32 uval, union futex_key *key,
return -EPERM;
}
- /*
- * We need to look at the task state flags to figure out,
- * whether the task is exiting. To protect against the do_exit
- * change of the task flags, we do this protected by
- * p->pi_lock:
- */
raw_spin_lock_irq(&p->pi_lock);
- if (unlikely(p->flags & PF_EXITING)) {
- /*
- * The task is on the way out. When PF_EXITPIDONE is
- * set, we know that the task has finished the
- * cleanup:
- */
- int ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
-
+ if (unlikely(p->flags & PF_EXITPIDONE)) {
+ /* exit_pi_state_list() was already called */
raw_spin_unlock_irq(&p->pi_lock);
put_task_struct(p);
- return ret;
+ return -ESRCH;
}
/*
@@ -1633,12 +1623,7 @@ retry_private:
goto retry;
goto out;
case -EAGAIN:
- /*
- * Two reasons for this:
- * - Owner is exiting and we just wait for the
- * exit to complete.
- * - The user space value changed.
- */
+ /* The user space value changed. */
free_pi_state(pi_state);
pi_state = NULL;
double_unlock_hb(hb1, hb2);
@@ -2295,12 +2280,7 @@ retry_private:
case -EFAULT:
goto uaddr_faulted;
case -EAGAIN:
- /*
- * Two reasons for this:
- * - Task is exiting and we just wait for the
- * exit to complete.
- * - The user space value changed.
- */
+ /* The user space value changed. */
queue_unlock(hb);
put_futex_key(&q.key);
cond_resched();
--
1.5.5.1
next prev parent reply other threads:[~2019-11-06 8:55 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-04 0:29 [RFC v2 PATCH] futex: extend set_robust_list to allow 2 locking ABIs at the same time Shawn Landden
2019-11-04 0:51 ` Shawn Landden
2019-11-04 15:37 ` Thomas Gleixner
2019-11-05 0:10 ` Thomas Gleixner
2019-11-05 9:48 ` Florian Weimer
2019-11-05 9:59 ` Thomas Gleixner
2019-11-05 10:06 ` Florian Weimer
2019-11-05 11:56 ` Thomas Gleixner
2019-11-05 14:10 ` Carlos O'Donell
2019-11-05 14:27 ` Florian Weimer
2019-11-05 14:53 ` Thomas Gleixner
2019-11-05 14:27 ` Thomas Gleixner
2019-11-05 14:33 ` Florian Weimer
2019-11-05 14:48 ` Thomas Gleixner
2019-11-06 14:00 ` Zack Weinberg
2019-11-06 14:04 ` Florian Weimer
2019-11-05 15:27 ` handle_exit_race && PF_EXITING Oleg Nesterov
2019-11-05 17:28 ` Thomas Gleixner
2019-11-05 17:59 ` Thomas Gleixner
2019-11-05 18:56 ` Thomas Gleixner
2019-11-05 19:19 ` Thomas Gleixner
2019-11-06 8:55 ` Oleg Nesterov [this message]
2019-11-06 9:53 ` Thomas Gleixner
2019-11-06 10:35 ` Oleg Nesterov
2019-11-06 11:07 ` Thomas Gleixner
2019-11-06 12:11 ` Oleg Nesterov
2019-11-06 13:38 ` Thomas Gleixner
2019-11-06 17:42 ` Thomas Gleixner
2019-11-07 15:51 ` Oleg Nesterov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191106085529.GA12575@redhat.com \
--to=oleg@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=catalin.marinas@arm.com \
--cc=deepa.kernel@gmail.com \
--cc=fweimer@redhat.com \
--cc=keithp@keithp.com \
--cc=libc-alpha@sourceware.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=shawn@git.icu \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox