public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-kernel@vger.kernel.org,
	"André Almeida" <andrealmeid@igalia.com>,
	"Darren Hart" <dvhart@infradead.org>,
	"Davidlohr Bueso" <dave@stgolabs.net>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Juri Lelli" <juri.lelli@redhat.com>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Valentin Schneider" <vschneid@redhat.com>,
	"Waiman Long" <longman@redhat.com>
Subject: Re: [PATCH v8 00/15] futex: Add support task local hash maps.
Date: Wed, 5 Feb 2025 13:52:50 +0100	[thread overview]
Message-ID: <20250205125250.GD7145@noisy.programming.kicks-ass.net> (raw)
In-Reply-To: <20250205122026.l6AQ2lf7@linutronix.de>

On Wed, Feb 05, 2025 at 01:20:26PM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-02-04 16:14:05 [+0100], Peter Zijlstra wrote:
> 
> This does not compile. Let me fix this up, a few comments…

Moo, clangd didn't complain :/ But yeah, I didn't actually compile this,
only had neovim running clangd.

> > diff --git a/io_uring/futex.c b/io_uring/futex.c
> > index 3159a2b7eeca..18cd5ccde36d 100644
> > --- a/io_uring/futex.c
> > +++ b/io_uring/futex.c
> > @@ -332,13 +331,13 @@ int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags)
> >  	ifd->q.wake = io_futex_wake_fn;
> >  	ifd->req = req;
> >  
> > +	// XXX task->state is messed up
> >  	ret = futex_wait_setup(iof->uaddr, iof->futex_val, iof->futex_flags,
> > -			       &ifd->q, &hb);
> > +			       &ifd->q, NULL);
> >  	if (!ret) {
> >  		hlist_add_head(&req->hash_node, &ctx->futex_list);
> >  		io_ring_submit_unlock(ctx, issue_flags);
> >  
> > -		futex_queue(&ifd->q, hb);
> >  		return IOU_ISSUE_SKIP_COMPLETE;
> 
> This looks interesting. This is called from
> req->io_task_work.func = io_req_task_submit
> | io_req_task_submit()
> | -> io_issue_sqe()
> |    -> def->issue() <- io_futex_wait
> 
> and
> io_fallback_req_func() iterates over a list and invokes
> req->io_task_work.func. This seems to be also invoked from
> io_sq_thread() (via io_sq_tw() -> io_handle_tw_list()).
> 
> If this (wait and wake) is only used within kernel threads then it is
> fine. If the waker and/ or waiter are in user context then we are in
> trouble because one will use the private hash of the process and the
> other won't because it is a kernel thread. So the messer-up task->state
> is the least of problems.

Right, so the io-uring stuff is tricky, I think this more or less does
what it used to though. I 'simply' moved the futex_queue() into
futex_wait_setup().

IIRC the io-uring threads share the process-mm but will never hit
userspace.

> >  	}
> …
> > --- a/kernel/futex/waitwake.c
> > +++ b/kernel/futex/waitwake.c
> > @@ -266,67 +264,69 @@ int futex_wake_op(u32 __user *uaddr1, unsigned int flags, u32 __user *uaddr2,
> >  	if (unlikely(ret != 0))
> >  		return ret;
> >  
> > -	hb1 = futex_hash(&key1);
> > -	hb2 = futex_hash(&key2);
> > -
> >  retry_private:
> > -	double_lock_hb(hb1, hb2);
> > -	op_ret = futex_atomic_op_inuser(op, uaddr2);
> > -	if (unlikely(op_ret < 0)) {
> > -		double_unlock_hb(hb1, hb2);
> > -
> > -		if (!IS_ENABLED(CONFIG_MMU) ||
> > -		    unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
> > -			/*
> > -			 * we don't get EFAULT from MMU faults if we don't have
> > -			 * an MMU, but we might get them from range checking
> > -			 */
> > -			ret = op_ret;
> > -			return ret;
> > -		}
> > -
> > -		if (op_ret == -EFAULT) {
> > -			ret = fault_in_user_writeable(uaddr2);
> > -			if (ret)
> > +	if (1) {
> > +		CLASS(hb, hb1)(&key1);
> > +		CLASS(hb, hb2)(&key2);
> 
> I don't know if hiding these things makes it better because this will do
> futex_hash_put() if it gets out of scope. This means we still hold the
> reference while in fault_in_user_writeable() and cond_resched(). Is this
> on purpose?

Sorta, I found it very hard to figure out what your patches did exactly,
and..

> I guess it does not matter much. The resize will be delayed until the
> task gets back and releases the reference. This will make progress. So
> it is okay.

this.

> > +		double_lock_hb(hb1, hb2);
> > +		op_ret = futex_atomic_op_inuser(op, uaddr2);
> > +		if (unlikely(op_ret < 0)) {
> > +			double_unlock_hb(hb1, hb2);
> > +
> > +			if (!IS_ENABLED(CONFIG_MMU) ||
> > +			    unlikely(op_ret != -EFAULT && op_ret != -EAGAIN)) {
> > +				/*
> > +				 * we don't get EFAULT from MMU faults if we don't have
> > +				 * an MMU, but we might get them from range checking
> > +				 */
> > +				ret = op_ret;
> >  				return ret;
> …
> > @@ -451,20 +442,22 @@ int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
> >  		struct futex_q *q = &vs[i].q;
> >  		u32 val = vs[i].w.val;
> >  
> > -		hb = futex_q_lock(q);
> > -		ret = futex_get_value_locked(&uval, uaddr);
> > +		if (1) {
> > +			CLASS(hb_q_lock, hb)(q);
> > +			ret = futex_get_value_locked(&uval, uaddr);
> 
> This confused me at the beginning because I expected hb_q_lock having
> the lock part in the constructor and also the matching unlock in the
> deconstructor. But no, this is not the case.

Agreed, that *is* rather ugly. The sane way to fix that might be to
untangle futex_q_lock() from futex_hash(). And instead do:

			CLASS(hb, hb)(&q->key);
			futex_q_lock(q, hb);

Or somesuch. That might be a nice cleanup either way.

> > @@ -618,26 +611,42 @@ int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
> …
> >  
> > +		if (uval != val) {
> > +			futex_q_unlock(hb);
> > +			return -EWOULDBLOCK;
> > +		}
> > +
> > +		if (key2 && !futex_match(&q->key, key2)) {
> 
> There should be no !

Duh..

> > +			futex_q_unlock(hb);
> > +			return -EINVAL;
> > +		}
> >  
> > -	if (uval != val) {
> > -		futex_q_unlock(*hb);
> > -		ret = -EWOULDBLOCK;
> > +		/*
> > +		 * The task state is guaranteed to be set before another task can
> > +		 * wake it. set_current_state() is implemented using smp_store_mb() and
> > +		 * futex_queue() calls spin_unlock() upon completion, both serializing
> > +		 * access to the hash list and forcing another memory barrier.
> > +		 */
> > +		set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
> > +		futex_queue(q, hb);
> >  	}
> >  
> >  	return ret;
> 
> So the beauty of it is that you enforce a ref drop on hb once it gets
> out of scope. So you can't use it by chance once the ref is dropped.

Right.

> But this does not help in futex_lock_pi() where you have the drop the
> reference before __rt_mutex_start_proxy_lock() (or at least before
> rt_mutex_wait_proxy_lock()) but still have it you go for the no_block
> shortcut. At which point even the lock is still owned.
> 
> While it makes the other cases nicer, the futex_lock_pi() function was
> the only one where I was thinking about setting hb to NULL to avoid
> accidental usage later on.

OK, so yeah, I got completely lost in futex_lock_pi(), and I couldn't
figure out what you did there. Let me try and untangle that again.

  reply	other threads:[~2025-02-05 14:12 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-03 13:59 [PATCH v8 00/15] futex: Add support task local hash maps Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 01/15] rcuref: Avoid false positive "imbalanced put" report Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 02/15] futex: Create helper function to initialize a hash slot Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 03/15] futex: Add basic infrastructure for local task local hash Sebastian Andrzej Siewior
2025-02-03 14:27   ` Peter Zijlstra
2025-02-03 15:51     ` Sebastian Andrzej Siewior
2025-02-04 10:34       ` Peter Zijlstra
2025-02-05  8:39         ` Sebastian Andrzej Siewior
2025-02-07  9:41           ` Juri Lelli
2025-02-07 11:00             ` Sebastian Andrzej Siewior
2025-02-07 11:06               ` Peter Zijlstra
2025-02-07 14:47                 ` Juri Lelli
2025-02-03 14:29   ` Peter Zijlstra
2025-02-03 14:41   ` Peter Zijlstra
2025-02-03 15:39     ` Peter Zijlstra
2025-02-03 15:52     ` Sebastian Andrzej Siewior
2025-02-04  8:41       ` Peter Zijlstra
2025-02-04  9:28         ` Thomas Gleixner
2025-02-03 13:59 ` [PATCH v8 04/15] futex: Allow automatic allocation of process wide futex hash Sebastian Andrzej Siewior
2025-02-03 14:36   ` Peter Zijlstra
2025-02-03 15:54     ` Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 05/15] futex: Hash only the address for private futexes Sebastian Andrzej Siewior
2025-02-03 14:41   ` Peter Zijlstra
2025-02-03 13:59 ` [PATCH v8 06/15] futex: Move private hashing into its own function Sebastian Andrzej Siewior
2025-02-04  9:34   ` Peter Zijlstra
2025-02-05  7:51     ` Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 07/15] futex: Decrease the waiter count before the unlock operation Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 08/15] futex: Prepare for reference counting of the process private hash end of operation Sebastian Andrzej Siewior
2025-02-04  9:49   ` Peter Zijlstra
2025-02-05  7:54     ` Sebastian Andrzej Siewior
2025-02-05  9:37       ` Peter Zijlstra
2025-02-03 13:59 ` [PATCH v8 09/15] futex: Re-evaluate the hash bucket after dropping the lock Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 10/15] futex: Introduce futex_get_locked_hb() Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 11/15] futex: Acquire a hash reference in futex_wait_multiple_setup() Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 12/15] futex: Allow to re-allocate the private local hash Sebastian Andrzej Siewior
2025-02-04 11:05   ` Peter Zijlstra
2025-02-05  8:00     ` Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 13/15] futex: Resize local futex hash table based on number of threads Sebastian Andrzej Siewior
2025-02-04 10:21   ` Peter Zijlstra
2025-02-05  8:05     ` Sebastian Andrzej Siewior
2025-02-07  9:07     ` Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 14/15] futex: Use a hashmask instead of hashsize Sebastian Andrzej Siewior
2025-02-03 13:59 ` [PATCH v8 15/15] futex: Avoid allocating new local hash if there is something pending Sebastian Andrzej Siewior
2025-02-04 15:14 ` [PATCH v8 00/15] futex: Add support task local hash maps Peter Zijlstra
2025-02-05  8:46   ` Sebastian Andrzej Siewior
2025-02-05 12:20   ` Sebastian Andrzej Siewior
2025-02-05 12:52     ` Peter Zijlstra [this message]
2025-02-05 16:52       ` Sebastian Andrzej Siewior
2025-02-20 15:12     ` Peter Zijlstra
2025-02-20 15:57       ` Sebastian Andrzej Siewior
2025-02-21 16:00       ` Sebastian Andrzej Siewior
2025-02-21 19:21         ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250205125250.GD7145@noisy.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=andrealmeid@igalia.com \
    --cc=bigeasy@linutronix.de \
    --cc=dave@stgolabs.net \
    --cc=dvhart@infradead.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox