public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Pavel Begunkov <asml.silence@gmail.com>,
	Jens Axboe <axboe@kernel.dk>,
	io-uring@vger.kernel.org
Cc: Andres Freund <andres@anarazel.de>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Darren Hart <dvhart@infradead.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC 4/4] io_uring: implement futex wait
Date: Tue, 01 Jun 2021 23:53:00 +0200	[thread overview]
Message-ID: <87sg211ccj.ffs@nanos.tec.linutronix.de> (raw)
In-Reply-To: <5ab4c8bd-3e82-e87b-1ae8-3b32ced72009@gmail.com>

Pavel, Jens,

On Tue, Jun 01 2021 at 17:29, Pavel Begunkov wrote:
> On 6/1/21 5:01 PM, Jens Axboe wrote:
>>> Yes, that would be preferable, but looks futexes don't use
>>> waitqueues but some manual enqueuing into a plist_node, see
>>> futex_wait_queue_me() or mark_wake_futex().
>>> Did I miss it somewhere?
>> 
>> Yes, we'd need to augment that with a callback. I do think that's going
>
> Yeah, that was the first idea, but it's also more intrusive for the
> futex codebase. Can be piled on top for next revision of patches.
>
> A question to futex maintainers, how much resistance to merging
> something like that I may expect?

Adding a waitqueue like callback or the proposed thing?

TBH. Neither one has a charm.

1) The proposed solution: I can't figure out from the changelogs or the
   cover letter what kind of problems it solves and what the exact
   semantics are. If you ever consider to submit futex patches, may I
   recommend to study Documentation/process and get some inspiration
   from git-log?

   What are the lifetime rules, what's the interaction with regular
   futexes, what's the interaction with robust list ....? Without
   interaction with regular futexes such a functionality does not make
   any sense at all.

   Also once we'd open that can of worms where is this going to end and
   where can we draw the line? This is going to be a bottomless pit
   because I don't believe for a split second that this simple interface
   is going to be sufficient.

   Aside of that we are definitely _not_ going to expose any of the
   internal functions simply because they evade any sanity check which
   happens at the syscall wrappers and I have zero interest to deal with
   the fallout of unfiltered input which comes via io-uring interfaces
   or try to keep those up to date when the core filtering changes.

2) Adding a waitqueue like callback is daft.

   a) Lifetime rules

      See mark_wake_futex().

   b) Locking

      The wakeup mechanism is designed to avoid hb->lock contention as much
      as possible. The dequeue/mark for wakeup happens under hb->lock
      and the actual wakeup happens after dropping hb->lock.

      This is not going to change. It's not even debatable.

      Aside of that this is optimized for minimal hb->lock hold time in
      general.

   So the only way to do that would be to invoke the callback from
   mark_wake_futex() _before_ invalidating futex_q and the callback plus
   the argument(s) would have to be stored in futex_q.

   Where does this information come from? Which context would invoke the
   wait with callback and callback arguments? User space, io-uring state
   machine or what?

   Aside of the extra storage (on stack) and yet more undefined life
   time rules and no semantics for error cases etc., that also would
   enforce that the callback is invoked with hb->lock held. IOW, it's
   making the hb->lock held time larger, which is exactly what the
   existing code tries to avoid by all means.

But what would that solve?

I can't tell because the provided information is absolutely useless
for anyone not familiar with your great idea:
   
  "Add futex wait requests, those always go through io-wq for
   simplicity."
      
What am I supposed to read out of this? Doing it elsewhere would be
more complex? Really useful information.

And I can't tell either what Jens means here:

  "Not a huge fan of that, I think this should tap into the waitqueue
   instead and just rely on the wakeup callback to trigger the
   event. That would be a lot more efficient than punting to io-wq, both
   in terms of latency on trigger, but also for efficiency if the app is
   waiting on a lot of futexes."

and here:

  "Yes, we'd need to augment that with a callback. I do think that's
   going to be necessary, I don't see the io-wq solution working well
   outside of the most basic of use cases. And even for that, it won't
   be particularly efficient for single waits."

All of these quotes are useless word salad without context and worse
without the minimal understanding how futexes work.

So can you folks please sit down and write up a coherent description of:

 1) The problem you are trying to solve

 2) How this futex functionality should be integrated into io-uring
    including the contexts which invoke it.

 3) Interaction with regular sys_futex() operations.

 4) Lifetime and locking rules.

Unless that materializes any futex related changes are not even going to
be reviewed.

I did not even try to review this stuff, I just tried to make sense out
of it, but while skimming it, it was inevitable to spot this gem:

 +int futex_wake_op_single(u32 __user *uaddr, int nr_wake, unsigned int op,
 +			 bool shared, bool try);
...
 +		ret = futex_wake_op_single(f->uaddr, f->nr_wake, f->wake_op_arg,
 +					   !(f->flags & IORING_FUTEX_SHARED),
 +					   nonblock);

You surely made your point that this is well thought out.

Thanks,

        tglx

  reply	other threads:[~2021-06-01 21:53 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-01 14:58 [RFC 0/4] futex request support Pavel Begunkov
2021-06-01 14:58 ` [RFC 1/4] futex: add op wake for a single key Pavel Begunkov
2021-06-01 14:58 ` [RFC 2/4] io_uring: frame out futex op Pavel Begunkov
2021-06-01 14:58 ` [RFC 3/4] io_uring: support futex wake requests Pavel Begunkov
2021-06-01 14:58 ` [RFC 4/4] io_uring: implement futex wait Pavel Begunkov
2021-06-01 15:45   ` Jens Axboe
2021-06-01 15:58     ` Pavel Begunkov
2021-06-01 16:01       ` Jens Axboe
2021-06-01 16:29         ` Pavel Begunkov
2021-06-01 21:53           ` Thomas Gleixner [this message]
2021-06-03 10:31             ` Pavel Begunkov
2021-06-04  9:19               ` Thomas Gleixner
2021-06-04 11:58                 ` Pavel Begunkov
2021-06-05  2:09                   ` Thomas Gleixner
2021-06-07 12:14                     ` Pavel Begunkov
2021-06-03 19:03             ` Andres Freund
2021-06-03 21:10               ` Peter Zijlstra
2021-06-03 21:21                 ` Andres Freund
2021-06-05  0:43               ` Thomas Gleixner
2021-06-07 11:31                 ` Pavel Begunkov
2021-06-07 11:48                   ` Peter Zijlstra
2021-06-03 18:59 ` [RFC 0/4] futex request support Andres Freund
2021-06-04 15:26   ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87sg211ccj.ffs@nanos.tec.linutronix.de \
    --to=tglx@linutronix.de \
    --cc=andres@anarazel.de \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=dave@stgolabs.net \
    --cc=dvhart@infradead.org \
    --cc=io-uring@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox