All of lore.kernel.org
 help / color / mirror / Atom feed
From: "David Hildenbrand (arm)" <david@kernel.org>
To: Jens Axboe <axboe@kernel.dk>, Gabriel Krisman Bertazi <krisman@suse.de>
Cc: io-uring@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	linux-mm@kvack.org
Subject: Re: [PATCH 0/2] Introduce IORING_OP_MMAP
Date: Wed, 4 Feb 2026 20:47:52 +0100	[thread overview]
Message-ID: <7faa5721-cd73-4140-9d63-fa5a279dbce3@kernel.org> (raw)
In-Reply-To: <01839e70-5a71-4969-ad5f-2495754250e1@kernel.dk>

On 2/2/26 15:34, Jens Axboe wrote:
> On 2/2/26 2:02 AM, David Hildenbrand (arm) wrote:
>> On 2/1/26 19:16, Jens Axboe wrote:
>>>
>>> The hard part isn't enabling all syscalls at once, that could be
>>> trivially done with an IORING_OP_SYSCALL and the SQE carries arg0..argN.
>>> And for any nonblocking/simple syscall, that would Just Work.
>>
>> Right, that's what I had in mind.
>>
>>> The
>>> challenge is for syscalls that block - the whole point of io_uring is
>>> that you should be able to do nonblock issues with sane retries. The
>>> futex series I did some time back is a good example of that - you modify
>>> the existing syscall to expose the waitqueue mechanism, which you can
>>> then use to wait in an async way, and get a callback when some action
>>> needs to be taken.
>>>
>>> If you just allow blocking, then you're blocking the entire io_uring
>>> issue pipeline. Which was exactly my main complaint on this patchset,
>>> see the review reply to patch 2.
>>
>> Makes sense. I was wondering whether that could be optimized
>> internally in the stream of IORING_OP_SYSCALL.
>>
>> But likely that would make it more tricky to optimize.
> 
> Are we talking generically, or mmap/munmap/mremap? 

Well, a bit of both :)

munmap() could be a bit challenging as it downgrades the mmap_lock for 
removal of the page tables. So quite a bit of rework would be required 
to batch that over multiple operations I suppose.

> You could trivially
> make IORING_OP_SYSCALL available and use it for everything, it'd just
> require a basically all of those to be offloaded to io-wq internally in
> io_uring. And that's not a great approach. The fast path for io_uring is
> running the opcode inline, which means that by the time the syscall
> returns, you have also posted the completion. If the operation can't
> complete inline, then the next best thing is to have it be triggered
> when it can complete, and then retry and post the completion. Think of
> reading from a pipe - if the data is there, the read is done inside
> io_uring_enter() when the read is attempted, and we're done. If no data
> is available, the operation is queued. When data becomes available, a
> retry is triggered, data is read, and a completion is posted.

Thanks for the explanation.

> 
> For an old school kind of syscall "do this thing, and just block the
> task until it's done" doesn't work that way at all. Running those in
> io_uring would necessitate punting the operation to io-wq, which are
> helper userspace threads for io_uring. As there's no way of knowing
> whether syscallN will complete fast inline or block for 2 seconds,
> io_uring has no other option than to offload it to io-wq. If it's a 2
> second operation, that's fine, you won't see any difference in the
> application, other than it can now do syscallN async in an efficient
> way. If syscallN would've completed inline in 1 usec, then offloading to
> io-wq is suddenly a big performance problem.
> 
>> The patch set says "serving as base for batching
>> multiple mappings in a single operation", and I was wondering, why one wouldn't just also batch with mremap/munmap/ etc. in the future.
>>
>> (BUT I am also skeptical whether holding the mmap lock in write mode
>> longer instead of repeatedly grabbing it, allowing other operations
>> that need it in read mode etc to make progress, is actually
>> preferrable)
> 
> That's always a trade off - if the frequency is high, then a certain
> level of batching makes sense. The good news is that you get to control
> that, you can just batch more or less.
> 
> Outside of mmap locking frequencies, I suspect potentially nicer wins
> might be around TLB flush reductions for this family of operations.

For mremap() and munmap(), yes, just like for MADV_DONTNEED.

mmap() maybe if we do a MAP_FIXED that implies an munmap() IIRC.

But then we are again in "hairy to reasonably batch" territory I think. 
These are all extremely involved operations.


Is there any use case for the patch set at hand, in particular, in an 
un-optimized form?

-- 
Cheers,

David

      reply	other threads:[~2026-02-04 19:47 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-29 22:11 [PATCH 0/2] Introduce IORING_OP_MMAP Gabriel Krisman Bertazi
2026-01-29 22:11 ` [PATCH 1/2] io_uring: Support commands with optional file descriptors Gabriel Krisman Bertazi
2026-01-29 22:11 ` [PATCH 2/2] io_uring: introduce IORING_OP_MMAP Gabriel Krisman Bertazi
2026-01-30  6:03   ` kernel test robot
2026-01-30 15:47     ` Gabriel Krisman Bertazi
2026-01-30 15:55   ` Jens Axboe
2026-02-09 14:36     ` Gabriel Krisman Bertazi
2026-02-01 17:46 ` [PATCH 0/2] Introduce IORING_OP_MMAP David Hildenbrand (arm)
2026-02-01 18:16   ` Jens Axboe
2026-02-02  9:02     ` David Hildenbrand (arm)
2026-02-02 14:34       ` Jens Axboe
2026-02-04 19:47         ` David Hildenbrand (arm) [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7faa5721-cd73-4140-9d63-fa5a279dbce3@kernel.org \
    --to=david@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=io-uring@vger.kernel.org \
    --cc=krisman@suse.de \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.