bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Pavel Begunkov <asml.silence@gmail.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	io-uring@vger.kernel.org,
	Caleb Sander Mateos <csander@purestorage.com>,
	Akilesh Kailash <akailash@google.com>,
	bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>
Subject: Re: [PATCH 0/5] io_uring: add IORING_OP_BPF for extending io_uring
Date: Wed, 19 Nov 2025 19:00:08 +0000	[thread overview]
Message-ID: <0527a07c-57ac-41f2-acfd-cfd057922e4a@gmail.com> (raw)
In-Reply-To: <aRVcAFOsb7X3kxB9@fedora>

Hey Ming,

Sorry for a late reply

On 11/13/25 04:18, Ming Lei wrote:
...
>> both cases you have bpf implementing some logic that was previously
>> done in userspace. To emphasize, you can do the desired parts of
>> handling in BPF, and I'm not suggesting moving the entirety of
>> request processing in there.
> 
> The problem with your patch is that SQE is built in bpf prog(kernel), then

It's an option, not a requirement. It should be perfectly fine,
for example, to only process CQEs and run some kfuncs, and return
back to userspace.

> inevitable application logic is moved to bpf prog, which isn't good at
> handling complicated logic.
> 
> Then people have to run kernel<->user communication for setting up the SQE.
> 
> And the SQE in bpf prog may need to be linked with previous and following SQEs in
> usersapce, which basically partitions application logic into two parts: one
> is in userspace, another is in bpf prog(kernel).

I'm not a huge fan of links. They add enough of complexity to the
kernel. I'd rather see them gone / sidelined out of the normal
execution paths if there is an alternative.

> The patch I am suggesting doesn't have this problem, all SQEs are built in
> userspace, and just the minimized part(standalone and well defined function) is
> done in bpf prog.
> 
>>
>>>>>> for short BPF programs is not great because of io_uring request handling
>>>>>> overhead. And flexibility was severely lacking, so even simple use cases
>>>>>
>>>>> What is the overhead? In this patch, OP's prep() and issue() are defined in
>>>>
>>>> The overhead of creating, freeing and executing a request. If you use
>>>> it with links, it's also overhead of that. That prototype could also
>>>> optionally wait for completions, and it wasn't free either.
>>>
>>> IORING_OP_BPF is same with existing normal io_uring request and link, wrt
>>> all above you mentioned.
>>
>> It is, but it's an extra request, and in previous testing overhead
>> for that extra request was affecting total performance, that's why
>> linking or not is also important.
> 
> Yes, but does the extra request matters for whole performance?

It did in previous tests with small pre-buffered IO, but that
depends on how well ammortised it is with other requests and
BPF execution.

> I did have such test:
> 
> 1) in tools/testing/selftests/ublk/null.c
> 
> - for zero copy test, one extra nop is submitted
> 
> 2) rublk test
> 
> - for zero copy test, it simply returns without submitting nop
> 
> The IOPS gap is pretty small.
> 
> Also in your approach, without allocating one new SQE in bpf, how to
> provide generic interface for bpf prog to work on different functions, such
> as, memory copy or raid5 parity or compression ..., all require flexible
> handling, such as, variable parameters, buffer could be plain user memory
> , fixed, vectored or fixed vectored,..., so one SQE or new operation is the
> easiest way for providing the abstraction and generic bpf prog interface.

Or it can be a kfunc

...
>>> It is easy to say, how can the BPF prog know the next completion is
>>> exactly waiting for? You have to rely on bpf map to communicate with userspace
>>
>> By taking a peek at and maybe dereferencing cqe->user_data.
> 
> Yes, but you have to pass the interested ->user_data to bpf prog first.

It can be looked it up from the CQ

> There could be many inflight interested IOs, how to query them efficiently?
> 
> Scan each one after every CQE is posted? But ebpf just support bound loops,
> the complexity may be run out of easily[1].
> 
> https://docs.ebpf.io/linux/concepts/loops/

Good point, I need to take a look at the looping.

>>> to understanding what completion is what you are interested in, also
>>> need all information from userpace for preparing the SQE for submission
>>> from bpf prog. Tons of userspace and kernel communication.
>>
>> You can setup a BPF arena, and all that comm will be working with
>> a block of shared memory. Or same but via io_uring parameter region.
>> That sounds pretty simple.
> 
> But application logic has to splitted into two parts, both two have to
> rely on the shared memory to communicate.
> 
> The exiting io_uring application has been complicated enough, adding one
> extra shared memory communication for holding application logic just makes
> things worse. Even in userspace programming, it is horrible to model logic
> into data, that is why state machine pattern is usually not readable.
> 
> Think about writing high performance raid5 application based on ublk zero
> copy & io_uring, for example, handling one simple write:
> 
> - one ublk write command comes for raid5
> 
> - suppose the command just writes data to one single stripe exactly
> 
> - submitting each write to N - 1 disks
> 
> - When all N writes are done, the new SQE needs to work:
> 
> 	- calculate parity by reading buffers from the N request kernel buffer
> 	  and writing resulted XOR parity to one user specified buffer
> 
> - then new FS IO need to be submitted to write the parity data to one calculated
> disk(N)
> 
> So the involved things for bpf prog SQE:
> 
> 	- monitoring N - 1 writes
> 	- do the parity calculation job, which has to define one kfunc
> 	- mark parity is ready & notify userspace for writing parity(how to
> 	  notify?)

And something still needs to do all that. The only silver lining
for userspace handling is that there is more language sugar helping
with it like coroutines.

> Now there can be variable(many) such WRITEs to handle concurrently, and the
> bpf prog has to cover them all.
> 
> The above just the simplest case, the write command may not align with
> stripe, so parity calculation may need to read data from other stripes.
> 
> If you think it is `pretty simple`, care to provide one example to show your
> approach is workable?
> 
>>
>>>> you introduced. After it can optionally queue up requests
>>>> writing it to the storage or anything else.
>>>
>>> Again, I do not want to move userspace logic into bpf prog(kernel), what
>>> IORING_BPF_OP provides is to define one operation, then userspace
>>> can use it just like in-kernel operations.
>>
>> Right, but that's rather limited. I want to cover all those
>> use cases with one implementation instead of fragmenting users,
>> if that can be achieved.
> 
> I don't know when your ambitious plan can land or be doable.
> 
> I am going to write V2 with the approach of IORING_BPF_OP which is at least
> workable for some cases, and much easier to take in userspace. Also it
> doesn't conflict with your approach.

-- 
Pavel Begunkov


      reply	other threads:[~2025-11-19 19:00 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-04 16:21 [PATCH 0/5] io_uring: add IORING_OP_BPF for extending io_uring Ming Lei
2025-11-04 16:21 ` [PATCH 1/5] io_uring: prepare for extending io_uring with bpf Ming Lei
2025-11-04 16:21 ` [PATCH 2/5] io_uring: bpf: add io_uring_ctx setup for BPF into one list Ming Lei
2025-11-04 16:21 ` [PATCH 3/5] io_uring: bpf: extend io_uring with bpf struct_ops Ming Lei
2025-11-07 19:02   ` kernel test robot
2025-11-08  6:53   ` kernel test robot
2025-11-13 10:32   ` Stefan Metzmacher
2025-11-13 10:59     ` Ming Lei
2025-11-13 11:19       ` Stefan Metzmacher
2025-11-14  3:00         ` Ming Lei
2025-12-08 22:45           ` Caleb Sander Mateos
2025-12-09  3:08             ` Ming Lei
2025-12-10 16:11               ` Caleb Sander Mateos
2025-11-19 14:39   ` Jonathan Corbet
2025-11-20  1:46     ` Ming Lei
2025-11-20  1:51       ` Ming Lei
2025-11-04 16:21 ` [PATCH 4/5] io_uring: bpf: add buffer support for IORING_OP_BPF Ming Lei
2025-11-13 10:42   ` Stefan Metzmacher
2025-11-13 11:04     ` Ming Lei
2025-11-13 11:25       ` Stefan Metzmacher
2025-11-04 16:21 ` [PATCH 5/5] io_uring: bpf: add io_uring_bpf_req_memcpy() kfunc Ming Lei
2025-11-07 18:51   ` kernel test robot
2025-11-05 12:47 ` [PATCH 0/5] io_uring: add IORING_OP_BPF for extending io_uring Pavel Begunkov
2025-11-05 15:57   ` Ming Lei
2025-11-06 16:03     ` Pavel Begunkov
2025-11-07 15:54       ` Ming Lei
2025-11-11 14:07         ` Pavel Begunkov
2025-11-13  4:18           ` Ming Lei
2025-11-19 19:00             ` Pavel Begunkov [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0527a07c-57ac-41f2-acfd-cfd057922e4a@gmail.com \
    --to=asml.silence@gmail.com \
    --cc=akailash@google.com \
    --cc=ast@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=bpf@vger.kernel.org \
    --cc=csander@purestorage.com \
    --cc=io-uring@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).