From: Pavel Begunkov <asml.silence@gmail.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>,
io-uring@vger.kernel.org,
Caleb Sander Mateos <csander@purestorage.com>,
Akilesh Kailash <akailash@google.com>,
bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>
Subject: Re: [PATCH 0/5] io_uring: add IORING_OP_BPF for extending io_uring
Date: Wed, 19 Nov 2025 19:00:08 +0000 [thread overview]
Message-ID: <0527a07c-57ac-41f2-acfd-cfd057922e4a@gmail.com> (raw)
In-Reply-To: <aRVcAFOsb7X3kxB9@fedora>
Hey Ming,
Sorry for a late reply
On 11/13/25 04:18, Ming Lei wrote:
...
>> both cases you have bpf implementing some logic that was previously
>> done in userspace. To emphasize, you can do the desired parts of
>> handling in BPF, and I'm not suggesting moving the entirety of
>> request processing in there.
>
> The problem with your patch is that SQE is built in bpf prog(kernel), then
It's an option, not a requirement. It should be perfectly fine,
for example, to only process CQEs and run some kfuncs, and return
back to userspace.
> inevitable application logic is moved to bpf prog, which isn't good at
> handling complicated logic.
>
> Then people have to run kernel<->user communication for setting up the SQE.
>
> And the SQE in bpf prog may need to be linked with previous and following SQEs in
> usersapce, which basically partitions application logic into two parts: one
> is in userspace, another is in bpf prog(kernel).
I'm not a huge fan of links. They add enough of complexity to the
kernel. I'd rather see them gone / sidelined out of the normal
execution paths if there is an alternative.
> The patch I am suggesting doesn't have this problem, all SQEs are built in
> userspace, and just the minimized part(standalone and well defined function) is
> done in bpf prog.
>
>>
>>>>>> for short BPF programs is not great because of io_uring request handling
>>>>>> overhead. And flexibility was severely lacking, so even simple use cases
>>>>>
>>>>> What is the overhead? In this patch, OP's prep() and issue() are defined in
>>>>
>>>> The overhead of creating, freeing and executing a request. If you use
>>>> it with links, it's also overhead of that. That prototype could also
>>>> optionally wait for completions, and it wasn't free either.
>>>
>>> IORING_OP_BPF is same with existing normal io_uring request and link, wrt
>>> all above you mentioned.
>>
>> It is, but it's an extra request, and in previous testing overhead
>> for that extra request was affecting total performance, that's why
>> linking or not is also important.
>
> Yes, but does the extra request matters for whole performance?
It did in previous tests with small pre-buffered IO, but that
depends on how well ammortised it is with other requests and
BPF execution.
> I did have such test:
>
> 1) in tools/testing/selftests/ublk/null.c
>
> - for zero copy test, one extra nop is submitted
>
> 2) rublk test
>
> - for zero copy test, it simply returns without submitting nop
>
> The IOPS gap is pretty small.
>
> Also in your approach, without allocating one new SQE in bpf, how to
> provide generic interface for bpf prog to work on different functions, such
> as, memory copy or raid5 parity or compression ..., all require flexible
> handling, such as, variable parameters, buffer could be plain user memory
> , fixed, vectored or fixed vectored,..., so one SQE or new operation is the
> easiest way for providing the abstraction and generic bpf prog interface.
Or it can be a kfunc
...
>>> It is easy to say, how can the BPF prog know the next completion is
>>> exactly waiting for? You have to rely on bpf map to communicate with userspace
>>
>> By taking a peek at and maybe dereferencing cqe->user_data.
>
> Yes, but you have to pass the interested ->user_data to bpf prog first.
It can be looked it up from the CQ
> There could be many inflight interested IOs, how to query them efficiently?
>
> Scan each one after every CQE is posted? But ebpf just support bound loops,
> the complexity may be run out of easily[1].
>
> https://docs.ebpf.io/linux/concepts/loops/
Good point, I need to take a look at the looping.
>>> to understanding what completion is what you are interested in, also
>>> need all information from userpace for preparing the SQE for submission
>>> from bpf prog. Tons of userspace and kernel communication.
>>
>> You can setup a BPF arena, and all that comm will be working with
>> a block of shared memory. Or same but via io_uring parameter region.
>> That sounds pretty simple.
>
> But application logic has to splitted into two parts, both two have to
> rely on the shared memory to communicate.
>
> The exiting io_uring application has been complicated enough, adding one
> extra shared memory communication for holding application logic just makes
> things worse. Even in userspace programming, it is horrible to model logic
> into data, that is why state machine pattern is usually not readable.
>
> Think about writing high performance raid5 application based on ublk zero
> copy & io_uring, for example, handling one simple write:
>
> - one ublk write command comes for raid5
>
> - suppose the command just writes data to one single stripe exactly
>
> - submitting each write to N - 1 disks
>
> - When all N writes are done, the new SQE needs to work:
>
> - calculate parity by reading buffers from the N request kernel buffer
> and writing resulted XOR parity to one user specified buffer
>
> - then new FS IO need to be submitted to write the parity data to one calculated
> disk(N)
>
> So the involved things for bpf prog SQE:
>
> - monitoring N - 1 writes
> - do the parity calculation job, which has to define one kfunc
> - mark parity is ready & notify userspace for writing parity(how to
> notify?)
And something still needs to do all that. The only silver lining
for userspace handling is that there is more language sugar helping
with it like coroutines.
> Now there can be variable(many) such WRITEs to handle concurrently, and the
> bpf prog has to cover them all.
>
> The above just the simplest case, the write command may not align with
> stripe, so parity calculation may need to read data from other stripes.
>
> If you think it is `pretty simple`, care to provide one example to show your
> approach is workable?
>
>>
>>>> you introduced. After it can optionally queue up requests
>>>> writing it to the storage or anything else.
>>>
>>> Again, I do not want to move userspace logic into bpf prog(kernel), what
>>> IORING_BPF_OP provides is to define one operation, then userspace
>>> can use it just like in-kernel operations.
>>
>> Right, but that's rather limited. I want to cover all those
>> use cases with one implementation instead of fragmenting users,
>> if that can be achieved.
>
> I don't know when your ambitious plan can land or be doable.
>
> I am going to write V2 with the approach of IORING_BPF_OP which is at least
> workable for some cases, and much easier to take in userspace. Also it
> doesn't conflict with your approach.
--
Pavel Begunkov
prev parent reply other threads:[~2025-11-19 19:00 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-04 16:21 [PATCH 0/5] io_uring: add IORING_OP_BPF for extending io_uring Ming Lei
2025-11-04 16:21 ` [PATCH 1/5] io_uring: prepare for extending io_uring with bpf Ming Lei
2025-11-04 16:21 ` [PATCH 2/5] io_uring: bpf: add io_uring_ctx setup for BPF into one list Ming Lei
2025-11-04 16:21 ` [PATCH 3/5] io_uring: bpf: extend io_uring with bpf struct_ops Ming Lei
2025-11-07 19:02 ` kernel test robot
2025-11-08 6:53 ` kernel test robot
2025-11-13 10:32 ` Stefan Metzmacher
2025-11-13 10:59 ` Ming Lei
2025-11-13 11:19 ` Stefan Metzmacher
2025-11-14 3:00 ` Ming Lei
2025-12-08 22:45 ` Caleb Sander Mateos
2025-12-09 3:08 ` Ming Lei
2025-12-10 16:11 ` Caleb Sander Mateos
2025-11-19 14:39 ` Jonathan Corbet
2025-11-20 1:46 ` Ming Lei
2025-11-20 1:51 ` Ming Lei
2025-11-04 16:21 ` [PATCH 4/5] io_uring: bpf: add buffer support for IORING_OP_BPF Ming Lei
2025-11-13 10:42 ` Stefan Metzmacher
2025-11-13 11:04 ` Ming Lei
2025-11-13 11:25 ` Stefan Metzmacher
2025-11-04 16:21 ` [PATCH 5/5] io_uring: bpf: add io_uring_bpf_req_memcpy() kfunc Ming Lei
2025-11-07 18:51 ` kernel test robot
2025-11-05 12:47 ` [PATCH 0/5] io_uring: add IORING_OP_BPF for extending io_uring Pavel Begunkov
2025-11-05 15:57 ` Ming Lei
2025-11-06 16:03 ` Pavel Begunkov
2025-11-07 15:54 ` Ming Lei
2025-11-11 14:07 ` Pavel Begunkov
2025-11-13 4:18 ` Ming Lei
2025-11-19 19:00 ` Pavel Begunkov [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0527a07c-57ac-41f2-acfd-cfd057922e4a@gmail.com \
--to=asml.silence@gmail.com \
--cc=akailash@google.com \
--cc=ast@kernel.org \
--cc=axboe@kernel.dk \
--cc=bpf@vger.kernel.org \
--cc=csander@purestorage.com \
--cc=io-uring@vger.kernel.org \
--cc=ming.lei@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).