All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <tom.leiming@gmail.com>
To: Bernd Schubert <bernd@niova.io>
Cc: Ming Lei <ming.lei@redhat.com>,
	fuse-devel@lists.linux.dev, Joanne Koong <joannelkoong@gmail.com>,
	io-uring <io-uring@vger.kernel.org>, Jens Axboe <axboe@kernel.dk>,
	Pavel Begunkov <asml.silence@gmail.com>,
	Miklos Szeredi <miklos@szeredi.hu>
Subject: Re: fuse/io-uring: Proposal to support pBuf in additon to kBuf
Date: Fri, 17 Apr 2026 22:35:38 +0800	[thread overview]
Message-ID: <aeJFOmvCF3ArL9iq@fedora> (raw)
In-Reply-To: <55db9a65-4408-42d2-8958-3bf3aa79d554@niova.io>

On Thu, Apr 16, 2026 at 09:13:41PM +0200, Bernd Schubert wrote:
> 
> 
> On 4/16/26 17:48, Ming Lei wrote:
> > On Thu, Apr 16, 2026 at 04:46:01PM +0200, Bernd Schubert wrote:
> >> Hi Ming,
> >>
> >> On 4/16/26 15:49, Ming Lei wrote:
> >>> Hi Bernd,
> >>>
> >>> On Tue, Apr 14, 2026 at 5:33 AM Bernd Schubert <bernd@niova.io> wrote:
> >>>>
> >>>> Hi Joanne, et al,
> >>>>
> >>>> this is a bit of duplication of the discussion we had before, but I was
> >>>> badly distracted with other work and also switching employer - didn't
> >>>> manage to reply [1].
> >>>>
> >>>>
> >>>> I'm still not too happy about kBuf and its restriction of locked-only
> >>>> memory. Right now I'm reviewing your patches from the view of what needs
> >>>> to be done for ublk (for my current employer) and also for fuse to
> >>>> support different buffer sizes. Let's say fuse only support kBuf and its
> >>>> restriction of pinned memory, I think we would be forced to add support
> >>>> for different buffer sizes to the current ring-entry-provides-the-buffer
> >>>> and the new kBuf interface - from my point of view code dup.
> >>>> If we would allow pBuf for fuse, we could put the current
> >>>> 'ring-entry-provides-the-buffer' interface into maintenance mode and
> >>>> support new features with the new interface only. I know you disagree on
> >>>> using pBuf [1] with the argument that userspace could free the buffer.
> >>>> Well, if it does, it does something totally wrong and the same could
> >>>> happen today over /dev/fuse and also the existing fuse-over-io-uring.
> >>>> Just the window is smaller, as the pages are extracted from the buffer
> >>>> during the copy.
> >>>>
> >>>> I was looking into what would be needed to support pBuf and I think
> >>>> io-uring could extract pages from pBuf when the buffer is obtained - it
> >>>> would limit the window when userspace can do something wrong in a
> >>>> similar way current fuse and ublk works.
> >>>>
> >>>> Suggested changes:
> >>>>
> >>>> io_uring:
> >>>>
> >>>>   - io_pin_pages() gets a 'bool longterm' parameter.
> >>>> The new pBuf path would pass false, every other exsting caller true.
> >>>>
> >>>>   - io_ring_buf_pin_user() / io_ring_buf_unpin_user()
> >>>>   - io_ring_buf_get_pages()/io_ring_buf_put_pages() -> fills the
> >>>> provided bvec
> >>>>   - New struct io_ring_buf (in cmd.h)
> >>>>
> >>>> struct io_ring_buf {
> >>>>        size_t                  len;
> >>>>        unsigned int            buf_id;
> >>>>        unsigned int            nr_bvecs;
> >>>>
> >>>>        /* private */
> >>>>        u64                     addr;
> >>>>        u8                      is_pinned;
> >>>> };
> >>>>
> >>>>
> >>>> Fuse changes:
> >>>>
> >>>>   - fuse_ring_ent (bufring union side): payload_kvec and ringbuf_buf_id
> >>>>     replaced by io_ring_buf + pre-allocated bvec array.
> >>>>   - Buffer selection under queue->lock removed.  The lock only protects
> >>>>     request dequeue and entry state transitions.  Page access happens
> >>>>     after the lock is dropped, in the context where the copy runs.
> >>>>   - setup_fuse_copy_state bufring branch: is_kaddr/kaddr replaced by
> >>>>     iov_iter_bvec() and would continue to use iov_iter_get_pages2()
> >>>>
> >>>> What do you think?
> >>>>
> >>>> And my current primary goal is to let ublk to support multiple buffer
> >>>> sizes - ublk would also need to get support for kBuf/pBuf and I'm
> >>>
> >>> Ublk server is just one liburing application, and it supports all generic
> >>> io_uring buffer types, so kbuf/pbuf should be fine for your ublk server
> >>> in theory.
> >>>
> >>> It really depends on how your ublk server is implemented.
> >>>
> >>> Maybe you can share your motivation first before discussing kbuf/pbuf support.
> >>> If it is for DMA,  there are other candidates too, such as hugepage,
> >>> recent added
> >>> UBLK_U_CMD_REG_BUF, ...
> >> Joanne had actually removed kBuf and switched to pBuf alone and that
> >> simiplifies things a bit.
> >>
> >> Motivation is to reduce memory usage. Let's say you need 4 IOs of 1MB to
> >> saturate streaming bandwidth, but still want to get smaller IOs through,
> >> for these smaller IOs you don't want to assign the 1MB buffer for each
> >> queue entry / tag.
> > 
> > Thanks for sharing the motivation.
> > 
> > Maybe you can pass UBLK_F_USER_COPY, and each IO buffer can be allocated
> > dynamically completely from userspace, then pre-allocation can be avoided.
> 
> I had looked into, but that is still another syscall / roundtrip, will
> have the same performance issue as UBLK_F_NEED_GET_DATA and probably
> worse because compared to ring IO that is a syscall per IO.

Yeah, it seems true in your use case in which compression is followed,
so pread/pwrite for read/write io buffer can't be linked to io_uring SQE pipeline.

However, I am not sure how you use pbuf for this use case, one big thing is
that the buffer has to be provided to ublk FETCH_AND_COMMAND command
beforehand for handling the coming ublk IO request, which size can't be
known at that time. I will study the pBuf patchset later, but it depends
how ublk driver uses it too, IMO.

Meantime another (more flexible)way is to use bpf struct_ops for allocating &
freeing IO buffer, following the basic idea:

- define struct_ops(alloc_io_buf, free_io_buf) for allocating & freeing io buffer
which is used for copying data between request pages and this buffer

- ->alloc_io_buf() can be called from ublk_map_io() and ->free_io_buf()
can be called from ublk_unmap_io()

- the allocated buffer can be accessed directly from both userspace ublk server
and bpf prog, bpf arena is one perfect match for this use case, page
pinning is avoided meantime.

- the two callbacks are not called for the following features:
UBLK_F_SUPPORT_ZERO_COPY,UBLK_F_USER_COPY, UBLK_F_AUTO_BUF_REG or
UBLK_IO_F_SHMEM_ZC is set for this IO

- motivation is for avoiding big pre-allocate, so ublk server can
use dynamic per-queue heap for allocating io buffer in space-effective way.

- with this feature, userspace needn't to pre-allocate io buffer with max
  buffer size, and typical implementation is to provide one bpf area heap
  for bpf prog to alloc & free buffer. And it still can fallback to usercopy
  code path in case of allocation failure from bpf prog.

You may compare the two approaches for your use case.

> 
> > 
> >> Zero copy is currently still out of question for us, although I will
> >> look into your recent work for integration of eBPF and if erasure
> >> coding, compression and checksums could be done with that (I guess
> >> checksums is the easy part).
> > 
> > Got it, compression could be the hardest one, however, the recent added bpf
> > iterator based buffer interface may simplify everything. I'd suggest you to look
> > at it, and provide some feedback if possible.
> > 
> > Also if your client application uses direct IO, recent added UBLK_F_SHMEM_ZC
> > could simplify implementation a lot, meantime with zero copy & user-mapped
> > address.
> 
> Oh I see, that was just merged. Nice, thank you! I don't our users will
> be DIO only, but nice to have that ZC option!

It can be thought as speedup or optimization for DIO use case.

Thanks,
Ming

  reply	other threads:[~2026-04-17 14:35 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-13 21:33 fuse/io-uring: Proposal to support pBuf in additon to kBuf Bernd Schubert
2026-04-14  0:56 ` Joanne Koong
2026-04-14 17:34   ` Bernd Schubert
2026-04-15  0:19     ` Joanne Koong
2026-04-16 13:49 ` Ming Lei
2026-04-16 14:46   ` Bernd Schubert
2026-04-16 15:48     ` Ming Lei
2026-04-16 19:13       ` Bernd Schubert
2026-04-17 14:35         ` Ming Lei [this message]
2026-04-17 21:02     ` Joanne Koong
2026-04-29 10:09       ` Bernd Schubert
2026-04-30 15:20         ` Joanne Koong
2026-04-30 16:55           ` Bernd Schubert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aeJFOmvCF3ArL9iq@fedora \
    --to=tom.leiming@gmail.com \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=bernd@niova.io \
    --cc=fuse-devel@lists.linux.dev \
    --cc=io-uring@vger.kernel.org \
    --cc=joannelkoong@gmail.com \
    --cc=miklos@szeredi.hu \
    --cc=ming.lei@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.