* Re: [LSF/MM/BPF TOPIC] FUSE io_uring zero copy
2025-01-30 21:28 [LSF/MM/BPF TOPIC] FUSE io_uring zero copy David Wei
@ 2025-01-30 22:05 ` Bernd Schubert
2025-01-30 22:51 ` David Wei
2025-01-31 14:13 ` Amir Goldstein
2025-01-30 22:22 ` Keith Busch
2025-02-05 2:27 ` Ming Lei
2 siblings, 2 replies; 8+ messages in thread
From: Bernd Schubert @ 2025-01-30 22:05 UTC (permalink / raw)
To: David Wei, lsf-pc@lists.linux-foundation.org
Cc: linux-fsdevel@vger.kernel.org, Keith Busch, Ming Lei, Jens Axboe,
Josef Bacik, Joanne Koong
Hi David,
I would love to participate in this discussion and the page
migration/tmp-page discussions, but I don't think I can make to to LSF/MM.
On 1/30/25 22:28, David Wei wrote:
> Hi folks, I want to propose a discussion on adding zero copy to FUSE
> io_uring in the kernel. The source is some userspace buffer or device
> memory e.g. GPU VRAM. The destination is FUSE server in userspace, which
> will then either forward it over the network or to an underlying
> FS/block device. The FUSE server may want to read the data.
>
> My goal is to eliminate copies in this entire data path, including the
> initial hop between the userspace client and the kernel. I know Ming and
> Keith are working on adding ublk zero copy but it does not cover this
> initial hop and it does not allow the ublk/FUSE server to read the data.
>
> My idea is to use shared memory or dma-buf, i.e. the source data is
> encapsulated in an mmap()able fd. The client and FUSE server exchange
> this fd through a back channel with no kernel involvement. The FUSE
> server maps the fd into its address space and registers the fd with
> io_uring via the io_uring_register() infra. When the client does e.g. a
> DIO write, the pages are pinned and forwarded to FUSE kernel, which does
> a lookup and understands that the pages belong to the fd that was
> registered from the FUSE server. Then io_uring tells the FUSE server
> that the data is in the fd it registered, so there is no need to copy
> anything at all.
For specific applications that know the protocol that should.
>
> I would like to discuss this and get feedback from the community. My top
> question is why do this in the kernel at all? It is entirely possible to
> bypass the kernel entirely by having the client and FUSE server exchange
> the fd and then do the I/O purely through IPC.
Because we leave posix and it is rather fuse specific then.
Thanks,
Bernd
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] FUSE io_uring zero copy
2025-01-30 22:05 ` Bernd Schubert
@ 2025-01-30 22:51 ` David Wei
2025-01-31 14:13 ` Amir Goldstein
1 sibling, 0 replies; 8+ messages in thread
From: David Wei @ 2025-01-30 22:51 UTC (permalink / raw)
To: Bernd Schubert, lsf-pc@lists.linux-foundation.org
Cc: linux-fsdevel@vger.kernel.org, Keith Busch, Ming Lei, Jens Axboe,
Josef Bacik, Joanne Koong
On 2025-01-30 14:05, Bernd Schubert wrote:
> Hi David,
>
> I would love to participate in this discussion and the page
> migration/tmp-page discussions, but I don't think I can make to to LSF/MM.
Thanks Bernd! Looking forward to discussing this with you.
>
> On 1/30/25 22:28, David Wei wrote:
>> Hi folks, I want to propose a discussion on adding zero copy to FUSE
>> io_uring in the kernel. The source is some userspace buffer or device
>> memory e.g. GPU VRAM. The destination is FUSE server in userspace, which
>> will then either forward it over the network or to an underlying
>> FS/block device. The FUSE server may want to read the data.
>>
>> My goal is to eliminate copies in this entire data path, including the
>> initial hop between the userspace client and the kernel. I know Ming and
>> Keith are working on adding ublk zero copy but it does not cover this
>> initial hop and it does not allow the ublk/FUSE server to read the data.
>>
>> My idea is to use shared memory or dma-buf, i.e. the source data is
>> encapsulated in an mmap()able fd. The client and FUSE server exchange
>> this fd through a back channel with no kernel involvement. The FUSE
>> server maps the fd into its address space and registers the fd with
>> io_uring via the io_uring_register() infra. When the client does e.g. a
>> DIO write, the pages are pinned and forwarded to FUSE kernel, which does
>> a lookup and understands that the pages belong to the fd that was
>> registered from the FUSE server. Then io_uring tells the FUSE server
>> that the data is in the fd it registered, so there is no need to copy
>> anything at all.
>
> For specific applications that know the protocol that should.
>
>>
>> I would like to discuss this and get feedback from the community. My top
>> question is why do this in the kernel at all? It is entirely possible to
>> bypass the kernel entirely by having the client and FUSE server exchange
>> the fd and then do the I/O purely through IPC.
>
> Because we leave posix and it is rather fuse specific then.
Yeah, good point. Another line of thought is in ease of use from the
client's perspective. Yes, they have to do a back channel IPC with the
FUSE server to do the setup. Though it could be as simple as using one
of the many ways of passing and installing fds between two processes,
e.g. io_uring or SCM_RIGHTS.
But the advantage is that DIO write() is the same as before. The kernel
takes over from that point onwards, all via standard kernel concepts.
Doing it _purely_ in userspace would need completely custom code.
I think this is a useful addition to the kernel and FUSE, that someone
else can make use of without needing to write their own code. If there
is a +1 voice at the conference, that would be a great result and gives
me the confidence to go and build it.
>
>
> Thanks,
> Bernd
>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] FUSE io_uring zero copy
2025-01-30 22:05 ` Bernd Schubert
2025-01-30 22:51 ` David Wei
@ 2025-01-31 14:13 ` Amir Goldstein
1 sibling, 0 replies; 8+ messages in thread
From: Amir Goldstein @ 2025-01-31 14:13 UTC (permalink / raw)
To: Bernd Schubert
Cc: David Wei, lsf-pc@lists.linux-foundation.org,
linux-fsdevel@vger.kernel.org, Keith Busch, Ming Lei, Jens Axboe,
Josef Bacik, Joanne Koong
On Thu, Jan 30, 2025 at 11:06 PM Bernd Schubert <bschubert@ddn.com> wrote:
>
> Hi David,
>
> I would love to participate in this discussion and the page
> migration/tmp-page discussions, but I don't think I can make it to LSF/MM.
>
Bernd,
We will do our best to make sure that you can participate in this discussion.
Worst case, we can take a small room and call you with one of the laptops
as you guys added me into the FUSE meeting in Plumbers ;-)
Thanks,
Amir.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] FUSE io_uring zero copy
2025-01-30 21:28 [LSF/MM/BPF TOPIC] FUSE io_uring zero copy David Wei
2025-01-30 22:05 ` Bernd Schubert
@ 2025-01-30 22:22 ` Keith Busch
2025-01-30 22:40 ` David Wei
2025-02-05 2:27 ` Ming Lei
2 siblings, 1 reply; 8+ messages in thread
From: Keith Busch @ 2025-01-30 22:22 UTC (permalink / raw)
To: David Wei
Cc: lsf-pc, linux-fsdevel, Bernd Schubert, Ming Lei, Jens Axboe,
Josef Bacik, Joanne Koong
On Thu, Jan 30, 2025 at 01:28:55PM -0800, David Wei wrote:
> Hi folks, I want to propose a discussion on adding zero copy to FUSE
> io_uring in the kernel. The source is some userspace buffer or device
> memory e.g. GPU VRAM. The destination is FUSE server in userspace, which
> will then either forward it over the network or to an underlying
> FS/block device. The FUSE server may want to read the data.
>
> My goal is to eliminate copies in this entire data path, including the
> initial hop between the userspace client and the kernel. I know Ming and
> Keith are working on adding ublk zero copy but it does not cover this
> initial hop and it does not allow the ublk/FUSE server to read the data.
If the server side has to be able to access the data for whatever
reason, copying does appear to be the best option for compatibility. But
if the server doesn't need to see the data, it's very efficient to reuse
the iov from the original IO without bringing it into a different
process' address space.
> My idea is to use shared memory or dma-buf, i.e. the source data is
> encapsulated in an mmap()able fd. The client and FUSE server exchange
> this fd through a back channel with no kernel involvement. The FUSE
> server maps the fd into its address space and registers the fd with
> io_uring via the io_uring_register() infra. When the client does e.g. a
> DIO write, the pages are pinned and forwarded to FUSE kernel, which does
> a lookup and understands that the pages belong to the fd that was
> registered from the FUSE server. Then io_uring tells the FUSE server
> that the data is in the fd it registered, so there is no need to copy
> anything at all.
>
> I would like to discuss this and get feedback from the community. My top
> question is why do this in the kernel at all? It is entirely possible to
> bypass the kernel entirely by having the client and FUSE server exchange
> the fd and then do the I/O purely through IPC.
This kind of sounds like "paravirtual" features, in that both sides need
to cooperate to make use of the enhancement. Interesting thought that if
everyone is going this far to bypass memory copies, it doesn't look like
much more of a heavy lift to just bypass the kernel too. There's
probably value in retaining the filesystem semantics, though.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] FUSE io_uring zero copy
2025-01-30 22:22 ` Keith Busch
@ 2025-01-30 22:40 ` David Wei
0 siblings, 0 replies; 8+ messages in thread
From: David Wei @ 2025-01-30 22:40 UTC (permalink / raw)
To: Keith Busch
Cc: lsf-pc, linux-fsdevel, Bernd Schubert, Ming Lei, Jens Axboe,
Josef Bacik, Joanne Koong
On 2025-01-30 14:22, Keith Busch wrote:
> On Thu, Jan 30, 2025 at 01:28:55PM -0800, David Wei wrote:
>> Hi folks, I want to propose a discussion on adding zero copy to FUSE
>> io_uring in the kernel. The source is some userspace buffer or device
>> memory e.g. GPU VRAM. The destination is FUSE server in userspace, which
>> will then either forward it over the network or to an underlying
>> FS/block device. The FUSE server may want to read the data.
>>
>> My goal is to eliminate copies in this entire data path, including the
>> initial hop between the userspace client and the kernel. I know Ming and
>> Keith are working on adding ublk zero copy but it does not cover this
>> initial hop and it does not allow the ublk/FUSE server to read the data.
>
> If the server side has to be able to access the data for whatever
> reason, copying does appear to be the best option for compatibility. But
> if the server doesn't need to see the data, it's very efficient to reuse
> the iov from the original IO without bringing it into a different
> process' address space.
For a write operation, the server may want to optionally _read_ the
data. Does this force pages to be pulled into the cache anyway so we may
as well copy? If so - then I can look into removing this read
requirement, which then makes it possible to use your work ublk zero
copy. That is, make whoever generating the data to do the work that FUSE
server was going to do _before_ shoving it into the zero copy data
pipeline.
>
>> My idea is to use shared memory or dma-buf, i.e. the source data is
>> encapsulated in an mmap()able fd. The client and FUSE server exchange
>> this fd through a back channel with no kernel involvement. The FUSE
>> server maps the fd into its address space and registers the fd with
>> io_uring via the io_uring_register() infra. When the client does e.g. a
>> DIO write, the pages are pinned and forwarded to FUSE kernel, which does
>> a lookup and understands that the pages belong to the fd that was
>> registered from the FUSE server. Then io_uring tells the FUSE server
>> that the data is in the fd it registered, so there is no need to copy
>> anything at all.
>>
>> I would like to discuss this and get feedback from the community. My top
>> question is why do this in the kernel at all? It is entirely possible to
>> bypass the kernel entirely by having the client and FUSE server exchange
>> the fd and then do the I/O purely through IPC.
>
> This kind of sounds like "paravirtual" features, in that both sides need
> to cooperate to make use of the enhancement. Interesting thought that if
> everyone is going this far to bypass memory copies, it doesn't look like
> much more of a heavy lift to just bypass the kernel too. There's
> probably value in retaining the filesystem semantics, though.
Right, sorry I didn't phrase well in my initial email. I _want_ to put
it in the kernel because I want it to be in the open and I want it to be
generically useful to others. But _is it_ generically useful enough to
codify it in the kernel? I have one use case in mind but I think it
would be a bad kernel API if it _only_ worked for my case! That's why I
want to lead a discussion on this at LSFMMBPF to gather feedback.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] FUSE io_uring zero copy
2025-01-30 21:28 [LSF/MM/BPF TOPIC] FUSE io_uring zero copy David Wei
2025-01-30 22:05 ` Bernd Schubert
2025-01-30 22:22 ` Keith Busch
@ 2025-02-05 2:27 ` Ming Lei
2025-02-07 2:10 ` David Wei
2 siblings, 1 reply; 8+ messages in thread
From: Ming Lei @ 2025-02-05 2:27 UTC (permalink / raw)
To: David Wei
Cc: lsf-pc, linux-fsdevel, Bernd Schubert, Keith Busch, Jens Axboe,
Josef Bacik, Joanne Koong
Hello David,
On Thu, Jan 30, 2025 at 01:28:55PM -0800, David Wei wrote:
> Hi folks, I want to propose a discussion on adding zero copy to FUSE
> io_uring in the kernel. The source is some userspace buffer or device
> memory e.g. GPU VRAM. The destination is FUSE server in userspace, which
> will then either forward it over the network or to an underlying
> FS/block device. The FUSE server may want to read the data.
>
> My goal is to eliminate copies in this entire data path, including the
> initial hop between the userspace client and the kernel. I know Ming and
> Keith are working on adding ublk zero copy but it does not cover this
> initial hop and it does not allow the ublk/FUSE server to read the data.
Not sure get the point, it depends on if the kernel buffer is initialized,
and you can't read data from one uninitialized kernel buffer.
But if it is userspace or device buffer, the limit may be relaxed.
>
> My idea is to use shared memory or dma-buf, i.e. the source data is
> encapsulated in an mmap()able fd. The client and FUSE server exchange
> this fd through a back channel with no kernel involvement. The FUSE
> server maps the fd into its address space and registers the fd with
This approach need client code modification, which isn't generic and
can't cover existed posix applications.
There could be too many client processes, does this way really scale?
> io_uring via the io_uring_register() infra. When the client does e.g. a
> DIO write, the pages are pinned and forwarded to FUSE kernel, which does
BTW, fuse supports write zero copy already, just read zero copy isn't
supported.
> a lookup and understands that the pages belong to the fd that was
> registered from the FUSE server. Then io_uring tells the FUSE server
> that the data is in the fd it registered, so there is no need to copy
> anything at all.
>
> I would like to discuss this and get feedback from the community. My top
> question is why do this in the kernel at all? It is entirely possible to
> bypass the kernel entirely by having the client and FUSE server exchange
> the fd and then do the I/O purely through IPC.
IMO, client code modification may not be accepted for existed applications.
Thanks,
Ming
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [LSF/MM/BPF TOPIC] FUSE io_uring zero copy
2025-02-05 2:27 ` Ming Lei
@ 2025-02-07 2:10 ` David Wei
0 siblings, 0 replies; 8+ messages in thread
From: David Wei @ 2025-02-07 2:10 UTC (permalink / raw)
To: Ming Lei
Cc: lsf-pc, linux-fsdevel, Bernd Schubert, Keith Busch, Jens Axboe,
Josef Bacik, Joanne Koong
On 2025-02-04 18:27, Ming Lei wrote:
> Hello David,
>
> On Thu, Jan 30, 2025 at 01:28:55PM -0800, David Wei wrote:
>> Hi folks, I want to propose a discussion on adding zero copy to FUSE
>> io_uring in the kernel. The source is some userspace buffer or device
>> memory e.g. GPU VRAM. The destination is FUSE server in userspace, which
>> will then either forward it over the network or to an underlying
>> FS/block device. The FUSE server may want to read the data.
>>
>> My goal is to eliminate copies in this entire data path, including the
>> initial hop between the userspace client and the kernel. I know Ming and
>> Keith are working on adding ublk zero copy but it does not cover this
>> initial hop and it does not allow the ublk/FUSE server to read the data.
>
> Not sure get the point, it depends on if the kernel buffer is initialized,
> and you can't read data from one uninitialized kernel buffer.
>
> But if it is userspace or device buffer, the limit may be relaxed.
When a client does a DIO write() to a FUSE filefd, the pages are pinned
by the kernel and then passed to FUSE kernel. It is possible to then
send these to the FUSE server, but it cannot read the data, only pass it
onwards.
>
>>
>> My idea is to use shared memory or dma-buf, i.e. the source data is
>> encapsulated in an mmap()able fd. The client and FUSE server exchange
>> this fd through a back channel with no kernel involvement. The FUSE
>> server maps the fd into its address space and registers the fd with
>
> This approach need client code modification, which isn't generic and
> can't cover existed posix applications.
Yes, the fd exchange is not POSIX. But we could encode the API using say
io_uring cmd if it is seen to be generically useful.
>
> There could be too many client processes, does this way really scale?
For zero copy there is a cutover point where it performs better than
copying. The trade off is between memcpy and the overheads of setting up
zero copy. In this case, the client is required to be long lived and
ideally the same shmfd is shared across multiple transactions. So the
overhead is paid once and then amortised over multiple transactions.
>
>> io_uring via the io_uring_register() infra. When the client does e.g. a
>> DIO write, the pages are pinned and forwarded to FUSE kernel, which does
>
> BTW, fuse supports write zero copy already, just read zero copy isn't
> supported.
Could you clarify exactly which direction and how much of the data path
"zero copy" covers?
>
>> a lookup and understands that the pages belong to the fd that was
>> registered from the FUSE server. Then io_uring tells the FUSE server
>> that the data is in the fd it registered, so there is no need to copy
>> anything at all.
>>
>> I would like to discuss this and get feedback from the community. My top
>> question is why do this in the kernel at all? It is entirely possible to
>> bypass the kernel entirely by having the client and FUSE server exchange
>> the fd and then do the I/O purely through IPC.
>
> IMO, client code modification may not be accepted for existed applications.
That's up to userspace. I don't think we need to limit ourselves to "no
userspace code changes" or "POSIX only".
>
>
> Thanks,
> Ming
>
^ permalink raw reply [flat|nested] 8+ messages in thread