* [RFC] Extension to POSIX API for zero-copy data transfers
@ 2025-01-19 20:21 mcaju95
2025-08-26 12:40 ` Serge E. Hallyn
0 siblings, 1 reply; 3+ messages in thread
From: mcaju95 @ 2025-01-19 20:21 UTC (permalink / raw)
To: linux-api, alx, serge
Greetings,
I've been thinking about a POSIX-like API that would allow
read/write/send/recv to be zero-copy instead of being buffered. As
such, storage devices and network sockets can have data transferred to
and from them directly to a user-space application's buffers.
My focus was initially on network stacks and I drew inspiration from
DPDK. I'm also aware of some work underway on extending io_uring to
support zero copy.
A draft API would work as follows:
* The application fills-out a series of iovec's with buffers in its own
memory that can store data from protocols such as TCP or UDP. These
iovec's will serve as hints that will tell the network stack that it
can definitely map a part of a frame's contents into the described
buffers. For example, an iovec may contain { .iov_base = 0x4000,
.iov_len = 0xa000 }. In this case, the data payload may end-up anywhere
between 0x4000 and 0xe000 and after the syscall, its fields will be
overwritten to something like { .iov_base = 0x4036, .iov_len = 1460 }
* In order to receive packets, the application calls readv or a
readv-like syscall and its array of iovec will be modified to point to
data payloads. Given that their pages will be mapped directly to
user-space, some header fields or tail-room may have to be zero-ed out
before being mapped, in order to prevent information leaks. Anny array
of iovec's passed to one such readv syscall should be checked for
sanity such as being able to hold data payloads in corner cases, not
overlap with each-other and hold values that would prove to map pages
to.
* The return value would be the number of data payloads that have been
populated. Only the first such elements in the provided array would
end-up containing data payloads.
* The syscall's prototype would be quite identical to that of readv,
except that iov would not be a const struct iovec *, but just a struct
iovec * and the return type would be modified. Like so:
int zc_readv(int fd, struct iovec *iov, int iovcnt);
* In the case of write's a struct iovec may not suffice as the provided
buffers should not only provide the location and size of data to be
sent, but also the guarantee that the buffers have sufficient head and
tail room. A hackish syscall would look like so:
int zc_writev(int fd, const struct iovec (*iov)[2], int iovcnt);
* While the first iovec should describe the entire memory area
available to a packet, including enough head and tail room for headers
and CRC's or other fields specific to the NIC, the second should
describe a sub-buffer that holds the data to be written.
* Again, sanity checks should be performed on the entire array, for
things like having enough room for other fields, not overlapping,
proper alignment, ability to DMA to a device, etc.
* After calling zc_writev the pages associated with the provided
iovec's are immediately swapped for zero-pages to avoid data-leaks.
* For writes, arbitrary physical pages may not work for every NIC as
some are bound by 32bit addressing constrains on the PCIe bus, etc. As
such the application would have to manage a memory pool associated with
each file-descriptor(possibly NIC) that would contain memory that is
physically mapped to areas that can be DMA'ed to the proper devices.
For example one may mmap the file-descriptor to obtain a pool of a
certain size for this purpose.
This concept can be extended to storage devices, unfortunately I am
unfamiliar with NVMe and SCSI so I can only guess that they work in a
similar manner to NIC rings, in that data can be written and read to
arbitrary physical RAM(as allowed by the IOMMU). Syscalls similar to
zc_read and zc_write can be used on file descriptors pointing to
storage devices to fetch or write sectors that contain data belonging
to files. Some data should be zeroed-out in this case as well, as
sectors more often that not will contain data that does not belong to
the intended files.
For example one can mix such syscalls to read directly from storage
into NIC buffers, providing in-place encryption on the way(via TLS) and
send them to a client in a similar way that Netflix does with in-kernel
TLS and sendfile.
All the best,
Mihai
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [RFC] Extension to POSIX API for zero-copy data transfers
2025-01-19 20:21 [RFC] Extension to POSIX API for zero-copy data transfers mcaju95
@ 2025-08-26 12:40 ` Serge E. Hallyn
2025-08-26 13:17 ` Mihai-Drosi Câju
0 siblings, 1 reply; 3+ messages in thread
From: Serge E. Hallyn @ 2025-08-26 12:40 UTC (permalink / raw)
To: mcaju95; +Cc: linux-api, alx, serge
On Sun, Jan 19, 2025 at 10:21:45PM +0200, mcaju95@gmail.com wrote:
> Greetings,
>
> I've been thinking about a POSIX-like API that would allow
> read/write/send/recv to be zero-copy instead of being buffered. As such,
> storage devices and network sockets can have data transferred to and from
> them directly to a user-space application's buffers.
Hi Mihai,
You're proposing a particular API. Do you have a kernel side
implementation of something along these lines? Do you have a particular
user space use case of your own in mind, or have you spoken to any
potential users?
> My focus was initially on network stacks and I drew inspiration from DPDK.
> I'm also aware of some work underway on extending io_uring to support zero
> copy.
I've not really been following io_uring work. Can you summarize the
status of their zero copy support and the advantages that this new
API would bring?
thanks,
-serge
> A draft API would work as follows:
> * The application fills-out a series of iovec's with buffers in its own
> memory that can store data from protocols such as TCP or UDP. These iovec's
> will serve as hints that will tell the network stack that it can definitely
> map a part of a frame's contents into the described buffers. For example, an
> iovec may contain { .iov_base = 0x4000, .iov_len = 0xa000 }. In this case,
> the data payload may end-up anywhere between 0x4000 and 0xe000 and after the
> syscall, its fields will be overwritten to something like { .iov_base =
> 0x4036, .iov_len = 1460 }
> * In order to receive packets, the application calls readv or a readv-like
> syscall and its array of iovec will be modified to point to data payloads.
> Given that their pages will be mapped directly to user-space, some header
> fields or tail-room may have to be zero-ed out before being mapped, in order
> to prevent information leaks. Anny array of iovec's passed to one such readv
> syscall should be checked for sanity such as being able to hold data
> payloads in corner cases, not overlap with each-other and hold values that
> would prove to map pages to.
> * The return value would be the number of data payloads that have been
> populated. Only the first such elements in the provided array would end-up
> containing data payloads.
> * The syscall's prototype would be quite identical to that of readv, except
> that iov would not be a const struct iovec *, but just a struct iovec * and
> the return type would be modified. Like so:
> int zc_readv(int fd, struct iovec *iov, int iovcnt);
>
> * In the case of write's a struct iovec may not suffice as the provided
> buffers should not only provide the location and size of data to be sent,
> but also the guarantee that the buffers have sufficient head and tail room.
> A hackish syscall would look like so:
> int zc_writev(int fd, const struct iovec (*iov)[2], int iovcnt);
> * While the first iovec should describe the entire memory area available to
> a packet, including enough head and tail room for headers and CRC's or other
> fields specific to the NIC, the second should describe a sub-buffer that
> holds the data to be written.
> * Again, sanity checks should be performed on the entire array, for things
> like having enough room for other fields, not overlapping, proper alignment,
> ability to DMA to a device, etc.
> * After calling zc_writev the pages associated with the provided iovec's are
> immediately swapped for zero-pages to avoid data-leaks.
> * For writes, arbitrary physical pages may not work for every NIC as some
> are bound by 32bit addressing constrains on the PCIe bus, etc. As such the
> application would have to manage a memory pool associated with each
> file-descriptor(possibly NIC) that would contain memory that is physically
> mapped to areas that can be DMA'ed to the proper devices. For example one
> may mmap the file-descriptor to obtain a pool of a certain size for this
> purpose.
>
> This concept can be extended to storage devices, unfortunately I am
> unfamiliar with NVMe and SCSI so I can only guess that they work in a
> similar manner to NIC rings, in that data can be written and read to
> arbitrary physical RAM(as allowed by the IOMMU). Syscalls similar to zc_read
> and zc_write can be used on file descriptors pointing to storage devices to
> fetch or write sectors that contain data belonging to files. Some data
> should be zeroed-out in this case as well, as sectors more often that not
> will contain data that does not belong to the intended files.
>
> For example one can mix such syscalls to read directly from storage into NIC
> buffers, providing in-place encryption on the way(via TLS) and send them to
> a client in a similar way that Netflix does with in-kernel TLS and sendfile.
>
> All the best,
> Mihai
>
>
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [RFC] Extension to POSIX API for zero-copy data transfers
2025-08-26 12:40 ` Serge E. Hallyn
@ 2025-08-26 13:17 ` Mihai-Drosi Câju
0 siblings, 0 replies; 3+ messages in thread
From: Mihai-Drosi Câju @ 2025-08-26 13:17 UTC (permalink / raw)
To: Serge E. Hallyn; +Cc: linux-api, alx
On Tue, Aug 26, 2025 at 3:40 PM Serge E. Hallyn <serge@hallyn.com> wrote:
>
> On Sun, Jan 19, 2025 at 10:21:45PM +0200, mcaju95@gmail.com wrote:
> > Greetings,
> >
> > I've been thinking about a POSIX-like API that would allow
> > read/write/send/recv to be zero-copy instead of being buffered. As such,
> > storage devices and network sockets can have data transferred to and from
> > them directly to a user-space application's buffers.
>
> Hi Mihai,
>
> You're proposing a particular API. Do you have a kernel side
> implementation of something along these lines? Do you have a particular
> user space use case of your own in mind, or have you spoken to any
> potential users?
>
I have a user-space implementation based on DPDK, it has a different and
less hackish API than the one presented here. I thought it best to seek
feedback from the Linux community before actually writing code for the kernel.
The use-case is the same as the normal Berkley sockets API.
Except that it's faster because you don't copy buffers between kernel and
user-space on each send and recv. You can even receive a buffer that
will be written to disk or vice-versa. Thereby making obsolete KTLS, etc.
I have not spoken to potential users, but I am aware of several attempts
at a zero copy TCP/IP stack F-Stack, mTCP, io_uring.
> > My focus was initially on network stacks and I drew inspiration from DPDK.
> > I'm also aware of some work underway on extending io_uring to support zero
> > copy.
>
> I've not really been following io_uring work. Can you summarize the
> status of their zero copy support and the advantages that this new
> API would bring?
>
So far, io_uring only supports zero copy reception of TCP segments
https://docs.kernel.org/networking/iou-zcrx.html
it's rather cluttered...
> thanks,
> -serge
>
> > A draft API would work as follows:
> > * The application fills-out a series of iovec's with buffers in its own
> > memory that can store data from protocols such as TCP or UDP. These iovec's
> > will serve as hints that will tell the network stack that it can definitely
> > map a part of a frame's contents into the described buffers. For example, an
> > iovec may contain { .iov_base = 0x4000, .iov_len = 0xa000 }. In this case,
> > the data payload may end-up anywhere between 0x4000 and 0xe000 and after the
> > syscall, its fields will be overwritten to something like { .iov_base =
> > 0x4036, .iov_len = 1460 }
> > * In order to receive packets, the application calls readv or a readv-like
> > syscall and its array of iovec will be modified to point to data payloads.
> > Given that their pages will be mapped directly to user-space, some header
> > fields or tail-room may have to be zero-ed out before being mapped, in order
> > to prevent information leaks. Anny array of iovec's passed to one such readv
> > syscall should be checked for sanity such as being able to hold data
> > payloads in corner cases, not overlap with each-other and hold values that
> > would prove to map pages to.
> > * The return value would be the number of data payloads that have been
> > populated. Only the first such elements in the provided array would end-up
> > containing data payloads.
> > * The syscall's prototype would be quite identical to that of readv, except
> > that iov would not be a const struct iovec *, but just a struct iovec * and
> > the return type would be modified. Like so:
> > int zc_readv(int fd, struct iovec *iov, int iovcnt);
> >
> > * In the case of write's a struct iovec may not suffice as the provided
> > buffers should not only provide the location and size of data to be sent,
> > but also the guarantee that the buffers have sufficient head and tail room.
> > A hackish syscall would look like so:
> > int zc_writev(int fd, const struct iovec (*iov)[2], int iovcnt);
> > * While the first iovec should describe the entire memory area available to
> > a packet, including enough head and tail room for headers and CRC's or other
> > fields specific to the NIC, the second should describe a sub-buffer that
> > holds the data to be written.
> > * Again, sanity checks should be performed on the entire array, for things
> > like having enough room for other fields, not overlapping, proper alignment,
> > ability to DMA to a device, etc.
> > * After calling zc_writev the pages associated with the provided iovec's are
> > immediately swapped for zero-pages to avoid data-leaks.
> > * For writes, arbitrary physical pages may not work for every NIC as some
> > are bound by 32bit addressing constrains on the PCIe bus, etc. As such the
> > application would have to manage a memory pool associated with each
> > file-descriptor(possibly NIC) that would contain memory that is physically
> > mapped to areas that can be DMA'ed to the proper devices. For example one
> > may mmap the file-descriptor to obtain a pool of a certain size for this
> > purpose.
> >
> > This concept can be extended to storage devices, unfortunately I am
> > unfamiliar with NVMe and SCSI so I can only guess that they work in a
> > similar manner to NIC rings, in that data can be written and read to
> > arbitrary physical RAM(as allowed by the IOMMU). Syscalls similar to zc_read
> > and zc_write can be used on file descriptors pointing to storage devices to
> > fetch or write sectors that contain data belonging to files. Some data
> > should be zeroed-out in this case as well, as sectors more often that not
> > will contain data that does not belong to the intended files.
> >
> > For example one can mix such syscalls to read directly from storage into NIC
> > buffers, providing in-place encryption on the way(via TLS) and send them to
> > a client in a similar way that Netflix does with in-kernel TLS and sendfile.
> >
> > All the best,
> > Mihai
> >
> >
> >
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-08-26 13:17 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-19 20:21 [RFC] Extension to POSIX API for zero-copy data transfers mcaju95
2025-08-26 12:40 ` Serge E. Hallyn
2025-08-26 13:17 ` Mihai-Drosi Câju
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).