* [RFC PATCH v3 0/5] rust: miscdevice: abstraction for uring_cmd @ 2025-08-22 12:55 Sidong Yang 2025-08-22 12:55 ` [RFC PATCH v3 1/5] rust: bindings: add io_uring headers in bindings_helper.h Sidong Yang ` (4 more replies) 0 siblings, 5 replies; 33+ messages in thread From: Sidong Yang @ 2025-08-22 12:55 UTC (permalink / raw) To: Jens Axboe, Daniel Almeida, Caleb Sander Mateos, Benno Lossin Cc: Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring, Sidong Yang This patch series implements the groundwork for using io-uring cmd interface in rust miscdevice. 1. Io-uring headers for bindings We needs additional headers for implementing rust abstraction. 2. zero-init pdu in io_uring_cmd We need to initialize pdu for avoiding UB when reading pdu in rust. 3. Core abstractions Provides 2 core struct for abstraction. One is for io_uring_cmd and another is for io_uring_sqe. 4. Method for MiscDevice MiscDevice has some method for `file_operations`. This patch adds a method `uring_cmd` which could be used in miscdevice driver. 5. uring_cmd interface for MiscDevice sample Added sample code for using new interface for `MiscDevice::uring_cmd()` Together, these 5 patches. - io-uring headers for bindings - zero-init pdu in io_uring_cmd - abstraction types for io-uring interface - add uring_cmd method for miscdevice - sample that implements uring_cmd method Changelog: v2: * use pinned &mut for IoUringCmd * add missing safety comments * use write_volatile for read uring_cmd in sample v3: * fixes various comments including safety related ones. * zero-init pdu in io_uring_cmd * use `read_pdu`/`write_pdu` with `AsBytes`/`FromBytes` for pdu * `IoUringSqe::cmd_data` checkes opcode and returns `FromBytes` * use `_IOR`/`_IOW` macros for cmd_op in sample Sidong Yang (5): rust: bindings: add io_uring headers in bindings_helper.h io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB rust: io_uring: introduce rust abstraction for io-uring cmd rust: miscdevice: Add `uring_cmd` support samples: rust: Add `uring_cmd` example to `rust_misc_device` io_uring/uring_cmd.c | 1 + rust/bindings/bindings_helper.h | 2 + rust/kernel/io_uring.rs | 306 +++++++++++++++++++++++++++++++ rust/kernel/lib.rs | 1 + rust/kernel/miscdevice.rs | 53 +++++- samples/rust/rust_misc_device.rs | 27 +++ 6 files changed, 389 insertions(+), 1 deletion(-) create mode 100644 rust/kernel/io_uring.rs -- 2.43.0 ^ permalink raw reply [flat|nested] 33+ messages in thread
* [RFC PATCH v3 1/5] rust: bindings: add io_uring headers in bindings_helper.h 2025-08-22 12:55 [RFC PATCH v3 0/5] rust: miscdevice: abstraction for uring_cmd Sidong Yang @ 2025-08-22 12:55 ` Sidong Yang 2025-08-22 12:55 ` [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB Sidong Yang ` (3 subsequent siblings) 4 siblings, 0 replies; 33+ messages in thread From: Sidong Yang @ 2025-08-22 12:55 UTC (permalink / raw) To: Jens Axboe, Daniel Almeida, Caleb Sander Mateos, Benno Lossin Cc: Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring, Sidong Yang This patch adds two headers io_uring.h io_uring/cmd.h in bindings_helper for implementing rust io_uring abstraction. Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> --- rust/bindings/bindings_helper.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 84d60635e8a9..96beaea73755 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -75,6 +75,8 @@ #include <linux/wait.h> #include <linux/workqueue.h> #include <linux/xarray.h> +#include <linux/io_uring.h> +#include <linux/io_uring/cmd.h> #include <trace/events/rust_sample.h> #if defined(CONFIG_DRM_PANIC_SCREEN_QR_CODE) -- 2.43.0 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB 2025-08-22 12:55 [RFC PATCH v3 0/5] rust: miscdevice: abstraction for uring_cmd Sidong Yang 2025-08-22 12:55 ` [RFC PATCH v3 1/5] rust: bindings: add io_uring headers in bindings_helper.h Sidong Yang @ 2025-08-22 12:55 ` Sidong Yang 2025-09-02 0:34 ` Caleb Sander Mateos 2025-08-22 12:55 ` [RFC PATCH v3 3/5] rust: io_uring: introduce rust abstraction for io-uring cmd Sidong Yang ` (2 subsequent siblings) 4 siblings, 1 reply; 33+ messages in thread From: Sidong Yang @ 2025-08-22 12:55 UTC (permalink / raw) To: Jens Axboe, Daniel Almeida, Caleb Sander Mateos, Benno Lossin Cc: Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring, Sidong Yang The pdu field in io_uring_cmd may contain stale data when a request object is recycled from the slab cache. Accessing uninitialized or garbage memory can lead to undefined behavior in users of the pdu. Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that each command starts from a well-defined state. This avoids exposing uninitialized memory and prevents potential misinterpretation of data from previous requests. No functional change is intended other than guaranteeing that pdu is always zero-initialized before use. Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> --- io_uring/uring_cmd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c index 053bac89b6c0..2492525d4e43 100644 --- a/io_uring/uring_cmd.c +++ b/io_uring/uring_cmd.c @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (!ac) return -ENOMEM; ioucmd->sqe = sqe; + memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu)); return 0; } -- 2.43.0 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB 2025-08-22 12:55 ` [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB Sidong Yang @ 2025-09-02 0:34 ` Caleb Sander Mateos 2025-09-02 10:23 ` Sidong Yang 0 siblings, 1 reply; 33+ messages in thread From: Caleb Sander Mateos @ 2025-09-02 0:34 UTC (permalink / raw) To: Sidong Yang Cc: Jens Axboe, Daniel Almeida, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > The pdu field in io_uring_cmd may contain stale data when a request > object is recycled from the slab cache. Accessing uninitialized or > garbage memory can lead to undefined behavior in users of the pdu. > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that > each command starts from a well-defined state. This avoids exposing > uninitialized memory and prevents potential misinterpretation of data > from previous requests. > > No functional change is intended other than guaranteeing that pdu is > always zero-initialized before use. > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > --- > io_uring/uring_cmd.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c > index 053bac89b6c0..2492525d4e43 100644 > --- a/io_uring/uring_cmd.c > +++ b/io_uring/uring_cmd.c > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > if (!ac) > return -ENOMEM; > ioucmd->sqe = sqe; > + memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu)); Adding this overhead to every existing uring_cmd() implementation is unfortunate. Could we instead track the initialized/uninitialized state by using different types on the Rust side? The io_uring_cmd could start as an IoUringCmd, where the PDU field is MaybeUninit, write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the PDU has been initialized. Best, Caleb > return 0; > } > > -- > 2.43.0 > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB 2025-09-02 0:34 ` Caleb Sander Mateos @ 2025-09-02 10:23 ` Sidong Yang 2025-09-02 15:31 ` Caleb Sander Mateos 0 siblings, 1 reply; 33+ messages in thread From: Sidong Yang @ 2025-09-02 10:23 UTC (permalink / raw) To: Caleb Sander Mateos Cc: Jens Axboe, Daniel Almeida, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote: > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > The pdu field in io_uring_cmd may contain stale data when a request > > object is recycled from the slab cache. Accessing uninitialized or > > garbage memory can lead to undefined behavior in users of the pdu. > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that > > each command starts from a well-defined state. This avoids exposing > > uninitialized memory and prevents potential misinterpretation of data > > from previous requests. > > > > No functional change is intended other than guaranteeing that pdu is > > always zero-initialized before use. > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > > --- > > io_uring/uring_cmd.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c > > index 053bac89b6c0..2492525d4e43 100644 > > --- a/io_uring/uring_cmd.c > > +++ b/io_uring/uring_cmd.c > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > if (!ac) > > return -ENOMEM; > > ioucmd->sqe = sqe; > > + memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu)); > > Adding this overhead to every existing uring_cmd() implementation is > unfortunate. Could we instead track the initialized/uninitialized > state by using different types on the Rust side? The io_uring_cmd > could start as an IoUringCmd, where the PDU field is MaybeUninit, > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the > PDU has been initialized. I've found a flag IORING_URING_CMD_REISSUE that we could initialize the pdu. In uring_cmd callback, we can fill zero when it's not reissued. But I don't know that we could call T::default() in miscdevice. If we make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>. How about assign a byte in pdu for checking initialized? In uring_cmd(), We could set a byte flag that it's not initialized. And we could return error that it's not initialized in read_pdu(). Thanks, Sidong > > Best, > Caleb > > > return 0; > > } > > > > -- > > 2.43.0 > > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB 2025-09-02 10:23 ` Sidong Yang @ 2025-09-02 15:31 ` Caleb Sander Mateos 2025-09-06 14:28 ` Sidong Yang 0 siblings, 1 reply; 33+ messages in thread From: Caleb Sander Mateos @ 2025-09-02 15:31 UTC (permalink / raw) To: Sidong Yang Cc: Jens Axboe, Daniel Almeida, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote: > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > The pdu field in io_uring_cmd may contain stale data when a request > > > object is recycled from the slab cache. Accessing uninitialized or > > > garbage memory can lead to undefined behavior in users of the pdu. > > > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that > > > each command starts from a well-defined state. This avoids exposing > > > uninitialized memory and prevents potential misinterpretation of data > > > from previous requests. > > > > > > No functional change is intended other than guaranteeing that pdu is > > > always zero-initialized before use. > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > > > --- > > > io_uring/uring_cmd.c | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c > > > index 053bac89b6c0..2492525d4e43 100644 > > > --- a/io_uring/uring_cmd.c > > > +++ b/io_uring/uring_cmd.c > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > > if (!ac) > > > return -ENOMEM; > > > ioucmd->sqe = sqe; > > > + memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu)); > > > > Adding this overhead to every existing uring_cmd() implementation is > > unfortunate. Could we instead track the initialized/uninitialized > > state by using different types on the Rust side? The io_uring_cmd > > could start as an IoUringCmd, where the PDU field is MaybeUninit, > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the > > PDU has been initialized. > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize > the pdu. In uring_cmd callback, we can fill zero when it's not reissued. > But I don't know that we could call T::default() in miscdevice. If we > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>. > > How about assign a byte in pdu for checking initialized? In uring_cmd(), > We could set a byte flag that it's not initialized. And we could return > error that it's not initialized in read_pdu(). Could we do the zero-initialization (or T::default()) in MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag isn't set (i.e. on the initial issue)? That way, we avoid any performance penalty for the existing C uring_cmd() implementations. I'm not quite sure what you mean by "assign a byte in pdu for checking initialized". Best, Caleb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB 2025-09-02 15:31 ` Caleb Sander Mateos @ 2025-09-06 14:28 ` Sidong Yang 2025-09-08 19:45 ` Caleb Sander Mateos 0 siblings, 1 reply; 33+ messages in thread From: Sidong Yang @ 2025-09-06 14:28 UTC (permalink / raw) To: Caleb Sander Mateos Cc: Jens Axboe, Daniel Almeida, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote: > On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote: > > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > The pdu field in io_uring_cmd may contain stale data when a request > > > > object is recycled from the slab cache. Accessing uninitialized or > > > > garbage memory can lead to undefined behavior in users of the pdu. > > > > > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that > > > > each command starts from a well-defined state. This avoids exposing > > > > uninitialized memory and prevents potential misinterpretation of data > > > > from previous requests. > > > > > > > > No functional change is intended other than guaranteeing that pdu is > > > > always zero-initialized before use. > > > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > > > > --- > > > > io_uring/uring_cmd.c | 1 + > > > > 1 file changed, 1 insertion(+) > > > > > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c > > > > index 053bac89b6c0..2492525d4e43 100644 > > > > --- a/io_uring/uring_cmd.c > > > > +++ b/io_uring/uring_cmd.c > > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > > > if (!ac) > > > > return -ENOMEM; > > > > ioucmd->sqe = sqe; > > > > + memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu)); > > > > > > Adding this overhead to every existing uring_cmd() implementation is > > > unfortunate. Could we instead track the initialized/uninitialized > > > state by using different types on the Rust side? The io_uring_cmd > > > could start as an IoUringCmd, where the PDU field is MaybeUninit, > > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the > > > PDU has been initialized. > > > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize > > the pdu. In uring_cmd callback, we can fill zero when it's not reissued. > > But I don't know that we could call T::default() in miscdevice. If we > > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>. > > > > How about assign a byte in pdu for checking initialized? In uring_cmd(), > > We could set a byte flag that it's not initialized. And we could return > > error that it's not initialized in read_pdu(). > > Could we do the zero-initialization (or T::default()) in > MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag > isn't set (i.e. on the initial issue)? That way, we avoid any > performance penalty for the existing C uring_cmd() implementations. > I'm not quite sure what you mean by "assign a byte in pdu for checking > initialized". Sure, we could fill zero when it's the first time uring_cmd called with checking the flag. I would remove this commit for next version. I also suggests that we would provide the method that read_pdu() and write_pdu(). In read_pdu() I want to check write_pdu() is called before. So along the 20 bytes for pdu, maybe we could use a bytes for the flag that pdu is initialized? But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and issue_flags. How about some additional field for pdu is initialized like below? struct IoUringCmdArgs { ioucmd: Pin<&mut IoUringCmd>, issue_flags: u32, pdu_initialized: bool, } > > Best, > Caleb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB 2025-09-06 14:28 ` Sidong Yang @ 2025-09-08 19:45 ` Caleb Sander Mateos 2025-09-09 14:43 ` Sidong Yang 0 siblings, 1 reply; 33+ messages in thread From: Caleb Sander Mateos @ 2025-09-08 19:45 UTC (permalink / raw) To: Sidong Yang Cc: Jens Axboe, Daniel Almeida, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Sat, Sep 6, 2025 at 7:28 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote: > > On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote: > > > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > The pdu field in io_uring_cmd may contain stale data when a request > > > > > object is recycled from the slab cache. Accessing uninitialized or > > > > > garbage memory can lead to undefined behavior in users of the pdu. > > > > > > > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that > > > > > each command starts from a well-defined state. This avoids exposing > > > > > uninitialized memory and prevents potential misinterpretation of data > > > > > from previous requests. > > > > > > > > > > No functional change is intended other than guaranteeing that pdu is > > > > > always zero-initialized before use. > > > > > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > > > > > --- > > > > > io_uring/uring_cmd.c | 1 + > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c > > > > > index 053bac89b6c0..2492525d4e43 100644 > > > > > --- a/io_uring/uring_cmd.c > > > > > +++ b/io_uring/uring_cmd.c > > > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > > > > if (!ac) > > > > > return -ENOMEM; > > > > > ioucmd->sqe = sqe; > > > > > + memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu)); > > > > > > > > Adding this overhead to every existing uring_cmd() implementation is > > > > unfortunate. Could we instead track the initialized/uninitialized > > > > state by using different types on the Rust side? The io_uring_cmd > > > > could start as an IoUringCmd, where the PDU field is MaybeUninit, > > > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the > > > > PDU has been initialized. > > > > > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize > > > the pdu. In uring_cmd callback, we can fill zero when it's not reissued. > > > But I don't know that we could call T::default() in miscdevice. If we > > > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>. > > > > > > How about assign a byte in pdu for checking initialized? In uring_cmd(), > > > We could set a byte flag that it's not initialized. And we could return > > > error that it's not initialized in read_pdu(). > > > > Could we do the zero-initialization (or T::default()) in > > MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag > > isn't set (i.e. on the initial issue)? That way, we avoid any > > performance penalty for the existing C uring_cmd() implementations. > > I'm not quite sure what you mean by "assign a byte in pdu for checking > > initialized". > > Sure, we could fill zero when it's the first time uring_cmd called with > checking the flag. I would remove this commit for next version. I also > suggests that we would provide the method that read_pdu() and write_pdu(). > In read_pdu() I want to check write_pdu() is called before. So along the > 20 bytes for pdu, maybe we could use a bytes for the flag that pdu is > initialized? Not sure what you mean about "20 bytes for pdu". It seems like it would be preferable to enforce that write_pdu() has been called before read_pdu() using the Rust type system instead of a runtime check. I was thinking a signature like fn write_pdu(cmd: IoUringCmd, value: T) -> IoUringCmdPdu<T>. Do you feel there's a reason that wouldn't work and a runtime check would be necessary? > > But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and > issue_flags. How about some additional field for pdu is initialized like below? > > struct IoUringCmdArgs { > ioucmd: Pin<&mut IoUringCmd>, > issue_flags: u32, > pdu_initialized: bool, > } One other thing I realized is that issue_flags should come from the *current* context rather than the context the uring_cmd() callback was called in. For example, if io_uring_cmd_done() is called from task work context, issue_flags should match the issue_flags passed to the io_uring_cmd_tw_t callback, not the issue_flags originally passed to the uring_cmd() callback. So it probably makes more sense to decouple issue_flags from the (owned) IoUringCmd. I think you could pass it by reference (&IssueFlags) or with a phantom reference lifetime (IssueFlags<'_>) to the Rust uring_cmd() and task work callbacks to ensure it can't be used after those callbacks have returned. Best, Caleb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB 2025-09-08 19:45 ` Caleb Sander Mateos @ 2025-09-09 14:43 ` Sidong Yang 2025-09-09 16:32 ` Caleb Sander Mateos 0 siblings, 1 reply; 33+ messages in thread From: Sidong Yang @ 2025-09-09 14:43 UTC (permalink / raw) To: Caleb Sander Mateos Cc: Jens Axboe, Daniel Almeida, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Mon, Sep 08, 2025 at 12:45:58PM -0700, Caleb Sander Mateos wrote: > On Sat, Sep 6, 2025 at 7:28 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote: > > > On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote: > > > > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > The pdu field in io_uring_cmd may contain stale data when a request > > > > > > object is recycled from the slab cache. Accessing uninitialized or > > > > > > garbage memory can lead to undefined behavior in users of the pdu. > > > > > > > > > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that > > > > > > each command starts from a well-defined state. This avoids exposing > > > > > > uninitialized memory and prevents potential misinterpretation of data > > > > > > from previous requests. > > > > > > > > > > > > No functional change is intended other than guaranteeing that pdu is > > > > > > always zero-initialized before use. > > > > > > > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > > > > > > --- > > > > > > io_uring/uring_cmd.c | 1 + > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c > > > > > > index 053bac89b6c0..2492525d4e43 100644 > > > > > > --- a/io_uring/uring_cmd.c > > > > > > +++ b/io_uring/uring_cmd.c > > > > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > > > > > if (!ac) > > > > > > return -ENOMEM; > > > > > > ioucmd->sqe = sqe; > > > > > > + memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu)); > > > > > > > > > > Adding this overhead to every existing uring_cmd() implementation is > > > > > unfortunate. Could we instead track the initialized/uninitialized > > > > > state by using different types on the Rust side? The io_uring_cmd > > > > > could start as an IoUringCmd, where the PDU field is MaybeUninit, > > > > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the > > > > > PDU has been initialized. > > > > > > > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize > > > > the pdu. In uring_cmd callback, we can fill zero when it's not reissued. > > > > But I don't know that we could call T::default() in miscdevice. If we > > > > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>. > > > > > > > > How about assign a byte in pdu for checking initialized? In uring_cmd(), > > > > We could set a byte flag that it's not initialized. And we could return > > > > error that it's not initialized in read_pdu(). > > > > > > Could we do the zero-initialization (or T::default()) in > > > MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag > > > isn't set (i.e. on the initial issue)? That way, we avoid any > > > performance penalty for the existing C uring_cmd() implementations. > > > I'm not quite sure what you mean by "assign a byte in pdu for checking > > > initialized". > > > > Sure, we could fill zero when it's the first time uring_cmd called with > > checking the flag. I would remove this commit for next version. I also > > suggests that we would provide the method that read_pdu() and write_pdu(). > > In read_pdu() I want to check write_pdu() is called before. So along the > > 20 bytes for pdu, maybe we could use a bytes for the flag that pdu is > > initialized? > > Not sure what you mean about "20 bytes for pdu". > It seems like it would be preferable to enforce that write_pdu() has > been called before read_pdu() using the Rust type system instead of a > runtime check. I was thinking a signature like fn write_pdu(cmd: > IoUringCmd, value: T) -> IoUringCmdPdu<T>. Do you feel there's a > reason that wouldn't work and a runtime check would be necessary? I didn't think about make write_pdu() to return IoUringCmdPdu<T> before. I think it's good way to pdu is safe without adding a new generic param for MiscDevice. write_pdu() would return IoUringCmdPdu<T> and it could call IoUringCmdPdu<T>::pdu(&mut self) -> &mut T safely maybe. > > > > > But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and > > issue_flags. How about some additional field for pdu is initialized like below? > > > > struct IoUringCmdArgs { > > ioucmd: Pin<&mut IoUringCmd>, > > issue_flags: u32, > > pdu_initialized: bool, > > } > > One other thing I realized is that issue_flags should come from the > *current* context rather than the context the uring_cmd() callback was > called in. For example, if io_uring_cmd_done() is called from task > work context, issue_flags should match the issue_flags passed to the > io_uring_cmd_tw_t callback, not the issue_flags originally passed to > the uring_cmd() callback. So it probably makes more sense to decouple > issue_flags from the (owned) IoUringCmd. I think you could pass it by > reference (&IssueFlags) or with a phantom reference lifetime > (IssueFlags<'_>) to the Rust uring_cmd() and task work callbacks to > ensure it can't be used after those callbacks have returned. I have had no idea about task work context. I agree with you that it would be better to separate issue_flags from IoUringCmd. So, IoUringCmdArgs would have a only field Pin<&mut IoUringCmd>? > > Best, > Caleb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB 2025-09-09 14:43 ` Sidong Yang @ 2025-09-09 16:32 ` Caleb Sander Mateos 2025-09-12 16:41 ` Sidong Yang 0 siblings, 1 reply; 33+ messages in thread From: Caleb Sander Mateos @ 2025-09-09 16:32 UTC (permalink / raw) To: Sidong Yang Cc: Jens Axboe, Daniel Almeida, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Tue, Sep 9, 2025 at 7:43 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > On Mon, Sep 08, 2025 at 12:45:58PM -0700, Caleb Sander Mateos wrote: > > On Sat, Sep 6, 2025 at 7:28 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote: > > > > On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote: > > > > > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > > > The pdu field in io_uring_cmd may contain stale data when a request > > > > > > > object is recycled from the slab cache. Accessing uninitialized or > > > > > > > garbage memory can lead to undefined behavior in users of the pdu. > > > > > > > > > > > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that > > > > > > > each command starts from a well-defined state. This avoids exposing > > > > > > > uninitialized memory and prevents potential misinterpretation of data > > > > > > > from previous requests. > > > > > > > > > > > > > > No functional change is intended other than guaranteeing that pdu is > > > > > > > always zero-initialized before use. > > > > > > > > > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > > > > > > > --- > > > > > > > io_uring/uring_cmd.c | 1 + > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c > > > > > > > index 053bac89b6c0..2492525d4e43 100644 > > > > > > > --- a/io_uring/uring_cmd.c > > > > > > > +++ b/io_uring/uring_cmd.c > > > > > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > > > > > > if (!ac) > > > > > > > return -ENOMEM; > > > > > > > ioucmd->sqe = sqe; > > > > > > > + memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu)); > > > > > > > > > > > > Adding this overhead to every existing uring_cmd() implementation is > > > > > > unfortunate. Could we instead track the initialized/uninitialized > > > > > > state by using different types on the Rust side? The io_uring_cmd > > > > > > could start as an IoUringCmd, where the PDU field is MaybeUninit, > > > > > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the > > > > > > PDU has been initialized. > > > > > > > > > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize > > > > > the pdu. In uring_cmd callback, we can fill zero when it's not reissued. > > > > > But I don't know that we could call T::default() in miscdevice. If we > > > > > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>. > > > > > > > > > > How about assign a byte in pdu for checking initialized? In uring_cmd(), > > > > > We could set a byte flag that it's not initialized. And we could return > > > > > error that it's not initialized in read_pdu(). > > > > > > > > Could we do the zero-initialization (or T::default()) in > > > > MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag > > > > isn't set (i.e. on the initial issue)? That way, we avoid any > > > > performance penalty for the existing C uring_cmd() implementations. > > > > I'm not quite sure what you mean by "assign a byte in pdu for checking > > > > initialized". > > > > > > Sure, we could fill zero when it's the first time uring_cmd called with > > > checking the flag. I would remove this commit for next version. I also > > > suggests that we would provide the method that read_pdu() and write_pdu(). > > > In read_pdu() I want to check write_pdu() is called before. So along the > > > 20 bytes for pdu, maybe we could use a bytes for the flag that pdu is > > > initialized? > > > > Not sure what you mean about "20 bytes for pdu". > > It seems like it would be preferable to enforce that write_pdu() has > > been called before read_pdu() using the Rust type system instead of a > > runtime check. I was thinking a signature like fn write_pdu(cmd: > > IoUringCmd, value: T) -> IoUringCmdPdu<T>. Do you feel there's a > > reason that wouldn't work and a runtime check would be necessary? > > I didn't think about make write_pdu() to return IoUringCmdPdu<T> before. > I think it's good way to pdu is safe without adding a new generic param for > MiscDevice. write_pdu() would return IoUringCmdPdu<T> and it could call > IoUringCmdPdu<T>::pdu(&mut self) -> &mut T safely maybe. Yes, that's what I was thinking. > > > > > > > > > But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and > > > issue_flags. How about some additional field for pdu is initialized like below? > > > > > > struct IoUringCmdArgs { > > > ioucmd: Pin<&mut IoUringCmd>, > > > issue_flags: u32, > > > pdu_initialized: bool, > > > } > > > > One other thing I realized is that issue_flags should come from the > > *current* context rather than the context the uring_cmd() callback was > > called in. For example, if io_uring_cmd_done() is called from task > > work context, issue_flags should match the issue_flags passed to the > > io_uring_cmd_tw_t callback, not the issue_flags originally passed to > > the uring_cmd() callback. So it probably makes more sense to decouple > > issue_flags from the (owned) IoUringCmd. I think you could pass it by > > reference (&IssueFlags) or with a phantom reference lifetime > > (IssueFlags<'_>) to the Rust uring_cmd() and task work callbacks to > > ensure it can't be used after those callbacks have returned. > > I have had no idea about task work context. I agree with you that > it would be better to separate issue_flags from IoUringCmd. So, > IoUringCmdArgs would have a only field Pin<&mut IoUringCmd>? "Task work" is a mechanism io_uring uses to queue work to run on the thread that submitted an io_uring operation. It's basically a per-thread atomic queue of callbacks that the thread will process whenever it returns from the kernel to userspace (after a syscall or an interrupt). This is the context where asynchronous uring_cmd completions are generally processed (see io_uring_cmd_complete_in_task() and io_uring_cmd_do_in_task_lazy()). I can't speak to the history of why io_uring uses task work, but my guess would be that it provides a safe context to acquire the io_ring_ctx uring_lock mutex (e.g. nvme_uring_cmd_end_io() can be called from an interrupt handler, so it's not allowed to take a mutex). Processing all the task work at once also provides natural opportunities for batching. Yes, we probably don't need to bundle anything else with the IoUringCmd after all. As I mentioned earlier, I don't think Pin<&mut IoUringCmd> will work for uring_cmds that complete asynchronously, as they will need to outlive the uring_cmd() call. So uring_cmd() needs to transfer ownership of the struct io_uring_cmd. Best, Caleb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB 2025-09-09 16:32 ` Caleb Sander Mateos @ 2025-09-12 16:41 ` Sidong Yang 2025-09-12 17:56 ` Caleb Sander Mateos 0 siblings, 1 reply; 33+ messages in thread From: Sidong Yang @ 2025-09-12 16:41 UTC (permalink / raw) To: Caleb Sander Mateos Cc: Jens Axboe, Daniel Almeida, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Tue, Sep 09, 2025 at 09:32:37AM -0700, Caleb Sander Mateos wrote: > On Tue, Sep 9, 2025 at 7:43 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > On Mon, Sep 08, 2025 at 12:45:58PM -0700, Caleb Sander Mateos wrote: > > > On Sat, Sep 6, 2025 at 7:28 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote: > > > > > On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote: > > > > > > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > > > > > The pdu field in io_uring_cmd may contain stale data when a request > > > > > > > > object is recycled from the slab cache. Accessing uninitialized or > > > > > > > > garbage memory can lead to undefined behavior in users of the pdu. > > > > > > > > > > > > > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that > > > > > > > > each command starts from a well-defined state. This avoids exposing > > > > > > > > uninitialized memory and prevents potential misinterpretation of data > > > > > > > > from previous requests. > > > > > > > > > > > > > > > > No functional change is intended other than guaranteeing that pdu is > > > > > > > > always zero-initialized before use. > > > > > > > > > > > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > > > > > > > > --- > > > > > > > > io_uring/uring_cmd.c | 1 + > > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c > > > > > > > > index 053bac89b6c0..2492525d4e43 100644 > > > > > > > > --- a/io_uring/uring_cmd.c > > > > > > > > +++ b/io_uring/uring_cmd.c > > > > > > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > > > > > > > if (!ac) > > > > > > > > return -ENOMEM; > > > > > > > > ioucmd->sqe = sqe; > > > > > > > > + memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu)); > > > > > > > > > > > > > > Adding this overhead to every existing uring_cmd() implementation is > > > > > > > unfortunate. Could we instead track the initialized/uninitialized > > > > > > > state by using different types on the Rust side? The io_uring_cmd > > > > > > > could start as an IoUringCmd, where the PDU field is MaybeUninit, > > > > > > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the > > > > > > > PDU has been initialized. > > > > > > > > > > > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize > > > > > > the pdu. In uring_cmd callback, we can fill zero when it's not reissued. > > > > > > But I don't know that we could call T::default() in miscdevice. If we > > > > > > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>. > > > > > > > > > > > > How about assign a byte in pdu for checking initialized? In uring_cmd(), > > > > > > We could set a byte flag that it's not initialized. And we could return > > > > > > error that it's not initialized in read_pdu(). > > > > > > > > > > Could we do the zero-initialization (or T::default()) in > > > > > MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag > > > > > isn't set (i.e. on the initial issue)? That way, we avoid any > > > > > performance penalty for the existing C uring_cmd() implementations. > > > > > I'm not quite sure what you mean by "assign a byte in pdu for checking > > > > > initialized". > > > > > > > > Sure, we could fill zero when it's the first time uring_cmd called with > > > > checking the flag. I would remove this commit for next version. I also > > > > suggests that we would provide the method that read_pdu() and write_pdu(). > > > > In read_pdu() I want to check write_pdu() is called before. So along the > > > > 20 bytes for pdu, maybe we could use a bytes for the flag that pdu is > > > > initialized? > > > > > > Not sure what you mean about "20 bytes for pdu". > > > It seems like it would be preferable to enforce that write_pdu() has > > > been called before read_pdu() using the Rust type system instead of a > > > runtime check. I was thinking a signature like fn write_pdu(cmd: > > > IoUringCmd, value: T) -> IoUringCmdPdu<T>. Do you feel there's a > > > reason that wouldn't work and a runtime check would be necessary? > > > > I didn't think about make write_pdu() to return IoUringCmdPdu<T> before. > > I think it's good way to pdu is safe without adding a new generic param for > > MiscDevice. write_pdu() would return IoUringCmdPdu<T> and it could call > > IoUringCmdPdu<T>::pdu(&mut self) -> &mut T safely maybe. > > Yes, that's what I was thinking. Good, I'll change api in this way. Thanks! > > > > > > > > > > > > > > But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and > > > > issue_flags. How about some additional field for pdu is initialized like below? > > > > > > > > struct IoUringCmdArgs { > > > > ioucmd: Pin<&mut IoUringCmd>, > > > > issue_flags: u32, > > > > pdu_initialized: bool, > > > > } > > > > > > One other thing I realized is that issue_flags should come from the > > > *current* context rather than the context the uring_cmd() callback was > > > called in. For example, if io_uring_cmd_done() is called from task > > > work context, issue_flags should match the issue_flags passed to the > > > io_uring_cmd_tw_t callback, not the issue_flags originally passed to > > > the uring_cmd() callback. So it probably makes more sense to decouple > > > issue_flags from the (owned) IoUringCmd. I think you could pass it by > > > reference (&IssueFlags) or with a phantom reference lifetime > > > (IssueFlags<'_>) to the Rust uring_cmd() and task work callbacks to > > > ensure it can't be used after those callbacks have returned. > > > > I have had no idea about task work context. I agree with you that > > it would be better to separate issue_flags from IoUringCmd. So, > > IoUringCmdArgs would have a only field Pin<&mut IoUringCmd>? > > "Task work" is a mechanism io_uring uses to queue work to run on the > thread that submitted an io_uring operation. It's basically a > per-thread atomic queue of callbacks that the thread will process > whenever it returns from the kernel to userspace (after a syscall or > an interrupt). This is the context where asynchronous uring_cmd > completions are generally processed (see > io_uring_cmd_complete_in_task() and io_uring_cmd_do_in_task_lazy()). I > can't speak to the history of why io_uring uses task work, but my > guess would be that it provides a safe context to acquire the > io_ring_ctx uring_lock mutex (e.g. nvme_uring_cmd_end_io() can be > called from an interrupt handler, so it's not allowed to take a > mutex). Processing all the task work at once also provides natural > opportunities for batching. Thanks, I've checked io_uring_cmd_complete_in_task() that it receives callback that has issue_flags different with io_uring_cmd(). I'll try to add a api that wrapping io_uring_cmd_complete_in_task() for next version. > > Yes, we probably don't need to bundle anything else with the > IoUringCmd after all. As I mentioned earlier, I don't think Pin<&mut > IoUringCmd> will work for uring_cmds that complete asynchronously, as > they will need to outlive the uring_cmd() call. So uring_cmd() needs > to transfer ownership of the struct io_uring_cmd. I can't think that how to take ownership of struct io_uring_cmd. The struct allocated with io_alloc_req() and should be freed with io_free_req(). If taking ownership means having pointer of struct io_uring_cmd, I think it's no difference with current version. Also could it be called with mem::forget() if it has ownership? Thanks, Sidong > Best, > Caleb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB 2025-09-12 16:41 ` Sidong Yang @ 2025-09-12 17:56 ` Caleb Sander Mateos 2025-09-13 12:42 ` Sidong Yang 0 siblings, 1 reply; 33+ messages in thread From: Caleb Sander Mateos @ 2025-09-12 17:56 UTC (permalink / raw) To: Sidong Yang Cc: Jens Axboe, Daniel Almeida, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Fri, Sep 12, 2025 at 9:42 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > On Tue, Sep 09, 2025 at 09:32:37AM -0700, Caleb Sander Mateos wrote: > > On Tue, Sep 9, 2025 at 7:43 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > On Mon, Sep 08, 2025 at 12:45:58PM -0700, Caleb Sander Mateos wrote: > > > > On Sat, Sep 6, 2025 at 7:28 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote: > > > > > > On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote: > > > > > > > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > > > > > > > The pdu field in io_uring_cmd may contain stale data when a request > > > > > > > > > object is recycled from the slab cache. Accessing uninitialized or > > > > > > > > > garbage memory can lead to undefined behavior in users of the pdu. > > > > > > > > > > > > > > > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that > > > > > > > > > each command starts from a well-defined state. This avoids exposing > > > > > > > > > uninitialized memory and prevents potential misinterpretation of data > > > > > > > > > from previous requests. > > > > > > > > > > > > > > > > > > No functional change is intended other than guaranteeing that pdu is > > > > > > > > > always zero-initialized before use. > > > > > > > > > > > > > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > > > > > > > > > --- > > > > > > > > > io_uring/uring_cmd.c | 1 + > > > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c > > > > > > > > > index 053bac89b6c0..2492525d4e43 100644 > > > > > > > > > --- a/io_uring/uring_cmd.c > > > > > > > > > +++ b/io_uring/uring_cmd.c > > > > > > > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > > > > > > > > if (!ac) > > > > > > > > > return -ENOMEM; > > > > > > > > > ioucmd->sqe = sqe; > > > > > > > > > + memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu)); > > > > > > > > > > > > > > > > Adding this overhead to every existing uring_cmd() implementation is > > > > > > > > unfortunate. Could we instead track the initialized/uninitialized > > > > > > > > state by using different types on the Rust side? The io_uring_cmd > > > > > > > > could start as an IoUringCmd, where the PDU field is MaybeUninit, > > > > > > > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the > > > > > > > > PDU has been initialized. > > > > > > > > > > > > > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize > > > > > > > the pdu. In uring_cmd callback, we can fill zero when it's not reissued. > > > > > > > But I don't know that we could call T::default() in miscdevice. If we > > > > > > > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>. > > > > > > > > > > > > > > How about assign a byte in pdu for checking initialized? In uring_cmd(), > > > > > > > We could set a byte flag that it's not initialized. And we could return > > > > > > > error that it's not initialized in read_pdu(). > > > > > > > > > > > > Could we do the zero-initialization (or T::default()) in > > > > > > MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag > > > > > > isn't set (i.e. on the initial issue)? That way, we avoid any > > > > > > performance penalty for the existing C uring_cmd() implementations. > > > > > > I'm not quite sure what you mean by "assign a byte in pdu for checking > > > > > > initialized". > > > > > > > > > > Sure, we could fill zero when it's the first time uring_cmd called with > > > > > checking the flag. I would remove this commit for next version. I also > > > > > suggests that we would provide the method that read_pdu() and write_pdu(). > > > > > In read_pdu() I want to check write_pdu() is called before. So along the > > > > > 20 bytes for pdu, maybe we could use a bytes for the flag that pdu is > > > > > initialized? > > > > > > > > Not sure what you mean about "20 bytes for pdu". > > > > It seems like it would be preferable to enforce that write_pdu() has > > > > been called before read_pdu() using the Rust type system instead of a > > > > runtime check. I was thinking a signature like fn write_pdu(cmd: > > > > IoUringCmd, value: T) -> IoUringCmdPdu<T>. Do you feel there's a > > > > reason that wouldn't work and a runtime check would be necessary? > > > > > > I didn't think about make write_pdu() to return IoUringCmdPdu<T> before. > > > I think it's good way to pdu is safe without adding a new generic param for > > > MiscDevice. write_pdu() would return IoUringCmdPdu<T> and it could call > > > IoUringCmdPdu<T>::pdu(&mut self) -> &mut T safely maybe. > > > > Yes, that's what I was thinking. > > Good, I'll change api in this way. Thanks! > > > > > > > > > > > > > > > > > > > > But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and > > > > > issue_flags. How about some additional field for pdu is initialized like below? > > > > > > > > > > struct IoUringCmdArgs { > > > > > ioucmd: Pin<&mut IoUringCmd>, > > > > > issue_flags: u32, > > > > > pdu_initialized: bool, > > > > > } > > > > > > > > One other thing I realized is that issue_flags should come from the > > > > *current* context rather than the context the uring_cmd() callback was > > > > called in. For example, if io_uring_cmd_done() is called from task > > > > work context, issue_flags should match the issue_flags passed to the > > > > io_uring_cmd_tw_t callback, not the issue_flags originally passed to > > > > the uring_cmd() callback. So it probably makes more sense to decouple > > > > issue_flags from the (owned) IoUringCmd. I think you could pass it by > > > > reference (&IssueFlags) or with a phantom reference lifetime > > > > (IssueFlags<'_>) to the Rust uring_cmd() and task work callbacks to > > > > ensure it can't be used after those callbacks have returned. > > > > > > I have had no idea about task work context. I agree with you that > > > it would be better to separate issue_flags from IoUringCmd. So, > > > IoUringCmdArgs would have a only field Pin<&mut IoUringCmd>? > > > > "Task work" is a mechanism io_uring uses to queue work to run on the > > thread that submitted an io_uring operation. It's basically a > > per-thread atomic queue of callbacks that the thread will process > > whenever it returns from the kernel to userspace (after a syscall or > > an interrupt). This is the context where asynchronous uring_cmd > > completions are generally processed (see > > io_uring_cmd_complete_in_task() and io_uring_cmd_do_in_task_lazy()). I > > can't speak to the history of why io_uring uses task work, but my > > guess would be that it provides a safe context to acquire the > > io_ring_ctx uring_lock mutex (e.g. nvme_uring_cmd_end_io() can be > > called from an interrupt handler, so it's not allowed to take a > > mutex). Processing all the task work at once also provides natural > > opportunities for batching. > > Thanks, I've checked io_uring_cmd_complete_in_task() that it receives > callback that has issue_flags different with io_uring_cmd(). I'll try to add > a api that wrapping io_uring_cmd_complete_in_task() for next version. > > > > > Yes, we probably don't need to bundle anything else with the > > IoUringCmd after all. As I mentioned earlier, I don't think Pin<&mut > > IoUringCmd> will work for uring_cmds that complete asynchronously, as > > they will need to outlive the uring_cmd() call. So uring_cmd() needs > > to transfer ownership of the struct io_uring_cmd. > > I can't think that how to take ownership of struct io_uring_cmd. The > struct allocated with io_alloc_req() and should be freed with io_free_req(). > If taking ownership means having pointer of struct io_uring_cmd, I think > it's no difference with current version. Also could it be called with > mem::forget() if it has ownership? I don't mean ownership of the io_uring_cmd allocation; that's the responsibility of the io_uring layer. But once the io_uring_cmd is handed to the uring_cmd() implementation, it belongs to that layer until it completes the command back to io_uring. Maybe a better way to describe it would be as ownership of the "executing io_uring_cmd". The problem with Pin<&mut IoUringCmd> is that it is a borrowed reference to the io_uring_cmd, so it can't outlive the uring_cmd() callback. Yes, it's possible to leak the io_uring_cmd by never calling io_uring_cmd_done() to return it to the io_uring layer. I would imagine something like this: #[derive(Clone, Copy)] struct IssueFlags<'a>(c_uint, PhantomData<&'a ()>); // Indicates ownership of the io_uring_cmd between uring_cmd() and io_uring_cmd_done() struct IoUringCmd(NonNull<bindings::io_uring_cmd>); impl IoUringCmd { // ... fn done(self, ret: i32, res2: u64, issue_flags: IssueFlags<'_>) { let cmd = self.0.as_ptr(); let issue_flags = issue_flags.0; unsafe { bindings::io_uring_cmd_done(cmd, ret, res2, issue_flags) } } } // Can choose whether to complete the command synchronously or asynchronously. // If take_async() is called, IoUringCmd::done() needs to be called to complete the command. // If take_async() isn't called, the command is completed synchronously // with the return value from MiscDevice::uring_cmd(). struct UringCmdInput<'a>(&mut Option<NonNull<bindings::io_uring_cmd>>); impl UringCmdInput<'_> { fn take_async(self) -> IoUringCmd { IoUringCmd(self.0.take().unwrap()) } } trait MiscDevice { // ... fn uring_cmd( _device: <Self::Ptr as ForeignOwnable>::Borrowed<'_>, _cmd: UringCmdInput<'_>, _ issue_flags: IssueFlags <'_>, ) -> Result<i32> { build_error!(VTABLE_DEFAULT_ERROR) } } impl<T: MiscDevice> MiscdeviceVTable<T> { // ... unsafe extern "C" fn uring_cmd( ioucmd: *mut bindings::io_uring_cmd, issue_flags: c_uint, ) -> c_int { let raw_file = unsafe { (*ioucmd).file }; let private = unsafe { (*raw_file).private_data }.cast(); let device = unsafe { <T::Ptr as ForeignOwnable>::borrow(private) }; let mut ioucmd = Some(NonNull::new(ioucmd).unwrap()); let issue_flags = IssueFlags(issue_flags, PhantomData); let ret = T::uring_cmd(device, UringCmdInput(&mut ioucmd), issue_flags); // -EIOCBQUEUED indicates ownership of io_uring_cmd has been taken if iou_cmd.is_none() { return -bindings::EIOCBQUEUED; } let ret = from_result(|| ret); if ret == -bindings::EIOCBQUEUED { -bindings::EINVAL } else { ret } } } Best, Caleb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB 2025-09-12 17:56 ` Caleb Sander Mateos @ 2025-09-13 12:42 ` Sidong Yang 2025-09-15 16:54 ` Caleb Sander Mateos 0 siblings, 1 reply; 33+ messages in thread From: Sidong Yang @ 2025-09-13 12:42 UTC (permalink / raw) To: Caleb Sander Mateos Cc: Jens Axboe, Daniel Almeida, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Fri, Sep 12, 2025 at 10:56:31AM -0700, Caleb Sander Mateos wrote: > On Fri, Sep 12, 2025 at 9:42 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > On Tue, Sep 09, 2025 at 09:32:37AM -0700, Caleb Sander Mateos wrote: > > > On Tue, Sep 9, 2025 at 7:43 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > On Mon, Sep 08, 2025 at 12:45:58PM -0700, Caleb Sander Mateos wrote: > > > > > On Sat, Sep 6, 2025 at 7:28 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote: > > > > > > > On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > > > > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote: > > > > > > > > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > > > > > > > > > The pdu field in io_uring_cmd may contain stale data when a request > > > > > > > > > > object is recycled from the slab cache. Accessing uninitialized or > > > > > > > > > > garbage memory can lead to undefined behavior in users of the pdu. > > > > > > > > > > > > > > > > > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that > > > > > > > > > > each command starts from a well-defined state. This avoids exposing > > > > > > > > > > uninitialized memory and prevents potential misinterpretation of data > > > > > > > > > > from previous requests. > > > > > > > > > > > > > > > > > > > > No functional change is intended other than guaranteeing that pdu is > > > > > > > > > > always zero-initialized before use. > > > > > > > > > > > > > > > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > > > > > > > > > > --- > > > > > > > > > > io_uring/uring_cmd.c | 1 + > > > > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > > > > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c > > > > > > > > > > index 053bac89b6c0..2492525d4e43 100644 > > > > > > > > > > --- a/io_uring/uring_cmd.c > > > > > > > > > > +++ b/io_uring/uring_cmd.c > > > > > > > > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > > > > > > > > > if (!ac) > > > > > > > > > > return -ENOMEM; > > > > > > > > > > ioucmd->sqe = sqe; > > > > > > > > > > + memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu)); > > > > > > > > > > > > > > > > > > Adding this overhead to every existing uring_cmd() implementation is > > > > > > > > > unfortunate. Could we instead track the initialized/uninitialized > > > > > > > > > state by using different types on the Rust side? The io_uring_cmd > > > > > > > > > could start as an IoUringCmd, where the PDU field is MaybeUninit, > > > > > > > > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the > > > > > > > > > PDU has been initialized. > > > > > > > > > > > > > > > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize > > > > > > > > the pdu. In uring_cmd callback, we can fill zero when it's not reissued. > > > > > > > > But I don't know that we could call T::default() in miscdevice. If we > > > > > > > > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>. > > > > > > > > > > > > > > > > How about assign a byte in pdu for checking initialized? In uring_cmd(), > > > > > > > > We could set a byte flag that it's not initialized. And we could return > > > > > > > > error that it's not initialized in read_pdu(). > > > > > > > > > > > > > > Could we do the zero-initialization (or T::default()) in > > > > > > > MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag > > > > > > > isn't set (i.e. on the initial issue)? That way, we avoid any > > > > > > > performance penalty for the existing C uring_cmd() implementations. > > > > > > > I'm not quite sure what you mean by "assign a byte in pdu for checking > > > > > > > initialized". > > > > > > > > > > > > Sure, we could fill zero when it's the first time uring_cmd called with > > > > > > checking the flag. I would remove this commit for next version. I also > > > > > > suggests that we would provide the method that read_pdu() and write_pdu(). > > > > > > In read_pdu() I want to check write_pdu() is called before. So along the > > > > > > 20 bytes for pdu, maybe we could use a bytes for the flag that pdu is > > > > > > initialized? > > > > > > > > > > Not sure what you mean about "20 bytes for pdu". > > > > > It seems like it would be preferable to enforce that write_pdu() has > > > > > been called before read_pdu() using the Rust type system instead of a > > > > > runtime check. I was thinking a signature like fn write_pdu(cmd: > > > > > IoUringCmd, value: T) -> IoUringCmdPdu<T>. Do you feel there's a > > > > > reason that wouldn't work and a runtime check would be necessary? > > > > > > > > I didn't think about make write_pdu() to return IoUringCmdPdu<T> before. > > > > I think it's good way to pdu is safe without adding a new generic param for > > > > MiscDevice. write_pdu() would return IoUringCmdPdu<T> and it could call > > > > IoUringCmdPdu<T>::pdu(&mut self) -> &mut T safely maybe. > > > > > > Yes, that's what I was thinking. > > > > Good, I'll change api in this way. Thanks! > > > > > > > > > > > > > > > > > > > > > > > > > > But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and > > > > > > issue_flags. How about some additional field for pdu is initialized like below? > > > > > > > > > > > > struct IoUringCmdArgs { > > > > > > ioucmd: Pin<&mut IoUringCmd>, > > > > > > issue_flags: u32, > > > > > > pdu_initialized: bool, > > > > > > } > > > > > > > > > > One other thing I realized is that issue_flags should come from the > > > > > *current* context rather than the context the uring_cmd() callback was > > > > > called in. For example, if io_uring_cmd_done() is called from task > > > > > work context, issue_flags should match the issue_flags passed to the > > > > > io_uring_cmd_tw_t callback, not the issue_flags originally passed to > > > > > the uring_cmd() callback. So it probably makes more sense to decouple > > > > > issue_flags from the (owned) IoUringCmd. I think you could pass it by > > > > > reference (&IssueFlags) or with a phantom reference lifetime > > > > > (IssueFlags<'_>) to the Rust uring_cmd() and task work callbacks to > > > > > ensure it can't be used after those callbacks have returned. > > > > > > > > I have had no idea about task work context. I agree with you that > > > > it would be better to separate issue_flags from IoUringCmd. So, > > > > IoUringCmdArgs would have a only field Pin<&mut IoUringCmd>? > > > > > > "Task work" is a mechanism io_uring uses to queue work to run on the > > > thread that submitted an io_uring operation. It's basically a > > > per-thread atomic queue of callbacks that the thread will process > > > whenever it returns from the kernel to userspace (after a syscall or > > > an interrupt). This is the context where asynchronous uring_cmd > > > completions are generally processed (see > > > io_uring_cmd_complete_in_task() and io_uring_cmd_do_in_task_lazy()). I > > > can't speak to the history of why io_uring uses task work, but my > > > guess would be that it provides a safe context to acquire the > > > io_ring_ctx uring_lock mutex (e.g. nvme_uring_cmd_end_io() can be > > > called from an interrupt handler, so it's not allowed to take a > > > mutex). Processing all the task work at once also provides natural > > > opportunities for batching. > > > > Thanks, I've checked io_uring_cmd_complete_in_task() that it receives > > callback that has issue_flags different with io_uring_cmd(). I'll try to add > > a api that wrapping io_uring_cmd_complete_in_task() for next version. > > > > > > > > Yes, we probably don't need to bundle anything else with the > > > IoUringCmd after all. As I mentioned earlier, I don't think Pin<&mut > > > IoUringCmd> will work for uring_cmds that complete asynchronously, as > > > they will need to outlive the uring_cmd() call. So uring_cmd() needs > > > to transfer ownership of the struct io_uring_cmd. > > > > I can't think that how to take ownership of struct io_uring_cmd. The > > struct allocated with io_alloc_req() and should be freed with io_free_req(). > > If taking ownership means having pointer of struct io_uring_cmd, I think > > it's no difference with current version. Also could it be called with > > mem::forget() if it has ownership? > > I don't mean ownership of the io_uring_cmd allocation; that's the > responsibility of the io_uring layer. But once the io_uring_cmd is > handed to the uring_cmd() implementation, it belongs to that layer > until it completes the command back to io_uring. Maybe a better way to > describe it would be as ownership of the "executing io_uring_cmd". The > problem with Pin<&mut IoUringCmd> is that it is a borrowed reference > to the io_uring_cmd, so it can't outlive the uring_cmd() callback. > Yes, it's possible to leak the io_uring_cmd by never calling > io_uring_cmd_done() to return it to the io_uring layer. Thanks, I understood that IoUringCmd could be outlive uring_cmd callback. But it's sad that it could be leaked without any unsafe code. > > I would imagine something like this: > > #[derive(Clone, Copy)] > struct IssueFlags<'a>(c_uint, PhantomData<&'a ()>); > > // Indicates ownership of the io_uring_cmd between uring_cmd() and > io_uring_cmd_done() > struct IoUringCmd(NonNull<bindings::io_uring_cmd>); > > impl IoUringCmd { > // ... > > fn done(self, ret: i32, res2: u64, issue_flags: IssueFlags<'_>) { > let cmd = self.0.as_ptr(); > let issue_flags = issue_flags.0; > unsafe { > bindings::io_uring_cmd_done(cmd, ret, res2, issue_flags) > } > } > } > > // Can choose whether to complete the command synchronously or asynchronously. > // If take_async() is called, IoUringCmd::done() needs to be called to > complete the command. > // If take_async() isn't called, the command is completed synchronously > // with the return value from MiscDevice::uring_cmd(). > struct UringCmdInput<'a>(&mut Option<NonNull<bindings::io_uring_cmd>>); Thanks for a detailed example! But rather than this, We could introduce new return type that has a callback that user could take IoUringCmd instead of returning -EIOCBQUEUED. But I prefer that we provide just one type IoUringCmd without UringCmdInput. Although UringCmdInput, user could call done() in uring_cmd callback and it makes confusion that whether task_async() called and returning -EIOCBQUEUED is mismatched that returns -EINVAL. We don't need to make it complex. Thanks, Sidong > > impl UringCmdInput<'_> { > fn take_async(self) -> IoUringCmd { > IoUringCmd(self.0.take().unwrap()) > } > } > > trait MiscDevice { > // ... > > fn uring_cmd( > _device: <Self::Ptr as ForeignOwnable>::Borrowed<'_>, > _cmd: UringCmdInput<'_>, > _ issue_flags: IssueFlags <'_>, > ) -> Result<i32> { > build_error!(VTABLE_DEFAULT_ERROR) > } > } > > impl<T: MiscDevice> MiscdeviceVTable<T> { > // ... > > unsafe extern "C" fn uring_cmd( > ioucmd: *mut bindings::io_uring_cmd, > issue_flags: c_uint, > ) -> c_int { > let raw_file = unsafe { (*ioucmd).file }; > let private = unsafe { (*raw_file).private_data }.cast(); > let device = unsafe { <T::Ptr as > ForeignOwnable>::borrow(private) }; > let mut ioucmd = Some(NonNull::new(ioucmd).unwrap()); > let issue_flags = IssueFlags(issue_flags, PhantomData); > let ret = T::uring_cmd(device, UringCmdInput(&mut > ioucmd), issue_flags); > // -EIOCBQUEUED indicates ownership of io_uring_cmd > has been taken > if iou_cmd.is_none() { > return -bindings::EIOCBQUEUED; > } > > let ret = from_result(|| ret); > if ret == -bindings::EIOCBQUEUED { > -bindings::EINVAL > } else { > ret > } > } > } > > Best, > Caleb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB 2025-09-13 12:42 ` Sidong Yang @ 2025-09-15 16:54 ` Caleb Sander Mateos 2025-09-17 14:56 ` Sidong Yang 0 siblings, 1 reply; 33+ messages in thread From: Caleb Sander Mateos @ 2025-09-15 16:54 UTC (permalink / raw) To: Sidong Yang Cc: Jens Axboe, Daniel Almeida, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Sat, Sep 13, 2025 at 5:42 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > On Fri, Sep 12, 2025 at 10:56:31AM -0700, Caleb Sander Mateos wrote: > > On Fri, Sep 12, 2025 at 9:42 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > On Tue, Sep 09, 2025 at 09:32:37AM -0700, Caleb Sander Mateos wrote: > > > > On Tue, Sep 9, 2025 at 7:43 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > On Mon, Sep 08, 2025 at 12:45:58PM -0700, Caleb Sander Mateos wrote: > > > > > > On Sat, Sep 6, 2025 at 7:28 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > > > On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote: > > > > > > > > On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > > > > > > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote: > > > > > > > > > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > > > > > > > > > > > The pdu field in io_uring_cmd may contain stale data when a request > > > > > > > > > > > object is recycled from the slab cache. Accessing uninitialized or > > > > > > > > > > > garbage memory can lead to undefined behavior in users of the pdu. > > > > > > > > > > > > > > > > > > > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that > > > > > > > > > > > each command starts from a well-defined state. This avoids exposing > > > > > > > > > > > uninitialized memory and prevents potential misinterpretation of data > > > > > > > > > > > from previous requests. > > > > > > > > > > > > > > > > > > > > > > No functional change is intended other than guaranteeing that pdu is > > > > > > > > > > > always zero-initialized before use. > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > > > > > > > > > > > --- > > > > > > > > > > > io_uring/uring_cmd.c | 1 + > > > > > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > > > > > > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c > > > > > > > > > > > index 053bac89b6c0..2492525d4e43 100644 > > > > > > > > > > > --- a/io_uring/uring_cmd.c > > > > > > > > > > > +++ b/io_uring/uring_cmd.c > > > > > > > > > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > > > > > > > > > > if (!ac) > > > > > > > > > > > return -ENOMEM; > > > > > > > > > > > ioucmd->sqe = sqe; > > > > > > > > > > > + memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu)); > > > > > > > > > > > > > > > > > > > > Adding this overhead to every existing uring_cmd() implementation is > > > > > > > > > > unfortunate. Could we instead track the initialized/uninitialized > > > > > > > > > > state by using different types on the Rust side? The io_uring_cmd > > > > > > > > > > could start as an IoUringCmd, where the PDU field is MaybeUninit, > > > > > > > > > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the > > > > > > > > > > PDU has been initialized. > > > > > > > > > > > > > > > > > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize > > > > > > > > > the pdu. In uring_cmd callback, we can fill zero when it's not reissued. > > > > > > > > > But I don't know that we could call T::default() in miscdevice. If we > > > > > > > > > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>. > > > > > > > > > > > > > > > > > > How about assign a byte in pdu for checking initialized? In uring_cmd(), > > > > > > > > > We could set a byte flag that it's not initialized. And we could return > > > > > > > > > error that it's not initialized in read_pdu(). > > > > > > > > > > > > > > > > Could we do the zero-initialization (or T::default()) in > > > > > > > > MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag > > > > > > > > isn't set (i.e. on the initial issue)? That way, we avoid any > > > > > > > > performance penalty for the existing C uring_cmd() implementations. > > > > > > > > I'm not quite sure what you mean by "assign a byte in pdu for checking > > > > > > > > initialized". > > > > > > > > > > > > > > Sure, we could fill zero when it's the first time uring_cmd called with > > > > > > > checking the flag. I would remove this commit for next version. I also > > > > > > > suggests that we would provide the method that read_pdu() and write_pdu(). > > > > > > > In read_pdu() I want to check write_pdu() is called before. So along the > > > > > > > 20 bytes for pdu, maybe we could use a bytes for the flag that pdu is > > > > > > > initialized? > > > > > > > > > > > > Not sure what you mean about "20 bytes for pdu". > > > > > > It seems like it would be preferable to enforce that write_pdu() has > > > > > > been called before read_pdu() using the Rust type system instead of a > > > > > > runtime check. I was thinking a signature like fn write_pdu(cmd: > > > > > > IoUringCmd, value: T) -> IoUringCmdPdu<T>. Do you feel there's a > > > > > > reason that wouldn't work and a runtime check would be necessary? > > > > > > > > > > I didn't think about make write_pdu() to return IoUringCmdPdu<T> before. > > > > > I think it's good way to pdu is safe without adding a new generic param for > > > > > MiscDevice. write_pdu() would return IoUringCmdPdu<T> and it could call > > > > > IoUringCmdPdu<T>::pdu(&mut self) -> &mut T safely maybe. > > > > > > > > Yes, that's what I was thinking. > > > > > > Good, I'll change api in this way. Thanks! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and > > > > > > > issue_flags. How about some additional field for pdu is initialized like below? > > > > > > > > > > > > > > struct IoUringCmdArgs { > > > > > > > ioucmd: Pin<&mut IoUringCmd>, > > > > > > > issue_flags: u32, > > > > > > > pdu_initialized: bool, > > > > > > > } > > > > > > > > > > > > One other thing I realized is that issue_flags should come from the > > > > > > *current* context rather than the context the uring_cmd() callback was > > > > > > called in. For example, if io_uring_cmd_done() is called from task > > > > > > work context, issue_flags should match the issue_flags passed to the > > > > > > io_uring_cmd_tw_t callback, not the issue_flags originally passed to > > > > > > the uring_cmd() callback. So it probably makes more sense to decouple > > > > > > issue_flags from the (owned) IoUringCmd. I think you could pass it by > > > > > > reference (&IssueFlags) or with a phantom reference lifetime > > > > > > (IssueFlags<'_>) to the Rust uring_cmd() and task work callbacks to > > > > > > ensure it can't be used after those callbacks have returned. > > > > > > > > > > I have had no idea about task work context. I agree with you that > > > > > it would be better to separate issue_flags from IoUringCmd. So, > > > > > IoUringCmdArgs would have a only field Pin<&mut IoUringCmd>? > > > > > > > > "Task work" is a mechanism io_uring uses to queue work to run on the > > > > thread that submitted an io_uring operation. It's basically a > > > > per-thread atomic queue of callbacks that the thread will process > > > > whenever it returns from the kernel to userspace (after a syscall or > > > > an interrupt). This is the context where asynchronous uring_cmd > > > > completions are generally processed (see > > > > io_uring_cmd_complete_in_task() and io_uring_cmd_do_in_task_lazy()). I > > > > can't speak to the history of why io_uring uses task work, but my > > > > guess would be that it provides a safe context to acquire the > > > > io_ring_ctx uring_lock mutex (e.g. nvme_uring_cmd_end_io() can be > > > > called from an interrupt handler, so it's not allowed to take a > > > > mutex). Processing all the task work at once also provides natural > > > > opportunities for batching. > > > > > > Thanks, I've checked io_uring_cmd_complete_in_task() that it receives > > > callback that has issue_flags different with io_uring_cmd(). I'll try to add > > > a api that wrapping io_uring_cmd_complete_in_task() for next version. > > > > > > > > > > > Yes, we probably don't need to bundle anything else with the > > > > IoUringCmd after all. As I mentioned earlier, I don't think Pin<&mut > > > > IoUringCmd> will work for uring_cmds that complete asynchronously, as > > > > they will need to outlive the uring_cmd() call. So uring_cmd() needs > > > > to transfer ownership of the struct io_uring_cmd. > > > > > > I can't think that how to take ownership of struct io_uring_cmd. The > > > struct allocated with io_alloc_req() and should be freed with io_free_req(). > > > If taking ownership means having pointer of struct io_uring_cmd, I think > > > it's no difference with current version. Also could it be called with > > > mem::forget() if it has ownership? > > > > I don't mean ownership of the io_uring_cmd allocation; that's the > > responsibility of the io_uring layer. But once the io_uring_cmd is > > handed to the uring_cmd() implementation, it belongs to that layer > > until it completes the command back to io_uring. Maybe a better way to > > describe it would be as ownership of the "executing io_uring_cmd". The > > problem with Pin<&mut IoUringCmd> is that it is a borrowed reference > > to the io_uring_cmd, so it can't outlive the uring_cmd() callback. > > Yes, it's possible to leak the io_uring_cmd by never calling > > io_uring_cmd_done() to return it to the io_uring layer. > > Thanks, I understood that IoUringCmd could be outlive uring_cmd callback. > But it's sad that it could be leaked without any unsafe code. Safety in Rust doesn't require destructors to run, which means any resource can be safely leaked (https://faultlore.com/blah/everyone-poops/ has some historical background on why Rust decided leaks had to be considered safe). Leaking an io_uring_cmd is analogous to leaking a Box, both are perfectly possible in safe Rust. > > > > > I would imagine something like this: > > > > #[derive(Clone, Copy)] > > struct IssueFlags<'a>(c_uint, PhantomData<&'a ()>); > > > > // Indicates ownership of the io_uring_cmd between uring_cmd() and > > io_uring_cmd_done() > > struct IoUringCmd(NonNull<bindings::io_uring_cmd>); > > > > impl IoUringCmd { > > // ... > > > > fn done(self, ret: i32, res2: u64, issue_flags: IssueFlags<'_>) { > > let cmd = self.0.as_ptr(); > > let issue_flags = issue_flags.0; > > unsafe { > > bindings::io_uring_cmd_done(cmd, ret, res2, issue_flags) > > } > > } > > } > > > > // Can choose whether to complete the command synchronously or asynchronously. > > // If take_async() is called, IoUringCmd::done() needs to be called to > > complete the command. > > // If take_async() isn't called, the command is completed synchronously > > // with the return value from MiscDevice::uring_cmd(). > > struct UringCmdInput<'a>(&mut Option<NonNull<bindings::io_uring_cmd>>); > > Thanks for a detailed example! > > But rather than this, We could introduce new return type that has a callback that > user could take IoUringCmd instead of returning -EIOCBQUEUED. I'm not following what you're suggesting, maybe a code sample would help? > > But I prefer that we provide just one type IoUringCmd without UringCmdInput. > Although UringCmdInput, user could call done() in uring_cmd callback and > it makes confusion that whether task_async() called and returning -EIOCBQUEUED > is mismatched that returns -EINVAL. We don't need to make it complex. Sure, if you only want to support asynchronous io_uring_cmd completions, than you can just pass IoUringCmd to MiscDevice::uring_cmd() and require it to call IoUringCmd::done() to complete the command. There's a small performance overhead to that over just returning the result from the uring_cmd() callback for synchronous completions (and it's more verbose), but I think that would be fine for an initial implementation. Best, Caleb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB 2025-09-15 16:54 ` Caleb Sander Mateos @ 2025-09-17 14:56 ` Sidong Yang 2025-09-22 18:09 ` Caleb Sander Mateos 0 siblings, 1 reply; 33+ messages in thread From: Sidong Yang @ 2025-09-17 14:56 UTC (permalink / raw) To: Caleb Sander Mateos Cc: Jens Axboe, Daniel Almeida, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Mon, Sep 15, 2025 at 09:54:50AM -0700, Caleb Sander Mateos wrote: > On Sat, Sep 13, 2025 at 5:42 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > On Fri, Sep 12, 2025 at 10:56:31AM -0700, Caleb Sander Mateos wrote: > > > On Fri, Sep 12, 2025 at 9:42 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > On Tue, Sep 09, 2025 at 09:32:37AM -0700, Caleb Sander Mateos wrote: > > > > > On Tue, Sep 9, 2025 at 7:43 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > On Mon, Sep 08, 2025 at 12:45:58PM -0700, Caleb Sander Mateos wrote: > > > > > > > On Sat, Sep 6, 2025 at 7:28 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > > > > > On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote: > > > > > > > > > On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > > > > > > > > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote: > > > > > > > > > > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > > > > > > > > > > > > > The pdu field in io_uring_cmd may contain stale data when a request > > > > > > > > > > > > object is recycled from the slab cache. Accessing uninitialized or > > > > > > > > > > > > garbage memory can lead to undefined behavior in users of the pdu. > > > > > > > > > > > > > > > > > > > > > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that > > > > > > > > > > > > each command starts from a well-defined state. This avoids exposing > > > > > > > > > > > > uninitialized memory and prevents potential misinterpretation of data > > > > > > > > > > > > from previous requests. > > > > > > > > > > > > > > > > > > > > > > > > No functional change is intended other than guaranteeing that pdu is > > > > > > > > > > > > always zero-initialized before use. > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > > > > > > > > > > > > --- > > > > > > > > > > > > io_uring/uring_cmd.c | 1 + > > > > > > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c > > > > > > > > > > > > index 053bac89b6c0..2492525d4e43 100644 > > > > > > > > > > > > --- a/io_uring/uring_cmd.c > > > > > > > > > > > > +++ b/io_uring/uring_cmd.c > > > > > > > > > > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > > > > > > > > > > > if (!ac) > > > > > > > > > > > > return -ENOMEM; > > > > > > > > > > > > ioucmd->sqe = sqe; > > > > > > > > > > > > + memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu)); > > > > > > > > > > > > > > > > > > > > > > Adding this overhead to every existing uring_cmd() implementation is > > > > > > > > > > > unfortunate. Could we instead track the initialized/uninitialized > > > > > > > > > > > state by using different types on the Rust side? The io_uring_cmd > > > > > > > > > > > could start as an IoUringCmd, where the PDU field is MaybeUninit, > > > > > > > > > > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the > > > > > > > > > > > PDU has been initialized. > > > > > > > > > > > > > > > > > > > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize > > > > > > > > > > the pdu. In uring_cmd callback, we can fill zero when it's not reissued. > > > > > > > > > > But I don't know that we could call T::default() in miscdevice. If we > > > > > > > > > > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>. > > > > > > > > > > > > > > > > > > > > How about assign a byte in pdu for checking initialized? In uring_cmd(), > > > > > > > > > > We could set a byte flag that it's not initialized. And we could return > > > > > > > > > > error that it's not initialized in read_pdu(). > > > > > > > > > > > > > > > > > > Could we do the zero-initialization (or T::default()) in > > > > > > > > > MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag > > > > > > > > > isn't set (i.e. on the initial issue)? That way, we avoid any > > > > > > > > > performance penalty for the existing C uring_cmd() implementations. > > > > > > > > > I'm not quite sure what you mean by "assign a byte in pdu for checking > > > > > > > > > initialized". > > > > > > > > > > > > > > > > Sure, we could fill zero when it's the first time uring_cmd called with > > > > > > > > checking the flag. I would remove this commit for next version. I also > > > > > > > > suggests that we would provide the method that read_pdu() and write_pdu(). > > > > > > > > In read_pdu() I want to check write_pdu() is called before. So along the > > > > > > > > 20 bytes for pdu, maybe we could use a bytes for the flag that pdu is > > > > > > > > initialized? > > > > > > > > > > > > > > Not sure what you mean about "20 bytes for pdu". > > > > > > > It seems like it would be preferable to enforce that write_pdu() has > > > > > > > been called before read_pdu() using the Rust type system instead of a > > > > > > > runtime check. I was thinking a signature like fn write_pdu(cmd: > > > > > > > IoUringCmd, value: T) -> IoUringCmdPdu<T>. Do you feel there's a > > > > > > > reason that wouldn't work and a runtime check would be necessary? > > > > > > > > > > > > I didn't think about make write_pdu() to return IoUringCmdPdu<T> before. > > > > > > I think it's good way to pdu is safe without adding a new generic param for > > > > > > MiscDevice. write_pdu() would return IoUringCmdPdu<T> and it could call > > > > > > IoUringCmdPdu<T>::pdu(&mut self) -> &mut T safely maybe. > > > > > > > > > > Yes, that's what I was thinking. > > > > > > > > Good, I'll change api in this way. Thanks! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and > > > > > > > > issue_flags. How about some additional field for pdu is initialized like below? > > > > > > > > > > > > > > > > struct IoUringCmdArgs { > > > > > > > > ioucmd: Pin<&mut IoUringCmd>, > > > > > > > > issue_flags: u32, > > > > > > > > pdu_initialized: bool, > > > > > > > > } > > > > > > > > > > > > > > One other thing I realized is that issue_flags should come from the > > > > > > > *current* context rather than the context the uring_cmd() callback was > > > > > > > called in. For example, if io_uring_cmd_done() is called from task > > > > > > > work context, issue_flags should match the issue_flags passed to the > > > > > > > io_uring_cmd_tw_t callback, not the issue_flags originally passed to > > > > > > > the uring_cmd() callback. So it probably makes more sense to decouple > > > > > > > issue_flags from the (owned) IoUringCmd. I think you could pass it by > > > > > > > reference (&IssueFlags) or with a phantom reference lifetime > > > > > > > (IssueFlags<'_>) to the Rust uring_cmd() and task work callbacks to > > > > > > > ensure it can't be used after those callbacks have returned. > > > > > > > > > > > > I have had no idea about task work context. I agree with you that > > > > > > it would be better to separate issue_flags from IoUringCmd. So, > > > > > > IoUringCmdArgs would have a only field Pin<&mut IoUringCmd>? > > > > > > > > > > "Task work" is a mechanism io_uring uses to queue work to run on the > > > > > thread that submitted an io_uring operation. It's basically a > > > > > per-thread atomic queue of callbacks that the thread will process > > > > > whenever it returns from the kernel to userspace (after a syscall or > > > > > an interrupt). This is the context where asynchronous uring_cmd > > > > > completions are generally processed (see > > > > > io_uring_cmd_complete_in_task() and io_uring_cmd_do_in_task_lazy()). I > > > > > can't speak to the history of why io_uring uses task work, but my > > > > > guess would be that it provides a safe context to acquire the > > > > > io_ring_ctx uring_lock mutex (e.g. nvme_uring_cmd_end_io() can be > > > > > called from an interrupt handler, so it's not allowed to take a > > > > > mutex). Processing all the task work at once also provides natural > > > > > opportunities for batching. > > > > > > > > Thanks, I've checked io_uring_cmd_complete_in_task() that it receives > > > > callback that has issue_flags different with io_uring_cmd(). I'll try to add > > > > a api that wrapping io_uring_cmd_complete_in_task() for next version. > > > > > > > > > > > > > > Yes, we probably don't need to bundle anything else with the > > > > > IoUringCmd after all. As I mentioned earlier, I don't think Pin<&mut > > > > > IoUringCmd> will work for uring_cmds that complete asynchronously, as > > > > > they will need to outlive the uring_cmd() call. So uring_cmd() needs > > > > > to transfer ownership of the struct io_uring_cmd. > > > > > > > > I can't think that how to take ownership of struct io_uring_cmd. The > > > > struct allocated with io_alloc_req() and should be freed with io_free_req(). > > > > If taking ownership means having pointer of struct io_uring_cmd, I think > > > > it's no difference with current version. Also could it be called with > > > > mem::forget() if it has ownership? > > > > > > I don't mean ownership of the io_uring_cmd allocation; that's the > > > responsibility of the io_uring layer. But once the io_uring_cmd is > > > handed to the uring_cmd() implementation, it belongs to that layer > > > until it completes the command back to io_uring. Maybe a better way to > > > describe it would be as ownership of the "executing io_uring_cmd". The > > > problem with Pin<&mut IoUringCmd> is that it is a borrowed reference > > > to the io_uring_cmd, so it can't outlive the uring_cmd() callback. > > > Yes, it's possible to leak the io_uring_cmd by never calling > > > io_uring_cmd_done() to return it to the io_uring layer. > > > > Thanks, I understood that IoUringCmd could be outlive uring_cmd callback. > > But it's sad that it could be leaked without any unsafe code. > > Safety in Rust doesn't require destructors to run, which means any > resource can be safely leaked > (https://faultlore.com/blah/everyone-poops/ has some historical > background on why Rust decided leaks had to be considered safe). > Leaking an io_uring_cmd is analogous to leaking a Box, both are > perfectly possible in safe Rust. Thanks for the reference. If driver just drops `IoUringCmd` without taking, the request wouldn't be completed until io-uring instance deinitialized. I understood that we cannot handle this. > > > > > > > > > I would imagine something like this: > > > > > > #[derive(Clone, Copy)] > > > struct IssueFlags<'a>(c_uint, PhantomData<&'a ()>); > > > > > > // Indicates ownership of the io_uring_cmd between uring_cmd() and > > > io_uring_cmd_done() > > > struct IoUringCmd(NonNull<bindings::io_uring_cmd>); > > > > > > impl IoUringCmd { > > > // ... > > > > > > fn done(self, ret: i32, res2: u64, issue_flags: IssueFlags<'_>) { > > > let cmd = self.0.as_ptr(); > > > let issue_flags = issue_flags.0; > > > unsafe { > > > bindings::io_uring_cmd_done(cmd, ret, res2, issue_flags) > > > } > > > } > > > } > > > > > > // Can choose whether to complete the command synchronously or asynchronously. > > > // If take_async() is called, IoUringCmd::done() needs to be called to > > > complete the command. > > > // If take_async() isn't called, the command is completed synchronously > > > // with the return value from MiscDevice::uring_cmd(). > > > struct UringCmdInput<'a>(&mut Option<NonNull<bindings::io_uring_cmd>>); > > > > Thanks for a detailed example! > > > > But rather than this, We could introduce new return type that has a callback that > > user could take IoUringCmd instead of returning -EIOCBQUEUED. > > I'm not following what you're suggesting, maybe a code sample would help? maybe like below... #[vtable] pub trait MiscDevice: Sized { /// ... /// If user returns `EIOCBQUEUED`, this function would be called for /// users who want to take `IoUringCmdAsync`. It could call `done()` for complete /// request. fn uring_cmd_async_prep( device: <Self::Ptr as ForeignOwnable>::Borrowed<'_>, io_uring_cmd_async: IoUringCmdAsync); /// ... } impl<T: MiscDevice> MiscdeviceVTable<T> { // ... unsafe extern "C" fn uring_cmd( ioucmd: *mut bindings::io_uring_cmd, issue_flags: ffi::c_uint, ) -> c_int { // ... let result = T::uring_cmd(device, ioucmd, issue_flags); if let Err(EIOCBQUEUED) = result { T::uring_cmd_async_prep(device, ioucmd.to_async()); } from_result(|| result) } } > > > > > But I prefer that we provide just one type IoUringCmd without UringCmdInput. > > Although UringCmdInput, user could call done() in uring_cmd callback and > > it makes confusion that whether task_async() called and returning -EIOCBQUEUED > > is mismatched that returns -EINVAL. We don't need to make it complex. > > Sure, if you only want to support asynchronous io_uring_cmd > completions, than you can just pass IoUringCmd to > MiscDevice::uring_cmd() and require it to call IoUringCmd::done() to > complete the command. There's a small performance overhead to that > over just returning the result from the uring_cmd() callback for > synchronous completions (and it's more verbose), but I think that > would be fine for an initial implementation. I appreciate for your understanding. I think it would be good to have just one simple struct `IoUringCmd`. I'll make next version patch soon. Thanks, Sidong > > Best, > Caleb ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB 2025-09-17 14:56 ` Sidong Yang @ 2025-09-22 18:09 ` Caleb Sander Mateos 0 siblings, 0 replies; 33+ messages in thread From: Caleb Sander Mateos @ 2025-09-22 18:09 UTC (permalink / raw) To: Sidong Yang Cc: Jens Axboe, Daniel Almeida, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Wed, Sep 17, 2025 at 7:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > On Mon, Sep 15, 2025 at 09:54:50AM -0700, Caleb Sander Mateos wrote: > > On Sat, Sep 13, 2025 at 5:42 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > On Fri, Sep 12, 2025 at 10:56:31AM -0700, Caleb Sander Mateos wrote: > > > > On Fri, Sep 12, 2025 at 9:42 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > On Tue, Sep 09, 2025 at 09:32:37AM -0700, Caleb Sander Mateos wrote: > > > > > > On Tue, Sep 9, 2025 at 7:43 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > > > On Mon, Sep 08, 2025 at 12:45:58PM -0700, Caleb Sander Mateos wrote: > > > > > > > > On Sat, Sep 6, 2025 at 7:28 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > > > > > > > On Tue, Sep 02, 2025 at 08:31:00AM -0700, Caleb Sander Mateos wrote: > > > > > > > > > > On Tue, Sep 2, 2025 at 3:23 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > > > > > > > > > > > On Mon, Sep 01, 2025 at 05:34:28PM -0700, Caleb Sander Mateos wrote: > > > > > > > > > > > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > The pdu field in io_uring_cmd may contain stale data when a request > > > > > > > > > > > > > object is recycled from the slab cache. Accessing uninitialized or > > > > > > > > > > > > > garbage memory can lead to undefined behavior in users of the pdu. > > > > > > > > > > > > > > > > > > > > > > > > > > Ensure the pdu buffer is cleared during io_uring_cmd_prep() so that > > > > > > > > > > > > > each command starts from a well-defined state. This avoids exposing > > > > > > > > > > > > > uninitialized memory and prevents potential misinterpretation of data > > > > > > > > > > > > > from previous requests. > > > > > > > > > > > > > > > > > > > > > > > > > > No functional change is intended other than guaranteeing that pdu is > > > > > > > > > > > > > always zero-initialized before use. > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > > > > > > > > > > > > > --- > > > > > > > > > > > > > io_uring/uring_cmd.c | 1 + > > > > > > > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c > > > > > > > > > > > > > index 053bac89b6c0..2492525d4e43 100644 > > > > > > > > > > > > > --- a/io_uring/uring_cmd.c > > > > > > > > > > > > > +++ b/io_uring/uring_cmd.c > > > > > > > > > > > > > @@ -203,6 +203,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > > > > > > > > > > > > if (!ac) > > > > > > > > > > > > > return -ENOMEM; > > > > > > > > > > > > > ioucmd->sqe = sqe; > > > > > > > > > > > > > + memset(&ioucmd->pdu, 0, sizeof(ioucmd->pdu)); > > > > > > > > > > > > > > > > > > > > > > > > Adding this overhead to every existing uring_cmd() implementation is > > > > > > > > > > > > unfortunate. Could we instead track the initialized/uninitialized > > > > > > > > > > > > state by using different types on the Rust side? The io_uring_cmd > > > > > > > > > > > > could start as an IoUringCmd, where the PDU field is MaybeUninit, > > > > > > > > > > > > write_pdu<T>() could return a new IoUringCmdPdu<T> that guarantees the > > > > > > > > > > > > PDU has been initialized. > > > > > > > > > > > > > > > > > > > > > > I've found a flag IORING_URING_CMD_REISSUE that we could initialize > > > > > > > > > > > the pdu. In uring_cmd callback, we can fill zero when it's not reissued. > > > > > > > > > > > But I don't know that we could call T::default() in miscdevice. If we > > > > > > > > > > > make IoUringCmdPdu<T>, MiscDevice also should be MiscDevice<T>. > > > > > > > > > > > > > > > > > > > > > > How about assign a byte in pdu for checking initialized? In uring_cmd(), > > > > > > > > > > > We could set a byte flag that it's not initialized. And we could return > > > > > > > > > > > error that it's not initialized in read_pdu(). > > > > > > > > > > > > > > > > > > > > Could we do the zero-initialization (or T::default()) in > > > > > > > > > > MiscdeviceVTable::uring_cmd() if the IORING_URING_CMD_REISSUE flag > > > > > > > > > > isn't set (i.e. on the initial issue)? That way, we avoid any > > > > > > > > > > performance penalty for the existing C uring_cmd() implementations. > > > > > > > > > > I'm not quite sure what you mean by "assign a byte in pdu for checking > > > > > > > > > > initialized". > > > > > > > > > > > > > > > > > > Sure, we could fill zero when it's the first time uring_cmd called with > > > > > > > > > checking the flag. I would remove this commit for next version. I also > > > > > > > > > suggests that we would provide the method that read_pdu() and write_pdu(). > > > > > > > > > In read_pdu() I want to check write_pdu() is called before. So along the > > > > > > > > > 20 bytes for pdu, maybe we could use a bytes for the flag that pdu is > > > > > > > > > initialized? > > > > > > > > > > > > > > > > Not sure what you mean about "20 bytes for pdu". > > > > > > > > It seems like it would be preferable to enforce that write_pdu() has > > > > > > > > been called before read_pdu() using the Rust type system instead of a > > > > > > > > runtime check. I was thinking a signature like fn write_pdu(cmd: > > > > > > > > IoUringCmd, value: T) -> IoUringCmdPdu<T>. Do you feel there's a > > > > > > > > reason that wouldn't work and a runtime check would be necessary? > > > > > > > > > > > > > > I didn't think about make write_pdu() to return IoUringCmdPdu<T> before. > > > > > > > I think it's good way to pdu is safe without adding a new generic param for > > > > > > > MiscDevice. write_pdu() would return IoUringCmdPdu<T> and it could call > > > > > > > IoUringCmdPdu<T>::pdu(&mut self) -> &mut T safely maybe. > > > > > > > > > > > > Yes, that's what I was thinking. > > > > > > > > > > Good, I'll change api in this way. Thanks! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > But maybe I would introduce a new struct that has Pin<&mut IoUringCmd> and > > > > > > > > > issue_flags. How about some additional field for pdu is initialized like below? > > > > > > > > > > > > > > > > > > struct IoUringCmdArgs { > > > > > > > > > ioucmd: Pin<&mut IoUringCmd>, > > > > > > > > > issue_flags: u32, > > > > > > > > > pdu_initialized: bool, > > > > > > > > > } > > > > > > > > > > > > > > > > One other thing I realized is that issue_flags should come from the > > > > > > > > *current* context rather than the context the uring_cmd() callback was > > > > > > > > called in. For example, if io_uring_cmd_done() is called from task > > > > > > > > work context, issue_flags should match the issue_flags passed to the > > > > > > > > io_uring_cmd_tw_t callback, not the issue_flags originally passed to > > > > > > > > the uring_cmd() callback. So it probably makes more sense to decouple > > > > > > > > issue_flags from the (owned) IoUringCmd. I think you could pass it by > > > > > > > > reference (&IssueFlags) or with a phantom reference lifetime > > > > > > > > (IssueFlags<'_>) to the Rust uring_cmd() and task work callbacks to > > > > > > > > ensure it can't be used after those callbacks have returned. > > > > > > > > > > > > > > I have had no idea about task work context. I agree with you that > > > > > > > it would be better to separate issue_flags from IoUringCmd. So, > > > > > > > IoUringCmdArgs would have a only field Pin<&mut IoUringCmd>? > > > > > > > > > > > > "Task work" is a mechanism io_uring uses to queue work to run on the > > > > > > thread that submitted an io_uring operation. It's basically a > > > > > > per-thread atomic queue of callbacks that the thread will process > > > > > > whenever it returns from the kernel to userspace (after a syscall or > > > > > > an interrupt). This is the context where asynchronous uring_cmd > > > > > > completions are generally processed (see > > > > > > io_uring_cmd_complete_in_task() and io_uring_cmd_do_in_task_lazy()). I > > > > > > can't speak to the history of why io_uring uses task work, but my > > > > > > guess would be that it provides a safe context to acquire the > > > > > > io_ring_ctx uring_lock mutex (e.g. nvme_uring_cmd_end_io() can be > > > > > > called from an interrupt handler, so it's not allowed to take a > > > > > > mutex). Processing all the task work at once also provides natural > > > > > > opportunities for batching. > > > > > > > > > > Thanks, I've checked io_uring_cmd_complete_in_task() that it receives > > > > > callback that has issue_flags different with io_uring_cmd(). I'll try to add > > > > > a api that wrapping io_uring_cmd_complete_in_task() for next version. > > > > > > > > > > > > > > > > > Yes, we probably don't need to bundle anything else with the > > > > > > IoUringCmd after all. As I mentioned earlier, I don't think Pin<&mut > > > > > > IoUringCmd> will work for uring_cmds that complete asynchronously, as > > > > > > they will need to outlive the uring_cmd() call. So uring_cmd() needs > > > > > > to transfer ownership of the struct io_uring_cmd. > > > > > > > > > > I can't think that how to take ownership of struct io_uring_cmd. The > > > > > struct allocated with io_alloc_req() and should be freed with io_free_req(). > > > > > If taking ownership means having pointer of struct io_uring_cmd, I think > > > > > it's no difference with current version. Also could it be called with > > > > > mem::forget() if it has ownership? > > > > > > > > I don't mean ownership of the io_uring_cmd allocation; that's the > > > > responsibility of the io_uring layer. But once the io_uring_cmd is > > > > handed to the uring_cmd() implementation, it belongs to that layer > > > > until it completes the command back to io_uring. Maybe a better way to > > > > describe it would be as ownership of the "executing io_uring_cmd". The > > > > problem with Pin<&mut IoUringCmd> is that it is a borrowed reference > > > > to the io_uring_cmd, so it can't outlive the uring_cmd() callback. > > > > Yes, it's possible to leak the io_uring_cmd by never calling > > > > io_uring_cmd_done() to return it to the io_uring layer. > > > > > > Thanks, I understood that IoUringCmd could be outlive uring_cmd callback. > > > But it's sad that it could be leaked without any unsafe code. > > > > Safety in Rust doesn't require destructors to run, which means any > > resource can be safely leaked > > (https://faultlore.com/blah/everyone-poops/ has some historical > > background on why Rust decided leaks had to be considered safe). > > Leaking an io_uring_cmd is analogous to leaking a Box, both are > > perfectly possible in safe Rust. > > Thanks for the reference. If driver just drops `IoUringCmd` without taking, > the request wouldn't be completed until io-uring instance deinitialized. > I understood that we cannot handle this. > > > > > > > > > > > > > > I would imagine something like this: > > > > > > > > #[derive(Clone, Copy)] > > > > struct IssueFlags<'a>(c_uint, PhantomData<&'a ()>); > > > > > > > > // Indicates ownership of the io_uring_cmd between uring_cmd() and > > > > io_uring_cmd_done() > > > > struct IoUringCmd(NonNull<bindings::io_uring_cmd>); > > > > > > > > impl IoUringCmd { > > > > // ... > > > > > > > > fn done(self, ret: i32, res2: u64, issue_flags: IssueFlags<'_>) { > > > > let cmd = self.0.as_ptr(); > > > > let issue_flags = issue_flags.0; > > > > unsafe { > > > > bindings::io_uring_cmd_done(cmd, ret, res2, issue_flags) > > > > } > > > > } > > > > } > > > > > > > > // Can choose whether to complete the command synchronously or asynchronously. > > > > // If take_async() is called, IoUringCmd::done() needs to be called to > > > > complete the command. > > > > // If take_async() isn't called, the command is completed synchronously > > > > // with the return value from MiscDevice::uring_cmd(). > > > > struct UringCmdInput<'a>(&mut Option<NonNull<bindings::io_uring_cmd>>); > > > > > > Thanks for a detailed example! > > > > > > But rather than this, We could introduce new return type that has a callback that > > > user could take IoUringCmd instead of returning -EIOCBQUEUED. > > > > I'm not following what you're suggesting, maybe a code sample would help? > > maybe like below... > > #[vtable] > pub trait MiscDevice: Sized { > /// ... > > /// If user returns `EIOCBQUEUED`, this function would be called for > /// users who want to take `IoUringCmdAsync`. It could call `done()` for complete > /// request. > fn uring_cmd_async_prep( > device: <Self::Ptr as ForeignOwnable>::Borrowed<'_>, > io_uring_cmd_async: IoUringCmdAsync); > /// ... > } > > impl<T: MiscDevice> MiscdeviceVTable<T> { > // ... > unsafe extern "C" fn uring_cmd( > ioucmd: *mut bindings::io_uring_cmd, > issue_flags: ffi::c_uint, > ) -> c_int { > // ... > let result = T::uring_cmd(device, ioucmd, issue_flags); > > if let Err(EIOCBQUEUED) = result { > T::uring_cmd_async_prep(device, ioucmd.to_async()); > } I see. Yes, this could work, but I worry separating the implementation into 2 calls makes it difficult to pass state between them. If T::uring_cmd() can determine whether the command will complete synchronously or asynchronously just by inspecting the io_uring_cmd/device, it may be a convenient interface. But if T::uring_cmd() performs a bunch of work before deciding the command should go async, it may be a pain to pass those computed values to T::uring_cmd_async_prep(). > > from_result(|| result) > } > } > > > > > > > > But I prefer that we provide just one type IoUringCmd without UringCmdInput. > > > Although UringCmdInput, user could call done() in uring_cmd callback and > > > it makes confusion that whether task_async() called and returning -EIOCBQUEUED > > > is mismatched that returns -EINVAL. We don't need to make it complex. > > > > Sure, if you only want to support asynchronous io_uring_cmd > > completions, than you can just pass IoUringCmd to > > MiscDevice::uring_cmd() and require it to call IoUringCmd::done() to > > complete the command. There's a small performance overhead to that > > over just returning the result from the uring_cmd() callback for > > synchronous completions (and it's more verbose), but I think that > > would be fine for an initial implementation. > > I appreciate for your understanding. I think it would be good to have just one > simple struct `IoUringCmd`. I'll make next version patch soon. Yeah, I think it will probably be easiest for an initial implementation to always transfer ownership of the io_uring_cmd to MiscDevice::uring_cmd() and requiring it to call IoUringCmd::done(). The synchronous case could be optimized in the future, but that's not the primary io_uring_cmd use case. Best, Caleb ^ permalink raw reply [flat|nested] 33+ messages in thread
* [RFC PATCH v3 3/5] rust: io_uring: introduce rust abstraction for io-uring cmd 2025-08-22 12:55 [RFC PATCH v3 0/5] rust: miscdevice: abstraction for uring_cmd Sidong Yang 2025-08-22 12:55 ` [RFC PATCH v3 1/5] rust: bindings: add io_uring headers in bindings_helper.h Sidong Yang 2025-08-22 12:55 ` [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB Sidong Yang @ 2025-08-22 12:55 ` Sidong Yang 2025-08-27 20:41 ` Daniel Almeida ` (2 more replies) 2025-08-22 12:55 ` [RFC PATCH v3 4/5] rust: miscdevice: Add `uring_cmd` support Sidong Yang 2025-08-22 12:55 ` [RFC PATCH v3 5/5] samples: rust: Add `uring_cmd` example to `rust_misc_device` Sidong Yang 4 siblings, 3 replies; 33+ messages in thread From: Sidong Yang @ 2025-08-22 12:55 UTC (permalink / raw) To: Jens Axboe, Daniel Almeida, Caleb Sander Mateos, Benno Lossin Cc: Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring, Sidong Yang Implment the io-uring abstractions needed for miscdevicecs and other char devices that have io-uring command interface. * `io_uring::IoUringCmd` : Rust abstraction for `io_uring_cmd` which will be used as arg for `MiscDevice::uring_cmd()`. And driver can get `cmd_op` sent from userspace. Also it has `flags` which includes option that is reissued. * `io_uring::IoUringSqe` : Rust abstraction for `io_uring_sqe` which could be get from `IoUringCmd::sqe()` and driver could get `cmd_data` from userspace. Also `IoUringSqe` has more data like opcode could be used in driver. Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> --- rust/kernel/io_uring.rs | 306 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/lib.rs | 1 + 2 files changed, 307 insertions(+) create mode 100644 rust/kernel/io_uring.rs diff --git a/rust/kernel/io_uring.rs b/rust/kernel/io_uring.rs new file mode 100644 index 000000000000..61e88bdf4e42 --- /dev/null +++ b/rust/kernel/io_uring.rs @@ -0,0 +1,306 @@ +// SPDX-License-Identifier: GPL-2.0 +// SPDX-FileCopyrightText: (C) 2025 Furiosa AI + +//! Abstractions for io-uring. +//! +//! This module provides types for implements io-uring interface for char device. +//! +//! +//! C headers: [`include/linux/io_uring/cmd.h`](srctree/include/linux/io_uring/cmd.h) and +//! [`include/linux/io_uring/io_uring.h`](srctree/include/linux/io_uring/io_uring.h) + +use core::{mem::MaybeUninit, pin::Pin}; + +use crate::error::from_result; +use crate::transmute::{AsBytes, FromBytes}; +use crate::{fs::File, types::Opaque}; + +use crate::prelude::*; + +/// io-uring opcode +pub mod opcode { + /// opcode for uring cmd + pub const URING_CMD: u32 = bindings::io_uring_op_IORING_OP_URING_CMD; +} + +/// A Rust abstraction for the Linux kernel's `io_uring_cmd` structure. +/// +/// This structure is a safe, opaque wrapper around the raw C `io_uring_cmd` +/// binding from the Linux kernel. It represents a command structure used +/// in io_uring operations within the kernel. +/// This type is used internally by the io_uring subsystem to manage +/// asynchronous I/O commands. +/// +/// This type should not be constructed or manipulated directly by +/// kernel module developers. +/// +/// # INVARIANT +/// - `self.inner` always points to a valid, live `bindings::io_uring_cmd`. +#[repr(transparent)] +pub struct IoUringCmd { + /// An opaque wrapper containing the actual `io_uring_cmd` data. + inner: Opaque<bindings::io_uring_cmd>, +} + +impl IoUringCmd { + /// Returns the cmd_op with associated with the `io_uring_cmd`. + #[inline] + pub fn cmd_op(&self) -> u32 { + // SAFETY: `self.inner` is guaranteed by the type invariant to point + // to a live `io_uring_cmd`, so dereferencing is safe. + unsafe { (*self.inner.get()).cmd_op } + } + + /// Returns the flags with associated with the `io_uring_cmd`. + #[inline] + pub fn flags(&self) -> u32 { + // SAFETY: `self.inner` is guaranteed by the type invariant to point + // to a live `io_uring_cmd`, so dereferencing is safe. + unsafe { (*self.inner.get()).flags } + } + + /// Reads protocol data unit as `T` that impl `FromBytes` from uring cmd + /// + /// Fails with [`EFAULT`] if size of `T` is bigger than pdu size. + #[inline] + pub fn read_pdu<T: FromBytes>(&self) -> Result<T> { + // SAFETY: `self.inner` is guaranteed by the type invariant to point + // to a live `io_uring_cmd`, so dereferencing is safe. + let inner = unsafe { &mut *self.inner.get() }; + + let len = size_of::<T>(); + if len > inner.pdu.len() { + return Err(EFAULT); + } + + let mut out: MaybeUninit<T> = MaybeUninit::uninit(); + let ptr = &raw mut inner.pdu as *const c_void; + + // SAFETY: + // * The `ptr` is valid pointer from `self.inner` that is guaranteed by type invariant. + // * The `out` is valid pointer that points `T` which impls `FromBytes` and checked + // size of `T` is smaller than pdu size. + unsafe { + core::ptr::copy_nonoverlapping(ptr, out.as_mut_ptr().cast::<c_void>(), len); + } + + // SAFETY: The read above has initialized all bytes in `out`, and since `T` implements + // `FromBytes`, any bit-pattern is a valid value for this type. + Ok(unsafe { out.assume_init() }) + } + + /// Writes the provided `value` to `pdu` in uring_cmd `self` + /// + /// Fails with [`EFAULT`] if size of `T` is bigger than pdu size. + #[inline] + pub fn write_pdu<T: AsBytes>(&mut self, value: &T) -> Result<()> { + // SAFETY: `self.inner` is guaranteed by the type invariant to point + // to a live `io_uring_cmd`, so dereferencing is safe. + let inner = unsafe { &mut *self.inner.get() }; + + let len = size_of::<T>(); + if len > inner.pdu.len() { + return Err(EFAULT); + } + + let src = (value as *const T).cast::<c_void>(); + let dst = &raw mut inner.pdu as *mut c_void; + + // SAFETY: + // * The `src` is points valid memory that is guaranteed by `T` impls `AsBytes` + // * The `dst` is valid. It's from `self.inner` that is guaranteed by type invariant. + // * It's safe to copy because size of `T` is no more than len of pdu. + unsafe { + core::ptr::copy_nonoverlapping(src, dst, len); + } + + Ok(()) + } + + /// Constructs a new [`IoUringCmd`] from a raw `io_uring_cmd` + /// + /// # Safety + /// + /// The caller must guarantee that: + /// - `ptr` is non-null, properly aligned, and points to a valid + /// `bindings::io_uring_cmd`. + /// - The pointed-to memory remains initialized and valid for the entire + /// lifetime `'a` of the returned reference. + /// - While the returned `Pin<&'a mut IoUringCmd>` is alive, the underlying + /// object is **not moved** (pinning requirement). + /// - **Aliasing rules:** the returned `&mut` has **exclusive** access to the same + /// object for its entire lifetime: + /// - No other `&mut` **or** `&` references to the same `io_uring_cmd` may be + /// alive at the same time. + /// - There must be no concurrent reads/writes through raw pointers, FFI, or + /// other kernel paths to the same object during this lifetime. + /// - If the object can be touched from other contexts (e.g. IRQ/another CPU), + /// the caller must provide synchronization to uphold this exclusivity. + /// - This function relies on `IoUringCmd` being `repr(transparent)` over + /// `bindings::io_uring_cmd` so the cast preserves layout. + #[inline] + pub unsafe fn from_raw<'a>(ptr: *mut bindings::io_uring_cmd) -> Pin<&'a mut IoUringCmd> { + // SAFETY: + // * The caller guarantees that the pointer is not dangling and stays + // valid for the duration of 'a. + // * The cast is okay because `IoUringCmd` is `repr(transparent)` and + // has the same memory layout as `bindings::io_uring_cmd`. + // * The returned `Pin` ensures that the object cannot be moved, which + // is required because the kernel may hold pointers to this memory + // location and moving it would invalidate those pointers. + unsafe { Pin::new_unchecked(&mut *ptr.cast()) } + } + + /// Returns the file that referenced by uring cmd self. + #[inline] + pub fn file(&self) -> &File { + // SAFETY: `self.inner` is guaranteed by the type invariant to point + // to a live `io_uring_cmd`, so dereferencing is safe. + let file = unsafe { (*self.inner.get()).file }; + + // SAFETY: + // * The `file` points valid file. + // * refcount is positive after submission queue entry issued. + // * There is no active fdget_pos region on the file on this thread. + unsafe { File::from_raw_file(file) } + } + + /// Returns an reference to the [`IoUringSqe`] associated with this command. + #[inline] + pub fn sqe(&self) -> &IoUringSqe { + // SAFETY: `self.inner` is guaranteed by the type invariant to point + // to a live `io_uring_cmd`, so dereferencing is safe. + let sqe = unsafe { (*self.inner.get()).sqe }; + // SAFETY: The call guarantees that the `sqe` points valid io_uring_sqe. + unsafe { IoUringSqe::from_raw(sqe) } + } + + /// Completes an this [`IoUringCmd`] request that was previously queued. + /// + /// # Safety + /// + /// - This function must be called **only** for a command whose `uring_cmd` + /// handler previously returned **`-EIOCBQUEUED`** to io_uring. + /// + /// # Parameters + /// + /// - `ret`: Result to return to userspace. + /// - `res2`: Extra for big completion queue entry `IORING_SETUP_CQE32`. + /// - `issue_flags`: Flags associated with this request, typically the same + /// as those passed to the `uring_cmd` handler. + #[inline] + pub fn done(self: Pin<&mut IoUringCmd>, ret: Result<i32>, res2: u64, issue_flags: u32) { + let ret = from_result(|| ret) as isize; + // SAFETY: The call guarantees that `self.inner` is not dangling and stays valid + unsafe { + bindings::io_uring_cmd_done(self.inner.get(), ret, res2, issue_flags); + } + } +} + +/// A Rust abstraction for the Linux kernel's `io_uring_sqe` structure. +/// +/// This structure is a safe, opaque wrapper around the raw C [`io_uring_sqe`](srctree/include/uapi/linux/io_uring.h) +/// binding from the Linux kernel. It represents a Submission Queue Entry +/// used in io_uring operations within the kernel. +/// +/// # Type Safety +/// +/// The `#[repr(transparent)]` attribute ensures that this wrapper has +/// the same memory layout as the underlying `io_uring_sqe` structure, +/// allowing it to be safely transmuted between the two representations. +/// +/// # Fields +/// +/// * `inner` - An opaque wrapper containing the actual `io_uring_sqe` data. +/// The `Opaque` type prevents direct access to the internal +/// structure fields, ensuring memory safety and encapsulation. +/// +/// # Usage +/// +/// This type represents a submission queue entry that describes an I/O +/// operation to be executed by the io_uring subsystem. It contains +/// information such as the operation type, file descriptor, buffer +/// pointers, and other operation-specific data. +/// +/// Users can obtain this type from [`IoUringCmd::sqe()`] method, which +/// extracts the submission queue entry associated with a command. +/// +/// This type should not be constructed or manipulated directly by +/// kernel module developers. +/// +/// # INVARIANT +/// - `self.inner` always points to a valid, live `bindings::io_uring_sqe`. +#[repr(transparent)] +pub struct IoUringSqe { + inner: Opaque<bindings::io_uring_sqe>, +} + +impl IoUringSqe { + /// Reads and interprets the `cmd` field of an `bindings::io_uring_sqe` as a value of type `T`. + /// + /// # Safety & Invariants + /// - Construction of `T` is delegated to `FromBytes`, which guarantees that `T` has no + /// invalid bit patterns and can be safely reconstructed from raw bytes. + /// - **Limitation:** This implementation does not support `IORING_SETUP_SQE128` (larger SQE entries). + /// Only the standard `io_uring_sqe` layout is handled here. + /// + /// # Errors + /// * Returns `EINVAL` if the `self` does not hold a `opcode::URING_CMD`. + /// * Returns `EFAULT` if the command buffer is smaller than the requested type `T`. + /// + /// # Returns + /// * On success, returns a `T` deserialized from the `cmd`. + /// * On failure, returns an appropriate error as described above. + pub fn cmd_data<T: FromBytes>(&self) -> Result<T> { + // SAFETY: `self.inner` guaranteed by the type invariant to point + // to a live `io_uring_sqe`, so dereferencing is safe. + let sqe = unsafe { &*self.inner.get() }; + + if u32::from(sqe.opcode) != opcode::URING_CMD { + return Err(EINVAL); + } + + // SAFETY: Accessing the `sqe.cmd` union field is safe because we've + // verified that `sqe.opcode == IORING_OP_URING_CMD`, which guarantees + // that this union variant is initialized and valid. + let cmd = unsafe { sqe.__bindgen_anon_6.cmd.as_ref() }; + let cmd_len = size_of_val(&sqe.__bindgen_anon_6.bindgen_union_field); + + if cmd_len < size_of::<T>() { + return Err(EFAULT); + } + + let cmd_ptr = cmd.as_ptr() as *mut T; + + // SAFETY: `cmd_ptr` is valid from `self.inner` which is guaranteed by + // type variant. And also it points to initialized `T` from userspace. + let ret = unsafe { core::ptr::read_unaligned(cmd_ptr) }; + + Ok(ret) + } + + /// Constructs a new `IoUringSqe` from a raw `io_uring_sqe`. + /// + /// # Safety + /// + /// The caller must guarantee that: + /// - `ptr` is non-null, properly aligned, and points to a valid initialized + /// `bindings::io_uring_sqe`. + /// - The pointed-to memory remains valid (not freed or repurposed) for the + /// entire lifetime `'a` of the returned reference. + /// - **Aliasing rules (for `&T`):** while the returned `&'a IoUringSqe` is + /// alive, there must be **no mutable access** to the same object through any + /// path (no `&mut`, no raw-pointer writes, no FFI/IRQ/other-CPU writers). + /// Multiple `&` is fine **only if all of them are read-only** for the entire + /// overlapping lifetime. + /// - This relies on `IoUringSqe` being `repr(transparent)` over + /// `bindings::io_uring_sqe`, so the cast preserves layout. + #[inline] + pub unsafe fn from_raw<'a>(ptr: *const bindings::io_uring_sqe) -> &'a IoUringSqe { + // SAFETY: The caller guarantees that the pointer is not dangling and stays valid for the + // duration of 'a. The cast is okay because `IoUringSqe` is `repr(transparent)` and has the + // same memory layout as `bindings::io_uring_sqe`. + unsafe { &*ptr.cast() } + } +} diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs index ed53169e795c..d38cf7137401 100644 --- a/rust/kernel/lib.rs +++ b/rust/kernel/lib.rs @@ -91,6 +91,7 @@ pub mod fs; pub mod init; pub mod io; +pub mod io_uring; pub mod ioctl; pub mod jump_label; #[cfg(CONFIG_KUNIT)] -- 2.43.0 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 3/5] rust: io_uring: introduce rust abstraction for io-uring cmd 2025-08-22 12:55 ` [RFC PATCH v3 3/5] rust: io_uring: introduce rust abstraction for io-uring cmd Sidong Yang @ 2025-08-27 20:41 ` Daniel Almeida 2025-08-28 7:24 ` Benno Lossin 2025-08-29 15:43 ` Sidong Yang 2025-08-28 0:36 ` Ming Lei 2025-09-02 1:11 ` Caleb Sander Mateos 2 siblings, 2 replies; 33+ messages in thread From: Daniel Almeida @ 2025-08-27 20:41 UTC (permalink / raw) To: Sidong Yang Cc: Jens Axboe, Caleb Sander Mateos, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring Hi Sidong, > On 22 Aug 2025, at 09:55, Sidong Yang <sidong.yang@furiosa.ai> wrote: > > Implment the io-uring abstractions needed for miscdevicecs and other > char devices that have io-uring command interface. Can you expand on this last part? > > * `io_uring::IoUringCmd` : Rust abstraction for `io_uring_cmd` which > will be used as arg for `MiscDevice::uring_cmd()`. And driver can get > `cmd_op` sent from userspace. Also it has `flags` which includes option > that is reissued. > This is a bit hard to parse. > * `io_uring::IoUringSqe` : Rust abstraction for `io_uring_sqe` which > could be get from `IoUringCmd::sqe()` and driver could get `cmd_data` > from userspace. Also `IoUringSqe` has more data like opcode could be used in > driver. Same here. > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > --- > rust/kernel/io_uring.rs | 306 ++++++++++++++++++++++++++++++++++++++++ > rust/kernel/lib.rs | 1 + > 2 files changed, 307 insertions(+) > create mode 100644 rust/kernel/io_uring.rs > > diff --git a/rust/kernel/io_uring.rs b/rust/kernel/io_uring.rs > new file mode 100644 > index 000000000000..61e88bdf4e42 > --- /dev/null > +++ b/rust/kernel/io_uring.rs > @@ -0,0 +1,306 @@ > +// SPDX-License-Identifier: GPL-2.0 > +// SPDX-FileCopyrightText: (C) 2025 Furiosa AI > + > +//! Abstractions for io-uring. > +//! > +//! This module provides types for implements io-uring interface for char device. This is also hard to parse. > +//! > +//! > +//! C headers: [`include/linux/io_uring/cmd.h`](srctree/include/linux/io_uring/cmd.h) and > +//! [`include/linux/io_uring/io_uring.h`](srctree/include/linux/io_uring/io_uring.h) > + > +use core::{mem::MaybeUninit, pin::Pin}; > + > +use crate::error::from_result; > +use crate::transmute::{AsBytes, FromBytes}; > +use crate::{fs::File, types::Opaque}; > + > +use crate::prelude::*; > + > +/// io-uring opcode /// `IoUring` opcodes. Notice: a) The capitalization, b) The use of backticks, c) The period in the end. This is an ongoing effort to keep the docs tidy :) > +pub mod opcode { > + /// opcode for uring cmd > + pub const URING_CMD: u32 = bindings::io_uring_op_IORING_OP_URING_CMD; > +} Should this be its own type? This way we can use the type system to enforce that only valid opcodes are used where an opcode is expected. > + > +/// A Rust abstraction for the Linux kernel's `io_uring_cmd` structure. /// A Rust abstraction for `io_uring_cmd`. > +/// > +/// This structure is a safe, opaque wrapper around the raw C `io_uring_cmd` > +/// binding from the Linux kernel. It represents a command structure used > +/// in io_uring operations within the kernel. This code will also be part of the kernel, so mentioning “the Linux kernel” is superfluous. > +/// This type is used internally by the io_uring subsystem to manage > +/// asynchronous I/O commands. > +/// > +/// This type should not be constructed or manipulated directly by > +/// kernel module developers. “…by drivers”. > +/// > +/// # INVARIANT /// # Invariants > +/// - `self.inner` always points to a valid, live `bindings::io_uring_cmd`. Blank here > +#[repr(transparent)] > +pub struct IoUringCmd { > + /// An opaque wrapper containing the actual `io_uring_cmd` data. > + inner: Opaque<bindings::io_uring_cmd>, > +} > + > +impl IoUringCmd { > + /// Returns the cmd_op with associated with the `io_uring_cmd`. This sentence does not parse very well. > + #[inline] > + pub fn cmd_op(&self) -> u32 { > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > + // to a live `io_uring_cmd`, so dereferencing is safe. > + unsafe { (*self.inner.get()).cmd_op } Perhaps add an as_raw() method so this becomes: unsafe {*self.as_raw()}.cmd_op > + } > + > + /// Returns the flags with associated with the `io_uring_cmd`. With the command, or something like that. The user doesn’t see the raw bindings::io_uring_cmd so we shouldn’t be mentioning it if we can help it. > + #[inline] > + pub fn flags(&self) -> u32 { > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > + // to a live `io_uring_cmd`, so dereferencing is safe. > + unsafe { (*self.inner.get()).flags } > + } > + > + /// Reads protocol data unit as `T` that impl `FromBytes` from uring cmd This sentence does not parse very well. > + /// > + /// Fails with [`EFAULT`] if size of `T` is bigger than pdu size. /// # Errors > + #[inline] > + pub fn read_pdu<T: FromBytes>(&self) -> Result<T> { This takes &self, > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > + // to a live `io_uring_cmd`, so dereferencing is safe. > + let inner = unsafe { &mut *self.inner.get() }; But this creates a &mut to self.inner using unsafe code. Avoid doing this in general. All of a sudden your type is not thread-safe anymore. If you need to mutate &self here, then take &mut self as an argument. > + > + let len = size_of::<T>(); > + if len > inner.pdu.len() { > + return Err(EFAULT); EFAULT? How about EINVAL? > + } > + > + let mut out: MaybeUninit<T> = MaybeUninit::uninit(); > + let ptr = &raw mut inner.pdu as *const c_void; > + > + // SAFETY: > + // * The `ptr` is valid pointer from `self.inner` that is guaranteed by type invariant. > + // * The `out` is valid pointer that points `T` which impls `FromBytes` and checked > + // size of `T` is smaller than pdu size. > + unsafe { > + core::ptr::copy_nonoverlapping(ptr, out.as_mut_ptr().cast::<c_void>(), len); I don’t think you need to manually specify c_void here. Benno, can’t we use core::mem::zeroed() or something like that to avoid this unsafe? The input was zeroed in prep() and the output can just be a zeroed T on the stack, unless I missed something? > + } > + > + // SAFETY: The read above has initialized all bytes in `out`, and since `T` implements > + // `FromBytes`, any bit-pattern is a valid value for this type. > + Ok(unsafe { out.assume_init() }) > + } > + > + /// Writes the provided `value` to `pdu` in uring_cmd `self` Writes the provided `value` to `pdu`. > + /// /// # Errors /// > + /// Fails with [`EFAULT`] if size of `T` is bigger than pdu size. > + #[inline] > + pub fn write_pdu<T: AsBytes>(&mut self, value: &T) -> Result<()> { > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > + // to a live `io_uring_cmd`, so dereferencing is safe. > + let inner = unsafe { &mut *self.inner.get() }; > + > + let len = size_of::<T>(); > + if len > inner.pdu.len() { > + return Err(EFAULT); > + } > + > + let src = (value as *const T).cast::<c_void>(); as_ptr().cast() > + let dst = &raw mut inner.pdu as *mut c_void; (&raw mut inner.pdu).cast() > + > + // SAFETY: > + // * The `src` is points valid memory that is guaranteed by `T` impls `AsBytes` > + // * The `dst` is valid. It's from `self.inner` that is guaranteed by type invariant. > + // * It's safe to copy because size of `T` is no more than len of pdu. > + unsafe { > + core::ptr::copy_nonoverlapping(src, dst, len); > + } > + > + Ok(()) > + } > + > + /// Constructs a new [`IoUringCmd`] from a raw `io_uring_cmd` Missing period. > + /// > + /// # Safety > + /// > + /// The caller must guarantee that: > + /// - `ptr` is non-null, properly aligned, and points to a valid > + /// `bindings::io_uring_cmd`. Blanks for every bullet point, please. > + /// - The pointed-to memory remains initialized and valid for the entire > + /// lifetime `'a` of the returned reference. > + /// - While the returned `Pin<&'a mut IoUringCmd>` is alive, the underlying > + /// object is **not moved** (pinning requirement). They can’t move an !Unpin type in safe code. > + /// - **Aliasing rules:** the returned `&mut` has **exclusive** access to the same > + /// object for its entire lifetime: You really don’t need to emphasize these. > + /// - No other `&mut` **or** `&` references to the same `io_uring_cmd` may be > + /// alive at the same time. This and the point above are identical. > + /// - There must be no concurrent reads/writes through raw pointers, FFI, or > + /// other kernel paths to the same object during this lifetime. This and the first point say the same thing. > + /// - If the object can be touched from other contexts (e.g. IRQ/another CPU), > + /// the caller must provide synchronization to uphold this exclusivity. I am not sure what you mean. > + /// - This function relies on `IoUringCmd` being `repr(transparent)` over > + /// `bindings::io_uring_cmd` so the cast preserves layout. This is not a safety requirement. Just adapt the requirements from other instances of from_raw(), they all convert a *mut T to a &T so the safety requirements are similar. > + #[inline] > + pub unsafe fn from_raw<'a>(ptr: *mut bindings::io_uring_cmd) -> Pin<&'a mut IoUringCmd> { Why is this pub? Sounds like a massive footgun? This should be private or at best pub(crate). > + // SAFETY: > + // * The caller guarantees that the pointer is not dangling and stays > + // valid for the duration of 'a. > + // * The cast is okay because `IoUringCmd` is `repr(transparent)` and > + // has the same memory layout as `bindings::io_uring_cmd`. > + // * The returned `Pin` ensures that the object cannot be moved, which > + // is required because the kernel may hold pointers to this memory > + // location and moving it would invalidate those pointers. > + unsafe { Pin::new_unchecked(&mut *ptr.cast()) } > + } > + > + /// Returns the file that referenced by uring cmd self. > + #[inline] > + pub fn file(&self) -> &File { > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > + // to a live `io_uring_cmd`, so dereferencing is safe. > + let file = unsafe { (*self.inner.get()).file }; > + > + // SAFETY: > + // * The `file` points valid file. Why? > + // * refcount is positive after submission queue entry issued. > + // * There is no active fdget_pos region on the file on this thread. > + unsafe { File::from_raw_file(file) } > + } > + > + /// Returns an reference to the [`IoUringSqe`] associated with this command. s/an/a > + #[inline] > + pub fn sqe(&self) -> &IoUringSqe { > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > + // to a live `io_uring_cmd`, so dereferencing is safe. > + let sqe = unsafe { (*self.inner.get()).sqe }; > + // SAFETY: The call guarantees that the `sqe` points valid io_uring_sqe. What do you mean by “the call guarantees” ? > + unsafe { IoUringSqe::from_raw(sqe) } > + } > + > + /// Completes an this [`IoUringCmd`] request that was previously queued. This sentence does not parse very well. > + /// > + /// # Safety > + /// > + /// - This function must be called **only** for a command whose `uring_cmd` Please no emphasis. > + /// handler previously returned **`-EIOCBQUEUED`** to io_uring. To what? Are you referring to a Rust type, or to the C part of the kernel? > + /// > + /// # Parameters > + /// > + /// - `ret`: Result to return to userspace. > + /// - `res2`: Extra for big completion queue entry `IORING_SETUP_CQE32`. This sentence does not parse very well. Also, can you rename this? > + /// - `issue_flags`: Flags associated with this request, typically the same > + /// as those passed to the `uring_cmd` handler. > + #[inline] > + pub fn done(self: Pin<&mut IoUringCmd>, ret: Result<i32>, res2: u64, issue_flags: u32) { > + let ret = from_result(|| ret) as isize; What does this do? > + // SAFETY: The call guarantees that `self.inner` is not dangling and stays valid What do you mean “the call” ? > + unsafe { > + bindings::io_uring_cmd_done(self.inner.get(), ret, res2, issue_flags); > + } > + } > +} > + > +/// A Rust abstraction for the Linux kernel's `io_uring_sqe` structure. Please don’t mention the words “Linux kernel” here either. > +/// > +/// This structure is a safe, opaque wrapper around the raw C [`io_uring_sqe`](srctree/include/uapi/linux/io_uring.h) This line needs to be wrapped. > +/// binding from the Linux kernel. It represents a Submission Queue Entry Can you link somewhere here? Perhaps there’s docs for “Submission Queue Entry”. > +/// used in io_uring operations within the kernel. > +/// > +/// # Type Safety > +/// > +/// The `#[repr(transparent)]` attribute ensures that this wrapper has > +/// the same memory layout as the underlying `io_uring_sqe` structure, > +/// allowing it to be safely transmuted between the two representations. This is an invariant, please move it there. > +/// > +/// # Fields > +/// > +/// * `inner` - An opaque wrapper containing the actual `io_uring_sqe` data. > +/// The `Opaque` type prevents direct access to the internal > +/// structure fields, ensuring memory safety and encapsulation. Inline docs please. > +/// > +/// # Usage I don’t think we specifically need to mention “# Usage”. > +/// > +/// This type represents a submission queue entry that describes an I/O You can start with “Represents a…”. No need to say “this type” here. > +/// operation to be executed by the io_uring subsystem. It contains > +/// information such as the operation type, file descriptor, buffer > +/// pointers, and other operation-specific data. > +/// > +/// Users can obtain this type from [`IoUringCmd::sqe()`] method, which > +/// extracts the submission queue entry associated with a command. > +/// > +/// This type should not be constructed or manipulated directly by > +/// kernel module developers. By drivers. > +/// > +/// # INVARIANT /// # Invariants > +/// - `self.inner` always points to a valid, live `bindings::io_uring_sqe`. > +#[repr(transparent)] > +pub struct IoUringSqe { > + inner: Opaque<bindings::io_uring_sqe>, > +} > + > +impl IoUringSqe { > + /// Reads and interprets the `cmd` field of an `bindings::io_uring_sqe` as a value of type `T`. > + /// > + /// # Safety & Invariants Safety section for a safe function. > + /// - Construction of `T` is delegated to `FromBytes`, which guarantees that `T` has no > + /// invalid bit patterns and can be safely reconstructed from raw bytes. > + /// - **Limitation:** This implementation does not support `IORING_SETUP_SQE128` (larger SQE entries). Please no emphasis. > + /// Only the standard `io_uring_sqe` layout is handled here. > + /// > + /// # Errors Blank here. > + /// * Returns `EINVAL` if the `self` does not hold a `opcode::URING_CMD`. > + /// * Returns `EFAULT` if the command buffer is smaller than the requested type `T`. > + /// > + /// # Returns I don’t think we need a specific section for this. Just write this in normal prose please. > + /// * On success, returns a `T` deserialized from the `cmd`. > + /// * On failure, returns an appropriate error as described above. > + pub fn cmd_data<T: FromBytes>(&self) -> Result<T> { > + // SAFETY: `self.inner` guaranteed by the type invariant to point > + // to a live `io_uring_sqe`, so dereferencing is safe. > + let sqe = unsafe { &*self.inner.get() }; > + > + if u32::from(sqe.opcode) != opcode::URING_CMD { > + return Err(EINVAL); > + } > + > + // SAFETY: Accessing the `sqe.cmd` union field is safe because we've > + // verified that `sqe.opcode == IORING_OP_URING_CMD`, which guarantees > + // that this union variant is initialized and valid. > + let cmd = unsafe { sqe.__bindgen_anon_6.cmd.as_ref() }; > + let cmd_len = size_of_val(&sqe.__bindgen_anon_6.bindgen_union_field); > + > + if cmd_len < size_of::<T>() { > + return Err(EFAULT); EINVAL > + } > + > + let cmd_ptr = cmd.as_ptr() as *mut T; cast() > + > + // SAFETY: `cmd_ptr` is valid from `self.inner` which is guaranteed by > + // type variant. And also it points to initialized `T` from userspace. “Invariant”. “[…] an initialized T”. > + let ret = unsafe { core::ptr::read_unaligned(cmd_ptr) }; > + > + Ok(ret) > + } > + > + /// Constructs a new `IoUringSqe` from a raw `io_uring_sqe`. [`IoUringSqe`]. Please build the docs and make sure all your docs look nice. > + /// > + /// # Safety > + /// > + /// The caller must guarantee that: > + /// - `ptr` is non-null, properly aligned, and points to a valid initialized > + /// `bindings::io_uring_sqe`. > + /// - The pointed-to memory remains valid (not freed or repurposed) for the > + /// entire lifetime `'a` of the returned reference. > + /// - **Aliasing rules (for `&T`):** while the returned `&'a IoUringSqe` is > + /// alive, there must be **no mutable access** to the same object through any > + /// path (no `&mut`, no raw-pointer writes, no FFI/IRQ/other-CPU writers). > + /// Multiple `&` is fine **only if all of them are read-only** for the entire > + /// overlapping lifetime. > + /// - This relies on `IoUringSqe` being `repr(transparent)` over > + /// `bindings::io_uring_sqe`, so the cast preserves layout. Please rewrite this entire section given the feedback I gave higher up in this patch. > + #[inline] > + pub unsafe fn from_raw<'a>(ptr: *const bindings::io_uring_sqe) -> &'a IoUringSqe { Private or pub(crate) at best. > + // SAFETY: The caller guarantees that the pointer is not dangling and stays valid for the > + // duration of 'a. The cast is okay because `IoUringSqe` is `repr(transparent)` and has the > + // same memory layout as `bindings::io_uring_sqe`. > + unsafe { &*ptr.cast() } > + } > +} > diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs > index ed53169e795c..d38cf7137401 100644 > --- a/rust/kernel/lib.rs > +++ b/rust/kernel/lib.rs > @@ -91,6 +91,7 @@ > pub mod fs; > pub mod init; > pub mod io; > +pub mod io_uring; > pub mod ioctl; > pub mod jump_label; > #[cfg(CONFIG_KUNIT)] > -- > 2.43.0 > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 3/5] rust: io_uring: introduce rust abstraction for io-uring cmd 2025-08-27 20:41 ` Daniel Almeida @ 2025-08-28 7:24 ` Benno Lossin 2025-08-29 15:43 ` Sidong Yang 1 sibling, 0 replies; 33+ messages in thread From: Benno Lossin @ 2025-08-28 7:24 UTC (permalink / raw) To: Daniel Almeida, Sidong Yang Cc: Jens Axboe, Caleb Sander Mateos, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Wed Aug 27, 2025 at 10:41 PM CEST, Daniel Almeida wrote: >> On 22 Aug 2025, at 09:55, Sidong Yang <sidong.yang@furiosa.ai> wrote: >> + } >> + >> + let mut out: MaybeUninit<T> = MaybeUninit::uninit(); >> + let ptr = &raw mut inner.pdu as *const c_void; >> + >> + // SAFETY: >> + // * The `ptr` is valid pointer from `self.inner` that is guaranteed by type invariant. >> + // * The `out` is valid pointer that points `T` which impls `FromBytes` and checked >> + // size of `T` is smaller than pdu size. >> + unsafe { >> + core::ptr::copy_nonoverlapping(ptr, out.as_mut_ptr().cast::<c_void>(), len); > > I don’t think you need to manually specify c_void here. In fact using `c_void` as the pointee type is wrong, since it's a ZST and thus this call copies no bytes whatsoever. > Benno, can’t we use core::mem::zeroed() or something like that to avoid this unsafe? > > The input was zeroed in prep() and the output can just be a zeroed T on the > stack, unless I missed something? Hmm I'm not sure I follow, I don't think that `mem::zeroed` will help (it also is an unsafe function). We have a helper in flight that might be useful in this case: https://lore.kernel.org/all/20250826-nova_firmware-v2-1-93566252fe3a@nvidia.com --- Cheers, Benno ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 3/5] rust: io_uring: introduce rust abstraction for io-uring cmd 2025-08-27 20:41 ` Daniel Almeida 2025-08-28 7:24 ` Benno Lossin @ 2025-08-29 15:43 ` Sidong Yang 2025-08-29 16:11 ` Daniel Almeida 1 sibling, 1 reply; 33+ messages in thread From: Sidong Yang @ 2025-08-29 15:43 UTC (permalink / raw) To: Daniel Almeida Cc: Jens Axboe, Caleb Sander Mateos, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Wed, Aug 27, 2025 at 05:41:15PM -0300, Daniel Almeida wrote: Hi Daniel, > Hi Sidong, > > > On 22 Aug 2025, at 09:55, Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > Implment the io-uring abstractions needed for miscdevicecs and other > > char devices that have io-uring command interface. > > Can you expand on this last part? Sure. > > > > > * `io_uring::IoUringCmd` : Rust abstraction for `io_uring_cmd` which > > will be used as arg for `MiscDevice::uring_cmd()`. And driver can get > > `cmd_op` sent from userspace. Also it has `flags` which includes option > > that is reissued. > > > > This is a bit hard to parse. I'll fix this. > > > * `io_uring::IoUringSqe` : Rust abstraction for `io_uring_sqe` which > > could be get from `IoUringCmd::sqe()` and driver could get `cmd_data` > > from userspace. Also `IoUringSqe` has more data like opcode could be used in > > driver. > > Same here. Same here. > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > > --- > > rust/kernel/io_uring.rs | 306 ++++++++++++++++++++++++++++++++++++++++ > > rust/kernel/lib.rs | 1 + > > 2 files changed, 307 insertions(+) > > create mode 100644 rust/kernel/io_uring.rs > > > > diff --git a/rust/kernel/io_uring.rs b/rust/kernel/io_uring.rs > > new file mode 100644 > > index 000000000000..61e88bdf4e42 > > --- /dev/null > > +++ b/rust/kernel/io_uring.rs > > @@ -0,0 +1,306 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +// SPDX-FileCopyrightText: (C) 2025 Furiosa AI > > + > > +//! Abstractions for io-uring. > > +//! > > +//! This module provides types for implements io-uring interface for char device. > > This is also hard to parse. I'll fix this. > > > +//! > > +//! > > +//! C headers: [`include/linux/io_uring/cmd.h`](srctree/include/linux/io_uring/cmd.h) and > > +//! [`include/linux/io_uring/io_uring.h`](srctree/include/linux/io_uring/io_uring.h) > > + > > +use core::{mem::MaybeUninit, pin::Pin}; > > + > > +use crate::error::from_result; > > +use crate::transmute::{AsBytes, FromBytes}; > > +use crate::{fs::File, types::Opaque}; > > + > > +use crate::prelude::*; > > + > > +/// io-uring opcode > > /// `IoUring` opcodes. > > Notice: > > a) The capitalization, > b) The use of backticks, > c) The period in the end. > > This is an ongoing effort to keep the docs tidy :) Thanks :) > > > +pub mod opcode { > > + /// opcode for uring cmd > > + pub const URING_CMD: u32 = bindings::io_uring_op_IORING_OP_URING_CMD; > > +} > > Should this be its own type? This way we can use the type system to enforce > that only valid opcodes are used where an opcode is expected. Sure. How about a transparent struct like below. #[repr(transparent)] pub struct Opcode(u32); impl Opcode { pub const URING_CMD: Self = Self(bindings::io_uring_op_IORING_OP_URING_CMD as u32); } > > > + > > +/// A Rust abstraction for the Linux kernel's `io_uring_cmd` structure. > > /// A Rust abstraction for `io_uring_cmd`. Okay, I'll fixed all comments mentioning "Linux kernel". > > > +/// > > +/// This structure is a safe, opaque wrapper around the raw C `io_uring_cmd` > > +/// binding from the Linux kernel. It represents a command structure used > > +/// in io_uring operations within the kernel. > > This code will also be part of the kernel, so mentioning "the Linux kernel" is superfluous. Thanks. > > > +/// This type is used internally by the io_uring subsystem to manage > > +/// asynchronous I/O commands. > > +/// > > +/// This type should not be constructed or manipulated directly by > > +/// kernel module developers. > > "...by drivers". Thanks. > > > +/// > > +/// # INVARIANT > > /// # Invariants Thanks. > > > +/// - `self.inner` always points to a valid, live `bindings::io_uring_cmd`. > > Blank here Thanks. > > > +#[repr(transparent)] > > +pub struct IoUringCmd { > > + /// An opaque wrapper containing the actual `io_uring_cmd` data. > > + inner: Opaque<bindings::io_uring_cmd>, > > +} > > + > > +impl IoUringCmd { > > + /// Returns the cmd_op with associated with the `io_uring_cmd`. > > This sentence does not parse very well. I'll fix this. Like you said, it's better to not to mention the c structure. > > > + #[inline] > > + pub fn cmd_op(&self) -> u32 { > > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > > + // to a live `io_uring_cmd`, so dereferencing is safe. > > + unsafe { (*self.inner.get()).cmd_op } > > Perhaps add an as_raw() method so this becomes: > > unsafe {*self.as_raw()}.cmd_op Agreed. Also it would return the Opcode type than u32. > > > + } > > + > > + /// Returns the flags with associated with the `io_uring_cmd`. > > With the command, or something like that. The user doesn´t see the raw > bindings::io_uring_cmd so we shouldn´t be mentioning it if we can help it. Agreed. I'll try to fix this without mentioning the io_uring_cmd. > > > + #[inline] > > + pub fn flags(&self) -> u32 { > > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > > + // to a live `io_uring_cmd`, so dereferencing is safe. > > + unsafe { (*self.inner.get()).flags } > > + } > > + > > + /// Reads protocol data unit as `T` that impl `FromBytes` from uring cmd > > This sentence does not parse very well. I'll fix this. > > > + /// > > + /// Fails with [`EFAULT`] if size of `T` is bigger than pdu size. > > /// # Errors Thanks. > > > + #[inline] > > + pub fn read_pdu<T: FromBytes>(&self) -> Result<T> { > > This takes &self, > > > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > > + // to a live `io_uring_cmd`, so dereferencing is safe. > > + let inner = unsafe { &mut *self.inner.get() }; > > But this creates a &mut to self.inner using unsafe code. Avoid doing this in > general. All of a sudden your type is not thread-safe anymore. > > If you need to mutate &self here, then take &mut self as an argument. > > > + > > + let len = size_of::<T>(); > > + if len > inner.pdu.len() { > > + return Err(EFAULT); > > EFAULT? How about EINVAL? Good. > > > + } > > + > > + let mut out: MaybeUninit<T> = MaybeUninit::uninit(); > > + let ptr = &raw mut inner.pdu as *const c_void; > > + > > + // SAFETY: > > + // * The `ptr` is valid pointer from `self.inner` that is guaranteed by type invariant. > > + // * The `out` is valid pointer that points `T` which impls `FromBytes` and checked > > + // size of `T` is smaller than pdu size. > > + unsafe { > > + core::ptr::copy_nonoverlapping(ptr, out.as_mut_ptr().cast::<c_void>(), len); > > I don´t think you need to manually specify c_void here. > > Benno, can´t we use core::mem::zeroed() or something like that to avoid this unsafe? > > The input was zeroed in prep() and the output can just be a zeroed T on the > stack, unless I missed something? > > > + } > > + > > + // SAFETY: The read above has initialized all bytes in `out`, and since `T` implements > > + // `FromBytes`, any bit-pattern is a valid value for this type. > > + Ok(unsafe { out.assume_init() }) > > + } > > + > > + /// Writes the provided `value` to `pdu` in uring_cmd `self` > > Writes the provided `value` to `pdu`. Thanks. > > > + /// > > /// # Errors > /// Thanks. > > > + /// Fails with [`EFAULT`] if size of `T` is bigger than pdu size. > > > + #[inline] > > + pub fn write_pdu<T: AsBytes>(&mut self, value: &T) -> Result<()> { > > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > > + // to a live `io_uring_cmd`, so dereferencing is safe. > > + let inner = unsafe { &mut *self.inner.get() }; > > + > > + let len = size_of::<T>(); > > + if len > inner.pdu.len() { > > + return Err(EFAULT); > > + } > > + > > + let src = (value as *const T).cast::<c_void>(); > > as_ptr().cast() > > > + let dst = &raw mut inner.pdu as *mut c_void; > > (&raw mut inner.pdu).cast() > Thanks. > > + > > + // SAFETY: > > + // * The `src` is points valid memory that is guaranteed by `T` impls `AsBytes` > > + // * The `dst` is valid. It's from `self.inner` that is guaranteed by type invariant. > > + // * It's safe to copy because size of `T` is no more than len of pdu. > > + unsafe { > > + core::ptr::copy_nonoverlapping(src, dst, len); > > + } > > + > > + Ok(()) > > + } > > + > > + /// Constructs a new [`IoUringCmd`] from a raw `io_uring_cmd` > > Missing period. Thanks. > > > + /// > > + /// # Safety > > + /// > > + /// The caller must guarantee that: > > + /// - `ptr` is non-null, properly aligned, and points to a valid > > + /// `bindings::io_uring_cmd`. > > Blanks for every bullet point, please. Thanks. > > > + /// - The pointed-to memory remains initialized and valid for the entire > > + /// lifetime `'a` of the returned reference. > > + /// - While the returned `Pin<&'a mut IoUringCmd>` is alive, the underlying > > + /// object is **not moved** (pinning requirement). > > They can´t move an !Unpin type in safe code. Okay, this could be removed. > > > + /// - **Aliasing rules:** the returned `&mut` has **exclusive** access to the same > > + /// object for its entire lifetime: > > You really don´t need to emphasize these. Okay. > > > + /// - No other `&mut` **or** `&` references to the same `io_uring_cmd` may be > > + /// alive at the same time. > > This and the point above are identical. It would be removed. > > > + /// - There must be no concurrent reads/writes through raw pointers, FFI, or > > + /// other kernel paths to the same object during this lifetime. > > This and the first point say the same thing. > > > + /// - If the object can be touched from other contexts (e.g. IRQ/another CPU), > > + /// the caller must provide synchronization to uphold this exclusivity. > > I am not sure what you mean. > > + /// - This function relies on `IoUringCmd` being `repr(transparent)` over > > + /// `bindings::io_uring_cmd` so the cast preserves layout. > > This is not a safety requirement. > > Just adapt the requirements from other instances of from_raw(), they all > convert a *mut T to a &T so the safety requirements are similar. > Okay, This safety comments are too redundant. I'll rewrite this. Thanks. > > + #[inline] > > + pub unsafe fn from_raw<'a>(ptr: *mut bindings::io_uring_cmd) -> Pin<&'a mut IoUringCmd> { > > Why is this pub? Sounds like a massive footgun? This should be private or at > best pub(crate). Because from_raw() would be used in miscdevice. pub(crate) will be okay. > > > > + // SAFETY: > > + // * The caller guarantees that the pointer is not dangling and stays > > + // valid for the duration of 'a. > > + // * The cast is okay because `IoUringCmd` is `repr(transparent)` and > > + // has the same memory layout as `bindings::io_uring_cmd`. > > + // * The returned `Pin` ensures that the object cannot be moved, which > > + // is required because the kernel may hold pointers to this memory > > + // location and moving it would invalidate those pointers. > > > + unsafe { Pin::new_unchecked(&mut *ptr.cast()) } > > + } > > + > > + /// Returns the file that referenced by uring cmd self. > > + #[inline] > > + pub fn file(&self) -> &File { > > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > > + // to a live `io_uring_cmd`, so dereferencing is safe. > > + let file = unsafe { (*self.inner.get()).file }; > > + > > + // SAFETY: > > + // * The `file` points valid file. > > Why? `file` is from `self.inner` which is guranteed by the type invariant. I missed the comment. > > > + // * refcount is positive after submission queue entry issued. > > + // * There is no active fdget_pos region on the file on this thread. > > + unsafe { File::from_raw_file(file) } > > + } > > + > > + /// Returns an reference to the [`IoUringSqe`] associated with this command. > > s/an/a Thanks. > > > + #[inline] > > + pub fn sqe(&self) -> &IoUringSqe { > > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > > + // to a live `io_uring_cmd`, so dereferencing is safe. > > + let sqe = unsafe { (*self.inner.get()).sqe }; > > + // SAFETY: The call guarantees that the `sqe` points valid io_uring_sqe. > > What do you mean by "the call guarantees" ? It's just miss. This should be "This call is guaranteed to be safe because...". > > > + unsafe { IoUringSqe::from_raw(sqe) } > > + } > > + > > + /// Completes an this [`IoUringCmd`] request that was previously queued. > > This sentence does not parse very well. I'll fix this. Thanks. > > > + /// > > + /// # Safety > > + /// > > + /// - This function must be called **only** for a command whose `uring_cmd` > > Please no emphasis. Thanks. > > > + /// handler previously returned **`-EIOCBQUEUED`** to io_uring. > > To what? Are you referring to a Rust type, or to the C part of the kernel? I referred C type but it's better to use Rust return type. > > > + /// > > + /// # Parameters > > + /// > > + /// - `ret`: Result to return to userspace. > > + /// - `res2`: Extra for big completion queue entry `IORING_SETUP_CQE32`. > > This sentence does not parse very well. Also, can you rename this? Okay. > > > + /// - `issue_flags`: Flags associated with this request, typically the same > > + /// as those passed to the `uring_cmd` handler. > > + #[inline] > > + pub fn done(self: Pin<&mut IoUringCmd>, ret: Result<i32>, res2: u64, issue_flags: u32) { > > + let ret = from_result(|| ret) as isize; > > What does this do? It casts Result<i32> to isize. `bindings::io_uring_cmd_done` receives `isize`. I wanted that `ret` would be `Result` than just i32. `ret` should be just isize? > > > + // SAFETY: The call guarantees that `self.inner` is not dangling and stays valid > > What do you mean "the call" ? Sorry, "This call is guruanteed to be safe ..." > > > + unsafe { > > + bindings::io_uring_cmd_done(self.inner.get(), ret, res2, issue_flags); > > + } > > + } > > +} > > + > > +/// A Rust abstraction for the Linux kernel's `io_uring_sqe` structure. > > Please don´t mention the words "Linux kernel" here either. Yes. > > > +/// > > +/// This structure is a safe, opaque wrapper around the raw C [`io_uring_sqe`](srctree/include/uapi/linux/io_uring.h) > > This line needs to be wrapped. Okay. > > > +/// binding from the Linux kernel. It represents a Submission Queue Entry > > Can you link somewhere here? Perhaps there´s docs for "Submission Queue > Entry". Okay I'll find it. > > > +/// used in io_uring operations within the kernel. > > +/// > > +/// # Type Safety > > +/// > > +/// The `#[repr(transparent)]` attribute ensures that this wrapper has > > +/// the same memory layout as the underlying `io_uring_sqe` structure, > > +/// allowing it to be safely transmuted between the two representations. > > This is an invariant, please move it there. Thanks. > > > +/// > > +/// # Fields > > +/// > > +/// * `inner` - An opaque wrapper containing the actual `io_uring_sqe` data. > > +/// The `Opaque` type prevents direct access to the internal > > +/// structure fields, ensuring memory safety and encapsulation. > > Inline docs please. Okay. > > > +/// > > +/// # Usage > > I don´t think we specifically need to mention "# Usage". Okay it would be deleted. > > > +/// > > +/// This type represents a submission queue entry that describes an I/O > > You can start with "Represents a...". No need to say "this type" here. Thanks. > > > +/// operation to be executed by the io_uring subsystem. It contains > > +/// information such as the operation type, file descriptor, buffer > > +/// pointers, and other operation-specific data. > > +/// > > +/// Users can obtain this type from [`IoUringCmd::sqe()`] method, which > > +/// extracts the submission queue entry associated with a command. > > +/// > > +/// This type should not be constructed or manipulated directly by > > +/// kernel module developers. > > By drivers. Thanks. > > > +/// > > +/// # INVARIANT > > /// # Invariants Thanks. > > > +/// - `self.inner` always points to a valid, live `bindings::io_uring_sqe`. > > +#[repr(transparent)] > > +pub struct IoUringSqe { > > + inner: Opaque<bindings::io_uring_sqe>, > > +} > > + > > +impl IoUringSqe { > > + /// Reads and interprets the `cmd` field of an `bindings::io_uring_sqe` as a value of type `T`. > > + /// > > + /// # Safety & Invariants > > Safety section for a safe function. Thanks. > > > + /// - Construction of `T` is delegated to `FromBytes`, which guarantees that `T` has no > > + /// invalid bit patterns and can be safely reconstructed from raw bytes. > > + /// - **Limitation:** This implementation does not support `IORING_SETUP_SQE128` (larger SQE entries). > > Please no emphasis. Okay. > > > > + /// Only the standard `io_uring_sqe` layout is handled here. > > + /// > > + /// # Errors > > Blank here. Thanks. > > > + /// * Returns `EINVAL` if the `self` does not hold a `opcode::URING_CMD`. > > + /// * Returns `EFAULT` if the command buffer is smaller than the requested type `T`. > > + /// > > + /// # Returns > > I don´t think we need a specific section for this. Just write this in > normal prose please. Okay. > > > > + /// * On success, returns a `T` deserialized from the `cmd`. > > + /// * On failure, returns an appropriate error as described above. > > + pub fn cmd_data<T: FromBytes>(&self) -> Result<T> { > > + // SAFETY: `self.inner` guaranteed by the type invariant to point > > + // to a live `io_uring_sqe`, so dereferencing is safe. > > + let sqe = unsafe { &*self.inner.get() }; > > + > > + if u32::from(sqe.opcode) != opcode::URING_CMD { > > + return Err(EINVAL); > > + } > > + > > + // SAFETY: Accessing the `sqe.cmd` union field is safe because we've > > + // verified that `sqe.opcode == IORING_OP_URING_CMD`, which guarantees > > + // that this union variant is initialized and valid. > > + let cmd = unsafe { sqe.__bindgen_anon_6.cmd.as_ref() }; > > + let cmd_len = size_of_val(&sqe.__bindgen_anon_6.bindgen_union_field); > > + > > + if cmd_len < size_of::<T>() { > > + return Err(EFAULT); > > EINVAL Thanks. > > > + } > > + > > + let cmd_ptr = cmd.as_ptr() as *mut T; > > cast() Thanks. > > > + > > + // SAFETY: `cmd_ptr` is valid from `self.inner` which is guaranteed by > > + // type variant. And also it points to initialized `T` from userspace. > > "Invariant". Thanks! > > "[...] an initialized T". Thanks. > > > > + let ret = unsafe { core::ptr::read_unaligned(cmd_ptr) }; > > + > > + Ok(ret) > > + } > > + > > + /// Constructs a new `IoUringSqe` from a raw `io_uring_sqe`. > > [`IoUringSqe`]. > > Please build the docs and make sure all your docs look nice. Okay, I'll review comments by docs. > > > + /// > > + /// # Safety > > + /// > > + /// The caller must guarantee that: > > + /// - `ptr` is non-null, properly aligned, and points to a valid initialized > > + /// `bindings::io_uring_sqe`. > > + /// - The pointed-to memory remains valid (not freed or repurposed) for the > > + /// entire lifetime `'a` of the returned reference. > > + /// - **Aliasing rules (for `&T`):** while the returned `&'a IoUringSqe` is > > + /// alive, there must be **no mutable access** to the same object through any > > + /// path (no `&mut`, no raw-pointer writes, no FFI/IRQ/other-CPU writers). > > + /// Multiple `&` is fine **only if all of them are read-only** for the entire > > + /// overlapping lifetime. > > + /// - This relies on `IoUringSqe` being `repr(transparent)` over > > + /// `bindings::io_uring_sqe`, so the cast preserves layout. > > Please rewrite this entire section given the feedback I gave higher up in this > patch. Okay. I'll rewrite this. > > > + #[inline] > > + pub unsafe fn from_raw<'a>(ptr: *const bindings::io_uring_sqe) -> &'a IoUringSqe { > > Private or pub(crate) at best. Okay. pub(crate) > > > + // SAFETY: The caller guarantees that the pointer is not dangling and stays valid for the > > + // duration of 'a. The cast is okay because `IoUringSqe` is `repr(transparent)` and has the > > + // same memory layout as `bindings::io_uring_sqe`. > > + unsafe { &*ptr.cast() } > > + } > > +} > > diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs > > index ed53169e795c..d38cf7137401 100644 > > --- a/rust/kernel/lib.rs > > +++ b/rust/kernel/lib.rs > > @@ -91,6 +91,7 @@ > > pub mod fs; > > pub mod init; > > pub mod io; > > +pub mod io_uring; > > pub mod ioctl; > > pub mod jump_label; > > #[cfg(CONFIG_KUNIT)] > > -- > > 2.43.0 > > > Thanks for detailed review! I'll revise the comments overall. Also there would be new type for opcode. Thanks, Sidong ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 3/5] rust: io_uring: introduce rust abstraction for io-uring cmd 2025-08-29 15:43 ` Sidong Yang @ 2025-08-29 16:11 ` Daniel Almeida 0 siblings, 0 replies; 33+ messages in thread From: Daniel Almeida @ 2025-08-29 16:11 UTC (permalink / raw) To: Sidong Yang Cc: Jens Axboe, Caleb Sander Mateos, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring […] >> >>> + #[inline] >>> + pub unsafe fn from_raw<'a>(ptr: *const bindings::io_uring_sqe) -> &'a IoUringSqe { >> >> Private or pub(crate) at best. > > Okay. pub(crate) Try to make things private and reach for pub(crate) here only when it’s actually needed. — Daniel ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 3/5] rust: io_uring: introduce rust abstraction for io-uring cmd 2025-08-22 12:55 ` [RFC PATCH v3 3/5] rust: io_uring: introduce rust abstraction for io-uring cmd Sidong Yang 2025-08-27 20:41 ` Daniel Almeida @ 2025-08-28 0:36 ` Ming Lei 2025-08-28 7:25 ` Benno Lossin 2025-09-02 1:11 ` Caleb Sander Mateos 2 siblings, 1 reply; 33+ messages in thread From: Ming Lei @ 2025-08-28 0:36 UTC (permalink / raw) To: Sidong Yang Cc: Jens Axboe, Daniel Almeida, Caleb Sander Mateos, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Fri, Aug 22, 2025 at 12:55:53PM +0000, Sidong Yang wrote: > Implment the io-uring abstractions needed for miscdevicecs and other > char devices that have io-uring command interface. > > * `io_uring::IoUringCmd` : Rust abstraction for `io_uring_cmd` which > will be used as arg for `MiscDevice::uring_cmd()`. And driver can get > `cmd_op` sent from userspace. Also it has `flags` which includes option > that is reissued. > > * `io_uring::IoUringSqe` : Rust abstraction for `io_uring_sqe` which > could be get from `IoUringCmd::sqe()` and driver could get `cmd_data` > from userspace. Also `IoUringSqe` has more data like opcode could be used in > driver. > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > --- > rust/kernel/io_uring.rs | 306 ++++++++++++++++++++++++++++++++++++++++ > rust/kernel/lib.rs | 1 + > 2 files changed, 307 insertions(+) > create mode 100644 rust/kernel/io_uring.rs > > diff --git a/rust/kernel/io_uring.rs b/rust/kernel/io_uring.rs > new file mode 100644 > index 000000000000..61e88bdf4e42 > --- /dev/null > +++ b/rust/kernel/io_uring.rs > @@ -0,0 +1,306 @@ > +// SPDX-License-Identifier: GPL-2.0 > +// SPDX-FileCopyrightText: (C) 2025 Furiosa AI > + > +//! Abstractions for io-uring. > +//! > +//! This module provides types for implements io-uring interface for char device. > +//! > +//! > +//! C headers: [`include/linux/io_uring/cmd.h`](srctree/include/linux/io_uring/cmd.h) and > +//! [`include/linux/io_uring/io_uring.h`](srctree/include/linux/io_uring/io_uring.h) > + > +use core::{mem::MaybeUninit, pin::Pin}; > + > +use crate::error::from_result; > +use crate::transmute::{AsBytes, FromBytes}; > +use crate::{fs::File, types::Opaque}; > + > +use crate::prelude::*; > + > +/// io-uring opcode > +pub mod opcode { > + /// opcode for uring cmd > + pub const URING_CMD: u32 = bindings::io_uring_op_IORING_OP_URING_CMD; > +} > + > +/// A Rust abstraction for the Linux kernel's `io_uring_cmd` structure. > +/// > +/// This structure is a safe, opaque wrapper around the raw C `io_uring_cmd` > +/// binding from the Linux kernel. It represents a command structure used > +/// in io_uring operations within the kernel. > +/// This type is used internally by the io_uring subsystem to manage > +/// asynchronous I/O commands. > +/// > +/// This type should not be constructed or manipulated directly by > +/// kernel module developers. > +/// > +/// # INVARIANT > +/// - `self.inner` always points to a valid, live `bindings::io_uring_cmd`. > +#[repr(transparent)] > +pub struct IoUringCmd { > + /// An opaque wrapper containing the actual `io_uring_cmd` data. > + inner: Opaque<bindings::io_uring_cmd>, > +} > + > +impl IoUringCmd { > + /// Returns the cmd_op with associated with the `io_uring_cmd`. > + #[inline] > + pub fn cmd_op(&self) -> u32 { > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > + // to a live `io_uring_cmd`, so dereferencing is safe. > + unsafe { (*self.inner.get()).cmd_op } > + } > + > + /// Returns the flags with associated with the `io_uring_cmd`. > + #[inline] > + pub fn flags(&self) -> u32 { > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > + // to a live `io_uring_cmd`, so dereferencing is safe. > + unsafe { (*self.inner.get()).flags } > + } > + > + /// Reads protocol data unit as `T` that impl `FromBytes` from uring cmd > + /// > + /// Fails with [`EFAULT`] if size of `T` is bigger than pdu size. > + #[inline] > + pub fn read_pdu<T: FromBytes>(&self) -> Result<T> { > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > + // to a live `io_uring_cmd`, so dereferencing is safe. > + let inner = unsafe { &mut *self.inner.get() }; > + > + let len = size_of::<T>(); > + if len > inner.pdu.len() { > + return Err(EFAULT); > + } > + > + let mut out: MaybeUninit<T> = MaybeUninit::uninit(); > + let ptr = &raw mut inner.pdu as *const c_void; > + > + // SAFETY: > + // * The `ptr` is valid pointer from `self.inner` that is guaranteed by type invariant. > + // * The `out` is valid pointer that points `T` which impls `FromBytes` and checked > + // size of `T` is smaller than pdu size. > + unsafe { > + core::ptr::copy_nonoverlapping(ptr, out.as_mut_ptr().cast::<c_void>(), len); > + } > + > + // SAFETY: The read above has initialized all bytes in `out`, and since `T` implements > + // `FromBytes`, any bit-pattern is a valid value for this type. > + Ok(unsafe { out.assume_init() }) > + } > + > + /// Writes the provided `value` to `pdu` in uring_cmd `self` > + /// > + /// Fails with [`EFAULT`] if size of `T` is bigger than pdu size. > + #[inline] > + pub fn write_pdu<T: AsBytes>(&mut self, value: &T) -> Result<()> { > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > + // to a live `io_uring_cmd`, so dereferencing is safe. > + let inner = unsafe { &mut *self.inner.get() }; > + > + let len = size_of::<T>(); > + if len > inner.pdu.len() { > + return Err(EFAULT); > + } > + > + let src = (value as *const T).cast::<c_void>(); > + let dst = &raw mut inner.pdu as *mut c_void; > + > + // SAFETY: > + // * The `src` is points valid memory that is guaranteed by `T` impls `AsBytes` > + // * The `dst` is valid. It's from `self.inner` that is guaranteed by type invariant. > + // * It's safe to copy because size of `T` is no more than len of pdu. > + unsafe { > + core::ptr::copy_nonoverlapping(src, dst, len); > + } > + > + Ok(()) > + } pdu is part of IoUringCmd, which is live in the whole uring_cmd lifetime. But both read_pdu()/write_pdu() needs copy to read or write any byte in the pdu, which is slow and hard to use, it could be more efficient to add two methods to return Result<&T> and Result<mut &T> for user to manipulate uring_cmd's pdu. Thanks, Ming ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 3/5] rust: io_uring: introduce rust abstraction for io-uring cmd 2025-08-28 0:36 ` Ming Lei @ 2025-08-28 7:25 ` Benno Lossin 2025-08-28 10:05 ` Ming Lei 0 siblings, 1 reply; 33+ messages in thread From: Benno Lossin @ 2025-08-28 7:25 UTC (permalink / raw) To: Ming Lei, Sidong Yang Cc: Jens Axboe, Daniel Almeida, Caleb Sander Mateos, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Thu Aug 28, 2025 at 2:36 AM CEST, Ming Lei wrote: > On Fri, Aug 22, 2025 at 12:55:53PM +0000, Sidong Yang wrote: >> + /// Reads protocol data unit as `T` that impl `FromBytes` from uring cmd >> + /// >> + /// Fails with [`EFAULT`] if size of `T` is bigger than pdu size. >> + #[inline] >> + pub fn read_pdu<T: FromBytes>(&self) -> Result<T> { >> + // SAFETY: `self.inner` is guaranteed by the type invariant to point >> + // to a live `io_uring_cmd`, so dereferencing is safe. >> + let inner = unsafe { &mut *self.inner.get() }; >> + >> + let len = size_of::<T>(); >> + if len > inner.pdu.len() { >> + return Err(EFAULT); >> + } >> + >> + let mut out: MaybeUninit<T> = MaybeUninit::uninit(); >> + let ptr = &raw mut inner.pdu as *const c_void; >> + >> + // SAFETY: >> + // * The `ptr` is valid pointer from `self.inner` that is guaranteed by type invariant. >> + // * The `out` is valid pointer that points `T` which impls `FromBytes` and checked >> + // size of `T` is smaller than pdu size. >> + unsafe { >> + core::ptr::copy_nonoverlapping(ptr, out.as_mut_ptr().cast::<c_void>(), len); >> + } >> + >> + // SAFETY: The read above has initialized all bytes in `out`, and since `T` implements >> + // `FromBytes`, any bit-pattern is a valid value for this type. >> + Ok(unsafe { out.assume_init() }) >> + } >> + >> + /// Writes the provided `value` to `pdu` in uring_cmd `self` >> + /// >> + /// Fails with [`EFAULT`] if size of `T` is bigger than pdu size. >> + #[inline] >> + pub fn write_pdu<T: AsBytes>(&mut self, value: &T) -> Result<()> { >> + // SAFETY: `self.inner` is guaranteed by the type invariant to point >> + // to a live `io_uring_cmd`, so dereferencing is safe. >> + let inner = unsafe { &mut *self.inner.get() }; >> + >> + let len = size_of::<T>(); >> + if len > inner.pdu.len() { >> + return Err(EFAULT); >> + } >> + >> + let src = (value as *const T).cast::<c_void>(); >> + let dst = &raw mut inner.pdu as *mut c_void; >> + >> + // SAFETY: >> + // * The `src` is points valid memory that is guaranteed by `T` impls `AsBytes` >> + // * The `dst` is valid. It's from `self.inner` that is guaranteed by type invariant. >> + // * It's safe to copy because size of `T` is no more than len of pdu. >> + unsafe { >> + core::ptr::copy_nonoverlapping(src, dst, len); >> + } >> + >> + Ok(()) >> + } > > pdu is part of IoUringCmd, which is live in the whole uring_cmd lifetime. But > both read_pdu()/write_pdu() needs copy to read or write any byte in the pdu, which > is slow and hard to use, it could be more efficient to add two methods to return > Result<&T> and Result<mut &T> for user to manipulate uring_cmd's pdu. That can also be useful, but you do need to ensure that the pdu is aligned to at least the required alignment of `T` for this to be sound. I also don't follow your argument that reading & writing the pdu is slow. --- Cheers, Benno ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 3/5] rust: io_uring: introduce rust abstraction for io-uring cmd 2025-08-28 7:25 ` Benno Lossin @ 2025-08-28 10:05 ` Ming Lei 0 siblings, 0 replies; 33+ messages in thread From: Ming Lei @ 2025-08-28 10:05 UTC (permalink / raw) To: Benno Lossin Cc: Sidong Yang, Jens Axboe, Daniel Almeida, Caleb Sander Mateos, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Thu, Aug 28, 2025 at 09:25:56AM +0200, Benno Lossin wrote: > On Thu Aug 28, 2025 at 2:36 AM CEST, Ming Lei wrote: > > On Fri, Aug 22, 2025 at 12:55:53PM +0000, Sidong Yang wrote: > >> + /// Reads protocol data unit as `T` that impl `FromBytes` from uring cmd > >> + /// > >> + /// Fails with [`EFAULT`] if size of `T` is bigger than pdu size. > >> + #[inline] > >> + pub fn read_pdu<T: FromBytes>(&self) -> Result<T> { > >> + // SAFETY: `self.inner` is guaranteed by the type invariant to point > >> + // to a live `io_uring_cmd`, so dereferencing is safe. > >> + let inner = unsafe { &mut *self.inner.get() }; > >> + > >> + let len = size_of::<T>(); > >> + if len > inner.pdu.len() { > >> + return Err(EFAULT); > >> + } > >> + > >> + let mut out: MaybeUninit<T> = MaybeUninit::uninit(); > >> + let ptr = &raw mut inner.pdu as *const c_void; > >> + > >> + // SAFETY: > >> + // * The `ptr` is valid pointer from `self.inner` that is guaranteed by type invariant. > >> + // * The `out` is valid pointer that points `T` which impls `FromBytes` and checked > >> + // size of `T` is smaller than pdu size. > >> + unsafe { > >> + core::ptr::copy_nonoverlapping(ptr, out.as_mut_ptr().cast::<c_void>(), len); > >> + } > >> + > >> + // SAFETY: The read above has initialized all bytes in `out`, and since `T` implements > >> + // `FromBytes`, any bit-pattern is a valid value for this type. > >> + Ok(unsafe { out.assume_init() }) > >> + } > >> + > >> + /// Writes the provided `value` to `pdu` in uring_cmd `self` > >> + /// > >> + /// Fails with [`EFAULT`] if size of `T` is bigger than pdu size. > >> + #[inline] > >> + pub fn write_pdu<T: AsBytes>(&mut self, value: &T) -> Result<()> { > >> + // SAFETY: `self.inner` is guaranteed by the type invariant to point > >> + // to a live `io_uring_cmd`, so dereferencing is safe. > >> + let inner = unsafe { &mut *self.inner.get() }; > >> + > >> + let len = size_of::<T>(); > >> + if len > inner.pdu.len() { > >> + return Err(EFAULT); > >> + } > >> + > >> + let src = (value as *const T).cast::<c_void>(); > >> + let dst = &raw mut inner.pdu as *mut c_void; > >> + > >> + // SAFETY: > >> + // * The `src` is points valid memory that is guaranteed by `T` impls `AsBytes` > >> + // * The `dst` is valid. It's from `self.inner` that is guaranteed by type invariant. > >> + // * It's safe to copy because size of `T` is no more than len of pdu. > >> + unsafe { > >> + core::ptr::copy_nonoverlapping(src, dst, len); > >> + } > >> + > >> + Ok(()) > >> + } > > > > pdu is part of IoUringCmd, which is live in the whole uring_cmd lifetime. But > > both read_pdu()/write_pdu() needs copy to read or write any byte in the pdu, which > > is slow and hard to use, it could be more efficient to add two methods to return > > Result<&T> and Result<mut &T> for user to manipulate uring_cmd's pdu. > > That can also be useful, but you do need to ensure that the pdu is > aligned to at least the required alignment of `T` for this to be sound. Yes, `IoUringCmd` is actually one C struct, so `T` is supposed to be `#[repr(C)]` too, and the usage should be just like what io_uring_cmd_to_pdu() provides. > I also don't follow your argument that reading & writing the pdu is > slow. Please look at how pdu is used in existing C users, such as, the tag field of `pdu` can be read/write directly via: `io_uring_cmd_to_pdu(ioucmd, struct ublk_uring_cmd_pdu)->tag` but read_pdu()/write_pdu() needs whole 32bytes copy for read/write any single field. User may only need to store single byte data in pdu... Thanks, Ming ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 3/5] rust: io_uring: introduce rust abstraction for io-uring cmd 2025-08-22 12:55 ` [RFC PATCH v3 3/5] rust: io_uring: introduce rust abstraction for io-uring cmd Sidong Yang 2025-08-27 20:41 ` Daniel Almeida 2025-08-28 0:36 ` Ming Lei @ 2025-09-02 1:11 ` Caleb Sander Mateos 2025-09-02 11:11 ` Sidong Yang 2 siblings, 1 reply; 33+ messages in thread From: Caleb Sander Mateos @ 2025-09-02 1:11 UTC (permalink / raw) To: Sidong Yang Cc: Jens Axboe, Daniel Almeida, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > Implment the io-uring abstractions needed for miscdevicecs and other typo: "Implement" > char devices that have io-uring command interface. > > * `io_uring::IoUringCmd` : Rust abstraction for `io_uring_cmd` which > will be used as arg for `MiscDevice::uring_cmd()`. And driver can get > `cmd_op` sent from userspace. Also it has `flags` which includes option > that is reissued. > > * `io_uring::IoUringSqe` : Rust abstraction for `io_uring_sqe` which > could be get from `IoUringCmd::sqe()` and driver could get `cmd_data` > from userspace. Also `IoUringSqe` has more data like opcode could be used in > driver. > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > --- > rust/kernel/io_uring.rs | 306 ++++++++++++++++++++++++++++++++++++++++ > rust/kernel/lib.rs | 1 + > 2 files changed, 307 insertions(+) > create mode 100644 rust/kernel/io_uring.rs > > diff --git a/rust/kernel/io_uring.rs b/rust/kernel/io_uring.rs > new file mode 100644 > index 000000000000..61e88bdf4e42 > --- /dev/null > +++ b/rust/kernel/io_uring.rs > @@ -0,0 +1,306 @@ > +// SPDX-License-Identifier: GPL-2.0 > +// SPDX-FileCopyrightText: (C) 2025 Furiosa AI > + > +//! Abstractions for io-uring. > +//! > +//! This module provides types for implements io-uring interface for char device. > +//! > +//! > +//! C headers: [`include/linux/io_uring/cmd.h`](srctree/include/linux/io_uring/cmd.h) and > +//! [`include/linux/io_uring/io_uring.h`](srctree/include/linux/io_uring/io_uring.h) > + > +use core::{mem::MaybeUninit, pin::Pin}; > + > +use crate::error::from_result; > +use crate::transmute::{AsBytes, FromBytes}; > +use crate::{fs::File, types::Opaque}; > + > +use crate::prelude::*; > + > +/// io-uring opcode > +pub mod opcode { > + /// opcode for uring cmd > + pub const URING_CMD: u32 = bindings::io_uring_op_IORING_OP_URING_CMD; > +} > + > +/// A Rust abstraction for the Linux kernel's `io_uring_cmd` structure. > +/// > +/// This structure is a safe, opaque wrapper around the raw C `io_uring_cmd` > +/// binding from the Linux kernel. It represents a command structure used > +/// in io_uring operations within the kernel. > +/// This type is used internally by the io_uring subsystem to manage > +/// asynchronous I/O commands. > +/// > +/// This type should not be constructed or manipulated directly by > +/// kernel module developers. > +/// > +/// # INVARIANT > +/// - `self.inner` always points to a valid, live `bindings::io_uring_cmd`. > +#[repr(transparent)] > +pub struct IoUringCmd { > + /// An opaque wrapper containing the actual `io_uring_cmd` data. > + inner: Opaque<bindings::io_uring_cmd>, > +} > + > +impl IoUringCmd { > + /// Returns the cmd_op with associated with the `io_uring_cmd`. > + #[inline] > + pub fn cmd_op(&self) -> u32 { > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > + // to a live `io_uring_cmd`, so dereferencing is safe. > + unsafe { (*self.inner.get()).cmd_op } > + } > + > + /// Returns the flags with associated with the `io_uring_cmd`. > + #[inline] > + pub fn flags(&self) -> u32 { > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > + // to a live `io_uring_cmd`, so dereferencing is safe. > + unsafe { (*self.inner.get()).flags } > + } > + > + /// Reads protocol data unit as `T` that impl `FromBytes` from uring cmd > + /// > + /// Fails with [`EFAULT`] if size of `T` is bigger than pdu size. > + #[inline] > + pub fn read_pdu<T: FromBytes>(&self) -> Result<T> { > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > + // to a live `io_uring_cmd`, so dereferencing is safe. > + let inner = unsafe { &mut *self.inner.get() }; Why does this reference need to be mutable? > + > + let len = size_of::<T>(); > + if len > inner.pdu.len() { > + return Err(EFAULT); > + } > + > + let mut out: MaybeUninit<T> = MaybeUninit::uninit(); Why is the intermediate MaybeUninit necessary? Would core::ptr::read_unaligned() not work? > + let ptr = &raw mut inner.pdu as *const c_void; > + > + // SAFETY: > + // * The `ptr` is valid pointer from `self.inner` that is guaranteed by type invariant. > + // * The `out` is valid pointer that points `T` which impls `FromBytes` and checked > + // size of `T` is smaller than pdu size. > + unsafe { > + core::ptr::copy_nonoverlapping(ptr, out.as_mut_ptr().cast::<c_void>(), len); > + } > + > + // SAFETY: The read above has initialized all bytes in `out`, and since `T` implements > + // `FromBytes`, any bit-pattern is a valid value for this type. > + Ok(unsafe { out.assume_init() }) > + } > + > + /// Writes the provided `value` to `pdu` in uring_cmd `self` > + /// > + /// Fails with [`EFAULT`] if size of `T` is bigger than pdu size. > + #[inline] > + pub fn write_pdu<T: AsBytes>(&mut self, value: &T) -> Result<()> { > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > + // to a live `io_uring_cmd`, so dereferencing is safe. > + let inner = unsafe { &mut *self.inner.get() }; > + > + let len = size_of::<T>(); > + if len > inner.pdu.len() { > + return Err(EFAULT); > + } > + > + let src = (value as *const T).cast::<c_void>(); > + let dst = &raw mut inner.pdu as *mut c_void; > + > + // SAFETY: > + // * The `src` is points valid memory that is guaranteed by `T` impls `AsBytes` > + // * The `dst` is valid. It's from `self.inner` that is guaranteed by type invariant. > + // * It's safe to copy because size of `T` is no more than len of pdu. > + unsafe { > + core::ptr::copy_nonoverlapping(src, dst, len); > + } > + > + Ok(()) > + } > + > + /// Constructs a new [`IoUringCmd`] from a raw `io_uring_cmd` > + /// > + /// # Safety > + /// > + /// The caller must guarantee that: > + /// - `ptr` is non-null, properly aligned, and points to a valid > + /// `bindings::io_uring_cmd`. > + /// - The pointed-to memory remains initialized and valid for the entire > + /// lifetime `'a` of the returned reference. > + /// - While the returned `Pin<&'a mut IoUringCmd>` is alive, the underlying > + /// object is **not moved** (pinning requirement). > + /// - **Aliasing rules:** the returned `&mut` has **exclusive** access to the same > + /// object for its entire lifetime: > + /// - No other `&mut` **or** `&` references to the same `io_uring_cmd` may be > + /// alive at the same time. > + /// - There must be no concurrent reads/writes through raw pointers, FFI, or > + /// other kernel paths to the same object during this lifetime. > + /// - If the object can be touched from other contexts (e.g. IRQ/another CPU), > + /// the caller must provide synchronization to uphold this exclusivity. > + /// - This function relies on `IoUringCmd` being `repr(transparent)` over > + /// `bindings::io_uring_cmd` so the cast preserves layout. > + #[inline] > + pub unsafe fn from_raw<'a>(ptr: *mut bindings::io_uring_cmd) -> Pin<&'a mut IoUringCmd> { > + // SAFETY: > + // * The caller guarantees that the pointer is not dangling and stays > + // valid for the duration of 'a. > + // * The cast is okay because `IoUringCmd` is `repr(transparent)` and > + // has the same memory layout as `bindings::io_uring_cmd`. > + // * The returned `Pin` ensures that the object cannot be moved, which > + // is required because the kernel may hold pointers to this memory > + // location and moving it would invalidate those pointers. > + unsafe { Pin::new_unchecked(&mut *ptr.cast()) } > + } > + > + /// Returns the file that referenced by uring cmd self. > + #[inline] > + pub fn file(&self) -> &File { > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > + // to a live `io_uring_cmd`, so dereferencing is safe. > + let file = unsafe { (*self.inner.get()).file }; > + > + // SAFETY: > + // * The `file` points valid file. > + // * refcount is positive after submission queue entry issued. > + // * There is no active fdget_pos region on the file on this thread. > + unsafe { File::from_raw_file(file) } > + } > + > + /// Returns an reference to the [`IoUringSqe`] associated with this command. > + #[inline] > + pub fn sqe(&self) -> &IoUringSqe { > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > + // to a live `io_uring_cmd`, so dereferencing is safe. > + let sqe = unsafe { (*self.inner.get()).sqe }; > + // SAFETY: The call guarantees that the `sqe` points valid io_uring_sqe. > + unsafe { IoUringSqe::from_raw(sqe) } > + } > + > + /// Completes an this [`IoUringCmd`] request that was previously queued. > + /// > + /// # Safety > + /// > + /// - This function must be called **only** for a command whose `uring_cmd` > + /// handler previously returned **`-EIOCBQUEUED`** to io_uring. > + /// > + /// # Parameters > + /// > + /// - `ret`: Result to return to userspace. > + /// - `res2`: Extra for big completion queue entry `IORING_SETUP_CQE32`. > + /// - `issue_flags`: Flags associated with this request, typically the same > + /// as those passed to the `uring_cmd` handler. > + #[inline] > + pub fn done(self: Pin<&mut IoUringCmd>, ret: Result<i32>, res2: u64, issue_flags: u32) { Since this takes the IoUringCmd by reference, it allows a single io_uring_cmd to be completed multiple times, which would not be safe. This method probably needs to take ownership of the IoUringCmd. Even better would be to enforce at compile time that the number of times IoUringCmd::done() is called matches the return value from uring_cmd(): io_uring_cmd_done() may only be called if uring_cmd() returns -EIOCBQUEUED; -EAGAIN will result in another call to uring_cmd() and all other return values synchronously complete the io_uring_cmd. It's also not safe for the caller to pass an arbitrary value for issue_flags here; it needs to exactly match what was passed into uring_cmd(). Maybe we could introduce a type that couples the IoUringCmd and issue_flags passed to uring_cmd()? > + let ret = from_result(|| ret) as isize; > + // SAFETY: The call guarantees that `self.inner` is not dangling and stays valid > + unsafe { > + bindings::io_uring_cmd_done(self.inner.get(), ret, res2, issue_flags); > + } > + } > +} > + > +/// A Rust abstraction for the Linux kernel's `io_uring_sqe` structure. > +/// > +/// This structure is a safe, opaque wrapper around the raw C [`io_uring_sqe`](srctree/include/uapi/linux/io_uring.h) > +/// binding from the Linux kernel. It represents a Submission Queue Entry > +/// used in io_uring operations within the kernel. > +/// > +/// # Type Safety > +/// > +/// The `#[repr(transparent)]` attribute ensures that this wrapper has > +/// the same memory layout as the underlying `io_uring_sqe` structure, > +/// allowing it to be safely transmuted between the two representations. > +/// > +/// # Fields > +/// > +/// * `inner` - An opaque wrapper containing the actual `io_uring_sqe` data. > +/// The `Opaque` type prevents direct access to the internal > +/// structure fields, ensuring memory safety and encapsulation. > +/// > +/// # Usage > +/// > +/// This type represents a submission queue entry that describes an I/O > +/// operation to be executed by the io_uring subsystem. It contains > +/// information such as the operation type, file descriptor, buffer > +/// pointers, and other operation-specific data. > +/// > +/// Users can obtain this type from [`IoUringCmd::sqe()`] method, which > +/// extracts the submission queue entry associated with a command. > +/// > +/// This type should not be constructed or manipulated directly by > +/// kernel module developers. > +/// > +/// # INVARIANT > +/// - `self.inner` always points to a valid, live `bindings::io_uring_sqe`. > +#[repr(transparent)] > +pub struct IoUringSqe { > + inner: Opaque<bindings::io_uring_sqe>, > +} > + > +impl IoUringSqe { > + /// Reads and interprets the `cmd` field of an `bindings::io_uring_sqe` as a value of type `T`. > + /// > + /// # Safety & Invariants > + /// - Construction of `T` is delegated to `FromBytes`, which guarantees that `T` has no > + /// invalid bit patterns and can be safely reconstructed from raw bytes. > + /// - **Limitation:** This implementation does not support `IORING_SETUP_SQE128` (larger SQE entries). > + /// Only the standard `io_uring_sqe` layout is handled here. > + /// > + /// # Errors > + /// * Returns `EINVAL` if the `self` does not hold a `opcode::URING_CMD`. > + /// * Returns `EFAULT` if the command buffer is smaller than the requested type `T`. > + /// > + /// # Returns > + /// * On success, returns a `T` deserialized from the `cmd`. > + /// * On failure, returns an appropriate error as described above. > + pub fn cmd_data<T: FromBytes>(&self) -> Result<T> { > + // SAFETY: `self.inner` guaranteed by the type invariant to point > + // to a live `io_uring_sqe`, so dereferencing is safe. > + let sqe = unsafe { &*self.inner.get() }; > + > + if u32::from(sqe.opcode) != opcode::URING_CMD { io_uring opcodes are u8 values. Can the URING_CMD constant be made a u8 instead of converting sqe.opcode here? The read of the opcode should also be using read_volatile(), as it may lie in the userspace-mapped SQE region, which could be concurrently written by another userspace thread. That would probably be buggy behavior on the userspace side, but it can cause undefined behavior on the kernel side without a volatile read, as the compiler could choose to re-read the value multiple times assuming it will get the same value each time. > + return Err(EINVAL); > + } > + > + // SAFETY: Accessing the `sqe.cmd` union field is safe because we've > + // verified that `sqe.opcode == IORING_OP_URING_CMD`, which guarantees > + // that this union variant is initialized and valid. The opcode check isn't necessary. It doesn't provide any assurances that this variant of the union is actually initialized, since a buggy userspace process could set the opcode without initializing the cmd field. It's always valid to access this memory since it's part of the SQE region created at io_uring setup time. So I would drop the opcode check. > + let cmd = unsafe { sqe.__bindgen_anon_6.cmd.as_ref() }; > + let cmd_len = size_of_val(&sqe.__bindgen_anon_6.bindgen_union_field); > + > + if cmd_len < size_of::<T>() { > + return Err(EFAULT); > + } > + > + let cmd_ptr = cmd.as_ptr() as *mut T; > + > + // SAFETY: `cmd_ptr` is valid from `self.inner` which is guaranteed by > + // type variant. And also it points to initialized `T` from userspace. > + let ret = unsafe { core::ptr::read_unaligned(cmd_ptr) }; Similarly, needs to be volatile. The C uring_cmd() implementations use READ_ONCE() to read the cmd field. Best, Caleb > + > + Ok(ret) > + } > + > + /// Constructs a new `IoUringSqe` from a raw `io_uring_sqe`. > + /// > + /// # Safety > + /// > + /// The caller must guarantee that: > + /// - `ptr` is non-null, properly aligned, and points to a valid initialized > + /// `bindings::io_uring_sqe`. > + /// - The pointed-to memory remains valid (not freed or repurposed) for the > + /// entire lifetime `'a` of the returned reference. > + /// - **Aliasing rules (for `&T`):** while the returned `&'a IoUringSqe` is > + /// alive, there must be **no mutable access** to the same object through any > + /// path (no `&mut`, no raw-pointer writes, no FFI/IRQ/other-CPU writers). > + /// Multiple `&` is fine **only if all of them are read-only** for the entire > + /// overlapping lifetime. > + /// - This relies on `IoUringSqe` being `repr(transparent)` over > + /// `bindings::io_uring_sqe`, so the cast preserves layout. > + #[inline] > + pub unsafe fn from_raw<'a>(ptr: *const bindings::io_uring_sqe) -> &'a IoUringSqe { > + // SAFETY: The caller guarantees that the pointer is not dangling and stays valid for the > + // duration of 'a. The cast is okay because `IoUringSqe` is `repr(transparent)` and has the > + // same memory layout as `bindings::io_uring_sqe`. > + unsafe { &*ptr.cast() } > + } > +} > diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs > index ed53169e795c..d38cf7137401 100644 > --- a/rust/kernel/lib.rs > +++ b/rust/kernel/lib.rs > @@ -91,6 +91,7 @@ > pub mod fs; > pub mod init; > pub mod io; > +pub mod io_uring; > pub mod ioctl; > pub mod jump_label; > #[cfg(CONFIG_KUNIT)] > -- > 2.43.0 > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 3/5] rust: io_uring: introduce rust abstraction for io-uring cmd 2025-09-02 1:11 ` Caleb Sander Mateos @ 2025-09-02 11:11 ` Sidong Yang 2025-09-02 15:41 ` Caleb Sander Mateos 0 siblings, 1 reply; 33+ messages in thread From: Sidong Yang @ 2025-09-02 11:11 UTC (permalink / raw) To: Caleb Sander Mateos Cc: Jens Axboe, Daniel Almeida, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Mon, Sep 01, 2025 at 06:11:01PM -0700, Caleb Sander Mateos wrote: > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > Implment the io-uring abstractions needed for miscdevicecs and other > > typo: "Implement" Thanks. > > > char devices that have io-uring command interface. > > > > * `io_uring::IoUringCmd` : Rust abstraction for `io_uring_cmd` which > > will be used as arg for `MiscDevice::uring_cmd()`. And driver can get > > `cmd_op` sent from userspace. Also it has `flags` which includes option > > that is reissued. > > > > * `io_uring::IoUringSqe` : Rust abstraction for `io_uring_sqe` which > > could be get from `IoUringCmd::sqe()` and driver could get `cmd_data` > > from userspace. Also `IoUringSqe` has more data like opcode could be used in > > driver. > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > > --- > > rust/kernel/io_uring.rs | 306 ++++++++++++++++++++++++++++++++++++++++ > > rust/kernel/lib.rs | 1 + > > 2 files changed, 307 insertions(+) > > create mode 100644 rust/kernel/io_uring.rs > > > > diff --git a/rust/kernel/io_uring.rs b/rust/kernel/io_uring.rs > > new file mode 100644 > > index 000000000000..61e88bdf4e42 > > --- /dev/null > > +++ b/rust/kernel/io_uring.rs > > @@ -0,0 +1,306 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +// SPDX-FileCopyrightText: (C) 2025 Furiosa AI > > + > > +//! Abstractions for io-uring. > > +//! > > +//! This module provides types for implements io-uring interface for char device. > > +//! > > +//! > > +//! C headers: [`include/linux/io_uring/cmd.h`](srctree/include/linux/io_uring/cmd.h) and > > +//! [`include/linux/io_uring/io_uring.h`](srctree/include/linux/io_uring/io_uring.h) > > + > > +use core::{mem::MaybeUninit, pin::Pin}; > > + > > +use crate::error::from_result; > > +use crate::transmute::{AsBytes, FromBytes}; > > +use crate::{fs::File, types::Opaque}; > > + > > +use crate::prelude::*; > > + > > +/// io-uring opcode > > +pub mod opcode { > > + /// opcode for uring cmd > > + pub const URING_CMD: u32 = bindings::io_uring_op_IORING_OP_URING_CMD; > > +} > > + > > +/// A Rust abstraction for the Linux kernel's `io_uring_cmd` structure. > > +/// > > +/// This structure is a safe, opaque wrapper around the raw C `io_uring_cmd` > > +/// binding from the Linux kernel. It represents a command structure used > > +/// in io_uring operations within the kernel. > > +/// This type is used internally by the io_uring subsystem to manage > > +/// asynchronous I/O commands. > > +/// > > +/// This type should not be constructed or manipulated directly by > > +/// kernel module developers. > > +/// > > +/// # INVARIANT > > +/// - `self.inner` always points to a valid, live `bindings::io_uring_cmd`. > > +#[repr(transparent)] > > +pub struct IoUringCmd { > > + /// An opaque wrapper containing the actual `io_uring_cmd` data. > > + inner: Opaque<bindings::io_uring_cmd>, > > +} > > + w> > +impl IoUringCmd { > > + /// Returns the cmd_op with associated with the `io_uring_cmd`. > > + #[inline] > > + pub fn cmd_op(&self) -> u32 { > > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > > + // to a live `io_uring_cmd`, so dereferencing is safe. > > + unsafe { (*self.inner.get()).cmd_op } > > + } > > + > > + /// Returns the flags with associated with the `io_uring_cmd`. > > + #[inline] > > + pub fn flags(&self) -> u32 { > > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > > + // to a live `io_uring_cmd`, so dereferencing is safe. > > + unsafe { (*self.inner.get()).flags } > > + } > > + > > + /// Reads protocol data unit as `T` that impl `FromBytes` from uring cmd > > + /// > > + /// Fails with [`EFAULT`] if size of `T` is bigger than pdu size. > > + #[inline] > > + pub fn read_pdu<T: FromBytes>(&self) -> Result<T> { > > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > > + // to a live `io_uring_cmd`, so dereferencing is safe. > > + let inner = unsafe { &mut *self.inner.get() }; > > Why does this reference need to be mutable? It don't need to be mutable. This could be borrow without mutable. > > > + > > + let len = size_of::<T>(); > > + if len > inner.pdu.len() { > > + return Err(EFAULT); > > + } > > + > > + let mut out: MaybeUninit<T> = MaybeUninit::uninit(); > > Why is the intermediate MaybeUninit necessary? Would > core::ptr::read_unaligned() not work? It's okay to use `read_unaligned`. It would be simpler than now. Thanks. > > > + let ptr = &raw mut inner.pdu as *const c_void; > > + > > + // SAFETY: > > + // * The `ptr` is valid pointer from `self.inner` that is guaranteed by type invariant. > > + // * The `out` is valid pointer that points `T` which impls `FromBytes` and checked > > + // size of `T` is smaller than pdu size. > > + unsafe { > > + core::ptr::copy_nonoverlapping(ptr, out.as_mut_ptr().cast::<c_void>(), len); > > + } > > + > > + // SAFETY: The read above has initialized all bytes in `out`, and since `T` implements > > + // `FromBytes`, any bit-pattern is a valid value for this type. > > + Ok(unsafe { out.assume_init() }) > > + } > > + > > + /// Writes the provided `value` to `pdu` in uring_cmd `self` > > + /// > > + /// Fails with [`EFAULT`] if size of `T` is bigger than pdu size. > > + #[inline] > > + pub fn write_pdu<T: AsBytes>(&mut self, value: &T) -> Result<()> { > > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > > + // to a live `io_uring_cmd`, so dereferencing is safe. > > + let inner = unsafe { &mut *self.inner.get() }; > > + > > + let len = size_of::<T>(); > > + if len > inner.pdu.len() { > > + return Err(EFAULT); > > + } > > + > > + let src = (value as *const T).cast::<c_void>(); > > + let dst = &raw mut inner.pdu as *mut c_void; > > + > > + // SAFETY: > > + // * The `src` is points valid memory that is guaranteed by `T` impls `AsBytes` > > + // * The `dst` is valid. It's from `self.inner` that is guaranteed by type invariant. > > + // * It's safe to copy because size of `T` is no more than len of pdu. > > + unsafe { > > + core::ptr::copy_nonoverlapping(src, dst, len); > > + } > > + > > + Ok(()) > > + } > > + > > + /// Constructs a new [`IoUringCmd`] from a raw `io_uring_cmd` > > + /// > > + /// # Safety > > + /// > > + /// The caller must guarantee that: > > + /// - `ptr` is non-null, properly aligned, and points to a valid > > + /// `bindings::io_uring_cmd`. > > + /// - The pointed-to memory remains initialized and valid for the entire > > + /// lifetime `'a` of the returned reference. > > + /// - While the returned `Pin<&'a mut IoUringCmd>` is alive, the underlying > > + /// object is **not moved** (pinning requirement). > > + /// - **Aliasing rules:** the returned `&mut` has **exclusive** access to the same > > + /// object for its entire lifetime: > > + /// - No other `&mut` **or** `&` references to the same `io_uring_cmd` may be > > + /// alive at the same time. > > + /// - There must be no concurrent reads/writes through raw pointers, FFI, or > > + /// other kernel paths to the same object during this lifetime. > > + /// - If the object can be touched from other contexts (e.g. IRQ/another CPU), > > + /// the caller must provide synchronization to uphold this exclusivity. > > + /// - This function relies on `IoUringCmd` being `repr(transparent)` over > > + /// `bindings::io_uring_cmd` so the cast preserves layout. > > + #[inline] > > + pub unsafe fn from_raw<'a>(ptr: *mut bindings::io_uring_cmd) -> Pin<&'a mut IoUringCmd> { > > + // SAFETY: > > + // * The caller guarantees that the pointer is not dangling and stays > > + // valid for the duration of 'a. > > + // * The cast is okay because `IoUringCmd` is `repr(transparent)` and > > + // has the same memory layout as `bindings::io_uring_cmd`. > > + // * The returned `Pin` ensures that the object cannot be moved, which > > + // is required because the kernel may hold pointers to this memory > > + // location and moving it would invalidate those pointers. > > + unsafe { Pin::new_unchecked(&mut *ptr.cast()) } > > + } > > + > > + /// Returns the file that referenced by uring cmd self. > > + #[inline] > > + pub fn file(&self) -> &File { > > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > > + // to a live `io_uring_cmd`, so dereferencing is safe. > > + let file = unsafe { (*self.inner.get()).file }; > > + > > + // SAFETY: ?> > + // * The `file` points valid file. > > + // * refcount is positive after submission queue entry issued. > > + // * There is no active fdget_pos region on the file on this thread. > > + unsafe { File::from_raw_file(file) } > > + } > > + > > + /// Returns an reference to the [`IoUringSqe`] associated with this command. > > + #[inline] > > + pub fn sqe(&self) -> &IoUringSqe { > > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > > + // to a live `io_uring_cmd`, so dereferencing is safe. > > + let sqe = unsafe { (*self.inner.get()).sqe }; > > + // SAFETY: The call guarantees that the `sqe` points valid io_uring_sqe. > > + unsafe { IoUringSqe::from_raw(sqe) } > > + } > > + > > + /// Completes an this [`IoUringCmd`] request that was previously queued. > > + /// > > + /// # Safety > > + /// > > + /// - This function must be called **only** for a command whose `uring_cmd` > > + /// handler previously returned **`-EIOCBQUEUED`** to io_uring. > > + /// > > + /// # Parameters > > + /// > > + /// - `ret`: Result to return to userspace. > > + /// - `res2`: Extra for big completion queue entry `IORING_SETUP_CQE32`. > > + /// - `issue_flags`: Flags associated with this request, typically the same > > + /// as those passed to the `uring_cmd` handler. > > + #[inline] > > + pub fn done(self: Pin<&mut IoUringCmd>, ret: Result<i32>, res2: u64, issue_flags: u32) { > > Since this takes the IoUringCmd by reference, it allows a single > io_uring_cmd to be completed multiple times, which would not be safe. > This method probably needs to take ownership of the IoUringCmd. Even > better would be to enforce at compile time that the number of times > IoUringCmd::done() is called matches the return value from > uring_cmd(): io_uring_cmd_done() may only be called if uring_cmd() > returns -EIOCBQUEUED; -EAGAIN will result in another call to > uring_cmd() and all other return values synchronously complete the > io_uring_cmd. > > It's also not safe for the caller to pass an arbitrary value for > issue_flags here; it needs to exactly match what was passed into > uring_cmd(). Maybe we could introduce a type that couples the > IoUringCmd and issue_flags passed to uring_cmd()? Agreed. We could introduce a new type like below pub struct IoUringCmd<'a> { cmd: Pin<&'a mut IoUringCmdInner>, issue_flags: IssueFlags, } Is `io_uring_cmd_done` should be called in iopoll callback? If so, it's better to make new type for iopoll and move this function to the type. > > > + let ret = from_result(|| ret) as isize; > > + // SAFETY: The call guarantees that `self.inner` is not dangling and stays valid > > + unsafe { > > + bindings::io_uring_cmd_done(self.inner.get(), ret, res2, issue_flags); > > + } > > + } > > +} > > + > > +/// A Rust abstraction for the Linux kernel's `io_uring_sqe` structure. > > +/// > > +/// This structure is a safe, opaque wrapper around the raw C [`io_uring_sqe`](srctree/include/uapi/linux/io_uring.h) > > +/// binding from the Linux kernel. It represents a Submission Queue Entry > > +/// used in io_uring operations within the kernel. > > +/// > > +/// # Type Safety > > +/// > > +/// The `#[repr(transparent)]` attribute ensures that this wrapper has > > +/// the same memory layout as the underlying `io_uring_sqe` structure, > > +/// allowing it to be safely transmuted between the two representations. > > +/// > > +/// # Fields > > +/// > > +/// * `inner` - An opaque wrapper containing the actual `io_uring_sqe` data. > > +/// The `Opaque` type prevents direct access to the internal > > +/// structure fields, ensuring memory safety and encapsulation. > > +/// > > +/// # Usage > > +/// > > +/// This type represents a submission queue entry that describes an I/O > > +/// operation to be executed by the io_uring subsystem. It contains > > +/// information such as the operation type, file descriptor, buffer > > +/// pointers, and other operation-specific data. > > +/// > > +/// Users can obtain this type from [`IoUringCmd::sqe()`] method, which > > +/// extracts the submission queue entry associated with a command. > > +/// > > +/// This type should not be constructed or manipulated directly by > > +/// kernel module developers. > > +/// > > +/// # INVARIANT > > +/// - `self.inner` always points to a valid, live `bindings::io_uring_sqe`. > > +#[repr(transparent)] > > +pub struct IoUringSqe { > > + inner: Opaque<bindings::io_uring_sqe>, > > +} > > + > > +impl IoUringSqe { > > + /// Reads and interprets the `cmd` field of an `bindings::io_uring_sqe` as a value of type `T`. > > + /// > > + /// # Safety & Invariants > > + /// - Construction of `T` is delegated to `FromBytes`, which guarantees that `T` has no > > + /// invalid bit patterns and can be safely reconstructed from raw bytes. > > + /// - **Limitation:** This implementation does not support `IORING_SETUP_SQE128` (larger SQE entries). > > + /// Only the standard `io_uring_sqe` layout is handled here. > > + /// > > + /// # Errors > > + /// * Returns `EINVAL` if the `self` does not hold a `opcode::URING_CMD`. > > + /// * Returns `EFAULT` if the command buffer is smaller than the requested type `T`. > > + /// > > + /// # Returns > > + /// * On success, returns a `T` deserialized from the `cmd`. > > + /// * On failure, returns an appropriate error as described above. > > + pub fn cmd_data<T: FromBytes>(&self) -> Result<T> { > > + // SAFETY: `self.inner` guaranteed by the type invariant to point > > + // to a live `io_uring_sqe`, so dereferencing is safe. > > + let sqe = unsafe { &*self.inner.get() }; > > + > > + if u32::from(sqe.opcode) != opcode::URING_CMD { > > io_uring opcodes are u8 values. Can the URING_CMD constant be made a > u8 instead of converting sqe.opcode here? > > The read of the opcode should also be using read_volatile(), as it may > lie in the userspace-mapped SQE region, which could be concurrently > written by another userspace thread. That would probably be buggy > behavior on the userspace side, but it can cause undefined behavior on > the kernel side without a volatile read, as the compiler could choose > to re-read the value multiple times assuming it will get the same > value each time. Thanks, opcode should be read with read_volatile(). And I would introduce new type for opcode. > > > + return Err(EINVAL); > > + } > > + > > + // SAFETY: Accessing the `sqe.cmd` union field is safe because we've > > + // verified that `sqe.opcode == IORING_OP_URING_CMD`, which guarantees > > + // that this union variant is initialized and valid. > > The opcode check isn't necessary. It doesn't provide any assurances > that this variant of the union is actually initialized, since a buggy > userspace process could set the opcode without initializing the cmd > field. It's always valid to access this memory since it's part of the > SQE region created at io_uring setup time. So I would drop the opcode > check. > > > + let cmd = unsafe { sqe.__bindgen_anon_6.cmd.as_ref() }; > > + let cmd_len = size_of_val(&sqe.__bindgen_anon_6.bindgen_union_field); > > + > > + if cmd_len < size_of::<T>() { > > + return Err(EFAULT); > > + } > > + > > + let cmd_ptr = cmd.as_ptr() as *mut T; > > + > > + // SAFETY: `cmd_ptr` is valid from `self.inner` which is guaranteed by > > + // type variant. And also it points to initialized `T` from userspace. > > + let ret = unsafe { core::ptr::read_unaligned(cmd_ptr) }; > > Similarly, needs to be volatile. The C uring_cmd() implementations use > READ_ONCE() to read the cmd field. Okay, This should use read_volatile too. Thanks, Sidong > > Best, > Caleb > > > + > > + Ok(ret) > > + } > > + > > + /// Constructs a new `IoUringSqe` from a raw `io_uring_sqe`. > > + /// > > + /// # Safety > > + /// > > + /// The caller must guarantee that: > > + /// - `ptr` is non-null, properly aligned, and points to a valid initialized > > + /// `bindings::io_uring_sqe`. > > + /// - The pointed-to memory remains valid (not freed or repurposed) for the > > + /// entire lifetime `'a` of the returned reference. > > + /// - **Aliasing rules (for `&T`):** while the returned `&'a IoUringSqe` is > > + /// alive, there must be **no mutable access** to the same object through any > > + /// path (no `&mut`, no raw-pointer writes, no FFI/IRQ/other-CPU writers). > > + /// Multiple `&` is fine **only if all of them are read-only** for the entire > > + /// overlapping lifetime. > > + /// - This relies on `IoUringSqe` being `repr(transparent)` over > > + /// `bindings::io_uring_sqe`, so the cast preserves layout. > > + #[inline] > > + pub unsafe fn from_raw<'a>(ptr: *const bindings::io_uring_sqe) -> &'a IoUringSqe { > > + // SAFETY: The caller guarantees that the pointer is not dangling and stays valid for the > > + // duration of 'a. The cast is okay because `IoUringSqe` is `repr(transparent)` and has the > > + // same memory layout as `bindings::io_uring_sqe`. > > + unsafe { &*ptr.cast() } > > + } > > +} > > diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs > > index ed53169e795c..d38cf7137401 100644 > > --- a/rust/kernel/lib.rs > > +++ b/rust/kernel/lib.rs > > @@ -91,6 +91,7 @@ > > pub mod fs; > > pub mod init; > > pub mod io; > > +pub mod io_uring; > > pub mod ioctl; > > pub mod jump_label; > > #[cfg(CONFIG_KUNIT)] > > -- > > 2.43.0 > > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 3/5] rust: io_uring: introduce rust abstraction for io-uring cmd 2025-09-02 11:11 ` Sidong Yang @ 2025-09-02 15:41 ` Caleb Sander Mateos 0 siblings, 0 replies; 33+ messages in thread From: Caleb Sander Mateos @ 2025-09-02 15:41 UTC (permalink / raw) To: Sidong Yang Cc: Jens Axboe, Daniel Almeida, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Tue, Sep 2, 2025 at 4:11 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > On Mon, Sep 01, 2025 at 06:11:01PM -0700, Caleb Sander Mateos wrote: > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > Implment the io-uring abstractions needed for miscdevicecs and other > > > > typo: "Implement" > > Thanks. > > > > > char devices that have io-uring command interface. > > > > > > * `io_uring::IoUringCmd` : Rust abstraction for `io_uring_cmd` which > > > will be used as arg for `MiscDevice::uring_cmd()`. And driver can get > > > `cmd_op` sent from userspace. Also it has `flags` which includes option > > > that is reissued. > > > > > > * `io_uring::IoUringSqe` : Rust abstraction for `io_uring_sqe` which > > > could be get from `IoUringCmd::sqe()` and driver could get `cmd_data` > > > from userspace. Also `IoUringSqe` has more data like opcode could be used in > > > driver. > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > > > --- > > > rust/kernel/io_uring.rs | 306 ++++++++++++++++++++++++++++++++++++++++ > > > rust/kernel/lib.rs | 1 + > > > 2 files changed, 307 insertions(+) > > > create mode 100644 rust/kernel/io_uring.rs > > > > > > diff --git a/rust/kernel/io_uring.rs b/rust/kernel/io_uring.rs > > > new file mode 100644 > > > index 000000000000..61e88bdf4e42 > > > --- /dev/null > > > +++ b/rust/kernel/io_uring.rs > > > @@ -0,0 +1,306 @@ > > > +// SPDX-License-Identifier: GPL-2.0 > > > +// SPDX-FileCopyrightText: (C) 2025 Furiosa AI > > > + > > > +//! Abstractions for io-uring. > > > +//! > > > +//! This module provides types for implements io-uring interface for char device. > > > +//! > > > +//! > > > +//! C headers: [`include/linux/io_uring/cmd.h`](srctree/include/linux/io_uring/cmd.h) and > > > +//! [`include/linux/io_uring/io_uring.h`](srctree/include/linux/io_uring/io_uring.h) > > > + > > > +use core::{mem::MaybeUninit, pin::Pin}; > > > + > > > +use crate::error::from_result; > > > +use crate::transmute::{AsBytes, FromBytes}; > > > +use crate::{fs::File, types::Opaque}; > > > + > > > +use crate::prelude::*; > > > + > > > +/// io-uring opcode > > > +pub mod opcode { > > > + /// opcode for uring cmd > > > + pub const URING_CMD: u32 = bindings::io_uring_op_IORING_OP_URING_CMD; > > > +} > > > + > > > +/// A Rust abstraction for the Linux kernel's `io_uring_cmd` structure. > > > +/// > > > +/// This structure is a safe, opaque wrapper around the raw C `io_uring_cmd` > > > +/// binding from the Linux kernel. It represents a command structure used > > > +/// in io_uring operations within the kernel. > > > +/// This type is used internally by the io_uring subsystem to manage > > > +/// asynchronous I/O commands. > > > +/// > > > +/// This type should not be constructed or manipulated directly by > > > +/// kernel module developers. > > > +/// > > > +/// # INVARIANT > > > +/// - `self.inner` always points to a valid, live `bindings::io_uring_cmd`. > > > +#[repr(transparent)] > > > +pub struct IoUringCmd { > > > + /// An opaque wrapper containing the actual `io_uring_cmd` data. > > > + inner: Opaque<bindings::io_uring_cmd>, > > > +} > > > + > w> > +impl IoUringCmd { > > > + /// Returns the cmd_op with associated with the `io_uring_cmd`. > > > + #[inline] > > > + pub fn cmd_op(&self) -> u32 { > > > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > > > + // to a live `io_uring_cmd`, so dereferencing is safe. > > > + unsafe { (*self.inner.get()).cmd_op } > > > + } > > > + > > > + /// Returns the flags with associated with the `io_uring_cmd`. > > > + #[inline] > > > + pub fn flags(&self) -> u32 { > > > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > > > + // to a live `io_uring_cmd`, so dereferencing is safe. > > > + unsafe { (*self.inner.get()).flags } > > > + } > > > + > > > + /// Reads protocol data unit as `T` that impl `FromBytes` from uring cmd > > > + /// > > > + /// Fails with [`EFAULT`] if size of `T` is bigger than pdu size. > > > + #[inline] > > > + pub fn read_pdu<T: FromBytes>(&self) -> Result<T> { > > > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > > > + // to a live `io_uring_cmd`, so dereferencing is safe. > > > + let inner = unsafe { &mut *self.inner.get() }; > > > > Why does this reference need to be mutable? > > It don't need to be mutable. This could be borrow without mutable. > > > > > + > > > + let len = size_of::<T>(); > > > + if len > inner.pdu.len() { > > > + return Err(EFAULT); > > > + } > > > + > > > + let mut out: MaybeUninit<T> = MaybeUninit::uninit(); > > > > Why is the intermediate MaybeUninit necessary? Would > > core::ptr::read_unaligned() not work? > > It's okay to use `read_unaligned`. It would be simpler than now. Thanks. > > > > > > + let ptr = &raw mut inner.pdu as *const c_void; > > > + > > > + // SAFETY: > > > + // * The `ptr` is valid pointer from `self.inner` that is guaranteed by type invariant. > > > + // * The `out` is valid pointer that points `T` which impls `FromBytes` and checked > > > + // size of `T` is smaller than pdu size. > > > + unsafe { > > > + core::ptr::copy_nonoverlapping(ptr, out.as_mut_ptr().cast::<c_void>(), len); > > > + } > > > + > > > + // SAFETY: The read above has initialized all bytes in `out`, and since `T` implements > > > + // `FromBytes`, any bit-pattern is a valid value for this type. > > > + Ok(unsafe { out.assume_init() }) > > > + } > > > + > > > + /// Writes the provided `value` to `pdu` in uring_cmd `self` > > > + /// > > > + /// Fails with [`EFAULT`] if size of `T` is bigger than pdu size. > > > + #[inline] > > > + pub fn write_pdu<T: AsBytes>(&mut self, value: &T) -> Result<()> { > > > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > > > + // to a live `io_uring_cmd`, so dereferencing is safe. > > > + let inner = unsafe { &mut *self.inner.get() }; > > > + > > > + let len = size_of::<T>(); > > > + if len > inner.pdu.len() { > > > + return Err(EFAULT); > > > + } > > > + > > > + let src = (value as *const T).cast::<c_void>(); > > > + let dst = &raw mut inner.pdu as *mut c_void; > > > + > > > + // SAFETY: > > > + // * The `src` is points valid memory that is guaranteed by `T` impls `AsBytes` > > > + // * The `dst` is valid. It's from `self.inner` that is guaranteed by type invariant. > > > + // * It's safe to copy because size of `T` is no more than len of pdu. > > > + unsafe { > > > + core::ptr::copy_nonoverlapping(src, dst, len); > > > + } > > > + > > > + Ok(()) > > > + } > > > + > > > + /// Constructs a new [`IoUringCmd`] from a raw `io_uring_cmd` > > > + /// > > > + /// # Safety > > > + /// > > > + /// The caller must guarantee that: > > > + /// - `ptr` is non-null, properly aligned, and points to a valid > > > + /// `bindings::io_uring_cmd`. > > > + /// - The pointed-to memory remains initialized and valid for the entire > > > + /// lifetime `'a` of the returned reference. > > > + /// - While the returned `Pin<&'a mut IoUringCmd>` is alive, the underlying > > > + /// object is **not moved** (pinning requirement). > > > + /// - **Aliasing rules:** the returned `&mut` has **exclusive** access to the same > > > + /// object for its entire lifetime: > > > + /// - No other `&mut` **or** `&` references to the same `io_uring_cmd` may be > > > + /// alive at the same time. > > > + /// - There must be no concurrent reads/writes through raw pointers, FFI, or > > > + /// other kernel paths to the same object during this lifetime. > > > + /// - If the object can be touched from other contexts (e.g. IRQ/another CPU), > > > + /// the caller must provide synchronization to uphold this exclusivity. > > > + /// - This function relies on `IoUringCmd` being `repr(transparent)` over > > > + /// `bindings::io_uring_cmd` so the cast preserves layout. > > > + #[inline] > > > + pub unsafe fn from_raw<'a>(ptr: *mut bindings::io_uring_cmd) -> Pin<&'a mut IoUringCmd> { > > > + // SAFETY: > > > + // * The caller guarantees that the pointer is not dangling and stays > > > + // valid for the duration of 'a. > > > + // * The cast is okay because `IoUringCmd` is `repr(transparent)` and > > > + // has the same memory layout as `bindings::io_uring_cmd`. > > > + // * The returned `Pin` ensures that the object cannot be moved, which > > > + // is required because the kernel may hold pointers to this memory > > > + // location and moving it would invalidate those pointers. > > > + unsafe { Pin::new_unchecked(&mut *ptr.cast()) } > > > + } > > > + > > > + /// Returns the file that referenced by uring cmd self. > > > + #[inline] > > > + pub fn file(&self) -> &File { > > > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > > > + // to a live `io_uring_cmd`, so dereferencing is safe. > > > + let file = unsafe { (*self.inner.get()).file }; > > > + > > > + // SAFETY: > ?> > + // * The `file` points valid file. > > > + // * refcount is positive after submission queue entry issued. > > > + // * There is no active fdget_pos region on the file on this thread. > > > + unsafe { File::from_raw_file(file) } > > > + } > > > + > > > + /// Returns an reference to the [`IoUringSqe`] associated with this command. > > > + #[inline] > > > + pub fn sqe(&self) -> &IoUringSqe { > > > + // SAFETY: `self.inner` is guaranteed by the type invariant to point > > > + // to a live `io_uring_cmd`, so dereferencing is safe. > > > + let sqe = unsafe { (*self.inner.get()).sqe }; > > > + // SAFETY: The call guarantees that the `sqe` points valid io_uring_sqe. > > > + unsafe { IoUringSqe::from_raw(sqe) } > > > + } > > > + > > > + /// Completes an this [`IoUringCmd`] request that was previously queued. > > > + /// > > > + /// # Safety > > > + /// > > > + /// - This function must be called **only** for a command whose `uring_cmd` > > > + /// handler previously returned **`-EIOCBQUEUED`** to io_uring. > > > + /// > > > + /// # Parameters > > > + /// > > > + /// - `ret`: Result to return to userspace. > > > + /// - `res2`: Extra for big completion queue entry `IORING_SETUP_CQE32`. > > > + /// - `issue_flags`: Flags associated with this request, typically the same > > > + /// as those passed to the `uring_cmd` handler. > > > + #[inline] > > > + pub fn done(self: Pin<&mut IoUringCmd>, ret: Result<i32>, res2: u64, issue_flags: u32) { > > > > Since this takes the IoUringCmd by reference, it allows a single > > io_uring_cmd to be completed multiple times, which would not be safe. > > This method probably needs to take ownership of the IoUringCmd. Even > > better would be to enforce at compile time that the number of times > > IoUringCmd::done() is called matches the return value from > > uring_cmd(): io_uring_cmd_done() may only be called if uring_cmd() > > returns -EIOCBQUEUED; -EAGAIN will result in another call to > > uring_cmd() and all other return values synchronously complete the > > io_uring_cmd. > > > > It's also not safe for the caller to pass an arbitrary value for > > issue_flags here; it needs to exactly match what was passed into > > uring_cmd(). Maybe we could introduce a type that couples the > > IoUringCmd and issue_flags passed to uring_cmd()? > > Agreed. We could introduce a new type like below > > pub struct IoUringCmd<'a> { > cmd: Pin<&'a mut IoUringCmdInner>, > issue_flags: IssueFlags, > } Yeah, that looks reasonable. > > Is `io_uring_cmd_done` should be called in iopoll callback? If so, it's > better to make new type for iopoll and move this function to the type. I'm not too familiar with IOPOLL. As far as I'm aware, only the NVMe passthru uring_cmd() implementation supports it? It looks like commit 9ce6c9875f3e ("nvme: always punt polled uring_cmd end_io work to task_work") removed the IOPOLL special case, so all NVMe passthru completions go through nvme_uring_task_cb(), which calls io_uring_cmd_done(). > > > > > > + let ret = from_result(|| ret) as isize; > > > + // SAFETY: The call guarantees that `self.inner` is not dangling and stays valid > > > + unsafe { > > > + bindings::io_uring_cmd_done(self.inner.get(), ret, res2, issue_flags); > > > + } > > > + } > > > +} > > > + > > > +/// A Rust abstraction for the Linux kernel's `io_uring_sqe` structure. > > > +/// > > > +/// This structure is a safe, opaque wrapper around the raw C [`io_uring_sqe`](srctree/include/uapi/linux/io_uring.h) > > > +/// binding from the Linux kernel. It represents a Submission Queue Entry > > > +/// used in io_uring operations within the kernel. > > > +/// > > > +/// # Type Safety > > > +/// > > > +/// The `#[repr(transparent)]` attribute ensures that this wrapper has > > > +/// the same memory layout as the underlying `io_uring_sqe` structure, > > > +/// allowing it to be safely transmuted between the two representations. > > > +/// > > > +/// # Fields > > > +/// > > > +/// * `inner` - An opaque wrapper containing the actual `io_uring_sqe` data. > > > +/// The `Opaque` type prevents direct access to the internal > > > +/// structure fields, ensuring memory safety and encapsulation. > > > +/// > > > +/// # Usage > > > +/// > > > +/// This type represents a submission queue entry that describes an I/O > > > +/// operation to be executed by the io_uring subsystem. It contains > > > +/// information such as the operation type, file descriptor, buffer > > > +/// pointers, and other operation-specific data. > > > +/// > > > +/// Users can obtain this type from [`IoUringCmd::sqe()`] method, which > > > +/// extracts the submission queue entry associated with a command. > > > +/// > > > +/// This type should not be constructed or manipulated directly by > > > +/// kernel module developers. > > > +/// > > > +/// # INVARIANT > > > +/// - `self.inner` always points to a valid, live `bindings::io_uring_sqe`. > > > +#[repr(transparent)] > > > +pub struct IoUringSqe { > > > + inner: Opaque<bindings::io_uring_sqe>, > > > +} > > > + > > > +impl IoUringSqe { > > > + /// Reads and interprets the `cmd` field of an `bindings::io_uring_sqe` as a value of type `T`. > > > + /// > > > + /// # Safety & Invariants > > > + /// - Construction of `T` is delegated to `FromBytes`, which guarantees that `T` has no > > > + /// invalid bit patterns and can be safely reconstructed from raw bytes. > > > + /// - **Limitation:** This implementation does not support `IORING_SETUP_SQE128` (larger SQE entries). > > > + /// Only the standard `io_uring_sqe` layout is handled here. > > > + /// > > > + /// # Errors > > > + /// * Returns `EINVAL` if the `self` does not hold a `opcode::URING_CMD`. > > > + /// * Returns `EFAULT` if the command buffer is smaller than the requested type `T`. > > > + /// > > > + /// # Returns > > > + /// * On success, returns a `T` deserialized from the `cmd`. > > > + /// * On failure, returns an appropriate error as described above. > > > + pub fn cmd_data<T: FromBytes>(&self) -> Result<T> { > > > + // SAFETY: `self.inner` guaranteed by the type invariant to point > > > + // to a live `io_uring_sqe`, so dereferencing is safe. > > > + let sqe = unsafe { &*self.inner.get() }; > > > + > > > + if u32::from(sqe.opcode) != opcode::URING_CMD { > > > > io_uring opcodes are u8 values. Can the URING_CMD constant be made a > > u8 instead of converting sqe.opcode here? > > > > The read of the opcode should also be using read_volatile(), as it may > > lie in the userspace-mapped SQE region, which could be concurrently > > written by another userspace thread. That would probably be buggy > > behavior on the userspace side, but it can cause undefined behavior on > > the kernel side without a volatile read, as the compiler could choose > > to re-read the value multiple times assuming it will get the same > > value each time. > > Thanks, opcode should be read with read_volatile(). And I would introduce new type > for opcode. I would rather remove the opcode check entirely. As mentioned below, it's not required for safety, it doesn't actually ensure that userspace has written to the cmd field, and it adds overhead in a hot path. Best, Caleb > > > > > > + return Err(EINVAL); > > > + } > > > + > > > + // SAFETY: Accessing the `sqe.cmd` union field is safe because we've > > > + // verified that `sqe.opcode == IORING_OP_URING_CMD`, which guarantees > > > + // that this union variant is initialized and valid. > > > > The opcode check isn't necessary. It doesn't provide any assurances > > that this variant of the union is actually initialized, since a buggy > > userspace process could set the opcode without initializing the cmd > > field. It's always valid to access this memory since it's part of the > > SQE region created at io_uring setup time. So I would drop the opcode > > check. > > > > > + let cmd = unsafe { sqe.__bindgen_anon_6.cmd.as_ref() }; > > > + let cmd_len = size_of_val(&sqe.__bindgen_anon_6.bindgen_union_field); > > > + > > > + if cmd_len < size_of::<T>() { > > > + return Err(EFAULT); > > > + } > > > + > > > + let cmd_ptr = cmd.as_ptr() as *mut T; > > > + > > > + // SAFETY: `cmd_ptr` is valid from `self.inner` which is guaranteed by > > > + // type variant. And also it points to initialized `T` from userspace. > > > + let ret = unsafe { core::ptr::read_unaligned(cmd_ptr) }; > > > > Similarly, needs to be volatile. The C uring_cmd() implementations use > > READ_ONCE() to read the cmd field. > > Okay, This should use read_volatile too. > > Thanks, > Sidong > > > > Best, > > Caleb > > > > > + > > > + Ok(ret) > > > + } > > > + > > > + /// Constructs a new `IoUringSqe` from a raw `io_uring_sqe`. > > > + /// > > > + /// # Safety > > > + /// > > > + /// The caller must guarantee that: > > > + /// - `ptr` is non-null, properly aligned, and points to a valid initialized > > > + /// `bindings::io_uring_sqe`. > > > + /// - The pointed-to memory remains valid (not freed or repurposed) for the > > > + /// entire lifetime `'a` of the returned reference. > > > + /// - **Aliasing rules (for `&T`):** while the returned `&'a IoUringSqe` is > > > + /// alive, there must be **no mutable access** to the same object through any > > > + /// path (no `&mut`, no raw-pointer writes, no FFI/IRQ/other-CPU writers). > > > + /// Multiple `&` is fine **only if all of them are read-only** for the entire > > > + /// overlapping lifetime. > > > + /// - This relies on `IoUringSqe` being `repr(transparent)` over > > > + /// `bindings::io_uring_sqe`, so the cast preserves layout. > > > + #[inline] > > > + pub unsafe fn from_raw<'a>(ptr: *const bindings::io_uring_sqe) -> &'a IoUringSqe { > > > + // SAFETY: The caller guarantees that the pointer is not dangling and stays valid for the > > > + // duration of 'a. The cast is okay because `IoUringSqe` is `repr(transparent)` and has the > > > + // same memory layout as `bindings::io_uring_sqe`. > > > + unsafe { &*ptr.cast() } > > > + } > > > +} > > > diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs > > > index ed53169e795c..d38cf7137401 100644 > > > --- a/rust/kernel/lib.rs > > > +++ b/rust/kernel/lib.rs > > > @@ -91,6 +91,7 @@ > > > pub mod fs; > > > pub mod init; > > > pub mod io; > > > +pub mod io_uring; > > > pub mod ioctl; > > > pub mod jump_label; > > > #[cfg(CONFIG_KUNIT)] > > > -- > > > 2.43.0 > > > ^ permalink raw reply [flat|nested] 33+ messages in thread
* [RFC PATCH v3 4/5] rust: miscdevice: Add `uring_cmd` support 2025-08-22 12:55 [RFC PATCH v3 0/5] rust: miscdevice: abstraction for uring_cmd Sidong Yang ` (2 preceding siblings ...) 2025-08-22 12:55 ` [RFC PATCH v3 3/5] rust: io_uring: introduce rust abstraction for io-uring cmd Sidong Yang @ 2025-08-22 12:55 ` Sidong Yang 2025-09-02 1:12 ` Caleb Sander Mateos 2025-08-22 12:55 ` [RFC PATCH v3 5/5] samples: rust: Add `uring_cmd` example to `rust_misc_device` Sidong Yang 4 siblings, 1 reply; 33+ messages in thread From: Sidong Yang @ 2025-08-22 12:55 UTC (permalink / raw) To: Jens Axboe, Daniel Almeida, Caleb Sander Mateos, Benno Lossin Cc: Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring, Sidong Yang This patch introduces support for `uring_cmd` to the `miscdevice` framework. This is achieved by adding a new `uring_cmd` method to the `MiscDevice` trait and wiring it up to the corresponding `file_operations` entry. The `uring_cmd` function provides a mechanism for `io_uring` to issue commands to a device driver. The new `uring_cmd` method takes the device, an `IoUringCmd` object, and issue flags as arguments. The `IoUringCmd` object is a safe Rust abstraction around the raw `io_uring_cmd` struct. To enable `uring_cmd` for a specific misc device, the `HAS_URING_CMD` constant must be set to `true` in the `MiscDevice` implementation. Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> --- rust/kernel/miscdevice.rs | 53 ++++++++++++++++++++++++++++++++++++++- 1 file changed, 52 insertions(+), 1 deletion(-) diff --git a/rust/kernel/miscdevice.rs b/rust/kernel/miscdevice.rs index 6373fe183b27..fcef579218ba 100644 --- a/rust/kernel/miscdevice.rs +++ b/rust/kernel/miscdevice.rs @@ -11,9 +11,10 @@ use crate::{ bindings, device::Device, - error::{to_result, Error, Result, VTABLE_DEFAULT_ERROR}, + error::{from_result, to_result, Error, Result, VTABLE_DEFAULT_ERROR}, ffi::{c_int, c_long, c_uint, c_ulong}, fs::File, + io_uring::IoUringCmd, mm::virt::VmaNew, prelude::*, seq_file::SeqFile, @@ -180,6 +181,21 @@ fn show_fdinfo( ) { build_error!(VTABLE_DEFAULT_ERROR) } + + /// Handler for uring_cmd. + /// + /// This function is invoked when userspace process submits an uring_cmd op + /// on io-uring submission queue. The `device` is borrowed instance defined + /// by `Ptr`. The `io_uring_cmd` would be used for get arguments cmd_op, sqe, + /// cmd_data. The `issue_flags` is the flags includes options for uring_cmd. + /// The options are listed in `kernel::io_uring::cmd_flags`. + fn uring_cmd( + _device: <Self::Ptr as ForeignOwnable>::Borrowed<'_>, + _io_uring_cmd: Pin<&mut IoUringCmd>, + _issue_flags: u32, + ) -> Result<i32> { + build_error!(VTABLE_DEFAULT_ERROR) + } } /// A vtable for the file operations of a Rust miscdevice. @@ -337,6 +353,36 @@ impl<T: MiscDevice> MiscdeviceVTable<T> { T::show_fdinfo(device, m, file); } + /// # Safety + /// + /// The caller must ensure that: + /// - The pointer `ioucmd` is not null and points to a valid `bindings::io_uring_cmd`. + unsafe extern "C" fn uring_cmd( + ioucmd: *mut bindings::io_uring_cmd, + issue_flags: ffi::c_uint, + ) -> c_int { + // SAFETY: `file` referenced by `ioucmd` is valid pointer. It's assigned in + // uring cmd preparation. So dereferencing is safe. + let raw_file = unsafe { (*ioucmd).file }; + + // SAFETY: `private_data` is guaranteed that it has valid pointer after + // this file opened. So dereferencing is safe. + let private = unsafe { (*raw_file).private_data }.cast(); + + // SAFETY: `ioucmd` is not null and points to valid memory `bindings::io_uring_cmd` + // and the memory pointed by `ioucmd` is valid and will not be moved or + // freed for the lifetime of returned value `ioucmd` + let ioucmd = unsafe { IoUringCmd::from_raw(ioucmd) }; + + // SAFETY: This call is safe because `private` is returned by + // `into_foreign` in [`open`]. And it's guaranteed + // that `from_foreign` is called by [`release`] after the end of + // the lifetime of `device` + let device = unsafe { <T::Ptr as ForeignOwnable>::borrow(private) }; + + from_result(|| T::uring_cmd(device, ioucmd, issue_flags)) + } + const VTABLE: bindings::file_operations = bindings::file_operations { open: Some(Self::open), release: Some(Self::release), @@ -359,6 +405,11 @@ impl<T: MiscDevice> MiscdeviceVTable<T> { } else { None }, + uring_cmd: if T::HAS_URING_CMD { + Some(Self::uring_cmd) + } else { + None + }, // SAFETY: All zeros is a valid value for `bindings::file_operations`. ..unsafe { MaybeUninit::zeroed().assume_init() } }; -- 2.43.0 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 4/5] rust: miscdevice: Add `uring_cmd` support 2025-08-22 12:55 ` [RFC PATCH v3 4/5] rust: miscdevice: Add `uring_cmd` support Sidong Yang @ 2025-09-02 1:12 ` Caleb Sander Mateos 2025-09-02 11:18 ` Sidong Yang 0 siblings, 1 reply; 33+ messages in thread From: Caleb Sander Mateos @ 2025-09-02 1:12 UTC (permalink / raw) To: Sidong Yang Cc: Jens Axboe, Daniel Almeida, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > This patch introduces support for `uring_cmd` to the `miscdevice` > framework. This is achieved by adding a new `uring_cmd` method to the > `MiscDevice` trait and wiring it up to the corresponding > `file_operations` entry. > > The `uring_cmd` function provides a mechanism for `io_uring` to issue > commands to a device driver. > > The new `uring_cmd` method takes the device, an `IoUringCmd` object, > and issue flags as arguments. The `IoUringCmd` object is a safe Rust > abstraction around the raw `io_uring_cmd` struct. > > To enable `uring_cmd` for a specific misc device, the `HAS_URING_CMD` > constant must be set to `true` in the `MiscDevice` implementation. > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > --- > rust/kernel/miscdevice.rs | 53 ++++++++++++++++++++++++++++++++++++++- > 1 file changed, 52 insertions(+), 1 deletion(-) > > diff --git a/rust/kernel/miscdevice.rs b/rust/kernel/miscdevice.rs > index 6373fe183b27..fcef579218ba 100644 > --- a/rust/kernel/miscdevice.rs > +++ b/rust/kernel/miscdevice.rs > @@ -11,9 +11,10 @@ > use crate::{ > bindings, > device::Device, > - error::{to_result, Error, Result, VTABLE_DEFAULT_ERROR}, > + error::{from_result, to_result, Error, Result, VTABLE_DEFAULT_ERROR}, > ffi::{c_int, c_long, c_uint, c_ulong}, > fs::File, > + io_uring::IoUringCmd, > mm::virt::VmaNew, > prelude::*, > seq_file::SeqFile, > @@ -180,6 +181,21 @@ fn show_fdinfo( > ) { > build_error!(VTABLE_DEFAULT_ERROR) > } > + > + /// Handler for uring_cmd. > + /// > + /// This function is invoked when userspace process submits an uring_cmd op > + /// on io-uring submission queue. The `device` is borrowed instance defined > + /// by `Ptr`. The `io_uring_cmd` would be used for get arguments cmd_op, sqe, > + /// cmd_data. The `issue_flags` is the flags includes options for uring_cmd. > + /// The options are listed in `kernel::io_uring::cmd_flags`. > + fn uring_cmd( > + _device: <Self::Ptr as ForeignOwnable>::Borrowed<'_>, > + _io_uring_cmd: Pin<&mut IoUringCmd>, Passing the IoUringCmd by reference doesn't allow the uring_cmd() implementation to store it beyond the function return. That precludes any asynchronous uring_cmd() implementation, which is kind of the whole point of uring_cmd. I think uring_cmd() needs to transfer ownership of the IoUringCmd so the implementation can complete it asynchronously. Best, Caleb > + _issue_flags: u32, > + ) -> Result<i32> { > + build_error!(VTABLE_DEFAULT_ERROR) > + } > } > > /// A vtable for the file operations of a Rust miscdevice. > @@ -337,6 +353,36 @@ impl<T: MiscDevice> MiscdeviceVTable<T> { > T::show_fdinfo(device, m, file); > } > > + /// # Safety > + /// > + /// The caller must ensure that: > + /// - The pointer `ioucmd` is not null and points to a valid `bindings::io_uring_cmd`. > + unsafe extern "C" fn uring_cmd( > + ioucmd: *mut bindings::io_uring_cmd, > + issue_flags: ffi::c_uint, > + ) -> c_int { > + // SAFETY: `file` referenced by `ioucmd` is valid pointer. It's assigned in > + // uring cmd preparation. So dereferencing is safe. > + let raw_file = unsafe { (*ioucmd).file }; > + > + // SAFETY: `private_data` is guaranteed that it has valid pointer after > + // this file opened. So dereferencing is safe. > + let private = unsafe { (*raw_file).private_data }.cast(); > + > + // SAFETY: `ioucmd` is not null and points to valid memory `bindings::io_uring_cmd` > + // and the memory pointed by `ioucmd` is valid and will not be moved or > + // freed for the lifetime of returned value `ioucmd` > + let ioucmd = unsafe { IoUringCmd::from_raw(ioucmd) }; > + > + // SAFETY: This call is safe because `private` is returned by > + // `into_foreign` in [`open`]. And it's guaranteed > + // that `from_foreign` is called by [`release`] after the end of > + // the lifetime of `device` > + let device = unsafe { <T::Ptr as ForeignOwnable>::borrow(private) }; > + > + from_result(|| T::uring_cmd(device, ioucmd, issue_flags)) > + } > + > const VTABLE: bindings::file_operations = bindings::file_operations { > open: Some(Self::open), > release: Some(Self::release), > @@ -359,6 +405,11 @@ impl<T: MiscDevice> MiscdeviceVTable<T> { > } else { > None > }, > + uring_cmd: if T::HAS_URING_CMD { > + Some(Self::uring_cmd) > + } else { > + None > + }, > // SAFETY: All zeros is a valid value for `bindings::file_operations`. > ..unsafe { MaybeUninit::zeroed().assume_init() } > }; > -- > 2.43.0 > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 4/5] rust: miscdevice: Add `uring_cmd` support 2025-09-02 1:12 ` Caleb Sander Mateos @ 2025-09-02 11:18 ` Sidong Yang 2025-09-02 15:53 ` Caleb Sander Mateos 0 siblings, 1 reply; 33+ messages in thread From: Sidong Yang @ 2025-09-02 11:18 UTC (permalink / raw) To: Caleb Sander Mateos Cc: Jens Axboe, Daniel Almeida, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Mon, Sep 01, 2025 at 06:12:44PM -0700, Caleb Sander Mateos wrote: > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > This patch introduces support for `uring_cmd` to the `miscdevice` > > framework. This is achieved by adding a new `uring_cmd` method to the > > `MiscDevice` trait and wiring it up to the corresponding > > `file_operations` entry. > > > > The `uring_cmd` function provides a mechanism for `io_uring` to issue > > commands to a device driver. > > > > The new `uring_cmd` method takes the device, an `IoUringCmd` object, > > and issue flags as arguments. The `IoUringCmd` object is a safe Rust > > abstraction around the raw `io_uring_cmd` struct. > > > > To enable `uring_cmd` for a specific misc device, the `HAS_URING_CMD` > > constant must be set to `true` in the `MiscDevice` implementation. > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > > --- > > rust/kernel/miscdevice.rs | 53 ++++++++++++++++++++++++++++++++++++++- > > 1 file changed, 52 insertions(+), 1 deletion(-) > > > > diff --git a/rust/kernel/miscdevice.rs b/rust/kernel/miscdevice.rs > > index 6373fe183b27..fcef579218ba 100644 > > --- a/rust/kernel/miscdevice.rs > > +++ b/rust/kernel/miscdevice.rs > > @@ -11,9 +11,10 @@ > > use crate::{ > > bindings, > > device::Device, > > - error::{to_result, Error, Result, VTABLE_DEFAULT_ERROR}, > > + error::{from_result, to_result, Error, Result, VTABLE_DEFAULT_ERROR}, > > ffi::{c_int, c_long, c_uint, c_ulong}, > > fs::File, > > + io_uring::IoUringCmd, > > mm::virt::VmaNew, > > prelude::*, > > seq_file::SeqFile, > > @@ -180,6 +181,21 @@ fn show_fdinfo( > > ) { > > build_error!(VTABLE_DEFAULT_ERROR) > > } > > + > > + /// Handler for uring_cmd. > > + /// > > + /// This function is invoked when userspace process submits an uring_cmd op > > + /// on io-uring submission queue. The `device` is borrowed instance defined > > + /// by `Ptr`. The `io_uring_cmd` would be used for get arguments cmd_op, sqe, > > + /// cmd_data. The `issue_flags` is the flags includes options for uring_cmd. > > + /// The options are listed in `kernel::io_uring::cmd_flags`. > > + fn uring_cmd( > > + _device: <Self::Ptr as ForeignOwnable>::Borrowed<'_>, > > + _io_uring_cmd: Pin<&mut IoUringCmd>, > > Passing the IoUringCmd by reference doesn't allow the uring_cmd() > implementation to store it beyond the function return. That precludes > any asynchronous uring_cmd() implementation, which is kind of the > whole point of uring_cmd. I think uring_cmd() needs to transfer > ownership of the IoUringCmd so the implementation can complete it > asynchronously. I didn't know that I can take IoUringCmd ownership and calling done(). In C implementation, is it safe to call `io_uring_cmd_done()` in any context? > > Best, > Caleb > > > + _issue_flags: u32, > > + ) -> Result<i32> { > > + build_error!(VTABLE_DEFAULT_ERROR) > > + } > > } > > > > /// A vtable for the file operations of a Rust miscdevice. > > @@ -337,6 +353,36 @@ impl<T: MiscDevice> MiscdeviceVTable<T> { > > T::show_fdinfo(device, m, file); > > } > > > > + /// # Safety > > + /// > > + /// The caller must ensure that: > > + /// - The pointer `ioucmd` is not null and points to a valid `bindings::io_uring_cmd`. > > + unsafe extern "C" fn uring_cmd( > > + ioucmd: *mut bindings::io_uring_cmd, > > + issue_flags: ffi::c_uint, > > + ) -> c_int { > > + // SAFETY: `file` referenced by `ioucmd` is valid pointer. It's assigned in > > + // uring cmd preparation. So dereferencing is safe. > > + let raw_file = unsafe { (*ioucmd).file }; > > + > > + // SAFETY: `private_data` is guaranteed that it has valid pointer after > > + // this file opened. So dereferencing is safe. > > + let private = unsafe { (*raw_file).private_data }.cast(); > > + > > + // SAFETY: `ioucmd` is not null and points to valid memory `bindings::io_uring_cmd` > > + // and the memory pointed by `ioucmd` is valid and will not be moved or > > + // freed for the lifetime of returned value `ioucmd` > > + let ioucmd = unsafe { IoUringCmd::from_raw(ioucmd) }; > > + > > + // SAFETY: This call is safe because `private` is returned by > > + // `into_foreign` in [`open`]. And it's guaranteed > > + // that `from_foreign` is called by [`release`] after the end of > > + // the lifetime of `device` > > + let device = unsafe { <T::Ptr as ForeignOwnable>::borrow(private) }; > > + > > + from_result(|| T::uring_cmd(device, ioucmd, issue_flags)) > > + } > > + > > const VTABLE: bindings::file_operations = bindings::file_operations { > > open: Some(Self::open), > > release: Some(Self::release), > > @@ -359,6 +405,11 @@ impl<T: MiscDevice> MiscdeviceVTable<T> { > > } else { > > None > > }, > > + uring_cmd: if T::HAS_URING_CMD { > > + Some(Self::uring_cmd) > > + } else { > > + None > > + }, > > // SAFETY: All zeros is a valid value for `bindings::file_operations`. > > ..unsafe { MaybeUninit::zeroed().assume_init() } > > }; > > -- > > 2.43.0 > > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 4/5] rust: miscdevice: Add `uring_cmd` support 2025-09-02 11:18 ` Sidong Yang @ 2025-09-02 15:53 ` Caleb Sander Mateos 0 siblings, 0 replies; 33+ messages in thread From: Caleb Sander Mateos @ 2025-09-02 15:53 UTC (permalink / raw) To: Sidong Yang Cc: Jens Axboe, Daniel Almeida, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Tue, Sep 2, 2025 at 4:18 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > On Mon, Sep 01, 2025 at 06:12:44PM -0700, Caleb Sander Mateos wrote: > > On Fri, Aug 22, 2025 at 5:56 AM Sidong Yang <sidong.yang@furiosa.ai> wrote: > > > > > > This patch introduces support for `uring_cmd` to the `miscdevice` > > > framework. This is achieved by adding a new `uring_cmd` method to the > > > `MiscDevice` trait and wiring it up to the corresponding > > > `file_operations` entry. > > > > > > The `uring_cmd` function provides a mechanism for `io_uring` to issue > > > commands to a device driver. > > > > > > The new `uring_cmd` method takes the device, an `IoUringCmd` object, > > > and issue flags as arguments. The `IoUringCmd` object is a safe Rust > > > abstraction around the raw `io_uring_cmd` struct. > > > > > > To enable `uring_cmd` for a specific misc device, the `HAS_URING_CMD` > > > constant must be set to `true` in the `MiscDevice` implementation. > > > > > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > > > --- > > > rust/kernel/miscdevice.rs | 53 ++++++++++++++++++++++++++++++++++++++- > > > 1 file changed, 52 insertions(+), 1 deletion(-) > > > > > > diff --git a/rust/kernel/miscdevice.rs b/rust/kernel/miscdevice.rs > > > index 6373fe183b27..fcef579218ba 100644 > > > --- a/rust/kernel/miscdevice.rs > > > +++ b/rust/kernel/miscdevice.rs > > > @@ -11,9 +11,10 @@ > > > use crate::{ > > > bindings, > > > device::Device, > > > - error::{to_result, Error, Result, VTABLE_DEFAULT_ERROR}, > > > + error::{from_result, to_result, Error, Result, VTABLE_DEFAULT_ERROR}, > > > ffi::{c_int, c_long, c_uint, c_ulong}, > > > fs::File, > > > + io_uring::IoUringCmd, > > > mm::virt::VmaNew, > > > prelude::*, > > > seq_file::SeqFile, > > > @@ -180,6 +181,21 @@ fn show_fdinfo( > > > ) { > > > build_error!(VTABLE_DEFAULT_ERROR) > > > } > > > + > > > + /// Handler for uring_cmd. > > > + /// > > > + /// This function is invoked when userspace process submits an uring_cmd op > > > + /// on io-uring submission queue. The `device` is borrowed instance defined > > > + /// by `Ptr`. The `io_uring_cmd` would be used for get arguments cmd_op, sqe, > > > + /// cmd_data. The `issue_flags` is the flags includes options for uring_cmd. > > > + /// The options are listed in `kernel::io_uring::cmd_flags`. > > > + fn uring_cmd( > > > + _device: <Self::Ptr as ForeignOwnable>::Borrowed<'_>, > > > + _io_uring_cmd: Pin<&mut IoUringCmd>, > > > > Passing the IoUringCmd by reference doesn't allow the uring_cmd() > > implementation to store it beyond the function return. That precludes > > any asynchronous uring_cmd() implementation, which is kind of the > > whole point of uring_cmd. I think uring_cmd() needs to transfer > > ownership of the IoUringCmd so the implementation can complete it > > asynchronously. > > I didn't know that I can take IoUringCmd ownership and calling done(). > In C implementation, is it safe to call `io_uring_cmd_done()` in any context? It depends on the issue_flags. If IO_URING_F_UNLOCKED is set, it can be called from any context. But if IO_URING_F_UNLOCKED is not set, it needs to be called with the io_ring_ctx's uring_lock held. That generally requires it to either be called during uring_cmd() or from a task work callback. In either case, issue_flags needs to match what was passed to uring_cmd() or the task work callback. You can look at NVMe passthru as an example. It calls io_uring_cmd_done() from nvme_uring_task_cb(), which is a task work callback scheduled with io_uring_cmd_do_in_task_lazy() in nvme_uring_cmd_end_io(). Best, Caleb ^ permalink raw reply [flat|nested] 33+ messages in thread
* [RFC PATCH v3 5/5] samples: rust: Add `uring_cmd` example to `rust_misc_device` 2025-08-22 12:55 [RFC PATCH v3 0/5] rust: miscdevice: abstraction for uring_cmd Sidong Yang ` (3 preceding siblings ...) 2025-08-22 12:55 ` [RFC PATCH v3 4/5] rust: miscdevice: Add `uring_cmd` support Sidong Yang @ 2025-08-22 12:55 ` Sidong Yang 2025-08-28 0:48 ` Ming Lei 4 siblings, 1 reply; 33+ messages in thread From: Sidong Yang @ 2025-08-22 12:55 UTC (permalink / raw) To: Jens Axboe, Daniel Almeida, Caleb Sander Mateos, Benno Lossin Cc: Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring, Sidong Yang This patch extends the `rust_misc_device` sample to demonstrate how to use the `uring_cmd` interface for asynchronous device operations. The new implementation handles two `uring_cmd` operations: * `RUST_MISC_DEV_URING_CMD_SET_VALUE`: Sets a value in the device. * `RUST_MISC_DEV_URING_CMD_GET_VALUE`: Gets a value from the device. To use this new functionality, users can submit `IORING_OP_URING_CMD` operations to the `rust_misc_device` character device. Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> --- samples/rust/rust_misc_device.rs | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/samples/rust/rust_misc_device.rs b/samples/rust/rust_misc_device.rs index e7ab77448f75..1f25d2b1f4d8 100644 --- a/samples/rust/rust_misc_device.rs +++ b/samples/rust/rust_misc_device.rs @@ -101,6 +101,7 @@ c_str, device::Device, fs::File, + io_uring::IoUringCmd, ioctl::{_IO, _IOC_SIZE, _IOR, _IOW}, miscdevice::{MiscDevice, MiscDeviceOptions, MiscDeviceRegistration}, new_mutex, @@ -114,6 +115,9 @@ const RUST_MISC_DEV_GET_VALUE: u32 = _IOR::<i32>('|' as u32, 0x81); const RUST_MISC_DEV_SET_VALUE: u32 = _IOW::<i32>('|' as u32, 0x82); +const RUST_MISC_DEV_URING_CMD_SET_VALUE: u32 = _IOR::<i32>('|' as u32, 0x83); +const RUST_MISC_DEV_URING_CMD_GET_VALUE: u32 = _IOW::<i32>('|' as u32, 0x84); + module! { type: RustMiscDeviceModule, name: "rust_misc_device", @@ -192,6 +196,29 @@ fn ioctl(me: Pin<&RustMiscDevice>, _file: &File, cmd: u32, arg: usize) -> Result Ok(0) } + + fn uring_cmd( + me: Pin<&RustMiscDevice>, + io_uring_cmd: Pin<&mut IoUringCmd>, + _issue_flags: u32, + ) -> Result<i32> { + dev_info!(me.dev, "UringCmd Rust Misc Device Sample\n"); + + let cmd = io_uring_cmd.cmd_op(); + let addr: usize = io_uring_cmd.sqe().cmd_data()?; + let user_ptr = UserPtr::from_addr(addr); + let user_slice = UserSlice::new(user_ptr, 8); + + match cmd { + RUST_MISC_DEV_URING_CMD_SET_VALUE => me.set_value(user_slice.reader())?, + RUST_MISC_DEV_URING_CMD_GET_VALUE => me.get_value(user_slice.writer())?, + _ => { + dev_err!(me.dev, "-> uring_cmd not recognised: {}\n", cmd); + return Err(ENOTTY); + } + }; + Ok(0) + } } #[pinned_drop] -- 2.43.0 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [RFC PATCH v3 5/5] samples: rust: Add `uring_cmd` example to `rust_misc_device` 2025-08-22 12:55 ` [RFC PATCH v3 5/5] samples: rust: Add `uring_cmd` example to `rust_misc_device` Sidong Yang @ 2025-08-28 0:48 ` Ming Lei 0 siblings, 0 replies; 33+ messages in thread From: Ming Lei @ 2025-08-28 0:48 UTC (permalink / raw) To: Sidong Yang Cc: Jens Axboe, Daniel Almeida, Caleb Sander Mateos, Benno Lossin, Miguel Ojeda, Arnd Bergmann, Greg Kroah-Hartman, rust-for-linux, linux-kernel, io-uring On Fri, Aug 22, 2025 at 12:55:55PM +0000, Sidong Yang wrote: > This patch extends the `rust_misc_device` sample to demonstrate how to > use the `uring_cmd` interface for asynchronous device operations. > > The new implementation handles two `uring_cmd` operations: > > * `RUST_MISC_DEV_URING_CMD_SET_VALUE`: Sets a value in the device. > * `RUST_MISC_DEV_URING_CMD_GET_VALUE`: Gets a value from the device. > > To use this new functionality, users can submit `IORING_OP_URING_CMD` > operations to the `rust_misc_device` character device. > > Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai> > --- > samples/rust/rust_misc_device.rs | 27 +++++++++++++++++++++++++++ > 1 file changed, 27 insertions(+) > > diff --git a/samples/rust/rust_misc_device.rs b/samples/rust/rust_misc_device.rs > index e7ab77448f75..1f25d2b1f4d8 100644 > --- a/samples/rust/rust_misc_device.rs > +++ b/samples/rust/rust_misc_device.rs > @@ -101,6 +101,7 @@ > c_str, > device::Device, > fs::File, > + io_uring::IoUringCmd, > ioctl::{_IO, _IOC_SIZE, _IOR, _IOW}, > miscdevice::{MiscDevice, MiscDeviceOptions, MiscDeviceRegistration}, > new_mutex, > @@ -114,6 +115,9 @@ > const RUST_MISC_DEV_GET_VALUE: u32 = _IOR::<i32>('|' as u32, 0x81); > const RUST_MISC_DEV_SET_VALUE: u32 = _IOW::<i32>('|' as u32, 0x82); > > +const RUST_MISC_DEV_URING_CMD_SET_VALUE: u32 = _IOR::<i32>('|' as u32, 0x83); > +const RUST_MISC_DEV_URING_CMD_GET_VALUE: u32 = _IOW::<i32>('|' as u32, 0x84); > + > module! { > type: RustMiscDeviceModule, > name: "rust_misc_device", > @@ -192,6 +196,29 @@ fn ioctl(me: Pin<&RustMiscDevice>, _file: &File, cmd: u32, arg: usize) -> Result > > Ok(0) > } > + > + fn uring_cmd( > + me: Pin<&RustMiscDevice>, > + io_uring_cmd: Pin<&mut IoUringCmd>, > + _issue_flags: u32, > + ) -> Result<i32> { > + dev_info!(me.dev, "UringCmd Rust Misc Device Sample\n"); > + > + let cmd = io_uring_cmd.cmd_op(); > + let addr: usize = io_uring_cmd.sqe().cmd_data()?; > + let user_ptr = UserPtr::from_addr(addr); > + let user_slice = UserSlice::new(user_ptr, 8); > + > + match cmd { > + RUST_MISC_DEV_URING_CMD_SET_VALUE => me.set_value(user_slice.reader())?, > + RUST_MISC_DEV_URING_CMD_GET_VALUE => me.get_value(user_slice.writer())?, > + _ => { > + dev_err!(me.dev, "-> uring_cmd not recognised: {}\n", cmd); > + return Err(ENOTTY); > + } > + }; > + Ok(0) > + } > } I'd suggest to cover most of methods added in patch 2, so that any kernel side change can be verified easily. Thanks, Ming ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2025-09-22 18:09 UTC | newest] Thread overview: 33+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-22 12:55 [RFC PATCH v3 0/5] rust: miscdevice: abstraction for uring_cmd Sidong Yang 2025-08-22 12:55 ` [RFC PATCH v3 1/5] rust: bindings: add io_uring headers in bindings_helper.h Sidong Yang 2025-08-22 12:55 ` [RFC PATCH v3 2/5] io_uring/cmd: zero-init pdu in io_uring_cmd_prep() to avoid UB Sidong Yang 2025-09-02 0:34 ` Caleb Sander Mateos 2025-09-02 10:23 ` Sidong Yang 2025-09-02 15:31 ` Caleb Sander Mateos 2025-09-06 14:28 ` Sidong Yang 2025-09-08 19:45 ` Caleb Sander Mateos 2025-09-09 14:43 ` Sidong Yang 2025-09-09 16:32 ` Caleb Sander Mateos 2025-09-12 16:41 ` Sidong Yang 2025-09-12 17:56 ` Caleb Sander Mateos 2025-09-13 12:42 ` Sidong Yang 2025-09-15 16:54 ` Caleb Sander Mateos 2025-09-17 14:56 ` Sidong Yang 2025-09-22 18:09 ` Caleb Sander Mateos 2025-08-22 12:55 ` [RFC PATCH v3 3/5] rust: io_uring: introduce rust abstraction for io-uring cmd Sidong Yang 2025-08-27 20:41 ` Daniel Almeida 2025-08-28 7:24 ` Benno Lossin 2025-08-29 15:43 ` Sidong Yang 2025-08-29 16:11 ` Daniel Almeida 2025-08-28 0:36 ` Ming Lei 2025-08-28 7:25 ` Benno Lossin 2025-08-28 10:05 ` Ming Lei 2025-09-02 1:11 ` Caleb Sander Mateos 2025-09-02 11:11 ` Sidong Yang 2025-09-02 15:41 ` Caleb Sander Mateos 2025-08-22 12:55 ` [RFC PATCH v3 4/5] rust: miscdevice: Add `uring_cmd` support Sidong Yang 2025-09-02 1:12 ` Caleb Sander Mateos 2025-09-02 11:18 ` Sidong Yang 2025-09-02 15:53 ` Caleb Sander Mateos 2025-08-22 12:55 ` [RFC PATCH v3 5/5] samples: rust: Add `uring_cmd` example to `rust_misc_device` Sidong Yang 2025-08-28 0:48 ` Ming Lei
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).