[RFC DISCUSSION] virtio-rdma: RDMA device emulation for QEMU

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC DISCUSSION] virtio-rdma: RDMA device emulation for QEMU
@ 2026-03-16 21:05 Serapheim Dimitropoulos
  2026-03-17  5:42 ` Stefan Hajnoczi
  0 siblings, 1 reply; 3+ messages in thread
From: Serapheim Dimitropoulos @ 2026-03-16 21:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, pbonzini, stefanha, sgarzare, xieyongji, weijunji,
	xiongweimin, Serapheim Dimitropoulos

Hi all,

Apologies if this is not the right place but I'm looking to propose
the addition of a new virtio device type to QEMU for RDMA emulation.
Before sending patches, I wanted to introduce the idea and get early
feedback on the design before I go too far with my implementation.

= Motivation =

QEMU removed pvrdma in v9.1 (deprecated in v8.2). There is currently
zero RDMA emulation in QEMU. Anyone wanting to learn, develop, or
test RDMA software needs Mellanox/NVIDIA hardware which at times is
hard to come by.

Software RDMA stacks (rxe, siw) exist in the kernel but they run
entirely on the guest CPU. One-sided operations (RDMA WRITE/READ)
still involve the remote CPU (it receives a UDP packet, then
decapsulates it, copies data in software). My point is that they
give you the RDMA API but not the hardware behavior.

Meanwhile, the vDPA framework has matured significantly (VDUSE in
5.15+, generic vhost-vdpa-device-pci in QEMU 8.0+), creating a
natural abstraction layer for virtio devices that can span software
emulation and hardware offload. No one has applied this to RDMA yet.

= My Proposal =

A new virtio-rdma device type (one guest driver + multiple backends):

 1. QEMU device model (in-process, C): The reference implementation.
    Software emulation for development, CI tests, and learning. No
    host RDMA stack or hardware needed. The idea: two QEMU instances
    on a dev laptop doing RDMA to each other.

 2. VDUSE backend (userspace daemon): The same virtio-rdma protocol
    implemented as a standalone process via /dev/vduse. Could be
    written in Rust (or not no preference here). It's crash-isolated
    from the host kernel and the goal is to make it a natural fit
    for DPU control planes (i.e. the DPU runs a VDUSE daemon that
    presents virtio-rdma to the host VM).

 3. Hardware vDPA (potential long-term goal): when/if DPU/SmartNIC
    vendors ever implement the virtio-rdma data path in silicon, the
    guest driver would work unchanged via vhost-vdpa-device-pci (no
    QEMU device code needed).

The guest driver is the same in all three cases. It submits verbs
through virtqueues, completely unaware of the backend.

The main idea is that this is basically self-contained (no host RDMA
stack or hardware needed for [1]), it uses standard virtio-pci
transport making it hypervisor-agnostic, and provides RDMA "hardware"
behavior without the actual hardware (compared to rxe/siw which is
RDMA "API" behavior without hardware). Again this is complementary
to rxe/siw and not a replacement.

The follow-up on that is that I hope it could fit naturally into the
vDPA ecosystem alongside virtio-net and virtio-blk as a first-class
offloadable device.

= Design Overview =

Four virtqueues:

  VQ 0 (command)    - resource management: create/destroy PD, MR,
                      CQ, QP, AH. Synchronous request/response.
  VQ 1 (completion) - device returns CQEs to pre-posted driver
                      buffers when operations complete.
  VQ 2 (data-tx)    - driver posts SEND/RDMA WRITE/READ work
                      requests with scatter/gather data.
  VQ 3 (data-rx)    - driver pre-posts receive buffers; device
                      fills them on incoming SEND.

The device maintains full resource state: protection domains, memory
regions with page tables and access keys, QPs with the standard IB
state machine (RESET -> INIT -> RTR -> RTS), and completion queues.

The QEMU device model uses a socket backend for SEND/RECV currently.
A shared-memory backend (ivshmem) would be needed for truly one-sided
RDMA WRITE/READ where the remote CPU is not involved (matches real HCA
DMA behavior). The same shared-memory transport would also be reused
by the VDUSE daemon backend.

QEMU device model currently looks something like this in terms of
structure:

  include/hw/virtio/virtio-rdma.h     - device structs, wire protocol
  hw/virtio/virtio-rdma.c             - resource manager, command VQ,
                                        datapath, completions
  hw/virtio/virtio-rdma-pci.c         - PCI wrapper (boilerplate)
  hw/virtio/virtio-rdma-backend.c     - socket backend

I have a minimally working kernel driver (modeled after
drivers/infiniband/hw/efa/) and a very early draft virtio spec.  I'll
gladly open RFC patches for all three given the general direction of
this make sense to you.

= The vDPA/VDUSE case =

I want to highlight why I believe this matters beyond pure emulation.
The vDPA framework already handles virtio-net and virtio-blk offload to
hardware. virtio-rdma would be the first vDPA-compatible RDMA device
type.

The specific use case for my employer would be DPU-based RDMA. A DPU
(think NVIDIA BlueField) runs a VDUSE daemon that presents a
virtio-rdma device to the host VM. The daemon handles the RDMA control
plane (connection setup, memory registration, key exchange) in
userspace, then programs the DPU's physical NIC for the data plane.
The host VM sees a standard virtio device and uses the standard
virtio-rdma driver (no vendor-specific drivers needed).

If I understand correctly, the vhost-vdpa-device-pci architecture
was designed for exactly that, so RDMA could fit naturally as a device
type after net and block.

VDUSE currently only supports virtio-block (security scoping in
drivers/vdpa/vduse/). Extending it to virtio-rdma would require
kernel patches to whitelist the device type, with appropriate
validation in the virtio-rdma driver to handle untrusted device
input safely.

= Prior work =

I was able to find a few explorations of virtio-RDMA but no prior
effort produced upstream-viable code. Yuval Shaia (Oracle) posted
an RFC in April 2019 with a QEMU device model and kernel driver,
but it only implemented probing and basic ibverbs (no data path nor
progress past RFC v1). The MIKELANGELO/Huawei vRDMA project
(2016-2017) targeted QEMU 2.3 and the OSv unikernel as part of an EU
Horizon 2020 research effort it seems now obsolete.

The most relevant prior work I could find is from Xie Yongji and Wei
Junji at ByteDance, who posted an [RFC v2] to virtio-comment in May
2022 [1] proposing to add RoCE as a VIRTIO_NET_F_ROCE feature extension
to virtio-net (rather than a standalone device type). Their v1 was a
standalone device; the v2 reworked it as a virtio-net extension. They
had working code (kernel driver, QEMU, rdma-core, vhost-user-rdma
backend) but the effort seems to have gone quiet after 2022 with no
v3. Notably, Yongji is also the author of VDUSE itself so I'd really
value his input. I'm proposing a standalone device type rather than a
virtio-net extension because RDMA isn't inherently tied to Ethernet
(InfiniBand exists), it maps more cleanly onto the vDPA offload model
as a separate device, and it avoids burdening virtio-net with RDMA
complexity. But I could be wrong — happy to hear arguments either way.

In addition, very recently, Xiong Weimin posted a vhost-user-rdma/DPDK
concept to netdev (Dec 2025) [2] which takes a different architectural
approach (DPDK-based vhost-user backend); that work seems
complementary but has not produced formal patches.

I CC'd both Yongji and Xiong on this email in case they have opinions.

This series builds on the same virtio-RDMA concept with a complete
data path, modern QEMU/kernel APIs (virtio-1.0, QIOChannel, kernel
6.x ib_device_ops), and a vDPA-native architecture that none of the
prior efforts had.

[1] https://groups.oasis-open.org/communities/community-home/digestviewer/viewthread?MessageKey=20912da6-db56-441c-9117-0148b9c86ea5&CommunityKey=2f26be99-3aa1-48f6-93a5-018dce262226
[2] https://lists.openwall.net/netdev/2025/12/19/13

= Questions for the community =

 1. Does a virtio-rdma device make sense as a direction? I've studied
    virtio-sound and virtio-can as precedents for new device types.

 2. The device is written in C to match existing virtio devices. I saw
    the Rust PCI+DMA bindings are still in progress — should I wait,
    or is C the right choice today?

 3. For the shared-memory backend, I'm planning to use ivshmem. Is
    there a preferred mechanism for zero-copy inter-VM memory sharing
    in QEMU today, or is ivshmem still the way to go?

 4. Any concerns about the virtqueue layout? I split command and
    completion into separate VQs (rather than putting responses
    inline) to allow async completions — similar to how real HCAs
    separate CQ from QP.

 5. For the VDUSE backend path: should the QEMU integration rely on
    the existing generic vhost-vdpa-device-pci, or would a dedicated
    vhost-vdpa-rdma device (like vhost-vdpa-net) be preferred for
    control-plane visibility?

 6. I'm planning to extend VDUSE to support virtio-rdma as a device
    type. Is there ongoing work or discussion about expanding VDUSE
    beyond virtio-block that I should coordinate with?

 7. Would it be ok to use the experimental device ID 0x1045 until
    the OASIS TC assignment?

I'm happy to send the RFC patch series, kernel driver, and spec draft
whenever you'd like to see code.

Thanks,
Serapheim

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC DISCUSSION] virtio-rdma: RDMA device emulation for QEMU
  2026-03-16 21:05 [RFC DISCUSSION] virtio-rdma: RDMA device emulation for QEMU Serapheim Dimitropoulos
@ 2026-03-17  5:42 ` Stefan Hajnoczi
  2026-03-20 17:23   ` Serapheim Dimitropoulos
  0 siblings, 1 reply; 3+ messages in thread
From: Stefan Hajnoczi @ 2026-03-17  5:42 UTC (permalink / raw)
  To: Serapheim Dimitropoulos
  Cc: qemu-devel, mst, pbonzini, stefanha, sgarzare, xieyongji,
	weijunji, xiongweimin

On Tue, Mar 17, 2026 at 5:39 AM Serapheim Dimitropoulos
<serapheimd@gmail.com> wrote:

Hi Serapheim,
Some thoughts below. I only have a rough understanding of RDMA, so I
hope others who worked on previous virtio-rdma and pvrdma devices will
also join the discussion.

> QEMU removed pvrdma in v9.1 (deprecated in v8.2). There is currently
> zero RDMA emulation in QEMU. Anyone wanting to learn, develop, or
> test RDMA software needs Mellanox/NVIDIA hardware which at times is
> hard to come by.

I think this is a sign that virtual RDMA has a small community without
someone willing to maintain it over the long term. Can you see
yourself actively maintaining this over the coming years?

If not, then it may be more appropriate to treat it as an experimental
and out-of-tree project. That way the spec and code can be shared in
case others want to build on it in the future without any commitment
or the overhead of going through the full process of getting a device
merged into the VIRTIO spec, QEMU emulating code merged, and Linux
guest code merged.

If you are going to ship products that rely on this, then it's
probably necessary to go through the full process of getting
everything merged upstream.

> Software RDMA stacks (rxe, siw) exist in the kernel but they run
> entirely on the guest CPU. One-sided operations (RDMA WRITE/READ)
> still involve the remote CPU (it receives a UDP packet, then
> decapsulates it, copies data in software). My point is that they
> give you the RDMA API but not the hardware behavior.
>
> Meanwhile, the vDPA framework has matured significantly (VDUSE in
> 5.15+, generic vhost-vdpa-device-pci in QEMU 8.0+), creating a
> natural abstraction layer for virtio devices that can span software
> emulation and hardware offload. No one has applied this to RDMA yet.
>
> = My Proposal =
>
> A new virtio-rdma device type (one guest driver + multiple backends):
>
>  1. QEMU device model (in-process, C): The reference implementation.
>     Software emulation for development, CI tests, and learning. No
>     host RDMA stack or hardware needed. The idea: two QEMU instances
>     on a dev laptop doing RDMA to each other.

This sounds like it would be implemented along the same lines as
vhost-user and vfio-user. These protocols use a UNIX domain socket to
coordinate shared memory and signalling. In the RDMA case there would
be two QEMUs talking to each other rather than one QEMU and a device
emulation process.

If you are eager to use Rust, QEMU has support for Rust code too.
Bindings are missing for existing C APIs but they can be added.

>  2. VDUSE backend (userspace daemon): The same virtio-rdma protocol
>     implemented as a standalone process via /dev/vduse. Could be
>     written in Rust (or not no preference here). It's crash-isolated
>     from the host kernel and the goal is to make it a natural fit
>     for DPU control planes (i.e. the DPU runs a VDUSE daemon that
>     presents virtio-rdma to the host VM).
>
>  3. Hardware vDPA (potential long-term goal): when/if DPU/SmartNIC
>     vendors ever implement the virtio-rdma data path in silicon, the
>     guest driver would work unchanged via vhost-vdpa-device-pci (no
>     QEMU device code needed).
>
> The guest driver is the same in all three cases. It submits verbs
> through virtqueues, completely unaware of the backend.
>
> The main idea is that this is basically self-contained (no host RDMA
> stack or hardware needed for [1]), it uses standard virtio-pci
> transport making it hypervisor-agnostic, and provides RDMA "hardware"
> behavior without the actual hardware (compared to rxe/siw which is
> RDMA "API" behavior without hardware). Again this is complementary
> to rxe/siw and not a replacement.
>
> The follow-up on that is that I hope it could fit naturally into the
> vDPA ecosystem alongside virtio-net and virtio-blk as a first-class
> offloadable device.
>
> = Design Overview =

It would be nice to also include virtio-comment@lists.linux.dev in the
design discussion. It may help get the attention of VIRTIO folks who
are less focused on QEMU but still interested in RDMA.

>
> Four virtqueues:
>
>   VQ 0 (command)    - resource management: create/destroy PD, MR,
>                       CQ, QP, AH. Synchronous request/response.

Does this mean that memory is registered on VQ 0 and incoming RDMA
WRITE (without immediate) requests modify that memory directly without
virtqueue activity? I think this is necessary because registered
memory is available continuously and the virtqueue model doesn't
really work for this mode of operation.

It's worth clarifying this because accessing memory outside of
virtqueue buffers is a violation of the VIRTIO device model. That's
okay, VIRTIO is pragmatic and some devices do this but it's worth
mentioning explicitly.

Stepping outside the VIRTIO device model can create implementation
challenges because interfaces like vDPA/VDUSE may not be designed for
it though.

>   VQ 1 (completion) - device returns CQEs to pre-posted driver
>                       buffers when operations complete.
>   VQ 2 (data-tx)    - driver posts SEND/RDMA WRITE/READ work
>                       requests with scatter/gather data.
>   VQ 3 (data-rx)    - driver pre-posts receive buffers; device
>                       fills them on incoming SEND.
>
> The device maintains full resource state: protection domains, memory
> regions with page tables and access keys, QPs with the standard IB
> state machine (RESET -> INIT -> RTR -> RTS), and completion queues.
>
> The QEMU device model uses a socket backend for SEND/RECV currently.
> A shared-memory backend (ivshmem) would be needed for truly one-sided
> RDMA WRITE/READ where the remote CPU is not involved (matches real HCA
> DMA behavior). The same shared-memory transport would also be reused
> by the VDUSE daemon backend.

vhost-user and vfio-user build the shared memory support into a UNIX
domain socket protocol. The guest RAM is shared with the external
process via file descriptor passing. The external process can then
read/write guest RAM at will. See
https://gitlab.com/qemu-project/qemu/-/blob/master/docs/interop/vhost-user.rst?ref_type=heads#id54.

There are limitations to this approach with regarding to IOMMU and
isolation (security), but it works well for basic use cases.

I haven't thought about how it would integrate with VDUSE.

> QEMU device model currently looks something like this in terms of
> structure:
>
>   include/hw/virtio/virtio-rdma.h     - device structs, wire protocol
>   hw/virtio/virtio-rdma.c             - resource manager, command VQ,
>                                         datapath, completions
>   hw/virtio/virtio-rdma-pci.c         - PCI wrapper (boilerplate)
>   hw/virtio/virtio-rdma-backend.c     - socket backend
>
> I have a minimally working kernel driver (modeled after
> drivers/infiniband/hw/efa/) and a very early draft virtio spec.  I'll
> gladly open RFC patches for all three given the general direction of
> this make sense to you.

Makes sense to me.

> = The vDPA/VDUSE case =
>
> I want to highlight why I believe this matters beyond pure emulation.
> The vDPA framework already handles virtio-net and virtio-blk offload to
> hardware. virtio-rdma would be the first vDPA-compatible RDMA device
> type.
>
> The specific use case for my employer would be DPU-based RDMA. A DPU
> (think NVIDIA BlueField) runs a VDUSE daemon that presents a
> virtio-rdma device to the host VM. The daemon handles the RDMA control
> plane (connection setup, memory registration, key exchange) in
> userspace, then programs the DPU's physical NIC for the data plane.
> The host VM sees a standard virtio device and uses the standard
> virtio-rdma driver (no vendor-specific drivers needed).
>
> If I understand correctly, the vhost-vdpa-device-pci architecture
> was designed for exactly that, so RDMA could fit naturally as a device
> type after net and block.

Yes, but check how registered memory can be implemented both in VDUSE
and in-kernel vDPA drivers.

> VDUSE currently only supports virtio-block (security scoping in
> drivers/vdpa/vduse/). Extending it to virtio-rdma would require
> kernel patches to whitelist the device type, with appropriate
> validation in the virtio-rdma driver to handle untrusted device
> input safely.
>
> = Prior work =
>
> I was able to find a few explorations of virtio-RDMA but no prior
> effort produced upstream-viable code. Yuval Shaia (Oracle) posted
> an RFC in April 2019 with a QEMU device model and kernel driver,
> but it only implemented probing and basic ibverbs (no data path nor
> progress past RFC v1). The MIKELANGELO/Huawei vRDMA project
> (2016-2017) targeted QEMU 2.3 and the OSv unikernel as part of an EU
> Horizon 2020 research effort it seems now obsolete.
>
> The most relevant prior work I could find is from Xie Yongji and Wei
> Junji at ByteDance, who posted an [RFC v2] to virtio-comment in May
> 2022 [1] proposing to add RoCE as a VIRTIO_NET_F_ROCE feature extension
> to virtio-net (rather than a standalone device type). Their v1 was a
> standalone device; the v2 reworked it as a virtio-net extension. They
> had working code (kernel driver, QEMU, rdma-core, vhost-user-rdma
> backend) but the effort seems to have gone quiet after 2022 with no
> v3. Notably, Yongji is also the author of VDUSE itself so I'd really
> value his input. I'm proposing a standalone device type rather than a
> virtio-net extension because RDMA isn't inherently tied to Ethernet
> (InfiniBand exists), it maps more cleanly onto the vDPA offload model
> as a separate device, and it avoids burdening virtio-net with RDMA
> complexity. But I could be wrong — happy to hear arguments either way.
>
> In addition, very recently, Xiong Weimin posted a vhost-user-rdma/DPDK
> concept to netdev (Dec 2025) [2] which takes a different architectural
> approach (DPDK-based vhost-user backend); that work seems
> complementary but has not produced formal patches.
>
> I CC'd both Yongji and Xiong on this email in case they have opinions.
>
> This series builds on the same virtio-RDMA concept with a complete
> data path, modern QEMU/kernel APIs (virtio-1.0, QIOChannel, kernel
> 6.x ib_device_ops), and a vDPA-native architecture that none of the
> prior efforts had.
>
> [1] https://groups.oasis-open.org/communities/community-home/digestviewer/viewthread?MessageKey=20912da6-db56-441c-9117-0148b9c86ea5&CommunityKey=2f26be99-3aa1-48f6-93a5-018dce262226
> [2] https://lists.openwall.net/netdev/2025/12/19/13
>
> = Questions for the community =
>
>  1. Does a virtio-rdma device make sense as a direction? I've studied
>     virtio-sound and virtio-can as precedents for new device types.
>
>  2. The device is written in C to match existing virtio devices. I saw
>     the Rust PCI+DMA bindings are still in progress — should I wait,
>     or is C the right choice today?

Writing C devices is still perfectly acceptable in QEMU. With Rust you
are likely to have to work on bindings and may hit issues just because
Rust is new in QEMU, but it's there and you can absolutely use it.

>  3. For the shared-memory backend, I'm planning to use ivshmem. Is
>     there a preferred mechanism for zero-copy inter-VM memory sharing
>     in QEMU today, or is ivshmem still the way to go?

For the userspace virtio-rdma device implementation I expected a new
UNIX domain socket protocol along the lines of vhost-user and
vfio-user. That's because sharing guest RAM is only part of the
communication that must happen between two QEMUs and I guess you'll
need to define your own protocol to coordinate RDMA between QEMU
processes anyway.

When using vDPA or VDUSE, QEMU shares guest RAM with the device
through the /dev/vhost ioctls.

In both cases, I'm not sure if ivshmem is necessary.

>  4. Any concerns about the virtqueue layout? I split command and
>     completion into separate VQs (rather than putting responses
>     inline) to allow async completions — similar to how real HCAs
>     separate CQ from QP.

It depends what you mean by async. Virtqueues can complete requests
out-of-order, so a separate completion virtqueue is not needed from
that perspective.

There could be other reasons why a separate completion virtqueue makes
sense. If RDMA relies on the separate CQ design to emit multiple CQEs
for the same request or emits CQEs not associated with any request,
then a single virtqueue won't work. I don't know RDMA well enough to
say either way.

>  5. For the VDUSE backend path: should the QEMU integration rely on
>     the existing generic vhost-vdpa-device-pci, or would a dedicated
>     vhost-vdpa-rdma device (like vhost-vdpa-net) be preferred for
>     control-plane visibility?

I think vhost-vpda-device-pci is good practice. It makes sure the
device is self-contained and doesn't rely on device-specific VMM
support. If it's possible to use just vhost-vdpa-device-pci, then
that's great. It scales better because it avoids the need to implement
a device in every VMM (like QEMU, Firecracker, etc).

>  6. I'm planning to extend VDUSE to support virtio-rdma as a device
>     type. Is there ongoing work or discussion about expanding VDUSE
>     beyond virtio-block that I should coordinate with?
>
>  7. Would it be ok to use the experimental device ID 0x1045 until
>     the OASIS TC assignment?

You can reserve a device ID from the VIRTIO Technical Committee
separately from getting the spec merged. Ask
virtio-comment@lists.linux.dev.

> I'm happy to send the RFC patch series, kernel driver, and spec draft
> whenever you'd like to see code.

It is helpful to see the draft VIRTIO spec and RFC patches at the same
time. So as soon as you want to discuss the specifics of the VIRTIO
spec patches it would be a good time to send RFC patches showing how
the spec is implemented.

Stefan

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC DISCUSSION] virtio-rdma: RDMA device emulation for QEMU
  2026-03-17  5:42 ` Stefan Hajnoczi
@ 2026-03-20 17:23   ` Serapheim Dimitropoulos
  0 siblings, 0 replies; 3+ messages in thread
From: Serapheim Dimitropoulos @ 2026-03-20 17:23 UTC (permalink / raw)
  To: stefanha
  Cc: qemu-devel, mst, pbonzini, sgarzarella, xieyongji, weijunji,
	15927021679

Hi Stefan,

Thank you for the quick reply and thorough review! I waited a few days
before replying to see if any of the folks from the previous virtio-rdma
efforts would chime in but nothing so far. In any case, my responses
inlined below.

> I think this is a sign that virtual RDMA has a small community without
> someone willing to maintain it over the long term. Can you see
> yourself actively maintaining this over the coming years?
>
> If not, then it may be more appropriate to treat it as an experimental
> and out-of-tree project. That way the spec and code can be shared in
> case others want to build on it in the future without any commitment
> or the overhead of going through the full process of getting a device
> merged into the VIRTIO spec, QEMU emulating code merged, and Linux
> guest code merged.
>
> If you are going to ship products that rely on this, then it's
> probably necessary to go through the full process of getting
> everything merged upstream.

I understand the hesitation given the pvrdma precedent. I want to be
explicit that this is not a side-project. I'm a kernel engineer at
CoreWeave and having something like virtio-rdma is a requirement for
some of our current projects. Two of them that I feel comfortable
disclosing is our work with BlueField DPUs and Kata containers.

For NVIDIA BlueFields we currently do most of our work with real
hardware which for us developers is at times hard to come by as
we want to make sure that our customers take priority. Being able
to do emulation with QEMU means that every engineer on the team
can iterate without a dedicated BlueField. Moreover, our test suite
can run more often this way too.

For Kata containers today, to get RDMA you have two options. Either
use virtio-net and give up latency or vfio-pci passthrough pinning
the pod to a NIC. The latter not only breaks the security model but
also isolation (not to mention any prospect of live-migration).

I'm ok keeping this as an out-of-tree project initially in the short
term but would hate it if something else comes along later and we
have to re-work everything on our end. As far as long-term commitment
goes I commit to maintaining virtio-rdma for as long as it's upstream.
If I ever leave my role at CoreWeave and my next role is not related
to virtio-rdma, my team at CoreWeave will designate a successor
maintainer as it has organizational interest in this work. I'm happy
to formalize my commitment in a MAINTAINERS entry when/if the time
comes.

> Does this mean that memory is registered on VQ 0 and incoming RDMA
> WRITE (without immediate) requests modify that memory directly without
> virtqueue activity? I think this is necessary because registered
> memory is available continuously and the virtqueue model doesn't
> really work for this mode of operation.
>
> It's worth clarifying this because accessing memory outside of
> virtqueue buffers is a violation of the VIRTIO device model. That's
> okay, VIRTIO is pragmatic and some devices do this but it's worth
> mentioning explicitly.
>
> Stepping outside the VIRTIO device model can create implementation
> challenges because interfaces like vDPA/VDUSE may not be designed for
> it though.

Ok great point actually. Evaluating the potential paths forward I
thought of the following (though I'm open to other ideas if you have
them):

A] Use a shared memory window (like virtio-fs DAX) - this wouldn't
work because it changes the RDMA programming model. Real HCAs let
you register *any* part of memory via ibv_reg_mr(). Restricting MRs
to a pre-allocated window would thus break standard applications.

B] Just acknowledge the deviation explicitly and move - as you said
besides being a spec violation it doesn't concretely solve the
vDPA/VDUSE case as they may want to enforce IOMMU boundaries.

C] Go the IOTLB route - when the driver registers an MR, the device
triggers IOTLB updates for every page in the MR giving the backend
legal IOMMU mappings.

Let me know if you can think of any other ways but I believe [C]
may be the way to go as real HCAs do the same thing. In our case
this would look like so:

1. REG_USER_MR sends the page list (guest physical addresses) via
   the command VQ (virtio-compliant command).

2. The device uses the platform's DMA mapping mechanism to establish
   mappings for each page in the MR. I believe for QEMU that would be
   address_space_map() since it has full guest RAM access.
   (VHOST_USER_IOTLB_MSG for vhost-user and VDUSE_IOTLB_REG_UMEM for
   VDUSE).

3. RDMA WRITE/READ resolve (remote_addr + rkey) through those mappings.

4. DEREG_MR invalidates them.

The above should require VIRTIO_F_ACCESS_PLATFORM when used with
IOMMU-protected backends like vDPA/VDUSE. As for the per-page mapping
cost at registration I'm open to ideas but I wonder if it is acceptable
for the v1 pass as it is a one time cost.

One potential future scalability issue with the flat list is that
for very large MRs (128GB/32M page addresses) it can become too
long/heavy in the command VQ which could be prohibitive for any
potential hardware implementations (if there were to be any). For
v1 maybe we could just reserve VIRTIO_RDMA_F_INDIRECT_MR and leave
room for a future indirect page table model?

Let me know how the above sound to you and I can make sure to document
them more formally in the spec draft.

> [...] check how registered memory can be implemented both in VDUSE
> and in-kernel vDPA drivers.

The IOTLB model above should cover both cases but I can double-check.
For VDUSE, MR registration triggers VDUSE_IOTLB_REG_UMEM calls. For
in-kernel vDPA, the vDPA bus provides DMA mapping APIs that map to the
parent IOMMU. The guest driver should ideally be unaware of which
backend is in use and just send REG_USER_MR with the device handling
the rest.

> For the userspace virtio-rdma device implementation I expected a new
> UNIX domain socket protocol along the lines of vhost-user and
> vfio-user. That's because sharing guest RAM is only part of the
> communication that must happen between two QEMUs and I guess you'll
> need to define your own protocol to coordinate RDMA between QEMU
> processes anyway.
>
> When using vDPA or VDUSE, QEMU shares guest RAM with the device
> through the /dev/vhost ioctls.
>
> In both cases, I'm not sure if ivshmem is necessary.

Makes sense - thank you for the pointers! I'm almost done switching
to domain sockets per your recommendation. The new scheme is currently
peer-to-peer (not a re-use of vhost-user which is VMM-to-backend). As
a first phase/stage each side exchanges MEM_REGIONS messages with
memfd descriptors for guest RAM regions and the peer mmap()s them
(handshake). Then we forward send/recv via framed messages on the
socket. RDMA WRITE/READ operate directly on the mmap'd peer memory
(no message nor remote CPU involvement).

BTW I don't need to add that level of detail in my spec, correct?
From what I can tell specs seem to define device-to-driver behavior
only (e.g. virtio-net doesn't say anything about TAP/vhost-user,
etc.)

> It depends what you mean by async. Virtqueues can complete requests
> out-of-order, so a separate completion virtqueue is not needed from
> that perspective.
>
> There could be other reasons why a separate completion virtqueue makes
> sense. If RDMA relies on the separate CQ design to emit multiple CQEs
> for the same request or emits CQEs not associated with any request,
> then a single virtqueue won't work. I don't know RDMA well enough to
> say either way.

ok looking more at RDMA CQ semantics and assuming I understand what you
propose correclty, I do believe we need the separate queue for the
following reasons:

- An RDMA application may create one CQ and bind multiple QPs to it
  (multi-QP fan-in). ibv_poll_cq() returns completions from all
  associated QPs in one call. If completions live in per-VQ used
  rings, then polling means scanning 2N VQs, O(n) per poll. It also
  seems like a mismatch in the CQ abstraction being a single
  aggregation point and virtio VQ a per-queue used ring. The dedicated
  completion VQ gives you the fan-in O(1) for free.

- The RDMA spec mandates that when a CQ overflows the device raises
  IBV_EVENT_CQ_ERR, which cascades to IBV_EVENT_QP_FATAL on every QP
  bound to that CQ. The device must be able to detect overflows to
  trigger this. Detecting that with a dedicated completion VQ is
  straightforward. The virtio used ring on the other hand doesn't have
  any overflow semantics from what I can tell so we'd need to make
  something ourselves.

- ibv_req_notify_cq(solicited_only=1) fires only for CQEs with the
  solicited flag set. I'm not sure how we could express this as-is
  when virtio event index suppresses by count, rather than per-WR
  flag. This is less critical though as we could work around it
  by always notifying and having the driver filter in its interrupt
  handler in order to be functionally correct.

Let me know if the above reasons seem legitimate on your end. It
generally seems to me the used ring has no place to put whatever
fields are needed for an RDMA CQE (only handles generic completion
metadata). So using a dedicated one to return buffers that contain
CQEs makes more sense that trying to encode CQE data in the used
ring itself.

> Writing C devices is still perfectly acceptable in QEMU. With Rust you
> are likely to have to work on bindings and may hit issues just because
> Rust is new in QEMU, but it's there and you can absolutely use it.

Yeah that's what it seemed like when I checked. I'd like to keep things
simple as a first pass and make sure we get something working while
monitoring for any Rust PCI+DMA bindings for future versions of this.

> I think vhost-vpda-device-pci is good practice. It makes sure the
> device is self-contained and doesn't rely on device-specific VMM
> support. If it's possible to use just vhost-vdpa-device-pci, then
> that's great. It scales better because it avoids the need to implement
> a device in every VMM (like QEMU, Firecracker, etc).

Great! Will proceed with that.

> You can reserve a device ID from the VIRTIO Technical Committee
> separately from getting the spec merged. Ask
> virtio-comment@lists.linux.dev.

Will do!

> It is helpful to see the draft VIRTIO spec and RFC patches at the same
> time. So as soon as you want to discuss the specifics of the VIRTIO
> spec patches it would be a good time to send RFC patches showing how
> the spec is implemented.

Will do the above shortly. I definitely need to revise my spec and code
first to incorporate some of the above. I really appreciate your
feedback on this!

Serapheim

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-03-20 17:24 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-16 21:05 [RFC DISCUSSION] virtio-rdma: RDMA device emulation for QEMU Serapheim Dimitropoulos
2026-03-17  5:42 ` Stefan Hajnoczi
2026-03-20 17:23   ` Serapheim Dimitropoulos

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.