All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC DISCUSSION] virtio-rdma: RDMA device emulation for QEMU
@ 2026-03-16 21:05 Serapheim Dimitropoulos
  2026-03-17  5:42 ` Stefan Hajnoczi
  0 siblings, 1 reply; 3+ messages in thread
From: Serapheim Dimitropoulos @ 2026-03-16 21:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: mst, pbonzini, stefanha, sgarzare, xieyongji, weijunji,
	xiongweimin, Serapheim Dimitropoulos

Hi all,

Apologies if this is not the right place but I'm looking to propose
the addition of a new virtio device type to QEMU for RDMA emulation.
Before sending patches, I wanted to introduce the idea and get early
feedback on the design before I go too far with my implementation.

= Motivation =

QEMU removed pvrdma in v9.1 (deprecated in v8.2). There is currently
zero RDMA emulation in QEMU. Anyone wanting to learn, develop, or
test RDMA software needs Mellanox/NVIDIA hardware which at times is
hard to come by.

Software RDMA stacks (rxe, siw) exist in the kernel but they run
entirely on the guest CPU. One-sided operations (RDMA WRITE/READ)
still involve the remote CPU (it receives a UDP packet, then
decapsulates it, copies data in software). My point is that they
give you the RDMA API but not the hardware behavior.

Meanwhile, the vDPA framework has matured significantly (VDUSE in
5.15+, generic vhost-vdpa-device-pci in QEMU 8.0+), creating a
natural abstraction layer for virtio devices that can span software
emulation and hardware offload. No one has applied this to RDMA yet.

= My Proposal =

A new virtio-rdma device type (one guest driver + multiple backends):

 1. QEMU device model (in-process, C): The reference implementation.
    Software emulation for development, CI tests, and learning. No
    host RDMA stack or hardware needed. The idea: two QEMU instances
    on a dev laptop doing RDMA to each other.

 2. VDUSE backend (userspace daemon): The same virtio-rdma protocol
    implemented as a standalone process via /dev/vduse. Could be
    written in Rust (or not no preference here). It's crash-isolated
    from the host kernel and the goal is to make it a natural fit
    for DPU control planes (i.e. the DPU runs a VDUSE daemon that
    presents virtio-rdma to the host VM).

 3. Hardware vDPA (potential long-term goal): when/if DPU/SmartNIC
    vendors ever implement the virtio-rdma data path in silicon, the
    guest driver would work unchanged via vhost-vdpa-device-pci (no
    QEMU device code needed).

The guest driver is the same in all three cases. It submits verbs
through virtqueues, completely unaware of the backend.

The main idea is that this is basically self-contained (no host RDMA
stack or hardware needed for [1]), it uses standard virtio-pci
transport making it hypervisor-agnostic, and provides RDMA "hardware"
behavior without the actual hardware (compared to rxe/siw which is
RDMA "API" behavior without hardware). Again this is complementary
to rxe/siw and not a replacement.

The follow-up on that is that I hope it could fit naturally into the
vDPA ecosystem alongside virtio-net and virtio-blk as a first-class
offloadable device.

= Design Overview =

Four virtqueues:

  VQ 0 (command)    - resource management: create/destroy PD, MR,
                      CQ, QP, AH. Synchronous request/response.
  VQ 1 (completion) - device returns CQEs to pre-posted driver
                      buffers when operations complete.
  VQ 2 (data-tx)    - driver posts SEND/RDMA WRITE/READ work
                      requests with scatter/gather data.
  VQ 3 (data-rx)    - driver pre-posts receive buffers; device
                      fills them on incoming SEND.

The device maintains full resource state: protection domains, memory
regions with page tables and access keys, QPs with the standard IB
state machine (RESET -> INIT -> RTR -> RTS), and completion queues.

The QEMU device model uses a socket backend for SEND/RECV currently.
A shared-memory backend (ivshmem) would be needed for truly one-sided
RDMA WRITE/READ where the remote CPU is not involved (matches real HCA
DMA behavior). The same shared-memory transport would also be reused
by the VDUSE daemon backend.

QEMU device model currently looks something like this in terms of
structure:

  include/hw/virtio/virtio-rdma.h     - device structs, wire protocol
  hw/virtio/virtio-rdma.c             - resource manager, command VQ,
                                        datapath, completions
  hw/virtio/virtio-rdma-pci.c         - PCI wrapper (boilerplate)
  hw/virtio/virtio-rdma-backend.c     - socket backend

I have a minimally working kernel driver (modeled after
drivers/infiniband/hw/efa/) and a very early draft virtio spec.  I'll
gladly open RFC patches for all three given the general direction of
this make sense to you.

= The vDPA/VDUSE case =

I want to highlight why I believe this matters beyond pure emulation.
The vDPA framework already handles virtio-net and virtio-blk offload to
hardware. virtio-rdma would be the first vDPA-compatible RDMA device
type.

The specific use case for my employer would be DPU-based RDMA. A DPU
(think NVIDIA BlueField) runs a VDUSE daemon that presents a
virtio-rdma device to the host VM. The daemon handles the RDMA control
plane (connection setup, memory registration, key exchange) in
userspace, then programs the DPU's physical NIC for the data plane.
The host VM sees a standard virtio device and uses the standard
virtio-rdma driver (no vendor-specific drivers needed).

If I understand correctly, the vhost-vdpa-device-pci architecture
was designed for exactly that, so RDMA could fit naturally as a device
type after net and block.

VDUSE currently only supports virtio-block (security scoping in
drivers/vdpa/vduse/). Extending it to virtio-rdma would require
kernel patches to whitelist the device type, with appropriate
validation in the virtio-rdma driver to handle untrusted device
input safely.

= Prior work =

I was able to find a few explorations of virtio-RDMA but no prior
effort produced upstream-viable code. Yuval Shaia (Oracle) posted
an RFC in April 2019 with a QEMU device model and kernel driver,
but it only implemented probing and basic ibverbs (no data path nor
progress past RFC v1). The MIKELANGELO/Huawei vRDMA project
(2016-2017) targeted QEMU 2.3 and the OSv unikernel as part of an EU
Horizon 2020 research effort it seems now obsolete.

The most relevant prior work I could find is from Xie Yongji and Wei
Junji at ByteDance, who posted an [RFC v2] to virtio-comment in May
2022 [1] proposing to add RoCE as a VIRTIO_NET_F_ROCE feature extension
to virtio-net (rather than a standalone device type). Their v1 was a
standalone device; the v2 reworked it as a virtio-net extension. They
had working code (kernel driver, QEMU, rdma-core, vhost-user-rdma
backend) but the effort seems to have gone quiet after 2022 with no
v3. Notably, Yongji is also the author of VDUSE itself so I'd really
value his input. I'm proposing a standalone device type rather than a
virtio-net extension because RDMA isn't inherently tied to Ethernet
(InfiniBand exists), it maps more cleanly onto the vDPA offload model
as a separate device, and it avoids burdening virtio-net with RDMA
complexity. But I could be wrong — happy to hear arguments either way.

In addition, very recently, Xiong Weimin posted a vhost-user-rdma/DPDK
concept to netdev (Dec 2025) [2] which takes a different architectural
approach (DPDK-based vhost-user backend); that work seems
complementary but has not produced formal patches.

I CC'd both Yongji and Xiong on this email in case they have opinions.

This series builds on the same virtio-RDMA concept with a complete
data path, modern QEMU/kernel APIs (virtio-1.0, QIOChannel, kernel
6.x ib_device_ops), and a vDPA-native architecture that none of the
prior efforts had.

[1] https://groups.oasis-open.org/communities/community-home/digestviewer/viewthread?MessageKey=20912da6-db56-441c-9117-0148b9c86ea5&CommunityKey=2f26be99-3aa1-48f6-93a5-018dce262226
[2] https://lists.openwall.net/netdev/2025/12/19/13

= Questions for the community =

 1. Does a virtio-rdma device make sense as a direction? I've studied
    virtio-sound and virtio-can as precedents for new device types.

 2. The device is written in C to match existing virtio devices. I saw
    the Rust PCI+DMA bindings are still in progress — should I wait,
    or is C the right choice today?

 3. For the shared-memory backend, I'm planning to use ivshmem. Is
    there a preferred mechanism for zero-copy inter-VM memory sharing
    in QEMU today, or is ivshmem still the way to go?

 4. Any concerns about the virtqueue layout? I split command and
    completion into separate VQs (rather than putting responses
    inline) to allow async completions — similar to how real HCAs
    separate CQ from QP.

 5. For the VDUSE backend path: should the QEMU integration rely on
    the existing generic vhost-vdpa-device-pci, or would a dedicated
    vhost-vdpa-rdma device (like vhost-vdpa-net) be preferred for
    control-plane visibility?

 6. I'm planning to extend VDUSE to support virtio-rdma as a device
    type. Is there ongoing work or discussion about expanding VDUSE
    beyond virtio-block that I should coordinate with?

 7. Would it be ok to use the experimental device ID 0x1045 until
    the OASIS TC assignment?

I'm happy to send the RFC patch series, kernel driver, and spec draft
whenever you'd like to see code.

Thanks,
Serapheim


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-03-20 17:24 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-16 21:05 [RFC DISCUSSION] virtio-rdma: RDMA device emulation for QEMU Serapheim Dimitropoulos
2026-03-17  5:42 ` Stefan Hajnoczi
2026-03-20 17:23   ` Serapheim Dimitropoulos

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.