public inbox for qemu-devel@nongnu.org
 help / color / mirror / Atom feed
From: Serapheim Dimitropoulos <serapheimd@gmail.com>
To: stefanha@redhat.com
Cc: qemu-devel@nongnu.org, mst@redhat.com, pbonzini@redhat.com,
	sgarzarella@redhat.com, xieyongji@bytedance.com,
	weijunji@bytedance.com, 15927021679@163.com
Subject: Re: [RFC DISCUSSION] virtio-rdma: RDMA device emulation for QEMU
Date: Fri, 20 Mar 2026 13:23:45 -0400	[thread overview]
Message-ID: <20260320172345.4688-1-serapheimd@gmail.com> (raw)
In-Reply-To: <CAJSP0QX3JtwhXDTnEgS-heLYYncjJtttg1KmFqG-+_VUXbH+Ww@mail.gmail.com>

Hi Stefan,

Thank you for the quick reply and thorough review! I waited a few days
before replying to see if any of the folks from the previous virtio-rdma
efforts would chime in but nothing so far. In any case, my responses
inlined below.

> I think this is a sign that virtual RDMA has a small community without
> someone willing to maintain it over the long term. Can you see
> yourself actively maintaining this over the coming years?
>
> If not, then it may be more appropriate to treat it as an experimental
> and out-of-tree project. That way the spec and code can be shared in
> case others want to build on it in the future without any commitment
> or the overhead of going through the full process of getting a device
> merged into the VIRTIO spec, QEMU emulating code merged, and Linux
> guest code merged.
>
> If you are going to ship products that rely on this, then it's
> probably necessary to go through the full process of getting
> everything merged upstream.

I understand the hesitation given the pvrdma precedent. I want to be
explicit that this is not a side-project. I'm a kernel engineer at
CoreWeave and having something like virtio-rdma is a requirement for
some of our current projects. Two of them that I feel comfortable
disclosing is our work with BlueField DPUs and Kata containers.

For NVIDIA BlueFields we currently do most of our work with real
hardware which for us developers is at times hard to come by as
we want to make sure that our customers take priority. Being able
to do emulation with QEMU means that every engineer on the team
can iterate without a dedicated BlueField. Moreover, our test suite
can run more often this way too.

For Kata containers today, to get RDMA you have two options. Either
use virtio-net and give up latency or vfio-pci passthrough pinning
the pod to a NIC. The latter not only breaks the security model but
also isolation (not to mention any prospect of live-migration).

I'm ok keeping this as an out-of-tree project initially in the short
term but would hate it if something else comes along later and we
have to re-work everything on our end. As far as long-term commitment
goes I commit to maintaining virtio-rdma for as long as it's upstream.
If I ever leave my role at CoreWeave and my next role is not related
to virtio-rdma, my team at CoreWeave will designate a successor
maintainer as it has organizational interest in this work. I'm happy
to formalize my commitment in a MAINTAINERS entry when/if the time
comes.

> Does this mean that memory is registered on VQ 0 and incoming RDMA
> WRITE (without immediate) requests modify that memory directly without
> virtqueue activity? I think this is necessary because registered
> memory is available continuously and the virtqueue model doesn't
> really work for this mode of operation.
>
> It's worth clarifying this because accessing memory outside of
> virtqueue buffers is a violation of the VIRTIO device model. That's
> okay, VIRTIO is pragmatic and some devices do this but it's worth
> mentioning explicitly.
>
> Stepping outside the VIRTIO device model can create implementation
> challenges because interfaces like vDPA/VDUSE may not be designed for
> it though.

Ok great point actually. Evaluating the potential paths forward I
thought of the following (though I'm open to other ideas if you have
them):

A] Use a shared memory window (like virtio-fs DAX) - this wouldn't
work because it changes the RDMA programming model. Real HCAs let
you register *any* part of memory via ibv_reg_mr(). Restricting MRs
to a pre-allocated window would thus break standard applications.

B] Just acknowledge the deviation explicitly and move - as you said
besides being a spec violation it doesn't concretely solve the
vDPA/VDUSE case as they may want to enforce IOMMU boundaries.

C] Go the IOTLB route - when the driver registers an MR, the device
triggers IOTLB updates for every page in the MR giving the backend
legal IOMMU mappings.

Let me know if you can think of any other ways but I believe [C]
may be the way to go as real HCAs do the same thing. In our case
this would look like so:

1. REG_USER_MR sends the page list (guest physical addresses) via
   the command VQ (virtio-compliant command).

2. The device uses the platform's DMA mapping mechanism to establish
   mappings for each page in the MR. I believe for QEMU that would be
   address_space_map() since it has full guest RAM access.
   (VHOST_USER_IOTLB_MSG for vhost-user and VDUSE_IOTLB_REG_UMEM for
   VDUSE).

3. RDMA WRITE/READ resolve (remote_addr + rkey) through those mappings.

4. DEREG_MR invalidates them.

The above should require VIRTIO_F_ACCESS_PLATFORM when used with
IOMMU-protected backends like vDPA/VDUSE. As for the per-page mapping
cost at registration I'm open to ideas but I wonder if it is acceptable
for the v1 pass as it is a one time cost.

One potential future scalability issue with the flat list is that
for very large MRs (128GB/32M page addresses) it can become too
long/heavy in the command VQ which could be prohibitive for any
potential hardware implementations (if there were to be any). For
v1 maybe we could just reserve VIRTIO_RDMA_F_INDIRECT_MR and leave
room for a future indirect page table model?

Let me know how the above sound to you and I can make sure to document
them more formally in the spec draft.

> [...] check how registered memory can be implemented both in VDUSE
> and in-kernel vDPA drivers.

The IOTLB model above should cover both cases but I can double-check.
For VDUSE, MR registration triggers VDUSE_IOTLB_REG_UMEM calls. For
in-kernel vDPA, the vDPA bus provides DMA mapping APIs that map to the
parent IOMMU. The guest driver should ideally be unaware of which
backend is in use and just send REG_USER_MR with the device handling
the rest.

> For the userspace virtio-rdma device implementation I expected a new
> UNIX domain socket protocol along the lines of vhost-user and
> vfio-user. That's because sharing guest RAM is only part of the
> communication that must happen between two QEMUs and I guess you'll
> need to define your own protocol to coordinate RDMA between QEMU
> processes anyway.
>
> When using vDPA or VDUSE, QEMU shares guest RAM with the device
> through the /dev/vhost ioctls.
>
> In both cases, I'm not sure if ivshmem is necessary.

Makes sense - thank you for the pointers! I'm almost done switching
to domain sockets per your recommendation. The new scheme is currently
peer-to-peer (not a re-use of vhost-user which is VMM-to-backend). As
a first phase/stage each side exchanges MEM_REGIONS messages with
memfd descriptors for guest RAM regions and the peer mmap()s them
(handshake). Then we forward send/recv via framed messages on the
socket. RDMA WRITE/READ operate directly on the mmap'd peer memory
(no message nor remote CPU involvement).

BTW I don't need to add that level of detail in my spec, correct?
From what I can tell specs seem to define device-to-driver behavior
only (e.g. virtio-net doesn't say anything about TAP/vhost-user,
etc.)

> It depends what you mean by async. Virtqueues can complete requests
> out-of-order, so a separate completion virtqueue is not needed from
> that perspective.
>
> There could be other reasons why a separate completion virtqueue makes
> sense. If RDMA relies on the separate CQ design to emit multiple CQEs
> for the same request or emits CQEs not associated with any request,
> then a single virtqueue won't work. I don't know RDMA well enough to
> say either way.

ok looking more at RDMA CQ semantics and assuming I understand what you
propose correclty, I do believe we need the separate queue for the
following reasons:

- An RDMA application may create one CQ and bind multiple QPs to it
  (multi-QP fan-in). ibv_poll_cq() returns completions from all
  associated QPs in one call. If completions live in per-VQ used
  rings, then polling means scanning 2N VQs, O(n) per poll. It also
  seems like a mismatch in the CQ abstraction being a single
  aggregation point and virtio VQ a per-queue used ring. The dedicated
  completion VQ gives you the fan-in O(1) for free.

- The RDMA spec mandates that when a CQ overflows the device raises
  IBV_EVENT_CQ_ERR, which cascades to IBV_EVENT_QP_FATAL on every QP
  bound to that CQ. The device must be able to detect overflows to
  trigger this. Detecting that with a dedicated completion VQ is
  straightforward. The virtio used ring on the other hand doesn't have
  any overflow semantics from what I can tell so we'd need to make
  something ourselves.

- ibv_req_notify_cq(solicited_only=1) fires only for CQEs with the
  solicited flag set. I'm not sure how we could express this as-is
  when virtio event index suppresses by count, rather than per-WR
  flag. This is less critical though as we could work around it
  by always notifying and having the driver filter in its interrupt
  handler in order to be functionally correct.

Let me know if the above reasons seem legitimate on your end. It
generally seems to me the used ring has no place to put whatever
fields are needed for an RDMA CQE (only handles generic completion
metadata). So using a dedicated one to return buffers that contain
CQEs makes more sense that trying to encode CQE data in the used
ring itself.

> Writing C devices is still perfectly acceptable in QEMU. With Rust you
> are likely to have to work on bindings and may hit issues just because
> Rust is new in QEMU, but it's there and you can absolutely use it.

Yeah that's what it seemed like when I checked. I'd like to keep things
simple as a first pass and make sure we get something working while
monitoring for any Rust PCI+DMA bindings for future versions of this.

> I think vhost-vpda-device-pci is good practice. It makes sure the
> device is self-contained and doesn't rely on device-specific VMM
> support. If it's possible to use just vhost-vdpa-device-pci, then
> that's great. It scales better because it avoids the need to implement
> a device in every VMM (like QEMU, Firecracker, etc).

Great! Will proceed with that.

> You can reserve a device ID from the VIRTIO Technical Committee
> separately from getting the spec merged. Ask
> virtio-comment@lists.linux.dev.

Will do!

> It is helpful to see the draft VIRTIO spec and RFC patches at the same
> time. So as soon as you want to discuss the specifics of the VIRTIO
> spec patches it would be a good time to send RFC patches showing how
> the spec is implemented.

Will do the above shortly. I definitely need to revise my spec and code
first to incorporate some of the above. I really appreciate your
feedback on this!

Serapheim


      reply	other threads:[~2026-03-20 17:24 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-16 21:05 [RFC DISCUSSION] virtio-rdma: RDMA device emulation for QEMU Serapheim Dimitropoulos
2026-03-17  5:42 ` Stefan Hajnoczi
2026-03-20 17:23   ` Serapheim Dimitropoulos [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260320172345.4688-1-serapheimd@gmail.com \
    --to=serapheimd@gmail.com \
    --cc=15927021679@163.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=sgarzarella@redhat.com \
    --cc=stefanha@redhat.com \
    --cc=weijunji@bytedance.com \
    --cc=xieyongji@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox