[PATCH 0/6] refactor RDMA live migration based on rsocket API

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] refactor RDMA live migration based on rsocket API
@ 2024-06-04 12:14 Gonglei via
  2024-06-04 12:14 ` [PATCH 1/6] migration: remove RDMA live migration temporarily Gonglei via
                   ` (9 more replies)
  0 siblings, 10 replies; 55+ messages in thread
From: Gonglei via @ 2024-06-04 12:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: peterx, yu.zhang, mgalaxy, elmar.gerdes, zhengchuan, berrange,
	armbru, lizhijian, pbonzini, mst, xiexiangyou, linux-rdma,
	lixiao91, arei.gonglei, jinpu.wang, Jialin Wang

From: Jialin Wang <wangjialin23@huawei.com>

Hi,

This patch series attempts to refactor RDMA live migration by
introducing a new QIOChannelRDMA class based on the rsocket API.

The /usr/include/rdma/rsocket.h provides a higher level rsocket API
that is a 1-1 match of the normal kernel 'sockets' API, which hides the
detail of rdma protocol into rsocket and allows us to add support for
some modern features like multifd more easily.

Here is the previous discussion on refactoring RDMA live migration using
the rsocket API:

https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linaro.org/

We have encountered some bugs when using rsocket and plan to submit them to
the rdma-core community.

In addition, the use of rsocket makes our programming more convenient,
but it must be noted that this method introduces multiple memory copies,
which can be imagined that there will be a certain performance degradation,
hoping that friends with RDMA network cards can help verify, thank you!

Jialin Wang (6):
  migration: remove RDMA live migration temporarily
  io: add QIOChannelRDMA class
  io/channel-rdma: support working in coroutine
  tests/unit: add test-io-channel-rdma.c
  migration: introduce new RDMA live migration
  migration/rdma: support multifd for RDMA migration

 docs/rdma.txt                     |  420 ---
 include/io/channel-rdma.h         |  165 ++
 io/channel-rdma.c                 |  798 ++++++
 io/meson.build                    |    1 +
 io/trace-events                   |   14 +
 meson.build                       |    6 -
 migration/meson.build             |    3 +-
 migration/migration-stats.c       |    5 +-
 migration/migration-stats.h       |    4 -
 migration/migration.c             |   13 +-
 migration/migration.h             |    9 -
 migration/multifd.c               |   10 +
 migration/options.c               |   16 -
 migration/options.h               |    2 -
 migration/qemu-file.c             |    1 -
 migration/ram.c                   |   90 +-
 migration/rdma.c                  | 4205 +----------------------------
 migration/rdma.h                  |   67 +-
 migration/savevm.c                |    2 +-
 migration/trace-events            |   68 +-
 qapi/migration.json               |   13 +-
 scripts/analyze-migration.py      |    3 -
 tests/unit/meson.build            |    1 +
 tests/unit/test-io-channel-rdma.c |  276 ++
 24 files changed, 1360 insertions(+), 4832 deletions(-)
 delete mode 100644 docs/rdma.txt
 create mode 100644 include/io/channel-rdma.h
 create mode 100644 io/channel-rdma.c
 create mode 100644 tests/unit/test-io-channel-rdma.c

-- 
2.43.0



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/6] migration: remove RDMA live migration temporarily
  2024-06-04 12:14 [PATCH 0/6] refactor RDMA live migration based on rsocket API Gonglei via
@ 2024-06-04 12:14 ` Gonglei via
  2024-06-04 14:01   ` David Hildenbrand
  2024-06-10 11:45   ` Markus Armbruster
  2024-06-04 12:14 ` [PATCH 2/6] io: add QIOChannelRDMA class Gonglei via
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 55+ messages in thread
From: Gonglei via @ 2024-06-04 12:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: peterx, yu.zhang, mgalaxy, elmar.gerdes, zhengchuan, berrange,
	armbru, lizhijian, pbonzini, mst, xiexiangyou, linux-rdma,
	lixiao91, arei.gonglei, jinpu.wang, Jialin Wang

From: Jialin Wang <wangjialin23@huawei.com>

The new RDMA live migration will be introduced in the upcoming
few commits.

Signed-off-by: Jialin Wang <wangjialin23@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 docs/rdma.txt                |  420 ----
 meson.build                  |    6 -
 migration/meson.build        |    1 -
 migration/migration-stats.c  |    5 +-
 migration/migration-stats.h  |    4 -
 migration/migration.c        |   20 -
 migration/migration.h        |    9 -
 migration/options.c          |   16 -
 migration/options.h          |    2 -
 migration/qemu-file.c        |    1 -
 migration/ram.c              |   90 +-
 migration/rdma.c             | 4184 ----------------------------------
 migration/rdma.h             |   69 -
 migration/savevm.c           |    2 +-
 migration/trace-events       |   68 +-
 qapi/migration.json          |   13 +-
 scripts/analyze-migration.py |    3 -
 17 files changed, 10 insertions(+), 4903 deletions(-)
 delete mode 100644 docs/rdma.txt
 delete mode 100644 migration/rdma.c
 delete mode 100644 migration/rdma.h

diff --git a/docs/rdma.txt b/docs/rdma.txt
deleted file mode 100644
index bd8dd799a9..0000000000
--- a/docs/rdma.txt
+++ /dev/null
@@ -1,420 +0,0 @@
-(RDMA: Remote Direct Memory Access)
-RDMA Live Migration Specification, Version # 1
-==============================================
-Wiki: https://wiki.qemu.org/Features/RDMALiveMigration
-Github: git@github.com:hinesmr/qemu.git, 'rdma' branch
-
-Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
-
-An *exhaustive* paper (2010) shows additional performance details
-linked on the QEMU wiki above.
-
-Contents:
-=========
-* Introduction
-* Before running
-* Running
-* Performance
-* RDMA Migration Protocol Description
-* Versioning and Capabilities
-* QEMUFileRDMA Interface
-* Migration of VM's ram
-* Error handling
-* TODO
-
-Introduction:
-=============
-
-RDMA helps make your migration more deterministic under heavy load because
-of the significantly lower latency and higher throughput over TCP/IP. This is
-because the RDMA I/O architecture reduces the number of interrupts and
-data copies by bypassing the host networking stack. In particular, a TCP-based
-migration, under certain types of memory-bound workloads, may take a more
-unpredictable amount of time to complete the migration if the amount of
-memory tracked during each live migration iteration round cannot keep pace
-with the rate of dirty memory produced by the workload.
-
-RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
-over Converged Ethernet) as well as Infiniband-based. This implementation of
-migration using RDMA is capable of using both technologies because of
-the use of the OpenFabrics OFED software stack that abstracts out the
-programming model irrespective of the underlying hardware.
-
-Refer to openfabrics.org or your respective RDMA hardware vendor for
-an understanding on how to verify that you have the OFED software stack
-installed in your environment. You should be able to successfully link
-against the "librdmacm" and "libibverbs" libraries and development headers
-for a working build of QEMU to run successfully using RDMA Migration.
-
-BEFORE RUNNING:
-===============
-
-Use of RDMA during migration requires pinning and registering memory
-with the hardware. This means that memory must be physically resident
-before the hardware can transmit that memory to another machine.
-If this is not acceptable for your application or product, then the use
-of RDMA migration may in fact be harmful to co-located VMs or other
-software on the machine if there is not sufficient memory available to
-relocate the entire footprint of the virtual machine. If so, then the
-use of RDMA is discouraged and it is recommended to use standard TCP migration.
-
-Experimental: Next, decide if you want dynamic page registration.
-For example, if you have an 8GB RAM virtual machine, but only 1GB
-is in active use, then enabling this feature will cause all 8GB to
-be pinned and resident in memory. This feature mostly affects the
-bulk-phase round of the migration and can be enabled for extremely
-high-performance RDMA hardware using the following command:
-
-QEMU Monitor Command:
-$ migrate_set_capability rdma-pin-all on # disabled by default
-
-Performing this action will cause all 8GB to be pinned, so if that's
-not what you want, then please ignore this step altogether.
-
-On the other hand, this will also significantly speed up the bulk round
-of the migration, which can greatly reduce the "total" time of your migration.
-Example performance of this using an idle VM in the previous example
-can be found in the "Performance" section.
-
-Note: for very large virtual machines (hundreds of GBs), pinning all
-*all* of the memory of your virtual machine in the kernel is very expensive
-may extend the initial bulk iteration time by many seconds,
-and thus extending the total migration time. However, this will not
-affect the determinism or predictability of your migration you will
-still gain from the benefits of advanced pinning with RDMA.
-
-RUNNING:
-========
-
-First, set the migration speed to match your hardware's capabilities:
-
-QEMU Monitor Command:
-$ migrate_set_parameter max-bandwidth 40g # or whatever is the MAX of your RDMA device
-
-Next, on the destination machine, add the following to the QEMU command line:
-
-qemu ..... -incoming rdma:host:port
-
-Finally, perform the actual migration on the source machine:
-
-QEMU Monitor Command:
-$ migrate -d rdma:host:port
-
-PERFORMANCE
-===========
-
-Here is a brief summary of total migration time and downtime using RDMA:
-Using a 40gbps infiniband link performing a worst-case stress test,
-using an 8GB RAM virtual machine:
-
-Using the following command:
-$ apt-get install stress
-$ stress --vm-bytes 7500M --vm 1 --vm-keep
-
-1. Migration throughput: 26 gigabits/second.
-2. Downtime (stop time) varies between 15 and 100 milliseconds.
-
-EFFECTS of memory registration on bulk phase round:
-
-For example, in the same 8GB RAM example with all 8GB of memory in
-active use and the VM itself is completely idle using the same 40 gbps
-infiniband link:
-
-1. rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
-2. rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
-
-These numbers would of course scale up to whatever size virtual machine
-you have to migrate using RDMA.
-
-Enabling this feature does *not* have any measurable affect on
-migration *downtime*. This is because, without this feature, all of the
-memory will have already been registered already in advance during
-the bulk round and does not need to be re-registered during the successive
-iteration rounds.
-
-RDMA Protocol Description:
-==========================
-
-Migration with RDMA is separated into two parts:
-
-1. The transmission of the pages using RDMA
-2. Everything else (a control channel is introduced)
-
-"Everything else" is transmitted using a formal
-protocol now, consisting of infiniband SEND messages.
-
-An infiniband SEND message is the standard ibverbs
-message used by applications of infiniband hardware.
-The only difference between a SEND message and an RDMA
-message is that SEND messages cause notifications
-to be posted to the completion queue (CQ) on the
-infiniband receiver side, whereas RDMA messages (used
-for VM's ram) do not (to behave like an actual DMA).
-
-Messages in infiniband require two things:
-
-1. registration of the memory that will be transmitted
-2. (SEND only) work requests to be posted on both
-   sides of the network before the actual transmission
-   can occur.
-
-RDMA messages are much easier to deal with. Once the memory
-on the receiver side is registered and pinned, we're
-basically done. All that is required is for the sender
-side to start dumping bytes onto the link.
-
-(Memory is not released from pinning until the migration
-completes, given that RDMA migrations are very fast.)
-
-SEND messages require more coordination because the
-receiver must have reserved space (using a receive
-work request) on the receive queue (RQ) before QEMUFileRDMA
-can start using them to carry all the bytes as
-a control transport for migration of device state.
-
-To begin the migration, the initial connection setup is
-as follows (migration-rdma.c):
-
-1. Receiver and Sender are started (command line or libvirt):
-2. Both sides post two RQ work requests
-3. Receiver does listen()
-4. Sender does connect()
-5. Receiver accept()
-6. Check versioning and capabilities (described later)
-
-At this point, we define a control channel on top of SEND messages
-which is described by a formal protocol. Each SEND message has a
-header portion and a data portion (but together are transmitted
-as a single SEND message).
-
-Header:
-    * Length               (of the data portion, uint32, network byte order)
-    * Type                 (what command to perform, uint32, network byte order)
-    * Repeat               (Number of commands in data portion, same type only)
-
-The 'Repeat' field is here to support future multiple page registrations
-in a single message without any need to change the protocol itself
-so that the protocol is compatible against multiple versions of QEMU.
-Version #1 requires that all server implementations of the protocol must
-check this field and register all requests found in the array of commands located
-in the data portion and return an equal number of results in the response.
-The maximum number of repeats is hard-coded to 4096. This is a conservative
-limit based on the maximum size of a SEND message along with empirical
-observations on the maximum future benefit of simultaneous page registrations.
-
-The 'type' field has 12 different command values:
-     1. Unused
-     2. Error                      (sent to the source during bad things)
-     3. Ready                      (control-channel is available)
-     4. QEMU File                  (for sending non-live device state)
-     5. RAM Blocks request         (used right after connection setup)
-     6. RAM Blocks result          (used right after connection setup)
-     7. Compress page              (zap zero page and skip registration)
-     8. Register request           (dynamic chunk registration)
-     9. Register result            ('rkey' to be used by sender)
-    10. Register finished          (registration for current iteration finished)
-    11. Unregister request         (unpin previously registered memory)
-    12. Unregister finished        (confirmation that unpin completed)
-
-A single control message, as hinted above, can contain within the data
-portion an array of many commands of the same type. If there is more than
-one command, then the 'repeat' field will be greater than 1.
-
-After connection setup, message 5 & 6 are used to exchange ram block
-information and optionally pin all the memory if requested by the user.
-
-After ram block exchange is completed, we have two protocol-level
-functions, responsible for communicating control-channel commands
-using the above list of values:
-
-Logically:
-
-qemu_rdma_exchange_recv(header, expected command type)
-
-1. We transmit a READY command to let the sender know that
-   we are *ready* to receive some data bytes on the control channel.
-2. Before attempting to receive the expected command, we post another
-   RQ work request to replace the one we just used up.
-3. Block on a CQ event channel and wait for the SEND to arrive.
-4. When the send arrives, librdmacm will unblock us.
-5. Verify that the command-type and version received matches the one we expected.
-
-qemu_rdma_exchange_send(header, data, optional response header & data):
-
-1. Block on the CQ event channel waiting for a READY command
-   from the receiver to tell us that the receiver
-   is *ready* for us to transmit some new bytes.
-2. Optionally: if we are expecting a response from the command
-   (that we have not yet transmitted), let's post an RQ
-   work request to receive that data a few moments later.
-3. When the READY arrives, librdmacm will
-   unblock us and we immediately post a RQ work request
-   to replace the one we just used up.
-4. Now, we can actually post the work request to SEND
-   the requested command type of the header we were asked for.
-5. Optionally, if we are expecting a response (as before),
-   we block again and wait for that response using the additional
-   work request we previously posted. (This is used to carry
-   'Register result' commands #6 back to the sender which
-   hold the rkey need to perform RDMA. Note that the virtual address
-   corresponding to this rkey was already exchanged at the beginning
-   of the connection (described below).
-
-All of the remaining command types (not including 'ready')
-described above all use the aforementioned two functions to do the hard work:
-
-1. After connection setup, RAMBlock information is exchanged using
-   this protocol before the actual migration begins. This information includes
-   a description of each RAMBlock on the server side as well as the virtual addresses
-   and lengths of each RAMBlock. This is used by the client to determine the
-   start and stop locations of chunks and how to register them dynamically
-   before performing the RDMA operations.
-2. During runtime, once a 'chunk' becomes full of pages ready to
-   be sent with RDMA, the registration commands are used to ask the
-   other side to register the memory for this chunk and respond
-   with the result (rkey) of the registration.
-3. Also, the QEMUFile interfaces also call these functions (described below)
-   when transmitting non-live state, such as devices or to send
-   its own protocol information during the migration process.
-4. Finally, zero pages are only checked if a page has not yet been registered
-   using chunk registration (or not checked at all and unconditionally
-   written if chunk registration is disabled. This is accomplished using
-   the "Compress" command listed above. If the page *has* been registered
-   then we check the entire chunk for zero. Only if the entire chunk is
-   zero, then we send a compress command to zap the page on the other side.
-
-Versioning and Capabilities
-===========================
-Current version of the protocol is version #1.
-
-The same version applies to both for protocol traffic and capabilities
-negotiation. (i.e. There is only one version number that is referred to
-by all communication).
-
-librdmacm provides the user with a 'private data' area to be exchanged
-at connection-setup time before any infiniband traffic is generated.
-
-Header:
-    * Version (protocol version validated before send/recv occurs),
-                                               uint32, network byte order
-    * Flags   (bitwise OR of each capability),
-                                               uint32, network byte order
-
-There is no data portion of this header right now, so there is
-no length field. The maximum size of the 'private data' section
-is only 192 bytes per the Infiniband specification, so it's not
-very useful for data anyway. This structure needs to remain small.
-
-This private data area is a convenient place to check for protocol
-versioning because the user does not need to register memory to
-transmit a few bytes of version information.
-
-This is also a convenient place to negotiate capabilities
-(like dynamic page registration).
-
-If the version is invalid, we throw an error.
-
-If the version is new, we only negotiate the capabilities that the
-requested version is able to perform and ignore the rest.
-
-Currently there is only one capability in Version #1: dynamic page registration
-
-Finally: Negotiation happens with the Flags field: If the primary-VM
-sets a flag, but the destination does not support this capability, it
-will return a zero-bit for that flag and the primary-VM will understand
-that as not being an available capability and will thus disable that
-capability on the primary-VM side.
-
-QEMUFileRDMA Interface:
-=======================
-
-QEMUFileRDMA introduces a couple of new functions:
-
-1. qemu_rdma_get_buffer()               (QEMUFileOps rdma_read_ops)
-2. qemu_rdma_put_buffer()               (QEMUFileOps rdma_write_ops)
-
-These two functions are very short and simply use the protocol
-describe above to deliver bytes without changing the upper-level
-users of QEMUFile that depend on a bytestream abstraction.
-
-Finally, how do we handoff the actual bytes to get_buffer()?
-
-Again, because we're trying to "fake" a bytestream abstraction
-using an analogy not unlike individual UDP frames, we have
-to hold on to the bytes received from control-channel's SEND
-messages in memory.
-
-Each time we receive a complete "QEMU File" control-channel
-message, the bytes from SEND are copied into a small local holding area.
-
-Then, we return the number of bytes requested by get_buffer()
-and leave the remaining bytes in the holding area until get_buffer()
-comes around for another pass.
-
-If the buffer is empty, then we follow the same steps
-listed above and issue another "QEMU File" protocol command,
-asking for a new SEND message to re-fill the buffer.
-
-Migration of VM's ram:
-====================
-
-At the beginning of the migration, (migration-rdma.c),
-the sender and the receiver populate the list of RAMBlocks
-to be registered with each other into a structure.
-Then, using the aforementioned protocol, they exchange a
-description of these blocks with each other, to be used later
-during the iteration of main memory. This description includes
-a list of all the RAMBlocks, their offsets and lengths, virtual
-addresses and possibly includes pre-registered RDMA keys in case dynamic
-page registration was disabled on the server-side, otherwise not.
-
-Main memory is not migrated with the aforementioned protocol,
-but is instead migrated with normal RDMA Write operations.
-
-Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now).
-Chunk size is not dynamic, but it could be in a future implementation.
-There's nothing to indicate that this is useful right now.
-
-When a chunk is full (or a flush() occurs), the memory backed by
-the chunk is registered with librdmacm is pinned in memory on
-both sides using the aforementioned protocol.
-After pinning, an RDMA Write is generated and transmitted
-for the entire chunk.
-
-Chunks are also transmitted in batches: This means that we
-do not request that the hardware signal the completion queue
-for the completion of *every* chunk. The current batch size
-is about 64 chunks (corresponding to 64 MB of memory).
-Only the last chunk in a batch must be signaled.
-This helps keep everything as asynchronous as possible
-and helps keep the hardware busy performing RDMA operations.
-
-Error-handling:
-===============
-
-Infiniband has what is called a "Reliable, Connected"
-link (one of 4 choices). This is the mode in which
-we use for RDMA migration.
-
-If a *single* message fails,
-the decision is to abort the migration entirely and
-cleanup all the RDMA descriptors and unregister all
-the memory.
-
-After cleanup, the Virtual Machine is returned to normal
-operation the same way that would happen if the TCP
-socket is broken during a non-RDMA based migration.
-
-TODO:
-=====
-1. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
-   are not compatible with infiniband memory pinning and will result in
-   an aborted migration (but with the source VM left unaffected).
-2. Use of the recent /proc/<pid>/pagemap would likely speed up
-   the use of KSM and ballooning while using RDMA.
-3. Also, some form of balloon-device usage tracking would also
-   help alleviate some issues.
-4. Use LRU to provide more fine-grained direction of UNREGISTER
-   requests for unpinning memory in an overcommitted environment.
-5. Expose UNREGISTER support to the user by way of workload-specific
-   hints about application behavior.
diff --git a/meson.build b/meson.build
index 6386607144..3894f1f942 100644
--- a/meson.build
+++ b/meson.build
@@ -2425,12 +2425,6 @@ if rbd.found()
                                        dependencies: rbd,
                                        prefix: '#include <rbd/librbd.h>'))
 endif
-if rdma.found()
-  config_host_data.set('HAVE_IBV_ADVISE_MR',
-                       cc.has_function('ibv_advise_mr',
-                                       dependencies: rdma,
-                                       prefix: '#include <infiniband/verbs.h>'))
-endif
 
 have_asan_fiber = false
 if get_option('sanitizers') and \
diff --git a/migration/meson.build b/migration/meson.build
index bdc3244bce..4e8a9ccf3e 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -37,7 +37,6 @@ else
   system_ss.add(files('colo-stubs.c'))
 endif
 
-system_ss.add(when: rdma, if_true: files('rdma.c'))
 system_ss.add(when: zstd, if_true: files('multifd-zstd.c'))
 
 specific_ss.add(when: 'CONFIG_SYSTEM_ONLY',
diff --git a/migration/migration-stats.c b/migration/migration-stats.c
index f690b98a03..9bc8d7018f 100644
--- a/migration/migration-stats.c
+++ b/migration/migration-stats.c
@@ -62,9 +62,8 @@ void migration_rate_reset(void)
 uint64_t migration_transferred_bytes(void)
 {
     uint64_t multifd = stat64_get(&mig_stats.multifd_bytes);
-    uint64_t rdma = stat64_get(&mig_stats.rdma_bytes);
     uint64_t qemu_file = stat64_get(&mig_stats.qemu_file_transferred);
 
-    trace_migration_transferred_bytes(qemu_file, multifd, rdma);
-    return qemu_file + multifd + rdma;
+    trace_migration_transferred_bytes(qemu_file, multifd);
+    return qemu_file + multifd;
 }
diff --git a/migration/migration-stats.h b/migration/migration-stats.h
index 05290ade76..6b87e133f1 100644
--- a/migration/migration-stats.h
+++ b/migration/migration-stats.h
@@ -93,10 +93,6 @@ typedef struct {
      * Maximum amount of data we can send in a cycle.
      */
     Stat64 rate_limit_max;
-    /*
-     * Number of bytes sent through RDMA.
-     */
-    Stat64 rdma_bytes;
     /*
      * Number of pages transferred that were full of zeros.
      */
diff --git a/migration/migration.c b/migration/migration.c
index e1b269624c..6b9ad4ff5f 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -25,7 +25,6 @@
 #include "sysemu/runstate.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/cpu-throttle.h"
-#include "rdma.h"
 #include "ram.h"
 #include "migration/global_state.h"
 #include "migration/misc.h"
@@ -645,18 +644,6 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
         } else if (saddr->type == SOCKET_ADDRESS_TYPE_FD) {
             fd_start_incoming_migration(saddr->u.fd.str, errp);
         }
-#ifdef CONFIG_RDMA
-    } else if (addr->transport == MIGRATION_ADDRESS_TYPE_RDMA) {
-        if (migrate_xbzrle()) {
-            error_setg(errp, "RDMA and XBZRLE can't be used together");
-            return;
-        }
-        if (migrate_multifd()) {
-            error_setg(errp, "RDMA and multifd can't be used together");
-            return;
-        }
-        rdma_start_incoming_migration(&addr->u.rdma, errp);
-#endif
     } else if (addr->transport == MIGRATION_ADDRESS_TYPE_EXEC) {
         exec_start_incoming_migration(addr->u.exec.args, errp);
     } else if (addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
@@ -744,9 +731,7 @@ process_incoming_migration_co(void *opaque)
     migrate_set_state(&mis->state, MIGRATION_STATUS_SETUP,
                       MIGRATION_STATUS_ACTIVE);
 
-    mis->loadvm_co = qemu_coroutine_self();
     ret = qemu_loadvm_state(mis->from_src_file);
-    mis->loadvm_co = NULL;
 
     trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
 
@@ -1668,7 +1653,6 @@ int migrate_init(MigrationState *s, Error **errp)
     s->iteration_initial_bytes = 0;
     s->threshold_size = 0;
     s->switchover_acked = false;
-    s->rdma_migration = false;
     /*
      * set mig_stats memory to zero for a new migration
      */
@@ -2062,10 +2046,6 @@ void qmp_migrate(const char *uri, bool has_channels,
         } else if (saddr->type == SOCKET_ADDRESS_TYPE_FD) {
             fd_start_outgoing_migration(s, saddr->u.fd.str, &local_err);
         }
-#ifdef CONFIG_RDMA
-    } else if (addr->transport == MIGRATION_ADDRESS_TYPE_RDMA) {
-        rdma_start_outgoing_migration(s, &addr->u.rdma, &local_err);
-#endif
     } else if (addr->transport == MIGRATION_ADDRESS_TYPE_EXEC) {
         exec_start_outgoing_migration(s, addr->u.exec.args, &local_err);
     } else if (addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
diff --git a/migration/migration.h b/migration/migration.h
index 6af01362d4..714643fe7e 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -162,13 +162,6 @@ struct MigrationIncomingState {
 
     int state;
 
-    /*
-     * The incoming migration coroutine, non-NULL during qemu_loadvm_state().
-     * Used to wake the migration incoming coroutine from rdma code. How much is
-     * it safe - it's a question.
-     */
-    Coroutine *loadvm_co;
-
     /* The coroutine we should enter (back) after failover */
     Coroutine *colo_incoming_co;
     QemuSemaphore colo_incoming_sem;
@@ -455,8 +448,6 @@ struct MigrationState {
      * switchover has been received.
      */
     bool switchover_acked;
-    /* Is this a rdma migration */
-    bool rdma_migration;
 };
 
 void migrate_set_state(int *state, int old_state, int new_state);
diff --git a/migration/options.c b/migration/options.c
index 5ab5b6d85d..601cd712b7 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -165,7 +165,6 @@ Property migration_properties[] = {
 
     /* Migration capabilities */
     DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
-    DEFINE_PROP_MIG_CAP("x-rdma-pin-all", MIGRATION_CAPABILITY_RDMA_PIN_ALL),
     DEFINE_PROP_MIG_CAP("x-auto-converge", MIGRATION_CAPABILITY_AUTO_CONVERGE),
     DEFINE_PROP_MIG_CAP("x-zero-blocks", MIGRATION_CAPABILITY_ZERO_BLOCKS),
     DEFINE_PROP_MIG_CAP("x-events", MIGRATION_CAPABILITY_EVENTS),
@@ -287,13 +286,6 @@ bool migrate_postcopy_ram(void)
     return s->capabilities[MIGRATION_CAPABILITY_POSTCOPY_RAM];
 }
 
-bool migrate_rdma_pin_all(void)
-{
-    MigrationState *s = migrate_get_current();
-
-    return s->capabilities[MIGRATION_CAPABILITY_RDMA_PIN_ALL];
-}
-
 bool migrate_release_ram(void)
 {
     MigrationState *s = migrate_get_current();
@@ -357,13 +349,6 @@ bool migrate_postcopy(void)
     return migrate_postcopy_ram() || migrate_dirty_bitmaps();
 }
 
-bool migrate_rdma(void)
-{
-    MigrationState *s = migrate_get_current();
-
-    return s->rdma_migration;
-}
-
 bool migrate_tls(void)
 {
     MigrationState *s = migrate_get_current();
@@ -422,7 +407,6 @@ INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
     MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER,
     MIGRATION_CAPABILITY_AUTO_CONVERGE,
     MIGRATION_CAPABILITY_RELEASE_RAM,
-    MIGRATION_CAPABILITY_RDMA_PIN_ALL,
     MIGRATION_CAPABILITY_XBZRLE,
     MIGRATION_CAPABILITY_X_COLO,
     MIGRATION_CAPABILITY_VALIDATE_UUID,
diff --git a/migration/options.h b/migration/options.h
index 4b21cc2669..cb26708ebf 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -35,7 +35,6 @@ bool migrate_multifd(void);
 bool migrate_pause_before_switchover(void);
 bool migrate_postcopy_blocktime(void);
 bool migrate_postcopy_preempt(void);
-bool migrate_rdma_pin_all(void);
 bool migrate_release_ram(void);
 bool migrate_return_path(void);
 bool migrate_validate_uuid(void);
@@ -52,7 +51,6 @@ bool migrate_zero_copy_send(void);
 
 bool migrate_multifd_flush_after_each_section(void);
 bool migrate_postcopy(void);
-bool migrate_rdma(void);
 bool migrate_tls(void);
 
 /* capabilities helpers */
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index b6d2f588bd..09fdfc2b4d 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -31,7 +31,6 @@
 #include "trace.h"
 #include "options.h"
 #include "qapi/error.h"
-#include "rdma.h"
 #include "io/channel-file.h"
 
 #define IO_BUF_SIZE 32768
diff --git a/migration/ram.c b/migration/ram.c
index ceea586b06..6b027c7fd7 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -57,7 +57,6 @@
 #include "qemu/iov.h"
 #include "multifd.h"
 #include "sysemu/runstate.h"
-#include "rdma.h"
 #include "options.h"
 #include "sysemu/dirtylimit.h"
 #include "sysemu/kvm.h"
@@ -88,7 +87,6 @@
 #define RAM_SAVE_FLAG_EOS      0x10
 #define RAM_SAVE_FLAG_CONTINUE 0x20
 #define RAM_SAVE_FLAG_XBZRLE   0x40
-/* 0x80 is reserved in rdma.h for RAM_SAVE_FLAG_HOOK */
 #define RAM_SAVE_FLAG_MULTIFD_FLUSH    0x200
 /* We can't use any flag that is bigger than 0x200 */
 
@@ -1168,32 +1166,6 @@ static int save_zero_page(RAMState *rs, PageSearchStatus *pss,
     return len;
 }
 
-/*
- * @pages: the number of pages written by the control path,
- *        < 0 - error
- *        > 0 - number of pages written
- *
- * Return true if the pages has been saved, otherwise false is returned.
- */
-static bool control_save_page(PageSearchStatus *pss,
-                              ram_addr_t offset, int *pages)
-{
-    int ret;
-
-    ret = rdma_control_save_page(pss->pss_channel, pss->block->offset, offset,
-                                 TARGET_PAGE_SIZE);
-    if (ret == RAM_SAVE_CONTROL_NOT_SUPP) {
-        return false;
-    }
-
-    if (ret == RAM_SAVE_CONTROL_DELAYED) {
-        *pages = 1;
-        return true;
-    }
-    *pages = ret;
-    return true;
-}
-
 /*
  * directly send the page to the stream
  *
@@ -1997,11 +1969,6 @@ int ram_save_queue_pages(const char *rbname, ram_addr_t start, ram_addr_t len,
 static int ram_save_target_page_legacy(RAMState *rs, PageSearchStatus *pss)
 {
     ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
-    int res;
-
-    if (control_save_page(pss, offset, &res)) {
-        return res;
-    }
 
     if (save_zero_page(rs, pss, offset)) {
         return 1;
@@ -3041,20 +3008,6 @@ static int ram_save_setup(QEMUFile *f, void *opaque, Error **errp)
         }
     }
 
-    ret = rdma_registration_start(f, RAM_CONTROL_SETUP);
-    if (ret < 0) {
-        error_setg(errp, "%s: failed to start RDMA registration", __func__);
-        qemu_file_set_error(f, ret);
-        return ret;
-    }
-
-    ret = rdma_registration_stop(f, RAM_CONTROL_SETUP);
-    if (ret < 0) {
-        error_setg(errp, "%s: failed to stop RDMA registration", __func__);
-        qemu_file_set_error(f, ret);
-        return ret;
-    }
-
     migration_ops = g_malloc0(sizeof(MigrationOps));
 
     if (migrate_multifd()) {
@@ -3148,12 +3101,6 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
             /* Read version before ram_list.blocks */
             smp_rmb();
 
-            ret = rdma_registration_start(f, RAM_CONTROL_ROUND);
-            if (ret < 0) {
-                qemu_file_set_error(f, ret);
-                goto out;
-            }
-
             t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
             i = 0;
             while ((ret = migration_rate_exceeded(f)) == 0 ||
@@ -3197,16 +3144,6 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
         }
     }
 
-    /*
-     * Must occur before EOS (or any QEMUFile operation)
-     * because of RDMA protocol.
-     */
-    ret = rdma_registration_stop(f, RAM_CONTROL_ROUND);
-    if (ret < 0) {
-        qemu_file_set_error(f, ret);
-    }
-
-out:
     if (ret >= 0
         && migration_is_setup_or_active()) {
         if (migrate_multifd() && migrate_multifd_flush_after_each_section() &&
@@ -3251,12 +3188,6 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
             migration_bitmap_sync_precopy(rs, true);
         }
 
-        ret = rdma_registration_start(f, RAM_CONTROL_FINISH);
-        if (ret < 0) {
-            qemu_file_set_error(f, ret);
-            return ret;
-        }
-
         /* try transferring iterative blocks of memory */
 
         /* flush all remaining blocks regardless of rate limiting */
@@ -3275,12 +3206,6 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
             }
         }
         qemu_mutex_unlock(&rs->bitmap_mutex);
-
-        ret = rdma_registration_stop(f, RAM_CONTROL_FINISH);
-        if (ret < 0) {
-            qemu_file_set_error(f, ret);
-            return ret;
-        }
     }
 
     ret = multifd_send_sync_main();
@@ -3493,8 +3418,7 @@ static inline void *colo_cache_from_block_offset(RAMBlock *block,
 /**
  * ram_handle_zero: handle the zero page case
  *
- * If a page (or a whole RDMA chunk) has been
- * determined to be zero, then zap it.
+ * If a page has been determined to be zero, then zap it.
  *
  * @host: host address for the zero page
  * @ch: what the page is filled from.  We only support zero
@@ -4071,10 +3995,6 @@ static int parse_ramblock(QEMUFile *f, RAMBlock *block, ram_addr_t length)
             return -EINVAL;
         }
     }
-    ret = rdma_block_notification_handle(f, block->idstr);
-    if (ret < 0) {
-        qemu_file_set_error(f, ret);
-    }
 
     return ret;
 }
@@ -4124,7 +4044,7 @@ static int ram_load_precopy(QEMUFile *f)
     int flags = 0, ret = 0, invalid_flags = 0, i = 0;
 
     if (migrate_mapped_ram()) {
-        invalid_flags |= (RAM_SAVE_FLAG_HOOK | RAM_SAVE_FLAG_MULTIFD_FLUSH |
+        invalid_flags |= (RAM_SAVE_FLAG_MULTIFD_FLUSH |
                           RAM_SAVE_FLAG_PAGE | RAM_SAVE_FLAG_XBZRLE |
                           RAM_SAVE_FLAG_ZERO);
     }
@@ -4255,12 +4175,6 @@ static int ram_load_precopy(QEMUFile *f)
                 multifd_recv_sync_main();
             }
             break;
-        case RAM_SAVE_FLAG_HOOK:
-            ret = rdma_registration_handle(f);
-            if (ret < 0) {
-                qemu_file_set_error(f, ret);
-            }
-            break;
         default:
             error_report("Unknown combination of migration flags: 0x%x", flags);
             ret = -EINVAL;
diff --git a/migration/rdma.c b/migration/rdma.c
deleted file mode 100644
index 855753c671..0000000000
--- a/migration/rdma.c
+++ /dev/null
@@ -1,4184 +0,0 @@
-/*
- * RDMA protocol and interfaces
- *
- * Copyright IBM, Corp. 2010-2013
- * Copyright Red Hat, Inc. 2015-2016
- *
- * Authors:
- *  Michael R. Hines <mrhines@us.ibm.com>
- *  Jiuxing Liu <jl@us.ibm.com>
- *  Daniel P. Berrange <berrange@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or
- * later.  See the COPYING file in the top-level directory.
- *
- */
-
-#include "qemu/osdep.h"
-#include "qapi/error.h"
-#include "qemu/cutils.h"
-#include "exec/target_page.h"
-#include "rdma.h"
-#include "migration.h"
-#include "migration-stats.h"
-#include "qemu-file.h"
-#include "ram.h"
-#include "qemu/error-report.h"
-#include "qemu/main-loop.h"
-#include "qemu/module.h"
-#include "qemu/rcu.h"
-#include "qemu/sockets.h"
-#include "qemu/bitmap.h"
-#include "qemu/coroutine.h"
-#include "exec/memory.h"
-#include <sys/socket.h>
-#include <netdb.h>
-#include <arpa/inet.h>
-#include <rdma/rdma_cma.h>
-#include "trace.h"
-#include "qom/object.h"
-#include "options.h"
-#include <poll.h>
-
-#define RDMA_RESOLVE_TIMEOUT_MS 10000
-
-/* Do not merge data if larger than this. */
-#define RDMA_MERGE_MAX (2 * 1024 * 1024)
-#define RDMA_SIGNALED_SEND_MAX (RDMA_MERGE_MAX / 4096)
-
-#define RDMA_REG_CHUNK_SHIFT 20 /* 1 MB */
-
-/*
- * This is only for non-live state being migrated.
- * Instead of RDMA_WRITE messages, we use RDMA_SEND
- * messages for that state, which requires a different
- * delivery design than main memory.
- */
-#define RDMA_SEND_INCREMENT 32768
-
-/*
- * Maximum size infiniband SEND message
- */
-#define RDMA_CONTROL_MAX_BUFFER (512 * 1024)
-#define RDMA_CONTROL_MAX_COMMANDS_PER_MESSAGE 4096
-
-#define RDMA_CONTROL_VERSION_CURRENT 1
-/*
- * Capabilities for negotiation.
- */
-#define RDMA_CAPABILITY_PIN_ALL 0x01
-
-/*
- * Add the other flags above to this list of known capabilities
- * as they are introduced.
- */
-static uint32_t known_capabilities = RDMA_CAPABILITY_PIN_ALL;
-
-/*
- * A work request ID is 64-bits and we split up these bits
- * into 3 parts:
- *
- * bits 0-15 : type of control message, 2^16
- * bits 16-29: ram block index, 2^14
- * bits 30-63: ram block chunk number, 2^34
- *
- * The last two bit ranges are only used for RDMA writes,
- * in order to track their completion and potentially
- * also track unregistration status of the message.
- */
-#define RDMA_WRID_TYPE_SHIFT  0UL
-#define RDMA_WRID_BLOCK_SHIFT 16UL
-#define RDMA_WRID_CHUNK_SHIFT 30UL
-
-#define RDMA_WRID_TYPE_MASK \
-    ((1UL << RDMA_WRID_BLOCK_SHIFT) - 1UL)
-
-#define RDMA_WRID_BLOCK_MASK \
-    (~RDMA_WRID_TYPE_MASK & ((1UL << RDMA_WRID_CHUNK_SHIFT) - 1UL))
-
-#define RDMA_WRID_CHUNK_MASK (~RDMA_WRID_BLOCK_MASK & ~RDMA_WRID_TYPE_MASK)
-
-/*
- * RDMA migration protocol:
- * 1. RDMA Writes (data messages, i.e. RAM)
- * 2. IB Send/Recv (control channel messages)
- */
-enum {
-    RDMA_WRID_NONE = 0,
-    RDMA_WRID_RDMA_WRITE = 1,
-    RDMA_WRID_SEND_CONTROL = 2000,
-    RDMA_WRID_RECV_CONTROL = 4000,
-};
-
-/*
- * Work request IDs for IB SEND messages only (not RDMA writes).
- * This is used by the migration protocol to transmit
- * control messages (such as device state and registration commands)
- *
- * We could use more WRs, but we have enough for now.
- */
-enum {
-    RDMA_WRID_READY = 0,
-    RDMA_WRID_DATA,
-    RDMA_WRID_CONTROL,
-    RDMA_WRID_MAX,
-};
-
-/*
- * SEND/RECV IB Control Messages.
- */
-enum {
-    RDMA_CONTROL_NONE = 0,
-    RDMA_CONTROL_ERROR,
-    RDMA_CONTROL_READY,               /* ready to receive */
-    RDMA_CONTROL_QEMU_FILE,           /* QEMUFile-transmitted bytes */
-    RDMA_CONTROL_RAM_BLOCKS_REQUEST,  /* RAMBlock synchronization */
-    RDMA_CONTROL_RAM_BLOCKS_RESULT,   /* RAMBlock synchronization */
-    RDMA_CONTROL_COMPRESS,            /* page contains repeat values */
-    RDMA_CONTROL_REGISTER_REQUEST,    /* dynamic page registration */
-    RDMA_CONTROL_REGISTER_RESULT,     /* key to use after registration */
-    RDMA_CONTROL_REGISTER_FINISHED,   /* current iteration finished */
-    RDMA_CONTROL_UNREGISTER_REQUEST,  /* dynamic UN-registration */
-    RDMA_CONTROL_UNREGISTER_FINISHED, /* unpinning finished */
-};
-
-
-/*
- * Memory and MR structures used to represent an IB Send/Recv work request.
- * This is *not* used for RDMA writes, only IB Send/Recv.
- */
-typedef struct {
-    uint8_t  control[RDMA_CONTROL_MAX_BUFFER]; /* actual buffer to register */
-    struct   ibv_mr *control_mr;               /* registration metadata */
-    size_t   control_len;                      /* length of the message */
-    uint8_t *control_curr;                     /* start of unconsumed bytes */
-} RDMAWorkRequestData;
-
-/*
- * Negotiate RDMA capabilities during connection-setup time.
- */
-typedef struct {
-    uint32_t version;
-    uint32_t flags;
-} RDMACapabilities;
-
-static void caps_to_network(RDMACapabilities *cap)
-{
-    cap->version = htonl(cap->version);
-    cap->flags = htonl(cap->flags);
-}
-
-static void network_to_caps(RDMACapabilities *cap)
-{
-    cap->version = ntohl(cap->version);
-    cap->flags = ntohl(cap->flags);
-}
-
-/*
- * Representation of a RAMBlock from an RDMA perspective.
- * This is not transmitted, only local.
- * This and subsequent structures cannot be linked lists
- * because we're using a single IB message to transmit
- * the information. It's small anyway, so a list is overkill.
- */
-typedef struct RDMALocalBlock {
-    char          *block_name;
-    uint8_t       *local_host_addr; /* local virtual address */
-    uint64_t       remote_host_addr; /* remote virtual address */
-    uint64_t       offset;
-    uint64_t       length;
-    struct         ibv_mr **pmr;    /* MRs for chunk-level registration */
-    struct         ibv_mr *mr;      /* MR for non-chunk-level registration */
-    uint32_t      *remote_keys;     /* rkeys for chunk-level registration */
-    uint32_t       remote_rkey;     /* rkeys for non-chunk-level registration */
-    int            index;           /* which block are we */
-    unsigned int   src_index;       /* (Only used on dest) */
-    bool           is_ram_block;
-    int            nb_chunks;
-    unsigned long *transit_bitmap;
-    unsigned long *unregister_bitmap;
-} RDMALocalBlock;
-
-/*
- * Also represents a RAMblock, but only on the dest.
- * This gets transmitted by the dest during connection-time
- * to the source VM and then is used to populate the
- * corresponding RDMALocalBlock with
- * the information needed to perform the actual RDMA.
- */
-typedef struct QEMU_PACKED RDMADestBlock {
-    uint64_t remote_host_addr;
-    uint64_t offset;
-    uint64_t length;
-    uint32_t remote_rkey;
-    uint32_t padding;
-} RDMADestBlock;
-
-static const char *control_desc(unsigned int rdma_control)
-{
-    static const char *strs[] = {
-        [RDMA_CONTROL_NONE] = "NONE",
-        [RDMA_CONTROL_ERROR] = "ERROR",
-        [RDMA_CONTROL_READY] = "READY",
-        [RDMA_CONTROL_QEMU_FILE] = "QEMU FILE",
-        [RDMA_CONTROL_RAM_BLOCKS_REQUEST] = "RAM BLOCKS REQUEST",
-        [RDMA_CONTROL_RAM_BLOCKS_RESULT] = "RAM BLOCKS RESULT",
-        [RDMA_CONTROL_COMPRESS] = "COMPRESS",
-        [RDMA_CONTROL_REGISTER_REQUEST] = "REGISTER REQUEST",
-        [RDMA_CONTROL_REGISTER_RESULT] = "REGISTER RESULT",
-        [RDMA_CONTROL_REGISTER_FINISHED] = "REGISTER FINISHED",
-        [RDMA_CONTROL_UNREGISTER_REQUEST] = "UNREGISTER REQUEST",
-        [RDMA_CONTROL_UNREGISTER_FINISHED] = "UNREGISTER FINISHED",
-    };
-
-    if (rdma_control > RDMA_CONTROL_UNREGISTER_FINISHED) {
-        return "??BAD CONTROL VALUE??";
-    }
-
-    return strs[rdma_control];
-}
-
-#if !defined(htonll)
-static uint64_t htonll(uint64_t v)
-{
-    union { uint32_t lv[2]; uint64_t llv; } u;
-    u.lv[0] = htonl(v >> 32);
-    u.lv[1] = htonl(v & 0xFFFFFFFFULL);
-    return u.llv;
-}
-#endif
-
-#if !defined(ntohll)
-static uint64_t ntohll(uint64_t v)
-{
-    union { uint32_t lv[2]; uint64_t llv; } u;
-    u.llv = v;
-    return ((uint64_t)ntohl(u.lv[0]) << 32) | (uint64_t) ntohl(u.lv[1]);
-}
-#endif
-
-static void dest_block_to_network(RDMADestBlock *db)
-{
-    db->remote_host_addr = htonll(db->remote_host_addr);
-    db->offset = htonll(db->offset);
-    db->length = htonll(db->length);
-    db->remote_rkey = htonl(db->remote_rkey);
-}
-
-static void network_to_dest_block(RDMADestBlock *db)
-{
-    db->remote_host_addr = ntohll(db->remote_host_addr);
-    db->offset = ntohll(db->offset);
-    db->length = ntohll(db->length);
-    db->remote_rkey = ntohl(db->remote_rkey);
-}
-
-/*
- * Virtual address of the above structures used for transmitting
- * the RAMBlock descriptions at connection-time.
- * This structure is *not* transmitted.
- */
-typedef struct RDMALocalBlocks {
-    int nb_blocks;
-    bool     init;             /* main memory init complete */
-    RDMALocalBlock *block;
-} RDMALocalBlocks;
-
-/*
- * Main data structure for RDMA state.
- * While there is only one copy of this structure being allocated right now,
- * this is the place where one would start if you wanted to consider
- * having more than one RDMA connection open at the same time.
- */
-typedef struct RDMAContext {
-    char *host;
-    int port;
-
-    RDMAWorkRequestData wr_data[RDMA_WRID_MAX];
-
-    /*
-     * This is used by *_exchange_send() to figure out whether or not
-     * the initial "READY" message has already been received or not.
-     * This is because other functions may potentially poll() and detect
-     * the READY message before send() does, in which case we need to
-     * know if it completed.
-     */
-    int control_ready_expected;
-
-    /* number of outstanding writes */
-    int nb_sent;
-
-    /* store info about current buffer so that we can
-       merge it with future sends */
-    uint64_t current_addr;
-    uint64_t current_length;
-    /* index of ram block the current buffer belongs to */
-    int current_index;
-    /* index of the chunk in the current ram block */
-    int current_chunk;
-
-    bool pin_all;
-
-    /*
-     * infiniband-specific variables for opening the device
-     * and maintaining connection state and so forth.
-     *
-     * cm_id also has ibv_context, rdma_event_channel, and ibv_qp in
-     * cm_id->verbs, cm_id->channel, and cm_id->qp.
-     */
-    struct rdma_cm_id *cm_id;               /* connection manager ID */
-    struct rdma_cm_id *listen_id;
-    bool connected;
-
-    struct ibv_context          *verbs;
-    struct rdma_event_channel   *channel;
-    struct ibv_qp *qp;                      /* queue pair */
-    struct ibv_comp_channel *recv_comp_channel;  /* recv completion channel */
-    struct ibv_comp_channel *send_comp_channel;  /* send completion channel */
-    struct ibv_pd *pd;                      /* protection domain */
-    struct ibv_cq *recv_cq;                 /* recvieve completion queue */
-    struct ibv_cq *send_cq;                 /* send completion queue */
-
-    /*
-     * If a previous write failed (perhaps because of a failed
-     * memory registration, then do not attempt any future work
-     * and remember the error state.
-     */
-    bool errored;
-    bool error_reported;
-    bool received_error;
-
-    /*
-     * Description of ram blocks used throughout the code.
-     */
-    RDMALocalBlocks local_ram_blocks;
-    RDMADestBlock  *dest_blocks;
-
-    /* Index of the next RAMBlock received during block registration */
-    unsigned int    next_src_index;
-
-    /*
-     * Migration on *destination* started.
-     * Then use coroutine yield function.
-     * Source runs in a thread, so we don't care.
-     */
-    int migration_started_on_destination;
-
-    int total_registrations;
-    int total_writes;
-
-    int unregister_current, unregister_next;
-    uint64_t unregistrations[RDMA_SIGNALED_SEND_MAX];
-
-    GHashTable *blockmap;
-
-    /* the RDMAContext for return path */
-    struct RDMAContext *return_path;
-    bool is_return_path;
-} RDMAContext;
-
-#define TYPE_QIO_CHANNEL_RDMA "qio-channel-rdma"
-OBJECT_DECLARE_SIMPLE_TYPE(QIOChannelRDMA, QIO_CHANNEL_RDMA)
-
-
-
-struct QIOChannelRDMA {
-    QIOChannel parent;
-    RDMAContext *rdmain;
-    RDMAContext *rdmaout;
-    QEMUFile *file;
-    bool blocking; /* XXX we don't actually honour this yet */
-};
-
-/*
- * Main structure for IB Send/Recv control messages.
- * This gets prepended at the beginning of every Send/Recv.
- */
-typedef struct QEMU_PACKED {
-    uint32_t len;     /* Total length of data portion */
-    uint32_t type;    /* which control command to perform */
-    uint32_t repeat;  /* number of commands in data portion of same type */
-    uint32_t padding;
-} RDMAControlHeader;
-
-static void control_to_network(RDMAControlHeader *control)
-{
-    control->type = htonl(control->type);
-    control->len = htonl(control->len);
-    control->repeat = htonl(control->repeat);
-}
-
-static void network_to_control(RDMAControlHeader *control)
-{
-    control->type = ntohl(control->type);
-    control->len = ntohl(control->len);
-    control->repeat = ntohl(control->repeat);
-}
-
-/*
- * Register a single Chunk.
- * Information sent by the source VM to inform the dest
- * to register an single chunk of memory before we can perform
- * the actual RDMA operation.
- */
-typedef struct QEMU_PACKED {
-    union QEMU_PACKED {
-        uint64_t current_addr;  /* offset into the ram_addr_t space */
-        uint64_t chunk;         /* chunk to lookup if unregistering */
-    } key;
-    uint32_t current_index; /* which ramblock the chunk belongs to */
-    uint32_t padding;
-    uint64_t chunks;            /* how many sequential chunks to register */
-} RDMARegister;
-
-static bool rdma_errored(RDMAContext *rdma)
-{
-    if (rdma->errored && !rdma->error_reported) {
-        error_report("RDMA is in an error state waiting migration"
-                     " to abort!");
-        rdma->error_reported = true;
-    }
-    return rdma->errored;
-}
-
-static void register_to_network(RDMAContext *rdma, RDMARegister *reg)
-{
-    RDMALocalBlock *local_block;
-    local_block  = &rdma->local_ram_blocks.block[reg->current_index];
-
-    if (local_block->is_ram_block) {
-        /*
-         * current_addr as passed in is an address in the local ram_addr_t
-         * space, we need to translate this for the destination
-         */
-        reg->key.current_addr -= local_block->offset;
-        reg->key.current_addr += rdma->dest_blocks[reg->current_index].offset;
-    }
-    reg->key.current_addr = htonll(reg->key.current_addr);
-    reg->current_index = htonl(reg->current_index);
-    reg->chunks = htonll(reg->chunks);
-}
-
-static void network_to_register(RDMARegister *reg)
-{
-    reg->key.current_addr = ntohll(reg->key.current_addr);
-    reg->current_index = ntohl(reg->current_index);
-    reg->chunks = ntohll(reg->chunks);
-}
-
-typedef struct QEMU_PACKED {
-    uint32_t value;     /* if zero, we will madvise() */
-    uint32_t block_idx; /* which ram block index */
-    uint64_t offset;    /* Address in remote ram_addr_t space */
-    uint64_t length;    /* length of the chunk */
-} RDMACompress;
-
-static void compress_to_network(RDMAContext *rdma, RDMACompress *comp)
-{
-    comp->value = htonl(comp->value);
-    /*
-     * comp->offset as passed in is an address in the local ram_addr_t
-     * space, we need to translate this for the destination
-     */
-    comp->offset -= rdma->local_ram_blocks.block[comp->block_idx].offset;
-    comp->offset += rdma->dest_blocks[comp->block_idx].offset;
-    comp->block_idx = htonl(comp->block_idx);
-    comp->offset = htonll(comp->offset);
-    comp->length = htonll(comp->length);
-}
-
-static void network_to_compress(RDMACompress *comp)
-{
-    comp->value = ntohl(comp->value);
-    comp->block_idx = ntohl(comp->block_idx);
-    comp->offset = ntohll(comp->offset);
-    comp->length = ntohll(comp->length);
-}
-
-/*
- * The result of the dest's memory registration produces an "rkey"
- * which the source VM must reference in order to perform
- * the RDMA operation.
- */
-typedef struct QEMU_PACKED {
-    uint32_t rkey;
-    uint32_t padding;
-    uint64_t host_addr;
-} RDMARegisterResult;
-
-static void result_to_network(RDMARegisterResult *result)
-{
-    result->rkey = htonl(result->rkey);
-    result->host_addr = htonll(result->host_addr);
-};
-
-static void network_to_result(RDMARegisterResult *result)
-{
-    result->rkey = ntohl(result->rkey);
-    result->host_addr = ntohll(result->host_addr);
-};
-
-static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head,
-                                   uint8_t *data, RDMAControlHeader *resp,
-                                   int *resp_idx,
-                                   int (*callback)(RDMAContext *rdma,
-                                                   Error **errp),
-                                   Error **errp);
-
-static inline uint64_t ram_chunk_index(const uint8_t *start,
-                                       const uint8_t *host)
-{
-    return ((uintptr_t) host - (uintptr_t) start) >> RDMA_REG_CHUNK_SHIFT;
-}
-
-static inline uint8_t *ram_chunk_start(const RDMALocalBlock *rdma_ram_block,
-                                       uint64_t i)
-{
-    return (uint8_t *)(uintptr_t)(rdma_ram_block->local_host_addr +
-                                  (i << RDMA_REG_CHUNK_SHIFT));
-}
-
-static inline uint8_t *ram_chunk_end(const RDMALocalBlock *rdma_ram_block,
-                                     uint64_t i)
-{
-    uint8_t *result = ram_chunk_start(rdma_ram_block, i) +
-                                         (1UL << RDMA_REG_CHUNK_SHIFT);
-
-    if (result > (rdma_ram_block->local_host_addr + rdma_ram_block->length)) {
-        result = rdma_ram_block->local_host_addr + rdma_ram_block->length;
-    }
-
-    return result;
-}
-
-static void rdma_add_block(RDMAContext *rdma, const char *block_name,
-                           void *host_addr,
-                           ram_addr_t block_offset, uint64_t length)
-{
-    RDMALocalBlocks *local = &rdma->local_ram_blocks;
-    RDMALocalBlock *block;
-    RDMALocalBlock *old = local->block;
-
-    local->block = g_new0(RDMALocalBlock, local->nb_blocks + 1);
-
-    if (local->nb_blocks) {
-        if (rdma->blockmap) {
-            for (int x = 0; x < local->nb_blocks; x++) {
-                g_hash_table_remove(rdma->blockmap,
-                                    (void *)(uintptr_t)old[x].offset);
-                g_hash_table_insert(rdma->blockmap,
-                                    (void *)(uintptr_t)old[x].offset,
-                                    &local->block[x]);
-            }
-        }
-        memcpy(local->block, old, sizeof(RDMALocalBlock) * local->nb_blocks);
-        g_free(old);
-    }
-
-    block = &local->block[local->nb_blocks];
-
-    block->block_name = g_strdup(block_name);
-    block->local_host_addr = host_addr;
-    block->offset = block_offset;
-    block->length = length;
-    block->index = local->nb_blocks;
-    block->src_index = ~0U; /* Filled in by the receipt of the block list */
-    block->nb_chunks = ram_chunk_index(host_addr, host_addr + length) + 1UL;
-    block->transit_bitmap = bitmap_new(block->nb_chunks);
-    bitmap_clear(block->transit_bitmap, 0, block->nb_chunks);
-    block->unregister_bitmap = bitmap_new(block->nb_chunks);
-    bitmap_clear(block->unregister_bitmap, 0, block->nb_chunks);
-    block->remote_keys = g_new0(uint32_t, block->nb_chunks);
-
-    block->is_ram_block = local->init ? false : true;
-
-    if (rdma->blockmap) {
-        g_hash_table_insert(rdma->blockmap, (void *)(uintptr_t)block_offset, block);
-    }
-
-    trace_rdma_add_block(block_name, local->nb_blocks,
-                         (uintptr_t) block->local_host_addr,
-                         block->offset, block->length,
-                         (uintptr_t) (block->local_host_addr + block->length),
-                         BITS_TO_LONGS(block->nb_chunks) *
-                             sizeof(unsigned long) * 8,
-                         block->nb_chunks);
-
-    local->nb_blocks++;
-}
-
-/*
- * Memory regions need to be registered with the device and queue pairs setup
- * in advanced before the migration starts. This tells us where the RAM blocks
- * are so that we can register them individually.
- */
-static int qemu_rdma_init_one_block(RAMBlock *rb, void *opaque)
-{
-    const char *block_name = qemu_ram_get_idstr(rb);
-    void *host_addr = qemu_ram_get_host_addr(rb);
-    ram_addr_t block_offset = qemu_ram_get_offset(rb);
-    ram_addr_t length = qemu_ram_get_used_length(rb);
-    rdma_add_block(opaque, block_name, host_addr, block_offset, length);
-    return 0;
-}
-
-/*
- * Identify the RAMBlocks and their quantity. They will be references to
- * identify chunk boundaries inside each RAMBlock and also be referenced
- * during dynamic page registration.
- */
-static void qemu_rdma_init_ram_blocks(RDMAContext *rdma)
-{
-    RDMALocalBlocks *local = &rdma->local_ram_blocks;
-    int ret;
-
-    assert(rdma->blockmap == NULL);
-    memset(local, 0, sizeof *local);
-    ret = foreach_not_ignored_block(qemu_rdma_init_one_block, rdma);
-    assert(!ret);
-    trace_qemu_rdma_init_ram_blocks(local->nb_blocks);
-    rdma->dest_blocks = g_new0(RDMADestBlock,
-                               rdma->local_ram_blocks.nb_blocks);
-    local->init = true;
-}
-
-/*
- * Note: If used outside of cleanup, the caller must ensure that the destination
- * block structures are also updated
- */
-static void rdma_delete_block(RDMAContext *rdma, RDMALocalBlock *block)
-{
-    RDMALocalBlocks *local = &rdma->local_ram_blocks;
-    RDMALocalBlock *old = local->block;
-
-    if (rdma->blockmap) {
-        g_hash_table_remove(rdma->blockmap, (void *)(uintptr_t)block->offset);
-    }
-    if (block->pmr) {
-        for (int j = 0; j < block->nb_chunks; j++) {
-            if (!block->pmr[j]) {
-                continue;
-            }
-            ibv_dereg_mr(block->pmr[j]);
-            rdma->total_registrations--;
-        }
-        g_free(block->pmr);
-        block->pmr = NULL;
-    }
-
-    if (block->mr) {
-        ibv_dereg_mr(block->mr);
-        rdma->total_registrations--;
-        block->mr = NULL;
-    }
-
-    g_free(block->transit_bitmap);
-    block->transit_bitmap = NULL;
-
-    g_free(block->unregister_bitmap);
-    block->unregister_bitmap = NULL;
-
-    g_free(block->remote_keys);
-    block->remote_keys = NULL;
-
-    g_free(block->block_name);
-    block->block_name = NULL;
-
-    if (rdma->blockmap) {
-        for (int x = 0; x < local->nb_blocks; x++) {
-            g_hash_table_remove(rdma->blockmap,
-                                (void *)(uintptr_t)old[x].offset);
-        }
-    }
-
-    if (local->nb_blocks > 1) {
-
-        local->block = g_new0(RDMALocalBlock, local->nb_blocks - 1);
-
-        if (block->index) {
-            memcpy(local->block, old, sizeof(RDMALocalBlock) * block->index);
-        }
-
-        if (block->index < (local->nb_blocks - 1)) {
-            memcpy(local->block + block->index, old + (block->index + 1),
-                sizeof(RDMALocalBlock) *
-                    (local->nb_blocks - (block->index + 1)));
-            for (int x = block->index; x < local->nb_blocks - 1; x++) {
-                local->block[x].index--;
-            }
-        }
-    } else {
-        assert(block == local->block);
-        local->block = NULL;
-    }
-
-    trace_rdma_delete_block(block, (uintptr_t)block->local_host_addr,
-                           block->offset, block->length,
-                            (uintptr_t)(block->local_host_addr + block->length),
-                           BITS_TO_LONGS(block->nb_chunks) *
-                               sizeof(unsigned long) * 8, block->nb_chunks);
-
-    g_free(old);
-
-    local->nb_blocks--;
-
-    if (local->nb_blocks && rdma->blockmap) {
-        for (int x = 0; x < local->nb_blocks; x++) {
-            g_hash_table_insert(rdma->blockmap,
-                                (void *)(uintptr_t)local->block[x].offset,
-                                &local->block[x]);
-        }
-    }
-}
-
-/*
- * Trace RDMA device open, with device details.
- */
-static void qemu_rdma_dump_id(const char *who, struct ibv_context *verbs)
-{
-    struct ibv_port_attr port;
-
-    if (ibv_query_port(verbs, 1, &port)) {
-        trace_qemu_rdma_dump_id_failed(who);
-        return;
-    }
-
-    trace_qemu_rdma_dump_id(who,
-                verbs->device->name,
-                verbs->device->dev_name,
-                verbs->device->dev_path,
-                verbs->device->ibdev_path,
-                port.link_layer,
-                port.link_layer == IBV_LINK_LAYER_INFINIBAND ? "Infiniband"
-                : port.link_layer == IBV_LINK_LAYER_ETHERNET ? "Ethernet"
-                : "Unknown");
-}
-
-/*
- * Trace RDMA gid addressing information.
- * Useful for understanding the RDMA device hierarchy in the kernel.
- */
-static void qemu_rdma_dump_gid(const char *who, struct rdma_cm_id *id)
-{
-    char sgid[33];
-    char dgid[33];
-    inet_ntop(AF_INET6, &id->route.addr.addr.ibaddr.sgid, sgid, sizeof sgid);
-    inet_ntop(AF_INET6, &id->route.addr.addr.ibaddr.dgid, dgid, sizeof dgid);
-    trace_qemu_rdma_dump_gid(who, sgid, dgid);
-}
-
-/*
- * As of now, IPv6 over RoCE / iWARP is not supported by linux.
- * We will try the next addrinfo struct, and fail if there are
- * no other valid addresses to bind against.
- *
- * If user is listening on '[::]', then we will not have a opened a device
- * yet and have no way of verifying if the device is RoCE or not.
- *
- * In this case, the source VM will throw an error for ALL types of
- * connections (both IPv4 and IPv6) if the destination machine does not have
- * a regular infiniband network available for use.
- *
- * The only way to guarantee that an error is thrown for broken kernels is
- * for the management software to choose a *specific* interface at bind time
- * and validate what time of hardware it is.
- *
- * Unfortunately, this puts the user in a fix:
- *
- *  If the source VM connects with an IPv4 address without knowing that the
- *  destination has bound to '[::]' the migration will unconditionally fail
- *  unless the management software is explicitly listening on the IPv4
- *  address while using a RoCE-based device.
- *
- *  If the source VM connects with an IPv6 address, then we're OK because we can
- *  throw an error on the source (and similarly on the destination).
- *
- *  But in mixed environments, this will be broken for a while until it is fixed
- *  inside linux.
- *
- * We do provide a *tiny* bit of help in this function: We can list all of the
- * devices in the system and check to see if all the devices are RoCE or
- * Infiniband.
- *
- * If we detect that we have a *pure* RoCE environment, then we can safely
- * thrown an error even if the management software has specified '[::]' as the
- * bind address.
- *
- * However, if there is are multiple hetergeneous devices, then we cannot make
- * this assumption and the user just has to be sure they know what they are
- * doing.
- *
- * Patches are being reviewed on linux-rdma.
- */
-static int qemu_rdma_broken_ipv6_kernel(struct ibv_context *verbs, Error **errp)
-{
-    /* This bug only exists in linux, to our knowledge. */
-#ifdef CONFIG_LINUX
-    struct ibv_port_attr port_attr;
-
-    /*
-     * Verbs are only NULL if management has bound to '[::]'.
-     *
-     * Let's iterate through all the devices and see if there any pure IB
-     * devices (non-ethernet).
-     *
-     * If not, then we can safely proceed with the migration.
-     * Otherwise, there are no guarantees until the bug is fixed in linux.
-     */
-    if (!verbs) {
-        int num_devices;
-        struct ibv_device **dev_list = ibv_get_device_list(&num_devices);
-        bool roce_found = false;
-        bool ib_found = false;
-
-        for (int x = 0; x < num_devices; x++) {
-            verbs = ibv_open_device(dev_list[x]);
-            /*
-             * ibv_open_device() is not documented to set errno.  If
-             * it does, it's somebody else's doc bug.  If it doesn't,
-             * the use of errno below is wrong.
-             * TODO Find out whether ibv_open_device() sets errno.
-             */
-            if (!verbs) {
-                if (errno == EPERM) {
-                    continue;
-                } else {
-                    error_setg_errno(errp, errno,
-                                     "could not open RDMA device context");
-                    return -1;
-                }
-            }
-
-            if (ibv_query_port(verbs, 1, &port_attr)) {
-                ibv_close_device(verbs);
-                error_setg(errp,
-                           "RDMA ERROR: Could not query initial IB port");
-                return -1;
-            }
-
-            if (port_attr.link_layer == IBV_LINK_LAYER_INFINIBAND) {
-                ib_found = true;
-            } else if (port_attr.link_layer == IBV_LINK_LAYER_ETHERNET) {
-                roce_found = true;
-            }
-
-            ibv_close_device(verbs);
-
-        }
-
-        if (roce_found) {
-            if (ib_found) {
-                warn_report("migrations may fail:"
-                            " IPv6 over RoCE / iWARP in linux"
-                            " is broken. But since you appear to have a"
-                            " mixed RoCE / IB environment, be sure to only"
-                            " migrate over the IB fabric until the kernel "
-                            " fixes the bug.");
-            } else {
-                error_setg(errp, "RDMA ERROR: "
-                           "You only have RoCE / iWARP devices in your systems"
-                           " and your management software has specified '[::]'"
-                           ", but IPv6 over RoCE / iWARP is not supported in Linux.");
-                return -1;
-            }
-        }
-
-        return 0;
-    }
-
-    /*
-     * If we have a verbs context, that means that some other than '[::]' was
-     * used by the management software for binding. In which case we can
-     * actually warn the user about a potentially broken kernel.
-     */
-
-    /* IB ports start with 1, not 0 */
-    if (ibv_query_port(verbs, 1, &port_attr)) {
-        error_setg(errp, "RDMA ERROR: Could not query initial IB port");
-        return -1;
-    }
-
-    if (port_attr.link_layer == IBV_LINK_LAYER_ETHERNET) {
-        error_setg(errp, "RDMA ERROR: "
-                   "Linux kernel's RoCE / iWARP does not support IPv6 "
-                   "(but patches on linux-rdma in progress)");
-        return -1;
-    }
-
-#endif
-
-    return 0;
-}
-
-/*
- * Figure out which RDMA device corresponds to the requested IP hostname
- * Also create the initial connection manager identifiers for opening
- * the connection.
- */
-static int qemu_rdma_resolve_host(RDMAContext *rdma, Error **errp)
-{
-    Error *err = NULL;
-    int ret;
-    struct rdma_addrinfo *res;
-    char port_str[16];
-    struct rdma_cm_event *cm_event;
-    char ip[40] = "unknown";
-
-    if (rdma->host == NULL || !strcmp(rdma->host, "")) {
-        error_setg(errp, "RDMA ERROR: RDMA hostname has not been set");
-        return -1;
-    }
-
-    /* create CM channel */
-    rdma->channel = rdma_create_event_channel();
-    if (!rdma->channel) {
-        error_setg(errp, "RDMA ERROR: could not create CM channel");
-        return -1;
-    }
-
-    /* create CM id */
-    ret = rdma_create_id(rdma->channel, &rdma->cm_id, NULL, RDMA_PS_TCP);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: could not create channel id");
-        goto err_resolve_create_id;
-    }
-
-    snprintf(port_str, 16, "%d", rdma->port);
-    port_str[15] = '\0';
-
-    ret = rdma_getaddrinfo(rdma->host, port_str, NULL, &res);
-    if (ret) {
-        error_setg(errp, "RDMA ERROR: could not rdma_getaddrinfo address %s",
-                   rdma->host);
-        goto err_resolve_get_addr;
-    }
-
-    /* Try all addresses, saving the first error in @err */
-    for (struct rdma_addrinfo *e = res; e != NULL; e = e->ai_next) {
-        Error **local_errp = err ? NULL : &err;
-
-        inet_ntop(e->ai_family,
-            &((struct sockaddr_in *) e->ai_dst_addr)->sin_addr, ip, sizeof ip);
-        trace_qemu_rdma_resolve_host_trying(rdma->host, ip);
-
-        ret = rdma_resolve_addr(rdma->cm_id, NULL, e->ai_dst_addr,
-                RDMA_RESOLVE_TIMEOUT_MS);
-        if (ret >= 0) {
-            if (e->ai_family == AF_INET6) {
-                ret = qemu_rdma_broken_ipv6_kernel(rdma->cm_id->verbs,
-                                                   local_errp);
-                if (ret < 0) {
-                    continue;
-                }
-            }
-            error_free(err);
-            goto route;
-        }
-    }
-
-    rdma_freeaddrinfo(res);
-    if (err) {
-        error_propagate(errp, err);
-    } else {
-        error_setg(errp, "RDMA ERROR: could not resolve address %s",
-                   rdma->host);
-    }
-    goto err_resolve_get_addr;
-
-route:
-    rdma_freeaddrinfo(res);
-    qemu_rdma_dump_gid("source_resolve_addr", rdma->cm_id);
-
-    ret = rdma_get_cm_event(rdma->channel, &cm_event);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: could not perform event_addr_resolved");
-        goto err_resolve_get_addr;
-    }
-
-    if (cm_event->event != RDMA_CM_EVENT_ADDR_RESOLVED) {
-        error_setg(errp,
-                   "RDMA ERROR: result not equal to event_addr_resolved %s",
-                   rdma_event_str(cm_event->event));
-        rdma_ack_cm_event(cm_event);
-        goto err_resolve_get_addr;
-    }
-    rdma_ack_cm_event(cm_event);
-
-    /* resolve route */
-    ret = rdma_resolve_route(rdma->cm_id, RDMA_RESOLVE_TIMEOUT_MS);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: could not resolve rdma route");
-        goto err_resolve_get_addr;
-    }
-
-    ret = rdma_get_cm_event(rdma->channel, &cm_event);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: could not perform event_route_resolved");
-        goto err_resolve_get_addr;
-    }
-    if (cm_event->event != RDMA_CM_EVENT_ROUTE_RESOLVED) {
-        error_setg(errp, "RDMA ERROR: "
-                   "result not equal to event_route_resolved: %s",
-                   rdma_event_str(cm_event->event));
-        rdma_ack_cm_event(cm_event);
-        goto err_resolve_get_addr;
-    }
-    rdma_ack_cm_event(cm_event);
-    rdma->verbs = rdma->cm_id->verbs;
-    qemu_rdma_dump_id("source_resolve_host", rdma->cm_id->verbs);
-    qemu_rdma_dump_gid("source_resolve_host", rdma->cm_id);
-    return 0;
-
-err_resolve_get_addr:
-    rdma_destroy_id(rdma->cm_id);
-    rdma->cm_id = NULL;
-err_resolve_create_id:
-    rdma_destroy_event_channel(rdma->channel);
-    rdma->channel = NULL;
-    return -1;
-}
-
-/*
- * Create protection domain and completion queues
- */
-static int qemu_rdma_alloc_pd_cq(RDMAContext *rdma, Error **errp)
-{
-    /* allocate pd */
-    rdma->pd = ibv_alloc_pd(rdma->verbs);
-    if (!rdma->pd) {
-        error_setg(errp, "failed to allocate protection domain");
-        return -1;
-    }
-
-    /* create receive completion channel */
-    rdma->recv_comp_channel = ibv_create_comp_channel(rdma->verbs);
-    if (!rdma->recv_comp_channel) {
-        error_setg(errp, "failed to allocate receive completion channel");
-        goto err_alloc_pd_cq;
-    }
-
-    /*
-     * Completion queue can be filled by read work requests.
-     */
-    rdma->recv_cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
-                                  NULL, rdma->recv_comp_channel, 0);
-    if (!rdma->recv_cq) {
-        error_setg(errp, "failed to allocate receive completion queue");
-        goto err_alloc_pd_cq;
-    }
-
-    /* create send completion channel */
-    rdma->send_comp_channel = ibv_create_comp_channel(rdma->verbs);
-    if (!rdma->send_comp_channel) {
-        error_setg(errp, "failed to allocate send completion channel");
-        goto err_alloc_pd_cq;
-    }
-
-    rdma->send_cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
-                                  NULL, rdma->send_comp_channel, 0);
-    if (!rdma->send_cq) {
-        error_setg(errp, "failed to allocate send completion queue");
-        goto err_alloc_pd_cq;
-    }
-
-    return 0;
-
-err_alloc_pd_cq:
-    if (rdma->pd) {
-        ibv_dealloc_pd(rdma->pd);
-    }
-    if (rdma->recv_comp_channel) {
-        ibv_destroy_comp_channel(rdma->recv_comp_channel);
-    }
-    if (rdma->send_comp_channel) {
-        ibv_destroy_comp_channel(rdma->send_comp_channel);
-    }
-    if (rdma->recv_cq) {
-        ibv_destroy_cq(rdma->recv_cq);
-        rdma->recv_cq = NULL;
-    }
-    rdma->pd = NULL;
-    rdma->recv_comp_channel = NULL;
-    rdma->send_comp_channel = NULL;
-    return -1;
-
-}
-
-/*
- * Create queue pairs.
- */
-static int qemu_rdma_alloc_qp(RDMAContext *rdma)
-{
-    struct ibv_qp_init_attr attr = { 0 };
-
-    attr.cap.max_send_wr = RDMA_SIGNALED_SEND_MAX;
-    attr.cap.max_recv_wr = 3;
-    attr.cap.max_send_sge = 1;
-    attr.cap.max_recv_sge = 1;
-    attr.send_cq = rdma->send_cq;
-    attr.recv_cq = rdma->recv_cq;
-    attr.qp_type = IBV_QPT_RC;
-
-    if (rdma_create_qp(rdma->cm_id, rdma->pd, &attr) < 0) {
-        return -1;
-    }
-
-    rdma->qp = rdma->cm_id->qp;
-    return 0;
-}
-
-/* Check whether On-Demand Paging is supported by RDAM device */
-static bool rdma_support_odp(struct ibv_context *dev)
-{
-    struct ibv_device_attr_ex attr = {0};
-
-    if (ibv_query_device_ex(dev, NULL, &attr)) {
-        return false;
-    }
-
-    if (attr.odp_caps.general_caps & IBV_ODP_SUPPORT) {
-        return true;
-    }
-
-    return false;
-}
-
-/*
- * ibv_advise_mr to avoid RNR NAK error as far as possible.
- * The responder mr registering with ODP will sent RNR NAK back to
- * the requester in the face of the page fault.
- */
-static void qemu_rdma_advise_prefetch_mr(struct ibv_pd *pd, uint64_t addr,
-                                         uint32_t len,  uint32_t lkey,
-                                         const char *name, bool wr)
-{
-#ifdef HAVE_IBV_ADVISE_MR
-    int ret;
-    int advice = wr ? IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE :
-                 IBV_ADVISE_MR_ADVICE_PREFETCH;
-    struct ibv_sge sg_list = {.lkey = lkey, .addr = addr, .length = len};
-
-    ret = ibv_advise_mr(pd, advice,
-                        IBV_ADVISE_MR_FLAG_FLUSH, &sg_list, 1);
-    /* ignore the error */
-    trace_qemu_rdma_advise_mr(name, len, addr, strerror(ret));
-#endif
-}
-
-static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma, Error **errp)
-{
-    int i;
-    RDMALocalBlocks *local = &rdma->local_ram_blocks;
-
-    for (i = 0; i < local->nb_blocks; i++) {
-        int access = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE;
-
-        local->block[i].mr =
-            ibv_reg_mr(rdma->pd,
-                    local->block[i].local_host_addr,
-                    local->block[i].length, access
-                    );
-        /*
-         * ibv_reg_mr() is not documented to set errno.  If it does,
-         * it's somebody else's doc bug.  If it doesn't, the use of
-         * errno below is wrong.
-         * TODO Find out whether ibv_reg_mr() sets errno.
-         */
-        if (!local->block[i].mr &&
-            errno == ENOTSUP && rdma_support_odp(rdma->verbs)) {
-                access |= IBV_ACCESS_ON_DEMAND;
-                /* register ODP mr */
-                local->block[i].mr =
-                    ibv_reg_mr(rdma->pd,
-                               local->block[i].local_host_addr,
-                               local->block[i].length, access);
-                trace_qemu_rdma_register_odp_mr(local->block[i].block_name);
-
-                if (local->block[i].mr) {
-                    qemu_rdma_advise_prefetch_mr(rdma->pd,
-                                    (uintptr_t)local->block[i].local_host_addr,
-                                    local->block[i].length,
-                                    local->block[i].mr->lkey,
-                                    local->block[i].block_name,
-                                    true);
-                }
-        }
-
-        if (!local->block[i].mr) {
-            error_setg_errno(errp, errno,
-                             "Failed to register local dest ram block!");
-            goto err;
-        }
-        rdma->total_registrations++;
-    }
-
-    return 0;
-
-err:
-    for (i--; i >= 0; i--) {
-        ibv_dereg_mr(local->block[i].mr);
-        local->block[i].mr = NULL;
-        rdma->total_registrations--;
-    }
-
-    return -1;
-
-}
-
-/*
- * Find the ram block that corresponds to the page requested to be
- * transmitted by QEMU.
- *
- * Once the block is found, also identify which 'chunk' within that
- * block that the page belongs to.
- */
-static void qemu_rdma_search_ram_block(RDMAContext *rdma,
-                                       uintptr_t block_offset,
-                                       uint64_t offset,
-                                       uint64_t length,
-                                       uint64_t *block_index,
-                                       uint64_t *chunk_index)
-{
-    uint64_t current_addr = block_offset + offset;
-    RDMALocalBlock *block = g_hash_table_lookup(rdma->blockmap,
-                                                (void *) block_offset);
-    assert(block);
-    assert(current_addr >= block->offset);
-    assert((current_addr + length) <= (block->offset + block->length));
-
-    *block_index = block->index;
-    *chunk_index = ram_chunk_index(block->local_host_addr,
-                block->local_host_addr + (current_addr - block->offset));
-}
-
-/*
- * Register a chunk with IB. If the chunk was already registered
- * previously, then skip.
- *
- * Also return the keys associated with the registration needed
- * to perform the actual RDMA operation.
- */
-static int qemu_rdma_register_and_get_keys(RDMAContext *rdma,
-        RDMALocalBlock *block, uintptr_t host_addr,
-        uint32_t *lkey, uint32_t *rkey, int chunk,
-        uint8_t *chunk_start, uint8_t *chunk_end)
-{
-    if (block->mr) {
-        if (lkey) {
-            *lkey = block->mr->lkey;
-        }
-        if (rkey) {
-            *rkey = block->mr->rkey;
-        }
-        return 0;
-    }
-
-    /* allocate memory to store chunk MRs */
-    if (!block->pmr) {
-        block->pmr = g_new0(struct ibv_mr *, block->nb_chunks);
-    }
-
-    /*
-     * If 'rkey', then we're the destination, so grant access to the source.
-     *
-     * If 'lkey', then we're the source VM, so grant access only to ourselves.
-     */
-    if (!block->pmr[chunk]) {
-        uint64_t len = chunk_end - chunk_start;
-        int access = rkey ? IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE :
-                     0;
-
-        trace_qemu_rdma_register_and_get_keys(len, chunk_start);
-
-        block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
-        /*
-         * ibv_reg_mr() is not documented to set errno.  If it does,
-         * it's somebody else's doc bug.  If it doesn't, the use of
-         * errno below is wrong.
-         * TODO Find out whether ibv_reg_mr() sets errno.
-         */
-        if (!block->pmr[chunk] &&
-            errno == ENOTSUP && rdma_support_odp(rdma->verbs)) {
-            access |= IBV_ACCESS_ON_DEMAND;
-            /* register ODP mr */
-            block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
-            trace_qemu_rdma_register_odp_mr(block->block_name);
-
-            if (block->pmr[chunk]) {
-                qemu_rdma_advise_prefetch_mr(rdma->pd, (uintptr_t)chunk_start,
-                                            len, block->pmr[chunk]->lkey,
-                                            block->block_name, rkey);
-
-            }
-        }
-    }
-    if (!block->pmr[chunk]) {
-        return -1;
-    }
-    rdma->total_registrations++;
-
-    if (lkey) {
-        *lkey = block->pmr[chunk]->lkey;
-    }
-    if (rkey) {
-        *rkey = block->pmr[chunk]->rkey;
-    }
-    return 0;
-}
-
-/*
- * Register (at connection time) the memory used for control
- * channel messages.
- */
-static int qemu_rdma_reg_control(RDMAContext *rdma, int idx)
-{
-    rdma->wr_data[idx].control_mr = ibv_reg_mr(rdma->pd,
-            rdma->wr_data[idx].control, RDMA_CONTROL_MAX_BUFFER,
-            IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);
-    if (rdma->wr_data[idx].control_mr) {
-        rdma->total_registrations++;
-        return 0;
-    }
-    return -1;
-}
-
-/*
- * Perform a non-optimized memory unregistration after every transfer
- * for demonstration purposes, only if pin-all is not requested.
- *
- * Potential optimizations:
- * 1. Start a new thread to run this function continuously
-        - for bit clearing
-        - and for receipt of unregister messages
- * 2. Use an LRU.
- * 3. Use workload hints.
- */
-static int qemu_rdma_unregister_waiting(RDMAContext *rdma)
-{
-    Error *err = NULL;
-
-    while (rdma->unregistrations[rdma->unregister_current]) {
-        int ret;
-        uint64_t wr_id = rdma->unregistrations[rdma->unregister_current];
-        uint64_t chunk =
-            (wr_id & RDMA_WRID_CHUNK_MASK) >> RDMA_WRID_CHUNK_SHIFT;
-        uint64_t index =
-            (wr_id & RDMA_WRID_BLOCK_MASK) >> RDMA_WRID_BLOCK_SHIFT;
-        RDMALocalBlock *block =
-            &(rdma->local_ram_blocks.block[index]);
-        RDMARegister reg = { .current_index = index };
-        RDMAControlHeader resp = { .type = RDMA_CONTROL_UNREGISTER_FINISHED,
-                                 };
-        RDMAControlHeader head = { .len = sizeof(RDMARegister),
-                                   .type = RDMA_CONTROL_UNREGISTER_REQUEST,
-                                   .repeat = 1,
-                                 };
-
-        trace_qemu_rdma_unregister_waiting_proc(chunk,
-                                                rdma->unregister_current);
-
-        rdma->unregistrations[rdma->unregister_current] = 0;
-        rdma->unregister_current++;
-
-        if (rdma->unregister_current == RDMA_SIGNALED_SEND_MAX) {
-            rdma->unregister_current = 0;
-        }
-
-
-        /*
-         * Unregistration is speculative (because migration is single-threaded
-         * and we cannot break the protocol's inifinband message ordering).
-         * Thus, if the memory is currently being used for transmission,
-         * then abort the attempt to unregister and try again
-         * later the next time a completion is received for this memory.
-         */
-        clear_bit(chunk, block->unregister_bitmap);
-
-        if (test_bit(chunk, block->transit_bitmap)) {
-            trace_qemu_rdma_unregister_waiting_inflight(chunk);
-            continue;
-        }
-
-        trace_qemu_rdma_unregister_waiting_send(chunk);
-
-        ret = ibv_dereg_mr(block->pmr[chunk]);
-        block->pmr[chunk] = NULL;
-        block->remote_keys[chunk] = 0;
-
-        if (ret != 0) {
-            error_report("unregistration chunk failed: %s",
-                         strerror(ret));
-            return -1;
-        }
-        rdma->total_registrations--;
-
-        reg.key.chunk = chunk;
-        register_to_network(rdma, &reg);
-        ret = qemu_rdma_exchange_send(rdma, &head, (uint8_t *) &reg,
-                                      &resp, NULL, NULL, &err);
-        if (ret < 0) {
-            error_report_err(err);
-            return -1;
-        }
-
-        trace_qemu_rdma_unregister_waiting_complete(chunk);
-    }
-
-    return 0;
-}
-
-static uint64_t qemu_rdma_make_wrid(uint64_t wr_id, uint64_t index,
-                                         uint64_t chunk)
-{
-    uint64_t result = wr_id & RDMA_WRID_TYPE_MASK;
-
-    result |= (index << RDMA_WRID_BLOCK_SHIFT);
-    result |= (chunk << RDMA_WRID_CHUNK_SHIFT);
-
-    return result;
-}
-
-/*
- * Consult the connection manager to see a work request
- * (of any kind) has completed.
- * Return the work request ID that completed.
- */
-static int qemu_rdma_poll(RDMAContext *rdma, struct ibv_cq *cq,
-                          uint64_t *wr_id_out, uint32_t *byte_len)
-{
-    int ret;
-    struct ibv_wc wc;
-    uint64_t wr_id;
-
-    ret = ibv_poll_cq(cq, 1, &wc);
-
-    if (!ret) {
-        *wr_id_out = RDMA_WRID_NONE;
-        return 0;
-    }
-
-    if (ret < 0) {
-        return -1;
-    }
-
-    wr_id = wc.wr_id & RDMA_WRID_TYPE_MASK;
-
-    if (wc.status != IBV_WC_SUCCESS) {
-        return -1;
-    }
-
-    if (rdma->control_ready_expected &&
-        (wr_id >= RDMA_WRID_RECV_CONTROL)) {
-        trace_qemu_rdma_poll_recv(wr_id - RDMA_WRID_RECV_CONTROL, wr_id,
-                                  rdma->nb_sent);
-        rdma->control_ready_expected = 0;
-    }
-
-    if (wr_id == RDMA_WRID_RDMA_WRITE) {
-        uint64_t chunk =
-            (wc.wr_id & RDMA_WRID_CHUNK_MASK) >> RDMA_WRID_CHUNK_SHIFT;
-        uint64_t index =
-            (wc.wr_id & RDMA_WRID_BLOCK_MASK) >> RDMA_WRID_BLOCK_SHIFT;
-        RDMALocalBlock *block = &(rdma->local_ram_blocks.block[index]);
-
-        trace_qemu_rdma_poll_write(wr_id, rdma->nb_sent,
-                                   index, chunk, block->local_host_addr,
-                                   (void *)(uintptr_t)block->remote_host_addr);
-
-        clear_bit(chunk, block->transit_bitmap);
-
-        if (rdma->nb_sent > 0) {
-            rdma->nb_sent--;
-        }
-    } else {
-        trace_qemu_rdma_poll_other(wr_id, rdma->nb_sent);
-    }
-
-    *wr_id_out = wc.wr_id;
-    if (byte_len) {
-        *byte_len = wc.byte_len;
-    }
-
-    return  0;
-}
-
-/* Wait for activity on the completion channel.
- * Returns 0 on success, none-0 on error.
- */
-static int qemu_rdma_wait_comp_channel(RDMAContext *rdma,
-                                       struct ibv_comp_channel *comp_channel)
-{
-    struct rdma_cm_event *cm_event;
-
-    /*
-     * Coroutine doesn't start until migration_fd_process_incoming()
-     * so don't yield unless we know we're running inside of a coroutine.
-     */
-    if (rdma->migration_started_on_destination &&
-        migration_incoming_get_current()->state == MIGRATION_STATUS_ACTIVE) {
-        yield_until_fd_readable(comp_channel->fd);
-    } else {
-        /* This is the source side, we're in a separate thread
-         * or destination prior to migration_fd_process_incoming()
-         * after postcopy, the destination also in a separate thread.
-         * we can't yield; so we have to poll the fd.
-         * But we need to be able to handle 'cancel' or an error
-         * without hanging forever.
-         */
-        while (!rdma->errored && !rdma->received_error) {
-            GPollFD pfds[2];
-            pfds[0].fd = comp_channel->fd;
-            pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
-            pfds[0].revents = 0;
-
-            pfds[1].fd = rdma->channel->fd;
-            pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
-            pfds[1].revents = 0;
-
-            /* 0.1s timeout, should be fine for a 'cancel' */
-            switch (qemu_poll_ns(pfds, 2, 100 * 1000 * 1000)) {
-            case 2:
-            case 1: /* fd active */
-                if (pfds[0].revents) {
-                    return 0;
-                }
-
-                if (pfds[1].revents) {
-                    if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
-                        return -1;
-                    }
-
-                    if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
-                        cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
-                        rdma_ack_cm_event(cm_event);
-                        return -1;
-                    }
-                    rdma_ack_cm_event(cm_event);
-                }
-                break;
-
-            case 0: /* Timeout, go around again */
-                break;
-
-            default: /* Error of some type -
-                      * I don't trust errno from qemu_poll_ns
-                     */
-                return -1;
-            }
-
-            if (migrate_get_current()->state == MIGRATION_STATUS_CANCELLING) {
-                /* Bail out and let the cancellation happen */
-                return -1;
-            }
-        }
-    }
-
-    if (rdma->received_error) {
-        return -1;
-    }
-    return -rdma->errored;
-}
-
-static struct ibv_comp_channel *to_channel(RDMAContext *rdma, uint64_t wrid)
-{
-    return wrid < RDMA_WRID_RECV_CONTROL ? rdma->send_comp_channel :
-           rdma->recv_comp_channel;
-}
-
-static struct ibv_cq *to_cq(RDMAContext *rdma, uint64_t wrid)
-{
-    return wrid < RDMA_WRID_RECV_CONTROL ? rdma->send_cq : rdma->recv_cq;
-}
-
-/*
- * Block until the next work request has completed.
- *
- * First poll to see if a work request has already completed,
- * otherwise block.
- *
- * If we encounter completed work requests for IDs other than
- * the one we're interested in, then that's generally an error.
- *
- * The only exception is actual RDMA Write completions. These
- * completions only need to be recorded, but do not actually
- * need further processing.
- */
-static int qemu_rdma_block_for_wrid(RDMAContext *rdma,
-                                    uint64_t wrid_requested,
-                                    uint32_t *byte_len)
-{
-    int num_cq_events = 0, ret;
-    struct ibv_cq *cq;
-    void *cq_ctx;
-    uint64_t wr_id = RDMA_WRID_NONE, wr_id_in;
-    struct ibv_comp_channel *ch = to_channel(rdma, wrid_requested);
-    struct ibv_cq *poll_cq = to_cq(rdma, wrid_requested);
-
-    if (ibv_req_notify_cq(poll_cq, 0)) {
-        return -1;
-    }
-    /* poll cq first */
-    while (wr_id != wrid_requested) {
-        ret = qemu_rdma_poll(rdma, poll_cq, &wr_id_in, byte_len);
-        if (ret < 0) {
-            return -1;
-        }
-
-        wr_id = wr_id_in & RDMA_WRID_TYPE_MASK;
-
-        if (wr_id == RDMA_WRID_NONE) {
-            break;
-        }
-        if (wr_id != wrid_requested) {
-            trace_qemu_rdma_block_for_wrid_miss(wrid_requested, wr_id);
-        }
-    }
-
-    if (wr_id == wrid_requested) {
-        return 0;
-    }
-
-    while (1) {
-        ret = qemu_rdma_wait_comp_channel(rdma, ch);
-        if (ret < 0) {
-            goto err_block_for_wrid;
-        }
-
-        ret = ibv_get_cq_event(ch, &cq, &cq_ctx);
-        if (ret < 0) {
-            goto err_block_for_wrid;
-        }
-
-        num_cq_events++;
-
-        if (ibv_req_notify_cq(cq, 0)) {
-            goto err_block_for_wrid;
-        }
-
-        while (wr_id != wrid_requested) {
-            ret = qemu_rdma_poll(rdma, poll_cq, &wr_id_in, byte_len);
-            if (ret < 0) {
-                goto err_block_for_wrid;
-            }
-
-            wr_id = wr_id_in & RDMA_WRID_TYPE_MASK;
-
-            if (wr_id == RDMA_WRID_NONE) {
-                break;
-            }
-            if (wr_id != wrid_requested) {
-                trace_qemu_rdma_block_for_wrid_miss(wrid_requested, wr_id);
-            }
-        }
-
-        if (wr_id == wrid_requested) {
-            goto success_block_for_wrid;
-        }
-    }
-
-success_block_for_wrid:
-    if (num_cq_events) {
-        ibv_ack_cq_events(cq, num_cq_events);
-    }
-    return 0;
-
-err_block_for_wrid:
-    if (num_cq_events) {
-        ibv_ack_cq_events(cq, num_cq_events);
-    }
-
-    rdma->errored = true;
-    return -1;
-}
-
-/*
- * Post a SEND message work request for the control channel
- * containing some data and block until the post completes.
- */
-static int qemu_rdma_post_send_control(RDMAContext *rdma, uint8_t *buf,
-                                       RDMAControlHeader *head,
-                                       Error **errp)
-{
-    int ret;
-    RDMAWorkRequestData *wr = &rdma->wr_data[RDMA_WRID_CONTROL];
-    struct ibv_send_wr *bad_wr;
-    struct ibv_sge sge = {
-                           .addr = (uintptr_t)(wr->control),
-                           .length = head->len + sizeof(RDMAControlHeader),
-                           .lkey = wr->control_mr->lkey,
-                         };
-    struct ibv_send_wr send_wr = {
-                                   .wr_id = RDMA_WRID_SEND_CONTROL,
-                                   .opcode = IBV_WR_SEND,
-                                   .send_flags = IBV_SEND_SIGNALED,
-                                   .sg_list = &sge,
-                                   .num_sge = 1,
-                                };
-
-    trace_qemu_rdma_post_send_control(control_desc(head->type));
-
-    /*
-     * We don't actually need to do a memcpy() in here if we used
-     * the "sge" properly, but since we're only sending control messages
-     * (not RAM in a performance-critical path), then its OK for now.
-     *
-     * The copy makes the RDMAControlHeader simpler to manipulate
-     * for the time being.
-     */
-    assert(head->len <= RDMA_CONTROL_MAX_BUFFER - sizeof(*head));
-    memcpy(wr->control, head, sizeof(RDMAControlHeader));
-    control_to_network((void *) wr->control);
-
-    if (buf) {
-        memcpy(wr->control + sizeof(RDMAControlHeader), buf, head->len);
-    }
-
-
-    ret = ibv_post_send(rdma->qp, &send_wr, &bad_wr);
-
-    if (ret > 0) {
-        error_setg(errp, "Failed to use post IB SEND for control");
-        return -1;
-    }
-
-    ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL, NULL);
-    if (ret < 0) {
-        error_setg(errp, "rdma migration: send polling control error");
-        return -1;
-    }
-
-    return 0;
-}
-
-/*
- * Post a RECV work request in anticipation of some future receipt
- * of data on the control channel.
- */
-static int qemu_rdma_post_recv_control(RDMAContext *rdma, int idx,
-                                       Error **errp)
-{
-    struct ibv_recv_wr *bad_wr;
-    struct ibv_sge sge = {
-                            .addr = (uintptr_t)(rdma->wr_data[idx].control),
-                            .length = RDMA_CONTROL_MAX_BUFFER,
-                            .lkey = rdma->wr_data[idx].control_mr->lkey,
-                         };
-
-    struct ibv_recv_wr recv_wr = {
-                                    .wr_id = RDMA_WRID_RECV_CONTROL + idx,
-                                    .sg_list = &sge,
-                                    .num_sge = 1,
-                                 };
-
-
-    if (ibv_post_recv(rdma->qp, &recv_wr, &bad_wr)) {
-        error_setg(errp, "error posting control recv");
-        return -1;
-    }
-
-    return 0;
-}
-
-/*
- * Block and wait for a RECV control channel message to arrive.
- */
-static int qemu_rdma_exchange_get_response(RDMAContext *rdma,
-                RDMAControlHeader *head, uint32_t expecting, int idx,
-                Error **errp)
-{
-    uint32_t byte_len;
-    int ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RECV_CONTROL + idx,
-                                       &byte_len);
-
-    if (ret < 0) {
-        error_setg(errp, "rdma migration: recv polling control error!");
-        return -1;
-    }
-
-    network_to_control((void *) rdma->wr_data[idx].control);
-    memcpy(head, rdma->wr_data[idx].control, sizeof(RDMAControlHeader));
-
-    trace_qemu_rdma_exchange_get_response_start(control_desc(expecting));
-
-    if (expecting == RDMA_CONTROL_NONE) {
-        trace_qemu_rdma_exchange_get_response_none(control_desc(head->type),
-                                             head->type);
-    } else if (head->type != expecting || head->type == RDMA_CONTROL_ERROR) {
-        error_setg(errp, "Was expecting a %s (%d) control message"
-                ", but got: %s (%d), length: %d",
-                control_desc(expecting), expecting,
-                control_desc(head->type), head->type, head->len);
-        if (head->type == RDMA_CONTROL_ERROR) {
-            rdma->received_error = true;
-        }
-        return -1;
-    }
-    if (head->len > RDMA_CONTROL_MAX_BUFFER - sizeof(*head)) {
-        error_setg(errp, "too long length: %d", head->len);
-        return -1;
-    }
-    if (sizeof(*head) + head->len != byte_len) {
-        error_setg(errp, "Malformed length: %d byte_len %d",
-                   head->len, byte_len);
-        return -1;
-    }
-
-    return 0;
-}
-
-/*
- * When a RECV work request has completed, the work request's
- * buffer is pointed at the header.
- *
- * This will advance the pointer to the data portion
- * of the control message of the work request's buffer that
- * was populated after the work request finished.
- */
-static void qemu_rdma_move_header(RDMAContext *rdma, int idx,
-                                  RDMAControlHeader *head)
-{
-    rdma->wr_data[idx].control_len = head->len;
-    rdma->wr_data[idx].control_curr =
-        rdma->wr_data[idx].control + sizeof(RDMAControlHeader);
-}
-
-/*
- * This is an 'atomic' high-level operation to deliver a single, unified
- * control-channel message.
- *
- * Additionally, if the user is expecting some kind of reply to this message,
- * they can request a 'resp' response message be filled in by posting an
- * additional work request on behalf of the user and waiting for an additional
- * completion.
- *
- * The extra (optional) response is used during registration to us from having
- * to perform an *additional* exchange of message just to provide a response by
- * instead piggy-backing on the acknowledgement.
- */
-static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head,
-                                   uint8_t *data, RDMAControlHeader *resp,
-                                   int *resp_idx,
-                                   int (*callback)(RDMAContext *rdma,
-                                                   Error **errp),
-                                   Error **errp)
-{
-    int ret;
-
-    /*
-     * Wait until the dest is ready before attempting to deliver the message
-     * by waiting for a READY message.
-     */
-    if (rdma->control_ready_expected) {
-        RDMAControlHeader resp_ignored;
-
-        ret = qemu_rdma_exchange_get_response(rdma, &resp_ignored,
-                                              RDMA_CONTROL_READY,
-                                              RDMA_WRID_READY, errp);
-        if (ret < 0) {
-            return -1;
-        }
-    }
-
-    /*
-     * If the user is expecting a response, post a WR in anticipation of it.
-     */
-    if (resp) {
-        ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_DATA, errp);
-        if (ret < 0) {
-            return -1;
-        }
-    }
-
-    /*
-     * Post a WR to replace the one we just consumed for the READY message.
-     */
-    ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_READY, errp);
-    if (ret < 0) {
-        return -1;
-    }
-
-    /*
-     * Deliver the control message that was requested.
-     */
-    ret = qemu_rdma_post_send_control(rdma, data, head, errp);
-
-    if (ret < 0) {
-        return -1;
-    }
-
-    /*
-     * If we're expecting a response, block and wait for it.
-     */
-    if (resp) {
-        if (callback) {
-            trace_qemu_rdma_exchange_send_issue_callback();
-            ret = callback(rdma, errp);
-            if (ret < 0) {
-                return -1;
-            }
-        }
-
-        trace_qemu_rdma_exchange_send_waiting(control_desc(resp->type));
-        ret = qemu_rdma_exchange_get_response(rdma, resp,
-                                              resp->type, RDMA_WRID_DATA,
-                                              errp);
-
-        if (ret < 0) {
-            return -1;
-        }
-
-        qemu_rdma_move_header(rdma, RDMA_WRID_DATA, resp);
-        if (resp_idx) {
-            *resp_idx = RDMA_WRID_DATA;
-        }
-        trace_qemu_rdma_exchange_send_received(control_desc(resp->type));
-    }
-
-    rdma->control_ready_expected = 1;
-
-    return 0;
-}
-
-/*
- * This is an 'atomic' high-level operation to receive a single, unified
- * control-channel message.
- */
-static int qemu_rdma_exchange_recv(RDMAContext *rdma, RDMAControlHeader *head,
-                                   uint32_t expecting, Error **errp)
-{
-    RDMAControlHeader ready = {
-                                .len = 0,
-                                .type = RDMA_CONTROL_READY,
-                                .repeat = 1,
-                              };
-    int ret;
-
-    /*
-     * Inform the source that we're ready to receive a message.
-     */
-    ret = qemu_rdma_post_send_control(rdma, NULL, &ready, errp);
-
-    if (ret < 0) {
-        return -1;
-    }
-
-    /*
-     * Block and wait for the message.
-     */
-    ret = qemu_rdma_exchange_get_response(rdma, head,
-                                          expecting, RDMA_WRID_READY, errp);
-
-    if (ret < 0) {
-        return -1;
-    }
-
-    qemu_rdma_move_header(rdma, RDMA_WRID_READY, head);
-
-    /*
-     * Post a new RECV work request to replace the one we just consumed.
-     */
-    ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_READY, errp);
-    if (ret < 0) {
-        return -1;
-    }
-
-    return 0;
-}
-
-/*
- * Write an actual chunk of memory using RDMA.
- *
- * If we're using dynamic registration on the dest-side, we have to
- * send a registration command first.
- */
-static int qemu_rdma_write_one(RDMAContext *rdma,
-                               int current_index, uint64_t current_addr,
-                               uint64_t length, Error **errp)
-{
-    struct ibv_sge sge;
-    struct ibv_send_wr send_wr = { 0 };
-    struct ibv_send_wr *bad_wr;
-    int reg_result_idx, ret, count = 0;
-    uint64_t chunk, chunks;
-    uint8_t *chunk_start, *chunk_end;
-    RDMALocalBlock *block = &(rdma->local_ram_blocks.block[current_index]);
-    RDMARegister reg;
-    RDMARegisterResult *reg_result;
-    RDMAControlHeader resp = { .type = RDMA_CONTROL_REGISTER_RESULT };
-    RDMAControlHeader head = { .len = sizeof(RDMARegister),
-                               .type = RDMA_CONTROL_REGISTER_REQUEST,
-                               .repeat = 1,
-                             };
-
-retry:
-    sge.addr = (uintptr_t)(block->local_host_addr +
-                            (current_addr - block->offset));
-    sge.length = length;
-
-    chunk = ram_chunk_index(block->local_host_addr,
-                            (uint8_t *)(uintptr_t)sge.addr);
-    chunk_start = ram_chunk_start(block, chunk);
-
-    if (block->is_ram_block) {
-        chunks = length / (1UL << RDMA_REG_CHUNK_SHIFT);
-
-        if (chunks && ((length % (1UL << RDMA_REG_CHUNK_SHIFT)) == 0)) {
-            chunks--;
-        }
-    } else {
-        chunks = block->length / (1UL << RDMA_REG_CHUNK_SHIFT);
-
-        if (chunks && ((block->length % (1UL << RDMA_REG_CHUNK_SHIFT)) == 0)) {
-            chunks--;
-        }
-    }
-
-    trace_qemu_rdma_write_one_top(chunks + 1,
-                                  (chunks + 1) *
-                                  (1UL << RDMA_REG_CHUNK_SHIFT) / 1024 / 1024);
-
-    chunk_end = ram_chunk_end(block, chunk + chunks);
-
-
-    while (test_bit(chunk, block->transit_bitmap)) {
-        (void)count;
-        trace_qemu_rdma_write_one_block(count++, current_index, chunk,
-                sge.addr, length, rdma->nb_sent, block->nb_chunks);
-
-        ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RDMA_WRITE, NULL);
-
-        if (ret < 0) {
-            error_setg(errp, "Failed to Wait for previous write to complete "
-                    "block %d chunk %" PRIu64
-                    " current %" PRIu64 " len %" PRIu64 " %d",
-                    current_index, chunk, sge.addr, length, rdma->nb_sent);
-            return -1;
-        }
-    }
-
-    if (!rdma->pin_all || !block->is_ram_block) {
-        if (!block->remote_keys[chunk]) {
-            /*
-             * This chunk has not yet been registered, so first check to see
-             * if the entire chunk is zero. If so, tell the other size to
-             * memset() + madvise() the entire chunk without RDMA.
-             */
-
-            if (buffer_is_zero((void *)(uintptr_t)sge.addr, length)) {
-                RDMACompress comp = {
-                                        .offset = current_addr,
-                                        .value = 0,
-                                        .block_idx = current_index,
-                                        .length = length,
-                                    };
-
-                head.len = sizeof(comp);
-                head.type = RDMA_CONTROL_COMPRESS;
-
-                trace_qemu_rdma_write_one_zero(chunk, sge.length,
-                                               current_index, current_addr);
-
-                compress_to_network(rdma, &comp);
-                ret = qemu_rdma_exchange_send(rdma, &head,
-                                (uint8_t *) &comp, NULL, NULL, NULL, errp);
-
-                if (ret < 0) {
-                    return -1;
-                }
-
-                /*
-                 * TODO: Here we are sending something, but we are not
-                 * accounting for anything transferred.  The following is wrong:
-                 *
-                 * stat64_add(&mig_stats.rdma_bytes, sge.length);
-                 *
-                 * because we are using some kind of compression.  I
-                 * would think that head.len would be the more similar
-                 * thing to a correct value.
-                 */
-                stat64_add(&mig_stats.zero_pages,
-                           sge.length / qemu_target_page_size());
-                return 1;
-            }
-
-            /*
-             * Otherwise, tell other side to register.
-             */
-            reg.current_index = current_index;
-            if (block->is_ram_block) {
-                reg.key.current_addr = current_addr;
-            } else {
-                reg.key.chunk = chunk;
-            }
-            reg.chunks = chunks;
-
-            trace_qemu_rdma_write_one_sendreg(chunk, sge.length, current_index,
-                                              current_addr);
-
-            register_to_network(rdma, &reg);
-            ret = qemu_rdma_exchange_send(rdma, &head, (uint8_t *) &reg,
-                                    &resp, &reg_result_idx, NULL, errp);
-            if (ret < 0) {
-                return -1;
-            }
-
-            /* try to overlap this single registration with the one we sent. */
-            if (qemu_rdma_register_and_get_keys(rdma, block, sge.addr,
-                                                &sge.lkey, NULL, chunk,
-                                                chunk_start, chunk_end)) {
-                error_setg(errp, "cannot get lkey");
-                return -1;
-            }
-
-            reg_result = (RDMARegisterResult *)
-                    rdma->wr_data[reg_result_idx].control_curr;
-
-            network_to_result(reg_result);
-
-            trace_qemu_rdma_write_one_recvregres(block->remote_keys[chunk],
-                                                 reg_result->rkey, chunk);
-
-            block->remote_keys[chunk] = reg_result->rkey;
-            block->remote_host_addr = reg_result->host_addr;
-        } else {
-            /* already registered before */
-            if (qemu_rdma_register_and_get_keys(rdma, block, sge.addr,
-                                                &sge.lkey, NULL, chunk,
-                                                chunk_start, chunk_end)) {
-                error_setg(errp, "cannot get lkey!");
-                return -1;
-            }
-        }
-
-        send_wr.wr.rdma.rkey = block->remote_keys[chunk];
-    } else {
-        send_wr.wr.rdma.rkey = block->remote_rkey;
-
-        if (qemu_rdma_register_and_get_keys(rdma, block, sge.addr,
-                                                     &sge.lkey, NULL, chunk,
-                                                     chunk_start, chunk_end)) {
-            error_setg(errp, "cannot get lkey!");
-            return -1;
-        }
-    }
-
-    /*
-     * Encode the ram block index and chunk within this wrid.
-     * We will use this information at the time of completion
-     * to figure out which bitmap to check against and then which
-     * chunk in the bitmap to look for.
-     */
-    send_wr.wr_id = qemu_rdma_make_wrid(RDMA_WRID_RDMA_WRITE,
-                                        current_index, chunk);
-
-    send_wr.opcode = IBV_WR_RDMA_WRITE;
-    send_wr.send_flags = IBV_SEND_SIGNALED;
-    send_wr.sg_list = &sge;
-    send_wr.num_sge = 1;
-    send_wr.wr.rdma.remote_addr = block->remote_host_addr +
-                                (current_addr - block->offset);
-
-    trace_qemu_rdma_write_one_post(chunk, sge.addr, send_wr.wr.rdma.remote_addr,
-                                   sge.length);
-
-    /*
-     * ibv_post_send() does not return negative error numbers,
-     * per the specification they are positive - no idea why.
-     */
-    ret = ibv_post_send(rdma->qp, &send_wr, &bad_wr);
-
-    if (ret == ENOMEM) {
-        trace_qemu_rdma_write_one_queue_full();
-        ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RDMA_WRITE, NULL);
-        if (ret < 0) {
-            error_setg(errp, "rdma migration: failed to make "
-                         "room in full send queue!");
-            return -1;
-        }
-
-        goto retry;
-
-    } else if (ret > 0) {
-        error_setg_errno(errp, ret,
-                         "rdma migration: post rdma write failed");
-        return -1;
-    }
-
-    set_bit(chunk, block->transit_bitmap);
-    stat64_add(&mig_stats.normal_pages, sge.length / qemu_target_page_size());
-    /*
-     * We are adding to transferred the amount of data written, but no
-     * overhead at all.  I will assume that RDMA is magicaly and don't
-     * need to transfer (at least) the addresses where it wants to
-     * write the pages.  Here it looks like it should be something
-     * like:
-     *     sizeof(send_wr) + sge.length
-     * but this being RDMA, who knows.
-     */
-    stat64_add(&mig_stats.rdma_bytes, sge.length);
-    ram_transferred_add(sge.length);
-    rdma->total_writes++;
-
-    return 0;
-}
-
-/*
- * Push out any unwritten RDMA operations.
- *
- * We support sending out multiple chunks at the same time.
- * Not all of them need to get signaled in the completion queue.
- */
-static int qemu_rdma_write_flush(RDMAContext *rdma, Error **errp)
-{
-    int ret;
-
-    if (!rdma->current_length) {
-        return 0;
-    }
-
-    ret = qemu_rdma_write_one(rdma, rdma->current_index, rdma->current_addr,
-                              rdma->current_length, errp);
-
-    if (ret < 0) {
-        return -1;
-    }
-
-    if (ret == 0) {
-        rdma->nb_sent++;
-        trace_qemu_rdma_write_flush(rdma->nb_sent);
-    }
-
-    rdma->current_length = 0;
-    rdma->current_addr = 0;
-
-    return 0;
-}
-
-static inline bool qemu_rdma_buffer_mergeable(RDMAContext *rdma,
-                    uint64_t offset, uint64_t len)
-{
-    RDMALocalBlock *block;
-    uint8_t *host_addr;
-    uint8_t *chunk_end;
-
-    if (rdma->current_index < 0) {
-        return false;
-    }
-
-    if (rdma->current_chunk < 0) {
-        return false;
-    }
-
-    block = &(rdma->local_ram_blocks.block[rdma->current_index]);
-    host_addr = block->local_host_addr + (offset - block->offset);
-    chunk_end = ram_chunk_end(block, rdma->current_chunk);
-
-    if (rdma->current_length == 0) {
-        return false;
-    }
-
-    /*
-     * Only merge into chunk sequentially.
-     */
-    if (offset != (rdma->current_addr + rdma->current_length)) {
-        return false;
-    }
-
-    if (offset < block->offset) {
-        return false;
-    }
-
-    if ((offset + len) > (block->offset + block->length)) {
-        return false;
-    }
-
-    if ((host_addr + len) > chunk_end) {
-        return false;
-    }
-
-    return true;
-}
-
-/*
- * We're not actually writing here, but doing three things:
- *
- * 1. Identify the chunk the buffer belongs to.
- * 2. If the chunk is full or the buffer doesn't belong to the current
- *    chunk, then start a new chunk and flush() the old chunk.
- * 3. To keep the hardware busy, we also group chunks into batches
- *    and only require that a batch gets acknowledged in the completion
- *    queue instead of each individual chunk.
- */
-static int qemu_rdma_write(RDMAContext *rdma,
-                           uint64_t block_offset, uint64_t offset,
-                           uint64_t len, Error **errp)
-{
-    uint64_t current_addr = block_offset + offset;
-    uint64_t index = rdma->current_index;
-    uint64_t chunk = rdma->current_chunk;
-
-    /* If we cannot merge it, we flush the current buffer first. */
-    if (!qemu_rdma_buffer_mergeable(rdma, current_addr, len)) {
-        if (qemu_rdma_write_flush(rdma, errp) < 0) {
-            return -1;
-        }
-        rdma->current_length = 0;
-        rdma->current_addr = current_addr;
-
-        qemu_rdma_search_ram_block(rdma, block_offset,
-                                   offset, len, &index, &chunk);
-        rdma->current_index = index;
-        rdma->current_chunk = chunk;
-    }
-
-    /* merge it */
-    rdma->current_length += len;
-
-    /* flush it if buffer is too large */
-    if (rdma->current_length >= RDMA_MERGE_MAX) {
-        return qemu_rdma_write_flush(rdma, errp);
-    }
-
-    return 0;
-}
-
-static void qemu_rdma_cleanup(RDMAContext *rdma)
-{
-    Error *err = NULL;
-
-    if (rdma->cm_id && rdma->connected) {
-        if ((rdma->errored ||
-             migrate_get_current()->state == MIGRATION_STATUS_CANCELLING) &&
-            !rdma->received_error) {
-            RDMAControlHeader head = { .len = 0,
-                                       .type = RDMA_CONTROL_ERROR,
-                                       .repeat = 1,
-                                     };
-            warn_report("Early error. Sending error.");
-            if (qemu_rdma_post_send_control(rdma, NULL, &head, &err) < 0) {
-                warn_report_err(err);
-            }
-        }
-
-        rdma_disconnect(rdma->cm_id);
-        trace_qemu_rdma_cleanup_disconnect();
-        rdma->connected = false;
-    }
-
-    if (rdma->channel) {
-        qemu_set_fd_handler(rdma->channel->fd, NULL, NULL, NULL);
-    }
-    g_free(rdma->dest_blocks);
-    rdma->dest_blocks = NULL;
-
-    for (int i = 0; i < RDMA_WRID_MAX; i++) {
-        if (rdma->wr_data[i].control_mr) {
-            rdma->total_registrations--;
-            ibv_dereg_mr(rdma->wr_data[i].control_mr);
-        }
-        rdma->wr_data[i].control_mr = NULL;
-    }
-
-    if (rdma->local_ram_blocks.block) {
-        while (rdma->local_ram_blocks.nb_blocks) {
-            rdma_delete_block(rdma, &rdma->local_ram_blocks.block[0]);
-        }
-    }
-
-    if (rdma->qp) {
-        rdma_destroy_qp(rdma->cm_id);
-        rdma->qp = NULL;
-    }
-    if (rdma->recv_cq) {
-        ibv_destroy_cq(rdma->recv_cq);
-        rdma->recv_cq = NULL;
-    }
-    if (rdma->send_cq) {
-        ibv_destroy_cq(rdma->send_cq);
-        rdma->send_cq = NULL;
-    }
-    if (rdma->recv_comp_channel) {
-        ibv_destroy_comp_channel(rdma->recv_comp_channel);
-        rdma->recv_comp_channel = NULL;
-    }
-    if (rdma->send_comp_channel) {
-        ibv_destroy_comp_channel(rdma->send_comp_channel);
-        rdma->send_comp_channel = NULL;
-    }
-    if (rdma->pd) {
-        ibv_dealloc_pd(rdma->pd);
-        rdma->pd = NULL;
-    }
-    if (rdma->cm_id) {
-        rdma_destroy_id(rdma->cm_id);
-        rdma->cm_id = NULL;
-    }
-
-    /* the destination side, listen_id and channel is shared */
-    if (rdma->listen_id) {
-        if (!rdma->is_return_path) {
-            rdma_destroy_id(rdma->listen_id);
-        }
-        rdma->listen_id = NULL;
-
-        if (rdma->channel) {
-            if (!rdma->is_return_path) {
-                rdma_destroy_event_channel(rdma->channel);
-            }
-            rdma->channel = NULL;
-        }
-    }
-
-    if (rdma->channel) {
-        rdma_destroy_event_channel(rdma->channel);
-        rdma->channel = NULL;
-    }
-    g_free(rdma->host);
-    rdma->host = NULL;
-}
-
-
-static int qemu_rdma_source_init(RDMAContext *rdma, bool pin_all, Error **errp)
-{
-    int ret;
-
-    /*
-     * Will be validated against destination's actual capabilities
-     * after the connect() completes.
-     */
-    rdma->pin_all = pin_all;
-
-    ret = qemu_rdma_resolve_host(rdma, errp);
-    if (ret < 0) {
-        goto err_rdma_source_init;
-    }
-
-    ret = qemu_rdma_alloc_pd_cq(rdma, errp);
-    if (ret < 0) {
-        goto err_rdma_source_init;
-    }
-
-    ret = qemu_rdma_alloc_qp(rdma);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: rdma migration: error allocating qp!");
-        goto err_rdma_source_init;
-    }
-
-    qemu_rdma_init_ram_blocks(rdma);
-
-    /* Build the hash that maps from offset to RAMBlock */
-    rdma->blockmap = g_hash_table_new(g_direct_hash, g_direct_equal);
-    for (int i = 0; i < rdma->local_ram_blocks.nb_blocks; i++) {
-        g_hash_table_insert(rdma->blockmap,
-                (void *)(uintptr_t)rdma->local_ram_blocks.block[i].offset,
-                &rdma->local_ram_blocks.block[i]);
-    }
-
-    for (int i = 0; i < RDMA_WRID_MAX; i++) {
-        ret = qemu_rdma_reg_control(rdma, i);
-        if (ret < 0) {
-            error_setg(errp, "RDMA ERROR: rdma migration: error "
-                       "registering %d control!", i);
-            goto err_rdma_source_init;
-        }
-    }
-
-    return 0;
-
-err_rdma_source_init:
-    qemu_rdma_cleanup(rdma);
-    return -1;
-}
-
-static int qemu_get_cm_event_timeout(RDMAContext *rdma,
-                                     struct rdma_cm_event **cm_event,
-                                     long msec, Error **errp)
-{
-    int ret;
-    struct pollfd poll_fd = {
-                                .fd = rdma->channel->fd,
-                                .events = POLLIN,
-                                .revents = 0
-                            };
-
-    do {
-        ret = poll(&poll_fd, 1, msec);
-    } while (ret < 0 && errno == EINTR);
-
-    if (ret == 0) {
-        error_setg(errp, "RDMA ERROR: poll cm event timeout");
-        return -1;
-    } else if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: failed to poll cm event, errno=%i",
-                   errno);
-        return -1;
-    } else if (poll_fd.revents & POLLIN) {
-        if (rdma_get_cm_event(rdma->channel, cm_event) < 0) {
-            error_setg(errp, "RDMA ERROR: failed to get cm event");
-            return -1;
-        }
-        return 0;
-    } else {
-        error_setg(errp, "RDMA ERROR: no POLLIN event, revent=%x",
-                   poll_fd.revents);
-        return -1;
-    }
-}
-
-static int qemu_rdma_connect(RDMAContext *rdma, bool return_path,
-                             Error **errp)
-{
-    RDMACapabilities cap = {
-                                .version = RDMA_CONTROL_VERSION_CURRENT,
-                                .flags = 0,
-                           };
-    struct rdma_conn_param conn_param = { .initiator_depth = 2,
-                                          .retry_count = 5,
-                                          .private_data = &cap,
-                                          .private_data_len = sizeof(cap),
-                                        };
-    struct rdma_cm_event *cm_event;
-    int ret;
-
-    /*
-     * Only negotiate the capability with destination if the user
-     * on the source first requested the capability.
-     */
-    if (rdma->pin_all) {
-        trace_qemu_rdma_connect_pin_all_requested();
-        cap.flags |= RDMA_CAPABILITY_PIN_ALL;
-    }
-
-    caps_to_network(&cap);
-
-    ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_READY, errp);
-    if (ret < 0) {
-        goto err_rdma_source_connect;
-    }
-
-    ret = rdma_connect(rdma->cm_id, &conn_param);
-    if (ret < 0) {
-        error_setg_errno(errp, errno,
-                         "RDMA ERROR: connecting to destination!");
-        goto err_rdma_source_connect;
-    }
-
-    if (return_path) {
-        ret = qemu_get_cm_event_timeout(rdma, &cm_event, 5000, errp);
-    } else {
-        ret = rdma_get_cm_event(rdma->channel, &cm_event);
-        if (ret < 0) {
-            error_setg_errno(errp, errno,
-                             "RDMA ERROR: failed to get cm event");
-        }
-    }
-    if (ret < 0) {
-        goto err_rdma_source_connect;
-    }
-
-    if (cm_event->event != RDMA_CM_EVENT_ESTABLISHED) {
-        error_setg(errp, "RDMA ERROR: connecting to destination!");
-        rdma_ack_cm_event(cm_event);
-        goto err_rdma_source_connect;
-    }
-    rdma->connected = true;
-
-    memcpy(&cap, cm_event->param.conn.private_data, sizeof(cap));
-    network_to_caps(&cap);
-
-    /*
-     * Verify that the *requested* capabilities are supported by the destination
-     * and disable them otherwise.
-     */
-    if (rdma->pin_all && !(cap.flags & RDMA_CAPABILITY_PIN_ALL)) {
-        warn_report("RDMA: Server cannot support pinning all memory. "
-                    "Will register memory dynamically.");
-        rdma->pin_all = false;
-    }
-
-    trace_qemu_rdma_connect_pin_all_outcome(rdma->pin_all);
-
-    rdma_ack_cm_event(cm_event);
-
-    rdma->control_ready_expected = 1;
-    rdma->nb_sent = 0;
-    return 0;
-
-err_rdma_source_connect:
-    qemu_rdma_cleanup(rdma);
-    return -1;
-}
-
-static int qemu_rdma_dest_init(RDMAContext *rdma, Error **errp)
-{
-    Error *err = NULL;
-    int ret;
-    struct rdma_cm_id *listen_id;
-    char ip[40] = "unknown";
-    struct rdma_addrinfo *res, *e;
-    char port_str[16];
-    int reuse = 1;
-
-    for (int i = 0; i < RDMA_WRID_MAX; i++) {
-        rdma->wr_data[i].control_len = 0;
-        rdma->wr_data[i].control_curr = NULL;
-    }
-
-    if (!rdma->host || !rdma->host[0]) {
-        error_setg(errp, "RDMA ERROR: RDMA host is not set!");
-        rdma->errored = true;
-        return -1;
-    }
-    /* create CM channel */
-    rdma->channel = rdma_create_event_channel();
-    if (!rdma->channel) {
-        error_setg(errp, "RDMA ERROR: could not create rdma event channel");
-        rdma->errored = true;
-        return -1;
-    }
-
-    /* create CM id */
-    ret = rdma_create_id(rdma->channel, &listen_id, NULL, RDMA_PS_TCP);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: could not create cm_id!");
-        goto err_dest_init_create_listen_id;
-    }
-
-    snprintf(port_str, 16, "%d", rdma->port);
-    port_str[15] = '\0';
-
-    ret = rdma_getaddrinfo(rdma->host, port_str, NULL, &res);
-    if (ret) {
-        error_setg(errp, "RDMA ERROR: could not rdma_getaddrinfo address %s",
-                   rdma->host);
-        goto err_dest_init_bind_addr;
-    }
-
-    ret = rdma_set_option(listen_id, RDMA_OPTION_ID, RDMA_OPTION_ID_REUSEADDR,
-                          &reuse, sizeof reuse);
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: Error: could not set REUSEADDR option");
-        goto err_dest_init_bind_addr;
-    }
-
-    /* Try all addresses, saving the first error in @err */
-    for (e = res; e != NULL; e = e->ai_next) {
-        Error **local_errp = err ? NULL : &err;
-
-        inet_ntop(e->ai_family,
-            &((struct sockaddr_in *) e->ai_dst_addr)->sin_addr, ip, sizeof ip);
-        trace_qemu_rdma_dest_init_trying(rdma->host, ip);
-        ret = rdma_bind_addr(listen_id, e->ai_dst_addr);
-        if (ret < 0) {
-            continue;
-        }
-        if (e->ai_family == AF_INET6) {
-            ret = qemu_rdma_broken_ipv6_kernel(listen_id->verbs,
-                                               local_errp);
-            if (ret < 0) {
-                continue;
-            }
-        }
-        error_free(err);
-        break;
-    }
-
-    rdma_freeaddrinfo(res);
-    if (!e) {
-        if (err) {
-            error_propagate(errp, err);
-        } else {
-            error_setg(errp, "RDMA ERROR: Error: could not rdma_bind_addr!");
-        }
-        goto err_dest_init_bind_addr;
-    }
-
-    rdma->listen_id = listen_id;
-    qemu_rdma_dump_gid("dest_init", listen_id);
-    return 0;
-
-err_dest_init_bind_addr:
-    rdma_destroy_id(listen_id);
-err_dest_init_create_listen_id:
-    rdma_destroy_event_channel(rdma->channel);
-    rdma->channel = NULL;
-    rdma->errored = true;
-    return -1;
-
-}
-
-static void qemu_rdma_return_path_dest_init(RDMAContext *rdma_return_path,
-                                            RDMAContext *rdma)
-{
-    for (int i = 0; i < RDMA_WRID_MAX; i++) {
-        rdma_return_path->wr_data[i].control_len = 0;
-        rdma_return_path->wr_data[i].control_curr = NULL;
-    }
-
-    /*the CM channel and CM id is shared*/
-    rdma_return_path->channel = rdma->channel;
-    rdma_return_path->listen_id = rdma->listen_id;
-
-    rdma->return_path = rdma_return_path;
-    rdma_return_path->return_path = rdma;
-    rdma_return_path->is_return_path = true;
-}
-
-static RDMAContext *qemu_rdma_data_init(InetSocketAddress *saddr, Error **errp)
-{
-    RDMAContext *rdma = NULL;
-
-    rdma = g_new0(RDMAContext, 1);
-    rdma->current_index = -1;
-    rdma->current_chunk = -1;
-
-    rdma->host = g_strdup(saddr->host);
-    rdma->port = atoi(saddr->port);
-    return rdma;
-}
-
-/*
- * QEMUFile interface to the control channel.
- * SEND messages for control only.
- * VM's ram is handled with regular RDMA messages.
- */
-static ssize_t qio_channel_rdma_writev(QIOChannel *ioc,
-                                       const struct iovec *iov,
-                                       size_t niov,
-                                       int *fds,
-                                       size_t nfds,
-                                       int flags,
-                                       Error **errp)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    RDMAContext *rdma;
-    int ret;
-    ssize_t done = 0;
-    size_t len;
-
-    RCU_READ_LOCK_GUARD();
-    rdma = qatomic_rcu_read(&rioc->rdmaout);
-
-    if (!rdma) {
-        error_setg(errp, "RDMA control channel output is not set");
-        return -1;
-    }
-
-    if (rdma->errored) {
-        error_setg(errp,
-                   "RDMA is in an error state waiting migration to abort!");
-        return -1;
-    }
-
-    /*
-     * Push out any writes that
-     * we're queued up for VM's ram.
-     */
-    ret = qemu_rdma_write_flush(rdma, errp);
-    if (ret < 0) {
-        rdma->errored = true;
-        return -1;
-    }
-
-    for (int i = 0; i < niov; i++) {
-        size_t remaining = iov[i].iov_len;
-        uint8_t * data = (void *)iov[i].iov_base;
-        while (remaining) {
-            RDMAControlHeader head = {};
-
-            len = MIN(remaining, RDMA_SEND_INCREMENT);
-            remaining -= len;
-
-            head.len = len;
-            head.type = RDMA_CONTROL_QEMU_FILE;
-
-            ret = qemu_rdma_exchange_send(rdma, &head,
-                                          data, NULL, NULL, NULL, errp);
-
-            if (ret < 0) {
-                rdma->errored = true;
-                return -1;
-            }
-
-            data += len;
-            done += len;
-        }
-    }
-
-    return done;
-}
-
-static size_t qemu_rdma_fill(RDMAContext *rdma, uint8_t *buf,
-                             size_t size, int idx)
-{
-    size_t len = 0;
-
-    if (rdma->wr_data[idx].control_len) {
-        trace_qemu_rdma_fill(rdma->wr_data[idx].control_len, size);
-
-        len = MIN(size, rdma->wr_data[idx].control_len);
-        memcpy(buf, rdma->wr_data[idx].control_curr, len);
-        rdma->wr_data[idx].control_curr += len;
-        rdma->wr_data[idx].control_len -= len;
-    }
-
-    return len;
-}
-
-/*
- * QEMUFile interface to the control channel.
- * RDMA links don't use bytestreams, so we have to
- * return bytes to QEMUFile opportunistically.
- */
-static ssize_t qio_channel_rdma_readv(QIOChannel *ioc,
-                                      const struct iovec *iov,
-                                      size_t niov,
-                                      int **fds,
-                                      size_t *nfds,
-                                      int flags,
-                                      Error **errp)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    RDMAContext *rdma;
-    RDMAControlHeader head;
-    int ret;
-    ssize_t done = 0;
-    size_t len;
-
-    RCU_READ_LOCK_GUARD();
-    rdma = qatomic_rcu_read(&rioc->rdmain);
-
-    if (!rdma) {
-        error_setg(errp, "RDMA control channel input is not set");
-        return -1;
-    }
-
-    if (rdma->errored) {
-        error_setg(errp,
-                   "RDMA is in an error state waiting migration to abort!");
-        return -1;
-    }
-
-    for (int i = 0; i < niov; i++) {
-        size_t want = iov[i].iov_len;
-        uint8_t *data = (void *)iov[i].iov_base;
-
-        /*
-         * First, we hold on to the last SEND message we
-         * were given and dish out the bytes until we run
-         * out of bytes.
-         */
-        len = qemu_rdma_fill(rdma, data, want, 0);
-        done += len;
-        want -= len;
-        /* Got what we needed, so go to next iovec */
-        if (want == 0) {
-            continue;
-        }
-
-        /* If we got any data so far, then don't wait
-         * for more, just return what we have */
-        if (done > 0) {
-            break;
-        }
-
-
-        /* We've got nothing at all, so lets wait for
-         * more to arrive
-         */
-        ret = qemu_rdma_exchange_recv(rdma, &head, RDMA_CONTROL_QEMU_FILE,
-                                      errp);
-
-        if (ret < 0) {
-            rdma->errored = true;
-            return -1;
-        }
-
-        /*
-         * SEND was received with new bytes, now try again.
-         */
-        len = qemu_rdma_fill(rdma, data, want, 0);
-        done += len;
-        want -= len;
-
-        /* Still didn't get enough, so lets just return */
-        if (want) {
-            if (done == 0) {
-                return QIO_CHANNEL_ERR_BLOCK;
-            } else {
-                break;
-            }
-        }
-    }
-    return done;
-}
-
-/*
- * Block until all the outstanding chunks have been delivered by the hardware.
- */
-static int qemu_rdma_drain_cq(RDMAContext *rdma)
-{
-    Error *err = NULL;
-
-    if (qemu_rdma_write_flush(rdma, &err) < 0) {
-        error_report_err(err);
-        return -1;
-    }
-
-    while (rdma->nb_sent) {
-        if (qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RDMA_WRITE, NULL) < 0) {
-            error_report("rdma migration: complete polling error!");
-            return -1;
-        }
-    }
-
-    qemu_rdma_unregister_waiting(rdma);
-
-    return 0;
-}
-
-
-static int qio_channel_rdma_set_blocking(QIOChannel *ioc,
-                                         bool blocking,
-                                         Error **errp)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    /* XXX we should make readv/writev actually honour this :-) */
-    rioc->blocking = blocking;
-    return 0;
-}
-
-
-typedef struct QIOChannelRDMASource QIOChannelRDMASource;
-struct QIOChannelRDMASource {
-    GSource parent;
-    QIOChannelRDMA *rioc;
-    GIOCondition condition;
-};
-
-static gboolean
-qio_channel_rdma_source_prepare(GSource *source,
-                                gint *timeout)
-{
-    QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
-    RDMAContext *rdma;
-    GIOCondition cond = 0;
-    *timeout = -1;
-
-    RCU_READ_LOCK_GUARD();
-    if (rsource->condition == G_IO_IN) {
-        rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
-    } else {
-        rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
-    }
-
-    if (!rdma) {
-        error_report("RDMAContext is NULL when prepare Gsource");
-        return FALSE;
-    }
-
-    if (rdma->wr_data[0].control_len) {
-        cond |= G_IO_IN;
-    }
-    cond |= G_IO_OUT;
-
-    return cond & rsource->condition;
-}
-
-static gboolean
-qio_channel_rdma_source_check(GSource *source)
-{
-    QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
-    RDMAContext *rdma;
-    GIOCondition cond = 0;
-
-    RCU_READ_LOCK_GUARD();
-    if (rsource->condition == G_IO_IN) {
-        rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
-    } else {
-        rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
-    }
-
-    if (!rdma) {
-        error_report("RDMAContext is NULL when check Gsource");
-        return FALSE;
-    }
-
-    if (rdma->wr_data[0].control_len) {
-        cond |= G_IO_IN;
-    }
-    cond |= G_IO_OUT;
-
-    return cond & rsource->condition;
-}
-
-static gboolean
-qio_channel_rdma_source_dispatch(GSource *source,
-                                 GSourceFunc callback,
-                                 gpointer user_data)
-{
-    QIOChannelFunc func = (QIOChannelFunc)callback;
-    QIOChannelRDMASource *rsource = (QIOChannelRDMASource *)source;
-    RDMAContext *rdma;
-    GIOCondition cond = 0;
-
-    RCU_READ_LOCK_GUARD();
-    if (rsource->condition == G_IO_IN) {
-        rdma = qatomic_rcu_read(&rsource->rioc->rdmain);
-    } else {
-        rdma = qatomic_rcu_read(&rsource->rioc->rdmaout);
-    }
-
-    if (!rdma) {
-        error_report("RDMAContext is NULL when dispatch Gsource");
-        return FALSE;
-    }
-
-    if (rdma->wr_data[0].control_len) {
-        cond |= G_IO_IN;
-    }
-    cond |= G_IO_OUT;
-
-    return (*func)(QIO_CHANNEL(rsource->rioc),
-                   (cond & rsource->condition),
-                   user_data);
-}
-
-static void
-qio_channel_rdma_source_finalize(GSource *source)
-{
-    QIOChannelRDMASource *ssource = (QIOChannelRDMASource *)source;
-
-    object_unref(OBJECT(ssource->rioc));
-}
-
-static GSourceFuncs qio_channel_rdma_source_funcs = {
-    qio_channel_rdma_source_prepare,
-    qio_channel_rdma_source_check,
-    qio_channel_rdma_source_dispatch,
-    qio_channel_rdma_source_finalize
-};
-
-static GSource *qio_channel_rdma_create_watch(QIOChannel *ioc,
-                                              GIOCondition condition)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    QIOChannelRDMASource *ssource;
-    GSource *source;
-
-    source = g_source_new(&qio_channel_rdma_source_funcs,
-                          sizeof(QIOChannelRDMASource));
-    ssource = (QIOChannelRDMASource *)source;
-
-    ssource->rioc = rioc;
-    object_ref(OBJECT(rioc));
-
-    ssource->condition = condition;
-
-    return source;
-}
-
-static void qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc,
-                                                AioContext *read_ctx,
-                                                IOHandler *io_read,
-                                                AioContext *write_ctx,
-                                                IOHandler *io_write,
-                                                void *opaque)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    if (io_read) {
-        aio_set_fd_handler(read_ctx, rioc->rdmain->recv_comp_channel->fd,
-                           io_read, io_write, NULL, NULL, opaque);
-        aio_set_fd_handler(read_ctx, rioc->rdmain->send_comp_channel->fd,
-                           io_read, io_write, NULL, NULL, opaque);
-    } else {
-        aio_set_fd_handler(write_ctx, rioc->rdmaout->recv_comp_channel->fd,
-                           io_read, io_write, NULL, NULL, opaque);
-        aio_set_fd_handler(write_ctx, rioc->rdmaout->send_comp_channel->fd,
-                           io_read, io_write, NULL, NULL, opaque);
-    }
-}
-
-struct rdma_close_rcu {
-    struct rcu_head rcu;
-    RDMAContext *rdmain;
-    RDMAContext *rdmaout;
-};
-
-/* callback from qio_channel_rdma_close via call_rcu */
-static void qio_channel_rdma_close_rcu(struct rdma_close_rcu *rcu)
-{
-    if (rcu->rdmain) {
-        qemu_rdma_cleanup(rcu->rdmain);
-    }
-
-    if (rcu->rdmaout) {
-        qemu_rdma_cleanup(rcu->rdmaout);
-    }
-
-    g_free(rcu->rdmain);
-    g_free(rcu->rdmaout);
-    g_free(rcu);
-}
-
-static int qio_channel_rdma_close(QIOChannel *ioc,
-                                  Error **errp)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    RDMAContext *rdmain, *rdmaout;
-    struct rdma_close_rcu *rcu = g_new(struct rdma_close_rcu, 1);
-
-    trace_qemu_rdma_close();
-
-    rdmain = rioc->rdmain;
-    if (rdmain) {
-        qatomic_rcu_set(&rioc->rdmain, NULL);
-    }
-
-    rdmaout = rioc->rdmaout;
-    if (rdmaout) {
-        qatomic_rcu_set(&rioc->rdmaout, NULL);
-    }
-
-    rcu->rdmain = rdmain;
-    rcu->rdmaout = rdmaout;
-    call_rcu(rcu, qio_channel_rdma_close_rcu, rcu);
-
-    return 0;
-}
-
-static int
-qio_channel_rdma_shutdown(QIOChannel *ioc,
-                            QIOChannelShutdown how,
-                            Error **errp)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
-    RDMAContext *rdmain, *rdmaout;
-
-    RCU_READ_LOCK_GUARD();
-
-    rdmain = qatomic_rcu_read(&rioc->rdmain);
-    rdmaout = qatomic_rcu_read(&rioc->rdmain);
-
-    switch (how) {
-    case QIO_CHANNEL_SHUTDOWN_READ:
-        if (rdmain) {
-            rdmain->errored = true;
-        }
-        break;
-    case QIO_CHANNEL_SHUTDOWN_WRITE:
-        if (rdmaout) {
-            rdmaout->errored = true;
-        }
-        break;
-    case QIO_CHANNEL_SHUTDOWN_BOTH:
-    default:
-        if (rdmain) {
-            rdmain->errored = true;
-        }
-        if (rdmaout) {
-            rdmaout->errored = true;
-        }
-        break;
-    }
-
-    return 0;
-}
-
-/*
- * Parameters:
- *    @offset == 0 :
- *        This means that 'block_offset' is a full virtual address that does not
- *        belong to a RAMBlock of the virtual machine and instead
- *        represents a private malloc'd memory area that the caller wishes to
- *        transfer.
- *
- *    @offset != 0 :
- *        Offset is an offset to be added to block_offset and used
- *        to also lookup the corresponding RAMBlock.
- *
- *    @size : Number of bytes to transfer
- *
- *    @pages_sent : User-specificed pointer to indicate how many pages were
- *                  sent. Usually, this will not be more than a few bytes of
- *                  the protocol because most transfers are sent asynchronously.
- */
-static int qemu_rdma_save_page(QEMUFile *f, ram_addr_t block_offset,
-                               ram_addr_t offset, size_t size)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(qemu_file_get_ioc(f));
-    Error *err = NULL;
-    RDMAContext *rdma;
-    int ret;
-
-    RCU_READ_LOCK_GUARD();
-    rdma = qatomic_rcu_read(&rioc->rdmaout);
-
-    if (!rdma) {
-        return -1;
-    }
-
-    if (rdma_errored(rdma)) {
-        return -1;
-    }
-
-    qemu_fflush(f);
-
-    /*
-     * Add this page to the current 'chunk'. If the chunk
-     * is full, or the page doesn't belong to the current chunk,
-     * an actual RDMA write will occur and a new chunk will be formed.
-     */
-    ret = qemu_rdma_write(rdma, block_offset, offset, size, &err);
-    if (ret < 0) {
-        error_report_err(err);
-        goto err;
-    }
-
-    /*
-     * Drain the Completion Queue if possible, but do not block,
-     * just poll.
-     *
-     * If nothing to poll, the end of the iteration will do this
-     * again to make sure we don't overflow the request queue.
-     */
-    while (1) {
-        uint64_t wr_id, wr_id_in;
-        ret = qemu_rdma_poll(rdma, rdma->recv_cq, &wr_id_in, NULL);
-
-        if (ret < 0) {
-            error_report("rdma migration: polling error");
-            goto err;
-        }
-
-        wr_id = wr_id_in & RDMA_WRID_TYPE_MASK;
-
-        if (wr_id == RDMA_WRID_NONE) {
-            break;
-        }
-    }
-
-    while (1) {
-        uint64_t wr_id, wr_id_in;
-        ret = qemu_rdma_poll(rdma, rdma->send_cq, &wr_id_in, NULL);
-
-        if (ret < 0) {
-            error_report("rdma migration: polling error");
-            goto err;
-        }
-
-        wr_id = wr_id_in & RDMA_WRID_TYPE_MASK;
-
-        if (wr_id == RDMA_WRID_NONE) {
-            break;
-        }
-    }
-
-    return RAM_SAVE_CONTROL_DELAYED;
-
-err:
-    rdma->errored = true;
-    return -1;
-}
-
-int rdma_control_save_page(QEMUFile *f, ram_addr_t block_offset,
-                           ram_addr_t offset, size_t size)
-{
-    if (!migrate_rdma() || migration_in_postcopy()) {
-        return RAM_SAVE_CONTROL_NOT_SUPP;
-    }
-
-    int ret = qemu_rdma_save_page(f, block_offset, offset, size);
-
-    if (ret != RAM_SAVE_CONTROL_DELAYED &&
-        ret != RAM_SAVE_CONTROL_NOT_SUPP) {
-        if (ret < 0) {
-            qemu_file_set_error(f, ret);
-        }
-    }
-    return ret;
-}
-
-static void rdma_accept_incoming_migration(void *opaque);
-
-static void rdma_cm_poll_handler(void *opaque)
-{
-    RDMAContext *rdma = opaque;
-    struct rdma_cm_event *cm_event;
-    MigrationIncomingState *mis = migration_incoming_get_current();
-
-    if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
-        error_report("get_cm_event failed %d", errno);
-        return;
-    }
-
-    if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
-        cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
-        if (!rdma->errored &&
-            migration_incoming_get_current()->state !=
-              MIGRATION_STATUS_COMPLETED) {
-            error_report("receive cm event, cm event is %d", cm_event->event);
-            rdma->errored = true;
-            if (rdma->return_path) {
-                rdma->return_path->errored = true;
-            }
-        }
-        rdma_ack_cm_event(cm_event);
-        if (mis->loadvm_co) {
-            qemu_coroutine_enter(mis->loadvm_co);
-        }
-        return;
-    }
-    rdma_ack_cm_event(cm_event);
-}
-
-static int qemu_rdma_accept(RDMAContext *rdma)
-{
-    Error *err = NULL;
-    RDMACapabilities cap;
-    struct rdma_conn_param conn_param = {
-                                            .responder_resources = 2,
-                                            .private_data = &cap,
-                                            .private_data_len = sizeof(cap),
-                                         };
-    RDMAContext *rdma_return_path = NULL;
-    g_autoptr(InetSocketAddress) isock = g_new0(InetSocketAddress, 1);
-    struct rdma_cm_event *cm_event;
-    struct ibv_context *verbs;
-    int ret;
-
-    ret = rdma_get_cm_event(rdma->channel, &cm_event);
-    if (ret < 0) {
-        goto err_rdma_dest_wait;
-    }
-
-    if (cm_event->event != RDMA_CM_EVENT_CONNECT_REQUEST) {
-        rdma_ack_cm_event(cm_event);
-        goto err_rdma_dest_wait;
-    }
-
-    isock->host = g_strdup(rdma->host);
-    isock->port = g_strdup_printf("%d", rdma->port);
-
-    /*
-     * initialize the RDMAContext for return path for postcopy after first
-     * connection request reached.
-     */
-    if ((migrate_postcopy() || migrate_return_path())
-        && !rdma->is_return_path) {
-        rdma_return_path = qemu_rdma_data_init(isock, NULL);
-        if (rdma_return_path == NULL) {
-            rdma_ack_cm_event(cm_event);
-            goto err_rdma_dest_wait;
-        }
-
-        qemu_rdma_return_path_dest_init(rdma_return_path, rdma);
-    }
-
-    memcpy(&cap, cm_event->param.conn.private_data, sizeof(cap));
-
-    network_to_caps(&cap);
-
-    if (cap.version < 1 || cap.version > RDMA_CONTROL_VERSION_CURRENT) {
-        error_report("Unknown source RDMA version: %d, bailing...",
-                     cap.version);
-        rdma_ack_cm_event(cm_event);
-        goto err_rdma_dest_wait;
-    }
-
-    /*
-     * Respond with only the capabilities this version of QEMU knows about.
-     */
-    cap.flags &= known_capabilities;
-
-    /*
-     * Enable the ones that we do know about.
-     * Add other checks here as new ones are introduced.
-     */
-    if (cap.flags & RDMA_CAPABILITY_PIN_ALL) {
-        rdma->pin_all = true;
-    }
-
-    rdma->cm_id = cm_event->id;
-    verbs = cm_event->id->verbs;
-
-    rdma_ack_cm_event(cm_event);
-
-    trace_qemu_rdma_accept_pin_state(rdma->pin_all);
-
-    caps_to_network(&cap);
-
-    trace_qemu_rdma_accept_pin_verbsc(verbs);
-
-    if (!rdma->verbs) {
-        rdma->verbs = verbs;
-    } else if (rdma->verbs != verbs) {
-        error_report("ibv context not matching %p, %p!", rdma->verbs,
-                     verbs);
-        goto err_rdma_dest_wait;
-    }
-
-    qemu_rdma_dump_id("dest_init", verbs);
-
-    ret = qemu_rdma_alloc_pd_cq(rdma, &err);
-    if (ret < 0) {
-        error_report_err(err);
-        goto err_rdma_dest_wait;
-    }
-
-    ret = qemu_rdma_alloc_qp(rdma);
-    if (ret < 0) {
-        error_report("rdma migration: error allocating qp!");
-        goto err_rdma_dest_wait;
-    }
-
-    qemu_rdma_init_ram_blocks(rdma);
-
-    for (int i = 0; i < RDMA_WRID_MAX; i++) {
-        ret = qemu_rdma_reg_control(rdma, i);
-        if (ret < 0) {
-            error_report("rdma: error registering %d control", i);
-            goto err_rdma_dest_wait;
-        }
-    }
-
-    /* Accept the second connection request for return path */
-    if ((migrate_postcopy() || migrate_return_path())
-        && !rdma->is_return_path) {
-        qemu_set_fd_handler(rdma->channel->fd, rdma_accept_incoming_migration,
-                            NULL,
-                            (void *)(intptr_t)rdma->return_path);
-    } else {
-        qemu_set_fd_handler(rdma->channel->fd, rdma_cm_poll_handler,
-                            NULL, rdma);
-    }
-
-    ret = rdma_accept(rdma->cm_id, &conn_param);
-    if (ret < 0) {
-        error_report("rdma_accept failed");
-        goto err_rdma_dest_wait;
-    }
-
-    ret = rdma_get_cm_event(rdma->channel, &cm_event);
-    if (ret < 0) {
-        error_report("rdma_accept get_cm_event failed");
-        goto err_rdma_dest_wait;
-    }
-
-    if (cm_event->event != RDMA_CM_EVENT_ESTABLISHED) {
-        error_report("rdma_accept not event established");
-        rdma_ack_cm_event(cm_event);
-        goto err_rdma_dest_wait;
-    }
-
-    rdma_ack_cm_event(cm_event);
-    rdma->connected = true;
-
-    ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_READY, &err);
-    if (ret < 0) {
-        error_report_err(err);
-        goto err_rdma_dest_wait;
-    }
-
-    qemu_rdma_dump_gid("dest_connect", rdma->cm_id);
-
-    return 0;
-
-err_rdma_dest_wait:
-    rdma->errored = true;
-    qemu_rdma_cleanup(rdma);
-    g_free(rdma_return_path);
-    return -1;
-}
-
-static int dest_ram_sort_func(const void *a, const void *b)
-{
-    unsigned int a_index = ((const RDMALocalBlock *)a)->src_index;
-    unsigned int b_index = ((const RDMALocalBlock *)b)->src_index;
-
-    return (a_index < b_index) ? -1 : (a_index != b_index);
-}
-
-/*
- * During each iteration of the migration, we listen for instructions
- * by the source VM to perform dynamic page registrations before they
- * can perform RDMA operations.
- *
- * We respond with the 'rkey'.
- *
- * Keep doing this until the source tells us to stop.
- */
-int rdma_registration_handle(QEMUFile *f)
-{
-    RDMAControlHeader reg_resp = { .len = sizeof(RDMARegisterResult),
-                               .type = RDMA_CONTROL_REGISTER_RESULT,
-                               .repeat = 0,
-                             };
-    RDMAControlHeader unreg_resp = { .len = 0,
-                               .type = RDMA_CONTROL_UNREGISTER_FINISHED,
-                               .repeat = 0,
-                             };
-    RDMAControlHeader blocks = { .type = RDMA_CONTROL_RAM_BLOCKS_RESULT,
-                                 .repeat = 1 };
-    QIOChannelRDMA *rioc;
-    Error *err = NULL;
-    RDMAContext *rdma;
-    RDMALocalBlocks *local;
-    RDMAControlHeader head;
-    RDMARegister *reg, *registers;
-    RDMACompress *comp;
-    RDMARegisterResult *reg_result;
-    static RDMARegisterResult results[RDMA_CONTROL_MAX_COMMANDS_PER_MESSAGE];
-    RDMALocalBlock *block;
-    void *host_addr;
-    int ret;
-    int idx = 0;
-
-    if (!migrate_rdma()) {
-        return 0;
-    }
-
-    RCU_READ_LOCK_GUARD();
-    rioc = QIO_CHANNEL_RDMA(qemu_file_get_ioc(f));
-    rdma = qatomic_rcu_read(&rioc->rdmain);
-
-    if (!rdma) {
-        return -1;
-    }
-
-    if (rdma_errored(rdma)) {
-        return -1;
-    }
-
-    local = &rdma->local_ram_blocks;
-    do {
-        trace_rdma_registration_handle_wait();
-
-        ret = qemu_rdma_exchange_recv(rdma, &head, RDMA_CONTROL_NONE, &err);
-
-        if (ret < 0) {
-            error_report_err(err);
-            break;
-        }
-
-        if (head.repeat > RDMA_CONTROL_MAX_COMMANDS_PER_MESSAGE) {
-            error_report("rdma: Too many requests in this message (%d)."
-                            "Bailing.", head.repeat);
-            break;
-        }
-
-        switch (head.type) {
-        case RDMA_CONTROL_COMPRESS:
-            comp = (RDMACompress *) rdma->wr_data[idx].control_curr;
-            network_to_compress(comp);
-
-            trace_rdma_registration_handle_compress(comp->length,
-                                                    comp->block_idx,
-                                                    comp->offset);
-            if (comp->block_idx >= rdma->local_ram_blocks.nb_blocks) {
-                error_report("rdma: 'compress' bad block index %u (vs %d)",
-                             (unsigned int)comp->block_idx,
-                             rdma->local_ram_blocks.nb_blocks);
-                goto err;
-            }
-            block = &(rdma->local_ram_blocks.block[comp->block_idx]);
-
-            host_addr = block->local_host_addr +
-                            (comp->offset - block->offset);
-            if (comp->value) {
-                error_report("rdma: Zero page with non-zero (%d) value",
-                             comp->value);
-                goto err;
-            }
-            ram_handle_zero(host_addr, comp->length);
-            break;
-
-        case RDMA_CONTROL_REGISTER_FINISHED:
-            trace_rdma_registration_handle_finished();
-            return 0;
-
-        case RDMA_CONTROL_RAM_BLOCKS_REQUEST:
-            trace_rdma_registration_handle_ram_blocks();
-
-            /* Sort our local RAM Block list so it's the same as the source,
-             * we can do this since we've filled in a src_index in the list
-             * as we received the RAMBlock list earlier.
-             */
-            qsort(rdma->local_ram_blocks.block,
-                  rdma->local_ram_blocks.nb_blocks,
-                  sizeof(RDMALocalBlock), dest_ram_sort_func);
-            for (int i = 0; i < local->nb_blocks; i++) {
-                local->block[i].index = i;
-            }
-
-            if (rdma->pin_all) {
-                ret = qemu_rdma_reg_whole_ram_blocks(rdma, &err);
-                if (ret < 0) {
-                    error_report_err(err);
-                    goto err;
-                }
-            }
-
-            /*
-             * Dest uses this to prepare to transmit the RAMBlock descriptions
-             * to the source VM after connection setup.
-             * Both sides use the "remote" structure to communicate and update
-             * their "local" descriptions with what was sent.
-             */
-            for (int i = 0; i < local->nb_blocks; i++) {
-                rdma->dest_blocks[i].remote_host_addr =
-                    (uintptr_t)(local->block[i].local_host_addr);
-
-                if (rdma->pin_all) {
-                    rdma->dest_blocks[i].remote_rkey = local->block[i].mr->rkey;
-                }
-
-                rdma->dest_blocks[i].offset = local->block[i].offset;
-                rdma->dest_blocks[i].length = local->block[i].length;
-
-                dest_block_to_network(&rdma->dest_blocks[i]);
-                trace_rdma_registration_handle_ram_blocks_loop(
-                    local->block[i].block_name,
-                    local->block[i].offset,
-                    local->block[i].length,
-                    local->block[i].local_host_addr,
-                    local->block[i].src_index);
-            }
-
-            blocks.len = rdma->local_ram_blocks.nb_blocks
-                                                * sizeof(RDMADestBlock);
-
-
-            ret = qemu_rdma_post_send_control(rdma,
-                                    (uint8_t *) rdma->dest_blocks, &blocks,
-                                    &err);
-
-            if (ret < 0) {
-                error_report_err(err);
-                goto err;
-            }
-
-            break;
-        case RDMA_CONTROL_REGISTER_REQUEST:
-            trace_rdma_registration_handle_register(head.repeat);
-
-            reg_resp.repeat = head.repeat;
-            registers = (RDMARegister *) rdma->wr_data[idx].control_curr;
-
-            for (int count = 0; count < head.repeat; count++) {
-                uint64_t chunk;
-                uint8_t *chunk_start, *chunk_end;
-
-                reg = &registers[count];
-                network_to_register(reg);
-
-                reg_result = &results[count];
-
-                trace_rdma_registration_handle_register_loop(count,
-                         reg->current_index, reg->key.current_addr, reg->chunks);
-
-                if (reg->current_index >= rdma->local_ram_blocks.nb_blocks) {
-                    error_report("rdma: 'register' bad block index %u (vs %d)",
-                                 (unsigned int)reg->current_index,
-                                 rdma->local_ram_blocks.nb_blocks);
-                    goto err;
-                }
-                block = &(rdma->local_ram_blocks.block[reg->current_index]);
-                if (block->is_ram_block) {
-                    if (block->offset > reg->key.current_addr) {
-                        error_report("rdma: bad register address for block %s"
-                            " offset: %" PRIx64 " current_addr: %" PRIx64,
-                            block->block_name, block->offset,
-                            reg->key.current_addr);
-                        goto err;
-                    }
-                    host_addr = (block->local_host_addr +
-                                (reg->key.current_addr - block->offset));
-                    chunk = ram_chunk_index(block->local_host_addr,
-                                            (uint8_t *) host_addr);
-                } else {
-                    chunk = reg->key.chunk;
-                    host_addr = block->local_host_addr +
-                        (reg->key.chunk * (1UL << RDMA_REG_CHUNK_SHIFT));
-                    /* Check for particularly bad chunk value */
-                    if (host_addr < (void *)block->local_host_addr) {
-                        error_report("rdma: bad chunk for block %s"
-                            " chunk: %" PRIx64,
-                            block->block_name, reg->key.chunk);
-                        goto err;
-                    }
-                }
-                chunk_start = ram_chunk_start(block, chunk);
-                chunk_end = ram_chunk_end(block, chunk + reg->chunks);
-                /* avoid "-Waddress-of-packed-member" warning */
-                uint32_t tmp_rkey = 0;
-                if (qemu_rdma_register_and_get_keys(rdma, block,
-                            (uintptr_t)host_addr, NULL, &tmp_rkey,
-                            chunk, chunk_start, chunk_end)) {
-                    error_report("cannot get rkey");
-                    goto err;
-                }
-                reg_result->rkey = tmp_rkey;
-
-                reg_result->host_addr = (uintptr_t)block->local_host_addr;
-
-                trace_rdma_registration_handle_register_rkey(reg_result->rkey);
-
-                result_to_network(reg_result);
-            }
-
-            ret = qemu_rdma_post_send_control(rdma,
-                            (uint8_t *) results, &reg_resp, &err);
-
-            if (ret < 0) {
-                error_report_err(err);
-                goto err;
-            }
-            break;
-        case RDMA_CONTROL_UNREGISTER_REQUEST:
-            trace_rdma_registration_handle_unregister(head.repeat);
-            unreg_resp.repeat = head.repeat;
-            registers = (RDMARegister *) rdma->wr_data[idx].control_curr;
-
-            for (int count = 0; count < head.repeat; count++) {
-                reg = &registers[count];
-                network_to_register(reg);
-
-                trace_rdma_registration_handle_unregister_loop(count,
-                           reg->current_index, reg->key.chunk);
-
-                block = &(rdma->local_ram_blocks.block[reg->current_index]);
-
-                ret = ibv_dereg_mr(block->pmr[reg->key.chunk]);
-                block->pmr[reg->key.chunk] = NULL;
-
-                if (ret != 0) {
-                    error_report("rdma unregistration chunk failed: %s",
-                                 strerror(errno));
-                    goto err;
-                }
-
-                rdma->total_registrations--;
-
-                trace_rdma_registration_handle_unregister_success(reg->key.chunk);
-            }
-
-            ret = qemu_rdma_post_send_control(rdma, NULL, &unreg_resp, &err);
-
-            if (ret < 0) {
-                error_report_err(err);
-                goto err;
-            }
-            break;
-        case RDMA_CONTROL_REGISTER_RESULT:
-            error_report("Invalid RESULT message at dest.");
-            goto err;
-        default:
-            error_report("Unknown control message %s", control_desc(head.type));
-            goto err;
-        }
-    } while (1);
-
-err:
-    rdma->errored = true;
-    return -1;
-}
-
-/* Destination:
- * Called during the initial RAM load section which lists the
- * RAMBlocks by name.  This lets us know the order of the RAMBlocks on
- * the source.  We've already built our local RAMBlock list, but not
- * yet sent the list to the source.
- */
-int rdma_block_notification_handle(QEMUFile *f, const char *name)
-{
-    int curr;
-    int found = -1;
-
-    if (!migrate_rdma()) {
-        return 0;
-    }
-
-    RCU_READ_LOCK_GUARD();
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(qemu_file_get_ioc(f));
-    RDMAContext *rdma = qatomic_rcu_read(&rioc->rdmain);
-
-    if (!rdma) {
-        return -1;
-    }
-
-    /* Find the matching RAMBlock in our local list */
-    for (curr = 0; curr < rdma->local_ram_blocks.nb_blocks; curr++) {
-        if (!strcmp(rdma->local_ram_blocks.block[curr].block_name, name)) {
-            found = curr;
-            break;
-        }
-    }
-
-    if (found == -1) {
-        error_report("RAMBlock '%s' not found on destination", name);
-        return -1;
-    }
-
-    rdma->local_ram_blocks.block[curr].src_index = rdma->next_src_index;
-    trace_rdma_block_notification_handle(name, rdma->next_src_index);
-    rdma->next_src_index++;
-
-    return 0;
-}
-
-int rdma_registration_start(QEMUFile *f, uint64_t flags)
-{
-    if (!migrate_rdma() || migration_in_postcopy()) {
-        return 0;
-    }
-
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(qemu_file_get_ioc(f));
-    RCU_READ_LOCK_GUARD();
-    RDMAContext *rdma = qatomic_rcu_read(&rioc->rdmaout);
-    if (!rdma) {
-        return -1;
-    }
-
-    if (rdma_errored(rdma)) {
-        return -1;
-    }
-
-    trace_rdma_registration_start(flags);
-    qemu_put_be64(f, RAM_SAVE_FLAG_HOOK);
-    return qemu_fflush(f);
-}
-
-/*
- * Inform dest that dynamic registrations are done for now.
- * First, flush writes, if any.
- */
-int rdma_registration_stop(QEMUFile *f, uint64_t flags)
-{
-    QIOChannelRDMA *rioc;
-    Error *err = NULL;
-    RDMAContext *rdma;
-    RDMAControlHeader head = { .len = 0, .repeat = 1 };
-    int ret;
-
-    if (!migrate_rdma() || migration_in_postcopy()) {
-        return 0;
-    }
-
-    RCU_READ_LOCK_GUARD();
-    rioc = QIO_CHANNEL_RDMA(qemu_file_get_ioc(f));
-    rdma = qatomic_rcu_read(&rioc->rdmaout);
-    if (!rdma) {
-        return -1;
-    }
-
-    if (rdma_errored(rdma)) {
-        return -1;
-    }
-
-    qemu_fflush(f);
-    ret = qemu_rdma_drain_cq(rdma);
-
-    if (ret < 0) {
-        goto err;
-    }
-
-    if (flags == RAM_CONTROL_SETUP) {
-        RDMAControlHeader resp = {.type = RDMA_CONTROL_RAM_BLOCKS_RESULT };
-        RDMALocalBlocks *local = &rdma->local_ram_blocks;
-        int reg_result_idx, nb_dest_blocks;
-
-        head.type = RDMA_CONTROL_RAM_BLOCKS_REQUEST;
-        trace_rdma_registration_stop_ram();
-
-        /*
-         * Make sure that we parallelize the pinning on both sides.
-         * For very large guests, doing this serially takes a really
-         * long time, so we have to 'interleave' the pinning locally
-         * with the control messages by performing the pinning on this
-         * side before we receive the control response from the other
-         * side that the pinning has completed.
-         */
-        ret = qemu_rdma_exchange_send(rdma, &head, NULL, &resp,
-                    &reg_result_idx, rdma->pin_all ?
-                    qemu_rdma_reg_whole_ram_blocks : NULL,
-                    &err);
-        if (ret < 0) {
-            error_report_err(err);
-            return -1;
-        }
-
-        nb_dest_blocks = resp.len / sizeof(RDMADestBlock);
-
-        /*
-         * The protocol uses two different sets of rkeys (mutually exclusive):
-         * 1. One key to represent the virtual address of the entire ram block.
-         *    (dynamic chunk registration disabled - pin everything with one rkey.)
-         * 2. One to represent individual chunks within a ram block.
-         *    (dynamic chunk registration enabled - pin individual chunks.)
-         *
-         * Once the capability is successfully negotiated, the destination transmits
-         * the keys to use (or sends them later) including the virtual addresses
-         * and then propagates the remote ram block descriptions to his local copy.
-         */
-
-        if (local->nb_blocks != nb_dest_blocks) {
-            error_report("ram blocks mismatch (Number of blocks %d vs %d)",
-                         local->nb_blocks, nb_dest_blocks);
-            error_printf("Your QEMU command line parameters are probably "
-                         "not identical on both the source and destination.");
-            rdma->errored = true;
-            return -1;
-        }
-
-        qemu_rdma_move_header(rdma, reg_result_idx, &resp);
-        memcpy(rdma->dest_blocks,
-            rdma->wr_data[reg_result_idx].control_curr, resp.len);
-        for (int i = 0; i < nb_dest_blocks; i++) {
-            network_to_dest_block(&rdma->dest_blocks[i]);
-
-            /* We require that the blocks are in the same order */
-            if (rdma->dest_blocks[i].length != local->block[i].length) {
-                error_report("Block %s/%d has a different length %" PRIu64
-                             "vs %" PRIu64,
-                             local->block[i].block_name, i,
-                             local->block[i].length,
-                             rdma->dest_blocks[i].length);
-                rdma->errored = true;
-                return -1;
-            }
-            local->block[i].remote_host_addr =
-                    rdma->dest_blocks[i].remote_host_addr;
-            local->block[i].remote_rkey = rdma->dest_blocks[i].remote_rkey;
-        }
-    }
-
-    trace_rdma_registration_stop(flags);
-
-    head.type = RDMA_CONTROL_REGISTER_FINISHED;
-    ret = qemu_rdma_exchange_send(rdma, &head, NULL, NULL, NULL, NULL, &err);
-
-    if (ret < 0) {
-        error_report_err(err);
-        goto err;
-    }
-
-    return 0;
-err:
-    rdma->errored = true;
-    return -1;
-}
-
-static void qio_channel_rdma_finalize(Object *obj)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(obj);
-    if (rioc->rdmain) {
-        qemu_rdma_cleanup(rioc->rdmain);
-        g_free(rioc->rdmain);
-        rioc->rdmain = NULL;
-    }
-    if (rioc->rdmaout) {
-        qemu_rdma_cleanup(rioc->rdmaout);
-        g_free(rioc->rdmaout);
-        rioc->rdmaout = NULL;
-    }
-}
-
-static void qio_channel_rdma_class_init(ObjectClass *klass,
-                                        void *class_data G_GNUC_UNUSED)
-{
-    QIOChannelClass *ioc_klass = QIO_CHANNEL_CLASS(klass);
-
-    ioc_klass->io_writev = qio_channel_rdma_writev;
-    ioc_klass->io_readv = qio_channel_rdma_readv;
-    ioc_klass->io_set_blocking = qio_channel_rdma_set_blocking;
-    ioc_klass->io_close = qio_channel_rdma_close;
-    ioc_klass->io_create_watch = qio_channel_rdma_create_watch;
-    ioc_klass->io_set_aio_fd_handler = qio_channel_rdma_set_aio_fd_handler;
-    ioc_klass->io_shutdown = qio_channel_rdma_shutdown;
-}
-
-static const TypeInfo qio_channel_rdma_info = {
-    .parent = TYPE_QIO_CHANNEL,
-    .name = TYPE_QIO_CHANNEL_RDMA,
-    .instance_size = sizeof(QIOChannelRDMA),
-    .instance_finalize = qio_channel_rdma_finalize,
-    .class_init = qio_channel_rdma_class_init,
-};
-
-static void qio_channel_rdma_register_types(void)
-{
-    type_register_static(&qio_channel_rdma_info);
-}
-
-type_init(qio_channel_rdma_register_types);
-
-static QEMUFile *rdma_new_input(RDMAContext *rdma)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
-
-    rioc->file = qemu_file_new_input(QIO_CHANNEL(rioc));
-    rioc->rdmain = rdma;
-    rioc->rdmaout = rdma->return_path;
-
-    return rioc->file;
-}
-
-static QEMUFile *rdma_new_output(RDMAContext *rdma)
-{
-    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
-
-    rioc->file = qemu_file_new_output(QIO_CHANNEL(rioc));
-    rioc->rdmaout = rdma;
-    rioc->rdmain = rdma->return_path;
-
-    return rioc->file;
-}
-
-static void rdma_accept_incoming_migration(void *opaque)
-{
-    RDMAContext *rdma = opaque;
-    QEMUFile *f;
-
-    trace_qemu_rdma_accept_incoming_migration();
-    if (qemu_rdma_accept(rdma) < 0) {
-        error_report("RDMA ERROR: Migration initialization failed");
-        return;
-    }
-
-    trace_qemu_rdma_accept_incoming_migration_accepted();
-
-    if (rdma->is_return_path) {
-        return;
-    }
-
-    f = rdma_new_input(rdma);
-    if (f == NULL) {
-        error_report("RDMA ERROR: could not open RDMA for input");
-        qemu_rdma_cleanup(rdma);
-        return;
-    }
-
-    rdma->migration_started_on_destination = 1;
-    migration_fd_process_incoming(f);
-}
-
-void rdma_start_incoming_migration(InetSocketAddress *host_port,
-                                   Error **errp)
-{
-    MigrationState *s = migrate_get_current();
-    int ret;
-    RDMAContext *rdma;
-
-    trace_rdma_start_incoming_migration();
-
-    /* Avoid ram_block_discard_disable(), cannot change during migration. */
-    if (ram_block_discard_is_required()) {
-        error_setg(errp, "RDMA: cannot disable RAM discard");
-        return;
-    }
-
-    rdma = qemu_rdma_data_init(host_port, errp);
-    if (rdma == NULL) {
-        goto err;
-    }
-
-    ret = qemu_rdma_dest_init(rdma, errp);
-    if (ret < 0) {
-        goto err;
-    }
-
-    trace_rdma_start_incoming_migration_after_dest_init();
-
-    ret = rdma_listen(rdma->listen_id, 5);
-
-    if (ret < 0) {
-        error_setg(errp, "RDMA ERROR: listening on socket!");
-        goto cleanup_rdma;
-    }
-
-    trace_rdma_start_incoming_migration_after_rdma_listen();
-    s->rdma_migration = true;
-    qemu_set_fd_handler(rdma->channel->fd, rdma_accept_incoming_migration,
-                        NULL, (void *)(intptr_t)rdma);
-    return;
-
-cleanup_rdma:
-    qemu_rdma_cleanup(rdma);
-err:
-    if (rdma) {
-        g_free(rdma->host);
-    }
-    g_free(rdma);
-}
-
-void rdma_start_outgoing_migration(void *opaque,
-                            InetSocketAddress *host_port, Error **errp)
-{
-    MigrationState *s = opaque;
-    RDMAContext *rdma_return_path = NULL;
-    RDMAContext *rdma;
-    int ret;
-
-    /* Avoid ram_block_discard_disable(), cannot change during migration. */
-    if (ram_block_discard_is_required()) {
-        error_setg(errp, "RDMA: cannot disable RAM discard");
-        return;
-    }
-
-    rdma = qemu_rdma_data_init(host_port, errp);
-    if (rdma == NULL) {
-        goto err;
-    }
-
-    ret = qemu_rdma_source_init(rdma, migrate_rdma_pin_all(), errp);
-
-    if (ret < 0) {
-        goto err;
-    }
-
-    trace_rdma_start_outgoing_migration_after_rdma_source_init();
-    ret = qemu_rdma_connect(rdma, false, errp);
-
-    if (ret < 0) {
-        goto err;
-    }
-
-    /* RDMA postcopy need a separate queue pair for return path */
-    if (migrate_postcopy() || migrate_return_path()) {
-        rdma_return_path = qemu_rdma_data_init(host_port, errp);
-
-        if (rdma_return_path == NULL) {
-            goto return_path_err;
-        }
-
-        ret = qemu_rdma_source_init(rdma_return_path,
-                                    migrate_rdma_pin_all(), errp);
-
-        if (ret < 0) {
-            goto return_path_err;
-        }
-
-        ret = qemu_rdma_connect(rdma_return_path, true, errp);
-
-        if (ret < 0) {
-            goto return_path_err;
-        }
-
-        rdma->return_path = rdma_return_path;
-        rdma_return_path->return_path = rdma;
-        rdma_return_path->is_return_path = true;
-    }
-
-    trace_rdma_start_outgoing_migration_after_rdma_connect();
-
-    s->to_dst_file = rdma_new_output(rdma);
-    s->rdma_migration = true;
-    migrate_fd_connect(s, NULL);
-    return;
-return_path_err:
-    qemu_rdma_cleanup(rdma);
-err:
-    g_free(rdma);
-    g_free(rdma_return_path);
-}
diff --git a/migration/rdma.h b/migration/rdma.h
deleted file mode 100644
index a8d27f33b8..0000000000
--- a/migration/rdma.h
+++ /dev/null
@@ -1,69 +0,0 @@
-/*
- * RDMA protocol and interfaces
- *
- * Copyright IBM, Corp. 2010-2013
- * Copyright Red Hat, Inc. 2015-2016
- *
- * Authors:
- *  Michael R. Hines <mrhines@us.ibm.com>
- *  Jiuxing Liu <jl@us.ibm.com>
- *  Daniel P. Berrange <berrange@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or
- * later.  See the COPYING file in the top-level directory.
- *
- */
-
-#include "qemu/sockets.h"
-
-#ifndef QEMU_MIGRATION_RDMA_H
-#define QEMU_MIGRATION_RDMA_H
-
-#include "exec/memory.h"
-
-void rdma_start_outgoing_migration(void *opaque, InetSocketAddress *host_port,
-                                   Error **errp);
-
-void rdma_start_incoming_migration(InetSocketAddress *host_port, Error **errp);
-
-/*
- * Constants used by rdma return codes
- */
-#define RAM_CONTROL_SETUP     0
-#define RAM_CONTROL_ROUND     1
-#define RAM_CONTROL_FINISH    3
-
-/*
- * Whenever this is found in the data stream, the flags
- * will be passed to rdma functions in the incoming-migration
- * side.
- */
-#define RAM_SAVE_FLAG_HOOK     0x80
-
-#define RAM_SAVE_CONTROL_NOT_SUPP -1000
-#define RAM_SAVE_CONTROL_DELAYED  -2000
-
-#ifdef CONFIG_RDMA
-int rdma_registration_handle(QEMUFile *f);
-int rdma_registration_start(QEMUFile *f, uint64_t flags);
-int rdma_registration_stop(QEMUFile *f, uint64_t flags);
-int rdma_block_notification_handle(QEMUFile *f, const char *name);
-int rdma_control_save_page(QEMUFile *f, ram_addr_t block_offset,
-                           ram_addr_t offset, size_t size);
-#else
-static inline
-int rdma_registration_handle(QEMUFile *f) { return 0; }
-static inline
-int rdma_registration_start(QEMUFile *f, uint64_t flags) { return 0; }
-static inline
-int rdma_registration_stop(QEMUFile *f, uint64_t flags) { return 0; }
-static inline
-int rdma_block_notification_handle(QEMUFile *f, const char *name) { return 0; }
-static inline
-int rdma_control_save_page(QEMUFile *f, ram_addr_t block_offset,
-                           ram_addr_t offset, size_t size)
-{
-    return RAM_SAVE_CONTROL_NOT_SUPP;
-}
-#endif
-#endif
diff --git a/migration/savevm.c b/migration/savevm.c
index c621f2359b..3941d65693 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2996,7 +2996,7 @@ int qemu_loadvm_state(QEMUFile *f)
 
     /* We've got to be careful; if we don't read the data and just shut the fd
      * then the sender can error if we close while it's still sending.
-     * We also mustn't read data that isn't there; some transports (RDMA)
+     * We also mustn't read data that isn't there; some transports
      * will stall waiting for that data when the source has already closed.
      */
     if (ret == 0 && should_send_vmdesc()) {
diff --git a/migration/trace-events b/migration/trace-events
index 0b7c3324fb..72e0517f09 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -193,7 +193,7 @@ process_incoming_migration_co_postcopy_end_main(void) ""
 postcopy_preempt_enabled(bool value) "%d"
 
 # migration-stats
-migration_transferred_bytes(uint64_t qemu_file, uint64_t multifd, uint64_t rdma) "qemu_file %" PRIu64 " multifd %" PRIu64 " RDMA %" PRIu64
+migration_transferred_bytes(uint64_t qemu_file, uint64_t multifd) "qemu_file %" PRIu64 " multifd %" PRIu64
 
 # channel.c
 migration_set_incoming_channel(void *ioc, const char *ioctype) "ioc=%p ioctype=%s"
@@ -204,72 +204,6 @@ migrate_state_too_big(void) ""
 migrate_global_state_post_load(const char *state) "loaded state: %s"
 migrate_global_state_pre_save(const char *state) "saved state: %s"
 
-# rdma.c
-qemu_rdma_accept_incoming_migration(void) ""
-qemu_rdma_accept_incoming_migration_accepted(void) ""
-qemu_rdma_accept_pin_state(bool pin) "%d"
-qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context after listen: %p"
-qemu_rdma_block_for_wrid_miss(uint64_t wcomp, uint64_t req) "A Wanted wrid %" PRIu64 " but got %" PRIu64
-qemu_rdma_cleanup_disconnect(void) ""
-qemu_rdma_close(void) ""
-qemu_rdma_connect_pin_all_requested(void) ""
-qemu_rdma_connect_pin_all_outcome(bool pin) "%d"
-qemu_rdma_dest_init_trying(const char *host, const char *ip) "%s => %s"
-qemu_rdma_dump_id_failed(const char *who) "%s RDMA Device opened, but can't query port information"
-qemu_rdma_dump_id(const char *who, const char *name, const char *dev_name, const char *dev_path, const char *ibdev_path, int transport, const char *transport_name) "%s RDMA Device opened: kernel name %s uverbs device name %s, infiniband_verbs class device path %s, infiniband class device path %s, transport: (%d) %s"
-qemu_rdma_dump_gid(const char *who, const char *src, const char *dst) "%s Source GID: %s, Dest GID: %s"
-qemu_rdma_exchange_get_response_start(const char *desc) "CONTROL: %s receiving..."
-qemu_rdma_exchange_get_response_none(const char *desc, int type) "Surprise: got %s (%d)"
-qemu_rdma_exchange_send_issue_callback(void) ""
-qemu_rdma_exchange_send_waiting(const char *desc) "Waiting for response %s"
-qemu_rdma_exchange_send_received(const char *desc) "Response %s received."
-qemu_rdma_fill(size_t control_len, size_t size) "RDMA %zd of %zd bytes already in buffer"
-qemu_rdma_init_ram_blocks(int blocks) "Allocated %d local ram block structures"
-qemu_rdma_poll_recv(uint64_t comp, int64_t id, int sent) "completion %" PRIu64 " received (%" PRId64 ") left %d"
-qemu_rdma_poll_write(uint64_t comp, int left, uint64_t block, uint64_t chunk, void *local, void *remote) "completions %" PRIu64 " left %d, block %" PRIu64 ", chunk: %" PRIu64 " %p %p"
-qemu_rdma_poll_other(uint64_t comp, int left) "other completion %" PRIu64 " received left %d"
-qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.."
-qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" PRIu64 " bytes @ %p"
-qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand Paging memory region: %s"
-qemu_rdma_advise_mr(const char *name, uint32_t len, uint64_t addr, const char *res) "Try to advise block %s prefetch at %" PRIu32 "@0x%" PRIx64 ": %s"
-qemu_rdma_resolve_host_trying(const char *host, const char *ip) "Trying %s => %s"
-qemu_rdma_signal_unregister_append(uint64_t chunk, int pos) "Appending unregister chunk %" PRIu64 " at position %d"
-qemu_rdma_signal_unregister_already(uint64_t chunk) "Unregister chunk %" PRIu64 " already in queue"
-qemu_rdma_unregister_waiting_inflight(uint64_t chunk) "Cannot unregister inflight chunk: %" PRIu64
-qemu_rdma_unregister_waiting_proc(uint64_t chunk, int pos) "Processing unregister for chunk: %" PRIu64 " at position %d"
-qemu_rdma_unregister_waiting_send(uint64_t chunk) "Sending unregister for chunk: %" PRIu64
-qemu_rdma_unregister_waiting_complete(uint64_t chunk) "Unregister for chunk: %" PRIu64 " complete."
-qemu_rdma_write_flush(int sent) "sent total: %d"
-qemu_rdma_write_one_block(int count, int block, uint64_t chunk, uint64_t current, uint64_t len, int nb_sent, int nb_chunks) "(%d) Not clobbering: block: %d chunk %" PRIu64 " current %" PRIu64 " len %" PRIu64 " %d %d"
-qemu_rdma_write_one_post(uint64_t chunk, long addr, long remote, uint32_t len) "Posting chunk: %" PRIu64 ", addr: 0x%lx remote: 0x%lx, bytes %" PRIu32
-qemu_rdma_write_one_queue_full(void) ""
-qemu_rdma_write_one_recvregres(int mykey, int theirkey, uint64_t chunk) "Received registration result: my key: 0x%x their key 0x%x, chunk %" PRIu64
-qemu_rdma_write_one_sendreg(uint64_t chunk, int len, int index, int64_t offset) "Sending registration request chunk %" PRIu64 " for %d bytes, index: %d, offset: %" PRId64
-qemu_rdma_write_one_top(uint64_t chunks, uint64_t size) "Writing %" PRIu64 " chunks, (%" PRIu64 " MB)"
-qemu_rdma_write_one_zero(uint64_t chunk, int len, int index, int64_t offset) "Entire chunk is zero, sending compress: %" PRIu64 " for %d bytes, index: %d, offset: %" PRId64
-rdma_add_block(const char *block_name, int block, uint64_t addr, uint64_t offset, uint64_t len, uint64_t end, uint64_t bits, int chunks) "Added Block: '%s':%d, addr: %" PRIu64 ", offset: %" PRIu64 " length: %" PRIu64 " end: %" PRIu64 " bits %" PRIu64 " chunks %d"
-rdma_block_notification_handle(const char *name, int index) "%s at %d"
-rdma_delete_block(void *block, uint64_t addr, uint64_t offset, uint64_t len, uint64_t end, uint64_t bits, int chunks) "Deleted Block: %p, addr: %" PRIu64 ", offset: %" PRIu64 " length: %" PRIu64 " end: %" PRIu64 " bits %" PRIu64 " chunks %d"
-rdma_registration_handle_compress(int64_t length, int index, int64_t offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64
-rdma_registration_handle_finished(void) ""
-rdma_registration_handle_ram_blocks(void) ""
-rdma_registration_handle_ram_blocks_loop(const char *name, uint64_t offset, uint64_t length, void *local_host_addr, unsigned int src_index) "%s: @0x%" PRIx64 "/%" PRIu64 " host:@%p src_index: %u"
-rdma_registration_handle_register(int requests) "%d requests"
-rdma_registration_handle_register_loop(int req, int index, uint64_t addr, uint64_t chunks) "Registration request (%d): index %d, current_addr %" PRIu64 " chunks: %" PRIu64
-rdma_registration_handle_register_rkey(int rkey) "0x%x"
-rdma_registration_handle_unregister(int requests) "%d requests"
-rdma_registration_handle_unregister_loop(int count, int index, uint64_t chunk) "Unregistration request (%d): index %d, chunk %" PRIu64
-rdma_registration_handle_unregister_success(uint64_t chunk) "%" PRIu64
-rdma_registration_handle_wait(void) ""
-rdma_registration_start(uint64_t flags) "%" PRIu64
-rdma_registration_stop(uint64_t flags) "%" PRIu64
-rdma_registration_stop_ram(void) ""
-rdma_start_incoming_migration(void) ""
-rdma_start_incoming_migration_after_dest_init(void) ""
-rdma_start_incoming_migration_after_rdma_listen(void) ""
-rdma_start_outgoing_migration_after_rdma_connect(void) ""
-rdma_start_outgoing_migration_after_rdma_source_init(void) ""
-
 # postcopy-ram.c
 postcopy_discard_send_finish(const char *ramblock, int nwords, int ncmds) "%s mask words sent=%d in %d commands"
 postcopy_discard_send_range(const char *ramblock, unsigned long start, unsigned long length) "%s:%lx/%lx"
diff --git a/qapi/migration.json b/qapi/migration.json
index a351fd3714..4d7d49bfec 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -210,9 +210,9 @@
 #
 # @setup-time: amount of setup time in milliseconds *before* the
 #     iterations begin but *after* the QMP command is issued.  This is
-#     designed to provide an accounting of any activities (such as
-#     RDMA pinning) which may be expensive, but do not actually occur
-#     during the iterative migration rounds themselves.  (since 1.6)
+#     designed to provide an accounting of any activities which may be
+#     expensive, but do not actually occur during the iterative migration
+#     rounds themselves.  (since 1.6)
 #
 # @cpu-throttle-percentage: percentage of time guest cpus are being
 #     throttled during auto-converge.  This is only present when
@@ -378,10 +378,6 @@
 #     for certain work loads, by sending compressed difference of the
 #     pages
 #
-# @rdma-pin-all: Controls whether or not the entire VM memory
-#     footprint is mlock()'d on demand or all at once.  Refer to
-#     docs/rdma.txt for usage.  Disabled by default.  (since 2.0)
-#
 # @zero-blocks: During storage migration encode blocks of zeroes
 #     efficiently.  This essentially saves 1MB of zeroes per block on
 #     the wire.  Enabling requires source and target VM to support
@@ -476,7 +472,7 @@
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
-  'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
+  'data': ['xbzrle', 'auto-converge', 'zero-blocks',
            'events', 'postcopy-ram',
            { 'name': 'x-colo', 'features': [ 'unstable' ] },
            'release-ram',
@@ -533,7 +529,6 @@
 #     -> { "execute": "query-migrate-capabilities" }
 #     <- { "return": [
 #           {"state": false, "capability": "xbzrle"},
-#           {"state": false, "capability": "rdma-pin-all"},
 #           {"state": false, "capability": "auto-converge"},
 #           {"state": false, "capability": "zero-blocks"},
 #           {"state": true, "capability": "events"},
diff --git a/scripts/analyze-migration.py b/scripts/analyze-migration.py
index 8a254a5b6a..70e77622d3 100755
--- a/scripts/analyze-migration.py
+++ b/scripts/analyze-migration.py
@@ -110,7 +110,6 @@ class RamSection(object):
     RAM_SAVE_FLAG_EOS      = 0x10
     RAM_SAVE_FLAG_CONTINUE = 0x20
     RAM_SAVE_FLAG_XBZRLE   = 0x40
-    RAM_SAVE_FLAG_HOOK     = 0x80
     RAM_SAVE_FLAG_COMPRESS_PAGE = 0x100
     RAM_SAVE_FLAG_MULTIFD_FLUSH = 0x200
 
@@ -203,8 +202,6 @@ def read(self):
                 flags &= ~self.RAM_SAVE_FLAG_PAGE
             elif flags & self.RAM_SAVE_FLAG_XBZRLE:
                 raise Exception("XBZRLE RAM compression is not supported yet")
-            elif flags & self.RAM_SAVE_FLAG_HOOK:
-                raise Exception("RAM hooks don't make sense with files")
             if flags & self.RAM_SAVE_FLAG_MULTIFD_FLUSH:
                 continue
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/6] io: add QIOChannelRDMA class
  2024-06-04 12:14 [PATCH 0/6] refactor RDMA live migration based on rsocket API Gonglei via
  2024-06-04 12:14 ` [PATCH 1/6] migration: remove RDMA live migration temporarily Gonglei via
@ 2024-06-04 12:14 ` Gonglei via
  2024-06-10  6:54   ` Jinpu Wang
  2024-06-04 12:14 ` [PATCH 3/6] io/channel-rdma: support working in coroutine Gonglei via
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 55+ messages in thread
From: Gonglei via @ 2024-06-04 12:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: peterx, yu.zhang, mgalaxy, elmar.gerdes, zhengchuan, berrange,
	armbru, lizhijian, pbonzini, mst, xiexiangyou, linux-rdma,
	lixiao91, arei.gonglei, jinpu.wang, Jialin Wang

From: Jialin Wang <wangjialin23@huawei.com>

Implement a QIOChannelRDMA subclass that is based on the rsocket
API (similar to socket API).

Signed-off-by: Jialin Wang <wangjialin23@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 include/io/channel-rdma.h | 152 +++++++++++++
 io/channel-rdma.c         | 437 ++++++++++++++++++++++++++++++++++++++
 io/meson.build            |   1 +
 io/trace-events           |  14 ++
 4 files changed, 604 insertions(+)
 create mode 100644 include/io/channel-rdma.h
 create mode 100644 io/channel-rdma.c

diff --git a/include/io/channel-rdma.h b/include/io/channel-rdma.h
new file mode 100644
index 0000000000..8cab2459e5
--- /dev/null
+++ b/include/io/channel-rdma.h
@@ -0,0 +1,152 @@
+/*
+ * QEMU I/O channels RDMA driver
+ *
+ * Copyright (c) 2024 HUAWEI TECHNOLOGIES CO., LTD.
+ *
+ * Authors:
+ *  Jialin Wang <wangjialin23@huawei.com>
+ *  Gonglei <arei.gonglei@huawei.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef QIO_CHANNEL_RDMA_H
+#define QIO_CHANNEL_RDMA_H
+
+#include "io/channel.h"
+#include "io/task.h"
+#include "qemu/sockets.h"
+#include "qom/object.h"
+
+#define TYPE_QIO_CHANNEL_RDMA "qio-channel-rdma"
+OBJECT_DECLARE_SIMPLE_TYPE(QIOChannelRDMA, QIO_CHANNEL_RDMA)
+
+/**
+ * QIOChannelRDMA:
+ *
+ * The QIOChannelRDMA object provides a channel implementation
+ * that discards all writes and returns EOF for all reads.
+ */
+struct QIOChannelRDMA {
+    QIOChannel parent;
+    /* the rsocket fd */
+    int fd;
+
+    struct sockaddr_storage localAddr;
+    socklen_t localAddrLen;
+    struct sockaddr_storage remoteAddr;
+    socklen_t remoteAddrLen;
+};
+
+/**
+ * qio_channel_rdma_new:
+ *
+ * Create a channel for performing I/O on a rdma
+ * connection, that is initially closed. After
+ * creating the rdma, it must be setup as a client
+ * connection or server.
+ *
+ * Returns: the rdma channel object
+ */
+QIOChannelRDMA *qio_channel_rdma_new(void);
+
+/**
+ * qio_channel_rdma_connect_sync:
+ * @ioc: the rdma channel object
+ * @addr: the address to connect to
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Attempt to connect to the address @addr. This method
+ * will run in the foreground so the caller will not regain
+ * execution control until the connection is established or
+ * an error occurs.
+ */
+int qio_channel_rdma_connect_sync(QIOChannelRDMA *ioc, InetSocketAddress *addr,
+                                  Error **errp);
+
+/**
+ * qio_channel_rdma_connect_async:
+ * @ioc: the rdma channel object
+ * @addr: the address to connect to
+ * @callback: the function to invoke on completion
+ * @opaque: user data to pass to @callback
+ * @destroy: the function to free @opaque
+ * @context: the context to run the async task. If %NULL, the default
+ *           context will be used.
+ *
+ * Attempt to connect to the address @addr. This method
+ * will run in the background so the caller will regain
+ * execution control immediately. The function @callback
+ * will be invoked on completion or failure. The @addr
+ * parameter will be copied, so may be freed as soon
+ * as this function returns without waiting for completion.
+ */
+void qio_channel_rdma_connect_async(QIOChannelRDMA *ioc,
+                                    InetSocketAddress *addr,
+                                    QIOTaskFunc callback, gpointer opaque,
+                                    GDestroyNotify destroy,
+                                    GMainContext *context);
+
+/**
+ * qio_channel_rdma_listen_sync:
+ * @ioc: the rdma channel object
+ * @addr: the address to listen to
+ * @num: the expected amount of connections
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Attempt to listen to the address @addr. This method
+ * will run in the foreground so the caller will not regain
+ * execution control until the connection is established or
+ * an error occurs.
+ */
+int qio_channel_rdma_listen_sync(QIOChannelRDMA *ioc, InetSocketAddress *addr,
+                                 int num, Error **errp);
+
+/**
+ * qio_channel_rdma_listen_async:
+ * @ioc: the rdma channel object
+ * @addr: the address to listen to
+ * @num: the expected amount of connections
+ * @callback: the function to invoke on completion
+ * @opaque: user data to pass to @callback
+ * @destroy: the function to free @opaque
+ * @context: the context to run the async task. If %NULL, the default
+ *           context will be used.
+ *
+ * Attempt to listen to the address @addr. This method
+ * will run in the background so the caller will regain
+ * execution control immediately. The function @callback
+ * will be invoked on completion or failure. The @addr
+ * parameter will be copied, so may be freed as soon
+ * as this function returns without waiting for completion.
+ */
+void qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, InetSocketAddress *addr,
+                                   int num, QIOTaskFunc callback,
+                                   gpointer opaque, GDestroyNotify destroy,
+                                   GMainContext *context);
+
+/**
+ * qio_channel_rdma_accept:
+ * @ioc: the rdma channel object
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * If the rdma represents a server, then this accepts
+ * a new client connection. The returned channel will
+ * represent the connected client rdma.
+ *
+ * Returns: the new client channel, or NULL on error
+ */
+QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *ioc, Error **errp);
+
+#endif /* QIO_CHANNEL_RDMA_H */
diff --git a/io/channel-rdma.c b/io/channel-rdma.c
new file mode 100644
index 0000000000..92c362df52
--- /dev/null
+++ b/io/channel-rdma.c
@@ -0,0 +1,437 @@
+/*
+ * QEMU I/O channels RDMA driver
+ *
+ * Copyright (c) 2024 HUAWEI TECHNOLOGIES CO., LTD.
+ *
+ * Authors:
+ *  Jialin Wang <wangjialin23@huawei.com>
+ *  Gonglei <arei.gonglei@huawei.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "io/channel-rdma.h"
+#include "io/channel.h"
+#include "qapi/clone-visitor.h"
+#include "qapi/error.h"
+#include "qapi/qapi-visit-sockets.h"
+#include "trace.h"
+#include <errno.h>
+#include <netdb.h>
+#include <rdma/rsocket.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/eventfd.h>
+#include <sys/poll.h>
+#include <unistd.h>
+
+QIOChannelRDMA *qio_channel_rdma_new(void)
+{
+    QIOChannelRDMA *rioc;
+    QIOChannel *ioc;
+
+    rioc = QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
+    ioc = QIO_CHANNEL(rioc);
+    qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
+
+    trace_qio_channel_rdma_new(ioc);
+
+    return rioc;
+}
+
+static int qio_channel_rdma_set_fd(QIOChannelRDMA *rioc, int fd, Error **errp)
+{
+    if (rioc->fd != -1) {
+        error_setg(errp, "Socket is already open");
+        return -1;
+    }
+
+    rioc->fd = fd;
+    rioc->remoteAddrLen = sizeof(rioc->remoteAddr);
+    rioc->localAddrLen = sizeof(rioc->localAddr);
+
+    if (rgetpeername(fd, (struct sockaddr *)&rioc->remoteAddr,
+                     &rioc->remoteAddrLen) < 0) {
+        if (errno == ENOTCONN) {
+            memset(&rioc->remoteAddr, 0, sizeof(rioc->remoteAddr));
+            rioc->remoteAddrLen = sizeof(rioc->remoteAddr);
+        } else {
+            error_setg_errno(errp, errno,
+                             "Unable to query remote rsocket address");
+            goto error;
+        }
+    }
+
+    if (rgetsockname(fd, (struct sockaddr *)&rioc->localAddr,
+                     &rioc->localAddrLen) < 0) {
+        error_setg_errno(errp, errno, "Unable to query local rsocket address");
+        goto error;
+    }
+
+    return 0;
+
+error:
+    rioc->fd = -1; /* Let the caller close FD on failure */
+    return -1;
+}
+
+int qio_channel_rdma_connect_sync(QIOChannelRDMA *rioc, InetSocketAddress *addr,
+                                  Error **errp)
+{
+    int ret, fd = -1;
+    struct rdma_addrinfo *ai;
+
+    trace_qio_channel_rdma_connect_sync(rioc, addr);
+    ret = rdma_getaddrinfo(addr->host, addr->port, NULL, &ai);
+    if (ret) {
+        error_setg(errp, "Failed to rdma_getaddrinfo: %s", gai_strerror(ret));
+        goto out;
+    }
+
+    fd = rsocket(ai->ai_family, SOCK_STREAM, 0);
+    if (fd < 0) {
+        error_setg_errno(errp, errno, "Failed to create rsocket");
+        goto out;
+    }
+    qemu_set_cloexec(fd);
+
+retry:
+    ret = rconnect(fd, ai->ai_dst_addr, ai->ai_dst_len);
+    if (ret) {
+        if (errno == EINTR) {
+            goto retry;
+        }
+        error_setg_errno(errp, errno, "Failed to rconnect");
+        goto out;
+    }
+
+    trace_qio_channel_rdma_connect_complete(rioc, fd);
+    ret = qio_channel_rdma_set_fd(rioc, fd, errp);
+    if (ret) {
+        goto out;
+    }
+
+out:
+    if (ret) {
+        trace_qio_channel_rdma_connect_fail(rioc);
+        if (fd >= 0) {
+            rclose(fd);
+        }
+    }
+    if (ai) {
+        rdma_freeaddrinfo(ai);
+    }
+
+    return ret;
+}
+
+static void qio_channel_rdma_connect_worker(QIOTask *task, gpointer opaque)
+{
+    QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(qio_task_get_source(task));
+    InetSocketAddress *addr = opaque;
+    Error *err = NULL;
+
+    qio_channel_rdma_connect_sync(ioc, addr, &err);
+
+    qio_task_set_error(task, err);
+}
+
+void qio_channel_rdma_connect_async(QIOChannelRDMA *ioc,
+                                    InetSocketAddress *addr,
+                                    QIOTaskFunc callback, gpointer opaque,
+                                    GDestroyNotify destroy,
+                                    GMainContext *context)
+{
+    QIOTask *task = qio_task_new(OBJECT(ioc), callback, opaque, destroy);
+    InetSocketAddress *addrCopy;
+
+    addrCopy = QAPI_CLONE(InetSocketAddress, addr);
+
+    /* rdma_getaddrinfo() blocks in DNS lookups, so we must use a thread */
+    trace_qio_channel_rdma_connect_async(ioc, addr);
+    qio_task_run_in_thread(task, qio_channel_rdma_connect_worker, addrCopy,
+                           (GDestroyNotify)qapi_free_InetSocketAddress,
+                           context);
+}
+
+int qio_channel_rdma_listen_sync(QIOChannelRDMA *rioc, InetSocketAddress *addr,
+                                 int num, Error **errp)
+{
+    int ret, fd = -1;
+    struct rdma_addrinfo *ai;
+    struct rdma_addrinfo ai_hints = { 0 };
+
+    trace_qio_channel_rdma_listen_sync(rioc, addr, num);
+    ai_hints.ai_port_space = RDMA_PS_TCP;
+    ai_hints.ai_flags |= RAI_PASSIVE;
+    ret = rdma_getaddrinfo(addr->host, addr->port, &ai_hints, &ai);
+    if (ret) {
+        error_setg(errp, "Failed to rdma_getaddrinfo: %s", gai_strerror(ret));
+        goto out;
+    }
+
+    fd = rsocket(ai->ai_family, SOCK_STREAM, 0);
+    if (fd < 0) {
+        error_setg_errno(errp, errno, "Failed to create rsocket");
+        goto out;
+    }
+    qemu_set_cloexec(fd);
+
+    ret = rbind(fd, ai->ai_src_addr, ai->ai_src_len);
+    if (ret) {
+        error_setg_errno(errp, errno, "Failed to rbind");
+        goto out;
+    }
+
+    ret = rlisten(fd, num);
+    if (ret) {
+        error_setg_errno(errp, errno, "Failed to rlisten");
+        goto out;
+    }
+
+    ret = qio_channel_rdma_set_fd(rioc, fd, errp);
+    if (ret) {
+        goto out;
+    }
+
+    qio_channel_set_feature(QIO_CHANNEL(rioc), QIO_CHANNEL_FEATURE_LISTEN);
+    trace_qio_channel_rdma_listen_complete(rioc, fd);
+
+out:
+    if (ret) {
+        trace_qio_channel_rdma_listen_fail(rioc);
+        if (fd >= 0) {
+            rclose(fd);
+        }
+    }
+    if (ai) {
+        rdma_freeaddrinfo(ai);
+    }
+
+    return ret;
+}
+
+struct QIOChannelListenWorkerData {
+    InetSocketAddress *addr;
+    int num; /* amount of expected connections */
+};
+
+static void qio_channel_listen_worker_free(gpointer opaque)
+{
+    struct QIOChannelListenWorkerData *data = opaque;
+
+    qapi_free_InetSocketAddress(data->addr);
+    g_free(data);
+}
+
+static void qio_channel_rdma_listen_worker(QIOTask *task, gpointer opaque)
+{
+    QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(qio_task_get_source(task));
+    struct QIOChannelListenWorkerData *data = opaque;
+    Error *err = NULL;
+
+    qio_channel_rdma_listen_sync(ioc, data->addr, data->num, &err);
+
+    qio_task_set_error(task, err);
+}
+
+void qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, InetSocketAddress *addr,
+                                   int num, QIOTaskFunc callback,
+                                   gpointer opaque, GDestroyNotify destroy,
+                                   GMainContext *context)
+{
+    QIOTask *task = qio_task_new(OBJECT(ioc), callback, opaque, destroy);
+    struct QIOChannelListenWorkerData *data;
+
+    data = g_new0(struct QIOChannelListenWorkerData, 1);
+    data->addr = QAPI_CLONE(InetSocketAddress, addr);
+    data->num = num;
+
+    /* rdma_getaddrinfo() blocks in DNS lookups, so we must use a thread */
+    trace_qio_channel_rdma_listen_async(ioc, addr, num);
+    qio_task_run_in_thread(task, qio_channel_rdma_listen_worker, data,
+                           qio_channel_listen_worker_free, context);
+}
+
+QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *rioc, Error **errp)
+{
+    QIOChannelRDMA *cioc;
+
+    cioc = qio_channel_rdma_new();
+    cioc->remoteAddrLen = sizeof(rioc->remoteAddr);
+    cioc->localAddrLen = sizeof(rioc->localAddr);
+
+    trace_qio_channel_rdma_accept(rioc);
+retry:
+    cioc->fd = raccept(rioc->fd, (struct sockaddr *)&cioc->remoteAddr,
+                       &cioc->remoteAddrLen);
+    if (cioc->fd < 0) {
+        if (errno == EINTR) {
+            goto retry;
+        }
+        error_setg_errno(errp, errno, "Unable to accept connection");
+        goto error;
+    }
+    qemu_set_cloexec(cioc->fd);
+
+    if (rgetsockname(cioc->fd, (struct sockaddr *)&cioc->localAddr,
+                     &cioc->localAddrLen) < 0) {
+        error_setg_errno(errp, errno, "Unable to query local rsocket address");
+        goto error;
+    }
+
+    trace_qio_channel_rdma_accept_complete(rioc, cioc, cioc->fd);
+    return cioc;
+
+error:
+    trace_qio_channel_rdma_accept_fail(rioc);
+    object_unref(OBJECT(cioc));
+    return NULL;
+}
+
+static void qio_channel_rdma_init(Object *obj)
+{
+    QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(obj);
+    ioc->fd = -1;
+}
+
+static void qio_channel_rdma_finalize(Object *obj)
+{
+    QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(obj);
+
+    if (ioc->fd != -1) {
+        rclose(ioc->fd);
+        ioc->fd = -1;
+    }
+}
+
+static ssize_t qio_channel_rdma_readv(QIOChannel *ioc, const struct iovec *iov,
+                                      size_t niov, int **fds G_GNUC_UNUSED,
+                                      size_t *nfds G_GNUC_UNUSED,
+                                      int flags G_GNUC_UNUSED, Error **errp)
+{
+    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
+    ssize_t ret;
+
+retry:
+    ret = rreadv(rioc->fd, iov, niov);
+    if (ret < 0) {
+        if (errno == EINTR) {
+            goto retry;
+        }
+        error_setg_errno(errp, errno, "Unable to write to rsocket");
+        return -1;
+    }
+
+    return ret;
+}
+
+static ssize_t qio_channel_rdma_writev(QIOChannel *ioc, const struct iovec *iov,
+                                       size_t niov, int *fds G_GNUC_UNUSED,
+                                       size_t nfds G_GNUC_UNUSED,
+                                       int flags G_GNUC_UNUSED, Error **errp)
+{
+    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
+    ssize_t ret;
+
+retry:
+    ret = rwritev(rioc->fd, iov, niov);
+    if (ret <= 0) {
+        if (errno == EINTR) {
+            goto retry;
+        }
+        error_setg_errno(errp, errno, "Unable to write to rsocket");
+        return -1;
+    }
+
+    return ret;
+}
+
+static void qio_channel_rdma_set_delay(QIOChannel *ioc, bool enabled)
+{
+    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
+    int v = enabled ? 0 : 1;
+
+    rsetsockopt(rioc->fd, IPPROTO_TCP, TCP_NODELAY, &v, sizeof(v));
+}
+
+static int qio_channel_rdma_close(QIOChannel *ioc, Error **errp)
+{
+    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
+
+    if (rioc->fd != -1) {
+        rclose(rioc->fd);
+        rioc->fd = -1;
+    }
+
+    return 0;
+}
+
+static int qio_channel_rdma_shutdown(QIOChannel *ioc, QIOChannelShutdown how,
+                                     Error **errp)
+{
+    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
+    int sockhow;
+
+    switch (how) {
+    case QIO_CHANNEL_SHUTDOWN_READ:
+        sockhow = SHUT_RD;
+        break;
+    case QIO_CHANNEL_SHUTDOWN_WRITE:
+        sockhow = SHUT_WR;
+        break;
+    case QIO_CHANNEL_SHUTDOWN_BOTH:
+    default:
+        sockhow = SHUT_RDWR;
+        break;
+    }
+
+    if (rshutdown(rioc->fd, sockhow) < 0) {
+        error_setg_errno(errp, errno, "Unable to shutdown rsocket");
+        return -1;
+    }
+
+    return 0;
+}
+
+static void qio_channel_rdma_class_init(ObjectClass *klass,
+                                        void *class_data G_GNUC_UNUSED)
+{
+    QIOChannelClass *ioc_klass = QIO_CHANNEL_CLASS(klass);
+
+    ioc_klass->io_writev = qio_channel_rdma_writev;
+    ioc_klass->io_readv = qio_channel_rdma_readv;
+    ioc_klass->io_close = qio_channel_rdma_close;
+    ioc_klass->io_shutdown = qio_channel_rdma_shutdown;
+    ioc_klass->io_set_delay = qio_channel_rdma_set_delay;
+}
+
+static const TypeInfo qio_channel_rdma_info = {
+    .parent = TYPE_QIO_CHANNEL,
+    .name = TYPE_QIO_CHANNEL_RDMA,
+    .instance_size = sizeof(QIOChannelRDMA),
+    .instance_init = qio_channel_rdma_init,
+    .instance_finalize = qio_channel_rdma_finalize,
+    .class_init = qio_channel_rdma_class_init,
+};
+
+static void qio_channel_rdma_register_types(void)
+{
+    type_register_static(&qio_channel_rdma_info);
+}
+
+type_init(qio_channel_rdma_register_types);
diff --git a/io/meson.build b/io/meson.build
index 283b9b2bdb..e0dbd5183f 100644
--- a/io/meson.build
+++ b/io/meson.build
@@ -14,3 +14,4 @@ io_ss.add(files(
   'net-listener.c',
   'task.c',
 ), gnutls)
+io_ss.add(when: rdma, if_true: files('channel-rdma.c'))
diff --git a/io/trace-events b/io/trace-events
index d4c0f84a9a..33026a2224 100644
--- a/io/trace-events
+++ b/io/trace-events
@@ -67,3 +67,17 @@ qio_channel_command_new_pid(void *ioc, int writefd, int readfd, int pid) "Comman
 qio_channel_command_new_spawn(void *ioc, const char *binary, int flags) "Command new spawn ioc=%p binary=%s flags=%d"
 qio_channel_command_abort(void *ioc, int pid) "Command abort ioc=%p pid=%d"
 qio_channel_command_wait(void *ioc, int pid, int ret, int status) "Command abort ioc=%p pid=%d ret=%d status=%d"
+
+# channel-rdma.c
+qio_channel_rdma_new(void *ioc) "RDMA rsocket new ioc=%p"
+qio_channel_rdma_connect_sync(void *ioc, void *addr) "RDMA rsocket connect sync ioc=%p addr=%p"
+qio_channel_rdma_connect_async(void *ioc, void *addr) "RDMA rsocket connect async ioc=%p addr=%p"
+qio_channel_rdma_connect_fail(void *ioc) "RDMA rsocket connect fail ioc=%p"
+qio_channel_rdma_connect_complete(void *ioc, int fd) "RDMA rsocket connect complete ioc=%p fd=%d"
+qio_channel_rdma_listen_sync(void *ioc, void *addr, int num) "RDMA rsocket listen sync ioc=%p addr=%p num=%d"
+qio_channel_rdma_listen_fail(void *ioc) "RDMA rsocket listen fail ioc=%p"
+qio_channel_rdma_listen_async(void *ioc, void *addr, int num) "RDMA rsocket listen async ioc=%p addr=%p num=%d"
+qio_channel_rdma_listen_complete(void *ioc, int fd) "RDMA rsocket listen complete ioc=%p fd=%d"
+qio_channel_rdma_accept(void *ioc) "Socket accept start ioc=%p"
+qio_channel_rdma_accept_fail(void *ioc) "RDMA rsocket accept fail ioc=%p"
+qio_channel_rdma_accept_complete(void *ioc, void *cioc, int fd) "RDMA rsocket accept complete ioc=%p cioc=%p fd=%d"
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 3/6] io/channel-rdma: support working in coroutine
  2024-06-04 12:14 [PATCH 0/6] refactor RDMA live migration based on rsocket API Gonglei via
  2024-06-04 12:14 ` [PATCH 1/6] migration: remove RDMA live migration temporarily Gonglei via
  2024-06-04 12:14 ` [PATCH 2/6] io: add QIOChannelRDMA class Gonglei via
@ 2024-06-04 12:14 ` Gonglei via
  2024-06-06 13:34   ` Haris Iqbal
  2024-06-07  9:04   ` Daniel P. Berrangé
  2024-06-04 12:14 ` [PATCH 4/6] tests/unit: add test-io-channel-rdma.c Gonglei via
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 55+ messages in thread
From: Gonglei via @ 2024-06-04 12:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: peterx, yu.zhang, mgalaxy, elmar.gerdes, zhengchuan, berrange,
	armbru, lizhijian, pbonzini, mst, xiexiangyou, linux-rdma,
	lixiao91, arei.gonglei, jinpu.wang, Jialin Wang

From: Jialin Wang <wangjialin23@huawei.com>

It is not feasible to obtain RDMA completion queue notifications
through poll/ppoll on the rsocket fd. Therefore, we create a thread
named rpoller for each rsocket fd and two eventfds: pollin_eventfd
and pollout_eventfd.

When using io_create_watch or io_set_aio_fd_handler waits for POLLIN
or POLLOUT events, it will actually poll/ppoll on the pollin_eventfd
and pollout_eventfd instead of the rsocket fd.

The rpoller rpoll() on the rsocket fd to receive POLLIN and POLLOUT
events.
When a POLLIN event occurs, the rpoller write the pollin_eventfd,
and then poll/ppoll will return the POLLIN event.
When a POLLOUT event occurs, the rpoller read the pollout_eventfd,
and then poll/ppoll will return the POLLOUT event.

For a non-blocking rsocket fd, if rread/rwrite returns EAGAIN, it will
read/write the pollin/pollout_eventfd, preventing poll/ppoll from
returning POLLIN/POLLOUT events.

Known limitations:

  For a blocking rsocket fd, if we use io_create_watch to wait for
  POLLIN or POLLOUT events, since the rsocket fd is blocking, we
  cannot determine when it is not ready to read/write as we can with
  non-blocking fds. Therefore, when an event occurs, it will occurs
  always, potentially leave the qemu hanging. So we need be cautious
  to avoid hanging when using io_create_watch .

Luckily, channel-rdma works well in coroutines :)

Signed-off-by: Jialin Wang <wangjialin23@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 include/io/channel-rdma.h |  15 +-
 io/channel-rdma.c         | 363 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 376 insertions(+), 2 deletions(-)

diff --git a/include/io/channel-rdma.h b/include/io/channel-rdma.h
index 8cab2459e5..cb56127d76 100644
--- a/include/io/channel-rdma.h
+++ b/include/io/channel-rdma.h
@@ -47,6 +47,18 @@ struct QIOChannelRDMA {
     socklen_t localAddrLen;
     struct sockaddr_storage remoteAddr;
     socklen_t remoteAddrLen;
+
+    /* private */
+
+    /* qemu g_poll/ppoll() POLLIN event on it */
+    int pollin_eventfd;
+    /* qemu g_poll/ppoll() POLLOUT event on it */
+    int pollout_eventfd;
+
+    /* the index in the rpoller's fds array */
+    int index;
+    /* rpoller will rpoll() rpoll_events on the rsocket fd */
+    short int rpoll_events;
 };
 
 /**
@@ -147,6 +159,7 @@ void qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, InetSocketAddress *addr,
  *
  * Returns: the new client channel, or NULL on error
  */
-QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *ioc, Error **errp);
+QIOChannelRDMA *coroutine_mixed_fn qio_channel_rdma_accept(QIOChannelRDMA *ioc,
+                                                           Error **errp);
 
 #endif /* QIO_CHANNEL_RDMA_H */
diff --git a/io/channel-rdma.c b/io/channel-rdma.c
index 92c362df52..9792add5cf 100644
--- a/io/channel-rdma.c
+++ b/io/channel-rdma.c
@@ -23,10 +23,15 @@
 
 #include "qemu/osdep.h"
 #include "io/channel-rdma.h"
+#include "io/channel-util.h"
+#include "io/channel-watch.h"
 #include "io/channel.h"
 #include "qapi/clone-visitor.h"
 #include "qapi/error.h"
 #include "qapi/qapi-visit-sockets.h"
+#include "qemu/atomic.h"
+#include "qemu/error-report.h"
+#include "qemu/thread.h"
 #include "trace.h"
 #include <errno.h>
 #include <netdb.h>
@@ -39,11 +44,274 @@
 #include <sys/poll.h>
 #include <unistd.h>
 
+typedef enum {
+    CLEAR_POLLIN,
+    CLEAR_POLLOUT,
+    SET_POLLIN,
+    SET_POLLOUT,
+} UpdateEvent;
+
+typedef enum {
+    RP_CMD_ADD_IOC,
+    RP_CMD_DEL_IOC,
+    RP_CMD_UPDATE,
+} RpollerCMD;
+
+typedef struct {
+    RpollerCMD cmd;
+    QIOChannelRDMA *rioc;
+} RpollerMsg;
+
+/*
+ * rpoll() on the rsocket fd with rpoll_events, when POLLIN/POLLOUT event
+ * occurs, it will write/read the pollin_eventfd/pollout_eventfd to allow
+ * qemu g_poll/ppoll() get the POLLIN/POLLOUT event
+ */
+static struct Rpoller {
+    QemuThread thread;
+    bool is_running;
+    int sock[2];
+    int count; /* the number of rsocket fds being rpoll() */
+    int size; /* the size of fds/riocs */
+    struct pollfd *fds;
+    QIOChannelRDMA **riocs;
+} rpoller;
+
+static void qio_channel_rdma_notify_rpoller(QIOChannelRDMA *rioc,
+                                            RpollerCMD cmd)
+{
+    RpollerMsg msg;
+    int ret;
+
+    msg.cmd = cmd;
+    msg.rioc = rioc;
+
+    ret = RETRY_ON_EINTR(write(rpoller.sock[0], &msg, sizeof msg));
+    if (ret != sizeof msg) {
+        error_report("%s: failed to send msg, errno: %d", __func__, errno);
+    }
+}
+
+static void qio_channel_rdma_update_poll_event(QIOChannelRDMA *rioc,
+                                               UpdateEvent action,
+                                               bool notify_rpoller)
+{
+    /* An eventfd with the value of ULLONG_MAX - 1 is readable but unwritable */
+    unsigned long long buf = ULLONG_MAX - 1;
+
+    switch (action) {
+    /* only rpoller do SET_* action, to allow qemu ppoll() get the event */
+    case SET_POLLIN:
+        RETRY_ON_EINTR(write(rioc->pollin_eventfd, &buf, sizeof buf));
+        rioc->rpoll_events &= ~POLLIN;
+        break;
+    case SET_POLLOUT:
+        RETRY_ON_EINTR(read(rioc->pollout_eventfd, &buf, sizeof buf));
+        rioc->rpoll_events &= ~POLLOUT;
+        break;
+
+    /* the rsocket fd is not ready to rread/rwrite */
+    case CLEAR_POLLIN:
+        RETRY_ON_EINTR(read(rioc->pollin_eventfd, &buf, sizeof buf));
+        rioc->rpoll_events |= POLLIN;
+        break;
+    case CLEAR_POLLOUT:
+        RETRY_ON_EINTR(write(rioc->pollout_eventfd, &buf, sizeof buf));
+        rioc->rpoll_events |= POLLOUT;
+        break;
+    default:
+        break;
+    }
+
+    /* notify rpoller to rpoll() POLLIN/POLLOUT events */
+    if (notify_rpoller) {
+        qio_channel_rdma_notify_rpoller(rioc, RP_CMD_UPDATE);
+    }
+}
+
+static void qio_channel_rdma_rpoller_add_rioc(QIOChannelRDMA *rioc)
+{
+    if (rioc->index != -1) {
+        error_report("%s: rioc already exsits", __func__);
+        return;
+    }
+
+    rioc->index = ++rpoller.count;
+
+    if (rpoller.count + 1 > rpoller.size) {
+        rpoller.size *= 2;
+        rpoller.fds = g_renew(struct pollfd, rpoller.fds, rpoller.size);
+        rpoller.riocs = g_renew(QIOChannelRDMA *, rpoller.riocs, rpoller.size);
+    }
+
+    rpoller.fds[rioc->index].fd = rioc->fd;
+    rpoller.fds[rioc->index].events = rioc->rpoll_events;
+    rpoller.riocs[rioc->index] = rioc;
+}
+
+static void qio_channel_rdma_rpoller_del_rioc(QIOChannelRDMA *rioc)
+{
+    if (rioc->index == -1) {
+        error_report("%s: rioc not exsits", __func__);
+        return;
+    }
+
+    rpoller.fds[rioc->index] = rpoller.fds[rpoller.count];
+    rpoller.riocs[rioc->index] = rpoller.riocs[rpoller.count];
+    rpoller.riocs[rioc->index]->index = rioc->index;
+    rpoller.count--;
+
+    close(rioc->pollin_eventfd);
+    close(rioc->pollout_eventfd);
+    rioc->index = -1;
+    rioc->rpoll_events = 0;
+}
+
+static void qio_channel_rdma_rpoller_update_ioc(QIOChannelRDMA *rioc)
+{
+    if (rioc->index == -1) {
+        error_report("%s: rioc not exsits", __func__);
+        return;
+    }
+
+    rpoller.fds[rioc->index].fd = rioc->fd;
+    rpoller.fds[rioc->index].events = rioc->rpoll_events;
+}
+
+static void qio_channel_rdma_rpoller_process_msg(void)
+{
+    RpollerMsg msg;
+    int ret;
+
+    ret = RETRY_ON_EINTR(read(rpoller.sock[1], &msg, sizeof msg));
+    if (ret != sizeof msg) {
+        error_report("%s: rpoller failed to recv msg: %s", __func__,
+                     strerror(errno));
+        return;
+    }
+
+    switch (msg.cmd) {
+    case RP_CMD_ADD_IOC:
+        qio_channel_rdma_rpoller_add_rioc(msg.rioc);
+        break;
+    case RP_CMD_DEL_IOC:
+        qio_channel_rdma_rpoller_del_rioc(msg.rioc);
+        break;
+    case RP_CMD_UPDATE:
+        qio_channel_rdma_rpoller_update_ioc(msg.rioc);
+        break;
+    default:
+        break;
+    }
+}
+
+static void qio_channel_rdma_rpoller_cleanup(void)
+{
+    close(rpoller.sock[0]);
+    close(rpoller.sock[1]);
+    rpoller.sock[0] = -1;
+    rpoller.sock[1] = -1;
+    g_free(rpoller.fds);
+    g_free(rpoller.riocs);
+    rpoller.fds = NULL;
+    rpoller.riocs = NULL;
+    rpoller.count = 0;
+    rpoller.size = 0;
+    rpoller.is_running = false;
+}
+
+static void *qio_channel_rdma_rpoller_thread(void *opaque)
+{
+    int i, ret, error_events = POLLERR | POLLHUP | POLLNVAL;
+
+    do {
+        ret = rpoll(rpoller.fds, rpoller.count + 1, -1);
+        if (ret < 0 && errno != -EINTR) {
+            error_report("%s: rpoll() error: %s", __func__, strerror(errno));
+            break;
+        }
+
+        for (i = 1; i <= rpoller.count; i++) {
+            if (rpoller.fds[i].revents & (POLLIN | error_events)) {
+                qio_channel_rdma_update_poll_event(rpoller.riocs[i], SET_POLLIN,
+                                                   false);
+                rpoller.fds[i].events &= ~POLLIN;
+            }
+            if (rpoller.fds[i].revents & (POLLOUT | error_events)) {
+                qio_channel_rdma_update_poll_event(rpoller.riocs[i],
+                                                   SET_POLLOUT, false);
+                rpoller.fds[i].events &= ~POLLOUT;
+            }
+            /* ignore this fd */
+            if (rpoller.fds[i].revents & (error_events)) {
+                rpoller.fds[i].fd = -1;
+            }
+        }
+
+        if (rpoller.fds[0].revents) {
+            qio_channel_rdma_rpoller_process_msg();
+        }
+    } while (rpoller.count >= 1);
+
+    qio_channel_rdma_rpoller_cleanup();
+
+    return NULL;
+}
+
+static void qio_channel_rdma_rpoller_start(void)
+{
+    if (qatomic_xchg(&rpoller.is_running, true)) {
+        return;
+    }
+
+    if (qemu_socketpair(AF_UNIX, SOCK_STREAM, 0, rpoller.sock)) {
+        rpoller.is_running = false;
+        error_report("%s: failed to create socketpair %s", __func__,
+                     strerror(errno));
+        return;
+    }
+
+    rpoller.count = 0;
+    rpoller.size = 4;
+    rpoller.fds = g_malloc0_n(rpoller.size, sizeof(struct pollfd));
+    rpoller.riocs = g_malloc0_n(rpoller.size, sizeof(QIOChannelRDMA *));
+    rpoller.fds[0].fd = rpoller.sock[1];
+    rpoller.fds[0].events = POLLIN;
+
+    qemu_thread_create(&rpoller.thread, "qio-channel-rdma-rpoller",
+                       qio_channel_rdma_rpoller_thread, NULL,
+                       QEMU_THREAD_JOINABLE);
+}
+
+static void qio_channel_rdma_add_rioc_to_rpoller(QIOChannelRDMA *rioc)
+{
+    int flags = EFD_CLOEXEC | EFD_NONBLOCK;
+
+    /*
+     * A single eventfd is either readable or writable. A single eventfd cannot
+     * represent a state where it is neither readable nor writable. so use two
+     * eventfds here.
+     */
+    rioc->pollin_eventfd = eventfd(0, flags);
+    rioc->pollout_eventfd = eventfd(0, flags);
+    /* pollout_eventfd with the value 0, means writable, make it unwritable */
+    qio_channel_rdma_update_poll_event(rioc, CLEAR_POLLOUT, false);
+
+    /* tell the rpoller to rpoll() events on rioc->socketfd */
+    rioc->rpoll_events = POLLIN | POLLOUT;
+    qio_channel_rdma_notify_rpoller(rioc, RP_CMD_ADD_IOC);
+}
+
 QIOChannelRDMA *qio_channel_rdma_new(void)
 {
     QIOChannelRDMA *rioc;
     QIOChannel *ioc;
 
+    qio_channel_rdma_rpoller_start();
+    if (!rpoller.is_running) {
+        return NULL;
+    }
+
     rioc = QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
     ioc = QIO_CHANNEL(rioc);
     qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
@@ -125,6 +393,8 @@ retry:
         goto out;
     }
 
+    qio_channel_rdma_add_rioc_to_rpoller(rioc);
+
 out:
     if (ret) {
         trace_qio_channel_rdma_connect_fail(rioc);
@@ -211,6 +481,8 @@ int qio_channel_rdma_listen_sync(QIOChannelRDMA *rioc, InetSocketAddress *addr,
     qio_channel_set_feature(QIO_CHANNEL(rioc), QIO_CHANNEL_FEATURE_LISTEN);
     trace_qio_channel_rdma_listen_complete(rioc, fd);
 
+    qio_channel_rdma_add_rioc_to_rpoller(rioc);
+
 out:
     if (ret) {
         trace_qio_channel_rdma_listen_fail(rioc);
@@ -267,8 +539,10 @@ void qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, InetSocketAddress *addr,
                            qio_channel_listen_worker_free, context);
 }
 
-QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *rioc, Error **errp)
+QIOChannelRDMA *coroutine_mixed_fn qio_channel_rdma_accept(QIOChannelRDMA *rioc,
+                                                           Error **errp)
 {
+    QIOChannel *ioc = QIO_CHANNEL(rioc);
     QIOChannelRDMA *cioc;
 
     cioc = qio_channel_rdma_new();
@@ -283,6 +557,17 @@ retry:
         if (errno == EINTR) {
             goto retry;
         }
+        if (errno == EAGAIN) {
+            if (!(rioc->rpoll_events & POLLIN)) {
+                qio_channel_rdma_update_poll_event(rioc, CLEAR_POLLIN, true);
+            }
+            if (qemu_in_coroutine()) {
+                qio_channel_yield(ioc, G_IO_IN);
+            } else {
+                qio_channel_wait(ioc, G_IO_IN);
+            }
+            goto retry;
+        }
         error_setg_errno(errp, errno, "Unable to accept connection");
         goto error;
     }
@@ -294,6 +579,8 @@ retry:
         goto error;
     }
 
+    qio_channel_rdma_add_rioc_to_rpoller(cioc);
+
     trace_qio_channel_rdma_accept_complete(rioc, cioc, cioc->fd);
     return cioc;
 
@@ -307,6 +594,10 @@ static void qio_channel_rdma_init(Object *obj)
 {
     QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(obj);
     ioc->fd = -1;
+    ioc->pollin_eventfd = -1;
+    ioc->pollout_eventfd = -1;
+    ioc->index = -1;
+    ioc->rpoll_events = 0;
 }
 
 static void qio_channel_rdma_finalize(Object *obj)
@@ -314,6 +605,7 @@ static void qio_channel_rdma_finalize(Object *obj)
     QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(obj);
 
     if (ioc->fd != -1) {
+        qio_channel_rdma_notify_rpoller(ioc, RP_CMD_DEL_IOC);
         rclose(ioc->fd);
         ioc->fd = -1;
     }
@@ -330,6 +622,12 @@ static ssize_t qio_channel_rdma_readv(QIOChannel *ioc, const struct iovec *iov,
 retry:
     ret = rreadv(rioc->fd, iov, niov);
     if (ret < 0) {
+        if (errno == EAGAIN) {
+            if (!(rioc->rpoll_events & POLLIN)) {
+                qio_channel_rdma_update_poll_event(rioc, CLEAR_POLLIN, true);
+            }
+            return QIO_CHANNEL_ERR_BLOCK;
+        }
         if (errno == EINTR) {
             goto retry;
         }
@@ -351,6 +649,12 @@ static ssize_t qio_channel_rdma_writev(QIOChannel *ioc, const struct iovec *iov,
 retry:
     ret = rwritev(rioc->fd, iov, niov);
     if (ret <= 0) {
+        if (errno == EAGAIN) {
+            if (!(rioc->rpoll_events & POLLOUT)) {
+                qio_channel_rdma_update_poll_event(rioc, CLEAR_POLLOUT, true);
+            }
+            return QIO_CHANNEL_ERR_BLOCK;
+        }
         if (errno == EINTR) {
             goto retry;
         }
@@ -361,6 +665,28 @@ retry:
     return ret;
 }
 
+static int qio_channel_rdma_set_blocking(QIOChannel *ioc, bool enabled,
+                                         Error **errp G_GNUC_UNUSED)
+{
+    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
+    int flags, ret;
+
+    flags = rfcntl(rioc->fd, F_GETFL);
+    if (enabled) {
+        flags &= ~O_NONBLOCK;
+    } else {
+        flags |= O_NONBLOCK;
+    }
+
+    ret = rfcntl(rioc->fd, F_SETFL, flags);
+    if (ret) {
+        error_setg_errno(errp, errno,
+                         "Unable to rfcntl rsocket fd with flags %d", flags);
+    }
+
+    return ret;
+}
+
 static void qio_channel_rdma_set_delay(QIOChannel *ioc, bool enabled)
 {
     QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
@@ -374,6 +700,7 @@ static int qio_channel_rdma_close(QIOChannel *ioc, Error **errp)
     QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
 
     if (rioc->fd != -1) {
+        qio_channel_rdma_notify_rpoller(rioc, RP_CMD_DEL_IOC);
         rclose(rioc->fd);
         rioc->fd = -1;
     }
@@ -408,6 +735,37 @@ static int qio_channel_rdma_shutdown(QIOChannel *ioc, QIOChannelShutdown how,
     return 0;
 }
 
+static void
+qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc, AioContext *read_ctx,
+                                    IOHandler *io_read, AioContext *write_ctx,
+                                    IOHandler *io_write, void *opaque)
+{
+    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
+
+    qio_channel_util_set_aio_fd_handler(rioc->pollin_eventfd, read_ctx, io_read,
+                                        rioc->pollout_eventfd, write_ctx,
+                                        io_write, opaque);
+}
+
+static GSource *qio_channel_rdma_create_watch(QIOChannel *ioc,
+                                              GIOCondition condition)
+{
+    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
+
+    switch (condition) {
+    case G_IO_IN:
+        return qio_channel_create_fd_watch(ioc, rioc->pollin_eventfd,
+                                           condition);
+    case G_IO_OUT:
+        return qio_channel_create_fd_watch(ioc, rioc->pollout_eventfd,
+                                           condition);
+    default:
+        error_report("%s: do not support watch 0x%x event", __func__,
+                     condition);
+        return NULL;
+    }
+}
+
 static void qio_channel_rdma_class_init(ObjectClass *klass,
                                         void *class_data G_GNUC_UNUSED)
 {
@@ -415,9 +773,12 @@ static void qio_channel_rdma_class_init(ObjectClass *klass,
 
     ioc_klass->io_writev = qio_channel_rdma_writev;
     ioc_klass->io_readv = qio_channel_rdma_readv;
+    ioc_klass->io_set_blocking = qio_channel_rdma_set_blocking;
     ioc_klass->io_close = qio_channel_rdma_close;
     ioc_klass->io_shutdown = qio_channel_rdma_shutdown;
     ioc_klass->io_set_delay = qio_channel_rdma_set_delay;
+    ioc_klass->io_create_watch = qio_channel_rdma_create_watch;
+    ioc_klass->io_set_aio_fd_handler = qio_channel_rdma_set_aio_fd_handler;
 }
 
 static const TypeInfo qio_channel_rdma_info = {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 4/6] tests/unit: add test-io-channel-rdma.c
  2024-06-04 12:14 [PATCH 0/6] refactor RDMA live migration based on rsocket API Gonglei via
                   ` (2 preceding siblings ...)
  2024-06-04 12:14 ` [PATCH 3/6] io/channel-rdma: support working in coroutine Gonglei via
@ 2024-06-04 12:14 ` Gonglei via
  2024-06-04 12:14 ` [PATCH 5/6] migration: introduce new RDMA live migration Gonglei via
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Gonglei via @ 2024-06-04 12:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: peterx, yu.zhang, mgalaxy, elmar.gerdes, zhengchuan, berrange,
	armbru, lizhijian, pbonzini, mst, xiexiangyou, linux-rdma,
	lixiao91, arei.gonglei, jinpu.wang, Jialin Wang

From: Jialin Wang <wangjialin23@huawei.com>

Signed-off-by: Jialin Wang <wangjialin23@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 tests/unit/meson.build            |   1 +
 tests/unit/test-io-channel-rdma.c | 276 ++++++++++++++++++++++++++++++
 2 files changed, 277 insertions(+)
 create mode 100644 tests/unit/test-io-channel-rdma.c

diff --git a/tests/unit/meson.build b/tests/unit/meson.build
index 26c109c968..c44020a3b5 100644
--- a/tests/unit/meson.build
+++ b/tests/unit/meson.build
@@ -85,6 +85,7 @@ if have_block
     'test-authz-listfile': [authz],
     'test-io-task': [testblock],
     'test-io-channel-socket': ['socket-helpers.c', 'io-channel-helpers.c', io],
+    'test-io-channel-rdma': ['io-channel-helpers.c', io],
     'test-io-channel-file': ['io-channel-helpers.c', io],
     'test-io-channel-command': ['io-channel-helpers.c', io],
     'test-io-channel-buffer': ['io-channel-helpers.c', io],
diff --git a/tests/unit/test-io-channel-rdma.c b/tests/unit/test-io-channel-rdma.c
new file mode 100644
index 0000000000..e96b55c8c7
--- /dev/null
+++ b/tests/unit/test-io-channel-rdma.c
@@ -0,0 +1,276 @@
+/*
+ * QEMU I/O channel RDMA test
+ *
+ * Copyright (c) 2024 HUAWEI TECHNOLOGIES CO., LTD.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "io/channel-rdma.h"
+#include "qapi/error.h"
+#include "qemu/main-loop.h"
+#include "qemu/module.h"
+#include "io-channel-helpers.h"
+#include "qapi-types-sockets.h"
+#include <rdma/rsocket.h>
+
+static SocketAddress *l_addr;
+static SocketAddress *c_addr;
+
+static void test_io_channel_set_rdma_bufs(QIOChannel *src, QIOChannel *dst)
+{
+    int buflen = 64 * 1024;
+
+    /*
+     * Make the socket buffers small so that we see
+     * the effects of partial reads/writes
+     */
+    rsetsockopt(((QIOChannelRDMA *)src)->fd, SOL_SOCKET, SO_SNDBUF,
+                (char *)&buflen, sizeof(buflen));
+
+    rsetsockopt(((QIOChannelRDMA *)dst)->fd, SOL_SOCKET, SO_SNDBUF,
+                (char *)&buflen, sizeof(buflen));
+}
+
+static void test_io_channel_setup_sync(InetSocketAddress *listen_addr,
+                                       InetSocketAddress *connect_addr,
+                                       QIOChannel **srv, QIOChannel **src,
+                                       QIOChannel **dst)
+{
+    QIOChannelRDMA *lioc;
+
+    lioc = qio_channel_rdma_new();
+    qio_channel_rdma_listen_sync(lioc, listen_addr, 1, &error_abort);
+
+    *src = QIO_CHANNEL(qio_channel_rdma_new());
+    qio_channel_rdma_connect_sync(QIO_CHANNEL_RDMA(*src), connect_addr,
+                                  &error_abort);
+    qio_channel_set_delay(*src, false);
+
+    qio_channel_wait(QIO_CHANNEL(lioc), G_IO_IN);
+    *dst = QIO_CHANNEL(qio_channel_rdma_accept(lioc, &error_abort));
+    g_assert(*dst);
+
+    test_io_channel_set_rdma_bufs(*src, *dst);
+
+    *srv = QIO_CHANNEL(lioc);
+}
+
+struct TestIOChannelData {
+    bool err;
+    GMainLoop *loop;
+};
+
+static void test_io_channel_complete(QIOTask *task, gpointer opaque)
+{
+    struct TestIOChannelData *data = opaque;
+    data->err = qio_task_propagate_error(task, NULL);
+    g_main_loop_quit(data->loop);
+}
+
+static void test_io_channel_setup_async(InetSocketAddress *listen_addr,
+                                        InetSocketAddress *connect_addr,
+                                        QIOChannel **srv, QIOChannel **src,
+                                        QIOChannel **dst)
+{
+    QIOChannelRDMA *lioc;
+    struct TestIOChannelData data;
+
+    data.loop = g_main_loop_new(g_main_context_default(), TRUE);
+
+    lioc = qio_channel_rdma_new();
+    qio_channel_rdma_listen_async(lioc, listen_addr, 1,
+                                  test_io_channel_complete, &data, NULL, NULL);
+
+    g_main_loop_run(data.loop);
+    g_main_context_iteration(g_main_context_default(), FALSE);
+
+    g_assert(!data.err);
+
+    *src = QIO_CHANNEL(qio_channel_rdma_new());
+
+    qio_channel_rdma_connect_async(QIO_CHANNEL_RDMA(*src), connect_addr,
+                                   test_io_channel_complete, &data, NULL, NULL);
+
+    g_main_loop_run(data.loop);
+    g_main_context_iteration(g_main_context_default(), FALSE);
+
+    g_assert(!data.err);
+
+    if (qemu_in_coroutine()) {
+        qio_channel_yield(QIO_CHANNEL(lioc), G_IO_IN);
+    } else {
+        qio_channel_wait(QIO_CHANNEL(lioc), G_IO_IN);
+    }
+    *dst = QIO_CHANNEL(qio_channel_rdma_accept(lioc, &error_abort));
+    g_assert(*dst);
+
+    qio_channel_set_delay(*src, false);
+    test_io_channel_set_rdma_bufs(*src, *dst);
+
+    *srv = QIO_CHANNEL(lioc);
+
+    g_main_loop_unref(data.loop);
+}
+
+static void test_io_channel(bool async, InetSocketAddress *listen_addr,
+                            InetSocketAddress *connect_addr)
+{
+    QIOChannel *src, *dst, *srv;
+    QIOChannelTest *test;
+
+    if (async) {
+        /* async + blocking */
+
+        test_io_channel_setup_async(listen_addr, connect_addr, &srv, &src,
+                                    &dst);
+
+        g_assert(qio_channel_has_feature(src, QIO_CHANNEL_FEATURE_SHUTDOWN));
+        g_assert(qio_channel_has_feature(dst, QIO_CHANNEL_FEATURE_SHUTDOWN));
+
+        test = qio_channel_test_new();
+        qio_channel_test_run_threads(test, true, src, dst);
+        qio_channel_test_validate(test);
+
+        /* unref without close, to ensure finalize() cleans up */
+
+        object_unref(OBJECT(src));
+        object_unref(OBJECT(dst));
+        object_unref(OBJECT(srv));
+
+        /* async + non-blocking */
+
+        test_io_channel_setup_async(listen_addr, connect_addr, &srv, &src,
+                                    &dst);
+
+        g_assert(qio_channel_has_feature(src, QIO_CHANNEL_FEATURE_SHUTDOWN));
+        g_assert(qio_channel_has_feature(dst, QIO_CHANNEL_FEATURE_SHUTDOWN));
+
+        test = qio_channel_test_new();
+        qio_channel_test_run_threads(test, false, src, dst);
+        qio_channel_test_validate(test);
+
+        /* close before unref, to ensure finalize copes with already closed */
+
+        qio_channel_close(src, &error_abort);
+        qio_channel_close(dst, &error_abort);
+        object_unref(OBJECT(src));
+        object_unref(OBJECT(dst));
+
+        qio_channel_close(srv, &error_abort);
+        object_unref(OBJECT(srv));
+    } else {
+        /* sync + blocking */
+
+        test_io_channel_setup_sync(listen_addr, connect_addr, &srv, &src, &dst);
+
+        g_assert(qio_channel_has_feature(src, QIO_CHANNEL_FEATURE_SHUTDOWN));
+        g_assert(qio_channel_has_feature(dst, QIO_CHANNEL_FEATURE_SHUTDOWN));
+
+        test = qio_channel_test_new();
+        qio_channel_test_run_threads(test, true, src, dst);
+        qio_channel_test_validate(test);
+
+        /* unref without close, to ensure finalize() cleans up */
+
+        object_unref(OBJECT(src));
+        object_unref(OBJECT(dst));
+        object_unref(OBJECT(srv));
+
+        /* sync + non-blocking */
+
+        test_io_channel_setup_sync(listen_addr, connect_addr, &srv, &src, &dst);
+
+        g_assert(qio_channel_has_feature(src, QIO_CHANNEL_FEATURE_SHUTDOWN));
+        g_assert(qio_channel_has_feature(dst, QIO_CHANNEL_FEATURE_SHUTDOWN));
+
+        test = qio_channel_test_new();
+        qio_channel_test_run_threads(test, false, src, dst);
+        qio_channel_test_validate(test);
+
+        /* close before unref, to ensure finalize copes with already closed */
+
+        qio_channel_close(src, &error_abort);
+        qio_channel_close(dst, &error_abort);
+        object_unref(OBJECT(src));
+        object_unref(OBJECT(dst));
+
+        qio_channel_close(srv, &error_abort);
+        object_unref(OBJECT(srv));
+    }
+}
+
+static void test_io_channel_rdma(bool async)
+{
+    InetSocketAddress *listen_addr;
+    InetSocketAddress *connect_addr;
+
+    listen_addr = &l_addr->u.inet;
+    connect_addr = &l_addr->u.inet;
+
+    test_io_channel(async, listen_addr, connect_addr);
+}
+
+static void test_io_channel_rdma_sync(void)
+{
+    test_io_channel_rdma(false);
+}
+
+static void test_io_channel_rdma_async(void)
+{
+    test_io_channel_rdma(true);
+}
+
+static void test_io_channel_rdma_co(void *opaque)
+{
+    test_io_channel_rdma(true);
+}
+
+static void test_io_channel_rdma_coroutine(void)
+{
+    Coroutine *coroutine;
+
+    coroutine = qemu_coroutine_create(test_io_channel_rdma_co, NULL);
+    qemu_coroutine_enter(coroutine);
+}
+
+int main(int argc, char **argv)
+{
+    module_call_init(MODULE_INIT_QOM);
+    qemu_init_main_loop(&error_abort);
+
+    if (argc != 3) {
+        fprintf(stderr, "Usage: %s listen_addr connect_addr\n", argv[0]);
+        exit(-1);
+    }
+
+    l_addr = socket_parse(argv[1], NULL);
+    c_addr = socket_parse(argv[2], NULL);
+    if (l_addr == NULL || c_addr == NULL ||
+        l_addr->type != SOCKET_ADDRESS_TYPE_INET ||
+        c_addr->type != SOCKET_ADDRESS_TYPE_INET) {
+        fprintf(stderr, "Only socket address types 'inet' is supported\n");
+        exit(-1);
+    }
+
+    g_test_init(&argc, &argv, NULL);
+
+    g_test_add_func("/io/channel/rdma/sync", test_io_channel_rdma_sync);
+    g_test_add_func("/io/channel/rdma/async", test_io_channel_rdma_async);
+    g_test_add_func("/io/channel/rdma/coroutine",
+                    test_io_channel_rdma_coroutine);
+
+    return g_test_run();
+}
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 5/6] migration: introduce new RDMA live migration
  2024-06-04 12:14 [PATCH 0/6] refactor RDMA live migration based on rsocket API Gonglei via
                   ` (3 preceding siblings ...)
  2024-06-04 12:14 ` [PATCH 4/6] tests/unit: add test-io-channel-rdma.c Gonglei via
@ 2024-06-04 12:14 ` Gonglei via
  2024-06-04 12:14 ` [PATCH 6/6] migration/rdma: support multifd for RDMA migration Gonglei via
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Gonglei via @ 2024-06-04 12:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: peterx, yu.zhang, mgalaxy, elmar.gerdes, zhengchuan, berrange,
	armbru, lizhijian, pbonzini, mst, xiexiangyou, linux-rdma,
	lixiao91, arei.gonglei, jinpu.wang, Jialin Wang

From: Jialin Wang <wangjialin23@huawei.com>

Signed-off-by: Jialin Wang <wangjialin23@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 migration/meson.build |  2 +
 migration/migration.c | 11 +++++-
 migration/rdma.c      | 88 +++++++++++++++++++++++++++++++++++++++++++
 migration/rdma.h      | 24 ++++++++++++
 4 files changed, 124 insertions(+), 1 deletion(-)
 create mode 100644 migration/rdma.c
 create mode 100644 migration/rdma.h

diff --git a/migration/meson.build b/migration/meson.build
index 4e8a9ccf3e..04e2e16239 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -42,3 +42,5 @@ system_ss.add(when: zstd, if_true: files('multifd-zstd.c'))
 specific_ss.add(when: 'CONFIG_SYSTEM_ONLY',
                 if_true: files('ram.c',
                                'target.c'))
+
+system_ss.add(when: rdma, if_true: files('rdma.c'))
diff --git a/migration/migration.c b/migration/migration.c
index 6b9ad4ff5f..77c301d351 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -25,6 +25,7 @@
 #include "sysemu/runstate.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/cpu-throttle.h"
+#include "rdma.h"
 #include "ram.h"
 #include "migration/global_state.h"
 #include "migration/misc.h"
@@ -145,7 +146,7 @@ static bool transport_supports_multi_channels(MigrationAddress *addr)
     } else if (addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
         return migrate_mapped_ram();
     } else {
-        return false;
+        return addr->transport == MIGRATION_ADDRESS_TYPE_RDMA;
     }
 }
 
@@ -644,6 +645,10 @@ static void qemu_start_incoming_migration(const char *uri, bool has_channels,
         } else if (saddr->type == SOCKET_ADDRESS_TYPE_FD) {
             fd_start_incoming_migration(saddr->u.fd.str, errp);
         }
+#ifdef CONFIG_RDMA
+    } else if (addr->transport == MIGRATION_ADDRESS_TYPE_RDMA) {
+        rdma_start_incoming_migration(&addr->u.rdma, errp);
+#endif
     } else if (addr->transport == MIGRATION_ADDRESS_TYPE_EXEC) {
         exec_start_incoming_migration(addr->u.exec.args, errp);
     } else if (addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
@@ -2046,6 +2051,10 @@ void qmp_migrate(const char *uri, bool has_channels,
         } else if (saddr->type == SOCKET_ADDRESS_TYPE_FD) {
             fd_start_outgoing_migration(s, saddr->u.fd.str, &local_err);
         }
+#ifdef CONFIG_RDMA
+    } else if (addr->transport == MIGRATION_ADDRESS_TYPE_RDMA) {
+        rdma_start_outgoing_migration(s, &addr->u.rdma, &local_err);
+#endif
     } else if (addr->transport == MIGRATION_ADDRESS_TYPE_EXEC) {
         exec_start_outgoing_migration(s, addr->u.exec.args, &local_err);
     } else if (addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
diff --git a/migration/rdma.c b/migration/rdma.c
new file mode 100644
index 0000000000..09a4de7f59
--- /dev/null
+++ b/migration/rdma.c
@@ -0,0 +1,88 @@
+/*
+ * QEMU live migration via RDMA
+ *
+ * Copyright (c) 2024 HUAWEI TECHNOLOGIES CO., LTD.
+ *
+ * Authors:
+ *  Jialin Wang <wangjialin23@huawei.com>
+ *  Gonglei <arei.gonglei@huawei.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "io/channel-rdma.h"
+#include "io/channel.h"
+#include "qapi/clone-visitor.h"
+#include "qapi/qapi-types-sockets.h"
+#include "qapi/qapi-visit-sockets.h"
+#include "channel.h"
+#include "migration.h"
+#include "rdma.h"
+#include "trace.h"
+#include <stdio.h>
+
+static struct RDMAOutgoingArgs {
+    InetSocketAddress *addr;
+} outgoing_args;
+
+static void rdma_outgoing_migration(QIOTask *task, gpointer opaque)
+{
+    MigrationState *s = opaque;
+    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(qio_task_get_source(task));
+
+    migration_channel_connect(s, QIO_CHANNEL(rioc), outgoing_args.addr->host,
+                              NULL);
+    object_unref(OBJECT(rioc));
+}
+
+void rdma_start_outgoing_migration(MigrationState *s, InetSocketAddress *iaddr,
+                                   Error **errp)
+{
+    QIOChannelRDMA *rioc = qio_channel_rdma_new();
+
+    /* in case previous migration leaked it */
+    qapi_free_InetSocketAddress(outgoing_args.addr);
+    outgoing_args.addr = QAPI_CLONE(InetSocketAddress, iaddr);
+
+    qio_channel_set_name(QIO_CHANNEL(rioc), "migration-rdma-outgoing");
+    qio_channel_rdma_connect_async(rioc, iaddr, rdma_outgoing_migration, s,
+                                   NULL, NULL);
+}
+
+static void coroutine_fn rdma_accept_incoming_migration(void *opaque)
+{
+    QIOChannelRDMA *rioc = opaque;
+    QIOChannelRDMA *cioc;
+
+    while (!migration_has_all_channels()) {
+        cioc = qio_channel_rdma_accept(rioc, NULL);
+
+        qio_channel_set_name(QIO_CHANNEL(cioc), "migration-rdma-incoming");
+        migration_channel_process_incoming(QIO_CHANNEL(cioc));
+        object_unref(OBJECT(cioc));
+    }
+}
+
+void rdma_start_incoming_migration(InetSocketAddress *addr, Error **errp)
+{
+    QIOChannelRDMA *rioc = qio_channel_rdma_new();
+    MigrationIncomingState *mis = migration_incoming_get_current();
+    Coroutine *co;
+    int num = 1;
+
+    qio_channel_set_name(QIO_CHANNEL(rioc), "migration-rdma-listener");
+
+    if (qio_channel_rdma_listen_sync(rioc, addr, num, errp) < 0) {
+        object_unref(OBJECT(rioc));
+        return;
+    }
+
+    mis->transport_data = rioc;
+    mis->transport_cleanup = object_unref;
+
+    qio_channel_set_blocking(QIO_CHANNEL(rioc), false, NULL);
+    co = qemu_coroutine_create(rdma_accept_incoming_migration, rioc);
+    aio_co_schedule(qemu_get_current_aio_context(), co);
+}
diff --git a/migration/rdma.h b/migration/rdma.h
new file mode 100644
index 0000000000..4c3eb9a972
--- /dev/null
+++ b/migration/rdma.h
@@ -0,0 +1,24 @@
+/*
+ * QEMU live migration via RDMA
+ *
+ * Copyright (c) 2024 HUAWEI TECHNOLOGIES CO., LTD.
+ *
+ * Authors:
+ *  Jialin Wang <wangjialin23@huawei.com>
+ *  Gonglei <arei.gonglei@huawei.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_MIGRATION_RDMA_H
+#define QEMU_MIGRATION_RDMA_H
+
+#include "qemu/sockets.h"
+
+void rdma_start_outgoing_migration(MigrationState *s, InetSocketAddress *addr,
+                                   Error **errp);
+
+void rdma_start_incoming_migration(InetSocketAddress *addr, Error **errp);
+
+#endif /* QEMU_MIGRATION_RDMA_H */
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 6/6] migration/rdma: support multifd for RDMA migration
  2024-06-04 12:14 [PATCH 0/6] refactor RDMA live migration based on rsocket API Gonglei via
                   ` (4 preceding siblings ...)
  2024-06-04 12:14 ` [PATCH 5/6] migration: introduce new RDMA live migration Gonglei via
@ 2024-06-04 12:14 ` Gonglei via
  2024-06-04 19:32 ` [PATCH 0/6] refactor RDMA live migration based on rsocket API Peter Xu
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 55+ messages in thread
From: Gonglei via @ 2024-06-04 12:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: peterx, yu.zhang, mgalaxy, elmar.gerdes, zhengchuan, berrange,
	armbru, lizhijian, pbonzini, mst, xiexiangyou, linux-rdma,
	lixiao91, arei.gonglei, jinpu.wang, Jialin Wang

From: Jialin Wang <wangjialin23@huawei.com>

Signed-off-by: Jialin Wang <wangjialin23@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 migration/multifd.c | 10 ++++++++++
 migration/rdma.c    | 27 +++++++++++++++++++++++++++
 migration/rdma.h    |  6 ++++++
 3 files changed, 43 insertions(+)

diff --git a/migration/multifd.c b/migration/multifd.c
index f317bff077..cee9858ad1 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -32,6 +32,7 @@
 #include "io/channel-file.h"
 #include "io/channel-socket.h"
 #include "yank_functions.h"
+#include "rdma.h"
 
 /* Multiple fd's */
 
@@ -793,6 +794,9 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
 static void multifd_send_cleanup_state(void)
 {
     file_cleanup_outgoing_migration();
+#ifdef CONFIG_RDMA
+    rdma_cleanup_outgoing_migration();
+#endif
     socket_cleanup_outgoing_migration();
     qemu_sem_destroy(&multifd_send_state->channels_created);
     qemu_sem_destroy(&multifd_send_state->channels_ready);
@@ -1139,6 +1143,12 @@ static bool multifd_new_send_channel_create(gpointer opaque, Error **errp)
         return file_send_channel_create(opaque, errp);
     }
 
+#ifdef CONFIG_RDMA
+    if (rdma_send_channel_create(multifd_new_send_channel_async, opaque)) {
+        return true;
+    }
+#endif
+
     socket_send_channel_create(multifd_new_send_channel_async, opaque);
     return true;
 }
diff --git a/migration/rdma.c b/migration/rdma.c
index 09a4de7f59..af4d2b5a5a 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -19,6 +19,7 @@
 #include "qapi/qapi-visit-sockets.h"
 #include "channel.h"
 #include "migration.h"
+#include "options.h"
 #include "rdma.h"
 #include "trace.h"
 #include <stdio.h>
@@ -27,6 +28,28 @@ static struct RDMAOutgoingArgs {
     InetSocketAddress *addr;
 } outgoing_args;
 
+bool rdma_send_channel_create(QIOTaskFunc f, void *data)
+{
+    QIOChannelRDMA *rioc;
+
+    if (!outgoing_args.addr) {
+        return false;
+    }
+
+    rioc = qio_channel_rdma_new();
+    qio_channel_rdma_connect_async(rioc, outgoing_args.addr, f, data, NULL,
+                                   NULL);
+    return true;
+}
+
+void rdma_cleanup_outgoing_migration(void)
+{
+    if (outgoing_args.addr) {
+        qapi_free_InetSocketAddress(outgoing_args.addr);
+        outgoing_args.addr = NULL;
+    }
+}
+
 static void rdma_outgoing_migration(QIOTask *task, gpointer opaque)
 {
     MigrationState *s = opaque;
@@ -74,6 +97,10 @@ void rdma_start_incoming_migration(InetSocketAddress *addr, Error **errp)
 
     qio_channel_set_name(QIO_CHANNEL(rioc), "migration-rdma-listener");
 
+    if (migrate_multifd()) {
+        num = migrate_multifd_channels();
+    }
+
     if (qio_channel_rdma_listen_sync(rioc, addr, num, errp) < 0) {
         object_unref(OBJECT(rioc));
         return;
diff --git a/migration/rdma.h b/migration/rdma.h
index 4c3eb9a972..cefccac61c 100644
--- a/migration/rdma.h
+++ b/migration/rdma.h
@@ -16,6 +16,12 @@
 
 #include "qemu/sockets.h"
 
+#include <stdbool.h>
+
+bool rdma_send_channel_create(QIOTaskFunc f, void *data);
+
+void rdma_cleanup_outgoing_migration(void);
+
 void rdma_start_outgoing_migration(MigrationState *s, InetSocketAddress *addr,
                                    Error **errp);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/6] migration: remove RDMA live migration temporarily
  2024-06-04 12:14 ` [PATCH 1/6] migration: remove RDMA live migration temporarily Gonglei via
@ 2024-06-04 14:01   ` David Hildenbrand
  2024-06-05 10:02     ` Gonglei (Arei) via
  2024-06-10 11:45   ` Markus Armbruster
  1 sibling, 1 reply; 55+ messages in thread
From: David Hildenbrand @ 2024-06-04 14:01 UTC (permalink / raw)
  To: Gonglei, qemu-devel
  Cc: peterx, yu.zhang, mgalaxy, elmar.gerdes, zhengchuan, berrange,
	armbru, lizhijian, pbonzini, mst, xiexiangyou, linux-rdma,
	lixiao91, jinpu.wang, Jialin Wang

On 04.06.24 14:14, Gonglei via wrote:
> From: Jialin Wang <wangjialin23@huawei.com>
> 
> The new RDMA live migration will be introduced in the upcoming
> few commits.
> 
> Signed-off-by: Jialin Wang <wangjialin23@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---

[...]

> -
> -    /* Avoid ram_block_discard_disable(), cannot change during migration. */
> -    if (ram_block_discard_is_required()) {
> -        error_setg(errp, "RDMA: cannot disable RAM discard");
> -        return;
> -    }

I'm particularly interested in the interaction with 
virtio-balloon/virtio-mem.

Do we still have to disable discarding of RAM, and where would you do 
that in the rewrite?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-06-04 12:14 [PATCH 0/6] refactor RDMA live migration based on rsocket API Gonglei via
                   ` (5 preceding siblings ...)
  2024-06-04 12:14 ` [PATCH 6/6] migration/rdma: support multifd for RDMA migration Gonglei via
@ 2024-06-04 19:32 ` Peter Xu
  2024-06-05 10:09   ` Gonglei (Arei) via
  2024-06-07 10:06   ` Daniel P. Berrangé
  2024-06-05  7:57 ` Michael S. Tsirkin
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 55+ messages in thread
From: Peter Xu @ 2024-06-04 19:32 UTC (permalink / raw)
  To: Gonglei
  Cc: qemu-devel, yu.zhang, mgalaxy, elmar.gerdes, zhengchuan, berrange,
	armbru, lizhijian, pbonzini, mst, xiexiangyou, linux-rdma,
	lixiao91, jinpu.wang, Jialin Wang, Fabiano Rosas

Hi, Lei, Jialin,

Thanks a lot for working on this!

I think we'll need to wait a bit on feedbacks from Jinpu and his team on
RDMA side, also Daniel for iochannels.  Also, please remember to copy
Fabiano Rosas in any relevant future posts.  We'd also like to know whether
he has any comments too.  I have him copied in this reply.

On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> From: Jialin Wang <wangjialin23@huawei.com>
> 
> Hi,
> 
> This patch series attempts to refactor RDMA live migration by
> introducing a new QIOChannelRDMA class based on the rsocket API.
> 
> The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> that is a 1-1 match of the normal kernel 'sockets' API, which hides the
> detail of rdma protocol into rsocket and allows us to add support for
> some modern features like multifd more easily.
> 
> Here is the previous discussion on refactoring RDMA live migration using
> the rsocket API:
> 
> https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linaro.org/
> 
> We have encountered some bugs when using rsocket and plan to submit them to
> the rdma-core community.
> 
> In addition, the use of rsocket makes our programming more convenient,
> but it must be noted that this method introduces multiple memory copies,
> which can be imagined that there will be a certain performance degradation,
> hoping that friends with RDMA network cards can help verify, thank you!

It'll be good to elaborate if you tested it in-house. What people should
expect on the numbers exactly?  Is that okay from Huawei's POV?

Besides that, the code looks pretty good at a first glance to me.  Before
others chim in, here're some high level comments..

Firstly, can we avoid using coroutine when listen()?  Might be relevant
when I see that rdma_accept_incoming_migration() runs in a loop to do
raccept(), but would that also hang the qemu main loop even with the
coroutine, before all channels are ready?  I'm not a coroutine person, but
I think the hope is that we can make dest QEMU run in a thread in the
future just like the src QEMU, so the less coroutine the better in this
path.

I think I also left a comment elsewhere on whether it would be possible to
allow iochannels implement their own poll() functions to avoid the
per-channel poll thread that is proposed in this series.

https://lore.kernel.org/r/ZldY21xVExtiMddB@x1n

Personally I think even with the thread proposal it's better than the old
rdma code, but I just still want to double check with you guys.  E.g.,
maybe that just won't work at all?  Again, that'll also be based on the
fact that we move migration incoming into a thread first to keep the dest
QEMU main loop intact, I think, but I hope we will reach that irrelevant of
rdma, IOW it'll be nice to happen even earlier if possible.

Another nitpick is that qio_channel_rdma_listen_async() doesn't look used
and may prone to removal.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-06-04 12:14 [PATCH 0/6] refactor RDMA live migration based on rsocket API Gonglei via
                   ` (6 preceding siblings ...)
  2024-06-04 19:32 ` [PATCH 0/6] refactor RDMA live migration based on rsocket API Peter Xu
@ 2024-06-05  7:57 ` Michael S. Tsirkin
  2024-06-05 10:00   ` Gonglei (Arei) via
  2024-06-07  5:53 ` Jinpu Wang
  2024-08-27 20:15 ` Peter Xu
  9 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2024-06-05  7:57 UTC (permalink / raw)
  To: Gonglei
  Cc: qemu-devel, peterx, yu.zhang, mgalaxy, elmar.gerdes, zhengchuan,
	berrange, armbru, lizhijian, pbonzini, xiexiangyou, linux-rdma,
	lixiao91, jinpu.wang, Jialin Wang

On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> From: Jialin Wang <wangjialin23@huawei.com>
> 
> Hi,
> 
> This patch series attempts to refactor RDMA live migration by
> introducing a new QIOChannelRDMA class based on the rsocket API.
> 
> The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> that is a 1-1 match of the normal kernel 'sockets' API, which hides the
> detail of rdma protocol into rsocket and allows us to add support for
> some modern features like multifd more easily.
> 
> Here is the previous discussion on refactoring RDMA live migration using
> the rsocket API:
> 
> https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linaro.org/
> 
> We have encountered some bugs when using rsocket and plan to submit them to
> the rdma-core community.
> 
> In addition, the use of rsocket makes our programming more convenient,
> but it must be noted that this method introduces multiple memory copies,
> which can be imagined that there will be a certain performance degradation,
> hoping that friends with RDMA network cards can help verify, thank you!

So you didn't test it with an RDMA card?
You really should test with an RDMA card though, for correctness
as much as performance.


> Jialin Wang (6):
>   migration: remove RDMA live migration temporarily
>   io: add QIOChannelRDMA class
>   io/channel-rdma: support working in coroutine
>   tests/unit: add test-io-channel-rdma.c
>   migration: introduce new RDMA live migration
>   migration/rdma: support multifd for RDMA migration
> 
>  docs/rdma.txt                     |  420 ---
>  include/io/channel-rdma.h         |  165 ++
>  io/channel-rdma.c                 |  798 ++++++
>  io/meson.build                    |    1 +
>  io/trace-events                   |   14 +
>  meson.build                       |    6 -
>  migration/meson.build             |    3 +-
>  migration/migration-stats.c       |    5 +-
>  migration/migration-stats.h       |    4 -
>  migration/migration.c             |   13 +-
>  migration/migration.h             |    9 -
>  migration/multifd.c               |   10 +
>  migration/options.c               |   16 -
>  migration/options.h               |    2 -
>  migration/qemu-file.c             |    1 -
>  migration/ram.c                   |   90 +-
>  migration/rdma.c                  | 4205 +----------------------------
>  migration/rdma.h                  |   67 +-
>  migration/savevm.c                |    2 +-
>  migration/trace-events            |   68 +-
>  qapi/migration.json               |   13 +-
>  scripts/analyze-migration.py      |    3 -
>  tests/unit/meson.build            |    1 +
>  tests/unit/test-io-channel-rdma.c |  276 ++
>  24 files changed, 1360 insertions(+), 4832 deletions(-)
>  delete mode 100644 docs/rdma.txt
>  create mode 100644 include/io/channel-rdma.h
>  create mode 100644 io/channel-rdma.c
>  create mode 100644 tests/unit/test-io-channel-rdma.c
> 
> -- 
> 2.43.0



^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-06-05  7:57 ` Michael S. Tsirkin
@ 2024-06-05 10:00   ` Gonglei (Arei) via
  2024-06-05 10:23     ` Michael S. Tsirkin
                       ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Gonglei (Arei) via @ 2024-06-05 10:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel@nongnu.org, peterx@redhat.com, yu.zhang@ionos.com,
	mgalaxy@akamai.com, elmar.gerdes@ionos.com, zhengchuan,
	berrange@redhat.com, armbru@redhat.com, lizhijian@fujitsu.com,
	pbonzini@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin



> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst@redhat.com]
> Sent: Wednesday, June 5, 2024 3:57 PM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: qemu-devel@nongnu.org; peterx@redhat.com; yu.zhang@ionos.com;
> mgalaxy@akamai.com; elmar.gerdes@ionos.com; zhengchuan
> <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
> lizhijian@fujitsu.com; pbonzini@redhat.com; Xiexiangyou
> <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> <wangjialin23@huawei.com>
> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> 
> On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > From: Jialin Wang <wangjialin23@huawei.com>
> >
> > Hi,
> >
> > This patch series attempts to refactor RDMA live migration by
> > introducing a new QIOChannelRDMA class based on the rsocket API.
> >
> > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > the detail of rdma protocol into rsocket and allows us to add support
> > for some modern features like multifd more easily.
> >
> > Here is the previous discussion on refactoring RDMA live migration
> > using the rsocket API:
> >
> > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > o.org/
> >
> > We have encountered some bugs when using rsocket and plan to submit
> > them to the rdma-core community.
> >
> > In addition, the use of rsocket makes our programming more convenient,
> > but it must be noted that this method introduces multiple memory
> > copies, which can be imagined that there will be a certain performance
> > degradation, hoping that friends with RDMA network cards can help verify,
> thank you!
> 
> So you didn't test it with an RDMA card?

Yep, we tested it by Soft-ROCE.

> You really should test with an RDMA card though, for correctness as much as
> performance.
> 
We will, we just don't have RDMA cards environment on hand at the moment.

Regards,
-Gonglei

> 
> > Jialin Wang (6):
> >   migration: remove RDMA live migration temporarily
> >   io: add QIOChannelRDMA class
> >   io/channel-rdma: support working in coroutine
> >   tests/unit: add test-io-channel-rdma.c
> >   migration: introduce new RDMA live migration
> >   migration/rdma: support multifd for RDMA migration
> >
> >  docs/rdma.txt                     |  420 ---
> >  include/io/channel-rdma.h         |  165 ++
> >  io/channel-rdma.c                 |  798 ++++++
> >  io/meson.build                    |    1 +
> >  io/trace-events                   |   14 +
> >  meson.build                       |    6 -
> >  migration/meson.build             |    3 +-
> >  migration/migration-stats.c       |    5 +-
> >  migration/migration-stats.h       |    4 -
> >  migration/migration.c             |   13 +-
> >  migration/migration.h             |    9 -
> >  migration/multifd.c               |   10 +
> >  migration/options.c               |   16 -
> >  migration/options.h               |    2 -
> >  migration/qemu-file.c             |    1 -
> >  migration/ram.c                   |   90 +-
> >  migration/rdma.c                  | 4205 +----------------------------
> >  migration/rdma.h                  |   67 +-
> >  migration/savevm.c                |    2 +-
> >  migration/trace-events            |   68 +-
> >  qapi/migration.json               |   13 +-
> >  scripts/analyze-migration.py      |    3 -
> >  tests/unit/meson.build            |    1 +
> >  tests/unit/test-io-channel-rdma.c |  276 ++
> >  24 files changed, 1360 insertions(+), 4832 deletions(-)  delete mode
> > 100644 docs/rdma.txt  create mode 100644 include/io/channel-rdma.h
> > create mode 100644 io/channel-rdma.c  create mode 100644
> > tests/unit/test-io-channel-rdma.c
> >
> > --
> > 2.43.0



^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH 1/6] migration: remove RDMA live migration temporarily
  2024-06-04 14:01   ` David Hildenbrand
@ 2024-06-05 10:02     ` Gonglei (Arei) via
  0 siblings, 0 replies; 55+ messages in thread
From: Gonglei (Arei) via @ 2024-06-05 10:02 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel@nongnu.org
  Cc: peterx@redhat.com, yu.zhang@ionos.com, mgalaxy@akamai.com,
	elmar.gerdes@ionos.com, zhengchuan, berrange@redhat.com,
	armbru@redhat.com, lizhijian@fujitsu.com, pbonzini@redhat.com,
	mst@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin



> -----Original Message-----
> From: David Hildenbrand [mailto:david@redhat.com]
> Sent: Tuesday, June 4, 2024 10:02 PM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>; qemu-devel@nongnu.org
> Cc: peterx@redhat.com; yu.zhang@ionos.com; mgalaxy@akamai.com;
> elmar.gerdes@ionos.com; zhengchuan <zhengchuan@huawei.com>;
> berrange@redhat.com; armbru@redhat.com; lizhijian@fujitsu.com;
> pbonzini@redhat.com; mst@redhat.com; Xiexiangyou
> <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> <wangjialin23@huawei.com>
> Subject: Re: [PATCH 1/6] migration: remove RDMA live migration temporarily
> 
> On 04.06.24 14:14, Gonglei via wrote:
> > From: Jialin Wang <wangjialin23@huawei.com>
> >
> > The new RDMA live migration will be introduced in the upcoming few
> > commits.
> >
> > Signed-off-by: Jialin Wang <wangjialin23@huawei.com>
> > Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> > ---
> 
> [...]
> 
> > -
> > -    /* Avoid ram_block_discard_disable(), cannot change during migration.
> */
> > -    if (ram_block_discard_is_required()) {
> > -        error_setg(errp, "RDMA: cannot disable RAM discard");
> > -        return;
> > -    }
> 
> I'm particularly interested in the interaction with virtio-balloon/virtio-mem.
> 
> Do we still have to disable discarding of RAM, and where would you do that in
> the rewrite?
> 

Yes, we do. We didn't change the logic. Thanks for your catching.

Regards,
-Gonglei

> --
> Cheers,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-06-04 19:32 ` [PATCH 0/6] refactor RDMA live migration based on rsocket API Peter Xu
@ 2024-06-05 10:09   ` Gonglei (Arei) via
  2024-06-05 14:18     ` Peter Xu
  2024-06-07 10:06   ` Daniel P. Berrangé
  1 sibling, 1 reply; 55+ messages in thread
From: Gonglei (Arei) via @ 2024-06-05 10:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel@nongnu.org, yu.zhang@ionos.com, mgalaxy@akamai.com,
	elmar.gerdes@ionos.com, zhengchuan, berrange@redhat.com,
	armbru@redhat.com, lizhijian@fujitsu.com, pbonzini@redhat.com,
	mst@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin, Fabiano Rosas

Hi Peter,

> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Wednesday, June 5, 2024 3:32 AM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: qemu-devel@nongnu.org; yu.zhang@ionos.com; mgalaxy@akamai.com;
> elmar.gerdes@ionos.com; zhengchuan <zhengchuan@huawei.com>;
> berrange@redhat.com; armbru@redhat.com; lizhijian@fujitsu.com;
> pbonzini@redhat.com; mst@redhat.com; Xiexiangyou
> <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> <wangjialin23@huawei.com>; Fabiano Rosas <farosas@suse.de>
> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> 
> Hi, Lei, Jialin,
> 
> Thanks a lot for working on this!
> 
> I think we'll need to wait a bit on feedbacks from Jinpu and his team on RDMA
> side, also Daniel for iochannels.  Also, please remember to copy Fabiano
> Rosas in any relevant future posts.  We'd also like to know whether he has any
> comments too.  I have him copied in this reply.
> 
> On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > From: Jialin Wang <wangjialin23@huawei.com>
> >
> > Hi,
> >
> > This patch series attempts to refactor RDMA live migration by
> > introducing a new QIOChannelRDMA class based on the rsocket API.
> >
> > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > the detail of rdma protocol into rsocket and allows us to add support
> > for some modern features like multifd more easily.
> >
> > Here is the previous discussion on refactoring RDMA live migration
> > using the rsocket API:
> >
> > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > o.org/
> >
> > We have encountered some bugs when using rsocket and plan to submit
> > them to the rdma-core community.
> >
> > In addition, the use of rsocket makes our programming more convenient,
> > but it must be noted that this method introduces multiple memory
> > copies, which can be imagined that there will be a certain performance
> > degradation, hoping that friends with RDMA network cards can help verify,
> thank you!
> 
> It'll be good to elaborate if you tested it in-house. What people should expect
> on the numbers exactly?  Is that okay from Huawei's POV?
> 
> Besides that, the code looks pretty good at a first glance to me.  Before
> others chim in, here're some high level comments..
> 
> Firstly, can we avoid using coroutine when listen()?  Might be relevant when I
> see that rdma_accept_incoming_migration() runs in a loop to do raccept(), but
> would that also hang the qemu main loop even with the coroutine, before all
> channels are ready?  I'm not a coroutine person, but I think the hope is that
> we can make dest QEMU run in a thread in the future just like the src QEMU, so
> the less coroutine the better in this path.
> 

Because rsocket is set to non-blocking, raccept will return EAGAIN when no connection 
is received, coroutine will yield, and will not hang the qemu main loop.

> I think I also left a comment elsewhere on whether it would be possible to allow
> iochannels implement their own poll() functions to avoid the per-channel poll
> thread that is proposed in this series.
> 
> https://lore.kernel.org/r/ZldY21xVExtiMddB@x1n
> 

We noticed that, and it's a big operation. I'm not sure that's a better way.

> Personally I think even with the thread proposal it's better than the old rdma
> code, but I just still want to double check with you guys.  E.g., maybe that just
> won't work at all?  Again, that'll also be based on the fact that we move
> migration incoming into a thread first to keep the dest QEMU main loop intact,
> I think, but I hope we will reach that irrelevant of rdma, IOW it'll be nice to
> happen even earlier if possible.
> 
Yep. This is a fairly big change, I wonder what other people's suggestions are?

> Another nitpick is that qio_channel_rdma_listen_async() doesn't look used and
> may prone to removal.
> 

Yes. This is because when we wrote the test case, we wanted to test qio_channel_rdma_connect_async, 
and also I added qio_channel_rdma_listen_async. It is not used in the RDMA hot migration code.

Regards,
-Gonglei


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-06-05 10:00   ` Gonglei (Arei) via
@ 2024-06-05 10:23     ` Michael S. Tsirkin
  2024-06-06 11:31     ` Leon Romanovsky
  2024-06-07 16:24     ` Yu Zhang
  2 siblings, 0 replies; 55+ messages in thread
From: Michael S. Tsirkin @ 2024-06-05 10:23 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: qemu-devel@nongnu.org, peterx@redhat.com, yu.zhang@ionos.com,
	mgalaxy@akamai.com, elmar.gerdes@ionos.com, zhengchuan,
	berrange@redhat.com, armbru@redhat.com, lizhijian@fujitsu.com,
	pbonzini@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin

On Wed, Jun 05, 2024 at 10:00:24AM +0000, Gonglei (Arei) wrote:
> 
> 
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Wednesday, June 5, 2024 3:57 PM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: qemu-devel@nongnu.org; peterx@redhat.com; yu.zhang@ionos.com;
> > mgalaxy@akamai.com; elmar.gerdes@ionos.com; zhengchuan
> > <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
> > lizhijian@fujitsu.com; pbonzini@redhat.com; Xiexiangyou
> > <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> > <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> > <wangjialin23@huawei.com>
> > Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> > 
> > On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > > From: Jialin Wang <wangjialin23@huawei.com>
> > >
> > > Hi,
> > >
> > > This patch series attempts to refactor RDMA live migration by
> > > introducing a new QIOChannelRDMA class based on the rsocket API.
> > >
> > > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > > the detail of rdma protocol into rsocket and allows us to add support
> > > for some modern features like multifd more easily.
> > >
> > > Here is the previous discussion on refactoring RDMA live migration
> > > using the rsocket API:
> > >
> > > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > > o.org/
> > >
> > > We have encountered some bugs when using rsocket and plan to submit
> > > them to the rdma-core community.
> > >
> > > In addition, the use of rsocket makes our programming more convenient,
> > > but it must be noted that this method introduces multiple memory
> > > copies, which can be imagined that there will be a certain performance
> > > degradation, hoping that friends with RDMA network cards can help verify,
> > thank you!
> > 
> > So you didn't test it with an RDMA card?
> 
> Yep, we tested it by Soft-ROCE.
> 
> > You really should test with an RDMA card though, for correctness as much as
> > performance.
> > 
> We will, we just don't have RDMA cards environment on hand at the moment.
> 
> Regards,
> -Gonglei

Until it's tested on real hardware it is probably best to tag this
series as RFC in the subject.

> > 
> > > Jialin Wang (6):
> > >   migration: remove RDMA live migration temporarily
> > >   io: add QIOChannelRDMA class
> > >   io/channel-rdma: support working in coroutine
> > >   tests/unit: add test-io-channel-rdma.c
> > >   migration: introduce new RDMA live migration
> > >   migration/rdma: support multifd for RDMA migration
> > >
> > >  docs/rdma.txt                     |  420 ---
> > >  include/io/channel-rdma.h         |  165 ++
> > >  io/channel-rdma.c                 |  798 ++++++
> > >  io/meson.build                    |    1 +
> > >  io/trace-events                   |   14 +
> > >  meson.build                       |    6 -
> > >  migration/meson.build             |    3 +-
> > >  migration/migration-stats.c       |    5 +-
> > >  migration/migration-stats.h       |    4 -
> > >  migration/migration.c             |   13 +-
> > >  migration/migration.h             |    9 -
> > >  migration/multifd.c               |   10 +
> > >  migration/options.c               |   16 -
> > >  migration/options.h               |    2 -
> > >  migration/qemu-file.c             |    1 -
> > >  migration/ram.c                   |   90 +-
> > >  migration/rdma.c                  | 4205 +----------------------------
> > >  migration/rdma.h                  |   67 +-
> > >  migration/savevm.c                |    2 +-
> > >  migration/trace-events            |   68 +-
> > >  qapi/migration.json               |   13 +-
> > >  scripts/analyze-migration.py      |    3 -
> > >  tests/unit/meson.build            |    1 +
> > >  tests/unit/test-io-channel-rdma.c |  276 ++
> > >  24 files changed, 1360 insertions(+), 4832 deletions(-)  delete mode
> > > 100644 docs/rdma.txt  create mode 100644 include/io/channel-rdma.h
> > > create mode 100644 io/channel-rdma.c  create mode 100644
> > > tests/unit/test-io-channel-rdma.c
> > >
> > > --
> > > 2.43.0



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-06-05 10:09   ` Gonglei (Arei) via
@ 2024-06-05 14:18     ` Peter Xu
  2024-06-07  8:49       ` Gonglei (Arei) via
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Xu @ 2024-06-05 14:18 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: qemu-devel@nongnu.org, yu.zhang@ionos.com, mgalaxy@akamai.com,
	elmar.gerdes@ionos.com, zhengchuan, berrange@redhat.com,
	armbru@redhat.com, lizhijian@fujitsu.com, pbonzini@redhat.com,
	mst@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin, Fabiano Rosas

On Wed, Jun 05, 2024 at 10:09:43AM +0000, Gonglei (Arei) wrote:
> Hi Peter,
> 
> > -----Original Message-----
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Wednesday, June 5, 2024 3:32 AM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: qemu-devel@nongnu.org; yu.zhang@ionos.com; mgalaxy@akamai.com;
> > elmar.gerdes@ionos.com; zhengchuan <zhengchuan@huawei.com>;
> > berrange@redhat.com; armbru@redhat.com; lizhijian@fujitsu.com;
> > pbonzini@redhat.com; mst@redhat.com; Xiexiangyou
> > <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> > <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> > <wangjialin23@huawei.com>; Fabiano Rosas <farosas@suse.de>
> > Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> > 
> > Hi, Lei, Jialin,
> > 
> > Thanks a lot for working on this!
> > 
> > I think we'll need to wait a bit on feedbacks from Jinpu and his team on RDMA
> > side, also Daniel for iochannels.  Also, please remember to copy Fabiano
> > Rosas in any relevant future posts.  We'd also like to know whether he has any
> > comments too.  I have him copied in this reply.
> > 
> > On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > > From: Jialin Wang <wangjialin23@huawei.com>
> > >
> > > Hi,
> > >
> > > This patch series attempts to refactor RDMA live migration by
> > > introducing a new QIOChannelRDMA class based on the rsocket API.
> > >
> > > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > > the detail of rdma protocol into rsocket and allows us to add support
> > > for some modern features like multifd more easily.
> > >
> > > Here is the previous discussion on refactoring RDMA live migration
> > > using the rsocket API:
> > >
> > > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > > o.org/
> > >
> > > We have encountered some bugs when using rsocket and plan to submit
> > > them to the rdma-core community.
> > >
> > > In addition, the use of rsocket makes our programming more convenient,
> > > but it must be noted that this method introduces multiple memory
> > > copies, which can be imagined that there will be a certain performance
> > > degradation, hoping that friends with RDMA network cards can help verify,
> > thank you!
> > 
> > It'll be good to elaborate if you tested it in-house. What people should expect
> > on the numbers exactly?  Is that okay from Huawei's POV?
> > 
> > Besides that, the code looks pretty good at a first glance to me.  Before
> > others chim in, here're some high level comments..
> > 
> > Firstly, can we avoid using coroutine when listen()?  Might be relevant when I
> > see that rdma_accept_incoming_migration() runs in a loop to do raccept(), but
> > would that also hang the qemu main loop even with the coroutine, before all
> > channels are ready?  I'm not a coroutine person, but I think the hope is that
> > we can make dest QEMU run in a thread in the future just like the src QEMU, so
> > the less coroutine the better in this path.
> > 
> 
> Because rsocket is set to non-blocking, raccept will return EAGAIN when no connection 
> is received, coroutine will yield, and will not hang the qemu main loop.

Ah that's ok.  And also I just noticed it may not be a big deal either as
long as we're before migration_incoming_process().

I'm wondering whether it can do it similarly like what we do with sockets
in qio_net_listener_set_client_func_full().  After all, rsocket wants to
mimic the socket API.  It'll make sense if rsocket code tries to match with
socket, or even reuse.

> 
> > I think I also left a comment elsewhere on whether it would be possible to allow
> > iochannels implement their own poll() functions to avoid the per-channel poll
> > thread that is proposed in this series.
> > 
> > https://lore.kernel.org/r/ZldY21xVExtiMddB@x1n
> > 
> 
> We noticed that, and it's a big operation. I'm not sure that's a better way.
> 
> > Personally I think even with the thread proposal it's better than the old rdma
> > code, but I just still want to double check with you guys.  E.g., maybe that just
> > won't work at all?  Again, that'll also be based on the fact that we move
> > migration incoming into a thread first to keep the dest QEMU main loop intact,
> > I think, but I hope we will reach that irrelevant of rdma, IOW it'll be nice to
> > happen even earlier if possible.
> > 
> Yep. This is a fairly big change, I wonder what other people's suggestions are?

Yes we can wait for others' opinions.  And btw I'm not asking for it and I
don't think it'll be a blocker for this approach to land, as I said this is
better than the current code so it's definitely an improvement to me.

I'm purely curious, because if you're not going to do it for rdma, maybe
someday I'll try to do that, and I want to know what "big change" could be
as I didn't dig further.  It may help me by sharing what issues you've found.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-06-05 10:00   ` Gonglei (Arei) via
  2024-06-05 10:23     ` Michael S. Tsirkin
@ 2024-06-06 11:31     ` Leon Romanovsky
  2024-06-07  1:04       ` Zhijian Li (Fujitsu) via
  2024-06-07 16:24     ` Yu Zhang
  2 siblings, 1 reply; 55+ messages in thread
From: Leon Romanovsky @ 2024-06-06 11:31 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: Michael S. Tsirkin, qemu-devel@nongnu.org, peterx@redhat.com,
	yu.zhang@ionos.com, mgalaxy@akamai.com, elmar.gerdes@ionos.com,
	zhengchuan, berrange@redhat.com, armbru@redhat.com,
	lizhijian@fujitsu.com, pbonzini@redhat.com, Xiexiangyou,
	linux-rdma@vger.kernel.org, lixiao (H), jinpu.wang@ionos.com,
	Wangjialin

On Wed, Jun 05, 2024 at 10:00:24AM +0000, Gonglei (Arei) wrote:
> 
> 
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Wednesday, June 5, 2024 3:57 PM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: qemu-devel@nongnu.org; peterx@redhat.com; yu.zhang@ionos.com;
> > mgalaxy@akamai.com; elmar.gerdes@ionos.com; zhengchuan
> > <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
> > lizhijian@fujitsu.com; pbonzini@redhat.com; Xiexiangyou
> > <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> > <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> > <wangjialin23@huawei.com>
> > Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> > 
> > On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > > From: Jialin Wang <wangjialin23@huawei.com>
> > >
> > > Hi,
> > >
> > > This patch series attempts to refactor RDMA live migration by
> > > introducing a new QIOChannelRDMA class based on the rsocket API.
> > >
> > > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > > the detail of rdma protocol into rsocket and allows us to add support
> > > for some modern features like multifd more easily.
> > >
> > > Here is the previous discussion on refactoring RDMA live migration
> > > using the rsocket API:
> > >
> > > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > > o.org/
> > >
> > > We have encountered some bugs when using rsocket and plan to submit
> > > them to the rdma-core community.
> > >
> > > In addition, the use of rsocket makes our programming more convenient,
> > > but it must be noted that this method introduces multiple memory
> > > copies, which can be imagined that there will be a certain performance
> > > degradation, hoping that friends with RDMA network cards can help verify,
> > thank you!
> > 
> > So you didn't test it with an RDMA card?
> 
> Yep, we tested it by Soft-ROCE.

Does Soft-RoCE (RXE) support live migration?

Thanks


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/6] io/channel-rdma: support working in coroutine
  2024-06-04 12:14 ` [PATCH 3/6] io/channel-rdma: support working in coroutine Gonglei via
@ 2024-06-06 13:34   ` Haris Iqbal
  2024-06-07  8:45     ` Gonglei (Arei) via
  2024-06-07  9:04   ` Daniel P. Berrangé
  1 sibling, 1 reply; 55+ messages in thread
From: Haris Iqbal @ 2024-06-06 13:34 UTC (permalink / raw)
  To: Gonglei
  Cc: qemu-devel, peterx, yu.zhang, mgalaxy, elmar.gerdes, zhengchuan,
	berrange, armbru, lizhijian, pbonzini, mst, xiexiangyou,
	linux-rdma, lixiao91, jinpu.wang, Jialin Wang

On Tue, Jun 4, 2024 at 2:14 PM Gonglei <arei.gonglei@huawei.com> wrote:
>
> From: Jialin Wang <wangjialin23@huawei.com>
>
> It is not feasible to obtain RDMA completion queue notifications
> through poll/ppoll on the rsocket fd. Therefore, we create a thread
> named rpoller for each rsocket fd and two eventfds: pollin_eventfd
> and pollout_eventfd.
>
> When using io_create_watch or io_set_aio_fd_handler waits for POLLIN
> or POLLOUT events, it will actually poll/ppoll on the pollin_eventfd
> and pollout_eventfd instead of the rsocket fd.
>
> The rpoller rpoll() on the rsocket fd to receive POLLIN and POLLOUT
> events.
> When a POLLIN event occurs, the rpoller write the pollin_eventfd,
> and then poll/ppoll will return the POLLIN event.
> When a POLLOUT event occurs, the rpoller read the pollout_eventfd,
> and then poll/ppoll will return the POLLOUT event.
>
> For a non-blocking rsocket fd, if rread/rwrite returns EAGAIN, it will
> read/write the pollin/pollout_eventfd, preventing poll/ppoll from
> returning POLLIN/POLLOUT events.
>
> Known limitations:
>
>   For a blocking rsocket fd, if we use io_create_watch to wait for
>   POLLIN or POLLOUT events, since the rsocket fd is blocking, we
>   cannot determine when it is not ready to read/write as we can with
>   non-blocking fds. Therefore, when an event occurs, it will occurs
>   always, potentially leave the qemu hanging. So we need be cautious
>   to avoid hanging when using io_create_watch .
>
> Luckily, channel-rdma works well in coroutines :)
>
> Signed-off-by: Jialin Wang <wangjialin23@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>  include/io/channel-rdma.h |  15 +-
>  io/channel-rdma.c         | 363 +++++++++++++++++++++++++++++++++++++-
>  2 files changed, 376 insertions(+), 2 deletions(-)
>
> diff --git a/include/io/channel-rdma.h b/include/io/channel-rdma.h
> index 8cab2459e5..cb56127d76 100644
> --- a/include/io/channel-rdma.h
> +++ b/include/io/channel-rdma.h
> @@ -47,6 +47,18 @@ struct QIOChannelRDMA {
>      socklen_t localAddrLen;
>      struct sockaddr_storage remoteAddr;
>      socklen_t remoteAddrLen;
> +
> +    /* private */
> +
> +    /* qemu g_poll/ppoll() POLLIN event on it */
> +    int pollin_eventfd;
> +    /* qemu g_poll/ppoll() POLLOUT event on it */
> +    int pollout_eventfd;
> +
> +    /* the index in the rpoller's fds array */
> +    int index;
> +    /* rpoller will rpoll() rpoll_events on the rsocket fd */
> +    short int rpoll_events;
>  };
>
>  /**
> @@ -147,6 +159,7 @@ void qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, InetSocketAddress *addr,
>   *
>   * Returns: the new client channel, or NULL on error
>   */
> -QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *ioc, Error **errp);
> +QIOChannelRDMA *coroutine_mixed_fn qio_channel_rdma_accept(QIOChannelRDMA *ioc,
> +                                                           Error **errp);
>
>  #endif /* QIO_CHANNEL_RDMA_H */
> diff --git a/io/channel-rdma.c b/io/channel-rdma.c
> index 92c362df52..9792add5cf 100644
> --- a/io/channel-rdma.c
> +++ b/io/channel-rdma.c
> @@ -23,10 +23,15 @@
>
>  #include "qemu/osdep.h"
>  #include "io/channel-rdma.h"
> +#include "io/channel-util.h"
> +#include "io/channel-watch.h"
>  #include "io/channel.h"
>  #include "qapi/clone-visitor.h"
>  #include "qapi/error.h"
>  #include "qapi/qapi-visit-sockets.h"
> +#include "qemu/atomic.h"
> +#include "qemu/error-report.h"
> +#include "qemu/thread.h"
>  #include "trace.h"
>  #include <errno.h>
>  #include <netdb.h>
> @@ -39,11 +44,274 @@
>  #include <sys/poll.h>
>  #include <unistd.h>
>
> +typedef enum {
> +    CLEAR_POLLIN,
> +    CLEAR_POLLOUT,
> +    SET_POLLIN,
> +    SET_POLLOUT,
> +} UpdateEvent;
> +
> +typedef enum {
> +    RP_CMD_ADD_IOC,
> +    RP_CMD_DEL_IOC,
> +    RP_CMD_UPDATE,
> +} RpollerCMD;
> +
> +typedef struct {
> +    RpollerCMD cmd;
> +    QIOChannelRDMA *rioc;
> +} RpollerMsg;
> +
> +/*
> + * rpoll() on the rsocket fd with rpoll_events, when POLLIN/POLLOUT event
> + * occurs, it will write/read the pollin_eventfd/pollout_eventfd to allow
> + * qemu g_poll/ppoll() get the POLLIN/POLLOUT event
> + */
> +static struct Rpoller {
> +    QemuThread thread;
> +    bool is_running;
> +    int sock[2];
> +    int count; /* the number of rsocket fds being rpoll() */
> +    int size; /* the size of fds/riocs */
> +    struct pollfd *fds;
> +    QIOChannelRDMA **riocs;
> +} rpoller;
> +
> +static void qio_channel_rdma_notify_rpoller(QIOChannelRDMA *rioc,
> +                                            RpollerCMD cmd)
> +{
> +    RpollerMsg msg;
> +    int ret;
> +
> +    msg.cmd = cmd;
> +    msg.rioc = rioc;
> +
> +    ret = RETRY_ON_EINTR(write(rpoller.sock[0], &msg, sizeof msg));
> +    if (ret != sizeof msg) {
> +        error_report("%s: failed to send msg, errno: %d", __func__, errno);
> +    }
> +}
> +
> +static void qio_channel_rdma_update_poll_event(QIOChannelRDMA *rioc,
> +                                               UpdateEvent action,
> +                                               bool notify_rpoller)
> +{
> +    /* An eventfd with the value of ULLONG_MAX - 1 is readable but unwritable */
> +    unsigned long long buf = ULLONG_MAX - 1;
> +
> +    switch (action) {
> +    /* only rpoller do SET_* action, to allow qemu ppoll() get the event */
> +    case SET_POLLIN:
> +        RETRY_ON_EINTR(write(rioc->pollin_eventfd, &buf, sizeof buf));
> +        rioc->rpoll_events &= ~POLLIN;
> +        break;
> +    case SET_POLLOUT:
> +        RETRY_ON_EINTR(read(rioc->pollout_eventfd, &buf, sizeof buf));
> +        rioc->rpoll_events &= ~POLLOUT;
> +        break;
> +
> +    /* the rsocket fd is not ready to rread/rwrite */
> +    case CLEAR_POLLIN:
> +        RETRY_ON_EINTR(read(rioc->pollin_eventfd, &buf, sizeof buf));
> +        rioc->rpoll_events |= POLLIN;
> +        break;
> +    case CLEAR_POLLOUT:
> +        RETRY_ON_EINTR(write(rioc->pollout_eventfd, &buf, sizeof buf));
> +        rioc->rpoll_events |= POLLOUT;
> +        break;
> +    default:
> +        break;
> +    }
> +
> +    /* notify rpoller to rpoll() POLLIN/POLLOUT events */
> +    if (notify_rpoller) {
> +        qio_channel_rdma_notify_rpoller(rioc, RP_CMD_UPDATE);
> +    }
> +}
> +
> +static void qio_channel_rdma_rpoller_add_rioc(QIOChannelRDMA *rioc)
> +{
> +    if (rioc->index != -1) {
> +        error_report("%s: rioc already exsits", __func__);
> +        return;
> +    }
> +
> +    rioc->index = ++rpoller.count;
> +
> +    if (rpoller.count + 1 > rpoller.size) {
> +        rpoller.size *= 2;
> +        rpoller.fds = g_renew(struct pollfd, rpoller.fds, rpoller.size);
> +        rpoller.riocs = g_renew(QIOChannelRDMA *, rpoller.riocs, rpoller.size);
> +    }
> +
> +    rpoller.fds[rioc->index].fd = rioc->fd;
> +    rpoller.fds[rioc->index].events = rioc->rpoll_events;

The allotment of rioc fds and events to rpoller slots are sequential,
but making the deletion also sequentials means that the del_rioc needs
to be called in the exact opposite sequence as they were added
(through add_rioc). Otherwise we leaves holes in between, and
readditions might step on an already used slot.

Does this setup make sure that the above restriction is satisfied, or
am I missing something?

> +    rpoller.riocs[rioc->index] = rioc;
> +}
> +
> +static void qio_channel_rdma_rpoller_del_rioc(QIOChannelRDMA *rioc)
> +{
> +    if (rioc->index == -1) {
> +        error_report("%s: rioc not exsits", __func__);
> +        return;
> +    }
> +
> +    rpoller.fds[rioc->index] = rpoller.fds[rpoller.count];

Should this be rpoller.count-1?

> +    rpoller.riocs[rioc->index] = rpoller.riocs[rpoller.count];
> +    rpoller.riocs[rioc->index]->index = rioc->index;
> +    rpoller.count--;
> +
> +    close(rioc->pollin_eventfd);
> +    close(rioc->pollout_eventfd);
> +    rioc->index = -1;
> +    rioc->rpoll_events = 0;
> +}
> +
> +static void qio_channel_rdma_rpoller_update_ioc(QIOChannelRDMA *rioc)
> +{
> +    if (rioc->index == -1) {
> +        error_report("%s: rioc not exsits", __func__);
> +        return;
> +    }
> +
> +    rpoller.fds[rioc->index].fd = rioc->fd;
> +    rpoller.fds[rioc->index].events = rioc->rpoll_events;
> +}
> +
> +static void qio_channel_rdma_rpoller_process_msg(void)
> +{
> +    RpollerMsg msg;
> +    int ret;
> +
> +    ret = RETRY_ON_EINTR(read(rpoller.sock[1], &msg, sizeof msg));
> +    if (ret != sizeof msg) {
> +        error_report("%s: rpoller failed to recv msg: %s", __func__,
> +                     strerror(errno));
> +        return;
> +    }
> +
> +    switch (msg.cmd) {
> +    case RP_CMD_ADD_IOC:
> +        qio_channel_rdma_rpoller_add_rioc(msg.rioc);
> +        break;
> +    case RP_CMD_DEL_IOC:
> +        qio_channel_rdma_rpoller_del_rioc(msg.rioc);
> +        break;
> +    case RP_CMD_UPDATE:
> +        qio_channel_rdma_rpoller_update_ioc(msg.rioc);
> +        break;
> +    default:
> +        break;
> +    }
> +}
> +
> +static void qio_channel_rdma_rpoller_cleanup(void)
> +{
> +    close(rpoller.sock[0]);
> +    close(rpoller.sock[1]);
> +    rpoller.sock[0] = -1;
> +    rpoller.sock[1] = -1;
> +    g_free(rpoller.fds);
> +    g_free(rpoller.riocs);
> +    rpoller.fds = NULL;
> +    rpoller.riocs = NULL;
> +    rpoller.count = 0;
> +    rpoller.size = 0;
> +    rpoller.is_running = false;
> +}
> +
> +static void *qio_channel_rdma_rpoller_thread(void *opaque)
> +{
> +    int i, ret, error_events = POLLERR | POLLHUP | POLLNVAL;
> +
> +    do {
> +        ret = rpoll(rpoller.fds, rpoller.count + 1, -1);
> +        if (ret < 0 && errno != -EINTR) {
> +            error_report("%s: rpoll() error: %s", __func__, strerror(errno));
> +            break;
> +        }
> +
> +        for (i = 1; i <= rpoller.count; i++) {
> +            if (rpoller.fds[i].revents & (POLLIN | error_events)) {
> +                qio_channel_rdma_update_poll_event(rpoller.riocs[i], SET_POLLIN,
> +                                                   false);
> +                rpoller.fds[i].events &= ~POLLIN;
> +            }
> +            if (rpoller.fds[i].revents & (POLLOUT | error_events)) {
> +                qio_channel_rdma_update_poll_event(rpoller.riocs[i],
> +                                                   SET_POLLOUT, false);
> +                rpoller.fds[i].events &= ~POLLOUT;
> +            }
> +            /* ignore this fd */
> +            if (rpoller.fds[i].revents & (error_events)) {
> +                rpoller.fds[i].fd = -1;
> +            }
> +        }
> +
> +        if (rpoller.fds[0].revents) {
> +            qio_channel_rdma_rpoller_process_msg();
> +        }
> +    } while (rpoller.count >= 1);
> +
> +    qio_channel_rdma_rpoller_cleanup();
> +
> +    return NULL;
> +}
> +
> +static void qio_channel_rdma_rpoller_start(void)
> +{
> +    if (qatomic_xchg(&rpoller.is_running, true)) {
> +        return;
> +    }
> +
> +    if (qemu_socketpair(AF_UNIX, SOCK_STREAM, 0, rpoller.sock)) {
> +        rpoller.is_running = false;
> +        error_report("%s: failed to create socketpair %s", __func__,
> +                     strerror(errno));
> +        return;
> +    }
> +
> +    rpoller.count = 0;
> +    rpoller.size = 4;
> +    rpoller.fds = g_malloc0_n(rpoller.size, sizeof(struct pollfd));
> +    rpoller.riocs = g_malloc0_n(rpoller.size, sizeof(QIOChannelRDMA *));
> +    rpoller.fds[0].fd = rpoller.sock[1];
> +    rpoller.fds[0].events = POLLIN;
> +
> +    qemu_thread_create(&rpoller.thread, "qio-channel-rdma-rpoller",
> +                       qio_channel_rdma_rpoller_thread, NULL,
> +                       QEMU_THREAD_JOINABLE);
> +}
> +
> +static void qio_channel_rdma_add_rioc_to_rpoller(QIOChannelRDMA *rioc)
> +{
> +    int flags = EFD_CLOEXEC | EFD_NONBLOCK;
> +
> +    /*
> +     * A single eventfd is either readable or writable. A single eventfd cannot
> +     * represent a state where it is neither readable nor writable. so use two
> +     * eventfds here.
> +     */
> +    rioc->pollin_eventfd = eventfd(0, flags);
> +    rioc->pollout_eventfd = eventfd(0, flags);
> +    /* pollout_eventfd with the value 0, means writable, make it unwritable */
> +    qio_channel_rdma_update_poll_event(rioc, CLEAR_POLLOUT, false);
> +
> +    /* tell the rpoller to rpoll() events on rioc->socketfd */
> +    rioc->rpoll_events = POLLIN | POLLOUT;
> +    qio_channel_rdma_notify_rpoller(rioc, RP_CMD_ADD_IOC);
> +}
> +
>  QIOChannelRDMA *qio_channel_rdma_new(void)
>  {
>      QIOChannelRDMA *rioc;
>      QIOChannel *ioc;
>
> +    qio_channel_rdma_rpoller_start();
> +    if (!rpoller.is_running) {
> +        return NULL;
> +    }
> +
>      rioc = QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
>      ioc = QIO_CHANNEL(rioc);
>      qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
> @@ -125,6 +393,8 @@ retry:
>          goto out;
>      }
>
> +    qio_channel_rdma_add_rioc_to_rpoller(rioc);
> +
>  out:
>      if (ret) {
>          trace_qio_channel_rdma_connect_fail(rioc);
> @@ -211,6 +481,8 @@ int qio_channel_rdma_listen_sync(QIOChannelRDMA *rioc, InetSocketAddress *addr,
>      qio_channel_set_feature(QIO_CHANNEL(rioc), QIO_CHANNEL_FEATURE_LISTEN);
>      trace_qio_channel_rdma_listen_complete(rioc, fd);
>
> +    qio_channel_rdma_add_rioc_to_rpoller(rioc);
> +
>  out:
>      if (ret) {
>          trace_qio_channel_rdma_listen_fail(rioc);
> @@ -267,8 +539,10 @@ void qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, InetSocketAddress *addr,
>                             qio_channel_listen_worker_free, context);
>  }
>
> -QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *rioc, Error **errp)
> +QIOChannelRDMA *coroutine_mixed_fn qio_channel_rdma_accept(QIOChannelRDMA *rioc,
> +                                                           Error **errp)
>  {
> +    QIOChannel *ioc = QIO_CHANNEL(rioc);
>      QIOChannelRDMA *cioc;
>
>      cioc = qio_channel_rdma_new();
> @@ -283,6 +557,17 @@ retry:
>          if (errno == EINTR) {
>              goto retry;
>          }
> +        if (errno == EAGAIN) {
> +            if (!(rioc->rpoll_events & POLLIN)) {
> +                qio_channel_rdma_update_poll_event(rioc, CLEAR_POLLIN, true);
> +            }
> +            if (qemu_in_coroutine()) {
> +                qio_channel_yield(ioc, G_IO_IN);
> +            } else {
> +                qio_channel_wait(ioc, G_IO_IN);
> +            }
> +            goto retry;
> +        }
>          error_setg_errno(errp, errno, "Unable to accept connection");
>          goto error;
>      }
> @@ -294,6 +579,8 @@ retry:
>          goto error;
>      }
>
> +    qio_channel_rdma_add_rioc_to_rpoller(cioc);
> +
>      trace_qio_channel_rdma_accept_complete(rioc, cioc, cioc->fd);
>      return cioc;
>
> @@ -307,6 +594,10 @@ static void qio_channel_rdma_init(Object *obj)
>  {
>      QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(obj);
>      ioc->fd = -1;
> +    ioc->pollin_eventfd = -1;
> +    ioc->pollout_eventfd = -1;
> +    ioc->index = -1;
> +    ioc->rpoll_events = 0;
>  }
>
>  static void qio_channel_rdma_finalize(Object *obj)
> @@ -314,6 +605,7 @@ static void qio_channel_rdma_finalize(Object *obj)
>      QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(obj);
>
>      if (ioc->fd != -1) {
> +        qio_channel_rdma_notify_rpoller(ioc, RP_CMD_DEL_IOC);
>          rclose(ioc->fd);
>          ioc->fd = -1;
>      }
> @@ -330,6 +622,12 @@ static ssize_t qio_channel_rdma_readv(QIOChannel *ioc, const struct iovec *iov,
>  retry:
>      ret = rreadv(rioc->fd, iov, niov);
>      if (ret < 0) {
> +        if (errno == EAGAIN) {
> +            if (!(rioc->rpoll_events & POLLIN)) {
> +                qio_channel_rdma_update_poll_event(rioc, CLEAR_POLLIN, true);
> +            }
> +            return QIO_CHANNEL_ERR_BLOCK;
> +        }
>          if (errno == EINTR) {
>              goto retry;
>          }
> @@ -351,6 +649,12 @@ static ssize_t qio_channel_rdma_writev(QIOChannel *ioc, const struct iovec *iov,
>  retry:
>      ret = rwritev(rioc->fd, iov, niov);
>      if (ret <= 0) {
> +        if (errno == EAGAIN) {
> +            if (!(rioc->rpoll_events & POLLOUT)) {
> +                qio_channel_rdma_update_poll_event(rioc, CLEAR_POLLOUT, true);
> +            }
> +            return QIO_CHANNEL_ERR_BLOCK;
> +        }
>          if (errno == EINTR) {
>              goto retry;
>          }
> @@ -361,6 +665,28 @@ retry:
>      return ret;
>  }
>
> +static int qio_channel_rdma_set_blocking(QIOChannel *ioc, bool enabled,
> +                                         Error **errp G_GNUC_UNUSED)
> +{
> +    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
> +    int flags, ret;
> +
> +    flags = rfcntl(rioc->fd, F_GETFL);
> +    if (enabled) {
> +        flags &= ~O_NONBLOCK;
> +    } else {
> +        flags |= O_NONBLOCK;
> +    }
> +
> +    ret = rfcntl(rioc->fd, F_SETFL, flags);
> +    if (ret) {
> +        error_setg_errno(errp, errno,
> +                         "Unable to rfcntl rsocket fd with flags %d", flags);
> +    }
> +
> +    return ret;
> +}
> +
>  static void qio_channel_rdma_set_delay(QIOChannel *ioc, bool enabled)
>  {
>      QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
> @@ -374,6 +700,7 @@ static int qio_channel_rdma_close(QIOChannel *ioc, Error **errp)
>      QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>
>      if (rioc->fd != -1) {
> +        qio_channel_rdma_notify_rpoller(rioc, RP_CMD_DEL_IOC);
>          rclose(rioc->fd);
>          rioc->fd = -1;
>      }
> @@ -408,6 +735,37 @@ static int qio_channel_rdma_shutdown(QIOChannel *ioc, QIOChannelShutdown how,
>      return 0;
>  }
>
> +static void
> +qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc, AioContext *read_ctx,
> +                                    IOHandler *io_read, AioContext *write_ctx,
> +                                    IOHandler *io_write, void *opaque)
> +{
> +    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
> +
> +    qio_channel_util_set_aio_fd_handler(rioc->pollin_eventfd, read_ctx, io_read,
> +                                        rioc->pollout_eventfd, write_ctx,
> +                                        io_write, opaque);
> +}
> +
> +static GSource *qio_channel_rdma_create_watch(QIOChannel *ioc,
> +                                              GIOCondition condition)
> +{
> +    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
> +
> +    switch (condition) {
> +    case G_IO_IN:
> +        return qio_channel_create_fd_watch(ioc, rioc->pollin_eventfd,
> +                                           condition);
> +    case G_IO_OUT:
> +        return qio_channel_create_fd_watch(ioc, rioc->pollout_eventfd,
> +                                           condition);
> +    default:
> +        error_report("%s: do not support watch 0x%x event", __func__,
> +                     condition);
> +        return NULL;
> +    }
> +}
> +
>  static void qio_channel_rdma_class_init(ObjectClass *klass,
>                                          void *class_data G_GNUC_UNUSED)
>  {
> @@ -415,9 +773,12 @@ static void qio_channel_rdma_class_init(ObjectClass *klass,
>
>      ioc_klass->io_writev = qio_channel_rdma_writev;
>      ioc_klass->io_readv = qio_channel_rdma_readv;
> +    ioc_klass->io_set_blocking = qio_channel_rdma_set_blocking;
>      ioc_klass->io_close = qio_channel_rdma_close;
>      ioc_klass->io_shutdown = qio_channel_rdma_shutdown;
>      ioc_klass->io_set_delay = qio_channel_rdma_set_delay;
> +    ioc_klass->io_create_watch = qio_channel_rdma_create_watch;
> +    ioc_klass->io_set_aio_fd_handler = qio_channel_rdma_set_aio_fd_handler;
>  }
>
>  static const TypeInfo qio_channel_rdma_info = {
> --
> 2.43.0
>
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-06-06 11:31     ` Leon Romanovsky
@ 2024-06-07  1:04       ` Zhijian Li (Fujitsu) via
  0 siblings, 0 replies; 55+ messages in thread
From: Zhijian Li (Fujitsu) via @ 2024-06-07  1:04 UTC (permalink / raw)
  To: Leon Romanovsky, Gonglei (Arei)
  Cc: Michael S. Tsirkin, qemu-devel@nongnu.org, peterx@redhat.com,
	yu.zhang@ionos.com, mgalaxy@akamai.com, elmar.gerdes@ionos.com,
	zhengchuan, berrange@redhat.com, armbru@redhat.com,
	pbonzini@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin



On 06/06/2024 19:31, Leon Romanovsky wrote:
> On Wed, Jun 05, 2024 at 10:00:24AM +0000, Gonglei (Arei) wrote:
>>
>>
>>> -----Original Message-----
>>> From: Michael S. Tsirkin [mailto:mst@redhat.com]
>>> Sent: Wednesday, June 5, 2024 3:57 PM
>>> To: Gonglei (Arei) <arei.gonglei@huawei.com>
>>> Cc: qemu-devel@nongnu.org; peterx@redhat.com; yu.zhang@ionos.com;
>>> mgalaxy@akamai.com; elmar.gerdes@ionos.com; zhengchuan
>>> <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
>>> lizhijian@fujitsu.com; pbonzini@redhat.com; Xiexiangyou
>>> <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
>>> <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
>>> <wangjialin23@huawei.com>
>>> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
>>>
>>> On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
>>>> From: Jialin Wang <wangjialin23@huawei.com>
>>>>
>>>> Hi,
>>>>
>>>> This patch series attempts to refactor RDMA live migration by
>>>> introducing a new QIOChannelRDMA class based on the rsocket API.
>>>>
>>>> The /usr/include/rdma/rsocket.h provides a higher level rsocket API
>>>> that is a 1-1 match of the normal kernel 'sockets' API, which hides
>>>> the detail of rdma protocol into rsocket and allows us to add support
>>>> for some modern features like multifd more easily.
>>>>
>>>> Here is the previous discussion on refactoring RDMA live migration
>>>> using the rsocket API:
>>>>
>>>> https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
>>>> o.org/
>>>>
>>>> We have encountered some bugs when using rsocket and plan to submit
>>>> them to the rdma-core community.
>>>>
>>>> In addition, the use of rsocket makes our programming more convenient,
>>>> but it must be noted that this method introduces multiple memory
>>>> copies, which can be imagined that there will be a certain performance
>>>> degradation, hoping that friends with RDMA network cards can help verify,
>>> thank you!
>>>
>>> So you didn't test it with an RDMA card?
>>
>> Yep, we tested it by Soft-ROCE.
> 
> Does Soft-RoCE (RXE) support live migration?


Yes, it does


Thanks
Zhijian

> 
> Thanks
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-06-04 12:14 [PATCH 0/6] refactor RDMA live migration based on rsocket API Gonglei via
                   ` (7 preceding siblings ...)
  2024-06-05  7:57 ` Michael S. Tsirkin
@ 2024-06-07  5:53 ` Jinpu Wang
  2024-06-07  8:28   ` Gonglei (Arei) via
  2024-08-27 20:15 ` Peter Xu
  9 siblings, 1 reply; 55+ messages in thread
From: Jinpu Wang @ 2024-06-07  5:53 UTC (permalink / raw)
  To: Gonglei
  Cc: qemu-devel, peterx, yu.zhang, mgalaxy, elmar.gerdes, zhengchuan,
	berrange, armbru, lizhijian, pbonzini, mst, xiexiangyou,
	linux-rdma, lixiao91, Jialin Wang

Hi Gonglei, hi folks on the list,

On Tue, Jun 4, 2024 at 2:14 PM Gonglei <arei.gonglei@huawei.com> wrote:
>
> From: Jialin Wang <wangjialin23@huawei.com>
>
> Hi,
>
> This patch series attempts to refactor RDMA live migration by
> introducing a new QIOChannelRDMA class based on the rsocket API.
>
> The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> that is a 1-1 match of the normal kernel 'sockets' API, which hides the
> detail of rdma protocol into rsocket and allows us to add support for
> some modern features like multifd more easily.
>
> Here is the previous discussion on refactoring RDMA live migration using
> the rsocket API:
>
> https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linaro.org/
>
> We have encountered some bugs when using rsocket and plan to submit them to
> the rdma-core community.
>
> In addition, the use of rsocket makes our programming more convenient,
> but it must be noted that this method introduces multiple memory copies,
> which can be imagined that there will be a certain performance degradation,
> hoping that friends with RDMA network cards can help verify, thank you!
First thx for the effort, we are running migration tests on our IB
fabric, different generation of HCA from mellanox, the migration works
ok,
there are a few failures,  Yu will share the result later separately.

The one blocker for the change is the old implementation and the new
rsocket implementation;
they don't talk to each other due to the effect of different wire
protocol during connection establishment.
eg the old RDMA migration has special control message during the
migration flow, which rsocket use a different control message, so
there lead to no way
to migrate VM using rdma transport pre to the rsocket patchset to a
new version with rsocket implementation.

Probably we should keep both implementation for a while, mark the old
implementation as deprecated, and promote the new implementation, and
high light in doc,
they are not compatible.

Regards!
Jinpu



>
> Jialin Wang (6):
>   migration: remove RDMA live migration temporarily
>   io: add QIOChannelRDMA class
>   io/channel-rdma: support working in coroutine
>   tests/unit: add test-io-channel-rdma.c
>   migration: introduce new RDMA live migration
>   migration/rdma: support multifd for RDMA migration
>
>  docs/rdma.txt                     |  420 ---
>  include/io/channel-rdma.h         |  165 ++
>  io/channel-rdma.c                 |  798 ++++++
>  io/meson.build                    |    1 +
>  io/trace-events                   |   14 +
>  meson.build                       |    6 -
>  migration/meson.build             |    3 +-
>  migration/migration-stats.c       |    5 +-
>  migration/migration-stats.h       |    4 -
>  migration/migration.c             |   13 +-
>  migration/migration.h             |    9 -
>  migration/multifd.c               |   10 +
>  migration/options.c               |   16 -
>  migration/options.h               |    2 -
>  migration/qemu-file.c             |    1 -
>  migration/ram.c                   |   90 +-
>  migration/rdma.c                  | 4205 +----------------------------
>  migration/rdma.h                  |   67 +-
>  migration/savevm.c                |    2 +-
>  migration/trace-events            |   68 +-
>  qapi/migration.json               |   13 +-
>  scripts/analyze-migration.py      |    3 -
>  tests/unit/meson.build            |    1 +
>  tests/unit/test-io-channel-rdma.c |  276 ++
>  24 files changed, 1360 insertions(+), 4832 deletions(-)
>  delete mode 100644 docs/rdma.txt
>  create mode 100644 include/io/channel-rdma.h
>  create mode 100644 io/channel-rdma.c
>  create mode 100644 tests/unit/test-io-channel-rdma.c
>
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-06-07  5:53 ` Jinpu Wang
@ 2024-06-07  8:28   ` Gonglei (Arei) via
  2024-06-10 16:31     ` Peter Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Gonglei (Arei) via @ 2024-06-07  8:28 UTC (permalink / raw)
  To: Jinpu Wang
  Cc: qemu-devel@nongnu.org, peterx@redhat.com, yu.zhang@ionos.com,
	mgalaxy@akamai.com, elmar.gerdes@ionos.com, zhengchuan,
	berrange@redhat.com, armbru@redhat.com, lizhijian@fujitsu.com,
	pbonzini@redhat.com, mst@redhat.com, Xiexiangyou,
	linux-rdma@vger.kernel.org, lixiao (H), Wangjialin



> -----Original Message-----
> From: Jinpu Wang [mailto:jinpu.wang@ionos.com]
> Sent: Friday, June 7, 2024 1:54 PM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: qemu-devel@nongnu.org; peterx@redhat.com; yu.zhang@ionos.com;
> mgalaxy@akamai.com; elmar.gerdes@ionos.com; zhengchuan
> <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
> lizhijian@fujitsu.com; pbonzini@redhat.com; mst@redhat.com; Xiexiangyou
> <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> <lixiao91@huawei.com>; Wangjialin <wangjialin23@huawei.com>
> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> 
> Hi Gonglei, hi folks on the list,
> 
> On Tue, Jun 4, 2024 at 2:14 PM Gonglei <arei.gonglei@huawei.com> wrote:
> >
> > From: Jialin Wang <wangjialin23@huawei.com>
> >
> > Hi,
> >
> > This patch series attempts to refactor RDMA live migration by
> > introducing a new QIOChannelRDMA class based on the rsocket API.
> >
> > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > the detail of rdma protocol into rsocket and allows us to add support
> > for some modern features like multifd more easily.
> >
> > Here is the previous discussion on refactoring RDMA live migration
> > using the rsocket API:
> >
> > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > o.org/
> >
> > We have encountered some bugs when using rsocket and plan to submit
> > them to the rdma-core community.
> >
> > In addition, the use of rsocket makes our programming more convenient,
> > but it must be noted that this method introduces multiple memory
> > copies, which can be imagined that there will be a certain performance
> > degradation, hoping that friends with RDMA network cards can help verify,
> thank you!
> First thx for the effort, we are running migration tests on our IB fabric, different
> generation of HCA from mellanox, the migration works ok, there are a few
> failures,  Yu will share the result later separately.
> 

Thank you so much. 

> The one blocker for the change is the old implementation and the new rsocket
> implementation; they don't talk to each other due to the effect of different wire
> protocol during connection establishment.
> eg the old RDMA migration has special control message during the migration
> flow, which rsocket use a different control message, so there lead to no way to
> migrate VM using rdma transport pre to the rsocket patchset to a new version
> with rsocket implementation.
> 
> Probably we should keep both implementation for a while, mark the old
> implementation as deprecated, and promote the new implementation, and
> high light in doc, they are not compatible.
> 

IMO It makes sense. What's your opinion? @Peter.


Regards,
-Gonglei

> Regards!
> Jinpu
> 
> 
> 
> >
> > Jialin Wang (6):
> >   migration: remove RDMA live migration temporarily
> >   io: add QIOChannelRDMA class
> >   io/channel-rdma: support working in coroutine
> >   tests/unit: add test-io-channel-rdma.c
> >   migration: introduce new RDMA live migration
> >   migration/rdma: support multifd for RDMA migration
> >
> >  docs/rdma.txt                     |  420 ---
> >  include/io/channel-rdma.h         |  165 ++
> >  io/channel-rdma.c                 |  798 ++++++
> >  io/meson.build                    |    1 +
> >  io/trace-events                   |   14 +
> >  meson.build                       |    6 -
> >  migration/meson.build             |    3 +-
> >  migration/migration-stats.c       |    5 +-
> >  migration/migration-stats.h       |    4 -
> >  migration/migration.c             |   13 +-
> >  migration/migration.h             |    9 -
> >  migration/multifd.c               |   10 +
> >  migration/options.c               |   16 -
> >  migration/options.h               |    2 -
> >  migration/qemu-file.c             |    1 -
> >  migration/ram.c                   |   90 +-
> >  migration/rdma.c                  | 4205 +----------------------------
> >  migration/rdma.h                  |   67 +-
> >  migration/savevm.c                |    2 +-
> >  migration/trace-events            |   68 +-
> >  qapi/migration.json               |   13 +-
> >  scripts/analyze-migration.py      |    3 -
> >  tests/unit/meson.build            |    1 +
> >  tests/unit/test-io-channel-rdma.c |  276 ++
> >  24 files changed, 1360 insertions(+), 4832 deletions(-)  delete mode
> > 100644 docs/rdma.txt  create mode 100644 include/io/channel-rdma.h
> > create mode 100644 io/channel-rdma.c  create mode 100644
> > tests/unit/test-io-channel-rdma.c
> >
> > --
> > 2.43.0
> >


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH 3/6] io/channel-rdma: support working in coroutine
  2024-06-06 13:34   ` Haris Iqbal
@ 2024-06-07  8:45     ` Gonglei (Arei) via
  2024-06-07 10:01       ` Haris Iqbal
  0 siblings, 1 reply; 55+ messages in thread
From: Gonglei (Arei) via @ 2024-06-07  8:45 UTC (permalink / raw)
  To: Haris Iqbal
  Cc: qemu-devel@nongnu.org, peterx@redhat.com, yu.zhang@ionos.com,
	mgalaxy@akamai.com, elmar.gerdes@ionos.com, zhengchuan,
	berrange@redhat.com, armbru@redhat.com, lizhijian@fujitsu.com,
	pbonzini@redhat.com, mst@redhat.com, Xiexiangyou,
	linux-rdma@vger.kernel.org, lixiao (H), jinpu.wang@ionos.com,
	Wangjialin



> -----Original Message-----
> From: Haris Iqbal [mailto:haris.iqbal@ionos.com]
> Sent: Thursday, June 6, 2024 9:35 PM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: qemu-devel@nongnu.org; peterx@redhat.com; yu.zhang@ionos.com;
> mgalaxy@akamai.com; elmar.gerdes@ionos.com; zhengchuan
> <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
> lizhijian@fujitsu.com; pbonzini@redhat.com; mst@redhat.com; Xiexiangyou
> <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> <wangjialin23@huawei.com>
> Subject: Re: [PATCH 3/6] io/channel-rdma: support working in coroutine
> 
> On Tue, Jun 4, 2024 at 2:14 PM Gonglei <arei.gonglei@huawei.com> wrote:
> >
> > From: Jialin Wang <wangjialin23@huawei.com>
> >
> > It is not feasible to obtain RDMA completion queue notifications
> > through poll/ppoll on the rsocket fd. Therefore, we create a thread
> > named rpoller for each rsocket fd and two eventfds: pollin_eventfd and
> > pollout_eventfd.
> >
> > When using io_create_watch or io_set_aio_fd_handler waits for POLLIN
> > or POLLOUT events, it will actually poll/ppoll on the pollin_eventfd
> > and pollout_eventfd instead of the rsocket fd.
> >
> > The rpoller rpoll() on the rsocket fd to receive POLLIN and POLLOUT
> > events.
> > When a POLLIN event occurs, the rpoller write the pollin_eventfd, and
> > then poll/ppoll will return the POLLIN event.
> > When a POLLOUT event occurs, the rpoller read the pollout_eventfd, and
> > then poll/ppoll will return the POLLOUT event.
> >
> > For a non-blocking rsocket fd, if rread/rwrite returns EAGAIN, it will
> > read/write the pollin/pollout_eventfd, preventing poll/ppoll from
> > returning POLLIN/POLLOUT events.
> >
> > Known limitations:
> >
> >   For a blocking rsocket fd, if we use io_create_watch to wait for
> >   POLLIN or POLLOUT events, since the rsocket fd is blocking, we
> >   cannot determine when it is not ready to read/write as we can with
> >   non-blocking fds. Therefore, when an event occurs, it will occurs
> >   always, potentially leave the qemu hanging. So we need be cautious
> >   to avoid hanging when using io_create_watch .
> >
> > Luckily, channel-rdma works well in coroutines :)
> >
> > Signed-off-by: Jialin Wang <wangjialin23@huawei.com>
> > Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> > ---
> >  include/io/channel-rdma.h |  15 +-
> >  io/channel-rdma.c         | 363
> +++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 376 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/io/channel-rdma.h b/include/io/channel-rdma.h
> > index 8cab2459e5..cb56127d76 100644
> > --- a/include/io/channel-rdma.h
> > +++ b/include/io/channel-rdma.h
> > @@ -47,6 +47,18 @@ struct QIOChannelRDMA {
> >      socklen_t localAddrLen;
> >      struct sockaddr_storage remoteAddr;
> >      socklen_t remoteAddrLen;
> > +
> > +    /* private */
> > +
> > +    /* qemu g_poll/ppoll() POLLIN event on it */
> > +    int pollin_eventfd;
> > +    /* qemu g_poll/ppoll() POLLOUT event on it */
> > +    int pollout_eventfd;
> > +
> > +    /* the index in the rpoller's fds array */
> > +    int index;
> > +    /* rpoller will rpoll() rpoll_events on the rsocket fd */
> > +    short int rpoll_events;
> >  };
> >
> >  /**
> > @@ -147,6 +159,7 @@ void
> qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, InetSocketAddress
> *addr,
> >   *
> >   * Returns: the new client channel, or NULL on error
> >   */
> > -QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *ioc,
> Error
> > **errp);
> > +QIOChannelRDMA *coroutine_mixed_fn
> qio_channel_rdma_accept(QIOChannelRDMA *ioc,
> > +
> Error
> > +**errp);
> >
> >  #endif /* QIO_CHANNEL_RDMA_H */
> > diff --git a/io/channel-rdma.c b/io/channel-rdma.c index
> > 92c362df52..9792add5cf 100644
> > --- a/io/channel-rdma.c
> > +++ b/io/channel-rdma.c
> > @@ -23,10 +23,15 @@
> >
> >  #include "qemu/osdep.h"
> >  #include "io/channel-rdma.h"
> > +#include "io/channel-util.h"
> > +#include "io/channel-watch.h"
> >  #include "io/channel.h"
> >  #include "qapi/clone-visitor.h"
> >  #include "qapi/error.h"
> >  #include "qapi/qapi-visit-sockets.h"
> > +#include "qemu/atomic.h"
> > +#include "qemu/error-report.h"
> > +#include "qemu/thread.h"
> >  #include "trace.h"
> >  #include <errno.h>
> >  #include <netdb.h>
> > @@ -39,11 +44,274 @@
> >  #include <sys/poll.h>
> >  #include <unistd.h>
> >
> > +typedef enum {
> > +    CLEAR_POLLIN,
> > +    CLEAR_POLLOUT,
> > +    SET_POLLIN,
> > +    SET_POLLOUT,
> > +} UpdateEvent;
> > +
> > +typedef enum {
> > +    RP_CMD_ADD_IOC,
> > +    RP_CMD_DEL_IOC,
> > +    RP_CMD_UPDATE,
> > +} RpollerCMD;
> > +
> > +typedef struct {
> > +    RpollerCMD cmd;
> > +    QIOChannelRDMA *rioc;
> > +} RpollerMsg;
> > +
> > +/*
> > + * rpoll() on the rsocket fd with rpoll_events, when POLLIN/POLLOUT
> > +event
> > + * occurs, it will write/read the pollin_eventfd/pollout_eventfd to
> > +allow
> > + * qemu g_poll/ppoll() get the POLLIN/POLLOUT event  */ static struct
> > +Rpoller {
> > +    QemuThread thread;
> > +    bool is_running;
> > +    int sock[2];
> > +    int count; /* the number of rsocket fds being rpoll() */
> > +    int size; /* the size of fds/riocs */
> > +    struct pollfd *fds;
> > +    QIOChannelRDMA **riocs;
> > +} rpoller;
> > +
> > +static void qio_channel_rdma_notify_rpoller(QIOChannelRDMA *rioc,
> > +                                            RpollerCMD cmd) {
> > +    RpollerMsg msg;
> > +    int ret;
> > +
> > +    msg.cmd = cmd;
> > +    msg.rioc = rioc;
> > +
> > +    ret = RETRY_ON_EINTR(write(rpoller.sock[0], &msg, sizeof msg));
> > +    if (ret != sizeof msg) {
> > +        error_report("%s: failed to send msg, errno: %d", __func__,
> errno);
> > +    }
> > +}
> > +
> > +static void qio_channel_rdma_update_poll_event(QIOChannelRDMA *rioc,
> > +                                               UpdateEvent
> action,
> > +                                               bool notify_rpoller)
> {
> > +    /* An eventfd with the value of ULLONG_MAX - 1 is readable but
> unwritable */
> > +    unsigned long long buf = ULLONG_MAX - 1;
> > +
> > +    switch (action) {
> > +    /* only rpoller do SET_* action, to allow qemu ppoll() get the event */
> > +    case SET_POLLIN:
> > +        RETRY_ON_EINTR(write(rioc->pollin_eventfd, &buf, sizeof buf));
> > +        rioc->rpoll_events &= ~POLLIN;
> > +        break;
> > +    case SET_POLLOUT:
> > +        RETRY_ON_EINTR(read(rioc->pollout_eventfd, &buf, sizeof buf));
> > +        rioc->rpoll_events &= ~POLLOUT;
> > +        break;
> > +
> > +    /* the rsocket fd is not ready to rread/rwrite */
> > +    case CLEAR_POLLIN:
> > +        RETRY_ON_EINTR(read(rioc->pollin_eventfd, &buf, sizeof buf));
> > +        rioc->rpoll_events |= POLLIN;
> > +        break;
> > +    case CLEAR_POLLOUT:
> > +        RETRY_ON_EINTR(write(rioc->pollout_eventfd, &buf, sizeof buf));
> > +        rioc->rpoll_events |= POLLOUT;
> > +        break;
> > +    default:
> > +        break;
> > +    }
> > +
> > +    /* notify rpoller to rpoll() POLLIN/POLLOUT events */
> > +    if (notify_rpoller) {
> > +        qio_channel_rdma_notify_rpoller(rioc, RP_CMD_UPDATE);
> > +    }
> > +}
> > +
> > +static void qio_channel_rdma_rpoller_add_rioc(QIOChannelRDMA *rioc) {
> > +    if (rioc->index != -1) {
> > +        error_report("%s: rioc already exsits", __func__);
> > +        return;
> > +    }
> > +
> > +    rioc->index = ++rpoller.count;
> > +
> > +    if (rpoller.count + 1 > rpoller.size) {
> > +        rpoller.size *= 2;
> > +        rpoller.fds = g_renew(struct pollfd, rpoller.fds, rpoller.size);
> > +        rpoller.riocs = g_renew(QIOChannelRDMA *, rpoller.riocs,
> rpoller.size);
> > +    }
> > +
> > +    rpoller.fds[rioc->index].fd = rioc->fd;
> > +    rpoller.fds[rioc->index].events = rioc->rpoll_events;
> 
> The allotment of rioc fds and events to rpoller slots are sequential, but making
> the deletion also sequentials means that the del_rioc needs to be called in the
> exact opposite sequence as they were added (through add_rioc). Otherwise we
> leaves holes in between, and readditions might step on an already used slot.
> 
> Does this setup make sure that the above restriction is satisfied, or am I
> missing something?
> 

Actually, we use an O (1) algorithm for deletion, that is, each time we replace the array element to be deleted with the last one.
Pls see qio_channel_rdma_rpoller_del_rioc():

   rpoller.fds[rioc->index] = rpoller.fds[rpoller.count];

> > +    rpoller.riocs[rioc->index] = rioc; }
> > +
> > +static void qio_channel_rdma_rpoller_del_rioc(QIOChannelRDMA *rioc) {
> > +    if (rioc->index == -1) {
> > +        error_report("%s: rioc not exsits", __func__);
> > +        return;
> > +    }
> > +
> > +    rpoller.fds[rioc->index] = rpoller.fds[rpoller.count];
> 
> Should this be rpoller.count-1?
> 
No. the first element is the sockpairs' fd. Pls see qio_channel_rdma_rpoller_start():

   rpoller.fds[0].fd = rpoller.sock[1];
   rpoller.fds[0].events = POLLIN;


Regards,
-Gonglei

> > +    rpoller.riocs[rioc->index] = rpoller.riocs[rpoller.count];
> > +    rpoller.riocs[rioc->index]->index = rioc->index;
> > +    rpoller.count--;
> > +
> > +    close(rioc->pollin_eventfd);
> > +    close(rioc->pollout_eventfd);
> > +    rioc->index = -1;
> > +    rioc->rpoll_events = 0;
> > +}
> > +
> > +static void qio_channel_rdma_rpoller_update_ioc(QIOChannelRDMA *rioc)
> > +{
> > +    if (rioc->index == -1) {
> > +        error_report("%s: rioc not exsits", __func__);
> > +        return;
> > +    }
> > +
> > +    rpoller.fds[rioc->index].fd = rioc->fd;
> > +    rpoller.fds[rioc->index].events = rioc->rpoll_events; }
> > +
> > +static void qio_channel_rdma_rpoller_process_msg(void)
> > +{
> > +    RpollerMsg msg;
> > +    int ret;
> > +
> > +    ret = RETRY_ON_EINTR(read(rpoller.sock[1], &msg, sizeof msg));
> > +    if (ret != sizeof msg) {
> > +        error_report("%s: rpoller failed to recv msg: %s", __func__,
> > +                     strerror(errno));
> > +        return;
> > +    }
> > +
> > +    switch (msg.cmd) {
> > +    case RP_CMD_ADD_IOC:
> > +        qio_channel_rdma_rpoller_add_rioc(msg.rioc);
> > +        break;
> > +    case RP_CMD_DEL_IOC:
> > +        qio_channel_rdma_rpoller_del_rioc(msg.rioc);
> > +        break;
> > +    case RP_CMD_UPDATE:
> > +        qio_channel_rdma_rpoller_update_ioc(msg.rioc);
> > +        break;
> > +    default:
> > +        break;
> > +    }
> > +}
> > +
> > +static void qio_channel_rdma_rpoller_cleanup(void)
> > +{
> > +    close(rpoller.sock[0]);
> > +    close(rpoller.sock[1]);
> > +    rpoller.sock[0] = -1;
> > +    rpoller.sock[1] = -1;
> > +    g_free(rpoller.fds);
> > +    g_free(rpoller.riocs);
> > +    rpoller.fds = NULL;
> > +    rpoller.riocs = NULL;
> > +    rpoller.count = 0;
> > +    rpoller.size = 0;
> > +    rpoller.is_running = false;
> > +}
> > +
> > +static void *qio_channel_rdma_rpoller_thread(void *opaque) {
> > +    int i, ret, error_events = POLLERR | POLLHUP | POLLNVAL;
> > +
> > +    do {
> > +        ret = rpoll(rpoller.fds, rpoller.count + 1, -1);
> > +        if (ret < 0 && errno != -EINTR) {
> > +            error_report("%s: rpoll() error: %s", __func__,
> strerror(errno));
> > +            break;
> > +        }
> > +
> > +        for (i = 1; i <= rpoller.count; i++) {
> > +            if (rpoller.fds[i].revents & (POLLIN | error_events)) {
> > +                qio_channel_rdma_update_poll_event(rpoller.riocs[i],
> SET_POLLIN,
> > +                                                   false);
> > +                rpoller.fds[i].events &= ~POLLIN;
> > +            }
> > +            if (rpoller.fds[i].revents & (POLLOUT | error_events)) {
> > +                qio_channel_rdma_update_poll_event(rpoller.riocs[i],
> > +
> SET_POLLOUT, false);
> > +                rpoller.fds[i].events &= ~POLLOUT;
> > +            }
> > +            /* ignore this fd */
> > +            if (rpoller.fds[i].revents & (error_events)) {
> > +                rpoller.fds[i].fd = -1;
> > +            }
> > +        }
> > +
> > +        if (rpoller.fds[0].revents) {
> > +            qio_channel_rdma_rpoller_process_msg();
> > +        }
> > +    } while (rpoller.count >= 1);
> > +
> > +    qio_channel_rdma_rpoller_cleanup();
> > +
> > +    return NULL;
> > +}
> > +
> > +static void qio_channel_rdma_rpoller_start(void)
> > +{
> > +    if (qatomic_xchg(&rpoller.is_running, true)) {
> > +        return;
> > +    }
> > +
> > +    if (qemu_socketpair(AF_UNIX, SOCK_STREAM, 0, rpoller.sock)) {
> > +        rpoller.is_running = false;
> > +        error_report("%s: failed to create socketpair %s", __func__,
> > +                     strerror(errno));
> > +        return;
> > +    }
> > +
> > +    rpoller.count = 0;
> > +    rpoller.size = 4;
> > +    rpoller.fds = g_malloc0_n(rpoller.size, sizeof(struct pollfd));
> > +    rpoller.riocs = g_malloc0_n(rpoller.size, sizeof(QIOChannelRDMA *));
> > +    rpoller.fds[0].fd = rpoller.sock[1];
> > +    rpoller.fds[0].events = POLLIN;
> > +
> > +    qemu_thread_create(&rpoller.thread, "qio-channel-rdma-rpoller",
> > +                       qio_channel_rdma_rpoller_thread, NULL,
> > +                       QEMU_THREAD_JOINABLE); }
> > +
> > +static void qio_channel_rdma_add_rioc_to_rpoller(QIOChannelRDMA
> > +*rioc) {
> > +    int flags = EFD_CLOEXEC | EFD_NONBLOCK;
> > +
> > +    /*
> > +     * A single eventfd is either readable or writable. A single eventfd
> cannot
> > +     * represent a state where it is neither readable nor writable. so use
> two
> > +     * eventfds here.
> > +     */
> > +    rioc->pollin_eventfd = eventfd(0, flags);
> > +    rioc->pollout_eventfd = eventfd(0, flags);
> > +    /* pollout_eventfd with the value 0, means writable, make it
> unwritable */
> > +    qio_channel_rdma_update_poll_event(rioc, CLEAR_POLLOUT, false);
> > +
> > +    /* tell the rpoller to rpoll() events on rioc->socketfd */
> > +    rioc->rpoll_events = POLLIN | POLLOUT;
> > +    qio_channel_rdma_notify_rpoller(rioc, RP_CMD_ADD_IOC); }
> > +
> >  QIOChannelRDMA *qio_channel_rdma_new(void)  {
> >      QIOChannelRDMA *rioc;
> >      QIOChannel *ioc;
> >
> > +    qio_channel_rdma_rpoller_start();
> > +    if (!rpoller.is_running) {
> > +        return NULL;
> > +    }
> > +
> >      rioc =
> QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
> >      ioc = QIO_CHANNEL(rioc);
> >      qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
> @@
> > -125,6 +393,8 @@ retry:
> >          goto out;
> >      }
> >
> > +    qio_channel_rdma_add_rioc_to_rpoller(rioc);
> > +
> >  out:
> >      if (ret) {
> >          trace_qio_channel_rdma_connect_fail(rioc);
> > @@ -211,6 +481,8 @@ int
> qio_channel_rdma_listen_sync(QIOChannelRDMA *rioc, InetSocketAddress
> *addr,
> >      qio_channel_set_feature(QIO_CHANNEL(rioc),
> QIO_CHANNEL_FEATURE_LISTEN);
> >      trace_qio_channel_rdma_listen_complete(rioc, fd);
> >
> > +    qio_channel_rdma_add_rioc_to_rpoller(rioc);
> > +
> >  out:
> >      if (ret) {
> >          trace_qio_channel_rdma_listen_fail(rioc);
> > @@ -267,8 +539,10 @@ void
> qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, InetSocketAddress
> *addr,
> >                             qio_channel_listen_worker_free,
> context);
> > }
> >
> > -QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *rioc,
> Error
> > **errp)
> > +QIOChannelRDMA *coroutine_mixed_fn
> qio_channel_rdma_accept(QIOChannelRDMA *rioc,
> > +
> Error
> > +**errp)
> >  {
> > +    QIOChannel *ioc = QIO_CHANNEL(rioc);
> >      QIOChannelRDMA *cioc;
> >
> >      cioc = qio_channel_rdma_new();
> > @@ -283,6 +557,17 @@ retry:
> >          if (errno == EINTR) {
> >              goto retry;
> >          }
> > +        if (errno == EAGAIN) {
> > +            if (!(rioc->rpoll_events & POLLIN)) {
> > +                qio_channel_rdma_update_poll_event(rioc,
> CLEAR_POLLIN, true);
> > +            }
> > +            if (qemu_in_coroutine()) {
> > +                qio_channel_yield(ioc, G_IO_IN);
> > +            } else {
> > +                qio_channel_wait(ioc, G_IO_IN);
> > +            }
> > +            goto retry;
> > +        }
> >          error_setg_errno(errp, errno, "Unable to accept connection");
> >          goto error;
> >      }
> > @@ -294,6 +579,8 @@ retry:
> >          goto error;
> >      }
> >
> > +    qio_channel_rdma_add_rioc_to_rpoller(cioc);
> > +
> >      trace_qio_channel_rdma_accept_complete(rioc, cioc, cioc->fd);
> >      return cioc;
> >
> > @@ -307,6 +594,10 @@ static void qio_channel_rdma_init(Object *obj)  {
> >      QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(obj);
> >      ioc->fd = -1;
> > +    ioc->pollin_eventfd = -1;
> > +    ioc->pollout_eventfd = -1;
> > +    ioc->index = -1;
> > +    ioc->rpoll_events = 0;
> >  }
> >
> >  static void qio_channel_rdma_finalize(Object *obj) @@ -314,6 +605,7
> > @@ static void qio_channel_rdma_finalize(Object *obj)
> >      QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(obj);
> >
> >      if (ioc->fd != -1) {
> > +        qio_channel_rdma_notify_rpoller(ioc, RP_CMD_DEL_IOC);
> >          rclose(ioc->fd);
> >          ioc->fd = -1;
> >      }
> > @@ -330,6 +622,12 @@ static ssize_t
> qio_channel_rdma_readv(QIOChannel
> > *ioc, const struct iovec *iov,
> >  retry:
> >      ret = rreadv(rioc->fd, iov, niov);
> >      if (ret < 0) {
> > +        if (errno == EAGAIN) {
> > +            if (!(rioc->rpoll_events & POLLIN)) {
> > +                qio_channel_rdma_update_poll_event(rioc,
> CLEAR_POLLIN, true);
> > +            }
> > +            return QIO_CHANNEL_ERR_BLOCK;
> > +        }
> >          if (errno == EINTR) {
> >              goto retry;
> >          }
> > @@ -351,6 +649,12 @@ static ssize_t
> qio_channel_rdma_writev(QIOChannel
> > *ioc, const struct iovec *iov,
> >  retry:
> >      ret = rwritev(rioc->fd, iov, niov);
> >      if (ret <= 0) {
> > +        if (errno == EAGAIN) {
> > +            if (!(rioc->rpoll_events & POLLOUT)) {
> > +                qio_channel_rdma_update_poll_event(rioc,
> CLEAR_POLLOUT, true);
> > +            }
> > +            return QIO_CHANNEL_ERR_BLOCK;
> > +        }
> >          if (errno == EINTR) {
> >              goto retry;
> >          }
> > @@ -361,6 +665,28 @@ retry:
> >      return ret;
> >  }
> >
> > +static int qio_channel_rdma_set_blocking(QIOChannel *ioc, bool enabled,
> > +                                         Error **errp
> G_GNUC_UNUSED)
> > +{
> > +    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
> > +    int flags, ret;
> > +
> > +    flags = rfcntl(rioc->fd, F_GETFL);
> > +    if (enabled) {
> > +        flags &= ~O_NONBLOCK;
> > +    } else {
> > +        flags |= O_NONBLOCK;
> > +    }
> > +
> > +    ret = rfcntl(rioc->fd, F_SETFL, flags);
> > +    if (ret) {
> > +        error_setg_errno(errp, errno,
> > +                         "Unable to rfcntl rsocket fd with flags %d",
> flags);
> > +    }
> > +
> > +    return ret;
> > +}
> > +
> >  static void qio_channel_rdma_set_delay(QIOChannel *ioc, bool enabled)
> > {
> >      QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc); @@ -374,6
> +700,7 @@
> > static int qio_channel_rdma_close(QIOChannel *ioc, Error **errp)
> >      QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
> >
> >      if (rioc->fd != -1) {
> > +        qio_channel_rdma_notify_rpoller(rioc, RP_CMD_DEL_IOC);
> >          rclose(rioc->fd);
> >          rioc->fd = -1;
> >      }
> > @@ -408,6 +735,37 @@ static int qio_channel_rdma_shutdown(QIOChannel
> *ioc, QIOChannelShutdown how,
> >      return 0;
> >  }
> >
> > +static void
> > +qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc, AioContext
> *read_ctx,
> > +                                    IOHandler *io_read,
> AioContext *write_ctx,
> > +                                    IOHandler *io_write, void
> > +*opaque) {
> > +    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
> > +
> > +    qio_channel_util_set_aio_fd_handler(rioc->pollin_eventfd, read_ctx,
> io_read,
> > +                                        rioc->pollout_eventfd,
> write_ctx,
> > +                                        io_write, opaque); }
> > +
> > +static GSource *qio_channel_rdma_create_watch(QIOChannel *ioc,
> > +                                              GIOCondition
> condition)
> > +{
> > +    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
> > +
> > +    switch (condition) {
> > +    case G_IO_IN:
> > +        return qio_channel_create_fd_watch(ioc, rioc->pollin_eventfd,
> > +                                           condition);
> > +    case G_IO_OUT:
> > +        return qio_channel_create_fd_watch(ioc, rioc->pollout_eventfd,
> > +                                           condition);
> > +    default:
> > +        error_report("%s: do not support watch 0x%x event", __func__,
> > +                     condition);
> > +        return NULL;
> > +    }
> > +}
> > +
> >  static void qio_channel_rdma_class_init(ObjectClass *klass,
> >                                          void *class_data
> > G_GNUC_UNUSED)  { @@ -415,9 +773,12 @@ static void
> > qio_channel_rdma_class_init(ObjectClass *klass,
> >
> >      ioc_klass->io_writev = qio_channel_rdma_writev;
> >      ioc_klass->io_readv = qio_channel_rdma_readv;
> > +    ioc_klass->io_set_blocking = qio_channel_rdma_set_blocking;
> >      ioc_klass->io_close = qio_channel_rdma_close;
> >      ioc_klass->io_shutdown = qio_channel_rdma_shutdown;
> >      ioc_klass->io_set_delay = qio_channel_rdma_set_delay;
> > +    ioc_klass->io_create_watch = qio_channel_rdma_create_watch;
> > +    ioc_klass->io_set_aio_fd_handler =
> > + qio_channel_rdma_set_aio_fd_handler;
> >  }
> >
> >  static const TypeInfo qio_channel_rdma_info = {
> > --
> > 2.43.0
> >
> >

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-06-05 14:18     ` Peter Xu
@ 2024-06-07  8:49       ` Gonglei (Arei) via
  2024-06-10 16:35         ` Peter Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Gonglei (Arei) via @ 2024-06-07  8:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel@nongnu.org, yu.zhang@ionos.com, mgalaxy@akamai.com,
	elmar.gerdes@ionos.com, zhengchuan, berrange@redhat.com,
	armbru@redhat.com, lizhijian@fujitsu.com, pbonzini@redhat.com,
	mst@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin, Fabiano Rosas



> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Wednesday, June 5, 2024 10:19 PM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: qemu-devel@nongnu.org; yu.zhang@ionos.com; mgalaxy@akamai.com;
> elmar.gerdes@ionos.com; zhengchuan <zhengchuan@huawei.com>;
> berrange@redhat.com; armbru@redhat.com; lizhijian@fujitsu.com;
> pbonzini@redhat.com; mst@redhat.com; Xiexiangyou
> <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> <wangjialin23@huawei.com>; Fabiano Rosas <farosas@suse.de>
> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> 
> On Wed, Jun 05, 2024 at 10:09:43AM +0000, Gonglei (Arei) wrote:
> > Hi Peter,
> >
> > > -----Original Message-----
> > > From: Peter Xu [mailto:peterx@redhat.com]
> > > Sent: Wednesday, June 5, 2024 3:32 AM
> > > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > > Cc: qemu-devel@nongnu.org; yu.zhang@ionos.com;
> mgalaxy@akamai.com;
> > > elmar.gerdes@ionos.com; zhengchuan <zhengchuan@huawei.com>;
> > > berrange@redhat.com; armbru@redhat.com; lizhijian@fujitsu.com;
> > > pbonzini@redhat.com; mst@redhat.com; Xiexiangyou
> > > <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> > > <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> > > <wangjialin23@huawei.com>; Fabiano Rosas <farosas@suse.de>
> > > Subject: Re: [PATCH 0/6] refactor RDMA live migration based on
> > > rsocket API
> > >
> > > Hi, Lei, Jialin,
> > >
> > > Thanks a lot for working on this!
> > >
> > > I think we'll need to wait a bit on feedbacks from Jinpu and his
> > > team on RDMA side, also Daniel for iochannels.  Also, please
> > > remember to copy Fabiano Rosas in any relevant future posts.  We'd
> > > also like to know whether he has any comments too.  I have him copied in
> this reply.
> > >
> > > On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > > > From: Jialin Wang <wangjialin23@huawei.com>
> > > >
> > > > Hi,
> > > >
> > > > This patch series attempts to refactor RDMA live migration by
> > > > introducing a new QIOChannelRDMA class based on the rsocket API.
> > > >
> > > > The /usr/include/rdma/rsocket.h provides a higher level rsocket
> > > > API that is a 1-1 match of the normal kernel 'sockets' API, which
> > > > hides the detail of rdma protocol into rsocket and allows us to
> > > > add support for some modern features like multifd more easily.
> > > >
> > > > Here is the previous discussion on refactoring RDMA live migration
> > > > using the rsocket API:
> > > >
> > > > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@l
> > > > inar
> > > > o.org/
> > > >
> > > > We have encountered some bugs when using rsocket and plan to
> > > > submit them to the rdma-core community.
> > > >
> > > > In addition, the use of rsocket makes our programming more
> > > > convenient, but it must be noted that this method introduces
> > > > multiple memory copies, which can be imagined that there will be a
> > > > certain performance degradation, hoping that friends with RDMA
> > > > network cards can help verify,
> > > thank you!
> > >
> > > It'll be good to elaborate if you tested it in-house. What people
> > > should expect on the numbers exactly?  Is that okay from Huawei's POV?
> > >
> > > Besides that, the code looks pretty good at a first glance to me.
> > > Before others chim in, here're some high level comments..
> > >
> > > Firstly, can we avoid using coroutine when listen()?  Might be
> > > relevant when I see that rdma_accept_incoming_migration() runs in a
> > > loop to do raccept(), but would that also hang the qemu main loop
> > > even with the coroutine, before all channels are ready?  I'm not a
> > > coroutine person, but I think the hope is that we can make dest QEMU
> > > run in a thread in the future just like the src QEMU, so the less coroutine
> the better in this path.
> > >
> >
> > Because rsocket is set to non-blocking, raccept will return EAGAIN
> > when no connection is received, coroutine will yield, and will not hang the
> qemu main loop.
> 
> Ah that's ok.  And also I just noticed it may not be a big deal either as long as
> we're before migration_incoming_process().
> 
> I'm wondering whether it can do it similarly like what we do with sockets in
> qio_net_listener_set_client_func_full().  After all, rsocket wants to mimic the
> socket API.  It'll make sense if rsocket code tries to match with socket, or
> even reuse.
> 

Actually we tried this solution, but it didn't work. Pls see patch 3/6

Known limitations: 
  For a blocking rsocket fd, if we use io_create_watch to wait for
  POLLIN or POLLOUT events, since the rsocket fd is blocking, we
  cannot determine when it is not ready to read/write as we can with
  non-blocking fds. Therefore, when an event occurs, it will occurs
  always, potentially leave the qemu hanging. So we need be cautious
  to avoid hanging when using io_create_watch .


Regards,
-Gonglei

> >
> > > I think I also left a comment elsewhere on whether it would be
> > > possible to allow iochannels implement their own poll() functions to
> > > avoid the per-channel poll thread that is proposed in this series.
> > >
> > > https://lore.kernel.org/r/ZldY21xVExtiMddB@x1n
> > >
> >
> > We noticed that, and it's a big operation. I'm not sure that's a better way.
> >
> > > Personally I think even with the thread proposal it's better than
> > > the old rdma code, but I just still want to double check with you
> > > guys.  E.g., maybe that just won't work at all?  Again, that'll also
> > > be based on the fact that we move migration incoming into a thread
> > > first to keep the dest QEMU main loop intact, I think, but I hope we
> > > will reach that irrelevant of rdma, IOW it'll be nice to happen even earlier if
> possible.
> > >
> > Yep. This is a fairly big change, I wonder what other people's suggestions
> are?
> 
> Yes we can wait for others' opinions.  And btw I'm not asking for it and I don't
> think it'll be a blocker for this approach to land, as I said this is better than the
> current code so it's definitely an improvement to me.
> 
> I'm purely curious, because if you're not going to do it for rdma, maybe
> someday I'll try to do that, and I want to know what "big change" could be as I
> didn't dig further.  It may help me by sharing what issues you've found.
> 
> Thanks,
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/6] io/channel-rdma: support working in coroutine
  2024-06-04 12:14 ` [PATCH 3/6] io/channel-rdma: support working in coroutine Gonglei via
  2024-06-06 13:34   ` Haris Iqbal
@ 2024-06-07  9:04   ` Daniel P. Berrangé
  2024-06-07  9:28     ` Gonglei (Arei) via
  1 sibling, 1 reply; 55+ messages in thread
From: Daniel P. Berrangé @ 2024-06-07  9:04 UTC (permalink / raw)
  To: Gonglei
  Cc: qemu-devel, peterx, yu.zhang, mgalaxy, elmar.gerdes, zhengchuan,
	armbru, lizhijian, pbonzini, mst, xiexiangyou, linux-rdma,
	lixiao91, jinpu.wang, Jialin Wang

On Tue, Jun 04, 2024 at 08:14:09PM +0800, Gonglei wrote:
> From: Jialin Wang <wangjialin23@huawei.com>
> 
> It is not feasible to obtain RDMA completion queue notifications
> through poll/ppoll on the rsocket fd. Therefore, we create a thread
> named rpoller for each rsocket fd and two eventfds: pollin_eventfd
> and pollout_eventfd.
> 
> When using io_create_watch or io_set_aio_fd_handler waits for POLLIN
> or POLLOUT events, it will actually poll/ppoll on the pollin_eventfd
> and pollout_eventfd instead of the rsocket fd.
> 
> The rpoller rpoll() on the rsocket fd to receive POLLIN and POLLOUT
> events.
> When a POLLIN event occurs, the rpoller write the pollin_eventfd,
> and then poll/ppoll will return the POLLIN event.
> When a POLLOUT event occurs, the rpoller read the pollout_eventfd,
> and then poll/ppoll will return the POLLOUT event.
> 
> For a non-blocking rsocket fd, if rread/rwrite returns EAGAIN, it will
> read/write the pollin/pollout_eventfd, preventing poll/ppoll from
> returning POLLIN/POLLOUT events.
> 
> Known limitations:
> 
>   For a blocking rsocket fd, if we use io_create_watch to wait for
>   POLLIN or POLLOUT events, since the rsocket fd is blocking, we
>   cannot determine when it is not ready to read/write as we can with
>   non-blocking fds. Therefore, when an event occurs, it will occurs
>   always, potentially leave the qemu hanging. So we need be cautious
>   to avoid hanging when using io_create_watch .
> 
> Luckily, channel-rdma works well in coroutines :)
> 
> Signed-off-by: Jialin Wang <wangjialin23@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>  include/io/channel-rdma.h |  15 +-
>  io/channel-rdma.c         | 363 +++++++++++++++++++++++++++++++++++++-
>  2 files changed, 376 insertions(+), 2 deletions(-)
> 
> diff --git a/include/io/channel-rdma.h b/include/io/channel-rdma.h
> index 8cab2459e5..cb56127d76 100644
> --- a/include/io/channel-rdma.h
> +++ b/include/io/channel-rdma.h
> @@ -47,6 +47,18 @@ struct QIOChannelRDMA {
>      socklen_t localAddrLen;
>      struct sockaddr_storage remoteAddr;
>      socklen_t remoteAddrLen;
> +
> +    /* private */
> +
> +    /* qemu g_poll/ppoll() POLLIN event on it */
> +    int pollin_eventfd;
> +    /* qemu g_poll/ppoll() POLLOUT event on it */
> +    int pollout_eventfd;
> +
> +    /* the index in the rpoller's fds array */
> +    int index;
> +    /* rpoller will rpoll() rpoll_events on the rsocket fd */
> +    short int rpoll_events;
>  };
>  
>  /**
> @@ -147,6 +159,7 @@ void qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, InetSocketAddress *addr,
>   *
>   * Returns: the new client channel, or NULL on error
>   */
> -QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *ioc, Error **errp);
> +QIOChannelRDMA *coroutine_mixed_fn qio_channel_rdma_accept(QIOChannelRDMA *ioc,
> +                                                           Error **errp);
>  
>  #endif /* QIO_CHANNEL_RDMA_H */
> diff --git a/io/channel-rdma.c b/io/channel-rdma.c
> index 92c362df52..9792add5cf 100644
> --- a/io/channel-rdma.c
> +++ b/io/channel-rdma.c
> @@ -23,10 +23,15 @@
>  
>  #include "qemu/osdep.h"
>  #include "io/channel-rdma.h"
> +#include "io/channel-util.h"
> +#include "io/channel-watch.h"
>  #include "io/channel.h"
>  #include "qapi/clone-visitor.h"
>  #include "qapi/error.h"
>  #include "qapi/qapi-visit-sockets.h"
> +#include "qemu/atomic.h"
> +#include "qemu/error-report.h"
> +#include "qemu/thread.h"
>  #include "trace.h"
>  #include <errno.h>
>  #include <netdb.h>
> @@ -39,11 +44,274 @@
>  #include <sys/poll.h>
>  #include <unistd.h>
>  
> +typedef enum {
> +    CLEAR_POLLIN,
> +    CLEAR_POLLOUT,
> +    SET_POLLIN,
> +    SET_POLLOUT,
> +} UpdateEvent;
> +
> +typedef enum {
> +    RP_CMD_ADD_IOC,
> +    RP_CMD_DEL_IOC,
> +    RP_CMD_UPDATE,
> +} RpollerCMD;
> +
> +typedef struct {
> +    RpollerCMD cmd;
> +    QIOChannelRDMA *rioc;
> +} RpollerMsg;
> +
> +/*
> + * rpoll() on the rsocket fd with rpoll_events, when POLLIN/POLLOUT event
> + * occurs, it will write/read the pollin_eventfd/pollout_eventfd to allow
> + * qemu g_poll/ppoll() get the POLLIN/POLLOUT event
> + */
> +static struct Rpoller {
> +    QemuThread thread;
> +    bool is_running;
> +    int sock[2];
> +    int count; /* the number of rsocket fds being rpoll() */
> +    int size; /* the size of fds/riocs */
> +    struct pollfd *fds;
> +    QIOChannelRDMA **riocs;
> +} rpoller;
> +
> +static void qio_channel_rdma_notify_rpoller(QIOChannelRDMA *rioc,
> +                                            RpollerCMD cmd)
> +{
> +    RpollerMsg msg;
> +    int ret;
> +
> +    msg.cmd = cmd;
> +    msg.rioc = rioc;
> +
> +    ret = RETRY_ON_EINTR(write(rpoller.sock[0], &msg, sizeof msg));

So this message is handled asynchronously by the poll thread, but
you're not acquiring any reference on teh 'rioc' object. So there's
the possibility that the owner of the rioc calls 'unref' free'ing
the last reference, before the poll thread has finished processing
the message.  IMHO the poll thread must hold a reference on the
rioc for as long as it needs the object.

> +    if (ret != sizeof msg) {
> +        error_report("%s: failed to send msg, errno: %d", __func__, errno);
> +    }

I feel like this should be propagated to the caller via an Error **errp
parameter.

> +}
> +
> +static void qio_channel_rdma_update_poll_event(QIOChannelRDMA *rioc,
> +                                               UpdateEvent action,
> +                                               bool notify_rpoller)
> +{
> +    /* An eventfd with the value of ULLONG_MAX - 1 is readable but unwritable */
> +    unsigned long long buf = ULLONG_MAX - 1;
> +
> +    switch (action) {
> +    /* only rpoller do SET_* action, to allow qemu ppoll() get the event */
> +    case SET_POLLIN:
> +        RETRY_ON_EINTR(write(rioc->pollin_eventfd, &buf, sizeof buf));
> +        rioc->rpoll_events &= ~POLLIN;
> +        break;
> +    case SET_POLLOUT:
> +        RETRY_ON_EINTR(read(rioc->pollout_eventfd, &buf, sizeof buf));
> +        rioc->rpoll_events &= ~POLLOUT;
> +        break;
> +
> +    /* the rsocket fd is not ready to rread/rwrite */
> +    case CLEAR_POLLIN:
> +        RETRY_ON_EINTR(read(rioc->pollin_eventfd, &buf, sizeof buf));
> +        rioc->rpoll_events |= POLLIN;
> +        break;
> +    case CLEAR_POLLOUT:
> +        RETRY_ON_EINTR(write(rioc->pollout_eventfd, &buf, sizeof buf));
> +        rioc->rpoll_events |= POLLOUT;
> +        break;
> +    default:
> +        break;
> +    }
> +
> +    /* notify rpoller to rpoll() POLLIN/POLLOUT events */
> +    if (notify_rpoller) {
> +        qio_channel_rdma_notify_rpoller(rioc, RP_CMD_UPDATE);
> +    }
> +}
> +
> +static void qio_channel_rdma_rpoller_add_rioc(QIOChannelRDMA *rioc)
> +{
> +    if (rioc->index != -1) {
> +        error_report("%s: rioc already exsits", __func__);
> +        return;
> +    }
> +
> +    rioc->index = ++rpoller.count;
> +
> +    if (rpoller.count + 1 > rpoller.size) {
> +        rpoller.size *= 2;
> +        rpoller.fds = g_renew(struct pollfd, rpoller.fds, rpoller.size);
> +        rpoller.riocs = g_renew(QIOChannelRDMA *, rpoller.riocs, rpoller.size);
> +    }
> +
> +    rpoller.fds[rioc->index].fd = rioc->fd;
> +    rpoller.fds[rioc->index].events = rioc->rpoll_events;
> +    rpoller.riocs[rioc->index] = rioc;
> +}
> +
> +static void qio_channel_rdma_rpoller_del_rioc(QIOChannelRDMA *rioc)
> +{
> +    if (rioc->index == -1) {
> +        error_report("%s: rioc not exsits", __func__);
> +        return;
> +    }
> +
> +    rpoller.fds[rioc->index] = rpoller.fds[rpoller.count];
> +    rpoller.riocs[rioc->index] = rpoller.riocs[rpoller.count];
> +    rpoller.riocs[rioc->index]->index = rioc->index;
> +    rpoller.count--;
> +
> +    close(rioc->pollin_eventfd);
> +    close(rioc->pollout_eventfd);
> +    rioc->index = -1;
> +    rioc->rpoll_events = 0;
> +}
> +
> +static void qio_channel_rdma_rpoller_update_ioc(QIOChannelRDMA *rioc)
> +{
> +    if (rioc->index == -1) {
> +        error_report("%s: rioc not exsits", __func__);
> +        return;
> +    }
> +
> +    rpoller.fds[rioc->index].fd = rioc->fd;
> +    rpoller.fds[rioc->index].events = rioc->rpoll_events;
> +}
> +
> +static void qio_channel_rdma_rpoller_process_msg(void)
> +{
> +    RpollerMsg msg;
> +    int ret;
> +
> +    ret = RETRY_ON_EINTR(read(rpoller.sock[1], &msg, sizeof msg));
> +    if (ret != sizeof msg) {
> +        error_report("%s: rpoller failed to recv msg: %s", __func__,
> +                     strerror(errno));
> +        return;
> +    }
> +
> +    switch (msg.cmd) {
> +    case RP_CMD_ADD_IOC:
> +        qio_channel_rdma_rpoller_add_rioc(msg.rioc);
> +        break;
> +    case RP_CMD_DEL_IOC:
> +        qio_channel_rdma_rpoller_del_rioc(msg.rioc);
> +        break;
> +    case RP_CMD_UPDATE:
> +        qio_channel_rdma_rpoller_update_ioc(msg.rioc);
> +        break;
> +    default:
> +        break;
> +    }
> +}
> +
> +static void qio_channel_rdma_rpoller_cleanup(void)
> +{
> +    close(rpoller.sock[0]);
> +    close(rpoller.sock[1]);
> +    rpoller.sock[0] = -1;
> +    rpoller.sock[1] = -1;
> +    g_free(rpoller.fds);
> +    g_free(rpoller.riocs);
> +    rpoller.fds = NULL;
> +    rpoller.riocs = NULL;
> +    rpoller.count = 0;
> +    rpoller.size = 0;
> +    rpoller.is_running = false;
> +}
> +
> +static void *qio_channel_rdma_rpoller_thread(void *opaque)
> +{
> +    int i, ret, error_events = POLLERR | POLLHUP | POLLNVAL;
> +
> +    do {
> +        ret = rpoll(rpoller.fds, rpoller.count + 1, -1);
> +        if (ret < 0 && errno != -EINTR) {
> +            error_report("%s: rpoll() error: %s", __func__, strerror(errno));
> +            break;
> +        }
> +
> +        for (i = 1; i <= rpoller.count; i++) {
> +            if (rpoller.fds[i].revents & (POLLIN | error_events)) {
> +                qio_channel_rdma_update_poll_event(rpoller.riocs[i], SET_POLLIN,
> +                                                   false);
> +                rpoller.fds[i].events &= ~POLLIN;
> +            }
> +            if (rpoller.fds[i].revents & (POLLOUT | error_events)) {
> +                qio_channel_rdma_update_poll_event(rpoller.riocs[i],
> +                                                   SET_POLLOUT, false);
> +                rpoller.fds[i].events &= ~POLLOUT;
> +            }
> +            /* ignore this fd */
> +            if (rpoller.fds[i].revents & (error_events)) {
> +                rpoller.fds[i].fd = -1;
> +            }
> +        }
> +
> +        if (rpoller.fds[0].revents) {
> +            qio_channel_rdma_rpoller_process_msg();
> +        }
> +    } while (rpoller.count >= 1);
> +
> +    qio_channel_rdma_rpoller_cleanup();
> +
> +    return NULL;
> +}
> +
> +static void qio_channel_rdma_rpoller_start(void)
> +{
> +    if (qatomic_xchg(&rpoller.is_running, true)) {
> +        return;
> +    }
> +
> +    if (qemu_socketpair(AF_UNIX, SOCK_STREAM, 0, rpoller.sock)) {
> +        rpoller.is_running = false;
> +        error_report("%s: failed to create socketpair %s", __func__,
> +                     strerror(errno));
> +        return;
> +    }
> +
> +    rpoller.count = 0;
> +    rpoller.size = 4;
> +    rpoller.fds = g_malloc0_n(rpoller.size, sizeof(struct pollfd));
> +    rpoller.riocs = g_malloc0_n(rpoller.size, sizeof(QIOChannelRDMA *));
> +    rpoller.fds[0].fd = rpoller.sock[1];
> +    rpoller.fds[0].events = POLLIN;
> +
> +    qemu_thread_create(&rpoller.thread, "qio-channel-rdma-rpoller",
> +                       qio_channel_rdma_rpoller_thread, NULL,
> +                       QEMU_THREAD_JOINABLE);
> +}
> +
> +static void qio_channel_rdma_add_rioc_to_rpoller(QIOChannelRDMA *rioc)
> +{
> +    int flags = EFD_CLOEXEC | EFD_NONBLOCK;
> +
> +    /*
> +     * A single eventfd is either readable or writable. A single eventfd cannot
> +     * represent a state where it is neither readable nor writable. so use two
> +     * eventfds here.
> +     */
> +    rioc->pollin_eventfd = eventfd(0, flags);
> +    rioc->pollout_eventfd = eventfd(0, flags);
> +    /* pollout_eventfd with the value 0, means writable, make it unwritable */
> +    qio_channel_rdma_update_poll_event(rioc, CLEAR_POLLOUT, false);
> +
> +    /* tell the rpoller to rpoll() events on rioc->socketfd */
> +    rioc->rpoll_events = POLLIN | POLLOUT;
> +    qio_channel_rdma_notify_rpoller(rioc, RP_CMD_ADD_IOC);
> +}
> +
>  QIOChannelRDMA *qio_channel_rdma_new(void)
>  {
>      QIOChannelRDMA *rioc;
>      QIOChannel *ioc;
>  
> +    qio_channel_rdma_rpoller_start();
> +    if (!rpoller.is_running) {
> +        return NULL;
> +    }
> +
>      rioc = QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
>      ioc = QIO_CHANNEL(rioc);
>      qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
> @@ -125,6 +393,8 @@ retry:
>          goto out;
>      }
>  
> +    qio_channel_rdma_add_rioc_to_rpoller(rioc);
> +
>  out:
>      if (ret) {
>          trace_qio_channel_rdma_connect_fail(rioc);
> @@ -211,6 +481,8 @@ int qio_channel_rdma_listen_sync(QIOChannelRDMA *rioc, InetSocketAddress *addr,
>      qio_channel_set_feature(QIO_CHANNEL(rioc), QIO_CHANNEL_FEATURE_LISTEN);
>      trace_qio_channel_rdma_listen_complete(rioc, fd);
>  
> +    qio_channel_rdma_add_rioc_to_rpoller(rioc);
> +
>  out:
>      if (ret) {
>          trace_qio_channel_rdma_listen_fail(rioc);
> @@ -267,8 +539,10 @@ void qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, InetSocketAddress *addr,
>                             qio_channel_listen_worker_free, context);
>  }
>  
> -QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *rioc, Error **errp)
> +QIOChannelRDMA *coroutine_mixed_fn qio_channel_rdma_accept(QIOChannelRDMA *rioc,
> +                                                           Error **errp)
>  {
> +    QIOChannel *ioc = QIO_CHANNEL(rioc);
>      QIOChannelRDMA *cioc;
>  
>      cioc = qio_channel_rdma_new();
> @@ -283,6 +557,17 @@ retry:
>          if (errno == EINTR) {
>              goto retry;
>          }
> +        if (errno == EAGAIN) {
> +            if (!(rioc->rpoll_events & POLLIN)) {
> +                qio_channel_rdma_update_poll_event(rioc, CLEAR_POLLIN, true);
> +            }
> +            if (qemu_in_coroutine()) {
> +                qio_channel_yield(ioc, G_IO_IN);
> +            } else {
> +                qio_channel_wait(ioc, G_IO_IN);
> +            }
> +            goto retry;
> +        }
>          error_setg_errno(errp, errno, "Unable to accept connection");
>          goto error;
>      }
> @@ -294,6 +579,8 @@ retry:
>          goto error;
>      }
>  
> +    qio_channel_rdma_add_rioc_to_rpoller(cioc);
> +
>      trace_qio_channel_rdma_accept_complete(rioc, cioc, cioc->fd);
>      return cioc;
>  
> @@ -307,6 +594,10 @@ static void qio_channel_rdma_init(Object *obj)
>  {
>      QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(obj);
>      ioc->fd = -1;
> +    ioc->pollin_eventfd = -1;
> +    ioc->pollout_eventfd = -1;
> +    ioc->index = -1;
> +    ioc->rpoll_events = 0;
>  }
>  
>  static void qio_channel_rdma_finalize(Object *obj)
> @@ -314,6 +605,7 @@ static void qio_channel_rdma_finalize(Object *obj)
>      QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(obj);
>  
>      if (ioc->fd != -1) {
> +        qio_channel_rdma_notify_rpoller(ioc, RP_CMD_DEL_IOC);

This is unsafe.

When finalize runs, the object has dropped its last reference and
is about to be free()d.  The notify_rpoller() method, however,
sends an async message to the poll thread, which the poll thread
will end up processing after the rioc is free()d. ie a use-after-free.

If you take my earlier suggestion that the poll thread should hold
its own reference on the ioc, then it becomes impossible for the
rioc to be freed while there is still an active I/O watch, and
thus this call can go away, and so will the use after free.

>          rclose(ioc->fd);
>          ioc->fd = -1;
>      }

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH 3/6] io/channel-rdma: support working in coroutine
  2024-06-07  9:04   ` Daniel P. Berrangé
@ 2024-06-07  9:28     ` Gonglei (Arei) via
  0 siblings, 0 replies; 55+ messages in thread
From: Gonglei (Arei) via @ 2024-06-07  9:28 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel@nongnu.org, peterx@redhat.com, yu.zhang@ionos.com,
	mgalaxy@akamai.com, elmar.gerdes@ionos.com, zhengchuan,
	armbru@redhat.com, lizhijian@fujitsu.com, pbonzini@redhat.com,
	mst@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin

Hi Daniel,

> -----Original Message-----
> From: Daniel P. Berrangé [mailto:berrange@redhat.com]
> Sent: Friday, June 7, 2024 5:04 PM
> To: Gonglei (Arei) <arei.gonglei@huawei.com>
> Cc: qemu-devel@nongnu.org; peterx@redhat.com; yu.zhang@ionos.com;
> mgalaxy@akamai.com; elmar.gerdes@ionos.com; zhengchuan
> <zhengchuan@huawei.com>; armbru@redhat.com; lizhijian@fujitsu.com;
> pbonzini@redhat.com; mst@redhat.com; Xiexiangyou
> <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> <wangjialin23@huawei.com>
> Subject: Re: [PATCH 3/6] io/channel-rdma: support working in coroutine
> 
> On Tue, Jun 04, 2024 at 08:14:09PM +0800, Gonglei wrote:
> > From: Jialin Wang <wangjialin23@huawei.com>
> >
> > It is not feasible to obtain RDMA completion queue notifications
> > through poll/ppoll on the rsocket fd. Therefore, we create a thread
> > named rpoller for each rsocket fd and two eventfds: pollin_eventfd and
> > pollout_eventfd.
> >
> > When using io_create_watch or io_set_aio_fd_handler waits for POLLIN
> > or POLLOUT events, it will actually poll/ppoll on the pollin_eventfd
> > and pollout_eventfd instead of the rsocket fd.
> >
> > The rpoller rpoll() on the rsocket fd to receive POLLIN and POLLOUT
> > events.
> > When a POLLIN event occurs, the rpoller write the pollin_eventfd, and
> > then poll/ppoll will return the POLLIN event.
> > When a POLLOUT event occurs, the rpoller read the pollout_eventfd, and
> > then poll/ppoll will return the POLLOUT event.
> >
> > For a non-blocking rsocket fd, if rread/rwrite returns EAGAIN, it will
> > read/write the pollin/pollout_eventfd, preventing poll/ppoll from
> > returning POLLIN/POLLOUT events.
> >
> > Known limitations:
> >
> >   For a blocking rsocket fd, if we use io_create_watch to wait for
> >   POLLIN or POLLOUT events, since the rsocket fd is blocking, we
> >   cannot determine when it is not ready to read/write as we can with
> >   non-blocking fds. Therefore, when an event occurs, it will occurs
> >   always, potentially leave the qemu hanging. So we need be cautious
> >   to avoid hanging when using io_create_watch .
> >
> > Luckily, channel-rdma works well in coroutines :)
> >
> > Signed-off-by: Jialin Wang <wangjialin23@huawei.com>
> > Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> > ---
> >  include/io/channel-rdma.h |  15 +-
> >  io/channel-rdma.c         | 363
> +++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 376 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/io/channel-rdma.h b/include/io/channel-rdma.h
> > index 8cab2459e5..cb56127d76 100644
> > --- a/include/io/channel-rdma.h
> > +++ b/include/io/channel-rdma.h
> > @@ -47,6 +47,18 @@ struct QIOChannelRDMA {
> >      socklen_t localAddrLen;
> >      struct sockaddr_storage remoteAddr;
> >      socklen_t remoteAddrLen;
> > +
> > +    /* private */
> > +
> > +    /* qemu g_poll/ppoll() POLLIN event on it */
> > +    int pollin_eventfd;
> > +    /* qemu g_poll/ppoll() POLLOUT event on it */
> > +    int pollout_eventfd;
> > +
> > +    /* the index in the rpoller's fds array */
> > +    int index;
> > +    /* rpoller will rpoll() rpoll_events on the rsocket fd */
> > +    short int rpoll_events;
> >  };
> >
> >  /**
> > @@ -147,6 +159,7 @@ void
> qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, InetSocketAddress
> *addr,
> >   *
> >   * Returns: the new client channel, or NULL on error
> >   */
> > -QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *ioc,
> Error
> > **errp);
> > +QIOChannelRDMA *coroutine_mixed_fn
> qio_channel_rdma_accept(QIOChannelRDMA *ioc,
> > +
> Error
> > +**errp);
> >
> >  #endif /* QIO_CHANNEL_RDMA_H */
> > diff --git a/io/channel-rdma.c b/io/channel-rdma.c index
> > 92c362df52..9792add5cf 100644
> > --- a/io/channel-rdma.c
> > +++ b/io/channel-rdma.c
> > @@ -23,10 +23,15 @@
> >
> >  #include "qemu/osdep.h"
> >  #include "io/channel-rdma.h"
> > +#include "io/channel-util.h"
> > +#include "io/channel-watch.h"
> >  #include "io/channel.h"
> >  #include "qapi/clone-visitor.h"
> >  #include "qapi/error.h"
> >  #include "qapi/qapi-visit-sockets.h"
> > +#include "qemu/atomic.h"
> > +#include "qemu/error-report.h"
> > +#include "qemu/thread.h"
> >  #include "trace.h"
> >  #include <errno.h>
> >  #include <netdb.h>
> > @@ -39,11 +44,274 @@
> >  #include <sys/poll.h>
> >  #include <unistd.h>
> >
> > +typedef enum {
> > +    CLEAR_POLLIN,
> > +    CLEAR_POLLOUT,
> > +    SET_POLLIN,
> > +    SET_POLLOUT,
> > +} UpdateEvent;
> > +
> > +typedef enum {
> > +    RP_CMD_ADD_IOC,
> > +    RP_CMD_DEL_IOC,
> > +    RP_CMD_UPDATE,
> > +} RpollerCMD;
> > +
> > +typedef struct {
> > +    RpollerCMD cmd;
> > +    QIOChannelRDMA *rioc;
> > +} RpollerMsg;
> > +
> > +/*
> > + * rpoll() on the rsocket fd with rpoll_events, when POLLIN/POLLOUT
> > +event
> > + * occurs, it will write/read the pollin_eventfd/pollout_eventfd to
> > +allow
> > + * qemu g_poll/ppoll() get the POLLIN/POLLOUT event  */ static struct
> > +Rpoller {
> > +    QemuThread thread;
> > +    bool is_running;
> > +    int sock[2];
> > +    int count; /* the number of rsocket fds being rpoll() */
> > +    int size; /* the size of fds/riocs */
> > +    struct pollfd *fds;
> > +    QIOChannelRDMA **riocs;
> > +} rpoller;
> > +
> > +static void qio_channel_rdma_notify_rpoller(QIOChannelRDMA *rioc,
> > +                                            RpollerCMD cmd) {
> > +    RpollerMsg msg;
> > +    int ret;
> > +
> > +    msg.cmd = cmd;
> > +    msg.rioc = rioc;
> > +
> > +    ret = RETRY_ON_EINTR(write(rpoller.sock[0], &msg, sizeof msg));
> 
> So this message is handled asynchronously by the poll thread, but you're not
> acquiring any reference on teh 'rioc' object. So there's the possibility that the
> owner of the rioc calls 'unref' free'ing the last reference, before the poll thread
> has finished processing the message.  IMHO the poll thread must hold a
> reference on the rioc for as long as it needs the object.
> 
Yes. You're right.


> > +    if (ret != sizeof msg) {
> > +        error_report("%s: failed to send msg, errno: %d", __func__,
> errno);
> > +    }
> 
> I feel like this should be propagated to the caller via an Error **errp parameter.
> 

OK. 


> > +}
> > +
> > +static void qio_channel_rdma_update_poll_event(QIOChannelRDMA *rioc,
> > +                                               UpdateEvent
> action,
> > +                                               bool notify_rpoller)
> {
> > +    /* An eventfd with the value of ULLONG_MAX - 1 is readable but
> unwritable */
> > +    unsigned long long buf = ULLONG_MAX - 1;
> > +
> > +    switch (action) {
> > +    /* only rpoller do SET_* action, to allow qemu ppoll() get the event */
> > +    case SET_POLLIN:
> > +        RETRY_ON_EINTR(write(rioc->pollin_eventfd, &buf, sizeof buf));
> > +        rioc->rpoll_events &= ~POLLIN;
> > +        break;
> > +    case SET_POLLOUT:
> > +        RETRY_ON_EINTR(read(rioc->pollout_eventfd, &buf, sizeof buf));
> > +        rioc->rpoll_events &= ~POLLOUT;
> > +        break;
> > +
> > +    /* the rsocket fd is not ready to rread/rwrite */
> > +    case CLEAR_POLLIN:
> > +        RETRY_ON_EINTR(read(rioc->pollin_eventfd, &buf, sizeof buf));
> > +        rioc->rpoll_events |= POLLIN;
> > +        break;
> > +    case CLEAR_POLLOUT:
> > +        RETRY_ON_EINTR(write(rioc->pollout_eventfd, &buf, sizeof buf));
> > +        rioc->rpoll_events |= POLLOUT;
> > +        break;
> > +    default:
> > +        break;
> > +    }
> > +
> > +    /* notify rpoller to rpoll() POLLIN/POLLOUT events */
> > +    if (notify_rpoller) {
> > +        qio_channel_rdma_notify_rpoller(rioc, RP_CMD_UPDATE);
> > +    }
> > +}
> > +
> > +static void qio_channel_rdma_rpoller_add_rioc(QIOChannelRDMA *rioc) {
> > +    if (rioc->index != -1) {
> > +        error_report("%s: rioc already exsits", __func__);
> > +        return;
> > +    }
> > +
> > +    rioc->index = ++rpoller.count;
> > +
> > +    if (rpoller.count + 1 > rpoller.size) {
> > +        rpoller.size *= 2;
> > +        rpoller.fds = g_renew(struct pollfd, rpoller.fds, rpoller.size);
> > +        rpoller.riocs = g_renew(QIOChannelRDMA *, rpoller.riocs,
> rpoller.size);
> > +    }
> > +
> > +    rpoller.fds[rioc->index].fd = rioc->fd;
> > +    rpoller.fds[rioc->index].events = rioc->rpoll_events;
> > +    rpoller.riocs[rioc->index] = rioc; }
> > +
> > +static void qio_channel_rdma_rpoller_del_rioc(QIOChannelRDMA *rioc) {
> > +    if (rioc->index == -1) {
> > +        error_report("%s: rioc not exsits", __func__);
> > +        return;
> > +    }
> > +
> > +    rpoller.fds[rioc->index] = rpoller.fds[rpoller.count];
> > +    rpoller.riocs[rioc->index] = rpoller.riocs[rpoller.count];
> > +    rpoller.riocs[rioc->index]->index = rioc->index;
> > +    rpoller.count--;
> > +
> > +    close(rioc->pollin_eventfd);
> > +    close(rioc->pollout_eventfd);
> > +    rioc->index = -1;
> > +    rioc->rpoll_events = 0;
> > +}
> > +
> > +static void qio_channel_rdma_rpoller_update_ioc(QIOChannelRDMA *rioc)
> > +{
> > +    if (rioc->index == -1) {
> > +        error_report("%s: rioc not exsits", __func__);
> > +        return;
> > +    }
> > +
> > +    rpoller.fds[rioc->index].fd = rioc->fd;
> > +    rpoller.fds[rioc->index].events = rioc->rpoll_events; }
> > +
> > +static void qio_channel_rdma_rpoller_process_msg(void)
> > +{
> > +    RpollerMsg msg;
> > +    int ret;
> > +
> > +    ret = RETRY_ON_EINTR(read(rpoller.sock[1], &msg, sizeof msg));
> > +    if (ret != sizeof msg) {
> > +        error_report("%s: rpoller failed to recv msg: %s", __func__,
> > +                     strerror(errno));
> > +        return;
> > +    }
> > +
> > +    switch (msg.cmd) {
> > +    case RP_CMD_ADD_IOC:
> > +        qio_channel_rdma_rpoller_add_rioc(msg.rioc);
> > +        break;
> > +    case RP_CMD_DEL_IOC:
> > +        qio_channel_rdma_rpoller_del_rioc(msg.rioc);
> > +        break;
> > +    case RP_CMD_UPDATE:
> > +        qio_channel_rdma_rpoller_update_ioc(msg.rioc);
> > +        break;
> > +    default:
> > +        break;
> > +    }
> > +}
> > +
> > +static void qio_channel_rdma_rpoller_cleanup(void)
> > +{
> > +    close(rpoller.sock[0]);
> > +    close(rpoller.sock[1]);
> > +    rpoller.sock[0] = -1;
> > +    rpoller.sock[1] = -1;
> > +    g_free(rpoller.fds);
> > +    g_free(rpoller.riocs);
> > +    rpoller.fds = NULL;
> > +    rpoller.riocs = NULL;
> > +    rpoller.count = 0;
> > +    rpoller.size = 0;
> > +    rpoller.is_running = false;
> > +}
> > +
> > +static void *qio_channel_rdma_rpoller_thread(void *opaque) {
> > +    int i, ret, error_events = POLLERR | POLLHUP | POLLNVAL;
> > +
> > +    do {
> > +        ret = rpoll(rpoller.fds, rpoller.count + 1, -1);
> > +        if (ret < 0 && errno != -EINTR) {
> > +            error_report("%s: rpoll() error: %s", __func__,
> strerror(errno));
> > +            break;
> > +        }
> > +
> > +        for (i = 1; i <= rpoller.count; i++) {
> > +            if (rpoller.fds[i].revents & (POLLIN | error_events)) {
> > +                qio_channel_rdma_update_poll_event(rpoller.riocs[i],
> SET_POLLIN,
> > +                                                   false);
> > +                rpoller.fds[i].events &= ~POLLIN;
> > +            }
> > +            if (rpoller.fds[i].revents & (POLLOUT | error_events)) {
> > +                qio_channel_rdma_update_poll_event(rpoller.riocs[i],
> > +
> SET_POLLOUT, false);
> > +                rpoller.fds[i].events &= ~POLLOUT;
> > +            }
> > +            /* ignore this fd */
> > +            if (rpoller.fds[i].revents & (error_events)) {
> > +                rpoller.fds[i].fd = -1;
> > +            }
> > +        }
> > +
> > +        if (rpoller.fds[0].revents) {
> > +            qio_channel_rdma_rpoller_process_msg();
> > +        }
> > +    } while (rpoller.count >= 1);
> > +
> > +    qio_channel_rdma_rpoller_cleanup();
> > +
> > +    return NULL;
> > +}
> > +
> > +static void qio_channel_rdma_rpoller_start(void)
> > +{
> > +    if (qatomic_xchg(&rpoller.is_running, true)) {
> > +        return;
> > +    }
> > +
> > +    if (qemu_socketpair(AF_UNIX, SOCK_STREAM, 0, rpoller.sock)) {
> > +        rpoller.is_running = false;
> > +        error_report("%s: failed to create socketpair %s", __func__,
> > +                     strerror(errno));
> > +        return;
> > +    }
> > +
> > +    rpoller.count = 0;
> > +    rpoller.size = 4;
> > +    rpoller.fds = g_malloc0_n(rpoller.size, sizeof(struct pollfd));
> > +    rpoller.riocs = g_malloc0_n(rpoller.size, sizeof(QIOChannelRDMA *));
> > +    rpoller.fds[0].fd = rpoller.sock[1];
> > +    rpoller.fds[0].events = POLLIN;
> > +
> > +    qemu_thread_create(&rpoller.thread, "qio-channel-rdma-rpoller",
> > +                       qio_channel_rdma_rpoller_thread, NULL,
> > +                       QEMU_THREAD_JOINABLE); }
> > +
> > +static void qio_channel_rdma_add_rioc_to_rpoller(QIOChannelRDMA
> > +*rioc) {
> > +    int flags = EFD_CLOEXEC | EFD_NONBLOCK;
> > +
> > +    /*
> > +     * A single eventfd is either readable or writable. A single eventfd
> cannot
> > +     * represent a state where it is neither readable nor writable. so use
> two
> > +     * eventfds here.
> > +     */
> > +    rioc->pollin_eventfd = eventfd(0, flags);
> > +    rioc->pollout_eventfd = eventfd(0, flags);
> > +    /* pollout_eventfd with the value 0, means writable, make it
> unwritable */
> > +    qio_channel_rdma_update_poll_event(rioc, CLEAR_POLLOUT, false);
> > +
> > +    /* tell the rpoller to rpoll() events on rioc->socketfd */
> > +    rioc->rpoll_events = POLLIN | POLLOUT;
> > +    qio_channel_rdma_notify_rpoller(rioc, RP_CMD_ADD_IOC); }
> > +
> >  QIOChannelRDMA *qio_channel_rdma_new(void)  {
> >      QIOChannelRDMA *rioc;
> >      QIOChannel *ioc;
> >
> > +    qio_channel_rdma_rpoller_start();
> > +    if (!rpoller.is_running) {
> > +        return NULL;
> > +    }
> > +
> >      rioc =
> QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
> >      ioc = QIO_CHANNEL(rioc);
> >      qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
> @@
> > -125,6 +393,8 @@ retry:
> >          goto out;
> >      }
> >
> > +    qio_channel_rdma_add_rioc_to_rpoller(rioc);
> > +
> >  out:
> >      if (ret) {
> >          trace_qio_channel_rdma_connect_fail(rioc);
> > @@ -211,6 +481,8 @@ int
> qio_channel_rdma_listen_sync(QIOChannelRDMA *rioc, InetSocketAddress
> *addr,
> >      qio_channel_set_feature(QIO_CHANNEL(rioc),
> QIO_CHANNEL_FEATURE_LISTEN);
> >      trace_qio_channel_rdma_listen_complete(rioc, fd);
> >
> > +    qio_channel_rdma_add_rioc_to_rpoller(rioc);
> > +
> >  out:
> >      if (ret) {
> >          trace_qio_channel_rdma_listen_fail(rioc);
> > @@ -267,8 +539,10 @@ void
> qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, InetSocketAddress
> *addr,
> >                             qio_channel_listen_worker_free,
> context);
> > }
> >
> > -QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *rioc,
> Error
> > **errp)
> > +QIOChannelRDMA *coroutine_mixed_fn
> qio_channel_rdma_accept(QIOChannelRDMA *rioc,
> > +
> Error
> > +**errp)
> >  {
> > +    QIOChannel *ioc = QIO_CHANNEL(rioc);
> >      QIOChannelRDMA *cioc;
> >
> >      cioc = qio_channel_rdma_new();
> > @@ -283,6 +557,17 @@ retry:
> >          if (errno == EINTR) {
> >              goto retry;
> >          }
> > +        if (errno == EAGAIN) {
> > +            if (!(rioc->rpoll_events & POLLIN)) {
> > +                qio_channel_rdma_update_poll_event(rioc,
> CLEAR_POLLIN, true);
> > +            }
> > +            if (qemu_in_coroutine()) {
> > +                qio_channel_yield(ioc, G_IO_IN);
> > +            } else {
> > +                qio_channel_wait(ioc, G_IO_IN);
> > +            }
> > +            goto retry;
> > +        }
> >          error_setg_errno(errp, errno, "Unable to accept connection");
> >          goto error;
> >      }
> > @@ -294,6 +579,8 @@ retry:
> >          goto error;
> >      }
> >
> > +    qio_channel_rdma_add_rioc_to_rpoller(cioc);
> > +
> >      trace_qio_channel_rdma_accept_complete(rioc, cioc, cioc->fd);
> >      return cioc;
> >
> > @@ -307,6 +594,10 @@ static void qio_channel_rdma_init(Object *obj)  {
> >      QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(obj);
> >      ioc->fd = -1;
> > +    ioc->pollin_eventfd = -1;
> > +    ioc->pollout_eventfd = -1;
> > +    ioc->index = -1;
> > +    ioc->rpoll_events = 0;
> >  }
> >
> >  static void qio_channel_rdma_finalize(Object *obj) @@ -314,6 +605,7
> > @@ static void qio_channel_rdma_finalize(Object *obj)
> >      QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(obj);
> >
> >      if (ioc->fd != -1) {
> > +        qio_channel_rdma_notify_rpoller(ioc, RP_CMD_DEL_IOC);
> 
> This is unsafe.
> 
> When finalize runs, the object has dropped its last reference and is about to be
> free()d.  The notify_rpoller() method, however, sends an async message to the
> poll thread, which the poll thread will end up processing after the rioc is free()d.
> ie a use-after-free.
> 
> If you take my earlier suggestion that the poll thread should hold its own
> reference on the ioc, then it becomes impossible for the rioc to be freed while
> there is still an active I/O watch, and thus this call can go away, and so will the
> use after free.
> 

Yes, will fixed in the next version.

Regards,
-Gonglei

> >          rclose(ioc->fd);
> >          ioc->fd = -1;
> >      }
> 
> With regards,
> Daniel
> --
> |: https://berrange.com      -o-
> https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-
> https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-
> https://www.instagram.com/dberrange :|
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/6] io/channel-rdma: support working in coroutine
  2024-06-07  8:45     ` Gonglei (Arei) via
@ 2024-06-07 10:01       ` Haris Iqbal
  0 siblings, 0 replies; 55+ messages in thread
From: Haris Iqbal @ 2024-06-07 10:01 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: qemu-devel@nongnu.org, peterx@redhat.com, yu.zhang@ionos.com,
	mgalaxy@akamai.com, elmar.gerdes@ionos.com, zhengchuan,
	berrange@redhat.com, armbru@redhat.com, lizhijian@fujitsu.com,
	pbonzini@redhat.com, mst@redhat.com, Xiexiangyou,
	linux-rdma@vger.kernel.org, lixiao (H), jinpu.wang@ionos.com,
	Wangjialin

On Fri, Jun 7, 2024 at 10:45 AM Gonglei (Arei) <arei.gonglei@huawei.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Haris Iqbal [mailto:haris.iqbal@ionos.com]
> > Sent: Thursday, June 6, 2024 9:35 PM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: qemu-devel@nongnu.org; peterx@redhat.com; yu.zhang@ionos.com;
> > mgalaxy@akamai.com; elmar.gerdes@ionos.com; zhengchuan
> > <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
> > lizhijian@fujitsu.com; pbonzini@redhat.com; mst@redhat.com; Xiexiangyou
> > <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> > <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> > <wangjialin23@huawei.com>
> > Subject: Re: [PATCH 3/6] io/channel-rdma: support working in coroutine
> >
> > On Tue, Jun 4, 2024 at 2:14 PM Gonglei <arei.gonglei@huawei.com> wrote:
> > >
> > > From: Jialin Wang <wangjialin23@huawei.com>
> > >
> > > It is not feasible to obtain RDMA completion queue notifications
> > > through poll/ppoll on the rsocket fd. Therefore, we create a thread
> > > named rpoller for each rsocket fd and two eventfds: pollin_eventfd and
> > > pollout_eventfd.
> > >
> > > When using io_create_watch or io_set_aio_fd_handler waits for POLLIN
> > > or POLLOUT events, it will actually poll/ppoll on the pollin_eventfd
> > > and pollout_eventfd instead of the rsocket fd.
> > >
> > > The rpoller rpoll() on the rsocket fd to receive POLLIN and POLLOUT
> > > events.
> > > When a POLLIN event occurs, the rpoller write the pollin_eventfd, and
> > > then poll/ppoll will return the POLLIN event.
> > > When a POLLOUT event occurs, the rpoller read the pollout_eventfd, and
> > > then poll/ppoll will return the POLLOUT event.
> > >
> > > For a non-blocking rsocket fd, if rread/rwrite returns EAGAIN, it will
> > > read/write the pollin/pollout_eventfd, preventing poll/ppoll from
> > > returning POLLIN/POLLOUT events.
> > >
> > > Known limitations:
> > >
> > >   For a blocking rsocket fd, if we use io_create_watch to wait for
> > >   POLLIN or POLLOUT events, since the rsocket fd is blocking, we
> > >   cannot determine when it is not ready to read/write as we can with
> > >   non-blocking fds. Therefore, when an event occurs, it will occurs
> > >   always, potentially leave the qemu hanging. So we need be cautious
> > >   to avoid hanging when using io_create_watch .
> > >
> > > Luckily, channel-rdma works well in coroutines :)
> > >
> > > Signed-off-by: Jialin Wang <wangjialin23@huawei.com>
> > > Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> > > ---
> > >  include/io/channel-rdma.h |  15 +-
> > >  io/channel-rdma.c         | 363
> > +++++++++++++++++++++++++++++++++++++-
> > >  2 files changed, 376 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/io/channel-rdma.h b/include/io/channel-rdma.h
> > > index 8cab2459e5..cb56127d76 100644
> > > --- a/include/io/channel-rdma.h
> > > +++ b/include/io/channel-rdma.h
> > > @@ -47,6 +47,18 @@ struct QIOChannelRDMA {
> > >      socklen_t localAddrLen;
> > >      struct sockaddr_storage remoteAddr;
> > >      socklen_t remoteAddrLen;
> > > +
> > > +    /* private */
> > > +
> > > +    /* qemu g_poll/ppoll() POLLIN event on it */
> > > +    int pollin_eventfd;
> > > +    /* qemu g_poll/ppoll() POLLOUT event on it */
> > > +    int pollout_eventfd;
> > > +
> > > +    /* the index in the rpoller's fds array */
> > > +    int index;
> > > +    /* rpoller will rpoll() rpoll_events on the rsocket fd */
> > > +    short int rpoll_events;
> > >  };
> > >
> > >  /**
> > > @@ -147,6 +159,7 @@ void
> > qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, InetSocketAddress
> > *addr,
> > >   *
> > >   * Returns: the new client channel, or NULL on error
> > >   */
> > > -QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *ioc,
> > Error
> > > **errp);
> > > +QIOChannelRDMA *coroutine_mixed_fn
> > qio_channel_rdma_accept(QIOChannelRDMA *ioc,
> > > +
> > Error
> > > +**errp);
> > >
> > >  #endif /* QIO_CHANNEL_RDMA_H */
> > > diff --git a/io/channel-rdma.c b/io/channel-rdma.c index
> > > 92c362df52..9792add5cf 100644
> > > --- a/io/channel-rdma.c
> > > +++ b/io/channel-rdma.c
> > > @@ -23,10 +23,15 @@
> > >
> > >  #include "qemu/osdep.h"
> > >  #include "io/channel-rdma.h"
> > > +#include "io/channel-util.h"
> > > +#include "io/channel-watch.h"
> > >  #include "io/channel.h"
> > >  #include "qapi/clone-visitor.h"
> > >  #include "qapi/error.h"
> > >  #include "qapi/qapi-visit-sockets.h"
> > > +#include "qemu/atomic.h"
> > > +#include "qemu/error-report.h"
> > > +#include "qemu/thread.h"
> > >  #include "trace.h"
> > >  #include <errno.h>
> > >  #include <netdb.h>
> > > @@ -39,11 +44,274 @@
> > >  #include <sys/poll.h>
> > >  #include <unistd.h>
> > >
> > > +typedef enum {
> > > +    CLEAR_POLLIN,
> > > +    CLEAR_POLLOUT,
> > > +    SET_POLLIN,
> > > +    SET_POLLOUT,
> > > +} UpdateEvent;
> > > +
> > > +typedef enum {
> > > +    RP_CMD_ADD_IOC,
> > > +    RP_CMD_DEL_IOC,
> > > +    RP_CMD_UPDATE,
> > > +} RpollerCMD;
> > > +
> > > +typedef struct {
> > > +    RpollerCMD cmd;
> > > +    QIOChannelRDMA *rioc;
> > > +} RpollerMsg;
> > > +
> > > +/*
> > > + * rpoll() on the rsocket fd with rpoll_events, when POLLIN/POLLOUT
> > > +event
> > > + * occurs, it will write/read the pollin_eventfd/pollout_eventfd to
> > > +allow
> > > + * qemu g_poll/ppoll() get the POLLIN/POLLOUT event  */ static struct
> > > +Rpoller {
> > > +    QemuThread thread;
> > > +    bool is_running;
> > > +    int sock[2];
> > > +    int count; /* the number of rsocket fds being rpoll() */
> > > +    int size; /* the size of fds/riocs */
> > > +    struct pollfd *fds;
> > > +    QIOChannelRDMA **riocs;
> > > +} rpoller;
> > > +
> > > +static void qio_channel_rdma_notify_rpoller(QIOChannelRDMA *rioc,
> > > +                                            RpollerCMD cmd) {
> > > +    RpollerMsg msg;
> > > +    int ret;
> > > +
> > > +    msg.cmd = cmd;
> > > +    msg.rioc = rioc;
> > > +
> > > +    ret = RETRY_ON_EINTR(write(rpoller.sock[0], &msg, sizeof msg));
> > > +    if (ret != sizeof msg) {
> > > +        error_report("%s: failed to send msg, errno: %d", __func__,
> > errno);
> > > +    }
> > > +}
> > > +
> > > +static void qio_channel_rdma_update_poll_event(QIOChannelRDMA *rioc,
> > > +                                               UpdateEvent
> > action,
> > > +                                               bool notify_rpoller)
> > {
> > > +    /* An eventfd with the value of ULLONG_MAX - 1 is readable but
> > unwritable */
> > > +    unsigned long long buf = ULLONG_MAX - 1;
> > > +
> > > +    switch (action) {
> > > +    /* only rpoller do SET_* action, to allow qemu ppoll() get the event */
> > > +    case SET_POLLIN:
> > > +        RETRY_ON_EINTR(write(rioc->pollin_eventfd, &buf, sizeof buf));
> > > +        rioc->rpoll_events &= ~POLLIN;
> > > +        break;
> > > +    case SET_POLLOUT:
> > > +        RETRY_ON_EINTR(read(rioc->pollout_eventfd, &buf, sizeof buf));
> > > +        rioc->rpoll_events &= ~POLLOUT;
> > > +        break;
> > > +
> > > +    /* the rsocket fd is not ready to rread/rwrite */
> > > +    case CLEAR_POLLIN:
> > > +        RETRY_ON_EINTR(read(rioc->pollin_eventfd, &buf, sizeof buf));
> > > +        rioc->rpoll_events |= POLLIN;
> > > +        break;
> > > +    case CLEAR_POLLOUT:
> > > +        RETRY_ON_EINTR(write(rioc->pollout_eventfd, &buf, sizeof buf));
> > > +        rioc->rpoll_events |= POLLOUT;
> > > +        break;
> > > +    default:
> > > +        break;
> > > +    }
> > > +
> > > +    /* notify rpoller to rpoll() POLLIN/POLLOUT events */
> > > +    if (notify_rpoller) {
> > > +        qio_channel_rdma_notify_rpoller(rioc, RP_CMD_UPDATE);
> > > +    }
> > > +}
> > > +
> > > +static void qio_channel_rdma_rpoller_add_rioc(QIOChannelRDMA *rioc) {
> > > +    if (rioc->index != -1) {
> > > +        error_report("%s: rioc already exsits", __func__);
> > > +        return;
> > > +    }
> > > +
> > > +    rioc->index = ++rpoller.count;
> > > +
> > > +    if (rpoller.count + 1 > rpoller.size) {
> > > +        rpoller.size *= 2;
> > > +        rpoller.fds = g_renew(struct pollfd, rpoller.fds, rpoller.size);
> > > +        rpoller.riocs = g_renew(QIOChannelRDMA *, rpoller.riocs,
> > rpoller.size);
> > > +    }
> > > +
> > > +    rpoller.fds[rioc->index].fd = rioc->fd;
> > > +    rpoller.fds[rioc->index].events = rioc->rpoll_events;
> >
> > The allotment of rioc fds and events to rpoller slots are sequential, but making
> > the deletion also sequentials means that the del_rioc needs to be called in the
> > exact opposite sequence as they were added (through add_rioc). Otherwise we
> > leaves holes in between, and readditions might step on an already used slot.
> >
> > Does this setup make sure that the above restriction is satisfied, or am I
> > missing something?
> >
>
> Actually, we use an O (1) algorithm for deletion, that is, each time we replace the array element to be deleted with the last one.
> Pls see qio_channel_rdma_rpoller_del_rioc():

Ah yes. I missed that. Thanks for the response.

>
>    rpoller.fds[rioc->index] = rpoller.fds[rpoller.count];
>
> > > +    rpoller.riocs[rioc->index] = rioc; }
> > > +
> > > +static void qio_channel_rdma_rpoller_del_rioc(QIOChannelRDMA *rioc) {
> > > +    if (rioc->index == -1) {
> > > +        error_report("%s: rioc not exsits", __func__);
> > > +        return;
> > > +    }
> > > +
> > > +    rpoller.fds[rioc->index] = rpoller.fds[rpoller.count];
> >
> > Should this be rpoller.count-1?
> >
> No. the first element is the sockpairs' fd. Pls see qio_channel_rdma_rpoller_start():
>
>    rpoller.fds[0].fd = rpoller.sock[1];
>    rpoller.fds[0].events = POLLIN;
>
>
> Regards,
> -Gonglei
>
> > > +    rpoller.riocs[rioc->index] = rpoller.riocs[rpoller.count];
> > > +    rpoller.riocs[rioc->index]->index = rioc->index;
> > > +    rpoller.count--;
> > > +
> > > +    close(rioc->pollin_eventfd);
> > > +    close(rioc->pollout_eventfd);
> > > +    rioc->index = -1;
> > > +    rioc->rpoll_events = 0;
> > > +}
> > > +
> > > +static void qio_channel_rdma_rpoller_update_ioc(QIOChannelRDMA *rioc)
> > > +{
> > > +    if (rioc->index == -1) {
> > > +        error_report("%s: rioc not exsits", __func__);
> > > +        return;
> > > +    }
> > > +
> > > +    rpoller.fds[rioc->index].fd = rioc->fd;
> > > +    rpoller.fds[rioc->index].events = rioc->rpoll_events; }
> > > +
> > > +static void qio_channel_rdma_rpoller_process_msg(void)
> > > +{
> > > +    RpollerMsg msg;
> > > +    int ret;
> > > +
> > > +    ret = RETRY_ON_EINTR(read(rpoller.sock[1], &msg, sizeof msg));
> > > +    if (ret != sizeof msg) {
> > > +        error_report("%s: rpoller failed to recv msg: %s", __func__,
> > > +                     strerror(errno));
> > > +        return;
> > > +    }
> > > +
> > > +    switch (msg.cmd) {
> > > +    case RP_CMD_ADD_IOC:
> > > +        qio_channel_rdma_rpoller_add_rioc(msg.rioc);
> > > +        break;
> > > +    case RP_CMD_DEL_IOC:
> > > +        qio_channel_rdma_rpoller_del_rioc(msg.rioc);
> > > +        break;
> > > +    case RP_CMD_UPDATE:
> > > +        qio_channel_rdma_rpoller_update_ioc(msg.rioc);
> > > +        break;
> > > +    default:
> > > +        break;
> > > +    }
> > > +}
> > > +
> > > +static void qio_channel_rdma_rpoller_cleanup(void)
> > > +{
> > > +    close(rpoller.sock[0]);
> > > +    close(rpoller.sock[1]);
> > > +    rpoller.sock[0] = -1;
> > > +    rpoller.sock[1] = -1;
> > > +    g_free(rpoller.fds);
> > > +    g_free(rpoller.riocs);
> > > +    rpoller.fds = NULL;
> > > +    rpoller.riocs = NULL;
> > > +    rpoller.count = 0;
> > > +    rpoller.size = 0;
> > > +    rpoller.is_running = false;
> > > +}
> > > +
> > > +static void *qio_channel_rdma_rpoller_thread(void *opaque) {
> > > +    int i, ret, error_events = POLLERR | POLLHUP | POLLNVAL;
> > > +
> > > +    do {
> > > +        ret = rpoll(rpoller.fds, rpoller.count + 1, -1);
> > > +        if (ret < 0 && errno != -EINTR) {
> > > +            error_report("%s: rpoll() error: %s", __func__,
> > strerror(errno));
> > > +            break;
> > > +        }
> > > +
> > > +        for (i = 1; i <= rpoller.count; i++) {
> > > +            if (rpoller.fds[i].revents & (POLLIN | error_events)) {
> > > +                qio_channel_rdma_update_poll_event(rpoller.riocs[i],
> > SET_POLLIN,
> > > +                                                   false);
> > > +                rpoller.fds[i].events &= ~POLLIN;
> > > +            }
> > > +            if (rpoller.fds[i].revents & (POLLOUT | error_events)) {
> > > +                qio_channel_rdma_update_poll_event(rpoller.riocs[i],
> > > +
> > SET_POLLOUT, false);
> > > +                rpoller.fds[i].events &= ~POLLOUT;
> > > +            }
> > > +            /* ignore this fd */
> > > +            if (rpoller.fds[i].revents & (error_events)) {
> > > +                rpoller.fds[i].fd = -1;
> > > +            }
> > > +        }
> > > +
> > > +        if (rpoller.fds[0].revents) {
> > > +            qio_channel_rdma_rpoller_process_msg();
> > > +        }
> > > +    } while (rpoller.count >= 1);
> > > +
> > > +    qio_channel_rdma_rpoller_cleanup();
> > > +
> > > +    return NULL;
> > > +}
> > > +
> > > +static void qio_channel_rdma_rpoller_start(void)
> > > +{
> > > +    if (qatomic_xchg(&rpoller.is_running, true)) {
> > > +        return;
> > > +    }
> > > +
> > > +    if (qemu_socketpair(AF_UNIX, SOCK_STREAM, 0, rpoller.sock)) {
> > > +        rpoller.is_running = false;
> > > +        error_report("%s: failed to create socketpair %s", __func__,
> > > +                     strerror(errno));
> > > +        return;
> > > +    }
> > > +
> > > +    rpoller.count = 0;
> > > +    rpoller.size = 4;
> > > +    rpoller.fds = g_malloc0_n(rpoller.size, sizeof(struct pollfd));
> > > +    rpoller.riocs = g_malloc0_n(rpoller.size, sizeof(QIOChannelRDMA *));
> > > +    rpoller.fds[0].fd = rpoller.sock[1];
> > > +    rpoller.fds[0].events = POLLIN;
> > > +
> > > +    qemu_thread_create(&rpoller.thread, "qio-channel-rdma-rpoller",
> > > +                       qio_channel_rdma_rpoller_thread, NULL,
> > > +                       QEMU_THREAD_JOINABLE); }
> > > +
> > > +static void qio_channel_rdma_add_rioc_to_rpoller(QIOChannelRDMA
> > > +*rioc) {
> > > +    int flags = EFD_CLOEXEC | EFD_NONBLOCK;
> > > +
> > > +    /*
> > > +     * A single eventfd is either readable or writable. A single eventfd
> > cannot
> > > +     * represent a state where it is neither readable nor writable. so use
> > two
> > > +     * eventfds here.
> > > +     */
> > > +    rioc->pollin_eventfd = eventfd(0, flags);
> > > +    rioc->pollout_eventfd = eventfd(0, flags);
> > > +    /* pollout_eventfd with the value 0, means writable, make it
> > unwritable */
> > > +    qio_channel_rdma_update_poll_event(rioc, CLEAR_POLLOUT, false);
> > > +
> > > +    /* tell the rpoller to rpoll() events on rioc->socketfd */
> > > +    rioc->rpoll_events = POLLIN | POLLOUT;
> > > +    qio_channel_rdma_notify_rpoller(rioc, RP_CMD_ADD_IOC); }
> > > +
> > >  QIOChannelRDMA *qio_channel_rdma_new(void)  {
> > >      QIOChannelRDMA *rioc;
> > >      QIOChannel *ioc;
> > >
> > > +    qio_channel_rdma_rpoller_start();
> > > +    if (!rpoller.is_running) {
> > > +        return NULL;
> > > +    }
> > > +
> > >      rioc =
> > QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
> > >      ioc = QIO_CHANNEL(rioc);
> > >      qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
> > @@
> > > -125,6 +393,8 @@ retry:
> > >          goto out;
> > >      }
> > >
> > > +    qio_channel_rdma_add_rioc_to_rpoller(rioc);
> > > +
> > >  out:
> > >      if (ret) {
> > >          trace_qio_channel_rdma_connect_fail(rioc);
> > > @@ -211,6 +481,8 @@ int
> > qio_channel_rdma_listen_sync(QIOChannelRDMA *rioc, InetSocketAddress
> > *addr,
> > >      qio_channel_set_feature(QIO_CHANNEL(rioc),
> > QIO_CHANNEL_FEATURE_LISTEN);
> > >      trace_qio_channel_rdma_listen_complete(rioc, fd);
> > >
> > > +    qio_channel_rdma_add_rioc_to_rpoller(rioc);
> > > +
> > >  out:
> > >      if (ret) {
> > >          trace_qio_channel_rdma_listen_fail(rioc);
> > > @@ -267,8 +539,10 @@ void
> > qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, InetSocketAddress
> > *addr,
> > >                             qio_channel_listen_worker_free,
> > context);
> > > }
> > >
> > > -QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *rioc,
> > Error
> > > **errp)
> > > +QIOChannelRDMA *coroutine_mixed_fn
> > qio_channel_rdma_accept(QIOChannelRDMA *rioc,
> > > +
> > Error
> > > +**errp)
> > >  {
> > > +    QIOChannel *ioc = QIO_CHANNEL(rioc);
> > >      QIOChannelRDMA *cioc;
> > >
> > >      cioc = qio_channel_rdma_new();
> > > @@ -283,6 +557,17 @@ retry:
> > >          if (errno == EINTR) {
> > >              goto retry;
> > >          }
> > > +        if (errno == EAGAIN) {
> > > +            if (!(rioc->rpoll_events & POLLIN)) {
> > > +                qio_channel_rdma_update_poll_event(rioc,
> > CLEAR_POLLIN, true);
> > > +            }
> > > +            if (qemu_in_coroutine()) {
> > > +                qio_channel_yield(ioc, G_IO_IN);
> > > +            } else {
> > > +                qio_channel_wait(ioc, G_IO_IN);
> > > +            }
> > > +            goto retry;
> > > +        }
> > >          error_setg_errno(errp, errno, "Unable to accept connection");
> > >          goto error;
> > >      }
> > > @@ -294,6 +579,8 @@ retry:
> > >          goto error;
> > >      }
> > >
> > > +    qio_channel_rdma_add_rioc_to_rpoller(cioc);
> > > +
> > >      trace_qio_channel_rdma_accept_complete(rioc, cioc, cioc->fd);
> > >      return cioc;
> > >
> > > @@ -307,6 +594,10 @@ static void qio_channel_rdma_init(Object *obj)  {
> > >      QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(obj);
> > >      ioc->fd = -1;
> > > +    ioc->pollin_eventfd = -1;
> > > +    ioc->pollout_eventfd = -1;
> > > +    ioc->index = -1;
> > > +    ioc->rpoll_events = 0;
> > >  }
> > >
> > >  static void qio_channel_rdma_finalize(Object *obj) @@ -314,6 +605,7
> > > @@ static void qio_channel_rdma_finalize(Object *obj)
> > >      QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(obj);
> > >
> > >      if (ioc->fd != -1) {
> > > +        qio_channel_rdma_notify_rpoller(ioc, RP_CMD_DEL_IOC);
> > >          rclose(ioc->fd);
> > >          ioc->fd = -1;
> > >      }
> > > @@ -330,6 +622,12 @@ static ssize_t
> > qio_channel_rdma_readv(QIOChannel
> > > *ioc, const struct iovec *iov,
> > >  retry:
> > >      ret = rreadv(rioc->fd, iov, niov);
> > >      if (ret < 0) {
> > > +        if (errno == EAGAIN) {
> > > +            if (!(rioc->rpoll_events & POLLIN)) {
> > > +                qio_channel_rdma_update_poll_event(rioc,
> > CLEAR_POLLIN, true);
> > > +            }
> > > +            return QIO_CHANNEL_ERR_BLOCK;
> > > +        }
> > >          if (errno == EINTR) {
> > >              goto retry;
> > >          }
> > > @@ -351,6 +649,12 @@ static ssize_t
> > qio_channel_rdma_writev(QIOChannel
> > > *ioc, const struct iovec *iov,
> > >  retry:
> > >      ret = rwritev(rioc->fd, iov, niov);
> > >      if (ret <= 0) {
> > > +        if (errno == EAGAIN) {
> > > +            if (!(rioc->rpoll_events & POLLOUT)) {
> > > +                qio_channel_rdma_update_poll_event(rioc,
> > CLEAR_POLLOUT, true);
> > > +            }
> > > +            return QIO_CHANNEL_ERR_BLOCK;
> > > +        }
> > >          if (errno == EINTR) {
> > >              goto retry;
> > >          }
> > > @@ -361,6 +665,28 @@ retry:
> > >      return ret;
> > >  }
> > >
> > > +static int qio_channel_rdma_set_blocking(QIOChannel *ioc, bool enabled,
> > > +                                         Error **errp
> > G_GNUC_UNUSED)
> > > +{
> > > +    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
> > > +    int flags, ret;
> > > +
> > > +    flags = rfcntl(rioc->fd, F_GETFL);
> > > +    if (enabled) {
> > > +        flags &= ~O_NONBLOCK;
> > > +    } else {
> > > +        flags |= O_NONBLOCK;
> > > +    }
> > > +
> > > +    ret = rfcntl(rioc->fd, F_SETFL, flags);
> > > +    if (ret) {
> > > +        error_setg_errno(errp, errno,
> > > +                         "Unable to rfcntl rsocket fd with flags %d",
> > flags);
> > > +    }
> > > +
> > > +    return ret;
> > > +}
> > > +
> > >  static void qio_channel_rdma_set_delay(QIOChannel *ioc, bool enabled)
> > > {
> > >      QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc); @@ -374,6
> > +700,7 @@
> > > static int qio_channel_rdma_close(QIOChannel *ioc, Error **errp)
> > >      QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
> > >
> > >      if (rioc->fd != -1) {
> > > +        qio_channel_rdma_notify_rpoller(rioc, RP_CMD_DEL_IOC);
> > >          rclose(rioc->fd);
> > >          rioc->fd = -1;
> > >      }
> > > @@ -408,6 +735,37 @@ static int qio_channel_rdma_shutdown(QIOChannel
> > *ioc, QIOChannelShutdown how,
> > >      return 0;
> > >  }
> > >
> > > +static void
> > > +qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc, AioContext
> > *read_ctx,
> > > +                                    IOHandler *io_read,
> > AioContext *write_ctx,
> > > +                                    IOHandler *io_write, void
> > > +*opaque) {
> > > +    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
> > > +
> > > +    qio_channel_util_set_aio_fd_handler(rioc->pollin_eventfd, read_ctx,
> > io_read,
> > > +                                        rioc->pollout_eventfd,
> > write_ctx,
> > > +                                        io_write, opaque); }
> > > +
> > > +static GSource *qio_channel_rdma_create_watch(QIOChannel *ioc,
> > > +                                              GIOCondition
> > condition)
> > > +{
> > > +    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
> > > +
> > > +    switch (condition) {
> > > +    case G_IO_IN:
> > > +        return qio_channel_create_fd_watch(ioc, rioc->pollin_eventfd,
> > > +                                           condition);
> > > +    case G_IO_OUT:
> > > +        return qio_channel_create_fd_watch(ioc, rioc->pollout_eventfd,
> > > +                                           condition);
> > > +    default:
> > > +        error_report("%s: do not support watch 0x%x event", __func__,
> > > +                     condition);
> > > +        return NULL;
> > > +    }
> > > +}
> > > +
> > >  static void qio_channel_rdma_class_init(ObjectClass *klass,
> > >                                          void *class_data
> > > G_GNUC_UNUSED)  { @@ -415,9 +773,12 @@ static void
> > > qio_channel_rdma_class_init(ObjectClass *klass,
> > >
> > >      ioc_klass->io_writev = qio_channel_rdma_writev;
> > >      ioc_klass->io_readv = qio_channel_rdma_readv;
> > > +    ioc_klass->io_set_blocking = qio_channel_rdma_set_blocking;
> > >      ioc_klass->io_close = qio_channel_rdma_close;
> > >      ioc_klass->io_shutdown = qio_channel_rdma_shutdown;
> > >      ioc_klass->io_set_delay = qio_channel_rdma_set_delay;
> > > +    ioc_klass->io_create_watch = qio_channel_rdma_create_watch;
> > > +    ioc_klass->io_set_aio_fd_handler =
> > > + qio_channel_rdma_set_aio_fd_handler;
> > >  }
> > >
> > >  static const TypeInfo qio_channel_rdma_info = {
> > > --
> > > 2.43.0
> > >
> > >


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-06-04 19:32 ` [PATCH 0/6] refactor RDMA live migration based on rsocket API Peter Xu
  2024-06-05 10:09   ` Gonglei (Arei) via
@ 2024-06-07 10:06   ` Daniel P. Berrangé
  1 sibling, 0 replies; 55+ messages in thread
From: Daniel P. Berrangé @ 2024-06-07 10:06 UTC (permalink / raw)
  To: Peter Xu
  Cc: Gonglei, qemu-devel, yu.zhang, mgalaxy, elmar.gerdes, zhengchuan,
	armbru, lizhijian, pbonzini, mst, xiexiangyou, linux-rdma,
	lixiao91, jinpu.wang, Jialin Wang, Fabiano Rosas

On Tue, Jun 04, 2024 at 03:32:19PM -0400, Peter Xu wrote:
> Hi, Lei, Jialin,
> 
> Thanks a lot for working on this!
> 
> I think we'll need to wait a bit on feedbacks from Jinpu and his team on
> RDMA side, also Daniel for iochannels.  Also, please remember to copy
> Fabiano Rosas in any relevant future posts.  We'd also like to know whether
> he has any comments too.  I have him copied in this reply.

I've not formally reviewed it, but I had a quick glance through the
I/O channel patches and they all look sensible. Pretty much  exactly
what I was hoping it would end up looking like.

> > In addition, the use of rsocket makes our programming more convenient,
> > but it must be noted that this method introduces multiple memory copies,
> > which can be imagined that there will be a certain performance degradation,
> > hoping that friends with RDMA network cards can help verify, thank you!
> 
> It'll be good to elaborate if you tested it in-house. What people should
> expect on the numbers exactly?  Is that okay from Huawei's POV?
> 
> Besides that, the code looks pretty good at a first glance to me.  Before

snip

> Personally I think even with the thread proposal it's better than the old
> rdma code, but I just still want to double check with you guys.  E.g.,
> maybe that just won't work at all?  Again, that'll also be based on the
> fact that we move migration incoming into a thread first to keep the dest
> QEMU main loop intact, I think, but I hope we will reach that irrelevant of
> rdma, IOW it'll be nice to happen even earlier if possible.

Yes, from the migration code POV, this is a massive step forward - the
RDMA integration is no completely trivial for migration code.

The $million question is what the performance of this new implmentation
looks like on real hardware. As mentioned above the extra memory copies
will probably hurt performance compared to the old version. We need the
performance of the new RDMA impl to still be better than the plain TCP
sockets backend to make it worthwhile having RDMA.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-06-05 10:00   ` Gonglei (Arei) via
  2024-06-05 10:23     ` Michael S. Tsirkin
  2024-06-06 11:31     ` Leon Romanovsky
@ 2024-06-07 16:24     ` Yu Zhang
  2 siblings, 0 replies; 55+ messages in thread
From: Yu Zhang @ 2024-06-07 16:24 UTC (permalink / raw)
  To: Gonglei (Arei), Peter Xu, Michael Galaxy, Jinpu Wang,
	Elmar Gerdes
  Cc: qemu-devel@nongnu.org, zhengchuan, berrange@redhat.com,
	armbru@redhat.com, lizhijian@fujitsu.com, pbonzini@redhat.com,
	mst@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), Wangjialin, Fabiano Rosas

Hello Gonglei,

Jinpu and I have tested your patchset by using our migration test
cases on the physical RDMA cards. The result is: among 59 migration
test cases, 10 failed. They are successful when using the original
RDMA migration coed, but always fail when using the patchset. The
syslog on the source server shows an error below:

Jun  6 13:35:20 ps402a-43 WARN: Migration failed
uuid="44449999-3333-48dc-9082-1b6950e74ee1"
target=2a02:247f:401:2:2:0:a:2c error=Failed(Unable to write to
rsocket: Connection reset by peer)

We also tried to compare the migration speed between w/o the patchset.
Without the patchset, a big VM (with 16 cores, 64 GB memory) stressed
with heavy memory workload can be migrated successfully. With the
patchset, only a small idle VM (1-2 cores, 2-4 GB memory) can be
migrated successfully. In each failed migration, the above error is
issued on the source server.

Therefore, I assume that this version is not yet quite capable of
handling heavy load yet. I'm also looking in the code to see if
anything can be improved. We really appreciate your excellent work!

Best regards,
Yu Zhang @ IONOS cloud

On Wed, Jun 5, 2024 at 12:00 PM Gonglei (Arei) <arei.gonglei@huawei.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst@redhat.com]
> > Sent: Wednesday, June 5, 2024 3:57 PM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: qemu-devel@nongnu.org; peterx@redhat.com; yu.zhang@ionos.com;
> > mgalaxy@akamai.com; elmar.gerdes@ionos.com; zhengchuan
> > <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
> > lizhijian@fujitsu.com; pbonzini@redhat.com; Xiexiangyou
> > <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> > <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> > <wangjialin23@huawei.com>
> > Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> >
> > On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > > From: Jialin Wang <wangjialin23@huawei.com>
> > >
> > > Hi,
> > >
> > > This patch series attempts to refactor RDMA live migration by
> > > introducing a new QIOChannelRDMA class based on the rsocket API.
> > >
> > > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > > the detail of rdma protocol into rsocket and allows us to add support
> > > for some modern features like multifd more easily.
> > >
> > > Here is the previous discussion on refactoring RDMA live migration
> > > using the rsocket API:
> > >
> > > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > > o.org/
> > >
> > > We have encountered some bugs when using rsocket and plan to submit
> > > them to the rdma-core community.
> > >
> > > In addition, the use of rsocket makes our programming more convenient,
> > > but it must be noted that this method introduces multiple memory
> > > copies, which can be imagined that there will be a certain performance
> > > degradation, hoping that friends with RDMA network cards can help verify,
> > thank you!
> >
> > So you didn't test it with an RDMA card?
>
> Yep, we tested it by Soft-ROCE.
>
> > You really should test with an RDMA card though, for correctness as much as
> > performance.
> >
> We will, we just don't have RDMA cards environment on hand at the moment.
>
> Regards,
> -Gonglei
>
> >
> > > Jialin Wang (6):
> > >   migration: remove RDMA live migration temporarily
> > >   io: add QIOChannelRDMA class
> > >   io/channel-rdma: support working in coroutine
> > >   tests/unit: add test-io-channel-rdma.c
> > >   migration: introduce new RDMA live migration
> > >   migration/rdma: support multifd for RDMA migration
> > >
> > >  docs/rdma.txt                     |  420 ---
> > >  include/io/channel-rdma.h         |  165 ++
> > >  io/channel-rdma.c                 |  798 ++++++
> > >  io/meson.build                    |    1 +
> > >  io/trace-events                   |   14 +
> > >  meson.build                       |    6 -
> > >  migration/meson.build             |    3 +-
> > >  migration/migration-stats.c       |    5 +-
> > >  migration/migration-stats.h       |    4 -
> > >  migration/migration.c             |   13 +-
> > >  migration/migration.h             |    9 -
> > >  migration/multifd.c               |   10 +
> > >  migration/options.c               |   16 -
> > >  migration/options.h               |    2 -
> > >  migration/qemu-file.c             |    1 -
> > >  migration/ram.c                   |   90 +-
> > >  migration/rdma.c                  | 4205 +----------------------------
> > >  migration/rdma.h                  |   67 +-
> > >  migration/savevm.c                |    2 +-
> > >  migration/trace-events            |   68 +-
> > >  qapi/migration.json               |   13 +-
> > >  scripts/analyze-migration.py      |    3 -
> > >  tests/unit/meson.build            |    1 +
> > >  tests/unit/test-io-channel-rdma.c |  276 ++
> > >  24 files changed, 1360 insertions(+), 4832 deletions(-)  delete mode
> > > 100644 docs/rdma.txt  create mode 100644 include/io/channel-rdma.h
> > > create mode 100644 io/channel-rdma.c  create mode 100644
> > > tests/unit/test-io-channel-rdma.c
> > >
> > > --
> > > 2.43.0
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/6] io: add QIOChannelRDMA class
  2024-06-04 12:14 ` [PATCH 2/6] io: add QIOChannelRDMA class Gonglei via
@ 2024-06-10  6:54   ` Jinpu Wang
  0 siblings, 0 replies; 55+ messages in thread
From: Jinpu Wang @ 2024-06-10  6:54 UTC (permalink / raw)
  To: Gonglei
  Cc: qemu-devel, peterx, yu.zhang, mgalaxy, elmar.gerdes, zhengchuan,
	berrange, armbru, lizhijian, pbonzini, mst, xiexiangyou,
	linux-rdma, lixiao91, Jialin Wang

On Tue, Jun 4, 2024 at 2:14 PM Gonglei <arei.gonglei@huawei.com> wrote:
>
> From: Jialin Wang <wangjialin23@huawei.com>
>
> Implement a QIOChannelRDMA subclass that is based on the rsocket
> API (similar to socket API).
>
> Signed-off-by: Jialin Wang <wangjialin23@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>
> ---
>  include/io/channel-rdma.h | 152 +++++++++++++
>  io/channel-rdma.c         | 437 ++++++++++++++++++++++++++++++++++++++
>  io/meson.build            |   1 +
>  io/trace-events           |  14 ++
>  4 files changed, 604 insertions(+)
>  create mode 100644 include/io/channel-rdma.h
>  create mode 100644 io/channel-rdma.c
>
> diff --git a/include/io/channel-rdma.h b/include/io/channel-rdma.h
> new file mode 100644
> index 0000000000..8cab2459e5
> --- /dev/null
> +++ b/include/io/channel-rdma.h
> @@ -0,0 +1,152 @@
> +/*
> + * QEMU I/O channels RDMA driver
> + *
> + * Copyright (c) 2024 HUAWEI TECHNOLOGIES CO., LTD.
> + *
> + * Authors:
> + *  Jialin Wang <wangjialin23@huawei.com>
> + *  Gonglei <arei.gonglei@huawei.com>
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#ifndef QIO_CHANNEL_RDMA_H
> +#define QIO_CHANNEL_RDMA_H
> +
> +#include "io/channel.h"
> +#include "io/task.h"
> +#include "qemu/sockets.h"
> +#include "qom/object.h"
> +
> +#define TYPE_QIO_CHANNEL_RDMA "qio-channel-rdma"
> +OBJECT_DECLARE_SIMPLE_TYPE(QIOChannelRDMA, QIO_CHANNEL_RDMA)
> +
> +/**
> + * QIOChannelRDMA:
> + *
> + * The QIOChannelRDMA object provides a channel implementation
> + * that discards all writes and returns EOF for all reads.
> + */
> +struct QIOChannelRDMA {
> +    QIOChannel parent;
> +    /* the rsocket fd */
> +    int fd;
> +
> +    struct sockaddr_storage localAddr;
> +    socklen_t localAddrLen;
> +    struct sockaddr_storage remoteAddr;
> +    socklen_t remoteAddrLen;
> +};
> +
> +/**
> + * qio_channel_rdma_new:
> + *
> + * Create a channel for performing I/O on a rdma
> + * connection, that is initially closed. After
> + * creating the rdma, it must be setup as a client
> + * connection or server.
> + *
> + * Returns: the rdma channel object
> + */
> +QIOChannelRDMA *qio_channel_rdma_new(void);
> +
> +/**
> + * qio_channel_rdma_connect_sync:
> + * @ioc: the rdma channel object
> + * @addr: the address to connect to
> + * @errp: pointer to a NULL-initialized error object
> + *
> + * Attempt to connect to the address @addr. This method
> + * will run in the foreground so the caller will not regain
> + * execution control until the connection is established or
> + * an error occurs.
> + */
> +int qio_channel_rdma_connect_sync(QIOChannelRDMA *ioc, InetSocketAddress *addr,
> +                                  Error **errp);
> +
> +/**
> + * qio_channel_rdma_connect_async:
> + * @ioc: the rdma channel object
> + * @addr: the address to connect to
> + * @callback: the function to invoke on completion
> + * @opaque: user data to pass to @callback
> + * @destroy: the function to free @opaque
> + * @context: the context to run the async task. If %NULL, the default
> + *           context will be used.
> + *
> + * Attempt to connect to the address @addr. This method
> + * will run in the background so the caller will regain
> + * execution control immediately. The function @callback
> + * will be invoked on completion or failure. The @addr
> + * parameter will be copied, so may be freed as soon
> + * as this function returns without waiting for completion.
> + */
> +void qio_channel_rdma_connect_async(QIOChannelRDMA *ioc,
> +                                    InetSocketAddress *addr,
> +                                    QIOTaskFunc callback, gpointer opaque,
> +                                    GDestroyNotify destroy,
> +                                    GMainContext *context);
> +
> +/**
> + * qio_channel_rdma_listen_sync:
> + * @ioc: the rdma channel object
> + * @addr: the address to listen to
> + * @num: the expected amount of connections
> + * @errp: pointer to a NULL-initialized error object
> + *
> + * Attempt to listen to the address @addr. This method
> + * will run in the foreground so the caller will not regain
> + * execution control until the connection is established or
> + * an error occurs.
> + */
> +int qio_channel_rdma_listen_sync(QIOChannelRDMA *ioc, InetSocketAddress *addr,
> +                                 int num, Error **errp);
> +
> +/**
> + * qio_channel_rdma_listen_async:
> + * @ioc: the rdma channel object
> + * @addr: the address to listen to
> + * @num: the expected amount of connections
> + * @callback: the function to invoke on completion
> + * @opaque: user data to pass to @callback
> + * @destroy: the function to free @opaque
> + * @context: the context to run the async task. If %NULL, the default
> + *           context will be used.
> + *
> + * Attempt to listen to the address @addr. This method
> + * will run in the background so the caller will regain
> + * execution control immediately. The function @callback
> + * will be invoked on completion or failure. The @addr
> + * parameter will be copied, so may be freed as soon
> + * as this function returns without waiting for completion.
> + */
> +void qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, InetSocketAddress *addr,
> +                                   int num, QIOTaskFunc callback,
> +                                   gpointer opaque, GDestroyNotify destroy,
> +                                   GMainContext *context);
> +
> +/**
> + * qio_channel_rdma_accept:
> + * @ioc: the rdma channel object
> + * @errp: pointer to a NULL-initialized error object
> + *
> + * If the rdma represents a server, then this accepts
> + * a new client connection. The returned channel will
> + * represent the connected client rdma.
> + *
> + * Returns: the new client channel, or NULL on error
> + */
> +QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *ioc, Error **errp);
> +
> +#endif /* QIO_CHANNEL_RDMA_H */
> diff --git a/io/channel-rdma.c b/io/channel-rdma.c
> new file mode 100644
> index 0000000000..92c362df52
> --- /dev/null
> +++ b/io/channel-rdma.c
> @@ -0,0 +1,437 @@
> +/*
> + * QEMU I/O channels RDMA driver
> + *
> + * Copyright (c) 2024 HUAWEI TECHNOLOGIES CO., LTD.
> + *
> + * Authors:
> + *  Jialin Wang <wangjialin23@huawei.com>
> + *  Gonglei <arei.gonglei@huawei.com>
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "io/channel-rdma.h"
> +#include "io/channel.h"
> +#include "qapi/clone-visitor.h"
> +#include "qapi/error.h"
> +#include "qapi/qapi-visit-sockets.h"
> +#include "trace.h"
> +#include <errno.h>
> +#include <netdb.h>
> +#include <rdma/rsocket.h>
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/eventfd.h>
> +#include <sys/poll.h>
> +#include <unistd.h>
> +
> +QIOChannelRDMA *qio_channel_rdma_new(void)
> +{
> +    QIOChannelRDMA *rioc;
> +    QIOChannel *ioc;
> +
> +    rioc = QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
> +    ioc = QIO_CHANNEL(rioc);
> +    qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
> +
> +    trace_qio_channel_rdma_new(ioc);
> +
> +    return rioc;
> +}
> +
> +static int qio_channel_rdma_set_fd(QIOChannelRDMA *rioc, int fd, Error **errp)
> +{
> +    if (rioc->fd != -1) {
> +        error_setg(errp, "Socket is already open");
> +        return -1;
> +    }
> +
> +    rioc->fd = fd;
> +    rioc->remoteAddrLen = sizeof(rioc->remoteAddr);
> +    rioc->localAddrLen = sizeof(rioc->localAddr);
> +
> +    if (rgetpeername(fd, (struct sockaddr *)&rioc->remoteAddr,
> +                     &rioc->remoteAddrLen) < 0) {
> +        if (errno == ENOTCONN) {
> +            memset(&rioc->remoteAddr, 0, sizeof(rioc->remoteAddr));
> +            rioc->remoteAddrLen = sizeof(rioc->remoteAddr);
> +        } else {
> +            error_setg_errno(errp, errno,
> +                             "Unable to query remote rsocket address");
> +            goto error;
> +        }
> +    }
> +
> +    if (rgetsockname(fd, (struct sockaddr *)&rioc->localAddr,
> +                     &rioc->localAddrLen) < 0) {
> +        error_setg_errno(errp, errno, "Unable to query local rsocket address");
> +        goto error;
> +    }
> +
> +    return 0;
> +
> +error:
> +    rioc->fd = -1; /* Let the caller close FD on failure */
> +    return -1;
> +}
> +
> +int qio_channel_rdma_connect_sync(QIOChannelRDMA *rioc, InetSocketAddress *addr,
> +                                  Error **errp)
> +{
> +    int ret, fd = -1;
> +    struct rdma_addrinfo *ai;
> +
> +    trace_qio_channel_rdma_connect_sync(rioc, addr);
> +    ret = rdma_getaddrinfo(addr->host, addr->port, NULL, &ai);
> +    if (ret) {
> +        error_setg(errp, "Failed to rdma_getaddrinfo: %s", gai_strerror(ret));
> +        goto out;
> +    }
> +
> +    fd = rsocket(ai->ai_family, SOCK_STREAM, 0);
> +    if (fd < 0) {
> +        error_setg_errno(errp, errno, "Failed to create rsocket");
> +        goto out;
> +    }
> +    qemu_set_cloexec(fd);
> +
> +retry:
> +    ret = rconnect(fd, ai->ai_dst_addr, ai->ai_dst_len);
> +    if (ret) {
> +        if (errno == EINTR) {
> +            goto retry;
> +        }
> +        error_setg_errno(errp, errno, "Failed to rconnect");
> +        goto out;
> +    }
> +
> +    trace_qio_channel_rdma_connect_complete(rioc, fd);
> +    ret = qio_channel_rdma_set_fd(rioc, fd, errp);
> +    if (ret) {
> +        goto out;
> +    }
> +
> +out:
> +    if (ret) {
> +        trace_qio_channel_rdma_connect_fail(rioc);
> +        if (fd >= 0) {
> +            rclose(fd);
> +        }
> +    }
> +    if (ai) {
> +        rdma_freeaddrinfo(ai);
> +    }
> +
> +    return ret;
> +}
> +
> +static void qio_channel_rdma_connect_worker(QIOTask *task, gpointer opaque)
> +{
> +    QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(qio_task_get_source(task));
> +    InetSocketAddress *addr = opaque;
> +    Error *err = NULL;
> +
> +    qio_channel_rdma_connect_sync(ioc, addr, &err);
> +
> +    qio_task_set_error(task, err);
> +}
> +
> +void qio_channel_rdma_connect_async(QIOChannelRDMA *ioc,
> +                                    InetSocketAddress *addr,
> +                                    QIOTaskFunc callback, gpointer opaque,
> +                                    GDestroyNotify destroy,
> +                                    GMainContext *context)
> +{
> +    QIOTask *task = qio_task_new(OBJECT(ioc), callback, opaque, destroy);
> +    InetSocketAddress *addrCopy;
> +
> +    addrCopy = QAPI_CLONE(InetSocketAddress, addr);
> +
> +    /* rdma_getaddrinfo() blocks in DNS lookups, so we must use a thread */
> +    trace_qio_channel_rdma_connect_async(ioc, addr);
> +    qio_task_run_in_thread(task, qio_channel_rdma_connect_worker, addrCopy,
> +                           (GDestroyNotify)qapi_free_InetSocketAddress,
> +                           context);
> +}
> +
> +int qio_channel_rdma_listen_sync(QIOChannelRDMA *rioc, InetSocketAddress *addr,
> +                                 int num, Error **errp)
> +{
> +    int ret, fd = -1;
> +    struct rdma_addrinfo *ai;
> +    struct rdma_addrinfo ai_hints = { 0 };
> +
> +    trace_qio_channel_rdma_listen_sync(rioc, addr, num);
> +    ai_hints.ai_port_space = RDMA_PS_TCP;
> +    ai_hints.ai_flags |= RAI_PASSIVE;
> +    ret = rdma_getaddrinfo(addr->host, addr->port, &ai_hints, &ai);
> +    if (ret) {
> +        error_setg(errp, "Failed to rdma_getaddrinfo: %s", gai_strerror(ret));
> +        goto out;
> +    }
> +
> +    fd = rsocket(ai->ai_family, SOCK_STREAM, 0);
> +    if (fd < 0) {
> +        error_setg_errno(errp, errno, "Failed to create rsocket");
> +        goto out;
> +    }
> +    qemu_set_cloexec(fd);
> +
> +    ret = rbind(fd, ai->ai_src_addr, ai->ai_src_len);
> +    if (ret) {
> +        error_setg_errno(errp, errno, "Failed to rbind");
> +        goto out;
> +    }
> +
> +    ret = rlisten(fd, num);
> +    if (ret) {
> +        error_setg_errno(errp, errno, "Failed to rlisten");
> +        goto out;
> +    }
> +
> +    ret = qio_channel_rdma_set_fd(rioc, fd, errp);
> +    if (ret) {
> +        goto out;
> +    }
> +
> +    qio_channel_set_feature(QIO_CHANNEL(rioc), QIO_CHANNEL_FEATURE_LISTEN);
> +    trace_qio_channel_rdma_listen_complete(rioc, fd);
> +
> +out:
> +    if (ret) {
> +        trace_qio_channel_rdma_listen_fail(rioc);
> +        if (fd >= 0) {
> +            rclose(fd);
> +        }
> +    }
> +    if (ai) {
> +        rdma_freeaddrinfo(ai);
> +    }
> +
> +    return ret;
> +}
> +
> +struct QIOChannelListenWorkerData {
> +    InetSocketAddress *addr;
> +    int num; /* amount of expected connections */
> +};
> +
> +static void qio_channel_listen_worker_free(gpointer opaque)
> +{
> +    struct QIOChannelListenWorkerData *data = opaque;
> +
> +    qapi_free_InetSocketAddress(data->addr);
> +    g_free(data);
> +}
> +
> +static void qio_channel_rdma_listen_worker(QIOTask *task, gpointer opaque)
> +{
> +    QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(qio_task_get_source(task));
> +    struct QIOChannelListenWorkerData *data = opaque;
> +    Error *err = NULL;
> +
> +    qio_channel_rdma_listen_sync(ioc, data->addr, data->num, &err);
> +
> +    qio_task_set_error(task, err);
> +}
> +
> +void qio_channel_rdma_listen_async(QIOChannelRDMA *ioc, InetSocketAddress *addr,
> +                                   int num, QIOTaskFunc callback,
> +                                   gpointer opaque, GDestroyNotify destroy,
> +                                   GMainContext *context)
> +{
> +    QIOTask *task = qio_task_new(OBJECT(ioc), callback, opaque, destroy);
> +    struct QIOChannelListenWorkerData *data;
> +
> +    data = g_new0(struct QIOChannelListenWorkerData, 1);
> +    data->addr = QAPI_CLONE(InetSocketAddress, addr);
> +    data->num = num;
> +
> +    /* rdma_getaddrinfo() blocks in DNS lookups, so we must use a thread */
> +    trace_qio_channel_rdma_listen_async(ioc, addr, num);
> +    qio_task_run_in_thread(task, qio_channel_rdma_listen_worker, data,
> +                           qio_channel_listen_worker_free, context);
> +}
> +
> +QIOChannelRDMA *qio_channel_rdma_accept(QIOChannelRDMA *rioc, Error **errp)
> +{
> +    QIOChannelRDMA *cioc;
> +
> +    cioc = qio_channel_rdma_new();
> +    cioc->remoteAddrLen = sizeof(rioc->remoteAddr);
> +    cioc->localAddrLen = sizeof(rioc->localAddr);
> +
> +    trace_qio_channel_rdma_accept(rioc);
> +retry:
> +    cioc->fd = raccept(rioc->fd, (struct sockaddr *)&cioc->remoteAddr,
> +                       &cioc->remoteAddrLen);
> +    if (cioc->fd < 0) {
> +        if (errno == EINTR) {
> +            goto retry;
> +        }
> +        error_setg_errno(errp, errno, "Unable to accept connection");
> +        goto error;
> +    }
> +    qemu_set_cloexec(cioc->fd);
> +
> +    if (rgetsockname(cioc->fd, (struct sockaddr *)&cioc->localAddr,
> +                     &cioc->localAddrLen) < 0) {
> +        error_setg_errno(errp, errno, "Unable to query local rsocket address");
> +        goto error;
> +    }
> +
> +    trace_qio_channel_rdma_accept_complete(rioc, cioc, cioc->fd);
> +    return cioc;
> +
> +error:
> +    trace_qio_channel_rdma_accept_fail(rioc);
> +    object_unref(OBJECT(cioc));
> +    return NULL;
> +}
> +
> +static void qio_channel_rdma_init(Object *obj)
> +{
> +    QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(obj);
> +    ioc->fd = -1;
> +}
> +
> +static void qio_channel_rdma_finalize(Object *obj)
> +{
> +    QIOChannelRDMA *ioc = QIO_CHANNEL_RDMA(obj);
> +
> +    if (ioc->fd != -1) {
> +        rclose(ioc->fd);
> +        ioc->fd = -1;
> +    }
> +}
> +
> +static ssize_t qio_channel_rdma_readv(QIOChannel *ioc, const struct iovec *iov,
> +                                      size_t niov, int **fds G_GNUC_UNUSED,
> +                                      size_t *nfds G_GNUC_UNUSED,
> +                                      int flags G_GNUC_UNUSED, Error **errp)
> +{
> +    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
> +    ssize_t ret;
> +
> +retry:
> +    ret = rreadv(rioc->fd, iov, niov);
> +    if (ret < 0) {
> +        if (errno == EINTR) {
> +            goto retry;
> +        }
> +        error_setg_errno(errp, errno, "Unable to write to rsocket");
This is a typo. s/write/read.
> +        return -1;
> +    }
> +
> +    return ret;
> +}
> +
> +static ssize_t qio_channel_rdma_writev(QIOChannel *ioc, const struct iovec *iov,
> +                                       size_t niov, int *fds G_GNUC_UNUSED,
> +                                       size_t nfds G_GNUC_UNUSED,
> +                                       int flags G_GNUC_UNUSED, Error **errp)
> +{
> +    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
> +    ssize_t ret;
> +
> +retry:
> +    ret = rwritev(rioc->fd, iov, niov);
> +    if (ret <= 0) {
> +        if (errno == EINTR) {
> +            goto retry;
> +        }
> +        error_setg_errno(errp, errno, "Unable to write to rsocket");
> +        return -1;
> +    }
> +
> +    return ret;
> +}
> +
> +static void qio_channel_rdma_set_delay(QIOChannel *ioc, bool enabled)
> +{
> +    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
> +    int v = enabled ? 0 : 1;
> +
> +    rsetsockopt(rioc->fd, IPPROTO_TCP, TCP_NODELAY, &v, sizeof(v));
> +}
> +
> +static int qio_channel_rdma_close(QIOChannel *ioc, Error **errp)
> +{
> +    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
> +
> +    if (rioc->fd != -1) {
> +        rclose(rioc->fd);
> +        rioc->fd = -1;
> +    }
> +
> +    return 0;
> +}
> +
> +static int qio_channel_rdma_shutdown(QIOChannel *ioc, QIOChannelShutdown how,
> +                                     Error **errp)
> +{
> +    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
> +    int sockhow;
> +
> +    switch (how) {
> +    case QIO_CHANNEL_SHUTDOWN_READ:
> +        sockhow = SHUT_RD;
> +        break;
> +    case QIO_CHANNEL_SHUTDOWN_WRITE:
> +        sockhow = SHUT_WR;
> +        break;
> +    case QIO_CHANNEL_SHUTDOWN_BOTH:
> +    default:
> +        sockhow = SHUT_RDWR;
> +        break;
> +    }
> +
> +    if (rshutdown(rioc->fd, sockhow) < 0) {
> +        error_setg_errno(errp, errno, "Unable to shutdown rsocket");
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +static void qio_channel_rdma_class_init(ObjectClass *klass,
> +                                        void *class_data G_GNUC_UNUSED)
> +{
> +    QIOChannelClass *ioc_klass = QIO_CHANNEL_CLASS(klass);
> +
> +    ioc_klass->io_writev = qio_channel_rdma_writev;
> +    ioc_klass->io_readv = qio_channel_rdma_readv;
> +    ioc_klass->io_close = qio_channel_rdma_close;
> +    ioc_klass->io_shutdown = qio_channel_rdma_shutdown;
> +    ioc_klass->io_set_delay = qio_channel_rdma_set_delay;
> +}
> +
> +static const TypeInfo qio_channel_rdma_info = {
> +    .parent = TYPE_QIO_CHANNEL,
> +    .name = TYPE_QIO_CHANNEL_RDMA,
> +    .instance_size = sizeof(QIOChannelRDMA),
> +    .instance_init = qio_channel_rdma_init,
> +    .instance_finalize = qio_channel_rdma_finalize,
> +    .class_init = qio_channel_rdma_class_init,
> +};
> +
> +static void qio_channel_rdma_register_types(void)
> +{
> +    type_register_static(&qio_channel_rdma_info);
> +}
> +
> +type_init(qio_channel_rdma_register_types);
> diff --git a/io/meson.build b/io/meson.build
> index 283b9b2bdb..e0dbd5183f 100644
> --- a/io/meson.build
> +++ b/io/meson.build
> @@ -14,3 +14,4 @@ io_ss.add(files(
>    'net-listener.c',
>    'task.c',
>  ), gnutls)
> +io_ss.add(when: rdma, if_true: files('channel-rdma.c'))
> diff --git a/io/trace-events b/io/trace-events
> index d4c0f84a9a..33026a2224 100644
> --- a/io/trace-events
> +++ b/io/trace-events
> @@ -67,3 +67,17 @@ qio_channel_command_new_pid(void *ioc, int writefd, int readfd, int pid) "Comman
>  qio_channel_command_new_spawn(void *ioc, const char *binary, int flags) "Command new spawn ioc=%p binary=%s flags=%d"
>  qio_channel_command_abort(void *ioc, int pid) "Command abort ioc=%p pid=%d"
>  qio_channel_command_wait(void *ioc, int pid, int ret, int status) "Command abort ioc=%p pid=%d ret=%d status=%d"
> +
> +# channel-rdma.c
> +qio_channel_rdma_new(void *ioc) "RDMA rsocket new ioc=%p"
> +qio_channel_rdma_connect_sync(void *ioc, void *addr) "RDMA rsocket connect sync ioc=%p addr=%p"
> +qio_channel_rdma_connect_async(void *ioc, void *addr) "RDMA rsocket connect async ioc=%p addr=%p"
> +qio_channel_rdma_connect_fail(void *ioc) "RDMA rsocket connect fail ioc=%p"
> +qio_channel_rdma_connect_complete(void *ioc, int fd) "RDMA rsocket connect complete ioc=%p fd=%d"
> +qio_channel_rdma_listen_sync(void *ioc, void *addr, int num) "RDMA rsocket listen sync ioc=%p addr=%p num=%d"
> +qio_channel_rdma_listen_fail(void *ioc) "RDMA rsocket listen fail ioc=%p"
> +qio_channel_rdma_listen_async(void *ioc, void *addr, int num) "RDMA rsocket listen async ioc=%p addr=%p num=%d"
> +qio_channel_rdma_listen_complete(void *ioc, int fd) "RDMA rsocket listen complete ioc=%p fd=%d"
> +qio_channel_rdma_accept(void *ioc) "Socket accept start ioc=%p"
> +qio_channel_rdma_accept_fail(void *ioc) "RDMA rsocket accept fail ioc=%p"
> +qio_channel_rdma_accept_complete(void *ioc, void *cioc, int fd) "RDMA rsocket accept complete ioc=%p cioc=%p fd=%d"
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/6] migration: remove RDMA live migration temporarily
  2024-06-04 12:14 ` [PATCH 1/6] migration: remove RDMA live migration temporarily Gonglei via
  2024-06-04 14:01   ` David Hildenbrand
@ 2024-06-10 11:45   ` Markus Armbruster
  1 sibling, 0 replies; 55+ messages in thread
From: Markus Armbruster @ 2024-06-10 11:45 UTC (permalink / raw)
  To: Gonglei
  Cc: qemu-devel, peterx, yu.zhang, mgalaxy, elmar.gerdes, zhengchuan,
	berrange, armbru, lizhijian, pbonzini, mst, xiexiangyou,
	linux-rdma, lixiao91, jinpu.wang, Jialin Wang

Gonglei <arei.gonglei@huawei.com> writes:

> From: Jialin Wang <wangjialin23@huawei.com>
>
> The new RDMA live migration will be introduced in the upcoming
> few commits.
>
> Signed-off-by: Jialin Wang <wangjialin23@huawei.com>
> Signed-off-by: Gonglei <arei.gonglei@huawei.com>

[...]

> diff --git a/qapi/migration.json b/qapi/migration.json
> index a351fd3714..4d7d49bfec 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -210,9 +210,9 @@
>  #
>  # @setup-time: amount of setup time in milliseconds *before* the
>  #     iterations begin but *after* the QMP command is issued.  This is
> -#     designed to provide an accounting of any activities (such as
> -#     RDMA pinning) which may be expensive, but do not actually occur
> -#     during the iterative migration rounds themselves.  (since 1.6)
> +#     designed to provide an accounting of any activities which may be
> +#     expensive, but do not actually occur during the iterative migration
> +#     rounds themselves.  (since 1.6)

I guess the new RDMA migration code will not do RDMA pinning.  Correct?

>  #
>  # @cpu-throttle-percentage: percentage of time guest cpus are being
>  #     throttled during auto-converge.  This is only present when
> @@ -378,10 +378,6 @@
>  #     for certain work loads, by sending compressed difference of the
>  #     pages
>  #
> -# @rdma-pin-all: Controls whether or not the entire VM memory
> -#     footprint is mlock()'d on demand or all at once.  Refer to
> -#     docs/rdma.txt for usage.  Disabled by default.  (since 2.0)
> -#
>  # @zero-blocks: During storage migration encode blocks of zeroes
>  #     efficiently.  This essentially saves 1MB of zeroes per block on
>  #     the wire.  Enabling requires source and target VM to support
> @@ -476,7 +472,7 @@
>  # Since: 1.2
>  ##
>  { 'enum': 'MigrationCapability',
> -  'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
> +  'data': ['xbzrle', 'auto-converge', 'zero-blocks',
>             'events', 'postcopy-ram',
>             { 'name': 'x-colo', 'features': [ 'unstable' ] },
>             'release-ram',

I guess you remove @rdma-pin-all, because it makes no sense with the new
migration code.  However, this is an incompatible change.

Here's the orderly way to remove it:

1. Document it doesn't do anything anymore, and deprecate it.

2. Remove after the deprecation grace period (two releases, see
docs/about/deprecated.rst.

> @@ -533,7 +529,6 @@
>  #     -> { "execute": "query-migrate-capabilities" }
>  #     <- { "return": [
>  #           {"state": false, "capability": "xbzrle"},
> -#           {"state": false, "capability": "rdma-pin-all"},
>  #           {"state": false, "capability": "auto-converge"},
>  #           {"state": false, "capability": "zero-blocks"},
>  #           {"state": true, "capability": "events"},

[...]



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-06-07  8:28   ` Gonglei (Arei) via
@ 2024-06-10 16:31     ` Peter Xu
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Xu @ 2024-06-10 16:31 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: Jinpu Wang, qemu-devel@nongnu.org, yu.zhang@ionos.com,
	mgalaxy@akamai.com, elmar.gerdes@ionos.com, zhengchuan,
	berrange@redhat.com, armbru@redhat.com, lizhijian@fujitsu.com,
	pbonzini@redhat.com, mst@redhat.com, Xiexiangyou,
	linux-rdma@vger.kernel.org, lixiao (H), Wangjialin

On Fri, Jun 07, 2024 at 08:28:29AM +0000, Gonglei (Arei) wrote:
> 
> 
> > -----Original Message-----
> > From: Jinpu Wang [mailto:jinpu.wang@ionos.com]
> > Sent: Friday, June 7, 2024 1:54 PM
> > To: Gonglei (Arei) <arei.gonglei@huawei.com>
> > Cc: qemu-devel@nongnu.org; peterx@redhat.com; yu.zhang@ionos.com;
> > mgalaxy@akamai.com; elmar.gerdes@ionos.com; zhengchuan
> > <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
> > lizhijian@fujitsu.com; pbonzini@redhat.com; mst@redhat.com; Xiexiangyou
> > <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> > <lixiao91@huawei.com>; Wangjialin <wangjialin23@huawei.com>
> > Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> > 
> > Hi Gonglei, hi folks on the list,
> > 
> > On Tue, Jun 4, 2024 at 2:14 PM Gonglei <arei.gonglei@huawei.com> wrote:
> > >
> > > From: Jialin Wang <wangjialin23@huawei.com>
> > >
> > > Hi,
> > >
> > > This patch series attempts to refactor RDMA live migration by
> > > introducing a new QIOChannelRDMA class based on the rsocket API.
> > >
> > > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > > that is a 1-1 match of the normal kernel 'sockets' API, which hides
> > > the detail of rdma protocol into rsocket and allows us to add support
> > > for some modern features like multifd more easily.
> > >
> > > Here is the previous discussion on refactoring RDMA live migration
> > > using the rsocket API:
> > >
> > > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linar
> > > o.org/
> > >
> > > We have encountered some bugs when using rsocket and plan to submit
> > > them to the rdma-core community.
> > >
> > > In addition, the use of rsocket makes our programming more convenient,
> > > but it must be noted that this method introduces multiple memory
> > > copies, which can be imagined that there will be a certain performance
> > > degradation, hoping that friends with RDMA network cards can help verify,
> > thank you!
> > First thx for the effort, we are running migration tests on our IB fabric, different
> > generation of HCA from mellanox, the migration works ok, there are a few
> > failures,  Yu will share the result later separately.
> > 
> 
> Thank you so much. 
> 
> > The one blocker for the change is the old implementation and the new rsocket
> > implementation; they don't talk to each other due to the effect of different wire
> > protocol during connection establishment.
> > eg the old RDMA migration has special control message during the migration
> > flow, which rsocket use a different control message, so there lead to no way to
> > migrate VM using rdma transport pre to the rsocket patchset to a new version
> > with rsocket implementation.
> > 
> > Probably we should keep both implementation for a while, mark the old
> > implementation as deprecated, and promote the new implementation, and
> > high light in doc, they are not compatible.
> > 
> 
> IMO It makes sense. What's your opinion? @Peter.

Sounds good to me.  We can use an internal property field and enable
rsocket rdma migration on new machine types with rdma protocol, deprecating
both old rdma and that internal field after 2 releases.  So that when
receiving rdma migrations it'll use old property (as old qemu will use old
machine types), but when initiating rdma migration on new binary it'll
switch to rsocket.

It might be more important to address either the failures or perf concerns
that others raised, though.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-06-07  8:49       ` Gonglei (Arei) via
@ 2024-06-10 16:35         ` Peter Xu
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Xu @ 2024-06-10 16:35 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: qemu-devel@nongnu.org, yu.zhang@ionos.com, mgalaxy@akamai.com,
	elmar.gerdes@ionos.com, zhengchuan, berrange@redhat.com,
	armbru@redhat.com, lizhijian@fujitsu.com, pbonzini@redhat.com,
	mst@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin, Fabiano Rosas

On Fri, Jun 07, 2024 at 08:49:01AM +0000, Gonglei (Arei) wrote:
> Actually we tried this solution, but it didn't work. Pls see patch 3/6
> 
> Known limitations: 
>   For a blocking rsocket fd, if we use io_create_watch to wait for
>   POLLIN or POLLOUT events, since the rsocket fd is blocking, we
>   cannot determine when it is not ready to read/write as we can with
>   non-blocking fds. Therefore, when an event occurs, it will occurs
>   always, potentially leave the qemu hanging. So we need be cautious
>   to avoid hanging when using io_create_watch .

I'm not sure I fully get that part, though.  In:

https://lore.kernel.org/all/ZldY21xVExtiMddB@x1n/

I was thinking of iochannel implements its own poll with the _POLL flag, so
in that case it'll call qio_channel_poll() which should call rpoll()
directly. So I didn't expect using qio_channel_create_watch().  I thought
the context was gmainloop won't work with rsocket fds in general, but maybe
I missed something.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-06-04 12:14 [PATCH 0/6] refactor RDMA live migration based on rsocket API Gonglei via
                   ` (8 preceding siblings ...)
  2024-06-07  5:53 ` Jinpu Wang
@ 2024-08-27 20:15 ` Peter Xu
  2024-08-27 20:57   ` Michael S. Tsirkin
  9 siblings, 1 reply; 55+ messages in thread
From: Peter Xu @ 2024-08-27 20:15 UTC (permalink / raw)
  To: Gonglei
  Cc: qemu-devel, yu.zhang, mgalaxy, elmar.gerdes, zhengchuan, berrange,
	armbru, lizhijian, pbonzini, mst, xiexiangyou, linux-rdma,
	lixiao91, jinpu.wang, Jialin Wang

On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> From: Jialin Wang <wangjialin23@huawei.com>
> 
> Hi,
> 
> This patch series attempts to refactor RDMA live migration by
> introducing a new QIOChannelRDMA class based on the rsocket API.
> 
> The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> that is a 1-1 match of the normal kernel 'sockets' API, which hides the
> detail of rdma protocol into rsocket and allows us to add support for
> some modern features like multifd more easily.
> 
> Here is the previous discussion on refactoring RDMA live migration using
> the rsocket API:
> 
> https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linaro.org/
> 
> We have encountered some bugs when using rsocket and plan to submit them to
> the rdma-core community.
> 
> In addition, the use of rsocket makes our programming more convenient,
> but it must be noted that this method introduces multiple memory copies,
> which can be imagined that there will be a certain performance degradation,
> hoping that friends with RDMA network cards can help verify, thank you!
> 
> Jialin Wang (6):
>   migration: remove RDMA live migration temporarily
>   io: add QIOChannelRDMA class
>   io/channel-rdma: support working in coroutine
>   tests/unit: add test-io-channel-rdma.c
>   migration: introduce new RDMA live migration
>   migration/rdma: support multifd for RDMA migration

This series has been idle for a while; we still need to know how to move
forward.  I guess I lost the latest status quo..

Any update (from anyone..) on what stage are we in?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-08-27 20:15 ` Peter Xu
@ 2024-08-27 20:57   ` Michael S. Tsirkin
  2024-09-22 19:29     ` Michael Galaxy
  0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2024-08-27 20:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: Gonglei, qemu-devel, yu.zhang, mgalaxy, elmar.gerdes, zhengchuan,
	berrange, armbru, lizhijian, pbonzini, xiexiangyou, linux-rdma,
	lixiao91, jinpu.wang, Jialin Wang

On Tue, Aug 27, 2024 at 04:15:42PM -0400, Peter Xu wrote:
> On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> > From: Jialin Wang <wangjialin23@huawei.com>
> > 
> > Hi,
> > 
> > This patch series attempts to refactor RDMA live migration by
> > introducing a new QIOChannelRDMA class based on the rsocket API.
> > 
> > The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> > that is a 1-1 match of the normal kernel 'sockets' API, which hides the
> > detail of rdma protocol into rsocket and allows us to add support for
> > some modern features like multifd more easily.
> > 
> > Here is the previous discussion on refactoring RDMA live migration using
> > the rsocket API:
> > 
> > https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linaro.org/
> > 
> > We have encountered some bugs when using rsocket and plan to submit them to
> > the rdma-core community.
> > 
> > In addition, the use of rsocket makes our programming more convenient,
> > but it must be noted that this method introduces multiple memory copies,
> > which can be imagined that there will be a certain performance degradation,
> > hoping that friends with RDMA network cards can help verify, thank you!
> > 
> > Jialin Wang (6):
> >   migration: remove RDMA live migration temporarily
> >   io: add QIOChannelRDMA class
> >   io/channel-rdma: support working in coroutine
> >   tests/unit: add test-io-channel-rdma.c
> >   migration: introduce new RDMA live migration
> >   migration/rdma: support multifd for RDMA migration
> 
> This series has been idle for a while; we still need to know how to move
> forward.


What exactly is the question? This got a bunch of comments,
the first thing to do would be to address them.


>  I guess I lost the latest status quo..
> 
> Any update (from anyone..) on what stage are we in?
> 
> Thanks,
> -- 
> Peter Xu



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-08-27 20:57   ` Michael S. Tsirkin
@ 2024-09-22 19:29     ` Michael Galaxy
  2024-09-23  1:04       ` Gonglei (Arei) via
  0 siblings, 1 reply; 55+ messages in thread
From: Michael Galaxy @ 2024-09-22 19:29 UTC (permalink / raw)
  To: Michael S. Tsirkin, Peter Xu
  Cc: Gonglei, qemu-devel, yu.zhang, elmar.gerdes, zhengchuan, berrange,
	armbru, lizhijian, pbonzini, xiexiangyou, linux-rdma, lixiao91,
	jinpu.wang, Jialin Wang

Hi All,

I have met with the team from IONOS about their testing on actual IB 
hardware here at KVM Forum today and the requirements are starting to 
make more sense to me. I didn't say much in our previous thread because 
I misunderstood the requirements, so let me try to explain and see if 
we're all on the same page. There appears to be a fundamental limitation 
here with rsocket, for which I don't see how it is possible to overcome.

The basic problem is that rsocket is trying to present a stream 
abstraction, a concept that is fundamentally incompatible with RDMA. The 
whole point of using RDMA in the first place is to avoid using the CPU, 
and to do that, all of the memory (potentially hundreds of gigabytes) 
need to be registered with the hardware *in advance* (this is how the 
original implementation works).

The need to fake a socket/bytestream abstraction eventually breaks down 
=> There is a limit (a few GB) in rsocket (which the IONOS team previous 
reported in testing.... see that email), it appears that means that 
rsocket is only going to be able to map a certain limited amount of 
memory with the hardware until its internal "buffer" runs out before it 
can then unmap and remap the next batch of memory with the hardware to 
continue along with the fake bytestream. This is very much sticking a 
square peg in a round hole. If you were to "relax" the rsocket 
implementation to register the entire VM memory space (as my original 
implementation does), then there wouldn't be any need for rsocket in the 
first place.

I think there is just some misunderstanding here in the group in the way 
infiniband is intended to work. Does that make sense so far? I do 
understand the need for testing, but rsocket is simply not intended to 
be used for kind of massive bulk data transfer purposes that we're 
proposing using it here for, simply for the purposes of making our lives 
better in testing.

Regarding testing: During our previous thread earlier this summer, why 
did we not consider making a better integration test to solve the test 
burden problem? To explain better: If a new integration test were 
written for QEMU and submitted and reviewed (a reasonably complex test 
that was in line with a traditional live migration integration test that 
actually spins up QEMU) which used softRoCE in a localhost configuration 
that has full libibverbs supports and still allowed for compatibility 
testing with QEMU, would such an integration not be sufficient to handle 
the testing burden?

Comments welcome,
- Michael

On 8/27/24 15:57, Michael S. Tsirkin wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> On Tue, Aug 27, 2024 at 04:15:42PM -0400, Peter Xu wrote:
>> On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
>>> From: Jialin Wang <wangjialin23@huawei.com>
>>>
>>> Hi,
>>>
>>> This patch series attempts to refactor RDMA live migration by
>>> introducing a new QIOChannelRDMA class based on the rsocket API.
>>>
>>> The /usr/include/rdma/rsocket.h provides a higher level rsocket API
>>> that is a 1-1 match of the normal kernel 'sockets' API, which hides the
>>> detail of rdma protocol into rsocket and allows us to add support for
>>> some modern features like multifd more easily.
>>>
>>> Here is the previous discussion on refactoring RDMA live migration using
>>> the rsocket API:
>>>
>>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/20240328130255.52257-1-philmd@linaro.org/__;!!GjvTz_vk!TuRaotO-yMj82o2kQo3x743jLoDElYgrXmp2wOfMTuCS1Y4k2Son1WGsRnZG_YYS9ZgBZ8uRHQ$
>>>
>>> We have encountered some bugs when using rsocket and plan to submit them to
>>> the rdma-core community.
>>>
>>> In addition, the use of rsocket makes our programming more convenient,
>>> but it must be noted that this method introduces multiple memory copies,
>>> which can be imagined that there will be a certain performance degradation,
>>> hoping that friends with RDMA network cards can help verify, thank you!
>>>
>>> Jialin Wang (6):
>>>    migration: remove RDMA live migration temporarily
>>>    io: add QIOChannelRDMA class
>>>    io/channel-rdma: support working in coroutine
>>>    tests/unit: add test-io-channel-rdma.c
>>>    migration: introduce new RDMA live migration
>>>    migration/rdma: support multifd for RDMA migration
>> This series has been idle for a while; we still need to know how to move
>> forward.
>
> What exactly is the question? This got a bunch of comments,
> the first thing to do would be to address them.
>
>
>>   I guess I lost the latest status quo..
>>
>> Any update (from anyone..) on what stage are we in?
>>
>> Thanks,
>> -- 
>> Peter Xu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-09-22 19:29     ` Michael Galaxy
@ 2024-09-23  1:04       ` Gonglei (Arei) via
  2024-09-25 15:08         ` Peter Xu
  2024-09-27 20:34         ` Michael Galaxy
  0 siblings, 2 replies; 55+ messages in thread
From: Gonglei (Arei) via @ 2024-09-23  1:04 UTC (permalink / raw)
  To: Michael Galaxy, Michael S. Tsirkin, Peter Xu
  Cc: qemu-devel@nongnu.org, yu.zhang@ionos.com, elmar.gerdes@ionos.com,
	zhengchuan, berrange@redhat.com, armbru@redhat.com,
	lizhijian@fujitsu.com, pbonzini@redhat.com, Xiexiangyou,
	linux-rdma@vger.kernel.org, lixiao (H), jinpu.wang@ionos.com,
	Wangjialin

Hi,

> -----Original Message-----
> From: Michael Galaxy [mailto:mgalaxy@akamai.com]
> Sent: Monday, September 23, 2024 3:29 AM
> To: Michael S. Tsirkin <mst@redhat.com>; Peter Xu <peterx@redhat.com>
> Cc: Gonglei (Arei) <arei.gonglei@huawei.com>; qemu-devel@nongnu.org;
> yu.zhang@ionos.com; elmar.gerdes@ionos.com; zhengchuan
> <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
> lizhijian@fujitsu.com; pbonzini@redhat.com; Xiexiangyou
> <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> <wangjialin23@huawei.com>
> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> 
> Hi All,
> 
> I have met with the team from IONOS about their testing on actual IB
> hardware here at KVM Forum today and the requirements are starting to make
> more sense to me. I didn't say much in our previous thread because I
> misunderstood the requirements, so let me try to explain and see if we're all on
> the same page. There appears to be a fundamental limitation here with rsocket,
> for which I don't see how it is possible to overcome.
> 
> The basic problem is that rsocket is trying to present a stream abstraction, a
> concept that is fundamentally incompatible with RDMA. The whole point of
> using RDMA in the first place is to avoid using the CPU, and to do that, all of the
> memory (potentially hundreds of gigabytes) need to be registered with the
> hardware *in advance* (this is how the original implementation works).
> 
> The need to fake a socket/bytestream abstraction eventually breaks down =>
> There is a limit (a few GB) in rsocket (which the IONOS team previous reported
> in testing.... see that email), it appears that means that rsocket is only going to
> be able to map a certain limited amount of memory with the hardware until its
> internal "buffer" runs out before it can then unmap and remap the next batch
> of memory with the hardware to continue along with the fake bytestream. This
> is very much sticking a square peg in a round hole. If you were to "relax" the
> rsocket implementation to register the entire VM memory space (as my
> original implementation does), then there wouldn't be any need for rsocket in
> the first place.
> 

Thank you for your opinion. You're right. RSocket has encountered difficulties in 
transferring large amounts of data. We haven't even figured it out yet. Although
in this practice, we solved several problems with rsocket.

In our practice, we need to quickly complete VM live migration and the downtime 
of live migration must be within 50 ms or less. Therefore, we use RDMA, which is 
an essential requirement. Next, I think we'll do it based on Qemu's native RDMA 
live migration solution. During this period, we really doubted whether RDMA live 
migration was really feasible through rsocket refactoring, so the refactoring plan 
was shelved.


Regards,
-Gonglei

> I think there is just some misunderstanding here in the group in the way
> infiniband is intended to work. Does that make sense so far? I do understand
> the need for testing, but rsocket is simply not intended to be used for kind of
> massive bulk data transfer purposes that we're proposing using it here for,
> simply for the purposes of making our lives better in testing.
> 
> Regarding testing: During our previous thread earlier this summer, why did we
> not consider making a better integration test to solve the test burden problem?
> To explain better: If a new integration test were written for QEMU and
> submitted and reviewed (a reasonably complex test that was in line with a
> traditional live migration integration test that actually spins up QEMU) which
> used softRoCE in a localhost configuration that has full libibverbs supports and
> still allowed for compatibility testing with QEMU, would such an integration not
> be sufficient to handle the testing burden?
> 
> Comments welcome,
> - Michael
> 
> On 8/27/24 15:57, Michael S. Tsirkin wrote:
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >    This message came from outside your organization.
> > |-------------------------------------------------------------------!
> >
> > On Tue, Aug 27, 2024 at 04:15:42PM -0400, Peter Xu wrote:
> >> On Tue, Jun 04, 2024 at 08:14:06PM +0800, Gonglei wrote:
> >>> From: Jialin Wang <wangjialin23@huawei.com>
> >>>
> >>> Hi,
> >>>
> >>> This patch series attempts to refactor RDMA live migration by
> >>> introducing a new QIOChannelRDMA class based on the rsocket API.
> >>>
> >>> The /usr/include/rdma/rsocket.h provides a higher level rsocket API
> >>> that is a 1-1 match of the normal kernel 'sockets' API, which hides
> >>> the detail of rdma protocol into rsocket and allows us to add
> >>> support for some modern features like multifd more easily.
> >>>
> >>> Here is the previous discussion on refactoring RDMA live migration
> >>> using the rsocket API:
> >>>
> >>> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/20240
> >>>
> 328130255.52257-1-philmd@linaro.org/__;!!GjvTz_vk!TuRaotO-yMj82o2kQo
> >>> 3x743jLoDElYgrXmp2wOfMTuCS1Y4k2Son1WGsRnZG_YYS9ZgBZ8uRHQ$
> >>>
> >>> We have encountered some bugs when using rsocket and plan to submit
> >>> them to the rdma-core community.
> >>>
> >>> In addition, the use of rsocket makes our programming more
> >>> convenient, but it must be noted that this method introduces
> >>> multiple memory copies, which can be imagined that there will be a
> >>> certain performance degradation, hoping that friends with RDMA network
> cards can help verify, thank you!
> >>>
> >>> Jialin Wang (6):
> >>>    migration: remove RDMA live migration temporarily
> >>>    io: add QIOChannelRDMA class
> >>>    io/channel-rdma: support working in coroutine
> >>>    tests/unit: add test-io-channel-rdma.c
> >>>    migration: introduce new RDMA live migration
> >>>    migration/rdma: support multifd for RDMA migration
> >> This series has been idle for a while; we still need to know how to
> >> move forward.
> >
> > What exactly is the question? This got a bunch of comments, the first
> > thing to do would be to address them.
> >
> >
> >>   I guess I lost the latest status quo..
> >>
> >> Any update (from anyone..) on what stage are we in?
> >>
> >> Thanks,
> >> --
> >> Peter Xu


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-09-23  1:04       ` Gonglei (Arei) via
@ 2024-09-25 15:08         ` Peter Xu
  2024-09-27 21:45           ` Sean Hefty
  2024-09-27 20:34         ` Michael Galaxy
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Xu @ 2024-09-25 15:08 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: Michael Galaxy, Michael S. Tsirkin, qemu-devel@nongnu.org,
	yu.zhang@ionos.com, elmar.gerdes@ionos.com, zhengchuan,
	berrange@redhat.com, armbru@redhat.com, lizhijian@fujitsu.com,
	pbonzini@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin

On Mon, Sep 23, 2024 at 01:04:17AM +0000, Gonglei (Arei) wrote:
> Hi,
> 
> > -----Original Message-----
> > From: Michael Galaxy [mailto:mgalaxy@akamai.com]
> > Sent: Monday, September 23, 2024 3:29 AM
> > To: Michael S. Tsirkin <mst@redhat.com>; Peter Xu <peterx@redhat.com>
> > Cc: Gonglei (Arei) <arei.gonglei@huawei.com>; qemu-devel@nongnu.org;
> > yu.zhang@ionos.com; elmar.gerdes@ionos.com; zhengchuan
> > <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
> > lizhijian@fujitsu.com; pbonzini@redhat.com; Xiexiangyou
> > <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
> > <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
> > <wangjialin23@huawei.com>
> > Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
> > 
> > Hi All,
> > 
> > I have met with the team from IONOS about their testing on actual IB
> > hardware here at KVM Forum today and the requirements are starting to make
> > more sense to me. I didn't say much in our previous thread because I
> > misunderstood the requirements, so let me try to explain and see if we're all on
> > the same page. There appears to be a fundamental limitation here with rsocket,
> > for which I don't see how it is possible to overcome.
> > 
> > The basic problem is that rsocket is trying to present a stream abstraction, a
> > concept that is fundamentally incompatible with RDMA. The whole point of
> > using RDMA in the first place is to avoid using the CPU, and to do that, all of the
> > memory (potentially hundreds of gigabytes) need to be registered with the
> > hardware *in advance* (this is how the original implementation works).
> > 
> > The need to fake a socket/bytestream abstraction eventually breaks down =>
> > There is a limit (a few GB) in rsocket (which the IONOS team previous reported
> > in testing.... see that email), it appears that means that rsocket is only going to
> > be able to map a certain limited amount of memory with the hardware until its
> > internal "buffer" runs out before it can then unmap and remap the next batch
> > of memory with the hardware to continue along with the fake bytestream. This
> > is very much sticking a square peg in a round hole. If you were to "relax" the
> > rsocket implementation to register the entire VM memory space (as my
> > original implementation does), then there wouldn't be any need for rsocket in
> > the first place.

Yes, some test like this can be helpful.

And thanks for the summary.  That's definitely helpful.

One question from my side (as someone knows nothing on RDMA/rsocket): is
that "a few GBs" limitation a software guard?  Would it be possible that
rsocket provide some option to allow user opt-in on setting that value, so
that it might work for VM use case?  Would that consume similar resources
v.s. the current QEMU impl but allows it to use rsockets with no perf
regressions?

> 
> Thank you for your opinion. You're right. RSocket has encountered difficulties in 
> transferring large amounts of data. We haven't even figured it out yet. Although
> in this practice, we solved several problems with rsocket.
> 
> In our practice, we need to quickly complete VM live migration and the downtime 
> of live migration must be within 50 ms or less. Therefore, we use RDMA, which is 
> an essential requirement. Next, I think we'll do it based on Qemu's native RDMA 
> live migration solution. During this period, we really doubted whether RDMA live 
> migration was really feasible through rsocket refactoring, so the refactoring plan 
> was shelved.

To me, 50ms guaranteed is hard.  I'm personally not sure how much RDMA
helps if that's only about the transport.

I meant, at least I feel like someone would need to work out some general
limitations, like:

https://wiki.qemu.org/ToDo/LiveMigration#Optimize_memory_updates_for_non-iterative_vmstates
https://lore.kernel.org/all/20230317081904.24389-1-xuchuangxclwt@bytedance.com/

I also remember we always have outliers that when save()/load() device
states it can simply be slower (takes 100ms or more on a single device; I
think it's normally has kernel/kvm involved).  That one device can already
break the rule, even if happens rarely.

We also haven't looked into multiple other issues during downtime:

  - vm start/stop will invoke notifiers, and notifiers can (in some cases)
    take quite some time to finish

  - some features may enlarge downtime in an unpredictable way, but so far
    we don't yet have full control of it, e.g. pause-before-switchover for
    block layers

There can be other stuff floating, just to provide some examples.  All
these cases I mentioned above are not relevant to transport on its own.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-09-23  1:04       ` Gonglei (Arei) via
  2024-09-25 15:08         ` Peter Xu
@ 2024-09-27 20:34         ` Michael Galaxy
  1 sibling, 0 replies; 55+ messages in thread
From: Michael Galaxy @ 2024-09-27 20:34 UTC (permalink / raw)
  To: Gonglei (Arei), Michael S. Tsirkin, Peter Xu
  Cc: qemu-devel@nongnu.org, yu.zhang@ionos.com, elmar.gerdes@ionos.com,
	zhengchuan, berrange@redhat.com, armbru@redhat.com,
	lizhijian@fujitsu.com, pbonzini@redhat.com, Xiexiangyou,
	linux-rdma@vger.kernel.org, lixiao (H), jinpu.wang@ionos.com,
	Wangjialin

Hi Gonglei,

On 9/22/24 20:04, Gonglei (Arei) wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> Hi,
>
>> -----Original Message-----
>> From: Michael Galaxy [mailto:mgalaxy@akamai.com]
>> Sent: Monday, September 23, 2024 3:29 AM
>> To: Michael S. Tsirkin <mst@redhat.com>; Peter Xu <peterx@redhat.com>
>> Cc: Gonglei (Arei) <arei.gonglei@huawei.com>; qemu-devel@nongnu.org;
>> yu.zhang@ionos.com; elmar.gerdes@ionos.com; zhengchuan
>> <zhengchuan@huawei.com>; berrange@redhat.com; armbru@redhat.com;
>> lizhijian@fujitsu.com; pbonzini@redhat.com; Xiexiangyou
>> <xiexiangyou@huawei.com>; linux-rdma@vger.kernel.org; lixiao (H)
>> <lixiao91@huawei.com>; jinpu.wang@ionos.com; Wangjialin
>> <wangjialin23@huawei.com>
>> Subject: Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
>>
>> Hi All,
>>
>> I have met with the team from IONOS about their testing on actual IB
>> hardware here at KVM Forum today and the requirements are starting to make
>> more sense to me. I didn't say much in our previous thread because I
>> misunderstood the requirements, so let me try to explain and see if we're all on
>> the same page. There appears to be a fundamental limitation here with rsocket,
>> for which I don't see how it is possible to overcome.
>>
>> The basic problem is that rsocket is trying to present a stream abstraction, a
>> concept that is fundamentally incompatible with RDMA. The whole point of
>> using RDMA in the first place is to avoid using the CPU, and to do that, all of the
>> memory (potentially hundreds of gigabytes) need to be registered with the
>> hardware *in advance* (this is how the original implementation works).
>>
>> The need to fake a socket/bytestream abstraction eventually breaks down =>
>> There is a limit (a few GB) in rsocket (which the IONOS team previous reported
>> in testing.... see that email), it appears that means that rsocket is only going to
>> be able to map a certain limited amount of memory with the hardware until its
>> internal "buffer" runs out before it can then unmap and remap the next batch
>> of memory with the hardware to continue along with the fake bytestream. This
>> is very much sticking a square peg in a round hole. If you were to "relax" the
>> rsocket implementation to register the entire VM memory space (as my
>> original implementation does), then there wouldn't be any need for rsocket in
>> the first place.
>>
> Thank you for your opinion. You're right. RSocket has encountered difficulties in
> transferring large amounts of data. We haven't even figured it out yet. Although
> in this practice, we solved several problems with rsocket.
>
> In our practice, we need to quickly complete VM live migration and the downtime
> of live migration must be within 50 ms or less. Therefore, we use RDMA, which is
> an essential requirement. Next, I think we'll do it based on Qemu's native RDMA
> live migration solution. During this period, we really doubted whether RDMA live
> migration was really feasible through rsocket refactoring, so the refactoring plan
> was shelved.
>
>
> Regards,
> -Gonglei

OK, this is helpful. Thanks for the response.

So that means we do still have two consumers of the native libibverbs 
RDMA solution.

Comments are still welcome. Is there still a reason to pursue this line 
of work that I might be missing?

- Michael




^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-09-25 15:08         ` Peter Xu
@ 2024-09-27 21:45           ` Sean Hefty
  2024-09-28 17:52             ` Michael Galaxy
  0 siblings, 1 reply; 55+ messages in thread
From: Sean Hefty @ 2024-09-27 21:45 UTC (permalink / raw)
  To: Peter Xu, Gonglei (Arei)
  Cc: Michael Galaxy, Michael S. Tsirkin, qemu-devel@nongnu.org,
	yu.zhang@ionos.com, elmar.gerdes@ionos.com, zhengchuan,
	berrange@redhat.com, armbru@redhat.com, lizhijian@fujitsu.com,
	pbonzini@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin

> > > I have met with the team from IONOS about their testing on actual IB
> > > hardware here at KVM Forum today and the requirements are starting
> > > to make more sense to me. I didn't say much in our previous thread
> > > because I misunderstood the requirements, so let me try to explain
> > > and see if we're all on the same page. There appears to be a
> > > fundamental limitation here with rsocket, for which I don't see how it is
> possible to overcome.
> > >
> > > The basic problem is that rsocket is trying to present a stream
> > > abstraction, a concept that is fundamentally incompatible with RDMA.
> > > The whole point of using RDMA in the first place is to avoid using
> > > the CPU, and to do that, all of the memory (potentially hundreds of
> > > gigabytes) need to be registered with the hardware *in advance* (this is
> how the original implementation works).
> > >
> > > The need to fake a socket/bytestream abstraction eventually breaks
> > > down => There is a limit (a few GB) in rsocket (which the IONOS team
> > > previous reported in testing.... see that email), it appears that
> > > means that rsocket is only going to be able to map a certain limited
> > > amount of memory with the hardware until its internal "buffer" runs
> > > out before it can then unmap and remap the next batch of memory with
> > > the hardware to continue along with the fake bytestream. This is
> > > very much sticking a square peg in a round hole. If you were to
> > > "relax" the rsocket implementation to register the entire VM memory
> > > space (as my original implementation does), then there wouldn't be any
> need for rsocket in the first place.
> 
> Yes, some test like this can be helpful.
> 
> And thanks for the summary.  That's definitely helpful.
> 
> One question from my side (as someone knows nothing on RDMA/rsocket): is
> that "a few GBs" limitation a software guard?  Would it be possible that rsocket
> provide some option to allow user opt-in on setting that value, so that it might
> work for VM use case?  Would that consume similar resources v.s. the current
> QEMU impl but allows it to use rsockets with no perf regressions?

Rsockets is emulated the streaming socket API.  The amount of memory dedicated to a single rsocket is controlled through a wmem_default configuration setting.  It is also configurable via rsetsockopt() SO_SNDBUF.  Both of those are similar to TCP settings.  The SW field used to store this value is 32-bits.

This internal buffer acts as a bounce buffer to convert the synchronous socket API calls into the asynchronous RDMA transfers.  Rsockets uses the CPU for data copies, but the transport is offloaded to the NIC, including kernel bypass.

Does your kernel allocate > 4 GBs of buffer space to an individual socket?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-09-27 21:45           ` Sean Hefty
@ 2024-09-28 17:52             ` Michael Galaxy
  2024-09-29 18:14               ` Michael S. Tsirkin
  2024-09-30 18:16               ` Peter Xu
  0 siblings, 2 replies; 55+ messages in thread
From: Michael Galaxy @ 2024-09-28 17:52 UTC (permalink / raw)
  To: Sean Hefty, Peter Xu, Gonglei (Arei)
  Cc: Michael S. Tsirkin, qemu-devel@nongnu.org, yu.zhang@ionos.com,
	elmar.gerdes@ionos.com, zhengchuan, berrange@redhat.com,
	armbru@redhat.com, lizhijian@fujitsu.com, pbonzini@redhat.com,
	Xiexiangyou, linux-rdma@vger.kernel.org, lixiao (H),
	jinpu.wang@ionos.com, Wangjialin

On 9/27/24 16:45, Sean Hefty wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
>>>> I have met with the team from IONOS about their testing on actual IB
>>>> hardware here at KVM Forum today and the requirements are starting
>>>> to make more sense to me. I didn't say much in our previous thread
>>>> because I misunderstood the requirements, so let me try to explain
>>>> and see if we're all on the same page. There appears to be a
>>>> fundamental limitation here with rsocket, for which I don't see how it is
>> possible to overcome.
>>>> The basic problem is that rsocket is trying to present a stream
>>>> abstraction, a concept that is fundamentally incompatible with RDMA.
>>>> The whole point of using RDMA in the first place is to avoid using
>>>> the CPU, and to do that, all of the memory (potentially hundreds of
>>>> gigabytes) need to be registered with the hardware *in advance* (this is
>> how the original implementation works).
>>>> The need to fake a socket/bytestream abstraction eventually breaks
>>>> down => There is a limit (a few GB) in rsocket (which the IONOS team
>>>> previous reported in testing.... see that email), it appears that
>>>> means that rsocket is only going to be able to map a certain limited
>>>> amount of memory with the hardware until its internal "buffer" runs
>>>> out before it can then unmap and remap the next batch of memory with
>>>> the hardware to continue along with the fake bytestream. This is
>>>> very much sticking a square peg in a round hole. If you were to
>>>> "relax" the rsocket implementation to register the entire VM memory
>>>> space (as my original implementation does), then there wouldn't be any
>> need for rsocket in the first place.
>>
>> Yes, some test like this can be helpful.
>>
>> And thanks for the summary.  That's definitely helpful.
>>
>> One question from my side (as someone knows nothing on RDMA/rsocket): is
>> that "a few GBs" limitation a software guard?  Would it be possible that rsocket
>> provide some option to allow user opt-in on setting that value, so that it might
>> work for VM use case?  Would that consume similar resources v.s. the current
>> QEMU impl but allows it to use rsockets with no perf regressions?
> Rsockets is emulated the streaming socket API.  The amount of memory dedicated to a single rsocket is controlled through a wmem_default configuration setting.  It is also configurable via rsetsockopt() SO_SNDBUF.  Both of those are similar to TCP settings.  The SW field used to store this value is 32-bits.
>
> This internal buffer acts as a bounce buffer to convert the synchronous socket API calls into the asynchronous RDMA transfers.  Rsockets uses the CPU for data copies, but the transport is offloaded to the NIC, including kernel bypass.
Understood.
> Does your kernel allocate > 4 GBs of buffer space to an individual socket?
Yes, it absolutely does. We're dealing with virtual machines here, 
right? It is possible (and likely) to have a virtual machine that is 
hundreds of GBs of RAM in size.

A bounce buffer defeats the entire purpose of using RDMA in these cases. 
When using RDMA for very large transfers like this, the goal here is to 
map the entire memory region at once and avoid all CPU interactions 
(except for message management within libibverbs) so that the NIC is 
doing all of the work.

I'm sure rsocket has its place with much smaller transfer sizes, but 
this is very different.

- Michael



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-09-28 17:52             ` Michael Galaxy
@ 2024-09-29 18:14               ` Michael S. Tsirkin
  2024-09-29 20:26                 ` Michael Galaxy
  2024-09-30 18:16               ` Peter Xu
  1 sibling, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2024-09-29 18:14 UTC (permalink / raw)
  To: Michael Galaxy
  Cc: Sean Hefty, Peter Xu, Gonglei (Arei), qemu-devel@nongnu.org,
	yu.zhang@ionos.com, elmar.gerdes@ionos.com, zhengchuan,
	berrange@redhat.com, armbru@redhat.com, lizhijian@fujitsu.com,
	pbonzini@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin

On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote:
> A bounce buffer defeats the entire purpose of using RDMA in these cases.
> When using RDMA for very large transfers like this, the goal here is to map
> the entire memory region at once and avoid all CPU interactions (except for
> message management within libibverbs) so that the NIC is doing all of the
> work.
> 
> I'm sure rsocket has its place with much smaller transfer sizes, but this is
> very different.

To clarify, are you actively using rdma based migration in production? Stepping up
to help maintain it?

-- 
MST



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-09-29 18:14               ` Michael S. Tsirkin
@ 2024-09-29 20:26                 ` Michael Galaxy
  2024-09-29 22:26                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 55+ messages in thread
From: Michael Galaxy @ 2024-09-29 20:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sean Hefty, Peter Xu, Gonglei (Arei), qemu-devel@nongnu.org,
	yu.zhang@ionos.com, elmar.gerdes@ionos.com, zhengchuan,
	berrange@redhat.com, armbru@redhat.com, lizhijian@fujitsu.com,
	pbonzini@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin


On 9/29/24 13:14, Michael S. Tsirkin wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote:
>> A bounce buffer defeats the entire purpose of using RDMA in these cases.
>> When using RDMA for very large transfers like this, the goal here is to map
>> the entire memory region at once and avoid all CPU interactions (except for
>> message management within libibverbs) so that the NIC is doing all of the
>> work.
>>
>> I'm sure rsocket has its place with much smaller transfer sizes, but this is
>> very different.
> To clarify, are you actively using rdma based migration in production? Stepping up
> to help maintain it?
>
Yes, both Huawei and IONOS have both been contributing here in this 
email thread.

They are both using it in production.

- Michael


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-09-29 20:26                 ` Michael Galaxy
@ 2024-09-29 22:26                   ` Michael S. Tsirkin
  2024-09-30 15:00                     ` Michael Galaxy
  0 siblings, 1 reply; 55+ messages in thread
From: Michael S. Tsirkin @ 2024-09-29 22:26 UTC (permalink / raw)
  To: Michael Galaxy
  Cc: Sean Hefty, Peter Xu, Gonglei (Arei), qemu-devel@nongnu.org,
	yu.zhang@ionos.com, elmar.gerdes@ionos.com, zhengchuan,
	berrange@redhat.com, armbru@redhat.com, lizhijian@fujitsu.com,
	pbonzini@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin

On Sun, Sep 29, 2024 at 03:26:58PM -0500, Michael Galaxy wrote:
> 
> On 9/29/24 13:14, Michael S. Tsirkin wrote:
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >    This message came from outside your organization.
> > |-------------------------------------------------------------------!
> > 
> > On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote:
> > > A bounce buffer defeats the entire purpose of using RDMA in these cases.
> > > When using RDMA for very large transfers like this, the goal here is to map
> > > the entire memory region at once and avoid all CPU interactions (except for
> > > message management within libibverbs) so that the NIC is doing all of the
> > > work.
> > > 
> > > I'm sure rsocket has its place with much smaller transfer sizes, but this is
> > > very different.
> > To clarify, are you actively using rdma based migration in production? Stepping up
> > to help maintain it?
> > 
> Yes, both Huawei and IONOS have both been contributing here in this email
> thread.
> 
> They are both using it in production.
> 
> - Michael

Well, any plans to work on it? for example, postcopy does not really
do zero copy last time I checked, there's also a long TODO list.

-- 
MST



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-09-29 22:26                   ` Michael S. Tsirkin
@ 2024-09-30 15:00                     ` Michael Galaxy
  2024-09-30 15:31                       ` Yu Zhang
  0 siblings, 1 reply; 55+ messages in thread
From: Michael Galaxy @ 2024-09-30 15:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sean Hefty, Peter Xu, Gonglei (Arei), qemu-devel@nongnu.org,
	yu.zhang@ionos.com, elmar.gerdes@ionos.com, zhengchuan,
	berrange@redhat.com, armbru@redhat.com, lizhijian@fujitsu.com,
	pbonzini@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin


On 9/29/24 17:26, Michael S. Tsirkin wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> On Sun, Sep 29, 2024 at 03:26:58PM -0500, Michael Galaxy wrote:
>> On 9/29/24 13:14, Michael S. Tsirkin wrote:
>>> !-------------------------------------------------------------------|
>>>     This Message Is From an External Sender
>>>     This message came from outside your organization.
>>> |-------------------------------------------------------------------!
>>>
>>> On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote:
>>>> A bounce buffer defeats the entire purpose of using RDMA in these cases.
>>>> When using RDMA for very large transfers like this, the goal here is to map
>>>> the entire memory region at once and avoid all CPU interactions (except for
>>>> message management within libibverbs) so that the NIC is doing all of the
>>>> work.
>>>>
>>>> I'm sure rsocket has its place with much smaller transfer sizes, but this is
>>>> very different.
>>> To clarify, are you actively using rdma based migration in production? Stepping up
>>> to help maintain it?
>>>
>> Yes, both Huawei and IONOS have both been contributing here in this email
>> thread.
>>
>> They are both using it in production.
>>
>> - Michael
> Well, any plans to work on it? for example, postcopy does not really
> do zero copy last time I checked, there's also a long TODO list.
>
I apologize, I'm not following the question here. Isn't that what this 
thread is about?

So, some background is missing here, perhaps: A few months ago, there 
was a proposal
to remove native RDMA support from live migration due to concerns about 
lack of testability.
Both IONOS and Huawei have stepped up that they are using it and are 
engaging with the
community here. I also proposed transferring over maintainership to them 
as well.  (I  no longer
have any of this hardware, so I cannot provide testing support anymore).

During that time, rsocket was proposed as an alternative, but as I have 
laid out above, I believe
it cannot work for technical reasons.

I also asked earlier in the thread if we can cover the community's 
testing concerns using softroce,
so that an integration test can be made to work (presumably through 
avocado or something similar).

Does that history make sense?

- Michael



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-09-30 15:00                     ` Michael Galaxy
@ 2024-09-30 15:31                       ` Yu Zhang
  0 siblings, 0 replies; 55+ messages in thread
From: Yu Zhang @ 2024-09-30 15:31 UTC (permalink / raw)
  To: Michael Galaxy, Michael S. Tsirkin
  Cc: Sean Hefty, Peter Xu, Gonglei (Arei), qemu-devel@nongnu.org,
	elmar.gerdes@ionos.com, zhengchuan, berrange@redhat.com,
	armbru@redhat.com, lizhijian@fujitsu.com, pbonzini@redhat.com,
	Xiexiangyou, linux-rdma@vger.kernel.org, lixiao (H),
	jinpu.wang@ionos.com, Wangjialin

Hello Michael,

That's true. To my understanding, to ease the maintenance, Gonglei's
team has taken efforts to refactorize the RDMA migration code by using
rsocket. However, due to a certain limitation in rsocket, it turned
out that only small VM (in terms of core number and memory) can be
migrated successfully. As long as this limitation persists, no
progress can be achieved in this direction. One the other hand, a
proper test environment and integration / regression test cases are
expected to catch any possible regression due to new changes. It seems
that currently, we can go in this direction.

Best regards,
Yu Zhang @ IONOS cloud

On Mon, Sep 30, 2024 at 5:00 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
>
>
> On 9/29/24 17:26, Michael S. Tsirkin wrote:
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >    This message came from outside your organization.
> > |-------------------------------------------------------------------!
> >
> > On Sun, Sep 29, 2024 at 03:26:58PM -0500, Michael Galaxy wrote:
> >> On 9/29/24 13:14, Michael S. Tsirkin wrote:
> >>> !-------------------------------------------------------------------|
> >>>     This Message Is From an External Sender
> >>>     This message came from outside your organization.
> >>> |-------------------------------------------------------------------!
> >>>
> >>> On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote:
> >>>> A bounce buffer defeats the entire purpose of using RDMA in these cases.
> >>>> When using RDMA for very large transfers like this, the goal here is to map
> >>>> the entire memory region at once and avoid all CPU interactions (except for
> >>>> message management within libibverbs) so that the NIC is doing all of the
> >>>> work.
> >>>>
> >>>> I'm sure rsocket has its place with much smaller transfer sizes, but this is
> >>>> very different.
> >>> To clarify, are you actively using rdma based migration in production? Stepping up
> >>> to help maintain it?
> >>>
> >> Yes, both Huawei and IONOS have both been contributing here in this email
> >> thread.
> >>
> >> They are both using it in production.
> >>
> >> - Michael
> > Well, any plans to work on it? for example, postcopy does not really
> > do zero copy last time I checked, there's also a long TODO list.
> >
> I apologize, I'm not following the question here. Isn't that what this
> thread is about?
>
> So, some background is missing here, perhaps: A few months ago, there
> was a proposal
> to remove native RDMA support from live migration due to concerns about
> lack of testability.
> Both IONOS and Huawei have stepped up that they are using it and are
> engaging with the
> community here. I also proposed transferring over maintainership to them
> as well.  (I  no longer
> have any of this hardware, so I cannot provide testing support anymore).
>
> During that time, rsocket was proposed as an alternative, but as I have
> laid out above, I believe
> it cannot work for technical reasons.
>
> I also asked earlier in the thread if we can cover the community's
> testing concerns using softroce,
> so that an integration test can be made to work (presumably through
> avocado or something similar).
>
> Does that history make sense?
>
> - Michael
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-09-28 17:52             ` Michael Galaxy
  2024-09-29 18:14               ` Michael S. Tsirkin
@ 2024-09-30 18:16               ` Peter Xu
  2024-09-30 19:20                 ` Sean Hefty
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Xu @ 2024-09-30 18:16 UTC (permalink / raw)
  To: Michael Galaxy
  Cc: Sean Hefty, Gonglei (Arei), Michael S. Tsirkin,
	qemu-devel@nongnu.org, yu.zhang@ionos.com, elmar.gerdes@ionos.com,
	zhengchuan, berrange@redhat.com, armbru@redhat.com,
	lizhijian@fujitsu.com, pbonzini@redhat.com, Xiexiangyou,
	linux-rdma@vger.kernel.org, lixiao (H), jinpu.wang@ionos.com,
	Wangjialin

On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote:
> On 9/27/24 16:45, Sean Hefty wrote:
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >    This message came from outside your organization.
> > |-------------------------------------------------------------------!
> > 
> > > > > I have met with the team from IONOS about their testing on actual IB
> > > > > hardware here at KVM Forum today and the requirements are starting
> > > > > to make more sense to me. I didn't say much in our previous thread
> > > > > because I misunderstood the requirements, so let me try to explain
> > > > > and see if we're all on the same page. There appears to be a
> > > > > fundamental limitation here with rsocket, for which I don't see how it is
> > > possible to overcome.
> > > > > The basic problem is that rsocket is trying to present a stream
> > > > > abstraction, a concept that is fundamentally incompatible with RDMA.
> > > > > The whole point of using RDMA in the first place is to avoid using
> > > > > the CPU, and to do that, all of the memory (potentially hundreds of
> > > > > gigabytes) need to be registered with the hardware *in advance* (this is
> > > how the original implementation works).
> > > > > The need to fake a socket/bytestream abstraction eventually breaks
> > > > > down => There is a limit (a few GB) in rsocket (which the IONOS team
> > > > > previous reported in testing.... see that email), it appears that
> > > > > means that rsocket is only going to be able to map a certain limited
> > > > > amount of memory with the hardware until its internal "buffer" runs
> > > > > out before it can then unmap and remap the next batch of memory with
> > > > > the hardware to continue along with the fake bytestream. This is
> > > > > very much sticking a square peg in a round hole. If you were to
> > > > > "relax" the rsocket implementation to register the entire VM memory
> > > > > space (as my original implementation does), then there wouldn't be any
> > > need for rsocket in the first place.
> > > 
> > > Yes, some test like this can be helpful.
> > > 
> > > And thanks for the summary.  That's definitely helpful.
> > > 
> > > One question from my side (as someone knows nothing on RDMA/rsocket): is
> > > that "a few GBs" limitation a software guard?  Would it be possible that rsocket
> > > provide some option to allow user opt-in on setting that value, so that it might
> > > work for VM use case?  Would that consume similar resources v.s. the current
> > > QEMU impl but allows it to use rsockets with no perf regressions?
> > Rsockets is emulated the streaming socket API.  The amount of memory dedicated to a single rsocket is controlled through a wmem_default configuration setting.  It is also configurable via rsetsockopt() SO_SNDBUF.  Both of those are similar to TCP settings.  The SW field used to store this value is 32-bits.
> > 
> > This internal buffer acts as a bounce buffer to convert the synchronous socket API calls into the asynchronous RDMA transfers.  Rsockets uses the CPU for data copies, but the transport is offloaded to the NIC, including kernel bypass.
> Understood.
> > Does your kernel allocate > 4 GBs of buffer space to an individual socket?
> Yes, it absolutely does. We're dealing with virtual machines here, right? It
> is possible (and likely) to have a virtual machine that is hundreds of GBs
> of RAM in size.
> 
> A bounce buffer defeats the entire purpose of using RDMA in these cases.
> When using RDMA for very large transfers like this, the goal here is to map
> the entire memory region at once and avoid all CPU interactions (except for
> message management within libibverbs) so that the NIC is doing all of the
> work.
> 
> I'm sure rsocket has its place with much smaller transfer sizes, but this is
> very different.

Is it possible to make rsocket be friendly with large buffers (>4GB) like
the VM use case?

I also wonder whether there're other applications that may benefit from
this outside of QEMU.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-09-30 18:16               ` Peter Xu
@ 2024-09-30 19:20                 ` Sean Hefty
  2024-09-30 19:47                   ` Peter Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Sean Hefty @ 2024-09-30 19:20 UTC (permalink / raw)
  To: Peter Xu, Michael Galaxy
  Cc: Gonglei (Arei), Michael S. Tsirkin, qemu-devel@nongnu.org,
	yu.zhang@ionos.com, elmar.gerdes@ionos.com, zhengchuan,
	berrange@redhat.com, armbru@redhat.com, lizhijian@fujitsu.com,
	pbonzini@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin

> > I'm sure rsocket has its place with much smaller transfer sizes, but
> > this is very different.
> 
> Is it possible to make rsocket be friendly with large buffers (>4GB) like the VM
> use case?

If you can perform large VM migrations using streaming sockets, rsockets is likely usable, but it will involve data copies.  The problem is the socket API semantics.

There are rsocket API extensions (riowrite, riomap) to support RDMA write operations.  This avoids the data copy at the target, but not the sender.   (riowrite follows the socket send semantics on buffer ownership.)

It may be possible to enhance rsockets with MSG_ZEROCOPY or io_uring extensions to enable zero-copy for large transfers, but that's not something I've looked at.  True zero copy may require combining MSG_ZEROCOPY with riowrite, but then that moves further away from using traditional socket calls.

- Sean

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-09-30 19:20                 ` Sean Hefty
@ 2024-09-30 19:47                   ` Peter Xu
  2024-10-03 21:26                     ` Michael Galaxy
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Xu @ 2024-09-30 19:47 UTC (permalink / raw)
  To: Sean Hefty
  Cc: Michael Galaxy, Gonglei (Arei), Michael S. Tsirkin,
	qemu-devel@nongnu.org, yu.zhang@ionos.com, elmar.gerdes@ionos.com,
	zhengchuan, berrange@redhat.com, armbru@redhat.com,
	lizhijian@fujitsu.com, pbonzini@redhat.com, Xiexiangyou,
	linux-rdma@vger.kernel.org, lixiao (H), jinpu.wang@ionos.com,
	Wangjialin

On Mon, Sep 30, 2024 at 07:20:56PM +0000, Sean Hefty wrote:
> > > I'm sure rsocket has its place with much smaller transfer sizes, but
> > > this is very different.
> > 
> > Is it possible to make rsocket be friendly with large buffers (>4GB) like the VM
> > use case?
> 
> If you can perform large VM migrations using streaming sockets, rsockets is likely usable, but it will involve data copies.  The problem is the socket API semantics.
> 
> There are rsocket API extensions (riowrite, riomap) to support RDMA write operations.  This avoids the data copy at the target, but not the sender.   (riowrite follows the socket send semantics on buffer ownership.)
> 
> It may be possible to enhance rsockets with MSG_ZEROCOPY or io_uring extensions to enable zero-copy for large transfers, but that's not something I've looked at.  True zero copy may require combining MSG_ZEROCOPY with riowrite, but then that moves further away from using traditional socket calls.

Thanks, Sean.

One thing to mention is that QEMU has QIO_CHANNEL_WRITE_FLAG_ZERO_COPY,
which already supports MSG_ZEROCOPY but only on sender side, and only if
when multifd is enabled, because it requires page pinning and alignments,
while it's more challenging to pin a random buffer than a guest page.

Nobody moved on yet with zerocopy recv for TCP; there might be similar
challenges that normal socket APIs may not work easily on top of current
iochannel design, but I don't know well to say..

Not sure whether it means there can be a shared goal with QEMU ultimately
supporting better zerocopy via either TCP or RDMA.  If that's true, maybe
there's chance we can move towards rsocket with all the above facilities,
meanwhile RDMA can, ideally, run similiarly like TCP with the same (to be
enhanced..) iochannel API, so that it can do zerocopy on both sides with
either transport.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-09-30 19:47                   ` Peter Xu
@ 2024-10-03 21:26                     ` Michael Galaxy
  2024-10-03 21:43                       ` Peter Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Michael Galaxy @ 2024-10-03 21:26 UTC (permalink / raw)
  To: Peter Xu, Sean Hefty
  Cc: Gonglei (Arei), Michael S. Tsirkin, qemu-devel@nongnu.org,
	yu.zhang@ionos.com, elmar.gerdes@ionos.com, zhengchuan,
	berrange@redhat.com, armbru@redhat.com, lizhijian@fujitsu.com,
	pbonzini@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin

On 9/30/24 14:47, Peter Xu wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> On Mon, Sep 30, 2024 at 07:20:56PM +0000, Sean Hefty wrote:
>>>> I'm sure rsocket has its place with much smaller transfer sizes, but
>>>> this is very different.
>>> Is it possible to make rsocket be friendly with large buffers (>4GB) like the VM
>>> use case?
>> If you can perform large VM migrations using streaming sockets, rsockets is likely usable, but it will involve data copies.  The problem is the socket API semantics.
>>
>> There are rsocket API extensions (riowrite, riomap) to support RDMA write operations.  This avoids the data copy at the target, but not the sender.   (riowrite follows the socket send semantics on buffer ownership.)
>>
>> It may be possible to enhance rsockets with MSG_ZEROCOPY or io_uring extensions to enable zero-copy for large transfers, but that's not something I've looked at.  True zero copy may require combining MSG_ZEROCOPY with riowrite, but then that moves further away from using traditional socket calls.
> Thanks, Sean.
>
> One thing to mention is that QEMU has QIO_CHANNEL_WRITE_FLAG_ZERO_COPY,
> which already supports MSG_ZEROCOPY but only on sender side, and only if
> when multifd is enabled, because it requires page pinning and alignments,
> while it's more challenging to pin a random buffer than a guest page.
>
> Nobody moved on yet with zerocopy recv for TCP; there might be similar
> challenges that normal socket APIs may not work easily on top of current
> iochannel design, but I don't know well to say..
>
> Not sure whether it means there can be a shared goal with QEMU ultimately
> supporting better zerocopy via either TCP or RDMA.  If that's true, maybe
> there's chance we can move towards rsocket with all the above facilities,
> meanwhile RDMA can, ideally, run similiarly like TCP with the same (to be
> enhanced..) iochannel API, so that it can do zerocopy on both sides with
> either transport.
>
What about the testing solution that I mentioned?

Does that satisfy your concerns? Or is there still a gap here that needs 
to be met?

- Michael


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-10-03 21:26                     ` Michael Galaxy
@ 2024-10-03 21:43                       ` Peter Xu
  2024-10-04 14:04                         ` Michael Galaxy
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Xu @ 2024-10-03 21:43 UTC (permalink / raw)
  To: Michael Galaxy
  Cc: Sean Hefty, Gonglei (Arei), Michael S. Tsirkin,
	qemu-devel@nongnu.org, yu.zhang@ionos.com, elmar.gerdes@ionos.com,
	zhengchuan, berrange@redhat.com, armbru@redhat.com,
	lizhijian@fujitsu.com, pbonzini@redhat.com, Xiexiangyou,
	linux-rdma@vger.kernel.org, lixiao (H), jinpu.wang@ionos.com,
	Wangjialin

On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
> What about the testing solution that I mentioned?
> 
> Does that satisfy your concerns? Or is there still a gap here that needs to
> be met?

I think such testing framework would be helpful, especially if we can kick
it off in CI when preparing pull requests, then we can make sure nothing
will break RDMA easily.

Meanwhile, we still need people committed to this and actively maintain it,
who knows the rdma code well.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-10-03 21:43                       ` Peter Xu
@ 2024-10-04 14:04                         ` Michael Galaxy
  2024-10-07  8:47                           ` Yu Zhang
  0 siblings, 1 reply; 55+ messages in thread
From: Michael Galaxy @ 2024-10-04 14:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: Sean Hefty, Gonglei (Arei), Michael S. Tsirkin,
	qemu-devel@nongnu.org, yu.zhang@ionos.com, elmar.gerdes@ionos.com,
	zhengchuan, berrange@redhat.com, armbru@redhat.com,
	lizhijian@fujitsu.com, pbonzini@redhat.com, Xiexiangyou,
	linux-rdma@vger.kernel.org, lixiao (H), jinpu.wang@ionos.com,
	Wangjialin


On 10/3/24 16:43, Peter Xu wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
>> What about the testing solution that I mentioned?
>>
>> Does that satisfy your concerns? Or is there still a gap here that needs to
>> be met?
> I think such testing framework would be helpful, especially if we can kick
> it off in CI when preparing pull requests, then we can make sure nothing
> will break RDMA easily.
>
> Meanwhile, we still need people committed to this and actively maintain it,
> who knows the rdma code well.
>
> Thanks,
>

OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test 
along these lines that would ensure that future RDMA breakages are 
detected more easily?

What do you think?

- Michael



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-10-04 14:04                         ` Michael Galaxy
@ 2024-10-07  8:47                           ` Yu Zhang
  2024-10-07 13:45                             ` Michael Galaxy
  0 siblings, 1 reply; 55+ messages in thread
From: Yu Zhang @ 2024-10-07  8:47 UTC (permalink / raw)
  To: Michael Galaxy
  Cc: Sean Hefty, Gonglei (Arei), Michael S. Tsirkin,
	qemu-devel@nongnu.org, elmar.gerdes@ionos.com, zhengchuan,
	berrange@redhat.com, armbru@redhat.com, lizhijian@fujitsu.com,
	pbonzini@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin

Sure, as we talked at the KVM Forum, a possible approach is to set up
two VMs on a physical host, configure the SoftRoCE, and run the
migration test in two nested VMs to ensure that the migration data
traffic goes through the emulated RDMA hardware. I will continue with
this and let you know.


On Fri, Oct 4, 2024 at 4:06 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
>
>
> On 10/3/24 16:43, Peter Xu wrote:
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >    This message came from outside your organization.
> > |-------------------------------------------------------------------!
> >
> > On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
> >> What about the testing solution that I mentioned?
> >>
> >> Does that satisfy your concerns? Or is there still a gap here that needs to
> >> be met?
> > I think such testing framework would be helpful, especially if we can kick
> > it off in CI when preparing pull requests, then we can make sure nothing
> > will break RDMA easily.
> >
> > Meanwhile, we still need people committed to this and actively maintain it,
> > who knows the rdma code well.
> >
> > Thanks,
> >
>
> OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test
> along these lines that would ensure that future RDMA breakages are
> detected more easily?
>
> What do you think?
>
> - Michael
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-10-07  8:47                           ` Yu Zhang
@ 2024-10-07 13:45                             ` Michael Galaxy
  2024-10-07 18:15                               ` Leon Romanovsky
  2024-10-23 13:42                               ` Michael Galaxy
  0 siblings, 2 replies; 55+ messages in thread
From: Michael Galaxy @ 2024-10-07 13:45 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Sean Hefty, Gonglei (Arei), Michael S. Tsirkin,
	qemu-devel@nongnu.org, elmar.gerdes@ionos.com, zhengchuan,
	berrange@redhat.com, armbru@redhat.com, lizhijian@fujitsu.com,
	pbonzini@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin

Hi,

On 10/7/24 03:47, Yu Zhang wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> Sure, as we talked at the KVM Forum, a possible approach is to set up
> two VMs on a physical host, configure the SoftRoCE, and run the
> migration test in two nested VMs to ensure that the migration data
> traffic goes through the emulated RDMA hardware. I will continue with
> this and let you know.
>
Acknowledged. Do share if you have any problems with it, like if it has 
compatibility issues
or if we need a different solution. We're open to change.

I'm not familiar with the "current state" of this or how well it would 
even work.

- Michael


> On Fri, Oct 4, 2024 at 4:06 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
>>
>> On 10/3/24 16:43, Peter Xu wrote:
>>> !-------------------------------------------------------------------|
>>>     This Message Is From an External Sender
>>>     This message came from outside your organization.
>>> |-------------------------------------------------------------------!
>>>
>>> On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
>>>> What about the testing solution that I mentioned?
>>>>
>>>> Does that satisfy your concerns? Or is there still a gap here that needs to
>>>> be met?
>>> I think such testing framework would be helpful, especially if we can kick
>>> it off in CI when preparing pull requests, then we can make sure nothing
>>> will break RDMA easily.
>>>
>>> Meanwhile, we still need people committed to this and actively maintain it,
>>> who knows the rdma code well.
>>>
>>> Thanks,
>>>
>> OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test
>> along these lines that would ensure that future RDMA breakages are
>> detected more easily?
>>
>> What do you think?
>>
>> - Michael
>>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-10-07 13:45                             ` Michael Galaxy
@ 2024-10-07 18:15                               ` Leon Romanovsky
  2024-10-08  9:31                                 ` Zhu Yanjun
  2024-10-23 13:42                               ` Michael Galaxy
  1 sibling, 1 reply; 55+ messages in thread
From: Leon Romanovsky @ 2024-10-07 18:15 UTC (permalink / raw)
  To: Michael Galaxy
  Cc: Yu Zhang, Sean Hefty, Gonglei (Arei), Michael S. Tsirkin,
	qemu-devel@nongnu.org, elmar.gerdes@ionos.com, zhengchuan,
	berrange@redhat.com, armbru@redhat.com, lizhijian@fujitsu.com,
	pbonzini@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin

On Mon, Oct 07, 2024 at 08:45:07AM -0500, Michael Galaxy wrote:
> Hi,
> 
> On 10/7/24 03:47, Yu Zhang wrote:
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >    This message came from outside your organization.
> > |-------------------------------------------------------------------!
> > 
> > Sure, as we talked at the KVM Forum, a possible approach is to set up
> > two VMs on a physical host, configure the SoftRoCE, and run the
> > migration test in two nested VMs to ensure that the migration data
> > traffic goes through the emulated RDMA hardware. I will continue with
> > this and let you know.
> > 
> Acknowledged. Do share if you have any problems with it, like if it has
> compatibility issues
> or if we need a different solution. We're open to change.
> 
> I'm not familiar with the "current state" of this or how well it would even
> work.

Any compatibility issue between versions of RXE (SoftRoCE) or between
RXE and real devices is a bug in RXE, which should be fixed.

RXE is expected to be compatible with rest RoCE devices, both virtual
and physical.

Thanks

> 
> - Michael
> 
> 
> > On Fri, Oct 4, 2024 at 4:06 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
> > > 
> > > On 10/3/24 16:43, Peter Xu wrote:
> > > > !-------------------------------------------------------------------|
> > > >     This Message Is From an External Sender
> > > >     This message came from outside your organization.
> > > > |-------------------------------------------------------------------!
> > > > 
> > > > On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
> > > > > What about the testing solution that I mentioned?
> > > > > 
> > > > > Does that satisfy your concerns? Or is there still a gap here that needs to
> > > > > be met?
> > > > I think such testing framework would be helpful, especially if we can kick
> > > > it off in CI when preparing pull requests, then we can make sure nothing
> > > > will break RDMA easily.
> > > > 
> > > > Meanwhile, we still need people committed to this and actively maintain it,
> > > > who knows the rdma code well.
> > > > 
> > > > Thanks,
> > > > 
> > > OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test
> > > along these lines that would ensure that future RDMA breakages are
> > > detected more easily?
> > > 
> > > What do you think?
> > > 
> > > - Michael
> > > 
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-10-07 18:15                               ` Leon Romanovsky
@ 2024-10-08  9:31                                 ` Zhu Yanjun
  0 siblings, 0 replies; 55+ messages in thread
From: Zhu Yanjun @ 2024-10-08  9:31 UTC (permalink / raw)
  To: Leon Romanovsky, Michael Galaxy
  Cc: Yu Zhang, Sean Hefty, Gonglei (Arei), Michael S. Tsirkin,
	qemu-devel@nongnu.org, elmar.gerdes@ionos.com, zhengchuan,
	berrange@redhat.com, armbru@redhat.com, lizhijian@fujitsu.com,
	pbonzini@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin

在 2024/10/8 2:15, Leon Romanovsky 写道:
> On Mon, Oct 07, 2024 at 08:45:07AM -0500, Michael Galaxy wrote:
>> Hi,
>>
>> On 10/7/24 03:47, Yu Zhang wrote:
>>> !-------------------------------------------------------------------|
>>>     This Message Is From an External Sender
>>>     This message came from outside your organization.
>>> |-------------------------------------------------------------------!
>>>
>>> Sure, as we talked at the KVM Forum, a possible approach is to set up
>>> two VMs on a physical host, configure the SoftRoCE, and run the
>>> migration test in two nested VMs to ensure that the migration data
>>> traffic goes through the emulated RDMA hardware. I will continue with
>>> this and let you know.
>>>
>> Acknowledged. Do share if you have any problems with it, like if it has
>> compatibility issues
>> or if we need a different solution. We're open to change.
>>
>> I'm not familiar with the "current state" of this or how well it would even
>> work.
> 
> Any compatibility issue between versions of RXE (SoftRoCE) or between
> RXE and real devices is a bug in RXE, which should be fixed.
> 
> RXE is expected to be compatible with rest RoCE devices, both virtual
> and physical.

 From my tests, about physical RoCE devices, for example, Nvidia MLX5 
and intel E810 (iRDMA), if RDMA feature is disabled on those devices. 
RXE can work well with them.

About Virtual devices, most virtual devices can work well with RXE, for 
example,bonding, veth. I have done a lot of tests with them.

If some virtual devices can not work well with RXE, please share the 
error messages in RDMA maillist.

Zhu Yanjun

> 
> Thanks
> 
>>
>> - Michael
>>
>>
>>> On Fri, Oct 4, 2024 at 4:06 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
>>>>
>>>> On 10/3/24 16:43, Peter Xu wrote:
>>>>> !-------------------------------------------------------------------|
>>>>>      This Message Is From an External Sender
>>>>>      This message came from outside your organization.
>>>>> |-------------------------------------------------------------------!
>>>>>
>>>>> On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
>>>>>> What about the testing solution that I mentioned?
>>>>>>
>>>>>> Does that satisfy your concerns? Or is there still a gap here that needs to
>>>>>> be met?
>>>>> I think such testing framework would be helpful, especially if we can kick
>>>>> it off in CI when preparing pull requests, then we can make sure nothing
>>>>> will break RDMA easily.
>>>>>
>>>>> Meanwhile, we still need people committed to this and actively maintain it,
>>>>> who knows the rdma code well.
>>>>>
>>>>> Thanks,
>>>>>
>>>> OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test
>>>> along these lines that would ensure that future RDMA breakages are
>>>> detected more easily?
>>>>
>>>> What do you think?
>>>>
>>>> - Michael
>>>>
>>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API
  2024-10-07 13:45                             ` Michael Galaxy
  2024-10-07 18:15                               ` Leon Romanovsky
@ 2024-10-23 13:42                               ` Michael Galaxy
  1 sibling, 0 replies; 55+ messages in thread
From: Michael Galaxy @ 2024-10-23 13:42 UTC (permalink / raw)
  To: Yu Zhang
  Cc: Sean Hefty, Gonglei (Arei), Michael S. Tsirkin,
	qemu-devel@nongnu.org, elmar.gerdes@ionos.com, zhengchuan,
	berrange@redhat.com, armbru@redhat.com, lizhijian@fujitsu.com,
	pbonzini@redhat.com, Xiexiangyou, linux-rdma@vger.kernel.org,
	lixiao (H), jinpu.wang@ionos.com, Wangjialin

Hi All,

This is just a heads up: I will be changing employment soon, so my 
Akamai email address will cease to operate this week.

My personal email: michael@flatgalaxy.com. I'll re-subscribe later once 
I have come back online to work soon.

Thanks!

- Michael

On 10/7/24 08:45, Michael Galaxy wrote:
> Hi,
>
> On 10/7/24 03:47, Yu Zhang wrote:
>> !-------------------------------------------------------------------|
>>    This Message Is From an External Sender
>>    This message came from outside your organization.
>> |-------------------------------------------------------------------!
>>
>> Sure, as we talked at the KVM Forum, a possible approach is to set up
>> two VMs on a physical host, configure the SoftRoCE, and run the
>> migration test in two nested VMs to ensure that the migration data
>> traffic goes through the emulated RDMA hardware. I will continue with
>> this and let you know.
>>
> Acknowledged. Do share if you have any problems with it, like if it 
> has compatibility issues
> or if we need a different solution. We're open to change.
>
> I'm not familiar with the "current state" of this or how well it would 
> even work.
>
> - Michael
>
>
>> On Fri, Oct 4, 2024 at 4:06 PM Michael Galaxy <mgalaxy@akamai.com> 
>> wrote:
>>>
>>> On 10/3/24 16:43, Peter Xu wrote:
>>>> !-------------------------------------------------------------------|
>>>>     This Message Is From an External Sender
>>>>     This message came from outside your organization.
>>>> |-------------------------------------------------------------------!
>>>>
>>>> On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
>>>>> What about the testing solution that I mentioned?
>>>>>
>>>>> Does that satisfy your concerns? Or is there still a gap here that 
>>>>> needs to
>>>>> be met?
>>>> I think such testing framework would be helpful, especially if we 
>>>> can kick
>>>> it off in CI when preparing pull requests, then we can make sure 
>>>> nothing
>>>> will break RDMA easily.
>>>>
>>>> Meanwhile, we still need people committed to this and actively 
>>>> maintain it,
>>>> who knows the rdma code well.
>>>>
>>>> Thanks,
>>>>
>>> OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test
>>> along these lines that would ensure that future RDMA breakages are
>>> detected more easily?
>>>
>>> What do you think?
>>>
>>> - Michael
>>>


^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2024-10-23 13:43 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-04 12:14 [PATCH 0/6] refactor RDMA live migration based on rsocket API Gonglei via
2024-06-04 12:14 ` [PATCH 1/6] migration: remove RDMA live migration temporarily Gonglei via
2024-06-04 14:01   ` David Hildenbrand
2024-06-05 10:02     ` Gonglei (Arei) via
2024-06-10 11:45   ` Markus Armbruster
2024-06-04 12:14 ` [PATCH 2/6] io: add QIOChannelRDMA class Gonglei via
2024-06-10  6:54   ` Jinpu Wang
2024-06-04 12:14 ` [PATCH 3/6] io/channel-rdma: support working in coroutine Gonglei via
2024-06-06 13:34   ` Haris Iqbal
2024-06-07  8:45     ` Gonglei (Arei) via
2024-06-07 10:01       ` Haris Iqbal
2024-06-07  9:04   ` Daniel P. Berrangé
2024-06-07  9:28     ` Gonglei (Arei) via
2024-06-04 12:14 ` [PATCH 4/6] tests/unit: add test-io-channel-rdma.c Gonglei via
2024-06-04 12:14 ` [PATCH 5/6] migration: introduce new RDMA live migration Gonglei via
2024-06-04 12:14 ` [PATCH 6/6] migration/rdma: support multifd for RDMA migration Gonglei via
2024-06-04 19:32 ` [PATCH 0/6] refactor RDMA live migration based on rsocket API Peter Xu
2024-06-05 10:09   ` Gonglei (Arei) via
2024-06-05 14:18     ` Peter Xu
2024-06-07  8:49       ` Gonglei (Arei) via
2024-06-10 16:35         ` Peter Xu
2024-06-07 10:06   ` Daniel P. Berrangé
2024-06-05  7:57 ` Michael S. Tsirkin
2024-06-05 10:00   ` Gonglei (Arei) via
2024-06-05 10:23     ` Michael S. Tsirkin
2024-06-06 11:31     ` Leon Romanovsky
2024-06-07  1:04       ` Zhijian Li (Fujitsu) via
2024-06-07 16:24     ` Yu Zhang
2024-06-07  5:53 ` Jinpu Wang
2024-06-07  8:28   ` Gonglei (Arei) via
2024-06-10 16:31     ` Peter Xu
2024-08-27 20:15 ` Peter Xu
2024-08-27 20:57   ` Michael S. Tsirkin
2024-09-22 19:29     ` Michael Galaxy
2024-09-23  1:04       ` Gonglei (Arei) via
2024-09-25 15:08         ` Peter Xu
2024-09-27 21:45           ` Sean Hefty
2024-09-28 17:52             ` Michael Galaxy
2024-09-29 18:14               ` Michael S. Tsirkin
2024-09-29 20:26                 ` Michael Galaxy
2024-09-29 22:26                   ` Michael S. Tsirkin
2024-09-30 15:00                     ` Michael Galaxy
2024-09-30 15:31                       ` Yu Zhang
2024-09-30 18:16               ` Peter Xu
2024-09-30 19:20                 ` Sean Hefty
2024-09-30 19:47                   ` Peter Xu
2024-10-03 21:26                     ` Michael Galaxy
2024-10-03 21:43                       ` Peter Xu
2024-10-04 14:04                         ` Michael Galaxy
2024-10-07  8:47                           ` Yu Zhang
2024-10-07 13:45                             ` Michael Galaxy
2024-10-07 18:15                               ` Leon Romanovsky
2024-10-08  9:31                                 ` Zhu Yanjun
2024-10-23 13:42                               ` Michael Galaxy
2024-09-27 20:34         ` Michael Galaxy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).