[PATCH v8 0/5] MSG_ZEROCOPY + multifd

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v8 0/5] MSG_ZEROCOPY + multifd
@ 2022-02-01  6:28 Leonardo Bras
  2022-02-01  6:28 ` [PATCH v8 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback Leonardo Bras
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Leonardo Bras @ 2022-02-01  6:28 UTC (permalink / raw)
  To: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Fam Zheng, Peter Xu
  Cc: Leonardo Bras, qemu-devel, qemu-block

This patch series intends to enable MSG_ZEROCOPY in QIOChannel, and make
use of it for multifd migration performance improvement, by reducing cpu
usage.

Patch #1 creates new callbacks for QIOChannel, allowing the implementation
of zero copy writing.

Patch #2 implements io_writev flags and io_flush() on QIOChannelSocket,
making use of MSG_ZEROCOPY on Linux.

Patch #3 adds a "zero_copy_send" migration property, only available with
CONFIG_LINUX, and compiled-out in any other architectures.
This migration property has to be enabled before multifd migration starts.

Patch #4 adds a helper function that allows to see if TLS is going to be used.
This helper will be later used in patch #5.

Patch #5 Makes use of QIOChannelSocket zero_copy implementation on
nocomp multifd migration.

Results:
In preliminary tests, the resource usage of __sys_sendmsg() reduced 15 times,
and the overall migration took 13-22% less time, based in synthetic cpu
workload.

In further tests, it was noted that, on multifd migration with 8 channels:
- On idle hosts, migration time reduced in 10% to 21%.
- On hosts busy with heavy cpu stress (1 stress thread per cpu, but
  not cpu-pinned) migration time reduced in ~25% by enabling zero-copy.
- On hosts with heavy cpu-pinned workloads (1 stress thread per cpu, 
  cpu-pinned), migration time reducted in ~66% by enabling zero-copy.

Above tests setup:
- Sending and Receiving hosts:
  - CPU : Intel(R) Xeon(R) Platinum 8276L CPU @ 2.20GHz (448 CPUS)
  - Network card: E810-C (100Gbps)
  - >1TB RAM
  - QEMU: Upstream master branch + This patchset
  - Linux: Upstream v5.15 
- VM configuration:
  - 28 VCPUs
  - 512GB RAM


---
Changes since v7:
- Migration property renamed from zero-copy to zero-copy-send
- A few early tests added to help misconfigurations to fail earlier
- qio_channel_full*_flags() renamed back to qio_channel_full*()
- multifd_send_sync_main() reverted back to not receiving a flag,
  so it always sync zero-copy when enabled.
- Improve code quality on a few points


Changes since v6:
- Remove io_writev_zero_copy(), and makes use of io_writev() new flags
  to achieve the same results.
- Rename io_flush_zero_copy() to io_flush()
- Previous patch #2 became too small, so it was squashed in previous
  patch #3 (now patch #2)

Changes since v5:
- flush_zero_copy now returns -1 on fail, 0 on success, and 1 when all
  processed writes were not able to use zerocopy in kernel.
- qio_channel_socket_poll() removed, using qio_channel_wait() instead
- ENOBUFS is now processed inside qio_channel_socket_writev_flags()
- Most zerocopy parameter validation moved to migrate_params_check(),
  leaving only feature test to socket_outgoing_migration() callback
- Naming went from *zerocopy to *zero_copy or *zero-copy, due to QAPI/QMP
  preferences
- Improved docs

Changes since v4:
- 3 patches got splitted in 6
- Flush is used for syncing after each iteration, instead of only at the end
- If zerocopy is not available, fail in connect instead of failing on write
- 'multifd-zerocopy' property renamed to 'zerocopy'
- Fail migrations that don't support zerocopy, if it's enabled.
- Instead of checking for zerocopy at each write, save the flags in
  MultiFDSendParams->write_flags and use them on write
- Reorganized flag usage in QIOChannelSocket 
- A lot of typos fixed
- More doc on buffer restrictions

Changes since v3:
- QIOChannel interface names changed from io_async_{writev,flush} to
  io_{writev,flush}_zerocopy
- Instead of falling back in case zerocopy is not implemented, return
  error and abort operation.
- Flush now waits as long as needed, or return error in case anything
  goes wrong, aborting the operation.
- Zerocopy is now conditional in multifd, being set by parameter
  multifd-zerocopy
- Moves zerocopy_flush to multifd_send_sync_main() from multifd_save_cleanup
  so migration can abort if flush goes wrong.
- Several other small improvements

Changes since v2:
- Patch #1: One more fallback
- Patch #2: Fall back to sync if fails to lock buffer memory in MSG_ZEROCOPY send.

Changes since v1:
- Reimplemented the patchset using async_write + async_flush approach.
- Implemented a flush to be able to tell whenever all data was written.


Leonardo Bras (5):
  QIOChannel: Add flags on io_writev and introduce io_flush callback
  QIOChannelSocket: Implement io_writev zero copy flag & io_flush for
    CONFIG_LINUX
  migration: Add zero-copy-send parameter for QMP/HMP for Linux
  migration: Add migrate_use_tls() helper
  multifd: Implement zero copy write in multifd migration
    (multifd-zero-copy)

 qapi/migration.json                 |  24 ++++++
 include/io/channel-socket.h         |   2 +
 include/io/channel.h                |  38 +++++++++-
 migration/migration.h               |   6 ++
 migration/multifd.h                 |   4 +-
 chardev/char-io.c                   |   2 +-
 hw/remote/mpqemu-link.c             |   2 +-
 io/channel-buffer.c                 |   1 +
 io/channel-command.c                |   1 +
 io/channel-file.c                   |   1 +
 io/channel-socket.c                 | 110 +++++++++++++++++++++++++++-
 io/channel-tls.c                    |   1 +
 io/channel-websock.c                |   1 +
 io/channel.c                        |  53 +++++++++++---
 migration/channel.c                 |   3 +-
 migration/migration.c               |  52 ++++++++++++-
 migration/multifd.c                 |  46 +++++++++---
 migration/ram.c                     |  29 ++++++--
 migration/rdma.c                    |   1 +
 migration/socket.c                  |   6 ++
 monitor/hmp-cmds.c                  |   6 ++
 scsi/pr-manager-helper.c            |   2 +-
 tests/unit/test-io-channel-socket.c |   1 +
 23 files changed, 353 insertions(+), 39 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v8 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback
  2022-02-01  6:28 [PATCH v8 0/5] MSG_ZEROCOPY + multifd Leonardo Bras
@ 2022-02-01  6:28 ` Leonardo Bras
  2022-02-01  9:35   ` Daniel P. Berrangé
                     ` (2 more replies)
  2022-02-01  6:29 ` [PATCH v8 2/5] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX Leonardo Bras
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 24+ messages in thread
From: Leonardo Bras @ 2022-02-01  6:28 UTC (permalink / raw)
  To: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Fam Zheng, Peter Xu
  Cc: Leonardo Bras, qemu-devel, qemu-block

Add flags to io_writev and introduce io_flush as optional callback to
QIOChannelClass, allowing the implementation of zero copy writes by
subclasses.

How to use them:
- Write data using qio_channel_writev*(...,QIO_CHANNEL_WRITE_FLAG_ZERO_COPY),
- Wait write completion with qio_channel_flush().

Notes:
As some zero copy write implementations work asynchronously, it's
recommended to keep the write buffer untouched until the return of
qio_channel_flush(), to avoid the risk of sending an updated buffer
instead of the buffer state during write.

As io_flush callback is optional, if a subclass does not implement it, then:
- io_flush will return 0 without changing anything.

Also, some functions like qio_channel_writev_full_all() were adapted to
receive a flag parameter. That allows shared code between zero copy and
non-zero copy writev, and also an easier implementation on new flags.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 include/io/channel.h                | 38 ++++++++++++++++++++-
 chardev/char-io.c                   |  2 +-
 hw/remote/mpqemu-link.c             |  2 +-
 io/channel-buffer.c                 |  1 +
 io/channel-command.c                |  1 +
 io/channel-file.c                   |  1 +
 io/channel-socket.c                 |  2 ++
 io/channel-tls.c                    |  1 +
 io/channel-websock.c                |  1 +
 io/channel.c                        | 53 +++++++++++++++++++++++------
 migration/rdma.c                    |  1 +
 scsi/pr-manager-helper.c            |  2 +-
 tests/unit/test-io-channel-socket.c |  1 +
 13 files changed, 92 insertions(+), 14 deletions(-)

diff --git a/include/io/channel.h b/include/io/channel.h
index 88988979f8..c680ee7480 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -32,12 +32,15 @@ OBJECT_DECLARE_TYPE(QIOChannel, QIOChannelClass,
 
 #define QIO_CHANNEL_ERR_BLOCK -2
 
+#define QIO_CHANNEL_WRITE_FLAG_ZERO_COPY 0x1
+
 typedef enum QIOChannelFeature QIOChannelFeature;
 
 enum QIOChannelFeature {
     QIO_CHANNEL_FEATURE_FD_PASS,
     QIO_CHANNEL_FEATURE_SHUTDOWN,
     QIO_CHANNEL_FEATURE_LISTEN,
+    QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY,
 };
 
 
@@ -104,6 +107,7 @@ struct QIOChannelClass {
                          size_t niov,
                          int *fds,
                          size_t nfds,
+                         int flags,
                          Error **errp);
     ssize_t (*io_readv)(QIOChannel *ioc,
                         const struct iovec *iov,
@@ -136,6 +140,8 @@ struct QIOChannelClass {
                                   IOHandler *io_read,
                                   IOHandler *io_write,
                                   void *opaque);
+    int (*io_flush)(QIOChannel *ioc,
+                    Error **errp);
 };
 
 /* General I/O handling functions */
@@ -228,6 +234,7 @@ ssize_t qio_channel_readv_full(QIOChannel *ioc,
  * @niov: the length of the @iov array
  * @fds: an array of file handles to send
  * @nfds: number of file handles in @fds
+ * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*)
  * @errp: pointer to a NULL-initialized error object
  *
  * Write data to the IO channel, reading it from the
@@ -260,6 +267,7 @@ ssize_t qio_channel_writev_full(QIOChannel *ioc,
                                 size_t niov,
                                 int *fds,
                                 size_t nfds,
+                                int flags,
                                 Error **errp);
 
 /**
@@ -837,6 +845,7 @@ int qio_channel_readv_full_all(QIOChannel *ioc,
  * @niov: the length of the @iov array
  * @fds: an array of file handles to send
  * @nfds: number of file handles in @fds
+ * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*)
  * @errp: pointer to a NULL-initialized error object
  *
  *
@@ -846,6 +855,14 @@ int qio_channel_readv_full_all(QIOChannel *ioc,
  * to be written, yielding from the current coroutine
  * if required.
  *
+ * If QIO_CHANNEL_WRITE_FLAG_ZERO_COPY is passed in flags,
+ * instead of waiting for all requested data to be written,
+ * this function will wait until it's all queued for writing.
+ * In this case, if the buffer gets changed between queueing and
+ * sending, the updated buffer will be sent. If this is not a
+ * desired behavior, it's suggested to call qio_channel_flush()
+ * before reusing the buffer.
+ *
  * Returns: 0 if all bytes were written, or -1 on error
  */
 
@@ -853,6 +870,25 @@ int qio_channel_writev_full_all(QIOChannel *ioc,
                                 const struct iovec *iov,
                                 size_t niov,
                                 int *fds, size_t nfds,
-                                Error **errp);
+                                int flags, Error **errp);
+
+/**
+ * qio_channel_flush:
+ * @ioc: the channel object
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Will block until every packet queued with
+ * qio_channel_writev_full() + QIO_CHANNEL_WRITE_FLAG_ZERO_COPY
+ * is sent, or return in case of any error.
+ *
+ * If not implemented, acts as a no-op, and returns 0.
+ *
+ * Returns -1 if any error is found,
+ *          1 if every send failed to use zero copy.
+ *          0 otherwise.
+ */
+
+int qio_channel_flush(QIOChannel *ioc,
+                      Error **errp);
 
 #endif /* QIO_CHANNEL_H */
diff --git a/chardev/char-io.c b/chardev/char-io.c
index 8ced184160..4451128cba 100644
--- a/chardev/char-io.c
+++ b/chardev/char-io.c
@@ -122,7 +122,7 @@ int io_channel_send_full(QIOChannel *ioc,
 
         ret = qio_channel_writev_full(
             ioc, &iov, 1,
-            fds, nfds, NULL);
+            fds, nfds, 0, NULL);
         if (ret == QIO_CHANNEL_ERR_BLOCK) {
             if (offset) {
                 return offset;
diff --git a/hw/remote/mpqemu-link.c b/hw/remote/mpqemu-link.c
index 7e841820e5..e8f556bd27 100644
--- a/hw/remote/mpqemu-link.c
+++ b/hw/remote/mpqemu-link.c
@@ -69,7 +69,7 @@ bool mpqemu_msg_send(MPQemuMsg *msg, QIOChannel *ioc, Error **errp)
     }
 
     if (!qio_channel_writev_full_all(ioc, send, G_N_ELEMENTS(send),
-                                    fds, nfds, errp)) {
+                                    fds, nfds, 0, errp)) {
         ret = true;
     } else {
         trace_mpqemu_send_io_error(msg->cmd, msg->size, nfds);
diff --git a/io/channel-buffer.c b/io/channel-buffer.c
index baa4e2b089..bf52011be2 100644
--- a/io/channel-buffer.c
+++ b/io/channel-buffer.c
@@ -81,6 +81,7 @@ static ssize_t qio_channel_buffer_writev(QIOChannel *ioc,
                                          size_t niov,
                                          int *fds,
                                          size_t nfds,
+                                         int flags,
                                          Error **errp)
 {
     QIOChannelBuffer *bioc = QIO_CHANNEL_BUFFER(ioc);
diff --git a/io/channel-command.c b/io/channel-command.c
index 338da73ade..54560464ae 100644
--- a/io/channel-command.c
+++ b/io/channel-command.c
@@ -258,6 +258,7 @@ static ssize_t qio_channel_command_writev(QIOChannel *ioc,
                                           size_t niov,
                                           int *fds,
                                           size_t nfds,
+                                          int flags,
                                           Error **errp)
 {
     QIOChannelCommand *cioc = QIO_CHANNEL_COMMAND(ioc);
diff --git a/io/channel-file.c b/io/channel-file.c
index d7cf6d278f..ef6807a6be 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -114,6 +114,7 @@ static ssize_t qio_channel_file_writev(QIOChannel *ioc,
                                        size_t niov,
                                        int *fds,
                                        size_t nfds,
+                                       int flags,
                                        Error **errp)
 {
     QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
diff --git a/io/channel-socket.c b/io/channel-socket.c
index 459922c874..b2d254ef8d 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -525,6 +525,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
                                          size_t niov,
                                          int *fds,
                                          size_t nfds,
+                                         int flags,
                                          Error **errp)
 {
     QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
@@ -620,6 +621,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
                                          size_t niov,
                                          int *fds,
                                          size_t nfds,
+                                         int flags,
                                          Error **errp)
 {
     QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
diff --git a/io/channel-tls.c b/io/channel-tls.c
index 2ae1b92fc0..4ce890a538 100644
--- a/io/channel-tls.c
+++ b/io/channel-tls.c
@@ -301,6 +301,7 @@ static ssize_t qio_channel_tls_writev(QIOChannel *ioc,
                                       size_t niov,
                                       int *fds,
                                       size_t nfds,
+                                      int flags,
                                       Error **errp)
 {
     QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);
diff --git a/io/channel-websock.c b/io/channel-websock.c
index 70889bb54d..035dd6075b 100644
--- a/io/channel-websock.c
+++ b/io/channel-websock.c
@@ -1127,6 +1127,7 @@ static ssize_t qio_channel_websock_writev(QIOChannel *ioc,
                                           size_t niov,
                                           int *fds,
                                           size_t nfds,
+                                          int flags,
                                           Error **errp)
 {
     QIOChannelWebsock *wioc = QIO_CHANNEL_WEBSOCK(ioc);
diff --git a/io/channel.c b/io/channel.c
index e8b019dc36..b8b99fdc4c 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -72,18 +72,32 @@ ssize_t qio_channel_writev_full(QIOChannel *ioc,
                                 size_t niov,
                                 int *fds,
                                 size_t nfds,
+                                int flags,
                                 Error **errp)
 {
     QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
 
-    if ((fds || nfds) &&
-        !qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_FD_PASS)) {
+    if (fds || nfds) {
+        if (!qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_FD_PASS)) {
+            error_setg_errno(errp, EINVAL,
+                             "Channel does not support file descriptor passing");
+            return -1;
+        }
+        if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
+            error_setg_errno(errp, EINVAL,
+                             "Zero Copy does not support file descriptor passing");
+            return -1;
+        }
+    }
+
+    if ((flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) &&
+        !qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY)) {
         error_setg_errno(errp, EINVAL,
-                         "Channel does not support file descriptor passing");
+                         "Requested Zero Copy feature is not available");
         return -1;
     }
 
-    return klass->io_writev(ioc, iov, niov, fds, nfds, errp);
+    return klass->io_writev(ioc, iov, niov, fds, nfds, flags, errp);
 }
 
 
@@ -217,14 +231,14 @@ int qio_channel_writev_all(QIOChannel *ioc,
                            size_t niov,
                            Error **errp)
 {
-    return qio_channel_writev_full_all(ioc, iov, niov, NULL, 0, errp);
+    return qio_channel_writev_full_all(ioc, iov, niov, NULL, 0, 0, errp);
 }
 
 int qio_channel_writev_full_all(QIOChannel *ioc,
                                 const struct iovec *iov,
                                 size_t niov,
                                 int *fds, size_t nfds,
-                                Error **errp)
+                                int flags, Error **errp)
 {
     int ret = -1;
     struct iovec *local_iov = g_new(struct iovec, niov);
@@ -235,10 +249,16 @@ int qio_channel_writev_full_all(QIOChannel *ioc,
                           iov, niov,
                           0, iov_size(iov, niov));
 
+    if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
+        assert(fds == NULL && nfds == 0);
+    }
+
     while (nlocal_iov > 0) {
         ssize_t len;
-        len = qio_channel_writev_full(ioc, local_iov, nlocal_iov, fds, nfds,
-                                      errp);
+
+        len = qio_channel_writev_full(ioc, local_iov, nlocal_iov, fds,
+                                            nfds, flags, errp);
+
         if (len == QIO_CHANNEL_ERR_BLOCK) {
             if (qemu_in_coroutine()) {
                 qio_channel_yield(ioc, G_IO_OUT);
@@ -277,7 +297,7 @@ ssize_t qio_channel_writev(QIOChannel *ioc,
                            size_t niov,
                            Error **errp)
 {
-    return qio_channel_writev_full(ioc, iov, niov, NULL, 0, errp);
+    return qio_channel_writev_full(ioc, iov, niov, NULL, 0, 0, errp);
 }
 
 
@@ -297,7 +317,7 @@ ssize_t qio_channel_write(QIOChannel *ioc,
                           Error **errp)
 {
     struct iovec iov = { .iov_base = (char *)buf, .iov_len = buflen };
-    return qio_channel_writev_full(ioc, &iov, 1, NULL, 0, errp);
+    return qio_channel_writev_full(ioc, &iov, 1, NULL, 0, 0, errp);
 }
 
 
@@ -473,6 +493,19 @@ off_t qio_channel_io_seek(QIOChannel *ioc,
     return klass->io_seek(ioc, offset, whence, errp);
 }
 
+int qio_channel_flush(QIOChannel *ioc,
+                                Error **errp)
+{
+    QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+    if (!klass->io_flush ||
+        !qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY)) {
+        return 0;
+    }
+
+    return klass->io_flush(ioc, errp);
+}
+
 
 static void qio_channel_restart_read(void *opaque)
 {
diff --git a/migration/rdma.c b/migration/rdma.c
index c7c7a38487..ed2a2a013b 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -2833,6 +2833,7 @@ static ssize_t qio_channel_rdma_writev(QIOChannel *ioc,
                                        size_t niov,
                                        int *fds,
                                        size_t nfds,
+                                       int flags,
                                        Error **errp)
 {
     QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
diff --git a/scsi/pr-manager-helper.c b/scsi/pr-manager-helper.c
index 451c7631b7..3be52a98d5 100644
--- a/scsi/pr-manager-helper.c
+++ b/scsi/pr-manager-helper.c
@@ -77,7 +77,7 @@ static int pr_manager_helper_write(PRManagerHelper *pr_mgr,
         iov.iov_base = (void *)buf;
         iov.iov_len = sz;
         n_written = qio_channel_writev_full(QIO_CHANNEL(pr_mgr->ioc), &iov, 1,
-                                            nfds ? &fd : NULL, nfds, errp);
+                                            nfds ? &fd : NULL, nfds, 0, errp);
 
         if (n_written <= 0) {
             assert(n_written != QIO_CHANNEL_ERR_BLOCK);
diff --git a/tests/unit/test-io-channel-socket.c b/tests/unit/test-io-channel-socket.c
index c49eec1f03..6713886d02 100644
--- a/tests/unit/test-io-channel-socket.c
+++ b/tests/unit/test-io-channel-socket.c
@@ -444,6 +444,7 @@ static void test_io_channel_unix_fd_pass(void)
                             G_N_ELEMENTS(iosend),
                             fdsend,
                             G_N_ELEMENTS(fdsend),
+                            0,
                             &error_abort);
 
     qio_channel_readv_full(dst,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v8 2/5] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX
  2022-02-01  6:28 [PATCH v8 0/5] MSG_ZEROCOPY + multifd Leonardo Bras
  2022-02-01  6:28 ` [PATCH v8 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback Leonardo Bras
@ 2022-02-01  6:29 ` Leonardo Bras
  2022-02-18 16:38   ` Juan Quintela
  2022-02-01  6:29 ` [PATCH v8 3/5] migration: Add zero-copy-send parameter for QMP/HMP for Linux Leonardo Bras
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Leonardo Bras @ 2022-02-01  6:29 UTC (permalink / raw)
  To: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Fam Zheng, Peter Xu
  Cc: Leonardo Bras, qemu-devel, qemu-block

For CONFIG_LINUX, implement the new zero copy flag and the optional callback
io_flush on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
feature is available in the host kernel, which is checked on
qio_channel_socket_connect_sync()

qio_channel_socket_flush() was implemented by counting how many times
sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the
socket's error queue, in order to find how many of them finished sending.
Flush will loop until those counters are the same, or until some error occurs.

Notes on using writev() with QIO_CHANNEL_WRITE_FLAG_ZERO_COPY:
1: Buffer
- As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid copying,
some caution is necessary to avoid overwriting any buffer before it's sent.
If something like this happen, a newer version of the buffer may be sent instead.
- If this is a problem, it's recommended to call qio_channel_flush() before freeing
or re-using the buffer.

2: Locked memory
- When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, and
unlocked after it's sent.
- Depending on the size of each buffer, and how often it's sent, it may require
a larger amount of locked memory than usually available to non-root user.
- If the required amount of locked memory is not available, writev_zero_copy
will return an error, which can abort an operation like migration,
- Because of this, when an user code wants to add zero copy as a feature, it
requires a mechanism to disable it, so it can still be accessible to less
privileged users.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---
 include/io/channel-socket.h |   2 +
 io/channel-socket.c         | 108 ++++++++++++++++++++++++++++++++++--
 2 files changed, 106 insertions(+), 4 deletions(-)

diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
index e747e63514..513c428fe4 100644
--- a/include/io/channel-socket.h
+++ b/include/io/channel-socket.h
@@ -47,6 +47,8 @@ struct QIOChannelSocket {
     socklen_t localAddrLen;
     struct sockaddr_storage remoteAddr;
     socklen_t remoteAddrLen;
+    ssize_t zero_copy_queued;
+    ssize_t zero_copy_sent;
 };
 
 
diff --git a/io/channel-socket.c b/io/channel-socket.c
index b2d254ef8d..155a0a2ada 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -26,6 +26,10 @@
 #include "io/channel-watch.h"
 #include "trace.h"
 #include "qapi/clone-visitor.h"
+#ifdef CONFIG_LINUX
+#include <linux/errqueue.h>
+#include <bits/socket.h>
+#endif
 
 #define SOCKET_MAX_FDS 16
 
@@ -55,6 +59,8 @@ qio_channel_socket_new(void)
 
     sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET));
     sioc->fd = -1;
+    sioc->zero_copy_queued = 0;
+    sioc->zero_copy_sent = 0;
 
     ioc = QIO_CHANNEL(sioc);
     qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
@@ -154,6 +160,16 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
         return -1;
     }
 
+#ifdef CONFIG_LINUX
+    int ret, v = 1;
+    ret = qemu_setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
+    if (ret == 0) {
+        /* Zero copy available on host */
+        qio_channel_set_feature(QIO_CHANNEL(ioc),
+                                QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY);
+    }
+#endif
+
     return 0;
 }
 
@@ -534,6 +550,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
     char control[CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)];
     size_t fdsize = sizeof(int) * nfds;
     struct cmsghdr *cmsg;
+    int sflags = 0;
 
     memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
 
@@ -558,15 +575,27 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
         memcpy(CMSG_DATA(cmsg), fds, fdsize);
     }
 
+    if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
+        sflags = MSG_ZEROCOPY;
+    }
+
  retry:
-    ret = sendmsg(sioc->fd, &msg, 0);
+    ret = sendmsg(sioc->fd, &msg, sflags);
     if (ret <= 0) {
-        if (errno == EAGAIN) {
+        switch (errno) {
+        case EAGAIN:
             return QIO_CHANNEL_ERR_BLOCK;
-        }
-        if (errno == EINTR) {
+        case EINTR:
             goto retry;
+        case ENOBUFS:
+            if (sflags & MSG_ZEROCOPY) {
+                error_setg_errno(errp, errno,
+                                 "Process can't lock enough memory for using MSG_ZEROCOPY");
+                return -1;
+            }
+            break;
         }
+
         error_setg_errno(errp, errno,
                          "Unable to write to socket");
         return -1;
@@ -660,6 +689,74 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
 }
 #endif /* WIN32 */
 
+
+#ifdef CONFIG_LINUX
+static int qio_channel_socket_flush(QIOChannel *ioc,
+                                    Error **errp)
+{
+    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+    struct msghdr msg = {};
+    struct sock_extended_err *serr;
+    struct cmsghdr *cm;
+    char control[CMSG_SPACE(sizeof(*serr))];
+    int received;
+    int ret = 1;
+
+    msg.msg_control = control;
+    msg.msg_controllen = sizeof(control);
+    memset(control, 0, sizeof(control));
+
+    while (sioc->zero_copy_sent < sioc->zero_copy_queued) {
+        received = recvmsg(sioc->fd, &msg, MSG_ERRQUEUE);
+        if (received < 0) {
+            switch (errno) {
+            case EAGAIN:
+                /* Nothing on errqueue, wait until something is available */
+                qio_channel_wait(ioc, G_IO_ERR);
+                continue;
+            case EINTR:
+                continue;
+            default:
+                error_setg_errno(errp, errno,
+                                 "Unable to read errqueue");
+                return -1;
+            }
+        }
+
+        cm = CMSG_FIRSTHDR(&msg);
+        if (cm->cmsg_level != SOL_IP &&
+            cm->cmsg_type != IP_RECVERR) {
+            error_setg_errno(errp, EPROTOTYPE,
+                             "Wrong cmsg in errqueue");
+            return -1;
+        }
+
+        serr = (void *) CMSG_DATA(cm);
+        if (serr->ee_errno != SO_EE_ORIGIN_NONE) {
+            error_setg_errno(errp, serr->ee_errno,
+                             "Error on socket");
+            return -1;
+        }
+        if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
+            error_setg_errno(errp, serr->ee_origin,
+                             "Error not from zero copy");
+            return -1;
+        }
+
+        /* No errors, count successfully finished sendmsg()*/
+        sioc->zero_copy_sent += serr->ee_data - serr->ee_info + 1;
+
+        /* If any sendmsg() succeeded using zero copy, return 0 at the end */
+        if (serr->ee_code != SO_EE_CODE_ZEROCOPY_COPIED) {
+            ret = 0;
+        }
+    }
+
+    return ret;
+}
+
+#endif /* CONFIG_LINUX */
+
 static int
 qio_channel_socket_set_blocking(QIOChannel *ioc,
                                 bool enabled,
@@ -790,6 +887,9 @@ static void qio_channel_socket_class_init(ObjectClass *klass,
     ioc_klass->io_set_delay = qio_channel_socket_set_delay;
     ioc_klass->io_create_watch = qio_channel_socket_create_watch;
     ioc_klass->io_set_aio_fd_handler = qio_channel_socket_set_aio_fd_handler;
+#ifdef CONFIG_LINUX
+    ioc_klass->io_flush = qio_channel_socket_flush;
+#endif
 }
 
 static const TypeInfo qio_channel_socket_info = {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v8 3/5] migration: Add zero-copy-send parameter for QMP/HMP for Linux
  2022-02-01  6:28 [PATCH v8 0/5] MSG_ZEROCOPY + multifd Leonardo Bras
  2022-02-01  6:28 ` [PATCH v8 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback Leonardo Bras
  2022-02-01  6:29 ` [PATCH v8 2/5] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX Leonardo Bras
@ 2022-02-01  6:29 ` Leonardo Bras
  2022-02-18 16:39   ` Juan Quintela
  2022-02-01  6:29 ` [PATCH v8 4/5] migration: Add migrate_use_tls() helper Leonardo Bras
  2022-02-01  6:29 ` [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy) Leonardo Bras
  4 siblings, 1 reply; 24+ messages in thread
From: Leonardo Bras @ 2022-02-01  6:29 UTC (permalink / raw)
  To: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Fam Zheng, Peter Xu
  Cc: Leonardo Bras, qemu-devel, qemu-block

Add property that allows zero-copy migration of memory pages
on the sending side, and also includes a helper function
migrate_use_zero_copy_send() to check if it's enabled.

No code is introduced to actually do the migration, but it allow
future implementations to enable/disable this feature.

On non-Linux builds this parameter is compiled-out.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---
 qapi/migration.json   | 24 ++++++++++++++++++++++++
 migration/migration.h |  5 +++++
 migration/migration.c | 32 ++++++++++++++++++++++++++++++++
 migration/socket.c    |  5 +++++
 monitor/hmp-cmds.c    |  6 ++++++
 5 files changed, 72 insertions(+)

diff --git a/qapi/migration.json b/qapi/migration.json
index 5975a0e104..5b4753b5de 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -741,6 +741,13 @@
 #                      will consume more CPU.
 #                      Defaults to 1. (Since 5.0)
 #
+# @zero-copy-send: Controls behavior on sending memory pages on migration.
+#                  When true, enables a zero-copy mechanism for sending memory
+#                  pages, if host supports it.
+#                  Requires that QEMU be permitted to use locked memory for guest
+#                  RAM pages.
+#                  Defaults to false. (Since 7.1)
+#
 # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
 #                        aliases for the purpose of dirty bitmap migration.  Such
 #                        aliases may for example be the corresponding names on the
@@ -780,6 +787,7 @@
            'xbzrle-cache-size', 'max-postcopy-bandwidth',
            'max-cpu-throttle', 'multifd-compression',
            'multifd-zlib-level' ,'multifd-zstd-level',
+           { 'name': 'zero-copy-send', 'if' : 'CONFIG_LINUX'},
            'block-bitmap-mapping' ] }
 
 ##
@@ -906,6 +914,13 @@
 #                      will consume more CPU.
 #                      Defaults to 1. (Since 5.0)
 #
+# @zero-copy-send: Controls behavior on sending memory pages on migration.
+#                  When true, enables a zero-copy mechanism for sending memory
+#                  pages, if host supports it.
+#                  Requires that QEMU be permitted to use locked memory for guest
+#                  RAM pages.
+#                  Defaults to false. (Since 7.1)
+#
 # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
 #                        aliases for the purpose of dirty bitmap migration.  Such
 #                        aliases may for example be the corresponding names on the
@@ -960,6 +975,7 @@
             '*multifd-compression': 'MultiFDCompression',
             '*multifd-zlib-level': 'uint8',
             '*multifd-zstd-level': 'uint8',
+            '*zero-copy-send': { 'type': 'bool', 'if': 'CONFIG_LINUX' },
             '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
 
 ##
@@ -1106,6 +1122,13 @@
 #                      will consume more CPU.
 #                      Defaults to 1. (Since 5.0)
 #
+# @zero-copy-send: Controls behavior on sending memory pages on migration.
+#                  When true, enables a zero-copy mechanism for sending memory
+#                  pages, if host supports it.
+#                  Requires that QEMU be permitted to use locked memory for guest
+#                  RAM pages.
+#                  Defaults to false. (Since 7.1)
+#
 # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
 #                        aliases for the purpose of dirty bitmap migration.  Such
 #                        aliases may for example be the corresponding names on the
@@ -1158,6 +1181,7 @@
             '*multifd-compression': 'MultiFDCompression',
             '*multifd-zlib-level': 'uint8',
             '*multifd-zstd-level': 'uint8',
+            '*zero-copy-send': { 'type': 'bool', 'if': 'CONFIG_LINUX' },
             '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
 
 ##
diff --git a/migration/migration.h b/migration/migration.h
index 8130b703eb..4cbc901ea0 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -339,6 +339,11 @@ MultiFDCompression migrate_multifd_compression(void);
 int migrate_multifd_zlib_level(void);
 int migrate_multifd_zstd_level(void);
 
+#ifdef CONFIG_LINUX
+bool migrate_use_zero_copy_send(void);
+#else
+#define migrate_use_zero_copy_send() (false)
+#endif
 int migrate_use_xbzrle(void);
 uint64_t migrate_xbzrle_cache_size(void);
 bool migrate_colo_enabled(void);
diff --git a/migration/migration.c b/migration/migration.c
index bcc385b94b..1b3230a97b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -893,6 +893,10 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->multifd_zlib_level = s->parameters.multifd_zlib_level;
     params->has_multifd_zstd_level = true;
     params->multifd_zstd_level = s->parameters.multifd_zstd_level;
+#ifdef CONFIG_LINUX
+    params->has_zero_copy_send = true;
+    params->zero_copy_send = s->parameters.zero_copy_send;
+#endif
     params->has_xbzrle_cache_size = true;
     params->xbzrle_cache_size = s->parameters.xbzrle_cache_size;
     params->has_max_postcopy_bandwidth = true;
@@ -1549,6 +1553,11 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_multifd_compression) {
         dest->multifd_compression = params->multifd_compression;
     }
+#ifdef CONFIG_LINUX
+    if (params->has_zero_copy_send) {
+        dest->zero_copy_send = params->zero_copy_send;
+    }
+#endif
     if (params->has_xbzrle_cache_size) {
         dest->xbzrle_cache_size = params->xbzrle_cache_size;
     }
@@ -1661,6 +1670,11 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
     if (params->has_multifd_compression) {
         s->parameters.multifd_compression = params->multifd_compression;
     }
+#ifdef CONFIG_LINUX
+    if (params->has_zero_copy_send) {
+        s->parameters.zero_copy_send = params->zero_copy_send;
+    }
+#endif
     if (params->has_xbzrle_cache_size) {
         s->parameters.xbzrle_cache_size = params->xbzrle_cache_size;
         xbzrle_cache_resize(params->xbzrle_cache_size, errp);
@@ -2551,6 +2565,17 @@ int migrate_multifd_zstd_level(void)
     return s->parameters.multifd_zstd_level;
 }
 
+#ifdef CONFIG_LINUX
+bool migrate_use_zero_copy_send(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->parameters.zero_copy_send;
+}
+#endif
+
 int migrate_use_xbzrle(void)
 {
     MigrationState *s;
@@ -4194,6 +4219,10 @@ static Property migration_properties[] = {
     DEFINE_PROP_UINT8("multifd-zstd-level", MigrationState,
                       parameters.multifd_zstd_level,
                       DEFAULT_MIGRATE_MULTIFD_ZSTD_LEVEL),
+#ifdef CONFIG_LINUX
+    DEFINE_PROP_BOOL("zero_copy_send", MigrationState,
+                      parameters.zero_copy_send, false),
+#endif
     DEFINE_PROP_SIZE("xbzrle-cache-size", MigrationState,
                       parameters.xbzrle_cache_size,
                       DEFAULT_MIGRATE_XBZRLE_CACHE_SIZE),
@@ -4291,6 +4320,9 @@ static void migration_instance_init(Object *obj)
     params->has_multifd_compression = true;
     params->has_multifd_zlib_level = true;
     params->has_multifd_zstd_level = true;
+#ifdef CONFIG_LINUX
+    params->has_zero_copy_send = true;
+#endif
     params->has_xbzrle_cache_size = true;
     params->has_max_postcopy_bandwidth = true;
     params->has_max_cpu_throttle = true;
diff --git a/migration/socket.c b/migration/socket.c
index 05705a32d8..f586f983b7 100644
--- a/migration/socket.c
+++ b/migration/socket.c
@@ -77,6 +77,11 @@ static void socket_outgoing_migration(QIOTask *task,
     } else {
         trace_migration_socket_outgoing_connected(data->hostname);
     }
+
+    if (migrate_use_zero_copy_send()) {
+        error_setg(&err, "Zero copy send not available in migration");
+    }
+
     migration_channel_connect(data->s, sioc, data->hostname, err);
     object_unref(OBJECT(sioc));
 }
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 8c384dc1b2..abd566d8cb 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -1309,6 +1309,12 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         p->has_multifd_zstd_level = true;
         visit_type_uint8(v, param, &p->multifd_zstd_level, &err);
         break;
+#ifdef CONFIG_LINUX
+    case MIGRATION_PARAMETER_ZERO_COPY_SEND:
+        p->has_zero_copy_send = true;
+        visit_type_bool(v, param, &p->zero_copy_send, &err);
+        break;
+#endif
     case MIGRATION_PARAMETER_XBZRLE_CACHE_SIZE:
         p->has_xbzrle_cache_size = true;
         if (!visit_type_size(v, param, &cache_size, &err)) {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v8 4/5] migration: Add migrate_use_tls() helper
  2022-02-01  6:28 [PATCH v8 0/5] MSG_ZEROCOPY + multifd Leonardo Bras
                   ` (2 preceding siblings ...)
  2022-02-01  6:29 ` [PATCH v8 3/5] migration: Add zero-copy-send parameter for QMP/HMP for Linux Leonardo Bras
@ 2022-02-01  6:29 ` Leonardo Bras
  2022-02-01  6:29 ` [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy) Leonardo Bras
  4 siblings, 0 replies; 24+ messages in thread
From: Leonardo Bras @ 2022-02-01  6:29 UTC (permalink / raw)
  To: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Fam Zheng, Peter Xu
  Cc: Leonardo Bras, qemu-devel, qemu-block

A lot of places check parameters.tls_creds in order to evaluate if TLS is
in use, and sometimes call migrate_get_current() just for that test.

Add new helper function migrate_use_tls() in order to simplify testing
for TLS usage.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---
 migration/migration.h | 1 +
 migration/channel.c   | 3 +--
 migration/migration.c | 9 +++++++++
 migration/multifd.c   | 5 +----
 4 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index 4cbc901ea0..debacb2251 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -344,6 +344,7 @@ bool migrate_use_zero_copy_send(void);
 #else
 #define migrate_use_zero_copy_send() (false)
 #endif
+int migrate_use_tls(void);
 int migrate_use_xbzrle(void);
 uint64_t migrate_xbzrle_cache_size(void);
 bool migrate_colo_enabled(void);
diff --git a/migration/channel.c b/migration/channel.c
index c4fc000a1a..086b5c0d8b 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -38,8 +38,7 @@ void migration_channel_process_incoming(QIOChannel *ioc)
     trace_migration_set_incoming_channel(
         ioc, object_get_typename(OBJECT(ioc)));
 
-    if (s->parameters.tls_creds &&
-        *s->parameters.tls_creds &&
+    if (migrate_use_tls() &&
         !object_dynamic_cast(OBJECT(ioc),
                              TYPE_QIO_CHANNEL_TLS)) {
         migration_tls_channel_process_incoming(s, ioc, &local_err);
diff --git a/migration/migration.c b/migration/migration.c
index 1b3230a97b..3e0a25bb5b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2576,6 +2576,15 @@ bool migrate_use_zero_copy_send(void)
 }
 #endif
 
+int migrate_use_tls(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->parameters.tls_creds && *s->parameters.tls_creds;
+}
+
 int migrate_use_xbzrle(void)
 {
     MigrationState *s;
diff --git a/migration/multifd.c b/migration/multifd.c
index 76b57a7177..43998ad117 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -784,14 +784,11 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
                                     QIOChannel *ioc,
                                     Error *error)
 {
-    MigrationState *s = migrate_get_current();
-
     trace_multifd_set_outgoing_channel(
         ioc, object_get_typename(OBJECT(ioc)), p->tls_hostname, error);
 
     if (!error) {
-        if (s->parameters.tls_creds &&
-            *s->parameters.tls_creds &&
+        if (migrate_use_tls() &&
             !object_dynamic_cast(OBJECT(ioc),
                                  TYPE_QIO_CHANNEL_TLS)) {
             multifd_tls_channel_connect(p, ioc, &error);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)
  2022-02-01  6:28 [PATCH v8 0/5] MSG_ZEROCOPY + multifd Leonardo Bras
                   ` (3 preceding siblings ...)
  2022-02-01  6:29 ` [PATCH v8 4/5] migration: Add migrate_use_tls() helper Leonardo Bras
@ 2022-02-01  6:29 ` Leonardo Bras
  2022-02-08  2:22   ` Peter Xu
  2022-02-18 16:57   ` Juan Quintela
  4 siblings, 2 replies; 24+ messages in thread
From: Leonardo Bras @ 2022-02-01  6:29 UTC (permalink / raw)
  To: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Fam Zheng, Peter Xu
  Cc: Leonardo Bras, qemu-devel, qemu-block

Implement zero copy send on nocomp_send_write(), by making use of QIOChannel
writev + flags & flush interface.

Change multifd_send_sync_main() so flush_zero_copy() can be called
after each iteration in order to make sure all dirty pages are sent before
a new iteration is started. It will also flush at the beginning and at the
end of migration.

Also make it return -1 if flush_zero_copy() fails, in order to cancel
the migration process, and avoid resuming the guest in the target host
without receiving all current RAM.

This will work fine on RAM migration because the RAM pages are not usually freed,
and there is no problem on changing the pages content between writev_zero_copy() and
the actual sending of the buffer, because this change will dirty the page and
cause it to be re-sent on a next iteration anyway.

A lot of locked memory may be needed in order to use multid migration
with zero-copy enabled, so disabling the feature should be necessary for
low-privileged users trying to perform multifd migrations.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 migration/multifd.h   |  4 +++-
 migration/migration.c | 11 ++++++++++-
 migration/multifd.c   | 41 +++++++++++++++++++++++++++++++++++------
 migration/ram.c       | 29 ++++++++++++++++++++++-------
 migration/socket.c    |  5 +++--
 5 files changed, 73 insertions(+), 17 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 4dda900a0b..7ec688fb4f 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -22,7 +22,7 @@ int multifd_load_cleanup(Error **errp);
 bool multifd_recv_all_channels_created(void);
 bool multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
 void multifd_recv_sync_main(void);
-void multifd_send_sync_main(QEMUFile *f);
+int multifd_send_sync_main(QEMUFile *f);
 int multifd_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
 
 /* Multifd Compression flags */
@@ -96,6 +96,8 @@ typedef struct {
     uint32_t packet_len;
     /* pointer to the packet */
     MultiFDPacket_t *packet;
+    /* multifd flags for sending ram */
+    int write_flags;
     /* multifd flags for each packet */
     uint32_t flags;
     /* size of the next packet that contains pages */
diff --git a/migration/migration.c b/migration/migration.c
index 3e0a25bb5b..1450fd0370 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1479,7 +1479,16 @@ static bool migrate_params_check(MigrationParameters *params, Error **errp)
         error_prepend(errp, "Invalid mapping given for block-bitmap-mapping: ");
         return false;
     }
-
+#ifdef CONFIG_LINUX
+    if (params->zero_copy_send &&
+        (!migrate_use_multifd() ||
+         params->multifd_compression != MULTIFD_COMPRESSION_NONE ||
+         (params->tls_creds && *params->tls_creds))) {
+        error_setg(errp,
+                   "Zero copy only available for non-compressed non-TLS multifd migration");
+        return false;
+    }
+#endif
     return true;
 }
 
diff --git a/migration/multifd.c b/migration/multifd.c
index 43998ad117..2d68b9cf4f 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -568,19 +568,28 @@ void multifd_save_cleanup(void)
     multifd_send_state = NULL;
 }
 
-void multifd_send_sync_main(QEMUFile *f)
+int multifd_send_sync_main(QEMUFile *f)
 {
     int i;
+    bool flush_zero_copy;
 
     if (!migrate_use_multifd()) {
-        return;
+        return 0;
     }
     if (multifd_send_state->pages->num) {
         if (multifd_send_pages(f) < 0) {
             error_report("%s: multifd_send_pages fail", __func__);
-            return;
+            return 0;
         }
     }
+
+    /*
+     * When using zero-copy, it's necessary to flush after each iteration to
+     * make sure pages from earlier iterations don't end up replacing newer
+     * pages.
+     */
+    flush_zero_copy = migrate_use_zero_copy_send();
+
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
 
@@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
         if (p->quit) {
             error_report("%s: channel %d has already quit", __func__, i);
             qemu_mutex_unlock(&p->mutex);
-            return;
+            return 0;
         }
 
         p->packet_num = multifd_send_state->packet_num++;
@@ -602,6 +611,17 @@ void multifd_send_sync_main(QEMUFile *f)
         ram_counters.transferred += p->packet_len;
         qemu_mutex_unlock(&p->mutex);
         qemu_sem_post(&p->sem);
+
+        if (flush_zero_copy) {
+            int ret;
+            Error *err = NULL;
+
+            ret = qio_channel_flush(p->c, &err);
+            if (ret < 0) {
+                error_report_err(err);
+                return -1;
+            }
+        }
     }
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
@@ -610,6 +630,8 @@ void multifd_send_sync_main(QEMUFile *f)
         qemu_sem_wait(&p->sem_sync);
     }
     trace_multifd_send_sync_main(multifd_send_state->packet_num);
+
+    return 0;
 }
 
 static void *multifd_send_thread(void *opaque)
@@ -668,8 +690,8 @@ static void *multifd_send_thread(void *opaque)
             p->iov[0].iov_len = p->packet_len;
             p->iov[0].iov_base = p->packet;
 
-            ret = qio_channel_writev_all(p->c, p->iov, p->iovs_num,
-                                         &local_err);
+            ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
+                                              0, p->write_flags, &local_err);
             if (ret != 0) {
                 break;
             }
@@ -910,6 +932,13 @@ int multifd_save_setup(Error **errp)
         /* We need one extra place for the packet header */
         p->iov = g_new0(struct iovec, page_count + 1);
         p->normal = g_new0(ram_addr_t, page_count);
+
+        if (migrate_use_zero_copy_send()) {
+            p->write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY;
+        } else {
+            p->write_flags = 0;
+        }
+
         socket_send_channel_create(multifd_new_send_channel_async, p);
     }
 
diff --git a/migration/ram.c b/migration/ram.c
index 91ca743ac8..d8c215e43a 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2902,6 +2902,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 {
     RAMState **rsp = opaque;
     RAMBlock *block;
+    int ret;
 
     if (compress_threads_save_setup()) {
         return -1;
@@ -2936,7 +2937,11 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
     ram_control_before_iterate(f, RAM_CONTROL_SETUP);
     ram_control_after_iterate(f, RAM_CONTROL_SETUP);
 
-    multifd_send_sync_main(f);
+    ret =  multifd_send_sync_main(f);
+    if (ret < 0) {
+        return ret;
+    }
+
     qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
     qemu_fflush(f);
 
@@ -3045,7 +3050,11 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 out:
     if (ret >= 0
         && migration_is_setup_or_active(migrate_get_current()->state)) {
-        multifd_send_sync_main(rs->f);
+        ret = multifd_send_sync_main(rs->f);
+        if (ret < 0) {
+            return ret;
+        }
+
         qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
         qemu_fflush(f);
         ram_transferred_add(8);
@@ -3105,13 +3114,19 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
         ram_control_after_iterate(f, RAM_CONTROL_FINISH);
     }
 
-    if (ret >= 0) {
-        multifd_send_sync_main(rs->f);
-        qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
-        qemu_fflush(f);
+    if (ret < 0) {
+        return ret;
     }
 
-    return ret;
+    ret = multifd_send_sync_main(rs->f);
+    if (ret < 0) {
+        return ret;
+    }
+
+    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
+    qemu_fflush(f);
+
+    return 0;
 }
 
 static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
diff --git a/migration/socket.c b/migration/socket.c
index f586f983b7..c49431425e 100644
--- a/migration/socket.c
+++ b/migration/socket.c
@@ -78,8 +78,9 @@ static void socket_outgoing_migration(QIOTask *task,
         trace_migration_socket_outgoing_connected(data->hostname);
     }
 
-    if (migrate_use_zero_copy_send()) {
-        error_setg(&err, "Zero copy send not available in migration");
+    if (migrate_use_zero_copy_send() &&
+        !qio_channel_has_feature(sioc, QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY)) {
+        error_setg(&err, "Zero copy send feature not detected in host kernel");
     }
 
     migration_channel_connect(data->s, sioc, data->hostname, err);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback
  2022-02-01  6:28 ` [PATCH v8 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback Leonardo Bras
@ 2022-02-01  9:35   ` Daniel P. Berrangé
  2022-02-01 17:25     ` Leonardo Bras Soares Passos
  2022-02-07 12:49   ` Peter Xu
  2022-02-18 16:36   ` Juan Quintela
  2 siblings, 1 reply; 24+ messages in thread
From: Daniel P. Berrangé @ 2022-02-01  9:35 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Dr. David Alan Gilbert, Peter Xu,
	Markus Armbruster, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng, Eric Blake

On Tue, Feb 01, 2022 at 03:28:59AM -0300, Leonardo Bras wrote:
> Add flags to io_writev and introduce io_flush as optional callback to
> QIOChannelClass, allowing the implementation of zero copy writes by
> subclasses.
> 
> How to use them:
> - Write data using qio_channel_writev*(...,QIO_CHANNEL_WRITE_FLAG_ZERO_COPY),
> - Wait write completion with qio_channel_flush().
> 
> Notes:
> As some zero copy write implementations work asynchronously, it's
> recommended to keep the write buffer untouched until the return of
> qio_channel_flush(), to avoid the risk of sending an updated buffer
> instead of the buffer state during write.
> 
> As io_flush callback is optional, if a subclass does not implement it, then:
> - io_flush will return 0 without changing anything.
> 
> Also, some functions like qio_channel_writev_full_all() were adapted to
> receive a flag parameter. That allows shared code between zero copy and
> non-zero copy writev, and also an easier implementation on new flags.
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
>  include/io/channel.h                | 38 ++++++++++++++++++++-
>  chardev/char-io.c                   |  2 +-
>  hw/remote/mpqemu-link.c             |  2 +-
>  io/channel-buffer.c                 |  1 +
>  io/channel-command.c                |  1 +
>  io/channel-file.c                   |  1 +
>  io/channel-socket.c                 |  2 ++
>  io/channel-tls.c                    |  1 +
>  io/channel-websock.c                |  1 +
>  io/channel.c                        | 53 +++++++++++++++++++++++------
>  migration/rdma.c                    |  1 +
>  scsi/pr-manager-helper.c            |  2 +-
>  tests/unit/test-io-channel-socket.c |  1 +
>  13 files changed, 92 insertions(+), 14 deletions(-)
> 
> diff --git a/io/channel.c b/io/channel.c
> index e8b019dc36..b8b99fdc4c 100644
> --- a/io/channel.c
> +++ b/io/channel.c
> @@ -72,18 +72,32 @@ ssize_t qio_channel_writev_full(QIOChannel *ioc,
>                                  size_t niov,
>                                  int *fds,
>                                  size_t nfds,
> +                                int flags,
>                                  Error **errp)
>  {
>      QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
>  
> -    if ((fds || nfds) &&
> -        !qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_FD_PASS)) {
> +    if (fds || nfds) {
> +        if (!qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_FD_PASS)) {
> +            error_setg_errno(errp, EINVAL,
> +                             "Channel does not support file descriptor passing");
> +            return -1;
> +        }
> +        if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> +            error_setg_errno(errp, EINVAL,
> +                             "Zero Copy does not support file descriptor passing");
> +            return -1;
> +        }

Here you gracefully reject FD passing when zero copy is requested
which is good.

> +    }
> +

> @@ -235,10 +249,16 @@ int qio_channel_writev_full_all(QIOChannel *ioc,
>                            iov, niov,
>                            0, iov_size(iov, niov));
>  
> +    if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> +        assert(fds == NULL && nfds == 0);
> +    }

But here you  abort QEMU if FD passing is requested when zero copy
is set.

AFAICT, if you just delete this assert, the code to gracefully
report errors will do the right thing.

Without the assert:

  Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback
  2022-02-01  9:35   ` Daniel P. Berrangé
@ 2022-02-01 17:25     ` Leonardo Bras Soares Passos
  0 siblings, 0 replies; 24+ messages in thread
From: Leonardo Bras Soares Passos @ 2022-02-01 17:25 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Dr. David Alan Gilbert, Peter Xu,
	Markus Armbruster, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng, Eric Blake

Hello Daniel, thanks for reviewing!

On Tue, Feb 1, 2022 at 6:35 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Tue, Feb 01, 2022 at 03:28:59AM -0300, Leonardo Bras wrote:
> > Add flags to io_writev and introduce io_flush as optional callback to
> > QIOChannelClass, allowing the implementation of zero copy writes by
> > subclasses.
> >
> > How to use them:
> > - Write data using qio_channel_writev*(...,QIO_CHANNEL_WRITE_FLAG_ZERO_COPY),
> > - Wait write completion with qio_channel_flush().
> >
> > Notes:
> > As some zero copy write implementations work asynchronously, it's
> > recommended to keep the write buffer untouched until the return of
> > qio_channel_flush(), to avoid the risk of sending an updated buffer
> > instead of the buffer state during write.
> >
> > As io_flush callback is optional, if a subclass does not implement it, then:
> > - io_flush will return 0 without changing anything.
> >
> > Also, some functions like qio_channel_writev_full_all() were adapted to
> > receive a flag parameter. That allows shared code between zero copy and
> > non-zero copy writev, and also an easier implementation on new flags.
> >
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > ---
> >  include/io/channel.h                | 38 ++++++++++++++++++++-
> >  chardev/char-io.c                   |  2 +-
> >  hw/remote/mpqemu-link.c             |  2 +-
> >  io/channel-buffer.c                 |  1 +
> >  io/channel-command.c                |  1 +
> >  io/channel-file.c                   |  1 +
> >  io/channel-socket.c                 |  2 ++
> >  io/channel-tls.c                    |  1 +
> >  io/channel-websock.c                |  1 +
> >  io/channel.c                        | 53 +++++++++++++++++++++++------
> >  migration/rdma.c                    |  1 +
> >  scsi/pr-manager-helper.c            |  2 +-
> >  tests/unit/test-io-channel-socket.c |  1 +
> >  13 files changed, 92 insertions(+), 14 deletions(-)
> >
> > diff --git a/io/channel.c b/io/channel.c
> > index e8b019dc36..b8b99fdc4c 100644
> > --- a/io/channel.c
> > +++ b/io/channel.c
> > @@ -72,18 +72,32 @@ ssize_t qio_channel_writev_full(QIOChannel *ioc,
> >                                  size_t niov,
> >                                  int *fds,
> >                                  size_t nfds,
> > +                                int flags,
> >                                  Error **errp)
> >  {
> >      QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
> >
> > -    if ((fds || nfds) &&
> > -        !qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_FD_PASS)) {
> > +    if (fds || nfds) {
> > +        if (!qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_FD_PASS)) {
> > +            error_setg_errno(errp, EINVAL,
> > +                             "Channel does not support file descriptor passing");
> > +            return -1;
> > +        }
> > +        if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> > +            error_setg_errno(errp, EINVAL,
> > +                             "Zero Copy does not support file descriptor passing");
> > +            return -1;
> > +        }
>
> Here you gracefully reject FD passing when zero copy is requested
> which is good.
>
> > +    }
> > +
>
> > @@ -235,10 +249,16 @@ int qio_channel_writev_full_all(QIOChannel *ioc,
> >                            iov, niov,
> >                            0, iov_size(iov, niov));
> >
> > +    if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> > +        assert(fds == NULL && nfds == 0);
> > +    }
>
> But here you  abort QEMU if FD passing is requested when zero copy
> is set.
>
> AFAICT, if you just delete this assert, the code to gracefully
> report errors will do the right thing.

Yeah, thatś right. This test is unnecessary since qio_channel_writev_full()
will be called and will return error if fds + zerocopy happens.

Good catch!

>
> Without the assert:
>
>   Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
>

Thanks!
I will wait for more feedback on other patches before sending the v9,
but it should not take too long this time.

Best regards,
Leo



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback
  2022-02-01  6:28 ` [PATCH v8 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback Leonardo Bras
  2022-02-01  9:35   ` Daniel P. Berrangé
@ 2022-02-07 12:49   ` Peter Xu
  2022-02-07 20:50     ` Leonardo Bras Soares Passos
  2022-02-18 16:36   ` Juan Quintela
  2 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-02-07 12:49 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Markus Armbruster,
	Daniel P. Berrangé, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Eric Blake

On Tue, Feb 01, 2022 at 03:28:59AM -0300, Leonardo Bras wrote:
> Add flags to io_writev and introduce io_flush as optional callback to
> QIOChannelClass, allowing the implementation of zero copy writes by
> subclasses.
> 
> How to use them:
> - Write data using qio_channel_writev*(...,QIO_CHANNEL_WRITE_FLAG_ZERO_COPY),
> - Wait write completion with qio_channel_flush().
> 
> Notes:
> As some zero copy write implementations work asynchronously, it's
> recommended to keep the write buffer untouched until the return of
> qio_channel_flush(), to avoid the risk of sending an updated buffer
> instead of the buffer state during write.
> 
> As io_flush callback is optional, if a subclass does not implement it, then:
> - io_flush will return 0 without changing anything.
> 
> Also, some functions like qio_channel_writev_full_all() were adapted to
> receive a flag parameter. That allows shared code between zero copy and
> non-zero copy writev, and also an easier implementation on new flags.
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>

With Dan's comment addressed on removing the redundant assertion:

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback
  2022-02-07 12:49   ` Peter Xu
@ 2022-02-07 20:50     ` Leonardo Bras Soares Passos
  0 siblings, 0 replies; 24+ messages in thread
From: Leonardo Bras Soares Passos @ 2022-02-07 20:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Markus Armbruster,
	Daniel P. Berrangé, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Eric Blake

Hello Peter,

On Mon, Feb 7, 2022 at 9:50 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Feb 01, 2022 at 03:28:59AM -0300, Leonardo Bras wrote:
> > Add flags to io_writev and introduce io_flush as optional callback to
> > QIOChannelClass, allowing the implementation of zero copy writes by
> > subclasses.
> >
> > How to use them:
> > - Write data using qio_channel_writev*(...,QIO_CHANNEL_WRITE_FLAG_ZERO_COPY),
> > - Wait write completion with qio_channel_flush().
> >
> > Notes:
> > As some zero copy write implementations work asynchronously, it's
> > recommended to keep the write buffer untouched until the return of
> > qio_channel_flush(), to avoid the risk of sending an updated buffer
> > instead of the buffer state during write.
> >
> > As io_flush callback is optional, if a subclass does not implement it, then:
> > - io_flush will return 0 without changing anything.
> >
> > Also, some functions like qio_channel_writev_full_all() were adapted to
> > receive a flag parameter. That allows shared code between zero copy and
> > non-zero copy writev, and also an easier implementation on new flags.
> >
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
>
> With Dan's comment addressed on removing the redundant assertion:
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>

Thank you for reviewing!

I think I am now missing reviewing only on patch 5/5 before sending
the next version.
Could you and/or Daniel help me with that? Just to check if I am
missing anything?

Best regards,
Leo



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)
  2022-02-01  6:29 ` [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy) Leonardo Bras
@ 2022-02-08  2:22   ` Peter Xu
  2022-02-08  2:49     ` Leonardo Bras Soares Passos
  2022-02-18 16:57   ` Juan Quintela
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-02-08  2:22 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Markus Armbruster,
	Daniel P. Berrangé, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Eric Blake

On Tue, Feb 01, 2022 at 03:29:03AM -0300, Leonardo Bras wrote:
> -void multifd_send_sync_main(QEMUFile *f)
> +int multifd_send_sync_main(QEMUFile *f)
>  {
>      int i;
> +    bool flush_zero_copy;
>  
>      if (!migrate_use_multifd()) {
> -        return;
> +        return 0;
>      }
>      if (multifd_send_state->pages->num) {
>          if (multifd_send_pages(f) < 0) {
>              error_report("%s: multifd_send_pages fail", __func__);
> -            return;
> +            return 0;

I've not checked how it used to do if multifd_send_pages() failed, but.. should
it returns -1 rather than 0 when there will be a return code?

>          }
>      }
> +
> +    /*
> +     * When using zero-copy, it's necessary to flush after each iteration to
> +     * make sure pages from earlier iterations don't end up replacing newer
> +     * pages.
> +     */
> +    flush_zero_copy = migrate_use_zero_copy_send();
> +
>      for (i = 0; i < migrate_multifd_channels(); i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
>  
> @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
>          if (p->quit) {
>              error_report("%s: channel %d has already quit", __func__, i);
>              qemu_mutex_unlock(&p->mutex);
> -            return;
> +            return 0;

Same question here.

>          }

The rest looks good.  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)
  2022-02-08  2:22   ` Peter Xu
@ 2022-02-08  2:49     ` Leonardo Bras Soares Passos
  2022-02-08  3:05       ` Peter Xu
  2022-02-18 17:36       ` Juan Quintela
  0 siblings, 2 replies; 24+ messages in thread
From: Leonardo Bras Soares Passos @ 2022-02-08  2:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Markus Armbruster,
	Daniel P. Berrangé, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Eric Blake

Hello Peter, thanks for reviewing!

On Mon, Feb 7, 2022 at 11:22 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Feb 01, 2022 at 03:29:03AM -0300, Leonardo Bras wrote:
> > -void multifd_send_sync_main(QEMUFile *f)
> > +int multifd_send_sync_main(QEMUFile *f)
> >  {
> >      int i;
> > +    bool flush_zero_copy;
> >
> >      if (!migrate_use_multifd()) {
> > -        return;
> > +        return 0;
> >      }
> >      if (multifd_send_state->pages->num) {
> >          if (multifd_send_pages(f) < 0) {
> >              error_report("%s: multifd_send_pages fail", __func__);
> > -            return;
> > +            return 0;
>
> I've not checked how it used to do if multifd_send_pages() failed, but.. should
> it returns -1 rather than 0 when there will be a return code?

Yeah, that makes sense.
The point here is that I was trying not to modify much of the current behavior.

I mean, multifd_send_sync_main() would previously return void, so any
other errors would not matter to the caller of this function, which
will continue to run as if nothing happened.

Now, if it fails with flush_zero_copy, the operation needs to be aborted.

Maybe, I should make it different:
- In any error, return -1.
- Create/use a specific error code in the case of a failing
flush_zero_copy, so I can test the return value for it on the caller
function and return early.

Or alternatively, the other errors could also return early, but since
this will change how the code currently works, I would probably need
another patch for that change. (so it can be easily reverted if
needed)

What do you think is better?


> >          }
> >      }
> > +
> > +    /*
> > +     * When using zero-copy, it's necessary to flush after each iteration to
> > +     * make sure pages from earlier iterations don't end up replacing newer
> > +     * pages.
> > +     */
> > +    flush_zero_copy = migrate_use_zero_copy_send();
> > +
> >      for (i = 0; i < migrate_multifd_channels(); i++) {
> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> >
> > @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
> >          if (p->quit) {
> >              error_report("%s: channel %d has already quit", __func__, i);
> >              qemu_mutex_unlock(&p->mutex);
> > -            return;
> > +            return 0;
>
> Same question here.

Please see above,

>
> >          }
>
> The rest looks good.  Thanks,

Thank you!

Best regards,
Leo



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)
  2022-02-08  2:49     ` Leonardo Bras Soares Passos
@ 2022-02-08  3:05       ` Peter Xu
  2022-02-18 17:36       ` Juan Quintela
  1 sibling, 0 replies; 24+ messages in thread
From: Peter Xu @ 2022-02-08  3:05 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos, Juan Quintela
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Markus Armbruster,
	Daniel P. Berrangé, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Eric Blake

On Mon, Feb 07, 2022 at 11:49:38PM -0300, Leonardo Bras Soares Passos wrote:
> Hello Peter, thanks for reviewing!
> 
> On Mon, Feb 7, 2022 at 11:22 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Feb 01, 2022 at 03:29:03AM -0300, Leonardo Bras wrote:
> > > -void multifd_send_sync_main(QEMUFile *f)
> > > +int multifd_send_sync_main(QEMUFile *f)
> > >  {
> > >      int i;
> > > +    bool flush_zero_copy;
> > >
> > >      if (!migrate_use_multifd()) {
> > > -        return;
> > > +        return 0;
> > >      }
> > >      if (multifd_send_state->pages->num) {
> > >          if (multifd_send_pages(f) < 0) {
> > >              error_report("%s: multifd_send_pages fail", __func__);
> > > -            return;
> > > +            return 0;
> >
> > I've not checked how it used to do if multifd_send_pages() failed, but.. should
> > it returns -1 rather than 0 when there will be a return code?
> 
> Yeah, that makes sense.
> The point here is that I was trying not to modify much of the current behavior.
> 
> I mean, multifd_send_sync_main() would previously return void, so any
> other errors would not matter to the caller of this function, which
> will continue to run as if nothing happened.
> 
> Now, if it fails with flush_zero_copy, the operation needs to be aborted.

Right, so how I understand is we'll fail anyway, and this allows us to fail
(probably) sooner.

> 
> Maybe, I should make it different:
> - In any error, return -1.
> - Create/use a specific error code in the case of a failing
> flush_zero_copy, so I can test the return value for it on the caller
> function and return early.
> 
> Or alternatively, the other errors could also return early, but since
> this will change how the code currently works, I would probably need
> another patch for that change. (so it can be easily reverted if
> needed)

Yeah, should work too to add a patch before this one.

> 
> What do you think is better?

I just don't see how it could continue if e.g. multifd_send_pages() failed.

The other thing is returning zero looks weird itself when there's obviously an
error.  Normally we could allow that but better with a comment showing why.
For this case it's more natural to me if we could just fail early.

Juan?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback
  2022-02-01  6:28 ` [PATCH v8 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback Leonardo Bras
  2022-02-01  9:35   ` Daniel P. Berrangé
  2022-02-07 12:49   ` Peter Xu
@ 2022-02-18 16:36   ` Juan Quintela
  2022-02-21 16:41     ` Leonardo Bras Soares Passos
  2 siblings, 1 reply; 24+ messages in thread
From: Juan Quintela @ 2022-02-18 16:36 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	qemu-devel, Markus Armbruster, Daniel P. Berrangé, Peter Xu,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng, Eric Blake

Leonardo Bras <leobras@redhat.com> wrote:
> Add flags to io_writev and introduce io_flush as optional callback to
> QIOChannelClass, allowing the implementation of zero copy writes by
> subclasses.
>
> How to use them:
> - Write data using qio_channel_writev*(...,QIO_CHANNEL_WRITE_FLAG_ZERO_COPY),
> - Wait write completion with qio_channel_flush().
>
> Notes:
> As some zero copy write implementations work asynchronously, it's
> recommended to keep the write buffer untouched until the return of
> qio_channel_flush(), to avoid the risk of sending an updated buffer
> instead of the buffer state during write.
>
> As io_flush callback is optional, if a subclass does not implement it, then:
> - io_flush will return 0 without changing anything.
>
> Also, some functions like qio_channel_writev_full_all() were adapted to
> receive a flag parameter. That allows shared code between zero copy and
> non-zero copy writev, and also an easier implementation on new flags.
>
> Signed-off-by: Leonardo Bras <leobras@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

As everybody pointed out about the missing assertion...



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 2/5] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX
  2022-02-01  6:29 ` [PATCH v8 2/5] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX Leonardo Bras
@ 2022-02-18 16:38   ` Juan Quintela
  0 siblings, 0 replies; 24+ messages in thread
From: Juan Quintela @ 2022-02-18 16:38 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	qemu-devel, Markus Armbruster, Daniel P. Berrangé, Peter Xu,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng, Eric Blake

Leonardo Bras <leobras@redhat.com> wrote:
> For CONFIG_LINUX, implement the new zero copy flag and the optional callback
> io_flush on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
> feature is available in the host kernel, which is checked on
> qio_channel_socket_connect_sync()
>
> qio_channel_socket_flush() was implemented by counting how many times
> sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the
> socket's error queue, in order to find how many of them finished sending.
> Flush will loop until those counters are the same, or until some error occurs.
>
> Notes on using writev() with QIO_CHANNEL_WRITE_FLAG_ZERO_COPY:
> 1: Buffer
> - As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid copying,
> some caution is necessary to avoid overwriting any buffer before it's sent.
> If something like this happen, a newer version of the buffer may be sent instead.
> - If this is a problem, it's recommended to call qio_channel_flush() before freeing
> or re-using the buffer.
>
> 2: Locked memory
> - When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, and
> unlocked after it's sent.
> - Depending on the size of each buffer, and how often it's sent, it may require
> a larger amount of locked memory than usually available to non-root user.
> - If the required amount of locked memory is not available, writev_zero_copy
> will return an error, which can abort an operation like migration,
> - Because of this, when an user code wants to add zero copy as a feature, it
> requires a mechanism to disable it, so it can still be accessible to less
> privileged users.
>
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 3/5] migration: Add zero-copy-send parameter for QMP/HMP for Linux
  2022-02-01  6:29 ` [PATCH v8 3/5] migration: Add zero-copy-send parameter for QMP/HMP for Linux Leonardo Bras
@ 2022-02-18 16:39   ` Juan Quintela
  0 siblings, 0 replies; 24+ messages in thread
From: Juan Quintela @ 2022-02-18 16:39 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	qemu-devel, Markus Armbruster, Daniel P. Berrangé, Peter Xu,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng, Eric Blake

Leonardo Bras <leobras@redhat.com> wrote:
> Add property that allows zero-copy migration of memory pages
> on the sending side, and also includes a helper function
> migrate_use_zero_copy_send() to check if it's enabled.
>
> No code is introduced to actually do the migration, but it allow
> future implementations to enable/disable this feature.
>
> On non-Linux builds this parameter is compiled-out.
>
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)
  2022-02-01  6:29 ` [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy) Leonardo Bras
  2022-02-08  2:22   ` Peter Xu
@ 2022-02-18 16:57   ` Juan Quintela
  2022-02-21 19:41     ` Leonardo Bras Soares Passos
  2022-03-01  3:57     ` Peter Xu
  1 sibling, 2 replies; 24+ messages in thread
From: Juan Quintela @ 2022-02-18 16:57 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	qemu-devel, Markus Armbruster, Daniel P. Berrangé, Peter Xu,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng, Eric Blake

Leonardo Bras <leobras@redhat.com> wrote:
> Implement zero copy send on nocomp_send_write(), by making use of QIOChannel
> writev + flags & flush interface.
>
> Change multifd_send_sync_main() so flush_zero_copy() can be called
> after each iteration in order to make sure all dirty pages are sent before
> a new iteration is started. It will also flush at the beginning and at the
> end of migration.
>
> Also make it return -1 if flush_zero_copy() fails, in order to cancel
> the migration process, and avoid resuming the guest in the target host
> without receiving all current RAM.
>
> This will work fine on RAM migration because the RAM pages are not usually freed,
> and there is no problem on changing the pages content between writev_zero_copy() and
> the actual sending of the buffer, because this change will dirty the page and
> cause it to be re-sent on a next iteration anyway.
>
> A lot of locked memory may be needed in order to use multid migration
                                                       ^^^^^^
multifd.

I can fix it on the commit.


> @@ -1479,7 +1479,16 @@ static bool migrate_params_check(MigrationParameters *params, Error **errp)
>          error_prepend(errp, "Invalid mapping given for block-bitmap-mapping: ");
>          return false;
>      }
> -
> +#ifdef CONFIG_LINUX
> +    if (params->zero_copy_send &&
> +        (!migrate_use_multifd() ||
> +         params->multifd_compression != MULTIFD_COMPRESSION_NONE ||
> +         (params->tls_creds && *params->tls_creds))) {
> +        error_setg(errp,
> +                   "Zero copy only available for non-compressed non-TLS multifd migration");
> +        return false;
> +    }
> +#endif
>      return true;
>  }

Test is long, but it is exactly what we need.  Good.


>  
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 43998ad117..2d68b9cf4f 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -568,19 +568,28 @@ void multifd_save_cleanup(void)
>      multifd_send_state = NULL;
>  }
>  
> -void multifd_send_sync_main(QEMUFile *f)
> +int multifd_send_sync_main(QEMUFile *f)
>  {
>      int i;
> +    bool flush_zero_copy;
>  
>      if (!migrate_use_multifd()) {
> -        return;
> +        return 0;
>      }
>      if (multifd_send_state->pages->num) {
>          if (multifd_send_pages(f) < 0) {
>              error_report("%s: multifd_send_pages fail", __func__);
> -            return;
> +            return 0;
>          }
>      }
> +
> +    /*
> +     * When using zero-copy, it's necessary to flush after each iteration to
> +     * make sure pages from earlier iterations don't end up replacing newer
> +     * pages.
> +     */
> +    flush_zero_copy = migrate_use_zero_copy_send();
> +
>      for (i = 0; i < migrate_multifd_channels(); i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
>  
> @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
>          if (p->quit) {
>              error_report("%s: channel %d has already quit", __func__, i);
>              qemu_mutex_unlock(&p->mutex);
> -            return;
> +            return 0;
>          }
>  
>          p->packet_num = multifd_send_state->packet_num++;
> @@ -602,6 +611,17 @@ void multifd_send_sync_main(QEMUFile *f)
>          ram_counters.transferred += p->packet_len;
>          qemu_mutex_unlock(&p->mutex);
>          qemu_sem_post(&p->sem);
> +
> +        if (flush_zero_copy) {
> +            int ret;
> +            Error *err = NULL;
> +
> +            ret = qio_channel_flush(p->c, &err);
> +            if (ret < 0) {
> +                error_report_err(err);
> +                return -1;
> +            }
> +        }
>      }
>      for (i = 0; i < migrate_multifd_channels(); i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
> @@ -610,6 +630,8 @@ void multifd_send_sync_main(QEMUFile *f)
>          qemu_sem_wait(&p->sem_sync);
>      }
>      trace_multifd_send_sync_main(multifd_send_state->packet_num);
> +
> +    return 0;
>  }

We are leaving pages is flight for potentially a lot of time. I *think*
that we can sync shorter than that.

>  static void *multifd_send_thread(void *opaque)
> @@ -668,8 +690,8 @@ static void *multifd_send_thread(void *opaque)
>              p->iov[0].iov_len = p->packet_len;
>              p->iov[0].iov_base = p->packet;
>  
> -            ret = qio_channel_writev_all(p->c, p->iov, p->iovs_num,
> -                                         &local_err);
> +            ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
> +                                              0, p->write_flags, &local_err);
>              if (ret != 0) {
>                  break;
>              }

I still think that it would be better to have a sync here in each
thread.  I don't know the size, but once every couple megabytes of RAM
or so.

I did a change on:

commit d48c3a044537689866fe44e65d24c7d39a68868a
Author: Juan Quintela <quintela@redhat.com>
Date:   Fri Nov 19 15:35:58 2021 +0100

    multifd: Use a single writev on the send side
    
    Until now, we wrote the packet header with write(), and the rest of the
    pages with writev().  Just increase the size of the iovec and do a
    single writev().
    
    Signed-off-by: Juan Quintela <quintela@redhat.com>
    Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

And now we need to "perserve" this header until we do the sync,
otherwise we are overwritting it with other things.

What testing have you done after this commit?

Notice that it is not _complicated_ to fix it, I will try to come with
some idea on monday, but basically is having an array of buffers for
each thread, and store them until we call a sync().

Later, Juan.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)
  2022-02-08  2:49     ` Leonardo Bras Soares Passos
  2022-02-08  3:05       ` Peter Xu
@ 2022-02-18 17:36       ` Juan Quintela
  2022-02-21 19:47         ` Leonardo Bras Soares Passos
  1 sibling, 1 reply; 24+ messages in thread
From: Juan Quintela @ 2022-02-18 17:36 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	qemu-devel, Markus Armbruster, Daniel P. Berrangé, Peter Xu,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng, Eric Blake

Leonardo Bras Soares Passos <leobras@redhat.com> wrote:
> Hello Peter, thanks for reviewing!
>
> On Mon, Feb 7, 2022 at 11:22 PM Peter Xu <peterx@redhat.com> wrote:
>>
>> On Tue, Feb 01, 2022 at 03:29:03AM -0300, Leonardo Bras wrote:
>> > -void multifd_send_sync_main(QEMUFile *f)
>> > +int multifd_send_sync_main(QEMUFile *f)
>> >  {
>> >      int i;
>> > +    bool flush_zero_copy;
>> >
>> >      if (!migrate_use_multifd()) {
>> > -        return;
>> > +        return 0;
>> >      }
>> >      if (multifd_send_state->pages->num) {
>> >          if (multifd_send_pages(f) < 0) {
>> >              error_report("%s: multifd_send_pages fail", __func__);
>> > -            return;
>> > +            return 0;
>>
>> I've not checked how it used to do if multifd_send_pages() failed, but.. should
>> it returns -1 rather than 0 when there will be a return code?
>
> Yeah, that makes sense.
> The point here is that I was trying not to modify much of the current behavior.

    if (qatomic_read(&multifd_send_state->exiting)) {
        return -1;
    }

        if (p->quit) {
            error_report("%s: channel %d has already quit!", __func__, i);
            qemu_mutex_unlock(&p->mutex);
            return -1;
        }

This are the only two cases where the current code can return one
error.  In the 1st case we are exiting, we are already in the middle of
finishing, so we don't really care.
In the second one, we have already quit, and error as already quite big.

But I agree with both comments:
- we need to improve the error paths
- leonardo changes don't affect what is already there.


> I mean, multifd_send_sync_main() would previously return void, so any
> other errors would not matter to the caller of this function, which
> will continue to run as if nothing happened.
>
> Now, if it fails with flush_zero_copy, the operation needs to be aborted.
>
> Maybe, I should make it different:
> - In any error, return -1.
> - Create/use a specific error code in the case of a failing
> flush_zero_copy, so I can test the return value for it on the caller
> function and return early.

We need to add the check.  It don't matter if the problem is zero_copy
or the existing one, we are under a minor catastrophe and migration has
to be aborted.

> Or alternatively, the other errors could also return early, but since
> this will change how the code currently works, I would probably need
> another patch for that change. (so it can be easily reverted if
> needed)
>
> What do you think is better?
>
>
>> >          }
>> >      }
>> > +
>> > +    /*
>> > +     * When using zero-copy, it's necessary to flush after each iteration to
>> > +     * make sure pages from earlier iterations don't end up replacing newer
>> > +     * pages.
>> > +     */
>> > +    flush_zero_copy = migrate_use_zero_copy_send();
>> > +
>> >      for (i = 0; i < migrate_multifd_channels(); i++) {
>> >          MultiFDSendParams *p = &multifd_send_state->params[i];
>> >
>> > @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
>> >          if (p->quit) {
>> >              error_report("%s: channel %d has already quit", __func__, i);
>> >              qemu_mutex_unlock(&p->mutex);
>> > -            return;
>> > +            return 0;
>>
>> Same question here.
>
> Please see above,
>
>>
>> >          }
>>
>> The rest looks good.  Thanks,

Later, Juan.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback
  2022-02-18 16:36   ` Juan Quintela
@ 2022-02-21 16:41     ` Leonardo Bras Soares Passos
  0 siblings, 0 replies; 24+ messages in thread
From: Leonardo Bras Soares Passos @ 2022-02-21 16:41 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	qemu-devel, Markus Armbruster, Daniel P. Berrangé, Peter Xu,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng, Eric Blake

Thanks for reviewing, Juan!

On Fri, Feb 18, 2022 at 1:36 PM Juan Quintela <quintela@redhat.com> wrote:
>
> Leonardo Bras <leobras@redhat.com> wrote:
> > Add flags to io_writev and introduce io_flush as optional callback to
> > QIOChannelClass, allowing the implementation of zero copy writes by
> > subclasses.
> >
> > How to use them:
> > - Write data using qio_channel_writev*(...,QIO_CHANNEL_WRITE_FLAG_ZERO_COPY),
> > - Wait write completion with qio_channel_flush().
> >
> > Notes:
> > As some zero copy write implementations work asynchronously, it's
> > recommended to keep the write buffer untouched until the return of
> > qio_channel_flush(), to avoid the risk of sending an updated buffer
> > instead of the buffer state during write.
> >
> > As io_flush callback is optional, if a subclass does not implement it, then:
> > - io_flush will return 0 without changing anything.
> >
> > Also, some functions like qio_channel_writev_full_all() were adapted to
> > receive a flag parameter. That allows shared code between zero copy and
> > non-zero copy writev, and also an easier implementation on new flags.
> >
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
>
> Reviewed-by: Juan Quintela <quintela@redhat.com>
>
> As everybody pointed out about the missing assertion...
>



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)
  2022-02-18 16:57   ` Juan Quintela
@ 2022-02-21 19:41     ` Leonardo Bras Soares Passos
  2022-02-22  4:09       ` Leonardo Bras Soares Passos
  2022-03-01  3:57     ` Peter Xu
  1 sibling, 1 reply; 24+ messages in thread
From: Leonardo Bras Soares Passos @ 2022-02-21 19:41 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	qemu-devel, Markus Armbruster, Daniel P. Berrangé, Peter Xu,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng, Eric Blake

Hello Juan, thanks for the feedback!

On Fri, Feb 18, 2022 at 1:57 PM Juan Quintela <quintela@redhat.com> wrote:
>
> Leonardo Bras <leobras@redhat.com> wrote:
> > Implement zero copy send on nocomp_send_write(), by making use of QIOChannel
> > writev + flags & flush interface.
> >
> > Change multifd_send_sync_main() so flush_zero_copy() can be called
> > after each iteration in order to make sure all dirty pages are sent before
> > a new iteration is started. It will also flush at the beginning and at the
> > end of migration.
> >
> > Also make it return -1 if flush_zero_copy() fails, in order to cancel
> > the migration process, and avoid resuming the guest in the target host
> > without receiving all current RAM.
> >
> > This will work fine on RAM migration because the RAM pages are not usually freed,
> > and there is no problem on changing the pages content between writev_zero_copy() and
> > the actual sending of the buffer, because this change will dirty the page and
> > cause it to be re-sent on a next iteration anyway.
> >
> > A lot of locked memory may be needed in order to use multid migration
>                                                        ^^^^^^
> multifd.
>
> I can fix it on the commit.

No worries, fixed for v9.

>
>
> > @@ -1479,7 +1479,16 @@ static bool migrate_params_check(MigrationParameters *params, Error **errp)
> >          error_prepend(errp, "Invalid mapping given for block-bitmap-mapping: ");
> >          return false;
> >      }
> > -
> > +#ifdef CONFIG_LINUX
> > +    if (params->zero_copy_send &&
> > +        (!migrate_use_multifd() ||
> > +         params->multifd_compression != MULTIFD_COMPRESSION_NONE ||
> > +         (params->tls_creds && *params->tls_creds))) {
> > +        error_setg(errp,
> > +                   "Zero copy only available for non-compressed non-TLS multifd migration");
> > +        return false;
> > +    }
> > +#endif
> >      return true;
> >  }
>
> Test is long, but it is exactly what we need.  Good.

Thanks!


>
>
> >
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 43998ad117..2d68b9cf4f 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -568,19 +568,28 @@ void multifd_save_cleanup(void)
> >      multifd_send_state = NULL;
> >  }
> >
> > -void multifd_send_sync_main(QEMUFile *f)
> > +int multifd_send_sync_main(QEMUFile *f)
> >  {
> >      int i;
> > +    bool flush_zero_copy;
> >
> >      if (!migrate_use_multifd()) {
> > -        return;
> > +        return 0;
> >      }
> >      if (multifd_send_state->pages->num) {
> >          if (multifd_send_pages(f) < 0) {
> >              error_report("%s: multifd_send_pages fail", __func__);
> > -            return;
> > +            return 0;
> >          }
> >      }
> > +
> > +    /*
> > +     * When using zero-copy, it's necessary to flush after each iteration to
> > +     * make sure pages from earlier iterations don't end up replacing newer
> > +     * pages.
> > +     */
> > +    flush_zero_copy = migrate_use_zero_copy_send();
> > +
> >      for (i = 0; i < migrate_multifd_channels(); i++) {
> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> >
> > @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
> >          if (p->quit) {
> >              error_report("%s: channel %d has already quit", __func__, i);
> >              qemu_mutex_unlock(&p->mutex);
> > -            return;
> > +            return 0;
> >          }
> >
> >          p->packet_num = multifd_send_state->packet_num++;
> > @@ -602,6 +611,17 @@ void multifd_send_sync_main(QEMUFile *f)
> >          ram_counters.transferred += p->packet_len;
> >          qemu_mutex_unlock(&p->mutex);
> >          qemu_sem_post(&p->sem);
> > +
> > +        if (flush_zero_copy) {
> > +            int ret;
> > +            Error *err = NULL;
> > +
> > +            ret = qio_channel_flush(p->c, &err);
> > +            if (ret < 0) {
> > +                error_report_err(err);
> > +                return -1;
> > +            }
> > +        }
> >      }
> >      for (i = 0; i < migrate_multifd_channels(); i++) {
> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> > @@ -610,6 +630,8 @@ void multifd_send_sync_main(QEMUFile *f)
> >          qemu_sem_wait(&p->sem_sync);
> >      }
> >      trace_multifd_send_sync_main(multifd_send_state->packet_num);
> > +
> > +    return 0;
> >  }
>
> We are leaving pages is flight for potentially a lot of time. I *think*
> that we can sync shorter than that.
>
> >  static void *multifd_send_thread(void *opaque)
> > @@ -668,8 +690,8 @@ static void *multifd_send_thread(void *opaque)
> >              p->iov[0].iov_len = p->packet_len;
> >              p->iov[0].iov_base = p->packet;
> >
> > -            ret = qio_channel_writev_all(p->c, p->iov, p->iovs_num,
> > -                                         &local_err);
> > +            ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
> > +                                              0, p->write_flags, &local_err);
> >              if (ret != 0) {
> >                  break;
> >              }
>
> I still think that it would be better to have a sync here in each
> thread.  I don't know the size, but once every couple megabytes of RAM
> or so.

This seems a good idea, since the first iteration may take a while,
and we may take a lot of time to fail if something goes wrong with
zerocopy at the start of iteration 1.

On the other hand, flushing takes some time, and doing it a lot may
take away some of the performance improvements.

Maybe we could move the flushing to a thread of it's own, if it
becomes a problem.


>
> I did a change on:
>
> commit d48c3a044537689866fe44e65d24c7d39a68868a
> Author: Juan Quintela <quintela@redhat.com>
> Date:   Fri Nov 19 15:35:58 2021 +0100
>
>     multifd: Use a single writev on the send side
>
>     Until now, we wrote the packet header with write(), and the rest of the
>     pages with writev().  Just increase the size of the iovec and do a
>     single writev().
>
>     Signed-off-by: Juan Quintela <quintela@redhat.com>
>     Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>
> And now we need to "perserve" this header until we do the sync,
> otherwise we are overwritting it with other things.

Yeah, that is a problem I faced on non-multifd migration, and a reason
why I choose to implement directly in multifd.

>
> What testing have you done after this commit?

Not much, but this will probably become an issue with bigger guests.

>
> Notice that it is not _complicated_ to fix it, I will try to come with
> some idea on monday, but basically is having an array of buffers for
> each thread, and store them until we call a sync().

That will probably work, we can use MultiFDRecvParams->packet as this
array, right?
We can start with a somehow small buffer size and grow if it ever gets full.
(Full means a sync + g_try_malloc0 + g_free)

By my calculations, it should take 1,8kB each element on a 4k PAGESIZE.

What do you think?

Best regards,
Leo



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)
  2022-02-18 17:36       ` Juan Quintela
@ 2022-02-21 19:47         ` Leonardo Bras Soares Passos
  0 siblings, 0 replies; 24+ messages in thread
From: Leonardo Bras Soares Passos @ 2022-02-21 19:47 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	qemu-devel, Markus Armbruster, Daniel P. Berrangé, Peter Xu,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng, Eric Blake

On Fri, Feb 18, 2022 at 2:36 PM Juan Quintela <quintela@redhat.com> wrote:
>
> Leonardo Bras Soares Passos <leobras@redhat.com> wrote:
> > Hello Peter, thanks for reviewing!
> >
> > On Mon, Feb 7, 2022 at 11:22 PM Peter Xu <peterx@redhat.com> wrote:
> >>
> >> On Tue, Feb 01, 2022 at 03:29:03AM -0300, Leonardo Bras wrote:
> >> > -void multifd_send_sync_main(QEMUFile *f)
> >> > +int multifd_send_sync_main(QEMUFile *f)
> >> >  {
> >> >      int i;
> >> > +    bool flush_zero_copy;
> >> >
> >> >      if (!migrate_use_multifd()) {
> >> > -        return;
> >> > +        return 0;
> >> >      }
> >> >      if (multifd_send_state->pages->num) {
> >> >          if (multifd_send_pages(f) < 0) {
> >> >              error_report("%s: multifd_send_pages fail", __func__);
> >> > -            return;
> >> > +            return 0;
> >>
> >> I've not checked how it used to do if multifd_send_pages() failed, but.. should
> >> it returns -1 rather than 0 when there will be a return code?
> >
> > Yeah, that makes sense.
> > The point here is that I was trying not to modify much of the current behavior.
>
>     if (qatomic_read(&multifd_send_state->exiting)) {
>         return -1;
>     }
>
>         if (p->quit) {
>             error_report("%s: channel %d has already quit!", __func__, i);
>             qemu_mutex_unlock(&p->mutex);
>             return -1;
>         }
>
> This are the only two cases where the current code can return one
> error.  In the 1st case we are exiting, we are already in the middle of
> finishing, so we don't really care.
> In the second one, we have already quit, and error as already quite big.
>
> But I agree with both comments:
> - we need to improve the error paths
> - leonardo changes don't affect what is already there.
>



>
> > I mean, multifd_send_sync_main() would previously return void, so any
> > other errors would not matter to the caller of this function, which
> > will continue to run as if nothing happened.
> >
> > Now, if it fails with flush_zero_copy, the operation needs to be aborted.
> >
> > Maybe, I should make it different:
> > - In any error, return -1.
> > - Create/use a specific error code in the case of a failing
> > flush_zero_copy, so I can test the return value for it on the caller
> > function and return early.
>
> We need to add the check.  It don't matter if the problem is zero_copy
> or the existing one, we are under a minor catastrophe and migration has
> to be aborted.

Ok, I will fix that so we can abort in case of any error.
Maybe it's better to do that on a separated patch, before 5/5, right?

>
> > Or alternatively, the other errors could also return early, but since
> > this will change how the code currently works, I would probably need
> > another patch for that change. (so it can be easily reverted if
> > needed)
> >
> > What do you think is better?
> >
> >
> >> >          }
> >> >      }
> >> > +
> >> > +    /*
> >> > +     * When using zero-copy, it's necessary to flush after each iteration to
> >> > +     * make sure pages from earlier iterations don't end up replacing newer
> >> > +     * pages.
> >> > +     */
> >> > +    flush_zero_copy = migrate_use_zero_copy_send();
> >> > +
> >> >      for (i = 0; i < migrate_multifd_channels(); i++) {
> >> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> >> >
> >> > @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
> >> >          if (p->quit) {
> >> >              error_report("%s: channel %d has already quit", __func__, i);
> >> >              qemu_mutex_unlock(&p->mutex);
> >> > -            return;
> >> > +            return 0;
> >>
> >> Same question here.
> >
> > Please see above,
> >
> >>
> >> >          }
> >>
> >> The rest looks good.  Thanks,
>
> Later, Juan.
>

Thanks for the feedback!

Best regards,
Leo



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)
  2022-02-21 19:41     ` Leonardo Bras Soares Passos
@ 2022-02-22  4:09       ` Leonardo Bras Soares Passos
  0 siblings, 0 replies; 24+ messages in thread
From: Leonardo Bras Soares Passos @ 2022-02-22  4:09 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	qemu-devel, Markus Armbruster, Daniel P. Berrangé, Peter Xu,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng, Eric Blake

On Mon, Feb 21, 2022 at 4:41 PM Leonardo Bras Soares Passos
<leobras@redhat.com> wrote:
>
> Hello Juan, thanks for the feedback!
>
> On Fri, Feb 18, 2022 at 1:57 PM Juan Quintela <quintela@redhat.com> wrote:
> >
> > Leonardo Bras <leobras@redhat.com> wrote:
> > > Implement zero copy send on nocomp_send_write(), by making use of QIOChannel
> > > writev + flags & flush interface.
> > >
> > > Change multifd_send_sync_main() so flush_zero_copy() can be called
> > > after each iteration in order to make sure all dirty pages are sent before
> > > a new iteration is started. It will also flush at the beginning and at the
> > > end of migration.
> > >
> > > Also make it return -1 if flush_zero_copy() fails, in order to cancel
> > > the migration process, and avoid resuming the guest in the target host
> > > without receiving all current RAM.
> > >
> > > This will work fine on RAM migration because the RAM pages are not usually freed,
> > > and there is no problem on changing the pages content between writev_zero_copy() and
> > > the actual sending of the buffer, because this change will dirty the page and
> > > cause it to be re-sent on a next iteration anyway.
> > >
> > > A lot of locked memory may be needed in order to use multid migration
> >                                                        ^^^^^^
> > multifd.
> >
> > I can fix it on the commit.
>
> No worries, fixed for v9.
>
> >
> >
> > > @@ -1479,7 +1479,16 @@ static bool migrate_params_check(MigrationParameters *params, Error **errp)
> > >          error_prepend(errp, "Invalid mapping given for block-bitmap-mapping: ");
> > >          return false;
> > >      }
> > > -
> > > +#ifdef CONFIG_LINUX
> > > +    if (params->zero_copy_send &&
> > > +        (!migrate_use_multifd() ||
> > > +         params->multifd_compression != MULTIFD_COMPRESSION_NONE ||
> > > +         (params->tls_creds && *params->tls_creds))) {
> > > +        error_setg(errp,
> > > +                   "Zero copy only available for non-compressed non-TLS multifd migration");
> > > +        return false;
> > > +    }
> > > +#endif
> > >      return true;
> > >  }
> >
> > Test is long, but it is exactly what we need.  Good.
>
> Thanks!
>
>
> >
> >
> > >
> > > diff --git a/migration/multifd.c b/migration/multifd.c
> > > index 43998ad117..2d68b9cf4f 100644
> > > --- a/migration/multifd.c
> > > +++ b/migration/multifd.c
> > > @@ -568,19 +568,28 @@ void multifd_save_cleanup(void)
> > >      multifd_send_state = NULL;
> > >  }
> > >
> > > -void multifd_send_sync_main(QEMUFile *f)
> > > +int multifd_send_sync_main(QEMUFile *f)
> > >  {
> > >      int i;
> > > +    bool flush_zero_copy;
> > >
> > >      if (!migrate_use_multifd()) {
> > > -        return;
> > > +        return 0;
> > >      }
> > >      if (multifd_send_state->pages->num) {
> > >          if (multifd_send_pages(f) < 0) {
> > >              error_report("%s: multifd_send_pages fail", __func__);
> > > -            return;
> > > +            return 0;
> > >          }
> > >      }
> > > +
> > > +    /*
> > > +     * When using zero-copy, it's necessary to flush after each iteration to
> > > +     * make sure pages from earlier iterations don't end up replacing newer
> > > +     * pages.
> > > +     */
> > > +    flush_zero_copy = migrate_use_zero_copy_send();
> > > +
> > >      for (i = 0; i < migrate_multifd_channels(); i++) {
> > >          MultiFDSendParams *p = &multifd_send_state->params[i];
> > >
> > > @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
> > >          if (p->quit) {
> > >              error_report("%s: channel %d has already quit", __func__, i);
> > >              qemu_mutex_unlock(&p->mutex);
> > > -            return;
> > > +            return 0;
> > >          }
> > >
> > >          p->packet_num = multifd_send_state->packet_num++;
> > > @@ -602,6 +611,17 @@ void multifd_send_sync_main(QEMUFile *f)
> > >          ram_counters.transferred += p->packet_len;
> > >          qemu_mutex_unlock(&p->mutex);
> > >          qemu_sem_post(&p->sem);
> > > +
> > > +        if (flush_zero_copy) {
> > > +            int ret;
> > > +            Error *err = NULL;
> > > +
> > > +            ret = qio_channel_flush(p->c, &err);
> > > +            if (ret < 0) {
> > > +                error_report_err(err);
> > > +                return -1;
> > > +            }
> > > +        }
> > >      }
> > >      for (i = 0; i < migrate_multifd_channels(); i++) {
> > >          MultiFDSendParams *p = &multifd_send_state->params[i];
> > > @@ -610,6 +630,8 @@ void multifd_send_sync_main(QEMUFile *f)
> > >          qemu_sem_wait(&p->sem_sync);
> > >      }
> > >      trace_multifd_send_sync_main(multifd_send_state->packet_num);
> > > +
> > > +    return 0;
> > >  }
> >
> > We are leaving pages is flight for potentially a lot of time. I *think*
> > that we can sync shorter than that.
> >
> > >  static void *multifd_send_thread(void *opaque)
> > > @@ -668,8 +690,8 @@ static void *multifd_send_thread(void *opaque)
> > >              p->iov[0].iov_len = p->packet_len;
> > >              p->iov[0].iov_base = p->packet;
> > >
> > > -            ret = qio_channel_writev_all(p->c, p->iov, p->iovs_num,
> > > -                                         &local_err);
> > > +            ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
> > > +                                              0, p->write_flags, &local_err);
> > >              if (ret != 0) {
> > >                  break;
> > >              }
> >
> > I still think that it would be better to have a sync here in each
> > thread.  I don't know the size, but once every couple megabytes of RAM
> > or so.
>
> This seems a good idea, since the first iteration may take a while,
> and we may take a lot of time to fail if something goes wrong with
> zerocopy at the start of iteration 1.
>
> On the other hand, flushing takes some time, and doing it a lot may
> take away some of the performance improvements.
>
> Maybe we could move the flushing to a thread of it's own, if it
> becomes a problem.
>
>
> >
> > I did a change on:
> >
> > commit d48c3a044537689866fe44e65d24c7d39a68868a
> > Author: Juan Quintela <quintela@redhat.com>
> > Date:   Fri Nov 19 15:35:58 2021 +0100
> >
> >     multifd: Use a single writev on the send side
> >
> >     Until now, we wrote the packet header with write(), and the rest of the
> >     pages with writev().  Just increase the size of the iovec and do a
> >     single writev().
> >
> >     Signed-off-by: Juan Quintela <quintela@redhat.com>
> >     Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> >
> > And now we need to "perserve" this header until we do the sync,
> > otherwise we are overwritting it with other things.
>
> Yeah, that is a problem I faced on non-multifd migration, and a reason
> why I choose to implement directly in multifd.
>
> >
> > What testing have you done after this commit?
>
> Not much, but this will probably become an issue with bigger guests.
>
> >
> > Notice that it is not _complicated_ to fix it, I will try to come with
> > some idea on monday, but basically is having an array of buffers for
> > each thread, and store them until we call a sync().
>
> That will probably work, we can use MultiFDRecvParams->packet as this
> array, right?
> We can start with a somehow small buffer size and grow if it ever gets full.
> (Full means a sync + g_try_malloc0 + g_free)

Or maybe use the array size as a size limiter before flushing, so it
will always flush after BUFFSIZE headers are sent.

>
> By my calculations, it should take 1,8kB each element on a 4k PAGESIZE.
>
> What do you think?
>
> Best regards,
> Leo



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)
  2022-02-18 16:57   ` Juan Quintela
  2022-02-21 19:41     ` Leonardo Bras Soares Passos
@ 2022-03-01  3:57     ` Peter Xu
  2022-03-07 14:20       ` Leonardo Bras Soares Passos
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-03-01  3:57 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	qemu-devel, Markus Armbruster, Daniel P. Berrangé,
	Dr. David Alan Gilbert, Leonardo Bras, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Eric Blake

On Fri, Feb 18, 2022 at 05:57:13PM +0100, Juan Quintela wrote:
> I did a change on:
> 
> commit d48c3a044537689866fe44e65d24c7d39a68868a
> Author: Juan Quintela <quintela@redhat.com>
> Date:   Fri Nov 19 15:35:58 2021 +0100
> 
>     multifd: Use a single writev on the send side
>     
>     Until now, we wrote the packet header with write(), and the rest of the
>     pages with writev().  Just increase the size of the iovec and do a
>     single writev().
>     
>     Signed-off-by: Juan Quintela <quintela@redhat.com>
>     Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> And now we need to "perserve" this header until we do the sync,
> otherwise we are overwritting it with other things.
> 
> What testing have you done after this commit?
> 
> Notice that it is not _complicated_ to fix it, I will try to come with
> some idea on monday, but basically is having an array of buffers for
> each thread, and store them until we call a sync().

Or can we conditionally merge the two write()s?  IMHO the array of buffers
idea sounds too complicated, and I'm not extremely sure whether it'll pay
off at last.  We could keep the two write()s with ZEROCOPY enabled, and use
the merged version otherwise.

Btw, is there any performance measurements for above commit d48c3a044537?
I had a feeling that the single write() may not help that much, because for
multifd the bottleneck should be on the nic not on the processor.

IOW, we could find that the major time used does not fall into the
user<->kernel switches (which is where the extra overhead of write()
syscall, iiuc), but we simply blocked on any of the write()s because the
socket write buffer is full...  So we could have saved some cpu cycles by
merging the calls, but performance-wise we may not get much.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)
  2022-03-01  3:57     ` Peter Xu
@ 2022-03-07 14:20       ` Leonardo Bras Soares Passos
  0 siblings, 0 replies; 24+ messages in thread
From: Leonardo Bras Soares Passos @ 2022-03-07 14:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Markus Armbruster,
	Daniel P. Berrangé, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Eric Blake

On Tue, Mar 1, 2022 at 12:57 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Feb 18, 2022 at 05:57:13PM +0100, Juan Quintela wrote:
> > I did a change on:
> >
> > commit d48c3a044537689866fe44e65d24c7d39a68868a
> > Author: Juan Quintela <quintela@redhat.com>
> > Date:   Fri Nov 19 15:35:58 2021 +0100
> >
> >     multifd: Use a single writev on the send side
> >
> >     Until now, we wrote the packet header with write(), and the rest of the
> >     pages with writev().  Just increase the size of the iovec and do a
> >     single writev().
> >
> >     Signed-off-by: Juan Quintela <quintela@redhat.com>
> >     Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> >
> > And now we need to "perserve" this header until we do the sync,
> > otherwise we are overwritting it with other things.
> >
> > What testing have you done after this commit?
> >
> > Notice that it is not _complicated_ to fix it, I will try to come with
> > some idea on monday, but basically is having an array of buffers for
> > each thread, and store them until we call a sync().
>
> Or can we conditionally merge the two write()s?  IMHO the array of buffers
> idea sounds too complicated, and I'm not extremely sure whether it'll pay
> off at last.  We could keep the two write()s with ZEROCOPY enabled, and use
> the merged version otherwise.

I think that's a great idea!
It would optimize the non-zerocopy version while letting us have a
simpler zerocopy implementation.
The array of buffers implementation would either require us to have a
'large' amount of memory for keeping the headers, or having flush
happening too often.

>
> Btw, is there any performance measurements for above commit d48c3a044537?
> I had a feeling that the single write() may not help that much, because for
> multifd the bottleneck should be on the nic not on the processor.

I am quite curious about those numbers too.

>
> IOW, we could find that the major time used does not fall into the
> user<->kernel switches (which is where the extra overhead of write()
> syscall, iiuc), but we simply blocked on any of the write()s because the
> socket write buffer is full...  So we could have saved some cpu cycles by
> merging the calls, but performance-wise we may not get much.
>
> Thanks,
>
> --
> Peter Xu
>

Thanks Peter!



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-03-07 14:21 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-02-01  6:28 [PATCH v8 0/5] MSG_ZEROCOPY + multifd Leonardo Bras
2022-02-01  6:28 ` [PATCH v8 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback Leonardo Bras
2022-02-01  9:35   ` Daniel P. Berrangé
2022-02-01 17:25     ` Leonardo Bras Soares Passos
2022-02-07 12:49   ` Peter Xu
2022-02-07 20:50     ` Leonardo Bras Soares Passos
2022-02-18 16:36   ` Juan Quintela
2022-02-21 16:41     ` Leonardo Bras Soares Passos
2022-02-01  6:29 ` [PATCH v8 2/5] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX Leonardo Bras
2022-02-18 16:38   ` Juan Quintela
2022-02-01  6:29 ` [PATCH v8 3/5] migration: Add zero-copy-send parameter for QMP/HMP for Linux Leonardo Bras
2022-02-18 16:39   ` Juan Quintela
2022-02-01  6:29 ` [PATCH v8 4/5] migration: Add migrate_use_tls() helper Leonardo Bras
2022-02-01  6:29 ` [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy) Leonardo Bras
2022-02-08  2:22   ` Peter Xu
2022-02-08  2:49     ` Leonardo Bras Soares Passos
2022-02-08  3:05       ` Peter Xu
2022-02-18 17:36       ` Juan Quintela
2022-02-21 19:47         ` Leonardo Bras Soares Passos
2022-02-18 16:57   ` Juan Quintela
2022-02-21 19:41     ` Leonardo Bras Soares Passos
2022-02-22  4:09       ` Leonardo Bras Soares Passos
2022-03-01  3:57     ` Peter Xu
2022-03-07 14:20       ` Leonardo Bras Soares Passos

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).