qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/18] migration/mapped-ram: Add direct-io support
@ 2024-05-23 19:05 Fabiano Rosas
  2024-05-23 19:05 ` [PATCH v2 01/18] migration: Fix file migration with fdset Fabiano Rosas
                   ` (17 more replies)
  0 siblings, 18 replies; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-23 19:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: berrange, armbru, Peter Xu, Claudio Fontana, Jim Fehlig

Thank you all for the comments in the previous version. I believe we
managed to remove much of the complexity of the fdset handling.

Major changes in this v2 are:

- The rework of fdset to be less weird when removing fds. Now we
  always remove after qmp_remove_fd() and never remove after the
  monitor closes. Also removed the ->removed flag.

- Properly check for direct-io during params_check. I'm not going with
  the meson.build approach because that's kind of novel for migration
  paramters and I don't want to make direct-io a special case. I also
  couldn't figure out a way of selecting on CONFIG_O_DIRECT that would
  not end up removing the direct_io fields from everywhere, forcing us
  to check the CONFIG all over the place.

New in this v2 are two options to make the usage of the feature more
uniform and require less special cases in the management layer:

- Precopy direct-io. When not using multifd, enable and disable
  O_DIRECT around the parts of the code known to be aligned.

- Implemented direct-io for the incoming side.

CI run: https://gitlab.com/farosas/qemu/-/pipelines/1302882857

v1:
https://lore.kernel.org/r/20240426142042.14573-1-farosas@suse.de

Hi everyone, here's the rest of the migration "mapped-ram" feature
that didn't get merged for 9.0. This series adds support for direct
I/O, the missing piece to get the desired performance improvements.

There's 3 parts to this:

1- The plumbing for the new "direct-io" migration parameter. With this
   we can already use direct-io with the file transport + multifd +
   mapped-ram. Patches 1-3.

Due to the alignment requirements of O_DIRECT and the fact that
multifd runs the channels in parallel with the migration thread, we
must open the migration file two times, one with O_DIRECT set and
another with it clear.

If the user is not passing in a file name which QEMU can open at will,
we must then require that the user pass the two file descriptors with
the flags already properly set. We'll use the already existing fdset +
QMP add-fd infrastructure for this.

2- Changes to the fdset infrastructure to support O_DIRECT. We need
   those to be able to select from the user-provided fdset the file
   descriptor that contains the O_DIRECT flag. Patches 4-5.

3- Some fdset validation to make sure the two-fds requirement is being
   met. Patches 6-7.

CI run: https://gitlab.com/farosas/qemu/-/pipelines/1269352083

Fabiano Rosas (17):
  migration: Fix file migration with fdset
  tests/qtest/migration: Fix file migration offset check
  tests/qtest/migration: Add a precopy file test with fdset
  monitor: Introduce monitor_fdset_*free
  monitor: Stop removing non-duplicated fds
  monitor: Simplify fdset and fd removal
  monitor: Report errors from monitor_fdset_dup_fd_add
  io: Stop using qemu_open_old in channel-file
  migration: Add direct-io parameter
  migration/multifd: Add direct-io support
  tests/qtest/migration: Add tests for file migration with direct-io
  monitor: fdset: Match against O_DIRECT
  migration: Add documentation for fdset with multifd + file
  tests/qtest/migration: Add a test for mapped-ram with passing of fds
  io/channel-file: Add direct-io support
  migration: Add direct-io helpers
  migration/ram: Add direct-io support to precopy file migration

Peter Xu (1):
  monitor: Drop monitor_fdset_dup_fd_add()

 docs/devel/migration/main.rst       |  24 ++-
 docs/devel/migration/mapped-ram.rst |   6 +-
 include/io/channel-file.h           |   8 +
 include/monitor/monitor.h           |   3 +-
 include/qemu/osdep.h                |   2 +
 io/channel-file.c                   |  33 +++-
 migration/file.c                    |  42 +++-
 migration/file.h                    |   1 -
 migration/migration-hmp-cmds.c      |  11 ++
 migration/migration.c               |  54 +++++
 migration/migration.h               |   2 +
 migration/options.c                 |  28 +++
 migration/options.h                 |   1 +
 migration/qemu-file.c               |  29 +++
 migration/qemu-file.h               |   2 +-
 migration/ram.c                     |  40 +++-
 monitor/fds.c                       |  89 ++++-----
 monitor/hmp.c                       |   2 -
 monitor/monitor-internal.h          |   1 -
 monitor/qmp.c                       |   2 -
 qapi/migration.json                 |  21 +-
 stubs/fdset.c                       |   7 +-
 tests/qtest/migration-helpers.c     |  44 +++++
 tests/qtest/migration-helpers.h     |   8 +
 tests/qtest/migration-test.c        | 296 +++++++++++++++++++++++++---
 util/osdep.c                        |  34 ++--
 26 files changed, 652 insertions(+), 138 deletions(-)


base-commit: 7e1c0047015ffbd408e1aa4a5ec1abe4751dbf7e
-- 
2.35.3



^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH v2 01/18] migration: Fix file migration with fdset
  2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
@ 2024-05-23 19:05 ` Fabiano Rosas
  2024-05-24 10:51   ` Prasad Pandit
                     ` (2 more replies)
  2024-05-23 19:05 ` [PATCH v2 02/18] tests/qtest/migration: Fix file migration offset check Fabiano Rosas
                   ` (16 subsequent siblings)
  17 siblings, 3 replies; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-23 19:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: berrange, armbru, Peter Xu, Claudio Fontana, Jim Fehlig

When the "file:" migration support was added we missed the special
case in the qemu_open_old implementation that allows for a particular
file name format to be used to refer to a set of file descriptors that
have been previously provided to QEMU via the add-fd QMP command.

When using this fdset feature, we should not truncate the migration
file because being given an fd means that the management layer is in
control of the file and will likely already have some data written to
it. This is further indicated by the presence of the 'offset'
argument, which indicates the start of the region where QEMU is
allowed to write.

Fix the issue by replacing the O_TRUNC flag on open by an ftruncate
call, which will take the offset into consideration.

Fixes: 385f510df5 ("migration: file URI offset")
Suggested-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/file.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/migration/file.c b/migration/file.c
index ab18ba505a..ba5b5c44ff 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -84,12 +84,19 @@ void file_start_outgoing_migration(MigrationState *s,
 
     trace_migration_file_outgoing(filename);
 
-    fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY | O_TRUNC,
-                                     0600, errp);
+    fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY, 0600, errp);
     if (!fioc) {
         return;
     }
 
+    if (ftruncate(fioc->fd, offset)) {
+        error_setg_errno(errp, errno,
+                         "failed to truncate migration file to offset %" PRIx64,
+                         offset);
+        object_unref(OBJECT(fioc));
+        return;
+    }
+
     outgoing_args.fname = g_strdup(filename);
 
     ioc = QIO_CHANNEL(fioc);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 02/18] tests/qtest/migration: Fix file migration offset check
  2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
  2024-05-23 19:05 ` [PATCH v2 01/18] migration: Fix file migration with fdset Fabiano Rosas
@ 2024-05-23 19:05 ` Fabiano Rosas
  2024-05-30 16:14   ` Peter Xu
  2024-06-03 10:21   ` Daniel P. Berrangé
  2024-05-23 19:05 ` [PATCH v2 03/18] tests/qtest/migration: Add a precopy file test with fdset Fabiano Rosas
                   ` (15 subsequent siblings)
  17 siblings, 2 replies; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-23 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Peter Xu, Claudio Fontana, Jim Fehlig,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

When doing file migration, QEMU accepts an offset that should be
skipped when writing the migration stream to the file. The purpose of
the offset is to allow the management layer to put its own metadata at
the start of the file.

We have tests for this in migration-test, but only testing that the
migration stream starts at the correct offset and not that it actually
leaves the data intact. Unsurprisingly, there's been a bug in that
area that the tests didn't catch.

Fix the tests to write some data to the offset region and check that
it's actually there after the migration.

While here, switch to using g_get_file_contents() which is more
portable than mmap().

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 tests/qtest/migration-test.c | 79 ++++++++++++++++++++++--------------
 1 file changed, 48 insertions(+), 31 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index b7e3406471..ec905543cf 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -67,6 +67,7 @@ static QTestMigrationState dst_state;
 #define QEMU_VM_FILE_MAGIC 0x5145564d
 #define FILE_TEST_FILENAME "migfile"
 #define FILE_TEST_OFFSET 0x1000
+#define FILE_TEST_MARKER 'X'
 #define QEMU_ENV_SRC "QTEST_QEMU_BINARY_SRC"
 #define QEMU_ENV_DST "QTEST_QEMU_BINARY_DST"
 
@@ -1716,10 +1717,43 @@ finish:
     test_migrate_end(from, to, args->result == MIG_TEST_SUCCEED);
 }
 
+static void file_dirty_offset_region(void)
+{
+    g_autofree char *path = g_strdup_printf("%s/%s", tmpfs, FILE_TEST_FILENAME);
+    size_t size = FILE_TEST_OFFSET;
+    g_autofree char *data = g_new0(char, size);
+
+    memset(data, FILE_TEST_MARKER, size);
+    g_assert(g_file_set_contents(path, data, size, NULL));
+}
+
+static void file_check_offset_region(void)
+{
+    g_autofree char *path = g_strdup_printf("%s/%s", tmpfs, FILE_TEST_FILENAME);
+    size_t size = FILE_TEST_OFFSET;
+    g_autofree char *expected = g_new0(char, size);
+    g_autofree char *actual = NULL;
+    uint64_t *stream_start;
+
+    /*
+     * Ensure the skipped offset region's data has not been touched
+     * and the migration stream starts at the right place.
+     */
+
+    memset(expected, FILE_TEST_MARKER, size);
+
+    g_assert(g_file_get_contents(path, &actual, NULL, NULL));
+    g_assert(!memcmp(actual, expected, size));
+
+    stream_start = (uint64_t *)(actual + size);
+    g_assert_cmpint(cpu_to_be64(*stream_start) >> 32, ==, QEMU_VM_FILE_MAGIC);
+}
+
 static void test_file_common(MigrateCommon *args, bool stop_src)
 {
     QTestState *from, *to;
     void *data_hook = NULL;
+    bool check_offset = false;
 
     if (test_migrate_start(&from, &to, args->listen_uri, &args->start)) {
         return;
@@ -1732,6 +1766,16 @@ static void test_file_common(MigrateCommon *args, bool stop_src)
      */
     g_assert_false(args->live);
 
+    if (g_strrstr(args->connect_uri, "offset=")) {
+        check_offset = true;
+        /*
+         * This comes before the start_hook because it's equivalent to
+         * a management application creating the file and writing to
+         * it so hooks should expect the file to be already present.
+         */
+        file_dirty_offset_region();
+    }
+
     if (args->start_hook) {
         data_hook = args->start_hook(from, to);
     }
@@ -1766,6 +1810,10 @@ static void test_file_common(MigrateCommon *args, bool stop_src)
 
     wait_for_serial("dest_serial");
 
+    if (check_offset) {
+        file_check_offset_region();
+    }
+
 finish:
     if (args->finish_hook) {
         args->finish_hook(from, to, data_hook);
@@ -1965,36 +2013,6 @@ static void test_precopy_file(void)
     test_file_common(&args, true);
 }
 
-static void file_offset_finish_hook(QTestState *from, QTestState *to,
-                                    void *opaque)
-{
-#if defined(__linux__)
-    g_autofree char *path = g_strdup_printf("%s/%s", tmpfs, FILE_TEST_FILENAME);
-    size_t size = FILE_TEST_OFFSET + sizeof(QEMU_VM_FILE_MAGIC);
-    uintptr_t *addr, *p;
-    int fd;
-
-    fd = open(path, O_RDONLY);
-    g_assert(fd != -1);
-    addr = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0);
-    g_assert(addr != MAP_FAILED);
-
-    /*
-     * Ensure the skipped offset contains zeros and the migration
-     * stream starts at the right place.
-     */
-    p = addr;
-    while (p < addr + FILE_TEST_OFFSET / sizeof(uintptr_t)) {
-        g_assert(*p == 0);
-        p++;
-    }
-    g_assert_cmpint(cpu_to_be64(*p) >> 32, ==, QEMU_VM_FILE_MAGIC);
-
-    munmap(addr, size);
-    close(fd);
-#endif
-}
-
 static void test_precopy_file_offset(void)
 {
     g_autofree char *uri = g_strdup_printf("file:%s/%s,offset=%d", tmpfs,
@@ -2003,7 +2021,6 @@ static void test_precopy_file_offset(void)
     MigrateCommon args = {
         .connect_uri = uri,
         .listen_uri = "defer",
-        .finish_hook = file_offset_finish_hook,
     };
 
     test_file_common(&args, false);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 03/18] tests/qtest/migration: Add a precopy file test with fdset
  2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
  2024-05-23 19:05 ` [PATCH v2 01/18] migration: Fix file migration with fdset Fabiano Rosas
  2024-05-23 19:05 ` [PATCH v2 02/18] tests/qtest/migration: Fix file migration offset check Fabiano Rosas
@ 2024-05-23 19:05 ` Fabiano Rosas
  2024-05-30 16:18   ` Peter Xu
  2024-05-23 19:05 ` [PATCH v2 04/18] monitor: Drop monitor_fdset_dup_fd_add() Fabiano Rosas
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-23 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Peter Xu, Claudio Fontana, Jim Fehlig,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

Add a test for file migration using fdset. The passing of fds is more
complex than using a file path. This is also the scenario where it's
most important we ensure that the initial migration stream offset is
respected because the fdset interface is the one used by the
management layer when providing a non empty migration file.

Note that fd passing is not available on Windows, so anything that
uses add-fd needs to exclude that platform.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 tests/qtest/migration-test.c | 44 ++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index ec905543cf..5bb53f4839 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2013,6 +2013,46 @@ static void test_precopy_file(void)
     test_file_common(&args, true);
 }
 
+#ifndef _WIN32
+static void fdset_add_fds(QTestState *qts, const char *file, int flags,
+                          int num_fds)
+{
+    for (int i = 0; i < num_fds; i++) {
+        int fd;
+
+        fd = open(file, flags, 0660);
+        assert(fd != -1);
+
+        qtest_qmp_fds_assert_success(qts, &fd, 1, "{'execute': 'add-fd', "
+                                     "'arguments': {'fdset-id': 1}}");
+        close(fd);
+    }
+}
+
+static void *file_offset_fdset_start_hook(QTestState *from, QTestState *to)
+{
+    g_autofree char *file = g_strdup_printf("%s/%s", tmpfs, FILE_TEST_FILENAME);
+
+    fdset_add_fds(from, file, O_WRONLY, 1);
+    fdset_add_fds(to, file, O_RDONLY, 1);
+
+    return NULL;
+}
+
+static void test_precopy_file_offset_fdset(void)
+{
+    g_autofree char *uri = g_strdup_printf("file:/dev/fdset/1,offset=%d",
+                                           FILE_TEST_OFFSET);
+    MigrateCommon args = {
+        .connect_uri = uri,
+        .listen_uri = "defer",
+        .start_hook = file_offset_fdset_start_hook,
+    };
+
+    test_file_common(&args, false);
+}
+#endif
+
 static void test_precopy_file_offset(void)
 {
     g_autofree char *uri = g_strdup_printf("file:%s/%s,offset=%d", tmpfs,
@@ -3521,6 +3561,10 @@ int main(int argc, char **argv)
                        test_precopy_file);
     migration_test_add("/migration/precopy/file/offset",
                        test_precopy_file_offset);
+#ifndef _WIN32
+    migration_test_add("/migration/precopy/file/offset/fdset",
+                       test_precopy_file_offset_fdset);
+#endif
     migration_test_add("/migration/precopy/file/offset/bad",
                        test_precopy_file_offset_bad);
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 04/18] monitor: Drop monitor_fdset_dup_fd_add()
  2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
                   ` (2 preceding siblings ...)
  2024-05-23 19:05 ` [PATCH v2 03/18] tests/qtest/migration: Add a precopy file test with fdset Fabiano Rosas
@ 2024-05-23 19:05 ` Fabiano Rosas
  2024-06-03 10:26   ` Daniel P. Berrangé
  2024-05-23 19:05 ` [PATCH v2 05/18] monitor: Introduce monitor_fdset_*free Fabiano Rosas
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-23 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Peter Xu, Claudio Fontana, Jim Fehlig,
	Dr . David Alan Gilbert, Philippe Mathieu-Daudé,
	Paolo Bonzini

From: Peter Xu <peterx@redhat.com>

This function is not needed, one remove function should already work.
Clean it up.

Here the code doesn't really care about whether we need to keep that dupfd
around if close() failed: when that happens something got very wrong,
keeping the dup_fd around the fdsets may not help that situation so far.

Cc: Dr. David Alan Gilbert <dave@treblig.org>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
[add missing return statement, removal during traversal is not safe]
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 include/monitor/monitor.h |  1 -
 monitor/fds.c             | 28 ++++++----------------------
 stubs/fdset.c             |  5 -----
 util/osdep.c              | 15 +--------------
 4 files changed, 7 insertions(+), 42 deletions(-)

diff --git a/include/monitor/monitor.h b/include/monitor/monitor.h
index 965f5d5450..fd9b3f538c 100644
--- a/include/monitor/monitor.h
+++ b/include/monitor/monitor.h
@@ -53,7 +53,6 @@ AddfdInfo *monitor_fdset_add_fd(int fd, bool has_fdset_id, int64_t fdset_id,
                                 const char *opaque, Error **errp);
 int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags);
 void monitor_fdset_dup_fd_remove(int dup_fd);
-int64_t monitor_fdset_dup_fd_find(int dup_fd);
 
 void monitor_register_hmp(const char *name, bool info,
                           void (*cmd)(Monitor *mon, const QDict *qdict));
diff --git a/monitor/fds.c b/monitor/fds.c
index d86c2c674c..fb9f58c056 100644
--- a/monitor/fds.c
+++ b/monitor/fds.c
@@ -458,7 +458,7 @@ int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags)
 #endif
 }
 
-static int64_t monitor_fdset_dup_fd_find_remove(int dup_fd, bool remove)
+void monitor_fdset_dup_fd_remove(int dup_fd)
 {
     MonFdset *mon_fdset;
     MonFdsetFd *mon_fdset_fd_dup;
@@ -467,31 +467,15 @@ static int64_t monitor_fdset_dup_fd_find_remove(int dup_fd, bool remove)
     QLIST_FOREACH(mon_fdset, &mon_fdsets, next) {
         QLIST_FOREACH(mon_fdset_fd_dup, &mon_fdset->dup_fds, next) {
             if (mon_fdset_fd_dup->fd == dup_fd) {
-                if (remove) {
-                    QLIST_REMOVE(mon_fdset_fd_dup, next);
-                    g_free(mon_fdset_fd_dup);
-                    if (QLIST_EMPTY(&mon_fdset->dup_fds)) {
-                        monitor_fdset_cleanup(mon_fdset);
-                    }
-                    return -1;
-                } else {
-                    return mon_fdset->id;
+                QLIST_REMOVE(mon_fdset_fd_dup, next);
+                g_free(mon_fdset_fd_dup);
+                if (QLIST_EMPTY(&mon_fdset->dup_fds)) {
+                    monitor_fdset_cleanup(mon_fdset);
                 }
+                return;
             }
         }
     }
-
-    return -1;
-}
-
-int64_t monitor_fdset_dup_fd_find(int dup_fd)
-{
-    return monitor_fdset_dup_fd_find_remove(dup_fd, false);
-}
-
-void monitor_fdset_dup_fd_remove(int dup_fd)
-{
-    monitor_fdset_dup_fd_find_remove(dup_fd, true);
 }
 
 int monitor_fd_param(Monitor *mon, const char *fdname, Error **errp)
diff --git a/stubs/fdset.c b/stubs/fdset.c
index d7c39a28ac..389e368a29 100644
--- a/stubs/fdset.c
+++ b/stubs/fdset.c
@@ -9,11 +9,6 @@ int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags)
     return -1;
 }
 
-int64_t monitor_fdset_dup_fd_find(int dup_fd)
-{
-    return -1;
-}
-
 void monitor_fdset_dup_fd_remove(int dupfd)
 {
 }
diff --git a/util/osdep.c b/util/osdep.c
index e996c4744a..2d9749d060 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -393,21 +393,8 @@ int qemu_open_old(const char *name, int flags, ...)
 
 int qemu_close(int fd)
 {
-    int64_t fdset_id;
-
     /* Close fd that was dup'd from an fdset */
-    fdset_id = monitor_fdset_dup_fd_find(fd);
-    if (fdset_id != -1) {
-        int ret;
-
-        ret = close(fd);
-        if (ret == 0) {
-            monitor_fdset_dup_fd_remove(fd);
-        }
-
-        return ret;
-    }
-
+    monitor_fdset_dup_fd_remove(fd);
     return close(fd);
 }
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 05/18] monitor: Introduce monitor_fdset_*free
  2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
                   ` (3 preceding siblings ...)
  2024-05-23 19:05 ` [PATCH v2 04/18] monitor: Drop monitor_fdset_dup_fd_add() Fabiano Rosas
@ 2024-05-23 19:05 ` Fabiano Rosas
  2024-05-30 20:03   ` Peter Xu
  2024-05-23 19:05 ` [PATCH v2 06/18] monitor: Stop removing non-duplicated fds Fabiano Rosas
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-23 19:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: berrange, armbru, Peter Xu, Claudio Fontana, Jim Fehlig

Introduce two new functions to remove and free no longer used fds and
fdsets.

We need those to decouple the remove/free routines from
monitor_fdset_cleanup() which will go away in the next patches.

The new functions:

- monitor_fdset_free() will be used when a monitor connection closes
  and when an fd is removed to cleanup any fdset that is now empty.

- monitor_fdset_fd_free() will be used to remove one or more fds that
  have been explicitly targeted by qmp_remove_fd().

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 monitor/fds.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/monitor/fds.c b/monitor/fds.c
index fb9f58c056..101e21720d 100644
--- a/monitor/fds.c
+++ b/monitor/fds.c
@@ -167,6 +167,22 @@ int monitor_get_fd(Monitor *mon, const char *fdname, Error **errp)
     return -1;
 }
 
+static void monitor_fdset_free(MonFdset *mon_fdset)
+{
+    if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) {
+        QLIST_REMOVE(mon_fdset, next);
+        g_free(mon_fdset);
+    }
+}
+
+static void monitor_fdset_fd_free(MonFdsetFd *mon_fdset_fd)
+{
+    close(mon_fdset_fd->fd);
+    g_free(mon_fdset_fd->opaque);
+    QLIST_REMOVE(mon_fdset_fd, next);
+    g_free(mon_fdset_fd);
+}
+
 static void monitor_fdset_cleanup(MonFdset *mon_fdset)
 {
     MonFdsetFd *mon_fdset_fd;
@@ -176,17 +192,11 @@ static void monitor_fdset_cleanup(MonFdset *mon_fdset)
         if ((mon_fdset_fd->removed ||
                 (QLIST_EMPTY(&mon_fdset->dup_fds) && mon_refcount == 0)) &&
                 runstate_is_running()) {
-            close(mon_fdset_fd->fd);
-            g_free(mon_fdset_fd->opaque);
-            QLIST_REMOVE(mon_fdset_fd, next);
-            g_free(mon_fdset_fd);
+            monitor_fdset_fd_free(mon_fdset_fd);
         }
     }
 
-    if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) {
-        QLIST_REMOVE(mon_fdset, next);
-        g_free(mon_fdset);
-    }
+    monitor_fdset_free(mon_fdset);
 }
 
 void monitor_fdsets_cleanup(void)
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 06/18] monitor: Stop removing non-duplicated fds
  2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
                   ` (4 preceding siblings ...)
  2024-05-23 19:05 ` [PATCH v2 05/18] monitor: Introduce monitor_fdset_*free Fabiano Rosas
@ 2024-05-23 19:05 ` Fabiano Rosas
  2024-05-30 21:05   ` Peter Xu
  2024-05-23 19:05 ` [PATCH v2 07/18] monitor: Simplify fdset and fd removal Fabiano Rosas
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-23 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Peter Xu, Claudio Fontana, Jim Fehlig,
	Dr. David Alan Gilbert

We've been up until now cleaning up any file descriptors that have
been passed into QEMU and never duplicated[1,2]. A file descriptor
without duplicates indicates that no part of QEMU has made use of
it. This approach is starting to show some cracks now that we're
starting to consume fds from the migration code:

- Doing cleanup every time the last monitor connection closes works to
  reap unused fds, but also has the side effect of forcing the
  management layer to pass the file descriptors again in case of a
  disconnect/re-connect, if that happened to be the only monitor
  connection.

  Another side effect is that removing an fd with qmp_remove_fd() is
  effectively delayed until the last monitor connection closes.

  The reliance on mon_refcount is also problematic because it's racy.

- Checking runstate_is_running() skips the cleanup unless the VM is
  running and avoids premature cleanup of the fds, but also has the
  side effect of blocking the legitimate removal of an fd via
  qmp_remove_fd() if the VM happens to be in another state.

  This affects qmp_remove_fd() and qmp_query_fdsets() in particular
  because requesting a removal at a bad time (guest stopped) might
  cause an fd to never be removed, or to be removed at a much later
  point in time, causing the query command to continue showing the
  supposedly removed fd/fdset.

Note that file descriptors that *have* been duplicated are owned by
the code that uses them and will be removed after qemu_close() is
called. Therefore we've decided that the best course of action to
avoid the undesired side-effects is to stop managing non-duplicated
file descriptors.

1- efb87c1697 ("monitor: Clean up fd sets on monitor disconnect")
2- ebe52b592d ("monitor: Prevent removing fd from set during init")

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 monitor/fds.c              | 15 ++++++++-------
 monitor/hmp.c              |  2 --
 monitor/monitor-internal.h |  1 -
 monitor/qmp.c              |  2 --
 4 files changed, 8 insertions(+), 12 deletions(-)

diff --git a/monitor/fds.c b/monitor/fds.c
index 101e21720d..f7b98814fa 100644
--- a/monitor/fds.c
+++ b/monitor/fds.c
@@ -169,6 +169,11 @@ int monitor_get_fd(Monitor *mon, const char *fdname, Error **errp)
 
 static void monitor_fdset_free(MonFdset *mon_fdset)
 {
+    /*
+     * Only remove an empty fdset. The fds are owned by the user and
+     * should have been removed with qmp_remove_fd(). The dup_fds are
+     * owned by QEMU and should have been removed with qemu_close().
+     */
     if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) {
         QLIST_REMOVE(mon_fdset, next);
         g_free(mon_fdset);
@@ -189,9 +194,7 @@ static void monitor_fdset_cleanup(MonFdset *mon_fdset)
     MonFdsetFd *mon_fdset_fd_next;
 
     QLIST_FOREACH_SAFE(mon_fdset_fd, &mon_fdset->fds, next, mon_fdset_fd_next) {
-        if ((mon_fdset_fd->removed ||
-                (QLIST_EMPTY(&mon_fdset->dup_fds) && mon_refcount == 0)) &&
-                runstate_is_running()) {
+        if (mon_fdset_fd->removed) {
             monitor_fdset_fd_free(mon_fdset_fd);
         }
     }
@@ -206,7 +209,7 @@ void monitor_fdsets_cleanup(void)
 
     QEMU_LOCK_GUARD(&mon_fdsets_lock);
     QLIST_FOREACH_SAFE(mon_fdset, &mon_fdsets, next, mon_fdset_next) {
-        monitor_fdset_cleanup(mon_fdset);
+        monitor_fdset_free(mon_fdset);
     }
 }
 
@@ -479,9 +482,7 @@ void monitor_fdset_dup_fd_remove(int dup_fd)
             if (mon_fdset_fd_dup->fd == dup_fd) {
                 QLIST_REMOVE(mon_fdset_fd_dup, next);
                 g_free(mon_fdset_fd_dup);
-                if (QLIST_EMPTY(&mon_fdset->dup_fds)) {
-                    monitor_fdset_cleanup(mon_fdset);
-                }
+                monitor_fdset_free(mon_fdset);
                 return;
             }
         }
diff --git a/monitor/hmp.c b/monitor/hmp.c
index 69c1b7e98a..460e8832f6 100644
--- a/monitor/hmp.c
+++ b/monitor/hmp.c
@@ -1437,11 +1437,9 @@ static void monitor_event(void *opaque, QEMUChrEvent event)
             monitor_resume(mon);
         }
         qemu_mutex_unlock(&mon->mon_lock);
-        mon_refcount++;
         break;
 
     case CHR_EVENT_CLOSED:
-        mon_refcount--;
         monitor_fdsets_cleanup();
         break;
 
diff --git a/monitor/monitor-internal.h b/monitor/monitor-internal.h
index 252de85681..cb628f681d 100644
--- a/monitor/monitor-internal.h
+++ b/monitor/monitor-internal.h
@@ -168,7 +168,6 @@ extern bool qmp_dispatcher_co_shutdown;
 extern QmpCommandList qmp_commands, qmp_cap_negotiation_commands;
 extern QemuMutex monitor_lock;
 extern MonitorList mon_list;
-extern int mon_refcount;
 
 extern HMPCommand hmp_cmds[];
 
diff --git a/monitor/qmp.c b/monitor/qmp.c
index a239945e8d..5e538f34c0 100644
--- a/monitor/qmp.c
+++ b/monitor/qmp.c
@@ -466,7 +466,6 @@ static void monitor_qmp_event(void *opaque, QEMUChrEvent event)
         data = qmp_greeting(mon);
         qmp_send_response(mon, data);
         qobject_unref(data);
-        mon_refcount++;
         break;
     case CHR_EVENT_CLOSED:
         /*
@@ -479,7 +478,6 @@ static void monitor_qmp_event(void *opaque, QEMUChrEvent event)
         json_message_parser_destroy(&mon->parser);
         json_message_parser_init(&mon->parser, handle_qmp_command,
                                  mon, NULL);
-        mon_refcount--;
         monitor_fdsets_cleanup();
         break;
     case CHR_EVENT_BREAK:
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 07/18] monitor: Simplify fdset and fd removal
  2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
                   ` (5 preceding siblings ...)
  2024-05-23 19:05 ` [PATCH v2 06/18] monitor: Stop removing non-duplicated fds Fabiano Rosas
@ 2024-05-23 19:05 ` Fabiano Rosas
  2024-05-31 15:58   ` Peter Xu
  2024-05-23 19:05 ` [PATCH v2 08/18] monitor: Report errors from monitor_fdset_dup_fd_add Fabiano Rosas
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-23 19:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: berrange, armbru, Peter Xu, Claudio Fontana, Jim Fehlig

Remove fds right away instead of setting the ->removed flag. We don't
need the extra complexity of having a cleanup function reap the
removed entries at a later time.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 monitor/fds.c | 27 ++++++---------------------
 1 file changed, 6 insertions(+), 21 deletions(-)

diff --git a/monitor/fds.c b/monitor/fds.c
index f7b98814fa..fa723e1883 100644
--- a/monitor/fds.c
+++ b/monitor/fds.c
@@ -43,7 +43,6 @@ struct mon_fd_t {
 typedef struct MonFdsetFd MonFdsetFd;
 struct MonFdsetFd {
     int fd;
-    bool removed;
     char *opaque;
     QLIST_ENTRY(MonFdsetFd) next;
 };
@@ -188,20 +187,6 @@ static void monitor_fdset_fd_free(MonFdsetFd *mon_fdset_fd)
     g_free(mon_fdset_fd);
 }
 
-static void monitor_fdset_cleanup(MonFdset *mon_fdset)
-{
-    MonFdsetFd *mon_fdset_fd;
-    MonFdsetFd *mon_fdset_fd_next;
-
-    QLIST_FOREACH_SAFE(mon_fdset_fd, &mon_fdset->fds, next, mon_fdset_fd_next) {
-        if (mon_fdset_fd->removed) {
-            monitor_fdset_fd_free(mon_fdset_fd);
-        }
-    }
-
-    monitor_fdset_free(mon_fdset);
-}
-
 void monitor_fdsets_cleanup(void)
 {
     MonFdset *mon_fdset;
@@ -276,7 +261,7 @@ void qmp_get_win32_socket(const char *infos, const char *fdname, Error **errp)
 void qmp_remove_fd(int64_t fdset_id, bool has_fd, int64_t fd, Error **errp)
 {
     MonFdset *mon_fdset;
-    MonFdsetFd *mon_fdset_fd;
+    MonFdsetFd *mon_fdset_fd, *mon_fdset_fd_next;
     char fd_str[60];
 
     QEMU_LOCK_GUARD(&mon_fdsets_lock);
@@ -284,21 +269,22 @@ void qmp_remove_fd(int64_t fdset_id, bool has_fd, int64_t fd, Error **errp)
         if (mon_fdset->id != fdset_id) {
             continue;
         }
-        QLIST_FOREACH(mon_fdset_fd, &mon_fdset->fds, next) {
+        QLIST_FOREACH_SAFE(mon_fdset_fd, &mon_fdset->fds, next,
+                           mon_fdset_fd_next) {
             if (has_fd) {
                 if (mon_fdset_fd->fd != fd) {
                     continue;
                 }
-                mon_fdset_fd->removed = true;
+                monitor_fdset_fd_free(mon_fdset_fd);
                 break;
             } else {
-                mon_fdset_fd->removed = true;
+                monitor_fdset_fd_free(mon_fdset_fd);
             }
         }
         if (has_fd && !mon_fdset_fd) {
             goto error;
         }
-        monitor_fdset_cleanup(mon_fdset);
+        monitor_fdset_free(mon_fdset);
         return;
     }
 
@@ -408,7 +394,6 @@ AddfdInfo *monitor_fdset_add_fd(int fd, bool has_fdset_id, int64_t fdset_id,
 
     mon_fdset_fd = g_malloc0(sizeof(*mon_fdset_fd));
     mon_fdset_fd->fd = fd;
-    mon_fdset_fd->removed = false;
     mon_fdset_fd->opaque = g_strdup(opaque);
     QLIST_INSERT_HEAD(&mon_fdset->fds, mon_fdset_fd, next);
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 08/18] monitor: Report errors from monitor_fdset_dup_fd_add
  2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
                   ` (6 preceding siblings ...)
  2024-05-23 19:05 ` [PATCH v2 07/18] monitor: Simplify fdset and fd removal Fabiano Rosas
@ 2024-05-23 19:05 ` Fabiano Rosas
  2024-05-30 21:08   ` Peter Xu
  2024-05-23 19:05 ` [PATCH v2 09/18] io: Stop using qemu_open_old in channel-file Fabiano Rosas
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-23 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Peter Xu, Claudio Fontana, Jim Fehlig,
	Paolo Bonzini

I'm keeping the EACCES because callers expect to be able to look at
errno.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 include/monitor/monitor.h |  2 +-
 monitor/fds.c             | 10 +++++++++-
 stubs/fdset.c             |  2 +-
 util/osdep.c              | 10 +---------
 4 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/monitor/monitor.h b/include/monitor/monitor.h
index fd9b3f538c..c3740ec616 100644
--- a/include/monitor/monitor.h
+++ b/include/monitor/monitor.h
@@ -51,7 +51,7 @@ int monitor_read_password(MonitorHMP *mon, ReadLineFunc *readline_func,
 
 AddfdInfo *monitor_fdset_add_fd(int fd, bool has_fdset_id, int64_t fdset_id,
                                 const char *opaque, Error **errp);
-int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags);
+int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags, Error **errp);
 void monitor_fdset_dup_fd_remove(int dup_fd);
 
 void monitor_register_hmp(const char *name, bool info,
diff --git a/monitor/fds.c b/monitor/fds.c
index fa723e1883..d93d2e695b 100644
--- a/monitor/fds.c
+++ b/monitor/fds.c
@@ -404,9 +404,10 @@ AddfdInfo *monitor_fdset_add_fd(int fd, bool has_fdset_id, int64_t fdset_id,
     return fdinfo;
 }
 
-int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags)
+int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags, Error **errp)
 {
 #ifdef _WIN32
+    error_setg(errp, "Platform does not support fd passing (fdset)");
     return -ENOENT;
 #else
     MonFdset *mon_fdset;
@@ -426,6 +427,8 @@ int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags)
         QLIST_FOREACH(mon_fdset_fd, &mon_fdset->fds, next) {
             mon_fd_flags = fcntl(mon_fdset_fd->fd, F_GETFL);
             if (mon_fd_flags == -1) {
+                error_setg(errp, "Failed to read file status flags for fd=%d",
+                           mon_fdset_fd->fd);
                 return -1;
             }
 
@@ -437,11 +440,15 @@ int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags)
 
         if (fd == -1) {
             errno = EACCES;
+            error_setg(errp,
+                       "Failed to find file descriptor with matching flags=0x%x",
+                       flags);
             return -1;
         }
 
         dup_fd = qemu_dup_flags(fd, flags);
         if (dup_fd == -1) {
+            error_setg(errp, "Failed to dup() given file descriptor fd=%d", fd);
             return -1;
         }
 
@@ -451,6 +458,7 @@ int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags)
         return dup_fd;
     }
 
+    error_setg(errp, "Failed to find fdset /dev/fdset/%" PRId64, fdset_id);
     errno = ENOENT;
     return -1;
 #endif
diff --git a/stubs/fdset.c b/stubs/fdset.c
index 389e368a29..2950fd91fd 100644
--- a/stubs/fdset.c
+++ b/stubs/fdset.c
@@ -3,7 +3,7 @@
 #include "monitor/monitor.h"
 #include "../monitor/monitor-internal.h"
 
-int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags)
+int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags, Error **errp)
 {
     errno = ENOSYS;
     return -1;
diff --git a/util/osdep.c b/util/osdep.c
index 2d9749d060..0cb8f98bf6 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -305,7 +305,6 @@ qemu_open_internal(const char *name, int flags, mode_t mode, Error **errp)
     /* Attempt dup of fd from fd set */
     if (strstart(name, "/dev/fdset/", &fdset_id_str)) {
         int64_t fdset_id;
-        int dupfd;
 
         fdset_id = qemu_parse_fdset(fdset_id_str);
         if (fdset_id == -1) {
@@ -314,14 +313,7 @@ qemu_open_internal(const char *name, int flags, mode_t mode, Error **errp)
             return -1;
         }
 
-        dupfd = monitor_fdset_dup_fd_add(fdset_id, flags);
-        if (dupfd == -1) {
-            error_setg_errno(errp, errno, "Could not dup FD for %s flags %x",
-                             name, flags);
-            return -1;
-        }
-
-        return dupfd;
+        return monitor_fdset_dup_fd_add(fdset_id, flags, errp);
     }
 #endif
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 09/18] io: Stop using qemu_open_old in channel-file
  2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
                   ` (7 preceding siblings ...)
  2024-05-23 19:05 ` [PATCH v2 08/18] monitor: Report errors from monitor_fdset_dup_fd_add Fabiano Rosas
@ 2024-05-23 19:05 ` Fabiano Rosas
  2024-05-30 21:10   ` Peter Xu
  2024-05-23 19:05 ` [PATCH v2 10/18] migration: Add direct-io parameter Fabiano Rosas
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-23 19:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: berrange, armbru, Peter Xu, Claudio Fontana, Jim Fehlig

We want to make use of the Error object to report fdset errors from
qemu_open_internal() and passing the error pointer to qemu_open_old()
would require changing all callers. Move the file channel to the new
API instead.

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 io/channel-file.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/io/channel-file.c b/io/channel-file.c
index 6436cfb6ae..2ea8d08360 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -68,11 +68,13 @@ qio_channel_file_new_path(const char *path,
 
     ioc = QIO_CHANNEL_FILE(object_new(TYPE_QIO_CHANNEL_FILE));
 
-    ioc->fd = qemu_open_old(path, flags, mode);
+    if (flags & O_CREAT) {
+        ioc->fd = qemu_create(path, flags & ~O_CREAT, mode, errp);
+    } else {
+        ioc->fd = qemu_open(path, flags, errp);
+    }
     if (ioc->fd < 0) {
         object_unref(OBJECT(ioc));
-        error_setg_errno(errp, errno,
-                         "Unable to open %s", path);
         return NULL;
     }
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 10/18] migration: Add direct-io parameter
  2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
                   ` (8 preceding siblings ...)
  2024-05-23 19:05 ` [PATCH v2 09/18] io: Stop using qemu_open_old in channel-file Fabiano Rosas
@ 2024-05-23 19:05 ` Fabiano Rosas
  2024-05-30 21:12   ` Peter Xu
  2024-05-23 19:05 ` [PATCH v2 11/18] migration/multifd: Add direct-io support Fabiano Rosas
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-23 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Peter Xu, Claudio Fontana, Jim Fehlig,
	Eric Blake

Add the direct-io migration parameter that tells the migration code to
use O_DIRECT when opening the migration stream file whenever possible.

This is currently only used with the mapped-ram migration that has a
clear window guaranteed to perform aligned writes.

Acked-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 include/qemu/osdep.h           |  2 ++
 migration/migration-hmp-cmds.c | 11 +++++++++++
 migration/options.c            | 28 ++++++++++++++++++++++++++++
 migration/options.h            |  1 +
 qapi/migration.json            | 21 ++++++++++++++++++---
 util/osdep.c                   |  9 +++++++++
 6 files changed, 69 insertions(+), 3 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index f61edcfdc2..191916f38e 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -612,6 +612,8 @@ int qemu_lock_fd_test(int fd, int64_t start, int64_t len, bool exclusive);
 bool qemu_has_ofd_lock(void);
 #endif
 
+bool qemu_has_direct_io(void);
+
 #if defined(__HAIKU__) && defined(__i386__)
 #define FMT_pid "%ld"
 #elif defined(WIN64)
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 9f0e8029e0..7d608d26e1 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -351,6 +351,13 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
         monitor_printf(mon, "%s: %s\n",
             MigrationParameter_str(MIGRATION_PARAMETER_MODE),
             qapi_enum_lookup(&MigMode_lookup, params->mode));
+
+        if (params->has_direct_io) {
+            monitor_printf(mon, "%s: %s\n",
+                           MigrationParameter_str(
+                               MIGRATION_PARAMETER_DIRECT_IO),
+                           params->direct_io ? "on" : "off");
+        }
     }
 
     qapi_free_MigrationParameters(params);
@@ -624,6 +631,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         p->has_mode = true;
         visit_type_MigMode(v, param, &p->mode, &err);
         break;
+    case MIGRATION_PARAMETER_DIRECT_IO:
+        p->has_direct_io = true;
+        visit_type_bool(v, param, &p->direct_io, &err);
+        break;
     default:
         assert(0);
     }
diff --git a/migration/options.c b/migration/options.c
index 5ab5b6d85d..6f614761b8 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -702,6 +702,18 @@ bool migrate_cpu_throttle_tailslow(void)
     return s->parameters.cpu_throttle_tailslow;
 }
 
+bool migrate_direct_io(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    /* For now O_DIRECT is only supported with mapped-ram */
+    if (!s->capabilities[MIGRATION_CAPABILITY_MAPPED_RAM]) {
+        return false;
+    }
+
+    return s->parameters.direct_io;
+}
+
 uint64_t migrate_downtime_limit(void)
 {
     MigrationState *s = migrate_get_current();
@@ -905,6 +917,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->mode = s->parameters.mode;
     params->has_zero_page_detection = true;
     params->zero_page_detection = s->parameters.zero_page_detection;
+    params->has_direct_io = true;
+    params->direct_io = s->parameters.direct_io;
 
     return params;
 }
@@ -937,6 +951,7 @@ void migrate_params_init(MigrationParameters *params)
     params->has_vcpu_dirty_limit = true;
     params->has_mode = true;
     params->has_zero_page_detection = true;
+    params->has_direct_io = true;
 }
 
 /*
@@ -1110,6 +1125,11 @@ bool migrate_params_check(MigrationParameters *params, Error **errp)
         return false;
     }
 
+    if (params->has_direct_io && params->direct_io && !qemu_has_direct_io()) {
+        error_setg(errp, "No build-time support for direct-io");
+        return false;
+    }
+
     return true;
 }
 
@@ -1216,6 +1236,10 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_zero_page_detection) {
         dest->zero_page_detection = params->zero_page_detection;
     }
+
+    if (params->has_direct_io) {
+        dest->direct_io = params->direct_io;
+    }
 }
 
 static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
@@ -1341,6 +1365,10 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
     if (params->has_zero_page_detection) {
         s->parameters.zero_page_detection = params->zero_page_detection;
     }
+
+    if (params->has_direct_io) {
+        s->parameters.direct_io = params->direct_io;
+    }
 }
 
 void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
diff --git a/migration/options.h b/migration/options.h
index 4b21cc2669..a2397026db 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -69,6 +69,7 @@ uint32_t migrate_checkpoint_delay(void);
 uint8_t migrate_cpu_throttle_increment(void);
 uint8_t migrate_cpu_throttle_initial(void);
 bool migrate_cpu_throttle_tailslow(void);
+bool migrate_direct_io(void);
 uint64_t migrate_downtime_limit(void);
 uint8_t migrate_max_cpu_throttle(void);
 uint64_t migrate_max_bandwidth(void);
diff --git a/qapi/migration.json b/qapi/migration.json
index a351fd3714..7efbaf1e25 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -812,6 +812,10 @@
 #     See description in @ZeroPageDetection.  Default is 'multifd'.
 #     (since 9.0)
 #
+# @direct-io: Open migration files with O_DIRECT when possible.  This
+#     only has effect if the @mapped-ram capability is enabled. (since
+#     9.1)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and
@@ -836,7 +840,8 @@
            { 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] },
            'vcpu-dirty-limit',
            'mode',
-           'zero-page-detection'] }
+           'zero-page-detection',
+           'direct-io'] }
 
 ##
 # @MigrateSetParameters:
@@ -982,6 +987,10 @@
 #     See description in @ZeroPageDetection.  Default is 'multifd'.
 #     (since 9.0)
 #
+# @direct-io: Open migration files with O_DIRECT when possible.  This
+#     only has effect if the @mapped-ram capability is enabled. (since
+#     9.1)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and
@@ -1021,7 +1030,8 @@
                                             'features': [ 'unstable' ] },
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
-            '*zero-page-detection': 'ZeroPageDetection'} }
+            '*zero-page-detection': 'ZeroPageDetection',
+            '*direct-io': 'bool' } }
 
 ##
 # @migrate-set-parameters:
@@ -1181,6 +1191,10 @@
 #     See description in @ZeroPageDetection.  Default is 'multifd'.
 #     (since 9.0)
 #
+# @direct-io: Open migration files with O_DIRECT when possible.  This
+#     only has effect if the @mapped-ram capability is enabled. (since
+#     9.1)
+#
 # Features:
 #
 # @unstable: Members @x-checkpoint-delay and
@@ -1217,7 +1231,8 @@
                                             'features': [ 'unstable' ] },
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
-            '*zero-page-detection': 'ZeroPageDetection'} }
+            '*zero-page-detection': 'ZeroPageDetection',
+            '*direct-io': 'bool' } }
 
 ##
 # @query-migrate-parameters:
diff --git a/util/osdep.c b/util/osdep.c
index 0cb8f98bf6..4900251d2e 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -277,6 +277,15 @@ int qemu_lock_fd_test(int fd, int64_t start, int64_t len, bool exclusive)
 }
 #endif
 
+bool qemu_has_direct_io(void)
+{
+#ifdef O_DIRECT
+    return true;
+#else
+    return false;
+#endif
+}
+
 static int qemu_open_cloexec(const char *name, int flags, mode_t mode)
 {
     int ret;
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 11/18] migration/multifd: Add direct-io support
  2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
                   ` (9 preceding siblings ...)
  2024-05-23 19:05 ` [PATCH v2 10/18] migration: Add direct-io parameter Fabiano Rosas
@ 2024-05-23 19:05 ` Fabiano Rosas
  2024-05-30 21:35   ` Peter Xu
  2024-05-23 19:05 ` [PATCH v2 12/18] tests/qtest/migration: Add tests for file migration with direct-io Fabiano Rosas
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-23 19:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: berrange, armbru, Peter Xu, Claudio Fontana, Jim Fehlig

When multifd is used along with mapped-ram, we can take benefit of a
filesystem that supports the O_DIRECT flag and perform direct I/O in
the multifd threads. This brings a significant performance improvement
because direct-io writes bypass the page cache which would otherwise
be thrashed by the multifd data which is unlikely to be needed again
in a short period of time.

To be able to use a multifd channel opened with O_DIRECT, we must
ensure that a certain aligment is used. Filesystems usually require a
block-size alignment for direct I/O. The way to achieve this is by
enabling the mapped-ram feature, which already aligns its I/O properly
(see MAPPED_RAM_FILE_OFFSET_ALIGNMENT at ram.c).

By setting O_DIRECT on the multifd channels, all writes to the same
file descriptor need to be aligned as well, even the ones that come
from outside multifd, such as the QEMUFile I/O from the main migration
code. This makes it impossible to use the same file descriptor for the
QEMUFile and for the multifd channels. The various flags and metadata
written by the main migration code will always be unaligned by virtue
of their small size. To workaround this issue, we'll require a second
file descriptor to be used exclusively for direct I/O.

The second file descriptor can be obtained by QEMU by re-opening the
migration file (already possible), or by being provided by the user or
management application (support to be added in future patches).

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/file.c      | 31 ++++++++++++++++++++++++++-----
 migration/file.h      |  1 -
 migration/migration.c | 23 +++++++++++++++++++++++
 3 files changed, 49 insertions(+), 6 deletions(-)

diff --git a/migration/file.c b/migration/file.c
index ba5b5c44ff..ac4d206492 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -50,12 +50,31 @@ void file_cleanup_outgoing_migration(void)
     outgoing_args.fname = NULL;
 }
 
+static void file_enable_direct_io(int *flags)
+{
+#ifdef O_DIRECT
+    if (migrate_direct_io()) {
+        *flags |= O_DIRECT;
+    }
+#else
+    /* it should have been rejected when setting the parameter */
+    g_assert_not_reached();
+#endif
+}
+
 bool file_send_channel_create(gpointer opaque, Error **errp)
 {
     QIOChannelFile *ioc;
     int flags = O_WRONLY;
     bool ret = true;
 
+    /*
+     * Attempt to enable O_DIRECT for the secondary channels. These
+     * are used for sending ram pages and writes should be guaranteed
+     * to be aligned to at least page size.
+     */
+    file_enable_direct_io(&flags);
+
     ioc = qio_channel_file_new_path(outgoing_args.fname, flags, 0, errp);
     if (!ioc) {
         ret = false;
@@ -116,21 +135,23 @@ static gboolean file_accept_incoming_migration(QIOChannel *ioc,
     return G_SOURCE_REMOVE;
 }
 
-void file_create_incoming_channels(QIOChannel *ioc, Error **errp)
+static void file_create_incoming_channels(QIOChannel *ioc, char *filename,
+                                          Error **errp)
 {
-    int i, fd, channels = 1;
+    int i, channels = 1;
     g_autofree QIOChannel **iocs = NULL;
+    int flags = O_RDONLY;
 
     if (migrate_multifd()) {
         channels += migrate_multifd_channels();
+        file_enable_direct_io(&flags);
     }
 
     iocs = g_new0(QIOChannel *, channels);
-    fd = QIO_CHANNEL_FILE(ioc)->fd;
     iocs[0] = ioc;
 
     for (i = 1; i < channels; i++) {
-        QIOChannelFile *fioc = qio_channel_file_new_dupfd(fd, errp);
+        QIOChannelFile *fioc = qio_channel_file_new_path(filename, flags, 0, errp);
 
         if (!fioc) {
             while (i) {
@@ -170,7 +191,7 @@ void file_start_incoming_migration(FileMigrationArgs *file_args, Error **errp)
         return;
     }
 
-    file_create_incoming_channels(QIO_CHANNEL(fioc), errp);
+    file_create_incoming_channels(QIO_CHANNEL(fioc), filename, errp);
 }
 
 int file_write_ramblock_iov(QIOChannel *ioc, const struct iovec *iov,
diff --git a/migration/file.h b/migration/file.h
index 7699c04677..9f71e87f74 100644
--- a/migration/file.h
+++ b/migration/file.h
@@ -20,7 +20,6 @@ void file_start_outgoing_migration(MigrationState *s,
 int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp);
 void file_cleanup_outgoing_migration(void);
 bool file_send_channel_create(gpointer opaque, Error **errp);
-void file_create_incoming_channels(QIOChannel *ioc, Error **errp);
 int file_write_ramblock_iov(QIOChannel *ioc, const struct iovec *iov,
                             int niov, RAMBlock *block, Error **errp);
 int multifd_file_recv_data(MultiFDRecvParams *p, Error **errp);
diff --git a/migration/migration.c b/migration/migration.c
index e1b269624c..e03c80b3aa 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -155,6 +155,16 @@ static bool migration_needs_seekable_channel(void)
     return migrate_mapped_ram();
 }
 
+static bool migration_needs_extra_fds(void)
+{
+    /*
+     * When doing direct-io, multifd requires two different,
+     * non-duplicated file descriptors so we can use one of them for
+     * unaligned IO.
+     */
+    return migrate_multifd() && migrate_direct_io();
+}
+
 static bool transport_supports_seeking(MigrationAddress *addr)
 {
     if (addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
@@ -164,6 +174,12 @@ static bool transport_supports_seeking(MigrationAddress *addr)
     return false;
 }
 
+static bool transport_supports_extra_fds(MigrationAddress *addr)
+{
+    /* file: works because QEMU can open it multiple times */
+    return addr->transport == MIGRATION_ADDRESS_TYPE_FILE;
+}
+
 static bool
 migration_channels_and_transport_compatible(MigrationAddress *addr,
                                             Error **errp)
@@ -180,6 +196,13 @@ migration_channels_and_transport_compatible(MigrationAddress *addr,
         return false;
     }
 
+    if (migration_needs_extra_fds() &&
+        !transport_supports_extra_fds(addr)) {
+        error_setg(errp,
+                   "Migration requires a transport that allows for extra fds (e.g. file)");
+        return false;
+    }
+
     return true;
 }
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 12/18] tests/qtest/migration: Add tests for file migration with direct-io
  2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
                   ` (10 preceding siblings ...)
  2024-05-23 19:05 ` [PATCH v2 11/18] migration/multifd: Add direct-io support Fabiano Rosas
@ 2024-05-23 19:05 ` Fabiano Rosas
  2024-05-23 19:05 ` [PATCH v2 13/18] monitor: fdset: Match against O_DIRECT Fabiano Rosas
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-23 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Peter Xu, Claudio Fontana, Jim Fehlig,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

The tests are only allowed to run in systems that know about the
O_DIRECT flag and in filesystems which support it.

Note: this also brings back migrate_set_parameter_bool() which went
away when we removed the compression tests. I copied it verbatim.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
- updated the aligment comment

- added an inline version of probe function so were able to call it
  without any ifdef

- simplified the _start function that sets the capability

- used qemu_try_memalign instead of aligned_alloc
---
 tests/qtest/migration-helpers.c | 44 +++++++++++++++++++++++
 tests/qtest/migration-helpers.h |  8 +++++
 tests/qtest/migration-test.c    | 62 +++++++++++++++++++++++++++++++++
 3 files changed, 114 insertions(+)

diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
index ce6d6615b5..0ac49ceb54 100644
--- a/tests/qtest/migration-helpers.c
+++ b/tests/qtest/migration-helpers.c
@@ -18,6 +18,7 @@
 #include "qapi/error.h"
 #include "qapi/qmp/qlist.h"
 #include "qemu/cutils.h"
+#include "qemu/memalign.h"
 
 #include "migration-helpers.h"
 
@@ -473,3 +474,46 @@ void migration_test_add(const char *path, void (*fn)(void))
     qtest_add_data_func_full(path, test, migration_test_wrapper,
                              migration_test_destroy);
 }
+
+#ifdef O_DIRECT
+/*
+ * Probe for O_DIRECT support on the filesystem. Since this is used
+ * for tests, be conservative, if anything fails, assume it's
+ * unsupported.
+ */
+bool probe_o_direct_support(const char *tmpfs)
+{
+    g_autofree char *filename = g_strdup_printf("%s/probe-o-direct", tmpfs);
+    int fd, flags = O_CREAT | O_RDWR | O_TRUNC | O_DIRECT;
+    void *buf;
+    ssize_t ret, len;
+    uint64_t offset;
+
+    fd = open(filename, flags, 0660);
+    if (fd < 0) {
+        unlink(filename);
+        return false;
+    }
+
+    /*
+     * Using 1MB alignment as conservative choice to satisfy any
+     * plausible architecture default page size, and/or filesystem
+     * alignment restrictions.
+     */
+    len = 0x100000;
+    offset = 0x100000;
+
+    buf = qemu_try_memalign(len, len);
+    g_assert(buf);
+
+    ret = pwrite(fd, buf, len, offset);
+    unlink(filename);
+    g_free(buf);
+
+    if (ret < 0) {
+        return false;
+    }
+
+    return true;
+}
+#endif
diff --git a/tests/qtest/migration-helpers.h b/tests/qtest/migration-helpers.h
index 1339835698..50095fca4a 100644
--- a/tests/qtest/migration-helpers.h
+++ b/tests/qtest/migration-helpers.h
@@ -54,5 +54,13 @@ char *find_common_machine_version(const char *mtype, const char *var1,
                                   const char *var2);
 char *resolve_machine_version(const char *alias, const char *var1,
                               const char *var2);
+#ifdef O_DIRECT
+bool probe_o_direct_support(const char *tmpfs);
+#else
+static inline bool probe_o_direct_support(const char *tmpfs)
+{
+    return false;
+}
+#endif
 void migration_test_add(const char *path, void (*fn)(void));
 #endif /* MIGRATION_HELPERS_H */
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 5bb53f4839..37017ff2d5 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -428,6 +428,38 @@ static void migrate_set_parameter_str(QTestState *who, const char *parameter,
     migrate_check_parameter_str(who, parameter, value);
 }
 
+static long long migrate_get_parameter_bool(QTestState *who,
+                                           const char *parameter)
+{
+    QDict *rsp;
+    int result;
+
+    rsp = qtest_qmp_assert_success_ref(
+        who, "{ 'execute': 'query-migrate-parameters' }");
+    result = qdict_get_bool(rsp, parameter);
+    qobject_unref(rsp);
+    return !!result;
+}
+
+static void migrate_check_parameter_bool(QTestState *who, const char *parameter,
+                                        int value)
+{
+    int result;
+
+    result = migrate_get_parameter_bool(who, parameter);
+    g_assert_cmpint(result, ==, value);
+}
+
+static void migrate_set_parameter_bool(QTestState *who, const char *parameter,
+                                      int value)
+{
+    qtest_qmp_assert_success(who,
+                             "{ 'execute': 'migrate-set-parameters',"
+                             "'arguments': { %s: %i } }",
+                             parameter, value);
+    migrate_check_parameter_bool(who, parameter, value);
+}
+
 static void migrate_ensure_non_converge(QTestState *who)
 {
     /* Can't converge with 1ms downtime + 3 mbs bandwidth limit */
@@ -2178,6 +2210,33 @@ static void test_multifd_file_mapped_ram(void)
     test_file_common(&args, true);
 }
 
+static void *multifd_mapped_ram_dio_start(QTestState *from, QTestState *to)
+{
+    migrate_multifd_mapped_ram_start(from, to);
+
+    migrate_set_parameter_bool(from, "direct-io", true);
+    migrate_set_parameter_bool(to, "direct-io", true);
+
+    return NULL;
+}
+
+static void test_multifd_file_mapped_ram_dio(void)
+{
+    g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
+                                           FILE_TEST_FILENAME);
+    MigrateCommon args = {
+        .connect_uri = uri,
+        .listen_uri = "defer",
+        .start_hook = multifd_mapped_ram_dio_start,
+    };
+
+    if (!probe_o_direct_support(tmpfs)) {
+        g_test_skip("Filesystem does not support O_DIRECT");
+        return;
+    }
+
+    test_file_common(&args, true);
+}
 
 static void test_precopy_tcp_plain(void)
 {
@@ -3586,6 +3645,9 @@ int main(int argc, char **argv)
     migration_test_add("/migration/multifd/file/mapped-ram/live",
                        test_multifd_file_mapped_ram_live);
 
+    migration_test_add("/migration/multifd/file/mapped-ram/dio",
+                       test_multifd_file_mapped_ram_dio);
+
 #ifdef CONFIG_GNUTLS
     migration_test_add("/migration/precopy/unix/tls/psk",
                        test_precopy_unix_tls_psk);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 13/18] monitor: fdset: Match against O_DIRECT
  2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
                   ` (11 preceding siblings ...)
  2024-05-23 19:05 ` [PATCH v2 12/18] tests/qtest/migration: Add tests for file migration with direct-io Fabiano Rosas
@ 2024-05-23 19:05 ` Fabiano Rosas
  2024-05-30 21:41   ` Peter Xu
  2024-05-23 19:05 ` [PATCH v2 14/18] migration: Add documentation for fdset with multifd + file Fabiano Rosas
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-23 19:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: berrange, armbru, Peter Xu, Claudio Fontana, Jim Fehlig

We're about to enable the use of O_DIRECT in the migration code and
due to the alignment restrictions imposed by filesystems we need to
make sure the flag is only used when doing aligned IO.

The migration will do parallel IO to different regions of a file, so
we need to use more than one file descriptor. Those cannot be obtained
by duplicating (dup()) since duplicated file descriptors share the
file status flags, including O_DIRECT. If one migration channel does
unaligned IO while another sets O_DIRECT to do aligned IO, the
filesystem would fail the unaligned operation.

The add-fd QMP command along with the fdset code are specifically
designed to allow the user to pass a set of file descriptors with
different access flags into QEMU to be later fetched by code that
needs to alternate between those flags when doing IO.

Extend the fdset matching to behave the same with the O_DIRECT flag.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 monitor/fds.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/monitor/fds.c b/monitor/fds.c
index d93d2e695b..a6580e215e 100644
--- a/monitor/fds.c
+++ b/monitor/fds.c
@@ -419,6 +419,11 @@ int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags, Error **errp)
         int fd = -1;
         int dup_fd;
         int mon_fd_flags;
+        int mask = O_ACCMODE;
+
+#ifdef O_DIRECT
+        mask |= O_DIRECT;
+#endif
 
         if (mon_fdset->id != fdset_id) {
             continue;
@@ -432,7 +437,7 @@ int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags, Error **errp)
                 return -1;
             }
 
-            if ((flags & O_ACCMODE) == (mon_fd_flags & O_ACCMODE)) {
+            if ((flags & mask) == (mon_fd_flags & mask)) {
                 fd = mon_fdset_fd->fd;
                 break;
             }
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 14/18] migration: Add documentation for fdset with multifd + file
  2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
                   ` (12 preceding siblings ...)
  2024-05-23 19:05 ` [PATCH v2 13/18] monitor: fdset: Match against O_DIRECT Fabiano Rosas
@ 2024-05-23 19:05 ` Fabiano Rosas
  2024-06-04 20:46   ` Peter Xu
  2024-05-23 19:05 ` [PATCH v2 15/18] tests/qtest/migration: Add a test for mapped-ram with passing of fds Fabiano Rosas
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-23 19:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: berrange, armbru, Peter Xu, Claudio Fontana, Jim Fehlig

With the last few changes to the fdset infrastructure, we now allow
multifd to use an fdset when migrating to a file. This is useful for
the scenario where the management layer wants to have control over the
migration file.

By receiving the file descriptors directly, QEMU can delegate some
high level operating system operations to the management layer (such
as mandatory access control). The management layer might also want to
add its own headers before the migration stream.

Document the "file:/dev/fdset/#" syntax for the multifd migration with
mapped-ram. The requirements for the fdset mechanism are:

- the fdset must contain two fds that are not duplicates between
  themselves;

- if direct-io is to be used, exactly one of the fds must have the
  O_DIRECT flag set;

- the file must be opened with WRONLY on the migration source side;

- the file must be opened with RDONLY on the migration destination
  side.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 docs/devel/migration/main.rst       | 24 +++++++++++++++++++-----
 docs/devel/migration/mapped-ram.rst |  6 +++++-
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/docs/devel/migration/main.rst b/docs/devel/migration/main.rst
index 495cdcb112..784c899dca 100644
--- a/docs/devel/migration/main.rst
+++ b/docs/devel/migration/main.rst
@@ -47,11 +47,25 @@ over any transport.
   QEMU interference. Note that QEMU does not flush cached file
   data/metadata at the end of migration.
 
-In addition, support is included for migration using RDMA, which
-transports the page data using ``RDMA``, where the hardware takes care of
-transporting the pages, and the load on the CPU is much lower.  While the
-internals of RDMA migration are a bit different, this isn't really visible
-outside the RAM migration code.
+  The file migration also supports using a file that has already been
+  opened. A set of file descriptors is passed to QEMU via an "fdset"
+  (see add-fd QMP command documentation). This method allows a
+  management application to have control over the migration file
+  opening operation. There are, however, strict requirements to this
+  interface if the multifd capability is enabled:
+
+    - the fdset must contain two file descriptors that are not
+      duplicates between themselves;
+    - if the direct-io capability is to be used, exactly one of the
+      file descriptors must have the O_DIRECT flag set;
+    - the file must be opened with WRONLY on the migration source side
+      and RDONLY on the migration destination side.
+
+- rdma migration: support is included for migration using RDMA, which
+  transports the page data using ``RDMA``, where the hardware takes
+  care of transporting the pages, and the load on the CPU is much
+  lower.  While the internals of RDMA migration are a bit different,
+  this isn't really visible outside the RAM migration code.
 
 All these migration protocols use the same infrastructure to
 save/restore state devices.  This infrastructure is shared with the
diff --git a/docs/devel/migration/mapped-ram.rst b/docs/devel/migration/mapped-ram.rst
index fa4cefd9fc..e6505511f0 100644
--- a/docs/devel/migration/mapped-ram.rst
+++ b/docs/devel/migration/mapped-ram.rst
@@ -16,7 +16,7 @@ location in the file, rather than constantly being added to a
 sequential stream. Having the pages at fixed offsets also allows the
 usage of O_DIRECT for save/restore of the migration stream as the
 pages are ensured to be written respecting O_DIRECT alignment
-restrictions (direct-io support not yet implemented).
+restrictions.
 
 Usage
 -----
@@ -35,6 +35,10 @@ Use a ``file:`` URL for migration:
 Mapped-ram migration is best done non-live, i.e. by stopping the VM on
 the source side before migrating.
 
+For best performance enable the ``direct-io`` capability as well:
+
+    ``migrate_set_capability direct-io on``
+
 Use-cases
 ---------
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 15/18] tests/qtest/migration: Add a test for mapped-ram with passing of fds
  2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
                   ` (13 preceding siblings ...)
  2024-05-23 19:05 ` [PATCH v2 14/18] migration: Add documentation for fdset with multifd + file Fabiano Rosas
@ 2024-05-23 19:05 ` Fabiano Rosas
  2024-06-04 20:51   ` Peter Xu
  2024-05-23 19:05 ` [PATCH v2 16/18] io/channel-file: Add direct-io support Fabiano Rosas
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-23 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Peter Xu, Claudio Fontana, Jim Fehlig,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

Add a multifd test for mapped-ram with passing of fds into QEMU. This
is how libvirt will consume the feature.

There are a couple of details to the fdset mechanism:

- multifd needs two distinct file descriptors (not duplicated with
  dup()) so it can enable O_DIRECT only on the channels that do
  aligned IO. The dup() system call creates file descriptors that
  share status flags, of which O_DIRECT is one.

- the open() access mode flags used for the fds passed into QEMU need
  to match the flags QEMU uses to open the file. Currently O_WRONLY
  for src and O_RDONLY for dst.

Note that fdset code goes under _WIN32 because fd passing is not
supported on Windows.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 tests/qtest/migration-test.c | 101 +++++++++++++++++++++++++++++++++--
 1 file changed, 98 insertions(+), 3 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 37017ff2d5..5ced3b90c9 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2047,11 +2047,18 @@ static void test_precopy_file(void)
 
 #ifndef _WIN32
 static void fdset_add_fds(QTestState *qts, const char *file, int flags,
-                          int num_fds)
+                          int num_fds, bool direct_io)
 {
     for (int i = 0; i < num_fds; i++) {
         int fd;
 
+#ifdef O_DIRECT
+        /* only secondary channels can use direct-io */
+        if (direct_io && i != 0) {
+            flags |= O_DIRECT;
+        }
+#endif
+
         fd = open(file, flags, 0660);
         assert(fd != -1);
 
@@ -2065,8 +2072,8 @@ static void *file_offset_fdset_start_hook(QTestState *from, QTestState *to)
 {
     g_autofree char *file = g_strdup_printf("%s/%s", tmpfs, FILE_TEST_FILENAME);
 
-    fdset_add_fds(from, file, O_WRONLY, 1);
-    fdset_add_fds(to, file, O_RDONLY, 1);
+    fdset_add_fds(from, file, O_WRONLY, 1, false);
+    fdset_add_fds(to, file, O_RDONLY, 1, false);
 
     return NULL;
 }
@@ -2238,6 +2245,87 @@ static void test_multifd_file_mapped_ram_dio(void)
     test_file_common(&args, true);
 }
 
+#ifndef _WIN32
+static void multifd_mapped_ram_fdset_end(QTestState *from, QTestState *to,
+                                         void *opaque)
+{
+    QDict *resp;
+    QList *fdsets;
+
+    /*
+     * Remove the fdsets after migration, otherwise a second migration
+     * would fail due fdset reuse.
+     */
+    qtest_qmp_assert_success(from, "{'execute': 'remove-fd', "
+                             "'arguments': { 'fdset-id': 1}}");
+
+    resp = qtest_qmp(from, "{'execute': 'query-fdsets', "
+                     "'arguments': {}}");
+    g_assert(qdict_haskey(resp, "return"));
+    fdsets = qdict_get_qlist(resp, "return");
+    g_assert(fdsets && qlist_empty(fdsets));
+}
+
+static void *multifd_mapped_ram_fdset_dio(QTestState *from, QTestState *to)
+{
+    g_autofree char *file = g_strdup_printf("%s/%s", tmpfs, FILE_TEST_FILENAME);
+
+    fdset_add_fds(from, file, O_WRONLY, 2, true);
+    fdset_add_fds(to, file, O_RDONLY, 2, true);
+
+    migrate_multifd_mapped_ram_start(from, to);
+    migrate_set_parameter_bool(from, "direct-io", true);
+    migrate_set_parameter_bool(to, "direct-io", true);
+
+    return NULL;
+}
+
+static void *multifd_mapped_ram_fdset(QTestState *from, QTestState *to)
+{
+    g_autofree char *file = g_strdup_printf("%s/%s", tmpfs, FILE_TEST_FILENAME);
+
+    fdset_add_fds(from, file, O_WRONLY, 2, false);
+    fdset_add_fds(to, file, O_RDONLY, 2, false);
+
+    migrate_multifd_mapped_ram_start(from, to);
+
+    return NULL;
+}
+
+static void test_multifd_file_mapped_ram_fdset(void)
+{
+    g_autofree char *uri = g_strdup_printf("file:/dev/fdset/1,offset=%d",
+                                           FILE_TEST_OFFSET);
+    MigrateCommon args = {
+        .connect_uri = uri,
+        .listen_uri = "defer",
+        .start_hook = multifd_mapped_ram_fdset,
+        .finish_hook = multifd_mapped_ram_fdset_end,
+    };
+
+    test_file_common(&args, true);
+}
+
+static void test_multifd_file_mapped_ram_fdset_dio(void)
+{
+    g_autofree char *uri = g_strdup_printf("file:/dev/fdset/1,offset=%d",
+                                           FILE_TEST_OFFSET);
+    MigrateCommon args = {
+        .connect_uri = uri,
+        .listen_uri = "defer",
+        .start_hook = multifd_mapped_ram_fdset_dio,
+        .finish_hook = multifd_mapped_ram_fdset_end,
+    };
+
+    if (!probe_o_direct_support(tmpfs)) {
+        g_test_skip("Filesystem does not support O_DIRECT");
+        return;
+    }
+
+    test_file_common(&args, true);
+}
+#endif /* !_WIN32 */
+
 static void test_precopy_tcp_plain(void)
 {
     MigrateCommon args = {
@@ -3648,6 +3736,13 @@ int main(int argc, char **argv)
     migration_test_add("/migration/multifd/file/mapped-ram/dio",
                        test_multifd_file_mapped_ram_dio);
 
+#ifndef _WIN32
+    qtest_add_func("/migration/multifd/file/mapped-ram/fdset",
+                   test_multifd_file_mapped_ram_fdset);
+    qtest_add_func("/migration/multifd/file/mapped-ram/fdset/dio",
+                   test_multifd_file_mapped_ram_fdset_dio);
+#endif
+
 #ifdef CONFIG_GNUTLS
     migration_test_add("/migration/precopy/unix/tls/psk",
                        test_precopy_unix_tls_psk);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 16/18] io/channel-file: Add direct-io support
  2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
                   ` (14 preceding siblings ...)
  2024-05-23 19:05 ` [PATCH v2 15/18] tests/qtest/migration: Add a test for mapped-ram with passing of fds Fabiano Rosas
@ 2024-05-23 19:05 ` Fabiano Rosas
  2024-06-03 10:32   ` Daniel P. Berrangé
  2024-05-23 19:05 ` [PATCH v2 17/18] migration: Add direct-io helpers Fabiano Rosas
  2024-05-23 19:05 ` [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration Fabiano Rosas
  17 siblings, 1 reply; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-23 19:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: berrange, armbru, Peter Xu, Claudio Fontana, Jim Fehlig

Add support for setting/clearing the O_DIRECT flag on a file
descriptor. This will be used for enabling O_DIRECT in the main
migration channel when multifd is not in use.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 include/io/channel-file.h |  8 ++++++++
 io/channel-file.c         | 25 +++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/include/io/channel-file.h b/include/io/channel-file.h
index d373a4e44d..ecb4450e8e 100644
--- a/include/io/channel-file.h
+++ b/include/io/channel-file.h
@@ -107,4 +107,12 @@ qio_channel_file_new_path(const char *path,
                           mode_t mode,
                           Error **errp);
 
+/**
+ * qio_channel_file_set_direct_io:
+ * @ioc: the QIOChannel object
+ * @enabled: the desired state of the O_DIRECT flag
+ * @errp: pointer to initialized error object
+ */
+void qio_channel_file_set_direct_io(QIOChannel *ioc, bool enabled,
+                                    Error **errp);
 #endif /* QIO_CHANNEL_FILE_H */
diff --git a/io/channel-file.c b/io/channel-file.c
index 2ea8d08360..a89cd3a6d5 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -231,6 +231,31 @@ static int qio_channel_file_set_blocking(QIOChannel *ioc,
 #endif
 }
 
+void qio_channel_file_set_direct_io(QIOChannel *ioc, bool enabled, Error **errp)
+{
+#ifdef O_DIRECT
+    QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
+    int flags = fcntl(fioc->fd, F_GETFL);
+
+    if (flags == -1) {
+        error_setg_errno(errp, errno, "Unable to read file descriptor flags");
+        return;
+    }
+
+    if (enabled) {
+        flags |= O_DIRECT;
+    } else {
+        flags &= ~O_DIRECT;
+    }
+
+    if (fcntl(fioc->fd, F_SETFL, flags) == -1) {
+        error_setg_errno(errp, errno, "Unable to set file descriptor flags");
+        return;
+    }
+#else
+    error_setg(errp, "System does not support O_DIRECT");
+#endif
+}
 
 static off_t qio_channel_file_seek(QIOChannel *ioc,
                                    off_t offset,
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 17/18] migration: Add direct-io helpers
  2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
                   ` (15 preceding siblings ...)
  2024-05-23 19:05 ` [PATCH v2 16/18] io/channel-file: Add direct-io support Fabiano Rosas
@ 2024-05-23 19:05 ` Fabiano Rosas
  2024-05-23 19:05 ` [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration Fabiano Rosas
  17 siblings, 0 replies; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-23 19:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: berrange, armbru, Peter Xu, Claudio Fontana, Jim Fehlig

We support using O_DIRECT with multifd by isolating the main migration
channel which does unaligned IO from the multifd channels that do
aligned IO. When multifd is not enabled, we can still use O_DIRECT, if
we judiciously enable/disable the flag to avoid the unaligned IO.

Add helpers to enable/disable direct-io around the aligned parts.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c | 31 +++++++++++++++++++++++++++++++
 migration/migration.h |  2 ++
 migration/qemu-file.c | 29 +++++++++++++++++++++++++++++
 migration/qemu-file.h |  2 +-
 4 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index e03c80b3aa..be128f95da 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -420,6 +420,37 @@ static void migrate_generate_event(int new_state)
     }
 }
 
+bool migration_direct_io_start(QEMUFile *file, Error **errp)
+{
+    if (!migrate_direct_io() || migrate_multifd()) {
+        return true;
+    }
+
+    /* flush any potentially unaligned IO before setting O_DIRECT */
+    qemu_fflush(file);
+
+    if (!qemu_file_set_direct_io(file, true)) {
+        error_setg(errp, "Failed to enable direct-io");
+        return false;
+    }
+
+    return true;
+}
+
+bool migration_direct_io_finish(QEMUFile *file, Error **errp)
+{
+    if (!migrate_direct_io() || migrate_multifd()) {
+        return true;
+    }
+
+    if (!qemu_file_set_direct_io(file, false)) {
+        error_setg(errp, "Failed to disable direct-io");
+        return false;
+    }
+
+    return true;
+}
+
 /*
  * Send a message on the return channel back to the source
  * of the migration.
diff --git a/migration/migration.h b/migration/migration.h
index 6af01362d4..4d808563b5 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -536,4 +536,6 @@ int migration_rp_wait(MigrationState *s);
  */
 void migration_rp_kick(MigrationState *s);
 
+bool migration_direct_io_start(QEMUFile *file, Error **errp);
+bool migration_direct_io_finish(QEMUFile *file, Error **errp);
 #endif
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 9ccbbb0099..4796be918f 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -52,6 +52,11 @@ struct QEMUFile {
 
     int last_error;
     Error *last_error_obj;
+    /*
+     * Whether O_DIRECT is in effect. Used to catch code attempting
+     * unaligned IO.
+     */
+    bool dio;
 };
 
 /*
@@ -281,6 +286,9 @@ int qemu_fflush(QEMUFile *f)
     }
     if (f->iovcnt > 0) {
         Error *local_error = NULL;
+
+        assert(!f->dio);
+
         if (qio_channel_writev_all(f->ioc,
                                    f->iov, f->iovcnt,
                                    &local_error) < 0) {
@@ -429,6 +437,8 @@ void qemu_put_buffer_async(QEMUFile *f, const uint8_t *buf, size_t size,
         return;
     }
 
+    assert(!f->dio);
+
     add_to_iovec(f, buf, size, may_free);
 }
 
@@ -440,6 +450,8 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, size_t size)
         return;
     }
 
+    assert(!f->dio);
+
     while (size > 0) {
         l = IO_BUF_SIZE - f->buf_index;
         if (l > size) {
@@ -558,6 +570,8 @@ off_t qemu_get_offset(QEMUFile *f)
 
 void qemu_put_byte(QEMUFile *f, int v)
 {
+    assert(!f->dio);
+
     if (f->last_error) {
         return;
     }
@@ -865,3 +879,18 @@ int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size)
 
     return 0;
 }
+
+bool qemu_file_set_direct_io(QEMUFile *f, bool enabled)
+{
+    Error *local_err = NULL;
+
+    qio_channel_file_set_direct_io(f->ioc, enabled, &local_err);
+    if (local_err) {
+        qemu_file_set_error_obj(f, -EINVAL, local_err);
+        return false;
+    }
+
+    f->dio = enabled;
+
+    return true;
+}
diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 11c2120edd..118853b21c 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -79,5 +79,5 @@ size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
                           off_t pos);
 
 QIOChannel *qemu_file_get_ioc(QEMUFile *file);
-
+bool qemu_file_set_direct_io(QEMUFile *f, bool enabled);
 #endif
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration
  2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
                   ` (16 preceding siblings ...)
  2024-05-23 19:05 ` [PATCH v2 17/18] migration: Add direct-io helpers Fabiano Rosas
@ 2024-05-23 19:05 ` Fabiano Rosas
  2024-06-04 20:56   ` Peter Xu
  17 siblings, 1 reply; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-23 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: berrange, armbru, Peter Xu, Claudio Fontana, Jim Fehlig,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

We've recently added support for direct-io with multifd, which brings
performance benefits, but creates a non-uniform user interface by
coupling direct-io with the multifd capability. This means that users
cannot keep the direct-io flag enabled while disabling multifd.

Libvirt in particular already has support for direct-io and parallel
migration separately from each other, so it would be a regression to
now require both options together. It's relatively simple for QEMU to
add support for direct-io migration without multifd, so let's do this
in order to keep both options decoupled.

We cannot simply enable the O_DIRECT flag, however, because not all IO
performed by the migration thread satisfies the alignment requirements
of O_DIRECT. There are many small read & writes that add headers and
synchronization flags to the stream, which at the moment are required
to always be present.

Fortunately, due to fixed-ram migration there is a discernible moment
where only RAM pages are written to the migration file. Enable
direct-io during that moment.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/ram.c              | 40 ++++++++++++++++++++++++++++--------
 tests/qtest/migration-test.c | 30 +++++++++++++++++++++++++++
 2 files changed, 61 insertions(+), 9 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index ceea586b06..5183d1f97c 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3131,6 +3131,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
     int i;
     int64_t t0;
     int done = 0;
+    Error **errp = NULL;
 
     /*
      * We'll take this lock a little bit long, but it's okay for two reasons.
@@ -3154,6 +3155,10 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
                 goto out;
             }
 
+            if (!migration_direct_io_start(f, errp)) {
+                return -errno;
+            }
+
             t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
             i = 0;
             while ((ret = migration_rate_exceeded(f)) == 0 ||
@@ -3194,6 +3199,9 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
                 }
                 i++;
             }
+            if (!migration_direct_io_finish(f, errp)) {
+                return -errno;
+            }
         }
     }
 
@@ -3242,7 +3250,8 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
 {
     RAMState **temp = opaque;
     RAMState *rs = *temp;
-    int ret = 0;
+    int ret = 0, pages;
+    Error **errp = NULL;
 
     rs->last_stage = !migration_in_colo_state();
 
@@ -3257,25 +3266,30 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
             return ret;
         }
 
+        if (!migration_direct_io_start(f, errp)) {
+            return -errno;
+        }
+
         /* try transferring iterative blocks of memory */
 
         /* flush all remaining blocks regardless of rate limiting */
         qemu_mutex_lock(&rs->bitmap_mutex);
         while (true) {
-            int pages;
-
             pages = ram_find_and_save_block(rs);
-            /* no more blocks to sent */
-            if (pages == 0) {
+            if (pages <= 0) {
                 break;
             }
-            if (pages < 0) {
-                qemu_mutex_unlock(&rs->bitmap_mutex);
-                return pages;
-            }
         }
         qemu_mutex_unlock(&rs->bitmap_mutex);
 
+        if (!migration_direct_io_finish(f, errp)) {
+            return -errno;
+        }
+
+        if (pages < 0) {
+            return pages;
+        }
+
         ret = rdma_registration_stop(f, RAM_CONTROL_FINISH);
         if (ret < 0) {
             qemu_file_set_error(f, ret);
@@ -3920,6 +3934,10 @@ static bool read_ramblock_mapped_ram(QEMUFile *f, RAMBlock *block,
     void *host;
     size_t read, unread, size;
 
+    if (!migration_direct_io_start(f, errp)) {
+        return false;
+    }
+
     for (set_bit_idx = find_first_bit(bitmap, num_pages);
          set_bit_idx < num_pages;
          set_bit_idx = find_next_bit(bitmap, num_pages, clear_bit_idx + 1)) {
@@ -3955,6 +3973,10 @@ static bool read_ramblock_mapped_ram(QEMUFile *f, RAMBlock *block,
         }
     }
 
+    if (!migration_direct_io_finish(f, errp)) {
+        return false;
+    }
+
     return true;
 
 err:
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 5ced3b90c9..8c6a122c20 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2245,6 +2245,34 @@ static void test_multifd_file_mapped_ram_dio(void)
     test_file_common(&args, true);
 }
 
+static void *mapped_ram_dio_start(QTestState *from, QTestState *to)
+{
+    migrate_mapped_ram_start(from, to);
+
+    migrate_set_parameter_bool(from, "direct-io", true);
+    migrate_set_parameter_bool(to, "direct-io", true);
+
+    return NULL;
+}
+
+static void test_precopy_file_mapped_ram_dio(void)
+{
+    g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
+                                           FILE_TEST_FILENAME);
+    MigrateCommon args = {
+        .connect_uri = uri,
+        .listen_uri = "defer",
+        .start_hook = mapped_ram_dio_start,
+    };
+
+    if (!probe_o_direct_support(tmpfs)) {
+        g_test_skip("Filesystem does not support O_DIRECT");
+        return;
+    }
+
+    test_file_common(&args, true);
+}
+
 #ifndef _WIN32
 static void multifd_mapped_ram_fdset_end(QTestState *from, QTestState *to,
                                          void *opaque)
@@ -3735,6 +3763,8 @@ int main(int argc, char **argv)
 
     migration_test_add("/migration/multifd/file/mapped-ram/dio",
                        test_multifd_file_mapped_ram_dio);
+    migration_test_add("/migration/precopy/file/mapped-ram/dio",
+                       test_precopy_file_mapped_ram_dio);
 
 #ifndef _WIN32
     qtest_add_func("/migration/multifd/file/mapped-ram/fdset",
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 01/18] migration: Fix file migration with fdset
  2024-05-23 19:05 ` [PATCH v2 01/18] migration: Fix file migration with fdset Fabiano Rosas
@ 2024-05-24 10:51   ` Prasad Pandit
  2024-05-24 12:30     ` Fabiano Rosas
  2024-05-30 16:11   ` Peter Xu
  2024-06-03 10:20   ` Daniel P. Berrangé
  2 siblings, 1 reply; 60+ messages in thread
From: Prasad Pandit @ 2024-05-24 10:51 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Peter Xu, Claudio Fontana,
	Jim Fehlig

On Fri, 24 May 2024 at 00:38, Fabiano Rosas <farosas@suse.de> wrote:
> This is further indicated by the presence of the 'offset'
> argument, which indicates the start of the region where QEMU is
> allowed to write.
>
> Fix the issue by replacing the O_TRUNC flag on open by an ftruncate
> call, which will take the offset into consideration.
>
> +    if (ftruncate(fioc->fd, offset)) {
> +        error_setg_errno(errp, errno,
> +                         "failed to truncate migration file to offset %" PRIx64,
> +                         offset);
> +        object_unref(OBJECT(fioc));
> +        return;
> +    }
> +

* Should 'offset' be checked for > zero while ftruncating? Else it'll
be same as O_TRUNC. Otherwise it looks fine.

Reviewed-by: Prasad Pandit <pjp@fedoraproject.org>

Thank you.
---
  - Prasad



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 01/18] migration: Fix file migration with fdset
  2024-05-24 10:51   ` Prasad Pandit
@ 2024-05-24 12:30     ` Fabiano Rosas
  2024-05-25  6:16       ` Prasad Pandit
  0 siblings, 1 reply; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-24 12:30 UTC (permalink / raw)
  To: Prasad Pandit
  Cc: qemu-devel, berrange, armbru, Peter Xu, Claudio Fontana,
	Jim Fehlig

Prasad Pandit <ppandit@redhat.com> writes:

> On Fri, 24 May 2024 at 00:38, Fabiano Rosas <farosas@suse.de> wrote:
>> This is further indicated by the presence of the 'offset'
>> argument, which indicates the start of the region where QEMU is
>> allowed to write.
>>
>> Fix the issue by replacing the O_TRUNC flag on open by an ftruncate
>> call, which will take the offset into consideration.
>>
>> +    if (ftruncate(fioc->fd, offset)) {
>> +        error_setg_errno(errp, errno,
>> +                         "failed to truncate migration file to offset %" PRIx64,
>> +                         offset);
>> +        object_unref(OBJECT(fioc));
>> +        return;
>> +    }
>> +
>
> * Should 'offset' be checked for > zero while ftruncating? Else it'll
> be same as O_TRUNC. Otherwise it looks fine.

That's the point. If offset==0 we truncate all the way, if not, we
truncate to the offset.

>
> Reviewed-by: Prasad Pandit <pjp@fedoraproject.org>

Thanks!


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 01/18] migration: Fix file migration with fdset
  2024-05-24 12:30     ` Fabiano Rosas
@ 2024-05-25  6:16       ` Prasad Pandit
  0 siblings, 0 replies; 60+ messages in thread
From: Prasad Pandit @ 2024-05-25  6:16 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Peter Xu, Claudio Fontana,
	Jim Fehlig

On Fri, 24 May 2024 at 18:00, Fabiano Rosas <farosas@suse.de> wrote:
> That's the point. If offset==0 we truncate all the way, if not, we truncate to the offset.

* Yes, I was wondering if the migration file has some data, but still
'offset' ends up being zero(0). If that's unlikely to happen, then we
are good.

Thank you.
---
  - Prasad



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 01/18] migration: Fix file migration with fdset
  2024-05-23 19:05 ` [PATCH v2 01/18] migration: Fix file migration with fdset Fabiano Rosas
  2024-05-24 10:51   ` Prasad Pandit
@ 2024-05-30 16:11   ` Peter Xu
  2024-05-31 14:58     ` Fabiano Rosas
  2024-06-03 10:20   ` Daniel P. Berrangé
  2 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2024-05-30 16:11 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig

On Thu, May 23, 2024 at 04:05:31PM -0300, Fabiano Rosas wrote:
> When the "file:" migration support was added we missed the special
> case in the qemu_open_old implementation that allows for a particular
> file name format to be used to refer to a set of file descriptors that
> have been previously provided to QEMU via the add-fd QMP command.
> 
> When using this fdset feature, we should not truncate the migration
> file because being given an fd means that the management layer is in
> control of the file and will likely already have some data written to
> it. This is further indicated by the presence of the 'offset'
> argument, which indicates the start of the region where QEMU is
> allowed to write.
> 
> Fix the issue by replacing the O_TRUNC flag on open by an ftruncate
> call, which will take the offset into consideration.
> 
> Fixes: 385f510df5 ("migration: file URI offset")
> Suggested-by: Daniel P. Berrangé <berrange@redhat.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

Reviewed-by: Peter Xu <peterx@redhat.com>

Right below the change, did we forget to free the ioc if
qio_channel_io_seek() failed?

> ---
>  migration/file.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/file.c b/migration/file.c
> index ab18ba505a..ba5b5c44ff 100644
> --- a/migration/file.c
> +++ b/migration/file.c
> @@ -84,12 +84,19 @@ void file_start_outgoing_migration(MigrationState *s,
>  
>      trace_migration_file_outgoing(filename);
>  
> -    fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY | O_TRUNC,
> -                                     0600, errp);
> +    fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY, 0600, errp);
>      if (!fioc) {
>          return;
>      }
>  
> +    if (ftruncate(fioc->fd, offset)) {
> +        error_setg_errno(errp, errno,
> +                         "failed to truncate migration file to offset %" PRIx64,
> +                         offset);
> +        object_unref(OBJECT(fioc));
> +        return;
> +    }
> +
>      outgoing_args.fname = g_strdup(filename);
>  
>      ioc = QIO_CHANNEL(fioc);
> -- 
> 2.35.3
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 02/18] tests/qtest/migration: Fix file migration offset check
  2024-05-23 19:05 ` [PATCH v2 02/18] tests/qtest/migration: Fix file migration offset check Fabiano Rosas
@ 2024-05-30 16:14   ` Peter Xu
  2024-06-03 10:21   ` Daniel P. Berrangé
  1 sibling, 0 replies; 60+ messages in thread
From: Peter Xu @ 2024-05-30 16:14 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

On Thu, May 23, 2024 at 04:05:32PM -0300, Fabiano Rosas wrote:
> When doing file migration, QEMU accepts an offset that should be
> skipped when writing the migration stream to the file. The purpose of
> the offset is to allow the management layer to put its own metadata at
> the start of the file.
> 
> We have tests for this in migration-test, but only testing that the
> migration stream starts at the correct offset and not that it actually
> leaves the data intact. Unsurprisingly, there's been a bug in that
> area that the tests didn't catch.
> 
> Fix the tests to write some data to the offset region and check that
> it's actually there after the migration.
> 
> While here, switch to using g_get_file_contents() which is more
> portable than mmap().
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 03/18] tests/qtest/migration: Add a precopy file test with fdset
  2024-05-23 19:05 ` [PATCH v2 03/18] tests/qtest/migration: Add a precopy file test with fdset Fabiano Rosas
@ 2024-05-30 16:18   ` Peter Xu
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2024-05-30 16:18 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

On Thu, May 23, 2024 at 04:05:33PM -0300, Fabiano Rosas wrote:
> Add a test for file migration using fdset. The passing of fds is more
> complex than using a file path. This is also the scenario where it's
> most important we ensure that the initial migration stream offset is
> respected because the fdset interface is the one used by the
> management layer when providing a non empty migration file.
> 
> Note that fd passing is not available on Windows, so anything that
> uses add-fd needs to exclude that platform.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 05/18] monitor: Introduce monitor_fdset_*free
  2024-05-23 19:05 ` [PATCH v2 05/18] monitor: Introduce monitor_fdset_*free Fabiano Rosas
@ 2024-05-30 20:03   ` Peter Xu
  2024-05-31 15:01     ` Fabiano Rosas
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2024-05-30 20:03 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig

On Thu, May 23, 2024 at 04:05:35PM -0300, Fabiano Rosas wrote:
> Introduce two new functions to remove and free no longer used fds and
> fdsets.
> 
> We need those to decouple the remove/free routines from
> monitor_fdset_cleanup() which will go away in the next patches.
> 
> The new functions:
> 
> - monitor_fdset_free() will be used when a monitor connection closes
>   and when an fd is removed to cleanup any fdset that is now empty.
> 
> - monitor_fdset_fd_free() will be used to remove one or more fds that
>   have been explicitly targeted by qmp_remove_fd().
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

Reviewed-by: Peter Xu <peterx@redhat.com>

One nitpick below.

> ---
>  monitor/fds.c | 26 ++++++++++++++++++--------
>  1 file changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git a/monitor/fds.c b/monitor/fds.c
> index fb9f58c056..101e21720d 100644
> --- a/monitor/fds.c
> +++ b/monitor/fds.c
> @@ -167,6 +167,22 @@ int monitor_get_fd(Monitor *mon, const char *fdname, Error **errp)
>      return -1;
>  }
>  
> +static void monitor_fdset_free(MonFdset *mon_fdset)
> +{
> +    if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) {
> +        QLIST_REMOVE(mon_fdset, next);
> +        g_free(mon_fdset);
> +    }
> +}

Would monitor_fdset_free_if_empty() (or similar) slightly better?

static void monitor_fdset_free(MonFdset *mon_fdset)
{
    QLIST_REMOVE(mon_fdset, next);
    g_free(mon_fdset);
}

static void monitor_fdset_free_if_empty(MonFdset *mon_fdset)
{
    if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) {
        monitor_fdset_free(mon_fdset);
    }
}

> +
> +static void monitor_fdset_fd_free(MonFdsetFd *mon_fdset_fd)
> +{
> +    close(mon_fdset_fd->fd);
> +    g_free(mon_fdset_fd->opaque);
> +    QLIST_REMOVE(mon_fdset_fd, next);
> +    g_free(mon_fdset_fd);
> +}
> +
>  static void monitor_fdset_cleanup(MonFdset *mon_fdset)
>  {
>      MonFdsetFd *mon_fdset_fd;
> @@ -176,17 +192,11 @@ static void monitor_fdset_cleanup(MonFdset *mon_fdset)
>          if ((mon_fdset_fd->removed ||
>                  (QLIST_EMPTY(&mon_fdset->dup_fds) && mon_refcount == 0)) &&
>                  runstate_is_running()) {
> -            close(mon_fdset_fd->fd);
> -            g_free(mon_fdset_fd->opaque);
> -            QLIST_REMOVE(mon_fdset_fd, next);
> -            g_free(mon_fdset_fd);
> +            monitor_fdset_fd_free(mon_fdset_fd);
>          }
>      }
>  
> -    if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) {
> -        QLIST_REMOVE(mon_fdset, next);
> -        g_free(mon_fdset);
> -    }
> +    monitor_fdset_free(mon_fdset);
>  }
>  
>  void monitor_fdsets_cleanup(void)
> -- 
> 2.35.3
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 06/18] monitor: Stop removing non-duplicated fds
  2024-05-23 19:05 ` [PATCH v2 06/18] monitor: Stop removing non-duplicated fds Fabiano Rosas
@ 2024-05-30 21:05   ` Peter Xu
  2024-05-31 15:25     ` Fabiano Rosas
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2024-05-30 21:05 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig,
	Dr. David Alan Gilbert

On Thu, May 23, 2024 at 04:05:36PM -0300, Fabiano Rosas wrote:
> We've been up until now cleaning up any file descriptors that have
> been passed into QEMU and never duplicated[1,2]. A file descriptor
> without duplicates indicates that no part of QEMU has made use of
> it. This approach is starting to show some cracks now that we're
> starting to consume fds from the migration code:
> 
> - Doing cleanup every time the last monitor connection closes works to
>   reap unused fds, but also has the side effect of forcing the
>   management layer to pass the file descriptors again in case of a
>   disconnect/re-connect, if that happened to be the only monitor
>   connection.
> 
>   Another side effect is that removing an fd with qmp_remove_fd() is
>   effectively delayed until the last monitor connection closes.
> 
>   The reliance on mon_refcount is also problematic because it's racy.
> 
> - Checking runstate_is_running() skips the cleanup unless the VM is
>   running and avoids premature cleanup of the fds, but also has the
>   side effect of blocking the legitimate removal of an fd via
>   qmp_remove_fd() if the VM happens to be in another state.
> 
>   This affects qmp_remove_fd() and qmp_query_fdsets() in particular
>   because requesting a removal at a bad time (guest stopped) might
>   cause an fd to never be removed, or to be removed at a much later
>   point in time, causing the query command to continue showing the
>   supposedly removed fd/fdset.
> 
> Note that file descriptors that *have* been duplicated are owned by
> the code that uses them and will be removed after qemu_close() is
> called. Therefore we've decided that the best course of action to
> avoid the undesired side-effects is to stop managing non-duplicated
> file descriptors.
> 
> 1- efb87c1697 ("monitor: Clean up fd sets on monitor disconnect")
> 2- ebe52b592d ("monitor: Prevent removing fd from set during init")
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  monitor/fds.c              | 15 ++++++++-------
>  monitor/hmp.c              |  2 --
>  monitor/monitor-internal.h |  1 -
>  monitor/qmp.c              |  2 --
>  4 files changed, 8 insertions(+), 12 deletions(-)
> 
> diff --git a/monitor/fds.c b/monitor/fds.c
> index 101e21720d..f7b98814fa 100644
> --- a/monitor/fds.c
> +++ b/monitor/fds.c
> @@ -169,6 +169,11 @@ int monitor_get_fd(Monitor *mon, const char *fdname, Error **errp)
>  
>  static void monitor_fdset_free(MonFdset *mon_fdset)
>  {
> +    /*
> +     * Only remove an empty fdset. The fds are owned by the user and
> +     * should have been removed with qmp_remove_fd(). The dup_fds are
> +     * owned by QEMU and should have been removed with qemu_close().
> +     */
>      if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) {
>          QLIST_REMOVE(mon_fdset, next);
>          g_free(mon_fdset);
> @@ -189,9 +194,7 @@ static void monitor_fdset_cleanup(MonFdset *mon_fdset)
>      MonFdsetFd *mon_fdset_fd_next;
>  
>      QLIST_FOREACH_SAFE(mon_fdset_fd, &mon_fdset->fds, next, mon_fdset_fd_next) {
> -        if ((mon_fdset_fd->removed ||
> -                (QLIST_EMPTY(&mon_fdset->dup_fds) && mon_refcount == 0)) &&
> -                runstate_is_running()) {
> +        if (mon_fdset_fd->removed) {

I don't know the code well, but I like it.

>              monitor_fdset_fd_free(mon_fdset_fd);
>          }
>      }
> @@ -206,7 +209,7 @@ void monitor_fdsets_cleanup(void)
>  
>      QEMU_LOCK_GUARD(&mon_fdsets_lock);
>      QLIST_FOREACH_SAFE(mon_fdset, &mon_fdsets, next, mon_fdset_next) {
> -        monitor_fdset_cleanup(mon_fdset);
> +        monitor_fdset_free(mon_fdset);
>      }
>  }
>  
> @@ -479,9 +482,7 @@ void monitor_fdset_dup_fd_remove(int dup_fd)
>              if (mon_fdset_fd_dup->fd == dup_fd) {
>                  QLIST_REMOVE(mon_fdset_fd_dup, next);
>                  g_free(mon_fdset_fd_dup);
> -                if (QLIST_EMPTY(&mon_fdset->dup_fds)) {
> -                    monitor_fdset_cleanup(mon_fdset);
> -                }
> +                monitor_fdset_free(mon_fdset);

This and above changes are not crystal clear to me where the _cleanup()
does extra check "removed" and clean those fds.

I think it'll make it easier for me to understand if this one and the next
are squashed together.  But maybe it's simply because I didn't fully
understand.

>                  return;
>              }
>          }
> diff --git a/monitor/hmp.c b/monitor/hmp.c
> index 69c1b7e98a..460e8832f6 100644
> --- a/monitor/hmp.c
> +++ b/monitor/hmp.c
> @@ -1437,11 +1437,9 @@ static void monitor_event(void *opaque, QEMUChrEvent event)
>              monitor_resume(mon);
>          }
>          qemu_mutex_unlock(&mon->mon_lock);
> -        mon_refcount++;
>          break;
>  
>      case CHR_EVENT_CLOSED:
> -        mon_refcount--;
>          monitor_fdsets_cleanup();
>          break;
>  
> diff --git a/monitor/monitor-internal.h b/monitor/monitor-internal.h
> index 252de85681..cb628f681d 100644
> --- a/monitor/monitor-internal.h
> +++ b/monitor/monitor-internal.h
> @@ -168,7 +168,6 @@ extern bool qmp_dispatcher_co_shutdown;
>  extern QmpCommandList qmp_commands, qmp_cap_negotiation_commands;
>  extern QemuMutex monitor_lock;
>  extern MonitorList mon_list;
> -extern int mon_refcount;
>  
>  extern HMPCommand hmp_cmds[];
>  
> diff --git a/monitor/qmp.c b/monitor/qmp.c
> index a239945e8d..5e538f34c0 100644
> --- a/monitor/qmp.c
> +++ b/monitor/qmp.c
> @@ -466,7 +466,6 @@ static void monitor_qmp_event(void *opaque, QEMUChrEvent event)
>          data = qmp_greeting(mon);
>          qmp_send_response(mon, data);
>          qobject_unref(data);
> -        mon_refcount++;
>          break;
>      case CHR_EVENT_CLOSED:
>          /*
> @@ -479,7 +478,6 @@ static void monitor_qmp_event(void *opaque, QEMUChrEvent event)
>          json_message_parser_destroy(&mon->parser);
>          json_message_parser_init(&mon->parser, handle_qmp_command,
>                                   mon, NULL);
> -        mon_refcount--;
>          monitor_fdsets_cleanup();
>          break;
>      case CHR_EVENT_BREAK:

I like this too when mon_refcount can be dropped.  It turns out I like this
patch and the next a lot, and I hope nothing will break.

Aside, you seem to have forgot removal of the "int mon_refcount" in
monitor.c.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 08/18] monitor: Report errors from monitor_fdset_dup_fd_add
  2024-05-23 19:05 ` [PATCH v2 08/18] monitor: Report errors from monitor_fdset_dup_fd_add Fabiano Rosas
@ 2024-05-30 21:08   ` Peter Xu
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2024-05-30 21:08 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig,
	Paolo Bonzini

On Thu, May 23, 2024 at 04:05:38PM -0300, Fabiano Rosas wrote:
> I'm keeping the EACCES because callers expect to be able to look at
> errno.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 09/18] io: Stop using qemu_open_old in channel-file
  2024-05-23 19:05 ` [PATCH v2 09/18] io: Stop using qemu_open_old in channel-file Fabiano Rosas
@ 2024-05-30 21:10   ` Peter Xu
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2024-05-30 21:10 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig

On Thu, May 23, 2024 at 04:05:39PM -0300, Fabiano Rosas wrote:
> We want to make use of the Error object to report fdset errors from
> qemu_open_internal() and passing the error pointer to qemu_open_old()
> would require changing all callers. Move the file channel to the new
> API instead.
> 
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 10/18] migration: Add direct-io parameter
  2024-05-23 19:05 ` [PATCH v2 10/18] migration: Add direct-io parameter Fabiano Rosas
@ 2024-05-30 21:12   ` Peter Xu
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2024-05-30 21:12 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig,
	Eric Blake

On Thu, May 23, 2024 at 04:05:40PM -0300, Fabiano Rosas wrote:
> Add the direct-io migration parameter that tells the migration code to
> use O_DIRECT when opening the migration stream file whenever possible.
> 
> This is currently only used with the mapped-ram migration that has a
> clear window guaranteed to perform aligned writes.
> 
> Acked-by: Markus Armbruster <armbru@redhat.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 11/18] migration/multifd: Add direct-io support
  2024-05-23 19:05 ` [PATCH v2 11/18] migration/multifd: Add direct-io support Fabiano Rosas
@ 2024-05-30 21:35   ` Peter Xu
  2024-05-31 15:27     ` Fabiano Rosas
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2024-05-30 21:35 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig

On Thu, May 23, 2024 at 04:05:41PM -0300, Fabiano Rosas wrote:
> When multifd is used along with mapped-ram, we can take benefit of a
> filesystem that supports the O_DIRECT flag and perform direct I/O in
> the multifd threads. This brings a significant performance improvement
> because direct-io writes bypass the page cache which would otherwise
> be thrashed by the multifd data which is unlikely to be needed again
> in a short period of time.
> 
> To be able to use a multifd channel opened with O_DIRECT, we must
> ensure that a certain aligment is used. Filesystems usually require a
> block-size alignment for direct I/O. The way to achieve this is by
> enabling the mapped-ram feature, which already aligns its I/O properly
> (see MAPPED_RAM_FILE_OFFSET_ALIGNMENT at ram.c).
> 
> By setting O_DIRECT on the multifd channels, all writes to the same
> file descriptor need to be aligned as well, even the ones that come
> from outside multifd, such as the QEMUFile I/O from the main migration
> code. This makes it impossible to use the same file descriptor for the
> QEMUFile and for the multifd channels. The various flags and metadata
> written by the main migration code will always be unaligned by virtue
> of their small size. To workaround this issue, we'll require a second
> file descriptor to be used exclusively for direct I/O.
> 
> The second file descriptor can be obtained by QEMU by re-opening the
> migration file (already possible), or by being provided by the user or
> management application (support to be added in future patches).
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  migration/file.c      | 31 ++++++++++++++++++++++++++-----
>  migration/file.h      |  1 -
>  migration/migration.c | 23 +++++++++++++++++++++++
>  3 files changed, 49 insertions(+), 6 deletions(-)
> 
> diff --git a/migration/file.c b/migration/file.c
> index ba5b5c44ff..ac4d206492 100644
> --- a/migration/file.c
> +++ b/migration/file.c
> @@ -50,12 +50,31 @@ void file_cleanup_outgoing_migration(void)
>      outgoing_args.fname = NULL;
>  }
>  
> +static void file_enable_direct_io(int *flags)
> +{
> +#ifdef O_DIRECT
> +    if (migrate_direct_io()) {
> +        *flags |= O_DIRECT;
> +    }
> +#else
> +    /* it should have been rejected when setting the parameter */
> +    g_assert_not_reached();
> +#endif
> +}
> +
>  bool file_send_channel_create(gpointer opaque, Error **errp)
>  {
>      QIOChannelFile *ioc;
>      int flags = O_WRONLY;
>      bool ret = true;
>  
> +    /*
> +     * Attempt to enable O_DIRECT for the secondary channels. These
> +     * are used for sending ram pages and writes should be guaranteed
> +     * to be aligned to at least page size.
> +     */
> +    file_enable_direct_io(&flags);

Call this only if enabled?  That looks clearer, IMHO:

       if (migrate_direct_io()) {
           file_enable_direct_io(&flags);
       }

Then:

static void file_enable_direct_io(int *flags)
{
#ifdef O_DIRECT
    *flags |= O_DIRECT;
#else
    /* it should have been rejected when setting the parameter */
    g_assert_not_reached();
#endif
}

If you remember we have similar multifd calls, and I hoped all multifd
functions are only invoked when multifd is enabled first.  Same thing.

> +
>      ioc = qio_channel_file_new_path(outgoing_args.fname, flags, 0, errp);
>      if (!ioc) {
>          ret = false;
> @@ -116,21 +135,23 @@ static gboolean file_accept_incoming_migration(QIOChannel *ioc,
>      return G_SOURCE_REMOVE;
>  }
>  
> -void file_create_incoming_channels(QIOChannel *ioc, Error **errp)
> +static void file_create_incoming_channels(QIOChannel *ioc, char *filename,
> +                                          Error **errp)
>  {
> -    int i, fd, channels = 1;
> +    int i, channels = 1;
>      g_autofree QIOChannel **iocs = NULL;
> +    int flags = O_RDONLY;
>  
>      if (migrate_multifd()) {
>          channels += migrate_multifd_channels();
> +        file_enable_direct_io(&flags);

Same here.

Other than that looks good.

Thanks,

>      }
>  
>      iocs = g_new0(QIOChannel *, channels);
> -    fd = QIO_CHANNEL_FILE(ioc)->fd;
>      iocs[0] = ioc;
>  
>      for (i = 1; i < channels; i++) {
> -        QIOChannelFile *fioc = qio_channel_file_new_dupfd(fd, errp);
> +        QIOChannelFile *fioc = qio_channel_file_new_path(filename, flags, 0, errp);
>  
>          if (!fioc) {
>              while (i) {
> @@ -170,7 +191,7 @@ void file_start_incoming_migration(FileMigrationArgs *file_args, Error **errp)
>          return;
>      }
>  
> -    file_create_incoming_channels(QIO_CHANNEL(fioc), errp);
> +    file_create_incoming_channels(QIO_CHANNEL(fioc), filename, errp);
>  }
>  
>  int file_write_ramblock_iov(QIOChannel *ioc, const struct iovec *iov,
> diff --git a/migration/file.h b/migration/file.h
> index 7699c04677..9f71e87f74 100644
> --- a/migration/file.h
> +++ b/migration/file.h
> @@ -20,7 +20,6 @@ void file_start_outgoing_migration(MigrationState *s,
>  int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp);
>  void file_cleanup_outgoing_migration(void);
>  bool file_send_channel_create(gpointer opaque, Error **errp);
> -void file_create_incoming_channels(QIOChannel *ioc, Error **errp);
>  int file_write_ramblock_iov(QIOChannel *ioc, const struct iovec *iov,
>                              int niov, RAMBlock *block, Error **errp);
>  int multifd_file_recv_data(MultiFDRecvParams *p, Error **errp);
> diff --git a/migration/migration.c b/migration/migration.c
> index e1b269624c..e03c80b3aa 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -155,6 +155,16 @@ static bool migration_needs_seekable_channel(void)
>      return migrate_mapped_ram();
>  }
>  
> +static bool migration_needs_extra_fds(void)
> +{
> +    /*
> +     * When doing direct-io, multifd requires two different,
> +     * non-duplicated file descriptors so we can use one of them for
> +     * unaligned IO.
> +     */
> +    return migrate_multifd() && migrate_direct_io();
> +}
> +
>  static bool transport_supports_seeking(MigrationAddress *addr)
>  {
>      if (addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
> @@ -164,6 +174,12 @@ static bool transport_supports_seeking(MigrationAddress *addr)
>      return false;
>  }
>  
> +static bool transport_supports_extra_fds(MigrationAddress *addr)
> +{
> +    /* file: works because QEMU can open it multiple times */
> +    return addr->transport == MIGRATION_ADDRESS_TYPE_FILE;
> +}
> +
>  static bool
>  migration_channels_and_transport_compatible(MigrationAddress *addr,
>                                              Error **errp)
> @@ -180,6 +196,13 @@ migration_channels_and_transport_compatible(MigrationAddress *addr,
>          return false;
>      }
>  
> +    if (migration_needs_extra_fds() &&
> +        !transport_supports_extra_fds(addr)) {
> +        error_setg(errp,
> +                   "Migration requires a transport that allows for extra fds (e.g. file)");
> +        return false;
> +    }
> +
>      return true;
>  }
>  
> -- 
> 2.35.3
> 
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 13/18] monitor: fdset: Match against O_DIRECT
  2024-05-23 19:05 ` [PATCH v2 13/18] monitor: fdset: Match against O_DIRECT Fabiano Rosas
@ 2024-05-30 21:41   ` Peter Xu
  2024-05-31 15:42     ` Fabiano Rosas
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2024-05-30 21:41 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig

On Thu, May 23, 2024 at 04:05:43PM -0300, Fabiano Rosas wrote:
> We're about to enable the use of O_DIRECT in the migration code and
> due to the alignment restrictions imposed by filesystems we need to
> make sure the flag is only used when doing aligned IO.
> 
> The migration will do parallel IO to different regions of a file, so
> we need to use more than one file descriptor. Those cannot be obtained
> by duplicating (dup()) since duplicated file descriptors share the
> file status flags, including O_DIRECT. If one migration channel does
> unaligned IO while another sets O_DIRECT to do aligned IO, the
> filesystem would fail the unaligned operation.
> 
> The add-fd QMP command along with the fdset code are specifically
> designed to allow the user to pass a set of file descriptors with
> different access flags into QEMU to be later fetched by code that
> needs to alternate between those flags when doing IO.
> 
> Extend the fdset matching to behave the same with the O_DIRECT flag.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

Reviewed-by: Peter Xu <peterx@redhat.com>

One thing I am confused but totally irrelevant to this specific change.

I wonder why do we need dupfds at all, and why client needs to remove-fd at
all.

It's about what would go wrong if qmp client only add-fd, then if it's
consumed by anything, it gets moved from "fds" list to "dupfds" list.  The
thing is I don't expect the client should pass over any fd that will not be
consumed.  Then if it's always consumed, why bother dup() at all, and why
bother an explicit remove-fd.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 01/18] migration: Fix file migration with fdset
  2024-05-30 16:11   ` Peter Xu
@ 2024-05-31 14:58     ` Fabiano Rosas
  0 siblings, 0 replies; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-31 14:58 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig

Peter Xu <peterx@redhat.com> writes:

> On Thu, May 23, 2024 at 04:05:31PM -0300, Fabiano Rosas wrote:
>> When the "file:" migration support was added we missed the special
>> case in the qemu_open_old implementation that allows for a particular
>> file name format to be used to refer to a set of file descriptors that
>> have been previously provided to QEMU via the add-fd QMP command.
>> 
>> When using this fdset feature, we should not truncate the migration
>> file because being given an fd means that the management layer is in
>> control of the file and will likely already have some data written to
>> it. This is further indicated by the presence of the 'offset'
>> argument, which indicates the start of the region where QEMU is
>> allowed to write.
>> 
>> Fix the issue by replacing the O_TRUNC flag on open by an ftruncate
>> call, which will take the offset into consideration.
>> 
>> Fixes: 385f510df5 ("migration: file URI offset")
>> Suggested-by: Daniel P. Berrangé <berrange@redhat.com>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>
> Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks

>
> Right below the change, did we forget to free the ioc if
> qio_channel_io_seek() failed?
>

Yep, looks like it. I'll fix it.

>> ---
>>  migration/file.c | 11 +++++++++--
>>  1 file changed, 9 insertions(+), 2 deletions(-)
>> 
>> diff --git a/migration/file.c b/migration/file.c
>> index ab18ba505a..ba5b5c44ff 100644
>> --- a/migration/file.c
>> +++ b/migration/file.c
>> @@ -84,12 +84,19 @@ void file_start_outgoing_migration(MigrationState *s,
>>  
>>      trace_migration_file_outgoing(filename);
>>  
>> -    fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY | O_TRUNC,
>> -                                     0600, errp);
>> +    fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY, 0600, errp);
>>      if (!fioc) {
>>          return;
>>      }
>>  
>> +    if (ftruncate(fioc->fd, offset)) {
>> +        error_setg_errno(errp, errno,
>> +                         "failed to truncate migration file to offset %" PRIx64,
>> +                         offset);
>> +        object_unref(OBJECT(fioc));
>> +        return;
>> +    }
>> +
>>      outgoing_args.fname = g_strdup(filename);
>>  
>>      ioc = QIO_CHANNEL(fioc);
>> -- 
>> 2.35.3
>> 


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 05/18] monitor: Introduce monitor_fdset_*free
  2024-05-30 20:03   ` Peter Xu
@ 2024-05-31 15:01     ` Fabiano Rosas
  0 siblings, 0 replies; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-31 15:01 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig

Peter Xu <peterx@redhat.com> writes:

> On Thu, May 23, 2024 at 04:05:35PM -0300, Fabiano Rosas wrote:
>> Introduce two new functions to remove and free no longer used fds and
>> fdsets.
>> 
>> We need those to decouple the remove/free routines from
>> monitor_fdset_cleanup() which will go away in the next patches.
>> 
>> The new functions:
>> 
>> - monitor_fdset_free() will be used when a monitor connection closes
>>   and when an fd is removed to cleanup any fdset that is now empty.
>> 
>> - monitor_fdset_fd_free() will be used to remove one or more fds that
>>   have been explicitly targeted by qmp_remove_fd().
>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> One nitpick below.
>
>> ---
>>  monitor/fds.c | 26 ++++++++++++++++++--------
>>  1 file changed, 18 insertions(+), 8 deletions(-)
>> 
>> diff --git a/monitor/fds.c b/monitor/fds.c
>> index fb9f58c056..101e21720d 100644
>> --- a/monitor/fds.c
>> +++ b/monitor/fds.c
>> @@ -167,6 +167,22 @@ int monitor_get_fd(Monitor *mon, const char *fdname, Error **errp)
>>      return -1;
>>  }
>>  
>> +static void monitor_fdset_free(MonFdset *mon_fdset)
>> +{
>> +    if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) {
>> +        QLIST_REMOVE(mon_fdset, next);
>> +        g_free(mon_fdset);
>> +    }
>> +}
>
> Would monitor_fdset_free_if_empty() (or similar) slightly better?

Yes, I'll use that.

>
> static void monitor_fdset_free(MonFdset *mon_fdset)
> {
>     QLIST_REMOVE(mon_fdset, next);
>     g_free(mon_fdset);
> }
>
> static void monitor_fdset_free_if_empty(MonFdset *mon_fdset)
> {
>     if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) {
>         monitor_fdset_free(mon_fdset);
>     }
> }
>
>> +
>> +static void monitor_fdset_fd_free(MonFdsetFd *mon_fdset_fd)
>> +{
>> +    close(mon_fdset_fd->fd);
>> +    g_free(mon_fdset_fd->opaque);
>> +    QLIST_REMOVE(mon_fdset_fd, next);
>> +    g_free(mon_fdset_fd);
>> +}
>> +
>>  static void monitor_fdset_cleanup(MonFdset *mon_fdset)
>>  {
>>      MonFdsetFd *mon_fdset_fd;
>> @@ -176,17 +192,11 @@ static void monitor_fdset_cleanup(MonFdset *mon_fdset)
>>          if ((mon_fdset_fd->removed ||
>>                  (QLIST_EMPTY(&mon_fdset->dup_fds) && mon_refcount == 0)) &&
>>                  runstate_is_running()) {
>> -            close(mon_fdset_fd->fd);
>> -            g_free(mon_fdset_fd->opaque);
>> -            QLIST_REMOVE(mon_fdset_fd, next);
>> -            g_free(mon_fdset_fd);
>> +            monitor_fdset_fd_free(mon_fdset_fd);
>>          }
>>      }
>>  
>> -    if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) {
>> -        QLIST_REMOVE(mon_fdset, next);
>> -        g_free(mon_fdset);
>> -    }
>> +    monitor_fdset_free(mon_fdset);
>>  }
>>  
>>  void monitor_fdsets_cleanup(void)
>> -- 
>> 2.35.3
>> 


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 06/18] monitor: Stop removing non-duplicated fds
  2024-05-30 21:05   ` Peter Xu
@ 2024-05-31 15:25     ` Fabiano Rosas
  2024-05-31 15:56       ` Peter Xu
  2024-06-04 23:40       ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-31 15:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig,
	Dr. David Alan Gilbert

Peter Xu <peterx@redhat.com> writes:

> On Thu, May 23, 2024 at 04:05:36PM -0300, Fabiano Rosas wrote:
>> We've been up until now cleaning up any file descriptors that have
>> been passed into QEMU and never duplicated[1,2]. A file descriptor
>> without duplicates indicates that no part of QEMU has made use of
>> it. This approach is starting to show some cracks now that we're
>> starting to consume fds from the migration code:
>> 
>> - Doing cleanup every time the last monitor connection closes works to
>>   reap unused fds, but also has the side effect of forcing the
>>   management layer to pass the file descriptors again in case of a
>>   disconnect/re-connect, if that happened to be the only monitor
>>   connection.
>> 
>>   Another side effect is that removing an fd with qmp_remove_fd() is
>>   effectively delayed until the last monitor connection closes.
>> 
>>   The reliance on mon_refcount is also problematic because it's racy.
>> 
>> - Checking runstate_is_running() skips the cleanup unless the VM is
>>   running and avoids premature cleanup of the fds, but also has the
>>   side effect of blocking the legitimate removal of an fd via
>>   qmp_remove_fd() if the VM happens to be in another state.
>> 
>>   This affects qmp_remove_fd() and qmp_query_fdsets() in particular
>>   because requesting a removal at a bad time (guest stopped) might
>>   cause an fd to never be removed, or to be removed at a much later
>>   point in time, causing the query command to continue showing the
>>   supposedly removed fd/fdset.
>> 
>> Note that file descriptors that *have* been duplicated are owned by
>> the code that uses them and will be removed after qemu_close() is
>> called. Therefore we've decided that the best course of action to
>> avoid the undesired side-effects is to stop managing non-duplicated
>> file descriptors.
>> 
>> 1- efb87c1697 ("monitor: Clean up fd sets on monitor disconnect")
>> 2- ebe52b592d ("monitor: Prevent removing fd from set during init")
>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>  monitor/fds.c              | 15 ++++++++-------
>>  monitor/hmp.c              |  2 --
>>  monitor/monitor-internal.h |  1 -
>>  monitor/qmp.c              |  2 --
>>  4 files changed, 8 insertions(+), 12 deletions(-)
>> 
>> diff --git a/monitor/fds.c b/monitor/fds.c
>> index 101e21720d..f7b98814fa 100644
>> --- a/monitor/fds.c
>> +++ b/monitor/fds.c
>> @@ -169,6 +169,11 @@ int monitor_get_fd(Monitor *mon, const char *fdname, Error **errp)
>>  
>>  static void monitor_fdset_free(MonFdset *mon_fdset)
>>  {
>> +    /*
>> +     * Only remove an empty fdset. The fds are owned by the user and
>> +     * should have been removed with qmp_remove_fd(). The dup_fds are
>> +     * owned by QEMU and should have been removed with qemu_close().
>> +     */
>>      if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) {
>>          QLIST_REMOVE(mon_fdset, next);
>>          g_free(mon_fdset);
>> @@ -189,9 +194,7 @@ static void monitor_fdset_cleanup(MonFdset *mon_fdset)
>>      MonFdsetFd *mon_fdset_fd_next;
>>  
>>      QLIST_FOREACH_SAFE(mon_fdset_fd, &mon_fdset->fds, next, mon_fdset_fd_next) {
>> -        if ((mon_fdset_fd->removed ||
>> -                (QLIST_EMPTY(&mon_fdset->dup_fds) && mon_refcount == 0)) &&
>> -                runstate_is_running()) {
>> +        if (mon_fdset_fd->removed) {
>
> I don't know the code well, but I like it.
>
>>              monitor_fdset_fd_free(mon_fdset_fd);
>>          }
>>      }
>> @@ -206,7 +209,7 @@ void monitor_fdsets_cleanup(void)
>>  
>>      QEMU_LOCK_GUARD(&mon_fdsets_lock);
>>      QLIST_FOREACH_SAFE(mon_fdset, &mon_fdsets, next, mon_fdset_next) {
>> -        monitor_fdset_cleanup(mon_fdset);
>> +        monitor_fdset_free(mon_fdset);
>>      }
>>  }
>>  
>> @@ -479,9 +482,7 @@ void monitor_fdset_dup_fd_remove(int dup_fd)
>>              if (mon_fdset_fd_dup->fd == dup_fd) {
>>                  QLIST_REMOVE(mon_fdset_fd_dup, next);
>>                  g_free(mon_fdset_fd_dup);
>> -                if (QLIST_EMPTY(&mon_fdset->dup_fds)) {
>> -                    monitor_fdset_cleanup(mon_fdset);
>> -                }
>> +                monitor_fdset_free(mon_fdset);
>
> This and above changes are not crystal clear to me where the _cleanup()
> does extra check "removed" and clean those fds.
>
> I think it'll make it easier for me to understand if this one and the next
> are squashed together.  But maybe it's simply because I didn't fully
> understand.

monitor_fdsets_cleanup() was doing three things previously:

1- Remove the removed=true fds. This is weird, but ok.

2- Remove fds from an fdset that has an empty dup_fds list, but only if
the guest is not running and the monitor is closed. This is problematic.

3- Remove/free fdsets that have become empty due to the above
removals. This is ok.

This patch covers (2), because that is the only change that has a
complex reasoning behind it and we need to document that without
conflating it with the rest of the changes.

Since (3) is still a reasonable thing to do, this patch merely keeps it
around, but using the helpers introduced in the previous patch.

The next patch removed the need for (1), making the removal immediate
instead of delayed. It has it's own much less complex reasoning, which
is: "we don't need to wait to remove the fd".

I hope this clarifies a bit. I would prefer if we kept this and the next
patch separate to avoid having a single patch that does too many
things. I hope to be as explicit as possible with the reason behind
these changes to avoid putting future people in the situation that we
are in now, i.e. having to guess at the reasons behind this code.

If you still think we should squash or if you have more questions, let
me know.

>>                  return;
>>              }
>>          }
>> diff --git a/monitor/hmp.c b/monitor/hmp.c
>> index 69c1b7e98a..460e8832f6 100644
>> --- a/monitor/hmp.c
>> +++ b/monitor/hmp.c
>> @@ -1437,11 +1437,9 @@ static void monitor_event(void *opaque, QEMUChrEvent event)
>>              monitor_resume(mon);
>>          }
>>          qemu_mutex_unlock(&mon->mon_lock);
>> -        mon_refcount++;
>>          break;
>>  
>>      case CHR_EVENT_CLOSED:
>> -        mon_refcount--;
>>          monitor_fdsets_cleanup();
>>          break;
>>  
>> diff --git a/monitor/monitor-internal.h b/monitor/monitor-internal.h
>> index 252de85681..cb628f681d 100644
>> --- a/monitor/monitor-internal.h
>> +++ b/monitor/monitor-internal.h
>> @@ -168,7 +168,6 @@ extern bool qmp_dispatcher_co_shutdown;
>>  extern QmpCommandList qmp_commands, qmp_cap_negotiation_commands;
>>  extern QemuMutex monitor_lock;
>>  extern MonitorList mon_list;
>> -extern int mon_refcount;
>>  
>>  extern HMPCommand hmp_cmds[];
>>  
>> diff --git a/monitor/qmp.c b/monitor/qmp.c
>> index a239945e8d..5e538f34c0 100644
>> --- a/monitor/qmp.c
>> +++ b/monitor/qmp.c
>> @@ -466,7 +466,6 @@ static void monitor_qmp_event(void *opaque, QEMUChrEvent event)
>>          data = qmp_greeting(mon);
>>          qmp_send_response(mon, data);
>>          qobject_unref(data);
>> -        mon_refcount++;
>>          break;
>>      case CHR_EVENT_CLOSED:
>>          /*
>> @@ -479,7 +478,6 @@ static void monitor_qmp_event(void *opaque, QEMUChrEvent event)
>>          json_message_parser_destroy(&mon->parser);
>>          json_message_parser_init(&mon->parser, handle_qmp_command,
>>                                   mon, NULL);
>> -        mon_refcount--;
>>          monitor_fdsets_cleanup();
>>          break;
>>      case CHR_EVENT_BREAK:
>
> I like this too when mon_refcount can be dropped.  It turns out I like this
> patch and the next a lot, and I hope nothing will break.
>
> Aside, you seem to have forgot removal of the "int mon_refcount" in
> monitor.c.

Yes, I'll fix that. Thanks.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 11/18] migration/multifd: Add direct-io support
  2024-05-30 21:35   ` Peter Xu
@ 2024-05-31 15:27     ` Fabiano Rosas
  0 siblings, 0 replies; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-31 15:27 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig

Peter Xu <peterx@redhat.com> writes:

> On Thu, May 23, 2024 at 04:05:41PM -0300, Fabiano Rosas wrote:
>> When multifd is used along with mapped-ram, we can take benefit of a
>> filesystem that supports the O_DIRECT flag and perform direct I/O in
>> the multifd threads. This brings a significant performance improvement
>> because direct-io writes bypass the page cache which would otherwise
>> be thrashed by the multifd data which is unlikely to be needed again
>> in a short period of time.
>> 
>> To be able to use a multifd channel opened with O_DIRECT, we must
>> ensure that a certain aligment is used. Filesystems usually require a
>> block-size alignment for direct I/O. The way to achieve this is by
>> enabling the mapped-ram feature, which already aligns its I/O properly
>> (see MAPPED_RAM_FILE_OFFSET_ALIGNMENT at ram.c).
>> 
>> By setting O_DIRECT on the multifd channels, all writes to the same
>> file descriptor need to be aligned as well, even the ones that come
>> from outside multifd, such as the QEMUFile I/O from the main migration
>> code. This makes it impossible to use the same file descriptor for the
>> QEMUFile and for the multifd channels. The various flags and metadata
>> written by the main migration code will always be unaligned by virtue
>> of their small size. To workaround this issue, we'll require a second
>> file descriptor to be used exclusively for direct I/O.
>> 
>> The second file descriptor can be obtained by QEMU by re-opening the
>> migration file (already possible), or by being provided by the user or
>> management application (support to be added in future patches).
>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>  migration/file.c      | 31 ++++++++++++++++++++++++++-----
>>  migration/file.h      |  1 -
>>  migration/migration.c | 23 +++++++++++++++++++++++
>>  3 files changed, 49 insertions(+), 6 deletions(-)
>> 
>> diff --git a/migration/file.c b/migration/file.c
>> index ba5b5c44ff..ac4d206492 100644
>> --- a/migration/file.c
>> +++ b/migration/file.c
>> @@ -50,12 +50,31 @@ void file_cleanup_outgoing_migration(void)
>>      outgoing_args.fname = NULL;
>>  }
>>  
>> +static void file_enable_direct_io(int *flags)
>> +{
>> +#ifdef O_DIRECT
>> +    if (migrate_direct_io()) {
>> +        *flags |= O_DIRECT;
>> +    }
>> +#else
>> +    /* it should have been rejected when setting the parameter */
>> +    g_assert_not_reached();
>> +#endif
>> +}
>> +
>>  bool file_send_channel_create(gpointer opaque, Error **errp)
>>  {
>>      QIOChannelFile *ioc;
>>      int flags = O_WRONLY;
>>      bool ret = true;
>>  
>> +    /*
>> +     * Attempt to enable O_DIRECT for the secondary channels. These
>> +     * are used for sending ram pages and writes should be guaranteed
>> +     * to be aligned to at least page size.
>> +     */
>> +    file_enable_direct_io(&flags);
>
> Call this only if enabled?  That looks clearer, IMHO:
>
>        if (migrate_direct_io()) {
>            file_enable_direct_io(&flags);
>        }

Sure



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 13/18] monitor: fdset: Match against O_DIRECT
  2024-05-30 21:41   ` Peter Xu
@ 2024-05-31 15:42     ` Fabiano Rosas
  2024-05-31 15:58       ` Peter Xu
  0 siblings, 1 reply; 60+ messages in thread
From: Fabiano Rosas @ 2024-05-31 15:42 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig

Peter Xu <peterx@redhat.com> writes:

> On Thu, May 23, 2024 at 04:05:43PM -0300, Fabiano Rosas wrote:
>> We're about to enable the use of O_DIRECT in the migration code and
>> due to the alignment restrictions imposed by filesystems we need to
>> make sure the flag is only used when doing aligned IO.
>> 
>> The migration will do parallel IO to different regions of a file, so
>> we need to use more than one file descriptor. Those cannot be obtained
>> by duplicating (dup()) since duplicated file descriptors share the
>> file status flags, including O_DIRECT. If one migration channel does
>> unaligned IO while another sets O_DIRECT to do aligned IO, the
>> filesystem would fail the unaligned operation.
>> 
>> The add-fd QMP command along with the fdset code are specifically
>> designed to allow the user to pass a set of file descriptors with
>> different access flags into QEMU to be later fetched by code that
>> needs to alternate between those flags when doing IO.
>> 
>> Extend the fdset matching to behave the same with the O_DIRECT flag.
>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> One thing I am confused but totally irrelevant to this specific change.
>
> I wonder why do we need dupfds at all, and why client needs to remove-fd at
> all.

This answer was lost to time.

>
> It's about what would go wrong if qmp client only add-fd, then if it's
> consumed by anything, it gets moved from "fds" list to "dupfds" list.  The
> thing is I don't expect the client should pass over any fd that will not be
> consumed.  Then if it's always consumed, why bother dup() at all, and why
> bother an explicit remove-fd.

With the lack of documentation, I can only imagine the code was
initially written to be more flexible, but we ended up never needing the
extra flexibility. Maybe we could change that transparently in the
future and deprecate qmp_remove_fd(). I'm thinking, even for the
mapped-ram work, we might not need libvirt to call that function ever.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 06/18] monitor: Stop removing non-duplicated fds
  2024-05-31 15:25     ` Fabiano Rosas
@ 2024-05-31 15:56       ` Peter Xu
  2024-06-04 23:40       ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 60+ messages in thread
From: Peter Xu @ 2024-05-31 15:56 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig,
	Dr. David Alan Gilbert

On Fri, May 31, 2024 at 12:25:52PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Thu, May 23, 2024 at 04:05:36PM -0300, Fabiano Rosas wrote:
> >> We've been up until now cleaning up any file descriptors that have
> >> been passed into QEMU and never duplicated[1,2]. A file descriptor
> >> without duplicates indicates that no part of QEMU has made use of
> >> it. This approach is starting to show some cracks now that we're
> >> starting to consume fds from the migration code:
> >> 
> >> - Doing cleanup every time the last monitor connection closes works to
> >>   reap unused fds, but also has the side effect of forcing the
> >>   management layer to pass the file descriptors again in case of a
> >>   disconnect/re-connect, if that happened to be the only monitor
> >>   connection.
> >> 
> >>   Another side effect is that removing an fd with qmp_remove_fd() is
> >>   effectively delayed until the last monitor connection closes.
> >> 
> >>   The reliance on mon_refcount is also problematic because it's racy.
> >> 
> >> - Checking runstate_is_running() skips the cleanup unless the VM is
> >>   running and avoids premature cleanup of the fds, but also has the
> >>   side effect of blocking the legitimate removal of an fd via
> >>   qmp_remove_fd() if the VM happens to be in another state.
> >> 
> >>   This affects qmp_remove_fd() and qmp_query_fdsets() in particular
> >>   because requesting a removal at a bad time (guest stopped) might
> >>   cause an fd to never be removed, or to be removed at a much later
> >>   point in time, causing the query command to continue showing the
> >>   supposedly removed fd/fdset.
> >> 
> >> Note that file descriptors that *have* been duplicated are owned by
> >> the code that uses them and will be removed after qemu_close() is
> >> called. Therefore we've decided that the best course of action to
> >> avoid the undesired side-effects is to stop managing non-duplicated
> >> file descriptors.
> >> 
> >> 1- efb87c1697 ("monitor: Clean up fd sets on monitor disconnect")
> >> 2- ebe52b592d ("monitor: Prevent removing fd from set during init")
> >> 
> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >> ---
> >>  monitor/fds.c              | 15 ++++++++-------
> >>  monitor/hmp.c              |  2 --
> >>  monitor/monitor-internal.h |  1 -
> >>  monitor/qmp.c              |  2 --
> >>  4 files changed, 8 insertions(+), 12 deletions(-)
> >> 
> >> diff --git a/monitor/fds.c b/monitor/fds.c
> >> index 101e21720d..f7b98814fa 100644
> >> --- a/monitor/fds.c
> >> +++ b/monitor/fds.c
> >> @@ -169,6 +169,11 @@ int monitor_get_fd(Monitor *mon, const char *fdname, Error **errp)
> >>  
> >>  static void monitor_fdset_free(MonFdset *mon_fdset)
> >>  {
> >> +    /*
> >> +     * Only remove an empty fdset. The fds are owned by the user and
> >> +     * should have been removed with qmp_remove_fd(). The dup_fds are
> >> +     * owned by QEMU and should have been removed with qemu_close().
> >> +     */
> >>      if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) {
> >>          QLIST_REMOVE(mon_fdset, next);
> >>          g_free(mon_fdset);
> >> @@ -189,9 +194,7 @@ static void monitor_fdset_cleanup(MonFdset *mon_fdset)
> >>      MonFdsetFd *mon_fdset_fd_next;
> >>  
> >>      QLIST_FOREACH_SAFE(mon_fdset_fd, &mon_fdset->fds, next, mon_fdset_fd_next) {
> >> -        if ((mon_fdset_fd->removed ||
> >> -                (QLIST_EMPTY(&mon_fdset->dup_fds) && mon_refcount == 0)) &&
> >> -                runstate_is_running()) {
> >> +        if (mon_fdset_fd->removed) {
> >
> > I don't know the code well, but I like it.
> >
> >>              monitor_fdset_fd_free(mon_fdset_fd);
> >>          }
> >>      }
> >> @@ -206,7 +209,7 @@ void monitor_fdsets_cleanup(void)
> >>  
> >>      QEMU_LOCK_GUARD(&mon_fdsets_lock);
> >>      QLIST_FOREACH_SAFE(mon_fdset, &mon_fdsets, next, mon_fdset_next) {
> >> -        monitor_fdset_cleanup(mon_fdset);
> >> +        monitor_fdset_free(mon_fdset);
> >>      }
> >>  }
> >>  
> >> @@ -479,9 +482,7 @@ void monitor_fdset_dup_fd_remove(int dup_fd)
> >>              if (mon_fdset_fd_dup->fd == dup_fd) {
> >>                  QLIST_REMOVE(mon_fdset_fd_dup, next);
> >>                  g_free(mon_fdset_fd_dup);
> >> -                if (QLIST_EMPTY(&mon_fdset->dup_fds)) {
> >> -                    monitor_fdset_cleanup(mon_fdset);
> >> -                }
> >> +                monitor_fdset_free(mon_fdset);
> >
> > This and above changes are not crystal clear to me where the _cleanup()
> > does extra check "removed" and clean those fds.
> >
> > I think it'll make it easier for me to understand if this one and the next
> > are squashed together.  But maybe it's simply because I didn't fully
> > understand.
> 
> monitor_fdsets_cleanup() was doing three things previously:
> 
> 1- Remove the removed=true fds. This is weird, but ok.
> 
> 2- Remove fds from an fdset that has an empty dup_fds list, but only if
> the guest is not running and the monitor is closed. This is problematic.
> 
> 3- Remove/free fdsets that have become empty due to the above
> removals. This is ok.
> 
> This patch covers (2), because that is the only change that has a
> complex reasoning behind it and we need to document that without
> conflating it with the rest of the changes.
> 
> Since (3) is still a reasonable thing to do, this patch merely keeps it
> around, but using the helpers introduced in the previous patch.
> 
> The next patch removed the need for (1), making the removal immediate
> instead of delayed. It has it's own much less complex reasoning, which
> is: "we don't need to wait to remove the fd".
> 
> I hope this clarifies a bit. I would prefer if we kept this and the next
> patch separate to avoid having a single patch that does too many
> things. I hope to be as explicit as possible with the reason behind
> these changes to avoid putting future people in the situation that we
> are in now, i.e. having to guess at the reasons behind this code.
> 
> If you still think we should squash or if you have more questions, let
> me know.

Thanks.  Mind adding something into the commit message for monitor newbies
(like myself)?

I hope whoever more familiar with monitor can look, but otherwise let's see
what will break and then we have someone to discuss with.

For what it worth, I still want to ack this:

Reviewed-by: Peter Xu <peterx@redhat.com>

> 
> >>                  return;
> >>              }
> >>          }
> >> diff --git a/monitor/hmp.c b/monitor/hmp.c
> >> index 69c1b7e98a..460e8832f6 100644
> >> --- a/monitor/hmp.c
> >> +++ b/monitor/hmp.c
> >> @@ -1437,11 +1437,9 @@ static void monitor_event(void *opaque, QEMUChrEvent event)
> >>              monitor_resume(mon);
> >>          }
> >>          qemu_mutex_unlock(&mon->mon_lock);
> >> -        mon_refcount++;
> >>          break;
> >>  
> >>      case CHR_EVENT_CLOSED:
> >> -        mon_refcount--;
> >>          monitor_fdsets_cleanup();
> >>          break;
> >>  
> >> diff --git a/monitor/monitor-internal.h b/monitor/monitor-internal.h
> >> index 252de85681..cb628f681d 100644
> >> --- a/monitor/monitor-internal.h
> >> +++ b/monitor/monitor-internal.h
> >> @@ -168,7 +168,6 @@ extern bool qmp_dispatcher_co_shutdown;
> >>  extern QmpCommandList qmp_commands, qmp_cap_negotiation_commands;
> >>  extern QemuMutex monitor_lock;
> >>  extern MonitorList mon_list;
> >> -extern int mon_refcount;
> >>  
> >>  extern HMPCommand hmp_cmds[];
> >>  
> >> diff --git a/monitor/qmp.c b/monitor/qmp.c
> >> index a239945e8d..5e538f34c0 100644
> >> --- a/monitor/qmp.c
> >> +++ b/monitor/qmp.c
> >> @@ -466,7 +466,6 @@ static void monitor_qmp_event(void *opaque, QEMUChrEvent event)
> >>          data = qmp_greeting(mon);
> >>          qmp_send_response(mon, data);
> >>          qobject_unref(data);
> >> -        mon_refcount++;
> >>          break;
> >>      case CHR_EVENT_CLOSED:
> >>          /*
> >> @@ -479,7 +478,6 @@ static void monitor_qmp_event(void *opaque, QEMUChrEvent event)
> >>          json_message_parser_destroy(&mon->parser);
> >>          json_message_parser_init(&mon->parser, handle_qmp_command,
> >>                                   mon, NULL);
> >> -        mon_refcount--;
> >>          monitor_fdsets_cleanup();
> >>          break;
> >>      case CHR_EVENT_BREAK:
> >
> > I like this too when mon_refcount can be dropped.  It turns out I like this
> > patch and the next a lot, and I hope nothing will break.
> >
> > Aside, you seem to have forgot removal of the "int mon_refcount" in
> > monitor.c.
> 
> Yes, I'll fix that. Thanks.
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 13/18] monitor: fdset: Match against O_DIRECT
  2024-05-31 15:42     ` Fabiano Rosas
@ 2024-05-31 15:58       ` Peter Xu
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2024-05-31 15:58 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig

On Fri, May 31, 2024 at 12:42:05PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Thu, May 23, 2024 at 04:05:43PM -0300, Fabiano Rosas wrote:
> >> We're about to enable the use of O_DIRECT in the migration code and
> >> due to the alignment restrictions imposed by filesystems we need to
> >> make sure the flag is only used when doing aligned IO.
> >> 
> >> The migration will do parallel IO to different regions of a file, so
> >> we need to use more than one file descriptor. Those cannot be obtained
> >> by duplicating (dup()) since duplicated file descriptors share the
> >> file status flags, including O_DIRECT. If one migration channel does
> >> unaligned IO while another sets O_DIRECT to do aligned IO, the
> >> filesystem would fail the unaligned operation.
> >> 
> >> The add-fd QMP command along with the fdset code are specifically
> >> designed to allow the user to pass a set of file descriptors with
> >> different access flags into QEMU to be later fetched by code that
> >> needs to alternate between those flags when doing IO.
> >> 
> >> Extend the fdset matching to behave the same with the O_DIRECT flag.
> >> 
> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> >
> > One thing I am confused but totally irrelevant to this specific change.
> >
> > I wonder why do we need dupfds at all, and why client needs to remove-fd at
> > all.
> 
> This answer was lost to time.
> 
> >
> > It's about what would go wrong if qmp client only add-fd, then if it's
> > consumed by anything, it gets moved from "fds" list to "dupfds" list.  The
> > thing is I don't expect the client should pass over any fd that will not be
> > consumed.  Then if it's always consumed, why bother dup() at all, and why
> > bother an explicit remove-fd.
> 
> With the lack of documentation, I can only imagine the code was
> initially written to be more flexible, but we ended up never needing the
> extra flexibility. Maybe we could change that transparently in the
> future and deprecate qmp_remove_fd(). I'm thinking, even for the
> mapped-ram work, we might not need libvirt to call that function ever.

Good to know I'm not the only one thinking that.

It's okay - after your cleanup it's much better at least to me.  The only
thing to avoid these, AFAIU, is more careful design level reviews in the
future, and more documents state the facts and keep in history (and
obviously why remove-fd was needed was not documented).  Now we carry them.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 07/18] monitor: Simplify fdset and fd removal
  2024-05-23 19:05 ` [PATCH v2 07/18] monitor: Simplify fdset and fd removal Fabiano Rosas
@ 2024-05-31 15:58   ` Peter Xu
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2024-05-31 15:58 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig

On Thu, May 23, 2024 at 04:05:37PM -0300, Fabiano Rosas wrote:
> Remove fds right away instead of setting the ->removed flag. We don't
> need the extra complexity of having a cleanup function reap the
> removed entries at a later time.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 01/18] migration: Fix file migration with fdset
  2024-05-23 19:05 ` [PATCH v2 01/18] migration: Fix file migration with fdset Fabiano Rosas
  2024-05-24 10:51   ` Prasad Pandit
  2024-05-30 16:11   ` Peter Xu
@ 2024-06-03 10:20   ` Daniel P. Berrangé
  2 siblings, 0 replies; 60+ messages in thread
From: Daniel P. Berrangé @ 2024-06-03 10:20 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: qemu-devel, armbru, Peter Xu, Claudio Fontana, Jim Fehlig

On Thu, May 23, 2024 at 04:05:31PM -0300, Fabiano Rosas wrote:
> When the "file:" migration support was added we missed the special
> case in the qemu_open_old implementation that allows for a particular
> file name format to be used to refer to a set of file descriptors that
> have been previously provided to QEMU via the add-fd QMP command.
> 
> When using this fdset feature, we should not truncate the migration
> file because being given an fd means that the management layer is in
> control of the file and will likely already have some data written to
> it. This is further indicated by the presence of the 'offset'
> argument, which indicates the start of the region where QEMU is
> allowed to write.
> 
> Fix the issue by replacing the O_TRUNC flag on open by an ftruncate
> call, which will take the offset into consideration.
> 
> Fixes: 385f510df5 ("migration: file URI offset")
> Suggested-by: Daniel P. Berrangé <berrange@redhat.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  migration/file.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 02/18] tests/qtest/migration: Fix file migration offset check
  2024-05-23 19:05 ` [PATCH v2 02/18] tests/qtest/migration: Fix file migration offset check Fabiano Rosas
  2024-05-30 16:14   ` Peter Xu
@ 2024-06-03 10:21   ` Daniel P. Berrangé
  1 sibling, 0 replies; 60+ messages in thread
From: Daniel P. Berrangé @ 2024-06-03 10:21 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Peter Xu, Claudio Fontana, Jim Fehlig,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

On Thu, May 23, 2024 at 04:05:32PM -0300, Fabiano Rosas wrote:
> When doing file migration, QEMU accepts an offset that should be
> skipped when writing the migration stream to the file. The purpose of
> the offset is to allow the management layer to put its own metadata at
> the start of the file.
> 
> We have tests for this in migration-test, but only testing that the
> migration stream starts at the correct offset and not that it actually
> leaves the data intact. Unsurprisingly, there's been a bug in that
> area that the tests didn't catch.
> 
> Fix the tests to write some data to the offset region and check that
> it's actually there after the migration.
> 
> While here, switch to using g_get_file_contents() which is more
> portable than mmap().
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  tests/qtest/migration-test.c | 79 ++++++++++++++++++++++--------------
>  1 file changed, 48 insertions(+), 31 deletions(-)

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 04/18] monitor: Drop monitor_fdset_dup_fd_add()
  2024-05-23 19:05 ` [PATCH v2 04/18] monitor: Drop monitor_fdset_dup_fd_add() Fabiano Rosas
@ 2024-06-03 10:26   ` Daniel P. Berrangé
  0 siblings, 0 replies; 60+ messages in thread
From: Daniel P. Berrangé @ 2024-06-03 10:26 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, armbru, Peter Xu, Claudio Fontana, Jim Fehlig,
	Dr . David Alan Gilbert, Philippe Mathieu-Daudé,
	Paolo Bonzini

Incorrect $SUBJECT - it claims to be removing monitor_fdset_dup_fd_add
but actually removes monitor_fdset_dup_fd_find.

On Thu, May 23, 2024 at 04:05:34PM -0300, Fabiano Rosas wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> This function is not needed, one remove function should already work.
> Clean it up.
> 
> Here the code doesn't really care about whether we need to keep that dupfd
> around if close() failed: when that happens something got very wrong,
> keeping the dup_fd around the fdsets may not help that situation so far.
> 
> Cc: Dr. David Alan Gilbert <dave@treblig.org>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Daniel P. Berrangé <berrange@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> [add missing return statement, removal during traversal is not safe]
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  include/monitor/monitor.h |  1 -
>  monitor/fds.c             | 28 ++++++----------------------
>  stubs/fdset.c             |  5 -----
>  util/osdep.c              | 15 +--------------
>  4 files changed, 7 insertions(+), 42 deletions(-)

With $SUBJECT fixed

 Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 16/18] io/channel-file: Add direct-io support
  2024-05-23 19:05 ` [PATCH v2 16/18] io/channel-file: Add direct-io support Fabiano Rosas
@ 2024-06-03 10:32   ` Daniel P. Berrangé
  0 siblings, 0 replies; 60+ messages in thread
From: Daniel P. Berrangé @ 2024-06-03 10:32 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: qemu-devel, armbru, Peter Xu, Claudio Fontana, Jim Fehlig

On Thu, May 23, 2024 at 04:05:46PM -0300, Fabiano Rosas wrote:
> Add support for setting/clearing the O_DIRECT flag on a file
> descriptor. This will be used for enabling O_DIRECT in the main
> migration channel when multifd is not in use.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  include/io/channel-file.h |  8 ++++++++
>  io/channel-file.c         | 25 +++++++++++++++++++++++++
>  2 files changed, 33 insertions(+)
> 
> diff --git a/include/io/channel-file.h b/include/io/channel-file.h
> index d373a4e44d..ecb4450e8e 100644
> --- a/include/io/channel-file.h
> +++ b/include/io/channel-file.h
> @@ -107,4 +107,12 @@ qio_channel_file_new_path(const char *path,
>                            mode_t mode,
>                            Error **errp);
>  
> +/**
> + * qio_channel_file_set_direct_io:
> + * @ioc: the QIOChannel object
> + * @enabled: the desired state of the O_DIRECT flag
> + * @errp: pointer to initialized error object
> + */
> +void qio_channel_file_set_direct_io(QIOChannel *ioc, bool enabled,
> +                                    Error **errp);

This should return 'int' rather than void, giving -1 on error,
0 on success, so callers don't deference the errp parameter
to check status.

>  #endif /* QIO_CHANNEL_FILE_H */
> diff --git a/io/channel-file.c b/io/channel-file.c
> index 2ea8d08360..a89cd3a6d5 100644
> --- a/io/channel-file.c
> +++ b/io/channel-file.c
> @@ -231,6 +231,31 @@ static int qio_channel_file_set_blocking(QIOChannel *ioc,
>  #endif
>  }
>  
> +void qio_channel_file_set_direct_io(QIOChannel *ioc, bool enabled, Error **errp)
> +{
> +#ifdef O_DIRECT
> +    QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
> +    int flags = fcntl(fioc->fd, F_GETFL);
> +
> +    if (flags == -1) {
> +        error_setg_errno(errp, errno, "Unable to read file descriptor flags");
> +        return;
> +    }
> +
> +    if (enabled) {
> +        flags |= O_DIRECT;
> +    } else {
> +        flags &= ~O_DIRECT;
> +    }
> +
> +    if (fcntl(fioc->fd, F_SETFL, flags) == -1) {
> +        error_setg_errno(errp, errno, "Unable to set file descriptor flags");
> +        return;
> +    }
> +#else
> +    error_setg(errp, "System does not support O_DIRECT");
> +#endif
> +}
>  
>  static off_t qio_channel_file_seek(QIOChannel *ioc,
>                                     off_t offset,
> -- 
> 2.35.3
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 14/18] migration: Add documentation for fdset with multifd + file
  2024-05-23 19:05 ` [PATCH v2 14/18] migration: Add documentation for fdset with multifd + file Fabiano Rosas
@ 2024-06-04 20:46   ` Peter Xu
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2024-06-04 20:46 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig

On Thu, May 23, 2024 at 04:05:44PM -0300, Fabiano Rosas wrote:
> With the last few changes to the fdset infrastructure, we now allow
> multifd to use an fdset when migrating to a file. This is useful for
> the scenario where the management layer wants to have control over the
> migration file.
> 
> By receiving the file descriptors directly, QEMU can delegate some
> high level operating system operations to the management layer (such
> as mandatory access control). The management layer might also want to
> add its own headers before the migration stream.
> 
> Document the "file:/dev/fdset/#" syntax for the multifd migration with
> mapped-ram. The requirements for the fdset mechanism are:
> 
> - the fdset must contain two fds that are not duplicates between
>   themselves;
> 
> - if direct-io is to be used, exactly one of the fds must have the
>   O_DIRECT flag set;
> 
> - the file must be opened with WRONLY on the migration source side;
> 
> - the file must be opened with RDONLY on the migration destination
>   side.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  docs/devel/migration/main.rst       | 24 +++++++++++++++++++-----
>  docs/devel/migration/mapped-ram.rst |  6 +++++-
>  2 files changed, 24 insertions(+), 6 deletions(-)
> 
> diff --git a/docs/devel/migration/main.rst b/docs/devel/migration/main.rst
> index 495cdcb112..784c899dca 100644
> --- a/docs/devel/migration/main.rst
> +++ b/docs/devel/migration/main.rst
> @@ -47,11 +47,25 @@ over any transport.
>    QEMU interference. Note that QEMU does not flush cached file
>    data/metadata at the end of migration.
>  
> -In addition, support is included for migration using RDMA, which
> -transports the page data using ``RDMA``, where the hardware takes care of
> -transporting the pages, and the load on the CPU is much lower.  While the
> -internals of RDMA migration are a bit different, this isn't really visible
> -outside the RAM migration code.
> +  The file migration also supports using a file that has already been
> +  opened. A set of file descriptors is passed to QEMU via an "fdset"
> +  (see add-fd QMP command documentation). This method allows a
> +  management application to have control over the migration file
> +  opening operation. There are, however, strict requirements to this
> +  interface if the multifd capability is enabled:
> +
> +    - the fdset must contain two file descriptors that are not
> +      duplicates between themselves;
> +    - if the direct-io capability is to be used, exactly one of the
> +      file descriptors must have the O_DIRECT flag set;
> +    - the file must be opened with WRONLY on the migration source side
> +      and RDONLY on the migration destination side.
> +
> +- rdma migration: support is included for migration using RDMA, which
> +  transports the page data using ``RDMA``, where the hardware takes
> +  care of transporting the pages, and the load on the CPU is much
> +  lower.  While the internals of RDMA migration are a bit different,
> +  this isn't really visible outside the RAM migration code.
>  
>  All these migration protocols use the same infrastructure to
>  save/restore state devices.  This infrastructure is shared with the
> diff --git a/docs/devel/migration/mapped-ram.rst b/docs/devel/migration/mapped-ram.rst
> index fa4cefd9fc..e6505511f0 100644
> --- a/docs/devel/migration/mapped-ram.rst
> +++ b/docs/devel/migration/mapped-ram.rst
> @@ -16,7 +16,7 @@ location in the file, rather than constantly being added to a
>  sequential stream. Having the pages at fixed offsets also allows the
>  usage of O_DIRECT for save/restore of the migration stream as the
>  pages are ensured to be written respecting O_DIRECT alignment
> -restrictions (direct-io support not yet implemented).
> +restrictions.
>  
>  Usage
>  -----
> @@ -35,6 +35,10 @@ Use a ``file:`` URL for migration:
>  Mapped-ram migration is best done non-live, i.e. by stopping the VM on
>  the source side before migrating.
>  
> +For best performance enable the ``direct-io`` capability as well:

Parameter?

> +
> +    ``migrate_set_capability direct-io on``

migrate_set_parameters?

If with that fixed:

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 15/18] tests/qtest/migration: Add a test for mapped-ram with passing of fds
  2024-05-23 19:05 ` [PATCH v2 15/18] tests/qtest/migration: Add a test for mapped-ram with passing of fds Fabiano Rosas
@ 2024-06-04 20:51   ` Peter Xu
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2024-06-04 20:51 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

On Thu, May 23, 2024 at 04:05:45PM -0300, Fabiano Rosas wrote:
> Add a multifd test for mapped-ram with passing of fds into QEMU. This
> is how libvirt will consume the feature.
> 
> There are a couple of details to the fdset mechanism:
> 
> - multifd needs two distinct file descriptors (not duplicated with
>   dup()) so it can enable O_DIRECT only on the channels that do
>   aligned IO. The dup() system call creates file descriptors that
>   share status flags, of which O_DIRECT is one.
> 
> - the open() access mode flags used for the fds passed into QEMU need
>   to match the flags QEMU uses to open the file. Currently O_WRONLY
>   for src and O_RDONLY for dst.
> 
> Note that fdset code goes under _WIN32 because fd passing is not
> supported on Windows.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration
  2024-05-23 19:05 ` [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration Fabiano Rosas
@ 2024-06-04 20:56   ` Peter Xu
  2024-06-07 18:42     ` Fabiano Rosas
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2024-06-04 20:56 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

On Thu, May 23, 2024 at 04:05:48PM -0300, Fabiano Rosas wrote:
> We've recently added support for direct-io with multifd, which brings
> performance benefits, but creates a non-uniform user interface by
> coupling direct-io with the multifd capability. This means that users
> cannot keep the direct-io flag enabled while disabling multifd.
> 
> Libvirt in particular already has support for direct-io and parallel
> migration separately from each other, so it would be a regression to
> now require both options together. It's relatively simple for QEMU to
> add support for direct-io migration without multifd, so let's do this
> in order to keep both options decoupled.
> 
> We cannot simply enable the O_DIRECT flag, however, because not all IO
> performed by the migration thread satisfies the alignment requirements
> of O_DIRECT. There are many small read & writes that add headers and
> synchronization flags to the stream, which at the moment are required
> to always be present.
> 
> Fortunately, due to fixed-ram migration there is a discernible moment
> where only RAM pages are written to the migration file. Enable
> direct-io during that moment.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

Is anyone going to consume this?  How's the performance?

It doesn't look super fast to me if we need to enable/disable dio in each
loop.. then it's a matter of whether we should bother, or would it be
easier that we simply require multifd when direct-io=on.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 06/18] monitor: Stop removing non-duplicated fds
  2024-05-31 15:25     ` Fabiano Rosas
  2024-05-31 15:56       ` Peter Xu
@ 2024-06-04 23:40       ` Dr. David Alan Gilbert
  2024-06-05 12:31         ` Fabiano Rosas
  1 sibling, 1 reply; 60+ messages in thread
From: Dr. David Alan Gilbert @ 2024-06-04 23:40 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Peter Xu, qemu-devel, berrange, armbru, Claudio Fontana,
	Jim Fehlig

* Fabiano Rosas (farosas@suse.de) wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Thu, May 23, 2024 at 04:05:36PM -0300, Fabiano Rosas wrote:
> >> We've been up until now cleaning up any file descriptors that have
> >> been passed into QEMU and never duplicated[1,2]. A file descriptor
> >> without duplicates indicates that no part of QEMU has made use of
> >> it. This approach is starting to show some cracks now that we're
> >> starting to consume fds from the migration code:
> >> 
> >> - Doing cleanup every time the last monitor connection closes works to
> >>   reap unused fds, but also has the side effect of forcing the
> >>   management layer to pass the file descriptors again in case of a
> >>   disconnect/re-connect, if that happened to be the only monitor
> >>   connection.
> >> 
> >>   Another side effect is that removing an fd with qmp_remove_fd() is
> >>   effectively delayed until the last monitor connection closes.
> >> 
> >>   The reliance on mon_refcount is also problematic because it's racy.
> >> 
> >> - Checking runstate_is_running() skips the cleanup unless the VM is
> >>   running and avoids premature cleanup of the fds, but also has the
> >>   side effect of blocking the legitimate removal of an fd via
> >>   qmp_remove_fd() if the VM happens to be in another state.
> >> 
> >>   This affects qmp_remove_fd() and qmp_query_fdsets() in particular
> >>   because requesting a removal at a bad time (guest stopped) might
> >>   cause an fd to never be removed, or to be removed at a much later
> >>   point in time, causing the query command to continue showing the
> >>   supposedly removed fd/fdset.
> >> 
> >> Note that file descriptors that *have* been duplicated are owned by
> >> the code that uses them and will be removed after qemu_close() is
> >> called. Therefore we've decided that the best course of action to
> >> avoid the undesired side-effects is to stop managing non-duplicated
> >> file descriptors.
> >> 
> >> 1- efb87c1697 ("monitor: Clean up fd sets on monitor disconnect")
> >> 2- ebe52b592d ("monitor: Prevent removing fd from set during init")
> >> 
> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >> ---
> >>  monitor/fds.c              | 15 ++++++++-------
> >>  monitor/hmp.c              |  2 --
> >>  monitor/monitor-internal.h |  1 -
> >>  monitor/qmp.c              |  2 --
> >>  4 files changed, 8 insertions(+), 12 deletions(-)
> >> 
> >> diff --git a/monitor/fds.c b/monitor/fds.c
> >> index 101e21720d..f7b98814fa 100644
> >> --- a/monitor/fds.c
> >> +++ b/monitor/fds.c
> >> @@ -169,6 +169,11 @@ int monitor_get_fd(Monitor *mon, const char *fdname, Error **errp)
> >>  
> >>  static void monitor_fdset_free(MonFdset *mon_fdset)
> >>  {
> >> +    /*
> >> +     * Only remove an empty fdset. The fds are owned by the user and
> >> +     * should have been removed with qmp_remove_fd(). The dup_fds are
> >> +     * owned by QEMU and should have been removed with qemu_close().
> >> +     */
> >>      if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) {
> >>          QLIST_REMOVE(mon_fdset, next);
> >>          g_free(mon_fdset);
> >> @@ -189,9 +194,7 @@ static void monitor_fdset_cleanup(MonFdset *mon_fdset)
> >>      MonFdsetFd *mon_fdset_fd_next;
> >>  
> >>      QLIST_FOREACH_SAFE(mon_fdset_fd, &mon_fdset->fds, next, mon_fdset_fd_next) {
> >> -        if ((mon_fdset_fd->removed ||
> >> -                (QLIST_EMPTY(&mon_fdset->dup_fds) && mon_refcount == 0)) &&
> >> -                runstate_is_running()) {
> >> +        if (mon_fdset_fd->removed) {
> >
> > I don't know the code well, but I like it.
> >
> >>              monitor_fdset_fd_free(mon_fdset_fd);
> >>          }
> >>      }
> >> @@ -206,7 +209,7 @@ void monitor_fdsets_cleanup(void)
> >>  
> >>      QEMU_LOCK_GUARD(&mon_fdsets_lock);
> >>      QLIST_FOREACH_SAFE(mon_fdset, &mon_fdsets, next, mon_fdset_next) {
> >> -        monitor_fdset_cleanup(mon_fdset);
> >> +        monitor_fdset_free(mon_fdset);
> >>      }
> >>  }
> >>  
> >> @@ -479,9 +482,7 @@ void monitor_fdset_dup_fd_remove(int dup_fd)
> >>              if (mon_fdset_fd_dup->fd == dup_fd) {
> >>                  QLIST_REMOVE(mon_fdset_fd_dup, next);
> >>                  g_free(mon_fdset_fd_dup);
> >> -                if (QLIST_EMPTY(&mon_fdset->dup_fds)) {
> >> -                    monitor_fdset_cleanup(mon_fdset);
> >> -                }
> >> +                monitor_fdset_free(mon_fdset);
> >
> > This and above changes are not crystal clear to me where the _cleanup()
> > does extra check "removed" and clean those fds.
> >
> > I think it'll make it easier for me to understand if this one and the next
> > are squashed together.  But maybe it's simply because I didn't fully
> > understand.
> 
> monitor_fdsets_cleanup() was doing three things previously:
> 
> 1- Remove the removed=true fds. This is weird, but ok.
> 
> 2- Remove fds from an fdset that has an empty dup_fds list, but only if
> the guest is not running and the monitor is closed. This is problematic.
> 
> 3- Remove/free fdsets that have become empty due to the above
> removals. This is ok.
> 
> This patch covers (2), because that is the only change that has a
> complex reasoning behind it and we need to document that without
> conflating it with the rest of the changes.

The ebe52b592d you reference, makes me think that the only reason for the
'is not running' was as a way to detect when init had finished; and there
are certainly better ways to do that (now?).

However, the efb87c1697 talks about cleaning up non-dup's on all monitors
closed, to stop getting left-over fd's that were added but then not used
(because a disconnect happened between adding and being used).
In your world when do these get cleaned up?

Dave

> Since (3) is still a reasonable thing to do, this patch merely keeps it
> around, but using the helpers introduced in the previous patch.
> 
> The next patch removed the need for (1), making the removal immediate
> instead of delayed. It has it's own much less complex reasoning, which
> is: "we don't need to wait to remove the fd".
> 
> I hope this clarifies a bit. I would prefer if we kept this and the next
> patch separate to avoid having a single patch that does too many
> things. I hope to be as explicit as possible with the reason behind
> these changes to avoid putting future people in the situation that we
> are in now, i.e. having to guess at the reasons behind this code.
> 
> If you still think we should squash or if you have more questions, let
> me know.
> 
> >>                  return;
> >>              }
> >>          }
> >> diff --git a/monitor/hmp.c b/monitor/hmp.c
> >> index 69c1b7e98a..460e8832f6 100644
> >> --- a/monitor/hmp.c
> >> +++ b/monitor/hmp.c
> >> @@ -1437,11 +1437,9 @@ static void monitor_event(void *opaque, QEMUChrEvent event)
> >>              monitor_resume(mon);
> >>          }
> >>          qemu_mutex_unlock(&mon->mon_lock);
> >> -        mon_refcount++;
> >>          break;
> >>  
> >>      case CHR_EVENT_CLOSED:
> >> -        mon_refcount--;
> >>          monitor_fdsets_cleanup();
> >>          break;
> >>  
> >> diff --git a/monitor/monitor-internal.h b/monitor/monitor-internal.h
> >> index 252de85681..cb628f681d 100644
> >> --- a/monitor/monitor-internal.h
> >> +++ b/monitor/monitor-internal.h
> >> @@ -168,7 +168,6 @@ extern bool qmp_dispatcher_co_shutdown;
> >>  extern QmpCommandList qmp_commands, qmp_cap_negotiation_commands;
> >>  extern QemuMutex monitor_lock;
> >>  extern MonitorList mon_list;
> >> -extern int mon_refcount;
> >>  
> >>  extern HMPCommand hmp_cmds[];
> >>  
> >> diff --git a/monitor/qmp.c b/monitor/qmp.c
> >> index a239945e8d..5e538f34c0 100644
> >> --- a/monitor/qmp.c
> >> +++ b/monitor/qmp.c
> >> @@ -466,7 +466,6 @@ static void monitor_qmp_event(void *opaque, QEMUChrEvent event)
> >>          data = qmp_greeting(mon);
> >>          qmp_send_response(mon, data);
> >>          qobject_unref(data);
> >> -        mon_refcount++;
> >>          break;
> >>      case CHR_EVENT_CLOSED:
> >>          /*
> >> @@ -479,7 +478,6 @@ static void monitor_qmp_event(void *opaque, QEMUChrEvent event)
> >>          json_message_parser_destroy(&mon->parser);
> >>          json_message_parser_init(&mon->parser, handle_qmp_command,
> >>                                   mon, NULL);
> >> -        mon_refcount--;
> >>          monitor_fdsets_cleanup();
> >>          break;
> >>      case CHR_EVENT_BREAK:
> >
> > I like this too when mon_refcount can be dropped.  It turns out I like this
> > patch and the next a lot, and I hope nothing will break.
> >
> > Aside, you seem to have forgot removal of the "int mon_refcount" in
> > monitor.c.
> 
> Yes, I'll fix that. Thanks.
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 06/18] monitor: Stop removing non-duplicated fds
  2024-06-04 23:40       ` Dr. David Alan Gilbert
@ 2024-06-05 12:31         ` Fabiano Rosas
  0 siblings, 0 replies; 60+ messages in thread
From: Fabiano Rosas @ 2024-06-05 12:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, qemu-devel, berrange, armbru, Claudio Fontana,
	Jim Fehlig

"Dr. David Alan Gilbert" <dave@treblig.org> writes:

> * Fabiano Rosas (farosas@suse.de) wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Thu, May 23, 2024 at 04:05:36PM -0300, Fabiano Rosas wrote:
>> >> We've been up until now cleaning up any file descriptors that have
>> >> been passed into QEMU and never duplicated[1,2]. A file descriptor
>> >> without duplicates indicates that no part of QEMU has made use of
>> >> it. This approach is starting to show some cracks now that we're
>> >> starting to consume fds from the migration code:
>> >> 
>> >> - Doing cleanup every time the last monitor connection closes works to
>> >>   reap unused fds, but also has the side effect of forcing the
>> >>   management layer to pass the file descriptors again in case of a
>> >>   disconnect/re-connect, if that happened to be the only monitor
>> >>   connection.
>> >> 
>> >>   Another side effect is that removing an fd with qmp_remove_fd() is
>> >>   effectively delayed until the last monitor connection closes.
>> >> 
>> >>   The reliance on mon_refcount is also problematic because it's racy.
>> >> 
>> >> - Checking runstate_is_running() skips the cleanup unless the VM is
>> >>   running and avoids premature cleanup of the fds, but also has the
>> >>   side effect of blocking the legitimate removal of an fd via
>> >>   qmp_remove_fd() if the VM happens to be in another state.
>> >> 
>> >>   This affects qmp_remove_fd() and qmp_query_fdsets() in particular
>> >>   because requesting a removal at a bad time (guest stopped) might
>> >>   cause an fd to never be removed, or to be removed at a much later
>> >>   point in time, causing the query command to continue showing the
>> >>   supposedly removed fd/fdset.
>> >> 
>> >> Note that file descriptors that *have* been duplicated are owned by
>> >> the code that uses them and will be removed after qemu_close() is
>> >> called. Therefore we've decided that the best course of action to
>> >> avoid the undesired side-effects is to stop managing non-duplicated
>> >> file descriptors.
>> >> 
>> >> 1- efb87c1697 ("monitor: Clean up fd sets on monitor disconnect")
>> >> 2- ebe52b592d ("monitor: Prevent removing fd from set during init")
>> >> 
>> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> >> ---
>> >>  monitor/fds.c              | 15 ++++++++-------
>> >>  monitor/hmp.c              |  2 --
>> >>  monitor/monitor-internal.h |  1 -
>> >>  monitor/qmp.c              |  2 --
>> >>  4 files changed, 8 insertions(+), 12 deletions(-)
>> >> 
>> >> diff --git a/monitor/fds.c b/monitor/fds.c
>> >> index 101e21720d..f7b98814fa 100644
>> >> --- a/monitor/fds.c
>> >> +++ b/monitor/fds.c
>> >> @@ -169,6 +169,11 @@ int monitor_get_fd(Monitor *mon, const char *fdname, Error **errp)
>> >>  
>> >>  static void monitor_fdset_free(MonFdset *mon_fdset)
>> >>  {
>> >> +    /*
>> >> +     * Only remove an empty fdset. The fds are owned by the user and
>> >> +     * should have been removed with qmp_remove_fd(). The dup_fds are
>> >> +     * owned by QEMU and should have been removed with qemu_close().
>> >> +     */
>> >>      if (QLIST_EMPTY(&mon_fdset->fds) && QLIST_EMPTY(&mon_fdset->dup_fds)) {
>> >>          QLIST_REMOVE(mon_fdset, next);
>> >>          g_free(mon_fdset);
>> >> @@ -189,9 +194,7 @@ static void monitor_fdset_cleanup(MonFdset *mon_fdset)
>> >>      MonFdsetFd *mon_fdset_fd_next;
>> >>  
>> >>      QLIST_FOREACH_SAFE(mon_fdset_fd, &mon_fdset->fds, next, mon_fdset_fd_next) {
>> >> -        if ((mon_fdset_fd->removed ||
>> >> -                (QLIST_EMPTY(&mon_fdset->dup_fds) && mon_refcount == 0)) &&
>> >> -                runstate_is_running()) {
>> >> +        if (mon_fdset_fd->removed) {
>> >
>> > I don't know the code well, but I like it.
>> >
>> >>              monitor_fdset_fd_free(mon_fdset_fd);
>> >>          }
>> >>      }
>> >> @@ -206,7 +209,7 @@ void monitor_fdsets_cleanup(void)
>> >>  
>> >>      QEMU_LOCK_GUARD(&mon_fdsets_lock);
>> >>      QLIST_FOREACH_SAFE(mon_fdset, &mon_fdsets, next, mon_fdset_next) {
>> >> -        monitor_fdset_cleanup(mon_fdset);
>> >> +        monitor_fdset_free(mon_fdset);
>> >>      }
>> >>  }
>> >>  
>> >> @@ -479,9 +482,7 @@ void monitor_fdset_dup_fd_remove(int dup_fd)
>> >>              if (mon_fdset_fd_dup->fd == dup_fd) {
>> >>                  QLIST_REMOVE(mon_fdset_fd_dup, next);
>> >>                  g_free(mon_fdset_fd_dup);
>> >> -                if (QLIST_EMPTY(&mon_fdset->dup_fds)) {
>> >> -                    monitor_fdset_cleanup(mon_fdset);
>> >> -                }
>> >> +                monitor_fdset_free(mon_fdset);
>> >
>> > This and above changes are not crystal clear to me where the _cleanup()
>> > does extra check "removed" and clean those fds.
>> >
>> > I think it'll make it easier for me to understand if this one and the next
>> > are squashed together.  But maybe it's simply because I didn't fully
>> > understand.
>> 
>> monitor_fdsets_cleanup() was doing three things previously:
>> 
>> 1- Remove the removed=true fds. This is weird, but ok.
>> 
>> 2- Remove fds from an fdset that has an empty dup_fds list, but only if
>> the guest is not running and the monitor is closed. This is problematic.
>> 
>> 3- Remove/free fdsets that have become empty due to the above
>> removals. This is ok.
>> 
>> This patch covers (2), because that is the only change that has a
>> complex reasoning behind it and we need to document that without
>> conflating it with the rest of the changes.
>
> The ebe52b592d you reference, makes me think that the only reason for the
> 'is not running' was as a way to detect when init had finished; and there
> are certainly better ways to do that (now?).

I agree with the perception, however I can't determine what "init" means
in this scenario. It's not clear from the original change at exactly
which point we're safe to assume fds have been consumed from some
subsystem (block probably). This also clashes with the (new) usage we're
attempting here for migration because the migration code would almost
certainly only consume the fds after this point.

>
> However, the efb87c1697 talks about cleaning up non-dup's on all monitors
> closed, to stop getting left-over fd's that were added but then not used
> (because a disconnect happened between adding and being used).
> In your world when do these get cleaned up?

They stay around until after the reconnection. Then either get removed
via qmp_remove_fd() or when a subsystem dups them and eventually calls
qemu_close().

The initial assumption seems to have been overly conservative, there
will always be a monitor connection around, even if it disconnects
briefly.

Per Daniel's suggestion we're considering a management layer bug if it
passes fds that QEMU never needs to consume. So QEMU will not attempt
any cleanup of perceived unused fds since they could be needed at a
later point (e.g. after qmp_migrate).

>
> Dave
>
>> Since (3) is still a reasonable thing to do, this patch merely keeps it
>> around, but using the helpers introduced in the previous patch.
>> 
>> The next patch removed the need for (1), making the removal immediate
>> instead of delayed. It has it's own much less complex reasoning, which
>> is: "we don't need to wait to remove the fd".
>> 
>> I hope this clarifies a bit. I would prefer if we kept this and the next
>> patch separate to avoid having a single patch that does too many
>> things. I hope to be as explicit as possible with the reason behind
>> these changes to avoid putting future people in the situation that we
>> are in now, i.e. having to guess at the reasons behind this code.
>> 
>> If you still think we should squash or if you have more questions, let
>> me know.
>> 
>> >>                  return;
>> >>              }
>> >>          }
>> >> diff --git a/monitor/hmp.c b/monitor/hmp.c
>> >> index 69c1b7e98a..460e8832f6 100644
>> >> --- a/monitor/hmp.c
>> >> +++ b/monitor/hmp.c
>> >> @@ -1437,11 +1437,9 @@ static void monitor_event(void *opaque, QEMUChrEvent event)
>> >>              monitor_resume(mon);
>> >>          }
>> >>          qemu_mutex_unlock(&mon->mon_lock);
>> >> -        mon_refcount++;
>> >>          break;
>> >>  
>> >>      case CHR_EVENT_CLOSED:
>> >> -        mon_refcount--;
>> >>          monitor_fdsets_cleanup();
>> >>          break;
>> >>  
>> >> diff --git a/monitor/monitor-internal.h b/monitor/monitor-internal.h
>> >> index 252de85681..cb628f681d 100644
>> >> --- a/monitor/monitor-internal.h
>> >> +++ b/monitor/monitor-internal.h
>> >> @@ -168,7 +168,6 @@ extern bool qmp_dispatcher_co_shutdown;
>> >>  extern QmpCommandList qmp_commands, qmp_cap_negotiation_commands;
>> >>  extern QemuMutex monitor_lock;
>> >>  extern MonitorList mon_list;
>> >> -extern int mon_refcount;
>> >>  
>> >>  extern HMPCommand hmp_cmds[];
>> >>  
>> >> diff --git a/monitor/qmp.c b/monitor/qmp.c
>> >> index a239945e8d..5e538f34c0 100644
>> >> --- a/monitor/qmp.c
>> >> +++ b/monitor/qmp.c
>> >> @@ -466,7 +466,6 @@ static void monitor_qmp_event(void *opaque, QEMUChrEvent event)
>> >>          data = qmp_greeting(mon);
>> >>          qmp_send_response(mon, data);
>> >>          qobject_unref(data);
>> >> -        mon_refcount++;
>> >>          break;
>> >>      case CHR_EVENT_CLOSED:
>> >>          /*
>> >> @@ -479,7 +478,6 @@ static void monitor_qmp_event(void *opaque, QEMUChrEvent event)
>> >>          json_message_parser_destroy(&mon->parser);
>> >>          json_message_parser_init(&mon->parser, handle_qmp_command,
>> >>                                   mon, NULL);
>> >> -        mon_refcount--;
>> >>          monitor_fdsets_cleanup();
>> >>          break;
>> >>      case CHR_EVENT_BREAK:
>> >
>> > I like this too when mon_refcount can be dropped.  It turns out I like this
>> > patch and the next a lot, and I hope nothing will break.
>> >
>> > Aside, you seem to have forgot removal of the "int mon_refcount" in
>> > monitor.c.
>> 
>> Yes, I'll fix that. Thanks.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration
  2024-06-04 20:56   ` Peter Xu
@ 2024-06-07 18:42     ` Fabiano Rosas
  2024-06-07 20:39       ` Jim Fehlig
  2024-06-10 16:09       ` Peter Xu
  0 siblings, 2 replies; 60+ messages in thread
From: Fabiano Rosas @ 2024-06-07 18:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

Peter Xu <peterx@redhat.com> writes:

> On Thu, May 23, 2024 at 04:05:48PM -0300, Fabiano Rosas wrote:
>> We've recently added support for direct-io with multifd, which brings
>> performance benefits, but creates a non-uniform user interface by
>> coupling direct-io with the multifd capability. This means that users
>> cannot keep the direct-io flag enabled while disabling multifd.
>> 
>> Libvirt in particular already has support for direct-io and parallel
>> migration separately from each other, so it would be a regression to
>> now require both options together. It's relatively simple for QEMU to
>> add support for direct-io migration without multifd, so let's do this
>> in order to keep both options decoupled.
>> 
>> We cannot simply enable the O_DIRECT flag, however, because not all IO
>> performed by the migration thread satisfies the alignment requirements
>> of O_DIRECT. There are many small read & writes that add headers and
>> synchronization flags to the stream, which at the moment are required
>> to always be present.
>> 
>> Fortunately, due to fixed-ram migration there is a discernible moment
>> where only RAM pages are written to the migration file. Enable
>> direct-io during that moment.
>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>
> Is anyone going to consume this?  How's the performance?

I don't think we have a pre-determined consumer for this. This came up
in an internal discussion about making the interface simpler for libvirt
and in a thread on the libvirt mailing list[1] about using O_DIRECT to
keep the snapshot data out of the caches to avoid impacting the rest of
the system. (I could have described this better in the commit message,
sorry).

Quoting Daniel:

  "Note the reason for using O_DIRECT is *not* to make saving / restoring
   the guest VM faster. Rather it is to ensure that saving/restoring a VM
   does not trash the host I/O / buffer cache, which will negatively impact
   performance of all the *other* concurrently running VMs."

1- https://lore.kernel.org/r/87sez86ztq.fsf@suse.de

About performance, a quick test on a stopped 30G guest, shows
mapped-ram=on direct-io=on it's 12% slower than mapped-ram=on
direct-io=off.

>
> It doesn't look super fast to me if we need to enable/disable dio in each
> loop.. then it's a matter of whether we should bother, or would it be
> easier that we simply require multifd when direct-io=on.

AIUI, the issue here that users are already allowed to specify in
libvirt the equivalent to direct-io and multifd independent of each
other (bypass-cache, parallel). To start requiring both together now in
some situations would be a regression. I confess I don't know libvirt
code to know whether this can be worked around somehow, but as I said,
it's a relatively simple change from the QEMU side.

Another option which would be for libvirt to keep using multifd, but
make it 1 channel only if --parallel is not specified. That might be
enough to solve the interface issues. Of course, it's a different code
altogether than the usual precopy code that gets executed when
multifd=off, I don't know whether that could be an issue somehow.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration
  2024-06-07 18:42     ` Fabiano Rosas
@ 2024-06-07 20:39       ` Jim Fehlig
  2024-06-10 16:09       ` Peter Xu
  1 sibling, 0 replies; 60+ messages in thread
From: Jim Fehlig @ 2024-06-07 20:39 UTC (permalink / raw)
  To: Fabiano Rosas, Peter Xu
  Cc: qemu-devel, berrange, armbru, Claudio Fontana, Thomas Huth,
	Laurent Vivier, Paolo Bonzini

On 6/7/24 12:42 PM, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
>> On Thu, May 23, 2024 at 04:05:48PM -0300, Fabiano Rosas wrote:
>>> We've recently added support for direct-io with multifd, which brings
>>> performance benefits, but creates a non-uniform user interface by
>>> coupling direct-io with the multifd capability. This means that users
>>> cannot keep the direct-io flag enabled while disabling multifd.
>>>
>>> Libvirt in particular already has support for direct-io and parallel
>>> migration separately from each other, so it would be a regression to
>>> now require both options together. It's relatively simple for QEMU to
>>> add support for direct-io migration without multifd, so let's do this
>>> in order to keep both options decoupled.
>>>
>>> We cannot simply enable the O_DIRECT flag, however, because not all IO
>>> performed by the migration thread satisfies the alignment requirements
>>> of O_DIRECT. There are many small read & writes that add headers and
>>> synchronization flags to the stream, which at the moment are required
>>> to always be present.
>>>
>>> Fortunately, due to fixed-ram migration there is a discernible moment
>>> where only RAM pages are written to the migration file. Enable
>>> direct-io during that moment.
>>>
>>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>>
>> Is anyone going to consume this?  How's the performance?
> 
> I don't think we have a pre-determined consumer for this. This came up
> in an internal discussion about making the interface simpler for libvirt
> and in a thread on the libvirt mailing list[1] about using O_DIRECT to
> keep the snapshot data out of the caches to avoid impacting the rest of
> the system. (I could have described this better in the commit message,
> sorry).
> 
> Quoting Daniel:
> 
>    "Note the reason for using O_DIRECT is *not* to make saving / restoring
>     the guest VM faster. Rather it is to ensure that saving/restoring a VM
>     does not trash the host I/O / buffer cache, which will negatively impact
>     performance of all the *other* concurrently running VMs."
> 
> 1- https://lore.kernel.org/r/87sez86ztq.fsf@suse.de
> 
> About performance, a quick test on a stopped 30G guest, shows
> mapped-ram=on direct-io=on it's 12% slower than mapped-ram=on
> direct-io=off.
> 
>>
>> It doesn't look super fast to me if we need to enable/disable dio in each
>> loop.. then it's a matter of whether we should bother, or would it be
>> easier that we simply require multifd when direct-io=on.
> 
> AIUI, the issue here that users are already allowed to specify in
> libvirt the equivalent to direct-io and multifd independent of each
> other (bypass-cache, parallel). To start requiring both together now in
> some situations would be a regression. I confess I don't know libvirt
> code to know whether this can be worked around somehow, but as I said,
> it's a relatively simple change from the QEMU side.

Currently, libvirt does not support --parallel with virDomainSave* and 
virDomainRestore* APIs. I'll work on that after getting support for mapped-ram 
merged. --parallel is supported in virDomainMigrate* APIs, but obviously those 
APIs don't accept --bypass-cache.

Regards,
Jim

> 
> Another option which would be for libvirt to keep using multifd, but
> make it 1 channel only if --parallel is not specified. That might be
> enough to solve the interface issues. Of course, it's a different code
> altogether than the usual precopy code that gets executed when
> multifd=off, I don't know whether that could be an issue somehow.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration
  2024-06-07 18:42     ` Fabiano Rosas
  2024-06-07 20:39       ` Jim Fehlig
@ 2024-06-10 16:09       ` Peter Xu
  2024-06-10 17:45         ` Fabiano Rosas
  1 sibling, 1 reply; 60+ messages in thread
From: Peter Xu @ 2024-06-10 16:09 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

On Fri, Jun 07, 2024 at 03:42:35PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Thu, May 23, 2024 at 04:05:48PM -0300, Fabiano Rosas wrote:
> >> We've recently added support for direct-io with multifd, which brings
> >> performance benefits, but creates a non-uniform user interface by
> >> coupling direct-io with the multifd capability. This means that users
> >> cannot keep the direct-io flag enabled while disabling multifd.
> >> 
> >> Libvirt in particular already has support for direct-io and parallel
> >> migration separately from each other, so it would be a regression to
> >> now require both options together. It's relatively simple for QEMU to
> >> add support for direct-io migration without multifd, so let's do this
> >> in order to keep both options decoupled.
> >> 
> >> We cannot simply enable the O_DIRECT flag, however, because not all IO
> >> performed by the migration thread satisfies the alignment requirements
> >> of O_DIRECT. There are many small read & writes that add headers and
> >> synchronization flags to the stream, which at the moment are required
> >> to always be present.
> >> 
> >> Fortunately, due to fixed-ram migration there is a discernible moment
> >> where only RAM pages are written to the migration file. Enable
> >> direct-io during that moment.
> >> 
> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >
> > Is anyone going to consume this?  How's the performance?
> 
> I don't think we have a pre-determined consumer for this. This came up
> in an internal discussion about making the interface simpler for libvirt
> and in a thread on the libvirt mailing list[1] about using O_DIRECT to
> keep the snapshot data out of the caches to avoid impacting the rest of
> the system. (I could have described this better in the commit message,
> sorry).
> 
> Quoting Daniel:
> 
>   "Note the reason for using O_DIRECT is *not* to make saving / restoring
>    the guest VM faster. Rather it is to ensure that saving/restoring a VM
>    does not trash the host I/O / buffer cache, which will negatively impact
>    performance of all the *other* concurrently running VMs."
> 
> 1- https://lore.kernel.org/r/87sez86ztq.fsf@suse.de
> 
> About performance, a quick test on a stopped 30G guest, shows
> mapped-ram=on direct-io=on it's 12% slower than mapped-ram=on
> direct-io=off.

Yes, this makes sense.

> 
> >
> > It doesn't look super fast to me if we need to enable/disable dio in each
> > loop.. then it's a matter of whether we should bother, or would it be
> > easier that we simply require multifd when direct-io=on.
> 
> AIUI, the issue here that users are already allowed to specify in
> libvirt the equivalent to direct-io and multifd independent of each
> other (bypass-cache, parallel). To start requiring both together now in
> some situations would be a regression. I confess I don't know libvirt
> code to know whether this can be worked around somehow, but as I said,
> it's a relatively simple change from the QEMU side.

Firstly, I definitely want to already avoid all the calls to either
migration_direct_io_start() or *_finish(), now we already need to
explicitly call them in three paths, and that's not intuitive and less
readable, just like the hard coded rdma codes.

I also worry we may overlook the complexity here, and pinning buffers
definitely need more thoughts on its own.  It's easier to digest when using
multifd and when QEMU only pins guest pages just like tcp-zerocopy does,
which are naturally host page size aligned, and also guaranteed to not be
freed (while reused / modified is fine here, as dirty tracking guarantees a
new page will be migrated soon again).

IMHO here the "not be freed / modified" is even more important than
"alignment": the latter is about perf, the former is about correctness.
When we do directio on random buffers, AFAIU we don't want to have the
buffer modified before flushed to disk, and that's IMHO not easy to
guarantee.

E.g., I don't think this guarantees a flush on the buffer usages:

  migration_direct_io_start()
    /* flush any potentially unaligned IO before setting O_DIRECT */
    qemu_fflush(file);

qemu_fflush() internally does writev(), and that "flush" is about "flushing
qemufile iov[] to fd", not "flushing buffers to disk".  I think it means
if we do qemu_fflush() then we modify QEMUFile.buf[IO_BUF_SIZE] we're
doomed: we will never know whether dio has happened, and which version of
buffer will be sent; I don't think it's guaranteed it will always be the
old version of the buffer.

However the issue is, QEMUFile defines qemu_fflush() as: after call, the
buf[] can be reused!  It suggests breaking things I guess in dio context.

IIUC currently mapped-ram is ok because mapped-ram is just special that it
doesn't have page headers, so it doesn't use the buf[] during iterations;
while for zeropage it uses file_bmap bitmap and that's separate too and
does not generate any byte on the wire either.

xbzrle could use that buf[], but maybe mapped-ram doesn't work anyway with
xbzrle.

Everything is just very not obvious and tricky to me.  This still looks
pretty dangerous to me.  Would migration_direct_io_finish() guarantee
something like a fdatasync()?  If so it looks safer, but still within the
start() and finish() if someone calls qemu_fflush() and reuse the buffer we
can still get hard to debug issues (as the outcome would be that we saw
corrupted migration files).

> 
> Another option which would be for libvirt to keep using multifd, but
> make it 1 channel only if --parallel is not specified. That might be
> enough to solve the interface issues. Of course, it's a different code
> altogether than the usual precopy code that gets executed when
> multifd=off, I don't know whether that could be an issue somehow.

Would there be any comment from Libvirt side?  This sounds like a good
solution if my above concern is real; as long as we always stick dio with
guest pages we'll be all fine.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration
  2024-06-10 16:09       ` Peter Xu
@ 2024-06-10 17:45         ` Fabiano Rosas
  2024-06-10 19:02           ` Peter Xu
  0 siblings, 1 reply; 60+ messages in thread
From: Fabiano Rosas @ 2024-06-10 17:45 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

Peter Xu <peterx@redhat.com> writes:

> On Fri, Jun 07, 2024 at 03:42:35PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Thu, May 23, 2024 at 04:05:48PM -0300, Fabiano Rosas wrote:
>> >> We've recently added support for direct-io with multifd, which brings
>> >> performance benefits, but creates a non-uniform user interface by
>> >> coupling direct-io with the multifd capability. This means that users
>> >> cannot keep the direct-io flag enabled while disabling multifd.
>> >> 
>> >> Libvirt in particular already has support for direct-io and parallel
>> >> migration separately from each other, so it would be a regression to
>> >> now require both options together. It's relatively simple for QEMU to
>> >> add support for direct-io migration without multifd, so let's do this
>> >> in order to keep both options decoupled.
>> >> 
>> >> We cannot simply enable the O_DIRECT flag, however, because not all IO
>> >> performed by the migration thread satisfies the alignment requirements
>> >> of O_DIRECT. There are many small read & writes that add headers and
>> >> synchronization flags to the stream, which at the moment are required
>> >> to always be present.
>> >> 
>> >> Fortunately, due to fixed-ram migration there is a discernible moment
>> >> where only RAM pages are written to the migration file. Enable
>> >> direct-io during that moment.
>> >> 
>> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> >
>> > Is anyone going to consume this?  How's the performance?
>> 
>> I don't think we have a pre-determined consumer for this. This came up
>> in an internal discussion about making the interface simpler for libvirt
>> and in a thread on the libvirt mailing list[1] about using O_DIRECT to
>> keep the snapshot data out of the caches to avoid impacting the rest of
>> the system. (I could have described this better in the commit message,
>> sorry).
>> 
>> Quoting Daniel:
>> 
>>   "Note the reason for using O_DIRECT is *not* to make saving / restoring
>>    the guest VM faster. Rather it is to ensure that saving/restoring a VM
>>    does not trash the host I/O / buffer cache, which will negatively impact
>>    performance of all the *other* concurrently running VMs."
>> 
>> 1- https://lore.kernel.org/r/87sez86ztq.fsf@suse.de
>> 
>> About performance, a quick test on a stopped 30G guest, shows
>> mapped-ram=on direct-io=on it's 12% slower than mapped-ram=on
>> direct-io=off.
>
> Yes, this makes sense.
>
>> 
>> >
>> > It doesn't look super fast to me if we need to enable/disable dio in each
>> > loop.. then it's a matter of whether we should bother, or would it be
>> > easier that we simply require multifd when direct-io=on.
>> 
>> AIUI, the issue here that users are already allowed to specify in
>> libvirt the equivalent to direct-io and multifd independent of each
>> other (bypass-cache, parallel). To start requiring both together now in
>> some situations would be a regression. I confess I don't know libvirt
>> code to know whether this can be worked around somehow, but as I said,
>> it's a relatively simple change from the QEMU side.
>
> Firstly, I definitely want to already avoid all the calls to either
> migration_direct_io_start() or *_finish(), now we already need to
> explicitly call them in three paths, and that's not intuitive and less
> readable, just like the hard coded rdma codes.

Right, but that's just a side-effect of how the code is structured and
the fact that writes to the stream happen in small chunks. Setting
O_DIRECT needs to happen around aligned IO. We could move the calls
further down into qemu_put_buffer_at(), but that would be four fcntl()
calls for every page.

A tangent:
 one thing that occured to me now is that we may be able to restrict
 calls to qemu_fflush() to internal code like add_to_iovec() and maybe
 use that function to gather the correct amount of data before writing,
 making sure it disables O_DIRECT in case alignment is about to be
 broken?

>
> I also worry we may overlook the complexity here, and pinning buffers
> definitely need more thoughts on its own.  It's easier to digest when using
> multifd and when QEMU only pins guest pages just like tcp-zerocopy does,
> which are naturally host page size aligned, and also guaranteed to not be
> freed (while reused / modified is fine here, as dirty tracking guarantees a
> new page will be migrated soon again).

I don't get this at all, sorry. What is different from multifd here?
We're writing on the same HVA as the one that would be given to multifd
(if it were enabled) and dirty tracking is working the same.

> IMHO here the "not be freed / modified" is even more important than
> "alignment": the latter is about perf, the former is about correctness.
> When we do directio on random buffers, AFAIU we don't want to have the
> buffer modified before flushed to disk, and that's IMHO not easy to
> guarantee.
>
> E.g., I don't think this guarantees a flush on the buffer usages:
>
>   migration_direct_io_start()
>     /* flush any potentially unaligned IO before setting O_DIRECT */
>     qemu_fflush(file);
>
> qemu_fflush() internally does writev(), and that "flush" is about "flushing
> qemufile iov[] to fd", not "flushing buffers to disk".  I think it means
> if we do qemu_fflush() then we modify QEMUFile.buf[IO_BUF_SIZE] we're
> doomed: we will never know whether dio has happened, and which version of
> buffer will be sent; I don't think it's guaranteed it will always be the
> old version of the buffer.
>
> However the issue is, QEMUFile defines qemu_fflush() as: after call, the
> buf[] can be reused!  It suggests breaking things I guess in dio context.

I think you're mixing the usage of qemu_put_byte()/qemu_put_buffer()
with the usage of qemu_put_buffer_at(). The former two use the
QEMUFile.buf without O_DIRECT and the latter writes directly to the fd
at the page offset. So there's no issue in reusing buf before writes
have reached the disk. All writes going through buf are serialized and
all writes going through qio_channel_pwrite() go to a different offset.

I included all of these assert(!f->dio) to ensure that we don't use the
two APIs incorrectly. Mainly that we don't try to write to buf while
O_DIRECT is set.

>
> IIUC currently mapped-ram is ok because mapped-ram is just special that it
> doesn't have page headers, so it doesn't use the buf[] during iterations;
> while for zeropage it uses file_bmap bitmap and that's separate too and
> does not generate any byte on the wire either.

Right. This is all mapped-ram. I'm not proposing to enable O_DIRECT for
any migration.

>
> xbzrle could use that buf[], but maybe mapped-ram doesn't work anyway with
> xbzrle.
>
> Everything is just very not obvious and tricky to me.  This still looks
> pretty dangerous to me.  Would migration_direct_io_finish() guarantee
> something like a fdatasync()?  If so it looks safer, but still within the
> start() and finish() if someone calls qemu_fflush() and reuse the buffer we
> can still get hard to debug issues (as the outcome would be that we saw
> corrupted migration files).
>
>> 
>> Another option which would be for libvirt to keep using multifd, but
>> make it 1 channel only if --parallel is not specified. That might be
>> enough to solve the interface issues. Of course, it's a different code
>> altogether than the usual precopy code that gets executed when
>> multifd=off, I don't know whether that could be an issue somehow.
>
> Would there be any comment from Libvirt side?  This sounds like a good
> solution if my above concern is real; as long as we always stick dio with
> guest pages we'll be all fine.
>
> Thanks,


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration
  2024-06-10 17:45         ` Fabiano Rosas
@ 2024-06-10 19:02           ` Peter Xu
  2024-06-10 19:07             ` Daniel P. Berrangé
  2024-06-10 20:12             ` Fabiano Rosas
  0 siblings, 2 replies; 60+ messages in thread
From: Peter Xu @ 2024-06-10 19:02 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

On Mon, Jun 10, 2024 at 02:45:53PM -0300, Fabiano Rosas wrote:
> >> AIUI, the issue here that users are already allowed to specify in
> >> libvirt the equivalent to direct-io and multifd independent of each
> >> other (bypass-cache, parallel). To start requiring both together now in
> >> some situations would be a regression. I confess I don't know libvirt
> >> code to know whether this can be worked around somehow, but as I said,
> >> it's a relatively simple change from the QEMU side.
> >
> > Firstly, I definitely want to already avoid all the calls to either
> > migration_direct_io_start() or *_finish(), now we already need to
> > explicitly call them in three paths, and that's not intuitive and less
> > readable, just like the hard coded rdma codes.
> 
> Right, but that's just a side-effect of how the code is structured and
> the fact that writes to the stream happen in small chunks. Setting
> O_DIRECT needs to happen around aligned IO. We could move the calls
> further down into qemu_put_buffer_at(), but that would be four fcntl()
> calls for every page.

Hmm.. why we need four fcntl()s instead of two?

> 
> A tangent:
>  one thing that occured to me now is that we may be able to restrict
>  calls to qemu_fflush() to internal code like add_to_iovec() and maybe
>  use that function to gather the correct amount of data before writing,
>  making sure it disables O_DIRECT in case alignment is about to be
>  broken?

IIUC dio doesn't require alignment if we don't care about perf?  I meant it
should be legal to write(fd, buffer, 5) even if O_DIRECT?

I just noticed the asserts you added in previous patch, I think that's
better indeed, but still I'm wondering whether we can avoid enabling it on
qemufile.

It makes me feel slightly nervous when introducing dio to QEMUFile rather
than iochannels - the API design of QEMUFile seems to easily encourage
breaking things in dio worlds with a default and static buffering. And if
we're going to blacklist most of the API anyway except the new one for
mapped-ram, I start to wonder too why bother on top of QEMUFile anyway.

IIRC you also mentioned in the previous doc patch so that libvirt should
always pass in two fds anyway to the fdset if dio is enabled.  I wonder
whether it's also true for multifd=off and directio=on, then would it be
possible to use the dio for guest pages with one fd, while keeping the
normal stream to use !dio with the other fd.  I'm not sure whether it's
easy to avoid qemufile in the dio fd, even if not looks like we may avoid
frequent fcntl()s?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration
  2024-06-10 19:02           ` Peter Xu
@ 2024-06-10 19:07             ` Daniel P. Berrangé
  2024-06-10 20:12             ` Fabiano Rosas
  1 sibling, 0 replies; 60+ messages in thread
From: Daniel P. Berrangé @ 2024-06-10 19:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fabiano Rosas, qemu-devel, armbru, Claudio Fontana, Jim Fehlig,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

On Mon, Jun 10, 2024 at 03:02:10PM -0400, Peter Xu wrote:
> On Mon, Jun 10, 2024 at 02:45:53PM -0300, Fabiano Rosas wrote:
> > >> AIUI, the issue here that users are already allowed to specify in
> > >> libvirt the equivalent to direct-io and multifd independent of each
> > >> other (bypass-cache, parallel). To start requiring both together now in
> > >> some situations would be a regression. I confess I don't know libvirt
> > >> code to know whether this can be worked around somehow, but as I said,
> > >> it's a relatively simple change from the QEMU side.
> > >
> > > Firstly, I definitely want to already avoid all the calls to either
> > > migration_direct_io_start() or *_finish(), now we already need to
> > > explicitly call them in three paths, and that's not intuitive and less
> > > readable, just like the hard coded rdma codes.
> > 
> > Right, but that's just a side-effect of how the code is structured and
> > the fact that writes to the stream happen in small chunks. Setting
> > O_DIRECT needs to happen around aligned IO. We could move the calls
> > further down into qemu_put_buffer_at(), but that would be four fcntl()
> > calls for every page.
> 
> Hmm.. why we need four fcntl()s instead of two?
> 
> > 
> > A tangent:
> >  one thing that occured to me now is that we may be able to restrict
> >  calls to qemu_fflush() to internal code like add_to_iovec() and maybe
> >  use that function to gather the correct amount of data before writing,
> >  making sure it disables O_DIRECT in case alignment is about to be
> >  broken?
> 
> IIUC dio doesn't require alignment if we don't care about perf?  I meant it
> should be legal to write(fd, buffer, 5) even if O_DIRECT?

No, we must assume  that O_DIRECT requires alignment both of the userspace
memory buffers, and the file offset on disk:

[quote man(open)]
  O_DIRECT
       The O_DIRECT flag may impose alignment restrictions  on  the  length
       and  address  of user-space buffers and the file offset of I/Os.  In
       Linux alignment restrictions vary by filesystem and  kernel  version
       and  might  be absent entirely.  The handling of misaligned O_DIRECT
       I/Os also varies; they can either fail with EINVAL or fall  back  to
       buffered I/O.
[/quote]

Given QEMU's code base, it is only safe for us to use O_DIRECT with RAM
blocks where we have predictable in-memory alignment, and have defined
a good on-disk offset alignment too.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration
  2024-06-10 19:02           ` Peter Xu
  2024-06-10 19:07             ` Daniel P. Berrangé
@ 2024-06-10 20:12             ` Fabiano Rosas
  2024-06-12 18:08               ` Fabiano Rosas
  1 sibling, 1 reply; 60+ messages in thread
From: Fabiano Rosas @ 2024-06-10 20:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

Peter Xu <peterx@redhat.com> writes:

> On Mon, Jun 10, 2024 at 02:45:53PM -0300, Fabiano Rosas wrote:
>> >> AIUI, the issue here that users are already allowed to specify in
>> >> libvirt the equivalent to direct-io and multifd independent of each
>> >> other (bypass-cache, parallel). To start requiring both together now in
>> >> some situations would be a regression. I confess I don't know libvirt
>> >> code to know whether this can be worked around somehow, but as I said,
>> >> it's a relatively simple change from the QEMU side.
>> >
>> > Firstly, I definitely want to already avoid all the calls to either
>> > migration_direct_io_start() or *_finish(), now we already need to
>> > explicitly call them in three paths, and that's not intuitive and less
>> > readable, just like the hard coded rdma codes.
>> 
>> Right, but that's just a side-effect of how the code is structured and
>> the fact that writes to the stream happen in small chunks. Setting
>> O_DIRECT needs to happen around aligned IO. We could move the calls
>> further down into qemu_put_buffer_at(), but that would be four fcntl()
>> calls for every page.
>
> Hmm.. why we need four fcntl()s instead of two?

Because we need to first get the flags before flipping the O_DIRECT
bit. And we do this once to enable and once to disable.

    int flags = fcntl(fioc->fd, F_GETFL);
    if (enabled) {
        flags |= O_DIRECT;
    } else {
        flags &= ~O_DIRECT;
    }
    fcntl(fioc->fd, F_SETFL, flags);

>> 
>> A tangent:
>>  one thing that occured to me now is that we may be able to restrict
>>  calls to qemu_fflush() to internal code like add_to_iovec() and maybe
>>  use that function to gather the correct amount of data before writing,
>>  making sure it disables O_DIRECT in case alignment is about to be
>>  broken?
>
> IIUC dio doesn't require alignment if we don't care about perf?  I meant it
> should be legal to write(fd, buffer, 5) even if O_DIRECT?

No, we may get an -EINVAL. See Daniel's reply.

>
> I just noticed the asserts you added in previous patch, I think that's
> better indeed, but still I'm wondering whether we can avoid enabling it on
> qemufile.
>
> It makes me feel slightly nervous when introducing dio to QEMUFile rather
> than iochannels - the API design of QEMUFile seems to easily encourage
> breaking things in dio worlds with a default and static buffering. And if
> we're going to blacklist most of the API anyway except the new one for
> mapped-ram, I start to wonder too why bother on top of QEMUFile anyway.
>
> IIRC you also mentioned in the previous doc patch so that libvirt should
> always pass in two fds anyway to the fdset if dio is enabled.  I wonder
> whether it's also true for multifd=off and directio=on, then would it be
> possible to use the dio for guest pages with one fd, while keeping the
> normal stream to use !dio with the other fd.  I'm not sure whether it's
> easy to avoid qemufile in the dio fd, even if not looks like we may avoid
> frequent fcntl()s?

Hm, sounds like a good idea. We'd need a place to put that new ioc
though. Either QEMUFile.direct_ioc and then make use of it in
qemu_put_buffer_at() or a more transparent QIOChannelFile.direct_fd that
gets set somewhere during file_start_outgoing_migration(). Let me try to
come up with something.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration
  2024-06-10 20:12             ` Fabiano Rosas
@ 2024-06-12 18:08               ` Fabiano Rosas
  2024-06-12 18:15                 ` Daniel P. Berrangé
  0 siblings, 1 reply; 60+ messages in thread
From: Fabiano Rosas @ 2024-06-12 18:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, berrange, armbru, Claudio Fontana, Jim Fehlig,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

Fabiano Rosas <farosas@suse.de> writes:

> Peter Xu <peterx@redhat.com> writes:
>
>> On Mon, Jun 10, 2024 at 02:45:53PM -0300, Fabiano Rosas wrote:
>>> >> AIUI, the issue here that users are already allowed to specify in
>>> >> libvirt the equivalent to direct-io and multifd independent of each
>>> >> other (bypass-cache, parallel). To start requiring both together now in
>>> >> some situations would be a regression. I confess I don't know libvirt
>>> >> code to know whether this can be worked around somehow, but as I said,
>>> >> it's a relatively simple change from the QEMU side.
>>> >
>>> > Firstly, I definitely want to already avoid all the calls to either
>>> > migration_direct_io_start() or *_finish(), now we already need to
>>> > explicitly call them in three paths, and that's not intuitive and less
>>> > readable, just like the hard coded rdma codes.
>>> 
>>> Right, but that's just a side-effect of how the code is structured and
>>> the fact that writes to the stream happen in small chunks. Setting
>>> O_DIRECT needs to happen around aligned IO. We could move the calls
>>> further down into qemu_put_buffer_at(), but that would be four fcntl()
>>> calls for every page.
>>
>> Hmm.. why we need four fcntl()s instead of two?
>
> Because we need to first get the flags before flipping the O_DIRECT
> bit. And we do this once to enable and once to disable.
>
>     int flags = fcntl(fioc->fd, F_GETFL);
>     if (enabled) {
>         flags |= O_DIRECT;
>     } else {
>         flags &= ~O_DIRECT;
>     }
>     fcntl(fioc->fd, F_SETFL, flags);
>
>>> 
>>> A tangent:
>>>  one thing that occured to me now is that we may be able to restrict
>>>  calls to qemu_fflush() to internal code like add_to_iovec() and maybe
>>>  use that function to gather the correct amount of data before writing,
>>>  making sure it disables O_DIRECT in case alignment is about to be
>>>  broken?
>>
>> IIUC dio doesn't require alignment if we don't care about perf?  I meant it
>> should be legal to write(fd, buffer, 5) even if O_DIRECT?
>
> No, we may get an -EINVAL. See Daniel's reply.
>
>>
>> I just noticed the asserts you added in previous patch, I think that's
>> better indeed, but still I'm wondering whether we can avoid enabling it on
>> qemufile.
>>
>> It makes me feel slightly nervous when introducing dio to QEMUFile rather
>> than iochannels - the API design of QEMUFile seems to easily encourage
>> breaking things in dio worlds with a default and static buffering. And if
>> we're going to blacklist most of the API anyway except the new one for
>> mapped-ram, I start to wonder too why bother on top of QEMUFile anyway.
>>
>> IIRC you also mentioned in the previous doc patch so that libvirt should
>> always pass in two fds anyway to the fdset if dio is enabled.  I wonder
>> whether it's also true for multifd=off and directio=on, then would it be
>> possible to use the dio for guest pages with one fd, while keeping the
>> normal stream to use !dio with the other fd.  I'm not sure whether it's
>> easy to avoid qemufile in the dio fd, even if not looks like we may avoid
>> frequent fcntl()s?
>
> Hm, sounds like a good idea. We'd need a place to put that new ioc
> though. Either QEMUFile.direct_ioc and then make use of it in
> qemu_put_buffer_at() or a more transparent QIOChannelFile.direct_fd that
> gets set somewhere during file_start_outgoing_migration(). Let me try to
> come up with something.

I looked into this and it's cumbersome:

- We'd need to check migrate_direct_io() several times, once to get the
  second fd and during every IO to know to use the fd.

- Even getting the second fd is not straight forward, we need to create
  a new ioc for it with qio_channel_new_path(). But QEMUFile is generic
  code, so we'd probably need to call this channel-file specific
  function from migration_channel_connect().

- With the new ioc, do we put it in QEMUFile, or do we take the fd only?
  Or maybe an ioc with two fds? Or a new QIOChannelDirect? All options
  look bad to me.

So I suggest we proceed proceed with the 1 multifd channel approach,
passing 2 fds into QEMU just like we do for the n channels. Is that ok
from libvirt's perspective? I assume libvirt users are mostly interested
in _enabling_ parallelism with --parallel, instead of _avoiding_ it with
the ommision of the option, so main thread + 1 channel should not be a
bad thing.

Choosing to use 1 multifd channel now is also a gentler introduction for
when we finally move all of the vmstate migration into multifd (I've
been looking into this, but don't hold your breaths).


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration
  2024-06-12 18:08               ` Fabiano Rosas
@ 2024-06-12 18:15                 ` Daniel P. Berrangé
  2024-06-12 18:27                   ` Peter Xu
  0 siblings, 1 reply; 60+ messages in thread
From: Daniel P. Berrangé @ 2024-06-12 18:15 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Peter Xu, qemu-devel, armbru, Claudio Fontana, Jim Fehlig,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

On Wed, Jun 12, 2024 at 03:08:02PM -0300, Fabiano Rosas wrote:
> Fabiano Rosas <farosas@suse.de> writes:
> 
> > Peter Xu <peterx@redhat.com> writes:
> >
> >> On Mon, Jun 10, 2024 at 02:45:53PM -0300, Fabiano Rosas wrote:
> I looked into this and it's cumbersome:
> 
> - We'd need to check migrate_direct_io() several times, once to get the
>   second fd and during every IO to know to use the fd.
> 
> - Even getting the second fd is not straight forward, we need to create
>   a new ioc for it with qio_channel_new_path(). But QEMUFile is generic
>   code, so we'd probably need to call this channel-file specific
>   function from migration_channel_connect().
> 
> - With the new ioc, do we put it in QEMUFile, or do we take the fd only?
>   Or maybe an ioc with two fds? Or a new QIOChannelDirect? All options
>   look bad to me.
> 
> So I suggest we proceed proceed with the 1 multifd channel approach,
> passing 2 fds into QEMU just like we do for the n channels. Is that ok
> from libvirt's perspective? I assume libvirt users are mostly interested
> in _enabling_ parallelism with --parallel, instead of _avoiding_ it with
> the ommision of the option, so main thread + 1 channel should not be a
> bad thing.

IIUC, with the "fixed-ram" feature, the on-disk format of a saved VM
should end up the same whether we're using traditional migration, or
multifd migration. Use of multifd is simply an optimization that lets
us write RAM in parallel to the file, with direct-io further optimizing.

There's also a clear break with libvirt between the existing on-disk
format libvirt uses, and the new fixed-ram format. So we have no backwards
compatibilty concerns added from multifd, beyond what we already have to
figure out when deciding on use of 'fixed-ram'. 

Thus I believe there is no downside to always using multifd for save
images with fixed-ram, even if we only want nchannels=1.


> Choosing to use 1 multifd channel now is also a gentler introduction for
> when we finally move all of the vmstate migration into multifd (I've
> been looking into this, but don't hold your breaths).

Yes, future proofing is a good idea.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration
  2024-06-12 18:15                 ` Daniel P. Berrangé
@ 2024-06-12 18:27                   ` Peter Xu
  2024-06-12 18:44                     ` Fabiano Rosas
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2024-06-12 18:27 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Fabiano Rosas, qemu-devel, armbru, Claudio Fontana, Jim Fehlig,
	Thomas Huth, Laurent Vivier, Paolo Bonzini

On Wed, Jun 12, 2024 at 07:15:19PM +0100, Daniel P. Berrangé wrote:
> IIUC, with the "fixed-ram" feature, the on-disk format of a saved VM
> should end up the same whether we're using traditional migration, or
> multifd migration. Use of multifd is simply an optimization that lets
> us write RAM in parallel to the file, with direct-io further optimizing.
> 
> There's also a clear break with libvirt between the existing on-disk
> format libvirt uses, and the new fixed-ram format. So we have no backwards
> compatibilty concerns added from multifd, beyond what we already have to
> figure out when deciding on use of 'fixed-ram'. 
> 
> Thus I believe there is no downside to always using multifd for save
> images with fixed-ram, even if we only want nchannels=1.

That sounds good.

Just to double check with all of us: so we allow mapped-ram to be used in
whatever case when !dio, however we restrict dio only when with multifd=on,
am I right?

I'd personally like that, and it also pretty much matches with what we have
with tcp zerocopy send too. After all they're really similar stuff to me on
pinning implications and locked_vm restrictions.  It's just that the target
of the data movement is different here, either to NIC, or to/from a file.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration
  2024-06-12 18:27                   ` Peter Xu
@ 2024-06-12 18:44                     ` Fabiano Rosas
  0 siblings, 0 replies; 60+ messages in thread
From: Fabiano Rosas @ 2024-06-12 18:44 UTC (permalink / raw)
  To: Peter Xu, Daniel P. Berrangé
  Cc: qemu-devel, armbru, Claudio Fontana, Jim Fehlig, Thomas Huth,
	Laurent Vivier, Paolo Bonzini

Peter Xu <peterx@redhat.com> writes:

> On Wed, Jun 12, 2024 at 07:15:19PM +0100, Daniel P. Berrangé wrote:
>> IIUC, with the "fixed-ram" feature, the on-disk format of a saved VM
>> should end up the same whether we're using traditional migration, or
>> multifd migration. Use of multifd is simply an optimization that lets
>> us write RAM in parallel to the file, with direct-io further optimizing.
>> 
>> There's also a clear break with libvirt between the existing on-disk
>> format libvirt uses, and the new fixed-ram format. So we have no backwards
>> compatibilty concerns added from multifd, beyond what we already have to
>> figure out when deciding on use of 'fixed-ram'. 
>> 
>> Thus I believe there is no downside to always using multifd for save
>> images with fixed-ram, even if we only want nchannels=1.
>
> That sounds good.
>
> Just to double check with all of us: so we allow mapped-ram to be used in
> whatever case when !dio, however we restrict dio only when with multifd=on,
> am I right?

Yes. The restricting part is not yet in place. I'll add a multifd check
to migrate_direct_io():

bool migrate_direct_io(void)
{
    MigrationState *s = migrate_get_current();

    return s->parameters.direct_io &&
        s->capabilities[MIGRATION_CAPABILITY_MAPPED_RAM] &&
        s->capabilities[MIGRATION_CAPABILITY_MULTIFD];
}


^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2024-06-12 18:45 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-23 19:05 [PATCH v2 00/18] migration/mapped-ram: Add direct-io support Fabiano Rosas
2024-05-23 19:05 ` [PATCH v2 01/18] migration: Fix file migration with fdset Fabiano Rosas
2024-05-24 10:51   ` Prasad Pandit
2024-05-24 12:30     ` Fabiano Rosas
2024-05-25  6:16       ` Prasad Pandit
2024-05-30 16:11   ` Peter Xu
2024-05-31 14:58     ` Fabiano Rosas
2024-06-03 10:20   ` Daniel P. Berrangé
2024-05-23 19:05 ` [PATCH v2 02/18] tests/qtest/migration: Fix file migration offset check Fabiano Rosas
2024-05-30 16:14   ` Peter Xu
2024-06-03 10:21   ` Daniel P. Berrangé
2024-05-23 19:05 ` [PATCH v2 03/18] tests/qtest/migration: Add a precopy file test with fdset Fabiano Rosas
2024-05-30 16:18   ` Peter Xu
2024-05-23 19:05 ` [PATCH v2 04/18] monitor: Drop monitor_fdset_dup_fd_add() Fabiano Rosas
2024-06-03 10:26   ` Daniel P. Berrangé
2024-05-23 19:05 ` [PATCH v2 05/18] monitor: Introduce monitor_fdset_*free Fabiano Rosas
2024-05-30 20:03   ` Peter Xu
2024-05-31 15:01     ` Fabiano Rosas
2024-05-23 19:05 ` [PATCH v2 06/18] monitor: Stop removing non-duplicated fds Fabiano Rosas
2024-05-30 21:05   ` Peter Xu
2024-05-31 15:25     ` Fabiano Rosas
2024-05-31 15:56       ` Peter Xu
2024-06-04 23:40       ` Dr. David Alan Gilbert
2024-06-05 12:31         ` Fabiano Rosas
2024-05-23 19:05 ` [PATCH v2 07/18] monitor: Simplify fdset and fd removal Fabiano Rosas
2024-05-31 15:58   ` Peter Xu
2024-05-23 19:05 ` [PATCH v2 08/18] monitor: Report errors from monitor_fdset_dup_fd_add Fabiano Rosas
2024-05-30 21:08   ` Peter Xu
2024-05-23 19:05 ` [PATCH v2 09/18] io: Stop using qemu_open_old in channel-file Fabiano Rosas
2024-05-30 21:10   ` Peter Xu
2024-05-23 19:05 ` [PATCH v2 10/18] migration: Add direct-io parameter Fabiano Rosas
2024-05-30 21:12   ` Peter Xu
2024-05-23 19:05 ` [PATCH v2 11/18] migration/multifd: Add direct-io support Fabiano Rosas
2024-05-30 21:35   ` Peter Xu
2024-05-31 15:27     ` Fabiano Rosas
2024-05-23 19:05 ` [PATCH v2 12/18] tests/qtest/migration: Add tests for file migration with direct-io Fabiano Rosas
2024-05-23 19:05 ` [PATCH v2 13/18] monitor: fdset: Match against O_DIRECT Fabiano Rosas
2024-05-30 21:41   ` Peter Xu
2024-05-31 15:42     ` Fabiano Rosas
2024-05-31 15:58       ` Peter Xu
2024-05-23 19:05 ` [PATCH v2 14/18] migration: Add documentation for fdset with multifd + file Fabiano Rosas
2024-06-04 20:46   ` Peter Xu
2024-05-23 19:05 ` [PATCH v2 15/18] tests/qtest/migration: Add a test for mapped-ram with passing of fds Fabiano Rosas
2024-06-04 20:51   ` Peter Xu
2024-05-23 19:05 ` [PATCH v2 16/18] io/channel-file: Add direct-io support Fabiano Rosas
2024-06-03 10:32   ` Daniel P. Berrangé
2024-05-23 19:05 ` [PATCH v2 17/18] migration: Add direct-io helpers Fabiano Rosas
2024-05-23 19:05 ` [PATCH v2 18/18] migration/ram: Add direct-io support to precopy file migration Fabiano Rosas
2024-06-04 20:56   ` Peter Xu
2024-06-07 18:42     ` Fabiano Rosas
2024-06-07 20:39       ` Jim Fehlig
2024-06-10 16:09       ` Peter Xu
2024-06-10 17:45         ` Fabiano Rosas
2024-06-10 19:02           ` Peter Xu
2024-06-10 19:07             ` Daniel P. Berrangé
2024-06-10 20:12             ` Fabiano Rosas
2024-06-12 18:08               ` Fabiano Rosas
2024-06-12 18:15                 ` Daniel P. Berrangé
2024-06-12 18:27                   ` Peter Xu
2024-06-12 18:44                     ` Fabiano Rosas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).