* [RFC PATCH v3 01/30] io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
@ 2023-11-27 20:25 ` Fabiano Rosas
2024-01-10 8:49 ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 02/30] io: Add generic pwritev/preadv interface Fabiano Rosas
` (29 subsequent siblings)
30 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:25 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana, Nikolay Borisov
From: Nikolay Borisov <nborisov@suse.com>
Add a generic QIOChannel feature SEEKABLE which would be used by the
qemu_file* apis. For the time being this will be only implemented for
file channels.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---
include/io/channel.h | 1 +
io/channel-file.c | 8 ++++++++
2 files changed, 9 insertions(+)
diff --git a/include/io/channel.h b/include/io/channel.h
index 5f9dbaab65..fcb19fd672 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -44,6 +44,7 @@ enum QIOChannelFeature {
QIO_CHANNEL_FEATURE_LISTEN,
QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY,
QIO_CHANNEL_FEATURE_READ_MSG_PEEK,
+ QIO_CHANNEL_FEATURE_SEEKABLE,
};
diff --git a/io/channel-file.c b/io/channel-file.c
index 4a12c61886..f91bf6db1c 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -36,6 +36,10 @@ qio_channel_file_new_fd(int fd)
ioc->fd = fd;
+ if (lseek(fd, 0, SEEK_CUR) != (off_t)-1) {
+ qio_channel_set_feature(QIO_CHANNEL(ioc), QIO_CHANNEL_FEATURE_SEEKABLE);
+ }
+
trace_qio_channel_file_new_fd(ioc, fd);
return ioc;
@@ -60,6 +64,10 @@ qio_channel_file_new_path(const char *path,
return NULL;
}
+ if (lseek(ioc->fd, 0, SEEK_CUR) != (off_t)-1) {
+ qio_channel_set_feature(QIO_CHANNEL(ioc), QIO_CHANNEL_FEATURE_SEEKABLE);
+ }
+
trace_qio_channel_file_new_path(ioc, path, flags, mode, ioc->fd);
return ioc;
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 01/30] io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file
2023-11-27 20:25 ` [RFC PATCH v3 01/30] io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file Fabiano Rosas
@ 2024-01-10 8:49 ` Peter Xu
0 siblings, 0 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-10 8:49 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana, Nikolay Borisov
On Mon, Nov 27, 2023 at 05:25:43PM -0300, Fabiano Rosas wrote:
> From: Nikolay Borisov <nborisov@suse.com>
>
> Add a generic QIOChannel feature SEEKABLE which would be used by the
> qemu_file* apis. For the time being this will be only implemented for
> file channels.
>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH v3 02/30] io: Add generic pwritev/preadv interface
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
2023-11-27 20:25 ` [RFC PATCH v3 01/30] io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file Fabiano Rosas
@ 2023-11-27 20:25 ` Fabiano Rosas
2024-01-10 9:07 ` Daniel P. Berrangé
2024-01-11 6:59 ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 03/30] io: implement io_pwritev/preadv for QIOChannelFile Fabiano Rosas
` (28 subsequent siblings)
30 siblings, 2 replies; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:25 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana, Nikolay Borisov
From: Nikolay Borisov <nborisov@suse.com>
Introduce basic pwritev/preadv support in the generic channel layer.
Specific implementation will follow for the file channel as this is
required in order to support migration streams with fixed location of
each ram page.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
- fixed naming: s/pwritev_full/pwritev
---
include/io/channel.h | 82 ++++++++++++++++++++++++++++++++++++++++++++
io/channel.c | 58 +++++++++++++++++++++++++++++++
2 files changed, 140 insertions(+)
diff --git a/include/io/channel.h b/include/io/channel.h
index fcb19fd672..7986c49c71 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -131,6 +131,16 @@ struct QIOChannelClass {
Error **errp);
/* Optional callbacks */
+ ssize_t (*io_pwritev)(QIOChannel *ioc,
+ const struct iovec *iov,
+ size_t niov,
+ off_t offset,
+ Error **errp);
+ ssize_t (*io_preadv)(QIOChannel *ioc,
+ const struct iovec *iov,
+ size_t niov,
+ off_t offset,
+ Error **errp);
int (*io_shutdown)(QIOChannel *ioc,
QIOChannelShutdown how,
Error **errp);
@@ -529,6 +539,78 @@ void qio_channel_set_follow_coroutine_ctx(QIOChannel *ioc, bool enabled);
int qio_channel_close(QIOChannel *ioc,
Error **errp);
+/**
+ * qio_channel_pwritev
+ * @ioc: the channel object
+ * @iov: the array of memory regions to write data from
+ * @niov: the length of the @iov array
+ * @offset: offset in the channel where writes should begin
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Not all implementations will support this facility, so may report
+ * an error. To avoid errors, the caller may check for the feature
+ * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
+ *
+ * Behaves as qio_channel_writev_full, apart from not supporting
+ * sending of file handles as well as beginning the write at the
+ * passed @offset
+ *
+ */
+ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
+ size_t niov, off_t offset, Error **errp);
+
+/**
+ * qio_channel_pwrite
+ * @ioc: the channel object
+ * @buf: the memory region to write data into
+ * @buflen: the number of bytes to @buf
+ * @offset: offset in the channel where writes should begin
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Not all implementations will support this facility, so may report
+ * an error. To avoid errors, the caller may check for the feature
+ * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
+ *
+ */
+ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, size_t buflen,
+ off_t offset, Error **errp);
+
+/**
+ * qio_channel_preadv
+ * @ioc: the channel object
+ * @iov: the array of memory regions to read data into
+ * @niov: the length of the @iov array
+ * @offset: offset in the channel where writes should begin
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Not all implementations will support this facility, so may report
+ * an error. To avoid errors, the caller may check for the feature
+ * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
+ *
+ * Behaves as qio_channel_readv_full, apart from not supporting
+ * receiving of file handles as well as beginning the read at the
+ * passed @offset
+ *
+ */
+ssize_t qio_channel_preadv(QIOChannel *ioc, const struct iovec *iov,
+ size_t niov, off_t offset, Error **errp);
+
+/**
+ * qio_channel_pread
+ * @ioc: the channel object
+ * @buf: the memory region to write data into
+ * @buflen: the number of bytes to @buf
+ * @offset: offset in the channel where writes should begin
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Not all implementations will support this facility, so may report
+ * an error. To avoid errors, the caller may check for the feature
+ * flag QIO_CHANNEL_FEATURE_SEEKABLE prior to calling this method.
+ *
+ */
+ssize_t qio_channel_pread(QIOChannel *ioc, char *buf, size_t buflen,
+ off_t offset, Error **errp);
+
/**
* qio_channel_shutdown:
* @ioc: the channel object
diff --git a/io/channel.c b/io/channel.c
index 86c5834510..a1f12f8e90 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -454,6 +454,64 @@ GSource *qio_channel_add_watch_source(QIOChannel *ioc,
}
+ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
+ size_t niov, off_t offset, Error **errp)
+{
+ QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+ if (!klass->io_pwritev) {
+ error_setg(errp, "Channel does not support pwritev");
+ return -1;
+ }
+
+ if (!qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_SEEKABLE)) {
+ error_setg_errno(errp, EINVAL, "Requested channel is not seekable");
+ return -1;
+ }
+
+ return klass->io_pwritev(ioc, iov, niov, offset, errp);
+}
+
+ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, size_t buflen,
+ off_t offset, Error **errp)
+{
+ struct iovec iov = {
+ .iov_base = buf,
+ .iov_len = buflen
+ };
+
+ return qio_channel_pwritev(ioc, &iov, 1, offset, errp);
+}
+
+ssize_t qio_channel_preadv(QIOChannel *ioc, const struct iovec *iov,
+ size_t niov, off_t offset, Error **errp)
+{
+ QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+ if (!klass->io_preadv) {
+ error_setg(errp, "Channel does not support preadv");
+ return -1;
+ }
+
+ if (!qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_SEEKABLE)) {
+ error_setg_errno(errp, EINVAL, "Requested channel is not seekable");
+ return -1;
+ }
+
+ return klass->io_preadv(ioc, iov, niov, offset, errp);
+}
+
+ssize_t qio_channel_pread(QIOChannel *ioc, char *buf, size_t buflen,
+ off_t offset, Error **errp)
+{
+ struct iovec iov = {
+ .iov_base = buf,
+ .iov_len = buflen
+ };
+
+ return qio_channel_preadv(ioc, &iov, 1, offset, errp);
+}
+
int qio_channel_shutdown(QIOChannel *ioc,
QIOChannelShutdown how,
Error **errp)
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 02/30] io: Add generic pwritev/preadv interface
2023-11-27 20:25 ` [RFC PATCH v3 02/30] io: Add generic pwritev/preadv interface Fabiano Rosas
@ 2024-01-10 9:07 ` Daniel P. Berrangé
2024-01-11 6:59 ` Peter Xu
1 sibling, 0 replies; 95+ messages in thread
From: Daniel P. Berrangé @ 2024-01-10 9:07 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana, Nikolay Borisov
On Mon, Nov 27, 2023 at 05:25:44PM -0300, Fabiano Rosas wrote:
> From: Nikolay Borisov <nborisov@suse.com>
>
> Introduce basic pwritev/preadv support in the generic channel layer.
> Specific implementation will follow for the file channel as this is
> required in order to support migration streams with fixed location of
> each ram page.
>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> - fixed naming: s/pwritev_full/pwritev
> ---
> include/io/channel.h | 82 ++++++++++++++++++++++++++++++++++++++++++++
> io/channel.c | 58 +++++++++++++++++++++++++++++++
> 2 files changed, 140 insertions(+)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 02/30] io: Add generic pwritev/preadv interface
2023-11-27 20:25 ` [RFC PATCH v3 02/30] io: Add generic pwritev/preadv interface Fabiano Rosas
2024-01-10 9:07 ` Daniel P. Berrangé
@ 2024-01-11 6:59 ` Peter Xu
1 sibling, 0 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-11 6:59 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana, Nikolay Borisov
On Mon, Nov 27, 2023 at 05:25:44PM -0300, Fabiano Rosas wrote:
> From: Nikolay Borisov <nborisov@suse.com>
>
> Introduce basic pwritev/preadv support in the generic channel layer.
> Specific implementation will follow for the file channel as this is
> required in order to support migration streams with fixed location of
> each ram page.
>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH v3 03/30] io: implement io_pwritev/preadv for QIOChannelFile
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
2023-11-27 20:25 ` [RFC PATCH v3 01/30] io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file Fabiano Rosas
2023-11-27 20:25 ` [RFC PATCH v3 02/30] io: Add generic pwritev/preadv interface Fabiano Rosas
@ 2023-11-27 20:25 ` Fabiano Rosas
2024-01-10 9:08 ` Daniel P. Berrangé
2024-01-11 7:04 ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 04/30] io: fsync before closing a file channel Fabiano Rosas
` (27 subsequent siblings)
30 siblings, 2 replies; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:25 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana, Nikolay Borisov
From: Nikolay Borisov <nborisov@suse.com>
The upcoming 'fixed-ram' feature will require qemu to write data to
(and restore from) specific offsets of the migration file.
Add a minimal implementation of pwritev/preadv and expose them via the
io_pwritev and io_preadv interfaces.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
- check CONFIG_PREADV to avoid breaking Windows
---
io/channel-file.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 56 insertions(+)
diff --git a/io/channel-file.c b/io/channel-file.c
index f91bf6db1c..a6ad7770c6 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -146,6 +146,58 @@ static ssize_t qio_channel_file_writev(QIOChannel *ioc,
return ret;
}
+#ifdef CONFIG_PREADV
+static ssize_t qio_channel_file_preadv(QIOChannel *ioc,
+ const struct iovec *iov,
+ size_t niov,
+ off_t offset,
+ Error **errp)
+{
+ QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
+ ssize_t ret;
+
+ retry:
+ ret = preadv(fioc->fd, iov, niov, offset);
+ if (ret < 0) {
+ if (errno == EAGAIN) {
+ return QIO_CHANNEL_ERR_BLOCK;
+ }
+ if (errno == EINTR) {
+ goto retry;
+ }
+
+ error_setg_errno(errp, errno, "Unable to read from file");
+ return -1;
+ }
+
+ return ret;
+}
+
+static ssize_t qio_channel_file_pwritev(QIOChannel *ioc,
+ const struct iovec *iov,
+ size_t niov,
+ off_t offset,
+ Error **errp)
+{
+ QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
+ ssize_t ret;
+
+ retry:
+ ret = pwritev(fioc->fd, iov, niov, offset);
+ if (ret <= 0) {
+ if (errno == EAGAIN) {
+ return QIO_CHANNEL_ERR_BLOCK;
+ }
+ if (errno == EINTR) {
+ goto retry;
+ }
+ error_setg_errno(errp, errno, "Unable to write to file");
+ return -1;
+ }
+ return ret;
+}
+#endif /* CONFIG_PREADV */
+
static int qio_channel_file_set_blocking(QIOChannel *ioc,
bool enabled,
Error **errp)
@@ -231,6 +283,10 @@ static void qio_channel_file_class_init(ObjectClass *klass,
ioc_klass->io_writev = qio_channel_file_writev;
ioc_klass->io_readv = qio_channel_file_readv;
ioc_klass->io_set_blocking = qio_channel_file_set_blocking;
+#ifdef CONFIG_PREADV
+ ioc_klass->io_pwritev = qio_channel_file_pwritev;
+ ioc_klass->io_preadv = qio_channel_file_preadv;
+#endif
ioc_klass->io_seek = qio_channel_file_seek;
ioc_klass->io_close = qio_channel_file_close;
ioc_klass->io_create_watch = qio_channel_file_create_watch;
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 03/30] io: implement io_pwritev/preadv for QIOChannelFile
2023-11-27 20:25 ` [RFC PATCH v3 03/30] io: implement io_pwritev/preadv for QIOChannelFile Fabiano Rosas
@ 2024-01-10 9:08 ` Daniel P. Berrangé
2024-01-11 7:04 ` Peter Xu
1 sibling, 0 replies; 95+ messages in thread
From: Daniel P. Berrangé @ 2024-01-10 9:08 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana, Nikolay Borisov
On Mon, Nov 27, 2023 at 05:25:45PM -0300, Fabiano Rosas wrote:
> From: Nikolay Borisov <nborisov@suse.com>
>
> The upcoming 'fixed-ram' feature will require qemu to write data to
> (and restore from) specific offsets of the migration file.
>
> Add a minimal implementation of pwritev/preadv and expose them via the
> io_pwritev and io_preadv interfaces.
>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> - check CONFIG_PREADV to avoid breaking Windows
> ---
> io/channel-file.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 56 insertions(+)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 03/30] io: implement io_pwritev/preadv for QIOChannelFile
2023-11-27 20:25 ` [RFC PATCH v3 03/30] io: implement io_pwritev/preadv for QIOChannelFile Fabiano Rosas
2024-01-10 9:08 ` Daniel P. Berrangé
@ 2024-01-11 7:04 ` Peter Xu
1 sibling, 0 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-11 7:04 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana, Nikolay Borisov
On Mon, Nov 27, 2023 at 05:25:45PM -0300, Fabiano Rosas wrote:
> From: Nikolay Borisov <nborisov@suse.com>
>
> The upcoming 'fixed-ram' feature will require qemu to write data to
> (and restore from) specific offsets of the migration file.
>
> Add a minimal implementation of pwritev/preadv and expose them via the
> io_pwritev and io_preadv interfaces.
>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH v3 04/30] io: fsync before closing a file channel
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (2 preceding siblings ...)
2023-11-27 20:25 ` [RFC PATCH v3 03/30] io: implement io_pwritev/preadv for QIOChannelFile Fabiano Rosas
@ 2023-11-27 20:25 ` Fabiano Rosas
2024-01-10 9:04 ` Daniel P. Berrangé
2024-01-11 8:44 ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 05/30] migration/qemu-file: add utility methods for working with seekable channels Fabiano Rosas
` (26 subsequent siblings)
30 siblings, 2 replies; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:25 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
Make sure the data is flushed to disk before closing file
channels. This will ensure data is on disk at the end of a migration
to file.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
io/channel-file.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/io/channel-file.c b/io/channel-file.c
index a6ad7770c6..d4706fa592 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -242,6 +242,11 @@ static int qio_channel_file_close(QIOChannel *ioc,
{
QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
+ if (qemu_fdatasync(fioc->fd) < 0) {
+ error_setg_errno(errp, errno,
+ "Unable to synchronize file data with storage device");
+ return -1;
+ }
if (qemu_close(fioc->fd) < 0) {
error_setg_errno(errp, errno,
"Unable to close file");
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 04/30] io: fsync before closing a file channel
2023-11-27 20:25 ` [RFC PATCH v3 04/30] io: fsync before closing a file channel Fabiano Rosas
@ 2024-01-10 9:04 ` Daniel P. Berrangé
2024-01-11 8:44 ` Peter Xu
1 sibling, 0 replies; 95+ messages in thread
From: Daniel P. Berrangé @ 2024-01-10 9:04 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
On Mon, Nov 27, 2023 at 05:25:46PM -0300, Fabiano Rosas wrote:
> Make sure the data is flushed to disk before closing file
> channels. This will ensure data is on disk at the end of a migration
> to file.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> io/channel-file.c | 5 +++++
> 1 file changed, 5 insertions(+)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Acked-by: Daniel P. Berrangé <berrange@redhat.com>
>
> diff --git a/io/channel-file.c b/io/channel-file.c
> index a6ad7770c6..d4706fa592 100644
> --- a/io/channel-file.c
> +++ b/io/channel-file.c
> @@ -242,6 +242,11 @@ static int qio_channel_file_close(QIOChannel *ioc,
> {
> QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
>
> + if (qemu_fdatasync(fioc->fd) < 0) {
> + error_setg_errno(errp, errno,
> + "Unable to synchronize file data with storage device");
> + return -1;
> + }
> if (qemu_close(fioc->fd) < 0) {
> error_setg_errno(errp, errno,
> "Unable to close file");
> --
> 2.35.3
>
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 04/30] io: fsync before closing a file channel
2023-11-27 20:25 ` [RFC PATCH v3 04/30] io: fsync before closing a file channel Fabiano Rosas
2024-01-10 9:04 ` Daniel P. Berrangé
@ 2024-01-11 8:44 ` Peter Xu
2024-01-11 18:46 ` Fabiano Rosas
1 sibling, 1 reply; 95+ messages in thread
From: Peter Xu @ 2024-01-11 8:44 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Mon, Nov 27, 2023 at 05:25:46PM -0300, Fabiano Rosas wrote:
> Make sure the data is flushed to disk before closing file
> channels. This will ensure data is on disk at the end of a migration
> to file.
Looks reasonable, but just two (possibly naive) questions:
(1) Does this apply to all io channel users, or only migration?
(2) Why metadata doesn't matter (v.s. fsync(), when CONFIG_FDATASYNC=y)?
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 04/30] io: fsync before closing a file channel
2024-01-11 8:44 ` Peter Xu
@ 2024-01-11 18:46 ` Fabiano Rosas
2024-01-12 0:01 ` Peter Xu
2024-01-15 8:57 ` Peter Xu
0 siblings, 2 replies; 95+ messages in thread
From: Fabiano Rosas @ 2024-01-11 18:46 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
Peter Xu <peterx@redhat.com> writes:
> On Mon, Nov 27, 2023 at 05:25:46PM -0300, Fabiano Rosas wrote:
>> Make sure the data is flushed to disk before closing file
>> channels. This will ensure data is on disk at the end of a migration
>> to file.
>
> Looks reasonable, but just two (possibly naive) questions:
>
> (1) Does this apply to all io channel users, or only migration?
All file channel users.
>
> (2) Why metadata doesn't matter (v.s. fsync(), when CONFIG_FDATASYNC=y)?
Syncing the inode information is not critical, it's mostly timestamp
information (man inode). And fdatasync makes sure to sync any metadata
that would be relevant for the retrieval of the data.
>
> Thanks,
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 04/30] io: fsync before closing a file channel
2024-01-11 18:46 ` Fabiano Rosas
@ 2024-01-12 0:01 ` Peter Xu
2024-01-12 10:40 ` Daniel P. Berrangé
2024-01-15 8:57 ` Peter Xu
1 sibling, 1 reply; 95+ messages in thread
From: Peter Xu @ 2024-01-12 0:01 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Thu, Jan 11, 2024 at 03:46:02PM -0300, Fabiano Rosas wrote:
> > (1) Does this apply to all io channel users, or only migration?
>
> All file channel users.
I meant the whole idea of flushing on close, on whether there will be
iochannel users that will prefer not do so? It's a matter of where to put
this best.
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 04/30] io: fsync before closing a file channel
2024-01-12 0:01 ` Peter Xu
@ 2024-01-12 10:40 ` Daniel P. Berrangé
2024-01-15 3:38 ` Peter Xu
0 siblings, 1 reply; 95+ messages in thread
From: Daniel P. Berrangé @ 2024-01-12 10:40 UTC (permalink / raw)
To: Peter Xu
Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Fri, Jan 12, 2024 at 08:01:36AM +0800, Peter Xu wrote:
> On Thu, Jan 11, 2024 at 03:46:02PM -0300, Fabiano Rosas wrote:
> > > (1) Does this apply to all io channel users, or only migration?
> >
> > All file channel users.
>
> I meant the whole idea of flushing on close, on whether there will be
> iochannel users that will prefer not do so? It's a matter of where to put
> this best.
IMHO, all users of QIOChannelFile will benefit from flushing, to ensure
integrity of their saved file upon host crash.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 04/30] io: fsync before closing a file channel
2024-01-12 10:40 ` Daniel P. Berrangé
@ 2024-01-15 3:38 ` Peter Xu
0 siblings, 0 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-15 3:38 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Fri, Jan 12, 2024 at 10:40:28AM +0000, Daniel P. Berrangé wrote:
> On Fri, Jan 12, 2024 at 08:01:36AM +0800, Peter Xu wrote:
> > On Thu, Jan 11, 2024 at 03:46:02PM -0300, Fabiano Rosas wrote:
> > > > (1) Does this apply to all io channel users, or only migration?
> > >
> > > All file channel users.
> >
> > I meant the whole idea of flushing on close, on whether there will be
> > iochannel users that will prefer not do so? It's a matter of where to put
> > this best.
>
> IMHO, all users of QIOChannelFile will benefit from flushing, to ensure
> integrity of their saved file upon host crash.
Thanks. It might be nice to also mention that is for all purpose then in
the commit message (currently, it only mentions "for migration only"), some
description would be even nicer on why it services all purposes.
For example, I think it applies to all as long as we don't have use case
for frequent open/close of iochannels which may require a fast close(), so
all use cases will hope the sync to happen.
Then we could optionally also document above QIOChannelClass.io_close() on
the implied behavior.
When looking at that header, I noticed we used to have io_flush() for
zerocopy. I'm wondering whether we should also do that for close() when
zerocopy is enabled. This may mean that file sockets can also implement
io_flush(), then we call io_flush() in close() if ever existed.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 04/30] io: fsync before closing a file channel
2024-01-11 18:46 ` Fabiano Rosas
2024-01-12 0:01 ` Peter Xu
@ 2024-01-15 8:57 ` Peter Xu
2024-01-15 9:03 ` Daniel P. Berrangé
1 sibling, 1 reply; 95+ messages in thread
From: Peter Xu @ 2024-01-15 8:57 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Thu, Jan 11, 2024 at 03:46:02PM -0300, Fabiano Rosas wrote:
> > (2) Why metadata doesn't matter (v.s. fsync(), when CONFIG_FDATASYNC=y)?
>
> Syncing the inode information is not critical, it's mostly timestamp
> information (man inode). And fdatasync makes sure to sync any metadata
> that would be relevant for the retrieval of the data.
I forgot to reply to this one in the previous reply..
Timestamps look all fine to be old. What about file size? That's also in
"man inode" as metadata, but I'm not sure whether data will be fully valid
if e.g. size enlarged but not flushed along with the page caches.
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 04/30] io: fsync before closing a file channel
2024-01-15 8:57 ` Peter Xu
@ 2024-01-15 9:03 ` Daniel P. Berrangé
2024-01-15 9:31 ` Peter Xu
0 siblings, 1 reply; 95+ messages in thread
From: Daniel P. Berrangé @ 2024-01-15 9:03 UTC (permalink / raw)
To: Peter Xu
Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Mon, Jan 15, 2024 at 04:57:42PM +0800, Peter Xu wrote:
> On Thu, Jan 11, 2024 at 03:46:02PM -0300, Fabiano Rosas wrote:
> > > (2) Why metadata doesn't matter (v.s. fsync(), when CONFIG_FDATASYNC=y)?
> >
> > Syncing the inode information is not critical, it's mostly timestamp
> > information (man inode). And fdatasync makes sure to sync any metadata
> > that would be relevant for the retrieval of the data.
>
> I forgot to reply to this one in the previous reply..
>
> Timestamps look all fine to be old. What about file size? That's also in
> "man inode" as metadata, but I'm not sure whether data will be fully valid
> if e.g. size enlarged but not flushed along with the page caches.
If the size wasn't updated, then syncing of the data would be pointless.
The man page confirms that size is synced:
[quote]
fdatasync() is similar to fsync(), but does not flush modified metadata
unless that metadata is needed in order to allow a subsequent data re‐
trieval to be correctly handled. For example, changes to st_atime or
st_mtime (respectively, time of last access and time of last modifica‐
tion; see inode(7)) do not require flushing because they are not neces‐
sary for a subsequent data read to be handled correctly. On the other
hand, a change to the file size (st_size, as made by say ftruncate(2)),
would require a metadata flush.
[/quote]
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 04/30] io: fsync before closing a file channel
2024-01-15 9:03 ` Daniel P. Berrangé
@ 2024-01-15 9:31 ` Peter Xu
0 siblings, 0 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-15 9:31 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Mon, Jan 15, 2024 at 09:03:20AM +0000, Daniel P. Berrangé wrote:
> On Mon, Jan 15, 2024 at 04:57:42PM +0800, Peter Xu wrote:
> > On Thu, Jan 11, 2024 at 03:46:02PM -0300, Fabiano Rosas wrote:
> > > > (2) Why metadata doesn't matter (v.s. fsync(), when CONFIG_FDATASYNC=y)?
> > >
> > > Syncing the inode information is not critical, it's mostly timestamp
> > > information (man inode). And fdatasync makes sure to sync any metadata
> > > that would be relevant for the retrieval of the data.
> >
> > I forgot to reply to this one in the previous reply..
> >
> > Timestamps look all fine to be old. What about file size? That's also in
> > "man inode" as metadata, but I'm not sure whether data will be fully valid
> > if e.g. size enlarged but not flushed along with the page caches.
>
> If the size wasn't updated, then syncing of the data would be pointless.
> The man page confirms that size is synced:
>
> [quote]
> fdatasync() is similar to fsync(), but does not flush modified metadata
> unless that metadata is needed in order to allow a subsequent data re‐
> trieval to be correctly handled. For example, changes to st_atime or
> st_mtime (respectively, time of last access and time of last modifica‐
> tion; see inode(7)) do not require flushing because they are not neces‐
> sary for a subsequent data read to be handled correctly. On the other
> hand, a change to the file size (st_size, as made by say ftruncate(2)),
> would require a metadata flush.
> [/quote]
I should have read more carefully, sorry for the noise.
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH v3 05/30] migration/qemu-file: add utility methods for working with seekable channels
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (3 preceding siblings ...)
2023-11-27 20:25 ` [RFC PATCH v3 04/30] io: fsync before closing a file channel Fabiano Rosas
@ 2023-11-27 20:25 ` Fabiano Rosas
2024-01-11 9:57 ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 06/30] migration/ram: Introduce 'fixed-ram' migration capability Fabiano Rosas
` (25 subsequent siblings)
30 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:25 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana, Nikolay Borisov
From: Nikolay Borisov <nborisov@suse.com>
Add utility methods that will be needed when implementing 'fixed-ram'
migration capability.
qemu_file_is_seekable
qemu_put_buffer_at
qemu_get_buffer_at
qemu_set_offset
qemu_get_offset
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---
include/migration/qemu-file-types.h | 2 +
migration/qemu-file.c | 82 +++++++++++++++++++++++++++++
migration/qemu-file.h | 6 +++
3 files changed, 90 insertions(+)
diff --git a/include/migration/qemu-file-types.h b/include/migration/qemu-file-types.h
index 9ba163f333..adec5abc07 100644
--- a/include/migration/qemu-file-types.h
+++ b/include/migration/qemu-file-types.h
@@ -50,6 +50,8 @@ unsigned int qemu_get_be16(QEMUFile *f);
unsigned int qemu_get_be32(QEMUFile *f);
uint64_t qemu_get_be64(QEMUFile *f);
+bool qemu_file_is_seekable(QEMUFile *f);
+
static inline void qemu_put_be64s(QEMUFile *f, const uint64_t *pv)
{
qemu_put_be64(f, *pv);
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 94231ff295..faf6427b91 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -33,6 +33,7 @@
#include "options.h"
#include "qapi/error.h"
#include "rdma.h"
+#include "io/channel-file.h"
#define IO_BUF_SIZE 32768
#define MAX_IOV_SIZE MIN_CONST(IOV_MAX, 64)
@@ -255,6 +256,10 @@ static void qemu_iovec_release_ram(QEMUFile *f)
memset(f->may_free, 0, sizeof(f->may_free));
}
+bool qemu_file_is_seekable(QEMUFile *f)
+{
+ return qio_channel_has_feature(f->ioc, QIO_CHANNEL_FEATURE_SEEKABLE);
+}
/**
* Flushes QEMUFile buffer
@@ -447,6 +452,83 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, size_t size)
}
}
+void qemu_put_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
+ off_t pos)
+{
+ Error *err = NULL;
+
+ if (f->last_error) {
+ return;
+ }
+
+ qemu_fflush(f);
+ qio_channel_pwrite(f->ioc, (char *)buf, buflen, pos, &err);
+
+ if (err) {
+ qemu_file_set_error_obj(f, -EIO, err);
+ } else {
+ stat64_add(&mig_stats.qemu_file_transferred, buflen);
+ }
+
+ return;
+}
+
+
+size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
+ off_t pos)
+{
+ Error *err = NULL;
+ ssize_t ret;
+
+ if (f->last_error) {
+ return 0;
+ }
+
+ ret = qio_channel_pread(f->ioc, (char *)buf, buflen, pos, &err);
+ if (ret == -1 || err) {
+ goto error;
+ }
+
+ return (size_t)ret;
+
+ error:
+ qemu_file_set_error_obj(f, -EIO, err);
+ return 0;
+}
+
+void qemu_set_offset(QEMUFile *f, off_t off, int whence)
+{
+ Error *err = NULL;
+ off_t ret;
+
+ qemu_fflush(f);
+
+ if (!qemu_file_is_writable(f)) {
+ f->buf_index = 0;
+ f->buf_size = 0;
+ }
+
+ ret = qio_channel_io_seek(f->ioc, off, whence, &err);
+ if (ret == (off_t)-1) {
+ qemu_file_set_error_obj(f, -EIO, err);
+ }
+}
+
+off_t qemu_get_offset(QEMUFile *f)
+{
+ Error *err = NULL;
+ off_t ret;
+
+ qemu_fflush(f);
+
+ ret = qio_channel_io_seek(f->ioc, 0, SEEK_CUR, &err);
+ if (ret == (off_t)-1) {
+ qemu_file_set_error_obj(f, -EIO, err);
+ }
+ return ret;
+}
+
+
void qemu_put_byte(QEMUFile *f, int v)
{
if (f->last_error) {
diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 8aec9fabf7..32fd4a34fd 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -75,6 +75,12 @@ QEMUFile *qemu_file_get_return_path(QEMUFile *f);
int qemu_fflush(QEMUFile *f);
void qemu_file_set_blocking(QEMUFile *f, bool block);
int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size);
+void qemu_set_offset(QEMUFile *f, off_t off, int whence);
+off_t qemu_get_offset(QEMUFile *f);
+void qemu_put_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
+ off_t pos);
+size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
+ off_t pos);
QIOChannel *qemu_file_get_ioc(QEMUFile *file);
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 05/30] migration/qemu-file: add utility methods for working with seekable channels
2023-11-27 20:25 ` [RFC PATCH v3 05/30] migration/qemu-file: add utility methods for working with seekable channels Fabiano Rosas
@ 2024-01-11 9:57 ` Peter Xu
2024-01-11 18:49 ` Fabiano Rosas
0 siblings, 1 reply; 95+ messages in thread
From: Peter Xu @ 2024-01-11 9:57 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana, Nikolay Borisov
On Mon, Nov 27, 2023 at 05:25:47PM -0300, Fabiano Rosas wrote:
> From: Nikolay Borisov <nborisov@suse.com>
>
> Add utility methods that will be needed when implementing 'fixed-ram'
> migration capability.
>
> qemu_file_is_seekable
> qemu_put_buffer_at
> qemu_get_buffer_at
> qemu_set_offset
> qemu_get_offset
>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
> include/migration/qemu-file-types.h | 2 +
> migration/qemu-file.c | 82 +++++++++++++++++++++++++++++
> migration/qemu-file.h | 6 +++
> 3 files changed, 90 insertions(+)
>
> diff --git a/include/migration/qemu-file-types.h b/include/migration/qemu-file-types.h
> index 9ba163f333..adec5abc07 100644
> --- a/include/migration/qemu-file-types.h
> +++ b/include/migration/qemu-file-types.h
> @@ -50,6 +50,8 @@ unsigned int qemu_get_be16(QEMUFile *f);
> unsigned int qemu_get_be32(QEMUFile *f);
> uint64_t qemu_get_be64(QEMUFile *f);
>
> +bool qemu_file_is_seekable(QEMUFile *f);
> +
> static inline void qemu_put_be64s(QEMUFile *f, const uint64_t *pv)
> {
> qemu_put_be64(f, *pv);
> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> index 94231ff295..faf6427b91 100644
> --- a/migration/qemu-file.c
> +++ b/migration/qemu-file.c
> @@ -33,6 +33,7 @@
> #include "options.h"
> #include "qapi/error.h"
> #include "rdma.h"
> +#include "io/channel-file.h"
>
> #define IO_BUF_SIZE 32768
> #define MAX_IOV_SIZE MIN_CONST(IOV_MAX, 64)
> @@ -255,6 +256,10 @@ static void qemu_iovec_release_ram(QEMUFile *f)
> memset(f->may_free, 0, sizeof(f->may_free));
> }
>
> +bool qemu_file_is_seekable(QEMUFile *f)
> +{
> + return qio_channel_has_feature(f->ioc, QIO_CHANNEL_FEATURE_SEEKABLE);
> +}
>
> /**
> * Flushes QEMUFile buffer
> @@ -447,6 +452,83 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, size_t size)
> }
> }
>
> +void qemu_put_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
> + off_t pos)
> +{
> + Error *err = NULL;
> +
> + if (f->last_error) {
> + return;
> + }
> +
> + qemu_fflush(f);
> + qio_channel_pwrite(f->ioc, (char *)buf, buflen, pos, &err);
Partial writes won't set err. Do we want to check the retval here too and
fail properly if detected partial writes?
> +
> + if (err) {
> + qemu_file_set_error_obj(f, -EIO, err);
> + } else {
> + stat64_add(&mig_stats.qemu_file_transferred, buflen);
buflen is only accurate if with above, iiuc.
> + }
> +
> + return;
> +}
> +
> +
> +size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
> + off_t pos)
> +{
> + Error *err = NULL;
> + ssize_t ret;
> +
> + if (f->last_error) {
> + return 0;
> + }
> +
> + ret = qio_channel_pread(f->ioc, (char *)buf, buflen, pos, &err);
Same question here.
> + if (ret == -1 || err) {
> + goto error;
> + }
> +
> + return (size_t)ret;
> +
> + error:
> + qemu_file_set_error_obj(f, -EIO, err);
> + return 0;
> +}
> +
> +void qemu_set_offset(QEMUFile *f, off_t off, int whence)
> +{
> + Error *err = NULL;
> + off_t ret;
> +
> + qemu_fflush(f);
> +
> + if (!qemu_file_is_writable(f)) {
> + f->buf_index = 0;
> + f->buf_size = 0;
> + }
There's the qemu_file_is_writable() check after all, then put qemu_fflush()
into condition too?
if (qemu_file_is_writable(f)) {
qemu_fflush(f);
} else {
/* Drop all the cached buffers if existed; will trigger a re-fill later */
f->buf_index = 0;
f->buf_size = 0;
}
> +
> + ret = qio_channel_io_seek(f->ioc, off, whence, &err);
> + if (ret == (off_t)-1) {
> + qemu_file_set_error_obj(f, -EIO, err);
> + }
> +}
> +
> +off_t qemu_get_offset(QEMUFile *f)
> +{
> + Error *err = NULL;
> + off_t ret;
> +
> + qemu_fflush(f);
> +
> + ret = qio_channel_io_seek(f->ioc, 0, SEEK_CUR, &err);
> + if (ret == (off_t)-1) {
> + qemu_file_set_error_obj(f, -EIO, err);
> + }
> + return ret;
> +}
> +
> +
> void qemu_put_byte(QEMUFile *f, int v)
> {
> if (f->last_error) {
> diff --git a/migration/qemu-file.h b/migration/qemu-file.h
> index 8aec9fabf7..32fd4a34fd 100644
> --- a/migration/qemu-file.h
> +++ b/migration/qemu-file.h
> @@ -75,6 +75,12 @@ QEMUFile *qemu_file_get_return_path(QEMUFile *f);
> int qemu_fflush(QEMUFile *f);
> void qemu_file_set_blocking(QEMUFile *f, bool block);
> int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size);
> +void qemu_set_offset(QEMUFile *f, off_t off, int whence);
> +off_t qemu_get_offset(QEMUFile *f);
> +void qemu_put_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
> + off_t pos);
> +size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
> + off_t pos);
>
> QIOChannel *qemu_file_get_ioc(QEMUFile *file);
>
> --
> 2.35.3
>
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 05/30] migration/qemu-file: add utility methods for working with seekable channels
2024-01-11 9:57 ` Peter Xu
@ 2024-01-11 18:49 ` Fabiano Rosas
0 siblings, 0 replies; 95+ messages in thread
From: Fabiano Rosas @ 2024-01-11 18:49 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana, Nikolay Borisov
Peter Xu <peterx@redhat.com> writes:
> On Mon, Nov 27, 2023 at 05:25:47PM -0300, Fabiano Rosas wrote:
>> From: Nikolay Borisov <nborisov@suse.com>
>>
>> Add utility methods that will be needed when implementing 'fixed-ram'
>> migration capability.
>>
>> qemu_file_is_seekable
>> qemu_put_buffer_at
>> qemu_get_buffer_at
>> qemu_set_offset
>> qemu_get_offset
>>
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
>> ---
>> include/migration/qemu-file-types.h | 2 +
>> migration/qemu-file.c | 82 +++++++++++++++++++++++++++++
>> migration/qemu-file.h | 6 +++
>> 3 files changed, 90 insertions(+)
>>
>> diff --git a/include/migration/qemu-file-types.h b/include/migration/qemu-file-types.h
>> index 9ba163f333..adec5abc07 100644
>> --- a/include/migration/qemu-file-types.h
>> +++ b/include/migration/qemu-file-types.h
>> @@ -50,6 +50,8 @@ unsigned int qemu_get_be16(QEMUFile *f);
>> unsigned int qemu_get_be32(QEMUFile *f);
>> uint64_t qemu_get_be64(QEMUFile *f);
>>
>> +bool qemu_file_is_seekable(QEMUFile *f);
>> +
>> static inline void qemu_put_be64s(QEMUFile *f, const uint64_t *pv)
>> {
>> qemu_put_be64(f, *pv);
>> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
>> index 94231ff295..faf6427b91 100644
>> --- a/migration/qemu-file.c
>> +++ b/migration/qemu-file.c
>> @@ -33,6 +33,7 @@
>> #include "options.h"
>> #include "qapi/error.h"
>> #include "rdma.h"
>> +#include "io/channel-file.h"
>>
>> #define IO_BUF_SIZE 32768
>> #define MAX_IOV_SIZE MIN_CONST(IOV_MAX, 64)
>> @@ -255,6 +256,10 @@ static void qemu_iovec_release_ram(QEMUFile *f)
>> memset(f->may_free, 0, sizeof(f->may_free));
>> }
>>
>> +bool qemu_file_is_seekable(QEMUFile *f)
>> +{
>> + return qio_channel_has_feature(f->ioc, QIO_CHANNEL_FEATURE_SEEKABLE);
>> +}
>>
>> /**
>> * Flushes QEMUFile buffer
>> @@ -447,6 +452,83 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, size_t size)
>> }
>> }
>>
>> +void qemu_put_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
>> + off_t pos)
>> +{
>> + Error *err = NULL;
>> +
>> + if (f->last_error) {
>> + return;
>> + }
>> +
>> + qemu_fflush(f);
>> + qio_channel_pwrite(f->ioc, (char *)buf, buflen, pos, &err);
>
> Partial writes won't set err. Do we want to check the retval here too and
> fail properly if detected partial writes?
>
Yep.
>> +
>> + if (err) {
>> + qemu_file_set_error_obj(f, -EIO, err);
>> + } else {
>> + stat64_add(&mig_stats.qemu_file_transferred, buflen);
>
> buflen is only accurate if with above, iiuc.
>
>> + }
>> +
>> + return;
>> +}
>> +
>> +
>> +size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
>> + off_t pos)
>> +{
>> + Error *err = NULL;
>> + ssize_t ret;
>> +
>> + if (f->last_error) {
>> + return 0;
>> + }
>> +
>> + ret = qio_channel_pread(f->ioc, (char *)buf, buflen, pos, &err);
>
> Same question here.
>
>> + if (ret == -1 || err) {
>> + goto error;
>> + }
>> +
>> + return (size_t)ret;
>> +
>> + error:
>> + qemu_file_set_error_obj(f, -EIO, err);
>> + return 0;
>> +}
>> +
>> +void qemu_set_offset(QEMUFile *f, off_t off, int whence)
>> +{
>> + Error *err = NULL;
>> + off_t ret;
>> +
>> + qemu_fflush(f);
>> +
>> + if (!qemu_file_is_writable(f)) {
>> + f->buf_index = 0;
>> + f->buf_size = 0;
>> + }
>
> There's the qemu_file_is_writable() check after all, then put qemu_fflush()
> into condition too?
>
> if (qemu_file_is_writable(f)) {
> qemu_fflush(f);
> } else {
> /* Drop all the cached buffers if existed; will trigger a re-fill later */
> f->buf_index = 0;
> f->buf_size = 0;
> }
>
Could be. I'll change it.
>> +
>> + ret = qio_channel_io_seek(f->ioc, off, whence, &err);
>> + if (ret == (off_t)-1) {
>> + qemu_file_set_error_obj(f, -EIO, err);
>> + }
>> +}
>> +
>> +off_t qemu_get_offset(QEMUFile *f)
>> +{
>> + Error *err = NULL;
>> + off_t ret;
>> +
>> + qemu_fflush(f);
>> +
>> + ret = qio_channel_io_seek(f->ioc, 0, SEEK_CUR, &err);
>> + if (ret == (off_t)-1) {
>> + qemu_file_set_error_obj(f, -EIO, err);
>> + }
>> + return ret;
>> +}
>> +
>> +
>> void qemu_put_byte(QEMUFile *f, int v)
>> {
>> if (f->last_error) {
>> diff --git a/migration/qemu-file.h b/migration/qemu-file.h
>> index 8aec9fabf7..32fd4a34fd 100644
>> --- a/migration/qemu-file.h
>> +++ b/migration/qemu-file.h
>> @@ -75,6 +75,12 @@ QEMUFile *qemu_file_get_return_path(QEMUFile *f);
>> int qemu_fflush(QEMUFile *f);
>> void qemu_file_set_blocking(QEMUFile *f, bool block);
>> int qemu_file_get_to_fd(QEMUFile *f, int fd, size_t size);
>> +void qemu_set_offset(QEMUFile *f, off_t off, int whence);
>> +off_t qemu_get_offset(QEMUFile *f);
>> +void qemu_put_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
>> + off_t pos);
>> +size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
>> + off_t pos);
>>
>> QIOChannel *qemu_file_get_ioc(QEMUFile *file);
>>
>> --
>> 2.35.3
>>
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH v3 06/30] migration/ram: Introduce 'fixed-ram' migration capability
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (4 preceding siblings ...)
2023-11-27 20:25 ` [RFC PATCH v3 05/30] migration/qemu-file: add utility methods for working with seekable channels Fabiano Rosas
@ 2023-11-27 20:25 ` Fabiano Rosas
2023-12-22 10:35 ` Markus Armbruster
2024-01-11 10:43 ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 07/30] migration: Add fixed-ram URI compatibility check Fabiano Rosas
` (24 subsequent siblings)
30 siblings, 2 replies; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:25 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana, Eric Blake
Add a new migration capability 'fixed-ram'.
The core of the feature is to ensure that each RAM page has a specific
offset in the resulting migration stream. The reasons why we'd want
such behavior are:
- The resulting file will have a bounded size, since pages which are
dirtied multiple times will always go to a fixed location in the
file, rather than constantly being added to a sequential
stream. This eliminates cases where a VM with, say, 1G of RAM can
result in a migration file that's 10s of GBs, provided that the
workload constantly redirties memory.
- It paves the way to implement O_DIRECT-enabled save/restore of the
migration stream as the pages are ensured to be written at aligned
offsets.
- It allows the usage of multifd so we can write RAM pages to the
migration file in parallel.
For now, enabling the capability has no effect. The next couple of
patches implement the core functionality.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
- mentioned seeking on docs
---
docs/devel/migration.rst | 21 +++++++++++++++++++++
migration/options.c | 34 ++++++++++++++++++++++++++++++++++
migration/options.h | 1 +
migration/savevm.c | 1 +
qapi/migration.json | 6 +++++-
5 files changed, 62 insertions(+), 1 deletion(-)
diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
index ec55089b25..eeb4fec31f 100644
--- a/docs/devel/migration.rst
+++ b/docs/devel/migration.rst
@@ -572,6 +572,27 @@ Others (especially either older devices or system devices which for
some reason don't have a bus concept) make use of the ``instance id``
for otherwise identically named devices.
+Fixed-ram format
+----------------
+
+When the ``fixed-ram`` capability is enabled, a slightly different
+stream format is used for the RAM section. Instead of having a
+sequential stream of pages that follow the RAMBlock headers, the dirty
+pages for a RAMBlock follow its header. This ensures that each RAM
+page has a fixed offset in the resulting migration file.
+
+The ``fixed-ram`` capability must be enabled in both source and
+destination with:
+
+ ``migrate_set_capability fixed-ram on``
+
+Since pages are written to their relative offsets and out of order
+(due to the memory dirtying patterns), streaming channels such as
+sockets are not supported. A seekable channel such as a file is
+required. This can be verified in the QIOChannel by the presence of
+the QIO_CHANNEL_FEATURE_SEEKABLE. In more practical terms, this
+migration format requires the ``file:`` URI when migrating.
+
Return path
-----------
diff --git a/migration/options.c b/migration/options.c
index 8d8ec73ad9..775428a8a5 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -204,6 +204,7 @@ Property migration_properties[] = {
DEFINE_PROP_MIG_CAP("x-switchover-ack",
MIGRATION_CAPABILITY_SWITCHOVER_ACK),
DEFINE_PROP_MIG_CAP("x-dirty-limit", MIGRATION_CAPABILITY_DIRTY_LIMIT),
+ DEFINE_PROP_MIG_CAP("x-fixed-ram", MIGRATION_CAPABILITY_FIXED_RAM),
DEFINE_PROP_END_OF_LIST(),
};
@@ -263,6 +264,13 @@ bool migrate_events(void)
return s->capabilities[MIGRATION_CAPABILITY_EVENTS];
}
+bool migrate_fixed_ram(void)
+{
+ MigrationState *s = migrate_get_current();
+
+ return s->capabilities[MIGRATION_CAPABILITY_FIXED_RAM];
+}
+
bool migrate_ignore_shared(void)
{
MigrationState *s = migrate_get_current();
@@ -645,6 +653,32 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
}
}
+ if (new_caps[MIGRATION_CAPABILITY_FIXED_RAM]) {
+ if (new_caps[MIGRATION_CAPABILITY_MULTIFD]) {
+ error_setg(errp,
+ "Fixed-ram migration is incompatible with multifd");
+ return false;
+ }
+
+ if (new_caps[MIGRATION_CAPABILITY_XBZRLE]) {
+ error_setg(errp,
+ "Fixed-ram migration is incompatible with xbzrle");
+ return false;
+ }
+
+ if (new_caps[MIGRATION_CAPABILITY_COMPRESS]) {
+ error_setg(errp,
+ "Fixed-ram migration is incompatible with compression");
+ return false;
+ }
+
+ if (new_caps[MIGRATION_CAPABILITY_POSTCOPY_RAM]) {
+ error_setg(errp,
+ "Fixed-ram migration is incompatible with postcopy ram");
+ return false;
+ }
+ }
+
return true;
}
diff --git a/migration/options.h b/migration/options.h
index 246c160aee..8680a10b79 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -31,6 +31,7 @@ bool migrate_compress(void);
bool migrate_dirty_bitmaps(void);
bool migrate_dirty_limit(void);
bool migrate_events(void);
+bool migrate_fixed_ram(void);
bool migrate_ignore_shared(void);
bool migrate_late_block_activate(void);
bool migrate_multifd(void);
diff --git a/migration/savevm.c b/migration/savevm.c
index eec5503a42..48c37bd198 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -245,6 +245,7 @@ static bool should_validate_capability(int capability)
/* Validate only new capabilities to keep compatibility. */
switch (capability) {
case MIGRATION_CAPABILITY_X_IGNORE_SHARED:
+ case MIGRATION_CAPABILITY_FIXED_RAM:
return true;
default:
return false;
diff --git a/qapi/migration.json b/qapi/migration.json
index eb2f883513..3b93e13743 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -531,6 +531,10 @@
# and can result in more stable read performance. Requires KVM
# with accelerator property "dirty-ring-size" set. (Since 8.1)
#
+# @fixed-ram: Migrate using fixed offsets for each RAM page. Requires
+# a migration URI that supports seeking, such as a file. (since
+# 8.2)
+#
# Features:
#
# @deprecated: Member @block is deprecated. Use blockdev-mirror with
@@ -555,7 +559,7 @@
{ 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
'validate-uuid', 'background-snapshot',
'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
- 'dirty-limit'] }
+ 'dirty-limit', 'fixed-ram'] }
##
# @MigrationCapabilityStatus:
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 06/30] migration/ram: Introduce 'fixed-ram' migration capability
2023-11-27 20:25 ` [RFC PATCH v3 06/30] migration/ram: Introduce 'fixed-ram' migration capability Fabiano Rosas
@ 2023-12-22 10:35 ` Markus Armbruster
2024-01-11 10:43 ` Peter Xu
1 sibling, 0 replies; 95+ messages in thread
From: Markus Armbruster @ 2023-12-22 10:35 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana, Eric Blake
Fabiano Rosas <farosas@suse.de> writes:
> Add a new migration capability 'fixed-ram'.
>
> The core of the feature is to ensure that each RAM page has a specific
> offset in the resulting migration stream. The reasons why we'd want
> such behavior are:
>
> - The resulting file will have a bounded size, since pages which are
> dirtied multiple times will always go to a fixed location in the
> file, rather than constantly being added to a sequential
> stream. This eliminates cases where a VM with, say, 1G of RAM can
> result in a migration file that's 10s of GBs, provided that the
> workload constantly redirties memory.
>
> - It paves the way to implement O_DIRECT-enabled save/restore of the
> migration stream as the pages are ensured to be written at aligned
> offsets.
>
> - It allows the usage of multifd so we can write RAM pages to the
> migration file in parallel.
>
> For now, enabling the capability has no effect. The next couple of
> patches implement the core functionality.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> - mentioned seeking on docs
> ---
> docs/devel/migration.rst | 21 +++++++++++++++++++++
> migration/options.c | 34 ++++++++++++++++++++++++++++++++++
> migration/options.h | 1 +
> migration/savevm.c | 1 +
> qapi/migration.json | 6 +++++-
> 5 files changed, 62 insertions(+), 1 deletion(-)
>
> diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
> index ec55089b25..eeb4fec31f 100644
> --- a/docs/devel/migration.rst
> +++ b/docs/devel/migration.rst
> @@ -572,6 +572,27 @@ Others (especially either older devices or system devices which for
> some reason don't have a bus concept) make use of the ``instance id``
> for otherwise identically named devices.
>
> +Fixed-ram format
> +----------------
> +
> +When the ``fixed-ram`` capability is enabled, a slightly different
> +stream format is used for the RAM section. Instead of having a
> +sequential stream of pages that follow the RAMBlock headers, the dirty
> +pages for a RAMBlock follow its header. This ensures that each RAM
> +page has a fixed offset in the resulting migration file.
> +
> +The ``fixed-ram`` capability must be enabled in both source and
> +destination with:
> +
> + ``migrate_set_capability fixed-ram on``
> +
> +Since pages are written to their relative offsets and out of order
> +(due to the memory dirtying patterns), streaming channels such as
> +sockets are not supported. A seekable channel such as a file is
> +required. This can be verified in the QIOChannel by the presence of
> +the QIO_CHANNEL_FEATURE_SEEKABLE. In more practical terms, this
> +migration format requires the ``file:`` URI when migrating.
> +
> Return path
> -----------
>
[...]
> diff --git a/qapi/migration.json b/qapi/migration.json
> index eb2f883513..3b93e13743 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -531,6 +531,10 @@
> # and can result in more stable read performance. Requires KVM
> # with accelerator property "dirty-ring-size" set. (Since 8.1)
> #
> +# @fixed-ram: Migrate using fixed offsets for each RAM page. Requires
Offsets in what?
Clear enough from commit message and doc update, but the doc comment
needs to make sense on its own.
> +# a migration URI that supports seeking, such as a file. (since
> +# 8.2)
9.0
> +#
> # Features:
> #
> # @deprecated: Member @block is deprecated. Use blockdev-mirror with
> @@ -555,7 +559,7 @@
> { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
> 'validate-uuid', 'background-snapshot',
> 'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
> - 'dirty-limit'] }
> + 'dirty-limit', 'fixed-ram'] }
>
> ##
> # @MigrationCapabilityStatus:
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 06/30] migration/ram: Introduce 'fixed-ram' migration capability
2023-11-27 20:25 ` [RFC PATCH v3 06/30] migration/ram: Introduce 'fixed-ram' migration capability Fabiano Rosas
2023-12-22 10:35 ` Markus Armbruster
@ 2024-01-11 10:43 ` Peter Xu
1 sibling, 0 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-11 10:43 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana, Eric Blake
On Mon, Nov 27, 2023 at 05:25:48PM -0300, Fabiano Rosas wrote:
> Add a new migration capability 'fixed-ram'.
>
> The core of the feature is to ensure that each RAM page has a specific
> offset in the resulting migration stream. The reasons why we'd want
> such behavior are:
>
> - The resulting file will have a bounded size, since pages which are
> dirtied multiple times will always go to a fixed location in the
> file, rather than constantly being added to a sequential
> stream. This eliminates cases where a VM with, say, 1G of RAM can
> result in a migration file that's 10s of GBs, provided that the
> workload constantly redirties memory.
>
> - It paves the way to implement O_DIRECT-enabled save/restore of the
> migration stream as the pages are ensured to be written at aligned
> offsets.
>
> - It allows the usage of multifd so we can write RAM pages to the
> migration file in parallel.
>
> For now, enabling the capability has no effect. The next couple of
> patches implement the core functionality.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> - mentioned seeking on docs
> ---
> docs/devel/migration.rst | 21 +++++++++++++++++++++
> migration/options.c | 34 ++++++++++++++++++++++++++++++++++
> migration/options.h | 1 +
> migration/savevm.c | 1 +
> qapi/migration.json | 6 +++++-
> 5 files changed, 62 insertions(+), 1 deletion(-)
>
> diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
> index ec55089b25..eeb4fec31f 100644
> --- a/docs/devel/migration.rst
> +++ b/docs/devel/migration.rst
> @@ -572,6 +572,27 @@ Others (especially either older devices or system devices which for
> some reason don't have a bus concept) make use of the ``instance id``
> for otherwise identically named devices.
>
> +Fixed-ram format
> +----------------
> +
> +When the ``fixed-ram`` capability is enabled, a slightly different
> +stream format is used for the RAM section. Instead of having a
> +sequential stream of pages that follow the RAMBlock headers, the dirty
> +pages for a RAMBlock follow its header. This ensures that each RAM
> +page has a fixed offset in the resulting migration file.
> +
> +The ``fixed-ram`` capability must be enabled in both source and
> +destination with:
> +
> + ``migrate_set_capability fixed-ram on``
> +
> +Since pages are written to their relative offsets and out of order
> +(due to the memory dirtying patterns), streaming channels such as
> +sockets are not supported. A seekable channel such as a file is
> +required. This can be verified in the QIOChannel by the presence of
> +the QIO_CHANNEL_FEATURE_SEEKABLE. In more practical terms, this
> +migration format requires the ``file:`` URI when migrating.
After the doc cleanup that I just posted, fixed-ram can have its own file
now.
Could you move the nice ascii art from patch 8 commit message to here?
More doc is always good. The commit message can get lost very soon, doc
will be more persistent.
Also, can we provide more information on this feature in the doc then users
can know when they should be used, and how?
For example, IIUC it only applies to the case where the user wants to stop
the VM right after snapshot-ing it into a file, right? We'd better be
clear on this, as this is quite a special use of migration anyway. When at
this, we should also mention the fact that it's always suggested to stop
the VM first before doing such a migration?
> +
> Return path
> -----------
>
> diff --git a/migration/options.c b/migration/options.c
> index 8d8ec73ad9..775428a8a5 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -204,6 +204,7 @@ Property migration_properties[] = {
> DEFINE_PROP_MIG_CAP("x-switchover-ack",
> MIGRATION_CAPABILITY_SWITCHOVER_ACK),
> DEFINE_PROP_MIG_CAP("x-dirty-limit", MIGRATION_CAPABILITY_DIRTY_LIMIT),
> + DEFINE_PROP_MIG_CAP("x-fixed-ram", MIGRATION_CAPABILITY_FIXED_RAM),
Let's drop "x-"? I am thinking we should drop all x-, it can break some
scripts but iiuc shouldn't be more than that. Definitely another story..
> DEFINE_PROP_END_OF_LIST(),
> };
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH v3 07/30] migration: Add fixed-ram URI compatibility check
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (5 preceding siblings ...)
2023-11-27 20:25 ` [RFC PATCH v3 06/30] migration/ram: Introduce 'fixed-ram' migration capability Fabiano Rosas
@ 2023-11-27 20:25 ` Fabiano Rosas
2024-01-15 9:01 ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 08/30] migration/ram: Add outgoing 'fixed-ram' migration Fabiano Rosas
` (23 subsequent siblings)
30 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:25 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
The fixed-ram migration format needs a channel that supports seeking
to be able to write each page to an arbitrary offset in the migration
stream.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---
- avoided overwriting errp in compatibility check
---
migration/migration.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/migration/migration.c b/migration/migration.c
index 28a34c9068..897ed1db67 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -135,10 +135,26 @@ static bool transport_supports_multi_channels(SocketAddress *saddr)
saddr->type == SOCKET_ADDRESS_TYPE_VSOCK;
}
+static bool migration_needs_seekable_channel(void)
+{
+ return migrate_fixed_ram();
+}
+
+static bool transport_supports_seeking(MigrationAddress *addr)
+{
+ return addr->transport == MIGRATION_ADDRESS_TYPE_FILE;
+}
+
static bool
migration_channels_and_transport_compatible(MigrationAddress *addr,
Error **errp)
{
+ if (migration_needs_seekable_channel() &&
+ !transport_supports_seeking(addr)) {
+ error_setg(errp, "Migration requires seekable transport (e.g. file)");
+ return false;
+ }
+
if (migration_needs_multiple_sockets() &&
(addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) &&
!transport_supports_multi_channels(&addr->u.socket)) {
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 07/30] migration: Add fixed-ram URI compatibility check
2023-11-27 20:25 ` [RFC PATCH v3 07/30] migration: Add fixed-ram URI compatibility check Fabiano Rosas
@ 2024-01-15 9:01 ` Peter Xu
2024-01-23 19:07 ` Fabiano Rosas
2024-01-23 19:07 ` Fabiano Rosas
0 siblings, 2 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-15 9:01 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Mon, Nov 27, 2023 at 05:25:49PM -0300, Fabiano Rosas wrote:
> The fixed-ram migration format needs a channel that supports seeking
> to be able to write each page to an arbitrary offset in the migration
> stream.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
> - avoided overwriting errp in compatibility check
> ---
> migration/migration.c | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 28a34c9068..897ed1db67 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -135,10 +135,26 @@ static bool transport_supports_multi_channels(SocketAddress *saddr)
> saddr->type == SOCKET_ADDRESS_TYPE_VSOCK;
> }
>
> +static bool migration_needs_seekable_channel(void)
> +{
> + return migrate_fixed_ram();
> +}
> +
> +static bool transport_supports_seeking(MigrationAddress *addr)
> +{
> + return addr->transport == MIGRATION_ADDRESS_TYPE_FILE;
> +}
What about TYPE_FD? Is it going to be supported later?
> +
> static bool
> migration_channels_and_transport_compatible(MigrationAddress *addr,
> Error **errp)
> {
> + if (migration_needs_seekable_channel() &&
> + !transport_supports_seeking(addr)) {
> + error_setg(errp, "Migration requires seekable transport (e.g. file)");
> + return false;
> + }
> +
> if (migration_needs_multiple_sockets() &&
> (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) &&
> !transport_supports_multi_channels(&addr->u.socket)) {
> --
> 2.35.3
>
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 07/30] migration: Add fixed-ram URI compatibility check
2024-01-15 9:01 ` Peter Xu
@ 2024-01-23 19:07 ` Fabiano Rosas
2024-01-23 19:07 ` Fabiano Rosas
1 sibling, 0 replies; 95+ messages in thread
From: Fabiano Rosas @ 2024-01-23 19:07 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
Peter Xu <peterx@redhat.com> writes:
> On Mon, Nov 27, 2023 at 05:25:49PM -0300, Fabiano Rosas wrote:
>> The fixed-ram migration format needs a channel that supports seeking
>> to be able to write each page to an arbitrary offset in the migration
>> stream.
>>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
>> ---
>> - avoided overwriting errp in compatibility check
>> ---
>> migration/migration.c | 16 ++++++++++++++++
>> 1 file changed, 16 insertions(+)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 28a34c9068..897ed1db67 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -135,10 +135,26 @@ static bool transport_supports_multi_channels(SocketAddress *saddr)
>> saddr->type == SOCKET_ADDRESS_TYPE_VSOCK;
>> }
>>
>> +static bool migration_needs_seekable_channel(void)
>> +{
>> + return migrate_fixed_ram();
>> +}
>> +
>> +static bool transport_supports_seeking(MigrationAddress *addr)
>> +{
>> + return addr->transport == MIGRATION_ADDRESS_TYPE_FILE;
>> +}
>
> What about TYPE_FD? Is it going to be supported later?
>
nSorry, I missed this one.
Yes, and thanks for asking because I just remembered I have code for
this lost in a git stash somewhere. I'll include it in v4.
>> +
>> static bool
>> migration_channels_and_transport_compatible(MigrationAddress *addr,
>> Error **errp)
>> {
>> + if (migration_needs_seekable_channel() &&
>> + !transport_supports_seeking(addr)) {
>> + error_setg(errp, "Migration requires seekable transport (e.g. file)");
>> + return false;
>> + }
>> +
>> if (migration_needs_multiple_sockets() &&
>> (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) &&
>> !transport_supports_multi_channels(&addr->u.socket)) {
>> --
>> 2.35.3
>>
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 07/30] migration: Add fixed-ram URI compatibility check
2024-01-15 9:01 ` Peter Xu
2024-01-23 19:07 ` Fabiano Rosas
@ 2024-01-23 19:07 ` Fabiano Rosas
1 sibling, 0 replies; 95+ messages in thread
From: Fabiano Rosas @ 2024-01-23 19:07 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
Peter Xu <peterx@redhat.com> writes:
> On Mon, Nov 27, 2023 at 05:25:49PM -0300, Fabiano Rosas wrote:
>> The fixed-ram migration format needs a channel that supports seeking
>> to be able to write each page to an arbitrary offset in the migration
>> stream.
>>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
>> ---
>> - avoided overwriting errp in compatibility check
>> ---
>> migration/migration.c | 16 ++++++++++++++++
>> 1 file changed, 16 insertions(+)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 28a34c9068..897ed1db67 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -135,10 +135,26 @@ static bool transport_supports_multi_channels(SocketAddress *saddr)
>> saddr->type == SOCKET_ADDRESS_TYPE_VSOCK;
>> }
>>
>> +static bool migration_needs_seekable_channel(void)
>> +{
>> + return migrate_fixed_ram();
>> +}
>> +
>> +static bool transport_supports_seeking(MigrationAddress *addr)
>> +{
>> + return addr->transport == MIGRATION_ADDRESS_TYPE_FILE;
>> +}
>
> What about TYPE_FD? Is it going to be supported later?
>
Sorry, I missed this one.
Yes, and thanks for asking because I just remembered I have code for
this lost in a git stash somewhere. I'll include it in v4.
>> +
>> static bool
>> migration_channels_and_transport_compatible(MigrationAddress *addr,
>> Error **errp)
>> {
>> + if (migration_needs_seekable_channel() &&
>> + !transport_supports_seeking(addr)) {
>> + error_setg(errp, "Migration requires seekable transport (e.g. file)");
>> + return false;
>> + }
>> +
>> if (migration_needs_multiple_sockets() &&
>> (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) &&
>> !transport_supports_multi_channels(&addr->u.socket)) {
>> --
>> 2.35.3
>>
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH v3 08/30] migration/ram: Add outgoing 'fixed-ram' migration
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (6 preceding siblings ...)
2023-11-27 20:25 ` [RFC PATCH v3 07/30] migration: Add fixed-ram URI compatibility check Fabiano Rosas
@ 2023-11-27 20:25 ` Fabiano Rosas
2024-01-15 9:28 ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 09/30] migration/ram: Add incoming " Fabiano Rosas
` (22 subsequent siblings)
30 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:25 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana, Nikolay Borisov, Paolo Bonzini,
David Hildenbrand, Philippe Mathieu-Daudé
From: Nikolay Borisov <nborisov@suse.com>
Implement the outgoing migration side for the 'fixed-ram' capability.
A bitmap is introduced to track which pages have been written in the
migration file. Pages are written at a fixed location for every
ramblock. Zero pages are ignored as they'd be zero in the destination
migration as well.
The migration stream is altered to put the dirty pages for a ramblock
after its header instead of having a sequential stream of pages that
follow the ramblock headers. Since all pages have a fixed location,
RAM_SAVE_FLAG_EOS is no longer generated on every migration iteration.
Without fixed-ram (current): With fixed-ram (new):
--------------------- --------------------------------
| ramblock 1 header | | ramblock 1 header |
--------------------- --------------------------------
| ramblock 2 header | | ramblock 1 fixed-ram header |
--------------------- --------------------------------
| ... | | padding to next 1MB boundary |
--------------------- | ... |
| ramblock n header | --------------------------------
--------------------- | ramblock 1 pages |
| RAM_SAVE_FLAG_EOS | | ... |
--------------------- --------------------------------
| stream of pages | | ramblock 2 header |
| (iter 1) | --------------------------------
| ... | | ramblock 2 fixed-ram header |
--------------------- --------------------------------
| RAM_SAVE_FLAG_EOS | | padding to next 1MB boundary |
--------------------- | ... |
| stream of pages | --------------------------------
| (iter 2) | | ramblock 2 pages |
| ... | | ... |
--------------------- --------------------------------
| ... | | ... |
--------------------- --------------------------------
| RAM_SAVE_FLAG_EOS |
--------------------------------
| ... |
--------------------------------
where:
- ramblock header: the generic information for a ramblock, such as
idstr, used_len, etc.
- ramblock fixed-ram header: the new information added by this
feature: bitmap of pages written, bitmap size and offset of pages
in the migration file.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
- used a macro for alignment value
- documented alignment assumptions
- moved shadow_bmap debug code to multifd patch
- did NOT use used_length for bmap, it breaks dirty page tracking somehow
- uncommented the capability enabling
- accounted for the bitmap size with ram_transferred_add()
---
include/exec/ramblock.h | 8 +++
migration/ram.c | 121 +++++++++++++++++++++++++++++++++++++---
2 files changed, 120 insertions(+), 9 deletions(-)
diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index 69c6a53902..e0e3f16852 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -44,6 +44,14 @@ struct RAMBlock {
size_t page_size;
/* dirty bitmap used during migration */
unsigned long *bmap;
+ /* shadow dirty bitmap used when migrating to a file */
+ unsigned long *shadow_bmap;
+ /*
+ * offset in the file pages belonging to this ramblock are saved,
+ * used only during migration to a file.
+ */
+ off_t bitmap_offset;
+ uint64_t pages_offset;
/* bitmap of already received pages in postcopy */
unsigned long *receivedmap;
diff --git a/migration/ram.c b/migration/ram.c
index 8c7886ab79..4a0ab8105f 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -94,6 +94,18 @@
#define RAM_SAVE_FLAG_MULTIFD_FLUSH 0x200
/* We can't use any flag that is bigger than 0x200 */
+/*
+ * fixed-ram migration supports O_DIRECT, so we need to make sure the
+ * userspace buffer, the IO operation size and the file offset are
+ * aligned according to the underlying device's block size. The first
+ * two are already aligned to page size, but we need to add padding to
+ * the file to align the offset. We cannot read the block size
+ * dynamically because the migration file can be moved between
+ * different systems, so use 1M to cover most block sizes and to keep
+ * the file offset aligned at page size as well.
+ */
+#define FIXED_RAM_FILE_OFFSET_ALIGNMENT 0x100000
+
XBZRLECacheStats xbzrle_counters;
/* used by the search for pages to send */
@@ -1127,12 +1139,18 @@ static int save_zero_page(RAMState *rs, PageSearchStatus *pss,
return 0;
}
+ stat64_add(&mig_stats.zero_pages, 1);
+
+ if (migrate_fixed_ram()) {
+ /* zero pages are not transferred with fixed-ram */
+ clear_bit(offset >> TARGET_PAGE_BITS, pss->block->shadow_bmap);
+ return 1;
+ }
+
len += save_page_header(pss, file, pss->block, offset | RAM_SAVE_FLAG_ZERO);
qemu_put_byte(file, 0);
len += 1;
ram_release_page(pss->block->idstr, offset);
-
- stat64_add(&mig_stats.zero_pages, 1);
ram_transferred_add(len);
/*
@@ -1190,14 +1208,20 @@ static int save_normal_page(PageSearchStatus *pss, RAMBlock *block,
{
QEMUFile *file = pss->pss_channel;
- ram_transferred_add(save_page_header(pss, pss->pss_channel, block,
- offset | RAM_SAVE_FLAG_PAGE));
- if (async) {
- qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
- migrate_release_ram() &&
- migration_in_postcopy());
+ if (migrate_fixed_ram()) {
+ qemu_put_buffer_at(file, buf, TARGET_PAGE_SIZE,
+ block->pages_offset + offset);
+ set_bit(offset >> TARGET_PAGE_BITS, block->shadow_bmap);
} else {
- qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
+ ram_transferred_add(save_page_header(pss, pss->pss_channel, block,
+ offset | RAM_SAVE_FLAG_PAGE));
+ if (async) {
+ qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
+ migrate_release_ram() &&
+ migration_in_postcopy());
+ } else {
+ qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
+ }
}
ram_transferred_add(TARGET_PAGE_SIZE);
stat64_add(&mig_stats.normal_pages, 1);
@@ -2413,6 +2437,8 @@ static void ram_save_cleanup(void *opaque)
block->clear_bmap = NULL;
g_free(block->bmap);
block->bmap = NULL;
+ g_free(block->shadow_bmap);
+ block->shadow_bmap = NULL;
}
xbzrle_cleanup();
@@ -2780,6 +2806,7 @@ static void ram_list_init_bitmaps(void)
*/
block->bmap = bitmap_new(pages);
bitmap_set(block->bmap, 0, pages);
+ block->shadow_bmap = bitmap_new(pages);
block->clear_bmap_shift = shift;
block->clear_bmap = bitmap_new(clear_bmap_size(pages, shift));
}
@@ -2917,6 +2944,58 @@ void qemu_guest_free_page_hint(void *addr, size_t len)
}
}
+#define FIXED_RAM_HDR_VERSION 1
+struct FixedRamHeader {
+ uint32_t version;
+ /*
+ * The target's page size, so we know how many pages are in the
+ * bitmap.
+ */
+ uint64_t page_size;
+ /*
+ * The offset in the migration file where the pages bitmap is
+ * found.
+ */
+ uint64_t bitmap_offset;
+ /*
+ * The offset in the migration file where the actual pages (data)
+ * are found.
+ */
+ uint64_t pages_offset;
+ /* end of v1 */
+} QEMU_PACKED;
+typedef struct FixedRamHeader FixedRamHeader;
+
+static void fixed_ram_insert_header(QEMUFile *file, RAMBlock *block)
+{
+ g_autofree FixedRamHeader *header;
+ size_t header_size, bitmap_size;
+ long num_pages;
+
+ header = g_new0(FixedRamHeader, 1);
+ header_size = sizeof(FixedRamHeader);
+
+ num_pages = block->used_length >> TARGET_PAGE_BITS;
+ bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
+
+ /*
+ * Save the file offsets of where the bitmap and the pages should
+ * go as they are written at the end of migration and during the
+ * iterative phase, respectively.
+ */
+ block->bitmap_offset = qemu_get_offset(file) + header_size;
+ block->pages_offset = ROUND_UP(block->bitmap_offset +
+ bitmap_size,
+ FIXED_RAM_FILE_OFFSET_ALIGNMENT);
+
+ header->version = cpu_to_be32(FIXED_RAM_HDR_VERSION);
+ header->page_size = cpu_to_be64(TARGET_PAGE_SIZE);
+ header->bitmap_offset = cpu_to_be64(block->bitmap_offset);
+ header->pages_offset = cpu_to_be64(block->pages_offset);
+
+ qemu_put_buffer(file, (uint8_t *) header, header_size);
+}
+
/*
* Each of ram_save_setup, ram_save_iterate and ram_save_complete has
* long-running RCU critical section. When rcu-reclaims in the code
@@ -2966,6 +3045,13 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
if (migrate_ignore_shared()) {
qemu_put_be64(f, block->mr->addr);
}
+
+ if (migrate_fixed_ram()) {
+ fixed_ram_insert_header(f, block);
+ /* prepare offset for next ramblock */
+ qemu_set_offset(f, block->pages_offset + block->used_length,
+ SEEK_SET);
+ }
}
}
@@ -2999,6 +3085,19 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
return qemu_fflush(f);
}
+static void ram_save_shadow_bmap(QEMUFile *f)
+{
+ RAMBlock *block;
+
+ RAMBLOCK_FOREACH_MIGRATABLE(block) {
+ long num_pages = block->used_length >> TARGET_PAGE_BITS;
+ long bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
+ qemu_put_buffer_at(f, (uint8_t *)block->shadow_bmap, bitmap_size,
+ block->bitmap_offset);
+ ram_transferred_add(bitmap_size);
+ }
+}
+
/**
* ram_save_iterate: iterative stage for migration
*
@@ -3188,6 +3287,10 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
return ret;
}
+ if (migrate_fixed_ram()) {
+ ram_save_shadow_bmap(f);
+ }
+
if (migrate_multifd() && !migrate_multifd_flush_after_each_section()) {
qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
}
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 08/30] migration/ram: Add outgoing 'fixed-ram' migration
2023-11-27 20:25 ` [RFC PATCH v3 08/30] migration/ram: Add outgoing 'fixed-ram' migration Fabiano Rosas
@ 2024-01-15 9:28 ` Peter Xu
2024-01-15 14:50 ` Fabiano Rosas
0 siblings, 1 reply; 95+ messages in thread
From: Peter Xu @ 2024-01-15 9:28 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana, Nikolay Borisov, Paolo Bonzini,
David Hildenbrand, Philippe Mathieu-Daudé
On Mon, Nov 27, 2023 at 05:25:50PM -0300, Fabiano Rosas wrote:
> From: Nikolay Borisov <nborisov@suse.com>
>
> Implement the outgoing migration side for the 'fixed-ram' capability.
>
> A bitmap is introduced to track which pages have been written in the
> migration file. Pages are written at a fixed location for every
> ramblock. Zero pages are ignored as they'd be zero in the destination
> migration as well.
>
> The migration stream is altered to put the dirty pages for a ramblock
> after its header instead of having a sequential stream of pages that
> follow the ramblock headers. Since all pages have a fixed location,
> RAM_SAVE_FLAG_EOS is no longer generated on every migration iteration.
>
> Without fixed-ram (current): With fixed-ram (new):
>
> --------------------- --------------------------------
> | ramblock 1 header | | ramblock 1 header |
> --------------------- --------------------------------
> | ramblock 2 header | | ramblock 1 fixed-ram header |
> --------------------- --------------------------------
> | ... | | padding to next 1MB boundary |
> --------------------- | ... |
> | ramblock n header | --------------------------------
> --------------------- | ramblock 1 pages |
> | RAM_SAVE_FLAG_EOS | | ... |
> --------------------- --------------------------------
> | stream of pages | | ramblock 2 header |
> | (iter 1) | --------------------------------
> | ... | | ramblock 2 fixed-ram header |
> --------------------- --------------------------------
> | RAM_SAVE_FLAG_EOS | | padding to next 1MB boundary |
> --------------------- | ... |
> | stream of pages | --------------------------------
> | (iter 2) | | ramblock 2 pages |
> | ... | | ... |
> --------------------- --------------------------------
> | ... | | ... |
> --------------------- --------------------------------
> | RAM_SAVE_FLAG_EOS |
> --------------------------------
> | ... |
> --------------------------------
>
> where:
> - ramblock header: the generic information for a ramblock, such as
> idstr, used_len, etc.
>
> - ramblock fixed-ram header: the new information added by this
> feature: bitmap of pages written, bitmap size and offset of pages
> in the migration file.
>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> - used a macro for alignment value
> - documented alignment assumptions
> - moved shadow_bmap debug code to multifd patch
> - did NOT use used_length for bmap, it breaks dirty page tracking somehow
> - uncommented the capability enabling
> - accounted for the bitmap size with ram_transferred_add()
> ---
> include/exec/ramblock.h | 8 +++
> migration/ram.c | 121 +++++++++++++++++++++++++++++++++++++---
> 2 files changed, 120 insertions(+), 9 deletions(-)
>
> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
> index 69c6a53902..e0e3f16852 100644
> --- a/include/exec/ramblock.h
> +++ b/include/exec/ramblock.h
> @@ -44,6 +44,14 @@ struct RAMBlock {
> size_t page_size;
> /* dirty bitmap used during migration */
> unsigned long *bmap;
> + /* shadow dirty bitmap used when migrating to a file */
> + unsigned long *shadow_bmap;
What is a "shadow dirty bitmap"? It's pretty unclear to me.
AFAICT it's actually a "page present" bitmap, while taking zero pages as
"not present", no?
> + /*
> + * offset in the file pages belonging to this ramblock are saved,
> + * used only during migration to a file.
> + */
> + off_t bitmap_offset;
> + uint64_t pages_offset;
Let's have a section to put fixed-ram data?
/*
* Below fields are only used by fixed-ram migration.
*/
...
> /* bitmap of already received pages in postcopy */
> unsigned long *receivedmap;
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 8c7886ab79..4a0ab8105f 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -94,6 +94,18 @@
> #define RAM_SAVE_FLAG_MULTIFD_FLUSH 0x200
> /* We can't use any flag that is bigger than 0x200 */
>
> +/*
> + * fixed-ram migration supports O_DIRECT, so we need to make sure the
> + * userspace buffer, the IO operation size and the file offset are
> + * aligned according to the underlying device's block size. The first
> + * two are already aligned to page size, but we need to add padding to
> + * the file to align the offset. We cannot read the block size
> + * dynamically because the migration file can be moved between
> + * different systems, so use 1M to cover most block sizes and to keep
> + * the file offset aligned at page size as well.
> + */
> +#define FIXED_RAM_FILE_OFFSET_ALIGNMENT 0x100000
> +
> XBZRLECacheStats xbzrle_counters;
>
> /* used by the search for pages to send */
> @@ -1127,12 +1139,18 @@ static int save_zero_page(RAMState *rs, PageSearchStatus *pss,
> return 0;
> }
>
> + stat64_add(&mig_stats.zero_pages, 1);
> +
> + if (migrate_fixed_ram()) {
> + /* zero pages are not transferred with fixed-ram */
> + clear_bit(offset >> TARGET_PAGE_BITS, pss->block->shadow_bmap);
> + return 1;
> + }
> +
> len += save_page_header(pss, file, pss->block, offset | RAM_SAVE_FLAG_ZERO);
> qemu_put_byte(file, 0);
> len += 1;
> ram_release_page(pss->block->idstr, offset);
> -
> - stat64_add(&mig_stats.zero_pages, 1);
> ram_transferred_add(len);
>
> /*
> @@ -1190,14 +1208,20 @@ static int save_normal_page(PageSearchStatus *pss, RAMBlock *block,
> {
> QEMUFile *file = pss->pss_channel;
>
> - ram_transferred_add(save_page_header(pss, pss->pss_channel, block,
> - offset | RAM_SAVE_FLAG_PAGE));
> - if (async) {
> - qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
> - migrate_release_ram() &&
> - migration_in_postcopy());
> + if (migrate_fixed_ram()) {
> + qemu_put_buffer_at(file, buf, TARGET_PAGE_SIZE,
> + block->pages_offset + offset);
> + set_bit(offset >> TARGET_PAGE_BITS, block->shadow_bmap);
> } else {
> - qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
> + ram_transferred_add(save_page_header(pss, pss->pss_channel, block,
> + offset | RAM_SAVE_FLAG_PAGE));
> + if (async) {
> + qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
> + migrate_release_ram() &&
> + migration_in_postcopy());
> + } else {
> + qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
> + }
> }
> ram_transferred_add(TARGET_PAGE_SIZE);
> stat64_add(&mig_stats.normal_pages, 1);
> @@ -2413,6 +2437,8 @@ static void ram_save_cleanup(void *opaque)
> block->clear_bmap = NULL;
> g_free(block->bmap);
> block->bmap = NULL;
> + g_free(block->shadow_bmap);
> + block->shadow_bmap = NULL;
> }
>
> xbzrle_cleanup();
> @@ -2780,6 +2806,7 @@ static void ram_list_init_bitmaps(void)
> */
> block->bmap = bitmap_new(pages);
> bitmap_set(block->bmap, 0, pages);
> + block->shadow_bmap = bitmap_new(pages);
We can avoid creating this bitmap if !fixed-ram.
> block->clear_bmap_shift = shift;
> block->clear_bmap = bitmap_new(clear_bmap_size(pages, shift));
> }
> @@ -2917,6 +2944,58 @@ void qemu_guest_free_page_hint(void *addr, size_t len)
> }
> }
>
> +#define FIXED_RAM_HDR_VERSION 1
> +struct FixedRamHeader {
> + uint32_t version;
> + /*
> + * The target's page size, so we know how many pages are in the
> + * bitmap.
> + */
> + uint64_t page_size;
> + /*
> + * The offset in the migration file where the pages bitmap is
> + * found.
s/found/stored/?
> + */
> + uint64_t bitmap_offset;
> + /*
> + * The offset in the migration file where the actual pages (data)
> + * are found.
same?
> + */
> + uint64_t pages_offset;
> + /* end of v1 */
I think we can drop this.
> +} QEMU_PACKED;
> +typedef struct FixedRamHeader FixedRamHeader;
> +
> +static void fixed_ram_insert_header(QEMUFile *file, RAMBlock *block)
> +{
> + g_autofree FixedRamHeader *header;
Let's either inline the g_new0() or initialize it to NULL? Just in case.
> + size_t header_size, bitmap_size;
> + long num_pages;
> +
> + header = g_new0(FixedRamHeader, 1);
> + header_size = sizeof(FixedRamHeader);
> +
> + num_pages = block->used_length >> TARGET_PAGE_BITS;
> + bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
> +
> + /*
> + * Save the file offsets of where the bitmap and the pages should
> + * go as they are written at the end of migration and during the
> + * iterative phase, respectively.
> + */
> + block->bitmap_offset = qemu_get_offset(file) + header_size;
> + block->pages_offset = ROUND_UP(block->bitmap_offset +
> + bitmap_size,
> + FIXED_RAM_FILE_OFFSET_ALIGNMENT);
> +
> + header->version = cpu_to_be32(FIXED_RAM_HDR_VERSION);
> + header->page_size = cpu_to_be64(TARGET_PAGE_SIZE);
> + header->bitmap_offset = cpu_to_be64(block->bitmap_offset);
> + header->pages_offset = cpu_to_be64(block->pages_offset);
> +
> + qemu_put_buffer(file, (uint8_t *) header, header_size);
> +}
> +
> /*
> * Each of ram_save_setup, ram_save_iterate and ram_save_complete has
> * long-running RCU critical section. When rcu-reclaims in the code
> @@ -2966,6 +3045,13 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> if (migrate_ignore_shared()) {
> qemu_put_be64(f, block->mr->addr);
> }
> +
> + if (migrate_fixed_ram()) {
> + fixed_ram_insert_header(f, block);
> + /* prepare offset for next ramblock */
> + qemu_set_offset(f, block->pages_offset + block->used_length,
> + SEEK_SET);
How about moving this line into fixed_ram_insert_header()? Perhaps also
rename to fixed_ram_setup_ramblock()?
> + }
> }
> }
>
> @@ -2999,6 +3085,19 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> return qemu_fflush(f);
> }
>
> +static void ram_save_shadow_bmap(QEMUFile *f)
[may need a rename after we decide a better name for the bitmap; "shadow"
is probably not the one..]
> +{
> + RAMBlock *block;
> +
> + RAMBLOCK_FOREACH_MIGRATABLE(block) {
> + long num_pages = block->used_length >> TARGET_PAGE_BITS;
> + long bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
> + qemu_put_buffer_at(f, (uint8_t *)block->shadow_bmap, bitmap_size,
> + block->bitmap_offset);
We may want to check for IO errors, either here, or (if too frequent) maybe
once and for all right before the final completion of migration? If the
latter, we may want to keep a comment around here explaining on error conditions.
> + ram_transferred_add(bitmap_size);
> + }
> +}
> +
> /**
> * ram_save_iterate: iterative stage for migration
> *
> @@ -3188,6 +3287,10 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
> return ret;
> }
>
> + if (migrate_fixed_ram()) {
> + ram_save_shadow_bmap(f);
> + }
> +
> if (migrate_multifd() && !migrate_multifd_flush_after_each_section()) {
> qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
> }
> --
> 2.35.3
>
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 08/30] migration/ram: Add outgoing 'fixed-ram' migration
2024-01-15 9:28 ` Peter Xu
@ 2024-01-15 14:50 ` Fabiano Rosas
0 siblings, 0 replies; 95+ messages in thread
From: Fabiano Rosas @ 2024-01-15 14:50 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana, Nikolay Borisov, Paolo Bonzini,
David Hildenbrand, Philippe Mathieu-Daudé
Peter Xu <peterx@redhat.com> writes:
> On Mon, Nov 27, 2023 at 05:25:50PM -0300, Fabiano Rosas wrote:
>> From: Nikolay Borisov <nborisov@suse.com>
>>
>> Implement the outgoing migration side for the 'fixed-ram' capability.
>>
>> A bitmap is introduced to track which pages have been written in the
>> migration file. Pages are written at a fixed location for every
>> ramblock. Zero pages are ignored as they'd be zero in the destination
>> migration as well.
>>
>> The migration stream is altered to put the dirty pages for a ramblock
>> after its header instead of having a sequential stream of pages that
>> follow the ramblock headers. Since all pages have a fixed location,
>> RAM_SAVE_FLAG_EOS is no longer generated on every migration iteration.
>>
>> Without fixed-ram (current): With fixed-ram (new):
>>
>> --------------------- --------------------------------
>> | ramblock 1 header | | ramblock 1 header |
>> --------------------- --------------------------------
>> | ramblock 2 header | | ramblock 1 fixed-ram header |
>> --------------------- --------------------------------
>> | ... | | padding to next 1MB boundary |
>> --------------------- | ... |
>> | ramblock n header | --------------------------------
>> --------------------- | ramblock 1 pages |
>> | RAM_SAVE_FLAG_EOS | | ... |
>> --------------------- --------------------------------
>> | stream of pages | | ramblock 2 header |
>> | (iter 1) | --------------------------------
>> | ... | | ramblock 2 fixed-ram header |
>> --------------------- --------------------------------
>> | RAM_SAVE_FLAG_EOS | | padding to next 1MB boundary |
>> --------------------- | ... |
>> | stream of pages | --------------------------------
>> | (iter 2) | | ramblock 2 pages |
>> | ... | | ... |
>> --------------------- --------------------------------
>> | ... | | ... |
>> --------------------- --------------------------------
>> | RAM_SAVE_FLAG_EOS |
>> --------------------------------
>> | ... |
>> --------------------------------
>>
>> where:
>> - ramblock header: the generic information for a ramblock, such as
>> idstr, used_len, etc.
>>
>> - ramblock fixed-ram header: the new information added by this
>> feature: bitmap of pages written, bitmap size and offset of pages
>> in the migration file.
>>
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>> - used a macro for alignment value
>> - documented alignment assumptions
>> - moved shadow_bmap debug code to multifd patch
>> - did NOT use used_length for bmap, it breaks dirty page tracking somehow
>> - uncommented the capability enabling
>> - accounted for the bitmap size with ram_transferred_add()
>> ---
>> include/exec/ramblock.h | 8 +++
>> migration/ram.c | 121 +++++++++++++++++++++++++++++++++++++---
>> 2 files changed, 120 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
>> index 69c6a53902..e0e3f16852 100644
>> --- a/include/exec/ramblock.h
>> +++ b/include/exec/ramblock.h
>> @@ -44,6 +44,14 @@ struct RAMBlock {
>> size_t page_size;
>> /* dirty bitmap used during migration */
>> unsigned long *bmap;
>> + /* shadow dirty bitmap used when migrating to a file */
>> + unsigned long *shadow_bmap;
>
> What is a "shadow dirty bitmap"? It's pretty unclear to me.
>
> AFAICT it's actually a "page present" bitmap, while taking zero pages as
> "not present", no?
>
Yes, something like that. It's the bitmap of pages written to the
migration file.
>> + /*
>> + * offset in the file pages belonging to this ramblock are saved,
>> + * used only during migration to a file.
>> + */
>> + off_t bitmap_offset;
>> + uint64_t pages_offset;
>
> Let's have a section to put fixed-ram data?
>
> /*
> * Below fields are only used by fixed-ram migration.
> */
> ...
>
>> /* bitmap of already received pages in postcopy */
>> unsigned long *receivedmap;
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 8c7886ab79..4a0ab8105f 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -94,6 +94,18 @@
>> #define RAM_SAVE_FLAG_MULTIFD_FLUSH 0x200
>> /* We can't use any flag that is bigger than 0x200 */
>>
>> +/*
>> + * fixed-ram migration supports O_DIRECT, so we need to make sure the
>> + * userspace buffer, the IO operation size and the file offset are
>> + * aligned according to the underlying device's block size. The first
>> + * two are already aligned to page size, but we need to add padding to
>> + * the file to align the offset. We cannot read the block size
>> + * dynamically because the migration file can be moved between
>> + * different systems, so use 1M to cover most block sizes and to keep
>> + * the file offset aligned at page size as well.
>> + */
>> +#define FIXED_RAM_FILE_OFFSET_ALIGNMENT 0x100000
>> +
>> XBZRLECacheStats xbzrle_counters;
>>
>> /* used by the search for pages to send */
>> @@ -1127,12 +1139,18 @@ static int save_zero_page(RAMState *rs, PageSearchStatus *pss,
>> return 0;
>> }
>>
>> + stat64_add(&mig_stats.zero_pages, 1);
>> +
>> + if (migrate_fixed_ram()) {
>> + /* zero pages are not transferred with fixed-ram */
>> + clear_bit(offset >> TARGET_PAGE_BITS, pss->block->shadow_bmap);
>> + return 1;
>> + }
>> +
>> len += save_page_header(pss, file, pss->block, offset | RAM_SAVE_FLAG_ZERO);
>> qemu_put_byte(file, 0);
>> len += 1;
>> ram_release_page(pss->block->idstr, offset);
>> -
>> - stat64_add(&mig_stats.zero_pages, 1);
>> ram_transferred_add(len);
>>
>> /*
>> @@ -1190,14 +1208,20 @@ static int save_normal_page(PageSearchStatus *pss, RAMBlock *block,
>> {
>> QEMUFile *file = pss->pss_channel;
>>
>> - ram_transferred_add(save_page_header(pss, pss->pss_channel, block,
>> - offset | RAM_SAVE_FLAG_PAGE));
>> - if (async) {
>> - qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
>> - migrate_release_ram() &&
>> - migration_in_postcopy());
>> + if (migrate_fixed_ram()) {
>> + qemu_put_buffer_at(file, buf, TARGET_PAGE_SIZE,
>> + block->pages_offset + offset);
>> + set_bit(offset >> TARGET_PAGE_BITS, block->shadow_bmap);
>> } else {
>> - qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
>> + ram_transferred_add(save_page_header(pss, pss->pss_channel, block,
>> + offset | RAM_SAVE_FLAG_PAGE));
>> + if (async) {
>> + qemu_put_buffer_async(file, buf, TARGET_PAGE_SIZE,
>> + migrate_release_ram() &&
>> + migration_in_postcopy());
>> + } else {
>> + qemu_put_buffer(file, buf, TARGET_PAGE_SIZE);
>> + }
>> }
>> ram_transferred_add(TARGET_PAGE_SIZE);
>> stat64_add(&mig_stats.normal_pages, 1);
>> @@ -2413,6 +2437,8 @@ static void ram_save_cleanup(void *opaque)
>> block->clear_bmap = NULL;
>> g_free(block->bmap);
>> block->bmap = NULL;
>> + g_free(block->shadow_bmap);
>> + block->shadow_bmap = NULL;
>> }
>>
>> xbzrle_cleanup();
>> @@ -2780,6 +2806,7 @@ static void ram_list_init_bitmaps(void)
>> */
>> block->bmap = bitmap_new(pages);
>> bitmap_set(block->bmap, 0, pages);
>> + block->shadow_bmap = bitmap_new(pages);
>
> We can avoid creating this bitmap if !fixed-ram.
>
ack
>> block->clear_bmap_shift = shift;
>> block->clear_bmap = bitmap_new(clear_bmap_size(pages, shift));
>> }
>> @@ -2917,6 +2944,58 @@ void qemu_guest_free_page_hint(void *addr, size_t len)
>> }
>> }
>>
>> +#define FIXED_RAM_HDR_VERSION 1
>> +struct FixedRamHeader {
>> + uint32_t version;
>> + /*
>> + * The target's page size, so we know how many pages are in the
>> + * bitmap.
>> + */
>> + uint64_t page_size;
>> + /*
>> + * The offset in the migration file where the pages bitmap is
>> + * found.
>
> s/found/stored/?
>
>> + */
>> + uint64_t bitmap_offset;
>> + /*
>> + * The offset in the migration file where the actual pages (data)
>> + * are found.
>
> same?
>
>> + */
>> + uint64_t pages_offset;
>> + /* end of v1 */
>
> I think we can drop this.
>
>> +} QEMU_PACKED;
>> +typedef struct FixedRamHeader FixedRamHeader;
>> +
>> +static void fixed_ram_insert_header(QEMUFile *file, RAMBlock *block)
>> +{
>> + g_autofree FixedRamHeader *header;
>
> Let's either inline the g_new0() or initialize it to NULL? Just in case.
Ouch, that's just wrong. I'll fix it.
>> + size_t header_size, bitmap_size;
>> + long num_pages;
>> +
>> + header = g_new0(FixedRamHeader, 1);
>> + header_size = sizeof(FixedRamHeader);
>> +
>> + num_pages = block->used_length >> TARGET_PAGE_BITS;
>> + bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
>> +
>> + /*
>> + * Save the file offsets of where the bitmap and the pages should
>> + * go as they are written at the end of migration and during the
>> + * iterative phase, respectively.
>> + */
>> + block->bitmap_offset = qemu_get_offset(file) + header_size;
>> + block->pages_offset = ROUND_UP(block->bitmap_offset +
>> + bitmap_size,
>> + FIXED_RAM_FILE_OFFSET_ALIGNMENT);
>> +
>> + header->version = cpu_to_be32(FIXED_RAM_HDR_VERSION);
>> + header->page_size = cpu_to_be64(TARGET_PAGE_SIZE);
>> + header->bitmap_offset = cpu_to_be64(block->bitmap_offset);
>> + header->pages_offset = cpu_to_be64(block->pages_offset);
>> +
>> + qemu_put_buffer(file, (uint8_t *) header, header_size);
>> +}
>> +
>> /*
>> * Each of ram_save_setup, ram_save_iterate and ram_save_complete has
>> * long-running RCU critical section. When rcu-reclaims in the code
>> @@ -2966,6 +3045,13 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>> if (migrate_ignore_shared()) {
>> qemu_put_be64(f, block->mr->addr);
>> }
>> +
>> + if (migrate_fixed_ram()) {
>> + fixed_ram_insert_header(f, block);
>> + /* prepare offset for next ramblock */
>> + qemu_set_offset(f, block->pages_offset + block->used_length,
>> + SEEK_SET);
>
> How about moving this line into fixed_ram_insert_header()? Perhaps also
> rename to fixed_ram_setup_ramblock()?
>
Ok.
>> + }
>> }
>> }
>>
>> @@ -2999,6 +3085,19 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>> return qemu_fflush(f);
>> }
>>
>> +static void ram_save_shadow_bmap(QEMUFile *f)
>
> [may need a rename after we decide a better name for the bitmap; "shadow"
> is probably not the one..]
>
>> +{
>> + RAMBlock *block;
>> +
>> + RAMBLOCK_FOREACH_MIGRATABLE(block) {
>> + long num_pages = block->used_length >> TARGET_PAGE_BITS;
>> + long bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
>> + qemu_put_buffer_at(f, (uint8_t *)block->shadow_bmap, bitmap_size,
>> + block->bitmap_offset);
>
> We may want to check for IO errors, either here, or (if too frequent) maybe
> once and for all right before the final completion of migration? If the
> latter, we may want to keep a comment around here explaining on error conditions.
>
Ok.
>> + ram_transferred_add(bitmap_size);
>> + }
>> +}
>> +
>> /**
>> * ram_save_iterate: iterative stage for migration
>> *
>> @@ -3188,6 +3287,10 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>> return ret;
>> }
>>
>> + if (migrate_fixed_ram()) {
>> + ram_save_shadow_bmap(f);
>> + }
>> +
>> if (migrate_multifd() && !migrate_multifd_flush_after_each_section()) {
>> qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
>> }
>> --
>> 2.35.3
>>
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH v3 09/30] migration/ram: Add incoming 'fixed-ram' migration
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (7 preceding siblings ...)
2023-11-27 20:25 ` [RFC PATCH v3 08/30] migration/ram: Add outgoing 'fixed-ram' migration Fabiano Rosas
@ 2023-11-27 20:25 ` Fabiano Rosas
2024-01-15 9:49 ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 10/30] tests/qtest: migration-test: Add tests for fixed-ram file-based migration Fabiano Rosas
` (21 subsequent siblings)
30 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:25 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana, Nikolay Borisov
Add the necessary code to parse the format changes for the 'fixed-ram'
capability.
One of the more notable changes in behavior is that in the 'fixed-ram'
case ram pages are restored in one go rather than constantly looping
through the migration stream.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
- added sanity check for pages_offset alignment
- s/parsing/reading
- used Error
- fixed buffer size computation, now allowing an arbitrary limit
- fixed dereference of pointer to packed struct member in endianness
conversion
---
migration/ram.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 119 insertions(+)
diff --git a/migration/ram.c b/migration/ram.c
index 4a0ab8105f..08604222f2 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -106,6 +106,12 @@
*/
#define FIXED_RAM_FILE_OFFSET_ALIGNMENT 0x100000
+/*
+ * When doing fixed-ram migration, this is the amount we read from the
+ * pages region in the migration file at a time.
+ */
+#define FIXED_RAM_LOAD_BUF_SIZE 0x100000
+
XBZRLECacheStats xbzrle_counters;
/* used by the search for pages to send */
@@ -2996,6 +3002,35 @@ static void fixed_ram_insert_header(QEMUFile *file, RAMBlock *block)
qemu_put_buffer(file, (uint8_t *) header, header_size);
}
+static bool fixed_ram_read_header(QEMUFile *file, FixedRamHeader *header,
+ Error **errp)
+{
+ size_t ret, header_size = sizeof(FixedRamHeader);
+
+ ret = qemu_get_buffer(file, (uint8_t *)header, header_size);
+ if (ret != header_size) {
+ error_setg(errp, "Could not read whole fixed-ram migration header "
+ "(expected %zd, got %zd bytes)", header_size, ret);
+ return false;
+ }
+
+ /* migration stream is big-endian */
+ header->version = be32_to_cpu(header->version);
+
+ if (header->version > FIXED_RAM_HDR_VERSION) {
+ error_setg(errp, "Migration fixed-ram capability version mismatch "
+ "(expected %d, got %d)", FIXED_RAM_HDR_VERSION,
+ header->version);
+ return false;
+ }
+
+ header->page_size = be64_to_cpu(header->page_size);
+ header->bitmap_offset = be64_to_cpu(header->bitmap_offset);
+ header->pages_offset = be64_to_cpu(header->pages_offset);
+
+ return true;
+}
+
/*
* Each of ram_save_setup, ram_save_iterate and ram_save_complete has
* long-running RCU critical section. When rcu-reclaims in the code
@@ -3892,6 +3927,80 @@ void colo_flush_ram_cache(void)
trace_colo_flush_ram_cache_end();
}
+static void read_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block,
+ long num_pages, unsigned long *bitmap)
+{
+ unsigned long set_bit_idx, clear_bit_idx;
+ ram_addr_t offset;
+ void *host;
+ size_t read, unread, size, buf_size = FIXED_RAM_LOAD_BUF_SIZE;
+
+ for (set_bit_idx = find_first_bit(bitmap, num_pages);
+ set_bit_idx < num_pages;
+ set_bit_idx = find_next_bit(bitmap, num_pages, clear_bit_idx + 1)) {
+
+ clear_bit_idx = find_next_zero_bit(bitmap, num_pages, set_bit_idx + 1);
+
+ unread = TARGET_PAGE_SIZE * (clear_bit_idx - set_bit_idx);
+ offset = set_bit_idx << TARGET_PAGE_BITS;
+
+ while (unread > 0) {
+ host = host_from_ram_block_offset(block, offset);
+ size = MIN(unread, buf_size);
+
+ read = qemu_get_buffer_at(f, host, size,
+ block->pages_offset + offset);
+ offset += read;
+ unread -= read;
+ }
+ }
+}
+
+static int parse_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block,
+ ram_addr_t length, Error **errp)
+{
+ g_autofree unsigned long *bitmap = NULL;
+ FixedRamHeader header;
+ size_t bitmap_size;
+ long num_pages;
+
+ if (!fixed_ram_read_header(f, &header, errp)) {
+ return -EINVAL;
+ }
+
+ block->pages_offset = header.pages_offset;
+
+ /*
+ * Check the alignment of the file region that contains pages. We
+ * don't enforce FIXED_RAM_FILE_OFFSET_ALIGNMENT to allow that
+ * value to change in the future. Do only a sanity check with page
+ * size alignment.
+ */
+ if (!QEMU_IS_ALIGNED(block->pages_offset, TARGET_PAGE_SIZE)) {
+ error_setg(errp,
+ "Error reading ramblock %s pages, region has bad alignment",
+ block->idstr);
+ return -EINVAL;
+ }
+
+ num_pages = length / header.page_size;
+ bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
+
+ bitmap = g_malloc0(bitmap_size);
+ if (qemu_get_buffer_at(f, (uint8_t *)bitmap, bitmap_size,
+ header.bitmap_offset) != bitmap_size) {
+ error_setg(errp, "Error reading dirty bitmap");
+ return -EINVAL;
+ }
+
+ read_ramblock_fixed_ram(f, block, num_pages, bitmap);
+
+ /* Skip pages array */
+ qemu_set_offset(f, block->pages_offset + length, SEEK_SET);
+
+ return 0;
+}
+
static int parse_ramblock(QEMUFile *f, RAMBlock *block, ram_addr_t length)
{
int ret = 0;
@@ -3900,6 +4009,16 @@ static int parse_ramblock(QEMUFile *f, RAMBlock *block, ram_addr_t length)
assert(block);
+ if (migrate_fixed_ram()) {
+ Error *local_err = NULL;
+
+ ret = parse_ramblock_fixed_ram(f, block, length, &local_err);
+ if (local_err) {
+ error_report_err(local_err);
+ }
+ return ret;
+ }
+
if (!qemu_ram_is_migratable(block)) {
error_report("block %s should not be migrated !", block->idstr);
return -EINVAL;
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 09/30] migration/ram: Add incoming 'fixed-ram' migration
2023-11-27 20:25 ` [RFC PATCH v3 09/30] migration/ram: Add incoming " Fabiano Rosas
@ 2024-01-15 9:49 ` Peter Xu
2024-01-15 16:43 ` Fabiano Rosas
0 siblings, 1 reply; 95+ messages in thread
From: Peter Xu @ 2024-01-15 9:49 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana, Nikolay Borisov
On Mon, Nov 27, 2023 at 05:25:51PM -0300, Fabiano Rosas wrote:
> Add the necessary code to parse the format changes for the 'fixed-ram'
> capability.
>
> One of the more notable changes in behavior is that in the 'fixed-ram'
> case ram pages are restored in one go rather than constantly looping
> through the migration stream.
>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> - added sanity check for pages_offset alignment
> - s/parsing/reading
> - used Error
> - fixed buffer size computation, now allowing an arbitrary limit
> - fixed dereference of pointer to packed struct member in endianness
> conversion
> ---
> migration/ram.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 119 insertions(+)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 4a0ab8105f..08604222f2 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -106,6 +106,12 @@
> */
> #define FIXED_RAM_FILE_OFFSET_ALIGNMENT 0x100000
>
> +/*
> + * When doing fixed-ram migration, this is the amount we read from the
> + * pages region in the migration file at a time.
> + */
> +#define FIXED_RAM_LOAD_BUF_SIZE 0x100000
> +
> XBZRLECacheStats xbzrle_counters;
>
> /* used by the search for pages to send */
> @@ -2996,6 +3002,35 @@ static void fixed_ram_insert_header(QEMUFile *file, RAMBlock *block)
> qemu_put_buffer(file, (uint8_t *) header, header_size);
> }
>
> +static bool fixed_ram_read_header(QEMUFile *file, FixedRamHeader *header,
> + Error **errp)
> +{
> + size_t ret, header_size = sizeof(FixedRamHeader);
> +
> + ret = qemu_get_buffer(file, (uint8_t *)header, header_size);
> + if (ret != header_size) {
> + error_setg(errp, "Could not read whole fixed-ram migration header "
> + "(expected %zd, got %zd bytes)", header_size, ret);
> + return false;
> + }
> +
> + /* migration stream is big-endian */
> + header->version = be32_to_cpu(header->version);
> +
> + if (header->version > FIXED_RAM_HDR_VERSION) {
> + error_setg(errp, "Migration fixed-ram capability version mismatch "
> + "(expected %d, got %d)", FIXED_RAM_HDR_VERSION,
> + header->version);
> + return false;
> + }
> +
> + header->page_size = be64_to_cpu(header->page_size);
> + header->bitmap_offset = be64_to_cpu(header->bitmap_offset);
> + header->pages_offset = be64_to_cpu(header->pages_offset);
> +
> + return true;
> +}
> +
> /*
> * Each of ram_save_setup, ram_save_iterate and ram_save_complete has
> * long-running RCU critical section. When rcu-reclaims in the code
> @@ -3892,6 +3927,80 @@ void colo_flush_ram_cache(void)
> trace_colo_flush_ram_cache_end();
> }
>
> +static void read_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block,
> + long num_pages, unsigned long *bitmap)
> +{
> + unsigned long set_bit_idx, clear_bit_idx;
> + ram_addr_t offset;
> + void *host;
> + size_t read, unread, size, buf_size = FIXED_RAM_LOAD_BUF_SIZE;
> +
> + for (set_bit_idx = find_first_bit(bitmap, num_pages);
> + set_bit_idx < num_pages;
> + set_bit_idx = find_next_bit(bitmap, num_pages, clear_bit_idx + 1)) {
> +
> + clear_bit_idx = find_next_zero_bit(bitmap, num_pages, set_bit_idx + 1);
> +
> + unread = TARGET_PAGE_SIZE * (clear_bit_idx - set_bit_idx);
> + offset = set_bit_idx << TARGET_PAGE_BITS;
> +
> + while (unread > 0) {
> + host = host_from_ram_block_offset(block, offset);
> + size = MIN(unread, buf_size);
Use the macro directly? buf_size can be dropped then.
> +
> + read = qemu_get_buffer_at(f, host, size,
> + block->pages_offset + offset);
Error detection missing? qemu_get_buffer_at() returns 0 if error, then it
dead loops.
> + offset += read;
> + unread -= read;
> + }
> + }
> +}
> +
> +static int parse_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block,
> + ram_addr_t length, Error **errp)
For new code, start to keep using boolean as retval when Error** exists?
> +{
> + g_autofree unsigned long *bitmap = NULL;
> + FixedRamHeader header;
> + size_t bitmap_size;
> + long num_pages;
> +
> + if (!fixed_ram_read_header(f, &header, errp)) {
> + return -EINVAL;
> + }
> +
> + block->pages_offset = header.pages_offset;
> +
> + /*
> + * Check the alignment of the file region that contains pages. We
> + * don't enforce FIXED_RAM_FILE_OFFSET_ALIGNMENT to allow that
> + * value to change in the future. Do only a sanity check with page
> + * size alignment.
> + */
> + if (!QEMU_IS_ALIGNED(block->pages_offset, TARGET_PAGE_SIZE)) {
> + error_setg(errp,
> + "Error reading ramblock %s pages, region has bad alignment",
> + block->idstr);
> + return -EINVAL;
> + }
> +
> + num_pages = length / header.page_size;
> + bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
> +
> + bitmap = g_malloc0(bitmap_size);
> + if (qemu_get_buffer_at(f, (uint8_t *)bitmap, bitmap_size,
> + header.bitmap_offset) != bitmap_size) {
> + error_setg(errp, "Error reading dirty bitmap");
> + return -EINVAL;
> + }
> +
> + read_ramblock_fixed_ram(f, block, num_pages, bitmap);
Detect error and fail properly?
> +
> + /* Skip pages array */
> + qemu_set_offset(f, block->pages_offset + length, SEEK_SET);
> +
> + return 0;
> +}
> +
> static int parse_ramblock(QEMUFile *f, RAMBlock *block, ram_addr_t length)
> {
> int ret = 0;
> @@ -3900,6 +4009,16 @@ static int parse_ramblock(QEMUFile *f, RAMBlock *block, ram_addr_t length)
>
> assert(block);
>
> + if (migrate_fixed_ram()) {
> + Error *local_err = NULL;
> +
> + ret = parse_ramblock_fixed_ram(f, block, length, &local_err);
> + if (local_err) {
> + error_report_err(local_err);
> + }
> + return ret;
We can optionally add one pre-requisite patch to convert parse_ramblock()
to return boolean too. I remember it was done somewhere before, but maybe
not merged.
> + }
> +
> if (!qemu_ram_is_migratable(block)) {
> error_report("block %s should not be migrated !", block->idstr);
> return -EINVAL;
> --
> 2.35.3
>
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 09/30] migration/ram: Add incoming 'fixed-ram' migration
2024-01-15 9:49 ` Peter Xu
@ 2024-01-15 16:43 ` Fabiano Rosas
0 siblings, 0 replies; 95+ messages in thread
From: Fabiano Rosas @ 2024-01-15 16:43 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana, Nikolay Borisov
Peter Xu <peterx@redhat.com> writes:
> On Mon, Nov 27, 2023 at 05:25:51PM -0300, Fabiano Rosas wrote:
>> Add the necessary code to parse the format changes for the 'fixed-ram'
>> capability.
>>
>> One of the more notable changes in behavior is that in the 'fixed-ram'
>> case ram pages are restored in one go rather than constantly looping
>> through the migration stream.
>>
>> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>> - added sanity check for pages_offset alignment
>> - s/parsing/reading
>> - used Error
>> - fixed buffer size computation, now allowing an arbitrary limit
>> - fixed dereference of pointer to packed struct member in endianness
>> conversion
>> ---
>> migration/ram.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 119 insertions(+)
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 4a0ab8105f..08604222f2 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -106,6 +106,12 @@
>> */
>> #define FIXED_RAM_FILE_OFFSET_ALIGNMENT 0x100000
>>
>> +/*
>> + * When doing fixed-ram migration, this is the amount we read from the
>> + * pages region in the migration file at a time.
>> + */
>> +#define FIXED_RAM_LOAD_BUF_SIZE 0x100000
>> +
>> XBZRLECacheStats xbzrle_counters;
>>
>> /* used by the search for pages to send */
>> @@ -2996,6 +3002,35 @@ static void fixed_ram_insert_header(QEMUFile *file, RAMBlock *block)
>> qemu_put_buffer(file, (uint8_t *) header, header_size);
>> }
>>
>> +static bool fixed_ram_read_header(QEMUFile *file, FixedRamHeader *header,
>> + Error **errp)
>> +{
>> + size_t ret, header_size = sizeof(FixedRamHeader);
>> +
>> + ret = qemu_get_buffer(file, (uint8_t *)header, header_size);
>> + if (ret != header_size) {
>> + error_setg(errp, "Could not read whole fixed-ram migration header "
>> + "(expected %zd, got %zd bytes)", header_size, ret);
>> + return false;
>> + }
>> +
>> + /* migration stream is big-endian */
>> + header->version = be32_to_cpu(header->version);
>> +
>> + if (header->version > FIXED_RAM_HDR_VERSION) {
>> + error_setg(errp, "Migration fixed-ram capability version mismatch "
>> + "(expected %d, got %d)", FIXED_RAM_HDR_VERSION,
>> + header->version);
>> + return false;
>> + }
>> +
>> + header->page_size = be64_to_cpu(header->page_size);
>> + header->bitmap_offset = be64_to_cpu(header->bitmap_offset);
>> + header->pages_offset = be64_to_cpu(header->pages_offset);
>> +
>> + return true;
>> +}
>> +
>> /*
>> * Each of ram_save_setup, ram_save_iterate and ram_save_complete has
>> * long-running RCU critical section. When rcu-reclaims in the code
>> @@ -3892,6 +3927,80 @@ void colo_flush_ram_cache(void)
>> trace_colo_flush_ram_cache_end();
>> }
>>
>> +static void read_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block,
>> + long num_pages, unsigned long *bitmap)
>> +{
>> + unsigned long set_bit_idx, clear_bit_idx;
>> + ram_addr_t offset;
>> + void *host;
>> + size_t read, unread, size, buf_size = FIXED_RAM_LOAD_BUF_SIZE;
>> +
>> + for (set_bit_idx = find_first_bit(bitmap, num_pages);
>> + set_bit_idx < num_pages;
>> + set_bit_idx = find_next_bit(bitmap, num_pages, clear_bit_idx + 1)) {
>> +
>> + clear_bit_idx = find_next_zero_bit(bitmap, num_pages, set_bit_idx + 1);
>> +
>> + unread = TARGET_PAGE_SIZE * (clear_bit_idx - set_bit_idx);
>> + offset = set_bit_idx << TARGET_PAGE_BITS;
>> +
>> + while (unread > 0) {
>> + host = host_from_ram_block_offset(block, offset);
>> + size = MIN(unread, buf_size);
>
> Use the macro directly? buf_size can be dropped then.
>
Ok. We only need it later when multifd support is added to this
function.
>> +
>> + read = qemu_get_buffer_at(f, host, size,
>> + block->pages_offset + offset);
>
> Error detection missing? qemu_get_buffer_at() returns 0 if error, then it
> dead loops.
>
Ah right, I was expecting we'd have a direction on how to improve the
qemu-file error handling before I sent this version and ended up
forgetting to do something about it.
>> + offset += read;
>> + unread -= read;
>> + }
>> + }
>> +}
>> +
>> +static int parse_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block,
>> + ram_addr_t length, Error **errp)
>
> For new code, start to keep using boolean as retval when Error** exists?
>
Yep.
>> +{
>> + g_autofree unsigned long *bitmap = NULL;
>> + FixedRamHeader header;
>> + size_t bitmap_size;
>> + long num_pages;
>> +
>> + if (!fixed_ram_read_header(f, &header, errp)) {
>> + return -EINVAL;
>> + }
>> +
>> + block->pages_offset = header.pages_offset;
>> +
>> + /*
>> + * Check the alignment of the file region that contains pages. We
>> + * don't enforce FIXED_RAM_FILE_OFFSET_ALIGNMENT to allow that
>> + * value to change in the future. Do only a sanity check with page
>> + * size alignment.
>> + */
>> + if (!QEMU_IS_ALIGNED(block->pages_offset, TARGET_PAGE_SIZE)) {
>> + error_setg(errp,
>> + "Error reading ramblock %s pages, region has bad alignment",
>> + block->idstr);
>> + return -EINVAL;
>> + }
>> +
>> + num_pages = length / header.page_size;
>> + bitmap_size = BITS_TO_LONGS(num_pages) * sizeof(unsigned long);
>> +
>> + bitmap = g_malloc0(bitmap_size);
>> + if (qemu_get_buffer_at(f, (uint8_t *)bitmap, bitmap_size,
>> + header.bitmap_offset) != bitmap_size) {
>> + error_setg(errp, "Error reading dirty bitmap");
>> + return -EINVAL;
>> + }
>> +
>> + read_ramblock_fixed_ram(f, block, num_pages, bitmap);
>
> Detect error and fail properly?
>
Ok.
>> +
>> + /* Skip pages array */
>> + qemu_set_offset(f, block->pages_offset + length, SEEK_SET);
>> +
>> + return 0;
>> +}
>> +
>> static int parse_ramblock(QEMUFile *f, RAMBlock *block, ram_addr_t length)
>> {
>> int ret = 0;
>> @@ -3900,6 +4009,16 @@ static int parse_ramblock(QEMUFile *f, RAMBlock *block, ram_addr_t length)
>>
>> assert(block);
>>
>> + if (migrate_fixed_ram()) {
>> + Error *local_err = NULL;
>> +
>> + ret = parse_ramblock_fixed_ram(f, block, length, &local_err);
>> + if (local_err) {
>> + error_report_err(local_err);
>> + }
>> + return ret;
>
> We can optionally add one pre-requisite patch to convert parse_ramblock()
> to return boolean too. I remember it was done somewhere before, but maybe
> not merged.
>
I don't think we changed the return type. There was only a refactoring
at commit 2f5ced5b. I'll change to boolean if possible.
>> + }
>> +
>> if (!qemu_ram_is_migratable(block)) {
>> error_report("block %s should not be migrated !", block->idstr);
>> return -EINVAL;
>> --
>> 2.35.3
>>
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH v3 10/30] tests/qtest: migration-test: Add tests for fixed-ram file-based migration
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (8 preceding siblings ...)
2023-11-27 20:25 ` [RFC PATCH v3 09/30] migration/ram: Add incoming " Fabiano Rosas
@ 2023-11-27 20:25 ` Fabiano Rosas
2024-01-15 10:01 ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 11/30] migration/multifd: Allow multifd without packets Fabiano Rosas
` (20 subsequent siblings)
30 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:25 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana, Nikolay Borisov, Thomas Huth, Laurent Vivier,
Paolo Bonzini
From: Nikolay Borisov <nborisov@suse.com>
Add basic tests for 'fixed-ram' migration.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
tests/qtest/migration-test.c | 39 ++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 0fbaa6a90f..96a6217af0 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2135,6 +2135,14 @@ static void *test_mode_reboot_start(QTestState *from, QTestState *to)
return NULL;
}
+static void *migrate_fixed_ram_start(QTestState *from, QTestState *to)
+{
+ migrate_set_capability(from, "fixed-ram", true);
+ migrate_set_capability(to, "fixed-ram", true);
+
+ return NULL;
+}
+
static void test_mode_reboot(void)
{
g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
@@ -2149,6 +2157,32 @@ static void test_mode_reboot(void)
test_file_common(&args, true);
}
+static void test_precopy_file_fixed_ram_live(void)
+{
+ g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
+ FILE_TEST_FILENAME);
+ MigrateCommon args = {
+ .connect_uri = uri,
+ .listen_uri = "defer",
+ .start_hook = migrate_fixed_ram_start,
+ };
+
+ test_file_common(&args, false);
+}
+
+static void test_precopy_file_fixed_ram(void)
+{
+ g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
+ FILE_TEST_FILENAME);
+ MigrateCommon args = {
+ .connect_uri = uri,
+ .listen_uri = "defer",
+ .start_hook = migrate_fixed_ram_start,
+ };
+
+ test_file_common(&args, true);
+}
+
static void test_precopy_tcp_plain(void)
{
MigrateCommon args = {
@@ -3392,6 +3426,11 @@ int main(int argc, char **argv)
qtest_add_func("/migration/mode/reboot", test_mode_reboot);
}
+ qtest_add_func("/migration/precopy/file/fixed-ram",
+ test_precopy_file_fixed_ram);
+ qtest_add_func("/migration/precopy/file/fixed-ram/live",
+ test_precopy_file_fixed_ram_live);
+
#ifdef CONFIG_GNUTLS
qtest_add_func("/migration/precopy/unix/tls/psk",
test_precopy_unix_tls_psk);
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 10/30] tests/qtest: migration-test: Add tests for fixed-ram file-based migration
2023-11-27 20:25 ` [RFC PATCH v3 10/30] tests/qtest: migration-test: Add tests for fixed-ram file-based migration Fabiano Rosas
@ 2024-01-15 10:01 ` Peter Xu
0 siblings, 0 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-15 10:01 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Leonardo Bras, Claudio Fontana,
Nikolay Borisov, Thomas Huth, Laurent Vivier, Paolo Bonzini
On Mon, Nov 27, 2023 at 05:25:52PM -0300, Fabiano Rosas wrote:
> From: Nikolay Borisov <nborisov@suse.com>
>
> Add basic tests for 'fixed-ram' migration.
>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH v3 11/30] migration/multifd: Allow multifd without packets
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (9 preceding siblings ...)
2023-11-27 20:25 ` [RFC PATCH v3 10/30] tests/qtest: migration-test: Add tests for fixed-ram file-based migration Fabiano Rosas
@ 2023-11-27 20:25 ` Fabiano Rosas
2024-01-15 11:51 ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 12/30] migration/multifd: Allow QIOTask error reporting without an object Fabiano Rosas
` (19 subsequent siblings)
30 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:25 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
For the upcoming support to the new 'fixed-ram' migration stream
format, we cannot use multifd packets because each write into the
ramblock section in the migration file is expected to contain only the
guest pages. They are written at their respective offsets relative to
the ramblock section header.
There is no space for the packet information and the expected gains
from the new approach come partly from being able to write the pages
sequentially without extraneous data in between.
The new format also doesn't need the packets and all necessary
information can be taken from the standard migration headers with some
(future) changes to multifd code.
Use the presence of the fixed-ram capability to decide whether to send
packets. For now this has no effect as fixed-ram cannot yet be enabled
with multifd.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
- moved more of the packet code under use_packets
---
migration/multifd.c | 138 +++++++++++++++++++++++++++-----------------
migration/options.c | 5 ++
migration/options.h | 1 +
3 files changed, 91 insertions(+), 53 deletions(-)
diff --git a/migration/multifd.c b/migration/multifd.c
index ec58c58082..9625640d61 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -654,18 +654,22 @@ static void *multifd_send_thread(void *opaque)
Error *local_err = NULL;
int ret = 0;
bool use_zero_copy_send = migrate_zero_copy_send();
+ bool use_packets = migrate_multifd_packets();
thread = migration_threads_add(p->name, qemu_get_thread_id());
trace_multifd_send_thread_start(p->id);
rcu_register_thread();
- if (multifd_send_initial_packet(p, &local_err) < 0) {
- ret = -1;
- goto out;
+ if (use_packets) {
+ if (multifd_send_initial_packet(p, &local_err) < 0) {
+ ret = -1;
+ goto out;
+ }
+
+ /* initial packet */
+ p->num_packets = 1;
}
- /* initial packet */
- p->num_packets = 1;
while (true) {
qemu_sem_post(&multifd_send_state->channels_ready);
@@ -677,11 +681,10 @@ static void *multifd_send_thread(void *opaque)
qemu_mutex_lock(&p->mutex);
if (p->pending_job) {
- uint64_t packet_num = p->packet_num;
uint32_t flags;
p->normal_num = 0;
- if (use_zero_copy_send) {
+ if (!use_packets || use_zero_copy_send) {
p->iovs_num = 0;
} else {
p->iovs_num = 1;
@@ -699,16 +702,20 @@ static void *multifd_send_thread(void *opaque)
break;
}
}
- multifd_send_fill_packet(p);
+
+ if (use_packets) {
+ multifd_send_fill_packet(p);
+ p->num_packets++;
+ }
+
flags = p->flags;
p->flags = 0;
- p->num_packets++;
p->total_normal_pages += p->normal_num;
p->pages->num = 0;
p->pages->block = NULL;
qemu_mutex_unlock(&p->mutex);
- trace_multifd_send(p->id, packet_num, p->normal_num, flags,
+ trace_multifd_send(p->id, p->packet_num, p->normal_num, flags,
p->next_packet_size);
if (use_zero_copy_send) {
@@ -718,7 +725,7 @@ static void *multifd_send_thread(void *opaque)
if (ret != 0) {
break;
}
- } else {
+ } else if (use_packets) {
/* Send header using the same writev call */
p->iov[0].iov_len = p->packet_len;
p->iov[0].iov_base = p->packet;
@@ -904,6 +911,7 @@ int multifd_save_setup(Error **errp)
{
int thread_count;
uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
+ bool use_packets = migrate_multifd_packets();
uint8_t i;
if (!migrate_multifd()) {
@@ -928,14 +936,20 @@ int multifd_save_setup(Error **errp)
p->pending_job = 0;
p->id = i;
p->pages = multifd_pages_init(page_count);
- p->packet_len = sizeof(MultiFDPacket_t)
- + sizeof(uint64_t) * page_count;
- p->packet = g_malloc0(p->packet_len);
- p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
- p->packet->version = cpu_to_be32(MULTIFD_VERSION);
+
+ if (use_packets) {
+ p->packet_len = sizeof(MultiFDPacket_t)
+ + sizeof(uint64_t) * page_count;
+ p->packet = g_malloc0(p->packet_len);
+ p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
+ p->packet->version = cpu_to_be32(MULTIFD_VERSION);
+
+ /* We need one extra place for the packet header */
+ p->iov = g_new0(struct iovec, page_count + 1);
+ } else {
+ p->iov = g_new0(struct iovec, page_count);
+ }
p->name = g_strdup_printf("multifdsend_%d", i);
- /* We need one extra place for the packet header */
- p->iov = g_new0(struct iovec, page_count + 1);
p->normal = g_new0(ram_addr_t, page_count);
p->page_size = qemu_target_page_size();
p->page_count = page_count;
@@ -1067,7 +1081,7 @@ void multifd_recv_sync_main(void)
{
int i;
- if (!migrate_multifd()) {
+ if (!migrate_multifd() || !migrate_multifd_packets()) {
return;
}
for (i = 0; i < migrate_multifd_channels(); i++) {
@@ -1094,38 +1108,44 @@ static void *multifd_recv_thread(void *opaque)
{
MultiFDRecvParams *p = opaque;
Error *local_err = NULL;
+ bool use_packets = migrate_multifd_packets();
int ret;
trace_multifd_recv_thread_start(p->id);
rcu_register_thread();
while (true) {
- uint32_t flags;
+ uint32_t flags = 0;
+ p->normal_num = 0;
if (p->quit) {
break;
}
- ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
- p->packet_len, &local_err);
- if (ret == 0 || ret == -1) { /* 0: EOF -1: Error */
- break;
- }
+ if (use_packets) {
+ ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
+ p->packet_len, &local_err);
+ if (ret == 0 || ret == -1) { /* 0: EOF -1: Error */
+ break;
+ }
+
+ qemu_mutex_lock(&p->mutex);
+ ret = multifd_recv_unfill_packet(p, &local_err);
+ if (ret) {
+ qemu_mutex_unlock(&p->mutex);
+ break;
+ }
+ p->num_packets++;
+
+ flags = p->flags;
+ /* recv methods don't know how to handle the SYNC flag */
+ p->flags &= ~MULTIFD_FLAG_SYNC;
+ trace_multifd_recv(p->id, p->packet_num, p->normal_num, flags,
+ p->next_packet_size);
- qemu_mutex_lock(&p->mutex);
- ret = multifd_recv_unfill_packet(p, &local_err);
- if (ret) {
- qemu_mutex_unlock(&p->mutex);
- break;
+ p->total_normal_pages += p->normal_num;
}
- flags = p->flags;
- /* recv methods don't know how to handle the SYNC flag */
- p->flags &= ~MULTIFD_FLAG_SYNC;
- trace_multifd_recv(p->id, p->packet_num, p->normal_num, flags,
- p->next_packet_size);
- p->num_packets++;
- p->total_normal_pages += p->normal_num;
qemu_mutex_unlock(&p->mutex);
if (p->normal_num) {
@@ -1135,7 +1155,7 @@ static void *multifd_recv_thread(void *opaque)
}
}
- if (flags & MULTIFD_FLAG_SYNC) {
+ if (use_packets && (flags & MULTIFD_FLAG_SYNC)) {
qemu_sem_post(&multifd_recv_state->sem_sync);
qemu_sem_wait(&p->sem_sync);
}
@@ -1159,6 +1179,7 @@ int multifd_load_setup(Error **errp)
{
int thread_count;
uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
+ bool use_packets = migrate_multifd_packets();
uint8_t i;
/*
@@ -1183,9 +1204,12 @@ int multifd_load_setup(Error **errp)
qemu_sem_init(&p->sem_sync, 0);
p->quit = false;
p->id = i;
- p->packet_len = sizeof(MultiFDPacket_t)
- + sizeof(uint64_t) * page_count;
- p->packet = g_malloc0(p->packet_len);
+
+ if (use_packets) {
+ p->packet_len = sizeof(MultiFDPacket_t)
+ + sizeof(uint64_t) * page_count;
+ p->packet = g_malloc0(p->packet_len);
+ }
p->name = g_strdup_printf("multifdrecv_%d", i);
p->iov = g_new0(struct iovec, page_count);
p->normal = g_new0(ram_addr_t, page_count);
@@ -1231,18 +1255,27 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
{
MultiFDRecvParams *p;
Error *local_err = NULL;
- int id;
+ bool use_packets = migrate_multifd_packets();
+ int id, num_packets = 0;
- id = multifd_recv_initial_packet(ioc, &local_err);
- if (id < 0) {
- multifd_recv_terminate_threads(local_err);
- error_propagate_prepend(errp, local_err,
- "failed to receive packet"
- " via multifd channel %d: ",
- qatomic_read(&multifd_recv_state->count));
- return;
+ if (use_packets) {
+ id = multifd_recv_initial_packet(ioc, &local_err);
+ if (id < 0) {
+ multifd_recv_terminate_threads(local_err);
+ error_propagate_prepend(errp, local_err,
+ "failed to receive packet"
+ " via multifd channel %d: ",
+ qatomic_read(&multifd_recv_state->count));
+ return;
+ }
+ trace_multifd_recv_new_channel(id);
+
+ /* initial packet */
+ num_packets = 1;
+ } else {
+ /* next patch gives this a meaningful value */
+ id = 0;
}
- trace_multifd_recv_new_channel(id);
p = &multifd_recv_state->params[id];
if (p->c != NULL) {
@@ -1253,9 +1286,8 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
return;
}
p->c = ioc;
+ p->num_packets = num_packets;
object_ref(OBJECT(ioc));
- /* initial packet */
- p->num_packets = 1;
p->running = true;
qemu_thread_create(&p->thread, p->name, multifd_recv_thread, p,
diff --git a/migration/options.c b/migration/options.c
index 775428a8a5..10730b13ba 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -385,6 +385,11 @@ bool migrate_multifd_flush_after_each_section(void)
return s->multifd_flush_after_each_section;
}
+bool migrate_multifd_packets(void)
+{
+ return !migrate_fixed_ram();
+}
+
bool migrate_postcopy(void)
{
return migrate_postcopy_ram() || migrate_dirty_bitmaps();
diff --git a/migration/options.h b/migration/options.h
index 8680a10b79..8a19d6939c 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -56,6 +56,7 @@ bool migrate_zero_copy_send(void);
*/
bool migrate_multifd_flush_after_each_section(void);
+bool migrate_multifd_packets(void);
bool migrate_postcopy(void);
bool migrate_rdma(void);
bool migrate_tls(void);
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 11/30] migration/multifd: Allow multifd without packets
2023-11-27 20:25 ` [RFC PATCH v3 11/30] migration/multifd: Allow multifd without packets Fabiano Rosas
@ 2024-01-15 11:51 ` Peter Xu
2024-01-15 18:39 ` Fabiano Rosas
0 siblings, 1 reply; 95+ messages in thread
From: Peter Xu @ 2024-01-15 11:51 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Mon, Nov 27, 2023 at 05:25:53PM -0300, Fabiano Rosas wrote:
> For the upcoming support to the new 'fixed-ram' migration stream
> format, we cannot use multifd packets because each write into the
> ramblock section in the migration file is expected to contain only the
> guest pages. They are written at their respective offsets relative to
> the ramblock section header.
>
> There is no space for the packet information and the expected gains
> from the new approach come partly from being able to write the pages
> sequentially without extraneous data in between.
>
> The new format also doesn't need the packets and all necessary
> information can be taken from the standard migration headers with some
> (future) changes to multifd code.
>
> Use the presence of the fixed-ram capability to decide whether to send
> packets. For now this has no effect as fixed-ram cannot yet be enabled
> with multifd.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> - moved more of the packet code under use_packets
> ---
> migration/multifd.c | 138 +++++++++++++++++++++++++++-----------------
> migration/options.c | 5 ++
> migration/options.h | 1 +
> 3 files changed, 91 insertions(+), 53 deletions(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index ec58c58082..9625640d61 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -654,18 +654,22 @@ static void *multifd_send_thread(void *opaque)
> Error *local_err = NULL;
> int ret = 0;
> bool use_zero_copy_send = migrate_zero_copy_send();
> + bool use_packets = migrate_multifd_packets();
>
> thread = migration_threads_add(p->name, qemu_get_thread_id());
>
> trace_multifd_send_thread_start(p->id);
> rcu_register_thread();
>
> - if (multifd_send_initial_packet(p, &local_err) < 0) {
> - ret = -1;
> - goto out;
> + if (use_packets) {
> + if (multifd_send_initial_packet(p, &local_err) < 0) {
> + ret = -1;
> + goto out;
> + }
> +
> + /* initial packet */
> + p->num_packets = 1;
> }
> - /* initial packet */
> - p->num_packets = 1;
>
> while (true) {
> qemu_sem_post(&multifd_send_state->channels_ready);
> @@ -677,11 +681,10 @@ static void *multifd_send_thread(void *opaque)
> qemu_mutex_lock(&p->mutex);
>
> if (p->pending_job) {
> - uint64_t packet_num = p->packet_num;
> uint32_t flags;
> p->normal_num = 0;
>
> - if (use_zero_copy_send) {
> + if (!use_packets || use_zero_copy_send) {
> p->iovs_num = 0;
> } else {
> p->iovs_num = 1;
> @@ -699,16 +702,20 @@ static void *multifd_send_thread(void *opaque)
> break;
> }
> }
> - multifd_send_fill_packet(p);
> +
> + if (use_packets) {
> + multifd_send_fill_packet(p);
> + p->num_packets++;
> + }
> +
> flags = p->flags;
> p->flags = 0;
> - p->num_packets++;
> p->total_normal_pages += p->normal_num;
> p->pages->num = 0;
> p->pages->block = NULL;
> qemu_mutex_unlock(&p->mutex);
>
> - trace_multifd_send(p->id, packet_num, p->normal_num, flags,
> + trace_multifd_send(p->id, p->packet_num, p->normal_num, flags,
> p->next_packet_size);
>
> if (use_zero_copy_send) {
> @@ -718,7 +725,7 @@ static void *multifd_send_thread(void *opaque)
> if (ret != 0) {
> break;
> }
> - } else {
> + } else if (use_packets) {
> /* Send header using the same writev call */
> p->iov[0].iov_len = p->packet_len;
> p->iov[0].iov_base = p->packet;
> @@ -904,6 +911,7 @@ int multifd_save_setup(Error **errp)
> {
> int thread_count;
> uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
> + bool use_packets = migrate_multifd_packets();
> uint8_t i;
>
> if (!migrate_multifd()) {
> @@ -928,14 +936,20 @@ int multifd_save_setup(Error **errp)
> p->pending_job = 0;
> p->id = i;
> p->pages = multifd_pages_init(page_count);
> - p->packet_len = sizeof(MultiFDPacket_t)
> - + sizeof(uint64_t) * page_count;
> - p->packet = g_malloc0(p->packet_len);
> - p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
> - p->packet->version = cpu_to_be32(MULTIFD_VERSION);
> +
> + if (use_packets) {
> + p->packet_len = sizeof(MultiFDPacket_t)
> + + sizeof(uint64_t) * page_count;
> + p->packet = g_malloc0(p->packet_len);
> + p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
> + p->packet->version = cpu_to_be32(MULTIFD_VERSION);
> +
> + /* We need one extra place for the packet header */
> + p->iov = g_new0(struct iovec, page_count + 1);
> + } else {
> + p->iov = g_new0(struct iovec, page_count);
> + }
> p->name = g_strdup_printf("multifdsend_%d", i);
> - /* We need one extra place for the packet header */
> - p->iov = g_new0(struct iovec, page_count + 1);
> p->normal = g_new0(ram_addr_t, page_count);
> p->page_size = qemu_target_page_size();
> p->page_count = page_count;
> @@ -1067,7 +1081,7 @@ void multifd_recv_sync_main(void)
> {
> int i;
>
> - if (!migrate_multifd()) {
> + if (!migrate_multifd() || !migrate_multifd_packets()) {
> return;
> }
> for (i = 0; i < migrate_multifd_channels(); i++) {
This noops the recv sync when use_packets=1, makes sense.
How about multifd_send_sync_main()? Should we do the same?
> @@ -1094,38 +1108,44 @@ static void *multifd_recv_thread(void *opaque)
> {
> MultiFDRecvParams *p = opaque;
> Error *local_err = NULL;
> + bool use_packets = migrate_multifd_packets();
> int ret;
>
> trace_multifd_recv_thread_start(p->id);
> rcu_register_thread();
>
> while (true) {
> - uint32_t flags;
> + uint32_t flags = 0;
> + p->normal_num = 0;
>
> if (p->quit) {
> break;
> }
>
> - ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
> - p->packet_len, &local_err);
> - if (ret == 0 || ret == -1) { /* 0: EOF -1: Error */
> - break;
> - }
> + if (use_packets) {
> + ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
> + p->packet_len, &local_err);
> + if (ret == 0 || ret == -1) { /* 0: EOF -1: Error */
> + break;
> + }
> +
> + qemu_mutex_lock(&p->mutex);
> + ret = multifd_recv_unfill_packet(p, &local_err);
> + if (ret) {
> + qemu_mutex_unlock(&p->mutex);
> + break;
> + }
> + p->num_packets++;
> +
> + flags = p->flags;
> + /* recv methods don't know how to handle the SYNC flag */
> + p->flags &= ~MULTIFD_FLAG_SYNC;
> + trace_multifd_recv(p->id, p->packet_num, p->normal_num, flags,
> + p->next_packet_size);
>
> - qemu_mutex_lock(&p->mutex);
> - ret = multifd_recv_unfill_packet(p, &local_err);
> - if (ret) {
> - qemu_mutex_unlock(&p->mutex);
> - break;
> + p->total_normal_pages += p->normal_num;
> }
>
> - flags = p->flags;
> - /* recv methods don't know how to handle the SYNC flag */
> - p->flags &= ~MULTIFD_FLAG_SYNC;
> - trace_multifd_recv(p->id, p->packet_num, p->normal_num, flags,
> - p->next_packet_size);
> - p->num_packets++;
> - p->total_normal_pages += p->normal_num;
> qemu_mutex_unlock(&p->mutex);
>
> if (p->normal_num) {
> @@ -1135,7 +1155,7 @@ static void *multifd_recv_thread(void *opaque)
> }
> }
>
> - if (flags & MULTIFD_FLAG_SYNC) {
> + if (use_packets && (flags & MULTIFD_FLAG_SYNC)) {
> qemu_sem_post(&multifd_recv_state->sem_sync);
> qemu_sem_wait(&p->sem_sync);
> }
> @@ -1159,6 +1179,7 @@ int multifd_load_setup(Error **errp)
> {
> int thread_count;
> uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
> + bool use_packets = migrate_multifd_packets();
> uint8_t i;
>
> /*
> @@ -1183,9 +1204,12 @@ int multifd_load_setup(Error **errp)
> qemu_sem_init(&p->sem_sync, 0);
> p->quit = false;
> p->id = i;
> - p->packet_len = sizeof(MultiFDPacket_t)
> - + sizeof(uint64_t) * page_count;
> - p->packet = g_malloc0(p->packet_len);
> +
> + if (use_packets) {
> + p->packet_len = sizeof(MultiFDPacket_t)
> + + sizeof(uint64_t) * page_count;
> + p->packet = g_malloc0(p->packet_len);
> + }
> p->name = g_strdup_printf("multifdrecv_%d", i);
> p->iov = g_new0(struct iovec, page_count);
> p->normal = g_new0(ram_addr_t, page_count);
> @@ -1231,18 +1255,27 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
> {
> MultiFDRecvParams *p;
> Error *local_err = NULL;
> - int id;
> + bool use_packets = migrate_multifd_packets();
> + int id, num_packets = 0;
>
> - id = multifd_recv_initial_packet(ioc, &local_err);
> - if (id < 0) {
> - multifd_recv_terminate_threads(local_err);
> - error_propagate_prepend(errp, local_err,
> - "failed to receive packet"
> - " via multifd channel %d: ",
> - qatomic_read(&multifd_recv_state->count));
> - return;
> + if (use_packets) {
> + id = multifd_recv_initial_packet(ioc, &local_err);
> + if (id < 0) {
> + multifd_recv_terminate_threads(local_err);
> + error_propagate_prepend(errp, local_err,
> + "failed to receive packet"
> + " via multifd channel %d: ",
> + qatomic_read(&multifd_recv_state->count));
> + return;
> + }
> + trace_multifd_recv_new_channel(id);
> +
> + /* initial packet */
> + num_packets = 1;
> + } else {
> + /* next patch gives this a meaningful value */
> + id = 0;
> }
> - trace_multifd_recv_new_channel(id);
>
> p = &multifd_recv_state->params[id];
> if (p->c != NULL) {
> @@ -1253,9 +1286,8 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
> return;
> }
> p->c = ioc;
> + p->num_packets = num_packets;
> object_ref(OBJECT(ioc));
> - /* initial packet */
> - p->num_packets = 1;
>
> p->running = true;
> qemu_thread_create(&p->thread, p->name, multifd_recv_thread, p,
> diff --git a/migration/options.c b/migration/options.c
> index 775428a8a5..10730b13ba 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -385,6 +385,11 @@ bool migrate_multifd_flush_after_each_section(void)
> return s->multifd_flush_after_each_section;
> }
>
> +bool migrate_multifd_packets(void)
Maybe multifd_use_packets()? Dropping the migrate_ prefix as this is not a
global API but multifd-only. Then if multifd_packets() reads too weird and
unclear, "add" makes it clear.
> +{
> + return !migrate_fixed_ram();
> +}
> +
> bool migrate_postcopy(void)
> {
> return migrate_postcopy_ram() || migrate_dirty_bitmaps();
> diff --git a/migration/options.h b/migration/options.h
> index 8680a10b79..8a19d6939c 100644
> --- a/migration/options.h
> +++ b/migration/options.h
> @@ -56,6 +56,7 @@ bool migrate_zero_copy_send(void);
> */
>
> bool migrate_multifd_flush_after_each_section(void);
> +bool migrate_multifd_packets(void);
> bool migrate_postcopy(void);
> bool migrate_rdma(void);
> bool migrate_tls(void);
> --
> 2.35.3
>
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 11/30] migration/multifd: Allow multifd without packets
2024-01-15 11:51 ` Peter Xu
@ 2024-01-15 18:39 ` Fabiano Rosas
2024-01-15 23:01 ` Peter Xu
0 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2024-01-15 18:39 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
Peter Xu <peterx@redhat.com> writes:
> On Mon, Nov 27, 2023 at 05:25:53PM -0300, Fabiano Rosas wrote:
>> For the upcoming support to the new 'fixed-ram' migration stream
>> format, we cannot use multifd packets because each write into the
>> ramblock section in the migration file is expected to contain only the
>> guest pages. They are written at their respective offsets relative to
>> the ramblock section header.
>>
>> There is no space for the packet information and the expected gains
>> from the new approach come partly from being able to write the pages
>> sequentially without extraneous data in between.
>>
>> The new format also doesn't need the packets and all necessary
>> information can be taken from the standard migration headers with some
>> (future) changes to multifd code.
>>
>> Use the presence of the fixed-ram capability to decide whether to send
>> packets. For now this has no effect as fixed-ram cannot yet be enabled
>> with multifd.
>>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>> - moved more of the packet code under use_packets
>> ---
>> migration/multifd.c | 138 +++++++++++++++++++++++++++-----------------
>> migration/options.c | 5 ++
>> migration/options.h | 1 +
>> 3 files changed, 91 insertions(+), 53 deletions(-)
>>
>> diff --git a/migration/multifd.c b/migration/multifd.c
>> index ec58c58082..9625640d61 100644
>> --- a/migration/multifd.c
>> +++ b/migration/multifd.c
>> @@ -654,18 +654,22 @@ static void *multifd_send_thread(void *opaque)
>> Error *local_err = NULL;
>> int ret = 0;
>> bool use_zero_copy_send = migrate_zero_copy_send();
>> + bool use_packets = migrate_multifd_packets();
>>
>> thread = migration_threads_add(p->name, qemu_get_thread_id());
>>
>> trace_multifd_send_thread_start(p->id);
>> rcu_register_thread();
>>
>> - if (multifd_send_initial_packet(p, &local_err) < 0) {
>> - ret = -1;
>> - goto out;
>> + if (use_packets) {
>> + if (multifd_send_initial_packet(p, &local_err) < 0) {
>> + ret = -1;
>> + goto out;
>> + }
>> +
>> + /* initial packet */
>> + p->num_packets = 1;
>> }
>> - /* initial packet */
>> - p->num_packets = 1;
>>
>> while (true) {
>> qemu_sem_post(&multifd_send_state->channels_ready);
>> @@ -677,11 +681,10 @@ static void *multifd_send_thread(void *opaque)
>> qemu_mutex_lock(&p->mutex);
>>
>> if (p->pending_job) {
>> - uint64_t packet_num = p->packet_num;
>> uint32_t flags;
>> p->normal_num = 0;
>>
>> - if (use_zero_copy_send) {
>> + if (!use_packets || use_zero_copy_send) {
>> p->iovs_num = 0;
>> } else {
>> p->iovs_num = 1;
>> @@ -699,16 +702,20 @@ static void *multifd_send_thread(void *opaque)
>> break;
>> }
>> }
>> - multifd_send_fill_packet(p);
>> +
>> + if (use_packets) {
>> + multifd_send_fill_packet(p);
>> + p->num_packets++;
>> + }
>> +
>> flags = p->flags;
>> p->flags = 0;
>> - p->num_packets++;
>> p->total_normal_pages += p->normal_num;
>> p->pages->num = 0;
>> p->pages->block = NULL;
>> qemu_mutex_unlock(&p->mutex);
>>
>> - trace_multifd_send(p->id, packet_num, p->normal_num, flags,
>> + trace_multifd_send(p->id, p->packet_num, p->normal_num, flags,
>> p->next_packet_size);
>>
>> if (use_zero_copy_send) {
>> @@ -718,7 +725,7 @@ static void *multifd_send_thread(void *opaque)
>> if (ret != 0) {
>> break;
>> }
>> - } else {
>> + } else if (use_packets) {
>> /* Send header using the same writev call */
>> p->iov[0].iov_len = p->packet_len;
>> p->iov[0].iov_base = p->packet;
>> @@ -904,6 +911,7 @@ int multifd_save_setup(Error **errp)
>> {
>> int thread_count;
>> uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
>> + bool use_packets = migrate_multifd_packets();
>> uint8_t i;
>>
>> if (!migrate_multifd()) {
>> @@ -928,14 +936,20 @@ int multifd_save_setup(Error **errp)
>> p->pending_job = 0;
>> p->id = i;
>> p->pages = multifd_pages_init(page_count);
>> - p->packet_len = sizeof(MultiFDPacket_t)
>> - + sizeof(uint64_t) * page_count;
>> - p->packet = g_malloc0(p->packet_len);
>> - p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
>> - p->packet->version = cpu_to_be32(MULTIFD_VERSION);
>> +
>> + if (use_packets) {
>> + p->packet_len = sizeof(MultiFDPacket_t)
>> + + sizeof(uint64_t) * page_count;
>> + p->packet = g_malloc0(p->packet_len);
>> + p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
>> + p->packet->version = cpu_to_be32(MULTIFD_VERSION);
>> +
>> + /* We need one extra place for the packet header */
>> + p->iov = g_new0(struct iovec, page_count + 1);
>> + } else {
>> + p->iov = g_new0(struct iovec, page_count);
>> + }
>> p->name = g_strdup_printf("multifdsend_%d", i);
>> - /* We need one extra place for the packet header */
>> - p->iov = g_new0(struct iovec, page_count + 1);
>> p->normal = g_new0(ram_addr_t, page_count);
>> p->page_size = qemu_target_page_size();
>> p->page_count = page_count;
>> @@ -1067,7 +1081,7 @@ void multifd_recv_sync_main(void)
>> {
>> int i;
>>
>> - if (!migrate_multifd()) {
>> + if (!migrate_multifd() || !migrate_multifd_packets()) {
>> return;
>> }
>> for (i = 0; i < migrate_multifd_channels(); i++) {
>
> This noops the recv sync when use_packets=1, makes sense.
>
> How about multifd_send_sync_main()? Should we do the same?
>
It seems it got lost during rebase.
>> @@ -1094,38 +1108,44 @@ static void *multifd_recv_thread(void *opaque)
>> {
>> MultiFDRecvParams *p = opaque;
>> Error *local_err = NULL;
>> + bool use_packets = migrate_multifd_packets();
>> int ret;
>>
>> trace_multifd_recv_thread_start(p->id);
>> rcu_register_thread();
>>
>> while (true) {
>> - uint32_t flags;
>> + uint32_t flags = 0;
>> + p->normal_num = 0;
>>
>> if (p->quit) {
>> break;
>> }
>>
>> - ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
>> - p->packet_len, &local_err);
>> - if (ret == 0 || ret == -1) { /* 0: EOF -1: Error */
>> - break;
>> - }
>> + if (use_packets) {
>> + ret = qio_channel_read_all_eof(p->c, (void *)p->packet,
>> + p->packet_len, &local_err);
>> + if (ret == 0 || ret == -1) { /* 0: EOF -1: Error */
>> + break;
>> + }
>> +
>> + qemu_mutex_lock(&p->mutex);
>> + ret = multifd_recv_unfill_packet(p, &local_err);
>> + if (ret) {
>> + qemu_mutex_unlock(&p->mutex);
>> + break;
>> + }
>> + p->num_packets++;
>> +
>> + flags = p->flags;
>> + /* recv methods don't know how to handle the SYNC flag */
>> + p->flags &= ~MULTIFD_FLAG_SYNC;
>> + trace_multifd_recv(p->id, p->packet_num, p->normal_num, flags,
>> + p->next_packet_size);
>>
>> - qemu_mutex_lock(&p->mutex);
>> - ret = multifd_recv_unfill_packet(p, &local_err);
>> - if (ret) {
>> - qemu_mutex_unlock(&p->mutex);
>> - break;
>> + p->total_normal_pages += p->normal_num;
>> }
>>
>> - flags = p->flags;
>> - /* recv methods don't know how to handle the SYNC flag */
>> - p->flags &= ~MULTIFD_FLAG_SYNC;
>> - trace_multifd_recv(p->id, p->packet_num, p->normal_num, flags,
>> - p->next_packet_size);
>> - p->num_packets++;
>> - p->total_normal_pages += p->normal_num;
>> qemu_mutex_unlock(&p->mutex);
>>
>> if (p->normal_num) {
>> @@ -1135,7 +1155,7 @@ static void *multifd_recv_thread(void *opaque)
>> }
>> }
>>
>> - if (flags & MULTIFD_FLAG_SYNC) {
>> + if (use_packets && (flags & MULTIFD_FLAG_SYNC)) {
>> qemu_sem_post(&multifd_recv_state->sem_sync);
>> qemu_sem_wait(&p->sem_sync);
>> }
>> @@ -1159,6 +1179,7 @@ int multifd_load_setup(Error **errp)
>> {
>> int thread_count;
>> uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
>> + bool use_packets = migrate_multifd_packets();
>> uint8_t i;
>>
>> /*
>> @@ -1183,9 +1204,12 @@ int multifd_load_setup(Error **errp)
>> qemu_sem_init(&p->sem_sync, 0);
>> p->quit = false;
>> p->id = i;
>> - p->packet_len = sizeof(MultiFDPacket_t)
>> - + sizeof(uint64_t) * page_count;
>> - p->packet = g_malloc0(p->packet_len);
>> +
>> + if (use_packets) {
>> + p->packet_len = sizeof(MultiFDPacket_t)
>> + + sizeof(uint64_t) * page_count;
>> + p->packet = g_malloc0(p->packet_len);
>> + }
>> p->name = g_strdup_printf("multifdrecv_%d", i);
>> p->iov = g_new0(struct iovec, page_count);
>> p->normal = g_new0(ram_addr_t, page_count);
>> @@ -1231,18 +1255,27 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
>> {
>> MultiFDRecvParams *p;
>> Error *local_err = NULL;
>> - int id;
>> + bool use_packets = migrate_multifd_packets();
>> + int id, num_packets = 0;
>>
>> - id = multifd_recv_initial_packet(ioc, &local_err);
>> - if (id < 0) {
>> - multifd_recv_terminate_threads(local_err);
>> - error_propagate_prepend(errp, local_err,
>> - "failed to receive packet"
>> - " via multifd channel %d: ",
>> - qatomic_read(&multifd_recv_state->count));
>> - return;
>> + if (use_packets) {
>> + id = multifd_recv_initial_packet(ioc, &local_err);
>> + if (id < 0) {
>> + multifd_recv_terminate_threads(local_err);
>> + error_propagate_prepend(errp, local_err,
>> + "failed to receive packet"
>> + " via multifd channel %d: ",
>> + qatomic_read(&multifd_recv_state->count));
>> + return;
>> + }
>> + trace_multifd_recv_new_channel(id);
>> +
>> + /* initial packet */
>> + num_packets = 1;
>> + } else {
>> + /* next patch gives this a meaningful value */
>> + id = 0;
>> }
>> - trace_multifd_recv_new_channel(id);
>>
>> p = &multifd_recv_state->params[id];
>> if (p->c != NULL) {
>> @@ -1253,9 +1286,8 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
>> return;
>> }
>> p->c = ioc;
>> + p->num_packets = num_packets;
>> object_ref(OBJECT(ioc));
>> - /* initial packet */
>> - p->num_packets = 1;
>>
>> p->running = true;
>> qemu_thread_create(&p->thread, p->name, multifd_recv_thread, p,
>> diff --git a/migration/options.c b/migration/options.c
>> index 775428a8a5..10730b13ba 100644
>> --- a/migration/options.c
>> +++ b/migration/options.c
>> @@ -385,6 +385,11 @@ bool migrate_multifd_flush_after_each_section(void)
>> return s->multifd_flush_after_each_section;
>> }
>>
>> +bool migrate_multifd_packets(void)
>
> Maybe multifd_use_packets()? Dropping the migrate_ prefix as this is not a
> global API but multifd-only. Then if multifd_packets() reads too weird and
> unclear, "add" makes it clear.
>
We removed all the instances of migrate_use_* from the migration code
recently. Not sure we should introduce them back, it seems like a step
back.
We're setting 'use_packets = migrate_multifd_packets()' in most places,
so I guess 'use_packets = multifd_packets()' wouldn't be too bad.
>> +{
>> + return !migrate_fixed_ram();
>> +}
>> +
>> bool migrate_postcopy(void)
>> {
>> return migrate_postcopy_ram() || migrate_dirty_bitmaps();
>> diff --git a/migration/options.h b/migration/options.h
>> index 8680a10b79..8a19d6939c 100644
>> --- a/migration/options.h
>> +++ b/migration/options.h
>> @@ -56,6 +56,7 @@ bool migrate_zero_copy_send(void);
>> */
>>
>> bool migrate_multifd_flush_after_each_section(void);
>> +bool migrate_multifd_packets(void);
>> bool migrate_postcopy(void);
>> bool migrate_rdma(void);
>> bool migrate_tls(void);
>> --
>> 2.35.3
>>
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 11/30] migration/multifd: Allow multifd without packets
2024-01-15 18:39 ` Fabiano Rosas
@ 2024-01-15 23:01 ` Peter Xu
0 siblings, 0 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-15 23:01 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Mon, Jan 15, 2024 at 03:39:29PM -0300, Fabiano Rosas wrote:
> > Maybe multifd_use_packets()? Dropping the migrate_ prefix as this is not a
> > global API but multifd-only. Then if multifd_packets() reads too weird and
> > unclear, "add" makes it clear.
>
> We removed all the instances of migrate_use_* from the migration code
> recently. Not sure we should introduce them back, it seems like a step
> back.
>
> We're setting 'use_packets = migrate_multifd_packets()' in most places,
> so I guess 'use_packets = multifd_packets()' wouldn't be too bad.
I actually prefers keep using "_use_" all over the places because it's
clearer to me. :) While I don't see much benefit saving three chars. Try
"git grep _use_ | wc -l" in both QEMU / Linux, then we'll see that reports
275 / 4680 corrrespondingly.
But yeah that's trivial, multifd_packets() is still okay.
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH v3 12/30] migration/multifd: Allow QIOTask error reporting without an object
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (10 preceding siblings ...)
2023-11-27 20:25 ` [RFC PATCH v3 11/30] migration/multifd: Allow multifd without packets Fabiano Rosas
@ 2023-11-27 20:25 ` Fabiano Rosas
2024-01-15 12:06 ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 13/30] migration/multifd: Add outgoing QIOChannelFile support Fabiano Rosas
` (18 subsequent siblings)
30 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:25 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
The only way for the channel backend to report an error to the multifd
core during creation is by setting the QIOTask error. We must allow
the channel backend to set the error even if the QIOChannel has failed
to be created, which means the QIOTask source object would be NULL.
At multifd_new_send_channel_async() move the QOM casting of the
channel until after we have checked for the QIOTask error.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
context: When doing multifd + file, it's possible that we fail to open
the file. I'll use the empty QIOTask to report the error back to
multifd.
---
migration/multifd.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/migration/multifd.c b/migration/multifd.c
index 9625640d61..123ff0dec0 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -865,8 +865,7 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
return true;
}
-static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
- QIOChannel *ioc, Error *err)
+static void multifd_new_send_channel_cleanup(MultiFDSendParams *p, Error *err)
{
migrate_set_error(migrate_get_current(), err);
/* Error happen, we need to tell who pay attention to me */
@@ -878,20 +877,20 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
* its status.
*/
p->quit = true;
- object_unref(OBJECT(ioc));
error_free(err);
}
static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
{
MultiFDSendParams *p = opaque;
- QIOChannel *ioc = QIO_CHANNEL(qio_task_get_source(task));
+ Object *obj = qio_task_get_source(task);
Error *local_err = NULL;
trace_multifd_new_send_channel_async(p->id);
if (!qio_task_propagate_error(task, &local_err)) {
- p->c = ioc;
- qio_channel_set_delay(p->c, false);
+ QIOChannel *ioc = QIO_CHANNEL(obj);
+
+ qio_channel_set_delay(ioc, false);
p->running = true;
if (multifd_channel_connect(p, ioc, &local_err)) {
return;
@@ -899,7 +898,8 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
}
trace_multifd_new_send_channel_async_error(p->id, local_err);
- multifd_new_send_channel_cleanup(p, ioc, local_err);
+ multifd_new_send_channel_cleanup(p, local_err);
+ object_unref(obj);
}
static void multifd_new_send_channel_create(gpointer opaque)
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 12/30] migration/multifd: Allow QIOTask error reporting without an object
2023-11-27 20:25 ` [RFC PATCH v3 12/30] migration/multifd: Allow QIOTask error reporting without an object Fabiano Rosas
@ 2024-01-15 12:06 ` Peter Xu
0 siblings, 0 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-15 12:06 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Mon, Nov 27, 2023 at 05:25:54PM -0300, Fabiano Rosas wrote:
> The only way for the channel backend to report an error to the multifd
> core during creation is by setting the QIOTask error. We must allow
> the channel backend to set the error even if the QIOChannel has failed
> to be created, which means the QIOTask source object would be NULL.
>
> At multifd_new_send_channel_async() move the QOM casting of the
> channel until after we have checked for the QIOTask error.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> context: When doing multifd + file, it's possible that we fail to open
> the file. I'll use the empty QIOTask to report the error back to
> multifd.
The "context" can be slightly reworded and put into the commit message too.
Reviewed-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH v3 13/30] migration/multifd: Add outgoing QIOChannelFile support
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (11 preceding siblings ...)
2023-11-27 20:25 ` [RFC PATCH v3 12/30] migration/multifd: Allow QIOTask error reporting without an object Fabiano Rosas
@ 2023-11-27 20:25 ` Fabiano Rosas
2024-01-16 4:05 ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 14/30] migration/multifd: Add incoming " Fabiano Rosas
` (17 subsequent siblings)
30 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:25 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
Allow multifd to open file-backed channels. This will be used when
enabling the fixed-ram migration stream format which expects a
seekable transport.
The QIOChannel read and write methods will use the preadv/pwritev
versions which don't update the file offset at each call so we can
reuse the fd without re-opening for every channel.
Note that this is just setup code and multifd cannot yet make use of
the file channels.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
- open multifd channels with O_WRONLY and no mode
- stop cancelling migration and propagate error via qio_task
---
migration/file.c | 47 +++++++++++++++++++++++++++++++++++++++++--
migration/file.h | 5 +++++
migration/multifd.c | 14 +++++++++++--
migration/options.c | 7 +++++++
migration/options.h | 1 +
migration/qemu-file.h | 1 -
6 files changed, 70 insertions(+), 5 deletions(-)
diff --git a/migration/file.c b/migration/file.c
index 5d4975f43e..67d6f42da7 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -17,6 +17,10 @@
#define OFFSET_OPTION ",offset="
+static struct FileOutgoingArgs {
+ char *fname;
+} outgoing_args;
+
/* Remove the offset option from @filespec and return it in @offsetp. */
int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp)
@@ -36,6 +40,42 @@ int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp)
return 0;
}
+static void qio_channel_file_connect_worker(QIOTask *task, gpointer opaque)
+{
+ /* noop */
+}
+
+int file_send_channel_destroy(QIOChannel *ioc)
+{
+ if (ioc) {
+ qio_channel_close(ioc, NULL);
+ object_unref(OBJECT(ioc));
+ }
+ g_free(outgoing_args.fname);
+ outgoing_args.fname = NULL;
+
+ return 0;
+}
+
+void file_send_channel_create(QIOTaskFunc f, void *data)
+{
+ QIOChannelFile *ioc;
+ QIOTask *task;
+ Error *err = NULL;
+ int flags = O_WRONLY;
+
+ ioc = qio_channel_file_new_path(outgoing_args.fname, flags, 0, &err);
+
+ task = qio_task_new(OBJECT(ioc), f, (gpointer)data, NULL);
+ if (!ioc) {
+ qio_task_set_error(task, err);
+ return;
+ }
+
+ qio_task_run_in_thread(task, qio_channel_file_connect_worker,
+ (gpointer)data, NULL, NULL);
+}
+
void file_start_outgoing_migration(MigrationState *s,
FileMigrationArgs *file_args, Error **errp)
{
@@ -43,15 +83,18 @@ void file_start_outgoing_migration(MigrationState *s,
g_autofree char *filename = g_strdup(file_args->filename);
uint64_t offset = file_args->offset;
QIOChannel *ioc;
+ int flags = O_CREAT | O_TRUNC | O_WRONLY;
+ mode_t mode = 0660;
trace_migration_file_outgoing(filename);
- fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY | O_TRUNC,
- 0600, errp);
+ fioc = qio_channel_file_new_path(filename, flags, mode, errp);
if (!fioc) {
return;
}
+ outgoing_args.fname = g_strdup(filename);
+
ioc = QIO_CHANNEL(fioc);
if (offset && qio_channel_io_seek(ioc, offset, SEEK_SET, errp) < 0) {
return;
diff --git a/migration/file.h b/migration/file.h
index 37d6a08bfc..511019b319 100644
--- a/migration/file.h
+++ b/migration/file.h
@@ -9,10 +9,15 @@
#define QEMU_MIGRATION_FILE_H
#include "qapi/qapi-types-migration.h"
+#include "io/task.h"
+#include "channel.h"
void file_start_incoming_migration(FileMigrationArgs *file_args, Error **errp);
void file_start_outgoing_migration(MigrationState *s,
FileMigrationArgs *file_args, Error **errp);
int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp);
+
+void file_send_channel_create(QIOTaskFunc f, void *data);
+int file_send_channel_destroy(QIOChannel *ioc);
#endif
diff --git a/migration/multifd.c b/migration/multifd.c
index 123ff0dec0..427740aab6 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -17,6 +17,7 @@
#include "exec/ramblock.h"
#include "qemu/error-report.h"
#include "qapi/error.h"
+#include "file.h"
#include "ram.h"
#include "migration.h"
#include "migration-stats.h"
@@ -28,6 +29,7 @@
#include "threadinfo.h"
#include "options.h"
#include "qemu/yank.h"
+#include "io/channel-file.h"
#include "io/channel-socket.h"
#include "yank_functions.h"
@@ -511,7 +513,11 @@ static void multifd_send_terminate_threads(Error *err)
static int multifd_send_channel_destroy(QIOChannel *send)
{
- return socket_send_channel_destroy(send);
+ if (migrate_to_file()) {
+ return file_send_channel_destroy(send);
+ } else {
+ return socket_send_channel_destroy(send);
+ }
}
void multifd_save_cleanup(void)
@@ -904,7 +910,11 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
static void multifd_new_send_channel_create(gpointer opaque)
{
- socket_send_channel_create(multifd_new_send_channel_async, opaque);
+ if (migrate_to_file()) {
+ file_send_channel_create(multifd_new_send_channel_async, opaque);
+ } else {
+ socket_send_channel_create(multifd_new_send_channel_async, opaque);
+ }
}
int multifd_save_setup(Error **errp)
diff --git a/migration/options.c b/migration/options.c
index 10730b13ba..f671e24758 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -409,6 +409,13 @@ bool migrate_tls(void)
return s->parameters.tls_creds && *s->parameters.tls_creds;
}
+bool migrate_to_file(void)
+{
+ MigrationState *s = migrate_get_current();
+
+ return qemu_file_is_seekable(s->to_dst_file);
+}
+
typedef enum WriteTrackingSupport {
WT_SUPPORT_UNKNOWN = 0,
WT_SUPPORT_ABSENT,
diff --git a/migration/options.h b/migration/options.h
index 8a19d6939c..84628a76e8 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -60,6 +60,7 @@ bool migrate_multifd_packets(void);
bool migrate_postcopy(void);
bool migrate_rdma(void);
bool migrate_tls(void);
+bool migrate_to_file(void);
/* capabilities helpers */
diff --git a/migration/qemu-file.h b/migration/qemu-file.h
index 32fd4a34fd..78ea21ab98 100644
--- a/migration/qemu-file.h
+++ b/migration/qemu-file.h
@@ -83,5 +83,4 @@ size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
off_t pos);
QIOChannel *qemu_file_get_ioc(QEMUFile *file);
-
#endif
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 13/30] migration/multifd: Add outgoing QIOChannelFile support
2023-11-27 20:25 ` [RFC PATCH v3 13/30] migration/multifd: Add outgoing QIOChannelFile support Fabiano Rosas
@ 2024-01-16 4:05 ` Peter Xu
2024-01-16 7:25 ` Peter Xu
2024-01-16 13:37 ` Fabiano Rosas
0 siblings, 2 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-16 4:05 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Mon, Nov 27, 2023 at 05:25:55PM -0300, Fabiano Rosas wrote:
> Allow multifd to open file-backed channels. This will be used when
> enabling the fixed-ram migration stream format which expects a
> seekable transport.
>
> The QIOChannel read and write methods will use the preadv/pwritev
> versions which don't update the file offset at each call so we can
> reuse the fd without re-opening for every channel.
>
> Note that this is just setup code and multifd cannot yet make use of
> the file channels.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> - open multifd channels with O_WRONLY and no mode
> - stop cancelling migration and propagate error via qio_task
> ---
> migration/file.c | 47 +++++++++++++++++++++++++++++++++++++++++--
> migration/file.h | 5 +++++
> migration/multifd.c | 14 +++++++++++--
> migration/options.c | 7 +++++++
> migration/options.h | 1 +
> migration/qemu-file.h | 1 -
> 6 files changed, 70 insertions(+), 5 deletions(-)
>
> diff --git a/migration/file.c b/migration/file.c
> index 5d4975f43e..67d6f42da7 100644
> --- a/migration/file.c
> +++ b/migration/file.c
> @@ -17,6 +17,10 @@
>
> #define OFFSET_OPTION ",offset="
>
> +static struct FileOutgoingArgs {
> + char *fname;
> +} outgoing_args;
> +
> /* Remove the offset option from @filespec and return it in @offsetp. */
>
> int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp)
> @@ -36,6 +40,42 @@ int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp)
> return 0;
> }
>
> +static void qio_channel_file_connect_worker(QIOTask *task, gpointer opaque)
> +{
> + /* noop */
> +}
> +
> +int file_send_channel_destroy(QIOChannel *ioc)
> +{
> + if (ioc) {
> + qio_channel_close(ioc, NULL);
> + object_unref(OBJECT(ioc));
> + }
> + g_free(outgoing_args.fname);
> + outgoing_args.fname = NULL;
> +
> + return 0;
> +}
> +
> +void file_send_channel_create(QIOTaskFunc f, void *data)
> +{
> + QIOChannelFile *ioc;
> + QIOTask *task;
> + Error *err = NULL;
> + int flags = O_WRONLY;
> +
> + ioc = qio_channel_file_new_path(outgoing_args.fname, flags, 0, &err);
> +
> + task = qio_task_new(OBJECT(ioc), f, (gpointer)data, NULL);
> + if (!ioc) {
> + qio_task_set_error(task, err);
> + return;
> + }
> +
> + qio_task_run_in_thread(task, qio_channel_file_connect_worker,
> + (gpointer)data, NULL, NULL);
This is pretty weird. This invokes a thread, but it'll run a noop. It
seems meaningless to me.
I assume you wanted to keep using the same async model as the socket typed
multifd, but I don't think that works anyway, because file open blocks at
qio_channel_file_new_path() so it's sync anyway.
AFAICT we still share the code, as long as the file path properly invokes
multifd_channel_connect() after the iochannel is setup.
> +}
> +
> void file_start_outgoing_migration(MigrationState *s,
> FileMigrationArgs *file_args, Error **errp)
> {
> @@ -43,15 +83,18 @@ void file_start_outgoing_migration(MigrationState *s,
> g_autofree char *filename = g_strdup(file_args->filename);
> uint64_t offset = file_args->offset;
> QIOChannel *ioc;
> + int flags = O_CREAT | O_TRUNC | O_WRONLY;
> + mode_t mode = 0660;
>
> trace_migration_file_outgoing(filename);
>
> - fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY | O_TRUNC,
> - 0600, errp);
> + fioc = qio_channel_file_new_path(filename, flags, mode, errp);
> if (!fioc) {
> return;
> }
>
> + outgoing_args.fname = g_strdup(filename);
> +
> ioc = QIO_CHANNEL(fioc);
> if (offset && qio_channel_io_seek(ioc, offset, SEEK_SET, errp) < 0) {
> return;
> diff --git a/migration/file.h b/migration/file.h
> index 37d6a08bfc..511019b319 100644
> --- a/migration/file.h
> +++ b/migration/file.h
> @@ -9,10 +9,15 @@
> #define QEMU_MIGRATION_FILE_H
>
> #include "qapi/qapi-types-migration.h"
> +#include "io/task.h"
> +#include "channel.h"
>
> void file_start_incoming_migration(FileMigrationArgs *file_args, Error **errp);
>
> void file_start_outgoing_migration(MigrationState *s,
> FileMigrationArgs *file_args, Error **errp);
> int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp);
> +
> +void file_send_channel_create(QIOTaskFunc f, void *data);
> +int file_send_channel_destroy(QIOChannel *ioc);
> #endif
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 123ff0dec0..427740aab6 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -17,6 +17,7 @@
> #include "exec/ramblock.h"
> #include "qemu/error-report.h"
> #include "qapi/error.h"
> +#include "file.h"
> #include "ram.h"
> #include "migration.h"
> #include "migration-stats.h"
> @@ -28,6 +29,7 @@
> #include "threadinfo.h"
> #include "options.h"
> #include "qemu/yank.h"
> +#include "io/channel-file.h"
> #include "io/channel-socket.h"
> #include "yank_functions.h"
>
> @@ -511,7 +513,11 @@ static void multifd_send_terminate_threads(Error *err)
>
> static int multifd_send_channel_destroy(QIOChannel *send)
> {
> - return socket_send_channel_destroy(send);
> + if (migrate_to_file()) {
> + return file_send_channel_destroy(send);
> + } else {
> + return socket_send_channel_destroy(send);
> + }
> }
>
> void multifd_save_cleanup(void)
> @@ -904,7 +910,11 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
>
> static void multifd_new_send_channel_create(gpointer opaque)
> {
> - socket_send_channel_create(multifd_new_send_channel_async, opaque);
> + if (migrate_to_file()) {
> + file_send_channel_create(multifd_new_send_channel_async, opaque);
> + } else {
> + socket_send_channel_create(multifd_new_send_channel_async, opaque);
> + }
> }
>
> int multifd_save_setup(Error **errp)
> diff --git a/migration/options.c b/migration/options.c
> index 10730b13ba..f671e24758 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -409,6 +409,13 @@ bool migrate_tls(void)
> return s->parameters.tls_creds && *s->parameters.tls_creds;
> }
>
> +bool migrate_to_file(void)
> +{
> + MigrationState *s = migrate_get_current();
> +
> + return qemu_file_is_seekable(s->to_dst_file);
> +}
Would this migrate_to_file() == migrate_multifd_packets()?
Maybe we can keep using the other one and drop migrate_to_file?
> +
> typedef enum WriteTrackingSupport {
> WT_SUPPORT_UNKNOWN = 0,
> WT_SUPPORT_ABSENT,
> diff --git a/migration/options.h b/migration/options.h
> index 8a19d6939c..84628a76e8 100644
> --- a/migration/options.h
> +++ b/migration/options.h
> @@ -60,6 +60,7 @@ bool migrate_multifd_packets(void);
> bool migrate_postcopy(void);
> bool migrate_rdma(void);
> bool migrate_tls(void);
> +bool migrate_to_file(void);
>
> /* capabilities helpers */
>
> diff --git a/migration/qemu-file.h b/migration/qemu-file.h
> index 32fd4a34fd..78ea21ab98 100644
> --- a/migration/qemu-file.h
> +++ b/migration/qemu-file.h
> @@ -83,5 +83,4 @@ size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
> off_t pos);
>
> QIOChannel *qemu_file_get_ioc(QEMUFile *file);
> -
> #endif
> --
> 2.35.3
>
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 13/30] migration/multifd: Add outgoing QIOChannelFile support
2024-01-16 4:05 ` Peter Xu
@ 2024-01-16 7:25 ` Peter Xu
2024-01-16 13:37 ` Fabiano Rosas
1 sibling, 0 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-16 7:25 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Tue, Jan 16, 2024 at 12:05:57PM +0800, Peter Xu wrote:
> On Mon, Nov 27, 2023 at 05:25:55PM -0300, Fabiano Rosas wrote:
> > Allow multifd to open file-backed channels. This will be used when
> > enabling the fixed-ram migration stream format which expects a
> > seekable transport.
> >
> > The QIOChannel read and write methods will use the preadv/pwritev
> > versions which don't update the file offset at each call so we can
> > reuse the fd without re-opening for every channel.
> >
> > Note that this is just setup code and multifd cannot yet make use of
> > the file channels.
> >
> > Signed-off-by: Fabiano Rosas <farosas@suse.de>
> > ---
> > - open multifd channels with O_WRONLY and no mode
> > - stop cancelling migration and propagate error via qio_task
> > ---
> > migration/file.c | 47 +++++++++++++++++++++++++++++++++++++++++--
> > migration/file.h | 5 +++++
> > migration/multifd.c | 14 +++++++++++--
> > migration/options.c | 7 +++++++
> > migration/options.h | 1 +
> > migration/qemu-file.h | 1 -
> > 6 files changed, 70 insertions(+), 5 deletions(-)
> >
> > diff --git a/migration/file.c b/migration/file.c
> > index 5d4975f43e..67d6f42da7 100644
> > --- a/migration/file.c
> > +++ b/migration/file.c
> > @@ -17,6 +17,10 @@
> >
> > #define OFFSET_OPTION ",offset="
> >
> > +static struct FileOutgoingArgs {
> > + char *fname;
> > +} outgoing_args;
> > +
> > /* Remove the offset option from @filespec and return it in @offsetp. */
> >
> > int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp)
> > @@ -36,6 +40,42 @@ int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp)
> > return 0;
> > }
> >
> > +static void qio_channel_file_connect_worker(QIOTask *task, gpointer opaque)
> > +{
> > + /* noop */
> > +}
> > +
> > +int file_send_channel_destroy(QIOChannel *ioc)
> > +{
> > + if (ioc) {
> > + qio_channel_close(ioc, NULL);
> > + object_unref(OBJECT(ioc));
> > + }
> > + g_free(outgoing_args.fname);
> > + outgoing_args.fname = NULL;
> > +
> > + return 0;
> > +}
> > +
> > +void file_send_channel_create(QIOTaskFunc f, void *data)
> > +{
> > + QIOChannelFile *ioc;
> > + QIOTask *task;
> > + Error *err = NULL;
> > + int flags = O_WRONLY;
> > +
> > + ioc = qio_channel_file_new_path(outgoing_args.fname, flags, 0, &err);
> > +
> > + task = qio_task_new(OBJECT(ioc), f, (gpointer)data, NULL);
> > + if (!ioc) {
> > + qio_task_set_error(task, err);
> > + return;
> > + }
> > +
> > + qio_task_run_in_thread(task, qio_channel_file_connect_worker,
> > + (gpointer)data, NULL, NULL);
>
> This is pretty weird. This invokes a thread, but it'll run a noop. It
> seems meaningless to me.
>
> I assume you wanted to keep using the same async model as the socket typed
> multifd, but I don't think that works anyway, because file open blocks at
> qio_channel_file_new_path() so it's sync anyway.
>
> AFAICT we still share the code, as long as the file path properly invokes
> multifd_channel_connect() after the iochannel is setup.
>
> > +}
> > +
> > void file_start_outgoing_migration(MigrationState *s,
> > FileMigrationArgs *file_args, Error **errp)
> > {
> > @@ -43,15 +83,18 @@ void file_start_outgoing_migration(MigrationState *s,
> > g_autofree char *filename = g_strdup(file_args->filename);
> > uint64_t offset = file_args->offset;
> > QIOChannel *ioc;
> > + int flags = O_CREAT | O_TRUNC | O_WRONLY;
> > + mode_t mode = 0660;
> >
> > trace_migration_file_outgoing(filename);
> >
> > - fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY | O_TRUNC,
> > - 0600, errp);
> > + fioc = qio_channel_file_new_path(filename, flags, mode, errp);
> > if (!fioc) {
> > return;
> > }
> >
> > + outgoing_args.fname = g_strdup(filename);
> > +
> > ioc = QIO_CHANNEL(fioc);
> > if (offset && qio_channel_io_seek(ioc, offset, SEEK_SET, errp) < 0) {
> > return;
> > diff --git a/migration/file.h b/migration/file.h
> > index 37d6a08bfc..511019b319 100644
> > --- a/migration/file.h
> > +++ b/migration/file.h
> > @@ -9,10 +9,15 @@
> > #define QEMU_MIGRATION_FILE_H
> >
> > #include "qapi/qapi-types-migration.h"
> > +#include "io/task.h"
> > +#include "channel.h"
> >
> > void file_start_incoming_migration(FileMigrationArgs *file_args, Error **errp);
> >
> > void file_start_outgoing_migration(MigrationState *s,
> > FileMigrationArgs *file_args, Error **errp);
> > int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp);
> > +
> > +void file_send_channel_create(QIOTaskFunc f, void *data);
> > +int file_send_channel_destroy(QIOChannel *ioc);
> > #endif
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 123ff0dec0..427740aab6 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -17,6 +17,7 @@
> > #include "exec/ramblock.h"
> > #include "qemu/error-report.h"
> > #include "qapi/error.h"
> > +#include "file.h"
> > #include "ram.h"
> > #include "migration.h"
> > #include "migration-stats.h"
> > @@ -28,6 +29,7 @@
> > #include "threadinfo.h"
> > #include "options.h"
> > #include "qemu/yank.h"
> > +#include "io/channel-file.h"
> > #include "io/channel-socket.h"
> > #include "yank_functions.h"
> >
> > @@ -511,7 +513,11 @@ static void multifd_send_terminate_threads(Error *err)
> >
> > static int multifd_send_channel_destroy(QIOChannel *send)
> > {
> > - return socket_send_channel_destroy(send);
> > + if (migrate_to_file()) {
> > + return file_send_channel_destroy(send);
> > + } else {
> > + return socket_send_channel_destroy(send);
> > + }
> > }
> >
> > void multifd_save_cleanup(void)
> > @@ -904,7 +910,11 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
> >
> > static void multifd_new_send_channel_create(gpointer opaque)
> > {
> > - socket_send_channel_create(multifd_new_send_channel_async, opaque);
> > + if (migrate_to_file()) {
> > + file_send_channel_create(multifd_new_send_channel_async, opaque);
> > + } else {
> > + socket_send_channel_create(multifd_new_send_channel_async, opaque);
> > + }
> > }
> >
> > int multifd_save_setup(Error **errp)
> > diff --git a/migration/options.c b/migration/options.c
> > index 10730b13ba..f671e24758 100644
> > --- a/migration/options.c
> > +++ b/migration/options.c
> > @@ -409,6 +409,13 @@ bool migrate_tls(void)
> > return s->parameters.tls_creds && *s->parameters.tls_creds;
> > }
> >
> > +bool migrate_to_file(void)
> > +{
> > + MigrationState *s = migrate_get_current();
> > +
> > + return qemu_file_is_seekable(s->to_dst_file);
> > +}
>
> Would this migrate_to_file() == migrate_multifd_packets()?
>
> Maybe we can keep using the other one and drop migrate_to_file?
Or perhaps the other way round; as migrate_to_file() is a migration global
helper, so applicable without multifd.
>
> > +
> > typedef enum WriteTrackingSupport {
> > WT_SUPPORT_UNKNOWN = 0,
> > WT_SUPPORT_ABSENT,
> > diff --git a/migration/options.h b/migration/options.h
> > index 8a19d6939c..84628a76e8 100644
> > --- a/migration/options.h
> > +++ b/migration/options.h
> > @@ -60,6 +60,7 @@ bool migrate_multifd_packets(void);
> > bool migrate_postcopy(void);
> > bool migrate_rdma(void);
> > bool migrate_tls(void);
> > +bool migrate_to_file(void);
> >
> > /* capabilities helpers */
> >
> > diff --git a/migration/qemu-file.h b/migration/qemu-file.h
> > index 32fd4a34fd..78ea21ab98 100644
> > --- a/migration/qemu-file.h
> > +++ b/migration/qemu-file.h
> > @@ -83,5 +83,4 @@ size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
> > off_t pos);
> >
> > QIOChannel *qemu_file_get_ioc(QEMUFile *file);
> > -
> > #endif
> > --
> > 2.35.3
> >
>
> --
> Peter Xu
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 13/30] migration/multifd: Add outgoing QIOChannelFile support
2024-01-16 4:05 ` Peter Xu
2024-01-16 7:25 ` Peter Xu
@ 2024-01-16 13:37 ` Fabiano Rosas
2024-01-17 8:28 ` Peter Xu
1 sibling, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2024-01-16 13:37 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
Peter Xu <peterx@redhat.com> writes:
> On Mon, Nov 27, 2023 at 05:25:55PM -0300, Fabiano Rosas wrote:
>> Allow multifd to open file-backed channels. This will be used when
>> enabling the fixed-ram migration stream format which expects a
>> seekable transport.
>>
>> The QIOChannel read and write methods will use the preadv/pwritev
>> versions which don't update the file offset at each call so we can
>> reuse the fd without re-opening for every channel.
>>
>> Note that this is just setup code and multifd cannot yet make use of
>> the file channels.
>>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>> - open multifd channels with O_WRONLY and no mode
>> - stop cancelling migration and propagate error via qio_task
>> ---
>> migration/file.c | 47 +++++++++++++++++++++++++++++++++++++++++--
>> migration/file.h | 5 +++++
>> migration/multifd.c | 14 +++++++++++--
>> migration/options.c | 7 +++++++
>> migration/options.h | 1 +
>> migration/qemu-file.h | 1 -
>> 6 files changed, 70 insertions(+), 5 deletions(-)
>>
>> diff --git a/migration/file.c b/migration/file.c
>> index 5d4975f43e..67d6f42da7 100644
>> --- a/migration/file.c
>> +++ b/migration/file.c
>> @@ -17,6 +17,10 @@
>>
>> #define OFFSET_OPTION ",offset="
>>
>> +static struct FileOutgoingArgs {
>> + char *fname;
>> +} outgoing_args;
>> +
>> /* Remove the offset option from @filespec and return it in @offsetp. */
>>
>> int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp)
>> @@ -36,6 +40,42 @@ int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp)
>> return 0;
>> }
>>
>> +static void qio_channel_file_connect_worker(QIOTask *task, gpointer opaque)
>> +{
>> + /* noop */
>> +}
>> +
>> +int file_send_channel_destroy(QIOChannel *ioc)
>> +{
>> + if (ioc) {
>> + qio_channel_close(ioc, NULL);
>> + object_unref(OBJECT(ioc));
>> + }
>> + g_free(outgoing_args.fname);
>> + outgoing_args.fname = NULL;
>> +
>> + return 0;
>> +}
>> +
>> +void file_send_channel_create(QIOTaskFunc f, void *data)
>> +{
>> + QIOChannelFile *ioc;
>> + QIOTask *task;
>> + Error *err = NULL;
>> + int flags = O_WRONLY;
>> +
>> + ioc = qio_channel_file_new_path(outgoing_args.fname, flags, 0, &err);
>> +
>> + task = qio_task_new(OBJECT(ioc), f, (gpointer)data, NULL);
>> + if (!ioc) {
>> + qio_task_set_error(task, err);
>> + return;
>> + }
>> +
>> + qio_task_run_in_thread(task, qio_channel_file_connect_worker,
>> + (gpointer)data, NULL, NULL);
>
> This is pretty weird. This invokes a thread, but it'll run a noop. It
> seems meaningless to me.
>
That's QIOTask weirdness isn't it? It will run the worker in the thread,
but it also schedules the completion function as a glib event. So that's
when multifd_new_send_channel_async() will run. The crucial aspect here
is that it gets dispatched by glib on the main loop. I'm just keeping
the model, except that I don't have work to do during the "connection"
phase.
> I assume you wanted to keep using the same async model as the socket typed
> multifd, but I don't think that works anyway, because file open blocks at
> qio_channel_file_new_path() so it's sync anyway.
It's async regarding multifd_channel_connect(). The connections will be
happening while multifd_save_setup() continues execution, IIUC.
>
> AFAICT we still share the code, as long as the file path properly invokes
> multifd_channel_connect() after the iochannel is setup.
>
I don't see the point in moving any of that logic into the URI
implementation. We already have the TLS handshake code which can also
call multifd_channel_connect() and that is a mess. IMO we should be
keeping the interface between multifd and the frontends as boilerplate
as possible.
>> +}
>> +
>> void file_start_outgoing_migration(MigrationState *s,
>> FileMigrationArgs *file_args, Error **errp)
>> {
>> @@ -43,15 +83,18 @@ void file_start_outgoing_migration(MigrationState *s,
>> g_autofree char *filename = g_strdup(file_args->filename);
>> uint64_t offset = file_args->offset;
>> QIOChannel *ioc;
>> + int flags = O_CREAT | O_TRUNC | O_WRONLY;
>> + mode_t mode = 0660;
>>
>> trace_migration_file_outgoing(filename);
>>
>> - fioc = qio_channel_file_new_path(filename, O_CREAT | O_WRONLY | O_TRUNC,
>> - 0600, errp);
>> + fioc = qio_channel_file_new_path(filename, flags, mode, errp);
>> if (!fioc) {
>> return;
>> }
>>
>> + outgoing_args.fname = g_strdup(filename);
>> +
>> ioc = QIO_CHANNEL(fioc);
>> if (offset && qio_channel_io_seek(ioc, offset, SEEK_SET, errp) < 0) {
>> return;
>> diff --git a/migration/file.h b/migration/file.h
>> index 37d6a08bfc..511019b319 100644
>> --- a/migration/file.h
>> +++ b/migration/file.h
>> @@ -9,10 +9,15 @@
>> #define QEMU_MIGRATION_FILE_H
>>
>> #include "qapi/qapi-types-migration.h"
>> +#include "io/task.h"
>> +#include "channel.h"
>>
>> void file_start_incoming_migration(FileMigrationArgs *file_args, Error **errp);
>>
>> void file_start_outgoing_migration(MigrationState *s,
>> FileMigrationArgs *file_args, Error **errp);
>> int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp);
>> +
>> +void file_send_channel_create(QIOTaskFunc f, void *data);
>> +int file_send_channel_destroy(QIOChannel *ioc);
>> #endif
>> diff --git a/migration/multifd.c b/migration/multifd.c
>> index 123ff0dec0..427740aab6 100644
>> --- a/migration/multifd.c
>> +++ b/migration/multifd.c
>> @@ -17,6 +17,7 @@
>> #include "exec/ramblock.h"
>> #include "qemu/error-report.h"
>> #include "qapi/error.h"
>> +#include "file.h"
>> #include "ram.h"
>> #include "migration.h"
>> #include "migration-stats.h"
>> @@ -28,6 +29,7 @@
>> #include "threadinfo.h"
>> #include "options.h"
>> #include "qemu/yank.h"
>> +#include "io/channel-file.h"
>> #include "io/channel-socket.h"
>> #include "yank_functions.h"
>>
>> @@ -511,7 +513,11 @@ static void multifd_send_terminate_threads(Error *err)
>>
>> static int multifd_send_channel_destroy(QIOChannel *send)
>> {
>> - return socket_send_channel_destroy(send);
>> + if (migrate_to_file()) {
>> + return file_send_channel_destroy(send);
>> + } else {
>> + return socket_send_channel_destroy(send);
>> + }
>> }
>>
>> void multifd_save_cleanup(void)
>> @@ -904,7 +910,11 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
>>
>> static void multifd_new_send_channel_create(gpointer opaque)
>> {
>> - socket_send_channel_create(multifd_new_send_channel_async, opaque);
>> + if (migrate_to_file()) {
>> + file_send_channel_create(multifd_new_send_channel_async, opaque);
>> + } else {
>> + socket_send_channel_create(multifd_new_send_channel_async, opaque);
>> + }
>> }
>>
>> int multifd_save_setup(Error **errp)
>> diff --git a/migration/options.c b/migration/options.c
>> index 10730b13ba..f671e24758 100644
>> --- a/migration/options.c
>> +++ b/migration/options.c
>> @@ -409,6 +409,13 @@ bool migrate_tls(void)
>> return s->parameters.tls_creds && *s->parameters.tls_creds;
>> }
>>
>> +bool migrate_to_file(void)
>> +{
>> + MigrationState *s = migrate_get_current();
>> +
>> + return qemu_file_is_seekable(s->to_dst_file);
>> +}
>
> Would this migrate_to_file() == migrate_multifd_packets()?
>
> Maybe we can keep using the other one and drop migrate_to_file?
>
Possibly the other way around as you mention. I'll take a look.
>> +
>> typedef enum WriteTrackingSupport {
>> WT_SUPPORT_UNKNOWN = 0,
>> WT_SUPPORT_ABSENT,
>> diff --git a/migration/options.h b/migration/options.h
>> index 8a19d6939c..84628a76e8 100644
>> --- a/migration/options.h
>> +++ b/migration/options.h
>> @@ -60,6 +60,7 @@ bool migrate_multifd_packets(void);
>> bool migrate_postcopy(void);
>> bool migrate_rdma(void);
>> bool migrate_tls(void);
>> +bool migrate_to_file(void);
>>
>> /* capabilities helpers */
>>
>> diff --git a/migration/qemu-file.h b/migration/qemu-file.h
>> index 32fd4a34fd..78ea21ab98 100644
>> --- a/migration/qemu-file.h
>> +++ b/migration/qemu-file.h
>> @@ -83,5 +83,4 @@ size_t qemu_get_buffer_at(QEMUFile *f, const uint8_t *buf, size_t buflen,
>> off_t pos);
>>
>> QIOChannel *qemu_file_get_ioc(QEMUFile *file);
>> -
>> #endif
>> --
>> 2.35.3
>>
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 13/30] migration/multifd: Add outgoing QIOChannelFile support
2024-01-16 13:37 ` Fabiano Rosas
@ 2024-01-17 8:28 ` Peter Xu
2024-01-17 17:34 ` Fabiano Rosas
0 siblings, 1 reply; 95+ messages in thread
From: Peter Xu @ 2024-01-17 8:28 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Tue, Jan 16, 2024 at 10:37:48AM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > On Mon, Nov 27, 2023 at 05:25:55PM -0300, Fabiano Rosas wrote:
> >> Allow multifd to open file-backed channels. This will be used when
> >> enabling the fixed-ram migration stream format which expects a
> >> seekable transport.
> >>
> >> The QIOChannel read and write methods will use the preadv/pwritev
> >> versions which don't update the file offset at each call so we can
> >> reuse the fd without re-opening for every channel.
> >>
> >> Note that this is just setup code and multifd cannot yet make use of
> >> the file channels.
> >>
> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >> ---
> >> - open multifd channels with O_WRONLY and no mode
> >> - stop cancelling migration and propagate error via qio_task
> >> ---
> >> migration/file.c | 47 +++++++++++++++++++++++++++++++++++++++++--
> >> migration/file.h | 5 +++++
> >> migration/multifd.c | 14 +++++++++++--
> >> migration/options.c | 7 +++++++
> >> migration/options.h | 1 +
> >> migration/qemu-file.h | 1 -
> >> 6 files changed, 70 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/migration/file.c b/migration/file.c
> >> index 5d4975f43e..67d6f42da7 100644
> >> --- a/migration/file.c
> >> +++ b/migration/file.c
> >> @@ -17,6 +17,10 @@
> >>
> >> #define OFFSET_OPTION ",offset="
> >>
> >> +static struct FileOutgoingArgs {
> >> + char *fname;
> >> +} outgoing_args;
> >> +
> >> /* Remove the offset option from @filespec and return it in @offsetp. */
> >>
> >> int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp)
> >> @@ -36,6 +40,42 @@ int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp)
> >> return 0;
> >> }
> >>
> >> +static void qio_channel_file_connect_worker(QIOTask *task, gpointer opaque)
> >> +{
> >> + /* noop */
> >> +}
> >> +
> >> +int file_send_channel_destroy(QIOChannel *ioc)
> >> +{
> >> + if (ioc) {
> >> + qio_channel_close(ioc, NULL);
> >> + object_unref(OBJECT(ioc));
> >> + }
> >> + g_free(outgoing_args.fname);
> >> + outgoing_args.fname = NULL;
> >> +
> >> + return 0;
> >> +}
> >> +
> >> +void file_send_channel_create(QIOTaskFunc f, void *data)
> >> +{
> >> + QIOChannelFile *ioc;
> >> + QIOTask *task;
> >> + Error *err = NULL;
> >> + int flags = O_WRONLY;
> >> +
> >> + ioc = qio_channel_file_new_path(outgoing_args.fname, flags, 0, &err);
> >> +
> >> + task = qio_task_new(OBJECT(ioc), f, (gpointer)data, NULL);
> >> + if (!ioc) {
> >> + qio_task_set_error(task, err);
> >> + return;
> >> + }
> >> +
> >> + qio_task_run_in_thread(task, qio_channel_file_connect_worker,
> >> + (gpointer)data, NULL, NULL);
> >
> > This is pretty weird. This invokes a thread, but it'll run a noop. It
> > seems meaningless to me.
> >
>
> That's QIOTask weirdness isn't it? It will run the worker in the thread,
> but it also schedules the completion function as a glib event. So that's
> when multifd_new_send_channel_async() will run. The crucial aspect here
> is that it gets dispatched by glib on the main loop. I'm just keeping
> the model, except that I don't have work to do during the "connection"
> phase.
The question is why do we need that if "file:" can be done synchronously.
Please see below.
>
> > I assume you wanted to keep using the same async model as the socket typed
> > multifd, but I don't think that works anyway, because file open blocks at
> > qio_channel_file_new_path() so it's sync anyway.
>
> It's async regarding multifd_channel_connect(). The connections will be
> happening while multifd_save_setup() continues execution, IIUC.
Yes. But I'm wondering whether we can start to simplify at least the
"file:" for this process. We all know that we _may_ have created too many
threads each doing very light work, which might not be needed. We haven't
yet resolved the "how to kill a thread along this process if migration
cancels during when one thread got blocked in a syscall" issue. We'll need
to start recording tids for every thread, and that'll be a mess for sure
when there're tons of threads.
>
> >
> > AFAICT we still share the code, as long as the file path properly invokes
> > multifd_channel_connect() after the iochannel is setup.
> >
>
> I don't see the point in moving any of that logic into the URI
> implementation. We already have the TLS handshake code which can also
> call multifd_channel_connect() and that is a mess. IMO we should be
> keeping the interface between multifd and the frontends as boilerplate
> as possible.
Hmm, I don't think it's a mess? At least multifd_channel_connect(). AFAICT
multifd_channel_connect() can be called in any context.
multifd_channel_connect() always creates yet another thread, no matter it's
for tls handshake, or it's one of the multifd send thread.
Here this series already treat file/socket differently:
static void multifd_new_send_channel_create(gpointer opaque)
{
if (migrate_to_file()) {
file_send_channel_create(multifd_new_send_channel_async, opaque);
} else {
socket_send_channel_create(multifd_new_send_channel_async, opaque);
}
}
What I am thinking is it could be much simpler if
multifd_new_send_channel_create() can create the multifd channels
synchronously here, then directly call multifd_channel_connect(), further
that'll create threads for whatever purposes.
When TLS is not enabled, I'd expect if with that change and with a "file:"
URI, after multifd_save_setup() completes, all send threads will be created
already.
I think multifd_new_send_channel_create() can already take
"MultiFDSendParams *p" as parameter, then:
static void multifd_new_send_channel_create(MultiFDSendParams *p)
{
if (migrate_to_file()) {
file_send_channel_create(p);
} else {
socket_send_channel_create(multifd_new_send_channel_async, p);
}
}
Where file_send_channel_create() can call multifd_channel_connect()
directly upon the ioc created.
Would that work for us, and much cleaner?
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 13/30] migration/multifd: Add outgoing QIOChannelFile support
2024-01-17 8:28 ` Peter Xu
@ 2024-01-17 17:34 ` Fabiano Rosas
2024-01-18 7:11 ` Peter Xu
0 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2024-01-17 17:34 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
Peter Xu <peterx@redhat.com> writes:
> On Tue, Jan 16, 2024 at 10:37:48AM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Mon, Nov 27, 2023 at 05:25:55PM -0300, Fabiano Rosas wrote:
>> >> Allow multifd to open file-backed channels. This will be used when
>> >> enabling the fixed-ram migration stream format which expects a
>> >> seekable transport.
>> >>
>> >> The QIOChannel read and write methods will use the preadv/pwritev
>> >> versions which don't update the file offset at each call so we can
>> >> reuse the fd without re-opening for every channel.
>> >>
>> >> Note that this is just setup code and multifd cannot yet make use of
>> >> the file channels.
>> >>
>> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> >> ---
>> >> - open multifd channels with O_WRONLY and no mode
>> >> - stop cancelling migration and propagate error via qio_task
>> >> ---
>> >> migration/file.c | 47 +++++++++++++++++++++++++++++++++++++++++--
>> >> migration/file.h | 5 +++++
>> >> migration/multifd.c | 14 +++++++++++--
>> >> migration/options.c | 7 +++++++
>> >> migration/options.h | 1 +
>> >> migration/qemu-file.h | 1 -
>> >> 6 files changed, 70 insertions(+), 5 deletions(-)
>> >>
>> >> diff --git a/migration/file.c b/migration/file.c
>> >> index 5d4975f43e..67d6f42da7 100644
>> >> --- a/migration/file.c
>> >> +++ b/migration/file.c
>> >> @@ -17,6 +17,10 @@
>> >>
>> >> #define OFFSET_OPTION ",offset="
>> >>
>> >> +static struct FileOutgoingArgs {
>> >> + char *fname;
>> >> +} outgoing_args;
>> >> +
>> >> /* Remove the offset option from @filespec and return it in @offsetp. */
>> >>
>> >> int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp)
>> >> @@ -36,6 +40,42 @@ int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp)
>> >> return 0;
>> >> }
>> >>
>> >> +static void qio_channel_file_connect_worker(QIOTask *task, gpointer opaque)
>> >> +{
>> >> + /* noop */
>> >> +}
>> >> +
>> >> +int file_send_channel_destroy(QIOChannel *ioc)
>> >> +{
>> >> + if (ioc) {
>> >> + qio_channel_close(ioc, NULL);
>> >> + object_unref(OBJECT(ioc));
>> >> + }
>> >> + g_free(outgoing_args.fname);
>> >> + outgoing_args.fname = NULL;
>> >> +
>> >> + return 0;
>> >> +}
>> >> +
>> >> +void file_send_channel_create(QIOTaskFunc f, void *data)
>> >> +{
>> >> + QIOChannelFile *ioc;
>> >> + QIOTask *task;
>> >> + Error *err = NULL;
>> >> + int flags = O_WRONLY;
>> >> +
>> >> + ioc = qio_channel_file_new_path(outgoing_args.fname, flags, 0, &err);
>> >> +
>> >> + task = qio_task_new(OBJECT(ioc), f, (gpointer)data, NULL);
>> >> + if (!ioc) {
>> >> + qio_task_set_error(task, err);
>> >> + return;
>> >> + }
>> >> +
>> >> + qio_task_run_in_thread(task, qio_channel_file_connect_worker,
>> >> + (gpointer)data, NULL, NULL);
>> >
>> > This is pretty weird. This invokes a thread, but it'll run a noop. It
>> > seems meaningless to me.
>> >
>>
>> That's QIOTask weirdness isn't it? It will run the worker in the thread,
>> but it also schedules the completion function as a glib event. So that's
>> when multifd_new_send_channel_async() will run. The crucial aspect here
>> is that it gets dispatched by glib on the main loop. I'm just keeping
>> the model, except that I don't have work to do during the "connection"
>> phase.
>
> The question is why do we need that if "file:" can be done synchronously.
I guess I tend to avoid changing existing patterns when adding a new
feature. But you're right, we don't really need this.
> Please see below.
>
>>
>> > I assume you wanted to keep using the same async model as the socket typed
>> > multifd, but I don't think that works anyway, because file open blocks at
>> > qio_channel_file_new_path() so it's sync anyway.
>>
>> It's async regarding multifd_channel_connect(). The connections will be
>> happening while multifd_save_setup() continues execution, IIUC.
>
> Yes. But I'm wondering whether we can start to simplify at least the
> "file:" for this process. We all know that we _may_ have created too many
> threads each doing very light work, which might not be needed. We haven't
> yet resolved the "how to kill a thread along this process if migration
> cancels during when one thread got blocked in a syscall" issue. We'll need
> to start recording tids for every thread, and that'll be a mess for sure
> when there're tons of threads.
>
>>
>> >
>> > AFAICT we still share the code, as long as the file path properly invokes
>> > multifd_channel_connect() after the iochannel is setup.
>> >
>>
>> I don't see the point in moving any of that logic into the URI
>> implementation. We already have the TLS handshake code which can also
>> call multifd_channel_connect() and that is a mess. IMO we should be
>> keeping the interface between multifd and the frontends as boilerplate
>> as possible.
>
> Hmm, I don't think it's a mess? At least multifd_channel_connect(). AFAICT
> multifd_channel_connect() can be called in any context.
Well this sequence:
multifd_new_send_channel_async() -> multifd_channel_connect() ->
multifd_tls_channel_connect() -> new thread ->
multifd_tls_handshake_thread() -> new task ->
multifd_tls_outgoing_handshake() -> multifd_channel_connect()
...is not what I would call intuitive. Specifically with
multifd_channel_connect() being called more than the number of multifd
channels.
This would be "not a mess" IMO:
for (i = 0; i < migrate_multifd_channels(); i++) {
multifd_tls_channel_connect();
multifd_channel_connect() ->
qemu_thread_create(..., multifd_send_thread);
}
> multifd_channel_connect() always creates yet another thread, no matter it's
> for tls handshake, or it's one of the multifd send thread.
>
> Here this series already treat file/socket differently:
>
> static void multifd_new_send_channel_create(gpointer opaque)
> {
> if (migrate_to_file()) {
> file_send_channel_create(multifd_new_send_channel_async, opaque);
> } else {
> socket_send_channel_create(multifd_new_send_channel_async, opaque);
> }
> }
>
> What I am thinking is it could be much simpler if
> multifd_new_send_channel_create() can create the multifd channels
> synchronously here, then directly call multifd_channel_connect(), further
> that'll create threads for whatever purposes.
>
> When TLS is not enabled, I'd expect if with that change and with a "file:"
> URI, after multifd_save_setup() completes, all send threads will be created
> already.
>
> I think multifd_new_send_channel_create() can already take
> "MultiFDSendParams *p" as parameter, then:
>
> static void multifd_new_send_channel_create(MultiFDSendParams *p)
> {
> if (migrate_to_file()) {
> file_send_channel_create(p);
> } else {
> socket_send_channel_create(multifd_new_send_channel_async, p);
> }
> }
>
> Where file_send_channel_create() can call multifd_channel_connect()
> directly upon the ioc created.
>
> Would that work for us, and much cleaner?
Looks cleaner indeed, let me give it a try.
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 13/30] migration/multifd: Add outgoing QIOChannelFile support
2024-01-17 17:34 ` Fabiano Rosas
@ 2024-01-18 7:11 ` Peter Xu
0 siblings, 0 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-18 7:11 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Wed, Jan 17, 2024 at 02:34:18PM -0300, Fabiano Rosas wrote:
> Well this sequence:
>
> multifd_new_send_channel_async() -> multifd_channel_connect() ->
> multifd_tls_channel_connect() -> new thread ->
> multifd_tls_handshake_thread() -> new task ->
> multifd_tls_outgoing_handshake() -> multifd_channel_connect()
>
> ...is not what I would call intuitive. Specifically with
> multifd_channel_connect() being called more than the number of multifd
> channels.
>
> This would be "not a mess" IMO:
>
> for (i = 0; i < migrate_multifd_channels(); i++) {
> multifd_tls_channel_connect();
> multifd_channel_connect() ->
> qemu_thread_create(..., multifd_send_thread);
> }
Ah, I see what you meant now, yes I agree. Let's see whether we can have a
simple procedure for file first, then the possibility to make the socket
path closer to file.
TLS could be another story, I'm guessing Dan could have good reasons to do
it like that, but we can rethink after we settle the file specific paths.
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH v3 14/30] migration/multifd: Add incoming QIOChannelFile support
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (12 preceding siblings ...)
2023-11-27 20:25 ` [RFC PATCH v3 13/30] migration/multifd: Add outgoing QIOChannelFile support Fabiano Rosas
@ 2023-11-27 20:25 ` Fabiano Rosas
2024-01-16 6:29 ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 15/30] io: Add a pwritev/preadv version that takes a discontiguous iovec Fabiano Rosas
` (16 subsequent siblings)
30 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:25 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
On the receiving side we don't need to differentiate between main
channel and threads, so whichever channel is defined first gets to be
the main one. And since there are no packets, use the atomic channel
count to index into the params array.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
- stop setting offset in secondary channels
- check for packets when peeking
---
migration/file.c | 36 ++++++++++++++++++++++++++++--------
migration/migration.c | 3 ++-
migration/multifd.c | 3 +--
migration/multifd.h | 1 +
4 files changed, 32 insertions(+), 11 deletions(-)
diff --git a/migration/file.c b/migration/file.c
index 67d6f42da7..62ba994109 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -7,12 +7,14 @@
#include "qemu/osdep.h"
#include "qemu/cutils.h"
+#include "qemu/error-report.h"
#include "qapi/error.h"
#include "channel.h"
#include "file.h"
#include "migration.h"
#include "io/channel-file.h"
#include "io/channel-util.h"
+#include "options.h"
#include "trace.h"
#define OFFSET_OPTION ",offset="
@@ -117,22 +119,40 @@ void file_start_incoming_migration(FileMigrationArgs *file_args, Error **errp)
g_autofree char *filename = g_strdup(file_args->filename);
QIOChannelFile *fioc = NULL;
uint64_t offset = file_args->offset;
- QIOChannel *ioc;
+ int channels = 1;
+ int i = 0, fd;
trace_migration_file_incoming(filename);
fioc = qio_channel_file_new_path(filename, O_RDONLY, 0, errp);
if (!fioc) {
+ goto out;
+ }
+
+ if (offset &&
+ qio_channel_io_seek(QIO_CHANNEL(fioc), offset, SEEK_SET, errp) < 0) {
return;
}
- ioc = QIO_CHANNEL(fioc);
- if (offset && qio_channel_io_seek(ioc, offset, SEEK_SET, errp) < 0) {
+ if (migrate_multifd()) {
+ channels += migrate_multifd_channels();
+ }
+
+ fd = fioc->fd;
+
+ do {
+ QIOChannel *ioc = QIO_CHANNEL(fioc);
+
+ qio_channel_set_name(ioc, "migration-file-incoming");
+ qio_channel_add_watch_full(ioc, G_IO_IN,
+ file_accept_incoming_migration,
+ NULL, NULL,
+ g_main_context_get_thread_default());
+ } while (++i < channels && (fioc = qio_channel_file_new_fd(fd)));
+
+out:
+ if (!fioc) {
+ error_setg(errp, "Error creating migration incoming channel");
return;
}
- qio_channel_set_name(QIO_CHANNEL(ioc), "migration-file-incoming");
- qio_channel_add_watch_full(ioc, G_IO_IN,
- file_accept_incoming_migration,
- NULL, NULL,
- g_main_context_get_thread_default());
}
diff --git a/migration/migration.c b/migration/migration.c
index 897ed1db67..16689171ab 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -838,7 +838,8 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
uint32_t channel_magic = 0;
int ret = 0;
- if (migrate_multifd() && !migrate_postcopy_ram() &&
+ if (migrate_multifd() && migrate_multifd_packets() &&
+ !migrate_postcopy_ram() &&
qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_READ_MSG_PEEK)) {
/*
* With multiple channels, it is possible that we receive channels
diff --git a/migration/multifd.c b/migration/multifd.c
index 427740aab6..3476fac49f 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1283,8 +1283,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
/* initial packet */
num_packets = 1;
} else {
- /* next patch gives this a meaningful value */
- id = 0;
+ id = qatomic_read(&multifd_recv_state->count);
}
p = &multifd_recv_state->params[id];
diff --git a/migration/multifd.h b/migration/multifd.h
index a835643b48..a112ec7ac6 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -18,6 +18,7 @@ void multifd_save_cleanup(void);
int multifd_load_setup(Error **errp);
void multifd_load_cleanup(void);
void multifd_load_shutdown(void);
+bool multifd_recv_first_channel(void);
bool multifd_recv_all_channels_created(void);
void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
void multifd_recv_sync_main(void);
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 14/30] migration/multifd: Add incoming QIOChannelFile support
2023-11-27 20:25 ` [RFC PATCH v3 14/30] migration/multifd: Add incoming " Fabiano Rosas
@ 2024-01-16 6:29 ` Peter Xu
0 siblings, 0 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-16 6:29 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Mon, Nov 27, 2023 at 05:25:56PM -0300, Fabiano Rosas wrote:
> On the receiving side we don't need to differentiate between main
> channel and threads, so whichever channel is defined first gets to be
> the main one. And since there are no packets, use the atomic channel
> count to index into the params array.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> - stop setting offset in secondary channels
> - check for packets when peeking
> ---
> migration/file.c | 36 ++++++++++++++++++++++++++++--------
> migration/migration.c | 3 ++-
> migration/multifd.c | 3 +--
> migration/multifd.h | 1 +
> 4 files changed, 32 insertions(+), 11 deletions(-)
>
> diff --git a/migration/file.c b/migration/file.c
> index 67d6f42da7..62ba994109 100644
> --- a/migration/file.c
> +++ b/migration/file.c
> @@ -7,12 +7,14 @@
>
> #include "qemu/osdep.h"
> #include "qemu/cutils.h"
> +#include "qemu/error-report.h"
> #include "qapi/error.h"
> #include "channel.h"
> #include "file.h"
> #include "migration.h"
> #include "io/channel-file.h"
> #include "io/channel-util.h"
> +#include "options.h"
> #include "trace.h"
>
> #define OFFSET_OPTION ",offset="
> @@ -117,22 +119,40 @@ void file_start_incoming_migration(FileMigrationArgs *file_args, Error **errp)
> g_autofree char *filename = g_strdup(file_args->filename);
> QIOChannelFile *fioc = NULL;
> uint64_t offset = file_args->offset;
> - QIOChannel *ioc;
> + int channels = 1;
> + int i = 0, fd;
>
> trace_migration_file_incoming(filename);
>
> fioc = qio_channel_file_new_path(filename, O_RDONLY, 0, errp);
> if (!fioc) {
> + goto out;
Shouldn't here be a "return"? Won't "goto out" try to error_setg() again
and crash?
It looks like that label can be dropped.
> + }
> +
> + if (offset &&
> + qio_channel_io_seek(QIO_CHANNEL(fioc), offset, SEEK_SET, errp) < 0) {
> return;
> }
>
> - ioc = QIO_CHANNEL(fioc);
> - if (offset && qio_channel_io_seek(ioc, offset, SEEK_SET, errp) < 0) {
> + if (migrate_multifd()) {
> + channels += migrate_multifd_channels();
> + }
> +
> + fd = fioc->fd;
> +
> + do {
> + QIOChannel *ioc = QIO_CHANNEL(fioc);
> +
> + qio_channel_set_name(ioc, "migration-file-incoming");
> + qio_channel_add_watch_full(ioc, G_IO_IN,
> + file_accept_incoming_migration,
> + NULL, NULL,
> + g_main_context_get_thread_default());
> + } while (++i < channels && (fioc = qio_channel_file_new_fd(fd)));
> +
> +out:
> + if (!fioc) {
> + error_setg(errp, "Error creating migration incoming channel");
> return;
> }
> - qio_channel_set_name(QIO_CHANNEL(ioc), "migration-file-incoming");
> - qio_channel_add_watch_full(ioc, G_IO_IN,
> - file_accept_incoming_migration,
> - NULL, NULL,
> - g_main_context_get_thread_default());
> }
> diff --git a/migration/migration.c b/migration/migration.c
> index 897ed1db67..16689171ab 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -838,7 +838,8 @@ void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
> uint32_t channel_magic = 0;
> int ret = 0;
>
> - if (migrate_multifd() && !migrate_postcopy_ram() &&
> + if (migrate_multifd() && migrate_multifd_packets() &&
> + !migrate_postcopy_ram() &&
> qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_READ_MSG_PEEK)) {
> /*
> * With multiple channels, it is possible that we receive channels
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 427740aab6..3476fac49f 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -1283,8 +1283,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
> /* initial packet */
> num_packets = 1;
> } else {
> - /* next patch gives this a meaningful value */
> - id = 0;
> + id = qatomic_read(&multifd_recv_state->count);
> }
>
> p = &multifd_recv_state->params[id];
> diff --git a/migration/multifd.h b/migration/multifd.h
> index a835643b48..a112ec7ac6 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -18,6 +18,7 @@ void multifd_save_cleanup(void);
> int multifd_load_setup(Error **errp);
> void multifd_load_cleanup(void);
> void multifd_load_shutdown(void);
> +bool multifd_recv_first_channel(void);
This can be dropped?
> bool multifd_recv_all_channels_created(void);
> void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
> void multifd_recv_sync_main(void);
> --
> 2.35.3
>
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH v3 15/30] io: Add a pwritev/preadv version that takes a discontiguous iovec
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (13 preceding siblings ...)
2023-11-27 20:25 ` [RFC PATCH v3 14/30] migration/multifd: Add incoming " Fabiano Rosas
@ 2023-11-27 20:25 ` Fabiano Rosas
2024-01-16 6:58 ` Peter Xu
2024-01-17 12:39 ` Daniel P. Berrangé
2023-11-27 20:25 ` [RFC PATCH v3 16/30] multifd: Rename MultiFDSendParams::data to compress_data Fabiano Rosas
` (15 subsequent siblings)
30 siblings, 2 replies; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:25 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
For the upcoming support to fixed-ram migration with multifd, we need
to be able to accept an iovec array with non-contiguous data.
Add a pwritev and preadv version that splits the array into contiguous
segments before writing. With that we can have the ram code continue
to add pages in any order and the multifd code continue to send large
arrays for reading and writing.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
- split the API that was merged into a single function
- use uintptr_t for compatibility with 32-bit
---
include/io/channel.h | 26 ++++++++++++++++
io/channel.c | 70 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 96 insertions(+)
diff --git a/include/io/channel.h b/include/io/channel.h
index 7986c49c71..25383db5aa 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -559,6 +559,19 @@ int qio_channel_close(QIOChannel *ioc,
ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
size_t niov, off_t offset, Error **errp);
+/**
+ * qio_channel_pwritev_all:
+ * @ioc: the channel object
+ * @iov: the array of memory regions to write data from
+ * @niov: the length of the @iov array
+ * @offset: the iovec offset in the file where to write the data
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Returns: 0 if all bytes were written, or -1 on error
+ */
+int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
+ size_t niov, off_t offset, Error **errp);
+
/**
* qio_channel_pwrite
* @ioc: the channel object
@@ -595,6 +608,19 @@ ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, size_t buflen,
ssize_t qio_channel_preadv(QIOChannel *ioc, const struct iovec *iov,
size_t niov, off_t offset, Error **errp);
+/**
+ * qio_channel_preadv_all:
+ * @ioc: the channel object
+ * @iov: the array of memory regions to read data to
+ * @niov: the length of the @iov array
+ * @offset: the iovec offset in the file from where to read the data
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Returns: 0 if all bytes were read, or -1 on error
+ */
+int qio_channel_preadv_all(QIOChannel *ioc, const struct iovec *iov,
+ size_t niov, off_t offset, Error **errp);
+
/**
* qio_channel_pread
* @ioc: the channel object
diff --git a/io/channel.c b/io/channel.c
index a1f12f8e90..2f1745d052 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -472,6 +472,69 @@ ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
return klass->io_pwritev(ioc, iov, niov, offset, errp);
}
+static int qio_channel_preadv_pwritev_contiguous(QIOChannel *ioc,
+ const struct iovec *iov,
+ size_t niov, off_t offset,
+ bool is_write, Error **errp)
+{
+ ssize_t ret = -1;
+ int i, slice_idx, slice_num;
+ uintptr_t base, next, file_offset;
+ size_t len;
+
+ slice_idx = 0;
+ slice_num = 1;
+
+ /*
+ * If the iov array doesn't have contiguous elements, we need to
+ * split it in slices because we only have one (file) 'offset' for
+ * the whole iov. Do this here so callers don't need to break the
+ * iov array themselves.
+ */
+ for (i = 0; i < niov; i++, slice_num++) {
+ base = (uintptr_t) iov[i].iov_base;
+
+ if (i != niov - 1) {
+ len = iov[i].iov_len;
+ next = (uintptr_t) iov[i + 1].iov_base;
+
+ if (base + len == next) {
+ continue;
+ }
+ }
+
+ /*
+ * Use the offset of the first element of the segment that
+ * we're sending.
+ */
+ file_offset = offset + (uintptr_t) iov[slice_idx].iov_base;
+
+ if (is_write) {
+ ret = qio_channel_pwritev(ioc, &iov[slice_idx], slice_num,
+ file_offset, errp);
+ } else {
+ ret = qio_channel_preadv(ioc, &iov[slice_idx], slice_num,
+ file_offset, errp);
+ }
+
+ if (ret < 0) {
+ break;
+ }
+
+ slice_idx += slice_num;
+ slice_num = 0;
+ }
+
+ return (ret < 0) ? -1 : 0;
+}
+
+int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
+ size_t niov, off_t offset, Error **errp)
+{
+ return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
+ offset, true, errp);
+}
+
ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, size_t buflen,
off_t offset, Error **errp)
{
@@ -501,6 +564,13 @@ ssize_t qio_channel_preadv(QIOChannel *ioc, const struct iovec *iov,
return klass->io_preadv(ioc, iov, niov, offset, errp);
}
+int qio_channel_preadv_all(QIOChannel *ioc, const struct iovec *iov,
+ size_t niov, off_t offset, Error **errp)
+{
+ return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
+ offset, false, errp);
+}
+
ssize_t qio_channel_pread(QIOChannel *ioc, char *buf, size_t buflen,
off_t offset, Error **errp)
{
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 15/30] io: Add a pwritev/preadv version that takes a discontiguous iovec
2023-11-27 20:25 ` [RFC PATCH v3 15/30] io: Add a pwritev/preadv version that takes a discontiguous iovec Fabiano Rosas
@ 2024-01-16 6:58 ` Peter Xu
2024-01-16 18:15 ` Fabiano Rosas
2024-01-17 12:39 ` Daniel P. Berrangé
1 sibling, 1 reply; 95+ messages in thread
From: Peter Xu @ 2024-01-16 6:58 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Mon, Nov 27, 2023 at 05:25:57PM -0300, Fabiano Rosas wrote:
> For the upcoming support to fixed-ram migration with multifd, we need
> to be able to accept an iovec array with non-contiguous data.
>
> Add a pwritev and preadv version that splits the array into contiguous
> segments before writing. With that we can have the ram code continue
> to add pages in any order and the multifd code continue to send large
> arrays for reading and writing.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> - split the API that was merged into a single function
> - use uintptr_t for compatibility with 32-bit
> ---
> include/io/channel.h | 26 ++++++++++++++++
> io/channel.c | 70 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 96 insertions(+)
>
> diff --git a/include/io/channel.h b/include/io/channel.h
> index 7986c49c71..25383db5aa 100644
> --- a/include/io/channel.h
> +++ b/include/io/channel.h
> @@ -559,6 +559,19 @@ int qio_channel_close(QIOChannel *ioc,
> ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
> size_t niov, off_t offset, Error **errp);
>
> +/**
> + * qio_channel_pwritev_all:
> + * @ioc: the channel object
> + * @iov: the array of memory regions to write data from
> + * @niov: the length of the @iov array
> + * @offset: the iovec offset in the file where to write the data
> + * @errp: pointer to a NULL-initialized error object
> + *
> + * Returns: 0 if all bytes were written, or -1 on error
> + */
> +int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
> + size_t niov, off_t offset, Error **errp);
> +
> /**
> * qio_channel_pwrite
> * @ioc: the channel object
> @@ -595,6 +608,19 @@ ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, size_t buflen,
> ssize_t qio_channel_preadv(QIOChannel *ioc, const struct iovec *iov,
> size_t niov, off_t offset, Error **errp);
>
> +/**
> + * qio_channel_preadv_all:
> + * @ioc: the channel object
> + * @iov: the array of memory regions to read data to
> + * @niov: the length of the @iov array
> + * @offset: the iovec offset in the file from where to read the data
> + * @errp: pointer to a NULL-initialized error object
> + *
> + * Returns: 0 if all bytes were read, or -1 on error
> + */
> +int qio_channel_preadv_all(QIOChannel *ioc, const struct iovec *iov,
> + size_t niov, off_t offset, Error **errp);
> +
> /**
> * qio_channel_pread
> * @ioc: the channel object
> diff --git a/io/channel.c b/io/channel.c
> index a1f12f8e90..2f1745d052 100644
> --- a/io/channel.c
> +++ b/io/channel.c
> @@ -472,6 +472,69 @@ ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
> return klass->io_pwritev(ioc, iov, niov, offset, errp);
> }
>
> +static int qio_channel_preadv_pwritev_contiguous(QIOChannel *ioc,
> + const struct iovec *iov,
> + size_t niov, off_t offset,
> + bool is_write, Error **errp)
> +{
> + ssize_t ret = -1;
> + int i, slice_idx, slice_num;
> + uintptr_t base, next, file_offset;
> + size_t len;
> +
> + slice_idx = 0;
> + slice_num = 1;
> +
> + /*
> + * If the iov array doesn't have contiguous elements, we need to
> + * split it in slices because we only have one (file) 'offset' for
> + * the whole iov. Do this here so callers don't need to break the
> + * iov array themselves.
> + */
> + for (i = 0; i < niov; i++, slice_num++) {
> + base = (uintptr_t) iov[i].iov_base;
> +
> + if (i != niov - 1) {
> + len = iov[i].iov_len;
> + next = (uintptr_t) iov[i + 1].iov_base;
> +
> + if (base + len == next) {
> + continue;
> + }
> + }
> +
> + /*
> + * Use the offset of the first element of the segment that
> + * we're sending.
> + */
> + file_offset = offset + (uintptr_t) iov[slice_idx].iov_base;
> +
> + if (is_write) {
> + ret = qio_channel_pwritev(ioc, &iov[slice_idx], slice_num,
> + file_offset, errp);
> + } else {
> + ret = qio_channel_preadv(ioc, &iov[slice_idx], slice_num,
> + file_offset, errp);
> + }
> +
> + if (ret < 0) {
> + break;
> + }
> +
> + slice_idx += slice_num;
> + slice_num = 0;
> + }
> +
> + return (ret < 0) ? -1 : 0;
> +}
> +
> +int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
> + size_t niov, off_t offset, Error **errp)
> +{
> + return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
> + offset, true, errp);
> +}
I'm not sure how Dan thinks about this, but I don't think this is pretty..
With this implementation, iochannels' preadv/pwritev is completely not
compatible with most OSes now, afaiu.
The definition of offset in preadv/pwritev of current iochannel is hard to
understand.. if I read it right it'll later be set to:
/*
* If we subtract the host page now, we don't need to
* pass it into qio_channel_pwritev_all() below.
*/
write_base = p->pages->block->pages_offset -
(uintptr_t)p->pages->block->host;
Which I cannot easily tell what it is.. besides being an unsigned int.
IIUC it's also based on the assumption that the host address of each iov
entry is linear to its offset in the file, but it may not be true for
future iochannel users of such interface called as pwritev/preadv. So
error prone.
Would it be possible we keep using the offset array (p->pages->offset[x])?
We have it already anyway, right? Wouldn't that be clearer?
It doesn't need to be called pwritev/preadv, but taking two arrays: the
host address array and another offset array on that file. It can still do
the range merge, do another sanity check on the offsets to make sure the
offsets are also continuous (and should be true in our case).
> +
> ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, size_t buflen,
> off_t offset, Error **errp)
> {
> @@ -501,6 +564,13 @@ ssize_t qio_channel_preadv(QIOChannel *ioc, const struct iovec *iov,
> return klass->io_preadv(ioc, iov, niov, offset, errp);
> }
>
> +int qio_channel_preadv_all(QIOChannel *ioc, const struct iovec *iov,
> + size_t niov, off_t offset, Error **errp)
> +{
> + return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
> + offset, false, errp);
> +}
> +
> ssize_t qio_channel_pread(QIOChannel *ioc, char *buf, size_t buflen,
> off_t offset, Error **errp)
> {
> --
> 2.35.3
>
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 15/30] io: Add a pwritev/preadv version that takes a discontiguous iovec
2024-01-16 6:58 ` Peter Xu
@ 2024-01-16 18:15 ` Fabiano Rosas
2024-01-17 9:48 ` Peter Xu
0 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2024-01-16 18:15 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
Peter Xu <peterx@redhat.com> writes:
> On Mon, Nov 27, 2023 at 05:25:57PM -0300, Fabiano Rosas wrote:
>> For the upcoming support to fixed-ram migration with multifd, we need
>> to be able to accept an iovec array with non-contiguous data.
>>
>> Add a pwritev and preadv version that splits the array into contiguous
>> segments before writing. With that we can have the ram code continue
>> to add pages in any order and the multifd code continue to send large
>> arrays for reading and writing.
>>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>> - split the API that was merged into a single function
>> - use uintptr_t for compatibility with 32-bit
>> ---
>> include/io/channel.h | 26 ++++++++++++++++
>> io/channel.c | 70 ++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 96 insertions(+)
>>
>> diff --git a/include/io/channel.h b/include/io/channel.h
>> index 7986c49c71..25383db5aa 100644
>> --- a/include/io/channel.h
>> +++ b/include/io/channel.h
>> @@ -559,6 +559,19 @@ int qio_channel_close(QIOChannel *ioc,
>> ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
>> size_t niov, off_t offset, Error **errp);
>>
>> +/**
>> + * qio_channel_pwritev_all:
>> + * @ioc: the channel object
>> + * @iov: the array of memory regions to write data from
>> + * @niov: the length of the @iov array
>> + * @offset: the iovec offset in the file where to write the data
>> + * @errp: pointer to a NULL-initialized error object
>> + *
>> + * Returns: 0 if all bytes were written, or -1 on error
>> + */
>> +int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
>> + size_t niov, off_t offset, Error **errp);
>> +
>> /**
>> * qio_channel_pwrite
>> * @ioc: the channel object
>> @@ -595,6 +608,19 @@ ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, size_t buflen,
>> ssize_t qio_channel_preadv(QIOChannel *ioc, const struct iovec *iov,
>> size_t niov, off_t offset, Error **errp);
>>
>> +/**
>> + * qio_channel_preadv_all:
>> + * @ioc: the channel object
>> + * @iov: the array of memory regions to read data to
>> + * @niov: the length of the @iov array
>> + * @offset: the iovec offset in the file from where to read the data
>> + * @errp: pointer to a NULL-initialized error object
>> + *
>> + * Returns: 0 if all bytes were read, or -1 on error
>> + */
>> +int qio_channel_preadv_all(QIOChannel *ioc, const struct iovec *iov,
>> + size_t niov, off_t offset, Error **errp);
>> +
>> /**
>> * qio_channel_pread
>> * @ioc: the channel object
>> diff --git a/io/channel.c b/io/channel.c
>> index a1f12f8e90..2f1745d052 100644
>> --- a/io/channel.c
>> +++ b/io/channel.c
>> @@ -472,6 +472,69 @@ ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
>> return klass->io_pwritev(ioc, iov, niov, offset, errp);
>> }
>>
>> +static int qio_channel_preadv_pwritev_contiguous(QIOChannel *ioc,
>> + const struct iovec *iov,
>> + size_t niov, off_t offset,
>> + bool is_write, Error **errp)
>> +{
>> + ssize_t ret = -1;
>> + int i, slice_idx, slice_num;
>> + uintptr_t base, next, file_offset;
>> + size_t len;
>> +
>> + slice_idx = 0;
>> + slice_num = 1;
>> +
>> + /*
>> + * If the iov array doesn't have contiguous elements, we need to
>> + * split it in slices because we only have one (file) 'offset' for
>> + * the whole iov. Do this here so callers don't need to break the
>> + * iov array themselves.
>> + */
>> + for (i = 0; i < niov; i++, slice_num++) {
>> + base = (uintptr_t) iov[i].iov_base;
>> +
>> + if (i != niov - 1) {
>> + len = iov[i].iov_len;
>> + next = (uintptr_t) iov[i + 1].iov_base;
>> +
>> + if (base + len == next) {
>> + continue;
>> + }
>> + }
>> +
>> + /*
>> + * Use the offset of the first element of the segment that
>> + * we're sending.
>> + */
>> + file_offset = offset + (uintptr_t) iov[slice_idx].iov_base;
>> +
>> + if (is_write) {
>> + ret = qio_channel_pwritev(ioc, &iov[slice_idx], slice_num,
>> + file_offset, errp);
>> + } else {
>> + ret = qio_channel_preadv(ioc, &iov[slice_idx], slice_num,
>> + file_offset, errp);
>> + }
>> +
>> + if (ret < 0) {
>> + break;
>> + }
>> +
>> + slice_idx += slice_num;
>> + slice_num = 0;
>> + }
>> +
>> + return (ret < 0) ? -1 : 0;
>> +}
>> +
>> +int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
>> + size_t niov, off_t offset, Error **errp)
>> +{
>> + return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
>> + offset, true, errp);
>> +}
>
> I'm not sure how Dan thinks about this, but I don't think this is pretty..
>
> With this implementation, iochannels' preadv/pwritev is completely not
> compatible with most OSes now, afaiu.
This is internal QEMU code. I hope no one is expecting qio_channel_foo()
to behave like some OS's foo() system call. We cannot guarantee that
compatibility save for the simplest of wrappers.
>
> The definition of offset in preadv/pwritev of current iochannel is hard to
> understand.. if I read it right it'll later be set to:
>
> /*
> * If we subtract the host page now, we don't need to
> * pass it into qio_channel_pwritev_all() below.
> */
> write_base = p->pages->block->pages_offset -
> (uintptr_t)p->pages->block->host;
>
> Which I cannot easily tell what it is.. besides being an unsigned int.
This description was unfortunately dropped along the way:
"Since iovs can be non contiguous, we'd need a separate array on the
side to carry an extra file offset for each of them, so I'm relying on
the fact that iovs are all within a same host page and passing in an
encoded offset that takes the host page into account."
> IIUC it's also based on the assumption that the host address of each iov
> entry is linear to its offset in the file, but it may not be true for
> future iochannel users of such interface called as pwritev/preadv. So
> error prone.
Yes, but it's also our choice whether to make this a generic API. We may
have good reasons to consider a migration-specific function here.
> Would it be possible we keep using the offset array (p->pages->offset[x])?
> We have it already anyway, right? Wouldn't that be clearer?
>
We'd have to make a copy of the array because p->pages is expected to
change while the IO happens. And while we already have a copy in
p->normal, my intention for multifd was to eliminate p->normal in the
future, so it would be nice if we could avoid it.
Also, we cannot use p->pages->offset alone because we still need the
pages_offset, i.e. the file offset where that ramblocks's pages begin.
So that means also adding that to each element of the new array.
It would probably be overall clearer and less wasteful to pass in the
host page address instead of an array of offsets. I don't see an issue
with restricting the iovs to the same host page. The migration code is
the only user for this code and AFAIK we don't have plans to change that
invariant.
> It doesn't need to be called pwritev/preadv, but taking two arrays: the
> host address array and another offset array on that file. It can still do
> the range merge, do another sanity check on the offsets to make sure the
> offsets are also continuous (and should be true in our case).
>
>> +
>> ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, size_t buflen,
>> off_t offset, Error **errp)
>> {
>> @@ -501,6 +564,13 @@ ssize_t qio_channel_preadv(QIOChannel *ioc, const struct iovec *iov,
>> return klass->io_preadv(ioc, iov, niov, offset, errp);
>> }
>>
>> +int qio_channel_preadv_all(QIOChannel *ioc, const struct iovec *iov,
>> + size_t niov, off_t offset, Error **errp)
>> +{
>> + return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
>> + offset, false, errp);
>> +}
>> +
>> ssize_t qio_channel_pread(QIOChannel *ioc, char *buf, size_t buflen,
>> off_t offset, Error **errp)
>> {
>> --
>> 2.35.3
>>
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 15/30] io: Add a pwritev/preadv version that takes a discontiguous iovec
2024-01-16 18:15 ` Fabiano Rosas
@ 2024-01-17 9:48 ` Peter Xu
2024-01-17 18:06 ` Fabiano Rosas
0 siblings, 1 reply; 95+ messages in thread
From: Peter Xu @ 2024-01-17 9:48 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Tue, Jan 16, 2024 at 03:15:50PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > On Mon, Nov 27, 2023 at 05:25:57PM -0300, Fabiano Rosas wrote:
> >> For the upcoming support to fixed-ram migration with multifd, we need
> >> to be able to accept an iovec array with non-contiguous data.
> >>
> >> Add a pwritev and preadv version that splits the array into contiguous
> >> segments before writing. With that we can have the ram code continue
> >> to add pages in any order and the multifd code continue to send large
> >> arrays for reading and writing.
> >>
> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >> ---
> >> - split the API that was merged into a single function
> >> - use uintptr_t for compatibility with 32-bit
> >> ---
> >> include/io/channel.h | 26 ++++++++++++++++
> >> io/channel.c | 70 ++++++++++++++++++++++++++++++++++++++++++++
> >> 2 files changed, 96 insertions(+)
> >>
> >> diff --git a/include/io/channel.h b/include/io/channel.h
> >> index 7986c49c71..25383db5aa 100644
> >> --- a/include/io/channel.h
> >> +++ b/include/io/channel.h
> >> @@ -559,6 +559,19 @@ int qio_channel_close(QIOChannel *ioc,
> >> ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
> >> size_t niov, off_t offset, Error **errp);
> >>
> >> +/**
> >> + * qio_channel_pwritev_all:
> >> + * @ioc: the channel object
> >> + * @iov: the array of memory regions to write data from
> >> + * @niov: the length of the @iov array
> >> + * @offset: the iovec offset in the file where to write the data
> >> + * @errp: pointer to a NULL-initialized error object
> >> + *
> >> + * Returns: 0 if all bytes were written, or -1 on error
> >> + */
> >> +int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
> >> + size_t niov, off_t offset, Error **errp);
> >> +
> >> /**
> >> * qio_channel_pwrite
> >> * @ioc: the channel object
> >> @@ -595,6 +608,19 @@ ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, size_t buflen,
> >> ssize_t qio_channel_preadv(QIOChannel *ioc, const struct iovec *iov,
> >> size_t niov, off_t offset, Error **errp);
> >>
> >> +/**
> >> + * qio_channel_preadv_all:
> >> + * @ioc: the channel object
> >> + * @iov: the array of memory regions to read data to
> >> + * @niov: the length of the @iov array
> >> + * @offset: the iovec offset in the file from where to read the data
> >> + * @errp: pointer to a NULL-initialized error object
> >> + *
> >> + * Returns: 0 if all bytes were read, or -1 on error
> >> + */
> >> +int qio_channel_preadv_all(QIOChannel *ioc, const struct iovec *iov,
> >> + size_t niov, off_t offset, Error **errp);
> >> +
> >> /**
> >> * qio_channel_pread
> >> * @ioc: the channel object
> >> diff --git a/io/channel.c b/io/channel.c
> >> index a1f12f8e90..2f1745d052 100644
> >> --- a/io/channel.c
> >> +++ b/io/channel.c
> >> @@ -472,6 +472,69 @@ ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
> >> return klass->io_pwritev(ioc, iov, niov, offset, errp);
> >> }
> >>
> >> +static int qio_channel_preadv_pwritev_contiguous(QIOChannel *ioc,
> >> + const struct iovec *iov,
> >> + size_t niov, off_t offset,
> >> + bool is_write, Error **errp)
> >> +{
> >> + ssize_t ret = -1;
> >> + int i, slice_idx, slice_num;
> >> + uintptr_t base, next, file_offset;
> >> + size_t len;
> >> +
> >> + slice_idx = 0;
> >> + slice_num = 1;
> >> +
> >> + /*
> >> + * If the iov array doesn't have contiguous elements, we need to
> >> + * split it in slices because we only have one (file) 'offset' for
> >> + * the whole iov. Do this here so callers don't need to break the
> >> + * iov array themselves.
> >> + */
> >> + for (i = 0; i < niov; i++, slice_num++) {
> >> + base = (uintptr_t) iov[i].iov_base;
> >> +
> >> + if (i != niov - 1) {
> >> + len = iov[i].iov_len;
> >> + next = (uintptr_t) iov[i + 1].iov_base;
> >> +
> >> + if (base + len == next) {
> >> + continue;
> >> + }
> >> + }
> >> +
> >> + /*
> >> + * Use the offset of the first element of the segment that
> >> + * we're sending.
> >> + */
> >> + file_offset = offset + (uintptr_t) iov[slice_idx].iov_base;
> >> +
> >> + if (is_write) {
> >> + ret = qio_channel_pwritev(ioc, &iov[slice_idx], slice_num,
> >> + file_offset, errp);
> >> + } else {
> >> + ret = qio_channel_preadv(ioc, &iov[slice_idx], slice_num,
> >> + file_offset, errp);
> >> + }
> >> +
> >> + if (ret < 0) {
> >> + break;
> >> + }
> >> +
> >> + slice_idx += slice_num;
> >> + slice_num = 0;
> >> + }
> >> +
> >> + return (ret < 0) ? -1 : 0;
> >> +}
> >> +
> >> +int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
> >> + size_t niov, off_t offset, Error **errp)
> >> +{
> >> + return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
> >> + offset, true, errp);
> >> +}
> >
> > I'm not sure how Dan thinks about this, but I don't think this is pretty..
> >
> > With this implementation, iochannels' preadv/pwritev is completely not
> > compatible with most OSes now, afaiu.
>
> This is internal QEMU code. I hope no one is expecting qio_channel_foo()
> to behave like some OS's foo() system call. We cannot guarantee that
> compatibility save for the simplest of wrappers.
I was expecting that when I started to read. :)
https://man.freebsd.org/cgi/man.cgi?query=pwritev
https://linux.die.net/man/2/pwritev
It's not "some OSes", it's mostly all. I can understand you prefer such
approach, but even if so, shall we still try to avoid using pwritev/preadv
as the names?
>
> >
> > The definition of offset in preadv/pwritev of current iochannel is hard to
> > understand.. if I read it right it'll later be set to:
> >
> > /*
> > * If we subtract the host page now, we don't need to
> > * pass it into qio_channel_pwritev_all() below.
> > */
> > write_base = p->pages->block->pages_offset -
> > (uintptr_t)p->pages->block->host;
> >
> > Which I cannot easily tell what it is.. besides being an unsigned int.
>
> This description was unfortunately dropped along the way:
>
> "Since iovs can be non contiguous, we'd need a separate array on the
> side to carry an extra file offset for each of them, so I'm relying on
> the fact that iovs are all within a same host page and passing in an
> encoded offset that takes the host page into account."
>
> > IIUC it's also based on the assumption that the host address of each iov
> > entry is linear to its offset in the file, but it may not be true for
> > future iochannel users of such interface called as pwritev/preadv. So
> > error prone.
>
> Yes, but it's also our choice whether to make this a generic API. We may
> have good reasons to consider a migration-specific function here.
>
> > Would it be possible we keep using the offset array (p->pages->offset[x])?
> > We have it already anyway, right? Wouldn't that be clearer?
> >
>
> We'd have to make a copy of the array because p->pages is expected to
> change while the IO happens.
Hmm, I don't see why p->pages can change. IIUC p->pages will be there solid
at least until all IO syscalls are completed, then the next call to, e.g.,
multifd_send_pages() will swap that with multifd_send_state->pages. But I
think I get your point, with below.
> And while we already have a copy in
> p->normal, my intention for multifd was to eliminate p->normal in the
> future, so it would be nice if we could avoid it.
>
> Also, we cannot use p->pages->offset alone because we still need the
> pages_offset, i.e. the file offset where that ramblocks's pages begin.
> So that means also adding that to each element of the new array.
>
> It would probably be overall clearer and less wasteful to pass in the
> host page address instead of an array of offsets. I don't see an issue
> with restricting the iovs to the same host page. The migration code is
> the only user for this code and AFAIK we don't have plans to change that
> invariant.
So I think I get your point now, the only concern (besides naming..) is,
I still want to avoid an interface that contains a field that is hard to
understand like write_base.
How about this?
/**
* multifd_write_ramblock_iov: Write IO vector (of ramblock) to channel
*
* @ioc: The iochannel to write to. The IOC must have pwritev/preadv
* interface must be implemented.
* @iov: The IO vector to write. All addresses must be within the
* ramblock host address range.
* @iov_len: The IO vector size
* @ramblock: The ramblock that covers all buffers in this IO vector
*/
int multifd_write_ramblock_iov(ioc, iov, iov_len, ramblock);
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 15/30] io: Add a pwritev/preadv version that takes a discontiguous iovec
2024-01-17 9:48 ` Peter Xu
@ 2024-01-17 18:06 ` Fabiano Rosas
2024-01-18 7:44 ` Peter Xu
0 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2024-01-17 18:06 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
Peter Xu <peterx@redhat.com> writes:
> On Tue, Jan 16, 2024 at 03:15:50PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Mon, Nov 27, 2023 at 05:25:57PM -0300, Fabiano Rosas wrote:
>> >> For the upcoming support to fixed-ram migration with multifd, we need
>> >> to be able to accept an iovec array with non-contiguous data.
>> >>
>> >> Add a pwritev and preadv version that splits the array into contiguous
>> >> segments before writing. With that we can have the ram code continue
>> >> to add pages in any order and the multifd code continue to send large
>> >> arrays for reading and writing.
>> >>
>> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> >> ---
>> >> - split the API that was merged into a single function
>> >> - use uintptr_t for compatibility with 32-bit
>> >> ---
>> >> include/io/channel.h | 26 ++++++++++++++++
>> >> io/channel.c | 70 ++++++++++++++++++++++++++++++++++++++++++++
>> >> 2 files changed, 96 insertions(+)
>> >>
>> >> diff --git a/include/io/channel.h b/include/io/channel.h
>> >> index 7986c49c71..25383db5aa 100644
>> >> --- a/include/io/channel.h
>> >> +++ b/include/io/channel.h
>> >> @@ -559,6 +559,19 @@ int qio_channel_close(QIOChannel *ioc,
>> >> ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
>> >> size_t niov, off_t offset, Error **errp);
>> >>
>> >> +/**
>> >> + * qio_channel_pwritev_all:
>> >> + * @ioc: the channel object
>> >> + * @iov: the array of memory regions to write data from
>> >> + * @niov: the length of the @iov array
>> >> + * @offset: the iovec offset in the file where to write the data
>> >> + * @errp: pointer to a NULL-initialized error object
>> >> + *
>> >> + * Returns: 0 if all bytes were written, or -1 on error
>> >> + */
>> >> +int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
>> >> + size_t niov, off_t offset, Error **errp);
>> >> +
>> >> /**
>> >> * qio_channel_pwrite
>> >> * @ioc: the channel object
>> >> @@ -595,6 +608,19 @@ ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, size_t buflen,
>> >> ssize_t qio_channel_preadv(QIOChannel *ioc, const struct iovec *iov,
>> >> size_t niov, off_t offset, Error **errp);
>> >>
>> >> +/**
>> >> + * qio_channel_preadv_all:
>> >> + * @ioc: the channel object
>> >> + * @iov: the array of memory regions to read data to
>> >> + * @niov: the length of the @iov array
>> >> + * @offset: the iovec offset in the file from where to read the data
>> >> + * @errp: pointer to a NULL-initialized error object
>> >> + *
>> >> + * Returns: 0 if all bytes were read, or -1 on error
>> >> + */
>> >> +int qio_channel_preadv_all(QIOChannel *ioc, const struct iovec *iov,
>> >> + size_t niov, off_t offset, Error **errp);
>> >> +
>> >> /**
>> >> * qio_channel_pread
>> >> * @ioc: the channel object
>> >> diff --git a/io/channel.c b/io/channel.c
>> >> index a1f12f8e90..2f1745d052 100644
>> >> --- a/io/channel.c
>> >> +++ b/io/channel.c
>> >> @@ -472,6 +472,69 @@ ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
>> >> return klass->io_pwritev(ioc, iov, niov, offset, errp);
>> >> }
>> >>
>> >> +static int qio_channel_preadv_pwritev_contiguous(QIOChannel *ioc,
>> >> + const struct iovec *iov,
>> >> + size_t niov, off_t offset,
>> >> + bool is_write, Error **errp)
>> >> +{
>> >> + ssize_t ret = -1;
>> >> + int i, slice_idx, slice_num;
>> >> + uintptr_t base, next, file_offset;
>> >> + size_t len;
>> >> +
>> >> + slice_idx = 0;
>> >> + slice_num = 1;
>> >> +
>> >> + /*
>> >> + * If the iov array doesn't have contiguous elements, we need to
>> >> + * split it in slices because we only have one (file) 'offset' for
>> >> + * the whole iov. Do this here so callers don't need to break the
>> >> + * iov array themselves.
>> >> + */
>> >> + for (i = 0; i < niov; i++, slice_num++) {
>> >> + base = (uintptr_t) iov[i].iov_base;
>> >> +
>> >> + if (i != niov - 1) {
>> >> + len = iov[i].iov_len;
>> >> + next = (uintptr_t) iov[i + 1].iov_base;
>> >> +
>> >> + if (base + len == next) {
>> >> + continue;
>> >> + }
>> >> + }
>> >> +
>> >> + /*
>> >> + * Use the offset of the first element of the segment that
>> >> + * we're sending.
>> >> + */
>> >> + file_offset = offset + (uintptr_t) iov[slice_idx].iov_base;
>> >> +
>> >> + if (is_write) {
>> >> + ret = qio_channel_pwritev(ioc, &iov[slice_idx], slice_num,
>> >> + file_offset, errp);
>> >> + } else {
>> >> + ret = qio_channel_preadv(ioc, &iov[slice_idx], slice_num,
>> >> + file_offset, errp);
>> >> + }
>> >> +
>> >> + if (ret < 0) {
>> >> + break;
>> >> + }
>> >> +
>> >> + slice_idx += slice_num;
>> >> + slice_num = 0;
>> >> + }
>> >> +
>> >> + return (ret < 0) ? -1 : 0;
>> >> +}
>> >> +
>> >> +int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
>> >> + size_t niov, off_t offset, Error **errp)
>> >> +{
>> >> + return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
>> >> + offset, true, errp);
>> >> +}
>> >
>> > I'm not sure how Dan thinks about this, but I don't think this is pretty..
>> >
>> > With this implementation, iochannels' preadv/pwritev is completely not
>> > compatible with most OSes now, afaiu.
>>
>> This is internal QEMU code. I hope no one is expecting qio_channel_foo()
>> to behave like some OS's foo() system call. We cannot guarantee that
>> compatibility save for the simplest of wrappers.
>
> I was expecting that when I started to read. :)
>
> https://man.freebsd.org/cgi/man.cgi?query=pwritev
> https://linux.die.net/man/2/pwritev
>
> It's not "some OSes", it's mostly all.
What I mean is no one would ever replace a call to pwritev() with
qio_channel_pwritev() and expect the same behvior. We're not writing a
libc.
> I can understand you prefer such
> approach, but even if so, shall we still try to avoid using pwritev/preadv
> as the names?
>
Yes, it's probably better to avoid those if we're going to be doing any
extra operations.
>>
>> >
>> > The definition of offset in preadv/pwritev of current iochannel is hard to
>> > understand.. if I read it right it'll later be set to:
>> >
>> > /*
>> > * If we subtract the host page now, we don't need to
>> > * pass it into qio_channel_pwritev_all() below.
>> > */
>> > write_base = p->pages->block->pages_offset -
>> > (uintptr_t)p->pages->block->host;
>> >
>> > Which I cannot easily tell what it is.. besides being an unsigned int.
>>
>> This description was unfortunately dropped along the way:
>>
>> "Since iovs can be non contiguous, we'd need a separate array on the
>> side to carry an extra file offset for each of them, so I'm relying on
>> the fact that iovs are all within a same host page and passing in an
>> encoded offset that takes the host page into account."
>>
>> > IIUC it's also based on the assumption that the host address of each iov
>> > entry is linear to its offset in the file, but it may not be true for
>> > future iochannel users of such interface called as pwritev/preadv. So
>> > error prone.
>>
>> Yes, but it's also our choice whether to make this a generic API. We may
>> have good reasons to consider a migration-specific function here.
>>
>> > Would it be possible we keep using the offset array (p->pages->offset[x])?
>> > We have it already anyway, right? Wouldn't that be clearer?
>> >
>>
>> We'd have to make a copy of the array because p->pages is expected to
>> change while the IO happens.
>
> Hmm, I don't see why p->pages can change. IIUC p->pages will be there solid
> at least until all IO syscalls are completed, then the next call to, e.g.,
> multifd_send_pages() will swap that with multifd_send_state->pages. But I
> think I get your point, with below.
Oh no, you're right. Because of p->pending_job. And thinking about
p->pending_job, wouldn't a trylock to the same job while being more
explicit?
next_channel %= migrate_multifd_channels();
for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
p = &multifd_send_state->params[i];
if(qemu_mutex_trylock(&p->mutex)) {
if (p->quit) {
error_report("%s: channel %d has already quit!", __func__, i);
qemu_mutex_unlock(&p->mutex);
return -1;
}
next_channel = (i + 1) % migrate_multifd_channels();
break;
} else {
/* channel still busy, try the next one */
}
}
multifd_send_state->pages = p->pages;
p->pages = pages;
qemu_mutex_unlock(&p->mutex);
>> And while we already have a copy in
>> p->normal, my intention for multifd was to eliminate p->normal in the
>> future, so it would be nice if we could avoid it.
>>
>> Also, we cannot use p->pages->offset alone because we still need the
>> pages_offset, i.e. the file offset where that ramblocks's pages begin.
>> So that means also adding that to each element of the new array.
>>
>> It would probably be overall clearer and less wasteful to pass in the
>> host page address instead of an array of offsets. I don't see an issue
>> with restricting the iovs to the same host page. The migration code is
>> the only user for this code and AFAIK we don't have plans to change that
>> invariant.
>
> So I think I get your point now, the only concern (besides naming..) is,
> I still want to avoid an interface that contains a field that is hard to
> understand like write_base.
>
> How about this?
>
> /**
> * multifd_write_ramblock_iov: Write IO vector (of ramblock) to channel
> *
> * @ioc: The iochannel to write to. The IOC must have pwritev/preadv
> * interface must be implemented.
> * @iov: The IO vector to write. All addresses must be within the
> * ramblock host address range.
> * @iov_len: The IO vector size
> * @ramblock: The ramblock that covers all buffers in this IO vector
> */
> int multifd_write_ramblock_iov(ioc, iov, iov_len, ramblock);
Ok, then I can take block->pages_offset and block->host from the
ramblock. I think I prefer something like this, that way we can be
explicit about the migration assumptions.
Thanks!
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 15/30] io: Add a pwritev/preadv version that takes a discontiguous iovec
2024-01-17 18:06 ` Fabiano Rosas
@ 2024-01-18 7:44 ` Peter Xu
2024-01-18 12:47 ` Fabiano Rosas
0 siblings, 1 reply; 95+ messages in thread
From: Peter Xu @ 2024-01-18 7:44 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Wed, Jan 17, 2024 at 03:06:15PM -0300, Fabiano Rosas wrote:
> Oh no, you're right. Because of p->pending_job. And thinking about
> p->pending_job, wouldn't a trylock to the same job while being more
> explicit?
>
> next_channel %= migrate_multifd_channels();
> for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
> p = &multifd_send_state->params[i];
>
> if(qemu_mutex_trylock(&p->mutex)) {
> if (p->quit) {
> error_report("%s: channel %d has already quit!", __func__, i);
> qemu_mutex_unlock(&p->mutex);
> return -1;
> }
> next_channel = (i + 1) % migrate_multifd_channels();
> break;
> } else {
> /* channel still busy, try the next one */
> }
> }
> multifd_send_state->pages = p->pages;
> p->pages = pages;
> qemu_mutex_unlock(&p->mutex);
We probably can't for now; multifd_send_thread() will unlock the mutex
before the iochannel write()s, while the write()s will need those fields.
> Ok, then I can take block->pages_offset and block->host from the
> ramblock. I think I prefer something like this, that way we can be
> explicit about the migration assumptions.
I'm glad we reached an initial consensus. Yes let's put that in
migration/; I won't expect this code will be used by other iochannel users.
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 15/30] io: Add a pwritev/preadv version that takes a discontiguous iovec
2024-01-18 7:44 ` Peter Xu
@ 2024-01-18 12:47 ` Fabiano Rosas
2024-01-19 0:22 ` Peter Xu
0 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2024-01-18 12:47 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
Peter Xu <peterx@redhat.com> writes:
> On Wed, Jan 17, 2024 at 03:06:15PM -0300, Fabiano Rosas wrote:
>> Oh no, you're right. Because of p->pending_job. And thinking about
>> p->pending_job, wouldn't a trylock to the same job while being more
>> explicit?
>>
>> next_channel %= migrate_multifd_channels();
>> for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
>> p = &multifd_send_state->params[i];
>>
>> if(qemu_mutex_trylock(&p->mutex)) {
>> if (p->quit) {
>> error_report("%s: channel %d has already quit!", __func__, i);
>> qemu_mutex_unlock(&p->mutex);
>> return -1;
>> }
>> next_channel = (i + 1) % migrate_multifd_channels();
>> break;
>> } else {
>> /* channel still busy, try the next one */
>> }
>> }
>> multifd_send_state->pages = p->pages;
>> p->pages = pages;
>> qemu_mutex_unlock(&p->mutex);
>
> We probably can't for now; multifd_send_thread() will unlock the mutex
> before the iochannel write()s, while the write()s will need those fields.
Right, but we'd change that code to do the IO with the lock held. If no
one is blocking, it should be ok to hold the lock. Anyway, food for
thought.
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 15/30] io: Add a pwritev/preadv version that takes a discontiguous iovec
2024-01-18 12:47 ` Fabiano Rosas
@ 2024-01-19 0:22 ` Peter Xu
0 siblings, 0 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-19 0:22 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Thu, Jan 18, 2024 at 09:47:18AM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > On Wed, Jan 17, 2024 at 03:06:15PM -0300, Fabiano Rosas wrote:
> >> Oh no, you're right. Because of p->pending_job. And thinking about
> >> p->pending_job, wouldn't a trylock to the same job while being more
> >> explicit?
> >>
> >> next_channel %= migrate_multifd_channels();
> >> for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
> >> p = &multifd_send_state->params[i];
> >>
> >> if(qemu_mutex_trylock(&p->mutex)) {
> >> if (p->quit) {
> >> error_report("%s: channel %d has already quit!", __func__, i);
> >> qemu_mutex_unlock(&p->mutex);
> >> return -1;
> >> }
> >> next_channel = (i + 1) % migrate_multifd_channels();
> >> break;
> >> } else {
> >> /* channel still busy, try the next one */
> >> }
> >> }
> >> multifd_send_state->pages = p->pages;
> >> p->pages = pages;
> >> qemu_mutex_unlock(&p->mutex);
> >
> > We probably can't for now; multifd_send_thread() will unlock the mutex
> > before the iochannel write()s, while the write()s will need those fields.
>
> Right, but we'd change that code to do the IO with the lock held. If no
> one is blocking, it should be ok to hold the lock. Anyway, food for
> thought.
I see what you meant. Sounds possible.
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 15/30] io: Add a pwritev/preadv version that takes a discontiguous iovec
2023-11-27 20:25 ` [RFC PATCH v3 15/30] io: Add a pwritev/preadv version that takes a discontiguous iovec Fabiano Rosas
2024-01-16 6:58 ` Peter Xu
@ 2024-01-17 12:39 ` Daniel P. Berrangé
2024-01-17 14:27 ` Daniel P. Berrangé
1 sibling, 1 reply; 95+ messages in thread
From: Daniel P. Berrangé @ 2024-01-17 12:39 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
On Mon, Nov 27, 2023 at 05:25:57PM -0300, Fabiano Rosas wrote:
> For the upcoming support to fixed-ram migration with multifd, we need
> to be able to accept an iovec array with non-contiguous data.
>
> Add a pwritev and preadv version that splits the array into contiguous
> segments before writing. With that we can have the ram code continue
> to add pages in any order and the multifd code continue to send large
> arrays for reading and writing.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> - split the API that was merged into a single function
> - use uintptr_t for compatibility with 32-bit
> ---
> include/io/channel.h | 26 ++++++++++++++++
> io/channel.c | 70 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 96 insertions(+)
>
> diff --git a/include/io/channel.h b/include/io/channel.h
> index 7986c49c71..25383db5aa 100644
> --- a/include/io/channel.h
> +++ b/include/io/channel.h
> @@ -559,6 +559,19 @@ int qio_channel_close(QIOChannel *ioc,
> ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
> size_t niov, off_t offset, Error **errp);
>
> +/**
> + * qio_channel_pwritev_all:
> + * @ioc: the channel object
> + * @iov: the array of memory regions to write data from
> + * @niov: the length of the @iov array
> + * @offset: the iovec offset in the file where to write the data
> + * @errp: pointer to a NULL-initialized error object
> + *
> + * Returns: 0 if all bytes were written, or -1 on error
> + */
> +int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
> + size_t niov, off_t offset, Error **errp);
> +
> /**
> * qio_channel_pwrite
> * @ioc: the channel object
> @@ -595,6 +608,19 @@ ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, size_t buflen,
> ssize_t qio_channel_preadv(QIOChannel *ioc, const struct iovec *iov,
> size_t niov, off_t offset, Error **errp);
>
> +/**
> + * qio_channel_preadv_all:
> + * @ioc: the channel object
> + * @iov: the array of memory regions to read data to
> + * @niov: the length of the @iov array
> + * @offset: the iovec offset in the file from where to read the data
> + * @errp: pointer to a NULL-initialized error object
> + *
> + * Returns: 0 if all bytes were read, or -1 on error
> + */
> +int qio_channel_preadv_all(QIOChannel *ioc, const struct iovec *iov,
> + size_t niov, off_t offset, Error **errp);
> +
> /**
> * qio_channel_pread
> * @ioc: the channel object
> diff --git a/io/channel.c b/io/channel.c
> index a1f12f8e90..2f1745d052 100644
> --- a/io/channel.c
> +++ b/io/channel.c
> @@ -472,6 +472,69 @@ ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
> return klass->io_pwritev(ioc, iov, niov, offset, errp);
> }
>
> +static int qio_channel_preadv_pwritev_contiguous(QIOChannel *ioc,
> + const struct iovec *iov,
> + size_t niov, off_t offset,
> + bool is_write, Error **errp)
> +{
> + ssize_t ret = -1;
> + int i, slice_idx, slice_num;
> + uintptr_t base, next, file_offset;
> + size_t len;
> +
> + slice_idx = 0;
> + slice_num = 1;
> +
> + /*
> + * If the iov array doesn't have contiguous elements, we need to
> + * split it in slices because we only have one (file) 'offset' for
> + * the whole iov. Do this here so callers don't need to break the
> + * iov array themselves.
> + */
> + for (i = 0; i < niov; i++, slice_num++) {
> + base = (uintptr_t) iov[i].iov_base;
> +
> + if (i != niov - 1) {
> + len = iov[i].iov_len;
> + next = (uintptr_t) iov[i + 1].iov_base;
> +
> + if (base + len == next) {
> + continue;
> + }
> + }
> +
> + /*
> + * Use the offset of the first element of the segment that
> + * we're sending.
> + */
> + file_offset = offset + (uintptr_t) iov[slice_idx].iov_base;
> +
> + if (is_write) {
> + ret = qio_channel_pwritev(ioc, &iov[slice_idx], slice_num,
> + file_offset, errp);
> + } else {
> + ret = qio_channel_preadv(ioc, &iov[slice_idx], slice_num,
> + file_offset, errp);
> + }
iov_base is the address of a pointer in RAM, so could be
potentially any 64-bit value.
We're assigning file_offset to this pointer address with an
user supplied offset, and then using it as an offset on disk.
First this could result in 64-bit overflow when 'offset' is
added to 'iov_base', and second this could result in a file
that's 16 Exabytes in size (with holes of course).
I don't get how this is supposed to work, or be used ?
> +
> + if (ret < 0) {
> + break;
> + }
> +
> + slice_idx += slice_num;
> + slice_num = 0;
> + }
> +
> + return (ret < 0) ? -1 : 0;
> +}
> +
> +int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
> + size_t niov, off_t offset, Error **errp)
> +{
> + return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
> + offset, true, errp);
> +}
> +
> ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, size_t buflen,
> off_t offset, Error **errp)
> {
> @@ -501,6 +564,13 @@ ssize_t qio_channel_preadv(QIOChannel *ioc, const struct iovec *iov,
> return klass->io_preadv(ioc, iov, niov, offset, errp);
> }
>
> +int qio_channel_preadv_all(QIOChannel *ioc, const struct iovec *iov,
> + size_t niov, off_t offset, Error **errp)
> +{
> + return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
> + offset, false, errp);
> +}
> +
> ssize_t qio_channel_pread(QIOChannel *ioc, char *buf, size_t buflen,
> off_t offset, Error **errp)
> {
> --
> 2.35.3
>
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 15/30] io: Add a pwritev/preadv version that takes a discontiguous iovec
2024-01-17 12:39 ` Daniel P. Berrangé
@ 2024-01-17 14:27 ` Daniel P. Berrangé
2024-01-17 18:09 ` Fabiano Rosas
0 siblings, 1 reply; 95+ messages in thread
From: Daniel P. Berrangé @ 2024-01-17 14:27 UTC (permalink / raw)
To: Fabiano Rosas, qemu-devel, armbru, Juan Quintela, Peter Xu,
Leonardo Bras, Claudio Fontana
On Wed, Jan 17, 2024 at 12:39:26PM +0000, Daniel P. Berrangé wrote:
> On Mon, Nov 27, 2023 at 05:25:57PM -0300, Fabiano Rosas wrote:
> > For the upcoming support to fixed-ram migration with multifd, we need
> > to be able to accept an iovec array with non-contiguous data.
> >
> > Add a pwritev and preadv version that splits the array into contiguous
> > segments before writing. With that we can have the ram code continue
> > to add pages in any order and the multifd code continue to send large
> > arrays for reading and writing.
> >
> > Signed-off-by: Fabiano Rosas <farosas@suse.de>
> > ---
> > - split the API that was merged into a single function
> > - use uintptr_t for compatibility with 32-bit
> > ---
> > include/io/channel.h | 26 ++++++++++++++++
> > io/channel.c | 70 ++++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 96 insertions(+)
> >
> > diff --git a/include/io/channel.h b/include/io/channel.h
> > index 7986c49c71..25383db5aa 100644
> > --- a/include/io/channel.h
> > +++ b/include/io/channel.h
> > @@ -559,6 +559,19 @@ int qio_channel_close(QIOChannel *ioc,
> > ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
> > size_t niov, off_t offset, Error **errp);
> >
> > +/**
> > + * qio_channel_pwritev_all:
> > + * @ioc: the channel object
> > + * @iov: the array of memory regions to write data from
> > + * @niov: the length of the @iov array
> > + * @offset: the iovec offset in the file where to write the data
> > + * @errp: pointer to a NULL-initialized error object
> > + *
> > + * Returns: 0 if all bytes were written, or -1 on error
> > + */
> > +int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
> > + size_t niov, off_t offset, Error **errp);
> > +
> > /**
> > * qio_channel_pwrite
> > * @ioc: the channel object
> > @@ -595,6 +608,19 @@ ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, size_t buflen,
> > ssize_t qio_channel_preadv(QIOChannel *ioc, const struct iovec *iov,
> > size_t niov, off_t offset, Error **errp);
> >
> > +/**
> > + * qio_channel_preadv_all:
> > + * @ioc: the channel object
> > + * @iov: the array of memory regions to read data to
> > + * @niov: the length of the @iov array
> > + * @offset: the iovec offset in the file from where to read the data
> > + * @errp: pointer to a NULL-initialized error object
> > + *
> > + * Returns: 0 if all bytes were read, or -1 on error
> > + */
> > +int qio_channel_preadv_all(QIOChannel *ioc, const struct iovec *iov,
> > + size_t niov, off_t offset, Error **errp);
> > +
> > /**
> > * qio_channel_pread
> > * @ioc: the channel object
> > diff --git a/io/channel.c b/io/channel.c
> > index a1f12f8e90..2f1745d052 100644
> > --- a/io/channel.c
> > +++ b/io/channel.c
> > @@ -472,6 +472,69 @@ ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
> > return klass->io_pwritev(ioc, iov, niov, offset, errp);
> > }
> >
> > +static int qio_channel_preadv_pwritev_contiguous(QIOChannel *ioc,
> > + const struct iovec *iov,
> > + size_t niov, off_t offset,
> > + bool is_write, Error **errp)
> > +{
> > + ssize_t ret = -1;
> > + int i, slice_idx, slice_num;
> > + uintptr_t base, next, file_offset;
> > + size_t len;
> > +
> > + slice_idx = 0;
> > + slice_num = 1;
> > +
> > + /*
> > + * If the iov array doesn't have contiguous elements, we need to
> > + * split it in slices because we only have one (file) 'offset' for
> > + * the whole iov. Do this here so callers don't need to break the
> > + * iov array themselves.
> > + */
> > + for (i = 0; i < niov; i++, slice_num++) {
> > + base = (uintptr_t) iov[i].iov_base;
> > +
> > + if (i != niov - 1) {
> > + len = iov[i].iov_len;
> > + next = (uintptr_t) iov[i + 1].iov_base;
> > +
> > + if (base + len == next) {
> > + continue;
> > + }
> > + }
> > +
> > + /*
> > + * Use the offset of the first element of the segment that
> > + * we're sending.
> > + */
> > + file_offset = offset + (uintptr_t) iov[slice_idx].iov_base;
> > +
> > + if (is_write) {
> > + ret = qio_channel_pwritev(ioc, &iov[slice_idx], slice_num,
> > + file_offset, errp);
> > + } else {
> > + ret = qio_channel_preadv(ioc, &iov[slice_idx], slice_num,
> > + file_offset, errp);
> > + }
>
> iov_base is the address of a pointer in RAM, so could be
> potentially any 64-bit value.
>
> We're assigning file_offset to this pointer address with an
> user supplied offset, and then using it as an offset on disk.
> First this could result in 64-bit overflow when 'offset' is
> added to 'iov_base', and second this could result in a file
> that's 16 Exabytes in size (with holes of course).
>
> I don't get how this is supposed to work, or be used ?
I feel like this whole method might become clearer if we separated
out the logic for merging memory adjacent iovecs.
How about adding a 'iov_collapse' method in iov.h / iov.c to do
the merging and then let the actual I/O code be simpler ?
>
> > +
> > + if (ret < 0) {
> > + break;
> > + }
> > +
> > + slice_idx += slice_num;
> > + slice_num = 0;
> > + }
> > +
> > + return (ret < 0) ? -1 : 0;
> > +}
> > +
> > +int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
> > + size_t niov, off_t offset, Error **errp)
> > +{
> > + return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
> > + offset, true, errp);
> > +}
> > +
> > ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, size_t buflen,
> > off_t offset, Error **errp)
> > {
> > @@ -501,6 +564,13 @@ ssize_t qio_channel_preadv(QIOChannel *ioc, const struct iovec *iov,
> > return klass->io_preadv(ioc, iov, niov, offset, errp);
> > }
> >
> > +int qio_channel_preadv_all(QIOChannel *ioc, const struct iovec *iov,
> > + size_t niov, off_t offset, Error **errp)
> > +{
> > + return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
> > + offset, false, errp);
> > +}
> > +
> > ssize_t qio_channel_pread(QIOChannel *ioc, char *buf, size_t buflen,
> > off_t offset, Error **errp)
> > {
> > --
> > 2.35.3
> >
>
> With regards,
> Daniel
> --
> |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o- https://fstop138.berrange.com :|
> |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
>
>
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 15/30] io: Add a pwritev/preadv version that takes a discontiguous iovec
2024-01-17 14:27 ` Daniel P. Berrangé
@ 2024-01-17 18:09 ` Fabiano Rosas
0 siblings, 0 replies; 95+ messages in thread
From: Fabiano Rosas @ 2024-01-17 18:09 UTC (permalink / raw)
To: Daniel P. Berrangé, qemu-devel, armbru, Juan Quintela,
Peter Xu, Leonardo Bras, Claudio Fontana
Daniel P. Berrangé <berrange@redhat.com> writes:
> On Wed, Jan 17, 2024 at 12:39:26PM +0000, Daniel P. Berrangé wrote:
>> On Mon, Nov 27, 2023 at 05:25:57PM -0300, Fabiano Rosas wrote:
>> > For the upcoming support to fixed-ram migration with multifd, we need
>> > to be able to accept an iovec array with non-contiguous data.
>> >
>> > Add a pwritev and preadv version that splits the array into contiguous
>> > segments before writing. With that we can have the ram code continue
>> > to add pages in any order and the multifd code continue to send large
>> > arrays for reading and writing.
>> >
>> > Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> > ---
>> > - split the API that was merged into a single function
>> > - use uintptr_t for compatibility with 32-bit
>> > ---
>> > include/io/channel.h | 26 ++++++++++++++++
>> > io/channel.c | 70 ++++++++++++++++++++++++++++++++++++++++++++
>> > 2 files changed, 96 insertions(+)
>> >
>> > diff --git a/include/io/channel.h b/include/io/channel.h
>> > index 7986c49c71..25383db5aa 100644
>> > --- a/include/io/channel.h
>> > +++ b/include/io/channel.h
>> > @@ -559,6 +559,19 @@ int qio_channel_close(QIOChannel *ioc,
>> > ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
>> > size_t niov, off_t offset, Error **errp);
>> >
>> > +/**
>> > + * qio_channel_pwritev_all:
>> > + * @ioc: the channel object
>> > + * @iov: the array of memory regions to write data from
>> > + * @niov: the length of the @iov array
>> > + * @offset: the iovec offset in the file where to write the data
>> > + * @errp: pointer to a NULL-initialized error object
>> > + *
>> > + * Returns: 0 if all bytes were written, or -1 on error
>> > + */
>> > +int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
>> > + size_t niov, off_t offset, Error **errp);
>> > +
>> > /**
>> > * qio_channel_pwrite
>> > * @ioc: the channel object
>> > @@ -595,6 +608,19 @@ ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, size_t buflen,
>> > ssize_t qio_channel_preadv(QIOChannel *ioc, const struct iovec *iov,
>> > size_t niov, off_t offset, Error **errp);
>> >
>> > +/**
>> > + * qio_channel_preadv_all:
>> > + * @ioc: the channel object
>> > + * @iov: the array of memory regions to read data to
>> > + * @niov: the length of the @iov array
>> > + * @offset: the iovec offset in the file from where to read the data
>> > + * @errp: pointer to a NULL-initialized error object
>> > + *
>> > + * Returns: 0 if all bytes were read, or -1 on error
>> > + */
>> > +int qio_channel_preadv_all(QIOChannel *ioc, const struct iovec *iov,
>> > + size_t niov, off_t offset, Error **errp);
>> > +
>> > /**
>> > * qio_channel_pread
>> > * @ioc: the channel object
>> > diff --git a/io/channel.c b/io/channel.c
>> > index a1f12f8e90..2f1745d052 100644
>> > --- a/io/channel.c
>> > +++ b/io/channel.c
>> > @@ -472,6 +472,69 @@ ssize_t qio_channel_pwritev(QIOChannel *ioc, const struct iovec *iov,
>> > return klass->io_pwritev(ioc, iov, niov, offset, errp);
>> > }
>> >
>> > +static int qio_channel_preadv_pwritev_contiguous(QIOChannel *ioc,
>> > + const struct iovec *iov,
>> > + size_t niov, off_t offset,
>> > + bool is_write, Error **errp)
>> > +{
>> > + ssize_t ret = -1;
>> > + int i, slice_idx, slice_num;
>> > + uintptr_t base, next, file_offset;
>> > + size_t len;
>> > +
>> > + slice_idx = 0;
>> > + slice_num = 1;
>> > +
>> > + /*
>> > + * If the iov array doesn't have contiguous elements, we need to
>> > + * split it in slices because we only have one (file) 'offset' for
>> > + * the whole iov. Do this here so callers don't need to break the
>> > + * iov array themselves.
>> > + */
>> > + for (i = 0; i < niov; i++, slice_num++) {
>> > + base = (uintptr_t) iov[i].iov_base;
>> > +
>> > + if (i != niov - 1) {
>> > + len = iov[i].iov_len;
>> > + next = (uintptr_t) iov[i + 1].iov_base;
>> > +
>> > + if (base + len == next) {
>> > + continue;
>> > + }
>> > + }
>> > +
>> > + /*
>> > + * Use the offset of the first element of the segment that
>> > + * we're sending.
>> > + */
>> > + file_offset = offset + (uintptr_t) iov[slice_idx].iov_base;
>> > +
>> > + if (is_write) {
>> > + ret = qio_channel_pwritev(ioc, &iov[slice_idx], slice_num,
>> > + file_offset, errp);
>> > + } else {
>> > + ret = qio_channel_preadv(ioc, &iov[slice_idx], slice_num,
>> > + file_offset, errp);
>> > + }
>>
>> iov_base is the address of a pointer in RAM, so could be
>> potentially any 64-bit value.
>>
>> We're assigning file_offset to this pointer address with an
>> user supplied offset, and then using it as an offset on disk.
>> First this could result in 64-bit overflow when 'offset' is
>> added to 'iov_base', and second this could result in a file
>> that's 16 Exabytes in size (with holes of course).
>>
>> I don't get how this is supposed to work, or be used ?
>
> I feel like this whole method might become clearer if we separated
> out the logic for merging memory adjacent iovecs.
>
> How about adding a 'iov_collapse' method in iov.h / iov.c to do
> the merging and then let the actual I/O code be simpler ?
I think if we add a migration-specific wrapper like we're discussing
with Peter earlier in the thread (on this same message), that would be
enough to keep the migration assumptions contained and not have to
pollute the IO code with any of this logic.
>>
>> > +
>> > + if (ret < 0) {
>> > + break;
>> > + }
>> > +
>> > + slice_idx += slice_num;
>> > + slice_num = 0;
>> > + }
>> > +
>> > + return (ret < 0) ? -1 : 0;
>> > +}
>
>
>
>> > +
>> > +int qio_channel_pwritev_all(QIOChannel *ioc, const struct iovec *iov,
>> > + size_t niov, off_t offset, Error **errp)
>> > +{
>> > + return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
>> > + offset, true, errp);
>> > +}
>> > +
>> > ssize_t qio_channel_pwrite(QIOChannel *ioc, char *buf, size_t buflen,
>> > off_t offset, Error **errp)
>> > {
>> > @@ -501,6 +564,13 @@ ssize_t qio_channel_preadv(QIOChannel *ioc, const struct iovec *iov,
>> > return klass->io_preadv(ioc, iov, niov, offset, errp);
>> > }
>> >
>> > +int qio_channel_preadv_all(QIOChannel *ioc, const struct iovec *iov,
>> > + size_t niov, off_t offset, Error **errp)
>> > +{
>> > + return qio_channel_preadv_pwritev_contiguous(ioc, iov, niov,
>> > + offset, false, errp);
>> > +}
>> > +
>> > ssize_t qio_channel_pread(QIOChannel *ioc, char *buf, size_t buflen,
>> > off_t offset, Error **errp)
>> > {
>> > --
>> > 2.35.3
>> >
>>
>> With regards,
>> Daniel
>> --
>> |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
>> |: https://libvirt.org -o- https://fstop138.berrange.com :|
>> |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
>>
>>
>
> With regards,
> Daniel
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH v3 16/30] multifd: Rename MultiFDSendParams::data to compress_data
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (14 preceding siblings ...)
2023-11-27 20:25 ` [RFC PATCH v3 15/30] io: Add a pwritev/preadv version that takes a discontiguous iovec Fabiano Rosas
@ 2023-11-27 20:25 ` Fabiano Rosas
2024-01-16 7:03 ` Peter Xu
2023-11-27 20:25 ` [RFC PATCH v3 17/30] migration/multifd: Decouple recv method from pages Fabiano Rosas
` (14 subsequent siblings)
30 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:25 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
Use a more specific name for the compression data so we can use the
generic for the multifd core code.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/multifd-zlib.c | 20 ++++++++++----------
migration/multifd-zstd.c | 20 ++++++++++----------
migration/multifd.h | 4 ++--
3 files changed, 22 insertions(+), 22 deletions(-)
diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index 37ce48621e..fd94e79dd9 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -69,7 +69,7 @@ static int zlib_send_setup(MultiFDSendParams *p, Error **errp)
err_msg = "out of memory for buf";
goto err_free_zbuff;
}
- p->data = z;
+ p->compress_data = z;
return 0;
err_free_zbuff:
@@ -92,15 +92,15 @@ err_free_z:
*/
static void zlib_send_cleanup(MultiFDSendParams *p, Error **errp)
{
- struct zlib_data *z = p->data;
+ struct zlib_data *z = p->compress_data;
deflateEnd(&z->zs);
g_free(z->zbuff);
z->zbuff = NULL;
g_free(z->buf);
z->buf = NULL;
- g_free(p->data);
- p->data = NULL;
+ g_free(p->compress_data);
+ p->compress_data = NULL;
}
/**
@@ -116,7 +116,7 @@ static void zlib_send_cleanup(MultiFDSendParams *p, Error **errp)
*/
static int zlib_send_prepare(MultiFDSendParams *p, Error **errp)
{
- struct zlib_data *z = p->data;
+ struct zlib_data *z = p->compress_data;
z_stream *zs = &z->zs;
uint32_t out_size = 0;
int ret;
@@ -189,7 +189,7 @@ static int zlib_recv_setup(MultiFDRecvParams *p, Error **errp)
struct zlib_data *z = g_new0(struct zlib_data, 1);
z_stream *zs = &z->zs;
- p->data = z;
+ p->compress_data = z;
zs->zalloc = Z_NULL;
zs->zfree = Z_NULL;
zs->opaque = Z_NULL;
@@ -219,13 +219,13 @@ static int zlib_recv_setup(MultiFDRecvParams *p, Error **errp)
*/
static void zlib_recv_cleanup(MultiFDRecvParams *p)
{
- struct zlib_data *z = p->data;
+ struct zlib_data *z = p->compress_data;
inflateEnd(&z->zs);
g_free(z->zbuff);
z->zbuff = NULL;
- g_free(p->data);
- p->data = NULL;
+ g_free(p->compress_data);
+ p->compress_data = NULL;
}
/**
@@ -241,7 +241,7 @@ static void zlib_recv_cleanup(MultiFDRecvParams *p)
*/
static int zlib_recv_pages(MultiFDRecvParams *p, Error **errp)
{
- struct zlib_data *z = p->data;
+ struct zlib_data *z = p->compress_data;
z_stream *zs = &z->zs;
uint32_t in_size = p->next_packet_size;
/* we measure the change of total_out */
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index b471daadcd..238eebbf4b 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -52,7 +52,7 @@ static int zstd_send_setup(MultiFDSendParams *p, Error **errp)
struct zstd_data *z = g_new0(struct zstd_data, 1);
int res;
- p->data = z;
+ p->compress_data = z;
z->zcs = ZSTD_createCStream();
if (!z->zcs) {
g_free(z);
@@ -90,14 +90,14 @@ static int zstd_send_setup(MultiFDSendParams *p, Error **errp)
*/
static void zstd_send_cleanup(MultiFDSendParams *p, Error **errp)
{
- struct zstd_data *z = p->data;
+ struct zstd_data *z = p->compress_data;
ZSTD_freeCStream(z->zcs);
z->zcs = NULL;
g_free(z->zbuff);
z->zbuff = NULL;
- g_free(p->data);
- p->data = NULL;
+ g_free(p->compress_data);
+ p->compress_data = NULL;
}
/**
@@ -113,7 +113,7 @@ static void zstd_send_cleanup(MultiFDSendParams *p, Error **errp)
*/
static int zstd_send_prepare(MultiFDSendParams *p, Error **errp)
{
- struct zstd_data *z = p->data;
+ struct zstd_data *z = p->compress_data;
int ret;
uint32_t i;
@@ -178,7 +178,7 @@ static int zstd_recv_setup(MultiFDRecvParams *p, Error **errp)
struct zstd_data *z = g_new0(struct zstd_data, 1);
int ret;
- p->data = z;
+ p->compress_data = z;
z->zds = ZSTD_createDStream();
if (!z->zds) {
g_free(z);
@@ -216,14 +216,14 @@ static int zstd_recv_setup(MultiFDRecvParams *p, Error **errp)
*/
static void zstd_recv_cleanup(MultiFDRecvParams *p)
{
- struct zstd_data *z = p->data;
+ struct zstd_data *z = p->compress_data;
ZSTD_freeDStream(z->zds);
z->zds = NULL;
g_free(z->zbuff);
z->zbuff = NULL;
- g_free(p->data);
- p->data = NULL;
+ g_free(p->compress_data);
+ p->compress_data = NULL;
}
/**
@@ -243,7 +243,7 @@ static int zstd_recv_pages(MultiFDRecvParams *p, Error **errp)
uint32_t out_size = 0;
uint32_t expected_size = p->normal_num * p->page_size;
uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
- struct zstd_data *z = p->data;
+ struct zstd_data *z = p->compress_data;
int ret;
int i;
diff --git a/migration/multifd.h b/migration/multifd.h
index a112ec7ac6..744b52762f 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -132,7 +132,7 @@ typedef struct {
/* num of non zero pages */
uint32_t normal_num;
/* used for compression methods */
- void *data;
+ void *compress_data;
} MultiFDSendParams;
typedef struct {
@@ -189,7 +189,7 @@ typedef struct {
/* num of non zero pages */
uint32_t normal_num;
/* used for de-compression methods */
- void *data;
+ void *compress_data;
} MultiFDRecvParams;
typedef struct {
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* [RFC PATCH v3 17/30] migration/multifd: Decouple recv method from pages
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (15 preceding siblings ...)
2023-11-27 20:25 ` [RFC PATCH v3 16/30] multifd: Rename MultiFDSendParams::data to compress_data Fabiano Rosas
@ 2023-11-27 20:25 ` Fabiano Rosas
2024-01-16 7:23 ` Peter Xu
2023-11-27 20:26 ` [RFC PATCH v3 18/30] migration/multifd: Allow receiving pages without packets Fabiano Rosas
` (13 subsequent siblings)
30 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:25 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
Next patch will abstract the type of data being received by the
channels, so do some cleanup now to remove references to pages and
dependency on 'normal_num'.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/multifd-zlib.c | 2 +-
migration/multifd-zstd.c | 2 +-
migration/multifd.c | 12 +++++++-----
migration/multifd.h | 5 ++---
4 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index fd94e79dd9..e019d2d74e 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -314,7 +314,7 @@ static MultiFDMethods multifd_zlib_ops = {
.send_prepare = zlib_send_prepare,
.recv_setup = zlib_recv_setup,
.recv_cleanup = zlib_recv_cleanup,
- .recv_pages = zlib_recv_pages
+ .recv_data = zlib_recv_pages
};
static void multifd_zlib_register(void)
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index 238eebbf4b..0b8414df5b 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -305,7 +305,7 @@ static MultiFDMethods multifd_zstd_ops = {
.send_prepare = zstd_send_prepare,
.recv_setup = zstd_recv_setup,
.recv_cleanup = zstd_recv_cleanup,
- .recv_pages = zstd_recv_pages
+ .recv_data = zstd_recv_pages
};
static void multifd_zstd_register(void)
diff --git a/migration/multifd.c b/migration/multifd.c
index 3476fac49f..c1381bdc21 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -130,7 +130,7 @@ static void nocomp_recv_cleanup(MultiFDRecvParams *p)
}
/**
- * nocomp_recv_pages: read the data from the channel into actual pages
+ * nocomp_recv_data: read the data from the channel
*
* For no compression we just need to read things into the correct place.
*
@@ -139,7 +139,7 @@ static void nocomp_recv_cleanup(MultiFDRecvParams *p)
* @p: Params for the channel that we are using
* @errp: pointer to an error
*/
-static int nocomp_recv_pages(MultiFDRecvParams *p, Error **errp)
+static int nocomp_recv_data(MultiFDRecvParams *p, Error **errp)
{
uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
@@ -161,7 +161,7 @@ static MultiFDMethods multifd_nocomp_ops = {
.send_prepare = nocomp_send_prepare,
.recv_setup = nocomp_recv_setup,
.recv_cleanup = nocomp_recv_cleanup,
- .recv_pages = nocomp_recv_pages
+ .recv_data = nocomp_recv_data
};
static MultiFDMethods *multifd_ops[MULTIFD_COMPRESSION__MAX] = {
@@ -1126,6 +1126,7 @@ static void *multifd_recv_thread(void *opaque)
while (true) {
uint32_t flags = 0;
+ bool has_data = false;
p->normal_num = 0;
if (p->quit) {
@@ -1154,12 +1155,13 @@ static void *multifd_recv_thread(void *opaque)
p->next_packet_size);
p->total_normal_pages += p->normal_num;
+ has_data = !!p->normal_num;
}
qemu_mutex_unlock(&p->mutex);
- if (p->normal_num) {
- ret = multifd_recv_state->ops->recv_pages(p, &local_err);
+ if (has_data) {
+ ret = multifd_recv_state->ops->recv_data(p, &local_err);
if (ret != 0) {
break;
}
diff --git a/migration/multifd.h b/migration/multifd.h
index 744b52762f..406d42dbae 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -203,11 +203,10 @@ typedef struct {
int (*recv_setup)(MultiFDRecvParams *p, Error **errp);
/* Cleanup for receiving side */
void (*recv_cleanup)(MultiFDRecvParams *p);
- /* Read all pages */
- int (*recv_pages)(MultiFDRecvParams *p, Error **errp);
+ /* Read all data */
+ int (*recv_data)(MultiFDRecvParams *p, Error **errp);
} MultiFDMethods;
void multifd_register_ops(int method, MultiFDMethods *ops);
#endif
-
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 17/30] migration/multifd: Decouple recv method from pages
2023-11-27 20:25 ` [RFC PATCH v3 17/30] migration/multifd: Decouple recv method from pages Fabiano Rosas
@ 2024-01-16 7:23 ` Peter Xu
0 siblings, 0 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-16 7:23 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Mon, Nov 27, 2023 at 05:25:59PM -0300, Fabiano Rosas wrote:
> Next patch will abstract the type of data being received by the
> channels, so do some cleanup now to remove references to pages and
> dependency on 'normal_num'.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH v3 18/30] migration/multifd: Allow receiving pages without packets
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (16 preceding siblings ...)
2023-11-27 20:25 ` [RFC PATCH v3 17/30] migration/multifd: Decouple recv method from pages Fabiano Rosas
@ 2023-11-27 20:26 ` Fabiano Rosas
2024-01-16 8:10 ` Peter Xu
2023-11-27 20:26 ` [RFC PATCH v3 19/30] migration/ram: Ignore multifd flush when doing fixed-ram migration Fabiano Rosas
` (12 subsequent siblings)
30 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:26 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
Currently multifd does not need to have knowledge of pages on the
receiving side because all the information needed is within the
packets that come in the stream.
We're about to add support to fixed-ram migration, which cannot use
packets because it expects the ramblock section in the migration file
to contain only the guest pages data.
Add a data structure to transfer pages between the ram migration code
and the multifd receiving threads.
We don't want to reuse MultiFDPages_t for two reasons:
a) multifd threads don't really need to know about the data they're
receiving.
b) the receiving side has to be stopped to load the pages, which means
we can experiment with larger granularities than page size when
transferring data.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
- stopped using MultiFDPages_t and added a new structure which can
take offset + size
---
migration/multifd.c | 122 ++++++++++++++++++++++++++++++++++++++++++--
migration/multifd.h | 20 ++++++++
2 files changed, 138 insertions(+), 4 deletions(-)
diff --git a/migration/multifd.c b/migration/multifd.c
index c1381bdc21..7dfab2367a 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -142,17 +142,36 @@ static void nocomp_recv_cleanup(MultiFDRecvParams *p)
static int nocomp_recv_data(MultiFDRecvParams *p, Error **errp)
{
uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
+ ERRP_GUARD();
if (flags != MULTIFD_FLAG_NOCOMP) {
error_setg(errp, "multifd %u: flags received %x flags expected %x",
p->id, flags, MULTIFD_FLAG_NOCOMP);
return -1;
}
- for (int i = 0; i < p->normal_num; i++) {
- p->iov[i].iov_base = p->host + p->normal[i];
- p->iov[i].iov_len = p->page_size;
+
+ if (!migrate_multifd_packets()) {
+ MultiFDRecvData *data = p->data;
+ size_t ret;
+
+ ret = qio_channel_pread(p->c, (char *) data->opaque,
+ data->size, data->file_offset, errp);
+ if (ret != data->size) {
+ error_prepend(errp,
+ "multifd recv (%u): read 0x%zx, expected 0x%zx",
+ p->id, ret, data->size);
+ return -1;
+ }
+
+ return 0;
+ } else {
+ for (int i = 0; i < p->normal_num; i++) {
+ p->iov[i].iov_base = p->host + p->normal[i];
+ p->iov[i].iov_len = p->page_size;
+ }
+
+ return qio_channel_readv_all(p->c, p->iov, p->normal_num, errp);
}
- return qio_channel_readv_all(p->c, p->iov, p->normal_num, errp);
}
static MultiFDMethods multifd_nocomp_ops = {
@@ -989,6 +1008,7 @@ int multifd_save_setup(Error **errp)
struct {
MultiFDRecvParams *params;
+ MultiFDRecvData *data;
/* number of created threads */
int count;
/* syncs main thread and channels */
@@ -999,6 +1019,49 @@ struct {
MultiFDMethods *ops;
} *multifd_recv_state;
+int multifd_recv(void)
+{
+ int i;
+ static int next_recv_channel;
+ MultiFDRecvParams *p = NULL;
+ MultiFDRecvData *data = multifd_recv_state->data;
+
+ /*
+ * next_channel can remain from a previous migration that was
+ * using more channels, so ensure it doesn't overflow if the
+ * limit is lower now.
+ */
+ next_recv_channel %= migrate_multifd_channels();
+ for (i = next_recv_channel;; i = (i + 1) % migrate_multifd_channels()) {
+ p = &multifd_recv_state->params[i];
+
+ qemu_mutex_lock(&p->mutex);
+ if (p->quit) {
+ error_report("%s: channel %d has already quit!", __func__, i);
+ qemu_mutex_unlock(&p->mutex);
+ return -1;
+ }
+ if (!p->pending_job) {
+ p->pending_job++;
+ next_recv_channel = (i + 1) % migrate_multifd_channels();
+ break;
+ }
+ qemu_mutex_unlock(&p->mutex);
+ }
+ assert(p->data->size == 0);
+ multifd_recv_state->data = p->data;
+ p->data = data;
+ qemu_mutex_unlock(&p->mutex);
+ qemu_sem_post(&p->sem);
+
+ return 1;
+}
+
+MultiFDRecvData *multifd_get_recv_data(void)
+{
+ return multifd_recv_state->data;
+}
+
static void multifd_recv_terminate_threads(Error *err)
{
int i;
@@ -1020,6 +1083,7 @@ static void multifd_recv_terminate_threads(Error *err)
qemu_mutex_lock(&p->mutex);
p->quit = true;
+ qemu_sem_post(&p->sem);
/*
* We could arrive here for two reasons:
* - normal quit, i.e. everything went fine, just finished
@@ -1069,6 +1133,7 @@ void multifd_load_cleanup(void)
p->c = NULL;
qemu_mutex_destroy(&p->mutex);
qemu_sem_destroy(&p->sem_sync);
+ qemu_sem_destroy(&p->sem);
g_free(p->name);
p->name = NULL;
p->packet_len = 0;
@@ -1083,6 +1148,8 @@ void multifd_load_cleanup(void)
qemu_sem_destroy(&multifd_recv_state->sem_sync);
g_free(multifd_recv_state->params);
multifd_recv_state->params = NULL;
+ g_free(multifd_recv_state->data);
+ multifd_recv_state->data = NULL;
g_free(multifd_recv_state);
multifd_recv_state = NULL;
}
@@ -1094,6 +1161,21 @@ void multifd_recv_sync_main(void)
if (!migrate_multifd() || !migrate_multifd_packets()) {
return;
}
+
+ if (!migrate_multifd_packets()) {
+ for (i = 0; i < migrate_multifd_channels(); i++) {
+ MultiFDRecvParams *p = &multifd_recv_state->params[i];
+
+ qemu_sem_post(&p->sem);
+ qemu_sem_wait(&p->sem_sync);
+
+ qemu_mutex_lock(&p->mutex);
+ assert(!p->pending_job || p->quit);
+ qemu_mutex_unlock(&p->mutex);
+ }
+ return;
+ }
+
for (i = 0; i < migrate_multifd_channels(); i++) {
MultiFDRecvParams *p = &multifd_recv_state->params[i];
@@ -1156,6 +1238,18 @@ static void *multifd_recv_thread(void *opaque)
p->total_normal_pages += p->normal_num;
has_data = !!p->normal_num;
+ } else {
+ /*
+ * No packets, so we need to wait for the vmstate code to
+ * give us work.
+ */
+ qemu_sem_wait(&p->sem);
+ qemu_mutex_lock(&p->mutex);
+ if (!p->pending_job) {
+ qemu_mutex_unlock(&p->mutex);
+ break;
+ }
+ has_data = !!p->data->size;
}
qemu_mutex_unlock(&p->mutex);
@@ -1171,6 +1265,17 @@ static void *multifd_recv_thread(void *opaque)
qemu_sem_post(&multifd_recv_state->sem_sync);
qemu_sem_wait(&p->sem_sync);
}
+
+ if (!use_packets) {
+ qemu_mutex_lock(&p->mutex);
+ p->data->size = 0;
+ p->pending_job--;
+ qemu_mutex_unlock(&p->mutex);
+ }
+ }
+
+ if (!use_packets) {
+ qemu_sem_post(&p->sem_sync);
}
if (local_err) {
@@ -1205,6 +1310,10 @@ int multifd_load_setup(Error **errp)
thread_count = migrate_multifd_channels();
multifd_recv_state = g_malloc0(sizeof(*multifd_recv_state));
multifd_recv_state->params = g_new0(MultiFDRecvParams, thread_count);
+
+ multifd_recv_state->data = g_new0(MultiFDRecvData, 1);
+ multifd_recv_state->data->size = 0;
+
qatomic_set(&multifd_recv_state->count, 0);
qemu_sem_init(&multifd_recv_state->sem_sync, 0);
multifd_recv_state->ops = multifd_ops[migrate_multifd_compression()];
@@ -1214,9 +1323,14 @@ int multifd_load_setup(Error **errp)
qemu_mutex_init(&p->mutex);
qemu_sem_init(&p->sem_sync, 0);
+ qemu_sem_init(&p->sem, 0);
p->quit = false;
+ p->pending_job = 0;
p->id = i;
+ p->data = g_new0(MultiFDRecvData, 1);
+ p->data->size = 0;
+
if (use_packets) {
p->packet_len = sizeof(MultiFDPacket_t)
+ sizeof(uint64_t) * page_count;
diff --git a/migration/multifd.h b/migration/multifd.h
index 406d42dbae..abaf16c3f2 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -13,6 +13,8 @@
#ifndef QEMU_MIGRATION_MULTIFD_H
#define QEMU_MIGRATION_MULTIFD_H
+typedef struct MultiFDRecvData MultiFDRecvData;
+
int multifd_save_setup(Error **errp);
void multifd_save_cleanup(void);
int multifd_load_setup(Error **errp);
@@ -24,6 +26,8 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
void multifd_recv_sync_main(void);
int multifd_send_sync_main(QEMUFile *f);
int multifd_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
+int multifd_recv(void);
+MultiFDRecvData *multifd_get_recv_data(void);
/* Multifd Compression flags */
#define MULTIFD_FLAG_SYNC (1 << 0)
@@ -66,6 +70,13 @@ typedef struct {
RAMBlock *block;
} MultiFDPages_t;
+struct MultiFDRecvData {
+ void *opaque;
+ size_t size;
+ /* for preadv */
+ off_t file_offset;
+};
+
typedef struct {
/* Fields are only written at creating/deletion time */
/* No lock required for them, they are read only */
@@ -156,6 +167,8 @@ typedef struct {
/* syncs main thread and channels */
QemuSemaphore sem_sync;
+ /* sem where to wait for more work */
+ QemuSemaphore sem;
/* this mutex protects the following parameters */
QemuMutex mutex;
@@ -167,6 +180,13 @@ typedef struct {
uint32_t flags;
/* global number of generated multifd packets */
uint64_t packet_num;
+ int pending_job;
+ /*
+ * The owner of 'data' depends of 'pending_job' value:
+ * pending_job == 0 -> migration_thread can use it.
+ * pending_job != 0 -> multifd_channel can use it.
+ */
+ MultiFDRecvData *data;
/* thread local variables. No locking required */
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 18/30] migration/multifd: Allow receiving pages without packets
2023-11-27 20:26 ` [RFC PATCH v3 18/30] migration/multifd: Allow receiving pages without packets Fabiano Rosas
@ 2024-01-16 8:10 ` Peter Xu
2024-01-16 20:25 ` Fabiano Rosas
0 siblings, 1 reply; 95+ messages in thread
From: Peter Xu @ 2024-01-16 8:10 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Mon, Nov 27, 2023 at 05:26:00PM -0300, Fabiano Rosas wrote:
> Currently multifd does not need to have knowledge of pages on the
> receiving side because all the information needed is within the
> packets that come in the stream.
>
> We're about to add support to fixed-ram migration, which cannot use
> packets because it expects the ramblock section in the migration file
> to contain only the guest pages data.
>
> Add a data structure to transfer pages between the ram migration code
> and the multifd receiving threads.
>
> We don't want to reuse MultiFDPages_t for two reasons:
>
> a) multifd threads don't really need to know about the data they're
> receiving.
>
> b) the receiving side has to be stopped to load the pages, which means
> we can experiment with larger granularities than page size when
> transferring data.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> - stopped using MultiFDPages_t and added a new structure which can
> take offset + size
> ---
> migration/multifd.c | 122 ++++++++++++++++++++++++++++++++++++++++++--
> migration/multifd.h | 20 ++++++++
> 2 files changed, 138 insertions(+), 4 deletions(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index c1381bdc21..7dfab2367a 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -142,17 +142,36 @@ static void nocomp_recv_cleanup(MultiFDRecvParams *p)
> static int nocomp_recv_data(MultiFDRecvParams *p, Error **errp)
> {
> uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
> + ERRP_GUARD();
>
> if (flags != MULTIFD_FLAG_NOCOMP) {
> error_setg(errp, "multifd %u: flags received %x flags expected %x",
> p->id, flags, MULTIFD_FLAG_NOCOMP);
> return -1;
> }
> - for (int i = 0; i < p->normal_num; i++) {
> - p->iov[i].iov_base = p->host + p->normal[i];
> - p->iov[i].iov_len = p->page_size;
> +
> + if (!migrate_multifd_packets()) {
> + MultiFDRecvData *data = p->data;
> + size_t ret;
> +
> + ret = qio_channel_pread(p->c, (char *) data->opaque,
> + data->size, data->file_offset, errp);
> + if (ret != data->size) {
> + error_prepend(errp,
> + "multifd recv (%u): read 0x%zx, expected 0x%zx",
> + p->id, ret, data->size);
> + return -1;
> + }
> +
> + return 0;
> + } else {
> + for (int i = 0; i < p->normal_num; i++) {
> + p->iov[i].iov_base = p->host + p->normal[i];
> + p->iov[i].iov_len = p->page_size;
> + }
> +
> + return qio_channel_readv_all(p->c, p->iov, p->normal_num, errp);
> }
I guess you managed to squash the file loads into "no compression" handler
of multifd, but IMHO it's not as clean.
Firstly, if to do so, we'd better make sure multifd-compression is not
enabled anywhere together with fixed-ram. I didn't yet see such protection
in the series. I think if it happens we should expect crashes because
they'll go into zlib/zstd paths for the file.
IMHO the only model fixed-ram can share with multifd is the task management
part, mutexes, semaphores, etc.. IIRC I used to mention that it'll be nice
if we have simply a pool of threads so we can enqueue tasks. If that's too
far away, would something like below closer to that? What I'm thinking:
- patch 1: rename MultiFDMethods to MultiFDCompressMethods, this can
replace the other patch to do s/recv_pages/recv_data/
- patch 2: introduce MultiFDMethods (on top of MultiFDCompressMethods),
refactor the current code to provide the socket version of MultiFDMethods.
- patch 3: add the fixed-ram "file" version of MultiFDMethods
MultiFDCompressMethods doesn't need to be used at all for "file" version of
MultiFDMethods.
Would that work?
> - return qio_channel_readv_all(p->c, p->iov, p->normal_num, errp);
> }
>
> static MultiFDMethods multifd_nocomp_ops = {
> @@ -989,6 +1008,7 @@ int multifd_save_setup(Error **errp)
>
> struct {
> MultiFDRecvParams *params;
> + MultiFDRecvData *data;
(If above would work, maybe we can split MultiFDRecvParams into two chunks,
one commonly used for both, one only for sockets?)
> /* number of created threads */
> int count;
> /* syncs main thread and channels */
> @@ -999,6 +1019,49 @@ struct {
> MultiFDMethods *ops;
> } *multifd_recv_state;
>
> +int multifd_recv(void)
> +{
> + int i;
> + static int next_recv_channel;
> + MultiFDRecvParams *p = NULL;
> + MultiFDRecvData *data = multifd_recv_state->data;
> +
> + /*
> + * next_channel can remain from a previous migration that was
> + * using more channels, so ensure it doesn't overflow if the
> + * limit is lower now.
> + */
> + next_recv_channel %= migrate_multifd_channels();
> + for (i = next_recv_channel;; i = (i + 1) % migrate_multifd_channels()) {
> + p = &multifd_recv_state->params[i];
> +
> + qemu_mutex_lock(&p->mutex);
> + if (p->quit) {
> + error_report("%s: channel %d has already quit!", __func__, i);
> + qemu_mutex_unlock(&p->mutex);
> + return -1;
> + }
> + if (!p->pending_job) {
> + p->pending_job++;
> + next_recv_channel = (i + 1) % migrate_multifd_channels();
> + break;
> + }
> + qemu_mutex_unlock(&p->mutex);
> + }
> + assert(p->data->size == 0);
> + multifd_recv_state->data = p->data;
> + p->data = data;
> + qemu_mutex_unlock(&p->mutex);
> + qemu_sem_post(&p->sem);
> +
> + return 1;
> +}
PS: so if we have the pool model we can already mostly merge above code
with multifd_send_pages().. because this will be a common helper to enqueue
a task to a pool, no matter it's for writting (to file/socket) or reading
(only from file).
> +
> +MultiFDRecvData *multifd_get_recv_data(void)
> +{
> + return multifd_recv_state->data;
> +}
> +
> static void multifd_recv_terminate_threads(Error *err)
> {
> int i;
> @@ -1020,6 +1083,7 @@ static void multifd_recv_terminate_threads(Error *err)
>
> qemu_mutex_lock(&p->mutex);
> p->quit = true;
> + qemu_sem_post(&p->sem);
> /*
> * We could arrive here for two reasons:
> * - normal quit, i.e. everything went fine, just finished
> @@ -1069,6 +1133,7 @@ void multifd_load_cleanup(void)
> p->c = NULL;
> qemu_mutex_destroy(&p->mutex);
> qemu_sem_destroy(&p->sem_sync);
> + qemu_sem_destroy(&p->sem);
> g_free(p->name);
> p->name = NULL;
> p->packet_len = 0;
> @@ -1083,6 +1148,8 @@ void multifd_load_cleanup(void)
> qemu_sem_destroy(&multifd_recv_state->sem_sync);
> g_free(multifd_recv_state->params);
> multifd_recv_state->params = NULL;
> + g_free(multifd_recv_state->data);
> + multifd_recv_state->data = NULL;
> g_free(multifd_recv_state);
> multifd_recv_state = NULL;
> }
> @@ -1094,6 +1161,21 @@ void multifd_recv_sync_main(void)
> if (!migrate_multifd() || !migrate_multifd_packets()) {
[1]
> return;
> }
> +
> + if (!migrate_multifd_packets()) {
Hmm, isn't this checked already above at [1]? Could this path ever trigger
then? Maybe we need to drop the one at [1]?
IIUC what you wanted to do here is relying on the last RAM_SAVE_FLAG_EOS in
the image file to do a full flush to make sure all pages are loaded.
You may want to be careful on the side effect of flush_after_each_section
parameter:
case RAM_SAVE_FLAG_EOS:
/* normal exit */
if (migrate_multifd() &&
migrate_multifd_flush_after_each_section()) {
multifd_recv_sync_main();
}
You may want to flush always for file?
> + for (i = 0; i < migrate_multifd_channels(); i++) {
> + MultiFDRecvParams *p = &multifd_recv_state->params[i];
> +
> + qemu_sem_post(&p->sem);
> + qemu_sem_wait(&p->sem_sync);
> +
> + qemu_mutex_lock(&p->mutex);
> + assert(!p->pending_job || p->quit);
> + qemu_mutex_unlock(&p->mutex);
> + }
> + return;
Btw, how does this kick off all the recv threads? Is it because you did a
sem_post(&sem) with p->pending_job==false this time?
Maybe it's clearer to just set p->quit (or a global quite knob) somewhere?
That'll be clear that this is a one-shot thing, only needed at the end of
the file incoming migration.
> + }
> +
> for (i = 0; i < migrate_multifd_channels(); i++) {
> MultiFDRecvParams *p = &multifd_recv_state->params[i];
>
> @@ -1156,6 +1238,18 @@ static void *multifd_recv_thread(void *opaque)
>
> p->total_normal_pages += p->normal_num;
> has_data = !!p->normal_num;
> + } else {
> + /*
> + * No packets, so we need to wait for the vmstate code to
> + * give us work.
> + */
> + qemu_sem_wait(&p->sem);
> + qemu_mutex_lock(&p->mutex);
> + if (!p->pending_job) {
> + qemu_mutex_unlock(&p->mutex);
> + break;
> + }
> + has_data = !!p->data->size;
> }
>
> qemu_mutex_unlock(&p->mutex);
> @@ -1171,6 +1265,17 @@ static void *multifd_recv_thread(void *opaque)
> qemu_sem_post(&multifd_recv_state->sem_sync);
> qemu_sem_wait(&p->sem_sync);
> }
> +
> + if (!use_packets) {
> + qemu_mutex_lock(&p->mutex);
> + p->data->size = 0;
> + p->pending_job--;
> + qemu_mutex_unlock(&p->mutex);
> + }
> + }
> +
> + if (!use_packets) {
> + qemu_sem_post(&p->sem_sync);
Currently sem_sync is only posted with MULTIFD_FLAG_SYNC flag. We'd better
be careful on reusing it.
Maybe add some comment above recv_state->sem_sync?
/*
* For sockets: this is posted once for each MULTIFD_FLAG_SYNC flag.
*
* For files: this is only posted at the end of the file load to mark
* completion of the load process.
*/
> }
>
> if (local_err) {
> @@ -1205,6 +1310,10 @@ int multifd_load_setup(Error **errp)
> thread_count = migrate_multifd_channels();
> multifd_recv_state = g_malloc0(sizeof(*multifd_recv_state));
> multifd_recv_state->params = g_new0(MultiFDRecvParams, thread_count);
> +
> + multifd_recv_state->data = g_new0(MultiFDRecvData, 1);
> + multifd_recv_state->data->size = 0;
> +
> qatomic_set(&multifd_recv_state->count, 0);
> qemu_sem_init(&multifd_recv_state->sem_sync, 0);
> multifd_recv_state->ops = multifd_ops[migrate_multifd_compression()];
> @@ -1214,9 +1323,14 @@ int multifd_load_setup(Error **errp)
>
> qemu_mutex_init(&p->mutex);
> qemu_sem_init(&p->sem_sync, 0);
> + qemu_sem_init(&p->sem, 0);
> p->quit = false;
> + p->pending_job = 0;
> p->id = i;
>
> + p->data = g_new0(MultiFDRecvData, 1);
> + p->data->size = 0;
> +
> if (use_packets) {
> p->packet_len = sizeof(MultiFDPacket_t)
> + sizeof(uint64_t) * page_count;
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 406d42dbae..abaf16c3f2 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -13,6 +13,8 @@
> #ifndef QEMU_MIGRATION_MULTIFD_H
> #define QEMU_MIGRATION_MULTIFD_H
>
> +typedef struct MultiFDRecvData MultiFDRecvData;
> +
> int multifd_save_setup(Error **errp);
> void multifd_save_cleanup(void);
> int multifd_load_setup(Error **errp);
> @@ -24,6 +26,8 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
> void multifd_recv_sync_main(void);
> int multifd_send_sync_main(QEMUFile *f);
> int multifd_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
> +int multifd_recv(void);
> +MultiFDRecvData *multifd_get_recv_data(void);
>
> /* Multifd Compression flags */
> #define MULTIFD_FLAG_SYNC (1 << 0)
> @@ -66,6 +70,13 @@ typedef struct {
> RAMBlock *block;
> } MultiFDPages_t;
>
> +struct MultiFDRecvData {
> + void *opaque;
> + size_t size;
> + /* for preadv */
> + off_t file_offset;
> +};
> +
> typedef struct {
> /* Fields are only written at creating/deletion time */
> /* No lock required for them, they are read only */
> @@ -156,6 +167,8 @@ typedef struct {
>
> /* syncs main thread and channels */
> QemuSemaphore sem_sync;
> + /* sem where to wait for more work */
> + QemuSemaphore sem;
>
> /* this mutex protects the following parameters */
> QemuMutex mutex;
> @@ -167,6 +180,13 @@ typedef struct {
> uint32_t flags;
> /* global number of generated multifd packets */
> uint64_t packet_num;
> + int pending_job;
> + /*
> + * The owner of 'data' depends of 'pending_job' value:
> + * pending_job == 0 -> migration_thread can use it.
> + * pending_job != 0 -> multifd_channel can use it.
> + */
> + MultiFDRecvData *data;
Right after the main thread assigns a chunk of memory to load for a recv
thread, the main thread job done, afaict. I don't see how a race could
happen here.
I'm not sure, but I _think_ if we rely on p->quite or something similar to
quite all recv threads, then this can be dropped?
>
> /* thread local variables. No locking required */
>
> --
> 2.35.3
>
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 18/30] migration/multifd: Allow receiving pages without packets
2024-01-16 8:10 ` Peter Xu
@ 2024-01-16 20:25 ` Fabiano Rosas
2024-01-19 0:20 ` Peter Xu
0 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2024-01-16 20:25 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
Peter Xu <peterx@redhat.com> writes:
> On Mon, Nov 27, 2023 at 05:26:00PM -0300, Fabiano Rosas wrote:
>> Currently multifd does not need to have knowledge of pages on the
>> receiving side because all the information needed is within the
>> packets that come in the stream.
>>
>> We're about to add support to fixed-ram migration, which cannot use
>> packets because it expects the ramblock section in the migration file
>> to contain only the guest pages data.
>>
>> Add a data structure to transfer pages between the ram migration code
>> and the multifd receiving threads.
>>
>> We don't want to reuse MultiFDPages_t for two reasons:
>>
>> a) multifd threads don't really need to know about the data they're
>> receiving.
>>
>> b) the receiving side has to be stopped to load the pages, which means
>> we can experiment with larger granularities than page size when
>> transferring data.
>>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>> - stopped using MultiFDPages_t and added a new structure which can
>> take offset + size
>> ---
>> migration/multifd.c | 122 ++++++++++++++++++++++++++++++++++++++++++--
>> migration/multifd.h | 20 ++++++++
>> 2 files changed, 138 insertions(+), 4 deletions(-)
>>
>> diff --git a/migration/multifd.c b/migration/multifd.c
>> index c1381bdc21..7dfab2367a 100644
>> --- a/migration/multifd.c
>> +++ b/migration/multifd.c
>> @@ -142,17 +142,36 @@ static void nocomp_recv_cleanup(MultiFDRecvParams *p)
>> static int nocomp_recv_data(MultiFDRecvParams *p, Error **errp)
>> {
>> uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
>> + ERRP_GUARD();
>>
>> if (flags != MULTIFD_FLAG_NOCOMP) {
>> error_setg(errp, "multifd %u: flags received %x flags expected %x",
>> p->id, flags, MULTIFD_FLAG_NOCOMP);
>> return -1;
>> }
>> - for (int i = 0; i < p->normal_num; i++) {
>> - p->iov[i].iov_base = p->host + p->normal[i];
>> - p->iov[i].iov_len = p->page_size;
>> +
>> + if (!migrate_multifd_packets()) {
>> + MultiFDRecvData *data = p->data;
>> + size_t ret;
>> +
>> + ret = qio_channel_pread(p->c, (char *) data->opaque,
>> + data->size, data->file_offset, errp);
>> + if (ret != data->size) {
>> + error_prepend(errp,
>> + "multifd recv (%u): read 0x%zx, expected 0x%zx",
>> + p->id, ret, data->size);
>> + return -1;
>> + }
>> +
>> + return 0;
>> + } else {
>> + for (int i = 0; i < p->normal_num; i++) {
>> + p->iov[i].iov_base = p->host + p->normal[i];
>> + p->iov[i].iov_len = p->page_size;
>> + }
>> +
>> + return qio_channel_readv_all(p->c, p->iov, p->normal_num, errp);
>> }
>
> I guess you managed to squash the file loads into "no compression" handler
> of multifd, but IMHO it's not as clean.
>
> Firstly, if to do so, we'd better make sure multifd-compression is not
> enabled anywhere together with fixed-ram. I didn't yet see such protection
> in the series. I think if it happens we should expect crashes because
> they'll go into zlib/zstd paths for the file.
Yes, we need some checks around this.
>
> IMHO the only model fixed-ram can share with multifd is the task management
> part, mutexes, semaphores, etc..
AFAIU, that's what multifd *is*. Compression would just be another
client of this task management code. This "nocomp" thing always felt off
to me.
> IIRC I used to mention that it'll be nice
> if we have simply a pool of threads so we can enqueue tasks.
Right, I don't disagree. However I don't think we can just show up with
a thread pool and start moving stuff into it. I think the safest way to
do this is to:
1- Adapt multifd so that the client code is the sole responsible for the
data being sent. No data knowledge by the multifd thread.
With this, nothing should need to touch multifd threads code
anymore. New clients just define their methods and prepare the data
as they please.
2- Move everything that is left into multifd. Zero pages, postcopy, etc.
With 1 and 2 we'd have a pretty good picture of what kinds of operations
we need to do and what are the requirements for the thread
infrastructure.
3- Try to use existing abstractions within QEMU to replace
multifd. Write from scratch if none are suitable.
What do you think? We could put an action plan in place and start
picking at it. My main concern is about what sorts of hidden assumptions
are present in the current code that we'd start discovering if we just
replaced it with something new.
> If that's too
> far away, would something like below closer to that? What I'm thinking:
>
> - patch 1: rename MultiFDMethods to MultiFDCompressMethods, this can
> replace the other patch to do s/recv_pages/recv_data/
>
> - patch 2: introduce MultiFDMethods (on top of MultiFDCompressMethods),
> refactor the current code to provide the socket version of MultiFDMethods.
>
> - patch 3: add the fixed-ram "file" version of MultiFDMethods
We also have zero page moving to multifd and compression accelerators
being developed. We need to take those into account. We might need an
ops structure that accounts for the current "phases" (setup, prepare,
recv, cleanup)[1], but within those also allow for composing arbitrary
data transformations.
(1)- there's no equivalent to send_prepare on dst and no equivalent to
recv_pages on src. We might need to add a recv_prepare and a send_pages
hook. The fixed-ram migration for instance would benefit from being able
to choose a different IO function to send data.
I'll send the patches moving zero pages to multifd once I find some
time, but another question I had was whether we should add zero page
detection as a new phase: setup - zero page detection - prepare - send -
cleanup.
> MultiFDCompressMethods doesn't need to be used at all for "file" version of
> MultiFDMethods.
>
> Would that work?
We definitely need _something_ to help us stop adding code to the middle
of multifd_send_thread every time there's a new feature.
>
>> - return qio_channel_readv_all(p->c, p->iov, p->normal_num, errp);
>> }
>>
>> static MultiFDMethods multifd_nocomp_ops = {
>> @@ -989,6 +1008,7 @@ int multifd_save_setup(Error **errp)
>>
>> struct {
>> MultiFDRecvParams *params;
>> + MultiFDRecvData *data;
>
> (If above would work, maybe we can split MultiFDRecvParams into two chunks,
> one commonly used for both, one only for sockets?)
>
If we assume the use of packets in multifd is coupled to the socket
channel usage then yes. However, I suspect that what we might want is a
streaming migration vs. non-streaming migration abstraction. Because we
can still use packets with file migration after all.
>> /* number of created threads */
>> int count;
>> /* syncs main thread and channels */
>> @@ -999,6 +1019,49 @@ struct {
>> MultiFDMethods *ops;
>> } *multifd_recv_state;
>>
>> +int multifd_recv(void)
>> +{
>> + int i;
>> + static int next_recv_channel;
>> + MultiFDRecvParams *p = NULL;
>> + MultiFDRecvData *data = multifd_recv_state->data;
>> +
>> + /*
>> + * next_channel can remain from a previous migration that was
>> + * using more channels, so ensure it doesn't overflow if the
>> + * limit is lower now.
>> + */
>> + next_recv_channel %= migrate_multifd_channels();
>> + for (i = next_recv_channel;; i = (i + 1) % migrate_multifd_channels()) {
>> + p = &multifd_recv_state->params[i];
>> +
>> + qemu_mutex_lock(&p->mutex);
>> + if (p->quit) {
>> + error_report("%s: channel %d has already quit!", __func__, i);
>> + qemu_mutex_unlock(&p->mutex);
>> + return -1;
>> + }
>> + if (!p->pending_job) {
>> + p->pending_job++;
>> + next_recv_channel = (i + 1) % migrate_multifd_channels();
>> + break;
>> + }
>> + qemu_mutex_unlock(&p->mutex);
>> + }
>> + assert(p->data->size == 0);
>> + multifd_recv_state->data = p->data;
>> + p->data = data;
>> + qemu_mutex_unlock(&p->mutex);
>> + qemu_sem_post(&p->sem);
>> +
>> + return 1;
>> +}
>
> PS: so if we have the pool model we can already mostly merge above code
> with multifd_send_pages().. because this will be a common helper to enqueue
> a task to a pool, no matter it's for writting (to file/socket) or reading
> (only from file).
>
>> +
>> +MultiFDRecvData *multifd_get_recv_data(void)
>> +{
>> + return multifd_recv_state->data;
>> +}
>> +
>> static void multifd_recv_terminate_threads(Error *err)
>> {
>> int i;
>> @@ -1020,6 +1083,7 @@ static void multifd_recv_terminate_threads(Error *err)
>>
>> qemu_mutex_lock(&p->mutex);
>> p->quit = true;
>> + qemu_sem_post(&p->sem);
>> /*
>> * We could arrive here for two reasons:
>> * - normal quit, i.e. everything went fine, just finished
>> @@ -1069,6 +1133,7 @@ void multifd_load_cleanup(void)
>> p->c = NULL;
>> qemu_mutex_destroy(&p->mutex);
>> qemu_sem_destroy(&p->sem_sync);
>> + qemu_sem_destroy(&p->sem);
>> g_free(p->name);
>> p->name = NULL;
>> p->packet_len = 0;
>> @@ -1083,6 +1148,8 @@ void multifd_load_cleanup(void)
>> qemu_sem_destroy(&multifd_recv_state->sem_sync);
>> g_free(multifd_recv_state->params);
>> multifd_recv_state->params = NULL;
>> + g_free(multifd_recv_state->data);
>> + multifd_recv_state->data = NULL;
>> g_free(multifd_recv_state);
>> multifd_recv_state = NULL;
>> }
>> @@ -1094,6 +1161,21 @@ void multifd_recv_sync_main(void)
>> if (!migrate_multifd() || !migrate_multifd_packets()) {
>
> [1]
>
>> return;
>> }
>> +
>> + if (!migrate_multifd_packets()) {
>
> Hmm, isn't this checked already above at [1]? Could this path ever trigger
> then? Maybe we need to drop the one at [1]?
That was a rebase mistake.
>
> IIUC what you wanted to do here is relying on the last RAM_SAVE_FLAG_EOS in
> the image file to do a full flush to make sure all pages are loaded.
>
Bear with me if I take it slow with everything below here. The practical
effect of changing any of these is causing the threads to go off-sync
and that results in a time dependent bug where memory is not properly
migrated, which just comes up as an assert in check_guests_ram() in the
tests. It gets hard to reproduce and has taken me whole weeks to debug
before.
> You may want to be careful on the side effect of flush_after_each_section
> parameter:
>
> case RAM_SAVE_FLAG_EOS:
> /* normal exit */
> if (migrate_multifd() &&
> migrate_multifd_flush_after_each_section()) {
> multifd_recv_sync_main();
> }
>
> You may want to flush always for file?
Next patch restricts the setting of flush_after_each_section.
>
>> + for (i = 0; i < migrate_multifd_channels(); i++) {
>> + MultiFDRecvParams *p = &multifd_recv_state->params[i];
>> +
>> + qemu_sem_post(&p->sem);
>> + qemu_sem_wait(&p->sem_sync);
>> +
>> + qemu_mutex_lock(&p->mutex);
>> + assert(!p->pending_job || p->quit);
>> + qemu_mutex_unlock(&p->mutex);
>> + }
>> + return;
>
> Btw, how does this kick off all the recv threads? Is it because you did a
> sem_post(&sem) with p->pending_job==false this time?
Yes, when the last piece of memory is received, the thread will loop
around and hang at qemu_sem_wait(&p->sem):
} else {
/*
* No packets, so we need to wait for the vmstate code to
* give us work.
*/
qemu_sem_wait(&p->sem);
qemu_mutex_lock(&p->mutex);
if (!p->pending_job) {
qemu_mutex_unlock(&p->mutex);
break;
}
has_data = !!p->data->size;
}
So here we release the p->sem one last time so the thread can see
p->pending_job = false and proceed to inform it has finished:
if (!use_packets) {
qemu_sem_post(&p->sem_sync);
}
> Maybe it's clearer to just set p->quit (or a global quite knob) somewhere?
> That'll be clear that this is a one-shot thing, only needed at the end of
> the file incoming migration.
>
Maybe I'm not following you, but the thread needs to check
p->pending_job before it knows there's no more work. And it can only do
that if the migration thread releases p->sem. Do you mean setting
p->quit on the thread instead of posting sem_sync? That's racy I think.
>> + }
>> +
>> for (i = 0; i < migrate_multifd_channels(); i++) {
>> MultiFDRecvParams *p = &multifd_recv_state->params[i];
>>
>> @@ -1156,6 +1238,18 @@ static void *multifd_recv_thread(void *opaque)
>>
>> p->total_normal_pages += p->normal_num;
>> has_data = !!p->normal_num;
>> + } else {
>> + /*
>> + * No packets, so we need to wait for the vmstate code to
>> + * give us work.
>> + */
>> + qemu_sem_wait(&p->sem);
>> + qemu_mutex_lock(&p->mutex);
>> + if (!p->pending_job) {
>> + qemu_mutex_unlock(&p->mutex);
>> + break;
>> + }
>> + has_data = !!p->data->size;
>> }
>>
>> qemu_mutex_unlock(&p->mutex);
>> @@ -1171,6 +1265,17 @@ static void *multifd_recv_thread(void *opaque)
>> qemu_sem_post(&multifd_recv_state->sem_sync);
>> qemu_sem_wait(&p->sem_sync);
>> }
>> +
>> + if (!use_packets) {
>> + qemu_mutex_lock(&p->mutex);
>> + p->data->size = 0;
>> + p->pending_job--;
>> + qemu_mutex_unlock(&p->mutex);
>> + }
>> + }
>> +
>> + if (!use_packets) {
>> + qemu_sem_post(&p->sem_sync);
>
> Currently sem_sync is only posted with MULTIFD_FLAG_SYNC flag. We'd better
> be careful on reusing it.
>
> Maybe add some comment above recv_state->sem_sync?
>
> /*
> * For sockets: this is posted once for each MULTIFD_FLAG_SYNC flag.
> *
> * For files: this is only posted at the end of the file load to mark
> * completion of the load process.
> */
>
Sure. I would rename it to sem_done if I could, but we already went
through that.
>> }
>>
>> if (local_err) {
>> @@ -1205,6 +1310,10 @@ int multifd_load_setup(Error **errp)
>> thread_count = migrate_multifd_channels();
>> multifd_recv_state = g_malloc0(sizeof(*multifd_recv_state));
>> multifd_recv_state->params = g_new0(MultiFDRecvParams, thread_count);
>> +
>> + multifd_recv_state->data = g_new0(MultiFDRecvData, 1);
>> + multifd_recv_state->data->size = 0;
>> +
>> qatomic_set(&multifd_recv_state->count, 0);
>> qemu_sem_init(&multifd_recv_state->sem_sync, 0);
>> multifd_recv_state->ops = multifd_ops[migrate_multifd_compression()];
>> @@ -1214,9 +1323,14 @@ int multifd_load_setup(Error **errp)
>>
>> qemu_mutex_init(&p->mutex);
>> qemu_sem_init(&p->sem_sync, 0);
>> + qemu_sem_init(&p->sem, 0);
>> p->quit = false;
>> + p->pending_job = 0;
>> p->id = i;
>>
>> + p->data = g_new0(MultiFDRecvData, 1);
>> + p->data->size = 0;
>> +
>> if (use_packets) {
>> p->packet_len = sizeof(MultiFDPacket_t)
>> + sizeof(uint64_t) * page_count;
>> diff --git a/migration/multifd.h b/migration/multifd.h
>> index 406d42dbae..abaf16c3f2 100644
>> --- a/migration/multifd.h
>> +++ b/migration/multifd.h
>> @@ -13,6 +13,8 @@
>> #ifndef QEMU_MIGRATION_MULTIFD_H
>> #define QEMU_MIGRATION_MULTIFD_H
>>
>> +typedef struct MultiFDRecvData MultiFDRecvData;
>> +
>> int multifd_save_setup(Error **errp);
>> void multifd_save_cleanup(void);
>> int multifd_load_setup(Error **errp);
>> @@ -24,6 +26,8 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
>> void multifd_recv_sync_main(void);
>> int multifd_send_sync_main(QEMUFile *f);
>> int multifd_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
>> +int multifd_recv(void);
>> +MultiFDRecvData *multifd_get_recv_data(void);
>>
>> /* Multifd Compression flags */
>> #define MULTIFD_FLAG_SYNC (1 << 0)
>> @@ -66,6 +70,13 @@ typedef struct {
>> RAMBlock *block;
>> } MultiFDPages_t;
>>
>> +struct MultiFDRecvData {
>> + void *opaque;
>> + size_t size;
>> + /* for preadv */
>> + off_t file_offset;
>> +};
>> +
>> typedef struct {
>> /* Fields are only written at creating/deletion time */
>> /* No lock required for them, they are read only */
>> @@ -156,6 +167,8 @@ typedef struct {
>>
>> /* syncs main thread and channels */
>> QemuSemaphore sem_sync;
>> + /* sem where to wait for more work */
>> + QemuSemaphore sem;
>>
>> /* this mutex protects the following parameters */
>> QemuMutex mutex;
>> @@ -167,6 +180,13 @@ typedef struct {
>> uint32_t flags;
>> /* global number of generated multifd packets */
>> uint64_t packet_num;
>> + int pending_job;
>> + /*
>> + * The owner of 'data' depends of 'pending_job' value:
>> + * pending_job == 0 -> migration_thread can use it.
>> + * pending_job != 0 -> multifd_channel can use it.
>> + */
>> + MultiFDRecvData *data;
>
> Right after the main thread assigns a chunk of memory to load for a recv
> thread, the main thread job done, afaict. I don't see how a race could
> happen here.
>
> I'm not sure, but I _think_ if we rely on p->quite or something similar to
> quite all recv threads, then this can be dropped?
>
We still need to know whether a channel is in use so we can skip to the
next.
>>
>> /* thread local variables. No locking required */
>>
>> --
>> 2.35.3
>>
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 18/30] migration/multifd: Allow receiving pages without packets
2024-01-16 20:25 ` Fabiano Rosas
@ 2024-01-19 0:20 ` Peter Xu
2024-01-19 12:57 ` Fabiano Rosas
0 siblings, 1 reply; 95+ messages in thread
From: Peter Xu @ 2024-01-19 0:20 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Tue, Jan 16, 2024 at 05:25:03PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > On Mon, Nov 27, 2023 at 05:26:00PM -0300, Fabiano Rosas wrote:
> >> Currently multifd does not need to have knowledge of pages on the
> >> receiving side because all the information needed is within the
> >> packets that come in the stream.
> >>
> >> We're about to add support to fixed-ram migration, which cannot use
> >> packets because it expects the ramblock section in the migration file
> >> to contain only the guest pages data.
> >>
> >> Add a data structure to transfer pages between the ram migration code
> >> and the multifd receiving threads.
> >>
> >> We don't want to reuse MultiFDPages_t for two reasons:
> >>
> >> a) multifd threads don't really need to know about the data they're
> >> receiving.
> >>
> >> b) the receiving side has to be stopped to load the pages, which means
> >> we can experiment with larger granularities than page size when
> >> transferring data.
> >>
> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >> ---
> >> - stopped using MultiFDPages_t and added a new structure which can
> >> take offset + size
> >> ---
> >> migration/multifd.c | 122 ++++++++++++++++++++++++++++++++++++++++++--
> >> migration/multifd.h | 20 ++++++++
> >> 2 files changed, 138 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/migration/multifd.c b/migration/multifd.c
> >> index c1381bdc21..7dfab2367a 100644
> >> --- a/migration/multifd.c
> >> +++ b/migration/multifd.c
> >> @@ -142,17 +142,36 @@ static void nocomp_recv_cleanup(MultiFDRecvParams *p)
> >> static int nocomp_recv_data(MultiFDRecvParams *p, Error **errp)
> >> {
> >> uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
> >> + ERRP_GUARD();
> >>
> >> if (flags != MULTIFD_FLAG_NOCOMP) {
> >> error_setg(errp, "multifd %u: flags received %x flags expected %x",
> >> p->id, flags, MULTIFD_FLAG_NOCOMP);
> >> return -1;
> >> }
> >> - for (int i = 0; i < p->normal_num; i++) {
> >> - p->iov[i].iov_base = p->host + p->normal[i];
> >> - p->iov[i].iov_len = p->page_size;
> >> +
> >> + if (!migrate_multifd_packets()) {
> >> + MultiFDRecvData *data = p->data;
> >> + size_t ret;
> >> +
> >> + ret = qio_channel_pread(p->c, (char *) data->opaque,
> >> + data->size, data->file_offset, errp);
> >> + if (ret != data->size) {
> >> + error_prepend(errp,
> >> + "multifd recv (%u): read 0x%zx, expected 0x%zx",
> >> + p->id, ret, data->size);
> >> + return -1;
> >> + }
> >> +
> >> + return 0;
> >> + } else {
> >> + for (int i = 0; i < p->normal_num; i++) {
> >> + p->iov[i].iov_base = p->host + p->normal[i];
> >> + p->iov[i].iov_len = p->page_size;
> >> + }
> >> +
> >> + return qio_channel_readv_all(p->c, p->iov, p->normal_num, errp);
> >> }
> >
> > I guess you managed to squash the file loads into "no compression" handler
> > of multifd, but IMHO it's not as clean.
> >
> > Firstly, if to do so, we'd better make sure multifd-compression is not
> > enabled anywhere together with fixed-ram. I didn't yet see such protection
> > in the series. I think if it happens we should expect crashes because
> > they'll go into zlib/zstd paths for the file.
>
> Yes, we need some checks around this.
>
> >
> > IMHO the only model fixed-ram can share with multifd is the task management
> > part, mutexes, semaphores, etc..
>
> AFAIU, that's what multifd *is*. Compression would just be another
> client of this task management code. This "nocomp" thing always felt off
> to me.
>
> > IIRC I used to mention that it'll be nice
> > if we have simply a pool of threads so we can enqueue tasks.
>
> Right, I don't disagree. However I don't think we can just show up with
> a thread pool and start moving stuff into it. I think the safest way to
> do this is to:
>
> 1- Adapt multifd so that the client code is the sole responsible for the
> data being sent. No data knowledge by the multifd thread.
>
> With this, nothing should need to touch multifd threads code
> anymore. New clients just define their methods and prepare the data
> as they please.
>
> 2- Move everything that is left into multifd. Zero pages, postcopy, etc.
>
> With 1 and 2 we'd have a pretty good picture of what kinds of operations
> we need to do and what are the requirements for the thread
> infrastructure.
>
> 3- Try to use existing abstractions within QEMU to replace
> multifd. Write from scratch if none are suitable.
>
> What do you think? We could put an action plan in place and start
> picking at it. My main concern is about what sorts of hidden assumptions
> are present in the current code that we'd start discovering if we just
> replaced it with something new.
You plan sounds good. Generalization (3) can happen even before (2), IMHO.
I suppose you already have the wiki account working now, would you please
add an entry into the todo page, with all these thoughts?
https://wiki.qemu.org/ToDo/LiveMigration
You can also mention you plan to look into it if you're taking the lead,
then people know it's in progress.
It can be under "cleanups" I assume.
>
> > If that's too
> > far away, would something like below closer to that? What I'm thinking:
> >
> > - patch 1: rename MultiFDMethods to MultiFDCompressMethods, this can
> > replace the other patch to do s/recv_pages/recv_data/
> >
> > - patch 2: introduce MultiFDMethods (on top of MultiFDCompressMethods),
> > refactor the current code to provide the socket version of MultiFDMethods.
> >
> > - patch 3: add the fixed-ram "file" version of MultiFDMethods
>
> We also have zero page moving to multifd and compression accelerators
> being developed. We need to take those into account. We might need an
> ops structure that accounts for the current "phases" (setup, prepare,
> recv, cleanup)[1], but within those also allow for composing arbitrary
> data transformations.
>
> (1)- there's no equivalent to send_prepare on dst and no equivalent to
> recv_pages on src. We might need to add a recv_prepare and a send_pages
> hook. The fixed-ram migration for instance would benefit from being able
> to choose a different IO function to send data.
>
> I'll send the patches moving zero pages to multifd once I find some
> time, but another question I had was whether we should add zero page
> detection as a new phase: setup - zero page detection - prepare - send -
> cleanup.
As you know I haven't yet followed those threads, only a rough memory on
the zero page movement but that can be obsolete and I'll need to read what
you have. I agree all these are multifd-relevant, and we should consider
them.
Now the question is even if we will have a good thread model for multifd,
whether file operations should be put into the compression layer (what you
already did in this patch) or move it out. Libvirt supports compression on
images so I assume file operations shouldn't need to rely on compression,
from that pov I think maybe it's better we leave all compression stuff
along from file: perspective.
>
> > MultiFDCompressMethods doesn't need to be used at all for "file" version of
> > MultiFDMethods.
> >
> > Would that work?
>
> We definitely need _something_ to help us stop adding code to the middle
> of multifd_send_thread every time there's a new feature.
>
> >
> >> - return qio_channel_readv_all(p->c, p->iov, p->normal_num, errp);
> >> }
> >>
> >> static MultiFDMethods multifd_nocomp_ops = {
> >> @@ -989,6 +1008,7 @@ int multifd_save_setup(Error **errp)
> >>
> >> struct {
> >> MultiFDRecvParams *params;
> >> + MultiFDRecvData *data;
> >
> > (If above would work, maybe we can split MultiFDRecvParams into two chunks,
> > one commonly used for both, one only for sockets?)
> >
>
> If we assume the use of packets in multifd is coupled to the socket
> channel usage then yes. However, I suspect that what we might want is a
> streaming migration vs. non-streaming migration abstraction. Because we
> can still use packets with file migration after all.
>
> >> /* number of created threads */
> >> int count;
> >> /* syncs main thread and channels */
> >> @@ -999,6 +1019,49 @@ struct {
> >> MultiFDMethods *ops;
> >> } *multifd_recv_state;
> >>
> >> +int multifd_recv(void)
> >> +{
> >> + int i;
> >> + static int next_recv_channel;
> >> + MultiFDRecvParams *p = NULL;
> >> + MultiFDRecvData *data = multifd_recv_state->data;
> >> +
> >> + /*
> >> + * next_channel can remain from a previous migration that was
> >> + * using more channels, so ensure it doesn't overflow if the
> >> + * limit is lower now.
> >> + */
> >> + next_recv_channel %= migrate_multifd_channels();
> >> + for (i = next_recv_channel;; i = (i + 1) % migrate_multifd_channels()) {
> >> + p = &multifd_recv_state->params[i];
> >> +
> >> + qemu_mutex_lock(&p->mutex);
> >> + if (p->quit) {
> >> + error_report("%s: channel %d has already quit!", __func__, i);
> >> + qemu_mutex_unlock(&p->mutex);
> >> + return -1;
> >> + }
> >> + if (!p->pending_job) {
> >> + p->pending_job++;
> >> + next_recv_channel = (i + 1) % migrate_multifd_channels();
> >> + break;
> >> + }
> >> + qemu_mutex_unlock(&p->mutex);
> >> + }
> >> + assert(p->data->size == 0);
> >> + multifd_recv_state->data = p->data;
> >> + p->data = data;
> >> + qemu_mutex_unlock(&p->mutex);
> >> + qemu_sem_post(&p->sem);
> >> +
> >> + return 1;
> >> +}
> >
> > PS: so if we have the pool model we can already mostly merge above code
> > with multifd_send_pages().. because this will be a common helper to enqueue
> > a task to a pool, no matter it's for writting (to file/socket) or reading
> > (only from file).
> >
> >> +
> >> +MultiFDRecvData *multifd_get_recv_data(void)
> >> +{
> >> + return multifd_recv_state->data;
> >> +}
> >> +
> >> static void multifd_recv_terminate_threads(Error *err)
> >> {
> >> int i;
> >> @@ -1020,6 +1083,7 @@ static void multifd_recv_terminate_threads(Error *err)
> >>
> >> qemu_mutex_lock(&p->mutex);
> >> p->quit = true;
> >> + qemu_sem_post(&p->sem);
> >> /*
> >> * We could arrive here for two reasons:
> >> * - normal quit, i.e. everything went fine, just finished
> >> @@ -1069,6 +1133,7 @@ void multifd_load_cleanup(void)
> >> p->c = NULL;
> >> qemu_mutex_destroy(&p->mutex);
> >> qemu_sem_destroy(&p->sem_sync);
> >> + qemu_sem_destroy(&p->sem);
> >> g_free(p->name);
> >> p->name = NULL;
> >> p->packet_len = 0;
> >> @@ -1083,6 +1148,8 @@ void multifd_load_cleanup(void)
> >> qemu_sem_destroy(&multifd_recv_state->sem_sync);
> >> g_free(multifd_recv_state->params);
> >> multifd_recv_state->params = NULL;
> >> + g_free(multifd_recv_state->data);
> >> + multifd_recv_state->data = NULL;
> >> g_free(multifd_recv_state);
> >> multifd_recv_state = NULL;
> >> }
> >> @@ -1094,6 +1161,21 @@ void multifd_recv_sync_main(void)
> >> if (!migrate_multifd() || !migrate_multifd_packets()) {
> >
> > [1]
> >
> >> return;
> >> }
> >> +
> >> + if (!migrate_multifd_packets()) {
> >
> > Hmm, isn't this checked already above at [1]? Could this path ever trigger
> > then? Maybe we need to drop the one at [1]?
>
> That was a rebase mistake.
>
> >
> > IIUC what you wanted to do here is relying on the last RAM_SAVE_FLAG_EOS in
> > the image file to do a full flush to make sure all pages are loaded.
> >
>
> Bear with me if I take it slow with everything below here. The practical
> effect of changing any of these is causing the threads to go off-sync
> and that results in a time dependent bug where memory is not properly
> migrated, which just comes up as an assert in check_guests_ram() in the
> tests. It gets hard to reproduce and has taken me whole weeks to debug
> before.
Per-iteration sync_main is needed for file, but only on sender side, IIUC.
Actually it's not needed for your use case at all to move the VM to file,
because in that case we could already stop the VM first. Now to be
compatible with Libvirt's sake on live snapshot on Windows, we assume VM
can run, then yes sync_main is needed at least on src, because otherwise
the same page can be queued >1 times on different threads, and it's not
guaranteed that the latest page will always land last.
Said that, I don't think it's needed on recver side? Because for both use
cases (either "move to file", or "take a snapshot"), the loader is actually
the same process where we read data from the file and relaunch the VM. In
that case there's no need to sync.
For socket-based multifd, the sync_main on recver side is only triggered
when RAM_SAVE_FLAG_MULTIFD_FLUSH packet is received on 9.0 machine type.
And then you'll also notice you don't even have that for file: URI multifd
migrations, isn't it?
When you said you hit a bug, did you have the sender side sync_main
available, or you missed both? I would expect that bug triggered because
you missed the sync_main on src, not on dest. For dest, IMHO we only need
a last phase sync to make sure all RAM loaded before we relaunch the VM.
>
> > You may want to be careful on the side effect of flush_after_each_section
> > parameter:
> >
> > case RAM_SAVE_FLAG_EOS:
> > /* normal exit */
> > if (migrate_multifd() &&
> > migrate_multifd_flush_after_each_section()) {
> > multifd_recv_sync_main();
> > }
> >
> > You may want to flush always for file?
>
> Next patch restricts the setting of flush_after_each_section.
>
> >
> >> + for (i = 0; i < migrate_multifd_channels(); i++) {
> >> + MultiFDRecvParams *p = &multifd_recv_state->params[i];
> >> +
> >> + qemu_sem_post(&p->sem);
> >> + qemu_sem_wait(&p->sem_sync);
> >> +
> >> + qemu_mutex_lock(&p->mutex);
> >> + assert(!p->pending_job || p->quit);
> >> + qemu_mutex_unlock(&p->mutex);
> >> + }
> >> + return;
> >
> > Btw, how does this kick off all the recv threads? Is it because you did a
> > sem_post(&sem) with p->pending_job==false this time?
>
> Yes, when the last piece of memory is received, the thread will loop
> around and hang at qemu_sem_wait(&p->sem):
>
> } else {
> /*
> * No packets, so we need to wait for the vmstate code to
> * give us work.
> */
> qemu_sem_wait(&p->sem);
> qemu_mutex_lock(&p->mutex);
> if (!p->pending_job) {
> qemu_mutex_unlock(&p->mutex);
> break;
> }
> has_data = !!p->data->size;
> }
>
> So here we release the p->sem one last time so the thread can see
> p->pending_job = false and proceed to inform it has finished:
>
> if (!use_packets) {
> qemu_sem_post(&p->sem_sync);
> }
(see below)
>
> > Maybe it's clearer to just set p->quit (or a global quite knob) somewhere?
> > That'll be clear that this is a one-shot thing, only needed at the end of
> > the file incoming migration.
> >
>
> Maybe I'm not following you, but the thread needs to check
> p->pending_job before it knows there's no more work. And it can only do
> that if the migration thread releases p->sem. Do you mean setting
> p->quit on the thread instead of posting sem_sync? That's racy I think.
I want to make the quit event not rely on pending_job. I think we should
allow pending_job==false and the thread should just sleep again.
That should also match recver side with sender, I remember we just reworked
that so as to reference to the global "quit" flag:
multifd_send_thread():
...
qemu_sem_wait(&p->sem);
if (qatomic_read(&multifd_send_state->exiting)) {
break;
}
...
Something like that.
>
> >> + }
> >> +
> >> for (i = 0; i < migrate_multifd_channels(); i++) {
> >> MultiFDRecvParams *p = &multifd_recv_state->params[i];
> >>
> >> @@ -1156,6 +1238,18 @@ static void *multifd_recv_thread(void *opaque)
> >>
> >> p->total_normal_pages += p->normal_num;
> >> has_data = !!p->normal_num;
> >> + } else {
> >> + /*
> >> + * No packets, so we need to wait for the vmstate code to
> >> + * give us work.
> >> + */
> >> + qemu_sem_wait(&p->sem);
> >> + qemu_mutex_lock(&p->mutex);
> >> + if (!p->pending_job) {
> >> + qemu_mutex_unlock(&p->mutex);
> >> + break;
> >> + }
> >> + has_data = !!p->data->size;
> >> }
> >>
> >> qemu_mutex_unlock(&p->mutex);
> >> @@ -1171,6 +1265,17 @@ static void *multifd_recv_thread(void *opaque)
> >> qemu_sem_post(&multifd_recv_state->sem_sync);
> >> qemu_sem_wait(&p->sem_sync);
> >> }
> >> +
> >> + if (!use_packets) {
> >> + qemu_mutex_lock(&p->mutex);
> >> + p->data->size = 0;
> >> + p->pending_job--;
> >> + qemu_mutex_unlock(&p->mutex);
> >> + }
> >> + }
> >> +
> >> + if (!use_packets) {
> >> + qemu_sem_post(&p->sem_sync);
> >
> > Currently sem_sync is only posted with MULTIFD_FLAG_SYNC flag. We'd better
> > be careful on reusing it.
> >
> > Maybe add some comment above recv_state->sem_sync?
> >
> > /*
> > * For sockets: this is posted once for each MULTIFD_FLAG_SYNC flag.
> > *
> > * For files: this is only posted at the end of the file load to mark
> > * completion of the load process.
> > */
> >
>
> Sure. I would rename it to sem_done if I could, but we already went
> through that.
>
> >> }
> >>
> >> if (local_err) {
> >> @@ -1205,6 +1310,10 @@ int multifd_load_setup(Error **errp)
> >> thread_count = migrate_multifd_channels();
> >> multifd_recv_state = g_malloc0(sizeof(*multifd_recv_state));
> >> multifd_recv_state->params = g_new0(MultiFDRecvParams, thread_count);
> >> +
> >> + multifd_recv_state->data = g_new0(MultiFDRecvData, 1);
> >> + multifd_recv_state->data->size = 0;
> >> +
> >> qatomic_set(&multifd_recv_state->count, 0);
> >> qemu_sem_init(&multifd_recv_state->sem_sync, 0);
> >> multifd_recv_state->ops = multifd_ops[migrate_multifd_compression()];
> >> @@ -1214,9 +1323,14 @@ int multifd_load_setup(Error **errp)
> >>
> >> qemu_mutex_init(&p->mutex);
> >> qemu_sem_init(&p->sem_sync, 0);
> >> + qemu_sem_init(&p->sem, 0);
> >> p->quit = false;
> >> + p->pending_job = 0;
> >> p->id = i;
> >>
> >> + p->data = g_new0(MultiFDRecvData, 1);
> >> + p->data->size = 0;
> >> +
> >> if (use_packets) {
> >> p->packet_len = sizeof(MultiFDPacket_t)
> >> + sizeof(uint64_t) * page_count;
> >> diff --git a/migration/multifd.h b/migration/multifd.h
> >> index 406d42dbae..abaf16c3f2 100644
> >> --- a/migration/multifd.h
> >> +++ b/migration/multifd.h
> >> @@ -13,6 +13,8 @@
> >> #ifndef QEMU_MIGRATION_MULTIFD_H
> >> #define QEMU_MIGRATION_MULTIFD_H
> >>
> >> +typedef struct MultiFDRecvData MultiFDRecvData;
> >> +
> >> int multifd_save_setup(Error **errp);
> >> void multifd_save_cleanup(void);
> >> int multifd_load_setup(Error **errp);
> >> @@ -24,6 +26,8 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
> >> void multifd_recv_sync_main(void);
> >> int multifd_send_sync_main(QEMUFile *f);
> >> int multifd_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
> >> +int multifd_recv(void);
> >> +MultiFDRecvData *multifd_get_recv_data(void);
> >>
> >> /* Multifd Compression flags */
> >> #define MULTIFD_FLAG_SYNC (1 << 0)
> >> @@ -66,6 +70,13 @@ typedef struct {
> >> RAMBlock *block;
> >> } MultiFDPages_t;
> >>
> >> +struct MultiFDRecvData {
> >> + void *opaque;
> >> + size_t size;
> >> + /* for preadv */
> >> + off_t file_offset;
> >> +};
> >> +
> >> typedef struct {
> >> /* Fields are only written at creating/deletion time */
> >> /* No lock required for them, they are read only */
> >> @@ -156,6 +167,8 @@ typedef struct {
> >>
> >> /* syncs main thread and channels */
> >> QemuSemaphore sem_sync;
> >> + /* sem where to wait for more work */
> >> + QemuSemaphore sem;
> >>
> >> /* this mutex protects the following parameters */
> >> QemuMutex mutex;
> >> @@ -167,6 +180,13 @@ typedef struct {
> >> uint32_t flags;
> >> /* global number of generated multifd packets */
> >> uint64_t packet_num;
> >> + int pending_job;
> >> + /*
> >> + * The owner of 'data' depends of 'pending_job' value:
> >> + * pending_job == 0 -> migration_thread can use it.
> >> + * pending_job != 0 -> multifd_channel can use it.
> >> + */
> >> + MultiFDRecvData *data;
> >
> > Right after the main thread assigns a chunk of memory to load for a recv
> > thread, the main thread job done, afaict. I don't see how a race could
> > happen here.
> >
> > I'm not sure, but I _think_ if we rely on p->quite or something similar to
> > quite all recv threads, then this can be dropped?
> >
>
> We still need to know whether a channel is in use so we can skip to the
> next.
Oh, yes.
>
> >>
> >> /* thread local variables. No locking required */
> >>
> >> --
> >> 2.35.3
> >>
>
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 18/30] migration/multifd: Allow receiving pages without packets
2024-01-19 0:20 ` Peter Xu
@ 2024-01-19 12:57 ` Fabiano Rosas
0 siblings, 0 replies; 95+ messages in thread
From: Fabiano Rosas @ 2024-01-19 12:57 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
Peter Xu <peterx@redhat.com> writes:
> On Tue, Jan 16, 2024 at 05:25:03PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Mon, Nov 27, 2023 at 05:26:00PM -0300, Fabiano Rosas wrote:
>> >> Currently multifd does not need to have knowledge of pages on the
>> >> receiving side because all the information needed is within the
>> >> packets that come in the stream.
>> >>
>> >> We're about to add support to fixed-ram migration, which cannot use
>> >> packets because it expects the ramblock section in the migration file
>> >> to contain only the guest pages data.
>> >>
>> >> Add a data structure to transfer pages between the ram migration code
>> >> and the multifd receiving threads.
>> >>
>> >> We don't want to reuse MultiFDPages_t for two reasons:
>> >>
>> >> a) multifd threads don't really need to know about the data they're
>> >> receiving.
>> >>
>> >> b) the receiving side has to be stopped to load the pages, which means
>> >> we can experiment with larger granularities than page size when
>> >> transferring data.
>> >>
>> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> >> ---
>> >> - stopped using MultiFDPages_t and added a new structure which can
>> >> take offset + size
>> >> ---
>> >> migration/multifd.c | 122 ++++++++++++++++++++++++++++++++++++++++++--
>> >> migration/multifd.h | 20 ++++++++
>> >> 2 files changed, 138 insertions(+), 4 deletions(-)
>> >>
>> >> diff --git a/migration/multifd.c b/migration/multifd.c
>> >> index c1381bdc21..7dfab2367a 100644
>> >> --- a/migration/multifd.c
>> >> +++ b/migration/multifd.c
>> >> @@ -142,17 +142,36 @@ static void nocomp_recv_cleanup(MultiFDRecvParams *p)
>> >> static int nocomp_recv_data(MultiFDRecvParams *p, Error **errp)
>> >> {
>> >> uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
>> >> + ERRP_GUARD();
>> >>
>> >> if (flags != MULTIFD_FLAG_NOCOMP) {
>> >> error_setg(errp, "multifd %u: flags received %x flags expected %x",
>> >> p->id, flags, MULTIFD_FLAG_NOCOMP);
>> >> return -1;
>> >> }
>> >> - for (int i = 0; i < p->normal_num; i++) {
>> >> - p->iov[i].iov_base = p->host + p->normal[i];
>> >> - p->iov[i].iov_len = p->page_size;
>> >> +
>> >> + if (!migrate_multifd_packets()) {
>> >> + MultiFDRecvData *data = p->data;
>> >> + size_t ret;
>> >> +
>> >> + ret = qio_channel_pread(p->c, (char *) data->opaque,
>> >> + data->size, data->file_offset, errp);
>> >> + if (ret != data->size) {
>> >> + error_prepend(errp,
>> >> + "multifd recv (%u): read 0x%zx, expected 0x%zx",
>> >> + p->id, ret, data->size);
>> >> + return -1;
>> >> + }
>> >> +
>> >> + return 0;
>> >> + } else {
>> >> + for (int i = 0; i < p->normal_num; i++) {
>> >> + p->iov[i].iov_base = p->host + p->normal[i];
>> >> + p->iov[i].iov_len = p->page_size;
>> >> + }
>> >> +
>> >> + return qio_channel_readv_all(p->c, p->iov, p->normal_num, errp);
>> >> }
>> >
>> > I guess you managed to squash the file loads into "no compression" handler
>> > of multifd, but IMHO it's not as clean.
>> >
>> > Firstly, if to do so, we'd better make sure multifd-compression is not
>> > enabled anywhere together with fixed-ram. I didn't yet see such protection
>> > in the series. I think if it happens we should expect crashes because
>> > they'll go into zlib/zstd paths for the file.
>>
>> Yes, we need some checks around this.
>>
>> >
>> > IMHO the only model fixed-ram can share with multifd is the task management
>> > part, mutexes, semaphores, etc..
>>
>> AFAIU, that's what multifd *is*. Compression would just be another
>> client of this task management code. This "nocomp" thing always felt off
>> to me.
>>
>> > IIRC I used to mention that it'll be nice
>> > if we have simply a pool of threads so we can enqueue tasks.
>>
>> Right, I don't disagree. However I don't think we can just show up with
>> a thread pool and start moving stuff into it. I think the safest way to
>> do this is to:
>>
>> 1- Adapt multifd so that the client code is the sole responsible for the
>> data being sent. No data knowledge by the multifd thread.
>>
>> With this, nothing should need to touch multifd threads code
>> anymore. New clients just define their methods and prepare the data
>> as they please.
>>
>> 2- Move everything that is left into multifd. Zero pages, postcopy, etc.
>>
>> With 1 and 2 we'd have a pretty good picture of what kinds of operations
>> we need to do and what are the requirements for the thread
>> infrastructure.
>>
>> 3- Try to use existing abstractions within QEMU to replace
>> multifd. Write from scratch if none are suitable.
>>
>> What do you think? We could put an action plan in place and start
>> picking at it. My main concern is about what sorts of hidden assumptions
>> are present in the current code that we'd start discovering if we just
>> replaced it with something new.
>
> You plan sounds good. Generalization (3) can happen even before (2), IMHO.
>
> I suppose you already have the wiki account working now, would you please
> add an entry into the todo page, with all these thoughts?
>
> https://wiki.qemu.org/ToDo/LiveMigration
>
> You can also mention you plan to look into it if you're taking the lead,
> then people know it's in progress.
>
> It can be under "cleanups" I assume.
>
>>
>> > If that's too
>> > far away, would something like below closer to that? What I'm thinking:
>> >
>> > - patch 1: rename MultiFDMethods to MultiFDCompressMethods, this can
>> > replace the other patch to do s/recv_pages/recv_data/
>> >
>> > - patch 2: introduce MultiFDMethods (on top of MultiFDCompressMethods),
>> > refactor the current code to provide the socket version of MultiFDMethods.
>> >
>> > - patch 3: add the fixed-ram "file" version of MultiFDMethods
>>
>> We also have zero page moving to multifd and compression accelerators
>> being developed. We need to take those into account. We might need an
>> ops structure that accounts for the current "phases" (setup, prepare,
>> recv, cleanup)[1], but within those also allow for composing arbitrary
>> data transformations.
>>
>> (1)- there's no equivalent to send_prepare on dst and no equivalent to
>> recv_pages on src. We might need to add a recv_prepare and a send_pages
>> hook. The fixed-ram migration for instance would benefit from being able
>> to choose a different IO function to send data.
>>
>> I'll send the patches moving zero pages to multifd once I find some
>> time, but another question I had was whether we should add zero page
>> detection as a new phase: setup - zero page detection - prepare - send -
>> cleanup.
>
> As you know I haven't yet followed those threads, only a rough memory on
> the zero page movement but that can be obsolete and I'll need to read what
> you have. I agree all these are multifd-relevant, and we should consider
> them.
>
> Now the question is even if we will have a good thread model for multifd,
> whether file operations should be put into the compression layer (what you
> already did in this patch) or move it out. Libvirt supports compression on
> images so I assume file operations shouldn't need to rely on compression,
> from that pov I think maybe it's better we leave all compression stuff
> along from file: perspective.
>
I'm thinking maybe divide multifd in phases (very similar to what we
have), but fold compression into one or more of those phases. So we'd
have:
send_prepare:
if compression:
ops->compression_prepare()
I'm working on something similar for zero pages. It's currently a 3
patch series from Juan, probably easier if we discuss there.
>> > MultiFDCompressMethods doesn't need to be used at all for "file" version of
>> > MultiFDMethods.
>> >
>> > Would that work?
>>
>> We definitely need _something_ to help us stop adding code to the middle
>> of multifd_send_thread every time there's a new feature.
>>
>> >
>> >> - return qio_channel_readv_all(p->c, p->iov, p->normal_num, errp);
>> >> }
>> >>
>> >> static MultiFDMethods multifd_nocomp_ops = {
>> >> @@ -989,6 +1008,7 @@ int multifd_save_setup(Error **errp)
>> >>
>> >> struct {
>> >> MultiFDRecvParams *params;
>> >> + MultiFDRecvData *data;
>> >
>> > (If above would work, maybe we can split MultiFDRecvParams into two chunks,
>> > one commonly used for both, one only for sockets?)
>> >
>>
>> If we assume the use of packets in multifd is coupled to the socket
>> channel usage then yes. However, I suspect that what we might want is a
>> streaming migration vs. non-streaming migration abstraction. Because we
>> can still use packets with file migration after all.
>>
>> >> /* number of created threads */
>> >> int count;
>> >> /* syncs main thread and channels */
>> >> @@ -999,6 +1019,49 @@ struct {
>> >> MultiFDMethods *ops;
>> >> } *multifd_recv_state;
>> >>
>> >> +int multifd_recv(void)
>> >> +{
>> >> + int i;
>> >> + static int next_recv_channel;
>> >> + MultiFDRecvParams *p = NULL;
>> >> + MultiFDRecvData *data = multifd_recv_state->data;
>> >> +
>> >> + /*
>> >> + * next_channel can remain from a previous migration that was
>> >> + * using more channels, so ensure it doesn't overflow if the
>> >> + * limit is lower now.
>> >> + */
>> >> + next_recv_channel %= migrate_multifd_channels();
>> >> + for (i = next_recv_channel;; i = (i + 1) % migrate_multifd_channels()) {
>> >> + p = &multifd_recv_state->params[i];
>> >> +
>> >> + qemu_mutex_lock(&p->mutex);
>> >> + if (p->quit) {
>> >> + error_report("%s: channel %d has already quit!", __func__, i);
>> >> + qemu_mutex_unlock(&p->mutex);
>> >> + return -1;
>> >> + }
>> >> + if (!p->pending_job) {
>> >> + p->pending_job++;
>> >> + next_recv_channel = (i + 1) % migrate_multifd_channels();
>> >> + break;
>> >> + }
>> >> + qemu_mutex_unlock(&p->mutex);
>> >> + }
>> >> + assert(p->data->size == 0);
>> >> + multifd_recv_state->data = p->data;
>> >> + p->data = data;
>> >> + qemu_mutex_unlock(&p->mutex);
>> >> + qemu_sem_post(&p->sem);
>> >> +
>> >> + return 1;
>> >> +}
>> >
>> > PS: so if we have the pool model we can already mostly merge above code
>> > with multifd_send_pages().. because this will be a common helper to enqueue
>> > a task to a pool, no matter it's for writting (to file/socket) or reading
>> > (only from file).
>> >
>> >> +
>> >> +MultiFDRecvData *multifd_get_recv_data(void)
>> >> +{
>> >> + return multifd_recv_state->data;
>> >> +}
>> >> +
>> >> static void multifd_recv_terminate_threads(Error *err)
>> >> {
>> >> int i;
>> >> @@ -1020,6 +1083,7 @@ static void multifd_recv_terminate_threads(Error *err)
>> >>
>> >> qemu_mutex_lock(&p->mutex);
>> >> p->quit = true;
>> >> + qemu_sem_post(&p->sem);
>> >> /*
>> >> * We could arrive here for two reasons:
>> >> * - normal quit, i.e. everything went fine, just finished
>> >> @@ -1069,6 +1133,7 @@ void multifd_load_cleanup(void)
>> >> p->c = NULL;
>> >> qemu_mutex_destroy(&p->mutex);
>> >> qemu_sem_destroy(&p->sem_sync);
>> >> + qemu_sem_destroy(&p->sem);
>> >> g_free(p->name);
>> >> p->name = NULL;
>> >> p->packet_len = 0;
>> >> @@ -1083,6 +1148,8 @@ void multifd_load_cleanup(void)
>> >> qemu_sem_destroy(&multifd_recv_state->sem_sync);
>> >> g_free(multifd_recv_state->params);
>> >> multifd_recv_state->params = NULL;
>> >> + g_free(multifd_recv_state->data);
>> >> + multifd_recv_state->data = NULL;
>> >> g_free(multifd_recv_state);
>> >> multifd_recv_state = NULL;
>> >> }
>> >> @@ -1094,6 +1161,21 @@ void multifd_recv_sync_main(void)
>> >> if (!migrate_multifd() || !migrate_multifd_packets()) {
>> >
>> > [1]
>> >
>> >> return;
>> >> }
>> >> +
>> >> + if (!migrate_multifd_packets()) {
>> >
>> > Hmm, isn't this checked already above at [1]? Could this path ever trigger
>> > then? Maybe we need to drop the one at [1]?
>>
>> That was a rebase mistake.
>>
>> >
>> > IIUC what you wanted to do here is relying on the last RAM_SAVE_FLAG_EOS in
>> > the image file to do a full flush to make sure all pages are loaded.
>> >
>>
>> Bear with me if I take it slow with everything below here. The practical
>> effect of changing any of these is causing the threads to go off-sync
>> and that results in a time dependent bug where memory is not properly
>> migrated, which just comes up as an assert in check_guests_ram() in the
>> tests. It gets hard to reproduce and has taken me whole weeks to debug
>> before.
>
> Per-iteration sync_main is needed for file, but only on sender side, IIUC.
>
> Actually it's not needed for your use case at all to move the VM to file,
> because in that case we could already stop the VM first. Now to be
> compatible with Libvirt's sake on live snapshot on Windows, we assume VM
> can run, then yes sync_main is needed at least on src, because otherwise
> the same page can be queued >1 times on different threads, and it's not
> guaranteed that the latest page will always land last.
>
> Said that, I don't think it's needed on recver side? Because for both use
> cases (either "move to file", or "take a snapshot"), the loader is actually
> the same process where we read data from the file and relaunch the VM. In
> that case there's no need to sync.
You're right.
> For socket-based multifd, the sync_main on recver side is only triggered
> when RAM_SAVE_FLAG_MULTIFD_FLUSH packet is received on 9.0 machine type.
> And then you'll also notice you don't even have that for file: URI multifd
> migrations, isn't it?
Right, because socket migration is a stream, so if done live they'll
need to sync. File migration takes everything from the file as is.
> When you said you hit a bug, did you have the sender side sync_main
> available, or you missed both? I would expect that bug triggered because
> you missed the sync_main on src, not on dest. For dest, IMHO we only need
> a last phase sync to make sure all RAM loaded before we relaunch the VM.
You might be right.
I hit several different bugs, synchronization issues, dirty bitmap
issues, threads finishing too soon, etc. They all have the same symptom:
one or more pages get "corrupted". So I cannot speak about any specific
one right now.
On that topic, I have played a bit with adding more information to the
file such as start and end of ramblocks or some greppable markers. Since
it is a file, it's way more prone to these kinds of debugging helpers
than the socket stream.
>
>>
>> > You may want to be careful on the side effect of flush_after_each_section
>> > parameter:
>> >
>> > case RAM_SAVE_FLAG_EOS:
>> > /* normal exit */
>> > if (migrate_multifd() &&
>> > migrate_multifd_flush_after_each_section()) {
>> > multifd_recv_sync_main();
>> > }
>> >
>> > You may want to flush always for file?
>>
>> Next patch restricts the setting of flush_after_each_section.
>>
>> >
>> >> + for (i = 0; i < migrate_multifd_channels(); i++) {
>> >> + MultiFDRecvParams *p = &multifd_recv_state->params[i];
>> >> +
>> >> + qemu_sem_post(&p->sem);
>> >> + qemu_sem_wait(&p->sem_sync);
>> >> +
>> >> + qemu_mutex_lock(&p->mutex);
>> >> + assert(!p->pending_job || p->quit);
>> >> + qemu_mutex_unlock(&p->mutex);
>> >> + }
>> >> + return;
>> >
>> > Btw, how does this kick off all the recv threads? Is it because you did a
>> > sem_post(&sem) with p->pending_job==false this time?
>>
>> Yes, when the last piece of memory is received, the thread will loop
>> around and hang at qemu_sem_wait(&p->sem):
>>
>> } else {
>> /*
>> * No packets, so we need to wait for the vmstate code to
>> * give us work.
>> */
>> qemu_sem_wait(&p->sem);
>> qemu_mutex_lock(&p->mutex);
>> if (!p->pending_job) {
>> qemu_mutex_unlock(&p->mutex);
>> break;
>> }
>> has_data = !!p->data->size;
>> }
>>
>> So here we release the p->sem one last time so the thread can see
>> p->pending_job = false and proceed to inform it has finished:
>>
>> if (!use_packets) {
>> qemu_sem_post(&p->sem_sync);
>> }
>
> (see below)
>
>>
>> > Maybe it's clearer to just set p->quit (or a global quite knob) somewhere?
>> > That'll be clear that this is a one-shot thing, only needed at the end of
>> > the file incoming migration.
>> >
>>
>> Maybe I'm not following you, but the thread needs to check
>> p->pending_job before it knows there's no more work. And it can only do
>> that if the migration thread releases p->sem. Do you mean setting
>> p->quit on the thread instead of posting sem_sync? That's racy I think.
>
> I want to make the quit event not rely on pending_job. I think we should
> allow pending_job==false and the thread should just sleep again.
Hmm, this could be a good thing. Because then we'll have only one source
of the "quit event" and we'd know all thread would wait until they see
that quit.
> That should also match recver side with sender, I remember we just reworked
> that so as to reference to the global "quit" flag:
Almost. The sender still has the ability of just exiting. So even though
it respects the 'exiting' flag, it might have already returned at the
end of the function.
>
> multifd_send_thread():
> ...
> qemu_sem_wait(&p->sem);
> if (qatomic_read(&multifd_send_state->exiting)) {
> break;
> }
> ...
>
> Something like that.
>
>>
>> >> + }
>> >> +
>> >> for (i = 0; i < migrate_multifd_channels(); i++) {
>> >> MultiFDRecvParams *p = &multifd_recv_state->params[i];
>> >>
>> >> @@ -1156,6 +1238,18 @@ static void *multifd_recv_thread(void *opaque)
>> >>
>> >> p->total_normal_pages += p->normal_num;
>> >> has_data = !!p->normal_num;
>> >> + } else {
>> >> + /*
>> >> + * No packets, so we need to wait for the vmstate code to
>> >> + * give us work.
>> >> + */
>> >> + qemu_sem_wait(&p->sem);
>> >> + qemu_mutex_lock(&p->mutex);
>> >> + if (!p->pending_job) {
>> >> + qemu_mutex_unlock(&p->mutex);
>> >> + break;
>> >> + }
>> >> + has_data = !!p->data->size;
>> >> }
>> >>
>> >> qemu_mutex_unlock(&p->mutex);
>> >> @@ -1171,6 +1265,17 @@ static void *multifd_recv_thread(void *opaque)
>> >> qemu_sem_post(&multifd_recv_state->sem_sync);
>> >> qemu_sem_wait(&p->sem_sync);
>> >> }
>> >> +
>> >> + if (!use_packets) {
>> >> + qemu_mutex_lock(&p->mutex);
>> >> + p->data->size = 0;
>> >> + p->pending_job--;
>> >> + qemu_mutex_unlock(&p->mutex);
>> >> + }
>> >> + }
>> >> +
>> >> + if (!use_packets) {
>> >> + qemu_sem_post(&p->sem_sync);
>> >
>> > Currently sem_sync is only posted with MULTIFD_FLAG_SYNC flag. We'd better
>> > be careful on reusing it.
>> >
>> > Maybe add some comment above recv_state->sem_sync?
>> >
>> > /*
>> > * For sockets: this is posted once for each MULTIFD_FLAG_SYNC flag.
>> > *
>> > * For files: this is only posted at the end of the file load to mark
>> > * completion of the load process.
>> > */
>> >
>>
>> Sure. I would rename it to sem_done if I could, but we already went
>> through that.
>>
>> >> }
>> >>
>> >> if (local_err) {
>> >> @@ -1205,6 +1310,10 @@ int multifd_load_setup(Error **errp)
>> >> thread_count = migrate_multifd_channels();
>> >> multifd_recv_state = g_malloc0(sizeof(*multifd_recv_state));
>> >> multifd_recv_state->params = g_new0(MultiFDRecvParams, thread_count);
>> >> +
>> >> + multifd_recv_state->data = g_new0(MultiFDRecvData, 1);
>> >> + multifd_recv_state->data->size = 0;
>> >> +
>> >> qatomic_set(&multifd_recv_state->count, 0);
>> >> qemu_sem_init(&multifd_recv_state->sem_sync, 0);
>> >> multifd_recv_state->ops = multifd_ops[migrate_multifd_compression()];
>> >> @@ -1214,9 +1323,14 @@ int multifd_load_setup(Error **errp)
>> >>
>> >> qemu_mutex_init(&p->mutex);
>> >> qemu_sem_init(&p->sem_sync, 0);
>> >> + qemu_sem_init(&p->sem, 0);
>> >> p->quit = false;
>> >> + p->pending_job = 0;
>> >> p->id = i;
>> >>
>> >> + p->data = g_new0(MultiFDRecvData, 1);
>> >> + p->data->size = 0;
>> >> +
>> >> if (use_packets) {
>> >> p->packet_len = sizeof(MultiFDPacket_t)
>> >> + sizeof(uint64_t) * page_count;
>> >> diff --git a/migration/multifd.h b/migration/multifd.h
>> >> index 406d42dbae..abaf16c3f2 100644
>> >> --- a/migration/multifd.h
>> >> +++ b/migration/multifd.h
>> >> @@ -13,6 +13,8 @@
>> >> #ifndef QEMU_MIGRATION_MULTIFD_H
>> >> #define QEMU_MIGRATION_MULTIFD_H
>> >>
>> >> +typedef struct MultiFDRecvData MultiFDRecvData;
>> >> +
>> >> int multifd_save_setup(Error **errp);
>> >> void multifd_save_cleanup(void);
>> >> int multifd_load_setup(Error **errp);
>> >> @@ -24,6 +26,8 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
>> >> void multifd_recv_sync_main(void);
>> >> int multifd_send_sync_main(QEMUFile *f);
>> >> int multifd_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
>> >> +int multifd_recv(void);
>> >> +MultiFDRecvData *multifd_get_recv_data(void);
>> >>
>> >> /* Multifd Compression flags */
>> >> #define MULTIFD_FLAG_SYNC (1 << 0)
>> >> @@ -66,6 +70,13 @@ typedef struct {
>> >> RAMBlock *block;
>> >> } MultiFDPages_t;
>> >>
>> >> +struct MultiFDRecvData {
>> >> + void *opaque;
>> >> + size_t size;
>> >> + /* for preadv */
>> >> + off_t file_offset;
>> >> +};
>> >> +
>> >> typedef struct {
>> >> /* Fields are only written at creating/deletion time */
>> >> /* No lock required for them, they are read only */
>> >> @@ -156,6 +167,8 @@ typedef struct {
>> >>
>> >> /* syncs main thread and channels */
>> >> QemuSemaphore sem_sync;
>> >> + /* sem where to wait for more work */
>> >> + QemuSemaphore sem;
>> >>
>> >> /* this mutex protects the following parameters */
>> >> QemuMutex mutex;
>> >> @@ -167,6 +180,13 @@ typedef struct {
>> >> uint32_t flags;
>> >> /* global number of generated multifd packets */
>> >> uint64_t packet_num;
>> >> + int pending_job;
>> >> + /*
>> >> + * The owner of 'data' depends of 'pending_job' value:
>> >> + * pending_job == 0 -> migration_thread can use it.
>> >> + * pending_job != 0 -> multifd_channel can use it.
>> >> + */
>> >> + MultiFDRecvData *data;
>> >
>> > Right after the main thread assigns a chunk of memory to load for a recv
>> > thread, the main thread job done, afaict. I don't see how a race could
>> > happen here.
>> >
>> > I'm not sure, but I _think_ if we rely on p->quite or something similar to
>> > quite all recv threads, then this can be dropped?
>> >
>>
>> We still need to know whether a channel is in use so we can skip to the
>> next.
>
> Oh, yes.
>
>>
>> >>
>> >> /* thread local variables. No locking required */
>> >>
>> >> --
>> >> 2.35.3
>> >>
>>
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH v3 19/30] migration/ram: Ignore multifd flush when doing fixed-ram migration
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (17 preceding siblings ...)
2023-11-27 20:26 ` [RFC PATCH v3 18/30] migration/multifd: Allow receiving pages without packets Fabiano Rosas
@ 2023-11-27 20:26 ` Fabiano Rosas
2024-01-16 8:23 ` Peter Xu
2023-11-27 20:26 ` [RFC PATCH v3 20/30] migration/multifd: Support outgoing fixed-ram stream format Fabiano Rosas
` (11 subsequent siblings)
30 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:26 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
Some functionalities of multifd are incompatible with the 'fixed-ram'
migration format.
The MULTIFD_FLUSH flag in particular is not used because in fixed-ram
there is no sinchronicity between migration source and destination so
there is not need for a sync packet. In fact, fixed-ram disables
packets in multifd as a whole.
However, we still need to sync the migration thread with the multifd
channels at key moments:
- between iterations, to avoid a slow channel being overrun by a fast
channel in the subsequent iteration;
- at ram_save_complete, to make sure all data has been transferred
before finishing migration;
Make sure RAM_SAVE_FLAG_MULTIFD_FLUSH is only emitted for fixed-ram at
those key moments.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/ram.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/migration/ram.c b/migration/ram.c
index 08604222f2..ad6abd1761 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1363,7 +1363,7 @@ static int find_dirty_block(RAMState *rs, PageSearchStatus *pss)
pss->page = 0;
pss->block = QLIST_NEXT_RCU(pss->block, next);
if (!pss->block) {
- if (migrate_multifd() &&
+ if (migrate_multifd() && !migrate_fixed_ram() &&
!migrate_multifd_flush_after_each_section()) {
QEMUFile *f = rs->pss[RAM_CHANNEL_PRECOPY].pss_channel;
int ret = multifd_send_sync_main(f);
@@ -3112,7 +3112,8 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
return ret;
}
- if (migrate_multifd() && !migrate_multifd_flush_after_each_section()) {
+ if (migrate_multifd() && !migrate_multifd_flush_after_each_section()
+ && !migrate_fixed_ram()) {
qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
}
@@ -3242,8 +3243,11 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
out:
if (ret >= 0
&& migration_is_setup_or_active(migrate_get_current()->state)) {
- if (migrate_multifd() && migrate_multifd_flush_after_each_section()) {
- ret = multifd_send_sync_main(rs->pss[RAM_CHANNEL_PRECOPY].pss_channel);
+ if (migrate_multifd() &&
+ (migrate_multifd_flush_after_each_section() ||
+ migrate_fixed_ram())) {
+ ret = multifd_send_sync_main(
+ rs->pss[RAM_CHANNEL_PRECOPY].pss_channel);
if (ret < 0) {
return ret;
}
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 19/30] migration/ram: Ignore multifd flush when doing fixed-ram migration
2023-11-27 20:26 ` [RFC PATCH v3 19/30] migration/ram: Ignore multifd flush when doing fixed-ram migration Fabiano Rosas
@ 2024-01-16 8:23 ` Peter Xu
2024-01-17 18:13 ` Fabiano Rosas
0 siblings, 1 reply; 95+ messages in thread
From: Peter Xu @ 2024-01-16 8:23 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Mon, Nov 27, 2023 at 05:26:01PM -0300, Fabiano Rosas wrote:
> Some functionalities of multifd are incompatible with the 'fixed-ram'
> migration format.
>
> The MULTIFD_FLUSH flag in particular is not used because in fixed-ram
> there is no sinchronicity between migration source and destination so
> there is not need for a sync packet. In fact, fixed-ram disables
> packets in multifd as a whole.
>
> However, we still need to sync the migration thread with the multifd
> channels at key moments:
>
> - between iterations, to avoid a slow channel being overrun by a fast
> channel in the subsequent iteration;
>
> - at ram_save_complete, to make sure all data has been transferred
> before finishing migration;
[1]
>
> Make sure RAM_SAVE_FLAG_MULTIFD_FLUSH is only emitted for fixed-ram at
> those key moments.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> migration/ram.c | 12 ++++++++----
> 1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index 08604222f2..ad6abd1761 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1363,7 +1363,7 @@ static int find_dirty_block(RAMState *rs, PageSearchStatus *pss)
> pss->page = 0;
> pss->block = QLIST_NEXT_RCU(pss->block, next);
> if (!pss->block) {
> - if (migrate_multifd() &&
> + if (migrate_multifd() && !migrate_fixed_ram() &&
> !migrate_multifd_flush_after_each_section()) {
> QEMUFile *f = rs->pss[RAM_CHANNEL_PRECOPY].pss_channel;
> int ret = multifd_send_sync_main(f);
> @@ -3112,7 +3112,8 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> return ret;
> }
>
> - if (migrate_multifd() && !migrate_multifd_flush_after_each_section()) {
> + if (migrate_multifd() && !migrate_multifd_flush_after_each_section()
> + && !migrate_fixed_ram()) {
> qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
> }
>
> @@ -3242,8 +3243,11 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
> out:
> if (ret >= 0
> && migration_is_setup_or_active(migrate_get_current()->state)) {
> - if (migrate_multifd() && migrate_multifd_flush_after_each_section()) {
> - ret = multifd_send_sync_main(rs->pss[RAM_CHANNEL_PRECOPY].pss_channel);
> + if (migrate_multifd() &&
> + (migrate_multifd_flush_after_each_section() ||
> + migrate_fixed_ram())) {
> + ret = multifd_send_sync_main(
> + rs->pss[RAM_CHANNEL_PRECOPY].pss_channel);
Why you want this one? ram_save_iterate() can be called tens for each
second iiuc.
There's one more? ram_save_complete():
if (migrate_multifd() && !migrate_multifd_flush_after_each_section()) {
qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
}
IIUC that's the one you referred to at [1] above, not sure why you modified
the code in ram_save_iterate() instead.
> if (ret < 0) {
> return ret;
> }
> --
> 2.35.3
>
Since the file migration added its whole new code in
multifd_send_sync_main(), now I'm hesitating whether we should just provide
multifd_file_sync_threads(), put file sync there, and call explicitly,
like:
if (migrate_multifd()) {
if (migrate_is_file()) {
multifd_file_sync_threads();
} else if (migrate_multifd_flush_after_each_section()) {
multifd_send_sync_main();
}
}
It'll be much clearer that file goes into its own path and we don't need to
worry on fat eyes of those if clauses. diff should be similar.
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 19/30] migration/ram: Ignore multifd flush when doing fixed-ram migration
2024-01-16 8:23 ` Peter Xu
@ 2024-01-17 18:13 ` Fabiano Rosas
2024-01-19 1:33 ` Peter Xu
0 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2024-01-17 18:13 UTC (permalink / raw)
To: Peter Xu
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
Peter Xu <peterx@redhat.com> writes:
> On Mon, Nov 27, 2023 at 05:26:01PM -0300, Fabiano Rosas wrote:
>> Some functionalities of multifd are incompatible with the 'fixed-ram'
>> migration format.
>>
>> The MULTIFD_FLUSH flag in particular is not used because in fixed-ram
>> there is no sinchronicity between migration source and destination so
>> there is not need for a sync packet. In fact, fixed-ram disables
>> packets in multifd as a whole.
>>
>> However, we still need to sync the migration thread with the multifd
>> channels at key moments:
>>
>> - between iterations, to avoid a slow channel being overrun by a fast
>> channel in the subsequent iteration;
>>
>> - at ram_save_complete, to make sure all data has been transferred
>> before finishing migration;
>
> [1]
>
>>
>> Make sure RAM_SAVE_FLAG_MULTIFD_FLUSH is only emitted for fixed-ram at
>> those key moments.
>>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>> migration/ram.c | 12 ++++++++----
>> 1 file changed, 8 insertions(+), 4 deletions(-)
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 08604222f2..ad6abd1761 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1363,7 +1363,7 @@ static int find_dirty_block(RAMState *rs, PageSearchStatus *pss)
>> pss->page = 0;
>> pss->block = QLIST_NEXT_RCU(pss->block, next);
>> if (!pss->block) {
>> - if (migrate_multifd() &&
>> + if (migrate_multifd() && !migrate_fixed_ram() &&
>> !migrate_multifd_flush_after_each_section()) {
>> QEMUFile *f = rs->pss[RAM_CHANNEL_PRECOPY].pss_channel;
>> int ret = multifd_send_sync_main(f);
>> @@ -3112,7 +3112,8 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>> return ret;
>> }
>>
>> - if (migrate_multifd() && !migrate_multifd_flush_after_each_section()) {
>> + if (migrate_multifd() && !migrate_multifd_flush_after_each_section()
>> + && !migrate_fixed_ram()) {
>> qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
>> }
>>
>> @@ -3242,8 +3243,11 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>> out:
>> if (ret >= 0
>> && migration_is_setup_or_active(migrate_get_current()->state)) {
>> - if (migrate_multifd() && migrate_multifd_flush_after_each_section()) {
>> - ret = multifd_send_sync_main(rs->pss[RAM_CHANNEL_PRECOPY].pss_channel);
>> + if (migrate_multifd() &&
>> + (migrate_multifd_flush_after_each_section() ||
>> + migrate_fixed_ram())) {
>> + ret = multifd_send_sync_main(
>> + rs->pss[RAM_CHANNEL_PRECOPY].pss_channel);
>
> Why you want this one? ram_save_iterate() can be called tens for each
> second iiuc.
>
AIUI, this is a requirement for live migration, so that we're not
sending the new version of the page while the old version is still in
transit.
> There's one more? ram_save_complete():
>
> if (migrate_multifd() && !migrate_multifd_flush_after_each_section()) {
> qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
> }
>
> IIUC that's the one you referred to at [1] above, not sure why you modified
> the code in ram_save_iterate() instead.
>
I mentioned it in the commit message as well:
" - between iterations, to avoid a slow channel being overrun by a fast
channel in the subsequent iteration;"
>> if (ret < 0) {
>> return ret;
>> }
>> --
>> 2.35.3
>>
>
> Since the file migration added its whole new code in
> multifd_send_sync_main(), now I'm hesitating whether we should just provide
> multifd_file_sync_threads(), put file sync there, and call explicitly,
> like:
>
> if (migrate_multifd()) {
> if (migrate_is_file()) {
> multifd_file_sync_threads();
> } else if (migrate_multifd_flush_after_each_section()) {
> multifd_send_sync_main();
> }
> }
>
> It'll be much clearer that file goes into its own path and we don't need to
> worry on fat eyes of those if clauses. diff should be similar.
Hm, it could be a good idea indeed. Let me experiment with it.
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 19/30] migration/ram: Ignore multifd flush when doing fixed-ram migration
2024-01-17 18:13 ` Fabiano Rosas
@ 2024-01-19 1:33 ` Peter Xu
0 siblings, 0 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-19 1:33 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Wed, Jan 17, 2024 at 03:13:20PM -0300, Fabiano Rosas wrote:
> >> @@ -3242,8 +3243,11 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
> >> out:
> >> if (ret >= 0
> >> && migration_is_setup_or_active(migrate_get_current()->state)) {
> >> - if (migrate_multifd() && migrate_multifd_flush_after_each_section()) {
> >> - ret = multifd_send_sync_main(rs->pss[RAM_CHANNEL_PRECOPY].pss_channel);
> >> + if (migrate_multifd() &&
> >> + (migrate_multifd_flush_after_each_section() ||
> >> + migrate_fixed_ram())) {
> >> + ret = multifd_send_sync_main(
> >> + rs->pss[RAM_CHANNEL_PRECOPY].pss_channel);
> >
> > Why you want this one? ram_save_iterate() can be called tens for each
> > second iiuc.
> >
>
> AIUI, this is a requirement for live migration, so that we're not
> sending the new version of the page while the old version is still in
> transit.
>
> > There's one more? ram_save_complete():
> >
> > if (migrate_multifd() && !migrate_multifd_flush_after_each_section()) {
> > qemu_put_be64(f, RAM_SAVE_FLAG_MULTIFD_FLUSH);
> > }
> >
> > IIUC that's the one you referred to at [1] above, not sure why you modified
> > the code in ram_save_iterate() instead.
> >
>
> I mentioned it in the commit message as well:
>
> " - between iterations, to avoid a slow channel being overrun by a fast
> channel in the subsequent iteration;"
IMHO you only need to flush all threads in find_dirty_block(). That's when
the "real iteration" happens (IOW, when the "real iteration" is defined to
be a full walk across all guest memories, rather than one single call to
ram_save_iterate()).
Multifd used to do too many flushes, and that's why we had the new
migrate_multifd_flush_after_each_section() and it's a bit of a mess if you
see..
To be super safe, you can also flush at ram_save_complete(), but I doubt
its necessity, and this same question applies to generic multifd: the
multifd_send_sync_main() in the ram_save_complete() can be redundant,
afaiu. However we can leave that for later to not add even more dependency
to fixed-ram. If that is justified, we can remove the sync_main for both
socket / file then.
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH v3 20/30] migration/multifd: Support outgoing fixed-ram stream format
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (18 preceding siblings ...)
2023-11-27 20:26 ` [RFC PATCH v3 19/30] migration/ram: Ignore multifd flush when doing fixed-ram migration Fabiano Rosas
@ 2023-11-27 20:26 ` Fabiano Rosas
2023-11-27 20:26 ` [RFC PATCH v3 21/30] migration/multifd: Support incoming " Fabiano Rosas
` (10 subsequent siblings)
30 siblings, 0 replies; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:26 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
The new fixed-ram stream format uses a file transport and puts ram
pages in the migration file at their respective offsets and can be
done in parallel by using the pwritev system call which takes iovecs
and an offset.
Add support to enabling the new format along with multifd to make use
of the threading and page handling already in place.
This requires multifd to stop sending headers and leaving the stream
format to the fixed-ram code. When it comes time to write the data, we
need to call a version of qio_channel_write that can take an offset.
Usage on HMP is:
(qemu) stop
(qemu) migrate_set_capability multifd on
(qemu) migrate_set_capability fixed-ram on
(qemu) migrate_set_parameter max-bandwidth 0
(qemu) migrate_set_parameter multifd-channels 8
(qemu) migrate file:migfile
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
- altered to call a separate qio_channel function for fixed-ram
---
include/qemu/bitops.h | 13 +++++++
migration/migration.c | 19 ++++++----
migration/multifd.c | 81 ++++++++++++++++++++++++++++++++++++-------
migration/options.c | 6 ----
migration/ram.c | 17 +++++++--
migration/ram.h | 1 +
6 files changed, 110 insertions(+), 27 deletions(-)
diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
index cb3526d1f4..2c0a2fe751 100644
--- a/include/qemu/bitops.h
+++ b/include/qemu/bitops.h
@@ -67,6 +67,19 @@ static inline void clear_bit(long nr, unsigned long *addr)
*p &= ~mask;
}
+/**
+ * clear_bit_atomic - Clears a bit in memory atomically
+ * @nr: Bit to clear
+ * @addr: Address to start counting from
+ */
+static inline void clear_bit_atomic(long nr, unsigned long *addr)
+{
+ unsigned long mask = BIT_MASK(nr);
+ unsigned long *p = addr + BIT_WORD(nr);
+
+ return qatomic_and(p, ~mask);
+}
+
/**
* change_bit - Toggle a bit in memory
* @nr: Bit to change
diff --git a/migration/migration.c b/migration/migration.c
index 16689171ab..cc707b0223 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -128,11 +128,19 @@ static bool migration_needs_multiple_sockets(void)
return migrate_multifd() || migrate_postcopy_preempt();
}
-static bool transport_supports_multi_channels(SocketAddress *saddr)
+static bool transport_supports_multi_channels(MigrationAddress *addr)
{
- return saddr->type == SOCKET_ADDRESS_TYPE_INET ||
- saddr->type == SOCKET_ADDRESS_TYPE_UNIX ||
- saddr->type == SOCKET_ADDRESS_TYPE_VSOCK;
+ if (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) {
+ SocketAddress *saddr = &addr->u.socket;
+
+ return (saddr->type == SOCKET_ADDRESS_TYPE_INET ||
+ saddr->type == SOCKET_ADDRESS_TYPE_UNIX ||
+ saddr->type == SOCKET_ADDRESS_TYPE_VSOCK);
+ } else if (addr->transport == MIGRATION_ADDRESS_TYPE_FILE) {
+ return migrate_fixed_ram();
+ } else {
+ return false;
+ }
}
static bool migration_needs_seekable_channel(void)
@@ -156,8 +164,7 @@ migration_channels_and_transport_compatible(MigrationAddress *addr,
}
if (migration_needs_multiple_sockets() &&
- (addr->transport == MIGRATION_ADDRESS_TYPE_SOCKET) &&
- !transport_supports_multi_channels(&addr->u.socket)) {
+ !transport_supports_multi_channels(addr)) {
error_setg(errp, "Migration requires multi-channel URIs (e.g. tcp)");
return false;
}
diff --git a/migration/multifd.c b/migration/multifd.c
index 7dfab2367a..8eae7de4de 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -278,6 +278,17 @@ static void multifd_pages_clear(MultiFDPages_t *pages)
g_free(pages);
}
+static void multifd_set_file_bitmap(MultiFDSendParams *p)
+{
+ MultiFDPages_t *pages = p->pages;
+
+ assert(pages->block);
+
+ for (int i = 0; i < p->normal_num; i++) {
+ ramblock_set_shadow_bmap_atomic(pages->block, pages->offset[i]);
+ }
+}
+
static void multifd_send_fill_packet(MultiFDSendParams *p)
{
MultiFDPacket_t *packet = p->packet;
@@ -624,6 +635,34 @@ int multifd_send_sync_main(QEMUFile *f)
}
}
+ if (!migrate_multifd_packets()) {
+ /*
+ * There's no sync packet to send. Just make sure the sending
+ * above has finished.
+ */
+ for (i = 0; i < migrate_multifd_channels(); i++) {
+ qemu_sem_wait(&multifd_send_state->channels_ready);
+ }
+
+ /* sanity check and release the channels */
+ for (i = 0; i < migrate_multifd_channels(); i++) {
+ MultiFDSendParams *p = &multifd_send_state->params[i];
+
+ qemu_mutex_lock(&p->mutex);
+ if (p->quit) {
+ error_report("%s: channel %d has already quit!", __func__, i);
+ qemu_mutex_unlock(&p->mutex);
+ return -1;
+ }
+ assert(!p->pending_job);
+ qemu_mutex_unlock(&p->mutex);
+
+ qemu_sem_post(&p->sem);
+ }
+
+ return 0;
+ }
+
/*
* When using zero-copy, it's necessary to flush the pages before any of
* the pages can be sent again, so we'll make sure the new version of the
@@ -707,6 +746,8 @@ static void *multifd_send_thread(void *opaque)
if (p->pending_job) {
uint32_t flags;
+ uintptr_t write_base;
+
p->normal_num = 0;
if (!use_packets || use_zero_copy_send) {
@@ -731,6 +772,15 @@ static void *multifd_send_thread(void *opaque)
if (use_packets) {
multifd_send_fill_packet(p);
p->num_packets++;
+ } else {
+ multifd_set_file_bitmap(p);
+
+ /*
+ * If we subtract the host page now, we don't need to
+ * pass it into qio_channel_pwritev_all() below.
+ */
+ write_base = p->pages->block->pages_offset -
+ (uintptr_t)p->pages->block->host;
}
flags = p->flags;
@@ -743,21 +793,28 @@ static void *multifd_send_thread(void *opaque)
trace_multifd_send(p->id, p->packet_num, p->normal_num, flags,
p->next_packet_size);
- if (use_zero_copy_send) {
- /* Send header first, without zerocopy */
- ret = qio_channel_write_all(p->c, (void *)p->packet,
- p->packet_len, &local_err);
- if (ret != 0) {
- break;
+ if (use_packets) {
+ if (use_zero_copy_send) {
+ /* Send header first, without zerocopy */
+ ret = qio_channel_write_all(p->c, (void *)p->packet,
+ p->packet_len, &local_err);
+ if (ret != 0) {
+ break;
+ }
+ } else {
+ /* Send header using the same writev call */
+ p->iov[0].iov_len = p->packet_len;
+ p->iov[0].iov_base = p->packet;
}
- } else if (use_packets) {
- /* Send header using the same writev call */
- p->iov[0].iov_len = p->packet_len;
- p->iov[0].iov_base = p->packet;
+
+ ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num,
+ NULL, 0, p->write_flags,
+ &local_err);
+ } else {
+ ret = qio_channel_pwritev_all(p->c, p->iov, p->iovs_num,
+ write_base, &local_err);
}
- ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
- 0, p->write_flags, &local_err);
if (ret != 0) {
break;
}
diff --git a/migration/options.c b/migration/options.c
index f671e24758..7f23881f51 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -666,12 +666,6 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
}
if (new_caps[MIGRATION_CAPABILITY_FIXED_RAM]) {
- if (new_caps[MIGRATION_CAPABILITY_MULTIFD]) {
- error_setg(errp,
- "Fixed-ram migration is incompatible with multifd");
- return false;
- }
-
if (new_caps[MIGRATION_CAPABILITY_XBZRLE]) {
error_setg(errp,
"Fixed-ram migration is incompatible with xbzrle");
diff --git a/migration/ram.c b/migration/ram.c
index ad6abd1761..385fe431bf 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1149,7 +1149,7 @@ static int save_zero_page(RAMState *rs, PageSearchStatus *pss,
if (migrate_fixed_ram()) {
/* zero pages are not transferred with fixed-ram */
- clear_bit(offset >> TARGET_PAGE_BITS, pss->block->shadow_bmap);
+ clear_bit_atomic(offset >> TARGET_PAGE_BITS, pss->block->shadow_bmap);
return 1;
}
@@ -2443,8 +2443,6 @@ static void ram_save_cleanup(void *opaque)
block->clear_bmap = NULL;
g_free(block->bmap);
block->bmap = NULL;
- g_free(block->shadow_bmap);
- block->shadow_bmap = NULL;
}
xbzrle_cleanup();
@@ -3131,9 +3129,22 @@ static void ram_save_shadow_bmap(QEMUFile *f)
qemu_put_buffer_at(f, (uint8_t *)block->shadow_bmap, bitmap_size,
block->bitmap_offset);
ram_transferred_add(bitmap_size);
+
+ /*
+ * Free the bitmap here to catch any synchronization issues
+ * with multifd channels. No channels should be sending pages
+ * after we've written the bitmap to file.
+ */
+ g_free(block->shadow_bmap);
+ block->shadow_bmap = NULL;
}
}
+void ramblock_set_shadow_bmap_atomic(RAMBlock *block, ram_addr_t offset)
+{
+ set_bit_atomic(offset >> TARGET_PAGE_BITS, block->shadow_bmap);
+}
+
/**
* ram_save_iterate: iterative stage for migration
*
diff --git a/migration/ram.h b/migration/ram.h
index 9b937a446b..a65120de0d 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -75,6 +75,7 @@ bool ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb, Error **errp);
bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start);
void postcopy_preempt_shutdown_file(MigrationState *s);
void *postcopy_preempt_thread(void *opaque);
+void ramblock_set_shadow_bmap_atomic(RAMBlock *block, ram_addr_t offset);
/* ram cache */
int colo_init_ram_cache(void);
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* [RFC PATCH v3 21/30] migration/multifd: Support incoming fixed-ram stream format
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (19 preceding siblings ...)
2023-11-27 20:26 ` [RFC PATCH v3 20/30] migration/multifd: Support outgoing fixed-ram stream format Fabiano Rosas
@ 2023-11-27 20:26 ` Fabiano Rosas
2023-11-27 20:26 ` [RFC PATCH v3 22/30] tests/qtest: Add a multifd + fixed-ram migration test Fabiano Rosas
` (9 subsequent siblings)
30 siblings, 0 replies; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:26 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
For the incoming fixed-ram migration we need to read the ramblock
headers, get the pages bitmap and send the host address of each
non-zero page to the multifd channel thread for writing.
To read from the migration file we need a preadv function that can
read into the iovs in segments of contiguous pages because (as in the
writing case) the file offset applies to the entire iovec.
Usage on HMP is:
(qemu) migrate_set_capability multifd on
(qemu) migrate_set_capability fixed-ram on
(qemu) migrate_set_parameter max-bandwidth 0
(qemu) migrate_set_parameter multifd-channels 8
(qemu) migrate_incoming file:migfile
(qemu) info status
(qemu) c
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/ram.c | 34 +++++++++++++++++++++++++++++++---
1 file changed, 31 insertions(+), 3 deletions(-)
diff --git a/migration/ram.c b/migration/ram.c
index 385fe431bf..f5173755f0 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -111,6 +111,7 @@
* pages region in the migration file at a time.
*/
#define FIXED_RAM_LOAD_BUF_SIZE 0x100000
+#define FIXED_RAM_MULTIFD_LOAD_BUF_SIZE 0x100000
XBZRLECacheStats xbzrle_counters;
@@ -3942,13 +3943,36 @@ void colo_flush_ram_cache(void)
trace_colo_flush_ram_cache_end();
}
+static size_t ram_load_multifd_pages(RAMBlock *block, ram_addr_t start_offset,
+ size_t size)
+{
+ MultiFDRecvData *data = multifd_get_recv_data();
+
+ /*
+ * Pointing the opaque directly to the host buffer, no
+ * preprocessing needed.
+ */
+ data->opaque = block->host + start_offset;
+
+ data->file_offset = block->pages_offset + start_offset;
+ data->size = size;
+
+ if (multifd_recv() < 0) {
+ return -1;
+ }
+
+ return size;
+}
+
static void read_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block,
long num_pages, unsigned long *bitmap)
{
unsigned long set_bit_idx, clear_bit_idx;
ram_addr_t offset;
void *host;
- size_t read, unread, size, buf_size = FIXED_RAM_LOAD_BUF_SIZE;
+ size_t read, unread, size;
+ size_t buf_size = (migrate_multifd() ? FIXED_RAM_MULTIFD_LOAD_BUF_SIZE :
+ FIXED_RAM_LOAD_BUF_SIZE);
for (set_bit_idx = find_first_bit(bitmap, num_pages);
set_bit_idx < num_pages;
@@ -3963,8 +3987,12 @@ static void read_ramblock_fixed_ram(QEMUFile *f, RAMBlock *block,
host = host_from_ram_block_offset(block, offset);
size = MIN(unread, buf_size);
- read = qemu_get_buffer_at(f, host, size,
- block->pages_offset + offset);
+ if (migrate_multifd()) {
+ read = ram_load_multifd_pages(block, offset, size);
+ } else {
+ read = qemu_get_buffer_at(f, host, size,
+ block->pages_offset + offset);
+ }
offset += read;
unread -= read;
}
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* [RFC PATCH v3 22/30] tests/qtest: Add a multifd + fixed-ram migration test
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (20 preceding siblings ...)
2023-11-27 20:26 ` [RFC PATCH v3 21/30] migration/multifd: Support incoming " Fabiano Rosas
@ 2023-11-27 20:26 ` Fabiano Rosas
2023-11-27 20:26 ` [RFC PATCH v3 23/30] migration: Add direct-io parameter Fabiano Rosas
` (8 subsequent siblings)
30 siblings, 0 replies; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:26 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana, Thomas Huth, Laurent Vivier, Paolo Bonzini
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
tests/qtest/migration-test.c | 45 ++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 96a6217af0..5c5725687c 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2183,6 +2183,46 @@ static void test_precopy_file_fixed_ram(void)
test_file_common(&args, true);
}
+static void *migrate_multifd_fixed_ram_start(QTestState *from, QTestState *to)
+{
+ migrate_fixed_ram_start(from, to);
+
+ migrate_set_parameter_int(from, "multifd-channels", 4);
+ migrate_set_parameter_int(to, "multifd-channels", 4);
+
+ migrate_set_capability(from, "multifd", true);
+ migrate_set_capability(to, "multifd", true);
+
+ return NULL;
+}
+
+static void test_multifd_file_fixed_ram_live(void)
+{
+ g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
+ FILE_TEST_FILENAME);
+ MigrateCommon args = {
+ .connect_uri = uri,
+ .listen_uri = "defer",
+ .start_hook = migrate_multifd_fixed_ram_start,
+ };
+
+ test_file_common(&args, false);
+}
+
+static void test_multifd_file_fixed_ram(void)
+{
+ g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
+ FILE_TEST_FILENAME);
+ MigrateCommon args = {
+ .connect_uri = uri,
+ .listen_uri = "defer",
+ .start_hook = migrate_multifd_fixed_ram_start,
+ };
+
+ test_file_common(&args, true);
+}
+
+
static void test_precopy_tcp_plain(void)
{
MigrateCommon args = {
@@ -3431,6 +3471,11 @@ int main(int argc, char **argv)
qtest_add_func("/migration/precopy/file/fixed-ram/live",
test_precopy_file_fixed_ram_live);
+ qtest_add_func("/migration/multifd/file/fixed-ram",
+ test_multifd_file_fixed_ram);
+ qtest_add_func("/migration/multifd/file/fixed-ram/live",
+ test_multifd_file_fixed_ram_live);
+
#ifdef CONFIG_GNUTLS
qtest_add_func("/migration/precopy/unix/tls/psk",
test_precopy_unix_tls_psk);
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* [RFC PATCH v3 23/30] migration: Add direct-io parameter
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (21 preceding siblings ...)
2023-11-27 20:26 ` [RFC PATCH v3 22/30] tests/qtest: Add a multifd + fixed-ram migration test Fabiano Rosas
@ 2023-11-27 20:26 ` Fabiano Rosas
2023-12-22 10:38 ` Markus Armbruster
2023-11-27 20:26 ` [RFC PATCH v3 24/30] tests/qtest: Add a test for migration with direct-io and multifd Fabiano Rosas
` (7 subsequent siblings)
30 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:26 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana, Eric Blake
Add the direct-io migration parameter that tells the migration code to
use O_DIRECT when opening the migration stream file whenever possible.
This is currently only used with fixed-ram migration for the multifd
channels that transfer the RAM pages. Those channels only transfer the
pages and are guaranteed to perform aligned writes.
However the parameter could be made to affect other types of
file-based migrations in the future.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
- json formatting
- added checks for O_DIRECT support
---
include/qemu/osdep.h | 2 ++
migration/file.c | 22 ++++++++++++++++++++--
migration/migration-hmp-cmds.c | 11 +++++++++++
migration/options.c | 30 ++++++++++++++++++++++++++++++
migration/options.h | 1 +
qapi/migration.json | 18 +++++++++++++++---
util/osdep.c | 9 +++++++++
7 files changed, 88 insertions(+), 5 deletions(-)
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 475a1c62ff..ea5d29ab9b 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -597,6 +597,8 @@ int qemu_lock_fd_test(int fd, int64_t start, int64_t len, bool exclusive);
bool qemu_has_ofd_lock(void);
#endif
+bool qemu_has_direct_io(void);
+
#if defined(__HAIKU__) && defined(__i386__)
#define FMT_pid "%ld"
#elif defined(WIN64)
diff --git a/migration/file.c b/migration/file.c
index 62ba994109..fc5c1a45f4 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -61,12 +61,30 @@ int file_send_channel_destroy(QIOChannel *ioc)
void file_send_channel_create(QIOTaskFunc f, void *data)
{
- QIOChannelFile *ioc;
+ QIOChannelFile *ioc = NULL;
QIOTask *task;
Error *err = NULL;
int flags = O_WRONLY;
- ioc = qio_channel_file_new_path(outgoing_args.fname, flags, 0, &err);
+ if (migrate_direct_io()) {
+#ifdef O_DIRECT
+ /*
+ * Enable O_DIRECT for the secondary channels. These are used
+ * for sending ram pages and writes should be guaranteed to be
+ * aligned to at least page size.
+ */
+ flags |= O_DIRECT;
+#else
+ error_setg(&err, "System does not support O_DIRECT");
+ error_append_hint(&err,
+ "Try disabling direct-io migration capability\n");
+ /* errors are propagated through the qio_task below */
+#endif
+ }
+
+ if (!err) {
+ ioc = qio_channel_file_new_path(outgoing_args.fname, flags, 0, &err);
+ }
task = qio_task_new(OBJECT(ioc), f, (gpointer)data, NULL);
if (!ioc) {
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 86ae832176..5ad6b2788d 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -392,6 +392,13 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
monitor_printf(mon, "%s: %s\n",
MigrationParameter_str(MIGRATION_PARAMETER_MODE),
qapi_enum_lookup(&MigMode_lookup, params->mode));
+
+ if (params->has_direct_io) {
+ monitor_printf(mon, "%s: %s\n",
+ MigrationParameter_str(
+ MIGRATION_PARAMETER_DIRECT_IO),
+ params->direct_io ? "on" : "off");
+ }
}
qapi_free_MigrationParameters(params);
@@ -679,6 +686,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
p->has_mode = true;
visit_type_MigMode(v, param, &p->mode, &err);
break;
+ case MIGRATION_PARAMETER_DIRECT_IO:
+ p->has_direct_io = true;
+ visit_type_bool(v, param, &p->direct_io, &err);
+ break;
default:
assert(0);
}
diff --git a/migration/options.c b/migration/options.c
index 7f23881f51..6c100dff7a 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -835,6 +835,22 @@ int migrate_decompress_threads(void)
return s->parameters.decompress_threads;
}
+bool migrate_direct_io(void)
+{
+ MigrationState *s = migrate_get_current();
+
+ /* For now O_DIRECT is only supported with fixed-ram */
+ if (!s->capabilities[MIGRATION_CAPABILITY_FIXED_RAM]) {
+ return false;
+ }
+
+ if (s->parameters.has_direct_io) {
+ return s->parameters.direct_io;
+ }
+
+ return false;
+}
+
uint64_t migrate_downtime_limit(void)
{
MigrationState *s = migrate_get_current();
@@ -1052,6 +1068,11 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
params->has_mode = true;
params->mode = s->parameters.mode;
+ if (s->parameters.has_direct_io) {
+ params->has_direct_io = true;
+ params->direct_io = s->parameters.direct_io;
+ }
+
return params;
}
@@ -1087,6 +1108,7 @@ void migrate_params_init(MigrationParameters *params)
params->has_x_vcpu_dirty_limit_period = true;
params->has_vcpu_dirty_limit = true;
params->has_mode = true;
+ params->has_direct_io = qemu_has_direct_io();
}
/*
@@ -1388,6 +1410,10 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
if (params->has_mode) {
dest->mode = params->mode;
}
+
+ if (params->has_direct_io) {
+ dest->direct_io = params->direct_io;
+ }
}
static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
@@ -1532,6 +1558,10 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
if (params->has_mode) {
s->parameters.mode = params->mode;
}
+
+ if (params->has_direct_io) {
+ s->parameters.direct_io = params->direct_io;
+ }
}
void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
diff --git a/migration/options.h b/migration/options.h
index 84628a76e8..9fbbf30168 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -81,6 +81,7 @@ uint8_t migrate_cpu_throttle_increment(void);
uint8_t migrate_cpu_throttle_initial(void);
bool migrate_cpu_throttle_tailslow(void);
int migrate_decompress_threads(void);
+bool migrate_direct_io(void);
uint64_t migrate_downtime_limit(void);
uint8_t migrate_max_cpu_throttle(void);
uint64_t migrate_max_bandwidth(void);
diff --git a/qapi/migration.json b/qapi/migration.json
index 3b93e13743..1d38619842 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -878,6 +878,9 @@
# @mode: Migration mode. See description in @MigMode. Default is 'normal'.
# (Since 8.2)
#
+# @direct-io: Open migration files with O_DIRECT when possible. This
+# requires that the 'fixed-ram' capability is enabled. (since 9.0)
+#
# Features:
#
# @deprecated: Member @block-incremental is deprecated. Use
@@ -911,7 +914,8 @@
'block-bitmap-mapping',
{ 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] },
'vcpu-dirty-limit',
- 'mode'] }
+ 'mode',
+ 'direct-io'] }
##
# @MigrateSetParameters:
@@ -1066,6 +1070,9 @@
# @mode: Migration mode. See description in @MigMode. Default is 'normal'.
# (Since 8.2)
#
+# @direct-io: Open migration files with O_DIRECT when possible. This
+# requires that the 'fixed-ram' capability is enabled. (since 9.0)
+#
# Features:
#
# @deprecated: Member @block-incremental is deprecated. Use
@@ -1119,7 +1126,8 @@
'*x-vcpu-dirty-limit-period': { 'type': 'uint64',
'features': [ 'unstable' ] },
'*vcpu-dirty-limit': 'uint64',
- '*mode': 'MigMode'} }
+ '*mode': 'MigMode',
+ '*direct-io': 'bool' } }
##
# @migrate-set-parameters:
@@ -1294,6 +1302,9 @@
# @mode: Migration mode. See description in @MigMode. Default is 'normal'.
# (Since 8.2)
#
+# @direct-io: Open migration files with O_DIRECT when possible. This
+# requires that the 'fixed-ram' capability is enabled. (since 9.0)
+#
# Features:
#
# @deprecated: Member @block-incremental is deprecated. Use
@@ -1344,7 +1355,8 @@
'*x-vcpu-dirty-limit-period': { 'type': 'uint64',
'features': [ 'unstable' ] },
'*vcpu-dirty-limit': 'uint64',
- '*mode': 'MigMode'} }
+ '*mode': 'MigMode',
+ '*direct-io': 'bool' } }
##
# @query-migrate-parameters:
diff --git a/util/osdep.c b/util/osdep.c
index e996c4744a..d0227a60ab 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -277,6 +277,15 @@ int qemu_lock_fd_test(int fd, int64_t start, int64_t len, bool exclusive)
}
#endif
+bool qemu_has_direct_io(void)
+{
+#ifdef O_DIRECT
+ return true;
+#else
+ return false;
+#endif
+}
+
static int qemu_open_cloexec(const char *name, int flags, mode_t mode)
{
int ret;
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 23/30] migration: Add direct-io parameter
2023-11-27 20:26 ` [RFC PATCH v3 23/30] migration: Add direct-io parameter Fabiano Rosas
@ 2023-12-22 10:38 ` Markus Armbruster
0 siblings, 0 replies; 95+ messages in thread
From: Markus Armbruster @ 2023-12-22 10:38 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana, Eric Blake
Fabiano Rosas <farosas@suse.de> writes:
> Add the direct-io migration parameter that tells the migration code to
> use O_DIRECT when opening the migration stream file whenever possible.
Why is that useful?
> This is currently only used with fixed-ram migration for the multifd
> channels that transfer the RAM pages. Those channels only transfer the
> pages and are guaranteed to perform aligned writes.
>
> However the parameter could be made to affect other types of
> file-based migrations in the future.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
[...]
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 3b93e13743..1d38619842 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -878,6 +878,9 @@
> # @mode: Migration mode. See description in @MigMode. Default is 'normal'.
> # (Since 8.2)
> #
> +# @direct-io: Open migration files with O_DIRECT when possible. This
> +# requires that the 'fixed-ram' capability is enabled. (since 9.0)
Two spaces between sentences for consistency, please.
> +#
> # Features:
> #
> # @deprecated: Member @block-incremental is deprecated. Use
> @@ -911,7 +914,8 @@
> 'block-bitmap-mapping',
> { 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] },
> 'vcpu-dirty-limit',
> - 'mode'] }
> + 'mode',
> + 'direct-io'] }
>
> ##
> # @MigrateSetParameters:
> @@ -1066,6 +1070,9 @@
> # @mode: Migration mode. See description in @MigMode. Default is 'normal'.
> # (Since 8.2)
> #
> +# @direct-io: Open migration files with O_DIRECT when possible. This
> +# requires that the 'fixed-ram' capability is enabled. (since 9.0)
> +#
> # Features:
> #
> # @deprecated: Member @block-incremental is deprecated. Use
> @@ -1119,7 +1126,8 @@
> '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
> 'features': [ 'unstable' ] },
> '*vcpu-dirty-limit': 'uint64',
> - '*mode': 'MigMode'} }
> + '*mode': 'MigMode',
> + '*direct-io': 'bool' } }
>
> ##
> # @migrate-set-parameters:
> @@ -1294,6 +1302,9 @@
> # @mode: Migration mode. See description in @MigMode. Default is 'normal'.
> # (Since 8.2)
> #
> +# @direct-io: Open migration files with O_DIRECT when possible. This
> +# requires that the 'fixed-ram' capability is enabled. (since 9.0)
> +#
> # Features:
> #
> # @deprecated: Member @block-incremental is deprecated. Use
> @@ -1344,7 +1355,8 @@
> '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
> 'features': [ 'unstable' ] },
> '*vcpu-dirty-limit': 'uint64',
> - '*mode': 'MigMode'} }
> + '*mode': 'MigMode',
> + '*direct-io': 'bool' } }
>
> ##
> # @query-migrate-parameters:
[...]
^ permalink raw reply [flat|nested] 95+ messages in thread
* [RFC PATCH v3 24/30] tests/qtest: Add a test for migration with direct-io and multifd
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (22 preceding siblings ...)
2023-11-27 20:26 ` [RFC PATCH v3 23/30] migration: Add direct-io parameter Fabiano Rosas
@ 2023-11-27 20:26 ` Fabiano Rosas
2023-11-27 20:26 ` [RFC PATCH v3 25/30] monitor: Honor QMP request for fd removal immediately Fabiano Rosas
` (6 subsequent siblings)
30 siblings, 0 replies; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:26 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana, Thomas Huth, Laurent Vivier, Paolo Bonzini
The test is only allowed to run in systems that know and in
filesystems which support O_DIRECT.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
- added ifdefs for O_DIRECT and a probing function
---
tests/qtest/migration-helpers.c | 39 +++++++++++++++++++++++++++++++++
tests/qtest/migration-helpers.h | 1 +
tests/qtest/migration-test.c | 35 +++++++++++++++++++++++++++++
3 files changed, 75 insertions(+)
diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
index 24fb7b3525..02b92f0cb6 100644
--- a/tests/qtest/migration-helpers.c
+++ b/tests/qtest/migration-helpers.c
@@ -292,3 +292,42 @@ char *resolve_machine_version(const char *alias, const char *var1,
return find_common_machine_version(machine_name, var1, var2);
}
+
+#ifdef O_DIRECT
+/*
+ * Probe for O_DIRECT support on the filesystem. Since this is used
+ * for tests, be conservative, if anything fails, assume it's
+ * unsupported.
+ */
+bool probe_o_direct_support(const char *tmpfs)
+{
+ g_autofree char *filename = g_strdup_printf("%s/probe-o-direct", tmpfs);
+ int fd, flags = O_CREAT | O_RDWR | O_DIRECT;
+ void *buf;
+ ssize_t ret, len;
+ uint64_t offset;
+
+ fd = open(filename, flags, 0660);
+ if (fd < 0) {
+ unlink(filename);
+ return false;
+ }
+
+ /*
+ * Assuming 4k should be enough to satisfy O_DIRECT alignment
+ * requirements. The migration code uses 1M to be conservative.
+ */
+ len = 0x100000;
+ offset = 0x100000;
+
+ buf = g_malloc0(len);
+ ret = pwrite(fd, buf, len, offset);
+ unlink(filename);
+
+ if (ret < 0) {
+ return false;
+ }
+
+ return true;
+}
+#endif
diff --git a/tests/qtest/migration-helpers.h b/tests/qtest/migration-helpers.h
index e31dc85cc7..15df009d35 100644
--- a/tests/qtest/migration-helpers.h
+++ b/tests/qtest/migration-helpers.h
@@ -47,4 +47,5 @@ char *find_common_machine_version(const char *mtype, const char *var1,
const char *var2);
char *resolve_machine_version(const char *alias, const char *var1,
const char *var2);
+bool probe_o_direct_support(const char *tmpfs);
#endif /* MIGRATION_HELPERS_H */
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 5c5725687c..192b8ec993 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2222,6 +2222,36 @@ static void test_multifd_file_fixed_ram(void)
test_file_common(&args, true);
}
+#ifdef O_DIRECT
+static void *migrate_multifd_fixed_ram_dio_start(QTestState *from,
+ QTestState *to)
+{
+ migrate_multifd_fixed_ram_start(from, to);
+
+ migrate_set_parameter_bool(from, "direct-io", true);
+ migrate_set_parameter_bool(to, "direct-io", true);
+
+ return NULL;
+}
+
+static void test_multifd_file_fixed_ram_dio(void)
+{
+ g_autofree char *uri = g_strdup_printf("file:%s/%s", tmpfs,
+ FILE_TEST_FILENAME);
+ MigrateCommon args = {
+ .connect_uri = uri,
+ .listen_uri = "defer",
+ .start_hook = migrate_multifd_fixed_ram_dio_start,
+ };
+
+ if (!probe_o_direct_support(tmpfs)) {
+ g_test_skip("Filesystem does not support O_DIRECT");
+ return;
+ }
+
+ test_file_common(&args, true);
+}
+#endif
static void test_precopy_tcp_plain(void)
{
@@ -3476,6 +3506,11 @@ int main(int argc, char **argv)
qtest_add_func("/migration/multifd/file/fixed-ram/live",
test_multifd_file_fixed_ram_live);
+#ifdef O_DIRECT
+ qtest_add_func("/migration/multifd/file/fixed-ram/dio",
+ test_multifd_file_fixed_ram_dio);
+#endif
+
#ifdef CONFIG_GNUTLS
qtest_add_func("/migration/precopy/unix/tls/psk",
test_precopy_unix_tls_psk);
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* [RFC PATCH v3 25/30] monitor: Honor QMP request for fd removal immediately
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (23 preceding siblings ...)
2023-11-27 20:26 ` [RFC PATCH v3 24/30] tests/qtest: Add a test for migration with direct-io and multifd Fabiano Rosas
@ 2023-11-27 20:26 ` Fabiano Rosas
2023-11-27 20:26 ` [RFC PATCH v3 26/30] monitor: Extract fdset fd flags comparison into a function Fabiano Rosas
` (5 subsequent siblings)
30 siblings, 0 replies; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:26 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
We're currently only removing an fd from the fdset if the VM is
running. This causes a QMP call to "remove-fd" to not actually remove
the fd if the VM happens to be stopped.
While the fd would eventually be removed when monitor_fdset_cleanup()
is called again, the user request should be honored and the fd
actually removed. Calling remove-fd + query-fdset shows a recently
removed fd still present.
The runstate_is_running() check was introduced by commit ebe52b592d
("monitor: Prevent removing fd from set during init"), which by the
shortlog indicates that they were trying to avoid removing an
yet-unduplicated fd too early.
I don't see why an fd explicitly removed with qmp_remove_fd() should
be under runstate_is_running(). I'm assuming this was a mistake when
adding the parenthesis around the expression.
Move the runstate_is_running() check to apply only to the
QLIST_EMPTY(dup_fds) side of the expression and ignore it when
mon_fdset_fd->removed has been explicitly set.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
monitor/fds.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/monitor/fds.c b/monitor/fds.c
index d86c2c674c..4ec3b7eea9 100644
--- a/monitor/fds.c
+++ b/monitor/fds.c
@@ -173,9 +173,9 @@ static void monitor_fdset_cleanup(MonFdset *mon_fdset)
MonFdsetFd *mon_fdset_fd_next;
QLIST_FOREACH_SAFE(mon_fdset_fd, &mon_fdset->fds, next, mon_fdset_fd_next) {
- if ((mon_fdset_fd->removed ||
- (QLIST_EMPTY(&mon_fdset->dup_fds) && mon_refcount == 0)) &&
- runstate_is_running()) {
+ if (mon_fdset_fd->removed ||
+ (QLIST_EMPTY(&mon_fdset->dup_fds) && mon_refcount == 0 &&
+ runstate_is_running())) {
close(mon_fdset_fd->fd);
g_free(mon_fdset_fd->opaque);
QLIST_REMOVE(mon_fdset_fd, next);
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* [RFC PATCH v3 26/30] monitor: Extract fdset fd flags comparison into a function
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (24 preceding siblings ...)
2023-11-27 20:26 ` [RFC PATCH v3 25/30] monitor: Honor QMP request for fd removal immediately Fabiano Rosas
@ 2023-11-27 20:26 ` Fabiano Rosas
2023-11-27 20:26 ` [RFC PATCH v3 27/30] monitor: fdset: Match against O_DIRECT Fabiano Rosas
` (4 subsequent siblings)
30 siblings, 0 replies; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:26 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
We're about to add one more condition to the flags comparison that
requires an ifdef. Move the code into a separate function now to make
it cleaner after the next patch.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
monitor/fds.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/monitor/fds.c b/monitor/fds.c
index 4ec3b7eea9..9a28e4b72b 100644
--- a/monitor/fds.c
+++ b/monitor/fds.c
@@ -406,6 +406,19 @@ AddfdInfo *monitor_fdset_add_fd(int fd, bool has_fdset_id, int64_t fdset_id,
return fdinfo;
}
+#ifndef _WIN32
+static bool monitor_fdset_flags_match(int flags, int fd_flags)
+{
+ bool match = false;
+
+ if ((flags & O_ACCMODE) == (fd_flags & O_ACCMODE)) {
+ match = true;
+ }
+
+ return match;
+}
+#endif
+
int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags)
{
#ifdef _WIN32
@@ -431,7 +444,7 @@ int monitor_fdset_dup_fd_add(int64_t fdset_id, int flags)
return -1;
}
- if ((flags & O_ACCMODE) == (mon_fd_flags & O_ACCMODE)) {
+ if (monitor_fdset_flags_match(flags, mon_fd_flags)) {
fd = mon_fdset_fd->fd;
break;
}
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* [RFC PATCH v3 27/30] monitor: fdset: Match against O_DIRECT
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (25 preceding siblings ...)
2023-11-27 20:26 ` [RFC PATCH v3 26/30] monitor: Extract fdset fd flags comparison into a function Fabiano Rosas
@ 2023-11-27 20:26 ` Fabiano Rosas
2023-11-27 20:26 ` [RFC PATCH v3 28/30] docs/devel/migration.rst: Document the file transport Fabiano Rosas
` (3 subsequent siblings)
30 siblings, 0 replies; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:26 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
We're about to enable the use of O_DIRECT in the migration code and
due to the alignment restrictions imposed by filesystems we need to
make sure the flag is only used when doing aligned IO.
The migration will do parallel IO to different regions of a file, so
we need to use more than one file descriptor. Those cannot be obtained
by duplicating (dup()) since duplicated file descriptors share the
file status flags, including O_DIRECT. If one migration channel does
unaligned IO while another sets O_DIRECT to do aligned IO, the
filesystem would fail the unaligned operation.
The add-fd QMP command along with the fdset code are specifically
designed to allow the user to pass a set of file descriptors with
different access flags into QEMU to be later fetched by code that
needs to alternate between those flags when doing IO.
Extend the fdset matching function to behave the same with the
O_DIRECT flag.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
monitor/fds.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/monitor/fds.c b/monitor/fds.c
index 9a28e4b72b..42bf3eb982 100644
--- a/monitor/fds.c
+++ b/monitor/fds.c
@@ -413,6 +413,12 @@ static bool monitor_fdset_flags_match(int flags, int fd_flags)
if ((flags & O_ACCMODE) == (fd_flags & O_ACCMODE)) {
match = true;
+
+#ifdef O_DIRECT
+ if ((flags & O_DIRECT) != (fd_flags & O_DIRECT)) {
+ match = false;
+ }
+#endif
}
return match;
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* [RFC PATCH v3 28/30] docs/devel/migration.rst: Document the file transport
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (26 preceding siblings ...)
2023-11-27 20:26 ` [RFC PATCH v3 27/30] monitor: fdset: Match against O_DIRECT Fabiano Rosas
@ 2023-11-27 20:26 ` Fabiano Rosas
2023-11-27 20:26 ` [RFC PATCH v3 29/30] migration: Add support for fdset with multifd + file Fabiano Rosas
` (2 subsequent siblings)
30 siblings, 0 replies; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:26 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
When adding the support for file migration with the file: transport,
we missed adding documentation for it.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
docs/devel/migration.rst | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
index eeb4fec31f..1488e5b2f9 100644
--- a/docs/devel/migration.rst
+++ b/docs/devel/migration.rst
@@ -41,6 +41,10 @@ over any transport.
- exec migration: do the migration using the stdin/stdout through a process.
- fd migration: do the migration using a file descriptor that is
passed to QEMU. QEMU doesn't care how this file descriptor is opened.
+- file migration: do the migration using a file that is passed to QEMU
+ by path. A file offset option is supported to allow a management
+ application to add its own metadata to the start of the file without
+ QEMU interference.
In addition, support is included for migration using RDMA, which
transports the page data using ``RDMA``, where the hardware takes care of
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* [RFC PATCH v3 29/30] migration: Add support for fdset with multifd + file
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (27 preceding siblings ...)
2023-11-27 20:26 ` [RFC PATCH v3 28/30] docs/devel/migration.rst: Document the file transport Fabiano Rosas
@ 2023-11-27 20:26 ` Fabiano Rosas
2023-11-27 20:26 ` [RFC PATCH v3 30/30] tests/qtest: Add a test for fixed-ram with passing of fds Fabiano Rosas
2024-01-11 10:50 ` [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Peter Xu
30 siblings, 0 replies; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:26 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana
Allow multifd to use an fdset when migrating to a file. This is useful
for the scenario where the management layer wants to have control over
the migration file.
By receiving the file descriptors directly, QEMU can delegate some
high level operating system operations to the management layer (such
as mandatory access control). The management layer might also want to
add its own headers before the migration stream.
Enable the "file:/dev/fdset/#" syntax for the multifd migration with
fixed-ram. The requirements for the fdset mechanism are:
On the migration source side:
- the fdset must contain two fds that are not duplicates between
themselves;
- if direct-io is to be used, exactly one of the fds must have the
O_DIRECT flag set;
- the file must be opened with WRONLY both times.
On the migration destination side:
- the fdset must contain one fd;
- the file must be opened with RDONLY.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
docs/devel/migration.rst | 18 +++++++
migration/file.c | 100 ++++++++++++++++++++++++++++++++++++---
2 files changed, 112 insertions(+), 6 deletions(-)
diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
index 1488e5b2f9..096ef27ed7 100644
--- a/docs/devel/migration.rst
+++ b/docs/devel/migration.rst
@@ -46,6 +46,24 @@ over any transport.
application to add its own metadata to the start of the file without
QEMU interference.
+ The file migration also supports using a file that has already been
+ opened. A set of file descriptors is passed to QEMU via an "fdset"
+ (see add-fd QMP command documentation). This method allows a
+ management application to have control over the migration file
+ opening operation. There are, however, strict requirements to this
+ interface:
+
+ On the migration source side:
+ - the fdset must contain two file descriptors that are not
+ duplicates between themselves;
+ - if the direct-io capability is to be used, exactly one of the
+ file descriptors must have the O_DIRECT flag set;
+ - the file must be opened with WRONLY both times.
+
+ On the migration destination side:
+ - the fdset must contain one file descriptor;
+ - the file must be opened with RDONLY.
+
In addition, support is included for migration using RDMA, which
transports the page data using ``RDMA``, where the hardware takes care of
transporting the pages, and the load on the CPU is much lower. While the
diff --git a/migration/file.c b/migration/file.c
index fc5c1a45f4..4b06335a8c 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -9,11 +9,13 @@
#include "qemu/cutils.h"
#include "qemu/error-report.h"
#include "qapi/error.h"
+#include "qapi/qapi-commands-misc.h"
#include "channel.h"
#include "file.h"
#include "migration.h"
#include "io/channel-file.h"
#include "io/channel-util.h"
+#include "monitor/monitor.h"
#include "options.h"
#include "trace.h"
@@ -21,6 +23,7 @@
static struct FileOutgoingArgs {
char *fname;
+ int64_t fdset_id;
} outgoing_args;
/* Remove the offset option from @filespec and return it in @offsetp. */
@@ -42,6 +45,84 @@ int file_parse_offset(char *filespec, uint64_t *offsetp, Error **errp)
return 0;
}
+/*
+ * If the open flags and file status flags from the file descriptors
+ * in the fdset don't match what QEMU expects, errno gets set to
+ * EACCES. Let's provide a more user-friendly message.
+ */
+static void file_fdset_error(int flags, Error **errp)
+{
+ ERRP_GUARD();
+
+ if (errno == EACCES) {
+ /* ditch the previous error */
+ error_free(*errp);
+ *errp = NULL;
+
+ error_setg(errp, "Fdset is missing a file descriptor with flags: 0x%x",
+ flags);
+ }
+}
+
+static void file_remove_fdset(void)
+{
+ if (outgoing_args.fdset_id != -1) {
+ qmp_remove_fd(outgoing_args.fdset_id, false, -1, NULL);
+ outgoing_args.fdset_id = -1;
+ }
+}
+
+/*
+ * Due to the behavior of the dup() system call, we need the fdset to
+ * have two non-duplicate fds so we can enable direct IO in the
+ * secondary channels without affecting the main channel.
+ */
+static bool file_parse_fdset(const char *filename, int64_t *fdset_id,
+ Error **errp)
+{
+ FdsetInfoList *fds_info;
+ FdsetFdInfoList *fd_info;
+ const char *fdset_id_str;
+ int nfds = 0;
+
+ *fdset_id = -1;
+
+ if (!strstart(filename, "/dev/fdset/", &fdset_id_str)) {
+ return true;
+ }
+
+ if (!migrate_multifd()) {
+ error_setg(errp, "fdset is only supported with multifd");
+ return false;
+ }
+
+ *fdset_id = qemu_parse_fd(fdset_id_str);
+
+ for (fds_info = qmp_query_fdsets(NULL); fds_info;
+ fds_info = fds_info->next) {
+
+ if (*fdset_id != fds_info->value->fdset_id) {
+ continue;
+ }
+
+ for (fd_info = fds_info->value->fds; fd_info; fd_info = fd_info->next) {
+ if (nfds++ > 2) {
+ break;
+ }
+ }
+ }
+
+ if (nfds != 2) {
+ error_setg(errp, "Outgoing migration needs two fds in the fdset, "
+ "got %d", nfds);
+ qmp_remove_fd(*fdset_id, false, -1, NULL);
+ *fdset_id = -1;
+ return false;
+ }
+
+ return true;
+}
+
static void qio_channel_file_connect_worker(QIOTask *task, gpointer opaque)
{
/* noop */
@@ -56,6 +137,7 @@ int file_send_channel_destroy(QIOChannel *ioc)
g_free(outgoing_args.fname);
outgoing_args.fname = NULL;
+ file_remove_fdset();
return 0;
}
@@ -88,6 +170,7 @@ void file_send_channel_create(QIOTaskFunc f, void *data)
task = qio_task_new(OBJECT(ioc), f, (gpointer)data, NULL);
if (!ioc) {
+ file_fdset_error(flags, &err);
qio_task_set_error(task, err);
return;
}
@@ -108,13 +191,18 @@ void file_start_outgoing_migration(MigrationState *s,
trace_migration_file_outgoing(filename);
- fioc = qio_channel_file_new_path(filename, flags, mode, errp);
- if (!fioc) {
+ if (!file_parse_fdset(filename, &outgoing_args.fdset_id, errp)) {
return;
}
outgoing_args.fname = g_strdup(filename);
+ fioc = qio_channel_file_new_path(filename, flags, mode, errp);
+ if (!fioc) {
+ file_fdset_error(flags, errp);
+ return;
+ }
+
ioc = QIO_CHANNEL(fioc);
if (offset && qio_channel_io_seek(ioc, offset, SEEK_SET, errp) < 0) {
return;
@@ -138,13 +226,14 @@ void file_start_incoming_migration(FileMigrationArgs *file_args, Error **errp)
QIOChannelFile *fioc = NULL;
uint64_t offset = file_args->offset;
int channels = 1;
- int i = 0, fd;
+ int i = 0, fd, flags = O_RDONLY;
trace_migration_file_incoming(filename);
- fioc = qio_channel_file_new_path(filename, O_RDONLY, 0, errp);
+ fioc = qio_channel_file_new_path(filename, flags, 0, errp);
if (!fioc) {
- goto out;
+ file_fdset_error(flags, errp);
+ return;
}
if (offset &&
@@ -168,7 +257,6 @@ void file_start_incoming_migration(FileMigrationArgs *file_args, Error **errp)
g_main_context_get_thread_default());
} while (++i < channels && (fioc = qio_channel_file_new_fd(fd)));
-out:
if (!fioc) {
error_setg(errp, "Error creating migration incoming channel");
return;
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* [RFC PATCH v3 30/30] tests/qtest: Add a test for fixed-ram with passing of fds
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (28 preceding siblings ...)
2023-11-27 20:26 ` [RFC PATCH v3 29/30] migration: Add support for fdset with multifd + file Fabiano Rosas
@ 2023-11-27 20:26 ` Fabiano Rosas
2024-01-11 10:50 ` [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Peter Xu
30 siblings, 0 replies; 95+ messages in thread
From: Fabiano Rosas @ 2023-11-27 20:26 UTC (permalink / raw)
To: qemu-devel
Cc: berrange, armbru, Juan Quintela, Peter Xu, Leonardo Bras,
Claudio Fontana, Thomas Huth, Laurent Vivier, Paolo Bonzini
Add a multifd test for fixed-ram with passing of fds into QEMU. This
is how libvirt will consume the feature.
There are a couple of details to the fdset mechanism:
- multifd needs two distinct file descriptors (not duplicated with
dup()) on the outgoing side so it can enable O_DIRECT only on the
channels that write with alignment. The dup() system call creates
file descriptors that share status flags, of which O_DIRECT is one.
the incoming side doesn't set O_DIRECT, so it can dup() fds and
therefore can receive only one in the fdset.
- the open() access mode flags used for the fds passed into QEMU need
to match the flags QEMU uses to open the file. Currently O_WRONLY
for src and O_RDONLY for dst.
O_DIRECT is not supported on all systems/filesystems, so run the fdset
test without O_DIRECT if that's the case. The migration code should
still work in that scenario.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
tests/qtest/migration-helpers.c | 7 ++-
tests/qtest/migration-test.c | 87 +++++++++++++++++++++++++++++++++
2 files changed, 92 insertions(+), 2 deletions(-)
diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
index 02b92f0cb6..3013094800 100644
--- a/tests/qtest/migration-helpers.c
+++ b/tests/qtest/migration-helpers.c
@@ -302,7 +302,7 @@ char *resolve_machine_version(const char *alias, const char *var1,
bool probe_o_direct_support(const char *tmpfs)
{
g_autofree char *filename = g_strdup_printf("%s/probe-o-direct", tmpfs);
- int fd, flags = O_CREAT | O_RDWR | O_DIRECT;
+ int fd, flags = O_CREAT | O_RDWR | O_TRUNC | O_DIRECT;
void *buf;
ssize_t ret, len;
uint64_t offset;
@@ -320,9 +320,12 @@ bool probe_o_direct_support(const char *tmpfs)
len = 0x100000;
offset = 0x100000;
- buf = g_malloc0(len);
+ buf = aligned_alloc(len, len);
+ g_assert(buf);
+
ret = pwrite(fd, buf, len, offset);
unlink(filename);
+ g_free(buf);
if (ret < 0) {
return false;
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 192b8ec993..bb2dd805fc 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2251,8 +2251,90 @@ static void test_multifd_file_fixed_ram_dio(void)
test_file_common(&args, true);
}
+
+static void migrate_multifd_fixed_ram_fdset_dio_end(QTestState *from,
+ QTestState *to,
+ void *opaque)
+{
+ QDict *resp;
+ QList *fdsets;
+
+ /*
+ * Check that we removed the fdsets after migration, otherwise a
+ * second migration would fail due to too many fdsets.
+ */
+
+ resp = qtest_qmp(from, "{'execute': 'query-fdsets', "
+ "'arguments': {}}");
+ g_assert(qdict_haskey(resp, "return"));
+ fdsets = qdict_get_qlist(resp, "return");
+ g_assert(fdsets && qlist_empty(fdsets));
+}
+#endif /* O_DIRECT */
+
+#ifndef _WIN32
+static void *migrate_multifd_fixed_ram_fdset(QTestState *from, QTestState *to)
+{
+ g_autofree char *file = g_strdup_printf("%s/%s", tmpfs, FILE_TEST_FILENAME);
+ int fds[3];
+ int src_flags = O_CREAT | O_WRONLY;
+ int dst_flags = O_CREAT | O_RDONLY;
+
+ /* main outgoing channel: no O_DIRECT */
+ fds[0] = open(file, src_flags, 0660);
+ assert(fds[0] != -1);
+
+#ifdef O_DIRECT
+ src_flags |= O_DIRECT;
#endif
+ /* secondary outgoing channels */
+ fds[1] = open(file, src_flags, 0660);
+ assert(fds[1] != -1);
+
+ qtest_qmp_fds_assert_success(from, &fds[0], 1, "{'execute': 'add-fd', "
+ "'arguments': {'fdset-id': 1}}");
+
+ qtest_qmp_fds_assert_success(from, &fds[1], 1, "{'execute': 'add-fd', "
+ "'arguments': {'fdset-id': 1}}");
+
+ /* incoming channel */
+ fds[2] = open(file, dst_flags, 0660);
+ assert(fds[2] != -1);
+
+ qtest_qmp_fds_assert_success(to, &fds[2], 1, "{'execute': 'add-fd', "
+ "'arguments': {'fdset-id': 1}}");
+
+#ifdef O_DIRECT
+ migrate_multifd_fixed_ram_dio_start(from, to);
+#else
+ migrate_multifd_fixed_ram_start(from, to);
+#endif
+
+ return NULL;
+}
+
+static void test_multifd_file_fixed_ram_fdset(void)
+{
+ g_autofree char *uri = g_strdup_printf("file:/dev/fdset/1,offset=0x100");
+ MigrateCommon args = {
+ .connect_uri = uri,
+ .listen_uri = "defer",
+ .start_hook = migrate_multifd_fixed_ram_fdset,
+#ifdef O_DIRECT
+ .finish_hook = migrate_multifd_fixed_ram_fdset_dio_end,
+#endif
+ };
+
+ if (!probe_o_direct_support(tmpfs)) {
+ g_test_skip("Filesystem does not support O_DIRECT");
+ return;
+ }
+
+ test_file_common(&args, true);
+}
+#endif /* _WIN32 */
+
static void test_precopy_tcp_plain(void)
{
MigrateCommon args = {
@@ -3511,6 +3593,11 @@ int main(int argc, char **argv)
test_multifd_file_fixed_ram_dio);
#endif
+#ifndef _WIN32
+ qtest_add_func("/migration/multifd/file/fixed-ram/fdset",
+ test_multifd_file_fixed_ram_fdset);
+#endif
+
#ifdef CONFIG_GNUTLS
qtest_add_func("/migration/precopy/unix/tls/psk",
test_precopy_unix_tls_psk);
--
2.35.3
^ permalink raw reply related [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram
2023-11-27 20:25 [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Fabiano Rosas
` (29 preceding siblings ...)
2023-11-27 20:26 ` [RFC PATCH v3 30/30] tests/qtest: Add a test for fixed-ram with passing of fds Fabiano Rosas
@ 2024-01-11 10:50 ` Peter Xu
2024-01-11 18:38 ` Fabiano Rosas
30 siblings, 1 reply; 95+ messages in thread
From: Peter Xu @ 2024-01-11 10:50 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Juan Quintela, Leonardo Bras,
Claudio Fontana
On Mon, Nov 27, 2023 at 05:25:42PM -0300, Fabiano Rosas wrote:
> Hi,
>
> In this v3:
>
> Added support for the "file:/dev/fdset/" syntax to receive multiple
> file descriptors. This allows the management layer to open the
> migration file beforehand and pass the file descriptors to QEMU. We
> need more than one fd to be able to use O_DIRECT concurrently with
> unaligned writes.
>
> Dropped the auto-pause capability. That discussion was kind of
> stuck. We can revisit optimizations for non-live scenarios once the
> series is more mature/merged.
>
> Changed the multifd incoming side to use a more generic data structure
> instead of MultiFDPages_t. This allows multifd to restore the ram
> using larger chunks.
>
> The rest are minor changes, I have noted them in the patches
> themselves.
Fabiano,
Could you always keep a section around in the cover letter (and also in the
upcoming doc file fixed-ram.rst) on the benefits of this feature?
Please bare with me - I can start to ask silly questions.
I thought it was about "keeping the snapshot file small". But then when I
was thinking the use case, iiuc fixed-ram migration should always suggest
the user to stop the VM first before migration starts, then if the VM is
stopped the ultimate image shouldn't be large either.
Or is it about performance only? Where did I miss?
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram
2024-01-11 10:50 ` [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram Peter Xu
@ 2024-01-11 18:38 ` Fabiano Rosas
2024-01-15 6:22 ` Peter Xu
0 siblings, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2024-01-11 18:38 UTC (permalink / raw)
To: Peter Xu; +Cc: qemu-devel, berrange, armbru, Leonardo Bras, Claudio Fontana
Peter Xu <peterx@redhat.com> writes:
> On Mon, Nov 27, 2023 at 05:25:42PM -0300, Fabiano Rosas wrote:
>> Hi,
>>
>> In this v3:
>>
>> Added support for the "file:/dev/fdset/" syntax to receive multiple
>> file descriptors. This allows the management layer to open the
>> migration file beforehand and pass the file descriptors to QEMU. We
>> need more than one fd to be able to use O_DIRECT concurrently with
>> unaligned writes.
>>
>> Dropped the auto-pause capability. That discussion was kind of
>> stuck. We can revisit optimizations for non-live scenarios once the
>> series is more mature/merged.
>>
>> Changed the multifd incoming side to use a more generic data structure
>> instead of MultiFDPages_t. This allows multifd to restore the ram
>> using larger chunks.
>>
>> The rest are minor changes, I have noted them in the patches
>> themselves.
>
> Fabiano,
>
> Could you always keep a section around in the cover letter (and also in the
> upcoming doc file fixed-ram.rst) on the benefits of this feature?
>
> Please bare with me - I can start to ask silly questions.
>
That's fine. Ask away!
> I thought it was about "keeping the snapshot file small". But then when I
> was thinking the use case, iiuc fixed-ram migration should always suggest
> the user to stop the VM first before migration starts, then if the VM is
> stopped the ultimate image shouldn't be large either.
>
> Or is it about performance only? Where did I miss?
Performance is the main benefit because fixed-ram enables the use of
multifd for file migration which would otherwise not be
parallelizable. To use multifd has been the direction for a while as you
know, so it makes sense.
A fast file migration is desirable because it could be used for
snapshots with a stopped vm and also to replace the "exec:cat" hack
(this last one I found out about recently, Juan mentioned it in this
thread: https://lore.kernel.org/r/87cyx5ty26.fsf@secure.mitica).
The size aspect is just an interesting property, not necessarily a
reason. It's about having the file bounded to the RAM size. So a running
guest would not produce a continuously growing file. This is in contrast
with previous experiments (libvirt code) in using a proxy to put
multifd-produced data into a file.
I'll add this^ information in a more organized matter to the docs and
cover letter. Let me know what else I need to clarify.
Some notes about fixed-ram by itself:
This series also enables fixed-ram without multifd, which would only
take benefit of the size property. That is not part of our end goal
which is to have multifd + fixed-ram, but I kept it nonetheless because
it helps to debug/reason about the fixed-ram format without conflating
matters with multifd.
Fixed-ram without multifd also allows the file migration to take benefit
of direct io because the data portion of the file (pages) will be
written with alignment. This version of the series does not yet support
it, but I have a simple patch for the next version.
I also had a - perhaps naive - idea that we could merge the io code +
fixed-ram first, to expedite things and later bring in the multifd and
directio enhancements, but the review process ended up not being that
modular.
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram
2024-01-11 18:38 ` Fabiano Rosas
@ 2024-01-15 6:22 ` Peter Xu
2024-01-15 8:11 ` Daniel P. Berrangé
2024-01-15 19:45 ` Fabiano Rosas
0 siblings, 2 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-15 6:22 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Leonardo Bras, Claudio Fontana
On Thu, Jan 11, 2024 at 03:38:31PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > On Mon, Nov 27, 2023 at 05:25:42PM -0300, Fabiano Rosas wrote:
> >> Hi,
> >>
> >> In this v3:
> >>
> >> Added support for the "file:/dev/fdset/" syntax to receive multiple
> >> file descriptors. This allows the management layer to open the
> >> migration file beforehand and pass the file descriptors to QEMU. We
> >> need more than one fd to be able to use O_DIRECT concurrently with
> >> unaligned writes.
> >>
> >> Dropped the auto-pause capability. That discussion was kind of
> >> stuck. We can revisit optimizations for non-live scenarios once the
> >> series is more mature/merged.
> >>
> >> Changed the multifd incoming side to use a more generic data structure
> >> instead of MultiFDPages_t. This allows multifd to restore the ram
> >> using larger chunks.
> >>
> >> The rest are minor changes, I have noted them in the patches
> >> themselves.
> >
> > Fabiano,
> >
> > Could you always keep a section around in the cover letter (and also in the
> > upcoming doc file fixed-ram.rst) on the benefits of this feature?
> >
> > Please bare with me - I can start to ask silly questions.
> >
>
> That's fine. Ask away!
>
> > I thought it was about "keeping the snapshot file small". But then when I
> > was thinking the use case, iiuc fixed-ram migration should always suggest
> > the user to stop the VM first before migration starts, then if the VM is
> > stopped the ultimate image shouldn't be large either.
> >
> > Or is it about performance only? Where did I miss?
>
> Performance is the main benefit because fixed-ram enables the use of
> multifd for file migration which would otherwise not be
> parallelizable. To use multifd has been the direction for a while as you
> know, so it makes sense.
>
> A fast file migration is desirable because it could be used for
> snapshots with a stopped vm and also to replace the "exec:cat" hack
> (this last one I found out about recently, Juan mentioned it in this
> thread: https://lore.kernel.org/r/87cyx5ty26.fsf@secure.mitica).
I digged again the history, and started to remember the "live" migration
case for fixed-ram. IIUC that is what Dan mentioned in below email
regarding to the "virDomainSnapshotXXX" use case:
https://lore.kernel.org/all/ZD7MRGQ+4QsDBtKR@redhat.com/
So IIUC "stopped VM" is not always the use case?
If you agree with this, we need to document these two use cases clearly in
the doc update:
- "Migrate a VM to file, then destroy the VM"
It should be suggested to stop the VM first before triggering such
migration in this use case in the documents.
- "Take a live snapshot of the VM"
It'll be ideal if there is a portable interface to synchronously track
dirtying of guest pages, but we don't...
So fixed-ram seems to be the solution for such a portable solution for
taking live snapshot across-platforms as long as async dirty tracking
is still supported on that OS (aka KVM_GET_DIRTY_LOG). If async
tracking is not supported, snapshot cannot be done live on the OS then,
and one needs to use "snapshot-save".
For this one, IMHO it would be good to mention (from QEMU perspective)
the existance of background-snapshot even though libvirt didn't support
it for some reason. Currently background-snapshot lacks multi-thread
feature (nor O_DIRECT), though, so it may be less performant than
fixed-ram. However if with all features there I believe that's even
more performant. Please consider mention to a degree of detail on
this.
>
> The size aspect is just an interesting property, not necessarily a
> reason.
See above on the 2nd "live" use case of fixed-ram. I think in that case,
size is still a matter, then, because that one cannot stop the VM vcpus.
> It's about having the file bounded to the RAM size. So a running
> guest would not produce a continuously growing file. This is in contrast
> with previous experiments (libvirt code) in using a proxy to put
> multifd-produced data into a file.
>
> I'll add this^ information in a more organized matter to the docs and
> cover letter. Let me know what else I need to clarify.
Thanks.
>
> Some notes about fixed-ram by itself:
>
> This series also enables fixed-ram without multifd, which would only
> take benefit of the size property. That is not part of our end goal
> which is to have multifd + fixed-ram, but I kept it nonetheless because
> it helps to debug/reason about the fixed-ram format without conflating
> matters with multifd.
Yes, makes sense.
>
> Fixed-ram without multifd also allows the file migration to take benefit
> of direct io because the data portion of the file (pages) will be
> written with alignment. This version of the series does not yet support
> it, but I have a simple patch for the next version.
>
> I also had a - perhaps naive - idea that we could merge the io code +
> fixed-ram first, to expedite things and later bring in the multifd and
> directio enhancements, but the review process ended up not being that
> modular.
What's the review process issue you're talking about?
If you can split the series that'll help merging for sure to me. IIRC
there's complexity on passing the o-direct fds around, and not sure whether
that chunk can be put at the last, similarly to split the multifd bits.
One thing I just noticed is fixed-ram seems to be always preferred for
"file:" migrations. Then can we already imply fixed-ram for "file" URIs?
I'm even thinking whether we can make it the default and drop the fixed-ram
capability: fixed-ram won't work besides file, and file won't make sense if
not using offsets / fixed-ram. There's at least one problem, where we have
released 8.2 with "file:", so it means it could break users already using
"file:" there. I'm wondering whether that'll be worthwhile considering if
we can drop the (seems redundant..) capability. What do you think?
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram
2024-01-15 6:22 ` Peter Xu
@ 2024-01-15 8:11 ` Daniel P. Berrangé
2024-01-15 8:41 ` Peter Xu
2024-01-15 19:45 ` Fabiano Rosas
1 sibling, 1 reply; 95+ messages in thread
From: Daniel P. Berrangé @ 2024-01-15 8:11 UTC (permalink / raw)
To: Peter Xu
Cc: Fabiano Rosas, qemu-devel, armbru, Leonardo Bras, Claudio Fontana
On Mon, Jan 15, 2024 at 02:22:47PM +0800, Peter Xu wrote:
> On Thu, Jan 11, 2024 at 03:38:31PM -0300, Fabiano Rosas wrote:
> > Peter Xu <peterx@redhat.com> writes:
> >
> > > On Mon, Nov 27, 2023 at 05:25:42PM -0300, Fabiano Rosas wrote:
> > >> Hi,
> > >>
> > >> In this v3:
> > >>
> > >> Added support for the "file:/dev/fdset/" syntax to receive multiple
> > >> file descriptors. This allows the management layer to open the
> > >> migration file beforehand and pass the file descriptors to QEMU. We
> > >> need more than one fd to be able to use O_DIRECT concurrently with
> > >> unaligned writes.
> > >>
> > >> Dropped the auto-pause capability. That discussion was kind of
> > >> stuck. We can revisit optimizations for non-live scenarios once the
> > >> series is more mature/merged.
> > >>
> > >> Changed the multifd incoming side to use a more generic data structure
> > >> instead of MultiFDPages_t. This allows multifd to restore the ram
> > >> using larger chunks.
> > >>
> > >> The rest are minor changes, I have noted them in the patches
> > >> themselves.
> > >
> > > Fabiano,
> > >
> > > Could you always keep a section around in the cover letter (and also in the
> > > upcoming doc file fixed-ram.rst) on the benefits of this feature?
> > >
> > > Please bare with me - I can start to ask silly questions.
> > >
> >
> > That's fine. Ask away!
> >
> > > I thought it was about "keeping the snapshot file small". But then when I
> > > was thinking the use case, iiuc fixed-ram migration should always suggest
> > > the user to stop the VM first before migration starts, then if the VM is
> > > stopped the ultimate image shouldn't be large either.
> > >
> > > Or is it about performance only? Where did I miss?
> >
> > Performance is the main benefit because fixed-ram enables the use of
> > multifd for file migration which would otherwise not be
> > parallelizable. To use multifd has been the direction for a while as you
> > know, so it makes sense.
> >
> > A fast file migration is desirable because it could be used for
> > snapshots with a stopped vm and also to replace the "exec:cat" hack
> > (this last one I found out about recently, Juan mentioned it in this
> > thread: https://lore.kernel.org/r/87cyx5ty26.fsf@secure.mitica).
>
> I digged again the history, and started to remember the "live" migration
> case for fixed-ram. IIUC that is what Dan mentioned in below email
> regarding to the "virDomainSnapshotXXX" use case:
>
> https://lore.kernel.org/all/ZD7MRGQ+4QsDBtKR@redhat.com/
>
> So IIUC "stopped VM" is not always the use case?
>
> If you agree with this, we need to document these two use cases clearly in
> the doc update:
>
> - "Migrate a VM to file, then destroy the VM"
>
> It should be suggested to stop the VM first before triggering such
> migration in this use case in the documents.
>
> - "Take a live snapshot of the VM"
>
> It'll be ideal if there is a portable interface to synchronously track
> dirtying of guest pages, but we don't...
>
> So fixed-ram seems to be the solution for such a portable solution for
> taking live snapshot across-platforms as long as async dirty tracking
> is still supported on that OS (aka KVM_GET_DIRTY_LOG). If async
> tracking is not supported, snapshot cannot be done live on the OS then,
> and one needs to use "snapshot-save".
>
> For this one, IMHO it would be good to mention (from QEMU perspective)
> the existance of background-snapshot even though libvirt didn't support
> it for some reason. Currently background-snapshot lacks multi-thread
> feature (nor O_DIRECT), though, so it may be less performant than
> fixed-ram. However if with all features there I believe that's even
> more performant. Please consider mention to a degree of detail on
> this.
>
> >
> > The size aspect is just an interesting property, not necessarily a
> > reason.
>
> See above on the 2nd "live" use case of fixed-ram. I think in that case,
> size is still a matter, then, because that one cannot stop the VM vcpus.
>
> > It's about having the file bounded to the RAM size. So a running
> > guest would not produce a continuously growing file. This is in contrast
> > with previous experiments (libvirt code) in using a proxy to put
> > multifd-produced data into a file.
> >
> > I'll add this^ information in a more organized matter to the docs and
> > cover letter. Let me know what else I need to clarify.
>
> Thanks.
>
> >
> > Some notes about fixed-ram by itself:
> >
> > This series also enables fixed-ram without multifd, which would only
> > take benefit of the size property. That is not part of our end goal
> > which is to have multifd + fixed-ram, but I kept it nonetheless because
> > it helps to debug/reason about the fixed-ram format without conflating
> > matters with multifd.
>
> Yes, makes sense.
>
> >
> > Fixed-ram without multifd also allows the file migration to take benefit
> > of direct io because the data portion of the file (pages) will be
> > written with alignment. This version of the series does not yet support
> > it, but I have a simple patch for the next version.
> >
> > I also had a - perhaps naive - idea that we could merge the io code +
> > fixed-ram first, to expedite things and later bring in the multifd and
> > directio enhancements, but the review process ended up not being that
> > modular.
>
> What's the review process issue you're talking about?
>
> If you can split the series that'll help merging for sure to me. IIRC
> there's complexity on passing the o-direct fds around, and not sure whether
> that chunk can be put at the last, similarly to split the multifd bits.
>
> One thing I just noticed is fixed-ram seems to be always preferred for
> "file:" migrations. Then can we already imply fixed-ram for "file" URIs?
>
> I'm even thinking whether we can make it the default and drop the fixed-ram
> capability: fixed-ram won't work besides file, and file won't make sense if
> not using offsets / fixed-ram. There's at least one problem, where we have
> released 8.2 with "file:", so it means it could break users already using
> "file:" there. I'm wondering whether that'll be worthwhile considering if
> we can drop the (seems redundant..) capability. What do you think?
The 'fd' protocol should support 'fixed-ram' too if passed a seekable
FD.
The 'file' protocol should be able to create save images compatible with
older QEMU too IMHO.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram
2024-01-15 8:11 ` Daniel P. Berrangé
@ 2024-01-15 8:41 ` Peter Xu
0 siblings, 0 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-15 8:41 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: Fabiano Rosas, qemu-devel, armbru, Leonardo Bras, Claudio Fontana
On Mon, Jan 15, 2024 at 08:11:40AM +0000, Daniel P. Berrangé wrote:
> On Mon, Jan 15, 2024 at 02:22:47PM +0800, Peter Xu wrote:
> > On Thu, Jan 11, 2024 at 03:38:31PM -0300, Fabiano Rosas wrote:
> > > Peter Xu <peterx@redhat.com> writes:
> > >
> > > > On Mon, Nov 27, 2023 at 05:25:42PM -0300, Fabiano Rosas wrote:
> > > >> Hi,
> > > >>
> > > >> In this v3:
> > > >>
> > > >> Added support for the "file:/dev/fdset/" syntax to receive multiple
> > > >> file descriptors. This allows the management layer to open the
> > > >> migration file beforehand and pass the file descriptors to QEMU. We
> > > >> need more than one fd to be able to use O_DIRECT concurrently with
> > > >> unaligned writes.
> > > >>
> > > >> Dropped the auto-pause capability. That discussion was kind of
> > > >> stuck. We can revisit optimizations for non-live scenarios once the
> > > >> series is more mature/merged.
> > > >>
> > > >> Changed the multifd incoming side to use a more generic data structure
> > > >> instead of MultiFDPages_t. This allows multifd to restore the ram
> > > >> using larger chunks.
> > > >>
> > > >> The rest are minor changes, I have noted them in the patches
> > > >> themselves.
> > > >
> > > > Fabiano,
> > > >
> > > > Could you always keep a section around in the cover letter (and also in the
> > > > upcoming doc file fixed-ram.rst) on the benefits of this feature?
> > > >
> > > > Please bare with me - I can start to ask silly questions.
> > > >
> > >
> > > That's fine. Ask away!
> > >
> > > > I thought it was about "keeping the snapshot file small". But then when I
> > > > was thinking the use case, iiuc fixed-ram migration should always suggest
> > > > the user to stop the VM first before migration starts, then if the VM is
> > > > stopped the ultimate image shouldn't be large either.
> > > >
> > > > Or is it about performance only? Where did I miss?
> > >
> > > Performance is the main benefit because fixed-ram enables the use of
> > > multifd for file migration which would otherwise not be
> > > parallelizable. To use multifd has been the direction for a while as you
> > > know, so it makes sense.
> > >
> > > A fast file migration is desirable because it could be used for
> > > snapshots with a stopped vm and also to replace the "exec:cat" hack
> > > (this last one I found out about recently, Juan mentioned it in this
> > > thread: https://lore.kernel.org/r/87cyx5ty26.fsf@secure.mitica).
> >
> > I digged again the history, and started to remember the "live" migration
> > case for fixed-ram. IIUC that is what Dan mentioned in below email
> > regarding to the "virDomainSnapshotXXX" use case:
> >
> > https://lore.kernel.org/all/ZD7MRGQ+4QsDBtKR@redhat.com/
> >
> > So IIUC "stopped VM" is not always the use case?
> >
> > If you agree with this, we need to document these two use cases clearly in
> > the doc update:
> >
> > - "Migrate a VM to file, then destroy the VM"
> >
> > It should be suggested to stop the VM first before triggering such
> > migration in this use case in the documents.
> >
> > - "Take a live snapshot of the VM"
> >
> > It'll be ideal if there is a portable interface to synchronously track
> > dirtying of guest pages, but we don't...
> >
> > So fixed-ram seems to be the solution for such a portable solution for
> > taking live snapshot across-platforms as long as async dirty tracking
> > is still supported on that OS (aka KVM_GET_DIRTY_LOG). If async
> > tracking is not supported, snapshot cannot be done live on the OS then,
> > and one needs to use "snapshot-save".
> >
> > For this one, IMHO it would be good to mention (from QEMU perspective)
> > the existance of background-snapshot even though libvirt didn't support
> > it for some reason. Currently background-snapshot lacks multi-thread
> > feature (nor O_DIRECT), though, so it may be less performant than
> > fixed-ram. However if with all features there I believe that's even
> > more performant. Please consider mention to a degree of detail on
> > this.
> >
> > >
> > > The size aspect is just an interesting property, not necessarily a
> > > reason.
> >
> > See above on the 2nd "live" use case of fixed-ram. I think in that case,
> > size is still a matter, then, because that one cannot stop the VM vcpus.
> >
> > > It's about having the file bounded to the RAM size. So a running
> > > guest would not produce a continuously growing file. This is in contrast
> > > with previous experiments (libvirt code) in using a proxy to put
> > > multifd-produced data into a file.
> > >
> > > I'll add this^ information in a more organized matter to the docs and
> > > cover letter. Let me know what else I need to clarify.
> >
> > Thanks.
> >
> > >
> > > Some notes about fixed-ram by itself:
> > >
> > > This series also enables fixed-ram without multifd, which would only
> > > take benefit of the size property. That is not part of our end goal
> > > which is to have multifd + fixed-ram, but I kept it nonetheless because
> > > it helps to debug/reason about the fixed-ram format without conflating
> > > matters with multifd.
> >
> > Yes, makes sense.
> >
> > >
> > > Fixed-ram without multifd also allows the file migration to take benefit
> > > of direct io because the data portion of the file (pages) will be
> > > written with alignment. This version of the series does not yet support
> > > it, but I have a simple patch for the next version.
> > >
> > > I also had a - perhaps naive - idea that we could merge the io code +
> > > fixed-ram first, to expedite things and later bring in the multifd and
> > > directio enhancements, but the review process ended up not being that
> > > modular.
> >
> > What's the review process issue you're talking about?
> >
> > If you can split the series that'll help merging for sure to me. IIRC
> > there's complexity on passing the o-direct fds around, and not sure whether
> > that chunk can be put at the last, similarly to split the multifd bits.
> >
> > One thing I just noticed is fixed-ram seems to be always preferred for
> > "file:" migrations. Then can we already imply fixed-ram for "file" URIs?
> >
> > I'm even thinking whether we can make it the default and drop the fixed-ram
> > capability: fixed-ram won't work besides file, and file won't make sense if
> > not using offsets / fixed-ram. There's at least one problem, where we have
> > released 8.2 with "file:", so it means it could break users already using
> > "file:" there. I'm wondering whether that'll be worthwhile considering if
> > we can drop the (seems redundant..) capability. What do you think?
>
> The 'fd' protocol should support 'fixed-ram' too if passed a seekable
> FD.
Ah ok, then the cap is still needed.
>
> The 'file' protocol should be able to create save images compatible with
> older QEMU too IMHO.
This is less of a concern, IMHO, but indeed if we have the cap anyway then
it makes sense to do so.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram
2024-01-15 6:22 ` Peter Xu
2024-01-15 8:11 ` Daniel P. Berrangé
@ 2024-01-15 19:45 ` Fabiano Rosas
2024-01-15 23:20 ` Peter Xu
1 sibling, 1 reply; 95+ messages in thread
From: Fabiano Rosas @ 2024-01-15 19:45 UTC (permalink / raw)
To: Peter Xu; +Cc: qemu-devel, berrange, armbru, Leonardo Bras, Claudio Fontana
Peter Xu <peterx@redhat.com> writes:
> On Thu, Jan 11, 2024 at 03:38:31PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Mon, Nov 27, 2023 at 05:25:42PM -0300, Fabiano Rosas wrote:
>> >> Hi,
>> >>
>> >> In this v3:
>> >>
>> >> Added support for the "file:/dev/fdset/" syntax to receive multiple
>> >> file descriptors. This allows the management layer to open the
>> >> migration file beforehand and pass the file descriptors to QEMU. We
>> >> need more than one fd to be able to use O_DIRECT concurrently with
>> >> unaligned writes.
>> >>
>> >> Dropped the auto-pause capability. That discussion was kind of
>> >> stuck. We can revisit optimizations for non-live scenarios once the
>> >> series is more mature/merged.
>> >>
>> >> Changed the multifd incoming side to use a more generic data structure
>> >> instead of MultiFDPages_t. This allows multifd to restore the ram
>> >> using larger chunks.
>> >>
>> >> The rest are minor changes, I have noted them in the patches
>> >> themselves.
>> >
>> > Fabiano,
>> >
>> > Could you always keep a section around in the cover letter (and also in the
>> > upcoming doc file fixed-ram.rst) on the benefits of this feature?
>> >
>> > Please bare with me - I can start to ask silly questions.
>> >
>>
>> That's fine. Ask away!
>>
>> > I thought it was about "keeping the snapshot file small". But then when I
>> > was thinking the use case, iiuc fixed-ram migration should always suggest
>> > the user to stop the VM first before migration starts, then if the VM is
>> > stopped the ultimate image shouldn't be large either.
>> >
>> > Or is it about performance only? Where did I miss?
>>
>> Performance is the main benefit because fixed-ram enables the use of
>> multifd for file migration which would otherwise not be
>> parallelizable. To use multifd has been the direction for a while as you
>> know, so it makes sense.
>>
>> A fast file migration is desirable because it could be used for
>> snapshots with a stopped vm and also to replace the "exec:cat" hack
>> (this last one I found out about recently, Juan mentioned it in this
>> thread: https://lore.kernel.org/r/87cyx5ty26.fsf@secure.mitica).
>
> I digged again the history, and started to remember the "live" migration
> case for fixed-ram. IIUC that is what Dan mentioned in below email
> regarding to the "virDomainSnapshotXXX" use case:
>
> https://lore.kernel.org/all/ZD7MRGQ+4QsDBtKR@redhat.com/
>
> So IIUC "stopped VM" is not always the use case?
>
> If you agree with this, we need to document these two use cases clearly in
> the doc update:
>
> - "Migrate a VM to file, then destroy the VM"
>
> It should be suggested to stop the VM first before triggering such
> migration in this use case in the documents.
>
> - "Take a live snapshot of the VM"
>
> It'll be ideal if there is a portable interface to synchronously track
> dirtying of guest pages, but we don't...
>
> So fixed-ram seems to be the solution for such a portable solution for
> taking live snapshot across-platforms as long as async dirty tracking
> is still supported on that OS (aka KVM_GET_DIRTY_LOG). If async
> tracking is not supported, snapshot cannot be done live on the OS then,
> and one needs to use "snapshot-save".
>
> For this one, IMHO it would be good to mention (from QEMU perspective)
> the existance of background-snapshot even though libvirt didn't support
> it for some reason. Currently background-snapshot lacks multi-thread
> feature (nor O_DIRECT), though, so it may be less performant than
> fixed-ram. However if with all features there I believe that's even
> more performant. Please consider mention to a degree of detail on
> this.
>
I'll include these in some form in the docs update.
>>
>> The size aspect is just an interesting property, not necessarily a
>> reason.
>
> See above on the 2nd "live" use case of fixed-ram. I think in that case,
> size is still a matter, then, because that one cannot stop the VM vcpus.
>
>> It's about having the file bounded to the RAM size. So a running
>> guest would not produce a continuously growing file. This is in contrast
>> with previous experiments (libvirt code) in using a proxy to put
>> multifd-produced data into a file.
>>
>> I'll add this^ information in a more organized matter to the docs and
>> cover letter. Let me know what else I need to clarify.
>
> Thanks.
>
>>
>> Some notes about fixed-ram by itself:
>>
>> This series also enables fixed-ram without multifd, which would only
>> take benefit of the size property. That is not part of our end goal
>> which is to have multifd + fixed-ram, but I kept it nonetheless because
>> it helps to debug/reason about the fixed-ram format without conflating
>> matters with multifd.
>
> Yes, makes sense.
>
>>
>> Fixed-ram without multifd also allows the file migration to take benefit
>> of direct io because the data portion of the file (pages) will be
>> written with alignment. This version of the series does not yet support
>> it, but I have a simple patch for the next version.
>>
>> I also had a - perhaps naive - idea that we could merge the io code +
>> fixed-ram first, to expedite things and later bring in the multifd and
>> directio enhancements, but the review process ended up not being that
>> modular.
>
> What's the review process issue you're talking about?
No issue per-se. I'm just mentioning that I split the series in a
certain way and no one seemed to notice. =)
Basically everything up until patch 10/30 is one chunk that is mostly
separate from multifd support (patches 11-22/30) and direct-io + fdset
(32-30/30).
>
> If you can split the series that'll help merging for sure to me. IIRC
> there's complexity on passing the o-direct fds around, and not sure whether
> that chunk can be put at the last, similarly to split the multifd bits.
>
The logical sequence for merging in my view would be:
1 - file: support - Steven already did that
2 - file: + fixed-ram
2a- file: + fixed-ram + direct-io (optional, I will send a patch in v4)
3 - file: + fixed-ram + multifd
4 - file: + fixed-ram + multifd + direct-io (here we get the full perf. benefits)
5 - file:/dev/fdset + fixed-ram + multifd + direct-io (here we can go
enable libvirt support)
> One thing I just noticed is fixed-ram seems to be always preferred for
> "file:" migrations. Then can we already imply fixed-ram for "file" URIs?
>
The file URI alone is good to replace the exec:cat trick. We'll need it
once we deprecate exec: to be able to do debugging of the stream.
> I'm even thinking whether we can make it the default and drop the fixed-ram
> capability: fixed-ram won't work besides file, and file won't make sense if
> not using offsets / fixed-ram. There's at least one problem, where we have
> released 8.2 with "file:", so it means it could break users already using
> "file:" there. I'm wondering whether that'll be worthwhile considering if
> we can drop the (seems redundant..) capability. What do you think?
^ permalink raw reply [flat|nested] 95+ messages in thread
* Re: [RFC PATCH v3 00/30] migration: File based migration with multifd and fixed-ram
2024-01-15 19:45 ` Fabiano Rosas
@ 2024-01-15 23:20 ` Peter Xu
0 siblings, 0 replies; 95+ messages in thread
From: Peter Xu @ 2024-01-15 23:20 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, berrange, armbru, Leonardo Bras, Claudio Fontana
On Mon, Jan 15, 2024 at 04:45:15PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > On Thu, Jan 11, 2024 at 03:38:31PM -0300, Fabiano Rosas wrote:
> >> Peter Xu <peterx@redhat.com> writes:
> >>
> >> > On Mon, Nov 27, 2023 at 05:25:42PM -0300, Fabiano Rosas wrote:
> >> >> Hi,
> >> >>
> >> >> In this v3:
> >> >>
> >> >> Added support for the "file:/dev/fdset/" syntax to receive multiple
> >> >> file descriptors. This allows the management layer to open the
> >> >> migration file beforehand and pass the file descriptors to QEMU. We
> >> >> need more than one fd to be able to use O_DIRECT concurrently with
> >> >> unaligned writes.
> >> >>
> >> >> Dropped the auto-pause capability. That discussion was kind of
> >> >> stuck. We can revisit optimizations for non-live scenarios once the
> >> >> series is more mature/merged.
> >> >>
> >> >> Changed the multifd incoming side to use a more generic data structure
> >> >> instead of MultiFDPages_t. This allows multifd to restore the ram
> >> >> using larger chunks.
> >> >>
> >> >> The rest are minor changes, I have noted them in the patches
> >> >> themselves.
> >> >
> >> > Fabiano,
> >> >
> >> > Could you always keep a section around in the cover letter (and also in the
> >> > upcoming doc file fixed-ram.rst) on the benefits of this feature?
> >> >
> >> > Please bare with me - I can start to ask silly questions.
> >> >
> >>
> >> That's fine. Ask away!
> >>
> >> > I thought it was about "keeping the snapshot file small". But then when I
> >> > was thinking the use case, iiuc fixed-ram migration should always suggest
> >> > the user to stop the VM first before migration starts, then if the VM is
> >> > stopped the ultimate image shouldn't be large either.
> >> >
> >> > Or is it about performance only? Where did I miss?
> >>
> >> Performance is the main benefit because fixed-ram enables the use of
> >> multifd for file migration which would otherwise not be
> >> parallelizable. To use multifd has been the direction for a while as you
> >> know, so it makes sense.
> >>
> >> A fast file migration is desirable because it could be used for
> >> snapshots with a stopped vm and also to replace the "exec:cat" hack
> >> (this last one I found out about recently, Juan mentioned it in this
> >> thread: https://lore.kernel.org/r/87cyx5ty26.fsf@secure.mitica).
> >
> > I digged again the history, and started to remember the "live" migration
> > case for fixed-ram. IIUC that is what Dan mentioned in below email
> > regarding to the "virDomainSnapshotXXX" use case:
> >
> > https://lore.kernel.org/all/ZD7MRGQ+4QsDBtKR@redhat.com/
> >
> > So IIUC "stopped VM" is not always the use case?
> >
> > If you agree with this, we need to document these two use cases clearly in
> > the doc update:
> >
> > - "Migrate a VM to file, then destroy the VM"
> >
> > It should be suggested to stop the VM first before triggering such
> > migration in this use case in the documents.
> >
> > - "Take a live snapshot of the VM"
> >
> > It'll be ideal if there is a portable interface to synchronously track
> > dirtying of guest pages, but we don't...
> >
> > So fixed-ram seems to be the solution for such a portable solution for
> > taking live snapshot across-platforms as long as async dirty tracking
> > is still supported on that OS (aka KVM_GET_DIRTY_LOG). If async
> > tracking is not supported, snapshot cannot be done live on the OS then,
> > and one needs to use "snapshot-save".
> >
> > For this one, IMHO it would be good to mention (from QEMU perspective)
> > the existance of background-snapshot even though libvirt didn't support
> > it for some reason. Currently background-snapshot lacks multi-thread
> > feature (nor O_DIRECT), though, so it may be less performant than
> > fixed-ram. However if with all features there I believe that's even
> > more performant. Please consider mention to a degree of detail on
> > this.
> >
>
> I'll include these in some form in the docs update.
Thanks.
Fixed-ram should also need a separate file after the doc series applied.
I'll try to prepare a pull this week so both fixed-ram and cpr should
hopefully have place to hold its own file under docs/devel/migration/.
PS: just in case it didn't land as soon, feel free to fetch migration-next
of my github.com/peterx/qemu repo; I only put things there if they at least
pass one round of CI, so the content should be relatively stable even
though not fully guranteed.
>
> >>
> >> The size aspect is just an interesting property, not necessarily a
> >> reason.
> >
> > See above on the 2nd "live" use case of fixed-ram. I think in that case,
> > size is still a matter, then, because that one cannot stop the VM vcpus.
> >
> >> It's about having the file bounded to the RAM size. So a running
> >> guest would not produce a continuously growing file. This is in contrast
> >> with previous experiments (libvirt code) in using a proxy to put
> >> multifd-produced data into a file.
> >>
> >> I'll add this^ information in a more organized matter to the docs and
> >> cover letter. Let me know what else I need to clarify.
> >
> > Thanks.
> >
> >>
> >> Some notes about fixed-ram by itself:
> >>
> >> This series also enables fixed-ram without multifd, which would only
> >> take benefit of the size property. That is not part of our end goal
> >> which is to have multifd + fixed-ram, but I kept it nonetheless because
> >> it helps to debug/reason about the fixed-ram format without conflating
> >> matters with multifd.
> >
> > Yes, makes sense.
> >
> >>
> >> Fixed-ram without multifd also allows the file migration to take benefit
> >> of direct io because the data portion of the file (pages) will be
> >> written with alignment. This version of the series does not yet support
> >> it, but I have a simple patch for the next version.
> >>
> >> I also had a - perhaps naive - idea that we could merge the io code +
> >> fixed-ram first, to expedite things and later bring in the multifd and
> >> directio enhancements, but the review process ended up not being that
> >> modular.
> >
> > What's the review process issue you're talking about?
>
> No issue per-se. I'm just mentioning that I split the series in a
> certain way and no one seemed to notice. =)
Oh :)
>
> Basically everything up until patch 10/30 is one chunk that is mostly
> separate from multifd support (patches 11-22/30) and direct-io + fdset
> (32-30/30).
You can describe these in the cover letter. Personally I can always merge
initial M patches when they're ready out of N; there'll be quite a few
iochannel ones though in the first batch, so I'll check with Dan when
the 1st chunk in ready stage on how it should go in.
>
> >
> > If you can split the series that'll help merging for sure to me. IIRC
> > there's complexity on passing the o-direct fds around, and not sure whether
> > that chunk can be put at the last, similarly to split the multifd bits.
> >
>
> The logical sequence for merging in my view would be:
>
> 1 - file: support - Steven already did that
> 2 - file: + fixed-ram
> 2a- file: + fixed-ram + direct-io (optional, I will send a patch in v4)
> 3 - file: + fixed-ram + multifd
> 4 - file: + fixed-ram + multifd + direct-io (here we get the full perf. benefits)
> 5 - file:/dev/fdset + fixed-ram + multifd + direct-io (here we can go
> enable libvirt support)
Sounds good.
Such planning is IMHO fine to be put into TODO section of
devel/migration/fixed-ram.rst if you want, especially you already plan to
post separate series. So you can start with the .rst file with the whole
design all in; we can merge it first. You remove todos along with patchset
goes in.
Your call on how to do it.
>
> > One thing I just noticed is fixed-ram seems to be always preferred for
> > "file:" migrations. Then can we already imply fixed-ram for "file" URIs?
> >
>
> The file URI alone is good to replace the exec:cat trick. We'll need it
> once we deprecate exec: to be able to do debugging of the stream.
I didn't follow up much on Juan's previous plan to deprecate exec. But
yeah anyway let's start with that cap.
>
> > I'm even thinking whether we can make it the default and drop the fixed-ram
> > capability: fixed-ram won't work besides file, and file won't make sense if
> > not using offsets / fixed-ram. There's at least one problem, where we have
> > released 8.2 with "file:", so it means it could break users already using
> > "file:" there. I'm wondering whether that'll be worthwhile considering if
> > we can drop the (seems redundant..) capability. What do you think?
>
--
Peter Xu
^ permalink raw reply [flat|nested] 95+ messages in thread