* [RFC PATCH 1/7] migration/multifd: Reduce access to p->pages
2024-06-20 21:21 [RFC PATCH 0/7] migration/multifd: Introduce storage slots Fabiano Rosas
@ 2024-06-20 21:21 ` Fabiano Rosas
2024-06-21 14:42 ` Peter Xu
2024-06-20 21:21 ` [RFC PATCH 2/7] migration/multifd: Pass in MultiFDPages_t to file_write_ramblock_iov Fabiano Rosas
` (6 subsequent siblings)
7 siblings, 1 reply; 46+ messages in thread
From: Fabiano Rosas @ 2024-06-20 21:21 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Xu, Maciej S . Szmigiero
I'm about to replace the p->pages pointer with an opaque pointer, so
do a cleanup now to reduce direct accesses to p->page, which makes the
next diffs cleaner.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/multifd-qpl.c | 8 +++++---
migration/multifd-uadk.c | 9 +++++----
migration/multifd-zlib.c | 2 +-
migration/multifd-zstd.c | 2 +-
migration/multifd.c | 13 +++++++------
5 files changed, 19 insertions(+), 15 deletions(-)
diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
index 9265098ee7..f8c84c52cf 100644
--- a/migration/multifd-qpl.c
+++ b/migration/multifd-qpl.c
@@ -404,13 +404,14 @@ retry:
static void multifd_qpl_compress_pages_slow_path(MultiFDSendParams *p)
{
QplData *qpl = p->compress_data;
+ MultiFDPages_t *pages = p->pages;
uint32_t size = p->page_size;
qpl_job *job = qpl->sw_job;
uint8_t *zbuf = qpl->zbuf;
uint8_t *buf;
- for (int i = 0; i < p->pages->normal_num; i++) {
- buf = p->pages->block->host + p->pages->offset[i];
+ for (int i = 0; i < pages->normal_num; i++) {
+ buf = pages->block->host + pages->offset[i];
multifd_qpl_prepare_comp_job(job, buf, zbuf, size);
if (qpl_execute_job(job) == QPL_STS_OK) {
multifd_qpl_fill_packet(i, p, zbuf, job->total_out);
@@ -498,6 +499,7 @@ static void multifd_qpl_compress_pages(MultiFDSendParams *p)
static int multifd_qpl_send_prepare(MultiFDSendParams *p, Error **errp)
{
QplData *qpl = p->compress_data;
+ MultiFDPages_t *pages = p->pages;
uint32_t len = 0;
if (!multifd_send_prepare_common(p)) {
@@ -505,7 +507,7 @@ static int multifd_qpl_send_prepare(MultiFDSendParams *p, Error **errp)
}
/* The first IOV is used to store the compressed page lengths */
- len = p->pages->normal_num * sizeof(uint32_t);
+ len = pages->normal_num * sizeof(uint32_t);
multifd_qpl_fill_iov(p, (uint8_t *) qpl->zlen, len);
if (qpl->hw_avail) {
multifd_qpl_compress_pages(p);
diff --git a/migration/multifd-uadk.c b/migration/multifd-uadk.c
index d12353fb21..b8ba3cd9c1 100644
--- a/migration/multifd-uadk.c
+++ b/migration/multifd-uadk.c
@@ -174,19 +174,20 @@ static int multifd_uadk_send_prepare(MultiFDSendParams *p, Error **errp)
uint32_t hdr_size;
uint8_t *buf = uadk_data->buf;
int ret = 0;
+ MultiFDPages_t *pages = p->pages;
if (!multifd_send_prepare_common(p)) {
goto out;
}
- hdr_size = p->pages->normal_num * sizeof(uint32_t);
+ hdr_size = pages->normal_num * sizeof(uint32_t);
/* prepare the header that stores the lengths of all compressed data */
prepare_next_iov(p, uadk_data->buf_hdr, hdr_size);
- for (int i = 0; i < p->pages->normal_num; i++) {
+ for (int i = 0; i < pages->normal_num; i++) {
struct wd_comp_req creq = {
.op_type = WD_DIR_COMPRESS,
- .src = p->pages->block->host + p->pages->offset[i],
+ .src = pages->block->host + pages->offset[i],
.src_len = p->page_size,
.dst = buf,
/* Set dst_len to double the src in case compressed out >= page_size */
@@ -214,7 +215,7 @@ static int multifd_uadk_send_prepare(MultiFDSendParams *p, Error **errp)
*/
if (!uadk_data->handle || creq.dst_len >= p->page_size) {
uadk_data->buf_hdr[i] = cpu_to_be32(p->page_size);
- prepare_next_iov(p, p->pages->block->host + p->pages->offset[i],
+ prepare_next_iov(p, pages->block->host + pages->offset[i],
p->page_size);
buf += p->page_size;
}
diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index 2ced69487e..65f8aba5c8 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -147,7 +147,7 @@ static int zlib_send_prepare(MultiFDSendParams *p, Error **errp)
* with compression. zlib does not guarantee that this is safe,
* therefore copy the page before calling deflate().
*/
- memcpy(z->buf, p->pages->block->host + pages->offset[i], p->page_size);
+ memcpy(z->buf, pages->block->host + pages->offset[i], p->page_size);
zs->avail_in = p->page_size;
zs->next_in = z->buf;
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index ca17b7e310..cb6075a9a5 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -138,7 +138,7 @@ static int zstd_send_prepare(MultiFDSendParams *p, Error **errp)
if (i == pages->normal_num - 1) {
flush = ZSTD_e_flush;
}
- z->in.src = p->pages->block->host + pages->offset[i];
+ z->in.src = pages->block->host + pages->offset[i];
z->in.size = p->page_size;
z->in.pos = 0;
diff --git a/migration/multifd.c b/migration/multifd.c
index d82885fdbb..506f42e124 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -114,11 +114,11 @@ static void multifd_set_file_bitmap(MultiFDSendParams *p)
assert(pages->block);
- for (int i = 0; i < p->pages->normal_num; i++) {
+ for (int i = 0; i < pages->normal_num; i++) {
ramblock_set_file_bmap_atomic(pages->block, pages->offset[i], true);
}
- for (int i = p->pages->normal_num; i < p->pages->num; i++) {
+ for (int i = pages->normal_num; i < pages->num; i++) {
ramblock_set_file_bmap_atomic(pages->block, pages->offset[i], false);
}
}
@@ -417,7 +417,7 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
int i;
packet->flags = cpu_to_be32(p->flags);
- packet->pages_alloc = cpu_to_be32(p->pages->allocated);
+ packet->pages_alloc = cpu_to_be32(pages->allocated);
packet->normal_pages = cpu_to_be32(pages->normal_num);
packet->zero_pages = cpu_to_be32(zero_num);
packet->next_packet_size = cpu_to_be32(p->next_packet_size);
@@ -953,7 +953,7 @@ static void *multifd_send_thread(void *opaque)
if (migrate_mapped_ram()) {
ret = file_write_ramblock_iov(p->c, p->iov, p->iovs_num,
- p->pages->block, &local_err);
+ pages->block, &local_err);
} else {
ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num,
NULL, 0, p->write_flags,
@@ -969,7 +969,7 @@ static void *multifd_send_thread(void *opaque)
stat64_add(&mig_stats.normal_pages, pages->normal_num);
stat64_add(&mig_stats.zero_pages, pages->num - pages->normal_num);
- multifd_pages_reset(p->pages);
+ multifd_pages_reset(pages);
p->next_packet_size = 0;
/*
@@ -1684,9 +1684,10 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
bool multifd_send_prepare_common(MultiFDSendParams *p)
{
+ MultiFDPages_t *pages = p->pages;
multifd_send_zero_page_detect(p);
- if (!p->pages->normal_num) {
+ if (!pages->normal_num) {
p->next_packet_size = 0;
return false;
}
--
2.35.3
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC PATCH 2/7] migration/multifd: Pass in MultiFDPages_t to file_write_ramblock_iov
2024-06-20 21:21 [RFC PATCH 0/7] migration/multifd: Introduce storage slots Fabiano Rosas
2024-06-20 21:21 ` [RFC PATCH 1/7] migration/multifd: Reduce access to p->pages Fabiano Rosas
@ 2024-06-20 21:21 ` Fabiano Rosas
2024-06-20 21:21 ` [RFC PATCH 3/7] migration/multifd: Replace p->pages with an opaque pointer Fabiano Rosas
` (5 subsequent siblings)
7 siblings, 0 replies; 46+ messages in thread
From: Fabiano Rosas @ 2024-06-20 21:21 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Xu, Maciej S . Szmigiero
We want to stop dereferencing 'pages' so it can be replaced by an
opaque pointer in the next patches.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/file.c | 3 ++-
migration/file.h | 2 +-
migration/multifd.c | 2 +-
3 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/migration/file.c b/migration/file.c
index ab18ba505a..07e8648c7d 100644
--- a/migration/file.c
+++ b/migration/file.c
@@ -167,12 +167,13 @@ void file_start_incoming_migration(FileMigrationArgs *file_args, Error **errp)
}
int file_write_ramblock_iov(QIOChannel *ioc, const struct iovec *iov,
- int niov, RAMBlock *block, Error **errp)
+ int niov, MultiFDPages_t *pages, Error **errp)
{
ssize_t ret = 0;
int i, slice_idx, slice_num;
uintptr_t base, next, offset;
size_t len;
+ RAMBlock *block = pages->block;
slice_idx = 0;
slice_num = 1;
diff --git a/migration/file.h b/migration/file.h
index 7699c04677..d735b623b0 100644
--- a/migration/file.h
+++ b/migration/file.h
@@ -22,6 +22,6 @@ void file_cleanup_outgoing_migration(void);
bool file_send_channel_create(gpointer opaque, Error **errp);
void file_create_incoming_channels(QIOChannel *ioc, Error **errp);
int file_write_ramblock_iov(QIOChannel *ioc, const struct iovec *iov,
- int niov, RAMBlock *block, Error **errp);
+ int niov, MultiFDPages_t *pages, Error **errp);
int multifd_file_recv_data(MultiFDRecvParams *p, Error **errp);
#endif
diff --git a/migration/multifd.c b/migration/multifd.c
index 506f42e124..58340d9d95 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -953,7 +953,7 @@ static void *multifd_send_thread(void *opaque)
if (migrate_mapped_ram()) {
ret = file_write_ramblock_iov(p->c, p->iov, p->iovs_num,
- pages->block, &local_err);
+ pages, &local_err);
} else {
ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num,
NULL, 0, p->write_flags,
--
2.35.3
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC PATCH 3/7] migration/multifd: Replace p->pages with an opaque pointer
2024-06-20 21:21 [RFC PATCH 0/7] migration/multifd: Introduce storage slots Fabiano Rosas
2024-06-20 21:21 ` [RFC PATCH 1/7] migration/multifd: Reduce access to p->pages Fabiano Rosas
2024-06-20 21:21 ` [RFC PATCH 2/7] migration/multifd: Pass in MultiFDPages_t to file_write_ramblock_iov Fabiano Rosas
@ 2024-06-20 21:21 ` Fabiano Rosas
2024-06-20 21:21 ` [RFC PATCH 4/7] migration/multifd: Move pages accounting into multifd_send_zero_page_detect() Fabiano Rosas
` (4 subsequent siblings)
7 siblings, 0 replies; 46+ messages in thread
From: Fabiano Rosas @ 2024-06-20 21:21 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Xu, Maciej S . Szmigiero
We want multifd to be able to handle more types of data than just ram
pages. To start decoupling multifd from pages, replace p->pages
(MultiFDPages_t) with a new type MultiFDSendData that hides an opaque
pointer.
The general idea here is to isolate functions that *need* to handle
MultiFDPages_t and move them in the future to multifd-ram.c, while
multifd.c will stay with only the core functions that handle
MultiFDSendData/MultiFDRecvData.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/multifd-zero-page.c | 2 +-
migration/multifd-zlib.c | 2 +-
migration/multifd-zstd.c | 2 +-
migration/multifd.c | 57 +++++++++++++++++++++--------------
migration/multifd.h | 13 ++++----
5 files changed, 44 insertions(+), 32 deletions(-)
diff --git a/migration/multifd-zero-page.c b/migration/multifd-zero-page.c
index e1b8370f88..1ad77462a4 100644
--- a/migration/multifd-zero-page.c
+++ b/migration/multifd-zero-page.c
@@ -46,7 +46,7 @@ static void swap_page_offset(ram_addr_t *pages_offset, int a, int b)
*/
void multifd_send_zero_page_detect(MultiFDSendParams *p)
{
- MultiFDPages_t *pages = p->pages;
+ MultiFDPages_t *pages = p->data->opaque;
RAMBlock *rb = pages->block;
int i = 0;
int j = pages->num - 1;
diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index 65f8aba5c8..e75e04d2c7 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -123,7 +123,7 @@ static void zlib_send_cleanup(MultiFDSendParams *p, Error **errp)
*/
static int zlib_send_prepare(MultiFDSendParams *p, Error **errp)
{
- MultiFDPages_t *pages = p->pages;
+ MultiFDPages_t *pages = p->data->opaque;
struct zlib_data *z = p->compress_data;
z_stream *zs = &z->zs;
uint32_t out_size = 0;
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index cb6075a9a5..1ba572a882 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -119,7 +119,7 @@ static void zstd_send_cleanup(MultiFDSendParams *p, Error **errp)
*/
static int zstd_send_prepare(MultiFDSendParams *p, Error **errp)
{
- MultiFDPages_t *pages = p->pages;
+ MultiFDPages_t *pages = p->data->opaque;
struct zstd_data *z = p->compress_data;
int ret;
uint32_t i;
diff --git a/migration/multifd.c b/migration/multifd.c
index 58340d9d95..55b31c4515 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -49,8 +49,7 @@ typedef struct {
struct {
MultiFDSendParams *params;
- /* array of pages to sent */
- MultiFDPages_t *pages;
+ MultiFDSendData *data;
/*
* Global number of generated multifd packets.
*
@@ -110,7 +109,7 @@ void multifd_send_channel_created(void)
static void multifd_set_file_bitmap(MultiFDSendParams *p)
{
- MultiFDPages_t *pages = p->pages;
+ MultiFDPages_t *pages = p->data->opaque;
assert(pages->block);
@@ -164,7 +163,7 @@ static void nocomp_send_cleanup(MultiFDSendParams *p, Error **errp)
static void multifd_send_prepare_iovs(MultiFDSendParams *p)
{
- MultiFDPages_t *pages = p->pages;
+ MultiFDPages_t *pages = p->data->opaque;
for (int i = 0; i < pages->normal_num; i++) {
p->iov[p->iovs_num].iov_base = pages->block->host + pages->offset[i];
@@ -411,7 +410,7 @@ static void multifd_pages_clear(MultiFDPages_t *pages)
void multifd_send_fill_packet(MultiFDSendParams *p)
{
MultiFDPacket_t *packet = p->packet;
- MultiFDPages_t *pages = p->pages;
+ MultiFDPages_t *pages = p->data->opaque;
uint64_t packet_num;
uint32_t zero_num = pages->num - pages->normal_num;
int i;
@@ -591,7 +590,8 @@ static bool multifd_send_pages(void)
int i;
static int next_channel;
MultiFDSendParams *p = NULL; /* make happy gcc */
- MultiFDPages_t *pages = multifd_send_state->pages;
+ MultiFDPages_t *channel_pages;
+ MultiFDSendData *data = multifd_send_state->data;
if (multifd_send_should_exit()) {
return false;
@@ -626,11 +626,14 @@ static bool multifd_send_pages(void)
* qatomic_store_release() in multifd_send_thread().
*/
smp_mb_acquire();
- assert(!p->pages->num);
- multifd_send_state->pages = p->pages;
- p->pages = pages;
+
+ channel_pages = p->data->opaque;
+ assert(!channel_pages->num);
+
+ multifd_send_state->data = p->data;
+ p->data = data;
/*
- * Making sure p->pages is setup before marking pending_job=true. Pairs
+ * Making sure p->data is setup before marking pending_job=true. Pairs
* with the qatomic_load_acquire() in multifd_send_thread().
*/
qatomic_store_release(&p->pending_job, true);
@@ -660,7 +663,7 @@ bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
MultiFDPages_t *pages;
retry:
- pages = multifd_send_state->pages;
+ pages = multifd_send_state->data->opaque;
/* If the queue is empty, we can already enqueue now */
if (multifd_queue_empty(pages)) {
@@ -790,8 +793,10 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
qemu_sem_destroy(&p->sem_sync);
g_free(p->name);
p->name = NULL;
- multifd_pages_clear(p->pages);
- p->pages = NULL;
+ multifd_pages_clear(p->data->opaque);
+ p->data->opaque = NULL;
+ g_free(p->data);
+ p->data = NULL;
p->packet_len = 0;
g_free(p->packet);
p->packet = NULL;
@@ -808,8 +813,10 @@ static void multifd_send_cleanup_state(void)
qemu_sem_destroy(&multifd_send_state->channels_ready);
g_free(multifd_send_state->params);
multifd_send_state->params = NULL;
- multifd_pages_clear(multifd_send_state->pages);
- multifd_send_state->pages = NULL;
+ multifd_pages_clear(multifd_send_state->data->opaque);
+ multifd_send_state->data->opaque = NULL;
+ g_free(multifd_send_state->data);
+ multifd_send_state->data = NULL;
g_free(multifd_send_state);
multifd_send_state = NULL;
}
@@ -858,11 +865,13 @@ int multifd_send_sync_main(void)
{
int i;
bool flush_zero_copy;
+ MultiFDPages_t *pages;
if (!migrate_multifd()) {
return 0;
}
- if (multifd_send_state->pages->num) {
+ pages = multifd_send_state->data->opaque;
+ if (pages->num) {
if (!multifd_send_pages()) {
error_report("%s: multifd_send_pages fail", __func__);
return -1;
@@ -937,11 +946,11 @@ static void *multifd_send_thread(void *opaque)
}
/*
- * Read pending_job flag before p->pages. Pairs with the
+ * Read pending_job flag before p->data. Pairs with the
* qatomic_store_release() in multifd_send_pages().
*/
if (qatomic_load_acquire(&p->pending_job)) {
- MultiFDPages_t *pages = p->pages;
+ MultiFDPages_t *pages = p->data->opaque;
p->iovs_num = 0;
assert(pages->num);
@@ -953,7 +962,7 @@ static void *multifd_send_thread(void *opaque)
if (migrate_mapped_ram()) {
ret = file_write_ramblock_iov(p->c, p->iov, p->iovs_num,
- pages, &local_err);
+ p->data->opaque, &local_err);
} else {
ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num,
NULL, 0, p->write_flags,
@@ -973,7 +982,7 @@ static void *multifd_send_thread(void *opaque)
p->next_packet_size = 0;
/*
- * Making sure p->pages is published before saying "we're
+ * Making sure p->data is published before saying "we're
* free". Pairs with the smp_mb_acquire() in
* multifd_send_pages().
*/
@@ -1169,7 +1178,8 @@ bool multifd_send_setup(void)
thread_count = migrate_multifd_channels();
multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
- multifd_send_state->pages = multifd_pages_init(page_count);
+ multifd_send_state->data = g_new0(MultiFDSendData, 1);
+ multifd_send_state->data->opaque = multifd_pages_init(page_count);
qemu_sem_init(&multifd_send_state->channels_created, 0);
qemu_sem_init(&multifd_send_state->channels_ready, 0);
qatomic_set(&multifd_send_state->exiting, 0);
@@ -1181,7 +1191,8 @@ bool multifd_send_setup(void)
qemu_sem_init(&p->sem, 0);
qemu_sem_init(&p->sem_sync, 0);
p->id = i;
- p->pages = multifd_pages_init(page_count);
+ p->data = g_new0(MultiFDSendData, 1);
+ p->data->opaque = multifd_pages_init(page_count);
if (use_packets) {
p->packet_len = sizeof(MultiFDPacket_t)
@@ -1684,7 +1695,7 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
bool multifd_send_prepare_common(MultiFDSendParams *p)
{
- MultiFDPages_t *pages = p->pages;
+ MultiFDPages_t *pages = p->data->opaque;
multifd_send_zero_page_detect(p);
if (!pages->normal_num) {
diff --git a/migration/multifd.h b/migration/multifd.h
index 0ecd6f47d7..2029bfd80a 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -16,6 +16,7 @@
#include "ram.h"
typedef struct MultiFDRecvData MultiFDRecvData;
+typedef struct MultiFDSendData MultiFDSendData;
bool multifd_send_setup(void);
void multifd_send_shutdown(void);
@@ -89,6 +90,11 @@ struct MultiFDRecvData {
off_t file_offset;
};
+struct MultiFDSendData {
+ void *opaque;
+ size_t size;
+};
+
typedef struct {
/* Fields are only written at creating/deletion time */
/* No lock required for them, they are read only */
@@ -131,12 +137,7 @@ typedef struct {
*/
bool pending_job;
bool pending_sync;
- /* array of pages to sent.
- * The owner of 'pages' depends of 'pending_job' value:
- * pending_job == 0 -> migration_thread can use it.
- * pending_job != 0 -> multifd_channel can use it.
- */
- MultiFDPages_t *pages;
+ MultiFDSendData *data;
/* thread local variables. No locking required */
--
2.35.3
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC PATCH 4/7] migration/multifd: Move pages accounting into multifd_send_zero_page_detect()
2024-06-20 21:21 [RFC PATCH 0/7] migration/multifd: Introduce storage slots Fabiano Rosas
` (2 preceding siblings ...)
2024-06-20 21:21 ` [RFC PATCH 3/7] migration/multifd: Replace p->pages with an opaque pointer Fabiano Rosas
@ 2024-06-20 21:21 ` Fabiano Rosas
2024-06-20 21:21 ` [RFC PATCH 5/7] migration/multifd: Isolate ram pages packet data Fabiano Rosas
` (3 subsequent siblings)
7 siblings, 0 replies; 46+ messages in thread
From: Fabiano Rosas @ 2024-06-20 21:21 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Xu, Maciej S . Szmigiero
All references to pages are being removed from the multifd worker
threads in order to allow multifd to deal with different payload
types.
multifd_send_zero_page_detect() is called by all multifd migration
paths that deal with pages and is the last spot where zero pages and
normal page amounts are adjusted. Move the pages accounting into that
function.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/multifd-zero-page.c | 4 ++++
migration/multifd.c | 2 --
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/migration/multifd-zero-page.c b/migration/multifd-zero-page.c
index 1ad77462a4..63a1c24ba8 100644
--- a/migration/multifd-zero-page.c
+++ b/migration/multifd-zero-page.c
@@ -14,6 +14,7 @@
#include "qemu/cutils.h"
#include "exec/ramblock.h"
#include "migration.h"
+#include "migration-stats.h"
#include "multifd.h"
#include "options.h"
#include "ram.h"
@@ -74,6 +75,9 @@ void multifd_send_zero_page_detect(MultiFDSendParams *p)
}
pages->normal_num = i;
+
+ stat64_add(&mig_stats.normal_pages, pages->normal_num);
+ stat64_add(&mig_stats.zero_pages, pages->num - pages->normal_num);
}
void multifd_recv_zero_page_process(MultiFDRecvParams *p)
diff --git a/migration/multifd.c b/migration/multifd.c
index 55b31c4515..c4a952576d 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -975,8 +975,6 @@ static void *multifd_send_thread(void *opaque)
stat64_add(&mig_stats.multifd_bytes,
p->next_packet_size + p->packet_len);
- stat64_add(&mig_stats.normal_pages, pages->normal_num);
- stat64_add(&mig_stats.zero_pages, pages->num - pages->normal_num);
multifd_pages_reset(pages);
p->next_packet_size = 0;
--
2.35.3
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC PATCH 5/7] migration/multifd: Isolate ram pages packet data
2024-06-20 21:21 [RFC PATCH 0/7] migration/multifd: Introduce storage slots Fabiano Rosas
` (3 preceding siblings ...)
2024-06-20 21:21 ` [RFC PATCH 4/7] migration/multifd: Move pages accounting into multifd_send_zero_page_detect() Fabiano Rosas
@ 2024-06-20 21:21 ` Fabiano Rosas
2024-07-19 14:40 ` Fabiano Rosas
2024-06-20 21:21 ` [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters Fabiano Rosas
` (2 subsequent siblings)
7 siblings, 1 reply; 46+ messages in thread
From: Fabiano Rosas @ 2024-06-20 21:21 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Xu, Maciej S . Szmigiero
While we cannot yet disentangle the multifd packet from page data, we
can make the code a bit cleaner by setting the page-related fields in
a separate function.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/multifd.c | 104 +++++++++++++++++++++++++++++---------------
1 file changed, 68 insertions(+), 36 deletions(-)
diff --git a/migration/multifd.c b/migration/multifd.c
index c4a952576d..6fe339b378 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -407,65 +407,64 @@ static void multifd_pages_clear(MultiFDPages_t *pages)
g_free(pages);
}
-void multifd_send_fill_packet(MultiFDSendParams *p)
+static void multifd_ram_fill_packet(MultiFDSendParams *p)
{
MultiFDPacket_t *packet = p->packet;
MultiFDPages_t *pages = p->data->opaque;
- uint64_t packet_num;
uint32_t zero_num = pages->num - pages->normal_num;
- int i;
- packet->flags = cpu_to_be32(p->flags);
packet->pages_alloc = cpu_to_be32(pages->allocated);
packet->normal_pages = cpu_to_be32(pages->normal_num);
packet->zero_pages = cpu_to_be32(zero_num);
- packet->next_packet_size = cpu_to_be32(p->next_packet_size);
-
- packet_num = qatomic_fetch_inc(&multifd_send_state->packet_num);
- packet->packet_num = cpu_to_be64(packet_num);
if (pages->block) {
strncpy(packet->ramblock, pages->block->idstr, 256);
}
- for (i = 0; i < pages->num; i++) {
+ for (int i = 0; i < pages->num; i++) {
/* there are architectures where ram_addr_t is 32 bit */
uint64_t temp = pages->offset[i];
packet->offset[i] = cpu_to_be64(temp);
}
- p->packets_sent++;
p->total_normal_pages += pages->normal_num;
p->total_zero_pages += zero_num;
+}
- trace_multifd_send(p->id, packet_num, pages->normal_num, zero_num,
+void multifd_send_fill_packet(MultiFDSendParams *p)
+{
+ MultiFDPacket_t *packet = p->packet;
+ uint64_t packet_num;
+
+ memset(packet, 0, p->packet_len);
+
+ packet->magic = cpu_to_be32(MULTIFD_MAGIC);
+ packet->version = cpu_to_be32(MULTIFD_VERSION);
+
+ packet->flags = cpu_to_be32(p->flags);
+ packet->next_packet_size = cpu_to_be32(p->next_packet_size);
+
+ packet_num = qatomic_fetch_inc(&multifd_send_state->packet_num);
+ packet->packet_num = cpu_to_be64(packet_num);
+
+ p->packets_sent++;
+
+ if (p->data) {
+ multifd_ram_fill_packet(p);
+ }
+
+ trace_multifd_send(p->id, packet_num,
+ be32_to_cpu(packet->normal_pages),
+ be32_to_cpu(packet->zero_pages),
p->flags, p->next_packet_size);
}
-static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+static int multifd_ram_unfill_packet(MultiFDRecvParams *p, Error **errp)
{
MultiFDPacket_t *packet = p->packet;
int i;
- packet->magic = be32_to_cpu(packet->magic);
- if (packet->magic != MULTIFD_MAGIC) {
- error_setg(errp, "multifd: received packet "
- "magic %x and expected magic %x",
- packet->magic, MULTIFD_MAGIC);
- return -1;
- }
-
- packet->version = be32_to_cpu(packet->version);
- if (packet->version != MULTIFD_VERSION) {
- error_setg(errp, "multifd: received packet "
- "version %u and expected version %u",
- packet->version, MULTIFD_VERSION);
- return -1;
- }
-
- p->flags = be32_to_cpu(packet->flags);
-
packet->pages_alloc = be32_to_cpu(packet->pages_alloc);
/*
* If we received a packet that is 100 times bigger than expected
@@ -494,15 +493,9 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
return -1;
}
- p->next_packet_size = be32_to_cpu(packet->next_packet_size);
- p->packet_num = be64_to_cpu(packet->packet_num);
- p->packets_recved++;
p->total_normal_pages += p->normal_num;
p->total_zero_pages += p->zero_num;
- trace_multifd_recv(p->id, p->packet_num, p->normal_num, p->zero_num,
- p->flags, p->next_packet_size);
-
if (p->normal_num == 0 && p->zero_num == 0) {
return 0;
}
@@ -544,6 +537,45 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
return 0;
}
+static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
+{
+ MultiFDPacket_t *packet = p->packet;
+ int ret = 0;
+
+ packet->magic = be32_to_cpu(packet->magic);
+ if (packet->magic != MULTIFD_MAGIC) {
+ error_setg(errp, "multifd: received packet "
+ "magic %x and expected magic %x",
+ packet->magic, MULTIFD_MAGIC);
+ return -1;
+ }
+
+ packet->version = be32_to_cpu(packet->version);
+ if (packet->version != MULTIFD_VERSION) {
+ error_setg(errp, "multifd: received packet "
+ "version %u and expected version %u",
+ packet->version, MULTIFD_VERSION);
+ return -1;
+ }
+
+ p->flags = be32_to_cpu(packet->flags);
+ p->next_packet_size = be32_to_cpu(packet->next_packet_size);
+ p->packet_num = be64_to_cpu(packet->packet_num);
+ p->packets_recved++;
+
+ if (p->flags & MULTIFD_FLAG_SYNC) {
+ p->normal_num = 0;
+ p->zero_num = 0;
+ } else {
+ ret = multifd_ram_unfill_packet(p, errp);
+ }
+
+ trace_multifd_recv(p->id, p->packet_num, p->normal_num, p->zero_num,
+ p->flags, p->next_packet_size);
+
+ return ret;
+}
+
static bool multifd_send_should_exit(void)
{
return qatomic_read(&multifd_send_state->exiting);
--
2.35.3
^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 5/7] migration/multifd: Isolate ram pages packet data
2024-06-20 21:21 ` [RFC PATCH 5/7] migration/multifd: Isolate ram pages packet data Fabiano Rosas
@ 2024-07-19 14:40 ` Fabiano Rosas
0 siblings, 0 replies; 46+ messages in thread
From: Fabiano Rosas @ 2024-07-19 14:40 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Xu, Maciej S . Szmigiero
Fabiano Rosas <farosas@suse.de> writes:
> While we cannot yet disentangle the multifd packet from page data, we
> can make the code a bit cleaner by setting the page-related fields in
> a separate function.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> migration/multifd.c | 104 +++++++++++++++++++++++++++++---------------
> 1 file changed, 68 insertions(+), 36 deletions(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index c4a952576d..6fe339b378 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -407,65 +407,64 @@ static void multifd_pages_clear(MultiFDPages_t *pages)
> g_free(pages);
> }
>
> -void multifd_send_fill_packet(MultiFDSendParams *p)
> +static void multifd_ram_fill_packet(MultiFDSendParams *p)
> {
> MultiFDPacket_t *packet = p->packet;
> MultiFDPages_t *pages = p->data->opaque;
> - uint64_t packet_num;
> uint32_t zero_num = pages->num - pages->normal_num;
> - int i;
>
> - packet->flags = cpu_to_be32(p->flags);
> packet->pages_alloc = cpu_to_be32(pages->allocated);
> packet->normal_pages = cpu_to_be32(pages->normal_num);
> packet->zero_pages = cpu_to_be32(zero_num);
> - packet->next_packet_size = cpu_to_be32(p->next_packet_size);
> -
> - packet_num = qatomic_fetch_inc(&multifd_send_state->packet_num);
> - packet->packet_num = cpu_to_be64(packet_num);
>
> if (pages->block) {
> strncpy(packet->ramblock, pages->block->idstr, 256);
> }
>
> - for (i = 0; i < pages->num; i++) {
> + for (int i = 0; i < pages->num; i++) {
> /* there are architectures where ram_addr_t is 32 bit */
> uint64_t temp = pages->offset[i];
>
> packet->offset[i] = cpu_to_be64(temp);
> }
>
> - p->packets_sent++;
> p->total_normal_pages += pages->normal_num;
> p->total_zero_pages += zero_num;
> +}
>
> - trace_multifd_send(p->id, packet_num, pages->normal_num, zero_num,
> +void multifd_send_fill_packet(MultiFDSendParams *p)
> +{
> + MultiFDPacket_t *packet = p->packet;
> + uint64_t packet_num;
> +
> + memset(packet, 0, p->packet_len);
> +
> + packet->magic = cpu_to_be32(MULTIFD_MAGIC);
> + packet->version = cpu_to_be32(MULTIFD_VERSION);
> +
> + packet->flags = cpu_to_be32(p->flags);
> + packet->next_packet_size = cpu_to_be32(p->next_packet_size);
> +
> + packet_num = qatomic_fetch_inc(&multifd_send_state->packet_num);
> + packet->packet_num = cpu_to_be64(packet_num);
> +
> + p->packets_sent++;
> +
> + if (p->data) {
This needs to be !(p->flags & MULTIFD_SYNC). In v2 I'll add it in a
separate patch to make it clear:
-->8--
From e0dd1e0f10b6adb5d419ff68c1ef3b76d2fcf1d4 Mon Sep 17 00:00:00 2001
From: Fabiano Rosas <farosas@suse.de>
Date: Fri, 19 Jul 2024 11:28:33 -0300
Subject: [PATCH] migration/multifd: Don't send ram data during SYNC
Skip saving and loading any ram data in the packet in the case of a
SYNC. This fixes a shortcoming of the current code which requires a
reset of the MultiFDPages_t fields right after the previous
pending_job finishes, otherwise the very next job might be a SYNC and
multifd_send_fill_packet() will put the stale values in the packet.
By not calling multifd_ram_fill_packet(), we can stop resetting
MultiFDPages_t in the multifd core and leave that to the client code.
Actually moving the reset function is not yet done because
pages->num==0 is used by the client code to determine whether the
MultiFDPages_t needs to be flushed. The subsequent patches will
replace that with a generic flag that is not dependent on
MultiFDPages_t.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/multifd.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/migration/multifd.c b/migration/multifd.c
index 3809890082..6e6e62d352 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -431,6 +431,7 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
{
MultiFDPacket_t *packet = p->packet;
uint64_t packet_num;
+ bool sync_packet = p->flags & MULTIFD_FLAG_SYNC;
memset(packet, 0, p->packet_len);
@@ -445,7 +446,9 @@ void multifd_send_fill_packet(MultiFDSendParams *p)
p->packets_sent++;
- multifd_ram_fill_packet(p);
+ if (!sync_packet) {
+ multifd_ram_fill_packet(p);
+ }
trace_multifd_send(p->id, packet_num,
be32_to_cpu(packet->normal_pages),
@@ -556,7 +559,12 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
p->packet_num = be64_to_cpu(packet->packet_num);
p->packets_recved++;
- ret = multifd_ram_unfill_packet(p, errp);
+ if (p->flags & MULTIFD_FLAG_SYNC) {
+ p->normal_num = 0;
+ p->zero_num = 0;
+ } else {
+ ret = multifd_ram_unfill_packet(p, errp);
+ }
trace_multifd_recv(p->id, p->packet_num, p->normal_num, p->zero_num,
p->flags, p->next_packet_size);
--
2.35.3
^ permalink raw reply related [flat|nested] 46+ messages in thread
* [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-06-20 21:21 [RFC PATCH 0/7] migration/multifd: Introduce storage slots Fabiano Rosas
` (4 preceding siblings ...)
2024-06-20 21:21 ` [RFC PATCH 5/7] migration/multifd: Isolate ram pages packet data Fabiano Rosas
@ 2024-06-20 21:21 ` Fabiano Rosas
2024-06-27 3:27 ` Wang, Lei
2024-06-20 21:21 ` [RFC PATCH 7/7] migration/multifd: Hide multifd slots implementation Fabiano Rosas
2024-06-21 14:44 ` [RFC PATCH 0/7] migration/multifd: Introduce storage slots Maciej S. Szmigiero
7 siblings, 1 reply; 46+ messages in thread
From: Fabiano Rosas @ 2024-06-20 21:21 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Xu, Maciej S . Szmigiero
Multifd currently has a simple scheduling mechanism that distributes
work to the various channels by providing the client (producer) with a
memory slot and swapping that slot with free slot from the next idle
channel (consumer). Or graphically:
[] <-- multifd_send_state->pages
[][][][] <-- channels' p->pages pointers
1) client fills the empty slot with data:
[a]
[][][][]
2) multifd_send_pages() finds an idle channel and swaps the pointers:
[a]
[][][][]
^idle
[]
[a][][][]
3) client can immediately fill new slot with more data:
[b]
[a][][][]
4) channel processes the data, the channel slot is now free to use
again:
[b]
[][][][]
This works just fine, except that it doesn't allow different types of
payloads to be processed at the same time in different channels,
i.e. the data type of multifd_send_state->pages needs to be the same
as p->pages. For each new data type different from MultiFDPage_t that
is to be handled, this logic needs to be duplicated by adding new
fields to multifd_send_state and to the channels.
The core of the issue here is that we're using the channel parameters
(MultiFDSendParams) to hold the storage space on behalf of the multifd
client (currently ram.c). This is cumbersome because it forces us to
change multifd_send_pages() to check the data type being handled
before deciding which field to use.
One way to solve this is to detach the storage space from the multifd
channel and put it somewhere else, in control of the multifd
client. That way, multifd_send_pages() can operate on an opaque
pointer without needing to be adapted to each new data type. Implement
this logic with a new "slots" abstraction:
struct MultiFDSendData {
void *opaque;
size_t size;
}
struct MultiFDSlots {
MultiFDSendData **free; <-- what used to be p->pages
MultiFDSendData *active; <-- what used to be multifd_send_state->pages
};
Each multifd client now gets one set of slots to use. The slots are
passed into multifd_send_pages() (renamed to multifd_send). The
channels now only hold a pointer to the generic MultiFDSendData, and
after it's processed that reference can be dropped.
Or graphically:
1) client fills the active slot with data. Channels point to nothing
at this point:
[a] <-- active slot
[][][][] <-- free slots, one per-channel
[][][][] <-- channels' p->data pointers
2) multifd_send() swaps the pointers inside the client slot. Channels
still point to nothing:
[]
[a][][][]
[][][][]
3) multifd_send() finds an idle channel and updates its pointer:
[]
[a][][][]
[a][][][]
^idle
4) a second client calls multifd_send(), but with it's own slots:
[] [b]
[a][][][] [][][][]
[a][][][]
5) multifd_send() does steps 2 and 3 again:
[] []
[a][][][] [][b][][]
[a][b][][]
^idle
6) The channels continue processing the data and lose/acquire the
references as multifd_send() updates them. The free lists of each
client are not affected.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/multifd.c | 119 +++++++++++++++++++++++++++++++-------------
migration/multifd.h | 17 +++++++
migration/ram.c | 1 +
3 files changed, 102 insertions(+), 35 deletions(-)
diff --git a/migration/multifd.c b/migration/multifd.c
index 6fe339b378..f22a1c2e84 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -97,6 +97,30 @@ struct {
MultiFDMethods *ops;
} *multifd_recv_state;
+MultiFDSlots *multifd_allocate_slots(void *(*alloc_fn)(void),
+ void (*reset_fn)(void *),
+ void (*cleanup_fn)(void *))
+{
+ int thread_count = migrate_multifd_channels();
+ MultiFDSlots *slots = g_new0(MultiFDSlots, 1);
+
+ slots->active = g_new0(MultiFDSendData, 1);
+ slots->free = g_new0(MultiFDSendData *, thread_count);
+
+ slots->active->opaque = alloc_fn();
+ slots->active->reset = reset_fn;
+ slots->active->cleanup = cleanup_fn;
+
+ for (int i = 0; i < thread_count; i++) {
+ slots->free[i] = g_new0(MultiFDSendData, 1);
+ slots->free[i]->opaque = alloc_fn();
+ slots->free[i]->reset = reset_fn;
+ slots->free[i]->cleanup = cleanup_fn;
+ }
+
+ return slots;
+}
+
static bool multifd_use_packets(void)
{
return !migrate_mapped_ram();
@@ -313,8 +337,10 @@ void multifd_register_ops(int method, MultiFDMethods *ops)
}
/* Reset a MultiFDPages_t* object for the next use */
-static void multifd_pages_reset(MultiFDPages_t *pages)
+static void multifd_pages_reset(void *opaque)
{
+ MultiFDPages_t *pages = opaque;
+
/*
* We don't need to touch offset[] array, because it will be
* overwritten later when reused.
@@ -388,8 +414,9 @@ static int multifd_recv_initial_packet(QIOChannel *c, Error **errp)
return msg.id;
}
-static MultiFDPages_t *multifd_pages_init(uint32_t n)
+static void *multifd_pages_init(void)
{
+ uint32_t n = MULTIFD_PACKET_SIZE / qemu_target_page_size();
MultiFDPages_t *pages = g_new0(MultiFDPages_t, 1);
pages->allocated = n;
@@ -398,13 +425,24 @@ static MultiFDPages_t *multifd_pages_init(uint32_t n)
return pages;
}
-static void multifd_pages_clear(MultiFDPages_t *pages)
+static void multifd_pages_clear(void *opaque)
{
+ MultiFDPages_t *pages = opaque;
+
multifd_pages_reset(pages);
pages->allocated = 0;
g_free(pages->offset);
pages->offset = NULL;
- g_free(pages);
+}
+
+/* TODO: move these to multifd-ram.c */
+MultiFDSlots *multifd_ram_send_slots;
+
+void multifd_ram_save_setup(void)
+{
+ multifd_ram_send_slots = multifd_allocate_slots(multifd_pages_init,
+ multifd_pages_reset,
+ multifd_pages_clear);
}
static void multifd_ram_fill_packet(MultiFDSendParams *p)
@@ -617,13 +655,12 @@ static void multifd_send_kick_main(MultiFDSendParams *p)
*
* Returns true if succeed, false otherwise.
*/
-static bool multifd_send_pages(void)
+static bool multifd_send(MultiFDSlots *slots)
{
int i;
static int next_channel;
MultiFDSendParams *p = NULL; /* make happy gcc */
- MultiFDPages_t *channel_pages;
- MultiFDSendData *data = multifd_send_state->data;
+ MultiFDSendData *active_slot;
if (multifd_send_should_exit()) {
return false;
@@ -659,11 +696,24 @@ static bool multifd_send_pages(void)
*/
smp_mb_acquire();
- channel_pages = p->data->opaque;
- assert(!channel_pages->num);
+ assert(!slots->free[p->id]->size);
+
+ /*
+ * Swap the slots. The client gets a free slot to fill up for the
+ * next iteration and the channel gets the active slot for
+ * processing.
+ */
+ active_slot = slots->active;
+ slots->active = slots->free[p->id];
+ p->data = active_slot;
+
+ /*
+ * By the next time we arrive here, the channel will certainly
+ * have consumed the active slot. Put it back on the free list
+ * now.
+ */
+ slots->free[p->id] = active_slot;
- multifd_send_state->data = p->data;
- p->data = data;
/*
* Making sure p->data is setup before marking pending_job=true. Pairs
* with the qatomic_load_acquire() in multifd_send_thread().
@@ -687,6 +737,7 @@ static inline bool multifd_queue_full(MultiFDPages_t *pages)
static inline void multifd_enqueue(MultiFDPages_t *pages, ram_addr_t offset)
{
pages->offset[pages->num++] = offset;
+ multifd_ram_send_slots->active->size += qemu_target_page_size();
}
/* Returns true if enqueue successful, false otherwise */
@@ -695,7 +746,7 @@ bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
MultiFDPages_t *pages;
retry:
- pages = multifd_send_state->data->opaque;
+ pages = multifd_ram_send_slots->active->opaque;
/* If the queue is empty, we can already enqueue now */
if (multifd_queue_empty(pages)) {
@@ -713,7 +764,7 @@ retry:
* After flush, always retry.
*/
if (pages->block != block || multifd_queue_full(pages)) {
- if (!multifd_send_pages()) {
+ if (!multifd_send(multifd_ram_send_slots)) {
return false;
}
goto retry;
@@ -825,10 +876,12 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
qemu_sem_destroy(&p->sem_sync);
g_free(p->name);
p->name = NULL;
- multifd_pages_clear(p->data->opaque);
- p->data->opaque = NULL;
- g_free(p->data);
- p->data = NULL;
+ if (p->data) {
+ p->data->cleanup(p->data->opaque);
+ p->data->opaque = NULL;
+ /* p->data was not allocated by us, just clear the pointer */
+ p->data = NULL;
+ }
p->packet_len = 0;
g_free(p->packet);
p->packet = NULL;
@@ -845,10 +898,6 @@ static void multifd_send_cleanup_state(void)
qemu_sem_destroy(&multifd_send_state->channels_ready);
g_free(multifd_send_state->params);
multifd_send_state->params = NULL;
- multifd_pages_clear(multifd_send_state->data->opaque);
- multifd_send_state->data->opaque = NULL;
- g_free(multifd_send_state->data);
- multifd_send_state->data = NULL;
g_free(multifd_send_state);
multifd_send_state = NULL;
}
@@ -897,14 +946,13 @@ int multifd_send_sync_main(void)
{
int i;
bool flush_zero_copy;
- MultiFDPages_t *pages;
if (!migrate_multifd()) {
return 0;
}
- pages = multifd_send_state->data->opaque;
- if (pages->num) {
- if (!multifd_send_pages()) {
+
+ if (multifd_ram_send_slots->active->size) {
+ if (!multifd_send(multifd_ram_send_slots)) {
error_report("%s: multifd_send_pages fail", __func__);
return -1;
}
@@ -979,13 +1027,11 @@ static void *multifd_send_thread(void *opaque)
/*
* Read pending_job flag before p->data. Pairs with the
- * qatomic_store_release() in multifd_send_pages().
+ * qatomic_store_release() in multifd_send().
*/
if (qatomic_load_acquire(&p->pending_job)) {
- MultiFDPages_t *pages = p->data->opaque;
-
p->iovs_num = 0;
- assert(pages->num);
+ assert(p->data->size);
ret = multifd_send_state->ops->send_prepare(p, &local_err);
if (ret != 0) {
@@ -1008,13 +1054,20 @@ static void *multifd_send_thread(void *opaque)
stat64_add(&mig_stats.multifd_bytes,
p->next_packet_size + p->packet_len);
- multifd_pages_reset(pages);
p->next_packet_size = 0;
+ /*
+ * The data has now been sent. Since multifd_send()
+ * already put this slot on the free list, reset the
+ * entire slot before releasing the barrier below.
+ */
+ p->data->size = 0;
+ p->data->reset(p->data->opaque);
+
/*
* Making sure p->data is published before saying "we're
* free". Pairs with the smp_mb_acquire() in
- * multifd_send_pages().
+ * multifd_send().
*/
qatomic_store_release(&p->pending_job, false);
} else {
@@ -1208,8 +1261,6 @@ bool multifd_send_setup(void)
thread_count = migrate_multifd_channels();
multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
- multifd_send_state->data = g_new0(MultiFDSendData, 1);
- multifd_send_state->data->opaque = multifd_pages_init(page_count);
qemu_sem_init(&multifd_send_state->channels_created, 0);
qemu_sem_init(&multifd_send_state->channels_ready, 0);
qatomic_set(&multifd_send_state->exiting, 0);
@@ -1221,8 +1272,6 @@ bool multifd_send_setup(void)
qemu_sem_init(&p->sem, 0);
qemu_sem_init(&p->sem_sync, 0);
p->id = i;
- p->data = g_new0(MultiFDSendData, 1);
- p->data->opaque = multifd_pages_init(page_count);
if (use_packets) {
p->packet_len = sizeof(MultiFDPacket_t)
diff --git a/migration/multifd.h b/migration/multifd.h
index 2029bfd80a..5230729077 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -17,6 +17,10 @@
typedef struct MultiFDRecvData MultiFDRecvData;
typedef struct MultiFDSendData MultiFDSendData;
+typedef struct MultiFDSlots MultiFDSlots;
+
+typedef void *(multifd_data_alloc_cb)(void);
+typedef void (multifd_data_cleanup_cb)(void *);
bool multifd_send_setup(void);
void multifd_send_shutdown(void);
@@ -93,8 +97,21 @@ struct MultiFDRecvData {
struct MultiFDSendData {
void *opaque;
size_t size;
+ /* reset the slot for reuse after successful transfer */
+ void (*reset)(void *);
+ void (*cleanup)(void *);
};
+struct MultiFDSlots {
+ MultiFDSendData **free;
+ MultiFDSendData *active;
+};
+
+MultiFDSlots *multifd_allocate_slots(void *(*alloc_fn)(void),
+ void (*reset_fn)(void *),
+ void (*cleanup_fn)(void *));
+void multifd_ram_save_setup(void);
+
typedef struct {
/* Fields are only written at creating/deletion time */
/* No lock required for them, they are read only */
diff --git a/migration/ram.c b/migration/ram.c
index ceea586b06..c33a9dcf3f 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3058,6 +3058,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque, Error **errp)
migration_ops = g_malloc0(sizeof(MigrationOps));
if (migrate_multifd()) {
+ multifd_ram_save_setup();
migration_ops->ram_save_target_page = ram_save_target_page_multifd;
} else {
migration_ops->ram_save_target_page = ram_save_target_page_legacy;
--
2.35.3
^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-06-20 21:21 ` [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters Fabiano Rosas
@ 2024-06-27 3:27 ` Wang, Lei
2024-06-27 14:40 ` Peter Xu
0 siblings, 1 reply; 46+ messages in thread
From: Wang, Lei @ 2024-06-27 3:27 UTC (permalink / raw)
To: Fabiano Rosas, qemu-devel; +Cc: Peter Xu, Maciej S . Szmigiero
On 6/21/2024 5:21, Fabiano Rosas wrote:> Multifd currently has a simple
scheduling mechanism that distributes
> work to the various channels by providing the client (producer) with a
> memory slot and swapping that slot with free slot from the next idle
> channel (consumer). Or graphically:
>
> [] <-- multifd_send_state->pages
> [][][][] <-- channels' p->pages pointers
>
> 1) client fills the empty slot with data:
> [a]
> [][][][]
>
> 2) multifd_send_pages() finds an idle channel and swaps the pointers:
> [a]
> [][][][]
> ^idle
>
> []
> [a][][][]
>
> 3) client can immediately fill new slot with more data:
> [b]
> [a][][][]
>
> 4) channel processes the data, the channel slot is now free to use
> again:
> [b]
> [][][][]
>
> This works just fine, except that it doesn't allow different types of
> payloads to be processed at the same time in different channels,
> i.e. the data type of multifd_send_state->pages needs to be the same
> as p->pages. For each new data type different from MultiFDPage_t that
> is to be handled, this logic needs to be duplicated by adding new
> fields to multifd_send_state and to the channels.
>
> The core of the issue here is that we're using the channel parameters
> (MultiFDSendParams) to hold the storage space on behalf of the multifd
> client (currently ram.c). This is cumbersome because it forces us to
> change multifd_send_pages() to check the data type being handled
> before deciding which field to use.
>
> One way to solve this is to detach the storage space from the multifd
> channel and put it somewhere else, in control of the multifd
> client. That way, multifd_send_pages() can operate on an opaque
> pointer without needing to be adapted to each new data type. Implement
> this logic with a new "slots" abstraction:
>
> struct MultiFDSendData {
> void *opaque;
> size_t size;
> }
>
> struct MultiFDSlots {
> MultiFDSendData **free; <-- what used to be p->pages
> MultiFDSendData *active; <-- what used to be multifd_send_state->pages
> };
>
> Each multifd client now gets one set of slots to use. The slots are
> passed into multifd_send_pages() (renamed to multifd_send). The
> channels now only hold a pointer to the generic MultiFDSendData, and
> after it's processed that reference can be dropped.
>
> Or graphically:
>
> 1) client fills the active slot with data. Channels point to nothing
> at this point:
> [a] <-- active slot
> [][][][] <-- free slots, one per-channel
>
> [][][][] <-- channels' p->data pointers
>
> 2) multifd_send() swaps the pointers inside the client slot. Channels
> still point to nothing:
> []
> [a][][][]
>
> [][][][]
>
> 3) multifd_send() finds an idle channel and updates its pointer:
It seems the action "finds an idle channel" is in step 2 rather than step 3,
which means the free slot is selected based on the id of the channel found, am I
understanding correctly?
> []
> [a][][][]
>
> [a][][][]
> ^idle
>
> 4) a second client calls multifd_send(), but with it's own slots:
> [] [b]
> [a][][][] [][][][]
>
> [a][][][]
>
> 5) multifd_send() does steps 2 and 3 again:
> [] []
> [a][][][] [][b][][]
>
> [a][b][][]
> ^idle
>
> 6) The channels continue processing the data and lose/acquire the
> references as multifd_send() updates them. The free lists of each
> client are not affected.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> migration/multifd.c | 119 +++++++++++++++++++++++++++++++-------------
> migration/multifd.h | 17 +++++++
> migration/ram.c | 1 +
> 3 files changed, 102 insertions(+), 35 deletions(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 6fe339b378..f22a1c2e84 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -97,6 +97,30 @@ struct {
> MultiFDMethods *ops;
> } *multifd_recv_state;
>
> +MultiFDSlots *multifd_allocate_slots(void *(*alloc_fn)(void),
> + void (*reset_fn)(void *),
> + void (*cleanup_fn)(void *))
> +{
> + int thread_count = migrate_multifd_channels();
> + MultiFDSlots *slots = g_new0(MultiFDSlots, 1);
> +
> + slots->active = g_new0(MultiFDSendData, 1);
> + slots->free = g_new0(MultiFDSendData *, thread_count);
> +
> + slots->active->opaque = alloc_fn();
> + slots->active->reset = reset_fn;
> + slots->active->cleanup = cleanup_fn;
> +
> + for (int i = 0; i < thread_count; i++) {
> + slots->free[i] = g_new0(MultiFDSendData, 1);
> + slots->free[i]->opaque = alloc_fn();
> + slots->free[i]->reset = reset_fn;
> + slots->free[i]->cleanup = cleanup_fn;
> + }
> +
> + return slots;
> +}
> +
> static bool multifd_use_packets(void)
> {
> return !migrate_mapped_ram();
> @@ -313,8 +337,10 @@ void multifd_register_ops(int method, MultiFDMethods *ops)
> }
>
> /* Reset a MultiFDPages_t* object for the next use */
> -static void multifd_pages_reset(MultiFDPages_t *pages)
> +static void multifd_pages_reset(void *opaque)
> {
> + MultiFDPages_t *pages = opaque;
> +
> /*
> * We don't need to touch offset[] array, because it will be
> * overwritten later when reused.
> @@ -388,8 +414,9 @@ static int multifd_recv_initial_packet(QIOChannel *c, Error **errp)
> return msg.id;
> }
>
> -static MultiFDPages_t *multifd_pages_init(uint32_t n)
> +static void *multifd_pages_init(void)
> {
> + uint32_t n = MULTIFD_PACKET_SIZE / qemu_target_page_size();
> MultiFDPages_t *pages = g_new0(MultiFDPages_t, 1);
>
> pages->allocated = n;
> @@ -398,13 +425,24 @@ static MultiFDPages_t *multifd_pages_init(uint32_t n)
> return pages;
> }
>
> -static void multifd_pages_clear(MultiFDPages_t *pages)
> +static void multifd_pages_clear(void *opaque)
> {
> + MultiFDPages_t *pages = opaque;
> +
> multifd_pages_reset(pages);
> pages->allocated = 0;
> g_free(pages->offset);
> pages->offset = NULL;
> - g_free(pages);
> +}
> +
> +/* TODO: move these to multifd-ram.c */
> +MultiFDSlots *multifd_ram_send_slots;
> +
> +void multifd_ram_save_setup(void)
> +{
> + multifd_ram_send_slots = multifd_allocate_slots(multifd_pages_init,
> + multifd_pages_reset,
> + multifd_pages_clear);
> }
>
> static void multifd_ram_fill_packet(MultiFDSendParams *p)
> @@ -617,13 +655,12 @@ static void multifd_send_kick_main(MultiFDSendParams *p)
> *
> * Returns true if succeed, false otherwise.
> */
> -static bool multifd_send_pages(void)
> +static bool multifd_send(MultiFDSlots *slots)
> {
> int i;
> static int next_channel;
> MultiFDSendParams *p = NULL; /* make happy gcc */
> - MultiFDPages_t *channel_pages;
> - MultiFDSendData *data = multifd_send_state->data;
> + MultiFDSendData *active_slot;
>
> if (multifd_send_should_exit()) {
> return false;
> @@ -659,11 +696,24 @@ static bool multifd_send_pages(void)
> */
> smp_mb_acquire();
>
> - channel_pages = p->data->opaque;
> - assert(!channel_pages->num);
> + assert(!slots->free[p->id]->size);
> +
> + /*
> + * Swap the slots. The client gets a free slot to fill up for the
> + * next iteration and the channel gets the active slot for
> + * processing.
> + */
> + active_slot = slots->active;
> + slots->active = slots->free[p->id];
> + p->data = active_slot;
> +
> + /*
> + * By the next time we arrive here, the channel will certainly
> + * have consumed the active slot. Put it back on the free list
> + * now.
> + */
> + slots->free[p->id] = active_slot;
>
> - multifd_send_state->data = p->data;
> - p->data = data;
> /*
> * Making sure p->data is setup before marking pending_job=true. Pairs
> * with the qatomic_load_acquire() in multifd_send_thread().
> @@ -687,6 +737,7 @@ static inline bool multifd_queue_full(MultiFDPages_t *pages)
> static inline void multifd_enqueue(MultiFDPages_t *pages, ram_addr_t offset)
> {
> pages->offset[pages->num++] = offset;
> + multifd_ram_send_slots->active->size += qemu_target_page_size();
> }
>
> /* Returns true if enqueue successful, false otherwise */
> @@ -695,7 +746,7 @@ bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
> MultiFDPages_t *pages;
>
> retry:
> - pages = multifd_send_state->data->opaque;
> + pages = multifd_ram_send_slots->active->opaque;
>
> /* If the queue is empty, we can already enqueue now */
> if (multifd_queue_empty(pages)) {
> @@ -713,7 +764,7 @@ retry:
> * After flush, always retry.
> */
> if (pages->block != block || multifd_queue_full(pages)) {
> - if (!multifd_send_pages()) {
> + if (!multifd_send(multifd_ram_send_slots)) {
> return false;
> }
> goto retry;
> @@ -825,10 +876,12 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
> qemu_sem_destroy(&p->sem_sync);
> g_free(p->name);
> p->name = NULL;
> - multifd_pages_clear(p->data->opaque);
> - p->data->opaque = NULL;
> - g_free(p->data);
> - p->data = NULL;
> + if (p->data) {
> + p->data->cleanup(p->data->opaque);
> + p->data->opaque = NULL;
> + /* p->data was not allocated by us, just clear the pointer */
> + p->data = NULL;
> + }
> p->packet_len = 0;
> g_free(p->packet);
> p->packet = NULL;
> @@ -845,10 +898,6 @@ static void multifd_send_cleanup_state(void)
> qemu_sem_destroy(&multifd_send_state->channels_ready);
> g_free(multifd_send_state->params);
> multifd_send_state->params = NULL;
> - multifd_pages_clear(multifd_send_state->data->opaque);
> - multifd_send_state->data->opaque = NULL;
> - g_free(multifd_send_state->data);
> - multifd_send_state->data = NULL;
> g_free(multifd_send_state);
> multifd_send_state = NULL;
> }
> @@ -897,14 +946,13 @@ int multifd_send_sync_main(void)
> {
> int i;
> bool flush_zero_copy;
> - MultiFDPages_t *pages;
>
> if (!migrate_multifd()) {
> return 0;
> }
> - pages = multifd_send_state->data->opaque;
> - if (pages->num) {
> - if (!multifd_send_pages()) {
> +
> + if (multifd_ram_send_slots->active->size) {
> + if (!multifd_send(multifd_ram_send_slots)) {
> error_report("%s: multifd_send_pages fail", __func__);
> return -1;
> }
> @@ -979,13 +1027,11 @@ static void *multifd_send_thread(void *opaque)
>
> /*
> * Read pending_job flag before p->data. Pairs with the
> - * qatomic_store_release() in multifd_send_pages().
> + * qatomic_store_release() in multifd_send().
> */
> if (qatomic_load_acquire(&p->pending_job)) {
> - MultiFDPages_t *pages = p->data->opaque;
> -
> p->iovs_num = 0;
> - assert(pages->num);
> + assert(p->data->size);
>
> ret = multifd_send_state->ops->send_prepare(p, &local_err);
> if (ret != 0) {
> @@ -1008,13 +1054,20 @@ static void *multifd_send_thread(void *opaque)
> stat64_add(&mig_stats.multifd_bytes,
> p->next_packet_size + p->packet_len);
>
> - multifd_pages_reset(pages);
> p->next_packet_size = 0;
>
> + /*
> + * The data has now been sent. Since multifd_send()
> + * already put this slot on the free list, reset the
> + * entire slot before releasing the barrier below.
> + */
> + p->data->size = 0;
> + p->data->reset(p->data->opaque);
> +
> /*
> * Making sure p->data is published before saying "we're
> * free". Pairs with the smp_mb_acquire() in
> - * multifd_send_pages().
> + * multifd_send().
> */
> qatomic_store_release(&p->pending_job, false);
> } else {
> @@ -1208,8 +1261,6 @@ bool multifd_send_setup(void)
> thread_count = migrate_multifd_channels();
> multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
> multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
> - multifd_send_state->data = g_new0(MultiFDSendData, 1);
> - multifd_send_state->data->opaque = multifd_pages_init(page_count);
> qemu_sem_init(&multifd_send_state->channels_created, 0);
> qemu_sem_init(&multifd_send_state->channels_ready, 0);
> qatomic_set(&multifd_send_state->exiting, 0);
> @@ -1221,8 +1272,6 @@ bool multifd_send_setup(void)
> qemu_sem_init(&p->sem, 0);
> qemu_sem_init(&p->sem_sync, 0);
> p->id = i;
> - p->data = g_new0(MultiFDSendData, 1);
> - p->data->opaque = multifd_pages_init(page_count);
>
> if (use_packets) {
> p->packet_len = sizeof(MultiFDPacket_t)
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 2029bfd80a..5230729077 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -17,6 +17,10 @@
>
> typedef struct MultiFDRecvData MultiFDRecvData;
> typedef struct MultiFDSendData MultiFDSendData;
> +typedef struct MultiFDSlots MultiFDSlots;
> +
> +typedef void *(multifd_data_alloc_cb)(void);
> +typedef void (multifd_data_cleanup_cb)(void *);
>
> bool multifd_send_setup(void);
> void multifd_send_shutdown(void);
> @@ -93,8 +97,21 @@ struct MultiFDRecvData {
> struct MultiFDSendData {
> void *opaque;
> size_t size;
> + /* reset the slot for reuse after successful transfer */
> + void (*reset)(void *);
> + void (*cleanup)(void *);
> };
>
> +struct MultiFDSlots {
> + MultiFDSendData **free;
> + MultiFDSendData *active;
> +};
> +
> +MultiFDSlots *multifd_allocate_slots(void *(*alloc_fn)(void),
> + void (*reset_fn)(void *),
> + void (*cleanup_fn)(void *));
> +void multifd_ram_save_setup(void);
> +
> typedef struct {
> /* Fields are only written at creating/deletion time */
> /* No lock required for them, they are read only */
> diff --git a/migration/ram.c b/migration/ram.c
> index ceea586b06..c33a9dcf3f 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3058,6 +3058,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque, Error **errp)
> migration_ops = g_malloc0(sizeof(MigrationOps));
>
> if (migrate_multifd()) {
> + multifd_ram_save_setup();
> migration_ops->ram_save_target_page = ram_save_target_page_multifd;
> } else {
> migration_ops->ram_save_target_page = ram_save_target_page_legacy;
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-06-27 3:27 ` Wang, Lei
@ 2024-06-27 14:40 ` Peter Xu
2024-06-27 15:17 ` Peter Xu
2024-07-10 16:10 ` Fabiano Rosas
0 siblings, 2 replies; 46+ messages in thread
From: Peter Xu @ 2024-06-27 14:40 UTC (permalink / raw)
To: Wang, Lei; +Cc: Fabiano Rosas, qemu-devel, Maciej S . Szmigiero
On Thu, Jun 27, 2024 at 11:27:08AM +0800, Wang, Lei wrote:
> > Or graphically:
> >
> > 1) client fills the active slot with data. Channels point to nothing
> > at this point:
> > [a] <-- active slot
> > [][][][] <-- free slots, one per-channel
> >
> > [][][][] <-- channels' p->data pointers
> >
> > 2) multifd_send() swaps the pointers inside the client slot. Channels
> > still point to nothing:
> > []
> > [a][][][]
> >
> > [][][][]
> >
> > 3) multifd_send() finds an idle channel and updates its pointer:
>
> It seems the action "finds an idle channel" is in step 2 rather than step 3,
> which means the free slot is selected based on the id of the channel found, am I
> understanding correctly?
I think you're right.
Actually I also feel like the desription here is ambiguous, even though I
think I get what Fabiano wanted to say.
The free slot should be the first step of step 2+3, here what Fabiano
really wanted to suggest is we move the free buffer array from multifd
channels into the callers, then the caller can pass in whatever data to
send.
So I think maybe it's cleaner to write it as this in code (note: I didn't
really change the code, just some ordering and comments):
===8<===
@@ -710,15 +710,11 @@ static bool multifd_send(MultiFDSlots *slots)
*/
active_slot = slots->active;
slots->active = slots->free[p->id];
- p->data = active_slot;
-
- /*
- * By the next time we arrive here, the channel will certainly
- * have consumed the active slot. Put it back on the free list
- * now.
- */
slots->free[p->id] = active_slot;
+ /* Assign the current active slot to the chosen thread */
+ p->data = active_slot;
===8<===
The comment I removed is slightly misleading to me too, because right now
active_slot contains the data hasn't yet been delivered to multifd, so
we're "putting it back to free list" not because of it's free, but because
we know it won't get used until the multifd send thread consumes it
(because before that the thread will be busy, and we won't use the buffer
if so in upcoming send()s).
And then when I'm looking at this again, I think maybe it's a slight
overkill, and maybe we can still keep the "opaque data" managed by multifd.
One reason might be that I don't expect the "opaque data" payload keep
growing at all: it should really be either RAM or device state as I
commented elsewhere in a relevant thread, after all it's a thread model
only for migration purpose to move vmstates..
Putting it managed by multifd thread should involve less change than this
series, but it could look like this:
typedef enum {
MULTIFD_PAYLOAD_RAM = 0,
MULTIFD_PAYLOAD_DEVICE_STATE = 1,
} MultifdPayloadType;
typedef enum {
MultiFDPages_t ram_payload;
MultifdDeviceState_t device_payload;
} MultifdPayload;
struct MultiFDSendData {
MultifdPayloadType type;
MultifdPayload data;
};
Then the "enum" makes sure the payload only consumes only the max of both
types; a side benefit to save some memory.
I think we need to make sure MultifdDeviceState_t is generic enough so that
it will work for mostly everything (especially normal VMSDs). In this case
the VFIO series should be good as that was currently defined as:
typedef struct {
MultiFDPacketHdr_t hdr;
char idstr[256] QEMU_NONSTRING;
uint32_t instance_id;
/* size of the next packet that contains the actual data */
uint32_t next_packet_size;
} __attribute__((packed)) MultiFDPacketDeviceState_t;
IIUC that was what we need exactly with idstr+instance_id, so as to nail
exactly at where should the "opaque device state" go to, then load it with
a buffer-based loader when it's ready (starting from VFIO, to get rid of
qemufile). For VMSDs in the future if ever possible, that should be a
modified version of vmstate_load() where it may take buffers not qemufiles.
To Maciej: please see whether above makes sense to you, and if you also
agree please consider that with your VFIO work.
Thanks,
>
> > []
> > [a][][][]
> >
> > [a][][][]
> > ^idle
--
Peter Xu
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-06-27 14:40 ` Peter Xu
@ 2024-06-27 15:17 ` Peter Xu
2024-07-10 16:10 ` Fabiano Rosas
1 sibling, 0 replies; 46+ messages in thread
From: Peter Xu @ 2024-06-27 15:17 UTC (permalink / raw)
To: Wang, Lei; +Cc: Fabiano Rosas, qemu-devel, Maciej S . Szmigiero
On Thu, Jun 27, 2024 at 10:40:11AM -0400, Peter Xu wrote:
> On Thu, Jun 27, 2024 at 11:27:08AM +0800, Wang, Lei wrote:
> > > Or graphically:
> > >
> > > 1) client fills the active slot with data. Channels point to nothing
> > > at this point:
> > > [a] <-- active slot
> > > [][][][] <-- free slots, one per-channel
> > >
> > > [][][][] <-- channels' p->data pointers
> > >
> > > 2) multifd_send() swaps the pointers inside the client slot. Channels
> > > still point to nothing:
> > > []
> > > [a][][][]
> > >
> > > [][][][]
> > >
> > > 3) multifd_send() finds an idle channel and updates its pointer:
> >
> > It seems the action "finds an idle channel" is in step 2 rather than step 3,
> > which means the free slot is selected based on the id of the channel found, am I
> > understanding correctly?
>
> I think you're right.
>
> Actually I also feel like the desription here is ambiguous, even though I
> think I get what Fabiano wanted to say.
>
> The free slot should be the first step of step 2+3, here what Fabiano
> really wanted to suggest is we move the free buffer array from multifd
> channels into the callers, then the caller can pass in whatever data to
> send.
>
> So I think maybe it's cleaner to write it as this in code (note: I didn't
> really change the code, just some ordering and comments):
>
> ===8<===
> @@ -710,15 +710,11 @@ static bool multifd_send(MultiFDSlots *slots)
> */
> active_slot = slots->active;
> slots->active = slots->free[p->id];
> - p->data = active_slot;
> -
> - /*
> - * By the next time we arrive here, the channel will certainly
> - * have consumed the active slot. Put it back on the free list
> - * now.
> - */
> slots->free[p->id] = active_slot;
>
> + /* Assign the current active slot to the chosen thread */
> + p->data = active_slot;
> ===8<===
>
> The comment I removed is slightly misleading to me too, because right now
> active_slot contains the data hasn't yet been delivered to multifd, so
> we're "putting it back to free list" not because of it's free, but because
> we know it won't get used until the multifd send thread consumes it
> (because before that the thread will be busy, and we won't use the buffer
> if so in upcoming send()s).
>
> And then when I'm looking at this again, I think maybe it's a slight
> overkill, and maybe we can still keep the "opaque data" managed by multifd.
> One reason might be that I don't expect the "opaque data" payload keep
> growing at all: it should really be either RAM or device state as I
> commented elsewhere in a relevant thread, after all it's a thread model
> only for migration purpose to move vmstates..
>
> Putting it managed by multifd thread should involve less change than this
> series, but it could look like this:
>
> typedef enum {
> MULTIFD_PAYLOAD_RAM = 0,
> MULTIFD_PAYLOAD_DEVICE_STATE = 1,
> } MultifdPayloadType;
>
> typedef enum {
> MultiFDPages_t ram_payload;
> MultifdDeviceState_t device_payload;
> } MultifdPayload;
PS: please conditionally read "enum" as "union" throughout the previous
email of mine, sorry.
[I'll leave that to readers to decide when should do the replacement..]
>
> struct MultiFDSendData {
> MultifdPayloadType type;
> MultifdPayload data;
> };
>
> Then the "enum" makes sure the payload only consumes only the max of both
> types; a side benefit to save some memory.
>
> I think we need to make sure MultifdDeviceState_t is generic enough so that
> it will work for mostly everything (especially normal VMSDs). In this case
> the VFIO series should be good as that was currently defined as:
>
> typedef struct {
> MultiFDPacketHdr_t hdr;
>
> char idstr[256] QEMU_NONSTRING;
> uint32_t instance_id;
>
> /* size of the next packet that contains the actual data */
> uint32_t next_packet_size;
> } __attribute__((packed)) MultiFDPacketDeviceState_t;
>
> IIUC that was what we need exactly with idstr+instance_id, so as to nail
> exactly at where should the "opaque device state" go to, then load it with
> a buffer-based loader when it's ready (starting from VFIO, to get rid of
> qemufile). For VMSDs in the future if ever possible, that should be a
> modified version of vmstate_load() where it may take buffers not qemufiles.
>
> To Maciej: please see whether above makes sense to you, and if you also
> agree please consider that with your VFIO work.
>
> Thanks,
>
> >
> > > []
> > > [a][][][]
> > >
> > > [a][][][]
> > > ^idle
>
> --
> Peter Xu
--
Peter Xu
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-06-27 14:40 ` Peter Xu
2024-06-27 15:17 ` Peter Xu
@ 2024-07-10 16:10 ` Fabiano Rosas
2024-07-10 19:10 ` Peter Xu
1 sibling, 1 reply; 46+ messages in thread
From: Fabiano Rosas @ 2024-07-10 16:10 UTC (permalink / raw)
To: Peter Xu, Wang, Lei; +Cc: qemu-devel, Maciej S . Szmigiero
Peter Xu <peterx@redhat.com> writes:
> On Thu, Jun 27, 2024 at 11:27:08AM +0800, Wang, Lei wrote:
>> > Or graphically:
>> >
>> > 1) client fills the active slot with data. Channels point to nothing
>> > at this point:
>> > [a] <-- active slot
>> > [][][][] <-- free slots, one per-channel
>> >
>> > [][][][] <-- channels' p->data pointers
>> >
>> > 2) multifd_send() swaps the pointers inside the client slot. Channels
>> > still point to nothing:
>> > []
>> > [a][][][]
>> >
>> > [][][][]
>> >
>> > 3) multifd_send() finds an idle channel and updates its pointer:
>>
>> It seems the action "finds an idle channel" is in step 2 rather than step 3,
>> which means the free slot is selected based on the id of the channel found, am I
>> understanding correctly?
>
> I think you're right.
>
> Actually I also feel like the desription here is ambiguous, even though I
> think I get what Fabiano wanted to say.
>
> The free slot should be the first step of step 2+3, here what Fabiano
> really wanted to suggest is we move the free buffer array from multifd
> channels into the callers, then the caller can pass in whatever data to
> send.
>
> So I think maybe it's cleaner to write it as this in code (note: I didn't
> really change the code, just some ordering and comments):
>
> ===8<===
> @@ -710,15 +710,11 @@ static bool multifd_send(MultiFDSlots *slots)
> */
> active_slot = slots->active;
> slots->active = slots->free[p->id];
> - p->data = active_slot;
> -
> - /*
> - * By the next time we arrive here, the channel will certainly
> - * have consumed the active slot. Put it back on the free list
> - * now.
> - */
> slots->free[p->id] = active_slot;
>
> + /* Assign the current active slot to the chosen thread */
> + p->data = active_slot;
> ===8<===
>
> The comment I removed is slightly misleading to me too, because right now
> active_slot contains the data hasn't yet been delivered to multifd, so
> we're "putting it back to free list" not because of it's free, but because
> we know it won't get used until the multifd send thread consumes it
> (because before that the thread will be busy, and we won't use the buffer
> if so in upcoming send()s).
>
> And then when I'm looking at this again, I think maybe it's a slight
> overkill, and maybe we can still keep the "opaque data" managed by multifd.
> One reason might be that I don't expect the "opaque data" payload keep
> growing at all: it should really be either RAM or device state as I
> commented elsewhere in a relevant thread, after all it's a thread model
> only for migration purpose to move vmstates..
Some amount of flexibility needs to be baked in. For instance, what
about the handshake procedure? Don't we want to use multifd threads to
put some information on the wire for that as well?
> Putting it managed by multifd thread should involve less change than this
> series, but it could look like this:
>
> typedef enum {
> MULTIFD_PAYLOAD_RAM = 0,
> MULTIFD_PAYLOAD_DEVICE_STATE = 1,
> } MultifdPayloadType;
>
> typedef enum {
> MultiFDPages_t ram_payload;
> MultifdDeviceState_t device_payload;
> } MultifdPayload;
>
> struct MultiFDSendData {
> MultifdPayloadType type;
> MultifdPayload data;
> };
Is that an union up there? So you want to simply allocate in multifd the
max amount of memory between the two types of payload? But then we'll
need a memset(p->data, 0, ...) at every round of sending to avoid giving
stale data from one client to another. That doesn't work with the
current ram migration because it wants p->pages to remain active across
several calls of multifd_queue_page().
>
> Then the "enum" makes sure the payload only consumes only the max of both
> types; a side benefit to save some memory.
>
> I think we need to make sure MultifdDeviceState_t is generic enough so that
> it will work for mostly everything (especially normal VMSDs). In this case
> the VFIO series should be good as that was currently defined as:
>
> typedef struct {
> MultiFDPacketHdr_t hdr;
>
> char idstr[256] QEMU_NONSTRING;
> uint32_t instance_id;
>
> /* size of the next packet that contains the actual data */
> uint32_t next_packet_size;
> } __attribute__((packed)) MultiFDPacketDeviceState_t;
This is the packet, a different thing. Not sure if your paragraph above
means to talk about that or really MultifdDeviceState, which is what is
exchanged between the multifd threads and the client code.
>
> IIUC that was what we need exactly with idstr+instance_id, so as to nail
> exactly at where should the "opaque device state" go to, then load it with
> a buffer-based loader when it's ready (starting from VFIO, to get rid of
> qemufile). For VMSDs in the future if ever possible, that should be a
> modified version of vmstate_load() where it may take buffers not qemufiles.
>
> To Maciej: please see whether above makes sense to you, and if you also
> agree please consider that with your VFIO work.
>
> Thanks,
>
>>
>> > []
>> > [a][][][]
>> >
>> > [a][][][]
>> > ^idle
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-10 16:10 ` Fabiano Rosas
@ 2024-07-10 19:10 ` Peter Xu
2024-07-10 20:16 ` Fabiano Rosas
0 siblings, 1 reply; 46+ messages in thread
From: Peter Xu @ 2024-07-10 19:10 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
On Wed, Jul 10, 2024 at 01:10:37PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > On Thu, Jun 27, 2024 at 11:27:08AM +0800, Wang, Lei wrote:
> >> > Or graphically:
> >> >
> >> > 1) client fills the active slot with data. Channels point to nothing
> >> > at this point:
> >> > [a] <-- active slot
> >> > [][][][] <-- free slots, one per-channel
> >> >
> >> > [][][][] <-- channels' p->data pointers
> >> >
> >> > 2) multifd_send() swaps the pointers inside the client slot. Channels
> >> > still point to nothing:
> >> > []
> >> > [a][][][]
> >> >
> >> > [][][][]
> >> >
> >> > 3) multifd_send() finds an idle channel and updates its pointer:
> >>
> >> It seems the action "finds an idle channel" is in step 2 rather than step 3,
> >> which means the free slot is selected based on the id of the channel found, am I
> >> understanding correctly?
> >
> > I think you're right.
> >
> > Actually I also feel like the desription here is ambiguous, even though I
> > think I get what Fabiano wanted to say.
> >
> > The free slot should be the first step of step 2+3, here what Fabiano
> > really wanted to suggest is we move the free buffer array from multifd
> > channels into the callers, then the caller can pass in whatever data to
> > send.
> >
> > So I think maybe it's cleaner to write it as this in code (note: I didn't
> > really change the code, just some ordering and comments):
> >
> > ===8<===
> > @@ -710,15 +710,11 @@ static bool multifd_send(MultiFDSlots *slots)
> > */
> > active_slot = slots->active;
> > slots->active = slots->free[p->id];
> > - p->data = active_slot;
> > -
> > - /*
> > - * By the next time we arrive here, the channel will certainly
> > - * have consumed the active slot. Put it back on the free list
> > - * now.
> > - */
> > slots->free[p->id] = active_slot;
> >
> > + /* Assign the current active slot to the chosen thread */
> > + p->data = active_slot;
> > ===8<===
> >
> > The comment I removed is slightly misleading to me too, because right now
> > active_slot contains the data hasn't yet been delivered to multifd, so
> > we're "putting it back to free list" not because of it's free, but because
> > we know it won't get used until the multifd send thread consumes it
> > (because before that the thread will be busy, and we won't use the buffer
> > if so in upcoming send()s).
> >
> > And then when I'm looking at this again, I think maybe it's a slight
> > overkill, and maybe we can still keep the "opaque data" managed by multifd.
> > One reason might be that I don't expect the "opaque data" payload keep
> > growing at all: it should really be either RAM or device state as I
> > commented elsewhere in a relevant thread, after all it's a thread model
> > only for migration purpose to move vmstates..
>
> Some amount of flexibility needs to be baked in. For instance, what
> about the handshake procedure? Don't we want to use multifd threads to
> put some information on the wire for that as well?
Is this an orthogonal question?
What I meant above is it looks fine to me to keep "device state" in
multifd.c, as long as it is not only about VFIO.
What you were saying seems to be about how to identify this is a device
state, then I just hope VFIO shares the same flag with any future device
that would also like to send its state via multifd, like:
#define MULTIFD_FLAG_DEVICE_STATE (32 << 1)
Then set it in MultiFDPacket_t.flags. The dest qemu should route that
packet to the device vmsd / save_entry for parsing.
>
> > Putting it managed by multifd thread should involve less change than this
> > series, but it could look like this:
> >
> > typedef enum {
> > MULTIFD_PAYLOAD_RAM = 0,
> > MULTIFD_PAYLOAD_DEVICE_STATE = 1,
> > } MultifdPayloadType;
> >
> > typedef enum {
> > MultiFDPages_t ram_payload;
> > MultifdDeviceState_t device_payload;
> > } MultifdPayload;
> >
> > struct MultiFDSendData {
> > MultifdPayloadType type;
> > MultifdPayload data;
> > };
>
> Is that an union up there? So you want to simply allocate in multifd the
Yes.
> max amount of memory between the two types of payload? But then we'll
Yes.
> need a memset(p->data, 0, ...) at every round of sending to avoid giving
> stale data from one client to another. That doesn't work with the
I think as long as the one to enqueue will always setup the fields, we
don't need to do memset. I am not sure if it's a major concern to always
set all the relevant fields in the multifd enqueue threads. It sounds like
the thing we should always better do.
> current ram migration because it wants p->pages to remain active across
> several calls of multifd_queue_page().
I don't think I followed here.
What I meant: QEMU maintains SendData[8], now a bunch of pages arrives, it
enqueues "pages" into a free slot index 2 (set type=pages), then before
thread 2 finished sending the bunch of pages, SendData[2] will always
represent those pages without being used by anything else. What did I miss?
>
> >
> > Then the "enum" makes sure the payload only consumes only the max of both
> > types; a side benefit to save some memory.
> >
> > I think we need to make sure MultifdDeviceState_t is generic enough so that
> > it will work for mostly everything (especially normal VMSDs). In this case
> > the VFIO series should be good as that was currently defined as:
> >
> > typedef struct {
> > MultiFDPacketHdr_t hdr;
> >
> > char idstr[256] QEMU_NONSTRING;
> > uint32_t instance_id;
> >
> > /* size of the next packet that contains the actual data */
> > uint32_t next_packet_size;
> > } __attribute__((packed)) MultiFDPacketDeviceState_t;
>
> This is the packet, a different thing. Not sure if your paragraph above
> means to talk about that or really MultifdDeviceState, which is what is
> exchanged between the multifd threads and the client code.
I meant the wire protocol looks great from that POV. We may need similar
thing for the type==device_state slots just to be generic.
>
> >
> > IIUC that was what we need exactly with idstr+instance_id, so as to nail
> > exactly at where should the "opaque device state" go to, then load it with
> > a buffer-based loader when it's ready (starting from VFIO, to get rid of
> > qemufile). For VMSDs in the future if ever possible, that should be a
> > modified version of vmstate_load() where it may take buffers not qemufiles.
> >
> > To Maciej: please see whether above makes sense to you, and if you also
> > agree please consider that with your VFIO work.
> >
> > Thanks,
> >
> >>
> >> > []
> >> > [a][][][]
> >> >
> >> > [a][][][]
> >> > ^idle
>
--
Peter Xu
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-10 19:10 ` Peter Xu
@ 2024-07-10 20:16 ` Fabiano Rosas
2024-07-10 21:55 ` Peter Xu
2024-07-16 20:10 ` Maciej S. Szmigiero
0 siblings, 2 replies; 46+ messages in thread
From: Fabiano Rosas @ 2024-07-10 20:16 UTC (permalink / raw)
To: Peter Xu; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
Peter Xu <peterx@redhat.com> writes:
> On Wed, Jul 10, 2024 at 01:10:37PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Thu, Jun 27, 2024 at 11:27:08AM +0800, Wang, Lei wrote:
>> >> > Or graphically:
>> >> >
>> >> > 1) client fills the active slot with data. Channels point to nothing
>> >> > at this point:
>> >> > [a] <-- active slot
>> >> > [][][][] <-- free slots, one per-channel
>> >> >
>> >> > [][][][] <-- channels' p->data pointers
>> >> >
>> >> > 2) multifd_send() swaps the pointers inside the client slot. Channels
>> >> > still point to nothing:
>> >> > []
>> >> > [a][][][]
>> >> >
>> >> > [][][][]
>> >> >
>> >> > 3) multifd_send() finds an idle channel and updates its pointer:
>> >>
>> >> It seems the action "finds an idle channel" is in step 2 rather than step 3,
>> >> which means the free slot is selected based on the id of the channel found, am I
>> >> understanding correctly?
>> >
>> > I think you're right.
>> >
>> > Actually I also feel like the desription here is ambiguous, even though I
>> > think I get what Fabiano wanted to say.
>> >
>> > The free slot should be the first step of step 2+3, here what Fabiano
>> > really wanted to suggest is we move the free buffer array from multifd
>> > channels into the callers, then the caller can pass in whatever data to
>> > send.
>> >
>> > So I think maybe it's cleaner to write it as this in code (note: I didn't
>> > really change the code, just some ordering and comments):
>> >
>> > ===8<===
>> > @@ -710,15 +710,11 @@ static bool multifd_send(MultiFDSlots *slots)
>> > */
>> > active_slot = slots->active;
>> > slots->active = slots->free[p->id];
>> > - p->data = active_slot;
>> > -
>> > - /*
>> > - * By the next time we arrive here, the channel will certainly
>> > - * have consumed the active slot. Put it back on the free list
>> > - * now.
>> > - */
>> > slots->free[p->id] = active_slot;
>> >
>> > + /* Assign the current active slot to the chosen thread */
>> > + p->data = active_slot;
>> > ===8<===
>> >
>> > The comment I removed is slightly misleading to me too, because right now
>> > active_slot contains the data hasn't yet been delivered to multifd, so
>> > we're "putting it back to free list" not because of it's free, but because
>> > we know it won't get used until the multifd send thread consumes it
>> > (because before that the thread will be busy, and we won't use the buffer
>> > if so in upcoming send()s).
>> >
>> > And then when I'm looking at this again, I think maybe it's a slight
>> > overkill, and maybe we can still keep the "opaque data" managed by multifd.
>> > One reason might be that I don't expect the "opaque data" payload keep
>> > growing at all: it should really be either RAM or device state as I
>> > commented elsewhere in a relevant thread, after all it's a thread model
>> > only for migration purpose to move vmstates..
>>
>> Some amount of flexibility needs to be baked in. For instance, what
>> about the handshake procedure? Don't we want to use multifd threads to
>> put some information on the wire for that as well?
>
> Is this an orthogonal question?
I don't think so. You say the payload data should be either RAM or
device state. I'm asking what other types of data do we want the multifd
channel to transmit and suggesting we need to allow room for the
addition of that, whatever it is. One thing that comes to mind that is
neither RAM or device state is some form of handshake or capabilities
negotiation.
>
> What I meant above is it looks fine to me to keep "device state" in
> multifd.c, as long as it is not only about VFIO.
>
> What you were saying seems to be about how to identify this is a device
> state, then I just hope VFIO shares the same flag with any future device
> that would also like to send its state via multifd, like:
>
> #define MULTIFD_FLAG_DEVICE_STATE (32 << 1)
>
> Then set it in MultiFDPacket_t.flags. The dest qemu should route that
> packet to the device vmsd / save_entry for parsing.
Sure, that part I agree with, no issue here.
>
>>
>> > Putting it managed by multifd thread should involve less change than this
>> > series, but it could look like this:
>> >
>> > typedef enum {
>> > MULTIFD_PAYLOAD_RAM = 0,
>> > MULTIFD_PAYLOAD_DEVICE_STATE = 1,
>> > } MultifdPayloadType;
>> >
>> > typedef enum {
>> > MultiFDPages_t ram_payload;
>> > MultifdDeviceState_t device_payload;
>> > } MultifdPayload;
>> >
>> > struct MultiFDSendData {
>> > MultifdPayloadType type;
>> > MultifdPayload data;
>> > };
>>
>> Is that an union up there? So you want to simply allocate in multifd the
>
> Yes.
>
>> max amount of memory between the two types of payload? But then we'll
>
> Yes.
>
>> need a memset(p->data, 0, ...) at every round of sending to avoid giving
>> stale data from one client to another. That doesn't work with the
>
> I think as long as the one to enqueue will always setup the fields, we
> don't need to do memset. I am not sure if it's a major concern to always
> set all the relevant fields in the multifd enqueue threads. It sounds like
> the thing we should always better do.
Well, writing to a region of memory that was "owned" by another multifd
client and already has a bunch of data there is somewhat prone to
bugs. Just forget to set something and now things start to behave
weirdly. I guess that's just the price of having an union. I'm not
against that, but I would maybe prefer to have each client hold its own
data and not have to think about anything else. Much of this feeling
comes from how the RAM code currently works (more on that below).
>
>> current ram migration because it wants p->pages to remain active across
>> several calls of multifd_queue_page().
>
> I don't think I followed here.
>
> What I meant: QEMU maintains SendData[8], now a bunch of pages arrives, it
> enqueues "pages" into a free slot index 2 (set type=pages), then before
> thread 2 finished sending the bunch of pages, SendData[2] will always
> represent those pages without being used by anything else. What did I miss?
>
You're missing multifd_send_state->pages and the fact that it holds
unsent data on behalf of the client. At every call to
multifd_queue_pages(), the RAM code expects the previously filled pages
structure to be there. Since we intend to have more than one multifd
client, now the other client (say device state) might run, it will take
that slot and fill it with it's own stuff (or rather fill p->send_data
and multifd_send_pages() switches the pointer). Next call to
multifd_queue_pages(), it will take multifd_send_state->pages and
there'll be garbage there.
The code is not: take a free slot from the next idle channel and fill it
with data.
It is: take from multifd_send_state the active slot which *might* have
previously been consumed by the last channel and (continue to) fill it
with data.
"might", because successive calls to multifd_queue_page() don't need to
call multifd_send_page() to flush to the channel.
>>
>> >
>> > Then the "enum" makes sure the payload only consumes only the max of both
>> > types; a side benefit to save some memory.
>> >
>> > I think we need to make sure MultifdDeviceState_t is generic enough so that
>> > it will work for mostly everything (especially normal VMSDs). In this case
>> > the VFIO series should be good as that was currently defined as:
>> >
>> > typedef struct {
>> > MultiFDPacketHdr_t hdr;
>> >
>> > char idstr[256] QEMU_NONSTRING;
>> > uint32_t instance_id;
>> >
>> > /* size of the next packet that contains the actual data */
>> > uint32_t next_packet_size;
>> > } __attribute__((packed)) MultiFDPacketDeviceState_t;
>>
>> This is the packet, a different thing. Not sure if your paragraph above
>> means to talk about that or really MultifdDeviceState, which is what is
>> exchanged between the multifd threads and the client code.
>
> I meant the wire protocol looks great from that POV. We may need similar
> thing for the type==device_state slots just to be generic.
>
>>
>> >
>> > IIUC that was what we need exactly with idstr+instance_id, so as to nail
>> > exactly at where should the "opaque device state" go to, then load it with
>> > a buffer-based loader when it's ready (starting from VFIO, to get rid of
>> > qemufile). For VMSDs in the future if ever possible, that should be a
>> > modified version of vmstate_load() where it may take buffers not qemufiles.
>> >
>> > To Maciej: please see whether above makes sense to you, and if you also
>> > agree please consider that with your VFIO work.
>> >
>> > Thanks,
>> >
>> >>
>> >> > []
>> >> > [a][][][]
>> >> >
>> >> > [a][][][]
>> >> > ^idle
>>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-10 20:16 ` Fabiano Rosas
@ 2024-07-10 21:55 ` Peter Xu
2024-07-11 14:12 ` Fabiano Rosas
2024-07-16 20:10 ` Maciej S. Szmigiero
1 sibling, 1 reply; 46+ messages in thread
From: Peter Xu @ 2024-07-10 21:55 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
On Wed, Jul 10, 2024 at 05:16:36PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > On Wed, Jul 10, 2024 at 01:10:37PM -0300, Fabiano Rosas wrote:
> >> Peter Xu <peterx@redhat.com> writes:
> >>
> >> > On Thu, Jun 27, 2024 at 11:27:08AM +0800, Wang, Lei wrote:
> >> >> > Or graphically:
> >> >> >
> >> >> > 1) client fills the active slot with data. Channels point to nothing
> >> >> > at this point:
> >> >> > [a] <-- active slot
> >> >> > [][][][] <-- free slots, one per-channel
> >> >> >
> >> >> > [][][][] <-- channels' p->data pointers
> >> >> >
> >> >> > 2) multifd_send() swaps the pointers inside the client slot. Channels
> >> >> > still point to nothing:
> >> >> > []
> >> >> > [a][][][]
> >> >> >
> >> >> > [][][][]
> >> >> >
> >> >> > 3) multifd_send() finds an idle channel and updates its pointer:
> >> >>
> >> >> It seems the action "finds an idle channel" is in step 2 rather than step 3,
> >> >> which means the free slot is selected based on the id of the channel found, am I
> >> >> understanding correctly?
> >> >
> >> > I think you're right.
> >> >
> >> > Actually I also feel like the desription here is ambiguous, even though I
> >> > think I get what Fabiano wanted to say.
> >> >
> >> > The free slot should be the first step of step 2+3, here what Fabiano
> >> > really wanted to suggest is we move the free buffer array from multifd
> >> > channels into the callers, then the caller can pass in whatever data to
> >> > send.
> >> >
> >> > So I think maybe it's cleaner to write it as this in code (note: I didn't
> >> > really change the code, just some ordering and comments):
> >> >
> >> > ===8<===
> >> > @@ -710,15 +710,11 @@ static bool multifd_send(MultiFDSlots *slots)
> >> > */
> >> > active_slot = slots->active;
> >> > slots->active = slots->free[p->id];
> >> > - p->data = active_slot;
> >> > -
> >> > - /*
> >> > - * By the next time we arrive here, the channel will certainly
> >> > - * have consumed the active slot. Put it back on the free list
> >> > - * now.
> >> > - */
> >> > slots->free[p->id] = active_slot;
> >> >
> >> > + /* Assign the current active slot to the chosen thread */
> >> > + p->data = active_slot;
> >> > ===8<===
> >> >
> >> > The comment I removed is slightly misleading to me too, because right now
> >> > active_slot contains the data hasn't yet been delivered to multifd, so
> >> > we're "putting it back to free list" not because of it's free, but because
> >> > we know it won't get used until the multifd send thread consumes it
> >> > (because before that the thread will be busy, and we won't use the buffer
> >> > if so in upcoming send()s).
> >> >
> >> > And then when I'm looking at this again, I think maybe it's a slight
> >> > overkill, and maybe we can still keep the "opaque data" managed by multifd.
> >> > One reason might be that I don't expect the "opaque data" payload keep
> >> > growing at all: it should really be either RAM or device state as I
> >> > commented elsewhere in a relevant thread, after all it's a thread model
> >> > only for migration purpose to move vmstates..
> >>
> >> Some amount of flexibility needs to be baked in. For instance, what
> >> about the handshake procedure? Don't we want to use multifd threads to
> >> put some information on the wire for that as well?
> >
> > Is this an orthogonal question?
>
> I don't think so. You say the payload data should be either RAM or
> device state. I'm asking what other types of data do we want the multifd
> channel to transmit and suggesting we need to allow room for the
> addition of that, whatever it is. One thing that comes to mind that is
> neither RAM or device state is some form of handshake or capabilities
> negotiation.
Indeed what I thought was multifd payload should be either ram or device,
nothing else. The worst case is we can add one more into the union, but I
can't think of.
I wonder why handshake needs to be done per-thread. I was naturally
thinking the handshake should happen sequentially, talking over everything
including multifd. IMO multifd to have these threads are mostly for the
sake of performance. I sometimes think we have some tiny places where we
"over-engineered" multifd, e.g. on attaching ZLIB/ZSTD/... flags on each
packet header, even if they should never change, and that is the part of
thing we can put into handshake too, and after handshake we should assume
both sides and all threads are in sync. There's no need to worry
compressor per-packet, per-channel. It could be a global thing and done
upfront, even if Libvirt didn't guarantee those.
>
> >
> > What I meant above is it looks fine to me to keep "device state" in
> > multifd.c, as long as it is not only about VFIO.
> >
> > What you were saying seems to be about how to identify this is a device
> > state, then I just hope VFIO shares the same flag with any future device
> > that would also like to send its state via multifd, like:
> >
> > #define MULTIFD_FLAG_DEVICE_STATE (32 << 1)
> >
> > Then set it in MultiFDPacket_t.flags. The dest qemu should route that
> > packet to the device vmsd / save_entry for parsing.
>
> Sure, that part I agree with, no issue here.
>
> >
> >>
> >> > Putting it managed by multifd thread should involve less change than this
> >> > series, but it could look like this:
> >> >
> >> > typedef enum {
> >> > MULTIFD_PAYLOAD_RAM = 0,
> >> > MULTIFD_PAYLOAD_DEVICE_STATE = 1,
> >> > } MultifdPayloadType;
> >> >
> >> > typedef enum {
> >> > MultiFDPages_t ram_payload;
> >> > MultifdDeviceState_t device_payload;
> >> > } MultifdPayload;
> >> >
> >> > struct MultiFDSendData {
> >> > MultifdPayloadType type;
> >> > MultifdPayload data;
> >> > };
> >>
> >> Is that an union up there? So you want to simply allocate in multifd the
> >
> > Yes.
> >
> >> max amount of memory between the two types of payload? But then we'll
> >
> > Yes.
> >
> >> need a memset(p->data, 0, ...) at every round of sending to avoid giving
> >> stale data from one client to another. That doesn't work with the
> >
> > I think as long as the one to enqueue will always setup the fields, we
> > don't need to do memset. I am not sure if it's a major concern to always
> > set all the relevant fields in the multifd enqueue threads. It sounds like
> > the thing we should always better do.
>
> Well, writing to a region of memory that was "owned" by another multifd
> client and already has a bunch of data there is somewhat prone to
> bugs. Just forget to set something and now things start to behave
> weirdly. I guess that's just the price of having an union. I'm not
> against that, but I would maybe prefer to have each client hold its own
> data and not have to think about anything else. Much of this feeling
> comes from how the RAM code currently works (more on that below).
IIUC so far not using union is fine. However in the future what if we
extend device states to vcpus? In that case the vcpu will need to have its
own array of SendData[], even though it will share the same structure with
what VFIO would use.
If with an union, and if we can settle it'll be either PAGES or DEV_STATE,
then when vcpu state is involved, vcpu will simply enqueue the same
DEV_STATE.
One thing to mention is that when with an union we may probably need to get
rid of multifd_send_state->pages already. The object can't be a global
cache (in which case so far it's N+1, N being n_multifd_channels, while "1"
is the extra buffer as only RAM uses it). In the union world we'll need to
allocate M+N SendData, where N is still the n_multifd_channels, and M is
the number of users, in VFIO's case, VFIO allocates the cached SendData and
use that to enqueue, right after enqueue it'll get a free one by switching
it with another one in the multifd's array[N]. Same to RAM. Then there'll
be N+2 SendData and VFIO/RAM needs to free their own SendData when cleanup
(multifd owns the N per-thread only).
>
> >
> >> current ram migration because it wants p->pages to remain active across
> >> several calls of multifd_queue_page().
> >
> > I don't think I followed here.
> >
> > What I meant: QEMU maintains SendData[8], now a bunch of pages arrives, it
> > enqueues "pages" into a free slot index 2 (set type=pages), then before
> > thread 2 finished sending the bunch of pages, SendData[2] will always
> > represent those pages without being used by anything else. What did I miss?
> >
>
> You're missing multifd_send_state->pages and the fact that it holds
> unsent data on behalf of the client. At every call to
> multifd_queue_pages(), the RAM code expects the previously filled pages
> structure to be there. Since we intend to have more than one multifd
> client, now the other client (say device state) might run, it will take
> that slot and fill it with it's own stuff (or rather fill p->send_data
> and multifd_send_pages() switches the pointer). Next call to
> multifd_queue_pages(), it will take multifd_send_state->pages and
> there'll be garbage there.
Please see above to see whether that may answer here too; in general I
think we need to drop multifd_send_state->pages, but let SendData caches be
managed by the users, so instead of multifd managing N+1 SendData, it only
manages N, leaving the rest to possible users of multifd. It also means
it's a must any multifd enqueue user will first provide a valid cache
object of SendData.
>
> The code is not: take a free slot from the next idle channel and fill it
> with data.
>
> It is: take from multifd_send_state the active slot which *might* have
> previously been consumed by the last channel and (continue to) fill it
> with data.
>
> "might", because successive calls to multifd_queue_page() don't need to
> call multifd_send_page() to flush to the channel.
>
> >>
> >> >
> >> > Then the "enum" makes sure the payload only consumes only the max of both
> >> > types; a side benefit to save some memory.
> >> >
> >> > I think we need to make sure MultifdDeviceState_t is generic enough so that
> >> > it will work for mostly everything (especially normal VMSDs). In this case
> >> > the VFIO series should be good as that was currently defined as:
> >> >
> >> > typedef struct {
> >> > MultiFDPacketHdr_t hdr;
> >> >
> >> > char idstr[256] QEMU_NONSTRING;
> >> > uint32_t instance_id;
> >> >
> >> > /* size of the next packet that contains the actual data */
> >> > uint32_t next_packet_size;
> >> > } __attribute__((packed)) MultiFDPacketDeviceState_t;
> >>
> >> This is the packet, a different thing. Not sure if your paragraph above
> >> means to talk about that or really MultifdDeviceState, which is what is
> >> exchanged between the multifd threads and the client code.
> >
> > I meant the wire protocol looks great from that POV. We may need similar
> > thing for the type==device_state slots just to be generic.
> >
> >>
> >> >
> >> > IIUC that was what we need exactly with idstr+instance_id, so as to nail
> >> > exactly at where should the "opaque device state" go to, then load it with
> >> > a buffer-based loader when it's ready (starting from VFIO, to get rid of
> >> > qemufile). For VMSDs in the future if ever possible, that should be a
> >> > modified version of vmstate_load() where it may take buffers not qemufiles.
> >> >
> >> > To Maciej: please see whether above makes sense to you, and if you also
> >> > agree please consider that with your VFIO work.
> >> >
> >> > Thanks,
> >> >
> >> >>
> >> >> > []
> >> >> > [a][][][]
> >> >> >
> >> >> > [a][][][]
> >> >> > ^idle
> >>
>
--
Peter Xu
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-10 21:55 ` Peter Xu
@ 2024-07-11 14:12 ` Fabiano Rosas
2024-07-11 16:11 ` Peter Xu
0 siblings, 1 reply; 46+ messages in thread
From: Fabiano Rosas @ 2024-07-11 14:12 UTC (permalink / raw)
To: Peter Xu; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
Peter Xu <peterx@redhat.com> writes:
> On Wed, Jul 10, 2024 at 05:16:36PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Wed, Jul 10, 2024 at 01:10:37PM -0300, Fabiano Rosas wrote:
>> >> Peter Xu <peterx@redhat.com> writes:
>> >>
>> >> > On Thu, Jun 27, 2024 at 11:27:08AM +0800, Wang, Lei wrote:
>> >> >> > Or graphically:
>> >> >> >
>> >> >> > 1) client fills the active slot with data. Channels point to nothing
>> >> >> > at this point:
>> >> >> > [a] <-- active slot
>> >> >> > [][][][] <-- free slots, one per-channel
>> >> >> >
>> >> >> > [][][][] <-- channels' p->data pointers
>> >> >> >
>> >> >> > 2) multifd_send() swaps the pointers inside the client slot. Channels
>> >> >> > still point to nothing:
>> >> >> > []
>> >> >> > [a][][][]
>> >> >> >
>> >> >> > [][][][]
>> >> >> >
>> >> >> > 3) multifd_send() finds an idle channel and updates its pointer:
>> >> >>
>> >> >> It seems the action "finds an idle channel" is in step 2 rather than step 3,
>> >> >> which means the free slot is selected based on the id of the channel found, am I
>> >> >> understanding correctly?
>> >> >
>> >> > I think you're right.
>> >> >
>> >> > Actually I also feel like the desription here is ambiguous, even though I
>> >> > think I get what Fabiano wanted to say.
>> >> >
>> >> > The free slot should be the first step of step 2+3, here what Fabiano
>> >> > really wanted to suggest is we move the free buffer array from multifd
>> >> > channels into the callers, then the caller can pass in whatever data to
>> >> > send.
>> >> >
>> >> > So I think maybe it's cleaner to write it as this in code (note: I didn't
>> >> > really change the code, just some ordering and comments):
>> >> >
>> >> > ===8<===
>> >> > @@ -710,15 +710,11 @@ static bool multifd_send(MultiFDSlots *slots)
>> >> > */
>> >> > active_slot = slots->active;
>> >> > slots->active = slots->free[p->id];
>> >> > - p->data = active_slot;
>> >> > -
>> >> > - /*
>> >> > - * By the next time we arrive here, the channel will certainly
>> >> > - * have consumed the active slot. Put it back on the free list
>> >> > - * now.
>> >> > - */
>> >> > slots->free[p->id] = active_slot;
>> >> >
>> >> > + /* Assign the current active slot to the chosen thread */
>> >> > + p->data = active_slot;
>> >> > ===8<===
>> >> >
>> >> > The comment I removed is slightly misleading to me too, because right now
>> >> > active_slot contains the data hasn't yet been delivered to multifd, so
>> >> > we're "putting it back to free list" not because of it's free, but because
>> >> > we know it won't get used until the multifd send thread consumes it
>> >> > (because before that the thread will be busy, and we won't use the buffer
>> >> > if so in upcoming send()s).
>> >> >
>> >> > And then when I'm looking at this again, I think maybe it's a slight
>> >> > overkill, and maybe we can still keep the "opaque data" managed by multifd.
>> >> > One reason might be that I don't expect the "opaque data" payload keep
>> >> > growing at all: it should really be either RAM or device state as I
>> >> > commented elsewhere in a relevant thread, after all it's a thread model
>> >> > only for migration purpose to move vmstates..
>> >>
>> >> Some amount of flexibility needs to be baked in. For instance, what
>> >> about the handshake procedure? Don't we want to use multifd threads to
>> >> put some information on the wire for that as well?
>> >
>> > Is this an orthogonal question?
>>
>> I don't think so. You say the payload data should be either RAM or
>> device state. I'm asking what other types of data do we want the multifd
>> channel to transmit and suggesting we need to allow room for the
>> addition of that, whatever it is. One thing that comes to mind that is
>> neither RAM or device state is some form of handshake or capabilities
>> negotiation.
>
> Indeed what I thought was multifd payload should be either ram or device,
> nothing else. The worst case is we can add one more into the union, but I
> can't think of.
What about the QEMUFile traffic? There's an iov in there. I have been
thinking of replacing some of qemu-file.c guts with calls to
multifd. Instead of several qemu_put_byte() we could construct an iov
and give it to multifd for transfering, call multifd_sync at the end and
get rid of the QEMUFile entirely. I don't have that completely laid out
at the moment, but I think it should be possible. I get concerned about
making assumptions on the types of data we're ever going to want to
transmit. I bet someone thought in the past that multifd would never be
used for anything other than ram.
>
> I wonder why handshake needs to be done per-thread. I was naturally
> thinking the handshake should happen sequentially, talking over everything
> including multifd.
Well, it would still be thread based. Just that it would be 1 thread and
it would not be managed by multifd. I don't see the point. We could make
everything be multifd-based. Any piece of data that needs to reach the
other side of the migration could be sent through multifd, no?
Also, when you say "per-thread", that's the model we're trying to get
away from. There should be nothing "per-thread", the threads just
consume the data produced by the clients. Anything "per-thread" that is
not strictly related to the thread model should go away. For instance,
p->page_size, p->page_count, p->write_flags, p->flags, etc. None of
these should be in MultiFDSendParams. That thing should be (say)
MultifdChannelState and contain only the semaphores and control flags
for the threads.
It would be nice if we could once and for all have a model that can
dispatch data transfers without having to fiddle with threading all the
time. Any time someone wants to do something different in the migration
code, there it goes a random qemu_create_thread() flying around.
> IMO multifd to have these threads are mostly for the
> sake of performance. I sometimes think we have some tiny places where we
> "over-engineered" multifd, e.g. on attaching ZLIB/ZSTD/... flags on each
> packet header, even if they should never change, and that is the part of
> thing we can put into handshake too, and after handshake we should assume
> both sides and all threads are in sync. There's no need to worry
> compressor per-packet, per-channel. It could be a global thing and done
> upfront, even if Libvirt didn't guarantee those.
Yep, I agree. And if any client needs such granularity, it can add a
sub-header of it's own.
>>
>> >
>> > What I meant above is it looks fine to me to keep "device state" in
>> > multifd.c, as long as it is not only about VFIO.
>> >
>> > What you were saying seems to be about how to identify this is a device
>> > state, then I just hope VFIO shares the same flag with any future device
>> > that would also like to send its state via multifd, like:
>> >
>> > #define MULTIFD_FLAG_DEVICE_STATE (32 << 1)
>> >
>> > Then set it in MultiFDPacket_t.flags. The dest qemu should route that
>> > packet to the device vmsd / save_entry for parsing.
>>
>> Sure, that part I agree with, no issue here.
>>
>> >
>> >>
>> >> > Putting it managed by multifd thread should involve less change than this
>> >> > series, but it could look like this:
>> >> >
>> >> > typedef enum {
>> >> > MULTIFD_PAYLOAD_RAM = 0,
>> >> > MULTIFD_PAYLOAD_DEVICE_STATE = 1,
>> >> > } MultifdPayloadType;
>> >> >
>> >> > typedef enum {
>> >> > MultiFDPages_t ram_payload;
>> >> > MultifdDeviceState_t device_payload;
>> >> > } MultifdPayload;
>> >> >
>> >> > struct MultiFDSendData {
>> >> > MultifdPayloadType type;
>> >> > MultifdPayload data;
>> >> > };
>> >>
>> >> Is that an union up there? So you want to simply allocate in multifd the
>> >
>> > Yes.
>> >
>> >> max amount of memory between the two types of payload? But then we'll
>> >
>> > Yes.
>> >
>> >> need a memset(p->data, 0, ...) at every round of sending to avoid giving
>> >> stale data from one client to another. That doesn't work with the
>> >
>> > I think as long as the one to enqueue will always setup the fields, we
>> > don't need to do memset. I am not sure if it's a major concern to always
>> > set all the relevant fields in the multifd enqueue threads. It sounds like
>> > the thing we should always better do.
>>
>> Well, writing to a region of memory that was "owned" by another multifd
>> client and already has a bunch of data there is somewhat prone to
>> bugs. Just forget to set something and now things start to behave
>> weirdly. I guess that's just the price of having an union. I'm not
>> against that, but I would maybe prefer to have each client hold its own
>> data and not have to think about anything else. Much of this feeling
>> comes from how the RAM code currently works (more on that below).
>
> IIUC so far not using union is fine. However in the future what if we
> extend device states to vcpus? In that case the vcpu will need to have its
> own array of SendData[], even though it will share the same structure with
> what VFIO would use.
>
> If with an union, and if we can settle it'll be either PAGES or DEV_STATE,
> then when vcpu state is involved, vcpu will simply enqueue the same
> DEV_STATE.
Yeah, the union is advantageous from that aspect.
> One thing to mention is that when with an union we may probably need to get
> rid of multifd_send_state->pages already.
Hehe, please don't do this like "oh, by the way...". This is a major
pain point. I've been complaining about that "holding of client data"
since the fist time I read that code. So if you're going to propose
something, it needs to account for that.
> The object can't be a global
> cache (in which case so far it's N+1, N being n_multifd_channels, while "1"
> is the extra buffer as only RAM uses it). In the union world we'll need to
> allocate M+N SendData, where N is still the n_multifd_channels, and M is
> the number of users, in VFIO's case, VFIO allocates the cached SendData and
> use that to enqueue, right after enqueue it'll get a free one by switching
> it with another one in the multifd's array[N]. Same to RAM. Then there'll
> be N+2 SendData and VFIO/RAM needs to free their own SendData when cleanup
> (multifd owns the N per-thread only).
>
At first sight, that seems to work. It's similar to this series, but
you're moving the free slots back into the channels. Should I keep
SendData as an actual separate array instead of multiple p->data?
Let me know, I'll write some code and see what it looks like.
>>
>> >
>> >> current ram migration because it wants p->pages to remain active across
>> >> several calls of multifd_queue_page().
>> >
>> > I don't think I followed here.
>> >
>> > What I meant: QEMU maintains SendData[8], now a bunch of pages arrives, it
>> > enqueues "pages" into a free slot index 2 (set type=pages), then before
>> > thread 2 finished sending the bunch of pages, SendData[2] will always
>> > represent those pages without being used by anything else. What did I miss?
>> >
>>
>> You're missing multifd_send_state->pages and the fact that it holds
>> unsent data on behalf of the client. At every call to
>> multifd_queue_pages(), the RAM code expects the previously filled pages
>> structure to be there. Since we intend to have more than one multifd
>> client, now the other client (say device state) might run, it will take
>> that slot and fill it with it's own stuff (or rather fill p->send_data
>> and multifd_send_pages() switches the pointer). Next call to
>> multifd_queue_pages(), it will take multifd_send_state->pages and
>> there'll be garbage there.
>
> Please see above to see whether that may answer here too; in general I
> think we need to drop multifd_send_state->pages, but let SendData caches be
> managed by the users, so instead of multifd managing N+1 SendData, it only
> manages N, leaving the rest to possible users of multifd. It also means
> it's a must any multifd enqueue user will first provide a valid cache
> object of SendData.
>
>>
>> The code is not: take a free slot from the next idle channel and fill it
>> with data.
>>
>> It is: take from multifd_send_state the active slot which *might* have
>> previously been consumed by the last channel and (continue to) fill it
>> with data.
>>
>> "might", because successive calls to multifd_queue_page() don't need to
>> call multifd_send_page() to flush to the channel.
>>
>> >>
>> >> >
>> >> > Then the "enum" makes sure the payload only consumes only the max of both
>> >> > types; a side benefit to save some memory.
>> >> >
>> >> > I think we need to make sure MultifdDeviceState_t is generic enough so that
>> >> > it will work for mostly everything (especially normal VMSDs). In this case
>> >> > the VFIO series should be good as that was currently defined as:
>> >> >
>> >> > typedef struct {
>> >> > MultiFDPacketHdr_t hdr;
>> >> >
>> >> > char idstr[256] QEMU_NONSTRING;
>> >> > uint32_t instance_id;
>> >> >
>> >> > /* size of the next packet that contains the actual data */
>> >> > uint32_t next_packet_size;
>> >> > } __attribute__((packed)) MultiFDPacketDeviceState_t;
>> >>
>> >> This is the packet, a different thing. Not sure if your paragraph above
>> >> means to talk about that or really MultifdDeviceState, which is what is
>> >> exchanged between the multifd threads and the client code.
>> >
>> > I meant the wire protocol looks great from that POV. We may need similar
>> > thing for the type==device_state slots just to be generic.
>> >
>> >>
>> >> >
>> >> > IIUC that was what we need exactly with idstr+instance_id, so as to nail
>> >> > exactly at where should the "opaque device state" go to, then load it with
>> >> > a buffer-based loader when it's ready (starting from VFIO, to get rid of
>> >> > qemufile). For VMSDs in the future if ever possible, that should be a
>> >> > modified version of vmstate_load() where it may take buffers not qemufiles.
>> >> >
>> >> > To Maciej: please see whether above makes sense to you, and if you also
>> >> > agree please consider that with your VFIO work.
>> >> >
>> >> > Thanks,
>> >> >
>> >> >>
>> >> >> > []
>> >> >> > [a][][][]
>> >> >> >
>> >> >> > [a][][][]
>> >> >> > ^idle
>> >>
>>
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-11 14:12 ` Fabiano Rosas
@ 2024-07-11 16:11 ` Peter Xu
2024-07-11 19:37 ` Fabiano Rosas
2024-07-18 19:39 ` Fabiano Rosas
0 siblings, 2 replies; 46+ messages in thread
From: Peter Xu @ 2024-07-11 16:11 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
On Thu, Jul 11, 2024 at 11:12:09AM -0300, Fabiano Rosas wrote:
> What about the QEMUFile traffic? There's an iov in there. I have been
> thinking of replacing some of qemu-file.c guts with calls to
> multifd. Instead of several qemu_put_byte() we could construct an iov
> and give it to multifd for transfering, call multifd_sync at the end and
> get rid of the QEMUFile entirely. I don't have that completely laid out
> at the moment, but I think it should be possible. I get concerned about
> making assumptions on the types of data we're ever going to want to
> transmit. I bet someone thought in the past that multifd would never be
> used for anything other than ram.
Hold on a bit.. there're two things I want to clarity with you.
Firstly, qemu_put_byte() has buffering on f->buf[]. Directly changing them
to iochannels may regress performance. I never checked, but I would assume
some buffering will be needed for small chunk of data even with iochannels.
Secondly, why multifd has things to do with this? What you're talking
about is more like the rework of qemufile->iochannel thing to me, and IIUC
that doesn't yet involve multifd. For many of such conversions, it'll
still be operating on the main channel, which is not the multifd channels.
What matters might be about what's in your mind to be put over multifd
channels there.
>
> >
> > I wonder why handshake needs to be done per-thread. I was naturally
> > thinking the handshake should happen sequentially, talking over everything
> > including multifd.
>
> Well, it would still be thread based. Just that it would be 1 thread and
> it would not be managed by multifd. I don't see the point. We could make
> everything be multifd-based. Any piece of data that needs to reach the
> other side of the migration could be sent through multifd, no?
Hmm.... yes we can. But what do we gain from it, if we know it'll be a few
MBs in total? There ain't a lot of huge stuff to move, it seems to me.
>
> Also, when you say "per-thread", that's the model we're trying to get
> away from. There should be nothing "per-thread", the threads just
> consume the data produced by the clients. Anything "per-thread" that is
> not strictly related to the thread model should go away. For instance,
> p->page_size, p->page_count, p->write_flags, p->flags, etc. None of
> these should be in MultiFDSendParams. That thing should be (say)
> MultifdChannelState and contain only the semaphores and control flags
> for the threads.
>
> It would be nice if we could once and for all have a model that can
> dispatch data transfers without having to fiddle with threading all the
> time. Any time someone wants to do something different in the migration
> code, there it goes a random qemu_create_thread() flying around.
That's exactly what I want to avoid. Not all things will need a thread,
only performance relevant ones.
So now we have multifd threads, they're for IO throughputs: if we want to
push a fast NIC, that's the only way to go. Anything wants to push that
NIC, should use multifd.
Then it turns out we want more concurrency, it's about VFIO save()/load()
of the kenrel drivers and it can block. Same to other devices that can
take time to save()/load() if it can happen concurrently in the future. I
think that's the reason why I suggested the VFIO solution to provide a
generic concept of thread pool so it services a generic purpose, and can be
reused in the future.
I hope that'll stop anyone else on migration to create yet another thread
randomly, and I definitely don't like that either. I would _suspect_ the
next one to come as such is TDX.. I remember at least in the very initial
proposal years ago, TDX migration involves its own "channel" to migrate,
migration.c may not even know where is that channel. We'll see.
[...]
> > One thing to mention is that when with an union we may probably need to get
> > rid of multifd_send_state->pages already.
>
> Hehe, please don't do this like "oh, by the way...". This is a major
> pain point. I've been complaining about that "holding of client data"
> since the fist time I read that code. So if you're going to propose
> something, it needs to account for that.
The client puts something into a buffer (SendData), then it delivers it to
multifd (who silently switches the buffer). After enqueued, the client
assumes the buffer is sent and reusable again.
It looks pretty common to me, what is the concern within the procedure?
What's the "holding of client data" issue?
>
> > The object can't be a global
> > cache (in which case so far it's N+1, N being n_multifd_channels, while "1"
> > is the extra buffer as only RAM uses it). In the union world we'll need to
> > allocate M+N SendData, where N is still the n_multifd_channels, and M is
> > the number of users, in VFIO's case, VFIO allocates the cached SendData and
> > use that to enqueue, right after enqueue it'll get a free one by switching
> > it with another one in the multifd's array[N]. Same to RAM. Then there'll
> > be N+2 SendData and VFIO/RAM needs to free their own SendData when cleanup
> > (multifd owns the N per-thread only).
> >
>
> At first sight, that seems to work. It's similar to this series, but
> you're moving the free slots back into the channels. Should I keep
> SendData as an actual separate array instead of multiple p->data?
I don't know.. they look similar to me yet so far, as long as multifd is
managing the N buffers, while the clients will manage one for each. There
should have a helper to allocate/free the generic multifd buffers (SendData
in this case) so everyone should be using that.
>
> Let me know, I'll write some code and see what it looks like.
I think Maciej is working on this too since your absence, as I saw he
decided to base his work on top of yours and he's preparing the new
version. I hope you two won't conflict or duplicates the work. Might be
good to ask / wait and see how far Maciej has been going.
--
Peter Xu
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-11 16:11 ` Peter Xu
@ 2024-07-11 19:37 ` Fabiano Rosas
2024-07-11 20:27 ` Peter Xu
2024-07-18 19:39 ` Fabiano Rosas
1 sibling, 1 reply; 46+ messages in thread
From: Fabiano Rosas @ 2024-07-11 19:37 UTC (permalink / raw)
To: Peter Xu; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
Peter Xu <peterx@redhat.com> writes:
> On Thu, Jul 11, 2024 at 11:12:09AM -0300, Fabiano Rosas wrote:
>> What about the QEMUFile traffic? There's an iov in there. I have been
>> thinking of replacing some of qemu-file.c guts with calls to
>> multifd. Instead of several qemu_put_byte() we could construct an iov
>> and give it to multifd for transfering, call multifd_sync at the end and
>> get rid of the QEMUFile entirely. I don't have that completely laid out
>> at the moment, but I think it should be possible. I get concerned about
>> making assumptions on the types of data we're ever going to want to
>> transmit. I bet someone thought in the past that multifd would never be
>> used for anything other than ram.
>
> Hold on a bit.. there're two things I want to clarity with you.
>
> Firstly, qemu_put_byte() has buffering on f->buf[]. Directly changing them
> to iochannels may regress performance. I never checked, but I would assume
> some buffering will be needed for small chunk of data even with iochannels.
Right, but there's an extra memcpy to do that. Not sure how those
balance out.
We also don't flush the iov at once, so f->buf seems redundant to
me. But of course, if we touch any of that we must ensure we're not
dropping any major optimization.
> Secondly, why multifd has things to do with this? What you're talking
> about is more like the rework of qemufile->iochannel thing to me, and IIUC
> that doesn't yet involve multifd. For many of such conversions, it'll
> still be operating on the main channel, which is not the multifd channels.
> What matters might be about what's in your mind to be put over multifd
> channels there.
>
Any piece of code that fills an iov with data is prone to be able to
send that data through multifd. From this perspective, multifd is just a
way to give work to an iochannel. We don't *need* to use it, but it
might be simple enough to the point that the benefit of ditching
QEMUFile can be reached without too much rework.
Say we provision multifd threads early and leave them waiting for any
part of the migration code to send some data. We could have n-1 threads
idle waiting for the bulk of the data and use a single thread for any
early traffic that does not need to be parallel.
I'm not suggesting we do any of this right away or even that this is the
correct way to go, I'm just letting you know some of my ideas and why I
think ram + device state might not be the only data we put through
multifd.
>>
>> >
>> > I wonder why handshake needs to be done per-thread. I was naturally
>> > thinking the handshake should happen sequentially, talking over everything
>> > including multifd.
>>
>> Well, it would still be thread based. Just that it would be 1 thread and
>> it would not be managed by multifd. I don't see the point. We could make
>> everything be multifd-based. Any piece of data that needs to reach the
>> other side of the migration could be sent through multifd, no?
>
> Hmm.... yes we can. But what do we gain from it, if we know it'll be a few
> MBs in total? There ain't a lot of huge stuff to move, it seems to me.
Well it depends on what the alternative is. If we're going to create a
thread to send small chunks of data anyway, we could use the multifd
threads instead.
>
>>
>> Also, when you say "per-thread", that's the model we're trying to get
>> away from. There should be nothing "per-thread", the threads just
>> consume the data produced by the clients. Anything "per-thread" that is
>> not strictly related to the thread model should go away. For instance,
>> p->page_size, p->page_count, p->write_flags, p->flags, etc. None of
>> these should be in MultiFDSendParams. That thing should be (say)
>> MultifdChannelState and contain only the semaphores and control flags
>> for the threads.
>>
>> It would be nice if we could once and for all have a model that can
>> dispatch data transfers without having to fiddle with threading all the
>> time. Any time someone wants to do something different in the migration
>> code, there it goes a random qemu_create_thread() flying around.
>
> That's exactly what I want to avoid. Not all things will need a thread,
> only performance relevant ones.
>
> So now we have multifd threads, they're for IO throughputs: if we want to
> push a fast NIC, that's the only way to go. Anything wants to push that
> NIC, should use multifd.
>
> Then it turns out we want more concurrency, it's about VFIO save()/load()
> of the kenrel drivers and it can block. Same to other devices that can
> take time to save()/load() if it can happen concurrently in the future. I
> think that's the reason why I suggested the VFIO solution to provide a
> generic concept of thread pool so it services a generic purpose, and can be
> reused in the future.
Just to be clear, do you want a thread-pool to replace multifd? Or would
that be only used for concurrency on the producer side?
> I hope that'll stop anyone else on migration to create yet another thread
> randomly, and I definitely don't like that either. I would _suspect_ the
> next one to come as such is TDX.. I remember at least in the very initial
> proposal years ago, TDX migration involves its own "channel" to migrate,
> migration.c may not even know where is that channel. We'll see.
>
It would be good if we could standardize on a single structure for data
transfer. Today it's kind of a mess, QEMUFile with it's several data
sizes, iovs, MultiFDPages_t, whatever comes out of this series in
p->data, etc. That makes it difficult to change the underlying model
without rewriting a bunch of logic.
> [...]
>
>> > One thing to mention is that when with an union we may probably need to get
>> > rid of multifd_send_state->pages already.
>>
>> Hehe, please don't do this like "oh, by the way...". This is a major
>> pain point. I've been complaining about that "holding of client data"
>> since the fist time I read that code. So if you're going to propose
>> something, it needs to account for that.
>
> The client puts something into a buffer (SendData), then it delivers it to
> multifd (who silently switches the buffer). After enqueued, the client
> assumes the buffer is sent and reusable again.
>
> It looks pretty common to me, what is the concern within the procedure?
> What's the "holding of client data" issue?
>
No concern, it's just that you didn't mention any of this when you
suggested the union, so I thought you simply ignored it.
"holding of the client data" is what we're discussing here: The fact
that multifd keeps multifd_send_state->pages around for the next call to
multifd_queue_pages() to reach.
Anyway, your M+N suggestion seems to be enough to address this, let's
see.
>>
>> > The object can't be a global
>> > cache (in which case so far it's N+1, N being n_multifd_channels, while "1"
>> > is the extra buffer as only RAM uses it). In the union world we'll need to
>> > allocate M+N SendData, where N is still the n_multifd_channels, and M is
>> > the number of users, in VFIO's case, VFIO allocates the cached SendData and
>> > use that to enqueue, right after enqueue it'll get a free one by switching
>> > it with another one in the multifd's array[N]. Same to RAM. Then there'll
>> > be N+2 SendData and VFIO/RAM needs to free their own SendData when cleanup
>> > (multifd owns the N per-thread only).
>> >
>>
>> At first sight, that seems to work. It's similar to this series, but
>> you're moving the free slots back into the channels. Should I keep
>> SendData as an actual separate array instead of multiple p->data?
>
> I don't know.. they look similar to me yet so far, as long as multifd is
> managing the N buffers, while the clients will manage one for each. There
> should have a helper to allocate/free the generic multifd buffers (SendData
> in this case) so everyone should be using that.
>
>>
>> Let me know, I'll write some code and see what it looks like.
>
> I think Maciej is working on this too since your absence, as I saw he
> decided to base his work on top of yours and he's preparing the new
> version. I hope you two won't conflict or duplicates the work. Might be
> good to ask / wait and see how far Maciej has been going.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-11 19:37 ` Fabiano Rosas
@ 2024-07-11 20:27 ` Peter Xu
2024-07-11 21:12 ` Fabiano Rosas
0 siblings, 1 reply; 46+ messages in thread
From: Peter Xu @ 2024-07-11 20:27 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
On Thu, Jul 11, 2024 at 04:37:34PM -0300, Fabiano Rosas wrote:
[...]
> We also don't flush the iov at once, so f->buf seems redundant to
> me. But of course, if we touch any of that we must ensure we're not
> dropping any major optimization.
Yes some tests over that would be more persuasive when it comes.
Per my limited experience in the past few years: memcpy on chips nowadays
is pretty cheap. You'll see very soon one more example of that when you
start to look at the qatzip series: that series decided to do one more
memcpy for all guest pages, to make it a larger chunk of buffer instead of
submitting the compression tasks in 4k chunks (while I thought 4k wasn't
too small itself).
That may be more involved so may not be a great example (e.g. the
compression algo can be special in this case where it just likes larger
buffers), but it's not uncommon that I see people trade things with memcpy,
especially small buffers.
[...]
> Any piece of code that fills an iov with data is prone to be able to
> send that data through multifd. From this perspective, multifd is just a
> way to give work to an iochannel. We don't *need* to use it, but it
> might be simple enough to the point that the benefit of ditching
> QEMUFile can be reached without too much rework.
>
> Say we provision multifd threads early and leave them waiting for any
> part of the migration code to send some data. We could have n-1 threads
> idle waiting for the bulk of the data and use a single thread for any
> early traffic that does not need to be parallel.
>
> I'm not suggesting we do any of this right away or even that this is the
> correct way to go, I'm just letting you know some of my ideas and why I
> think ram + device state might not be the only data we put through
> multifd.
We can wait and see whether that can be of any use in the future, even if
so, we still have chance to add more types into the union, I think. But
again, I don't expect.
My gut feeling: we shouldn't bother putting any (1) non-huge-chunk, or (2)
non-IO, data onto multifd. Again, I would ask "why not the main channel",
otherwise.
[...]
> Just to be clear, do you want a thread-pool to replace multifd? Or would
> that be only used for concurrency on the producer side?
Not replace multifd. It's just that I was imagining multifd threads only
manage IO stuff, nothing else.
I was indeed thinking whether we can reuse multifd threads, but then I
found there's risk mangling these two concepts, as: when we do more than IO
in multifd threads (e.g., talking to VFIO kernel fetching data which can
block), we have risk of blocking IO even if we can push more so the NICs
can be idle again. There's also the complexity where the job fetches data
from VFIO kernel and want to enqueue again, it means an multifd task can
enqueue to itself, and circular enqueue can be challenging: imagine 8
concurrent tasks (with a total of 8 multifd threads) trying to enqueue at
the same time; they hunger themselves to death. Things like that. Then I
figured the rest jobs are really fn(void*) type of things; they should
deserve their own pool of threads.
So the VFIO threads (used to be per-device) becomes migration worker
threads, we need them for both src/dst: on dst there's still pending work
to apply the continuous VFIO data back to the kernel driver, and that can't
be done by multifd thread too due to similar same reason. Then those dest
side worker threads can also do load() not only for VFIO but also other
device states if we can add more.
So to summary, we'll have:
- 1 main thread (send / recv)
- N multifd threads (IOs only)
- M worker threads (jobs only)
Of course, postcopy not involved.. How's that sound?
--
Peter Xu
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-11 20:27 ` Peter Xu
@ 2024-07-11 21:12 ` Fabiano Rosas
2024-07-11 22:06 ` Peter Xu
0 siblings, 1 reply; 46+ messages in thread
From: Fabiano Rosas @ 2024-07-11 21:12 UTC (permalink / raw)
To: Peter Xu; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
Peter Xu <peterx@redhat.com> writes:
> On Thu, Jul 11, 2024 at 04:37:34PM -0300, Fabiano Rosas wrote:
>
> [...]
>
>> We also don't flush the iov at once, so f->buf seems redundant to
>> me. But of course, if we touch any of that we must ensure we're not
>> dropping any major optimization.
>
> Yes some tests over that would be more persuasive when it comes.
>
> Per my limited experience in the past few years: memcpy on chips nowadays
> is pretty cheap. You'll see very soon one more example of that when you
> start to look at the qatzip series: that series decided to do one more
> memcpy for all guest pages, to make it a larger chunk of buffer instead of
> submitting the compression tasks in 4k chunks (while I thought 4k wasn't
> too small itself).
>
> That may be more involved so may not be a great example (e.g. the
> compression algo can be special in this case where it just likes larger
> buffers), but it's not uncommon that I see people trade things with memcpy,
> especially small buffers.
>
> [...]
>
>> Any piece of code that fills an iov with data is prone to be able to
>> send that data through multifd. From this perspective, multifd is just a
>> way to give work to an iochannel. We don't *need* to use it, but it
>> might be simple enough to the point that the benefit of ditching
>> QEMUFile can be reached without too much rework.
>>
>> Say we provision multifd threads early and leave them waiting for any
>> part of the migration code to send some data. We could have n-1 threads
>> idle waiting for the bulk of the data and use a single thread for any
>> early traffic that does not need to be parallel.
>>
>> I'm not suggesting we do any of this right away or even that this is the
>> correct way to go, I'm just letting you know some of my ideas and why I
>> think ram + device state might not be the only data we put through
>> multifd.
>
> We can wait and see whether that can be of any use in the future, even if
> so, we still have chance to add more types into the union, I think. But
> again, I don't expect.
>
> My gut feeling: we shouldn't bother putting any (1) non-huge-chunk, or (2)
> non-IO, data onto multifd. Again, I would ask "why not the main channel",
> otherwise.
>
> [...]
>
>> Just to be clear, do you want a thread-pool to replace multifd? Or would
>> that be only used for concurrency on the producer side?
>
> Not replace multifd. It's just that I was imagining multifd threads only
> manage IO stuff, nothing else.
>
> I was indeed thinking whether we can reuse multifd threads, but then I
> found there's risk mangling these two concepts, as: when we do more than IO
> in multifd threads (e.g., talking to VFIO kernel fetching data which can
> block), we have risk of blocking IO even if we can push more so the NICs
> can be idle again. There's also the complexity where the job fetches data
> from VFIO kernel and want to enqueue again, it means an multifd task can
> enqueue to itself, and circular enqueue can be challenging: imagine 8
> concurrent tasks (with a total of 8 multifd threads) trying to enqueue at
> the same time; they hunger themselves to death. Things like that. Then I
> figured the rest jobs are really fn(void*) type of things; they should
> deserve their own pool of threads.
>
> So the VFIO threads (used to be per-device) becomes migration worker
> threads, we need them for both src/dst: on dst there's still pending work
> to apply the continuous VFIO data back to the kernel driver, and that can't
> be done by multifd thread too due to similar same reason. Then those dest
> side worker threads can also do load() not only for VFIO but also other
> device states if we can add more.
>
> So to summary, we'll have:
>
> - 1 main thread (send / recv)
> - N multifd threads (IOs only)
> - M worker threads (jobs only)
>
> Of course, postcopy not involved.. How's that sound?
Looks good. There's a better divide between producer and consumer this
way. I think it will help when designing new features.
One observation is that we'll still have two different entities doing IO
(multifd threads and the migration thread), which I would prefer were
using a common code at a higher level than the iochannel.
One thing that I tried to look into for mapped-ram was whether we could
set up iouring in the migration code, but got entirely discouraged by
the migration thread doing IO at random points. And of course, you've
seen what we had to do with direct-io. That was in part due to having
the migration thread in parallel doing it's small writes at undetermined
points in time.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-11 21:12 ` Fabiano Rosas
@ 2024-07-11 22:06 ` Peter Xu
2024-07-12 12:44 ` Fabiano Rosas
0 siblings, 1 reply; 46+ messages in thread
From: Peter Xu @ 2024-07-11 22:06 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
On Thu, Jul 11, 2024 at 06:12:44PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > On Thu, Jul 11, 2024 at 04:37:34PM -0300, Fabiano Rosas wrote:
> >
> > [...]
> >
> >> We also don't flush the iov at once, so f->buf seems redundant to
> >> me. But of course, if we touch any of that we must ensure we're not
> >> dropping any major optimization.
> >
> > Yes some tests over that would be more persuasive when it comes.
> >
> > Per my limited experience in the past few years: memcpy on chips nowadays
> > is pretty cheap. You'll see very soon one more example of that when you
> > start to look at the qatzip series: that series decided to do one more
> > memcpy for all guest pages, to make it a larger chunk of buffer instead of
> > submitting the compression tasks in 4k chunks (while I thought 4k wasn't
> > too small itself).
> >
> > That may be more involved so may not be a great example (e.g. the
> > compression algo can be special in this case where it just likes larger
> > buffers), but it's not uncommon that I see people trade things with memcpy,
> > especially small buffers.
> >
> > [...]
> >
> >> Any piece of code that fills an iov with data is prone to be able to
> >> send that data through multifd. From this perspective, multifd is just a
> >> way to give work to an iochannel. We don't *need* to use it, but it
> >> might be simple enough to the point that the benefit of ditching
> >> QEMUFile can be reached without too much rework.
> >>
> >> Say we provision multifd threads early and leave them waiting for any
> >> part of the migration code to send some data. We could have n-1 threads
> >> idle waiting for the bulk of the data and use a single thread for any
> >> early traffic that does not need to be parallel.
> >>
> >> I'm not suggesting we do any of this right away or even that this is the
> >> correct way to go, I'm just letting you know some of my ideas and why I
> >> think ram + device state might not be the only data we put through
> >> multifd.
> >
> > We can wait and see whether that can be of any use in the future, even if
> > so, we still have chance to add more types into the union, I think. But
> > again, I don't expect.
> >
> > My gut feeling: we shouldn't bother putting any (1) non-huge-chunk, or (2)
> > non-IO, data onto multifd. Again, I would ask "why not the main channel",
> > otherwise.
> >
> > [...]
> >
> >> Just to be clear, do you want a thread-pool to replace multifd? Or would
> >> that be only used for concurrency on the producer side?
> >
> > Not replace multifd. It's just that I was imagining multifd threads only
> > manage IO stuff, nothing else.
> >
> > I was indeed thinking whether we can reuse multifd threads, but then I
> > found there's risk mangling these two concepts, as: when we do more than IO
> > in multifd threads (e.g., talking to VFIO kernel fetching data which can
> > block), we have risk of blocking IO even if we can push more so the NICs
> > can be idle again. There's also the complexity where the job fetches data
> > from VFIO kernel and want to enqueue again, it means an multifd task can
> > enqueue to itself, and circular enqueue can be challenging: imagine 8
> > concurrent tasks (with a total of 8 multifd threads) trying to enqueue at
> > the same time; they hunger themselves to death. Things like that. Then I
> > figured the rest jobs are really fn(void*) type of things; they should
> > deserve their own pool of threads.
> >
> > So the VFIO threads (used to be per-device) becomes migration worker
> > threads, we need them for both src/dst: on dst there's still pending work
> > to apply the continuous VFIO data back to the kernel driver, and that can't
> > be done by multifd thread too due to similar same reason. Then those dest
> > side worker threads can also do load() not only for VFIO but also other
> > device states if we can add more.
> >
> > So to summary, we'll have:
> >
> > - 1 main thread (send / recv)
> > - N multifd threads (IOs only)
> > - M worker threads (jobs only)
> >
> > Of course, postcopy not involved.. How's that sound?
>
> Looks good. There's a better divide between producer and consumer this
> way. I think it will help when designing new features.
>
> One observation is that we'll still have two different entities doing IO
> (multifd threads and the migration thread), which I would prefer were
> using a common code at a higher level than the iochannel.
At least for the main channel probably yes. I think Dan has had the idea
of adding the buffering layer over iochannels, then replace qemufiles with
that. Multifd channels looks ok so far to use as raw channels.
>
> One thing that I tried to look into for mapped-ram was whether we could
> set up iouring in the migration code, but got entirely discouraged by
> the migration thread doing IO at random points. And of course, you've
> seen what we had to do with direct-io. That was in part due to having
> the migration thread in parallel doing it's small writes at undetermined
> points in time.
On the locked_vm part: probably yes, we'd better try to avoid using page
pinning if possible. It just looks like it becomes a more important
scenario nowadays to put VMs into containers, it means then such feature
may not be always usable there.
For the rest: I really don't know much on iouring, but I remember it can be
fast normally only in a poll model with interrupt-less context? Not sure
whether it suites here for us, as I guess we should avoid consuming cpu
resourcess with no good reason, and polling for perf falls into that
category, I think. Even without it, kubevirt now already has issue on
multifd eating cpus, and people observe multifd threads causing vcpu
threads to be throttled, interrupting guest workloads; they're currently
put in the same container. I also not sure how much it'll help comparing
to when we have the multi-threading ready. I suspect not that much.
--
Peter Xu
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-11 22:06 ` Peter Xu
@ 2024-07-12 12:44 ` Fabiano Rosas
2024-07-12 15:37 ` Peter Xu
0 siblings, 1 reply; 46+ messages in thread
From: Fabiano Rosas @ 2024-07-12 12:44 UTC (permalink / raw)
To: Peter Xu; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
Peter Xu <peterx@redhat.com> writes:
> On Thu, Jul 11, 2024 at 06:12:44PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Thu, Jul 11, 2024 at 04:37:34PM -0300, Fabiano Rosas wrote:
>> >
>> > [...]
>> >
>> >> We also don't flush the iov at once, so f->buf seems redundant to
>> >> me. But of course, if we touch any of that we must ensure we're not
>> >> dropping any major optimization.
>> >
>> > Yes some tests over that would be more persuasive when it comes.
>> >
>> > Per my limited experience in the past few years: memcpy on chips nowadays
>> > is pretty cheap. You'll see very soon one more example of that when you
>> > start to look at the qatzip series: that series decided to do one more
>> > memcpy for all guest pages, to make it a larger chunk of buffer instead of
>> > submitting the compression tasks in 4k chunks (while I thought 4k wasn't
>> > too small itself).
>> >
>> > That may be more involved so may not be a great example (e.g. the
>> > compression algo can be special in this case where it just likes larger
>> > buffers), but it's not uncommon that I see people trade things with memcpy,
>> > especially small buffers.
>> >
>> > [...]
>> >
>> >> Any piece of code that fills an iov with data is prone to be able to
>> >> send that data through multifd. From this perspective, multifd is just a
>> >> way to give work to an iochannel. We don't *need* to use it, but it
>> >> might be simple enough to the point that the benefit of ditching
>> >> QEMUFile can be reached without too much rework.
>> >>
>> >> Say we provision multifd threads early and leave them waiting for any
>> >> part of the migration code to send some data. We could have n-1 threads
>> >> idle waiting for the bulk of the data and use a single thread for any
>> >> early traffic that does not need to be parallel.
>> >>
>> >> I'm not suggesting we do any of this right away or even that this is the
>> >> correct way to go, I'm just letting you know some of my ideas and why I
>> >> think ram + device state might not be the only data we put through
>> >> multifd.
>> >
>> > We can wait and see whether that can be of any use in the future, even if
>> > so, we still have chance to add more types into the union, I think. But
>> > again, I don't expect.
>> >
>> > My gut feeling: we shouldn't bother putting any (1) non-huge-chunk, or (2)
>> > non-IO, data onto multifd. Again, I would ask "why not the main channel",
>> > otherwise.
>> >
>> > [...]
>> >
>> >> Just to be clear, do you want a thread-pool to replace multifd? Or would
>> >> that be only used for concurrency on the producer side?
>> >
>> > Not replace multifd. It's just that I was imagining multifd threads only
>> > manage IO stuff, nothing else.
>> >
>> > I was indeed thinking whether we can reuse multifd threads, but then I
>> > found there's risk mangling these two concepts, as: when we do more than IO
>> > in multifd threads (e.g., talking to VFIO kernel fetching data which can
>> > block), we have risk of blocking IO even if we can push more so the NICs
>> > can be idle again. There's also the complexity where the job fetches data
>> > from VFIO kernel and want to enqueue again, it means an multifd task can
>> > enqueue to itself, and circular enqueue can be challenging: imagine 8
>> > concurrent tasks (with a total of 8 multifd threads) trying to enqueue at
>> > the same time; they hunger themselves to death. Things like that. Then I
>> > figured the rest jobs are really fn(void*) type of things; they should
>> > deserve their own pool of threads.
>> >
>> > So the VFIO threads (used to be per-device) becomes migration worker
>> > threads, we need them for both src/dst: on dst there's still pending work
>> > to apply the continuous VFIO data back to the kernel driver, and that can't
>> > be done by multifd thread too due to similar same reason. Then those dest
>> > side worker threads can also do load() not only for VFIO but also other
>> > device states if we can add more.
>> >
>> > So to summary, we'll have:
>> >
>> > - 1 main thread (send / recv)
>> > - N multifd threads (IOs only)
>> > - M worker threads (jobs only)
>> >
>> > Of course, postcopy not involved.. How's that sound?
>>
>> Looks good. There's a better divide between producer and consumer this
>> way. I think it will help when designing new features.
>>
>> One observation is that we'll still have two different entities doing IO
>> (multifd threads and the migration thread), which I would prefer were
>> using a common code at a higher level than the iochannel.
>
> At least for the main channel probably yes. I think Dan has had the idea
> of adding the buffering layer over iochannels, then replace qemufiles with
> that. Multifd channels looks ok so far to use as raw channels.
>
>>
>> One thing that I tried to look into for mapped-ram was whether we could
>> set up iouring in the migration code, but got entirely discouraged by
>> the migration thread doing IO at random points. And of course, you've
>> seen what we had to do with direct-io. That was in part due to having
>> the migration thread in parallel doing it's small writes at undetermined
>> points in time.
>
> On the locked_vm part: probably yes, we'd better try to avoid using page
> pinning if possible. It just looks like it becomes a more important
> scenario nowadays to put VMs into containers, it means then such feature
> may not be always usable there.
>
> For the rest: I really don't know much on iouring, but I remember it can be
> fast normally only in a poll model with interrupt-less context?
I'm not sure. I mainly thought about using it to save syscalls on the
write side. But as I said, I didn't look further into it.
> Not sure
> whether it suites here for us, as I guess we should avoid consuming cpu
> resourcess with no good reason, and polling for perf falls into that
> category, I think. Even without it, kubevirt now already has issue on
> multifd eating cpus, and people observe multifd threads causing vcpu
> threads to be throttled, interrupting guest workloads; they're currently
> put in the same container. I also not sure how much it'll help comparing
> to when we have the multi-threading ready. I suspect not that much.
Do you have a reference for that kubevirt issue I could look at? It
maybe interesting to investigate further. Where's the throttling coming
from? And doesn't less vcpu time imply less dirtying and therefore
faster convergence?
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-12 12:44 ` Fabiano Rosas
@ 2024-07-12 15:37 ` Peter Xu
0 siblings, 0 replies; 46+ messages in thread
From: Peter Xu @ 2024-07-12 15:37 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
On Fri, Jul 12, 2024 at 09:44:02AM -0300, Fabiano Rosas wrote:
> Do you have a reference for that kubevirt issue I could look at? It
> maybe interesting to investigate further. Where's the throttling coming
> from? And doesn't less vcpu time imply less dirtying and therefore
> faster convergence?
Sorry I don't have a link on hand.. sometimes it's not about converge, it's
about impacting the guest workload too much without intention which is not
wanted, especially if on a public cloud.
It's understandable to me since they're under the same cgroup with
throttled cpu resources applie to QEMU+Libvirt processes as a whole,
probably based on N_VCPUS with some tiny extra room for other stuff.
For example, I remember they also hit other threads content with the vcpu
threads like the block layer thread pools.
It's a separate issue here when talking about locked_vm, as kubevirt
probably need to figure out a way to say "these are mgmt threads, and those
are vcpu threads", because mgmt threads can take quite some cpu resources
sometimes and it's not avoidable. Page pinning will be another story, as
in many cases pinning should not be required, except VFIO, zerocopy and
other special stuff.
--
Peter Xu
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-11 16:11 ` Peter Xu
2024-07-11 19:37 ` Fabiano Rosas
@ 2024-07-18 19:39 ` Fabiano Rosas
2024-07-18 21:12 ` Peter Xu
1 sibling, 1 reply; 46+ messages in thread
From: Fabiano Rosas @ 2024-07-18 19:39 UTC (permalink / raw)
To: Peter Xu; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
Peter Xu <peterx@redhat.com> writes:
> On Thu, Jul 11, 2024 at 11:12:09AM -0300, Fabiano Rosas wrote:
>> What about the QEMUFile traffic? There's an iov in there. I have been
>> thinking of replacing some of qemu-file.c guts with calls to
>> multifd. Instead of several qemu_put_byte() we could construct an iov
>> and give it to multifd for transfering, call multifd_sync at the end and
>> get rid of the QEMUFile entirely. I don't have that completely laid out
>> at the moment, but I think it should be possible. I get concerned about
>> making assumptions on the types of data we're ever going to want to
>> transmit. I bet someone thought in the past that multifd would never be
>> used for anything other than ram.
>
> Hold on a bit.. there're two things I want to clarity with you.
>
> Firstly, qemu_put_byte() has buffering on f->buf[]. Directly changing them
> to iochannels may regress performance. I never checked, but I would assume
> some buffering will be needed for small chunk of data even with iochannels.
>
> Secondly, why multifd has things to do with this? What you're talking
> about is more like the rework of qemufile->iochannel thing to me, and IIUC
> that doesn't yet involve multifd. For many of such conversions, it'll
> still be operating on the main channel, which is not the multifd channels.
> What matters might be about what's in your mind to be put over multifd
> channels there.
>
>>
>> >
>> > I wonder why handshake needs to be done per-thread. I was naturally
>> > thinking the handshake should happen sequentially, talking over everything
>> > including multifd.
>>
>> Well, it would still be thread based. Just that it would be 1 thread and
>> it would not be managed by multifd. I don't see the point. We could make
>> everything be multifd-based. Any piece of data that needs to reach the
>> other side of the migration could be sent through multifd, no?
>
> Hmm.... yes we can. But what do we gain from it, if we know it'll be a few
> MBs in total? There ain't a lot of huge stuff to move, it seems to me.
>
>>
>> Also, when you say "per-thread", that's the model we're trying to get
>> away from. There should be nothing "per-thread", the threads just
>> consume the data produced by the clients. Anything "per-thread" that is
>> not strictly related to the thread model should go away. For instance,
>> p->page_size, p->page_count, p->write_flags, p->flags, etc. None of
>> these should be in MultiFDSendParams. That thing should be (say)
>> MultifdChannelState and contain only the semaphores and control flags
>> for the threads.
>>
>> It would be nice if we could once and for all have a model that can
>> dispatch data transfers without having to fiddle with threading all the
>> time. Any time someone wants to do something different in the migration
>> code, there it goes a random qemu_create_thread() flying around.
>
> That's exactly what I want to avoid. Not all things will need a thread,
> only performance relevant ones.
>
> So now we have multifd threads, they're for IO throughputs: if we want to
> push a fast NIC, that's the only way to go. Anything wants to push that
> NIC, should use multifd.
>
> Then it turns out we want more concurrency, it's about VFIO save()/load()
> of the kenrel drivers and it can block. Same to other devices that can
> take time to save()/load() if it can happen concurrently in the future. I
> think that's the reason why I suggested the VFIO solution to provide a
> generic concept of thread pool so it services a generic purpose, and can be
> reused in the future.
>
> I hope that'll stop anyone else on migration to create yet another thread
> randomly, and I definitely don't like that either. I would _suspect_ the
> next one to come as such is TDX.. I remember at least in the very initial
> proposal years ago, TDX migration involves its own "channel" to migrate,
> migration.c may not even know where is that channel. We'll see.
>
> [...]
>
>> > One thing to mention is that when with an union we may probably need to get
>> > rid of multifd_send_state->pages already.
>>
>> Hehe, please don't do this like "oh, by the way...". This is a major
>> pain point. I've been complaining about that "holding of client data"
>> since the fist time I read that code. So if you're going to propose
>> something, it needs to account for that.
>
> The client puts something into a buffer (SendData), then it delivers it to
> multifd (who silently switches the buffer). After enqueued, the client
> assumes the buffer is sent and reusable again.
>
> It looks pretty common to me, what is the concern within the procedure?
> What's the "holding of client data" issue?
>
v2 is ready, but unfortunately this approach doesn't work. When client A
takes the payload, it fills it with it's data, which may include
allocating memory. MultiFDPages_t does that for the offset. This means
we need a round of free/malloc at every packet sent. For every client
and every allocation they decide to do.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-18 19:39 ` Fabiano Rosas
@ 2024-07-18 21:12 ` Peter Xu
2024-07-18 21:27 ` Fabiano Rosas
0 siblings, 1 reply; 46+ messages in thread
From: Peter Xu @ 2024-07-18 21:12 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
On Thu, Jul 18, 2024 at 04:39:00PM -0300, Fabiano Rosas wrote:
> v2 is ready, but unfortunately this approach doesn't work. When client A
> takes the payload, it fills it with it's data, which may include
> allocating memory. MultiFDPages_t does that for the offset. This means
> we need a round of free/malloc at every packet sent. For every client
> and every allocation they decide to do.
Shouldn't be a blocker? E.g. one option is:
/* Allocate both the pages + offset[] */
MultiFDPages_t *pages = g_malloc0(sizeof(MultiFDPages_t) +
sizeof(ram_addr_t) * n, 1);
pages->allocated = n;
pages->offset = &pages[1];
Or.. we can also make offset[] dynamic size, if that looks less tricky:
typedef struct {
/* number of used pages */
uint32_t num;
/* number of normal pages */
uint32_t normal_num;
/* number of allocated pages */
uint32_t allocated;
RAMBlock *block;
/* offset of each page */
ram_addr_t offset[0];
} MultiFDPages_t;
--
Peter Xu
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-18 21:12 ` Peter Xu
@ 2024-07-18 21:27 ` Fabiano Rosas
2024-07-18 21:52 ` Peter Xu
0 siblings, 1 reply; 46+ messages in thread
From: Fabiano Rosas @ 2024-07-18 21:27 UTC (permalink / raw)
To: Peter Xu; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
Peter Xu <peterx@redhat.com> writes:
> On Thu, Jul 18, 2024 at 04:39:00PM -0300, Fabiano Rosas wrote:
>> v2 is ready, but unfortunately this approach doesn't work. When client A
>> takes the payload, it fills it with it's data, which may include
>> allocating memory. MultiFDPages_t does that for the offset. This means
>> we need a round of free/malloc at every packet sent. For every client
>> and every allocation they decide to do.
>
> Shouldn't be a blocker? E.g. one option is:
>
> /* Allocate both the pages + offset[] */
> MultiFDPages_t *pages = g_malloc0(sizeof(MultiFDPages_t) +
> sizeof(ram_addr_t) * n, 1);
> pages->allocated = n;
> pages->offset = &pages[1];
>
> Or.. we can also make offset[] dynamic size, if that looks less tricky:
>
> typedef struct {
> /* number of used pages */
> uint32_t num;
> /* number of normal pages */
> uint32_t normal_num;
> /* number of allocated pages */
> uint32_t allocated;
> RAMBlock *block;
> /* offset of each page */
> ram_addr_t offset[0];
> } MultiFDPages_t;
I think you missed the point. If we hold a pointer inside the payload,
we lose the reference when the other client takes the structure and puts
its own data there. So we'll need to alloc/free everytime we send a
packet.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-18 21:27 ` Fabiano Rosas
@ 2024-07-18 21:52 ` Peter Xu
2024-07-18 22:32 ` Fabiano Rosas
0 siblings, 1 reply; 46+ messages in thread
From: Peter Xu @ 2024-07-18 21:52 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
On Thu, Jul 18, 2024 at 06:27:32PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > On Thu, Jul 18, 2024 at 04:39:00PM -0300, Fabiano Rosas wrote:
> >> v2 is ready, but unfortunately this approach doesn't work. When client A
> >> takes the payload, it fills it with it's data, which may include
> >> allocating memory. MultiFDPages_t does that for the offset. This means
> >> we need a round of free/malloc at every packet sent. For every client
> >> and every allocation they decide to do.
> >
> > Shouldn't be a blocker? E.g. one option is:
> >
> > /* Allocate both the pages + offset[] */
> > MultiFDPages_t *pages = g_malloc0(sizeof(MultiFDPages_t) +
> > sizeof(ram_addr_t) * n, 1);
> > pages->allocated = n;
> > pages->offset = &pages[1];
> >
> > Or.. we can also make offset[] dynamic size, if that looks less tricky:
> >
> > typedef struct {
> > /* number of used pages */
> > uint32_t num;
> > /* number of normal pages */
> > uint32_t normal_num;
> > /* number of allocated pages */
> > uint32_t allocated;
> > RAMBlock *block;
> > /* offset of each page */
> > ram_addr_t offset[0];
> > } MultiFDPages_t;
>
> I think you missed the point. If we hold a pointer inside the payload,
> we lose the reference when the other client takes the structure and puts
> its own data there. So we'll need to alloc/free everytime we send a
> packet.
For option 1: when the buffer switch happens, MultiFDPages_t will switch as
a whole, including its offset[], because its offset[] always belong to this
MultiFDPages_t. So yes, we want to lose that *offset reference together
with MultiFDPages_t here, so the offset[] always belongs to one single
MultiFDPages_t object for its lifetime.
For option 2: I meant MultiFDPages_t will have no offset[] pointer anymore,
but make it part of the struct (MultiFDPages_t.offset[]). Logically it's
the same as option 1 but maybe slight cleaner. We just need to make it
sized 0 so as to be dynamic in size.
Hmm.. is it the case?
--
Peter Xu
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-18 21:52 ` Peter Xu
@ 2024-07-18 22:32 ` Fabiano Rosas
2024-07-19 14:04 ` Peter Xu
0 siblings, 1 reply; 46+ messages in thread
From: Fabiano Rosas @ 2024-07-18 22:32 UTC (permalink / raw)
To: Peter Xu; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
Peter Xu <peterx@redhat.com> writes:
> On Thu, Jul 18, 2024 at 06:27:32PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Thu, Jul 18, 2024 at 04:39:00PM -0300, Fabiano Rosas wrote:
>> >> v2 is ready, but unfortunately this approach doesn't work. When client A
>> >> takes the payload, it fills it with it's data, which may include
>> >> allocating memory. MultiFDPages_t does that for the offset. This means
>> >> we need a round of free/malloc at every packet sent. For every client
>> >> and every allocation they decide to do.
>> >
>> > Shouldn't be a blocker? E.g. one option is:
>> >
>> > /* Allocate both the pages + offset[] */
>> > MultiFDPages_t *pages = g_malloc0(sizeof(MultiFDPages_t) +
>> > sizeof(ram_addr_t) * n, 1);
>> > pages->allocated = n;
>> > pages->offset = &pages[1];
>> >
>> > Or.. we can also make offset[] dynamic size, if that looks less tricky:
>> >
>> > typedef struct {
>> > /* number of used pages */
>> > uint32_t num;
>> > /* number of normal pages */
>> > uint32_t normal_num;
>> > /* number of allocated pages */
>> > uint32_t allocated;
>> > RAMBlock *block;
>> > /* offset of each page */
>> > ram_addr_t offset[0];
>> > } MultiFDPages_t;
>>
>> I think you missed the point. If we hold a pointer inside the payload,
>> we lose the reference when the other client takes the structure and puts
>> its own data there. So we'll need to alloc/free everytime we send a
>> packet.
>
> For option 1: when the buffer switch happens, MultiFDPages_t will switch as
> a whole, including its offset[], because its offset[] always belong to this
> MultiFDPages_t. So yes, we want to lose that *offset reference together
> with MultiFDPages_t here, so the offset[] always belongs to one single
> MultiFDPages_t object for its lifetime.
MultiFDPages_t is part of MultiFDSendData, it doesn't get allocated
individually:
struct MultiFDSendData {
MultiFDPayloadType type;
union {
MultiFDPages_t ram_payload;
} u;
};
(and even if it did, then we'd lose the pointer to ram_payload anyway -
or require multiple free/alloc)
>
> For option 2: I meant MultiFDPages_t will have no offset[] pointer anymore,
> but make it part of the struct (MultiFDPages_t.offset[]). Logically it's
> the same as option 1 but maybe slight cleaner. We just need to make it
> sized 0 so as to be dynamic in size.
Seems like an undefined behavior magnet. If I sent this as the first
version, you'd NACK me right away.
Besides, it's an unnecessary restriction to impose in the client
code. And like above, we don't allocate the struct directly, it's part
of MultiFDSendData, that's an advantage of using the union.
I think we've reached the point where I'd like to hear more concrete
reasons for not going with the current proposal, except for the
simplicity argument you already put. I like the union idea, but OTOH we
already have a working solution right here.
>
> Hmm.. is it the case?
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-18 22:32 ` Fabiano Rosas
@ 2024-07-19 14:04 ` Peter Xu
2024-07-19 16:54 ` Fabiano Rosas
0 siblings, 1 reply; 46+ messages in thread
From: Peter Xu @ 2024-07-19 14:04 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
On Thu, Jul 18, 2024 at 07:32:05PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > On Thu, Jul 18, 2024 at 06:27:32PM -0300, Fabiano Rosas wrote:
> >> Peter Xu <peterx@redhat.com> writes:
> >>
> >> > On Thu, Jul 18, 2024 at 04:39:00PM -0300, Fabiano Rosas wrote:
> >> >> v2 is ready, but unfortunately this approach doesn't work. When client A
> >> >> takes the payload, it fills it with it's data, which may include
> >> >> allocating memory. MultiFDPages_t does that for the offset. This means
> >> >> we need a round of free/malloc at every packet sent. For every client
> >> >> and every allocation they decide to do.
> >> >
> >> > Shouldn't be a blocker? E.g. one option is:
> >> >
> >> > /* Allocate both the pages + offset[] */
> >> > MultiFDPages_t *pages = g_malloc0(sizeof(MultiFDPages_t) +
> >> > sizeof(ram_addr_t) * n, 1);
> >> > pages->allocated = n;
> >> > pages->offset = &pages[1];
> >> >
> >> > Or.. we can also make offset[] dynamic size, if that looks less tricky:
> >> >
> >> > typedef struct {
> >> > /* number of used pages */
> >> > uint32_t num;
> >> > /* number of normal pages */
> >> > uint32_t normal_num;
> >> > /* number of allocated pages */
> >> > uint32_t allocated;
> >> > RAMBlock *block;
> >> > /* offset of each page */
> >> > ram_addr_t offset[0];
> >> > } MultiFDPages_t;
> >>
> >> I think you missed the point. If we hold a pointer inside the payload,
> >> we lose the reference when the other client takes the structure and puts
> >> its own data there. So we'll need to alloc/free everytime we send a
> >> packet.
> >
> > For option 1: when the buffer switch happens, MultiFDPages_t will switch as
> > a whole, including its offset[], because its offset[] always belong to this
> > MultiFDPages_t. So yes, we want to lose that *offset reference together
> > with MultiFDPages_t here, so the offset[] always belongs to one single
> > MultiFDPages_t object for its lifetime.
>
> MultiFDPages_t is part of MultiFDSendData, it doesn't get allocated
> individually:
>
> struct MultiFDSendData {
> MultiFDPayloadType type;
> union {
> MultiFDPages_t ram_payload;
> } u;
> };
>
> (and even if it did, then we'd lose the pointer to ram_payload anyway -
> or require multiple free/alloc)
IMHO it's the same.
The core idea is we allocate a buffer to put MultiFDSendData which may
contain either Pages_t or DeviceState_t, and the size of the buffer should
be MAX(A, B).
>
> >
> > For option 2: I meant MultiFDPages_t will have no offset[] pointer anymore,
> > but make it part of the struct (MultiFDPages_t.offset[]). Logically it's
> > the same as option 1 but maybe slight cleaner. We just need to make it
> > sized 0 so as to be dynamic in size.
>
> Seems like an undefined behavior magnet. If I sent this as the first
> version, you'd NACK me right away.
>
> Besides, it's an unnecessary restriction to impose in the client
> code. And like above, we don't allocate the struct directly, it's part
> of MultiFDSendData, that's an advantage of using the union.
>
> I think we've reached the point where I'd like to hear more concrete
> reasons for not going with the current proposal, except for the
> simplicity argument you already put. I like the union idea, but OTOH we
> already have a working solution right here.
I think the issue with current proposal is each client will need to
allocate (N+1)*buffer, so more user using it the more buffers we'll need (M
users, then M*(N+1)*buffer). Currently it seems to me we will have 3 users
at least: RAM, VFIO, and some other VMSD devices TBD in mid-long futures;
the latter two will share the same DeviceState_t. Maybe vDPA as well at
some point? Then 4.
I'd agree with this approach only if multifd is flexible enough to not even
know what's the buffers, but it's not the case, and we seem only care about
two:
if (type==RAM)
...
else
assert(type==DEVICE);
...
In this case I think it's easier we have multifd manage all the buffers
(after all, it knows them well...). Then the consumption is not
M*(N+1)*buffer, but (M+N)*buffer.
Perhaps push your tree somewhere so we can have a quick look? I'm totally
lost when you said I'll nack it.. so maybe I didn't really get what you
meant. Codes may clarify that.
--
Peter Xu
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-19 14:04 ` Peter Xu
@ 2024-07-19 16:54 ` Fabiano Rosas
2024-07-19 17:58 ` Peter Xu
0 siblings, 1 reply; 46+ messages in thread
From: Fabiano Rosas @ 2024-07-19 16:54 UTC (permalink / raw)
To: Peter Xu; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
Peter Xu <peterx@redhat.com> writes:
> On Thu, Jul 18, 2024 at 07:32:05PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Thu, Jul 18, 2024 at 06:27:32PM -0300, Fabiano Rosas wrote:
>> >> Peter Xu <peterx@redhat.com> writes:
>> >>
>> >> > On Thu, Jul 18, 2024 at 04:39:00PM -0300, Fabiano Rosas wrote:
>> >> >> v2 is ready, but unfortunately this approach doesn't work. When client A
>> >> >> takes the payload, it fills it with it's data, which may include
>> >> >> allocating memory. MultiFDPages_t does that for the offset. This means
>> >> >> we need a round of free/malloc at every packet sent. For every client
>> >> >> and every allocation they decide to do.
>> >> >
>> >> > Shouldn't be a blocker? E.g. one option is:
>> >> >
>> >> > /* Allocate both the pages + offset[] */
>> >> > MultiFDPages_t *pages = g_malloc0(sizeof(MultiFDPages_t) +
>> >> > sizeof(ram_addr_t) * n, 1);
>> >> > pages->allocated = n;
>> >> > pages->offset = &pages[1];
>> >> >
>> >> > Or.. we can also make offset[] dynamic size, if that looks less tricky:
>> >> >
>> >> > typedef struct {
>> >> > /* number of used pages */
>> >> > uint32_t num;
>> >> > /* number of normal pages */
>> >> > uint32_t normal_num;
>> >> > /* number of allocated pages */
>> >> > uint32_t allocated;
>> >> > RAMBlock *block;
>> >> > /* offset of each page */
>> >> > ram_addr_t offset[0];
>> >> > } MultiFDPages_t;
>> >>
>> >> I think you missed the point. If we hold a pointer inside the payload,
>> >> we lose the reference when the other client takes the structure and puts
>> >> its own data there. So we'll need to alloc/free everytime we send a
>> >> packet.
>> >
>> > For option 1: when the buffer switch happens, MultiFDPages_t will switch as
>> > a whole, including its offset[], because its offset[] always belong to this
>> > MultiFDPages_t. So yes, we want to lose that *offset reference together
>> > with MultiFDPages_t here, so the offset[] always belongs to one single
>> > MultiFDPages_t object for its lifetime.
>>
>> MultiFDPages_t is part of MultiFDSendData, it doesn't get allocated
>> individually:
>>
>> struct MultiFDSendData {
>> MultiFDPayloadType type;
>> union {
>> MultiFDPages_t ram_payload;
>> } u;
>> };
>>
>> (and even if it did, then we'd lose the pointer to ram_payload anyway -
>> or require multiple free/alloc)
>
> IMHO it's the same.
>
> The core idea is we allocate a buffer to put MultiFDSendData which may
> contain either Pages_t or DeviceState_t, and the size of the buffer should
> be MAX(A, B).
>
Right, but with your zero-length array proposals we need to have a
separate allocation for MultiFDPages_t because to expand the array we
need to include the number of pages.
Also, don't think only about MultiFDPages_t. With this approach we
cannot have pointers to memory allocated by the client at all anywhere
inside the union. Every pointer needs to have another reference
somewhere else to ensure we don't leak it. That's an unnecessary
restriction.
>>
>> >
>> > For option 2: I meant MultiFDPages_t will have no offset[] pointer anymore,
>> > but make it part of the struct (MultiFDPages_t.offset[]). Logically it's
>> > the same as option 1 but maybe slight cleaner. We just need to make it
>> > sized 0 so as to be dynamic in size.
>>
>> Seems like an undefined behavior magnet. If I sent this as the first
>> version, you'd NACK me right away.
>>
>> Besides, it's an unnecessary restriction to impose in the client
>> code. And like above, we don't allocate the struct directly, it's part
>> of MultiFDSendData, that's an advantage of using the union.
>>
>> I think we've reached the point where I'd like to hear more concrete
>> reasons for not going with the current proposal, except for the
>> simplicity argument you already put. I like the union idea, but OTOH we
>> already have a working solution right here.
>
> I think the issue with current proposal is each client will need to
> allocate (N+1)*buffer, so more user using it the more buffers we'll need (M
> users, then M*(N+1)*buffer). Currently it seems to me we will have 3 users
> at least: RAM, VFIO, and some other VMSD devices TBD in mid-long futures;
> the latter two will share the same DeviceState_t. Maybe vDPA as well at
> some point? Then 4.
You used the opposite argument earlier in this thread to argue in favor
of the union: We'll only have 2 clients. I'm confused.
Although, granted, this RFC does use more memory.
> I'd agree with this approach only if multifd is flexible enough to not even
> know what's the buffers, but it's not the case, and we seem only care about
> two:
>
> if (type==RAM)
> ...
> else
> assert(type==DEVICE);
> ...
I don't understand: "not even know what's the buffers" is exactly what
this series is about. It doesn't have any such conditional on "type".
>
> In this case I think it's easier we have multifd manage all the buffers
> (after all, it knows them well...). Then the consumption is not
> M*(N+1)*buffer, but (M+N)*buffer.
Fine. As I said, I like the union approach. It's just that it doesn't
work if the client wants to have a pointer in there.
Again, this is client data that multifd holds, it's not multifd
data. MultiFDPages_t or DeviceState_t have nothing to do with
multifd. It should be ok to have:
DeviceState_t *devstate = &p->data->u.device;
devstate->foo = g_new0(...);
devstate->bar = g_new0(...);
just like we have:
MultiFDPages_t *pages = &p->data->u.ram;
pages->offset = g_new0(ram_addr_t, page_count);
>
> Perhaps push your tree somewhere so we can have a quick look?
https://gitlab.com/farosas/qemu/-/commits/multifd-pages-decouple
> I'm totally
> lost when you said I'll nack it.. so maybe I didn't really get what you
> meant. Codes may clarify that.
I'm conjecturing that any contributor adding a zero-length array (a[0])
would probably be given a hard time on the mailing list. There's 10
instances of it in the code base. The proper way to grow an array is to
use a flexible array (a[]) instead.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-19 16:54 ` Fabiano Rosas
@ 2024-07-19 17:58 ` Peter Xu
2024-07-19 21:30 ` Fabiano Rosas
0 siblings, 1 reply; 46+ messages in thread
From: Peter Xu @ 2024-07-19 17:58 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
On Fri, Jul 19, 2024 at 01:54:37PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > On Thu, Jul 18, 2024 at 07:32:05PM -0300, Fabiano Rosas wrote:
> >> Peter Xu <peterx@redhat.com> writes:
> >>
> >> > On Thu, Jul 18, 2024 at 06:27:32PM -0300, Fabiano Rosas wrote:
> >> >> Peter Xu <peterx@redhat.com> writes:
> >> >>
> >> >> > On Thu, Jul 18, 2024 at 04:39:00PM -0300, Fabiano Rosas wrote:
> >> >> >> v2 is ready, but unfortunately this approach doesn't work. When client A
> >> >> >> takes the payload, it fills it with it's data, which may include
> >> >> >> allocating memory. MultiFDPages_t does that for the offset. This means
> >> >> >> we need a round of free/malloc at every packet sent. For every client
> >> >> >> and every allocation they decide to do.
> >> >> >
> >> >> > Shouldn't be a blocker? E.g. one option is:
> >> >> >
> >> >> > /* Allocate both the pages + offset[] */
> >> >> > MultiFDPages_t *pages = g_malloc0(sizeof(MultiFDPages_t) +
> >> >> > sizeof(ram_addr_t) * n, 1);
> >> >> > pages->allocated = n;
> >> >> > pages->offset = &pages[1];
> >> >> >
> >> >> > Or.. we can also make offset[] dynamic size, if that looks less tricky:
> >> >> >
> >> >> > typedef struct {
> >> >> > /* number of used pages */
> >> >> > uint32_t num;
> >> >> > /* number of normal pages */
> >> >> > uint32_t normal_num;
> >> >> > /* number of allocated pages */
> >> >> > uint32_t allocated;
> >> >> > RAMBlock *block;
> >> >> > /* offset of each page */
> >> >> > ram_addr_t offset[0];
> >> >> > } MultiFDPages_t;
> >> >>
> >> >> I think you missed the point. If we hold a pointer inside the payload,
> >> >> we lose the reference when the other client takes the structure and puts
> >> >> its own data there. So we'll need to alloc/free everytime we send a
> >> >> packet.
> >> >
> >> > For option 1: when the buffer switch happens, MultiFDPages_t will switch as
> >> > a whole, including its offset[], because its offset[] always belong to this
> >> > MultiFDPages_t. So yes, we want to lose that *offset reference together
> >> > with MultiFDPages_t here, so the offset[] always belongs to one single
> >> > MultiFDPages_t object for its lifetime.
> >>
> >> MultiFDPages_t is part of MultiFDSendData, it doesn't get allocated
> >> individually:
> >>
> >> struct MultiFDSendData {
> >> MultiFDPayloadType type;
> >> union {
> >> MultiFDPages_t ram_payload;
> >> } u;
> >> };
> >>
> >> (and even if it did, then we'd lose the pointer to ram_payload anyway -
> >> or require multiple free/alloc)
> >
> > IMHO it's the same.
> >
> > The core idea is we allocate a buffer to put MultiFDSendData which may
> > contain either Pages_t or DeviceState_t, and the size of the buffer should
> > be MAX(A, B).
> >
>
> Right, but with your zero-length array proposals we need to have a
> separate allocation for MultiFDPages_t because to expand the array we
> need to include the number of pages.
We need to fetch the max size we need and allocate one object covers all
the sizes we need. I sincerely don't understand why it's an issue..
>
> Also, don't think only about MultiFDPages_t. With this approach we
> cannot have pointers to memory allocated by the client at all anywhere
> inside the union. Every pointer needs to have another reference
> somewhere else to ensure we don't leak it. That's an unnecessary
> restriction.
So even if there can be multiple pointers we can definitely play the same
trick that we allocate object A+B+C+D in the same chunk and let A->b points
to B, A->c points to C, and so on.
Before that, my question is do we really need that.
For device states, AFAIU it'll always be an opaque buffer.. VFIO needs
that, vDPA probably the same, and for VMSDs it'll be a temp buffer to put
the VMSD dump.
For multifd, I used offset[0] just to make sure things like "dynamic sized
multifd buffers" will easily work without much changes. Or even we could
have this, afaict:
#define MULTIFD_PAGES_PER_PACKET (128)
typedef struct {
/* number of used pages */
uint32_t num;
/* number of normal pages */
uint32_t normal_num;
/* number of allocated pages */
uint32_t allocated;
RAMBlock *block;
/* offset of each page */
ram_addr_t offset[MULTIFD_PAGES_PER_PACKET];
} MultiFDPages_t;
It might change perf on a few archs where psize is not 4K, but I don't see
it a huge deal, personally.
Then everything will have no pointers, and it can be even slightly faster
because we use 64B cachelines in most systems nowadays, and one indirect
pointer may always need a load on a new cacheline otherwise..
This whole cacheline thing is trivial. What I worried that you worry too
much on that flexibility that we may never need.
And even with that flexibilty I don't understand why you don't like
allocating an object that's larger than how the union is defined: I really
don't see it a problem.. It'll need care on alloc/free, true, but it
should be pretty manageable in this case to me.
>
> >>
> >> >
> >> > For option 2: I meant MultiFDPages_t will have no offset[] pointer anymore,
> >> > but make it part of the struct (MultiFDPages_t.offset[]). Logically it's
> >> > the same as option 1 but maybe slight cleaner. We just need to make it
> >> > sized 0 so as to be dynamic in size.
> >>
> >> Seems like an undefined behavior magnet. If I sent this as the first
> >> version, you'd NACK me right away.
> >>
> >> Besides, it's an unnecessary restriction to impose in the client
> >> code. And like above, we don't allocate the struct directly, it's part
> >> of MultiFDSendData, that's an advantage of using the union.
> >>
> >> I think we've reached the point where I'd like to hear more concrete
> >> reasons for not going with the current proposal, except for the
> >> simplicity argument you already put. I like the union idea, but OTOH we
> >> already have a working solution right here.
> >
> > I think the issue with current proposal is each client will need to
> > allocate (N+1)*buffer, so more user using it the more buffers we'll need (M
> > users, then M*(N+1)*buffer). Currently it seems to me we will have 3 users
> > at least: RAM, VFIO, and some other VMSD devices TBD in mid-long futures;
> > the latter two will share the same DeviceState_t. Maybe vDPA as well at
> > some point? Then 4.
>
> You used the opposite argument earlier in this thread to argue in favor
> of the union: We'll only have 2 clients. I'm confused.
Maybe I meant "2 types of clients"? VDPA will also use the same device
state buffer.
>
> Although, granted, this RFC does use more memory.
IMHO it's also easier to understand, where any user always has a free
SendData buffer to manipulate, and multifd always has one buffer for each
channel (free or busy). That is compared to each client needs to allocate
N buffers and we're actually at least leaking "number of multifd channels"
into the client which may not be wanted.
IOW, I wonder whether you're happy with below to drop the union idea:
struct MultiFDSendData {
MultiFDPayloadType type;
MultiFDPages_t ram_payload;
MultiFDDeviceState_t device_payload;
};
Then we keep the "(M+N)" usage model, but don't use union and simply forget
about the memory consumption (similar to your original memory consumption
with this, but will be better as long as anything else joins, e.g. vDPA,
because then vDPA will at least share that same buffer with VFIO).
Do you think you would accept this?
>
> > I'd agree with this approach only if multifd is flexible enough to not even
> > know what's the buffers, but it's not the case, and we seem only care about
> > two:
> >
> > if (type==RAM)
> > ...
> > else
> > assert(type==DEVICE);
> > ...
>
> I don't understand: "not even know what's the buffers" is exactly what
> this series is about. It doesn't have any such conditional on "type".
>
> >
> > In this case I think it's easier we have multifd manage all the buffers
> > (after all, it knows them well...). Then the consumption is not
> > M*(N+1)*buffer, but (M+N)*buffer.
>
> Fine. As I said, I like the union approach. It's just that it doesn't
> work if the client wants to have a pointer in there.
>
> Again, this is client data that multifd holds, it's not multifd
> data. MultiFDPages_t or DeviceState_t have nothing to do with
> multifd. It should be ok to have:
>
> DeviceState_t *devstate = &p->data->u.device;
> devstate->foo = g_new0(...);
> devstate->bar = g_new0(...);
>
> just like we have:
>
> MultiFDPages_t *pages = &p->data->u.ram;
> pages->offset = g_new0(ram_addr_t, page_count);
>
> >
> > Perhaps push your tree somewhere so we can have a quick look?
>
> https://gitlab.com/farosas/qemu/-/commits/multifd-pages-decouple
>
> > I'm totally
> > lost when you said I'll nack it.. so maybe I didn't really get what you
> > meant. Codes may clarify that.
>
> I'm conjecturing that any contributor adding a zero-length array (a[0])
> would probably be given a hard time on the mailing list. There's 10
> instances of it in the code base. The proper way to grow an array is to
> use a flexible array (a[]) instead.
I'm not familiar with flexible array. What's the difference between:
struct {
int a[];
};
v.s.
struct {
int a[0];
};
?
If that works for you, it should work for me.
Or if you really hate the union / zero-sized array thing, it'll still be
nice to me to drop the union but keep using M+N objects model.
Thanks.
--
Peter Xu
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-19 17:58 ` Peter Xu
@ 2024-07-19 21:30 ` Fabiano Rosas
0 siblings, 0 replies; 46+ messages in thread
From: Fabiano Rosas @ 2024-07-19 21:30 UTC (permalink / raw)
To: Peter Xu; +Cc: Wang, Lei, qemu-devel, Maciej S . Szmigiero
Peter Xu <peterx@redhat.com> writes:
> On Fri, Jul 19, 2024 at 01:54:37PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Thu, Jul 18, 2024 at 07:32:05PM -0300, Fabiano Rosas wrote:
>> >> Peter Xu <peterx@redhat.com> writes:
>> >>
>> >> > On Thu, Jul 18, 2024 at 06:27:32PM -0300, Fabiano Rosas wrote:
>> >> >> Peter Xu <peterx@redhat.com> writes:
>> >> >>
>> >> >> > On Thu, Jul 18, 2024 at 04:39:00PM -0300, Fabiano Rosas wrote:
>> >> >> >> v2 is ready, but unfortunately this approach doesn't work. When client A
>> >> >> >> takes the payload, it fills it with it's data, which may include
>> >> >> >> allocating memory. MultiFDPages_t does that for the offset. This means
>> >> >> >> we need a round of free/malloc at every packet sent. For every client
>> >> >> >> and every allocation they decide to do.
>> >> >> >
>> >> >> > Shouldn't be a blocker? E.g. one option is:
>> >> >> >
>> >> >> > /* Allocate both the pages + offset[] */
>> >> >> > MultiFDPages_t *pages = g_malloc0(sizeof(MultiFDPages_t) +
>> >> >> > sizeof(ram_addr_t) * n, 1);
>> >> >> > pages->allocated = n;
>> >> >> > pages->offset = &pages[1];
>> >> >> >
>> >> >> > Or.. we can also make offset[] dynamic size, if that looks less tricky:
>> >> >> >
>> >> >> > typedef struct {
>> >> >> > /* number of used pages */
>> >> >> > uint32_t num;
>> >> >> > /* number of normal pages */
>> >> >> > uint32_t normal_num;
>> >> >> > /* number of allocated pages */
>> >> >> > uint32_t allocated;
>> >> >> > RAMBlock *block;
>> >> >> > /* offset of each page */
>> >> >> > ram_addr_t offset[0];
>> >> >> > } MultiFDPages_t;
>> >> >>
>> >> >> I think you missed the point. If we hold a pointer inside the payload,
>> >> >> we lose the reference when the other client takes the structure and puts
>> >> >> its own data there. So we'll need to alloc/free everytime we send a
>> >> >> packet.
>> >> >
>> >> > For option 1: when the buffer switch happens, MultiFDPages_t will switch as
>> >> > a whole, including its offset[], because its offset[] always belong to this
>> >> > MultiFDPages_t. So yes, we want to lose that *offset reference together
>> >> > with MultiFDPages_t here, so the offset[] always belongs to one single
>> >> > MultiFDPages_t object for its lifetime.
>> >>
>> >> MultiFDPages_t is part of MultiFDSendData, it doesn't get allocated
>> >> individually:
>> >>
>> >> struct MultiFDSendData {
>> >> MultiFDPayloadType type;
>> >> union {
>> >> MultiFDPages_t ram_payload;
>> >> } u;
>> >> };
>> >>
>> >> (and even if it did, then we'd lose the pointer to ram_payload anyway -
>> >> or require multiple free/alloc)
>> >
>> > IMHO it's the same.
>> >
>> > The core idea is we allocate a buffer to put MultiFDSendData which may
>> > contain either Pages_t or DeviceState_t, and the size of the buffer should
>> > be MAX(A, B).
>> >
>>
>> Right, but with your zero-length array proposals we need to have a
>> separate allocation for MultiFDPages_t because to expand the array we
>> need to include the number of pages.
>
> We need to fetch the max size we need and allocate one object covers all
> the sizes we need. I sincerely don't understand why it's an issue..
>
What you describe is this:
p->data = g_malloc(sizeof(MultiFDPayloadType) +
max(sizeof(MultiFDPages_t) + sizeof(ram_addr_t) * page_count,
sizeof(MultiFDDevice_t)));
This pushes the payload specific information into multifd_send_setup()
which is against what we've been doing, namely isolating payload
information out of multifd main code.
>>
>> Also, don't think only about MultiFDPages_t. With this approach we
>> cannot have pointers to memory allocated by the client at all anywhere
>> inside the union. Every pointer needs to have another reference
>> somewhere else to ensure we don't leak it. That's an unnecessary
>> restriction.
>
> So even if there can be multiple pointers we can definitely play the same
> trick that we allocate object A+B+C+D in the same chunk and let A->b points
> to B, A->c points to C, and so on.
>
> Before that, my question is do we really need that.
>
> For device states, AFAIU it'll always be an opaque buffer.. VFIO needs
> that, vDPA probably the same, and for VMSDs it'll be a temp buffer to put
> the VMSD dump.
>
> For multifd, I used offset[0] just to make sure things like "dynamic sized
> multifd buffers" will easily work without much changes. Or even we could
> have this, afaict:
>
> #define MULTIFD_PAGES_PER_PACKET (128)
>
> typedef struct {
> /* number of used pages */
> uint32_t num;
> /* number of normal pages */
> uint32_t normal_num;
> /* number of allocated pages */
> uint32_t allocated;
> RAMBlock *block;
> /* offset of each page */
> ram_addr_t offset[MULTIFD_PAGES_PER_PACKET];
> } MultiFDPages_t;
I think this is off the table, we're looking into allowing multifd
packet size to change. Page size can change as well.
Also future payload types might need to add dynamically allocated data
to the payload. Although it might be ok if we have a rule that
everything in the payload needs to be static, it just seems unnecessary.
>
> It might change perf on a few archs where psize is not 4K, but I don't see
> it a huge deal, personally.
>
> Then everything will have no pointers, and it can be even slightly faster
> because we use 64B cachelines in most systems nowadays, and one indirect
> pointer may always need a load on a new cacheline otherwise..
>
> This whole cacheline thing is trivial. What I worried that you worry too
> much on that flexibility that we may never need.
>
> And even with that flexibilty I don't understand why you don't like
> allocating an object that's larger than how the union is defined: I really
The object being larger than the union is not the point. The point is we
don't know what size that array will have, because that's
client-specific data. Say RAM code wants an array of X size, vmstate
might need another of Y size, etc.
> don't see it a problem.. It'll need care on alloc/free, true, but it
> should be pretty manageable in this case to me.
>
>>
>> >>
>> >> >
>> >> > For option 2: I meant MultiFDPages_t will have no offset[] pointer anymore,
>> >> > but make it part of the struct (MultiFDPages_t.offset[]). Logically it's
>> >> > the same as option 1 but maybe slight cleaner. We just need to make it
>> >> > sized 0 so as to be dynamic in size.
>> >>
>> >> Seems like an undefined behavior magnet. If I sent this as the first
>> >> version, you'd NACK me right away.
>> >>
>> >> Besides, it's an unnecessary restriction to impose in the client
>> >> code. And like above, we don't allocate the struct directly, it's part
>> >> of MultiFDSendData, that's an advantage of using the union.
>> >>
>> >> I think we've reached the point where I'd like to hear more concrete
>> >> reasons for not going with the current proposal, except for the
>> >> simplicity argument you already put. I like the union idea, but OTOH we
>> >> already have a working solution right here.
>> >
>> > I think the issue with current proposal is each client will need to
>> > allocate (N+1)*buffer, so more user using it the more buffers we'll need (M
>> > users, then M*(N+1)*buffer). Currently it seems to me we will have 3 users
>> > at least: RAM, VFIO, and some other VMSD devices TBD in mid-long futures;
>> > the latter two will share the same DeviceState_t. Maybe vDPA as well at
>> > some point? Then 4.
>>
>> You used the opposite argument earlier in this thread to argue in favor
>> of the union: We'll only have 2 clients. I'm confused.
>
> Maybe I meant "2 types of clients"? VDPA will also use the same device
> state buffer.
>
>>
>> Although, granted, this RFC does use more memory.
>
> IMHO it's also easier to understand, where any user always has a free
> SendData buffer to manipulate, and multifd always has one buffer for each
> channel (free or busy).
>
> That is compared to each client needs to allocate
> N buffers and we're actually at least leaking "number of multifd channels"
> into the client which may not be wanted.
That's a stretch. multifd-channels is a migration parameter, it's fine
if any code has access to it.
>
> IOW, I wonder whether you're happy with below to drop the union idea:
>
> struct MultiFDSendData {
> MultiFDPayloadType type;
> MultiFDPages_t ram_payload;
> MultiFDDeviceState_t device_payload;
> };
>
> Then we keep the "(M+N)" usage model, but don't use union and simply forget
> about the memory consumption (similar to your original memory consumption
> with this, but will be better as long as anything else joins, e.g. vDPA,
> because then vDPA will at least share that same buffer with VFIO).
>
> Do you think you would accept this?
I'm not sure. I'll try some alternatives first. Maybe the a[] approach
is not so bad. More below...
>
>>
>> > I'd agree with this approach only if multifd is flexible enough to not even
>> > know what's the buffers, but it's not the case, and we seem only care about
>> > two:
>> >
>> > if (type==RAM)
>> > ...
>> > else
>> > assert(type==DEVICE);
>> > ...
>>
>> I don't understand: "not even know what's the buffers" is exactly what
>> this series is about. It doesn't have any such conditional on "type".
>>
>> >
>> > In this case I think it's easier we have multifd manage all the buffers
>> > (after all, it knows them well...). Then the consumption is not
>> > M*(N+1)*buffer, but (M+N)*buffer.
>>
>> Fine. As I said, I like the union approach. It's just that it doesn't
>> work if the client wants to have a pointer in there.
>>
>> Again, this is client data that multifd holds, it's not multifd
>> data. MultiFDPages_t or DeviceState_t have nothing to do with
>> multifd. It should be ok to have:
>>
>> DeviceState_t *devstate = &p->data->u.device;
>> devstate->foo = g_new0(...);
>> devstate->bar = g_new0(...);
>>
>> just like we have:
>>
>> MultiFDPages_t *pages = &p->data->u.ram;
>> pages->offset = g_new0(ram_addr_t, page_count);
>>
>> >
>> > Perhaps push your tree somewhere so we can have a quick look?
>>
>> https://gitlab.com/farosas/qemu/-/commits/multifd-pages-decouple
>>
>> > I'm totally
>> > lost when you said I'll nack it.. so maybe I didn't really get what you
>> > meant. Codes may clarify that.
>>
>> I'm conjecturing that any contributor adding a zero-length array (a[0])
>> would probably be given a hard time on the mailing list. There's 10
>> instances of it in the code base. The proper way to grow an array is to
>> use a flexible array (a[]) instead.
>
> I'm not familiar with flexible array. What's the difference between:
>
> struct {
> int a[];
> };
>
> v.s.
>
> struct {
> int a[0];
> };
Both are ways of making a dynamically sized structure. We allocate them
with:
s = malloc(sizeof(struct) + n * sizeof(int));
a[0] is the older way and full of issues:
- sizeof might return 0 or 1 depending on compiler extensions
- the compiler can't tell when the array is not at the end of the struct
- I'm not sure putting this inside an usion is well defined. gcc docs
mention:
"Declaring zero-length arrays in other contexts, including as interior
members of structure objects or as non-member objects, is
discouraged. Accessing elements of zero-length arrays declared in such
contexts is undefined and may be diagnosed." --
https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
- the kernel has deprecated this usage entirely:
"...this led to other problems ... like not being able to detect when
such an array is accidentally being used _not_ at the end of a
structure (which could happen directly, or when such a struct was in
unions, structs of structs, etc)." --
https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays
a[] is the modern way and doesn't have the issues above, but it's still
not a perfect solution for us:
- note above how allocation doesn't reference "a" directly, which means
we need to know "n" at the time of allocating p->data. This requires
the sizeof mess I mentioned at the beginning.
- the array needs to be at the end of the structure, so we can only have
one of them per payload.
- I have no idea how this would work if more than one payload needed an
array like that.
I'll give this one a shot and see how it looks. I'm getting tired of
looking at this code.
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-10 20:16 ` Fabiano Rosas
2024-07-10 21:55 ` Peter Xu
@ 2024-07-16 20:10 ` Maciej S. Szmigiero
2024-07-17 19:00 ` Peter Xu
1 sibling, 1 reply; 46+ messages in thread
From: Maciej S. Szmigiero @ 2024-07-16 20:10 UTC (permalink / raw)
To: Fabiano Rosas, Peter Xu; +Cc: Wang, Lei, qemu-devel, Avihai Horon
On 10.07.2024 22:16, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
>> On Wed, Jul 10, 2024 at 01:10:37PM -0300, Fabiano Rosas wrote:
>>> Peter Xu <peterx@redhat.com> writes:
>>>
>>>> On Thu, Jun 27, 2024 at 11:27:08AM +0800, Wang, Lei wrote:
>>>>>> Or graphically:
>>>>>>
>>>>>> 1) client fills the active slot with data. Channels point to nothing
>>>>>> at this point:
>>>>>> [a] <-- active slot
>>>>>> [][][][] <-- free slots, one per-channel
>>>>>>
>>>>>> [][][][] <-- channels' p->data pointers
>>>>>>
>>>>>> 2) multifd_send() swaps the pointers inside the client slot. Channels
>>>>>> still point to nothing:
>>>>>> []
>>>>>> [a][][][]
>>>>>>
>>>>>> [][][][]
>>>>>>
>>>>>> 3) multifd_send() finds an idle channel and updates its pointer:
>>>>>
>>>>> It seems the action "finds an idle channel" is in step 2 rather than step 3,
>>>>> which means the free slot is selected based on the id of the channel found, am I
>>>>> understanding correctly?
>>>>
>>>> I think you're right.
>>>>
>>>> Actually I also feel like the desription here is ambiguous, even though I
>>>> think I get what Fabiano wanted to say.
>>>>
>>>> The free slot should be the first step of step 2+3, here what Fabiano
>>>> really wanted to suggest is we move the free buffer array from multifd
>>>> channels into the callers, then the caller can pass in whatever data to
>>>> send.
>>>>
>>>> So I think maybe it's cleaner to write it as this in code (note: I didn't
>>>> really change the code, just some ordering and comments):
>>>>
>>>> ===8<===
>>>> @@ -710,15 +710,11 @@ static bool multifd_send(MultiFDSlots *slots)
>>>> */
>>>> active_slot = slots->active;
>>>> slots->active = slots->free[p->id];
>>>> - p->data = active_slot;
>>>> -
>>>> - /*
>>>> - * By the next time we arrive here, the channel will certainly
>>>> - * have consumed the active slot. Put it back on the free list
>>>> - * now.
>>>> - */
>>>> slots->free[p->id] = active_slot;
>>>>
>>>> + /* Assign the current active slot to the chosen thread */
>>>> + p->data = active_slot;
>>>> ===8<===
>>>>
>>>> The comment I removed is slightly misleading to me too, because right now
>>>> active_slot contains the data hasn't yet been delivered to multifd, so
>>>> we're "putting it back to free list" not because of it's free, but because
>>>> we know it won't get used until the multifd send thread consumes it
>>>> (because before that the thread will be busy, and we won't use the buffer
>>>> if so in upcoming send()s).
>>>>
>>>> And then when I'm looking at this again, I think maybe it's a slight
>>>> overkill, and maybe we can still keep the "opaque data" managed by multifd.
>>>> One reason might be that I don't expect the "opaque data" payload keep
>>>> growing at all: it should really be either RAM or device state as I
>>>> commented elsewhere in a relevant thread, after all it's a thread model
>>>> only for migration purpose to move vmstates..
>>>
>>> Some amount of flexibility needs to be baked in. For instance, what
>>> about the handshake procedure? Don't we want to use multifd threads to
>>> put some information on the wire for that as well?
>>
>> Is this an orthogonal question?
>
> I don't think so. You say the payload data should be either RAM or
> device state. I'm asking what other types of data do we want the multifd
> channel to transmit and suggesting we need to allow room for the
> addition of that, whatever it is. One thing that comes to mind that is
> neither RAM or device state is some form of handshake or capabilities
> negotiation.
The RFC version of my multifd device state transfer patch set introduced
a new migration channel header (by Avihai) for clean and extensible
migration channel handshaking but people didn't like so it was removed in v1.
Thanks,
Maciej
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-16 20:10 ` Maciej S. Szmigiero
@ 2024-07-17 19:00 ` Peter Xu
2024-07-17 21:07 ` Maciej S. Szmigiero
0 siblings, 1 reply; 46+ messages in thread
From: Peter Xu @ 2024-07-17 19:00 UTC (permalink / raw)
To: Maciej S. Szmigiero; +Cc: Fabiano Rosas, Wang, Lei, qemu-devel, Avihai Horon
On Tue, Jul 16, 2024 at 10:10:25PM +0200, Maciej S. Szmigiero wrote:
> > > > > The comment I removed is slightly misleading to me too, because right now
> > > > > active_slot contains the data hasn't yet been delivered to multifd, so
> > > > > we're "putting it back to free list" not because of it's free, but because
> > > > > we know it won't get used until the multifd send thread consumes it
> > > > > (because before that the thread will be busy, and we won't use the buffer
> > > > > if so in upcoming send()s).
> > > > >
> > > > > And then when I'm looking at this again, I think maybe it's a slight
> > > > > overkill, and maybe we can still keep the "opaque data" managed by multifd.
> > > > > One reason might be that I don't expect the "opaque data" payload keep
> > > > > growing at all: it should really be either RAM or device state as I
> > > > > commented elsewhere in a relevant thread, after all it's a thread model
> > > > > only for migration purpose to move vmstates..
> > > >
> > > > Some amount of flexibility needs to be baked in. For instance, what
> > > > about the handshake procedure? Don't we want to use multifd threads to
> > > > put some information on the wire for that as well?
> > >
> > > Is this an orthogonal question?
> >
> > I don't think so. You say the payload data should be either RAM or
> > device state. I'm asking what other types of data do we want the multifd
> > channel to transmit and suggesting we need to allow room for the
> > addition of that, whatever it is. One thing that comes to mind that is
> > neither RAM or device state is some form of handshake or capabilities
> > negotiation.
>
> The RFC version of my multifd device state transfer patch set introduced
> a new migration channel header (by Avihai) for clean and extensible
> migration channel handshaking but people didn't like so it was removed in v1.
Hmm, I'm not sure this is relevant to the context of discussion here, but I
confess I didn't notice the per-channel header thing in the previous RFC
series. Link is here:
https://lore.kernel.org/r/636cec92eb801f13ba893de79d4872f5d8342097.1713269378.git.maciej.szmigiero@oracle.com
Maciej, if you want, you can split that out of the seriess. So far it looks
like a good thing with/without how VFIO tackles it.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-17 19:00 ` Peter Xu
@ 2024-07-17 21:07 ` Maciej S. Szmigiero
2024-07-17 21:30 ` Peter Xu
0 siblings, 1 reply; 46+ messages in thread
From: Maciej S. Szmigiero @ 2024-07-17 21:07 UTC (permalink / raw)
To: Peter Xu
Cc: Fabiano Rosas, Wang, Lei, qemu-devel, Avihai Horon,
Daniel P. Berrangé
On 17.07.2024 21:00, Peter Xu wrote:
> On Tue, Jul 16, 2024 at 10:10:25PM +0200, Maciej S. Szmigiero wrote:
>>>>>> The comment I removed is slightly misleading to me too, because right now
>>>>>> active_slot contains the data hasn't yet been delivered to multifd, so
>>>>>> we're "putting it back to free list" not because of it's free, but because
>>>>>> we know it won't get used until the multifd send thread consumes it
>>>>>> (because before that the thread will be busy, and we won't use the buffer
>>>>>> if so in upcoming send()s).
>>>>>>
>>>>>> And then when I'm looking at this again, I think maybe it's a slight
>>>>>> overkill, and maybe we can still keep the "opaque data" managed by multifd.
>>>>>> One reason might be that I don't expect the "opaque data" payload keep
>>>>>> growing at all: it should really be either RAM or device state as I
>>>>>> commented elsewhere in a relevant thread, after all it's a thread model
>>>>>> only for migration purpose to move vmstates..
>>>>>
>>>>> Some amount of flexibility needs to be baked in. For instance, what
>>>>> about the handshake procedure? Don't we want to use multifd threads to
>>>>> put some information on the wire for that as well?
>>>>
>>>> Is this an orthogonal question?
>>>
>>> I don't think so. You say the payload data should be either RAM or
>>> device state. I'm asking what other types of data do we want the multifd
>>> channel to transmit and suggesting we need to allow room for the
>>> addition of that, whatever it is. One thing that comes to mind that is
>>> neither RAM or device state is some form of handshake or capabilities
>>> negotiation.
>>
>> The RFC version of my multifd device state transfer patch set introduced
>> a new migration channel header (by Avihai) for clean and extensible
>> migration channel handshaking but people didn't like so it was removed in v1.
>
> Hmm, I'm not sure this is relevant to the context of discussion here, but I
> confess I didn't notice the per-channel header thing in the previous RFC
> series. Link is here:
>
> https://lore.kernel.org/r/636cec92eb801f13ba893de79d4872f5d8342097.1713269378.git.maciej.szmigiero@oracle.com
The channel header patches were dropped because Daniel didn't like them:
https://lore.kernel.org/qemu-devel/Zh-KF72Fe9oV6tfT@redhat.com/
https://lore.kernel.org/qemu-devel/Zh_6W8u3H4FmGS49@redhat.com/
> Maciej, if you want, you can split that out of the seriess. So far it looks
> like a good thing with/without how VFIO tackles it.
Unfortunately, these Avihai's channel header patches obviously impact wire
protocol and are a bit of intermingled with the rest of the device state
transfer patch set so it would be good to know upfront whether there is
some consensus to (re)introduce this new channel header (CCed Daniel, too).
> Thanks,
>
Thanks,
Maciej
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters
2024-07-17 21:07 ` Maciej S. Szmigiero
@ 2024-07-17 21:30 ` Peter Xu
0 siblings, 0 replies; 46+ messages in thread
From: Peter Xu @ 2024-07-17 21:30 UTC (permalink / raw)
To: Maciej S. Szmigiero
Cc: Fabiano Rosas, Wang, Lei, qemu-devel, Avihai Horon,
Daniel P. Berrangé
On Wed, Jul 17, 2024 at 11:07:17PM +0200, Maciej S. Szmigiero wrote:
> On 17.07.2024 21:00, Peter Xu wrote:
> > On Tue, Jul 16, 2024 at 10:10:25PM +0200, Maciej S. Szmigiero wrote:
> > > > > > > The comment I removed is slightly misleading to me too, because right now
> > > > > > > active_slot contains the data hasn't yet been delivered to multifd, so
> > > > > > > we're "putting it back to free list" not because of it's free, but because
> > > > > > > we know it won't get used until the multifd send thread consumes it
> > > > > > > (because before that the thread will be busy, and we won't use the buffer
> > > > > > > if so in upcoming send()s).
> > > > > > >
> > > > > > > And then when I'm looking at this again, I think maybe it's a slight
> > > > > > > overkill, and maybe we can still keep the "opaque data" managed by multifd.
> > > > > > > One reason might be that I don't expect the "opaque data" payload keep
> > > > > > > growing at all: it should really be either RAM or device state as I
> > > > > > > commented elsewhere in a relevant thread, after all it's a thread model
> > > > > > > only for migration purpose to move vmstates..
> > > > > >
> > > > > > Some amount of flexibility needs to be baked in. For instance, what
> > > > > > about the handshake procedure? Don't we want to use multifd threads to
> > > > > > put some information on the wire for that as well?
> > > > >
> > > > > Is this an orthogonal question?
> > > >
> > > > I don't think so. You say the payload data should be either RAM or
> > > > device state. I'm asking what other types of data do we want the multifd
> > > > channel to transmit and suggesting we need to allow room for the
> > > > addition of that, whatever it is. One thing that comes to mind that is
> > > > neither RAM or device state is some form of handshake or capabilities
> > > > negotiation.
> > >
> > > The RFC version of my multifd device state transfer patch set introduced
> > > a new migration channel header (by Avihai) for clean and extensible
> > > migration channel handshaking but people didn't like so it was removed in v1.
> >
> > Hmm, I'm not sure this is relevant to the context of discussion here, but I
> > confess I didn't notice the per-channel header thing in the previous RFC
> > series. Link is here:
> >
> > https://lore.kernel.org/r/636cec92eb801f13ba893de79d4872f5d8342097.1713269378.git.maciej.szmigiero@oracle.com
>
> The channel header patches were dropped because Daniel didn't like them:
> https://lore.kernel.org/qemu-devel/Zh-KF72Fe9oV6tfT@redhat.com/
> https://lore.kernel.org/qemu-devel/Zh_6W8u3H4FmGS49@redhat.com/
Ah I missed that too when I quickly went over the old series, sorry.
I think what Dan meant was that we'd better do that with the handshake
work, which should cover more than this. I've no problem with that.
It's just that sooner or later, we should provide something more solid than
commit 6720c2b327 ("migration: check magic value for deciding the mapping
of channels").
>
> > Maciej, if you want, you can split that out of the seriess. So far it looks
> > like a good thing with/without how VFIO tackles it.
>
> Unfortunately, these Avihai's channel header patches obviously impact wire
> protocol and are a bit of intermingled with the rest of the device state
> transfer patch set so it would be good to know upfront whether there is
> some consensus to (re)introduce this new channel header (CCed Daniel, too).
When I mentioned posting it separately, it'll still not be relevant to the
VFIO series. IOW, I think below is definitely not needed (and I think we're
on the same page now to reuse multifd threads as generic channels, so
there's no issue now):
https://lore.kernel.org/qemu-devel/027695db92ace07d2d6ee66da05f8e85959fd46a.1713269378.git.maciej.szmigiero@oracle.com/
So I assume we should leave that for later for whoever refactors the
handshake process.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 46+ messages in thread
* [RFC PATCH 7/7] migration/multifd: Hide multifd slots implementation
2024-06-20 21:21 [RFC PATCH 0/7] migration/multifd: Introduce storage slots Fabiano Rosas
` (5 preceding siblings ...)
2024-06-20 21:21 ` [RFC PATCH 6/7] migration/multifd: Move payload storage out of the channel parameters Fabiano Rosas
@ 2024-06-20 21:21 ` Fabiano Rosas
2024-06-21 14:44 ` [RFC PATCH 0/7] migration/multifd: Introduce storage slots Maciej S. Szmigiero
7 siblings, 0 replies; 46+ messages in thread
From: Fabiano Rosas @ 2024-06-20 21:21 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Xu, Maciej S . Szmigiero
The only two things the multifd client needs to access are the active
slot and the active slot size:
The active slot itself is obviously needed because it's where the data
is put.
The slot size is needed only by the ram pages code, because it does
not fill the data slot and sends it in one go, it instead fills the
slot partially at each call of multifd_queue_page(), so the size is
needed to differentiate an empty slot (free or recently consumed) from
the slot that is partially full.
Hide the MultiFDSlots implementation so the client is not tempted to
make use of the free list. That field is there simply because we need
the client to carry a handle to that memory, it's not supposed to be
accessed directly.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
migration/multifd.c | 26 +++++++++++++++++++++++---
migration/multifd.h | 8 +++-----
2 files changed, 26 insertions(+), 8 deletions(-)
diff --git a/migration/multifd.c b/migration/multifd.c
index f22a1c2e84..9fb719eb0d 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -38,6 +38,11 @@
#define MULTIFD_MAGIC 0x11223344U
#define MULTIFD_VERSION 1
+struct MultiFDSlots {
+ MultiFDSendData **free;
+ MultiFDSendData *active;
+};
+
typedef struct {
uint32_t magic;
uint32_t version;
@@ -737,7 +742,22 @@ static inline bool multifd_queue_full(MultiFDPages_t *pages)
static inline void multifd_enqueue(MultiFDPages_t *pages, ram_addr_t offset)
{
pages->offset[pages->num++] = offset;
- multifd_ram_send_slots->active->size += qemu_target_page_size();
+ multifd_set_slot_size(multifd_ram_send_slots, qemu_target_page_size());
+}
+
+void *multifd_get_active_slot(MultiFDSlots *multifd_ram_send_slots)
+{
+ return multifd_ram_send_slots->active->opaque;
+}
+
+void multifd_set_slot_size(MultiFDSlots *multifd_ram_send_slots, size_t size)
+{
+ multifd_ram_send_slots->active->size += size;
+}
+
+bool multifd_slot_has_data(MultiFDSlots *multifd_ram_send_slots)
+{
+ return !!multifd_ram_send_slots->active->size;
}
/* Returns true if enqueue successful, false otherwise */
@@ -746,7 +766,7 @@ bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
MultiFDPages_t *pages;
retry:
- pages = multifd_ram_send_slots->active->opaque;
+ pages = multifd_get_active_slot(multifd_ram_send_slots);
/* If the queue is empty, we can already enqueue now */
if (multifd_queue_empty(pages)) {
@@ -951,7 +971,7 @@ int multifd_send_sync_main(void)
return 0;
}
- if (multifd_ram_send_slots->active->size) {
+ if (multifd_slot_has_data(multifd_ram_send_slots)) {
if (!multifd_send(multifd_ram_send_slots)) {
error_report("%s: multifd_send_pages fail", __func__);
return -1;
diff --git a/migration/multifd.h b/migration/multifd.h
index 5230729077..8f99fe2652 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -102,15 +102,13 @@ struct MultiFDSendData {
void (*cleanup)(void *);
};
-struct MultiFDSlots {
- MultiFDSendData **free;
- MultiFDSendData *active;
-};
-
MultiFDSlots *multifd_allocate_slots(void *(*alloc_fn)(void),
void (*reset_fn)(void *),
void (*cleanup_fn)(void *));
void multifd_ram_save_setup(void);
+void *multifd_get_active_slot(MultiFDSlots *multifd_ram_send_slots);
+void multifd_set_slot_size(MultiFDSlots *multifd_ram_send_slots, size_t size);
+bool multifd_slot_has_data(MultiFDSlots *multifd_ram_send_slots);
typedef struct {
/* Fields are only written at creating/deletion time */
--
2.35.3
^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 0/7] migration/multifd: Introduce storage slots
2024-06-20 21:21 [RFC PATCH 0/7] migration/multifd: Introduce storage slots Fabiano Rosas
` (6 preceding siblings ...)
2024-06-20 21:21 ` [RFC PATCH 7/7] migration/multifd: Hide multifd slots implementation Fabiano Rosas
@ 2024-06-21 14:44 ` Maciej S. Szmigiero
2024-06-21 15:04 ` Fabiano Rosas
7 siblings, 1 reply; 46+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-21 14:44 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: Peter Xu, qemu-devel
Hi Fabiano,
On 20.06.2024 23:21, Fabiano Rosas wrote:
> Hi folks,
>
> First of all, apologies for the roughness of the series. I'm off for
> the next couple of weeks and wanted to put something together early
> for your consideration.
>
> This series is a refactoring (based on an earlier, off-list
> attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
> the multifd core. If we're going to add support for more data types to
> multifd, we first need to clean that up.
>
> This time around this work was prompted by Maciej's series[1]. I see
> you're having to add a bunch of is_device_state checks to work around
> the rigidity of the code.
>
> Aside from the VFIO work, there is also the intent (coming back from
> Juan's ideas) to make multifd the default code path for migration,
> which will have to include the vmstate migration and anything else we
> put on the stream via QEMUFile.
>
> I have long since been bothered by having 'pages' sprinkled all over
> the code, so I might be coming at this with a bit of a narrow focus,
> but I believe in order to support more types of payloads in multifd,
> we need to first allow the scheduling at multifd_send_pages() to be
> independent of MultiFDPages_t. So here it is. Let me know what you
> think.
Thanks for the patch set, I quickly glanced at these patches and they
definitely make sense to me.
I guess its latest version could be found in the repo at [2] since
that's where the CI run mentioned below took it from?
> (as I said, I'll be off for a couple of weeks, so feel free to
> incorporate any of this code if it's useful. Or to ignore it
> completely).
I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
feature freeze in about a month, correct?
> CI run: https://gitlab.com/farosas/qemu/-/pipelines/1340992028
>
> 0- https://github.com/farosas/qemu/commits/multifd-packet-cleanups/
> 1- https://lore.kernel.org/r/cover.1718717584.git.maciej.szmigiero@oracle.com
[2]: https://gitlab.com/farosas/qemu/-/commits/multifd-pages-decouple
Thanks,
Maciej
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 0/7] migration/multifd: Introduce storage slots
2024-06-21 14:44 ` [RFC PATCH 0/7] migration/multifd: Introduce storage slots Maciej S. Szmigiero
@ 2024-06-21 15:04 ` Fabiano Rosas
2024-06-21 15:31 ` Maciej S. Szmigiero
0 siblings, 1 reply; 46+ messages in thread
From: Fabiano Rosas @ 2024-06-21 15:04 UTC (permalink / raw)
To: Maciej S. Szmigiero; +Cc: Peter Xu, qemu-devel
"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> Hi Fabiano,
>
> On 20.06.2024 23:21, Fabiano Rosas wrote:
>> Hi folks,
>>
>> First of all, apologies for the roughness of the series. I'm off for
>> the next couple of weeks and wanted to put something together early
>> for your consideration.
>>
>> This series is a refactoring (based on an earlier, off-list
>> attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
>> the multifd core. If we're going to add support for more data types to
>> multifd, we first need to clean that up.
>>
>> This time around this work was prompted by Maciej's series[1]. I see
>> you're having to add a bunch of is_device_state checks to work around
>> the rigidity of the code.
>>
>> Aside from the VFIO work, there is also the intent (coming back from
>> Juan's ideas) to make multifd the default code path for migration,
>> which will have to include the vmstate migration and anything else we
>> put on the stream via QEMUFile.
>>
>> I have long since been bothered by having 'pages' sprinkled all over
>> the code, so I might be coming at this with a bit of a narrow focus,
>> but I believe in order to support more types of payloads in multifd,
>> we need to first allow the scheduling at multifd_send_pages() to be
>> independent of MultiFDPages_t. So here it is. Let me know what you
>> think.
>
> Thanks for the patch set, I quickly glanced at these patches and they
> definitely make sense to me.
>
> I guess its latest version could be found in the repo at [2] since
> that's where the CI run mentioned below took it from?
Yes
>
>> (as I said, I'll be off for a couple of weeks, so feel free to
>> incorporate any of this code if it's useful. Or to ignore it
>> completely).
>
> I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
> feature freeze in about a month, correct?
>
For general code improvements like this I'm not thinking about QEMU
releases at all. But this series is not super complex, so I could
imagine we merging it in time for 9.1 if we reach an agreement.
Are you thinking your series might miss the target? Or have concerns
over the stability of the refactoring? We can within reason merge code
based on the current framework and improve things on top, we already did
something similar when merging zero-page support. I don't have an issue
with that.
>> CI run: https://gitlab.com/farosas/qemu/-/pipelines/1340992028
>>
>> 0- https://github.com/farosas/qemu/commits/multifd-packet-cleanups/
>> 1- https://lore.kernel.org/r/cover.1718717584.git.maciej.szmigiero@oracle.com
>
> [2]: https://gitlab.com/farosas/qemu/-/commits/multifd-pages-decouple
>
> Thanks,
> Maciej
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 0/7] migration/multifd: Introduce storage slots
2024-06-21 15:04 ` Fabiano Rosas
@ 2024-06-21 15:31 ` Maciej S. Szmigiero
2024-06-21 15:56 ` Peter Xu
0 siblings, 1 reply; 46+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-21 15:31 UTC (permalink / raw)
To: Fabiano Rosas; +Cc: Peter Xu, qemu-devel
On 21.06.2024 17:04, Fabiano Rosas wrote:
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>
>> Hi Fabiano,
>>
>> On 20.06.2024 23:21, Fabiano Rosas wrote:
>>> Hi folks,
>>>
>>> First of all, apologies for the roughness of the series. I'm off for
>>> the next couple of weeks and wanted to put something together early
>>> for your consideration.
>>>
>>> This series is a refactoring (based on an earlier, off-list
>>> attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
>>> the multifd core. If we're going to add support for more data types to
>>> multifd, we first need to clean that up.
>>>
>>> This time around this work was prompted by Maciej's series[1]. I see
>>> you're having to add a bunch of is_device_state checks to work around
>>> the rigidity of the code.
>>>
>>> Aside from the VFIO work, there is also the intent (coming back from
>>> Juan's ideas) to make multifd the default code path for migration,
>>> which will have to include the vmstate migration and anything else we
>>> put on the stream via QEMUFile.
>>>
>>> I have long since been bothered by having 'pages' sprinkled all over
>>> the code, so I might be coming at this with a bit of a narrow focus,
>>> but I believe in order to support more types of payloads in multifd,
>>> we need to first allow the scheduling at multifd_send_pages() to be
>>> independent of MultiFDPages_t. So here it is. Let me know what you
>>> think.
>>
>> Thanks for the patch set, I quickly glanced at these patches and they
>> definitely make sense to me.
>>
(..)
>>> (as I said, I'll be off for a couple of weeks, so feel free to
>>> incorporate any of this code if it's useful. Or to ignore it
>>> completely).
>>
>> I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
>> feature freeze in about a month, correct?
>>
>
> For general code improvements like this I'm not thinking about QEMU
> releases at all. But this series is not super complex, so I could
> imagine we merging it in time for 9.1 if we reach an agreement.
>
> Are you thinking your series might miss the target? Or have concerns
> over the stability of the refactoring? We can within reason merge code
> based on the current framework and improve things on top, we already did
> something similar when merging zero-page support. I don't have an issue
> with that.
The reason that I asked whether you are targeting 9.1 is because my
patch set is definitely targeting that release.
At the same time my patch set will need to be rebased/refactored on top
of this patch set if it is supposed to be merged for 9.1 too.
If this patch set gets merged quickly that's not really a problem.
On the other hand, if another iteration(s) is/are needed AND you are
not available in the coming weeks to work on them then there's a
question whether we will make the required deadline.
Thanks,
Maciej
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 0/7] migration/multifd: Introduce storage slots
2024-06-21 15:31 ` Maciej S. Szmigiero
@ 2024-06-21 15:56 ` Peter Xu
2024-06-21 17:40 ` Maciej S. Szmigiero
0 siblings, 1 reply; 46+ messages in thread
From: Peter Xu @ 2024-06-21 15:56 UTC (permalink / raw)
To: Maciej S. Szmigiero; +Cc: Fabiano Rosas, qemu-devel
On Fri, Jun 21, 2024 at 05:31:54PM +0200, Maciej S. Szmigiero wrote:
> On 21.06.2024 17:04, Fabiano Rosas wrote:
> > "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> >
> > > Hi Fabiano,
> > >
> > > On 20.06.2024 23:21, Fabiano Rosas wrote:
> > > > Hi folks,
> > > >
> > > > First of all, apologies for the roughness of the series. I'm off for
> > > > the next couple of weeks and wanted to put something together early
> > > > for your consideration.
> > > >
> > > > This series is a refactoring (based on an earlier, off-list
> > > > attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
> > > > the multifd core. If we're going to add support for more data types to
> > > > multifd, we first need to clean that up.
> > > >
> > > > This time around this work was prompted by Maciej's series[1]. I see
> > > > you're having to add a bunch of is_device_state checks to work around
> > > > the rigidity of the code.
> > > >
> > > > Aside from the VFIO work, there is also the intent (coming back from
> > > > Juan's ideas) to make multifd the default code path for migration,
> > > > which will have to include the vmstate migration and anything else we
> > > > put on the stream via QEMUFile.
> > > >
> > > > I have long since been bothered by having 'pages' sprinkled all over
> > > > the code, so I might be coming at this with a bit of a narrow focus,
> > > > but I believe in order to support more types of payloads in multifd,
> > > > we need to first allow the scheduling at multifd_send_pages() to be
> > > > independent of MultiFDPages_t. So here it is. Let me know what you
> > > > think.
> > >
> > > Thanks for the patch set, I quickly glanced at these patches and they
> > > definitely make sense to me.
> > >
> (..)
> > > > (as I said, I'll be off for a couple of weeks, so feel free to
> > > > incorporate any of this code if it's useful. Or to ignore it
> > > > completely).
> > >
> > > I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
> > > feature freeze in about a month, correct?
> > >
> >
> > For general code improvements like this I'm not thinking about QEMU
> > releases at all. But this series is not super complex, so I could
> > imagine we merging it in time for 9.1 if we reach an agreement.
> >
> > Are you thinking your series might miss the target? Or have concerns
> > over the stability of the refactoring? We can within reason merge code
> > based on the current framework and improve things on top, we already did
> > something similar when merging zero-page support. I don't have an issue
> > with that.
>
> The reason that I asked whether you are targeting 9.1 is because my
> patch set is definitely targeting that release.
>
> At the same time my patch set will need to be rebased/refactored on top
> of this patch set if it is supposed to be merged for 9.1 too.
>
> If this patch set gets merged quickly that's not really a problem.
>
> On the other hand, if another iteration(s) is/are needed AND you are
> not available in the coming weeks to work on them then there's a
> question whether we will make the required deadline.
I think it's a bit rush to merge the vfio series in this release. I'm not
sure it has enough time to be properly reviewed, reposted, retested, etc.
I've already started looking at it, and so far I think I have doubt not
only on agreement with Fabiano on the device_state thing which I prefer to
avoid, but also I'm thinking of any possible way to at least make the
worker threads generic too: a direct impact could be vDPA in the near
future if anyone cared, while I don't want modules to create threads
randomly during migration.
Meanwhile I'm also thinking whether that "the thread needs to dump all
data, and during iteration we can't do that" is the good reason to not
support that during iterations.
I didn't yet reply because I don't think I think all things through, but
I'll get there.
So I'm not saying that the design is problematic, but IMHO it's just not
mature enough to assume it will land in 9.1, considering it's still a large
one, and the first non-rfc version just posted two days ago.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 0/7] migration/multifd: Introduce storage slots
2024-06-21 15:56 ` Peter Xu
@ 2024-06-21 17:40 ` Maciej S. Szmigiero
2024-06-21 20:54 ` Peter Xu
0 siblings, 1 reply; 46+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-21 17:40 UTC (permalink / raw)
To: Peter Xu; +Cc: Fabiano Rosas, qemu-devel
On 21.06.2024 17:56, Peter Xu wrote:
> On Fri, Jun 21, 2024 at 05:31:54PM +0200, Maciej S. Szmigiero wrote:
>> On 21.06.2024 17:04, Fabiano Rosas wrote:
>>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>>>
>>>> Hi Fabiano,
>>>>
>>>> On 20.06.2024 23:21, Fabiano Rosas wrote:
>>>>> Hi folks,
>>>>>
>>>>> First of all, apologies for the roughness of the series. I'm off for
>>>>> the next couple of weeks and wanted to put something together early
>>>>> for your consideration.
>>>>>
>>>>> This series is a refactoring (based on an earlier, off-list
>>>>> attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
>>>>> the multifd core. If we're going to add support for more data types to
>>>>> multifd, we first need to clean that up.
>>>>>
>>>>> This time around this work was prompted by Maciej's series[1]. I see
>>>>> you're having to add a bunch of is_device_state checks to work around
>>>>> the rigidity of the code.
>>>>>
>>>>> Aside from the VFIO work, there is also the intent (coming back from
>>>>> Juan's ideas) to make multifd the default code path for migration,
>>>>> which will have to include the vmstate migration and anything else we
>>>>> put on the stream via QEMUFile.
>>>>>
>>>>> I have long since been bothered by having 'pages' sprinkled all over
>>>>> the code, so I might be coming at this with a bit of a narrow focus,
>>>>> but I believe in order to support more types of payloads in multifd,
>>>>> we need to first allow the scheduling at multifd_send_pages() to be
>>>>> independent of MultiFDPages_t. So here it is. Let me know what you
>>>>> think.
>>>>
>>>> Thanks for the patch set, I quickly glanced at these patches and they
>>>> definitely make sense to me.
>>>>
>> (..)
>>>>> (as I said, I'll be off for a couple of weeks, so feel free to
>>>>> incorporate any of this code if it's useful. Or to ignore it
>>>>> completely).
>>>>
>>>> I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
>>>> feature freeze in about a month, correct?
>>>>
>>>
>>> For general code improvements like this I'm not thinking about QEMU
>>> releases at all. But this series is not super complex, so I could
>>> imagine we merging it in time for 9.1 if we reach an agreement.
>>>
>>> Are you thinking your series might miss the target? Or have concerns
>>> over the stability of the refactoring? We can within reason merge code
>>> based on the current framework and improve things on top, we already did
>>> something similar when merging zero-page support. I don't have an issue
>>> with that.
>>
>> The reason that I asked whether you are targeting 9.1 is because my
>> patch set is definitely targeting that release.
>>
>> At the same time my patch set will need to be rebased/refactored on top
>> of this patch set if it is supposed to be merged for 9.1 too.
>>
>> If this patch set gets merged quickly that's not really a problem.
>>
>> On the other hand, if another iteration(s) is/are needed AND you are
>> not available in the coming weeks to work on them then there's a
>> question whether we will make the required deadline.
>
> I think it's a bit rush to merge the vfio series in this release. I'm not
> sure it has enough time to be properly reviewed, reposted, retested, etc.
>
> I've already started looking at it, and so far I think I have doubt not
> only on agreement with Fabiano on the device_state thing which I prefer to
> avoid, but also I'm thinking of any possible way to at least make the
> worker threads generic too: a direct impact could be vDPA in the near
> future if anyone cared, while I don't want modules to create threads
> randomly during migration.
>
> Meanwhile I'm also thinking whether that "the thread needs to dump all
> data, and during iteration we can't do that" is the good reason to not
> support that during iterations.
>
> I didn't yet reply because I don't think I think all things through, but
> I'll get there.
>
> So I'm not saying that the design is problematic, but IMHO it's just not
> mature enough to assume it will land in 9.1, considering it's still a large
> one, and the first non-rfc version just posted two days ago.
The RFC version was posted more than 2 months ago.
It has received some review comments from multiple people,
all of which were addressed in this patch set version.
I have not received any further comments during these 2 months, so I thought
the overall design is considered okay - if anything, there might be minor
code comments/issues but these can easily be improved/fixed in the 5 weeks
remaining to the soft code freeze for 9.1.
If anything, I think that the VM live phase (non-downtime) transfers
functionality should be deferred until 9.2 because:
* It wasn't a part of the RFC so even if implemented today would get much
less testing overall,
* It's orthogonal to the switchover time device state transfer functionality
introduced by this patch set and could be added on top of that without
changing the wire protocol for switchover time device state transfers,
* It doesn't impact the switchover downtime so in this case 9.1 would
already contain all what's necessary to improve it.
Thanks,
Maciej
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 0/7] migration/multifd: Introduce storage slots
2024-06-21 17:40 ` Maciej S. Szmigiero
@ 2024-06-21 20:54 ` Peter Xu
2024-06-23 20:25 ` Maciej S. Szmigiero
0 siblings, 1 reply; 46+ messages in thread
From: Peter Xu @ 2024-06-21 20:54 UTC (permalink / raw)
To: Maciej S. Szmigiero; +Cc: Fabiano Rosas, qemu-devel
On Fri, Jun 21, 2024 at 07:40:01PM +0200, Maciej S. Szmigiero wrote:
> On 21.06.2024 17:56, Peter Xu wrote:
> > On Fri, Jun 21, 2024 at 05:31:54PM +0200, Maciej S. Szmigiero wrote:
> > > On 21.06.2024 17:04, Fabiano Rosas wrote:
> > > > "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
> > > >
> > > > > Hi Fabiano,
> > > > >
> > > > > On 20.06.2024 23:21, Fabiano Rosas wrote:
> > > > > > Hi folks,
> > > > > >
> > > > > > First of all, apologies for the roughness of the series. I'm off for
> > > > > > the next couple of weeks and wanted to put something together early
> > > > > > for your consideration.
> > > > > >
> > > > > > This series is a refactoring (based on an earlier, off-list
> > > > > > attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
> > > > > > the multifd core. If we're going to add support for more data types to
> > > > > > multifd, we first need to clean that up.
> > > > > >
> > > > > > This time around this work was prompted by Maciej's series[1]. I see
> > > > > > you're having to add a bunch of is_device_state checks to work around
> > > > > > the rigidity of the code.
> > > > > >
> > > > > > Aside from the VFIO work, there is also the intent (coming back from
> > > > > > Juan's ideas) to make multifd the default code path for migration,
> > > > > > which will have to include the vmstate migration and anything else we
> > > > > > put on the stream via QEMUFile.
> > > > > >
> > > > > > I have long since been bothered by having 'pages' sprinkled all over
> > > > > > the code, so I might be coming at this with a bit of a narrow focus,
> > > > > > but I believe in order to support more types of payloads in multifd,
> > > > > > we need to first allow the scheduling at multifd_send_pages() to be
> > > > > > independent of MultiFDPages_t. So here it is. Let me know what you
> > > > > > think.
> > > > >
> > > > > Thanks for the patch set, I quickly glanced at these patches and they
> > > > > definitely make sense to me.
> > > > >
> > > (..)
> > > > > > (as I said, I'll be off for a couple of weeks, so feel free to
> > > > > > incorporate any of this code if it's useful. Or to ignore it
> > > > > > completely).
> > > > >
> > > > > I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
> > > > > feature freeze in about a month, correct?
> > > > >
> > > >
> > > > For general code improvements like this I'm not thinking about QEMU
> > > > releases at all. But this series is not super complex, so I could
> > > > imagine we merging it in time for 9.1 if we reach an agreement.
> > > >
> > > > Are you thinking your series might miss the target? Or have concerns
> > > > over the stability of the refactoring? We can within reason merge code
> > > > based on the current framework and improve things on top, we already did
> > > > something similar when merging zero-page support. I don't have an issue
> > > > with that.
> > >
> > > The reason that I asked whether you are targeting 9.1 is because my
> > > patch set is definitely targeting that release.
> > >
> > > At the same time my patch set will need to be rebased/refactored on top
> > > of this patch set if it is supposed to be merged for 9.1 too.
> > >
> > > If this patch set gets merged quickly that's not really a problem.
> > >
> > > On the other hand, if another iteration(s) is/are needed AND you are
> > > not available in the coming weeks to work on them then there's a
> > > question whether we will make the required deadline.
> >
> > I think it's a bit rush to merge the vfio series in this release. I'm not
> > sure it has enough time to be properly reviewed, reposted, retested, etc.
> >
> > I've already started looking at it, and so far I think I have doubt not
> > only on agreement with Fabiano on the device_state thing which I prefer to
> > avoid, but also I'm thinking of any possible way to at least make the
> > worker threads generic too: a direct impact could be vDPA in the near
> > future if anyone cared, while I don't want modules to create threads
> > randomly during migration.
> >
> > Meanwhile I'm also thinking whether that "the thread needs to dump all
> > data, and during iteration we can't do that" is the good reason to not
> > support that during iterations.
> >
> > I didn't yet reply because I don't think I think all things through, but
> > I'll get there.
> >
> > So I'm not saying that the design is problematic, but IMHO it's just not
> > mature enough to assume it will land in 9.1, considering it's still a large
> > one, and the first non-rfc version just posted two days ago.
>
>
> The RFC version was posted more than 2 months ago.
>
> It has received some review comments from multiple people,
> all of which were addressed in this patch set version.
I thought it was mostly me who reviewed it, am I right? Or do you have
other thread that has such discussion happening, and the design review has
properly done and reached an agreement?
IMHO that is also not how RFC works.
It doesn't work like "if RFC didn't got NACKed, a maintainer should merge
v1 when someone posts it". Instead RFC should only mean these at least to
me: "(1) please review this from high level, things can drastically change;
(2) please don't merge this, because it is not for merging but for getting
comments."
Beyond, it doesn't imply anything for what happens after the RFC series.
>
> I have not received any further comments during these 2 months, so I thought
> the overall design is considered okay - if anything, there might be minor
> code comments/issues but these can easily be improved/fixed in the 5 weeks
> remaining to the soft code freeze for 9.1.
The latest email in that thread (assuming this one is what you're referring
to) is:
https://lore.kernel.org/qemu-devel/f67fcc88-aaf6-43f7-9287-90cbd7495ba1@nvidia.com/#t
I didn't hear anything from Avihai yet, neither did I think we reached an
complete agreement on the whole design.
>
>
> If anything, I think that the VM live phase (non-downtime) transfers
> functionality should be deferred until 9.2 because:
> * It wasn't a part of the RFC so even if implemented today would get much
> less testing overall,
IMO it really depends on what was proposed. Anyone who send patches should
definitely test whatever the patchset provides. If that patchset includes
the iteration changes then that needs to be tested by the submitter.
>
> * It's orthogonal to the switchover time device state transfer functionality
> introduced by this patch set and could be added on top of that without
> changing the wire protocol for switchover time device state transfers,
AFAICT it will affect the wire protocol? If the dest QEMU contains your
patcheset to be the old version of QEMU, then the threads will only be
created at the switchover phase, and it won't be ready to take whatever
data sent from a new QEMU which may allow migrating VFIO iteration data,
who may expect these VFIO data to be passed over via multifd channels
even before the switchover.
It can only be compatible at least when the threads are created before
iteration starts on dest side, and I didn't yet check the packet headers
and other stuffs.
I think that can be a sweet spot where maybe you can land this series
sooner, but it also gets ready for anyone who wants to further extend that
to iteration phase easily. Not sure whether it'll be easily feasible by
just moving the thread creations earlier.
>
> * It doesn't impact the switchover downtime so in this case 9.1 would
> already contain all what's necessary to improve it.
Yes it won't, but IMHO that's not an issue.
Since Fabiano is going on a short break soon, I think I'll be the only one
review it. I'll try my best, but still I can't guarantee it will be able
to land in 9.1, and this is not the only thing I'll need to do next week..
I appreciated a lot you worked out VFIO on top of multifd, because IMHO
that's really the right direction. However even with that, I don't think
the whole design is yet fully settled, not to mention the details. And that
implies it may miss 9.1.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 0/7] migration/multifd: Introduce storage slots
2024-06-21 20:54 ` Peter Xu
@ 2024-06-23 20:25 ` Maciej S. Szmigiero
2024-06-23 20:45 ` Peter Xu
0 siblings, 1 reply; 46+ messages in thread
From: Maciej S. Szmigiero @ 2024-06-23 20:25 UTC (permalink / raw)
To: Peter Xu; +Cc: Fabiano Rosas, qemu-devel
On 21.06.2024 22:54, Peter Xu wrote:
> On Fri, Jun 21, 2024 at 07:40:01PM +0200, Maciej S. Szmigiero wrote:
>> On 21.06.2024 17:56, Peter Xu wrote:
>>> On Fri, Jun 21, 2024 at 05:31:54PM +0200, Maciej S. Szmigiero wrote:
>>>> On 21.06.2024 17:04, Fabiano Rosas wrote:
>>>>> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> writes:
>>>>>
>>>>>> Hi Fabiano,
>>>>>>
>>>>>> On 20.06.2024 23:21, Fabiano Rosas wrote:
>>>>>>> Hi folks,
>>>>>>>
>>>>>>> First of all, apologies for the roughness of the series. I'm off for
>>>>>>> the next couple of weeks and wanted to put something together early
>>>>>>> for your consideration.
>>>>>>>
>>>>>>> This series is a refactoring (based on an earlier, off-list
>>>>>>> attempt[0]), aimed to remove the usage of the MultiFDPages_t type in
>>>>>>> the multifd core. If we're going to add support for more data types to
>>>>>>> multifd, we first need to clean that up.
>>>>>>>
>>>>>>> This time around this work was prompted by Maciej's series[1]. I see
>>>>>>> you're having to add a bunch of is_device_state checks to work around
>>>>>>> the rigidity of the code.
>>>>>>>
>>>>>>> Aside from the VFIO work, there is also the intent (coming back from
>>>>>>> Juan's ideas) to make multifd the default code path for migration,
>>>>>>> which will have to include the vmstate migration and anything else we
>>>>>>> put on the stream via QEMUFile.
>>>>>>>
>>>>>>> I have long since been bothered by having 'pages' sprinkled all over
>>>>>>> the code, so I might be coming at this with a bit of a narrow focus,
>>>>>>> but I believe in order to support more types of payloads in multifd,
>>>>>>> we need to first allow the scheduling at multifd_send_pages() to be
>>>>>>> independent of MultiFDPages_t. So here it is. Let me know what you
>>>>>>> think.
>>>>>>
>>>>>> Thanks for the patch set, I quickly glanced at these patches and they
>>>>>> definitely make sense to me.
>>>>>>
>>>> (..)
>>>>>>> (as I said, I'll be off for a couple of weeks, so feel free to
>>>>>>> incorporate any of this code if it's useful. Or to ignore it
>>>>>>> completely).
>>>>>>
>>>>>> I guess you are targeting QEMU 9.2 rather than 9.1 since 9.1 has
>>>>>> feature freeze in about a month, correct?
>>>>>>
>>>>>
>>>>> For general code improvements like this I'm not thinking about QEMU
>>>>> releases at all. But this series is not super complex, so I could
>>>>> imagine we merging it in time for 9.1 if we reach an agreement.
>>>>>
>>>>> Are you thinking your series might miss the target? Or have concerns
>>>>> over the stability of the refactoring? We can within reason merge code
>>>>> based on the current framework and improve things on top, we already did
>>>>> something similar when merging zero-page support. I don't have an issue
>>>>> with that.
>>>>
>>>> The reason that I asked whether you are targeting 9.1 is because my
>>>> patch set is definitely targeting that release.
>>>>
>>>> At the same time my patch set will need to be rebased/refactored on top
>>>> of this patch set if it is supposed to be merged for 9.1 too.
>>>>
>>>> If this patch set gets merged quickly that's not really a problem.
>>>>
>>>> On the other hand, if another iteration(s) is/are needed AND you are
>>>> not available in the coming weeks to work on them then there's a
>>>> question whether we will make the required deadline.
>>>
>>> I think it's a bit rush to merge the vfio series in this release. I'm not
>>> sure it has enough time to be properly reviewed, reposted, retested, etc.
>>>
>>> I've already started looking at it, and so far I think I have doubt not
>>> only on agreement with Fabiano on the device_state thing which I prefer to
>>> avoid, but also I'm thinking of any possible way to at least make the
>>> worker threads generic too: a direct impact could be vDPA in the near
>>> future if anyone cared, while I don't want modules to create threads
>>> randomly during migration.
>>>
>>> Meanwhile I'm also thinking whether that "the thread needs to dump all
>>> data, and during iteration we can't do that" is the good reason to not
>>> support that during iterations.
>>>
>>> I didn't yet reply because I don't think I think all things through, but
>>> I'll get there.
>>>
>>> So I'm not saying that the design is problematic, but IMHO it's just not
>>> mature enough to assume it will land in 9.1, considering it's still a large
>>> one, and the first non-rfc version just posted two days ago.
>>
>>
>> The RFC version was posted more than 2 months ago.
>>
>> It has received some review comments from multiple people,
>> all of which were addressed in this patch set version.
>
> I thought it was mostly me who reviewed it, am I right? Or do you have
> other thread that has such discussion happening, and the design review has
> properly done and reached an agreement?
Daniel P. Berrangé also submitted a few comments: [1], [2], [3], [4], [5].
In fact, it is him who first suggested not having a new channel header
wire format or dedicated device state channels.
In addition to that, Avihai was also following our discussions: [6] and
he also looked privately at an early (but functioning) draft of these
patches well before the RFC was even publicly posted.
> IMHO that is also not how RFC works.
>
> It doesn't work like "if RFC didn't got NACKed, a maintainer should merge
> v1 when someone posts it". Instead RFC should only mean these at least to
> me: "(1) please review this from high level, things can drastically change;
> (2) please don't merge this, because it is not for merging but for getting
> comments."
>
> Beyond, it doesn't imply anything for what happens after the RFC series.
That "RFC" marking on v0 was meant to signify it as a draft not suitable
to be merged immediately.
Much like marking a {pull,merge} request a draft on Git{Hub,Lab}.
docs/devel/submitting-a-patch.rst even says that:
> "RFC" means "Request For Comments" and is a statement that you don't
> intend for your patchset to be applied to master, but would like some
> review on it anyway.
>>
>> I have not received any further comments during these 2 months, so I thought
>> the overall design is considered okay - if anything, there might be minor
>> code comments/issues but these can easily be improved/fixed in the 5 weeks
>> remaining to the soft code freeze for 9.1.
>
> The latest email in that thread (assuming this one is what you're referring
> to) is:
>
> https://lore.kernel.org/qemu-devel/f67fcc88-aaf6-43f7-9287-90cbd7495ba1@nvidia.com/#t
>
> I didn't hear anything from Avihai yet, neither did I think we reached an
> complete agreement on the whole design.
So what then is necessary to reach a "complete agreement on the whole design"?
Do you think organizing a brainstorming session somewhere (Zoom, etc.) would
help with that?
Although there is always a risk that such "10,000 foot" design turns out
to have significantly worse performance than this one - how a (sensible)
design will perform in real world testing is rather hard to predict in
advance.
The current design took a while to make sure we don't leave any possible
performance (downtime improvement) by mistake.
>>
>>
>> If anything, I think that the VM live phase (non-downtime) transfers
>> functionality should be deferred until 9.2 because:
>> * It wasn't a part of the RFC so even if implemented today would get much
>> less testing overall,
>
> IMO it really depends on what was proposed. Anyone who send patches should
> definitely test whatever the patchset provides. If that patchset includes
> the iteration changes then that needs to be tested by the submitter.
I agree that anything proposed needs to be well tested before being
submitted.
>>
>> * It's orthogonal to the switchover time device state transfer functionality
>> introduced by this patch set and could be added on top of that without
>> changing the wire protocol for switchover time device state transfers,
>
> AFAICT it will affect the wire protocol? If the dest QEMU contains your
> patcheset to be the old version of QEMU, then the threads will only be
> created at the switchover phase, and it won't be ready to take whatever
> data sent from a new QEMU which may allow migrating VFIO iteration data,
> who may expect these VFIO data to be passed over via multifd channels
> even before the switchover.
>
> It can only be compatible at least when the threads are created before
> iteration starts on dest side, and I didn't yet check the packet headers
> and other stuffs.
>
> I think that can be a sweet spot where maybe you can land this series
> sooner, but it also gets ready for anyone who wants to further extend that
> to iteration phase easily. Not sure whether it'll be easily feasible by
> just moving the thread creations earlier.
If someone is migrating data to an older QEMU version that does not
contain this patch set the source must have "x-migration-multifd-transfer"
VFIO device property set to false (the default value) for wire format
compatibility.
The same will go for VM live phase data - it will need to have some
additional setting which needs to be disabled for the wire format
to be compatible with older QEMU versions that do not understand the
such data transfer over multifd channels.
On the other hand, as I wrote in the cover letter, there's nothing
stopping a QEMU device driver from requiring different handling
(loading, etc.) of VM live phase data from the post-switchover data.
So loading such VM live phase data will need a new handler anyway.
>
>>
>> * It doesn't impact the switchover downtime so in this case 9.1 would
>> already contain all what's necessary to improve it.
>
> Yes it won't, but IMHO that's not an issue.
>
> Since Fabiano is going on a short break soon, I think I'll be the only one
> review it. I'll try my best, but still I can't guarantee it will be able
> to land in 9.1, and this is not the only thing I'll need to do next week..
>
> I appreciated a lot you worked out VFIO on top of multifd, because IMHO
> that's really the right direction. However even with that, I don't think
> the whole design is yet fully settled, not to mention the details. And that
> implies it may miss 9.1.
I appreciate your work and review Peter - it would be nice if we could
at least make some progress on this subject for 9.1.
> Thanks,
>
Thanks,
Maciej
[1]: https://lore.kernel.org/qemu-devel/Zh-KF72Fe9oV6tfT@redhat.com/
[2]: https://lore.kernel.org/qemu-devel/Zh_6W8u3H4FmGS49@redhat.com/
[3]: https://lore.kernel.org/qemu-devel/ZiD4aLSre6qubuHr@redhat.com/
[4]: https://lore.kernel.org/qemu-devel/ZiJCSZvsekaO8dzO@redhat.com/
[5]: https://lore.kernel.org/qemu-devel/ZiJFU_QrOHVvwe4W@redhat.com/
[6]: https://lore.kernel.org/qemu-devel/7e855ccb-d5af-490f-94ab-61141fa30ba8@nvidia.com/
^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [RFC PATCH 0/7] migration/multifd: Introduce storage slots
2024-06-23 20:25 ` Maciej S. Szmigiero
@ 2024-06-23 20:45 ` Peter Xu
0 siblings, 0 replies; 46+ messages in thread
From: Peter Xu @ 2024-06-23 20:45 UTC (permalink / raw)
To: Maciej S. Szmigiero; +Cc: Fabiano Rosas, qemu-devel
On Sun, Jun 23, 2024 at 10:25:05PM +0200, Maciej S. Szmigiero wrote:
> > I appreciated a lot you worked out VFIO on top of multifd, because IMHO
> > that's really the right direction. However even with that, I don't think
> > the whole design is yet fully settled, not to mention the details. And that
> > implies it may miss 9.1.
>
> I appreciate your work and review Peter - it would be nice if we could
> at least make some progress on this subject for 9.1.
Let's try and see, I didn't mean it won't hit 9.1 at all, it just feels
still challenging, but who knows! I just don't want to give you that
feeling then right before softfreeze I start to say "it's not ready".
I think one reason it'll be more challenge is also since Fabiano will be
missing for weeks, and since you look like to agree on his RFC as a start,
it means it might be good idea we wait for his back and see how that goes
from there even if we can reach some consensus; it'll just be challenging
already.
I also left my (slightly lengthy) comment on the high level design of that
series here (I know you'll see that, but just to keep a record of this
discussion and link them together):
https://lore.kernel.org/r/ZniFH14DT6ycjbrL@x1n
Let's discuss there, let me know if something I just made it wrong, and
also if you're targeting landing part of the series you can try to
prioritize / provision what can already be landed and easier.
I actually have a wild idea where I don't know whether "switchover phase"
hooks are even needed. I mean, currently save_live_iterate() allows
"returning 1 means all data ready". Logically in switchover phase the
migration core could simply call save_live_iterate() until it returns 1.
Then switchover hooks do not need to exist at all. I didn't check the
details, but if that works that'll simplify all register_savevm_live()
users, and also for VFIO it may mean iteration phase is more important too,
as when it's resolved it naturally works with switchover phase. You can
ignore this idea because it can be too wild and definitely not helpful on
landing things fast, just in case it may bring some thoughts.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 46+ messages in thread