* [PATCH 1/3] multifd: bugfix for migration using compression methods
2024-12-18 9:14 [PATCH 0/3] bugfixes for migration using compression methods Yuan Liu
@ 2024-12-18 9:14 ` Yuan Liu
2024-12-18 17:03 ` Peter Xu
2024-12-18 9:14 ` [PATCH 2/3] multifd: bugfix for incorrect migration data with QPL compression Yuan Liu
` (3 subsequent siblings)
4 siblings, 1 reply; 11+ messages in thread
From: Yuan Liu @ 2024-12-18 9:14 UTC (permalink / raw)
To: peterx, farosas; +Cc: qemu-devel, yuan1.liu, jason.zeng, yichen.wang
When compression is enabled on the migration channel and
the pages processed are all zero pages, these pages will
not be sent and updated on the target side, resulting in
incorrect memory data on the source and target sides.
The root cause is that all compression methods call
multifd_send_prepare_common to determine whether to compress
dirty pages, but multifd_send_prepare_common does not update
the IOV of MultiFDPacket_t when all dirty pages are zero pages.
The solution is to always update the IOV of MultiFDPacket_t
regardless of whether the dirty pages are all zero pages.
Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Jason Zeng <jason.zeng@intel.com>
---
migration/multifd-nocomp.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
index 55191152f9..2e4aaac285 100644
--- a/migration/multifd-nocomp.c
+++ b/migration/multifd-nocomp.c
@@ -362,6 +362,7 @@ int multifd_ram_flush_and_sync(void)
bool multifd_send_prepare_common(MultiFDSendParams *p)
{
MultiFDPages_t *pages = &p->data->u.ram;
+ multifd_send_prepare_header(p);
multifd_send_zero_page_detect(p);
if (!pages->normal_num) {
@@ -369,8 +370,6 @@ bool multifd_send_prepare_common(MultiFDSendParams *p)
return false;
}
- multifd_send_prepare_header(p);
-
return true;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH 1/3] multifd: bugfix for migration using compression methods
2024-12-18 9:14 ` [PATCH 1/3] multifd: bugfix " Yuan Liu
@ 2024-12-18 17:03 ` Peter Xu
2024-12-19 8:42 ` Liu, Yuan1
0 siblings, 1 reply; 11+ messages in thread
From: Peter Xu @ 2024-12-18 17:03 UTC (permalink / raw)
To: Yuan Liu; +Cc: farosas, qemu-devel, jason.zeng, yichen.wang
On Wed, Dec 18, 2024 at 05:14:11PM +0800, Yuan Liu wrote:
> When compression is enabled on the migration channel and
> the pages processed are all zero pages, these pages will
> not be sent and updated on the target side, resulting in
> incorrect memory data on the source and target sides.
>
> The root cause is that all compression methods call
> multifd_send_prepare_common to determine whether to compress
> dirty pages, but multifd_send_prepare_common does not update
> the IOV of MultiFDPacket_t when all dirty pages are zero pages.
>
> The solution is to always update the IOV of MultiFDPacket_t
> regardless of whether the dirty pages are all zero pages.
>
> Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> Reviewed-by: Jason Zeng <jason.zeng@intel.com>
Ouch.. thanks for digging this out.
Reviewed-by: Peter Xu <peterx@redhat.com>
Is this the correct Fixes tag (and copy stable for 9.0+)?
Fixes: 303e6f54f9 ("migration/multifd: Implement zero page transmission on the multifd thread.")
> ---
> migration/multifd-nocomp.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
> index 55191152f9..2e4aaac285 100644
> --- a/migration/multifd-nocomp.c
> +++ b/migration/multifd-nocomp.c
> @@ -362,6 +362,7 @@ int multifd_ram_flush_and_sync(void)
> bool multifd_send_prepare_common(MultiFDSendParams *p)
> {
> MultiFDPages_t *pages = &p->data->u.ram;
> + multifd_send_prepare_header(p);
> multifd_send_zero_page_detect(p);
>
> if (!pages->normal_num) {
> @@ -369,8 +370,6 @@ bool multifd_send_prepare_common(MultiFDSendParams *p)
> return false;
> }
>
> - multifd_send_prepare_header(p);
> -
> return true;
> }
>
> --
> 2.43.0
>
--
Peter Xu
^ permalink raw reply [flat|nested] 11+ messages in thread* RE: [PATCH 1/3] multifd: bugfix for migration using compression methods
2024-12-18 17:03 ` Peter Xu
@ 2024-12-19 8:42 ` Liu, Yuan1
0 siblings, 0 replies; 11+ messages in thread
From: Liu, Yuan1 @ 2024-12-19 8:42 UTC (permalink / raw)
To: Peter Xu
Cc: farosas@suse.de, qemu-devel@nongnu.org, Zeng, Jason, Wang, Yichen
> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, December 19, 2024 1:03 AM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: farosas@suse.de; qemu-devel@nongnu.org; Zeng, Jason
> <jason.zeng@intel.com>; Wang, Yichen <yichen.wang@bytedance.com>
> Subject: Re: [PATCH 1/3] multifd: bugfix for migration using compression
> methods
>
> On Wed, Dec 18, 2024 at 05:14:11PM +0800, Yuan Liu wrote:
> > When compression is enabled on the migration channel and
> > the pages processed are all zero pages, these pages will
> > not be sent and updated on the target side, resulting in
> > incorrect memory data on the source and target sides.
> >
> > The root cause is that all compression methods call
> > multifd_send_prepare_common to determine whether to compress
> > dirty pages, but multifd_send_prepare_common does not update
> > the IOV of MultiFDPacket_t when all dirty pages are zero pages.
> >
> > The solution is to always update the IOV of MultiFDPacket_t
> > regardless of whether the dirty pages are all zero pages.
> >
> > Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> > Reviewed-by: Jason Zeng <jason.zeng@intel.com>
>
> Ouch.. thanks for digging this out.
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> Is this the correct Fixes tag (and copy stable for 9.0+)?
>
> Fixes: 303e6f54f9 ("migration/multifd: Implement zero page transmission on
> the multifd thread.")
Yes, this patch is used to fix the tag containing 303e6f54f9, and I see it already exists in v9.0.0 and later versions.
By the way, The three bugfix patches are based on mainline and the commit id is 8032c78e55 (origin/staging, origin/master, origin/HEAD) Merge tag 'firmware-20241216-pull-request' of https://gitlab.com/kraxel/qemu into staging
> > ---
> > migration/multifd-nocomp.c | 3 +--
> > 1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/migration/multifd-nocomp.c b/migration/multifd-nocomp.c
> > index 55191152f9..2e4aaac285 100644
> > --- a/migration/multifd-nocomp.c
> > +++ b/migration/multifd-nocomp.c
> > @@ -362,6 +362,7 @@ int multifd_ram_flush_and_sync(void)
> > bool multifd_send_prepare_common(MultiFDSendParams *p)
> > {
> > MultiFDPages_t *pages = &p->data->u.ram;
> > + multifd_send_prepare_header(p);
> > multifd_send_zero_page_detect(p);
> >
> > if (!pages->normal_num) {
> > @@ -369,8 +370,6 @@ bool multifd_send_prepare_common(MultiFDSendParams
> *p)
> > return false;
> > }
> >
> > - multifd_send_prepare_header(p);
> > -
> > return true;
> > }
> >
> > --
> > 2.43.0
> >
>
> --
> Peter Xu
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 2/3] multifd: bugfix for incorrect migration data with QPL compression
2024-12-18 9:14 [PATCH 0/3] bugfixes for migration using compression methods Yuan Liu
2024-12-18 9:14 ` [PATCH 1/3] multifd: bugfix " Yuan Liu
@ 2024-12-18 9:14 ` Yuan Liu
2024-12-18 17:08 ` Peter Xu
2024-12-18 9:14 ` [PATCH 3/3] multifd: bugfix for incorrect migration data with qatzip compression Yuan Liu
` (2 subsequent siblings)
4 siblings, 1 reply; 11+ messages in thread
From: Yuan Liu @ 2024-12-18 9:14 UTC (permalink / raw)
To: peterx, farosas; +Cc: qemu-devel, yuan1.liu, jason.zeng, yichen.wang
When QPL compression is enabled on the migration channel and the same
dirty page changes from a normal page to a zero page in the iterative
memory copy, the dirty page will not be updated to a zero page again
on the target side, resulting in incorrect memory data on the source
and target sides.
The root cause is that the target side does not record the normal pages
to the receivedmap.
The solution is to add ramblock_recv_bitmap_set_offset in target side
to record the normal pages.
Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Jason Zeng <jason.zeng@intel.com>
---
migration/multifd-qpl.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
index bbe466617f..88e2344af2 100644
--- a/migration/multifd-qpl.c
+++ b/migration/multifd-qpl.c
@@ -679,6 +679,7 @@ static int multifd_qpl_recv(MultiFDRecvParams *p, Error **errp)
qpl->zlen[i] = be32_to_cpu(qpl->zlen[i]);
assert(qpl->zlen[i] <= multifd_ram_page_size());
zbuf_len += qpl->zlen[i];
+ ramblock_recv_bitmap_set_offset(p->block, p->normal[i]);
}
/* read compressed pages */
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH 2/3] multifd: bugfix for incorrect migration data with QPL compression
2024-12-18 9:14 ` [PATCH 2/3] multifd: bugfix for incorrect migration data with QPL compression Yuan Liu
@ 2024-12-18 17:08 ` Peter Xu
0 siblings, 0 replies; 11+ messages in thread
From: Peter Xu @ 2024-12-18 17:08 UTC (permalink / raw)
To: Yuan Liu; +Cc: farosas, qemu-devel, jason.zeng, yichen.wang
On Wed, Dec 18, 2024 at 05:14:12PM +0800, Yuan Liu wrote:
> When QPL compression is enabled on the migration channel and the same
> dirty page changes from a normal page to a zero page in the iterative
> memory copy, the dirty page will not be updated to a zero page again
> on the target side, resulting in incorrect memory data on the source
> and target sides.
>
> The root cause is that the target side does not record the normal pages
> to the receivedmap.
>
> The solution is to add ramblock_recv_bitmap_set_offset in target side
> to record the normal pages.
>
> Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> Reviewed-by: Jason Zeng <jason.zeng@intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 3/3] multifd: bugfix for incorrect migration data with qatzip compression
2024-12-18 9:14 [PATCH 0/3] bugfixes for migration using compression methods Yuan Liu
2024-12-18 9:14 ` [PATCH 1/3] multifd: bugfix " Yuan Liu
2024-12-18 9:14 ` [PATCH 2/3] multifd: bugfix for incorrect migration data with QPL compression Yuan Liu
@ 2024-12-18 9:14 ` Yuan Liu
2024-12-18 17:08 ` Peter Xu
2024-12-18 17:12 ` [PATCH 0/3] bugfixes for migration using compression methods Peter Xu
2025-01-12 13:12 ` Michael Tokarev
4 siblings, 1 reply; 11+ messages in thread
From: Yuan Liu @ 2024-12-18 9:14 UTC (permalink / raw)
To: peterx, farosas; +Cc: qemu-devel, yuan1.liu, jason.zeng, yichen.wang
When QPL compression is enabled on the migration channel and the same
dirty page changes from a normal page to a zero page in the iterative
memory copy, the dirty page will not be updated to a zero page again
on the target side, resulting in incorrect memory data on the source
and target sides.
The root cause is that the target side does not record the normal pages
to the receivedmap.
The solution is to add ramblock_recv_bitmap_set_offset in target side
to record the normal pages.
Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Jason Zeng <jason.zeng@intel.com>
---
migration/multifd-qatzip.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/migration/multifd-qatzip.c b/migration/multifd-qatzip.c
index 7b68397625..6a0e989fae 100644
--- a/migration/multifd-qatzip.c
+++ b/migration/multifd-qatzip.c
@@ -373,6 +373,7 @@ static int qatzip_recv(MultiFDRecvParams *p, Error **errp)
/* Copy each page to its appropriate location. */
for (int i = 0; i < p->normal_num; i++) {
memcpy(p->host + p->normal[i], q->out_buf + page_size * i, page_size);
+ ramblock_recv_bitmap_set_offset(p->block, p->normal[i]);
}
return 0;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH 3/3] multifd: bugfix for incorrect migration data with qatzip compression
2024-12-18 9:14 ` [PATCH 3/3] multifd: bugfix for incorrect migration data with qatzip compression Yuan Liu
@ 2024-12-18 17:08 ` Peter Xu
0 siblings, 0 replies; 11+ messages in thread
From: Peter Xu @ 2024-12-18 17:08 UTC (permalink / raw)
To: Yuan Liu; +Cc: farosas, qemu-devel, jason.zeng, yichen.wang
On Wed, Dec 18, 2024 at 05:14:13PM +0800, Yuan Liu wrote:
> When QPL compression is enabled on the migration channel and the same
> dirty page changes from a normal page to a zero page in the iterative
> memory copy, the dirty page will not be updated to a zero page again
> on the target side, resulting in incorrect memory data on the source
> and target sides.
>
> The root cause is that the target side does not record the normal pages
> to the receivedmap.
>
> The solution is to add ramblock_recv_bitmap_set_offset in target side
> to record the normal pages.
>
> Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> Reviewed-by: Jason Zeng <jason.zeng@intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
--
Peter Xu
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/3] bugfixes for migration using compression methods
2024-12-18 9:14 [PATCH 0/3] bugfixes for migration using compression methods Yuan Liu
` (2 preceding siblings ...)
2024-12-18 9:14 ` [PATCH 3/3] multifd: bugfix for incorrect migration data with qatzip compression Yuan Liu
@ 2024-12-18 17:12 ` Peter Xu
2025-01-12 13:12 ` Michael Tokarev
4 siblings, 0 replies; 11+ messages in thread
From: Peter Xu @ 2024-12-18 17:12 UTC (permalink / raw)
To: Yuan Liu; +Cc: farosas, qemu-devel, jason.zeng, yichen.wang, Shameer Kolothum
On Wed, Dec 18, 2024 at 05:14:10PM +0800, Yuan Liu wrote:
> This set of patches is used to fix the bugs of incorrect migration
> memory data when compression is enabled.
>
> The method to reproduce this bug is as follows
> 1. Run "stress-ng --class memory --all 1" in the source side, the
> stress-ng tool comes from https://github.com/ColinIanKing/stress-ng.git
>
> 2. Enable the multifd compression methods and start migration
> e.g. migrate_set_parameter multifd-compression qpl
>
> 3. The guest kernel will crash automatically or crash at shutdown after
> the migration is complete
>
> The root cause of the bugs and the solutions are described in detail in
> the patch.
>
> My verification method as follows
> 1. Start the VM and run the stess-ng test command on the source side.
> 2. Start the VM with "-S" parameter on the target side, it is
> used to pause the vCPUs after migration.
> 3. After the migration is successful, use the dump-guest-memory command
> to export the memory data of the source and target VMs respectively.
> 4. Use "cmp -l source_memory target_memory" to verify memory data.
This looks like a good idea to test memory intergrity. I wonder if we can
do that in some, or all, of our migration qtests.
I didn't check the latter 2 patches but I assume they can also have a
proper Fixes tag.
The other thing is uadk seems also broken from that regard.. we could add
one patch for it, but the testing may be challenging for any of us. In all
case, I copy Shameer.
--
Peter Xu
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH 0/3] bugfixes for migration using compression methods
2024-12-18 9:14 [PATCH 0/3] bugfixes for migration using compression methods Yuan Liu
` (3 preceding siblings ...)
2024-12-18 17:12 ` [PATCH 0/3] bugfixes for migration using compression methods Peter Xu
@ 2025-01-12 13:12 ` Michael Tokarev
2025-01-13 0:58 ` Liu, Yuan1
4 siblings, 1 reply; 11+ messages in thread
From: Michael Tokarev @ 2025-01-12 13:12 UTC (permalink / raw)
To: Yuan Liu, peterx, farosas
Cc: qemu-devel, jason.zeng, yichen.wang, qemu-stable
18.12.2024 12:14, Yuan Liu wrote:
> This set of patches is used to fix the bugs of incorrect migration
> memory data when compression is enabled.
>
> The method to reproduce this bug is as follows
> 1. Run "stress-ng --class memory --all 1" in the source side, the
> stress-ng tool comes from https://github.com/ColinIanKing/stress-ng.git
>
> 2. Enable the multifd compression methods and start migration
> e.g. migrate_set_parameter multifd-compression qpl
>
> 3. The guest kernel will crash automatically or crash at shutdown after
> the migration is complete
>
> The root cause of the bugs and the solutions are described in detail in
> the patch.
>
> My verification method as follows
> 1. Start the VM and run the stess-ng test command on the source side.
> 2. Start the VM with "-S" parameter on the target side, it is
> used to pause the vCPUs after migration.
> 3. After the migration is successful, use the dump-guest-memory command
> to export the memory data of the source and target VMs respectively.
> 4. Use "cmp -l source_memory target_memory" to verify memory data.
>
> Yuan Liu (3):
> multifd: bugfix for migration using compression methods
> multifd: bugfix for incorrect migration data with QPL compression
> multifd: bugfix for incorrect migration data with qatzip compression
Should just the first patch be applied to qemu-stable branches, or all 3?
The first one has been Cc'd qemu-stable, but the other two hasn't?
Thanks,
/mjt
^ permalink raw reply [flat|nested] 11+ messages in thread* RE: [PATCH 0/3] bugfixes for migration using compression methods
2025-01-12 13:12 ` Michael Tokarev
@ 2025-01-13 0:58 ` Liu, Yuan1
0 siblings, 0 replies; 11+ messages in thread
From: Liu, Yuan1 @ 2025-01-13 0:58 UTC (permalink / raw)
To: Michael Tokarev, peterx@redhat.com, farosas@suse.de
Cc: qemu-devel@nongnu.org, Zeng, Jason, Wang, Yichen, qemu-stable
> -----Original Message-----
> From: Michael Tokarev <mjt@tls.msk.ru>
> Sent: Sunday, January 12, 2025 9:13 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>; peterx@redhat.com; farosas@suse.de
> Cc: qemu-devel@nongnu.org; Zeng, Jason <jason.zeng@intel.com>; Wang,
> Yichen <yichen.wang@bytedance.com>; qemu-stable <qemu-stable@nongnu.org>
> Subject: Re: [PATCH 0/3] bugfixes for migration using compression methods
>
> 18.12.2024 12:14, Yuan Liu wrote:
> > This set of patches is used to fix the bugs of incorrect migration
> > memory data when compression is enabled.
> >
> > The method to reproduce this bug is as follows
> > 1. Run "stress-ng --class memory --all 1" in the source side, the
> > stress-ng tool comes from https://github.com/ColinIanKing/stress-ng.git
> >
> > 2. Enable the multifd compression methods and start migration
> > e.g. migrate_set_parameter multifd-compression qpl
> >
> > 3. The guest kernel will crash automatically or crash at shutdown after
> > the migration is complete
> >
> > The root cause of the bugs and the solutions are described in detail in
> > the patch.
> >
> > My verification method as follows
> > 1. Start the VM and run the stess-ng test command on the source side.
> > 2. Start the VM with "-S" parameter on the target side, it is
> > used to pause the vCPUs after migration.
> > 3. After the migration is successful, use the dump-guest-memory command
> > to export the memory data of the source and target VMs respectively.
> > 4. Use "cmp -l source_memory target_memory" to verify memory data.
> >
> > Yuan Liu (3):
> > multifd: bugfix for migration using compression methods
> > multifd: bugfix for incorrect migration data with QPL compression
> > multifd: bugfix for incorrect migration data with qatzip compression
>
> Should just the first patch be applied to qemu-stable branches, or all 3?
> The first one has been Cc'd qemu-stable, but the other two hasn't?
I think all three patches should be applied, they solve three different bugs.
>
> Thanks,
>
> /mjt
^ permalink raw reply [flat|nested] 11+ messages in thread