qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] migration: Fix transition to COLO state from precopy
@ 2025-11-04  1:36 Li Zhijian via
  2025-11-04  1:49 ` Zhijian Li (Fujitsu)
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Li Zhijian via @ 2025-11-04  1:36 UTC (permalink / raw)
  To: peterx, farosas; +Cc: zhangckid, zhanghailiang, qemu-devel, Li Zhijian

Commit 4881411136 ("migration: Always set DEVICE state") set a new DEVICE
state before completed during migration, which broke the original transition
to COLO. The migration flow for precopy has changed to:
active -> pre-switchover -> device -> completed.

This patch updates the transition state to ensure that the Pre-COLO
state corresponds to DEVICE state correctly.

Fixes: 4881411136 ("migration: Always set DEVICE state")
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
---
 migration/migration.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index a63b46bbef..6ec7f3cec8 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3095,9 +3095,9 @@ static void migration_completion(MigrationState *s)
         goto fail;
     }
 
-    if (migrate_colo() && s->state == MIGRATION_STATUS_ACTIVE) {
+    if (migrate_colo() && s->state == MIGRATION_STATUS_DEVICE) {
         /* COLO does not support postcopy */
-        migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
+        migrate_set_state(&s->state, MIGRATION_STATUS_DEVICE,
                           MIGRATION_STATUS_COLO);
     } else {
         migration_completion_end(s);
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] migration: Fix transition to COLO state from precopy
  2025-11-04  1:36 [PATCH] migration: Fix transition to COLO state from precopy Li Zhijian via
@ 2025-11-04  1:49 ` Zhijian Li (Fujitsu)
  2025-11-04  2:40 ` Zhang Chen
  2025-11-05 20:58 ` Peter Xu
  2 siblings, 0 replies; 8+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-11-04  1:49 UTC (permalink / raw)
  To: peterx@redhat.com, farosas@suse.de
  Cc: zhangckid@gmail.com, zhanghailiang@xfusion.com,
	qemu-devel@nongnu.org

FYI,

Share my local COLO test steps/scripts on the same host.



S1: ./primary.sh
S2: ./secondary.sh
S3: cat secondary-cmd.json | nc localhost 55555
S4: cat primary-cmd.json | nc localhost 25555

Till now, primary and secondary VM entered the COLO state
Then, we can trigger the failover

(Primary takeover)S5_1: killall -9 secondary; sleep 1; cat primary-failover.json | nc localhost 25555
or
(Secondary takeover)S_2: killall -9 primary; sleep 1; cat secondary-failover.json | nc localhost 55555

=========scripts=============
# cat primary.sh
cmd="./primary -enable-kvm -cpu qemu64,kvmclock=on -m 4096 -smp 1 -device piix3-usb-uhci -device usb-tablet -name primary -netdev tap,id=hn0,vhost=off,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device e1000,id=e0,netdev=hn0 -chardev socket,id=mirror0,host=0.0.0.0,port=9003,server=on,wait=off -chardev socket,id=compare1,host=0.0.0.0,port=9004,server=on,wait=on -chardev socket,id=compare0,host=127.0.0.1,port=9001,server=on,wait=off -chardev socket,id=compare0-0,host=127.0.0.1,port=9001 -chardev socket,id=compare_out,host=127.0.0.1,port=9005,server=on,wait=off -chardev socket,id=compare_out0,host=127.0.0.1,port=9005 -object filter-mirror,id=m0,netdev=hn0,queue=tx,outdev=mirror0 -object filter-redirector,netdev=hn0,id=redire0,queue=rx,indev=compare_out -object filter-redirector,netdev=hn0,id=redire1,queue=rx,outdev=compare0 -object iothread,id=iothread1 -object colo-compare,id=comp0,primary_in=compare0-0,secondary_in=compare1,outdev=compare_out0,iothread=iothread1 -drive if=ide,id=colo-disk0,driver=quorum,read-pattern=fifo,vote-threshold=1,children.0.file.filename=/home/lizhijian/images/colo/primary/primary.qcow2,children.0.driver=qcow2 -nographic -monitor telnet:127.0.0.1:15555,server,nowait -qmp telnet:127.0.0.1:25555,server,nowait -S"

echo $cmd
exec $cmd

# cat secondary.sh
cmd="./secondary -enable-kvm -cpu qemu64,kvmclock=on -m 4096 -smp 1 -qmp telnet:127.0.0.1:55555,server,nowait -device piix3-usb-uhci -device usb-tablet -name secondary -netdev tap,id=hn0,vhost=off,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device e1000,id=e0,netdev=hn0 -chardev socket,id=red0,host=127.0.0.1,port=9003,reconnect-ms=1 -chardev socket,id=red1,host=127.0.0.1,port=9004,reconnect-ms=1 -object filter-redirector,id=f1,netdev=hn0,queue=tx,indev=red0 -object filter-redirector,id=f2,netdev=hn0,queue=rx,outdev=red1 -object filter-rewriter,id=rew0,netdev=hn0,queue=all -drive if=none,id=parent0,file.filename=/home/lizhijian/images/colo/secondary/primary.qcow2,driver=qcow2 -drive if=none,id=childs0,driver=replication,mode=secondary,file.driver=qcow2,top-id=colo-disk0,file.file.filename=/home/lizhijian/images/colo/secondary/secondary-active.qcow2,file.backing.driver=qcow2,file.backing.file.filename=/home/lizhijian/images/colo/secondary/secondary-hidden.qcow2,file.backing.backing=parent0 -drive if=ide,id=colo-disk0,driver=quorum,read-pattern=fifo,vote-threshold=1,children.0=childs0 -incoming tcp:0.0.0.0:9998 -nographic -monitor telnet:127.0.0.1:55554,server,nowait"

echo $cmd
exec $cmd

# cat secondary-cmd.json
{"execute":"qmp_capabilities"}
{"execute": "migrate-set-capabilities", "arguments": {"capabilities": [ {"capability": "x-colo", "state": true} ] } }
{"execute": "nbd-server-start", "arguments": {"addr": {"type": "inet", "data": {"host": "0.0.0.0", "port": "9999"} } } }
{"execute": "nbd-server-add", "arguments": {"device": "parent0", "writable": true } }
{'execute': 'trace-event-set-state', 'arguments': {'name': 'colo*', 'enable': true} }

# cat primary-cmd.json
{"execute":"qmp_capabilities"}
{'execute': 'trace-event-set-state', 'arguments': {'name': 'colo*', 'enable': true} }
{'execute': 'trace-event-set-state', 'arguments': {'name': 'migrat*', 'enable': true} }
{"execute": "human-monitor-command", "arguments": {"command-line": "drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=127.0.0.2,file.port=9999,file.export=parent0,node-name=replication0"}}
{"execute": "x-blockdev-change", "arguments":{"parent": "colo-disk0", "node": "replication0" } }
{"execute": "migrate-set-capabilities", "arguments": {"capabilities": [ {"capability": "x-colo", "state": true } ] } }
{"execute": "migrate", "arguments": {"uri": "tcp:127.0.0.2:9998" } }

# cat primary-failover.json
{"execute":"qmp_capabilities"}
{"execute": "x-blockdev-change", "arguments":{ "parent": "colo-disk0", "child": "children.1"} }
{"execute": "human-monitor-command", "arguments":{ "command-line": "drive_del replication0" } }
{"execute": "object-del", "arguments":{ "id": "comp0" } }
{"execute": "object-del", "arguments":{ "id": "iothread1" } }
{"execute": "object-del", "arguments":{ "id": "m0" } }
{"execute": "object-del", "arguments":{ "id": "redire0" } }
{"execute": "object-del", "arguments":{ "id": "redire1" } }
{"execute": "x-colo-lost-heartbeat" }

# cat secondary-failover.json
{"execute":"qmp_capabilities"}
{"execute": "nbd-server-stop"}
{"execute": "x-colo-lost-heartbeat"}

{"execute": "object-del", "arguments":{ "id": "f2" } }
{"execute": "object-del", "arguments":{ "id": "f1" } }
{"execute": "chardev-remove", "arguments":{ "id": "red1" } }
{"execute": "chardev-remove", "arguments":{ "id": "red0" } }

{"execute": "chardev-add", "arguments":{ "id": "mirror0", "backend": {"type": "socket", "data": {"addr": { "type": "inet", "data": { "host": "0.0.0.0", "port": "9003" } }, "server": true } } } }
{"execute": "chardev-add", "arguments":{ "id": "compare1", "backend": {"type": "socket", "data": {"addr": { "type": "inet", "data": { "host": "0.0.0.0", "port": "9004" } }, "server": true } } } }
{"execute": "chardev-add", "arguments":{ "id": "compare0", "backend": {"type": "socket", "data": {"addr": { "type": "inet", "data": { "host": "127.0.0.1", "port": "9001" } }, "server": true } } } }
{"execute": "chardev-add", "arguments":{ "id": "compare0-0", "backend": {"type": "socket", "data": {"addr": { "type": "inet", "data": { "host": "127.0.0.1", "port": "9001" } }, "server": false } } } }
{"execute": "chardev-add", "arguments":{ "id": "compare_out", "backend": {"type": "socket", "data": {"addr": { "type": "inet", "data": { "host": "127.0.0.1", "port": "9005" } }, "server": true } } } }
{"execute": "chardev-add", "arguments":{ "id": "compare_out0", "backend": {"type": "socket", "data": {"addr": { "type": "inet", "data": { "host": "127.0.0.1", "port": "9005" } }, "server": false } } } }


On 04/11/2025 09:36, Li Zhijian wrote:
> Commit 4881411136 ("migration: Always set DEVICE state") set a new DEVICE
> state before completed during migration, which broke the original transition
> to COLO. The migration flow for precopy has changed to:
> active -> pre-switchover -> device -> completed.
> 
> This patch updates the transition state to ensure that the Pre-COLO
> state corresponds to DEVICE state correctly.
> 
> Fixes: 4881411136 ("migration: Always set DEVICE state")
> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> ---
>   migration/migration.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index a63b46bbef..6ec7f3cec8 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3095,9 +3095,9 @@ static void migration_completion(MigrationState *s)
>           goto fail;
>       }
>   
> -    if (migrate_colo() && s->state == MIGRATION_STATUS_ACTIVE) {
> +    if (migrate_colo() && s->state == MIGRATION_STATUS_DEVICE) {
>           /* COLO does not support postcopy */
> -        migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
> +        migrate_set_state(&s->state, MIGRATION_STATUS_DEVICE,
>                             MIGRATION_STATUS_COLO);
>       } else {
>           migration_completion_end(s);


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] migration: Fix transition to COLO state from precopy
  2025-11-04  1:36 [PATCH] migration: Fix transition to COLO state from precopy Li Zhijian via
  2025-11-04  1:49 ` Zhijian Li (Fujitsu)
@ 2025-11-04  2:40 ` Zhang Chen
  2025-11-05 20:58 ` Peter Xu
  2 siblings, 0 replies; 8+ messages in thread
From: Zhang Chen @ 2025-11-04  2:40 UTC (permalink / raw)
  To: Li Zhijian; +Cc: peterx, farosas, zhanghailiang, qemu-devel

On Tue, Nov 4, 2025 at 9:34 AM Li Zhijian <lizhijian@fujitsu.com> wrote:
>
> Commit 4881411136 ("migration: Always set DEVICE state") set a new DEVICE
> state before completed during migration, which broke the original transition
> to COLO. The migration flow for precopy has changed to:
> active -> pre-switchover -> device -> completed.
>
> This patch updates the transition state to ensure that the Pre-COLO
> state corresponds to DEVICE state correctly.
>
> Fixes: 4881411136 ("migration: Always set DEVICE state")
> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>

LGTM~ Thanks Zhijian.

Reviewed-by: Zhang Chen <zhangckid@gmail.com>
Tested-by: Zhang Chen <zhangckid@gmail.com>

> ---
>  migration/migration.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index a63b46bbef..6ec7f3cec8 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3095,9 +3095,9 @@ static void migration_completion(MigrationState *s)
>          goto fail;
>      }
>
> -    if (migrate_colo() && s->state == MIGRATION_STATUS_ACTIVE) {
> +    if (migrate_colo() && s->state == MIGRATION_STATUS_DEVICE) {
>          /* COLO does not support postcopy */
> -        migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
> +        migrate_set_state(&s->state, MIGRATION_STATUS_DEVICE,
>                            MIGRATION_STATUS_COLO);
>      } else {
>          migration_completion_end(s);
> --
> 2.44.0
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] migration: Fix transition to COLO state from precopy
  2025-11-04  1:36 [PATCH] migration: Fix transition to COLO state from precopy Li Zhijian via
  2025-11-04  1:49 ` Zhijian Li (Fujitsu)
  2025-11-04  2:40 ` Zhang Chen
@ 2025-11-05 20:58 ` Peter Xu
  2025-11-06  1:09   ` Zhijian Li (Fujitsu)
  2 siblings, 1 reply; 8+ messages in thread
From: Peter Xu @ 2025-11-05 20:58 UTC (permalink / raw)
  To: Li Zhijian, zhangckid, zhanghailiang
  Cc: farosas, zhangckid, zhanghailiang, qemu-devel

On Tue, Nov 04, 2025 at 09:36:06AM +0800, Li Zhijian wrote:
> Commit 4881411136 ("migration: Always set DEVICE state") set a new DEVICE
> state before completed during migration, which broke the original transition
> to COLO. The migration flow for precopy has changed to:
> active -> pre-switchover -> device -> completed.
> 
> This patch updates the transition state to ensure that the Pre-COLO
> state corresponds to DEVICE state correctly.
> 
> Fixes: 4881411136 ("migration: Always set DEVICE state")
> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> ---
>  migration/migration.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index a63b46bbef..6ec7f3cec8 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3095,9 +3095,9 @@ static void migration_completion(MigrationState *s)
>          goto fail;
>      }
>  
> -    if (migrate_colo() && s->state == MIGRATION_STATUS_ACTIVE) {
> +    if (migrate_colo() && s->state == MIGRATION_STATUS_DEVICE) {
>          /* COLO does not support postcopy */
> -        migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
> +        migrate_set_state(&s->state, MIGRATION_STATUS_DEVICE,
>                            MIGRATION_STATUS_COLO);
>      } else {
>          migration_completion_end(s);

Thanks a lot for fixing it, Zhijian.  It means I broke COLO already for
10.0/10.1..

Hailiang/Chen, do you still know anyone who is using COLO, especially in
enterprise?  I don't expect any individual using it.. It definitely
complicates migration logics all over the places.  Fabiano and I discussed
a few times on removing legacy code and COLO was always in the list.

We used to discuss RDMA obsoletion too, that's when Huawei developers at
least tried to re-implement the whole RDMA using rsocket, that didn't land
only because of a perf regression.  Meanwhile, Zhijian also provided an
unit test, which we rely on recently to not break RDMA at the minimum.

If we do not have known users, I sincerely want to discuss with you on
obsoletion and removal of COLO from qemu codebase.  Do you see feasible?

Zhijian, do you have any input here?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] migration: Fix transition to COLO state from precopy
  2025-11-05 20:58 ` Peter Xu
@ 2025-11-06  1:09   ` Zhijian Li (Fujitsu)
  2025-11-06  3:21     ` Zhang Chen
  0 siblings, 1 reply; 8+ messages in thread
From: Zhijian Li (Fujitsu) @ 2025-11-06  1:09 UTC (permalink / raw)
  To: Peter Xu, zhangckid@gmail.com, zhanghailiang@xfusion.com
  Cc: farosas@suse.de, qemu-devel@nongnu.org



On 06/11/2025 04:58, Peter Xu wrote:
> On Tue, Nov 04, 2025 at 09:36:06AM +0800, Li Zhijian wrote:
>> Commit 4881411136 ("migration: Always set DEVICE state") set a new DEVICE
>> state before completed during migration, which broke the original transition
>> to COLO. The migration flow for precopy has changed to:
>> active -> pre-switchover -> device -> completed.
>>
>> This patch updates the transition state to ensure that the Pre-COLO
>> state corresponds to DEVICE state correctly.
>>
>> Fixes: 4881411136 ("migration: Always set DEVICE state")
>> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
>> ---
>>   migration/migration.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index a63b46bbef..6ec7f3cec8 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -3095,9 +3095,9 @@ static void migration_completion(MigrationState *s)
>>           goto fail;
>>       }
>>   
>> -    if (migrate_colo() && s->state == MIGRATION_STATUS_ACTIVE) {
>> +    if (migrate_colo() && s->state == MIGRATION_STATUS_DEVICE) {
>>           /* COLO does not support postcopy */
>> -        migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
>> +        migrate_set_state(&s->state, MIGRATION_STATUS_DEVICE,
>>                             MIGRATION_STATUS_COLO);
>>       } else {
>>           migration_completion_end(s);
> 
> Thanks a lot for fixing it, Zhijian.  It means I broke COLO already for
> 10.0/10.1..
> 
> Hailiang/Chen, do you still know anyone who is using COLO, especially in
> enterprise?  I don't expect any individual using it.. It definitely
> complicates migration logics all over the places.  Fabiano and I discussed
> a few times on removing legacy code and COLO was always in the list.
> 
> We used to discuss RDMA obsoletion too, that's when Huawei developers at
> least tried to re-implement the whole RDMA using rsocket, that didn't land
> only because of a perf regression.  Meanwhile, Zhijian also provided an
> unit test, which we rely on recently to not break RDMA at the minimum.
> 
> If we do not have known users, I sincerely want to discuss with you on
> obsoletion and removal of COLO from qemu codebase.  Do you see feasible?
> 
> Zhijian, do you have any input here?


If we don't have any known users, I personally have no objection to removing COLO.

 From my previous understanding, its use cases are rather limited, and the checkpointing overhead is significant.
Moreover, with the continuous development of Cloud Native over the past decade, service-based
FT/HA solutions have become very mature, which shrinks the use cases for VM-based FT solutions even further.

I think it's worth keeping if we have:

- Active users who depend on it.
- A unit test for the COLO framework.

Thanks
Zhijian



> 
> Thanks,
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] migration: Fix transition to COLO state from precopy
  2025-11-06  1:09   ` Zhijian Li (Fujitsu)
@ 2025-11-06  3:21     ` Zhang Chen
  2025-11-06  3:24       ` Zhang Chen
  2025-11-06 22:07       ` Peter Xu
  0 siblings, 2 replies; 8+ messages in thread
From: Zhang Chen @ 2025-11-06  3:21 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu)
  Cc: Peter Xu, zhanghailiang@xfusion.com, farosas@suse.de,
	qemu-devel@nongnu.org, Lukas Straub

On Thu, Nov 6, 2025 at 9:10 AM Zhijian Li (Fujitsu)
<lizhijian@fujitsu.com> wrote:
>
>
>
> On 06/11/2025 04:58, Peter Xu wrote:
> > On Tue, Nov 04, 2025 at 09:36:06AM +0800, Li Zhijian wrote:
> >> Commit 4881411136 ("migration: Always set DEVICE state") set a new DEVICE
> >> state before completed during migration, which broke the original transition
> >> to COLO. The migration flow for precopy has changed to:
> >> active -> pre-switchover -> device -> completed.
> >>
> >> This patch updates the transition state to ensure that the Pre-COLO
> >> state corresponds to DEVICE state correctly.
> >>
> >> Fixes: 4881411136 ("migration: Always set DEVICE state")
> >> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> >> ---
> >>   migration/migration.c | 4 ++--
> >>   1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/migration/migration.c b/migration/migration.c
> >> index a63b46bbef..6ec7f3cec8 100644
> >> --- a/migration/migration.c
> >> +++ b/migration/migration.c
> >> @@ -3095,9 +3095,9 @@ static void migration_completion(MigrationState *s)
> >>           goto fail;
> >>       }
> >>
> >> -    if (migrate_colo() && s->state == MIGRATION_STATUS_ACTIVE) {
> >> +    if (migrate_colo() && s->state == MIGRATION_STATUS_DEVICE) {
> >>           /* COLO does not support postcopy */
> >> -        migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
> >> +        migrate_set_state(&s->state, MIGRATION_STATUS_DEVICE,
> >>                             MIGRATION_STATUS_COLO);
> >>       } else {
> >>           migration_completion_end(s);
> >
> > Thanks a lot for fixing it, Zhijian.  It means I broke COLO already for
> > 10.0/10.1..
> >
> > Hailiang/Chen, do you still know anyone who is using COLO, especially in
> > enterprise?  I don't expect any individual using it.. It definitely
> > complicates migration logics all over the places.  Fabiano and I discussed
> > a few times on removing legacy code and COLO was always in the list.
> >
> > We used to discuss RDMA obsoletion too, that's when Huawei developers at
> > least tried to re-implement the whole RDMA using rsocket, that didn't land
> > only because of a perf regression.  Meanwhile, Zhijian also provided an
> > unit test, which we rely on recently to not break RDMA at the minimum.
> >
> > If we do not have known users, I sincerely want to discuss with you on
> > obsoletion and removal of COLO from qemu codebase.  Do you see feasible?
> >
> > Zhijian, do you have any input here?
>
>
> If we don't have any known users, I personally have no objection to removing COLO.
>
>  From my previous understanding, its use cases are rather limited, and the checkpointing overhead is significant.
> Moreover, with the continuous development of Cloud Native over the past decade, service-based
> FT/HA solutions have become very mature, which shrinks the use cases for VM-based FT solutions even further.
>
> I think it's worth keeping if we have:
>
> - Active users who depend on it.
> - A unit test for the COLO framework.
>
> Thanks
> Zhijian
>
>

Add CC Lukas.

From technical point, I agree Zhijian's comments. We can probably do
this gradually.
In my side, I know some local companies build thier HA/FT product based on COLO.
In this case, I think most of them already forked QEMU upstream code
to a private repo for internal mantained.
It may caused some upgrade issues in the future.

And another part is Lukas covered pacemaker project integrated COLO,
and I don't know users status for pacemaker.
Maybe Lukas can input some comments?

For the implementation, COLO not only have migration part of code(it
is the core of COLO), it also including network and block replication
for co-working.
If we remove migration related code need to consider how to handle
other parts, network maybe change to general QEMU netfilter?  block
replication ?

For the COLO framework unit test,  I think it need to add some "#if
defined(qtest)" in migration code for testing(COLO proxy/netfilter
already have independent qtest).

Thanks
Chen





>
> >
> > Thanks,
> >


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] migration: Fix transition to COLO state from precopy
  2025-11-06  3:21     ` Zhang Chen
@ 2025-11-06  3:24       ` Zhang Chen
  2025-11-06 22:07       ` Peter Xu
  1 sibling, 0 replies; 8+ messages in thread
From: Zhang Chen @ 2025-11-06  3:24 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu)
  Cc: Peter Xu, zhanghailiang@xfusion.com, farosas@suse.de,
	qemu-devel@nongnu.org, Lukas Straub

On Thu, Nov 6, 2025 at 11:21 AM Zhang Chen <zhangckid@gmail.com> wrote:
>
> On Thu, Nov 6, 2025 at 9:10 AM Zhijian Li (Fujitsu)
> <lizhijian@fujitsu.com> wrote:
> >
> >
> >
> > On 06/11/2025 04:58, Peter Xu wrote:
> > > On Tue, Nov 04, 2025 at 09:36:06AM +0800, Li Zhijian wrote:
> > >> Commit 4881411136 ("migration: Always set DEVICE state") set a new DEVICE
> > >> state before completed during migration, which broke the original transition
> > >> to COLO. The migration flow for precopy has changed to:
> > >> active -> pre-switchover -> device -> completed.
> > >>
> > >> This patch updates the transition state to ensure that the Pre-COLO
> > >> state corresponds to DEVICE state correctly.
> > >>
> > >> Fixes: 4881411136 ("migration: Always set DEVICE state")
> > >> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> > >> ---
> > >>   migration/migration.c | 4 ++--
> > >>   1 file changed, 2 insertions(+), 2 deletions(-)
> > >>
> > >> diff --git a/migration/migration.c b/migration/migration.c
> > >> index a63b46bbef..6ec7f3cec8 100644
> > >> --- a/migration/migration.c
> > >> +++ b/migration/migration.c
> > >> @@ -3095,9 +3095,9 @@ static void migration_completion(MigrationState *s)
> > >>           goto fail;
> > >>       }
> > >>
> > >> -    if (migrate_colo() && s->state == MIGRATION_STATUS_ACTIVE) {
> > >> +    if (migrate_colo() && s->state == MIGRATION_STATUS_DEVICE) {
> > >>           /* COLO does not support postcopy */
> > >> -        migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
> > >> +        migrate_set_state(&s->state, MIGRATION_STATUS_DEVICE,
> > >>                             MIGRATION_STATUS_COLO);
> > >>       } else {
> > >>           migration_completion_end(s);
> > >
> > > Thanks a lot for fixing it, Zhijian.  It means I broke COLO already for
> > > 10.0/10.1..
> > >
> > > Hailiang/Chen, do you still know anyone who is using COLO, especially in
> > > enterprise?  I don't expect any individual using it.. It definitely
> > > complicates migration logics all over the places.  Fabiano and I discussed
> > > a few times on removing legacy code and COLO was always in the list.
> > >
> > > We used to discuss RDMA obsoletion too, that's when Huawei developers at
> > > least tried to re-implement the whole RDMA using rsocket, that didn't land
> > > only because of a perf regression.  Meanwhile, Zhijian also provided an
> > > unit test, which we rely on recently to not break RDMA at the minimum.
> > >
> > > If we do not have known users, I sincerely want to discuss with you on
> > > obsoletion and removal of COLO from qemu codebase.  Do you see feasible?
> > >
> > > Zhijian, do you have any input here?
> >
> >
> > If we don't have any known users, I personally have no objection to removing COLO.
> >
> >  From my previous understanding, its use cases are rather limited, and the checkpointing overhead is significant.
> > Moreover, with the continuous development of Cloud Native over the past decade, service-based
> > FT/HA solutions have become very mature, which shrinks the use cases for VM-based FT solutions even further.
> >
> > I think it's worth keeping if we have:
> >
> > - Active users who depend on it.
> > - A unit test for the COLO framework.
> >
> > Thanks
> > Zhijian
> >
> >
>
> Add CC Lukas.
>
> From technical point, I agree Zhijian's comments. We can probably do
> this gradually.
> In my side, I know some local companies build thier HA/FT product based on COLO.
> In this case, I think most of them already forked QEMU upstream code
> to a private repo for internal mantained.
> It may caused some upgrade issues in the future.
>
> And another part is Lukas covered pacemaker project integrated COLO,
> and I don't know users status for pacemaker.
> Maybe Lukas can input some comments?
>
> For the implementation, COLO not only have migration part of code(it
> is the core of COLO), it also including network and block replication
> for co-working.
> If we remove migration related code need to consider how to handle
> other parts, network maybe change to general QEMU netfilter?  block
> replication ?
>
> For the COLO framework unit test,  I think it need to add some "#if
> defined(qtest)" in migration code for testing(COLO proxy/netfilter
> already have independent qtest).
>
> Thanks
> Chen
>
>

Add pacemaker/corosync related details for COLO:
https://wiki.qemu.org/Features/COLO/Managed_HOWTO


>
>
>
> >
> > >
> > > Thanks,
> > >


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] migration: Fix transition to COLO state from precopy
  2025-11-06  3:21     ` Zhang Chen
  2025-11-06  3:24       ` Zhang Chen
@ 2025-11-06 22:07       ` Peter Xu
  1 sibling, 0 replies; 8+ messages in thread
From: Peter Xu @ 2025-11-06 22:07 UTC (permalink / raw)
  To: Zhang Chen
  Cc: Zhijian Li (Fujitsu), zhanghailiang@xfusion.com, farosas@suse.de,
	qemu-devel@nongnu.org, Lukas Straub, Jason Wang

On Thu, Nov 06, 2025 at 11:21:56AM +0800, Zhang Chen wrote:
> On Thu, Nov 6, 2025 at 9:10 AM Zhijian Li (Fujitsu)
> <lizhijian@fujitsu.com> wrote:
> >
> >
> >
> > On 06/11/2025 04:58, Peter Xu wrote:
> > > On Tue, Nov 04, 2025 at 09:36:06AM +0800, Li Zhijian wrote:
> > >> Commit 4881411136 ("migration: Always set DEVICE state") set a new DEVICE
> > >> state before completed during migration, which broke the original transition
> > >> to COLO. The migration flow for precopy has changed to:
> > >> active -> pre-switchover -> device -> completed.
> > >>
> > >> This patch updates the transition state to ensure that the Pre-COLO
> > >> state corresponds to DEVICE state correctly.
> > >>
> > >> Fixes: 4881411136 ("migration: Always set DEVICE state")
> > >> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
> > >> ---
> > >>   migration/migration.c | 4 ++--
> > >>   1 file changed, 2 insertions(+), 2 deletions(-)
> > >>
> > >> diff --git a/migration/migration.c b/migration/migration.c
> > >> index a63b46bbef..6ec7f3cec8 100644
> > >> --- a/migration/migration.c
> > >> +++ b/migration/migration.c
> > >> @@ -3095,9 +3095,9 @@ static void migration_completion(MigrationState *s)
> > >>           goto fail;
> > >>       }
> > >>
> > >> -    if (migrate_colo() && s->state == MIGRATION_STATUS_ACTIVE) {
> > >> +    if (migrate_colo() && s->state == MIGRATION_STATUS_DEVICE) {
> > >>           /* COLO does not support postcopy */
> > >> -        migrate_set_state(&s->state, MIGRATION_STATUS_ACTIVE,
> > >> +        migrate_set_state(&s->state, MIGRATION_STATUS_DEVICE,
> > >>                             MIGRATION_STATUS_COLO);
> > >>       } else {
> > >>           migration_completion_end(s);
> > >
> > > Thanks a lot for fixing it, Zhijian.  It means I broke COLO already for
> > > 10.0/10.1..
> > >
> > > Hailiang/Chen, do you still know anyone who is using COLO, especially in
> > > enterprise?  I don't expect any individual using it.. It definitely
> > > complicates migration logics all over the places.  Fabiano and I discussed
> > > a few times on removing legacy code and COLO was always in the list.
> > >
> > > We used to discuss RDMA obsoletion too, that's when Huawei developers at
> > > least tried to re-implement the whole RDMA using rsocket, that didn't land
> > > only because of a perf regression.  Meanwhile, Zhijian also provided an
> > > unit test, which we rely on recently to not break RDMA at the minimum.
> > >
> > > If we do not have known users, I sincerely want to discuss with you on
> > > obsoletion and removal of COLO from qemu codebase.  Do you see feasible?
> > >
> > > Zhijian, do you have any input here?
> >
> >
> > If we don't have any known users, I personally have no objection to removing COLO.
> >
> >  From my previous understanding, its use cases are rather limited, and the checkpointing overhead is significant.
> > Moreover, with the continuous development of Cloud Native over the past decade, service-based
> > FT/HA solutions have become very mature, which shrinks the use cases for VM-based FT solutions even further.
> >
> > I think it's worth keeping if we have:
> >
> > - Active users who depend on it.
> > - A unit test for the COLO framework.
> >
> > Thanks
> > Zhijian
> >
> >
> 
> Add CC Lukas.
> 
> From technical point, I agree Zhijian's comments. We can probably do
> this gradually.

Thanks both for the inputs so far.

> In my side, I know some local companies build thier HA/FT product based on COLO.
> In this case, I think most of them already forked QEMU upstream code
> to a private repo for internal mantained.
> It may caused some upgrade issues in the future.

If this might be an issue to them, please share this discussion with them,
and see whether they want to get involved (if that may make their workflow
easier).  In general, whoever still rebases to upstream (even if with low
freq) should always benefit from some involvement upstream to not get
things totally out of control in their production systems.

> 
> And another part is Lukas covered pacemaker project integrated COLO,
> and I don't know users status for pacemaker.
> Maybe Lukas can input some comments?
> 
> For the implementation, COLO not only have migration part of code(it
> is the core of COLO), it also including network and block replication
> for co-working.
> If we remove migration related code need to consider how to handle
> other parts, network maybe change to general QEMU netfilter?  block
> replication ?

This is a great question.  We can talk about that when an deprecation
decision will be made.  IMHO you guys know better than me, so suggestions
will be welcomed.

From migration POV, we don't necessary need to remove anything outside
migration; removing COLO inside migration itself will be a great offload of
our maintenance burden on its own.  But I still would definitely like to
sync with other subsystems if we decide that.. let me copy Jason already
just for awareness.  I should have already done that but I forgot, sorry.

> 
> For the COLO framework unit test,  I think it need to add some "#if
> defined(qtest)" in migration code for testing(COLO proxy/netfilter
> already have independent qtest).

Another question to be answered only after we have an initial decision to
keep COLO in this case.  So we can focus on answering the 1st question on
whether we should deprecate COLO.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-11-06 22:08 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-04  1:36 [PATCH] migration: Fix transition to COLO state from precopy Li Zhijian via
2025-11-04  1:49 ` Zhijian Li (Fujitsu)
2025-11-04  2:40 ` Zhang Chen
2025-11-05 20:58 ` Peter Xu
2025-11-06  1:09   ` Zhijian Li (Fujitsu)
2025-11-06  3:21     ` Zhang Chen
2025-11-06  3:24       ` Zhang Chen
2025-11-06 22:07       ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).