All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lukas Straub <lukasstraub2@web.de>
To: Arun Menon <armenon@redhat.com>
Cc: qemu-devel@nongnu.org, Peter Xu <peterx@redhat.com>,
	Fabiano Rosas <farosas@suse.de>, Zhang Chen <zhangckid@gmail.com>,
	Hailiang Zhang <zhanghailiang@xfusion.com>,
	Li Zhijian <lizhijian@fujitsu.com>
Subject: Re: [PATCH v11 13/21] migration-test: Add COLO migration unit test
Date: Thu, 19 Mar 2026 23:40:31 +0100	[thread overview]
Message-ID: <20260319232607.6d79f980@penguin> (raw)
In-Reply-To: <abBYAVLFQ9YDIKpr@fedora>

[-- Attachment #1: Type: text/plain, Size: 6100 bytes --]

On Tue, 10 Mar 2026 23:12:25 +0530
Arun Menon <armenon@redhat.com> wrote:

> Hi Lukas,
> 
> On Tue, Mar 10, 2026 at 04:29:54PM +0100, Lukas Straub wrote:
> > On Tue, 10 Mar 2026 20:17:57 +0530
> > Arun Menon <armenon@redhat.com> wrote:
> >   
> > > Hi Lukas,
> > > 
> > > On Mon, Mar 02, 2026 at 12:45:28PM +0100, Lukas Straub wrote:  
> > > > Add a COLO migration test for COLO migration and failover.
> > > > 
> > > > Reviewed-by: Fabiano Rosas <farosas@suse.de>
> > > > Tested-by: Fabiano Rosas <farosas@suse.de>
> > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > > Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> > > > [...]
> > > >    
> > > 
> > > I was running the qtests locally, and I encountered a timeout error.
> > > 
> > > Command run: mkdir -p build ; cd build ; make check-qtest-x86_64;
> > > 
> > > Following is the output:
> > > ======
> > > [...]
> > > 
> > > Summary of Failures:
> > > 67/67 qtest+qtest-x86_64 - qemu:qtest-x86_64/migration-test         TIMEOUT        480.05s   killed by signal 15 SIGTERM
> > > Ok:                64
> > > Fail:              0
> > > Skipped:           2
> > > Timeout:           1
> > > ======
> > > 
> > > It seems that the test runner is stuck waiting for some input.
> > > Following is the stack trace
> > > [...]
> > > ======
> > > gstack 128276
> > > Thread 2 (Thread 0x7fdd090716c0 (LWP 128279) "call_rcu"):
> > > #0  0x00007fdd0921434d in syscall () from /lib64/libc.so.6
> > > #1  0x0000557fd604563a in qemu_futex_wait (f=0x557fd60a0190 <rcu_call_ready_event>, val=4294967295) at /home/arun/workdir/new/devel/upstream/qemu-priv/include/qemu/futex.h:47
> > > #2  0x0000557fd604584e in qemu_event_wait (ev=0x557fd60a0190 <rcu_call_ready_event>) at ../util/event.c:162
> > > #3  0x0000557fd6045fde in call_rcu_thread (opaque=0x0) at ../util/rcu.c:304
> > > #4  0x0000557fd600e8fb in qemu_thread_start (args=0x557fd6beec70) at ../util/qemu-thread-posix.c:414
> > > #5  0x00007fdd09193464 in start_thread () from /lib64/libc.so.6
> > > #6  0x00007fdd092165ac in __clone3 () from /lib64/libc.so.6
> > > 
> > > Thread 1 (Thread 0x7fdd09073240 (LWP 128276) "migration-test"):
> > > #0  0x00007fdd0919b982 in __syscall_cancel_arch () from /lib64/libc.so.6
> > > #1  0x00007fdd0918fc3c in __internal_syscall_cancel () from /lib64/libc.so.6
> > > #2  0x00007fdd091dfb62 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
> > > #3  0x00007fdd091ebb37 in nanosleep () from /lib64/libc.so.6
> > > #4  0x00007fdd0921613a in usleep () from /lib64/libc.so.6
> > > #5  0x0000557fd5fd99cd in wait_for_serial (side=0x557fd6065f08 "dest_serial") at ../tests/qtest/migration/framework.c:82
> > > #6  0x0000557fd5fe5865 in test_colo_common (args=0x557fd6bfdf50, failover_during_checkpoint=false, primary_failover=true) at ../tests/qtest/migration/colo-tests.c:66  
> > 
> > 60    migrate_qmp(from, to, args->connect_uri, NULL, "{}");
> > 61
> > 62    wait_for_migration_status(from, "colo", NULL);
> > 63    wait_for_resume(to, get_dst());
> > 64
> > 65    wait_for_serial("src_serial");
> > 66    wait_for_serial("dest_serial");
> > 
> > Interesting, so the secondary guest is stuck/crahsed after entering
> > colo state despite having resumed.
> > 
> > It works fine here on master. And before the merge I have looped the
> > colo tests for a whole day on my machine without any failures.
> > 
> > How often does this happen? What is the commit you are on, host, ASAN,
> > MSAN, UBSAN, configure options? With kvm or without?  
> 
> I have started the run on a fedora-43 VM. I am on commit:
> de61484ec39f418e5c0d4603017695f9ffb8fe24 master branch.

Okay, so the problem is that nested virtualization slows down kvm page
faults (e.g. dirtying the dirty bitmap) so much that it takes longer
for the guest to dirty the memory than the interval between two colo
checkpoints at which point the dirty bitmap is cleared again. And to
make matters worse the secondary qemu process is starved by the primary
qemu for some reason.

So the following happens:
1. After a checkpoint primary dirties all it's memory pretty quickly
and afterwards the dirty mem + serial out loops quickly. So it is very
likely the guest is in the middle of dirtying the memory at any given
time.
2. The colo checkpointing is actually working flawlessly it only seems
stuck because the migraion-test is stuck and not echoing the qmp events.
The secondary receives the primary checkpoint (whose guest is in the
middle of dirtying memory) and both sides resume. The secondary is
starved and spends a lot of time in the dirty mem phase.
3. Another checkpoint happens, before the secondary guest is finished
dirtying memory and has a chance sending the serial out signal. Goto 1.

Can you try increasing the checkpoint interval the following and see
what works for you:
migrate_set_parameter_int(from, "x-checkpoint-delay", 300);

In my tests, raising it to 1000 worked fine. Though for a doubly nested VM
I had to raise it to 5000. E.g. It takes up to 5 seconds to dirty 100Mb
memory in a doubly nested VM.

Regards,
Lukas Straub

> 
> The configuration command I used was:
> ../configure --target-list=x86_64-softmmu --enable-debug --enable-trace-backends=log --enable-slirp
> 
> Targets and accelerators
>   KVM support                     : YES
> 
> > 
> > Can you try the following and show me the log:
> > (cd build && QTEST_QEMU_IMG=./qemu-img QTEST_QEMU_BINARY=./qemu-system-x86_64 QTEST_LOG=- tests/qtest/migration-test --full -p /x86_64/migration/colo/plain)
> >   
> 
> I have attached the log file : qemu.log to this email. It is stuck at
> the last line : 
> {"timestamp": {"seconds": 1773164370, "microseconds": 609791}, "event": "RESUME"}
> 
> It should be some configuration miss from my end.
> Thank you for looking into it.
> 
> [...]
> > > 
> > > 
> > > Is there something that I am missing? Can you please look into this? 
> > > 
> > > 
> > > Regards,
> > > Arun Menon
> > >   
> >   
> 
> Regards,
> Arun Menon
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2026-03-19 22:41 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-02 11:43 [PATCH v11 00/21] migration: Add COLO multifd support and COLO migration unit test Lukas Straub
2026-03-02 11:43 ` [PATCH v11 01/21] MAINTAINERS: Add myself as maintainer for COLO migration framework Lukas Straub
2026-03-02 11:43 ` [PATCH v11 02/21] MAINTAINERS: Remove Hailiang Zhang from " Lukas Straub
2026-03-10 11:33   ` Lukas Straub
2026-03-02 11:43 ` [PATCH v11 03/21] colo: Setup ram cache in normal migration path Lukas Straub
2026-03-02 11:43 ` [PATCH v11 04/21] colo: Replace migration_incoming_colo_enabled() with migrate_colo() Lukas Straub
2026-03-02 11:43 ` [PATCH v11 05/21] colo: Remove ENABLE_COLO savevm command and mark it as deprecated Lukas Straub
2026-03-02 11:43 ` [PATCH v11 06/21] ram: Remove colo special-casing Lukas Straub
2026-03-02 11:43 ` [PATCH v11 07/21] multifd: Move ram state receive into multifd_ram_state_recv() Lukas Straub
2026-03-02 11:43 ` [PATCH v11 08/21] multifd: Add COLO support Lukas Straub
2026-03-02 11:43 ` [PATCH v11 09/21] Call colo_release_ram_cache() after multifd threads terminate Lukas Straub
2026-03-02 11:45 ` [PATCH v11 10/21] colo: Fix crash during device vmstate load Lukas Straub
2026-03-02 11:45 ` [PATCH v11 11/21] colo: Hold the BQL while sending ram state Lukas Straub
2026-03-02 11:45 ` [PATCH v11 12/21] colo: Do not hold the BQL while receiving " Lukas Straub
2026-03-02 11:45 ` [PATCH v11 13/21] migration-test: Add COLO migration unit test Lukas Straub
2026-03-10 14:47   ` Arun Menon
2026-03-10 15:29     ` Lukas Straub
2026-03-10 17:42       ` Arun Menon
2026-03-19 22:40         ` Lukas Straub [this message]
2026-03-02 11:45 ` [PATCH v11 14/21] Convert colo main documentation to restructuredText Lukas Straub
2026-03-02 11:45 ` [PATCH v11 15/21] qemu-colo.rst: Miscellaneous changes Lukas Straub
2026-03-02 11:45 ` [PATCH v11 16/21] qemu-colo.rst: Add my copyright Lukas Straub
2026-03-02 11:45 ` [PATCH v11 17/21] qemu-colo.rst: Simplify the block replication setup Lukas Straub
2026-03-02 11:45 ` [PATCH v11 18/21] multifd: Fix hang if send thread errors during sync Lukas Straub
2026-03-02 11:45 ` [PATCH v11 19/21] colo: Use file lock in primary_vm_do_failover() Lukas Straub
2026-03-02 20:06   ` Peter Xu
2026-03-02 11:45 ` [PATCH v11 20/21] migration: Keep s->rp_state.from_dst_file open until migration ends Lukas Straub
2026-03-02 20:07   ` Peter Xu
2026-03-02 11:45 ` [PATCH v11 21/21] colo: Reuse the return path from migration on primary and secondary side Lukas Straub
2026-03-02 20:11   ` Peter Xu
2026-03-06 12:53 ` [PATCH v11 00/21] migration: Add COLO multifd support and COLO migration unit test Lukas Straub
2026-03-06 13:07   ` Fabiano Rosas
  -- strict thread matches above, loose matches on Subject: below --
2026-03-02 10:19 Lukas Straub
2026-03-02 10:19 ` [PATCH v11 13/21] migration-test: Add " Lukas Straub
2026-03-02 10:13 [PATCH v11 00/21] migration: Add COLO multifd support and " Lukas Straub
2026-03-02 10:13 ` [PATCH v11 13/21] migration-test: Add " Lukas Straub
2026-03-01 21:37 [PATCH v11 00/21] migration: Add COLO multifd support and " Lukas Straub
2026-03-01 21:37 ` [PATCH v11 13/21] migration-test: Add " Lukas Straub
2026-03-01 19:16 [PATCH v11 00/21] migration: Add COLO multifd support and " Lukas Straub
2026-03-01 19:16 ` [PATCH v11 13/21] migration-test: Add " Lukas Straub

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260319232607.6d79f980@penguin \
    --to=lukasstraub2@web.de \
    --cc=armenon@redhat.com \
    --cc=farosas@suse.de \
    --cc=lizhijian@fujitsu.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=zhangckid@gmail.com \
    --cc=zhanghailiang@xfusion.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.