qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Fabiano Rosas <farosas@suse.de>
To: Peter Xu <peterx@redhat.com>
Cc: "Peter Maydell" <peter.maydell@linaro.org>,
	"Hyman Huang" <yong.huang@smartx.com>,
	qemu-devel@nongnu.org, "Eric Blake" <eblake@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>,
	"David Hildenbrand" <david@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@linaro.org>,
	"Paolo Bonzini" <pbonzini@redhat.com>
Subject: Re: [PATCH RFC 10/10] tests/migration-tests: Add test case for responsive CPU throttle
Date: Fri, 13 Sep 2024 12:09:00 -0300	[thread overview]
Message-ID: <87ikuz1tgz.fsf@suse.de> (raw)
In-Reply-To: <ZuRTgbDhEJ7c-dcE@x1n>

Peter Xu <peterx@redhat.com> writes:

> On Thu, Sep 12, 2024 at 07:52:48PM -0300, Fabiano Rosas wrote:
>> Fabiano Rosas <farosas@suse.de> writes:
>> 
>> > Peter Xu <peterx@redhat.com> writes:
>> >
>> >> On Thu, Sep 12, 2024 at 09:13:16AM +0100, Peter Maydell wrote:
>> >>> On Wed, 11 Sept 2024 at 22:26, Fabiano Rosas <farosas@suse.de> wrote:
>> >>> > I don't think we're discussing total CI time at this point, so the math
>> >>> > doesn't really add up. We're not looking into making the CI finish
>> >>> > faster. We're looking into making migration-test finish faster. That
>> >>> > would reduce timeouts in CI, speed-up make check and reduce the chance
>> >>> > of random race conditions* affecting other people/staging runs.
>> >>> 
>> >>> Right. The reason migration-test appears on my radar is because
>> >>> it is very frequently the thing that shows up as "this sometimes
>> >>> just fails or just times out and if you hit retry it goes away
>> >>> again". That might not be migration-test's fault specifically,
>> >>> because those retries tend to be certain CI configs (s390,
>> >>> the i686-tci one), and I have some theories about what might be
>> >>> causing it (e.g. build system runs 4 migration-tests in parallel,
>> >>> which means 8 QEMU processes which is too many for the number
>> >>> of host CPUs). But right now I look at CI job failures and my reaction
>> >>> is "oh, it's the migration-test failing yet again" :-(
>> >>> 
>> >>> For some examples from this week:
>> >>> 
>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7802183144
>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373  <--------[1]
>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152  <--------[2]
>> >>> https://gitlab.com/qemu-project/qemu/-/jobs/7786579155
>> >>
>> >> Ah right, the TIMEOUT is unfortunate, especially if tests can be run in
>> >> parallel.  It indeed sounds like no good way to finally solve.. I don't
>> >> also see how speeding up / reducing tests in migration test would help, as
>> >> that's (from some degree..) is the same as tuning the timeout value bigger.
>> >> When the tests are less it'll fit into 480s window, but maybe it's too
>> >> quick now we wonder whether we should shrink it to e.g. 90s, but then it
>> >> can timeout again when on a busy host with less capability of concurrency.
>> >>
>> >> But indeed there're two ERRORs ([1,2] above)..  I collected some more info
>> >> here before the log expires:
>> >>
>> >> =================================8<================================
>> >>
>> >> *** /i386/migration/multifd/tcp/plain/cancel, qtest-i386 on s390 host
>> >>
>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7799842373
>> >>
>> >> 101/953 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          144.32s   killed by signal 6 SIGABRT
>> >>>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh PYTHON=/home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=144 QTEST_QEMU_BINARY=./qemu-system-i386 /home/gitlab-runner/builds/zEr9wY_L/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
>> >> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
>> >> stderr:
>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>> >> (test program exited with status code -6)
>> >> TAP parsing error: Too few tests run (expected 53, got 39)
>> >> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>> >>
>> >> # Start of plain tests
>> >> # Running /i386/migration/multifd/tcp/plain/cancel
>> >> # Using machine type: pc-i440fx-9.2
>> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/src_serial -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
>> >> # starting QEMU: exec ./qemu-system-i386 -qtest unix:/tmp/qtest-3273509.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3273509.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pc-i440fx-9.2, -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-4112T2/dest_serial -incoming defer -drive if=none,id=d0,file=/tmp/migration-test-4112T2/bootsect,format=raw -device ide-hd,drive=d0,secs=1,cyls=1,heads=1    2>/dev/null -accel qtest
>> >> ----------------------------------- stderr -----------------------------------
>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>> >>
>> >> *** /ppc64/migration/multifd/tcp/plain/cancel, qtest-ppc64 on i686 host
>> >>
>> >> https://gitlab.com/qemu-project/qemu/-/jobs/7786579152
>> >>
>> >> 174/315 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test                       ERROR          381.00s   killed by signal 6 SIGABRT
>> >>>>> PYTHON=/builds/qemu-project/qemu/build/pyvenv/bin/python3 QTEST_QEMU_IMG=./qemu-img G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh QTEST_QEMU_BINARY=./qemu-system-ppc64 MALLOC_PERTURB_=178 QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
>> >> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
>> >> stderr:
>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>> >> (test program exited with status code -6)
>> >> TAP parsing error: Too few tests run (expected 61, got 47)
>> >> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>> >>
>> >> # Start of plain tests
>> >> # Running /ppc64/migration/multifd/tcp/plain/cancel
>> >> # Using machine type: pseries-9.2
>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name source,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/src_serial -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
>> >> # starting QEMU: exec ./qemu-system-ppc64 -qtest unix:/tmp/qtest-40766.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-40766.qmp,id=char0 -mon chardev=char0,mode=control -display none -audio none -accel kvm -accel tcg -machine pseries-9.2,vsmt=8 -name target,debug-threads=on -m 256M -serial file:/tmp/migration-test-H0Z1T2/dest_serial -incoming defer -nodefaults -machine cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken,cap-ccf-assist=off, -bios /tmp/migration-test-H0Z1T2/bootsect    2>/dev/null -accel qtest
>> >> ----------------------------------- stderr -----------------------------------
>> >> qemu-system-ppc64: Cannot read from TLS channel: The TLS connection was non-properly terminated.
>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >> warning: fd: migration to a file is deprecated. Use file: instead.
>> >> ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>> >>
>> >> (test program exited with status code -6)
>> >> =================================8<================================
>> >>
>> >> So.. it's the same test (multifd/tcp/plain/cancel) that is failing on
>> >> different host / arch being tested.  What is more weird is the two failures
>> >> are different, the 2nd failure throw out a TLS error even though the test
>> >> doesn't yet have tls involved.
>> >
>> > I think that's just a parallel test being cancelled prematurely, either
>> > due to the crash or due to the timeout.
>> >
>> >>
>> >> Fabiano, is this the issue you're looking at?
>> >
>> > Yes. I can reproduce locally by running 2 processes in parallel: 1 loop
>> > with make -j$(nproc) check and another loop with tcp/plain/cancel. It
>> > takes ~1h to hit. I've seen crashes with ppc64, s390 and
>> > aarch64.
>> >
>> 
>> Ok, the issue is that after commit 5ef7e26bdb ("migration/multifd: solve
>> zero page causing multiple page faults"), the multifd code started using
>> the rb->receivedmap bitmap, which belongs to the ram code and is
>> initialized and *freed* from the ram SaveVMHandlers.
>> 
>> process_incoming_migration_co()        ...
>>   qemu_loadvm_state()                  multifd_nocomp_recv()
>>     qemu_loadvm_state_cleanup()          ramblock_recv_bitmap_set_offset()
>>       rb->receivedmap = NULL               set_bit_atomic(..., rb->receivedmap)
>>   ...
>>   migration_incoming_state_destroy()
>>     multifd_recv_cleanup()
>>       multifd_recv_terminate_threads(NULL)
>> 
>> Multifd threads are live until migration_incoming_state_destroy(), which
>> is called some time later.
>
> Thanks for the debugging.  Hmm I would expect loadvm should wait until all
> ram is received somehow..

Looks like a similar issue as when we didn't have the multifd_send sync
working correctly and ram code would run and do cleanup.

>
> And when looking, I also found we're ambiguous on the cleanups
> now.. probably a separate issue that not yet cause crashes.  I meant we
> even don't have a consistent order to clean these in precopy / postcopy,
> as:
>
> We have this:
>
>   qemu_loadvm_state
>     qemu_loadvm_state_cleanup              <------------- [1]
>   migration_bh_schedule(process_incoming_migration_bh, mis);
>   (then in bh...)
>   process_incoming_migration_bh
>     migration_incoming_state_destroy       <------------- [2]
>
> But then:
>   
>   postcopy_ram_listen_thread
>     migration_incoming_state_destroy       <------------- [2]
>     qemu_loadvm_state_cleanup              <------------- [1]
>
> I think it makes more sense to do [2] after [1], so maybe postcopy needs
> changing too..

Yes, ram_load_cleanup doesn't do much:

static int ram_load_cleanup(void *opaque)
{
    RAMBlock *rb;

    RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
        qemu_ram_block_writeback(rb);
    }

    xbzrle_load_cleanup();

    return 0;
}

We can probably change the order just fine.

For the bug I'm now trying to add a helper to init/clean the receivedmap
and only call it from code that uses it (postcopy, multifd_recv),
instead of always init/clean along with ram_load_setup/cleanup.

>
>> 
>> >> Peter, do you think it'll be helpful if we temporarily mark this test as
>> >> "slow" too so it's not run in CI (so we still run it ourselves when prepare
>> >> migration PR, with the hope that it can reproduce)?
>> >>
>> >> Thanks,
>> 


  reply	other threads:[~2024-09-13 15:09 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-09 13:47 [PATCH RFC 00/10] migration: auto-converge refinements for huge VM Hyman Huang
2024-09-09 13:47 ` [PATCH RFC 01/10] migration: Introduce structs for periodic CPU throttle Hyman Huang
2024-09-09 13:47 ` [PATCH RFC 02/10] migration: Refine util functions to support " Hyman Huang
2024-09-09 13:47 ` [PATCH RFC 03/10] qapi/migration: Introduce periodic CPU throttling parameters Hyman Huang
2024-09-09 13:47 ` [PATCH RFC 04/10] qapi/migration: Introduce the iteration-count Hyman Huang
2024-09-09 13:47 ` [PATCH RFC 05/10] migration: Introduce util functions for periodic CPU throttle Hyman Huang
2024-09-09 13:47 ` [PATCH RFC 06/10] migration: Support " Hyman Huang
2024-09-09 13:47 ` [PATCH RFC 07/10] tests/migration-tests: Add test case for periodic throttle Hyman Huang
2024-09-09 13:47 ` [PATCH RFC 08/10] migration: Introduce cpu-responsive-throttle parameter Hyman Huang
2024-09-09 13:47 ` [PATCH RFC 09/10] migration: Support responsive CPU throttle Hyman Huang
2024-09-09 13:47 ` [PATCH RFC 10/10] tests/migration-tests: Add test case for " Hyman Huang
2024-09-09 14:02   ` Peter Maydell
2024-09-09 14:36     ` Peter Xu
2024-09-09 21:54       ` Fabiano Rosas
2024-09-10 21:22         ` Peter Xu
2024-09-10 22:23           ` Fabiano Rosas
2024-09-11 15:59             ` Peter Xu
2024-09-11 19:48               ` Fabiano Rosas
2024-09-11 20:37                 ` Peter Xu
2024-09-11 21:26                   ` Fabiano Rosas
2024-09-12  8:13                     ` Peter Maydell
2024-09-12 13:48                       ` Fabiano Rosas
2024-09-12 14:09                         ` Peter Maydell
2024-09-12 14:28                           ` Fabiano Rosas
2024-09-12 15:09                       ` Peter Xu
2024-09-12 15:14                         ` Peter Maydell
2024-09-13 15:02                           ` Peter Xu
2024-09-12 15:37                         ` Fabiano Rosas
2024-09-12 22:52                           ` Fabiano Rosas
2024-09-13 15:00                             ` Peter Xu
2024-09-13 15:09                               ` Fabiano Rosas [this message]
2024-09-13 15:17                                 ` Fabiano Rosas
2024-09-13 15:38                                   ` Peter Xu
2024-09-13 17:51                                     ` Fabiano Rosas
2024-09-09 14:43     ` Yong Huang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ikuz1tgz.fsf@suse.de \
    --to=farosas@suse.de \
    --cc=armbru@redhat.com \
    --cc=david@redhat.com \
    --cc=eblake@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=peterx@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=yong.huang@smartx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).