* Big TCG slowdown when using zstd with aarch64
@ 2023-06-01 21:06 Juan Quintela
2023-06-02 9:10 ` Daniel P. Berrangé
2023-06-02 10:14 ` Daniel P. Berrangé
0 siblings, 2 replies; 11+ messages in thread
From: Juan Quintela @ 2023-06-01 21:06 UTC (permalink / raw)
To: qemu-devel, peter.maydell, Richard Henderson, Daniel Berrange
Hi
Before I continue investigating this further, do you have any clue what
is going on here. I am running qemu-system-aarch64 on x86_64.
$ time ./tests/qtest/migration-test -p /aarch64/migration/multifd/tcp/plain/none
TAP version 13
# random seed: R02S3d50a0e874b28727af4b862a3cc4214e
# Start of aarch64 tests
# Start of migration tests
# Start of multifd tests
# Start of tcp tests
# Start of plain tests
# starting QEMU: exec ./qemu-system-aarch64 -qtest unix:/tmp/qtest-2888203.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-2888203.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -machine virt,gic-version=max -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-WT9151/src_serial -cpu max -kernel /tmp/migration-test-WT9151/bootsect -accel qtest
# starting QEMU: exec ./qemu-system-aarch64 -qtest unix:/tmp/qtest-2888203.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-2888203.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -machine virt,gic-version=max -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-WT9151/dest_serial -incoming defer -cpu max -kernel /tmp/migration-test-WT9151/bootsect -accel qtest
ok 1 /aarch64/migration/multifd/tcp/plain/none
# End of plain tests
# End of tcp tests
# End of multifd tests
# End of migration tests
# End of aarch64 tests
1..1
real 0m4.559s
user 0m4.898s
sys 0m1.156s
$ time ./tests/qtest/migration-test -p /aarch64/migration/multifd/tcp/plain/zlib
TAP version 13
# random seed: R02S014dd197350726bdd95aea37b81d3898
# Start of aarch64 tests
# Start of migration tests
# Start of multifd tests
# Start of tcp tests
# Start of plain tests
# starting QEMU: exec ./qemu-system-aarch64 -qtest unix:/tmp/qtest-2888278.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-2888278.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -machine virt,gic-version=max -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-25U151/src_serial -cpu max -kernel /tmp/migration-test-25U151/bootsect -accel qtest
# starting QEMU: exec ./qemu-system-aarch64 -qtest unix:/tmp/qtest-2888278.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-2888278.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -machine virt,gic-version=max -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-25U151/dest_serial -incoming defer -cpu max -kernel /tmp/migration-test-25U151/bootsect -accel qtest
ok 1 /aarch64/migration/multifd/tcp/plain/zlib
# End of plain tests
# End of tcp tests
# End of multifd tests
# End of migration tests
# End of aarch64 tests
1..1
real 0m1.645s
user 0m3.484s
sys 0m0.512s
$ time ./tests/qtest/migration-test -p /aarch64/migration/multifd/tcp/plain/zstd
TAP version 13
# random seed: R02Se49afe2ea9d2b76a1eda1fa2bc8d812c
# Start of aarch64 tests
# Start of migration tests
# Start of multifd tests
# Start of tcp tests
# Start of plain tests
# starting QEMU: exec ./qemu-system-aarch64 -qtest unix:/tmp/qtest-2888353.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-2888353.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -machine virt,gic-version=max -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-UILY51/src_serial -cpu max -kernel /tmp/migration-test-UILY51/bootsect -accel qtest
# starting QEMU: exec ./qemu-system-aarch64 -qtest unix:/tmp/qtest-2888353.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-2888353.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -machine virt,gic-version=max -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-UILY51/dest_serial -incoming defer -cpu max -kernel /tmp/migration-test-UILY51/bootsect -accel qtest
ok 1 /aarch64/migration/multifd/tcp/plain/zstd
# End of plain tests
# End of tcp tests
# End of multifd tests
# End of migration tests
# End of aarch64 tests
1..1
real 0m48.022s
user 8m17.306s
sys 0m35.217s
This test is very amenable to compression, basically we only modify one
byte for each page, and basically all the pages are the same.
no compression: 4.5 seconds
zlib compression: 1.6 seconds (inside what I would expect)
zstd compression: 48 seconds, what is going on here?
As a comparison, this are the times for x86_64 running natively, values
much more reasonable.
$ time ./tests/qtest/migration-test -p /x86_64/migration/multifd/tcp/plain/none
TAP version 13
# random seed: R02S579fbe8739386c3a3336486f2adbfecd
# Start of x86_64 tests
# Start of migration tests
# Start of multifd tests
# Start of tcp tests
# Start of plain tests
# starting QEMU: exec ./qemu-system-x86_64 -qtest unix:/tmp/qtest-3002254.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3002254.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-KA6Z51/src_serial -drive file=/tmp/migration-test-KA6Z51/bootsect,format=raw -accel qtest
# starting QEMU: exec ./qemu-system-x86_64 -qtest unix:/tmp/qtest-3002254.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3002254.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-KA6Z51/dest_serial -incoming defer -drive file=/tmp/migration-test-KA6Z51/bootsect,format=raw -accel qtest
ok 1 /x86_64/migration/multifd/tcp/plain/none
# End of plain tests
# End of tcp tests
# End of multifd tests
# End of migration tests
# End of x86_64 tests
1..1
real 0m3.889s
user 0m4.264s
sys 0m1.295s
$ time ./tests/qtest/migration-test -p /x86_64/migration/multifd/tcp/plain/zlib
TAP version 13
# random seed: R02S968738d716d2c0dc8c8279716ff3dd9a
# Start of x86_64 tests
# Start of migration tests
# Start of multifd tests
# Start of tcp tests
# Start of plain tests
# starting QEMU: exec ./qemu-system-x86_64 -qtest unix:/tmp/qtest-3002385.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3002385.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-9JTZ51/src_serial -drive file=/tmp/migration-test-9JTZ51/bootsect,format=raw -accel qtest
# starting QEMU: exec ./qemu-system-x86_64 -qtest unix:/tmp/qtest-3002385.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3002385.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-9JTZ51/dest_serial -incoming defer -drive file=/tmp/migration-test-9JTZ51/bootsect,format=raw -accel qtest
ok 1 /x86_64/migration/multifd/tcp/plain/zlib
# End of plain tests
# End of tcp tests
# End of multifd tests
# End of migration tests
# End of x86_64 tests
1..1
real 0m1.464s
user 0m2.868s
sys 0m0.534s
$ time ./tests/qtest/migration-test -p /x86_64/migration/multifd/tcp/plain/zstd
TAP version 13
# random seed: R02Sba4a923c284ad824bc82fd488044a5df
# Start of x86_64 tests
# Start of migration tests
# Start of multifd tests
# Start of tcp tests
# Start of plain tests
# starting QEMU: exec ./qemu-system-x86_64 -qtest unix:/tmp/qtest-3006857.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3006857.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-ALK251/src_serial -drive file=/tmp/migration-test-ALK251/bootsect,format=raw -accel qtest
# starting QEMU: exec ./qemu-system-x86_64 -qtest unix:/tmp/qtest-3006857.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3006857.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-ALK251/dest_serial -incoming defer -drive file=/tmp/migration-test-ALK251/bootsect,format=raw -accel qtest
ok 1 /x86_64/migration/multifd/tcp/plain/zstd
# End of plain tests
# End of tcp tests
# End of multifd tests
# End of migration tests
# End of x86_64 tests
1..1
real 0m1.298s
user 0m2.540s
sys 0m0.662s
3.88, 1.46 and 1.29 seconds, what I would have expected.
And if you ask why is this so important: with 48 seconds, we are very
near the limit. If I am running 2 or more migration tests at the same
time:
# random seed: R02Sfb0b65ab5484a997057ef94daed7072f
# Start of aarch64 tests
# Start of migration tests
# Start of multifd tests
# Start of tcp tests
# Start of plain tests
# starting QEMU: exec ./qemu-system-aarch64 -qtest unix:/tmp/qtest-2754383.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-2754383.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -machine virt,gic-version=max -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-L93051/src_serial -cpu max -kernel /tmp/migration-test-L93051/bootsect -accel qtest
# starting QEMU: exec ./qemu-system-aarch64 -qtest unix:/tmp/qtest-2754383.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-2754383.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -machine virt,gic-version=max -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-L93051/dest_serial -incoming defer -cpu max -kernel /tmp/migration-test-L93051/bootsect -accel qtest
**
ERROR:../../../../mnt/code/qemu/multifd/tests/qtest/migration-helpers.c:143:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT)
not ok /aarch64/migration/multifd/tcp/plain/zstd - ERROR:../../../../mnt/code/qemu/multifd/tests/qtest/migration-helpers.c:143:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT)
Bail out!
qemu-system-aarch64: multifd_send_pages: channel 0 has already quit!
qemu-system-aarch64: Unable to write to socket: Connection reset by peer
Aborted (core dumped)
real 2m0.928s
user 16m15.671s
sys 1m11.431s
Later, Juan.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Big TCG slowdown when using zstd with aarch64
2023-06-01 21:06 Big TCG slowdown when using zstd with aarch64 Juan Quintela
@ 2023-06-02 9:10 ` Daniel P. Berrangé
2023-06-02 9:22 ` Peter Maydell
` (2 more replies)
2023-06-02 10:14 ` Daniel P. Berrangé
1 sibling, 3 replies; 11+ messages in thread
From: Daniel P. Berrangé @ 2023-06-02 9:10 UTC (permalink / raw)
To: Juan Quintela; +Cc: qemu-devel, peter.maydell, Richard Henderson
On Thu, Jun 01, 2023 at 11:06:42PM +0200, Juan Quintela wrote:
>
> Hi
>
> Before I continue investigating this further, do you have any clue what
> is going on here. I am running qemu-system-aarch64 on x86_64.
>
> $ time ./tests/qtest/migration-test -p /aarch64/migration/multifd/tcp/plain/none
> real 0m4.559s
> user 0m4.898s
> sys 0m1.156s
> $ time ./tests/qtest/migration-test -p /aarch64/migration/multifd/tcp/plain/zlib
> real 0m1.645s
> user 0m3.484s
> sys 0m0.512s
> $ time ./tests/qtest/migration-test -p /aarch64/migration/multifd/tcp/plain/zstd
> real 0m48.022s
> user 8m17.306s
> sys 0m35.217s
>
>
> This test is very amenable to compression, basically we only modify one
> byte for each page, and basically all the pages are the same.
>
> no compression: 4.5 seconds
> zlib compression: 1.6 seconds (inside what I would expect)
> zstd compression: 48 seconds, what is going on here?
This is non-deterministic. I've seen *all* three cases complete in approx
1 second each. If I set 'QTEST_LOG=1', then very often the zstd test will
complete in < 1 second.
I notice the multifd tests are not sharing the setup logic with the
precopy tests, so they have no set any migration bandwidth limit.
IOW migration is running at full speed.
What I happening is that the migrate is runing so fast that the guest
workload hasn't had the chance to dirty any memory, so 'none' and 'zlib'
tests only copy about 15-30 MB of data, the rest is still all zeroes.
When it is fast, the zstd test also has similar low transfer of data,
but when it is slow then it transfers a massive amount more, and goes
through a *huge* number of iterations
eg I see dirty-sync-count over 1000:
{"return": {"expected-downtime": 221243, "status": "active", "setup-time": 1, "total-time": 44028, "ram": {"total": 291905536, "postcopy-requests": 0, "dirty-sync-count": 1516, "multifd-bytes": 24241675, "pages-per-second": 804571, "downtime-bytes": 0, "page-size": 4096, "remaining": 82313216, "postcopy-bytes": 0, "mbps": 3.7536507936507939, "transferred": 25377710, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 1136035, "duplicate": 124866, "dirty-pages-rate": 850637, "skipped": 0, "normal-bytes": 156904067072, "normal": 38306657}}}
I suspect that the zstd logic takes a little bit longer in setup,
which allows often allows the guest dirty workload to get ahead of
it, resulting in a huge amount of data to transfer. Every now and
then the compression code gets ahead of the workload and thus most
data is zeros and skipped.
IMHO this feels like just another example of compression being largely
useless. The CPU overhead of compression can't keep up with the guest
dirty workload, making the supposedly network bandwidth saving irrelevant.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Big TCG slowdown when using zstd with aarch64
2023-06-02 9:10 ` Daniel P. Berrangé
@ 2023-06-02 9:22 ` Peter Maydell
2023-06-02 9:37 ` Daniel P. Berrangé
2023-06-02 9:42 ` Alex Bennée
2023-06-02 9:24 ` Thomas Huth
2023-06-02 9:25 ` Juan Quintela
2 siblings, 2 replies; 11+ messages in thread
From: Peter Maydell @ 2023-06-02 9:22 UTC (permalink / raw)
To: Daniel P. Berrangé; +Cc: Juan Quintela, qemu-devel, Richard Henderson
On Fri, 2 Jun 2023 at 10:10, Daniel P. Berrangé <berrange@redhat.com> wrote:
> I suspect that the zstd logic takes a little bit longer in setup,
> which allows often allows the guest dirty workload to get ahead of
> it, resulting in a huge amount of data to transfer. Every now and
> then the compression code gets ahead of the workload and thus most
> data is zeros and skipped.
>
> IMHO this feels like just another example of compression being largely
> useless. The CPU overhead of compression can't keep up with the guest
> dirty workload, making the supposedly network bandwidth saving irrelevant.
It seems a bit surprising if compression can't keep up with
a TCG guest workload, though...
-- PMM
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Big TCG slowdown when using zstd with aarch64
2023-06-02 9:22 ` Peter Maydell
@ 2023-06-02 9:37 ` Daniel P. Berrangé
2023-06-02 9:42 ` Alex Bennée
1 sibling, 0 replies; 11+ messages in thread
From: Daniel P. Berrangé @ 2023-06-02 9:37 UTC (permalink / raw)
To: Peter Maydell; +Cc: Juan Quintela, qemu-devel, Richard Henderson
On Fri, Jun 02, 2023 at 10:22:28AM +0100, Peter Maydell wrote:
> On Fri, 2 Jun 2023 at 10:10, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > I suspect that the zstd logic takes a little bit longer in setup,
> > which allows often allows the guest dirty workload to get ahead of
> > it, resulting in a huge amount of data to transfer. Every now and
> > then the compression code gets ahead of the workload and thus most
> > data is zeros and skipped.
> >
> > IMHO this feels like just another example of compression being largely
> > useless. The CPU overhead of compression can't keep up with the guest
> > dirty workload, making the supposedly network bandwidth saving irrelevant.
>
> It seems a bit surprising if compression can't keep up with
> a TCG guest workload, though...
The multifd code seems to be getting slower and slower through the
migration. It peaks at 39 mbps, but degrades down to 4 mbps when i
test it.
I doubt that the aarch64 is specifically a problem, rather it is just
affecting timing that exposes some migration issue.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Big TCG slowdown when using zstd with aarch64
2023-06-02 9:22 ` Peter Maydell
2023-06-02 9:37 ` Daniel P. Berrangé
@ 2023-06-02 9:42 ` Alex Bennée
1 sibling, 0 replies; 11+ messages in thread
From: Alex Bennée @ 2023-06-02 9:42 UTC (permalink / raw)
To: Peter Maydell
Cc: Daniel P. Berrangé, Juan Quintela, Richard Henderson,
qemu-devel
Peter Maydell <peter.maydell@linaro.org> writes:
> On Fri, 2 Jun 2023 at 10:10, Daniel P. Berrangé <berrange@redhat.com> wrote:
>> I suspect that the zstd logic takes a little bit longer in setup,
>> which allows often allows the guest dirty workload to get ahead of
>> it, resulting in a huge amount of data to transfer. Every now and
>> then the compression code gets ahead of the workload and thus most
>> data is zeros and skipped.
>>
>> IMHO this feels like just another example of compression being largely
>> useless. The CPU overhead of compression can't keep up with the guest
>> dirty workload, making the supposedly network bandwidth saving irrelevant.
>
> It seems a bit surprising if compression can't keep up with
> a TCG guest workload, though...
Actual running code doesn't see much of a look in on the perf data:
4.17% CPU 0/TCG qemu-system-aarch64 [.] tlb_set_dirty
3.55% CPU 0/TCG qemu-system-aarch64 [.] helper_ldub_mmu
1.58% live_migration qemu-system-aarch64 [.] buffer_zero_avx2
1.35% CPU 0/TCG qemu-system-aarch64 [.] tlb_set_page_full
1.11% multifdsend_2 libc.so.6 [.] __memmove_avx_unaligned_erms
1.07% multifdsend_13 libc.so.6 [.] __memmove_avx_unaligned_erms
1.07% multifdsend_6 libc.so.6 [.] __memmove_avx_unaligned_erms
1.07% multifdsend_8 libc.so.6 [.] __memmove_avx_unaligned_erms
1.06% multifdsend_10 libc.so.6 [.] __memmove_avx_unaligned_erms
1.06% multifdsend_3 libc.so.6 [.] __memmove_avx_unaligned_erms
1.05% multifdsend_7 libc.so.6 [.] __memmove_avx_unaligned_erms
1.04% multifdsend_11 libc.so.6 [.] __memmove_avx_unaligned_erms
1.04% multifdsend_15 libc.so.6 [.] __memmove_avx_unaligned_erms
1.04% multifdsend_9 libc.so.6 [.] __memmove_avx_unaligned_erms
1.03% multifdsend_1 libc.so.6 [.] __memmove_avx_unaligned_erms
1.03% multifdsend_0 libc.so.6 [.] __memmove_avx_unaligned_erms
1.02% multifdsend_4 libc.so.6 [.] __memmove_avx_unaligned_erms
1.02% multifdsend_14 libc.so.6 [.] __memmove_avx_unaligned_erms
1.02% multifdsend_12 libc.so.6 [.] __memmove_avx_unaligned_erms
1.01% multifdsend_5 libc.so.6 [.] __memmove_avx_unaligned_erms
0.96% multifdrecv_3 libc.so.6 [.] __memmove_avx_unaligned_erms
0.94% multifdrecv_13 libc.so.6 [.] __memmove_avx_unaligned_erms
0.94% multifdrecv_2 libc.so.6 [.] __memmove_avx_unaligned_erms
0.93% multifdrecv_15 libc.so.6 [.] __memmove_avx_unaligned_erms
0.93% multifdrecv_10 libc.so.6 [.] __memmove_avx_unaligned_erms
0.93% multifdrecv_12 libc.so.6 [.] __memmove_avx_unaligned_erms
0.92% multifdrecv_0 libc.so.6 [.] __memmove_avx_unaligned_erms
0.92% multifdrecv_1 libc.so.6 [.] __memmove_avx_unaligned_erms
0.92% multifdrecv_8 libc.so.6 [.] __memmove_avx_unaligned_erms
0.91% multifdrecv_6 libc.so.6 [.] __memmove_avx_unaligned_erms
0.91% multifdrecv_7 libc.so.6 [.] __memmove_avx_unaligned_erms
0.91% multifdrecv_4 libc.so.6 [.] __memmove_avx_unaligned_erms
0.91% multifdrecv_11 libc.so.6 [.] __memmove_avx_unaligned_erms
0.90% multifdrecv_14 libc.so.6 [.] __memmove_avx_unaligned_erms
0.90% multifdrecv_5 libc.so.6 [.] __memmove_avx_unaligned_erms
0.89% multifdrecv_9 libc.so.6 [.] __memmove_avx_unaligned_erms
0.77% CPU 0/TCG qemu-system-aarch64 [.] cpu_physical_memory_get_dirty.constprop.0
0.59% migration-test [kernel.vmlinux] [k] syscall_exit_to_user_mode
0.55% multifdrecv_12 libzstd.so.1.5.4 [.] 0x000000000008ec20
0.54% multifdrecv_4 libzstd.so.1.5.4 [.] 0x000000000008ec20
0.51% multifdrecv_5 libzstd.so.1.5.4 [.] 0x000000000008ec20
0.51% multifdrecv_14 libzstd.so.1.5.4 [.] 0x000000000008ec20
0.49% multifdrecv_2 libzstd.so.1.5.4 [.] 0x000000000008ec20
0.45% multifdrecv_1 libzstd.so.1.5.4 [.] 0x000000000008ec20
0.45% multifdrecv_9 libzstd.so.1.5.4 [.] 0x000000000008ec20
0.42% multifdrecv_10 libzstd.so.1.5.4 [.] 0x000000000008ec20
0.40% multifdrecv_6 libzstd.so.1.5.4 [.] 0x000000000008ec20
0.40% multifdrecv_3 libzstd.so.1.5.4 [.] 0x000000000008ec20
0.40% multifdrecv_8 libzstd.so.1.5.4 [.] 0x000000000008ec20
0.39% multifdrecv_7 libzstd.so.1.5.4 [.] 0x000000000008ec20
>
> -- PMM
--
Alex Bennée
Virtualisation Tech Lead @ Linaro
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Big TCG slowdown when using zstd with aarch64
2023-06-02 9:10 ` Daniel P. Berrangé
2023-06-02 9:22 ` Peter Maydell
@ 2023-06-02 9:24 ` Thomas Huth
2023-06-02 9:34 ` Juan Quintela
2023-06-02 9:25 ` Juan Quintela
2 siblings, 1 reply; 11+ messages in thread
From: Thomas Huth @ 2023-06-02 9:24 UTC (permalink / raw)
To: Daniel P. Berrangé, Juan Quintela
Cc: qemu-devel, peter.maydell, Richard Henderson, Peter Xu
On 02/06/2023 11.10, Daniel P. Berrangé wrote:
...
> IMHO this feels like just another example of compression being largely
> useless. The CPU overhead of compression can't keep up with the guest
> dirty workload, making the supposedly network bandwidth saving irrelevant.
Has anybody ever shown that there is a benefit in some use cases with
compression? ... if not, we should maybe deprecate this feature and remove
it in a couple of releases if nobody complains. That would mean less code to
maintain, less testing effort, and likely no disadvantages for the users.
Thomas
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Big TCG slowdown when using zstd with aarch64
2023-06-02 9:24 ` Thomas Huth
@ 2023-06-02 9:34 ` Juan Quintela
2023-06-02 9:47 ` Thomas Huth
0 siblings, 1 reply; 11+ messages in thread
From: Juan Quintela @ 2023-06-02 9:34 UTC (permalink / raw)
To: Thomas Huth
Cc: Daniel P. Berrangé, qemu-devel, peter.maydell,
Richard Henderson, Peter Xu
Thomas Huth <thuth@redhat.com> wrote:
> On 02/06/2023 11.10, Daniel P. Berrangé wrote:
> ...
>> IMHO this feels like just another example of compression being largely
>> useless. The CPU overhead of compression can't keep up with the guest
>> dirty workload, making the supposedly network bandwidth saving irrelevant.
>
> Has anybody ever shown that there is a benefit in some use cases with
> compression?
see my other reply to Daniel.
Basically now a days only migration over WAN. Everything over a LAN or
near enough LANS, bandwidth is so cheap that it makes no sense to use
CPU to do compression.
> ... if not, we should maybe deprecate this feature and
> remove it in a couple of releases if nobody complains. That would mean
> less code to maintain, less testing effort, and likely no
> disadvantages for the users.
For multifd, I don't care, the amount of code for enabling the feature
is trivial and don't interfere with anything else:
(fix-tests-old)$ wc -l migration/multifd-z*
326 migration/multifd-zlib.c
317 migration/multifd-zstd.c
643 total
And that is because we need a lot of boilerplate code to define 6
callbacks.
The compression on precopy is a complete different beast:
- It is *VERY* buggy (no races fixed there)
- It is *VERY* inneficient
copy page to thread
thread compress page in a different buffer
go back to main thread
copy page to migration stream
And we have to reset the compression dictionaries over each page, so
we don't get the benefits of compression.
So I can't wait the day that we can remove it.
With respect of the multifd compression, Intel AT data engine (whatever
is called this week) is able to handle the compression by itself,
i.e. without using the host CPU, so this could be a win, but I haven't
had the time to play with it. There are patches to do this on the list,
but they are for the old compression code, not the multifd ones. I
asked the submiter to port it to multifd, but haven't heard from him
yet.
Later, Juan.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Big TCG slowdown when using zstd with aarch64
2023-06-02 9:34 ` Juan Quintela
@ 2023-06-02 9:47 ` Thomas Huth
0 siblings, 0 replies; 11+ messages in thread
From: Thomas Huth @ 2023-06-02 9:47 UTC (permalink / raw)
To: quintela
Cc: Daniel P. Berrangé, qemu-devel, peter.maydell,
Richard Henderson, Peter Xu
On 02/06/2023 11.34, Juan Quintela wrote:
...
> The compression on precopy is a complete different beast:
> - It is *VERY* buggy (no races fixed there)
> - It is *VERY* inneficient
> copy page to thread
> thread compress page in a different buffer
> go back to main thread
> copy page to migration stream
>
> And we have to reset the compression dictionaries over each page, so
> we don't get the benefits of compression.
>
> So I can't wait the day that we can remove it.
So could you maybe write a patch to add it to the docs/about/deprecated.rst
file?
Thomas
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Big TCG slowdown when using zstd with aarch64
2023-06-02 9:10 ` Daniel P. Berrangé
2023-06-02 9:22 ` Peter Maydell
2023-06-02 9:24 ` Thomas Huth
@ 2023-06-02 9:25 ` Juan Quintela
2 siblings, 0 replies; 11+ messages in thread
From: Juan Quintela @ 2023-06-02 9:25 UTC (permalink / raw)
To: Daniel P. Berrangé; +Cc: qemu-devel, peter.maydell, Richard Henderson
Daniel P. Berrangé <berrange@redhat.com> wrote:
> On Thu, Jun 01, 2023 at 11:06:42PM +0200, Juan Quintela wrote:
>>
>> Hi
>>
>> Before I continue investigating this further, do you have any clue what
>> is going on here. I am running qemu-system-aarch64 on x86_64.
>>
>> $ time ./tests/qtest/migration-test -p /aarch64/migration/multifd/tcp/plain/none
>
>
>> real 0m4.559s
>> user 0m4.898s
>> sys 0m1.156s
>
>> $ time ./tests/qtest/migration-test -p /aarch64/migration/multifd/tcp/plain/zlib
>
>> real 0m1.645s
>> user 0m3.484s
>> sys 0m0.512s
>> $ time ./tests/qtest/migration-test -p /aarch64/migration/multifd/tcp/plain/zstd
>
>> real 0m48.022s
>> user 8m17.306s
>> sys 0m35.217s
>>
>>
>> This test is very amenable to compression, basically we only modify one
>> byte for each page, and basically all the pages are the same.
>>
>> no compression: 4.5 seconds
>> zlib compression: 1.6 seconds (inside what I would expect)
>> zstd compression: 48 seconds, what is going on here?
>
> This is non-deterministic. I've seen *all* three cases complete in approx
> 1 second each. If I set 'QTEST_LOG=1', then very often the zstd test will
> complete in < 1 second.
Not in my case.
/me goes and checks again.
Low and behold, today it don't fails.
Notice that I am running qemu-system-aarch64 in x86_64 host.
Yesterday I was unable to reproduce it with kvm x86_64 in x86_64 host.
And for aarch64 I reproduced it like 20 times in a row, that is why I
decided to send this email.
In all the other cases, it behaves as expected. This is one of the few
cases where compression is way better than not compression. But
remember that compression uses dictionaries, and for each page it sends
the equivalent of:
1st page: create dictionary, something that represents 1 byte with value
X and TARGET_PAGE_SIZE -1 zeros
Next 63 pages in the packet: copy the previous dictionary 63 times.
I investigated it when I created multifd-zlib because the size of the
packet that described 64 pages content was ridiculous, something like
4-8 bytes (yes, I don't remember, but it was way, way less that 1 byte
per page).
> I notice the multifd tests are not sharing the setup logic with the
> precopy tests, so they have no set any migration bandwidth limit.
> IOW migration is running at full speed.
Aha.
> What I happening is that the migrate is runing so fast that the guest
> workload hasn't had the chance to dirty any memory, so 'none' and 'zlib'
> tests only copy about 15-30 MB of data, the rest is still all zeroes.
>
> When it is fast, the zstd test also has similar low transfer of data,
> but when it is slow then it transfers a massive amount more, and goes
> through a *huge* number of iterations
>
> eg I see dirty-sync-count over 1000:
Aha, will try to print that info.
> {"return": {"expected-downtime": 221243, "status": "active",
> "setup-time": 1, "total-time": 44028, "ram": {"total": 291905536,
> "postcopy-requests": 0, "dirty-sync-count": 1516, "multifd-bytes":
> 24241675, "pages-per-second": 804571, "downtime-bytes": 0,
> "page-size": 4096, "remaining": 82313216, "postcopy-bytes": 0, "mbps":
> 3.7536507936507939, "transferred": 25377710,
> "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 1136035,
> "duplicate": 124866, "dirty-pages-rate": 850637, "skipped": 0,
> "normal-bytes": 156904067072, "normal": 38306657}}}
>
>
> I suspect that the zstd logic takes a little bit longer in setup,
> which allows often allows the guest dirty workload to get ahead of
> it, resulting in a huge amount of data to transfer. Every now and
> then the compression code gets ahead of the workload and thus most
> data is zeros and skipped.
That makes sense. I think that the other problem that I am having this
days is that I am loading my machine a lot (basically running
make check in both branches at the same time, and that makes this much
more probable to happens.)
> IMHO this feels like just another example of compression being largely
> useless. The CPU overhead of compression can't keep up with the guest
> dirty workload, making the supposedly network bandwidth saving irrelevant.
I will not say that it make it useless. But I have said since quite a
long time that compression and xbzrle only make sense if you are
migration between datacenters. Anything that is in the same switch, or
that only needs a couple of hops in the same datacenter it makes no
sense.
Later, Juan.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Big TCG slowdown when using zstd with aarch64
2023-06-01 21:06 Big TCG slowdown when using zstd with aarch64 Juan Quintela
2023-06-02 9:10 ` Daniel P. Berrangé
@ 2023-06-02 10:14 ` Daniel P. Berrangé
2023-06-02 10:41 ` Juan Quintela
1 sibling, 1 reply; 11+ messages in thread
From: Daniel P. Berrangé @ 2023-06-02 10:14 UTC (permalink / raw)
To: Juan Quintela; +Cc: qemu-devel, peter.maydell, Richard Henderson
On Thu, Jun 01, 2023 at 11:06:42PM +0200, Juan Quintela wrote:
>
> Hi
>
> Before I continue investigating this further, do you have any clue what
> is going on here. I am running qemu-system-aarch64 on x86_64.
FYI, the trigger for this behaviour appears to be your recent change
to stats accounting in:
commit cbec7eb76879d419e7dbf531ee2506ec0722e825 (HEAD)
Author: Juan Quintela <quintela@redhat.com>
Date: Mon May 15 21:57:09 2023 +0200
migration/multifd: Compute transferred bytes correctly
In the past, we had to put the in the main thread all the operations
related with sizes due to qemu_file not beeing thread safe. As now
all counters are atomic, we can update the counters just after the
do the write. As an aditional bonus, we are able to use the right
value for the compression methods. Right now we were assuming that
there were no compression at all.
Signed-off-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-Id: <20230515195709.63843-17-quintela@redhat.com>
Before that commit the /aarch64/migration/multifd/tcp/plain/{none,zlib,zstd}
tests all took 21 seconds eachs.
After that commit the 'none' test takes about 3 seconds, and the zlib/zstd
test take about 1 second, except when zstd is suddenly very slow.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Big TCG slowdown when using zstd with aarch64
2023-06-02 10:14 ` Daniel P. Berrangé
@ 2023-06-02 10:41 ` Juan Quintela
0 siblings, 0 replies; 11+ messages in thread
From: Juan Quintela @ 2023-06-02 10:41 UTC (permalink / raw)
To: Daniel P. Berrangé; +Cc: qemu-devel, peter.maydell, Richard Henderson
Daniel P. Berrangé <berrange@redhat.com> wrote:
> On Thu, Jun 01, 2023 at 11:06:42PM +0200, Juan Quintela wrote:
>>
>> Hi
>>
>> Before I continue investigating this further, do you have any clue what
>> is going on here. I am running qemu-system-aarch64 on x86_64.
>
> FYI, the trigger for this behaviour appears to be your recent change
> to stats accounting in:
>
> commit cbec7eb76879d419e7dbf531ee2506ec0722e825 (HEAD)
> Author: Juan Quintela <quintela@redhat.com>
> Date: Mon May 15 21:57:09 2023 +0200
>
> migration/multifd: Compute transferred bytes correctly
>
> In the past, we had to put the in the main thread all the operations
> related with sizes due to qemu_file not beeing thread safe. As now
> all counters are atomic, we can update the counters just after the
> do the write. As an aditional bonus, we are able to use the right
> value for the compression methods. Right now we were assuming that
> there were no compression at all.
>
> Signed-off-by: Juan Quintela <quintela@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Message-Id: <20230515195709.63843-17-quintela@redhat.com>
>
>
>
> Before that commit the /aarch64/migration/multifd/tcp/plain/{none,zlib,zstd}
> tests all took 21 seconds eachs.
>
> After that commit the 'none' test takes about 3 seconds, and the zlib/zstd
> test take about 1 second, except when zstd is suddenly very slow.
Slowdown was reported by Fiona.
This series remove the slowdown (it is an intermediate state while I
switch from one counter to another.)
Subject: [PATCH v2 00/20] Next round of migration atomic counters
But to integrate it I have to fix the RDMA fixes that you pointed
yesterday and get the series reviewed (Hint, Hint).
Will try to get the RDMA bits fixed during the day.
Thanks for the report, Juan.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2023-06-02 10:42 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-01 21:06 Big TCG slowdown when using zstd with aarch64 Juan Quintela
2023-06-02 9:10 ` Daniel P. Berrangé
2023-06-02 9:22 ` Peter Maydell
2023-06-02 9:37 ` Daniel P. Berrangé
2023-06-02 9:42 ` Alex Bennée
2023-06-02 9:24 ` Thomas Huth
2023-06-02 9:34 ` Juan Quintela
2023-06-02 9:47 ` Thomas Huth
2023-06-02 9:25 ` Juan Quintela
2023-06-02 10:14 ` Daniel P. Berrangé
2023-06-02 10:41 ` Juan Quintela
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).