Big TCG slowdown when using zstd with aarch64

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* Big TCG slowdown when using zstd with aarch64
@ 2023-06-01 21:06 Juan Quintela
  2023-06-02  9:10 ` Daniel P. Berrangé
  2023-06-02 10:14 ` Daniel P. Berrangé
  0 siblings, 2 replies; 11+ messages in thread
From: Juan Quintela @ 2023-06-01 21:06 UTC (permalink / raw)
  To: qemu-devel, peter.maydell, Richard Henderson, Daniel Berrange


Hi

Before I continue investigating this further, do you have any clue what
is going on here.  I am running qemu-system-aarch64 on x86_64.

$ time ./tests/qtest/migration-test -p /aarch64/migration/multifd/tcp/plain/none
TAP version 13
# random seed: R02S3d50a0e874b28727af4b862a3cc4214e
# Start of aarch64 tests
# Start of migration tests
# Start of multifd tests
# Start of tcp tests
# Start of plain tests
# starting QEMU: exec ./qemu-system-aarch64 -qtest unix:/tmp/qtest-2888203.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-2888203.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -machine virt,gic-version=max -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-WT9151/src_serial -cpu max -kernel /tmp/migration-test-WT9151/bootsect     -accel qtest
# starting QEMU: exec ./qemu-system-aarch64 -qtest unix:/tmp/qtest-2888203.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-2888203.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -machine virt,gic-version=max -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-WT9151/dest_serial -incoming defer -cpu max -kernel /tmp/migration-test-WT9151/bootsect    -accel qtest
ok 1 /aarch64/migration/multifd/tcp/plain/none
# End of plain tests
# End of tcp tests
# End of multifd tests
# End of migration tests
# End of aarch64 tests
1..1

real	0m4.559s
user	0m4.898s
sys	0m1.156s
$ time ./tests/qtest/migration-test -p /aarch64/migration/multifd/tcp/plain/zlib
TAP version 13
# random seed: R02S014dd197350726bdd95aea37b81d3898
# Start of aarch64 tests
# Start of migration tests
# Start of multifd tests
# Start of tcp tests
# Start of plain tests
# starting QEMU: exec ./qemu-system-aarch64 -qtest unix:/tmp/qtest-2888278.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-2888278.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -machine virt,gic-version=max -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-25U151/src_serial -cpu max -kernel /tmp/migration-test-25U151/bootsect     -accel qtest
# starting QEMU: exec ./qemu-system-aarch64 -qtest unix:/tmp/qtest-2888278.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-2888278.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -machine virt,gic-version=max -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-25U151/dest_serial -incoming defer -cpu max -kernel /tmp/migration-test-25U151/bootsect    -accel qtest
ok 1 /aarch64/migration/multifd/tcp/plain/zlib
# End of plain tests
# End of tcp tests
# End of multifd tests
# End of migration tests
# End of aarch64 tests
1..1

real	0m1.645s
user	0m3.484s
sys	0m0.512s
$ time ./tests/qtest/migration-test -p /aarch64/migration/multifd/tcp/plain/zstd
TAP version 13
# random seed: R02Se49afe2ea9d2b76a1eda1fa2bc8d812c
# Start of aarch64 tests
# Start of migration tests
# Start of multifd tests
# Start of tcp tests
# Start of plain tests
# starting QEMU: exec ./qemu-system-aarch64 -qtest unix:/tmp/qtest-2888353.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-2888353.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -machine virt,gic-version=max -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-UILY51/src_serial -cpu max -kernel /tmp/migration-test-UILY51/bootsect     -accel qtest
# starting QEMU: exec ./qemu-system-aarch64 -qtest unix:/tmp/qtest-2888353.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-2888353.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -machine virt,gic-version=max -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-UILY51/dest_serial -incoming defer -cpu max -kernel /tmp/migration-test-UILY51/bootsect    -accel qtest



ok 1 /aarch64/migration/multifd/tcp/plain/zstd
# End of plain tests
# End of tcp tests
# End of multifd tests
# End of migration tests
# End of aarch64 tests
1..1

real	0m48.022s
user	8m17.306s
sys	0m35.217s


This test is very amenable to compression, basically we only modify one
byte for each page, and basically all the pages are the same.

no compression: 4.5 seconds
zlib compression: 1.6 seconds (inside what I would expect)
zstd compression: 48 seconds, what is going on here?

As a comparison, this are the times for x86_64 running natively, values
much more reasonable.

$ time ./tests/qtest/migration-test -p /x86_64/migration/multifd/tcp/plain/none
TAP version 13
# random seed: R02S579fbe8739386c3a3336486f2adbfecd
# Start of x86_64 tests
# Start of migration tests
# Start of multifd tests
# Start of tcp tests
# Start of plain tests
# starting QEMU: exec ./qemu-system-x86_64 -qtest unix:/tmp/qtest-3002254.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3002254.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-KA6Z51/src_serial -drive file=/tmp/migration-test-KA6Z51/bootsect,format=raw     -accel qtest
# starting QEMU: exec ./qemu-system-x86_64 -qtest unix:/tmp/qtest-3002254.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3002254.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-KA6Z51/dest_serial -incoming defer -drive file=/tmp/migration-test-KA6Z51/bootsect,format=raw    -accel qtest
ok 1 /x86_64/migration/multifd/tcp/plain/none
# End of plain tests
# End of tcp tests
# End of multifd tests
# End of migration tests
# End of x86_64 tests
1..1

real	0m3.889s
user	0m4.264s
sys	0m1.295s
$ time ./tests/qtest/migration-test -p /x86_64/migration/multifd/tcp/plain/zlib
TAP version 13
# random seed: R02S968738d716d2c0dc8c8279716ff3dd9a
# Start of x86_64 tests
# Start of migration tests
# Start of multifd tests
# Start of tcp tests
# Start of plain tests
# starting QEMU: exec ./qemu-system-x86_64 -qtest unix:/tmp/qtest-3002385.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3002385.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-9JTZ51/src_serial -drive file=/tmp/migration-test-9JTZ51/bootsect,format=raw     -accel qtest
# starting QEMU: exec ./qemu-system-x86_64 -qtest unix:/tmp/qtest-3002385.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3002385.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-9JTZ51/dest_serial -incoming defer -drive file=/tmp/migration-test-9JTZ51/bootsect,format=raw    -accel qtest
ok 1 /x86_64/migration/multifd/tcp/plain/zlib
# End of plain tests
# End of tcp tests
# End of multifd tests
# End of migration tests
# End of x86_64 tests
1..1

real	0m1.464s
user	0m2.868s
sys	0m0.534s
$ time ./tests/qtest/migration-test -p /x86_64/migration/multifd/tcp/plain/zstd
TAP version 13
# random seed: R02Sba4a923c284ad824bc82fd488044a5df
# Start of x86_64 tests
# Start of migration tests
# Start of multifd tests
# Start of tcp tests
# Start of plain tests
# starting QEMU: exec ./qemu-system-x86_64 -qtest unix:/tmp/qtest-3006857.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3006857.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-ALK251/src_serial -drive file=/tmp/migration-test-ALK251/bootsect,format=raw     -accel qtest
# starting QEMU: exec ./qemu-system-x86_64 -qtest unix:/tmp/qtest-3006857.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-3006857.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-ALK251/dest_serial -incoming defer -drive file=/tmp/migration-test-ALK251/bootsect,format=raw    -accel qtest
ok 1 /x86_64/migration/multifd/tcp/plain/zstd
# End of plain tests
# End of tcp tests
# End of multifd tests
# End of migration tests
# End of x86_64 tests
1..1

real	0m1.298s
user	0m2.540s
sys	0m0.662s

3.88, 1.46 and 1.29 seconds, what I would have expected.

And if you ask why is this so important: with 48 seconds, we are very
near the limit.  If I am running 2 or more migration tests at the same
time:

# random seed: R02Sfb0b65ab5484a997057ef94daed7072f
# Start of aarch64 tests
# Start of migration tests
# Start of multifd tests
# Start of tcp tests
# Start of plain tests
# starting QEMU: exec ./qemu-system-aarch64 -qtest unix:/tmp/qtest-2754383.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-2754383.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -machine virt,gic-version=max -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-L93051/src_serial -cpu max -kernel /tmp/migration-test-L93051/bootsect     -accel qtest
# starting QEMU: exec ./qemu-system-aarch64 -qtest unix:/tmp/qtest-2754383.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-2754383.qmp,id=char0 -mon chardev=char0,mode=control -display none -net none -accel kvm -accel tcg -machine virt,gic-version=max -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-L93051/dest_serial -incoming defer -cpu max -kernel /tmp/migration-test-L93051/bootsect    -accel qtest
**
ERROR:../../../../mnt/code/qemu/multifd/tests/qtest/migration-helpers.c:143:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT)
not ok /aarch64/migration/multifd/tcp/plain/zstd - ERROR:../../../../mnt/code/qemu/multifd/tests/qtest/migration-helpers.c:143:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT)
Bail out!
qemu-system-aarch64: multifd_send_pages: channel 0 has already quit!
qemu-system-aarch64: Unable to write to socket: Connection reset by peer
Aborted (core dumped)

real	2m0.928s
user	16m15.671s
sys	1m11.431s


Later, Juan.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Big TCG slowdown when using zstd with aarch64
  2023-06-01 21:06 Big TCG slowdown when using zstd with aarch64 Juan Quintela
@ 2023-06-02  9:10 ` Daniel P. Berrangé
  2023-06-02  9:22   ` Peter Maydell
                     ` (2 more replies)
  2023-06-02 10:14 ` Daniel P. Berrangé
  1 sibling, 3 replies; 11+ messages in thread
From: Daniel P. Berrangé @ 2023-06-02  9:10 UTC (permalink / raw)
  To: Juan Quintela; +Cc: qemu-devel, peter.maydell, Richard Henderson

On Thu, Jun 01, 2023 at 11:06:42PM +0200, Juan Quintela wrote:
> 
> Hi
> 
> Before I continue investigating this further, do you have any clue what
> is going on here.  I am running qemu-system-aarch64 on x86_64.
> 
> $ time ./tests/qtest/migration-test -p /aarch64/migration/multifd/tcp/plain/none

> real	0m4.559s
> user	0m4.898s
> sys	0m1.156s

> $ time ./tests/qtest/migration-test -p /aarch64/migration/multifd/tcp/plain/zlib

> real	0m1.645s
> user	0m3.484s
> sys	0m0.512s
> $ time ./tests/qtest/migration-test -p /aarch64/migration/multifd/tcp/plain/zstd

> real	0m48.022s
> user	8m17.306s
> sys	0m35.217s
> 
> 
> This test is very amenable to compression, basically we only modify one
> byte for each page, and basically all the pages are the same.
> 
> no compression: 4.5 seconds
> zlib compression: 1.6 seconds (inside what I would expect)
> zstd compression: 48 seconds, what is going on here?

This is non-deterministic. I've seen *all* three cases complete in approx
1 second each. If I set 'QTEST_LOG=1', then very often the zstd test will
complete in < 1 second.

I notice the multifd tests are not sharing the setup logic with the
precopy tests, so they have no set any migration bandwidth limit.
IOW migration is running at full speed.

What I happening is that the migrate is runing so fast that the guest
workload hasn't had the chance to dirty any memory, so 'none' and 'zlib'
tests only copy about 15-30 MB of data, the rest is still all zeroes.

When it is fast, the zstd test also has similar low transfer of data,
but when it is slow then it transfers a massive amount more, and goes
through a *huge* number of iterations

eg I see dirty-sync-count over 1000:

{"return": {"expected-downtime": 221243, "status": "active", "setup-time": 1, "total-time": 44028, "ram": {"total": 291905536, "postcopy-requests": 0, "dirty-sync-count": 1516, "multifd-bytes": 24241675, "pages-per-second": 804571, "downtime-bytes": 0, "page-size": 4096, "remaining": 82313216, "postcopy-bytes": 0, "mbps": 3.7536507936507939, "transferred": 25377710, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 1136035, "duplicate": 124866, "dirty-pages-rate": 850637, "skipped": 0, "normal-bytes": 156904067072, "normal": 38306657}}}

I suspect that the zstd logic takes a little bit longer in setup,
which allows often allows the guest dirty workload to get ahead of
it, resulting in a huge amount of data to transfer. Every now and
then the compression code gets ahead of the workload and thus most
data is zeros and skipped.

IMHO this feels like just another example of compression being largely
useless. The CPU overhead of compression can't keep up with the guest
dirty workload, making the supposedly network bandwidth saving irrelevant.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Big TCG slowdown when using zstd with aarch64
  2023-06-02  9:10 ` Daniel P. Berrangé
@ 2023-06-02  9:22   ` Peter Maydell
  2023-06-02  9:37     ` Daniel P. Berrangé
  2023-06-02  9:42     ` Alex Bennée
  2023-06-02  9:24   ` Thomas Huth
  2023-06-02  9:25   ` Juan Quintela
  2 siblings, 2 replies; 11+ messages in thread
From: Peter Maydell @ 2023-06-02  9:22 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: Juan Quintela, qemu-devel, Richard Henderson

On Fri, 2 Jun 2023 at 10:10, Daniel P. Berrangé <berrange@redhat.com> wrote:
> I suspect that the zstd logic takes a little bit longer in setup,
> which allows often allows the guest dirty workload to get ahead of
> it, resulting in a huge amount of data to transfer. Every now and
> then the compression code gets ahead of the workload and thus most
> data is zeros and skipped.
>
> IMHO this feels like just another example of compression being largely
> useless. The CPU overhead of compression can't keep up with the guest
> dirty workload, making the supposedly network bandwidth saving irrelevant.

It seems a bit surprising if compression can't keep up with
a TCG guest workload, though...

-- PMM


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Big TCG slowdown when using zstd with aarch64
  2023-06-02  9:22   ` Peter Maydell
@ 2023-06-02  9:37     ` Daniel P. Berrangé
  2023-06-02  9:42     ` Alex Bennée
  1 sibling, 0 replies; 11+ messages in thread
From: Daniel P. Berrangé @ 2023-06-02  9:37 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Juan Quintela, qemu-devel, Richard Henderson

On Fri, Jun 02, 2023 at 10:22:28AM +0100, Peter Maydell wrote:
> On Fri, 2 Jun 2023 at 10:10, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > I suspect that the zstd logic takes a little bit longer in setup,
> > which allows often allows the guest dirty workload to get ahead of
> > it, resulting in a huge amount of data to transfer. Every now and
> > then the compression code gets ahead of the workload and thus most
> > data is zeros and skipped.
> >
> > IMHO this feels like just another example of compression being largely
> > useless. The CPU overhead of compression can't keep up with the guest
> > dirty workload, making the supposedly network bandwidth saving irrelevant.
> 
> It seems a bit surprising if compression can't keep up with
> a TCG guest workload, though...

The multifd code seems to be getting slower and slower through the
migration. It peaks at 39 mbps, but degrades down to 4 mbps when i
test it.

I doubt that the aarch64 is specifically a problem, rather it is just
affecting timing that exposes some migration issue.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Big TCG slowdown when using zstd with aarch64
  2023-06-02  9:22   ` Peter Maydell
  2023-06-02  9:37     ` Daniel P. Berrangé
@ 2023-06-02  9:42     ` Alex Bennée
  1 sibling, 0 replies; 11+ messages in thread
From: Alex Bennée @ 2023-06-02  9:42 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Daniel P. Berrangé, Juan Quintela, Richard Henderson,
	qemu-devel


Peter Maydell <peter.maydell@linaro.org> writes:

> On Fri, 2 Jun 2023 at 10:10, Daniel P. Berrangé <berrange@redhat.com> wrote:
>> I suspect that the zstd logic takes a little bit longer in setup,
>> which allows often allows the guest dirty workload to get ahead of
>> it, resulting in a huge amount of data to transfer. Every now and
>> then the compression code gets ahead of the workload and thus most
>> data is zeros and skipped.
>>
>> IMHO this feels like just another example of compression being largely
>> useless. The CPU overhead of compression can't keep up with the guest
>> dirty workload, making the supposedly network bandwidth saving irrelevant.
>
> It seems a bit surprising if compression can't keep up with
> a TCG guest workload, though...

Actual running code doesn't see much of a look in on the perf data:

   4.17%  CPU 0/TCG        qemu-system-aarch64      [.] tlb_set_dirty
   3.55%  CPU 0/TCG        qemu-system-aarch64      [.] helper_ldub_mmu
   1.58%  live_migration   qemu-system-aarch64      [.] buffer_zero_avx2
   1.35%  CPU 0/TCG        qemu-system-aarch64      [.] tlb_set_page_full
   1.11%  multifdsend_2    libc.so.6                [.] __memmove_avx_unaligned_erms
   1.07%  multifdsend_13   libc.so.6                [.] __memmove_avx_unaligned_erms
   1.07%  multifdsend_6    libc.so.6                [.] __memmove_avx_unaligned_erms
   1.07%  multifdsend_8    libc.so.6                [.] __memmove_avx_unaligned_erms
   1.06%  multifdsend_10   libc.so.6                [.] __memmove_avx_unaligned_erms
   1.06%  multifdsend_3    libc.so.6                [.] __memmove_avx_unaligned_erms
   1.05%  multifdsend_7    libc.so.6                [.] __memmove_avx_unaligned_erms
   1.04%  multifdsend_11   libc.so.6                [.] __memmove_avx_unaligned_erms
   1.04%  multifdsend_15   libc.so.6                [.] __memmove_avx_unaligned_erms
   1.04%  multifdsend_9    libc.so.6                [.] __memmove_avx_unaligned_erms
   1.03%  multifdsend_1    libc.so.6                [.] __memmove_avx_unaligned_erms
   1.03%  multifdsend_0    libc.so.6                [.] __memmove_avx_unaligned_erms
   1.02%  multifdsend_4    libc.so.6                [.] __memmove_avx_unaligned_erms
   1.02%  multifdsend_14   libc.so.6                [.] __memmove_avx_unaligned_erms
   1.02%  multifdsend_12   libc.so.6                [.] __memmove_avx_unaligned_erms
   1.01%  multifdsend_5    libc.so.6                [.] __memmove_avx_unaligned_erms
   0.96%  multifdrecv_3    libc.so.6                [.] __memmove_avx_unaligned_erms
   0.94%  multifdrecv_13   libc.so.6                [.] __memmove_avx_unaligned_erms
   0.94%  multifdrecv_2    libc.so.6                [.] __memmove_avx_unaligned_erms
   0.93%  multifdrecv_15   libc.so.6                [.] __memmove_avx_unaligned_erms
   0.93%  multifdrecv_10   libc.so.6                [.] __memmove_avx_unaligned_erms
   0.93%  multifdrecv_12   libc.so.6                [.] __memmove_avx_unaligned_erms
   0.92%  multifdrecv_0    libc.so.6                [.] __memmove_avx_unaligned_erms
   0.92%  multifdrecv_1    libc.so.6                [.] __memmove_avx_unaligned_erms
   0.92%  multifdrecv_8    libc.so.6                [.] __memmove_avx_unaligned_erms
   0.91%  multifdrecv_6    libc.so.6                [.] __memmove_avx_unaligned_erms
   0.91%  multifdrecv_7    libc.so.6                [.] __memmove_avx_unaligned_erms
   0.91%  multifdrecv_4    libc.so.6                [.] __memmove_avx_unaligned_erms
   0.91%  multifdrecv_11   libc.so.6                [.] __memmove_avx_unaligned_erms
   0.90%  multifdrecv_14   libc.so.6                [.] __memmove_avx_unaligned_erms
   0.90%  multifdrecv_5    libc.so.6                [.] __memmove_avx_unaligned_erms
   0.89%  multifdrecv_9    libc.so.6                [.] __memmove_avx_unaligned_erms
   0.77%  CPU 0/TCG        qemu-system-aarch64      [.] cpu_physical_memory_get_dirty.constprop.0
   0.59%  migration-test   [kernel.vmlinux]         [k] syscall_exit_to_user_mode
   0.55%  multifdrecv_12   libzstd.so.1.5.4         [.] 0x000000000008ec20
   0.54%  multifdrecv_4    libzstd.so.1.5.4         [.] 0x000000000008ec20
   0.51%  multifdrecv_5    libzstd.so.1.5.4         [.] 0x000000000008ec20
   0.51%  multifdrecv_14   libzstd.so.1.5.4         [.] 0x000000000008ec20
   0.49%  multifdrecv_2    libzstd.so.1.5.4         [.] 0x000000000008ec20
   0.45%  multifdrecv_1    libzstd.so.1.5.4         [.] 0x000000000008ec20
   0.45%  multifdrecv_9    libzstd.so.1.5.4         [.] 0x000000000008ec20
   0.42%  multifdrecv_10   libzstd.so.1.5.4         [.] 0x000000000008ec20
   0.40%  multifdrecv_6    libzstd.so.1.5.4         [.] 0x000000000008ec20
   0.40%  multifdrecv_3    libzstd.so.1.5.4         [.] 0x000000000008ec20
   0.40%  multifdrecv_8    libzstd.so.1.5.4         [.] 0x000000000008ec20
   0.39%  multifdrecv_7    libzstd.so.1.5.4         [.] 0x000000000008ec20

>
> -- PMM


-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Big TCG slowdown when using zstd with aarch64
  2023-06-02  9:10 ` Daniel P. Berrangé
  2023-06-02  9:22   ` Peter Maydell
@ 2023-06-02  9:24   ` Thomas Huth
  2023-06-02  9:34     ` Juan Quintela
  2023-06-02  9:25   ` Juan Quintela
  2 siblings, 1 reply; 11+ messages in thread
From: Thomas Huth @ 2023-06-02  9:24 UTC (permalink / raw)
  To: Daniel P. Berrangé, Juan Quintela
  Cc: qemu-devel, peter.maydell, Richard Henderson, Peter Xu

On 02/06/2023 11.10, Daniel P. Berrangé wrote:
...
> IMHO this feels like just another example of compression being largely
> useless. The CPU overhead of compression can't keep up with the guest
> dirty workload, making the supposedly network bandwidth saving irrelevant.

Has anybody ever shown that there is a benefit in some use cases with 
compression? ... if not, we should maybe deprecate this feature and remove 
it in a couple of releases if nobody complains. That would mean less code to 
maintain, less testing effort, and likely no disadvantages for the users.

  Thomas



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Big TCG slowdown when using zstd with aarch64
  2023-06-02  9:24   ` Thomas Huth
@ 2023-06-02  9:34     ` Juan Quintela
  2023-06-02  9:47       ` Thomas Huth
  0 siblings, 1 reply; 11+ messages in thread
From: Juan Quintela @ 2023-06-02  9:34 UTC (permalink / raw)
  To: Thomas Huth
  Cc: Daniel P. Berrangé, qemu-devel, peter.maydell,
	Richard Henderson, Peter Xu

Thomas Huth <thuth@redhat.com> wrote:
> On 02/06/2023 11.10, Daniel P. Berrangé wrote:
> ...
>> IMHO this feels like just another example of compression being largely
>> useless. The CPU overhead of compression can't keep up with the guest
>> dirty workload, making the supposedly network bandwidth saving irrelevant.
>
> Has anybody ever shown that there is a benefit in some use cases with
> compression?

see my other reply to Daniel.

Basically now a days only migration over WAN.  Everything over a LAN or
near enough LANS, bandwidth is so cheap that it makes no sense to use
CPU to do compression.

> ... if not, we should maybe deprecate this feature and
> remove it in a couple of releases if nobody complains. That would mean
> less code to maintain, less testing effort, and likely no
> disadvantages for the users.

For multifd, I don't care, the amount of code for enabling the feature
is trivial and don't interfere with anything else:

(fix-tests-old)$ wc -l migration/multifd-z*
  326 migration/multifd-zlib.c
  317 migration/multifd-zstd.c
  643 total

And that is because we need a lot of boilerplate code to define 6
callbacks.

The compression on precopy is a complete different beast:
- It is *VERY* buggy (no races fixed there)
- It is *VERY* inneficient
  copy page to thread
  thread compress page in a different buffer
  go back to main thread
  copy page to migration stream

  And we have to reset the compression dictionaries over each page, so
  we don't get the benefits of compression.

So I can't wait the day that we can remove it.

With respect of the multifd compression, Intel AT data engine (whatever
is called this week) is able to handle the compression by itself,
i.e. without using the host CPU, so this could be a win, but I haven't
had the time to play with it.  There are patches to do this on the list,
but they are for the old compression code, not the multifd ones.  I
asked the submiter to port it to multifd, but haven't heard from him
yet.

Later, Juan.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Big TCG slowdown when using zstd with aarch64
  2023-06-02  9:34     ` Juan Quintela
@ 2023-06-02  9:47       ` Thomas Huth
  0 siblings, 0 replies; 11+ messages in thread
From: Thomas Huth @ 2023-06-02  9:47 UTC (permalink / raw)
  To: quintela
  Cc: Daniel P. Berrangé, qemu-devel, peter.maydell,
	Richard Henderson, Peter Xu

On 02/06/2023 11.34, Juan Quintela wrote:
...
> The compression on precopy is a complete different beast:
> - It is *VERY* buggy (no races fixed there)
> - It is *VERY* inneficient
>    copy page to thread
>    thread compress page in a different buffer
>    go back to main thread
>    copy page to migration stream
> 
>    And we have to reset the compression dictionaries over each page, so
>    we don't get the benefits of compression.
> 
> So I can't wait the day that we can remove it.

So could you maybe write a patch to add it to the docs/about/deprecated.rst 
file?

  Thomas



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Big TCG slowdown when using zstd with aarch64
  2023-06-02  9:10 ` Daniel P. Berrangé
  2023-06-02  9:22   ` Peter Maydell
  2023-06-02  9:24   ` Thomas Huth
@ 2023-06-02  9:25   ` Juan Quintela
  2 siblings, 0 replies; 11+ messages in thread
From: Juan Quintela @ 2023-06-02  9:25 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: qemu-devel, peter.maydell, Richard Henderson

Daniel P. Berrangé <berrange@redhat.com> wrote:
> On Thu, Jun 01, 2023 at 11:06:42PM +0200, Juan Quintela wrote:
>> 
>> Hi
>> 
>> Before I continue investigating this further, do you have any clue what
>> is going on here.  I am running qemu-system-aarch64 on x86_64.
>> 
>> $ time ./tests/qtest/migration-test -p /aarch64/migration/multifd/tcp/plain/none
>
>
>> real	0m4.559s
>> user	0m4.898s
>> sys	0m1.156s
>
>> $ time ./tests/qtest/migration-test -p /aarch64/migration/multifd/tcp/plain/zlib
>
>> real	0m1.645s
>> user	0m3.484s
>> sys	0m0.512s
>> $ time ./tests/qtest/migration-test -p /aarch64/migration/multifd/tcp/plain/zstd
>
>> real	0m48.022s
>> user	8m17.306s
>> sys	0m35.217s
>> 
>> 
>> This test is very amenable to compression, basically we only modify one
>> byte for each page, and basically all the pages are the same.
>> 
>> no compression: 4.5 seconds
>> zlib compression: 1.6 seconds (inside what I would expect)
>> zstd compression: 48 seconds, what is going on here?
>
> This is non-deterministic. I've seen *all* three cases complete in approx
> 1 second each. If I set 'QTEST_LOG=1', then very often the zstd test will
> complete in < 1 second.

Not in my case.
/me goes and checks again.

Low and behold, today it don't fails.

Notice that I am running qemu-system-aarch64 in x86_64 host.
Yesterday I was unable to reproduce it with kvm x86_64 in x86_64 host.

And for aarch64 I reproduced it like 20 times in a row, that is why I
decided to send this email.

In all the other cases, it behaves as expected.  This is one of the few
cases where compression is way better than not compression.  But
remember that compression uses dictionaries, and for each page it sends
the equivalent of:

1st page: create dictionary, something that represents 1 byte with value
          X and TARGET_PAGE_SIZE -1 zeros
Next 63 pages in the packet: copy the previous dictionary 63 times.

I investigated it when I created multifd-zlib because the size of the
packet that described 64 pages content was ridiculous, something like
4-8 bytes (yes, I don't remember, but it was way, way less that 1 byte
per page).

> I notice the multifd tests are not sharing the setup logic with the
> precopy tests, so they have no set any migration bandwidth limit.
> IOW migration is running at full speed.

Aha.

> What I happening is that the migrate is runing so fast that the guest
> workload hasn't had the chance to dirty any memory, so 'none' and 'zlib'
> tests only copy about 15-30 MB of data, the rest is still all zeroes.
>
> When it is fast, the zstd test also has similar low transfer of data,
> but when it is slow then it transfers a massive amount more, and goes
> through a *huge* number of iterations
>
> eg I see dirty-sync-count over 1000:

Aha, will try to print that info.

> {"return": {"expected-downtime": 221243, "status": "active",
> "setup-time": 1, "total-time": 44028, "ram": {"total": 291905536,
> "postcopy-requests": 0, "dirty-sync-count": 1516, "multifd-bytes":
> 24241675, "pages-per-second": 804571, "downtime-bytes": 0,
> "page-size": 4096, "remaining": 82313216, "postcopy-bytes": 0, "mbps":
> 3.7536507936507939, "transferred": 25377710,
> "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 1136035,
> "duplicate": 124866, "dirty-pages-rate": 850637, "skipped": 0,
> "normal-bytes": 156904067072, "normal": 38306657}}}
>
>
> I suspect that the zstd logic takes a little bit longer in setup,
> which allows often allows the guest dirty workload to get ahead of
> it, resulting in a huge amount of data to transfer. Every now and
> then the compression code gets ahead of the workload and thus most
> data is zeros and skipped.

That makes sense.  I think that the other problem that I am having this
days is that I am loading my machine a lot (basically running
make check in both branches at the same time, and that makes this much
more probable to happens.)

> IMHO this feels like just another example of compression being largely
> useless. The CPU overhead of compression can't keep up with the guest
> dirty workload, making the supposedly network bandwidth saving irrelevant.

I will not say that it make it useless.  But I have said since quite a
long time that compression and xbzrle only make sense if you are
migration between datacenters.  Anything that is in the same switch, or
that only needs a couple of hops in the same datacenter it makes no
sense.

Later, Juan.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Big TCG slowdown when using zstd with aarch64
  2023-06-01 21:06 Big TCG slowdown when using zstd with aarch64 Juan Quintela
  2023-06-02  9:10 ` Daniel P. Berrangé
@ 2023-06-02 10:14 ` Daniel P. Berrangé
  2023-06-02 10:41   ` Juan Quintela
  1 sibling, 1 reply; 11+ messages in thread
From: Daniel P. Berrangé @ 2023-06-02 10:14 UTC (permalink / raw)
  To: Juan Quintela; +Cc: qemu-devel, peter.maydell, Richard Henderson

On Thu, Jun 01, 2023 at 11:06:42PM +0200, Juan Quintela wrote:
> 
> Hi
> 
> Before I continue investigating this further, do you have any clue what
> is going on here.  I am running qemu-system-aarch64 on x86_64.

FYI, the trigger for this behaviour appears to be your recent change
to stats accounting in:

commit cbec7eb76879d419e7dbf531ee2506ec0722e825 (HEAD)
Author: Juan Quintela <quintela@redhat.com>
Date:   Mon May 15 21:57:09 2023 +0200

    migration/multifd: Compute transferred bytes correctly
    
    In the past, we had to put the in the main thread all the operations
    related with sizes due to qemu_file not beeing thread safe.  As now
    all counters are atomic, we can update the counters just after the
    do the write.  As an aditional bonus, we are able to use the right
    value for the compression methods.  Right now we were assuming that
    there were no compression at all.
    
    Signed-off-by: Juan Quintela <quintela@redhat.com>
    Reviewed-by: Peter Xu <peterx@redhat.com>
    Message-Id: <20230515195709.63843-17-quintela@redhat.com>



Before that commit the /aarch64/migration/multifd/tcp/plain/{none,zlib,zstd}
tests all took 21 seconds eachs.

After that commit the 'none' test takes about 3 seconds, and the zlib/zstd
test take about 1 second, except when zstd is suddenly very slow.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Big TCG slowdown when using zstd with aarch64
  2023-06-02 10:14 ` Daniel P. Berrangé
@ 2023-06-02 10:41   ` Juan Quintela
  0 siblings, 0 replies; 11+ messages in thread
From: Juan Quintela @ 2023-06-02 10:41 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: qemu-devel, peter.maydell, Richard Henderson

Daniel P. Berrangé <berrange@redhat.com> wrote:
> On Thu, Jun 01, 2023 at 11:06:42PM +0200, Juan Quintela wrote:
>> 
>> Hi
>> 
>> Before I continue investigating this further, do you have any clue what
>> is going on here.  I am running qemu-system-aarch64 on x86_64.
>
> FYI, the trigger for this behaviour appears to be your recent change
> to stats accounting in:
>
> commit cbec7eb76879d419e7dbf531ee2506ec0722e825 (HEAD)
> Author: Juan Quintela <quintela@redhat.com>
> Date:   Mon May 15 21:57:09 2023 +0200
>
>     migration/multifd: Compute transferred bytes correctly
>     
>     In the past, we had to put the in the main thread all the operations
>     related with sizes due to qemu_file not beeing thread safe.  As now
>     all counters are atomic, we can update the counters just after the
>     do the write.  As an aditional bonus, we are able to use the right
>     value for the compression methods.  Right now we were assuming that
>     there were no compression at all.
>     
>     Signed-off-by: Juan Quintela <quintela@redhat.com>
>     Reviewed-by: Peter Xu <peterx@redhat.com>
>     Message-Id: <20230515195709.63843-17-quintela@redhat.com>
>
>
>
> Before that commit the /aarch64/migration/multifd/tcp/plain/{none,zlib,zstd}
> tests all took 21 seconds eachs.
>
> After that commit the 'none' test takes about 3 seconds, and the zlib/zstd
> test take about 1 second, except when zstd is suddenly very slow.

Slowdown was reported by Fiona.

This series remove the slowdown (it is an intermediate state while I
switch from one counter to another.)

Subject: [PATCH v2 00/20] Next round of migration atomic counters

But to integrate it I have to fix the RDMA fixes that you pointed
yesterday and get the series reviewed (Hint, Hint).

Will try to get the RDMA bits fixed during the day.

Thanks for the report, Juan.



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-06-02 10:42 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-01 21:06 Big TCG slowdown when using zstd with aarch64 Juan Quintela
2023-06-02  9:10 ` Daniel P. Berrangé
2023-06-02  9:22   ` Peter Maydell
2023-06-02  9:37     ` Daniel P. Berrangé
2023-06-02  9:42     ` Alex Bennée
2023-06-02  9:24   ` Thomas Huth
2023-06-02  9:34     ` Juan Quintela
2023-06-02  9:47       ` Thomas Huth
2023-06-02  9:25   ` Juan Quintela
2023-06-02 10:14 ` Daniel P. Berrangé
2023-06-02 10:41   ` Juan Quintela

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).