* [PATCH 1/4] tests/qtest: Add check-migration
2024-10-17 14:32 [PATCH 0/4] tests/qtest: Move the bulk of migration tests into a separate target Fabiano Rosas
@ 2024-10-17 14:32 ` Fabiano Rosas
2024-10-17 15:09 ` Daniel P. Berrangé
2024-10-17 14:32 ` [PATCH 2/4] docs: Add migration tests documentation Fabiano Rosas
` (2 subsequent siblings)
3 siblings, 1 reply; 23+ messages in thread
From: Fabiano Rosas @ 2024-10-17 14:32 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Xu, Peter Maydell, Paolo Bonzini, Marc-André Lureau,
Daniel P. Berrangé, Philippe Mathieu-Daudé,
Alex Bennée, Thomas Huth, Laurent Vivier
Add two new targets, check-migration and check-migration-quick to
allow dividing migration tests into a quick set and a slow set. With
this it'll be possible to reduce the amount of migration tests that
run by default as part of 'make check'.
Keep under the 'migration-quick' suite only a few tests to serve as
sanity check for every build and move the rest under the 'migration'
suite.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
meson.build | 6 +++---
tests/Makefile.include | 2 ++
tests/qtest/meson.build | 47 +++++++++++++++++++++++++++++++++--------
3 files changed, 43 insertions(+), 12 deletions(-)
diff --git a/meson.build b/meson.build
index 4ea1984fc5..92d38691f9 100644
--- a/meson.build
+++ b/meson.build
@@ -3,9 +3,9 @@ project('qemu', ['c'], meson_version: '>=1.1.0',
'b_staticpic=false', 'stdsplit=false', 'optimization=2', 'b_pie=true'],
version: files('VERSION'))
-add_test_setup('quick', exclude_suites: ['slow', 'thorough'], is_default: true)
-add_test_setup('slow', exclude_suites: ['thorough'], env: ['G_TEST_SLOW=1', 'SPEED=slow'])
-add_test_setup('thorough', env: ['G_TEST_SLOW=1', 'SPEED=thorough'])
+add_test_setup('quick', exclude_suites: ['slow', 'thorough', 'migration'], is_default: true)
+add_test_setup('slow', exclude_suites: ['thorough', 'migration-quick'], env: ['G_TEST_SLOW=1', 'SPEED=slow'])
+add_test_setup('thorough', exclude_suites: ['migration-quick'], env: ['G_TEST_SLOW=1', 'SPEED=thorough'])
meson.add_postconf_script(find_program('scripts/symlink-install-tree.py'))
diff --git a/tests/Makefile.include b/tests/Makefile.include
index 010369bd3a..79c1350bfb 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -11,6 +11,8 @@ check-help:
@echo " $(MAKE) check-qtest Run qtest tests"
@echo " $(MAKE) check-functional Run python-based functional tests"
@echo " $(MAKE) check-functional-TARGET Run functional tests for a given target"
+ @echo " $(MAKE) check-migration-quick Run a small set of migration tests"
+ @echo " $(MAKE) check-migration Run all migration tests"
@echo " $(MAKE) check-unit Run qobject tests"
@echo " $(MAKE) check-qapi-schema Run QAPI schema tests"
@echo " $(MAKE) check-block Run block tests"
diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
index b207e38696..27a802474a 100644
--- a/tests/qtest/meson.build
+++ b/tests/qtest/meson.build
@@ -406,14 +406,43 @@ foreach dir : target_dirs
test: executable(test, src, dependencies: deps)
}
endif
- test('qtest-@0@/@1@'.format(target_base, test),
- qtest_executables[test],
- depends: [test_deps, qtest_emulator, emulator_modules],
- env: qtest_env,
- args: ['--tap', '-k'],
- protocol: 'tap',
- timeout: slow_qtests.get(test, 60),
- priority: slow_qtests.get(test, 60),
- suite: ['qtest', 'qtest-' + target_base])
+
+ # The migration-test test runs several slow sub-tests. Add it to
+ # two separate targets, one for executing a few tests
+ # (migration-quick) and another for executing the full set
+ # (migration). This is done to reduce the amount of tests that run
+ # via make check.
+ if test == 'migration-test'
+ foreach opts : [
+ {
+ 'test-args': ['--tap', '-k', '-m', 'slow'],
+ 'test-suite': ['migration']
+ },
+ {
+ 'test-args': ['--tap', '-k', '-m', 'quick'],
+ 'test-suite': ['migration-quick']
+ }]
+
+ test(target_base,
+ qtest_executables[test],
+ depends: [test_deps, qtest_emulator, emulator_modules],
+ env: qtest_env,
+ args: opts['test-args'],
+ protocol: 'tap',
+ timeout: slow_qtests.get(test, 60),
+ priority: slow_qtests.get(test, 60),
+ suite: opts['test-suite'])
+ endforeach
+ else
+ test('qtest-@0@/@1@'.format(target_base, test),
+ qtest_executables[test],
+ depends: [test_deps, qtest_emulator, emulator_modules],
+ env: qtest_env,
+ args: ['--tap', '-k'],
+ protocol: 'tap',
+ timeout: slow_qtests.get(test, 60),
+ priority: slow_qtests.get(test, 60),
+ suite: ['qtest', 'qtest-' + target_base])
+ endif
endforeach
endforeach
--
2.35.3
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 1/4] tests/qtest: Add check-migration
2024-10-17 14:32 ` [PATCH 1/4] tests/qtest: Add check-migration Fabiano Rosas
@ 2024-10-17 15:09 ` Daniel P. Berrangé
0 siblings, 0 replies; 23+ messages in thread
From: Daniel P. Berrangé @ 2024-10-17 15:09 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, Peter Xu, Peter Maydell, Paolo Bonzini,
Marc-André Lureau, Philippe Mathieu-Daudé,
Alex Bennée, Thomas Huth, Laurent Vivier
On Thu, Oct 17, 2024 at 11:32:08AM -0300, Fabiano Rosas wrote:
> Add two new targets, check-migration and check-migration-quick to
> allow dividing migration tests into a quick set and a slow set. With
> this it'll be possible to reduce the amount of migration tests that
> run by default as part of 'make check'.
>
> Keep under the 'migration-quick' suite only a few tests to serve as
> sanity check for every build and move the rest under the 'migration'
> suite.
I don't think we should need to have separate make targets
for each speed. I would expect users to be able to run
$ make check SPEED=thorough
$ make check-qtest SPEED=thorough
which is how we document doing this for functional tests.
If we want a way to let users more easily run individual
(or a subset of) qtest suites, how about we allow for
some filtering along the lines of:
$ make check-qtest QTESTS=migration-test
as adding top level make-check-<blah> targets for each
different subset maintainers might want isn't scalable.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> meson.build | 6 +++---
> tests/Makefile.include | 2 ++
> tests/qtest/meson.build | 47 +++++++++++++++++++++++++++++++++--------
> 3 files changed, 43 insertions(+), 12 deletions(-)
>
> diff --git a/meson.build b/meson.build
> index 4ea1984fc5..92d38691f9 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -3,9 +3,9 @@ project('qemu', ['c'], meson_version: '>=1.1.0',
> 'b_staticpic=false', 'stdsplit=false', 'optimization=2', 'b_pie=true'],
> version: files('VERSION'))
>
> -add_test_setup('quick', exclude_suites: ['slow', 'thorough'], is_default: true)
> -add_test_setup('slow', exclude_suites: ['thorough'], env: ['G_TEST_SLOW=1', 'SPEED=slow'])
> -add_test_setup('thorough', env: ['G_TEST_SLOW=1', 'SPEED=thorough'])
> +add_test_setup('quick', exclude_suites: ['slow', 'thorough', 'migration'], is_default: true)
> +add_test_setup('slow', exclude_suites: ['thorough', 'migration-quick'], env: ['G_TEST_SLOW=1', 'SPEED=slow'])
> +add_test_setup('thorough', exclude_suites: ['migration-quick'], env: ['G_TEST_SLOW=1', 'SPEED=thorough'])
>
> meson.add_postconf_script(find_program('scripts/symlink-install-tree.py'))
>
> diff --git a/tests/Makefile.include b/tests/Makefile.include
> index 010369bd3a..79c1350bfb 100644
> --- a/tests/Makefile.include
> +++ b/tests/Makefile.include
> @@ -11,6 +11,8 @@ check-help:
> @echo " $(MAKE) check-qtest Run qtest tests"
> @echo " $(MAKE) check-functional Run python-based functional tests"
> @echo " $(MAKE) check-functional-TARGET Run functional tests for a given target"
> + @echo " $(MAKE) check-migration-quick Run a small set of migration tests"
> + @echo " $(MAKE) check-migration Run all migration tests"
> @echo " $(MAKE) check-unit Run qobject tests"
> @echo " $(MAKE) check-qapi-schema Run QAPI schema tests"
> @echo " $(MAKE) check-block Run block tests"
> diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
> index b207e38696..27a802474a 100644
> --- a/tests/qtest/meson.build
> +++ b/tests/qtest/meson.build
> @@ -406,14 +406,43 @@ foreach dir : target_dirs
> test: executable(test, src, dependencies: deps)
> }
> endif
> - test('qtest-@0@/@1@'.format(target_base, test),
> - qtest_executables[test],
> - depends: [test_deps, qtest_emulator, emulator_modules],
> - env: qtest_env,
> - args: ['--tap', '-k'],
> - protocol: 'tap',
> - timeout: slow_qtests.get(test, 60),
> - priority: slow_qtests.get(test, 60),
> - suite: ['qtest', 'qtest-' + target_base])
> +
> + # The migration-test test runs several slow sub-tests. Add it to
> + # two separate targets, one for executing a few tests
> + # (migration-quick) and another for executing the full set
> + # (migration). This is done to reduce the amount of tests that run
> + # via make check.
> + if test == 'migration-test'
> + foreach opts : [
> + {
> + 'test-args': ['--tap', '-k', '-m', 'slow'],
> + 'test-suite': ['migration']
> + },
> + {
> + 'test-args': ['--tap', '-k', '-m', 'quick'],
> + 'test-suite': ['migration-quick']
> + }]
> +
> + test(target_base,
> + qtest_executables[test],
> + depends: [test_deps, qtest_emulator, emulator_modules],
> + env: qtest_env,
> + args: opts['test-args'],
> + protocol: 'tap',
> + timeout: slow_qtests.get(test, 60),
> + priority: slow_qtests.get(test, 60),
> + suite: opts['test-suite'])
> + endforeach
> + else
> + test('qtest-@0@/@1@'.format(target_base, test),
> + qtest_executables[test],
> + depends: [test_deps, qtest_emulator, emulator_modules],
> + env: qtest_env,
> + args: ['--tap', '-k'],
> + protocol: 'tap',
> + timeout: slow_qtests.get(test, 60),
> + priority: slow_qtests.get(test, 60),
> + suite: ['qtest', 'qtest-' + target_base])
> + endif
> endforeach
> endforeach
> --
> 2.35.3
>
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 2/4] docs: Add migration tests documentation
2024-10-17 14:32 [PATCH 0/4] tests/qtest: Move the bulk of migration tests into a separate target Fabiano Rosas
2024-10-17 14:32 ` [PATCH 1/4] tests/qtest: Add check-migration Fabiano Rosas
@ 2024-10-17 14:32 ` Fabiano Rosas
2024-10-17 14:32 ` [PATCH 3/4] tests/qtest/migration: Move tests into g_test_slow() Fabiano Rosas
2024-10-17 14:32 ` [PATCH 4/4] ci: Add check-migration-quick to the clang job Fabiano Rosas
3 siblings, 0 replies; 23+ messages in thread
From: Fabiano Rosas @ 2024-10-17 14:32 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Xu, Peter Maydell
Add documentation about how to write, run and debug migration tests.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
docs/devel/testing/index.rst | 1 +
docs/devel/testing/main.rst | 13 ++
docs/devel/testing/migration.rst | 275 +++++++++++++++++++++++++++++++
docs/devel/testing/qtest.rst | 1 +
4 files changed, 290 insertions(+)
create mode 100644 docs/devel/testing/migration.rst
diff --git a/docs/devel/testing/index.rst b/docs/devel/testing/index.rst
index 45eb4a7181..e46a33e9e8 100644
--- a/docs/devel/testing/index.rst
+++ b/docs/devel/testing/index.rst
@@ -9,6 +9,7 @@ testing infrastructure.
main
qtest
+ migration
functional
avocado
acpi-bits
diff --git a/docs/devel/testing/main.rst b/docs/devel/testing/main.rst
index 09725e8ea9..f238926300 100644
--- a/docs/devel/testing/main.rst
+++ b/docs/devel/testing/main.rst
@@ -96,6 +96,19 @@ QTest cases can be executed with
make check-qtest
+Migration
+~~~~~~~~~
+
+Migration tests are part of QTest, but are run independently. Refer
+to :doc:`migration` for more details.
+
+Migration test cases can be executed with
+
+.. code::
+
+ make check-migration
+ make check-migration-quick
+
Writing portable test cases
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Both unit tests and qtests can run on POSIX hosts as well as Windows hosts.
diff --git a/docs/devel/testing/migration.rst b/docs/devel/testing/migration.rst
new file mode 100644
index 0000000000..e877a979bf
--- /dev/null
+++ b/docs/devel/testing/migration.rst
@@ -0,0 +1,275 @@
+.. _migration:
+
+Migration tests
+===============
+
+Migration tests are part of QTest, but have some particularities of
+their own, such as:
+
+- Extended test time due to the need to exercise the iterative phase
+ of migration;
+- Extra requirements on the QEMU binary being used due to
+ :ref:`cross-version migration <cross-version-tests>`;
+- The use of a custom binary for the guest code to test memory
+ integrity (see :ref:`guest-code`).
+
+Invocation
+----------
+
+Migration tests can be ran with:
+
+.. code::
+
+ make check-migration
+ make check-migration-quick
+
+or directly:
+
+.. code::
+
+ # all tests
+ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test
+
+ # single test
+ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test -p /x86_64/migration/bad_dest
+
+ # all tests under /multifd (note no trailing slash)
+ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test -r /x86_64/migration/multifd
+
+for cross-version tests (see :ref:`cross-version-tests`):
+
+.. code::
+
+ # old QEMU -> new QEMU
+ QTEST_QEMU_BINARY=./qemu-system-x86_64 QTEST_QEMU_BINARY_SRC=./old/qemu-system-x86_64 ./tests/qtest/migration-test
+ QTEST_QEMU_BINARY_DST=./qemu-system-x86_64 QTEST_QEMU_BINARY=./old/qemu-system-x86_64 ./tests/qtest/migration-test
+
+ # new QEMU -> old QEMU (backwards migration)
+ QTEST_QEMU_BINARY_SRC=./qemu-system-x86_64 QTEST_QEMU_BINARY=./old/qemu-system-x86_64 ./tests/qtest/migration-test
+ QTEST_QEMU_BINARY=./qemu-system-x86_64 QTEST_QEMU_BINARY_DST=./old/qemu-system-x86_64 ./tests/qtest/migration-test
+
+ # both _SRC and _DST variants are supported for convenience
+
+.. _cross-version-tests:
+
+Cross-version tests
+~~~~~~~~~~~~~~~~~~~
+
+To detect compatibility issues between different QEMU versions, all
+tests from migration-test can be executed with two different QEMU
+versions. The common machine type between the two versions is used.
+
+To setup cross-version tests, a previous build of QEMU must be kept,
+e.g.:
+
+.. code::
+
+ # build current code
+ mkdir build
+ cd build
+ ../configure; make
+
+ # build previous version
+ cd ../
+ mkdir build-9.1
+ git checkout v9.1.0
+ cd build
+ ../configure; make
+
+To avoid issues with newly added features and new tests, it is highly
+recommended to run the tests from the source directory of *older*
+version being tested.
+
+.. code::
+
+ ./build/qemu-system-x86_64 --version
+ QEMU emulator version 9.1.50
+
+ ./build-9.1/qemu-system-x86_64 --version
+ QEMU emulator version 9.1.0
+
+ cd build-9.1
+ QTEST_QEMU_BINARY=./qemu-system-x86_64 QTEST_QEMU_BINARY_DST=../build/qemu-system-x86_64 ./tests/qtest/migration-test
+
+
+How to write migration tests
+----------------------------
+
+Add a test function (prefixed with ``test_``) that gets registered
+with QTest using the ``migration_test_add*()`` helpers.
+
+The choice between ``migration_test_add()`` and
+``migration_test_add_quick()`` affects whether a test is executed via
+``make check``. The former helper adds a test to the "slow" set, which
+only runs via ``make check-migration``, while the latter allows a test
+to run via ``make check`` or ``make check-migration-quick``.
+
+A new test should preferably not add too much time to ``make
+check``. In general, code that depends on external libraries or pieces
+outside of migration could be tested in the quick set as well as tests
+that are fast (e.g. only tests migration URL and exits). Code that is
+core to the migration framework or changes infrequently should be
+tested in the slow set as well as test that are considerably slower
+(e.g. run several migration iterations).
+
+.. code::
+
+ migration_test_add("/migration/multifd/tcp/plain/cancel", test_multifd_tcp_cancel);
+
+There is no formal grammar for the definition of the test paths, but
+an informal rule is followed for consistency. Usually:
+
+``/migration/<multifd|precopy|postcopy>/<url type>/<test-specific>/``
+
+Bear in mind that the path string affects test execution order and
+filtering when using the ``-r`` flag.
+
+For simpler tests, the test function can setup the test arguments in
+the ``MigrateCommon`` structure and call into a common test
+routine. Currently there are two common test routines:
+
+ - test_precopy_common - for generic precopy migration
+ - test_file_common - for migration using the file: URL
+
+The general structure of a test routine is:
+
+- call ``test_migrate_start()`` to initialize the two QEMU
+ instances. Usually named "from", for the source machine and "to" for
+ the destination machine;
+
+- define the migration duration, (roughly speaking either quick or
+ slow) by altering the convergence parameters with
+ ``migrate_ensure[_non]_converge()``;
+
+- wait for the machines to be in the desired state with the ``wait_for_*``
+ helpers;
+
+- migrate with ``migrate_qmp()/migrate_incoming_qmp()/migrate_qmp_fail()``;
+
+- check that guest memory was not corrupted and clean up the QEMU
+ instances with ``test_migrate_end()``.
+
+If using the common test routines, the ``.start_hook`` and ``.finish_hook``
+callbacks can be used to perform test-specific tasks.
+
+.. _guest-code:
+
+About guest code
+----------------
+
+The tests all use a custom, architecture-specific binary as the guest
+code. This code, known as a-b-kernel or a-b-bootblock, constantly
+iterates over the guest memory, writing a number to the start of each
+guest page, incrementing it as it loops around (i.e. a generation
+count). This allows the tests to catch memory corruption errors that
+occur during migration as every page's first byte must have the same
+value, except at the point where the transition happens.
+
+Whenever guest memory is migrated incorrectly, the test will output
+the address and amount of pages that present a value inconsistent with
+the generation count, e.g.:
+
+.. code::
+
+ Memory content inconsistency at d53000 first_byte = 27 last_byte = 26 current = 27 hit_edge = 1
+ Memory content inconsistency at d54000 first_byte = 27 last_byte = 26 current = 27 hit_edge = 1
+ Memory content inconsistency at d55000 first_byte = 27 last_byte = 26 current = 27 hit_edge = 1
+ and in another 4929 pages
+
+In the scenario above,
+
+``first_byte`` shows that the current generation number is 27, therefore
+all pages should have 27 as their first byte. Since ``hit_edge=1``, that
+means the transition point was found, i.e. the guest was stopped for
+migration while not all pages had yet been updated to the new
+generation count. So 26 is also a valid byte to find in some pages.
+
+The inconsistency here is that ``last_byte``, i.e. the previous
+generation count is smaller than the ``current`` byte, which should not
+be possible. This would indicate a memory layout such as:
+
+.. code::
+
+ 0xb00000 | 27 00 00 ...
+ ...
+ 0xc00000 | 27 00 00 ...
+ ...
+ 0xd00000 | 27 00 00 ...
+ 0x?????? | 26 00 00 ... <-- pages around this addr weren't migrated correctly
+ ...
+ 0xd53000 | 27 00 00 ...
+ 0xd54000 | 27 00 00 ...
+ 0xd55000 | 27 00 00 ...
+ ...
+
+The a-b code is located at ``tests/migration/<arch>``.
+
+Troubleshooting
+---------------
+
+Migration tests usually run as part of make check, which is most
+likely to not have been using the verbose flag, so the first thing to
+check is the test log from meson (``meson-logs/testlog.txt``).
+
+There, look for the last "Running" entry, which will be the current
+test. Notice whether the failing program is one of the QEMU instances
+or the migration-test itself.
+
+E.g.:
+
+.. code::
+
+ # Running /s390x/migration/precopy/unix/plain
+ # Using machine type: s390-ccw-virtio-9.2
+ # starting QEMU: exec ./qemu-system-s390x -qtest ...
+ # starting QEMU: exec ./qemu-system-s390x -qtest ...
+ ----------------------------------- stderr -----------------------------------
+ migration-test: ../tests/qtest/migration-test.c:1712: test_precopy_common: Assertion `0' failed.
+
+ (test program exited with status code -6)
+
+.. code::
+
+ # Running /x86_64/migration/bad_dest
+ # Using machine type: pc-q35-9.2
+ # starting QEMU: exec ./qemu-system-x86_64 -qtest ...
+ # starting QEMU: exec ./qemu-system-x86_64 -qtest ...
+ ----------------------------------- stderr -----------------------------------
+ Broken pipe
+ ../tests/qtest/libqtest.c:205: kill_qemu() detected QEMU death from signal 6 (Aborted) (core dumped)
+
+ (test program exited with status code -6)
+
+The above is usually not enough to determine what happened, so
+re-running the test directly is helpful:
+
+.. code::
+
+ QTEST_QEMU_BINARY=./qemu-system-x86_64 ./tests/qtest/migration-test -p /x86_64/migration/bad_dest
+
+There are also the QTEST_LOG and QTEST_TRACE variables for increased
+logging and tracing.
+
+The QTEST_QEMU_BINARY environment variable can be abused to hook GDB
+or valgrind into the invocation:
+
+.. code::
+
+ QTEST_QEMU_BINARY='gdb -q --ex "set pagination off" --ex "set print thread-events off" \
+ --ex "handle SIGUSR1 noprint" --ex "break <breakpoint>" --ex "run" --ex "quit \$_exitcode" \
+ --args ./qemu-system-x86_64' ./tests/qtest/migration-test -p /x86_64/migration/multifd/file/mapped-ram/fdset/dio
+
+.. code::
+
+ QTEST_QEMU_BINARY='valgrind -q --leak-check=full --show-leak-kinds=definite,indirect \
+ ./qemu-system-x86_64' ./tests/qtest/migration-test -r /x86_64/migration
+
+Whenever a test fails, it will leave behind a temporary
+directory. This is useful for file migrations to inspect the generated
+migration file:
+
+.. code::
+
+ $ file /tmp/migration-test-X496U2/migfile
+ /tmp/migration-test-X496U2/migfile: QEMU suspend to disk image
+ $ hexdump -C /tmp/migration-test-X496U2/migfile | less
diff --git a/docs/devel/testing/qtest.rst b/docs/devel/testing/qtest.rst
index c5b8546b3e..4665c160b6 100644
--- a/docs/devel/testing/qtest.rst
+++ b/docs/devel/testing/qtest.rst
@@ -5,6 +5,7 @@ QTest Device Emulation Testing Framework
.. toctree::
qgraph
+ migration
QTest is a device emulation testing framework. It can be very useful to test
device models; it could also control certain aspects of QEMU (such as virtual
--
2.35.3
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 3/4] tests/qtest/migration: Move tests into g_test_slow()
2024-10-17 14:32 [PATCH 0/4] tests/qtest: Move the bulk of migration tests into a separate target Fabiano Rosas
2024-10-17 14:32 ` [PATCH 1/4] tests/qtest: Add check-migration Fabiano Rosas
2024-10-17 14:32 ` [PATCH 2/4] docs: Add migration tests documentation Fabiano Rosas
@ 2024-10-17 14:32 ` Fabiano Rosas
2024-10-17 14:32 ` [PATCH 4/4] ci: Add check-migration-quick to the clang job Fabiano Rosas
3 siblings, 0 replies; 23+ messages in thread
From: Fabiano Rosas @ 2024-10-17 14:32 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Xu, Peter Maydell
Move most of the tests under g_test_slow() and add a
migration_test_add_quick() helper to mark a test as "quick".
quick - runs with:
make check
make check-migration-quick
make check-migration
make SPEED=slow check
slow - runs with:
make check-migration
make SPEED=slow check
Also add some words on the rationale, which is to keep the bulk of the
migration-test out of make check.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
tests/qtest/migration-helpers.c | 9 +++++
tests/qtest/migration-helpers.h | 16 ++++++++
tests/qtest/migration-test.c | 67 ++++++++++++++++-----------------
3 files changed, 58 insertions(+), 34 deletions(-)
diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
index 0025933883..b3a873630f 100644
--- a/tests/qtest/migration-helpers.c
+++ b/tests/qtest/migration-helpers.c
@@ -456,6 +456,15 @@ static void migration_test_wrapper(const void *data)
}
void migration_test_add(const char *path, void (*fn)(void))
+{
+ if (!g_test_slow()) {
+ return;
+ }
+
+ migration_test_add_quick(path, fn);
+}
+
+void migration_test_add_quick(const char *path, void (*fn)(void))
{
MigrationTest *test = g_new0(MigrationTest, 1);
diff --git a/tests/qtest/migration-helpers.h b/tests/qtest/migration-helpers.h
index 72dba369fb..59daf5ea62 100644
--- a/tests/qtest/migration-helpers.h
+++ b/tests/qtest/migration-helpers.h
@@ -62,7 +62,23 @@ static inline bool probe_o_direct_support(const char *tmpfs)
return false;
}
#endif
+
+/*
+ * Tests that would only break due to core migration changes should be
+ * added with migration_test_add(). Those are considered 'slow' by
+ * default and run during make check-migration, or when '-m slow' is
+ * added on the cmdline.
+ *
+ * Tests that are quick and simple, as well as those that are
+ * succeptible to changes outside of migration/ (e.g. depend on TLS,
+ * qio_channel, etc), should be added with
+ * migration_test_add_quick(). Those are considered 'quick' and run as
+ * part of make check (i.e. execute in CI and with every developer's
+ * invocation of make check). Avoid adding too much time to those.
+ */
void migration_test_add(const char *path, void (*fn)(void));
+void migration_test_add_quick(const char *path, void (*fn)(void));
+
void migration_event_wait(QTestState *s, const char *target);
#endif /* MIGRATION_HELPERS_H */
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 95e45b5029..4b021984c4 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -43,6 +43,7 @@ static bool uffd_feature_thread_id;
static QTestMigrationState src_state;
static QTestMigrationState dst_state;
+
/*
* An initial 3 MB offset is used as that corresponds
* to ~1 sec of data transfer with our bandwidth setting.
@@ -3785,20 +3786,20 @@ int main(int argc, char **argv)
#endif
if (is_x86) {
- migration_test_add("/migration/precopy/unix/suspend/live",
- test_precopy_unix_suspend_live);
- migration_test_add("/migration/precopy/unix/suspend/notlive",
- test_precopy_unix_suspend_notlive);
+ migration_test_add_quick("/migration/precopy/unix/suspend/live",
+ test_precopy_unix_suspend_live);
+ migration_test_add_quick("/migration/precopy/unix/suspend/notlive",
+ test_precopy_unix_suspend_notlive);
}
if (has_uffd) {
- migration_test_add("/migration/postcopy/plain", test_postcopy);
+ migration_test_add_quick("/migration/postcopy/plain", test_postcopy);
migration_test_add("/migration/postcopy/recovery/plain",
test_postcopy_recovery);
- migration_test_add("/migration/postcopy/preempt/plain",
- test_postcopy_preempt);
- migration_test_add("/migration/postcopy/preempt/recovery/plain",
- test_postcopy_preempt_recovery);
+ migration_test_add_quick("/migration/postcopy/preempt/plain",
+ test_postcopy_preempt);
+ migration_test_add_quick("/migration/postcopy/preempt/recovery/plain",
+ test_postcopy_preempt_recovery);
migration_test_add("/migration/postcopy/recovery/double-failures/handshake",
test_postcopy_recovery_fail_handshake);
migration_test_add("/migration/postcopy/recovery/double-failures/reconnect",
@@ -3809,12 +3810,12 @@ int main(int argc, char **argv)
}
}
- migration_test_add("/migration/precopy/unix/plain",
- test_precopy_unix_plain);
- if (g_test_slow()) {
- migration_test_add("/migration/precopy/unix/xbzrle",
- test_precopy_unix_xbzrle);
- }
+ migration_test_add_quick("/migration/precopy/unix/plain",
+ test_precopy_unix_plain);
+
+ migration_test_add("/migration/precopy/unix/xbzrle",
+ test_precopy_unix_xbzrle);
+
migration_test_add("/migration/precopy/file",
test_precopy_file);
migration_test_add("/migration/precopy/file/offset",
@@ -3881,14 +3882,14 @@ int main(int argc, char **argv)
#endif /* CONFIG_TASN1 */
#endif /* CONFIG_GNUTLS */
- migration_test_add("/migration/precopy/tcp/plain", test_precopy_tcp_plain);
+ migration_test_add_quick("/migration/precopy/tcp/plain", test_precopy_tcp_plain);
migration_test_add("/migration/precopy/tcp/plain/switchover-ack",
test_precopy_tcp_switchover_ack);
#ifdef CONFIG_GNUTLS
- migration_test_add("/migration/precopy/tcp/tls/psk/match",
- test_precopy_tcp_tls_psk_match);
+ migration_test_add_quick("/migration/precopy/tcp/tls/psk/match",
+ test_precopy_tcp_tls_psk_match);
migration_test_add("/migration/precopy/tcp/tls/psk/mismatch",
test_precopy_tcp_tls_psk_mismatch);
#ifdef CONFIG_TASN1
@@ -3930,25 +3931,23 @@ int main(int argc, char **argv)
/*
* See explanation why this test is slow on function definition
*/
- if (g_test_slow()) {
- migration_test_add("/migration/auto_converge",
- test_migrate_auto_converge);
- if (g_str_equal(arch, "x86_64") &&
- has_kvm && kvm_dirty_ring_supported()) {
- migration_test_add("/migration/dirty_limit",
- test_migrate_dirty_limit);
- }
+ migration_test_add("/migration/auto_converge",
+ test_migrate_auto_converge);
+ if (g_str_equal(arch, "x86_64") &&
+ has_kvm && kvm_dirty_ring_supported()) {
+ migration_test_add("/migration/dirty_limit",
+ test_migrate_dirty_limit);
}
- migration_test_add("/migration/multifd/tcp/uri/plain/none",
- test_multifd_tcp_uri_none);
+ migration_test_add_quick("/migration/multifd/tcp/uri/plain/none",
+ test_multifd_tcp_uri_none);
migration_test_add("/migration/multifd/tcp/channels/plain/none",
test_multifd_tcp_channels_none);
migration_test_add("/migration/multifd/tcp/plain/zero-page/legacy",
test_multifd_tcp_zero_page_legacy);
migration_test_add("/migration/multifd/tcp/plain/zero-page/none",
test_multifd_tcp_no_zero_page);
- migration_test_add("/migration/multifd/tcp/plain/cancel",
- test_multifd_tcp_cancel);
+ migration_test_add_quick("/migration/multifd/tcp/plain/cancel",
+ test_multifd_tcp_cancel);
migration_test_add("/migration/multifd/tcp/plain/zlib",
test_multifd_tcp_zlib);
#ifdef CONFIG_ZSTD
@@ -3957,7 +3956,7 @@ int main(int argc, char **argv)
#endif
#ifdef CONFIG_QATZIP
migration_test_add("/migration/multifd/tcp/plain/qatzip",
- test_multifd_tcp_qatzip);
+ test_multifd_tcp_qatzip);
#endif
#ifdef CONFIG_QPL
migration_test_add("/migration/multifd/tcp/plain/qpl",
@@ -3987,9 +3986,9 @@ int main(int argc, char **argv)
#endif /* CONFIG_GNUTLS */
if (g_str_equal(arch, "x86_64") && has_kvm && kvm_dirty_ring_supported()) {
- migration_test_add("/migration/dirty_ring",
- test_precopy_unix_dirty_ring);
- if (qtest_has_machine("pc") && g_test_slow()) {
+ migration_test_add_quick("/migration/dirty_ring",
+ test_precopy_unix_dirty_ring);
+ if (qtest_has_machine("pc")) {
migration_test_add("/migration/vcpu_dirty_limit",
test_vcpu_dirty_limit);
}
--
2.35.3
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 4/4] ci: Add check-migration-quick to the clang job
2024-10-17 14:32 [PATCH 0/4] tests/qtest: Move the bulk of migration tests into a separate target Fabiano Rosas
` (2 preceding siblings ...)
2024-10-17 14:32 ` [PATCH 3/4] tests/qtest/migration: Move tests into g_test_slow() Fabiano Rosas
@ 2024-10-17 14:32 ` Fabiano Rosas
2024-10-17 14:57 ` Daniel P. Berrangé
3 siblings, 1 reply; 23+ messages in thread
From: Fabiano Rosas @ 2024-10-17 14:32 UTC (permalink / raw)
To: qemu-devel
Cc: Peter Xu, Peter Maydell, Alex Bennée,
Philippe Mathieu-Daudé, Thomas Huth,
Wainer dos Santos Moschetta
Recent changes to how we invoke the migration tests have
(intentionally) caused them to not be part of the check-qtest target
anymore. Add the check-migration-quick target so we don't lose
migration code testing in this job.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
.gitlab-ci.d/buildtest.yml | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
index 34d3f4e9ab..37247dc5fa 100644
--- a/.gitlab-ci.d/buildtest.yml
+++ b/.gitlab-ci.d/buildtest.yml
@@ -442,7 +442,7 @@ clang-system:
CONFIGURE_ARGS: --cc=clang --cxx=clang++ --enable-ubsan
--extra-cflags=-fno-sanitize-recover=undefined
TARGETS: alpha-softmmu arm-softmmu m68k-softmmu mips64-softmmu s390x-softmmu
- MAKE_CHECK_ARGS: check-qtest check-tcg
+ MAKE_CHECK_ARGS: check-qtest check-tcg check-migration-quick
clang-user:
extends: .native_build_job_template
--
2.35.3
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job
2024-10-17 14:32 ` [PATCH 4/4] ci: Add check-migration-quick to the clang job Fabiano Rosas
@ 2024-10-17 14:57 ` Daniel P. Berrangé
2024-10-17 16:29 ` Fabiano Rosas
0 siblings, 1 reply; 23+ messages in thread
From: Daniel P. Berrangé @ 2024-10-17 14:57 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, Peter Xu, Peter Maydell, Alex Bennée,
Philippe Mathieu-Daudé, Thomas Huth,
Wainer dos Santos Moschetta
On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
> Recent changes to how we invoke the migration tests have
> (intentionally) caused them to not be part of the check-qtest target
> anymore. Add the check-migration-quick target so we don't lose
> migration code testing in this job.
But 'check-migration-quick' is only the subset of migration tests,
'check-migration' is all of the migration tests. So surely this is
a massive regressions in covage in CI pipelines.
Experience shows us that relying on humans to run tests periodically
doesn't work well, and they'll slowly bit rot. Migration maintainers
don't have a way to run this as gating test for every pull request
that merges, and pull requests coming from non-migration maintainers
can still break migration code.
Any tests in tree need to be exercised by CI as the minimum bar
to prevent bit rot from merges.
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> .gitlab-ci.d/buildtest.yml | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
> index 34d3f4e9ab..37247dc5fa 100644
> --- a/.gitlab-ci.d/buildtest.yml
> +++ b/.gitlab-ci.d/buildtest.yml
> @@ -442,7 +442,7 @@ clang-system:
> CONFIGURE_ARGS: --cc=clang --cxx=clang++ --enable-ubsan
> --extra-cflags=-fno-sanitize-recover=undefined
> TARGETS: alpha-softmmu arm-softmmu m68k-softmmu mips64-softmmu s390x-softmmu
> - MAKE_CHECK_ARGS: check-qtest check-tcg
> + MAKE_CHECK_ARGS: check-qtest check-tcg check-migration-quick
>
> clang-user:
> extends: .native_build_job_template
> --
> 2.35.3
>
>
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job
2024-10-17 14:57 ` Daniel P. Berrangé
@ 2024-10-17 16:29 ` Fabiano Rosas
2024-10-18 9:01 ` Daniel P. Berrangé
0 siblings, 1 reply; 23+ messages in thread
From: Fabiano Rosas @ 2024-10-17 16:29 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: qemu-devel, Peter Xu, Peter Maydell, Alex Bennée,
Philippe Mathieu-Daudé, Thomas Huth,
Wainer dos Santos Moschetta
Daniel P. Berrangé <berrange@redhat.com> writes:
> On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
>> Recent changes to how we invoke the migration tests have
>> (intentionally) caused them to not be part of the check-qtest target
>> anymore. Add the check-migration-quick target so we don't lose
>> migration code testing in this job.
>
> But 'check-migration-quick' is only the subset of migration tests,
> 'check-migration' is all of the migration tests. So surely this is
> a massive regressions in covage in CI pipelines.
I'm not sure it is. There are tests there already for all the major
parts of the code: precopy, postcopy, multifd, socket. Besides, we can
tweak migration-quick to cover spots where we think we're losing
coverage.
Since our CI offers nothing in terms of reproducibility or
debuggability, I don't think it's productive to have an increasing
amount of tests running in CI if that means we'll be dealing with
timeouts and intermittent crashes constantly.
>
> Experience shows us that relying on humans to run tests periodically
> doesn't work well, and they'll slowly bit rot. Migration maintainers
> don't have a way to run this as gating test for every pull request
> that merges, and pull requests coming from non-migration maintainers
> can still break migration code.
Right, but migration code would still be tested with migration-quick
which is executed at every make check. Do we really need the full set in
every pull request? We must draw a line somewhere, otherwise make check
will just balloon in duration.
>
> Any tests in tree need to be exercised by CI as the minimum bar
> to prevent bit rot from merges.
>
No disagreement here. But then I'm going to need advice on what to do
when other maintainers ask us to stop writing migration tests because
they take too long. I cannot send contributors away nor merge code
without tests.
Do we need nightly CI runs? Unit tests? Bear in mind there's a resource
allocation issue there. Addressing problems with timeouts/races in our
CI is not something any random person can do.
>>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>> .gitlab-ci.d/buildtest.yml | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/.gitlab-ci.d/buildtest.yml b/.gitlab-ci.d/buildtest.yml
>> index 34d3f4e9ab..37247dc5fa 100644
>> --- a/.gitlab-ci.d/buildtest.yml
>> +++ b/.gitlab-ci.d/buildtest.yml
>> @@ -442,7 +442,7 @@ clang-system:
>> CONFIGURE_ARGS: --cc=clang --cxx=clang++ --enable-ubsan
>> --extra-cflags=-fno-sanitize-recover=undefined
>> TARGETS: alpha-softmmu arm-softmmu m68k-softmmu mips64-softmmu s390x-softmmu
>> - MAKE_CHECK_ARGS: check-qtest check-tcg
>> + MAKE_CHECK_ARGS: check-qtest check-tcg check-migration-quick
>>
>> clang-user:
>> extends: .native_build_job_template
>> --
>> 2.35.3
>>
>>
>
> With regards,
> Daniel
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job
2024-10-17 16:29 ` Fabiano Rosas
@ 2024-10-18 9:01 ` Daniel P. Berrangé
2024-10-18 9:46 ` Peter Maydell
` (2 more replies)
0 siblings, 3 replies; 23+ messages in thread
From: Daniel P. Berrangé @ 2024-10-18 9:01 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, Peter Xu, Peter Maydell, Alex Bennée,
Philippe Mathieu-Daudé, Thomas Huth,
Wainer dos Santos Moschetta
On Thu, Oct 17, 2024 at 01:29:35PM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
>
> > On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
> >> Recent changes to how we invoke the migration tests have
> >> (intentionally) caused them to not be part of the check-qtest target
> >> anymore. Add the check-migration-quick target so we don't lose
> >> migration code testing in this job.
> >
> > But 'check-migration-quick' is only the subset of migration tests,
> > 'check-migration' is all of the migration tests. So surely this is
> > a massive regressions in covage in CI pipelines.
>
> I'm not sure it is. There are tests there already for all the major
> parts of the code: precopy, postcopy, multifd, socket. Besides, we can
> tweak migration-quick to cover spots where we think we're losing
> coverage.
Each of the tests in migration-test were added for a good reason,
generally to address testing gaps where we had functional regressions
in the past. I don't think its a good idea to stop running such tests
in CI as gating on new contributions. Any time we've had optional
tests in QEMU, we've seen repeated regressions in the area in question.
> Since our CI offers nothing in terms of reproducibility or
> debuggability, I don't think it's productive to have an increasing
> amount of tests running in CI if that means we'll be dealing with
> timeouts and intermittent crashes constantly.
Test reliability is a different thing. If a particular test is
flaky, it needs to either be fixed or disabled. Splitting into
a fast & slow grouping doesn't address reliability, just hides
the problem from view.
> > Experience shows us that relying on humans to run tests periodically
> > doesn't work well, and they'll slowly bit rot. Migration maintainers
> > don't have a way to run this as gating test for every pull request
> > that merges, and pull requests coming from non-migration maintainers
> > can still break migration code.
>
> Right, but migration code would still be tested with migration-quick
> which is executed at every make check. Do we really need the full set in
> every pull request? We must draw a line somewhere, otherwise make check
> will just balloon in duration.
Again, the tests all exist because migration code is incredibly
complicated, with a lot of permutations, with a history of being
very bug / regression prone. With that in mind, it is unavoidable
that we're going to have a significant testing overhead for
migration code.
Looking at its execution time right now, I'd say migration test
is pretty good, considering the permutations we have to target.
It gets a bad reputation because historically it has been as
much as x20 slower than it is today, and has also struggled
with reliability. The latter is a reflection of the complexity
of migration and and IMHO actually justifies greater testing,
as long as we put in time to address bugs.
Also we've got one single test program, covering an entire
subsystem in one go, rather than lots of short individual
test programs, so migration unfairly gets blamed for being
slow, when it simply covers alot of functionality in one
program.
> > Any tests in tree need to be exercised by CI as the minimum bar
> > to prevent bit rot from merges.
> >
>
> No disagreement here. But then I'm going to need advice on what to do
> when other maintainers ask us to stop writing migration tests because
> they take too long. I cannot send contributors away nor merge code
> without tests.
In general, I think it is unreasonable for other maintainers to
tell us to stop adding test coverage for migration, and would
push back against such a request.
We should, however, continue to optimize how we add further test
coverage, where practical, overload testing of multiple features
onto a single test case helps.
We've already massively optimized the migration-test compared to
its historical behaviour.
A potentially bigger win could be seen if we change how we exercise
the migration functionality. Since we had the migration qtest that
runs a full migration operation, we've tended to expand testing by
adding new qtest functions. ie we've added a functional test for
everything we want covered. This is nice & simple, but also expensive.
We've ignored unit testing, which I think is a mistake.
If i look at the test list:
# /x86_64/migration/bad_dest
# /x86_64/migration/analyze-script
# /x86_64/migration/validate_uuid
# /x86_64/migration/validate_uuid_error
# /x86_64/migration/validate_uuid_src_not_set
# /x86_64/migration/validate_uuid_dst_not_set
# /x86_64/migration/dirty_ring
# /x86_64/migration/precopy/file
# /x86_64/migration/precopy/unix/plain
# /x86_64/migration/precopy/unix/suspend/live
# /x86_64/migration/precopy/unix/suspend/notlive
# /x86_64/migration/precopy/unix/tls/psk
# /x86_64/migration/precopy/unix/tls/x509/default-host
# /x86_64/migration/precopy/unix/tls/x509/override-host
# /x86_64/migration/precopy/file/offset
# /x86_64/migration/precopy/file/mapped-ram
# /x86_64/migration/precopy/file/offset/fdset
# /x86_64/migration/precopy/file/offset/bad
# /x86_64/migration/precopy/file/mapped-ram/live
# /x86_64/migration/precopy/tcp/plain
# /x86_64/migration/precopy/tcp/plain/switchover-ack
# /x86_64/migration/precopy/tcp/tls/psk/match
# /x86_64/migration/precopy/tcp/tls/psk/mismatch
# /x86_64/migration/precopy/tcp/tls/x509/default-host
# /x86_64/migration/precopy/tcp/tls/x509/override-host
# /x86_64/migration/precopy/tcp/tls/x509/mismatch-host
# /x86_64/migration/precopy/tcp/tls/x509/friendly-client
# /x86_64/migration/precopy/tcp/tls/x509/hostile-client
# /x86_64/migration/precopy/tcp/tls/x509/allow-anon-client
# /x86_64/migration/precopy/tcp/tls/x509/reject-anon-client
# /x86_64/migration/precopy/fd/tcp
# /x86_64/migration/precopy/fd/file
# /x86_64/migration/multifd/file/mapped-ram
# /x86_64/migration/multifd/file/mapped-ram/live
# /x86_64/migration/multifd/file/mapped-ram/dio
# /x86_64/migration/multifd/file/mapped-ram/fdset
# /x86_64/migration/multifd/file/mapped-ram/fdset/dio
# /x86_64/migration/multifd/tcp/uri/plain/none
# /x86_64/migration/multifd/tcp/channels/plain/none
# /x86_64/migration/multifd/tcp/plain/cancel
# /x86_64/migration/multifd/tcp/plain/zlib
# /x86_64/migration/multifd/tcp/plain/zstd
# /x86_64/migration/multifd/tcp/plain/zero-page/legacy
# /x86_64/migration/multifd/tcp/plain/zero-page/none
# /x86_64/migration/multifd/tcp/tls/psk/match
# /x86_64/migration/multifd/tcp/tls/psk/mismatch
# /x86_64/migration/multifd/tcp/tls/x509/default-host
# /x86_64/migration/multifd/tcp/tls/x509/override-host
# /x86_64/migration/multifd/tcp/tls/x509/mismatch-host
# /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client
# /x86_64/migration/multifd/tcp/tls/x509/reject-anon-client
# /x86_64/migration/validate_uri/channels/both_set
# /x86_64/migration/validate_uri/channels/none_set
Individually none of those is very slow on its own - 10 are in
the 2-3 second range, 35 are 1-2 secs, and 6 are less than
1 second.
A very large portion of those are validating different ways to
establish migration. Hardly any of them actually need to run
a migration to completion. Even without running to completion
though, we have the overheads of spawning 2 QEMUs.
This feels like something that should be amenable to unit testing.
Might need a little re-factoring of migration code to make it
easier to run a subset of its logic in isolation, but that'd
probably be a win anyway, as such work usually makes code cleaner.
> Do we need nightly CI runs? Unit tests? Bear in mind there's a resource
> allocation issue there. Addressing problems with timeouts/races in our
> CI is not something any random person can do.
In terms of running time, I think migration-test is acceptable as it is
to run in 'make check' by default and doesn't justify dropping test
coverage. We should still look to optimize & move to unit testing more
code, and any reliability issues are something that needs to be addressed
too.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job
2024-10-18 9:01 ` Daniel P. Berrangé
@ 2024-10-18 9:46 ` Peter Maydell
2024-10-18 10:00 ` Daniel P. Berrangé
2024-10-18 13:51 ` Fabiano Rosas
2024-10-18 15:25 ` Peter Maydell
2 siblings, 1 reply; 23+ messages in thread
From: Peter Maydell @ 2024-10-18 9:46 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: Fabiano Rosas, qemu-devel, Peter Xu, Alex Bennée,
Philippe Mathieu-Daudé, Thomas Huth,
Wainer dos Santos Moschetta
On Fri, 18 Oct 2024 at 10:01, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Oct 17, 2024 at 01:29:35PM -0300, Fabiano Rosas wrote:
> > Daniel P. Berrangé <berrange@redhat.com> writes:
> >
> > > On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
> > >> Recent changes to how we invoke the migration tests have
> > >> (intentionally) caused them to not be part of the check-qtest target
> > >> anymore. Add the check-migration-quick target so we don't lose
> > >> migration code testing in this job.
> > >
> > > But 'check-migration-quick' is only the subset of migration tests,
> > > 'check-migration' is all of the migration tests. So surely this is
> > > a massive regressions in covage in CI pipelines.
> >
> > I'm not sure it is. There are tests there already for all the major
> > parts of the code: precopy, postcopy, multifd, socket. Besides, we can
> > tweak migration-quick to cover spots where we think we're losing
> > coverage.
>
> Each of the tests in migration-test were added for a good reason,
> generally to address testing gaps where we had functional regressions
> in the past. I don't think its a good idea to stop running such tests
> in CI as gating on new contributions. Any time we've had optional
> tests in QEMU, we've seen repeated regressions in the area in question.
>
> > Since our CI offers nothing in terms of reproducibility or
> > debuggability, I don't think it's productive to have an increasing
> > amount of tests running in CI if that means we'll be dealing with
> > timeouts and intermittent crashes constantly.
>
> Test reliability is a different thing. If a particular test is
> flaky, it needs to either be fixed or disabled. Splitting into
> a fast & slow grouping doesn't address reliability, just hides
> the problem from view.
A lot of the current reliability issue is timeouts -- sometimes
our CI runners just run really slow (I have seen an example where
between a normal and a slow run on the same commit both the
compile and test times were 10x different...) So any test
that is not a fast-to-complete is much much more likely to
hit its timeout if the runner is running slowly. When I am
doing CI testing for merges "migration test timed out again"
is really really common.
> > No disagreement here. But then I'm going to need advice on what to do
> > when other maintainers ask us to stop writing migration tests because
> > they take too long. I cannot send contributors away nor merge code
> > without tests.
>
> In general, I think it is unreasonable for other maintainers to
> tell us to stop adding test coverage for migration, and would
> push back against such a request.
We do not have infinite CI resources, unfortunately. Migration
is competing with everything else for time on CI. You have to
find a balance between "what do we run every time" and "what
do we only run when specifically testing a migration pullreq".
Similarly, there's a lot of iotests but we don't run all of them
for every block backend for every CI job via "make check".
Long test times for tests run under "make check" are also bad
for individual developers -- if I'm running "make check" to
test a target/arm change I've made I don't really want that
to then spend 15 minutes testing the migration code that
I haven't touched and that is vanishingly unlikely to be
affected by my patches.
thanks
-- PMM
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job
2024-10-18 9:46 ` Peter Maydell
@ 2024-10-18 10:00 ` Daniel P. Berrangé
2024-10-18 10:09 ` Peter Maydell
` (2 more replies)
0 siblings, 3 replies; 23+ messages in thread
From: Daniel P. Berrangé @ 2024-10-18 10:00 UTC (permalink / raw)
To: Peter Maydell
Cc: Fabiano Rosas, qemu-devel, Peter Xu, Alex Bennée,
Philippe Mathieu-Daudé, Thomas Huth,
Wainer dos Santos Moschetta
On Fri, Oct 18, 2024 at 10:46:55AM +0100, Peter Maydell wrote:
> On Fri, 18 Oct 2024 at 10:01, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Thu, Oct 17, 2024 at 01:29:35PM -0300, Fabiano Rosas wrote:
> > > Daniel P. Berrangé <berrange@redhat.com> writes:
> > >
> > > > On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
> > > >> Recent changes to how we invoke the migration tests have
> > > >> (intentionally) caused them to not be part of the check-qtest target
> > > >> anymore. Add the check-migration-quick target so we don't lose
> > > >> migration code testing in this job.
> > > >
> > > > But 'check-migration-quick' is only the subset of migration tests,
> > > > 'check-migration' is all of the migration tests. So surely this is
> > > > a massive regressions in covage in CI pipelines.
> > >
> > > I'm not sure it is. There are tests there already for all the major
> > > parts of the code: precopy, postcopy, multifd, socket. Besides, we can
> > > tweak migration-quick to cover spots where we think we're losing
> > > coverage.
> >
> > Each of the tests in migration-test were added for a good reason,
> > generally to address testing gaps where we had functional regressions
> > in the past. I don't think its a good idea to stop running such tests
> > in CI as gating on new contributions. Any time we've had optional
> > tests in QEMU, we've seen repeated regressions in the area in question.
> >
> > > Since our CI offers nothing in terms of reproducibility or
> > > debuggability, I don't think it's productive to have an increasing
> > > amount of tests running in CI if that means we'll be dealing with
> > > timeouts and intermittent crashes constantly.
> >
> > Test reliability is a different thing. If a particular test is
> > flaky, it needs to either be fixed or disabled. Splitting into
> > a fast & slow grouping doesn't address reliability, just hides
> > the problem from view.
>
> A lot of the current reliability issue is timeouts -- sometimes
> our CI runners just run really slow (I have seen an example where
> between a normal and a slow run on the same commit both the
> compile and test times were 10x different...) So any test
> that is not a fast-to-complete is much much more likely to
> hit its timeout if the runner is running slowly. When I am
> doing CI testing for merges "migration test timed out again"
> is really really common.
If its frequently timing out, then we've got the timeouts
wrong, or we have some genuine bugs in there to be fixed.
> > > No disagreement here. But then I'm going to need advice on what to do
> > > when other maintainers ask us to stop writing migration tests because
> > > they take too long. I cannot send contributors away nor merge code
> > > without tests.
> >
> > In general, I think it is unreasonable for other maintainers to
> > tell us to stop adding test coverage for migration, and would
> > push back against such a request.
>
> We do not have infinite CI resources, unfortunately. Migration
> is competing with everything else for time on CI. You have to
> find a balance between "what do we run every time" and "what
> do we only run when specifically testing a migration pullreq".
> Similarly, there's a lot of iotests but we don't run all of them
> for every block backend for every CI job via "make check".
The combos we don't run for iotests are a good source of
regressions too :-(
> Long test times for tests run under "make check" are also bad
> for individual developers -- if I'm running "make check" to
> test a target/arm change I've made I don't really want that
> to then spend 15 minutes testing the migration code that
> I haven't touched and that is vanishingly unlikely to be
> affected by my patches.
Migration-test *used* to take 15 minutes to run, but that was a
very long time ago. A run of it today is around 1m20.
That said, if you are building multiple system emulators, we
run the same test multiple times, and with the number of
targets we have, that will be painful.
That could be a good reason to split the migration-test into
two distinct programs. One program that runs for every target,
and one that is only run once, for some arbitrary "primary"
target ? Or could we make use of glib's g_test_thorough
for this - a primary target runs with "SPEED=through" and
all other targets with normal settings. That would give us
a way to optimize any of the qtests to reduce redundant
testing where appropriate.
If we move alot of testing out into a migration unit test,
this also solves the redundancy problem.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job
2024-10-18 10:00 ` Daniel P. Berrangé
@ 2024-10-18 10:09 ` Peter Maydell
2024-10-18 10:38 ` Alex Bennée
2024-10-18 13:51 ` Fabiano Rosas
2 siblings, 0 replies; 23+ messages in thread
From: Peter Maydell @ 2024-10-18 10:09 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: Fabiano Rosas, qemu-devel, Peter Xu, Alex Bennée,
Philippe Mathieu-Daudé, Thomas Huth,
Wainer dos Santos Moschetta
On Fri, 18 Oct 2024 at 11:00, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Fri, Oct 18, 2024 at 10:46:55AM +0100, Peter Maydell wrote:
> > We do not have infinite CI resources, unfortunately. Migration
> > is competing with everything else for time on CI. You have to
> > find a balance between "what do we run every time" and "what
> > do we only run when specifically testing a migration pullreq".
> > Similarly, there's a lot of iotests but we don't run all of them
> > for every block backend for every CI job via "make check".
>
> The combos we don't run for iotests are a good source of
> regressions too :-(
>
> > Long test times for tests run under "make check" are also bad
> > for individual developers -- if I'm running "make check" to
> > test a target/arm change I've made I don't really want that
> > to then spend 15 minutes testing the migration code that
> > I haven't touched and that is vanishingly unlikely to be
> > affected by my patches.
>
> Migration-test *used* to take 15 minutes to run, but that was a
> very long time ago. A run of it today is around 1m20.
>
> That said, if you are building multiple system emulators, we
> run the same test multiple times, and with the number of
> targets we have, that will be painful.
Yeah. Here's a recent s390 job, and not one that was running
slow: https://gitlab.com/qemu-project/qemu/-/jobs/8112195449
The migration tests it ran were:
95/954 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test
OK 196.95s 50 subtests passed
96/954 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test
OK 202.47s 50 subtests passed
99/954 qemu:qtest+qtest-i386 / qtest-i386/migration-test
OK 203.54s 52 subtests passed
107/954 qemu:qtest+qtest-s390x / qtest-s390x/migration-test
OK 193.65s 50 subtests passed
156/954 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test
OK 164.44s 52 subtests passed
So that's 192s (over 3 minutes) average, and over 16 minutes
total spent on migration testing on this one CI job. If
the s390 VM has a noisy-neighbour problem then the
migration tests can hit their 8 minute timeout, implying
40 minutes spent on migration testing alone...
thanks
-- PMM
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job
2024-10-18 10:00 ` Daniel P. Berrangé
2024-10-18 10:09 ` Peter Maydell
@ 2024-10-18 10:38 ` Alex Bennée
2024-10-18 13:51 ` Fabiano Rosas
2 siblings, 0 replies; 23+ messages in thread
From: Alex Bennée @ 2024-10-18 10:38 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: Peter Maydell, Fabiano Rosas, qemu-devel, Peter Xu,
Philippe Mathieu-Daudé, Thomas Huth,
Wainer dos Santos Moschetta
Daniel P. Berrangé <berrange@redhat.com> writes:
> On Fri, Oct 18, 2024 at 10:46:55AM +0100, Peter Maydell wrote:
>> On Fri, 18 Oct 2024 at 10:01, Daniel P. Berrangé <berrange@redhat.com> wrote:
>> >
>> > On Thu, Oct 17, 2024 at 01:29:35PM -0300, Fabiano Rosas wrote:
>> > > Daniel P. Berrangé <berrange@redhat.com> writes:
>> > >
>> > > > On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
<snip>
>
> Migration-test *used* to take 15 minutes to run, but that was a
> very long time ago. A run of it today is around 1m20.
>
> That said, if you are building multiple system emulators, we
> run the same test multiple times, and with the number of
> targets we have, that will be painful.
I think this is the main problem. We run the whole set of tests for
every system emulator we build and I doubt we are actually increasing
any coverage. One suggestion I made previously was to change the test
selection logic so we only run all the tests for the KVM target and one
TCG target. For the other targets we should run the basic tests only to
ensure there is now architecture specific breakage for migration
generally.
> That could be a good reason to split the migration-test into
> two distinct programs. One program that runs for every target,
> and one that is only run once, for some arbitrary "primary"
> target ? Or could we make use of glib's g_test_thorough
> for this - a primary target runs with "SPEED=through" and
> all other targets with normal settings. That would give us
> a way to optimize any of the qtests to reduce redundant
> testing where appropriate.
Does splitting the tests make it easier? Would meson just pick which
arches for the "migrtion-full" experience?
My idea was just to pass the list of configured builds as an environment
variable and without it default to everything. That way running the test
by hand would still have full coverage.
>
>
> If we move alot of testing out into a migration unit test,
> this also solves the redundancy problem.
>
>
> With regards,
> Daniel
--
Alex Bennée
Virtualisation Tech Lead @ Linaro
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job
2024-10-18 10:00 ` Daniel P. Berrangé
2024-10-18 10:09 ` Peter Maydell
2024-10-18 10:38 ` Alex Bennée
@ 2024-10-18 13:51 ` Fabiano Rosas
2024-10-18 13:54 ` Daniel P. Berrangé
2 siblings, 1 reply; 23+ messages in thread
From: Fabiano Rosas @ 2024-10-18 13:51 UTC (permalink / raw)
To: Daniel P. Berrangé, Peter Maydell
Cc: qemu-devel, Peter Xu, Alex Bennée,
Philippe Mathieu-Daudé, Thomas Huth,
Wainer dos Santos Moschetta
Daniel P. Berrangé <berrange@redhat.com> writes:
> On Fri, Oct 18, 2024 at 10:46:55AM +0100, Peter Maydell wrote:
>> On Fri, 18 Oct 2024 at 10:01, Daniel P. Berrangé <berrange@redhat.com> wrote:
>> >
>> > On Thu, Oct 17, 2024 at 01:29:35PM -0300, Fabiano Rosas wrote:
>> > > Daniel P. Berrangé <berrange@redhat.com> writes:
>> > >
>> > > > On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
>> > > >> Recent changes to how we invoke the migration tests have
>> > > >> (intentionally) caused them to not be part of the check-qtest target
>> > > >> anymore. Add the check-migration-quick target so we don't lose
>> > > >> migration code testing in this job.
>> > > >
>> > > > But 'check-migration-quick' is only the subset of migration tests,
>> > > > 'check-migration' is all of the migration tests. So surely this is
>> > > > a massive regressions in covage in CI pipelines.
>> > >
>> > > I'm not sure it is. There are tests there already for all the major
>> > > parts of the code: precopy, postcopy, multifd, socket. Besides, we can
>> > > tweak migration-quick to cover spots where we think we're losing
>> > > coverage.
>> >
>> > Each of the tests in migration-test were added for a good reason,
>> > generally to address testing gaps where we had functional regressions
>> > in the past. I don't think its a good idea to stop running such tests
>> > in CI as gating on new contributions. Any time we've had optional
>> > tests in QEMU, we've seen repeated regressions in the area in question.
>> >
>> > > Since our CI offers nothing in terms of reproducibility or
>> > > debuggability, I don't think it's productive to have an increasing
>> > > amount of tests running in CI if that means we'll be dealing with
>> > > timeouts and intermittent crashes constantly.
>> >
>> > Test reliability is a different thing. If a particular test is
>> > flaky, it needs to either be fixed or disabled. Splitting into
>> > a fast & slow grouping doesn't address reliability, just hides
>> > the problem from view.
>>
>> A lot of the current reliability issue is timeouts -- sometimes
>> our CI runners just run really slow (I have seen an example where
>> between a normal and a slow run on the same commit both the
>> compile and test times were 10x different...) So any test
>> that is not a fast-to-complete is much much more likely to
>> hit its timeout if the runner is running slowly. When I am
>> doing CI testing for merges "migration test timed out again"
>> is really really common.
>
> If its frequently timing out, then we've got the timeouts
> wrong, or we have some genuine bugs in there to be fixed.
>
>> > > No disagreement here. But then I'm going to need advice on what to do
>> > > when other maintainers ask us to stop writing migration tests because
>> > > they take too long. I cannot send contributors away nor merge code
>> > > without tests.
>> >
>> > In general, I think it is unreasonable for other maintainers to
>> > tell us to stop adding test coverage for migration, and would
>> > push back against such a request.
>>
>> We do not have infinite CI resources, unfortunately. Migration
>> is competing with everything else for time on CI. You have to
>> find a balance between "what do we run every time" and "what
>> do we only run when specifically testing a migration pullreq".
>> Similarly, there's a lot of iotests but we don't run all of them
>> for every block backend for every CI job via "make check".
>
> The combos we don't run for iotests are a good source of
> regressions too :-(
>
>> Long test times for tests run under "make check" are also bad
>> for individual developers -- if I'm running "make check" to
>> test a target/arm change I've made I don't really want that
>> to then spend 15 minutes testing the migration code that
>> I haven't touched and that is vanishingly unlikely to be
>> affected by my patches.
>
> Migration-test *used* to take 15 minutes to run, but that was a
> very long time ago. A run of it today is around 1m20.
>
> That said, if you are building multiple system emulators, we
> run the same test multiple times, and with the number of
> targets we have, that will be painful.
>
> That could be a good reason to split the migration-test into
> two distinct programs. One program that runs for every target,
> and one that is only run once, for some arbitrary "primary"
> target ?
What do you mean by distinct programs? It's not the migration-test that
decides on which targets it runs, it's meson.build. We register a test()
for each target, same as with any other qtest. Maybe I misunderstood
you...
> Or could we make use of glib's g_test_thorough
> for this - a primary target runs with "SPEED=through" and
> all other targets with normal settings. That would give us
> a way to optimize any of the qtests to reduce redundant
> testing where appropriate.
This still requires a new make target I think. Otherwise we'd run *all*
thorough tests for a QEMU target and not only migration-test in thorough
mode.
>
>
> If we move alot of testing out into a migration unit test,
> this also solves the redundancy problem.
>
>
> With regards,
> Daniel
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job
2024-10-18 13:51 ` Fabiano Rosas
@ 2024-10-18 13:54 ` Daniel P. Berrangé
2024-10-18 14:28 ` Fabiano Rosas
0 siblings, 1 reply; 23+ messages in thread
From: Daniel P. Berrangé @ 2024-10-18 13:54 UTC (permalink / raw)
To: Fabiano Rosas
Cc: Peter Maydell, qemu-devel, Peter Xu, Alex Bennée,
Philippe Mathieu-Daudé, Thomas Huth,
Wainer dos Santos Moschetta
On Fri, Oct 18, 2024 at 10:51:03AM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
>
> > On Fri, Oct 18, 2024 at 10:46:55AM +0100, Peter Maydell wrote:
> >> On Fri, 18 Oct 2024 at 10:01, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >> >
> >> > On Thu, Oct 17, 2024 at 01:29:35PM -0300, Fabiano Rosas wrote:
> >> > > Daniel P. Berrangé <berrange@redhat.com> writes:
> >> > >
> >> > > > On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
> >> > > >> Recent changes to how we invoke the migration tests have
> >> > > >> (intentionally) caused them to not be part of the check-qtest target
> >> > > >> anymore. Add the check-migration-quick target so we don't lose
> >> > > >> migration code testing in this job.
> >> > > >
> >> > > > But 'check-migration-quick' is only the subset of migration tests,
> >> > > > 'check-migration' is all of the migration tests. So surely this is
> >> > > > a massive regressions in covage in CI pipelines.
> >> > >
> >> > > I'm not sure it is. There are tests there already for all the major
> >> > > parts of the code: precopy, postcopy, multifd, socket. Besides, we can
> >> > > tweak migration-quick to cover spots where we think we're losing
> >> > > coverage.
> >> >
> >> > Each of the tests in migration-test were added for a good reason,
> >> > generally to address testing gaps where we had functional regressions
> >> > in the past. I don't think its a good idea to stop running such tests
> >> > in CI as gating on new contributions. Any time we've had optional
> >> > tests in QEMU, we've seen repeated regressions in the area in question.
> >> >
> >> > > Since our CI offers nothing in terms of reproducibility or
> >> > > debuggability, I don't think it's productive to have an increasing
> >> > > amount of tests running in CI if that means we'll be dealing with
> >> > > timeouts and intermittent crashes constantly.
> >> >
> >> > Test reliability is a different thing. If a particular test is
> >> > flaky, it needs to either be fixed or disabled. Splitting into
> >> > a fast & slow grouping doesn't address reliability, just hides
> >> > the problem from view.
> >>
> >> A lot of the current reliability issue is timeouts -- sometimes
> >> our CI runners just run really slow (I have seen an example where
> >> between a normal and a slow run on the same commit both the
> >> compile and test times were 10x different...) So any test
> >> that is not a fast-to-complete is much much more likely to
> >> hit its timeout if the runner is running slowly. When I am
> >> doing CI testing for merges "migration test timed out again"
> >> is really really common.
> >
> > If its frequently timing out, then we've got the timeouts
> > wrong, or we have some genuine bugs in there to be fixed.
> >
> >> > > No disagreement here. But then I'm going to need advice on what to do
> >> > > when other maintainers ask us to stop writing migration tests because
> >> > > they take too long. I cannot send contributors away nor merge code
> >> > > without tests.
> >> >
> >> > In general, I think it is unreasonable for other maintainers to
> >> > tell us to stop adding test coverage for migration, and would
> >> > push back against such a request.
> >>
> >> We do not have infinite CI resources, unfortunately. Migration
> >> is competing with everything else for time on CI. You have to
> >> find a balance between "what do we run every time" and "what
> >> do we only run when specifically testing a migration pullreq".
> >> Similarly, there's a lot of iotests but we don't run all of them
> >> for every block backend for every CI job via "make check".
> >
> > The combos we don't run for iotests are a good source of
> > regressions too :-(
> >
> >> Long test times for tests run under "make check" are also bad
> >> for individual developers -- if I'm running "make check" to
> >> test a target/arm change I've made I don't really want that
> >> to then spend 15 minutes testing the migration code that
> >> I haven't touched and that is vanishingly unlikely to be
> >> affected by my patches.
> >
> > Migration-test *used* to take 15 minutes to run, but that was a
> > very long time ago. A run of it today is around 1m20.
> >
> > That said, if you are building multiple system emulators, we
> > run the same test multiple times, and with the number of
> > targets we have, that will be painful.
> >
> > That could be a good reason to split the migration-test into
> > two distinct programs. One program that runs for every target,
> > and one that is only run once, for some arbitrary "primary"
> > target ?
>
> What do you mean by distinct programs? It's not the migration-test that
> decides on which targets it runs, it's meson.build. We register a test()
> for each target, same as with any other qtest. Maybe I misunderstood
> you...
If we split, we could have meson.build register "migration-smoketest"
for every target while registering "migration-bigtest" for just 1 target.
> > Or could we make use of glib's g_test_thorough
> > for this - a primary target runs with "SPEED=through" and
> > all other targets with normal settings. That would give us
> > a way to optimize any of the qtests to reduce redundant
> > testing where appropriate.
>
> This still requires a new make target I think. Otherwise we'd run *all*
> thorough tests for a QEMU target and not only migration-test in thorough
> mode.
Yes, that's true, having separate programs is probably an easier
option than playing games with "SPEED" settings.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job
2024-10-18 13:54 ` Daniel P. Berrangé
@ 2024-10-18 14:28 ` Fabiano Rosas
2024-10-18 14:33 ` Daniel P. Berrangé
0 siblings, 1 reply; 23+ messages in thread
From: Fabiano Rosas @ 2024-10-18 14:28 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: Peter Maydell, qemu-devel, Peter Xu, Alex Bennée,
Philippe Mathieu-Daudé, Thomas Huth,
Wainer dos Santos Moschetta
Daniel P. Berrangé <berrange@redhat.com> writes:
> On Fri, Oct 18, 2024 at 10:51:03AM -0300, Fabiano Rosas wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>>
>> > On Fri, Oct 18, 2024 at 10:46:55AM +0100, Peter Maydell wrote:
>> >> On Fri, 18 Oct 2024 at 10:01, Daniel P. Berrangé <berrange@redhat.com> wrote:
>> >> >
>> >> > On Thu, Oct 17, 2024 at 01:29:35PM -0300, Fabiano Rosas wrote:
>> >> > > Daniel P. Berrangé <berrange@redhat.com> writes:
>> >> > >
>> >> > > > On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
>> >> > > >> Recent changes to how we invoke the migration tests have
>> >> > > >> (intentionally) caused them to not be part of the check-qtest target
>> >> > > >> anymore. Add the check-migration-quick target so we don't lose
>> >> > > >> migration code testing in this job.
>> >> > > >
>> >> > > > But 'check-migration-quick' is only the subset of migration tests,
>> >> > > > 'check-migration' is all of the migration tests. So surely this is
>> >> > > > a massive regressions in covage in CI pipelines.
>> >> > >
>> >> > > I'm not sure it is. There are tests there already for all the major
>> >> > > parts of the code: precopy, postcopy, multifd, socket. Besides, we can
>> >> > > tweak migration-quick to cover spots where we think we're losing
>> >> > > coverage.
>> >> >
>> >> > Each of the tests in migration-test were added for a good reason,
>> >> > generally to address testing gaps where we had functional regressions
>> >> > in the past. I don't think its a good idea to stop running such tests
>> >> > in CI as gating on new contributions. Any time we've had optional
>> >> > tests in QEMU, we've seen repeated regressions in the area in question.
>> >> >
>> >> > > Since our CI offers nothing in terms of reproducibility or
>> >> > > debuggability, I don't think it's productive to have an increasing
>> >> > > amount of tests running in CI if that means we'll be dealing with
>> >> > > timeouts and intermittent crashes constantly.
>> >> >
>> >> > Test reliability is a different thing. If a particular test is
>> >> > flaky, it needs to either be fixed or disabled. Splitting into
>> >> > a fast & slow grouping doesn't address reliability, just hides
>> >> > the problem from view.
>> >>
>> >> A lot of the current reliability issue is timeouts -- sometimes
>> >> our CI runners just run really slow (I have seen an example where
>> >> between a normal and a slow run on the same commit both the
>> >> compile and test times were 10x different...) So any test
>> >> that is not a fast-to-complete is much much more likely to
>> >> hit its timeout if the runner is running slowly. When I am
>> >> doing CI testing for merges "migration test timed out again"
>> >> is really really common.
>> >
>> > If its frequently timing out, then we've got the timeouts
>> > wrong, or we have some genuine bugs in there to be fixed.
>> >
>> >> > > No disagreement here. But then I'm going to need advice on what to do
>> >> > > when other maintainers ask us to stop writing migration tests because
>> >> > > they take too long. I cannot send contributors away nor merge code
>> >> > > without tests.
>> >> >
>> >> > In general, I think it is unreasonable for other maintainers to
>> >> > tell us to stop adding test coverage for migration, and would
>> >> > push back against such a request.
>> >>
>> >> We do not have infinite CI resources, unfortunately. Migration
>> >> is competing with everything else for time on CI. You have to
>> >> find a balance between "what do we run every time" and "what
>> >> do we only run when specifically testing a migration pullreq".
>> >> Similarly, there's a lot of iotests but we don't run all of them
>> >> for every block backend for every CI job via "make check".
>> >
>> > The combos we don't run for iotests are a good source of
>> > regressions too :-(
>> >
>> >> Long test times for tests run under "make check" are also bad
>> >> for individual developers -- if I'm running "make check" to
>> >> test a target/arm change I've made I don't really want that
>> >> to then spend 15 minutes testing the migration code that
>> >> I haven't touched and that is vanishingly unlikely to be
>> >> affected by my patches.
>> >
>> > Migration-test *used* to take 15 minutes to run, but that was a
>> > very long time ago. A run of it today is around 1m20.
>> >
>> > That said, if you are building multiple system emulators, we
>> > run the same test multiple times, and with the number of
>> > targets we have, that will be painful.
>> >
>> > That could be a good reason to split the migration-test into
>> > two distinct programs. One program that runs for every target,
>> > and one that is only run once, for some arbitrary "primary"
>> > target ?
>>
>> What do you mean by distinct programs? It's not the migration-test that
>> decides on which targets it runs, it's meson.build. We register a test()
>> for each target, same as with any other qtest. Maybe I misunderstood
>> you...
>
> If we split, we could have meson.build register "migration-smoketest"
> for every target while registering "migration-bigtest" for just 1 target.
Isn't that a bunch of shuffling code around just to have two different
invocations of migration-test?
There's the possibility of using the gtester path like
/x86_64/migration/smoke and passing '-r' when invoking via meson. That
requires way fewer changes in the C code. It moves the complexity into
meson.build, which might be worse.
Alex's idea of full set for KVM arch + a TCG arch would probably be
trickier in meson.
>
>> > Or could we make use of glib's g_test_thorough
>> > for this - a primary target runs with "SPEED=through" and
>> > all other targets with normal settings. That would give us
>> > a way to optimize any of the qtests to reduce redundant
>> > testing where appropriate.
>>
>> This still requires a new make target I think. Otherwise we'd run *all*
>> thorough tests for a QEMU target and not only migration-test in thorough
>> mode.
>
> Yes, that's true, having separate programs is probably an easier
> option than playing games with "SPEED" settings.
>
>
> With regards,
> Daniel
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job
2024-10-18 14:28 ` Fabiano Rosas
@ 2024-10-18 14:33 ` Daniel P. Berrangé
0 siblings, 0 replies; 23+ messages in thread
From: Daniel P. Berrangé @ 2024-10-18 14:33 UTC (permalink / raw)
To: Fabiano Rosas
Cc: Peter Maydell, qemu-devel, Peter Xu, Alex Bennée,
Philippe Mathieu-Daudé, Thomas Huth,
Wainer dos Santos Moschetta
On Fri, Oct 18, 2024 at 11:28:53AM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
>
> > On Fri, Oct 18, 2024 at 10:51:03AM -0300, Fabiano Rosas wrote:
> >> Daniel P. Berrangé <berrange@redhat.com> writes:
> >>
> >> > That could be a good reason to split the migration-test into
> >> > two distinct programs. One program that runs for every target,
> >> > and one that is only run once, for some arbitrary "primary"
> >> > target ?
> >>
> >> What do you mean by distinct programs? It's not the migration-test that
> >> decides on which targets it runs, it's meson.build. We register a test()
> >> for each target, same as with any other qtest. Maybe I misunderstood
> >> you...
> >
> > If we split, we could have meson.build register "migration-smoketest"
> > for every target while registering "migration-bigtest" for just 1 target.
>
> Isn't that a bunch of shuffling code around just to have two different
> invocations of migration-test?
Yes, pretty much, but that's not an inherantly bad thing. Migration
is a bit of an outlier in its attempt to test the entire subsystem
from a single test binary.
> There's the possibility of using the gtester path like
> /x86_64/migration/smoke and passing '-r' when invoking via meson. That
> requires way fewer changes in the C code. It moves the complexity into
> meson.build, which might be worse.
>
> Alex's idea of full set for KVM arch + a TCG arch would probably be
> trickier in meson.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job
2024-10-18 9:01 ` Daniel P. Berrangé
2024-10-18 9:46 ` Peter Maydell
@ 2024-10-18 13:51 ` Fabiano Rosas
2024-10-18 14:21 ` Daniel P. Berrangé
2024-10-18 15:25 ` Peter Maydell
2 siblings, 1 reply; 23+ messages in thread
From: Fabiano Rosas @ 2024-10-18 13:51 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: qemu-devel, Peter Xu, Peter Maydell, Alex Bennée,
Philippe Mathieu-Daudé, Thomas Huth,
Wainer dos Santos Moschetta
Daniel P. Berrangé <berrange@redhat.com> writes:
> On Thu, Oct 17, 2024 at 01:29:35PM -0300, Fabiano Rosas wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>>
>> > On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
>> >> Recent changes to how we invoke the migration tests have
>> >> (intentionally) caused them to not be part of the check-qtest target
>> >> anymore. Add the check-migration-quick target so we don't lose
>> >> migration code testing in this job.
>> >
>> > But 'check-migration-quick' is only the subset of migration tests,
>> > 'check-migration' is all of the migration tests. So surely this is
>> > a massive regressions in covage in CI pipelines.
>>
>> I'm not sure it is. There are tests there already for all the major
>> parts of the code: precopy, postcopy, multifd, socket. Besides, we can
>> tweak migration-quick to cover spots where we think we're losing
>> coverage.
>
> Each of the tests in migration-test were added for a good reason,
> generally to address testing gaps where we had functional regressions
> in the past. I don't think its a good idea to stop running such tests
> in CI as gating on new contributions. Any time we've had optional
> tests in QEMU, we've seen repeated regressions in the area in question.
>
>> Since our CI offers nothing in terms of reproducibility or
>> debuggability, I don't think it's productive to have an increasing
>> amount of tests running in CI if that means we'll be dealing with
>> timeouts and intermittent crashes constantly.
>
> Test reliability is a different thing. If a particular test is
> flaky, it needs to either be fixed or disabled.
The problem is that in this community the idea of "fix" is: wait until
someone with the appropriate skill level and interest stumbles upon the
problem on their own and fix it in anger.
For it to be a proper strategy, we'd need to create an issue in gitlab
referencing the bug, have a proper reproducer and encourage contributors
to work on the issue.
Iff the above was in place, then we could disable the test. Otherwise
the test just sits there disabled. In the past there were even tests
committed that *never* ran.
The situation with multifd/cancel alone is absurd:
- It was disabled in March 2023 and stood there *not testing anything*
while a major refactoring of the test code was happening.
- The test was fixed in June 2023, but not reenabled in fear of getting
flak from the community for breaking CI again (or at least that's the
feeling I got from talking to Juan).
- mapped-ram (which relies entirely on multifd) started being worked on
and I had to enable the test in my own branch to be able to test the
code properly. While disabled, it caught several issues in mapped-ram.
- In October 2023 the test is re-enabled an immediately exposes issues
in the code.
This is how I started working on the migration code. Maybe you can
appreciate why I don't feel confident about this fix or disable
strategy. It has eaten many hours of my work.
> Splitting into
> a fast & slow grouping doesn't address reliability, just hides
> the problem from view.
Right, and is that not the same as FLAKY? What good is keeping a test in
view if the only people that can fix it are the ones that would be
seeing the breakage constantly in their own branches anyway?
>
>> > Experience shows us that relying on humans to run tests periodically
>> > doesn't work well, and they'll slowly bit rot. Migration maintainers
>> > don't have a way to run this as gating test for every pull request
>> > that merges, and pull requests coming from non-migration maintainers
>> > can still break migration code.
>>
>> Right, but migration code would still be tested with migration-quick
>> which is executed at every make check. Do we really need the full set in
>> every pull request? We must draw a line somewhere, otherwise make check
>> will just balloon in duration.
>
> Again, the tests all exist because migration code is incredibly
> complicated, with a lot of permutations, with a history of being
> very bug / regression prone. With that in mind, it is unavoidable
> that we're going to have a significant testing overhead for
> migration code.
Yep, we should be testing way more actually.
>
> Looking at its execution time right now, I'd say migration test
> is pretty good, considering the permutations we have to target.
>
> It gets a bad reputation because historically it has been as
> much as x20 slower than it is today, and has also struggled
> with reliability. The latter is a reflection of the complexity
> of migration and and IMHO actually justifies greater testing,
> as long as we put in time to address bugs.
>
> Also we've got one single test program, covering an entire
> subsystem in one go, rather than lots of short individual
> test programs, so migration unfairly gets blamed for being
> slow, when it simply covers alot of functionality in one
> program.
And still you think it's not worth it having a separate target for
testing migration. FWIW, I also proposed splittling it into multiple
meson tests, which you also rejected. It would be so much easier to move
all of this into a separate target and let those who want nothing do to
with to just ignore it.
>
>> > Any tests in tree need to be exercised by CI as the minimum bar
>> > to prevent bit rot from merges.
>> >
>>
>> No disagreement here. But then I'm going to need advice on what to do
>> when other maintainers ask us to stop writing migration tests because
>> they take too long. I cannot send contributors away nor merge code
>> without tests.
>
> In general, I think it is unreasonable for other maintainers to
> tell us to stop adding test coverage for migration, and would
> push back against such a request.
This might be a larger issue in QEMU. I also heard the same back in 2021
when doing ppc work: "don't add too many tests because the CI buckles
and people get mad". The same with adding too much logging really. We're
hostages to the gitlab CI unfortunately.
>
> We should, however, continue to optimize how we add further test
> coverage, where practical, overload testing of multiple features
> onto a single test case helps.
>
> We've already massively optimized the migration-test compared to
> its historical behaviour.
>
> A potentially bigger win could be seen if we change how we exercise
> the migration functionality. Since we had the migration qtest that
> runs a full migration operation, we've tended to expand testing by
> adding new qtest functions. ie we've added a functional test for
> everything we want covered. This is nice & simple, but also expensive.
> We've ignored unit testing, which I think is a mistake.
>
> If i look at the test list:
>
> # /x86_64/migration/bad_dest
> # /x86_64/migration/analyze-script
> # /x86_64/migration/validate_uuid
> # /x86_64/migration/validate_uuid_error
> # /x86_64/migration/validate_uuid_src_not_set
> # /x86_64/migration/validate_uuid_dst_not_set
> # /x86_64/migration/dirty_ring
> # /x86_64/migration/precopy/file
> # /x86_64/migration/precopy/unix/plain
> # /x86_64/migration/precopy/unix/suspend/live
> # /x86_64/migration/precopy/unix/suspend/notlive
> # /x86_64/migration/precopy/unix/tls/psk
> # /x86_64/migration/precopy/unix/tls/x509/default-host
> # /x86_64/migration/precopy/unix/tls/x509/override-host
> # /x86_64/migration/precopy/file/offset
> # /x86_64/migration/precopy/file/mapped-ram
> # /x86_64/migration/precopy/file/offset/fdset
> # /x86_64/migration/precopy/file/offset/bad
> # /x86_64/migration/precopy/file/mapped-ram/live
> # /x86_64/migration/precopy/tcp/plain
> # /x86_64/migration/precopy/tcp/plain/switchover-ack
> # /x86_64/migration/precopy/tcp/tls/psk/match
> # /x86_64/migration/precopy/tcp/tls/psk/mismatch
> # /x86_64/migration/precopy/tcp/tls/x509/default-host
> # /x86_64/migration/precopy/tcp/tls/x509/override-host
> # /x86_64/migration/precopy/tcp/tls/x509/mismatch-host
> # /x86_64/migration/precopy/tcp/tls/x509/friendly-client
> # /x86_64/migration/precopy/tcp/tls/x509/hostile-client
> # /x86_64/migration/precopy/tcp/tls/x509/allow-anon-client
> # /x86_64/migration/precopy/tcp/tls/x509/reject-anon-client
> # /x86_64/migration/precopy/fd/tcp
> # /x86_64/migration/precopy/fd/file
> # /x86_64/migration/multifd/file/mapped-ram
> # /x86_64/migration/multifd/file/mapped-ram/live
> # /x86_64/migration/multifd/file/mapped-ram/dio
> # /x86_64/migration/multifd/file/mapped-ram/fdset
> # /x86_64/migration/multifd/file/mapped-ram/fdset/dio
> # /x86_64/migration/multifd/tcp/uri/plain/none
> # /x86_64/migration/multifd/tcp/channels/plain/none
> # /x86_64/migration/multifd/tcp/plain/cancel
> # /x86_64/migration/multifd/tcp/plain/zlib
> # /x86_64/migration/multifd/tcp/plain/zstd
> # /x86_64/migration/multifd/tcp/plain/zero-page/legacy
> # /x86_64/migration/multifd/tcp/plain/zero-page/none
> # /x86_64/migration/multifd/tcp/tls/psk/match
> # /x86_64/migration/multifd/tcp/tls/psk/mismatch
> # /x86_64/migration/multifd/tcp/tls/x509/default-host
> # /x86_64/migration/multifd/tcp/tls/x509/override-host
> # /x86_64/migration/multifd/tcp/tls/x509/mismatch-host
> # /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client
> # /x86_64/migration/multifd/tcp/tls/x509/reject-anon-client
> # /x86_64/migration/validate_uri/channels/both_set
> # /x86_64/migration/validate_uri/channels/none_set
>
> Individually none of those is very slow on its own - 10 are in
> the 2-3 second range, 35 are 1-2 secs, and 6 are less than
> 1 second.
>
> A very large portion of those are validating different ways to
> establish migration. Hardly any of them actually need to run
> a migration to completion. Even without running to completion
> though, we have the overheads of spawning 2 QEMUs.
>
> This feels like something that should be amenable to unit testing.
> Might need a little re-factoring of migration code to make it
> easier to run a subset of its logic in isolation, but that'd
> probably be a win anyway, as such work usually makes code cleaner.
I'll invest some time in that. I've no idea how we do unit testing in
QEMU.
>
>> Do we need nightly CI runs? Unit tests? Bear in mind there's a resource
>> allocation issue there. Addressing problems with timeouts/races in our
>> CI is not something any random person can do.
>
> In terms of running time, I think migration-test is acceptable as it is
> to run in 'make check' by default and doesn't justify dropping test
> coverage. We should still look to optimize & move to unit testing more
> code, and any reliability issues are something that needs to be addressed
> too.
>
> With regards,
> Daniel
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job
2024-10-18 13:51 ` Fabiano Rosas
@ 2024-10-18 14:21 ` Daniel P. Berrangé
2024-10-18 14:47 ` Fabiano Rosas
0 siblings, 1 reply; 23+ messages in thread
From: Daniel P. Berrangé @ 2024-10-18 14:21 UTC (permalink / raw)
To: Fabiano Rosas
Cc: qemu-devel, Peter Xu, Peter Maydell, Alex Bennée,
Philippe Mathieu-Daudé, Thomas Huth,
Wainer dos Santos Moschetta
On Fri, Oct 18, 2024 at 10:51:08AM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
>
> > On Thu, Oct 17, 2024 at 01:29:35PM -0300, Fabiano Rosas wrote:
> >> Daniel P. Berrangé <berrange@redhat.com> writes:
> >>
> >> > On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
> >> >> Recent changes to how we invoke the migration tests have
> >> >> (intentionally) caused them to not be part of the check-qtest target
> >> >> anymore. Add the check-migration-quick target so we don't lose
> >> >> migration code testing in this job.
> >> >
> >> > But 'check-migration-quick' is only the subset of migration tests,
> >> > 'check-migration' is all of the migration tests. So surely this is
> >> > a massive regressions in covage in CI pipelines.
> >>
> >> I'm not sure it is. There are tests there already for all the major
> >> parts of the code: precopy, postcopy, multifd, socket. Besides, we can
> >> tweak migration-quick to cover spots where we think we're losing
> >> coverage.
> >
> > Each of the tests in migration-test were added for a good reason,
> > generally to address testing gaps where we had functional regressions
> > in the past. I don't think its a good idea to stop running such tests
> > in CI as gating on new contributions. Any time we've had optional
> > tests in QEMU, we've seen repeated regressions in the area in question.
> >
> >> Since our CI offers nothing in terms of reproducibility or
> >> debuggability, I don't think it's productive to have an increasing
> >> amount of tests running in CI if that means we'll be dealing with
> >> timeouts and intermittent crashes constantly.
> >
> > Test reliability is a different thing. If a particular test is
> > flaky, it needs to either be fixed or disabled.
>
> The problem is that in this community the idea of "fix" is: wait until
> someone with the appropriate skill level and interest stumbles upon the
> problem on their own and fix it in anger.
>
> For it to be a proper strategy, we'd need to create an issue in gitlab
> referencing the bug, have a proper reproducer and encourage contributors
> to work on the issue.
It is policy that we should be creating issues in gitlab for any
flaky tests. I wouldn't say we've been perfect at that, but we
should be doing it, and that link ought to be linked in the code
if we disable the test there.
> - It was disabled in March 2023 and stood there *not testing anything*
> while a major refactoring of the test code was happening.
>
> - The test was fixed in June 2023, but not reenabled in fear of getting
> flak from the community for breaking CI again (or at least that's the
> feeling I got from talking to Juan).
>
> - mapped-ram (which relies entirely on multifd) started being worked on
> and I had to enable the test in my own branch to be able to test the
> code properly. While disabled, it caught several issues in mapped-ram.
>
> - In October 2023 the test is re-enabled an immediately exposes issues
> in the code.
>
> This is how I started working on the migration code. Maybe you can
> appreciate why I don't feel confident about this fix or disable
> strategy. It has eaten many hours of my work.
The migration subsystem was definitely suffering from insufficient
maintainer resources available historically, which is reflected
in some of the testing problems we've had & largely how I ended
up spending so much time on migration code too.
> > Looking at its execution time right now, I'd say migration test
> > is pretty good, considering the permutations we have to target.
> >
> > It gets a bad reputation because historically it has been as
> > much as x20 slower than it is today, and has also struggled
> > with reliability. The latter is a reflection of the complexity
> > of migration and and IMHO actually justifies greater testing,
> > as long as we put in time to address bugs.
> >
> > Also we've got one single test program, covering an entire
> > subsystem in one go, rather than lots of short individual
> > test programs, so migration unfairly gets blamed for being
> > slow, when it simply covers alot of functionality in one
> > program.
>
> And still you think it's not worth it having a separate target for
> testing migration. FWIW, I also proposed splittling it into multiple
> meson tests, which you also rejected. It would be so much easier to move
> all of this into a separate target and let those who want nothing do to
> with to just ignore it.
In the qtests/meson.build, I see we register separate
suites - a generic "qtest" suite, and a "qtest-$TARGET"
suite. What's missing here is a suite for subsystem
classification, which I guess is more or less what you
proposed here for migration.
How about (in addition to the idea of splitting migration-test
into one part run for all targets, and one part run for just
one target), we generalize this concept to work for any
subsystem tagging
qtest_subsystems = {
'migration-test': ['migration'],
'cdrom-test': ['block'],
'ahci-test': ['block'],
....
}
then when registering tests we could do
suites = ['qtest', 'qtest-' + target_base]
foreach subsys: qtest_subsystems.get(test, [])
suites += ['qtest-' + subsys, 'qtest-' + target_base + '-' + subsys]
endforeach
test(....
suite: suites)
that would give us a way to run just the migration tests, or
just the migration tests on x86_64, etc, likewise for other
subsystems we want to tag, while still keeping 'make check-qtest'
as the full coverage.
> >> > Any tests in tree need to be exercised by CI as the minimum bar
> >> > to prevent bit rot from merges.
> >> >
> >>
> >> No disagreement here. But then I'm going to need advice on what to do
> >> when other maintainers ask us to stop writing migration tests because
> >> they take too long. I cannot send contributors away nor merge code
> >> without tests.
> >
> > In general, I think it is unreasonable for other maintainers to
> > tell us to stop adding test coverage for migration, and would
> > push back against such a request.
>
> This might be a larger issue in QEMU. I also heard the same back in 2021
> when doing ppc work: "don't add too many tests because the CI buckles
> and people get mad". The same with adding too much logging really. We're
> hostages to the gitlab CI unfortunately.
Yep, we need more investment in our CI resources. There were some
ideas discussed at KVM Forum that could help with this, which
should be publicised soon.
> > This feels like something that should be amenable to unit testing.
> > Might need a little re-factoring of migration code to make it
> > easier to run a subset of its logic in isolation, but that'd
> > probably be a win anyway, as such work usually makes code cleaner.
>
> I'll invest some time in that. I've no idea how we do unit testing in
> QEMU.
Mostly starts with the standard glib test program setup, where by
you create a tests/unit/test-<blah>.c file, with functions for
each test case that you register with g_test_add, same as qtest.
The key difference from qtest is that you then just directly
link the test binary to the code to be tested, and call into
it internal QEMU APIs directly. In this case, it would involve
linking to the 'libmigration' meson static library object.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job
2024-10-18 14:21 ` Daniel P. Berrangé
@ 2024-10-18 14:47 ` Fabiano Rosas
0 siblings, 0 replies; 23+ messages in thread
From: Fabiano Rosas @ 2024-10-18 14:47 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: qemu-devel, Peter Xu, Peter Maydell, Alex Bennée,
Philippe Mathieu-Daudé, Thomas Huth,
Wainer dos Santos Moschetta
Daniel P. Berrangé <berrange@redhat.com> writes:
> On Fri, Oct 18, 2024 at 10:51:08AM -0300, Fabiano Rosas wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>>
>> > On Thu, Oct 17, 2024 at 01:29:35PM -0300, Fabiano Rosas wrote:
>> >> Daniel P. Berrangé <berrange@redhat.com> writes:
>> >>
>> >> > On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
>> >> >> Recent changes to how we invoke the migration tests have
>> >> >> (intentionally) caused them to not be part of the check-qtest target
>> >> >> anymore. Add the check-migration-quick target so we don't lose
>> >> >> migration code testing in this job.
>> >> >
>> >> > But 'check-migration-quick' is only the subset of migration tests,
>> >> > 'check-migration' is all of the migration tests. So surely this is
>> >> > a massive regressions in covage in CI pipelines.
>> >>
>> >> I'm not sure it is. There are tests there already for all the major
>> >> parts of the code: precopy, postcopy, multifd, socket. Besides, we can
>> >> tweak migration-quick to cover spots where we think we're losing
>> >> coverage.
>> >
>> > Each of the tests in migration-test were added for a good reason,
>> > generally to address testing gaps where we had functional regressions
>> > in the past. I don't think its a good idea to stop running such tests
>> > in CI as gating on new contributions. Any time we've had optional
>> > tests in QEMU, we've seen repeated regressions in the area in question.
>> >
>> >> Since our CI offers nothing in terms of reproducibility or
>> >> debuggability, I don't think it's productive to have an increasing
>> >> amount of tests running in CI if that means we'll be dealing with
>> >> timeouts and intermittent crashes constantly.
>> >
>> > Test reliability is a different thing. If a particular test is
>> > flaky, it needs to either be fixed or disabled.
>>
>> The problem is that in this community the idea of "fix" is: wait until
>> someone with the appropriate skill level and interest stumbles upon the
>> problem on their own and fix it in anger.
>>
>> For it to be a proper strategy, we'd need to create an issue in gitlab
>> referencing the bug, have a proper reproducer and encourage contributors
>> to work on the issue.
>
> It is policy that we should be creating issues in gitlab for any
> flaky tests. I wouldn't say we've been perfect at that, but we
> should be doing it, and that link ought to be linked in the code
> if we disable the test there.
>
>> - It was disabled in March 2023 and stood there *not testing anything*
>> while a major refactoring of the test code was happening.
>>
>> - The test was fixed in June 2023, but not reenabled in fear of getting
>> flak from the community for breaking CI again (or at least that's the
>> feeling I got from talking to Juan).
>>
>> - mapped-ram (which relies entirely on multifd) started being worked on
>> and I had to enable the test in my own branch to be able to test the
>> code properly. While disabled, it caught several issues in mapped-ram.
>>
>> - In October 2023 the test is re-enabled an immediately exposes issues
>> in the code.
>>
>> This is how I started working on the migration code. Maybe you can
>> appreciate why I don't feel confident about this fix or disable
>> strategy. It has eaten many hours of my work.
>
> The migration subsystem was definitely suffering from insufficient
> maintainer resources available historically, which is reflected
> in some of the testing problems we've had & largely how I ended
> up spending so much time on migration code too.
>
>> > Looking at its execution time right now, I'd say migration test
>> > is pretty good, considering the permutations we have to target.
>> >
>> > It gets a bad reputation because historically it has been as
>> > much as x20 slower than it is today, and has also struggled
>> > with reliability. The latter is a reflection of the complexity
>> > of migration and and IMHO actually justifies greater testing,
>> > as long as we put in time to address bugs.
>> >
>> > Also we've got one single test program, covering an entire
>> > subsystem in one go, rather than lots of short individual
>> > test programs, so migration unfairly gets blamed for being
>> > slow, when it simply covers alot of functionality in one
>> > program.
>>
>> And still you think it's not worth it having a separate target for
>> testing migration. FWIW, I also proposed splittling it into multiple
>> meson tests, which you also rejected. It would be so much easier to move
>> all of this into a separate target and let those who want nothing do to
>> with to just ignore it.
>
> In the qtests/meson.build, I see we register separate
> suites - a generic "qtest" suite, and a "qtest-$TARGET"
> suite. What's missing here is a suite for subsystem
> classification, which I guess is more or less what you
> proposed here for migration.
>
> How about (in addition to the idea of splitting migration-test
> into one part run for all targets, and one part run for just
> one target), we generalize this concept to work for any
> subsystem tagging
>
> qtest_subsystems = {
> 'migration-test': ['migration'],
> 'cdrom-test': ['block'],
> 'ahci-test': ['block'],
> ....
> }
This is interesting because it would allow a test to be considered in
more than one subsystem. There are many tests that invoke migration
themselves.
>
>
> then when registering tests we could do
>
> suites = ['qtest', 'qtest-' + target_base]
> foreach subsys: qtest_subsystems.get(test, [])
> suites += ['qtest-' + subsys, 'qtest-' + target_base + '-' + subsys]
Hopefully meson won't take this as a hint to construct a 1000 characters
long line when invoking the test.
> endforeach
>
> test(....
> suite: suites)
>
> that would give us a way to run just the migration tests, or
> just the migration tests on x86_64, etc, likewise for other
> subsystems we want to tag, while still keeping 'make check-qtest'
> as the full coverage.
>
>> >> > Any tests in tree need to be exercised by CI as the minimum bar
>> >> > to prevent bit rot from merges.
>> >> >
>> >>
>> >> No disagreement here. But then I'm going to need advice on what to do
>> >> when other maintainers ask us to stop writing migration tests because
>> >> they take too long. I cannot send contributors away nor merge code
>> >> without tests.
>> >
>> > In general, I think it is unreasonable for other maintainers to
>> > tell us to stop adding test coverage for migration, and would
>> > push back against such a request.
>>
>> This might be a larger issue in QEMU. I also heard the same back in 2021
>> when doing ppc work: "don't add too many tests because the CI buckles
>> and people get mad". The same with adding too much logging really. We're
>> hostages to the gitlab CI unfortunately.
>
> Yep, we need more investment in our CI resources. There were some
> ideas discussed at KVM Forum that could help with this, which
> should be publicised soon.
Great. Looking forward to those.
>
>
>> > This feels like something that should be amenable to unit testing.
>> > Might need a little re-factoring of migration code to make it
>> > easier to run a subset of its logic in isolation, but that'd
>> > probably be a win anyway, as such work usually makes code cleaner.
>>
>> I'll invest some time in that. I've no idea how we do unit testing in
>> QEMU.
>
> Mostly starts with the standard glib test program setup, where by
> you create a tests/unit/test-<blah>.c file, with functions for
> each test case that you register with g_test_add, same as qtest.
>
> The key difference from qtest is that you then just directly
> link the test binary to the code to be tested, and call into
> it internal QEMU APIs directly. In this case, it would involve
> linking to the 'libmigration' meson static library object.
Thanks for the introduction. I'll pick some self-contained part of
migration and see how far we are from being able to write unit tests.
>
>
> With regards,
> Daniel
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job
2024-10-18 9:01 ` Daniel P. Berrangé
2024-10-18 9:46 ` Peter Maydell
2024-10-18 13:51 ` Fabiano Rosas
@ 2024-10-18 15:25 ` Peter Maydell
2024-10-18 16:12 ` Daniel P. Berrangé
2024-10-21 14:55 ` Daniel P. Berrangé
2 siblings, 2 replies; 23+ messages in thread
From: Peter Maydell @ 2024-10-18 15:25 UTC (permalink / raw)
To: Daniel P. Berrangé
Cc: Fabiano Rosas, qemu-devel, Peter Xu, Alex Bennée,
Philippe Mathieu-Daudé, Thomas Huth,
Wainer dos Santos Moschetta
On Fri, 18 Oct 2024 at 10:01, Daniel P. Berrangé <berrange@redhat.com> wrote:
> Test reliability is a different thing. If a particular test is
> flaky, it needs to either be fixed or disabled. Splitting into
> a fast & slow grouping doesn't address reliability, just hides
> the problem from view.
On the subject of 'flaky', here's another low-repeatability
intermittent with migration-test that I just ran into in
'make vm-build-openbsd':
▶ 97/916 /ppc64/migration/multifd/tcp/plain/cancel
OK
▶ 96/916 /i386/migration/precopy/tcp/tls/x509/allow-anon-client
OK
▶ 97/916 /ppc64/migration/multifd/tcp/plain/zlib -
ERROR:../src/tests/qtest/migration-helpers.c:322:check_migration_status:
assertion failed (current_status != "failed"): ("failed" != "failed")
FAIL
▶ 97/916
ERROR
▶ 95/916 /aarch64/migration/multifd/tcp/channels/plain/none
OK
97/916 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test
ERROR 134.38s killed by signal 6 SIGABRT
――――――――――――――――――――――――――――――――――――― ✀ ―――――――――――――――――――――――――――――――――――――
stderr:
warning: fd: migration to a file is deprecated. Use file: instead.
warning: fd: migration to a file is deprecated. Use file: instead.
**
ERROR:../src/tests/qtest/migration-helpers.c:322:check_migration_status:
assertion failed (current_status != "failed"): ("failed" != "failed")
qemu-system-ppc64: Failed to connect to '127.0.0.1:24109': Address
already in use
qemu-system-ppc64: Failed to peek at channel
(test program exited with status code -6)
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
Probably https://lore.kernel.org/qemu-devel/CAFEAcA8p9BKCrn9EfFXTpHE+5w-_8zhtE_52SpZLuS-+zpF5Gg@mail.gmail.com/
in a slightly different form.
thanks
-- PMM
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job
2024-10-18 15:25 ` Peter Maydell
@ 2024-10-18 16:12 ` Daniel P. Berrangé
2024-10-21 14:55 ` Daniel P. Berrangé
1 sibling, 0 replies; 23+ messages in thread
From: Daniel P. Berrangé @ 2024-10-18 16:12 UTC (permalink / raw)
To: Peter Maydell
Cc: Fabiano Rosas, qemu-devel, Peter Xu, Alex Bennée,
Philippe Mathieu-Daudé, Thomas Huth,
Wainer dos Santos Moschetta
On Fri, Oct 18, 2024 at 04:25:07PM +0100, Peter Maydell wrote:
> On Fri, 18 Oct 2024 at 10:01, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > Test reliability is a different thing. If a particular test is
> > flaky, it needs to either be fixed or disabled. Splitting into
> > a fast & slow grouping doesn't address reliability, just hides
> > the problem from view.
>
> On the subject of 'flaky', here's another low-repeatability
> intermittent with migration-test that I just ran into in
> 'make vm-build-openbsd':
>
> ▶ 97/916 /ppc64/migration/multifd/tcp/plain/cancel
> OK
> ▶ 96/916 /i386/migration/precopy/tcp/tls/x509/allow-anon-client
> OK
> ▶ 97/916 /ppc64/migration/multifd/tcp/plain/zlib -
> ERROR:../src/tests/qtest/migration-helpers.c:322:check_migration_status:
> assertion failed (current_status != "failed"): ("failed" != "failed")
> FAIL
> ▶ 97/916
> ERROR
> ▶ 95/916 /aarch64/migration/multifd/tcp/channels/plain/none
> OK
> 97/916 qemu:qtest+qtest-ppc64 / qtest-ppc64/migration-test
> ERROR 134.38s killed by signal 6 SIGABRT
> ――――――――――――――――――――――――――――――――――――― ✀ ―――――――――――――――――――――――――――――――――――――
> stderr:
> warning: fd: migration to a file is deprecated. Use file: instead.
> warning: fd: migration to a file is deprecated. Use file: instead.
> **
> ERROR:../src/tests/qtest/migration-helpers.c:322:check_migration_status:
> assertion failed (current_status != "failed"): ("failed" != "failed")
> qemu-system-ppc64: Failed to connect to '127.0.0.1:24109': Address
> already in use
This is interesting, as I suspect it is a sign of a genuine portability
problem, as there are some subtle sockets binding differences between
Linux and *BSDs IIRC.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job
2024-10-18 15:25 ` Peter Maydell
2024-10-18 16:12 ` Daniel P. Berrangé
@ 2024-10-21 14:55 ` Daniel P. Berrangé
1 sibling, 0 replies; 23+ messages in thread
From: Daniel P. Berrangé @ 2024-10-21 14:55 UTC (permalink / raw)
To: Peter Maydell
Cc: Fabiano Rosas, qemu-devel, Peter Xu, Alex Bennée,
Philippe Mathieu-Daudé, Thomas Huth,
Wainer dos Santos Moschetta
On Fri, Oct 18, 2024 at 04:25:07PM +0100, Peter Maydell wrote:
> On Fri, 18 Oct 2024 at 10:01, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > Test reliability is a different thing. If a particular test is
> > flaky, it needs to either be fixed or disabled. Splitting into
> > a fast & slow grouping doesn't address reliability, just hides
> > the problem from view.
>
> On the subject of 'flaky', here's another low-repeatability
> intermittent with migration-test that I just ran into in
> 'make vm-build-openbsd':
> ――――――――――――――――――――――――――――――――――――― ✀ ―――――――――――――――――――――――――――――――――――――
> stderr:
> warning: fd: migration to a file is deprecated. Use file: instead.
> warning: fd: migration to a file is deprecated. Use file: instead.
> **
> ERROR:../src/tests/qtest/migration-helpers.c:322:check_migration_status:
> assertion failed (current_status != "failed"): ("failed" != "failed")
> qemu-system-ppc64: Failed to connect to '127.0.0.1:24109': Address
> already in use
> qemu-system-ppc64: Failed to peek at channel
>
> (test program exited with status code -6)
> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>
> Probably https://lore.kernel.org/qemu-devel/CAFEAcA8p9BKCrn9EfFXTpHE+5w-_8zhtE_52SpZLuS-+zpF5Gg@mail.gmail.com/
I think I've finally found the root cause of this bug. Setting SO_REUSEADDR
on client sockets on OpenBSD causes re-use of ports in TIME_WAIT when
auto-assigning a local bind address for connections. I've sent a patch to
remove this, since it is essentially pointless todo this AFAIK.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
^ permalink raw reply [flat|nested] 23+ messages in thread