[PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration.

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration.
@ 2024-07-11 21:52 Yichen Wang
  2024-07-11 21:52 ` [PATCH v5 01/13] meson: Introduce new instruction set enqcmd to the build system Yichen Wang
                   ` (11 more replies)
  0 siblings, 12 replies; 33+ messages in thread
From: Yichen Wang @ 2024-07-11 21:52 UTC (permalink / raw)
  To: Paolo Bonzini, Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Michael S. Tsirkin, Cornelia Huck,
	qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

v5
* Rebase on top of 39a032cea23e522268519d89bb738974bc43b6f6.
* Rename struct definitions with typedef and CamelCase names;
* Add build and runtime checks about DSA accelerator;
* Address all comments from v4 reviews about typos, licenses, comments,
error reporting, etc.

v4
* Rebase on top of 85b597413d4370cb168f711192eaef2eb70535ac.
* A separate "multifd zero page checking" patchset was split from this
patchset's v3 and got merged into master. v4 re-applied the rest of all
commits on top of that patchset, re-factored and re-tested.
https://lore.kernel.org/all/20240311180015.3359271-1-hao.xiang@linux.dev/
* There are some feedback from v3 I likely overlooked.

v3
* Rebase on top of 7425b6277f12e82952cede1f531bfc689bf77fb1.
* Fix error/warning from checkpatch.pl
* Fix use-after-free bug when multifd-dsa-accel option is not set.
* Handle error from dsa_init and correctly propogate the error.
* Remove unnecessary call to dsa_stop.
* Detect availability of DSA feature at compile time.
* Implement a generic batch_task structure and a DSA specific one dsa_batch_task.
* Remove all exit() calls and propagate errors correctly.
* Use bytes instead of page count to configure multifd-packet-size option.

v2
* Rebase on top of 3e01f1147a16ca566694b97eafc941d62fa1e8d8.
* Leave Juan's changes in their original form instead of squashing them.
* Add a new commit to refactor the multifd_send_thread function to prepare for introducing the DSA offload functionality.
* Use page count to configure multifd-packet-size option.
* Don't use the FLAKY flag in DSA tests.
* Test if DSA integration test is setup correctly and skip the test if
* not.
* Fixed broken link in the previous patch cover.

* Background:

I posted an RFC about DSA offloading in QEMU:
https://patchew.org/QEMU/20230529182001.2232069-1-hao.xiang@bytedance.com/

This patchset implements the DSA offloading on zero page checking in
multifd live migration code path.

* Overview:

Intel Data Streaming Accelerator(DSA) is introduced in Intel's 4th generation
Xeon server, aka Sapphire Rapids.
https://cdrdv2-public.intel.com/671116/341204-intel-data-streaming-accelerator-spec.pdf
https://www.intel.com/content/www/us/en/content-details/759709/intel-data-streaming-accelerator-user-guide.html
One of the things DSA can do is to offload memory comparison workload from
CPU to DSA accelerator hardware. This patchset implements a solution to offload
QEMU's zero page checking from CPU to DSA accelerator hardware. We gain
two benefits from this change:
1. Reduces CPU usage in multifd live migration workflow across all use
cases.
2. Reduces migration total time in some use cases. 

* Design:

These are the logical steps to perform DSA offloading:
1. Configure DSA accelerators and create user space openable DSA work
queues via the idxd driver.
2. Map DSA's work queue into a user space address space.
3. Fill an in-memory task descriptor to describe the memory operation.
4. Use dedicated CPU instruction _enqcmd to queue a task descriptor to
the work queue.
5. Pull the task descriptor's completion status field until the task
completes.
6. Check return status.

The memory operation is now totally done by the accelerator hardware but
the new workflow introduces overheads. The overhead is the extra cost CPU
prepares and submits the task descriptors and the extra cost CPU pulls for
completion. The design is around minimizing these two overheads.

1. In order to reduce the overhead on task preparation and submission,
we use batch descriptors. A batch descriptor will contain N individual
zero page checking tasks where the default N is 128 (default packet size
/ page size) and we can increase N by setting the packet size via a new
migration option.
2. The multifd sender threads prepares and submits batch tasks to DSA
hardware and it waits on a synchronization object for task completion.
Whenever a DSA task is submitted, the task structure is added to a
thread safe queue. It's safe to have multiple multifd sender threads to
submit tasks concurrently.
3. Multiple DSA hardware devices can be used. During multifd initialization,
every sender thread will be assigned a DSA device to work with. We
use a round-robin scheme to evenly distribute the work across all used
DSA devices.
4. Use a dedicated thread dsa_completion to perform busy pulling for all
DSA task completions. The thread keeps dequeuing DSA tasks from the
thread safe queue. The thread blocks when there is no outstanding DSA
task. When pulling for completion of a DSA task, the thread uses CPU
instruction _mm_pause between the iterations of a busy loop to save some
CPU power as well as optimizing core resources for the other hypercore.
5. DSA accelerator can encounter errors. The most popular error is a
page fault. We have tested using devices to handle page faults but
performance is bad. Right now, if DSA hits a page fault, we fallback to
use CPU to complete the rest of the work. The CPU fallback is done in
the multifd sender thread.
6. Added a new migration option multifd-dsa-accel to set the DSA device
path. If set, the multifd workflow will leverage the DSA devices for
offloading.
7. Added a new migration option multifd-normal-page-ratio to make
multifd live migration easier to test. Setting a normal page ratio will
make live migration recognize a zero page as a normal page and send
the entire payload over the network. If we want to send a large network
payload and analyze throughput, this option is useful.
8. Added a new migration option multifd-packet-size. This can increase
the number of pages being zero page checked and sent over the network.
The extra synchronization between the sender threads and the dsa
completion thread is an overhead. Using a large packet size can reduce
that overhead.

* Performance:

We use two Intel 4th generation Xeon servers for testing.

Architecture:        x86_64
CPU(s):              192
Thread(s) per core:  2
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               143
Model name:          Intel(R) Xeon(R) Platinum 8457C
Stepping:            8
CPU MHz:             2538.624
CPU max MHz:         3800.0000
CPU min MHz:         800.0000

We perform multifd live migration with below setup:
1. VM has 100GB memory. 
2. Use the new migration option multifd-set-normal-page-ratio to control the total
size of the payload sent over the network.
3. Use 8 multifd channels.
4. Use tcp for live migration.
4. Use CPU to perform zero page checking as the baseline.
5. Use one DSA device to offload zero page checking to compare with the baseline.
6. Use "perf sched record" and "perf sched timehist" to analyze CPU usage.

A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.

	CPU usage

	|---------------|---------------|---------------|---------------|
	|		|comm		|runtime(msec)	|totaltime(msec)|
	|---------------|---------------|---------------|---------------|
	|Baseline	|live_migration	|5657.58	|		|
	|		|multifdsend_0	|3931.563	|		|
	|		|multifdsend_1	|4405.273	|		|
	|		|multifdsend_2	|3941.968	|		|
	|		|multifdsend_3	|5032.975	|		|
	|		|multifdsend_4	|4533.865	|		|
	|		|multifdsend_5	|4530.461	|		|
	|		|multifdsend_6	|5171.916	|		|
	|		|multifdsend_7	|4722.769	|41922		|
	|---------------|---------------|---------------|---------------|
	|DSA		|live_migration	|6129.168	|		|
	|		|multifdsend_0	|2954.717	|		|
	|		|multifdsend_1	|2766.359	|		|
	|		|multifdsend_2	|2853.519	|		|
	|		|multifdsend_3	|2740.717	|		|
	|		|multifdsend_4	|2824.169	|		|
	|		|multifdsend_5	|2966.908	|		|
	|		|multifdsend_6	|2611.137	|		|
	|		|multifdsend_7	|3114.732	|		|
	|		|dsa_completion	|3612.564	|32568		|
	|---------------|---------------|---------------|---------------|

Baseline total runtime is calculated by adding up all multifdsend_X
and live_migration threads runtime. DSA offloading total runtime is
calculated by adding up all multifdsend_X, live_migration and
dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
that is 23% total CPU usage savings.

	Latency
	|---------------|---------------|---------------|---------------|---------------|---------------|
	|		|total time	|down time	|throughput	|transferred-ram|total-ram	|
	|---------------|---------------|---------------|---------------|---------------|---------------|	
	|Baseline	|10343 ms	|161 ms		|41007.00 mbps	|51583797 kb	|102400520 kb	|
	|---------------|---------------|---------------|---------------|-------------------------------|
	|DSA offload	|9535 ms	|135 ms		|46554.40 mbps	|53947545 kb	|102400520 kb	|	
	|---------------|---------------|---------------|---------------|---------------|---------------|

Total time is 8% faster and down time is 16% faster.

B) Scenario 2: 100% (100GB) zero pages on an 100GB vm.

	CPU usage
	|---------------|---------------|---------------|---------------|
	|		|comm		|runtime(msec)	|totaltime(msec)|
	|---------------|---------------|---------------|---------------|
	|Baseline	|live_migration	|4860.718	|		|
	|	 	|multifdsend_0	|748.875	|		|
	|		|multifdsend_1	|898.498	|		|
	|		|multifdsend_2	|787.456	|		|
	|		|multifdsend_3	|764.537	|		|
	|		|multifdsend_4	|785.687	|		|
	|		|multifdsend_5	|756.941	|		|
	|		|multifdsend_6	|774.084	|		|
	|		|multifdsend_7	|782.900	|11154		|
	|---------------|---------------|-------------------------------|
	|DSA offloading	|live_migration	|3846.976	|		|
	|		|multifdsend_0	|191.880	|		|
	|		|multifdsend_1	|166.331	|		|
	|		|multifdsend_2	|168.528	|		|
	|		|multifdsend_3	|197.831	|		|
	|		|multifdsend_4	|169.580	|		|
	|		|multifdsend_5	|167.984	|		|
	|		|multifdsend_6	|198.042	|		|
	|		|multifdsend_7	|170.624	|		|
	|		|dsa_completion	|3428.669	|8700		|
	|---------------|---------------|---------------|---------------|

Baseline total runtime is 11154 msec and DSA offloading total runtime is
8700 msec. That is 22% CPU savings.

	Latency
	|--------------------------------------------------------------------------------------------|
	|		|total time	|down time	|throughput	|transferred-ram|total-ram   |
	|---------------|---------------|---------------|---------------|---------------|------------|	
	|Baseline	|4867 ms	|20 ms		|1.51 mbps	|565 kb		|102400520 kb|
	|---------------|---------------|---------------|---------------|----------------------------|
	|DSA offload	|3888 ms	|18 ms		|1.89 mbps	|565 kb		|102400520 kb|	
	|---------------|---------------|---------------|---------------|---------------|------------|

Total time 20% faster and down time 10% faster.

* Testing:

1. Added unit tests for cover the added code path in dsa.c
2. Added integration tests to cover multifd live migration using DSA
offloading.

Hao Xiang (12):
  meson: Introduce new instruction set enqcmd to the build system.
  util/dsa: Implement DSA device start and stop logic.
  util/dsa: Implement DSA task enqueue and dequeue.
  util/dsa: Implement DSA task asynchronous completion thread model.
  util/dsa: Implement zero page checking in DSA task.
  util/dsa: Implement DSA task asynchronous submission and wait for
    completion.
  migration/multifd: Add new migration option for multifd DSA
    offloading.
  migration/multifd: Prepare to introduce DSA acceleration on the
    multifd path.
  migration/multifd: Enable DSA offloading in multifd sender path.
  migration/multifd: Add migration option set packet size.
  util/dsa: Add unit test coverage for Intel DSA task submission and
    completion.
  migration/multifd: Add integration tests for multifd with Intel DSA
    offloading.

Yichen Wang (1):
  util/dsa: Add idxd into linux header copy list.

 include/qemu/dsa.h              |  176 +++++
 meson.build                     |   14 +
 meson_options.txt               |    2 +
 migration/migration-hmp-cmds.c  |   22 +-
 migration/migration.c           |    2 +-
 migration/multifd-zero-page.c   |  100 ++-
 migration/multifd-zlib.c        |    6 +-
 migration/multifd-zstd.c        |    6 +-
 migration/multifd.c             |   53 +-
 migration/multifd.h             |    8 +-
 migration/options.c             |   85 +++
 migration/options.h             |    2 +
 qapi/migration.json             |   49 +-
 scripts/meson-buildoptions.sh   |    3 +
 scripts/update-linux-headers.sh |    2 +-
 tests/qtest/migration-test.c    |   80 ++-
 tests/unit/meson.build          |    6 +
 tests/unit/test-dsa.c           |  503 ++++++++++++++
 util/dsa.c                      | 1082 +++++++++++++++++++++++++++++++
 util/meson.build                |    3 +
 20 files changed, 2177 insertions(+), 27 deletions(-)
 create mode 100644 include/qemu/dsa.h
 create mode 100644 tests/unit/test-dsa.c
 create mode 100644 util/dsa.c

-- 
Yichen Wang

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v5 01/13] meson: Introduce new instruction set enqcmd to the build system.
  2024-07-11 21:52 [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
@ 2024-07-11 21:52 ` Yichen Wang
  2024-07-15 15:02   ` Liu, Yuan1
  2024-07-11 21:52 ` [PATCH v5 02/13] util/dsa: Add idxd into linux header copy list Yichen Wang
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 33+ messages in thread
From: Yichen Wang @ 2024-07-11 21:52 UTC (permalink / raw)
  To: Paolo Bonzini, Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Michael S. Tsirkin, Cornelia Huck,
	qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

From: Hao Xiang <hao.xiang@linux.dev>

Enable instruction set enqcmd in build.

Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 meson.build                   | 14 ++++++++++++++
 meson_options.txt             |  2 ++
 scripts/meson-buildoptions.sh |  3 +++
 3 files changed, 19 insertions(+)

diff --git a/meson.build b/meson.build
index 6a93da48e1..af650cfabf 100644
--- a/meson.build
+++ b/meson.build
@@ -2893,6 +2893,20 @@ config_host_data.set('CONFIG_AVX512BW_OPT', get_option('avx512bw') \
     int main(int argc, char *argv[]) { return bar(argv[0]); }
   '''), error_message: 'AVX512BW not available').allowed())
 
+config_host_data.set('CONFIG_DSA_OPT', get_option('enqcmd') \
+  .require(have_cpuid_h, error_message: 'cpuid.h not available, cannot enable ENQCMD') \
+  .require(cc.links('''
+    #include <stdint.h>
+    #include <cpuid.h>
+    #include <immintrin.h>
+    static int __attribute__((target("enqcmd"))) bar(void *a) {
+      uint64_t dst[8] = { 0 };
+      uint64_t src[8] = { 0 };
+      return _enqcmd(dst, src);
+    }
+    int main(int argc, char *argv[]) { return bar(argv[argc - 1]); }
+  '''), error_message: 'ENQCMD not available').allowed())
+
 # For both AArch64 and AArch32, detect if builtins are available.
 config_host_data.set('CONFIG_ARM_AES_BUILTIN', cc.compiles('''
     #include <arm_neon.h>
diff --git a/meson_options.txt b/meson_options.txt
index 0269fa0f16..4ed820bb8d 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -121,6 +121,8 @@ option('avx2', type: 'feature', value: 'auto',
        description: 'AVX2 optimizations')
 option('avx512bw', type: 'feature', value: 'auto',
        description: 'AVX512BW optimizations')
+option('enqcmd', type: 'feature', value: 'disabled',
+       description: 'ENQCMD optimizations')
 option('keyring', type: 'feature', value: 'auto',
        description: 'Linux keyring support')
 option('libkeyutils', type: 'feature', value: 'auto',
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index cfadb5ea86..280e117687 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -95,6 +95,7 @@ meson_options_help() {
   printf "%s\n" '  auth-pam        PAM access control'
   printf "%s\n" '  avx2            AVX2 optimizations'
   printf "%s\n" '  avx512bw        AVX512BW optimizations'
+  printf "%s\n" '  enqcmd          ENQCMD optimizations'
   printf "%s\n" '  blkio           libblkio block device driver'
   printf "%s\n" '  bochs           bochs image format support'
   printf "%s\n" '  bpf             eBPF support'
@@ -239,6 +240,8 @@ _meson_option_parse() {
     --disable-avx2) printf "%s" -Davx2=disabled ;;
     --enable-avx512bw) printf "%s" -Davx512bw=enabled ;;
     --disable-avx512bw) printf "%s" -Davx512bw=disabled ;;
+    --enable-enqcmd) printf "%s" -Denqcmd=enabled ;;
+    --disable-enqcmd) printf "%s" -Denqcmd=disabled ;;
     --enable-gcov) printf "%s" -Db_coverage=true ;;
     --disable-gcov) printf "%s" -Db_coverage=false ;;
     --enable-lto) printf "%s" -Db_lto=true ;;
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 02/13] util/dsa: Add idxd into linux header copy list.
  2024-07-11 21:52 [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
  2024-07-11 21:52 ` [PATCH v5 01/13] meson: Introduce new instruction set enqcmd to the build system Yichen Wang
@ 2024-07-11 21:52 ` Yichen Wang
  2024-07-11 21:52 ` [PATCH v5 03/13] util/dsa: Implement DSA device start and stop logic Yichen Wang
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Yichen Wang @ 2024-07-11 21:52 UTC (permalink / raw)
  To: Paolo Bonzini, Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Michael S. Tsirkin, Cornelia Huck,
	qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 scripts/update-linux-headers.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index c34ac6454e..5aba95d9cb 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -193,7 +193,7 @@ rm -rf "$output/linux-headers/linux"
 mkdir -p "$output/linux-headers/linux"
 for header in const.h stddef.h kvm.h vfio.h vfio_ccw.h vfio_zdev.h vhost.h \
               psci.h psp-sev.h userfaultfd.h memfd.h mman.h nvme_ioctl.h \
-              vduse.h iommufd.h bits.h; do
+              vduse.h iommufd.h bits.h idxd.h; do
     cp "$hdrdir/include/linux/$header" "$output/linux-headers/linux"
 done
 
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 03/13] util/dsa: Implement DSA device start and stop logic.
  2024-07-11 21:52 [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
  2024-07-11 21:52 ` [PATCH v5 01/13] meson: Introduce new instruction set enqcmd to the build system Yichen Wang
  2024-07-11 21:52 ` [PATCH v5 02/13] util/dsa: Add idxd into linux header copy list Yichen Wang
@ 2024-07-11 21:52 ` Yichen Wang
  2024-07-11 21:52 ` [PATCH v5 04/13] util/dsa: Implement DSA task enqueue and dequeue Yichen Wang
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Yichen Wang @ 2024-07-11 21:52 UTC (permalink / raw)
  To: Paolo Bonzini, Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Michael S. Tsirkin, Cornelia Huck,
	qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang, Bryan Zhang

From: Hao Xiang <hao.xiang@linux.dev>

* DSA device open and close.
* DSA group contains multiple DSA devices.
* DSA group configure/start/stop/clean.

Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 include/qemu/dsa.h |  86 +++++++++++++
 util/dsa.c         | 303 +++++++++++++++++++++++++++++++++++++++++++++
 util/meson.build   |   3 +
 3 files changed, 392 insertions(+)
 create mode 100644 include/qemu/dsa.h
 create mode 100644 util/dsa.c

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
new file mode 100644
index 0000000000..29b60654e9
--- /dev/null
+++ b/include/qemu/dsa.h
@@ -0,0 +1,86 @@
+/*
+ * Interface for using Intel Data Streaming Accelerator to offload certain
+ * background operations.
+ *
+ * Copyright (C) Bytedance Ltd.
+ *
+ * Authors:
+ *  Hao Xiang <hao.xiang@bytedance.com>
+ *  Yichen Wang <yichen.wang@bytedance.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_DSA_H
+#define QEMU_DSA_H
+
+#include "qemu/error-report.h"
+#include "qemu/thread.h"
+#include "qemu/queue.h"
+
+#ifdef CONFIG_DSA_OPT
+
+#pragma GCC push_options
+#pragma GCC target("enqcmd")
+
+#include <linux/idxd.h>
+#include "x86intrin.h"
+
+/**
+ * @brief Initializes DSA devices.
+ *
+ * @param dsa_parameter A list of DSA device path from migration parameter.
+ *
+ * @return int Zero if successful, otherwise non zero.
+ */
+int qemu_dsa_init(const char *dsa_parameter, Error **errp);
+
+/**
+ * @brief Start logic to enable using DSA.
+ */
+void qemu_dsa_start(void);
+
+/**
+ * @brief Stop the device group and the completion thread.
+ */
+void qemu_dsa_stop(void);
+
+/**
+ * @brief Clean up system resources created for DSA offloading.
+ */
+void qemu_dsa_cleanup(void);
+
+/**
+ * @brief Check if DSA is running.
+ *
+ * @return True if DSA is running, otherwise false.
+ */
+bool qemu_dsa_is_running(void);
+
+#else
+
+static inline bool qemu_dsa_is_running(void)
+{
+    return false;
+}
+
+static inline int qemu_dsa_init(const char *dsa_parameter, Error **errp)
+{
+    if (dsa_parameter != NULL && strlen(dsa_parameter) != 0) {
+        error_setg(errp, "DSA is not supported.");
+        return -1;
+    }
+
+    return 0;
+}
+
+static inline void qemu_dsa_start(void) {}
+
+static inline void qemu_dsa_stop(void) {}
+
+static inline void qemu_dsa_cleanup(void) {}
+
+#endif
+
+#endif
diff --git a/util/dsa.c b/util/dsa.c
new file mode 100644
index 0000000000..cdb0b9dda2
--- /dev/null
+++ b/util/dsa.c
@@ -0,0 +1,303 @@
+/*
+ * Use Intel Data Streaming Accelerator to offload certain background
+ * operations.
+ *
+ * Copyright (C) Bytedance Ltd.
+ *
+ * Authors:
+ *  Hao Xiang <hao.xiang@bytedance.com>
+ *  Bryan Zhang <bryan.zhang@bytedance.com>
+ *  Yichen Wang <yichen.wang@bytedance.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu/queue.h"
+#include "qemu/memalign.h"
+#include "qemu/lockable.h"
+#include "qemu/cutils.h"
+#include "qemu/dsa.h"
+#include "qemu/bswap.h"
+#include "qemu/error-report.h"
+#include "qemu/rcu.h"
+
+#pragma GCC push_options
+#pragma GCC target("enqcmd")
+
+#include <linux/idxd.h>
+#include "x86intrin.h"
+
+#define DSA_WQ_SIZE 4096
+#define MAX_DSA_DEVICES 16
+
+typedef QSIMPLEQ_HEAD(dsa_task_queue, dsa_batch_task) dsa_task_queue;
+
+typedef struct {
+    void *work_queue;
+} QemuDsaDevice;
+
+typedef struct {
+    QemuDsaDevice *dsa_devices;
+    int num_dsa_devices;
+    /* The index of the next DSA device to be used. */
+    uint32_t device_allocator_index;
+    bool running;
+    QemuMutex task_queue_lock;
+    QemuCond task_queue_cond;
+    dsa_task_queue task_queue;
+} QemuDsaDeviceGroup;
+
+uint64_t max_retry_count;
+static QemuDsaDeviceGroup dsa_group;
+
+
+/**
+ * @brief This function opens a DSA device's work queue and
+ *        maps the DSA device memory into the current process.
+ *
+ * @param dsa_wq_path A pointer to the DSA device work queue's file path.
+ * @return A pointer to the mapped memory, or MAP_FAILED on failure.
+ */
+static void *
+map_dsa_device(const char *dsa_wq_path)
+{
+    void *dsa_device;
+    int fd;
+
+    fd = open(dsa_wq_path, O_RDWR);
+    if (fd < 0) {
+        error_report("Open %s failed with errno = %d.",
+                dsa_wq_path, errno);
+        return MAP_FAILED;
+    }
+    dsa_device = mmap(NULL, DSA_WQ_SIZE, PROT_WRITE,
+                      MAP_SHARED | MAP_POPULATE, fd, 0);
+    close(fd);
+    if (dsa_device == MAP_FAILED) {
+        error_report("mmap failed with errno = %d.", errno);
+        return MAP_FAILED;
+    }
+    return dsa_device;
+}
+
+/**
+ * @brief Initializes a DSA device structure.
+ *
+ * @param instance A pointer to the DSA device.
+ * @param work_queue A pointer to the DSA work queue.
+ */
+static void
+dsa_device_init(QemuDsaDevice *instance,
+                void *dsa_work_queue)
+{
+    instance->work_queue = dsa_work_queue;
+}
+
+/**
+ * @brief Cleans up a DSA device structure.
+ *
+ * @param instance A pointer to the DSA device to cleanup.
+ */
+static void
+dsa_device_cleanup(QemuDsaDevice *instance)
+{
+    if (instance->work_queue != MAP_FAILED) {
+        munmap(instance->work_queue, DSA_WQ_SIZE);
+    }
+}
+
+/**
+ * @brief Initializes a DSA device group.
+ *
+ * @param group A pointer to the DSA device group.
+ * @param dsa_parameter A list of DSA device path from are separated by space
+ * character migration parameter. Multiple DSA device path.
+ *
+ * @return Zero if successful, non-zero otherwise.
+ */
+static int
+dsa_device_group_init(QemuDsaDeviceGroup *group,
+                      const char *dsa_parameter,
+                      Error **errp)
+{
+    if (dsa_parameter == NULL || strlen(dsa_parameter) == 0) {
+        return 0;
+    }
+
+    int ret = 0;
+    char *local_dsa_parameter = g_strdup(dsa_parameter);
+    const char *dsa_path[MAX_DSA_DEVICES];
+    int num_dsa_devices = 0;
+    char delim[2] = " ";
+
+    char *current_dsa_path = strtok(local_dsa_parameter, delim);
+
+    while (current_dsa_path != NULL) {
+        dsa_path[num_dsa_devices++] = current_dsa_path;
+        if (num_dsa_devices == MAX_DSA_DEVICES) {
+            break;
+        }
+        current_dsa_path = strtok(NULL, delim);
+    }
+
+    group->dsa_devices =
+        g_new0(QemuDsaDevice, num_dsa_devices);
+    group->num_dsa_devices = num_dsa_devices;
+    group->device_allocator_index = 0;
+
+    group->running = false;
+    qemu_mutex_init(&group->task_queue_lock);
+    qemu_cond_init(&group->task_queue_cond);
+    QSIMPLEQ_INIT(&group->task_queue);
+
+    void *dsa_wq = MAP_FAILED;
+    for (int i = 0; i < num_dsa_devices; i++) {
+        dsa_wq = map_dsa_device(dsa_path[i]);
+        if (dsa_wq == MAP_FAILED) {
+            error_setg(errp, "map_dsa_device failed MAP_FAILED.");
+            ret = -1;
+            goto exit;
+        }
+        dsa_device_init(&dsa_group.dsa_devices[i], dsa_wq);
+    }
+
+exit:
+    g_free(local_dsa_parameter);
+    return ret;
+}
+
+/**
+ * @brief Starts a DSA device group.
+ *
+ * @param group A pointer to the DSA device group.
+ */
+static void
+dsa_device_group_start(QemuDsaDeviceGroup *group)
+{
+    group->running = true;
+}
+
+/**
+ * @brief Stops a DSA device group.
+ *
+ * @param group A pointer to the DSA device group.
+ */
+__attribute__((unused))
+static void
+dsa_device_group_stop(QemuDsaDeviceGroup *group)
+{
+    group->running = false;
+}
+
+/**
+ * @brief Cleans up a DSA device group.
+ *
+ * @param group A pointer to the DSA device group.
+ */
+static void
+dsa_device_group_cleanup(QemuDsaDeviceGroup *group)
+{
+    if (!group->dsa_devices) {
+        return;
+    }
+    for (int i = 0; i < group->num_dsa_devices; i++) {
+        dsa_device_cleanup(&group->dsa_devices[i]);
+    }
+    g_free(group->dsa_devices);
+    group->dsa_devices = NULL;
+
+    qemu_mutex_destroy(&group->task_queue_lock);
+    qemu_cond_destroy(&group->task_queue_cond);
+}
+
+/**
+ * @brief Returns the next available DSA device in the group.
+ *
+ * @param group A pointer to the DSA device group.
+ *
+ * @return struct QemuDsaDevice* A pointer to the next available DSA device
+ *         in the group.
+ */
+__attribute__((unused))
+static QemuDsaDevice *
+dsa_device_group_get_next_device(QemuDsaDeviceGroup *group)
+{
+    if (group->num_dsa_devices == 0) {
+        return NULL;
+    }
+    uint32_t current = qatomic_fetch_inc(&group->device_allocator_index);
+    current %= group->num_dsa_devices;
+    return &group->dsa_devices[current];
+}
+
+/**
+ * @brief Check if DSA is running.
+ *
+ * @return True if DSA is running, otherwise false.
+ */
+bool qemu_dsa_is_running(void)
+{
+    return false;
+}
+
+static void
+dsa_globals_init(void)
+{
+    max_retry_count = UINT64_MAX;
+}
+
+/**
+ * @brief Initializes DSA devices.
+ *
+ * @param dsa_parameter A list of DSA device path from migration parameter.
+ *
+ * @return int Zero if successful, otherwise non zero.
+ */
+int qemu_dsa_init(const char *dsa_parameter, Error **errp)
+{
+    dsa_globals_init();
+
+    return dsa_device_group_init(&dsa_group, dsa_parameter, errp);
+}
+
+/**
+ * @brief Start logic to enable using DSA.
+ *
+ */
+void qemu_dsa_start(void)
+{
+    if (dsa_group.num_dsa_devices == 0) {
+        return;
+    }
+    if (dsa_group.running) {
+        return;
+    }
+    dsa_device_group_start(&dsa_group);
+}
+
+/**
+ * @brief Stop the device group and the completion thread.
+ *
+ */
+void qemu_dsa_stop(void)
+{
+    QemuDsaDeviceGroup *group = &dsa_group;
+
+    if (!group->running) {
+        return;
+    }
+}
+
+/**
+ * @brief Clean up system resources created for DSA offloading.
+ *
+ */
+void qemu_dsa_cleanup(void)
+{
+    qemu_dsa_stop();
+    dsa_device_group_cleanup(&dsa_group);
+}
+
diff --git a/util/meson.build b/util/meson.build
index 5d8bef9891..3360f62923 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -88,6 +88,9 @@ if have_block or have_ga
 endif
 if have_block
   util_ss.add(files('aio-wait.c'))
+  if config_host_data.get('CONFIG_DSA_OPT')
+    util_ss.add(files('dsa.c'))
+  endif
   util_ss.add(files('buffer.c'))
   util_ss.add(files('bufferiszero.c'))
   util_ss.add(files('hbitmap.c'))
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 04/13] util/dsa: Implement DSA task enqueue and dequeue.
  2024-07-11 21:52 [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
                   ` (2 preceding siblings ...)
  2024-07-11 21:52 ` [PATCH v5 03/13] util/dsa: Implement DSA device start and stop logic Yichen Wang
@ 2024-07-11 21:52 ` Yichen Wang
  2024-07-11 21:52 ` [PATCH v5 05/13] util/dsa: Implement DSA task asynchronous completion thread model Yichen Wang
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Yichen Wang @ 2024-07-11 21:52 UTC (permalink / raw)
  To: Paolo Bonzini, Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Michael S. Tsirkin, Cornelia Huck,
	qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

From: Hao Xiang <hao.xiang@linux.dev>

* Use a safe thread queue for DSA task enqueue/dequeue.
* Implement DSA task submission.
* Implement DSA batch task submission.

Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 include/qemu/dsa.h |  46 ++++++++++
 util/dsa.c         | 222 ++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 246 insertions(+), 22 deletions(-)

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
index 29b60654e9..9cc836b64c 100644
--- a/include/qemu/dsa.h
+++ b/include/qemu/dsa.h
@@ -27,6 +27,52 @@
 #include <linux/idxd.h>
 #include "x86intrin.h"
 
+typedef enum QemuDsaTaskType {
+    QEMU_DSA_TASK = 0,
+    QEMU_DSA_BATCH_TASK
+} QemuDsaTaskType;
+
+typedef enum QemuDsaTaskStatus {
+    QEMU_DSA_TASK_READY = 0,
+    QEMU_DSA_TASK_PROCESSING,
+    QEMU_DSA_TASK_COMPLETION
+} QemuDsaTaskStatus;
+
+typedef struct {
+    void *work_queue;
+} QemuDsaDevice;
+
+typedef QSIMPLEQ_HEAD(QemuDsaTaskQueue, QemuDsaBatchTask) QemuDsaTaskQueue;
+
+typedef struct {
+    QemuDsaDevice *dsa_devices;
+    int num_dsa_devices;
+    /* The index of the next DSA device to be used. */
+    uint32_t device_allocator_index;
+    bool running;
+    QemuMutex task_queue_lock;
+    QemuCond task_queue_cond;
+    QemuDsaTaskQueue task_queue;
+} QemuDsaDeviceGroup;
+
+typedef void (*qemu_dsa_completion_fn)(void *);
+
+typedef struct QemuDsaBatchTask {
+    struct dsa_hw_desc batch_descriptor;
+    struct dsa_hw_desc *descriptors;
+    struct dsa_completion_record batch_completion __attribute__((aligned(32)));
+    struct dsa_completion_record *completions;
+    QemuDsaDeviceGroup *group;
+    QemuDsaDevice *device;
+    qemu_dsa_completion_fn completion_callback;
+    QemuSemaphore sem_task_complete;
+    QemuDsaTaskType task_type;
+    QemuDsaTaskStatus status;
+    int batch_size;
+    QSIMPLEQ_ENTRY(QemuDsaBatchTask) entry;
+} QemuDsaBatchTask;
+
+
 /**
  * @brief Initializes DSA devices.
  *
diff --git a/util/dsa.c b/util/dsa.c
index cdb0b9dda2..43280b31cd 100644
--- a/util/dsa.c
+++ b/util/dsa.c
@@ -30,27 +30,11 @@
 #include <linux/idxd.h>
 #include "x86intrin.h"
 
-#define DSA_WQ_SIZE 4096
+#define DSA_WQ_PORTAL_SIZE 4096
+#define DSA_WQ_DEPTH 128
 #define MAX_DSA_DEVICES 16
 
-typedef QSIMPLEQ_HEAD(dsa_task_queue, dsa_batch_task) dsa_task_queue;
-
-typedef struct {
-    void *work_queue;
-} QemuDsaDevice;
-
-typedef struct {
-    QemuDsaDevice *dsa_devices;
-    int num_dsa_devices;
-    /* The index of the next DSA device to be used. */
-    uint32_t device_allocator_index;
-    bool running;
-    QemuMutex task_queue_lock;
-    QemuCond task_queue_cond;
-    dsa_task_queue task_queue;
-} QemuDsaDeviceGroup;
-
-uint64_t max_retry_count;
+uint32_t max_retry_count;
 static QemuDsaDeviceGroup dsa_group;
 
 
@@ -73,7 +57,7 @@ map_dsa_device(const char *dsa_wq_path)
                 dsa_wq_path, errno);
         return MAP_FAILED;
     }
-    dsa_device = mmap(NULL, DSA_WQ_SIZE, PROT_WRITE,
+    dsa_device = mmap(NULL, DSA_WQ_PORTAL_SIZE, PROT_WRITE,
                       MAP_SHARED | MAP_POPULATE, fd, 0);
     close(fd);
     if (dsa_device == MAP_FAILED) {
@@ -105,7 +89,7 @@ static void
 dsa_device_cleanup(QemuDsaDevice *instance)
 {
     if (instance->work_queue != MAP_FAILED) {
-        munmap(instance->work_queue, DSA_WQ_SIZE);
+        munmap(instance->work_queue, DSA_WQ_PORTAL_SIZE);
     }
 }
 
@@ -233,6 +217,198 @@ dsa_device_group_get_next_device(QemuDsaDeviceGroup *group)
     return &group->dsa_devices[current];
 }
 
+/**
+ * @brief Empties out the DSA task queue.
+ *
+ * @param group A pointer to the DSA device group.
+ */
+static void
+dsa_empty_task_queue(QemuDsaDeviceGroup *group)
+{
+    qemu_mutex_lock(&group->task_queue_lock);
+    QemuDsaTaskQueue *task_queue = &group->task_queue;
+    while (!QSIMPLEQ_EMPTY(task_queue)) {
+        QSIMPLEQ_REMOVE_HEAD(task_queue, entry);
+    }
+    qemu_mutex_unlock(&group->task_queue_lock);
+}
+
+/**
+ * @brief Adds a task to the DSA task queue.
+ *
+ * @param group A pointer to the DSA device group.
+ * @param task A pointer to the DSA task to enqueue.
+ *
+ * @return int Zero if successful, otherwise a proper error code.
+ */
+static int
+dsa_task_enqueue(QemuDsaDeviceGroup *group,
+                 QemuDsaBatchTask *task)
+{
+    bool notify = false;
+
+    qemu_mutex_lock(&group->task_queue_lock);
+
+    if (!group->running) {
+        error_report("DSA: Tried to queue task to stopped device queue.");
+        qemu_mutex_unlock(&group->task_queue_lock);
+        return -1;
+    }
+
+    /* The queue is empty. This enqueue operation is a 0->1 transition. */
+    if (QSIMPLEQ_EMPTY(&group->task_queue)) {
+        notify = true;
+    }
+
+    QSIMPLEQ_INSERT_TAIL(&group->task_queue, task, entry);
+
+    /* We need to notify the waiter for 0->1 transitions. */
+    if (notify) {
+        qemu_cond_signal(&group->task_queue_cond);
+    }
+
+    qemu_mutex_unlock(&group->task_queue_lock);
+
+    return 0;
+}
+
+/**
+ * @brief Takes a DSA task out of the task queue.
+ *
+ * @param group A pointer to the DSA device group.
+ * @return QemuDsaBatchTask* The DSA task being dequeued.
+ */
+__attribute__((unused))
+static QemuDsaBatchTask *
+dsa_task_dequeue(QemuDsaDeviceGroup *group)
+{
+    QemuDsaBatchTask *task = NULL;
+
+    qemu_mutex_lock(&group->task_queue_lock);
+
+    while (true) {
+        if (!group->running) {
+            goto exit;
+        }
+        task = QSIMPLEQ_FIRST(&group->task_queue);
+        if (task != NULL) {
+            break;
+        }
+        qemu_cond_wait(&group->task_queue_cond, &group->task_queue_lock);
+    }
+
+    QSIMPLEQ_REMOVE_HEAD(&group->task_queue, entry);
+
+exit:
+    qemu_mutex_unlock(&group->task_queue_lock);
+    return task;
+}
+
+/**
+ * @brief Submits a DSA work item to the device work queue.
+ *
+ * @param wq A pointer to the DSA work queue's device memory.
+ * @param descriptor A pointer to the DSA work item descriptor.
+ *
+ * @return Zero if successful, non-zero otherwise.
+ */
+static int
+submit_wi_int(void *wq, struct dsa_hw_desc *descriptor)
+{
+    uint32_t retry = 0;
+
+    _mm_sfence();
+
+    while (true) {
+        if (_enqcmd(wq, descriptor) == 0) {
+            break;
+        }
+        retry++;
+        if (retry > max_retry_count) {
+            error_report("Submit work retry %u times.", retry);
+            return -1;
+        }
+    }
+
+    return 0;
+}
+
+/**
+ * @brief Synchronously submits a DSA work item to the
+ *        device work queue.
+ *
+ * @param wq A pointer to the DSA work queue's device memory.
+ * @param descriptor A pointer to the DSA work item descriptor.
+ *
+ * @return int Zero if successful, non-zero otherwise.
+ */
+__attribute__((unused))
+static int
+submit_wi(void *wq, struct dsa_hw_desc *descriptor)
+{
+    return submit_wi_int(wq, descriptor);
+}
+
+/**
+ * @brief Asynchronously submits a DSA work item to the
+ *        device work queue.
+ *
+ * @param task A pointer to the task.
+ *
+ * @return int Zero if successful, non-zero otherwise.
+ */
+__attribute__((unused))
+static int
+submit_wi_async(QemuDsaBatchTask *task)
+{
+    QemuDsaDeviceGroup *device_group = task->group;
+    QemuDsaDevice *device_instance = task->device;
+    int ret;
+
+    assert(task->task_type == QEMU_DSA_TASK);
+
+    task->status = QEMU_DSA_TASK_PROCESSING;
+
+    ret = submit_wi_int(device_instance->work_queue,
+                        &task->descriptors[0]);
+    if (ret != 0) {
+        return ret;
+    }
+
+    return dsa_task_enqueue(device_group, task);
+}
+
+/**
+ * @brief Asynchronously submits a DSA batch work item to the
+ *        device work queue.
+ *
+ * @param batch_task A pointer to the batch task.
+ *
+ * @return int Zero if successful, non-zero otherwise.
+ */
+__attribute__((unused))
+static int
+submit_batch_wi_async(QemuDsaBatchTask *batch_task)
+{
+    QemuDsaDeviceGroup *device_group = batch_task->group;
+    QemuDsaDevice *device_instance = batch_task->device;
+    int ret;
+
+    assert(batch_task->task_type == QEMU_DSA_BATCH_TASK);
+    assert(batch_task->batch_descriptor.desc_count <= batch_task->batch_size);
+    assert(batch_task->status == QEMU_DSA_TASK_READY);
+
+    batch_task->status = QEMU_DSA_TASK_PROCESSING;
+
+    ret = submit_wi_int(device_instance->work_queue,
+                        &batch_task->batch_descriptor);
+    if (ret != 0) {
+        return ret;
+    }
+
+    return dsa_task_enqueue(device_group, batch_task);
+}
+
 /**
  * @brief Check if DSA is running.
  *
@@ -246,7 +422,7 @@ bool qemu_dsa_is_running(void)
 static void
 dsa_globals_init(void)
 {
-    max_retry_count = UINT64_MAX;
+    max_retry_count = DSA_WQ_DEPTH;
 }
 
 /**
@@ -289,6 +465,8 @@ void qemu_dsa_stop(void)
     if (!group->running) {
         return;
     }
+
+    dsa_empty_task_queue(group);
 }
 
 /**
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 05/13] util/dsa: Implement DSA task asynchronous completion thread model.
  2024-07-11 21:52 [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
                   ` (3 preceding siblings ...)
  2024-07-11 21:52 ` [PATCH v5 04/13] util/dsa: Implement DSA task enqueue and dequeue Yichen Wang
@ 2024-07-11 21:52 ` Yichen Wang
  2024-07-11 21:52 ` [PATCH v5 06/13] util/dsa: Implement zero page checking in DSA task Yichen Wang
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Yichen Wang @ 2024-07-11 21:52 UTC (permalink / raw)
  To: Paolo Bonzini, Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Michael S. Tsirkin, Cornelia Huck,
	qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

From: Hao Xiang <hao.xiang@linux.dev>

* Create a dedicated thread for DSA task completion.
* DSA completion thread runs a loop and poll for completed tasks.
* Start and stop DSA completion thread during DSA device start stop.

User space application can directly submit task to Intel DSA
accelerator by writing to DSA's device memory (mapped in user space).
Once a task is submitted, the device starts processing it and write
the completion status back to the task. A user space application can
poll the task's completion status to check for completion. This change
uses a dedicated thread to perform DSA task completion checking.

Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 include/qemu/dsa.h |   1 +
 util/dsa.c         | 274 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 274 insertions(+), 1 deletion(-)

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
index 9cc836b64c..d46a9f42a5 100644
--- a/include/qemu/dsa.h
+++ b/include/qemu/dsa.h
@@ -69,6 +69,7 @@ typedef struct QemuDsaBatchTask {
     QemuDsaTaskType task_type;
     QemuDsaTaskStatus status;
     int batch_size;
+    bool *results;
     QSIMPLEQ_ENTRY(QemuDsaBatchTask) entry;
 } QemuDsaBatchTask;
 
diff --git a/util/dsa.c b/util/dsa.c
index 43280b31cd..1eb85f37f1 100644
--- a/util/dsa.c
+++ b/util/dsa.c
@@ -33,9 +33,20 @@
 #define DSA_WQ_PORTAL_SIZE 4096
 #define DSA_WQ_DEPTH 128
 #define MAX_DSA_DEVICES 16
+#define DSA_COMPLETION_THREAD "qemu_dsa_completion"
+
+typedef struct {
+    bool stopping;
+    bool running;
+    QemuThread thread;
+    int thread_id;
+    QemuSemaphore sem_init_done;
+    QemuDsaDeviceGroup *group;
+} QemuDsaCompletionThread;
 
 uint32_t max_retry_count;
 static QemuDsaDeviceGroup dsa_group;
+static QemuDsaCompletionThread completion_thread;
 
 
 /**
@@ -409,6 +420,265 @@ submit_batch_wi_async(QemuDsaBatchTask *batch_task)
     return dsa_task_enqueue(device_group, batch_task);
 }
 
+/**
+ * @brief Poll for the DSA work item completion.
+ *
+ * @param completion A pointer to the DSA work item completion record.
+ * @param opcode The DSA opcode.
+ *
+ * @return Zero if successful, non-zero otherwise.
+ */
+static int
+poll_completion(struct dsa_completion_record *completion,
+                enum dsa_opcode opcode)
+{
+    uint8_t status;
+    uint64_t retry = 0;
+
+    while (true) {
+        /* The DSA operation completes successfully or fails. */
+        status = completion->status;
+        if (status == DSA_COMP_SUCCESS ||
+            status == DSA_COMP_PAGE_FAULT_NOBOF ||
+            status == DSA_COMP_BATCH_PAGE_FAULT ||
+            status == DSA_COMP_BATCH_FAIL) {
+            break;
+        } else if (status != DSA_COMP_NONE) {
+            error_report("DSA opcode %d failed with status = %d.",
+                    opcode, status);
+            return 1;
+        }
+        retry++;
+        if (retry > max_retry_count) {
+            error_report("DSA wait for completion retry %lu times.", retry);
+            return 1;
+        }
+        _mm_pause();
+    }
+
+    return 0;
+}
+
+/**
+ * @brief Complete a single DSA task in the batch task.
+ *
+ * @param task A pointer to the batch task structure.
+ *
+ * @return Zero if successful, otherwise non-zero.
+ */
+static int
+poll_task_completion(QemuDsaBatchTask *task)
+{
+    assert(task->task_type == QEMU_DSA_TASK);
+
+    struct dsa_completion_record *completion = &task->completions[0];
+    uint8_t status;
+    int ret;
+
+    ret = poll_completion(completion, task->descriptors[0].opcode);
+    if (ret != 0) {
+        goto exit;
+    }
+
+    status = completion->status;
+    if (status == DSA_COMP_SUCCESS) {
+        task->results[0] = (completion->result == 0);
+        goto exit;
+    }
+
+    assert(status == DSA_COMP_PAGE_FAULT_NOBOF);
+
+exit:
+    return ret;
+}
+
+/**
+ * @brief Poll a batch task status until it completes. If DSA task doesn't
+ *        complete properly, use CPU to complete the task.
+ *
+ * @param batch_task A pointer to the DSA batch task.
+ *
+ * @return Zero if successful, otherwise non-zero.
+ */
+static int
+poll_batch_task_completion(QemuDsaBatchTask *batch_task)
+{
+    struct dsa_completion_record *batch_completion =
+        &batch_task->batch_completion;
+    struct dsa_completion_record *completion;
+    uint8_t batch_status;
+    uint8_t status;
+    bool *results = batch_task->results;
+    uint32_t count = batch_task->batch_descriptor.desc_count;
+    int ret;
+
+    ret = poll_completion(batch_completion,
+                          batch_task->batch_descriptor.opcode);
+    if (ret != 0) {
+        goto exit;
+    }
+
+    batch_status = batch_completion->status;
+
+    if (batch_status == DSA_COMP_SUCCESS) {
+        if (batch_completion->bytes_completed == count) {
+            /*
+             * Let's skip checking for each descriptors' completion status
+             * if the batch descriptor says all succedded.
+             */
+            for (int i = 0; i < count; i++) {
+                assert(batch_task->completions[i].status == DSA_COMP_SUCCESS);
+                results[i] = (batch_task->completions[i].result == 0);
+            }
+            goto exit;
+        }
+    } else {
+        assert(batch_status == DSA_COMP_BATCH_FAIL ||
+            batch_status == DSA_COMP_BATCH_PAGE_FAULT);
+    }
+
+    for (int i = 0; i < count; i++) {
+
+        completion = &batch_task->completions[i];
+        status = completion->status;
+
+        if (status == DSA_COMP_SUCCESS) {
+            results[i] = (completion->result == 0);
+            continue;
+        }
+
+        assert(status == DSA_COMP_PAGE_FAULT_NOBOF);
+
+        if (status != DSA_COMP_PAGE_FAULT_NOBOF) {
+            error_report("Unexpected DSA completion status = %u.", status);
+            ret = 1;
+            goto exit;
+        }
+    }
+
+exit:
+    return ret;
+}
+
+/**
+ * @brief Handles an asynchronous DSA batch task completion.
+ *
+ * @param task A pointer to the batch buffer zero task structure.
+ */
+static void
+dsa_batch_task_complete(QemuDsaBatchTask *batch_task)
+{
+    batch_task->status = QEMU_DSA_TASK_COMPLETION;
+    batch_task->completion_callback(batch_task);
+}
+
+/**
+ * @brief The function entry point called by a dedicated DSA
+ *        work item completion thread.
+ *
+ * @param opaque A pointer to the thread context.
+ *
+ * @return void* Not used.
+ */
+static void *
+dsa_completion_loop(void *opaque)
+{
+    QemuDsaCompletionThread *thread_context =
+        (QemuDsaCompletionThread *)opaque;
+    QemuDsaBatchTask *batch_task;
+    QemuDsaDeviceGroup *group = thread_context->group;
+    int ret;
+
+    rcu_register_thread();
+
+    thread_context->thread_id = qemu_get_thread_id();
+    qemu_sem_post(&thread_context->sem_init_done);
+
+    while (thread_context->running) {
+        batch_task = dsa_task_dequeue(group);
+        assert(batch_task != NULL || !group->running);
+        if (!group->running) {
+            assert(!thread_context->running);
+            break;
+        }
+        if (batch_task->task_type == QEMU_DSA_TASK) {
+            ret = poll_task_completion(batch_task);
+        } else {
+            assert(batch_task->task_type == QEMU_DSA_BATCH_TASK);
+            ret = poll_batch_task_completion(batch_task);
+        }
+
+        if (ret != 0) {
+            goto exit;
+        }
+
+        dsa_batch_task_complete(batch_task);
+    }
+
+exit:
+    if (ret != 0) {
+        error_report("DSA completion thread exited due to internal error.");
+    }
+    rcu_unregister_thread();
+    return NULL;
+}
+
+/**
+ * @brief Initializes a DSA completion thread.
+ *
+ * @param completion_thread A pointer to the completion thread context.
+ * @param group A pointer to the DSA device group.
+ */
+static void
+dsa_completion_thread_init(
+    QemuDsaCompletionThread *completion_thread,
+    QemuDsaDeviceGroup *group)
+{
+    completion_thread->stopping = false;
+    completion_thread->running = true;
+    completion_thread->thread_id = -1;
+    qemu_sem_init(&completion_thread->sem_init_done, 0);
+    completion_thread->group = group;
+
+    qemu_thread_create(&completion_thread->thread,
+                       DSA_COMPLETION_THREAD,
+                       dsa_completion_loop,
+                       completion_thread,
+                       QEMU_THREAD_JOINABLE);
+
+    /* Wait for initialization to complete */
+    qemu_sem_wait(&completion_thread->sem_init_done);
+}
+
+/**
+ * @brief Stops the completion thread (and implicitly, the device group).
+ *
+ * @param opaque A pointer to the completion thread.
+ */
+static void dsa_completion_thread_stop(void *opaque)
+{
+    QemuDsaCompletionThread *thread_context =
+        (QemuDsaCompletionThread *)opaque;
+
+    QemuDsaDeviceGroup *group = thread_context->group;
+
+    qemu_mutex_lock(&group->task_queue_lock);
+
+    thread_context->stopping = true;
+    thread_context->running = false;
+
+    /* Prevent the compiler from setting group->running first. */
+    barrier();
+    dsa_device_group_stop(group);
+
+    qemu_cond_signal(&group->task_queue_cond);
+    qemu_mutex_unlock(&group->task_queue_lock);
+
+    qemu_thread_join(&thread_context->thread);
+
+    qemu_sem_destroy(&thread_context->sem_init_done);
+}
+
 /**
  * @brief Check if DSA is running.
  *
@@ -416,7 +686,7 @@ submit_batch_wi_async(QemuDsaBatchTask *batch_task)
  */
 bool qemu_dsa_is_running(void)
 {
-    return false;
+    return completion_thread.running;
 }
 
 static void
@@ -452,6 +722,7 @@ void qemu_dsa_start(void)
         return;
     }
     dsa_device_group_start(&dsa_group);
+    dsa_completion_thread_init(&completion_thread, &dsa_group);
 }
 
 /**
@@ -466,6 +737,7 @@ void qemu_dsa_stop(void)
         return;
     }
 
+    dsa_completion_thread_stop(&completion_thread);
     dsa_empty_task_queue(group);
 }
 
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 06/13] util/dsa: Implement zero page checking in DSA task.
  2024-07-11 21:52 [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
                   ` (4 preceding siblings ...)
  2024-07-11 21:52 ` [PATCH v5 05/13] util/dsa: Implement DSA task asynchronous completion thread model Yichen Wang
@ 2024-07-11 21:52 ` Yichen Wang
  2024-07-11 21:52 ` [PATCH v5 07/13] util/dsa: Implement DSA task asynchronous submission and wait for completion Yichen Wang
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Yichen Wang @ 2024-07-11 21:52 UTC (permalink / raw)
  To: Paolo Bonzini, Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Michael S. Tsirkin, Cornelia Huck,
	qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang, Bryan Zhang

From: Hao Xiang <hao.xiang@linux.dev>

Create DSA task with operation code DSA_OPCODE_COMPVAL.
Here we create two types of DSA tasks, a single DSA task and
a batch DSA task. Batch DSA task reduces task submission overhead
and hence should be the default option. However, due to the way DSA
hardware works, a DSA batch task must contain at least two individual
tasks. There are times we need to submit a single task and hence a
single DSA task submission is also required.

Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 include/qemu/dsa.h |  18 ++++
 util/dsa.c         | 247 +++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 244 insertions(+), 21 deletions(-)

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
index d46a9f42a5..1b4baf1c80 100644
--- a/include/qemu/dsa.h
+++ b/include/qemu/dsa.h
@@ -105,6 +105,24 @@ void qemu_dsa_cleanup(void);
  */
 bool qemu_dsa_is_running(void);
 
+/**
+ * @brief Initializes a buffer zero batch task.
+ *
+ * @param task A pointer to the batch task to initialize.
+ * @param results A pointer to an array of zero page checking results.
+ * @param batch_size The number of DSA tasks in the batch.
+ */
+void
+buffer_zero_batch_task_init(QemuDsaBatchTask *task,
+                            bool *results, int batch_size);
+
+/**
+ * @brief Performs the proper cleanup on a DSA batch task.
+ *
+ * @param task A pointer to the batch task to cleanup.
+ */
+void buffer_zero_batch_task_destroy(QemuDsaBatchTask *task);
+
 #else
 
 static inline bool qemu_dsa_is_running(void)
diff --git a/util/dsa.c b/util/dsa.c
index 1eb85f37f1..f0d8cce210 100644
--- a/util/dsa.c
+++ b/util/dsa.c
@@ -48,6 +48,7 @@ uint32_t max_retry_count;
 static QemuDsaDeviceGroup dsa_group;
 static QemuDsaCompletionThread completion_thread;
 
+static void buffer_zero_dsa_completion(void *context);
 
 /**
  * @brief This function opens a DSA device's work queue and
@@ -180,7 +181,6 @@ dsa_device_group_start(QemuDsaDeviceGroup *group)
  *
  * @param group A pointer to the DSA device group.
  */
-__attribute__((unused))
 static void
 dsa_device_group_stop(QemuDsaDeviceGroup *group)
 {
@@ -216,7 +216,6 @@ dsa_device_group_cleanup(QemuDsaDeviceGroup *group)
  * @return struct QemuDsaDevice* A pointer to the next available DSA device
  *         in the group.
  */
-__attribute__((unused))
 static QemuDsaDevice *
 dsa_device_group_get_next_device(QemuDsaDeviceGroup *group)
 {
@@ -289,7 +288,6 @@ dsa_task_enqueue(QemuDsaDeviceGroup *group,
  * @param group A pointer to the DSA device group.
  * @return QemuDsaBatchTask* The DSA task being dequeued.
  */
-__attribute__((unused))
 static QemuDsaBatchTask *
 dsa_task_dequeue(QemuDsaDeviceGroup *group)
 {
@@ -344,22 +342,6 @@ submit_wi_int(void *wq, struct dsa_hw_desc *descriptor)
     return 0;
 }
 
-/**
- * @brief Synchronously submits a DSA work item to the
- *        device work queue.
- *
- * @param wq A pointer to the DSA work queue's device memory.
- * @param descriptor A pointer to the DSA work item descriptor.
- *
- * @return int Zero if successful, non-zero otherwise.
- */
-__attribute__((unused))
-static int
-submit_wi(void *wq, struct dsa_hw_desc *descriptor)
-{
-    return submit_wi_int(wq, descriptor);
-}
-
 /**
  * @brief Asynchronously submits a DSA work item to the
  *        device work queue.
@@ -368,7 +350,6 @@ submit_wi(void *wq, struct dsa_hw_desc *descriptor)
  *
  * @return int Zero if successful, non-zero otherwise.
  */
-__attribute__((unused))
 static int
 submit_wi_async(QemuDsaBatchTask *task)
 {
@@ -397,7 +378,6 @@ submit_wi_async(QemuDsaBatchTask *task)
  *
  * @return int Zero if successful, non-zero otherwise.
  */
-__attribute__((unused))
 static int
 submit_batch_wi_async(QemuDsaBatchTask *batch_task)
 {
@@ -679,6 +659,231 @@ static void dsa_completion_thread_stop(void *opaque)
     qemu_sem_destroy(&thread_context->sem_init_done);
 }
 
+/**
+ * @brief Initializes a buffer zero comparison DSA task.
+ *
+ * @param descriptor A pointer to the DSA task descriptor.
+ * @param completion A pointer to the DSA task completion record.
+ */
+static void
+buffer_zero_task_init_int(struct dsa_hw_desc *descriptor,
+                          struct dsa_completion_record *completion)
+{
+    descriptor->opcode = DSA_OPCODE_COMPVAL;
+    descriptor->flags = IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CRAV;
+    descriptor->comp_pattern = (uint64_t)0;
+    descriptor->completion_addr = (uint64_t)completion;
+}
+
+/**
+ * @brief Initializes a buffer zero batch task.
+ *
+ * @param task A pointer to the batch task to initialize.
+ * @param results A pointer to an array of zero page checking results.
+ * @param batch_size The number of DSA tasks in the batch.
+ */
+void
+buffer_zero_batch_task_init(QemuDsaBatchTask *task,
+                            bool *results, int batch_size)
+{
+    int descriptors_size = sizeof(*task->descriptors) * batch_size;
+    memset(task, 0, sizeof(*task));
+
+    task->descriptors =
+        (struct dsa_hw_desc *)qemu_memalign(64, descriptors_size);
+    memset(task->descriptors, 0, descriptors_size);
+    task->completions = (struct dsa_completion_record *)qemu_memalign(
+        32, sizeof(*task->completions) * batch_size);
+    task->results = results;
+    task->batch_size = batch_size;
+
+    task->batch_completion.status = DSA_COMP_NONE;
+    task->batch_descriptor.completion_addr = (uint64_t)&task->batch_completion;
+    /* TODO: Ensure that we never send a batch with count <= 1 */
+    task->batch_descriptor.desc_count = 0;
+    task->batch_descriptor.opcode = DSA_OPCODE_BATCH;
+    task->batch_descriptor.flags = IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CRAV;
+    task->batch_descriptor.desc_list_addr = (uintptr_t)task->descriptors;
+    task->status = QEMU_DSA_TASK_READY;
+    task->group = &dsa_group;
+    task->device = dsa_device_group_get_next_device(&dsa_group);
+
+    for (int i = 0; i < task->batch_size; i++) {
+        buffer_zero_task_init_int(&task->descriptors[i],
+                                  &task->completions[i]);
+    }
+
+    qemu_sem_init(&task->sem_task_complete, 0);
+    task->completion_callback = buffer_zero_dsa_completion;
+}
+
+/**
+ * @brief Performs the proper cleanup on a DSA batch task.
+ *
+ * @param task A pointer to the batch task to cleanup.
+ */
+void
+buffer_zero_batch_task_destroy(QemuDsaBatchTask *task)
+{
+    qemu_vfree(task->descriptors);
+    qemu_vfree(task->completions);
+    task->results = NULL;
+
+    qemu_sem_destroy(&task->sem_task_complete);
+}
+
+/**
+ * @brief Resets a buffer zero comparison DSA batch task.
+ *
+ * @param task A pointer to the batch task.
+ * @param count The number of DSA tasks this batch task will contain.
+ */
+static void
+buffer_zero_batch_task_reset(QemuDsaBatchTask *task, size_t count)
+{
+    task->batch_completion.status = DSA_COMP_NONE;
+    task->batch_descriptor.desc_count = count;
+    task->task_type = QEMU_DSA_BATCH_TASK;
+    task->status = QEMU_DSA_TASK_READY;
+}
+
+/**
+ * @brief Sets a buffer zero comparison DSA task.
+ *
+ * @param descriptor A pointer to the DSA task descriptor.
+ * @param buf A pointer to the memory buffer.
+ * @param len The length of the buffer.
+ */
+static void
+buffer_zero_task_set_int(struct dsa_hw_desc *descriptor,
+                         const void *buf,
+                         size_t len)
+{
+    struct dsa_completion_record *completion =
+        (struct dsa_completion_record *)descriptor->completion_addr;
+
+    descriptor->xfer_size = len;
+    descriptor->src_addr = (uintptr_t)buf;
+    completion->status = 0;
+    completion->result = 0;
+}
+
+/**
+ * @brief Resets a buffer zero comparison DSA batch task.
+ *
+ * @param task A pointer to the DSA batch task.
+ */
+static void
+buffer_zero_task_reset(QemuDsaBatchTask *task)
+{
+    task->completions[0].status = DSA_COMP_NONE;
+    task->task_type = QEMU_DSA_TASK;
+    task->status = QEMU_DSA_TASK_READY;
+}
+
+/**
+ * @brief Sets a buffer zero comparison DSA task.
+ *
+ * @param task A pointer to the DSA task.
+ * @param buf A pointer to the memory buffer.
+ * @param len The buffer length.
+ */
+static void
+buffer_zero_task_set(QemuDsaBatchTask *task,
+                     const void *buf,
+                     size_t len)
+{
+    buffer_zero_task_reset(task);
+    buffer_zero_task_set_int(&task->descriptors[0], buf, len);
+}
+
+/**
+ * @brief Sets a buffer zero comparison batch task.
+ *
+ * @param batch_task A pointer to the batch task.
+ * @param buf An array of memory buffers.
+ * @param count The number of buffers in the array.
+ * @param len The length of the buffers.
+ */
+static void
+buffer_zero_batch_task_set(QemuDsaBatchTask *batch_task,
+                           const void **buf, size_t count, size_t len)
+{
+    assert(count > 0);
+    assert(count <= batch_task->batch_size);
+
+    buffer_zero_batch_task_reset(batch_task, count);
+    for (int i = 0; i < count; i++) {
+        buffer_zero_task_set_int(&batch_task->descriptors[i], buf[i], len);
+    }
+}
+
+/**
+ * @brief Asychronously perform a buffer zero DSA operation.
+ *
+ * @param task A pointer to the batch task structure.
+ * @param buf A pointer to the memory buffer.
+ * @param len The length of the memory buffer.
+ *
+ * @return int Zero if successful, otherwise an appropriate error code.
+ */
+__attribute__((unused))
+static int
+buffer_zero_dsa_async(QemuDsaBatchTask *task,
+                      const void *buf, size_t len)
+{
+    buffer_zero_task_set(task, buf, len);
+
+    return submit_wi_async(task);
+}
+
+/**
+ * @brief Sends a memory comparison batch task to a DSA device and wait
+ *        for completion.
+ *
+ * @param batch_task The batch task to be submitted to DSA device.
+ * @param buf An array of memory buffers to check for zero.
+ * @param count The number of buffers.
+ * @param len The buffer length.
+ */
+__attribute__((unused))
+static int
+buffer_zero_dsa_batch_async(QemuDsaBatchTask *batch_task,
+                            const void **buf, size_t count, size_t len)
+{
+    assert(count <= batch_task->batch_size);
+    buffer_zero_batch_task_set(batch_task, buf, count, len);
+
+    return submit_batch_wi_async(batch_task);
+}
+
+/**
+ * @brief The completion callback function for buffer zero
+ *        comparison DSA task completion.
+ *
+ * @param context A pointer to the callback context.
+ */
+static void
+buffer_zero_dsa_completion(void *context)
+{
+    assert(context != NULL);
+
+    QemuDsaBatchTask *task = (QemuDsaBatchTask *)context;
+    qemu_sem_post(&task->sem_task_complete);
+}
+
+/**
+ * @brief Wait for the asynchronous DSA task to complete.
+ *
+ * @param batch_task A pointer to the buffer zero comparison batch task.
+ */
+__attribute__((unused))
+static void
+buffer_zero_dsa_wait(QemuDsaBatchTask *batch_task)
+{
+    qemu_sem_wait(&batch_task->sem_task_complete);
+}
+
 /**
  * @brief Check if DSA is running.
  *
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 07/13] util/dsa: Implement DSA task asynchronous submission and wait for completion.
  2024-07-11 21:52 [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
                   ` (5 preceding siblings ...)
  2024-07-11 21:52 ` [PATCH v5 06/13] util/dsa: Implement zero page checking in DSA task Yichen Wang
@ 2024-07-11 21:52 ` Yichen Wang
  2024-07-11 21:52 ` [PATCH v5 08/13] migration/multifd: Add new migration option for multifd DSA offloading Yichen Wang
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 33+ messages in thread
From: Yichen Wang @ 2024-07-11 21:52 UTC (permalink / raw)
  To: Paolo Bonzini, Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Michael S. Tsirkin, Cornelia Huck,
	qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang, Bryan Zhang

From: Hao Xiang <hao.xiang@linux.dev>

* Add a DSA task completion callback.
* DSA completion thread will call the tasks's completion callback
on every task/batch task completion.
* DSA submission path to wait for completion.
* Implement CPU fallback if DSA is not able to complete the task.

Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 include/qemu/dsa.h |  14 +++++
 util/dsa.c         | 125 +++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 135 insertions(+), 4 deletions(-)

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
index 1b4baf1c80..20bb88d48c 100644
--- a/include/qemu/dsa.h
+++ b/include/qemu/dsa.h
@@ -123,6 +123,20 @@ buffer_zero_batch_task_init(QemuDsaBatchTask *task,
  */
 void buffer_zero_batch_task_destroy(QemuDsaBatchTask *task);
 
+/**
+ * @brief Performs buffer zero comparison on a DSA batch task synchronously.
+ *
+ * @param batch_task A pointer to the batch task.
+ * @param buf An array of memory buffers.
+ * @param count The number of buffers in the array.
+ * @param len The buffer length.
+ *
+ * @return Zero if successful, otherwise non-zero.
+ */
+int
+buffer_is_zero_dsa_batch_sync(QemuDsaBatchTask *batch_task,
+                              const void **buf, size_t count, size_t len);
+
 #else
 
 static inline bool qemu_dsa_is_running(void)
diff --git a/util/dsa.c b/util/dsa.c
index f0d8cce210..74b9aa1331 100644
--- a/util/dsa.c
+++ b/util/dsa.c
@@ -439,6 +439,42 @@ poll_completion(struct dsa_completion_record *completion,
     return 0;
 }
 
+/**
+ * @brief Helper function to use CPU to complete a single
+ *        zero page checking task.
+ *
+ * @param completion A pointer to a DSA task completion record.
+ * @param descriptor A pointer to a DSA task descriptor.
+ * @param result A pointer to the result of a zero page checking.
+ */
+static void
+task_cpu_fallback_int(struct dsa_completion_record *completion,
+                      struct dsa_hw_desc *descriptor, bool *result)
+{
+    const uint8_t *buf;
+    size_t len;
+
+    if (completion->status == DSA_COMP_SUCCESS) {
+        return;
+    }
+
+    /*
+     * DSA was able to partially complete the operation. Check the
+     * result. If we already know this is not a zero page, we can
+     * return now.
+     */
+    if (completion->bytes_completed != 0 && completion->result != 0) {
+        *result = false;
+        return;
+    }
+
+    /* Let's fallback to use CPU to complete it. */
+    buf = (const uint8_t *)descriptor->src_addr;
+    len = descriptor->xfer_size;
+    *result = buffer_is_zero(buf + completion->bytes_completed,
+                             len - completion->bytes_completed);
+}
+
 /**
  * @brief Complete a single DSA task in the batch task.
  *
@@ -567,7 +603,7 @@ dsa_completion_loop(void *opaque)
         (QemuDsaCompletionThread *)opaque;
     QemuDsaBatchTask *batch_task;
     QemuDsaDeviceGroup *group = thread_context->group;
-    int ret;
+    int ret = 0;
 
     rcu_register_thread();
 
@@ -827,7 +863,6 @@ buffer_zero_batch_task_set(QemuDsaBatchTask *batch_task,
  *
  * @return int Zero if successful, otherwise an appropriate error code.
  */
-__attribute__((unused))
 static int
 buffer_zero_dsa_async(QemuDsaBatchTask *task,
                       const void *buf, size_t len)
@@ -846,7 +881,6 @@ buffer_zero_dsa_async(QemuDsaBatchTask *task,
  * @param count The number of buffers.
  * @param len The buffer length.
  */
-__attribute__((unused))
 static int
 buffer_zero_dsa_batch_async(QemuDsaBatchTask *batch_task,
                             const void **buf, size_t count, size_t len)
@@ -877,13 +911,61 @@ buffer_zero_dsa_completion(void *context)
  *
  * @param batch_task A pointer to the buffer zero comparison batch task.
  */
-__attribute__((unused))
 static void
 buffer_zero_dsa_wait(QemuDsaBatchTask *batch_task)
 {
     qemu_sem_wait(&batch_task->sem_task_complete);
 }
 
+/**
+ * @brief Use CPU to complete the zero page checking task if DSA
+ *        is not able to complete it.
+ *
+ * @param batch_task A pointer to the batch task.
+ */
+static void
+buffer_zero_cpu_fallback(QemuDsaBatchTask *batch_task)
+{
+    if (batch_task->task_type == QEMU_DSA_TASK) {
+        if (batch_task->completions[0].status == DSA_COMP_SUCCESS) {
+            return;
+        }
+        task_cpu_fallback_int(&batch_task->completions[0],
+                              &batch_task->descriptors[0],
+                              &batch_task->results[0]);
+    } else if (batch_task->task_type == QEMU_DSA_BATCH_TASK) {
+        struct dsa_completion_record *batch_completion =
+            &batch_task->batch_completion;
+        struct dsa_completion_record *completion;
+        uint8_t status;
+        bool *results = batch_task->results;
+        uint32_t count = batch_task->batch_descriptor.desc_count;
+
+        /* DSA is able to complete the entire batch task. */
+        if (batch_completion->status == DSA_COMP_SUCCESS) {
+            assert(count == batch_completion->bytes_completed);
+            return;
+        }
+
+        /*
+         * DSA encounters some error and is not able to complete
+         * the entire batch task. Use CPU fallback.
+         */
+        for (int i = 0; i < count; i++) {
+
+            completion = &batch_task->completions[i];
+            status = completion->status;
+
+            assert(status == DSA_COMP_SUCCESS ||
+                status == DSA_COMP_PAGE_FAULT_NOBOF);
+
+            task_cpu_fallback_int(completion,
+                                  &batch_task->descriptors[i],
+                                  &results[i]);
+        }
+    }
+}
+
 /**
  * @brief Check if DSA is running.
  *
@@ -956,3 +1038,38 @@ void qemu_dsa_cleanup(void)
     dsa_device_group_cleanup(&dsa_group);
 }
 
+/**
+ * @brief Performs buffer zero comparison on a DSA batch task synchronously.
+ *
+ * @param batch_task A pointer to the batch task.
+ * @param buf An array of memory buffers.
+ * @param count The number of buffers in the array.
+ * @param len The buffer length.
+ *
+ * @return Zero if successful, otherwise non-zero.
+ */
+int
+buffer_is_zero_dsa_batch_sync(QemuDsaBatchTask *batch_task,
+                              const void **buf, size_t count, size_t len)
+{
+    if (count <= 0 || count > batch_task->batch_size) {
+        return -1;
+    }
+
+    assert(batch_task != NULL);
+    assert(len != 0);
+    assert(buf != NULL);
+
+    if (count == 1) {
+        /* DSA doesn't take batch operation with only 1 task. */
+        buffer_zero_dsa_async(batch_task, buf[0], len);
+    } else {
+        buffer_zero_dsa_batch_async(batch_task, buf, count, len);
+    }
+
+    buffer_zero_dsa_wait(batch_task);
+    buffer_zero_cpu_fallback(batch_task);
+
+    return 0;
+}
+
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 08/13] migration/multifd: Add new migration option for multifd DSA offloading.
  2024-07-11 21:52 [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
                   ` (6 preceding siblings ...)
  2024-07-11 21:52 ` [PATCH v5 07/13] util/dsa: Implement DSA task asynchronous submission and wait for completion Yichen Wang
@ 2024-07-11 21:52 ` Yichen Wang
  2024-07-11 22:00   ` Yichen Wang
  2024-07-17 13:30   ` Fabiano Rosas
  2024-07-11 21:52 ` [PATCH v5 09/13] migration/multifd: Prepare to introduce DSA acceleration on the multifd path Yichen Wang
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 33+ messages in thread
From: Yichen Wang @ 2024-07-11 21:52 UTC (permalink / raw)
  To: Paolo Bonzini, Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Michael S. Tsirkin, Cornelia Huck,
	qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

From: Hao Xiang <hao.xiang@linux.dev>

Intel DSA offloading is an optional feature that turns on if
proper hardware and software stack is available. To turn on
DSA offloading in multifd live migration:

dsa-accel-path="[dsa_dev_path1] [dsa_dev_path2] ... [dsa_dev_pathX]"

This feature is turned off by default.

Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 migration/migration-hmp-cmds.c | 15 ++++++++++-
 migration/options.c            | 47 ++++++++++++++++++++++++++++++++++
 migration/options.h            |  1 +
 qapi/migration.json            | 32 ++++++++++++++++++++---
 4 files changed, 90 insertions(+), 5 deletions(-)

diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 7d608d26e1..c422db4ecd 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -312,7 +312,16 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
         monitor_printf(mon, "%s: '%s'\n",
             MigrationParameter_str(MIGRATION_PARAMETER_TLS_AUTHZ),
             params->tls_authz);
-
+        if (params->has_dsa_accel_path) {
+            strList *dsa_accel_path = params->dsa_accel_path;
+            monitor_printf(mon, "%s:",
+                MigrationParameter_str(MIGRATION_PARAMETER_DSA_ACCEL_PATH));
+            while (dsa_accel_path) {
+                monitor_printf(mon, " %s", dsa_accel_path->value);
+                dsa_accel_path = dsa_accel_path->next;
+            }
+            monitor_printf(mon, "\n");
+        }
         if (params->has_block_bitmap_mapping) {
             const BitmapMigrationNodeAliasList *bmnal;
 
@@ -563,6 +572,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         p->has_x_checkpoint_delay = true;
         visit_type_uint32(v, param, &p->x_checkpoint_delay, &err);
         break;
+    case MIGRATION_PARAMETER_DSA_ACCEL_PATH:
+        p->has_dsa_accel_path = true;
+        visit_type_strList(v, param, &p->dsa_accel_path, &err);
+        break;
     case MIGRATION_PARAMETER_MULTIFD_CHANNELS:
         p->has_multifd_channels = true;
         visit_type_uint8(v, param, &p->multifd_channels, &err);
diff --git a/migration/options.c b/migration/options.c
index 645f55003d..f839493016 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -29,6 +29,7 @@
 #include "ram.h"
 #include "options.h"
 #include "sysemu/kvm.h"
+#include <cpuid.h>
 
 /* Maximum migrate downtime set to 2000 seconds */
 #define MAX_MIGRATE_DOWNTIME_SECONDS 2000
@@ -162,6 +163,10 @@ Property migration_properties[] = {
     DEFINE_PROP_ZERO_PAGE_DETECTION("zero-page-detection", MigrationState,
                        parameters.zero_page_detection,
                        ZERO_PAGE_DETECTION_MULTIFD),
+    /* DEFINE_PROP_ARRAY("dsa-accel-path", MigrationState, x, */
+    /*                    parameters.dsa_accel_path, qdev_prop_string, char *), */
+    /* DEFINE_PROP_STRING("dsa-accel-path", MigrationState, */
+    /*                    parameters.dsa_accel_path), */
 
     /* Migration capabilities */
     DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
@@ -815,6 +820,13 @@ const char *migrate_tls_creds(void)
     return s->parameters.tls_creds;
 }
 
+const strList *migrate_dsa_accel_path(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->parameters.dsa_accel_path;
+}
+
 const char *migrate_tls_hostname(void)
 {
     MigrationState *s = migrate_get_current();
@@ -926,6 +938,7 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->zero_page_detection = s->parameters.zero_page_detection;
     params->has_direct_io = true;
     params->direct_io = s->parameters.direct_io;
+    params->dsa_accel_path = QAPI_CLONE(strList, s->parameters.dsa_accel_path);
 
     return params;
 }
@@ -934,6 +947,7 @@ void migrate_params_init(MigrationParameters *params)
 {
     params->tls_hostname = g_strdup("");
     params->tls_creds = g_strdup("");
+    params->dsa_accel_path = NULL;
 
     /* Set has_* up only for parameter checks */
     params->has_throttle_trigger_threshold = true;
@@ -1137,6 +1151,22 @@ bool migrate_params_check(MigrationParameters *params, Error **errp)
         return false;
     }
 
+    if (params->has_zero_page_detection &&
+        params->zero_page_detection == ZERO_PAGE_DETECTION_DSA_ACCEL) {
+#ifdef CONFIG_DSA_OPT
+        unsigned int eax, ebx, ecx, edx;
+        /* ENQCMD is indicated by bit 29 of ecx in CPUID leaf 7, subleaf 0. */
+        if (!__get_cpuid_count(7, 0, &eax, &ebx, &ecx, &edx) ||
+            !(ecx & (1 << 29))) {
+            error_setg(errp, "DSA acceleration is not supported by CPU");
+            return false;
+        }
+#else
+        error_setg(errp, "DSA acceleration is not enabled");
+        return false;
+#endif
+    }
+
     return true;
 }
 
@@ -1247,6 +1277,11 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_direct_io) {
         dest->direct_io = params->direct_io;
     }
+
+    if (params->has_dsa_accel_path) {
+        dest->has_dsa_accel_path = true;
+        dest->dsa_accel_path = params->dsa_accel_path;
+    }
 }
 
 static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
@@ -1376,6 +1411,12 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
     if (params->has_direct_io) {
         s->parameters.direct_io = params->direct_io;
     }
+    if (params->has_dsa_accel_path) {
+        qapi_free_strList(s->parameters.dsa_accel_path);
+        s->parameters.has_dsa_accel_path = true;
+        s->parameters.dsa_accel_path =
+            QAPI_CLONE(strList, params->dsa_accel_path);
+    }
 }
 
 void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
@@ -1401,6 +1442,12 @@ void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
         params->tls_authz->type = QTYPE_QSTRING;
         params->tls_authz->u.s = strdup("");
     }
+    /* if (params->dsa_accel_path */
+    /*     && params->dsa_accel_path->type == QTYPE_QNULL) { */
+    /*     qobject_unref(params->dsa_accel_path->u.n); */
+    /*     params->dsa_accel_path->type = QTYPE_QLIST; */
+    /*     params->dsa_accel_path->u.s = strdup(""); */
+    /* } */
 
     migrate_params_test_apply(params, &tmp);
 
diff --git a/migration/options.h b/migration/options.h
index a2397026db..78b9e4080b 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -85,6 +85,7 @@ const char *migrate_tls_creds(void);
 const char *migrate_tls_hostname(void);
 uint64_t migrate_xbzrle_cache_size(void);
 ZeroPageDetection migrate_zero_page_detection(void);
+const strList *migrate_dsa_accel_path(void);
 
 /* parameters helpers */
 
diff --git a/qapi/migration.json b/qapi/migration.json
index 1234bef888..ff41780347 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -619,10 +619,14 @@
 #     multifd migration is enabled, else in the main migration thread
 #     as for @legacy.
 #
+# @dsa-accel: Perform zero page checking with the DSA accelerator
+#     offloading in multifd sender thread if multifd migration is
+#     enabled, else in the main migration thread as for @legacy.
+#
 # Since: 9.0
 ##
 { 'enum': 'ZeroPageDetection',
-  'data': [ 'none', 'legacy', 'multifd' ] }
+  'data': [ 'none', 'legacy', 'multifd', 'dsa-accel' ] }
 
 ##
 # @BitmapMigrationBitmapAliasTransform:
@@ -825,6 +829,12 @@
 #     See description in @ZeroPageDetection.  Default is 'multifd'.
 #     (since 9.0)
 #
+# @dsa-accel-path: If enabled, use DSA accelerator offloading for
+#     certain memory operations. Enable DSA accelerator for zero
+#     page detection offloading by setting the @zero-page-detection
+#     to dsa-accel. This parameter defines the dsa device path, and
+#     defaults to an empty list. (since 9.2)
+#
 # @direct-io: Open migration files with O_DIRECT when possible.  This
 #     only has effect if the @mapped-ram capability is enabled.
 #     (Since 9.1)
@@ -843,7 +853,7 @@
            'cpu-throttle-initial', 'cpu-throttle-increment',
            'cpu-throttle-tailslow',
            'tls-creds', 'tls-hostname', 'tls-authz', 'max-bandwidth',
-           'avail-switchover-bandwidth', 'downtime-limit',
+           'avail-switchover-bandwidth', 'downtime-limit', 'dsa-accel-path',
            { 'name': 'x-checkpoint-delay', 'features': [ 'unstable' ] },
            'multifd-channels',
            'xbzrle-cache-size', 'max-postcopy-bandwidth',
@@ -1000,6 +1010,12 @@
 #     See description in @ZeroPageDetection.  Default is 'multifd'.
 #     (since 9.0)
 #
+# @dsa-accel-path: If enabled, use DSA accelerator offloading for
+#     certain memory operations. Enable DSA accelerator for zero
+#     page detection offloading by setting the @zero-page-detection
+#     to dsa-accel. This parameter defines the dsa device path, and
+#     defaults to an empty list. (since 9.2)
+#
 # @direct-io: Open migration files with O_DIRECT when possible.  This
 #     only has effect if the @mapped-ram capability is enabled.
 #     (Since 9.1)
@@ -1044,7 +1060,8 @@
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
             '*zero-page-detection': 'ZeroPageDetection',
-            '*direct-io': 'bool' } }
+            '*direct-io': 'bool',
+            '*dsa-accel-path': ['str'] } }
 
 ##
 # @migrate-set-parameters:
@@ -1204,6 +1221,12 @@
 #     See description in @ZeroPageDetection.  Default is 'multifd'.
 #     (since 9.0)
 #
+# @dsa-accel-path: If enabled, use DSA accelerator offloading for
+#     certain memory operations. Enable DSA accelerator for zero
+#     page detection offloading by setting the @zero-page-detection
+#     to dsa-accel. This parameter defines the dsa device path, and
+#     defaults to an empty list. (since 9.2)
+#
 # @direct-io: Open migration files with O_DIRECT when possible.  This
 #     only has effect if the @mapped-ram capability is enabled.
 #     (Since 9.1)
@@ -1245,7 +1268,8 @@
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
             '*zero-page-detection': 'ZeroPageDetection',
-            '*direct-io': 'bool' } }
+            '*direct-io': 'bool',
+            '*dsa-accel-path': ['str'] } }
 
 ##
 # @query-migrate-parameters:
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v5 09/13] migration/multifd: Prepare to introduce DSA acceleration on the multifd path.
  2024-07-11 21:52 [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
                   ` (7 preceding siblings ...)
  2024-07-11 21:52 ` [PATCH v5 08/13] migration/multifd: Add new migration option for multifd DSA offloading Yichen Wang
@ 2024-07-11 21:52 ` Yichen Wang
  2024-07-17 13:39   ` Fabiano Rosas
  2024-07-11 22:49 ` [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration Michael S. Tsirkin
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 33+ messages in thread
From: Yichen Wang @ 2024-07-11 21:52 UTC (permalink / raw)
  To: Paolo Bonzini, Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Michael S. Tsirkin, Cornelia Huck,
	qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

From: Hao Xiang <hao.xiang@linux.dev>

1. Refactor multifd_send_thread function.
2. Introduce the batch task structure in MultiFDSendParams.

Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 include/qemu/dsa.h  |  41 ++++---
 migration/multifd.c |   4 +
 migration/multifd.h |   3 +
 util/dsa.c          | 270 +++++++++++++++++++++++---------------------
 4 files changed, 172 insertions(+), 146 deletions(-)

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
index 20bb88d48c..fd0305a7c7 100644
--- a/include/qemu/dsa.h
+++ b/include/qemu/dsa.h
@@ -16,6 +16,7 @@
 #define QEMU_DSA_H
 
 #include "qemu/error-report.h"
+#include "exec/cpu-common.h"
 #include "qemu/thread.h"
 #include "qemu/queue.h"
 
@@ -70,10 +71,11 @@ typedef struct QemuDsaBatchTask {
     QemuDsaTaskStatus status;
     int batch_size;
     bool *results;
+    /* Address of each pages in pages */
+    ram_addr_t *addr;
     QSIMPLEQ_ENTRY(QemuDsaBatchTask) entry;
 } QemuDsaBatchTask;
 
-
 /**
  * @brief Initializes DSA devices.
  *
@@ -106,15 +108,13 @@ void qemu_dsa_cleanup(void);
 bool qemu_dsa_is_running(void);
 
 /**
- * @brief Initializes a buffer zero batch task.
+ * @brief Initializes a buffer zero DSA batch task.
  *
- * @param task A pointer to the batch task to initialize.
- * @param results A pointer to an array of zero page checking results.
- * @param batch_size The number of DSA tasks in the batch.
+ * @param batch_size The number of zero page checking tasks in the batch.
+ * @return A pointer to the zero page checking tasks initialized.
  */
-void
-buffer_zero_batch_task_init(QemuDsaBatchTask *task,
-                            bool *results, int batch_size);
+QemuDsaBatchTask *
+buffer_zero_batch_task_init(int batch_size);
 
 /**
  * @brief Performs the proper cleanup on a DSA batch task.
@@ -139,6 +139,8 @@ buffer_is_zero_dsa_batch_sync(QemuDsaBatchTask *batch_task,
 
 #else
 
+typedef struct QemuDsaBatchTask {} QemuDsaBatchTask;
+
 static inline bool qemu_dsa_is_running(void)
 {
     return false;
@@ -146,19 +148,28 @@ static inline bool qemu_dsa_is_running(void)
 
 static inline int qemu_dsa_init(const char *dsa_parameter, Error **errp)
 {
-    if (dsa_parameter != NULL && strlen(dsa_parameter) != 0) {
-        error_setg(errp, "DSA is not supported.");
-        return -1;
-    }
-
-    return 0;
+    error_setg(errp, "DSA accelerator is not enabled.");
+    return -1;
 }
 
 static inline void qemu_dsa_start(void) {}
 
 static inline void qemu_dsa_stop(void) {}
 
-static inline void qemu_dsa_cleanup(void) {}
+static inline QemuDsaBatchTask *buffer_zero_batch_task_init(int batch_size)
+{
+    return NULL;
+}
+
+static inline void buffer_zero_batch_task_destroy(QemuDsaBatchTask *task) {}
+
+static inline int
+buffer_is_zero_dsa_batch_sync(QemuDsaBatchTask *batch_task,
+                              const void **buf, size_t count, size_t len)
+{
+    error_setg(errp, "DSA accelerator is not enabled.");
+    return -1;
+}
 
 #endif
 
diff --git a/migration/multifd.c b/migration/multifd.c
index 0b4cbaddfe..6f8edd4b6a 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -13,6 +13,7 @@
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
 #include "qemu/rcu.h"
+#include "qemu/dsa.h"
 #include "exec/target_page.h"
 #include "sysemu/sysemu.h"
 #include "exec/ramblock.h"
@@ -792,6 +793,8 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
     p->name = NULL;
     multifd_pages_clear(p->pages);
     p->pages = NULL;
+    buffer_zero_batch_task_destroy(p->dsa_batch_task);
+    p->dsa_batch_task = NULL;
     p->packet_len = 0;
     g_free(p->packet);
     p->packet = NULL;
@@ -1182,6 +1185,7 @@ bool multifd_send_setup(void)
         qemu_sem_init(&p->sem_sync, 0);
         p->id = i;
         p->pages = multifd_pages_init(page_count);
+        p->dsa_batch_task = buffer_zero_batch_task_init(page_count);
 
         if (use_packets) {
             p->packet_len = sizeof(MultiFDPacket_t)
diff --git a/migration/multifd.h b/migration/multifd.h
index 0ecd6f47d7..027f57bf4e 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -14,6 +14,7 @@
 #define QEMU_MIGRATION_MULTIFD_H
 
 #include "ram.h"
+#include "qemu/dsa.h"
 
 typedef struct MultiFDRecvData MultiFDRecvData;
 
@@ -137,6 +138,8 @@ typedef struct {
      * pending_job != 0 -> multifd_channel can use it.
      */
     MultiFDPages_t *pages;
+    /* Zero page checking batch task */
+    QemuDsaBatchTask *dsa_batch_task;
 
     /* thread local variables. No locking required */
 
diff --git a/util/dsa.c b/util/dsa.c
index 74b9aa1331..5aba1ae23a 100644
--- a/util/dsa.c
+++ b/util/dsa.c
@@ -696,93 +696,81 @@ static void dsa_completion_thread_stop(void *opaque)
 }
 
 /**
- * @brief Initializes a buffer zero comparison DSA task.
+ * @brief Check if DSA is running.
  *
- * @param descriptor A pointer to the DSA task descriptor.
- * @param completion A pointer to the DSA task completion record.
+ * @return True if DSA is running, otherwise false.
  */
+bool qemu_dsa_is_running(void)
+{
+    return completion_thread.running;
+}
+
 static void
-buffer_zero_task_init_int(struct dsa_hw_desc *descriptor,
-                          struct dsa_completion_record *completion)
+dsa_globals_init(void)
 {
-    descriptor->opcode = DSA_OPCODE_COMPVAL;
-    descriptor->flags = IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CRAV;
-    descriptor->comp_pattern = (uint64_t)0;
-    descriptor->completion_addr = (uint64_t)completion;
+    max_retry_count = DSA_WQ_DEPTH;
 }
 
 /**
- * @brief Initializes a buffer zero batch task.
+ * @brief Initializes DSA devices.
+ *
+ * @param dsa_parameter A list of DSA device path from migration parameter.
  *
- * @param task A pointer to the batch task to initialize.
- * @param results A pointer to an array of zero page checking results.
- * @param batch_size The number of DSA tasks in the batch.
+ * @return int Zero if successful, otherwise non zero.
  */
-void
-buffer_zero_batch_task_init(QemuDsaBatchTask *task,
-                            bool *results, int batch_size)
+int qemu_dsa_init(const char *dsa_parameter, Error **errp)
 {
-    int descriptors_size = sizeof(*task->descriptors) * batch_size;
-    memset(task, 0, sizeof(*task));
-
-    task->descriptors =
-        (struct dsa_hw_desc *)qemu_memalign(64, descriptors_size);
-    memset(task->descriptors, 0, descriptors_size);
-    task->completions = (struct dsa_completion_record *)qemu_memalign(
-        32, sizeof(*task->completions) * batch_size);
-    task->results = results;
-    task->batch_size = batch_size;
+    dsa_globals_init();
 
-    task->batch_completion.status = DSA_COMP_NONE;
-    task->batch_descriptor.completion_addr = (uint64_t)&task->batch_completion;
-    /* TODO: Ensure that we never send a batch with count <= 1 */
-    task->batch_descriptor.desc_count = 0;
-    task->batch_descriptor.opcode = DSA_OPCODE_BATCH;
-    task->batch_descriptor.flags = IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CRAV;
-    task->batch_descriptor.desc_list_addr = (uintptr_t)task->descriptors;
-    task->status = QEMU_DSA_TASK_READY;
-    task->group = &dsa_group;
-    task->device = dsa_device_group_get_next_device(&dsa_group);
+    return dsa_device_group_init(&dsa_group, dsa_parameter, errp);
+}
 
-    for (int i = 0; i < task->batch_size; i++) {
-        buffer_zero_task_init_int(&task->descriptors[i],
-                                  &task->completions[i]);
+/**
+ * @brief Start logic to enable using DSA.
+ *
+ */
+void qemu_dsa_start(void)
+{
+    if (dsa_group.num_dsa_devices == 0) {
+        return;
     }
-
-    qemu_sem_init(&task->sem_task_complete, 0);
-    task->completion_callback = buffer_zero_dsa_completion;
+    if (dsa_group.running) {
+        return;
+    }
+    dsa_device_group_start(&dsa_group);
+    dsa_completion_thread_init(&completion_thread, &dsa_group);
 }
 
 /**
- * @brief Performs the proper cleanup on a DSA batch task.
+ * @brief Stop the device group and the completion thread.
  *
- * @param task A pointer to the batch task to cleanup.
  */
-void
-buffer_zero_batch_task_destroy(QemuDsaBatchTask *task)
+void qemu_dsa_stop(void)
 {
-    qemu_vfree(task->descriptors);
-    qemu_vfree(task->completions);
-    task->results = NULL;
+    QemuDsaDeviceGroup *group = &dsa_group;
 
-    qemu_sem_destroy(&task->sem_task_complete);
+    if (!group->running) {
+        return;
+    }
+
+    dsa_completion_thread_stop(&completion_thread);
+    dsa_empty_task_queue(group);
 }
 
 /**
- * @brief Resets a buffer zero comparison DSA batch task.
+ * @brief Clean up system resources created for DSA offloading.
  *
- * @param task A pointer to the batch task.
- * @param count The number of DSA tasks this batch task will contain.
  */
-static void
-buffer_zero_batch_task_reset(QemuDsaBatchTask *task, size_t count)
+void qemu_dsa_cleanup(void)
 {
-    task->batch_completion.status = DSA_COMP_NONE;
-    task->batch_descriptor.desc_count = count;
-    task->task_type = QEMU_DSA_BATCH_TASK;
-    task->status = QEMU_DSA_TASK_READY;
+    qemu_dsa_stop();
+    dsa_device_group_cleanup(&dsa_group);
 }
 
+
+/* Buffer zero comparison DSA task implementations */
+/* =============================================== */
+
 /**
  * @brief Sets a buffer zero comparison DSA task.
  *
@@ -817,6 +805,21 @@ buffer_zero_task_reset(QemuDsaBatchTask *task)
     task->status = QEMU_DSA_TASK_READY;
 }
 
+/**
+ * @brief Resets a buffer zero comparison DSA batch task.
+ *
+ * @param task A pointer to the batch task.
+ * @param count The number of DSA tasks this batch task will contain.
+ */
+static void
+buffer_zero_batch_task_reset(QemuDsaBatchTask *task, size_t count)
+{
+    task->batch_completion.status = DSA_COMP_NONE;
+    task->batch_descriptor.desc_count = count;
+    task->task_type = QEMU_DSA_BATCH_TASK;
+    task->status = QEMU_DSA_TASK_READY;
+}
+
 /**
  * @brief Sets a buffer zero comparison DSA task.
  *
@@ -923,6 +926,7 @@ buffer_zero_dsa_wait(QemuDsaBatchTask *batch_task)
  *
  * @param batch_task A pointer to the batch task.
  */
+
 static void
 buffer_zero_cpu_fallback(QemuDsaBatchTask *batch_task)
 {
@@ -966,78 +970,6 @@ buffer_zero_cpu_fallback(QemuDsaBatchTask *batch_task)
     }
 }
 
-/**
- * @brief Check if DSA is running.
- *
- * @return True if DSA is running, otherwise false.
- */
-bool qemu_dsa_is_running(void)
-{
-    return completion_thread.running;
-}
-
-static void
-dsa_globals_init(void)
-{
-    max_retry_count = DSA_WQ_DEPTH;
-}
-
-/**
- * @brief Initializes DSA devices.
- *
- * @param dsa_parameter A list of DSA device path from migration parameter.
- *
- * @return int Zero if successful, otherwise non zero.
- */
-int qemu_dsa_init(const char *dsa_parameter, Error **errp)
-{
-    dsa_globals_init();
-
-    return dsa_device_group_init(&dsa_group, dsa_parameter, errp);
-}
-
-/**
- * @brief Start logic to enable using DSA.
- *
- */
-void qemu_dsa_start(void)
-{
-    if (dsa_group.num_dsa_devices == 0) {
-        return;
-    }
-    if (dsa_group.running) {
-        return;
-    }
-    dsa_device_group_start(&dsa_group);
-    dsa_completion_thread_init(&completion_thread, &dsa_group);
-}
-
-/**
- * @brief Stop the device group and the completion thread.
- *
- */
-void qemu_dsa_stop(void)
-{
-    QemuDsaDeviceGroup *group = &dsa_group;
-
-    if (!group->running) {
-        return;
-    }
-
-    dsa_completion_thread_stop(&completion_thread);
-    dsa_empty_task_queue(group);
-}
-
-/**
- * @brief Clean up system resources created for DSA offloading.
- *
- */
-void qemu_dsa_cleanup(void)
-{
-    qemu_dsa_stop();
-    dsa_device_group_cleanup(&dsa_group);
-}
-
 /**
  * @brief Performs buffer zero comparison on a DSA batch task synchronously.
  *
@@ -1073,3 +1005,79 @@ buffer_is_zero_dsa_batch_sync(QemuDsaBatchTask *batch_task,
     return 0;
 }
 
+/**
+ * @brief Initializes a buffer zero comparison DSA task.
+ *
+ * @param descriptor A pointer to the DSA task descriptor.
+ * @param completion A pointer to the DSA task completion record.
+ */
+static void
+buffer_zero_task_init_int(struct dsa_hw_desc *descriptor,
+                          struct dsa_completion_record *completion)
+{
+    descriptor->opcode = DSA_OPCODE_COMPVAL;
+    descriptor->flags = IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CRAV;
+    descriptor->comp_pattern = (uint64_t)0;
+    descriptor->completion_addr = (uint64_t)completion;
+}
+
+/**
+ * @brief Initializes a buffer zero DSA batch task.
+ *
+ * @param batch_size The number of zero page checking tasks in the batch.
+ * @return A pointer to the zero page checking tasks initialized.
+ */
+QemuDsaBatchTask *
+buffer_zero_batch_task_init(int batch_size)
+{
+    QemuDsaBatchTask *task = qemu_memalign(64, sizeof(QemuDsaBatchTask));
+    int descriptors_size = sizeof(*task->descriptors) * batch_size;
+
+    memset(task, 0, sizeof(*task));
+    task->addr = g_new0(ram_addr_t, batch_size);
+    task->results = g_new0(bool, batch_size);
+    task->batch_size = batch_size;
+    task->descriptors =
+        (struct dsa_hw_desc *)qemu_memalign(64, descriptors_size);
+    memset(task->descriptors, 0, descriptors_size);
+    task->completions = (struct dsa_completion_record *)qemu_memalign(
+        32, sizeof(*task->completions) * batch_size);
+
+    task->batch_completion.status = DSA_COMP_NONE;
+    task->batch_descriptor.completion_addr = (uint64_t)&task->batch_completion;
+    /* TODO: Ensure that we never send a batch with count <= 1 */
+    task->batch_descriptor.desc_count = 0;
+    task->batch_descriptor.opcode = DSA_OPCODE_BATCH;
+    task->batch_descriptor.flags = IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CRAV;
+    task->batch_descriptor.desc_list_addr = (uintptr_t)task->descriptors;
+    task->status = QEMU_DSA_TASK_READY;
+    task->group = &dsa_group;
+    task->device = dsa_device_group_get_next_device(&dsa_group);
+
+    for (int i = 0; i < task->batch_size; i++) {
+        buffer_zero_task_init_int(&task->descriptors[i],
+                                  &task->completions[i]);
+    }
+
+    qemu_sem_init(&task->sem_task_complete, 0);
+    task->completion_callback = buffer_zero_dsa_completion;
+
+    return task;
+}
+
+/**
+ * @brief Performs the proper cleanup on a DSA batch task.
+ *
+ * @param task A pointer to the batch task to cleanup.
+ */
+void
+buffer_zero_batch_task_destroy(QemuDsaBatchTask *task)
+{
+    g_free(task->addr);
+    g_free(task->results);
+    qemu_vfree(task->descriptors);
+    qemu_vfree(task->completions);
+    task->results = NULL;
+    qemu_sem_destroy(&task->sem_task_complete);
+    qemu_vfree(task);
+}
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 08/13] migration/multifd: Add new migration option for multifd DSA offloading.
  2024-07-11 21:52 ` [PATCH v5 08/13] migration/multifd: Add new migration option for multifd DSA offloading Yichen Wang
@ 2024-07-11 22:00   ` Yichen Wang
  2024-07-17  0:00     ` Fabiano Rosas
  2024-07-17 13:30   ` Fabiano Rosas
  1 sibling, 1 reply; 33+ messages in thread
From: Yichen Wang @ 2024-07-11 22:00 UTC (permalink / raw)
  To: Paolo Bonzini, Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Michael S. Tsirkin, Cornelia Huck,
	qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang

On Thu, Jul 11, 2024 at 2:53 PM Yichen Wang <yichen.wang@bytedance.com> wrote:

> diff --git a/migration/options.c b/migration/options.c
> index 645f55003d..f839493016 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -29,6 +29,7 @@
>  #include "ram.h"
>  #include "options.h"
>  #include "sysemu/kvm.h"
> +#include <cpuid.h>
>
>  /* Maximum migrate downtime set to 2000 seconds */
>  #define MAX_MIGRATE_DOWNTIME_SECONDS 2000
> @@ -162,6 +163,10 @@ Property migration_properties[] = {
>      DEFINE_PROP_ZERO_PAGE_DETECTION("zero-page-detection", MigrationState,
>                         parameters.zero_page_detection,
>                         ZERO_PAGE_DETECTION_MULTIFD),
> +    /* DEFINE_PROP_ARRAY("dsa-accel-path", MigrationState, x, */
> +    /*                    parameters.dsa_accel_path, qdev_prop_string, char *), */
> +    /* DEFINE_PROP_STRING("dsa-accel-path", MigrationState, */
> +    /*                    parameters.dsa_accel_path), */
>
>      /* Migration capabilities */
>      DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),

I changed the dsa-accel-path to be a ['str'], i.e. strList* in C.
However, I am having a hard time about how to define the proper
properties here. I don't know what MARCO to use and I can't find good
examples... Need some guidance about how to proceed. Basically I will
need this to pass something like '-global
migration.dsa-accel-path="/dev/dsa/wq0.0"' in cmdline, or
"migrate_set_parameter dsa-accel-path" in QEMU CLI. Don't know how to
pass strList there.

Thanks very much!


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration.
  2024-07-11 21:52 [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
                   ` (8 preceding siblings ...)
  2024-07-11 21:52 ` [PATCH v5 09/13] migration/multifd: Prepare to introduce DSA acceleration on the multifd path Yichen Wang
@ 2024-07-11 22:49 ` Michael S. Tsirkin
  2024-07-15  8:29   ` Liu, Yuan1
  2024-07-12 10:58 ` Michael S. Tsirkin
  2024-07-16 21:47 ` Fabiano Rosas
  11 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2024-07-11 22:49 UTC (permalink / raw)
  To: Yichen Wang
  Cc: Paolo Bonzini, Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Cornelia Huck, qemu-devel,
	Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang

On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> * Performance:
> 
> We use two Intel 4th generation Xeon servers for testing.
> 
> Architecture:        x86_64
> CPU(s):              192
> Thread(s) per core:  2
> Core(s) per socket:  48
> Socket(s):           2
> NUMA node(s):        2
> Vendor ID:           GenuineIntel
> CPU family:          6
> Model:               143
> Model name:          Intel(R) Xeon(R) Platinum 8457C
> Stepping:            8
> CPU MHz:             2538.624
> CPU max MHz:         3800.0000
> CPU min MHz:         800.0000
> 
> We perform multifd live migration with below setup:
> 1. VM has 100GB memory. 
> 2. Use the new migration option multifd-set-normal-page-ratio to control the total
> size of the payload sent over the network.
> 3. Use 8 multifd channels.
> 4. Use tcp for live migration.
> 4. Use CPU to perform zero page checking as the baseline.
> 5. Use one DSA device to offload zero page checking to compare with the baseline.
> 6. Use "perf sched record" and "perf sched timehist" to analyze CPU usage.
> 
> A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> 
> 	CPU usage
> 
> 	|---------------|---------------|---------------|---------------|
> 	|		|comm		|runtime(msec)	|totaltime(msec)|
> 	|---------------|---------------|---------------|---------------|
> 	|Baseline	|live_migration	|5657.58	|		|
> 	|		|multifdsend_0	|3931.563	|		|
> 	|		|multifdsend_1	|4405.273	|		|
> 	|		|multifdsend_2	|3941.968	|		|
> 	|		|multifdsend_3	|5032.975	|		|
> 	|		|multifdsend_4	|4533.865	|		|
> 	|		|multifdsend_5	|4530.461	|		|
> 	|		|multifdsend_6	|5171.916	|		|
> 	|		|multifdsend_7	|4722.769	|41922		|
> 	|---------------|---------------|---------------|---------------|
> 	|DSA		|live_migration	|6129.168	|		|
> 	|		|multifdsend_0	|2954.717	|		|
> 	|		|multifdsend_1	|2766.359	|		|
> 	|		|multifdsend_2	|2853.519	|		|
> 	|		|multifdsend_3	|2740.717	|		|
> 	|		|multifdsend_4	|2824.169	|		|
> 	|		|multifdsend_5	|2966.908	|		|
> 	|		|multifdsend_6	|2611.137	|		|
> 	|		|multifdsend_7	|3114.732	|		|
> 	|		|dsa_completion	|3612.564	|32568		|
> 	|---------------|---------------|---------------|---------------|
> 
> Baseline total runtime is calculated by adding up all multifdsend_X
> and live_migration threads runtime. DSA offloading total runtime is
> calculated by adding up all multifdsend_X, live_migration and
> dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
> that is 23% total CPU usage savings.


Here the DSA was mostly idle.

Sounds good but a question: what if several qemu instances are
migrated in parallel?

Some accelerators tend to basically stall if several tasks
are trying to use them at the same time.

Where is the boundary here?




-- 
MST



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration.
  2024-07-11 21:52 [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
                   ` (9 preceding siblings ...)
  2024-07-11 22:49 ` [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration Michael S. Tsirkin
@ 2024-07-12 10:58 ` Michael S. Tsirkin
  2024-07-16 21:47 ` Fabiano Rosas
  11 siblings, 0 replies; 33+ messages in thread
From: Michael S. Tsirkin @ 2024-07-12 10:58 UTC (permalink / raw)
  To: Yichen Wang
  Cc: Paolo Bonzini, Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Cornelia Huck, qemu-devel,
	Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang

On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> v5
> * Rebase on top of 39a032cea23e522268519d89bb738974bc43b6f6.
> * Rename struct definitions with typedef and CamelCase names;
> * Add build and runtime checks about DSA accelerator;
> * Address all comments from v4 reviews about typos, licenses, comments,
> error reporting, etc.
> 
> v4
> * Rebase on top of 85b597413d4370cb168f711192eaef2eb70535ac.
> * A separate "multifd zero page checking" patchset was split from this
> patchset's v3 and got merged into master. v4 re-applied the rest of all
> commits on top of that patchset, re-factored and re-tested.
> https://lore.kernel.org/all/20240311180015.3359271-1-hao.xiang@linux.dev/
> * There are some feedback from v3 I likely overlooked.
>  
> v3
> * Rebase on top of 7425b6277f12e82952cede1f531bfc689bf77fb1.
> * Fix error/warning from checkpatch.pl
> * Fix use-after-free bug when multifd-dsa-accel option is not set.
> * Handle error from dsa_init and correctly propogate the error.
> * Remove unnecessary call to dsa_stop.
> * Detect availability of DSA feature at compile time.
> * Implement a generic batch_task structure and a DSA specific one dsa_batch_task.
> * Remove all exit() calls and propagate errors correctly.
> * Use bytes instead of page count to configure multifd-packet-size option.
> 
> v2
> * Rebase on top of 3e01f1147a16ca566694b97eafc941d62fa1e8d8.
> * Leave Juan's changes in their original form instead of squashing them.
> * Add a new commit to refactor the multifd_send_thread function to prepare for introducing the DSA offload functionality.
> * Use page count to configure multifd-packet-size option.
> * Don't use the FLAKY flag in DSA tests.
> * Test if DSA integration test is setup correctly and skip the test if
> * not.
> * Fixed broken link in the previous patch cover.
> 
> * Background:

The DSA interface here is extremely low level: it would mean we add a
ton of complex fragile code for any new accelerator.
Please add something high level and simple on top of this.
Off the top of my head:

void start_memcmp(void *a, void *b, int cnt, void *opaque,
	void (*callback)(int result, void *a, void *b, int cnt, void *opaque)
	);

Do all the batching hacks internally.




> I posted an RFC about DSA offloading in QEMU:
> https://patchew.org/QEMU/20230529182001.2232069-1-hao.xiang@bytedance.com/
> 
> This patchset implements the DSA offloading on zero page checking in
> multifd live migration code path.
> 
> * Overview:
> 
> Intel Data Streaming Accelerator(DSA) is introduced in Intel's 4th generation
> Xeon server, aka Sapphire Rapids.
> https://cdrdv2-public.intel.com/671116/341204-intel-data-streaming-accelerator-spec.pdf
> https://www.intel.com/content/www/us/en/content-details/759709/intel-data-streaming-accelerator-user-guide.html
> One of the things DSA can do is to offload memory comparison workload from
> CPU to DSA accelerator hardware. This patchset implements a solution to offload
> QEMU's zero page checking from CPU to DSA accelerator hardware. We gain
> two benefits from this change:
> 1. Reduces CPU usage in multifd live migration workflow across all use
> cases.
> 2. Reduces migration total time in some use cases. 
> 
> * Design:
> 
> These are the logical steps to perform DSA offloading:
> 1. Configure DSA accelerators and create user space openable DSA work
> queues via the idxd driver.
> 2. Map DSA's work queue into a user space address space.
> 3. Fill an in-memory task descriptor to describe the memory operation.
> 4. Use dedicated CPU instruction _enqcmd to queue a task descriptor to
> the work queue.
> 5. Pull the task descriptor's completion status field until the task
> completes.
> 6. Check return status.
> 
> The memory operation is now totally done by the accelerator hardware but
> the new workflow introduces overheads. The overhead is the extra cost CPU
> prepares and submits the task descriptors and the extra cost CPU pulls for
> completion. The design is around minimizing these two overheads.
> 
> 1. In order to reduce the overhead on task preparation and submission,
> we use batch descriptors. A batch descriptor will contain N individual
> zero page checking tasks where the default N is 128 (default packet size
> / page size) and we can increase N by setting the packet size via a new
> migration option.
> 2. The multifd sender threads prepares and submits batch tasks to DSA
> hardware and it waits on a synchronization object for task completion.
> Whenever a DSA task is submitted, the task structure is added to a
> thread safe queue. It's safe to have multiple multifd sender threads to
> submit tasks concurrently.
> 3. Multiple DSA hardware devices can be used. During multifd initialization,
> every sender thread will be assigned a DSA device to work with. We
> use a round-robin scheme to evenly distribute the work across all used
> DSA devices.
> 4. Use a dedicated thread dsa_completion to perform busy pulling for all
> DSA task completions. The thread keeps dequeuing DSA tasks from the
> thread safe queue. The thread blocks when there is no outstanding DSA
> task. When pulling for completion of a DSA task, the thread uses CPU
> instruction _mm_pause between the iterations of a busy loop to save some
> CPU power as well as optimizing core resources for the other hypercore.
> 5. DSA accelerator can encounter errors. The most popular error is a
> page fault. We have tested using devices to handle page faults but
> performance is bad. Right now, if DSA hits a page fault, we fallback to
> use CPU to complete the rest of the work. The CPU fallback is done in
> the multifd sender thread.
> 6. Added a new migration option multifd-dsa-accel to set the DSA device
> path. If set, the multifd workflow will leverage the DSA devices for
> offloading.
> 7. Added a new migration option multifd-normal-page-ratio to make
> multifd live migration easier to test. Setting a normal page ratio will
> make live migration recognize a zero page as a normal page and send
> the entire payload over the network. If we want to send a large network
> payload and analyze throughput, this option is useful.
> 8. Added a new migration option multifd-packet-size. This can increase
> the number of pages being zero page checked and sent over the network.
> The extra synchronization between the sender threads and the dsa
> completion thread is an overhead. Using a large packet size can reduce
> that overhead.
> 
> * Performance:
> 
> We use two Intel 4th generation Xeon servers for testing.
> 
> Architecture:        x86_64
> CPU(s):              192
> Thread(s) per core:  2
> Core(s) per socket:  48
> Socket(s):           2
> NUMA node(s):        2
> Vendor ID:           GenuineIntel
> CPU family:          6
> Model:               143
> Model name:          Intel(R) Xeon(R) Platinum 8457C
> Stepping:            8
> CPU MHz:             2538.624
> CPU max MHz:         3800.0000
> CPU min MHz:         800.0000
> 
> We perform multifd live migration with below setup:
> 1. VM has 100GB memory. 
> 2. Use the new migration option multifd-set-normal-page-ratio to control the total
> size of the payload sent over the network.
> 3. Use 8 multifd channels.
> 4. Use tcp for live migration.
> 4. Use CPU to perform zero page checking as the baseline.
> 5. Use one DSA device to offload zero page checking to compare with the baseline.
> 6. Use "perf sched record" and "perf sched timehist" to analyze CPU usage.
> 
> A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> 
> 	CPU usage
> 
> 	|---------------|---------------|---------------|---------------|
> 	|		|comm		|runtime(msec)	|totaltime(msec)|
> 	|---------------|---------------|---------------|---------------|
> 	|Baseline	|live_migration	|5657.58	|		|
> 	|		|multifdsend_0	|3931.563	|		|
> 	|		|multifdsend_1	|4405.273	|		|
> 	|		|multifdsend_2	|3941.968	|		|
> 	|		|multifdsend_3	|5032.975	|		|
> 	|		|multifdsend_4	|4533.865	|		|
> 	|		|multifdsend_5	|4530.461	|		|
> 	|		|multifdsend_6	|5171.916	|		|
> 	|		|multifdsend_7	|4722.769	|41922		|
> 	|---------------|---------------|---------------|---------------|
> 	|DSA		|live_migration	|6129.168	|		|
> 	|		|multifdsend_0	|2954.717	|		|
> 	|		|multifdsend_1	|2766.359	|		|
> 	|		|multifdsend_2	|2853.519	|		|
> 	|		|multifdsend_3	|2740.717	|		|
> 	|		|multifdsend_4	|2824.169	|		|
> 	|		|multifdsend_5	|2966.908	|		|
> 	|		|multifdsend_6	|2611.137	|		|
> 	|		|multifdsend_7	|3114.732	|		|
> 	|		|dsa_completion	|3612.564	|32568		|
> 	|---------------|---------------|---------------|---------------|
> 
> Baseline total runtime is calculated by adding up all multifdsend_X
> and live_migration threads runtime. DSA offloading total runtime is
> calculated by adding up all multifdsend_X, live_migration and
> dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
> that is 23% total CPU usage savings.
> 
> 	Latency
> 	|---------------|---------------|---------------|---------------|---------------|---------------|
> 	|		|total time	|down time	|throughput	|transferred-ram|total-ram	|
> 	|---------------|---------------|---------------|---------------|---------------|---------------|	
> 	|Baseline	|10343 ms	|161 ms		|41007.00 mbps	|51583797 kb	|102400520 kb	|
> 	|---------------|---------------|---------------|---------------|-------------------------------|
> 	|DSA offload	|9535 ms	|135 ms		|46554.40 mbps	|53947545 kb	|102400520 kb	|	
> 	|---------------|---------------|---------------|---------------|---------------|---------------|
> 
> Total time is 8% faster and down time is 16% faster.
> 
> B) Scenario 2: 100% (100GB) zero pages on an 100GB vm.
> 
> 	CPU usage
> 	|---------------|---------------|---------------|---------------|
> 	|		|comm		|runtime(msec)	|totaltime(msec)|
> 	|---------------|---------------|---------------|---------------|
> 	|Baseline	|live_migration	|4860.718	|		|
> 	|	 	|multifdsend_0	|748.875	|		|
> 	|		|multifdsend_1	|898.498	|		|
> 	|		|multifdsend_2	|787.456	|		|
> 	|		|multifdsend_3	|764.537	|		|
> 	|		|multifdsend_4	|785.687	|		|
> 	|		|multifdsend_5	|756.941	|		|
> 	|		|multifdsend_6	|774.084	|		|
> 	|		|multifdsend_7	|782.900	|11154		|
> 	|---------------|---------------|-------------------------------|
> 	|DSA offloading	|live_migration	|3846.976	|		|
> 	|		|multifdsend_0	|191.880	|		|
> 	|		|multifdsend_1	|166.331	|		|
> 	|		|multifdsend_2	|168.528	|		|
> 	|		|multifdsend_3	|197.831	|		|
> 	|		|multifdsend_4	|169.580	|		|
> 	|		|multifdsend_5	|167.984	|		|
> 	|		|multifdsend_6	|198.042	|		|
> 	|		|multifdsend_7	|170.624	|		|
> 	|		|dsa_completion	|3428.669	|8700		|
> 	|---------------|---------------|---------------|---------------|
> 
> Baseline total runtime is 11154 msec and DSA offloading total runtime is
> 8700 msec. That is 22% CPU savings.
> 
> 	Latency
> 	|--------------------------------------------------------------------------------------------|
> 	|		|total time	|down time	|throughput	|transferred-ram|total-ram   |
> 	|---------------|---------------|---------------|---------------|---------------|------------|	
> 	|Baseline	|4867 ms	|20 ms		|1.51 mbps	|565 kb		|102400520 kb|
> 	|---------------|---------------|---------------|---------------|----------------------------|
> 	|DSA offload	|3888 ms	|18 ms		|1.89 mbps	|565 kb		|102400520 kb|	
> 	|---------------|---------------|---------------|---------------|---------------|------------|
> 
> Total time 20% faster and down time 10% faster.
> 
> * Testing:
> 
> 1. Added unit tests for cover the added code path in dsa.c
> 2. Added integration tests to cover multifd live migration using DSA
> offloading.
> 
> Hao Xiang (12):
>   meson: Introduce new instruction set enqcmd to the build system.
>   util/dsa: Implement DSA device start and stop logic.
>   util/dsa: Implement DSA task enqueue and dequeue.
>   util/dsa: Implement DSA task asynchronous completion thread model.
>   util/dsa: Implement zero page checking in DSA task.
>   util/dsa: Implement DSA task asynchronous submission and wait for
>     completion.
>   migration/multifd: Add new migration option for multifd DSA
>     offloading.
>   migration/multifd: Prepare to introduce DSA acceleration on the
>     multifd path.
>   migration/multifd: Enable DSA offloading in multifd sender path.
>   migration/multifd: Add migration option set packet size.
>   util/dsa: Add unit test coverage for Intel DSA task submission and
>     completion.
>   migration/multifd: Add integration tests for multifd with Intel DSA
>     offloading.
> 
> Yichen Wang (1):
>   util/dsa: Add idxd into linux header copy list.
> 
>  include/qemu/dsa.h              |  176 +++++
>  meson.build                     |   14 +
>  meson_options.txt               |    2 +
>  migration/migration-hmp-cmds.c  |   22 +-
>  migration/migration.c           |    2 +-
>  migration/multifd-zero-page.c   |  100 ++-
>  migration/multifd-zlib.c        |    6 +-
>  migration/multifd-zstd.c        |    6 +-
>  migration/multifd.c             |   53 +-
>  migration/multifd.h             |    8 +-
>  migration/options.c             |   85 +++
>  migration/options.h             |    2 +
>  qapi/migration.json             |   49 +-
>  scripts/meson-buildoptions.sh   |    3 +
>  scripts/update-linux-headers.sh |    2 +-
>  tests/qtest/migration-test.c    |   80 ++-
>  tests/unit/meson.build          |    6 +
>  tests/unit/test-dsa.c           |  503 ++++++++++++++
>  util/dsa.c                      | 1082 +++++++++++++++++++++++++++++++
>  util/meson.build                |    3 +
>  20 files changed, 2177 insertions(+), 27 deletions(-)
>  create mode 100644 include/qemu/dsa.h
>  create mode 100644 tests/unit/test-dsa.c
>  create mode 100644 util/dsa.c
> 
> -- 
> Yichen Wang



^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration.
  2024-07-11 22:49 ` [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration Michael S. Tsirkin
@ 2024-07-15  8:29   ` Liu, Yuan1
  2024-07-15 12:23     ` Michael S. Tsirkin
  0 siblings, 1 reply; 33+ messages in thread
From: Liu, Yuan1 @ 2024-07-15  8:29 UTC (permalink / raw)
  To: Michael S. Tsirkin, Wang, Yichen
  Cc: Paolo Bonzini, Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Cornelia Huck,
	qemu-devel@nongnu.org, Hao Xiang, Kumar, Shivam,
	Ho-Ren (Jack) Chuang

> -----Original Message-----
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, July 12, 2024 6:49 AM
> To: Wang, Yichen <yichen.wang@bytedance.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau
> <marcandre.lureau@redhat.com>; Daniel P. Berrangé <berrange@redhat.com>;
> Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé
> <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas
> <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus Armbruster
> <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1
> <yuan1.liu@intel.com>; Kumar, Shivam <shivam.kumar1@nutanix.com>; Ho-Ren
> (Jack) Chuang <horenchuang@bytedance.com>
> Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> zero page checking in multifd live migration.
> 
> On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> > * Performance:
> >
> > We use two Intel 4th generation Xeon servers for testing.
> >
> > Architecture:        x86_64
> > CPU(s):              192
> > Thread(s) per core:  2
> > Core(s) per socket:  48
> > Socket(s):           2
> > NUMA node(s):        2
> > Vendor ID:           GenuineIntel
> > CPU family:          6
> > Model:               143
> > Model name:          Intel(R) Xeon(R) Platinum 8457C
> > Stepping:            8
> > CPU MHz:             2538.624
> > CPU max MHz:         3800.0000
> > CPU min MHz:         800.0000
> >
> > We perform multifd live migration with below setup:
> > 1. VM has 100GB memory.
> > 2. Use the new migration option multifd-set-normal-page-ratio to control
> the total
> > size of the payload sent over the network.
> > 3. Use 8 multifd channels.
> > 4. Use tcp for live migration.
> > 4. Use CPU to perform zero page checking as the baseline.
> > 5. Use one DSA device to offload zero page checking to compare with the
> baseline.
> > 6. Use "perf sched record" and "perf sched timehist" to analyze CPU
> usage.
> >
> > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> >
> > 	CPU usage
> >
> > 	|---------------|---------------|---------------|---------------|
> > 	|		|comm		|runtime(msec)	|totaltime(msec)|
> > 	|---------------|---------------|---------------|---------------|
> > 	|Baseline	|live_migration	|5657.58	|		|
> > 	|		|multifdsend_0	|3931.563	|		|
> > 	|		|multifdsend_1	|4405.273	|		|
> > 	|		|multifdsend_2	|3941.968	|		|
> > 	|		|multifdsend_3	|5032.975	|		|
> > 	|		|multifdsend_4	|4533.865	|		|
> > 	|		|multifdsend_5	|4530.461	|		|
> > 	|		|multifdsend_6	|5171.916	|		|
> > 	|		|multifdsend_7	|4722.769	|41922		|
> > 	|---------------|---------------|---------------|---------------|
> > 	|DSA		|live_migration	|6129.168	|		|
> > 	|		|multifdsend_0	|2954.717	|		|
> > 	|		|multifdsend_1	|2766.359	|		|
> > 	|		|multifdsend_2	|2853.519	|		|
> > 	|		|multifdsend_3	|2740.717	|		|
> > 	|		|multifdsend_4	|2824.169	|		|
> > 	|		|multifdsend_5	|2966.908	|		|
> > 	|		|multifdsend_6	|2611.137	|		|
> > 	|		|multifdsend_7	|3114.732	|		|
> > 	|		|dsa_completion	|3612.564	|32568		|
> > 	|---------------|---------------|---------------|---------------|
> >
> > Baseline total runtime is calculated by adding up all multifdsend_X
> > and live_migration threads runtime. DSA offloading total runtime is
> > calculated by adding up all multifdsend_X, live_migration and
> > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
> > that is 23% total CPU usage savings.
> 
> 
> Here the DSA was mostly idle.
> 
> Sounds good but a question: what if several qemu instances are
> migrated in parallel?
> 
> Some accelerators tend to basically stall if several tasks
> are trying to use them at the same time.
> 
> Where is the boundary here?

A DSA device can be assigned to multiple Qemu instances. 
The DSA resource used by each process is called a work queue, each DSA
device can support up to 8 work queues and work queues are classified into 
dedicated queues and shared queues. 

A dedicated queue can only serve one process. Theoretically, there is no limit 
on the number of processes in a shared queue, it is based on enqcmd + SVM technology.

https://www.kernel.org/doc/html/v5.17/x86/sva.html

> --
> MST



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration.
  2024-07-15  8:29   ` Liu, Yuan1
@ 2024-07-15 12:23     ` Michael S. Tsirkin
  2024-07-15 13:09       ` Liu, Yuan1
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2024-07-15 12:23 UTC (permalink / raw)
  To: Liu, Yuan1
  Cc: Wang, Yichen, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Fabiano Rosas, Eric Blake, Markus Armbruster,
	Cornelia Huck, qemu-devel@nongnu.org, Hao Xiang, Kumar, Shivam,
	Ho-Ren (Jack) Chuang

On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, July 12, 2024 6:49 AM
> > To: Wang, Yichen <yichen.wang@bytedance.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau
> > <marcandre.lureau@redhat.com>; Daniel P. Berrangé <berrange@redhat.com>;
> > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé
> > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas
> > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus Armbruster
> > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1
> > <yuan1.liu@intel.com>; Kumar, Shivam <shivam.kumar1@nutanix.com>; Ho-Ren
> > (Jack) Chuang <horenchuang@bytedance.com>
> > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> > zero page checking in multifd live migration.
> > 
> > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> > > * Performance:
> > >
> > > We use two Intel 4th generation Xeon servers for testing.
> > >
> > > Architecture:        x86_64
> > > CPU(s):              192
> > > Thread(s) per core:  2
> > > Core(s) per socket:  48
> > > Socket(s):           2
> > > NUMA node(s):        2
> > > Vendor ID:           GenuineIntel
> > > CPU family:          6
> > > Model:               143
> > > Model name:          Intel(R) Xeon(R) Platinum 8457C
> > > Stepping:            8
> > > CPU MHz:             2538.624
> > > CPU max MHz:         3800.0000
> > > CPU min MHz:         800.0000
> > >
> > > We perform multifd live migration with below setup:
> > > 1. VM has 100GB memory.
> > > 2. Use the new migration option multifd-set-normal-page-ratio to control
> > the total
> > > size of the payload sent over the network.
> > > 3. Use 8 multifd channels.
> > > 4. Use tcp for live migration.
> > > 4. Use CPU to perform zero page checking as the baseline.
> > > 5. Use one DSA device to offload zero page checking to compare with the
> > baseline.
> > > 6. Use "perf sched record" and "perf sched timehist" to analyze CPU
> > usage.
> > >
> > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> > >
> > > 	CPU usage
> > >
> > > 	|---------------|---------------|---------------|---------------|
> > > 	|		|comm		|runtime(msec)	|totaltime(msec)|
> > > 	|---------------|---------------|---------------|---------------|
> > > 	|Baseline	|live_migration	|5657.58	|		|
> > > 	|		|multifdsend_0	|3931.563	|		|
> > > 	|		|multifdsend_1	|4405.273	|		|
> > > 	|		|multifdsend_2	|3941.968	|		|
> > > 	|		|multifdsend_3	|5032.975	|		|
> > > 	|		|multifdsend_4	|4533.865	|		|
> > > 	|		|multifdsend_5	|4530.461	|		|
> > > 	|		|multifdsend_6	|5171.916	|		|
> > > 	|		|multifdsend_7	|4722.769	|41922		|
> > > 	|---------------|---------------|---------------|---------------|
> > > 	|DSA		|live_migration	|6129.168	|		|
> > > 	|		|multifdsend_0	|2954.717	|		|
> > > 	|		|multifdsend_1	|2766.359	|		|
> > > 	|		|multifdsend_2	|2853.519	|		|
> > > 	|		|multifdsend_3	|2740.717	|		|
> > > 	|		|multifdsend_4	|2824.169	|		|
> > > 	|		|multifdsend_5	|2966.908	|		|
> > > 	|		|multifdsend_6	|2611.137	|		|
> > > 	|		|multifdsend_7	|3114.732	|		|
> > > 	|		|dsa_completion	|3612.564	|32568		|
> > > 	|---------------|---------------|---------------|---------------|
> > >
> > > Baseline total runtime is calculated by adding up all multifdsend_X
> > > and live_migration threads runtime. DSA offloading total runtime is
> > > calculated by adding up all multifdsend_X, live_migration and
> > > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
> > > that is 23% total CPU usage savings.
> > 
> > 
> > Here the DSA was mostly idle.
> > 
> > Sounds good but a question: what if several qemu instances are
> > migrated in parallel?
> > 
> > Some accelerators tend to basically stall if several tasks
> > are trying to use them at the same time.
> > 
> > Where is the boundary here?
> 
> A DSA device can be assigned to multiple Qemu instances. 
> The DSA resource used by each process is called a work queue, each DSA
> device can support up to 8 work queues and work queues are classified into 
> dedicated queues and shared queues. 
> 
> A dedicated queue can only serve one process. Theoretically, there is no limit 
> on the number of processes in a shared queue, it is based on enqcmd + SVM technology.
> 
> https://www.kernel.org/doc/html/v5.17/x86/sva.html

This server has 200 CPUs which can thinkably migrate around 100 single
cpu qemu instances with no issue. What happens if you do this with DSA?

> > --
> > MST



^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration.
  2024-07-15 12:23     ` Michael S. Tsirkin
@ 2024-07-15 13:09       ` Liu, Yuan1
  2024-07-15 14:42         ` Michael S. Tsirkin
  0 siblings, 1 reply; 33+ messages in thread
From: Liu, Yuan1 @ 2024-07-15 13:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wang, Yichen, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Fabiano Rosas, Eric Blake, Markus Armbruster,
	Cornelia Huck, qemu-devel@nongnu.org, Hao Xiang, Kumar, Shivam,
	Ho-Ren (Jack) Chuang

> -----Original Message-----
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Monday, July 15, 2024 8:24 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>;
> Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>;
> Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>;
> Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> <horenchuang@bytedance.com>
> Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> zero page checking in multifd live migration.
> 
> On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Friday, July 12, 2024 6:49 AM
> > > To: Wang, Yichen <yichen.wang@bytedance.com>
> > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau
> > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé
> <berrange@redhat.com>;
> > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé
> > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas
> > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus Armbruster
> > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1
> > > <yuan1.liu@intel.com>; Kumar, Shivam <shivam.kumar1@nutanix.com>; Ho-
> Ren
> > > (Jack) Chuang <horenchuang@bytedance.com>
> > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> offload
> > > zero page checking in multifd live migration.
> > >
> > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> > > > * Performance:
> > > >
> > > > We use two Intel 4th generation Xeon servers for testing.
> > > >
> > > > Architecture:        x86_64
> > > > CPU(s):              192
> > > > Thread(s) per core:  2
> > > > Core(s) per socket:  48
> > > > Socket(s):           2
> > > > NUMA node(s):        2
> > > > Vendor ID:           GenuineIntel
> > > > CPU family:          6
> > > > Model:               143
> > > > Model name:          Intel(R) Xeon(R) Platinum 8457C
> > > > Stepping:            8
> > > > CPU MHz:             2538.624
> > > > CPU max MHz:         3800.0000
> > > > CPU min MHz:         800.0000
> > > >
> > > > We perform multifd live migration with below setup:
> > > > 1. VM has 100GB memory.
> > > > 2. Use the new migration option multifd-set-normal-page-ratio to
> control
> > > the total
> > > > size of the payload sent over the network.
> > > > 3. Use 8 multifd channels.
> > > > 4. Use tcp for live migration.
> > > > 4. Use CPU to perform zero page checking as the baseline.
> > > > 5. Use one DSA device to offload zero page checking to compare with
> the
> > > baseline.
> > > > 6. Use "perf sched record" and "perf sched timehist" to analyze CPU
> > > usage.
> > > >
> > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> > > >
> > > > 	CPU usage
> > > >
> > > > 	|---------------|---------------|---------------|-------------
> --|
> > > > 	|		|comm		|runtime(msec)	|totaltime(msec)|
> > > > 	|---------------|---------------|---------------|-------------
> --|
> > > > 	|Baseline	|live_migration	|5657.58	|		|
> > > > 	|		|multifdsend_0	|3931.563	|		|
> > > > 	|		|multifdsend_1	|4405.273	|		|
> > > > 	|		|multifdsend_2	|3941.968	|		|
> > > > 	|		|multifdsend_3	|5032.975	|		|
> > > > 	|		|multifdsend_4	|4533.865	|		|
> > > > 	|		|multifdsend_5	|4530.461	|		|
> > > > 	|		|multifdsend_6	|5171.916	|		|
> > > > 	|		|multifdsend_7	|4722.769	|41922		|
> > > > 	|---------------|---------------|---------------|-------------
> --|
> > > > 	|DSA		|live_migration	|6129.168	|		|
> > > > 	|		|multifdsend_0	|2954.717	|		|
> > > > 	|		|multifdsend_1	|2766.359	|		|
> > > > 	|		|multifdsend_2	|2853.519	|		|
> > > > 	|		|multifdsend_3	|2740.717	|		|
> > > > 	|		|multifdsend_4	|2824.169	|		|
> > > > 	|		|multifdsend_5	|2966.908	|		|
> > > > 	|		|multifdsend_6	|2611.137	|		|
> > > > 	|		|multifdsend_7	|3114.732	|		|
> > > > 	|		|dsa_completion	|3612.564	|32568		|
> > > > 	|---------------|---------------|---------------|-------------
> --|
> > > >
> > > > Baseline total runtime is calculated by adding up all multifdsend_X
> > > > and live_migration threads runtime. DSA offloading total runtime is
> > > > calculated by adding up all multifdsend_X, live_migration and
> > > > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
> > > > that is 23% total CPU usage savings.
> > >
> > >
> > > Here the DSA was mostly idle.
> > >
> > > Sounds good but a question: what if several qemu instances are
> > > migrated in parallel?
> > >
> > > Some accelerators tend to basically stall if several tasks
> > > are trying to use them at the same time.
> > >
> > > Where is the boundary here?
> >
> > A DSA device can be assigned to multiple Qemu instances.
> > The DSA resource used by each process is called a work queue, each DSA
> > device can support up to 8 work queues and work queues are classified
> into
> > dedicated queues and shared queues.
> >
> > A dedicated queue can only serve one process. Theoretically, there is no
> limit
> > on the number of processes in a shared queue, it is based on enqcmd +
> SVM technology.
> >
> > https://www.kernel.org/doc/html/v5.17/x86/sva.html
> 
> This server has 200 CPUs which can thinkably migrate around 100 single
> cpu qemu instances with no issue. What happens if you do this with DSA?

First, the DSA work queue needs to be configured in shared mode, and one
queue is enough. 

The maximum depth of the work queue of the DSA hardware is 128, which means
that the number of zero-page detection tasks submitted cannot exceed 128,
otherwise, enqcmd will return an error until the work queue is available again

100 Qemu instances need to be migrated concurrently, I don't have any data on
this yet, I think the 100 zero-page detection tasks can be successfully submitted
to the DSA hardware work queue, but the throughput of DSA's zero-page detection also
needs to be considered. Once the DSA maximum throughput is reached, the work queue
may be filled up quickly, this will cause some Qemu instances to be temporarily unable
to submit new tasks to DSA. This is likely to happen in the first round of migration
memory iteration.

> > > --
> > > MST



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration.
  2024-07-15 13:09       ` Liu, Yuan1
@ 2024-07-15 14:42         ` Michael S. Tsirkin
  2024-07-15 15:23           ` Liu, Yuan1
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2024-07-15 14:42 UTC (permalink / raw)
  To: Liu, Yuan1
  Cc: Wang, Yichen, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Fabiano Rosas, Eric Blake, Markus Armbruster,
	Cornelia Huck, qemu-devel@nongnu.org, Hao Xiang, Kumar, Shivam,
	Ho-Ren (Jack) Chuang

On Mon, Jul 15, 2024 at 01:09:59PM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Monday, July 15, 2024 8:24 PM
> > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> > <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>;
> > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>;
> > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>;
> > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> > <horenchuang@bytedance.com>
> > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> > zero page checking in multifd live migration.
> > 
> > On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote:
> > > > -----Original Message-----
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, July 12, 2024 6:49 AM
> > > > To: Wang, Yichen <yichen.wang@bytedance.com>
> > > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau
> > > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé
> > <berrange@redhat.com>;
> > > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé
> > > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas
> > > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus Armbruster
> > > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1
> > > > <yuan1.liu@intel.com>; Kumar, Shivam <shivam.kumar1@nutanix.com>; Ho-
> > Ren
> > > > (Jack) Chuang <horenchuang@bytedance.com>
> > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> > offload
> > > > zero page checking in multifd live migration.
> > > >
> > > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> > > > > * Performance:
> > > > >
> > > > > We use two Intel 4th generation Xeon servers for testing.
> > > > >
> > > > > Architecture:        x86_64
> > > > > CPU(s):              192
> > > > > Thread(s) per core:  2
> > > > > Core(s) per socket:  48
> > > > > Socket(s):           2
> > > > > NUMA node(s):        2
> > > > > Vendor ID:           GenuineIntel
> > > > > CPU family:          6
> > > > > Model:               143
> > > > > Model name:          Intel(R) Xeon(R) Platinum 8457C
> > > > > Stepping:            8
> > > > > CPU MHz:             2538.624
> > > > > CPU max MHz:         3800.0000
> > > > > CPU min MHz:         800.0000
> > > > >
> > > > > We perform multifd live migration with below setup:
> > > > > 1. VM has 100GB memory.
> > > > > 2. Use the new migration option multifd-set-normal-page-ratio to
> > control
> > > > the total
> > > > > size of the payload sent over the network.
> > > > > 3. Use 8 multifd channels.
> > > > > 4. Use tcp for live migration.
> > > > > 4. Use CPU to perform zero page checking as the baseline.
> > > > > 5. Use one DSA device to offload zero page checking to compare with
> > the
> > > > baseline.
> > > > > 6. Use "perf sched record" and "perf sched timehist" to analyze CPU
> > > > usage.
> > > > >
> > > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> > > > >
> > > > > 	CPU usage
> > > > >
> > > > > 	|---------------|---------------|---------------|-------------
> > --|
> > > > > 	|		|comm		|runtime(msec)	|totaltime(msec)|
> > > > > 	|---------------|---------------|---------------|-------------
> > --|
> > > > > 	|Baseline	|live_migration	|5657.58	|		|
> > > > > 	|		|multifdsend_0	|3931.563	|		|
> > > > > 	|		|multifdsend_1	|4405.273	|		|
> > > > > 	|		|multifdsend_2	|3941.968	|		|
> > > > > 	|		|multifdsend_3	|5032.975	|		|
> > > > > 	|		|multifdsend_4	|4533.865	|		|
> > > > > 	|		|multifdsend_5	|4530.461	|		|
> > > > > 	|		|multifdsend_6	|5171.916	|		|
> > > > > 	|		|multifdsend_7	|4722.769	|41922		|
> > > > > 	|---------------|---------------|---------------|-------------
> > --|
> > > > > 	|DSA		|live_migration	|6129.168	|		|
> > > > > 	|		|multifdsend_0	|2954.717	|		|
> > > > > 	|		|multifdsend_1	|2766.359	|		|
> > > > > 	|		|multifdsend_2	|2853.519	|		|
> > > > > 	|		|multifdsend_3	|2740.717	|		|
> > > > > 	|		|multifdsend_4	|2824.169	|		|
> > > > > 	|		|multifdsend_5	|2966.908	|		|
> > > > > 	|		|multifdsend_6	|2611.137	|		|
> > > > > 	|		|multifdsend_7	|3114.732	|		|
> > > > > 	|		|dsa_completion	|3612.564	|32568		|
> > > > > 	|---------------|---------------|---------------|-------------
> > --|
> > > > >
> > > > > Baseline total runtime is calculated by adding up all multifdsend_X
> > > > > and live_migration threads runtime. DSA offloading total runtime is
> > > > > calculated by adding up all multifdsend_X, live_migration and
> > > > > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
> > > > > that is 23% total CPU usage savings.
> > > >
> > > >
> > > > Here the DSA was mostly idle.
> > > >
> > > > Sounds good but a question: what if several qemu instances are
> > > > migrated in parallel?
> > > >
> > > > Some accelerators tend to basically stall if several tasks
> > > > are trying to use them at the same time.
> > > >
> > > > Where is the boundary here?
> > >
> > > A DSA device can be assigned to multiple Qemu instances.
> > > The DSA resource used by each process is called a work queue, each DSA
> > > device can support up to 8 work queues and work queues are classified
> > into
> > > dedicated queues and shared queues.
> > >
> > > A dedicated queue can only serve one process. Theoretically, there is no
> > limit
> > > on the number of processes in a shared queue, it is based on enqcmd +
> > SVM technology.
> > >
> > > https://www.kernel.org/doc/html/v5.17/x86/sva.html
> > 
> > This server has 200 CPUs which can thinkably migrate around 100 single
> > cpu qemu instances with no issue. What happens if you do this with DSA?
> 
> First, the DSA work queue needs to be configured in shared mode, and one
> queue is enough. 
> 
> The maximum depth of the work queue of the DSA hardware is 128, which means
> that the number of zero-page detection tasks submitted cannot exceed 128,
> otherwise, enqcmd will return an error until the work queue is available again
> 
> 100 Qemu instances need to be migrated concurrently, I don't have any data on
> this yet, I think the 100 zero-page detection tasks can be successfully submitted
> to the DSA hardware work queue, but the throughput of DSA's zero-page detection also
> needs to be considered. Once the DSA maximum throughput is reached, the work queue
> may be filled up quickly, this will cause some Qemu instances to be temporarily unable
> to submit new tasks to DSA.

The unfortunate reality here would be that there's likely no QoS, this
is purely fifo, right?

> This is likely to happen in the first round of migration
> memory iteration.

Try testing this and see then?


-- 
MST



^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v5 01/13] meson: Introduce new instruction set enqcmd to the build system.
  2024-07-11 21:52 ` [PATCH v5 01/13] meson: Introduce new instruction set enqcmd to the build system Yichen Wang
@ 2024-07-15 15:02   ` Liu, Yuan1
  2024-09-09 17:55     ` [External] " Yichen Wang
  0 siblings, 1 reply; 33+ messages in thread
From: Liu, Yuan1 @ 2024-07-15 15:02 UTC (permalink / raw)
  To: Wang, Yichen, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Fabiano Rosas, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel@nongnu.org
  Cc: Hao Xiang, Kumar, Shivam, Ho-Ren (Jack) Chuang, Wang, Yichen

> -----Original Message-----
> From: Yichen Wang <yichen.wang@bytedance.com>
> Sent: Friday, July 12, 2024 5:53 AM
> To: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau
> <marcandre.lureau@redhat.com>; Daniel P. Berrangé <berrange@redhat.com>;
> Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé
> <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas
> <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus Armbruster
> <armbru@redhat.com>; Michael S. Tsirkin <mst@redhat.com>; Cornelia Huck
> <cohuck@redhat.com>; qemu-devel@nongnu.org
> Cc: Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 <yuan1.liu@intel.com>;
> Kumar, Shivam <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> <horenchuang@bytedance.com>; Wang, Yichen <yichen.wang@bytedance.com>
> Subject: [PATCH v5 01/13] meson: Introduce new instruction set enqcmd to
> the build system.
> 
> From: Hao Xiang <hao.xiang@linux.dev>
> 
> Enable instruction set enqcmd in build.
> 
> Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
> Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
> ---
>  meson.build                   | 14 ++++++++++++++
>  meson_options.txt             |  2 ++
>  scripts/meson-buildoptions.sh |  3 +++
>  3 files changed, 19 insertions(+)
> 
> diff --git a/meson.build b/meson.build
> index 6a93da48e1..af650cfabf 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -2893,6 +2893,20 @@ config_host_data.set('CONFIG_AVX512BW_OPT',
> get_option('avx512bw') \
>      int main(int argc, char *argv[]) { return bar(argv[0]); }
>    '''), error_message: 'AVX512BW not available').allowed())
> 
> +config_host_data.set('CONFIG_DSA_OPT', get_option('enqcmd') \
> +  .require(have_cpuid_h, error_message: 'cpuid.h not available, cannot
> enable ENQCMD') \
> +  .require(cc.links('''
> +    #include <stdint.h>
> +    #include <cpuid.h>
> +    #include <immintrin.h>
> +    static int __attribute__((target("enqcmd"))) bar(void *a) {
> +      uint64_t dst[8] = { 0 };
> +      uint64_t src[8] = { 0 };
> +      return _enqcmd(dst, src);
> +    }
> +    int main(int argc, char *argv[]) { return bar(argv[argc - 1]); }
> +  '''), error_message: 'ENQCMD not available').allowed())
> +

How about using cpuid instruction to dynamically detect enqcmd and movdir64b 
instructions? 

My reasons are as follows
1. enqcmd/movdir64b and DSA devices are used together. DSA devices are dynamically
   detected, so enqcmd can also dynamically detect.

   Simple code for dynamically detect movdir64b and enqcmd
   bool check_dsa_instructions(void) {
       uint32_t eax, ebx, ecx, edx;
       bool movedirb_enabled;
       bool enqcmd_enabled;

       cpuid(0x07, 0x0, &eax, &ebx, &ecx, &edx);
       movedirb_enabled = (ecx >> 28) & 0x1;
       if (!movedirb_enabled) {
           return false;
       }
       enqcmd_enabled = (ecx >> 29) & 0x1;
       if (!enqcmd_enabled) {
           return false;
       }
       return true;
    }
    https://cdrdv2-public.intel.com/819680/architecture-instruction-set-extensions-programming-reference.pdf

2. The enqcmd/movdir64b are new instructions, I checked they are integrated into GCC10
   However, users do not need gcc10 or higher to use two instructions.
   Simple code to implement enqcmd
   static inline int enqcmd(volatile void *reg, struct dsa_hw_desc *desc)
   {
       uint8_t retry;
       asm volatile (".byte 0xf2, 0x0f, 0x38, 0xf8, 0x02\t\n"
       "setz %0\t\n":"=r" (retry):"a"(reg), "d"(desc));
       return (int)retry;
   } 
   file:///C:/Users/yliu80/Downloads/353216-data-streaming-accelerator-user-guide-002.pdf

>  # For both AArch64 and AArch32, detect if builtins are available.
>  config_host_data.set('CONFIG_ARM_AES_BUILTIN', cc.compiles('''
>      #include <arm_neon.h>
> diff --git a/meson_options.txt b/meson_options.txt
> index 0269fa0f16..4ed820bb8d 100644
> --- a/meson_options.txt
> +++ b/meson_options.txt
> @@ -121,6 +121,8 @@ option('avx2', type: 'feature', value: 'auto',
>         description: 'AVX2 optimizations')
>  option('avx512bw', type: 'feature', value: 'auto',
>         description: 'AVX512BW optimizations')
> +option('enqcmd', type: 'feature', value: 'disabled',
> +       description: 'ENQCMD optimizations')
>  option('keyring', type: 'feature', value: 'auto',
>         description: 'Linux keyring support')
>  option('libkeyutils', type: 'feature', value: 'auto',
> diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
> index cfadb5ea86..280e117687 100644
> --- a/scripts/meson-buildoptions.sh
> +++ b/scripts/meson-buildoptions.sh
> @@ -95,6 +95,7 @@ meson_options_help() {
>    printf "%s\n" '  auth-pam        PAM access control'
>    printf "%s\n" '  avx2            AVX2 optimizations'
>    printf "%s\n" '  avx512bw        AVX512BW optimizations'
> +  printf "%s\n" '  enqcmd          ENQCMD optimizations'
>    printf "%s\n" '  blkio           libblkio block device driver'
>    printf "%s\n" '  bochs           bochs image format support'
>    printf "%s\n" '  bpf             eBPF support'
> @@ -239,6 +240,8 @@ _meson_option_parse() {
>      --disable-avx2) printf "%s" -Davx2=disabled ;;
>      --enable-avx512bw) printf "%s" -Davx512bw=enabled ;;
>      --disable-avx512bw) printf "%s" -Davx512bw=disabled ;;
> +    --enable-enqcmd) printf "%s" -Denqcmd=enabled ;;
> +    --disable-enqcmd) printf "%s" -Denqcmd=disabled ;;
>      --enable-gcov) printf "%s" -Db_coverage=true ;;
>      --disable-gcov) printf "%s" -Db_coverage=false ;;
>      --enable-lto) printf "%s" -Db_lto=true ;;
> --
> Yichen Wang



^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration.
  2024-07-15 14:42         ` Michael S. Tsirkin
@ 2024-07-15 15:23           ` Liu, Yuan1
  2024-07-15 15:57             ` Liu, Yuan1
  2024-07-15 16:08             ` Michael S. Tsirkin
  0 siblings, 2 replies; 33+ messages in thread
From: Liu, Yuan1 @ 2024-07-15 15:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wang, Yichen, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Fabiano Rosas, Eric Blake, Markus Armbruster,
	Cornelia Huck, qemu-devel@nongnu.org, Hao Xiang, Kumar, Shivam,
	Ho-Ren (Jack) Chuang

> -----Original Message-----
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Monday, July 15, 2024 10:43 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>;
> Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>;
> Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>;
> Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> <horenchuang@bytedance.com>
> Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> zero page checking in multifd live migration.
> 
> On Mon, Jul 15, 2024 at 01:09:59PM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Monday, July 15, 2024 8:24 PM
> > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> > > <pbonzini@redhat.com>; Marc-André Lureau
> <marcandre.lureau@redhat.com>;
> > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth
> <thuth@redhat.com>;
> > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu
> <peterx@redhat.com>;
> > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>;
> Markus
> > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> qemu-
> > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> > > <horenchuang@bytedance.com>
> > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> offload
> > > zero page checking in multifd live migration.
> > >
> > > On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote:
> > > > > -----Original Message-----
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, July 12, 2024 6:49 AM
> > > > > To: Wang, Yichen <yichen.wang@bytedance.com>
> > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau
> > > > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé
> > > <berrange@redhat.com>;
> > > > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé
> > > > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas
> > > > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> Armbruster
> > > > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1
> > > > > <yuan1.liu@intel.com>; Kumar, Shivam <shivam.kumar1@nutanix.com>;
> Ho-
> > > Ren
> > > > > (Jack) Chuang <horenchuang@bytedance.com>
> > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> > > offload
> > > > > zero page checking in multifd live migration.
> > > > >
> > > > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> > > > > > * Performance:
> > > > > >
> > > > > > We use two Intel 4th generation Xeon servers for testing.
> > > > > >
> > > > > > Architecture:        x86_64
> > > > > > CPU(s):              192
> > > > > > Thread(s) per core:  2
> > > > > > Core(s) per socket:  48
> > > > > > Socket(s):           2
> > > > > > NUMA node(s):        2
> > > > > > Vendor ID:           GenuineIntel
> > > > > > CPU family:          6
> > > > > > Model:               143
> > > > > > Model name:          Intel(R) Xeon(R) Platinum 8457C
> > > > > > Stepping:            8
> > > > > > CPU MHz:             2538.624
> > > > > > CPU max MHz:         3800.0000
> > > > > > CPU min MHz:         800.0000
> > > > > >
> > > > > > We perform multifd live migration with below setup:
> > > > > > 1. VM has 100GB memory.
> > > > > > 2. Use the new migration option multifd-set-normal-page-ratio to
> > > control
> > > > > the total
> > > > > > size of the payload sent over the network.
> > > > > > 3. Use 8 multifd channels.
> > > > > > 4. Use tcp for live migration.
> > > > > > 4. Use CPU to perform zero page checking as the baseline.
> > > > > > 5. Use one DSA device to offload zero page checking to compare
> with
> > > the
> > > > > baseline.
> > > > > > 6. Use "perf sched record" and "perf sched timehist" to analyze
> CPU
> > > > > usage.
> > > > > >
> > > > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> > > > > >
> > > > > > 	CPU usage
> > > > > >
> > > > > > 	|---------------|---------------|---------------|-------------
> > > --|
> > > > > > 	|		|comm		|runtime(msec)	|totaltime(msec)|
> > > > > > 	|---------------|---------------|---------------|-------------
> > > --|
> > > > > > 	|Baseline	|live_migration	|5657.58	|		|
> > > > > > 	|		|multifdsend_0	|3931.563	|		|
> > > > > > 	|		|multifdsend_1	|4405.273	|		|
> > > > > > 	|		|multifdsend_2	|3941.968	|		|
> > > > > > 	|		|multifdsend_3	|5032.975	|		|
> > > > > > 	|		|multifdsend_4	|4533.865	|		|
> > > > > > 	|		|multifdsend_5	|4530.461	|		|
> > > > > > 	|		|multifdsend_6	|5171.916	|		|
> > > > > > 	|		|multifdsend_7	|4722.769	|41922		|
> > > > > > 	|---------------|---------------|---------------|-------------
> > > --|
> > > > > > 	|DSA		|live_migration	|6129.168	|		|
> > > > > > 	|		|multifdsend_0	|2954.717	|		|
> > > > > > 	|		|multifdsend_1	|2766.359	|		|
> > > > > > 	|		|multifdsend_2	|2853.519	|		|
> > > > > > 	|		|multifdsend_3	|2740.717	|		|
> > > > > > 	|		|multifdsend_4	|2824.169	|		|
> > > > > > 	|		|multifdsend_5	|2966.908	|		|
> > > > > > 	|		|multifdsend_6	|2611.137	|		|
> > > > > > 	|		|multifdsend_7	|3114.732	|		|
> > > > > > 	|		|dsa_completion	|3612.564	|32568		|
> > > > > > 	|---------------|---------------|---------------|-------------
> > > --|
> > > > > >
> > > > > > Baseline total runtime is calculated by adding up all
> multifdsend_X
> > > > > > and live_migration threads runtime. DSA offloading total runtime
> is
> > > > > > calculated by adding up all multifdsend_X, live_migration and
> > > > > > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime
> and
> > > > > > that is 23% total CPU usage savings.
> > > > >
> > > > >
> > > > > Here the DSA was mostly idle.
> > > > >
> > > > > Sounds good but a question: what if several qemu instances are
> > > > > migrated in parallel?
> > > > >
> > > > > Some accelerators tend to basically stall if several tasks
> > > > > are trying to use them at the same time.
> > > > >
> > > > > Where is the boundary here?
> > > >
> > > > A DSA device can be assigned to multiple Qemu instances.
> > > > The DSA resource used by each process is called a work queue, each
> DSA
> > > > device can support up to 8 work queues and work queues are
> classified
> > > into
> > > > dedicated queues and shared queues.
> > > >
> > > > A dedicated queue can only serve one process. Theoretically, there
> is no
> > > limit
> > > > on the number of processes in a shared queue, it is based on enqcmd
> +
> > > SVM technology.
> > > >
> > > > https://www.kernel.org/doc/html/v5.17/x86/sva.html
> > >
> > > This server has 200 CPUs which can thinkably migrate around 100 single
> > > cpu qemu instances with no issue. What happens if you do this with
> DSA?
> >
> > First, the DSA work queue needs to be configured in shared mode, and one
> > queue is enough.
> >
> > The maximum depth of the work queue of the DSA hardware is 128, which
> means
> > that the number of zero-page detection tasks submitted cannot exceed
> 128,
> > otherwise, enqcmd will return an error until the work queue is available
> again
> >
> > 100 Qemu instances need to be migrated concurrently, I don't have any
> data on
> > this yet, I think the 100 zero-page detection tasks can be successfully
> submitted
> > to the DSA hardware work queue, but the throughput of DSA's zero-page
> detection also
> > needs to be considered. Once the DSA maximum throughput is reached, the
> work queue
> > may be filled up quickly, this will cause some Qemu instances to be
> temporarily unable
> > to submit new tasks to DSA.
> 
> The unfortunate reality here would be that there's likely no QoS, this
> is purely fifo, right?

Yes, this scenario may be fifo, assuming that the number of pages each task
is the same, because DSA hardware consists of multiple work engines, they can
process tasks concurrently, usually in a round-robin way to get tasks from the
work queue.	

DSA supports priority and flow control based on work queue granularity.
https://github.com/intel/idxd-config/blob/stable/Documentation/accfg/accel-config-config-wq.txt

> > This is likely to happen in the first round of migration
> > memory iteration.
> 
> Try testing this and see then?

Yes, I can test based on this patch set. Please review the test scenario
My server has 192 CPUs, and 8 DSA devices, 100Gbps NIC.
All 8 DSA devices serve 100 Qemu instances for simultaneous live migration.
Each VM has 1 vCPU, and 1G memory, with no workload in the VM.

You want to know if some Qemu instances are stalled because of DSA, right?

> --
> MST



^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration.
  2024-07-15 15:23           ` Liu, Yuan1
@ 2024-07-15 15:57             ` Liu, Yuan1
  2024-07-15 16:24               ` Michael S. Tsirkin
  2024-07-15 16:08             ` Michael S. Tsirkin
  1 sibling, 1 reply; 33+ messages in thread
From: Liu, Yuan1 @ 2024-07-15 15:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wang, Yichen, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Fabiano Rosas, Eric Blake, Markus Armbruster,
	Cornelia Huck, qemu-devel@nongnu.org, Hao Xiang, Kumar, Shivam,
	Ho-Ren (Jack) Chuang

> -----Original Message-----
> From: Liu, Yuan1
> Sent: Monday, July 15, 2024 11:23 PM
> To: Michael S. Tsirkin <mst@redhat.com>
> Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>;
> Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>;
> Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>;
> Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> <horenchuang@bytedance.com>
> Subject: RE: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> zero page checking in multifd live migration.
> 
> > -----Original Message-----
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Monday, July 15, 2024 10:43 PM
> > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> > <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>;
> > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth
> <thuth@redhat.com>;
> > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu
> <peterx@redhat.com>;
> > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> > <horenchuang@bytedance.com>
> > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> > zero page checking in multifd live migration.
> >
> > On Mon, Jul 15, 2024 at 01:09:59PM +0000, Liu, Yuan1 wrote:
> > > > -----Original Message-----
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Monday, July 15, 2024 8:24 PM
> > > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> > > > <pbonzini@redhat.com>; Marc-André Lureau
> > <marcandre.lureau@redhat.com>;
> > > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth
> > <thuth@redhat.com>;
> > > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu
> > <peterx@redhat.com>;
> > > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>;
> > Markus
> > > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> > qemu-
> > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> > > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> > > > <horenchuang@bytedance.com>
> > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> > offload
> > > > zero page checking in multifd live migration.
> > > >
> > > > On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote:
> > > > > > -----Original Message-----
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, July 12, 2024 6:49 AM
> > > > > > To: Wang, Yichen <yichen.wang@bytedance.com>
> > > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau
> > > > > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé
> > > > <berrange@redhat.com>;
> > > > > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé
> > > > > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas
> > > > > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> > Armbruster
> > > > > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1
> > > > > > <yuan1.liu@intel.com>; Kumar, Shivam
> <shivam.kumar1@nutanix.com>;
> > Ho-
> > > > Ren
> > > > > > (Jack) Chuang <horenchuang@bytedance.com>
> > > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> > > > offload
> > > > > > zero page checking in multifd live migration.
> > > > > >
> > > > > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> > > > > > > * Performance:
> > > > > > >
> > > > > > > We use two Intel 4th generation Xeon servers for testing.
> > > > > > >
> > > > > > > Architecture:        x86_64
> > > > > > > CPU(s):              192
> > > > > > > Thread(s) per core:  2
> > > > > > > Core(s) per socket:  48
> > > > > > > Socket(s):           2
> > > > > > > NUMA node(s):        2
> > > > > > > Vendor ID:           GenuineIntel
> > > > > > > CPU family:          6
> > > > > > > Model:               143
> > > > > > > Model name:          Intel(R) Xeon(R) Platinum 8457C
> > > > > > > Stepping:            8
> > > > > > > CPU MHz:             2538.624
> > > > > > > CPU max MHz:         3800.0000
> > > > > > > CPU min MHz:         800.0000
> > > > > > >
> > > > > > > We perform multifd live migration with below setup:
> > > > > > > 1. VM has 100GB memory.
> > > > > > > 2. Use the new migration option multifd-set-normal-page-ratio
> to
> > > > control
> > > > > > the total
> > > > > > > size of the payload sent over the network.
> > > > > > > 3. Use 8 multifd channels.
> > > > > > > 4. Use tcp for live migration.
> > > > > > > 4. Use CPU to perform zero page checking as the baseline.
> > > > > > > 5. Use one DSA device to offload zero page checking to compare
> > with
> > > > the
> > > > > > baseline.
> > > > > > > 6. Use "perf sched record" and "perf sched timehist" to
> analyze
> > CPU
> > > > > > usage.
> > > > > > >
> > > > > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> > > > > > >
> > > > > > > 	CPU usage
> > > > > > >
> > > > > > > 	|---------------|---------------|---------------|-------
> ------
> > > > --|
> > > > > > > 	|		|comm		|runtime(msec)
> 	|totaltime(msec)|
> > > > > > > 	|---------------|---------------|---------------|-------
> ------
> > > > --|
> > > > > > > 	|Baseline	|live_migration	|5657.58	|		|
> > > > > > > 	|		|multifdsend_0	|3931.563	|		|
> > > > > > > 	|		|multifdsend_1	|4405.273	|		|
> > > > > > > 	|		|multifdsend_2	|3941.968	|		|
> > > > > > > 	|		|multifdsend_3	|5032.975	|		|
> > > > > > > 	|		|multifdsend_4	|4533.865	|		|
> > > > > > > 	|		|multifdsend_5	|4530.461	|		|
> > > > > > > 	|		|multifdsend_6	|5171.916	|		|
> > > > > > > 	|		|multifdsend_7	|4722.769	|41922
> 	|
> > > > > > > 	|---------------|---------------|---------------|-------
> ------
> > > > --|
> > > > > > > 	|DSA		|live_migration	|6129.168	|		|
> > > > > > > 	|		|multifdsend_0	|2954.717	|		|
> > > > > > > 	|		|multifdsend_1	|2766.359	|		|
> > > > > > > 	|		|multifdsend_2	|2853.519	|		|
> > > > > > > 	|		|multifdsend_3	|2740.717	|		|
> > > > > > > 	|		|multifdsend_4	|2824.169	|		|
> > > > > > > 	|		|multifdsend_5	|2966.908	|		|
> > > > > > > 	|		|multifdsend_6	|2611.137	|		|
> > > > > > > 	|		|multifdsend_7	|3114.732	|		|
> > > > > > > 	|		|dsa_completion	|3612.564	|32568
> 	|
> > > > > > > 	|---------------|---------------|---------------|-------
> ------
> > > > --|
> > > > > > >
> > > > > > > Baseline total runtime is calculated by adding up all
> > multifdsend_X
> > > > > > > and live_migration threads runtime. DSA offloading total
> runtime
> > is
> > > > > > > calculated by adding up all multifdsend_X, live_migration and
> > > > > > > dsa_completion threads runtime. 41922 msec VS 32568 msec
> runtime
> > and
> > > > > > > that is 23% total CPU usage savings.
> > > > > >
> > > > > >
> > > > > > Here the DSA was mostly idle.
> > > > > >
> > > > > > Sounds good but a question: what if several qemu instances are
> > > > > > migrated in parallel?
> > > > > >
> > > > > > Some accelerators tend to basically stall if several tasks
> > > > > > are trying to use them at the same time.
> > > > > >
> > > > > > Where is the boundary here?

If I understand correctly, you are concerned that in some scenarios the
accelerator itself is the migration bottleneck, causing the migration performance
to be degraded.

My understanding is to make full use of the accelerator bandwidth, and once
the accelerator is the bottleneck, it will fall back to zero-page detection
by the CPU.

For example, when the enqcmd command returns an error which means the work queue
is full, then we can add some retry mechanisms or directly use CPU detection.

> > > > > A DSA device can be assigned to multiple Qemu instances.
> > > > > The DSA resource used by each process is called a work queue, each
> > DSA
> > > > > device can support up to 8 work queues and work queues are
> > classified
> > > > into
> > > > > dedicated queues and shared queues.
> > > > >
> > > > > A dedicated queue can only serve one process. Theoretically, there
> > is no
> > > > limit
> > > > > on the number of processes in a shared queue, it is based on
> enqcmd
> > +
> > > > SVM technology.
> > > > >
> > > > > https://www.kernel.org/doc/html/v5.17/x86/sva.html
> > > >
> > > > This server has 200 CPUs which can thinkably migrate around 100
> single
> > > > cpu qemu instances with no issue. What happens if you do this with
> > DSA?
> > >
> > > First, the DSA work queue needs to be configured in shared mode, and
> one
> > > queue is enough.
> > >
> > > The maximum depth of the work queue of the DSA hardware is 128, which
> > means
> > > that the number of zero-page detection tasks submitted cannot exceed
> > 128,
> > > otherwise, enqcmd will return an error until the work queue is
> available
> > again
> > >
> > > 100 Qemu instances need to be migrated concurrently, I don't have any
> > data on
> > > this yet, I think the 100 zero-page detection tasks can be
> successfully
> > submitted
> > > to the DSA hardware work queue, but the throughput of DSA's zero-page
> > detection also
> > > needs to be considered. Once the DSA maximum throughput is reached,
> the
> > work queue
> > > may be filled up quickly, this will cause some Qemu instances to be
> > temporarily unable
> > > to submit new tasks to DSA.
> >
> > The unfortunate reality here would be that there's likely no QoS, this
> > is purely fifo, right?
> 
> Yes, this scenario may be fifo, assuming that the number of pages each
> task
> is the same, because DSA hardware consists of multiple work engines, they
> can
> process tasks concurrently, usually in a round-robin way to get tasks from
> the
> work queue.
> 
> DSA supports priority and flow control based on work queue granularity.
> https://github.com/intel/idxd-
> config/blob/stable/Documentation/accfg/accel-config-config-wq.txt
> 
> > > This is likely to happen in the first round of migration
> > > memory iteration.
> >
> > Try testing this and see then?
> 
> Yes, I can test based on this patch set. Please review the test scenario
> My server has 192 CPUs, and 8 DSA devices, 100Gbps NIC.
> All 8 DSA devices serve 100 Qemu instances for simultaneous live
> migration.
> Each VM has 1 vCPU, and 1G memory, with no workload in the VM.
> 
> You want to know if some Qemu instances are stalled because of DSA, right?
> 
> > --
> > MST



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration.
  2024-07-15 15:23           ` Liu, Yuan1
  2024-07-15 15:57             ` Liu, Yuan1
@ 2024-07-15 16:08             ` Michael S. Tsirkin
  2024-07-16  1:21               ` Liu, Yuan1
  1 sibling, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2024-07-15 16:08 UTC (permalink / raw)
  To: Liu, Yuan1
  Cc: Wang, Yichen, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Fabiano Rosas, Eric Blake, Markus Armbruster,
	Cornelia Huck, qemu-devel@nongnu.org, Hao Xiang, Kumar, Shivam,
	Ho-Ren (Jack) Chuang

On Mon, Jul 15, 2024 at 03:23:13PM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Monday, July 15, 2024 10:43 PM
> > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> > <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>;
> > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>;
> > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>;
> > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> > <horenchuang@bytedance.com>
> > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> > zero page checking in multifd live migration.
> > 
> > On Mon, Jul 15, 2024 at 01:09:59PM +0000, Liu, Yuan1 wrote:
> > > > -----Original Message-----
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Monday, July 15, 2024 8:24 PM
> > > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> > > > <pbonzini@redhat.com>; Marc-André Lureau
> > <marcandre.lureau@redhat.com>;
> > > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth
> > <thuth@redhat.com>;
> > > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu
> > <peterx@redhat.com>;
> > > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>;
> > Markus
> > > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> > qemu-
> > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> > > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> > > > <horenchuang@bytedance.com>
> > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> > offload
> > > > zero page checking in multifd live migration.
> > > >
> > > > On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote:
> > > > > > -----Original Message-----
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, July 12, 2024 6:49 AM
> > > > > > To: Wang, Yichen <yichen.wang@bytedance.com>
> > > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau
> > > > > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé
> > > > <berrange@redhat.com>;
> > > > > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé
> > > > > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas
> > > > > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> > Armbruster
> > > > > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1
> > > > > > <yuan1.liu@intel.com>; Kumar, Shivam <shivam.kumar1@nutanix.com>;
> > Ho-
> > > > Ren
> > > > > > (Jack) Chuang <horenchuang@bytedance.com>
> > > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> > > > offload
> > > > > > zero page checking in multifd live migration.
> > > > > >
> > > > > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> > > > > > > * Performance:
> > > > > > >
> > > > > > > We use two Intel 4th generation Xeon servers for testing.
> > > > > > >
> > > > > > > Architecture:        x86_64
> > > > > > > CPU(s):              192
> > > > > > > Thread(s) per core:  2
> > > > > > > Core(s) per socket:  48
> > > > > > > Socket(s):           2
> > > > > > > NUMA node(s):        2
> > > > > > > Vendor ID:           GenuineIntel
> > > > > > > CPU family:          6
> > > > > > > Model:               143
> > > > > > > Model name:          Intel(R) Xeon(R) Platinum 8457C
> > > > > > > Stepping:            8
> > > > > > > CPU MHz:             2538.624
> > > > > > > CPU max MHz:         3800.0000
> > > > > > > CPU min MHz:         800.0000
> > > > > > >
> > > > > > > We perform multifd live migration with below setup:
> > > > > > > 1. VM has 100GB memory.
> > > > > > > 2. Use the new migration option multifd-set-normal-page-ratio to
> > > > control
> > > > > > the total
> > > > > > > size of the payload sent over the network.
> > > > > > > 3. Use 8 multifd channels.
> > > > > > > 4. Use tcp for live migration.
> > > > > > > 4. Use CPU to perform zero page checking as the baseline.
> > > > > > > 5. Use one DSA device to offload zero page checking to compare
> > with
> > > > the
> > > > > > baseline.
> > > > > > > 6. Use "perf sched record" and "perf sched timehist" to analyze
> > CPU
> > > > > > usage.
> > > > > > >
> > > > > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> > > > > > >
> > > > > > > 	CPU usage
> > > > > > >
> > > > > > > 	|---------------|---------------|---------------|-------------
> > > > --|
> > > > > > > 	|		|comm		|runtime(msec)	|totaltime(msec)|
> > > > > > > 	|---------------|---------------|---------------|-------------
> > > > --|
> > > > > > > 	|Baseline	|live_migration	|5657.58	|		|
> > > > > > > 	|		|multifdsend_0	|3931.563	|		|
> > > > > > > 	|		|multifdsend_1	|4405.273	|		|
> > > > > > > 	|		|multifdsend_2	|3941.968	|		|
> > > > > > > 	|		|multifdsend_3	|5032.975	|		|
> > > > > > > 	|		|multifdsend_4	|4533.865	|		|
> > > > > > > 	|		|multifdsend_5	|4530.461	|		|
> > > > > > > 	|		|multifdsend_6	|5171.916	|		|
> > > > > > > 	|		|multifdsend_7	|4722.769	|41922		|
> > > > > > > 	|---------------|---------------|---------------|-------------
> > > > --|
> > > > > > > 	|DSA		|live_migration	|6129.168	|		|
> > > > > > > 	|		|multifdsend_0	|2954.717	|		|
> > > > > > > 	|		|multifdsend_1	|2766.359	|		|
> > > > > > > 	|		|multifdsend_2	|2853.519	|		|
> > > > > > > 	|		|multifdsend_3	|2740.717	|		|
> > > > > > > 	|		|multifdsend_4	|2824.169	|		|
> > > > > > > 	|		|multifdsend_5	|2966.908	|		|
> > > > > > > 	|		|multifdsend_6	|2611.137	|		|
> > > > > > > 	|		|multifdsend_7	|3114.732	|		|
> > > > > > > 	|		|dsa_completion	|3612.564	|32568		|
> > > > > > > 	|---------------|---------------|---------------|-------------
> > > > --|
> > > > > > >
> > > > > > > Baseline total runtime is calculated by adding up all
> > multifdsend_X
> > > > > > > and live_migration threads runtime. DSA offloading total runtime
> > is
> > > > > > > calculated by adding up all multifdsend_X, live_migration and
> > > > > > > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime
> > and
> > > > > > > that is 23% total CPU usage savings.
> > > > > >
> > > > > >
> > > > > > Here the DSA was mostly idle.
> > > > > >
> > > > > > Sounds good but a question: what if several qemu instances are
> > > > > > migrated in parallel?
> > > > > >
> > > > > > Some accelerators tend to basically stall if several tasks
> > > > > > are trying to use them at the same time.
> > > > > >
> > > > > > Where is the boundary here?
> > > > >
> > > > > A DSA device can be assigned to multiple Qemu instances.
> > > > > The DSA resource used by each process is called a work queue, each
> > DSA
> > > > > device can support up to 8 work queues and work queues are
> > classified
> > > > into
> > > > > dedicated queues and shared queues.
> > > > >
> > > > > A dedicated queue can only serve one process. Theoretically, there
> > is no
> > > > limit
> > > > > on the number of processes in a shared queue, it is based on enqcmd
> > +
> > > > SVM technology.
> > > > >
> > > > > https://www.kernel.org/doc/html/v5.17/x86/sva.html
> > > >
> > > > This server has 200 CPUs which can thinkably migrate around 100 single
> > > > cpu qemu instances with no issue. What happens if you do this with
> > DSA?
> > >
> > > First, the DSA work queue needs to be configured in shared mode, and one
> > > queue is enough.
> > >
> > > The maximum depth of the work queue of the DSA hardware is 128, which
> > means
> > > that the number of zero-page detection tasks submitted cannot exceed
> > 128,
> > > otherwise, enqcmd will return an error until the work queue is available
> > again
> > >
> > > 100 Qemu instances need to be migrated concurrently, I don't have any
> > data on
> > > this yet, I think the 100 zero-page detection tasks can be successfully
> > submitted
> > > to the DSA hardware work queue, but the throughput of DSA's zero-page
> > detection also
> > > needs to be considered. Once the DSA maximum throughput is reached, the
> > work queue
> > > may be filled up quickly, this will cause some Qemu instances to be
> > temporarily unable
> > > to submit new tasks to DSA.
> > 
> > The unfortunate reality here would be that there's likely no QoS, this
> > is purely fifo, right?
> 
> Yes, this scenario may be fifo, assuming that the number of pages each task
> is the same, because DSA hardware consists of multiple work engines, they can
> process tasks concurrently, usually in a round-robin way to get tasks from the
> work queue.	
> 
> DSA supports priority and flow control based on work queue granularity.
> https://github.com/intel/idxd-config/blob/stable/Documentation/accfg/accel-config-config-wq.txt

Right but it seems clear there aren't enough work queues for a typical setup.

> > > This is likely to happen in the first round of migration
> > > memory iteration.
> > 
> > Try testing this and see then?
> 
> Yes, I can test based on this patch set. Please review the test scenario
> My server has 192 CPUs, and 8 DSA devices, 100Gbps NIC.
> All 8 DSA devices serve 100 Qemu instances for simultaneous live migration.
> Each VM has 1 vCPU, and 1G memory, with no workload in the VM.
> 
> You want to know if some Qemu instances are stalled because of DSA, right?

And generally just run same benchmark you did compared to cpu:
worst case and average numbers would be interesting.

> > --
> > MST



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration.
  2024-07-15 15:57             ` Liu, Yuan1
@ 2024-07-15 16:24               ` Michael S. Tsirkin
  2024-07-16  1:25                 ` Liu, Yuan1
  0 siblings, 1 reply; 33+ messages in thread
From: Michael S. Tsirkin @ 2024-07-15 16:24 UTC (permalink / raw)
  To: Liu, Yuan1
  Cc: Wang, Yichen, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Fabiano Rosas, Eric Blake, Markus Armbruster,
	Cornelia Huck, qemu-devel@nongnu.org, Hao Xiang, Kumar, Shivam,
	Ho-Ren (Jack) Chuang

On Mon, Jul 15, 2024 at 03:57:42PM +0000, Liu, Yuan1 wrote:
> > > > > > > > that is 23% total CPU usage savings.
> > > > > > >
> > > > > > >
> > > > > > > Here the DSA was mostly idle.
> > > > > > >
> > > > > > > Sounds good but a question: what if several qemu instances are
> > > > > > > migrated in parallel?
> > > > > > >
> > > > > > > Some accelerators tend to basically stall if several tasks
> > > > > > > are trying to use them at the same time.
> > > > > > >
> > > > > > > Where is the boundary here?
> 
> If I understand correctly, you are concerned that in some scenarios the
> accelerator itself is the migration bottleneck, causing the migration performance
> to be degraded.
> 
> My understanding is to make full use of the accelerator bandwidth, and once
> the accelerator is the bottleneck, it will fall back to zero-page detection
> by the CPU.
> 
> For example, when the enqcmd command returns an error which means the work queue
> is full, then we can add some retry mechanisms or directly use CPU detection.


How is it handled in your patch? If you just abort migration unless
enqcmd succeeds then would that not be a bug, where loading the system
leads to migraton failures?


-- 
MST



^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration.
  2024-07-15 16:08             ` Michael S. Tsirkin
@ 2024-07-16  1:21               ` Liu, Yuan1
  0 siblings, 0 replies; 33+ messages in thread
From: Liu, Yuan1 @ 2024-07-16  1:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wang, Yichen, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Fabiano Rosas, Eric Blake, Markus Armbruster,
	Cornelia Huck, qemu-devel@nongnu.org, Hao Xiang, Kumar, Shivam,
	Ho-Ren (Jack) Chuang

> -----Original Message-----
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Tuesday, July 16, 2024 12:09 AM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>;
> Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>;
> Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>;
> Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> <horenchuang@bytedance.com>
> Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> zero page checking in multifd live migration.
> 
> On Mon, Jul 15, 2024 at 03:23:13PM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Monday, July 15, 2024 10:43 PM
> > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> > > <pbonzini@redhat.com>; Marc-André Lureau
> <marcandre.lureau@redhat.com>;
> > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth
> <thuth@redhat.com>;
> > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu
> <peterx@redhat.com>;
> > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>;
> Markus
> > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> qemu-
> > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> > > <horenchuang@bytedance.com>
> > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> offload
> > > zero page checking in multifd live migration.
> > >
> > > On Mon, Jul 15, 2024 at 01:09:59PM +0000, Liu, Yuan1 wrote:
> > > > > -----Original Message-----
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Monday, July 15, 2024 8:24 PM
> > > > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > > > Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> > > > > <pbonzini@redhat.com>; Marc-André Lureau
> > > <marcandre.lureau@redhat.com>;
> > > > > Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth
> > > <thuth@redhat.com>;
> > > > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu
> > > <peterx@redhat.com>;
> > > > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>;
> > > Markus
> > > > > Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>;
> > > qemu-
> > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> > > > > <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> > > > > <horenchuang@bytedance.com>
> > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to
> > > offload
> > > > > zero page checking in multifd live migration.
> > > > >
> > > > > On Mon, Jul 15, 2024 at 08:29:03AM +0000, Liu, Yuan1 wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Friday, July 12, 2024 6:49 AM
> > > > > > > To: Wang, Yichen <yichen.wang@bytedance.com>
> > > > > > > Cc: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau
> > > > > > > <marcandre.lureau@redhat.com>; Daniel P. Berrangé
> > > > > <berrange@redhat.com>;
> > > > > > > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé
> > > > > > > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano
> Rosas
> > > > > > > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> > > Armbruster
> > > > > > > <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> > > > > > > devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1
> > > > > > > <yuan1.liu@intel.com>; Kumar, Shivam
> <shivam.kumar1@nutanix.com>;
> > > Ho-
> > > > > Ren
> > > > > > > (Jack) Chuang <horenchuang@bytedance.com>
> > > > > > > Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator
> to
> > > > > offload
> > > > > > > zero page checking in multifd live migration.
> > > > > > >
> > > > > > > On Thu, Jul 11, 2024 at 02:52:35PM -0700, Yichen Wang wrote:
> > > > > > > > * Performance:
> > > > > > > >
> > > > > > > > We use two Intel 4th generation Xeon servers for testing.
> > > > > > > >
> > > > > > > > Architecture:        x86_64
> > > > > > > > CPU(s):              192
> > > > > > > > Thread(s) per core:  2
> > > > > > > > Core(s) per socket:  48
> > > > > > > > Socket(s):           2
> > > > > > > > NUMA node(s):        2
> > > > > > > > Vendor ID:           GenuineIntel
> > > > > > > > CPU family:          6
> > > > > > > > Model:               143
> > > > > > > > Model name:          Intel(R) Xeon(R) Platinum 8457C
> > > > > > > > Stepping:            8
> > > > > > > > CPU MHz:             2538.624
> > > > > > > > CPU max MHz:         3800.0000
> > > > > > > > CPU min MHz:         800.0000
> > > > > > > >
> > > > > > > > We perform multifd live migration with below setup:
> > > > > > > > 1. VM has 100GB memory.
> > > > > > > > 2. Use the new migration option multifd-set-normal-page-
> ratio to
> > > > > control
> > > > > > > the total
> > > > > > > > size of the payload sent over the network.
> > > > > > > > 3. Use 8 multifd channels.
> > > > > > > > 4. Use tcp for live migration.
> > > > > > > > 4. Use CPU to perform zero page checking as the baseline.
> > > > > > > > 5. Use one DSA device to offload zero page checking to
> compare
> > > with
> > > > > the
> > > > > > > baseline.
> > > > > > > > 6. Use "perf sched record" and "perf sched timehist" to
> analyze
> > > CPU
> > > > > > > usage.
> > > > > > > >
> > > > > > > > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> > > > > > > >
> > > > > > > > 	CPU usage
> > > > > > > >
> > > > > > > > 	|---------------|---------------|---------------|-------
> ------
> > > > > --|
> > > > > > > > 	|		|comm		|runtime(msec)
> 	|totaltime(msec)|
> > > > > > > > 	|---------------|---------------|---------------|-------
> ------
> > > > > --|
> > > > > > > > 	|Baseline	|live_migration	|5657.58	|		|
> > > > > > > > 	|		|multifdsend_0	|3931.563	|		|
> > > > > > > > 	|		|multifdsend_1	|4405.273	|		|
> > > > > > > > 	|		|multifdsend_2	|3941.968	|		|
> > > > > > > > 	|		|multifdsend_3	|5032.975	|		|
> > > > > > > > 	|		|multifdsend_4	|4533.865	|		|
> > > > > > > > 	|		|multifdsend_5	|4530.461	|		|
> > > > > > > > 	|		|multifdsend_6	|5171.916	|		|
> > > > > > > > 	|		|multifdsend_7	|4722.769	|41922
> 	|
> > > > > > > > 	|---------------|---------------|---------------|-------
> ------
> > > > > --|
> > > > > > > > 	|DSA		|live_migration	|6129.168	|		|
> > > > > > > > 	|		|multifdsend_0	|2954.717	|		|
> > > > > > > > 	|		|multifdsend_1	|2766.359	|		|
> > > > > > > > 	|		|multifdsend_2	|2853.519	|		|
> > > > > > > > 	|		|multifdsend_3	|2740.717	|		|
> > > > > > > > 	|		|multifdsend_4	|2824.169	|		|
> > > > > > > > 	|		|multifdsend_5	|2966.908	|		|
> > > > > > > > 	|		|multifdsend_6	|2611.137	|		|
> > > > > > > > 	|		|multifdsend_7	|3114.732	|		|
> > > > > > > > 	|		|dsa_completion	|3612.564	|32568
> 	|
> > > > > > > > 	|---------------|---------------|---------------|-------
> ------
> > > > > --|
> > > > > > > >
> > > > > > > > Baseline total runtime is calculated by adding up all
> > > multifdsend_X
> > > > > > > > and live_migration threads runtime. DSA offloading total
> runtime
> > > is
> > > > > > > > calculated by adding up all multifdsend_X, live_migration
> and
> > > > > > > > dsa_completion threads runtime. 41922 msec VS 32568 msec
> runtime
> > > and
> > > > > > > > that is 23% total CPU usage savings.
> > > > > > >
> > > > > > >
> > > > > > > Here the DSA was mostly idle.
> > > > > > >
> > > > > > > Sounds good but a question: what if several qemu instances are
> > > > > > > migrated in parallel?
> > > > > > >
> > > > > > > Some accelerators tend to basically stall if several tasks
> > > > > > > are trying to use them at the same time.
> > > > > > >
> > > > > > > Where is the boundary here?
> > > > > >
> > > > > > A DSA device can be assigned to multiple Qemu instances.
> > > > > > The DSA resource used by each process is called a work queue,
> each
> > > DSA
> > > > > > device can support up to 8 work queues and work queues are
> > > classified
> > > > > into
> > > > > > dedicated queues and shared queues.
> > > > > >
> > > > > > A dedicated queue can only serve one process. Theoretically,
> there
> > > is no
> > > > > limit
> > > > > > on the number of processes in a shared queue, it is based on
> enqcmd
> > > +
> > > > > SVM technology.
> > > > > >
> > > > > > https://www.kernel.org/doc/html/v5.17/x86/sva.html
> > > > >
> > > > > This server has 200 CPUs which can thinkably migrate around 100
> single
> > > > > cpu qemu instances with no issue. What happens if you do this with
> > > DSA?
> > > >
> > > > First, the DSA work queue needs to be configured in shared mode, and
> one
> > > > queue is enough.
> > > >
> > > > The maximum depth of the work queue of the DSA hardware is 128,
> which
> > > means
> > > > that the number of zero-page detection tasks submitted cannot exceed
> > > 128,
> > > > otherwise, enqcmd will return an error until the work queue is
> available
> > > again
> > > >
> > > > 100 Qemu instances need to be migrated concurrently, I don't have
> any
> > > data on
> > > > this yet, I think the 100 zero-page detection tasks can be
> successfully
> > > submitted
> > > > to the DSA hardware work queue, but the throughput of DSA's zero-
> page
> > > detection also
> > > > needs to be considered. Once the DSA maximum throughput is reached,
> the
> > > work queue
> > > > may be filled up quickly, this will cause some Qemu instances to be
> > > temporarily unable
> > > > to submit new tasks to DSA.
> > >
> > > The unfortunate reality here would be that there's likely no QoS, this
> > > is purely fifo, right?
> >
> > Yes, this scenario may be fifo, assuming that the number of pages each
> task
> > is the same, because DSA hardware consists of multiple work engines,
> they can
> > process tasks concurrently, usually in a round-robin way to get tasks
> from the
> > work queue.
> >
> > DSA supports priority and flow control based on work queue granularity.
> > https://github.com/intel/idxd-
> config/blob/stable/Documentation/accfg/accel-config-config-wq.txt
> 
> Right but it seems clear there aren't enough work queues for a typical
> setup.
> 
> > > > This is likely to happen in the first round of migration
> > > > memory iteration.
> > >
> > > Try testing this and see then?
> >
> > Yes, I can test based on this patch set. Please review the test scenario
> > My server has 192 CPUs, and 8 DSA devices, 100Gbps NIC.
> > All 8 DSA devices serve 100 Qemu instances for simultaneous live
> migration.
> > Each VM has 1 vCPU, and 1G memory, with no workload in the VM.
> >
> > You want to know if some Qemu instances are stalled because of DSA,
> right?
> 
> And generally just run same benchmark you did compared to cpu:
> worst case and average numbers would be interesting.

Sure, I will have a test for this.

> > > --
> > > MST



^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration.
  2024-07-15 16:24               ` Michael S. Tsirkin
@ 2024-07-16  1:25                 ` Liu, Yuan1
  0 siblings, 0 replies; 33+ messages in thread
From: Liu, Yuan1 @ 2024-07-16  1:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wang, Yichen, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Fabiano Rosas, Eric Blake, Markus Armbruster,
	Cornelia Huck, qemu-devel@nongnu.org, Hao Xiang, Kumar, Shivam,
	Ho-Ren (Jack) Chuang

> -----Original Message-----
> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Tuesday, July 16, 2024 12:24 AM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> <pbonzini@redhat.com>; Marc-André Lureau <marcandre.lureau@redhat.com>;
> Daniel P. Berrangé <berrange@redhat.com>; Thomas Huth <thuth@redhat.com>;
> Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>;
> Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> Armbruster <armbru@redhat.com>; Cornelia Huck <cohuck@redhat.com>; qemu-
> devel@nongnu.org; Hao Xiang <hao.xiang@linux.dev>; Kumar, Shivam
> <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> <horenchuang@bytedance.com>
> Subject: Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload
> zero page checking in multifd live migration.
> 
> On Mon, Jul 15, 2024 at 03:57:42PM +0000, Liu, Yuan1 wrote:
> > > > > > > > > that is 23% total CPU usage savings.
> > > > > > > >
> > > > > > > >
> > > > > > > > Here the DSA was mostly idle.
> > > > > > > >
> > > > > > > > Sounds good but a question: what if several qemu instances
> are
> > > > > > > > migrated in parallel?
> > > > > > > >
> > > > > > > > Some accelerators tend to basically stall if several tasks
> > > > > > > > are trying to use them at the same time.
> > > > > > > >
> > > > > > > > Where is the boundary here?
> >
> > If I understand correctly, you are concerned that in some scenarios the
> > accelerator itself is the migration bottleneck, causing the migration
> performance
> > to be degraded.
> >
> > My understanding is to make full use of the accelerator bandwidth, and
> once
> > the accelerator is the bottleneck, it will fall back to zero-page
> detection
> > by the CPU.
> >
> > For example, when the enqcmd command returns an error which means the
> work queue
> > is full, then we can add some retry mechanisms or directly use CPU
> detection.
> 
> 
> How is it handled in your patch? If you just abort migration unless
> enqcmd succeeds then would that not be a bug, where loading the system
> leads to migraton failures?

Sorry for this, I have just started reviewing this patch. The content we
discussed before is only related to the DSA device itself and may not be
related to this patch's implementation. I will review the issue you mentioned
carefully. Thank you for your reminder.

> --
> MST



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration.
  2024-07-11 21:52 [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
                   ` (10 preceding siblings ...)
  2024-07-12 10:58 ` Michael S. Tsirkin
@ 2024-07-16 21:47 ` Fabiano Rosas
  11 siblings, 0 replies; 33+ messages in thread
From: Fabiano Rosas @ 2024-07-16 21:47 UTC (permalink / raw)
  To: Yichen Wang, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Eric Blake, Markus Armbruster, Michael S. Tsirkin,
	Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

Yichen Wang <yichen.wang@bytedance.com> writes:

> v5
> * Rebase on top of 39a032cea23e522268519d89bb738974bc43b6f6.
> * Rename struct definitions with typedef and CamelCase names;
> * Add build and runtime checks about DSA accelerator;
> * Address all comments from v4 reviews about typos, licenses, comments,
> error reporting, etc.

Hi,

You forgot to make sure the patches compile without DSA support as
well! =)

Also, please be more explicit on the state of the series, the WIP on the
title is not enough. You can send the whole series as RFC (e.g. PATCH RFC v5)
if it's not ready to merge, or put the RFC tag only on the patches you
need help with. But make sure you have some words in the cover-letter
stating what is going on.

Another point is, I see you have applied some suggestions from the
previous version, but did those on top of the existing code in some
cases. Try to avoid that and please fix it for the next version. That
is, don't add code in one patch just to remove it on the next, try to
apply the changes/suggestions on the patch that introduces the code, as
much as possible.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 08/13] migration/multifd: Add new migration option for multifd DSA offloading.
  2024-07-11 22:00   ` Yichen Wang
@ 2024-07-17  0:00     ` Fabiano Rosas
  2024-07-17 19:43       ` Fabiano Rosas
  2024-07-24 14:50       ` Markus Armbruster
  0 siblings, 2 replies; 33+ messages in thread
From: Fabiano Rosas @ 2024-07-17  0:00 UTC (permalink / raw)
  To: Yichen Wang, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Eric Blake, Markus Armbruster, Michael S. Tsirkin,
	Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang

Yichen Wang <yichen.wang@bytedance.com> writes:

> On Thu, Jul 11, 2024 at 2:53 PM Yichen Wang <yichen.wang@bytedance.com> wrote:
>
>> diff --git a/migration/options.c b/migration/options.c
>> index 645f55003d..f839493016 100644
>> --- a/migration/options.c
>> +++ b/migration/options.c
>> @@ -29,6 +29,7 @@
>>  #include "ram.h"
>>  #include "options.h"
>>  #include "sysemu/kvm.h"
>> +#include <cpuid.h>
>>
>>  /* Maximum migrate downtime set to 2000 seconds */
>>  #define MAX_MIGRATE_DOWNTIME_SECONDS 2000
>> @@ -162,6 +163,10 @@ Property migration_properties[] = {
>>      DEFINE_PROP_ZERO_PAGE_DETECTION("zero-page-detection", MigrationState,
>>                         parameters.zero_page_detection,
>>                         ZERO_PAGE_DETECTION_MULTIFD),
>> +    /* DEFINE_PROP_ARRAY("dsa-accel-path", MigrationState, x, */
>> +    /*                    parameters.dsa_accel_path, qdev_prop_string, char *), */

This is mostly correct, I think, you just need to create a field in
MigrationState to keep the length (instead of x). However, I found out
just now that this only works with QMP. Let me ask for other's
opinions...

>> +    /* DEFINE_PROP_STRING("dsa-accel-path", MigrationState, */
>> +    /*                    parameters.dsa_accel_path), */
>>
>>      /* Migration capabilities */
>>      DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
>
> I changed the dsa-accel-path to be a ['str'], i.e. strList* in C.
> However, I am having a hard time about how to define the proper
> properties here. I don't know what MACRO to use and I can't find good
> examples... Need some guidance about how to proceed. Basically I will
> need this to pass something like '-global
> migration.dsa-accel-path="/dev/dsa/wq0.0"' in cmdline, or
> "migrate_set_parameter dsa-accel-path" in QEMU CLI. Don't know how to
> pass strList there.
>
> Thanks very much!

@Daniel, @Markus, any idea here?

If I'm reading this commit[1] right, it seems we decided to disallow
passing of arrays without JSON, which affects -global on the
command-line and HMP.

1- b06f8b500d (qdev: Rework array properties based on list visitor,
2023-11-09)

QMP shell:
(QEMU) migrate-set-parameters dsa-accel-path=['a','b']
{"return": {}}

HMP:
(qemu) migrate_set_parameter dsa-accel-path "['a','b']"
qemu-system-x86_64: ../qapi/string-input-visitor.c:343: parse_type_str:
Assertion `siv->lm == LM_NONE' failed.

Any recommendation? I believe all migration parameters so far can be set
via those means, I don't think we can allow only this one to be
QMP-only.

Or am I just missing something?


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 08/13] migration/multifd: Add new migration option for multifd DSA offloading.
  2024-07-11 21:52 ` [PATCH v5 08/13] migration/multifd: Add new migration option for multifd DSA offloading Yichen Wang
  2024-07-11 22:00   ` Yichen Wang
@ 2024-07-17 13:30   ` Fabiano Rosas
  1 sibling, 0 replies; 33+ messages in thread
From: Fabiano Rosas @ 2024-07-17 13:30 UTC (permalink / raw)
  To: Yichen Wang, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Eric Blake, Markus Armbruster, Michael S. Tsirkin,
	Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

Yichen Wang <yichen.wang@bytedance.com> writes:

> From: Hao Xiang <hao.xiang@linux.dev>
>
> Intel DSA offloading is an optional feature that turns on if
> proper hardware and software stack is available. To turn on
> DSA offloading in multifd live migration:
>
> dsa-accel-path="[dsa_dev_path1] [dsa_dev_path2] ... [dsa_dev_pathX]"
>
> This feature is turned off by default.
>
> Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
> Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
> ---
>  migration/migration-hmp-cmds.c | 15 ++++++++++-
>  migration/options.c            | 47 ++++++++++++++++++++++++++++++++++
>  migration/options.h            |  1 +
>  qapi/migration.json            | 32 ++++++++++++++++++++---
>  4 files changed, 90 insertions(+), 5 deletions(-)
>
> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
> index 7d608d26e1..c422db4ecd 100644
> --- a/migration/migration-hmp-cmds.c
> +++ b/migration/migration-hmp-cmds.c
> @@ -312,7 +312,16 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
>          monitor_printf(mon, "%s: '%s'\n",
>              MigrationParameter_str(MIGRATION_PARAMETER_TLS_AUTHZ),
>              params->tls_authz);
> -
> +        if (params->has_dsa_accel_path) {
> +            strList *dsa_accel_path = params->dsa_accel_path;
> +            monitor_printf(mon, "%s:",
> +                MigrationParameter_str(MIGRATION_PARAMETER_DSA_ACCEL_PATH));
> +            while (dsa_accel_path) {
> +                monitor_printf(mon, " %s", dsa_accel_path->value);
> +                dsa_accel_path = dsa_accel_path->next;
> +            }
> +            monitor_printf(mon, "\n");
> +        }
>          if (params->has_block_bitmap_mapping) {
>              const BitmapMigrationNodeAliasList *bmnal;
>  
> @@ -563,6 +572,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
>          p->has_x_checkpoint_delay = true;
>          visit_type_uint32(v, param, &p->x_checkpoint_delay, &err);
>          break;
> +    case MIGRATION_PARAMETER_DSA_ACCEL_PATH:
> +        p->has_dsa_accel_path = true;
> +        visit_type_strList(v, param, &p->dsa_accel_path, &err);
> +        break;
>      case MIGRATION_PARAMETER_MULTIFD_CHANNELS:
>          p->has_multifd_channels = true;
>          visit_type_uint8(v, param, &p->multifd_channels, &err);
> diff --git a/migration/options.c b/migration/options.c
> index 645f55003d..f839493016 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -29,6 +29,7 @@
>  #include "ram.h"
>  #include "options.h"
>  #include "sysemu/kvm.h"
> +#include <cpuid.h>
>  
>  /* Maximum migrate downtime set to 2000 seconds */
>  #define MAX_MIGRATE_DOWNTIME_SECONDS 2000
> @@ -162,6 +163,10 @@ Property migration_properties[] = {
>      DEFINE_PROP_ZERO_PAGE_DETECTION("zero-page-detection", MigrationState,
>                         parameters.zero_page_detection,
>                         ZERO_PAGE_DETECTION_MULTIFD),
> +    /* DEFINE_PROP_ARRAY("dsa-accel-path", MigrationState, x, */
> +    /*                    parameters.dsa_accel_path, qdev_prop_string, char *), */
> +    /* DEFINE_PROP_STRING("dsa-accel-path", MigrationState, */
> +    /*                    parameters.dsa_accel_path), */
>  
>      /* Migration capabilities */
>      DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
> @@ -815,6 +820,13 @@ const char *migrate_tls_creds(void)
>      return s->parameters.tls_creds;
>  }
>  
> +const strList *migrate_dsa_accel_path(void)
> +{
> +    MigrationState *s = migrate_get_current();
> +
> +    return s->parameters.dsa_accel_path;
> +}
> +
>  const char *migrate_tls_hostname(void)
>  {
>      MigrationState *s = migrate_get_current();
> @@ -926,6 +938,7 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
>      params->zero_page_detection = s->parameters.zero_page_detection;
>      params->has_direct_io = true;
>      params->direct_io = s->parameters.direct_io;
> +    params->dsa_accel_path = QAPI_CLONE(strList, s->parameters.dsa_accel_path);
>  
>      return params;
>  }
> @@ -934,6 +947,7 @@ void migrate_params_init(MigrationParameters *params)
>  {
>      params->tls_hostname = g_strdup("");
>      params->tls_creds = g_strdup("");
> +    params->dsa_accel_path = NULL;
>  
>      /* Set has_* up only for parameter checks */
>      params->has_throttle_trigger_threshold = true;
> @@ -1137,6 +1151,22 @@ bool migrate_params_check(MigrationParameters *params, Error **errp)
>          return false;
>      }
>  
> +    if (params->has_zero_page_detection &&
> +        params->zero_page_detection == ZERO_PAGE_DETECTION_DSA_ACCEL) {
> +#ifdef CONFIG_DSA_OPT
> +        unsigned int eax, ebx, ecx, edx;
> +        /* ENQCMD is indicated by bit 29 of ecx in CPUID leaf 7, subleaf 0. */
> +        if (!__get_cpuid_count(7, 0, &eax, &ebx, &ecx, &edx) ||
> +            !(ecx & (1 << 29))) {
> +            error_setg(errp, "DSA acceleration is not supported by CPU");
> +            return false;
> +        }

This should be a function along with the others in dsa.h, then you
wouldn't need the ifdef here.

> +#else
> +        error_setg(errp, "DSA acceleration is not enabled");
> +        return false;
> +#endif
> +    }
> +
>      return true;
>  }
>  
> @@ -1247,6 +1277,11 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
>      if (params->has_direct_io) {
>          dest->direct_io = params->direct_io;
>      }
> +
> +    if (params->has_dsa_accel_path) {
> +        dest->has_dsa_accel_path = true;
> +        dest->dsa_accel_path = params->dsa_accel_path;
> +    }
>  }
>  
>  static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
> @@ -1376,6 +1411,12 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
>      if (params->has_direct_io) {
>          s->parameters.direct_io = params->direct_io;
>      }
> +    if (params->has_dsa_accel_path) {
> +        qapi_free_strList(s->parameters.dsa_accel_path);
> +        s->parameters.has_dsa_accel_path = true;
> +        s->parameters.dsa_accel_path =
> +            QAPI_CLONE(strList, params->dsa_accel_path);
> +    }
>  }
>  
>  void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
> @@ -1401,6 +1442,12 @@ void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
>          params->tls_authz->type = QTYPE_QSTRING;
>          params->tls_authz->u.s = strdup("");
>      }
> +    /* if (params->dsa_accel_path */
> +    /*     && params->dsa_accel_path->type == QTYPE_QNULL) { */
> +    /*     qobject_unref(params->dsa_accel_path->u.n); */
> +    /*     params->dsa_accel_path->type = QTYPE_QLIST; */
> +    /*     params->dsa_accel_path->u.s = strdup(""); */
> +    /* } */
>  
>      migrate_params_test_apply(params, &tmp);
>  
> diff --git a/migration/options.h b/migration/options.h
> index a2397026db..78b9e4080b 100644
> --- a/migration/options.h
> +++ b/migration/options.h
> @@ -85,6 +85,7 @@ const char *migrate_tls_creds(void);
>  const char *migrate_tls_hostname(void);
>  uint64_t migrate_xbzrle_cache_size(void);
>  ZeroPageDetection migrate_zero_page_detection(void);
> +const strList *migrate_dsa_accel_path(void);
>  
>  /* parameters helpers */
>  
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 1234bef888..ff41780347 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -619,10 +619,14 @@
>  #     multifd migration is enabled, else in the main migration thread
>  #     as for @legacy.
>  #
> +# @dsa-accel: Perform zero page checking with the DSA accelerator
> +#     offloading in multifd sender thread if multifd migration is
> +#     enabled, else in the main migration thread as for @legacy.
> +#
>  # Since: 9.0
>  ##
>  { 'enum': 'ZeroPageDetection',
> -  'data': [ 'none', 'legacy', 'multifd' ] }
> +  'data': [ 'none', 'legacy', 'multifd', 'dsa-accel' ] }
>  
>  ##
>  # @BitmapMigrationBitmapAliasTransform:
> @@ -825,6 +829,12 @@
>  #     See description in @ZeroPageDetection.  Default is 'multifd'.
>  #     (since 9.0)
>  #
> +# @dsa-accel-path: If enabled, use DSA accelerator offloading for
> +#     certain memory operations. Enable DSA accelerator for zero
> +#     page detection offloading by setting the @zero-page-detection
> +#     to dsa-accel. This parameter defines the dsa device path, and
> +#     defaults to an empty list. (since 9.2)
> +#
>  # @direct-io: Open migration files with O_DIRECT when possible.  This
>  #     only has effect if the @mapped-ram capability is enabled.
>  #     (Since 9.1)
> @@ -843,7 +853,7 @@
>             'cpu-throttle-initial', 'cpu-throttle-increment',
>             'cpu-throttle-tailslow',
>             'tls-creds', 'tls-hostname', 'tls-authz', 'max-bandwidth',
> -           'avail-switchover-bandwidth', 'downtime-limit',
> +           'avail-switchover-bandwidth', 'downtime-limit', 'dsa-accel-path',
>             { 'name': 'x-checkpoint-delay', 'features': [ 'unstable' ] },
>             'multifd-channels',
>             'xbzrle-cache-size', 'max-postcopy-bandwidth',
> @@ -1000,6 +1010,12 @@
>  #     See description in @ZeroPageDetection.  Default is 'multifd'.
>  #     (since 9.0)
>  #
> +# @dsa-accel-path: If enabled, use DSA accelerator offloading for
> +#     certain memory operations. Enable DSA accelerator for zero
> +#     page detection offloading by setting the @zero-page-detection
> +#     to dsa-accel. This parameter defines the dsa device path, and
> +#     defaults to an empty list. (since 9.2)
> +#
>  # @direct-io: Open migration files with O_DIRECT when possible.  This
>  #     only has effect if the @mapped-ram capability is enabled.
>  #     (Since 9.1)
> @@ -1044,7 +1060,8 @@
>              '*vcpu-dirty-limit': 'uint64',
>              '*mode': 'MigMode',
>              '*zero-page-detection': 'ZeroPageDetection',
> -            '*direct-io': 'bool' } }
> +            '*direct-io': 'bool',
> +            '*dsa-accel-path': ['str'] } }
>  
>  ##
>  # @migrate-set-parameters:
> @@ -1204,6 +1221,12 @@
>  #     See description in @ZeroPageDetection.  Default is 'multifd'.
>  #     (since 9.0)
>  #
> +# @dsa-accel-path: If enabled, use DSA accelerator offloading for
> +#     certain memory operations. Enable DSA accelerator for zero
> +#     page detection offloading by setting the @zero-page-detection
> +#     to dsa-accel. This parameter defines the dsa device path, and
> +#     defaults to an empty list. (since 9.2)
> +#
>  # @direct-io: Open migration files with O_DIRECT when possible.  This
>  #     only has effect if the @mapped-ram capability is enabled.
>  #     (Since 9.1)
> @@ -1245,7 +1268,8 @@
>              '*vcpu-dirty-limit': 'uint64',
>              '*mode': 'MigMode',
>              '*zero-page-detection': 'ZeroPageDetection',
> -            '*direct-io': 'bool' } }
> +            '*direct-io': 'bool',
> +            '*dsa-accel-path': ['str'] } }
>  
>  ##
>  # @query-migrate-parameters:


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 09/13] migration/multifd: Prepare to introduce DSA acceleration on the multifd path.
  2024-07-11 21:52 ` [PATCH v5 09/13] migration/multifd: Prepare to introduce DSA acceleration on the multifd path Yichen Wang
@ 2024-07-17 13:39   ` Fabiano Rosas
  0 siblings, 0 replies; 33+ messages in thread
From: Fabiano Rosas @ 2024-07-17 13:39 UTC (permalink / raw)
  To: Yichen Wang, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Eric Blake, Markus Armbruster, Michael S. Tsirkin,
	Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

Yichen Wang <yichen.wang@bytedance.com> writes:

> From: Hao Xiang <hao.xiang@linux.dev>
>
> 1. Refactor multifd_send_thread function.
> 2. Introduce the batch task structure in MultiFDSendParams.

This patch needs to be restructured, maybe even go away. Most of it has
to be where these structures were introduced for the first time and any
multifd parts have to go into another patch that touches only multifd
code.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 08/13] migration/multifd: Add new migration option for multifd DSA offloading.
  2024-07-17  0:00     ` Fabiano Rosas
@ 2024-07-17 19:43       ` Fabiano Rosas
  2024-07-24 14:50       ` Markus Armbruster
  1 sibling, 0 replies; 33+ messages in thread
From: Fabiano Rosas @ 2024-07-17 19:43 UTC (permalink / raw)
  To: Yichen Wang, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Eric Blake, Markus Armbruster, Michael S. Tsirkin,
	Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang

Fabiano Rosas <farosas@suse.de> writes:

> Yichen Wang <yichen.wang@bytedance.com> writes:
>
>> On Thu, Jul 11, 2024 at 2:53 PM Yichen Wang <yichen.wang@bytedance.com> wrote:
>>
>>> diff --git a/migration/options.c b/migration/options.c
>>> index 645f55003d..f839493016 100644
>>> --- a/migration/options.c
>>> +++ b/migration/options.c
>>> @@ -29,6 +29,7 @@
>>>  #include "ram.h"
>>>  #include "options.h"
>>>  #include "sysemu/kvm.h"
>>> +#include <cpuid.h>
>>>
>>>  /* Maximum migrate downtime set to 2000 seconds */
>>>  #define MAX_MIGRATE_DOWNTIME_SECONDS 2000
>>> @@ -162,6 +163,10 @@ Property migration_properties[] = {
>>>      DEFINE_PROP_ZERO_PAGE_DETECTION("zero-page-detection", MigrationState,
>>>                         parameters.zero_page_detection,
>>>                         ZERO_PAGE_DETECTION_MULTIFD),
>>> +    /* DEFINE_PROP_ARRAY("dsa-accel-path", MigrationState, x, */
>>> +    /*                    parameters.dsa_accel_path, qdev_prop_string, char *), */
>
> This is mostly correct, I think, you just need to create a field in
> MigrationState to keep the length (instead of x). However, I found out
> just now that this only works with QMP. Let me ask for other's
> opinions...
>
>>> +    /* DEFINE_PROP_STRING("dsa-accel-path", MigrationState, */
>>> +    /*                    parameters.dsa_accel_path), */
>>>
>>>      /* Migration capabilities */
>>>      DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
>>
>> I changed the dsa-accel-path to be a ['str'], i.e. strList* in C.
>> However, I am having a hard time about how to define the proper
>> properties here. I don't know what MACRO to use and I can't find good
>> examples... Need some guidance about how to proceed. Basically I will
>> need this to pass something like '-global
>> migration.dsa-accel-path="/dev/dsa/wq0.0"' in cmdline, or
>> "migrate_set_parameter dsa-accel-path" in QEMU CLI. Don't know how to
>> pass strList there.
>>
>> Thanks very much!
>
> @Daniel, @Markus, any idea here?
>
> If I'm reading this commit[1] right, it seems we decided to disallow
> passing of arrays without JSON, which affects -global on the
> command-line and HMP.
>
> 1- b06f8b500d (qdev: Rework array properties based on list visitor,
> 2023-11-09)
>
> QMP shell:
> (QEMU) migrate-set-parameters dsa-accel-path=['a','b']
> {"return": {}}
>
> HMP:
> (qemu) migrate_set_parameter dsa-accel-path "['a','b']"
> qemu-system-x86_64: ../qapi/string-input-visitor.c:343: parse_type_str:
> Assertion `siv->lm == LM_NONE' failed.
>
> Any recommendation? I believe all migration parameters so far can be set
> via those means, I don't think we can allow only this one to be
> QMP-only.
>
> Or am I just missing something?

I guess we could just skip adding property like Steve did here:

https://lore.kernel.org/r/1719776434-435013-10-git-send-email-steven.sistare@oracle.com


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v5 08/13] migration/multifd: Add new migration option for multifd DSA offloading.
  2024-07-17  0:00     ` Fabiano Rosas
  2024-07-17 19:43       ` Fabiano Rosas
@ 2024-07-24 14:50       ` Markus Armbruster
  2024-09-06 22:29         ` [External] " Yichen Wang
  1 sibling, 1 reply; 33+ messages in thread
From: Markus Armbruster @ 2024-07-24 14:50 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Yichen Wang, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Eric Blake, Michael S. Tsirkin, Cornelia Huck,
	qemu-devel, Hao Xiang, Liu, Yuan1, Shivam Kumar,
	Ho-Ren (Jack) Chuang

Fabiano Rosas <farosas@suse.de> writes:

> Yichen Wang <yichen.wang@bytedance.com> writes:
>
>> On Thu, Jul 11, 2024 at 2:53 PM Yichen Wang <yichen.wang@bytedance.com> wrote:
>>
>>> diff --git a/migration/options.c b/migration/options.c
>>> index 645f55003d..f839493016 100644
>>> --- a/migration/options.c
>>> +++ b/migration/options.c
>>> @@ -29,6 +29,7 @@
>>>  #include "ram.h"
>>>  #include "options.h"
>>>  #include "sysemu/kvm.h"
>>> +#include <cpuid.h>
>>>
>>>  /* Maximum migrate downtime set to 2000 seconds */
>>>  #define MAX_MIGRATE_DOWNTIME_SECONDS 2000
>>> @@ -162,6 +163,10 @@ Property migration_properties[] = {
>>>      DEFINE_PROP_ZERO_PAGE_DETECTION("zero-page-detection", MigrationState,
>>>                         parameters.zero_page_detection,
>>>                         ZERO_PAGE_DETECTION_MULTIFD),
>>> +    /* DEFINE_PROP_ARRAY("dsa-accel-path", MigrationState, x, */
>>> +    /*                    parameters.dsa_accel_path, qdev_prop_string, char *), */
>
> This is mostly correct, I think, you just need to create a field in
> MigrationState to keep the length (instead of x). However, I found out
> just now that this only works with QMP. Let me ask for other's
> opinions...
>
>>> +    /* DEFINE_PROP_STRING("dsa-accel-path", MigrationState, */
>>> +    /*                    parameters.dsa_accel_path), */
>>>
>>>      /* Migration capabilities */
>>>      DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
>>
>> I changed the dsa-accel-path to be a ['str'], i.e. strList* in C.
>> However, I am having a hard time about how to define the proper
>> properties here. I don't know what MACRO to use and I can't find good
>> examples... Need some guidance about how to proceed. Basically I will
>> need this to pass something like '-global
>> migration.dsa-accel-path="/dev/dsa/wq0.0"' in cmdline, or
>> "migrate_set_parameter dsa-accel-path" in QEMU CLI. Don't know how to
>> pass strList there.
>>
>> Thanks very much!
>
> @Daniel, @Markus, any idea here?
>
> If I'm reading this commit[1] right, it seems we decided to disallow
> passing of arrays without JSON, which affects -global on the
> command-line and HMP.
>
> 1- b06f8b500d (qdev: Rework array properties based on list visitor,
> 2023-11-09)
>
> QMP shell:
> (QEMU) migrate-set-parameters dsa-accel-path=['a','b']
> {"return": {}}
>
> HMP:
> (qemu) migrate_set_parameter dsa-accel-path "['a','b']"
> qemu-system-x86_64: ../qapi/string-input-visitor.c:343: parse_type_str:
> Assertion `siv->lm == LM_NONE' failed.

HMP migrate_set_parameter doesn't support JSON.  It uses the string
input visitor to parse the value, which can only do lists of integers.

The string visitors have been thorns in my side since forever.

> Any recommendation? I believe all migration parameters so far can be set
> via those means, I don't think we can allow only this one to be
> QMP-only.
>
> Or am I just missing something?

I don't think the string input visitor can be compatibly extended to
arbitrary lists.

We could replace HMP migrate_set_parameter by migrate_set_parameters.
The new command parses its single argument into a struct
MigrateSetParameters with keyval_parse(),
qobject_input_visitor_new_keyval(), and
visit_type_MigrateSetParameters().



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [External] Re: [PATCH v5 08/13] migration/multifd: Add new migration option for multifd DSA offloading.
  2024-07-24 14:50       ` Markus Armbruster
@ 2024-09-06 22:29         ` Yichen Wang
  2024-09-16 15:15           ` Fabiano Rosas
  0 siblings, 1 reply; 33+ messages in thread
From: Yichen Wang @ 2024-09-06 22:29 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Fabiano Rosas, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Eric Blake, Michael S. Tsirkin, Cornelia Huck,
	qemu-devel, Hao Xiang, Liu, Yuan1, Shivam Kumar,
	Ho-Ren (Jack) Chuang

On Wed, Jul 24, 2024 at 7:50 AM Markus Armbruster <armbru@redhat.com> wrote:
>
> Fabiano Rosas <farosas@suse.de> writes:
>
> > Yichen Wang <yichen.wang@bytedance.com> writes:
> >
> >> On Thu, Jul 11, 2024 at 2:53 PM Yichen Wang <yichen.wang@bytedance.com> wrote:
> >>
> >>> diff --git a/migration/options.c b/migration/options.c
> >>> index 645f55003d..f839493016 100644
> >>> --- a/migration/options.c
> >>> +++ b/migration/options.c
> >>> @@ -29,6 +29,7 @@
> >>>  #include "ram.h"
> >>>  #include "options.h"
> >>>  #include "sysemu/kvm.h"
> >>> +#include <cpuid.h>
> >>>
> >>>  /* Maximum migrate downtime set to 2000 seconds */
> >>>  #define MAX_MIGRATE_DOWNTIME_SECONDS 2000
> >>> @@ -162,6 +163,10 @@ Property migration_properties[] = {
> >>>      DEFINE_PROP_ZERO_PAGE_DETECTION("zero-page-detection", MigrationState,
> >>>                         parameters.zero_page_detection,
> >>>                         ZERO_PAGE_DETECTION_MULTIFD),
> >>> +    /* DEFINE_PROP_ARRAY("dsa-accel-path", MigrationState, x, */
> >>> +    /*                    parameters.dsa_accel_path, qdev_prop_string, char *), */
> >
> > This is mostly correct, I think, you just need to create a field in
> > MigrationState to keep the length (instead of x). However, I found out
> > just now that this only works with QMP. Let me ask for other's
> > opinions...
> >
> >>> +    /* DEFINE_PROP_STRING("dsa-accel-path", MigrationState, */
> >>> +    /*                    parameters.dsa_accel_path), */
> >>>
> >>>      /* Migration capabilities */
> >>>      DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
> >>
> >> I changed the dsa-accel-path to be a ['str'], i.e. strList* in C.
> >> However, I am having a hard time about how to define the proper
> >> properties here. I don't know what MACRO to use and I can't find good
> >> examples... Need some guidance about how to proceed. Basically I will
> >> need this to pass something like '-global
> >> migration.dsa-accel-path="/dev/dsa/wq0.0"' in cmdline, or
> >> "migrate_set_parameter dsa-accel-path" in QEMU CLI. Don't know how to
> >> pass strList there.
> >>
> >> Thanks very much!
> >
> > @Daniel, @Markus, any idea here?
> >
> > If I'm reading this commit[1] right, it seems we decided to disallow
> > passing of arrays without JSON, which affects -global on the
> > command-line and HMP.
> >
> > 1- b06f8b500d (qdev: Rework array properties based on list visitor,
> > 2023-11-09)
> >
> > QMP shell:
> > (QEMU) migrate-set-parameters dsa-accel-path=['a','b']
> > {"return": {}}
> >
> > HMP:
> > (qemu) migrate_set_parameter dsa-accel-path "['a','b']"
> > qemu-system-x86_64: ../qapi/string-input-visitor.c:343: parse_type_str:
> > Assertion `siv->lm == LM_NONE' failed.
>
> HMP migrate_set_parameter doesn't support JSON.  It uses the string
> input visitor to parse the value, which can only do lists of integers.
>
> The string visitors have been thorns in my side since forever.
>
> > Any recommendation? I believe all migration parameters so far can be set
> > via those means, I don't think we can allow only this one to be
> > QMP-only.
> >
> > Or am I just missing something?
>
> I don't think the string input visitor can be compatibly extended to
> arbitrary lists.
>
> We could replace HMP migrate_set_parameter by migrate_set_parameters.
> The new command parses its single argument into a struct
> MigrateSetParameters with keyval_parse(),
> qobject_input_visitor_new_keyval(), and
> visit_type_MigrateSetParameters().
>

I tried Fabiano's suggestion, and put a unit32_t in MigrateState data
structure. I got exactly the same: "qemu-system-x86_64.dsa:
../../../qapi/string-input-visitor.c:343: parse_type_str: Assertion
`siv->lm == LM_NONE' failed.". Steve's patch is more to be a read-only
field from HMP, so probably I can't do that. Markus's suggestion seems
to be too heavy for the patch and I took a quick glance and it doesn't
seem to be easy to do.

So should we revert to the old "str" format instead of strList? Or how
should I proceed here?


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [External] RE: [PATCH v5 01/13] meson: Introduce new instruction set enqcmd to the build system.
  2024-07-15 15:02   ` Liu, Yuan1
@ 2024-09-09 17:55     ` Yichen Wang
  0 siblings, 0 replies; 33+ messages in thread
From: Yichen Wang @ 2024-09-09 17:55 UTC (permalink / raw)
  To: Liu, Yuan1
  Cc: Paolo Bonzini, Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé, Peter Xu, Fabiano Rosas,
	Eric Blake, Markus Armbruster, Michael S. Tsirkin, Cornelia Huck,
	qemu-devel@nongnu.org, Hao Xiang, Kumar, Shivam,
	Ho-Ren (Jack) Chuang

On Mon, Jul 15, 2024 at 8:02 AM Liu, Yuan1 <yuan1.liu@intel.com> wrote:
>
> > -----Original Message-----
> > From: Yichen Wang <yichen.wang@bytedance.com>
> > Sent: Friday, July 12, 2024 5:53 AM
> > To: Paolo Bonzini <pbonzini@redhat.com>; Marc-André Lureau
> > <marcandre.lureau@redhat.com>; Daniel P. Berrangé <berrange@redhat.com>;
> > Thomas Huth <thuth@redhat.com>; Philippe Mathieu-Daudé
> > <philmd@linaro.org>; Peter Xu <peterx@redhat.com>; Fabiano Rosas
> > <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus Armbruster
> > <armbru@redhat.com>; Michael S. Tsirkin <mst@redhat.com>; Cornelia Huck
> > <cohuck@redhat.com>; qemu-devel@nongnu.org
> > Cc: Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 <yuan1.liu@intel.com>;
> > Kumar, Shivam <shivam.kumar1@nutanix.com>; Ho-Ren (Jack) Chuang
> > <horenchuang@bytedance.com>; Wang, Yichen <yichen.wang@bytedance.com>
> > Subject: [PATCH v5 01/13] meson: Introduce new instruction set enqcmd to
> > the build system.
> >
> > From: Hao Xiang <hao.xiang@linux.dev>
> >
> > Enable instruction set enqcmd in build.
> >
> > Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
> > Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
> > ---
> >  meson.build                   | 14 ++++++++++++++
> >  meson_options.txt             |  2 ++
> >  scripts/meson-buildoptions.sh |  3 +++
> >  3 files changed, 19 insertions(+)
> >
> > diff --git a/meson.build b/meson.build
> > index 6a93da48e1..af650cfabf 100644
> > --- a/meson.build
> > +++ b/meson.build
> > @@ -2893,6 +2893,20 @@ config_host_data.set('CONFIG_AVX512BW_OPT',
> > get_option('avx512bw') \
> >      int main(int argc, char *argv[]) { return bar(argv[0]); }
> >    '''), error_message: 'AVX512BW not available').allowed())
> >
> > +config_host_data.set('CONFIG_DSA_OPT', get_option('enqcmd') \
> > +  .require(have_cpuid_h, error_message: 'cpuid.h not available, cannot
> > enable ENQCMD') \
> > +  .require(cc.links('''
> > +    #include <stdint.h>
> > +    #include <cpuid.h>
> > +    #include <immintrin.h>
> > +    static int __attribute__((target("enqcmd"))) bar(void *a) {
> > +      uint64_t dst[8] = { 0 };
> > +      uint64_t src[8] = { 0 };
> > +      return _enqcmd(dst, src);
> > +    }
> > +    int main(int argc, char *argv[]) { return bar(argv[argc - 1]); }
> > +  '''), error_message: 'ENQCMD not available').allowed())
> > +
>
> How about using cpuid instruction to dynamically detect enqcmd and movdir64b
> instructions?
>
> My reasons are as follows
> 1. enqcmd/movdir64b and DSA devices are used together. DSA devices are dynamically
>    detected, so enqcmd can also dynamically detect.
>
>    Simple code for dynamically detect movdir64b and enqcmd
>    bool check_dsa_instructions(void) {
>        uint32_t eax, ebx, ecx, edx;
>        bool movedirb_enabled;
>        bool enqcmd_enabled;
>
>        cpuid(0x07, 0x0, &eax, &ebx, &ecx, &edx);
>        movedirb_enabled = (ecx >> 28) & 0x1;
>        if (!movedirb_enabled) {
>            return false;
>        }
>        enqcmd_enabled = (ecx >> 29) & 0x1;
>        if (!enqcmd_enabled) {
>            return false;
>        }
>        return true;
>     }
>     https://cdrdv2-public.intel.com/819680/architecture-instruction-set-extensions-programming-reference.pdf
>
> 2. The enqcmd/movdir64b are new instructions, I checked they are integrated into GCC10
>    However, users do not need gcc10 or higher to use two instructions.
>    Simple code to implement enqcmd
>    static inline int enqcmd(volatile void *reg, struct dsa_hw_desc *desc)
>    {
>        uint8_t retry;
>        asm volatile (".byte 0xf2, 0x0f, 0x38, 0xf8, 0x02\t\n"
>        "setz %0\t\n":"=r" (retry):"a"(reg), "d"(desc));
>        return (int)retry;
>    }
>    file:///C:/Users/yliu80/Downloads/353216-data-streaming-accelerator-user-guide-002.pdf
>

This is for compile time detection. So if I am understanding
correctly, we don't need this dynamic detection at meson build time,
am I right? I actually already have similar code in dynamic detection
at runtime, and I will refine that part with your suggestion above.

> >  # For both AArch64 and AArch32, detect if builtins are available.
> >  config_host_data.set('CONFIG_ARM_AES_BUILTIN', cc.compiles('''
> >      #include <arm_neon.h>
> > diff --git a/meson_options.txt b/meson_options.txt
> > index 0269fa0f16..4ed820bb8d 100644
> > --- a/meson_options.txt
> > +++ b/meson_options.txt
> > @@ -121,6 +121,8 @@ option('avx2', type: 'feature', value: 'auto',
> >         description: 'AVX2 optimizations')
> >  option('avx512bw', type: 'feature', value: 'auto',
> >         description: 'AVX512BW optimizations')
> > +option('enqcmd', type: 'feature', value: 'disabled',
> > +       description: 'ENQCMD optimizations')
> >  option('keyring', type: 'feature', value: 'auto',
> >         description: 'Linux keyring support')
> >  option('libkeyutils', type: 'feature', value: 'auto',
> > diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
> > index cfadb5ea86..280e117687 100644
> > --- a/scripts/meson-buildoptions.sh
> > +++ b/scripts/meson-buildoptions.sh
> > @@ -95,6 +95,7 @@ meson_options_help() {
> >    printf "%s\n" '  auth-pam        PAM access control'
> >    printf "%s\n" '  avx2            AVX2 optimizations'
> >    printf "%s\n" '  avx512bw        AVX512BW optimizations'
> > +  printf "%s\n" '  enqcmd          ENQCMD optimizations'
> >    printf "%s\n" '  blkio           libblkio block device driver'
> >    printf "%s\n" '  bochs           bochs image format support'
> >    printf "%s\n" '  bpf             eBPF support'
> > @@ -239,6 +240,8 @@ _meson_option_parse() {
> >      --disable-avx2) printf "%s" -Davx2=disabled ;;
> >      --enable-avx512bw) printf "%s" -Davx512bw=enabled ;;
> >      --disable-avx512bw) printf "%s" -Davx512bw=disabled ;;
> > +    --enable-enqcmd) printf "%s" -Denqcmd=enabled ;;
> > +    --disable-enqcmd) printf "%s" -Denqcmd=disabled ;;
> >      --enable-gcov) printf "%s" -Db_coverage=true ;;
> >      --disable-gcov) printf "%s" -Db_coverage=false ;;
> >      --enable-lto) printf "%s" -Db_lto=true ;;
> > --
> > Yichen Wang
>


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [External] Re: [PATCH v5 08/13] migration/multifd: Add new migration option for multifd DSA offloading.
  2024-09-06 22:29         ` [External] " Yichen Wang
@ 2024-09-16 15:15           ` Fabiano Rosas
  0 siblings, 0 replies; 33+ messages in thread
From: Fabiano Rosas @ 2024-09-16 15:15 UTC (permalink / raw)
  To: Yichen Wang, Markus Armbruster
  Cc: Paolo Bonzini, Marc-André Lureau, Daniel P. Berrangé,
	Thomas Huth, Philippe Mathieu-Daudé, Peter Xu, Eric Blake,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel, Hao Xiang,
	Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang

Yichen Wang <yichen.wang@bytedance.com> writes:

> On Wed, Jul 24, 2024 at 7:50 AM Markus Armbruster <armbru@redhat.com> wrote:
>>
>> Fabiano Rosas <farosas@suse.de> writes:
>>
>> > Yichen Wang <yichen.wang@bytedance.com> writes:
>> >
>> >> On Thu, Jul 11, 2024 at 2:53 PM Yichen Wang <yichen.wang@bytedance.com> wrote:
>> >>
>> >>> diff --git a/migration/options.c b/migration/options.c
>> >>> index 645f55003d..f839493016 100644
>> >>> --- a/migration/options.c
>> >>> +++ b/migration/options.c
>> >>> @@ -29,6 +29,7 @@
>> >>>  #include "ram.h"
>> >>>  #include "options.h"
>> >>>  #include "sysemu/kvm.h"
>> >>> +#include <cpuid.h>
>> >>>
>> >>>  /* Maximum migrate downtime set to 2000 seconds */
>> >>>  #define MAX_MIGRATE_DOWNTIME_SECONDS 2000
>> >>> @@ -162,6 +163,10 @@ Property migration_properties[] = {
>> >>>      DEFINE_PROP_ZERO_PAGE_DETECTION("zero-page-detection", MigrationState,
>> >>>                         parameters.zero_page_detection,
>> >>>                         ZERO_PAGE_DETECTION_MULTIFD),
>> >>> +    /* DEFINE_PROP_ARRAY("dsa-accel-path", MigrationState, x, */
>> >>> +    /*                    parameters.dsa_accel_path, qdev_prop_string, char *), */
>> >
>> > This is mostly correct, I think, you just need to create a field in
>> > MigrationState to keep the length (instead of x). However, I found out
>> > just now that this only works with QMP. Let me ask for other's
>> > opinions...
>> >
>> >>> +    /* DEFINE_PROP_STRING("dsa-accel-path", MigrationState, */
>> >>> +    /*                    parameters.dsa_accel_path), */
>> >>>
>> >>>      /* Migration capabilities */
>> >>>      DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
>> >>
>> >> I changed the dsa-accel-path to be a ['str'], i.e. strList* in C.
>> >> However, I am having a hard time about how to define the proper
>> >> properties here. I don't know what MACRO to use and I can't find good
>> >> examples... Need some guidance about how to proceed. Basically I will
>> >> need this to pass something like '-global
>> >> migration.dsa-accel-path="/dev/dsa/wq0.0"' in cmdline, or
>> >> "migrate_set_parameter dsa-accel-path" in QEMU CLI. Don't know how to
>> >> pass strList there.
>> >>
>> >> Thanks very much!
>> >
>> > @Daniel, @Markus, any idea here?
>> >
>> > If I'm reading this commit[1] right, it seems we decided to disallow
>> > passing of arrays without JSON, which affects -global on the
>> > command-line and HMP.
>> >
>> > 1- b06f8b500d (qdev: Rework array properties based on list visitor,
>> > 2023-11-09)
>> >
>> > QMP shell:
>> > (QEMU) migrate-set-parameters dsa-accel-path=['a','b']
>> > {"return": {}}
>> >
>> > HMP:
>> > (qemu) migrate_set_parameter dsa-accel-path "['a','b']"
>> > qemu-system-x86_64: ../qapi/string-input-visitor.c:343: parse_type_str:
>> > Assertion `siv->lm == LM_NONE' failed.
>>
>> HMP migrate_set_parameter doesn't support JSON.  It uses the string
>> input visitor to parse the value, which can only do lists of integers.
>>
>> The string visitors have been thorns in my side since forever.
>>
>> > Any recommendation? I believe all migration parameters so far can be set
>> > via those means, I don't think we can allow only this one to be
>> > QMP-only.
>> >
>> > Or am I just missing something?
>>
>> I don't think the string input visitor can be compatibly extended to
>> arbitrary lists.
>>
>> We could replace HMP migrate_set_parameter by migrate_set_parameters.
>> The new command parses its single argument into a struct
>> MigrateSetParameters with keyval_parse(),
>> qobject_input_visitor_new_keyval(), and
>> visit_type_MigrateSetParameters().
>>
>
> I tried Fabiano's suggestion, and put a unit32_t in MigrateState data
> structure. I got exactly the same: "qemu-system-x86_64.dsa:
> ../../../qapi/string-input-visitor.c:343: parse_type_str: Assertion
> `siv->lm == LM_NONE' failed.". Steve's patch is more to be a read-only
> field from HMP, so probably I can't do that.

What do you mean by read-only field? I thought his usage was the same as
what we want for dsa-accel-path:

(qemu) migrate_set_parameter cpr-exec-command abc def
(qemu) info migrate_parameters 
...
cpr-exec-command: abc def

(gdb) p valuestr
$3 = 0x55555766a8d0 "abc def"
(gdb) p *p->cpr_exec_command 
$6 = {next = 0x55555823d300, value = 0x55555765f690 "abc"}
(gdb) p *p->cpr_exec_command.next
$7 = {next = 0x55555805be20, value = 0x555557fefc80 "def"}


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2024-09-16 15:16 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-11 21:52 [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
2024-07-11 21:52 ` [PATCH v5 01/13] meson: Introduce new instruction set enqcmd to the build system Yichen Wang
2024-07-15 15:02   ` Liu, Yuan1
2024-09-09 17:55     ` [External] " Yichen Wang
2024-07-11 21:52 ` [PATCH v5 02/13] util/dsa: Add idxd into linux header copy list Yichen Wang
2024-07-11 21:52 ` [PATCH v5 03/13] util/dsa: Implement DSA device start and stop logic Yichen Wang
2024-07-11 21:52 ` [PATCH v5 04/13] util/dsa: Implement DSA task enqueue and dequeue Yichen Wang
2024-07-11 21:52 ` [PATCH v5 05/13] util/dsa: Implement DSA task asynchronous completion thread model Yichen Wang
2024-07-11 21:52 ` [PATCH v5 06/13] util/dsa: Implement zero page checking in DSA task Yichen Wang
2024-07-11 21:52 ` [PATCH v5 07/13] util/dsa: Implement DSA task asynchronous submission and wait for completion Yichen Wang
2024-07-11 21:52 ` [PATCH v5 08/13] migration/multifd: Add new migration option for multifd DSA offloading Yichen Wang
2024-07-11 22:00   ` Yichen Wang
2024-07-17  0:00     ` Fabiano Rosas
2024-07-17 19:43       ` Fabiano Rosas
2024-07-24 14:50       ` Markus Armbruster
2024-09-06 22:29         ` [External] " Yichen Wang
2024-09-16 15:15           ` Fabiano Rosas
2024-07-17 13:30   ` Fabiano Rosas
2024-07-11 21:52 ` [PATCH v5 09/13] migration/multifd: Prepare to introduce DSA acceleration on the multifd path Yichen Wang
2024-07-17 13:39   ` Fabiano Rosas
2024-07-11 22:49 ` [PATCH v5 00/13] WIP: Use Intel DSA accelerator to offload zero page checking in multifd live migration Michael S. Tsirkin
2024-07-15  8:29   ` Liu, Yuan1
2024-07-15 12:23     ` Michael S. Tsirkin
2024-07-15 13:09       ` Liu, Yuan1
2024-07-15 14:42         ` Michael S. Tsirkin
2024-07-15 15:23           ` Liu, Yuan1
2024-07-15 15:57             ` Liu, Yuan1
2024-07-15 16:24               ` Michael S. Tsirkin
2024-07-16  1:25                 ` Liu, Yuan1
2024-07-15 16:08             ` Michael S. Tsirkin
2024-07-16  1:21               ` Liu, Yuan1
2024-07-12 10:58 ` Michael S. Tsirkin
2024-07-16 21:47 ` Fabiano Rosas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).