qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration.
@ 2024-11-14 22:01 Yichen Wang
  2024-11-14 22:01 ` [PATCH v7 01/12] meson: Introduce new instruction set enqcmd to the build system Yichen Wang
                   ` (12 more replies)
  0 siblings, 13 replies; 30+ messages in thread
From: Yichen Wang @ 2024-11-14 22:01 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

v7
* Rebase on top of f0a5a31c33a8109061c2493e475c8a2f4d022432;
* Fix a bug that will crash QEMU when DSA initialization failed;
* Use a more generalized accel-path to support other accelerators;
* Remove multifd-packet-size in the parameter list;

v6
* Rebase on top of 838fc0a8769d7cc6edfe50451ba4e3368395f5c1;
* Refactor code to have clean history on all commits;
* Add comments on DSA specific defines about how the value is picked;
* Address all comments from v5 reviews about api defines, questions, etc.;

v5
* Rebase on top of 39a032cea23e522268519d89bb738974bc43b6f6.
* Rename struct definitions with typedef and CamelCase names;
* Add build and runtime checks about DSA accelerator;
* Address all comments from v4 reviews about typos, licenses, comments,
error reporting, etc.

v4
* Rebase on top of 85b597413d4370cb168f711192eaef2eb70535ac.
* A separate "multifd zero page checking" patchset was split from this
patchset's v3 and got merged into master. v4 re-applied the rest of all
commits on top of that patchset, re-factored and re-tested.
https://lore.kernel.org/all/20240311180015.3359271-1-hao.xiang@linux.dev/
* There are some feedback from v3 I likely overlooked.

v3
* Rebase on top of 7425b6277f12e82952cede1f531bfc689bf77fb1.
* Fix error/warning from checkpatch.pl
* Fix use-after-free bug when multifd-dsa-accel option is not set.
* Handle error from dsa_init and correctly propogate the error.
* Remove unnecessary call to dsa_stop.
* Detect availability of DSA feature at compile time.
* Implement a generic batch_task structure and a DSA specific one dsa_batch_task.
* Remove all exit() calls and propagate errors correctly.
* Use bytes instead of page count to configure multifd-packet-size option.

v2
* Rebase on top of 3e01f1147a16ca566694b97eafc941d62fa1e8d8.
* Leave Juan's changes in their original form instead of squashing them.
* Add a new commit to refactor the multifd_send_thread function to prepare for introducing the DSA offload functionality.
* Use page count to configure multifd-packet-size option.
* Don't use the FLAKY flag in DSA tests.
* Test if DSA integration test is setup correctly and skip the test if
* not.
* Fixed broken link in the previous patch cover.

* Background:

I posted an RFC about DSA offloading in QEMU:
https://patchew.org/QEMU/20230529182001.2232069-1-hao.xiang@bytedance.com/

This patchset implements the DSA offloading on zero page checking in
multifd live migration code path.

* Overview:

Intel Data Streaming Accelerator(DSA) is introduced in Intel's 4th generation
Xeon server, aka Sapphire Rapids.
https://cdrdv2-public.intel.com/671116/341204-intel-data-streaming-accelerator-spec.pdf
https://www.intel.com/content/www/us/en/content-details/759709/intel-data-streaming-accelerator-user-guide.html
One of the things DSA can do is to offload memory comparison workload from
CPU to DSA accelerator hardware. This patchset implements a solution to offload
QEMU's zero page checking from CPU to DSA accelerator hardware. We gain
two benefits from this change:
1. Reduces CPU usage in multifd live migration workflow across all use
cases.
2. Reduces migration total time in some use cases. 

* Design:

These are the logical steps to perform DSA offloading:
1. Configure DSA accelerators and create user space openable DSA work
queues via the idxd driver.
2. Map DSA's work queue into a user space address space.
3. Fill an in-memory task descriptor to describe the memory operation.
4. Use dedicated CPU instruction _enqcmd to queue a task descriptor to
the work queue.
5. Pull the task descriptor's completion status field until the task
completes.
6. Check return status.

The memory operation is now totally done by the accelerator hardware but
the new workflow introduces overheads. The overhead is the extra cost CPU
prepares and submits the task descriptors and the extra cost CPU pulls for
completion. The design is around minimizing these two overheads.

1. In order to reduce the overhead on task preparation and submission,
we use batch descriptors. A batch descriptor will contain N individual
zero page checking tasks where the default N is 128 (default packet size
/ page size) and we can increase N by setting the packet size via a new
migration option.
2. The multifd sender threads prepares and submits batch tasks to DSA
hardware and it waits on a synchronization object for task completion.
Whenever a DSA task is submitted, the task structure is added to a
thread safe queue. It's safe to have multiple multifd sender threads to
submit tasks concurrently.
3. Multiple DSA hardware devices can be used. During multifd initialization,
every sender thread will be assigned a DSA device to work with. We
use a round-robin scheme to evenly distribute the work across all used
DSA devices.
4. Use a dedicated thread dsa_completion to perform busy pulling for all
DSA task completions. The thread keeps dequeuing DSA tasks from the
thread safe queue. The thread blocks when there is no outstanding DSA
task. When pulling for completion of a DSA task, the thread uses CPU
instruction _mm_pause between the iterations of a busy loop to save some
CPU power as well as optimizing core resources for the other hypercore.
5. DSA accelerator can encounter errors. The most popular error is a
page fault. We have tested using devices to handle page faults but
performance is bad. Right now, if DSA hits a page fault, we fallback to
use CPU to complete the rest of the work. The CPU fallback is done in
the multifd sender thread.
6. Added a new migration option multifd-dsa-accel to set the DSA device
path. If set, the multifd workflow will leverage the DSA devices for
offloading.
7. Added a new migration option multifd-normal-page-ratio to make
multifd live migration easier to test. Setting a normal page ratio will
make live migration recognize a zero page as a normal page and send
the entire payload over the network. If we want to send a large network
payload and analyze throughput, this option is useful.
8. Added a new migration option multifd-packet-size. This can increase
the number of pages being zero page checked and sent over the network.
The extra synchronization between the sender threads and the dsa
completion thread is an overhead. Using a large packet size can reduce
that overhead.

* Performance:

We use two Intel 4th generation Xeon servers for testing.

Architecture:        x86_64
CPU(s):              192
Thread(s) per core:  2
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               143
Model name:          Intel(R) Xeon(R) Platinum 8457C
Stepping:            8
CPU MHz:             2538.624
CPU max MHz:         3800.0000
CPU min MHz:         800.0000

We perform multifd live migration with below setup:
1. VM has 100GB memory. 
2. Use the new migration option multifd-set-normal-page-ratio to control the total
size of the payload sent over the network.
3. Use 8 multifd channels.
4. Use tcp for live migration.
4. Use CPU to perform zero page checking as the baseline.
5. Use one DSA device to offload zero page checking to compare with the baseline.
6. Use "perf sched record" and "perf sched timehist" to analyze CPU usage.

A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.

	CPU usage

	|---------------|---------------|---------------|---------------|
	|		|comm		|runtime(msec)	|totaltime(msec)|
	|---------------|---------------|---------------|---------------|
	|Baseline	|live_migration	|5657.58	|		|
	|		|multifdsend_0	|3931.563	|		|
	|		|multifdsend_1	|4405.273	|		|
	|		|multifdsend_2	|3941.968	|		|
	|		|multifdsend_3	|5032.975	|		|
	|		|multifdsend_4	|4533.865	|		|
	|		|multifdsend_5	|4530.461	|		|
	|		|multifdsend_6	|5171.916	|		|
	|		|multifdsend_7	|4722.769	|41922		|
	|---------------|---------------|---------------|---------------|
	|DSA		|live_migration	|6129.168	|		|
	|		|multifdsend_0	|2954.717	|		|
	|		|multifdsend_1	|2766.359	|		|
	|		|multifdsend_2	|2853.519	|		|
	|		|multifdsend_3	|2740.717	|		|
	|		|multifdsend_4	|2824.169	|		|
	|		|multifdsend_5	|2966.908	|		|
	|		|multifdsend_6	|2611.137	|		|
	|		|multifdsend_7	|3114.732	|		|
	|		|dsa_completion	|3612.564	|32568		|
	|---------------|---------------|---------------|---------------|

Baseline total runtime is calculated by adding up all multifdsend_X
and live_migration threads runtime. DSA offloading total runtime is
calculated by adding up all multifdsend_X, live_migration and
dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
that is 23% total CPU usage savings.

	Latency
	|---------------|---------------|---------------|---------------|---------------|---------------|
	|		|total time	|down time	|throughput	|transferred-ram|total-ram	|
	|---------------|---------------|---------------|---------------|---------------|---------------|	
	|Baseline	|10343 ms	|161 ms		|41007.00 mbps	|51583797 kb	|102400520 kb	|
	|---------------|---------------|---------------|---------------|-------------------------------|
	|DSA offload	|9535 ms	|135 ms		|46554.40 mbps	|53947545 kb	|102400520 kb	|	
	|---------------|---------------|---------------|---------------|---------------|---------------|

Total time is 8% faster and down time is 16% faster.

B) Scenario 2: 100% (100GB) zero pages on an 100GB vm.

	CPU usage
	|---------------|---------------|---------------|---------------|
	|		|comm		|runtime(msec)	|totaltime(msec)|
	|---------------|---------------|---------------|---------------|
	|Baseline	|live_migration	|4860.718	|		|
	|	 	|multifdsend_0	|748.875	|		|
	|		|multifdsend_1	|898.498	|		|
	|		|multifdsend_2	|787.456	|		|
	|		|multifdsend_3	|764.537	|		|
	|		|multifdsend_4	|785.687	|		|
	|		|multifdsend_5	|756.941	|		|
	|		|multifdsend_6	|774.084	|		|
	|		|multifdsend_7	|782.900	|11154		|
	|---------------|---------------|-------------------------------|
	|DSA offloading	|live_migration	|3846.976	|		|
	|		|multifdsend_0	|191.880	|		|
	|		|multifdsend_1	|166.331	|		|
	|		|multifdsend_2	|168.528	|		|
	|		|multifdsend_3	|197.831	|		|
	|		|multifdsend_4	|169.580	|		|
	|		|multifdsend_5	|167.984	|		|
	|		|multifdsend_6	|198.042	|		|
	|		|multifdsend_7	|170.624	|		|
	|		|dsa_completion	|3428.669	|8700		|
	|---------------|---------------|---------------|---------------|

Baseline total runtime is 11154 msec and DSA offloading total runtime is
8700 msec. That is 22% CPU savings.

	Latency
	|--------------------------------------------------------------------------------------------|
	|		|total time	|down time	|throughput	|transferred-ram|total-ram   |
	|---------------|---------------|---------------|---------------|---------------|------------|	
	|Baseline	|4867 ms	|20 ms		|1.51 mbps	|565 kb		|102400520 kb|
	|---------------|---------------|---------------|---------------|----------------------------|
	|DSA offload	|3888 ms	|18 ms		|1.89 mbps	|565 kb		|102400520 kb|	
	|---------------|---------------|---------------|---------------|---------------|------------|

Total time 20% faster and down time 10% faster.

* Testing:

1. Added unit tests for cover the added code path in dsa.c
2. Added integration tests to cover multifd live migration using DSA
offloading.

Hao Xiang (10):
  meson: Introduce new instruction set enqcmd to the build system.
  util/dsa: Implement DSA device start and stop logic.
  util/dsa: Implement DSA task enqueue and dequeue.
  util/dsa: Implement DSA task asynchronous completion thread model.
  util/dsa: Implement zero page checking in DSA task.
  util/dsa: Implement DSA task asynchronous submission and wait for
    completion.
  migration/multifd: Add new migration option for multifd DSA
    offloading.
  migration/multifd: Enable DSA offloading in multifd sender path.
  util/dsa: Add unit test coverage for Intel DSA task submission and
    completion.
  migration/multifd: Add integration tests for multifd with Intel DSA
    offloading.

Yichen Wang (1):
  util/dsa: Add idxd into linux header copy list.

Yuan Liu (1):
  migration/doc: Add DSA zero page detection doc

 .../migration/dsa-zero-page-detection.rst     |  290 +++++
 docs/devel/migration/features.rst             |    1 +
 hmp-commands.hx                               |    2 +-
 include/qemu/dsa.h                            |  188 +++
 meson.build                                   |   14 +
 meson_options.txt                             |    2 +
 migration/migration-hmp-cmds.c                |   19 +-
 migration/multifd-zero-page.c                 |  129 +-
 migration/multifd.c                           |   29 +-
 migration/multifd.h                           |    5 +
 migration/options.c                           |   30 +
 migration/options.h                           |    1 +
 qapi/migration.json                           |   32 +-
 scripts/meson-buildoptions.sh                 |    3 +
 scripts/update-linux-headers.sh               |    2 +-
 tests/qtest/migration-test.c                  |   80 +-
 tests/unit/meson.build                        |    6 +
 tests/unit/test-dsa.c                         |  503 ++++++++
 util/dsa.c                                    | 1112 +++++++++++++++++
 util/meson.build                              |    3 +
 20 files changed, 2427 insertions(+), 24 deletions(-)
 create mode 100644 docs/devel/migration/dsa-zero-page-detection.rst
 create mode 100644 include/qemu/dsa.h
 create mode 100644 tests/unit/test-dsa.c
 create mode 100644 util/dsa.c

-- 
Yichen Wang



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v7 01/12] meson: Introduce new instruction set enqcmd to the build system.
  2024-11-14 22:01 [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
@ 2024-11-14 22:01 ` Yichen Wang
  2024-11-21 13:51   ` Fabiano Rosas
  2024-11-14 22:01 ` [PATCH v7 02/12] util/dsa: Add idxd into linux header copy list Yichen Wang
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Yichen Wang @ 2024-11-14 22:01 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

From: Hao Xiang <hao.xiang@linux.dev>

Enable instruction set enqcmd in build.

Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 meson.build                   | 14 ++++++++++++++
 meson_options.txt             |  2 ++
 scripts/meson-buildoptions.sh |  3 +++
 3 files changed, 19 insertions(+)

diff --git a/meson.build b/meson.build
index e0b880e4e1..fbcb75d161 100644
--- a/meson.build
+++ b/meson.build
@@ -3062,6 +3062,20 @@ config_host_data.set('CONFIG_AVX512BW_OPT', get_option('avx512bw') \
     int main(int argc, char *argv[]) { return bar(argv[0]); }
   '''), error_message: 'AVX512BW not available').allowed())
 
+config_host_data.set('CONFIG_DSA_OPT', get_option('enqcmd') \
+  .require(have_cpuid_h, error_message: 'cpuid.h not available, cannot enable ENQCMD') \
+  .require(cc.links('''
+    #include <stdint.h>
+    #include <cpuid.h>
+    #include <immintrin.h>
+    static int __attribute__((target("enqcmd"))) bar(void *a) {
+      uint64_t dst[8] = { 0 };
+      uint64_t src[8] = { 0 };
+      return _enqcmd(dst, src);
+    }
+    int main(int argc, char *argv[]) { return bar(argv[argc - 1]); }
+  '''), error_message: 'ENQCMD not available').allowed())
+
 # For both AArch64 and AArch32, detect if builtins are available.
 config_host_data.set('CONFIG_ARM_AES_BUILTIN', cc.compiles('''
     #include <arm_neon.h>
diff --git a/meson_options.txt b/meson_options.txt
index 5eeaf3eee5..4386e8b1fc 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -125,6 +125,8 @@ option('avx2', type: 'feature', value: 'auto',
        description: 'AVX2 optimizations')
 option('avx512bw', type: 'feature', value: 'auto',
        description: 'AVX512BW optimizations')
+option('enqcmd', type: 'feature', value: 'disabled',
+       description: 'ENQCMD optimizations')
 option('keyring', type: 'feature', value: 'auto',
        description: 'Linux keyring support')
 option('libkeyutils', type: 'feature', value: 'auto',
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index a8066aab03..ff6c66db1e 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -99,6 +99,7 @@ meson_options_help() {
   printf "%s\n" '  auth-pam        PAM access control'
   printf "%s\n" '  avx2            AVX2 optimizations'
   printf "%s\n" '  avx512bw        AVX512BW optimizations'
+  printf "%s\n" '  enqcmd          ENQCMD optimizations'
   printf "%s\n" '  blkio           libblkio block device driver'
   printf "%s\n" '  bochs           bochs image format support'
   printf "%s\n" '  bpf             eBPF support'
@@ -246,6 +247,8 @@ _meson_option_parse() {
     --disable-avx2) printf "%s" -Davx2=disabled ;;
     --enable-avx512bw) printf "%s" -Davx512bw=enabled ;;
     --disable-avx512bw) printf "%s" -Davx512bw=disabled ;;
+    --enable-enqcmd) printf "%s" -Denqcmd=enabled ;;
+    --disable-enqcmd) printf "%s" -Denqcmd=disabled ;;
     --enable-gcov) printf "%s" -Db_coverage=true ;;
     --disable-gcov) printf "%s" -Db_coverage=false ;;
     --enable-lto) printf "%s" -Db_lto=true ;;
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v7 02/12] util/dsa: Add idxd into linux header copy list.
  2024-11-14 22:01 [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
  2024-11-14 22:01 ` [PATCH v7 01/12] meson: Introduce new instruction set enqcmd to the build system Yichen Wang
@ 2024-11-14 22:01 ` Yichen Wang
  2024-11-21 13:51   ` Fabiano Rosas
  2024-11-14 22:01 ` [PATCH v7 03/12] util/dsa: Implement DSA device start and stop logic Yichen Wang
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Yichen Wang @ 2024-11-14 22:01 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 scripts/update-linux-headers.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index 99a8d9fa4c..9128c7499b 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -200,7 +200,7 @@ rm -rf "$output/linux-headers/linux"
 mkdir -p "$output/linux-headers/linux"
 for header in const.h stddef.h kvm.h vfio.h vfio_ccw.h vfio_zdev.h vhost.h \
               psci.h psp-sev.h userfaultfd.h memfd.h mman.h nvme_ioctl.h \
-              vduse.h iommufd.h bits.h; do
+              vduse.h iommufd.h bits.h idxd.h; do
     cp "$hdrdir/include/linux/$header" "$output/linux-headers/linux"
 done
 
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v7 03/12] util/dsa: Implement DSA device start and stop logic.
  2024-11-14 22:01 [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
  2024-11-14 22:01 ` [PATCH v7 01/12] meson: Introduce new instruction set enqcmd to the build system Yichen Wang
  2024-11-14 22:01 ` [PATCH v7 02/12] util/dsa: Add idxd into linux header copy list Yichen Wang
@ 2024-11-14 22:01 ` Yichen Wang
  2024-11-21 14:11   ` Fabiano Rosas
  2024-11-14 22:01 ` [PATCH v7 04/12] util/dsa: Implement DSA task enqueue and dequeue Yichen Wang
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Yichen Wang @ 2024-11-14 22:01 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang, Bryan Zhang

From: Hao Xiang <hao.xiang@linux.dev>

* DSA device open and close.
* DSA group contains multiple DSA devices.
* DSA group configure/start/stop/clean.

Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 include/qemu/dsa.h | 103 +++++++++++++++++
 util/dsa.c         | 280 +++++++++++++++++++++++++++++++++++++++++++++
 util/meson.build   |   3 +
 3 files changed, 386 insertions(+)
 create mode 100644 include/qemu/dsa.h
 create mode 100644 util/dsa.c

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
new file mode 100644
index 0000000000..71686af28f
--- /dev/null
+++ b/include/qemu/dsa.h
@@ -0,0 +1,103 @@
+/*
+ * Interface for using Intel Data Streaming Accelerator to offload certain
+ * background operations.
+ *
+ * Copyright (C) Bytedance Ltd.
+ *
+ * Authors:
+ *  Hao Xiang <hao.xiang@bytedance.com>
+ *  Yichen Wang <yichen.wang@bytedance.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef QEMU_DSA_H
+#define QEMU_DSA_H
+
+#include "qapi/error.h"
+#include "qemu/thread.h"
+#include "qemu/queue.h"
+
+#ifdef CONFIG_DSA_OPT
+
+#pragma GCC push_options
+#pragma GCC target("enqcmd")
+
+#include <linux/idxd.h>
+#include "x86intrin.h"
+
+typedef struct {
+    void *work_queue;
+} QemuDsaDevice;
+
+typedef QSIMPLEQ_HEAD(QemuDsaTaskQueue, QemuDsaBatchTask) QemuDsaTaskQueue;
+
+typedef struct {
+    QemuDsaDevice *dsa_devices;
+    int num_dsa_devices;
+    /* The index of the next DSA device to be used. */
+    uint32_t device_allocator_index;
+    bool running;
+    QemuMutex task_queue_lock;
+    QemuCond task_queue_cond;
+    QemuDsaTaskQueue task_queue;
+} QemuDsaDeviceGroup;
+
+/**
+ * @brief Initializes DSA devices.
+ *
+ * @param dsa_parameter A list of DSA device path from migration parameter.
+ *
+ * @return int Zero if successful, otherwise non zero.
+ */
+int qemu_dsa_init(const strList *dsa_parameter, Error **errp);
+
+/**
+ * @brief Start logic to enable using DSA.
+ */
+void qemu_dsa_start(void);
+
+/**
+ * @brief Stop the device group and the completion thread.
+ */
+void qemu_dsa_stop(void);
+
+/**
+ * @brief Clean up system resources created for DSA offloading.
+ */
+void qemu_dsa_cleanup(void);
+
+/**
+ * @brief Check if DSA is running.
+ *
+ * @return True if DSA is running, otherwise false.
+ */
+bool qemu_dsa_is_running(void);
+
+#else
+
+static inline bool qemu_dsa_is_running(void)
+{
+    return false;
+}
+
+static inline int qemu_dsa_init(const strList *dsa_parameter, Error **errp)
+{
+    if (dsa_parameter != NULL && strlen(dsa_parameter) != 0) {
+        error_setg(errp, "DSA is not supported.");
+        return -1;
+    }
+
+    return 0;
+}
+
+static inline void qemu_dsa_start(void) {}
+
+static inline void qemu_dsa_stop(void) {}
+
+static inline void qemu_dsa_cleanup(void) {}
+
+#endif
+
+#endif
diff --git a/util/dsa.c b/util/dsa.c
new file mode 100644
index 0000000000..79dab5d62c
--- /dev/null
+++ b/util/dsa.c
@@ -0,0 +1,280 @@
+/*
+ * Use Intel Data Streaming Accelerator to offload certain background
+ * operations.
+ *
+ * Copyright (C) Bytedance Ltd.
+ *
+ * Authors:
+ *  Hao Xiang <hao.xiang@bytedance.com>
+ *  Bryan Zhang <bryan.zhang@bytedance.com>
+ *  Yichen Wang <yichen.wang@bytedance.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu/queue.h"
+#include "qemu/memalign.h"
+#include "qemu/lockable.h"
+#include "qemu/cutils.h"
+#include "qemu/dsa.h"
+#include "qemu/bswap.h"
+#include "qemu/error-report.h"
+#include "qemu/rcu.h"
+
+#pragma GCC push_options
+#pragma GCC target("enqcmd")
+
+#include <linux/idxd.h>
+#include "x86intrin.h"
+
+#define DSA_WQ_PORTAL_SIZE 4096
+#define MAX_DSA_DEVICES 16
+
+uint32_t max_retry_count;
+static QemuDsaDeviceGroup dsa_group;
+
+
+/**
+ * @brief This function opens a DSA device's work queue and
+ *        maps the DSA device memory into the current process.
+ *
+ * @param dsa_wq_path A pointer to the DSA device work queue's file path.
+ * @return A pointer to the mapped memory, or MAP_FAILED on failure.
+ */
+static void *
+map_dsa_device(const char *dsa_wq_path)
+{
+    void *dsa_device;
+    int fd;
+
+    fd = open(dsa_wq_path, O_RDWR);
+    if (fd < 0) {
+        error_report("Open %s failed with errno = %d.",
+                dsa_wq_path, errno);
+        return MAP_FAILED;
+    }
+    dsa_device = mmap(NULL, DSA_WQ_PORTAL_SIZE, PROT_WRITE,
+                      MAP_SHARED | MAP_POPULATE, fd, 0);
+    close(fd);
+    if (dsa_device == MAP_FAILED) {
+        error_report("mmap failed with errno = %d.", errno);
+        return MAP_FAILED;
+    }
+    return dsa_device;
+}
+
+/**
+ * @brief Initializes a DSA device structure.
+ *
+ * @param instance A pointer to the DSA device.
+ * @param work_queue A pointer to the DSA work queue.
+ */
+static void
+dsa_device_init(QemuDsaDevice *instance,
+                void *dsa_work_queue)
+{
+    instance->work_queue = dsa_work_queue;
+}
+
+/**
+ * @brief Cleans up a DSA device structure.
+ *
+ * @param instance A pointer to the DSA device to cleanup.
+ */
+static void
+dsa_device_cleanup(QemuDsaDevice *instance)
+{
+    if (instance->work_queue != MAP_FAILED) {
+        munmap(instance->work_queue, DSA_WQ_PORTAL_SIZE);
+    }
+}
+
+/**
+ * @brief Initializes a DSA device group.
+ *
+ * @param group A pointer to the DSA device group.
+ * @param dsa_parameter A list of DSA device path from are separated by space
+ * character migration parameter. Multiple DSA device path.
+ *
+ * @return Zero if successful, non-zero otherwise.
+ */
+static int
+dsa_device_group_init(QemuDsaDeviceGroup *group,
+                      const strList *dsa_parameter,
+                      Error **errp)
+{
+    if (dsa_parameter == NULL) {
+        error_setg(errp, "dsa device path is not supplied.");
+        return -1;
+    }
+
+    int ret = 0;
+    const char *dsa_path[MAX_DSA_DEVICES];
+    int num_dsa_devices = 0;
+
+    while (dsa_parameter) {
+        dsa_path[num_dsa_devices++] = dsa_parameter->value;
+        if (num_dsa_devices == MAX_DSA_DEVICES) {
+            break;
+        }
+        dsa_parameter = dsa_parameter->next;
+    }
+
+    group->dsa_devices =
+        g_new0(QemuDsaDevice, num_dsa_devices);
+    group->num_dsa_devices = num_dsa_devices;
+    group->device_allocator_index = 0;
+
+    group->running = false;
+    qemu_mutex_init(&group->task_queue_lock);
+    qemu_cond_init(&group->task_queue_cond);
+    QSIMPLEQ_INIT(&group->task_queue);
+
+    void *dsa_wq = MAP_FAILED;
+    for (int i = 0; i < num_dsa_devices; i++) {
+        dsa_wq = map_dsa_device(dsa_path[i]);
+        if (dsa_wq == MAP_FAILED) {
+            error_setg(errp, "map_dsa_device failed MAP_FAILED.");
+            ret = -1;
+        }
+        dsa_device_init(&group->dsa_devices[i], dsa_wq);
+    }
+
+    return ret;
+}
+
+/**
+ * @brief Starts a DSA device group.
+ *
+ * @param group A pointer to the DSA device group.
+ */
+static void
+dsa_device_group_start(QemuDsaDeviceGroup *group)
+{
+    group->running = true;
+}
+
+/**
+ * @brief Stops a DSA device group.
+ *
+ * @param group A pointer to the DSA device group.
+ */
+__attribute__((unused))
+static void
+dsa_device_group_stop(QemuDsaDeviceGroup *group)
+{
+    group->running = false;
+}
+
+/**
+ * @brief Cleans up a DSA device group.
+ *
+ * @param group A pointer to the DSA device group.
+ */
+static void
+dsa_device_group_cleanup(QemuDsaDeviceGroup *group)
+{
+    if (!group->dsa_devices) {
+        return;
+    }
+    for (int i = 0; i < group->num_dsa_devices; i++) {
+        dsa_device_cleanup(&group->dsa_devices[i]);
+    }
+    g_free(group->dsa_devices);
+    group->dsa_devices = NULL;
+
+    qemu_mutex_destroy(&group->task_queue_lock);
+    qemu_cond_destroy(&group->task_queue_cond);
+}
+
+/**
+ * @brief Returns the next available DSA device in the group.
+ *
+ * @param group A pointer to the DSA device group.
+ *
+ * @return struct QemuDsaDevice* A pointer to the next available DSA device
+ *         in the group.
+ */
+__attribute__((unused))
+static QemuDsaDevice *
+dsa_device_group_get_next_device(QemuDsaDeviceGroup *group)
+{
+    if (group->num_dsa_devices == 0) {
+        return NULL;
+    }
+    uint32_t current = qatomic_fetch_inc(&group->device_allocator_index);
+    current %= group->num_dsa_devices;
+    return &group->dsa_devices[current];
+}
+
+/**
+ * @brief Check if DSA is running.
+ *
+ * @return True if DSA is running, otherwise false.
+ */
+bool qemu_dsa_is_running(void)
+{
+    return false;
+}
+
+static void
+dsa_globals_init(void)
+{
+    max_retry_count = UINT32_MAX;
+}
+
+/**
+ * @brief Initializes DSA devices.
+ *
+ * @param dsa_parameter A list of DSA device path from migration parameter.
+ *
+ * @return int Zero if successful, otherwise non zero.
+ */
+int qemu_dsa_init(const strList *dsa_parameter, Error **errp)
+{
+    dsa_globals_init();
+
+    return dsa_device_group_init(&dsa_group, dsa_parameter, errp);
+}
+
+/**
+ * @brief Start logic to enable using DSA.
+ *
+ */
+void qemu_dsa_start(void)
+{
+    if (dsa_group.num_dsa_devices == 0) {
+        return;
+    }
+    if (dsa_group.running) {
+        return;
+    }
+    dsa_device_group_start(&dsa_group);
+}
+
+/**
+ * @brief Stop the device group and the completion thread.
+ *
+ */
+void qemu_dsa_stop(void)
+{
+    QemuDsaDeviceGroup *group = &dsa_group;
+
+    if (!group->running) {
+        return;
+    }
+}
+
+/**
+ * @brief Clean up system resources created for DSA offloading.
+ *
+ */
+void qemu_dsa_cleanup(void)
+{
+    qemu_dsa_stop();
+    dsa_device_group_cleanup(&dsa_group);
+}
+
diff --git a/util/meson.build b/util/meson.build
index 5d8bef9891..5ec2158f9e 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -123,6 +123,9 @@ if cpu == 'aarch64'
   util_ss.add(files('cpuinfo-aarch64.c'))
 elif cpu in ['x86', 'x86_64']
   util_ss.add(files('cpuinfo-i386.c'))
+  if config_host_data.get('CONFIG_DSA_OPT')
+    util_ss.add(files('dsa.c'))
+  endif
 elif cpu == 'loongarch64'
   util_ss.add(files('cpuinfo-loongarch.c'))
 elif cpu in ['ppc', 'ppc64']
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v7 04/12] util/dsa: Implement DSA task enqueue and dequeue.
  2024-11-14 22:01 [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
                   ` (2 preceding siblings ...)
  2024-11-14 22:01 ` [PATCH v7 03/12] util/dsa: Implement DSA device start and stop logic Yichen Wang
@ 2024-11-14 22:01 ` Yichen Wang
  2024-11-21 20:55   ` Fabiano Rosas
  2024-11-14 22:01 ` [PATCH v7 05/12] util/dsa: Implement DSA task asynchronous completion thread model Yichen Wang
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Yichen Wang @ 2024-11-14 22:01 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

From: Hao Xiang <hao.xiang@linux.dev>

* Use a safe thread queue for DSA task enqueue/dequeue.
* Implement DSA task submission.
* Implement DSA batch task submission.

Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 include/qemu/dsa.h |  29 +++++++
 util/dsa.c         | 202 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 230 insertions(+), 1 deletion(-)

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
index 71686af28f..04ee8924ab 100644
--- a/include/qemu/dsa.h
+++ b/include/qemu/dsa.h
@@ -27,6 +27,17 @@
 #include <linux/idxd.h>
 #include "x86intrin.h"
 
+typedef enum QemuDsaTaskType {
+    QEMU_DSA_TASK = 0,
+    QEMU_DSA_BATCH_TASK
+} QemuDsaTaskType;
+
+typedef enum QemuDsaTaskStatus {
+    QEMU_DSA_TASK_READY = 0,
+    QEMU_DSA_TASK_PROCESSING,
+    QEMU_DSA_TASK_COMPLETION
+} QemuDsaTaskStatus;
+
 typedef struct {
     void *work_queue;
 } QemuDsaDevice;
@@ -44,6 +55,24 @@ typedef struct {
     QemuDsaTaskQueue task_queue;
 } QemuDsaDeviceGroup;
 
+typedef void (*qemu_dsa_completion_fn)(void *);
+
+typedef struct QemuDsaBatchTask {
+    struct dsa_hw_desc batch_descriptor;
+    struct dsa_hw_desc *descriptors;
+    struct dsa_completion_record batch_completion __attribute__((aligned(32)));
+    struct dsa_completion_record *completions;
+    QemuDsaDeviceGroup *group;
+    QemuDsaDevice *device;
+    qemu_dsa_completion_fn completion_callback;
+    QemuSemaphore sem_task_complete;
+    QemuDsaTaskType task_type;
+    QemuDsaTaskStatus status;
+    int batch_size;
+    QSIMPLEQ_ENTRY(QemuDsaBatchTask) entry;
+} QemuDsaBatchTask;
+
+
 /**
  * @brief Initializes DSA devices.
  *
diff --git a/util/dsa.c b/util/dsa.c
index 79dab5d62c..b55fa599f0 100644
--- a/util/dsa.c
+++ b/util/dsa.c
@@ -31,6 +31,7 @@
 #include "x86intrin.h"
 
 #define DSA_WQ_PORTAL_SIZE 4096
+#define DSA_WQ_DEPTH 128
 #define MAX_DSA_DEVICES 16
 
 uint32_t max_retry_count;
@@ -210,6 +211,198 @@ dsa_device_group_get_next_device(QemuDsaDeviceGroup *group)
     return &group->dsa_devices[current];
 }
 
+/**
+ * @brief Empties out the DSA task queue.
+ *
+ * @param group A pointer to the DSA device group.
+ */
+static void
+dsa_empty_task_queue(QemuDsaDeviceGroup *group)
+{
+    qemu_mutex_lock(&group->task_queue_lock);
+    QemuDsaTaskQueue *task_queue = &group->task_queue;
+    while (!QSIMPLEQ_EMPTY(task_queue)) {
+        QSIMPLEQ_REMOVE_HEAD(task_queue, entry);
+    }
+    qemu_mutex_unlock(&group->task_queue_lock);
+}
+
+/**
+ * @brief Adds a task to the DSA task queue.
+ *
+ * @param group A pointer to the DSA device group.
+ * @param task A pointer to the DSA task to enqueue.
+ *
+ * @return int Zero if successful, otherwise a proper error code.
+ */
+static int
+dsa_task_enqueue(QemuDsaDeviceGroup *group,
+                 QemuDsaBatchTask *task)
+{
+    bool notify = false;
+
+    qemu_mutex_lock(&group->task_queue_lock);
+
+    if (!group->running) {
+        error_report("DSA: Tried to queue task to stopped device queue.");
+        qemu_mutex_unlock(&group->task_queue_lock);
+        return -1;
+    }
+
+    /* The queue is empty. This enqueue operation is a 0->1 transition. */
+    if (QSIMPLEQ_EMPTY(&group->task_queue)) {
+        notify = true;
+    }
+
+    QSIMPLEQ_INSERT_TAIL(&group->task_queue, task, entry);
+
+    /* We need to notify the waiter for 0->1 transitions. */
+    if (notify) {
+        qemu_cond_signal(&group->task_queue_cond);
+    }
+
+    qemu_mutex_unlock(&group->task_queue_lock);
+
+    return 0;
+}
+
+/**
+ * @brief Takes a DSA task out of the task queue.
+ *
+ * @param group A pointer to the DSA device group.
+ * @return QemuDsaBatchTask* The DSA task being dequeued.
+ */
+__attribute__((unused))
+static QemuDsaBatchTask *
+dsa_task_dequeue(QemuDsaDeviceGroup *group)
+{
+    QemuDsaBatchTask *task = NULL;
+
+    qemu_mutex_lock(&group->task_queue_lock);
+
+    while (true) {
+        if (!group->running) {
+            goto exit;
+        }
+        task = QSIMPLEQ_FIRST(&group->task_queue);
+        if (task != NULL) {
+            break;
+        }
+        qemu_cond_wait(&group->task_queue_cond, &group->task_queue_lock);
+    }
+
+    QSIMPLEQ_REMOVE_HEAD(&group->task_queue, entry);
+
+exit:
+    qemu_mutex_unlock(&group->task_queue_lock);
+    return task;
+}
+
+/**
+ * @brief Submits a DSA work item to the device work queue.
+ *
+ * @param wq A pointer to the DSA work queue's device memory.
+ * @param descriptor A pointer to the DSA work item descriptor.
+ *
+ * @return Zero if successful, non-zero otherwise.
+ */
+static int
+submit_wi_int(void *wq, struct dsa_hw_desc *descriptor)
+{
+    uint32_t retry = 0;
+
+    _mm_sfence();
+
+    while (true) {
+        if (_enqcmd(wq, descriptor) == 0) {
+            break;
+        }
+        retry++;
+        if (retry > max_retry_count) {
+            error_report("Submit work retry %u times.", retry);
+            return -1;
+        }
+    }
+
+    return 0;
+}
+
+/**
+ * @brief Synchronously submits a DSA work item to the
+ *        device work queue.
+ *
+ * @param wq A pointer to the DSA work queue's device memory.
+ * @param descriptor A pointer to the DSA work item descriptor.
+ *
+ * @return int Zero if successful, non-zero otherwise.
+ */
+__attribute__((unused))
+static int
+submit_wi(void *wq, struct dsa_hw_desc *descriptor)
+{
+    return submit_wi_int(wq, descriptor);
+}
+
+/**
+ * @brief Asynchronously submits a DSA work item to the
+ *        device work queue.
+ *
+ * @param task A pointer to the task.
+ *
+ * @return int Zero if successful, non-zero otherwise.
+ */
+__attribute__((unused))
+static int
+submit_wi_async(QemuDsaBatchTask *task)
+{
+    QemuDsaDeviceGroup *device_group = task->group;
+    QemuDsaDevice *device_instance = task->device;
+    int ret;
+
+    assert(task->task_type == QEMU_DSA_TASK);
+
+    task->status = QEMU_DSA_TASK_PROCESSING;
+
+    ret = submit_wi_int(device_instance->work_queue,
+                        &task->descriptors[0]);
+    if (ret != 0) {
+        return ret;
+    }
+
+    return dsa_task_enqueue(device_group, task);
+}
+
+/**
+ * @brief Asynchronously submits a DSA batch work item to the
+ *        device work queue.
+ *
+ * @param batch_task A pointer to the batch task.
+ *
+ * @return int Zero if successful, non-zero otherwise.
+ */
+__attribute__((unused))
+static int
+submit_batch_wi_async(QemuDsaBatchTask *batch_task)
+{
+    QemuDsaDeviceGroup *device_group = batch_task->group;
+    QemuDsaDevice *device_instance = batch_task->device;
+    int ret;
+
+    assert(batch_task->task_type == QEMU_DSA_BATCH_TASK);
+    assert(batch_task->batch_descriptor.desc_count <= batch_task->batch_size);
+    assert(batch_task->status == QEMU_DSA_TASK_READY);
+
+    batch_task->status = QEMU_DSA_TASK_PROCESSING;
+
+    ret = submit_wi_int(device_instance->work_queue,
+                        &batch_task->batch_descriptor);
+    if (ret != 0) {
+        return ret;
+    }
+
+    return dsa_task_enqueue(device_group, batch_task);
+}
+
 /**
  * @brief Check if DSA is running.
  *
@@ -223,7 +416,12 @@ bool qemu_dsa_is_running(void)
 static void
 dsa_globals_init(void)
 {
-    max_retry_count = UINT32_MAX;
+    /*
+     * This value follows a reference example by Intel. The POLL_RETRY_MAX is
+     * defined to 10000, so here we used the max WQ depth * 100 for the the max
+     * polling retry count.
+     */
+    max_retry_count = DSA_WQ_DEPTH * 100;
 }
 
 /**
@@ -266,6 +464,8 @@ void qemu_dsa_stop(void)
     if (!group->running) {
         return;
     }
+
+    dsa_empty_task_queue(group);
 }
 
 /**
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v7 05/12] util/dsa: Implement DSA task asynchronous completion thread model.
  2024-11-14 22:01 [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
                   ` (3 preceding siblings ...)
  2024-11-14 22:01 ` [PATCH v7 04/12] util/dsa: Implement DSA task enqueue and dequeue Yichen Wang
@ 2024-11-14 22:01 ` Yichen Wang
  2024-11-21 20:58   ` Fabiano Rosas
  2024-11-14 22:01 ` [PATCH v7 06/12] util/dsa: Implement zero page checking in DSA task Yichen Wang
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Yichen Wang @ 2024-11-14 22:01 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

From: Hao Xiang <hao.xiang@linux.dev>

* Create a dedicated thread for DSA task completion.
* DSA completion thread runs a loop and poll for completed tasks.
* Start and stop DSA completion thread during DSA device start stop.

User space application can directly submit task to Intel DSA
accelerator by writing to DSA's device memory (mapped in user space).
Once a task is submitted, the device starts processing it and write
the completion status back to the task. A user space application can
poll the task's completion status to check for completion. This change
uses a dedicated thread to perform DSA task completion checking.

Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 include/qemu/dsa.h |   1 +
 util/dsa.c         | 274 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 274 insertions(+), 1 deletion(-)

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
index 04ee8924ab..d24567f0be 100644
--- a/include/qemu/dsa.h
+++ b/include/qemu/dsa.h
@@ -69,6 +69,7 @@ typedef struct QemuDsaBatchTask {
     QemuDsaTaskType task_type;
     QemuDsaTaskStatus status;
     int batch_size;
+    bool *results;
     QSIMPLEQ_ENTRY(QemuDsaBatchTask) entry;
 } QemuDsaBatchTask;
 
diff --git a/util/dsa.c b/util/dsa.c
index b55fa599f0..c3ca71df86 100644
--- a/util/dsa.c
+++ b/util/dsa.c
@@ -33,9 +33,20 @@
 #define DSA_WQ_PORTAL_SIZE 4096
 #define DSA_WQ_DEPTH 128
 #define MAX_DSA_DEVICES 16
+#define DSA_COMPLETION_THREAD "qemu_dsa_completion"
+
+typedef struct {
+    bool stopping;
+    bool running;
+    QemuThread thread;
+    int thread_id;
+    QemuSemaphore sem_init_done;
+    QemuDsaDeviceGroup *group;
+} QemuDsaCompletionThread;
 
 uint32_t max_retry_count;
 static QemuDsaDeviceGroup dsa_group;
+static QemuDsaCompletionThread completion_thread;
 
 
 /**
@@ -403,6 +414,265 @@ submit_batch_wi_async(QemuDsaBatchTask *batch_task)
     return dsa_task_enqueue(device_group, batch_task);
 }
 
+/**
+ * @brief Poll for the DSA work item completion.
+ *
+ * @param completion A pointer to the DSA work item completion record.
+ * @param opcode The DSA opcode.
+ *
+ * @return Zero if successful, non-zero otherwise.
+ */
+static int
+poll_completion(struct dsa_completion_record *completion,
+                enum dsa_opcode opcode)
+{
+    uint8_t status;
+    uint64_t retry = 0;
+
+    while (true) {
+        /* The DSA operation completes successfully or fails. */
+        status = completion->status;
+        if (status == DSA_COMP_SUCCESS ||
+            status == DSA_COMP_PAGE_FAULT_NOBOF ||
+            status == DSA_COMP_BATCH_PAGE_FAULT ||
+            status == DSA_COMP_BATCH_FAIL) {
+            break;
+        } else if (status != DSA_COMP_NONE) {
+            error_report("DSA opcode %d failed with status = %d.",
+                    opcode, status);
+            return 1;
+        }
+        retry++;
+        if (retry > max_retry_count) {
+            error_report("DSA wait for completion retry %lu times.", retry);
+            return 1;
+        }
+        _mm_pause();
+    }
+
+    return 0;
+}
+
+/**
+ * @brief Complete a single DSA task in the batch task.
+ *
+ * @param task A pointer to the batch task structure.
+ *
+ * @return Zero if successful, otherwise non-zero.
+ */
+static int
+poll_task_completion(QemuDsaBatchTask *task)
+{
+    assert(task->task_type == QEMU_DSA_TASK);
+
+    struct dsa_completion_record *completion = &task->completions[0];
+    uint8_t status;
+    int ret;
+
+    ret = poll_completion(completion, task->descriptors[0].opcode);
+    if (ret != 0) {
+        goto exit;
+    }
+
+    status = completion->status;
+    if (status == DSA_COMP_SUCCESS) {
+        task->results[0] = (completion->result == 0);
+        goto exit;
+    }
+
+    assert(status == DSA_COMP_PAGE_FAULT_NOBOF);
+
+exit:
+    return ret;
+}
+
+/**
+ * @brief Poll a batch task status until it completes. If DSA task doesn't
+ *        complete properly, use CPU to complete the task.
+ *
+ * @param batch_task A pointer to the DSA batch task.
+ *
+ * @return Zero if successful, otherwise non-zero.
+ */
+static int
+poll_batch_task_completion(QemuDsaBatchTask *batch_task)
+{
+    struct dsa_completion_record *batch_completion =
+        &batch_task->batch_completion;
+    struct dsa_completion_record *completion;
+    uint8_t batch_status;
+    uint8_t status;
+    bool *results = batch_task->results;
+    uint32_t count = batch_task->batch_descriptor.desc_count;
+    int ret;
+
+    ret = poll_completion(batch_completion,
+                          batch_task->batch_descriptor.opcode);
+    if (ret != 0) {
+        goto exit;
+    }
+
+    batch_status = batch_completion->status;
+
+    if (batch_status == DSA_COMP_SUCCESS) {
+        if (batch_completion->bytes_completed == count) {
+            /*
+             * Let's skip checking for each descriptors' completion status
+             * if the batch descriptor says all succedded.
+             */
+            for (int i = 0; i < count; i++) {
+                assert(batch_task->completions[i].status == DSA_COMP_SUCCESS);
+                results[i] = (batch_task->completions[i].result == 0);
+            }
+            goto exit;
+        }
+    } else {
+        assert(batch_status == DSA_COMP_BATCH_FAIL ||
+            batch_status == DSA_COMP_BATCH_PAGE_FAULT);
+    }
+
+    for (int i = 0; i < count; i++) {
+
+        completion = &batch_task->completions[i];
+        status = completion->status;
+
+        if (status == DSA_COMP_SUCCESS) {
+            results[i] = (completion->result == 0);
+            continue;
+        }
+
+        assert(status == DSA_COMP_PAGE_FAULT_NOBOF);
+
+        if (status != DSA_COMP_PAGE_FAULT_NOBOF) {
+            error_report("Unexpected DSA completion status = %u.", status);
+            ret = 1;
+            goto exit;
+        }
+    }
+
+exit:
+    return ret;
+}
+
+/**
+ * @brief Handles an asynchronous DSA batch task completion.
+ *
+ * @param task A pointer to the batch buffer zero task structure.
+ */
+static void
+dsa_batch_task_complete(QemuDsaBatchTask *batch_task)
+{
+    batch_task->status = QEMU_DSA_TASK_COMPLETION;
+    batch_task->completion_callback(batch_task);
+}
+
+/**
+ * @brief The function entry point called by a dedicated DSA
+ *        work item completion thread.
+ *
+ * @param opaque A pointer to the thread context.
+ *
+ * @return void* Not used.
+ */
+static void *
+dsa_completion_loop(void *opaque)
+{
+    QemuDsaCompletionThread *thread_context =
+        (QemuDsaCompletionThread *)opaque;
+    QemuDsaBatchTask *batch_task;
+    QemuDsaDeviceGroup *group = thread_context->group;
+    int ret;
+
+    rcu_register_thread();
+
+    thread_context->thread_id = qemu_get_thread_id();
+    qemu_sem_post(&thread_context->sem_init_done);
+
+    while (thread_context->running) {
+        batch_task = dsa_task_dequeue(group);
+        assert(batch_task != NULL || !group->running);
+        if (!group->running) {
+            assert(!thread_context->running);
+            break;
+        }
+        if (batch_task->task_type == QEMU_DSA_TASK) {
+            ret = poll_task_completion(batch_task);
+        } else {
+            assert(batch_task->task_type == QEMU_DSA_BATCH_TASK);
+            ret = poll_batch_task_completion(batch_task);
+        }
+
+        if (ret != 0) {
+            goto exit;
+        }
+
+        dsa_batch_task_complete(batch_task);
+    }
+
+exit:
+    if (ret != 0) {
+        error_report("DSA completion thread exited due to internal error.");
+    }
+    rcu_unregister_thread();
+    return NULL;
+}
+
+/**
+ * @brief Initializes a DSA completion thread.
+ *
+ * @param completion_thread A pointer to the completion thread context.
+ * @param group A pointer to the DSA device group.
+ */
+static void
+dsa_completion_thread_init(
+    QemuDsaCompletionThread *completion_thread,
+    QemuDsaDeviceGroup *group)
+{
+    completion_thread->stopping = false;
+    completion_thread->running = true;
+    completion_thread->thread_id = -1;
+    qemu_sem_init(&completion_thread->sem_init_done, 0);
+    completion_thread->group = group;
+
+    qemu_thread_create(&completion_thread->thread,
+                       DSA_COMPLETION_THREAD,
+                       dsa_completion_loop,
+                       completion_thread,
+                       QEMU_THREAD_JOINABLE);
+
+    /* Wait for initialization to complete */
+    qemu_sem_wait(&completion_thread->sem_init_done);
+}
+
+/**
+ * @brief Stops the completion thread (and implicitly, the device group).
+ *
+ * @param opaque A pointer to the completion thread.
+ */
+static void dsa_completion_thread_stop(void *opaque)
+{
+    QemuDsaCompletionThread *thread_context =
+        (QemuDsaCompletionThread *)opaque;
+
+    QemuDsaDeviceGroup *group = thread_context->group;
+
+    qemu_mutex_lock(&group->task_queue_lock);
+
+    thread_context->stopping = true;
+    thread_context->running = false;
+
+    /* Prevent the compiler from setting group->running first. */
+    barrier();
+    dsa_device_group_stop(group);
+
+    qemu_cond_signal(&group->task_queue_cond);
+    qemu_mutex_unlock(&group->task_queue_lock);
+
+    qemu_thread_join(&thread_context->thread);
+
+    qemu_sem_destroy(&thread_context->sem_init_done);
+}
+
 /**
  * @brief Check if DSA is running.
  *
@@ -410,7 +680,7 @@ submit_batch_wi_async(QemuDsaBatchTask *batch_task)
  */
 bool qemu_dsa_is_running(void)
 {
-    return false;
+    return completion_thread.running;
 }
 
 static void
@@ -451,6 +721,7 @@ void qemu_dsa_start(void)
         return;
     }
     dsa_device_group_start(&dsa_group);
+    dsa_completion_thread_init(&completion_thread, &dsa_group);
 }
 
 /**
@@ -465,6 +736,7 @@ void qemu_dsa_stop(void)
         return;
     }
 
+    dsa_completion_thread_stop(&completion_thread);
     dsa_empty_task_queue(group);
 }
 
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v7 06/12] util/dsa: Implement zero page checking in DSA task.
  2024-11-14 22:01 [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
                   ` (4 preceding siblings ...)
  2024-11-14 22:01 ` [PATCH v7 05/12] util/dsa: Implement DSA task asynchronous completion thread model Yichen Wang
@ 2024-11-14 22:01 ` Yichen Wang
  2024-11-25 15:53   ` Fabiano Rosas
  2024-11-14 22:01 ` [PATCH v7 07/12] util/dsa: Implement DSA task asynchronous submission and wait for completion Yichen Wang
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Yichen Wang @ 2024-11-14 22:01 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang, Bryan Zhang

From: Hao Xiang <hao.xiang@linux.dev>

Create DSA task with operation code DSA_OPCODE_COMPVAL.
Here we create two types of DSA tasks, a single DSA task and
a batch DSA task. Batch DSA task reduces task submission overhead
and hence should be the default option. However, due to the way DSA
hardware works, a DSA batch task must contain at least two individual
tasks. There are times we need to submit a single task and hence a
single DSA task submission is also required.

Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 include/qemu/dsa.h |  44 ++++++--
 util/dsa.c         | 254 +++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 269 insertions(+), 29 deletions(-)

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
index d24567f0be..cb407b8b49 100644
--- a/include/qemu/dsa.h
+++ b/include/qemu/dsa.h
@@ -16,6 +16,7 @@
 #define QEMU_DSA_H
 
 #include "qapi/error.h"
+#include "exec/cpu-common.h"
 #include "qemu/thread.h"
 #include "qemu/queue.h"
 
@@ -70,10 +71,11 @@ typedef struct QemuDsaBatchTask {
     QemuDsaTaskStatus status;
     int batch_size;
     bool *results;
+    /* Address of each pages in pages */
+    ram_addr_t *addr;
     QSIMPLEQ_ENTRY(QemuDsaBatchTask) entry;
 } QemuDsaBatchTask;
 
-
 /**
  * @brief Initializes DSA devices.
  *
@@ -105,8 +107,26 @@ void qemu_dsa_cleanup(void);
  */
 bool qemu_dsa_is_running(void);
 
+/**
+ * @brief Initializes a buffer zero DSA batch task.
+ *
+ * @param batch_size The number of zero page checking tasks in the batch.
+ * @return A pointer to the zero page checking tasks initialized.
+ */
+QemuDsaBatchTask *
+buffer_zero_batch_task_init(int batch_size);
+
+/**
+ * @brief Performs the proper cleanup on a DSA batch task.
+ *
+ * @param task A pointer to the batch task to cleanup.
+ */
+void buffer_zero_batch_task_destroy(QemuDsaBatchTask *task);
+
 #else
 
+typedef struct QemuDsaBatchTask {} QemuDsaBatchTask;
+
 static inline bool qemu_dsa_is_running(void)
 {
     return false;
@@ -114,19 +134,27 @@ static inline bool qemu_dsa_is_running(void)
 
 static inline int qemu_dsa_init(const strList *dsa_parameter, Error **errp)
 {
-    if (dsa_parameter != NULL && strlen(dsa_parameter) != 0) {
-        error_setg(errp, "DSA is not supported.");
-        return -1;
-    }
-
-    return 0;
+    error_setg(errp, "DSA accelerator is not enabled.");
+    return -1;
 }
 
 static inline void qemu_dsa_start(void) {}
 
 static inline void qemu_dsa_stop(void) {}
 
-static inline void qemu_dsa_cleanup(void) {}
+static inline QemuDsaBatchTask *buffer_zero_batch_task_init(int batch_size)
+{
+    return NULL;
+}
+
+static inline void buffer_zero_batch_task_destroy(QemuDsaBatchTask *task) {}
+
+static inline int
+buffer_is_zero_dsa_batch_sync(QemuDsaBatchTask *batch_task,
+                              const void **buf, size_t count, size_t len)
+{
+    return -1;
+}
 
 #endif
 
diff --git a/util/dsa.c b/util/dsa.c
index c3ca71df86..408c163195 100644
--- a/util/dsa.c
+++ b/util/dsa.c
@@ -48,6 +48,7 @@ uint32_t max_retry_count;
 static QemuDsaDeviceGroup dsa_group;
 static QemuDsaCompletionThread completion_thread;
 
+static void buffer_zero_dsa_completion(void *context);
 
 /**
  * @brief This function opens a DSA device's work queue and
@@ -174,7 +175,6 @@ dsa_device_group_start(QemuDsaDeviceGroup *group)
  *
  * @param group A pointer to the DSA device group.
  */
-__attribute__((unused))
 static void
 dsa_device_group_stop(QemuDsaDeviceGroup *group)
 {
@@ -210,7 +210,6 @@ dsa_device_group_cleanup(QemuDsaDeviceGroup *group)
  * @return struct QemuDsaDevice* A pointer to the next available DSA device
  *         in the group.
  */
-__attribute__((unused))
 static QemuDsaDevice *
 dsa_device_group_get_next_device(QemuDsaDeviceGroup *group)
 {
@@ -283,7 +282,6 @@ dsa_task_enqueue(QemuDsaDeviceGroup *group,
  * @param group A pointer to the DSA device group.
  * @return QemuDsaBatchTask* The DSA task being dequeued.
  */
-__attribute__((unused))
 static QemuDsaBatchTask *
 dsa_task_dequeue(QemuDsaDeviceGroup *group)
 {
@@ -338,22 +336,6 @@ submit_wi_int(void *wq, struct dsa_hw_desc *descriptor)
     return 0;
 }
 
-/**
- * @brief Synchronously submits a DSA work item to the
- *        device work queue.
- *
- * @param wq A pointer to the DSA work queue's device memory.
- * @param descriptor A pointer to the DSA work item descriptor.
- *
- * @return int Zero if successful, non-zero otherwise.
- */
-__attribute__((unused))
-static int
-submit_wi(void *wq, struct dsa_hw_desc *descriptor)
-{
-    return submit_wi_int(wq, descriptor);
-}
-
 /**
  * @brief Asynchronously submits a DSA work item to the
  *        device work queue.
@@ -362,7 +344,6 @@ submit_wi(void *wq, struct dsa_hw_desc *descriptor)
  *
  * @return int Zero if successful, non-zero otherwise.
  */
-__attribute__((unused))
 static int
 submit_wi_async(QemuDsaBatchTask *task)
 {
@@ -391,7 +372,6 @@ submit_wi_async(QemuDsaBatchTask *task)
  *
  * @return int Zero if successful, non-zero otherwise.
  */
-__attribute__((unused))
 static int
 submit_batch_wi_async(QemuDsaBatchTask *batch_task)
 {
@@ -750,3 +730,235 @@ void qemu_dsa_cleanup(void)
     dsa_device_group_cleanup(&dsa_group);
 }
 
+
+/* Buffer zero comparison DSA task implementations */
+/* =============================================== */
+
+/**
+ * @brief Sets a buffer zero comparison DSA task.
+ *
+ * @param descriptor A pointer to the DSA task descriptor.
+ * @param buf A pointer to the memory buffer.
+ * @param len The length of the buffer.
+ */
+static void
+buffer_zero_task_set_int(struct dsa_hw_desc *descriptor,
+                         const void *buf,
+                         size_t len)
+{
+    struct dsa_completion_record *completion =
+        (struct dsa_completion_record *)descriptor->completion_addr;
+
+    descriptor->xfer_size = len;
+    descriptor->src_addr = (uintptr_t)buf;
+    completion->status = 0;
+    completion->result = 0;
+}
+
+/**
+ * @brief Resets a buffer zero comparison DSA batch task.
+ *
+ * @param task A pointer to the DSA batch task.
+ */
+static void
+buffer_zero_task_reset(QemuDsaBatchTask *task)
+{
+    task->completions[0].status = DSA_COMP_NONE;
+    task->task_type = QEMU_DSA_TASK;
+    task->status = QEMU_DSA_TASK_READY;
+}
+
+/**
+ * @brief Resets a buffer zero comparison DSA batch task.
+ *
+ * @param task A pointer to the batch task.
+ * @param count The number of DSA tasks this batch task will contain.
+ */
+static void
+buffer_zero_batch_task_reset(QemuDsaBatchTask *task, size_t count)
+{
+    task->batch_completion.status = DSA_COMP_NONE;
+    task->batch_descriptor.desc_count = count;
+    task->task_type = QEMU_DSA_BATCH_TASK;
+    task->status = QEMU_DSA_TASK_READY;
+}
+
+/**
+ * @brief Sets a buffer zero comparison DSA task.
+ *
+ * @param task A pointer to the DSA task.
+ * @param buf A pointer to the memory buffer.
+ * @param len The buffer length.
+ */
+static void
+buffer_zero_task_set(QemuDsaBatchTask *task,
+                     const void *buf,
+                     size_t len)
+{
+    buffer_zero_task_reset(task);
+    buffer_zero_task_set_int(&task->descriptors[0], buf, len);
+}
+
+/**
+ * @brief Sets a buffer zero comparison batch task.
+ *
+ * @param batch_task A pointer to the batch task.
+ * @param buf An array of memory buffers.
+ * @param count The number of buffers in the array.
+ * @param len The length of the buffers.
+ */
+static void
+buffer_zero_batch_task_set(QemuDsaBatchTask *batch_task,
+                           const void **buf, size_t count, size_t len)
+{
+    assert(count > 0);
+    assert(count <= batch_task->batch_size);
+
+    buffer_zero_batch_task_reset(batch_task, count);
+    for (int i = 0; i < count; i++) {
+        buffer_zero_task_set_int(&batch_task->descriptors[i], buf[i], len);
+    }
+}
+
+/**
+ * @brief Asychronously perform a buffer zero DSA operation.
+ *
+ * @param task A pointer to the batch task structure.
+ * @param buf A pointer to the memory buffer.
+ * @param len The length of the memory buffer.
+ *
+ * @return int Zero if successful, otherwise an appropriate error code.
+ */
+__attribute__((unused))
+static int
+buffer_zero_dsa_async(QemuDsaBatchTask *task,
+                      const void *buf, size_t len)
+{
+    buffer_zero_task_set(task, buf, len);
+
+    return submit_wi_async(task);
+}
+
+/**
+ * @brief Sends a memory comparison batch task to a DSA device and wait
+ *        for completion.
+ *
+ * @param batch_task The batch task to be submitted to DSA device.
+ * @param buf An array of memory buffers to check for zero.
+ * @param count The number of buffers.
+ * @param len The buffer length.
+ */
+__attribute__((unused))
+static int
+buffer_zero_dsa_batch_async(QemuDsaBatchTask *batch_task,
+                            const void **buf, size_t count, size_t len)
+{
+    assert(count <= batch_task->batch_size);
+    buffer_zero_batch_task_set(batch_task, buf, count, len);
+
+    return submit_batch_wi_async(batch_task);
+}
+
+/**
+ * @brief The completion callback function for buffer zero
+ *        comparison DSA task completion.
+ *
+ * @param context A pointer to the callback context.
+ */
+static void
+buffer_zero_dsa_completion(void *context)
+{
+    assert(context != NULL);
+
+    QemuDsaBatchTask *task = (QemuDsaBatchTask *)context;
+    qemu_sem_post(&task->sem_task_complete);
+}
+
+/**
+ * @brief Wait for the asynchronous DSA task to complete.
+ *
+ * @param batch_task A pointer to the buffer zero comparison batch task.
+ */
+__attribute__((unused))
+static void
+buffer_zero_dsa_wait(QemuDsaBatchTask *batch_task)
+{
+    qemu_sem_wait(&batch_task->sem_task_complete);
+}
+
+/**
+ * @brief Initializes a buffer zero comparison DSA task.
+ *
+ * @param descriptor A pointer to the DSA task descriptor.
+ * @param completion A pointer to the DSA task completion record.
+ */
+static void
+buffer_zero_task_init_int(struct dsa_hw_desc *descriptor,
+                          struct dsa_completion_record *completion)
+{
+    descriptor->opcode = DSA_OPCODE_COMPVAL;
+    descriptor->flags = IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CRAV;
+    descriptor->comp_pattern = (uint64_t)0;
+    descriptor->completion_addr = (uint64_t)completion;
+}
+
+/**
+ * @brief Initializes a buffer zero DSA batch task.
+ *
+ * @param batch_size The number of zero page checking tasks in the batch.
+ * @return A pointer to the zero page checking tasks initialized.
+ */
+QemuDsaBatchTask *
+buffer_zero_batch_task_init(int batch_size)
+{
+    QemuDsaBatchTask *task = qemu_memalign(64, sizeof(QemuDsaBatchTask));
+    int descriptors_size = sizeof(*task->descriptors) * batch_size;
+
+    memset(task, 0, sizeof(*task));
+    task->addr = g_new0(ram_addr_t, batch_size);
+    task->results = g_new0(bool, batch_size);
+    task->batch_size = batch_size;
+    task->descriptors =
+        (struct dsa_hw_desc *)qemu_memalign(64, descriptors_size);
+    memset(task->descriptors, 0, descriptors_size);
+    task->completions = (struct dsa_completion_record *)qemu_memalign(
+        32, sizeof(*task->completions) * batch_size);
+
+    task->batch_completion.status = DSA_COMP_NONE;
+    task->batch_descriptor.completion_addr = (uint64_t)&task->batch_completion;
+    /* TODO: Ensure that we never send a batch with count <= 1 */
+    task->batch_descriptor.desc_count = 0;
+    task->batch_descriptor.opcode = DSA_OPCODE_BATCH;
+    task->batch_descriptor.flags = IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CRAV;
+    task->batch_descriptor.desc_list_addr = (uintptr_t)task->descriptors;
+    task->status = QEMU_DSA_TASK_READY;
+    task->group = &dsa_group;
+    task->device = dsa_device_group_get_next_device(&dsa_group);
+
+    for (int i = 0; i < task->batch_size; i++) {
+        buffer_zero_task_init_int(&task->descriptors[i],
+                                  &task->completions[i]);
+    }
+
+    qemu_sem_init(&task->sem_task_complete, 0);
+    task->completion_callback = buffer_zero_dsa_completion;
+
+    return task;
+}
+
+/**
+ * @brief Performs the proper cleanup on a DSA batch task.
+ *
+ * @param task A pointer to the batch task to cleanup.
+ */
+void
+buffer_zero_batch_task_destroy(QemuDsaBatchTask *task)
+{
+    g_free(task->addr);
+    g_free(task->results);
+    qemu_vfree(task->descriptors);
+    qemu_vfree(task->completions);
+    task->results = NULL;
+    qemu_sem_destroy(&task->sem_task_complete);
+    qemu_vfree(task);
+}
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v7 07/12] util/dsa: Implement DSA task asynchronous submission and wait for completion.
  2024-11-14 22:01 [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
                   ` (5 preceding siblings ...)
  2024-11-14 22:01 ` [PATCH v7 06/12] util/dsa: Implement zero page checking in DSA task Yichen Wang
@ 2024-11-14 22:01 ` Yichen Wang
  2024-11-25 18:00   ` Fabiano Rosas
  2024-11-14 22:01 ` [PATCH v7 08/12] migration/multifd: Add new migration option for multifd DSA offloading Yichen Wang
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Yichen Wang @ 2024-11-14 22:01 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang, Bryan Zhang

From: Hao Xiang <hao.xiang@linux.dev>

* Add a DSA task completion callback.
* DSA completion thread will call the tasks's completion callback
on every task/batch task completion.
* DSA submission path to wait for completion.
* Implement CPU fallback if DSA is not able to complete the task.

Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 include/qemu/dsa.h |  14 +++++
 util/dsa.c         | 125 +++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 135 insertions(+), 4 deletions(-)

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
index cb407b8b49..8284804a32 100644
--- a/include/qemu/dsa.h
+++ b/include/qemu/dsa.h
@@ -123,6 +123,20 @@ buffer_zero_batch_task_init(int batch_size);
  */
 void buffer_zero_batch_task_destroy(QemuDsaBatchTask *task);
 
+/**
+ * @brief Performs buffer zero comparison on a DSA batch task synchronously.
+ *
+ * @param batch_task A pointer to the batch task.
+ * @param buf An array of memory buffers.
+ * @param count The number of buffers in the array.
+ * @param len The buffer length.
+ *
+ * @return Zero if successful, otherwise non-zero.
+ */
+int
+buffer_is_zero_dsa_batch_sync(QemuDsaBatchTask *batch_task,
+                              const void **buf, size_t count, size_t len);
+
 #else
 
 typedef struct QemuDsaBatchTask {} QemuDsaBatchTask;
diff --git a/util/dsa.c b/util/dsa.c
index 408c163195..50f53ec24b 100644
--- a/util/dsa.c
+++ b/util/dsa.c
@@ -433,6 +433,42 @@ poll_completion(struct dsa_completion_record *completion,
     return 0;
 }
 
+/**
+ * @brief Helper function to use CPU to complete a single
+ *        zero page checking task.
+ *
+ * @param completion A pointer to a DSA task completion record.
+ * @param descriptor A pointer to a DSA task descriptor.
+ * @param result A pointer to the result of a zero page checking.
+ */
+static void
+task_cpu_fallback_int(struct dsa_completion_record *completion,
+                      struct dsa_hw_desc *descriptor, bool *result)
+{
+    const uint8_t *buf;
+    size_t len;
+
+    if (completion->status == DSA_COMP_SUCCESS) {
+        return;
+    }
+
+    /*
+     * DSA was able to partially complete the operation. Check the
+     * result. If we already know this is not a zero page, we can
+     * return now.
+     */
+    if (completion->bytes_completed != 0 && completion->result != 0) {
+        *result = false;
+        return;
+    }
+
+    /* Let's fallback to use CPU to complete it. */
+    buf = (const uint8_t *)descriptor->src_addr;
+    len = descriptor->xfer_size;
+    *result = buffer_is_zero(buf + completion->bytes_completed,
+                             len - completion->bytes_completed);
+}
+
 /**
  * @brief Complete a single DSA task in the batch task.
  *
@@ -561,7 +597,7 @@ dsa_completion_loop(void *opaque)
         (QemuDsaCompletionThread *)opaque;
     QemuDsaBatchTask *batch_task;
     QemuDsaDeviceGroup *group = thread_context->group;
-    int ret;
+    int ret = 0;
 
     rcu_register_thread();
 
@@ -829,7 +865,6 @@ buffer_zero_batch_task_set(QemuDsaBatchTask *batch_task,
  *
  * @return int Zero if successful, otherwise an appropriate error code.
  */
-__attribute__((unused))
 static int
 buffer_zero_dsa_async(QemuDsaBatchTask *task,
                       const void *buf, size_t len)
@@ -848,7 +883,6 @@ buffer_zero_dsa_async(QemuDsaBatchTask *task,
  * @param count The number of buffers.
  * @param len The buffer length.
  */
-__attribute__((unused))
 static int
 buffer_zero_dsa_batch_async(QemuDsaBatchTask *batch_task,
                             const void **buf, size_t count, size_t len)
@@ -879,13 +913,61 @@ buffer_zero_dsa_completion(void *context)
  *
  * @param batch_task A pointer to the buffer zero comparison batch task.
  */
-__attribute__((unused))
 static void
 buffer_zero_dsa_wait(QemuDsaBatchTask *batch_task)
 {
     qemu_sem_wait(&batch_task->sem_task_complete);
 }
 
+/**
+ * @brief Use CPU to complete the zero page checking task if DSA
+ *        is not able to complete it.
+ *
+ * @param batch_task A pointer to the batch task.
+ */
+static void
+buffer_zero_cpu_fallback(QemuDsaBatchTask *batch_task)
+{
+    if (batch_task->task_type == QEMU_DSA_TASK) {
+        if (batch_task->completions[0].status == DSA_COMP_SUCCESS) {
+            return;
+        }
+        task_cpu_fallback_int(&batch_task->completions[0],
+                              &batch_task->descriptors[0],
+                              &batch_task->results[0]);
+    } else if (batch_task->task_type == QEMU_DSA_BATCH_TASK) {
+        struct dsa_completion_record *batch_completion =
+            &batch_task->batch_completion;
+        struct dsa_completion_record *completion;
+        uint8_t status;
+        bool *results = batch_task->results;
+        uint32_t count = batch_task->batch_descriptor.desc_count;
+
+        /* DSA is able to complete the entire batch task. */
+        if (batch_completion->status == DSA_COMP_SUCCESS) {
+            assert(count == batch_completion->bytes_completed);
+            return;
+        }
+
+        /*
+         * DSA encounters some error and is not able to complete
+         * the entire batch task. Use CPU fallback.
+         */
+        for (int i = 0; i < count; i++) {
+
+            completion = &batch_task->completions[i];
+            status = completion->status;
+
+            assert(status == DSA_COMP_SUCCESS ||
+                status == DSA_COMP_PAGE_FAULT_NOBOF);
+
+            task_cpu_fallback_int(completion,
+                                  &batch_task->descriptors[i],
+                                  &results[i]);
+        }
+    }
+}
+
 /**
  * @brief Initializes a buffer zero comparison DSA task.
  *
@@ -962,3 +1044,38 @@ buffer_zero_batch_task_destroy(QemuDsaBatchTask *task)
     qemu_sem_destroy(&task->sem_task_complete);
     qemu_vfree(task);
 }
+
+/**
+ * @brief Performs buffer zero comparison on a DSA batch task synchronously.
+ *
+ * @param batch_task A pointer to the batch task.
+ * @param buf An array of memory buffers.
+ * @param count The number of buffers in the array.
+ * @param len The buffer length.
+ *
+ * @return Zero if successful, otherwise non-zero.
+ */
+int
+buffer_is_zero_dsa_batch_sync(QemuDsaBatchTask *batch_task,
+                              const void **buf, size_t count, size_t len)
+{
+    if (count <= 0 || count > batch_task->batch_size) {
+        return -1;
+    }
+
+    assert(batch_task != NULL);
+    assert(len != 0);
+    assert(buf != NULL);
+
+    if (count == 1) {
+        /* DSA doesn't take batch operation with only 1 task. */
+        buffer_zero_dsa_async(batch_task, buf[0], len);
+    } else {
+        buffer_zero_dsa_batch_async(batch_task, buf, count, len);
+    }
+
+    buffer_zero_dsa_wait(batch_task);
+    buffer_zero_cpu_fallback(batch_task);
+
+    return 0;
+}
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v7 08/12] migration/multifd: Add new migration option for multifd DSA offloading.
  2024-11-14 22:01 [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
                   ` (6 preceding siblings ...)
  2024-11-14 22:01 ` [PATCH v7 07/12] util/dsa: Implement DSA task asynchronous submission and wait for completion Yichen Wang
@ 2024-11-14 22:01 ` Yichen Wang
  2024-11-15 14:32   ` Dr. David Alan Gilbert
  2024-11-14 22:01 ` [PATCH v7 09/12] migration/multifd: Enable DSA offloading in multifd sender path Yichen Wang
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Yichen Wang @ 2024-11-14 22:01 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

From: Hao Xiang <hao.xiang@linux.dev>

Intel DSA offloading is an optional feature that turns on if
proper hardware and software stack is available. To turn on
DSA offloading in multifd live migration by setting:

zero-page-detection=dsa-accel
dsa-accel-path="dsa:<dsa_dev_path1> dsa:[dsa_dev_path2] ..."

This feature is turned off by default.

Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 hmp-commands.hx                |  2 +-
 include/qemu/dsa.h             | 13 +++++++++++++
 migration/migration-hmp-cmds.c | 19 ++++++++++++++++++-
 migration/options.c            | 30 ++++++++++++++++++++++++++++++
 migration/options.h            |  1 +
 qapi/migration.json            | 32 ++++++++++++++++++++++++++++----
 util/dsa.c                     | 31 +++++++++++++++++++++++++++++++
 7 files changed, 122 insertions(+), 6 deletions(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 06746f0afc..0e04eac7c7 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1009,7 +1009,7 @@ ERST
 
     {
         .name       = "migrate_set_parameter",
-        .args_type  = "parameter:s,value:s",
+        .args_type  = "parameter:s,value:S",
         .params     = "parameter value",
         .help       = "Set the parameter for migration",
         .cmd        = hmp_migrate_set_parameter,
diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
index 8284804a32..258860bd20 100644
--- a/include/qemu/dsa.h
+++ b/include/qemu/dsa.h
@@ -100,6 +100,13 @@ void qemu_dsa_stop(void);
  */
 void qemu_dsa_cleanup(void);
 
+/**
+ * @brief Check if DSA is supported.
+ *
+ * @return True if DSA is supported, otherwise false.
+ */
+bool qemu_dsa_is_supported(void);
+
 /**
  * @brief Check if DSA is running.
  *
@@ -141,6 +148,12 @@ buffer_is_zero_dsa_batch_sync(QemuDsaBatchTask *batch_task,
 
 typedef struct QemuDsaBatchTask {} QemuDsaBatchTask;
 
+static inline bool qemu_dsa_is_supported(void)
+{
+    return false;
+}
+
+
 static inline bool qemu_dsa_is_running(void)
 {
     return false;
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 20d1a6e219..01c528b80a 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -312,7 +312,16 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
         monitor_printf(mon, "%s: '%s'\n",
             MigrationParameter_str(MIGRATION_PARAMETER_TLS_AUTHZ),
             params->tls_authz);
-
+        if (params->has_accel_path) {
+            strList *accel_path = params->accel_path;
+            monitor_printf(mon, "%s:",
+                MigrationParameter_str(MIGRATION_PARAMETER_ACCEL_PATH));
+            while (accel_path) {
+                monitor_printf(mon, " '%s'", accel_path->value);
+                accel_path = accel_path->next;
+            }
+            monitor_printf(mon, "\n");
+        }
         if (params->has_block_bitmap_mapping) {
             const BitmapMigrationNodeAliasList *bmnal;
 
@@ -563,6 +572,14 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         p->has_x_checkpoint_delay = true;
         visit_type_uint32(v, param, &p->x_checkpoint_delay, &err);
         break;
+    case MIGRATION_PARAMETER_ACCEL_PATH:
+        p->has_accel_path = true;
+        g_autofree char **strv = g_strsplit(valuestr ? : "", " ", -1);
+        strList **tail = &p->accel_path;
+        for (int i = 0; strv[i]; i++) {
+            QAPI_LIST_APPEND(tail, strv[i]);
+        }
+        break;
     case MIGRATION_PARAMETER_MULTIFD_CHANNELS:
         p->has_multifd_channels = true;
         visit_type_uint8(v, param, &p->multifd_channels, &err);
diff --git a/migration/options.c b/migration/options.c
index ad8d6989a8..ca89fdc4f4 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -13,6 +13,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
+#include "qemu/dsa.h"
 #include "exec/target_page.h"
 #include "qapi/clone-visitor.h"
 #include "qapi/error.h"
@@ -809,6 +810,13 @@ const char *migrate_tls_creds(void)
     return s->parameters.tls_creds;
 }
 
+const strList *migrate_accel_path(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->parameters.accel_path;
+}
+
 const char *migrate_tls_hostname(void)
 {
     MigrationState *s = migrate_get_current();
@@ -922,6 +930,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->zero_page_detection = s->parameters.zero_page_detection;
     params->has_direct_io = true;
     params->direct_io = s->parameters.direct_io;
+    params->has_accel_path = true;
+    params->accel_path = QAPI_CLONE(strList, s->parameters.accel_path);
 
     return params;
 }
@@ -930,6 +940,7 @@ void migrate_params_init(MigrationParameters *params)
 {
     params->tls_hostname = g_strdup("");
     params->tls_creds = g_strdup("");
+    params->accel_path = NULL;
 
     /* Set has_* up only for parameter checks */
     params->has_throttle_trigger_threshold = true;
@@ -1142,6 +1153,14 @@ bool migrate_params_check(MigrationParameters *params, Error **errp)
         return false;
     }
 
+    if (params->has_zero_page_detection &&
+        params->zero_page_detection == ZERO_PAGE_DETECTION_DSA_ACCEL) {
+        if (!qemu_dsa_is_supported()) {
+            error_setg(errp, "DSA acceleration is not supported.");
+            return false;
+        }
+    }
+
     return true;
 }
 
@@ -1255,6 +1274,11 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_direct_io) {
         dest->direct_io = params->direct_io;
     }
+
+    if (params->has_accel_path) {
+        dest->has_accel_path = true;
+        dest->accel_path = params->accel_path;
+    }
 }
 
 static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
@@ -1387,6 +1411,12 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
     if (params->has_direct_io) {
         s->parameters.direct_io = params->direct_io;
     }
+    if (params->has_accel_path) {
+        qapi_free_strList(s->parameters.accel_path);
+        s->parameters.has_accel_path = true;
+        s->parameters.accel_path =
+            QAPI_CLONE(strList, params->accel_path);
+    }
 }
 
 void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
diff --git a/migration/options.h b/migration/options.h
index 79084eed0d..3d1e91dc52 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -84,6 +84,7 @@ const char *migrate_tls_creds(void);
 const char *migrate_tls_hostname(void);
 uint64_t migrate_xbzrle_cache_size(void);
 ZeroPageDetection migrate_zero_page_detection(void);
+const strList *migrate_accel_path(void);
 
 /* parameters helpers */
 
diff --git a/qapi/migration.json b/qapi/migration.json
index a605dc26db..389776065d 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -629,10 +629,14 @@
 #     multifd migration is enabled, else in the main migration thread
 #     as for @legacy.
 #
+# @dsa-accel: Perform zero page checking with the DSA accelerator
+#     offloading in multifd sender thread if multifd migration is
+#     enabled, else in the main migration thread as for @legacy.
+#
 # Since: 9.0
 ##
 { 'enum': 'ZeroPageDetection',
-  'data': [ 'none', 'legacy', 'multifd' ] }
+  'data': [ 'none', 'legacy', 'multifd', 'dsa-accel' ] }
 
 ##
 # @BitmapMigrationBitmapAliasTransform:
@@ -840,6 +844,12 @@
 #     See description in @ZeroPageDetection.  Default is 'multifd'.
 #     (since 9.0)
 #
+# @accel-path: If enabled, specify the accelerator paths that to be
+#     used in QEMU. For example, enable DSA accelerator for zero page
+#     detection offloading by setting the @zero-page-detection to
+#     dsa-accel, and defines the accel-path to "dsa:<dsa_device path>".
+#     This parameter is default to an empty list.  (Since 9.2)
+#
 # @direct-io: Open migration files with O_DIRECT when possible.  This
 #     only has effect if the @mapped-ram capability is enabled.
 #     (Since 9.1)
@@ -858,7 +868,7 @@
            'cpu-throttle-initial', 'cpu-throttle-increment',
            'cpu-throttle-tailslow',
            'tls-creds', 'tls-hostname', 'tls-authz', 'max-bandwidth',
-           'avail-switchover-bandwidth', 'downtime-limit',
+           'avail-switchover-bandwidth', 'downtime-limit', 'accel-path',
            { 'name': 'x-checkpoint-delay', 'features': [ 'unstable' ] },
            'multifd-channels',
            'xbzrle-cache-size', 'max-postcopy-bandwidth',
@@ -1021,6 +1031,12 @@
 #     See description in @ZeroPageDetection.  Default is 'multifd'.
 #     (since 9.0)
 #
+# @accel-path: If enabled, specify the accelerator paths that to be
+#     used in QEMU. For example, enable DSA accelerator for zero page
+#     detection offloading by setting the @zero-page-detection to
+#     dsa-accel, and defines the accel-path to "dsa:<dsa_device path>".
+#     This parameter is default to an empty list.  (Since 9.2)
+#
 # @direct-io: Open migration files with O_DIRECT when possible.  This
 #     only has effect if the @mapped-ram capability is enabled.
 #     (Since 9.1)
@@ -1066,7 +1082,8 @@
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
             '*zero-page-detection': 'ZeroPageDetection',
-            '*direct-io': 'bool' } }
+            '*direct-io': 'bool',
+            '*accel-path': [ 'str' ] } }
 
 ##
 # @migrate-set-parameters:
@@ -1231,6 +1248,12 @@
 #     See description in @ZeroPageDetection.  Default is 'multifd'.
 #     (since 9.0)
 #
+# @accel-path: If enabled, specify the accelerator paths that to be
+#     used in QEMU. For example, enable DSA accelerator for zero page
+#     detection offloading by setting the @zero-page-detection to
+#     dsa-accel, and defines the accel-path to "dsa:<dsa_device path>".
+#     This parameter is default to an empty list.  (Since 9.2)
+#
 # @direct-io: Open migration files with O_DIRECT when possible.  This
 #     only has effect if the @mapped-ram capability is enabled.
 #     (Since 9.1)
@@ -1273,7 +1296,8 @@
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
             '*zero-page-detection': 'ZeroPageDetection',
-            '*direct-io': 'bool' } }
+            '*direct-io': 'bool',
+            '*accel-path': [ 'str' ] } }
 
 ##
 # @query-migrate-parameters:
diff --git a/util/dsa.c b/util/dsa.c
index 50f53ec24b..18ed36e354 100644
--- a/util/dsa.c
+++ b/util/dsa.c
@@ -23,6 +23,7 @@
 #include "qemu/bswap.h"
 #include "qemu/error-report.h"
 #include "qemu/rcu.h"
+#include <cpuid.h>
 
 #pragma GCC push_options
 #pragma GCC target("enqcmd")
@@ -689,6 +690,36 @@ static void dsa_completion_thread_stop(void *opaque)
     qemu_sem_destroy(&thread_context->sem_init_done);
 }
 
+/**
+ * @brief Check if DSA is supported.
+ *
+ * @return True if DSA is supported, otherwise false.
+ */
+bool qemu_dsa_is_supported(void)
+{
+    /*
+     * movdir64b is indicated by bit 28 of ecx in CPUID leaf 7, subleaf 0.
+     * enqcmd is indicated by bit 29 of ecx in CPUID leaf 7, subleaf 0.
+     * Doc: https://cdrdv2-public.intel.com/819680/architecture-instruction-\
+     *      set-extensions-programming-reference.pdf
+     */
+    uint32_t eax, ebx, ecx, edx;
+    bool movedirb_enabled;
+    bool enqcmd_enabled;
+
+    __get_cpuid_count(7, 0, &eax, &ebx, &ecx, &edx);
+    movedirb_enabled = (ecx >> 28) & 0x1;
+    if (!movedirb_enabled) {
+        return false;
+    }
+    enqcmd_enabled = (ecx >> 29) & 0x1;
+    if (!enqcmd_enabled) {
+        return false;
+    }
+
+    return true;
+}
+
 /**
  * @brief Check if DSA is running.
  *
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v7 09/12] migration/multifd: Enable DSA offloading in multifd sender path.
  2024-11-14 22:01 [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
                   ` (7 preceding siblings ...)
  2024-11-14 22:01 ` [PATCH v7 08/12] migration/multifd: Add new migration option for multifd DSA offloading Yichen Wang
@ 2024-11-14 22:01 ` Yichen Wang
  2024-11-21 20:50   ` Fabiano Rosas
  2024-11-14 22:01 ` [PATCH v7 10/12] util/dsa: Add unit test coverage for Intel DSA task submission and completion Yichen Wang
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Yichen Wang @ 2024-11-14 22:01 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

From: Hao Xiang <hao.xiang@linux.dev>

Multifd sender path gets an array of pages queued by the migration
thread. It performs zero page checking on every page in the array.
The pages are classfied as either a zero page or a normal page. This
change uses Intel DSA to offload the zero page checking from CPU to
the DSA accelerator. The sender thread submits a batch of pages to DSA
hardware and waits for the DSA completion thread to signal for work
completion.

Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 migration/multifd-zero-page.c | 129 ++++++++++++++++++++++++++++++----
 migration/multifd.c           |  29 +++++++-
 migration/multifd.h           |   5 ++
 3 files changed, 147 insertions(+), 16 deletions(-)

diff --git a/migration/multifd-zero-page.c b/migration/multifd-zero-page.c
index f1e988a959..639aed9f6b 100644
--- a/migration/multifd-zero-page.c
+++ b/migration/multifd-zero-page.c
@@ -21,7 +21,9 @@
 
 static bool multifd_zero_page_enabled(void)
 {
-    return migrate_zero_page_detection() == ZERO_PAGE_DETECTION_MULTIFD;
+    ZeroPageDetection curMethod = migrate_zero_page_detection();
+    return (curMethod == ZERO_PAGE_DETECTION_MULTIFD ||
+            curMethod == ZERO_PAGE_DETECTION_DSA_ACCEL);
 }
 
 static void swap_page_offset(ram_addr_t *pages_offset, int a, int b)
@@ -37,26 +39,49 @@ static void swap_page_offset(ram_addr_t *pages_offset, int a, int b)
     pages_offset[b] = temp;
 }
 
+#ifdef CONFIG_DSA_OPT
+
+static void swap_result(bool *results, int a, int b)
+{
+    bool temp;
+
+    if (a == b) {
+        return;
+    }
+
+    temp = results[a];
+    results[a] = results[b];
+    results[b] = temp;
+}
+
 /**
- * multifd_send_zero_page_detect: Perform zero page detection on all pages.
+ * zero_page_detect_dsa: Perform zero page detection using
+ * Intel Data Streaming Accelerator (DSA).
  *
- * Sorts normal pages before zero pages in p->pages->offset and updates
- * p->pages->normal_num.
+ * Sorts normal pages before zero pages in pages->offset and updates
+ * pages->normal_num.
  *
  * @param p A pointer to the send params.
  */
-void multifd_send_zero_page_detect(MultiFDSendParams *p)
+static void zero_page_detect_dsa(MultiFDSendParams *p)
 {
     MultiFDPages_t *pages = &p->data->u.ram;
     RAMBlock *rb = pages->block;
-    int i = 0;
-    int j = pages->num - 1;
+    bool *results = p->dsa_batch_task->results;
 
-    if (!multifd_zero_page_enabled()) {
-        pages->normal_num = pages->num;
-        goto out;
+    for (int i = 0; i < pages->num; i++) {
+        p->dsa_batch_task->addr[i] =
+            (ram_addr_t)(rb->host + pages->offset[i]);
     }
 
+    buffer_is_zero_dsa_batch_sync(p->dsa_batch_task,
+                                  (const void **)p->dsa_batch_task->addr,
+                                  pages->num,
+                                  multifd_ram_page_size());
+
+    int i = 0;
+    int j = pages->num - 1;
+
     /*
      * Sort the page offset array by moving all normal pages to
      * the left and all zero pages to the right of the array.
@@ -64,23 +89,39 @@ void multifd_send_zero_page_detect(MultiFDSendParams *p)
     while (i <= j) {
         uint64_t offset = pages->offset[i];
 
-        if (!buffer_is_zero(rb->host + offset, multifd_ram_page_size())) {
+        if (!results[i]) {
             i++;
             continue;
         }
 
+        swap_result(results, i, j);
         swap_page_offset(pages->offset, i, j);
         ram_release_page(rb->idstr, offset);
         j--;
     }
 
     pages->normal_num = i;
+}
 
-out:
-    stat64_add(&mig_stats.normal_pages, pages->normal_num);
-    stat64_add(&mig_stats.zero_pages, pages->num - pages->normal_num);
+void multifd_dsa_cleanup(void)
+{
+    qemu_dsa_cleanup();
+}
+
+#else
+
+static void zero_page_detect_dsa(MultiFDSendParams *p)
+{
+    g_assert_not_reached();
+}
+
+void multifd_dsa_cleanup(void)
+{
+    return ;
 }
 
+#endif
+
 void multifd_recv_zero_page_process(MultiFDRecvParams *p)
 {
     for (int i = 0; i < p->zero_num; i++) {
@@ -92,3 +133,63 @@ void multifd_recv_zero_page_process(MultiFDRecvParams *p)
         }
     }
 }
+
+/**
+ * zero_page_detect_cpu: Perform zero page detection using CPU.
+ *
+ * Sorts normal pages before zero pages in p->pages->offset and updates
+ * p->pages->normal_num.
+ *
+ * @param p A pointer to the send params.
+ */
+static void zero_page_detect_cpu(MultiFDSendParams *p)
+{
+    MultiFDPages_t *pages = &p->data->u.ram;
+    RAMBlock *rb = pages->block;
+    int i = 0;
+    int j = pages->num - 1;
+
+    /*
+     * Sort the page offset array by moving all normal pages to
+     * the left and all zero pages to the right of the array.
+     */
+    while (i <= j) {
+        uint64_t offset = pages->offset[i];
+
+        if (!buffer_is_zero(rb->host + offset, multifd_ram_page_size())) {
+            i++;
+            continue;
+        }
+
+        swap_page_offset(pages->offset, i, j);
+        ram_release_page(rb->idstr, offset);
+        j--;
+    }
+
+    pages->normal_num = i;
+}
+
+/**
+ * multifd_send_zero_page_detect: Perform zero page detection on all pages.
+ *
+ * @param p A pointer to the send params.
+ */
+void multifd_send_zero_page_detect(MultiFDSendParams *p)
+{
+    MultiFDPages_t *pages = &p->data->u.ram;
+
+    if (!multifd_zero_page_enabled()) {
+        pages->normal_num = pages->num;
+        goto out;
+    }
+
+    if (qemu_dsa_is_running()) {
+        zero_page_detect_dsa(p);
+    } else {
+        zero_page_detect_cpu(p);
+    }
+
+out:
+    stat64_add(&mig_stats.normal_pages, pages->normal_num);
+    stat64_add(&mig_stats.zero_pages, pages->num - pages->normal_num);
+}
diff --git a/migration/multifd.c b/migration/multifd.c
index 4374e14a96..689acceff2 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -13,6 +13,7 @@
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
 #include "qemu/rcu.h"
+#include "qemu/dsa.h"
 #include "exec/target_page.h"
 #include "sysemu/sysemu.h"
 #include "exec/ramblock.h"
@@ -462,6 +463,8 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
     p->name = NULL;
     g_free(p->data);
     p->data = NULL;
+    buffer_zero_batch_task_destroy(p->dsa_batch_task);
+    p->dsa_batch_task = NULL;
     p->packet_len = 0;
     g_free(p->packet);
     p->packet = NULL;
@@ -493,6 +496,8 @@ void multifd_send_shutdown(void)
 
     multifd_send_terminate_threads();
 
+    multifd_dsa_cleanup();
+
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
         Error *local_err = NULL;
@@ -814,11 +819,31 @@ bool multifd_send_setup(void)
     uint32_t page_count = multifd_ram_page_count();
     bool use_packets = multifd_use_packets();
     uint8_t i;
+    Error *local_err = NULL;
 
     if (!migrate_multifd()) {
         return true;
     }
 
+    if (s &&
+        s->parameters.zero_page_detection == ZERO_PAGE_DETECTION_DSA_ACCEL) {
+        // Populate the dsa device path from accel-path
+        const strList *accel_path = migrate_accel_path();
+        g_autofree strList *dsa_parameter = g_malloc0(sizeof(strList));
+        strList **tail = &dsa_parameter;
+        while (accel_path) {
+            if (strncmp(accel_path->value, "dsa:", 4) == 0) {
+                QAPI_LIST_APPEND(tail, &accel_path->value[4]);
+            }
+            accel_path = accel_path->next;
+        }
+        if (qemu_dsa_init(dsa_parameter, &local_err)) {
+            ret = -1;
+        } else {
+            qemu_dsa_start();
+        }
+    }
+
     thread_count = migrate_multifd_channels();
     multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
     multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
@@ -829,12 +854,12 @@ bool multifd_send_setup(void)
 
     for (i = 0; i < thread_count; i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
-        Error *local_err = NULL;
 
         qemu_sem_init(&p->sem, 0);
         qemu_sem_init(&p->sem_sync, 0);
         p->id = i;
         p->data = multifd_send_data_alloc();
+        p->dsa_batch_task = buffer_zero_batch_task_init(page_count);
 
         if (use_packets) {
             p->packet_len = sizeof(MultiFDPacket_t)
@@ -865,7 +890,6 @@ bool multifd_send_setup(void)
 
     for (i = 0; i < thread_count; i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
-        Error *local_err = NULL;
 
         ret = multifd_send_state->ops->send_setup(p, &local_err);
         if (ret) {
@@ -1047,6 +1071,7 @@ void multifd_recv_cleanup(void)
             qemu_thread_join(&p->thread);
         }
     }
+    multifd_dsa_cleanup();
     for (i = 0; i < migrate_multifd_channels(); i++) {
         multifd_recv_cleanup_channel(&multifd_recv_state->params[i]);
     }
diff --git a/migration/multifd.h b/migration/multifd.h
index 50d58c0c9c..e293ddbc1d 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -15,6 +15,7 @@
 
 #include "exec/target_page.h"
 #include "ram.h"
+#include "qemu/dsa.h"
 
 typedef struct MultiFDRecvData MultiFDRecvData;
 typedef struct MultiFDSendData MultiFDSendData;
@@ -155,6 +156,9 @@ typedef struct {
     bool pending_sync;
     MultiFDSendData *data;
 
+    /* Zero page checking batch task */
+    QemuDsaBatchTask *dsa_batch_task;
+
     /* thread local variables. No locking required */
 
     /* pointer to the packet */
@@ -313,6 +317,7 @@ void multifd_send_fill_packet(MultiFDSendParams *p);
 bool multifd_send_prepare_common(MultiFDSendParams *p);
 void multifd_send_zero_page_detect(MultiFDSendParams *p);
 void multifd_recv_zero_page_process(MultiFDRecvParams *p);
+void multifd_dsa_cleanup(void);
 
 static inline void multifd_send_prepare_header(MultiFDSendParams *p)
 {
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v7 10/12] util/dsa: Add unit test coverage for Intel DSA task submission and completion.
  2024-11-14 22:01 [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
                   ` (8 preceding siblings ...)
  2024-11-14 22:01 ` [PATCH v7 09/12] migration/multifd: Enable DSA offloading in multifd sender path Yichen Wang
@ 2024-11-14 22:01 ` Yichen Wang
  2024-11-14 22:01 ` [PATCH v7 11/12] migration/multifd: Add integration tests for multifd with Intel DSA offloading Yichen Wang
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 30+ messages in thread
From: Yichen Wang @ 2024-11-14 22:01 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang, Bryan Zhang

From: Hao Xiang <hao.xiang@linux.dev>

* Test DSA start and stop path.
* Test DSA configure and cleanup path.
* Test DSA task submission and completion path.

Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 tests/unit/meson.build |   6 +
 tests/unit/test-dsa.c  | 503 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 509 insertions(+)
 create mode 100644 tests/unit/test-dsa.c

diff --git a/tests/unit/meson.build b/tests/unit/meson.build
index d5248ae51d..394219e903 100644
--- a/tests/unit/meson.build
+++ b/tests/unit/meson.build
@@ -50,6 +50,12 @@ tests = {
   'test-fifo': [],
 }
 
+if config_host_data.get('CONFIG_DSA_OPT')
+  tests += {
+    'test-dsa': [],
+  }
+endif
+
 if have_system or have_tools
   tests += {
     'test-qmp-event': [testqapi],
diff --git a/tests/unit/test-dsa.c b/tests/unit/test-dsa.c
new file mode 100644
index 0000000000..181a547528
--- /dev/null
+++ b/tests/unit/test-dsa.c
@@ -0,0 +1,503 @@
+/*
+ * Test DSA functions.
+ *
+ * Copyright (C) Bytedance Ltd.
+ *
+ * Authors:
+ *  Hao Xiang <hao.xiang@bytedance.com>
+ *  Bryan Zhang <bryan.zhang@bytedance.com>
+ *  Yichen Wang <yichen.wang@bytedance.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/host-utils.h"
+
+#include "qemu/cutils.h"
+#include "qemu/memalign.h"
+#include "qemu/dsa.h"
+
+/*
+ * TODO Communicate that DSA must be configured to support this batch size.
+ * TODO Alternatively, poke the DSA device to figure out batch size.
+ */
+#define batch_size 128
+#define page_size 4096
+
+#define oversized_batch_size (batch_size + 1)
+#define num_devices 2
+#define max_buffer_size (64 * 1024)
+
+/* TODO Make these not-hardcoded. */
+static const strList path1[] = {
+    {.value = (char *)"/dev/dsa/wq4.0", .next = NULL}
+};
+static const strList path2[] = {
+    {.value = (char *)"/dev/dsa/wq4.0", .next = (strList*)&path2[1]},
+    {.value = (char *)"/dev/dsa/wq4.1", .next = NULL}
+};
+
+static Error **errp;
+
+static QemuDsaBatchTask *task;
+
+/* A helper for running a single task and checking for correctness. */
+static void do_single_task(void)
+{
+    task = buffer_zero_batch_task_init(batch_size);
+    char buf[page_size];
+    char *ptr = buf;
+
+    buffer_is_zero_dsa_batch_sync(task,
+                                  (const void **)&ptr,
+                                  1,
+                                  page_size);
+    g_assert(task->results[0] == buffer_is_zero(buf, page_size));
+
+    buffer_zero_batch_task_destroy(task);
+}
+
+static void test_single_zero(void)
+{
+    g_assert(!qemu_dsa_init(path1, errp));
+    qemu_dsa_start();
+
+    task = buffer_zero_batch_task_init(batch_size);
+
+    char buf[page_size];
+    char *ptr = buf;
+
+    memset(buf, 0x0, page_size);
+    buffer_is_zero_dsa_batch_sync(task,
+                                  (const void **)&ptr,
+                                  1, page_size);
+    g_assert(task->results[0]);
+
+    buffer_zero_batch_task_destroy(task);
+
+    qemu_dsa_cleanup();
+}
+
+static void test_single_zero_async(void)
+{
+    test_single_zero();
+}
+
+static void test_single_nonzero(void)
+{
+    g_assert(!qemu_dsa_init(path1, errp));
+    qemu_dsa_start();
+
+    task = buffer_zero_batch_task_init(batch_size);
+
+    char buf[page_size];
+    char *ptr = buf;
+
+    memset(buf, 0x1, page_size);
+    buffer_is_zero_dsa_batch_sync(task,
+                                  (const void **)&ptr,
+                                  1, page_size);
+    g_assert(!task->results[0]);
+
+    buffer_zero_batch_task_destroy(task);
+
+    qemu_dsa_cleanup();
+}
+
+static void test_single_nonzero_async(void)
+{
+    test_single_nonzero();
+}
+
+/* count == 0 should return quickly without calling into DSA. */
+static void test_zero_count_async(void)
+{
+    char buf[page_size];
+    buffer_is_zero_dsa_batch_sync(task,
+                                  (const void **)&buf,
+                                  0,
+                                  page_size);
+}
+
+static void test_null_task_async(void)
+{
+    if (g_test_subprocess()) {
+        g_assert(!qemu_dsa_init(path1, errp));
+
+        char buf[page_size * batch_size];
+        char *addrs[batch_size];
+        for (int i = 0; i < batch_size; i++) {
+            addrs[i] = buf + (page_size * i);
+        }
+
+        buffer_is_zero_dsa_batch_sync(NULL, (const void **)addrs,
+                                      batch_size,
+                                      page_size);
+    } else {
+        g_test_trap_subprocess(NULL, 0, 0);
+        g_test_trap_assert_failed();
+    }
+}
+
+static void test_oversized_batch(void)
+{
+    g_assert(!qemu_dsa_init(path1, errp));
+    qemu_dsa_start();
+
+    task = buffer_zero_batch_task_init(batch_size);
+
+    char buf[page_size * oversized_batch_size];
+    char *addrs[batch_size];
+    for (int i = 0; i < oversized_batch_size; i++) {
+        addrs[i] = buf + (page_size * i);
+    }
+
+    int ret = buffer_is_zero_dsa_batch_sync(task,
+                                            (const void **)addrs,
+                                            oversized_batch_size,
+                                            page_size);
+    g_assert(ret != 0);
+
+    buffer_zero_batch_task_destroy(task);
+
+    qemu_dsa_cleanup();
+}
+
+static void test_oversized_batch_async(void)
+{
+    test_oversized_batch();
+}
+
+static void test_zero_len_async(void)
+{
+    if (g_test_subprocess()) {
+        g_assert(!qemu_dsa_init(path1, errp));
+
+        task = buffer_zero_batch_task_init(batch_size);
+
+        char buf[page_size];
+
+        buffer_is_zero_dsa_batch_sync(task,
+                                      (const void **)&buf,
+                                      1,
+                                      0);
+
+        buffer_zero_batch_task_destroy(task);
+    } else {
+        g_test_trap_subprocess(NULL, 0, 0);
+        g_test_trap_assert_failed();
+    }
+}
+
+static void test_null_buf_async(void)
+{
+    if (g_test_subprocess()) {
+        g_assert(!qemu_dsa_init(path1, errp));
+
+        task = buffer_zero_batch_task_init(batch_size);
+
+        buffer_is_zero_dsa_batch_sync(task, NULL, 1, page_size);
+
+        buffer_zero_batch_task_destroy(task);
+    } else {
+        g_test_trap_subprocess(NULL, 0, 0);
+        g_test_trap_assert_failed();
+    }
+}
+
+static void test_batch(void)
+{
+    g_assert(!qemu_dsa_init(path1, errp));
+    qemu_dsa_start();
+
+    task = buffer_zero_batch_task_init(batch_size);
+
+    char buf[page_size * batch_size];
+    char *addrs[batch_size];
+    for (int i = 0; i < batch_size; i++) {
+        addrs[i] = buf + (page_size * i);
+    }
+
+    /*
+     * Using whatever is on the stack is somewhat random.
+     * Manually set some pages to zero and some to nonzero.
+     */
+    memset(buf + 0, 0, page_size * 10);
+    memset(buf + (10 * page_size), 0xff, page_size * 10);
+
+    buffer_is_zero_dsa_batch_sync(task,
+                                  (const void **)addrs,
+                                  batch_size,
+                                  page_size);
+
+    bool is_zero;
+    for (int i = 0; i < batch_size; i++) {
+        is_zero = buffer_is_zero((const void *)&buf[page_size * i], page_size);
+        g_assert(task->results[i] == is_zero);
+    }
+
+    buffer_zero_batch_task_destroy(task);
+
+    qemu_dsa_cleanup();
+}
+
+static void test_batch_async(void)
+{
+    test_batch();
+}
+
+static void test_page_fault(void)
+{
+    g_assert(!qemu_dsa_init(path1, errp));
+    qemu_dsa_start();
+
+    char *buf[2];
+    int prot = PROT_READ | PROT_WRITE;
+    int flags = MAP_SHARED | MAP_ANON;
+    buf[0] = (char *)mmap(NULL, page_size * batch_size, prot, flags, -1, 0);
+    assert(buf[0] != MAP_FAILED);
+    buf[1] = (char *)malloc(page_size * batch_size);
+    assert(buf[1] != NULL);
+
+    for (int j = 0; j < 2; j++) {
+        task = buffer_zero_batch_task_init(batch_size);
+
+        char *addrs[batch_size];
+        for (int i = 0; i < batch_size; i++) {
+            addrs[i] = buf[j] + (page_size * i);
+        }
+
+        buffer_is_zero_dsa_batch_sync(task,
+                                      (const void **)addrs,
+                                      batch_size,
+                                      page_size);
+
+        bool is_zero;
+        for (int i = 0; i < batch_size; i++) {
+            is_zero = buffer_is_zero((const void *)&buf[j][page_size * i],
+                                      page_size);
+            g_assert(task->results[i] == is_zero);
+        }
+        buffer_zero_batch_task_destroy(task);
+    }
+
+    assert(!munmap(buf[0], page_size * batch_size));
+    free(buf[1]);
+    qemu_dsa_cleanup();
+}
+
+static void test_various_buffer_sizes(void)
+{
+    g_assert(!qemu_dsa_init(path1, errp));
+    qemu_dsa_start();
+
+    char *buf = malloc(max_buffer_size * batch_size);
+    char *addrs[batch_size];
+
+    for (int len = 16; len <= max_buffer_size; len *= 2) {
+        task = buffer_zero_batch_task_init(batch_size);
+
+        for (int i = 0; i < batch_size; i++) {
+            addrs[i] = buf + (len * i);
+        }
+
+        buffer_is_zero_dsa_batch_sync(task,
+                                      (const void **)addrs,
+                                      batch_size,
+                                      len);
+
+        bool is_zero;
+        for (int j = 0; j < batch_size; j++) {
+            is_zero = buffer_is_zero((const void *)&buf[len * j], len);
+            g_assert(task->results[j] == is_zero);
+        }
+
+        buffer_zero_batch_task_destroy(task);
+    }
+
+    free(buf);
+
+    qemu_dsa_cleanup();
+}
+
+static void test_various_buffer_sizes_async(void)
+{
+    test_various_buffer_sizes();
+}
+
+static void test_double_start_stop(void)
+{
+    g_assert(!qemu_dsa_init(path1, errp));
+    /* Double start */
+    qemu_dsa_start();
+    qemu_dsa_start();
+    g_assert(qemu_dsa_is_running());
+    do_single_task();
+
+    /* Double stop */
+    qemu_dsa_stop();
+    g_assert(!qemu_dsa_is_running());
+    qemu_dsa_stop();
+    g_assert(!qemu_dsa_is_running());
+
+    /* Restart */
+    qemu_dsa_start();
+    g_assert(qemu_dsa_is_running());
+    do_single_task();
+    qemu_dsa_cleanup();
+}
+
+static void test_is_running(void)
+{
+    g_assert(!qemu_dsa_init(path1, errp));
+
+    g_assert(!qemu_dsa_is_running());
+    qemu_dsa_start();
+    g_assert(qemu_dsa_is_running());
+    qemu_dsa_stop();
+    g_assert(!qemu_dsa_is_running());
+    qemu_dsa_cleanup();
+}
+
+static void test_multiple_engines(void)
+{
+    g_assert(!qemu_dsa_init(path2, errp));
+    qemu_dsa_start();
+
+    QemuDsaBatchTask *tasks[num_devices];
+    char bufs[num_devices][page_size * batch_size];
+    char *addrs[num_devices][batch_size];
+
+    /*
+     *  This is a somewhat implementation-specific way
+     *  of testing that the tasks have unique engines
+     *  assigned to them.
+     */
+    tasks[0] = buffer_zero_batch_task_init(batch_size);
+    tasks[1] = buffer_zero_batch_task_init(batch_size);
+    g_assert(tasks[0]->device != tasks[1]->device);
+
+    for (int i = 0; i < num_devices; i++) {
+        for (int j = 0; j < batch_size; j++) {
+            addrs[i][j] = bufs[i] + (page_size * j);
+        }
+
+        buffer_is_zero_dsa_batch_sync(tasks[i],
+                                      (const void **)addrs[i],
+                                      batch_size, page_size);
+
+        bool is_zero;
+        for (int j = 0; j < batch_size; j++) {
+            is_zero = buffer_is_zero((const void *)&bufs[i][page_size * j],
+                                     page_size);
+            g_assert(tasks[i]->results[j] == is_zero);
+        }
+    }
+
+    buffer_zero_batch_task_destroy(tasks[0]);
+    buffer_zero_batch_task_destroy(tasks[1]);
+
+    qemu_dsa_cleanup();
+}
+
+static void test_configure_dsa_twice(void)
+{
+    g_assert(!qemu_dsa_init(path2, errp));
+    g_assert(!qemu_dsa_init(path2, errp));
+    qemu_dsa_start();
+    do_single_task();
+    qemu_dsa_cleanup();
+}
+
+static void test_configure_dsa_bad_path(void)
+{
+    const strList *bad_path = &(strList) {
+        .value = (char *)"/not/a/real/path", .next = NULL
+    };
+    g_assert(qemu_dsa_init(bad_path, errp));
+}
+
+static void test_cleanup_before_configure(void)
+{
+    qemu_dsa_cleanup();
+    g_assert(!qemu_dsa_init(path2, errp));
+}
+
+static void test_configure_dsa_num_devices(void)
+{
+    g_assert(!qemu_dsa_init(path1, errp));
+    qemu_dsa_start();
+
+    do_single_task();
+    qemu_dsa_stop();
+    qemu_dsa_cleanup();
+}
+
+static void test_cleanup_twice(void)
+{
+    g_assert(!qemu_dsa_init(path2, errp));
+    qemu_dsa_cleanup();
+    qemu_dsa_cleanup();
+
+    g_assert(!qemu_dsa_init(path2, errp));
+    qemu_dsa_start();
+    do_single_task();
+    qemu_dsa_cleanup();
+}
+
+static int check_test_setup(void)
+{
+    const strList *path[2] = {path1, path2};
+    for (int i = 0; i < sizeof(path) / sizeof(strList *); i++) {
+        if (qemu_dsa_init(path[i], errp)) {
+            return -1;
+        }
+        qemu_dsa_cleanup();
+    }
+    return 0;
+}
+
+int main(int argc, char **argv)
+{
+    g_test_init(&argc, &argv, NULL);
+
+    if (check_test_setup() != 0) {
+        /*
+         * This test requires extra setup. The current
+         * setup is not correct. Just skip this test
+         * for now.
+         */
+        exit(0);
+    }
+
+    if (num_devices > 1) {
+        g_test_add_func("/dsa/multiple_engines", test_multiple_engines);
+    }
+
+    g_test_add_func("/dsa/async/batch", test_batch_async);
+    g_test_add_func("/dsa/async/various_buffer_sizes",
+                    test_various_buffer_sizes_async);
+    g_test_add_func("/dsa/async/null_buf", test_null_buf_async);
+    g_test_add_func("/dsa/async/zero_len", test_zero_len_async);
+    g_test_add_func("/dsa/async/oversized_batch", test_oversized_batch_async);
+    g_test_add_func("/dsa/async/zero_count", test_zero_count_async);
+    g_test_add_func("/dsa/async/single_zero", test_single_zero_async);
+    g_test_add_func("/dsa/async/single_nonzero", test_single_nonzero_async);
+    g_test_add_func("/dsa/async/null_task", test_null_task_async);
+    g_test_add_func("/dsa/async/page_fault", test_page_fault);
+
+    g_test_add_func("/dsa/double_start_stop", test_double_start_stop);
+    g_test_add_func("/dsa/is_running", test_is_running);
+
+    g_test_add_func("/dsa/configure_dsa_twice", test_configure_dsa_twice);
+    g_test_add_func("/dsa/configure_dsa_bad_path", test_configure_dsa_bad_path);
+    g_test_add_func("/dsa/cleanup_before_configure",
+                    test_cleanup_before_configure);
+    g_test_add_func("/dsa/configure_dsa_num_devices",
+                    test_configure_dsa_num_devices);
+    g_test_add_func("/dsa/cleanup_twice", test_cleanup_twice);
+
+    return g_test_run();
+}
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v7 11/12] migration/multifd: Add integration tests for multifd with Intel DSA offloading.
  2024-11-14 22:01 [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
                   ` (9 preceding siblings ...)
  2024-11-14 22:01 ` [PATCH v7 10/12] util/dsa: Add unit test coverage for Intel DSA task submission and completion Yichen Wang
@ 2024-11-14 22:01 ` Yichen Wang
  2024-11-25 18:25   ` Fabiano Rosas
  2024-11-14 22:01 ` [PATCH v7 12/12] migration/doc: Add DSA zero page detection doc Yichen Wang
  2024-11-19 21:31 ` [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration Fabiano Rosas
  12 siblings, 1 reply; 30+ messages in thread
From: Yichen Wang @ 2024-11-14 22:01 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang, Bryan Zhang

From: Hao Xiang <hao.xiang@linux.dev>

* Add test case to start and complete multifd live migration with DSA
offloading enabled.
* Add test case to start and cancel multifd live migration with DSA
offloading enabled.

Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 tests/qtest/migration-test.c | 80 +++++++++++++++++++++++++++++++++++-
 1 file changed, 79 insertions(+), 1 deletion(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index e6a2803e71..cd551ce70c 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -611,6 +611,12 @@ typedef struct {
     bool suspend_me;
 } MigrateStart;
 
+/*
+ * It requires separate steps to configure and enable DSA device.
+ * This test assumes that the configuration is done already.
+ */
+static const char *dsa_dev_path_p = "['dsa:/dev/dsa/wq4.0']";
+
 /*
  * A hook that runs after the src and dst QEMUs have been
  * created, but before the migration is started. This can
@@ -3262,7 +3268,7 @@ static void test_multifd_tcp_tls_x509_reject_anon_client(void)
  *
  *  And see that it works
  */
-static void test_multifd_tcp_cancel(void)
+static void test_multifd_tcp_cancel_common(bool use_dsa)
 {
     MigrateStart args = {
         .hide_stderr = true,
@@ -3282,6 +3288,11 @@ static void test_multifd_tcp_cancel(void)
     migrate_set_capability(from, "multifd", true);
     migrate_set_capability(to, "multifd", true);
 
+    if (use_dsa) {
+        migrate_set_parameter_str(from, "zero-page-detection", "dsa-accel");
+        migrate_set_parameter_str(from, "accel-path", dsa_dev_path_p);
+    }
+
     /* Start incoming migration from the 1st socket */
     migrate_incoming_qmp(to, "tcp:127.0.0.1:0", "{}");
 
@@ -3340,6 +3351,49 @@ static void test_multifd_tcp_cancel(void)
     test_migrate_end(from, to2, true);
 }
 
+/*
+ * This test does:
+ *  source               target
+ *                       migrate_incoming
+ *     migrate
+ *     migrate_cancel
+ *                       launch another target
+ *     migrate
+ *
+ *  And see that it works
+ */
+static void test_multifd_tcp_cancel(void)
+{
+    test_multifd_tcp_cancel_common(false);
+}
+
+#ifdef CONFIG_DSA_OPT
+
+static void *test_migrate_precopy_tcp_multifd_start_dsa(QTestState *from,
+                                                        QTestState *to)
+{
+    migrate_set_parameter_str(from, "zero-page-detection", "dsa-accel");
+    migrate_set_parameter_str(from, "accel-path", dsa_dev_path_p);
+    return test_migrate_precopy_tcp_multifd_start_common(from, to, "none");
+}
+
+static void test_multifd_tcp_zero_page_dsa(void)
+{
+    MigrateCommon args = {
+        .listen_uri = "defer",
+        .start_hook = test_migrate_precopy_tcp_multifd_start_dsa,
+    };
+
+    test_precopy_common(&args);
+}
+
+static void test_multifd_tcp_cancel_dsa(void)
+{
+    test_multifd_tcp_cancel_common(true);
+}
+
+#endif
+
 static void calc_dirty_rate(QTestState *who, uint64_t calc_time)
 {
     qtest_qmp_assert_success(who,
@@ -3767,6 +3821,20 @@ static bool kvm_dirty_ring_supported(void)
 #endif
 }
 
+#ifdef CONFIG_DSA_OPT
+static const char *dsa_dev_path = "/dev/dsa/wq4.0";
+static int test_dsa_setup(void)
+{
+    int fd;
+    fd = open(dsa_dev_path, O_RDWR);
+    if (fd < 0) {
+        return -1;
+    }
+    close(fd);
+    return 0;
+}
+#endif
+
 int main(int argc, char **argv)
 {
     bool has_kvm, has_tcg;
@@ -3979,6 +4047,16 @@ int main(int argc, char **argv)
                        test_multifd_tcp_zero_page_legacy);
     migration_test_add("/migration/multifd/tcp/plain/zero-page/none",
                        test_multifd_tcp_no_zero_page);
+
+#ifdef CONFIG_DSA_OPT
+    if (g_str_equal(arch, "x86_64") && test_dsa_setup() == 0) {
+        migration_test_add("/migration/multifd/tcp/plain/zero-page/dsa",
+                       test_multifd_tcp_zero_page_dsa);
+        migration_test_add("/migration/multifd/tcp/plain/cancel/dsa",
+                       test_multifd_tcp_cancel_dsa);
+    }
+#endif
+
     migration_test_add("/migration/multifd/tcp/plain/cancel",
                        test_multifd_tcp_cancel);
     migration_test_add("/migration/multifd/tcp/plain/zlib",
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v7 12/12] migration/doc: Add DSA zero page detection doc
  2024-11-14 22:01 [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
                   ` (10 preceding siblings ...)
  2024-11-14 22:01 ` [PATCH v7 11/12] migration/multifd: Add integration tests for multifd with Intel DSA offloading Yichen Wang
@ 2024-11-14 22:01 ` Yichen Wang
  2024-11-25 18:28   ` Fabiano Rosas
  2024-11-19 21:31 ` [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration Fabiano Rosas
  12 siblings, 1 reply; 30+ messages in thread
From: Yichen Wang @ 2024-11-14 22:01 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

From: Yuan Liu <yuan1.liu@intel.com>

Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 .../migration/dsa-zero-page-detection.rst     | 290 ++++++++++++++++++
 docs/devel/migration/features.rst             |   1 +
 2 files changed, 291 insertions(+)
 create mode 100644 docs/devel/migration/dsa-zero-page-detection.rst

diff --git a/docs/devel/migration/dsa-zero-page-detection.rst b/docs/devel/migration/dsa-zero-page-detection.rst
new file mode 100644
index 0000000000..1279fcdd99
--- /dev/null
+++ b/docs/devel/migration/dsa-zero-page-detection.rst
@@ -0,0 +1,290 @@
+=============================
+DSA-Based Zero Page Detection
+=============================
+Intel Data Streaming Accelerator(``DSA``) is introduced in Intel's 4th
+generation Xeon server, aka Sapphire Rapids(``SPR``). One of the things
+DSA can do is to offload memory comparison workload from CPU to DSA accelerator
+hardware.
+
+The main advantages of using DSA to accelerate zero pages detection include
+
+1. Reduces CPU usage in multifd live migration workflow across all use cases.
+
+2. Reduces migration total time in some use cases.
+
+
+DSA-Based Zero Page Detection Introduction
+==========================================
+
+::
+
+
+  +----------------+       +------------------+
+  | MultiFD Thread |       |accel-config tool |
+  +-+--------+-----+       +--------+---------+
+    |        |                      |
+    |        |  Open DSA            | Setup DSA
+    |        |  Work Queues         | Resources
+    |        |       +-----+-----+  |
+    |        +------>|idxd driver|<-+
+    |                +-----+-----+
+    |                      |
+    |                      |
+    |                +-----+-----+
+    +----------------+DSA Devices|
+      Submit jobs    +-----------+
+      via enqcmd
+
+
+DSA Introduction
+----------------
+Intel Data Streaming Accelerator (DSA) is a high-performance data copy and
+transformation accelerator that is integrated in Intel Xeon processors,
+targeted for optimizing streaming data movement and transformation operations
+common with applications for high-performance storage, networking, persistent
+memory, and various data processing applications.
+
+For more ``DSA`` introduction, please refer to `DSA Introduction
+<https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/data-streaming-accelerator.html>`_
+
+For ``DSA`` specification, please refer to `DSA Specification
+<https://cdrdv2-public.intel.com/671116/341204-intel-data-streaming-accelerator-spec.pdf>`_
+
+For ``DSA`` user guide, please refer to `DSA User Guide
+<https://www.intel.com/content/www/us/en/content-details/759709/intel-data-streaming-accelerator-user-guide.html>`_
+
+DSA Device Management
+---------------------
+
+The number of ``DSA`` devices will vary depending on the Xeon product model.
+On a ``SPR`` server, there can be a maximum of 8 ``DSA`` devices, with up to
+4 devices per socket.
+
+By default, all ``DSA`` devices are disabled and need to be configured and
+enabled by users manually.
+
+Check the number of devices through the following command
+
+.. code-block:: shell
+
+  #lspci -d 8086:0b25
+  6a:01.0 System peripheral: Intel Corporation Device 0b25
+  6f:01.0 System peripheral: Intel Corporation Device 0b25
+  74:01.0 System peripheral: Intel Corporation Device 0b25
+  79:01.0 System peripheral: Intel Corporation Device 0b25
+  e7:01.0 System peripheral: Intel Corporation Device 0b25
+  ec:01.0 System peripheral: Intel Corporation Device 0b25
+  f1:01.0 System peripheral: Intel Corporation Device 0b25
+  f6:01.0 System peripheral: Intel Corporation Device 0b25
+
+
+DSA Device Configuration And Enabling
+-------------------------------------
+
+The ``accel-config`` tool is used to enable ``DSA`` devices and configure
+``DSA`` hardware resources(work queues and engines). One ``DSA`` device
+has 8 work queues and 4 processing engines, multiple engines can be assigned
+to a work queue via ``group`` attribute.
+
+For ``accel-config`` installation, please refer to `accel-config installation
+<https://github.com/intel/idxd-config>`_
+
+One example of configuring and enabling an ``DSA`` device.
+
+.. code-block:: shell
+
+  #accel-config config-engine dsa0/engine0.0 -g 0
+  #accel-config config-engine dsa0/engine0.1 -g 0
+  #accel-config config-engine dsa0/engine0.2 -g 0
+  #accel-config config-engine dsa0/engine0.3 -g 0
+  #accel-config config-wq dsa0/wq0.0 -g 0 -s 128 -p 10 -b 1 -t 128 -m shared -y user -n app1 -d user
+  #accel-config enable-device dsa0
+  #accel-config enable-wq dsa0/wq0.0
+
+- The ``DSA`` device index is 0, use ``ls -lh /sys/bus/dsa/devices/dsa*``
+  command to query the ``DSA`` device index.
+
+- 4 engines and 1 work queue are configured in group 0, so that all zero-page
+  detection jobs submitted to this work queue can be processed by all engines
+  simultaneously.
+
+- Set work queue attributes including the work mode, work queue size and so on.
+
+- Enable the ``dsa0`` device and work queue ``dsa0/wq0.0``
+
+.. note::
+
+   1. ``DSA`` device driver is Intel Data Accelerator Driver (idxd), it is
+      recommended that the minimum version of Linux kernel is 5.18.
+
+   2. Only ``DSA`` shared work queue mode is supported, it needs to add
+      ``"intel_iommu=on,sm_on"`` parameter to kernel command line.
+
+For more detailed configuration, please refer to `DSA Configuration Samples
+<https://github.com/intel/idxd-config/tree/stable/Documentation/accfg>`_
+
+
+Performances
+============
+We use two Intel 4th generation Xeon servers for testing.
+
+::
+
+    Architecture:        x86_64
+    CPU(s):              192
+    Thread(s) per core:  2
+    Core(s) per socket:  48
+    Socket(s):           2
+    NUMA node(s):        2
+    Vendor ID:           GenuineIntel
+    CPU family:          6
+    Model:               143
+    Model name:          Intel(R) Xeon(R) Platinum 8457C
+    Stepping:            8
+    CPU MHz:             2538.624
+    CPU max MHz:         3800.0000
+    CPU min MHz:         800.0000
+
+We perform multifd live migration with below setup:
+
+1. VM has 100GB memory.
+
+2. Use the new migration option multifd-set-normal-page-ratio to control the
+   total size of the payload sent over the network.
+
+3. Use 8 multifd channels.
+
+4. Use tcp for live migration.
+
+5. Use CPU to perform zero page checking as the baseline.
+
+6. Use one DSA device to offload zero page checking to compare with the baseline.
+
+7. Use "perf sched record" and "perf sched timehist" to analyze CPU usage.
+
+
+A) Scenario 1: 50% (50GB) normal pages on an 100GB vm
+-----------------------------------------------------
+
+::
+
+	CPU usage
+
+	|---------------|---------------|---------------|---------------|
+	|		|comm		|runtime(msec)	|totaltime(msec)|
+	|---------------|---------------|---------------|---------------|
+	|Baseline	|live_migration	|5657.58	|		|
+	|		|multifdsend_0	|3931.563	|		|
+	|		|multifdsend_1	|4405.273	|		|
+	|		|multifdsend_2	|3941.968	|		|
+	|		|multifdsend_3	|5032.975	|		|
+	|		|multifdsend_4	|4533.865	|		|
+	|		|multifdsend_5	|4530.461	|		|
+	|		|multifdsend_6	|5171.916	|		|
+	|		|multifdsend_7	|4722.769	|41922		|
+	|---------------|---------------|---------------|---------------|
+	|DSA		|live_migration	|6129.168	|		|
+	|		|multifdsend_0	|2954.717	|		|
+	|		|multifdsend_1	|2766.359	|		|
+	|		|multifdsend_2	|2853.519	|		|
+	|		|multifdsend_3	|2740.717	|		|
+	|		|multifdsend_4	|2824.169	|		|
+	|		|multifdsend_5	|2966.908	|		|
+	|		|multifdsend_6	|2611.137	|		|
+	|		|multifdsend_7	|3114.732	|		|
+	|		|dsa_completion	|3612.564	|32568		|
+	|---------------|---------------|---------------|---------------|
+
+Baseline total runtime is calculated by adding up all multifdsend_X
+and live_migration threads runtime. DSA offloading total runtime is
+calculated by adding up all multifdsend_X, live_migration and
+dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
+that is 23% total CPU usage savings.
+
+::
+
+	Latency
+	|---------------|---------------|---------------|---------------|---------------|---------------|
+	|		|total time	|down time	|throughput	|transferred-ram|total-ram	|
+	|---------------|---------------|---------------|---------------|---------------|---------------|
+	|Baseline	|10343 ms	|161 ms		|41007.00 mbps	|51583797 kb	|102400520 kb	|
+	|---------------|---------------|---------------|---------------|-------------------------------|
+	|DSA offload	|9535 ms	|135 ms		|46554.40 mbps	|53947545 kb	|102400520 kb	|
+	|---------------|---------------|---------------|---------------|---------------|---------------|
+
+Total time is 8% faster and down time is 16% faster.
+
+
+B) Scenario 2: 100% (100GB) zero pages on an 100GB vm
+-----------------------------------------------------
+
+::
+
+	CPU usage
+	|---------------|---------------|---------------|---------------|
+	|		|comm		|runtime(msec)	|totaltime(msec)|
+	|---------------|---------------|---------------|---------------|
+	|Baseline	|live_migration	|4860.718	|		|
+	|	 	|multifdsend_0	|748.875	|		|
+	|		|multifdsend_1	|898.498	|		|
+	|		|multifdsend_2	|787.456	|		|
+	|		|multifdsend_3	|764.537	|		|
+	|		|multifdsend_4	|785.687	|		|
+	|		|multifdsend_5	|756.941	|		|
+	|		|multifdsend_6	|774.084	|		|
+	|		|multifdsend_7	|782.900	|11154		|
+	|---------------|---------------|-------------------------------|
+	|DSA offloading	|live_migration	|3846.976	|		|
+	|		|multifdsend_0	|191.880	|		|
+	|		|multifdsend_1	|166.331	|		|
+	|		|multifdsend_2	|168.528	|		|
+	|		|multifdsend_3	|197.831	|		|
+	|		|multifdsend_4	|169.580	|		|
+	|		|multifdsend_5	|167.984	|		|
+	|		|multifdsend_6	|198.042	|		|
+	|		|multifdsend_7	|170.624	|		|
+	|		|dsa_completion	|3428.669	|8700		|
+	|---------------|---------------|---------------|---------------|
+
+Baseline total runtime is 11154 msec and DSA offloading total runtime is
+8700 msec. That is 22% CPU savings.
+
+::
+
+	Latency
+	|--------------------------------------------------------------------------------------------|
+	|		|total time	|down time	|throughput	|transferred-ram|total-ram   |
+	|---------------|---------------|---------------|---------------|---------------|------------|
+	|Baseline	|4867 ms	|20 ms		|1.51 mbps	|565 kb		|102400520 kb|
+	|---------------|---------------|---------------|---------------|----------------------------|
+	|DSA offload	|3888 ms	|18 ms		|1.89 mbps	|565 kb		|102400520 kb|
+	|---------------|---------------|---------------|---------------|---------------|------------|
+
+Total time 20% faster and down time 10% faster.
+
+
+How To Use DSA In Migration
+===========================
+
+The migration parameter ``accel-path`` is used to specify the resource
+allocation for DSA. After the user configures
+``zero-page-detection=dsa-accel``, one or more DSA work queues need to be
+specified for migration.
+
+The following example shows two DSA work queues for zero page detection
+
+.. code-block:: shell
+
+   migrate_set_parameter zero-page-detection=dsa-accel
+   migrate_set_parameter accel-path=dsa:/dev/dsa/wq0.0 dsa:/dev/dsa/wq1.0
+
+.. note::
+
+  Accessing DSA resources requires ``sudo`` command or ``root`` privileges
+  by default. Administrators can modify the DSA device node ownership
+  so that QEMU can use DSA with specified user permissions.
+
+  For example:
+
+  #chown -R qemu /dev/dsa
+
diff --git a/docs/devel/migration/features.rst b/docs/devel/migration/features.rst
index 8f431d52f9..ea2893d80f 100644
--- a/docs/devel/migration/features.rst
+++ b/docs/devel/migration/features.rst
@@ -15,3 +15,4 @@ Migration has plenty of features to support different use cases.
    qpl-compression
    uadk-compression
    qatzip-compression
+   dsa-zero-page-detection
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 08/12] migration/multifd: Add new migration option for multifd DSA offloading.
  2024-11-14 22:01 ` [PATCH v7 08/12] migration/multifd: Add new migration option for multifd DSA offloading Yichen Wang
@ 2024-11-15 14:32   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 30+ messages in thread
From: Dr. David Alan Gilbert @ 2024-11-15 14:32 UTC (permalink / raw)
  To: Yichen Wang
  Cc: Peter Xu, Fabiano Rosas, Paolo Bonzini, Marc-André Lureau,
	Daniel P. Berrangé, Philippe Mathieu-Daudé, Eric Blake,
	Markus Armbruster, Michael S. Tsirkin, Cornelia Huck, qemu-devel,
	Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang

* Yichen Wang (yichen.wang@bytedance.com) wrote:
> From: Hao Xiang <hao.xiang@linux.dev>
> 
> Intel DSA offloading is an optional feature that turns on if
> proper hardware and software stack is available. To turn on
> DSA offloading in multifd live migration by setting:
> 
> zero-page-detection=dsa-accel
> dsa-accel-path="dsa:<dsa_dev_path1> dsa:[dsa_dev_path2] ..."

  ^^^^
oops, commit message needs updating, but other than that,
for HMP:

Acked-by: Dr. David Alan Gilbert <dave@treblig.org>

Thanks for making the changes,

Dave

> This feature is turned off by default.
> 
> Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
> Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
> ---
>  hmp-commands.hx                |  2 +-
>  include/qemu/dsa.h             | 13 +++++++++++++
>  migration/migration-hmp-cmds.c | 19 ++++++++++++++++++-
>  migration/options.c            | 30 ++++++++++++++++++++++++++++++
>  migration/options.h            |  1 +
>  qapi/migration.json            | 32 ++++++++++++++++++++++++++++----
>  util/dsa.c                     | 31 +++++++++++++++++++++++++++++++
>  7 files changed, 122 insertions(+), 6 deletions(-)
> 
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> index 06746f0afc..0e04eac7c7 100644
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -1009,7 +1009,7 @@ ERST
>  
>      {
>          .name       = "migrate_set_parameter",
> -        .args_type  = "parameter:s,value:s",
> +        .args_type  = "parameter:s,value:S",
>          .params     = "parameter value",
>          .help       = "Set the parameter for migration",
>          .cmd        = hmp_migrate_set_parameter,
> diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
> index 8284804a32..258860bd20 100644
> --- a/include/qemu/dsa.h
> +++ b/include/qemu/dsa.h
> @@ -100,6 +100,13 @@ void qemu_dsa_stop(void);
>   */
>  void qemu_dsa_cleanup(void);
>  
> +/**
> + * @brief Check if DSA is supported.
> + *
> + * @return True if DSA is supported, otherwise false.
> + */
> +bool qemu_dsa_is_supported(void);
> +
>  /**
>   * @brief Check if DSA is running.
>   *
> @@ -141,6 +148,12 @@ buffer_is_zero_dsa_batch_sync(QemuDsaBatchTask *batch_task,
>  
>  typedef struct QemuDsaBatchTask {} QemuDsaBatchTask;
>  
> +static inline bool qemu_dsa_is_supported(void)
> +{
> +    return false;
> +}
> +
> +
>  static inline bool qemu_dsa_is_running(void)
>  {
>      return false;
> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
> index 20d1a6e219..01c528b80a 100644
> --- a/migration/migration-hmp-cmds.c
> +++ b/migration/migration-hmp-cmds.c
> @@ -312,7 +312,16 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
>          monitor_printf(mon, "%s: '%s'\n",
>              MigrationParameter_str(MIGRATION_PARAMETER_TLS_AUTHZ),
>              params->tls_authz);
> -
> +        if (params->has_accel_path) {
> +            strList *accel_path = params->accel_path;
> +            monitor_printf(mon, "%s:",
> +                MigrationParameter_str(MIGRATION_PARAMETER_ACCEL_PATH));
> +            while (accel_path) {
> +                monitor_printf(mon, " '%s'", accel_path->value);
> +                accel_path = accel_path->next;
> +            }
> +            monitor_printf(mon, "\n");
> +        }
>          if (params->has_block_bitmap_mapping) {
>              const BitmapMigrationNodeAliasList *bmnal;
>  
> @@ -563,6 +572,14 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
>          p->has_x_checkpoint_delay = true;
>          visit_type_uint32(v, param, &p->x_checkpoint_delay, &err);
>          break;
> +    case MIGRATION_PARAMETER_ACCEL_PATH:
> +        p->has_accel_path = true;
> +        g_autofree char **strv = g_strsplit(valuestr ? : "", " ", -1);
> +        strList **tail = &p->accel_path;
> +        for (int i = 0; strv[i]; i++) {
> +            QAPI_LIST_APPEND(tail, strv[i]);
> +        }
> +        break;
>      case MIGRATION_PARAMETER_MULTIFD_CHANNELS:
>          p->has_multifd_channels = true;
>          visit_type_uint8(v, param, &p->multifd_channels, &err);
> diff --git a/migration/options.c b/migration/options.c
> index ad8d6989a8..ca89fdc4f4 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -13,6 +13,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "qemu/error-report.h"
> +#include "qemu/dsa.h"
>  #include "exec/target_page.h"
>  #include "qapi/clone-visitor.h"
>  #include "qapi/error.h"
> @@ -809,6 +810,13 @@ const char *migrate_tls_creds(void)
>      return s->parameters.tls_creds;
>  }
>  
> +const strList *migrate_accel_path(void)
> +{
> +    MigrationState *s = migrate_get_current();
> +
> +    return s->parameters.accel_path;
> +}
> +
>  const char *migrate_tls_hostname(void)
>  {
>      MigrationState *s = migrate_get_current();
> @@ -922,6 +930,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
>      params->zero_page_detection = s->parameters.zero_page_detection;
>      params->has_direct_io = true;
>      params->direct_io = s->parameters.direct_io;
> +    params->has_accel_path = true;
> +    params->accel_path = QAPI_CLONE(strList, s->parameters.accel_path);
>  
>      return params;
>  }
> @@ -930,6 +940,7 @@ void migrate_params_init(MigrationParameters *params)
>  {
>      params->tls_hostname = g_strdup("");
>      params->tls_creds = g_strdup("");
> +    params->accel_path = NULL;
>  
>      /* Set has_* up only for parameter checks */
>      params->has_throttle_trigger_threshold = true;
> @@ -1142,6 +1153,14 @@ bool migrate_params_check(MigrationParameters *params, Error **errp)
>          return false;
>      }
>  
> +    if (params->has_zero_page_detection &&
> +        params->zero_page_detection == ZERO_PAGE_DETECTION_DSA_ACCEL) {
> +        if (!qemu_dsa_is_supported()) {
> +            error_setg(errp, "DSA acceleration is not supported.");
> +            return false;
> +        }
> +    }
> +
>      return true;
>  }
>  
> @@ -1255,6 +1274,11 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
>      if (params->has_direct_io) {
>          dest->direct_io = params->direct_io;
>      }
> +
> +    if (params->has_accel_path) {
> +        dest->has_accel_path = true;
> +        dest->accel_path = params->accel_path;
> +    }
>  }
>  
>  static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
> @@ -1387,6 +1411,12 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
>      if (params->has_direct_io) {
>          s->parameters.direct_io = params->direct_io;
>      }
> +    if (params->has_accel_path) {
> +        qapi_free_strList(s->parameters.accel_path);
> +        s->parameters.has_accel_path = true;
> +        s->parameters.accel_path =
> +            QAPI_CLONE(strList, params->accel_path);
> +    }
>  }
>  
>  void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
> diff --git a/migration/options.h b/migration/options.h
> index 79084eed0d..3d1e91dc52 100644
> --- a/migration/options.h
> +++ b/migration/options.h
> @@ -84,6 +84,7 @@ const char *migrate_tls_creds(void);
>  const char *migrate_tls_hostname(void);
>  uint64_t migrate_xbzrle_cache_size(void);
>  ZeroPageDetection migrate_zero_page_detection(void);
> +const strList *migrate_accel_path(void);
>  
>  /* parameters helpers */
>  
> diff --git a/qapi/migration.json b/qapi/migration.json
> index a605dc26db..389776065d 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -629,10 +629,14 @@
>  #     multifd migration is enabled, else in the main migration thread
>  #     as for @legacy.
>  #
> +# @dsa-accel: Perform zero page checking with the DSA accelerator
> +#     offloading in multifd sender thread if multifd migration is
> +#     enabled, else in the main migration thread as for @legacy.
> +#
>  # Since: 9.0
>  ##
>  { 'enum': 'ZeroPageDetection',
> -  'data': [ 'none', 'legacy', 'multifd' ] }
> +  'data': [ 'none', 'legacy', 'multifd', 'dsa-accel' ] }
>  
>  ##
>  # @BitmapMigrationBitmapAliasTransform:
> @@ -840,6 +844,12 @@
>  #     See description in @ZeroPageDetection.  Default is 'multifd'.
>  #     (since 9.0)
>  #
> +# @accel-path: If enabled, specify the accelerator paths that to be
> +#     used in QEMU. For example, enable DSA accelerator for zero page
> +#     detection offloading by setting the @zero-page-detection to
> +#     dsa-accel, and defines the accel-path to "dsa:<dsa_device path>".
> +#     This parameter is default to an empty list.  (Since 9.2)
> +#
>  # @direct-io: Open migration files with O_DIRECT when possible.  This
>  #     only has effect if the @mapped-ram capability is enabled.
>  #     (Since 9.1)
> @@ -858,7 +868,7 @@
>             'cpu-throttle-initial', 'cpu-throttle-increment',
>             'cpu-throttle-tailslow',
>             'tls-creds', 'tls-hostname', 'tls-authz', 'max-bandwidth',
> -           'avail-switchover-bandwidth', 'downtime-limit',
> +           'avail-switchover-bandwidth', 'downtime-limit', 'accel-path',
>             { 'name': 'x-checkpoint-delay', 'features': [ 'unstable' ] },
>             'multifd-channels',
>             'xbzrle-cache-size', 'max-postcopy-bandwidth',
> @@ -1021,6 +1031,12 @@
>  #     See description in @ZeroPageDetection.  Default is 'multifd'.
>  #     (since 9.0)
>  #
> +# @accel-path: If enabled, specify the accelerator paths that to be
> +#     used in QEMU. For example, enable DSA accelerator for zero page
> +#     detection offloading by setting the @zero-page-detection to
> +#     dsa-accel, and defines the accel-path to "dsa:<dsa_device path>".
> +#     This parameter is default to an empty list.  (Since 9.2)
> +#
>  # @direct-io: Open migration files with O_DIRECT when possible.  This
>  #     only has effect if the @mapped-ram capability is enabled.
>  #     (Since 9.1)
> @@ -1066,7 +1082,8 @@
>              '*vcpu-dirty-limit': 'uint64',
>              '*mode': 'MigMode',
>              '*zero-page-detection': 'ZeroPageDetection',
> -            '*direct-io': 'bool' } }
> +            '*direct-io': 'bool',
> +            '*accel-path': [ 'str' ] } }
>  
>  ##
>  # @migrate-set-parameters:
> @@ -1231,6 +1248,12 @@
>  #     See description in @ZeroPageDetection.  Default is 'multifd'.
>  #     (since 9.0)
>  #
> +# @accel-path: If enabled, specify the accelerator paths that to be
> +#     used in QEMU. For example, enable DSA accelerator for zero page
> +#     detection offloading by setting the @zero-page-detection to
> +#     dsa-accel, and defines the accel-path to "dsa:<dsa_device path>".
> +#     This parameter is default to an empty list.  (Since 9.2)
> +#
>  # @direct-io: Open migration files with O_DIRECT when possible.  This
>  #     only has effect if the @mapped-ram capability is enabled.
>  #     (Since 9.1)
> @@ -1273,7 +1296,8 @@
>              '*vcpu-dirty-limit': 'uint64',
>              '*mode': 'MigMode',
>              '*zero-page-detection': 'ZeroPageDetection',
> -            '*direct-io': 'bool' } }
> +            '*direct-io': 'bool',
> +            '*accel-path': [ 'str' ] } }
>  
>  ##
>  # @query-migrate-parameters:
> diff --git a/util/dsa.c b/util/dsa.c
> index 50f53ec24b..18ed36e354 100644
> --- a/util/dsa.c
> +++ b/util/dsa.c
> @@ -23,6 +23,7 @@
>  #include "qemu/bswap.h"
>  #include "qemu/error-report.h"
>  #include "qemu/rcu.h"
> +#include <cpuid.h>
>  
>  #pragma GCC push_options
>  #pragma GCC target("enqcmd")
> @@ -689,6 +690,36 @@ static void dsa_completion_thread_stop(void *opaque)
>      qemu_sem_destroy(&thread_context->sem_init_done);
>  }
>  
> +/**
> + * @brief Check if DSA is supported.
> + *
> + * @return True if DSA is supported, otherwise false.
> + */
> +bool qemu_dsa_is_supported(void)
> +{
> +    /*
> +     * movdir64b is indicated by bit 28 of ecx in CPUID leaf 7, subleaf 0.
> +     * enqcmd is indicated by bit 29 of ecx in CPUID leaf 7, subleaf 0.
> +     * Doc: https://cdrdv2-public.intel.com/819680/architecture-instruction-\
> +     *      set-extensions-programming-reference.pdf
> +     */
> +    uint32_t eax, ebx, ecx, edx;
> +    bool movedirb_enabled;
> +    bool enqcmd_enabled;
> +
> +    __get_cpuid_count(7, 0, &eax, &ebx, &ecx, &edx);
> +    movedirb_enabled = (ecx >> 28) & 0x1;
> +    if (!movedirb_enabled) {
> +        return false;
> +    }
> +    enqcmd_enabled = (ecx >> 29) & 0x1;
> +    if (!enqcmd_enabled) {
> +        return false;
> +    }
> +
> +    return true;
> +}
> +
>  /**
>   * @brief Check if DSA is running.
>   *
> -- 
> Yichen Wang
> 
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration.
  2024-11-14 22:01 [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
                   ` (11 preceding siblings ...)
  2024-11-14 22:01 ` [PATCH v7 12/12] migration/doc: Add DSA zero page detection doc Yichen Wang
@ 2024-11-19 21:31 ` Fabiano Rosas
  2024-11-26  4:43   ` [External] " Yichen Wang
  12 siblings, 1 reply; 30+ messages in thread
From: Fabiano Rosas @ 2024-11-19 21:31 UTC (permalink / raw)
  To: Yichen Wang, Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

Yichen Wang <yichen.wang@bytedance.com> writes:

> v7
> * Rebase on top of f0a5a31c33a8109061c2493e475c8a2f4d022432;
> * Fix a bug that will crash QEMU when DSA initialization failed;
> * Use a more generalized accel-path to support other accelerators;
> * Remove multifd-packet-size in the parameter list;
>
> v6
> * Rebase on top of 838fc0a8769d7cc6edfe50451ba4e3368395f5c1;
> * Refactor code to have clean history on all commits;
> * Add comments on DSA specific defines about how the value is picked;
> * Address all comments from v5 reviews about api defines, questions, etc.;
>
> v5
> * Rebase on top of 39a032cea23e522268519d89bb738974bc43b6f6.
> * Rename struct definitions with typedef and CamelCase names;
> * Add build and runtime checks about DSA accelerator;
> * Address all comments from v4 reviews about typos, licenses, comments,
> error reporting, etc.
>
> v4
> * Rebase on top of 85b597413d4370cb168f711192eaef2eb70535ac.
> * A separate "multifd zero page checking" patchset was split from this
> patchset's v3 and got merged into master. v4 re-applied the rest of all
> commits on top of that patchset, re-factored and re-tested.
> https://lore.kernel.org/all/20240311180015.3359271-1-hao.xiang@linux.dev/
> * There are some feedback from v3 I likely overlooked.
>
> v3
> * Rebase on top of 7425b6277f12e82952cede1f531bfc689bf77fb1.
> * Fix error/warning from checkpatch.pl
> * Fix use-after-free bug when multifd-dsa-accel option is not set.
> * Handle error from dsa_init and correctly propogate the error.
> * Remove unnecessary call to dsa_stop.
> * Detect availability of DSA feature at compile time.
> * Implement a generic batch_task structure and a DSA specific one dsa_batch_task.
> * Remove all exit() calls and propagate errors correctly.
> * Use bytes instead of page count to configure multifd-packet-size option.
>
> v2
> * Rebase on top of 3e01f1147a16ca566694b97eafc941d62fa1e8d8.
> * Leave Juan's changes in their original form instead of squashing them.
> * Add a new commit to refactor the multifd_send_thread function to prepare for introducing the DSA offload functionality.
> * Use page count to configure multifd-packet-size option.
> * Don't use the FLAKY flag in DSA tests.
> * Test if DSA integration test is setup correctly and skip the test if
> * not.
> * Fixed broken link in the previous patch cover.
>
> * Background:
>
> I posted an RFC about DSA offloading in QEMU:
> https://patchew.org/QEMU/20230529182001.2232069-1-hao.xiang@bytedance.com/
>
> This patchset implements the DSA offloading on zero page checking in
> multifd live migration code path.
>
> * Overview:
>
> Intel Data Streaming Accelerator(DSA) is introduced in Intel's 4th generation
> Xeon server, aka Sapphire Rapids.
> https://cdrdv2-public.intel.com/671116/341204-intel-data-streaming-accelerator-spec.pdf
> https://www.intel.com/content/www/us/en/content-details/759709/intel-data-streaming-accelerator-user-guide.html
> One of the things DSA can do is to offload memory comparison workload from
> CPU to DSA accelerator hardware. This patchset implements a solution to offload
> QEMU's zero page checking from CPU to DSA accelerator hardware. We gain
> two benefits from this change:
> 1. Reduces CPU usage in multifd live migration workflow across all use
> cases.
> 2. Reduces migration total time in some use cases. 
>
> * Design:
>
> These are the logical steps to perform DSA offloading:
> 1. Configure DSA accelerators and create user space openable DSA work
> queues via the idxd driver.
> 2. Map DSA's work queue into a user space address space.
> 3. Fill an in-memory task descriptor to describe the memory operation.
> 4. Use dedicated CPU instruction _enqcmd to queue a task descriptor to
> the work queue.
> 5. Pull the task descriptor's completion status field until the task
> completes.
> 6. Check return status.
>
> The memory operation is now totally done by the accelerator hardware but
> the new workflow introduces overheads. The overhead is the extra cost CPU
> prepares and submits the task descriptors and the extra cost CPU pulls for
> completion. The design is around minimizing these two overheads.
>
> 1. In order to reduce the overhead on task preparation and submission,
> we use batch descriptors. A batch descriptor will contain N individual
> zero page checking tasks where the default N is 128 (default packet size
> / page size) and we can increase N by setting the packet size via a new
> migration option.
> 2. The multifd sender threads prepares and submits batch tasks to DSA
> hardware and it waits on a synchronization object for task completion.
> Whenever a DSA task is submitted, the task structure is added to a
> thread safe queue. It's safe to have multiple multifd sender threads to
> submit tasks concurrently.
> 3. Multiple DSA hardware devices can be used. During multifd initialization,
> every sender thread will be assigned a DSA device to work with. We
> use a round-robin scheme to evenly distribute the work across all used
> DSA devices.
> 4. Use a dedicated thread dsa_completion to perform busy pulling for all
> DSA task completions. The thread keeps dequeuing DSA tasks from the
> thread safe queue. The thread blocks when there is no outstanding DSA
> task. When pulling for completion of a DSA task, the thread uses CPU
> instruction _mm_pause between the iterations of a busy loop to save some
> CPU power as well as optimizing core resources for the other hypercore.
> 5. DSA accelerator can encounter errors. The most popular error is a
> page fault. We have tested using devices to handle page faults but
> performance is bad. Right now, if DSA hits a page fault, we fallback to
> use CPU to complete the rest of the work. The CPU fallback is done in
> the multifd sender thread.
> 6. Added a new migration option multifd-dsa-accel to set the DSA device
> path. If set, the multifd workflow will leverage the DSA devices for
> offloading.
> 7. Added a new migration option multifd-normal-page-ratio to make
> multifd live migration easier to test. Setting a normal page ratio will
> make live migration recognize a zero page as a normal page and send
> the entire payload over the network. If we want to send a large network
> payload and analyze throughput, this option is useful.
> 8. Added a new migration option multifd-packet-size. This can increase
> the number of pages being zero page checked and sent over the network.
> The extra synchronization between the sender threads and the dsa
> completion thread is an overhead. Using a large packet size can reduce
> that overhead.
>
> * Performance:
>
> We use two Intel 4th generation Xeon servers for testing.
>
> Architecture:        x86_64
> CPU(s):              192
> Thread(s) per core:  2
> Core(s) per socket:  48
> Socket(s):           2
> NUMA node(s):        2
> Vendor ID:           GenuineIntel
> CPU family:          6
> Model:               143
> Model name:          Intel(R) Xeon(R) Platinum 8457C
> Stepping:            8
> CPU MHz:             2538.624
> CPU max MHz:         3800.0000
> CPU min MHz:         800.0000
>
> We perform multifd live migration with below setup:
> 1. VM has 100GB memory. 
> 2. Use the new migration option multifd-set-normal-page-ratio to control the total
> size of the payload sent over the network.
> 3. Use 8 multifd channels.
> 4. Use tcp for live migration.
> 4. Use CPU to perform zero page checking as the baseline.
> 5. Use one DSA device to offload zero page checking to compare with the baseline.
> 6. Use "perf sched record" and "perf sched timehist" to analyze CPU usage.
>
> A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
>
> 	CPU usage
>
> 	|---------------|---------------|---------------|---------------|
> 	|		|comm		|runtime(msec)	|totaltime(msec)|
> 	|---------------|---------------|---------------|---------------|
> 	|Baseline	|live_migration	|5657.58	|		|
> 	|		|multifdsend_0	|3931.563	|		|
> 	|		|multifdsend_1	|4405.273	|		|
> 	|		|multifdsend_2	|3941.968	|		|
> 	|		|multifdsend_3	|5032.975	|		|
> 	|		|multifdsend_4	|4533.865	|		|
> 	|		|multifdsend_5	|4530.461	|		|
> 	|		|multifdsend_6	|5171.916	|		|
> 	|		|multifdsend_7	|4722.769	|41922		|
> 	|---------------|---------------|---------------|---------------|
> 	|DSA		|live_migration	|6129.168	|		|
> 	|		|multifdsend_0	|2954.717	|		|
> 	|		|multifdsend_1	|2766.359	|		|
> 	|		|multifdsend_2	|2853.519	|		|
> 	|		|multifdsend_3	|2740.717	|		|
> 	|		|multifdsend_4	|2824.169	|		|
> 	|		|multifdsend_5	|2966.908	|		|
> 	|		|multifdsend_6	|2611.137	|		|
> 	|		|multifdsend_7	|3114.732	|		|
> 	|		|dsa_completion	|3612.564	|32568		|
> 	|---------------|---------------|---------------|---------------|
>
> Baseline total runtime is calculated by adding up all multifdsend_X
> and live_migration threads runtime. DSA offloading total runtime is
> calculated by adding up all multifdsend_X, live_migration and
> dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
> that is 23% total CPU usage savings.
>
> 	Latency
> 	|---------------|---------------|---------------|---------------|---------------|---------------|
> 	|		|total time	|down time	|throughput	|transferred-ram|total-ram	|
> 	|---------------|---------------|---------------|---------------|---------------|---------------|	
> 	|Baseline	|10343 ms	|161 ms		|41007.00 mbps	|51583797 kb	|102400520 kb	|
> 	|---------------|---------------|---------------|---------------|-------------------------------|
> 	|DSA offload	|9535 ms	|135 ms		|46554.40 mbps	|53947545 kb	|102400520 kb	|	
> 	|---------------|---------------|---------------|---------------|---------------|---------------|
>
> Total time is 8% faster and down time is 16% faster.
>
> B) Scenario 2: 100% (100GB) zero pages on an 100GB vm.
>
> 	CPU usage
> 	|---------------|---------------|---------------|---------------|
> 	|		|comm		|runtime(msec)	|totaltime(msec)|
> 	|---------------|---------------|---------------|---------------|
> 	|Baseline	|live_migration	|4860.718	|		|
> 	|	 	|multifdsend_0	|748.875	|		|
> 	|		|multifdsend_1	|898.498	|		|
> 	|		|multifdsend_2	|787.456	|		|
> 	|		|multifdsend_3	|764.537	|		|
> 	|		|multifdsend_4	|785.687	|		|
> 	|		|multifdsend_5	|756.941	|		|
> 	|		|multifdsend_6	|774.084	|		|
> 	|		|multifdsend_7	|782.900	|11154		|
> 	|---------------|---------------|-------------------------------|
> 	|DSA offloading	|live_migration	|3846.976	|		|
> 	|		|multifdsend_0	|191.880	|		|
> 	|		|multifdsend_1	|166.331	|		|
> 	|		|multifdsend_2	|168.528	|		|
> 	|		|multifdsend_3	|197.831	|		|
> 	|		|multifdsend_4	|169.580	|		|
> 	|		|multifdsend_5	|167.984	|		|
> 	|		|multifdsend_6	|198.042	|		|
> 	|		|multifdsend_7	|170.624	|		|
> 	|		|dsa_completion	|3428.669	|8700		|
> 	|---------------|---------------|---------------|---------------|
>
> Baseline total runtime is 11154 msec and DSA offloading total runtime is
> 8700 msec. That is 22% CPU savings.
>
> 	Latency
> 	|--------------------------------------------------------------------------------------------|
> 	|		|total time	|down time	|throughput	|transferred-ram|total-ram   |
> 	|---------------|---------------|---------------|---------------|---------------|------------|	
> 	|Baseline	|4867 ms	|20 ms		|1.51 mbps	|565 kb		|102400520 kb|
> 	|---------------|---------------|---------------|---------------|----------------------------|
> 	|DSA offload	|3888 ms	|18 ms		|1.89 mbps	|565 kb		|102400520 kb|	
> 	|---------------|---------------|---------------|---------------|---------------|------------|
>
> Total time 20% faster and down time 10% faster.
>
> * Testing:
>
> 1. Added unit tests for cover the added code path in dsa.c
> 2. Added integration tests to cover multifd live migration using DSA
> offloading.
>
> Hao Xiang (10):
>   meson: Introduce new instruction set enqcmd to the build system.
>   util/dsa: Implement DSA device start and stop logic.
>   util/dsa: Implement DSA task enqueue and dequeue.
>   util/dsa: Implement DSA task asynchronous completion thread model.
>   util/dsa: Implement zero page checking in DSA task.
>   util/dsa: Implement DSA task asynchronous submission and wait for
>     completion.
>   migration/multifd: Add new migration option for multifd DSA
>     offloading.
>   migration/multifd: Enable DSA offloading in multifd sender path.
>   util/dsa: Add unit test coverage for Intel DSA task submission and
>     completion.
>   migration/multifd: Add integration tests for multifd with Intel DSA
>     offloading.
>
> Yichen Wang (1):
>   util/dsa: Add idxd into linux header copy list.
>
> Yuan Liu (1):
>   migration/doc: Add DSA zero page detection doc
>
>  .../migration/dsa-zero-page-detection.rst     |  290 +++++
>  docs/devel/migration/features.rst             |    1 +
>  hmp-commands.hx                               |    2 +-
>  include/qemu/dsa.h                            |  188 +++
>  meson.build                                   |   14 +
>  meson_options.txt                             |    2 +
>  migration/migration-hmp-cmds.c                |   19 +-
>  migration/multifd-zero-page.c                 |  129 +-
>  migration/multifd.c                           |   29 +-
>  migration/multifd.h                           |    5 +
>  migration/options.c                           |   30 +
>  migration/options.h                           |    1 +
>  qapi/migration.json                           |   32 +-
>  scripts/meson-buildoptions.sh                 |    3 +
>  scripts/update-linux-headers.sh               |    2 +-
>  tests/qtest/migration-test.c                  |   80 +-
>  tests/unit/meson.build                        |    6 +
>  tests/unit/test-dsa.c                         |  503 ++++++++
>  util/dsa.c                                    | 1112 +++++++++++++++++
>  util/meson.build                              |    3 +
>  20 files changed, 2427 insertions(+), 24 deletions(-)
>  create mode 100644 docs/devel/migration/dsa-zero-page-detection.rst
>  create mode 100644 include/qemu/dsa.h
>  create mode 100644 tests/unit/test-dsa.c
>  create mode 100644 util/dsa.c

Hi, take a look at make check, there are some tests failing.

Summary of Failures:                                                                                                                                                                           
                                                                                                                                                                                               
 16/474 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp                     ERROR            0.86s   killed by signal 6 SIGABRT
 18/474 qemu:qtest+qtest-ppc64 / qtest-ppc64/test-hmp                       ERROR            0.93s   killed by signal 6 SIGABRT
 20/474 qemu:qtest+qtest-aarch64 / qtest-aarch64/test-hmp                   ERROR            1.30s   killed by signal 6 SIGABRT
 21/474 qemu:qtest+qtest-s390x / qtest-s390x/test-hmp                       ERROR            0.76s   killed by signal 6 SIGABRT
 22/474 qemu:qtest+qtest-riscv64 / qtest-riscv64/test-hmp                   ERROR            0.60s   killed by signal 6 SIGABRT

Looks like a double-free due to glib autofree. Here's one sample:

#0  __GI_abort () at abort.c:49
#1  0x00007ffff5899c87 in __libc_message (action=do_abort, fmt=0x7ffff59c3138 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#2  0x00007ffff58a1d2a in malloc_printerr (str=0x7ffff59c0e0e "free(): invalid pointer") at malloc.c:5347
#3  0x00007ffff58a37d4 in _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:4173
#4  0x00007ffff78c5639 in g_free (mem=0x5555561200f1 <qemu_mutex_unlock_impl+96>) at ../glib/gmem.c:199
#5  0x0000555555bdd527 in g_autoptr_cleanup_generic_gfree (p=0x7fffffffd568) at /usr/include/glib-2.0/glib/glib-autocleanups.h:28
#6  0x0000555555bdfabc in hmp_migrate_set_parameter (mon=0x7fffffffd6f0, qdict=0x555558554560) at ../migration/migration-hmp-cmds.c:577
#7  0x0000555555c1a231 in handle_hmp_command_exec (mon=0x7fffffffd6f0, cmd=0x5555571e7450 <hmp_cmds+4560>, qdict=0x555558554560) at ../monitor/hmp.c:1106
#8  0x0000555555c1a470 in handle_hmp_command (mon=0x7fffffffd6f0, cmdline=0x5555577ec2f6 "xbzrle-cache-size 64k") at ../monitor/hmp.c:1158
#9  0x0000555555c1c40e in qmp_human_monitor_command (command_line=0x5555577ec2e0 "migrate_set_parameter xbzrle-cache-size 64k", has_cpu_index=false, cpu_index=0, errp=0x7fffffffd800)
    at ../monitor/qmp-cmds.c:181
#10 0x00005555560c7eb6 in qmp_marshal_human_monitor_command (args=0x7fffe000ac00, ret=0x7ffff4d25da8, errp=0x7ffff4d25da0) at qapi/qapi-commands-misc.c:347
#11 0x000055555610e7a4 in do_qmp_dispatch_bh (opaque=0x7ffff4d25e40) at ../qapi/qmp-dispatch.c:128
#12 0x000055555613a1b9 in aio_bh_call (bh=0x7fffe0004050) at ../util/async.c:172
#13 0x000055555613a2d5 in aio_bh_poll (ctx=0x5555573df400) at ../util/async.c:219
#14 0x000055555611b8cd in aio_dispatch (ctx=0x5555573df400) at ../util/aio-posix.c:424
#15 0x000055555613a712 in aio_ctx_dispatch (source=0x5555573df400, callback=0x0, user_data=0x0) at ../util/async.c:361
#16 0x00007ffff78bf82b in g_main_dispatch (context=0x5555573e3440) at ../glib/gmain.c:3381
#17 g_main_context_dispatch (context=0x5555573e3440) at ../glib/gmain.c:4099
#18 0x000055555613bdae in glib_pollfds_poll () at ../util/main-loop.c:287
#19 0x000055555613be28 in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:310
#20 0x000055555613bf2d in main_loop_wait (nonblocking=0) at ../util/main-loop.c:589
#21 0x0000555555bb455c in qemu_main_loop () at ../system/runstate.c:835
#22 0x00005555560594d1 in qemu_default_main () at ../system/main.c:37
#23 0x000055555605950c in main (argc=18, argv=0x7fffffffdc18) at ../system/main.c:48


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 01/12] meson: Introduce new instruction set enqcmd to the build system.
  2024-11-14 22:01 ` [PATCH v7 01/12] meson: Introduce new instruction set enqcmd to the build system Yichen Wang
@ 2024-11-21 13:51   ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2024-11-21 13:51 UTC (permalink / raw)
  To: Yichen Wang, Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

Yichen Wang <yichen.wang@bytedance.com> writes:

> From: Hao Xiang <hao.xiang@linux.dev>
>
> Enable instruction set enqcmd in build.
>
> Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
> Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 02/12] util/dsa: Add idxd into linux header copy list.
  2024-11-14 22:01 ` [PATCH v7 02/12] util/dsa: Add idxd into linux header copy list Yichen Wang
@ 2024-11-21 13:51   ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2024-11-21 13:51 UTC (permalink / raw)
  To: Yichen Wang, Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

Yichen Wang <yichen.wang@bytedance.com> writes:

> Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
> ---
>  scripts/update-linux-headers.sh | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
> index 99a8d9fa4c..9128c7499b 100755
> --- a/scripts/update-linux-headers.sh
> +++ b/scripts/update-linux-headers.sh
> @@ -200,7 +200,7 @@ rm -rf "$output/linux-headers/linux"
>  mkdir -p "$output/linux-headers/linux"
>  for header in const.h stddef.h kvm.h vfio.h vfio_ccw.h vfio_zdev.h vhost.h \
>                psci.h psp-sev.h userfaultfd.h memfd.h mman.h nvme_ioctl.h \
> -              vduse.h iommufd.h bits.h; do
> +              vduse.h iommufd.h bits.h idxd.h; do
>      cp "$hdrdir/include/linux/$header" "$output/linux-headers/linux"
>  done

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 03/12] util/dsa: Implement DSA device start and stop logic.
  2024-11-14 22:01 ` [PATCH v7 03/12] util/dsa: Implement DSA device start and stop logic Yichen Wang
@ 2024-11-21 14:11   ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2024-11-21 14:11 UTC (permalink / raw)
  To: Yichen Wang, Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang, Bryan Zhang

Yichen Wang <yichen.wang@bytedance.com> writes:

> From: Hao Xiang <hao.xiang@linux.dev>
>
> * DSA device open and close.
> * DSA group contains multiple DSA devices.
> * DSA group configure/start/stop/clean.
>
> Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
> Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
> Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
> ---
>  include/qemu/dsa.h | 103 +++++++++++++++++
>  util/dsa.c         | 280 +++++++++++++++++++++++++++++++++++++++++++++
>  util/meson.build   |   3 +
>  3 files changed, 386 insertions(+)
>  create mode 100644 include/qemu/dsa.h
>  create mode 100644 util/dsa.c
>
> diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
> new file mode 100644
> index 0000000000..71686af28f
> --- /dev/null
> +++ b/include/qemu/dsa.h
> @@ -0,0 +1,103 @@
> +/*
> + * Interface for using Intel Data Streaming Accelerator to offload certain
> + * background operations.
> + *
> + * Copyright (C) Bytedance Ltd.
> + *
> + * Authors:
> + *  Hao Xiang <hao.xiang@bytedance.com>
> + *  Yichen Wang <yichen.wang@bytedance.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef QEMU_DSA_H
> +#define QEMU_DSA_H
> +
> +#include "qapi/error.h"
> +#include "qemu/thread.h"
> +#include "qemu/queue.h"
> +
> +#ifdef CONFIG_DSA_OPT
> +
> +#pragma GCC push_options
> +#pragma GCC target("enqcmd")
> +
> +#include <linux/idxd.h>
> +#include "x86intrin.h"
> +
> +typedef struct {
> +    void *work_queue;
> +} QemuDsaDevice;
> +
> +typedef QSIMPLEQ_HEAD(QemuDsaTaskQueue, QemuDsaBatchTask) QemuDsaTaskQueue;
> +
> +typedef struct {
> +    QemuDsaDevice *dsa_devices;
> +    int num_dsa_devices;
> +    /* The index of the next DSA device to be used. */
> +    uint32_t device_allocator_index;
> +    bool running;
> +    QemuMutex task_queue_lock;
> +    QemuCond task_queue_cond;
> +    QemuDsaTaskQueue task_queue;
> +} QemuDsaDeviceGroup;
> +
> +/**
> + * @brief Initializes DSA devices.
> + *
> + * @param dsa_parameter A list of DSA device path from migration parameter.
> + *
> + * @return int Zero if successful, otherwise non zero.
> + */
> +int qemu_dsa_init(const strList *dsa_parameter, Error **errp);
> +
> +/**
> + * @brief Start logic to enable using DSA.
> + */
> +void qemu_dsa_start(void);
> +
> +/**
> + * @brief Stop the device group and the completion thread.
> + */
> +void qemu_dsa_stop(void);
> +
> +/**
> + * @brief Clean up system resources created for DSA offloading.
> + */
> +void qemu_dsa_cleanup(void);
> +
> +/**
> + * @brief Check if DSA is running.
> + *
> + * @return True if DSA is running, otherwise false.
> + */
> +bool qemu_dsa_is_running(void);
> +
> +#else
> +
> +static inline bool qemu_dsa_is_running(void)
> +{
> +    return false;
> +}
> +
> +static inline int qemu_dsa_init(const strList *dsa_parameter, Error **errp)
> +{
> +    if (dsa_parameter != NULL && strlen(dsa_parameter) != 0) {
> +        error_setg(errp, "DSA is not supported.");
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +static inline void qemu_dsa_start(void) {}
> +
> +static inline void qemu_dsa_stop(void) {}
> +
> +static inline void qemu_dsa_cleanup(void) {}
> +
> +#endif
> +
> +#endif
> diff --git a/util/dsa.c b/util/dsa.c
> new file mode 100644
> index 0000000000..79dab5d62c
> --- /dev/null
> +++ b/util/dsa.c
> @@ -0,0 +1,280 @@
> +/*
> + * Use Intel Data Streaming Accelerator to offload certain background
> + * operations.
> + *
> + * Copyright (C) Bytedance Ltd.
> + *
> + * Authors:
> + *  Hao Xiang <hao.xiang@bytedance.com>
> + *  Bryan Zhang <bryan.zhang@bytedance.com>
> + *  Yichen Wang <yichen.wang@bytedance.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "qemu/queue.h"
> +#include "qemu/memalign.h"
> +#include "qemu/lockable.h"
> +#include "qemu/cutils.h"
> +#include "qemu/dsa.h"
> +#include "qemu/bswap.h"
> +#include "qemu/error-report.h"
> +#include "qemu/rcu.h"
> +
> +#pragma GCC push_options
> +#pragma GCC target("enqcmd")
> +
> +#include <linux/idxd.h>
> +#include "x86intrin.h"
> +
> +#define DSA_WQ_PORTAL_SIZE 4096
> +#define MAX_DSA_DEVICES 16
> +
> +uint32_t max_retry_count;
> +static QemuDsaDeviceGroup dsa_group;
> +
> +
> +/**
> + * @brief This function opens a DSA device's work queue and
> + *        maps the DSA device memory into the current process.
> + *
> + * @param dsa_wq_path A pointer to the DSA device work queue's file path.
> + * @return A pointer to the mapped memory, or MAP_FAILED on failure.
> + */
> +static void *
> +map_dsa_device(const char *dsa_wq_path)
> +{
> +    void *dsa_device;
> +    int fd;
> +
> +    fd = open(dsa_wq_path, O_RDWR);
> +    if (fd < 0) {
> +        error_report("Open %s failed with errno = %d.",
> +                dsa_wq_path, errno);
> +        return MAP_FAILED;
> +    }
> +    dsa_device = mmap(NULL, DSA_WQ_PORTAL_SIZE, PROT_WRITE,
> +                      MAP_SHARED | MAP_POPULATE, fd, 0);
> +    close(fd);
> +    if (dsa_device == MAP_FAILED) {
> +        error_report("mmap failed with errno = %d.", errno);
> +        return MAP_FAILED;
> +    }
> +    return dsa_device;
> +}
> +
> +/**
> + * @brief Initializes a DSA device structure.
> + *
> + * @param instance A pointer to the DSA device.
> + * @param work_queue A pointer to the DSA work queue.
> + */
> +static void
> +dsa_device_init(QemuDsaDevice *instance,
> +                void *dsa_work_queue)
> +{
> +    instance->work_queue = dsa_work_queue;
> +}
> +
> +/**
> + * @brief Cleans up a DSA device structure.
> + *
> + * @param instance A pointer to the DSA device to cleanup.
> + */
> +static void
> +dsa_device_cleanup(QemuDsaDevice *instance)
> +{
> +    if (instance->work_queue != MAP_FAILED) {
> +        munmap(instance->work_queue, DSA_WQ_PORTAL_SIZE);
> +    }
> +}
> +
> +/**
> + * @brief Initializes a DSA device group.
> + *
> + * @param group A pointer to the DSA device group.
> + * @param dsa_parameter A list of DSA device path from are separated by space
> + * character migration parameter. Multiple DSA device path.
> + *
> + * @return Zero if successful, non-zero otherwise.
> + */
> +static int
> +dsa_device_group_init(QemuDsaDeviceGroup *group,
> +                      const strList *dsa_parameter,
> +                      Error **errp)
> +{
> +    if (dsa_parameter == NULL) {
> +        error_setg(errp, "dsa device path is not supplied.");
> +        return -1;
> +    }
> +
> +    int ret = 0;
> +    const char *dsa_path[MAX_DSA_DEVICES];
> +    int num_dsa_devices = 0;
> +
> +    while (dsa_parameter) {
> +        dsa_path[num_dsa_devices++] = dsa_parameter->value;
> +        if (num_dsa_devices == MAX_DSA_DEVICES) {
> +            break;
> +        }
> +        dsa_parameter = dsa_parameter->next;
> +    }
> +
> +    group->dsa_devices =
> +        g_new0(QemuDsaDevice, num_dsa_devices);
> +    group->num_dsa_devices = num_dsa_devices;
> +    group->device_allocator_index = 0;
> +
> +    group->running = false;
> +    qemu_mutex_init(&group->task_queue_lock);
> +    qemu_cond_init(&group->task_queue_cond);
> +    QSIMPLEQ_INIT(&group->task_queue);
> +
> +    void *dsa_wq = MAP_FAILED;
> +    for (int i = 0; i < num_dsa_devices; i++) {
> +        dsa_wq = map_dsa_device(dsa_path[i]);
> +        if (dsa_wq == MAP_FAILED) {
> +            error_setg(errp, "map_dsa_device failed MAP_FAILED.");

This will assert if it fails in more than one iteration, errp cannot be
set twice. You'll have to test 'ret' outside of the loop before
returning and set the error there.

> +            ret = -1;
> +        }
> +        dsa_device_init(&group->dsa_devices[i], dsa_wq);
> +    }
> +
> +    return ret;
> +}
> +
> +/**
> + * @brief Starts a DSA device group.
> + *
> + * @param group A pointer to the DSA device group.
> + */
> +static void
> +dsa_device_group_start(QemuDsaDeviceGroup *group)
> +{
> +    group->running = true;
> +}
> +
> +/**
> + * @brief Stops a DSA device group.
> + *
> + * @param group A pointer to the DSA device group.
> + */
> +__attribute__((unused))
> +static void
> +dsa_device_group_stop(QemuDsaDeviceGroup *group)
> +{
> +    group->running = false;
> +}
> +
> +/**
> + * @brief Cleans up a DSA device group.
> + *
> + * @param group A pointer to the DSA device group.
> + */
> +static void
> +dsa_device_group_cleanup(QemuDsaDeviceGroup *group)
> +{
> +    if (!group->dsa_devices) {
> +        return;
> +    }
> +    for (int i = 0; i < group->num_dsa_devices; i++) {
> +        dsa_device_cleanup(&group->dsa_devices[i]);
> +    }
> +    g_free(group->dsa_devices);
> +    group->dsa_devices = NULL;
> +
> +    qemu_mutex_destroy(&group->task_queue_lock);
> +    qemu_cond_destroy(&group->task_queue_cond);
> +}
> +
> +/**
> + * @brief Returns the next available DSA device in the group.
> + *
> + * @param group A pointer to the DSA device group.
> + *
> + * @return struct QemuDsaDevice* A pointer to the next available DSA device
> + *         in the group.
> + */
> +__attribute__((unused))
> +static QemuDsaDevice *
> +dsa_device_group_get_next_device(QemuDsaDeviceGroup *group)
> +{
> +    if (group->num_dsa_devices == 0) {
> +        return NULL;
> +    }
> +    uint32_t current = qatomic_fetch_inc(&group->device_allocator_index);
> +    current %= group->num_dsa_devices;
> +    return &group->dsa_devices[current];
> +}
> +
> +/**
> + * @brief Check if DSA is running.
> + *
> + * @return True if DSA is running, otherwise false.
> + */
> +bool qemu_dsa_is_running(void)
> +{
> +    return false;
> +}
> +
> +static void
> +dsa_globals_init(void)
> +{
> +    max_retry_count = UINT32_MAX;
> +}
> +
> +/**
> + * @brief Initializes DSA devices.
> + *
> + * @param dsa_parameter A list of DSA device path from migration parameter.
> + *
> + * @return int Zero if successful, otherwise non zero.
> + */
> +int qemu_dsa_init(const strList *dsa_parameter, Error **errp)
> +{
> +    dsa_globals_init();
> +
> +    return dsa_device_group_init(&dsa_group, dsa_parameter, errp);
> +}
> +
> +/**
> + * @brief Start logic to enable using DSA.
> + *
> + */
> +void qemu_dsa_start(void)
> +{
> +    if (dsa_group.num_dsa_devices == 0) {
> +        return;
> +    }
> +    if (dsa_group.running) {
> +        return;
> +    }
> +    dsa_device_group_start(&dsa_group);
> +}
> +
> +/**
> + * @brief Stop the device group and the completion thread.
> + *
> + */
> +void qemu_dsa_stop(void)
> +{
> +    QemuDsaDeviceGroup *group = &dsa_group;
> +
> +    if (!group->running) {
> +        return;
> +    }
> +}
> +
> +/**
> + * @brief Clean up system resources created for DSA offloading.
> + *
> + */
> +void qemu_dsa_cleanup(void)
> +{
> +    qemu_dsa_stop();
> +    dsa_device_group_cleanup(&dsa_group);
> +}
> +
> diff --git a/util/meson.build b/util/meson.build
> index 5d8bef9891..5ec2158f9e 100644
> --- a/util/meson.build
> +++ b/util/meson.build
> @@ -123,6 +123,9 @@ if cpu == 'aarch64'
>    util_ss.add(files('cpuinfo-aarch64.c'))
>  elif cpu in ['x86', 'x86_64']
>    util_ss.add(files('cpuinfo-i386.c'))
> +  if config_host_data.get('CONFIG_DSA_OPT')
> +    util_ss.add(files('dsa.c'))
> +  endif
>  elif cpu == 'loongarch64'
>    util_ss.add(files('cpuinfo-loongarch.c'))
>  elif cpu in ['ppc', 'ppc64']


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 09/12] migration/multifd: Enable DSA offloading in multifd sender path.
  2024-11-14 22:01 ` [PATCH v7 09/12] migration/multifd: Enable DSA offloading in multifd sender path Yichen Wang
@ 2024-11-21 20:50   ` Fabiano Rosas
  2024-11-26  4:41     ` [External] " Yichen Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Fabiano Rosas @ 2024-11-21 20:50 UTC (permalink / raw)
  To: Yichen Wang, Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

Yichen Wang <yichen.wang@bytedance.com> writes:

> From: Hao Xiang <hao.xiang@linux.dev>
>
> Multifd sender path gets an array of pages queued by the migration
> thread. It performs zero page checking on every page in the array.
> The pages are classfied as either a zero page or a normal page. This
> change uses Intel DSA to offload the zero page checking from CPU to
> the DSA accelerator. The sender thread submits a batch of pages to DSA
> hardware and waits for the DSA completion thread to signal for work
> completion.
>
> Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
> Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
> ---
>  migration/multifd-zero-page.c | 129 ++++++++++++++++++++++++++++++----
>  migration/multifd.c           |  29 +++++++-
>  migration/multifd.h           |   5 ++
>  3 files changed, 147 insertions(+), 16 deletions(-)
>
> diff --git a/migration/multifd-zero-page.c b/migration/multifd-zero-page.c
> index f1e988a959..639aed9f6b 100644
> --- a/migration/multifd-zero-page.c
> +++ b/migration/multifd-zero-page.c
> @@ -21,7 +21,9 @@
>  
>  static bool multifd_zero_page_enabled(void)
>  {
> -    return migrate_zero_page_detection() == ZERO_PAGE_DETECTION_MULTIFD;
> +    ZeroPageDetection curMethod = migrate_zero_page_detection();
> +    return (curMethod == ZERO_PAGE_DETECTION_MULTIFD ||
> +            curMethod == ZERO_PAGE_DETECTION_DSA_ACCEL);
>  }
>  
>  static void swap_page_offset(ram_addr_t *pages_offset, int a, int b)
> @@ -37,26 +39,49 @@ static void swap_page_offset(ram_addr_t *pages_offset, int a, int b)
>      pages_offset[b] = temp;
>  }
>  
> +#ifdef CONFIG_DSA_OPT
> +
> +static void swap_result(bool *results, int a, int b)
> +{
> +    bool temp;
> +
> +    if (a == b) {
> +        return;
> +    }
> +
> +    temp = results[a];
> +    results[a] = results[b];
> +    results[b] = temp;
> +}
> +
>  /**
> - * multifd_send_zero_page_detect: Perform zero page detection on all pages.
> + * zero_page_detect_dsa: Perform zero page detection using
> + * Intel Data Streaming Accelerator (DSA).
>   *
> - * Sorts normal pages before zero pages in p->pages->offset and updates
> - * p->pages->normal_num.
> + * Sorts normal pages before zero pages in pages->offset and updates
> + * pages->normal_num.
>   *
>   * @param p A pointer to the send params.
>   */
> -void multifd_send_zero_page_detect(MultiFDSendParams *p)
> +static void zero_page_detect_dsa(MultiFDSendParams *p)
>  {
>      MultiFDPages_t *pages = &p->data->u.ram;
>      RAMBlock *rb = pages->block;
> -    int i = 0;
> -    int j = pages->num - 1;
> +    bool *results = p->dsa_batch_task->results;
>  
> -    if (!multifd_zero_page_enabled()) {
> -        pages->normal_num = pages->num;
> -        goto out;
> +    for (int i = 0; i < pages->num; i++) {
> +        p->dsa_batch_task->addr[i] =
> +            (ram_addr_t)(rb->host + pages->offset[i]);
>      }
>  
> +    buffer_is_zero_dsa_batch_sync(p->dsa_batch_task,
> +                                  (const void **)p->dsa_batch_task->addr,
> +                                  pages->num,
> +                                  multifd_ram_page_size());
> +
> +    int i = 0;
> +    int j = pages->num - 1;
> +
>      /*
>       * Sort the page offset array by moving all normal pages to
>       * the left and all zero pages to the right of the array.
> @@ -64,23 +89,39 @@ void multifd_send_zero_page_detect(MultiFDSendParams *p)
>      while (i <= j) {
>          uint64_t offset = pages->offset[i];
>  
> -        if (!buffer_is_zero(rb->host + offset, multifd_ram_page_size())) {
> +        if (!results[i]) {
>              i++;
>              continue;
>          }
>  
> +        swap_result(results, i, j);
>          swap_page_offset(pages->offset, i, j);
>          ram_release_page(rb->idstr, offset);
>          j--;
>      }
>  
>      pages->normal_num = i;
> +}
>  
> -out:
> -    stat64_add(&mig_stats.normal_pages, pages->normal_num);
> -    stat64_add(&mig_stats.zero_pages, pages->num - pages->normal_num);
> +void multifd_dsa_cleanup(void)
> +{
> +    qemu_dsa_cleanup();
> +}
> +
> +#else
> +
> +static void zero_page_detect_dsa(MultiFDSendParams *p)
> +{
> +    g_assert_not_reached();
> +}
> +
> +void multifd_dsa_cleanup(void)
> +{
> +    return ;
>  }
>  
> +#endif
> +
>  void multifd_recv_zero_page_process(MultiFDRecvParams *p)
>  {
>      for (int i = 0; i < p->zero_num; i++) {
> @@ -92,3 +133,63 @@ void multifd_recv_zero_page_process(MultiFDRecvParams *p)
>          }
>      }
>  }
> +
> +/**
> + * zero_page_detect_cpu: Perform zero page detection using CPU.
> + *
> + * Sorts normal pages before zero pages in p->pages->offset and updates
> + * p->pages->normal_num.
> + *
> + * @param p A pointer to the send params.
> + */
> +static void zero_page_detect_cpu(MultiFDSendParams *p)
> +{
> +    MultiFDPages_t *pages = &p->data->u.ram;
> +    RAMBlock *rb = pages->block;
> +    int i = 0;
> +    int j = pages->num - 1;
> +
> +    /*
> +     * Sort the page offset array by moving all normal pages to
> +     * the left and all zero pages to the right of the array.
> +     */
> +    while (i <= j) {
> +        uint64_t offset = pages->offset[i];
> +
> +        if (!buffer_is_zero(rb->host + offset, multifd_ram_page_size())) {
> +            i++;
> +            continue;
> +        }
> +
> +        swap_page_offset(pages->offset, i, j);
> +        ram_release_page(rb->idstr, offset);
> +        j--;
> +    }
> +
> +    pages->normal_num = i;
> +}
> +
> +/**
> + * multifd_send_zero_page_detect: Perform zero page detection on all pages.
> + *
> + * @param p A pointer to the send params.
> + */
> +void multifd_send_zero_page_detect(MultiFDSendParams *p)
> +{
> +    MultiFDPages_t *pages = &p->data->u.ram;
> +
> +    if (!multifd_zero_page_enabled()) {
> +        pages->normal_num = pages->num;
> +        goto out;
> +    }
> +
> +    if (qemu_dsa_is_running()) {
> +        zero_page_detect_dsa(p);
> +    } else {
> +        zero_page_detect_cpu(p);
> +    }
> +
> +out:
> +    stat64_add(&mig_stats.normal_pages, pages->normal_num);
> +    stat64_add(&mig_stats.zero_pages, pages->num - pages->normal_num);
> +}
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 4374e14a96..689acceff2 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -13,6 +13,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/cutils.h"
>  #include "qemu/rcu.h"
> +#include "qemu/dsa.h"
>  #include "exec/target_page.h"
>  #include "sysemu/sysemu.h"
>  #include "exec/ramblock.h"
> @@ -462,6 +463,8 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
>      p->name = NULL;
>      g_free(p->data);
>      p->data = NULL;
> +    buffer_zero_batch_task_destroy(p->dsa_batch_task);
> +    p->dsa_batch_task = NULL;
>      p->packet_len = 0;
>      g_free(p->packet);
>      p->packet = NULL;
> @@ -493,6 +496,8 @@ void multifd_send_shutdown(void)
>  
>      multifd_send_terminate_threads();
>  
> +    multifd_dsa_cleanup();
> +
>      for (i = 0; i < migrate_multifd_channels(); i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
>          Error *local_err = NULL;
> @@ -814,11 +819,31 @@ bool multifd_send_setup(void)
>      uint32_t page_count = multifd_ram_page_count();
>      bool use_packets = multifd_use_packets();
>      uint8_t i;
> +    Error *local_err = NULL;
>  
>      if (!migrate_multifd()) {
>          return true;
>      }
>  
> +    if (s &&
> +        s->parameters.zero_page_detection == ZERO_PAGE_DETECTION_DSA_ACCEL) {
> +        // Populate the dsa device path from accel-path

scripts/checkpatch.pl would have rejected this.

> +        const strList *accel_path = migrate_accel_path();
> +        g_autofree strList *dsa_parameter = g_malloc0(sizeof(strList));
> +        strList **tail = &dsa_parameter;
> +        while (accel_path) {
> +            if (strncmp(accel_path->value, "dsa:", 4) == 0) {
> +                QAPI_LIST_APPEND(tail, &accel_path->value[4]);
> +            }
> +            accel_path = accel_path->next;
> +        }

The parsing of the parameter should be in options.c. In fact, Peter
suggested in v4 to make all of this a multifd_dsa_send_setup() or
multifd_dsa_init(), I think that's a good idea.

> +        if (qemu_dsa_init(dsa_parameter, &local_err)) {
> +            ret = -1;

migrate_set_error(s, local_err);
goto err;

> +        } else {
> +            qemu_dsa_start();
> +        }
> +    }
> +
>      thread_count = migrate_multifd_channels();
>      multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
>      multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
> @@ -829,12 +854,12 @@ bool multifd_send_setup(void)
>  
>      for (i = 0; i < thread_count; i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
> -        Error *local_err = NULL;
>  
>          qemu_sem_init(&p->sem, 0);
>          qemu_sem_init(&p->sem_sync, 0);
>          p->id = i;
>          p->data = multifd_send_data_alloc();
> +        p->dsa_batch_task = buffer_zero_batch_task_init(page_count);
>  
>          if (use_packets) {
>              p->packet_len = sizeof(MultiFDPacket_t)
> @@ -865,7 +890,6 @@ bool multifd_send_setup(void)
>  
>      for (i = 0; i < thread_count; i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
> -        Error *local_err = NULL;
>  
>          ret = multifd_send_state->ops->send_setup(p, &local_err);
>          if (ret) {
> @@ -1047,6 +1071,7 @@ void multifd_recv_cleanup(void)
>              qemu_thread_join(&p->thread);
>          }
>      }
> +    multifd_dsa_cleanup();
>      for (i = 0; i < migrate_multifd_channels(); i++) {
>          multifd_recv_cleanup_channel(&multifd_recv_state->params[i]);
>      }
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 50d58c0c9c..e293ddbc1d 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -15,6 +15,7 @@
>  
>  #include "exec/target_page.h"
>  #include "ram.h"
> +#include "qemu/dsa.h"
>  
>  typedef struct MultiFDRecvData MultiFDRecvData;
>  typedef struct MultiFDSendData MultiFDSendData;
> @@ -155,6 +156,9 @@ typedef struct {
>      bool pending_sync;
>      MultiFDSendData *data;
>  
> +    /* Zero page checking batch task */
> +    QemuDsaBatchTask *dsa_batch_task;
> +
>      /* thread local variables. No locking required */
>  
>      /* pointer to the packet */
> @@ -313,6 +317,7 @@ void multifd_send_fill_packet(MultiFDSendParams *p);
>  bool multifd_send_prepare_common(MultiFDSendParams *p);
>  void multifd_send_zero_page_detect(MultiFDSendParams *p);
>  void multifd_recv_zero_page_process(MultiFDRecvParams *p);
> +void multifd_dsa_cleanup(void);
>  
>  static inline void multifd_send_prepare_header(MultiFDSendParams *p)
>  {


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 04/12] util/dsa: Implement DSA task enqueue and dequeue.
  2024-11-14 22:01 ` [PATCH v7 04/12] util/dsa: Implement DSA task enqueue and dequeue Yichen Wang
@ 2024-11-21 20:55   ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2024-11-21 20:55 UTC (permalink / raw)
  To: Yichen Wang, Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

Yichen Wang <yichen.wang@bytedance.com> writes:

> From: Hao Xiang <hao.xiang@linux.dev>
>
> * Use a safe thread queue for DSA task enqueue/dequeue.
> * Implement DSA task submission.
> * Implement DSA batch task submission.
>
> Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
> Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 05/12] util/dsa: Implement DSA task asynchronous completion thread model.
  2024-11-14 22:01 ` [PATCH v7 05/12] util/dsa: Implement DSA task asynchronous completion thread model Yichen Wang
@ 2024-11-21 20:58   ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2024-11-21 20:58 UTC (permalink / raw)
  To: Yichen Wang, Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

Yichen Wang <yichen.wang@bytedance.com> writes:

> From: Hao Xiang <hao.xiang@linux.dev>
>
> * Create a dedicated thread for DSA task completion.
> * DSA completion thread runs a loop and poll for completed tasks.
> * Start and stop DSA completion thread during DSA device start stop.
>
> User space application can directly submit task to Intel DSA
> accelerator by writing to DSA's device memory (mapped in user space).
> Once a task is submitted, the device starts processing it and write
> the completion status back to the task. A user space application can
> poll the task's completion status to check for completion. This change
> uses a dedicated thread to perform DSA task completion checking.
>
> Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
> Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
> ---
>  include/qemu/dsa.h |   1 +
>  util/dsa.c         | 274 ++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 274 insertions(+), 1 deletion(-)
>
> diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
> index 04ee8924ab..d24567f0be 100644
> --- a/include/qemu/dsa.h
> +++ b/include/qemu/dsa.h
> @@ -69,6 +69,7 @@ typedef struct QemuDsaBatchTask {
>      QemuDsaTaskType task_type;
>      QemuDsaTaskStatus status;
>      int batch_size;
> +    bool *results;
>      QSIMPLEQ_ENTRY(QemuDsaBatchTask) entry;
>  } QemuDsaBatchTask;
>  
> diff --git a/util/dsa.c b/util/dsa.c
> index b55fa599f0..c3ca71df86 100644
> --- a/util/dsa.c
> +++ b/util/dsa.c
> @@ -33,9 +33,20 @@
>  #define DSA_WQ_PORTAL_SIZE 4096
>  #define DSA_WQ_DEPTH 128
>  #define MAX_DSA_DEVICES 16
> +#define DSA_COMPLETION_THREAD "qemu_dsa_completion"
> +
> +typedef struct {
> +    bool stopping;
> +    bool running;
> +    QemuThread thread;
> +    int thread_id;
> +    QemuSemaphore sem_init_done;
> +    QemuDsaDeviceGroup *group;
> +} QemuDsaCompletionThread;
>  
>  uint32_t max_retry_count;
>  static QemuDsaDeviceGroup dsa_group;
> +static QemuDsaCompletionThread completion_thread;
>  
>  
>  /**
> @@ -403,6 +414,265 @@ submit_batch_wi_async(QemuDsaBatchTask *batch_task)
>      return dsa_task_enqueue(device_group, batch_task);
>  }
>  
> +/**
> + * @brief Poll for the DSA work item completion.
> + *
> + * @param completion A pointer to the DSA work item completion record.
> + * @param opcode The DSA opcode.
> + *
> + * @return Zero if successful, non-zero otherwise.
> + */
> +static int
> +poll_completion(struct dsa_completion_record *completion,
> +                enum dsa_opcode opcode)
> +{
> +    uint8_t status;
> +    uint64_t retry = 0;
> +
> +    while (true) {
> +        /* The DSA operation completes successfully or fails. */
> +        status = completion->status;
> +        if (status == DSA_COMP_SUCCESS ||
> +            status == DSA_COMP_PAGE_FAULT_NOBOF ||
> +            status == DSA_COMP_BATCH_PAGE_FAULT ||
> +            status == DSA_COMP_BATCH_FAIL) {
> +            break;
> +        } else if (status != DSA_COMP_NONE) {
> +            error_report("DSA opcode %d failed with status = %d.",
> +                    opcode, status);
> +            return 1;
> +        }
> +        retry++;
> +        if (retry > max_retry_count) {
> +            error_report("DSA wait for completion retry %lu times.", retry);
> +            return 1;
> +        }
> +        _mm_pause();
> +    }
> +
> +    return 0;
> +}
> +
> +/**
> + * @brief Complete a single DSA task in the batch task.
> + *
> + * @param task A pointer to the batch task structure.
> + *
> + * @return Zero if successful, otherwise non-zero.
> + */
> +static int
> +poll_task_completion(QemuDsaBatchTask *task)
> +{
> +    assert(task->task_type == QEMU_DSA_TASK);
> +
> +    struct dsa_completion_record *completion = &task->completions[0];
> +    uint8_t status;
> +    int ret;
> +
> +    ret = poll_completion(completion, task->descriptors[0].opcode);
> +    if (ret != 0) {
> +        goto exit;
> +    }
> +
> +    status = completion->status;
> +    if (status == DSA_COMP_SUCCESS) {
> +        task->results[0] = (completion->result == 0);
> +        goto exit;
> +    }
> +
> +    assert(status == DSA_COMP_PAGE_FAULT_NOBOF);
> +
> +exit:
> +    return ret;
> +}
> +
> +/**
> + * @brief Poll a batch task status until it completes. If DSA task doesn't
> + *        complete properly, use CPU to complete the task.
> + *
> + * @param batch_task A pointer to the DSA batch task.
> + *
> + * @return Zero if successful, otherwise non-zero.
> + */
> +static int
> +poll_batch_task_completion(QemuDsaBatchTask *batch_task)
> +{
> +    struct dsa_completion_record *batch_completion =
> +        &batch_task->batch_completion;
> +    struct dsa_completion_record *completion;
> +    uint8_t batch_status;
> +    uint8_t status;
> +    bool *results = batch_task->results;
> +    uint32_t count = batch_task->batch_descriptor.desc_count;
> +    int ret;
> +
> +    ret = poll_completion(batch_completion,
> +                          batch_task->batch_descriptor.opcode);
> +    if (ret != 0) {
> +        goto exit;
> +    }
> +
> +    batch_status = batch_completion->status;
> +
> +    if (batch_status == DSA_COMP_SUCCESS) {
> +        if (batch_completion->bytes_completed == count) {
> +            /*
> +             * Let's skip checking for each descriptors' completion status
> +             * if the batch descriptor says all succedded.
> +             */
> +            for (int i = 0; i < count; i++) {
> +                assert(batch_task->completions[i].status == DSA_COMP_SUCCESS);
> +                results[i] = (batch_task->completions[i].result == 0);
> +            }
> +            goto exit;
> +        }
> +    } else {
> +        assert(batch_status == DSA_COMP_BATCH_FAIL ||
> +            batch_status == DSA_COMP_BATCH_PAGE_FAULT);
> +    }
> +
> +    for (int i = 0; i < count; i++) {
> +
> +        completion = &batch_task->completions[i];
> +        status = completion->status;
> +
> +        if (status == DSA_COMP_SUCCESS) {
> +            results[i] = (completion->result == 0);
> +            continue;
> +        }
> +
> +        assert(status == DSA_COMP_PAGE_FAULT_NOBOF);
> +
> +        if (status != DSA_COMP_PAGE_FAULT_NOBOF) {
> +            error_report("Unexpected DSA completion status = %u.", status);

Unreachable with the assert above.

With that fixed:

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 06/12] util/dsa: Implement zero page checking in DSA task.
  2024-11-14 22:01 ` [PATCH v7 06/12] util/dsa: Implement zero page checking in DSA task Yichen Wang
@ 2024-11-25 15:53   ` Fabiano Rosas
  2024-11-26  4:38     ` [External] " Yichen Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Fabiano Rosas @ 2024-11-25 15:53 UTC (permalink / raw)
  To: Yichen Wang, Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang, Bryan Zhang

Yichen Wang <yichen.wang@bytedance.com> writes:

> From: Hao Xiang <hao.xiang@linux.dev>
>
> Create DSA task with operation code DSA_OPCODE_COMPVAL.
> Here we create two types of DSA tasks, a single DSA task and
> a batch DSA task. Batch DSA task reduces task submission overhead
> and hence should be the default option. However, due to the way DSA
> hardware works, a DSA batch task must contain at least two individual
> tasks. There are times we need to submit a single task and hence a
> single DSA task submission is also required.
>
> Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
> Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
> Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
> ---
>  include/qemu/dsa.h |  44 ++++++--
>  util/dsa.c         | 254 +++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 269 insertions(+), 29 deletions(-)
>
> diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
> index d24567f0be..cb407b8b49 100644
> --- a/include/qemu/dsa.h
> +++ b/include/qemu/dsa.h
> @@ -16,6 +16,7 @@
>  #define QEMU_DSA_H
>  
>  #include "qapi/error.h"
> +#include "exec/cpu-common.h"
>  #include "qemu/thread.h"
>  #include "qemu/queue.h"
>  
> @@ -70,10 +71,11 @@ typedef struct QemuDsaBatchTask {
>      QemuDsaTaskStatus status;
>      int batch_size;
>      bool *results;
> +    /* Address of each pages in pages */
> +    ram_addr_t *addr;
>      QSIMPLEQ_ENTRY(QemuDsaBatchTask) entry;
>  } QemuDsaBatchTask;
>  
> -
>  /**
>   * @brief Initializes DSA devices.
>   *
> @@ -105,8 +107,26 @@ void qemu_dsa_cleanup(void);
>   */
>  bool qemu_dsa_is_running(void);
>  
> +/**
> + * @brief Initializes a buffer zero DSA batch task.
> + *
> + * @param batch_size The number of zero page checking tasks in the batch.
> + * @return A pointer to the zero page checking tasks initialized.
> + */
> +QemuDsaBatchTask *
> +buffer_zero_batch_task_init(int batch_size);
> +
> +/**
> + * @brief Performs the proper cleanup on a DSA batch task.
> + *
> + * @param task A pointer to the batch task to cleanup.
> + */
> +void buffer_zero_batch_task_destroy(QemuDsaBatchTask *task);
> +
>  #else
>  
> +typedef struct QemuDsaBatchTask {} QemuDsaBatchTask;
> +
>  static inline bool qemu_dsa_is_running(void)
>  {
>      return false;
> @@ -114,19 +134,27 @@ static inline bool qemu_dsa_is_running(void)
>  
>  static inline int qemu_dsa_init(const strList *dsa_parameter, Error **errp)
>  {
> -    if (dsa_parameter != NULL && strlen(dsa_parameter) != 0) {
> -        error_setg(errp, "DSA is not supported.");
> -        return -1;
> -    }
> -
> -    return 0;
> +    error_setg(errp, "DSA accelerator is not enabled.");
> +    return -1;

This should have been fixed in the patch that introduced this function.

>  }
>  
>  static inline void qemu_dsa_start(void) {}
>  
>  static inline void qemu_dsa_stop(void) {}
>  
> -static inline void qemu_dsa_cleanup(void) {}

Where did this go?

> +static inline QemuDsaBatchTask *buffer_zero_batch_task_init(int batch_size)
> +{
> +    return NULL;
> +}
> +
> +static inline void buffer_zero_batch_task_destroy(QemuDsaBatchTask *task) {}
> +
> +static inline int
> +buffer_is_zero_dsa_batch_sync(QemuDsaBatchTask *batch_task,
> +                              const void **buf, size_t count, size_t len)
> +{
> +    return -1;
> +}
>  
>  #endif
>  
> diff --git a/util/dsa.c b/util/dsa.c
> index c3ca71df86..408c163195 100644
> --- a/util/dsa.c
> +++ b/util/dsa.c
> @@ -48,6 +48,7 @@ uint32_t max_retry_count;
>  static QemuDsaDeviceGroup dsa_group;
>  static QemuDsaCompletionThread completion_thread;
>  
> +static void buffer_zero_dsa_completion(void *context);
>  
>  /**
>   * @brief This function opens a DSA device's work queue and
> @@ -174,7 +175,6 @@ dsa_device_group_start(QemuDsaDeviceGroup *group)
>   *
>   * @param group A pointer to the DSA device group.
>   */
> -__attribute__((unused))
>  static void
>  dsa_device_group_stop(QemuDsaDeviceGroup *group)
>  {
> @@ -210,7 +210,6 @@ dsa_device_group_cleanup(QemuDsaDeviceGroup *group)
>   * @return struct QemuDsaDevice* A pointer to the next available DSA device
>   *         in the group.
>   */
> -__attribute__((unused))
>  static QemuDsaDevice *
>  dsa_device_group_get_next_device(QemuDsaDeviceGroup *group)
>  {
> @@ -283,7 +282,6 @@ dsa_task_enqueue(QemuDsaDeviceGroup *group,
>   * @param group A pointer to the DSA device group.
>   * @return QemuDsaBatchTask* The DSA task being dequeued.
>   */
> -__attribute__((unused))
>  static QemuDsaBatchTask *
>  dsa_task_dequeue(QemuDsaDeviceGroup *group)
>  {
> @@ -338,22 +336,6 @@ submit_wi_int(void *wq, struct dsa_hw_desc *descriptor)
>      return 0;
>  }
>  
> -/**
> - * @brief Synchronously submits a DSA work item to the
> - *        device work queue.
> - *
> - * @param wq A pointer to the DSA work queue's device memory.
> - * @param descriptor A pointer to the DSA work item descriptor.
> - *
> - * @return int Zero if successful, non-zero otherwise.
> - */
> -__attribute__((unused))
> -static int
> -submit_wi(void *wq, struct dsa_hw_desc *descriptor)
> -{
> -    return submit_wi_int(wq, descriptor);
> -}
> -

Why is this being removed?

>  /**
>   * @brief Asynchronously submits a DSA work item to the
>   *        device work queue.
> @@ -362,7 +344,6 @@ submit_wi(void *wq, struct dsa_hw_desc *descriptor)
>   *
>   * @return int Zero if successful, non-zero otherwise.
>   */
> -__attribute__((unused))
>  static int
>  submit_wi_async(QemuDsaBatchTask *task)
>  {
> @@ -391,7 +372,6 @@ submit_wi_async(QemuDsaBatchTask *task)
>   *
>   * @return int Zero if successful, non-zero otherwise.
>   */
> -__attribute__((unused))
>  static int
>  submit_batch_wi_async(QemuDsaBatchTask *batch_task)
>  {
> @@ -750,3 +730,235 @@ void qemu_dsa_cleanup(void)
>      dsa_device_group_cleanup(&dsa_group);
>  }
>  
> +
> +/* Buffer zero comparison DSA task implementations */
> +/* =============================================== */
> +
> +/**
> + * @brief Sets a buffer zero comparison DSA task.
> + *
> + * @param descriptor A pointer to the DSA task descriptor.
> + * @param buf A pointer to the memory buffer.
> + * @param len The length of the buffer.
> + */
> +static void
> +buffer_zero_task_set_int(struct dsa_hw_desc *descriptor,
> +                         const void *buf,
> +                         size_t len)
> +{
> +    struct dsa_completion_record *completion =
> +        (struct dsa_completion_record *)descriptor->completion_addr;
> +
> +    descriptor->xfer_size = len;
> +    descriptor->src_addr = (uintptr_t)buf;
> +    completion->status = 0;
> +    completion->result = 0;
> +}
> +
> +/**
> + * @brief Resets a buffer zero comparison DSA batch task.
> + *
> + * @param task A pointer to the DSA batch task.
> + */
> +static void
> +buffer_zero_task_reset(QemuDsaBatchTask *task)
> +{
> +    task->completions[0].status = DSA_COMP_NONE;
> +    task->task_type = QEMU_DSA_TASK;
> +    task->status = QEMU_DSA_TASK_READY;
> +}
> +
> +/**
> + * @brief Resets a buffer zero comparison DSA batch task.
> + *
> + * @param task A pointer to the batch task.
> + * @param count The number of DSA tasks this batch task will contain.
> + */
> +static void
> +buffer_zero_batch_task_reset(QemuDsaBatchTask *task, size_t count)
> +{
> +    task->batch_completion.status = DSA_COMP_NONE;
> +    task->batch_descriptor.desc_count = count;
> +    task->task_type = QEMU_DSA_BATCH_TASK;
> +    task->status = QEMU_DSA_TASK_READY;
> +}
> +
> +/**
> + * @brief Sets a buffer zero comparison DSA task.
> + *
> + * @param task A pointer to the DSA task.
> + * @param buf A pointer to the memory buffer.
> + * @param len The buffer length.
> + */
> +static void
> +buffer_zero_task_set(QemuDsaBatchTask *task,
> +                     const void *buf,
> +                     size_t len)
> +{
> +    buffer_zero_task_reset(task);
> +    buffer_zero_task_set_int(&task->descriptors[0], buf, len);
> +}
> +
> +/**
> + * @brief Sets a buffer zero comparison batch task.
> + *
> + * @param batch_task A pointer to the batch task.
> + * @param buf An array of memory buffers.
> + * @param count The number of buffers in the array.
> + * @param len The length of the buffers.
> + */
> +static void
> +buffer_zero_batch_task_set(QemuDsaBatchTask *batch_task,
> +                           const void **buf, size_t count, size_t len)
> +{
> +    assert(count > 0);
> +    assert(count <= batch_task->batch_size);
> +
> +    buffer_zero_batch_task_reset(batch_task, count);
> +    for (int i = 0; i < count; i++) {
> +        buffer_zero_task_set_int(&batch_task->descriptors[i], buf[i], len);
> +    }
> +}
> +
> +/**
> + * @brief Asychronously perform a buffer zero DSA operation.
> + *
> + * @param task A pointer to the batch task structure.
> + * @param buf A pointer to the memory buffer.
> + * @param len The length of the memory buffer.
> + *
> + * @return int Zero if successful, otherwise an appropriate error code.
> + */
> +__attribute__((unused))
> +static int
> +buffer_zero_dsa_async(QemuDsaBatchTask *task,
> +                      const void *buf, size_t len)
> +{
> +    buffer_zero_task_set(task, buf, len);
> +
> +    return submit_wi_async(task);
> +}
> +
> +/**
> + * @brief Sends a memory comparison batch task to a DSA device and wait
> + *        for completion.
> + *
> + * @param batch_task The batch task to be submitted to DSA device.
> + * @param buf An array of memory buffers to check for zero.
> + * @param count The number of buffers.
> + * @param len The buffer length.
> + */
> +__attribute__((unused))
> +static int
> +buffer_zero_dsa_batch_async(QemuDsaBatchTask *batch_task,
> +                            const void **buf, size_t count, size_t len)
> +{
> +    assert(count <= batch_task->batch_size);
> +    buffer_zero_batch_task_set(batch_task, buf, count, len);
> +
> +    return submit_batch_wi_async(batch_task);
> +}
> +
> +/**
> + * @brief The completion callback function for buffer zero
> + *        comparison DSA task completion.
> + *
> + * @param context A pointer to the callback context.
> + */
> +static void
> +buffer_zero_dsa_completion(void *context)
> +{
> +    assert(context != NULL);
> +
> +    QemuDsaBatchTask *task = (QemuDsaBatchTask *)context;
> +    qemu_sem_post(&task->sem_task_complete);
> +}
> +
> +/**
> + * @brief Wait for the asynchronous DSA task to complete.
> + *
> + * @param batch_task A pointer to the buffer zero comparison batch task.
> + */
> +__attribute__((unused))
> +static void
> +buffer_zero_dsa_wait(QemuDsaBatchTask *batch_task)
> +{
> +    qemu_sem_wait(&batch_task->sem_task_complete);
> +}
> +
> +/**
> + * @brief Initializes a buffer zero comparison DSA task.
> + *
> + * @param descriptor A pointer to the DSA task descriptor.
> + * @param completion A pointer to the DSA task completion record.
> + */
> +static void
> +buffer_zero_task_init_int(struct dsa_hw_desc *descriptor,
> +                          struct dsa_completion_record *completion)
> +{
> +    descriptor->opcode = DSA_OPCODE_COMPVAL;
> +    descriptor->flags = IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CRAV;
> +    descriptor->comp_pattern = (uint64_t)0;
> +    descriptor->completion_addr = (uint64_t)completion;
> +}
> +
> +/**
> + * @brief Initializes a buffer zero DSA batch task.
> + *
> + * @param batch_size The number of zero page checking tasks in the batch.
> + * @return A pointer to the zero page checking tasks initialized.
> + */
> +QemuDsaBatchTask *
> +buffer_zero_batch_task_init(int batch_size)
> +{
> +    QemuDsaBatchTask *task = qemu_memalign(64, sizeof(QemuDsaBatchTask));
> +    int descriptors_size = sizeof(*task->descriptors) * batch_size;
> +
> +    memset(task, 0, sizeof(*task));
> +    task->addr = g_new0(ram_addr_t, batch_size);
> +    task->results = g_new0(bool, batch_size);
> +    task->batch_size = batch_size;
> +    task->descriptors =
> +        (struct dsa_hw_desc *)qemu_memalign(64, descriptors_size);
> +    memset(task->descriptors, 0, descriptors_size);
> +    task->completions = (struct dsa_completion_record *)qemu_memalign(
> +        32, sizeof(*task->completions) * batch_size);
> +
> +    task->batch_completion.status = DSA_COMP_NONE;
> +    task->batch_descriptor.completion_addr = (uint64_t)&task->batch_completion;
> +    /* TODO: Ensure that we never send a batch with count <= 1 */
> +    task->batch_descriptor.desc_count = 0;
> +    task->batch_descriptor.opcode = DSA_OPCODE_BATCH;
> +    task->batch_descriptor.flags = IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CRAV;
> +    task->batch_descriptor.desc_list_addr = (uintptr_t)task->descriptors;
> +    task->status = QEMU_DSA_TASK_READY;
> +    task->group = &dsa_group;
> +    task->device = dsa_device_group_get_next_device(&dsa_group);
> +
> +    for (int i = 0; i < task->batch_size; i++) {
> +        buffer_zero_task_init_int(&task->descriptors[i],
> +                                  &task->completions[i]);
> +    }
> +
> +    qemu_sem_init(&task->sem_task_complete, 0);
> +    task->completion_callback = buffer_zero_dsa_completion;
> +
> +    return task;
> +}
> +
> +/**
> + * @brief Performs the proper cleanup on a DSA batch task.
> + *
> + * @param task A pointer to the batch task to cleanup.
> + */
> +void
> +buffer_zero_batch_task_destroy(QemuDsaBatchTask *task)
> +{
> +    g_free(task->addr);
> +    g_free(task->results);
> +    qemu_vfree(task->descriptors);
> +    qemu_vfree(task->completions);
> +    task->results = NULL;
> +    qemu_sem_destroy(&task->sem_task_complete);
> +    qemu_vfree(task);
> +}


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 07/12] util/dsa: Implement DSA task asynchronous submission and wait for completion.
  2024-11-14 22:01 ` [PATCH v7 07/12] util/dsa: Implement DSA task asynchronous submission and wait for completion Yichen Wang
@ 2024-11-25 18:00   ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2024-11-25 18:00 UTC (permalink / raw)
  To: Yichen Wang, Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang, Bryan Zhang

Yichen Wang <yichen.wang@bytedance.com> writes:

> From: Hao Xiang <hao.xiang@linux.dev>
>
> * Add a DSA task completion callback.
> * DSA completion thread will call the tasks's completion callback
> on every task/batch task completion.
> * DSA submission path to wait for completion.
> * Implement CPU fallback if DSA is not able to complete the task.
>
> Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
> Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
> Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
> ---
>  include/qemu/dsa.h |  14 +++++
>  util/dsa.c         | 125 +++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 135 insertions(+), 4 deletions(-)
>
> diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
> index cb407b8b49..8284804a32 100644
> --- a/include/qemu/dsa.h
> +++ b/include/qemu/dsa.h
> @@ -123,6 +123,20 @@ buffer_zero_batch_task_init(int batch_size);
>   */
>  void buffer_zero_batch_task_destroy(QemuDsaBatchTask *task);
>  
> +/**
> + * @brief Performs buffer zero comparison on a DSA batch task synchronously.
> + *
> + * @param batch_task A pointer to the batch task.
> + * @param buf An array of memory buffers.
> + * @param count The number of buffers in the array.
> + * @param len The buffer length.
> + *
> + * @return Zero if successful, otherwise non-zero.
> + */
> +int
> +buffer_is_zero_dsa_batch_sync(QemuDsaBatchTask *batch_task,
> +                              const void **buf, size_t count, size_t len);
> +
>  #else
>  
>  typedef struct QemuDsaBatchTask {} QemuDsaBatchTask;
> diff --git a/util/dsa.c b/util/dsa.c
> index 408c163195..50f53ec24b 100644
> --- a/util/dsa.c
> +++ b/util/dsa.c
> @@ -433,6 +433,42 @@ poll_completion(struct dsa_completion_record *completion,
>      return 0;
>  }
>  
> +/**
> + * @brief Helper function to use CPU to complete a single
> + *        zero page checking task.
> + *
> + * @param completion A pointer to a DSA task completion record.
> + * @param descriptor A pointer to a DSA task descriptor.
> + * @param result A pointer to the result of a zero page checking.
> + */
> +static void
> +task_cpu_fallback_int(struct dsa_completion_record *completion,
> +                      struct dsa_hw_desc *descriptor, bool *result)
> +{
> +    const uint8_t *buf;
> +    size_t len;
> +
> +    if (completion->status == DSA_COMP_SUCCESS) {
> +        return;
> +    }
> +
> +    /*
> +     * DSA was able to partially complete the operation. Check the
> +     * result. If we already know this is not a zero page, we can
> +     * return now.
> +     */
> +    if (completion->bytes_completed != 0 && completion->result != 0) {
> +        *result = false;
> +        return;
> +    }
> +
> +    /* Let's fallback to use CPU to complete it. */
> +    buf = (const uint8_t *)descriptor->src_addr;
> +    len = descriptor->xfer_size;
> +    *result = buffer_is_zero(buf + completion->bytes_completed,
> +                             len - completion->bytes_completed);
> +}
> +
>  /**
>   * @brief Complete a single DSA task in the batch task.
>   *
> @@ -561,7 +597,7 @@ dsa_completion_loop(void *opaque)
>          (QemuDsaCompletionThread *)opaque;
>      QemuDsaBatchTask *batch_task;
>      QemuDsaDeviceGroup *group = thread_context->group;
> -    int ret;
> +    int ret = 0;
>  
>      rcu_register_thread();
>  
> @@ -829,7 +865,6 @@ buffer_zero_batch_task_set(QemuDsaBatchTask *batch_task,
>   *
>   * @return int Zero if successful, otherwise an appropriate error code.
>   */
> -__attribute__((unused))
>  static int
>  buffer_zero_dsa_async(QemuDsaBatchTask *task,
>                        const void *buf, size_t len)
> @@ -848,7 +883,6 @@ buffer_zero_dsa_async(QemuDsaBatchTask *task,
>   * @param count The number of buffers.
>   * @param len The buffer length.
>   */
> -__attribute__((unused))
>  static int
>  buffer_zero_dsa_batch_async(QemuDsaBatchTask *batch_task,
>                              const void **buf, size_t count, size_t len)
> @@ -879,13 +913,61 @@ buffer_zero_dsa_completion(void *context)
>   *
>   * @param batch_task A pointer to the buffer zero comparison batch task.
>   */
> -__attribute__((unused))
>  static void
>  buffer_zero_dsa_wait(QemuDsaBatchTask *batch_task)
>  {
>      qemu_sem_wait(&batch_task->sem_task_complete);
>  }
>  
> +/**
> + * @brief Use CPU to complete the zero page checking task if DSA
> + *        is not able to complete it.
> + *
> + * @param batch_task A pointer to the batch task.
> + */
> +static void
> +buffer_zero_cpu_fallback(QemuDsaBatchTask *batch_task)
> +{
> +    if (batch_task->task_type == QEMU_DSA_TASK) {
> +        if (batch_task->completions[0].status == DSA_COMP_SUCCESS) {
> +            return;
> +        }
> +        task_cpu_fallback_int(&batch_task->completions[0],
> +                              &batch_task->descriptors[0],
> +                              &batch_task->results[0]);
> +    } else if (batch_task->task_type == QEMU_DSA_BATCH_TASK) {
> +        struct dsa_completion_record *batch_completion =
> +            &batch_task->batch_completion;
> +        struct dsa_completion_record *completion;
> +        uint8_t status;
> +        bool *results = batch_task->results;
> +        uint32_t count = batch_task->batch_descriptor.desc_count;
> +
> +        /* DSA is able to complete the entire batch task. */
> +        if (batch_completion->status == DSA_COMP_SUCCESS) {
> +            assert(count == batch_completion->bytes_completed);
> +            return;
> +        }
> +
> +        /*
> +         * DSA encounters some error and is not able to complete
> +         * the entire batch task. Use CPU fallback.
> +         */
> +        for (int i = 0; i < count; i++) {
> +
> +            completion = &batch_task->completions[i];
> +            status = completion->status;
> +
> +            assert(status == DSA_COMP_SUCCESS ||
> +                status == DSA_COMP_PAGE_FAULT_NOBOF);
> +
> +            task_cpu_fallback_int(completion,
> +                                  &batch_task->descriptors[i],
> +                                  &results[i]);
> +        }
> +    }
> +}
> +
>  /**
>   * @brief Initializes a buffer zero comparison DSA task.
>   *
> @@ -962,3 +1044,38 @@ buffer_zero_batch_task_destroy(QemuDsaBatchTask *task)
>      qemu_sem_destroy(&task->sem_task_complete);
>      qemu_vfree(task);
>  }
> +
> +/**
> + * @brief Performs buffer zero comparison on a DSA batch task synchronously.
> + *
> + * @param batch_task A pointer to the batch task.
> + * @param buf An array of memory buffers.
> + * @param count The number of buffers in the array.
> + * @param len The buffer length.
> + *
> + * @return Zero if successful, otherwise non-zero.
> + */
> +int
> +buffer_is_zero_dsa_batch_sync(QemuDsaBatchTask *batch_task,
> +                              const void **buf, size_t count, size_t len)
> +{
> +    if (count <= 0 || count > batch_task->batch_size) {
> +        return -1;
> +    }
> +
> +    assert(batch_task != NULL);

batch_task is already dereferenced above.

> +    assert(len != 0);
> +    assert(buf != NULL);
> +
> +    if (count == 1) {
> +        /* DSA doesn't take batch operation with only 1 task. */
> +        buffer_zero_dsa_async(batch_task, buf[0], len);
> +    } else {
> +        buffer_zero_dsa_batch_async(batch_task, buf, count, len);
> +    }
> +
> +    buffer_zero_dsa_wait(batch_task);
> +    buffer_zero_cpu_fallback(batch_task);
> +
> +    return 0;
> +}


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 11/12] migration/multifd: Add integration tests for multifd with Intel DSA offloading.
  2024-11-14 22:01 ` [PATCH v7 11/12] migration/multifd: Add integration tests for multifd with Intel DSA offloading Yichen Wang
@ 2024-11-25 18:25   ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2024-11-25 18:25 UTC (permalink / raw)
  To: Yichen Wang, Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang, Bryan Zhang

Yichen Wang <yichen.wang@bytedance.com> writes:

> From: Hao Xiang <hao.xiang@linux.dev>
>
> * Add test case to start and complete multifd live migration with DSA
> offloading enabled.
> * Add test case to start and cancel multifd live migration with DSA
> offloading enabled.
>
> Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
> Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
> Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
> ---

Alright, we should figure out eventually what to do with tests that call
test_migrate_start|end more than once regarding the hooks. I propose we
should stop setting capabilities within hooks and have a separate
mechanism that can take a list of key=value entries as part of
MigrateCommon and set all the capabilities at once. The hooks could
probably be phased out then.

This is too much work for now, it would be better if those changes came
from QMP/QEMU first and the tests then followed. Doing it the other way
around will result in a mess of macros.

Reviewed-by: Fabiano Rosas <farosas@suse.de>



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v7 12/12] migration/doc: Add DSA zero page detection doc
  2024-11-14 22:01 ` [PATCH v7 12/12] migration/doc: Add DSA zero page detection doc Yichen Wang
@ 2024-11-25 18:28   ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2024-11-25 18:28 UTC (permalink / raw)
  To: Yichen Wang, Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang,
	Yichen Wang

Yichen Wang <yichen.wang@bytedance.com> writes:

> From: Yuan Liu <yuan1.liu@intel.com>
>
> Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [PATCH v7 06/12] util/dsa: Implement zero page checking in DSA task.
  2024-11-25 15:53   ` Fabiano Rosas
@ 2024-11-26  4:38     ` Yichen Wang
  0 siblings, 0 replies; 30+ messages in thread
From: Yichen Wang @ 2024-11-26  4:38 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel, Hao Xiang,
	Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang, Bryan Zhang

On Mon, Nov 25, 2024 at 7:53 AM Fabiano Rosas <farosas@suse.de> wrote:
>
> Yichen Wang <yichen.wang@bytedance.com> writes:
>
> > From: Hao Xiang <hao.xiang@linux.dev>
> >
> > Create DSA task with operation code DSA_OPCODE_COMPVAL.
> > Here we create two types of DSA tasks, a single DSA task and
> > a batch DSA task. Batch DSA task reduces task submission overhead
> > and hence should be the default option. However, due to the way DSA
> > hardware works, a DSA batch task must contain at least two individual
> > tasks. There are times we need to submit a single task and hence a
> > single DSA task submission is also required.
> >
> > Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
> > Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
> > Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
> > ---
> >  include/qemu/dsa.h |  44 ++++++--
> >  util/dsa.c         | 254 +++++++++++++++++++++++++++++++++++++++++----
> >  2 files changed, 269 insertions(+), 29 deletions(-)
> >
> > diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
> > index d24567f0be..cb407b8b49 100644
> > --- a/include/qemu/dsa.h
> > +++ b/include/qemu/dsa.h
> > @@ -16,6 +16,7 @@
> >  #define QEMU_DSA_H
> >
> >  #include "qapi/error.h"
> > +#include "exec/cpu-common.h"
> >  #include "qemu/thread.h"
> >  #include "qemu/queue.h"
> >
> > @@ -70,10 +71,11 @@ typedef struct QemuDsaBatchTask {
> >      QemuDsaTaskStatus status;
> >      int batch_size;
> >      bool *results;
> > +    /* Address of each pages in pages */
> > +    ram_addr_t *addr;
> >      QSIMPLEQ_ENTRY(QemuDsaBatchTask) entry;
> >  } QemuDsaBatchTask;
> >
> > -
> >  /**
> >   * @brief Initializes DSA devices.
> >   *
> > @@ -105,8 +107,26 @@ void qemu_dsa_cleanup(void);
> >   */
> >  bool qemu_dsa_is_running(void);
> >
> > +/**
> > + * @brief Initializes a buffer zero DSA batch task.
> > + *
> > + * @param batch_size The number of zero page checking tasks in the batch.
> > + * @return A pointer to the zero page checking tasks initialized.
> > + */
> > +QemuDsaBatchTask *
> > +buffer_zero_batch_task_init(int batch_size);
> > +
> > +/**
> > + * @brief Performs the proper cleanup on a DSA batch task.
> > + *
> > + * @param task A pointer to the batch task to cleanup.
> > + */
> > +void buffer_zero_batch_task_destroy(QemuDsaBatchTask *task);
> > +
> >  #else
> >
> > +typedef struct QemuDsaBatchTask {} QemuDsaBatchTask;
> > +
> >  static inline bool qemu_dsa_is_running(void)
> >  {
> >      return false;
> > @@ -114,19 +134,27 @@ static inline bool qemu_dsa_is_running(void)
> >
> >  static inline int qemu_dsa_init(const strList *dsa_parameter, Error **errp)
> >  {
> > -    if (dsa_parameter != NULL && strlen(dsa_parameter) != 0) {
> > -        error_setg(errp, "DSA is not supported.");
> > -        return -1;
> > -    }
> > -
> > -    return 0;
> > +    error_setg(errp, "DSA accelerator is not enabled.");
> > +    return -1;
>
> This should have been fixed in the patch that introduced this function.
>

Will be fixed directly in an earlier patch.

> >  }
> >
> >  static inline void qemu_dsa_start(void) {}
> >
> >  static inline void qemu_dsa_stop(void) {}
> >
> > -static inline void qemu_dsa_cleanup(void) {}
>
> Where did this go?
>

I will add it back in the next patch. Actually in the non-DSA patch,
this function is not being called. But I agree it should be back for
the sake of completeness.

> > +static inline QemuDsaBatchTask *buffer_zero_batch_task_init(int batch_size)
> > +{
> > +    return NULL;
> > +}
> > +
> > +static inline void buffer_zero_batch_task_destroy(QemuDsaBatchTask *task) {}
> > +
> > +static inline int
> > +buffer_is_zero_dsa_batch_sync(QemuDsaBatchTask *batch_task,
> > +                              const void **buf, size_t count, size_t len)
> > +{
> > +    return -1;
> > +}
> >
> >  #endif
> >
> > diff --git a/util/dsa.c b/util/dsa.c
> > index c3ca71df86..408c163195 100644
> > --- a/util/dsa.c
> > +++ b/util/dsa.c
> > @@ -48,6 +48,7 @@ uint32_t max_retry_count;
> >  static QemuDsaDeviceGroup dsa_group;
> >  static QemuDsaCompletionThread completion_thread;
> >
> > +static void buffer_zero_dsa_completion(void *context);
> >
> >  /**
> >   * @brief This function opens a DSA device's work queue and
> > @@ -174,7 +175,6 @@ dsa_device_group_start(QemuDsaDeviceGroup *group)
> >   *
> >   * @param group A pointer to the DSA device group.
> >   */
> > -__attribute__((unused))
> >  static void
> >  dsa_device_group_stop(QemuDsaDeviceGroup *group)
> >  {
> > @@ -210,7 +210,6 @@ dsa_device_group_cleanup(QemuDsaDeviceGroup *group)
> >   * @return struct QemuDsaDevice* A pointer to the next available DSA device
> >   *         in the group.
> >   */
> > -__attribute__((unused))
> >  static QemuDsaDevice *
> >  dsa_device_group_get_next_device(QemuDsaDeviceGroup *group)
> >  {
> > @@ -283,7 +282,6 @@ dsa_task_enqueue(QemuDsaDeviceGroup *group,
> >   * @param group A pointer to the DSA device group.
> >   * @return QemuDsaBatchTask* The DSA task being dequeued.
> >   */
> > -__attribute__((unused))
> >  static QemuDsaBatchTask *
> >  dsa_task_dequeue(QemuDsaDeviceGroup *group)
> >  {
> > @@ -338,22 +336,6 @@ submit_wi_int(void *wq, struct dsa_hw_desc *descriptor)
> >      return 0;
> >  }
> >
> > -/**
> > - * @brief Synchronously submits a DSA work item to the
> > - *        device work queue.
> > - *
> > - * @param wq A pointer to the DSA work queue's device memory.
> > - * @param descriptor A pointer to the DSA work item descriptor.
> > - *
> > - * @return int Zero if successful, non-zero otherwise.
> > - */
> > -__attribute__((unused))
> > -static int
> > -submit_wi(void *wq, struct dsa_hw_desc *descriptor)
> > -{
> > -    return submit_wi_int(wq, descriptor);
> > -}
> > -
>
> Why is this being removed?
>

This is the same as submit_wi_int(), so I feel like there is no need
for another wrapper.

> >  /**
> >   * @brief Asynchronously submits a DSA work item to the
> >   *        device work queue.
> > @@ -362,7 +344,6 @@ submit_wi(void *wq, struct dsa_hw_desc *descriptor)
> >   *
> >   * @return int Zero if successful, non-zero otherwise.
> >   */
> > -__attribute__((unused))
> >  static int
> >  submit_wi_async(QemuDsaBatchTask *task)
> >  {
> > @@ -391,7 +372,6 @@ submit_wi_async(QemuDsaBatchTask *task)
> >   *
> >   * @return int Zero if successful, non-zero otherwise.
> >   */
> > -__attribute__((unused))
> >  static int
> >  submit_batch_wi_async(QemuDsaBatchTask *batch_task)
> >  {
> > @@ -750,3 +730,235 @@ void qemu_dsa_cleanup(void)
> >      dsa_device_group_cleanup(&dsa_group);
> >  }
> >
> > +
> > +/* Buffer zero comparison DSA task implementations */
> > +/* =============================================== */
> > +
> > +/**
> > + * @brief Sets a buffer zero comparison DSA task.
> > + *
> > + * @param descriptor A pointer to the DSA task descriptor.
> > + * @param buf A pointer to the memory buffer.
> > + * @param len The length of the buffer.
> > + */
> > +static void
> > +buffer_zero_task_set_int(struct dsa_hw_desc *descriptor,
> > +                         const void *buf,
> > +                         size_t len)
> > +{
> > +    struct dsa_completion_record *completion =
> > +        (struct dsa_completion_record *)descriptor->completion_addr;
> > +
> > +    descriptor->xfer_size = len;
> > +    descriptor->src_addr = (uintptr_t)buf;
> > +    completion->status = 0;
> > +    completion->result = 0;
> > +}
> > +
> > +/**
> > + * @brief Resets a buffer zero comparison DSA batch task.
> > + *
> > + * @param task A pointer to the DSA batch task.
> > + */
> > +static void
> > +buffer_zero_task_reset(QemuDsaBatchTask *task)
> > +{
> > +    task->completions[0].status = DSA_COMP_NONE;
> > +    task->task_type = QEMU_DSA_TASK;
> > +    task->status = QEMU_DSA_TASK_READY;
> > +}
> > +
> > +/**
> > + * @brief Resets a buffer zero comparison DSA batch task.
> > + *
> > + * @param task A pointer to the batch task.
> > + * @param count The number of DSA tasks this batch task will contain.
> > + */
> > +static void
> > +buffer_zero_batch_task_reset(QemuDsaBatchTask *task, size_t count)
> > +{
> > +    task->batch_completion.status = DSA_COMP_NONE;
> > +    task->batch_descriptor.desc_count = count;
> > +    task->task_type = QEMU_DSA_BATCH_TASK;
> > +    task->status = QEMU_DSA_TASK_READY;
> > +}
> > +
> > +/**
> > + * @brief Sets a buffer zero comparison DSA task.
> > + *
> > + * @param task A pointer to the DSA task.
> > + * @param buf A pointer to the memory buffer.
> > + * @param len The buffer length.
> > + */
> > +static void
> > +buffer_zero_task_set(QemuDsaBatchTask *task,
> > +                     const void *buf,
> > +                     size_t len)
> > +{
> > +    buffer_zero_task_reset(task);
> > +    buffer_zero_task_set_int(&task->descriptors[0], buf, len);
> > +}
> > +
> > +/**
> > + * @brief Sets a buffer zero comparison batch task.
> > + *
> > + * @param batch_task A pointer to the batch task.
> > + * @param buf An array of memory buffers.
> > + * @param count The number of buffers in the array.
> > + * @param len The length of the buffers.
> > + */
> > +static void
> > +buffer_zero_batch_task_set(QemuDsaBatchTask *batch_task,
> > +                           const void **buf, size_t count, size_t len)
> > +{
> > +    assert(count > 0);
> > +    assert(count <= batch_task->batch_size);
> > +
> > +    buffer_zero_batch_task_reset(batch_task, count);
> > +    for (int i = 0; i < count; i++) {
> > +        buffer_zero_task_set_int(&batch_task->descriptors[i], buf[i], len);
> > +    }
> > +}
> > +
> > +/**
> > + * @brief Asychronously perform a buffer zero DSA operation.
> > + *
> > + * @param task A pointer to the batch task structure.
> > + * @param buf A pointer to the memory buffer.
> > + * @param len The length of the memory buffer.
> > + *
> > + * @return int Zero if successful, otherwise an appropriate error code.
> > + */
> > +__attribute__((unused))
> > +static int
> > +buffer_zero_dsa_async(QemuDsaBatchTask *task,
> > +                      const void *buf, size_t len)
> > +{
> > +    buffer_zero_task_set(task, buf, len);
> > +
> > +    return submit_wi_async(task);
> > +}
> > +
> > +/**
> > + * @brief Sends a memory comparison batch task to a DSA device and wait
> > + *        for completion.
> > + *
> > + * @param batch_task The batch task to be submitted to DSA device.
> > + * @param buf An array of memory buffers to check for zero.
> > + * @param count The number of buffers.
> > + * @param len The buffer length.
> > + */
> > +__attribute__((unused))
> > +static int
> > +buffer_zero_dsa_batch_async(QemuDsaBatchTask *batch_task,
> > +                            const void **buf, size_t count, size_t len)
> > +{
> > +    assert(count <= batch_task->batch_size);
> > +    buffer_zero_batch_task_set(batch_task, buf, count, len);
> > +
> > +    return submit_batch_wi_async(batch_task);
> > +}
> > +
> > +/**
> > + * @brief The completion callback function for buffer zero
> > + *        comparison DSA task completion.
> > + *
> > + * @param context A pointer to the callback context.
> > + */
> > +static void
> > +buffer_zero_dsa_completion(void *context)
> > +{
> > +    assert(context != NULL);
> > +
> > +    QemuDsaBatchTask *task = (QemuDsaBatchTask *)context;
> > +    qemu_sem_post(&task->sem_task_complete);
> > +}
> > +
> > +/**
> > + * @brief Wait for the asynchronous DSA task to complete.
> > + *
> > + * @param batch_task A pointer to the buffer zero comparison batch task.
> > + */
> > +__attribute__((unused))
> > +static void
> > +buffer_zero_dsa_wait(QemuDsaBatchTask *batch_task)
> > +{
> > +    qemu_sem_wait(&batch_task->sem_task_complete);
> > +}
> > +
> > +/**
> > + * @brief Initializes a buffer zero comparison DSA task.
> > + *
> > + * @param descriptor A pointer to the DSA task descriptor.
> > + * @param completion A pointer to the DSA task completion record.
> > + */
> > +static void
> > +buffer_zero_task_init_int(struct dsa_hw_desc *descriptor,
> > +                          struct dsa_completion_record *completion)
> > +{
> > +    descriptor->opcode = DSA_OPCODE_COMPVAL;
> > +    descriptor->flags = IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CRAV;
> > +    descriptor->comp_pattern = (uint64_t)0;
> > +    descriptor->completion_addr = (uint64_t)completion;
> > +}
> > +
> > +/**
> > + * @brief Initializes a buffer zero DSA batch task.
> > + *
> > + * @param batch_size The number of zero page checking tasks in the batch.
> > + * @return A pointer to the zero page checking tasks initialized.
> > + */
> > +QemuDsaBatchTask *
> > +buffer_zero_batch_task_init(int batch_size)
> > +{
> > +    QemuDsaBatchTask *task = qemu_memalign(64, sizeof(QemuDsaBatchTask));
> > +    int descriptors_size = sizeof(*task->descriptors) * batch_size;
> > +
> > +    memset(task, 0, sizeof(*task));
> > +    task->addr = g_new0(ram_addr_t, batch_size);
> > +    task->results = g_new0(bool, batch_size);
> > +    task->batch_size = batch_size;
> > +    task->descriptors =
> > +        (struct dsa_hw_desc *)qemu_memalign(64, descriptors_size);
> > +    memset(task->descriptors, 0, descriptors_size);
> > +    task->completions = (struct dsa_completion_record *)qemu_memalign(
> > +        32, sizeof(*task->completions) * batch_size);
> > +
> > +    task->batch_completion.status = DSA_COMP_NONE;
> > +    task->batch_descriptor.completion_addr = (uint64_t)&task->batch_completion;
> > +    /* TODO: Ensure that we never send a batch with count <= 1 */
> > +    task->batch_descriptor.desc_count = 0;
> > +    task->batch_descriptor.opcode = DSA_OPCODE_BATCH;
> > +    task->batch_descriptor.flags = IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CRAV;
> > +    task->batch_descriptor.desc_list_addr = (uintptr_t)task->descriptors;
> > +    task->status = QEMU_DSA_TASK_READY;
> > +    task->group = &dsa_group;
> > +    task->device = dsa_device_group_get_next_device(&dsa_group);
> > +
> > +    for (int i = 0; i < task->batch_size; i++) {
> > +        buffer_zero_task_init_int(&task->descriptors[i],
> > +                                  &task->completions[i]);
> > +    }
> > +
> > +    qemu_sem_init(&task->sem_task_complete, 0);
> > +    task->completion_callback = buffer_zero_dsa_completion;
> > +
> > +    return task;
> > +}
> > +
> > +/**
> > + * @brief Performs the proper cleanup on a DSA batch task.
> > + *
> > + * @param task A pointer to the batch task to cleanup.
> > + */
> > +void
> > +buffer_zero_batch_task_destroy(QemuDsaBatchTask *task)
> > +{
> > +    g_free(task->addr);
> > +    g_free(task->results);
> > +    qemu_vfree(task->descriptors);
> > +    qemu_vfree(task->completions);
> > +    task->results = NULL;
> > +    qemu_sem_destroy(&task->sem_task_complete);
> > +    qemu_vfree(task);
> > +}


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [PATCH v7 09/12] migration/multifd: Enable DSA offloading in multifd sender path.
  2024-11-21 20:50   ` Fabiano Rosas
@ 2024-11-26  4:41     ` Yichen Wang
  2024-11-26 13:20       ` Fabiano Rosas
  0 siblings, 1 reply; 30+ messages in thread
From: Yichen Wang @ 2024-11-26  4:41 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel, Hao Xiang,
	Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang

On Thu, Nov 21, 2024 at 12:52 PM Fabiano Rosas <farosas@suse.de> wrote:
>
> Yichen Wang <yichen.wang@bytedance.com> writes:
>
> > From: Hao Xiang <hao.xiang@linux.dev>
> >
> > Multifd sender path gets an array of pages queued by the migration
> > thread. It performs zero page checking on every page in the array.
> > The pages are classfied as either a zero page or a normal page. This
> > change uses Intel DSA to offload the zero page checking from CPU to
> > the DSA accelerator. The sender thread submits a batch of pages to DSA
> > hardware and waits for the DSA completion thread to signal for work
> > completion.
> >
> > Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
> > Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
> > ---
> >  migration/multifd-zero-page.c | 129 ++++++++++++++++++++++++++++++----
> >  migration/multifd.c           |  29 +++++++-
> >  migration/multifd.h           |   5 ++
> >  3 files changed, 147 insertions(+), 16 deletions(-)
> >
> > diff --git a/migration/multifd-zero-page.c b/migration/multifd-zero-page.c
> > index f1e988a959..639aed9f6b 100644
> > --- a/migration/multifd-zero-page.c
> > +++ b/migration/multifd-zero-page.c
> > @@ -21,7 +21,9 @@
> >
> >  static bool multifd_zero_page_enabled(void)
> >  {
> > -    return migrate_zero_page_detection() == ZERO_PAGE_DETECTION_MULTIFD;
> > +    ZeroPageDetection curMethod = migrate_zero_page_detection();
> > +    return (curMethod == ZERO_PAGE_DETECTION_MULTIFD ||
> > +            curMethod == ZERO_PAGE_DETECTION_DSA_ACCEL);
> >  }
> >
> >  static void swap_page_offset(ram_addr_t *pages_offset, int a, int b)
> > @@ -37,26 +39,49 @@ static void swap_page_offset(ram_addr_t *pages_offset, int a, int b)
> >      pages_offset[b] = temp;
> >  }
> >
> > +#ifdef CONFIG_DSA_OPT
> > +
> > +static void swap_result(bool *results, int a, int b)
> > +{
> > +    bool temp;
> > +
> > +    if (a == b) {
> > +        return;
> > +    }
> > +
> > +    temp = results[a];
> > +    results[a] = results[b];
> > +    results[b] = temp;
> > +}
> > +
> >  /**
> > - * multifd_send_zero_page_detect: Perform zero page detection on all pages.
> > + * zero_page_detect_dsa: Perform zero page detection using
> > + * Intel Data Streaming Accelerator (DSA).
> >   *
> > - * Sorts normal pages before zero pages in p->pages->offset and updates
> > - * p->pages->normal_num.
> > + * Sorts normal pages before zero pages in pages->offset and updates
> > + * pages->normal_num.
> >   *
> >   * @param p A pointer to the send params.
> >   */
> > -void multifd_send_zero_page_detect(MultiFDSendParams *p)
> > +static void zero_page_detect_dsa(MultiFDSendParams *p)
> >  {
> >      MultiFDPages_t *pages = &p->data->u.ram;
> >      RAMBlock *rb = pages->block;
> > -    int i = 0;
> > -    int j = pages->num - 1;
> > +    bool *results = p->dsa_batch_task->results;
> >
> > -    if (!multifd_zero_page_enabled()) {
> > -        pages->normal_num = pages->num;
> > -        goto out;
> > +    for (int i = 0; i < pages->num; i++) {
> > +        p->dsa_batch_task->addr[i] =
> > +            (ram_addr_t)(rb->host + pages->offset[i]);
> >      }
> >
> > +    buffer_is_zero_dsa_batch_sync(p->dsa_batch_task,
> > +                                  (const void **)p->dsa_batch_task->addr,
> > +                                  pages->num,
> > +                                  multifd_ram_page_size());
> > +
> > +    int i = 0;
> > +    int j = pages->num - 1;
> > +
> >      /*
> >       * Sort the page offset array by moving all normal pages to
> >       * the left and all zero pages to the right of the array.
> > @@ -64,23 +89,39 @@ void multifd_send_zero_page_detect(MultiFDSendParams *p)
> >      while (i <= j) {
> >          uint64_t offset = pages->offset[i];
> >
> > -        if (!buffer_is_zero(rb->host + offset, multifd_ram_page_size())) {
> > +        if (!results[i]) {
> >              i++;
> >              continue;
> >          }
> >
> > +        swap_result(results, i, j);
> >          swap_page_offset(pages->offset, i, j);
> >          ram_release_page(rb->idstr, offset);
> >          j--;
> >      }
> >
> >      pages->normal_num = i;
> > +}
> >
> > -out:
> > -    stat64_add(&mig_stats.normal_pages, pages->normal_num);
> > -    stat64_add(&mig_stats.zero_pages, pages->num - pages->normal_num);
> > +void multifd_dsa_cleanup(void)
> > +{
> > +    qemu_dsa_cleanup();
> > +}
> > +
> > +#else
> > +
> > +static void zero_page_detect_dsa(MultiFDSendParams *p)
> > +{
> > +    g_assert_not_reached();
> > +}
> > +
> > +void multifd_dsa_cleanup(void)
> > +{
> > +    return ;
> >  }
> >
> > +#endif
> > +
> >  void multifd_recv_zero_page_process(MultiFDRecvParams *p)
> >  {
> >      for (int i = 0; i < p->zero_num; i++) {
> > @@ -92,3 +133,63 @@ void multifd_recv_zero_page_process(MultiFDRecvParams *p)
> >          }
> >      }
> >  }
> > +
> > +/**
> > + * zero_page_detect_cpu: Perform zero page detection using CPU.
> > + *
> > + * Sorts normal pages before zero pages in p->pages->offset and updates
> > + * p->pages->normal_num.
> > + *
> > + * @param p A pointer to the send params.
> > + */
> > +static void zero_page_detect_cpu(MultiFDSendParams *p)
> > +{
> > +    MultiFDPages_t *pages = &p->data->u.ram;
> > +    RAMBlock *rb = pages->block;
> > +    int i = 0;
> > +    int j = pages->num - 1;
> > +
> > +    /*
> > +     * Sort the page offset array by moving all normal pages to
> > +     * the left and all zero pages to the right of the array.
> > +     */
> > +    while (i <= j) {
> > +        uint64_t offset = pages->offset[i];
> > +
> > +        if (!buffer_is_zero(rb->host + offset, multifd_ram_page_size())) {
> > +            i++;
> > +            continue;
> > +        }
> > +
> > +        swap_page_offset(pages->offset, i, j);
> > +        ram_release_page(rb->idstr, offset);
> > +        j--;
> > +    }
> > +
> > +    pages->normal_num = i;
> > +}
> > +
> > +/**
> > + * multifd_send_zero_page_detect: Perform zero page detection on all pages.
> > + *
> > + * @param p A pointer to the send params.
> > + */
> > +void multifd_send_zero_page_detect(MultiFDSendParams *p)
> > +{
> > +    MultiFDPages_t *pages = &p->data->u.ram;
> > +
> > +    if (!multifd_zero_page_enabled()) {
> > +        pages->normal_num = pages->num;
> > +        goto out;
> > +    }
> > +
> > +    if (qemu_dsa_is_running()) {
> > +        zero_page_detect_dsa(p);
> > +    } else {
> > +        zero_page_detect_cpu(p);
> > +    }
> > +
> > +out:
> > +    stat64_add(&mig_stats.normal_pages, pages->normal_num);
> > +    stat64_add(&mig_stats.zero_pages, pages->num - pages->normal_num);
> > +}
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 4374e14a96..689acceff2 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -13,6 +13,7 @@
> >  #include "qemu/osdep.h"
> >  #include "qemu/cutils.h"
> >  #include "qemu/rcu.h"
> > +#include "qemu/dsa.h"
> >  #include "exec/target_page.h"
> >  #include "sysemu/sysemu.h"
> >  #include "exec/ramblock.h"
> > @@ -462,6 +463,8 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
> >      p->name = NULL;
> >      g_free(p->data);
> >      p->data = NULL;
> > +    buffer_zero_batch_task_destroy(p->dsa_batch_task);
> > +    p->dsa_batch_task = NULL;
> >      p->packet_len = 0;
> >      g_free(p->packet);
> >      p->packet = NULL;
> > @@ -493,6 +496,8 @@ void multifd_send_shutdown(void)
> >
> >      multifd_send_terminate_threads();
> >
> > +    multifd_dsa_cleanup();
> > +
> >      for (i = 0; i < migrate_multifd_channels(); i++) {
> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> >          Error *local_err = NULL;
> > @@ -814,11 +819,31 @@ bool multifd_send_setup(void)
> >      uint32_t page_count = multifd_ram_page_count();
> >      bool use_packets = multifd_use_packets();
> >      uint8_t i;
> > +    Error *local_err = NULL;
> >
> >      if (!migrate_multifd()) {
> >          return true;
> >      }
> >
> > +    if (s &&
> > +        s->parameters.zero_page_detection == ZERO_PAGE_DETECTION_DSA_ACCEL) {
> > +        // Populate the dsa device path from accel-path
>
> scripts/checkpatch.pl would have rejected this.
>

Sorry. I will make sure to run checkpatch.pl, unit test (both
with/without DSA), before the send-email...

> > +        const strList *accel_path = migrate_accel_path();
> > +        g_autofree strList *dsa_parameter = g_malloc0(sizeof(strList));
> > +        strList **tail = &dsa_parameter;
> > +        while (accel_path) {
> > +            if (strncmp(accel_path->value, "dsa:", 4) == 0) {
> > +                QAPI_LIST_APPEND(tail, &accel_path->value[4]);
> > +            }
> > +            accel_path = accel_path->next;
> > +        }
>
> The parsing of the parameter should be in options.c. In fact, Peter
> suggested in v4 to make all of this a multifd_dsa_send_setup() or
> multifd_dsa_init(), I think that's a good idea.
>

Will fix it in the next version.

> > +        if (qemu_dsa_init(dsa_parameter, &local_err)) {
> > +            ret = -1;
>
> migrate_set_error(s, local_err);
> goto err;

Will fix it in the next version. But here we can't goto err, because
the cleanup() function will be called when setup() fails, and it has
assumptions that a certain data structure is in place. If we exit
earlier, the cleanup() function will complain and fail.

>
> > +        } else {
> > +            qemu_dsa_start();
> > +        }
> > +    }
> > +
> >      thread_count = migrate_multifd_channels();
> >      multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
> >      multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
> > @@ -829,12 +854,12 @@ bool multifd_send_setup(void)
> >
> >      for (i = 0; i < thread_count; i++) {
> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> > -        Error *local_err = NULL;
> >
> >          qemu_sem_init(&p->sem, 0);
> >          qemu_sem_init(&p->sem_sync, 0);
> >          p->id = i;
> >          p->data = multifd_send_data_alloc();
> > +        p->dsa_batch_task = buffer_zero_batch_task_init(page_count);
> >
> >          if (use_packets) {
> >              p->packet_len = sizeof(MultiFDPacket_t)
> > @@ -865,7 +890,6 @@ bool multifd_send_setup(void)
> >
> >      for (i = 0; i < thread_count; i++) {
> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> > -        Error *local_err = NULL;
> >
> >          ret = multifd_send_state->ops->send_setup(p, &local_err);
> >          if (ret) {
> > @@ -1047,6 +1071,7 @@ void multifd_recv_cleanup(void)
> >              qemu_thread_join(&p->thread);
> >          }
> >      }
> > +    multifd_dsa_cleanup();
> >      for (i = 0; i < migrate_multifd_channels(); i++) {
> >          multifd_recv_cleanup_channel(&multifd_recv_state->params[i]);
> >      }
> > diff --git a/migration/multifd.h b/migration/multifd.h
> > index 50d58c0c9c..e293ddbc1d 100644
> > --- a/migration/multifd.h
> > +++ b/migration/multifd.h
> > @@ -15,6 +15,7 @@
> >
> >  #include "exec/target_page.h"
> >  #include "ram.h"
> > +#include "qemu/dsa.h"
> >
> >  typedef struct MultiFDRecvData MultiFDRecvData;
> >  typedef struct MultiFDSendData MultiFDSendData;
> > @@ -155,6 +156,9 @@ typedef struct {
> >      bool pending_sync;
> >      MultiFDSendData *data;
> >
> > +    /* Zero page checking batch task */
> > +    QemuDsaBatchTask *dsa_batch_task;
> > +
> >      /* thread local variables. No locking required */
> >
> >      /* pointer to the packet */
> > @@ -313,6 +317,7 @@ void multifd_send_fill_packet(MultiFDSendParams *p);
> >  bool multifd_send_prepare_common(MultiFDSendParams *p);
> >  void multifd_send_zero_page_detect(MultiFDSendParams *p);
> >  void multifd_recv_zero_page_process(MultiFDRecvParams *p);
> > +void multifd_dsa_cleanup(void);
> >
> >  static inline void multifd_send_prepare_header(MultiFDSendParams *p)
> >  {


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration.
  2024-11-19 21:31 ` [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration Fabiano Rosas
@ 2024-11-26  4:43   ` Yichen Wang
  0 siblings, 0 replies; 30+ messages in thread
From: Yichen Wang @ 2024-11-26  4:43 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel, Hao Xiang,
	Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang

On Tue, Nov 19, 2024 at 1:31 PM Fabiano Rosas <farosas@suse.de> wrote:
>
> Yichen Wang <yichen.wang@bytedance.com> writes:
>
> > v7
> > * Rebase on top of f0a5a31c33a8109061c2493e475c8a2f4d022432;
> > * Fix a bug that will crash QEMU when DSA initialization failed;
> > * Use a more generalized accel-path to support other accelerators;
> > * Remove multifd-packet-size in the parameter list;
> >
> > v6
> > * Rebase on top of 838fc0a8769d7cc6edfe50451ba4e3368395f5c1;
> > * Refactor code to have clean history on all commits;
> > * Add comments on DSA specific defines about how the value is picked;
> > * Address all comments from v5 reviews about api defines, questions, etc.;
> >
> > v5
> > * Rebase on top of 39a032cea23e522268519d89bb738974bc43b6f6.
> > * Rename struct definitions with typedef and CamelCase names;
> > * Add build and runtime checks about DSA accelerator;
> > * Address all comments from v4 reviews about typos, licenses, comments,
> > error reporting, etc.
> >
> > v4
> > * Rebase on top of 85b597413d4370cb168f711192eaef2eb70535ac.
> > * A separate "multifd zero page checking" patchset was split from this
> > patchset's v3 and got merged into master. v4 re-applied the rest of all
> > commits on top of that patchset, re-factored and re-tested.
> > https://lore.kernel.org/all/20240311180015.3359271-1-hao.xiang@linux.dev/
> > * There are some feedback from v3 I likely overlooked.
> >
> > v3
> > * Rebase on top of 7425b6277f12e82952cede1f531bfc689bf77fb1.
> > * Fix error/warning from checkpatch.pl
> > * Fix use-after-free bug when multifd-dsa-accel option is not set.
> > * Handle error from dsa_init and correctly propogate the error.
> > * Remove unnecessary call to dsa_stop.
> > * Detect availability of DSA feature at compile time.
> > * Implement a generic batch_task structure and a DSA specific one dsa_batch_task.
> > * Remove all exit() calls and propagate errors correctly.
> > * Use bytes instead of page count to configure multifd-packet-size option.
> >
> > v2
> > * Rebase on top of 3e01f1147a16ca566694b97eafc941d62fa1e8d8.
> > * Leave Juan's changes in their original form instead of squashing them.
> > * Add a new commit to refactor the multifd_send_thread function to prepare for introducing the DSA offload functionality.
> > * Use page count to configure multifd-packet-size option.
> > * Don't use the FLAKY flag in DSA tests.
> > * Test if DSA integration test is setup correctly and skip the test if
> > * not.
> > * Fixed broken link in the previous patch cover.
> >
> > * Background:
> >
> > I posted an RFC about DSA offloading in QEMU:
> > https://patchew.org/QEMU/20230529182001.2232069-1-hao.xiang@bytedance.com/
> >
> > This patchset implements the DSA offloading on zero page checking in
> > multifd live migration code path.
> >
> > * Overview:
> >
> > Intel Data Streaming Accelerator(DSA) is introduced in Intel's 4th generation
> > Xeon server, aka Sapphire Rapids.
> > https://cdrdv2-public.intel.com/671116/341204-intel-data-streaming-accelerator-spec.pdf
> > https://www.intel.com/content/www/us/en/content-details/759709/intel-data-streaming-accelerator-user-guide.html
> > One of the things DSA can do is to offload memory comparison workload from
> > CPU to DSA accelerator hardware. This patchset implements a solution to offload
> > QEMU's zero page checking from CPU to DSA accelerator hardware. We gain
> > two benefits from this change:
> > 1. Reduces CPU usage in multifd live migration workflow across all use
> > cases.
> > 2. Reduces migration total time in some use cases.
> >
> > * Design:
> >
> > These are the logical steps to perform DSA offloading:
> > 1. Configure DSA accelerators and create user space openable DSA work
> > queues via the idxd driver.
> > 2. Map DSA's work queue into a user space address space.
> > 3. Fill an in-memory task descriptor to describe the memory operation.
> > 4. Use dedicated CPU instruction _enqcmd to queue a task descriptor to
> > the work queue.
> > 5. Pull the task descriptor's completion status field until the task
> > completes.
> > 6. Check return status.
> >
> > The memory operation is now totally done by the accelerator hardware but
> > the new workflow introduces overheads. The overhead is the extra cost CPU
> > prepares and submits the task descriptors and the extra cost CPU pulls for
> > completion. The design is around minimizing these two overheads.
> >
> > 1. In order to reduce the overhead on task preparation and submission,
> > we use batch descriptors. A batch descriptor will contain N individual
> > zero page checking tasks where the default N is 128 (default packet size
> > / page size) and we can increase N by setting the packet size via a new
> > migration option.
> > 2. The multifd sender threads prepares and submits batch tasks to DSA
> > hardware and it waits on a synchronization object for task completion.
> > Whenever a DSA task is submitted, the task structure is added to a
> > thread safe queue. It's safe to have multiple multifd sender threads to
> > submit tasks concurrently.
> > 3. Multiple DSA hardware devices can be used. During multifd initialization,
> > every sender thread will be assigned a DSA device to work with. We
> > use a round-robin scheme to evenly distribute the work across all used
> > DSA devices.
> > 4. Use a dedicated thread dsa_completion to perform busy pulling for all
> > DSA task completions. The thread keeps dequeuing DSA tasks from the
> > thread safe queue. The thread blocks when there is no outstanding DSA
> > task. When pulling for completion of a DSA task, the thread uses CPU
> > instruction _mm_pause between the iterations of a busy loop to save some
> > CPU power as well as optimizing core resources for the other hypercore.
> > 5. DSA accelerator can encounter errors. The most popular error is a
> > page fault. We have tested using devices to handle page faults but
> > performance is bad. Right now, if DSA hits a page fault, we fallback to
> > use CPU to complete the rest of the work. The CPU fallback is done in
> > the multifd sender thread.
> > 6. Added a new migration option multifd-dsa-accel to set the DSA device
> > path. If set, the multifd workflow will leverage the DSA devices for
> > offloading.
> > 7. Added a new migration option multifd-normal-page-ratio to make
> > multifd live migration easier to test. Setting a normal page ratio will
> > make live migration recognize a zero page as a normal page and send
> > the entire payload over the network. If we want to send a large network
> > payload and analyze throughput, this option is useful.
> > 8. Added a new migration option multifd-packet-size. This can increase
> > the number of pages being zero page checked and sent over the network.
> > The extra synchronization between the sender threads and the dsa
> > completion thread is an overhead. Using a large packet size can reduce
> > that overhead.
> >
> > * Performance:
> >
> > We use two Intel 4th generation Xeon servers for testing.
> >
> > Architecture:        x86_64
> > CPU(s):              192
> > Thread(s) per core:  2
> > Core(s) per socket:  48
> > Socket(s):           2
> > NUMA node(s):        2
> > Vendor ID:           GenuineIntel
> > CPU family:          6
> > Model:               143
> > Model name:          Intel(R) Xeon(R) Platinum 8457C
> > Stepping:            8
> > CPU MHz:             2538.624
> > CPU max MHz:         3800.0000
> > CPU min MHz:         800.0000
> >
> > We perform multifd live migration with below setup:
> > 1. VM has 100GB memory.
> > 2. Use the new migration option multifd-set-normal-page-ratio to control the total
> > size of the payload sent over the network.
> > 3. Use 8 multifd channels.
> > 4. Use tcp for live migration.
> > 4. Use CPU to perform zero page checking as the baseline.
> > 5. Use one DSA device to offload zero page checking to compare with the baseline.
> > 6. Use "perf sched record" and "perf sched timehist" to analyze CPU usage.
> >
> > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> >
> >       CPU usage
> >
> >       |---------------|---------------|---------------|---------------|
> >       |               |comm           |runtime(msec)  |totaltime(msec)|
> >       |---------------|---------------|---------------|---------------|
> >       |Baseline       |live_migration |5657.58        |               |
> >       |               |multifdsend_0  |3931.563       |               |
> >       |               |multifdsend_1  |4405.273       |               |
> >       |               |multifdsend_2  |3941.968       |               |
> >       |               |multifdsend_3  |5032.975       |               |
> >       |               |multifdsend_4  |4533.865       |               |
> >       |               |multifdsend_5  |4530.461       |               |
> >       |               |multifdsend_6  |5171.916       |               |
> >       |               |multifdsend_7  |4722.769       |41922          |
> >       |---------------|---------------|---------------|---------------|
> >       |DSA            |live_migration |6129.168       |               |
> >       |               |multifdsend_0  |2954.717       |               |
> >       |               |multifdsend_1  |2766.359       |               |
> >       |               |multifdsend_2  |2853.519       |               |
> >       |               |multifdsend_3  |2740.717       |               |
> >       |               |multifdsend_4  |2824.169       |               |
> >       |               |multifdsend_5  |2966.908       |               |
> >       |               |multifdsend_6  |2611.137       |               |
> >       |               |multifdsend_7  |3114.732       |               |
> >       |               |dsa_completion |3612.564       |32568          |
> >       |---------------|---------------|---------------|---------------|
> >
> > Baseline total runtime is calculated by adding up all multifdsend_X
> > and live_migration threads runtime. DSA offloading total runtime is
> > calculated by adding up all multifdsend_X, live_migration and
> > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
> > that is 23% total CPU usage savings.
> >
> >       Latency
> >       |---------------|---------------|---------------|---------------|---------------|---------------|
> >       |               |total time     |down time      |throughput     |transferred-ram|total-ram      |
> >       |---------------|---------------|---------------|---------------|---------------|---------------|
> >       |Baseline       |10343 ms       |161 ms         |41007.00 mbps  |51583797 kb    |102400520 kb   |
> >       |---------------|---------------|---------------|---------------|-------------------------------|
> >       |DSA offload    |9535 ms        |135 ms         |46554.40 mbps  |53947545 kb    |102400520 kb   |
> >       |---------------|---------------|---------------|---------------|---------------|---------------|
> >
> > Total time is 8% faster and down time is 16% faster.
> >
> > B) Scenario 2: 100% (100GB) zero pages on an 100GB vm.
> >
> >       CPU usage
> >       |---------------|---------------|---------------|---------------|
> >       |               |comm           |runtime(msec)  |totaltime(msec)|
> >       |---------------|---------------|---------------|---------------|
> >       |Baseline       |live_migration |4860.718       |               |
> >       |               |multifdsend_0  |748.875        |               |
> >       |               |multifdsend_1  |898.498        |               |
> >       |               |multifdsend_2  |787.456        |               |
> >       |               |multifdsend_3  |764.537        |               |
> >       |               |multifdsend_4  |785.687        |               |
> >       |               |multifdsend_5  |756.941        |               |
> >       |               |multifdsend_6  |774.084        |               |
> >       |               |multifdsend_7  |782.900        |11154          |
> >       |---------------|---------------|-------------------------------|
> >       |DSA offloading |live_migration |3846.976       |               |
> >       |               |multifdsend_0  |191.880        |               |
> >       |               |multifdsend_1  |166.331        |               |
> >       |               |multifdsend_2  |168.528        |               |
> >       |               |multifdsend_3  |197.831        |               |
> >       |               |multifdsend_4  |169.580        |               |
> >       |               |multifdsend_5  |167.984        |               |
> >       |               |multifdsend_6  |198.042        |               |
> >       |               |multifdsend_7  |170.624        |               |
> >       |               |dsa_completion |3428.669       |8700           |
> >       |---------------|---------------|---------------|---------------|
> >
> > Baseline total runtime is 11154 msec and DSA offloading total runtime is
> > 8700 msec. That is 22% CPU savings.
> >
> >       Latency
> >       |--------------------------------------------------------------------------------------------|
> >       |               |total time     |down time      |throughput     |transferred-ram|total-ram   |
> >       |---------------|---------------|---------------|---------------|---------------|------------|
> >       |Baseline       |4867 ms        |20 ms          |1.51 mbps      |565 kb         |102400520 kb|
> >       |---------------|---------------|---------------|---------------|----------------------------|
> >       |DSA offload    |3888 ms        |18 ms          |1.89 mbps      |565 kb         |102400520 kb|
> >       |---------------|---------------|---------------|---------------|---------------|------------|
> >
> > Total time 20% faster and down time 10% faster.
> >
> > * Testing:
> >
> > 1. Added unit tests for cover the added code path in dsa.c
> > 2. Added integration tests to cover multifd live migration using DSA
> > offloading.
> >
> > Hao Xiang (10):
> >   meson: Introduce new instruction set enqcmd to the build system.
> >   util/dsa: Implement DSA device start and stop logic.
> >   util/dsa: Implement DSA task enqueue and dequeue.
> >   util/dsa: Implement DSA task asynchronous completion thread model.
> >   util/dsa: Implement zero page checking in DSA task.
> >   util/dsa: Implement DSA task asynchronous submission and wait for
> >     completion.
> >   migration/multifd: Add new migration option for multifd DSA
> >     offloading.
> >   migration/multifd: Enable DSA offloading in multifd sender path.
> >   util/dsa: Add unit test coverage for Intel DSA task submission and
> >     completion.
> >   migration/multifd: Add integration tests for multifd with Intel DSA
> >     offloading.
> >
> > Yichen Wang (1):
> >   util/dsa: Add idxd into linux header copy list.
> >
> > Yuan Liu (1):
> >   migration/doc: Add DSA zero page detection doc
> >
> >  .../migration/dsa-zero-page-detection.rst     |  290 +++++
> >  docs/devel/migration/features.rst             |    1 +
> >  hmp-commands.hx                               |    2 +-
> >  include/qemu/dsa.h                            |  188 +++
> >  meson.build                                   |   14 +
> >  meson_options.txt                             |    2 +
> >  migration/migration-hmp-cmds.c                |   19 +-
> >  migration/multifd-zero-page.c                 |  129 +-
> >  migration/multifd.c                           |   29 +-
> >  migration/multifd.h                           |    5 +
> >  migration/options.c                           |   30 +
> >  migration/options.h                           |    1 +
> >  qapi/migration.json                           |   32 +-
> >  scripts/meson-buildoptions.sh                 |    3 +
> >  scripts/update-linux-headers.sh               |    2 +-
> >  tests/qtest/migration-test.c                  |   80 +-
> >  tests/unit/meson.build                        |    6 +
> >  tests/unit/test-dsa.c                         |  503 ++++++++
> >  util/dsa.c                                    | 1112 +++++++++++++++++
> >  util/meson.build                              |    3 +
> >  20 files changed, 2427 insertions(+), 24 deletions(-)
> >  create mode 100644 docs/devel/migration/dsa-zero-page-detection.rst
> >  create mode 100644 include/qemu/dsa.h
> >  create mode 100644 tests/unit/test-dsa.c
> >  create mode 100644 util/dsa.c
>
> Hi, take a look at make check, there are some tests failing.
>
> Summary of Failures:
>
>  16/474 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp                     ERROR            0.86s   killed by signal 6 SIGABRT
>  18/474 qemu:qtest+qtest-ppc64 / qtest-ppc64/test-hmp                       ERROR            0.93s   killed by signal 6 SIGABRT
>  20/474 qemu:qtest+qtest-aarch64 / qtest-aarch64/test-hmp                   ERROR            1.30s   killed by signal 6 SIGABRT
>  21/474 qemu:qtest+qtest-s390x / qtest-s390x/test-hmp                       ERROR            0.76s   killed by signal 6 SIGABRT
>  22/474 qemu:qtest+qtest-riscv64 / qtest-riscv64/test-hmp                   ERROR            0.60s   killed by signal 6 SIGABRT
>
> Looks like a double-free due to glib autofree. Here's one sample:
>
> #0  __GI_abort () at abort.c:49
> #1  0x00007ffff5899c87 in __libc_message (action=do_abort, fmt=0x7ffff59c3138 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
> #2  0x00007ffff58a1d2a in malloc_printerr (str=0x7ffff59c0e0e "free(): invalid pointer") at malloc.c:5347
> #3  0x00007ffff58a37d4 in _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:4173
> #4  0x00007ffff78c5639 in g_free (mem=0x5555561200f1 <qemu_mutex_unlock_impl+96>) at ../glib/gmem.c:199
> #5  0x0000555555bdd527 in g_autoptr_cleanup_generic_gfree (p=0x7fffffffd568) at /usr/include/glib-2.0/glib/glib-autocleanups.h:28
> #6  0x0000555555bdfabc in hmp_migrate_set_parameter (mon=0x7fffffffd6f0, qdict=0x555558554560) at ../migration/migration-hmp-cmds.c:577
> #7  0x0000555555c1a231 in handle_hmp_command_exec (mon=0x7fffffffd6f0, cmd=0x5555571e7450 <hmp_cmds+4560>, qdict=0x555558554560) at ../monitor/hmp.c:1106
> #8  0x0000555555c1a470 in handle_hmp_command (mon=0x7fffffffd6f0, cmdline=0x5555577ec2f6 "xbzrle-cache-size 64k") at ../monitor/hmp.c:1158
> #9  0x0000555555c1c40e in qmp_human_monitor_command (command_line=0x5555577ec2e0 "migrate_set_parameter xbzrle-cache-size 64k", has_cpu_index=false, cpu_index=0, errp=0x7fffffffd800)
>     at ../monitor/qmp-cmds.c:181
> #10 0x00005555560c7eb6 in qmp_marshal_human_monitor_command (args=0x7fffe000ac00, ret=0x7ffff4d25da8, errp=0x7ffff4d25da0) at qapi/qapi-commands-misc.c:347
> #11 0x000055555610e7a4 in do_qmp_dispatch_bh (opaque=0x7ffff4d25e40) at ../qapi/qmp-dispatch.c:128
> #12 0x000055555613a1b9 in aio_bh_call (bh=0x7fffe0004050) at ../util/async.c:172
> #13 0x000055555613a2d5 in aio_bh_poll (ctx=0x5555573df400) at ../util/async.c:219
> #14 0x000055555611b8cd in aio_dispatch (ctx=0x5555573df400) at ../util/aio-posix.c:424
> #15 0x000055555613a712 in aio_ctx_dispatch (source=0x5555573df400, callback=0x0, user_data=0x0) at ../util/async.c:361
> #16 0x00007ffff78bf82b in g_main_dispatch (context=0x5555573e3440) at ../glib/gmain.c:3381
> #17 g_main_context_dispatch (context=0x5555573e3440) at ../glib/gmain.c:4099
> #18 0x000055555613bdae in glib_pollfds_poll () at ../util/main-loop.c:287
> #19 0x000055555613be28 in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:310
> #20 0x000055555613bf2d in main_loop_wait (nonblocking=0) at ../util/main-loop.c:589
> #21 0x0000555555bb455c in qemu_main_loop () at ../system/runstate.c:835
> #22 0x00005555560594d1 in qemu_default_main () at ../system/main.c:37
> #23 0x000055555605950c in main (argc=18, argv=0x7fffffffdc18) at ../system/main.c:48

Fixed. Interesting that our marco of g_auto(GStrv) won't work in
switch-case statements. I switched back to original char
**/g_strfreev() to get it working.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [PATCH v7 09/12] migration/multifd: Enable DSA offloading in multifd sender path.
  2024-11-26  4:41     ` [External] " Yichen Wang
@ 2024-11-26 13:20       ` Fabiano Rosas
  2024-12-03  3:43         ` Yichen Wang
  0 siblings, 1 reply; 30+ messages in thread
From: Fabiano Rosas @ 2024-11-26 13:20 UTC (permalink / raw)
  To: Yichen Wang
  Cc: Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel, Hao Xiang,
	Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang

Yichen Wang <yichen.wang@bytedance.com> writes:

> On Thu, Nov 21, 2024 at 12:52 PM Fabiano Rosas <farosas@suse.de> wrote:
>>
>> Yichen Wang <yichen.wang@bytedance.com> writes:
>>
>> > From: Hao Xiang <hao.xiang@linux.dev>
>> >
>> > Multifd sender path gets an array of pages queued by the migration
>> > thread. It performs zero page checking on every page in the array.
>> > The pages are classfied as either a zero page or a normal page. This
>> > change uses Intel DSA to offload the zero page checking from CPU to
>> > the DSA accelerator. The sender thread submits a batch of pages to DSA
>> > hardware and waits for the DSA completion thread to signal for work
>> > completion.
>> >
>> > Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
>> > Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
>> > ---
>> >  migration/multifd-zero-page.c | 129 ++++++++++++++++++++++++++++++----
>> >  migration/multifd.c           |  29 +++++++-
>> >  migration/multifd.h           |   5 ++
>> >  3 files changed, 147 insertions(+), 16 deletions(-)
>> >
>> > diff --git a/migration/multifd-zero-page.c b/migration/multifd-zero-page.c
>> > index f1e988a959..639aed9f6b 100644
>> > --- a/migration/multifd-zero-page.c
>> > +++ b/migration/multifd-zero-page.c
>> > @@ -21,7 +21,9 @@
>> >
>> >  static bool multifd_zero_page_enabled(void)
>> >  {
>> > -    return migrate_zero_page_detection() == ZERO_PAGE_DETECTION_MULTIFD;
>> > +    ZeroPageDetection curMethod = migrate_zero_page_detection();
>> > +    return (curMethod == ZERO_PAGE_DETECTION_MULTIFD ||
>> > +            curMethod == ZERO_PAGE_DETECTION_DSA_ACCEL);
>> >  }
>> >
>> >  static void swap_page_offset(ram_addr_t *pages_offset, int a, int b)
>> > @@ -37,26 +39,49 @@ static void swap_page_offset(ram_addr_t *pages_offset, int a, int b)
>> >      pages_offset[b] = temp;
>> >  }
>> >
>> > +#ifdef CONFIG_DSA_OPT
>> > +
>> > +static void swap_result(bool *results, int a, int b)
>> > +{
>> > +    bool temp;
>> > +
>> > +    if (a == b) {
>> > +        return;
>> > +    }
>> > +
>> > +    temp = results[a];
>> > +    results[a] = results[b];
>> > +    results[b] = temp;
>> > +}
>> > +
>> >  /**
>> > - * multifd_send_zero_page_detect: Perform zero page detection on all pages.
>> > + * zero_page_detect_dsa: Perform zero page detection using
>> > + * Intel Data Streaming Accelerator (DSA).
>> >   *
>> > - * Sorts normal pages before zero pages in p->pages->offset and updates
>> > - * p->pages->normal_num.
>> > + * Sorts normal pages before zero pages in pages->offset and updates
>> > + * pages->normal_num.
>> >   *
>> >   * @param p A pointer to the send params.
>> >   */
>> > -void multifd_send_zero_page_detect(MultiFDSendParams *p)
>> > +static void zero_page_detect_dsa(MultiFDSendParams *p)
>> >  {
>> >      MultiFDPages_t *pages = &p->data->u.ram;
>> >      RAMBlock *rb = pages->block;
>> > -    int i = 0;
>> > -    int j = pages->num - 1;
>> > +    bool *results = p->dsa_batch_task->results;
>> >
>> > -    if (!multifd_zero_page_enabled()) {
>> > -        pages->normal_num = pages->num;
>> > -        goto out;
>> > +    for (int i = 0; i < pages->num; i++) {
>> > +        p->dsa_batch_task->addr[i] =
>> > +            (ram_addr_t)(rb->host + pages->offset[i]);
>> >      }
>> >
>> > +    buffer_is_zero_dsa_batch_sync(p->dsa_batch_task,
>> > +                                  (const void **)p->dsa_batch_task->addr,
>> > +                                  pages->num,
>> > +                                  multifd_ram_page_size());
>> > +
>> > +    int i = 0;
>> > +    int j = pages->num - 1;
>> > +
>> >      /*
>> >       * Sort the page offset array by moving all normal pages to
>> >       * the left and all zero pages to the right of the array.
>> > @@ -64,23 +89,39 @@ void multifd_send_zero_page_detect(MultiFDSendParams *p)
>> >      while (i <= j) {
>> >          uint64_t offset = pages->offset[i];
>> >
>> > -        if (!buffer_is_zero(rb->host + offset, multifd_ram_page_size())) {
>> > +        if (!results[i]) {
>> >              i++;
>> >              continue;
>> >          }
>> >
>> > +        swap_result(results, i, j);
>> >          swap_page_offset(pages->offset, i, j);
>> >          ram_release_page(rb->idstr, offset);
>> >          j--;
>> >      }
>> >
>> >      pages->normal_num = i;
>> > +}
>> >
>> > -out:
>> > -    stat64_add(&mig_stats.normal_pages, pages->normal_num);
>> > -    stat64_add(&mig_stats.zero_pages, pages->num - pages->normal_num);
>> > +void multifd_dsa_cleanup(void)
>> > +{
>> > +    qemu_dsa_cleanup();
>> > +}
>> > +
>> > +#else
>> > +
>> > +static void zero_page_detect_dsa(MultiFDSendParams *p)
>> > +{
>> > +    g_assert_not_reached();
>> > +}
>> > +
>> > +void multifd_dsa_cleanup(void)
>> > +{
>> > +    return ;
>> >  }
>> >
>> > +#endif
>> > +
>> >  void multifd_recv_zero_page_process(MultiFDRecvParams *p)
>> >  {
>> >      for (int i = 0; i < p->zero_num; i++) {
>> > @@ -92,3 +133,63 @@ void multifd_recv_zero_page_process(MultiFDRecvParams *p)
>> >          }
>> >      }
>> >  }
>> > +
>> > +/**
>> > + * zero_page_detect_cpu: Perform zero page detection using CPU.
>> > + *
>> > + * Sorts normal pages before zero pages in p->pages->offset and updates
>> > + * p->pages->normal_num.
>> > + *
>> > + * @param p A pointer to the send params.
>> > + */
>> > +static void zero_page_detect_cpu(MultiFDSendParams *p)
>> > +{
>> > +    MultiFDPages_t *pages = &p->data->u.ram;
>> > +    RAMBlock *rb = pages->block;
>> > +    int i = 0;
>> > +    int j = pages->num - 1;
>> > +
>> > +    /*
>> > +     * Sort the page offset array by moving all normal pages to
>> > +     * the left and all zero pages to the right of the array.
>> > +     */
>> > +    while (i <= j) {
>> > +        uint64_t offset = pages->offset[i];
>> > +
>> > +        if (!buffer_is_zero(rb->host + offset, multifd_ram_page_size())) {
>> > +            i++;
>> > +            continue;
>> > +        }
>> > +
>> > +        swap_page_offset(pages->offset, i, j);
>> > +        ram_release_page(rb->idstr, offset);
>> > +        j--;
>> > +    }
>> > +
>> > +    pages->normal_num = i;
>> > +}
>> > +
>> > +/**
>> > + * multifd_send_zero_page_detect: Perform zero page detection on all pages.
>> > + *
>> > + * @param p A pointer to the send params.
>> > + */
>> > +void multifd_send_zero_page_detect(MultiFDSendParams *p)
>> > +{
>> > +    MultiFDPages_t *pages = &p->data->u.ram;
>> > +
>> > +    if (!multifd_zero_page_enabled()) {
>> > +        pages->normal_num = pages->num;
>> > +        goto out;
>> > +    }
>> > +
>> > +    if (qemu_dsa_is_running()) {
>> > +        zero_page_detect_dsa(p);
>> > +    } else {
>> > +        zero_page_detect_cpu(p);
>> > +    }
>> > +
>> > +out:
>> > +    stat64_add(&mig_stats.normal_pages, pages->normal_num);
>> > +    stat64_add(&mig_stats.zero_pages, pages->num - pages->normal_num);
>> > +}
>> > diff --git a/migration/multifd.c b/migration/multifd.c
>> > index 4374e14a96..689acceff2 100644
>> > --- a/migration/multifd.c
>> > +++ b/migration/multifd.c
>> > @@ -13,6 +13,7 @@
>> >  #include "qemu/osdep.h"
>> >  #include "qemu/cutils.h"
>> >  #include "qemu/rcu.h"
>> > +#include "qemu/dsa.h"
>> >  #include "exec/target_page.h"
>> >  #include "sysemu/sysemu.h"
>> >  #include "exec/ramblock.h"
>> > @@ -462,6 +463,8 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
>> >      p->name = NULL;
>> >      g_free(p->data);
>> >      p->data = NULL;
>> > +    buffer_zero_batch_task_destroy(p->dsa_batch_task);
>> > +    p->dsa_batch_task = NULL;
>> >      p->packet_len = 0;
>> >      g_free(p->packet);
>> >      p->packet = NULL;
>> > @@ -493,6 +496,8 @@ void multifd_send_shutdown(void)
>> >
>> >      multifd_send_terminate_threads();
>> >
>> > +    multifd_dsa_cleanup();
>> > +
>> >      for (i = 0; i < migrate_multifd_channels(); i++) {
>> >          MultiFDSendParams *p = &multifd_send_state->params[i];
>> >          Error *local_err = NULL;
>> > @@ -814,11 +819,31 @@ bool multifd_send_setup(void)
>> >      uint32_t page_count = multifd_ram_page_count();
>> >      bool use_packets = multifd_use_packets();
>> >      uint8_t i;
>> > +    Error *local_err = NULL;
>> >
>> >      if (!migrate_multifd()) {
>> >          return true;
>> >      }
>> >
>> > +    if (s &&
>> > +        s->parameters.zero_page_detection == ZERO_PAGE_DETECTION_DSA_ACCEL) {
>> > +        // Populate the dsa device path from accel-path
>>
>> scripts/checkpatch.pl would have rejected this.
>>
>
> Sorry. I will make sure to run checkpatch.pl, unit test (both
> with/without DSA), before the send-email...
>
>> > +        const strList *accel_path = migrate_accel_path();
>> > +        g_autofree strList *dsa_parameter = g_malloc0(sizeof(strList));
>> > +        strList **tail = &dsa_parameter;
>> > +        while (accel_path) {
>> > +            if (strncmp(accel_path->value, "dsa:", 4) == 0) {
>> > +                QAPI_LIST_APPEND(tail, &accel_path->value[4]);
>> > +            }
>> > +            accel_path = accel_path->next;
>> > +        }
>>
>> The parsing of the parameter should be in options.c. In fact, Peter
>> suggested in v4 to make all of this a multifd_dsa_send_setup() or
>> multifd_dsa_init(), I think that's a good idea.
>>
>
> Will fix it in the next version.
>
>> > +        if (qemu_dsa_init(dsa_parameter, &local_err)) {
>> > +            ret = -1;
>>
>> migrate_set_error(s, local_err);
>> goto err;
>
> Will fix it in the next version. But here we can't goto err, because
> the cleanup() function will be called when setup() fails, and it has
> assumptions that a certain data structure is in place. If we exit
> earlier, the cleanup() function will complain and fail.
>

Which data structure? Is that multifd_send_state below? You could move
those before qemu_dsa_init if that's the case.

>>
>> > +        } else {
>> > +            qemu_dsa_start();
>> > +        }
>> > +    }
>> > +
>> >      thread_count = migrate_multifd_channels();
>> >      multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
>> >      multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
>> > @@ -829,12 +854,12 @@ bool multifd_send_setup(void)
>> >
>> >      for (i = 0; i < thread_count; i++) {
>> >          MultiFDSendParams *p = &multifd_send_state->params[i];
>> > -        Error *local_err = NULL;
>> >
>> >          qemu_sem_init(&p->sem, 0);
>> >          qemu_sem_init(&p->sem_sync, 0);
>> >          p->id = i;
>> >          p->data = multifd_send_data_alloc();
>> > +        p->dsa_batch_task = buffer_zero_batch_task_init(page_count);
>> >
>> >          if (use_packets) {
>> >              p->packet_len = sizeof(MultiFDPacket_t)
>> > @@ -865,7 +890,6 @@ bool multifd_send_setup(void)
>> >
>> >      for (i = 0; i < thread_count; i++) {
>> >          MultiFDSendParams *p = &multifd_send_state->params[i];
>> > -        Error *local_err = NULL;
>> >
>> >          ret = multifd_send_state->ops->send_setup(p, &local_err);
>> >          if (ret) {
>> > @@ -1047,6 +1071,7 @@ void multifd_recv_cleanup(void)
>> >              qemu_thread_join(&p->thread);
>> >          }
>> >      }
>> > +    multifd_dsa_cleanup();
>> >      for (i = 0; i < migrate_multifd_channels(); i++) {
>> >          multifd_recv_cleanup_channel(&multifd_recv_state->params[i]);
>> >      }
>> > diff --git a/migration/multifd.h b/migration/multifd.h
>> > index 50d58c0c9c..e293ddbc1d 100644
>> > --- a/migration/multifd.h
>> > +++ b/migration/multifd.h
>> > @@ -15,6 +15,7 @@
>> >
>> >  #include "exec/target_page.h"
>> >  #include "ram.h"
>> > +#include "qemu/dsa.h"
>> >
>> >  typedef struct MultiFDRecvData MultiFDRecvData;
>> >  typedef struct MultiFDSendData MultiFDSendData;
>> > @@ -155,6 +156,9 @@ typedef struct {
>> >      bool pending_sync;
>> >      MultiFDSendData *data;
>> >
>> > +    /* Zero page checking batch task */
>> > +    QemuDsaBatchTask *dsa_batch_task;
>> > +
>> >      /* thread local variables. No locking required */
>> >
>> >      /* pointer to the packet */
>> > @@ -313,6 +317,7 @@ void multifd_send_fill_packet(MultiFDSendParams *p);
>> >  bool multifd_send_prepare_common(MultiFDSendParams *p);
>> >  void multifd_send_zero_page_detect(MultiFDSendParams *p);
>> >  void multifd_recv_zero_page_process(MultiFDRecvParams *p);
>> > +void multifd_dsa_cleanup(void);
>> >
>> >  static inline void multifd_send_prepare_header(MultiFDSendParams *p)
>> >  {


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [PATCH v7 09/12] migration/multifd: Enable DSA offloading in multifd sender path.
  2024-11-26 13:20       ` Fabiano Rosas
@ 2024-12-03  3:43         ` Yichen Wang
  0 siblings, 0 replies; 30+ messages in thread
From: Yichen Wang @ 2024-12-03  3:43 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé, Eric Blake, Markus Armbruster,
	Michael S. Tsirkin, Cornelia Huck, qemu-devel, Hao Xiang,
	Liu, Yuan1, Shivam Kumar, Ho-Ren (Jack) Chuang

On Tue, Nov 26, 2024 at 5:23 AM Fabiano Rosas <farosas@suse.de> wrote:
>
> Yichen Wang <yichen.wang@bytedance.com> writes:
>
> > On Thu, Nov 21, 2024 at 12:52 PM Fabiano Rosas <farosas@suse.de> wrote:
> >>
> >> Yichen Wang <yichen.wang@bytedance.com> writes:
> >>
> >> > From: Hao Xiang <hao.xiang@linux.dev>
> >> >
> >> > Multifd sender path gets an array of pages queued by the migration
> >> > thread. It performs zero page checking on every page in the array.
> >> > The pages are classfied as either a zero page or a normal page. This
> >> > change uses Intel DSA to offload the zero page checking from CPU to
> >> > the DSA accelerator. The sender thread submits a batch of pages to DSA
> >> > hardware and waits for the DSA completion thread to signal for work
> >> > completion.
> >> >
> >> > Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
> >> > Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
> >> > ---
> >> >  migration/multifd-zero-page.c | 129 ++++++++++++++++++++++++++++++----
> >> >  migration/multifd.c           |  29 +++++++-
> >> >  migration/multifd.h           |   5 ++
> >> >  3 files changed, 147 insertions(+), 16 deletions(-)
> >> >
> >> > diff --git a/migration/multifd-zero-page.c b/migration/multifd-zero-page.c
> >> > index f1e988a959..639aed9f6b 100644
> >> > --- a/migration/multifd-zero-page.c
> >> > +++ b/migration/multifd-zero-page.c
> >> > @@ -21,7 +21,9 @@
> >> >
> >> >  static bool multifd_zero_page_enabled(void)
> >> >  {
> >> > -    return migrate_zero_page_detection() == ZERO_PAGE_DETECTION_MULTIFD;
> >> > +    ZeroPageDetection curMethod = migrate_zero_page_detection();
> >> > +    return (curMethod == ZERO_PAGE_DETECTION_MULTIFD ||
> >> > +            curMethod == ZERO_PAGE_DETECTION_DSA_ACCEL);
> >> >  }
> >> >
> >> >  static void swap_page_offset(ram_addr_t *pages_offset, int a, int b)
> >> > @@ -37,26 +39,49 @@ static void swap_page_offset(ram_addr_t *pages_offset, int a, int b)
> >> >      pages_offset[b] = temp;
> >> >  }
> >> >
> >> > +#ifdef CONFIG_DSA_OPT
> >> > +
> >> > +static void swap_result(bool *results, int a, int b)
> >> > +{
> >> > +    bool temp;
> >> > +
> >> > +    if (a == b) {
> >> > +        return;
> >> > +    }
> >> > +
> >> > +    temp = results[a];
> >> > +    results[a] = results[b];
> >> > +    results[b] = temp;
> >> > +}
> >> > +
> >> >  /**
> >> > - * multifd_send_zero_page_detect: Perform zero page detection on all pages.
> >> > + * zero_page_detect_dsa: Perform zero page detection using
> >> > + * Intel Data Streaming Accelerator (DSA).
> >> >   *
> >> > - * Sorts normal pages before zero pages in p->pages->offset and updates
> >> > - * p->pages->normal_num.
> >> > + * Sorts normal pages before zero pages in pages->offset and updates
> >> > + * pages->normal_num.
> >> >   *
> >> >   * @param p A pointer to the send params.
> >> >   */
> >> > -void multifd_send_zero_page_detect(MultiFDSendParams *p)
> >> > +static void zero_page_detect_dsa(MultiFDSendParams *p)
> >> >  {
> >> >      MultiFDPages_t *pages = &p->data->u.ram;
> >> >      RAMBlock *rb = pages->block;
> >> > -    int i = 0;
> >> > -    int j = pages->num - 1;
> >> > +    bool *results = p->dsa_batch_task->results;
> >> >
> >> > -    if (!multifd_zero_page_enabled()) {
> >> > -        pages->normal_num = pages->num;
> >> > -        goto out;
> >> > +    for (int i = 0; i < pages->num; i++) {
> >> > +        p->dsa_batch_task->addr[i] =
> >> > +            (ram_addr_t)(rb->host + pages->offset[i]);
> >> >      }
> >> >
> >> > +    buffer_is_zero_dsa_batch_sync(p->dsa_batch_task,
> >> > +                                  (const void **)p->dsa_batch_task->addr,
> >> > +                                  pages->num,
> >> > +                                  multifd_ram_page_size());
> >> > +
> >> > +    int i = 0;
> >> > +    int j = pages->num - 1;
> >> > +
> >> >      /*
> >> >       * Sort the page offset array by moving all normal pages to
> >> >       * the left and all zero pages to the right of the array.
> >> > @@ -64,23 +89,39 @@ void multifd_send_zero_page_detect(MultiFDSendParams *p)
> >> >      while (i <= j) {
> >> >          uint64_t offset = pages->offset[i];
> >> >
> >> > -        if (!buffer_is_zero(rb->host + offset, multifd_ram_page_size())) {
> >> > +        if (!results[i]) {
> >> >              i++;
> >> >              continue;
> >> >          }
> >> >
> >> > +        swap_result(results, i, j);
> >> >          swap_page_offset(pages->offset, i, j);
> >> >          ram_release_page(rb->idstr, offset);
> >> >          j--;
> >> >      }
> >> >
> >> >      pages->normal_num = i;
> >> > +}
> >> >
> >> > -out:
> >> > -    stat64_add(&mig_stats.normal_pages, pages->normal_num);
> >> > -    stat64_add(&mig_stats.zero_pages, pages->num - pages->normal_num);
> >> > +void multifd_dsa_cleanup(void)
> >> > +{
> >> > +    qemu_dsa_cleanup();
> >> > +}
> >> > +
> >> > +#else
> >> > +
> >> > +static void zero_page_detect_dsa(MultiFDSendParams *p)
> >> > +{
> >> > +    g_assert_not_reached();
> >> > +}
> >> > +
> >> > +void multifd_dsa_cleanup(void)
> >> > +{
> >> > +    return ;
> >> >  }
> >> >
> >> > +#endif
> >> > +
> >> >  void multifd_recv_zero_page_process(MultiFDRecvParams *p)
> >> >  {
> >> >      for (int i = 0; i < p->zero_num; i++) {
> >> > @@ -92,3 +133,63 @@ void multifd_recv_zero_page_process(MultiFDRecvParams *p)
> >> >          }
> >> >      }
> >> >  }
> >> > +
> >> > +/**
> >> > + * zero_page_detect_cpu: Perform zero page detection using CPU.
> >> > + *
> >> > + * Sorts normal pages before zero pages in p->pages->offset and updates
> >> > + * p->pages->normal_num.
> >> > + *
> >> > + * @param p A pointer to the send params.
> >> > + */
> >> > +static void zero_page_detect_cpu(MultiFDSendParams *p)
> >> > +{
> >> > +    MultiFDPages_t *pages = &p->data->u.ram;
> >> > +    RAMBlock *rb = pages->block;
> >> > +    int i = 0;
> >> > +    int j = pages->num - 1;
> >> > +
> >> > +    /*
> >> > +     * Sort the page offset array by moving all normal pages to
> >> > +     * the left and all zero pages to the right of the array.
> >> > +     */
> >> > +    while (i <= j) {
> >> > +        uint64_t offset = pages->offset[i];
> >> > +
> >> > +        if (!buffer_is_zero(rb->host + offset, multifd_ram_page_size())) {
> >> > +            i++;
> >> > +            continue;
> >> > +        }
> >> > +
> >> > +        swap_page_offset(pages->offset, i, j);
> >> > +        ram_release_page(rb->idstr, offset);
> >> > +        j--;
> >> > +    }
> >> > +
> >> > +    pages->normal_num = i;
> >> > +}
> >> > +
> >> > +/**
> >> > + * multifd_send_zero_page_detect: Perform zero page detection on all pages.
> >> > + *
> >> > + * @param p A pointer to the send params.
> >> > + */
> >> > +void multifd_send_zero_page_detect(MultiFDSendParams *p)
> >> > +{
> >> > +    MultiFDPages_t *pages = &p->data->u.ram;
> >> > +
> >> > +    if (!multifd_zero_page_enabled()) {
> >> > +        pages->normal_num = pages->num;
> >> > +        goto out;
> >> > +    }
> >> > +
> >> > +    if (qemu_dsa_is_running()) {
> >> > +        zero_page_detect_dsa(p);
> >> > +    } else {
> >> > +        zero_page_detect_cpu(p);
> >> > +    }
> >> > +
> >> > +out:
> >> > +    stat64_add(&mig_stats.normal_pages, pages->normal_num);
> >> > +    stat64_add(&mig_stats.zero_pages, pages->num - pages->normal_num);
> >> > +}
> >> > diff --git a/migration/multifd.c b/migration/multifd.c
> >> > index 4374e14a96..689acceff2 100644
> >> > --- a/migration/multifd.c
> >> > +++ b/migration/multifd.c
> >> > @@ -13,6 +13,7 @@
> >> >  #include "qemu/osdep.h"
> >> >  #include "qemu/cutils.h"
> >> >  #include "qemu/rcu.h"
> >> > +#include "qemu/dsa.h"
> >> >  #include "exec/target_page.h"
> >> >  #include "sysemu/sysemu.h"
> >> >  #include "exec/ramblock.h"
> >> > @@ -462,6 +463,8 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
> >> >      p->name = NULL;
> >> >      g_free(p->data);
> >> >      p->data = NULL;
> >> > +    buffer_zero_batch_task_destroy(p->dsa_batch_task);
> >> > +    p->dsa_batch_task = NULL;
> >> >      p->packet_len = 0;
> >> >      g_free(p->packet);
> >> >      p->packet = NULL;
> >> > @@ -493,6 +496,8 @@ void multifd_send_shutdown(void)
> >> >
> >> >      multifd_send_terminate_threads();
> >> >
> >> > +    multifd_dsa_cleanup();
> >> > +
> >> >      for (i = 0; i < migrate_multifd_channels(); i++) {
> >> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> >> >          Error *local_err = NULL;
> >> > @@ -814,11 +819,31 @@ bool multifd_send_setup(void)
> >> >      uint32_t page_count = multifd_ram_page_count();
> >> >      bool use_packets = multifd_use_packets();
> >> >      uint8_t i;
> >> > +    Error *local_err = NULL;
> >> >
> >> >      if (!migrate_multifd()) {
> >> >          return true;
> >> >      }
> >> >
> >> > +    if (s &&
> >> > +        s->parameters.zero_page_detection == ZERO_PAGE_DETECTION_DSA_ACCEL) {
> >> > +        // Populate the dsa device path from accel-path
> >>
> >> scripts/checkpatch.pl would have rejected this.
> >>
> >
> > Sorry. I will make sure to run checkpatch.pl, unit test (both
> > with/without DSA), before the send-email...
> >
> >> > +        const strList *accel_path = migrate_accel_path();
> >> > +        g_autofree strList *dsa_parameter = g_malloc0(sizeof(strList));
> >> > +        strList **tail = &dsa_parameter;
> >> > +        while (accel_path) {
> >> > +            if (strncmp(accel_path->value, "dsa:", 4) == 0) {
> >> > +                QAPI_LIST_APPEND(tail, &accel_path->value[4]);
> >> > +            }
> >> > +            accel_path = accel_path->next;
> >> > +        }
> >>
> >> The parsing of the parameter should be in options.c. In fact, Peter
> >> suggested in v4 to make all of this a multifd_dsa_send_setup() or
> >> multifd_dsa_init(), I think that's a good idea.
> >>
> >
> > Will fix it in the next version.
> >
> >> > +        if (qemu_dsa_init(dsa_parameter, &local_err)) {
> >> > +            ret = -1;
> >>
> >> migrate_set_error(s, local_err);
> >> goto err;
> >
> > Will fix it in the next version. But here we can't goto err, because
> > the cleanup() function will be called when setup() fails, and it has
> > assumptions that a certain data structure is in place. If we exit
> > earlier, the cleanup() function will complain and fail.
> >
>
> Which data structure? Is that multifd_send_state below? You could move
> those before qemu_dsa_init if that's the case.
>

Yes, actually the whole below code including the for-loop are expected
to be done and cannot exit earlier. For example, inside the
thread_count loop, multifd_new_send_channel_create() is also not exit
earlier, and the whole data structure will be referred to in the
cleanup() function. I also thought to put dsa setup function post
these original initializations, but "p->dsa_batch_task =
buffer_zero_batch_task_init(page_count);" has dependence on DSA and
dsa setup needs to be earlier. So I guess this is the only way...

> >>
> >> > +        } else {
> >> > +            qemu_dsa_start();
> >> > +        }
> >> > +    }
> >> > +
> >> >      thread_count = migrate_multifd_channels();
> >> >      multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
> >> >      multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
> >> > @@ -829,12 +854,12 @@ bool multifd_send_setup(void)
> >> >
> >> >      for (i = 0; i < thread_count; i++) {
> >> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> >> > -        Error *local_err = NULL;
> >> >
> >> >          qemu_sem_init(&p->sem, 0);
> >> >          qemu_sem_init(&p->sem_sync, 0);
> >> >          p->id = i;
> >> >          p->data = multifd_send_data_alloc();
> >> > +        p->dsa_batch_task = buffer_zero_batch_task_init(page_count);
> >> >
> >> >          if (use_packets) {
> >> >              p->packet_len = sizeof(MultiFDPacket_t)
> >> > @@ -865,7 +890,6 @@ bool multifd_send_setup(void)
> >> >
> >> >      for (i = 0; i < thread_count; i++) {
> >> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> >> > -        Error *local_err = NULL;
> >> >
> >> >          ret = multifd_send_state->ops->send_setup(p, &local_err);
> >> >          if (ret) {
> >> > @@ -1047,6 +1071,7 @@ void multifd_recv_cleanup(void)
> >> >              qemu_thread_join(&p->thread);
> >> >          }
> >> >      }
> >> > +    multifd_dsa_cleanup();
> >> >      for (i = 0; i < migrate_multifd_channels(); i++) {
> >> >          multifd_recv_cleanup_channel(&multifd_recv_state->params[i]);
> >> >      }
> >> > diff --git a/migration/multifd.h b/migration/multifd.h
> >> > index 50d58c0c9c..e293ddbc1d 100644
> >> > --- a/migration/multifd.h
> >> > +++ b/migration/multifd.h
> >> > @@ -15,6 +15,7 @@
> >> >
> >> >  #include "exec/target_page.h"
> >> >  #include "ram.h"
> >> > +#include "qemu/dsa.h"
> >> >
> >> >  typedef struct MultiFDRecvData MultiFDRecvData;
> >> >  typedef struct MultiFDSendData MultiFDSendData;
> >> > @@ -155,6 +156,9 @@ typedef struct {
> >> >      bool pending_sync;
> >> >      MultiFDSendData *data;
> >> >
> >> > +    /* Zero page checking batch task */
> >> > +    QemuDsaBatchTask *dsa_batch_task;
> >> > +
> >> >      /* thread local variables. No locking required */
> >> >
> >> >      /* pointer to the packet */
> >> > @@ -313,6 +317,7 @@ void multifd_send_fill_packet(MultiFDSendParams *p);
> >> >  bool multifd_send_prepare_common(MultiFDSendParams *p);
> >> >  void multifd_send_zero_page_detect(MultiFDSendParams *p);
> >> >  void multifd_recv_zero_page_process(MultiFDRecvParams *p);
> >> > +void multifd_dsa_cleanup(void);
> >> >
> >> >  static inline void multifd_send_prepare_header(MultiFDSendParams *p)
> >> >  {


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2024-12-03  3:44 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-14 22:01 [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration Yichen Wang
2024-11-14 22:01 ` [PATCH v7 01/12] meson: Introduce new instruction set enqcmd to the build system Yichen Wang
2024-11-21 13:51   ` Fabiano Rosas
2024-11-14 22:01 ` [PATCH v7 02/12] util/dsa: Add idxd into linux header copy list Yichen Wang
2024-11-21 13:51   ` Fabiano Rosas
2024-11-14 22:01 ` [PATCH v7 03/12] util/dsa: Implement DSA device start and stop logic Yichen Wang
2024-11-21 14:11   ` Fabiano Rosas
2024-11-14 22:01 ` [PATCH v7 04/12] util/dsa: Implement DSA task enqueue and dequeue Yichen Wang
2024-11-21 20:55   ` Fabiano Rosas
2024-11-14 22:01 ` [PATCH v7 05/12] util/dsa: Implement DSA task asynchronous completion thread model Yichen Wang
2024-11-21 20:58   ` Fabiano Rosas
2024-11-14 22:01 ` [PATCH v7 06/12] util/dsa: Implement zero page checking in DSA task Yichen Wang
2024-11-25 15:53   ` Fabiano Rosas
2024-11-26  4:38     ` [External] " Yichen Wang
2024-11-14 22:01 ` [PATCH v7 07/12] util/dsa: Implement DSA task asynchronous submission and wait for completion Yichen Wang
2024-11-25 18:00   ` Fabiano Rosas
2024-11-14 22:01 ` [PATCH v7 08/12] migration/multifd: Add new migration option for multifd DSA offloading Yichen Wang
2024-11-15 14:32   ` Dr. David Alan Gilbert
2024-11-14 22:01 ` [PATCH v7 09/12] migration/multifd: Enable DSA offloading in multifd sender path Yichen Wang
2024-11-21 20:50   ` Fabiano Rosas
2024-11-26  4:41     ` [External] " Yichen Wang
2024-11-26 13:20       ` Fabiano Rosas
2024-12-03  3:43         ` Yichen Wang
2024-11-14 22:01 ` [PATCH v7 10/12] util/dsa: Add unit test coverage for Intel DSA task submission and completion Yichen Wang
2024-11-14 22:01 ` [PATCH v7 11/12] migration/multifd: Add integration tests for multifd with Intel DSA offloading Yichen Wang
2024-11-25 18:25   ` Fabiano Rosas
2024-11-14 22:01 ` [PATCH v7 12/12] migration/doc: Add DSA zero page detection doc Yichen Wang
2024-11-25 18:28   ` Fabiano Rosas
2024-11-19 21:31 ` [PATCH v7 00/12] Use Intel DSA accelerator to offload zero page checking in multifd live migration Fabiano Rosas
2024-11-26  4:43   ` [External] " Yichen Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).