[PATCH 0/4] Add Intel Data Streaming Accelerator offloading

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] Add Intel Data Streaming Accelerator offloading
@ 2023-05-29 18:19 Hao Xiang
  2023-05-29 18:19 ` [PATCH 1/4] Introduce new instruction set enqcmd/mmovdir64b to the build system Hao Xiang
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Hao Xiang @ 2023-05-29 18:19 UTC (permalink / raw)
  To: pbonzini, quintela, qemu-devel; +Cc: Hao Xiang

* Idea:

Intel Data Streaming Accelerator(DSA) is introduced in Intel's 4th generation
Xeon server, aka Sapphire Rapids.
https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator

One of the things DSA can do is to offload memory comparison workload from
CPU to DSA accelerator hardware. This change proposes a solution to offload
QEMU's zero page checking from CPU to DSA accelerator hardware. For this to
work, I am looking into two potential improvements during QEMU background
operations, eg, VM checkpoint, VM live migration.

1. Reduce CPU usage.

I did a simple tracing for the save VM workload. I started VMs with
64GB RAM, 128GB RAM and 256GB RAM respectively and then ran the "savevm"
command from QEMU's commandline. During this savevm workload, I recorded
the CPU cycles spent in function save_snapshot and the total CPU cycles
spent in repeated callings into function buffer_is_zero, which performs
zero page checking.

|------------------------------------------|
|VM memory  |save_snapshot  |buffer_is_zero|
|capacity   |(CPU cycles)   |(CPU cycles)  |
|------------------------------------------|
|64GB       |19449838924    |5022504892    |
|------------------------------------------|
|128GB      |36966951760    |10049561066   |
|------------------------------------------| 
|256GB      |72744615676    |20042076440   |
|------------------------------------------|

In the three scenarios, the CPU cycles spent in zero page checking accounts
roughly 27% of the total cycles spent in save_snapshot. I believe this is due
to the fact that a large part of save_snapshot performs file IO operations
writing all memory pages into the QEMU image file and there is a linear increase
on CPU cycles spent in savevm as the VM's total memory increases. If we can
offload all zero page checking function calls to the DSA accelerator, we will
reduce the CPU usage by 27% in the savevm workload, potentially freeing CPU
resources for other work. The same savings should apply to live VM migration
workload as well. Furthermore, if the guest VM's vcpu shares the same physical
CPU core used for live migration, the guest VM will gain more underlying CPU
resource and hence making the guest VM more responsive to it's own guest workload
during live migration.

2. Reduce operation latency.

I did some benchmark testing on pure CPU memomory comparison implementation
and DSA hardware offload implementation.

Testbed: Intel(R) Xeon(R) Platinum 8457C, CPU 3100MHz

Latency is measured by completing memory comparison on two memory buffers, each
with one GB in size. The memory comparison are done via CPU and DSA accelerator
respectively. When performing CPU memory comparison, I use a single thread. When
performing DSA accelerator memory comparison, I use one DSA engine. While doing
memory comparison, both CPU and DSA based implementation uses 4k memory buffer
as the granularity for comparison.

|-------------------------------|
|Memory           |Latency      |
|-------------------------------|
|CPU one thread   |80ms         |
|-------------------------------|
|DSA one engine   |89ms         |
|-------------------------------|

In our test system, we have two sockets and two DSA devices per socket. Each
DSA device has four engines built in. I believe that if we leverage more DSA
engine resources and a good parallelization on zero page checking, we can
keep the DSA devices busy and reduce CPU usage.

* Current state:

This patch implements the DSA offloading operation for zero page checking.
User can optionally replace the zero page checking function with DSA offloading
by specifying a new argument in qemu start up commandline. There is no
performance gain in this change. This is mainly because zero page checking is
a synchronous operation and each page size is 4k. Offloading a single 4k memory
page comparison to the DSA accelerator and wait for the driver to complete
the operation introduces overhead. Currently the overhead is bigger than
the CPU cycles saved due to offloading.

* Future work:

1. Need to make the zero page checking workflow asynchronous. The idea is that
we can throw lots of zero page checking operations at once to N(configurable)
DSA engines. Then we wait for those operations to be completed by idxd (DSA
device driver). Currently ram_save_iterate has a loop to iterate through all
the memory blocks, find the dirty pages and save them all. The loop exits
when there is no more dirty pages to save. I think when we walk through all
the memory blocks, we just need to identify whether there is dirty pages
remaining but we can do the actual "save page" asynchronously. We can return
from ram_save_iterate when we finish walking through the memory blocks and
all pages are saved. This sounds like a pretty large refactoring change and
I am looking hard into this path to figure out exactly how I can tackle it.
Any feedback would be really appreciated.

2. Need to implement an abstraction layer where QEMU can just throw zero page 
checking operations to the DSA layer and the DSA layer will figure out which
work queue/engine to handle the operation. Probably we can use a round-robin
dispatcher to balance the work across multiple DSA engines.

3. The current patch uses busy loop to pull for DSA completions and that's
really a bad practice. I need to either use the umonitor/umwait instructions
or user mode interrupt for true async completion.

4. The DSA device can also offload other operations.
* memcpy
* xbzrle encoding/decoding
* crc32

base-commit: ac84b57b4d74606f7f83667a0606deef32b2049d

Hao Xiang (4):
  Introduce new instruction set enqcmd/mmovdir64b to the build system.
  Add dependency idxd.
  Implement zero page checking using DSA.
  Add QEMU command line argument to enable DSA offloading.

 include/qemu/cutils.h                |   6 +
 linux-headers/linux/idxd.h           | 356 +++++++++++++++++++++++++++
 meson.build                          |   3 +
 meson_options.txt                    |   4 +
 migration/ram.c                      |   4 +
 qemu-options.hx                      |  10 +
 scripts/meson-buildoptions.sh        |   6 +
 softmmu/runstate.c                   |   4 +
 softmmu/vl.c                         |  22 ++
 storage-daemon/qemu-storage-daemon.c |   2 +
 util/bufferiszero.c                  |  14 ++
 util/dsa.c                           | 295 ++++++++++++++++++++++
 util/meson.build                     |   1 +
 13 files changed, 727 insertions(+)
 create mode 100644 linux-headers/linux/idxd.h
 create mode 100644 util/dsa.c

-- 
2.30.2

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/4] Introduce new instruction set enqcmd/mmovdir64b to the build system.
  2023-05-29 18:19 [PATCH 0/4] Add Intel Data Streaming Accelerator offloading Hao Xiang
@ 2023-05-29 18:19 ` Hao Xiang
  2023-05-29 18:19 ` [PATCH 2/4] Add dependency idxd Hao Xiang
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Hao Xiang @ 2023-05-29 18:19 UTC (permalink / raw)
  To: pbonzini, quintela, qemu-devel; +Cc: Hao Xiang

1. Enable instruction set enqcmd in build.
2. Enable instruction set movdir64b in build.

Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
---
 meson.build                   | 3 +++
 meson_options.txt             | 4 ++++
 scripts/meson-buildoptions.sh | 6 ++++++
 3 files changed, 13 insertions(+)

diff --git a/meson.build b/meson.build
index 2d48aa1e2e..46f1bb2e34 100644
--- a/meson.build
+++ b/meson.build
@@ -2682,6 +2682,8 @@ config_host_data.set('CONFIG_AVX512BW_OPT', get_option('avx512bw') \
     int main(int argc, char *argv[]) { return bar(argv[0]); }
   '''), error_message: 'AVX512BW not available').allowed())
 
+config_host_data.set('CONFIG_DSA_OPT', get_option('enqcmd'))
+
 have_pvrdma = get_option('pvrdma') \
   .require(rdma.found(), error_message: 'PVRDMA requires OpenFabrics libraries') \
   .require(cc.compiles(gnu_source_prefix + '''
@@ -4123,6 +4125,7 @@ summary_info += {'memory allocator':  get_option('malloc')}
 summary_info += {'avx2 optimization': config_host_data.get('CONFIG_AVX2_OPT')}
 summary_info += {'avx512bw optimization': config_host_data.get('CONFIG_AVX512BW_OPT')}
 summary_info += {'avx512f optimization': config_host_data.get('CONFIG_AVX512F_OPT')}
+summary_info += {'dsa acceleration': config_host_data.get('CONFIG_DSA_OPT')}
 if get_option('gprof')
   gprof_info = 'YES (deprecated)'
 else
diff --git a/meson_options.txt b/meson_options.txt
index 90237389e2..51097da56c 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -117,6 +117,10 @@ option('avx512f', type: 'feature', value: 'disabled',
        description: 'AVX512F optimizations')
 option('avx512bw', type: 'feature', value: 'auto',
        description: 'AVX512BW optimizations')
+option('enqcmd', type: 'boolean', value: false,
+       description: 'MENQCMD optimizations')
+option('movdir64b', type: 'boolean', value: false,
+       description: 'MMOVDIR64B optimizations')
 option('keyring', type: 'feature', value: 'auto',
        description: 'Linux keyring support')
 
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index 5714fd93d9..5ef4ec36f4 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -81,6 +81,8 @@ meson_options_help() {
   printf "%s\n" '  avx2            AVX2 optimizations'
   printf "%s\n" '  avx512bw        AVX512BW optimizations'
   printf "%s\n" '  avx512f         AVX512F optimizations'
+  printf "%s\n" '  enqcmd          ENQCMD optimizations'
+  printf "%s\n" '  movdir64b       MOVDIR64B optimizations'
   printf "%s\n" '  blkio           libblkio block device driver'
   printf "%s\n" '  bochs           bochs image format support'
   printf "%s\n" '  bpf             eBPF support'
@@ -221,6 +223,10 @@ _meson_option_parse() {
     --disable-avx512bw) printf "%s" -Davx512bw=disabled ;;
     --enable-avx512f) printf "%s" -Davx512f=enabled ;;
     --disable-avx512f) printf "%s" -Davx512f=disabled ;;
+    --enable-enqcmd) printf "%s" -Denqcmd=true ;;
+    --disable-enqcmd) printf "%s" -Denqcmd=false ;;
+    --enable-movdir64b) printf "%s" -Dmovdir64b=true ;;
+    --disable-movdir64b) printf "%s" -Dmovdir64b=false ;;
     --enable-gcov) printf "%s" -Db_coverage=true ;;
     --disable-gcov) printf "%s" -Db_coverage=false ;;
     --enable-lto) printf "%s" -Db_lto=true ;;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/4] Add dependency idxd.
  2023-05-29 18:19 [PATCH 0/4] Add Intel Data Streaming Accelerator offloading Hao Xiang
  2023-05-29 18:19 ` [PATCH 1/4] Introduce new instruction set enqcmd/mmovdir64b to the build system Hao Xiang
@ 2023-05-29 18:19 ` Hao Xiang
  2023-05-29 18:20 ` [PATCH 3/4] Implement zero page checking using DSA Hao Xiang
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Hao Xiang @ 2023-05-29 18:19 UTC (permalink / raw)
  To: pbonzini, quintela, qemu-devel; +Cc: Hao Xiang

Idxd is the device driver for DSA (Intel Data Streaming
Accelerator). The driver is fully functioning since Linux
kernel 5.19. This change adds the driver's header file used
for userspace development.

Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
---
 linux-headers/linux/idxd.h | 356 +++++++++++++++++++++++++++++++++++++
 1 file changed, 356 insertions(+)
 create mode 100644 linux-headers/linux/idxd.h

diff --git a/linux-headers/linux/idxd.h b/linux-headers/linux/idxd.h
new file mode 100644
index 0000000000..1d553bedbd
--- /dev/null
+++ b/linux-headers/linux/idxd.h
@@ -0,0 +1,356 @@
+/* SPDX-License-Identifier: LGPL-2.1 WITH Linux-syscall-note */
+/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
+#ifndef _USR_IDXD_H_
+#define _USR_IDXD_H_
+
+#ifdef __KERNEL__
+#include <linux/types.h>
+#else
+#include <stdint.h>
+#endif
+
+/* Driver command error status */
+enum idxd_scmd_stat {
+	IDXD_SCMD_DEV_ENABLED = 0x80000010,
+	IDXD_SCMD_DEV_NOT_ENABLED = 0x80000020,
+	IDXD_SCMD_WQ_ENABLED = 0x80000021,
+	IDXD_SCMD_DEV_DMA_ERR = 0x80020000,
+	IDXD_SCMD_WQ_NO_GRP = 0x80030000,
+	IDXD_SCMD_WQ_NO_NAME = 0x80040000,
+	IDXD_SCMD_WQ_NO_SVM = 0x80050000,
+	IDXD_SCMD_WQ_NO_THRESH = 0x80060000,
+	IDXD_SCMD_WQ_PORTAL_ERR = 0x80070000,
+	IDXD_SCMD_WQ_RES_ALLOC_ERR = 0x80080000,
+	IDXD_SCMD_PERCPU_ERR = 0x80090000,
+	IDXD_SCMD_DMA_CHAN_ERR = 0x800a0000,
+	IDXD_SCMD_CDEV_ERR = 0x800b0000,
+	IDXD_SCMD_WQ_NO_SWQ_SUPPORT = 0x800c0000,
+	IDXD_SCMD_WQ_NONE_CONFIGURED = 0x800d0000,
+	IDXD_SCMD_WQ_NO_SIZE = 0x800e0000,
+	IDXD_SCMD_WQ_NO_PRIV = 0x800f0000,
+	IDXD_SCMD_WQ_IRQ_ERR = 0x80100000,
+	IDXD_SCMD_WQ_USER_NO_IOMMU = 0x80110000,
+};
+
+#define IDXD_SCMD_SOFTERR_MASK	0x80000000
+#define IDXD_SCMD_SOFTERR_SHIFT	16
+
+/* Descriptor flags */
+#define IDXD_OP_FLAG_FENCE	0x0001
+#define IDXD_OP_FLAG_BOF	0x0002
+#define IDXD_OP_FLAG_CRAV	0x0004
+#define IDXD_OP_FLAG_RCR	0x0008
+#define IDXD_OP_FLAG_RCI	0x0010
+#define IDXD_OP_FLAG_CRSTS	0x0020
+#define IDXD_OP_FLAG_CR		0x0080
+#define IDXD_OP_FLAG_CC		0x0100
+#define IDXD_OP_FLAG_ADDR1_TCS	0x0200
+#define IDXD_OP_FLAG_ADDR2_TCS	0x0400
+#define IDXD_OP_FLAG_ADDR3_TCS	0x0800
+#define IDXD_OP_FLAG_CR_TCS	0x1000
+#define IDXD_OP_FLAG_STORD	0x2000
+#define IDXD_OP_FLAG_DRDBK	0x4000
+#define IDXD_OP_FLAG_DSTS	0x8000
+
+/* IAX */
+#define IDXD_OP_FLAG_RD_SRC2_AECS	0x010000
+#define IDXD_OP_FLAG_RD_SRC2_2ND	0x020000
+#define IDXD_OP_FLAG_WR_SRC2_AECS_COMP	0x040000
+#define IDXD_OP_FLAG_WR_SRC2_AECS_OVFL	0x080000
+#define IDXD_OP_FLAG_SRC2_STS		0x100000
+#define IDXD_OP_FLAG_CRC_RFC3720	0x200000
+
+/* Opcode */
+enum dsa_opcode {
+	DSA_OPCODE_NOOP = 0,
+	DSA_OPCODE_BATCH,
+	DSA_OPCODE_DRAIN,
+	DSA_OPCODE_MEMMOVE,
+	DSA_OPCODE_MEMFILL,
+	DSA_OPCODE_COMPARE,
+	DSA_OPCODE_COMPVAL,
+	DSA_OPCODE_CR_DELTA,
+	DSA_OPCODE_AP_DELTA,
+	DSA_OPCODE_DUALCAST,
+	DSA_OPCODE_CRCGEN = 0x10,
+	DSA_OPCODE_COPY_CRC,
+	DSA_OPCODE_DIF_CHECK,
+	DSA_OPCODE_DIF_INS,
+	DSA_OPCODE_DIF_STRP,
+	DSA_OPCODE_DIF_UPDT,
+	DSA_OPCODE_CFLUSH = 0x20,
+};
+
+enum iax_opcode {
+	IAX_OPCODE_NOOP = 0,
+	IAX_OPCODE_DRAIN = 2,
+	IAX_OPCODE_MEMMOVE,
+	IAX_OPCODE_DECOMPRESS = 0x42,
+	IAX_OPCODE_COMPRESS,
+	IAX_OPCODE_CRC64,
+	IAX_OPCODE_ZERO_DECOMP_32 = 0x48,
+	IAX_OPCODE_ZERO_DECOMP_16,
+	IAX_OPCODE_ZERO_COMP_32 = 0x4c,
+	IAX_OPCODE_ZERO_COMP_16,
+	IAX_OPCODE_SCAN = 0x50,
+	IAX_OPCODE_SET_MEMBER,
+	IAX_OPCODE_EXTRACT,
+	IAX_OPCODE_SELECT,
+	IAX_OPCODE_RLE_BURST,
+	IAX_OPCODE_FIND_UNIQUE,
+	IAX_OPCODE_EXPAND,
+};
+
+/* Completion record status */
+enum dsa_completion_status {
+	DSA_COMP_NONE = 0,
+	DSA_COMP_SUCCESS,
+	DSA_COMP_SUCCESS_PRED,
+	DSA_COMP_PAGE_FAULT_NOBOF,
+	DSA_COMP_PAGE_FAULT_IR,
+	DSA_COMP_BATCH_FAIL,
+	DSA_COMP_BATCH_PAGE_FAULT,
+	DSA_COMP_DR_OFFSET_NOINC,
+	DSA_COMP_DR_OFFSET_ERANGE,
+	DSA_COMP_DIF_ERR,
+	DSA_COMP_BAD_OPCODE = 0x10,
+	DSA_COMP_INVALID_FLAGS,
+	DSA_COMP_NOZERO_RESERVE,
+	DSA_COMP_XFER_ERANGE,
+	DSA_COMP_DESC_CNT_ERANGE,
+	DSA_COMP_DR_ERANGE,
+	DSA_COMP_OVERLAP_BUFFERS,
+	DSA_COMP_DCAST_ERR,
+	DSA_COMP_DESCLIST_ALIGN,
+	DSA_COMP_INT_HANDLE_INVAL,
+	DSA_COMP_CRA_XLAT,
+	DSA_COMP_CRA_ALIGN,
+	DSA_COMP_ADDR_ALIGN,
+	DSA_COMP_PRIV_BAD,
+	DSA_COMP_TRAFFIC_CLASS_CONF,
+	DSA_COMP_PFAULT_RDBA,
+	DSA_COMP_HW_ERR1,
+	DSA_COMP_HW_ERR_DRB,
+	DSA_COMP_TRANSLATION_FAIL,
+};
+
+enum iax_completion_status {
+	IAX_COMP_NONE = 0,
+	IAX_COMP_SUCCESS,
+	IAX_COMP_PAGE_FAULT_IR = 0x04,
+	IAX_COMP_ANALYTICS_ERROR = 0x0a,
+	IAX_COMP_OUTBUF_OVERFLOW,
+	IAX_COMP_BAD_OPCODE = 0x10,
+	IAX_COMP_INVALID_FLAGS,
+	IAX_COMP_NOZERO_RESERVE,
+	IAX_COMP_INVALID_SIZE,
+	IAX_COMP_OVERLAP_BUFFERS = 0x16,
+	IAX_COMP_INT_HANDLE_INVAL = 0x19,
+	IAX_COMP_CRA_XLAT,
+	IAX_COMP_CRA_ALIGN,
+	IAX_COMP_ADDR_ALIGN,
+	IAX_COMP_PRIV_BAD,
+	IAX_COMP_TRAFFIC_CLASS_CONF,
+	IAX_COMP_PFAULT_RDBA,
+	IAX_COMP_HW_ERR1,
+	IAX_COMP_HW_ERR_DRB,
+	IAX_COMP_TRANSLATION_FAIL,
+	IAX_COMP_PRS_TIMEOUT,
+	IAX_COMP_WATCHDOG,
+	IAX_COMP_INVALID_COMP_FLAG = 0x30,
+	IAX_COMP_INVALID_FILTER_FLAG,
+	IAX_COMP_INVALID_INPUT_SIZE,
+	IAX_COMP_INVALID_NUM_ELEMS,
+	IAX_COMP_INVALID_SRC1_WIDTH,
+	IAX_COMP_INVALID_INVERT_OUT,
+};
+
+#define DSA_COMP_STATUS_MASK		0x7f
+#define DSA_COMP_STATUS_WRITE		0x80
+
+struct dsa_hw_desc {
+	uint32_t	pasid:20;
+	uint32_t	rsvd:11;
+	uint32_t	priv:1;
+	uint32_t	flags:24;
+	uint32_t	opcode:8;
+	uint64_t	completion_addr;
+	union {
+		uint64_t	src_addr;
+		uint64_t	rdback_addr;
+		uint64_t	pattern;
+		uint64_t	desc_list_addr;
+	};
+	union {
+		uint64_t	dst_addr;
+		uint64_t	rdback_addr2;
+		uint64_t	src2_addr;
+		uint64_t	comp_pattern;
+	};
+	union {
+		uint32_t	xfer_size;
+		uint32_t	desc_count;
+	};
+	uint16_t	int_handle;
+	uint16_t	rsvd1;
+	union {
+		uint8_t		expected_res;
+		/* create delta record */
+		struct {
+			uint64_t	delta_addr;
+			uint32_t	max_delta_size;
+			uint32_t 	delt_rsvd;
+			uint8_t 	expected_res_mask;
+		};
+		uint32_t	delta_rec_size;
+		uint64_t	dest2;
+		/* CRC */
+		struct {
+			uint32_t	crc_seed;
+			uint32_t	crc_rsvd;
+			uint64_t	seed_addr;
+		};
+		/* DIF check or strip */
+		struct {
+			uint8_t		src_dif_flags;
+			uint8_t		dif_chk_res;
+			uint8_t		dif_chk_flags;
+			uint8_t		dif_chk_res2[5];
+			uint32_t	chk_ref_tag_seed;
+			uint16_t	chk_app_tag_mask;
+			uint16_t	chk_app_tag_seed;
+		};
+		/* DIF insert */
+		struct {
+			uint8_t		dif_ins_res;
+			uint8_t		dest_dif_flag;
+			uint8_t		dif_ins_flags;
+			uint8_t		dif_ins_res2[13];
+			uint32_t	ins_ref_tag_seed;
+			uint16_t	ins_app_tag_mask;
+			uint16_t	ins_app_tag_seed;
+		};
+		/* DIF update */
+		struct {
+			uint8_t		src_upd_flags;
+			uint8_t		upd_dest_flags;
+			uint8_t		dif_upd_flags;
+			uint8_t		dif_upd_res[5];
+			uint32_t	src_ref_tag_seed;
+			uint16_t	src_app_tag_mask;
+			uint16_t	src_app_tag_seed;
+			uint32_t	dest_ref_tag_seed;
+			uint16_t	dest_app_tag_mask;
+			uint16_t	dest_app_tag_seed;
+		};
+
+		uint8_t		op_specific[24];
+	};
+} __attribute__((packed));
+
+struct iax_hw_desc {
+	uint32_t        pasid:20;
+	uint32_t        rsvd:11;
+	uint32_t        priv:1;
+	uint32_t        flags:24;
+	uint32_t        opcode:8;
+	uint64_t        completion_addr;
+	uint64_t        src1_addr;
+	uint64_t        dst_addr;
+	uint32_t        src1_size;
+	uint16_t        int_handle;
+	union {
+		uint16_t        compr_flags;
+		uint16_t        decompr_flags;
+	};
+	uint64_t        src2_addr;
+	uint32_t        max_dst_size;
+	uint32_t        src2_size;
+	uint32_t	filter_flags;
+	uint32_t	num_inputs;
+} __attribute__((packed));
+
+struct dsa_raw_desc {
+	uint64_t	field[8];
+} __attribute__((packed));
+
+/*
+ * The status field will be modified by hardware, therefore it should be
+ * volatile and prevent the compiler from optimize the read.
+ */
+struct dsa_completion_record {
+	volatile uint8_t	status;
+	union {
+		uint8_t		result;
+		uint8_t		dif_status;
+	};
+	uint16_t		rsvd;
+	uint32_t		bytes_completed;
+	uint64_t		fault_addr;
+	union {
+		/* common record */
+		struct {
+			uint32_t	invalid_flags:24;
+			uint32_t	rsvd2:8;
+		};
+
+		uint32_t	delta_rec_size;
+		uint64_t	crc_val;
+
+		/* DIF check & strip */
+		struct {
+			uint32_t	dif_chk_ref_tag;
+			uint16_t	dif_chk_app_tag_mask;
+			uint16_t	dif_chk_app_tag;
+		};
+
+		/* DIF insert */
+		struct {
+			uint64_t	dif_ins_res;
+			uint32_t	dif_ins_ref_tag;
+			uint16_t	dif_ins_app_tag_mask;
+			uint16_t	dif_ins_app_tag;
+		};
+
+		/* DIF update */
+		struct {
+			uint32_t	dif_upd_src_ref_tag;
+			uint16_t	dif_upd_src_app_tag_mask;
+			uint16_t	dif_upd_src_app_tag;
+			uint32_t	dif_upd_dest_ref_tag;
+			uint16_t	dif_upd_dest_app_tag_mask;
+			uint16_t	dif_upd_dest_app_tag;
+		};
+
+		uint8_t		op_specific[16];
+	};
+} __attribute__((packed));
+
+struct dsa_raw_completion_record {
+	uint64_t	field[4];
+} __attribute__((packed));
+
+struct iax_completion_record {
+	volatile uint8_t        status;
+	uint8_t                 error_code;
+	uint16_t                rsvd;
+	uint32_t                bytes_completed;
+	uint64_t                fault_addr;
+	uint32_t                invalid_flags;
+	uint32_t                rsvd2;
+	uint32_t                output_size;
+	uint8_t                 output_bits;
+	uint8_t                 rsvd3;
+	uint16_t                xor_csum;
+	uint32_t                crc;
+	uint32_t                min;
+	uint32_t                max;
+	uint32_t                sum;
+	uint64_t                rsvd4[2];
+} __attribute__((packed));
+
+struct iax_raw_completion_record {
+	uint64_t	field[8];
+} __attribute__((packed));
+
+#endif
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/4] Implement zero page checking using DSA.
  2023-05-29 18:19 [PATCH 0/4] Add Intel Data Streaming Accelerator offloading Hao Xiang
  2023-05-29 18:19 ` [PATCH 1/4] Introduce new instruction set enqcmd/mmovdir64b to the build system Hao Xiang
  2023-05-29 18:19 ` [PATCH 2/4] Add dependency idxd Hao Xiang
@ 2023-05-29 18:20 ` Hao Xiang
  2023-05-29 18:20 ` [PATCH 4/4] Add QEMU command line argument to enable DSA offloading Hao Xiang
  2023-05-29 18:24 ` [PATCH 0/4] Add Intel Data Streaming Accelerator offloading Hao Xiang
  4 siblings, 0 replies; 6+ messages in thread
From: Hao Xiang @ 2023-05-29 18:20 UTC (permalink / raw)
  To: pbonzini, quintela, qemu-devel; +Cc: Hao Xiang

1. Adds a memory comparison function by submitting
the work to the idxd driver.
2. Add interface to set bufferiszero accel function
to DSA offloading.
3. Fallback to use CPU accel function if DSA offloading
fails due to page fault.

Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
---
 include/qemu/cutils.h |   6 +
 migration/ram.c       |   4 +
 util/bufferiszero.c   |  14 ++
 util/dsa.c            | 295 ++++++++++++++++++++++++++++++++++++++++++
 util/meson.build      |   1 +
 5 files changed, 320 insertions(+)
 create mode 100644 util/dsa.c

diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
index 92c436d8c7..9d0286ac99 100644
--- a/include/qemu/cutils.h
+++ b/include/qemu/cutils.h
@@ -188,9 +188,15 @@ char *freq_to_str(uint64_t freq_hz);
 /* used to print char* safely */
 #define STR_OR_NULL(str) ((str) ? (str) : "null")
 
+typedef bool (*buffer_accel_fn)(const void *, size_t);
+void set_accel(buffer_accel_fn, size_t len);
+void get_fallback_accel(buffer_accel_fn *);
 bool buffer_is_zero(const void *buf, size_t len);
 bool test_buffer_is_zero_next_accel(void);
 
+int configure_dsa(const char *dsa_path);
+void dsa_cleanup(void);
+
 /*
  * Implementation of ULEB128 (http://en.wikipedia.org/wiki/LEB128)
  * Input is limited to 14-bit numbers
diff --git a/migration/ram.c b/migration/ram.c
index 88a6c82e63..b586ac4a99 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2280,6 +2280,10 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss)
             if (preempt_active) {
                 qemu_mutex_unlock(&rs->bitmap_mutex);
             }
+            /*
+             * TODO: Make ram_save_target_page asyn to take advantage
+             * of DSA offloading.
+             */
             tmppages = migration_ops->ram_save_target_page(rs, pss);
             if (tmppages >= 0) {
                 pages += tmppages;
diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index 3e6a5dfd63..3d089ef1fe 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -206,6 +206,7 @@ buffer_zero_avx512(const void *buf, size_t len)
 static unsigned used_accel = INIT_USED;
 static unsigned length_to_accel = INIT_LENGTH;
 static bool (*buffer_accel)(const void *, size_t) = INIT_ACCEL;
+static bool (*buffer_accel_fallback)(const void *, size_t) = INIT_ACCEL;
 
 static unsigned __attribute__((noinline))
 select_accel_cpuinfo(unsigned info)
@@ -231,6 +232,7 @@ select_accel_cpuinfo(unsigned info)
         if (info & all[i].bit) {
             length_to_accel = all[i].len;
             buffer_accel = all[i].fn;
+            buffer_accel_fallback = all[i].fn;
             return all[i].bit;
         }
     }
@@ -272,6 +274,17 @@ bool test_buffer_is_zero_next_accel(void)
 }
 #endif
 
+void set_accel(buffer_accel_fn fn, size_t len)
+{
+    buffer_accel = fn;
+    length_to_accel = len;
+}
+
+void get_fallback_accel(buffer_accel_fn *fn)
+{
+    *fn = buffer_accel_fallback;
+}
+
 /*
  * Checks if a buffer is all zeroes
  */
@@ -288,3 +301,4 @@ bool buffer_is_zero(const void *buf, size_t len)
        includes a check for an unrolled loop over 64-bit integers.  */
     return select_accel_fn(buf, len);
 }
+
diff --git a/util/dsa.c b/util/dsa.c
new file mode 100644
index 0000000000..2fdcdb4f49
--- /dev/null
+++ b/util/dsa.c
@@ -0,0 +1,295 @@
+/*
+ * Use Intel Data Streaming Accelerator to offload certain background
+ * operations.
+ *
+ * Copyright (c) 2023 Hao Xiang <hao.xiang@bytedance.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+#include "qemu/osdep.h"
+#include "qemu/cutils.h"
+#include "qemu/bswap.h"
+#include "qemu/error-report.h"
+
+#ifdef CONFIG_DSA_OPT
+
+#pragma GCC push_options
+#pragma GCC target("enqcmd")
+#pragma GCC target("movdir64b")
+
+#include <linux/idxd.h>
+#include "x86intrin.h"
+
+#define DSA_WQ_SIZE 4096
+
+static bool use_simulation;
+static uint64_t total_bytes_checked;
+static uint64_t total_function_calls;
+static uint64_t total_success_count;
+static int max_retry_count;
+static int top_retry_count;
+
+static void *dsa_wq = MAP_FAILED;
+static uint8_t zero_page_buffer[4096];
+static bool dedicated_mode;
+static int length_to_accel = 64;
+
+static buffer_accel_fn buffer_zero_fallback;
+
+/**
+ * @brief This function opens a DSA device's work queue and
+ *        maps the DSA device memory into the current process.
+ *
+ * @param dsa_wq_path A pointer to the DSA device work queue's file path.
+ * @return A pointer to the mapped memory.
+ */
+static void *map_dsa_device(const char *dsa_wq_path)
+{
+    void *dsa_device;
+    int fd;
+
+    fd = open(dsa_wq_path, O_RDWR);
+    if (fd < 0) {
+        fprintf(stderr, "open %s failed with errno = %d.\n",
+                dsa_wq_path, errno);
+        return MAP_FAILED;
+    }
+    dsa_device = mmap(NULL, DSA_WQ_SIZE, PROT_WRITE,
+                      MAP_SHARED | MAP_POPULATE, fd, 0);
+    close(fd);
+    if (dsa_device == MAP_FAILED) {
+        fprintf(stderr, "mmap failed with errno = %d.\n", errno);
+        return MAP_FAILED;
+    }
+    return dsa_device;
+}
+
+/**
+ * @brief Submits a DSA work item to the device work queue.
+ *
+ * @param wq A pointer to the DSA work queue's device memory.
+ * @param descriptor A pointer to the DSA work item descriptor.
+ * @return Zero if successful, non-zero otherwise.
+ */
+static int submit_wi(void *wq, void *descriptor)
+{
+    int retry = 0;
+
+    _mm_sfence();
+
+    if (dedicated_mode) {
+        _movdir64b(dsa_wq, descriptor);
+    } else {
+        while (true) {
+            if (_enqcmd(dsa_wq, descriptor) == 0) {
+                break;
+            }
+            retry++;
+            if (retry > max_retry_count) {
+                fprintf(stderr, "Submit work retry %d times.\n", retry);
+                exit(1);
+            }
+        }
+    }
+
+    return 0;
+}
+
+/**
+ * @brief Poll for the DSA work item completion.
+ *
+ * @param completion A pointer to the DSA work item completion record.
+ * @param opcode The DSA opcode.
+ * @return Zero if successful, non-zero otherwise.
+ */
+static int poll_completion(struct dsa_completion_record *completion,
+                           enum dsa_opcode opcode)
+{
+    int retry = 0;
+
+    while (true) {
+        if (completion->status != DSA_COMP_NONE) {
+            /* TODO: Error handling here. */
+            if (completion->status != DSA_COMP_SUCCESS &&
+                completion->status != DSA_COMP_PAGE_FAULT_NOBOF) {
+                fprintf(stderr, "DSA opcode %d failed with status = %d.\n",
+                    opcode, completion->status);
+                exit(1);
+            } else {
+                total_success_count++;
+            }
+            break;
+        }
+        retry++;
+        if (retry > max_retry_count) {
+            fprintf(stderr, "Wait for completion retry %d times.\n", retry);
+            exit(1);
+        }
+        _mm_pause();
+    }
+
+    if (retry > top_retry_count) {
+        top_retry_count = retry;
+    }
+
+    return 0;
+}
+
+static bool buffer_zero_dsa_simulation(const void *buf, size_t len)
+{
+    /* TODO: Handle page size greater than 4k. */
+    if (len > sizeof(zero_page_buffer)) {
+        fprintf(stderr, "Page size greater than %lu is not supported by DSA "
+                        "buffer zero checking.\n", sizeof(zero_page_buffer));
+        exit(1);
+    }
+
+    total_bytes_checked += len;
+    total_function_calls++;
+
+    return memcmp(buf, zero_page_buffer, len) == 0;
+}
+
+/**
+ * @brief Sends a memory comparison work item to a DSA device and wait
+ *        for completion.
+ *
+ * @param buf A pointer to the memory buffer for comparison.
+ * @param len Length of the memory buffer for comparison.
+ * @return true if the memory buffer is all zero, false otherwise.
+ */
+static bool buffer_zero_dsa(const void *buf, size_t len)
+{
+    struct dsa_completion_record completion __attribute__((aligned(32)));
+    struct dsa_hw_desc descriptor;
+    uint8_t test_byte;
+
+    /* TODO: Handle page size greater than 4k. */
+    if (len > sizeof(zero_page_buffer)) {
+        fprintf(stderr, "Page size greater than %lu is not supported by DSA "
+                        "buffer zero checking.\n", sizeof(zero_page_buffer));
+        exit(1);
+    }
+
+    total_bytes_checked += len;
+    total_function_calls++;
+
+    memset(&completion, 0, sizeof(completion));
+    memset(&descriptor, 0, sizeof(descriptor));
+
+    descriptor.opcode = DSA_OPCODE_COMPARE;
+    descriptor.flags = IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CRAV;
+    descriptor.xfer_size = len;
+    descriptor.src_addr = (uintptr_t)buf;
+    descriptor.dst_addr = (uintptr_t)zero_page_buffer;
+    completion.status = 0;
+    descriptor.completion_addr = (uint64_t)&completion;
+
+    /*
+     * TODO: Find a better solution. DSA device can encounter page
+     * fault during the memory comparison operatio. Block on page
+     * fault is turned off for better performance. This temporary
+     * solution reads the first byte of the memory buffer in order
+     * to cause a CPU page fault so that DSA device won't hit that
+     * later.
+     */
+    test_byte = ((uint8_t *)buf)[0];
+    ((uint8_t *)buf)[0] = test_byte;
+
+    submit_wi(dsa_wq, &descriptor);
+    poll_completion(&completion, DSA_OPCODE_COMPARE);
+
+    if (completion.status == DSA_COMP_SUCCESS) {
+        return completion.result == 0;
+    }
+
+    /*
+     * DSA was able to partially complete the operation. Check the
+     * result. If we already know this is not a zero page, we can
+     * return now.
+     */
+    if (completion.bytes_completed != 0 && completion.result != 0) {
+        return false;
+    }
+
+    /* Let's fallback to use CPU to complete it. */
+    return buffer_zero_fallback((uint8_t *)buf + completion.bytes_completed,
+                                len - completion.bytes_completed);
+}
+
+/**
+ * @brief Check if DSA devices are enabled in the current system
+ *        and set DSA offloading for zero page checking operation.
+ *        This function is called during QEMU initialization.
+ *
+ * @param dsa_path A pointer to the DSA device's work queue file path.
+ * @return int Zero if successful, non-zero otherwise.
+ */
+int configure_dsa(const char *dsa_path)
+{
+    dedicated_mode = false;
+    use_simulation = false;
+    max_retry_count = 3000;
+    total_bytes_checked = 0;
+    total_function_calls = 0;
+    total_success_count = 0;
+
+    memset(zero_page_buffer, 0, sizeof(zero_page_buffer));
+
+    dsa_wq = map_dsa_device(dsa_path);
+    if (dsa_wq == MAP_FAILED) {
+        fprintf(stderr, "map_dsa_device failed MAP_FAILED, "
+                "using simulation.\n");
+        return -1;
+    }
+
+    if (use_simulation)
+        set_accel(buffer_zero_dsa_simulation, length_to_accel);
+    else {
+        set_accel(buffer_zero_dsa, length_to_accel);
+        get_fallback_accel(&buffer_zero_fallback);
+    }
+
+    return 0;
+}
+
+/**
+ * @brief Clean up system resources created for DSA offloading.
+ *        This function is called during QEMU process teardown.
+ *
+ */
+void dsa_cleanup(void)
+{
+    if (dsa_wq != MAP_FAILED) {
+        munmap(dsa_wq, DSA_WQ_SIZE);
+    }
+}
+
+#else
+
+int configure_dsa(const char *dsa_path)
+{
+    fprintf(stderr, "Intel Data Streaming Accelerator is not supported "
+                    "on this platform.\n");
+    return -1;
+}
+
+void dsa_cleanup(void) {}
+
+#endif
diff --git a/util/meson.build b/util/meson.build
index 3a93071d27..f493071c91 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -84,6 +84,7 @@ if have_block or have_ga
 endif
 if have_block
   util_ss.add(files('aio-wait.c'))
+  util_ss.add(files('dsa.c'))
   util_ss.add(files('buffer.c'))
   util_ss.add(files('bufferiszero.c'))
   util_ss.add(files('hbitmap.c'))
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 4/4] Add QEMU command line argument to enable DSA offloading.
  2023-05-29 18:19 [PATCH 0/4] Add Intel Data Streaming Accelerator offloading Hao Xiang
                   ` (2 preceding siblings ...)
  2023-05-29 18:20 ` [PATCH 3/4] Implement zero page checking using DSA Hao Xiang
@ 2023-05-29 18:20 ` Hao Xiang
  2023-05-29 18:24 ` [PATCH 0/4] Add Intel Data Streaming Accelerator offloading Hao Xiang
  4 siblings, 0 replies; 6+ messages in thread
From: Hao Xiang @ 2023-05-29 18:20 UTC (permalink / raw)
  To: pbonzini, quintela, qemu-devel; +Cc: Hao Xiang

This change adds a new argument --dsa-accelerate to qemu.

Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
---
 qemu-options.hx                      | 10 ++++++++++
 softmmu/runstate.c                   |  4 ++++
 softmmu/vl.c                         | 22 ++++++++++++++++++++++
 storage-daemon/qemu-storage-daemon.c |  2 ++
 4 files changed, 38 insertions(+)

diff --git a/qemu-options.hx b/qemu-options.hx
index b37eb9662b..29491ee691 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -4890,6 +4890,16 @@ SRST
         otherwise the option is ignored. Default is off.
 ERST
 
+DEF("dsa-accelerate", HAS_ARG, QEMU_OPTION_dsa,
+    "-dsa-accelerate <file>\n"
+    "                Use Intel Data Streaming Accelerator for certain QEMU\n"
+    "                operations, eg, checkpoint.\n",
+    QEMU_ARCH_I386)
+SRST
+``-dsa-accelerate path``
+    The device path to a DSA accelerator.
+ERST
+
 DEF("dump-vmstate", HAS_ARG, QEMU_OPTION_dump_vmstate,
     "-dump-vmstate <file>\n"
     "                Output vmstate information in JSON format to file.\n"
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index 2f2396c819..1f938e192f 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -41,6 +41,7 @@
 #include "qapi/qapi-commands-run-state.h"
 #include "qapi/qapi-events-run-state.h"
 #include "qemu/accel.h"
+#include "qemu/cutils.h"
 #include "qemu/error-report.h"
 #include "qemu/job.h"
 #include "qemu/log.h"
@@ -834,6 +835,9 @@ void qemu_cleanup(void)
     tpm_cleanup();
     net_cleanup();
     audio_cleanup();
+
+    dsa_cleanup();
+
     monitor_cleanup();
     qemu_chr_cleanup();
     user_creatable_cleanup();
diff --git a/softmmu/vl.c b/softmmu/vl.c
index b0b96f67fa..8ace491183 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -161,6 +161,7 @@ static const char *mem_path;
 static const char *incoming;
 static const char *loadvm;
 static const char *accelerators;
+static const char *dsa_path;
 static bool have_custom_ram_size;
 static const char *ram_memdev_id;
 static QDict *machine_opts_dict;
@@ -373,6 +374,20 @@ static QemuOptsList qemu_msg_opts = {
     },
 };
 
+static QemuOptsList qemu_dsa_opts = {
+    .name = "dsa-accelerate",
+    .head = QTAILQ_HEAD_INITIALIZER(qemu_dsa_opts.head),
+    .desc = {
+        {
+            .name = "device",
+            .type = QEMU_OPT_STRING,
+            .help = "The device path to DSA accelerator used for certain "
+                    "QEMU operations, eg, checkpoint\n",
+        },
+        { /* end of list */ }
+    },
+};
+
 static QemuOptsList qemu_name_opts = {
     .name = "name",
     .implied_opt_name = "guest",
@@ -2704,6 +2719,7 @@ void qemu_init(int argc, char **argv)
     qemu_add_opts(&qemu_semihosting_config_opts);
     qemu_add_opts(&qemu_fw_cfg_opts);
     qemu_add_opts(&qemu_action_opts);
+    qemu_add_opts(&qemu_dsa_opts);
     module_call_init(MODULE_INIT_OPTS);
 
     error_init(argv[0]);
@@ -3504,6 +3520,12 @@ void qemu_init(int argc, char **argv)
                 }
                 configure_msg(opts);
                 break;
+            case QEMU_OPTION_dsa:
+                dsa_path = optarg;
+                if (configure_dsa(dsa_path)) {
+                    exit(1);
+                }
+                break;
             case QEMU_OPTION_dump_vmstate:
                 if (vmstate_dump_file) {
                     error_report("only one '-dump-vmstate' "
diff --git a/storage-daemon/qemu-storage-daemon.c b/storage-daemon/qemu-storage-daemon.c
index 0e9354faa6..0e4375407a 100644
--- a/storage-daemon/qemu-storage-daemon.c
+++ b/storage-daemon/qemu-storage-daemon.c
@@ -439,6 +439,8 @@ int main(int argc, char *argv[])
     job_cancel_sync_all();
     bdrv_close_all();
 
+    dsa_cleanup();
+
     monitor_cleanup();
     qemu_chr_cleanup();
     user_creatable_cleanup();
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/4] Add Intel Data Streaming Accelerator offloading
  2023-05-29 18:19 [PATCH 0/4] Add Intel Data Streaming Accelerator offloading Hao Xiang
                   ` (3 preceding siblings ...)
  2023-05-29 18:20 ` [PATCH 4/4] Add QEMU command line argument to enable DSA offloading Hao Xiang
@ 2023-05-29 18:24 ` Hao Xiang
  4 siblings, 0 replies; 6+ messages in thread
From: Hao Xiang @ 2023-05-29 18:24 UTC (permalink / raw)
  To: pbonzini, quintela, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 6888 bytes --]

Hi all, this is meant to be an RFC. Sorry I didn't put that in the email
subject correctly.
From: "Hao Xiang"<hao.xiang@bytedance.com>
Date:  Mon, May 29, 2023, 11:20
Subject:  [PATCH 0/4] Add Intel Data Streaming Accelerator offloading
To: <pbonzini@redhat.com>, <quintela@redhat.com>, <qemu-devel@nongnu.org>
Cc: "Hao Xiang"<hao.xiang@bytedance.com>
* Idea:

Intel Data Streaming Accelerator(DSA) is introduced in Intel's 4th
generation
Xeon server, aka Sapphire Rapids.
https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator

One of the things DSA can do is to offload memory comparison workload from
CPU to DSA accelerator hardware. This change proposes a solution to offload
QEMU's zero page checking from CPU to DSA accelerator hardware. For this to
work, I am looking into two potential improvements during QEMU background
operations, eg, VM checkpoint, VM live migration.

1. Reduce CPU usage.

I did a simple tracing for the save VM workload. I started VMs with
64GB RAM, 128GB RAM and 256GB RAM respectively and then ran the "savevm"
command from QEMU's commandline. During this savevm workload, I recorded
the CPU cycles spent in function save_snapshot and the total CPU cycles
spent in repeated callings into function buffer_is_zero, which performs
zero page checking.

|------------------------------------------|
|VM memory  |save_snapshot  |buffer_is_zero|
|capacity   |(CPU cycles)   |(CPU cycles)  |
|------------------------------------------|
|64GB       |19449838924    |5022504892    |
|------------------------------------------|
|128GB      |36966951760    |10049561066   |
|------------------------------------------|
|256GB      |72744615676    |20042076440   |
|------------------------------------------|

In the three scenarios, the CPU cycles spent in zero page checking accounts
roughly 27% of the total cycles spent in save_snapshot. I believe this is
due
to the fact that a large part of save_snapshot performs file IO operations
writing all memory pages into the QEMU image file and there is a linear
increase
on CPU cycles spent in savevm as the VM's total memory increases. If we can
offload all zero page checking function calls to the DSA accelerator, we
will
reduce the CPU usage by 27% in the savevm workload, potentially freeing CPU
resources for other work. The same savings should apply to live VM
migration
workload as well. Furthermore, if the guest VM's vcpu shares the same
physical
CPU core used for live migration, the guest VM will gain more underlying
CPU
resource and hence making the guest VM more responsive to it's own guest
workload
during live migration.

2. Reduce operation latency.

I did some benchmark testing on pure CPU memomory comparison implementation
and DSA hardware offload implementation.

Testbed: Intel(R) Xeon(R) Platinum 8457C, CPU 3100MHz

Latency is measured by completing memory comparison on two memory buffers,
each
with one GB in size. The memory comparison are done via CPU and DSA
accelerator
respectively. When performing CPU memory comparison, I use a single thread.
When
performing DSA accelerator memory comparison, I use one DSA engine. While
doing
memory comparison, both CPU and DSA based implementation uses 4k memory
buffer
as the granularity for comparison.

|-------------------------------|
|Memory           |Latency      |
|-------------------------------|
|CPU one thread   |80ms         |
|-------------------------------|
|DSA one engine   |89ms         |
|-------------------------------|

In our test system, we have two sockets and two DSA devices per socket.
Each
DSA device has four engines built in. I believe that if we leverage more
DSA
engine resources and a good parallelization on zero page checking, we can
keep the DSA devices busy and reduce CPU usage.

* Current state:

This patch implements the DSA offloading operation for zero page checking.
User can optionally replace the zero page checking function with DSA
offloading
by specifying a new argument in qemu start up commandline. There is no
performance gain in this change. This is mainly because zero page checking
is
a synchronous operation and each page size is 4k. Offloading a single 4k
memory
page comparison to the DSA accelerator and wait for the driver to complete
the operation introduces overhead. Currently the overhead is bigger than
the CPU cycles saved due to offloading.

* Future work:

1. Need to make the zero page checking workflow asynchronous. The idea is
that
we can throw lots of zero page checking operations at once to
N(configurable)
DSA engines. Then we wait for those operations to be completed by idxd (DSA
device driver). Currently ram_save_iterate has a loop to iterate through
all
the memory blocks, find the dirty pages and save them all. The loop exits
when there is no more dirty pages to save. I think when we walk through all
the memory blocks, we just need to identify whether there is dirty pages
remaining but we can do the actual "save page" asynchronously. We can
return
from ram_save_iterate when we finish walking through the memory blocks and
all pages are saved. This sounds like a pretty large refactoring change and
I am looking hard into this path to figure out exactly how I can tackle it.
Any feedback would be really appreciated.

2. Need to implement an abstraction layer where QEMU can just throw zero
page
checking operations to the DSA layer and the DSA layer will figure out
which
work queue/engine to handle the operation. Probably we can use a
round-robin
dispatcher to balance the work across multiple DSA engines.

3. The current patch uses busy loop to pull for DSA completions and that's
really a bad practice. I need to either use the umonitor/umwait
instructions
or user mode interrupt for true async completion.

4. The DSA device can also offload other operations.
* memcpy
* xbzrle encoding/decoding
* crc32

base-commit: ac84b57b4d74606f7f83667a0606deef32b2049d

Hao Xiang (4):
  Introduce new instruction set enqcmd/mmovdir64b to the build system.
  Add dependency idxd.
  Implement zero page checking using DSA.
  Add QEMU command line argument to enable DSA offloading.

include/qemu/cutils.h                |   6 +
linux-headers/linux/idxd.h           | 356 +++++++++++++++++++++++++++
meson.build                          |   3 +
meson_options.txt                    |   4 +
migration/ram.c                      |   4 +
qemu-options.hx                      |  10 +
scripts/meson-buildoptions.sh        |   6 +
softmmu/runstate.c                   |   4 +
softmmu/vl.c                         |  22 ++
storage-daemon/qemu-storage-daemon.c |   2 +
util/bufferiszero.c                  |  14 ++
util/dsa.c                           | 295 ++++++++++++++++++++++
util/meson.build                     |   1 +
13 files changed, 727 insertions(+)
create mode 100644 linux-headers/linux/idxd.h
create mode 100644 util/dsa.c

-- 
2.30.2

[-- Attachment #2: Type: text/html, Size: 28832 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-05-29 18:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-29 18:19 [PATCH 0/4] Add Intel Data Streaming Accelerator offloading Hao Xiang
2023-05-29 18:19 ` [PATCH 1/4] Introduce new instruction set enqcmd/mmovdir64b to the build system Hao Xiang
2023-05-29 18:19 ` [PATCH 2/4] Add dependency idxd Hao Xiang
2023-05-29 18:20 ` [PATCH 3/4] Implement zero page checking using DSA Hao Xiang
2023-05-29 18:20 ` [PATCH 4/4] Add QEMU command line argument to enable DSA offloading Hao Xiang
2023-05-29 18:24 ` [PATCH 0/4] Add Intel Data Streaming Accelerator offloading Hao Xiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).