[PATCH v4 0/4] Implement using Intel QAT to offload ZLIB

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
@ 2024-07-05 18:28 Yichen Wang
  2024-07-05 18:28 ` [PATCH v4 1/4] meson: Introduce 'qatzip' feature to the build system Yichen Wang
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Yichen Wang @ 2024-07-05 18:28 UTC (permalink / raw)
  To: Paolo Bonzini, Daniel P. Berrangé, Eduardo Habkost,
	Marc-André Lureau, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Fabiano Rosas, Eric Blake, Markus Armbruster,
	Laurent Vivier, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Zou, Nanhai, Ho-Ren (Jack) Chuang,
	Yichen Wang

v4:
- Rebase changes on top of 1a2d52c7fcaeaaf4f2fe8d4d5183dccaeab67768
- Move the IOV initialization to qatzip implementation
- Only use qatzip to compress normal pages

v3:
- Rebase changes on top of master
- Merge two patches per Fabiano Rosas's comment
- Add versions into comments and documentations

v2:
- Rebase changes on top of recent multifd code changes.
- Use QATzip API 'qzMalloc' and 'qzFree' to allocate QAT buffers.
- Remove parameter tuning and use QATzip's defaults for better
  performance.
- Add parameter to enable QAT software fallback.

v1:
https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg03761.html

* Performance

We present updated performance results. For circumstantial reasons, v1
presented performance on a low-bandwidth (1Gbps) network.

Here, we present updated results with a similar setup as before but with
two main differences:

1. Our machines have a ~50Gbps connection, tested using 'iperf3'.
2. We had a bug in our memory allocation causing us to only use ~1/2 of
the VM's RAM. Now we properly allocate and fill nearly all of the VM's
RAM.

Thus, the test setup is as follows:

We perform multifd live migration over TCP using a VM with 64GB memory.
We prepare the machine's memory by powering it on, allocating a large
amount of memory (60GB) as a single buffer, and filling the buffer with
the repeated contents of the Silesia corpus[0]. This is in lieu of a more
realistic memory snapshot, which proved troublesome to acquire.

We analyze CPU usage by averaging the output of 'top' every second
during migration. This is admittedly imprecise, but we feel that it
accurately portrays the different degrees of CPU usage of varying
compression methods.

We present the latency, throughput, and CPU usage results for all of the
compression methods, with varying numbers of multifd threads (4, 8, and
16).

[0] The Silesia corpus can be accessed here:
https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia

** Results

4 multifd threads:

    |---------------|---------------|----------------|---------|---------|
    |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
    |---------------|---------------|----------------|---------|---------|
    |qatzip         | 23.13         | 8749.94        |117.50   |186.49   |
    |---------------|---------------|----------------|---------|---------|
    |zlib           |254.35         |  771.87        |388.20   |144.40   |
    |---------------|---------------|----------------|---------|---------|
    |zstd           | 54.52         | 3442.59        |414.59   |149.77   |
    |---------------|---------------|----------------|---------|---------|
    |none           | 12.45         |43739.60        |159.71   |204.96   |
    |---------------|---------------|----------------|---------|---------|

8 multifd threads:

    |---------------|---------------|----------------|---------|---------|
    |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
    |---------------|---------------|----------------|---------|---------|
    |qatzip         | 16.91         |12306.52        |186.37   |391.84   |
    |---------------|---------------|----------------|---------|---------|
    |zlib           |130.11         | 1508.89        |753.86   |289.35   |
    |---------------|---------------|----------------|---------|---------|
    |zstd           | 27.57         | 6823.23        |786.83   |303.80   |
    |---------------|---------------|----------------|---------|---------|
    |none           | 11.82         |46072.63        |163.74   |238.56   |
    |---------------|---------------|----------------|---------|---------|

16 multifd threads:

    |---------------|---------------|----------------|---------|---------|
    |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
    |---------------|---------------|----------------|---------|---------|
    |qatzip         |18.64          |11044.52        | 573.61  |437.65   |
    |---------------|---------------|----------------|---------|---------|
    |zlib           |66.43          | 2955.79        |1469.68  |567.47   |
    |---------------|---------------|----------------|---------|---------|
    |zstd           |14.17          |13290.66        |1504.08  |615.33   |
    |---------------|---------------|----------------|---------|---------|
    |none           |16.82          |32363.26        | 180.74  |217.17   |
    |---------------|---------------|----------------|---------|---------|

** Observations

- In general, not using compression outperforms using compression in a
  non-network-bound environment.
- 'qatzip' outperforms other compression workers with 4 and 8 workers,
  achieving a ~91% latency reduction over 'zlib' with 4 workers, and a
~58% latency reduction over 'zstd' with 4 workers.
- 'qatzip' maintains comparable performance with 'zstd' at 16 workers,
  showing a ~32% increase in latency. This performance difference
becomes more noticeable with more workers, as CPU compression is highly
parallelizable.
- 'qatzip' compression uses considerably less CPU than other compression
  methods. At 8 workers, 'qatzip' demonstrates a ~75% reduction in
compression CPU usage compared to 'zstd' and 'zlib'.
- 'qatzip' decompression CPU usage is less impressive, and is even
  slightly worse than 'zstd' and 'zlib' CPU usage at 4 and 16 workers.


Bryan Zhang (4):
  meson: Introduce 'qatzip' feature to the build system
  migration: Add migration parameters for QATzip
  migration: Introduce 'qatzip' compression method
  tests/migration: Add integration test for 'qatzip' compression method

 hw/core/qdev-properties-system.c |   6 +-
 meson.build                      |  10 +
 meson_options.txt                |   2 +
 migration/meson.build            |   1 +
 migration/migration-hmp-cmds.c   |   8 +
 migration/multifd-qatzip.c       | 391 +++++++++++++++++++++++++++++++
 migration/multifd.h              |   5 +-
 migration/options.c              |  57 +++++
 migration/options.h              |   2 +
 qapi/migration.json              |  38 +++
 scripts/meson-buildoptions.sh    |   3 +
 tests/qtest/meson.build          |   4 +
 tests/qtest/migration-test.c     |  35 +++
 13 files changed, 559 insertions(+), 3 deletions(-)
 create mode 100644 migration/multifd-qatzip.c

-- 
Yichen Wang



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v4 1/4] meson: Introduce 'qatzip' feature to the build system
  2024-07-05 18:28 [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB Yichen Wang
@ 2024-07-05 18:28 ` Yichen Wang
  2024-07-05 18:28 ` [PATCH v4 2/4] migration: Add migration parameters for QATzip Yichen Wang
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Yichen Wang @ 2024-07-05 18:28 UTC (permalink / raw)
  To: Paolo Bonzini, Daniel P. Berrangé, Eduardo Habkost,
	Marc-André Lureau, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Fabiano Rosas, Eric Blake, Markus Armbruster,
	Laurent Vivier, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Zou, Nanhai, Ho-Ren (Jack) Chuang,
	Yichen Wang, Bryan Zhang

From: Bryan Zhang <bryan.zhang@bytedance.com>

Add a 'qatzip' feature, which is automatically disabled, and which
depends on the QATzip library if enabled.

Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 meson.build                   | 10 ++++++++++
 meson_options.txt             |  2 ++
 scripts/meson-buildoptions.sh |  3 +++
 3 files changed, 15 insertions(+)

diff --git a/meson.build b/meson.build
index 54e6b09f4f..820baf4745 100644
--- a/meson.build
+++ b/meson.build
@@ -1244,6 +1244,14 @@ if not get_option('uadk').auto() or have_system
      uadk = declare_dependency(dependencies: [libwd, libwd_comp])
   endif
 endif
+
+qatzip = not_found
+if get_option('qatzip').enabled()
+  qatzip = dependency('qatzip', version: '>=1.1.2',
+                      required: get_option('qatzip'),
+                      method: 'pkg-config')
+endif
+
 virgl = not_found
 
 have_vhost_user_gpu = have_tools and host_os == 'linux' and pixman.found()
@@ -2378,6 +2386,7 @@ config_host_data.set('CONFIG_STATX_MNT_ID', has_statx_mnt_id)
 config_host_data.set('CONFIG_ZSTD', zstd.found())
 config_host_data.set('CONFIG_QPL', qpl.found())
 config_host_data.set('CONFIG_UADK', uadk.found())
+config_host_data.set('CONFIG_QATZIP', qatzip.found())
 config_host_data.set('CONFIG_FUSE', fuse.found())
 config_host_data.set('CONFIG_FUSE_LSEEK', fuse_lseek.found())
 config_host_data.set('CONFIG_SPICE_PROTOCOL', spice_protocol.found())
@@ -4480,6 +4489,7 @@ summary_info += {'lzfse support':     liblzfse}
 summary_info += {'zstd support':      zstd}
 summary_info += {'Query Processing Library support': qpl}
 summary_info += {'UADK Library support': uadk}
+summary_info += {'qatzip support':    qatzip}
 summary_info += {'NUMA host support': numa}
 summary_info += {'capstone':          capstone}
 summary_info += {'libpmem support':   libpmem}
diff --git a/meson_options.txt b/meson_options.txt
index 0269fa0f16..35a69f6697 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -261,6 +261,8 @@ option('qpl', type : 'feature', value : 'auto',
        description: 'Query Processing Library support')
 option('uadk', type : 'feature', value : 'auto',
        description: 'UADK Library support')
+option('qatzip', type: 'feature', value: 'disabled',
+       description: 'QATzip compression support')
 option('fuse', type: 'feature', value: 'auto',
        description: 'FUSE block device export')
 option('fuse_lseek', type : 'feature', value : 'auto',
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index cfadb5ea86..1ce467e9cc 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -163,6 +163,7 @@ meson_options_help() {
   printf "%s\n" '  pixman          pixman support'
   printf "%s\n" '  plugins         TCG plugins via shared library loading'
   printf "%s\n" '  png             PNG support with libpng'
+  printf "%s\n" '  qatzip          QATzip compression support'
   printf "%s\n" '  qcow1           qcow1 image format support'
   printf "%s\n" '  qed             qed image format support'
   printf "%s\n" '  qga-vss         build QGA VSS support (broken with MinGW)'
@@ -427,6 +428,8 @@ _meson_option_parse() {
     --enable-png) printf "%s" -Dpng=enabled ;;
     --disable-png) printf "%s" -Dpng=disabled ;;
     --prefix=*) quote_sh "-Dprefix=$2" ;;
+    --enable-qatzip) printf "%s" -Dqatzip=enabled ;;
+    --disable-qatzip) printf "%s" -Dqatzip=disabled ;;
     --enable-qcow1) printf "%s" -Dqcow1=enabled ;;
     --disable-qcow1) printf "%s" -Dqcow1=disabled ;;
     --enable-qed) printf "%s" -Dqed=enabled ;;
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 2/4] migration: Add migration parameters for QATzip
  2024-07-05 18:28 [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB Yichen Wang
  2024-07-05 18:28 ` [PATCH v4 1/4] meson: Introduce 'qatzip' feature to the build system Yichen Wang
@ 2024-07-05 18:28 ` Yichen Wang
  2024-07-08 21:10   ` Peter Xu
  2024-07-05 18:29 ` [PATCH v4 3/4] migration: Introduce 'qatzip' compression method Yichen Wang
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Yichen Wang @ 2024-07-05 18:28 UTC (permalink / raw)
  To: Paolo Bonzini, Daniel P. Berrangé, Eduardo Habkost,
	Marc-André Lureau, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Fabiano Rosas, Eric Blake, Markus Armbruster,
	Laurent Vivier, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Zou, Nanhai, Ho-Ren (Jack) Chuang,
	Yichen Wang, Bryan Zhang

From: Bryan Zhang <bryan.zhang@bytedance.com>

Adds support for migration parameters to control QATzip compression
level and to enable/disable software fallback when QAT hardware is
unavailable. This is a preparatory commit for a subsequent commit that
will actually use QATzip compression.

Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 migration/migration-hmp-cmds.c |  8 +++++
 migration/options.c            | 57 ++++++++++++++++++++++++++++++++++
 migration/options.h            |  2 ++
 qapi/migration.json            | 35 +++++++++++++++++++++
 4 files changed, 102 insertions(+)

diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 7d608d26e1..664e2390a3 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -576,6 +576,14 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         p->has_multifd_zlib_level = true;
         visit_type_uint8(v, param, &p->multifd_zlib_level, &err);
         break;
+    case MIGRATION_PARAMETER_MULTIFD_QATZIP_LEVEL:
+        p->has_multifd_qatzip_level = true;
+        visit_type_uint8(v, param, &p->multifd_qatzip_level, &err);
+        break;
+    case MIGRATION_PARAMETER_MULTIFD_QATZIP_SW_FALLBACK:
+        p->has_multifd_qatzip_sw_fallback = true;
+        visit_type_bool(v, param, &p->multifd_qatzip_sw_fallback, &err);
+        break;
     case MIGRATION_PARAMETER_MULTIFD_ZSTD_LEVEL:
         p->has_multifd_zstd_level = true;
         visit_type_uint8(v, param, &p->multifd_zstd_level, &err);
diff --git a/migration/options.c b/migration/options.c
index 645f55003d..334d70fb6d 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -55,6 +55,15 @@
 #define DEFAULT_MIGRATE_MULTIFD_COMPRESSION MULTIFD_COMPRESSION_NONE
 /* 0: means nocompress, 1: best speed, ... 9: best compress ratio */
 #define DEFAULT_MIGRATE_MULTIFD_ZLIB_LEVEL 1
+/*
+ * 1: best speed, ... 9: best compress ratio
+ * There is some nuance here. Refer to QATzip documentation to understand
+ * the mapping of QATzip levels to standard deflate levels.
+ */
+#define DEFAULT_MIGRATE_MULTIFD_QATZIP_LEVEL 1
+/* QATzip's SW fallback implementation is extremely slow, so avoid fallback */
+#define DEFAULT_MIGRATE_MULTIFD_QATZIP_SW_FALLBACK false
+
 /* 0: means nocompress, 1: best speed, ... 20: best compress ratio */
 #define DEFAULT_MIGRATE_MULTIFD_ZSTD_LEVEL 1
 
@@ -123,6 +132,12 @@ Property migration_properties[] = {
     DEFINE_PROP_UINT8("multifd-zlib-level", MigrationState,
                       parameters.multifd_zlib_level,
                       DEFAULT_MIGRATE_MULTIFD_ZLIB_LEVEL),
+    DEFINE_PROP_UINT8("multifd-qatzip-level", MigrationState,
+                      parameters.multifd_qatzip_level,
+                      DEFAULT_MIGRATE_MULTIFD_QATZIP_LEVEL),
+    DEFINE_PROP_BOOL("multifd-qatzip-sw-fallback", MigrationState,
+                      parameters.multifd_qatzip_sw_fallback,
+                      DEFAULT_MIGRATE_MULTIFD_QATZIP_SW_FALLBACK),
     DEFINE_PROP_UINT8("multifd-zstd-level", MigrationState,
                       parameters.multifd_zstd_level,
                       DEFAULT_MIGRATE_MULTIFD_ZSTD_LEVEL),
@@ -787,6 +802,20 @@ int migrate_multifd_zlib_level(void)
     return s->parameters.multifd_zlib_level;
 }
 
+int migrate_multifd_qatzip_level(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->parameters.multifd_qatzip_level;
+}
+
+bool migrate_multifd_qatzip_sw_fallback(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->parameters.multifd_qatzip_sw_fallback;
+}
+
 int migrate_multifd_zstd_level(void)
 {
     MigrationState *s = migrate_get_current();
@@ -892,6 +921,11 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->multifd_compression = s->parameters.multifd_compression;
     params->has_multifd_zlib_level = true;
     params->multifd_zlib_level = s->parameters.multifd_zlib_level;
+    params->has_multifd_qatzip_level = true;
+    params->multifd_qatzip_level = s->parameters.multifd_qatzip_level;
+    params->has_multifd_qatzip_sw_fallback = true;
+    params->multifd_qatzip_sw_fallback =
+        s->parameters.multifd_qatzip_sw_fallback;
     params->has_multifd_zstd_level = true;
     params->multifd_zstd_level = s->parameters.multifd_zstd_level;
     params->has_xbzrle_cache_size = true;
@@ -946,6 +980,8 @@ void migrate_params_init(MigrationParameters *params)
     params->has_multifd_channels = true;
     params->has_multifd_compression = true;
     params->has_multifd_zlib_level = true;
+    params->has_multifd_qatzip_level = true;
+    params->has_multifd_qatzip_sw_fallback = true;
     params->has_multifd_zstd_level = true;
     params->has_xbzrle_cache_size = true;
     params->has_max_postcopy_bandwidth = true;
@@ -1038,6 +1074,14 @@ bool migrate_params_check(MigrationParameters *params, Error **errp)
         return false;
     }
 
+    if (params->has_multifd_qatzip_level &&
+        ((params->multifd_qatzip_level > 9) ||
+        (params->multifd_qatzip_level < 1))) {
+        error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "multifd_qatzip_level",
+                   "a value between 1 and 9");
+        return false;
+    }
+
     if (params->has_multifd_zstd_level &&
         (params->multifd_zstd_level > 20)) {
         error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "multifd_zstd_level",
@@ -1195,6 +1239,12 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_multifd_compression) {
         dest->multifd_compression = params->multifd_compression;
     }
+    if (params->has_multifd_qatzip_level) {
+        dest->multifd_qatzip_level = params->multifd_qatzip_level;
+    }
+    if (params->has_multifd_qatzip_sw_fallback) {
+        dest->multifd_qatzip_sw_fallback = params->multifd_qatzip_sw_fallback;
+    }
     if (params->has_multifd_zlib_level) {
         dest->multifd_zlib_level = params->multifd_zlib_level;
     }
@@ -1315,6 +1365,13 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
     if (params->has_multifd_compression) {
         s->parameters.multifd_compression = params->multifd_compression;
     }
+    if (params->has_multifd_qatzip_level) {
+        s->parameters.multifd_qatzip_level = params->multifd_qatzip_level;
+    }
+    if (params->has_multifd_qatzip_sw_fallback) {
+        s->parameters.multifd_qatzip_sw_fallback =
+            params->multifd_qatzip_sw_fallback;
+    }
     if (params->has_multifd_zlib_level) {
         s->parameters.multifd_zlib_level = params->multifd_zlib_level;
     }
diff --git a/migration/options.h b/migration/options.h
index a2397026db..24d98c6a29 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -78,6 +78,8 @@ uint64_t migrate_max_postcopy_bandwidth(void);
 int migrate_multifd_channels(void);
 MultiFDCompression migrate_multifd_compression(void);
 int migrate_multifd_zlib_level(void);
+int migrate_multifd_qatzip_level(void);
+bool migrate_multifd_qatzip_sw_fallback(void);
 int migrate_multifd_zstd_level(void);
 uint8_t migrate_throttle_trigger_threshold(void);
 const char *migrate_tls_authz(void);
diff --git a/qapi/migration.json b/qapi/migration.json
index 0f24206bce..8c9f2a8aa7 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -789,6 +789,16 @@
 #     speed, and 9 means best compression ratio which will consume
 #     more CPU. Defaults to 1.  (Since 5.0)
 #
+# @multifd-qatzip-level: Set the compression level to be used in live
+#     migration. The level is an integer between 1 and 9, where 1 means
+#     the best compression speed, and 9 means the best compression
+#     ratio which will consume more CPU. Defaults to 1. (Since 9.1)
+#
+# @multifd-qatzip-sw-fallback: Enable software fallback if QAT hardware
+#     is unavailable. Defaults to false. Software fallback performance
+#     is very poor compared to regular zlib, so be cautious about
+#     enabling this option. (Since 9.1)
+#
 # @multifd-zstd-level: Set the compression level to be used in live
 #     migration, the compression level is an integer between 0 and 20,
 #     where 0 means no compression, 1 means the best compression
@@ -849,6 +859,7 @@
            'xbzrle-cache-size', 'max-postcopy-bandwidth',
            'max-cpu-throttle', 'multifd-compression',
            'multifd-zlib-level', 'multifd-zstd-level',
+           'multifd-qatzip-level', 'multifd-qatzip-sw-fallback',
            'block-bitmap-mapping',
            { 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] },
            'vcpu-dirty-limit',
@@ -964,6 +975,16 @@
 #     speed, and 9 means best compression ratio which will consume
 #     more CPU. Defaults to 1.  (Since 5.0)
 #
+# @multifd-qatzip-level: Set the compression level to be used in live
+#     migration. The level is an integer between 1 and 9, where 1 means
+#     the best compression speed, and 9 means the best compression
+#     ratio which will consume more CPU. Defaults to 1. (Since 9.1)
+#
+# @multifd-qatzip-sw-fallback: Enable software fallback if QAT hardware
+#     is unavailable. Defaults to false. Software fallback performance
+#     is very poor compared to regular zlib, so be cautious about
+#     enabling this option. (Since 9.1)
+#
 # @multifd-zstd-level: Set the compression level to be used in live
 #     migration, the compression level is an integer between 0 and 20,
 #     where 0 means no compression, 1 means the best compression
@@ -1037,6 +1058,8 @@
             '*max-cpu-throttle': 'uint8',
             '*multifd-compression': 'MultiFDCompression',
             '*multifd-zlib-level': 'uint8',
+            '*multifd-qatzip-level': 'uint8',
+            '*multifd-qatzip-sw-fallback': 'bool',
             '*multifd-zstd-level': 'uint8',
             '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
             '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
@@ -1168,6 +1191,16 @@
 #     speed, and 9 means best compression ratio which will consume
 #     more CPU. Defaults to 1.  (Since 5.0)
 #
+# @multifd-qatzip-level: Set the compression level to be used in live
+#     migration. The level is an integer between 1 and 9, where 1 means
+#     the best compression speed, and 9 means the best compression
+#     ratio which will consume more CPU. Defaults to 1. (Since 9.1)
+#
+# @multifd-qatzip-sw-fallback: Enable software fallback if QAT hardware
+#     is unavailable. Defaults to false. Software fallback performance
+#     is very poor compared to regular zlib, so be cautious about
+#     enabling this option. (Since 9.1)
+#
 # @multifd-zstd-level: Set the compression level to be used in live
 #     migration, the compression level is an integer between 0 and 20,
 #     where 0 means no compression, 1 means the best compression
@@ -1238,6 +1271,8 @@
             '*max-cpu-throttle': 'uint8',
             '*multifd-compression': 'MultiFDCompression',
             '*multifd-zlib-level': 'uint8',
+            '*multifd-qatzip-level': 'uint8',
+            '*multifd-qatzip-sw-fallback': 'bool',
             '*multifd-zstd-level': 'uint8',
             '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
             '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 3/4] migration: Introduce 'qatzip' compression method
  2024-07-05 18:28 [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB Yichen Wang
  2024-07-05 18:28 ` [PATCH v4 1/4] meson: Introduce 'qatzip' feature to the build system Yichen Wang
  2024-07-05 18:28 ` [PATCH v4 2/4] migration: Add migration parameters for QATzip Yichen Wang
@ 2024-07-05 18:29 ` Yichen Wang
  2024-07-08 21:34   ` Peter Xu
  2024-07-10 15:20   ` Liu, Yuan1
  2024-07-05 18:29 ` [PATCH v4 4/4] tests/migration: Add integration test for " Yichen Wang
  2024-07-09  8:42 ` [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB Liu, Yuan1
  4 siblings, 2 replies; 14+ messages in thread
From: Yichen Wang @ 2024-07-05 18:29 UTC (permalink / raw)
  To: Paolo Bonzini, Daniel P. Berrangé, Eduardo Habkost,
	Marc-André Lureau, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Fabiano Rosas, Eric Blake, Markus Armbruster,
	Laurent Vivier, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Zou, Nanhai, Ho-Ren (Jack) Chuang,
	Yichen Wang, Bryan Zhang

From: Bryan Zhang <bryan.zhang@bytedance.com>

Adds support for 'qatzip' as an option for the multifd compression
method parameter, and implements using QAT for 'qatzip' compression and
decompression.

Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
---
 hw/core/qdev-properties-system.c |   6 +-
 migration/meson.build            |   1 +
 migration/multifd-qatzip.c       | 391 +++++++++++++++++++++++++++++++
 migration/multifd.h              |   5 +-
 qapi/migration.json              |   3 +
 tests/qtest/meson.build          |   4 +
 6 files changed, 407 insertions(+), 3 deletions(-)
 create mode 100644 migration/multifd-qatzip.c

diff --git a/hw/core/qdev-properties-system.c b/hw/core/qdev-properties-system.c
index f13350b4fb..eb50d6ec5b 100644
--- a/hw/core/qdev-properties-system.c
+++ b/hw/core/qdev-properties-system.c
@@ -659,7 +659,11 @@ const PropertyInfo qdev_prop_fdc_drive_type = {
 const PropertyInfo qdev_prop_multifd_compression = {
     .name = "MultiFDCompression",
     .description = "multifd_compression values, "
-                   "none/zlib/zstd/qpl/uadk",
+                   "none/zlib/zstd/qpl/uadk"
+#ifdef CONFIG_QATZIP
+                   "/qatzip"
+#endif
+                   ,
     .enum_table = &MultiFDCompression_lookup,
     .get = qdev_propinfo_get_enum,
     .set = qdev_propinfo_set_enum,
diff --git a/migration/meson.build b/migration/meson.build
index 5ce2acb41e..c9454c26ae 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -41,6 +41,7 @@ system_ss.add(when: rdma, if_true: files('rdma.c'))
 system_ss.add(when: zstd, if_true: files('multifd-zstd.c'))
 system_ss.add(when: qpl, if_true: files('multifd-qpl.c'))
 system_ss.add(when: uadk, if_true: files('multifd-uadk.c'))
+system_ss.add(when: qatzip, if_true: files('multifd-qatzip.c'))
 
 specific_ss.add(when: 'CONFIG_SYSTEM_ONLY',
                 if_true: files('ram.c',
diff --git a/migration/multifd-qatzip.c b/migration/multifd-qatzip.c
new file mode 100644
index 0000000000..a1502a5589
--- /dev/null
+++ b/migration/multifd-qatzip.c
@@ -0,0 +1,391 @@
+/*
+ * Multifd QATzip compression implementation
+ *
+ * Copyright (c) Bytedance
+ *
+ * Authors:
+ *  Bryan Zhang <bryan.zhang@bytedance.com>
+ *  Hao Xiang <hao.xiang@bytedance.com>
+ *  Yichen Wang <yichen.wang@bytedance.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "exec/ramblock.h"
+#include "exec/target_page.h"
+#include "qapi/error.h"
+#include "migration.h"
+#include "options.h"
+#include "multifd.h"
+#include <qatzip.h>
+
+struct qatzip_data {
+    /*
+     * Unique session for use with QATzip API
+     */
+    QzSession_T sess;
+
+    /*
+     * For compression: Buffer for pages to compress
+     * For decompression: Buffer for data to decompress
+     */
+    uint8_t *in_buf;
+    uint32_t in_len;
+
+    /*
+     * For compression: Output buffer of compressed data
+     * For decompression: Output buffer of decompressed data
+     */
+    uint8_t *out_buf;
+    uint32_t out_len;
+};
+
+/**
+ * qatzip_send_setup: Set up QATzip session and private buffers.
+ *
+ * @param p    Multifd channel params
+ * @param errp Pointer to error, which will be set in case of error
+ * @return     0 on success, -1 on error (and *errp will be set)
+ */
+static int qatzip_send_setup(MultiFDSendParams *p, Error **errp)
+{
+    struct qatzip_data *q;
+    QzSessionParamsDeflate_T params;
+    const char *err_msg;
+    int ret;
+    int sw_fallback;
+
+    q = g_new0(struct qatzip_data, 1);
+    p->compress_data = q;
+    /* We need one extra place for the packet header */
+    p->iov = g_new0(struct iovec, 2);
+
+    sw_fallback = 0;
+    if (migrate_multifd_qatzip_sw_fallback()) {
+        sw_fallback = 1;
+    }
+
+    ret = qzInit(&q->sess, sw_fallback);
+    if (ret != QZ_OK && ret != QZ_DUPLICATE) {
+        err_msg = "qzInit failed";
+        goto err_free_q;
+    }
+
+    ret = qzGetDefaultsDeflate(&params);
+    if (ret != QZ_OK) {
+        err_msg = "qzGetDefaultsDeflate failed";
+        goto err_close;
+    }
+
+    /* Make sure to use configured QATzip compression level. */
+    params.common_params.comp_lvl = migrate_multifd_qatzip_level();
+
+    ret = qzSetupSessionDeflate(&q->sess, &params);
+    if (ret != QZ_OK && ret != QZ_DUPLICATE) {
+        err_msg = "qzSetupSessionDeflate failed";
+        goto err_close;
+    }
+
+    /* TODO Add support for larger packets. */
+    if (MULTIFD_PACKET_SIZE > UINT32_MAX) {
+        err_msg = "packet size too large for QAT";
+        goto err_close;
+    }
+
+    q->in_len = MULTIFD_PACKET_SIZE;
+    q->in_buf = qzMalloc(q->in_len, 0, PINNED_MEM);
+    if (!q->in_buf) {
+        err_msg = "qzMalloc failed";
+        goto err_close;
+    }
+
+    q->out_len = qzMaxCompressedLength(MULTIFD_PACKET_SIZE, &q->sess);
+    q->out_buf = qzMalloc(q->out_len, 0, PINNED_MEM);
+    if (!q->out_buf) {
+        err_msg = "qzMalloc failed";
+        goto err_free_inbuf;
+    }
+
+    return 0;
+
+err_free_inbuf:
+    qzFree(q->in_buf);
+err_close:
+    qzClose(&q->sess);
+err_free_q:
+    g_free(q);
+    g_free(p->iov);
+    p->iov = NULL;
+    error_setg(errp, "multifd %u: %s", p->id, err_msg);
+    return -1;
+}
+
+/**
+ * qatzip_send_cleanup: Tear down QATzip session and release private buffers.
+ *
+ * @param p    Multifd channel params
+ * @param errp Pointer to error, which will be set in case of error
+ * @return     None
+ */
+static void qatzip_send_cleanup(MultiFDSendParams *p, Error **errp)
+{
+    struct qatzip_data *q = p->compress_data;
+    const char *err_msg;
+    int ret;
+
+    ret = qzTeardownSession(&q->sess);
+    if (ret != QZ_OK) {
+        err_msg = "qzTeardownSession failed";
+        goto err;
+    }
+
+    ret = qzClose(&q->sess);
+    if (ret != QZ_OK) {
+        err_msg = "qzClose failed";
+        goto err;
+    }
+
+    qzFree(q->in_buf);
+    q->in_buf = NULL;
+    qzFree(q->out_buf);
+    q->out_buf = NULL;
+    g_free(p->iov);
+    p->iov = NULL;
+    g_free(p->compress_data);
+    p->compress_data = NULL;
+    return;
+
+err:
+    error_setg(errp, "multifd %u: %s", p->id, err_msg);
+}
+
+/**
+ * qatzip_send_prepare: Compress pages and update IO channel info.
+ *
+ * @param p    Multifd channel params
+ * @param errp Pointer to error, which will be set in case of error
+ * @return     0 on success, -1 on error (and *errp will be set)
+ */
+static int qatzip_send_prepare(MultiFDSendParams *p, Error **errp)
+{
+    MultiFDPages_t *pages = p->pages;
+    struct qatzip_data *q = p->compress_data;
+    int ret;
+    unsigned int in_len, out_len;
+
+    if (!multifd_send_prepare_common(p)) {
+        goto out;
+    }
+
+    /* memcpy all the pages into one buffer. */
+    for (int i = 0; i < pages->normal_num; i++) {
+        memcpy(q->in_buf + (i * p->page_size),
+               p->pages->block->host + pages->offset[i],
+               p->page_size);
+    }
+
+    in_len = pages->normal_num * p->page_size;
+    if (in_len > q->in_len) {
+        error_setg(errp, "multifd %u: unexpectedly large input", p->id);
+        return -1;
+    }
+    out_len = q->out_len;
+
+    /*
+     * Unlike other multifd compression implementations, we use a non-streaming
+     * API and place all the data into one buffer, rather than sending each page
+     * to the compression API at a time. Based on initial benchmarks, the
+     * non-streaming API outperforms the streaming API. Plus, the logic in QEMU
+     * is friendly to using the non-streaming API anyway. If either of these
+     * statements becomes no longer true, we can revisit adding a streaming
+     * implementation.
+     */
+    ret = qzCompress(&q->sess, q->in_buf, &in_len, q->out_buf, &out_len, 1);
+    if (ret != QZ_OK) {
+        error_setg(errp, "multifd %u: QATzip returned %d instead of QZ_OK",
+                   p->id, ret);
+        return -1;
+    }
+    if (in_len != pages->normal_num * p->page_size) {
+        error_setg(errp, "multifd %u: QATzip failed to compress all input",
+                   p->id);
+        return -1;
+    }
+
+    p->iov[p->iovs_num].iov_base = q->out_buf;
+    p->iov[p->iovs_num].iov_len = out_len;
+    p->iovs_num++;
+    p->next_packet_size = out_len;
+
+out:
+    p->flags |= MULTIFD_FLAG_QATZIP;
+    multifd_send_fill_packet(p);
+    return 0;
+}
+
+/**
+ * qatzip_recv_setup: Set up QATzip session and allocate private buffers.
+ *
+ * @param p    Multifd channel params
+ * @param errp Pointer to error, which will be set in case of error
+ * @return     0 on success, -1 on error (and *errp will be set)
+ */
+static int qatzip_recv_setup(MultiFDRecvParams *p, Error **errp)
+{
+    struct qatzip_data *q;
+    QzSessionParamsDeflate_T params;
+    const char *err_msg;
+    int ret;
+    int sw_fallback;
+
+    q = g_new0(struct qatzip_data, 1);
+    p->compress_data = q;
+
+    sw_fallback = 0;
+    if (migrate_multifd_qatzip_sw_fallback()) {
+        sw_fallback = 1;
+    }
+
+    ret = qzInit(&q->sess, sw_fallback);
+    if (ret != QZ_OK && ret != QZ_DUPLICATE) {
+        err_msg = "qzInit failed";
+        goto err_free_q;
+    }
+
+    ret = qzGetDefaultsDeflate(&params);
+    if (ret != QZ_OK) {
+        err_msg = "qzGetDefaultsDeflate failed";
+        goto err_close;
+    }
+
+    /* Make sure to use configured QATzip compression level. */
+    params.common_params.comp_lvl = migrate_multifd_qatzip_level();
+
+    ret = qzSetupSessionDeflate(&q->sess, &params);
+    if (ret != QZ_OK && ret != QZ_DUPLICATE) {
+        err_msg = "qzSetupSessionDeflate failed";
+        goto err_close;
+    }
+
+    /*
+     * Mimic multifd-zlib, which reserves extra space for the
+     * incoming packet.
+     */
+    q->in_len = MULTIFD_PACKET_SIZE * 2;
+    q->in_buf = qzMalloc(q->in_len, 0, PINNED_MEM);
+    if (!q->in_buf) {
+        err_msg = "qzMalloc failed";
+        goto err_close;
+    }
+
+    q->out_len = MULTIFD_PACKET_SIZE;
+    q->out_buf = qzMalloc(q->out_len, 0, PINNED_MEM);
+    if (!q->out_buf) {
+        err_msg = "qzMalloc failed";
+        goto err_free_inbuf;
+    }
+
+    return 0;
+
+err_free_inbuf:
+    qzFree(q->in_buf);
+err_close:
+    qzClose(&q->sess);
+err_free_q:
+    g_free(q);
+    error_setg(errp, "multifd %u: %s", p->id, err_msg);
+    return -1;
+}
+
+/**
+ * qatzip_recv_cleanup: Tear down QATzip session and release private buffers.
+ *
+ * @param p    Multifd channel params
+ * @return     None
+ */
+static void qatzip_recv_cleanup(MultiFDRecvParams *p)
+{
+    struct qatzip_data *q = p->compress_data;
+
+    /* Ignoring return values here due to function signature. */
+    qzTeardownSession(&q->sess);
+    qzClose(&q->sess);
+    qzFree(q->in_buf);
+    qzFree(q->out_buf);
+    g_free(p->compress_data);
+}
+
+
+/**
+ * qatzip_recv: Decompress pages and copy them to the appropriate
+ * locations.
+ *
+ * @param p    Multifd channel params
+ * @param errp Pointer to error, which will be set in case of error
+ * @return     0 on success, -1 on error (and *errp will be set)
+ */
+static int qatzip_recv(MultiFDRecvParams *p, Error **errp)
+{
+    struct qatzip_data *q = p->compress_data;
+    int ret;
+    unsigned int in_len, out_len;
+    uint32_t in_size = p->next_packet_size;
+    uint32_t expected_size = p->normal_num * p->page_size;
+    uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
+
+    if (in_size > q->in_len) {
+        error_setg(errp, "multifd %u: received unexpectedly large packet",
+                   p->id);
+        return -1;
+    }
+
+    if (flags != MULTIFD_FLAG_QATZIP) {
+        error_setg(errp, "multifd %u: flags received %x flags expected %x",
+                   p->id, flags, MULTIFD_FLAG_QATZIP);
+        return -1;
+    }
+
+    ret = qio_channel_read_all(p->c, (void *)q->in_buf, in_size, errp);
+    if (ret != 0) {
+        return ret;
+    }
+
+    in_len = in_size;
+    out_len = q->out_len;
+    ret = qzDecompress(&q->sess, q->in_buf, &in_len, q->out_buf, &out_len);
+    if (ret != QZ_OK) {
+        error_setg(errp, "multifd %u: qzDecompress failed", p->id);
+        return -1;
+    }
+    if (out_len != expected_size) {
+        error_setg(errp, "multifd %u: packet size received %u size expected %u",
+                   p->id, out_len, expected_size);
+        return -1;
+    }
+
+    /* Copy each page to its appropriate location. */
+    for (int i = 0; i < p->normal_num; i++) {
+        memcpy(p->host + p->normal[i],
+               q->out_buf + p->page_size * i,
+               p->page_size);
+    }
+    return 0;
+}
+
+static MultiFDMethods multifd_qatzip_ops = {
+    .send_setup = qatzip_send_setup,
+    .send_cleanup = qatzip_send_cleanup,
+    .send_prepare = qatzip_send_prepare,
+    .recv_setup = qatzip_recv_setup,
+    .recv_cleanup = qatzip_recv_cleanup,
+    .recv = qatzip_recv
+};
+
+static void multifd_qatzip_register(void)
+{
+    multifd_register_ops(MULTIFD_COMPRESSION_QATZIP, &multifd_qatzip_ops);
+}
+
+migration_init(multifd_qatzip_register);
diff --git a/migration/multifd.h b/migration/multifd.h
index 0ecd6f47d7..adceb65050 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -34,14 +34,15 @@ MultiFDRecvData *multifd_get_recv_data(void);
 /* Multifd Compression flags */
 #define MULTIFD_FLAG_SYNC (1 << 0)
 
-/* We reserve 4 bits for compression methods */
-#define MULTIFD_FLAG_COMPRESSION_MASK (0xf << 1)
+/* We reserve 5 bits for compression methods */
+#define MULTIFD_FLAG_COMPRESSION_MASK (0x1f << 1)
 /* we need to be compatible. Before compression value was 0 */
 #define MULTIFD_FLAG_NOCOMP (0 << 1)
 #define MULTIFD_FLAG_ZLIB (1 << 1)
 #define MULTIFD_FLAG_ZSTD (2 << 1)
 #define MULTIFD_FLAG_QPL (4 << 1)
 #define MULTIFD_FLAG_UADK (8 << 1)
+#define MULTIFD_FLAG_QATZIP (16 << 1)
 
 /* This value needs to be a multiple of qemu_target_page_size() */
 #define MULTIFD_PACKET_SIZE (512 * 1024)
diff --git a/qapi/migration.json b/qapi/migration.json
index 8c9f2a8aa7..ea62f983b1 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -558,6 +558,8 @@
 #
 # @zstd: use zstd compression method.
 #
+# @qatzip: use qatzip compression method. (Since 9.1)
+#
 # @qpl: use qpl compression method.  Query Processing Library(qpl) is
 #       based on the deflate compression algorithm and use the Intel
 #       In-Memory Analytics Accelerator(IAA) accelerated compression
@@ -570,6 +572,7 @@
 { 'enum': 'MultiFDCompression',
   'data': [ 'none', 'zlib',
             { 'name': 'zstd', 'if': 'CONFIG_ZSTD' },
+            { 'name': 'qatzip', 'if': 'CONFIG_QATZIP'},
             { 'name': 'qpl', 'if': 'CONFIG_QPL' },
             { 'name': 'uadk', 'if': 'CONFIG_UADK' } ] }
 
diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
index 12792948ff..23e46144d7 100644
--- a/tests/qtest/meson.build
+++ b/tests/qtest/meson.build
@@ -324,6 +324,10 @@ if gnutls.found()
   endif
 endif
 
+if qatzip.found()
+  migration_files += [qatzip]
+endif
+
 qtests = {
   'bios-tables-test': [io, 'boot-sector.c', 'acpi-utils.c', 'tpm-emu.c'],
   'cdrom-test': files('boot-sector.c'),
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 4/4] tests/migration: Add integration test for 'qatzip' compression method
  2024-07-05 18:28 [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB Yichen Wang
                   ` (2 preceding siblings ...)
  2024-07-05 18:29 ` [PATCH v4 3/4] migration: Introduce 'qatzip' compression method Yichen Wang
@ 2024-07-05 18:29 ` Yichen Wang
  2024-07-09  8:42 ` [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB Liu, Yuan1
  4 siblings, 0 replies; 14+ messages in thread
From: Yichen Wang @ 2024-07-05 18:29 UTC (permalink / raw)
  To: Paolo Bonzini, Daniel P. Berrangé, Eduardo Habkost,
	Marc-André Lureau, Thomas Huth, Philippe Mathieu-Daudé,
	Peter Xu, Fabiano Rosas, Eric Blake, Markus Armbruster,
	Laurent Vivier, qemu-devel
  Cc: Hao Xiang, Liu, Yuan1, Zou, Nanhai, Ho-Ren (Jack) Chuang,
	Yichen Wang, Bryan Zhang

From: Bryan Zhang <bryan.zhang@bytedance.com>

Adds an integration test for 'qatzip'.

Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
---
 tests/qtest/migration-test.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 70b606b888..b796dd21cb 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -32,6 +32,10 @@
 # endif /* CONFIG_TASN1 */
 #endif /* CONFIG_GNUTLS */
 
+#ifdef CONFIG_QATZIP
+#include <qatzip.h>
+#endif /* CONFIG_QATZIP */
+
 /* For dirty ring test; so far only x86_64 is supported */
 #if defined(__linux__) && defined(HOST_X86_64)
 #include "linux/kvm.h"
@@ -2992,6 +2996,22 @@ test_migrate_precopy_tcp_multifd_zstd_start(QTestState *from,
 }
 #endif /* CONFIG_ZSTD */
 
+#ifdef CONFIG_QATZIP
+static void *
+test_migrate_precopy_tcp_multifd_qatzip_start(QTestState *from,
+                                              QTestState *to)
+{
+    migrate_set_parameter_int(from, "multifd-qatzip-level", 2);
+    migrate_set_parameter_int(to, "multifd-qatzip-level", 2);
+
+    /* SW fallback is disabled by default, so enable it for testing. */
+    migrate_set_parameter_bool(from, "multifd-qatzip-sw-fallback", true);
+    migrate_set_parameter_bool(to, "multifd-qatzip-sw-fallback", true);
+
+    return test_migrate_precopy_tcp_multifd_start_common(from, to, "qatzip");
+}
+#endif
+
 #ifdef CONFIG_QPL
 static void *
 test_migrate_precopy_tcp_multifd_qpl_start(QTestState *from,
@@ -3089,6 +3109,17 @@ static void test_multifd_tcp_zstd(void)
 }
 #endif
 
+#ifdef CONFIG_QATZIP
+static void test_multifd_tcp_qatzip(void)
+{
+    MigrateCommon args = {
+        .listen_uri = "defer",
+        .start_hook = test_migrate_precopy_tcp_multifd_qatzip_start,
+    };
+    test_precopy_common(&args);
+}
+#endif
+
 #ifdef CONFIG_QPL
 static void test_multifd_tcp_qpl(void)
 {
@@ -3992,6 +4023,10 @@ int main(int argc, char **argv)
     migration_test_add("/migration/multifd/tcp/plain/zstd",
                        test_multifd_tcp_zstd);
 #endif
+#ifdef CONFIG_QATZIP
+    migration_test_add("/migration/multifd/tcp/plain/qatzip",
+                test_multifd_tcp_qatzip);
+#endif
 #ifdef CONFIG_QPL
     migration_test_add("/migration/multifd/tcp/plain/qpl",
                        test_multifd_tcp_qpl);
-- 
Yichen Wang



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/4] migration: Add migration parameters for QATzip
  2024-07-05 18:28 ` [PATCH v4 2/4] migration: Add migration parameters for QATzip Yichen Wang
@ 2024-07-08 21:10   ` Peter Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Xu @ 2024-07-08 21:10 UTC (permalink / raw)
  To: Yichen Wang
  Cc: Paolo Bonzini, Daniel P. Berrangé, Eduardo Habkost,
	Marc-André Lureau, Thomas Huth, Philippe Mathieu-Daudé,
	Fabiano Rosas, Eric Blake, Markus Armbruster, Laurent Vivier,
	qemu-devel, Hao Xiang, Liu, Yuan1, Zou, Nanhai,
	Ho-Ren (Jack) Chuang, Bryan Zhang

On Fri, Jul 05, 2024 at 11:28:59AM -0700, Yichen Wang wrote:
> +# @multifd-qatzip-sw-fallback: Enable software fallback if QAT hardware
> +#     is unavailable. Defaults to false. Software fallback performance
> +#     is very poor compared to regular zlib, so be cautious about
> +#     enabling this option. (Since 9.1)

Could we avoid this parameter but always have the fallback?

IMHO anyone who is serious with using a HW-accelerated compression method
during migration should make sure that the HWs are properly setup.

If you think such caucious is required, would warn_report_once() works when
the fallback happens?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 3/4] migration: Introduce 'qatzip' compression method
  2024-07-05 18:29 ` [PATCH v4 3/4] migration: Introduce 'qatzip' compression method Yichen Wang
@ 2024-07-08 21:34   ` Peter Xu
  2024-07-10 15:20   ` Liu, Yuan1
  1 sibling, 0 replies; 14+ messages in thread
From: Peter Xu @ 2024-07-08 21:34 UTC (permalink / raw)
  To: Yichen Wang
  Cc: Paolo Bonzini, Daniel P. Berrangé, Eduardo Habkost,
	Marc-André Lureau, Thomas Huth, Philippe Mathieu-Daudé,
	Fabiano Rosas, Eric Blake, Markus Armbruster, Laurent Vivier,
	qemu-devel, Hao Xiang, Liu, Yuan1, Zou, Nanhai,
	Ho-Ren (Jack) Chuang, Bryan Zhang

On Fri, Jul 05, 2024 at 11:29:00AM -0700, Yichen Wang wrote:
> From: Bryan Zhang <bryan.zhang@bytedance.com>
> 
> Adds support for 'qatzip' as an option for the multifd compression
> method parameter, and implements using QAT for 'qatzip' compression and
> decompression.
> 
> Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
> Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
> Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>

Mostly good to me, I left some nitpicks below here and there.

> ---
>  hw/core/qdev-properties-system.c |   6 +-
>  migration/meson.build            |   1 +
>  migration/multifd-qatzip.c       | 391 +++++++++++++++++++++++++++++++
>  migration/multifd.h              |   5 +-
>  qapi/migration.json              |   3 +
>  tests/qtest/meson.build          |   4 +
>  6 files changed, 407 insertions(+), 3 deletions(-)
>  create mode 100644 migration/multifd-qatzip.c
> 
> diff --git a/hw/core/qdev-properties-system.c b/hw/core/qdev-properties-system.c
> index f13350b4fb..eb50d6ec5b 100644
> --- a/hw/core/qdev-properties-system.c
> +++ b/hw/core/qdev-properties-system.c
> @@ -659,7 +659,11 @@ const PropertyInfo qdev_prop_fdc_drive_type = {
>  const PropertyInfo qdev_prop_multifd_compression = {
>      .name = "MultiFDCompression",
>      .description = "multifd_compression values, "
> -                   "none/zlib/zstd/qpl/uadk",
> +                   "none/zlib/zstd/qpl/uadk"
> +#ifdef CONFIG_QATZIP
> +                   "/qatzip"
> +#endif
> +                   ,
>      .enum_table = &MultiFDCompression_lookup,
>      .get = qdev_propinfo_get_enum,
>      .set = qdev_propinfo_set_enum,
> diff --git a/migration/meson.build b/migration/meson.build
> index 5ce2acb41e..c9454c26ae 100644
> --- a/migration/meson.build
> +++ b/migration/meson.build
> @@ -41,6 +41,7 @@ system_ss.add(when: rdma, if_true: files('rdma.c'))
>  system_ss.add(when: zstd, if_true: files('multifd-zstd.c'))
>  system_ss.add(when: qpl, if_true: files('multifd-qpl.c'))
>  system_ss.add(when: uadk, if_true: files('multifd-uadk.c'))
> +system_ss.add(when: qatzip, if_true: files('multifd-qatzip.c'))
>  
>  specific_ss.add(when: 'CONFIG_SYSTEM_ONLY',
>                  if_true: files('ram.c',
> diff --git a/migration/multifd-qatzip.c b/migration/multifd-qatzip.c
> new file mode 100644
> index 0000000000..a1502a5589
> --- /dev/null
> +++ b/migration/multifd-qatzip.c
> @@ -0,0 +1,391 @@
> +/*
> + * Multifd QATzip compression implementation
> + *
> + * Copyright (c) Bytedance
> + *
> + * Authors:
> + *  Bryan Zhang <bryan.zhang@bytedance.com>
> + *  Hao Xiang <hao.xiang@bytedance.com>
> + *  Yichen Wang <yichen.wang@bytedance.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "exec/ramblock.h"
> +#include "exec/target_page.h"
> +#include "qapi/error.h"
> +#include "migration.h"
> +#include "options.h"
> +#include "multifd.h"
> +#include <qatzip.h>
> +
> +struct qatzip_data {
> +    /*
> +     * Unique session for use with QATzip API
> +     */
> +    QzSession_T sess;
> +
> +    /*
> +     * For compression: Buffer for pages to compress
> +     * For decompression: Buffer for data to decompress
> +     */
> +    uint8_t *in_buf;
> +    uint32_t in_len;
> +
> +    /*
> +     * For compression: Output buffer of compressed data
> +     * For decompression: Output buffer of decompressed data
> +     */
> +    uint8_t *out_buf;
> +    uint32_t out_len;
> +};
> +
> +/**
> + * qatzip_send_setup: Set up QATzip session and private buffers.
> + *
> + * @param p    Multifd channel params
> + * @param errp Pointer to error, which will be set in case of error
> + * @return     0 on success, -1 on error (and *errp will be set)
> + */
> +static int qatzip_send_setup(MultiFDSendParams *p, Error **errp)
> +{
> +    struct qatzip_data *q;
> +    QzSessionParamsDeflate_T params;
> +    const char *err_msg;
> +    int ret;
> +    int sw_fallback;
> +
> +    q = g_new0(struct qatzip_data, 1);
> +    p->compress_data = q;
> +    /* We need one extra place for the packet header */
> +    p->iov = g_new0(struct iovec, 2);
> +
> +    sw_fallback = 0;
> +    if (migrate_multifd_qatzip_sw_fallback()) {
> +        sw_fallback = 1;
> +    }
> +
> +    ret = qzInit(&q->sess, sw_fallback);
> +    if (ret != QZ_OK && ret != QZ_DUPLICATE) {
> +        err_msg = "qzInit failed";
> +        goto err_free_q;
> +    }
> +
> +    ret = qzGetDefaultsDeflate(&params);
> +    if (ret != QZ_OK) {
> +        err_msg = "qzGetDefaultsDeflate failed";
> +        goto err_close;
> +    }
> +
> +    /* Make sure to use configured QATzip compression level. */
> +    params.common_params.comp_lvl = migrate_multifd_qatzip_level();
> +
> +    ret = qzSetupSessionDeflate(&q->sess, &params);
> +    if (ret != QZ_OK && ret != QZ_DUPLICATE) {
> +        err_msg = "qzSetupSessionDeflate failed";
> +        goto err_close;
> +    }
> +
> +    /* TODO Add support for larger packets. */
> +    if (MULTIFD_PACKET_SIZE > UINT32_MAX) {

Is this a real TODO to support packet larger than 4G?

We'll need to double check in the future to increase MULTIFD_PACKET_SIZE
irrelevant of QAT capabilities - multifd already made the container world
unhappy by eating too much cpu resources (so it will cause throttle of vcpu
threads..). Increasing MULTIFD_PACKET_SIZE can definitely make it worse,
probably in the form of OOM kills.

We can drop the comment otherwise, as 4G sounds big enough.

> +        err_msg = "packet size too large for QAT";
> +        goto err_close;
> +    }
> +
> +    q->in_len = MULTIFD_PACKET_SIZE;
> +    q->in_buf = qzMalloc(q->in_len, 0, PINNED_MEM);

I'm not sure what does this imply, but if it means pinning memory I'd
suggest we put that into documentation.  E.g., I think QEMU can boot with
"ulimit -l" == 0, and it might be good to mention these before users start
to use them, or at least know what values should be used when allowing page
pinnings.

In this specific case, it could mention the multifd default packet size for
each pinned buffer, and how many buffers qatzip will need.

> +    if (!q->in_buf) {
> +        err_msg = "qzMalloc failed";
> +        goto err_close;
> +    }
> +
> +    q->out_len = qzMaxCompressedLength(MULTIFD_PACKET_SIZE, &q->sess);
> +    q->out_buf = qzMalloc(q->out_len, 0, PINNED_MEM);
> +    if (!q->out_buf) {
> +        err_msg = "qzMalloc failed";
> +        goto err_free_inbuf;
> +    }
> +
> +    return 0;
> +
> +err_free_inbuf:
> +    qzFree(q->in_buf);
> +err_close:
> +    qzClose(&q->sess);
> +err_free_q:
> +    g_free(q);
> +    g_free(p->iov);
> +    p->iov = NULL;
> +    error_setg(errp, "multifd %u: %s", p->id, err_msg);
> +    return -1;
> +}
> +
> +/**
> + * qatzip_send_cleanup: Tear down QATzip session and release private buffers.
> + *
> + * @param p    Multifd channel params
> + * @param errp Pointer to error, which will be set in case of error
> + * @return     None
> + */
> +static void qatzip_send_cleanup(MultiFDSendParams *p, Error **errp)
> +{
> +    struct qatzip_data *q = p->compress_data;
> +    const char *err_msg;
> +    int ret;
> +
> +    ret = qzTeardownSession(&q->sess);
> +    if (ret != QZ_OK) {
> +        err_msg = "qzTeardownSession failed";
> +        goto err;
> +    }
> +
> +    ret = qzClose(&q->sess);
> +    if (ret != QZ_OK) {
> +        err_msg = "qzClose failed";
> +        goto err;
> +    }
> +
> +    qzFree(q->in_buf);
> +    q->in_buf = NULL;
> +    qzFree(q->out_buf);
> +    q->out_buf = NULL;
> +    g_free(p->iov);
> +    p->iov = NULL;
> +    g_free(p->compress_data);
> +    p->compress_data = NULL;
> +    return;
> +
> +err:
> +    error_setg(errp, "multifd %u: %s", p->id, err_msg);
> +}
> +
> +/**
> + * qatzip_send_prepare: Compress pages and update IO channel info.
> + *
> + * @param p    Multifd channel params
> + * @param errp Pointer to error, which will be set in case of error
> + * @return     0 on success, -1 on error (and *errp will be set)
> + */
> +static int qatzip_send_prepare(MultiFDSendParams *p, Error **errp)
> +{
> +    MultiFDPages_t *pages = p->pages;
> +    struct qatzip_data *q = p->compress_data;
> +    int ret;
> +    unsigned int in_len, out_len;
> +
> +    if (!multifd_send_prepare_common(p)) {
> +        goto out;
> +    }
> +
> +    /* memcpy all the pages into one buffer. */
> +    for (int i = 0; i < pages->normal_num; i++) {
> +        memcpy(q->in_buf + (i * p->page_size),
> +               p->pages->block->host + pages->offset[i],
> +               p->page_size);
> +    }
> +
> +    in_len = pages->normal_num * p->page_size;
> +    if (in_len > q->in_len) {
> +        error_setg(errp, "multifd %u: unexpectedly large input", p->id);
> +        return -1;
> +    }
> +    out_len = q->out_len;
> +
> +    /*
> +     * Unlike other multifd compression implementations, we use a non-streaming
> +     * API and place all the data into one buffer, rather than sending each page
> +     * to the compression API at a time. Based on initial benchmarks, the
> +     * non-streaming API outperforms the streaming API. Plus, the logic in QEMU
> +     * is friendly to using the non-streaming API anyway. If either of these
> +     * statements becomes no longer true, we can revisit adding a streaming
> +     * implementation.
> +     */

Such comments are helpful, thanks.  Maybe it can be moved upper to where
the page copy happened?  We can also mention that in the docs/ patch.

> +    ret = qzCompress(&q->sess, q->in_buf, &in_len, q->out_buf, &out_len, 1);
> +    if (ret != QZ_OK) {
> +        error_setg(errp, "multifd %u: QATzip returned %d instead of QZ_OK",
> +                   p->id, ret);
> +        return -1;
> +    }
> +    if (in_len != pages->normal_num * p->page_size) {
> +        error_setg(errp, "multifd %u: QATzip failed to compress all input",
> +                   p->id);
> +        return -1;
> +    }
> +
> +    p->iov[p->iovs_num].iov_base = q->out_buf;
> +    p->iov[p->iovs_num].iov_len = out_len;
> +    p->iovs_num++;
> +    p->next_packet_size = out_len;
> +
> +out:
> +    p->flags |= MULTIFD_FLAG_QATZIP;
> +    multifd_send_fill_packet(p);
> +    return 0;
> +}
> +
> +/**
> + * qatzip_recv_setup: Set up QATzip session and allocate private buffers.
> + *
> + * @param p    Multifd channel params
> + * @param errp Pointer to error, which will be set in case of error
> + * @return     0 on success, -1 on error (and *errp will be set)
> + */
> +static int qatzip_recv_setup(MultiFDRecvParams *p, Error **errp)
> +{
> +    struct qatzip_data *q;
> +    QzSessionParamsDeflate_T params;
> +    const char *err_msg;
> +    int ret;
> +    int sw_fallback;
> +
> +    q = g_new0(struct qatzip_data, 1);
> +    p->compress_data = q;
> +
> +    sw_fallback = 0;
> +    if (migrate_multifd_qatzip_sw_fallback()) {
> +        sw_fallback = 1;
> +    }
> +
> +    ret = qzInit(&q->sess, sw_fallback);
> +    if (ret != QZ_OK && ret != QZ_DUPLICATE) {
> +        err_msg = "qzInit failed";
> +        goto err_free_q;
> +    }
> +
> +    ret = qzGetDefaultsDeflate(&params);
> +    if (ret != QZ_OK) {
> +        err_msg = "qzGetDefaultsDeflate failed";
> +        goto err_close;
> +    }
> +
> +    /* Make sure to use configured QATzip compression level. */
> +    params.common_params.comp_lvl = migrate_multifd_qatzip_level();
> +
> +    ret = qzSetupSessionDeflate(&q->sess, &params);
> +    if (ret != QZ_OK && ret != QZ_DUPLICATE) {
> +        err_msg = "qzSetupSessionDeflate failed";
> +        goto err_close;
> +    }
> +
> +    /*
> +     * Mimic multifd-zlib, which reserves extra space for the
> +     * incoming packet.
> +     */
> +    q->in_len = MULTIFD_PACKET_SIZE * 2;
> +    q->in_buf = qzMalloc(q->in_len, 0, PINNED_MEM);
> +    if (!q->in_buf) {
> +        err_msg = "qzMalloc failed";
> +        goto err_close;
> +    }
> +
> +    q->out_len = MULTIFD_PACKET_SIZE;
> +    q->out_buf = qzMalloc(q->out_len, 0, PINNED_MEM);
> +    if (!q->out_buf) {
> +        err_msg = "qzMalloc failed";
> +        goto err_free_inbuf;
> +    }
> +
> +    return 0;
> +
> +err_free_inbuf:
> +    qzFree(q->in_buf);
> +err_close:
> +    qzClose(&q->sess);
> +err_free_q:
> +    g_free(q);
> +    error_setg(errp, "multifd %u: %s", p->id, err_msg);
> +    return -1;
> +}
> +
> +/**
> + * qatzip_recv_cleanup: Tear down QATzip session and release private buffers.
> + *
> + * @param p    Multifd channel params
> + * @return     None
> + */
> +static void qatzip_recv_cleanup(MultiFDRecvParams *p)
> +{
> +    struct qatzip_data *q = p->compress_data;
> +
> +    /* Ignoring return values here due to function signature. */
> +    qzTeardownSession(&q->sess);
> +    qzClose(&q->sess);
> +    qzFree(q->in_buf);
> +    qzFree(q->out_buf);
> +    g_free(p->compress_data);
> +}
> +
> +
> +/**
> + * qatzip_recv: Decompress pages and copy them to the appropriate
> + * locations.
> + *
> + * @param p    Multifd channel params
> + * @param errp Pointer to error, which will be set in case of error
> + * @return     0 on success, -1 on error (and *errp will be set)
> + */
> +static int qatzip_recv(MultiFDRecvParams *p, Error **errp)
> +{
> +    struct qatzip_data *q = p->compress_data;
> +    int ret;
> +    unsigned int in_len, out_len;
> +    uint32_t in_size = p->next_packet_size;
> +    uint32_t expected_size = p->normal_num * p->page_size;
> +    uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
> +
> +    if (in_size > q->in_len) {
> +        error_setg(errp, "multifd %u: received unexpectedly large packet",
> +                   p->id);
> +        return -1;
> +    }
> +
> +    if (flags != MULTIFD_FLAG_QATZIP) {
> +        error_setg(errp, "multifd %u: flags received %x flags expected %x",
> +                   p->id, flags, MULTIFD_FLAG_QATZIP);
> +        return -1;
> +    }
> +
> +    ret = qio_channel_read_all(p->c, (void *)q->in_buf, in_size, errp);
> +    if (ret != 0) {
> +        return ret;
> +    }
> +
> +    in_len = in_size;
> +    out_len = q->out_len;
> +    ret = qzDecompress(&q->sess, q->in_buf, &in_len, q->out_buf, &out_len);
> +    if (ret != QZ_OK) {
> +        error_setg(errp, "multifd %u: qzDecompress failed", p->id);
> +        return -1;
> +    }
> +    if (out_len != expected_size) {
> +        error_setg(errp, "multifd %u: packet size received %u size expected %u",
> +                   p->id, out_len, expected_size);
> +        return -1;
> +    }
> +
> +    /* Copy each page to its appropriate location. */
> +    for (int i = 0; i < p->normal_num; i++) {
> +        memcpy(p->host + p->normal[i],
> +               q->out_buf + p->page_size * i,
> +               p->page_size);
> +    }
> +    return 0;
> +}
> +
> +static MultiFDMethods multifd_qatzip_ops = {
> +    .send_setup = qatzip_send_setup,
> +    .send_cleanup = qatzip_send_cleanup,
> +    .send_prepare = qatzip_send_prepare,
> +    .recv_setup = qatzip_recv_setup,
> +    .recv_cleanup = qatzip_recv_cleanup,
> +    .recv = qatzip_recv
> +};
> +
> +static void multifd_qatzip_register(void)
> +{
> +    multifd_register_ops(MULTIFD_COMPRESSION_QATZIP, &multifd_qatzip_ops);
> +}
> +
> +migration_init(multifd_qatzip_register);
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 0ecd6f47d7..adceb65050 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -34,14 +34,15 @@ MultiFDRecvData *multifd_get_recv_data(void);
>  /* Multifd Compression flags */
>  #define MULTIFD_FLAG_SYNC (1 << 0)
>  
> -/* We reserve 4 bits for compression methods */
> -#define MULTIFD_FLAG_COMPRESSION_MASK (0xf << 1)
> +/* We reserve 5 bits for compression methods */
> +#define MULTIFD_FLAG_COMPRESSION_MASK (0x1f << 1)
>  /* we need to be compatible. Before compression value was 0 */
>  #define MULTIFD_FLAG_NOCOMP (0 << 1)
>  #define MULTIFD_FLAG_ZLIB (1 << 1)
>  #define MULTIFD_FLAG_ZSTD (2 << 1)
>  #define MULTIFD_FLAG_QPL (4 << 1)
>  #define MULTIFD_FLAG_UADK (8 << 1)
> +#define MULTIFD_FLAG_QATZIP (16 << 1)

This is not a problem of this series alone, but it's sad to see it keeps
eating the flag bits with no good reasons..

Since it will never happen we enable >1 compressor, we should have made it
a type field not bitmask since the start..  We may even avoid having that
at all in packet flags, as migration so far relies on both sides to setup
the same migration parameters, and that already includes compressor
methods, or everything will go chaos.  This flag should so far do mostly
nothing good but waste some cycles on both sides..

Sign..  I think we can keep this for now, as we already have 5 others
anyway.. it's a pity we didn't notice this earlier, definitely not this
series's fault.

>  
>  /* This value needs to be a multiple of qemu_target_page_size() */
>  #define MULTIFD_PACKET_SIZE (512 * 1024)
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 8c9f2a8aa7..ea62f983b1 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -558,6 +558,8 @@
>  #
>  # @zstd: use zstd compression method.
>  #
> +# @qatzip: use qatzip compression method. (Since 9.1)
> +#
>  # @qpl: use qpl compression method.  Query Processing Library(qpl) is
>  #       based on the deflate compression algorithm and use the Intel
>  #       In-Memory Analytics Accelerator(IAA) accelerated compression
> @@ -570,6 +572,7 @@
>  { 'enum': 'MultiFDCompression',
>    'data': [ 'none', 'zlib',
>              { 'name': 'zstd', 'if': 'CONFIG_ZSTD' },
> +            { 'name': 'qatzip', 'if': 'CONFIG_QATZIP'},
>              { 'name': 'qpl', 'if': 'CONFIG_QPL' },
>              { 'name': 'uadk', 'if': 'CONFIG_UADK' } ] }
>  
> diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
> index 12792948ff..23e46144d7 100644
> --- a/tests/qtest/meson.build
> +++ b/tests/qtest/meson.build
> @@ -324,6 +324,10 @@ if gnutls.found()
>    endif
>  endif
>  
> +if qatzip.found()
> +  migration_files += [qatzip]
> +endif
> +
>  qtests = {
>    'bios-tables-test': [io, 'boot-sector.c', 'acpi-utils.c', 'tpm-emu.c'],
>    'cdrom-test': files('boot-sector.c'),
> -- 
> Yichen Wang
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
  2024-07-05 18:28 [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB Yichen Wang
                   ` (3 preceding siblings ...)
  2024-07-05 18:29 ` [PATCH v4 4/4] tests/migration: Add integration test for " Yichen Wang
@ 2024-07-09  8:42 ` Liu, Yuan1
  2024-07-09 18:42   ` Peter Xu
  4 siblings, 1 reply; 14+ messages in thread
From: Liu, Yuan1 @ 2024-07-09  8:42 UTC (permalink / raw)
  To: Wang, Yichen, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Marc-André Lureau, Thomas Huth,
	Philippe Mathieu-Daudé, Peter Xu, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Laurent Vivier, qemu-devel@nongnu.org
  Cc: Hao Xiang, Zou, Nanhai, Ho-Ren (Jack) Chuang, Wang, Yichen

> -----Original Message-----
> From: Yichen Wang <yichen.wang@bytedance.com>
> Sent: Saturday, July 6, 2024 2:29 AM
> To: Paolo Bonzini <pbonzini@redhat.com>; Daniel P. Berrangé
> <berrange@redhat.com>; Eduardo Habkost <eduardo@habkost.net>; Marc-André
> Lureau <marcandre.lureau@redhat.com>; Thomas Huth <thuth@redhat.com>;
> Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>;
> Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> Armbruster <armbru@redhat.com>; Laurent Vivier <lvivier@redhat.com>; qemu-
> devel@nongnu.org
> Cc: Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 <yuan1.liu@intel.com>;
> Zou, Nanhai <nanhai.zou@intel.com>; Ho-Ren (Jack) Chuang
> <horenchuang@bytedance.com>; Wang, Yichen <yichen.wang@bytedance.com>
> Subject: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
> 
> v4:
> - Rebase changes on top of 1a2d52c7fcaeaaf4f2fe8d4d5183dccaeab67768
> - Move the IOV initialization to qatzip implementation
> - Only use qatzip to compress normal pages
> 
> v3:
> - Rebase changes on top of master
> - Merge two patches per Fabiano Rosas's comment
> - Add versions into comments and documentations
> 
> v2:
> - Rebase changes on top of recent multifd code changes.
> - Use QATzip API 'qzMalloc' and 'qzFree' to allocate QAT buffers.
> - Remove parameter tuning and use QATzip's defaults for better
>   performance.
> - Add parameter to enable QAT software fallback.
> 
> v1:
> https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg03761.html
> 
> * Performance
> 
> We present updated performance results. For circumstantial reasons, v1
> presented performance on a low-bandwidth (1Gbps) network.
> 
> Here, we present updated results with a similar setup as before but with
> two main differences:
> 
> 1. Our machines have a ~50Gbps connection, tested using 'iperf3'.
> 2. We had a bug in our memory allocation causing us to only use ~1/2 of
> the VM's RAM. Now we properly allocate and fill nearly all of the VM's
> RAM.
> 
> Thus, the test setup is as follows:
> 
> We perform multifd live migration over TCP using a VM with 64GB memory.
> We prepare the machine's memory by powering it on, allocating a large
> amount of memory (60GB) as a single buffer, and filling the buffer with
> the repeated contents of the Silesia corpus[0]. This is in lieu of a more
> realistic memory snapshot, which proved troublesome to acquire.
> 
> We analyze CPU usage by averaging the output of 'top' every second
> during migration. This is admittedly imprecise, but we feel that it
> accurately portrays the different degrees of CPU usage of varying
> compression methods.
> 
> We present the latency, throughput, and CPU usage results for all of the
> compression methods, with varying numbers of multifd threads (4, 8, and
> 16).
> 
> [0] The Silesia corpus can be accessed here:
> https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
> 
> ** Results
> 
> 4 multifd threads:
> 
>     |---------------|---------------|----------------|---------|---------|
>     |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
>     |---------------|---------------|----------------|---------|---------|
>     |qatzip         | 23.13         | 8749.94        |117.50   |186.49   |
>     |---------------|---------------|----------------|---------|---------|
>     |zlib           |254.35         |  771.87        |388.20   |144.40   |
>     |---------------|---------------|----------------|---------|---------|
>     |zstd           | 54.52         | 3442.59        |414.59   |149.77   |
>     |---------------|---------------|----------------|---------|---------|
>     |none           | 12.45         |43739.60        |159.71   |204.96   |
>     |---------------|---------------|----------------|---------|---------|
> 
> 8 multifd threads:
> 
>     |---------------|---------------|----------------|---------|---------|
>     |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
>     |---------------|---------------|----------------|---------|---------|
>     |qatzip         | 16.91         |12306.52        |186.37   |391.84   |
>     |---------------|---------------|----------------|---------|---------|
>     |zlib           |130.11         | 1508.89        |753.86   |289.35   |
>     |---------------|---------------|----------------|---------|---------|
>     |zstd           | 27.57         | 6823.23        |786.83   |303.80   |
>     |---------------|---------------|----------------|---------|---------|
>     |none           | 11.82         |46072.63        |163.74   |238.56   |
>     |---------------|---------------|----------------|---------|---------|
> 
> 16 multifd threads:
> 
>     |---------------|---------------|----------------|---------|---------|
>     |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
>     |---------------|---------------|----------------|---------|---------|
>     |qatzip         |18.64          |11044.52        | 573.61  |437.65   |
>     |---------------|---------------|----------------|---------|---------|
>     |zlib           |66.43          | 2955.79        |1469.68  |567.47   |
>     |---------------|---------------|----------------|---------|---------|
>     |zstd           |14.17          |13290.66        |1504.08  |615.33   |
>     |---------------|---------------|----------------|---------|---------|
>     |none           |16.82          |32363.26        | 180.74  |217.17   |
>     |---------------|---------------|----------------|---------|---------|
> 
> ** Observations
> 
> - In general, not using compression outperforms using compression in a
>   non-network-bound environment.
> - 'qatzip' outperforms other compression workers with 4 and 8 workers,
>   achieving a ~91% latency reduction over 'zlib' with 4 workers, and a
> ~58% latency reduction over 'zstd' with 4 workers.
> - 'qatzip' maintains comparable performance with 'zstd' at 16 workers,
>   showing a ~32% increase in latency. This performance difference
> becomes more noticeable with more workers, as CPU compression is highly
> parallelizable.
> - 'qatzip' compression uses considerably less CPU than other compression
>   methods. At 8 workers, 'qatzip' demonstrates a ~75% reduction in
> compression CPU usage compared to 'zstd' and 'zlib'.
> - 'qatzip' decompression CPU usage is less impressive, and is even
>   slightly worse than 'zstd' and 'zlib' CPU usage at 4 and 16 workers.

Hi Peter & Yichen

I have a test based on the V4 patch set
VM configuration:16 vCPU, 64G memory, 
VM Workload: all vCPUs are idle and 54G memory is filled with Silesia data.
QAT Devices: 4

Sender migration parameters
migrate_set_capability multifd on
migrate_set_parameter multifd-channels 2/4/8
migrate_set_parameter max-bandwidth 1G/10G
migrate_set_parameter multifd-compression qatzip/zstd

Receiver migration parameters
migrate_set_capability multifd on
migrate_set_parameter multifd-channels 2
migrate_set_parameter multifd-compression qatzip/zstd

max-bandwidth: 1GBps
     |-----------|--------|---------|----------|------|------|
     |2 Channels |Total   |down     |throughput| send | recv |
     |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
     |-----------|--------|---------|----------|------|------|
     |qatzip     |   21607|       77|      8051|    88|   125|
     |-----------|--------|---------|----------|------|------|
     |zstd       |   78351|       96|      2199|   204|    80|
     |-----------|--------|---------|----------|------|------|

     |-----------|--------|---------|----------|------|------|
     |4 Channels |Total   |down     |throughput| send | recv |
     |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
     |-----------|--------|---------|----------|------|------|
     |qatzip     |   20336|       25|      8557|   110|   190|
     |-----------|--------|---------|----------|------|------|
     |zstd       |   39324|       31|      4389|   406|   160|
     |-----------|--------|---------|----------|------|------|

     |-----------|--------|---------|----------|------|------|
     |8 Channels |Total   |down     |throughput| send | recv |
     |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
     |-----------|--------|---------|----------|------|------|
     |qatzip     |   20208|       22|      8613|   125|   300|
     |-----------|--------|---------|----------|------|------|
     |zstd       |   20515|       22|      8438|   800|   340|
     |-----------|--------|---------|----------|------|------|

max-bandwidth: 10GBps
     |-----------|--------|---------|----------|------|------|
     |2 Channels |Total   |down     |throughput| send | recv |
     |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
     |-----------|--------|---------|----------|------|------|
     |qatzip     |   22450|       77|      7748|    80|   125|
     |-----------|--------|---------|----------|------|------|
     |zstd       |   78339|       76|      2199|   204|    80|
     |-----------|--------|---------|----------|------|------|

     |-----------|--------|---------|----------|------|------|
     |4 Channels |Total   |down     |throughput| send | recv |
     |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
     |-----------|--------|---------|----------|------|------|
     |qatzip     |   13017|       24|     13401|   180|   285|
     |-----------|--------|---------|----------|------|------|
     |zstd       |   39466|       21|      4373|   406|   160|
     |-----------|--------|---------|----------|------|------|

     |-----------|--------|---------|----------|------|------|
     |8 Channels |Total   |down     |throughput| send | recv |
     |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
     |-----------|--------|---------|----------|------|------|
     |qatzip     |   10255|       22|     17037|   280|   590|
     |-----------|--------|---------|----------|------|------|
     |zstd       |   20126|       77|      8595|   810|   340|
     |-----------|--------|---------|----------|------|------|

If the user has enabled compression in live migration, using QAT
can save the host CPU resources.

When compression is enabled, the bottleneck of migration is usually
the compression throughput on the sender side, since CPU decompression
throughput is higher than compression, some reference data 
https://github.com/inikep/lzbench, so more CPU resources need to be 
allocated to the sender side.

Summary:
1. In the 1GBps case, QAT only uses 88% CPU utilization to reach 1GBps, 
   but ZSTD needs 800%.
2. In the 10Gbps case, QAT uses 180% CPU utilization to reach 10GBps
   But ZSTD still cannot reach 10Gbps even if it uses 810%.
3. The QAT decompression CPU utilization is higher than compression and ZSTD,
   from my analysis
   3.1 when using QAT compression, the data needs to be copied to the QAT 
       memory (for DMA operations), and the same for decompression. However, 
       do_user_addr_fault will be triggered during decompression because the 
       QAT decompressed data is copied to the VM address space for the first time,
       in addition, both compression and decompression are processed by QAT and 
       do not consume CPU resources, so the CPU utilization of the receiver is 
       slightly higher than the sender.
   
   3.2 Since zstd decompression decompresses data directly into the VM address space, 
       there is one less memory copy than QAT, so the CPU utilization on the receiver
       is better than QAT. For the 1GBps case, the receiver CPU utilization is 125%,
       and the memory copy occupies ~80% of CPU utilization.

   I think this is acceptable. Considering the overall CPU usage of the sender and receiver, 
   the QAT benefit is good.

> Bryan Zhang (4):
>   meson: Introduce 'qatzip' feature to the build system
>   migration: Add migration parameters for QATzip
>   migration: Introduce 'qatzip' compression method
>   tests/migration: Add integration test for 'qatzip' compression method
> 
>  hw/core/qdev-properties-system.c |   6 +-
>  meson.build                      |  10 +
>  meson_options.txt                |   2 +
>  migration/meson.build            |   1 +
>  migration/migration-hmp-cmds.c   |   8 +
>  migration/multifd-qatzip.c       | 391 +++++++++++++++++++++++++++++++
>  migration/multifd.h              |   5 +-
>  migration/options.c              |  57 +++++
>  migration/options.h              |   2 +
>  qapi/migration.json              |  38 +++
>  scripts/meson-buildoptions.sh    |   3 +
>  tests/qtest/meson.build          |   4 +
>  tests/qtest/migration-test.c     |  35 +++
>  13 files changed, 559 insertions(+), 3 deletions(-)
>  create mode 100644 migration/multifd-qatzip.c
> 
> --
> Yichen Wang



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
  2024-07-09  8:42 ` [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB Liu, Yuan1
@ 2024-07-09 18:42   ` Peter Xu
  2024-07-10 13:55     ` Liu, Yuan1
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2024-07-09 18:42 UTC (permalink / raw)
  To: Liu, Yuan1
  Cc: Wang, Yichen, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Marc-André Lureau, Thomas Huth,
	Philippe Mathieu-Daudé, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Laurent Vivier, qemu-devel@nongnu.org,
	Hao Xiang, Zou, Nanhai, Ho-Ren (Jack) Chuang

On Tue, Jul 09, 2024 at 08:42:59AM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Yichen Wang <yichen.wang@bytedance.com>
> > Sent: Saturday, July 6, 2024 2:29 AM
> > To: Paolo Bonzini <pbonzini@redhat.com>; Daniel P. Berrangé
> > <berrange@redhat.com>; Eduardo Habkost <eduardo@habkost.net>; Marc-André
> > Lureau <marcandre.lureau@redhat.com>; Thomas Huth <thuth@redhat.com>;
> > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>;
> > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> > Armbruster <armbru@redhat.com>; Laurent Vivier <lvivier@redhat.com>; qemu-
> > devel@nongnu.org
> > Cc: Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 <yuan1.liu@intel.com>;
> > Zou, Nanhai <nanhai.zou@intel.com>; Ho-Ren (Jack) Chuang
> > <horenchuang@bytedance.com>; Wang, Yichen <yichen.wang@bytedance.com>
> > Subject: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
> > 
> > v4:
> > - Rebase changes on top of 1a2d52c7fcaeaaf4f2fe8d4d5183dccaeab67768
> > - Move the IOV initialization to qatzip implementation
> > - Only use qatzip to compress normal pages
> > 
> > v3:
> > - Rebase changes on top of master
> > - Merge two patches per Fabiano Rosas's comment
> > - Add versions into comments and documentations
> > 
> > v2:
> > - Rebase changes on top of recent multifd code changes.
> > - Use QATzip API 'qzMalloc' and 'qzFree' to allocate QAT buffers.
> > - Remove parameter tuning and use QATzip's defaults for better
> >   performance.
> > - Add parameter to enable QAT software fallback.
> > 
> > v1:
> > https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg03761.html
> > 
> > * Performance
> > 
> > We present updated performance results. For circumstantial reasons, v1
> > presented performance on a low-bandwidth (1Gbps) network.
> > 
> > Here, we present updated results with a similar setup as before but with
> > two main differences:
> > 
> > 1. Our machines have a ~50Gbps connection, tested using 'iperf3'.
> > 2. We had a bug in our memory allocation causing us to only use ~1/2 of
> > the VM's RAM. Now we properly allocate and fill nearly all of the VM's
> > RAM.
> > 
> > Thus, the test setup is as follows:
> > 
> > We perform multifd live migration over TCP using a VM with 64GB memory.
> > We prepare the machine's memory by powering it on, allocating a large
> > amount of memory (60GB) as a single buffer, and filling the buffer with
> > the repeated contents of the Silesia corpus[0]. This is in lieu of a more
> > realistic memory snapshot, which proved troublesome to acquire.
> > 
> > We analyze CPU usage by averaging the output of 'top' every second
> > during migration. This is admittedly imprecise, but we feel that it
> > accurately portrays the different degrees of CPU usage of varying
> > compression methods.
> > 
> > We present the latency, throughput, and CPU usage results for all of the
> > compression methods, with varying numbers of multifd threads (4, 8, and
> > 16).
> > 
> > [0] The Silesia corpus can be accessed here:
> > https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
> > 
> > ** Results
> > 
> > 4 multifd threads:
> > 
> >     |---------------|---------------|----------------|---------|---------|
> >     |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
> >     |---------------|---------------|----------------|---------|---------|
> >     |qatzip         | 23.13         | 8749.94        |117.50   |186.49   |
> >     |---------------|---------------|----------------|---------|---------|
> >     |zlib           |254.35         |  771.87        |388.20   |144.40   |
> >     |---------------|---------------|----------------|---------|---------|
> >     |zstd           | 54.52         | 3442.59        |414.59   |149.77   |
> >     |---------------|---------------|----------------|---------|---------|
> >     |none           | 12.45         |43739.60        |159.71   |204.96   |
> >     |---------------|---------------|----------------|---------|---------|
> > 
> > 8 multifd threads:
> > 
> >     |---------------|---------------|----------------|---------|---------|
> >     |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
> >     |---------------|---------------|----------------|---------|---------|
> >     |qatzip         | 16.91         |12306.52        |186.37   |391.84   |
> >     |---------------|---------------|----------------|---------|---------|
> >     |zlib           |130.11         | 1508.89        |753.86   |289.35   |
> >     |---------------|---------------|----------------|---------|---------|
> >     |zstd           | 27.57         | 6823.23        |786.83   |303.80   |
> >     |---------------|---------------|----------------|---------|---------|
> >     |none           | 11.82         |46072.63        |163.74   |238.56   |
> >     |---------------|---------------|----------------|---------|---------|
> > 
> > 16 multifd threads:
> > 
> >     |---------------|---------------|----------------|---------|---------|
> >     |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
> >     |---------------|---------------|----------------|---------|---------|
> >     |qatzip         |18.64          |11044.52        | 573.61  |437.65   |
> >     |---------------|---------------|----------------|---------|---------|
> >     |zlib           |66.43          | 2955.79        |1469.68  |567.47   |
> >     |---------------|---------------|----------------|---------|---------|
> >     |zstd           |14.17          |13290.66        |1504.08  |615.33   |
> >     |---------------|---------------|----------------|---------|---------|
> >     |none           |16.82          |32363.26        | 180.74  |217.17   |
> >     |---------------|---------------|----------------|---------|---------|
> > 
> > ** Observations
> > 
> > - In general, not using compression outperforms using compression in a
> >   non-network-bound environment.
> > - 'qatzip' outperforms other compression workers with 4 and 8 workers,
> >   achieving a ~91% latency reduction over 'zlib' with 4 workers, and a
> > ~58% latency reduction over 'zstd' with 4 workers.
> > - 'qatzip' maintains comparable performance with 'zstd' at 16 workers,
> >   showing a ~32% increase in latency. This performance difference
> > becomes more noticeable with more workers, as CPU compression is highly
> > parallelizable.
> > - 'qatzip' compression uses considerably less CPU than other compression
> >   methods. At 8 workers, 'qatzip' demonstrates a ~75% reduction in
> > compression CPU usage compared to 'zstd' and 'zlib'.
> > - 'qatzip' decompression CPU usage is less impressive, and is even
> >   slightly worse than 'zstd' and 'zlib' CPU usage at 4 and 16 workers.
> 
> Hi Peter & Yichen
> 
> I have a test based on the V4 patch set
> VM configuration:16 vCPU, 64G memory, 
> VM Workload: all vCPUs are idle and 54G memory is filled with Silesia data.
> QAT Devices: 4
> 
> Sender migration parameters
> migrate_set_capability multifd on
> migrate_set_parameter multifd-channels 2/4/8
> migrate_set_parameter max-bandwidth 1G/10G

Ah, I think this means GBps... not Gbps, then.

> migrate_set_parameter multifd-compression qatzip/zstd
> 
> Receiver migration parameters
> migrate_set_capability multifd on
> migrate_set_parameter multifd-channels 2
> migrate_set_parameter multifd-compression qatzip/zstd
> 
> max-bandwidth: 1GBps
>      |-----------|--------|---------|----------|------|------|
>      |2 Channels |Total   |down     |throughput| send | recv |
>      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
>      |-----------|--------|---------|----------|------|------|
>      |qatzip     |   21607|       77|      8051|    88|   125|
>      |-----------|--------|---------|----------|------|------|
>      |zstd       |   78351|       96|      2199|   204|    80|
>      |-----------|--------|---------|----------|------|------|
> 
>      |-----------|--------|---------|----------|------|------|
>      |4 Channels |Total   |down     |throughput| send | recv |
>      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
>      |-----------|--------|---------|----------|------|------|
>      |qatzip     |   20336|       25|      8557|   110|   190|
>      |-----------|--------|---------|----------|------|------|
>      |zstd       |   39324|       31|      4389|   406|   160|
>      |-----------|--------|---------|----------|------|------|
> 
>      |-----------|--------|---------|----------|------|------|
>      |8 Channels |Total   |down     |throughput| send | recv |
>      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
>      |-----------|--------|---------|----------|------|------|
>      |qatzip     |   20208|       22|      8613|   125|   300|
>      |-----------|--------|---------|----------|------|------|
>      |zstd       |   20515|       22|      8438|   800|   340|
>      |-----------|--------|---------|----------|------|------|
> 
> max-bandwidth: 10GBps
>      |-----------|--------|---------|----------|------|------|
>      |2 Channels |Total   |down     |throughput| send | recv |
>      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
>      |-----------|--------|---------|----------|------|------|
>      |qatzip     |   22450|       77|      7748|    80|   125|
>      |-----------|--------|---------|----------|------|------|
>      |zstd       |   78339|       76|      2199|   204|    80|
>      |-----------|--------|---------|----------|------|------|
> 
>      |-----------|--------|---------|----------|------|------|
>      |4 Channels |Total   |down     |throughput| send | recv |
>      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
>      |-----------|--------|---------|----------|------|------|
>      |qatzip     |   13017|       24|     13401|   180|   285|
>      |-----------|--------|---------|----------|------|------|
>      |zstd       |   39466|       21|      4373|   406|   160|
>      |-----------|--------|---------|----------|------|------|
> 
>      |-----------|--------|---------|----------|------|------|
>      |8 Channels |Total   |down     |throughput| send | recv |
>      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
>      |-----------|--------|---------|----------|------|------|
>      |qatzip     |   10255|       22|     17037|   280|   590|
>      |-----------|--------|---------|----------|------|------|
>      |zstd       |   20126|       77|      8595|   810|   340|
>      |-----------|--------|---------|----------|------|------|

PS: this 77ms downtime smells like it hits some spikes during save/load.
Doesn't look like reproducable comparing to the rest data.

> 
> If the user has enabled compression in live migration, using QAT
> can save the host CPU resources.
> 
> When compression is enabled, the bottleneck of migration is usually
> the compression throughput on the sender side, since CPU decompression
> throughput is higher than compression, some reference data 
> https://github.com/inikep/lzbench, so more CPU resources need to be 
> allocated to the sender side.

Thank you, Yuan.

> 
> Summary:
> 1. In the 1GBps case, QAT only uses 88% CPU utilization to reach 1GBps, 
>    but ZSTD needs 800%.
> 2. In the 10Gbps case, QAT uses 180% CPU utilization to reach 10GBps
>    But ZSTD still cannot reach 10Gbps even if it uses 810%.

So I assumed you always meant GBps across all the test results, as only
that matches with max-bandwidth parameter.

Then in this case 10GBps is actually 80Gbps, which was not a low bandwidth
test.

And I think the most interesting one that I would be curious is nocomp in
low network tests.  Would you mind run one more test with the same
workload, but with: no-comp, 8 channels, 10Gbps (or 1GBps)?

I think in this case multifd shouldn't matter a huge deal, but let's still
enable that just assume that's the baseline / default setup.  I would
expect this result should obviously show a win on using compressors, but
just to check.

> 3. The QAT decompression CPU utilization is higher than compression and ZSTD,
>    from my analysis
>    3.1 when using QAT compression, the data needs to be copied to the QAT 
>        memory (for DMA operations), and the same for decompression. However, 
>        do_user_addr_fault will be triggered during decompression because the 
>        QAT decompressed data is copied to the VM address space for the first time,
>        in addition, both compression and decompression are processed by QAT and 
>        do not consume CPU resources, so the CPU utilization of the receiver is 
>        slightly higher than the sender.

I thought you hit this same issue when working on QPL and I remember you
used -mem-prealloc.  Why not use it here?

>    
>    3.2 Since zstd decompression decompresses data directly into the VM address space, 
>        there is one less memory copy than QAT, so the CPU utilization on the receiver
>        is better than QAT. For the 1GBps case, the receiver CPU utilization is 125%,
>        and the memory copy occupies ~80% of CPU utilization.

Hmm, yes I read that part in code and I thought it was a design decision to
do the copy, the comment said "it is faster".  So it's not?

I think we can definitely submit compression tasks per-page rather than
buffering, if that would be better.

> 
>    I think this is acceptable. Considering the overall CPU usage of the sender and receiver, 
>    the QAT benefit is good.

Yes, I don't think there's any major issue to block this from supported,
it's more about when we are at it we'd better figure all things out.

For example, I think we used to discuss the use case where there's 100G*2
network deployed, but the admin may still want to have some control plane
VMs moving around using very limited network for QoS.  In that case, I
wonder any of you thought about using postcopy?  I assume the control plane
workload isn't super critical in this case or it won't get provisioned with
low network for migrations, in that case maybe it'll also be fine to
post-copy after one round of precopy on the slow-bandwidth network.

Again, I don't think the answer blocks such feature in any form whoever
simply wants to use a compressor, just to ask.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
  2024-07-09 18:42   ` Peter Xu
@ 2024-07-10 13:55     ` Liu, Yuan1
  2024-07-10 15:18       ` Peter Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Liu, Yuan1 @ 2024-07-10 13:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: Wang, Yichen, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Marc-André Lureau, Thomas Huth,
	Philippe Mathieu-Daudé, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Laurent Vivier, qemu-devel@nongnu.org,
	Hao Xiang, Zou, Nanhai, Ho-Ren (Jack) Chuang

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, July 10, 2024 2:43 AM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> <pbonzini@redhat.com>; Daniel P. Berrangé <berrange@redhat.com>; Eduardo
> Habkost <eduardo@habkost.net>; Marc-André Lureau
> <marcandre.lureau@redhat.com>; Thomas Huth <thuth@redhat.com>; Philippe
> Mathieu-Daudé <philmd@linaro.org>; Fabiano Rosas <farosas@suse.de>; Eric
> Blake <eblake@redhat.com>; Markus Armbruster <armbru@redhat.com>; Laurent
> Vivier <lvivier@redhat.com>; qemu-devel@nongnu.org; Hao Xiang
> <hao.xiang@linux.dev>; Zou, Nanhai <nanhai.zou@intel.com>; Ho-Ren (Jack)
> Chuang <horenchuang@bytedance.com>
> Subject: Re: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
> 
> On Tue, Jul 09, 2024 at 08:42:59AM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Yichen Wang <yichen.wang@bytedance.com>
> > > Sent: Saturday, July 6, 2024 2:29 AM
> > > To: Paolo Bonzini <pbonzini@redhat.com>; Daniel P. Berrangé
> > > <berrange@redhat.com>; Eduardo Habkost <eduardo@habkost.net>; Marc-
> André
> > > Lureau <marcandre.lureau@redhat.com>; Thomas Huth <thuth@redhat.com>;
> > > Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu
> <peterx@redhat.com>;
> > > Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>;
> Markus
> > > Armbruster <armbru@redhat.com>; Laurent Vivier <lvivier@redhat.com>;
> qemu-
> > > devel@nongnu.org
> > > Cc: Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 <yuan1.liu@intel.com>;
> > > Zou, Nanhai <nanhai.zou@intel.com>; Ho-Ren (Jack) Chuang
> > > <horenchuang@bytedance.com>; Wang, Yichen <yichen.wang@bytedance.com>
> > > Subject: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
> > >
> > > v4:
> > > - Rebase changes on top of 1a2d52c7fcaeaaf4f2fe8d4d5183dccaeab67768
> > > - Move the IOV initialization to qatzip implementation
> > > - Only use qatzip to compress normal pages
> > >
> > > v3:
> > > - Rebase changes on top of master
> > > - Merge two patches per Fabiano Rosas's comment
> > > - Add versions into comments and documentations
> > >
> > > v2:
> > > - Rebase changes on top of recent multifd code changes.
> > > - Use QATzip API 'qzMalloc' and 'qzFree' to allocate QAT buffers.
> > > - Remove parameter tuning and use QATzip's defaults for better
> > >   performance.
> > > - Add parameter to enable QAT software fallback.
> > >
> > > v1:
> > > https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg03761.html
> > >
> > > * Performance
> > >
> > > We present updated performance results. For circumstantial reasons, v1
> > > presented performance on a low-bandwidth (1Gbps) network.
> > >
> > > Here, we present updated results with a similar setup as before but
> with
> > > two main differences:
> > >
> > > 1. Our machines have a ~50Gbps connection, tested using 'iperf3'.
> > > 2. We had a bug in our memory allocation causing us to only use ~1/2
> of
> > > the VM's RAM. Now we properly allocate and fill nearly all of the VM's
> > > RAM.
> > >
> > > Thus, the test setup is as follows:
> > >
> > > We perform multifd live migration over TCP using a VM with 64GB
> memory.
> > > We prepare the machine's memory by powering it on, allocating a large
> > > amount of memory (60GB) as a single buffer, and filling the buffer
> with
> > > the repeated contents of the Silesia corpus[0]. This is in lieu of a
> more
> > > realistic memory snapshot, which proved troublesome to acquire.
> > >
> > > We analyze CPU usage by averaging the output of 'top' every second
> > > during migration. This is admittedly imprecise, but we feel that it
> > > accurately portrays the different degrees of CPU usage of varying
> > > compression methods.
> > >
> > > We present the latency, throughput, and CPU usage results for all of
> the
> > > compression methods, with varying numbers of multifd threads (4, 8,
> and
> > > 16).
> > >
> > > [0] The Silesia corpus can be accessed here:
> > > https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
> > >
> > > ** Results
> > >
> > > 4 multifd threads:
> > >
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |method         |time(sec)      |throughput(mbps)|send cpu%|recv
> cpu%|
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |qatzip         | 23.13         | 8749.94        |117.50   |186.49
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |zlib           |254.35         |  771.87        |388.20   |144.40
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |zstd           | 54.52         | 3442.59        |414.59   |149.77
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |none           | 12.45         |43739.60        |159.71   |204.96
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >
> > > 8 multifd threads:
> > >
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |method         |time(sec)      |throughput(mbps)|send cpu%|recv
> cpu%|
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |qatzip         | 16.91         |12306.52        |186.37   |391.84
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |zlib           |130.11         | 1508.89        |753.86   |289.35
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |zstd           | 27.57         | 6823.23        |786.83   |303.80
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |none           | 11.82         |46072.63        |163.74   |238.56
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >
> > > 16 multifd threads:
> > >
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |method         |time(sec)      |throughput(mbps)|send cpu%|recv
> cpu%|
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |qatzip         |18.64          |11044.52        | 573.61  |437.65
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |zlib           |66.43          | 2955.79        |1469.68  |567.47
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |zstd           |14.17          |13290.66        |1504.08  |615.33
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >     |none           |16.82          |32363.26        | 180.74  |217.17
> |
> > >     |---------------|---------------|----------------|---------|------
> ---|
> > >
> > > ** Observations
> > >
> > > - In general, not using compression outperforms using compression in a
> > >   non-network-bound environment.
> > > - 'qatzip' outperforms other compression workers with 4 and 8 workers,
> > >   achieving a ~91% latency reduction over 'zlib' with 4 workers, and a
> > > ~58% latency reduction over 'zstd' with 4 workers.
> > > - 'qatzip' maintains comparable performance with 'zstd' at 16 workers,
> > >   showing a ~32% increase in latency. This performance difference
> > > becomes more noticeable with more workers, as CPU compression is
> highly
> > > parallelizable.
> > > - 'qatzip' compression uses considerably less CPU than other
> compression
> > >   methods. At 8 workers, 'qatzip' demonstrates a ~75% reduction in
> > > compression CPU usage compared to 'zstd' and 'zlib'.
> > > - 'qatzip' decompression CPU usage is less impressive, and is even
> > >   slightly worse than 'zstd' and 'zlib' CPU usage at 4 and 16 workers.
> >
> > Hi Peter & Yichen
> >
> > I have a test based on the V4 patch set
> > VM configuration:16 vCPU, 64G memory,
> > VM Workload: all vCPUs are idle and 54G memory is filled with Silesia
> data.
> > QAT Devices: 4
> >
> > Sender migration parameters
> > migrate_set_capability multifd on
> > migrate_set_parameter multifd-channels 2/4/8
> > migrate_set_parameter max-bandwidth 1G/10G
> 
> Ah, I think this means GBps... not Gbps, then.
> 
> > migrate_set_parameter multifd-compression qatzip/zstd
> >
> > Receiver migration parameters
> > migrate_set_capability multifd on
> > migrate_set_parameter multifd-channels 2
> > migrate_set_parameter multifd-compression qatzip/zstd
> >
> > max-bandwidth: 1GBps
> >      |-----------|--------|---------|----------|------|------|
> >      |2 Channels |Total   |down     |throughput| send | recv |
> >      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
> >      |-----------|--------|---------|----------|------|------|
> >      |qatzip     |   21607|       77|      8051|    88|   125|
> >      |-----------|--------|---------|----------|------|------|
> >      |zstd       |   78351|       96|      2199|   204|    80|
> >      |-----------|--------|---------|----------|------|------|
> >
> >      |-----------|--------|---------|----------|------|------|
> >      |4 Channels |Total   |down     |throughput| send | recv |
> >      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
> >      |-----------|--------|---------|----------|------|------|
> >      |qatzip     |   20336|       25|      8557|   110|   190|
> >      |-----------|--------|---------|----------|------|------|
> >      |zstd       |   39324|       31|      4389|   406|   160|
> >      |-----------|--------|---------|----------|------|------|
> >
> >      |-----------|--------|---------|----------|------|------|
> >      |8 Channels |Total   |down     |throughput| send | recv |
> >      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
> >      |-----------|--------|---------|----------|------|------|
> >      |qatzip     |   20208|       22|      8613|   125|   300|
> >      |-----------|--------|---------|----------|------|------|
> >      |zstd       |   20515|       22|      8438|   800|   340|
> >      |-----------|--------|---------|----------|------|------|
> >
> > max-bandwidth: 10GBps
> >      |-----------|--------|---------|----------|------|------|
> >      |2 Channels |Total   |down     |throughput| send | recv |
> >      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
> >      |-----------|--------|---------|----------|------|------|
> >      |qatzip     |   22450|       77|      7748|    80|   125|
> >      |-----------|--------|---------|----------|------|------|
> >      |zstd       |   78339|       76|      2199|   204|    80|
> >      |-----------|--------|---------|----------|------|------|
> >
> >      |-----------|--------|---------|----------|------|------|
> >      |4 Channels |Total   |down     |throughput| send | recv |
> >      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
> >      |-----------|--------|---------|----------|------|------|
> >      |qatzip     |   13017|       24|     13401|   180|   285|
> >      |-----------|--------|---------|----------|------|------|
> >      |zstd       |   39466|       21|      4373|   406|   160|
> >      |-----------|--------|---------|----------|------|------|
> >
> >      |-----------|--------|---------|----------|------|------|
> >      |8 Channels |Total   |down     |throughput| send | recv |
> >      |           |time(ms)|time(ms) |(mbps)    | cpu %| cpu% |
> >      |-----------|--------|---------|----------|------|------|
> >      |qatzip     |   10255|       22|     17037|   280|   590|
> >      |-----------|--------|---------|----------|------|------|
> >      |zstd       |   20126|       77|      8595|   810|   340|
> >      |-----------|--------|---------|----------|------|------|
> 
> PS: this 77ms downtime smells like it hits some spikes during save/load.
> Doesn't look like reproducable comparing to the rest data.

I agree with this.

> >
> > If the user has enabled compression in live migration, using QAT
> > can save the host CPU resources.
> >
> > When compression is enabled, the bottleneck of migration is usually
> > the compression throughput on the sender side, since CPU decompression
> > throughput is higher than compression, some reference data
> > https://github.com/inikep/lzbench, so more CPU resources need to be
> > allocated to the sender side.
> 
> Thank you, Yuan.
> 
> >
> > Summary:
> > 1. In the 1GBps case, QAT only uses 88% CPU utilization to reach 1GBps,
> >    but ZSTD needs 800%.
> > 2. In the 10Gbps case, QAT uses 180% CPU utilization to reach 10GBps
> >    But ZSTD still cannot reach 10Gbps even if it uses 810%.
> 
> So I assumed you always meant GBps across all the test results, as only
> that matches with max-bandwidth parameter.
> 
> Then in this case 10GBps is actually 80Gbps, which was not a low bandwidth
> test.
> 
> And I think the most interesting one that I would be curious is nocomp in
> low network tests.  Would you mind run one more test with the same
> workload, but with: no-comp, 8 channels, 10Gbps (or 1GBps)?
> 
> I think in this case multifd shouldn't matter a huge deal, but let's still
> enable that just assume that's the baseline / default setup.  I would
> expect this result should obviously show a win on using compressors, but
> just to check.

migrate_set_parameter max-bandwidth 1250M
|-----------|--------|---------|----------|----------|------|------|
|8 Channels |Total   |down     |throughput|pages per | send | recv |
|           |time(ms)|time(ms) |(mbps)    |second    | cpu %| cpu% |
|-----------|--------|---------|----------|----------|------|------|
|qatzip     |   16630|       28|     10467|   2940235|   160|   360|
|-----------|--------|---------|----------|----------|------|------|
|zstd       |   20165|       24|      8579|   2391465|   810|   340|
|-----------|--------|---------|----------|----------|------|------|
|none       |   46063|       40|     10848|    330240|    45|    85|
|-----------|--------|---------|----------|----------|------|------|

QATzip's dirty page processing throughput is much higher than that no compression. 
In this test, the vCPUs are in idle state, so the migration can be successful even 
without compression.

> > 3. The QAT decompression CPU utilization is higher than compression and
> ZSTD,
> >    from my analysis
> >    3.1 when using QAT compression, the data needs to be copied to the
> QAT
> >        memory (for DMA operations), and the same for decompression.
> However,
> >        do_user_addr_fault will be triggered during decompression because
> the
> >        QAT decompressed data is copied to the VM address space for the
> first time,
> >        in addition, both compression and decompression are processed by
> QAT and
> >        do not consume CPU resources, so the CPU utilization of the
> receiver is
> >        slightly higher than the sender.
> 
> I thought you hit this same issue when working on QPL and I remember you
> used -mem-prealloc.  Why not use it here?
> 
> >
> >    3.2 Since zstd decompression decompresses data directly into the VM
> address space,
> >        there is one less memory copy than QAT, so the CPU utilization on
> the receiver
> >        is better than QAT. For the 1GBps case, the receiver CPU
> utilization is 125%,
> >        and the memory copy occupies ~80% of CPU utilization.
> 
> Hmm, yes I read that part in code and I thought it was a design decision
> to
> do the copy, the comment said "it is faster".  So it's not?
> 
> I think we can definitely submit compression tasks per-page rather than
> buffering, if that would be better.

I think faster here probably refers to QAT throughput, QAT is more friendly to 
large block data compression(e.g. 32K). And QATzip doesn't support batching
compression tasks, so copying multiple small data to a buffer for compression
is a common practice.

> >    I think this is acceptable. Considering the overall CPU usage of the
> sender and receiver,
> >    the QAT benefit is good.
> 
> Yes, I don't think there's any major issue to block this from supported,
> it's more about when we are at it we'd better figure all things out.
> 
> For example, I think we used to discuss the use case where there's 100G*2
> network deployed, but the admin may still want to have some control plane
> VMs moving around using very limited network for QoS.  In that case, I
> wonder any of you thought about using postcopy?  I assume the control
> plane
> workload isn't super critical in this case or it won't get provisioned
> with
> low network for migrations, in that case maybe it'll also be fine to
> post-copy after one round of precopy on the slow-bandwidth network.
> 
> Again, I don't think the answer blocks such feature in any form whoever
> simply wants to use a compressor, just to ask.

I don’t have much experience with postcopy, here are some of my thoughts
1. For write-intensive VMs, this solution can improve the migration success, 
   because in a limited bandwidth network scenario, the dirty page processing
   throughput will be significantly reduced for no compression, the previous
   data includes this(pages_per_second), it means that in the no compression
   precopy, the dirty pages generated by the workload are greater than the
   migration processing, resulting in migration failure.

2. If the VM is read-intensive or has low vCPU utilization (for example, my 
   current test scenario is that the vCPUs are all idle). I think no compression +
   precopy + postcopy also cannot improve the migration performance, and may also
   cause timeout failure due to long migration time, same with no compression precopy.

3. In my opinion, the postcopy is a good solution in this scenario(low network bandwidth,
   VM is not critical), because even if compression is turned on, the migration may still 
   fail(page_per_second may still less than the new dirty pages), and it is hard to predict
   whether VM memory is compression-friendly.

> Thanks,
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
  2024-07-10 13:55     ` Liu, Yuan1
@ 2024-07-10 15:18       ` Peter Xu
  2024-07-10 15:39         ` Liu, Yuan1
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2024-07-10 15:18 UTC (permalink / raw)
  To: Liu, Yuan1
  Cc: Wang, Yichen, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Marc-André Lureau, Thomas Huth,
	Philippe Mathieu-Daudé, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Laurent Vivier, qemu-devel@nongnu.org,
	Hao Xiang, Zou, Nanhai, Ho-Ren (Jack) Chuang

On Wed, Jul 10, 2024 at 01:55:23PM +0000, Liu, Yuan1 wrote:

[...]

> migrate_set_parameter max-bandwidth 1250M
> |-----------|--------|---------|----------|----------|------|------|
> |8 Channels |Total   |down     |throughput|pages per | send | recv |
> |           |time(ms)|time(ms) |(mbps)    |second    | cpu %| cpu% |
> |-----------|--------|---------|----------|----------|------|------|
> |qatzip     |   16630|       28|     10467|   2940235|   160|   360|
> |-----------|--------|---------|----------|----------|------|------|
> |zstd       |   20165|       24|      8579|   2391465|   810|   340|
> |-----------|--------|---------|----------|----------|------|------|
> |none       |   46063|       40|     10848|    330240|    45|    85|
> |-----------|--------|---------|----------|----------|------|------|
> 
> QATzip's dirty page processing throughput is much higher than that no compression. 
> In this test, the vCPUs are in idle state, so the migration can be successful even 
> without compression.

Thanks!  Maybe good material to be put into the docs/ too, if Yichen's
going to pick up your doc patch when repost.

[...]

> I don’t have much experience with postcopy, here are some of my thoughts
> 1. For write-intensive VMs, this solution can improve the migration success, 
>    because in a limited bandwidth network scenario, the dirty page processing
>    throughput will be significantly reduced for no compression, the previous
>    data includes this(pages_per_second), it means that in the no compression
>    precopy, the dirty pages generated by the workload are greater than the
>    migration processing, resulting in migration failure.

Yes.

> 
> 2. If the VM is read-intensive or has low vCPU utilization (for example, my 
>    current test scenario is that the vCPUs are all idle). I think no compression +
>    precopy + postcopy also cannot improve the migration performance, and may also
>    cause timeout failure due to long migration time, same with no compression precopy.

I don't think postcopy will trigger timeout failures - postcopy should use
constant time to complete a migration, that is guest memsize / bw.

The challenge is normally on the delay of page requests higher than
precopy, but in this case it might not be a big deal. And I wonder if on
100G*2 cards it can also perform pretty well, as the delay might be minimal
even if bandwidth is throttled.

> 
> 3. In my opinion, the postcopy is a good solution in this scenario(low network bandwidth,
>    VM is not critical), because even if compression is turned on, the migration may still 
>    fail(page_per_second may still less than the new dirty pages), and it is hard to predict
>    whether VM memory is compression-friendly.

Yes.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH v4 3/4] migration: Introduce 'qatzip' compression method
  2024-07-05 18:29 ` [PATCH v4 3/4] migration: Introduce 'qatzip' compression method Yichen Wang
  2024-07-08 21:34   ` Peter Xu
@ 2024-07-10 15:20   ` Liu, Yuan1
  1 sibling, 0 replies; 14+ messages in thread
From: Liu, Yuan1 @ 2024-07-10 15:20 UTC (permalink / raw)
  To: Wang, Yichen, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Marc-André Lureau, Thomas Huth,
	Philippe Mathieu-Daudé, Peter Xu, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Laurent Vivier, qemu-devel@nongnu.org
  Cc: Hao Xiang, Zou, Nanhai, Ho-Ren (Jack) Chuang, Wang, Yichen,
	Bryan Zhang

> -----Original Message-----
> From: Yichen Wang <yichen.wang@bytedance.com>
> Sent: Saturday, July 6, 2024 2:29 AM
> To: Paolo Bonzini <pbonzini@redhat.com>; Daniel P. Berrangé
> <berrange@redhat.com>; Eduardo Habkost <eduardo@habkost.net>; Marc-André
> Lureau <marcandre.lureau@redhat.com>; Thomas Huth <thuth@redhat.com>;
> Philippe Mathieu-Daudé <philmd@linaro.org>; Peter Xu <peterx@redhat.com>;
> Fabiano Rosas <farosas@suse.de>; Eric Blake <eblake@redhat.com>; Markus
> Armbruster <armbru@redhat.com>; Laurent Vivier <lvivier@redhat.com>; qemu-
> devel@nongnu.org
> Cc: Hao Xiang <hao.xiang@linux.dev>; Liu, Yuan1 <yuan1.liu@intel.com>;
> Zou, Nanhai <nanhai.zou@intel.com>; Ho-Ren (Jack) Chuang
> <horenchuang@bytedance.com>; Wang, Yichen <yichen.wang@bytedance.com>;
> Bryan Zhang <bryan.zhang@bytedance.com>
> Subject: [PATCH v4 3/4] migration: Introduce 'qatzip' compression method
> 
> From: Bryan Zhang <bryan.zhang@bytedance.com>
> 
> Adds support for 'qatzip' as an option for the multifd compression
> method parameter, and implements using QAT for 'qatzip' compression and
> decompression.
> 
> Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
> Signed-off-by: Hao Xiang <hao.xiang@linux.dev>
> Signed-off-by: Yichen Wang <yichen.wang@bytedance.com>
> ---
>  hw/core/qdev-properties-system.c |   6 +-
>  migration/meson.build            |   1 +
>  migration/multifd-qatzip.c       | 391 +++++++++++++++++++++++++++++++
>  migration/multifd.h              |   5 +-
>  qapi/migration.json              |   3 +
>  tests/qtest/meson.build          |   4 +
>  6 files changed, 407 insertions(+), 3 deletions(-)
>  create mode 100644 migration/multifd-qatzip.c
> 
> diff --git a/hw/core/qdev-properties-system.c b/hw/core/qdev-properties-
> system.c
> index f13350b4fb..eb50d6ec5b 100644
> --- a/hw/core/qdev-properties-system.c
> +++ b/hw/core/qdev-properties-system.c
> @@ -659,7 +659,11 @@ const PropertyInfo qdev_prop_fdc_drive_type = {
>  const PropertyInfo qdev_prop_multifd_compression = {
>      .name = "MultiFDCompression",
>      .description = "multifd_compression values, "
> -                   "none/zlib/zstd/qpl/uadk",
> +                   "none/zlib/zstd/qpl/uadk"
> +#ifdef CONFIG_QATZIP
> +                   "/qatzip"
> +#endif
> +                   ,
>      .enum_table = &MultiFDCompression_lookup,
>      .get = qdev_propinfo_get_enum,
>      .set = qdev_propinfo_set_enum,
> diff --git a/migration/meson.build b/migration/meson.build
> index 5ce2acb41e..c9454c26ae 100644
> --- a/migration/meson.build
> +++ b/migration/meson.build
> @@ -41,6 +41,7 @@ system_ss.add(when: rdma, if_true: files('rdma.c'))
>  system_ss.add(when: zstd, if_true: files('multifd-zstd.c'))
>  system_ss.add(when: qpl, if_true: files('multifd-qpl.c'))
>  system_ss.add(when: uadk, if_true: files('multifd-uadk.c'))
> +system_ss.add(when: qatzip, if_true: files('multifd-qatzip.c'))
> 
>  specific_ss.add(when: 'CONFIG_SYSTEM_ONLY',
>                  if_true: files('ram.c',
> diff --git a/migration/multifd-qatzip.c b/migration/multifd-qatzip.c
> new file mode 100644
> index 0000000000..a1502a5589
> --- /dev/null
> +++ b/migration/multifd-qatzip.c
> @@ -0,0 +1,391 @@
> +/*
> + * Multifd QATzip compression implementation
> + *
> + * Copyright (c) Bytedance
> + *
> + * Authors:
> + *  Bryan Zhang <bryan.zhang@bytedance.com>
> + *  Hao Xiang <hao.xiang@bytedance.com>
> + *  Yichen Wang <yichen.wang@bytedance.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "exec/ramblock.h"
> +#include "exec/target_page.h"
> +#include "qapi/error.h"
> +#include "migration.h"
> +#include "options.h"
> +#include "multifd.h"
> +#include <qatzip.h>

"exec/target_page.h" may not required

use "qapi/qapi-types-migration.h" to instead "migration.h"

> +struct qatzip_data {
> +    /*
> +     * Unique session for use with QATzip API
> +     */
> +    QzSession_T sess;
> +
> +    /*
> +     * For compression: Buffer for pages to compress
> +     * For decompression: Buffer for data to decompress
> +     */
> +    uint8_t *in_buf;
> +    uint32_t in_len;
> +
> +    /*
> +     * For compression: Output buffer of compressed data
> +     * For decompression: Output buffer of decompressed data
> +     */
> +    uint8_t *out_buf;
> +    uint32_t out_len;
> +};

Add typedef and CamelCase name.
typedef struct QatzipData
https://www.qemu.org/docs/master/devel/style.html#comment-style

Typedefs are used to eliminate the redundant 'struct' keyword, since
type names have a different style than other identifiers 
("CamelCase" versus "snake_case"). Each named struct type should have 
a CamelCase name and a corresponding typedef.

> +/**
> + * qatzip_send_setup: Set up QATzip session and private buffers.
> + *
> + * @param p    Multifd channel params
> + * @param errp Pointer to error, which will be set in case of error
> + * @return     0 on success, -1 on error (and *errp will be set)
> + */
> +static int qatzip_send_setup(MultiFDSendParams *p, Error **errp)
> +{
> +    struct qatzip_data *q;
> +    QzSessionParamsDeflate_T params;
> +    const char *err_msg;
> +    int ret;
> +    int sw_fallback;
> +
> +    q = g_new0(struct qatzip_data, 1);
> +    p->compress_data = q;
> +    /* We need one extra place for the packet header */
> +    p->iov = g_new0(struct iovec, 2);
> +
> +    sw_fallback = 0;
> +    if (migrate_multifd_qatzip_sw_fallback()) {
> +        sw_fallback = 1;
> +    }
> +
> +    ret = qzInit(&q->sess, sw_fallback);
> +    if (ret != QZ_OK && ret != QZ_DUPLICATE) {
> +        err_msg = "qzInit failed";
> +        goto err_free_q;
> +    }
> +
> +    ret = qzGetDefaultsDeflate(&params);
> +    if (ret != QZ_OK) {
> +        err_msg = "qzGetDefaultsDeflate failed";
> +        goto err_close;
> +    }
> +
> +    /* Make sure to use configured QATzip compression level. */
> +    params.common_params.comp_lvl = migrate_multifd_qatzip_level();
> +
> +    ret = qzSetupSessionDeflate(&q->sess, &params);
> +    if (ret != QZ_OK && ret != QZ_DUPLICATE) {
> +        err_msg = "qzSetupSessionDeflate failed";
> +        goto err_close;
> +    }
> +
> +    /* TODO Add support for larger packets. */
> +    if (MULTIFD_PACKET_SIZE > UINT32_MAX) {
> +        err_msg = "packet size too large for QAT";
> +        goto err_close;
> +    }
> +
> +    q->in_len = MULTIFD_PACKET_SIZE;
> +    q->in_buf = qzMalloc(q->in_len, 0, PINNED_MEM);
> +    if (!q->in_buf) {
> +        err_msg = "qzMalloc failed";
> +        goto err_close;
> +    }
> +
> +    q->out_len = qzMaxCompressedLength(MULTIFD_PACKET_SIZE, &q->sess);
> +    q->out_buf = qzMalloc(q->out_len, 0, PINNED_MEM);
> +    if (!q->out_buf) {
> +        err_msg = "qzMalloc failed";
> +        goto err_free_inbuf;
> +    }
> +
> +    return 0;
> +
> +err_free_inbuf:
> +    qzFree(q->in_buf);
> +err_close:
> +    qzClose(&q->sess);
> +err_free_q:
> +    g_free(q);

There might be risks here.

The p->compress_data may have a wild pointer, need to add 
p->compress_data = NULL Or move p->compress_data = q before
the return 0.

> +    g_free(p->iov);
> +    p->iov = NULL;
> +    error_setg(errp, "multifd %u: %s", p->id, err_msg);
> +    return -1;
> +}
> +
> +/**
> + * qatzip_send_cleanup: Tear down QATzip session and release private
> buffers.
> + *
> + * @param p    Multifd channel params
> + * @param errp Pointer to error, which will be set in case of error
> + * @return     None
> + */
> +static void qatzip_send_cleanup(MultiFDSendParams *p, Error **errp)
> +{
> +    struct qatzip_data *q = p->compress_data;
> +    const char *err_msg;
> +    int ret;
> +
> +    ret = qzTeardownSession(&q->sess);
> +    if (ret != QZ_OK) {
> +        err_msg = "qzTeardownSession failed";
> +        goto err;
> +    }
> +
> +    ret = qzClose(&q->sess);
> +    if (ret != QZ_OK) {
> +        err_msg = "qzClose failed";
> +        goto err;
> +    }
> +
> +    qzFree(q->in_buf);
> +    q->in_buf = NULL;
> +    qzFree(q->out_buf);
> +    q->out_buf = NULL;
> +    g_free(p->iov);
> +    p->iov = NULL;
> +    g_free(p->compress_data);
> +    p->compress_data = NULL;
> +    return;
> +
> +err:
> +    error_setg(errp, "multifd %u: %s", p->id, err_msg);
> +}
> +
> +/**
> + * qatzip_send_prepare: Compress pages and update IO channel info.
> + *
> + * @param p    Multifd channel params
> + * @param errp Pointer to error, which will be set in case of error
> + * @return     0 on success, -1 on error (and *errp will be set)
> + */
> +static int qatzip_send_prepare(MultiFDSendParams *p, Error **errp)
> +{
> +    MultiFDPages_t *pages = p->pages;
> +    struct qatzip_data *q = p->compress_data;
> +    int ret;
> +    unsigned int in_len, out_len;
> +
> +    if (!multifd_send_prepare_common(p)) {
> +        goto out;
> +    }
> +
> +    /* memcpy all the pages into one buffer. */
> +    for (int i = 0; i < pages->normal_num; i++) {
> +        memcpy(q->in_buf + (i * p->page_size),
> +               p->pages->block->host + pages->offset[i],
> +               p->page_size);
> +    }
> +
> +    in_len = pages->normal_num * p->page_size;
> +    if (in_len > q->in_len) {
> +        error_setg(errp, "multifd %u: unexpectedly large input", p->id);
> +        return -1;
> +    }
> +    out_len = q->out_len;
> +
> +    /*
> +     * Unlike other multifd compression implementations, we use a non-
> streaming
> +     * API and place all the data into one buffer, rather than sending
> each page
> +     * to the compression API at a time. Based on initial benchmarks, the
> +     * non-streaming API outperforms the streaming API. Plus, the logic
> in QEMU
> +     * is friendly to using the non-streaming API anyway. If either of
> these
> +     * statements becomes no longer true, we can revisit adding a
> streaming
> +     * implementation.
> +     */
> +    ret = qzCompress(&q->sess, q->in_buf, &in_len, q->out_buf, &out_len,
> 1);
> +    if (ret != QZ_OK) {
> +        error_setg(errp, "multifd %u: QATzip returned %d instead of
> QZ_OK",
> +                   p->id, ret);
> +        return -1;
> +    }
> +    if (in_len != pages->normal_num * p->page_size) {
> +        error_setg(errp, "multifd %u: QATzip failed to compress all
> input",
> +                   p->id);
> +        return -1;
> +    }
> +
> +    p->iov[p->iovs_num].iov_base = q->out_buf;
> +    p->iov[p->iovs_num].iov_len = out_len;
> +    p->iovs_num++;
> +    p->next_packet_size = out_len;
> +
> +out:
> +    p->flags |= MULTIFD_FLAG_QATZIP;
> +    multifd_send_fill_packet(p);
> +    return 0;
> +}
> +
> +/**
> + * qatzip_recv_setup: Set up QATzip session and allocate private buffers.
> + *
> + * @param p    Multifd channel params
> + * @param errp Pointer to error, which will be set in case of error
> + * @return     0 on success, -1 on error (and *errp will be set)
> + */
> +static int qatzip_recv_setup(MultiFDRecvParams *p, Error **errp)
> +{
> +    struct qatzip_data *q;
> +    QzSessionParamsDeflate_T params;
> +    const char *err_msg;
> +    int ret;
> +    int sw_fallback;
> +
> +    q = g_new0(struct qatzip_data, 1);
> +    p->compress_data = q;
> +
> +    sw_fallback = 0;
> +    if (migrate_multifd_qatzip_sw_fallback()) {
> +        sw_fallback = 1;
> +    }
> +
> +    ret = qzInit(&q->sess, sw_fallback);
> +    if (ret != QZ_OK && ret != QZ_DUPLICATE) {
> +        err_msg = "qzInit failed";
> +        goto err_free_q;
> +    }
> +
> +    ret = qzGetDefaultsDeflate(&params);
> +    if (ret != QZ_OK) {
> +        err_msg = "qzGetDefaultsDeflate failed";
> +        goto err_close;
> +    }
> +
> +    /* Make sure to use configured QATzip compression level. */
> +    params.common_params.comp_lvl = migrate_multifd_qatzip_level();

There is no need to set the compression level for decompression.

> +    ret = qzSetupSessionDeflate(&q->sess, &params);
> +    if (ret != QZ_OK && ret != QZ_DUPLICATE) {
> +        err_msg = "qzSetupSessionDeflate failed";
> +        goto err_close;
> +    }
> +
> +    /*
> +     * Mimic multifd-zlib, which reserves extra space for the
> +     * incoming packet.
> +     */
> +    q->in_len = MULTIFD_PACKET_SIZE * 2;

I don't quite understand why MULTIFD_PACKET_SIZE * 2 lengths are needed here.

> +    q->in_buf = qzMalloc(q->in_len, 0, PINNED_MEM);
> +    if (!q->in_buf) {
> +        err_msg = "qzMalloc failed";
> +        goto err_close;
> +    }
> +
> +    q->out_len = MULTIFD_PACKET_SIZE;
> +    q->out_buf = qzMalloc(q->out_len, 0, PINNED_MEM);
> +    if (!q->out_buf) {
> +        err_msg = "qzMalloc failed";
> +        goto err_free_inbuf;
> +    }
> +
> +    return 0;
> +
> +err_free_inbuf:
> +    qzFree(q->in_buf);
> +err_close:
> +    qzClose(&q->sess);
> +err_free_q:
> +    g_free(q);

Ditto

> +    error_setg(errp, "multifd %u: %s", p->id, err_msg);
> +    return -1;
> +}
> +
> +/**
> + * qatzip_recv_cleanup: Tear down QATzip session and release private
> buffers.
> + *
> + * @param p    Multifd channel params
> + * @return     None
> + */
> +static void qatzip_recv_cleanup(MultiFDRecvParams *p)
> +{
> +    struct qatzip_data *q = p->compress_data;
> +
> +    /* Ignoring return values here due to function signature. */
> +    qzTeardownSession(&q->sess);
> +    qzClose(&q->sess);
> +    qzFree(q->in_buf);
> +    qzFree(q->out_buf);
> +    g_free(p->compress_data);
> +}
> +
> +
> +/**
> + * qatzip_recv: Decompress pages and copy them to the appropriate
> + * locations.
> + *
> + * @param p    Multifd channel params
> + * @param errp Pointer to error, which will be set in case of error
> + * @return     0 on success, -1 on error (and *errp will be set)
> + */
> +static int qatzip_recv(MultiFDRecvParams *p, Error **errp)
> +{
> +    struct qatzip_data *q = p->compress_data;
> +    int ret;
> +    unsigned int in_len, out_len;
> +    uint32_t in_size = p->next_packet_size;
> +    uint32_t expected_size = p->normal_num * p->page_size;
> +    uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
> +
> +    if (in_size > q->in_len) {
> +        error_setg(errp, "multifd %u: received unexpectedly large
> packet",
> +                   p->id);
> +        return -1;
> +    }
> +
> +    if (flags != MULTIFD_FLAG_QATZIP) {
> +        error_setg(errp, "multifd %u: flags received %x flags
> expected %x",
> +                   p->id, flags, MULTIFD_FLAG_QATZIP);
> +        return -1;
> +    }

The zero-page processing missing here
    multifd_recv_zero_page_process(p);

    if (!p->normal_num) {
        assert(in_size == 0);
        return 0;
    }

> +    ret = qio_channel_read_all(p->c, (void *)q->in_buf, in_size, errp);
> +    if (ret != 0) {
> +        return ret;
> +    }
> +
> +    in_len = in_size;
> +    out_len = q->out_len;
> +    ret = qzDecompress(&q->sess, q->in_buf, &in_len, q->out_buf,
> &out_len);
> +    if (ret != QZ_OK) {
> +        error_setg(errp, "multifd %u: qzDecompress failed", p->id);
> +        return -1;
> +    }
> +    if (out_len != expected_size) {
> +        error_setg(errp, "multifd %u: packet size received %u size
> expected %u",
> +                   p->id, out_len, expected_size);
> +        return -1;
> +    }
> +
> +    /* Copy each page to its appropriate location. */
> +    for (int i = 0; i < p->normal_num; i++) {
> +        memcpy(p->host + p->normal[i],
> +               q->out_buf + p->page_size * i,
> +               p->page_size);
> +    }
> +    return 0;
> +}
> +
> +static MultiFDMethods multifd_qatzip_ops = {
> +    .send_setup = qatzip_send_setup,
> +    .send_cleanup = qatzip_send_cleanup,
> +    .send_prepare = qatzip_send_prepare,
> +    .recv_setup = qatzip_recv_setup,
> +    .recv_cleanup = qatzip_recv_cleanup,
> +    .recv = qatzip_recv
> +};
> +
> +static void multifd_qatzip_register(void)
> +{
> +    multifd_register_ops(MULTIFD_COMPRESSION_QATZIP,
> &multifd_qatzip_ops);
> +}
> +
> +migration_init(multifd_qatzip_register);
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 0ecd6f47d7..adceb65050 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -34,14 +34,15 @@ MultiFDRecvData *multifd_get_recv_data(void);
>  /* Multifd Compression flags */
>  #define MULTIFD_FLAG_SYNC (1 << 0)
> 
> -/* We reserve 4 bits for compression methods */
> -#define MULTIFD_FLAG_COMPRESSION_MASK (0xf << 1)
> +/* We reserve 5 bits for compression methods */
> +#define MULTIFD_FLAG_COMPRESSION_MASK (0x1f << 1)
>  /* we need to be compatible. Before compression value was 0 */
>  #define MULTIFD_FLAG_NOCOMP (0 << 1)
>  #define MULTIFD_FLAG_ZLIB (1 << 1)
>  #define MULTIFD_FLAG_ZSTD (2 << 1)
>  #define MULTIFD_FLAG_QPL (4 << 1)
>  #define MULTIFD_FLAG_UADK (8 << 1)
> +#define MULTIFD_FLAG_QATZIP (16 << 1)
> 
>  /* This value needs to be a multiple of qemu_target_page_size() */
>  #define MULTIFD_PACKET_SIZE (512 * 1024)
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 8c9f2a8aa7..ea62f983b1 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -558,6 +558,8 @@
>  #
>  # @zstd: use zstd compression method.
>  #
> +# @qatzip: use qatzip compression method. (Since 9.1)
> +#
>  # @qpl: use qpl compression method.  Query Processing Library(qpl) is
>  #       based on the deflate compression algorithm and use the Intel
>  #       In-Memory Analytics Accelerator(IAA) accelerated compression
> @@ -570,6 +572,7 @@
>  { 'enum': 'MultiFDCompression',
>    'data': [ 'none', 'zlib',
>              { 'name': 'zstd', 'if': 'CONFIG_ZSTD' },
> +            { 'name': 'qatzip', 'if': 'CONFIG_QATZIP'},
>              { 'name': 'qpl', 'if': 'CONFIG_QPL' },
>              { 'name': 'uadk', 'if': 'CONFIG_UADK' } ] }
> 
> diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
> index 12792948ff..23e46144d7 100644
> --- a/tests/qtest/meson.build
> +++ b/tests/qtest/meson.build
> @@ -324,6 +324,10 @@ if gnutls.found()
>    endif
>  endif
> 
> +if qatzip.found()
> +  migration_files += [qatzip]
> +endif
> +
>  qtests = {
>    'bios-tables-test': [io, 'boot-sector.c', 'acpi-utils.c', 'tpm-emu.c'],
>    'cdrom-test': files('boot-sector.c'),
> --
> Yichen Wang



^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
  2024-07-10 15:18       ` Peter Xu
@ 2024-07-10 15:39         ` Liu, Yuan1
  2024-07-10 18:51           ` Peter Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Liu, Yuan1 @ 2024-07-10 15:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: Wang, Yichen, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Marc-André Lureau, Thomas Huth,
	Philippe Mathieu-Daudé, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Laurent Vivier, qemu-devel@nongnu.org,
	Hao Xiang, Zou, Nanhai, Ho-Ren (Jack) Chuang

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, July 10, 2024 11:19 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: Wang, Yichen <yichen.wang@bytedance.com>; Paolo Bonzini
> <pbonzini@redhat.com>; Daniel P. Berrangé <berrange@redhat.com>; Eduardo
> Habkost <eduardo@habkost.net>; Marc-André Lureau
> <marcandre.lureau@redhat.com>; Thomas Huth <thuth@redhat.com>; Philippe
> Mathieu-Daudé <philmd@linaro.org>; Fabiano Rosas <farosas@suse.de>; Eric
> Blake <eblake@redhat.com>; Markus Armbruster <armbru@redhat.com>; Laurent
> Vivier <lvivier@redhat.com>; qemu-devel@nongnu.org; Hao Xiang
> <hao.xiang@linux.dev>; Zou, Nanhai <nanhai.zou@intel.com>; Ho-Ren (Jack)
> Chuang <horenchuang@bytedance.com>
> Subject: Re: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
> 
> On Wed, Jul 10, 2024 at 01:55:23PM +0000, Liu, Yuan1 wrote:
> 
> [...]
> 
> > migrate_set_parameter max-bandwidth 1250M
> > |-----------|--------|---------|----------|----------|------|------|
> > |8 Channels |Total   |down     |throughput|pages per | send | recv |
> > |           |time(ms)|time(ms) |(mbps)    |second    | cpu %| cpu% |
> > |-----------|--------|---------|----------|----------|------|------|
> > |qatzip     |   16630|       28|     10467|   2940235|   160|   360|
> > |-----------|--------|---------|----------|----------|------|------|
> > |zstd       |   20165|       24|      8579|   2391465|   810|   340|
> > |-----------|--------|---------|----------|----------|------|------|
> > |none       |   46063|       40|     10848|    330240|    45|    85|
> > |-----------|--------|---------|----------|----------|------|------|
> >
> > QATzip's dirty page processing throughput is much higher than that no
> compression.
> > In this test, the vCPUs are in idle state, so the migration can be
> successful even
> > without compression.
> 
> Thanks!  Maybe good material to be put into the docs/ too, if Yichen's
> going to pick up your doc patch when repost.

Sure, Yichen will add my doc patch, if he doesn't add this part in 
the next version, I will add it later.

> [...]
> 
> > I don’t have much experience with postcopy, here are some of my thoughts
> > 1. For write-intensive VMs, this solution can improve the migration
> success,
> >    because in a limited bandwidth network scenario, the dirty page
> processing
> >    throughput will be significantly reduced for no compression, the
> previous
> >    data includes this(pages_per_second), it means that in the no
> compression
> >    precopy, the dirty pages generated by the workload are greater than
> the
> >    migration processing, resulting in migration failure.
> 
> Yes.
> 
> >
> > 2. If the VM is read-intensive or has low vCPU utilization (for example,
> my
> >    current test scenario is that the vCPUs are all idle). I think no
> compression +
> >    precopy + postcopy also cannot improve the migration performance, and
> may also
> >    cause timeout failure due to long migration time, same with no
> compression precopy.
> 
> I don't think postcopy will trigger timeout failures - postcopy should use
> constant time to complete a migration, that is guest memsize / bw.

Yes, the migration total time is predictable, failure due to timeout is incorrect, 
migration taking a long time may be more accurate.

> The challenge is normally on the delay of page requests higher than
> precopy, but in this case it might not be a big deal. And I wonder if on
> 100G*2 cards it can also perform pretty well, as the delay might be
> minimal
> even if bandwidth is throttled.

I got your point, I don't have much experience in this area.
So you mean to reserve a small amount of bandwidth on a NIC for postcopy 
migration, and compare the migration performance with and without traffic
on the NIC? Will data plane traffic affect page request delays in postcopy?

> >
> > 3. In my opinion, the postcopy is a good solution in this scenario(low
> network bandwidth,
> >    VM is not critical), because even if compression is turned on, the
> migration may still
> >    fail(page_per_second may still less than the new dirty pages), and it
> is hard to predict
> >    whether VM memory is compression-friendly.
> 
> Yes.
> 
> Thanks,
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB
  2024-07-10 15:39         ` Liu, Yuan1
@ 2024-07-10 18:51           ` Peter Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Xu @ 2024-07-10 18:51 UTC (permalink / raw)
  To: Liu, Yuan1
  Cc: Wang, Yichen, Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, Marc-André Lureau, Thomas Huth,
	Philippe Mathieu-Daudé, Fabiano Rosas, Eric Blake,
	Markus Armbruster, Laurent Vivier, qemu-devel@nongnu.org,
	Hao Xiang, Zou, Nanhai, Ho-Ren (Jack) Chuang

On Wed, Jul 10, 2024 at 03:39:43PM +0000, Liu, Yuan1 wrote:
> > I don't think postcopy will trigger timeout failures - postcopy should use
> > constant time to complete a migration, that is guest memsize / bw.
> 
> Yes, the migration total time is predictable, failure due to timeout is incorrect, 
> migration taking a long time may be more accurate.

It shouldn't: postcopy is run always together with precopy, so if you start
postcopy after one round of precopy, the total migration time should
alwways be smaller than if you run the precopy two rounds.

With postcopy after that migration completes, but for precopy two rounds of
migration will follow with a dirty sync which may say "there's unforunately
more dirty pages, let's move on with the 3rd round and more".

> 
> > The challenge is normally on the delay of page requests higher than
> > precopy, but in this case it might not be a big deal. And I wonder if on
> > 100G*2 cards it can also perform pretty well, as the delay might be
> > minimal
> > even if bandwidth is throttled.
> 
> I got your point, I don't have much experience in this area.
> So you mean to reserve a small amount of bandwidth on a NIC for postcopy 
> migration, and compare the migration performance with and without traffic
> on the NIC? Will data plane traffic affect page request delays in postcopy?

I'm not sure what's the "data plane" you're describing here, but logically
VMs should be migrated using mgmt networks, and should be somehow separate
from IOs within the VMs.

I'm not really asking for another test, sorry to cause confusions; it's
only about some pure discussions.  I just feel like postcopy wasn't really
seriously considered even for many valid cases, some of them postcopy can
play pretty well even without any modern hardwares requested.  There's no
need to prove which is better for this series.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-07-10 18:52 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-05 18:28 [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB Yichen Wang
2024-07-05 18:28 ` [PATCH v4 1/4] meson: Introduce 'qatzip' feature to the build system Yichen Wang
2024-07-05 18:28 ` [PATCH v4 2/4] migration: Add migration parameters for QATzip Yichen Wang
2024-07-08 21:10   ` Peter Xu
2024-07-05 18:29 ` [PATCH v4 3/4] migration: Introduce 'qatzip' compression method Yichen Wang
2024-07-08 21:34   ` Peter Xu
2024-07-10 15:20   ` Liu, Yuan1
2024-07-05 18:29 ` [PATCH v4 4/4] tests/migration: Add integration test for " Yichen Wang
2024-07-09  8:42 ` [PATCH v4 0/4] Implement using Intel QAT to offload ZLIB Liu, Yuan1
2024-07-09 18:42   ` Peter Xu
2024-07-10 13:55     ` Liu, Yuan1
2024-07-10 15:18       ` Peter Xu
2024-07-10 15:39         ` Liu, Yuan1
2024-07-10 18:51           ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).