[PATCH v3 0/4] Live Migration Acceleration with IAA Compression

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/4] Live Migration Acceleration with IAA Compression
@ 2024-01-03 11:28 Yuan Liu
  2024-01-03 11:28 ` [PATCH v3 1/4] migration: Introduce multifd-compression-accel parameter Yuan Liu
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Yuan Liu @ 2024-01-03 11:28 UTC (permalink / raw)
  To: quintela, peterx, farosas, leobras; +Cc: qemu-devel, yuan1.liu, nanhai.zou

Hi,

I am writing to submit a code change aimed at enhancing live migration
acceleration by leveraging the compression capability of the Intel
In-Memory Analytics Accelerator (IAA).

The implementation of the IAA (de)compression code is based on Intel Query
Processing Library (QPL), an open-source software project designed for
IAA high-level software programming. https://github.com/intel/qpl

In the last version, there was some discussion about whether to
introduce a new compression algorithm for IAA. Because the compression
algorithm of IAA hardware is based on deflate, and QPL already supports
Zlib, so in this version, I implemented IAA as an accelerator for the
Zlib compression method. However, due to some reasons, QPL is currently
not compatible with the existing Zlib method that Zlib compressed data
can be decompressed by QPl and vice versa.

I have some concerns about the existing Zlib compression
  1. Will you consider supporting one channel to support multi-stream
     compression? Of course, this may lead to a reduction in compression
     ratio, but it will allow the hardware to process each stream 
     concurrently. We can have each stream process multiple pages,
     reducing the loss of compression ratio. For example, 128 pages are
     divided into 16 streams for independent compression. I will provide
     the a early performance data in the next version(v4).

  2. Will you consider using QPL/IAA as an independent compression
     algorithm instead of an accelerator? In this way, we can better
     utilize hardware performance and some features, such as IAA's
     canned mode, which can be dynamically generated by some statistics
     of data. A huffman table to improve the compression ratio.

Test condition:
  1. Host CPUs are based on Sapphire Rapids, and frequency locked to 3.4G
  2. VM type, 16 vCPU and 64G memory
  3. The Idle workload means no workload is running in the VM 
  4. The Redis workload means YCSB workloadb + Redis Server are running
     in the VM, about 20G or more memory will be used.
  5. Source side migartion configuration commands
     a. migrate_set_capability multifd on
     b. migrate_set_parameter multifd-channels 2/4/8
     c. migrate_set_parameter downtime-limit 300
     d. migrate_set_parameter multifd-compression zlib
     e. migrate_set_parameter multifd-compression-accel none/qpl
     f. migrate_set_parameter max-bandwidth 100G
  6. Desitination side migration configuration commands
     a. migrate_set_capability multifd on
     b. migrate_set_parameter multifd-channels 2/4/8
     c. migrate_set_parameter multifd-compression zlib
     d. migrate_set_parameter multifd-compression-accel none/qpl
     e. migrate_set_parameter max-bandwidth 100G

Early migration result, each result is the average of three tests
 +--------+-------------+--------+--------+---------+----+-----+
 |        | The number  |total   |downtime|network  |pages per |
 |        | of channels |time(ms)|(ms)    |bandwidth|second    |
 |        | and mode    |        |        |(mbps)   |          |
 |        +-------------+-----------------+---------+----------+
 |        | 2 chl, Zlib | 20647  | 22     | 195     | 137767   |
 |        +-------------+--------+--------+---------+----------+
 | Idle   | 2 chl, IAA  | 17022  | 36     | 286     | 460289   |
 |workload+-------------+--------+--------+---------+----------+
 |        | 4 chl, Zlib | 18835  | 29     | 241     | 299028   |
 |        +-------------+--------+--------+---------+----------+
 |        | 4 chl, IAA  | 16280  | 32     | 298     | 652456   |
 |        +-------------+--------+--------+---------+----------+
 |        | 8 chl, Zlib | 17379  | 32     | 275     | 470591   |
 |        +-------------+--------+--------+---------+----------+
 |        | 8 chl, IAA  | 15551  | 46     | 313     | 1315784  |
 +--------+-------------+--------+--------+---------+----------+

 +--------+-------------+--------+--------+---------+----+-----+
 |        | The number  |total   |downtime|network  |pages per |
 |        | of channels |time(ms)|(ms)    |bandwidth|second    |
 |        | and mode    |        |        |(mbps)   |          |
 |        +-------------+-----------------+---------+----------+
 |        | 2 chl, Zlib | 100% failure, timeout is 120s        |
 |        +-------------+--------+--------+---------+----------+
 | Redis  | 2 chl, IAA  | 62737  | 115    | 4547    | 387911   |
 |workload+-------------+--------+--------+---------+----------+
 |        | 4 chl, Zlib | 30% failure, timeout is 120s         |
 |        +-------------+--------+--------+---------+----------+
 |        | 4 chl, IAA  | 54645  | 177    | 5382    | 656865   |
 |        +-------------+--------+--------+---------+----------+
 |        | 8 chl, Zlib | 93488  | 74     | 1264    | 129486   |
 |        +-------------+--------+--------+---------+----------+
 |        | 8 chl, IAA  | 24367  | 303    | 6901    | 964380   |
 +--------+-------------+--------+--------+---------+----------+

v2:       
  - add support for multifd compression accelerator
  - add support for the QPL accelerator in the multifd
    compression accelerator
  - fixed the issue that QPL was compiled into the migration
    module by default

v3:
  - use Meson instead of pkg-config to resolve QPL build
    dependency issue
  - fix coding style
  - fix a CI issue for get_multifd_ops function in multifd.c file

Yuan Liu (4):
  migration: Introduce multifd-compression-accel parameter
  multifd: Implement multifd compression accelerator
  configure: add qpl option
  multifd: Introduce QPL compression accelerator

 hw/core/qdev-properties-system.c    |  11 +
 include/hw/qdev-properties-system.h |   4 +
 meson.build                         |  18 ++
 meson_options.txt                   |   2 +
 migration/meson.build               |   1 +
 migration/migration-hmp-cmds.c      |  10 +
 migration/multifd-qpl.c             | 323 ++++++++++++++++++++++++++++
 migration/multifd.c                 |  40 +++-
 migration/multifd.h                 |   8 +
 migration/options.c                 |  28 +++
 migration/options.h                 |   1 +
 qapi/migration.json                 |  31 ++-
 scripts/meson-buildoptions.sh       |   3 +
 13 files changed, 477 insertions(+), 3 deletions(-)
 create mode 100644 migration/multifd-qpl.c

-- 
2.39.3



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 1/4] migration: Introduce multifd-compression-accel parameter
  2024-01-03 11:28 [PATCH v3 0/4] Live Migration Acceleration with IAA Compression Yuan Liu
@ 2024-01-03 11:28 ` Yuan Liu
  2024-01-03 11:28 ` [PATCH v3 2/4] multifd: Implement multifd compression accelerator Yuan Liu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Yuan Liu @ 2024-01-03 11:28 UTC (permalink / raw)
  To: quintela, peterx, farosas, leobras; +Cc: qemu-devel, yuan1.liu, nanhai.zou

Introduce the multifd-compression-accel option to enable or disable live
migration data (de)compression accelerator.

The default value of multifd-compression-accel is auto, and the enabling
and selection of the accelerator are automatically detected. By setting
multifd-compression-accel=none, the acceleration function can be disabled.
Similarly, users can explicitly specify a specific accelerator name, such
as multifd-compression-accel=qpl.

Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
---
 hw/core/qdev-properties-system.c    | 11 ++++++++++
 include/hw/qdev-properties-system.h |  4 ++++
 migration/migration-hmp-cmds.c      | 10 ++++++++++
 migration/options.c                 | 28 ++++++++++++++++++++++++++
 migration/options.h                 |  1 +
 qapi/migration.json                 | 31 ++++++++++++++++++++++++++++-
 6 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/hw/core/qdev-properties-system.c b/hw/core/qdev-properties-system.c
index 688340610e..ed23035845 100644
--- a/hw/core/qdev-properties-system.c
+++ b/hw/core/qdev-properties-system.c
@@ -673,6 +673,17 @@ const PropertyInfo qdev_prop_multifd_compression = {
     .set_default_value = qdev_propinfo_set_default_value_enum,
 };
 
+/* --- MultiFD Compression Accelerator --- */
+
+const PropertyInfo qdev_prop_multifd_compression_accel = {
+    .name = "MultiFDCompressionAccel",
+    .description = "MultiFD Compression Accelerator, "
+                   "auto/none/qpl",
+    .enum_table = &MultiFDCompressionAccel_lookup,
+    .get = qdev_propinfo_get_enum,
+    .set = qdev_propinfo_set_enum,
+    .set_default_value = qdev_propinfo_set_default_value_enum,
+};
 /* --- Reserved Region --- */
 
 /*
diff --git a/include/hw/qdev-properties-system.h b/include/hw/qdev-properties-system.h
index 0ac327ae60..3c125db3a3 100644
--- a/include/hw/qdev-properties-system.h
+++ b/include/hw/qdev-properties-system.h
@@ -7,6 +7,7 @@ extern const PropertyInfo qdev_prop_chr;
 extern const PropertyInfo qdev_prop_macaddr;
 extern const PropertyInfo qdev_prop_reserved_region;
 extern const PropertyInfo qdev_prop_multifd_compression;
+extern const PropertyInfo qdev_prop_multifd_compression_accel;
 extern const PropertyInfo qdev_prop_losttickpolicy;
 extern const PropertyInfo qdev_prop_blockdev_on_error;
 extern const PropertyInfo qdev_prop_bios_chs_trans;
@@ -41,6 +42,9 @@ extern const PropertyInfo qdev_prop_pcie_link_width;
 #define DEFINE_PROP_MULTIFD_COMPRESSION(_n, _s, _f, _d) \
     DEFINE_PROP_SIGNED(_n, _s, _f, _d, qdev_prop_multifd_compression, \
                        MultiFDCompression)
+#define DEFINE_PROP_MULTIFD_COMP_ACCEL(_n, _s, _f, _d) \
+    DEFINE_PROP_SIGNED(_n, _s, _f, _d, qdev_prop_multifd_compression_accel, \
+                       MultiFDCompressionAccel)
 #define DEFINE_PROP_LOSTTICKPOLICY(_n, _s, _f, _d) \
     DEFINE_PROP_SIGNED(_n, _s, _f, _d, qdev_prop_losttickpolicy, \
                         LostTickPolicy)
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index a82597f18e..3a278c89d9 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -344,6 +344,11 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
         monitor_printf(mon, "%s: %s\n",
             MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_COMPRESSION),
             MultiFDCompression_str(params->multifd_compression));
+        assert(params->has_multifd_compression_accel);
+        monitor_printf(mon, "%s: %s\n",
+            MigrationParameter_str(
+                MIGRATION_PARAMETER_MULTIFD_COMPRESSION_ACCEL),
+            MultiFDCompressionAccel_str(params->multifd_compression_accel));
         monitor_printf(mon, "%s: %" PRIu64 " bytes\n",
             MigrationParameter_str(MIGRATION_PARAMETER_XBZRLE_CACHE_SIZE),
             params->xbzrle_cache_size);
@@ -610,6 +615,11 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         visit_type_MultiFDCompression(v, param, &p->multifd_compression,
                                       &err);
         break;
+    case MIGRATION_PARAMETER_MULTIFD_COMPRESSION_ACCEL:
+        p->has_multifd_compression_accel = true;
+        visit_type_MultiFDCompressionAccel(v, param,
+                                           &p->multifd_compression_accel, &err);
+        break;
     case MIGRATION_PARAMETER_MULTIFD_ZLIB_LEVEL:
         p->has_multifd_zlib_level = true;
         visit_type_uint8(v, param, &p->multifd_zlib_level, &err);
diff --git a/migration/options.c b/migration/options.c
index 42fb818956..6ef06d1816 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -59,6 +59,12 @@
 #define DEFAULT_MIGRATE_X_CHECKPOINT_DELAY (200 * 100)
 #define DEFAULT_MIGRATE_MULTIFD_CHANNELS 2
 #define DEFAULT_MIGRATE_MULTIFD_COMPRESSION MULTIFD_COMPRESSION_NONE
+
+/*
+ * When the compression method is available and supported by the
+ * accelerator, data compression is performed using the accelerator.
+ */
+#define DEFAULT_MIGRATE_MULTIFD_COMPRESSION_ACCEL MULTIFD_COMPRESSION_ACCEL_AUTO
 /* 0: means nocompress, 1: best speed, ... 9: best compress ratio */
 #define DEFAULT_MIGRATE_MULTIFD_ZLIB_LEVEL 1
 /* 0: means nocompress, 1: best speed, ... 20: best compress ratio */
@@ -139,6 +145,9 @@ Property migration_properties[] = {
     DEFINE_PROP_MULTIFD_COMPRESSION("multifd-compression", MigrationState,
                       parameters.multifd_compression,
                       DEFAULT_MIGRATE_MULTIFD_COMPRESSION),
+    DEFINE_PROP_MULTIFD_COMP_ACCEL("multifd-compression-accel", MigrationState,
+                      parameters.multifd_compression_accel,
+                      DEFAULT_MIGRATE_MULTIFD_COMPRESSION_ACCEL),
     DEFINE_PROP_UINT8("multifd-zlib-level", MigrationState,
                       parameters.multifd_zlib_level,
                       DEFAULT_MIGRATE_MULTIFD_ZLIB_LEVEL),
@@ -818,6 +827,15 @@ MultiFDCompression migrate_multifd_compression(void)
     return s->parameters.multifd_compression;
 }
 
+MultiFDCompressionAccel migrate_multifd_compression_accel(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    assert(s->parameters.multifd_compression_accel <
+           MULTIFD_COMPRESSION_ACCEL__MAX);
+    return s->parameters.multifd_compression_accel;
+}
+
 int migrate_multifd_zlib_level(void)
 {
     MigrationState *s = migrate_get_current();
@@ -945,6 +963,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->multifd_channels = s->parameters.multifd_channels;
     params->has_multifd_compression = true;
     params->multifd_compression = s->parameters.multifd_compression;
+    params->has_multifd_compression_accel = true;
+    params->multifd_compression_accel = s->parameters.multifd_compression_accel;
     params->has_multifd_zlib_level = true;
     params->multifd_zlib_level = s->parameters.multifd_zlib_level;
     params->has_multifd_zstd_level = true;
@@ -999,6 +1019,7 @@ void migrate_params_init(MigrationParameters *params)
     params->has_block_incremental = true;
     params->has_multifd_channels = true;
     params->has_multifd_compression = true;
+    params->has_multifd_compression_accel = true;
     params->has_multifd_zlib_level = true;
     params->has_multifd_zstd_level = true;
     params->has_xbzrle_cache_size = true;
@@ -1273,6 +1294,9 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_multifd_compression) {
         dest->multifd_compression = params->multifd_compression;
     }
+    if (params->has_multifd_compression_accel) {
+        dest->multifd_compression_accel = params->multifd_compression_accel;
+    }
     if (params->has_xbzrle_cache_size) {
         dest->xbzrle_cache_size = params->xbzrle_cache_size;
     }
@@ -1394,6 +1418,10 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
     if (params->has_multifd_compression) {
         s->parameters.multifd_compression = params->multifd_compression;
     }
+    if (params->has_multifd_compression_accel) {
+        s->parameters.multifd_compression_accel =
+            params->multifd_compression_accel;
+    }
     if (params->has_xbzrle_cache_size) {
         s->parameters.xbzrle_cache_size = params->xbzrle_cache_size;
         xbzrle_cache_resize(params->xbzrle_cache_size, errp);
diff --git a/migration/options.h b/migration/options.h
index 237f2d6b4a..e59bf4b5c1 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -85,6 +85,7 @@ uint64_t migrate_avail_switchover_bandwidth(void);
 uint64_t migrate_max_postcopy_bandwidth(void);
 int migrate_multifd_channels(void);
 MultiFDCompression migrate_multifd_compression(void);
+MultiFDCompressionAccel migrate_multifd_compression_accel(void);
 int migrate_multifd_zlib_level(void);
 int migrate_multifd_zstd_level(void);
 uint8_t migrate_throttle_trigger_threshold(void);
diff --git a/qapi/migration.json b/qapi/migration.json
index db3df12d6c..7a1dde6c5c 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -616,6 +616,22 @@
             { 'name': 'zstd', 'if': 'CONFIG_ZSTD' } ] }
 
 ##
+# @MultiFDCompressionAccel:
+#
+# An enumeration of multifd compression accelerator.
+#
+# @auto: if accelerators are available, enable one of them.
+#
+# @none: disable compression accelerator.
+#
+# @qpl: enable qpl compression accelerator.
+#
+# Since: 8.2
+##
+{ 'enum': 'MultiFDCompressionAccel',
+  'data': [ 'auto', 'none',
+            { 'name': 'qpl', 'if': 'CONFIG_QPL' } ] }
+##
 # @BitmapMigrationBitmapAliasTransform:
 #
 # @persistent: If present, the bitmap will be made persistent or
@@ -798,6 +814,9 @@
 # @multifd-compression: Which compression method to use.  Defaults to
 #     none.  (Since 5.0)
 #
+# @multifd-compression-accel: Which compression accelerator to use.
+#     Defaults to auto.  (Since 8.2)
+#
 # @multifd-zlib-level: Set the compression level to be used in live
 #     migration, the compression level is an integer between 0 and 9,
 #     where 0 means no compression, 1 means the best compression
@@ -853,7 +872,9 @@
            'block-incremental',
            'multifd-channels',
            'xbzrle-cache-size', 'max-postcopy-bandwidth',
-           'max-cpu-throttle', 'multifd-compression',
+           'max-cpu-throttle',
+           'multifd-compression',
+           'multifd-compression-accel',
            'multifd-zlib-level', 'multifd-zstd-level',
            'block-bitmap-mapping',
            { 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] },
@@ -974,6 +995,9 @@
 # @multifd-compression: Which compression method to use.  Defaults to
 #     none.  (Since 5.0)
 #
+# @multifd-compression-accel: Which compression accelerator to use.
+#     Defaults to auto. (Since 8.2)
+#
 # @multifd-zlib-level: Set the compression level to be used in live
 #     migration, the compression level is an integer between 0 and 9,
 #     where 0 means no compression, 1 means the best compression
@@ -1046,6 +1070,7 @@
             '*max-postcopy-bandwidth': 'size',
             '*max-cpu-throttle': 'uint8',
             '*multifd-compression': 'MultiFDCompression',
+            '*multifd-compression-accel': 'MultiFDCompressionAccel',
             '*multifd-zlib-level': 'uint8',
             '*multifd-zstd-level': 'uint8',
             '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
@@ -1188,6 +1213,9 @@
 # @multifd-compression: Which compression method to use.  Defaults to
 #     none.  (Since 5.0)
 #
+# @multifd-compression-accel: Which compression accelerator to use.
+#     Defaults to auto. (Since 8.2)
+#
 # @multifd-zlib-level: Set the compression level to be used in live
 #     migration, the compression level is an integer between 0 and 9,
 #     where 0 means no compression, 1 means the best compression
@@ -1257,6 +1285,7 @@
             '*max-postcopy-bandwidth': 'size',
             '*max-cpu-throttle': 'uint8',
             '*multifd-compression': 'MultiFDCompression',
+            '*multifd-compression-accel': 'MultiFDCompressionAccel',
             '*multifd-zlib-level': 'uint8',
             '*multifd-zstd-level': 'uint8',
             '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ],
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 2/4] multifd: Implement multifd compression accelerator
  2024-01-03 11:28 [PATCH v3 0/4] Live Migration Acceleration with IAA Compression Yuan Liu
  2024-01-03 11:28 ` [PATCH v3 1/4] migration: Introduce multifd-compression-accel parameter Yuan Liu
@ 2024-01-03 11:28 ` Yuan Liu
  2024-01-03 11:28 ` [PATCH v3 3/4] configure: add qpl option Yuan Liu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Yuan Liu @ 2024-01-03 11:28 UTC (permalink / raw)
  To: quintela, peterx, farosas, leobras; +Cc: qemu-devel, yuan1.liu, nanhai.zou

when starting multifd live migration, if the compression method is
enabled, compression method can be accelerated using accelerators.

Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
---
 migration/multifd.c | 40 ++++++++++++++++++++++++++++++++++++++--
 migration/multifd.h |  8 ++++++++
 2 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 1fe53d3b98..8ee083b691 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -165,6 +165,36 @@ static MultiFDMethods multifd_nocomp_ops = {
 static MultiFDMethods *multifd_ops[MULTIFD_COMPRESSION__MAX] = {
     [MULTIFD_COMPRESSION_NONE] = &multifd_nocomp_ops,
 };
+static MultiFDAccelMethods *accel_multifd_ops[MULTIFD_COMPRESSION_ACCEL__MAX];
+
+static MultiFDMethods *get_multifd_ops(void)
+{
+    MultiFDCompression comp = migrate_multifd_compression();
+    MultiFDCompressionAccel accel = migrate_multifd_compression_accel();
+
+    assert(comp < MULTIFD_COMPRESSION__MAX);
+    assert(accel < MULTIFD_COMPRESSION_ACCEL__MAX);
+    if (comp == MULTIFD_COMPRESSION_NONE ||
+        accel == MULTIFD_COMPRESSION_ACCEL_NONE) {
+        return multifd_ops[comp];
+    }
+    if (accel == MULTIFD_COMPRESSION_ACCEL_AUTO) {
+        for (int i = 0; i < MULTIFD_COMPRESSION_ACCEL__MAX; i++) {
+            if (accel_multifd_ops[i] &&
+                accel_multifd_ops[i]->is_supported(comp)) {
+                return accel_multifd_ops[i]->get_multifd_methods();
+            }
+        }
+        return multifd_ops[comp];
+    }
+
+    /* Check if a specified accelerator is available */
+    if (accel_multifd_ops[accel] &&
+        accel_multifd_ops[accel]->is_supported(comp)) {
+        return accel_multifd_ops[accel]->get_multifd_methods();
+    }
+    return multifd_ops[comp];
+}
 
 void multifd_register_ops(int method, MultiFDMethods *ops)
 {
@@ -172,6 +202,12 @@ void multifd_register_ops(int method, MultiFDMethods *ops)
     multifd_ops[method] = ops;
 }
 
+void multifd_register_accel_ops(int accel, MultiFDAccelMethods *ops)
+{
+    assert(0 < accel && accel < MULTIFD_COMPRESSION_ACCEL__MAX);
+    accel_multifd_ops[accel] = ops;
+}
+
 static int multifd_send_initial_packet(MultiFDSendParams *p, Error **errp)
 {
     MultiFDInit_t msg = {};
@@ -922,7 +958,7 @@ int multifd_save_setup(Error **errp)
     multifd_send_state->pages = multifd_pages_init(page_count);
     qemu_sem_init(&multifd_send_state->channels_ready, 0);
     qatomic_set(&multifd_send_state->exiting, 0);
-    multifd_send_state->ops = multifd_ops[migrate_multifd_compression()];
+    multifd_send_state->ops = get_multifd_ops();
 
     for (i = 0; i < thread_count; i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
@@ -1180,7 +1216,7 @@ int multifd_load_setup(Error **errp)
     multifd_recv_state->params = g_new0(MultiFDRecvParams, thread_count);
     qatomic_set(&multifd_recv_state->count, 0);
     qemu_sem_init(&multifd_recv_state->sem_sync, 0);
-    multifd_recv_state->ops = multifd_ops[migrate_multifd_compression()];
+    multifd_recv_state->ops = get_multifd_ops();
 
     for (i = 0; i < thread_count; i++) {
         MultiFDRecvParams *p = &multifd_recv_state->params[i];
diff --git a/migration/multifd.h b/migration/multifd.h
index a835643b48..c40ff79443 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -206,7 +206,15 @@ typedef struct {
     int (*recv_pages)(MultiFDRecvParams *p, Error **errp);
 } MultiFDMethods;
 
+typedef struct {
+    /* Check if the compression method supports acceleration */
+    bool (*is_supported) (MultiFDCompression compression);
+    /* Get multifd methods of the accelerator */
+    MultiFDMethods* (*get_multifd_methods)(void);
+} MultiFDAccelMethods;
+
 void multifd_register_ops(int method, MultiFDMethods *ops);
+void multifd_register_accel_ops(int accel, MultiFDAccelMethods *ops);
 
 #endif
 
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 3/4] configure: add qpl option
  2024-01-03 11:28 [PATCH v3 0/4] Live Migration Acceleration with IAA Compression Yuan Liu
  2024-01-03 11:28 ` [PATCH v3 1/4] migration: Introduce multifd-compression-accel parameter Yuan Liu
  2024-01-03 11:28 ` [PATCH v3 2/4] multifd: Implement multifd compression accelerator Yuan Liu
@ 2024-01-03 11:28 ` Yuan Liu
  2024-01-03 11:28 ` [PATCH v3 4/4] multifd: Introduce QPL compression accelerator Yuan Liu
  2024-01-29 10:42 ` [PATCH v3 0/4] Live Migration Acceleration with IAA Compression Peter Xu
  4 siblings, 0 replies; 9+ messages in thread
From: Yuan Liu @ 2024-01-03 11:28 UTC (permalink / raw)
  To: quintela, peterx, farosas, leobras; +Cc: qemu-devel, yuan1.liu, nanhai.zou

the Query Processing Library (QPL) is an open-source library that
supports data compression and decompression features.

add --enable-qpl and --disable-qpl options to enable and disable
the QPL compression accelerator. The QPL compression accelerator
can accelerate the Zlib compression algorithm during the live migration.

Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
---
 meson.build                   | 18 ++++++++++++++++++
 meson_options.txt             |  2 ++
 scripts/meson-buildoptions.sh |  3 +++
 3 files changed, 23 insertions(+)

diff --git a/meson.build b/meson.build
index 259dc5f308..f2bb81f9cb 100644
--- a/meson.build
+++ b/meson.build
@@ -1032,6 +1032,22 @@ if not get_option('zstd').auto() or have_block
                     required: get_option('zstd'),
                     method: 'pkg-config')
 endif
+qpl = not_found
+if not get_option('qpl').auto()
+  libqpl = cc.find_library('qpl', required: false)
+  if not libqpl.found()
+    error('libqpl not found, please install it from ' +
+    'https://intel.github.io/qpl/documentation/get_started_docs/installation.html')
+  endif
+  libaccel = cc.find_library('accel-config', required: false)
+  if not libaccel.found()
+    error('libaccel-config not found, please install it from ' +
+    'https://github.com/intel/idxd-config')
+  endif
+  qpl = declare_dependency(dependencies: [libqpl, libaccel,
+        cc.find_library('dl', required: get_option('qpl'))],
+        link_args: ['-lstdc++'])
+endif
 virgl = not_found
 
 have_vhost_user_gpu = have_tools and targetos == 'linux' and pixman.found()
@@ -2165,6 +2181,7 @@ config_host_data.set('CONFIG_MALLOC_TRIM', has_malloc_trim)
 config_host_data.set('CONFIG_STATX', has_statx)
 config_host_data.set('CONFIG_STATX_MNT_ID', has_statx_mnt_id)
 config_host_data.set('CONFIG_ZSTD', zstd.found())
+config_host_data.set('CONFIG_QPL', qpl.found())
 config_host_data.set('CONFIG_FUSE', fuse.found())
 config_host_data.set('CONFIG_FUSE_LSEEK', fuse_lseek.found())
 config_host_data.set('CONFIG_SPICE_PROTOCOL', spice_protocol.found())
@@ -4325,6 +4342,7 @@ summary_info += {'snappy support':    snappy}
 summary_info += {'bzip2 support':     libbzip2}
 summary_info += {'lzfse support':     liblzfse}
 summary_info += {'zstd support':      zstd}
+summary_info += {'Query Processing Library support': qpl}
 summary_info += {'NUMA host support': numa}
 summary_info += {'capstone':          capstone}
 summary_info += {'libpmem support':   libpmem}
diff --git a/meson_options.txt b/meson_options.txt
index 3c7398f3c6..71cd533985 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -255,6 +255,8 @@ option('xkbcommon', type : 'feature', value : 'auto',
        description: 'xkbcommon support')
 option('zstd', type : 'feature', value : 'auto',
        description: 'zstd compression support')
+option('qpl', type : 'feature', value : 'auto',
+       description: 'Query Processing Library support')
 option('fuse', type: 'feature', value: 'auto',
        description: 'FUSE block device export')
 option('fuse_lseek', type : 'feature', value : 'auto',
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index 7ca4b77eae..0909d1d517 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -220,6 +220,7 @@ meson_options_help() {
   printf "%s\n" '                  Xen PCI passthrough support'
   printf "%s\n" '  xkbcommon       xkbcommon support'
   printf "%s\n" '  zstd            zstd compression support'
+  printf "%s\n" '  qpl             Query Processing Library support'
 }
 _meson_option_parse() {
   case $1 in
@@ -556,6 +557,8 @@ _meson_option_parse() {
     --disable-xkbcommon) printf "%s" -Dxkbcommon=disabled ;;
     --enable-zstd) printf "%s" -Dzstd=enabled ;;
     --disable-zstd) printf "%s" -Dzstd=disabled ;;
+    --enable-qpl) printf "%s" -Dqpl=enabled ;;
+    --disable-qpl) printf "%s" -Dqpl=disabled ;;
     *) return 1 ;;
   esac
 }
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 4/4] multifd: Introduce QPL compression accelerator
  2024-01-03 11:28 [PATCH v3 0/4] Live Migration Acceleration with IAA Compression Yuan Liu
                   ` (2 preceding siblings ...)
  2024-01-03 11:28 ` [PATCH v3 3/4] configure: add qpl option Yuan Liu
@ 2024-01-03 11:28 ` Yuan Liu
  2024-01-29 10:42 ` [PATCH v3 0/4] Live Migration Acceleration with IAA Compression Peter Xu
  4 siblings, 0 replies; 9+ messages in thread
From: Yuan Liu @ 2024-01-03 11:28 UTC (permalink / raw)
  To: quintela, peterx, farosas, leobras; +Cc: qemu-devel, yuan1.liu, nanhai.zou

Intel Query Processing Library (QPL) is an open-source library
for data compression, it supports the deflate compression algorithm,
compatible with Zlib and GZIP.

QPL supports both software compression and hardware compression.
Software compression is based on instruction optimization to accelerate
data compression, and it can be widely used on Intel CPUs. Hardware
compression utilizes the Intel In-Memory Analytics Accelerator (IAA)
hardware which is available on Intel Xeon Sapphire Rapids processors.

During multifd live migration, the QPL accelerator can be specified to
accelerate the Zlib compression algorithm. QPL can automatically choose
software or hardware acceleration based on the platform.

Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
---
 migration/meson.build   |   1 +
 migration/multifd-qpl.c | 323 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 324 insertions(+)
 create mode 100644 migration/multifd-qpl.c

diff --git a/migration/meson.build b/migration/meson.build
index 92b1cc4297..c155c2d781 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -40,6 +40,7 @@ if get_option('live_block_migration').allowed()
   system_ss.add(files('block.c'))
 endif
 system_ss.add(when: zstd, if_true: files('multifd-zstd.c'))
+system_ss.add(when: qpl, if_true: files('multifd-qpl.c'))
 
 specific_ss.add(when: 'CONFIG_SYSTEM_ONLY',
                 if_true: files('ram.c',
diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
new file mode 100644
index 0000000000..88ebe87c09
--- /dev/null
+++ b/migration/multifd-qpl.c
@@ -0,0 +1,323 @@
+/*
+ * Multifd qpl compression accelerator implementation
+ *
+ * Copyright (c) 2023 Intel Corporation
+ *
+ * Authors:
+ *  Yuan Liu<yuan1.liu@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/rcu.h"
+#include "exec/ramblock.h"
+#include "exec/target_page.h"
+#include "qapi/error.h"
+#include "migration.h"
+#include "trace.h"
+#include "options.h"
+#include "multifd.h"
+#include "qpl/qpl.h"
+
+#define MAX_BUF_SIZE (MULTIFD_PACKET_SIZE * 2)
+static bool support_compression_methods[MULTIFD_COMPRESSION__MAX];
+
+struct qpl_data {
+    qpl_job *job;
+    /* compressed data buffer */
+    uint8_t *buf;
+    /* decompressed data buffer */
+    uint8_t *zbuf;
+};
+
+static int init_qpl(struct qpl_data *qpl, uint8_t channel_id,  Error **errp)
+{
+    qpl_status status;
+    qpl_path_t path = qpl_path_auto;
+    uint32_t job_size = 0;
+
+    status = qpl_get_job_size(path, &job_size);
+    if (status != QPL_STS_OK) {
+        error_setg(errp, "multfd: %u: failed to get QPL size, error %d",
+                   channel_id, status);
+        return -1;
+    }
+
+    qpl->job = g_try_malloc0(job_size);
+    if (!qpl->job) {
+        error_setg(errp, "multfd: %u: failed to allocate QPL job", channel_id);
+        return -1;
+    }
+
+    status = qpl_init_job(path, qpl->job);
+    if (status != QPL_STS_OK) {
+        error_setg(errp, "multfd: %u: failed to init QPL hardware, error %d",
+                   channel_id, status);
+        return -1;
+    }
+    return 0;
+}
+
+static void deinit_qpl(struct qpl_data *qpl)
+{
+    if (qpl->job) {
+        qpl_fini_job(qpl->job);
+        g_free(qpl->job);
+    }
+}
+
+/**
+ * qpl_send_setup: setup send side
+ *
+ * Setup each channel with QPL compression.
+ *
+ * Returns 0 for success or -1 for error
+ *
+ * @p: Params for the channel that we are using
+ * @errp: pointer to an error
+ */
+static int qpl_send_setup(MultiFDSendParams *p, Error **errp)
+{
+    struct qpl_data *qpl = g_new0(struct qpl_data, 1);
+    int flags = MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS;
+    const char *err_msg;
+
+    if (init_qpl(qpl, p->id, errp) != 0) {
+        err_msg = "failed to initialize QPL\n";
+        goto err_qpl_init;
+    }
+    qpl->zbuf = mmap(NULL, MAX_BUF_SIZE, PROT_READ | PROT_WRITE, flags, -1, 0);
+    if (qpl->zbuf == MAP_FAILED) {
+        err_msg = "failed to allocate QPL zbuf\n";
+        goto err_zbuf_mmap;
+    }
+    p->data = qpl;
+    return 0;
+
+err_zbuf_mmap:
+    deinit_qpl(qpl);
+err_qpl_init:
+    g_free(qpl);
+    error_setg(errp, "multifd %u: %s", p->id, err_msg);
+    return -1;
+}
+
+/**
+ * qpl_send_cleanup: cleanup send side
+ *
+ * Close the channel and return memory.
+ *
+ * @p: Params for the channel that we are using
+ * @errp: pointer to an error
+ */
+static void qpl_send_cleanup(MultiFDSendParams *p, Error **errp)
+{
+    struct qpl_data *qpl = p->data;
+
+    deinit_qpl(qpl);
+    if (qpl->zbuf) {
+        munmap(qpl->zbuf, MAX_BUF_SIZE);
+        qpl->zbuf = NULL;
+    }
+    g_free(p->data);
+    p->data = NULL;
+}
+
+/**
+ * qpl_send_prepare: prepare data to be able to send
+ *
+ * Create a compressed buffer with all the pages that we are going to
+ * send.
+ *
+ * Returns 0 for success or -1 for error
+ *
+ * @p: Params for the channel that we are using
+ * @errp: pointer to an error
+ */
+static int qpl_send_prepare(MultiFDSendParams *p, Error **errp)
+{
+    struct qpl_data *qpl = p->data;
+    qpl_job *job = qpl->job;
+    qpl_status status;
+
+    job->op = qpl_op_compress;
+    job->next_out_ptr = qpl->zbuf;
+    job->available_out = MAX_BUF_SIZE;
+    job->flags = QPL_FLAG_FIRST | QPL_FLAG_OMIT_VERIFY | QPL_FLAG_ZLIB_MODE;
+    /* QPL supports compression level 1 */
+    job->level = 1;
+    for (int i = 0; i < p->normal_num; i++) {
+        if (i == p->normal_num - 1) {
+            job->flags |= (QPL_FLAG_LAST | QPL_FLAG_OMIT_VERIFY);
+        }
+        job->next_in_ptr = p->pages->block->host + p->normal[i];
+        job->available_in = p->page_size;
+        status = qpl_execute_job(job);
+        if (status != QPL_STS_OK) {
+            error_setg(errp, "multifd %u: execute job error %d ",
+                       p->id, status);
+            return -1;
+        }
+        job->flags &= ~QPL_FLAG_FIRST;
+    }
+    p->iov[p->iovs_num].iov_base = qpl->zbuf;
+    p->iov[p->iovs_num].iov_len = job->total_out;
+    p->iovs_num++;
+    p->next_packet_size += job->total_out;
+    p->flags |= MULTIFD_FLAG_ZLIB;
+    return 0;
+}
+
+/**
+ * qpl_recv_setup: setup receive side
+ *
+ * Create the compressed channel and buffer.
+ *
+ * Returns 0 for success or -1 for error
+ *
+ * @p: Params for the channel that we are using
+ * @errp: pointer to an error
+ */
+static int qpl_recv_setup(MultiFDRecvParams *p, Error **errp)
+{
+    struct qpl_data *qpl = g_new0(struct qpl_data, 1);
+    int flags = MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS;
+    const char *err_msg;
+
+    if (init_qpl(qpl, p->id, errp) != 0) {
+        err_msg = "failed to initialize QPL\n";
+        goto err_qpl_init;
+    }
+    qpl->zbuf = mmap(NULL, MAX_BUF_SIZE, PROT_READ | PROT_WRITE, flags, -1, 0);
+    if (qpl->zbuf == MAP_FAILED) {
+        err_msg = "failed to allocate QPL zbuf\n";
+        goto err_zbuf_mmap;
+    }
+    qpl->buf = mmap(NULL, MAX_BUF_SIZE, PROT_READ | PROT_WRITE, flags, -1, 0);
+    if (qpl->buf == MAP_FAILED) {
+        err_msg = "failed to allocate QPL buf\n";
+        goto err_buf_mmap;
+    }
+    p->data = qpl;
+    return 0;
+
+err_buf_mmap:
+    munmap(qpl->zbuf, MAX_BUF_SIZE);
+    qpl->zbuf = NULL;
+err_zbuf_mmap:
+    deinit_qpl(qpl);
+err_qpl_init:
+    g_free(qpl);
+    error_setg(errp, "multifd %u: %s", p->id, err_msg);
+    return -1;
+}
+
+/**
+ * qpl_recv_cleanup: setup receive side
+ *
+ * For no compression this function does nothing.
+ *
+ * @p: Params for the channel that we are using
+ */
+static void qpl_recv_cleanup(MultiFDRecvParams *p)
+{
+    struct qpl_data *qpl = p->data;
+
+    deinit_qpl(qpl);
+    if (qpl->zbuf) {
+        munmap(qpl->zbuf, MAX_BUF_SIZE);
+        qpl->zbuf = NULL;
+    }
+    if (qpl->buf) {
+        munmap(qpl->buf, MAX_BUF_SIZE);
+        qpl->buf = NULL;
+    }
+    g_free(p->data);
+    p->data = NULL;
+}
+
+/**
+ * qpl_recv_pages: read the data from the channel into actual pages
+ *
+ * Read the compressed buffer, and uncompress it into the actual
+ * pages.
+ *
+ * Returns 0 for success or -1 for error
+ *
+ * @p: Params for the channel that we are using
+ * @errp: pointer to an error
+ */
+static int qpl_recv_pages(MultiFDRecvParams *p, Error **errp)
+{
+    struct qpl_data *qpl = p->data;
+    uint32_t in_size = p->next_packet_size;
+    uint32_t expected_size = p->normal_num * p->page_size;
+    uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
+    qpl_job *job = qpl->job;
+    qpl_status status;
+    int ret;
+
+    if (flags != MULTIFD_FLAG_ZLIB) {
+        error_setg(errp, "multifd %u: flags received %x flags expected %x",
+                   p->id, flags, MULTIFD_FLAG_ZLIB);
+        return -1;
+    }
+    ret = qio_channel_read_all(p->c, (void *)qpl->zbuf, in_size, errp);
+    if (ret != 0) {
+        return ret;
+    }
+
+    job->op = qpl_op_decompress;
+    job->next_in_ptr = qpl->zbuf;
+    job->available_in = in_size;
+    job->next_out_ptr = qpl->buf;
+    job->available_out = expected_size;
+    job->flags = QPL_FLAG_FIRST | QPL_FLAG_LAST | QPL_FLAG_OMIT_VERIFY |
+                 QPL_FLAG_ZLIB_MODE;
+    status = qpl_execute_job(job);
+    if ((status != QPL_STS_OK) || (job->total_out != expected_size)) {
+        error_setg(errp, "multifd %u: execute job error %d, expect %u, out %u",
+                   p->id, status, job->total_out, expected_size);
+        return -1;
+    }
+    for (int i = 0; i < p->normal_num; i++) {
+        memcpy(p->host + p->normal[i], qpl->buf + (i * p->page_size),
+               p->page_size);
+    }
+    return 0;
+}
+
+static MultiFDMethods multifd_qpl_ops = {
+    .send_setup = qpl_send_setup,
+    .send_cleanup = qpl_send_cleanup,
+    .send_prepare = qpl_send_prepare,
+    .recv_setup = qpl_recv_setup,
+    .recv_cleanup = qpl_recv_cleanup,
+    .recv_pages = qpl_recv_pages
+};
+
+static bool is_supported(MultiFDCompression compression)
+{
+    return support_compression_methods[compression];
+}
+
+static MultiFDMethods *get_qpl_multifd_methods(void)
+{
+    return &multifd_qpl_ops;
+}
+
+static MultiFDAccelMethods multifd_qpl_accel_ops = {
+    .is_supported = is_supported,
+    .get_multifd_methods = get_qpl_multifd_methods,
+};
+
+static void multifd_qpl_register(void)
+{
+    multifd_register_accel_ops(MULTIFD_COMPRESSION_ACCEL_QPL,
+                               &multifd_qpl_accel_ops);
+    support_compression_methods[MULTIFD_COMPRESSION_ZLIB] = true;
+}
+
+migration_init(multifd_qpl_register);
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 0/4] Live Migration Acceleration with IAA Compression
  2024-01-03 11:28 [PATCH v3 0/4] Live Migration Acceleration with IAA Compression Yuan Liu
                   ` (3 preceding siblings ...)
  2024-01-03 11:28 ` [PATCH v3 4/4] multifd: Introduce QPL compression accelerator Yuan Liu
@ 2024-01-29 10:42 ` Peter Xu
  2024-01-30  3:56   ` Liu, Yuan1
  4 siblings, 1 reply; 9+ messages in thread
From: Peter Xu @ 2024-01-29 10:42 UTC (permalink / raw)
  To: Yuan Liu; +Cc: farosas, leobras, qemu-devel, nanhai.zou

On Wed, Jan 03, 2024 at 07:28:47PM +0800, Yuan Liu wrote:
> Hi,

Hi, Yuan,

I have a few comments and questions.  Many of them can be pure questions as
I don't know enough on these new technologies.

> 
> I am writing to submit a code change aimed at enhancing live migration
> acceleration by leveraging the compression capability of the Intel
> In-Memory Analytics Accelerator (IAA).
> 
> The implementation of the IAA (de)compression code is based on Intel Query
> Processing Library (QPL), an open-source software project designed for
> IAA high-level software programming. https://github.com/intel/qpl
> 
> In the last version, there was some discussion about whether to
> introduce a new compression algorithm for IAA. Because the compression
> algorithm of IAA hardware is based on deflate, and QPL already supports
> Zlib, so in this version, I implemented IAA as an accelerator for the
> Zlib compression method. However, due to some reasons, QPL is currently
> not compatible with the existing Zlib method that Zlib compressed data
> can be decompressed by QPl and vice versa.
> 
> I have some concerns about the existing Zlib compression
>   1. Will you consider supporting one channel to support multi-stream
>      compression? Of course, this may lead to a reduction in compression
>      ratio, but it will allow the hardware to process each stream 
>      concurrently. We can have each stream process multiple pages,
>      reducing the loss of compression ratio. For example, 128 pages are
>      divided into 16 streams for independent compression. I will provide
>      the a early performance data in the next version(v4).

I think Juan used to ask similar question: how much this can help if
multifd can already achieve some form of concurrency over the pages?
Couldn't the user specify more multifd channels if they want to grant more
cpu resource for comp/decomp purpose?

IOW, how many concurrent channels QPL can provide?  What is the suggested
concurrency channels there?

> 
>   2. Will you consider using QPL/IAA as an independent compression
>      algorithm instead of an accelerator? In this way, we can better
>      utilize hardware performance and some features, such as IAA's
>      canned mode, which can be dynamically generated by some statistics
>      of data. A huffman table to improve the compression ratio.

Maybe one more knob will work?  If it's not compatible with the deflate
algo maybe it should never be the default.  IOW, the accelerators may be
extended into this (based on what you already proposed):

  - auto ("qpl" first, "none" second; never "qpl-optimized")
  - none (old zlib)
  - qpl (qpl compatible)
  - qpl-optimized (qpl uncompatible)

Then "auto"/"none"/"qpl" will always be compatible, only the last doesn't,
user can select it explicit, but only on both sides of QEMU.

> 
> Test condition:
>   1. Host CPUs are based on Sapphire Rapids, and frequency locked to 3.4G
>   2. VM type, 16 vCPU and 64G memory
>   3. The Idle workload means no workload is running in the VM 
>   4. The Redis workload means YCSB workloadb + Redis Server are running
>      in the VM, about 20G or more memory will be used.
>   5. Source side migartion configuration commands
>      a. migrate_set_capability multifd on
>      b. migrate_set_parameter multifd-channels 2/4/8
>      c. migrate_set_parameter downtime-limit 300
>      d. migrate_set_parameter multifd-compression zlib
>      e. migrate_set_parameter multifd-compression-accel none/qpl
>      f. migrate_set_parameter max-bandwidth 100G
>   6. Desitination side migration configuration commands
>      a. migrate_set_capability multifd on
>      b. migrate_set_parameter multifd-channels 2/4/8
>      c. migrate_set_parameter multifd-compression zlib
>      d. migrate_set_parameter multifd-compression-accel none/qpl
>      e. migrate_set_parameter max-bandwidth 100G

How is zlib-level setup?  Default (1)?

Btw, it seems both zlib/zstd levels are not even working right now to be
configured.. probably overlooked in migrate_params_apply().

> 
> Early migration result, each result is the average of three tests
>  +--------+-------------+--------+--------+---------+----+-----+
>  |        | The number  |total   |downtime|network  |pages per |
>  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
>  |        | and mode    |        |        |(mbps)   |          |
>  |        +-------------+-----------------+---------+----------+
>  |        | 2 chl, Zlib | 20647  | 22     | 195     | 137767   |
>  |        +-------------+--------+--------+---------+----------+
>  | Idle   | 2 chl, IAA  | 17022  | 36     | 286     | 460289   |
>  |workload+-------------+--------+--------+---------+----------+
>  |        | 4 chl, Zlib | 18835  | 29     | 241     | 299028   |
>  |        +-------------+--------+--------+---------+----------+
>  |        | 4 chl, IAA  | 16280  | 32     | 298     | 652456   |
>  |        +-------------+--------+--------+---------+----------+
>  |        | 8 chl, Zlib | 17379  | 32     | 275     | 470591   |
>  |        +-------------+--------+--------+---------+----------+
>  |        | 8 chl, IAA  | 15551  | 46     | 313     | 1315784  |

The number is slightly confusing to me.  If IAA can send 3x times more
pages per-second, shouldn't the total migration time 1/3 of the other if
the guest is idle?  But the total times seem to be pretty close no matter N
of channels. Maybe I missed something?

>  +--------+-------------+--------+--------+---------+----------+
> 
>  +--------+-------------+--------+--------+---------+----+-----+
>  |        | The number  |total   |downtime|network  |pages per |
>  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
>  |        | and mode    |        |        |(mbps)   |          |
>  |        +-------------+-----------------+---------+----------+
>  |        | 2 chl, Zlib | 100% failure, timeout is 120s        |
>  |        +-------------+--------+--------+---------+----------+
>  | Redis  | 2 chl, IAA  | 62737  | 115    | 4547    | 387911   |
>  |workload+-------------+--------+--------+---------+----------+
>  |        | 4 chl, Zlib | 30% failure, timeout is 120s         |
>  |        +-------------+--------+--------+---------+----------+
>  |        | 4 chl, IAA  | 54645  | 177    | 5382    | 656865   |
>  |        +-------------+--------+--------+---------+----------+
>  |        | 8 chl, Zlib | 93488  | 74     | 1264    | 129486   |
>  |        +-------------+--------+--------+---------+----------+
>  |        | 8 chl, IAA  | 24367  | 303    | 6901    | 964380   |
>  +--------+-------------+--------+--------+---------+----------+

The redis results look much more preferred on using IAA comparing to the
idle tests.  Does it mean that IAA works less good with zero pages in
general (assuming that'll be the majority in idle test)?

From the manual, I see that IAA also supports encryption/decryption.  Would
it be able to accelerate TLS?

How should one consider IAA over QAT?  What is the major difference?  I see
that IAA requires IOMMU scalable mode, why?  Is it because the IAA HW is
something attached to the pcie bus (assume QAT the same)?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH v3 0/4] Live Migration Acceleration with IAA Compression
  2024-01-29 10:42 ` [PATCH v3 0/4] Live Migration Acceleration with IAA Compression Peter Xu
@ 2024-01-30  3:56   ` Liu, Yuan1
  2024-01-30 10:32     ` Peter Xu
  0 siblings, 1 reply; 9+ messages in thread
From: Liu, Yuan1 @ 2024-01-30  3:56 UTC (permalink / raw)
  To: Peter Xu
  Cc: farosas@suse.de, leobras@redhat.com, qemu-devel@nongnu.org,
	Zou, Nanhai

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Monday, January 29, 2024 6:43 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: farosas@suse.de; leobras@redhat.com; qemu-devel@nongnu.org; Zou,
> Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v3 0/4] Live Migration Acceleration with IAA
> Compression
> 
> On Wed, Jan 03, 2024 at 07:28:47PM +0800, Yuan Liu wrote:
> > Hi,
> 
> Hi, Yuan,
> 
> I have a few comments and questions.  Many of them can be pure questions
> as I don't know enough on these new technologies.
> 
> >
> > I am writing to submit a code change aimed at enhancing live migration
> > acceleration by leveraging the compression capability of the Intel
> > In-Memory Analytics Accelerator (IAA).
> >
> > The implementation of the IAA (de)compression code is based on Intel
> > Query Processing Library (QPL), an open-source software project
> > designed for IAA high-level software programming.
> > https://github.com/intel/qpl
> >
> > In the last version, there was some discussion about whether to
> > introduce a new compression algorithm for IAA. Because the compression
> > algorithm of IAA hardware is based on deflate, and QPL already
> > supports Zlib, so in this version, I implemented IAA as an accelerator
> > for the Zlib compression method. However, due to some reasons, QPL is
> > currently not compatible with the existing Zlib method that Zlib
> > compressed data can be decompressed by QPl and vice versa.
> >
> > I have some concerns about the existing Zlib compression
> >   1. Will you consider supporting one channel to support multi-stream
> >      compression? Of course, this may lead to a reduction in compression
> >      ratio, but it will allow the hardware to process each stream
> >      concurrently. We can have each stream process multiple pages,
> >      reducing the loss of compression ratio. For example, 128 pages are
> >      divided into 16 streams for independent compression. I will provide
> >      the a early performance data in the next version(v4).
> 
> I think Juan used to ask similar question: how much this can help if
> multifd can already achieve some form of concurrency over the pages?


> Couldn't the user specify more multifd channels if they want to grant more
> cpu resource for comp/decomp purpose?
> 
> IOW, how many concurrent channels QPL can provide?  What is the suggested
> concurrency channels there?

From the QPL software, there is no limit on the number of concurrent compression and decompression tasks.
From the IAA hardware, one IAA physical device can process two compressions concurrently or eight decompression tasks concurrently. There are up to 8 IAA devices on an Intel SPR Server and it will vary according to the customer’s product selection and deployment.

Regarding the requirement for the number of concurrent channels, I think this may not be a bottleneck problem.
Please allow me to introduce a little more here

1. If the compression design is based on Zlib/Deflate/Gzip streaming mode, then we indeed need more channels to maintain concurrent processing. Because each time a multifd packet is compressed (including 128 independent pages), it needs to be compressed page by page. These 128 pages are not concurrent. The concurrency is reflected in the logic of multiple channels for the multifd packet.

2. Through testing, we prefer concurrent processing on 4K pages, not multifd packet, which means that 128 pages belonging to a packet can be compressed/decompressed concurrently. Even one channel can also utilize all the resources of IAA. But this is not compatible with existing zlib.
The code is similar to the following
  for(int i = 0; i < num_pages; i++) {
    job[i]->input_data = pages[i]
    submit_job(job[i] //Non-block submit for compression/decompression tasks
  }
  for(int i = 0; i < num_pages; i++) {
    wait_job(job[i])  //busy polling. In the future, we will make this part and data sending into pipeline mode.
  } 

3. Currently, the patches we provide to the community are based on streaming compression. This is to be compatible with the current zlib method. However, we found that there are still many problems with this, so we plan to provide a new change in the next version that the independent QPL/IAA acceleration function as said above.
Compatibility issues include the following
    1. QPL currently does not support the z_sync_flush operation
    2. IAA comp/decomp window is fixed 4K. By default, the zlib window size is 32K. And window size should be the same for Both comp/decomp sides. 
    3. At the same time, I researched the QAT compression scheme. QATzip currently does not support zlib, nor does it support z_sync_flush. The window size is 32K

In general, I think it is a good suggestion to make the accelerator compatible with standard compression algorithms, but also let the accelerator run independently, thus avoiding some compatibility and performance problems of the accelerator. For example, we can add the "accel" option to the compression method, and then the user must specify the same accelerator by compression accelerator parameter on the source and remote ends (just like specifying the same compression algorithm)

> >
> >   2. Will you consider using QPL/IAA as an independent compression
> >      algorithm instead of an accelerator? In this way, we can better
> >      utilize hardware performance and some features, such as IAA's
> >      canned mode, which can be dynamically generated by some statistics
> >      of data. A huffman table to improve the compression ratio.
> 
> Maybe one more knob will work?  If it's not compatible with the deflate
> algo maybe it should never be the default.  IOW, the accelerators may be
> extended into this (based on what you already proposed):
> 
>   - auto ("qpl" first, "none" second; never "qpl-optimized")
>   - none (old zlib)
>   - qpl (qpl compatible)
>   - qpl-optimized (qpl uncompatible)
> 
> Then "auto"/"none"/"qpl" will always be compatible, only the last doesn't,
> user can select it explicit, but only on both sides of QEMU.
Yes, this is what I want, I need a way that QPL is not compatible with zlib. From my current point of view, if zlib chooses raw defalte mode, then QAT will be compatible with the current community's zlib solution.
So my suggestion is as follows

Compression method parameter
 - none
 - zlib
 - zstd
 - accel (Both Qemu sides need to select the same accelerator from "Compression accelerator parameter" explicitly).

Compression accelerator parameter
 - auto
 - none
 - qpl (qpl will not support zlib/zstd, it will inform an error when zlib/zstd is selected)
 - qat (it can provide acceleration of zlib/zstd)

> > Test condition:
> >   1. Host CPUs are based on Sapphire Rapids, and frequency locked to
> 3.4G
> >   2. VM type, 16 vCPU and 64G memory
> >   3. The Idle workload means no workload is running in the VM
> >   4. The Redis workload means YCSB workloadb + Redis Server are running
> >      in the VM, about 20G or more memory will be used.
> >   5. Source side migartion configuration commands
> >      a. migrate_set_capability multifd on
> >      b. migrate_set_parameter multifd-channels 2/4/8
> >      c. migrate_set_parameter downtime-limit 300
> >      d. migrate_set_parameter multifd-compression zlib
> >      e. migrate_set_parameter multifd-compression-accel none/qpl
> >      f. migrate_set_parameter max-bandwidth 100G
> >   6. Desitination side migration configuration commands
> >      a. migrate_set_capability multifd on
> >      b. migrate_set_parameter multifd-channels 2/4/8
> >      c. migrate_set_parameter multifd-compression zlib
> >      d. migrate_set_parameter multifd-compression-accel none/qpl
> >      e. migrate_set_parameter max-bandwidth 100G
> 
> How is zlib-level setup?  Default (1)?
Yes, use level 1 the default level.

> Btw, it seems both zlib/zstd levels are not even working right now to be
> configured.. probably overlooked in migrate_params_apply().
Ok, I will check this.

> > Early migration result, each result is the average of three tests
> > +--------+-------------+--------+--------+---------+----+-----+
> >  |        | The number  |total   |downtime|network  |pages per |
> >  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
> >  |        | and mode    |        |        |(mbps)   |          |
> >  |        +-------------+-----------------+---------+----------+
> >  |        | 2 chl, Zlib | 20647  | 22     | 195     | 137767   |
> >  |        +-------------+--------+--------+---------+----------+
> >  | Idle   | 2 chl, IAA  | 17022  | 36     | 286     | 460289   |
> >  |workload+-------------+--------+--------+---------+----------+
> >  |        | 4 chl, Zlib | 18835  | 29     | 241     | 299028   |
> >  |        +-------------+--------+--------+---------+----------+
> >  |        | 4 chl, IAA  | 16280  | 32     | 298     | 652456   |
> >  |        +-------------+--------+--------+---------+----------+
> >  |        | 8 chl, Zlib | 17379  | 32     | 275     | 470591   |
> >  |        +-------------+--------+--------+---------+----------+
> >  |        | 8 chl, IAA  | 15551  | 46     | 313     | 1315784  |
> 
> The number is slightly confusing to me.  If IAA can send 3x times more
> pages per-second, shouldn't the total migration time 1/3 of the other if
> the guest is idle?  But the total times seem to be pretty close no matter
> N of channels. Maybe I missed something?

This data is the information read from "info migrate" after the live migration status changes to "complete".
I think it is the max throughout when expected downtime and network available bandwidth are met.
In vCPUs are idle, live migration does not run at maximum throughput for too long.

> >  +--------+-------------+--------+--------+---------+----------+
> >
> >  +--------+-------------+--------+--------+---------+----+-----+
> >  |        | The number  |total   |downtime|network  |pages per |
> >  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
> >  |        | and mode    |        |        |(mbps)   |          |
> >  |        +-------------+-----------------+---------+----------+
> >  |        | 2 chl, Zlib | 100% failure, timeout is 120s        |
> >  |        +-------------+--------+--------+---------+----------+
> >  | Redis  | 2 chl, IAA  | 62737  | 115    | 4547    | 387911   |
> >  |workload+-------------+--------+--------+---------+----------+
> >  |        | 4 chl, Zlib | 30% failure, timeout is 120s         |
> >  |        +-------------+--------+--------+---------+----------+
> >  |        | 4 chl, IAA  | 54645  | 177    | 5382    | 656865   |
> >  |        +-------------+--------+--------+---------+----------+
> >  |        | 8 chl, Zlib | 93488  | 74     | 1264    | 129486   |
> >  |        +-------------+--------+--------+---------+----------+
> >  |        | 8 chl, IAA  | 24367  | 303    | 6901    | 964380   |
> >  +--------+-------------+--------+--------+---------+----------+
> 
> The redis results look much more preferred on using IAA comparing to the
> idle tests.  Does it mean that IAA works less good with zero pages in
> general (assuming that'll be the majority in idle test)?
Both Idle and Redis data are not the best performance for IAA since it is based on multifd packet streaming compression.
In the idle case, most pages are indeed zero page, zero page compression is not as good as only detecting zero pages, so the compression advantage is not reflected.

> From the manual, I see that IAA also supports encryption/decryption.
> Would it be able to accelerate TLS?
From Sapphire Rapids(SPR)/Emerald Rapids (EMR) Xeon servers, IAA can't support encryption/decryption. This feature may be available in future generations
For TLS acceleration, QAT supports this function on SPR/EMR and has successful cases in some scenarios.
https://www.intel.cn/content/www/cn/zh/developer/articles/guide/nginx-https-with-qat-tuning-guide.html

> How should one consider IAA over QAT?  What is the major difference?  I
> see that IAA requires IOMMU scalable mode, why?  Is it because the IAA HW
> is something attached to the pcie bus (assume QAT the same)?

Regarding the difference between using IAA or QAT for compression
1. IAA is more suitable for 4K compression, and QAT is suitable for large block data compression. This is determined by the deflate windows size, and QAT can support more compression levels. IAA hardware supports 1 compression level.
2. From the perspective of throughput, one IAA device supports compression throughput is 4GBps and decompression is 30GBps. One QAT support compression or decompression throughput is 20GBps.
3. Depending on the product type selected by the customer and the deployment, the resources used for live migration will also be different.

Regarding the IOMMU scalable mode
1. The current IAA software stack requires Shared Virtual Memory (SVM) technology, and SVM depends on IOMMU scalable mode.
2. Both IAA and QAT support PCIe PASID capability, then IAA can support shared work queue.
https://docs.kernel.org/next/x86/sva.html

> Thanks,
> 
> --
> Peter Xu


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 0/4] Live Migration Acceleration with IAA Compression
  2024-01-30  3:56   ` Liu, Yuan1
@ 2024-01-30 10:32     ` Peter Xu
  2024-01-31  2:08       ` Liu, Yuan1
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Xu @ 2024-01-30 10:32 UTC (permalink / raw)
  To: Liu, Yuan1
  Cc: farosas@suse.de, leobras@redhat.com, qemu-devel@nongnu.org,
	Zou, Nanhai

On Tue, Jan 30, 2024 at 03:56:05AM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Monday, January 29, 2024 6:43 PM
> > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > Cc: farosas@suse.de; leobras@redhat.com; qemu-devel@nongnu.org; Zou,
> > Nanhai <nanhai.zou@intel.com>
> > Subject: Re: [PATCH v3 0/4] Live Migration Acceleration with IAA
> > Compression
> > 
> > On Wed, Jan 03, 2024 at 07:28:47PM +0800, Yuan Liu wrote:
> > > Hi,
> > 
> > Hi, Yuan,
> > 
> > I have a few comments and questions.  Many of them can be pure questions
> > as I don't know enough on these new technologies.
> > 
> > >
> > > I am writing to submit a code change aimed at enhancing live migration
> > > acceleration by leveraging the compression capability of the Intel
> > > In-Memory Analytics Accelerator (IAA).
> > >
> > > The implementation of the IAA (de)compression code is based on Intel
> > > Query Processing Library (QPL), an open-source software project
> > > designed for IAA high-level software programming.
> > > https://github.com/intel/qpl
> > >
> > > In the last version, there was some discussion about whether to
> > > introduce a new compression algorithm for IAA. Because the compression
> > > algorithm of IAA hardware is based on deflate, and QPL already
> > > supports Zlib, so in this version, I implemented IAA as an accelerator
> > > for the Zlib compression method. However, due to some reasons, QPL is
> > > currently not compatible with the existing Zlib method that Zlib
> > > compressed data can be decompressed by QPl and vice versa.
> > >
> > > I have some concerns about the existing Zlib compression
> > >   1. Will you consider supporting one channel to support multi-stream
> > >      compression? Of course, this may lead to a reduction in compression
> > >      ratio, but it will allow the hardware to process each stream
> > >      concurrently. We can have each stream process multiple pages,
> > >      reducing the loss of compression ratio. For example, 128 pages are
> > >      divided into 16 streams for independent compression. I will provide
> > >      the a early performance data in the next version(v4).
> > 
> > I think Juan used to ask similar question: how much this can help if
> > multifd can already achieve some form of concurrency over the pages?
> 
> 
> > Couldn't the user specify more multifd channels if they want to grant more
> > cpu resource for comp/decomp purpose?
> > 
> > IOW, how many concurrent channels QPL can provide?  What is the suggested
> > concurrency channels there?
> 
> From the QPL software, there is no limit on the number of concurrent compression and decompression tasks.
> From the IAA hardware, one IAA physical device can process two compressions concurrently or eight decompression tasks concurrently. There are up to 8 IAA devices on an Intel SPR Server and it will vary according to the customer’s product selection and deployment.
> 
> Regarding the requirement for the number of concurrent channels, I think this may not be a bottleneck problem.
> Please allow me to introduce a little more here
> 
> 1. If the compression design is based on Zlib/Deflate/Gzip streaming mode, then we indeed need more channels to maintain concurrent processing. Because each time a multifd packet is compressed (including 128 independent pages), it needs to be compressed page by page. These 128 pages are not concurrent. The concurrency is reflected in the logic of multiple channels for the multifd packet.

Right.  However since you said there're only a max of 8 IAA devices, would
it also mean n_multifd_threads=8 can be a good enough scenario to achieve
proper concurrency, no matter the size of data chunk for one compression
request?

Maybe you meant each device can still process concurrent compression
requests, so the real capability of concurrency can be much larger than 8?

> 
> 2. Through testing, we prefer concurrent processing on 4K pages, not multifd packet, which means that 128 pages belonging to a packet can be compressed/decompressed concurrently. Even one channel can also utilize all the resources of IAA. But this is not compatible with existing zlib.
> The code is similar to the following
>   for(int i = 0; i < num_pages; i++) {
>     job[i]->input_data = pages[i]
>     submit_job(job[i] //Non-block submit for compression/decompression tasks
>   }
>   for(int i = 0; i < num_pages; i++) {
>     wait_job(job[i])  //busy polling. In the future, we will make this part and data sending into pipeline mode.
>   } 

Right, if more concurrency is wanted, you can use this async model; I think
Juan used to suggest such and I agree it will also work.  It can be done on
top of the basic functionality merged.

> 
> 3. Currently, the patches we provide to the community are based on streaming compression. This is to be compatible with the current zlib method. However, we found that there are still many problems with this, so we plan to provide a new change in the next version that the independent QPL/IAA acceleration function as said above.
> Compatibility issues include the following
>     1. QPL currently does not support the z_sync_flush operation
>     2. IAA comp/decomp window is fixed 4K. By default, the zlib window size is 32K. And window size should be the same for Both comp/decomp sides. 
>     3. At the same time, I researched the QAT compression scheme. QATzip currently does not support zlib, nor does it support z_sync_flush. The window size is 32K
> 
> In general, I think it is a good suggestion to make the accelerator compatible with standard compression algorithms, but also let the accelerator run independently, thus avoiding some compatibility and performance problems of the accelerator. For example, we can add the "accel" option to the compression method, and then the user must specify the same accelerator by compression accelerator parameter on the source and remote ends (just like specifying the same compression algorithm)
> 
> > >
> > >   2. Will you consider using QPL/IAA as an independent compression
> > >      algorithm instead of an accelerator? In this way, we can better
> > >      utilize hardware performance and some features, such as IAA's
> > >      canned mode, which can be dynamically generated by some statistics
> > >      of data. A huffman table to improve the compression ratio.
> > 
> > Maybe one more knob will work?  If it's not compatible with the deflate
> > algo maybe it should never be the default.  IOW, the accelerators may be
> > extended into this (based on what you already proposed):
> > 
> >   - auto ("qpl" first, "none" second; never "qpl-optimized")
> >   - none (old zlib)
> >   - qpl (qpl compatible)
> >   - qpl-optimized (qpl uncompatible)
> > 
> > Then "auto"/"none"/"qpl" will always be compatible, only the last doesn't,
> > user can select it explicit, but only on both sides of QEMU.
> Yes, this is what I want, I need a way that QPL is not compatible with zlib. From my current point of view, if zlib chooses raw defalte mode, then QAT will be compatible with the current community's zlib solution.
> So my suggestion is as follows
> 
> Compression method parameter
>  - none
>  - zlib
>  - zstd
>  - accel (Both Qemu sides need to select the same accelerator from "Compression accelerator parameter" explicitly).

Can we avoid naming it as "accel"?  It's too generic, IMHO.

If it's a special algorithm that only applies to QPL, can we just call it
"qpl" here?  Then...

> 
> Compression accelerator parameter
>  - auto
>  - none
>  - qpl (qpl will not support zlib/zstd, it will inform an error when zlib/zstd is selected)
>  - qat (it can provide acceleration of zlib/zstd)

Here IMHO we don't need qpl then, because the "qpl" compression method can
enforce an hardware accelerator.  In summary, not sure whether this works;

Compression methods: none, zlib, zstd, qpl (describes all the algorithms
that might be used; again, qpl enforces HW support).

Compression accelerators: auto, none, qat (only applies when zlib/zstd
chosen above)

> 
> > > Test condition:
> > >   1. Host CPUs are based on Sapphire Rapids, and frequency locked to
> > 3.4G
> > >   2. VM type, 16 vCPU and 64G memory
> > >   3. The Idle workload means no workload is running in the VM
> > >   4. The Redis workload means YCSB workloadb + Redis Server are running
> > >      in the VM, about 20G or more memory will be used.
> > >   5. Source side migartion configuration commands
> > >      a. migrate_set_capability multifd on
> > >      b. migrate_set_parameter multifd-channels 2/4/8
> > >      c. migrate_set_parameter downtime-limit 300
> > >      d. migrate_set_parameter multifd-compression zlib
> > >      e. migrate_set_parameter multifd-compression-accel none/qpl
> > >      f. migrate_set_parameter max-bandwidth 100G
> > >   6. Desitination side migration configuration commands
> > >      a. migrate_set_capability multifd on
> > >      b. migrate_set_parameter multifd-channels 2/4/8
> > >      c. migrate_set_parameter multifd-compression zlib
> > >      d. migrate_set_parameter multifd-compression-accel none/qpl
> > >      e. migrate_set_parameter max-bandwidth 100G
> > 
> > How is zlib-level setup?  Default (1)?
> Yes, use level 1 the default level.
> 
> > Btw, it seems both zlib/zstd levels are not even working right now to be
> > configured.. probably overlooked in migrate_params_apply().
> Ok, I will check this.

Thanks.  If you plan to post patch, please attach:

Reported-by: Xiaohui Li <xiaohli@redhat.com>

As that's reported by our QE team.

Maybe you can already add an unit test (migration-test.c, under tests/)
which should expose this issue already, by setting z*-level to non-1 then
query it back, asserting that the value did change.

> 
> > > Early migration result, each result is the average of three tests
> > > +--------+-------------+--------+--------+---------+----+-----+
> > >  |        | The number  |total   |downtime|network  |pages per |
> > >  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
> > >  |        | and mode    |        |        |(mbps)   |          |
> > >  |        +-------------+-----------------+---------+----------+
> > >  |        | 2 chl, Zlib | 20647  | 22     | 195     | 137767   |
> > >  |        +-------------+--------+--------+---------+----------+
> > >  | Idle   | 2 chl, IAA  | 17022  | 36     | 286     | 460289   |
> > >  |workload+-------------+--------+--------+---------+----------+
> > >  |        | 4 chl, Zlib | 18835  | 29     | 241     | 299028   |
> > >  |        +-------------+--------+--------+---------+----------+
> > >  |        | 4 chl, IAA  | 16280  | 32     | 298     | 652456   |
> > >  |        +-------------+--------+--------+---------+----------+
> > >  |        | 8 chl, Zlib | 17379  | 32     | 275     | 470591   |
> > >  |        +-------------+--------+--------+---------+----------+
> > >  |        | 8 chl, IAA  | 15551  | 46     | 313     | 1315784  |
> > 
> > The number is slightly confusing to me.  If IAA can send 3x times more
> > pages per-second, shouldn't the total migration time 1/3 of the other if
> > the guest is idle?  But the total times seem to be pretty close no matter
> > N of channels. Maybe I missed something?
> 
> This data is the information read from "info migrate" after the live migration status changes to "complete".
> I think it is the max throughout when expected downtime and network available bandwidth are met.
> In vCPUs are idle, live migration does not run at maximum throughput for too long.
> 
> > >  +--------+-------------+--------+--------+---------+----------+
> > >
> > >  +--------+-------------+--------+--------+---------+----+-----+
> > >  |        | The number  |total   |downtime|network  |pages per |
> > >  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
> > >  |        | and mode    |        |        |(mbps)   |          |
> > >  |        +-------------+-----------------+---------+----------+
> > >  |        | 2 chl, Zlib | 100% failure, timeout is 120s        |
> > >  |        +-------------+--------+--------+---------+----------+
> > >  | Redis  | 2 chl, IAA  | 62737  | 115    | 4547    | 387911   |
> > >  |workload+-------------+--------+--------+---------+----------+
> > >  |        | 4 chl, Zlib | 30% failure, timeout is 120s         |
> > >  |        +-------------+--------+--------+---------+----------+
> > >  |        | 4 chl, IAA  | 54645  | 177    | 5382    | 656865   |
> > >  |        +-------------+--------+--------+---------+----------+
> > >  |        | 8 chl, Zlib | 93488  | 74     | 1264    | 129486   |
> > >  |        +-------------+--------+--------+---------+----------+
> > >  |        | 8 chl, IAA  | 24367  | 303    | 6901    | 964380   |
> > >  +--------+-------------+--------+--------+---------+----------+
> > 
> > The redis results look much more preferred on using IAA comparing to the
> > idle tests.  Does it mean that IAA works less good with zero pages in
> > general (assuming that'll be the majority in idle test)?
> Both Idle and Redis data are not the best performance for IAA since it is based on multifd packet streaming compression.
> In the idle case, most pages are indeed zero page, zero page compression is not as good as only detecting zero pages, so the compression advantage is not reflected.
> 
> > From the manual, I see that IAA also supports encryption/decryption.
> > Would it be able to accelerate TLS?
> From Sapphire Rapids(SPR)/Emerald Rapids (EMR) Xeon servers, IAA can't support encryption/decryption. This feature may be available in future generations
> For TLS acceleration, QAT supports this function on SPR/EMR and has successful cases in some scenarios.
> https://www.intel.cn/content/www/cn/zh/developer/articles/guide/nginx-https-with-qat-tuning-guide.html
> 
> > How should one consider IAA over QAT?  What is the major difference?  I
> > see that IAA requires IOMMU scalable mode, why?  Is it because the IAA HW
> > is something attached to the pcie bus (assume QAT the same)?
> 
> Regarding the difference between using IAA or QAT for compression
> 1. IAA is more suitable for 4K compression, and QAT is suitable for large block data compression. This is determined by the deflate windows size, and QAT can support more compression levels. IAA hardware supports 1 compression level.
> 2. From the perspective of throughput, one IAA device supports compression throughput is 4GBps and decompression is 30GBps. One QAT support compression or decompression throughput is 20GBps.
> 3. Depending on the product type selected by the customer and the deployment, the resources used for live migration will also be different.
> 
> Regarding the IOMMU scalable mode
> 1. The current IAA software stack requires Shared Virtual Memory (SVM) technology, and SVM depends on IOMMU scalable mode.
> 2. Both IAA and QAT support PCIe PASID capability, then IAA can support shared work queue.
> https://docs.kernel.org/next/x86/sva.html

Thanks for all these information.  I'm personally still curious why Intel
would like to provide two new technology to service similar purposes merely
at the same time window.

Could you put many of these information into a doc file?  It can be
docs/devel/migration/QPL.rst.

Also, we may want an unit test to cover the new stuff when the whole design
settles. It may cover all mode supported, but for sure we can skip hw
accelerated use case.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH v3 0/4] Live Migration Acceleration with IAA Compression
  2024-01-30 10:32     ` Peter Xu
@ 2024-01-31  2:08       ` Liu, Yuan1
  0 siblings, 0 replies; 9+ messages in thread
From: Liu, Yuan1 @ 2024-01-31  2:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: farosas@suse.de, leobras@redhat.com, qemu-devel@nongnu.org,
	Zou, Nanhai

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Tuesday, January 30, 2024 6:32 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: farosas@suse.de; leobras@redhat.com; qemu-devel@nongnu.org; Zou,
> Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v3 0/4] Live Migration Acceleration with IAA
> Compression
> 
> On Tue, Jan 30, 2024 at 03:56:05AM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Monday, January 29, 2024 6:43 PM
> > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > Cc: farosas@suse.de; leobras@redhat.com; qemu-devel@nongnu.org; Zou,
> > > Nanhai <nanhai.zou@intel.com>
> > > Subject: Re: [PATCH v3 0/4] Live Migration Acceleration with IAA
> > > Compression
> > >
> > > On Wed, Jan 03, 2024 at 07:28:47PM +0800, Yuan Liu wrote:
> > > > Hi,
> > >
> > > Hi, Yuan,
> > >
> > > I have a few comments and questions.  Many of them can be pure
> > > questions as I don't know enough on these new technologies.
> > >
> > > >
> > > > I am writing to submit a code change aimed at enhancing live
> > > > migration acceleration by leveraging the compression capability of
> > > > the Intel In-Memory Analytics Accelerator (IAA).
> > > >
> > > > The implementation of the IAA (de)compression code is based on
> > > > Intel Query Processing Library (QPL), an open-source software
> > > > project designed for IAA high-level software programming.
> > > > https://github.com/intel/qpl
> > > >
> > > > In the last version, there was some discussion about whether to
> > > > introduce a new compression algorithm for IAA. Because the
> > > > compression algorithm of IAA hardware is based on deflate, and QPL
> > > > already supports Zlib, so in this version, I implemented IAA as an
> > > > accelerator for the Zlib compression method. However, due to some
> > > > reasons, QPL is currently not compatible with the existing Zlib
> > > > method that Zlib compressed data can be decompressed by QPl and vice
> versa.
> > > >
> > > > I have some concerns about the existing Zlib compression
> > > >   1. Will you consider supporting one channel to support multi-
> stream
> > > >      compression? Of course, this may lead to a reduction in
> compression
> > > >      ratio, but it will allow the hardware to process each stream
> > > >      concurrently. We can have each stream process multiple pages,
> > > >      reducing the loss of compression ratio. For example, 128 pages
> are
> > > >      divided into 16 streams for independent compression. I will
> provide
> > > >      the a early performance data in the next version(v4).
> > >
> > > I think Juan used to ask similar question: how much this can help if
> > > multifd can already achieve some form of concurrency over the pages?
> >
> >
> > > Couldn't the user specify more multifd channels if they want to
> > > grant more cpu resource for comp/decomp purpose?
> > >
> > > IOW, how many concurrent channels QPL can provide?  What is the
> > > suggested concurrency channels there?
> >
> > From the QPL software, there is no limit on the number of concurrent
> compression and decompression tasks.
> > From the IAA hardware, one IAA physical device can process two
> compressions concurrently or eight decompression tasks concurrently. There
> are up to 8 IAA devices on an Intel SPR Server and it will vary according
> to the customer’s product selection and deployment.
> >
> > Regarding the requirement for the number of concurrent channels, I think
> this may not be a bottleneck problem.
> > Please allow me to introduce a little more here
> >
> > 1. If the compression design is based on Zlib/Deflate/Gzip streaming
> mode, then we indeed need more channels to maintain concurrent processing.
> Because each time a multifd packet is compressed (including 128
> independent pages), it needs to be compressed page by page. These 128
> pages are not concurrent. The concurrency is reflected in the logic of
> multiple channels for the multifd packet.
> 
> Right.  However since you said there're only a max of 8 IAA devices, would
> it also mean n_multifd_threads=8 can be a good enough scenario to achieve
> proper concurrency, no matter the size of data chunk for one compression
> request?
> 
> Maybe you meant each device can still process concurrent compression
> requests, so the real capability of concurrency can be much larger than 8?

Yes, the number of concurrent requests can be greater than 8, one device can 
handle 2 compression requests or 8 decompression requests concurrently. 

> >
> > 2. Through testing, we prefer concurrent processing on 4K pages, not
> multifd packet, which means that 128 pages belonging to a packet can be
> compressed/decompressed concurrently. Even one channel can also utilize
> all the resources of IAA. But this is not compatible with existing zlib.
> > The code is similar to the following
> >   for(int i = 0; i < num_pages; i++) {
> >     job[i]->input_data = pages[i]
> >     submit_job(job[i] //Non-block submit for compression/decompression
> tasks
> >   }
> >   for(int i = 0; i < num_pages; i++) {
> >     wait_job(job[i])  //busy polling. In the future, we will make this
> part and data sending into pipeline mode.
> >   }
> 
> Right, if more concurrency is wanted, you can use this async model; I
> think Juan used to suggest such and I agree it will also work.  It can be
> done on top of the basic functionality merged.

Sure, I think we can show the better performance based on it.

> > 3. Currently, the patches we provide to the community are based on
> streaming compression. This is to be compatible with the current zlib
> method. However, we found that there are still many problems with this, so
> we plan to provide a new change in the next version that the independent
> QPL/IAA acceleration function as said above.
> > Compatibility issues include the following
> >     1. QPL currently does not support the z_sync_flush operation
> >     2. IAA comp/decomp window is fixed 4K. By default, the zlib window
> size is 32K. And window size should be the same for Both comp/decomp
> sides.
> >     3. At the same time, I researched the QAT compression scheme.
> > QATzip currently does not support zlib, nor does it support
> > z_sync_flush. The window size is 32K
> >
> > In general, I think it is a good suggestion to make the accelerator
> > compatible with standard compression algorithms, but also let the
> > accelerator run independently, thus avoiding some compatibility and
> > performance problems of the accelerator. For example, we can add the
> > "accel" option to the compression method, and then the user must
> > specify the same accelerator by compression accelerator parameter on
> > the source and remote ends (just like specifying the same compression
> > algorithm)
> >
> > > >
> > > >   2. Will you consider using QPL/IAA as an independent compression
> > > >      algorithm instead of an accelerator? In this way, we can better
> > > >      utilize hardware performance and some features, such as IAA's
> > > >      canned mode, which can be dynamically generated by some
> statistics
> > > >      of data. A huffman table to improve the compression ratio.
> > >
> > > Maybe one more knob will work?  If it's not compatible with the
> > > deflate algo maybe it should never be the default.  IOW, the
> > > accelerators may be extended into this (based on what you already
> proposed):
> > >
> > >   - auto ("qpl" first, "none" second; never "qpl-optimized")
> > >   - none (old zlib)
> > >   - qpl (qpl compatible)
> > >   - qpl-optimized (qpl uncompatible)
> > >
> > > Then "auto"/"none"/"qpl" will always be compatible, only the last
> > > doesn't, user can select it explicit, but only on both sides of QEMU.
> > Yes, this is what I want, I need a way that QPL is not compatible with
> zlib. From my current point of view, if zlib chooses raw defalte mode,
> then QAT will be compatible with the current community's zlib solution.
> > So my suggestion is as follows
> >
> > Compression method parameter
> >  - none
> >  - zlib
> >  - zstd
> >  - accel (Both Qemu sides need to select the same accelerator from
> "Compression accelerator parameter" explicitly).
> 
> Can we avoid naming it as "accel"?  It's too generic, IMHO.
> 
> If it's a special algorithm that only applies to QPL, can we just call it
> "qpl" here?  Then...

Yes, I agree.

> > Compression accelerator parameter
> >  - auto
> >  - none
> >  - qpl (qpl will not support zlib/zstd, it will inform an error when
> > zlib/zstd is selected)
> >  - qat (it can provide acceleration of zlib/zstd)
> 
> Here IMHO we don't need qpl then, because the "qpl" compression method can
> enforce an hardware accelerator.  In summary, not sure whether this works;
> 
> Compression methods: none, zlib, zstd, qpl (describes all the algorithms
> that might be used; again, qpl enforces HW support).
> 
> Compression accelerators: auto, none, qat (only applies when zlib/zstd
> chosen above)

I agree, QPL will dynamically detect IAA hardware resources and prioritize 
hardware acceleration. If IAA is not available, QPL can also provide an 
efficient deflate-based compression algorithm. And the software and hardware 
are fully compatible.

> > > > Test condition:
> > > >   1. Host CPUs are based on Sapphire Rapids, and frequency locked
> > > > to
> > > 3.4G
> > > >   2. VM type, 16 vCPU and 64G memory
> > > >   3. The Idle workload means no workload is running in the VM
> > > >   4. The Redis workload means YCSB workloadb + Redis Server are
> running
> > > >      in the VM, about 20G or more memory will be used.
> > > >   5. Source side migartion configuration commands
> > > >      a. migrate_set_capability multifd on
> > > >      b. migrate_set_parameter multifd-channels 2/4/8
> > > >      c. migrate_set_parameter downtime-limit 300
> > > >      d. migrate_set_parameter multifd-compression zlib
> > > >      e. migrate_set_parameter multifd-compression-accel none/qpl
> > > >      f. migrate_set_parameter max-bandwidth 100G
> > > >   6. Desitination side migration configuration commands
> > > >      a. migrate_set_capability multifd on
> > > >      b. migrate_set_parameter multifd-channels 2/4/8
> > > >      c. migrate_set_parameter multifd-compression zlib
> > > >      d. migrate_set_parameter multifd-compression-accel none/qpl
> > > >      e. migrate_set_parameter max-bandwidth 100G
> > >
> > > How is zlib-level setup?  Default (1)?
> > Yes, use level 1 the default level.
> >
> > > Btw, it seems both zlib/zstd levels are not even working right now
> > > to be configured.. probably overlooked in migrate_params_apply().
> > Ok, I will check this.
> 
> Thanks.  If you plan to post patch, please attach:
> 
> Reported-by: Xiaohui Li <xiaohli@redhat.com>
> 
> As that's reported by our QE team.
> 
> Maybe you can already add an unit test (migration-test.c, under tests/)
> which should expose this issue already, by setting z*-level to non-1 then
> query it back, asserting that the value did change.

Thanks for your suggestions, I will improve the test part of the code

> > > > Early migration result, each result is the average of three tests
> > > > +--------+-------------+--------+--------+---------+----+-----+
> > > >  |        | The number  |total   |downtime|network  |pages per |
> > > >  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
> > > >  |        | and mode    |        |        |(mbps)   |          |
> > > >  |        +-------------+-----------------+---------+----------+
> > > >  |        | 2 chl, Zlib | 20647  | 22     | 195     | 137767   |
> > > >  |        +-------------+--------+--------+---------+----------+
> > > >  | Idle   | 2 chl, IAA  | 17022  | 36     | 286     | 460289   |
> > > >  |workload+-------------+--------+--------+---------+----------+
> > > >  |        | 4 chl, Zlib | 18835  | 29     | 241     | 299028   |
> > > >  |        +-------------+--------+--------+---------+----------+
> > > >  |        | 4 chl, IAA  | 16280  | 32     | 298     | 652456   |
> > > >  |        +-------------+--------+--------+---------+----------+
> > > >  |        | 8 chl, Zlib | 17379  | 32     | 275     | 470591   |
> > > >  |        +-------------+--------+--------+---------+----------+
> > > >  |        | 8 chl, IAA  | 15551  | 46     | 313     | 1315784  |
> > >
> > > The number is slightly confusing to me.  If IAA can send 3x times
> > > more pages per-second, shouldn't the total migration time 1/3 of the
> > > other if the guest is idle?  But the total times seem to be pretty
> > > close no matter N of channels. Maybe I missed something?
> >
> > This data is the information read from "info migrate" after the live
> migration status changes to "complete".
> > I think it is the max throughout when expected downtime and network
> available bandwidth are met.
> > In vCPUs are idle, live migration does not run at maximum throughput for
> too long.
> >
> > > >  +--------+-------------+--------+--------+---------+----------+
> > > >
> > > >  +--------+-------------+--------+--------+---------+----+-----+
> > > >  |        | The number  |total   |downtime|network  |pages per |
> > > >  |        | of channels |time(ms)|(ms)    |bandwidth|second    |
> > > >  |        | and mode    |        |        |(mbps)   |          |
> > > >  |        +-------------+-----------------+---------+----------+
> > > >  |        | 2 chl, Zlib | 100% failure, timeout is 120s        |
> > > >  |        +-------------+--------+--------+---------+----------+
> > > >  | Redis  | 2 chl, IAA  | 62737  | 115    | 4547    | 387911   |
> > > >  |workload+-------------+--------+--------+---------+----------+
> > > >  |        | 4 chl, Zlib | 30% failure, timeout is 120s         |
> > > >  |        +-------------+--------+--------+---------+----------+
> > > >  |        | 4 chl, IAA  | 54645  | 177    | 5382    | 656865   |
> > > >  |        +-------------+--------+--------+---------+----------+
> > > >  |        | 8 chl, Zlib | 93488  | 74     | 1264    | 129486   |
> > > >  |        +-------------+--------+--------+---------+----------+
> > > >  |        | 8 chl, IAA  | 24367  | 303    | 6901    | 964380   |
> > > >  +--------+-------------+--------+--------+---------+----------+
> > >
> > > The redis results look much more preferred on using IAA comparing to
> > > the idle tests.  Does it mean that IAA works less good with zero
> > > pages in general (assuming that'll be the majority in idle test)?
> > Both Idle and Redis data are not the best performance for IAA since it
> is based on multifd packet streaming compression.
> > In the idle case, most pages are indeed zero page, zero page compression
> is not as good as only detecting zero pages, so the compression advantage
> is not reflected.
> >
> > > From the manual, I see that IAA also supports encryption/decryption.
> > > Would it be able to accelerate TLS?
> > From Sapphire Rapids(SPR)/Emerald Rapids (EMR) Xeon servers, IAA can't
> > support encryption/decryption. This feature may be available in future
> generations For TLS acceleration, QAT supports this function on SPR/EMR
> and has successful cases in some scenarios.
> > https://www.intel.cn/content/www/cn/zh/developer/articles/guide/nginx-
> > https-with-qat-tuning-guide.html
> >
> > > How should one consider IAA over QAT?  What is the major difference?
> > > I see that IAA requires IOMMU scalable mode, why?  Is it because the
> > > IAA HW is something attached to the pcie bus (assume QAT the same)?
> >
> > Regarding the difference between using IAA or QAT for compression 1.
> > IAA is more suitable for 4K compression, and QAT is suitable for large
> block data compression. This is determined by the deflate windows size,
> and QAT can support more compression levels. IAA hardware supports 1
> compression level.
> > 2. From the perspective of throughput, one IAA device supports
> compression throughput is 4GBps and decompression is 30GBps. One QAT
> support compression or decompression throughput is 20GBps.
> > 3. Depending on the product type selected by the customer and the
> deployment, the resources used for live migration will also be different.
> >
> > Regarding the IOMMU scalable mode
> > 1. The current IAA software stack requires Shared Virtual Memory (SVM)
> technology, and SVM depends on IOMMU scalable mode.
> > 2. Both IAA and QAT support PCIe PASID capability, then IAA can support
> shared work queue.
> > https://docs.kernel.org/next/x86/sva.html
> 
> Thanks for all these information.  I'm personally still curious why Intel
> would like to provide two new technology to service similar purposes
> merely at the same time window.
> 
> Could you put many of these information into a doc file?  It can be
> docs/devel/migration/QPL.rst.

Sure, I will update the documentation

> Also, we may want an unit test to cover the new stuff when the whole
> design settles. It may cover all mode supported, but for sure we can skip
> hw accelerated use case.

For QPL, I think this is not a problem. QPL is used as a new compression 
method that can be used when hardware accelerators are not available

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-01-31  2:09 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-03 11:28 [PATCH v3 0/4] Live Migration Acceleration with IAA Compression Yuan Liu
2024-01-03 11:28 ` [PATCH v3 1/4] migration: Introduce multifd-compression-accel parameter Yuan Liu
2024-01-03 11:28 ` [PATCH v3 2/4] multifd: Implement multifd compression accelerator Yuan Liu
2024-01-03 11:28 ` [PATCH v3 3/4] configure: add qpl option Yuan Liu
2024-01-03 11:28 ` [PATCH v3 4/4] multifd: Introduce QPL compression accelerator Yuan Liu
2024-01-29 10:42 ` [PATCH v3 0/4] Live Migration Acceleration with IAA Compression Peter Xu
2024-01-30  3:56   ` Liu, Yuan1
2024-01-30 10:32     ` Peter Xu
2024-01-31  2:08       ` Liu, Yuan1

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).