qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/7] Live Migration With IAA
@ 2024-06-03 15:40 Yuan Liu
  2024-06-03 15:41 ` [PATCH v7 1/7] docs/migration: add qpl compression feature Yuan Liu
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Yuan Liu @ 2024-06-03 15:40 UTC (permalink / raw)
  To: peterx, farosas, pbonzini, marcandre.lureau, berrange, thuth,
	philmd
  Cc: qemu-devel, yuan1.liu, nanhai.zou, shameerali.kolothum.thodi

I am writing to submit a code change aimed at enhancing live migration
acceleration by leveraging the compression capability of the Intel
In-Memory Analytics Accelerator (IAA).

The implementation of the IAA (de)compression code is based on Intel Query
Processing Library (QPL), an open-source software project designed for
high-performance query processing operations on Intel CPUs.
https://github.com/intel/qpl

I would like to summarize the progress so far
1. QPL will be used as an independent compression method like ZLIB and ZSTD,
   For the summary of issues compatible with ZLIB, please refer to
   docs/devel/migration/qpl-compression.rst

2. The QPL method supports both software path and hardware path (use IAA device)
   for multifd migration compression and decompression.
   The hardware path is always used first, if the hardware path is unavailable,
   it automatically falls back to the software path.

3. Compression accelerator related patches are removed from this patch set and
   will be added to the QAT patch set, we will submit separate patches to use
   QAT to accelerate ZLIB and ZSTD.

4. Advantages of using IAA accelerator include:
   a. Compared with the non-compression method, it can improve downtime
      performance without adding additional host resources (both CPU and
      network).
   b. Compared with using software compression methods (ZSTD/ZLIB), it can
      provide high data compression ratio and save a lot of CPU resources
      used for compression.

Test condition:
  1. Host CPUs are based on Sapphire Rapids
  2. VM type, 16 vCPU and 64G memory
  3. The source and destination respectively use 4 IAA devices.
  4. The workload in the VM
    a. all vCPUs are idle state
    b. 90% of the virtual machine's memory is used, use silesia to fill
       the memory.
       The introduction of silesia:
       https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
  5. Set "--mem-prealloc" boot parameter on the destination, this parameter
     can make IAA performance better and related introduction is added here.
     docs/devel/migration/qpl-compression.rst
  6. Source migration configuration commands
     a. migrate_set_capability multifd on
     b. migrate_set_parameter multifd-channels 2/4/8
     c. migrate_set_parameter downtime-limit 100
     f. migrate_set_parameter max-bandwidth 100G/1G
     d. migrate_set_parameter multifd-compression none/qpl/zstd
  7. Destination migration configuration commands
     a. migrate_set_capability multifd on
     b. migrate_set_parameter multifd-channels 2/4/8
     c. migrate_set_parameter multifd-compression none/qpl/zstd

Early migration result, each result is the average of three tests

 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 | None   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|    7493|      64|    66782|   2549993|  205%|
 |BW:100G +-------------+--------+--------+---------+----------+------+
 |        |            4|    5901|      49|    84832|   5289304|  283%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|    5789|      58|    86488|   4674351|  266%|
 +--------+-------------+--------+--------+---------+----------+------+

 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 |QPL(IAA)| of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|    7330|      23|    34537|   5261385|  216%|
 |BW:100G +-------------+--------+--------+---------+----------+------+
 |        |            4|    5219|      31|    48899|   6498718|  405%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|    5250|      22|    49073|   5578875|  768%|
 +--------+-------------+--------+--------+---------+----------+------+

 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 | QPL(SW)| of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|   73535|      21|     3887|    515976|  202%|
 |BW:100G +-------------+--------+--------+---------+----------+------+
 |        |            4|   37299|      27|     7668|   1637228|  403%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|   18964|      38|    15093|   3074972|  795%|
 +--------+-------------+--------+--------+---------+----------+------+

 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 | ZSTD   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|   78291|      24|     2201|    435601|  212%|
 |BW:100G +-------------+--------+--------+---------+----------+------+
 |        |            4|   39544|      21|     4366|   1036449|  457%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|   20180|      26|     8581|   1958901|  894%|
 +--------+-------------+--------+--------+---------+----------+------+

 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 |QPL(IAA)| of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |without +-------------+-----------------+---------+----------+------+
 | --mem- |            2|   51431|      68|     4884|    227428|  202%|
 |prealloc+-------------+--------+--------+---------+----------+------+
 |        |            4|   29988|      92|     8392|    405648|  403%|
 |Network +-------------+--------+--------+---------+----------+------+
 |BW:100G |            8|   22868|      89|    11039|    222222|  795%|
 +--------+-------------+--------+--------+---------+----------+------+

When network bandwidth resource is sufficient, QPL can improve downtime
by 2x compared to no compression. In this scenario, with 4/8 channels,
IAA hardware resources are fully used, so adding more channels will not
gain more benefits.


 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 | None   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|   57758|      66|     8643|    264617|   28%|
 |BW:  1G +-------------+--------+--------+---------+----------+------+
 |        |            4|   57216|      58|     8726|    266773|   28%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|   56708|      53|     8804|    270223|   28%|
 +--------+-------------+--------+--------+---------+----------+------+

 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 |QPL(IAA)| of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|   30129|      34|     8345|   2224761|   26%|
 |BW:  1G +-------------+--------+--------+---------+----------+------+
 |        |            4|   30317|      39|     8300|   2025220|   58%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|   29615|      35|     8514|   2250122|  130%|
 +--------+-------------+--------+--------+---------+----------+------+

 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 | QPL(SW)| of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|   93358|      41|     3064|    534588|  202%|
 |BW:  1G +-------------+--------+--------+---------+----------+------+
 |        |            4|   47266|      52|     6067|   1392941|  403%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|   33134|      45|     8706|   2433242|  480%|
 +--------+-------------+--------+--------+---------+----------+------+

 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 | ZSTD   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|   95750|      24|     1802|    477236|  263%|
 |BW:  1G +-------------+--------+--------+---------+----------+------+
 |        |            4|   48907|      24|     3536|   1002142|  411%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|   25568|      32|     6783|   1696437|  783%|
 +--------+-------------+--------+--------+---------+----------+------+

 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 |QPL(IAA)| of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |without +-------------+-----------------+---------+----------+------+
 | --mem- |            2|   50908|      66|     4935|    240301|  200%|
 |prealloc+-------------+--------+--------+---------+----------+------+
 |        |            4|   31334|      94|     8030|    451310|  400%|
 |Network +-------------+--------+--------+---------+----------+------+
 |BW:100G |            8|   29010|     103|     8690|    629132|  620%|
 +--------+-------------+--------+--------+---------+----------+------+

When network bandwidth resource is limited, the "page perf second" metric
decreases for none compression, the success rate of migration will reduce.
Comparison of QPL and ZSTD compression methods, QPL can save a lot of CPU
resources used for compression.

without --mem-prealloc test is based on I/O page faults are handled by
QPL software. The performance bottleneck occurs on the receiving side
because hardware descriptors need to be submitted repeatedly, which
increases performance overhead, so it is still necessary to use
--mem-prealloc on the receiving side.

v2:
  - add support for multifd compression accelerator
  - add support for the QPL accelerator in the multifd
    compression accelerator
  - fixed the issue that QPL was compiled into the migration
    module by default

v3:
  - use Meson instead of pkg-config to resolve QPL build
    dependency issue
  - fix coding style
  - fix a CI issue for get_multifd_ops function in multifd.c file

v4:
  - patch based on commit: da96ad4a6a Merge tag 'hw-misc-20240215' of
    https://github.com/philmd/qemu into staging
  - remove the compression accelerator implementation patches, the patches
    will be placed in the QAT accelerator implementation.
  - introduce QPL as a new compression method
  - add QPL compression documentation
  - add QPL compression migration test
  - fix zlib/zstd compression level issue

v5:
  - patch based on v9.0.0-rc0 (c62d54d0a8)
  - use pkgconfig to check libaccel-config, libaccel-config is already
    in many distributions.
  - initialize the IOV of the sender by the specific compression method
  - refine the coding style
  - remove the zlib/zstd compression level not working patch, the issue
    has been solved

v6:
  - rebase to commit id 248f6f62df, Merge tag 'pull-axp-20240504' of
    https://gitlab.com/rth7680/qemu into staging.
  - add qpl software path, if the hardware path(IAA) is unavailable, the
    qpl will fall back to software path automatically.
  - add pkgconfig to check qpl, qpl version 1.5 already supports pkgconfig,
    users can install the qpl library and qpl.pc file through source code.
  - remove libaccel-config library detection, if there is no the library,
    the qpl will automatically switch to the software path.
  - use g_malloc0 instead of mmap to apply for memory.
  - add the more introduction of IAA device management, including usage
    permission, resource configuration, etc in qpl-compression.rst.
  - modified unit test to use software path to complete testing when
    hardware path is unavailable.

v7:
  - rebase to commit id 3ab42e46ac, Merge tag 'pull-ufs-20240603' of
    https://gitlab.com/jeuk20.kim/qemu into staging.
  - add the maximum number of retries for hardware job resubmission,
    if the hardware job resubmission fails, use the software job instead.
  - retest performance data based on the latest code base and add a test
    for I/O page fault handled by QPL.

Yuan Liu (7):
  docs/migration: add qpl compression feature
  migration/multifd: put IOV initialization into compression method
  configure: add --enable-qpl build option
  migration/multifd: add qpl compression method
  migration/multifd: implement initialization of qpl compression
  migration/multifd: implement qpl compression and decompression
  tests/migration-test: add qpl compression test

 docs/devel/migration/features.rst        |   1 +
 docs/devel/migration/qpl-compression.rst | 262 ++++++++
 hw/core/qdev-properties-system.c         |   2 +-
 meson.build                              |   8 +
 meson_options.txt                        |   2 +
 migration/meson.build                    |   1 +
 migration/multifd-qpl.c                  | 750 +++++++++++++++++++++++
 migration/multifd-zlib.c                 |   7 +
 migration/multifd-zstd.c                 |   8 +-
 migration/multifd.c                      |  22 +-
 migration/multifd.h                      |   1 +
 qapi/migration.json                      |   7 +-
 scripts/meson-buildoptions.sh            |   3 +
 tests/qtest/migration-test.c             |  24 +
 14 files changed, 1085 insertions(+), 13 deletions(-)
 create mode 100644 docs/devel/migration/qpl-compression.rst
 create mode 100644 migration/multifd-qpl.c

-- 
2.43.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v7 1/7] docs/migration: add qpl compression feature
  2024-06-03 15:40 [PATCH v7 0/7] Live Migration With IAA Yuan Liu
@ 2024-06-03 15:41 ` Yuan Liu
  2024-06-04 20:19   ` Peter Xu
  2024-06-05 19:59   ` Fabiano Rosas
  2024-06-03 15:41 ` [PATCH v7 2/7] migration/multifd: put IOV initialization into compression method Yuan Liu
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 18+ messages in thread
From: Yuan Liu @ 2024-06-03 15:41 UTC (permalink / raw)
  To: peterx, farosas, pbonzini, marcandre.lureau, berrange, thuth,
	philmd
  Cc: qemu-devel, yuan1.liu, nanhai.zou, shameerali.kolothum.thodi

add Intel Query Processing Library (QPL) compression method
introduction

Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
---
 docs/devel/migration/features.rst        |   1 +
 docs/devel/migration/qpl-compression.rst | 262 +++++++++++++++++++++++
 2 files changed, 263 insertions(+)
 create mode 100644 docs/devel/migration/qpl-compression.rst

diff --git a/docs/devel/migration/features.rst b/docs/devel/migration/features.rst
index d5ca7b86d5..bc98b65075 100644
--- a/docs/devel/migration/features.rst
+++ b/docs/devel/migration/features.rst
@@ -12,3 +12,4 @@ Migration has plenty of features to support different use cases.
    virtio
    mapped-ram
    CPR
+   qpl-compression
diff --git a/docs/devel/migration/qpl-compression.rst b/docs/devel/migration/qpl-compression.rst
new file mode 100644
index 0000000000..13fb7a67b1
--- /dev/null
+++ b/docs/devel/migration/qpl-compression.rst
@@ -0,0 +1,262 @@
+===============
+QPL Compression
+===============
+The Intel Query Processing Library (Intel ``QPL``) is an open-source library to
+provide compression and decompression features and it is based on deflate
+compression algorithm (RFC 1951).
+
+The ``QPL`` compression relies on Intel In-Memory Analytics Accelerator(``IAA``)
+and Shared Virtual Memory(``SVM``) technology, they are new features supported
+from Intel 4th Gen Intel Xeon Scalable processors, codenamed Sapphire Rapids
+processor(``SPR``).
+
+For more ``QPL`` introduction, please refer to `QPL Introduction
+<https://intel.github.io/qpl/documentation/introduction_docs/introduction.html>`_
+
+QPL Compression Framework
+=========================
+
+::
+
+  +----------------+       +------------------+
+  | MultiFD Thread |       |accel-config tool |
+  +-------+--------+       +--------+---------+
+          |                         |
+          |                         |
+          |compress/decompress      |
+  +-------+--------+                | Setup IAA
+  |  QPL library   |                | Resources
+  +-------+---+----+                |
+          |   |                     |
+          |   +-------------+-------+
+          |   Open IAA      |
+          |   Devices +-----+-----+
+          |           |idxd driver|
+          |           +-----+-----+
+          |                 |
+          |                 |
+          |           +-----+-----+
+          +-----------+IAA Devices|
+      Submit jobs     +-----------+
+      via enqcmd
+
+
+QPL Build And Installation
+--------------------------
+
+.. code-block:: shell
+
+  $git clone --recursive https://github.com/intel/qpl.git qpl
+  $mkdir qpl/build
+  $cd qpl/build
+  $cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr -DQPL_LIBRARY_TYPE=SHARED ..
+  $sudo cmake --build . --target install
+
+For more details about ``QPL`` installation, please refer to `QPL Installation
+<https://intel.github.io/qpl/documentation/get_started_docs/installation.html>`_
+
+IAA Device Management
+---------------------
+
+The number of ``IAA`` devices will vary depending on the Xeon product model.
+On a ``SPR`` server, there can be a maximum of 8 ``IAA`` devices, with up to
+4 devices per socket.
+
+By default, all ``IAA`` devices are disabled and need to be configured and
+enabled by users manually.
+
+Check the number of devices through the following command
+
+.. code-block:: shell
+
+  #lspci -d 8086:0cfe
+  6a:02.0 System peripheral: Intel Corporation Device 0cfe
+  6f:02.0 System peripheral: Intel Corporation Device 0cfe
+  74:02.0 System peripheral: Intel Corporation Device 0cfe
+  79:02.0 System peripheral: Intel Corporation Device 0cfe
+  e7:02.0 System peripheral: Intel Corporation Device 0cfe
+  ec:02.0 System peripheral: Intel Corporation Device 0cfe
+  f1:02.0 System peripheral: Intel Corporation Device 0cfe
+  f6:02.0 System peripheral: Intel Corporation Device 0cfe
+
+IAA Device Configuration And Enabling
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ``accel-config`` tool is used to enable ``IAA`` devices and configure
+``IAA`` hardware resources(work queues and engines). One ``IAA`` device
+has 8 work queues and 8 processing engines, multiple engines can be assigned
+to a work queue via ``group`` attribute.
+
+For ``accel-config`` installation, please refer to `accel-config installation
+<https://github.com/intel/idxd-config>`_
+
+One example of configuring and enabling an ``IAA`` device.
+
+.. code-block:: shell
+
+  #accel-config config-engine iax1/engine1.0 -g 0
+  #accel-config config-engine iax1/engine1.1 -g 0
+  #accel-config config-engine iax1/engine1.2 -g 0
+  #accel-config config-engine iax1/engine1.3 -g 0
+  #accel-config config-engine iax1/engine1.4 -g 0
+  #accel-config config-engine iax1/engine1.5 -g 0
+  #accel-config config-engine iax1/engine1.6 -g 0
+  #accel-config config-engine iax1/engine1.7 -g 0
+  #accel-config config-wq iax1/wq1.0 -g 0 -s 128 -p 10 -b 1 -t 128 -m shared -y user -n app1 -d user
+  #accel-config enable-device iax1
+  #accel-config enable-wq iax1/wq1.0
+
+.. note::
+   IAX is an early name for IAA
+
+- The ``IAA`` device index is 1, use ``ls -lh /sys/bus/dsa/devices/iax*``
+  command to query the ``IAA`` device index.
+
+- 8 engines and 1 work queue are configured in group 0, so all compression jobs
+  submitted to this work queue can be processed by all engines at the same time.
+
+- Set work queue attributes including the work mode, work queue size and so on.
+
+- Enable the ``IAA1`` device and work queue 1.0
+
+.. note::
+
+  Set work queue mode to shared mode, since ``QPL`` library only supports
+  shared mode
+
+For more detailed configuration, please refer to `IAA Configuration Samples
+<https://github.com/intel/idxd-config/tree/stable/Documentation/accfg>`_
+
+IAA Unit Test
+^^^^^^^^^^^^^
+
+- Enabling ``IAA`` devices for Xeon platform, please refer to `IAA User Guide
+  <https://www.intel.com/content/www/us/en/content-details/780887/intel-in-memory-analytics-accelerator-intel-iaa.html>`_
+
+- ``IAA`` device driver is Intel Data Accelerator Driver (idxd), it is
+  recommended that the minimum version of Linux kernel is 5.18.
+
+- Add ``"intel_iommu=on,sm_on"`` parameter to kernel command line
+  for ``SVM`` feature enabling.
+
+Here is an easy way to verify ``IAA`` device driver and ``SVM`` with `iaa_test
+<https://github.com/intel/idxd-config/tree/stable/test>`_
+
+.. code-block:: shell
+
+  #./test/iaa_test
+   [ info] alloc wq 0 shared size 128 addr 0x7f26cebe5000 batch sz 0xfffffffe xfer sz 0x80000000
+   [ info] test noop: tflags 0x1 num_desc 1
+   [ info] preparing descriptor for noop
+   [ info] Submitted all noop jobs
+   [ info] verifying task result for 0x16f7e20
+   [ info] test with op 0 passed
+
+
+IAA Resources Allocation For Migration
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There is no ``IAA`` resource configuration parameters for migration and
+``accel-config`` tool configuration cannot directly specify the ``IAA``
+resources used for migration.
+
+The multifd migration with ``QPL`` compression method  will use all work
+queues that are enabled and shared mode.
+
+.. note::
+
+  Accessing IAA resources requires ``sudo`` command or ``root`` privileges
+  by default. Administrators can modify the IAA device node ownership
+  so that Qemu can use IAA with specified user permissions.
+
+  For example
+
+  #chown -R Qemu /dev/iax
+
+
+Shared Virtual Memory(SVM) Introduction
+=======================================
+
+An ability for an accelerator I/O device to operate in the same virtual
+memory space of applications on host processors. It also implies the
+ability to operate from pageable memory, avoiding functional requirements
+to pin memory for DMA operations.
+
+When using ``SVM`` technology, users do not need to reserve memory for the
+``IAA`` device and perform pin memory operation. The ``IAA`` device can
+directly access data using the virtual address of the process.
+
+For more ``SVM`` technology, please refer to
+`Shared Virtual Addressing (SVA) with ENQCMD
+<https://docs.kernel.org/next/x86/sva.html>`_
+
+
+How To Use QPL Compression In Migration
+=======================================
+
+1 - Installation of ``QPL`` library and ``accel-config`` library if using IAA
+
+2 - Configure and enable ``IAA`` devices and work queues via ``accel-config``
+
+3 - Build ``Qemu`` with ``--enable-qpl`` parameter
+
+  E.g. configure --target-list=x86_64-softmmu --enable-kvm ``--enable-qpl``
+
+4 - Enable ``QPL`` compression during migration
+
+  Set ``migrate_set_parameter multifd-compression qpl`` when migrating, the
+  ``QPL`` compression does not support configuring the compression level, it
+  only supports one compression level.
+
+The Difference Between QPL And ZLIB
+===================================
+
+Although both ``QPL`` and ``ZLIB`` are based on the deflate compression
+algorithm, and ``QPL`` can support the header and tail of ``ZLIB``, ``QPL``
+is still not fully compatible with the ``ZLIB`` compression in the migration.
+
+``QPL`` only supports 4K history buffer, and ``ZLIB`` is 32K by default. The
+``ZLIB`` compressed data that ``QPL`` may not decompress correctly and
+vice versa.
+
+``QPL`` does not support the ``Z_SYNC_FLUSH`` operation in ``ZLIB`` streaming
+compression, current ``ZLIB`` implementation uses ``Z_SYNC_FLUSH``, so each
+``multifd`` thread has a ``ZLIB`` streaming context, and all page compression
+and decompression are based on this stream. ``QPL`` cannot decompress such data
+and vice versa.
+
+The introduction for ``Z_SYNC_FLUSH``, please refer to `Zlib Manual
+<https://www.zlib.net/manual.html>`_
+
+The Best Practices
+==================
+When user enables the IAA device for ``QPL`` compression, it is recommended
+to add ``-mem-prealloc`` parameter to the destination boot parameters. This
+parameter can avoid the occurrence of I/O page fault and reduce the overhead
+of IAA compression and decompression.
+
+The example of booting with ``-mem-prealloc`` parameter
+
+.. code-block:: shell
+
+   $qemu-system-x86_64 --enable-kvm -cpu host --mem-prealloc ...
+
+
+An example about I/O page fault measurement of destination without
+``-mem-prealloc``, the ``svm_prq`` indicates the number of I/O page fault
+occurrences and processing time.
+
+.. code-block:: shell
+
+  #echo 1 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
+  #echo 2 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
+  #echo 3 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
+  #echo 4 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
+  #cat /sys/kernel/debug/iommu/intel/dmar_perf_latency
+  IOMMU: dmar18 Register Base Address: c87fc000
+                  <0.1us   0.1us-1us    1us-10us  10us-100us   100us-1ms    1ms-10ms      >=10ms     min(us)     max(us) average(us)
+   inv_iotlb           0         286         123           0           0           0           0           0           1           0
+  inv_devtlb           0         276         133           0           0           0           0           0           2           0
+     inv_iec           0           0           0           0           0           0           0           0           0           0
+     svm_prq           0           0       25206         364         395           0           0           1         556           9
+
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v7 2/7] migration/multifd: put IOV initialization into compression method
  2024-06-03 15:40 [PATCH v7 0/7] Live Migration With IAA Yuan Liu
  2024-06-03 15:41 ` [PATCH v7 1/7] docs/migration: add qpl compression feature Yuan Liu
@ 2024-06-03 15:41 ` Yuan Liu
  2024-06-03 15:41 ` [PATCH v7 3/7] configure: add --enable-qpl build option Yuan Liu
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Yuan Liu @ 2024-06-03 15:41 UTC (permalink / raw)
  To: peterx, farosas, pbonzini, marcandre.lureau, berrange, thuth,
	philmd
  Cc: qemu-devel, yuan1.liu, nanhai.zou, shameerali.kolothum.thodi

Different compression methods may require different numbers of IOVs.
Based on streaming compression of zlib and zstd, all pages will be
compressed to a data block, so two IOVs are needed for packet header
and compressed data block.

Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd-zlib.c |  7 +++++++
 migration/multifd-zstd.c |  8 +++++++-
 migration/multifd.c      | 22 ++++++++++++----------
 3 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index 737a9645d2..2ced69487e 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -70,6 +70,10 @@ static int zlib_send_setup(MultiFDSendParams *p, Error **errp)
         goto err_free_zbuff;
     }
     p->compress_data = z;
+
+    /* Needs 2 IOVs, one for packet header and one for compressed data */
+    p->iov = g_new0(struct iovec, 2);
+
     return 0;
 
 err_free_zbuff:
@@ -101,6 +105,9 @@ static void zlib_send_cleanup(MultiFDSendParams *p, Error **errp)
     z->buf = NULL;
     g_free(p->compress_data);
     p->compress_data = NULL;
+
+    g_free(p->iov);
+    p->iov = NULL;
 }
 
 /**
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index 256858df0a..ca17b7e310 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -52,7 +52,6 @@ static int zstd_send_setup(MultiFDSendParams *p, Error **errp)
     struct zstd_data *z = g_new0(struct zstd_data, 1);
     int res;
 
-    p->compress_data = z;
     z->zcs = ZSTD_createCStream();
     if (!z->zcs) {
         g_free(z);
@@ -77,6 +76,10 @@ static int zstd_send_setup(MultiFDSendParams *p, Error **errp)
         error_setg(errp, "multifd %u: out of memory for zbuff", p->id);
         return -1;
     }
+    p->compress_data = z;
+
+    /* Needs 2 IOVs, one for packet header and one for compressed data */
+    p->iov = g_new0(struct iovec, 2);
     return 0;
 }
 
@@ -98,6 +101,9 @@ static void zstd_send_cleanup(MultiFDSendParams *p, Error **errp)
     z->zbuff = NULL;
     g_free(p->compress_data);
     p->compress_data = NULL;
+
+    g_free(p->iov);
+    p->iov = NULL;
 }
 
 /**
diff --git a/migration/multifd.c b/migration/multifd.c
index f317bff077..d82885fdbb 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -137,6 +137,13 @@ static int nocomp_send_setup(MultiFDSendParams *p, Error **errp)
         p->write_flags |= QIO_CHANNEL_WRITE_FLAG_ZERO_COPY;
     }
 
+    if (multifd_use_packets()) {
+        /* We need one extra place for the packet header */
+        p->iov = g_new0(struct iovec, p->page_count + 1);
+    } else {
+        p->iov = g_new0(struct iovec, p->page_count);
+    }
+
     return 0;
 }
 
@@ -150,6 +157,8 @@ static int nocomp_send_setup(MultiFDSendParams *p, Error **errp)
  */
 static void nocomp_send_cleanup(MultiFDSendParams *p, Error **errp)
 {
+    g_free(p->iov);
+    p->iov = NULL;
     return;
 }
 
@@ -228,6 +237,7 @@ static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
  */
 static int nocomp_recv_setup(MultiFDRecvParams *p, Error **errp)
 {
+    p->iov = g_new0(struct iovec, p->page_count);
     return 0;
 }
 
@@ -240,6 +250,8 @@ static int nocomp_recv_setup(MultiFDRecvParams *p, Error **errp)
  */
 static void nocomp_recv_cleanup(MultiFDRecvParams *p)
 {
+    g_free(p->iov);
+    p->iov = NULL;
 }
 
 /**
@@ -783,8 +795,6 @@ static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
     p->packet_len = 0;
     g_free(p->packet);
     p->packet = NULL;
-    g_free(p->iov);
-    p->iov = NULL;
     multifd_send_state->ops->send_cleanup(p, errp);
 
     return *errp == NULL;
@@ -1179,11 +1189,6 @@ bool multifd_send_setup(void)
             p->packet = g_malloc0(p->packet_len);
             p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
             p->packet->version = cpu_to_be32(MULTIFD_VERSION);
-
-            /* We need one extra place for the packet header */
-            p->iov = g_new0(struct iovec, page_count + 1);
-        } else {
-            p->iov = g_new0(struct iovec, page_count);
         }
         p->name = g_strdup_printf("multifdsend_%d", i);
         p->page_size = qemu_target_page_size();
@@ -1353,8 +1358,6 @@ static void multifd_recv_cleanup_channel(MultiFDRecvParams *p)
     p->packet_len = 0;
     g_free(p->packet);
     p->packet = NULL;
-    g_free(p->iov);
-    p->iov = NULL;
     g_free(p->normal);
     p->normal = NULL;
     g_free(p->zero);
@@ -1602,7 +1605,6 @@ int multifd_recv_setup(Error **errp)
             p->packet = g_malloc0(p->packet_len);
         }
         p->name = g_strdup_printf("multifdrecv_%d", i);
-        p->iov = g_new0(struct iovec, page_count);
         p->normal = g_new0(ram_addr_t, page_count);
         p->zero = g_new0(ram_addr_t, page_count);
         p->page_count = page_count;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v7 3/7] configure: add --enable-qpl build option
  2024-06-03 15:40 [PATCH v7 0/7] Live Migration With IAA Yuan Liu
  2024-06-03 15:41 ` [PATCH v7 1/7] docs/migration: add qpl compression feature Yuan Liu
  2024-06-03 15:41 ` [PATCH v7 2/7] migration/multifd: put IOV initialization into compression method Yuan Liu
@ 2024-06-03 15:41 ` Yuan Liu
  2024-06-05 20:04   ` Fabiano Rosas
  2024-06-03 15:41 ` [PATCH v7 4/7] migration/multifd: add qpl compression method Yuan Liu
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Yuan Liu @ 2024-06-03 15:41 UTC (permalink / raw)
  To: peterx, farosas, pbonzini, marcandre.lureau, berrange, thuth,
	philmd
  Cc: qemu-devel, yuan1.liu, nanhai.zou, shameerali.kolothum.thodi

add --enable-qpl and --disable-qpl options to enable and disable
the QPL compression method for multifd migration.

The Query Processing Library (QPL) is an open-source library
that supports data compression and decompression features. It
is based on the deflate compression algorithm and use Intel
In-Memory Analytics Accelerator(IAA) hardware for compression
and decompression acceleration.

For more live migration with IAA, please refer to the document
docs/devel/migration/qpl-compression.rst

Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
---
 meson.build                   | 8 ++++++++
 meson_options.txt             | 2 ++
 scripts/meson-buildoptions.sh | 3 +++
 3 files changed, 13 insertions(+)

diff --git a/meson.build b/meson.build
index 6386607144..d97f312a42 100644
--- a/meson.build
+++ b/meson.build
@@ -1197,6 +1197,12 @@ if not get_option('zstd').auto() or have_block
                     required: get_option('zstd'),
                     method: 'pkg-config')
 endif
+qpl = not_found
+if not get_option('qpl').auto() or have_system
+  qpl = dependency('qpl', version: '>=1.5.0',
+                    required: get_option('qpl'),
+                    method: 'pkg-config')
+endif
 virgl = not_found
 
 have_vhost_user_gpu = have_tools and host_os == 'linux' and pixman.found()
@@ -2331,6 +2337,7 @@ config_host_data.set('CONFIG_MALLOC_TRIM', has_malloc_trim)
 config_host_data.set('CONFIG_STATX', has_statx)
 config_host_data.set('CONFIG_STATX_MNT_ID', has_statx_mnt_id)
 config_host_data.set('CONFIG_ZSTD', zstd.found())
+config_host_data.set('CONFIG_QPL', qpl.found())
 config_host_data.set('CONFIG_FUSE', fuse.found())
 config_host_data.set('CONFIG_FUSE_LSEEK', fuse_lseek.found())
 config_host_data.set('CONFIG_SPICE_PROTOCOL', spice_protocol.found())
@@ -4439,6 +4446,7 @@ summary_info += {'snappy support':    snappy}
 summary_info += {'bzip2 support':     libbzip2}
 summary_info += {'lzfse support':     liblzfse}
 summary_info += {'zstd support':      zstd}
+summary_info += {'Query Processing Library support': qpl}
 summary_info += {'NUMA host support': numa}
 summary_info += {'capstone':          capstone}
 summary_info += {'libpmem support':   libpmem}
diff --git a/meson_options.txt b/meson_options.txt
index 4c1583eb40..dd680a5faf 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -259,6 +259,8 @@ option('xkbcommon', type : 'feature', value : 'auto',
        description: 'xkbcommon support')
 option('zstd', type : 'feature', value : 'auto',
        description: 'zstd compression support')
+option('qpl', type : 'feature', value : 'auto',
+       description: 'Query Processing Library support')
 option('fuse', type: 'feature', value: 'auto',
        description: 'FUSE block device export')
 option('fuse_lseek', type : 'feature', value : 'auto',
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index 6ce5a8b72a..73ae8cedfc 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -220,6 +220,7 @@ meson_options_help() {
   printf "%s\n" '                  Xen PCI passthrough support'
   printf "%s\n" '  xkbcommon       xkbcommon support'
   printf "%s\n" '  zstd            zstd compression support'
+  printf "%s\n" '  qpl             Query Processing Library support'
 }
 _meson_option_parse() {
   case $1 in
@@ -558,6 +559,8 @@ _meson_option_parse() {
     --disable-xkbcommon) printf "%s" -Dxkbcommon=disabled ;;
     --enable-zstd) printf "%s" -Dzstd=enabled ;;
     --disable-zstd) printf "%s" -Dzstd=disabled ;;
+    --enable-qpl) printf "%s" -Dqpl=enabled ;;
+    --disable-qpl) printf "%s" -Dqpl=disabled ;;
     *) return 1 ;;
   esac
 }
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v7 4/7] migration/multifd: add qpl compression method
  2024-06-03 15:40 [PATCH v7 0/7] Live Migration With IAA Yuan Liu
                   ` (2 preceding siblings ...)
  2024-06-03 15:41 ` [PATCH v7 3/7] configure: add --enable-qpl build option Yuan Liu
@ 2024-06-03 15:41 ` Yuan Liu
  2024-06-03 15:41 ` [PATCH v7 5/7] migration/multifd: implement initialization of qpl compression Yuan Liu
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Yuan Liu @ 2024-06-03 15:41 UTC (permalink / raw)
  To: peterx, farosas, pbonzini, marcandre.lureau, berrange, thuth,
	philmd
  Cc: qemu-devel, yuan1.liu, nanhai.zou, shameerali.kolothum.thodi

add the Query Processing Library (QPL) compression method

Introduce the qpl as a new multifd migration compression method, it can
use In-Memory Analytics Accelerator(IAA) to accelerate compression and
decompression, which can not only reduce network bandwidth requirement
but also reduce host compression and decompression CPU overhead.

How to enable qpl compression during migration:
migrate_set_parameter multifd-compression qpl

There is no qpl compression level parameter added since it only supports
level one, users do not need to specify the qpl compression level.

Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 hw/core/qdev-properties-system.c |  2 +-
 migration/meson.build            |  1 +
 migration/multifd-qpl.c          | 20 ++++++++++++++++++++
 migration/multifd.h              |  1 +
 qapi/migration.json              |  7 ++++++-
 5 files changed, 29 insertions(+), 2 deletions(-)
 create mode 100644 migration/multifd-qpl.c

diff --git a/hw/core/qdev-properties-system.c b/hw/core/qdev-properties-system.c
index d79d6f4b53..6ccd7224f6 100644
--- a/hw/core/qdev-properties-system.c
+++ b/hw/core/qdev-properties-system.c
@@ -659,7 +659,7 @@ const PropertyInfo qdev_prop_fdc_drive_type = {
 const PropertyInfo qdev_prop_multifd_compression = {
     .name = "MultiFDCompression",
     .description = "multifd_compression values, "
-                   "none/zlib/zstd",
+                   "none/zlib/zstd/qpl",
     .enum_table = &MultiFDCompression_lookup,
     .get = qdev_propinfo_get_enum,
     .set = qdev_propinfo_set_enum,
diff --git a/migration/meson.build b/migration/meson.build
index bdc3244bce..5f146fe8a9 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -39,6 +39,7 @@ endif
 
 system_ss.add(when: rdma, if_true: files('rdma.c'))
 system_ss.add(when: zstd, if_true: files('multifd-zstd.c'))
+system_ss.add(when: qpl, if_true: files('multifd-qpl.c'))
 
 specific_ss.add(when: 'CONFIG_SYSTEM_ONLY',
                 if_true: files('ram.c',
diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
new file mode 100644
index 0000000000..056a68a060
--- /dev/null
+++ b/migration/multifd-qpl.c
@@ -0,0 +1,20 @@
+/*
+ * Multifd qpl compression accelerator implementation
+ *
+ * Copyright (c) 2023 Intel Corporation
+ *
+ * Authors:
+ *  Yuan Liu<yuan1.liu@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include "qemu/module.h"
+
+static void multifd_qpl_register(void)
+{
+    /* noop */
+}
+
+migration_init(multifd_qpl_register);
diff --git a/migration/multifd.h b/migration/multifd.h
index c9d9b09239..5b7d9b15f8 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -40,6 +40,7 @@ MultiFDRecvData *multifd_get_recv_data(void);
 #define MULTIFD_FLAG_NOCOMP (0 << 1)
 #define MULTIFD_FLAG_ZLIB (1 << 1)
 #define MULTIFD_FLAG_ZSTD (2 << 1)
+#define MULTIFD_FLAG_QPL (4 << 1)
 
 /* This value needs to be a multiple of qemu_target_page_size() */
 #define MULTIFD_PACKET_SIZE (512 * 1024)
diff --git a/qapi/migration.json b/qapi/migration.json
index a351fd3714..f97bc3bb93 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -554,11 +554,16 @@
 #
 # @zstd: use zstd compression method.
 #
+# @qpl: use qpl compression method. Query Processing Library(qpl) is based on
+#       the deflate compression algorithm and use the Intel In-Memory Analytics
+#       Accelerator(IAA) accelerated compression and decompression. (Since 9.1)
+#
 # Since: 5.0
 ##
 { 'enum': 'MultiFDCompression',
   'data': [ 'none', 'zlib',
-            { 'name': 'zstd', 'if': 'CONFIG_ZSTD' } ] }
+            { 'name': 'zstd', 'if': 'CONFIG_ZSTD' },
+            { 'name': 'qpl', 'if': 'CONFIG_QPL' } ] }
 
 ##
 # @MigMode:
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v7 5/7] migration/multifd: implement initialization of qpl compression
  2024-06-03 15:40 [PATCH v7 0/7] Live Migration With IAA Yuan Liu
                   ` (3 preceding siblings ...)
  2024-06-03 15:41 ` [PATCH v7 4/7] migration/multifd: add qpl compression method Yuan Liu
@ 2024-06-03 15:41 ` Yuan Liu
  2024-06-05 20:19   ` Fabiano Rosas
  2024-06-03 15:41 ` [PATCH v7 6/7] migration/multifd: implement qpl compression and decompression Yuan Liu
  2024-06-03 15:41 ` [PATCH v7 7/7] tests/migration-test: add qpl compression test Yuan Liu
  6 siblings, 1 reply; 18+ messages in thread
From: Yuan Liu @ 2024-06-03 15:41 UTC (permalink / raw)
  To: peterx, farosas, pbonzini, marcandre.lureau, berrange, thuth,
	philmd
  Cc: qemu-devel, yuan1.liu, nanhai.zou, shameerali.kolothum.thodi

during initialization, a software job is allocated to each channel
for software path fallabck when the IAA hardware is unavailable or
the hardware job submission fails. If the IAA hardware is available,
multiple hardware jobs are allocated for batch processing.

Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
---
 migration/multifd-qpl.c | 328 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 327 insertions(+), 1 deletion(-)

diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
index 056a68a060..6791a204d5 100644
--- a/migration/multifd-qpl.c
+++ b/migration/multifd-qpl.c
@@ -9,12 +9,338 @@
  * This work is licensed under the terms of the GNU GPL, version 2 or later.
  * See the COPYING file in the top-level directory.
  */
+
 #include "qemu/osdep.h"
 #include "qemu/module.h"
+#include "qapi/error.h"
+#include "multifd.h"
+#include "qpl/qpl.h"
+
+typedef struct {
+    /* the QPL hardware path job */
+    qpl_job *job;
+    /* indicates if fallback to software path is required */
+    bool fallback_sw_path;
+    /* output data from the software path */
+    uint8_t *sw_output;
+    /* output data length from the software path */
+    uint32_t sw_output_len;
+} QplHwJob;
+
+typedef struct {
+    /* array of hardware jobs, the number of jobs equals the number pages */
+    QplHwJob *hw_jobs;
+    /* the QPL software job for the slow path and software fallback */
+    qpl_job *sw_job;
+    /* the number of pages that the QPL needs to process at one time */
+    uint32_t page_num;
+    /* array of compressed page buffers */
+    uint8_t *zbuf;
+    /* array of compressed page lengths */
+    uint32_t *zlen;
+    /* the status of the hardware device */
+    bool hw_avail;
+} QplData;
+
+/**
+ * check_hw_avail: check if IAA hardware is available
+ *
+ * If the IAA hardware does not exist or is unavailable,
+ * the QPL hardware job initialization will fail.
+ *
+ * Returns true if IAA hardware is available, otherwise false.
+ *
+ * @job_size: indicates the hardware job size if hardware is available
+ */
+static bool check_hw_avail(uint32_t *job_size)
+{
+    qpl_path_t path = qpl_path_hardware;
+    uint32_t size = 0;
+    qpl_job *job;
+
+    if (qpl_get_job_size(path, &size) != QPL_STS_OK) {
+        return false;
+    }
+    assert(size > 0);
+    job = g_malloc0(size);
+    if (qpl_init_job(path, job) != QPL_STS_OK) {
+        g_free(job);
+        return false;
+    }
+    g_free(job);
+    *job_size = size;
+    return true;
+}
+
+/**
+ * multifd_qpl_free_sw_job: clean up software job
+ *
+ * Free the software job resources.
+ *
+ * @qpl: pointer to the QplData structure
+ */
+static void multifd_qpl_free_sw_job(QplData *qpl)
+{
+    assert(qpl);
+    if (qpl->sw_job) {
+        qpl_fini_job(qpl->sw_job);
+        g_free(qpl->sw_job);
+        qpl->sw_job = NULL;
+    }
+}
+
+/**
+ * multifd_qpl_free_jobs: clean up hardware jobs
+ *
+ * Free all hardware job resources.
+ *
+ * @qpl: pointer to the QplData structure
+ */
+static void multifd_qpl_free_hw_job(QplData *qpl)
+{
+    assert(qpl);
+    if (qpl->hw_jobs) {
+        for (int i = 0; i < qpl->page_num; i++) {
+            qpl_fini_job(qpl->hw_jobs[i].job);
+            g_free(qpl->hw_jobs[i].job);
+            qpl->hw_jobs[i].job = NULL;
+        }
+        g_free(qpl->hw_jobs);
+        qpl->hw_jobs = NULL;
+    }
+}
+
+/**
+ * multifd_qpl_init_sw_job: initialize a software job
+ *
+ * Use the QPL software path to initialize a job
+ *
+ * @qpl: pointer to the QplData structure
+ * @errp: pointer to an error
+ */
+static int multifd_qpl_init_sw_job(QplData *qpl, Error **errp)
+{
+    qpl_path_t path = qpl_path_software;
+    uint32_t size = 0;
+    qpl_job *job = NULL;
+    qpl_status status;
+
+    status = qpl_get_job_size(path, &size);
+    if (status != QPL_STS_OK) {
+        error_setg(errp, "qpl_get_job_size failed with error %d", status);
+        return -1;
+    }
+    job = g_malloc0(size);
+    status = qpl_init_job(path, job);
+    if (status != QPL_STS_OK) {
+        error_setg(errp, "qpl_init_job failed with error %d", status);
+        g_free(job);
+        return -1;
+    }
+    qpl->sw_job = job;
+    return 0;
+}
+
+/**
+ * multifd_qpl_init_jobs: initialize hardware jobs
+ *
+ * Use the QPL hardware path to initialize jobs
+ *
+ * @qpl: pointer to the QplData structure
+ * @size: the size of QPL hardware path job
+ * @errp: pointer to an error
+ */
+static void multifd_qpl_init_hw_job(QplData *qpl, uint32_t size, Error **errp)
+{
+    qpl_path_t path = qpl_path_hardware;
+    qpl_job *job = NULL;
+    qpl_status status;
+
+    qpl->hw_jobs = g_new0(QplHwJob, qpl->page_num);
+    for (int i = 0; i < qpl->page_num; i++) {
+        job = g_malloc0(size);
+        status = qpl_init_job(path, job);
+        /* the job initialization should succeed after check_hw_avail */
+        assert(status == QPL_STS_OK);
+        qpl->hw_jobs[i].job = job;
+    }
+}
+
+/**
+ * multifd_qpl_init: initialize QplData structure
+ *
+ * Allocate and initialize a QplData structure
+ *
+ * Returns a QplData pointer on success or NULL on error
+ *
+ * @num: the number of pages
+ * @size: the page size
+ * @errp: pointer to an error
+ */
+static QplData *multifd_qpl_init(uint32_t num, uint32_t size, Error **errp)
+{
+    uint32_t job_size = 0;
+    QplData *qpl;
+
+    qpl = g_new0(QplData, 1);
+    qpl->page_num = num;
+    if (multifd_qpl_init_sw_job(qpl, errp) != 0) {
+        g_free(qpl);
+        return NULL;
+    }
+    qpl->hw_avail = check_hw_avail(&job_size);
+    if (qpl->hw_avail) {
+        multifd_qpl_init_hw_job(qpl, job_size, errp);
+    }
+    qpl->zbuf = g_malloc0(size * num);
+    qpl->zlen = g_new0(uint32_t, num);
+    return qpl;
+}
+
+/**
+ * multifd_qpl_deinit: clean up QplData structure
+ *
+ * Free jobs, buffers and the QplData structure
+ *
+ * @qpl: pointer to the QplData structure
+ */
+static void multifd_qpl_deinit(QplData *qpl)
+{
+    if (qpl) {
+        multifd_qpl_free_sw_job(qpl);
+        multifd_qpl_free_hw_job(qpl);
+        g_free(qpl->zbuf);
+        g_free(qpl->zlen);
+        g_free(qpl);
+    }
+}
+
+/**
+ * multifd_qpl_send_setup: set up send side
+ *
+ * Set up the channel with QPL compression.
+ *
+ * Returns 0 on success or -1 on error
+ *
+ * @p: Params for the channel being used
+ * @errp: pointer to an error
+ */
+static int multifd_qpl_send_setup(MultiFDSendParams *p, Error **errp)
+{
+    QplData *qpl;
+
+    qpl = multifd_qpl_init(p->page_count, p->page_size, errp);
+    if (!qpl) {
+        return -1;
+    }
+    p->compress_data = qpl;
+
+    /*
+     * the page will be compressed independently and sent using an IOV. The
+     * additional two IOVs are used to store packet header and compressed data
+     * length
+     */
+    p->iov = g_new0(struct iovec, p->page_count + 2);
+    return 0;
+}
+
+/**
+ * multifd_qpl_send_cleanup: clean up send side
+ *
+ * Close the channel and free memory.
+ *
+ * @p: Params for the channel being used
+ * @errp: pointer to an error
+ */
+static void multifd_qpl_send_cleanup(MultiFDSendParams *p, Error **errp)
+{
+    multifd_qpl_deinit(p->compress_data);
+    p->compress_data = NULL;
+    g_free(p->iov);
+    p->iov = NULL;
+}
+
+/**
+ * multifd_qpl_send_prepare: prepare data to be able to send
+ *
+ * Create a compressed buffer with all the pages that we are going to
+ * send.
+ *
+ * Returns 0 on success or -1 on error
+ *
+ * @p: Params for the channel being used
+ * @errp: pointer to an error
+ */
+static int multifd_qpl_send_prepare(MultiFDSendParams *p, Error **errp)
+{
+    /* Implement in next patch */
+    return -1;
+}
+
+/**
+ * multifd_qpl_recv_setup: set up receive side
+ *
+ * Create the compressed channel and buffer.
+ *
+ * Returns 0 on success or -1 on error
+ *
+ * @p: Params for the channel being used
+ * @errp: pointer to an error
+ */
+static int multifd_qpl_recv_setup(MultiFDRecvParams *p, Error **errp)
+{
+    QplData *qpl;
+
+    qpl = multifd_qpl_init(p->page_count, p->page_size, errp);
+    if (!qpl) {
+        return -1;
+    }
+    p->compress_data = qpl;
+    return 0;
+}
+
+/**
+ * multifd_qpl_recv_cleanup: set up receive side
+ *
+ * Close the channel and free memory.
+ *
+ * @p: Params for the channel being used
+ */
+static void multifd_qpl_recv_cleanup(MultiFDRecvParams *p)
+{
+    multifd_qpl_deinit(p->compress_data);
+    p->compress_data = NULL;
+}
+
+/**
+ * multifd_qpl_recv: read the data from the channel into actual pages
+ *
+ * Read the compressed buffer, and uncompress it into the actual
+ * pages.
+ *
+ * Returns 0 on success or -1 on error
+ *
+ * @p: Params for the channel being used
+ * @errp: pointer to an error
+ */
+static int multifd_qpl_recv(MultiFDRecvParams *p, Error **errp)
+{
+    /* Implement in next patch */
+    return -1;
+}
+
+static MultiFDMethods multifd_qpl_ops = {
+    .send_setup = multifd_qpl_send_setup,
+    .send_cleanup = multifd_qpl_send_cleanup,
+    .send_prepare = multifd_qpl_send_prepare,
+    .recv_setup = multifd_qpl_recv_setup,
+    .recv_cleanup = multifd_qpl_recv_cleanup,
+    .recv = multifd_qpl_recv,
+};
 
 static void multifd_qpl_register(void)
 {
-    /* noop */
+    multifd_register_ops(MULTIFD_COMPRESSION_QPL, &multifd_qpl_ops);
 }
 
 migration_init(multifd_qpl_register);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v7 6/7] migration/multifd: implement qpl compression and decompression
  2024-06-03 15:40 [PATCH v7 0/7] Live Migration With IAA Yuan Liu
                   ` (4 preceding siblings ...)
  2024-06-03 15:41 ` [PATCH v7 5/7] migration/multifd: implement initialization of qpl compression Yuan Liu
@ 2024-06-03 15:41 ` Yuan Liu
  2024-06-05 22:25   ` Fabiano Rosas
  2024-06-03 15:41 ` [PATCH v7 7/7] tests/migration-test: add qpl compression test Yuan Liu
  6 siblings, 1 reply; 18+ messages in thread
From: Yuan Liu @ 2024-06-03 15:41 UTC (permalink / raw)
  To: peterx, farosas, pbonzini, marcandre.lureau, berrange, thuth,
	philmd
  Cc: qemu-devel, yuan1.liu, nanhai.zou, shameerali.kolothum.thodi

QPL compression and decompression will use IAA hardware first.
If IAA hardware is not available, it will automatically fall
back to QPL software path, if the software job also fails,
the uncompressed page is sent directly.

Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
---
 migration/multifd-qpl.c | 412 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 408 insertions(+), 4 deletions(-)

diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
index 6791a204d5..18b3384bd5 100644
--- a/migration/multifd-qpl.c
+++ b/migration/multifd-qpl.c
@@ -13,9 +13,14 @@
 #include "qemu/osdep.h"
 #include "qemu/module.h"
 #include "qapi/error.h"
+#include "qapi/qapi-types-migration.h"
+#include "exec/ramblock.h"
 #include "multifd.h"
 #include "qpl/qpl.h"
 
+/* Maximum number of retries to resubmit a job if IAA work queues are full */
+#define MAX_SUBMIT_RETRY_NUM (3)
+
 typedef struct {
     /* the QPL hardware path job */
     qpl_job *job;
@@ -260,6 +265,219 @@ static void multifd_qpl_send_cleanup(MultiFDSendParams *p, Error **errp)
     p->iov = NULL;
 }
 
+/**
+ * multifd_qpl_prepare_job: prepare the job
+ *
+ * Set the QPL job parameters and properties.
+ *
+ * @job: pointer to the qpl_job structure
+ * @is_compression: indicates compression and decompression
+ * @input: pointer to the input data buffer
+ * @input_len: the length of the input data
+ * @output: pointer to the output data buffer
+ * @output_len: the length of the output data
+ */
+static void multifd_qpl_prepare_job(qpl_job *job, bool is_compression,
+                                    uint8_t *input, uint32_t input_len,
+                                    uint8_t *output, uint32_t output_len)
+{
+    job->op = is_compression ? qpl_op_compress : qpl_op_decompress;
+    job->next_in_ptr = input;
+    job->next_out_ptr = output;
+    job->available_in = input_len;
+    job->available_out = output_len;
+    job->flags = QPL_FLAG_FIRST | QPL_FLAG_LAST | QPL_FLAG_OMIT_VERIFY;
+    /* only supports compression level 1 */
+    job->level = 1;
+}
+
+/**
+ * multifd_qpl_prepare_job: prepare the compression job
+ *
+ * Set the compression job parameters and properties.
+ *
+ * @job: pointer to the qpl_job structure
+ * @input: pointer to the input data buffer
+ * @input_len: the length of the input data
+ * @output: pointer to the output data buffer
+ * @output_len: the length of the output data
+ */
+static void multifd_qpl_prepare_comp_job(qpl_job *job, uint8_t *input,
+                                         uint32_t input_len, uint8_t *output,
+                                         uint32_t output_len)
+{
+    multifd_qpl_prepare_job(job, true, input, input_len, output, output_len);
+}
+
+/**
+ * multifd_qpl_prepare_job: prepare the decompression job
+ *
+ * Set the decompression job parameters and properties.
+ *
+ * @job: pointer to the qpl_job structure
+ * @input: pointer to the input data buffer
+ * @input_len: the length of the input data
+ * @output: pointer to the output data buffer
+ * @output_len: the length of the output data
+ */
+static void multifd_qpl_prepare_decomp_job(qpl_job *job, uint8_t *input,
+                                           uint32_t input_len, uint8_t *output,
+                                           uint32_t output_len)
+{
+    multifd_qpl_prepare_job(job, false, input, input_len, output, output_len);
+}
+
+/**
+ * multifd_qpl_fill_iov: fill in the IOV
+ *
+ * Fill in the QPL packet IOV
+ *
+ * @p: Params for the channel being used
+ * @data: pointer to the IOV data
+ * @len: The length of the IOV data
+ */
+static void multifd_qpl_fill_iov(MultiFDSendParams *p, uint8_t *data,
+                                 uint32_t len)
+{
+    p->iov[p->iovs_num].iov_base = data;
+    p->iov[p->iovs_num].iov_len = len;
+    p->iovs_num++;
+    p->next_packet_size += len;
+}
+
+/**
+ * multifd_qpl_fill_packet: fill the compressed page into the QPL packet
+ *
+ * Fill the compressed page length and IOV into the QPL packet
+ *
+ * @idx: The index of the compressed length array
+ * @p: Params for the channel being used
+ * @data: pointer to the compressed page buffer
+ * @len: The length of the compressed page
+ */
+static void multifd_qpl_fill_packet(uint32_t idx, MultiFDSendParams *p,
+                                    uint8_t *data, uint32_t len)
+{
+    QplData *qpl = p->compress_data;
+
+    qpl->zlen[idx] = cpu_to_be32(len);
+    multifd_qpl_fill_iov(p, data, len);
+}
+
+/**
+ * multifd_qpl_submit_job: submit a job to the hardware
+ *
+ * Submit a QPL hardware job to the IAA device
+ *
+ * Returns true if the job is submitted successfully, otherwise false.
+ *
+ * @job: pointer to the qpl_job structure
+ */
+static bool multifd_qpl_submit_job(qpl_job *job)
+{
+    qpl_status status;
+    uint32_t num = 0;
+
+retry:
+    status = qpl_submit_job(job);
+    if (status == QPL_STS_QUEUES_ARE_BUSY_ERR) {
+        if (num < MAX_SUBMIT_RETRY_NUM) {
+            num++;
+            goto retry;
+        }
+    }
+    return (status == QPL_STS_OK);
+}
+
+/**
+ * multifd_qpl_compress_pages_slow_path: compress pages using slow path
+ *
+ * Compress the pages using software. If compression fails, the page will
+ * be sent directly.
+ *
+ * @p: Params for the channel being used
+ */
+static void multifd_qpl_compress_pages_slow_path(MultiFDSendParams *p)
+{
+    QplData *qpl = p->compress_data;
+    uint32_t size = p->page_size;
+    qpl_job *job = qpl->sw_job;
+    uint8_t *zbuf = qpl->zbuf;
+    uint8_t *buf;
+
+    for (int i = 0; i < p->pages->normal_num; i++) {
+        buf = p->pages->block->host + p->pages->offset[i];
+        /* Set output length to less than the page to reduce decompression */
+        multifd_qpl_prepare_comp_job(job, buf, size, zbuf, size - 1);
+        if (qpl_execute_job(job) == QPL_STS_OK) {
+            multifd_qpl_fill_packet(i, p, zbuf, job->total_out);
+        } else {
+            /* send the page directly */
+            multifd_qpl_fill_packet(i, p, buf, size);
+        }
+        zbuf += size;
+    }
+}
+
+/**
+ * multifd_qpl_compress_pages: compress pages
+ *
+ * Submit the pages to the IAA hardware for compression. If hardware
+ * compression fails, it falls back to software compression. If software
+ * compression also fails, the page is sent directly
+ *
+ * @p: Params for the channel being used
+ */
+static void multifd_qpl_compress_pages(MultiFDSendParams *p)
+{
+    QplData *qpl = p->compress_data;
+    MultiFDPages_t *pages = p->pages;
+    uint32_t size = p->page_size;
+    QplHwJob *hw_job;
+    uint8_t *buf;
+    uint8_t *zbuf;
+
+    for (int i = 0; i < pages->normal_num; i++) {
+        buf = pages->block->host + pages->offset[i];
+        zbuf = qpl->zbuf + (size * i);
+        hw_job = &qpl->hw_jobs[i];
+        /* Set output length to less than the page to reduce decompression */
+        multifd_qpl_prepare_comp_job(hw_job->job, buf, size, zbuf, size - 1);
+        if (multifd_qpl_submit_job(hw_job->job)) {
+            hw_job->fallback_sw_path = false;
+        } else {
+            hw_job->fallback_sw_path = true;
+            /* Set output length less than page size to reduce decompression */
+            multifd_qpl_prepare_comp_job(qpl->sw_job, buf, size, zbuf,
+                                         size - 1);
+            if (qpl_execute_job(qpl->sw_job) == QPL_STS_OK) {
+                hw_job->sw_output = zbuf;
+                hw_job->sw_output_len = qpl->sw_job->total_out;
+            } else {
+                hw_job->sw_output = buf;
+                hw_job->sw_output_len = size;
+            }
+        }
+    }
+
+    for (int i = 0; i < pages->normal_num; i++) {
+        buf = pages->block->host + pages->offset[i];
+        zbuf = qpl->zbuf + (size * i);
+        hw_job = &qpl->hw_jobs[i];
+        if (hw_job->fallback_sw_path) {
+            multifd_qpl_fill_packet(i, p, hw_job->sw_output,
+                                    hw_job->sw_output_len);
+            continue;
+        }
+        if (qpl_wait_job(hw_job->job) == QPL_STS_OK) {
+            multifd_qpl_fill_packet(i, p, zbuf, hw_job->job->total_out);
+        } else {
+            /* send the page directly */
+            multifd_qpl_fill_packet(i, p, buf, size);
+        }
+    }
+}
+
 /**
  * multifd_qpl_send_prepare: prepare data to be able to send
  *
@@ -273,8 +491,26 @@ static void multifd_qpl_send_cleanup(MultiFDSendParams *p, Error **errp)
  */
 static int multifd_qpl_send_prepare(MultiFDSendParams *p, Error **errp)
 {
-    /* Implement in next patch */
-    return -1;
+    QplData *qpl = p->compress_data;
+    uint32_t len = 0;
+
+    if (!multifd_send_prepare_common(p)) {
+        goto out;
+    }
+
+    /* The first IOV is used to store the compressed page lengths */
+    len = p->pages->normal_num * sizeof(uint32_t);
+    multifd_qpl_fill_iov(p, (uint8_t *) qpl->zlen, len);
+    if (qpl->hw_avail) {
+        multifd_qpl_compress_pages(p);
+    } else {
+        multifd_qpl_compress_pages_slow_path(p);
+    }
+
+out:
+    p->flags |= MULTIFD_FLAG_QPL;
+    multifd_send_fill_packet(p);
+    return 0;
 }
 
 /**
@@ -312,6 +548,134 @@ static void multifd_qpl_recv_cleanup(MultiFDRecvParams *p)
     p->compress_data = NULL;
 }
 
+/**
+ * multifd_qpl_process_and_check_job: process and check a QPL job
+ *
+ * Process the job and check whether the job output length is the
+ * same as the specified length
+ *
+ * Returns true if the job execution succeeded and the output length
+ * is equal to the specified length, otherwise false.
+ *
+ * @job: pointer to the qpl_job structure
+ * @is_hardware: indicates whether the job is a hardware job
+ * @len: Specified output length
+ * @errp: pointer to an error
+ */
+static bool multifd_qpl_process_and_check_job(qpl_job *job, bool is_hardware,
+                                              uint32_t len, Error **errp)
+{
+    qpl_status status;
+
+    status = (is_hardware ? qpl_wait_job(job) : qpl_execute_job(job));
+    if (status != QPL_STS_OK) {
+        error_setg(errp, "qpl_execute_job failed with error %d", status);
+        return false;
+    }
+    if (job->total_out != len) {
+        error_setg(errp, "qpl decompressed len %u, expected len %u",
+                   job->total_out, len);
+        return false;
+    }
+    return true;
+}
+
+/**
+ * multifd_qpl_decompress_pages_slow_path: decompress pages using slow path
+ *
+ * Decompress the pages using software
+ *
+ * Returns 0 on success or -1 on error
+ *
+ * @p: Params for the channel being used
+ * @errp: pointer to an error
+ */
+static int multifd_qpl_decompress_pages_slow_path(MultiFDRecvParams *p,
+                                                  Error **errp)
+{
+    QplData *qpl = p->compress_data;
+    uint32_t size = p->page_size;
+    qpl_job *job = qpl->sw_job;
+    uint8_t *zbuf = qpl->zbuf;
+    uint8_t *addr;
+    uint32_t len;
+
+    for (int i = 0; i < p->normal_num; i++) {
+        len = qpl->zlen[i];
+        addr = p->host + p->normal[i];
+        /* the page is uncompressed, load it */
+        if (len == size) {
+            memcpy(addr, zbuf, size);
+            zbuf += size;
+            continue;
+        }
+        multifd_qpl_prepare_decomp_job(job, zbuf, len, addr, size);
+        if (!multifd_qpl_process_and_check_job(job, false, size, errp)) {
+            return -1;
+        }
+        zbuf += len;
+    }
+    return 0;
+}
+
+/**
+ * multifd_qpl_decompress_pages: decompress pages
+ *
+ * Decompress the pages using the IAA hardware. If hardware
+ * decompression fails, it falls back to software decompression.
+ *
+ * Returns 0 on success or -1 on error
+ *
+ * @p: Params for the channel being used
+ * @errp: pointer to an error
+ */
+static int multifd_qpl_decompress_pages(MultiFDRecvParams *p, Error **errp)
+{
+    QplData *qpl = p->compress_data;
+    uint32_t size = p->page_size;
+    uint8_t *zbuf = qpl->zbuf;
+    uint8_t *addr;
+    uint32_t len;
+    qpl_job *job;
+
+    for (int i = 0; i < p->normal_num; i++) {
+        addr = p->host + p->normal[i];
+        len = qpl->zlen[i];
+        /* the page is uncompressed if received length equals the page size */
+        if (len == size) {
+            memcpy(addr, zbuf, size);
+            zbuf += size;
+            continue;
+        }
+
+        job = qpl->hw_jobs[i].job;
+        multifd_qpl_prepare_decomp_job(job, zbuf, len, addr, size);
+        if (multifd_qpl_submit_job(job)) {
+            qpl->hw_jobs[i].fallback_sw_path = false;
+        } else {
+            qpl->hw_jobs[i].fallback_sw_path = true;
+            job = qpl->sw_job;
+            multifd_qpl_prepare_decomp_job(job, zbuf, len, addr, size);
+            if (!multifd_qpl_process_and_check_job(job, false, size, errp)) {
+                return -1;
+            }
+        }
+        zbuf += len;
+    }
+
+    for (int i = 0; i < p->normal_num; i++) {
+        /* ignore pages that have already been processed */
+        if (qpl->zlen[i] == size || qpl->hw_jobs[i].fallback_sw_path) {
+            continue;
+        }
+
+        job = qpl->hw_jobs[i].job;
+        if (!multifd_qpl_process_and_check_job(job, true, size, errp)) {
+            return -1;
+        }
+    }
+    return 0;
+}
 /**
  * multifd_qpl_recv: read the data from the channel into actual pages
  *
@@ -325,8 +689,48 @@ static void multifd_qpl_recv_cleanup(MultiFDRecvParams *p)
  */
 static int multifd_qpl_recv(MultiFDRecvParams *p, Error **errp)
 {
-    /* Implement in next patch */
-    return -1;
+    QplData *qpl = p->compress_data;
+    uint32_t in_size = p->next_packet_size;
+    uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
+    uint32_t len = 0;
+    uint32_t zbuf_len = 0;
+    int ret;
+
+    if (flags != MULTIFD_FLAG_QPL) {
+        error_setg(errp, "multifd %u: flags received %x flags expected %x",
+                   p->id, flags, MULTIFD_FLAG_QPL);
+        return -1;
+    }
+    multifd_recv_zero_page_process(p);
+    if (!p->normal_num) {
+        assert(in_size == 0);
+        return 0;
+    }
+
+    /* read compressed page lengths */
+    len = p->normal_num * sizeof(uint32_t);
+    assert(len < in_size);
+    ret = qio_channel_read_all(p->c, (void *) qpl->zlen, len, errp);
+    if (ret != 0) {
+        return ret;
+    }
+    for (int i = 0; i < p->normal_num; i++) {
+        qpl->zlen[i] = be32_to_cpu(qpl->zlen[i]);
+        assert(qpl->zlen[i] <= p->page_size);
+        zbuf_len += qpl->zlen[i];
+    }
+
+    /* read compressed pages */
+    assert(in_size == len + zbuf_len);
+    ret = qio_channel_read_all(p->c, (void *) qpl->zbuf, zbuf_len, errp);
+    if (ret != 0) {
+        return ret;
+    }
+
+    if (qpl->hw_avail) {
+        return multifd_qpl_decompress_pages(p, errp);
+    }
+    return multifd_qpl_decompress_pages_slow_path(p, errp);
 }
 
 static MultiFDMethods multifd_qpl_ops = {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v7 7/7] tests/migration-test: add qpl compression test
  2024-06-03 15:40 [PATCH v7 0/7] Live Migration With IAA Yuan Liu
                   ` (5 preceding siblings ...)
  2024-06-03 15:41 ` [PATCH v7 6/7] migration/multifd: implement qpl compression and decompression Yuan Liu
@ 2024-06-03 15:41 ` Yuan Liu
  2024-06-05 22:26   ` Fabiano Rosas
  6 siblings, 1 reply; 18+ messages in thread
From: Yuan Liu @ 2024-06-03 15:41 UTC (permalink / raw)
  To: peterx, farosas, pbonzini, marcandre.lureau, berrange, thuth,
	philmd
  Cc: qemu-devel, yuan1.liu, nanhai.zou, shameerali.kolothum.thodi

add qpl to compression method test for multifd migration

the qpl compression supports software path and hardware
path(IAA device), and the hardware path is used first by
default. If the hardware path is unavailable, it will
automatically fallback to the software path for testing.

Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 tests/qtest/migration-test.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index b7e3406471..ef0c3f5e28 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2661,6 +2661,15 @@ test_migrate_precopy_tcp_multifd_zstd_start(QTestState *from,
 }
 #endif /* CONFIG_ZSTD */
 
+#ifdef CONFIG_QPL
+static void *
+test_migrate_precopy_tcp_multifd_qpl_start(QTestState *from,
+                                            QTestState *to)
+{
+    return test_migrate_precopy_tcp_multifd_start_common(from, to, "qpl");
+}
+#endif /* CONFIG_QPL */
+
 static void test_multifd_tcp_uri_none(void)
 {
     MigrateCommon args = {
@@ -2741,6 +2750,17 @@ static void test_multifd_tcp_zstd(void)
 }
 #endif
 
+#ifdef CONFIG_QPL
+static void test_multifd_tcp_qpl(void)
+{
+    MigrateCommon args = {
+        .listen_uri = "defer",
+        .start_hook = test_migrate_precopy_tcp_multifd_qpl_start,
+    };
+    test_precopy_common(&args);
+}
+#endif
+
 #ifdef CONFIG_GNUTLS
 static void *
 test_migrate_multifd_tcp_tls_psk_start_match(QTestState *from,
@@ -3626,6 +3646,10 @@ int main(int argc, char **argv)
     migration_test_add("/migration/multifd/tcp/plain/zstd",
                        test_multifd_tcp_zstd);
 #endif
+#ifdef CONFIG_QPL
+    migration_test_add("/migration/multifd/tcp/plain/qpl",
+                       test_multifd_tcp_qpl);
+#endif
 #ifdef CONFIG_GNUTLS
     migration_test_add("/migration/multifd/tcp/tls/psk/match",
                        test_multifd_tcp_tls_psk_match);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v7 1/7] docs/migration: add qpl compression feature
  2024-06-03 15:41 ` [PATCH v7 1/7] docs/migration: add qpl compression feature Yuan Liu
@ 2024-06-04 20:19   ` Peter Xu
  2024-06-05 19:59   ` Fabiano Rosas
  1 sibling, 0 replies; 18+ messages in thread
From: Peter Xu @ 2024-06-04 20:19 UTC (permalink / raw)
  To: Yuan Liu
  Cc: farosas, pbonzini, marcandre.lureau, berrange, thuth, philmd,
	qemu-devel, nanhai.zou, shameerali.kolothum.thodi

On Mon, Jun 03, 2024 at 11:41:00PM +0800, Yuan Liu wrote:
> add Intel Query Processing Library (QPL) compression method
> introduction
> 
> Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v7 1/7] docs/migration: add qpl compression feature
  2024-06-03 15:41 ` [PATCH v7 1/7] docs/migration: add qpl compression feature Yuan Liu
  2024-06-04 20:19   ` Peter Xu
@ 2024-06-05 19:59   ` Fabiano Rosas
  2024-06-06  7:03     ` Liu, Yuan1
  1 sibling, 1 reply; 18+ messages in thread
From: Fabiano Rosas @ 2024-06-05 19:59 UTC (permalink / raw)
  To: Yuan Liu, peterx, pbonzini, marcandre.lureau, berrange, thuth,
	philmd
  Cc: qemu-devel, yuan1.liu, nanhai.zou, shameerali.kolothum.thodi

Yuan Liu <yuan1.liu@intel.com> writes:

> add Intel Query Processing Library (QPL) compression method
> introduction
>
> Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>

Just some nits if you need to respin. Otherwise I can touch up in the
migration tree.

Reviewed-by: Fabiano Rosas <farosas@suse.de>

> ---
>  docs/devel/migration/features.rst        |   1 +
>  docs/devel/migration/qpl-compression.rst | 262 +++++++++++++++++++++++
>  2 files changed, 263 insertions(+)
>  create mode 100644 docs/devel/migration/qpl-compression.rst
>
> diff --git a/docs/devel/migration/features.rst b/docs/devel/migration/features.rst
> index d5ca7b86d5..bc98b65075 100644
> --- a/docs/devel/migration/features.rst
> +++ b/docs/devel/migration/features.rst
> @@ -12,3 +12,4 @@ Migration has plenty of features to support different use cases.
>     virtio
>     mapped-ram
>     CPR
> +   qpl-compression
> diff --git a/docs/devel/migration/qpl-compression.rst b/docs/devel/migration/qpl-compression.rst
> new file mode 100644
> index 0000000000..13fb7a67b1
> --- /dev/null
> +++ b/docs/devel/migration/qpl-compression.rst
> @@ -0,0 +1,262 @@
> +===============
> +QPL Compression
> +===============
> +The Intel Query Processing Library (Intel ``QPL``) is an open-source library to
> +provide compression and decompression features and it is based on deflate
> +compression algorithm (RFC 1951).
> +
> +The ``QPL`` compression relies on Intel In-Memory Analytics Accelerator(``IAA``)
> +and Shared Virtual Memory(``SVM``) technology, they are new features supported
> +from Intel 4th Gen Intel Xeon Scalable processors, codenamed Sapphire Rapids
> +processor(``SPR``).
> +
> +For more ``QPL`` introduction, please refer to `QPL Introduction
> +<https://intel.github.io/qpl/documentation/introduction_docs/introduction.html>`_
> +
> +QPL Compression Framework
> +=========================
> +
> +::
> +
> +  +----------------+       +------------------+
> +  | MultiFD Thread |       |accel-config tool |
> +  +-------+--------+       +--------+---------+
> +          |                         |
> +          |                         |
> +          |compress/decompress      |
> +  +-------+--------+                | Setup IAA
> +  |  QPL library   |                | Resources
> +  +-------+---+----+                |
> +          |   |                     |
> +          |   +-------------+-------+
> +          |   Open IAA      |
> +          |   Devices +-----+-----+
> +          |           |idxd driver|
> +          |           +-----+-----+
> +          |                 |
> +          |                 |
> +          |           +-----+-----+
> +          +-----------+IAA Devices|
> +      Submit jobs     +-----------+
> +      via enqcmd
> +
> +
> +QPL Build And Installation
> +--------------------------
> +
> +.. code-block:: shell
> +
> +  $git clone --recursive https://github.com/intel/qpl.git qpl
> +  $mkdir qpl/build
> +  $cd qpl/build
> +  $cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr -DQPL_LIBRARY_TYPE=SHARED ..
> +  $sudo cmake --build . --target install
> +
> +For more details about ``QPL`` installation, please refer to `QPL Installation
> +<https://intel.github.io/qpl/documentation/get_started_docs/installation.html>`_
> +
> +IAA Device Management
> +---------------------
> +
> +The number of ``IAA`` devices will vary depending on the Xeon product model.
> +On a ``SPR`` server, there can be a maximum of 8 ``IAA`` devices, with up to
> +4 devices per socket.
> +
> +By default, all ``IAA`` devices are disabled and need to be configured and
> +enabled by users manually.
> +
> +Check the number of devices through the following command
> +
> +.. code-block:: shell
> +
> +  #lspci -d 8086:0cfe
> +  6a:02.0 System peripheral: Intel Corporation Device 0cfe
> +  6f:02.0 System peripheral: Intel Corporation Device 0cfe
> +  74:02.0 System peripheral: Intel Corporation Device 0cfe
> +  79:02.0 System peripheral: Intel Corporation Device 0cfe
> +  e7:02.0 System peripheral: Intel Corporation Device 0cfe
> +  ec:02.0 System peripheral: Intel Corporation Device 0cfe
> +  f1:02.0 System peripheral: Intel Corporation Device 0cfe
> +  f6:02.0 System peripheral: Intel Corporation Device 0cfe
> +
> +IAA Device Configuration And Enabling
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +The ``accel-config`` tool is used to enable ``IAA`` devices and configure
> +``IAA`` hardware resources(work queues and engines). One ``IAA`` device
> +has 8 work queues and 8 processing engines, multiple engines can be assigned
> +to a work queue via ``group`` attribute.
> +
> +For ``accel-config`` installation, please refer to `accel-config installation
> +<https://github.com/intel/idxd-config>`_
> +
> +One example of configuring and enabling an ``IAA`` device.
> +
> +.. code-block:: shell
> +
> +  #accel-config config-engine iax1/engine1.0 -g 0
> +  #accel-config config-engine iax1/engine1.1 -g 0
> +  #accel-config config-engine iax1/engine1.2 -g 0
> +  #accel-config config-engine iax1/engine1.3 -g 0
> +  #accel-config config-engine iax1/engine1.4 -g 0
> +  #accel-config config-engine iax1/engine1.5 -g 0
> +  #accel-config config-engine iax1/engine1.6 -g 0
> +  #accel-config config-engine iax1/engine1.7 -g 0
> +  #accel-config config-wq iax1/wq1.0 -g 0 -s 128 -p 10 -b 1 -t 128 -m shared -y user -n app1 -d user
> +  #accel-config enable-device iax1
> +  #accel-config enable-wq iax1/wq1.0
> +
> +.. note::
> +   IAX is an early name for IAA
> +
> +- The ``IAA`` device index is 1, use ``ls -lh /sys/bus/dsa/devices/iax*``
> +  command to query the ``IAA`` device index.
> +
> +- 8 engines and 1 work queue are configured in group 0, so all compression jobs
> +  submitted to this work queue can be processed by all engines at the same time.
> +
> +- Set work queue attributes including the work mode, work queue size and so on.
> +
> +- Enable the ``IAA1`` device and work queue 1.0
> +
> +.. note::
> +
> +  Set work queue mode to shared mode, since ``QPL`` library only supports
> +  shared mode
> +
> +For more detailed configuration, please refer to `IAA Configuration Samples
> +<https://github.com/intel/idxd-config/tree/stable/Documentation/accfg>`_
> +
> +IAA Unit Test
> +^^^^^^^^^^^^^
> +
> +- Enabling ``IAA`` devices for Xeon platform, please refer to `IAA User Guide
> +  <https://www.intel.com/content/www/us/en/content-details/780887/intel-in-memory-analytics-accelerator-intel-iaa.html>`_
> +
> +- ``IAA`` device driver is Intel Data Accelerator Driver (idxd), it is
> +  recommended that the minimum version of Linux kernel is 5.18.
> +
> +- Add ``"intel_iommu=on,sm_on"`` parameter to kernel command line
> +  for ``SVM`` feature enabling.
> +
> +Here is an easy way to verify ``IAA`` device driver and ``SVM`` with `iaa_test
> +<https://github.com/intel/idxd-config/tree/stable/test>`_
> +
> +.. code-block:: shell
> +
> +  #./test/iaa_test
> +   [ info] alloc wq 0 shared size 128 addr 0x7f26cebe5000 batch sz 0xfffffffe xfer sz 0x80000000
> +   [ info] test noop: tflags 0x1 num_desc 1
> +   [ info] preparing descriptor for noop
> +   [ info] Submitted all noop jobs
> +   [ info] verifying task result for 0x16f7e20
> +   [ info] test with op 0 passed
> +
> +
> +IAA Resources Allocation For Migration
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +There is no ``IAA`` resource configuration parameters for migration and
> +``accel-config`` tool configuration cannot directly specify the ``IAA``
> +resources used for migration.
> +
> +The multifd migration with ``QPL`` compression method  will use all work
> +queues that are enabled and shared mode.
> +
> +.. note::
> +
> +  Accessing IAA resources requires ``sudo`` command or ``root`` privileges
> +  by default. Administrators can modify the IAA device node ownership
> +  so that Qemu can use IAA with specified user permissions.

QEMU

> +
> +  For example
> +
> +  #chown -R Qemu /dev/iax

qemu

> +
> +
> +Shared Virtual Memory(SVM) Introduction
> +=======================================
> +
> +An ability for an accelerator I/O device to operate in the same virtual
> +memory space of applications on host processors. It also implies the
> +ability to operate from pageable memory, avoiding functional requirements
> +to pin memory for DMA operations.
> +
> +When using ``SVM`` technology, users do not need to reserve memory for the
> +``IAA`` device and perform pin memory operation. The ``IAA`` device can
> +directly access data using the virtual address of the process.
> +
> +For more ``SVM`` technology, please refer to
> +`Shared Virtual Addressing (SVA) with ENQCMD
> +<https://docs.kernel.org/next/x86/sva.html>`_
> +
> +
> +How To Use QPL Compression In Migration
> +=======================================
> +
> +1 - Installation of ``QPL`` library and ``accel-config`` library if using IAA
> +
> +2 - Configure and enable ``IAA`` devices and work queues via ``accel-config``
> +
> +3 - Build ``Qemu`` with ``--enable-qpl`` parameter

QEMU

> +
> +  E.g. configure --target-list=x86_64-softmmu --enable-kvm ``--enable-qpl``
> +
> +4 - Enable ``QPL`` compression during migration
> +
> +  Set ``migrate_set_parameter multifd-compression qpl`` when migrating, the
> +  ``QPL`` compression does not support configuring the compression level, it
> +  only supports one compression level.
> +
> +The Difference Between QPL And ZLIB
> +===================================
> +
> +Although both ``QPL`` and ``ZLIB`` are based on the deflate compression
> +algorithm, and ``QPL`` can support the header and tail of ``ZLIB``, ``QPL``
> +is still not fully compatible with the ``ZLIB`` compression in the migration.
> +
> +``QPL`` only supports 4K history buffer, and ``ZLIB`` is 32K by default. The
> +``ZLIB`` compressed data that ``QPL`` may not decompress correctly and
> +vice versa.

s/The ZLIB compressed/ZLIB compresses/

> +
> +``QPL`` does not support the ``Z_SYNC_FLUSH`` operation in ``ZLIB`` streaming
> +compression, current ``ZLIB`` implementation uses ``Z_SYNC_FLUSH``, so each
> +``multifd`` thread has a ``ZLIB`` streaming context, and all page compression
> +and decompression are based on this stream. ``QPL`` cannot decompress such data
> +and vice versa.
> +
> +The introduction for ``Z_SYNC_FLUSH``, please refer to `Zlib Manual
> +<https://www.zlib.net/manual.html>`_
> +
> +The Best Practices
> +==================
> +When user enables the IAA device for ``QPL`` compression, it is recommended
> +to add ``-mem-prealloc`` parameter to the destination boot parameters. This
> +parameter can avoid the occurrence of I/O page fault and reduce the overhead
> +of IAA compression and decompression.
> +
> +The example of booting with ``-mem-prealloc`` parameter
> +
> +.. code-block:: shell
> +
> +   $qemu-system-x86_64 --enable-kvm -cpu host --mem-prealloc ...
> +
> +
> +An example about I/O page fault measurement of destination without
> +``-mem-prealloc``, the ``svm_prq`` indicates the number of I/O page fault
> +occurrences and processing time.
> +
> +.. code-block:: shell
> +
> +  #echo 1 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
> +  #echo 2 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
> +  #echo 3 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
> +  #echo 4 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
> +  #cat /sys/kernel/debug/iommu/intel/dmar_perf_latency
> +  IOMMU: dmar18 Register Base Address: c87fc000
> +                  <0.1us   0.1us-1us    1us-10us  10us-100us   100us-1ms    1ms-10ms      >=10ms     min(us)     max(us) average(us)
> +   inv_iotlb           0         286         123           0           0           0           0           0           1           0
> +  inv_devtlb           0         276         133           0           0           0           0           0           2           0
> +     inv_iec           0           0           0           0           0           0           0           0           0           0
> +     svm_prq           0           0       25206         364         395           0           0           1         556           9
> +


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v7 3/7] configure: add --enable-qpl build option
  2024-06-03 15:41 ` [PATCH v7 3/7] configure: add --enable-qpl build option Yuan Liu
@ 2024-06-05 20:04   ` Fabiano Rosas
  0 siblings, 0 replies; 18+ messages in thread
From: Fabiano Rosas @ 2024-06-05 20:04 UTC (permalink / raw)
  To: Yuan Liu, peterx, pbonzini, marcandre.lureau, berrange, thuth,
	philmd
  Cc: qemu-devel, yuan1.liu, nanhai.zou, shameerali.kolothum.thodi

Yuan Liu <yuan1.liu@intel.com> writes:

> add --enable-qpl and --disable-qpl options to enable and disable
> the QPL compression method for multifd migration.
>
> The Query Processing Library (QPL) is an open-source library
> that supports data compression and decompression features. It
> is based on the deflate compression algorithm and use Intel
> In-Memory Analytics Accelerator(IAA) hardware for compression
> and decompression acceleration.
>
> For more live migration with IAA, please refer to the document
> docs/devel/migration/qpl-compression.rst
>
> Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v7 5/7] migration/multifd: implement initialization of qpl compression
  2024-06-03 15:41 ` [PATCH v7 5/7] migration/multifd: implement initialization of qpl compression Yuan Liu
@ 2024-06-05 20:19   ` Fabiano Rosas
  0 siblings, 0 replies; 18+ messages in thread
From: Fabiano Rosas @ 2024-06-05 20:19 UTC (permalink / raw)
  To: Yuan Liu, peterx, pbonzini, marcandre.lureau, berrange, thuth,
	philmd
  Cc: qemu-devel, yuan1.liu, nanhai.zou, shameerali.kolothum.thodi

Yuan Liu <yuan1.liu@intel.com> writes:

> during initialization, a software job is allocated to each channel
> for software path fallabck when the IAA hardware is unavailable or
> the hardware job submission fails. If the IAA hardware is available,
> multiple hardware jobs are allocated for batch processing.
>
> Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v7 6/7] migration/multifd: implement qpl compression and decompression
  2024-06-03 15:41 ` [PATCH v7 6/7] migration/multifd: implement qpl compression and decompression Yuan Liu
@ 2024-06-05 22:25   ` Fabiano Rosas
  2024-06-06  6:12     ` Liu, Yuan1
  0 siblings, 1 reply; 18+ messages in thread
From: Fabiano Rosas @ 2024-06-05 22:25 UTC (permalink / raw)
  To: Yuan Liu, peterx, pbonzini, marcandre.lureau, berrange, thuth,
	philmd
  Cc: qemu-devel, yuan1.liu, nanhai.zou, shameerali.kolothum.thodi

Yuan Liu <yuan1.liu@intel.com> writes:

> QPL compression and decompression will use IAA hardware first.
> If IAA hardware is not available, it will automatically fall
> back to QPL software path, if the software job also fails,
> the uncompressed page is sent directly.
>
> Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> ---
>  migration/multifd-qpl.c | 412 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 408 insertions(+), 4 deletions(-)
>
> diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
> index 6791a204d5..18b3384bd5 100644
> --- a/migration/multifd-qpl.c
> +++ b/migration/multifd-qpl.c
> @@ -13,9 +13,14 @@
>  #include "qemu/osdep.h"
>  #include "qemu/module.h"
>  #include "qapi/error.h"
> +#include "qapi/qapi-types-migration.h"
> +#include "exec/ramblock.h"
>  #include "multifd.h"
>  #include "qpl/qpl.h"
>  
> +/* Maximum number of retries to resubmit a job if IAA work queues are full */
> +#define MAX_SUBMIT_RETRY_NUM (3)
> +
>  typedef struct {
>      /* the QPL hardware path job */
>      qpl_job *job;
> @@ -260,6 +265,219 @@ static void multifd_qpl_send_cleanup(MultiFDSendParams *p, Error **errp)
>      p->iov = NULL;
>  }
>  
> +/**
> + * multifd_qpl_prepare_job: prepare the job
> + *
> + * Set the QPL job parameters and properties.
> + *
> + * @job: pointer to the qpl_job structure
> + * @is_compression: indicates compression and decompression
> + * @input: pointer to the input data buffer
> + * @input_len: the length of the input data
> + * @output: pointer to the output data buffer
> + * @output_len: the length of the output data
> + */
> +static void multifd_qpl_prepare_job(qpl_job *job, bool is_compression,
> +                                    uint8_t *input, uint32_t input_len,
> +                                    uint8_t *output, uint32_t output_len)
> +{
> +    job->op = is_compression ? qpl_op_compress : qpl_op_decompress;
> +    job->next_in_ptr = input;
> +    job->next_out_ptr = output;
> +    job->available_in = input_len;
> +    job->available_out = output_len;
> +    job->flags = QPL_FLAG_FIRST | QPL_FLAG_LAST | QPL_FLAG_OMIT_VERIFY;
> +    /* only supports compression level 1 */
> +    job->level = 1;
> +}
> +
> +/**
> + * multifd_qpl_prepare_job: prepare the compression job

function name is wrong

> + *
> + * Set the compression job parameters and properties.
> + *
> + * @job: pointer to the qpl_job structure
> + * @input: pointer to the input data buffer
> + * @input_len: the length of the input data
> + * @output: pointer to the output data buffer
> + * @output_len: the length of the output data
> + */
> +static void multifd_qpl_prepare_comp_job(qpl_job *job, uint8_t *input,
> +                                         uint32_t input_len, uint8_t *output,
> +                                         uint32_t output_len)
> +{
> +    multifd_qpl_prepare_job(job, true, input, input_len, output, output_len);
> +}
> +
> +/**
> + * multifd_qpl_prepare_job: prepare the decompression job

here as well

> + *
> + * Set the decompression job parameters and properties.
> + *
> + * @job: pointer to the qpl_job structure
> + * @input: pointer to the input data buffer
> + * @input_len: the length of the input data
> + * @output: pointer to the output data buffer
> + * @output_len: the length of the output data
> + */
> +static void multifd_qpl_prepare_decomp_job(qpl_job *job, uint8_t *input,
> +                                           uint32_t input_len, uint8_t *output,
> +                                           uint32_t output_len)
> +{
> +    multifd_qpl_prepare_job(job, false, input, input_len, output, output_len);
> +}
> +
> +/**
> + * multifd_qpl_fill_iov: fill in the IOV
> + *
> + * Fill in the QPL packet IOV
> + *
> + * @p: Params for the channel being used
> + * @data: pointer to the IOV data
> + * @len: The length of the IOV data
> + */
> +static void multifd_qpl_fill_iov(MultiFDSendParams *p, uint8_t *data,
> +                                 uint32_t len)
> +{
> +    p->iov[p->iovs_num].iov_base = data;
> +    p->iov[p->iovs_num].iov_len = len;
> +    p->iovs_num++;
> +    p->next_packet_size += len;
> +}
> +
> +/**
> + * multifd_qpl_fill_packet: fill the compressed page into the QPL packet
> + *
> + * Fill the compressed page length and IOV into the QPL packet
> + *
> + * @idx: The index of the compressed length array
> + * @p: Params for the channel being used
> + * @data: pointer to the compressed page buffer
> + * @len: The length of the compressed page
> + */
> +static void multifd_qpl_fill_packet(uint32_t idx, MultiFDSendParams *p,
> +                                    uint8_t *data, uint32_t len)
> +{
> +    QplData *qpl = p->compress_data;
> +
> +    qpl->zlen[idx] = cpu_to_be32(len);
> +    multifd_qpl_fill_iov(p, data, len);
> +}
> +
> +/**
> + * multifd_qpl_submit_job: submit a job to the hardware
> + *
> + * Submit a QPL hardware job to the IAA device
> + *
> + * Returns true if the job is submitted successfully, otherwise false.
> + *
> + * @job: pointer to the qpl_job structure
> + */
> +static bool multifd_qpl_submit_job(qpl_job *job)
> +{
> +    qpl_status status;
> +    uint32_t num = 0;
> +
> +retry:
> +    status = qpl_submit_job(job);
> +    if (status == QPL_STS_QUEUES_ARE_BUSY_ERR) {
> +        if (num < MAX_SUBMIT_RETRY_NUM) {
> +            num++;
> +            goto retry;
> +        }
> +    }
> +    return (status == QPL_STS_OK);

How often do we expect this to fail? Will the queues be busy frequently
or is this an unlikely event? I'm thinking whether we really need to
allow a fallback for the hw path. Sorry if this has been discussed
already, I don't remember.

> +}
> +
> +/**
> + * multifd_qpl_compress_pages_slow_path: compress pages using slow path
> + *
> + * Compress the pages using software. If compression fails, the page will
> + * be sent directly.
> + *
> + * @p: Params for the channel being used
> + */
> +static void multifd_qpl_compress_pages_slow_path(MultiFDSendParams *p)
> +{
> +    QplData *qpl = p->compress_data;
> +    uint32_t size = p->page_size;
> +    qpl_job *job = qpl->sw_job;
> +    uint8_t *zbuf = qpl->zbuf;
> +    uint8_t *buf;
> +
> +    for (int i = 0; i < p->pages->normal_num; i++) {
> +        buf = p->pages->block->host + p->pages->offset[i];
> +        /* Set output length to less than the page to reduce decompression */
> +        multifd_qpl_prepare_comp_job(job, buf, size, zbuf, size - 1);
> +        if (qpl_execute_job(job) == QPL_STS_OK) {
> +            multifd_qpl_fill_packet(i, p, zbuf, job->total_out);
> +        } else {
> +            /* send the page directly */

s/directly/uncompressed/

a bit clearer.

> +            multifd_qpl_fill_packet(i, p, buf, size);
> +        }
> +        zbuf += size;
> +    }
> +}
> +
> +/**
> + * multifd_qpl_compress_pages: compress pages
> + *
> + * Submit the pages to the IAA hardware for compression. If hardware
> + * compression fails, it falls back to software compression. If software
> + * compression also fails, the page is sent directly
> + *
> + * @p: Params for the channel being used
> + */
> +static void multifd_qpl_compress_pages(MultiFDSendParams *p)
> +{
> +    QplData *qpl = p->compress_data;
> +    MultiFDPages_t *pages = p->pages;
> +    uint32_t size = p->page_size;
> +    QplHwJob *hw_job;
> +    uint8_t *buf;
> +    uint8_t *zbuf;
> +

Let's document the output size choice more explicitly:

    /*
     * Set output length to less than the page size to force the job to
     * fail in case it compresses to a larger size. We'll send that page
     * without compression and skip the decompression operation on the
     * destination.
     */
     out_size = size - 1;

you can then omit the other comments.

> +    for (int i = 0; i < pages->normal_num; i++) {
> +        buf = pages->block->host + pages->offset[i];
> +        zbuf = qpl->zbuf + (size * i);
> +        hw_job = &qpl->hw_jobs[i];
> +        /* Set output length to less than the page to reduce decompression */
> +        multifd_qpl_prepare_comp_job(hw_job->job, buf, size, zbuf, size - 1);
> +        if (multifd_qpl_submit_job(hw_job->job)) {
> +            hw_job->fallback_sw_path = false;
> +        } else {
> +            hw_job->fallback_sw_path = true;
> +            /* Set output length less than page size to reduce decompression */
> +            multifd_qpl_prepare_comp_job(qpl->sw_job, buf, size, zbuf,
> +                                         size - 1);
> +            if (qpl_execute_job(qpl->sw_job) == QPL_STS_OK) {
> +                hw_job->sw_output = zbuf;
> +                hw_job->sw_output_len = qpl->sw_job->total_out;
> +            } else {
> +                hw_job->sw_output = buf;
> +                hw_job->sw_output_len = size;
> +            }

Hmm, these look a bit cumbersome, would it work if we moved the fallback
qpl_execute_job() down into the other loop? We could then avoid the
extra fields. Something like:

static void multifd_qpl_compress_pages(MultiFDSendParams *p)
{
    QplData *qpl = p->compress_data;
    MultiFDPages_t *pages = p->pages;
    uint32_t out_size, size = p->page_size;
    uint8_t *buf, *zbuf;

    /*
     * Set output length to less than the page size to force the job to
     * fail in case it compresses to a larger size. We'll send that page
     * without compression to skip the decompression operation on the
     * destination.
     */
    out_size = size - 1;

    for (int i = 0; i < pages->normal_num; i++) {
        QplHwJob *hw_job = &qpl->hw_jobs[i];

        hw_job->fallback_sw_path = false;
        buf = pages->block->host + pages->offset[i];
        zbuf = qpl->zbuf + (size * i);

        multifd_qpl_prepare_comp_job(hw_job->job, buf, size, zbuf, out_size);

        if (!multifd_qpl_submit_job(hw_job->job)) {
            hw_job->fallback_sw_path = true;
        }
    }

    for (int i = 0; i < pages->normal_num; i++) {
        QplHwJob *hw_job = &qpl->hw_jobs[i];
        qpl_job *job;

        buf = pages->block->host + pages->offset[i];
        zbuf = qpl->zbuf + (size * i);

        if (hw_job->fallback_sw_path) {
            job = qpl->sw_job;
            multifd_qpl_prepare_comp_job(job, buf, size, zbuf, out_size);
            ret = qpl_execute_job(job);
        } else {            
            job = hw_job->job; 
            ret = qpl_wait_job(job);
        }

        if (ret == QPL_STS_OK) {
            multifd_qpl_fill_packet(i, p, zbuf, job->total_out);
        } else {
            multifd_qpl_fill_packet(i, p, buf, size);
        }
    }
}

> +        }
> +    }
> +
> +    for (int i = 0; i < pages->normal_num; i++) {
> +        buf = pages->block->host + pages->offset[i];
> +        zbuf = qpl->zbuf + (size * i);
> +        hw_job = &qpl->hw_jobs[i];
> +        if (hw_job->fallback_sw_path) {
> +            multifd_qpl_fill_packet(i, p, hw_job->sw_output,
> +                                    hw_job->sw_output_len);
> +            continue;
> +        }
> +        if (qpl_wait_job(hw_job->job) == QPL_STS_OK) {
> +            multifd_qpl_fill_packet(i, p, zbuf, hw_job->job->total_out);
> +        } else {
> +            /* send the page directly */
> +            multifd_qpl_fill_packet(i, p, buf, size);
> +        }
> +    }
> +}
> +
>  /**
>   * multifd_qpl_send_prepare: prepare data to be able to send
>   *
> @@ -273,8 +491,26 @@ static void multifd_qpl_send_cleanup(MultiFDSendParams *p, Error **errp)
>   */
>  static int multifd_qpl_send_prepare(MultiFDSendParams *p, Error **errp)
>  {
> -    /* Implement in next patch */
> -    return -1;
> +    QplData *qpl = p->compress_data;
> +    uint32_t len = 0;
> +
> +    if (!multifd_send_prepare_common(p)) {
> +        goto out;
> +    }
> +
> +    /* The first IOV is used to store the compressed page lengths */
> +    len = p->pages->normal_num * sizeof(uint32_t);
> +    multifd_qpl_fill_iov(p, (uint8_t *) qpl->zlen, len);
> +    if (qpl->hw_avail) {
> +        multifd_qpl_compress_pages(p);
> +    } else {
> +        multifd_qpl_compress_pages_slow_path(p);
> +    }
> +
> +out:
> +    p->flags |= MULTIFD_FLAG_QPL;
> +    multifd_send_fill_packet(p);
> +    return 0;
>  }
>  
>  /**
> @@ -312,6 +548,134 @@ static void multifd_qpl_recv_cleanup(MultiFDRecvParams *p)
>      p->compress_data = NULL;
>  }
>  
> +/**
> + * multifd_qpl_process_and_check_job: process and check a QPL job
> + *
> + * Process the job and check whether the job output length is the
> + * same as the specified length
> + *
> + * Returns true if the job execution succeeded and the output length
> + * is equal to the specified length, otherwise false.
> + *
> + * @job: pointer to the qpl_job structure
> + * @is_hardware: indicates whether the job is a hardware job
> + * @len: Specified output length
> + * @errp: pointer to an error
> + */
> +static bool multifd_qpl_process_and_check_job(qpl_job *job, bool is_hardware,
> +                                              uint32_t len, Error **errp)
> +{
> +    qpl_status status;
> +
> +    status = (is_hardware ? qpl_wait_job(job) : qpl_execute_job(job));
> +    if (status != QPL_STS_OK) {
> +        error_setg(errp, "qpl_execute_job failed with error %d", status);

The error message should also cover qpl_wait_job(), right? Maybe just
use "qpl job failed".

> +        return false;
> +    }
> +    if (job->total_out != len) {
> +        error_setg(errp, "qpl decompressed len %u, expected len %u",
> +                   job->total_out, len);
> +        return false;
> +    }
> +    return true;
> +}
> +
> +/**
> + * multifd_qpl_decompress_pages_slow_path: decompress pages using slow path
> + *
> + * Decompress the pages using software
> + *
> + * Returns 0 on success or -1 on error
> + *
> + * @p: Params for the channel being used
> + * @errp: pointer to an error
> + */
> +static int multifd_qpl_decompress_pages_slow_path(MultiFDRecvParams *p,
> +                                                  Error **errp)
> +{
> +    QplData *qpl = p->compress_data;
> +    uint32_t size = p->page_size;
> +    qpl_job *job = qpl->sw_job;
> +    uint8_t *zbuf = qpl->zbuf;
> +    uint8_t *addr;
> +    uint32_t len;
> +
> +    for (int i = 0; i < p->normal_num; i++) {
> +        len = qpl->zlen[i];
> +        addr = p->host + p->normal[i];
> +        /* the page is uncompressed, load it */
> +        if (len == size) {
> +            memcpy(addr, zbuf, size);
> +            zbuf += size;
> +            continue;
> +        }
> +        multifd_qpl_prepare_decomp_job(job, zbuf, len, addr, size);
> +        if (!multifd_qpl_process_and_check_job(job, false, size, errp)) {
> +            return -1;
> +        }
> +        zbuf += len;
> +    }
> +    return 0;
> +}
> +
> +/**
> + * multifd_qpl_decompress_pages: decompress pages
> + *
> + * Decompress the pages using the IAA hardware. If hardware
> + * decompression fails, it falls back to software decompression.
> + *
> + * Returns 0 on success or -1 on error
> + *
> + * @p: Params for the channel being used
> + * @errp: pointer to an error
> + */
> +static int multifd_qpl_decompress_pages(MultiFDRecvParams *p, Error **errp)
> +{
> +    QplData *qpl = p->compress_data;
> +    uint32_t size = p->page_size;
> +    uint8_t *zbuf = qpl->zbuf;
> +    uint8_t *addr;
> +    uint32_t len;
> +    qpl_job *job;
> +
> +    for (int i = 0; i < p->normal_num; i++) {
> +        addr = p->host + p->normal[i];
> +        len = qpl->zlen[i];
> +        /* the page is uncompressed if received length equals the page size */
> +        if (len == size) {
> +            memcpy(addr, zbuf, size);
> +            zbuf += size;
> +            continue;
> +        }
> +
> +        job = qpl->hw_jobs[i].job;
> +        multifd_qpl_prepare_decomp_job(job, zbuf, len, addr, size);
> +        if (multifd_qpl_submit_job(job)) {
> +            qpl->hw_jobs[i].fallback_sw_path = false;
> +        } else {
> +            qpl->hw_jobs[i].fallback_sw_path = true;
> +            job = qpl->sw_job;
> +            multifd_qpl_prepare_decomp_job(job, zbuf, len, addr, size);
> +            if (!multifd_qpl_process_and_check_job(job, false, size, errp)) {
> +                return -1;
> +            }

Here the same suggestion applies. You created
multifd_qpl_process_and_check_job() but is now calling it twice, which
seems to lose the purpose. If the fallback moves to the loop below, then
you do it all in one place:

    for (int i = 0; i < p->normal_num; i++) {
        bool is_sw = !qpl->hw_jobs[i].fallback_sw_path;

        if (is_sw) {
            job = qpl->sw_job;
            multifd_qpl_prepare_decomp_job(job, zbuf, len, addr, size);
        } else {
            job = qpl->hw_jobs[i].job;
        }

        if (!multifd_qpl_process_and_check_job(job, !is_sw, size, errp)) {
            return -1;
        }
    }

> +        }
> +        zbuf += len;
> +    }
> +
> +    for (int i = 0; i < p->normal_num; i++) {
> +        /* ignore pages that have already been processed */
> +        if (qpl->zlen[i] == size || qpl->hw_jobs[i].fallback_sw_path) {
> +            continue;
> +        }
> +
> +        job = qpl->hw_jobs[i].job;
> +        if (!multifd_qpl_process_and_check_job(job, true, size, errp)) {
> +            return -1;
> +        }
> +    }
> +    return 0;
> +}
>  /**
>   * multifd_qpl_recv: read the data from the channel into actual pages
>   *
> @@ -325,8 +689,48 @@ static void multifd_qpl_recv_cleanup(MultiFDRecvParams *p)
>   */
>  static int multifd_qpl_recv(MultiFDRecvParams *p, Error **errp)
>  {
> -    /* Implement in next patch */
> -    return -1;
> +    QplData *qpl = p->compress_data;
> +    uint32_t in_size = p->next_packet_size;
> +    uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
> +    uint32_t len = 0;
> +    uint32_t zbuf_len = 0;
> +    int ret;
> +
> +    if (flags != MULTIFD_FLAG_QPL) {
> +        error_setg(errp, "multifd %u: flags received %x flags expected %x",
> +                   p->id, flags, MULTIFD_FLAG_QPL);
> +        return -1;
> +    }
> +    multifd_recv_zero_page_process(p);
> +    if (!p->normal_num) {
> +        assert(in_size == 0);
> +        return 0;
> +    }
> +
> +    /* read compressed page lengths */
> +    len = p->normal_num * sizeof(uint32_t);
> +    assert(len < in_size);
> +    ret = qio_channel_read_all(p->c, (void *) qpl->zlen, len, errp);
> +    if (ret != 0) {
> +        return ret;
> +    }
> +    for (int i = 0; i < p->normal_num; i++) {
> +        qpl->zlen[i] = be32_to_cpu(qpl->zlen[i]);
> +        assert(qpl->zlen[i] <= p->page_size);
> +        zbuf_len += qpl->zlen[i];
> +    }
> +
> +    /* read compressed pages */
> +    assert(in_size == len + zbuf_len);
> +    ret = qio_channel_read_all(p->c, (void *) qpl->zbuf, zbuf_len, errp);
> +    if (ret != 0) {
> +        return ret;
> +    }
> +
> +    if (qpl->hw_avail) {
> +        return multifd_qpl_decompress_pages(p, errp);
> +    }
> +    return multifd_qpl_decompress_pages_slow_path(p, errp);
>  }
>  
>  static MultiFDMethods multifd_qpl_ops = {


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v7 7/7] tests/migration-test: add qpl compression test
  2024-06-03 15:41 ` [PATCH v7 7/7] tests/migration-test: add qpl compression test Yuan Liu
@ 2024-06-05 22:26   ` Fabiano Rosas
  0 siblings, 0 replies; 18+ messages in thread
From: Fabiano Rosas @ 2024-06-05 22:26 UTC (permalink / raw)
  To: Yuan Liu, peterx, pbonzini, marcandre.lureau, berrange, thuth,
	philmd
  Cc: qemu-devel, yuan1.liu, nanhai.zou, shameerali.kolothum.thodi

Yuan Liu <yuan1.liu@intel.com> writes:

> add qpl to compression method test for multifd migration
>
> the qpl compression supports software path and hardware
> path(IAA device), and the hardware path is used first by
> default. If the hardware path is unavailable, it will
> automatically fallback to the software path for testing.
>
> Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v7 6/7] migration/multifd: implement qpl compression and decompression
  2024-06-05 22:25   ` Fabiano Rosas
@ 2024-06-06  6:12     ` Liu, Yuan1
  2024-06-06 13:51       ` Fabiano Rosas
  0 siblings, 1 reply; 18+ messages in thread
From: Liu, Yuan1 @ 2024-06-06  6:12 UTC (permalink / raw)
  To: Fabiano Rosas, peterx@redhat.com, pbonzini@redhat.com,
	marcandre.lureau@redhat.com, berrange@redhat.com,
	thuth@redhat.com, philmd@linaro.org
  Cc: qemu-devel@nongnu.org, Zou, Nanhai,
	shameerali.kolothum.thodi@huawei.com

> -----Original Message-----
> From: Fabiano Rosas <farosas@suse.de>
> Sent: Thursday, June 6, 2024 6:26 AM
> To: Liu, Yuan1 <yuan1.liu@intel.com>; peterx@redhat.com;
> pbonzini@redhat.com; marcandre.lureau@redhat.com; berrange@redhat.com;
> thuth@redhat.com; philmd@linaro.org
> Cc: qemu-devel@nongnu.org; Liu, Yuan1 <yuan1.liu@intel.com>; Zou, Nanhai
> <nanhai.zou@intel.com>; shameerali.kolothum.thodi@huawei.com
> Subject: Re: [PATCH v7 6/7] migration/multifd: implement qpl compression
> and decompression
> 
> Yuan Liu <yuan1.liu@intel.com> writes:
> 
> > QPL compression and decompression will use IAA hardware first.
> > If IAA hardware is not available, it will automatically fall
> > back to QPL software path, if the software job also fails,
> > the uncompressed page is sent directly.
> >
> > Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> > Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> > ---
> >  migration/multifd-qpl.c | 412 +++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 408 insertions(+), 4 deletions(-)
> >
> > diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
> > index 6791a204d5..18b3384bd5 100644
> > --- a/migration/multifd-qpl.c
> > +++ b/migration/multifd-qpl.c
> > @@ -13,9 +13,14 @@
> >  #include "qemu/osdep.h"
> >  #include "qemu/module.h"
> >  #include "qapi/error.h"
> > +#include "qapi/qapi-types-migration.h"
> > +#include "exec/ramblock.h"
> >  #include "multifd.h"
> >  #include "qpl/qpl.h"
> >
> > +/* Maximum number of retries to resubmit a job if IAA work queues are
> full */
> > +#define MAX_SUBMIT_RETRY_NUM (3)
> > +
> >  typedef struct {
> >      /* the QPL hardware path job */
> >      qpl_job *job;
> > @@ -260,6 +265,219 @@ static void
> multifd_qpl_send_cleanup(MultiFDSendParams *p, Error **errp)
> >      p->iov = NULL;
> >  }
> >
> > +/**
> > + * multifd_qpl_prepare_job: prepare the job
> > + *
> > + * Set the QPL job parameters and properties.
> > + *
> > + * @job: pointer to the qpl_job structure
> > + * @is_compression: indicates compression and decompression
> > + * @input: pointer to the input data buffer
> > + * @input_len: the length of the input data
> > + * @output: pointer to the output data buffer
> > + * @output_len: the length of the output data
> > + */
> > +static void multifd_qpl_prepare_job(qpl_job *job, bool is_compression,
> > +                                    uint8_t *input, uint32_t input_len,
> > +                                    uint8_t *output, uint32_t
> output_len)
> > +{
> > +    job->op = is_compression ? qpl_op_compress : qpl_op_decompress;
> > +    job->next_in_ptr = input;
> > +    job->next_out_ptr = output;
> > +    job->available_in = input_len;
> > +    job->available_out = output_len;
> > +    job->flags = QPL_FLAG_FIRST | QPL_FLAG_LAST | QPL_FLAG_OMIT_VERIFY;
> > +    /* only supports compression level 1 */
> > +    job->level = 1;
> > +}
> > +
> > +/**
> > + * multifd_qpl_prepare_job: prepare the compression job
> 
> function name is wrong

Thanks, I will fix this next version.
 
> > + *
> > + * Set the compression job parameters and properties.
> > + *
> > + * @job: pointer to the qpl_job structure
> > + * @input: pointer to the input data buffer
> > + * @input_len: the length of the input data
> > + * @output: pointer to the output data buffer
> > + * @output_len: the length of the output data
> > + */
> > +static void multifd_qpl_prepare_comp_job(qpl_job *job, uint8_t *input,
> > +                                         uint32_t input_len, uint8_t
> *output,
> > +                                         uint32_t output_len)
> > +{
> > +    multifd_qpl_prepare_job(job, true, input, input_len, output,
> output_len);
> > +}
> > +
> > +/**
> > + * multifd_qpl_prepare_job: prepare the decompression job

Thanks, I will fix this next version.
 
> > + *
> > + * Set the decompression job parameters and properties.
> > + *
> > + * @job: pointer to the qpl_job structure
> > + * @input: pointer to the input data buffer
> > + * @input_len: the length of the input data
> > + * @output: pointer to the output data buffer
> > + * @output_len: the length of the output data
> > + */
> > +static void multifd_qpl_prepare_decomp_job(qpl_job *job, uint8_t
> *input,
> > +                                           uint32_t input_len, uint8_t
> *output,
> > +                                           uint32_t output_len)
> > +{
> > +    multifd_qpl_prepare_job(job, false, input, input_len, output,
> output_len);
> > +}
> > +
> > +/**
> > + * multifd_qpl_fill_iov: fill in the IOV
> > + *
> > + * Fill in the QPL packet IOV
> > + *
> > + * @p: Params for the channel being used
> > + * @data: pointer to the IOV data
> > + * @len: The length of the IOV data
> > + */
> > +static void multifd_qpl_fill_iov(MultiFDSendParams *p, uint8_t *data,
> > +                                 uint32_t len)
> > +{
> > +    p->iov[p->iovs_num].iov_base = data;
> > +    p->iov[p->iovs_num].iov_len = len;
> > +    p->iovs_num++;
> > +    p->next_packet_size += len;
> > +}
> > +
> > +/**
> > + * multifd_qpl_fill_packet: fill the compressed page into the QPL
> packet
> > + *
> > + * Fill the compressed page length and IOV into the QPL packet
> > + *
> > + * @idx: The index of the compressed length array
> > + * @p: Params for the channel being used
> > + * @data: pointer to the compressed page buffer
> > + * @len: The length of the compressed page
> > + */
> > +static void multifd_qpl_fill_packet(uint32_t idx, MultiFDSendParams *p,
> > +                                    uint8_t *data, uint32_t len)
> > +{
> > +    QplData *qpl = p->compress_data;
> > +
> > +    qpl->zlen[idx] = cpu_to_be32(len);
> > +    multifd_qpl_fill_iov(p, data, len);
> > +}
> > +
> > +/**
> > + * multifd_qpl_submit_job: submit a job to the hardware
> > + *
> > + * Submit a QPL hardware job to the IAA device
> > + *
> > + * Returns true if the job is submitted successfully, otherwise false.
> > + *
> > + * @job: pointer to the qpl_job structure
> > + */
> > +static bool multifd_qpl_submit_job(qpl_job *job)
> > +{
> > +    qpl_status status;
> > +    uint32_t num = 0;
> > +
> > +retry:
> > +    status = qpl_submit_job(job);
> > +    if (status == QPL_STS_QUEUES_ARE_BUSY_ERR) {
> > +        if (num < MAX_SUBMIT_RETRY_NUM) {
> > +            num++;
> > +            goto retry;
> > +        }
> > +    }
> > +    return (status == QPL_STS_OK);
> 
> How often do we expect this to fail? Will the queues be busy frequently
> or is this an unlikely event? I'm thinking whether we really need to
> allow a fallback for the hw path. Sorry if this has been discussed
> already, I don't remember.

In some scenarios, this may happen frequently, such as configuring 4 channels 
but only one IAA device is available. In the case of insufficient IAA hardware 
resources, retry and fallback can help optimize performance.
I have a comparison test below

1. Retry + SW fallback:
   total time: 14649 ms
   downtime: 25 ms
   throughput: 17666.57 mbps
   pages-per-second: 1509647

2. No fallback, always wait for work queues to become available
   total time: 18381 ms
   downtime: 25 ms
   throughput: 13698.65 mbps
   pages-per-second: 859607

> > +}
> > +
> > +/**
> > + * multifd_qpl_compress_pages_slow_path: compress pages using slow path
> > + *
> > + * Compress the pages using software. If compression fails, the page
> will
> > + * be sent directly.
> > + *
> > + * @p: Params for the channel being used
> > + */
> > +static void multifd_qpl_compress_pages_slow_path(MultiFDSendParams *p)
> > +{
> > +    QplData *qpl = p->compress_data;
> > +    uint32_t size = p->page_size;
> > +    qpl_job *job = qpl->sw_job;
> > +    uint8_t *zbuf = qpl->zbuf;
> > +    uint8_t *buf;
> > +
> > +    for (int i = 0; i < p->pages->normal_num; i++) {
> > +        buf = p->pages->block->host + p->pages->offset[i];
> > +        /* Set output length to less than the page to reduce
> decompression */
> > +        multifd_qpl_prepare_comp_job(job, buf, size, zbuf, size - 1);
> > +        if (qpl_execute_job(job) == QPL_STS_OK) {
> > +            multifd_qpl_fill_packet(i, p, zbuf, job->total_out);
> > +        } else {
> > +            /* send the page directly */
> 
> s/directly/uncompressed/
> 
> a bit clearer.

Sure, I will fix it next version. 

> > +            multifd_qpl_fill_packet(i, p, buf, size);
> > +        }
> > +        zbuf += size;
> > +    }
> > +}
> > +
> > +/**
> > + * multifd_qpl_compress_pages: compress pages
> > + *
> > + * Submit the pages to the IAA hardware for compression. If hardware
> > + * compression fails, it falls back to software compression. If
> software
> > + * compression also fails, the page is sent directly
> > + *
> > + * @p: Params for the channel being used
> > + */
> > +static void multifd_qpl_compress_pages(MultiFDSendParams *p)
> > +{
> > +    QplData *qpl = p->compress_data;
> > +    MultiFDPages_t *pages = p->pages;
> > +    uint32_t size = p->page_size;
> > +    QplHwJob *hw_job;
> > +    uint8_t *buf;
> > +    uint8_t *zbuf;
> > +
> 
> Let's document the output size choice more explicitly:
> 
>     /*
>      * Set output length to less than the page size to force the job to
>      * fail in case it compresses to a larger size. We'll send that page
>      * without compression and skip the decompression operation on the
>      * destination.
>      */
>      out_size = size - 1;
> 
> you can then omit the other comments.

Thanks for the comments, I will refine this next version.
 
> > +    for (int i = 0; i < pages->normal_num; i++) {
> > +        buf = pages->block->host + pages->offset[i];
> > +        zbuf = qpl->zbuf + (size * i);
> > +        hw_job = &qpl->hw_jobs[i];
> > +        /* Set output length to less than the page to reduce
> decompression */
> > +        multifd_qpl_prepare_comp_job(hw_job->job, buf, size, zbuf, size
> - 1);
> > +        if (multifd_qpl_submit_job(hw_job->job)) {
> > +            hw_job->fallback_sw_path = false;
> > +        } else {
> > +            hw_job->fallback_sw_path = true;
> > +            /* Set output length less than page size to reduce
> decompression */
> > +            multifd_qpl_prepare_comp_job(qpl->sw_job, buf, size, zbuf,
> > +                                         size - 1);
> > +            if (qpl_execute_job(qpl->sw_job) == QPL_STS_OK) {
> > +                hw_job->sw_output = zbuf;
> > +                hw_job->sw_output_len = qpl->sw_job->total_out;
> > +            } else {
> > +                hw_job->sw_output = buf;
> > +                hw_job->sw_output_len = size;
> > +            }
> 
> Hmm, these look a bit cumbersome, would it work if we moved the fallback
> qpl_execute_job() down into the other loop? We could then avoid the
> extra fields. Something like:
> 
> static void multifd_qpl_compress_pages(MultiFDSendParams *p)
> {
>     QplData *qpl = p->compress_data;
>     MultiFDPages_t *pages = p->pages;
>     uint32_t out_size, size = p->page_size;
>     uint8_t *buf, *zbuf;
> 
>     /*
>      * Set output length to less than the page size to force the job to
>      * fail in case it compresses to a larger size. We'll send that page
>      * without compression to skip the decompression operation on the
>      * destination.
>      */
>     out_size = size - 1;
> 
>     for (int i = 0; i < pages->normal_num; i++) {
>         QplHwJob *hw_job = &qpl->hw_jobs[i];
> 
>         hw_job->fallback_sw_path = false;
>         buf = pages->block->host + pages->offset[i];
>         zbuf = qpl->zbuf + (size * i);
> 
>         multifd_qpl_prepare_comp_job(hw_job->job, buf, size, zbuf,
> out_size);
> 
>         if (!multifd_qpl_submit_job(hw_job->job)) {
>             hw_job->fallback_sw_path = true;
>         }
>     }
> 
>     for (int i = 0; i < pages->normal_num; i++) {
>         QplHwJob *hw_job = &qpl->hw_jobs[i];
>         qpl_job *job;
> 
>         buf = pages->block->host + pages->offset[i];
>         zbuf = qpl->zbuf + (size * i);
> 
>         if (hw_job->fallback_sw_path) {
>             job = qpl->sw_job;
>             multifd_qpl_prepare_comp_job(job, buf, size, zbuf, out_size);
>             ret = qpl_execute_job(job);
>         } else {
>             job = hw_job->job;
>             ret = qpl_wait_job(job);
>         }
> 
>         if (ret == QPL_STS_OK) {
>             multifd_qpl_fill_packet(i, p, zbuf, job->total_out);
>         } else {
>             multifd_qpl_fill_packet(i, p, buf, size);
>         }
>     }
> }

Very thanks for the reference code, I have test the code and the performance is not good.
When the work queue is full, after a hardware job fails to be submitted, the subsequent
job submission will most likely fail as well. so my idea is to use software job execution
instead immediately, but all subsequent jobs will still give priority to hardware path. 

There is almost no overhead in job submission because Intel uses the new "enqcmd" instruction,
which allows the user program to submit the job directly to the hardware.

According to the implementation of the reference code, when a job fails to be submitted, there 
is a high probability that "ALL" subsequent jobs will fail to be submitted and then use software
compression, resulting in the IAA hardware not being fully utilized.

For 4 Channel, 1 IAA device test case, using the reference code will reduce IAA throughput 
from 3.4GBps to 2.2GBps, thus affecting live migration performance.(total time from 14s to 18s)

> > +        }
> > +    }
> > +
> > +    for (int i = 0; i < pages->normal_num; i++) {
> > +        buf = pages->block->host + pages->offset[i];
> > +        zbuf = qpl->zbuf + (size * i);
> > +        hw_job = &qpl->hw_jobs[i];
> > +        if (hw_job->fallback_sw_path) {
> > +            multifd_qpl_fill_packet(i, p, hw_job->sw_output,
> > +                                    hw_job->sw_output_len);
> > +            continue;
> > +        }
> > +        if (qpl_wait_job(hw_job->job) == QPL_STS_OK) {
> > +            multifd_qpl_fill_packet(i, p, zbuf, hw_job->job-
> >total_out);
> > +        } else {
> > +            /* send the page directly */
> > +            multifd_qpl_fill_packet(i, p, buf, size);
> > +        }
> > +    }
> > +}
> > +
> >  /**
> >   * multifd_qpl_send_prepare: prepare data to be able to send
> >   *
> > @@ -273,8 +491,26 @@ static void
> multifd_qpl_send_cleanup(MultiFDSendParams *p, Error **errp)
> >   */
> >  static int multifd_qpl_send_prepare(MultiFDSendParams *p, Error **errp)
> >  {
> > -    /* Implement in next patch */
> > -    return -1;
> > +    QplData *qpl = p->compress_data;
> > +    uint32_t len = 0;
> > +
> > +    if (!multifd_send_prepare_common(p)) {
> > +        goto out;
> > +    }
> > +
> > +    /* The first IOV is used to store the compressed page lengths */
> > +    len = p->pages->normal_num * sizeof(uint32_t);
> > +    multifd_qpl_fill_iov(p, (uint8_t *) qpl->zlen, len);
> > +    if (qpl->hw_avail) {
> > +        multifd_qpl_compress_pages(p);
> > +    } else {
> > +        multifd_qpl_compress_pages_slow_path(p);
> > +    }
> > +
> > +out:
> > +    p->flags |= MULTIFD_FLAG_QPL;
> > +    multifd_send_fill_packet(p);
> > +    return 0;
> >  }
> >
> >  /**
> > @@ -312,6 +548,134 @@ static void
> multifd_qpl_recv_cleanup(MultiFDRecvParams *p)
> >      p->compress_data = NULL;
> >  }
> >
> > +/**
> > + * multifd_qpl_process_and_check_job: process and check a QPL job
> > + *
> > + * Process the job and check whether the job output length is the
> > + * same as the specified length
> > + *
> > + * Returns true if the job execution succeeded and the output length
> > + * is equal to the specified length, otherwise false.
> > + *
> > + * @job: pointer to the qpl_job structure
> > + * @is_hardware: indicates whether the job is a hardware job
> > + * @len: Specified output length
> > + * @errp: pointer to an error
> > + */
> > +static bool multifd_qpl_process_and_check_job(qpl_job *job, bool
> is_hardware,
> > +                                              uint32_t len, Error
> **errp)
> > +{
> > +    qpl_status status;
> > +
> > +    status = (is_hardware ? qpl_wait_job(job) : qpl_execute_job(job));
> > +    if (status != QPL_STS_OK) {
> > +        error_setg(errp, "qpl_execute_job failed with error %d",
> status);
> 
> The error message should also cover qpl_wait_job(), right? Maybe just
> use "qpl job failed".

You are right, I will fix this next version.

> > +        return false;
> > +    }
> > +    if (job->total_out != len) {
> > +        error_setg(errp, "qpl decompressed len %u, expected len %u",
> > +                   job->total_out, len);
> > +        return false;
> > +    }
> > +    return true;
> > +}
> > +
> > +/**
> > + * multifd_qpl_decompress_pages_slow_path: decompress pages using slow
> path
> > + *
> > + * Decompress the pages using software
> > + *
> > + * Returns 0 on success or -1 on error
> > + *
> > + * @p: Params for the channel being used
> > + * @errp: pointer to an error
> > + */
> > +static int multifd_qpl_decompress_pages_slow_path(MultiFDRecvParams *p,
> > +                                                  Error **errp)
> > +{
> > +    QplData *qpl = p->compress_data;
> > +    uint32_t size = p->page_size;
> > +    qpl_job *job = qpl->sw_job;
> > +    uint8_t *zbuf = qpl->zbuf;
> > +    uint8_t *addr;
> > +    uint32_t len;
> > +
> > +    for (int i = 0; i < p->normal_num; i++) {
> > +        len = qpl->zlen[i];
> > +        addr = p->host + p->normal[i];
> > +        /* the page is uncompressed, load it */
> > +        if (len == size) {
> > +            memcpy(addr, zbuf, size);
> > +            zbuf += size;
> > +            continue;
> > +        }
> > +        multifd_qpl_prepare_decomp_job(job, zbuf, len, addr, size);
> > +        if (!multifd_qpl_process_and_check_job(job, false, size, errp))
> {
> > +            return -1;
> > +        }
> > +        zbuf += len;
> > +    }
> > +    return 0;
> > +}
> > +
> > +/**
> > + * multifd_qpl_decompress_pages: decompress pages
> > + *
> > + * Decompress the pages using the IAA hardware. If hardware
> > + * decompression fails, it falls back to software decompression.
> > + *
> > + * Returns 0 on success or -1 on error
> > + *
> > + * @p: Params for the channel being used
> > + * @errp: pointer to an error
> > + */
> > +static int multifd_qpl_decompress_pages(MultiFDRecvParams *p, Error
> **errp)
> > +{
> > +    QplData *qpl = p->compress_data;
> > +    uint32_t size = p->page_size;
> > +    uint8_t *zbuf = qpl->zbuf;
> > +    uint8_t *addr;
> > +    uint32_t len;
> > +    qpl_job *job;
> > +
> > +    for (int i = 0; i < p->normal_num; i++) {
> > +        addr = p->host + p->normal[i];
> > +        len = qpl->zlen[i];
> > +        /* the page is uncompressed if received length equals the page
> size */
> > +        if (len == size) {
> > +            memcpy(addr, zbuf, size);
> > +            zbuf += size;
> > +            continue;
> > +        }
> > +
> > +        job = qpl->hw_jobs[i].job;
> > +        multifd_qpl_prepare_decomp_job(job, zbuf, len, addr, size);
> > +        if (multifd_qpl_submit_job(job)) {
> > +            qpl->hw_jobs[i].fallback_sw_path = false;
> > +        } else {
> > +            qpl->hw_jobs[i].fallback_sw_path = true;
> > +            job = qpl->sw_job;
> > +            multifd_qpl_prepare_decomp_job(job, zbuf, len, addr, size);
> > +            if (!multifd_qpl_process_and_check_job(job, false, size,
> errp)) {
> > +                return -1;
> > +            }
> 
> Here the same suggestion applies. You created
> multifd_qpl_process_and_check_job() but is now calling it twice, which
> seems to lose the purpose. If the fallback moves to the loop below, then
> you do it all in one place:
> 
>     for (int i = 0; i < p->normal_num; i++) {
>         bool is_sw = !qpl->hw_jobs[i].fallback_sw_path;
> 
>         if (is_sw) {
>             job = qpl->sw_job;
>             multifd_qpl_prepare_decomp_job(job, zbuf, len, addr, size);
>         } else {
>             job = qpl->hw_jobs[i].job;
>         }
> 
>         if (!multifd_qpl_process_and_check_job(job, !is_sw, size, errp)) {
>             return -1;
>         }
>     }

I think this is the same issue as discussed above, after a hardware job fails to
be submitted, execute a software job immediately, and subsequent jobs are
prioritized to use hardware jobs. So use the same multifd_qpl_process_and_check_job 
in two parts.
 
> > +        }
> > +        zbuf += len;
> > +    }
> > +
> > +    for (int i = 0; i < p->normal_num; i++) {
> > +        /* ignore pages that have already been processed */
> > +        if (qpl->zlen[i] == size || qpl->hw_jobs[i].fallback_sw_path) {
> > +            continue;
> > +        }
> > +
> > +        job = qpl->hw_jobs[i].job;
> > +        if (!multifd_qpl_process_and_check_job(job, true, size, errp))
> {
> > +            return -1;
> > +        }
> > +    }
> > +    return 0;
> > +}
> >  /**
> >   * multifd_qpl_recv: read the data from the channel into actual pages
> >   *
> > @@ -325,8 +689,48 @@ static void
> multifd_qpl_recv_cleanup(MultiFDRecvParams *p)
> >   */
> >  static int multifd_qpl_recv(MultiFDRecvParams *p, Error **errp)
> >  {
> > -    /* Implement in next patch */
> > -    return -1;
> > +    QplData *qpl = p->compress_data;
> > +    uint32_t in_size = p->next_packet_size;
> > +    uint32_t flags = p->flags & MULTIFD_FLAG_COMPRESSION_MASK;
> > +    uint32_t len = 0;
> > +    uint32_t zbuf_len = 0;
> > +    int ret;
> > +
> > +    if (flags != MULTIFD_FLAG_QPL) {
> > +        error_setg(errp, "multifd %u: flags received %x flags
> expected %x",
> > +                   p->id, flags, MULTIFD_FLAG_QPL);
> > +        return -1;
> > +    }
> > +    multifd_recv_zero_page_process(p);
> > +    if (!p->normal_num) {
> > +        assert(in_size == 0);
> > +        return 0;
> > +    }
> > +
> > +    /* read compressed page lengths */
> > +    len = p->normal_num * sizeof(uint32_t);
> > +    assert(len < in_size);
> > +    ret = qio_channel_read_all(p->c, (void *) qpl->zlen, len, errp);
> > +    if (ret != 0) {
> > +        return ret;
> > +    }
> > +    for (int i = 0; i < p->normal_num; i++) {
> > +        qpl->zlen[i] = be32_to_cpu(qpl->zlen[i]);
> > +        assert(qpl->zlen[i] <= p->page_size);
> > +        zbuf_len += qpl->zlen[i];
> > +    }
> > +
> > +    /* read compressed pages */
> > +    assert(in_size == len + zbuf_len);
> > +    ret = qio_channel_read_all(p->c, (void *) qpl->zbuf, zbuf_len,
> errp);
> > +    if (ret != 0) {
> > +        return ret;
> > +    }
> > +
> > +    if (qpl->hw_avail) {
> > +        return multifd_qpl_decompress_pages(p, errp);
> > +    }
> > +    return multifd_qpl_decompress_pages_slow_path(p, errp);
> >  }
> >
> >  static MultiFDMethods multifd_qpl_ops = {


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v7 1/7] docs/migration: add qpl compression feature
  2024-06-05 19:59   ` Fabiano Rosas
@ 2024-06-06  7:03     ` Liu, Yuan1
  0 siblings, 0 replies; 18+ messages in thread
From: Liu, Yuan1 @ 2024-06-06  7:03 UTC (permalink / raw)
  To: Fabiano Rosas, peterx@redhat.com, pbonzini@redhat.com,
	marcandre.lureau@redhat.com, berrange@redhat.com,
	thuth@redhat.com, philmd@linaro.org
  Cc: qemu-devel@nongnu.org, Zou, Nanhai,
	shameerali.kolothum.thodi@huawei.com

> -----Original Message-----
> From: Fabiano Rosas <farosas@suse.de>
> Sent: Thursday, June 6, 2024 4:00 AM
> To: Liu, Yuan1 <yuan1.liu@intel.com>; peterx@redhat.com;
> pbonzini@redhat.com; marcandre.lureau@redhat.com; berrange@redhat.com;
> thuth@redhat.com; philmd@linaro.org
> Cc: qemu-devel@nongnu.org; Liu, Yuan1 <yuan1.liu@intel.com>; Zou, Nanhai
> <nanhai.zou@intel.com>; shameerali.kolothum.thodi@huawei.com
> Subject: Re: [PATCH v7 1/7] docs/migration: add qpl compression feature
> 
> Yuan Liu <yuan1.liu@intel.com> writes:
> 
> > add Intel Query Processing Library (QPL) compression method
> > introduction
> >
> > Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> > Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> 
> Just some nits if you need to respin. Otherwise I can touch up in the
> migration tree.
> 
> Reviewed-by: Fabiano Rosas <farosas@suse.de>

Thank you very much, there is nothing I need to change for this patch, 
if this set of patches needs the next version, I will fix the nits
according to your suggestions.

> > ---
> >  docs/devel/migration/features.rst        |   1 +
> >  docs/devel/migration/qpl-compression.rst | 262 +++++++++++++++++++++++
> >  2 files changed, 263 insertions(+)
> >  create mode 100644 docs/devel/migration/qpl-compression.rst
> >
> > diff --git a/docs/devel/migration/features.rst
> b/docs/devel/migration/features.rst
> > index d5ca7b86d5..bc98b65075 100644
> > --- a/docs/devel/migration/features.rst
> > +++ b/docs/devel/migration/features.rst
> > @@ -12,3 +12,4 @@ Migration has plenty of features to support different
> use cases.
> >     virtio
> >     mapped-ram
> >     CPR
> > +   qpl-compression
> > diff --git a/docs/devel/migration/qpl-compression.rst
> b/docs/devel/migration/qpl-compression.rst
> > new file mode 100644
> > index 0000000000..13fb7a67b1
> > --- /dev/null
> > +++ b/docs/devel/migration/qpl-compression.rst
> > @@ -0,0 +1,262 @@
> > +===============
> > +QPL Compression
> > +===============
> > +The Intel Query Processing Library (Intel ``QPL``) is an open-source
> library to
> > +provide compression and decompression features and it is based on
> deflate
> > +compression algorithm (RFC 1951).
> > +
> > +The ``QPL`` compression relies on Intel In-Memory Analytics
> Accelerator(``IAA``)
> > +and Shared Virtual Memory(``SVM``) technology, they are new features
> supported
> > +from Intel 4th Gen Intel Xeon Scalable processors, codenamed Sapphire
> Rapids
> > +processor(``SPR``).
> > +
> > +For more ``QPL`` introduction, please refer to `QPL Introduction
> >
> +<https://intel.github.io/qpl/documentation/introduction_docs/introduction
> .html>`_
> > +
> > +QPL Compression Framework
> > +=========================
> > +
> > +::
> > +
> > +  +----------------+       +------------------+
> > +  | MultiFD Thread |       |accel-config tool |
> > +  +-------+--------+       +--------+---------+
> > +          |                         |
> > +          |                         |
> > +          |compress/decompress      |
> > +  +-------+--------+                | Setup IAA
> > +  |  QPL library   |                | Resources
> > +  +-------+---+----+                |
> > +          |   |                     |
> > +          |   +-------------+-------+
> > +          |   Open IAA      |
> > +          |   Devices +-----+-----+
> > +          |           |idxd driver|
> > +          |           +-----+-----+
> > +          |                 |
> > +          |                 |
> > +          |           +-----+-----+
> > +          +-----------+IAA Devices|
> > +      Submit jobs     +-----------+
> > +      via enqcmd
> > +
> > +
> > +QPL Build And Installation
> > +--------------------------
> > +
> > +.. code-block:: shell
> > +
> > +  $git clone --recursive https://github.com/intel/qpl.git qpl
> > +  $mkdir qpl/build
> > +  $cd qpl/build
> > +  $cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr -
> DQPL_LIBRARY_TYPE=SHARED ..
> > +  $sudo cmake --build . --target install
> > +
> > +For more details about ``QPL`` installation, please refer to `QPL
> Installation
> >
> +<https://intel.github.io/qpl/documentation/get_started_docs/installation.
> html>`_
> > +
> > +IAA Device Management
> > +---------------------
> > +
> > +The number of ``IAA`` devices will vary depending on the Xeon product
> model.
> > +On a ``SPR`` server, there can be a maximum of 8 ``IAA`` devices, with
> up to
> > +4 devices per socket.
> > +
> > +By default, all ``IAA`` devices are disabled and need to be configured
> and
> > +enabled by users manually.
> > +
> > +Check the number of devices through the following command
> > +
> > +.. code-block:: shell
> > +
> > +  #lspci -d 8086:0cfe
> > +  6a:02.0 System peripheral: Intel Corporation Device 0cfe
> > +  6f:02.0 System peripheral: Intel Corporation Device 0cfe
> > +  74:02.0 System peripheral: Intel Corporation Device 0cfe
> > +  79:02.0 System peripheral: Intel Corporation Device 0cfe
> > +  e7:02.0 System peripheral: Intel Corporation Device 0cfe
> > +  ec:02.0 System peripheral: Intel Corporation Device 0cfe
> > +  f1:02.0 System peripheral: Intel Corporation Device 0cfe
> > +  f6:02.0 System peripheral: Intel Corporation Device 0cfe
> > +
> > +IAA Device Configuration And Enabling
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +The ``accel-config`` tool is used to enable ``IAA`` devices and
> configure
> > +``IAA`` hardware resources(work queues and engines). One ``IAA`` device
> > +has 8 work queues and 8 processing engines, multiple engines can be
> assigned
> > +to a work queue via ``group`` attribute.
> > +
> > +For ``accel-config`` installation, please refer to `accel-config
> installation
> > +<https://github.com/intel/idxd-config>`_
> > +
> > +One example of configuring and enabling an ``IAA`` device.
> > +
> > +.. code-block:: shell
> > +
> > +  #accel-config config-engine iax1/engine1.0 -g 0
> > +  #accel-config config-engine iax1/engine1.1 -g 0
> > +  #accel-config config-engine iax1/engine1.2 -g 0
> > +  #accel-config config-engine iax1/engine1.3 -g 0
> > +  #accel-config config-engine iax1/engine1.4 -g 0
> > +  #accel-config config-engine iax1/engine1.5 -g 0
> > +  #accel-config config-engine iax1/engine1.6 -g 0
> > +  #accel-config config-engine iax1/engine1.7 -g 0
> > +  #accel-config config-wq iax1/wq1.0 -g 0 -s 128 -p 10 -b 1 -t 128 -m
> shared -y user -n app1 -d user
> > +  #accel-config enable-device iax1
> > +  #accel-config enable-wq iax1/wq1.0
> > +
> > +.. note::
> > +   IAX is an early name for IAA
> > +
> > +- The ``IAA`` device index is 1, use ``ls -lh
> /sys/bus/dsa/devices/iax*``
> > +  command to query the ``IAA`` device index.
> > +
> > +- 8 engines and 1 work queue are configured in group 0, so all
> compression jobs
> > +  submitted to this work queue can be processed by all engines at the
> same time.
> > +
> > +- Set work queue attributes including the work mode, work queue size
> and so on.
> > +
> > +- Enable the ``IAA1`` device and work queue 1.0
> > +
> > +.. note::
> > +
> > +  Set work queue mode to shared mode, since ``QPL`` library only
> supports
> > +  shared mode
> > +
> > +For more detailed configuration, please refer to `IAA Configuration
> Samples
> > +<https://github.com/intel/idxd-
> config/tree/stable/Documentation/accfg>`_
> > +
> > +IAA Unit Test
> > +^^^^^^^^^^^^^
> > +
> > +- Enabling ``IAA`` devices for Xeon platform, please refer to `IAA User
> Guide
> > +  <https://www.intel.com/content/www/us/en/content-
> details/780887/intel-in-memory-analytics-accelerator-intel-iaa.html>`_
> > +
> > +- ``IAA`` device driver is Intel Data Accelerator Driver (idxd), it is
> > +  recommended that the minimum version of Linux kernel is 5.18.
> > +
> > +- Add ``"intel_iommu=on,sm_on"`` parameter to kernel command line
> > +  for ``SVM`` feature enabling.
> > +
> > +Here is an easy way to verify ``IAA`` device driver and ``SVM`` with
> `iaa_test
> > +<https://github.com/intel/idxd-config/tree/stable/test>`_
> > +
> > +.. code-block:: shell
> > +
> > +  #./test/iaa_test
> > +   [ info] alloc wq 0 shared size 128 addr 0x7f26cebe5000 batch sz
> 0xfffffffe xfer sz 0x80000000
> > +   [ info] test noop: tflags 0x1 num_desc 1
> > +   [ info] preparing descriptor for noop
> > +   [ info] Submitted all noop jobs
> > +   [ info] verifying task result for 0x16f7e20
> > +   [ info] test with op 0 passed
> > +
> > +
> > +IAA Resources Allocation For Migration
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > +There is no ``IAA`` resource configuration parameters for migration and
> > +``accel-config`` tool configuration cannot directly specify the ``IAA``
> > +resources used for migration.
> > +
> > +The multifd migration with ``QPL`` compression method  will use all
> work
> > +queues that are enabled and shared mode.
> > +
> > +.. note::
> > +
> > +  Accessing IAA resources requires ``sudo`` command or ``root``
> privileges
> > +  by default. Administrators can modify the IAA device node ownership
> > +  so that Qemu can use IAA with specified user permissions.
> 
> QEMU
> 
> > +
> > +  For example
> > +
> > +  #chown -R Qemu /dev/iax
> 
> qemu
> 
> > +
> > +
> > +Shared Virtual Memory(SVM) Introduction
> > +=======================================
> > +
> > +An ability for an accelerator I/O device to operate in the same virtual
> > +memory space of applications on host processors. It also implies the
> > +ability to operate from pageable memory, avoiding functional
> requirements
> > +to pin memory for DMA operations.
> > +
> > +When using ``SVM`` technology, users do not need to reserve memory for
> the
> > +``IAA`` device and perform pin memory operation. The ``IAA`` device can
> > +directly access data using the virtual address of the process.
> > +
> > +For more ``SVM`` technology, please refer to
> > +`Shared Virtual Addressing (SVA) with ENQCMD
> > +<https://docs.kernel.org/next/x86/sva.html>`_
> > +
> > +
> > +How To Use QPL Compression In Migration
> > +=======================================
> > +
> > +1 - Installation of ``QPL`` library and ``accel-config`` library if
> using IAA
> > +
> > +2 - Configure and enable ``IAA`` devices and work queues via ``accel-
> config``
> > +
> > +3 - Build ``Qemu`` with ``--enable-qpl`` parameter
> 
> QEMU
> 
> > +
> > +  E.g. configure --target-list=x86_64-softmmu --enable-kvm ``--enable-
> qpl``
> > +
> > +4 - Enable ``QPL`` compression during migration
> > +
> > +  Set ``migrate_set_parameter multifd-compression qpl`` when migrating,
> the
> > +  ``QPL`` compression does not support configuring the compression
> level, it
> > +  only supports one compression level.
> > +
> > +The Difference Between QPL And ZLIB
> > +===================================
> > +
> > +Although both ``QPL`` and ``ZLIB`` are based on the deflate compression
> > +algorithm, and ``QPL`` can support the header and tail of ``ZLIB``,
> ``QPL``
> > +is still not fully compatible with the ``ZLIB`` compression in the
> migration.
> > +
> > +``QPL`` only supports 4K history buffer, and ``ZLIB`` is 32K by
> default. The
> > +``ZLIB`` compressed data that ``QPL`` may not decompress correctly and
> > +vice versa.
> 
> s/The ZLIB compressed/ZLIB compresses/
> 
> > +
> > +``QPL`` does not support the ``Z_SYNC_FLUSH`` operation in ``ZLIB``
> streaming
> > +compression, current ``ZLIB`` implementation uses ``Z_SYNC_FLUSH``, so
> each
> > +``multifd`` thread has a ``ZLIB`` streaming context, and all page
> compression
> > +and decompression are based on this stream. ``QPL`` cannot decompress
> such data
> > +and vice versa.
> > +
> > +The introduction for ``Z_SYNC_FLUSH``, please refer to `Zlib Manual
> > +<https://www.zlib.net/manual.html>`_
> > +
> > +The Best Practices
> > +==================
> > +When user enables the IAA device for ``QPL`` compression, it is
> recommended
> > +to add ``-mem-prealloc`` parameter to the destination boot parameters.
> This
> > +parameter can avoid the occurrence of I/O page fault and reduce the
> overhead
> > +of IAA compression and decompression.
> > +
> > +The example of booting with ``-mem-prealloc`` parameter
> > +
> > +.. code-block:: shell
> > +
> > +   $qemu-system-x86_64 --enable-kvm -cpu host --mem-prealloc ...
> > +
> > +
> > +An example about I/O page fault measurement of destination without
> > +``-mem-prealloc``, the ``svm_prq`` indicates the number of I/O page
> fault
> > +occurrences and processing time.
> > +
> > +.. code-block:: shell
> > +
> > +  #echo 1 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
> > +  #echo 2 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
> > +  #echo 3 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
> > +  #echo 4 > /sys/kernel/debug/iommu/intel/dmar_perf_latency
> > +  #cat /sys/kernel/debug/iommu/intel/dmar_perf_latency
> > +  IOMMU: dmar18 Register Base Address: c87fc000
> > +                  <0.1us   0.1us-1us    1us-10us  10us-100us   100us-
> 1ms    1ms-10ms      >=10ms     min(us)     max(us) average(us)
> > +   inv_iotlb           0         286         123           0
> 0           0           0           0           1           0
> > +  inv_devtlb           0         276         133           0
> 0           0           0           0           2           0
> > +     inv_iec           0           0           0           0
> 0           0           0           0           0           0
> > +     svm_prq           0           0       25206         364
> 395           0           0           1         556           9
> > +


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v7 6/7] migration/multifd: implement qpl compression and decompression
  2024-06-06  6:12     ` Liu, Yuan1
@ 2024-06-06 13:51       ` Fabiano Rosas
  2024-06-06 14:29         ` Liu, Yuan1
  0 siblings, 1 reply; 18+ messages in thread
From: Fabiano Rosas @ 2024-06-06 13:51 UTC (permalink / raw)
  To: Liu, Yuan1, peterx@redhat.com, pbonzini@redhat.com,
	marcandre.lureau@redhat.com, berrange@redhat.com,
	thuth@redhat.com, philmd@linaro.org
  Cc: qemu-devel@nongnu.org, Zou, Nanhai,
	shameerali.kolothum.thodi@huawei.com

"Liu, Yuan1" <yuan1.liu@intel.com> writes:

>> -----Original Message-----
>> From: Fabiano Rosas <farosas@suse.de>
>> Sent: Thursday, June 6, 2024 6:26 AM
>> To: Liu, Yuan1 <yuan1.liu@intel.com>; peterx@redhat.com;
>> pbonzini@redhat.com; marcandre.lureau@redhat.com; berrange@redhat.com;
>> thuth@redhat.com; philmd@linaro.org
>> Cc: qemu-devel@nongnu.org; Liu, Yuan1 <yuan1.liu@intel.com>; Zou, Nanhai
>> <nanhai.zou@intel.com>; shameerali.kolothum.thodi@huawei.com
>> Subject: Re: [PATCH v7 6/7] migration/multifd: implement qpl compression
>> and decompression
>> 
>> Yuan Liu <yuan1.liu@intel.com> writes:
>> 
>> > QPL compression and decompression will use IAA hardware first.
>> > If IAA hardware is not available, it will automatically fall
>> > back to QPL software path, if the software job also fails,
>> > the uncompressed page is sent directly.
>> >
>> > Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
>> > Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
>> > ---
>> >  migration/multifd-qpl.c | 412 +++++++++++++++++++++++++++++++++++++++-
>> >  1 file changed, 408 insertions(+), 4 deletions(-)
>> >
>> > diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
>> > index 6791a204d5..18b3384bd5 100644
>> > --- a/migration/multifd-qpl.c
>> > +++ b/migration/multifd-qpl.c
>> > @@ -13,9 +13,14 @@
>> >  #include "qemu/osdep.h"
>> >  #include "qemu/module.h"
>> >  #include "qapi/error.h"
>> > +#include "qapi/qapi-types-migration.h"
>> > +#include "exec/ramblock.h"
>> >  #include "multifd.h"
>> >  #include "qpl/qpl.h"
>> >
>> > +/* Maximum number of retries to resubmit a job if IAA work queues are
>> full */
>> > +#define MAX_SUBMIT_RETRY_NUM (3)
>> > +
>> >  typedef struct {
>> >      /* the QPL hardware path job */
>> >      qpl_job *job;
>> > @@ -260,6 +265,219 @@ static void
>> multifd_qpl_send_cleanup(MultiFDSendParams *p, Error **errp)
>> >      p->iov = NULL;
>> >  }
>> >
>> > +/**
>> > + * multifd_qpl_prepare_job: prepare the job
>> > + *
>> > + * Set the QPL job parameters and properties.
>> > + *
>> > + * @job: pointer to the qpl_job structure
>> > + * @is_compression: indicates compression and decompression
>> > + * @input: pointer to the input data buffer
>> > + * @input_len: the length of the input data
>> > + * @output: pointer to the output data buffer
>> > + * @output_len: the length of the output data
>> > + */
>> > +static void multifd_qpl_prepare_job(qpl_job *job, bool is_compression,
>> > +                                    uint8_t *input, uint32_t input_len,
>> > +                                    uint8_t *output, uint32_t
>> output_len)
>> > +{
>> > +    job->op = is_compression ? qpl_op_compress : qpl_op_decompress;
>> > +    job->next_in_ptr = input;
>> > +    job->next_out_ptr = output;
>> > +    job->available_in = input_len;
>> > +    job->available_out = output_len;
>> > +    job->flags = QPL_FLAG_FIRST | QPL_FLAG_LAST | QPL_FLAG_OMIT_VERIFY;
>> > +    /* only supports compression level 1 */
>> > +    job->level = 1;
>> > +}
>> > +
>> > +/**
>> > + * multifd_qpl_prepare_job: prepare the compression job
>> 
>> function name is wrong
>
> Thanks, I will fix this next version.
>  
>> > + *
>> > + * Set the compression job parameters and properties.
>> > + *
>> > + * @job: pointer to the qpl_job structure
>> > + * @input: pointer to the input data buffer
>> > + * @input_len: the length of the input data
>> > + * @output: pointer to the output data buffer
>> > + * @output_len: the length of the output data
>> > + */
>> > +static void multifd_qpl_prepare_comp_job(qpl_job *job, uint8_t *input,
>> > +                                         uint32_t input_len, uint8_t
>> *output,
>> > +                                         uint32_t output_len)
>> > +{
>> > +    multifd_qpl_prepare_job(job, true, input, input_len, output,
>> output_len);
>> > +}
>> > +
>> > +/**
>> > + * multifd_qpl_prepare_job: prepare the decompression job
>
> Thanks, I will fix this next version.
>  
>> > + *
>> > + * Set the decompression job parameters and properties.
>> > + *
>> > + * @job: pointer to the qpl_job structure
>> > + * @input: pointer to the input data buffer
>> > + * @input_len: the length of the input data
>> > + * @output: pointer to the output data buffer
>> > + * @output_len: the length of the output data
>> > + */
>> > +static void multifd_qpl_prepare_decomp_job(qpl_job *job, uint8_t
>> *input,
>> > +                                           uint32_t input_len, uint8_t
>> *output,
>> > +                                           uint32_t output_len)
>> > +{
>> > +    multifd_qpl_prepare_job(job, false, input, input_len, output,
>> output_len);
>> > +}
>> > +
>> > +/**
>> > + * multifd_qpl_fill_iov: fill in the IOV
>> > + *
>> > + * Fill in the QPL packet IOV
>> > + *
>> > + * @p: Params for the channel being used
>> > + * @data: pointer to the IOV data
>> > + * @len: The length of the IOV data
>> > + */
>> > +static void multifd_qpl_fill_iov(MultiFDSendParams *p, uint8_t *data,
>> > +                                 uint32_t len)
>> > +{
>> > +    p->iov[p->iovs_num].iov_base = data;
>> > +    p->iov[p->iovs_num].iov_len = len;
>> > +    p->iovs_num++;
>> > +    p->next_packet_size += len;
>> > +}
>> > +
>> > +/**
>> > + * multifd_qpl_fill_packet: fill the compressed page into the QPL
>> packet
>> > + *
>> > + * Fill the compressed page length and IOV into the QPL packet
>> > + *
>> > + * @idx: The index of the compressed length array
>> > + * @p: Params for the channel being used
>> > + * @data: pointer to the compressed page buffer
>> > + * @len: The length of the compressed page
>> > + */
>> > +static void multifd_qpl_fill_packet(uint32_t idx, MultiFDSendParams *p,
>> > +                                    uint8_t *data, uint32_t len)
>> > +{
>> > +    QplData *qpl = p->compress_data;
>> > +
>> > +    qpl->zlen[idx] = cpu_to_be32(len);
>> > +    multifd_qpl_fill_iov(p, data, len);
>> > +}
>> > +
>> > +/**
>> > + * multifd_qpl_submit_job: submit a job to the hardware
>> > + *
>> > + * Submit a QPL hardware job to the IAA device
>> > + *
>> > + * Returns true if the job is submitted successfully, otherwise false.
>> > + *
>> > + * @job: pointer to the qpl_job structure
>> > + */
>> > +static bool multifd_qpl_submit_job(qpl_job *job)
>> > +{
>> > +    qpl_status status;
>> > +    uint32_t num = 0;
>> > +
>> > +retry:
>> > +    status = qpl_submit_job(job);
>> > +    if (status == QPL_STS_QUEUES_ARE_BUSY_ERR) {
>> > +        if (num < MAX_SUBMIT_RETRY_NUM) {
>> > +            num++;
>> > +            goto retry;
>> > +        }
>> > +    }
>> > +    return (status == QPL_STS_OK);
>> 
>> How often do we expect this to fail? Will the queues be busy frequently
>> or is this an unlikely event? I'm thinking whether we really need to
>> allow a fallback for the hw path. Sorry if this has been discussed
>> already, I don't remember.
>
> In some scenarios, this may happen frequently, such as configuring 4 channels 
> but only one IAA device is available. In the case of insufficient IAA hardware 
> resources, retry and fallback can help optimize performance.
> I have a comparison test below
>
> 1. Retry + SW fallback:
>    total time: 14649 ms
>    downtime: 25 ms
>    throughput: 17666.57 mbps
>    pages-per-second: 1509647
>
> 2. No fallback, always wait for work queues to become available
>    total time: 18381 ms
>    downtime: 25 ms
>    throughput: 13698.65 mbps
>    pages-per-second: 859607

Thanks for the data, this is helpful. Let's include it in the commit
message, it's important to let people know you actually did that
analysis. I put a suggestion below:

---
QPL compression and decompression will use IAA hardware path if the IAA
hardware is available. Otherwise the QPL library software path is used.

The hardware path will automatically fall back to QPL software path if
the IAA queues are busy. In some scenarios, this may happen frequently,
such as configuring 4 channels but only one IAA device is available. In
the case of insufficient IAA hardware resources, retry and fallback can
help optimize performance:

 1. Retry + SW fallback:
    total time: 14649 ms
    downtime: 25 ms
    throughput: 17666.57 mbps
    pages-per-second: 1509647

 2. No fallback, always wait for work queues to become available
    total time: 18381 ms
    downtime: 25 ms
    throughput: 13698.65 mbps
    pages-per-second: 859607

If both the hardware and software paths fail, the uncompressed page is
sent directly.

>> > +}
>> > +
>> > +/**
>> > + * multifd_qpl_compress_pages_slow_path: compress pages using slow path
>> > + *
>> > + * Compress the pages using software. If compression fails, the page
>> will
>> > + * be sent directly.
>> > + *
>> > + * @p: Params for the channel being used
>> > + */
>> > +static void multifd_qpl_compress_pages_slow_path(MultiFDSendParams *p)
>> > +{
>> > +    QplData *qpl = p->compress_data;
>> > +    uint32_t size = p->page_size;
>> > +    qpl_job *job = qpl->sw_job;
>> > +    uint8_t *zbuf = qpl->zbuf;
>> > +    uint8_t *buf;
>> > +
>> > +    for (int i = 0; i < p->pages->normal_num; i++) {
>> > +        buf = p->pages->block->host + p->pages->offset[i];
>> > +        /* Set output length to less than the page to reduce
>> decompression */
>> > +        multifd_qpl_prepare_comp_job(job, buf, size, zbuf, size - 1);
>> > +        if (qpl_execute_job(job) == QPL_STS_OK) {
>> > +            multifd_qpl_fill_packet(i, p, zbuf, job->total_out);
>> > +        } else {
>> > +            /* send the page directly */
>> 
>> s/directly/uncompressed/
>> 
>> a bit clearer.
>
> Sure, I will fix it next version. 
>
>> > +            multifd_qpl_fill_packet(i, p, buf, size);
>> > +        }
>> > +        zbuf += size;
>> > +    }
>> > +}
>> > +
>> > +/**
>> > + * multifd_qpl_compress_pages: compress pages
>> > + *
>> > + * Submit the pages to the IAA hardware for compression. If hardware
>> > + * compression fails, it falls back to software compression. If
>> software
>> > + * compression also fails, the page is sent directly
>> > + *
>> > + * @p: Params for the channel being used
>> > + */
>> > +static void multifd_qpl_compress_pages(MultiFDSendParams *p)
>> > +{
>> > +    QplData *qpl = p->compress_data;
>> > +    MultiFDPages_t *pages = p->pages;
>> > +    uint32_t size = p->page_size;
>> > +    QplHwJob *hw_job;
>> > +    uint8_t *buf;
>> > +    uint8_t *zbuf;
>> > +
>> 
>> Let's document the output size choice more explicitly:
>> 
>>     /*
>>      * Set output length to less than the page size to force the job to
>>      * fail in case it compresses to a larger size. We'll send that page
>>      * without compression and skip the decompression operation on the
>>      * destination.
>>      */
>>      out_size = size - 1;
>> 
>> you can then omit the other comments.
>
> Thanks for the comments, I will refine this next version.
>  
>> > +    for (int i = 0; i < pages->normal_num; i++) {
>> > +        buf = pages->block->host + pages->offset[i];
>> > +        zbuf = qpl->zbuf + (size * i);
>> > +        hw_job = &qpl->hw_jobs[i];
>> > +        /* Set output length to less than the page to reduce
>> decompression */
>> > +        multifd_qpl_prepare_comp_job(hw_job->job, buf, size, zbuf, size
>> - 1);
>> > +        if (multifd_qpl_submit_job(hw_job->job)) {
>> > +            hw_job->fallback_sw_path = false;
>> > +        } else {
>> > +            hw_job->fallback_sw_path = true;
>> > +            /* Set output length less than page size to reduce
>> decompression */
>> > +            multifd_qpl_prepare_comp_job(qpl->sw_job, buf, size, zbuf,
>> > +                                         size - 1);
>> > +            if (qpl_execute_job(qpl->sw_job) == QPL_STS_OK) {
>> > +                hw_job->sw_output = zbuf;
>> > +                hw_job->sw_output_len = qpl->sw_job->total_out;
>> > +            } else {
>> > +                hw_job->sw_output = buf;
>> > +                hw_job->sw_output_len = size;
>> > +            }
>> 
>> Hmm, these look a bit cumbersome, would it work if we moved the fallback
>> qpl_execute_job() down into the other loop? We could then avoid the
>> extra fields. Something like:
>> 
>> static void multifd_qpl_compress_pages(MultiFDSendParams *p)
>> {
>>     QplData *qpl = p->compress_data;
>>     MultiFDPages_t *pages = p->pages;
>>     uint32_t out_size, size = p->page_size;
>>     uint8_t *buf, *zbuf;
>> 
>>     /*
>>      * Set output length to less than the page size to force the job to
>>      * fail in case it compresses to a larger size. We'll send that page
>>      * without compression to skip the decompression operation on the
>>      * destination.
>>      */
>>     out_size = size - 1;
>> 
>>     for (int i = 0; i < pages->normal_num; i++) {
>>         QplHwJob *hw_job = &qpl->hw_jobs[i];
>> 
>>         hw_job->fallback_sw_path = false;
>>         buf = pages->block->host + pages->offset[i];
>>         zbuf = qpl->zbuf + (size * i);
>> 
>>         multifd_qpl_prepare_comp_job(hw_job->job, buf, size, zbuf,
>> out_size);
>> 
>>         if (!multifd_qpl_submit_job(hw_job->job)) {
>>             hw_job->fallback_sw_path = true;
>>         }
>>     }
>> 
>>     for (int i = 0; i < pages->normal_num; i++) {
>>         QplHwJob *hw_job = &qpl->hw_jobs[i];
>>         qpl_job *job;
>> 
>>         buf = pages->block->host + pages->offset[i];
>>         zbuf = qpl->zbuf + (size * i);
>> 
>>         if (hw_job->fallback_sw_path) {
>>             job = qpl->sw_job;
>>             multifd_qpl_prepare_comp_job(job, buf, size, zbuf, out_size);
>>             ret = qpl_execute_job(job);
>>         } else {
>>             job = hw_job->job;
>>             ret = qpl_wait_job(job);
>>         }
>> 
>>         if (ret == QPL_STS_OK) {
>>             multifd_qpl_fill_packet(i, p, zbuf, job->total_out);
>>         } else {
>>             multifd_qpl_fill_packet(i, p, buf, size);
>>         }
>>     }
>> }
>
> Very thanks for the reference code, I have test the code and the performance is not good.
> When the work queue is full, after a hardware job fails to be submitted, the subsequent
> job submission will most likely fail as well. so my idea is to use software job execution
> instead immediately, but all subsequent jobs will still give priority to hardware path. 

So let me see if I get this, you're saying that going with the sw path
immediately after a hw path failure is beneficial because the time it
takes to call the sw path serves as a backoff time for the hw path?

Do you have an idea on the time difference of waiting for sw path
vs. introducing a delay to multifd_qpl_submit_job()? Aren't we leaving
performance on the table by going with a much slower sw path instead of
waiting for the queues to open up? Or some other strategy, such as going
once again over the not-submitted pages.

I understand there's a tradeoff here between your effort to investigate
these things and the amount of performance to be had, so feel free to
leave this question unanswered. We could choose to simply document this
with a comment:

    if (multifd_qpl_submit_job(hw_job->job)) {
        hw_job->fallback_sw_path = false;
        continue;
    }

    /* 
     * The IAA work queue is full, any immediate subsequent job
     * submission is likely to fail, sending the page via the QPL
     * software path at this point gives us a better chance of
     * finding the queue open for the next pages.
     */
    hw_job->fallback_sw_path = true;
    ...

> There is almost no overhead in job submission because Intel uses the new "enqcmd" instruction,
> which allows the user program to submit the job directly to the hardware.
>
> According to the implementation of the reference code, when a job fails to be submitted, there 
> is a high probability that "ALL" subsequent jobs will fail to be submitted and then use software
> compression, resulting in the IAA hardware not being fully utilized.
>
> For 4 Channel, 1 IAA device test case, using the reference code will reduce IAA throughput 
> from 3.4GBps to 2.2GBps, thus affecting live migration performance.(total time from 14s to 18s)
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v7 6/7] migration/multifd: implement qpl compression and decompression
  2024-06-06 13:51       ` Fabiano Rosas
@ 2024-06-06 14:29         ` Liu, Yuan1
  0 siblings, 0 replies; 18+ messages in thread
From: Liu, Yuan1 @ 2024-06-06 14:29 UTC (permalink / raw)
  To: Fabiano Rosas, peterx@redhat.com, pbonzini@redhat.com,
	marcandre.lureau@redhat.com, berrange@redhat.com,
	thuth@redhat.com, philmd@linaro.org
  Cc: qemu-devel@nongnu.org, Zou, Nanhai,
	shameerali.kolothum.thodi@huawei.com


> -----Original Message-----
> From: Fabiano Rosas <farosas@suse.de>
> Sent: Thursday, June 6, 2024 9:52 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>; peterx@redhat.com;
> pbonzini@redhat.com; marcandre.lureau@redhat.com; berrange@redhat.com;
> thuth@redhat.com; philmd@linaro.org
> Cc: qemu-devel@nongnu.org; Zou, Nanhai <nanhai.zou@intel.com>;
> shameerali.kolothum.thodi@huawei.com
> Subject: RE: [PATCH v7 6/7] migration/multifd: implement qpl compression
> and decompression
> 
> "Liu, Yuan1" <yuan1.liu@intel.com> writes:
> 
> >> -----Original Message-----
> >> From: Fabiano Rosas <farosas@suse.de>
> >> Sent: Thursday, June 6, 2024 6:26 AM
> >> To: Liu, Yuan1 <yuan1.liu@intel.com>; peterx@redhat.com;
> >> pbonzini@redhat.com; marcandre.lureau@redhat.com; berrange@redhat.com;
> >> thuth@redhat.com; philmd@linaro.org
> >> Cc: qemu-devel@nongnu.org; Liu, Yuan1 <yuan1.liu@intel.com>; Zou,
> Nanhai
> >> <nanhai.zou@intel.com>; shameerali.kolothum.thodi@huawei.com
> >> Subject: Re: [PATCH v7 6/7] migration/multifd: implement qpl
> compression
> >> and decompression
> >>
> >> Yuan Liu <yuan1.liu@intel.com> writes:
> >>
> >> > QPL compression and decompression will use IAA hardware first.
> >> > If IAA hardware is not available, it will automatically fall
> >> > back to QPL software path, if the software job also fails,
> >> > the uncompressed page is sent directly.
> >> >
> >> > Signed-off-by: Yuan Liu <yuan1.liu@intel.com>
> >> > Reviewed-by: Nanhai Zou <nanhai.zou@intel.com>
> >> > ---
> >> >  migration/multifd-qpl.c | 412
> +++++++++++++++++++++++++++++++++++++++-
> >> >  1 file changed, 408 insertions(+), 4 deletions(-)
> >> >
> >> > diff --git a/migration/multifd-qpl.c b/migration/multifd-qpl.c
> >> > index 6791a204d5..18b3384bd5 100644
> >> > --- a/migration/multifd-qpl.c
> >> > +++ b/migration/multifd-qpl.c
> >> > @@ -13,9 +13,14 @@
> >> >  #include "qemu/osdep.h"
> >> >  #include "qemu/module.h"
> >> >  #include "qapi/error.h"
> >> > +#include "qapi/qapi-types-migration.h"
> >> > +#include "exec/ramblock.h"
> >> >  #include "multifd.h"
> >> >  #include "qpl/qpl.h"
> >> >
> >> > +/* Maximum number of retries to resubmit a job if IAA work queues
> are
> >> full */
> >> > +#define MAX_SUBMIT_RETRY_NUM (3)
> >> > +
> >> >  typedef struct {
> >> >      /* the QPL hardware path job */
> >> >      qpl_job *job;
> >> > @@ -260,6 +265,219 @@ static void
> >> multifd_qpl_send_cleanup(MultiFDSendParams *p, Error **errp)
> >> >      p->iov = NULL;
> >> >  }
> >> >
> >> > +/**
> >> > + * multifd_qpl_prepare_job: prepare the job
> >> > + *
> >> > + * Set the QPL job parameters and properties.
> >> > + *
> >> > + * @job: pointer to the qpl_job structure
> >> > + * @is_compression: indicates compression and decompression
> >> > + * @input: pointer to the input data buffer
> >> > + * @input_len: the length of the input data
> >> > + * @output: pointer to the output data buffer
> >> > + * @output_len: the length of the output data
> >> > + */
> >> > +static void multifd_qpl_prepare_job(qpl_job *job, bool
> is_compression,
> >> > +                                    uint8_t *input, uint32_t
> input_len,
> >> > +                                    uint8_t *output, uint32_t
> >> output_len)
> >> > +{
> >> > +    job->op = is_compression ? qpl_op_compress : qpl_op_decompress;
> >> > +    job->next_in_ptr = input;
> >> > +    job->next_out_ptr = output;
> >> > +    job->available_in = input_len;
> >> > +    job->available_out = output_len;
> >> > +    job->flags = QPL_FLAG_FIRST | QPL_FLAG_LAST |
> QPL_FLAG_OMIT_VERIFY;
> >> > +    /* only supports compression level 1 */
> >> > +    job->level = 1;
> >> > +}
> >> > +
> >> > +/**
> >> > + * multifd_qpl_prepare_job: prepare the compression job
> >>
> >> function name is wrong
> >
> > Thanks, I will fix this next version.
> >
> >> > + *
> >> > + * Set the compression job parameters and properties.
> >> > + *
> >> > + * @job: pointer to the qpl_job structure
> >> > + * @input: pointer to the input data buffer
> >> > + * @input_len: the length of the input data
> >> > + * @output: pointer to the output data buffer
> >> > + * @output_len: the length of the output data
> >> > + */
> >> > +static void multifd_qpl_prepare_comp_job(qpl_job *job, uint8_t
> *input,
> >> > +                                         uint32_t input_len, uint8_t
> >> *output,
> >> > +                                         uint32_t output_len)
> >> > +{
> >> > +    multifd_qpl_prepare_job(job, true, input, input_len, output,
> >> output_len);
> >> > +}
> >> > +
> >> > +/**
> >> > + * multifd_qpl_prepare_job: prepare the decompression job
> >
> > Thanks, I will fix this next version.
> >
> >> > + *
> >> > + * Set the decompression job parameters and properties.
> >> > + *
> >> > + * @job: pointer to the qpl_job structure
> >> > + * @input: pointer to the input data buffer
> >> > + * @input_len: the length of the input data
> >> > + * @output: pointer to the output data buffer
> >> > + * @output_len: the length of the output data
> >> > + */
> >> > +static void multifd_qpl_prepare_decomp_job(qpl_job *job, uint8_t
> >> *input,
> >> > +                                           uint32_t input_len,
> uint8_t
> >> *output,
> >> > +                                           uint32_t output_len)
> >> > +{
> >> > +    multifd_qpl_prepare_job(job, false, input, input_len, output,
> >> output_len);
> >> > +}
> >> > +
> >> > +/**
> >> > + * multifd_qpl_fill_iov: fill in the IOV
> >> > + *
> >> > + * Fill in the QPL packet IOV
> >> > + *
> >> > + * @p: Params for the channel being used
> >> > + * @data: pointer to the IOV data
> >> > + * @len: The length of the IOV data
> >> > + */
> >> > +static void multifd_qpl_fill_iov(MultiFDSendParams *p, uint8_t
> *data,
> >> > +                                 uint32_t len)
> >> > +{
> >> > +    p->iov[p->iovs_num].iov_base = data;
> >> > +    p->iov[p->iovs_num].iov_len = len;
> >> > +    p->iovs_num++;
> >> > +    p->next_packet_size += len;
> >> > +}
> >> > +
> >> > +/**
> >> > + * multifd_qpl_fill_packet: fill the compressed page into the QPL
> >> packet
> >> > + *
> >> > + * Fill the compressed page length and IOV into the QPL packet
> >> > + *
> >> > + * @idx: The index of the compressed length array
> >> > + * @p: Params for the channel being used
> >> > + * @data: pointer to the compressed page buffer
> >> > + * @len: The length of the compressed page
> >> > + */
> >> > +static void multifd_qpl_fill_packet(uint32_t idx, MultiFDSendParams
> *p,
> >> > +                                    uint8_t *data, uint32_t len)
> >> > +{
> >> > +    QplData *qpl = p->compress_data;
> >> > +
> >> > +    qpl->zlen[idx] = cpu_to_be32(len);
> >> > +    multifd_qpl_fill_iov(p, data, len);
> >> > +}
> >> > +
> >> > +/**
> >> > + * multifd_qpl_submit_job: submit a job to the hardware
> >> > + *
> >> > + * Submit a QPL hardware job to the IAA device
> >> > + *
> >> > + * Returns true if the job is submitted successfully, otherwise
> false.
> >> > + *
> >> > + * @job: pointer to the qpl_job structure
> >> > + */
> >> > +static bool multifd_qpl_submit_job(qpl_job *job)
> >> > +{
> >> > +    qpl_status status;
> >> > +    uint32_t num = 0;
> >> > +
> >> > +retry:
> >> > +    status = qpl_submit_job(job);
> >> > +    if (status == QPL_STS_QUEUES_ARE_BUSY_ERR) {
> >> > +        if (num < MAX_SUBMIT_RETRY_NUM) {
> >> > +            num++;
> >> > +            goto retry;
> >> > +        }
> >> > +    }
> >> > +    return (status == QPL_STS_OK);
> >>
> >> How often do we expect this to fail? Will the queues be busy frequently
> >> or is this an unlikely event? I'm thinking whether we really need to
> >> allow a fallback for the hw path. Sorry if this has been discussed
> >> already, I don't remember.
> >
> > In some scenarios, this may happen frequently, such as configuring 4
> channels
> > but only one IAA device is available. In the case of insufficient IAA
> hardware
> > resources, retry and fallback can help optimize performance.
> > I have a comparison test below
> >
> > 1. Retry + SW fallback:
> >    total time: 14649 ms
> >    downtime: 25 ms
> >    throughput: 17666.57 mbps
> >    pages-per-second: 1509647
> >
> > 2. No fallback, always wait for work queues to become available
> >    total time: 18381 ms
> >    downtime: 25 ms
> >    throughput: 13698.65 mbps
> >    pages-per-second: 859607
> 
> Thanks for the data, this is helpful. Let's include it in the commit
> message, it's important to let people know you actually did that
> analysis. I put a suggestion below:
> 
> ---
> QPL compression and decompression will use IAA hardware path if the IAA
> hardware is available. Otherwise the QPL library software path is used.
> 
> The hardware path will automatically fall back to QPL software path if
> the IAA queues are busy. In some scenarios, this may happen frequently,
> such as configuring 4 channels but only one IAA device is available. In
> the case of insufficient IAA hardware resources, retry and fallback can
> help optimize performance:
> 
>  1. Retry + SW fallback:
>     total time: 14649 ms
>     downtime: 25 ms
>     throughput: 17666.57 mbps
>     pages-per-second: 1509647
> 
>  2. No fallback, always wait for work queues to become available
>     total time: 18381 ms
>     downtime: 25 ms
>     throughput: 13698.65 mbps
>     pages-per-second: 859607
> 
> If both the hardware and software paths fail, the uncompressed page is
> sent directly.

Very thanks for your comments, I will add these to the commit message.

> >> > +}
> >> > +
> >> > +/**
> >> > + * multifd_qpl_compress_pages_slow_path: compress pages using slow
> path
> >> > + *
> >> > + * Compress the pages using software. If compression fails, the page
> >> will
> >> > + * be sent directly.
> >> > + *
> >> > + * @p: Params for the channel being used
> >> > + */
> >> > +static void multifd_qpl_compress_pages_slow_path(MultiFDSendParams
> *p)
> >> > +{
> >> > +    QplData *qpl = p->compress_data;
> >> > +    uint32_t size = p->page_size;
> >> > +    qpl_job *job = qpl->sw_job;
> >> > +    uint8_t *zbuf = qpl->zbuf;
> >> > +    uint8_t *buf;
> >> > +
> >> > +    for (int i = 0; i < p->pages->normal_num; i++) {
> >> > +        buf = p->pages->block->host + p->pages->offset[i];
> >> > +        /* Set output length to less than the page to reduce
> >> decompression */
> >> > +        multifd_qpl_prepare_comp_job(job, buf, size, zbuf, size -
> 1);
> >> > +        if (qpl_execute_job(job) == QPL_STS_OK) {
> >> > +            multifd_qpl_fill_packet(i, p, zbuf, job->total_out);
> >> > +        } else {
> >> > +            /* send the page directly */
> >>
> >> s/directly/uncompressed/
> >>
> >> a bit clearer.
> >
> > Sure, I will fix it next version.
> >
> >> > +            multifd_qpl_fill_packet(i, p, buf, size);
> >> > +        }
> >> > +        zbuf += size;
> >> > +    }
> >> > +}
> >> > +
> >> > +/**
> >> > + * multifd_qpl_compress_pages: compress pages
> >> > + *
> >> > + * Submit the pages to the IAA hardware for compression. If hardware
> >> > + * compression fails, it falls back to software compression. If
> >> software
> >> > + * compression also fails, the page is sent directly
> >> > + *
> >> > + * @p: Params for the channel being used
> >> > + */
> >> > +static void multifd_qpl_compress_pages(MultiFDSendParams *p)
> >> > +{
> >> > +    QplData *qpl = p->compress_data;
> >> > +    MultiFDPages_t *pages = p->pages;
> >> > +    uint32_t size = p->page_size;
> >> > +    QplHwJob *hw_job;
> >> > +    uint8_t *buf;
> >> > +    uint8_t *zbuf;
> >> > +
> >>
> >> Let's document the output size choice more explicitly:
> >>
> >>     /*
> >>      * Set output length to less than the page size to force the job to
> >>      * fail in case it compresses to a larger size. We'll send that
> page
> >>      * without compression and skip the decompression operation on the
> >>      * destination.
> >>      */
> >>      out_size = size - 1;
> >>
> >> you can then omit the other comments.
> >
> > Thanks for the comments, I will refine this next version.
> >
> >> > +    for (int i = 0; i < pages->normal_num; i++) {
> >> > +        buf = pages->block->host + pages->offset[i];
> >> > +        zbuf = qpl->zbuf + (size * i);
> >> > +        hw_job = &qpl->hw_jobs[i];
> >> > +        /* Set output length to less than the page to reduce
> >> decompression */
> >> > +        multifd_qpl_prepare_comp_job(hw_job->job, buf, size, zbuf,
> size
> >> - 1);
> >> > +        if (multifd_qpl_submit_job(hw_job->job)) {
> >> > +            hw_job->fallback_sw_path = false;
> >> > +        } else {
> >> > +            hw_job->fallback_sw_path = true;
> >> > +            /* Set output length less than page size to reduce
> >> decompression */
> >> > +            multifd_qpl_prepare_comp_job(qpl->sw_job, buf, size,
> zbuf,
> >> > +                                         size - 1);
> >> > +            if (qpl_execute_job(qpl->sw_job) == QPL_STS_OK) {
> >> > +                hw_job->sw_output = zbuf;
> >> > +                hw_job->sw_output_len = qpl->sw_job->total_out;
> >> > +            } else {
> >> > +                hw_job->sw_output = buf;
> >> > +                hw_job->sw_output_len = size;
> >> > +            }
> >>
> >> Hmm, these look a bit cumbersome, would it work if we moved the
> fallback
> >> qpl_execute_job() down into the other loop? We could then avoid the
> >> extra fields. Something like:
> >>
> >> static void multifd_qpl_compress_pages(MultiFDSendParams *p)
> >> {
> >>     QplData *qpl = p->compress_data;
> >>     MultiFDPages_t *pages = p->pages;
> >>     uint32_t out_size, size = p->page_size;
> >>     uint8_t *buf, *zbuf;
> >>
> >>     /*
> >>      * Set output length to less than the page size to force the job to
> >>      * fail in case it compresses to a larger size. We'll send that
> page
> >>      * without compression to skip the decompression operation on the
> >>      * destination.
> >>      */
> >>     out_size = size - 1;
> >>
> >>     for (int i = 0; i < pages->normal_num; i++) {
> >>         QplHwJob *hw_job = &qpl->hw_jobs[i];
> >>
> >>         hw_job->fallback_sw_path = false;
> >>         buf = pages->block->host + pages->offset[i];
> >>         zbuf = qpl->zbuf + (size * i);
> >>
> >>         multifd_qpl_prepare_comp_job(hw_job->job, buf, size, zbuf,
> >> out_size);
> >>
> >>         if (!multifd_qpl_submit_job(hw_job->job)) {
> >>             hw_job->fallback_sw_path = true;
> >>         }
> >>     }
> >>
> >>     for (int i = 0; i < pages->normal_num; i++) {
> >>         QplHwJob *hw_job = &qpl->hw_jobs[i];
> >>         qpl_job *job;
> >>
> >>         buf = pages->block->host + pages->offset[i];
> >>         zbuf = qpl->zbuf + (size * i);
> >>
> >>         if (hw_job->fallback_sw_path) {
> >>             job = qpl->sw_job;
> >>             multifd_qpl_prepare_comp_job(job, buf, size, zbuf,
> out_size);
> >>             ret = qpl_execute_job(job);
> >>         } else {
> >>             job = hw_job->job;
> >>             ret = qpl_wait_job(job);
> >>         }
> >>
> >>         if (ret == QPL_STS_OK) {
> >>             multifd_qpl_fill_packet(i, p, zbuf, job->total_out);
> >>         } else {
> >>             multifd_qpl_fill_packet(i, p, buf, size);
> >>         }
> >>     }
> >> }
> >
> > Very thanks for the reference code, I have test the code and the
> performance is not good.
> > When the work queue is full, after a hardware job fails to be submitted,
> the subsequent
> > job submission will most likely fail as well. so my idea is to use
> software job execution
> > instead immediately, but all subsequent jobs will still give priority to
> hardware path.
> 
> So let me see if I get this, you're saying that going with the sw path
> immediately after a hw path failure is beneficial because the time it
> takes to call the sw path serves as a backoff time for the hw path?

Exactly, I want to use the sw path as the backoff time for the hardware path.

> Do you have an idea on the time difference of waiting for sw path
> vs. introducing a delay to multifd_qpl_submit_job()? Aren't we leaving
> performance on the table by going with a much slower sw path instead of
> waiting for the queues to open up? Or some other strategy, such as going
> once again over the not-submitted pages.

Using a specific delay time to guarantee performance may be difficult now,
because the solution only supports shared working queue mode and when the live
migration starts waiting for a specified time, other workloads may still fill 
the device work queue, causing the live migration job submission to fail after
a while.

I agree with your point. Currently, using the software path to solve the backoff
time may also cause the performance drop due to software path overhead. I will 
consider how to solve it in the future.

> I understand there's a tradeoff here between your effort to investigate
> these things and the amount of performance to be had, so feel free to
> leave this question unanswered. We could choose to simply document this
> with a comment:

Sure, I'll leave the comments on this so I can improve it in the future

>     if (multifd_qpl_submit_job(hw_job->job)) {
>         hw_job->fallback_sw_path = false;
>         continue;
>     }
> 
>     /*
>      * The IAA work queue is full, any immediate subsequent job
>      * submission is likely to fail, sending the page via the QPL
>      * software path at this point gives us a better chance of
>      * finding the queue open for the next pages.
>      */
>     hw_job->fallback_sw_path = true;
>     ...
> 
> > There is almost no overhead in job submission because Intel uses the new
> "enqcmd" instruction,
> > which allows the user program to submit the job directly to the
> hardware.
> >
> > According to the implementation of the reference code, when a job fails
> to be submitted, there
> > is a high probability that "ALL" subsequent jobs will fail to be
> submitted and then use software
> > compression, resulting in the IAA hardware not being fully utilized.
> >
> > For 4 Channel, 1 IAA device test case, using the reference code will
> reduce IAA throughput
> > from 3.4GBps to 2.2GBps, thus affecting live migration
> performance.(total time from 14s to 18s)
> >


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-06-06 14:30 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-03 15:40 [PATCH v7 0/7] Live Migration With IAA Yuan Liu
2024-06-03 15:41 ` [PATCH v7 1/7] docs/migration: add qpl compression feature Yuan Liu
2024-06-04 20:19   ` Peter Xu
2024-06-05 19:59   ` Fabiano Rosas
2024-06-06  7:03     ` Liu, Yuan1
2024-06-03 15:41 ` [PATCH v7 2/7] migration/multifd: put IOV initialization into compression method Yuan Liu
2024-06-03 15:41 ` [PATCH v7 3/7] configure: add --enable-qpl build option Yuan Liu
2024-06-05 20:04   ` Fabiano Rosas
2024-06-03 15:41 ` [PATCH v7 4/7] migration/multifd: add qpl compression method Yuan Liu
2024-06-03 15:41 ` [PATCH v7 5/7] migration/multifd: implement initialization of qpl compression Yuan Liu
2024-06-05 20:19   ` Fabiano Rosas
2024-06-03 15:41 ` [PATCH v7 6/7] migration/multifd: implement qpl compression and decompression Yuan Liu
2024-06-05 22:25   ` Fabiano Rosas
2024-06-06  6:12     ` Liu, Yuan1
2024-06-06 13:51       ` Fabiano Rosas
2024-06-06 14:29         ` Liu, Yuan1
2024-06-03 15:41 ` [PATCH v7 7/7] tests/migration-test: add qpl compression test Yuan Liu
2024-06-05 22:26   ` Fabiano Rosas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).