From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 07F20C47074 for ; Sun, 31 Dec 2023 21:23:15 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rK3G7-0001Hf-K9; Sun, 31 Dec 2023 16:22:11 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rK2uo-00077x-7S for qemu-devel@nongnu.org; Sun, 31 Dec 2023 16:00:10 -0500 Received: from mail-qk1-x733.google.com ([2607:f8b0:4864:20::733]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1rK2ul-0007AI-Lu for qemu-devel@nongnu.org; Sun, 31 Dec 2023 16:00:09 -0500 Received: by mail-qk1-x733.google.com with SMTP id af79cd13be357-781600ab3d8so334282485a.0 for ; Sun, 31 Dec 2023 13:00:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1704056401; x=1704661201; darn=nongnu.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=1pM+6S5s1kWcUzrk/umGDrUTZJMQU/N8lp5nzmz05i8=; b=VyYg25qq87bPYMsts8ChaGC87Z2jrPp46USjMpbmMfmkyYMGkAaXazYuG0ufJlg1lk bgbQ3Luq9ThVITF2cobTDeFTnW6GpI56lY6r/SCfhyR5sIhosjIvAbd9tx+pyremHkzO saux7PzoHKe2+83Mw/46h9hU0vkx5GqkfgOsDn4Wzx2ZcAaOGOAYeofX+Xzzk5QfHNPc dKoNCqqAu1nkQSBQfehblOEgk9h8fN2HCjebeWrskg50dDkb9d9DZ1ub8p1JN8f3dZ/6 Rw8WJPMYSdMbdQDS4pukrOo2hoS8EcZo2xD15KDnjBqVyFc2aUPAAwLH+K24wo6XUgcJ Ox9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704056401; x=1704661201; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=1pM+6S5s1kWcUzrk/umGDrUTZJMQU/N8lp5nzmz05i8=; b=NUkyTaJzBxDWwdlBi5hGv2MUMa/62uANmaD1kkzVUrMTDSg2xdsuF5iUk9Z1ojKXKX 0KkgaE0TgadOH2Mpy5pLBde9m+FE99hFiFvbnn0OkbApD15WmXR5yHwvRW1rMbFwwr9R D8gXtvnG49Kz4RhjOaINpP3GEx1z9/1pm3QYwOEidRlItyPbdvyhWXGL71qVLgpbHE6R SMrTQoNbtuc41wdP5dUAabbWIKhWGK+Bd0qCj1IOoSVKtYVjtU6CVFHqjxB6GsAfvwwS 2611/NGVtkUSGr4a3HDW41GXOMqM8gEzK41HRdypt9e9SAAv2AoTbPfXr6zg7FXEMXQb ZN7g== X-Gm-Message-State: AOJu0YxvVPx24974XFUrDQZNWGimaqYu3eHHRKI9z6tzNJlU5T0KpOMs LzJ/uiJOsxL7d4Jvm/74nLq2Ybq1tfF5GZQccge7xeefGx7UVA== X-Google-Smtp-Source: AGHT+IEcg1YPDqcJfQ48G2IE/ryWxY0UzUS1ktfXrdDN9UjvefAToTPQFywj2Yea0axYIFoIUunsKQ== X-Received: by 2002:a05:620a:1362:b0:781:5efd:403b with SMTP id d2-20020a05620a136200b007815efd403bmr9233059qkl.13.1704056400933; Sun, 31 Dec 2023 13:00:00 -0800 (PST) Received: from n36-186-108.byted.org. ([147.160.184.90]) by smtp.gmail.com with ESMTPSA id pb21-20020a05620a839500b007811da87cefsm8111750qkn.127.2023.12.31.13.00.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 31 Dec 2023 13:00:00 -0800 (PST) From: Bryan Zhang To: qemu-devel@nongnu.org, farosas@suse.de, marcandre.lureau@redhat.com, peterx@redhat.com, quintela@redhat.com, peter.maydell@linaro.org, hao.xiang@bytedance.com Cc: bryan.zhang@bytedance.com Subject: [PATCH 0/5] *** Implement using Intel QAT to offload ZLIB Date: Sun, 31 Dec 2023 20:57:59 +0000 Message-Id: <20231231205804.2366509-1-bryan.zhang@bytedance.com> X-Mailer: git-send-email 2.30.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2607:f8b0:4864:20::733; envelope-from=bryan.zhang@bytedance.com; helo=mail-qk1-x733.google.com X-Spam_score_int: 7 X-Spam_score: 0.7 X-Spam_bar: / X-Spam_report: (0.7 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RAZOR2_CF_RANGE_51_100=1.886, RAZOR2_CHECK=0.922, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=no autolearn_force=no X-Spam_action: no action X-Mailman-Approved-At: Sun, 31 Dec 2023 16:22:08 -0500 X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org * Overview: This patchset implements using Intel's QAT accelerator to offload ZLIB compression and decompression in the multifd live migration path. * Background: Intel's 4th generation Xeon processors support Intel's QuickAssist Technology (QAT), a hardware accelerator for cryptography and compression operations. Intel has also released a software library, QATzip, that interacts with QAT and exposes an API for QAT-accelerated ZLIB compression and decompression. This patchset introduces a new multifd compression method, `qatzip`, which uses QATzip to perform ZLIB compression and decompression. * Implementation: The bulk of this patchset is in `migration/multifd-qatzip.c`, which mirrors the other compression implementation files, `migration/multifd-zlib.c` and `migration/multifd-zstd.c`, by providing an implementation of the multifd send/recv methods using the API exposed by QATzip. This is fairly straightforward, as the multifd setup/prepare/teardown methods align closely with QATzip's methods for initialization/(de)compression/teardown. The only major divergence from the other compression methods is that we use a non-streaming compression/decompression API, as opposed to streaming each page to the compression layer one at a time. This does not require any major code changes - by the time we want to call to the compression layer, we already have a batch of pages, so it is easy to copy them into a contiguous buffer. This decision is purely performance-based, as our initial QAT benchmark testing showed that QATzip's non-streaming API outperformed the streaming API. * Performance: ** Setup: We use two Intel 4th generation Xeon servers for testing. Architecture: x86_64 CPU(s): 192 Thread(s) per core: 2 Core(s) per socket: 48 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 143 Model name: Intel(R) Xeon(R) Platinum 8457C Stepping: 8 CPU MHz: 2538.624 CPU max MHz: 3800.0000 CPU min MHz: 800.0000 Each server has two QAT devices, and the network bandwidth between the two servers is 1Gbps. We perform multifd live migration over TCP using a VM with 64GB memory. We prepared the machine's memory by powering it on, allocating a large amount of memory (63GB) as a single buffer, and filling the buffer with the repeated contents of the Silesia corpus[0]. This is in lieu of a more realistic memory snapshot, which proved troublesome to acquire. We analyzed CPU usage by averaging the output of `top` every second during live migration. This is admittedly imprecise, but we feel that it accurately portrays the different degrees of CPU usage of varying compression methods. We present the latency, throughput, and CPU usage results for all of the compression methods, with varying numbers of multifd threads (4, 8, and 16). [0] The Silesia corpus can be accessed here: https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia ** Results: 4 multifd threads: |---------------|---------------|----------------|---------|---------| |method |time(sec) |throughput(mbps)|send cpu%|recv cpu%| |---------------|---------------|----------------|---------|---------| |qatzip |111.256 |916.03 | 29.08% | 51.90 | |---------------|---------------|----------------|---------|---------| |zlib |193.033 |562.16 |297.36% |237.84 | |---------------|---------------|----------------|---------|---------| |zstd |112.449 |920.67 |234.39% |157.57 | |---------------|---------------|----------------|---------|---------| |none |327.014 |933.41 | 9.50% | 25.28 | |---------------|---------------|----------------|---------|---------| 8 multifd threads: |---------------|---------------|----------------|---------|---------| |method |time(sec) |throughput(mbps)|send cpu%|recv cpu%| |---------------|---------------|----------------|---------|---------| |qatzip |111.349 |915.20 | 29.13% | 59.63 | |---------------|---------------|----------------|---------|---------| |zlib |149.378 |726.64 |516.24% |400.46 | |---------------|---------------|----------------|---------|---------| |zstd |111.942 |925.85 |345.75% |170.74 | |---------------|---------------|----------------|---------|---------| |none |327.417 |933.34 | 8.38% | 27.72 | |---------------|---------------|----------------|---------|---------| 16 multifd threads: |---------------|---------------|----------------|---------|---------| |method |time(sec) |throughput(mbps)|send cpu%|recv cpu%| |---------------|---------------|----------------|---------|---------| |qatzip |112.035 |908.96 | 29.93% | 63.83% | |---------------|---------------|----------------|---------|---------| |zlib |118.730 |912.94 |914.14% |621.59% | |---------------|---------------|----------------|---------|---------| |zstd |112.167 |924.78 |384.81% |171.54% | |---------------|---------------|----------------|---------|---------| |none |327.728 |932.08 | 9.31% | 29.89% | |---------------|---------------|----------------|---------|---------| ** Observations: Latency: In our test setting, live migration is mostly network-constrained, so compression performs relatively well in general. `qatzip` particularly shows a significant improvement over `zlib` with limited threads. With 4 multifd threads, `qatzip` shows a ~42% decrease in latency over `zlib`. In all scenarios, `qatzip` shows comparable performance with `zstd`. Throughput: In all scenarios, nearly all compression methods reach nearly the entire network throughput of 1Gbps except for `zlib`, which appears to be CPU-bound with 4 and 8 threads, but reaches comparable throughput performance with the other methods at 16 threads. CPU usage: In all scenarios, `qatzip` consumes a fraction of the CPU usage that `zlib` and `zstd` use. In the most limited case, with 4 multifd threads, `qatzip`'s sender CPU usage is ~10% that of `zlib`, and ~12% that of `zstd`, and its receiver CPU usage is ~22% that of `zlib`, and ~33% that of `zstd`. The magnitude of these savings increases as we increase to 8 and 16 threads. * Future work: - Comparing QAT offloading against other compression methods in environments that are not as network-constrained. - Combining compression offloading with offloading using other Intel accelerators (e.g. using Intel's Data Streaming Accelerator to offload zero page checking, which is part of another related patchset currently under discussion, and to offload `memcpy()` operations on the receiver side). - Reworking multifd logic to pipeline live migration work to improve device saturation. * Testing: This patchset adds an integration test for the new `qatzip` multifd compression method. * Patchset: This patchset was generated on top of commit 7425b627. Bryan Zhang (5): meson: Introduce 'qatzip' feature to the build system. migration: Add compression level parameter for QATzip migration: Introduce unimplemented 'qatzip' compression method migration: Implement 'qatzip' methods using QAT migration: Add integration test for 'qatzip' compression method hw/core/qdev-properties-system.c | 6 +- meson.build | 10 + meson_options.txt | 2 + migration/meson.build | 1 + migration/migration-hmp-cmds.c | 4 + migration/multifd-qatzip.c | 369 +++++++++++++++++++++++++++++++ migration/multifd.h | 1 + migration/options.c | 27 +++ migration/options.h | 1 + qapi/migration.json | 24 +- scripts/meson-buildoptions.sh | 3 + tests/qtest/meson.build | 4 + tests/qtest/migration-test.c | 37 ++++ 13 files changed, 486 insertions(+), 3 deletions(-) create mode 100644 migration/multifd-qatzip.c -- 2.30.2