* Overview:
This patchset implements using Intel's QAT accelerator to offload ZLIB compression and decompression in the multifd live migration path.
* Background:
Intel's 4th generation Xeon processors support Intel's QuickAssist Technology (QAT), a hardware accelerator for cryptography and compression operations.
Intel has also released a software library, QATzip, that interacts with QAT and exposes an API for QAT-accelerated ZLIB compression and decompression.
This patchset introduces a new multifd compression method, `qatzip`, which uses QATzip to perform ZLIB compression and decompression.
* Implementation:
The bulk of this patchset is in `migration/multifd-qatzip.c`, which mirrors the other compression implementation files, `migration/multifd-zlib.c` and `migration/multifd-zstd.c`, by providing an implementation of the multifd send/recv methods using the API exposed by QATzip. This is fairly straightforward, as the multifd setup/prepare/teardown methods align closely with QATzip's methods for initialization/(de)compression/teardown.
The only major divergence from the other compression methods is that we use a non-streaming compression/decompression API, as opposed to streaming each page to the compression layer one at a time. This does not require any major code changes - by the time we want to call to the compression layer, we already have a batch of pages, so it is easy to copy them into a contiguous buffer. This decision is purely performance-based, as our initial QAT benchmark testing showed that QATzip's non-streaming API outperformed the streaming API.
* Performance:
** Setup:
We use two Intel 4th generation Xeon servers for testing.
Architecture: x86_64
CPU(s): 192
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 143
Model name: Intel(R) Xeon(R) Platinum 8457C
Stepping: 8
CPU MHz: 2538.624
CPU max MHz: 3800.0000
CPU min MHz: 800.0000
Each server has two QAT devices, and the network bandwidth between the two servers is 1Gbps.
We perform multifd live migration over TCP using a VM with 64GB memory. We prepared the machine's memory by powering it on, allocating a large amount of memory (63GB) as a single buffer, and filling the buffer with the repeated contents of the Silesia corpus[0]. This is in lieu of a more realistic memory snapshot, which proved troublesome to acquire.
We analyzed CPU usage by averaging the output of `top` every second during live migration. This is admittedly imprecise, but we feel that it accurately portrays the different degrees of CPU usage of varying compression methods.
We present the latency, throughput, and CPU usage results for all of the compression methods, with varying numbers of multifd threads (4, 8, and 16).
[0] The Silesia corpus can be accessed here: https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
** Results:
4 multifd threads:
|---------------|---------------|----------------|---------|---------|
|method |time(sec) |throughput(mbps)|send cpu%|recv cpu%|
|---------------|---------------|----------------|---------|---------|
|qatzip |111.256 |916.03 | 29.08% | 51.90 |
|---------------|---------------|----------------|---------|---------|
|zlib |193.033 |562.16 |297.36% |237.84 |
|---------------|---------------|----------------|---------|---------|
|zstd |112.449 |920.67 |234.39% |157.57 |
|---------------|---------------|----------------|---------|---------|
|none |327.014 |933.41 | 9.50% | 25.28 |
|---------------|---------------|----------------|---------|---------|
8 multifd threads:
|---------------|---------------|----------------|---------|---------|
|method |time(sec) |throughput(mbps)|send cpu%|recv cpu%|
|---------------|---------------|----------------|---------|---------|
|qatzip |111.349 |915.20 | 29.13% | 59.63 |
|---------------|---------------|----------------|---------|---------|
|zlib |149.378 |726.64 |516.24% |400.46 |
|---------------|---------------|----------------|---------|---------|
|zstd |111.942 |925.85 |345.75% |170.74 |
|---------------|---------------|----------------|---------|---------|
|none |327.417 |933.34 | 8.38% | 27.72 |
|---------------|---------------|----------------|---------|---------|
16 multifd threads:
|---------------|---------------|----------------|---------|---------|
|method |time(sec) |throughput(mbps)|send cpu%|recv cpu%|
|---------------|---------------|----------------|---------|---------|
|qatzip |112.035 |908.96 | 29.93% | 63.83% |
|---------------|---------------|----------------|---------|---------|
|zlib |118.730 |912.94 |914.14% |621.59% |
|---------------|---------------|----------------|---------|---------|
|zstd |112.167 |924.78 |384.81% |171.54% |
|---------------|---------------|----------------|---------|---------|
|none |327.728 |932.08 | 9.31% | 29.89% |
|---------------|---------------|----------------|---------|---------|
** Observations:
Latency: In our test setting, live migration is mostly network-constrained, so compression performs relatively well in general. `qatzip` particularly shows a significant improvement over `zlib` with limited threads. With 4 multifd threads, `qatzip` shows a ~42% decrease in latency over `zlib`. In all scenarios, `qatzip` shows comparable performance with `zstd`.
Throughput: In all scenarios, nearly all compression methods reach nearly the entire network throughput of 1Gbps except for `zlib`, which appears to be CPU-bound with 4 and 8 threads, but reaches comparable throughput performance with the other methods at 16 threads.
CPU usage: In all scenarios, `qatzip` consumes a fraction of the CPU usage that `zlib` and `zstd` use. In the most limited case, with 4 multifd threads, `qatzip`'s sender CPU usage is ~10% that of `zlib`, and ~12% that of `zstd`, and its receiver CPU usage is ~22% that of `zlib`, and ~33% that of `zstd`. The magnitude of these savings increases as we increase to 8 and 16 threads.
* Future work:
- Comparing QAT offloading against other compression methods in environments that are not as network-constrained.
- Combining compression offloading with offloading using other Intel accelerators (e.g. using Intel's Data Streaming Accelerator to offload zero page checking, which is part of another related patchset currently under discussion, and to offload `memcpy()` operations on the receiver side).
- Reworking multifd logic to pipeline live migration work to improve device saturation.
* Testing:
This patchset adds an integration test for the new `qatzip` multifd compression method.
* Patchset:
This patchset was generated on top of commit 7425b627.
Bryan Zhang (5):
meson: Introduce 'qatzip' feature to the build system.
migration: Add compression level parameter for QATzip
migration: Introduce unimplemented 'qatzip' compression method
migration: Implement 'qatzip' methods using QAT
migration: Add integration test for 'qatzip' compression method
hw/core/qdev-properties-system.c | 6 +-
meson.build | 10 +
meson_options.txt | 2 +
migration/meson.build | 1 +
migration/migration-hmp-cmds.c | 4 +
migration/multifd-qatzip.c | 369 +++++++++++++++++++++++++++++++
migration/multifd.h | 1 +
migration/options.c | 27 +++
migration/options.h | 1 +
qapi/migration.json | 24 +-
scripts/meson-buildoptions.sh | 3 +
tests/qtest/meson.build | 4 +
tests/qtest/migration-test.c | 37 ++++
13 files changed, 486 insertions(+), 3 deletions(-)
create mode 100644 migration/multifd-qatzip.c
--
2.30.2