RE: [PATCH v2 0/5] *** Implement using Intel QAT to offload ZLIB

Posted by Liu, Yuan1 1 month ago
> -----Original Message-----
> From: Bryan Zhang <bryan.zhang@bytedance.com>
> Sent: Wednesday, March 27, 2024 6:42 AM
> To: qemu-devel@nongnu.org
> Cc: peterx@redhat.com; farosas@suse.de; Liu, Yuan1 <yuan1.liu@intel.com>;
> berrange@redhat.com; Zou, Nanhai <nanhai.zou@intel.com>;
> hao.xiang@linux.dev; Bryan Zhang <bryan.zhang@bytedance.com>
> Subject: [PATCH v2 0/5] *** Implement using Intel QAT to offload ZLIB
> 
> v2:
> - Rebase changes on top of recent multifd code changes.
> - Use QATzip API 'qzMalloc' and 'qzFree' to allocate QAT buffers.
> - Remove parameter tuning and use QATzip's defaults for better
>   performance.
> - Add parameter to enable QAT software fallback.
> 
> v1:
> https://lists.nongnu.org/archive/html/qemu-devel/2023-12/msg03761.html
> 
> * Performance
> 
> We present updated performance results. For circumstantial reasons, v1
> presented performance on a low-bandwidth (1Gbps) network.
> 
> Here, we present updated results with a similar setup as before but with
> two main differences:
> 
> 1. Our machines have a ~50Gbps connection, tested using 'iperf3'.
> 2. We had a bug in our memory allocation causing us to only use ~1/2 of
> the VM's RAM. Now we properly allocate and fill nearly all of the VM's
> RAM.
> 
> Thus, the test setup is as follows:
> 
> We perform multifd live migration over TCP using a VM with 64GB memory.
> We prepare the machine's memory by powering it on, allocating a large
> amount of memory (60GB) as a single buffer, and filling the buffer with
> the repeated contents of the Silesia corpus[0]. This is in lieu of a more
> realistic memory snapshot, which proved troublesome to acquire.
> 
> We analyze CPU usage by averaging the output of 'top' every second
> during migration. This is admittedly imprecise, but we feel that it
> accurately portrays the different degrees of CPU usage of varying
> compression methods.
> 
> We present the latency, throughput, and CPU usage results for all of the
> compression methods, with varying numbers of multifd threads (4, 8, and
> 16).
> 
> [0] The Silesia corpus can be accessed here:
> https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
> 
> ** Results
> 
> 4 multifd threads:
> 
>     |---------------|---------------|----------------|---------|---------|
>     |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
>     |---------------|---------------|----------------|---------|---------|
>     |qatzip         | 23.13         | 8749.94        |117.50   |186.49   |
>     |---------------|---------------|----------------|---------|---------|
>     |zlib           |254.35         |  771.87        |388.20   |144.40   |
>     |---------------|---------------|----------------|---------|---------|
>     |zstd           | 54.52         | 3442.59        |414.59   |149.77   |
>     |---------------|---------------|----------------|---------|---------|
>     |none           | 12.45         |43739.60        |159.71   |204.96   |
>     |---------------|---------------|----------------|---------|---------|
> 
> 8 multifd threads:
> 
>     |---------------|---------------|----------------|---------|---------|
>     |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
>     |---------------|---------------|----------------|---------|---------|
>     |qatzip         | 16.91         |12306.52        |186.37   |391.84   |
>     |---------------|---------------|----------------|---------|---------|
>     |zlib           |130.11         | 1508.89        |753.86   |289.35   |
>     |---------------|---------------|----------------|---------|---------|
>     |zstd           | 27.57         | 6823.23        |786.83   |303.80   |
>     |---------------|---------------|----------------|---------|---------|
>     |none           | 11.82         |46072.63        |163.74   |238.56   |
>     |---------------|---------------|----------------|---------|---------|
> 
> 16 multifd threads:
> 
>     |---------------|---------------|----------------|---------|---------|
>     |method         |time(sec)      |throughput(mbps)|send cpu%|recv cpu%|
>     |---------------|---------------|----------------|---------|---------|
>     |qatzip         |18.64          |11044.52        | 573.61  |437.65   |
>     |---------------|---------------|----------------|---------|---------|
>     |zlib           |66.43          | 2955.79        |1469.68  |567.47   |
>     |---------------|---------------|----------------|---------|---------|
>     |zstd           |14.17          |13290.66        |1504.08  |615.33   |
>     |---------------|---------------|----------------|---------|---------|
>     |none           |16.82          |32363.26        | 180.74  |217.17   |
>     |---------------|---------------|----------------|---------|---------|
> 
> ** Observations

I'm a little confused about the CPU utilization on the destination for 
decompression, it seems the CPU is decompressing instead of QAT, I check
the code about qzDecompress, it is the same with qzCompress if the decompression
task is not completed, it will try to stay sleep state as much as possible.

Maybe I understand it incorrectly, but I think QAT should help save more CPU resources 
in both compression and decompression.

Thank you very much for providing this version. I will set up an environment on your patch 
set to test the performance.
 
> - In general, not using compression outperforms using compression in a
>   non-network-bound environment.
> - 'qatzip' outperforms other compression workers with 4 and 8 workers,
>   achieving a ~91% latency reduction over 'zlib' with 4 workers, and a
> ~58% latency reduction over 'zstd' with 4 workers.
> - 'qatzip' maintains comparable performance with 'zstd' at 16 workers,
>   showing a ~32% increase in latency. This performance difference
> becomes more noticeable with more workers, as CPU compression is highly
> parallelizable.
> - 'qatzip' compression uses considerably less CPU than other compression
>   methods. At 8 workers, 'qatzip' demonstrates a ~75% reduction in
> compression CPU usage compared to 'zstd' and 'zlib'.
> - 'qatzip' decompression CPU usage is less impressive, and is even
>   slightly worse than 'zstd' and 'zlib' CPU usage at 4 and 16 workers.
> 
> Bryan Zhang (5):
>   meson: Introduce 'qatzip' feature to the build system
>   migration: Add migration parameters for QATzip
>   migration: Introduce unimplemented 'qatzip' compression method
>   migration: Implement 'qatzip' methods using QAT
>   tests/migration: Add integration test for 'qatzip' compression method
> 
>  hw/core/qdev-properties-system.c |   6 +-
>  meson.build                      |  10 +
>  meson_options.txt                |   2 +
>  migration/meson.build            |   1 +
>  migration/migration-hmp-cmds.c   |   8 +
>  migration/multifd-qatzip.c       | 382 +++++++++++++++++++++++++++++++
>  migration/multifd.h              |   1 +
>  migration/options.c              |  57 +++++
>  migration/options.h              |   2 +
>  qapi/migration.json              |  40 +++-
>  scripts/meson-buildoptions.sh    |   3 +
>  tests/qtest/meson.build          |   4 +
>  tests/qtest/migration-test.c     |  35 +++
>  13 files changed, 549 insertions(+), 2 deletions(-)
>  create mode 100644 migration/multifd-qatzip.c
> 
> --
> 2.30.2