> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Tuesday, January 30, 2024 6:32 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: farosas@suse.de; leobras@redhat.com; qemu-devel@nongnu.org; Zou,
> Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v3 0/4] Live Migration Acceleration with IAA
> Compression
>
> On Tue, Jan 30, 2024 at 03:56:05AM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Monday, January 29, 2024 6:43 PM
> > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > Cc: farosas@suse.de; leobras@redhat.com; qemu-devel@nongnu.org; Zou,
> > > Nanhai <nanhai.zou@intel.com>
> > > Subject: Re: [PATCH v3 0/4] Live Migration Acceleration with IAA
> > > Compression
> > >
> > > On Wed, Jan 03, 2024 at 07:28:47PM +0800, Yuan Liu wrote:
> > > > Hi,
> > >
> > > Hi, Yuan,
> > >
> > > I have a few comments and questions. Many of them can be pure
> > > questions as I don't know enough on these new technologies.
> > >
> > > >
> > > > I am writing to submit a code change aimed at enhancing live
> > > > migration acceleration by leveraging the compression capability of
> > > > the Intel In-Memory Analytics Accelerator (IAA).
> > > >
> > > > The implementation of the IAA (de)compression code is based on
> > > > Intel Query Processing Library (QPL), an open-source software
> > > > project designed for IAA high-level software programming.
> > > > https://github.com/intel/qpl
> > > >
> > > > In the last version, there was some discussion about whether to
> > > > introduce a new compression algorithm for IAA. Because the
> > > > compression algorithm of IAA hardware is based on deflate, and QPL
> > > > already supports Zlib, so in this version, I implemented IAA as an
> > > > accelerator for the Zlib compression method. However, due to some
> > > > reasons, QPL is currently not compatible with the existing Zlib
> > > > method that Zlib compressed data can be decompressed by QPl and vice
> versa.
> > > >
> > > > I have some concerns about the existing Zlib compression
> > > > 1. Will you consider supporting one channel to support multi-
> stream
> > > > compression? Of course, this may lead to a reduction in
> compression
> > > > ratio, but it will allow the hardware to process each stream
> > > > concurrently. We can have each stream process multiple pages,
> > > > reducing the loss of compression ratio. For example, 128 pages
> are
> > > > divided into 16 streams for independent compression. I will
> provide
> > > > the a early performance data in the next version(v4).
> > >
> > > I think Juan used to ask similar question: how much this can help if
> > > multifd can already achieve some form of concurrency over the pages?
> >
> >
> > > Couldn't the user specify more multifd channels if they want to
> > > grant more cpu resource for comp/decomp purpose?
> > >
> > > IOW, how many concurrent channels QPL can provide? What is the
> > > suggested concurrency channels there?
> >
> > From the QPL software, there is no limit on the number of concurrent
> compression and decompression tasks.
> > From the IAA hardware, one IAA physical device can process two
> compressions concurrently or eight decompression tasks concurrently. There
> are up to 8 IAA devices on an Intel SPR Server and it will vary according
> to the customer’s product selection and deployment.
> >
> > Regarding the requirement for the number of concurrent channels, I think
> this may not be a bottleneck problem.
> > Please allow me to introduce a little more here
> >
> > 1. If the compression design is based on Zlib/Deflate/Gzip streaming
> mode, then we indeed need more channels to maintain concurrent processing.
> Because each time a multifd packet is compressed (including 128
> independent pages), it needs to be compressed page by page. These 128
> pages are not concurrent. The concurrency is reflected in the logic of
> multiple channels for the multifd packet.
>
> Right. However since you said there're only a max of 8 IAA devices, would
> it also mean n_multifd_threads=8 can be a good enough scenario to achieve
> proper concurrency, no matter the size of data chunk for one compression
> request?
>
> Maybe you meant each device can still process concurrent compression
> requests, so the real capability of concurrency can be much larger than 8?
Yes, the number of concurrent requests can be greater than 8, one device can
handle 2 compression requests or 8 decompression requests concurrently.
> >
> > 2. Through testing, we prefer concurrent processing on 4K pages, not
> multifd packet, which means that 128 pages belonging to a packet can be
> compressed/decompressed concurrently. Even one channel can also utilize
> all the resources of IAA. But this is not compatible with existing zlib.
> > The code is similar to the following
> > for(int i = 0; i < num_pages; i++) {
> > job[i]->input_data = pages[i]
> > submit_job(job[i] //Non-block submit for compression/decompression
> tasks
> > }
> > for(int i = 0; i < num_pages; i++) {
> > wait_job(job[i]) //busy polling. In the future, we will make this
> part and data sending into pipeline mode.
> > }
>
> Right, if more concurrency is wanted, you can use this async model; I
> think Juan used to suggest such and I agree it will also work. It can be
> done on top of the basic functionality merged.
Sure, I think we can show the better performance based on it.
> > 3. Currently, the patches we provide to the community are based on
> streaming compression. This is to be compatible with the current zlib
> method. However, we found that there are still many problems with this, so
> we plan to provide a new change in the next version that the independent
> QPL/IAA acceleration function as said above.
> > Compatibility issues include the following
> > 1. QPL currently does not support the z_sync_flush operation
> > 2. IAA comp/decomp window is fixed 4K. By default, the zlib window
> size is 32K. And window size should be the same for Both comp/decomp
> sides.
> > 3. At the same time, I researched the QAT compression scheme.
> > QATzip currently does not support zlib, nor does it support
> > z_sync_flush. The window size is 32K
> >
> > In general, I think it is a good suggestion to make the accelerator
> > compatible with standard compression algorithms, but also let the
> > accelerator run independently, thus avoiding some compatibility and
> > performance problems of the accelerator. For example, we can add the
> > "accel" option to the compression method, and then the user must
> > specify the same accelerator by compression accelerator parameter on
> > the source and remote ends (just like specifying the same compression
> > algorithm)
> >
> > > >
> > > > 2. Will you consider using QPL/IAA as an independent compression
> > > > algorithm instead of an accelerator? In this way, we can better
> > > > utilize hardware performance and some features, such as IAA's
> > > > canned mode, which can be dynamically generated by some
> statistics
> > > > of data. A huffman table to improve the compression ratio.
> > >
> > > Maybe one more knob will work? If it's not compatible with the
> > > deflate algo maybe it should never be the default. IOW, the
> > > accelerators may be extended into this (based on what you already
> proposed):
> > >
> > > - auto ("qpl" first, "none" second; never "qpl-optimized")
> > > - none (old zlib)
> > > - qpl (qpl compatible)
> > > - qpl-optimized (qpl uncompatible)
> > >
> > > Then "auto"/"none"/"qpl" will always be compatible, only the last
> > > doesn't, user can select it explicit, but only on both sides of QEMU.
> > Yes, this is what I want, I need a way that QPL is not compatible with
> zlib. From my current point of view, if zlib chooses raw defalte mode,
> then QAT will be compatible with the current community's zlib solution.
> > So my suggestion is as follows
> >
> > Compression method parameter
> > - none
> > - zlib
> > - zstd
> > - accel (Both Qemu sides need to select the same accelerator from
> "Compression accelerator parameter" explicitly).
>
> Can we avoid naming it as "accel"? It's too generic, IMHO.
>
> If it's a special algorithm that only applies to QPL, can we just call it
> "qpl" here? Then...
Yes, I agree.
> > Compression accelerator parameter
> > - auto
> > - none
> > - qpl (qpl will not support zlib/zstd, it will inform an error when
> > zlib/zstd is selected)
> > - qat (it can provide acceleration of zlib/zstd)
>
> Here IMHO we don't need qpl then, because the "qpl" compression method can
> enforce an hardware accelerator. In summary, not sure whether this works;
>
> Compression methods: none, zlib, zstd, qpl (describes all the algorithms
> that might be used; again, qpl enforces HW support).
>
> Compression accelerators: auto, none, qat (only applies when zlib/zstd
> chosen above)
I agree, QPL will dynamically detect IAA hardware resources and prioritize
hardware acceleration. If IAA is not available, QPL can also provide an
efficient deflate-based compression algorithm. And the software and hardware
are fully compatible.
> > > > Test condition:
> > > > 1. Host CPUs are based on Sapphire Rapids, and frequency locked
> > > > to
> > > 3.4G
> > > > 2. VM type, 16 vCPU and 64G memory
> > > > 3. The Idle workload means no workload is running in the VM
> > > > 4. The Redis workload means YCSB workloadb + Redis Server are
> running
> > > > in the VM, about 20G or more memory will be used.
> > > > 5. Source side migartion configuration commands
> > > > a. migrate_set_capability multifd on
> > > > b. migrate_set_parameter multifd-channels 2/4/8
> > > > c. migrate_set_parameter downtime-limit 300
> > > > d. migrate_set_parameter multifd-compression zlib
> > > > e. migrate_set_parameter multifd-compression-accel none/qpl
> > > > f. migrate_set_parameter max-bandwidth 100G
> > > > 6. Desitination side migration configuration commands
> > > > a. migrate_set_capability multifd on
> > > > b. migrate_set_parameter multifd-channels 2/4/8
> > > > c. migrate_set_parameter multifd-compression zlib
> > > > d. migrate_set_parameter multifd-compression-accel none/qpl
> > > > e. migrate_set_parameter max-bandwidth 100G
> > >
> > > How is zlib-level setup? Default (1)?
> > Yes, use level 1 the default level.
> >
> > > Btw, it seems both zlib/zstd levels are not even working right now
> > > to be configured.. probably overlooked in migrate_params_apply().
> > Ok, I will check this.
>
> Thanks. If you plan to post patch, please attach:
>
> Reported-by: Xiaohui Li <xiaohli@redhat.com>
>
> As that's reported by our QE team.
>
> Maybe you can already add an unit test (migration-test.c, under tests/)
> which should expose this issue already, by setting z*-level to non-1 then
> query it back, asserting that the value did change.
Thanks for your suggestions, I will improve the test part of the code
> > > > Early migration result, each result is the average of three tests
> > > > +--------+-------------+--------+--------+---------+----+-----+
> > > > | | The number |total |downtime|network |pages per |
> > > > | | of channels |time(ms)|(ms) |bandwidth|second |
> > > > | | and mode | | |(mbps) | |
> > > > | +-------------+-----------------+---------+----------+
> > > > | | 2 chl, Zlib | 20647 | 22 | 195 | 137767 |
> > > > | +-------------+--------+--------+---------+----------+
> > > > | Idle | 2 chl, IAA | 17022 | 36 | 286 | 460289 |
> > > > |workload+-------------+--------+--------+---------+----------+
> > > > | | 4 chl, Zlib | 18835 | 29 | 241 | 299028 |
> > > > | +-------------+--------+--------+---------+----------+
> > > > | | 4 chl, IAA | 16280 | 32 | 298 | 652456 |
> > > > | +-------------+--------+--------+---------+----------+
> > > > | | 8 chl, Zlib | 17379 | 32 | 275 | 470591 |
> > > > | +-------------+--------+--------+---------+----------+
> > > > | | 8 chl, IAA | 15551 | 46 | 313 | 1315784 |
> > >
> > > The number is slightly confusing to me. If IAA can send 3x times
> > > more pages per-second, shouldn't the total migration time 1/3 of the
> > > other if the guest is idle? But the total times seem to be pretty
> > > close no matter N of channels. Maybe I missed something?
> >
> > This data is the information read from "info migrate" after the live
> migration status changes to "complete".
> > I think it is the max throughout when expected downtime and network
> available bandwidth are met.
> > In vCPUs are idle, live migration does not run at maximum throughput for
> too long.
> >
> > > > +--------+-------------+--------+--------+---------+----------+
> > > >
> > > > +--------+-------------+--------+--------+---------+----+-----+
> > > > | | The number |total |downtime|network |pages per |
> > > > | | of channels |time(ms)|(ms) |bandwidth|second |
> > > > | | and mode | | |(mbps) | |
> > > > | +-------------+-----------------+---------+----------+
> > > > | | 2 chl, Zlib | 100% failure, timeout is 120s |
> > > > | +-------------+--------+--------+---------+----------+
> > > > | Redis | 2 chl, IAA | 62737 | 115 | 4547 | 387911 |
> > > > |workload+-------------+--------+--------+---------+----------+
> > > > | | 4 chl, Zlib | 30% failure, timeout is 120s |
> > > > | +-------------+--------+--------+---------+----------+
> > > > | | 4 chl, IAA | 54645 | 177 | 5382 | 656865 |
> > > > | +-------------+--------+--------+---------+----------+
> > > > | | 8 chl, Zlib | 93488 | 74 | 1264 | 129486 |
> > > > | +-------------+--------+--------+---------+----------+
> > > > | | 8 chl, IAA | 24367 | 303 | 6901 | 964380 |
> > > > +--------+-------------+--------+--------+---------+----------+
> > >
> > > The redis results look much more preferred on using IAA comparing to
> > > the idle tests. Does it mean that IAA works less good with zero
> > > pages in general (assuming that'll be the majority in idle test)?
> > Both Idle and Redis data are not the best performance for IAA since it
> is based on multifd packet streaming compression.
> > In the idle case, most pages are indeed zero page, zero page compression
> is not as good as only detecting zero pages, so the compression advantage
> is not reflected.
> >
> > > From the manual, I see that IAA also supports encryption/decryption.
> > > Would it be able to accelerate TLS?
> > From Sapphire Rapids(SPR)/Emerald Rapids (EMR) Xeon servers, IAA can't
> > support encryption/decryption. This feature may be available in future
> generations For TLS acceleration, QAT supports this function on SPR/EMR
> and has successful cases in some scenarios.
> > https://www.intel.cn/content/www/cn/zh/developer/articles/guide/nginx-
> > https-with-qat-tuning-guide.html
> >
> > > How should one consider IAA over QAT? What is the major difference?
> > > I see that IAA requires IOMMU scalable mode, why? Is it because the
> > > IAA HW is something attached to the pcie bus (assume QAT the same)?
> >
> > Regarding the difference between using IAA or QAT for compression 1.
> > IAA is more suitable for 4K compression, and QAT is suitable for large
> block data compression. This is determined by the deflate windows size,
> and QAT can support more compression levels. IAA hardware supports 1
> compression level.
> > 2. From the perspective of throughput, one IAA device supports
> compression throughput is 4GBps and decompression is 30GBps. One QAT
> support compression or decompression throughput is 20GBps.
> > 3. Depending on the product type selected by the customer and the
> deployment, the resources used for live migration will also be different.
> >
> > Regarding the IOMMU scalable mode
> > 1. The current IAA software stack requires Shared Virtual Memory (SVM)
> technology, and SVM depends on IOMMU scalable mode.
> > 2. Both IAA and QAT support PCIe PASID capability, then IAA can support
> shared work queue.
> > https://docs.kernel.org/next/x86/sva.html
>
> Thanks for all these information. I'm personally still curious why Intel
> would like to provide two new technology to service similar purposes
> merely at the same time window.
>
> Could you put many of these information into a doc file? It can be
> docs/devel/migration/QPL.rst.
Sure, I will update the documentation
> Also, we may want an unit test to cover the new stuff when the whole
> design settles. It may cover all mode supported, but for sure we can skip
> hw accelerated use case.
For QPL, I think this is not a problem. QPL is used as a new compression
method that can be used when hardware accelerators are not available
© 2016 - 2026 Red Hat, Inc.