> -----Original Message----- > From: Peter Xu <peterx@redhat.com> > Sent: Tuesday, January 30, 2024 6:32 PM > To: Liu, Yuan1 <yuan1.liu@intel.com> > Cc: farosas@suse.de; leobras@redhat.com; qemu-devel@nongnu.org; Zou, > Nanhai <nanhai.zou@intel.com> > Subject: Re: [PATCH v3 0/4] Live Migration Acceleration with IAA > Compression > > On Tue, Jan 30, 2024 at 03:56:05AM +0000, Liu, Yuan1 wrote: > > > -----Original Message----- > > > From: Peter Xu <peterx@redhat.com> > > > Sent: Monday, January 29, 2024 6:43 PM > > > To: Liu, Yuan1 <yuan1.liu@intel.com> > > > Cc: farosas@suse.de; leobras@redhat.com; qemu-devel@nongnu.org; Zou, > > > Nanhai <nanhai.zou@intel.com> > > > Subject: Re: [PATCH v3 0/4] Live Migration Acceleration with IAA > > > Compression > > > > > > On Wed, Jan 03, 2024 at 07:28:47PM +0800, Yuan Liu wrote: > > > > Hi, > > > > > > Hi, Yuan, > > > > > > I have a few comments and questions. Many of them can be pure > > > questions as I don't know enough on these new technologies. > > > > > > > > > > > I am writing to submit a code change aimed at enhancing live > > > > migration acceleration by leveraging the compression capability of > > > > the Intel In-Memory Analytics Accelerator (IAA). > > > > > > > > The implementation of the IAA (de)compression code is based on > > > > Intel Query Processing Library (QPL), an open-source software > > > > project designed for IAA high-level software programming. > > > > https://github.com/intel/qpl > > > > > > > > In the last version, there was some discussion about whether to > > > > introduce a new compression algorithm for IAA. Because the > > > > compression algorithm of IAA hardware is based on deflate, and QPL > > > > already supports Zlib, so in this version, I implemented IAA as an > > > > accelerator for the Zlib compression method. However, due to some > > > > reasons, QPL is currently not compatible with the existing Zlib > > > > method that Zlib compressed data can be decompressed by QPl and vice > versa. > > > > > > > > I have some concerns about the existing Zlib compression > > > > 1. Will you consider supporting one channel to support multi- > stream > > > > compression? Of course, this may lead to a reduction in > compression > > > > ratio, but it will allow the hardware to process each stream > > > > concurrently. We can have each stream process multiple pages, > > > > reducing the loss of compression ratio. For example, 128 pages > are > > > > divided into 16 streams for independent compression. I will > provide > > > > the a early performance data in the next version(v4). > > > > > > I think Juan used to ask similar question: how much this can help if > > > multifd can already achieve some form of concurrency over the pages? > > > > > > > Couldn't the user specify more multifd channels if they want to > > > grant more cpu resource for comp/decomp purpose? > > > > > > IOW, how many concurrent channels QPL can provide? What is the > > > suggested concurrency channels there? > > > > From the QPL software, there is no limit on the number of concurrent > compression and decompression tasks. > > From the IAA hardware, one IAA physical device can process two > compressions concurrently or eight decompression tasks concurrently. There > are up to 8 IAA devices on an Intel SPR Server and it will vary according > to the customer’s product selection and deployment. > > > > Regarding the requirement for the number of concurrent channels, I think > this may not be a bottleneck problem. > > Please allow me to introduce a little more here > > > > 1. If the compression design is based on Zlib/Deflate/Gzip streaming > mode, then we indeed need more channels to maintain concurrent processing. > Because each time a multifd packet is compressed (including 128 > independent pages), it needs to be compressed page by page. These 128 > pages are not concurrent. The concurrency is reflected in the logic of > multiple channels for the multifd packet. > > Right. However since you said there're only a max of 8 IAA devices, would > it also mean n_multifd_threads=8 can be a good enough scenario to achieve > proper concurrency, no matter the size of data chunk for one compression > request? > > Maybe you meant each device can still process concurrent compression > requests, so the real capability of concurrency can be much larger than 8? Yes, the number of concurrent requests can be greater than 8, one device can handle 2 compression requests or 8 decompression requests concurrently. > > > > 2. Through testing, we prefer concurrent processing on 4K pages, not > multifd packet, which means that 128 pages belonging to a packet can be > compressed/decompressed concurrently. Even one channel can also utilize > all the resources of IAA. But this is not compatible with existing zlib. > > The code is similar to the following > > for(int i = 0; i < num_pages; i++) { > > job[i]->input_data = pages[i] > > submit_job(job[i] //Non-block submit for compression/decompression > tasks > > } > > for(int i = 0; i < num_pages; i++) { > > wait_job(job[i]) //busy polling. In the future, we will make this > part and data sending into pipeline mode. > > } > > Right, if more concurrency is wanted, you can use this async model; I > think Juan used to suggest such and I agree it will also work. It can be > done on top of the basic functionality merged. Sure, I think we can show the better performance based on it. > > 3. Currently, the patches we provide to the community are based on > streaming compression. This is to be compatible with the current zlib > method. However, we found that there are still many problems with this, so > we plan to provide a new change in the next version that the independent > QPL/IAA acceleration function as said above. > > Compatibility issues include the following > > 1. QPL currently does not support the z_sync_flush operation > > 2. IAA comp/decomp window is fixed 4K. By default, the zlib window > size is 32K. And window size should be the same for Both comp/decomp > sides. > > 3. At the same time, I researched the QAT compression scheme. > > QATzip currently does not support zlib, nor does it support > > z_sync_flush. The window size is 32K > > > > In general, I think it is a good suggestion to make the accelerator > > compatible with standard compression algorithms, but also let the > > accelerator run independently, thus avoiding some compatibility and > > performance problems of the accelerator. For example, we can add the > > "accel" option to the compression method, and then the user must > > specify the same accelerator by compression accelerator parameter on > > the source and remote ends (just like specifying the same compression > > algorithm) > > > > > > > > > > 2. Will you consider using QPL/IAA as an independent compression > > > > algorithm instead of an accelerator? In this way, we can better > > > > utilize hardware performance and some features, such as IAA's > > > > canned mode, which can be dynamically generated by some > statistics > > > > of data. A huffman table to improve the compression ratio. > > > > > > Maybe one more knob will work? If it's not compatible with the > > > deflate algo maybe it should never be the default. IOW, the > > > accelerators may be extended into this (based on what you already > proposed): > > > > > > - auto ("qpl" first, "none" second; never "qpl-optimized") > > > - none (old zlib) > > > - qpl (qpl compatible) > > > - qpl-optimized (qpl uncompatible) > > > > > > Then "auto"/"none"/"qpl" will always be compatible, only the last > > > doesn't, user can select it explicit, but only on both sides of QEMU. > > Yes, this is what I want, I need a way that QPL is not compatible with > zlib. From my current point of view, if zlib chooses raw defalte mode, > then QAT will be compatible with the current community's zlib solution. > > So my suggestion is as follows > > > > Compression method parameter > > - none > > - zlib > > - zstd > > - accel (Both Qemu sides need to select the same accelerator from > "Compression accelerator parameter" explicitly). > > Can we avoid naming it as "accel"? It's too generic, IMHO. > > If it's a special algorithm that only applies to QPL, can we just call it > "qpl" here? Then... Yes, I agree. > > Compression accelerator parameter > > - auto > > - none > > - qpl (qpl will not support zlib/zstd, it will inform an error when > > zlib/zstd is selected) > > - qat (it can provide acceleration of zlib/zstd) > > Here IMHO we don't need qpl then, because the "qpl" compression method can > enforce an hardware accelerator. In summary, not sure whether this works; > > Compression methods: none, zlib, zstd, qpl (describes all the algorithms > that might be used; again, qpl enforces HW support). > > Compression accelerators: auto, none, qat (only applies when zlib/zstd > chosen above) I agree, QPL will dynamically detect IAA hardware resources and prioritize hardware acceleration. If IAA is not available, QPL can also provide an efficient deflate-based compression algorithm. And the software and hardware are fully compatible. > > > > Test condition: > > > > 1. Host CPUs are based on Sapphire Rapids, and frequency locked > > > > to > > > 3.4G > > > > 2. VM type, 16 vCPU and 64G memory > > > > 3. The Idle workload means no workload is running in the VM > > > > 4. The Redis workload means YCSB workloadb + Redis Server are > running > > > > in the VM, about 20G or more memory will be used. > > > > 5. Source side migartion configuration commands > > > > a. migrate_set_capability multifd on > > > > b. migrate_set_parameter multifd-channels 2/4/8 > > > > c. migrate_set_parameter downtime-limit 300 > > > > d. migrate_set_parameter multifd-compression zlib > > > > e. migrate_set_parameter multifd-compression-accel none/qpl > > > > f. migrate_set_parameter max-bandwidth 100G > > > > 6. Desitination side migration configuration commands > > > > a. migrate_set_capability multifd on > > > > b. migrate_set_parameter multifd-channels 2/4/8 > > > > c. migrate_set_parameter multifd-compression zlib > > > > d. migrate_set_parameter multifd-compression-accel none/qpl > > > > e. migrate_set_parameter max-bandwidth 100G > > > > > > How is zlib-level setup? Default (1)? > > Yes, use level 1 the default level. > > > > > Btw, it seems both zlib/zstd levels are not even working right now > > > to be configured.. probably overlooked in migrate_params_apply(). > > Ok, I will check this. > > Thanks. If you plan to post patch, please attach: > > Reported-by: Xiaohui Li <xiaohli@redhat.com> > > As that's reported by our QE team. > > Maybe you can already add an unit test (migration-test.c, under tests/) > which should expose this issue already, by setting z*-level to non-1 then > query it back, asserting that the value did change. Thanks for your suggestions, I will improve the test part of the code > > > > Early migration result, each result is the average of three tests > > > > +--------+-------------+--------+--------+---------+----+-----+ > > > > | | The number |total |downtime|network |pages per | > > > > | | of channels |time(ms)|(ms) |bandwidth|second | > > > > | | and mode | | |(mbps) | | > > > > | +-------------+-----------------+---------+----------+ > > > > | | 2 chl, Zlib | 20647 | 22 | 195 | 137767 | > > > > | +-------------+--------+--------+---------+----------+ > > > > | Idle | 2 chl, IAA | 17022 | 36 | 286 | 460289 | > > > > |workload+-------------+--------+--------+---------+----------+ > > > > | | 4 chl, Zlib | 18835 | 29 | 241 | 299028 | > > > > | +-------------+--------+--------+---------+----------+ > > > > | | 4 chl, IAA | 16280 | 32 | 298 | 652456 | > > > > | +-------------+--------+--------+---------+----------+ > > > > | | 8 chl, Zlib | 17379 | 32 | 275 | 470591 | > > > > | +-------------+--------+--------+---------+----------+ > > > > | | 8 chl, IAA | 15551 | 46 | 313 | 1315784 | > > > > > > The number is slightly confusing to me. If IAA can send 3x times > > > more pages per-second, shouldn't the total migration time 1/3 of the > > > other if the guest is idle? But the total times seem to be pretty > > > close no matter N of channels. Maybe I missed something? > > > > This data is the information read from "info migrate" after the live > migration status changes to "complete". > > I think it is the max throughout when expected downtime and network > available bandwidth are met. > > In vCPUs are idle, live migration does not run at maximum throughput for > too long. > > > > > > +--------+-------------+--------+--------+---------+----------+ > > > > > > > > +--------+-------------+--------+--------+---------+----+-----+ > > > > | | The number |total |downtime|network |pages per | > > > > | | of channels |time(ms)|(ms) |bandwidth|second | > > > > | | and mode | | |(mbps) | | > > > > | +-------------+-----------------+---------+----------+ > > > > | | 2 chl, Zlib | 100% failure, timeout is 120s | > > > > | +-------------+--------+--------+---------+----------+ > > > > | Redis | 2 chl, IAA | 62737 | 115 | 4547 | 387911 | > > > > |workload+-------------+--------+--------+---------+----------+ > > > > | | 4 chl, Zlib | 30% failure, timeout is 120s | > > > > | +-------------+--------+--------+---------+----------+ > > > > | | 4 chl, IAA | 54645 | 177 | 5382 | 656865 | > > > > | +-------------+--------+--------+---------+----------+ > > > > | | 8 chl, Zlib | 93488 | 74 | 1264 | 129486 | > > > > | +-------------+--------+--------+---------+----------+ > > > > | | 8 chl, IAA | 24367 | 303 | 6901 | 964380 | > > > > +--------+-------------+--------+--------+---------+----------+ > > > > > > The redis results look much more preferred on using IAA comparing to > > > the idle tests. Does it mean that IAA works less good with zero > > > pages in general (assuming that'll be the majority in idle test)? > > Both Idle and Redis data are not the best performance for IAA since it > is based on multifd packet streaming compression. > > In the idle case, most pages are indeed zero page, zero page compression > is not as good as only detecting zero pages, so the compression advantage > is not reflected. > > > > > From the manual, I see that IAA also supports encryption/decryption. > > > Would it be able to accelerate TLS? > > From Sapphire Rapids(SPR)/Emerald Rapids (EMR) Xeon servers, IAA can't > > support encryption/decryption. This feature may be available in future > generations For TLS acceleration, QAT supports this function on SPR/EMR > and has successful cases in some scenarios. > > https://www.intel.cn/content/www/cn/zh/developer/articles/guide/nginx- > > https-with-qat-tuning-guide.html > > > > > How should one consider IAA over QAT? What is the major difference? > > > I see that IAA requires IOMMU scalable mode, why? Is it because the > > > IAA HW is something attached to the pcie bus (assume QAT the same)? > > > > Regarding the difference between using IAA or QAT for compression 1. > > IAA is more suitable for 4K compression, and QAT is suitable for large > block data compression. This is determined by the deflate windows size, > and QAT can support more compression levels. IAA hardware supports 1 > compression level. > > 2. From the perspective of throughput, one IAA device supports > compression throughput is 4GBps and decompression is 30GBps. One QAT > support compression or decompression throughput is 20GBps. > > 3. Depending on the product type selected by the customer and the > deployment, the resources used for live migration will also be different. > > > > Regarding the IOMMU scalable mode > > 1. The current IAA software stack requires Shared Virtual Memory (SVM) > technology, and SVM depends on IOMMU scalable mode. > > 2. Both IAA and QAT support PCIe PASID capability, then IAA can support > shared work queue. > > https://docs.kernel.org/next/x86/sva.html > > Thanks for all these information. I'm personally still curious why Intel > would like to provide two new technology to service similar purposes > merely at the same time window. > > Could you put many of these information into a doc file? It can be > docs/devel/migration/QPL.rst. Sure, I will update the documentation > Also, we may want an unit test to cover the new stuff when the whole > design settles. It may cover all mode supported, but for sure we can skip > hw accelerated use case. For QPL, I think this is not a problem. QPL is used as a new compression method that can be used when hardware accelerators are not available
© 2016 - 2024 Red Hat, Inc.