> -----Original Message----- > From: Peter Xu <peterx@redhat.com> > Sent: Thursday, March 28, 2024 3:46 AM > To: Liu, Yuan1 <yuan1.liu@intel.com> > Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com; > bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com> > Subject: Re: [PATCH v5 0/7] Live Migration With IAA > > On Wed, Mar 27, 2024 at 03:20:19AM +0000, Liu, Yuan1 wrote: > > > -----Original Message----- > > > From: Peter Xu <peterx@redhat.com> > > > Sent: Wednesday, March 27, 2024 4:30 AM > > > To: Liu, Yuan1 <yuan1.liu@intel.com> > > > Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com; > > > bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com> > > > Subject: Re: [PATCH v5 0/7] Live Migration With IAA > > > > > > Hi, Yuan, > > > > > > On Wed, Mar 20, 2024 at 12:45:20AM +0800, Yuan Liu wrote: > > > > 1. QPL will be used as an independent compression method like ZLIB > and > > > ZSTD, > > > > QPL will force the use of the IAA accelerator and will not > support > > > software > > > > compression. For a summary of issues compatible with Zlib, please > > > refer to > > > > docs/devel/migration/qpl-compression.rst > > > > > > IIRC our previous discussion is we should provide a software fallback > for > > > the new QEMU paths, right? Why the decision changed? Again, such > > > fallback > > > can help us to make sure qpl won't get broken easily by other changes. > > > > Hi Peter > > > > Previous your suggestion below > > > > > https://patchew.org/QEMU/PH7PR11MB5941019462E0ADDE231C7295A37C2@PH7PR11MB5 > 941.namprd11.prod.outlook.com/ > > Compression methods: none, zlib, zstd, qpl (describes all the algorithms > > that might be used; again, qpl enforces HW support). > > Compression accelerators: auto, none, qat (only applies when zlib/zstd > > chosen above) > > > > Maybe I misunderstood here, what you mean is that if the IAA hardware is > unavailable, > > it will fall back to the software path. This does not need to be > specified through live > > migration parameters, and it will automatically determine whether to use > the software or > > hardware path during QPL initialization, is that right? > > I think there are two questions. > > Firstly, we definitely want the qpl compressor to be able to run without > any hardware support. As I mentioned above, I think that's the only way > that qpl code can always get covered by the CI as CI hosts should normally > don't have those modern hardwares. > > I think it also means in the last test patch, instead of detecting > /dev/iax > we should unconditionally run the qpl test as long as compiled in, because > it should just fallback to the software path then when HW not valid? > > The second question is whether we'll want a new "compression accelerator", > fundamentally the only use case of that is to enforce software fallback > even if hardware existed. I don't remember whether others have any > opinion > before, but to me I think it's good to have, however no strong opinion. > It's less important comparing to the other question on CI coverage. Yes, I will support software fallback to ensure CI testing and users can still use qpl compression without IAA hardware. Although the qpl software solution will have better performance than zlib, I still don't think it has a greater advantage than zstd. I don't think there is a need to add a migration option to configure the qpl software or hardware path. So I will still only use QPL as an independent compression in the next version, and no other migration options are needed. I will also add a guide to qpl-compression.rst about IAA permission issues and how to determine whether the hardware path is available. > > > > 2. Compression accelerator related patches are removed from this > patch > > > set and > > > > will be added to the QAT patch set, we will submit separate > patches > > > to use > > > > QAT to accelerate ZLIB and ZSTD. > > > > > > > > 3. Advantages of using IAA accelerator include: > > > > a. Compared with the non-compression method, it can improve > downtime > > > > performance without adding additional host resources (both CPU > and > > > > network). > > > > b. Compared with using software compression methods (ZSTD/ZLIB), > it > > > can > > > > provide high data compression ratio and save a lot of CPU > > > resources > > > > used for compression. > > > > > > > > Test condition: > > > > 1. Host CPUs are based on Sapphire Rapids > > > > 2. VM type, 16 vCPU and 64G memory > > > > 3. The source and destination respectively use 4 IAA devices. > > > > 4. The workload in the VM > > > > a. all vCPUs are idle state > > > > b. 90% of the virtual machine's memory is used, use silesia to > fill > > > > the memory. > > > > The introduction of silesia: > > > > https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia > > > > 5. Set "--mem-prealloc" boot parameter on the destination, this > > > parameter > > > > can make IAA performance better and related introduction is > added > > > here. > > > > docs/devel/migration/qpl-compression.rst > > > > 6. Source migration configuration commands > > > > a. migrate_set_capability multifd on > > > > b. migrate_set_parameter multifd-channels 2/4/8 > > > > c. migrate_set_parameter downtime-limit 300 > > > > f. migrate_set_parameter max-bandwidth 100G/1G > > > > d. migrate_set_parameter multifd-compression none/qpl/zstd > > > > 7. Destination migration configuration commands > > > > a. migrate_set_capability multifd on > > > > b. migrate_set_parameter multifd-channels 2/4/8 > > > > c. migrate_set_parameter multifd-compression none/qpl/zstd > > > > > > > > Early migration result, each result is the average of three tests > > > > > > > > +--------+-------------+--------+--------+---------+----------+---- > --| > > > > | | The number |total |downtime|network |pages per | CPU > | > > > > | None | of channels |time(ms)|(ms) |bandwidth|second | > Util | > > > > | Comp | | | |(mbps) | | > | > > > > | +-------------+-----------------+---------+----------+---- > --+ > > > > |Network | 2| 8571| 69| 58391| 1896525| > 256%| > > > > > > Is this the average bandwidth? I'm surprised that you can hit ~59Gbps > > > only > > > with 2 channels. My previous experience is around ~1XGbps per > channel, so > > > no more than 30Gbps for two channels. Is it because of a faster > > > processor? > > > Indeed from the 4/8 results it doesn't look like increasing the num of > > > channels helped a lot, and even it got worse on the downtime. > > > > Yes, I use iperf3 to check the bandwidth for one core, the bandwith is > 60Gbps. > > [ ID] Interval Transfer Bitrate Retr Cwnd > > [ 5] 0.00-1.00 sec 7.00 GBytes 60.1 Gbits/sec 0 2.87 MBytes > > [ 5] 1.00-2.00 sec 7.05 GBytes 60.6 Gbits/sec 0 2.87 Mbytes > > > > And in the live migration test, a multifd thread's CPU utilization is > almost 100% > > This 60Gpbs per-channel is definitely impressive.. > > Have you tried migration without multifd on your system? Would that also > perform similarly v.s. 2 channels multifd? Simple Test result below: VM Type: 16vCPU, 64G memory Workload in VM: fill 56G memory with Silesia data and vCPUs are idle Migration Configurations: 1. migrate_set_parameter max-bandwidth 100G 2. migrate_set_parameter downtime-limit 300 3. migrate_set_capability multifd on (multiFD test case) 4. migrate_set_parameter multifd-channels 2 (multiFD test case) Totaltime (ms) Downtime (ms) Throughput (mbps) Pages-per-second without Multifd 23580 307 21221 689588 Multifd 2 7657 198 65410 2221176 > > The whole point of multifd is to scale on bandwidth. If single thread can > already achieve 60Gbps (where in my previous memory of tests, multifd can > only reach ~70Gbps before..), then either multifd will be less useful with > the new hardwares (especially when with a most generic socket nocomp > setup), or we need to start working on bottlenecks of multifd to make it > scale better. Otherwise multifd will become a pool for compressor loads > only. > > > > > > What is the rational behind "downtime improvement" when with the QPL > > > compressors? IIUC in this 100Gbps case the bandwidth is never a > > > limitation, then I don't understand why adding the compression phase > can > > > make the switchover faster. I can expect much more pages sent in a > > > NIC-limted env like you described below with 1Gbps, but not when NIC > has > > > unlimited resources like here. > > > > The compression can improve the network stack overhead(not improve the > RDMA > > solution), the less data, the smaller the overhead in the > > network protocol stack. If compression has no overhead, and network > bandwidth > > is not limited, the last memory copy is faster with compression > > > > The migration hotspot focuses on the _sys_sendmsg > > _sys_sendmsg > > |- tcp_sendmsg > > |- copy_user_enhanced_fast_string > > |- tcp_push_one > > Makes sense. I assume that's logical indeed when the compression ratio is > high enough, meanwhile if the compression work is fast enough to be much > lower than sending extra data when without it. > > Thanks, > > -- > Peter Xu
On Thu, Mar 28, 2024 at 03:02:30AM +0000, Liu, Yuan1 wrote: > Yes, I will support software fallback to ensure CI testing and users can > still use qpl compression without IAA hardware. > > Although the qpl software solution will have better performance than zlib, > I still don't think it has a greater advantage than zstd. I don't think there > is a need to add a migration option to configure the qpl software or hardware path. > So I will still only use QPL as an independent compression in the next version, and > no other migration options are needed. That should be fine. > > I will also add a guide to qpl-compression.rst about IAA permission issues and how to > determine whether the hardware path is available. OK. [...] > > > Yes, I use iperf3 to check the bandwidth for one core, the bandwith is > > 60Gbps. > > > [ ID] Interval Transfer Bitrate Retr Cwnd > > > [ 5] 0.00-1.00 sec 7.00 GBytes 60.1 Gbits/sec 0 2.87 MBytes > > > [ 5] 1.00-2.00 sec 7.05 GBytes 60.6 Gbits/sec 0 2.87 Mbytes > > > > > > And in the live migration test, a multifd thread's CPU utilization is > > almost 100% > > > > This 60Gpbs per-channel is definitely impressive.. > > > > Have you tried migration without multifd on your system? Would that also > > perform similarly v.s. 2 channels multifd? > > Simple Test result below: > VM Type: 16vCPU, 64G memory > Workload in VM: fill 56G memory with Silesia data and vCPUs are idle > Migration Configurations: > 1. migrate_set_parameter max-bandwidth 100G > 2. migrate_set_parameter downtime-limit 300 > 3. migrate_set_capability multifd on (multiFD test case) > 4. migrate_set_parameter multifd-channels 2 (multiFD test case) > > Totaltime (ms) Downtime (ms) Throughput (mbps) Pages-per-second > without Multifd 23580 307 21221 689588 > Multifd 2 7657 198 65410 2221176 Thanks for the test results. So I am guessing the migration overheads besides pushing the socket is high enough to make it drop drastically, even if in this case zero detection shouldn't play a major role considering most of guest mem is pre-filled. -- Peter Xu
> -----Original Message----- > From: Peter Xu <peterx@redhat.com> > Sent: Thursday, March 28, 2024 11:22 PM > To: Liu, Yuan1 <yuan1.liu@intel.com> > Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com; > bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com> > Subject: Re: [PATCH v5 0/7] Live Migration With IAA > > On Thu, Mar 28, 2024 at 03:02:30AM +0000, Liu, Yuan1 wrote: > > Yes, I will support software fallback to ensure CI testing and users can > > still use qpl compression without IAA hardware. > > > > Although the qpl software solution will have better performance than > zlib, > > I still don't think it has a greater advantage than zstd. I don't think > there > > is a need to add a migration option to configure the qpl software or > hardware path. > > So I will still only use QPL as an independent compression in the next > version, and > > no other migration options are needed. > > That should be fine. > > > > > I will also add a guide to qpl-compression.rst about IAA permission > issues and how to > > determine whether the hardware path is available. > > OK. > > [...] > > > > > Yes, I use iperf3 to check the bandwidth for one core, the bandwith > is > > > 60Gbps. > > > > [ ID] Interval Transfer Bitrate Retr Cwnd > > > > [ 5] 0.00-1.00 sec 7.00 GBytes 60.1 Gbits/sec 0 2.87 > MBytes > > > > [ 5] 1.00-2.00 sec 7.05 GBytes 60.6 Gbits/sec 0 2.87 > Mbytes > > > > > > > > And in the live migration test, a multifd thread's CPU utilization > is > > > almost 100% > > > > > > This 60Gpbs per-channel is definitely impressive.. > > > > > > Have you tried migration without multifd on your system? Would that > also > > > perform similarly v.s. 2 channels multifd? > > > > Simple Test result below: > > VM Type: 16vCPU, 64G memory > > Workload in VM: fill 56G memory with Silesia data and vCPUs are idle > > Migration Configurations: > > 1. migrate_set_parameter max-bandwidth 100G > > 2. migrate_set_parameter downtime-limit 300 > > 3. migrate_set_capability multifd on (multiFD test case) > > 4. migrate_set_parameter multifd-channels 2 (multiFD test case) > > > > Totaltime (ms) Downtime (ms) Throughput (mbps) Pages- > per-second > > without Multifd 23580 307 21221 689588 > > Multifd 2 7657 198 65410 2221176 > > Thanks for the test results. > > So I am guessing the migration overheads besides pushing the socket is > high > enough to make it drop drastically, even if in this case zero detection > shouldn't play a major role considering most of guest mem is pre-filled. Yes, for no multifd migration, besides the network stack overhead, the zero page detection overhead (both of source and destination) is indeed very high. Placing the zero page detection in multi-threads can reduce the performance degradation caused by the overhead of zero page detection. I also think migration doesn't need to detect zero page by memcmp in all cases. The benefit of zero page detection may be that the VM's memory determines that there are a large number of 0 pages. My experience in this area may be insufficient, I am trying with Hao and Bryan to see if it is possible to use DSA hardware to accelerate this part (including page 0 detection and writing page 0). DSA is an accelerator for detecting memory, writing memory, and comparing memory https://cdrdv2-public.intel.com/671116/341204-intel-data-streaming-accelerator-spec.pdf
© 2016 - 2024 Red Hat, Inc.