> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, March 28, 2024 3:46 AM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com;
> bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 0/7] Live Migration With IAA
>
> On Wed, Mar 27, 2024 at 03:20:19AM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Wednesday, March 27, 2024 4:30 AM
> > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com;
> > > bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com>
> > > Subject: Re: [PATCH v5 0/7] Live Migration With IAA
> > >
> > > Hi, Yuan,
> > >
> > > On Wed, Mar 20, 2024 at 12:45:20AM +0800, Yuan Liu wrote:
> > > > 1. QPL will be used as an independent compression method like ZLIB
> and
> > > ZSTD,
> > > > QPL will force the use of the IAA accelerator and will not
> support
> > > software
> > > > compression. For a summary of issues compatible with Zlib, please
> > > refer to
> > > > docs/devel/migration/qpl-compression.rst
> > >
> > > IIRC our previous discussion is we should provide a software fallback
> for
> > > the new QEMU paths, right? Why the decision changed? Again, such
> > > fallback
> > > can help us to make sure qpl won't get broken easily by other changes.
> >
> > Hi Peter
> >
> > Previous your suggestion below
> >
> >
> https://patchew.org/QEMU/PH7PR11MB5941019462E0ADDE231C7295A37C2@PH7PR11MB5
> 941.namprd11.prod.outlook.com/
> > Compression methods: none, zlib, zstd, qpl (describes all the algorithms
> > that might be used; again, qpl enforces HW support).
> > Compression accelerators: auto, none, qat (only applies when zlib/zstd
> > chosen above)
> >
> > Maybe I misunderstood here, what you mean is that if the IAA hardware is
> unavailable,
> > it will fall back to the software path. This does not need to be
> specified through live
> > migration parameters, and it will automatically determine whether to use
> the software or
> > hardware path during QPL initialization, is that right?
>
> I think there are two questions.
>
> Firstly, we definitely want the qpl compressor to be able to run without
> any hardware support. As I mentioned above, I think that's the only way
> that qpl code can always get covered by the CI as CI hosts should normally
> don't have those modern hardwares.
>
> I think it also means in the last test patch, instead of detecting
> /dev/iax
> we should unconditionally run the qpl test as long as compiled in, because
> it should just fallback to the software path then when HW not valid?
>
> The second question is whether we'll want a new "compression accelerator",
> fundamentally the only use case of that is to enforce software fallback
> even if hardware existed. I don't remember whether others have any
> opinion
> before, but to me I think it's good to have, however no strong opinion.
> It's less important comparing to the other question on CI coverage.
Yes, I will support software fallback to ensure CI testing and users can
still use qpl compression without IAA hardware.
Although the qpl software solution will have better performance than zlib,
I still don't think it has a greater advantage than zstd. I don't think there
is a need to add a migration option to configure the qpl software or hardware path.
So I will still only use QPL as an independent compression in the next version, and
no other migration options are needed.
I will also add a guide to qpl-compression.rst about IAA permission issues and how to
determine whether the hardware path is available.
> > > > 2. Compression accelerator related patches are removed from this
> patch
> > > set and
> > > > will be added to the QAT patch set, we will submit separate
> patches
> > > to use
> > > > QAT to accelerate ZLIB and ZSTD.
> > > >
> > > > 3. Advantages of using IAA accelerator include:
> > > > a. Compared with the non-compression method, it can improve
> downtime
> > > > performance without adding additional host resources (both CPU
> and
> > > > network).
> > > > b. Compared with using software compression methods (ZSTD/ZLIB),
> it
> > > can
> > > > provide high data compression ratio and save a lot of CPU
> > > resources
> > > > used for compression.
> > > >
> > > > Test condition:
> > > > 1. Host CPUs are based on Sapphire Rapids
> > > > 2. VM type, 16 vCPU and 64G memory
> > > > 3. The source and destination respectively use 4 IAA devices.
> > > > 4. The workload in the VM
> > > > a. all vCPUs are idle state
> > > > b. 90% of the virtual machine's memory is used, use silesia to
> fill
> > > > the memory.
> > > > The introduction of silesia:
> > > > https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
> > > > 5. Set "--mem-prealloc" boot parameter on the destination, this
> > > parameter
> > > > can make IAA performance better and related introduction is
> added
> > > here.
> > > > docs/devel/migration/qpl-compression.rst
> > > > 6. Source migration configuration commands
> > > > a. migrate_set_capability multifd on
> > > > b. migrate_set_parameter multifd-channels 2/4/8
> > > > c. migrate_set_parameter downtime-limit 300
> > > > f. migrate_set_parameter max-bandwidth 100G/1G
> > > > d. migrate_set_parameter multifd-compression none/qpl/zstd
> > > > 7. Destination migration configuration commands
> > > > a. migrate_set_capability multifd on
> > > > b. migrate_set_parameter multifd-channels 2/4/8
> > > > c. migrate_set_parameter multifd-compression none/qpl/zstd
> > > >
> > > > Early migration result, each result is the average of three tests
> > > >
> > > > +--------+-------------+--------+--------+---------+----------+----
> --|
> > > > | | The number |total |downtime|network |pages per | CPU
> |
> > > > | None | of channels |time(ms)|(ms) |bandwidth|second |
> Util |
> > > > | Comp | | | |(mbps) | |
> |
> > > > | +-------------+-----------------+---------+----------+----
> --+
> > > > |Network | 2| 8571| 69| 58391| 1896525|
> 256%|
> > >
> > > Is this the average bandwidth? I'm surprised that you can hit ~59Gbps
> > > only
> > > with 2 channels. My previous experience is around ~1XGbps per
> channel, so
> > > no more than 30Gbps for two channels. Is it because of a faster
> > > processor?
> > > Indeed from the 4/8 results it doesn't look like increasing the num of
> > > channels helped a lot, and even it got worse on the downtime.
> >
> > Yes, I use iperf3 to check the bandwidth for one core, the bandwith is
> 60Gbps.
> > [ ID] Interval Transfer Bitrate Retr Cwnd
> > [ 5] 0.00-1.00 sec 7.00 GBytes 60.1 Gbits/sec 0 2.87 MBytes
> > [ 5] 1.00-2.00 sec 7.05 GBytes 60.6 Gbits/sec 0 2.87 Mbytes
> >
> > And in the live migration test, a multifd thread's CPU utilization is
> almost 100%
>
> This 60Gpbs per-channel is definitely impressive..
>
> Have you tried migration without multifd on your system? Would that also
> perform similarly v.s. 2 channels multifd?
Simple Test result below:
VM Type: 16vCPU, 64G memory
Workload in VM: fill 56G memory with Silesia data and vCPUs are idle
Migration Configurations:
1. migrate_set_parameter max-bandwidth 100G
2. migrate_set_parameter downtime-limit 300
3. migrate_set_capability multifd on (multiFD test case)
4. migrate_set_parameter multifd-channels 2 (multiFD test case)
Totaltime (ms) Downtime (ms) Throughput (mbps) Pages-per-second
without Multifd 23580 307 21221 689588
Multifd 2 7657 198 65410 2221176
>
> The whole point of multifd is to scale on bandwidth. If single thread can
> already achieve 60Gbps (where in my previous memory of tests, multifd can
> only reach ~70Gbps before..), then either multifd will be less useful with
> the new hardwares (especially when with a most generic socket nocomp
> setup), or we need to start working on bottlenecks of multifd to make it
> scale better. Otherwise multifd will become a pool for compressor loads
> only.
>
> >
> > > What is the rational behind "downtime improvement" when with the QPL
> > > compressors? IIUC in this 100Gbps case the bandwidth is never a
> > > limitation, then I don't understand why adding the compression phase
> can
> > > make the switchover faster. I can expect much more pages sent in a
> > > NIC-limted env like you described below with 1Gbps, but not when NIC
> has
> > > unlimited resources like here.
> >
> > The compression can improve the network stack overhead(not improve the
> RDMA
> > solution), the less data, the smaller the overhead in the
> > network protocol stack. If compression has no overhead, and network
> bandwidth
> > is not limited, the last memory copy is faster with compression
> >
> > The migration hotspot focuses on the _sys_sendmsg
> > _sys_sendmsg
> > |- tcp_sendmsg
> > |- copy_user_enhanced_fast_string
> > |- tcp_push_one
>
> Makes sense. I assume that's logical indeed when the compression ratio is
> high enough, meanwhile if the compression work is fast enough to be much
> lower than sending extra data when without it.
>
> Thanks,
>
> --
> Peter Xu
On Thu, Mar 28, 2024 at 03:02:30AM +0000, Liu, Yuan1 wrote: > Yes, I will support software fallback to ensure CI testing and users can > still use qpl compression without IAA hardware. > > Although the qpl software solution will have better performance than zlib, > I still don't think it has a greater advantage than zstd. I don't think there > is a need to add a migration option to configure the qpl software or hardware path. > So I will still only use QPL as an independent compression in the next version, and > no other migration options are needed. That should be fine. > > I will also add a guide to qpl-compression.rst about IAA permission issues and how to > determine whether the hardware path is available. OK. [...] > > > Yes, I use iperf3 to check the bandwidth for one core, the bandwith is > > 60Gbps. > > > [ ID] Interval Transfer Bitrate Retr Cwnd > > > [ 5] 0.00-1.00 sec 7.00 GBytes 60.1 Gbits/sec 0 2.87 MBytes > > > [ 5] 1.00-2.00 sec 7.05 GBytes 60.6 Gbits/sec 0 2.87 Mbytes > > > > > > And in the live migration test, a multifd thread's CPU utilization is > > almost 100% > > > > This 60Gpbs per-channel is definitely impressive.. > > > > Have you tried migration without multifd on your system? Would that also > > perform similarly v.s. 2 channels multifd? > > Simple Test result below: > VM Type: 16vCPU, 64G memory > Workload in VM: fill 56G memory with Silesia data and vCPUs are idle > Migration Configurations: > 1. migrate_set_parameter max-bandwidth 100G > 2. migrate_set_parameter downtime-limit 300 > 3. migrate_set_capability multifd on (multiFD test case) > 4. migrate_set_parameter multifd-channels 2 (multiFD test case) > > Totaltime (ms) Downtime (ms) Throughput (mbps) Pages-per-second > without Multifd 23580 307 21221 689588 > Multifd 2 7657 198 65410 2221176 Thanks for the test results. So I am guessing the migration overheads besides pushing the socket is high enough to make it drop drastically, even if in this case zero detection shouldn't play a major role considering most of guest mem is pre-filled. -- Peter Xu
> -----Original Message----- > From: Peter Xu <peterx@redhat.com> > Sent: Thursday, March 28, 2024 11:22 PM > To: Liu, Yuan1 <yuan1.liu@intel.com> > Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com; > bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com> > Subject: Re: [PATCH v5 0/7] Live Migration With IAA > > On Thu, Mar 28, 2024 at 03:02:30AM +0000, Liu, Yuan1 wrote: > > Yes, I will support software fallback to ensure CI testing and users can > > still use qpl compression without IAA hardware. > > > > Although the qpl software solution will have better performance than > zlib, > > I still don't think it has a greater advantage than zstd. I don't think > there > > is a need to add a migration option to configure the qpl software or > hardware path. > > So I will still only use QPL as an independent compression in the next > version, and > > no other migration options are needed. > > That should be fine. > > > > > I will also add a guide to qpl-compression.rst about IAA permission > issues and how to > > determine whether the hardware path is available. > > OK. > > [...] > > > > > Yes, I use iperf3 to check the bandwidth for one core, the bandwith > is > > > 60Gbps. > > > > [ ID] Interval Transfer Bitrate Retr Cwnd > > > > [ 5] 0.00-1.00 sec 7.00 GBytes 60.1 Gbits/sec 0 2.87 > MBytes > > > > [ 5] 1.00-2.00 sec 7.05 GBytes 60.6 Gbits/sec 0 2.87 > Mbytes > > > > > > > > And in the live migration test, a multifd thread's CPU utilization > is > > > almost 100% > > > > > > This 60Gpbs per-channel is definitely impressive.. > > > > > > Have you tried migration without multifd on your system? Would that > also > > > perform similarly v.s. 2 channels multifd? > > > > Simple Test result below: > > VM Type: 16vCPU, 64G memory > > Workload in VM: fill 56G memory with Silesia data and vCPUs are idle > > Migration Configurations: > > 1. migrate_set_parameter max-bandwidth 100G > > 2. migrate_set_parameter downtime-limit 300 > > 3. migrate_set_capability multifd on (multiFD test case) > > 4. migrate_set_parameter multifd-channels 2 (multiFD test case) > > > > Totaltime (ms) Downtime (ms) Throughput (mbps) Pages- > per-second > > without Multifd 23580 307 21221 689588 > > Multifd 2 7657 198 65410 2221176 > > Thanks for the test results. > > So I am guessing the migration overheads besides pushing the socket is > high > enough to make it drop drastically, even if in this case zero detection > shouldn't play a major role considering most of guest mem is pre-filled. Yes, for no multifd migration, besides the network stack overhead, the zero page detection overhead (both of source and destination) is indeed very high. Placing the zero page detection in multi-threads can reduce the performance degradation caused by the overhead of zero page detection. I also think migration doesn't need to detect zero page by memcmp in all cases. The benefit of zero page detection may be that the VM's memory determines that there are a large number of 0 pages. My experience in this area may be insufficient, I am trying with Hao and Bryan to see if it is possible to use DSA hardware to accelerate this part (including page 0 detection and writing page 0). DSA is an accelerator for detecting memory, writing memory, and comparing memory https://cdrdv2-public.intel.com/671116/341204-intel-data-streaming-accelerator-spec.pdf
© 2016 - 2026 Red Hat, Inc.