[v5] RE: [PATCH v5 0/7] Live Migration With IAA

RE: [PATCH v5 0/7] Live Migration With IAA

Posted by Liu, Yuan1 1 month ago

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, March 28, 2024 3:46 AM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com;
> bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 0/7] Live Migration With IAA
> 
> On Wed, Mar 27, 2024 at 03:20:19AM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Wednesday, March 27, 2024 4:30 AM
> > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com;
> > > bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com>
> > > Subject: Re: [PATCH v5 0/7] Live Migration With IAA
> > >
> > > Hi, Yuan,
> > >
> > > On Wed, Mar 20, 2024 at 12:45:20AM +0800, Yuan Liu wrote:
> > > > 1. QPL will be used as an independent compression method like ZLIB
> and
> > > ZSTD,
> > > >    QPL will force the use of the IAA accelerator and will not
> support
> > > software
> > > >    compression. For a summary of issues compatible with Zlib, please
> > > refer to
> > > >    docs/devel/migration/qpl-compression.rst
> > >
> > > IIRC our previous discussion is we should provide a software fallback
> for
> > > the new QEMU paths, right?  Why the decision changed?  Again, such
> > > fallback
> > > can help us to make sure qpl won't get broken easily by other changes.
> >
> > Hi Peter
> >
> > Previous your suggestion below
> >
> >
> https://patchew.org/QEMU/PH7PR11MB5941019462E0ADDE231C7295A37C2@PH7PR11MB5
> 941.namprd11.prod.outlook.com/
> > Compression methods: none, zlib, zstd, qpl (describes all the algorithms
> > that might be used; again, qpl enforces HW support).
> > Compression accelerators: auto, none, qat (only applies when zlib/zstd
> > chosen above)
> >
> > Maybe I misunderstood here, what you mean is that if the IAA hardware is
> unavailable,
> > it will fall back to the software path. This does not need to be
> specified through live
> > migration parameters, and it will automatically determine whether to use
> the software or
> > hardware path during QPL initialization, is that right?
> 
> I think there are two questions.
> 
> Firstly, we definitely want the qpl compressor to be able to run without
> any hardware support.  As I mentioned above, I think that's the only way
> that qpl code can always get covered by the CI as CI hosts should normally
> don't have those modern hardwares.
> 
> I think it also means in the last test patch, instead of detecting
> /dev/iax
> we should unconditionally run the qpl test as long as compiled in, because
> it should just fallback to the software path then when HW not valid?
> 
> The second question is whether we'll want a new "compression accelerator",
> fundamentally the only use case of that is to enforce software fallback
> even if hardware existed.  I don't remember whether others have any
> opinion
> before, but to me I think it's good to have, however no strong opinion.
> It's less important comparing to the other question on CI coverage.

Yes, I will support software fallback to ensure CI testing and users can 
still use qpl compression without IAA hardware.

Although the qpl software solution will have better performance than zlib, 
I still don't think it has a greater advantage than zstd. I don't think there
is a need to add a migration option to configure the qpl software or hardware path.
So I will still only use QPL as an independent compression in the next version, and
no other migration options are needed.

I will also add a guide to qpl-compression.rst about IAA permission issues and how to
determine whether the hardware path is available.

> > > > 2. Compression accelerator related patches are removed from this
> patch
> > > set and
> > > >    will be added to the QAT patch set, we will submit separate
> patches
> > > to use
> > > >    QAT to accelerate ZLIB and ZSTD.
> > > >
> > > > 3. Advantages of using IAA accelerator include:
> > > >    a. Compared with the non-compression method, it can improve
> downtime
> > > >       performance without adding additional host resources (both CPU
> and
> > > >       network).
> > > >    b. Compared with using software compression methods (ZSTD/ZLIB),
> it
> > > can
> > > >       provide high data compression ratio and save a lot of CPU
> > > resources
> > > >       used for compression.
> > > >
> > > > Test condition:
> > > >   1. Host CPUs are based on Sapphire Rapids
> > > >   2. VM type, 16 vCPU and 64G memory
> > > >   3. The source and destination respectively use 4 IAA devices.
> > > >   4. The workload in the VM
> > > >     a. all vCPUs are idle state
> > > >     b. 90% of the virtual machine's memory is used, use silesia to
> fill
> > > >        the memory.
> > > >        The introduction of silesia:
> > > >        https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
> > > >   5. Set "--mem-prealloc" boot parameter on the destination, this
> > > parameter
> > > >      can make IAA performance better and related introduction is
> added
> > > here.
> > > >      docs/devel/migration/qpl-compression.rst
> > > >   6. Source migration configuration commands
> > > >      a. migrate_set_capability multifd on
> > > >      b. migrate_set_parameter multifd-channels 2/4/8
> > > >      c. migrate_set_parameter downtime-limit 300
> > > >      f. migrate_set_parameter max-bandwidth 100G/1G
> > > >      d. migrate_set_parameter multifd-compression none/qpl/zstd
> > > >   7. Destination migration configuration commands
> > > >      a. migrate_set_capability multifd on
> > > >      b. migrate_set_parameter multifd-channels 2/4/8
> > > >      c. migrate_set_parameter multifd-compression none/qpl/zstd
> > > >
> > > > Early migration result, each result is the average of three tests
> > > >
> > > >  +--------+-------------+--------+--------+---------+----------+----
> --|
> > > >  |        | The number  |total   |downtime|network  |pages per | CPU
> |
> > > >  | None   | of channels |time(ms)|(ms)    |bandwidth|second    |
> Util |
> > > >  | Comp   |             |        |        |(mbps)   |          |
> |
> > > >  |        +-------------+-----------------+---------+----------+----
> --+
> > > >  |Network |            2|    8571|      69|    58391|   1896525|
> 256%|
> > >
> > > Is this the average bandwidth?  I'm surprised that you can hit ~59Gbps
> > > only
> > > with 2 channels.  My previous experience is around ~1XGbps per
> channel, so
> > > no more than 30Gbps for two channels.  Is it because of a faster
> > > processor?
> > > Indeed from the 4/8 results it doesn't look like increasing the num of
> > > channels helped a lot, and even it got worse on the downtime.
> >
> > Yes, I use iperf3 to check the bandwidth for one core, the bandwith is
> 60Gbps.
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87 MBytes
> > [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87 Mbytes
> >
> > And in the live migration test, a multifd thread's CPU utilization is
> almost 100%
> 
> This 60Gpbs per-channel is definitely impressive..
> 
> Have you tried migration without multifd on your system? Would that also
> perform similarly v.s. 2 channels multifd?

Simple Test result below:
VM Type: 16vCPU, 64G memory
Workload in VM: fill 56G memory with Silesia data and vCPUs are idle
Migration Configurations:
1. migrate_set_parameter max-bandwidth 100G
2. migrate_set_parameter downtime-limit 300
3. migrate_set_capability multifd on (multiFD test case)
4. migrate_set_parameter multifd-channels 2 (multiFD test case)

                  Totaltime (ms) Downtime (ms) Throughput (mbps) Pages-per-second
without Multifd	23580	            307	         21221	       689588
Multifd 2	       7657	            198	         65410	      2221176

> 
> The whole point of multifd is to scale on bandwidth.  If single thread can
> already achieve 60Gbps (where in my previous memory of tests, multifd can
> only reach ~70Gbps before..), then either multifd will be less useful with
> the new hardwares (especially when with a most generic socket nocomp
> setup), or we need to start working on bottlenecks of multifd to make it
> scale better.  Otherwise multifd will become a pool for compressor loads
> only.
> 
> >
> > > What is the rational behind "downtime improvement" when with the QPL
> > > compressors?  IIUC in this 100Gbps case the bandwidth is never a
> > > limitation, then I don't understand why adding the compression phase
> can
> > > make the switchover faster.  I can expect much more pages sent in a
> > > NIC-limted env like you described below with 1Gbps, but not when NIC
> has
> > > unlimited resources like here.
> >
> > The compression can improve the network stack overhead(not improve the
> RDMA
> > solution), the less data, the smaller the overhead in the
> > network protocol stack. If compression has no overhead, and network
> bandwidth
> > is not limited, the last memory copy is faster with compression
> >
> > The migration hotspot focuses on the _sys_sendmsg
> > _sys_sendmsg
> >   |- tcp_sendmsg
> >     |- copy_user_enhanced_fast_string
> >     |- tcp_push_one
> 
> Makes sense.  I assume that's logical indeed when the compression ratio is
> high enough, meanwhile if the compression work is fast enough to be much
> lower than sending extra data when without it.
> 
> Thanks,
> 
> --
> Peter Xu

Re: [PATCH v5 0/7] Live Migration With IAA

Posted by Peter Xu 1 month ago

On Thu, Mar 28, 2024 at 03:02:30AM +0000, Liu, Yuan1 wrote:
> Yes, I will support software fallback to ensure CI testing and users can 
> still use qpl compression without IAA hardware.
> 
> Although the qpl software solution will have better performance than zlib, 
> I still don't think it has a greater advantage than zstd. I don't think there
> is a need to add a migration option to configure the qpl software or hardware path.
> So I will still only use QPL as an independent compression in the next version, and
> no other migration options are needed.

That should be fine.

> 
> I will also add a guide to qpl-compression.rst about IAA permission issues and how to
> determine whether the hardware path is available.

OK.

[...]

> > > Yes, I use iperf3 to check the bandwidth for one core, the bandwith is
> > 60Gbps.
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87 MBytes
> > > [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87 Mbytes
> > >
> > > And in the live migration test, a multifd thread's CPU utilization is
> > almost 100%
> > 
> > This 60Gpbs per-channel is definitely impressive..
> > 
> > Have you tried migration without multifd on your system? Would that also
> > perform similarly v.s. 2 channels multifd?
> 
> Simple Test result below:
> VM Type: 16vCPU, 64G memory
> Workload in VM: fill 56G memory with Silesia data and vCPUs are idle
> Migration Configurations:
> 1. migrate_set_parameter max-bandwidth 100G
> 2. migrate_set_parameter downtime-limit 300
> 3. migrate_set_capability multifd on (multiFD test case)
> 4. migrate_set_parameter multifd-channels 2 (multiFD test case)
> 
>                   Totaltime (ms) Downtime (ms) Throughput (mbps) Pages-per-second
> without Multifd	23580	            307	         21221	       689588
> Multifd 2	       7657	            198	         65410	      2221176

Thanks for the test results.

So I am guessing the migration overheads besides pushing the socket is high
enough to make it drop drastically, even if in this case zero detection
shouldn't play a major role considering most of guest mem is pre-filled.

-- 
Peter Xu

RE: [PATCH v5 0/7] Live Migration With IAA

Posted by Liu, Yuan1 1 month ago

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, March 28, 2024 11:22 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com;
> bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 0/7] Live Migration With IAA
> 
> On Thu, Mar 28, 2024 at 03:02:30AM +0000, Liu, Yuan1 wrote:
> > Yes, I will support software fallback to ensure CI testing and users can
> > still use qpl compression without IAA hardware.
> >
> > Although the qpl software solution will have better performance than
> zlib,
> > I still don't think it has a greater advantage than zstd. I don't think
> there
> > is a need to add a migration option to configure the qpl software or
> hardware path.
> > So I will still only use QPL as an independent compression in the next
> version, and
> > no other migration options are needed.
> 
> That should be fine.
> 
> >
> > I will also add a guide to qpl-compression.rst about IAA permission
> issues and how to
> > determine whether the hardware path is available.
> 
> OK.
> 
> [...]
> 
> > > > Yes, I use iperf3 to check the bandwidth for one core, the bandwith
> is
> > > 60Gbps.
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87
> MBytes
> > > > [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87
> Mbytes
> > > >
> > > > And in the live migration test, a multifd thread's CPU utilization
> is
> > > almost 100%
> > >
> > > This 60Gpbs per-channel is definitely impressive..
> > >
> > > Have you tried migration without multifd on your system? Would that
> also
> > > perform similarly v.s. 2 channels multifd?
> >
> > Simple Test result below:
> > VM Type: 16vCPU, 64G memory
> > Workload in VM: fill 56G memory with Silesia data and vCPUs are idle
> > Migration Configurations:
> > 1. migrate_set_parameter max-bandwidth 100G
> > 2. migrate_set_parameter downtime-limit 300
> > 3. migrate_set_capability multifd on (multiFD test case)
> > 4. migrate_set_parameter multifd-channels 2 (multiFD test case)
> >
> >                   Totaltime (ms) Downtime (ms) Throughput (mbps) Pages-
> per-second
> > without Multifd	23580	            307	         21221	       689588
> > Multifd 2	       7657	            198	         65410	      2221176
> 
> Thanks for the test results.
> 
> So I am guessing the migration overheads besides pushing the socket is
> high
> enough to make it drop drastically, even if in this case zero detection
> shouldn't play a major role considering most of guest mem is pre-filled.

Yes, for no multifd migration, besides the network stack overhead, the zero
page detection overhead (both of source and destination) is indeed very high.
Placing the zero page detection in multi-threads can reduce the performance 
degradation caused by the overhead of zero page detection.

I also think migration doesn't need to detect zero page by memcmp in all cases.
The benefit of zero page detection may be that the VM's memory determines that
there are a large number of 0 pages. 

My experience in this area may be insufficient, I am trying with Hao and Bryan to
see if it is possible to use DSA hardware to accelerate this part (including page 0
detection and writing page 0). 

DSA is an accelerator for detecting memory, writing memory, and comparing memory
https://cdrdv2-public.intel.com/671116/341204-intel-data-streaming-accelerator-spec.pdf