[v5] Live Migration With IAA

[PATCH v5 0/7] Live Migration With IAA

Posted by Yuan Liu 1 month, 1 week ago

I am writing to submit a code change aimed at enhancing live migration
acceleration by leveraging the compression capability of the Intel
In-Memory Analytics Accelerator (IAA).

The implementation of the IAA (de)compression code is based on Intel Query
Processing Library (QPL), an open-source software project designed for
IAA high-level software programming. https://github.com/intel/qpl

I would like to summarize the progress so far
1. QPL will be used as an independent compression method like ZLIB and ZSTD,
   QPL will force the use of the IAA accelerator and will not support software
   compression. For a summary of issues compatible with Zlib, please refer to
   docs/devel/migration/qpl-compression.rst

2. Compression accelerator related patches are removed from this patch set and
   will be added to the QAT patch set, we will submit separate patches to use
   QAT to accelerate ZLIB and ZSTD.

3. Advantages of using IAA accelerator include:
   a. Compared with the non-compression method, it can improve downtime
      performance without adding additional host resources (both CPU and
      network).
   b. Compared with using software compression methods (ZSTD/ZLIB), it can
      provide high data compression ratio and save a lot of CPU resources
      used for compression.

Test condition:
  1. Host CPUs are based on Sapphire Rapids
  2. VM type, 16 vCPU and 64G memory
  3. The source and destination respectively use 4 IAA devices.
  4. The workload in the VM
    a. all vCPUs are idle state
    b. 90% of the virtual machine's memory is used, use silesia to fill
       the memory.
       The introduction of silesia:
       https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
  5. Set "--mem-prealloc" boot parameter on the destination, this parameter
     can make IAA performance better and related introduction is added here.
     docs/devel/migration/qpl-compression.rst
  6. Source migration configuration commands
     a. migrate_set_capability multifd on
     b. migrate_set_parameter multifd-channels 2/4/8
     c. migrate_set_parameter downtime-limit 300
     f. migrate_set_parameter max-bandwidth 100G/1G
     d. migrate_set_parameter multifd-compression none/qpl/zstd
  7. Destination migration configuration commands
     a. migrate_set_capability multifd on
     b. migrate_set_parameter multifd-channels 2/4/8
     c. migrate_set_parameter multifd-compression none/qpl/zstd

Early migration result, each result is the average of three tests

 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 | None   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|    8571|      69|    58391|   1896525|  256%|
 |BW:100G +-------------+--------+--------+---------+----------+------+
 |        |            4|    7180|      92|    69736|   1865640|  300%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|    7090|     121|    70562|   2174060|  307%|
 +--------+-------------+--------+--------+---------+----------+------+

 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 | QPL    | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|    8413|      34|    30067|   1732411|  230%|
 |BW:100G +-------------+--------+--------+---------+----------+------+
 |        |            4|    6559|      32|    38804|   1689954|  450%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|    6623|      37|    38745|   1566507|  790%|
 +--------+-------------+--------+--------+---------+----------+------+

 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 | ZSTD   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|   95846|      24|     1800|    521829|  203%|
 |BW:100G +-------------+--------+--------+---------+----------+------+
 |        |            4|   49004|      24|     3529|    890532|  403%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|   25574|      32|     6782|   1762222|  800%|
 +--------+-------------+--------+--------+---------+----------+------+

When network bandwidth resource is sufficient, QPL can improve downtime
by 2x compared to no compression. In this scenario, with 4 channels, the
IAA hardware resources are fully used, so adding more channels will not
gain more benefits.

 
 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 | None   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|   57758|      66|     8643|    264617|   34%|
 |BW:  1G +-------------+--------+--------+---------+----------+------+
 |        |            4|   57216|      58|     8726|    266773|   34%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|   56708|      53|     8804|    270223|   33%|
 +--------+-------------+--------+--------+---------+----------+------+

 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 | QPL    | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|   30129|      34|     8345|   2224761|   54%|
 |BW:  1G +-------------+--------+--------+---------+----------+------+
 |        |            4|   30317|      39|     8300|   2025220|   73%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|   29615|      35|     8514|   2250122|  131%|
 +--------+-------------+--------+--------+---------+----------+------+

 +--------+-------------+--------+--------+---------+----------+------|
 |        | The number  |total   |downtime|network  |pages per | CPU  |
 | ZSTD   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
 | Comp   |             |        |        |(mbps)   |          |      |
 |        +-------------+-----------------+---------+----------+------+
 |Network |            2|   95750|      24|     1802|    477236|  202%|
 |BW:  1G +-------------+--------+--------+---------+----------+------+
 |        |            4|   48907|      24|     3536|   1002142|  404%|
 |        +-------------+--------+--------+---------+----------+------+
 |        |            8|   25568|      32|     6783|   1696437|  800%|
 +--------+-------------+--------+--------+---------+----------+------+

When network bandwidth resource is limited, the "page perf second" metric
decreases for none compression, the success rate of migration will reduce.
Comparison of QPL and ZSTD compression methods, QPL can save a lot of CPU
resources used for compression.

v2:
  - add support for multifd compression accelerator
  - add support for the QPL accelerator in the multifd
    compression accelerator
  - fixed the issue that QPL was compiled into the migration
    module by default

v3:
  - use Meson instead of pkg-config to resolve QPL build
    dependency issue
  - fix coding style
  - fix a CI issue for get_multifd_ops function in multifd.c file

v4:
  - patch based on commit: da96ad4a6a Merge tag 'hw-misc-20240215' of
    https://github.com/philmd/qemu into staging
  - remove the compression accelerator implementation patches, the patches
    will be placed in the QAT accelerator implementation.
  - introduce QPL as a new compression method
  - add QPL compression documentation
  - add QPL compression migration test
  - fix zlib/zstd compression level issue

v5:
  - patch based on v9.0.0-rc0 (c62d54d0a8)
  - use pkgconfig to check libaccel-config, libaccel-config is already
    in many distributions.
  - initialize the IOV of the sender by the specific compression method
  - refine the coding style
  - remove the zlib/zstd compression level not working patch, the issue
    has been solved

Yuan Liu (7):
  docs/migration: add qpl compression feature
  migration/multifd: put IOV initialization into compression method
  configure: add --enable-qpl build option
  migration/multifd: add qpl compression method
  migration/multifd: implement initialization of qpl compression
  migration/multifd: implement qpl compression and decompression
  tests/migration-test: add qpl compression test

 docs/devel/migration/features.rst        |   1 +
 docs/devel/migration/qpl-compression.rst | 231 +++++++++++
 hw/core/qdev-properties-system.c         |   2 +-
 meson.build                              |  16 +
 meson_options.txt                        |   2 +
 migration/meson.build                    |   1 +
 migration/multifd-qpl.c                  | 482 +++++++++++++++++++++++
 migration/multifd-zlib.c                 |   4 +
 migration/multifd-zstd.c                 |   6 +-
 migration/multifd.c                      |   8 +-
 migration/multifd.h                      |   1 +
 qapi/migration.json                      |   7 +-
 scripts/meson-buildoptions.sh            |   3 +
 tests/qtest/migration-test.c             |  24 ++
 14 files changed, 782 insertions(+), 6 deletions(-)
 create mode 100644 docs/devel/migration/qpl-compression.rst
 create mode 100644 migration/multifd-qpl.c

-- 
2.39.3

Re: [PATCH v5 0/7] Live Migration With IAA

Posted by Peter Xu 1 month ago

Hi, Yuan,

On Wed, Mar 20, 2024 at 12:45:20AM +0800, Yuan Liu wrote:
> 1. QPL will be used as an independent compression method like ZLIB and ZSTD,
>    QPL will force the use of the IAA accelerator and will not support software
>    compression. For a summary of issues compatible with Zlib, please refer to
>    docs/devel/migration/qpl-compression.rst

IIRC our previous discussion is we should provide a software fallback for
the new QEMU paths, right?  Why the decision changed?  Again, such fallback
can help us to make sure qpl won't get broken easily by other changes.

> 
> 2. Compression accelerator related patches are removed from this patch set and
>    will be added to the QAT patch set, we will submit separate patches to use
>    QAT to accelerate ZLIB and ZSTD.
> 
> 3. Advantages of using IAA accelerator include:
>    a. Compared with the non-compression method, it can improve downtime
>       performance without adding additional host resources (both CPU and
>       network).
>    b. Compared with using software compression methods (ZSTD/ZLIB), it can
>       provide high data compression ratio and save a lot of CPU resources
>       used for compression.
> 
> Test condition:
>   1. Host CPUs are based on Sapphire Rapids
>   2. VM type, 16 vCPU and 64G memory
>   3. The source and destination respectively use 4 IAA devices.
>   4. The workload in the VM
>     a. all vCPUs are idle state
>     b. 90% of the virtual machine's memory is used, use silesia to fill
>        the memory.
>        The introduction of silesia:
>        https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
>   5. Set "--mem-prealloc" boot parameter on the destination, this parameter
>      can make IAA performance better and related introduction is added here.
>      docs/devel/migration/qpl-compression.rst
>   6. Source migration configuration commands
>      a. migrate_set_capability multifd on
>      b. migrate_set_parameter multifd-channels 2/4/8
>      c. migrate_set_parameter downtime-limit 300
>      f. migrate_set_parameter max-bandwidth 100G/1G
>      d. migrate_set_parameter multifd-compression none/qpl/zstd
>   7. Destination migration configuration commands
>      a. migrate_set_capability multifd on
>      b. migrate_set_parameter multifd-channels 2/4/8
>      c. migrate_set_parameter multifd-compression none/qpl/zstd
> 
> Early migration result, each result is the average of three tests
> 
>  +--------+-------------+--------+--------+---------+----------+------|
>  |        | The number  |total   |downtime|network  |pages per | CPU  |
>  | None   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
>  | Comp   |             |        |        |(mbps)   |          |      |
>  |        +-------------+-----------------+---------+----------+------+
>  |Network |            2|    8571|      69|    58391|   1896525|  256%|

Is this the average bandwidth?  I'm surprised that you can hit ~59Gbps only
with 2 channels.  My previous experience is around ~1XGbps per channel, so
no more than 30Gbps for two channels.  Is it because of a faster processor?
Indeed from the 4/8 results it doesn't look like increasing the num of
channels helped a lot, and even it got worse on the downtime.

What is the rational behind "downtime improvement" when with the QPL
compressors?  IIUC in this 100Gbps case the bandwidth is never a
limitation, then I don't understand why adding the compression phase can
make the switchover faster.  I can expect much more pages sent in a
NIC-limted env like you described below with 1Gbps, but not when NIC has
unlimited resources like here.

>  |BW:100G +-------------+--------+--------+---------+----------+------+
>  |        |            4|    7180|      92|    69736|   1865640|  300%|
>  |        +-------------+--------+--------+---------+----------+------+
>  |        |            8|    7090|     121|    70562|   2174060|  307%|
>  +--------+-------------+--------+--------+---------+----------+------+
> 
>  +--------+-------------+--------+--------+---------+----------+------|
>  |        | The number  |total   |downtime|network  |pages per | CPU  |
>  | QPL    | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
>  | Comp   |             |        |        |(mbps)   |          |      |
>  |        +-------------+-----------------+---------+----------+------+
>  |Network |            2|    8413|      34|    30067|   1732411|  230%|
>  |BW:100G +-------------+--------+--------+---------+----------+------+
>  |        |            4|    6559|      32|    38804|   1689954|  450%|
>  |        +-------------+--------+--------+---------+----------+------+
>  |        |            8|    6623|      37|    38745|   1566507|  790%|
>  +--------+-------------+--------+--------+---------+----------+------+
> 
>  +--------+-------------+--------+--------+---------+----------+------|
>  |        | The number  |total   |downtime|network  |pages per | CPU  |
>  | ZSTD   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
>  | Comp   |             |        |        |(mbps)   |          |      |
>  |        +-------------+-----------------+---------+----------+------+
>  |Network |            2|   95846|      24|     1800|    521829|  203%|
>  |BW:100G +-------------+--------+--------+---------+----------+------+
>  |        |            4|   49004|      24|     3529|    890532|  403%|
>  |        +-------------+--------+--------+---------+----------+------+
>  |        |            8|   25574|      32|     6782|   1762222|  800%|
>  +--------+-------------+--------+--------+---------+----------+------+
> 
> When network bandwidth resource is sufficient, QPL can improve downtime
> by 2x compared to no compression. In this scenario, with 4 channels, the
> IAA hardware resources are fully used, so adding more channels will not
> gain more benefits.
> 
>  
>  +--------+-------------+--------+--------+---------+----------+------|
>  |        | The number  |total   |downtime|network  |pages per | CPU  |
>  | None   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
>  | Comp   |             |        |        |(mbps)   |          |      |
>  |        +-------------+-----------------+---------+----------+------+
>  |Network |            2|   57758|      66|     8643|    264617|   34%|
>  |BW:  1G +-------------+--------+--------+---------+----------+------+
>  |        |            4|   57216|      58|     8726|    266773|   34%|
>  |        +-------------+--------+--------+---------+----------+------+
>  |        |            8|   56708|      53|     8804|    270223|   33%|
>  +--------+-------------+--------+--------+---------+----------+------+
> 
>  +--------+-------------+--------+--------+---------+----------+------|
>  |        | The number  |total   |downtime|network  |pages per | CPU  |
>  | QPL    | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
>  | Comp   |             |        |        |(mbps)   |          |      |
>  |        +-------------+-----------------+---------+----------+------+
>  |Network |            2|   30129|      34|     8345|   2224761|   54%|
>  |BW:  1G +-------------+--------+--------+---------+----------+------+
>  |        |            4|   30317|      39|     8300|   2025220|   73%|
>  |        +-------------+--------+--------+---------+----------+------+
>  |        |            8|   29615|      35|     8514|   2250122|  131%|
>  +--------+-------------+--------+--------+---------+----------+------+
> 
>  +--------+-------------+--------+--------+---------+----------+------|
>  |        | The number  |total   |downtime|network  |pages per | CPU  |
>  | ZSTD   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
>  | Comp   |             |        |        |(mbps)   |          |      |
>  |        +-------------+-----------------+---------+----------+------+
>  |Network |            2|   95750|      24|     1802|    477236|  202%|
>  |BW:  1G +-------------+--------+--------+---------+----------+------+
>  |        |            4|   48907|      24|     3536|   1002142|  404%|
>  |        +-------------+--------+--------+---------+----------+------+
>  |        |            8|   25568|      32|     6783|   1696437|  800%|
>  +--------+-------------+--------+--------+---------+----------+------+
> 
> When network bandwidth resource is limited, the "page perf second" metric
> decreases for none compression, the success rate of migration will reduce.
> Comparison of QPL and ZSTD compression methods, QPL can save a lot of CPU
> resources used for compression.
> 
> v2:
>   - add support for multifd compression accelerator
>   - add support for the QPL accelerator in the multifd
>     compression accelerator
>   - fixed the issue that QPL was compiled into the migration
>     module by default
> 
> v3:
>   - use Meson instead of pkg-config to resolve QPL build
>     dependency issue
>   - fix coding style
>   - fix a CI issue for get_multifd_ops function in multifd.c file
> 
> v4:
>   - patch based on commit: da96ad4a6a Merge tag 'hw-misc-20240215' of
>     https://github.com/philmd/qemu into staging
>   - remove the compression accelerator implementation patches, the patches
>     will be placed in the QAT accelerator implementation.
>   - introduce QPL as a new compression method
>   - add QPL compression documentation
>   - add QPL compression migration test
>   - fix zlib/zstd compression level issue
> 
> v5:
>   - patch based on v9.0.0-rc0 (c62d54d0a8)
>   - use pkgconfig to check libaccel-config, libaccel-config is already
>     in many distributions.
>   - initialize the IOV of the sender by the specific compression method
>   - refine the coding style
>   - remove the zlib/zstd compression level not working patch, the issue
>     has been solved
> 
> Yuan Liu (7):
>   docs/migration: add qpl compression feature
>   migration/multifd: put IOV initialization into compression method
>   configure: add --enable-qpl build option
>   migration/multifd: add qpl compression method
>   migration/multifd: implement initialization of qpl compression
>   migration/multifd: implement qpl compression and decompression
>   tests/migration-test: add qpl compression test
> 
>  docs/devel/migration/features.rst        |   1 +
>  docs/devel/migration/qpl-compression.rst | 231 +++++++++++
>  hw/core/qdev-properties-system.c         |   2 +-
>  meson.build                              |  16 +
>  meson_options.txt                        |   2 +
>  migration/meson.build                    |   1 +
>  migration/multifd-qpl.c                  | 482 +++++++++++++++++++++++
>  migration/multifd-zlib.c                 |   4 +
>  migration/multifd-zstd.c                 |   6 +-
>  migration/multifd.c                      |   8 +-
>  migration/multifd.h                      |   1 +
>  qapi/migration.json                      |   7 +-
>  scripts/meson-buildoptions.sh            |   3 +
>  tests/qtest/migration-test.c             |  24 ++
>  14 files changed, 782 insertions(+), 6 deletions(-)
>  create mode 100644 docs/devel/migration/qpl-compression.rst
>  create mode 100644 migration/multifd-qpl.c
> 
> -- 
> 2.39.3
> 

-- 
Peter Xu

RE: [PATCH v5 0/7] Live Migration With IAA

Posted by Liu, Yuan1 1 month ago

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Wednesday, March 27, 2024 4:30 AM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com;
> bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 0/7] Live Migration With IAA
> 
> Hi, Yuan,
> 
> On Wed, Mar 20, 2024 at 12:45:20AM +0800, Yuan Liu wrote:
> > 1. QPL will be used as an independent compression method like ZLIB and
> ZSTD,
> >    QPL will force the use of the IAA accelerator and will not support
> software
> >    compression. For a summary of issues compatible with Zlib, please
> refer to
> >    docs/devel/migration/qpl-compression.rst
> 
> IIRC our previous discussion is we should provide a software fallback for
> the new QEMU paths, right?  Why the decision changed?  Again, such
> fallback
> can help us to make sure qpl won't get broken easily by other changes.

Hi Peter

Previous your suggestion below

https://patchew.org/QEMU/PH7PR11MB5941019462E0ADDE231C7295A37C2@PH7PR11MB5941.namprd11.prod.outlook.com/
Compression methods: none, zlib, zstd, qpl (describes all the algorithms
that might be used; again, qpl enforces HW support).
Compression accelerators: auto, none, qat (only applies when zlib/zstd
chosen above)

Maybe I misunderstood here, what you mean is that if the IAA hardware is unavailable, 
it will fall back to the software path. This does not need to be specified through live
migration parameters, and it will automatically determine whether to use the software or
hardware path during QPL initialization, is that right?

> > 2. Compression accelerator related patches are removed from this patch
> set and
> >    will be added to the QAT patch set, we will submit separate patches
> to use
> >    QAT to accelerate ZLIB and ZSTD.
> >
> > 3. Advantages of using IAA accelerator include:
> >    a. Compared with the non-compression method, it can improve downtime
> >       performance without adding additional host resources (both CPU and
> >       network).
> >    b. Compared with using software compression methods (ZSTD/ZLIB), it
> can
> >       provide high data compression ratio and save a lot of CPU
> resources
> >       used for compression.
> >
> > Test condition:
> >   1. Host CPUs are based on Sapphire Rapids
> >   2. VM type, 16 vCPU and 64G memory
> >   3. The source and destination respectively use 4 IAA devices.
> >   4. The workload in the VM
> >     a. all vCPUs are idle state
> >     b. 90% of the virtual machine's memory is used, use silesia to fill
> >        the memory.
> >        The introduction of silesia:
> >        https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
> >   5. Set "--mem-prealloc" boot parameter on the destination, this
> parameter
> >      can make IAA performance better and related introduction is added
> here.
> >      docs/devel/migration/qpl-compression.rst
> >   6. Source migration configuration commands
> >      a. migrate_set_capability multifd on
> >      b. migrate_set_parameter multifd-channels 2/4/8
> >      c. migrate_set_parameter downtime-limit 300
> >      f. migrate_set_parameter max-bandwidth 100G/1G
> >      d. migrate_set_parameter multifd-compression none/qpl/zstd
> >   7. Destination migration configuration commands
> >      a. migrate_set_capability multifd on
> >      b. migrate_set_parameter multifd-channels 2/4/8
> >      c. migrate_set_parameter multifd-compression none/qpl/zstd
> >
> > Early migration result, each result is the average of three tests
> >
> >  +--------+-------------+--------+--------+---------+----------+------|
> >  |        | The number  |total   |downtime|network  |pages per | CPU  |
> >  | None   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
> >  | Comp   |             |        |        |(mbps)   |          |      |
> >  |        +-------------+-----------------+---------+----------+------+
> >  |Network |            2|    8571|      69|    58391|   1896525|  256%|
> 
> Is this the average bandwidth?  I'm surprised that you can hit ~59Gbps
> only
> with 2 channels.  My previous experience is around ~1XGbps per channel, so
> no more than 30Gbps for two channels.  Is it because of a faster
> processor?
> Indeed from the 4/8 results it doesn't look like increasing the num of
> channels helped a lot, and even it got worse on the downtime.

Yes, I use iperf3 to check the bandwidth for one core, the bandwith is 60Gbps.
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87 MBytes
[  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87 Mbytes

And in the live migration test, a multifd thread's CPU utilization is almost 100%

> What is the rational behind "downtime improvement" when with the QPL
> compressors?  IIUC in this 100Gbps case the bandwidth is never a
> limitation, then I don't understand why adding the compression phase can
> make the switchover faster.  I can expect much more pages sent in a
> NIC-limted env like you described below with 1Gbps, but not when NIC has
> unlimited resources like here.

The compression can improve the network stack overhead(not improve the RDMA 
solution), the less data, the smaller the overhead in the 
network protocol stack. If compression has no overhead, and network bandwidth
is not limited, the last memory copy is faster with compression

The migration hotspot focuses on the _sys_sendmsg
_sys_sendmsg
  |- tcp_sendmsg
    |- copy_user_enhanced_fast_string
    |- tcp_push_one


> >  |BW:100G +-------------+--------+--------+---------+----------+------+
> >  |        |            4|    7180|      92|    69736|   1865640|  300%|
> >  |        +-------------+--------+--------+---------+----------+------+
> >  |        |            8|    7090|     121|    70562|   2174060|  307%|
> >  +--------+-------------+--------+--------+---------+----------+------+
> >
> >  +--------+-------------+--------+--------+---------+----------+------|
> >  |        | The number  |total   |downtime|network  |pages per | CPU  |
> >  | QPL    | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
> >  | Comp   |             |        |        |(mbps)   |          |      |
> >  |        +-------------+-----------------+---------+----------+------+
> >  |Network |            2|    8413|      34|    30067|   1732411|  230%|
> >  |BW:100G +-------------+--------+--------+---------+----------+------+
> >  |        |            4|    6559|      32|    38804|   1689954|  450%|
> >  |        +-------------+--------+--------+---------+----------+------+
> >  |        |            8|    6623|      37|    38745|   1566507|  790%|
> >  +--------+-------------+--------+--------+---------+----------+------+
> >
> >  +--------+-------------+--------+--------+---------+----------+------|
> >  |        | The number  |total   |downtime|network  |pages per | CPU  |
> >  | ZSTD   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
> >  | Comp   |             |        |        |(mbps)   |          |      |
> >  |        +-------------+-----------------+---------+----------+------+
> >  |Network |            2|   95846|      24|     1800|    521829|  203%|
> >  |BW:100G +-------------+--------+--------+---------+----------+------+
> >  |        |            4|   49004|      24|     3529|    890532|  403%|
> >  |        +-------------+--------+--------+---------+----------+------+
> >  |        |            8|   25574|      32|     6782|   1762222|  800%|
> >  +--------+-------------+--------+--------+---------+----------+------+
> >
> > When network bandwidth resource is sufficient, QPL can improve downtime
> > by 2x compared to no compression. In this scenario, with 4 channels, the
> > IAA hardware resources are fully used, so adding more channels will not
> > gain more benefits.
> >
> >
> >  +--------+-------------+--------+--------+---------+----------+------|
> >  |        | The number  |total   |downtime|network  |pages per | CPU  |
> >  | None   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
> >  | Comp   |             |        |        |(mbps)   |          |      |
> >  |        +-------------+-----------------+---------+----------+------+
> >  |Network |            2|   57758|      66|     8643|    264617|   34%|
> >  |BW:  1G +-------------+--------+--------+---------+----------+------+
> >  |        |            4|   57216|      58|     8726|    266773|   34%|
> >  |        +-------------+--------+--------+---------+----------+------+
> >  |        |            8|   56708|      53|     8804|    270223|   33%|
> >  +--------+-------------+--------+--------+---------+----------+------+
> >
> >  +--------+-------------+--------+--------+---------+----------+------|
> >  |        | The number  |total   |downtime|network  |pages per | CPU  |
> >  | QPL    | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
> >  | Comp   |             |        |        |(mbps)   |          |      |
> >  |        +-------------+-----------------+---------+----------+------+
> >  |Network |            2|   30129|      34|     8345|   2224761|   54%|
> >  |BW:  1G +-------------+--------+--------+---------+----------+------+
> >  |        |            4|   30317|      39|     8300|   2025220|   73%|
> >  |        +-------------+--------+--------+---------+----------+------+
> >  |        |            8|   29615|      35|     8514|   2250122|  131%|
> >  +--------+-------------+--------+--------+---------+----------+------+
> >
> >  +--------+-------------+--------+--------+---------+----------+------|
> >  |        | The number  |total   |downtime|network  |pages per | CPU  |
> >  | ZSTD   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
> >  | Comp   |             |        |        |(mbps)   |          |      |
> >  |        +-------------+-----------------+---------+----------+------+
> >  |Network |            2|   95750|      24|     1802|    477236|  202%|
> >  |BW:  1G +-------------+--------+--------+---------+----------+------+
> >  |        |            4|   48907|      24|     3536|   1002142|  404%|
> >  |        +-------------+--------+--------+---------+----------+------+
> >  |        |            8|   25568|      32|     6783|   1696437|  800%|
> >  +--------+-------------+--------+--------+---------+----------+------+
> >
> > When network bandwidth resource is limited, the "page perf second"
> metric
> > decreases for none compression, the success rate of migration will
> reduce.
> > Comparison of QPL and ZSTD compression methods, QPL can save a lot of
> CPU
> > resources used for compression.
> >
> > v2:
> >   - add support for multifd compression accelerator
> >   - add support for the QPL accelerator in the multifd
> >     compression accelerator
> >   - fixed the issue that QPL was compiled into the migration
> >     module by default
> >
> > v3:
> >   - use Meson instead of pkg-config to resolve QPL build
> >     dependency issue
> >   - fix coding style
> >   - fix a CI issue for get_multifd_ops function in multifd.c file
> >
> > v4:
> >   - patch based on commit: da96ad4a6a Merge tag 'hw-misc-20240215' of
> >     https://github.com/philmd/qemu into staging
> >   - remove the compression accelerator implementation patches, the
> patches
> >     will be placed in the QAT accelerator implementation.
> >   - introduce QPL as a new compression method
> >   - add QPL compression documentation
> >   - add QPL compression migration test
> >   - fix zlib/zstd compression level issue
> >
> > v5:
> >   - patch based on v9.0.0-rc0 (c62d54d0a8)
> >   - use pkgconfig to check libaccel-config, libaccel-config is already
> >     in many distributions.
> >   - initialize the IOV of the sender by the specific compression method
> >   - refine the coding style
> >   - remove the zlib/zstd compression level not working patch, the issue
> >     has been solved
> >
> > Yuan Liu (7):
> >   docs/migration: add qpl compression feature
> >   migration/multifd: put IOV initialization into compression method
> >   configure: add --enable-qpl build option
> >   migration/multifd: add qpl compression method
> >   migration/multifd: implement initialization of qpl compression
> >   migration/multifd: implement qpl compression and decompression
> >   tests/migration-test: add qpl compression test
> >
> >  docs/devel/migration/features.rst        |   1 +
> >  docs/devel/migration/qpl-compression.rst | 231 +++++++++++
> >  hw/core/qdev-properties-system.c         |   2 +-
> >  meson.build                              |  16 +
> >  meson_options.txt                        |   2 +
> >  migration/meson.build                    |   1 +
> >  migration/multifd-qpl.c                  | 482 +++++++++++++++++++++++
> >  migration/multifd-zlib.c                 |   4 +
> >  migration/multifd-zstd.c                 |   6 +-
> >  migration/multifd.c                      |   8 +-
> >  migration/multifd.h                      |   1 +
> >  qapi/migration.json                      |   7 +-
> >  scripts/meson-buildoptions.sh            |   3 +
> >  tests/qtest/migration-test.c             |  24 ++
> >  14 files changed, 782 insertions(+), 6 deletions(-)
> >  create mode 100644 docs/devel/migration/qpl-compression.rst
> >  create mode 100644 migration/multifd-qpl.c
> >
> > --
> > 2.39.3
> >
> 
> --
> Peter Xu

Re: [PATCH v5 0/7] Live Migration With IAA

Posted by Peter Xu 1 month ago

On Wed, Mar 27, 2024 at 03:20:19AM +0000, Liu, Yuan1 wrote:
> > -----Original Message-----
> > From: Peter Xu <peterx@redhat.com>
> > Sent: Wednesday, March 27, 2024 4:30 AM
> > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com;
> > bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com>
> > Subject: Re: [PATCH v5 0/7] Live Migration With IAA
> > 
> > Hi, Yuan,
> > 
> > On Wed, Mar 20, 2024 at 12:45:20AM +0800, Yuan Liu wrote:
> > > 1. QPL will be used as an independent compression method like ZLIB and
> > ZSTD,
> > >    QPL will force the use of the IAA accelerator and will not support
> > software
> > >    compression. For a summary of issues compatible with Zlib, please
> > refer to
> > >    docs/devel/migration/qpl-compression.rst
> > 
> > IIRC our previous discussion is we should provide a software fallback for
> > the new QEMU paths, right?  Why the decision changed?  Again, such
> > fallback
> > can help us to make sure qpl won't get broken easily by other changes.
> 
> Hi Peter
> 
> Previous your suggestion below
> 
> https://patchew.org/QEMU/PH7PR11MB5941019462E0ADDE231C7295A37C2@PH7PR11MB5941.namprd11.prod.outlook.com/
> Compression methods: none, zlib, zstd, qpl (describes all the algorithms
> that might be used; again, qpl enforces HW support).
> Compression accelerators: auto, none, qat (only applies when zlib/zstd
> chosen above)
> 
> Maybe I misunderstood here, what you mean is that if the IAA hardware is unavailable, 
> it will fall back to the software path. This does not need to be specified through live
> migration parameters, and it will automatically determine whether to use the software or
> hardware path during QPL initialization, is that right?

I think there are two questions.

Firstly, we definitely want the qpl compressor to be able to run without
any hardware support.  As I mentioned above, I think that's the only way
that qpl code can always get covered by the CI as CI hosts should normally
don't have those modern hardwares.

I think it also means in the last test patch, instead of detecting /dev/iax
we should unconditionally run the qpl test as long as compiled in, because
it should just fallback to the software path then when HW not valid?

The second question is whether we'll want a new "compression accelerator",
fundamentally the only use case of that is to enforce software fallback
even if hardware existed.  I don't remember whether others have any opinion
before, but to me I think it's good to have, however no strong opinion.
It's less important comparing to the other question on CI coverage.

> 
> > > 2. Compression accelerator related patches are removed from this patch
> > set and
> > >    will be added to the QAT patch set, we will submit separate patches
> > to use
> > >    QAT to accelerate ZLIB and ZSTD.
> > >
> > > 3. Advantages of using IAA accelerator include:
> > >    a. Compared with the non-compression method, it can improve downtime
> > >       performance without adding additional host resources (both CPU and
> > >       network).
> > >    b. Compared with using software compression methods (ZSTD/ZLIB), it
> > can
> > >       provide high data compression ratio and save a lot of CPU
> > resources
> > >       used for compression.
> > >
> > > Test condition:
> > >   1. Host CPUs are based on Sapphire Rapids
> > >   2. VM type, 16 vCPU and 64G memory
> > >   3. The source and destination respectively use 4 IAA devices.
> > >   4. The workload in the VM
> > >     a. all vCPUs are idle state
> > >     b. 90% of the virtual machine's memory is used, use silesia to fill
> > >        the memory.
> > >        The introduction of silesia:
> > >        https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
> > >   5. Set "--mem-prealloc" boot parameter on the destination, this
> > parameter
> > >      can make IAA performance better and related introduction is added
> > here.
> > >      docs/devel/migration/qpl-compression.rst
> > >   6. Source migration configuration commands
> > >      a. migrate_set_capability multifd on
> > >      b. migrate_set_parameter multifd-channels 2/4/8
> > >      c. migrate_set_parameter downtime-limit 300
> > >      f. migrate_set_parameter max-bandwidth 100G/1G
> > >      d. migrate_set_parameter multifd-compression none/qpl/zstd
> > >   7. Destination migration configuration commands
> > >      a. migrate_set_capability multifd on
> > >      b. migrate_set_parameter multifd-channels 2/4/8
> > >      c. migrate_set_parameter multifd-compression none/qpl/zstd
> > >
> > > Early migration result, each result is the average of three tests
> > >
> > >  +--------+-------------+--------+--------+---------+----------+------|
> > >  |        | The number  |total   |downtime|network  |pages per | CPU  |
> > >  | None   | of channels |time(ms)|(ms)    |bandwidth|second    | Util |
> > >  | Comp   |             |        |        |(mbps)   |          |      |
> > >  |        +-------------+-----------------+---------+----------+------+
> > >  |Network |            2|    8571|      69|    58391|   1896525|  256%|
> > 
> > Is this the average bandwidth?  I'm surprised that you can hit ~59Gbps
> > only
> > with 2 channels.  My previous experience is around ~1XGbps per channel, so
> > no more than 30Gbps for two channels.  Is it because of a faster
> > processor?
> > Indeed from the 4/8 results it doesn't look like increasing the num of
> > channels helped a lot, and even it got worse on the downtime.
> 
> Yes, I use iperf3 to check the bandwidth for one core, the bandwith is 60Gbps.
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87 MBytes
> [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87 Mbytes
> 
> And in the live migration test, a multifd thread's CPU utilization is almost 100%

This 60Gpbs per-channel is definitely impressive..

Have you tried migration without multifd on your system? Would that also
perform similarly v.s. 2 channels multifd?

The whole point of multifd is to scale on bandwidth.  If single thread can
already achieve 60Gbps (where in my previous memory of tests, multifd can
only reach ~70Gbps before..), then either multifd will be less useful with
the new hardwares (especially when with a most generic socket nocomp
setup), or we need to start working on bottlenecks of multifd to make it
scale better.  Otherwise multifd will become a pool for compressor loads
only.

> 
> > What is the rational behind "downtime improvement" when with the QPL
> > compressors?  IIUC in this 100Gbps case the bandwidth is never a
> > limitation, then I don't understand why adding the compression phase can
> > make the switchover faster.  I can expect much more pages sent in a
> > NIC-limted env like you described below with 1Gbps, but not when NIC has
> > unlimited resources like here.
> 
> The compression can improve the network stack overhead(not improve the RDMA 
> solution), the less data, the smaller the overhead in the 
> network protocol stack. If compression has no overhead, and network bandwidth
> is not limited, the last memory copy is faster with compression
> 
> The migration hotspot focuses on the _sys_sendmsg
> _sys_sendmsg
>   |- tcp_sendmsg
>     |- copy_user_enhanced_fast_string
>     |- tcp_push_one

Makes sense.  I assume that's logical indeed when the compression ratio is
high enough, meanwhile if the compression work is fast enough to be much
lower than sending extra data when without it.

Thanks,

-- 
Peter Xu

RE: [PATCH v5 0/7] Live Migration With IAA

Posted by Liu, Yuan1 1 month ago

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, March 28, 2024 3:46 AM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com;
> bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 0/7] Live Migration With IAA
> 
> On Wed, Mar 27, 2024 at 03:20:19AM +0000, Liu, Yuan1 wrote:
> > > -----Original Message-----
> > > From: Peter Xu <peterx@redhat.com>
> > > Sent: Wednesday, March 27, 2024 4:30 AM
> > > To: Liu, Yuan1 <yuan1.liu@intel.com>
> > > Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com;
> > > bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com>
> > > Subject: Re: [PATCH v5 0/7] Live Migration With IAA
> > >
> > > Hi, Yuan,
> > >
> > > On Wed, Mar 20, 2024 at 12:45:20AM +0800, Yuan Liu wrote:
> > > > 1. QPL will be used as an independent compression method like ZLIB
> and
> > > ZSTD,
> > > >    QPL will force the use of the IAA accelerator and will not
> support
> > > software
> > > >    compression. For a summary of issues compatible with Zlib, please
> > > refer to
> > > >    docs/devel/migration/qpl-compression.rst
> > >
> > > IIRC our previous discussion is we should provide a software fallback
> for
> > > the new QEMU paths, right?  Why the decision changed?  Again, such
> > > fallback
> > > can help us to make sure qpl won't get broken easily by other changes.
> >
> > Hi Peter
> >
> > Previous your suggestion below
> >
> >
> https://patchew.org/QEMU/PH7PR11MB5941019462E0ADDE231C7295A37C2@PH7PR11MB5
> 941.namprd11.prod.outlook.com/
> > Compression methods: none, zlib, zstd, qpl (describes all the algorithms
> > that might be used; again, qpl enforces HW support).
> > Compression accelerators: auto, none, qat (only applies when zlib/zstd
> > chosen above)
> >
> > Maybe I misunderstood here, what you mean is that if the IAA hardware is
> unavailable,
> > it will fall back to the software path. This does not need to be
> specified through live
> > migration parameters, and it will automatically determine whether to use
> the software or
> > hardware path during QPL initialization, is that right?
> 
> I think there are two questions.
> 
> Firstly, we definitely want the qpl compressor to be able to run without
> any hardware support.  As I mentioned above, I think that's the only way
> that qpl code can always get covered by the CI as CI hosts should normally
> don't have those modern hardwares.
> 
> I think it also means in the last test patch, instead of detecting
> /dev/iax
> we should unconditionally run the qpl test as long as compiled in, because
> it should just fallback to the software path then when HW not valid?
> 
> The second question is whether we'll want a new "compression accelerator",
> fundamentally the only use case of that is to enforce software fallback
> even if hardware existed.  I don't remember whether others have any
> opinion
> before, but to me I think it's good to have, however no strong opinion.
> It's less important comparing to the other question on CI coverage.

Yes, I will support software fallback to ensure CI testing and users can 
still use qpl compression without IAA hardware.

Although the qpl software solution will have better performance than zlib, 
I still don't think it has a greater advantage than zstd. I don't think there
is a need to add a migration option to configure the qpl software or hardware path.
So I will still only use QPL as an independent compression in the next version, and
no other migration options are needed.

I will also add a guide to qpl-compression.rst about IAA permission issues and how to
determine whether the hardware path is available.

> > > > 2. Compression accelerator related patches are removed from this
> patch
> > > set and
> > > >    will be added to the QAT patch set, we will submit separate
> patches
> > > to use
> > > >    QAT to accelerate ZLIB and ZSTD.
> > > >
> > > > 3. Advantages of using IAA accelerator include:
> > > >    a. Compared with the non-compression method, it can improve
> downtime
> > > >       performance without adding additional host resources (both CPU
> and
> > > >       network).
> > > >    b. Compared with using software compression methods (ZSTD/ZLIB),
> it
> > > can
> > > >       provide high data compression ratio and save a lot of CPU
> > > resources
> > > >       used for compression.
> > > >
> > > > Test condition:
> > > >   1. Host CPUs are based on Sapphire Rapids
> > > >   2. VM type, 16 vCPU and 64G memory
> > > >   3. The source and destination respectively use 4 IAA devices.
> > > >   4. The workload in the VM
> > > >     a. all vCPUs are idle state
> > > >     b. 90% of the virtual machine's memory is used, use silesia to
> fill
> > > >        the memory.
> > > >        The introduction of silesia:
> > > >        https://sun.aei.polsl.pl//~sdeor/index.php?page=silesia
> > > >   5. Set "--mem-prealloc" boot parameter on the destination, this
> > > parameter
> > > >      can make IAA performance better and related introduction is
> added
> > > here.
> > > >      docs/devel/migration/qpl-compression.rst
> > > >   6. Source migration configuration commands
> > > >      a. migrate_set_capability multifd on
> > > >      b. migrate_set_parameter multifd-channels 2/4/8
> > > >      c. migrate_set_parameter downtime-limit 300
> > > >      f. migrate_set_parameter max-bandwidth 100G/1G
> > > >      d. migrate_set_parameter multifd-compression none/qpl/zstd
> > > >   7. Destination migration configuration commands
> > > >      a. migrate_set_capability multifd on
> > > >      b. migrate_set_parameter multifd-channels 2/4/8
> > > >      c. migrate_set_parameter multifd-compression none/qpl/zstd
> > > >
> > > > Early migration result, each result is the average of three tests
> > > >
> > > >  +--------+-------------+--------+--------+---------+----------+----
> --|
> > > >  |        | The number  |total   |downtime|network  |pages per | CPU
> |
> > > >  | None   | of channels |time(ms)|(ms)    |bandwidth|second    |
> Util |
> > > >  | Comp   |             |        |        |(mbps)   |          |
> |
> > > >  |        +-------------+-----------------+---------+----------+----
> --+
> > > >  |Network |            2|    8571|      69|    58391|   1896525|
> 256%|
> > >
> > > Is this the average bandwidth?  I'm surprised that you can hit ~59Gbps
> > > only
> > > with 2 channels.  My previous experience is around ~1XGbps per
> channel, so
> > > no more than 30Gbps for two channels.  Is it because of a faster
> > > processor?
> > > Indeed from the 4/8 results it doesn't look like increasing the num of
> > > channels helped a lot, and even it got worse on the downtime.
> >
> > Yes, I use iperf3 to check the bandwidth for one core, the bandwith is
> 60Gbps.
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87 MBytes
> > [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87 Mbytes
> >
> > And in the live migration test, a multifd thread's CPU utilization is
> almost 100%
> 
> This 60Gpbs per-channel is definitely impressive..
> 
> Have you tried migration without multifd on your system? Would that also
> perform similarly v.s. 2 channels multifd?

Simple Test result below:
VM Type: 16vCPU, 64G memory
Workload in VM: fill 56G memory with Silesia data and vCPUs are idle
Migration Configurations:
1. migrate_set_parameter max-bandwidth 100G
2. migrate_set_parameter downtime-limit 300
3. migrate_set_capability multifd on (multiFD test case)
4. migrate_set_parameter multifd-channels 2 (multiFD test case)

                  Totaltime (ms) Downtime (ms) Throughput (mbps) Pages-per-second
without Multifd	23580	            307	         21221	       689588
Multifd 2	       7657	            198	         65410	      2221176

> 
> The whole point of multifd is to scale on bandwidth.  If single thread can
> already achieve 60Gbps (where in my previous memory of tests, multifd can
> only reach ~70Gbps before..), then either multifd will be less useful with
> the new hardwares (especially when with a most generic socket nocomp
> setup), or we need to start working on bottlenecks of multifd to make it
> scale better.  Otherwise multifd will become a pool for compressor loads
> only.
> 
> >
> > > What is the rational behind "downtime improvement" when with the QPL
> > > compressors?  IIUC in this 100Gbps case the bandwidth is never a
> > > limitation, then I don't understand why adding the compression phase
> can
> > > make the switchover faster.  I can expect much more pages sent in a
> > > NIC-limted env like you described below with 1Gbps, but not when NIC
> has
> > > unlimited resources like here.
> >
> > The compression can improve the network stack overhead(not improve the
> RDMA
> > solution), the less data, the smaller the overhead in the
> > network protocol stack. If compression has no overhead, and network
> bandwidth
> > is not limited, the last memory copy is faster with compression
> >
> > The migration hotspot focuses on the _sys_sendmsg
> > _sys_sendmsg
> >   |- tcp_sendmsg
> >     |- copy_user_enhanced_fast_string
> >     |- tcp_push_one
> 
> Makes sense.  I assume that's logical indeed when the compression ratio is
> high enough, meanwhile if the compression work is fast enough to be much
> lower than sending extra data when without it.
> 
> Thanks,
> 
> --
> Peter Xu

Re: [PATCH v5 0/7] Live Migration With IAA

Posted by Peter Xu 1 month ago

On Thu, Mar 28, 2024 at 03:02:30AM +0000, Liu, Yuan1 wrote:
> Yes, I will support software fallback to ensure CI testing and users can 
> still use qpl compression without IAA hardware.
> 
> Although the qpl software solution will have better performance than zlib, 
> I still don't think it has a greater advantage than zstd. I don't think there
> is a need to add a migration option to configure the qpl software or hardware path.
> So I will still only use QPL as an independent compression in the next version, and
> no other migration options are needed.

That should be fine.

> 
> I will also add a guide to qpl-compression.rst about IAA permission issues and how to
> determine whether the hardware path is available.

OK.

[...]

> > > Yes, I use iperf3 to check the bandwidth for one core, the bandwith is
> > 60Gbps.
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87 MBytes
> > > [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87 Mbytes
> > >
> > > And in the live migration test, a multifd thread's CPU utilization is
> > almost 100%
> > 
> > This 60Gpbs per-channel is definitely impressive..
> > 
> > Have you tried migration without multifd on your system? Would that also
> > perform similarly v.s. 2 channels multifd?
> 
> Simple Test result below:
> VM Type: 16vCPU, 64G memory
> Workload in VM: fill 56G memory with Silesia data and vCPUs are idle
> Migration Configurations:
> 1. migrate_set_parameter max-bandwidth 100G
> 2. migrate_set_parameter downtime-limit 300
> 3. migrate_set_capability multifd on (multiFD test case)
> 4. migrate_set_parameter multifd-channels 2 (multiFD test case)
> 
>                   Totaltime (ms) Downtime (ms) Throughput (mbps) Pages-per-second
> without Multifd	23580	            307	         21221	       689588
> Multifd 2	       7657	            198	         65410	      2221176

Thanks for the test results.

So I am guessing the migration overheads besides pushing the socket is high
enough to make it drop drastically, even if in this case zero detection
shouldn't play a major role considering most of guest mem is pre-filled.

-- 
Peter Xu

RE: [PATCH v5 0/7] Live Migration With IAA

Posted by Liu, Yuan1 1 month ago

> -----Original Message-----
> From: Peter Xu <peterx@redhat.com>
> Sent: Thursday, March 28, 2024 11:22 PM
> To: Liu, Yuan1 <yuan1.liu@intel.com>
> Cc: farosas@suse.de; qemu-devel@nongnu.org; hao.xiang@bytedance.com;
> bryan.zhang@bytedance.com; Zou, Nanhai <nanhai.zou@intel.com>
> Subject: Re: [PATCH v5 0/7] Live Migration With IAA
> 
> On Thu, Mar 28, 2024 at 03:02:30AM +0000, Liu, Yuan1 wrote:
> > Yes, I will support software fallback to ensure CI testing and users can
> > still use qpl compression without IAA hardware.
> >
> > Although the qpl software solution will have better performance than
> zlib,
> > I still don't think it has a greater advantage than zstd. I don't think
> there
> > is a need to add a migration option to configure the qpl software or
> hardware path.
> > So I will still only use QPL as an independent compression in the next
> version, and
> > no other migration options are needed.
> 
> That should be fine.
> 
> >
> > I will also add a guide to qpl-compression.rst about IAA permission
> issues and how to
> > determine whether the hardware path is available.
> 
> OK.
> 
> [...]
> 
> > > > Yes, I use iperf3 to check the bandwidth for one core, the bandwith
> is
> > > 60Gbps.
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec    0   2.87
> MBytes
> > > > [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec    0   2.87
> Mbytes
> > > >
> > > > And in the live migration test, a multifd thread's CPU utilization
> is
> > > almost 100%
> > >
> > > This 60Gpbs per-channel is definitely impressive..
> > >
> > > Have you tried migration without multifd on your system? Would that
> also
> > > perform similarly v.s. 2 channels multifd?
> >
> > Simple Test result below:
> > VM Type: 16vCPU, 64G memory
> > Workload in VM: fill 56G memory with Silesia data and vCPUs are idle
> > Migration Configurations:
> > 1. migrate_set_parameter max-bandwidth 100G
> > 2. migrate_set_parameter downtime-limit 300
> > 3. migrate_set_capability multifd on (multiFD test case)
> > 4. migrate_set_parameter multifd-channels 2 (multiFD test case)
> >
> >                   Totaltime (ms) Downtime (ms) Throughput (mbps) Pages-
> per-second
> > without Multifd	23580	            307	         21221	       689588
> > Multifd 2	       7657	            198	         65410	      2221176
> 
> Thanks for the test results.
> 
> So I am guessing the migration overheads besides pushing the socket is
> high
> enough to make it drop drastically, even if in this case zero detection
> shouldn't play a major role considering most of guest mem is pre-filled.

Yes, for no multifd migration, besides the network stack overhead, the zero
page detection overhead (both of source and destination) is indeed very high.
Placing the zero page detection in multi-threads can reduce the performance 
degradation caused by the overhead of zero page detection.

I also think migration doesn't need to detect zero page by memcmp in all cases.
The benefit of zero page detection may be that the VM's memory determines that
there are a large number of 0 pages. 

My experience in this area may be insufficient, I am trying with Hao and Bryan to
see if it is possible to use DSA hardware to accelerate this part (including page 0
detection and writing page 0). 

DSA is an accelerator for detecting memory, writing memory, and comparing memory
https://cdrdv2-public.intel.com/671116/341204-intel-data-streaming-accelerator-spec.pdf