Add Intel Data Streaming Accelerator offloading

[PATCH 0/4] Add Intel Data Streaming Accelerator offloading

Posted by Hao Xiang 11 months, 2 weeks ago

* Idea:

Intel Data Streaming Accelerator(DSA) is introduced in Intel's 4th generation
Xeon server, aka Sapphire Rapids.
https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator

One of the things DSA can do is to offload memory comparison workload from
CPU to DSA accelerator hardware. This change proposes a solution to offload
QEMU's zero page checking from CPU to DSA accelerator hardware. For this to
work, I am looking into two potential improvements during QEMU background
operations, eg, VM checkpoint, VM live migration.

1. Reduce CPU usage.

I did a simple tracing for the save VM workload. I started VMs with
64GB RAM, 128GB RAM and 256GB RAM respectively and then ran the "savevm"
command from QEMU's commandline. During this savevm workload, I recorded
the CPU cycles spent in function save_snapshot and the total CPU cycles
spent in repeated callings into function buffer_is_zero, which performs
zero page checking.

|------------------------------------------|
|VM memory  |save_snapshot  |buffer_is_zero|
|capacity   |(CPU cycles)   |(CPU cycles)  |
|------------------------------------------|
|64GB       |19449838924    |5022504892    |
|------------------------------------------|
|128GB      |36966951760    |10049561066   |
|------------------------------------------| 
|256GB      |72744615676    |20042076440   |
|------------------------------------------|

In the three scenarios, the CPU cycles spent in zero page checking accounts
roughly 27% of the total cycles spent in save_snapshot. I believe this is due
to the fact that a large part of save_snapshot performs file IO operations
writing all memory pages into the QEMU image file and there is a linear increase
on CPU cycles spent in savevm as the VM's total memory increases. If we can
offload all zero page checking function calls to the DSA accelerator, we will
reduce the CPU usage by 27% in the savevm workload, potentially freeing CPU
resources for other work. The same savings should apply to live VM migration
workload as well. Furthermore, if the guest VM's vcpu shares the same physical
CPU core used for live migration, the guest VM will gain more underlying CPU
resource and hence making the guest VM more responsive to it's own guest workload
during live migration.

2. Reduce operation latency.

I did some benchmark testing on pure CPU memomory comparison implementation
and DSA hardware offload implementation.

Testbed: Intel(R) Xeon(R) Platinum 8457C, CPU 3100MHz

Latency is measured by completing memory comparison on two memory buffers, each
with one GB in size. The memory comparison are done via CPU and DSA accelerator
respectively. When performing CPU memory comparison, I use a single thread. When
performing DSA accelerator memory comparison, I use one DSA engine. While doing
memory comparison, both CPU and DSA based implementation uses 4k memory buffer
as the granularity for comparison.

|-------------------------------|
|Memory           |Latency      |
|-------------------------------|
|CPU one thread   |80ms         |
|-------------------------------|
|DSA one engine   |89ms         |
|-------------------------------|

In our test system, we have two sockets and two DSA devices per socket. Each
DSA device has four engines built in. I believe that if we leverage more DSA
engine resources and a good parallelization on zero page checking, we can
keep the DSA devices busy and reduce CPU usage.

* Current state:

This patch implements the DSA offloading operation for zero page checking.
User can optionally replace the zero page checking function with DSA offloading
by specifying a new argument in qemu start up commandline. There is no
performance gain in this change. This is mainly because zero page checking is
a synchronous operation and each page size is 4k. Offloading a single 4k memory
page comparison to the DSA accelerator and wait for the driver to complete
the operation introduces overhead. Currently the overhead is bigger than
the CPU cycles saved due to offloading.

* Future work:

1. Need to make the zero page checking workflow asynchronous. The idea is that
we can throw lots of zero page checking operations at once to N(configurable)
DSA engines. Then we wait for those operations to be completed by idxd (DSA
device driver). Currently ram_save_iterate has a loop to iterate through all
the memory blocks, find the dirty pages and save them all. The loop exits
when there is no more dirty pages to save. I think when we walk through all
the memory blocks, we just need to identify whether there is dirty pages
remaining but we can do the actual "save page" asynchronously. We can return
from ram_save_iterate when we finish walking through the memory blocks and
all pages are saved. This sounds like a pretty large refactoring change and
I am looking hard into this path to figure out exactly how I can tackle it.
Any feedback would be really appreciated.

2. Need to implement an abstraction layer where QEMU can just throw zero page 
checking operations to the DSA layer and the DSA layer will figure out which
work queue/engine to handle the operation. Probably we can use a round-robin
dispatcher to balance the work across multiple DSA engines.

3. The current patch uses busy loop to pull for DSA completions and that's
really a bad practice. I need to either use the umonitor/umwait instructions
or user mode interrupt for true async completion.

4. The DSA device can also offload other operations.
* memcpy
* xbzrle encoding/decoding
* crc32

base-commit: ac84b57b4d74606f7f83667a0606deef32b2049d

Hao Xiang (4):
  Introduce new instruction set enqcmd/mmovdir64b to the build system.
  Add dependency idxd.
  Implement zero page checking using DSA.
  Add QEMU command line argument to enable DSA offloading.

 include/qemu/cutils.h                |   6 +
 linux-headers/linux/idxd.h           | 356 +++++++++++++++++++++++++++
 meson.build                          |   3 +
 meson_options.txt                    |   4 +
 migration/ram.c                      |   4 +
 qemu-options.hx                      |  10 +
 scripts/meson-buildoptions.sh        |   6 +
 softmmu/runstate.c                   |   4 +
 softmmu/vl.c                         |  22 ++
 storage-daemon/qemu-storage-daemon.c |   2 +
 util/bufferiszero.c                  |  14 ++
 util/dsa.c                           | 295 ++++++++++++++++++++++
 util/meson.build                     |   1 +
 13 files changed, 727 insertions(+)
 create mode 100644 linux-headers/linux/idxd.h
 create mode 100644 util/dsa.c

-- 
2.30.2

Re: [PATCH 0/4] Add Intel Data Streaming Accelerator offloading

Posted by Hao Xiang 11 months, 2 weeks ago

Hi all, this is meant to be an RFC. Sorry I didn't put that in the email
subject correctly.
From: "Hao Xiang"<hao.xiang@bytedance.com>
Date:  Mon, May 29, 2023, 11:20
Subject:  [PATCH 0/4] Add Intel Data Streaming Accelerator offloading
To: <pbonzini@redhat.com>, <quintela@redhat.com>, <qemu-devel@nongnu.org>
Cc: "Hao Xiang"<hao.xiang@bytedance.com>
* Idea:

Intel Data Streaming Accelerator(DSA) is introduced in Intel's 4th
generation
Xeon server, aka Sapphire Rapids.
https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator

One of the things DSA can do is to offload memory comparison workload from
CPU to DSA accelerator hardware. This change proposes a solution to offload
QEMU's zero page checking from CPU to DSA accelerator hardware. For this to
work, I am looking into two potential improvements during QEMU background
operations, eg, VM checkpoint, VM live migration.

1. Reduce CPU usage.

I did a simple tracing for the save VM workload. I started VMs with
64GB RAM, 128GB RAM and 256GB RAM respectively and then ran the "savevm"
command from QEMU's commandline. During this savevm workload, I recorded
the CPU cycles spent in function save_snapshot and the total CPU cycles
spent in repeated callings into function buffer_is_zero, which performs
zero page checking.

|------------------------------------------|
|VM memory  |save_snapshot  |buffer_is_zero|
|capacity   |(CPU cycles)   |(CPU cycles)  |
|------------------------------------------|
|64GB       |19449838924    |5022504892    |
|------------------------------------------|
|128GB      |36966951760    |10049561066   |
|------------------------------------------|
|256GB      |72744615676    |20042076440   |
|------------------------------------------|

In the three scenarios, the CPU cycles spent in zero page checking accounts
roughly 27% of the total cycles spent in save_snapshot. I believe this is
due
to the fact that a large part of save_snapshot performs file IO operations
writing all memory pages into the QEMU image file and there is a linear
increase
on CPU cycles spent in savevm as the VM's total memory increases. If we can
offload all zero page checking function calls to the DSA accelerator, we
will
reduce the CPU usage by 27% in the savevm workload, potentially freeing CPU
resources for other work. The same savings should apply to live VM
migration
workload as well. Furthermore, if the guest VM's vcpu shares the same
physical
CPU core used for live migration, the guest VM will gain more underlying
CPU
resource and hence making the guest VM more responsive to it's own guest
workload
during live migration.

2. Reduce operation latency.

I did some benchmark testing on pure CPU memomory comparison implementation
and DSA hardware offload implementation.

Testbed: Intel(R) Xeon(R) Platinum 8457C, CPU 3100MHz

Latency is measured by completing memory comparison on two memory buffers,
each
with one GB in size. The memory comparison are done via CPU and DSA
accelerator
respectively. When performing CPU memory comparison, I use a single thread.
When
performing DSA accelerator memory comparison, I use one DSA engine. While
doing
memory comparison, both CPU and DSA based implementation uses 4k memory
buffer
as the granularity for comparison.

|-------------------------------|
|Memory           |Latency      |
|-------------------------------|
|CPU one thread   |80ms         |
|-------------------------------|
|DSA one engine   |89ms         |
|-------------------------------|

In our test system, we have two sockets and two DSA devices per socket.
Each
DSA device has four engines built in. I believe that if we leverage more
DSA
engine resources and a good parallelization on zero page checking, we can
keep the DSA devices busy and reduce CPU usage.

* Current state:

This patch implements the DSA offloading operation for zero page checking.
User can optionally replace the zero page checking function with DSA
offloading
by specifying a new argument in qemu start up commandline. There is no
performance gain in this change. This is mainly because zero page checking
is
a synchronous operation and each page size is 4k. Offloading a single 4k
memory
page comparison to the DSA accelerator and wait for the driver to complete
the operation introduces overhead. Currently the overhead is bigger than
the CPU cycles saved due to offloading.

* Future work:

1. Need to make the zero page checking workflow asynchronous. The idea is
that
we can throw lots of zero page checking operations at once to
N(configurable)
DSA engines. Then we wait for those operations to be completed by idxd (DSA
device driver). Currently ram_save_iterate has a loop to iterate through
all
the memory blocks, find the dirty pages and save them all. The loop exits
when there is no more dirty pages to save. I think when we walk through all
the memory blocks, we just need to identify whether there is dirty pages
remaining but we can do the actual "save page" asynchronously. We can
return
from ram_save_iterate when we finish walking through the memory blocks and
all pages are saved. This sounds like a pretty large refactoring change and
I am looking hard into this path to figure out exactly how I can tackle it.
Any feedback would be really appreciated.

2. Need to implement an abstraction layer where QEMU can just throw zero
page
checking operations to the DSA layer and the DSA layer will figure out
which
work queue/engine to handle the operation. Probably we can use a
round-robin
dispatcher to balance the work across multiple DSA engines.

3. The current patch uses busy loop to pull for DSA completions and that's
really a bad practice. I need to either use the umonitor/umwait
instructions
or user mode interrupt for true async completion.

4. The DSA device can also offload other operations.
* memcpy
* xbzrle encoding/decoding
* crc32

base-commit: ac84b57b4d74606f7f83667a0606deef32b2049d

Hao Xiang (4):
  Introduce new instruction set enqcmd/mmovdir64b to the build system.
  Add dependency idxd.
  Implement zero page checking using DSA.
  Add QEMU command line argument to enable DSA offloading.

include/qemu/cutils.h                |   6 +
linux-headers/linux/idxd.h           | 356 +++++++++++++++++++++++++++
meson.build                          |   3 +
meson_options.txt                    |   4 +
migration/ram.c                      |   4 +
qemu-options.hx                      |  10 +
scripts/meson-buildoptions.sh        |   6 +
softmmu/runstate.c                   |   4 +
softmmu/vl.c                         |  22 ++
storage-daemon/qemu-storage-daemon.c |   2 +
util/bufferiszero.c                  |  14 ++
util/dsa.c                           | 295 ++++++++++++++++++++++
util/meson.build                     |   1 +
13 files changed, 727 insertions(+)
create mode 100644 linux-headers/linux/idxd.h
create mode 100644 util/dsa.c

-- 
2.30.2