[QEMU-devel][RFC PATCH 0/1] Introduce HostMemType for 'memory-backend-*'

Ho-Ren (Jack) Chuang posted 1 patch 11 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20240101075315.43167-1-horenchuang@bytedance.com
Maintainers: David Hildenbrand <david@redhat.com>, Igor Mammedov <imammedo@redhat.com>, Eric Blake <eblake@redhat.com>, Markus Armbruster <armbru@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, "Daniel P. Berrangé" <berrange@redhat.com>, Eduardo Habkost <eduardo@habkost.net>
backends/hostmem.c       | 184 +++++++++++++++++++++++++++++++++++++++
include/sysemu/hostmem.h |   1 +
qapi/common.json         |  19 ++++
qapi/qom.json            |   1 +
qemu-options.hx          |   2 +-
5 files changed, 206 insertions(+), 1 deletion(-)
[QEMU-devel][RFC PATCH 0/1] Introduce HostMemType for 'memory-backend-*'
Posted by Ho-Ren (Jack) Chuang 11 months ago
Heterogeneous memory setups are becoming popular. Today, we can run a server
system combining DRAM, pmem, HBM, and CXL-attached DDR. CXL-attached DDR
memory shares the same set of attributes as normal system DRAM but with higher
latency and lower bandwidth. With the rapid increase in CPU core counts in
today's server platforms, memory capacity and bandwidth become bottlenecks.
High-capacity memory devices are very expensive and deliver poor value on a
dollar-per-GB basis. There are a limited number of memory channels per socket,
and hence the total memory capacity per socket is limited by cost.

As a cloud service provider, virtual machines are a fundamental service.
The virtual machine models have pre-set vCPU counts and memory capacity.
Large memory capacity VM models have a higher memory capacity per vCPU.
Delivering VM instances with the same vCPU and memory requirements
on new-generation Intel/AMD server platforms becomes challenging as
the CPU core count rapidly increases. With the help of CXL local
memory expanders, we can install more DDR memory devices on a socket and
almost double the total memory capacity per socket at a reasonable cost on
new server platforms. Thus, we can continue to deliver existing VM models. On
top of that, low-cost, large memory capacity VM models become a possibility.

CXL-attached memory (CXL type-3 device) can be used in exactly the same way
as system-DRAM but with somewhat degraded performance. QEMU is in the process
of supporting CXL virtualization. Currently, in QEMU, we can already create
virtualized CXL memory devices, and a guest OS running the latest Linux kernel
can successfully bring CXL memory online.

We conducted benchmark testing on VMs with three setups:
1. VM with virtualized system-DRAM, backed by system-DRAM on
the physical host. No virtualized CXL memory.
2. VM with virtualized system-DRAM, backed by CXL-attached memory on
the physical host. No virtualized CXL memory.
3. VM with virtualized system-DRAM, backed by system-DRAM on
the physical host, and virtualized CXL memory backed by CXL-attached memory on
the physical host.

Benchmark 1: Intel Memory Latency Checker
Link: https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html
Guest VM idle latency for random access in nanoseconds:
- System-DRAM backed by system-DRAM on the host = 116.1
- CXL memory backed by CXL-attached DRAM on the host = 266.8
- System-DRAM backed by CXL-attached DRAM on the host = 269.7
From within the guest VM, read/write latency on memory backed by
host CXL-DRAM is 2X compared to memory backed by host system-DRAM memory.
We observe the same performance result regardless of whether memory is exposed
as virtualized system-DRAM or virtualized CXL memory. The driving factor for
performance is the memory backend (backing memory type: system-DRAM vs
CXL-DRAM) on the host, not the frontend memory type (virtualized system-DRAM
vs virtualized CXL-DRAM) exposed to the guest OS.

Benchmark 2: Redis memtier benchmark
Link: https://redis.com/blog/memtier_benchmark-a-high-throughput-benchmarking-tool-for-redis-memcached/
Guest VM Redis concurrent read latency in milliseconds:
- Key size = 800B, value size = 800B
                         P50         P99         P999
(1) System-DRAM only    13.43       40.95       243.71
(2) CXL-DRAM only       29.18       49.15       249.85
(3) Tiered memory       13.76       39.16       241.66

- Key size = 800B, value size = 70kB
                           P50        P99        P999
(1) System-DRAM only     342.01     630.78      925.69
(2) CXL-DRAM only        696.32     720.89     1007.61
(3) Tiered memory        610.30     671.74     1011.71

From within the guest VM, the Redis server is filled with a large number of
in-memory key-value pairs. Almost all memory is used inside the VM. We then
start a workload with concurrent read operations. For (3), we only read the
key-value pairs located in CXL-DRAM.

- The P50 latency for read operations is almost the same between tiered memory
and system-DRAM when the value size is small (800 bytes). The performance
results are primarily dominated by other software stacks like communication
and CPU cache. The read workload consistently hit the key-value pairs stored
in CXL memory. However, the P50 latency in CXL-only is >2X slower than the
other two setups. When the value size is small, the latency for
read operations is mostly spent in the software stack. It seems critical to
have the guest Linux kernel running on system-DRAM backed memory for
good performance.

- The P50 latency for read operations becomes 70% worse in the tiered memory
setup compared to system-DRAM only when the value size is large (70 KB). The
CXL-only option consistently exhibits poor performance. When the value size is
large, the latency for read operations is mostly spent in reading the value
from the Redis server. The read workload consistently hit the key-value pairs
stored in CXL memory.

Please note that in our experiment, the tiered memory system didn't
promote/demote the pages as expected. The tiered memory setup should have
better performance as the Linux community gradually improves the page
promotion/demotion algorithm. The Linux kernel community has developed a
tiered memory system to better utilize various types of DRAM-like memory.
The future of memory tiering: https://lwn.net/Articles/931421/


Having the guest kernel running on system-DRAM-backed memory and
application data running on CXL-backed memory shows comparable performance to
the topline (system-DRAM only), and it is also a cost-effective solution.
Moreover, in the near future, users will be able to benefit from memory with
different features. Enabling the tiered memory system in the guest OS seems to
be in the right direction. To enable these scenarios from end to end, we need
some plumbing work in the memory backend and CXL memory virtualization stacks.

- QEMU's memory backend object needs to support an option to automatically map
guest memory to a specified type of memory on the host. Take CXL-attached
memory for example, we can then create the virtualized CXL type-3 device as
the frontend and automatically map it to a type of CXL memory backend. This
patchset contains a prototype implementation that accomplishes this. 
We introduce a new configuration option 'host-mem-type=', enabling users
to specify the type of memory from which they want to allocate. An argument
'cxlram' is used to automatically locate CXL-DRAM NUMA nodes on the host and
use them as the backend memory. This provides users with great convenience.
There is no existing API in the Linux kernel to explicitly allocate memory
from CXL-attached memory. Therefore, we rely on the information provided by
the dax kmem driver under the sysfs path
'/sys/bus/cxl/devices/region[X]/dax_region[X]/dax[X]/target_node'
in the prototype.

- Kernel memory tiering uses the dax kmem driver's device probe path to query
ACPI to obtain CXL device attributes (latency, bandwidth) and calculates its
abstract distance. The abstract distance sets the memory to the correct tier.
Although QEMU already provides the option "-numa hmat-lb" to set memory
latency/bandwidth attributes, we were not able to connect the dots from end
to end. After setting the attributes in QEMU, booting up the VM, and creating
devdax CXL devices, the guest kernel was not able to correctly read the memory
attributes for the devdax devices. We are still debugging that path, but we
suspect that it's due to missing functionality in CXL virtualization support.

- When creating two virtualized CXL type-3 devices and bringing them up by
using cxl and daxctl tools, we were not able to create the
2nd memory region/devdax device inside the VM. We are debugging this issue
but would appreciate feedback if others are also dealing with similar
challenges.

Ho-Ren (Jack) Chuang (1):
  backends/hostmem: qapi/qom: Add ObjectOptions for memory-backend-*
    called HostMemType

 backends/hostmem.c       | 184 +++++++++++++++++++++++++++++++++++++++
 include/sysemu/hostmem.h |   1 +
 qapi/common.json         |  19 ++++
 qapi/qom.json            |   1 +
 qemu-options.hx          |   2 +-
 5 files changed, 206 insertions(+), 1 deletion(-)

-- 
Regards,
Hao Xiang and Ho-Ren (Jack) Chuang