From nobody Sun May 12 13:57:34 2024 Delivered-To: importer@patchew.org Authentication-Results: mx.zohomail.com; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; dmarc=fail(p=none dis=none) header.from=intel.com Return-Path: Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1626660833254901.1051478459043; Sun, 18 Jul 2021 19:13:53 -0700 (PDT) Received: from localhost ([::1]:60700 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1m5In2-0005UJ-7x for importer@patchew.org; Sun, 18 Jul 2021 22:13:52 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:35264) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1m5Ilb-00043t-Ku for qemu-devel@nongnu.org; Sun, 18 Jul 2021 22:12:23 -0400 Received: from mga02.intel.com ([134.134.136.20]:9491) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1m5IlY-00047z-TG for qemu-devel@nongnu.org; Sun, 18 Jul 2021 22:12:23 -0400 Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2021 19:12:16 -0700 Received: from icx-hcc-jingqi.sh.intel.com ([10.239.48.6]) by FMSMGA003.fm.intel.com with ESMTP; 18 Jul 2021 19:12:14 -0700 X-IronPort-AV: E=McAfee;i="6200,9189,10049"; a="198181449" X-IronPort-AV: E=Sophos;i="5.84,250,1620716400"; d="scan'208";a="198181449" X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.84,250,1620716400"; d="scan'208";a="499560593" From: Jingqi Liu To: imammedo@redhat.com, xiaoguangrong.eric@gmail.com, mst@redhat.com, marcel.apfelbaum@gmail.com, pbonzini@redhat.com, richard.henderson@linaro.org, ehabkost@redhat.com Subject: [PATCH v2 1/1] nvdimm: add 'target-node' option Date: Mon, 19 Jul 2021 10:01:53 +0800 Message-Id: <20210719020153.30574-2-jingqi.liu@intel.com> X-Mailer: git-send-email 2.21.3 In-Reply-To: <20210719020153.30574-1-jingqi.liu@intel.com> References: <20210719020153.30574-1-jingqi.liu@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Received-SPF: pass client-ip=134.134.136.20; envelope-from=jingqi.liu@intel.com; helo=mga02.intel.com X-Spam_score_int: -22 X-Spam_score: -2.3 X-Spam_bar: -- X-Spam_report: (-2.3 / 5.0 requ) RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Jingqi Liu , qemu-devel@nongnu.org Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZM-MESSAGEID: 1626660834003100001 Content-Type: text/plain; charset="utf-8" Linux kernel version 5.1 brings in support for the volatile-use of persistent memory as a hotplugged memory region (KMEM DAX). When this feature is enabled, persistent memory can be seen as a separate memory-only NUMA node(s). This newly-added memory can be selected by its unique NUMA node. Add 'target-node' option for 'nvdimm' device to indicate this NUMA node. It can be extended to a new node after all existing NUMA nodes. The 'node' option of 'pc-dimm' device is to add the DIMM to an existing NUMA node. The 'node' should be in the available NUMA nodes. For KMEM DAX mode, persistent memory can be in a new separate memory-only NUMA node. The new node is created dynamically. So users use 'target-node' to control whether persistent memory is added to an existing NUMA node or a new NUMA node. An example of configuration is as follows. Using the following QEMU command: -object memory-backend-file,id=3Dnvmem1,share=3Don,mem-path=3D/dev/dax0.0,= size=3D3G,align=3D2M -device nvdimm,id=3Dnvdimm1,memdev=3Dmem1,label-size=3D128K,targe-node=3D2 To list DAX devices: # daxctl list -u { "chardev":"dax0.0", "size":"3.00 GiB (3.22 GB)", "target_node":2, "mode":"devdax" } To create a namespace in Device-DAX mode as a standard memory: $ ndctl create-namespace --mode=3Ddevdax --map=3Dmem To reconfigure DAX device from devdax mode to a system-ram mode: $ daxctl reconfigure-device dax0.0 --mode=3Dsystem-ram There are two existing NUMA nodes in Guest. After these operations, persistent memory is configured as a separate Node 2 and can be used as a volatile memory. This NUMA node is dynamically created according to 'target-node'. Signed-off-by: Jingqi Liu --- docs/nvdimm.txt | 93 +++++++++++++++++++++++++++++++++++++++++ hw/acpi/nvdimm.c | 18 ++++---- hw/i386/acpi-build.c | 12 +++++- hw/i386/pc.c | 4 ++ hw/mem/nvdimm.c | 43 ++++++++++++++++++- include/hw/mem/nvdimm.h | 17 +++++++- util/nvdimm-utils.c | 22 ++++++++++ 7 files changed, 198 insertions(+), 11 deletions(-) diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt index 0aae682be3..083d954bb4 100644 --- a/docs/nvdimm.txt +++ b/docs/nvdimm.txt @@ -107,6 +107,99 @@ Note: may result guest data corruption (e.g. breakage of guest file system). =20 +Target node +----------- + +Linux kernel version 5.1 brings in support for the volatile-use of +persistent memory as a hotplugged memory region (KMEM DAX). +When this feature is enabled, persistent memory can be seen as a +separate memory-only NUMA node(s). This newly-added memory can be +selected by its unique NUMA node. +Add 'target-node' option for nvdimm device to indicate this NUMA node. +It can be extended after all existing NUMA nodes. + +An example of configuration is presented below. + +Using the following QEMU command: + -object memory-backend-file,id=3Dnvmem1,share=3Don,mem-path=3D/dev/dax0.0= ,size=3D3G,align=3D2M + -device nvdimm,id=3Dnvdimm1,memdev=3Dmem1,label-size=3D128K,targe-node=3D1 + +The below operations are in Guest. + +To list available NUMA nodes using numactl: + # numactl -H + available: 1 nodes (0) + node 0 cpus: 0 1 2 3 4 5 6 7 + node 0 size: 5933 MB + node 0 free: 5457 MB + node distances: + node 0 + 0: 10 + +To create a namespace in Device-DAX mode as a standard memory from +all the available capacity of NVDIMM: + + # ndctl create-namespace --mode=3Ddevdax --map=3Dmem + { + "dev":"namespace0.0", + "mode":"devdax", + "map":"mem", + "size":"3.00 GiB (3.22 GB)", + "uuid":"4e4d8293-dd3b-4e43-8ad9-7f3d2a8d1680", + "daxregion":{ + "id":0, + "size":"3.00 GiB (3.22 GB)", + "align":2097152, + "devices":[ + { + "chardev":"dax0.0", + "size":"3.00 GiB (3.22 GB)", + "target_node":1, + "mode":"devdax" + } + ] + }, + "align":2097152 + } + +To list DAX devices: + # daxctl list -u + { + "chardev":"dax0.0", + "size":"3.00 GiB (3.22 GB)", + "target_node":1, + "mode":"devdax" + } + +To reconfigure DAX device from devdax mode to a system-ram mode: + # daxctl reconfigure-device dax0.0 --mode=3Dsystem-ram + [ + { + "chardev":"dax0.0", + "size":3217031168, + "target_node":1, + "mode":"system-ram", + "movable":false + } + ] + +After this operation, persistent memory is configured as a separate NUMA n= ode +and can be used as a volatile memory. +The new NUMA node is Node 1: + # numactl -H + available: 2 nodes (0-1) + node 0 cpus: 0 1 2 3 4 5 6 7 + node 0 size: 5933 MB + node 0 free: 5339 MB + node 1 cpus: + node 1 size: 2816 MB + node 1 free: 2815 MB + node distances: + node 0 1 + 0: 10 20 + 1: 20 10 + + Hotplug ------- =20 diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c index e3d5fe1939..ebce0d6e68 100644 --- a/hw/acpi/nvdimm.c +++ b/hw/acpi/nvdimm.c @@ -228,8 +228,8 @@ nvdimm_build_structure_spa(GArray *structures, DeviceSt= ate *dev) NULL); uint64_t size =3D object_property_get_uint(OBJECT(dev), PC_DIMM_SIZE_P= ROP, NULL); - uint32_t node =3D object_property_get_uint(OBJECT(dev), PC_DIMM_NODE_P= ROP, - NULL); + int target_node =3D object_property_get_uint(OBJECT(dev), NVDIMM_TARGE= T_NODE_PROP, + NULL); int slot =3D object_property_get_int(OBJECT(dev), PC_DIMM_SLOT_PROP, NULL); =20 @@ -251,7 +251,7 @@ nvdimm_build_structure_spa(GArray *structures, DeviceSt= ate *dev) valid*/); =20 /* NUMA node. */ - nfit_spa->proximity_domain =3D cpu_to_le32(node); + nfit_spa->proximity_domain =3D cpu_to_le32(target_node); /* the region reported as PMEM. */ memcpy(nfit_spa->type_guid, nvdimm_nfit_spa_uuid, sizeof(nvdimm_nfit_spa_uuid)); @@ -1337,8 +1337,9 @@ static void nvdimm_build_ssdt(GArray *table_offsets, = GArray *table_data, free_aml_allocator(); } =20 -void nvdimm_build_srat(GArray *table_data) +int nvdimm_build_srat(GArray *table_data) { + int max_target_node =3D nvdimm_check_target_nodes(); GSList *device_list =3D nvdimm_get_device_list(); =20 for (; device_list; device_list =3D device_list->next) { @@ -1346,17 +1347,20 @@ void nvdimm_build_srat(GArray *table_data) DeviceState *dev =3D device_list->data; Object *obj =3D OBJECT(dev); uint64_t addr, size; - int node; + int target_node; =20 - node =3D object_property_get_int(obj, PC_DIMM_NODE_PROP, &error_ab= ort); + target_node =3D object_property_get_uint(obj, NVDIMM_TARGET_NODE_P= ROP, + &error_abort); addr =3D object_property_get_uint(obj, PC_DIMM_ADDR_PROP, &error_a= bort); size =3D object_property_get_uint(obj, PC_DIMM_SIZE_PROP, &error_a= bort); =20 numamem =3D acpi_data_push(table_data, sizeof *numamem); - build_srat_memory(numamem, addr, size, node, + build_srat_memory(numamem, addr, size, target_node, MEM_AFFINITY_ENABLED | MEM_AFFINITY_NON_VOLATILE= ); } g_slist_free(device_list); + + return max_target_node; } =20 void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data, diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c index 796ffc6f5c..19bf91063f 100644 --- a/hw/i386/acpi-build.c +++ b/hw/i386/acpi-build.c @@ -1879,6 +1879,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, Ma= chineState *machine) AcpiSratMemoryAffinity *numamem; =20 int i; + int max_node =3D 0; int srat_start, numa_start, slots; uint64_t mem_len, mem_base, next_base; MachineClass *mc =3D MACHINE_GET_CLASS(machine); @@ -1974,7 +1975,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, Ma= chineState *machine) } =20 if (machine->nvdimms_state->is_enabled) { - nvdimm_build_srat(table_data); + max_node =3D nvdimm_build_srat(table_data); } =20 slots =3D (table_data->len - numa_start) / sizeof *numamem; @@ -1992,9 +1993,16 @@ build_srat(GArray *table_data, BIOSLinker *linker, M= achineState *machine) * providing _PXM method if necessary. */ if (hotplugabble_address_space_size) { + if (max_node < 0) { + max_node =3D pcms->numa_nodes - 1; + } else { + max_node =3D max_node > pcms->numa_nodes - 1 ? + max_node : pcms->numa_nodes - 1; + } + numamem =3D acpi_data_push(table_data, sizeof *numamem); build_srat_memory(numamem, machine->device_memory->base, - hotplugabble_address_space_size, pcms->numa_node= s - 1, + hotplugabble_address_space_size, max_node, MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED= ); } =20 diff --git a/hw/i386/pc.c b/hw/i386/pc.c index c6d8d0d84d..debf26b31e 100644 --- a/hw/i386/pc.c +++ b/hw/i386/pc.c @@ -1251,6 +1251,10 @@ static void pc_memory_pre_plug(HotplugHandler *hotpl= ug_dev, DeviceState *dev, =20 pc_dimm_pre_plug(PC_DIMM(dev), MACHINE(hotplug_dev), pcmc->enforce_aligned_dimm ? NULL : &legacy_align, er= rp); + + if (is_nvdimm) { + nvdimm_pre_plug(NVDIMM(dev), MACHINE(hotplug_dev), errp); + } } =20 static void pc_memory_plug(HotplugHandler *hotplug_dev, diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c index 7397b67156..c864e6717b 100644 --- a/hw/mem/nvdimm.c +++ b/hw/mem/nvdimm.c @@ -27,11 +27,52 @@ #include "qemu/pmem.h" #include "qapi/error.h" #include "qapi/visitor.h" +#include "hw/boards.h" #include "hw/mem/nvdimm.h" #include "hw/qdev-properties.h" #include "hw/mem/memory-device.h" #include "sysemu/hostmem.h" =20 +unsigned long nvdimm_target_nodes[BITS_TO_LONGS(MAX_NODES)]; +int nvdimm_max_target_node; + +void nvdimm_pre_plug(NVDIMMDevice *nvdimm, MachineState *machine, + Error **errp) +{ + int node; + + node =3D object_property_get_uint(OBJECT(nvdimm), PC_DIMM_NODE_PROP, + &error_abort); + if (node && (nvdimm->target_node !=3D -1)) { + error_setg(errp, "Both property '" PC_DIMM_NODE_PROP + "' and '" NVDIMM_TARGET_NODE_PROP + "' cannot be set!"); + return; + } + + if (nvdimm->target_node !=3D -1) { + if (nvdimm->target_node >=3D MAX_NODES) { + error_setg(errp, "'NVDIMM property " NVDIMM_TARGET_NODE_PROP + " has value %" PRIu32 + "' which exceeds the max number of numa nodes: %d", + nvdimm->target_node, MAX_NODES); + return; + } + if (nvdimm->target_node >=3D machine->numa_state->num_nodes) { + set_bit(nvdimm->target_node, nvdimm_target_nodes); + if (nvdimm->target_node > nvdimm_max_target_node) { + nvdimm_max_target_node =3D nvdimm->target_node; + } + } + } else { + /* + * If the 'target-node' option is not set, + * the value of 'node' is used as target node. + */ + nvdimm->target_node =3D node; + } +} + static void nvdimm_get_label_size(Object *obj, Visitor *v, const char *nam= e, void *opaque, Error **errp) { @@ -96,7 +137,6 @@ static void nvdimm_set_uuid(Object *obj, Visitor *v, con= st char *name, g_free(value); } =20 - static void nvdimm_init(Object *obj) { object_property_add(obj, NVDIMM_LABEL_SIZE_PROP, "int", @@ -229,6 +269,7 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdim= m, const void *buf, =20 static Property nvdimm_properties[] =3D { DEFINE_PROP_BOOL(NVDIMM_UNARMED_PROP, NVDIMMDevice, unarmed, false), + DEFINE_PROP_UINT32(NVDIMM_TARGET_NODE_PROP, NVDIMMDevice, target_node,= -1), DEFINE_PROP_END_OF_LIST(), }; =20 diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h index bcf62f825c..4f490d0c2a 100644 --- a/include/hw/mem/nvdimm.h +++ b/include/hw/mem/nvdimm.h @@ -51,6 +51,7 @@ OBJECT_DECLARE_TYPE(NVDIMMDevice, NVDIMMClass, NVDIMM) #define NVDIMM_LABEL_SIZE_PROP "label-size" #define NVDIMM_UUID_PROP "uuid" #define NVDIMM_UNARMED_PROP "unarmed" +#define NVDIMM_TARGET_NODE_PROP "target-node" =20 struct NVDIMMDevice { /* private */ @@ -89,6 +90,14 @@ struct NVDIMMDevice { * The PPC64 - spapr requires each nvdimm device have a uuid. */ QemuUUID uuid; + + /* + * Support for the volatile-use of persistent memory as normal RAM. + * This newly-added memory can be selected by its unique NUMA node. + * This node can be extended to a new node after all existing NUMA + * nodes. + */ + uint32_t target_node; }; =20 struct NVDIMMClass { @@ -148,14 +157,20 @@ struct NVDIMMState { }; typedef struct NVDIMMState NVDIMMState; =20 +extern unsigned long nvdimm_target_nodes[]; +extern int nvdimm_max_target_node; + void nvdimm_init_acpi_state(NVDIMMState *state, MemoryRegion *io, struct AcpiGenericAddress dsm_io, FWCfgState *fw_cfg, Object *owner); -void nvdimm_build_srat(GArray *table_data); +int nvdimm_build_srat(GArray *table_data); void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data, BIOSLinker *linker, NVDIMMState *state, uint32_t ram_slots, const char *oem_id, const char *oem_table_id); void nvdimm_plug(NVDIMMState *state); void nvdimm_acpi_plug_cb(HotplugHandler *hotplug_dev, DeviceState *dev); +int nvdimm_check_target_nodes(void); +void nvdimm_pre_plug(NVDIMMDevice *dimm, MachineState *machine, + Error **errp); #endif diff --git a/util/nvdimm-utils.c b/util/nvdimm-utils.c index aa3d199f2d..767f1e4787 100644 --- a/util/nvdimm-utils.c +++ b/util/nvdimm-utils.c @@ -1,5 +1,7 @@ #include "qemu/osdep.h" #include "qemu/nvdimm-utils.h" +#include "qapi/error.h" +#include "hw/boards.h" #include "hw/mem/nvdimm.h" =20 static int nvdimm_device_list(Object *obj, void *opaque) @@ -28,3 +30,23 @@ GSList *nvdimm_get_device_list(void) object_child_foreach(qdev_get_machine(), nvdimm_device_list, &list); return list; } + +int nvdimm_check_target_nodes(void) +{ + MachineState *ms =3D MACHINE(qdev_get_machine()); + int nb_numa_nodes =3D ms->numa_state->num_nodes; + int node; + + if (!nvdimm_max_target_node) { + return -1; + } + + for (node =3D nb_numa_nodes; node <=3D nvdimm_max_target_node; node++)= { + if (!test_bit(node, nvdimm_target_nodes)) { + error_report("nvdimm target-node: Node ID missing: %d", node); + exit(1); + } + } + + return nvdimm_max_target_node; +} --=20 2.21.3