From nobody Thu Nov 6 14:07:02 2025 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zoho.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1489141113486171.6966562207076; Fri, 10 Mar 2017 02:18:33 -0800 (PST) Received: from localhost ([::1]:38337 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cmHd1-00049A-PE for importer@patchew.org; Fri, 10 Mar 2017 05:18:31 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:54189) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cmHcV-000494-2J for qemu-devel@nongnu.org; Fri, 10 Mar 2017 05:18:00 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cmHcQ-0002A4-2y for qemu-devel@nongnu.org; Fri, 10 Mar 2017 05:17:59 -0500 Received: from mga14.intel.com ([192.55.52.115]:49255) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1cmHcP-00029k-MR for qemu-devel@nongnu.org; Fri, 10 Mar 2017 05:17:54 -0500 Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 10 Mar 2017 02:17:51 -0800 Received: from he.bj.intel.com (HELO localhost) ([10.238.135.175]) by orsmga004.jf.intel.com with ESMTP; 10 Mar 2017 02:17:49 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.36,140,1486454400"; d="scan'208";a="66119734" From: He Chen To: qemu-devel@nongnu.org Date: Fri, 10 Mar 2017 18:18:17 +0800 Message-Id: <1489141097-28587-1-git-send-email-he.chen@linux.intel.com> X-Mailer: git-send-email 2.7.4 X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 192.55.52.115 Subject: [Qemu-devel] [PATCH] x86: Allow to set NUMA distance for different NUMA nodes X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Eduardo Habkost , "Michael S . Tsirkin" , He Chen , Markus Armbruster , Paolo Bonzini , Igor Mammedov , Richard Henderson Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Current, QEMU does not provide a clear command to set vNUMA distance for guest although we already have `-numa` command to set vNUMA nodes. vNUMA distance makes sense in certain scenario. But now, if we create a guest that has 4 vNUMA nodes, when we check NUMA info via `numactl -H`, we will see: node distance: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10 Guest kernel regards all local node as distance 10, and all remote node as distance 20 when there is no SLIT table since QEMU doesn't build it. It looks like a little strange when you have seen the distance in an actual physical machine that contains 4 NUMA nodes. My machine shows: node distance: node 0 1 2 3 0: 10 21 31 41 1: 21 10 21 31 2: 31 21 10 21 3: 41 31 21 10 This patch is going to add SLIT table support in QEMU, and provides addtional option `dist` for command `-numa` to allow user set vNUMA distance by QEMU command. With this patch, when a user wants to create a guest that contains several vNUMA nodes and also wants to set distance among those nodes, the QEMU command would like: ``` -object memory-backend-ram,size=3D1G,prealloc=3Dyes,host-nodes=3D0,policy= =3Dbind,id=3Dnode0 \ -numa node,nodeid=3D0,cpus=3D0,memdev=3Dnode0 \ -object memory-backend-ram,size=3D1G,prealloc=3Dyes,host-nodes=3D1,policy= =3Dbind,id=3Dnode1 \ -numa node,nodeid=3D1,cpus=3D1,memdev=3Dnode1 \ -object memory-backend-ram,size=3D1G,prealloc=3Dyes,host-nodes=3D2,policy= =3Dbind,id=3Dnode2 \ -numa node,nodeid=3D2,cpus=3D2,memdev=3Dnode2 \ -object memory-backend-ram,size=3D1G,prealloc=3Dyes,host-nodes=3D3,policy= =3Dbind,id=3Dnode3 \ -numa node,nodeid=3D3,cpus=3D3,memdev=3Dnode3 \ -numa dist,a=3D0,b=3D1,val=3D21 \ -numa dist,a=3D0,b=3D2,val=3D31 \ -numa dist,a=3D0,b=3D3,val=3D41 \ -numa dist,a=3D1,b=3D0,val=3D21 \ ... ``` Thanks, -He Signed-off-by: He Chen --- hw/i386/acpi-build.c | 28 ++++++++++++++++++++++++++ include/hw/acpi/acpi-defs.h | 9 +++++++++ include/sysemu/numa.h | 1 + numa.c | 48 +++++++++++++++++++++++++++++++++++++++++= ++++ qapi-schema.json | 24 +++++++++++++++++++++-- qemu-options.hx | 5 ++++- 6 files changed, 112 insertions(+), 3 deletions(-) diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c index 2073108..7ced37d 100644 --- a/hw/i386/acpi-build.c +++ b/hw/i386/acpi-build.c @@ -2396,6 +2396,32 @@ build_srat(GArray *table_data, BIOSLinker *linker, M= achineState *machine) } =20 static void +build_slit(GArray *table_data, BIOSLinker *linker, MachineState *machine) +{ + struct AcpiSystemLocalityDistanceTable *slit; + uint8_t *entry; + int slit_start, slit_data_len, i, j; + slit_start =3D table_data->len; + + slit =3D acpi_data_push(table_data, sizeof(*slit)); + slit->nb_localities =3D nb_numa_nodes; + + slit_data_len =3D sizeof(uint8_t) * nb_numa_nodes * nb_numa_nodes; + entry =3D acpi_data_push(table_data, slit_data_len); + + for (i =3D 0; i < nb_numa_nodes; i++) { + for (j =3D 0; j < nb_numa_nodes; j++) { + entry[i * nb_numa_nodes + j] =3D numa_info[i].distance[j]; + } + } + + build_header(linker, table_data, + (void *)(table_data->data + slit_start), + "SLIT", + table_data->len - slit_start, 1, NULL, NULL); +} + +static void build_mcfg_q35(GArray *table_data, BIOSLinker *linker, AcpiMcfgInfo *info) { AcpiTableMcfg *mcfg; @@ -2678,6 +2704,8 @@ void acpi_build(AcpiBuildTables *tables, MachineState= *machine) if (pcms->numa_nodes) { acpi_add_table(table_offsets, tables_blob); build_srat(tables_blob, tables->linker, machine); + acpi_add_table(table_offsets, tables_blob); + build_slit(tables_blob, tables->linker, machine); } if (acpi_get_mcfg(&mcfg)) { acpi_add_table(table_offsets, tables_blob); diff --git a/include/hw/acpi/acpi-defs.h b/include/hw/acpi/acpi-defs.h index 4cc3630..b183a8f 100644 --- a/include/hw/acpi/acpi-defs.h +++ b/include/hw/acpi/acpi-defs.h @@ -527,6 +527,15 @@ struct AcpiSratProcessorGiccAffinity =20 typedef struct AcpiSratProcessorGiccAffinity AcpiSratProcessorGiccAffinity; =20 +/* + * SLIT (NUMA distance description) table + */ +struct AcpiSystemLocalityDistanceTable { + ACPI_TABLE_HEADER_DEF + uint64_t nb_localities; +} QEMU_PACKED; +typedef struct AcpiSystemLocalityDistanceTable AcpiSystemLocalityDistanceT= able; + /* PCI fw r3.0 MCFG table. */ /* Subtable */ struct AcpiMcfgAllocation { diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h index 8f09dcf..2f7a941 100644 --- a/include/sysemu/numa.h +++ b/include/sysemu/numa.h @@ -21,6 +21,7 @@ typedef struct node_info { struct HostMemoryBackend *node_memdev; bool present; QLIST_HEAD(, numa_addr_range) addr; /* List to store address ranges */ + uint8_t distance[MAX_NODES]; } NodeInfo; =20 extern NodeInfo numa_info[MAX_NODES]; diff --git a/numa.c b/numa.c index e01cb54..897657a 100644 --- a/numa.c +++ b/numa.c @@ -50,6 +50,9 @@ static int have_memdevs =3D -1; static int max_numa_nodeid; /* Highest specified NUMA node ID, plus one. * For all nodes, nodeid < max_numa_nodeid */ +static int min_numa_distance =3D 10; +static int def_numa_distance =3D 20; +static int max_numa_distance =3D 255; int nb_numa_nodes; NodeInfo numa_info[MAX_NODES]; =20 @@ -208,10 +211,33 @@ static void numa_node_parse(NumaNodeOptions *node, Qe= muOpts *opts, Error **errp) numa_info[nodenr].node_mem =3D object_property_get_int(o, "size", = NULL); numa_info[nodenr].node_memdev =3D MEMORY_BACKEND(o); } + numa_info[nodenr].present =3D true; max_numa_nodeid =3D MAX(max_numa_nodeid, nodenr + 1); } =20 +static void numa_distance_parse(NumaDistOptions *dist, QemuOpts *opts, Err= or **errp) +{ + uint8_t a =3D dist->a; + uint8_t b =3D dist->b; + uint8_t val =3D dist->val; + + if (a >=3D MAX_NODES || b >=3D MAX_NODES) { + error_setg(errp, "Max number of NUMA nodes reached: %" + PRIu16 "", a > b ? a : b); + return; + } + + if (val > max_numa_distance || val < min_numa_distance) { + error_setg(errp, + "NUMA distance (%" PRIu8 ") out of range (%d)~(%d)", + dist->val, max_numa_distance, min_numa_distance); + return; + } + + numa_info[a].distance[b] =3D val; +} + static int parse_numa(void *opaque, QemuOpts *opts, Error **errp) { NumaOptions *object =3D NULL; @@ -235,6 +261,12 @@ static int parse_numa(void *opaque, QemuOpts *opts, Er= ror **errp) } nb_numa_nodes++; break; + case NUMA_OPTIONS_TYPE_DIST: + numa_distance_parse(&object->u.dist, opts, &err); + if (err) { + goto end; + } + break; default: abort(); } @@ -294,6 +326,21 @@ static void validate_numa_cpus(void) g_free(seen_cpus); } =20 +static void default_numa_distance(void) +{ + int i, j; + + for (i =3D 0; i < nb_numa_nodes; i++) { + for (j =3D 0; j < nb_numa_nodes; j++) { + if (i =3D=3D j && numa_info[i].distance[j] !=3D min_numa_dista= nce) { + numa_info[i].distance[j] =3D min_numa_distance; + } else if (numa_info[i].distance[j] <=3D min_numa_distance) { + numa_info[i].distance[j] =3D def_numa_distance; + } + } + } +} + void parse_numa_opts(MachineClass *mc) { int i; @@ -390,6 +437,7 @@ void parse_numa_opts(MachineClass *mc) } =20 validate_numa_cpus(); + default_numa_distance(); } else { numa_set_mem_node_id(0, ram_size, 0); } diff --git a/qapi-schema.json b/qapi-schema.json index 32b4a4b..2988304 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -5647,7 +5647,7 @@ # Since: 2.1 ## { 'enum': 'NumaOptionsType', - 'data': [ 'node' ] } + 'data': [ 'node', 'dist' ] } =20 ## # @NumaOptions: @@ -5660,7 +5660,8 @@ 'base': { 'type': 'NumaOptionsType' }, 'discriminator': 'type', 'data': { - 'node': 'NumaNodeOptions' }} + 'node': 'NumaNodeOptions', + 'dist': 'NumaDistOptions' }} =20 ## # @NumaNodeOptions: @@ -5689,6 +5690,25 @@ '*memdev': 'str' }} =20 ## +# @NumaDistOptions: +# +# Set distance between 2 NUMA nodes. (for OptsVisitor) +# +# @a: first NUMA node. +# +# @b: second NUMA node. +# +# @val: NUMA distance between 2 given NUMA nodes. +# +# Since: 2.9 +## +{ 'struct': 'NumaDistOptions', + 'data': { + 'a': 'uint8', + 'b': 'uint8', + 'val': 'uint8' }} + +## # @HostMemPolicy: # # Host memory policy types diff --git a/qemu-options.hx b/qemu-options.hx index 8dd8ee3..0de5cf8 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -139,12 +139,15 @@ ETEXI =20 DEF("numa", HAS_ARG, QEMU_OPTION_numa, "-numa node[,mem=3Dsize][,cpus=3Dfirstcpu[-lastcpu]][,nodeid=3Dnode]\n" - "-numa node[,memdev=3Did][,cpus=3Dfirstcpu[-lastcpu]][,nodeid=3Dnode]\= n", QEMU_ARCH_ALL) + "-numa node[,memdev=3Did][,cpus=3Dfirstcpu[-lastcpu]][,nodeid=3Dnode]\= n" + "-numa dist,a=3Dfirstnode,b=3Dsecondnode,val=3Ddistance\n", QEMU_ARCH_= ALL) STEXI @item -numa node[,mem=3D@var{size}][,cpus=3D@var{firstcpu}[-@var{lastcpu}]= ][,nodeid=3D@var{node}] @itemx -numa node[,memdev=3D@var{id}][,cpus=3D@var{firstcpu}[-@var{lastcpu= }]][,nodeid=3D@var{node}] +@itemx -numa dist,a=3D@var{firstnode},b=3D@var{secondnode},val=3D@var{dist= ance} @findex -numa Define a NUMA node and assign RAM and VCPUs to it. +Set NUMA distance between 2 NUMA nodes. =20 @var{firstcpu} and @var{lastcpu} are CPU indexes. Each @samp{cpus} option represent a contiguous range of CPU indexes --=20 2.7.4