From nobody Sat Apr 27 18:49:12 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zoho.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org; Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1490176407681109.9769187881459; Wed, 22 Mar 2017 02:53:27 -0700 (PDT) Received: from localhost ([::1]:49845 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cqcxJ-0001w8-LF for importer@patchew.org; Wed, 22 Mar 2017 05:53:25 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:59889) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cqcwT-0001ae-PE for qemu-devel@nongnu.org; Wed, 22 Mar 2017 05:52:35 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cqcwP-0001jZ-MM for qemu-devel@nongnu.org; Wed, 22 Mar 2017 05:52:33 -0400 Received: from mga14.intel.com ([192.55.52.115]:30315) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1cqcwP-0001j4-7W; Wed, 22 Mar 2017 05:52:29 -0400 Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 22 Mar 2017 02:52:14 -0700 Received: from he.bj.intel.com (HELO localhost) ([10.238.135.175]) by fmsmga001.fm.intel.com with ESMTP; 22 Mar 2017 02:52:13 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=intel.com; i=@intel.com; q=dns/txt; s=intel; t=1490176347; x=1521712347; h=from:to:cc:subject:date:message-id; bh=JzhAkg2PR0MKIF+LCwvtp2KGc58K6ksXzAm1tIB9Ddk=; b=HWHJkOwvKm5dQr4iHFW6YgCNvSjIuXOdUilbRPU7Kxd7KNI0Uq50ifbs PlaoXqiDX6CN7aUxb0gFsPy2U9Cgcw==; X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.36,204,1486454400"; d="scan'208";a="1125704214" From: He Chen To: qemu-devel@nongnu.org Date: Wed, 22 Mar 2017 17:32:46 +0800 Message-Id: <1490175166-19785-1-git-send-email-he.chen@linux.intel.com> X-Mailer: git-send-email 2.7.4 X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 192.55.52.115 Subject: [Qemu-devel] [PATCH v3] Allow setting NUMA distance for different NUMA nodes X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Peter Maydell , Eduardo Habkost , "Michael S . Tsirkin" , Markus Armbruster , Paolo Bonzini , qemu-arm@nongnu.org, Shannon Zhao , Igor Mammedov , Richard Henderson Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Current, QEMU does not provide a clear command to set vNUMA distance for guest although we already have `-numa` command to set vNUMA nodes. vNUMA distance makes sense in certain scenario. But now, if we create a guest that has 4 vNUMA nodes, when we check NUMA info via `numactl -H`, we will see: node distance: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10 Guest kernel regards all local node as distance 10, and all remote node as distance 20 when there is no SLIT table since QEMU doesn't build it. It looks like a little strange when you have seen the distance in an actual physical machine that contains 4 NUMA nodes. My machine shows: node distance: node 0 1 2 3 0: 10 21 31 41 1: 21 10 21 31 2: 31 21 10 21 3: 41 31 21 10 To set vNUMA distance, guest should see a complete SLIT table. I found QEMU has provide `-acpitable` command that allows users to add a ACPI table into guest, but it requires users building ACPI table by themselves first. Using `-acpitable` to add a SLIT table may be not so straightforward or flexible, imagine that when the vNUMA configuration is changes and we need to generate another SLIT table manually. It may not be friendly to users or upper software like libvirt. This patch is going to add SLIT table support in QEMU, and provides additional option `dist` for command `-numa` to allow user set vNUMA distance by QEMU command. With this patch, when a user wants to create a guest that contains several vNUMA nodes and also wants to set distance among those nodes, the QEMU command would like: ``` -object memory-backend-ram,size=3D1G,prealloc=3Dyes,host-nodes=3D0,policy= =3Dbind,id=3Dnode0 \ -numa node,nodeid=3D0,cpus=3D0,memdev=3Dnode0 \ -object memory-backend-ram,size=3D1G,prealloc=3Dyes,host-nodes=3D1,policy= =3Dbind,id=3Dnode1 \ -numa node,nodeid=3D1,cpus=3D1,memdev=3Dnode1 \ -object memory-backend-ram,size=3D1G,prealloc=3Dyes,host-nodes=3D2,policy= =3Dbind,id=3Dnode2 \ -numa node,nodeid=3D2,cpus=3D2,memdev=3Dnode2 \ -object memory-backend-ram,size=3D1G,prealloc=3Dyes,host-nodes=3D3,policy= =3Dbind,id=3Dnode3 \ -numa node,nodeid=3D3,cpus=3D3,memdev=3Dnode3 \ -numa dist,src=3D0,dst=3D1,val=3D21 \ -numa dist,src=3D0,dst=3D2,val=3D31 \ -numa dist,src=3D0,dst=3D3,val=3D41 \ -numa dist,src=3D1,dst=3D0,val=3D21 \ ... ``` Signed-off-by: He Chen fix --- hw/acpi/aml-build.c | 26 +++++++++++++++++++++++++ hw/arm/virt-acpi-build.c | 2 ++ hw/i386/acpi-build.c | 2 ++ include/hw/acpi/aml-build.h | 1 + include/sysemu/numa.h | 1 + include/sysemu/sysemu.h | 4 ++++ numa.c | 47 +++++++++++++++++++++++++++++++++++++++++= ++++ qapi-schema.json | 30 ++++++++++++++++++++++++++--- qemu-options.hx | 12 +++++++++++- 9 files changed, 121 insertions(+), 4 deletions(-) diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c index c6f2032..410b30e 100644 --- a/hw/acpi/aml-build.c +++ b/hw/acpi/aml-build.c @@ -24,6 +24,7 @@ #include "hw/acpi/aml-build.h" #include "qemu/bswap.h" #include "qemu/bitops.h" +#include "sysemu/numa.h" =20 static GArray *build_alloc_array(void) { @@ -1609,3 +1610,28 @@ void build_srat_memory(AcpiSratMemoryAffinity *numam= em, uint64_t base, numamem->base_addr =3D cpu_to_le64(base); numamem->range_length =3D cpu_to_le64(len); } + +/* + * ACPI spec 5.2.17 System Locality Distance Information Table + * (Revision 2.0 or later) + */ +void build_slit(GArray *table_data, BIOSLinker *linker) +{ + int slit_start, i, j; + slit_start =3D table_data->len; + + acpi_data_push(table_data, sizeof(AcpiTableHeader)); + + build_append_int_noprefix(table_data, nb_numa_nodes, 8); + for (i =3D 0; i < nb_numa_nodes; i++) { + for (j =3D 0; j < nb_numa_nodes; j++) { + build_append_int_noprefix(table_data, numa_info[i].distance[j]= , 1); + } + } + + build_header(linker, table_data, + (void *)(table_data->data + slit_start), + "SLIT", + table_data->len - slit_start, 1, NULL, NULL); +} + diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c index 0835e59..d9e6828 100644 --- a/hw/arm/virt-acpi-build.c +++ b/hw/arm/virt-acpi-build.c @@ -781,6 +781,8 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTa= bles *tables) if (nb_numa_nodes > 0) { acpi_add_table(table_offsets, tables_blob); build_srat(tables_blob, tables->linker, vms); + acpi_add_table(table_offsets, tables_blob); + build_slit(tables_blob, tables->linker); } =20 if (its_class_name() && !vmc->no_its) { diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c index 2073108..12730ea 100644 --- a/hw/i386/acpi-build.c +++ b/hw/i386/acpi-build.c @@ -2678,6 +2678,8 @@ void acpi_build(AcpiBuildTables *tables, MachineState= *machine) if (pcms->numa_nodes) { acpi_add_table(table_offsets, tables_blob); build_srat(tables_blob, tables->linker, machine); + acpi_add_table(table_offsets, tables_blob); + build_slit(tables_blob, tables->linker); } if (acpi_get_mcfg(&mcfg)) { acpi_add_table(table_offsets, tables_blob); diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h index 00c21f1..329a0d0 100644 --- a/include/hw/acpi/aml-build.h +++ b/include/hw/acpi/aml-build.h @@ -389,4 +389,5 @@ GCC_FMT_ATTR(2, 3); void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base, uint64_t len, int node, MemoryAffinityFlags flags); =20 +void build_slit(GArray *table_data, BIOSLinker *linker); #endif diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h index 8f09dcf..2f7a941 100644 --- a/include/sysemu/numa.h +++ b/include/sysemu/numa.h @@ -21,6 +21,7 @@ typedef struct node_info { struct HostMemoryBackend *node_memdev; bool present; QLIST_HEAD(, numa_addr_range) addr; /* List to store address ranges */ + uint8_t distance[MAX_NODES]; } NodeInfo; =20 extern NodeInfo numa_info[MAX_NODES]; diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h index 576c7ce..a4e328d 100644 --- a/include/sysemu/sysemu.h +++ b/include/sysemu/sysemu.h @@ -169,6 +169,10 @@ extern int mem_prealloc; =20 #define MAX_NODES 128 #define NUMA_NODE_UNASSIGNED MAX_NODES +#define MIN_NUMA_DISTANCE 10 +#define DEF_NUMA_DISTANCE 20 +#define MAX_NUMA_DISTANCE 254 +#define NUMA_DISTANCE_UNREACHABLE 255 =20 #define MAX_OPTION_ROMS 16 typedef struct QEMUOptionRom { diff --git a/numa.c b/numa.c index e01cb54..425a320 100644 --- a/numa.c +++ b/numa.c @@ -212,6 +212,28 @@ static void numa_node_parse(NumaNodeOptions *node, Qem= uOpts *opts, Error **errp) max_numa_nodeid =3D MAX(max_numa_nodeid, nodenr + 1); } =20 +static void numa_distance_parse(NumaDistOptions *dist, QemuOpts *opts, Err= or **errp) +{ + uint64_t src =3D dist->src; + uint64_t dst =3D dist->dst; + uint8_t val =3D dist->val; + + if (src >=3D MAX_NODES || dst >=3D MAX_NODES) { + error_setg(errp, "Max number of NUMA nodes reached: %" + PRIu64 "", src > dst ? src : dst); + return; + } + + if (val < MIN_NUMA_DISTANCE || val > MAX_NUMA_DISTANCE) { + error_setg(errp, + "NUMA distance (%" PRIu8 ") out of range (%d) ~ (%d)", + dist->val, MIN_NUMA_DISTANCE, MAX_NUMA_DISTANCE); + return; + } + + numa_info[src].distance[dst] =3D val; +} + static int parse_numa(void *opaque, QemuOpts *opts, Error **errp) { NumaOptions *object =3D NULL; @@ -235,6 +257,12 @@ static int parse_numa(void *opaque, QemuOpts *opts, Er= ror **errp) } nb_numa_nodes++; break; + case NUMA_OPTIONS_TYPE_DIST: + numa_distance_parse(&object->u.dist, opts, &err); + if (err) { + goto end; + } + break; default: abort(); } @@ -294,6 +322,24 @@ static void validate_numa_cpus(void) g_free(seen_cpus); } =20 +static void default_numa_distance(void) +{ + int src, dst; + + for (src =3D 0; src < nb_numa_nodes; src++) { + for (dst =3D 0; dst < nb_numa_nodes; dst++) { + if (src =3D=3D dst && numa_info[src].distance[dst] !=3D MIN_NU= MA_DISTANCE) { + numa_info[src].distance[dst] =3D MIN_NUMA_DISTANCE; + } else if (numa_info[src].distance[dst] <=3D MIN_NUMA_DISTANCE= ) { + if (numa_info[dst].distance[src] > MIN_NUMA_DISTANCE) + numa_info[src].distance[dst] =3D numa_info[dst].distan= ce[src]; + else + numa_info[src].distance[dst] =3D DEF_NUMA_DISTANCE; + } + } + } +} + void parse_numa_opts(MachineClass *mc) { int i; @@ -390,6 +436,7 @@ void parse_numa_opts(MachineClass *mc) } =20 validate_numa_cpus(); + default_numa_distance(); } else { numa_set_mem_node_id(0, ram_size, 0); } diff --git a/qapi-schema.json b/qapi-schema.json index 32b4a4b..21ad94a 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -5644,15 +5644,19 @@ ## # @NumaOptionsType: # +# @node: NUMA nodes configuration +# +# @dist: NUMA distance configuration +# # Since: 2.1 ## { 'enum': 'NumaOptionsType', - 'data': [ 'node' ] } + 'data': [ 'node', 'dist' ] } =20 ## # @NumaOptions: # -# A discriminated record of NUMA options. (for OptsVisitor) +# A discriminated record of NUMA options. # # Since: 2.1 ## @@ -5660,7 +5664,8 @@ 'base': { 'type': 'NumaOptionsType' }, 'discriminator': 'type', 'data': { - 'node': 'NumaNodeOptions' }} + 'node': 'NumaNodeOptions', + 'dist': 'NumaDistOptions' }} =20 ## # @NumaNodeOptions: @@ -5689,6 +5694,25 @@ '*memdev': 'str' }} =20 ## +# @NumaDistOptions: +# +# Set distance between 2 NUMA nodes. (for OptsVisitor) +# +# @src: source NUMA node. +# +# @dst: destination NUMA node. +# +# @val: NUMA distance from source node to destination node. +# +# Since: 2.10 +## +{ 'struct': 'NumaDistOptions', + 'data': { + 'src': 'uint64', + 'dst': 'uint64', + 'val': 'uint8' }} + +## # @HostMemPolicy: # # Host memory policy types diff --git a/qemu-options.hx b/qemu-options.hx index 8dd8ee3..43c3950 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -139,12 +139,15 @@ ETEXI =20 DEF("numa", HAS_ARG, QEMU_OPTION_numa, "-numa node[,mem=3Dsize][,cpus=3Dfirstcpu[-lastcpu]][,nodeid=3Dnode]\n" - "-numa node[,memdev=3Did][,cpus=3Dfirstcpu[-lastcpu]][,nodeid=3Dnode]\= n", QEMU_ARCH_ALL) + "-numa node[,memdev=3Did][,cpus=3Dfirstcpu[-lastcpu]][,nodeid=3Dnode]\= n" + "-numa dist,src=3Dsource,dst=3Ddestination,val=3Ddistance\n", QEMU_ARC= H_ALL) STEXI @item -numa node[,mem=3D@var{size}][,cpus=3D@var{firstcpu}[-@var{lastcpu}]= ][,nodeid=3D@var{node}] @itemx -numa node[,memdev=3D@var{id}][,cpus=3D@var{firstcpu}[-@var{lastcpu= }]][,nodeid=3D@var{node}] +@itemx -numa dist,src=3D@var{source},dst=3D@var{destination},val=3D@var{di= stance} @findex -numa Define a NUMA node and assign RAM and VCPUs to it. +Set NUMA distance from source node to destination node. =20 @var{firstcpu} and @var{lastcpu} are CPU indexes. Each @samp{cpus} option represent a contiguous range of CPU indexes @@ -167,6 +170,13 @@ split equally between them. @samp{mem} and @samp{memdev} are mutually exclusive. Furthermore, if one node uses @samp{memdev}, all of them have to use it. =20 +@var{source} and @var{destination} are NUMA node IDs. +@var{distance} is the NUMA distance from @var{source} to @var{destination}. +The distance from node A to node B may be different from the distance from +node B to node A since the distance can to be asymmetry. +If the distance is not set, the default distance for a local NUMA node is = 10, +and 20 for a remote node. + Note that the -@option{numa} option doesn't allocate any of the specified resources, it just assigns existing resources to NUMA nodes. This means that one still has to use the @option{-m}, --=20 2.7.4