[Qemu-devel] [PATCH v3] Allow setting NUMA distance for different NUMA nodes

He Chen posted 1 patch 7 years ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/1490175166-19785-1-git-send-email-he.chen@linux.intel.com
Test checkpatch passed
Test docker passed
Test s390x passed
There is a newer version of this series
hw/acpi/aml-build.c         | 26 +++++++++++++++++++++++++
hw/arm/virt-acpi-build.c    |  2 ++
hw/i386/acpi-build.c        |  2 ++
include/hw/acpi/aml-build.h |  1 +
include/sysemu/numa.h       |  1 +
include/sysemu/sysemu.h     |  4 ++++
numa.c                      | 47 +++++++++++++++++++++++++++++++++++++++++++++
qapi-schema.json            | 30 ++++++++++++++++++++++++++---
qemu-options.hx             | 12 +++++++++++-
9 files changed, 121 insertions(+), 4 deletions(-)
[Qemu-devel] [PATCH v3] Allow setting NUMA distance for different NUMA nodes
Posted by He Chen 7 years ago
Current, QEMU does not provide a clear command to set vNUMA distance for
guest although we already have `-numa` command to set vNUMA nodes.

vNUMA distance makes sense in certain scenario.
But now, if we create a guest that has 4 vNUMA nodes, when we check NUMA
info via `numactl -H`, we will see:

node distance:
node    0    1    2    3
  0:   10   20   20   20
  1:   20   10   20   20
  2:   20   20   10   20
  3:   20   20   20   10

Guest kernel regards all local node as distance 10, and all remote node
as distance 20 when there is no SLIT table since QEMU doesn't build it.
It looks like a little strange when you have seen the distance in an
actual physical machine that contains 4 NUMA nodes. My machine shows:

node distance:
node    0    1    2    3
  0:   10   21   31   41
  1:   21   10   21   31
  2:   31   21   10   21
  3:   41   31   21   10

To set vNUMA distance, guest should see a complete SLIT table.
I found QEMU has provide `-acpitable` command that allows users to add
a ACPI table into guest, but it requires users building ACPI table by
themselves first. Using `-acpitable` to add a SLIT table may be not so
straightforward or flexible, imagine that when the vNUMA configuration
is changes and we need to generate another SLIT table manually. It may
not be friendly to users or upper software like libvirt.

This patch is going to add SLIT table support in QEMU, and provides
additional option `dist` for command `-numa` to allow user set vNUMA
distance by QEMU command.

With this patch, when a user wants to create a guest that contains
several vNUMA nodes and also wants to set distance among those nodes,
the QEMU command would like:

```
-object memory-backend-ram,size=1G,prealloc=yes,host-nodes=0,policy=bind,id=node0 \
-numa node,nodeid=0,cpus=0,memdev=node0 \
-object memory-backend-ram,size=1G,prealloc=yes,host-nodes=1,policy=bind,id=node1 \
-numa node,nodeid=1,cpus=1,memdev=node1 \
-object memory-backend-ram,size=1G,prealloc=yes,host-nodes=2,policy=bind,id=node2 \
-numa node,nodeid=2,cpus=2,memdev=node2 \
-object memory-backend-ram,size=1G,prealloc=yes,host-nodes=3,policy=bind,id=node3 \
-numa node,nodeid=3,cpus=3,memdev=node3 \
-numa dist,src=0,dst=1,val=21 \
-numa dist,src=0,dst=2,val=31 \
-numa dist,src=0,dst=3,val=41 \
-numa dist,src=1,dst=0,val=21 \
...
```

Signed-off-by: He Chen <he.chen@linux.intel.com>

fix
---
 hw/acpi/aml-build.c         | 26 +++++++++++++++++++++++++
 hw/arm/virt-acpi-build.c    |  2 ++
 hw/i386/acpi-build.c        |  2 ++
 include/hw/acpi/aml-build.h |  1 +
 include/sysemu/numa.h       |  1 +
 include/sysemu/sysemu.h     |  4 ++++
 numa.c                      | 47 +++++++++++++++++++++++++++++++++++++++++++++
 qapi-schema.json            | 30 ++++++++++++++++++++++++++---
 qemu-options.hx             | 12 +++++++++++-
 9 files changed, 121 insertions(+), 4 deletions(-)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index c6f2032..410b30e 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -24,6 +24,7 @@
 #include "hw/acpi/aml-build.h"
 #include "qemu/bswap.h"
 #include "qemu/bitops.h"
+#include "sysemu/numa.h"
 
 static GArray *build_alloc_array(void)
 {
@@ -1609,3 +1610,28 @@ void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base,
     numamem->base_addr = cpu_to_le64(base);
     numamem->range_length = cpu_to_le64(len);
 }
+
+/*
+ * ACPI spec 5.2.17 System Locality Distance Information Table
+ * (Revision 2.0 or later)
+ */
+void build_slit(GArray *table_data, BIOSLinker *linker)
+{
+    int slit_start, i, j;
+    slit_start = table_data->len;
+
+    acpi_data_push(table_data, sizeof(AcpiTableHeader));
+
+    build_append_int_noprefix(table_data, nb_numa_nodes, 8);
+    for (i = 0; i < nb_numa_nodes; i++) {
+        for (j = 0; j < nb_numa_nodes; j++) {
+            build_append_int_noprefix(table_data, numa_info[i].distance[j], 1);
+        }
+    }
+
+    build_header(linker, table_data,
+                 (void *)(table_data->data + slit_start),
+                 "SLIT",
+                 table_data->len - slit_start, 1, NULL, NULL);
+}
+
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 0835e59..d9e6828 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -781,6 +781,8 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
     if (nb_numa_nodes > 0) {
         acpi_add_table(table_offsets, tables_blob);
         build_srat(tables_blob, tables->linker, vms);
+        acpi_add_table(table_offsets, tables_blob);
+        build_slit(tables_blob, tables->linker);
     }
 
     if (its_class_name() && !vmc->no_its) {
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 2073108..12730ea 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2678,6 +2678,8 @@ void acpi_build(AcpiBuildTables *tables, MachineState *machine)
     if (pcms->numa_nodes) {
         acpi_add_table(table_offsets, tables_blob);
         build_srat(tables_blob, tables->linker, machine);
+        acpi_add_table(table_offsets, tables_blob);
+        build_slit(tables_blob, tables->linker);
     }
     if (acpi_get_mcfg(&mcfg)) {
         acpi_add_table(table_offsets, tables_blob);
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 00c21f1..329a0d0 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -389,4 +389,5 @@ GCC_FMT_ATTR(2, 3);
 void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base,
                        uint64_t len, int node, MemoryAffinityFlags flags);
 
+void build_slit(GArray *table_data, BIOSLinker *linker);
 #endif
diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
index 8f09dcf..2f7a941 100644
--- a/include/sysemu/numa.h
+++ b/include/sysemu/numa.h
@@ -21,6 +21,7 @@ typedef struct node_info {
     struct HostMemoryBackend *node_memdev;
     bool present;
     QLIST_HEAD(, numa_addr_range) addr; /* List to store address ranges */
+    uint8_t distance[MAX_NODES];
 } NodeInfo;
 
 extern NodeInfo numa_info[MAX_NODES];
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 576c7ce..a4e328d 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -169,6 +169,10 @@ extern int mem_prealloc;
 
 #define MAX_NODES 128
 #define NUMA_NODE_UNASSIGNED MAX_NODES
+#define MIN_NUMA_DISTANCE 10
+#define DEF_NUMA_DISTANCE 20
+#define MAX_NUMA_DISTANCE 254
+#define NUMA_DISTANCE_UNREACHABLE 255
 
 #define MAX_OPTION_ROMS 16
 typedef struct QEMUOptionRom {
diff --git a/numa.c b/numa.c
index e01cb54..425a320 100644
--- a/numa.c
+++ b/numa.c
@@ -212,6 +212,28 @@ static void numa_node_parse(NumaNodeOptions *node, QemuOpts *opts, Error **errp)
     max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
 }
 
+static void numa_distance_parse(NumaDistOptions *dist, QemuOpts *opts, Error **errp)
+{
+    uint64_t src = dist->src;
+    uint64_t dst = dist->dst;
+    uint8_t val = dist->val;
+
+    if (src >= MAX_NODES || dst >= MAX_NODES) {
+        error_setg(errp, "Max number of NUMA nodes reached: %"
+                   PRIu64 "", src > dst ? src : dst);
+        return;
+    }
+
+    if (val < MIN_NUMA_DISTANCE || val > MAX_NUMA_DISTANCE) {
+        error_setg(errp,
+                "NUMA distance (%" PRIu8 ") out of range (%d) ~ (%d)",
+                dist->val, MIN_NUMA_DISTANCE, MAX_NUMA_DISTANCE);
+        return;
+    }
+
+    numa_info[src].distance[dst] = val;
+}
+
 static int parse_numa(void *opaque, QemuOpts *opts, Error **errp)
 {
     NumaOptions *object = NULL;
@@ -235,6 +257,12 @@ static int parse_numa(void *opaque, QemuOpts *opts, Error **errp)
         }
         nb_numa_nodes++;
         break;
+    case NUMA_OPTIONS_TYPE_DIST:
+        numa_distance_parse(&object->u.dist, opts, &err);
+        if (err) {
+            goto end;
+        }
+        break;
     default:
         abort();
     }
@@ -294,6 +322,24 @@ static void validate_numa_cpus(void)
     g_free(seen_cpus);
 }
 
+static void default_numa_distance(void)
+{
+    int src, dst;
+
+    for (src = 0; src < nb_numa_nodes; src++) {
+        for (dst = 0; dst < nb_numa_nodes; dst++) {
+            if (src == dst && numa_info[src].distance[dst] != MIN_NUMA_DISTANCE) {
+                numa_info[src].distance[dst] = MIN_NUMA_DISTANCE;
+            } else if (numa_info[src].distance[dst] <= MIN_NUMA_DISTANCE) {
+                if (numa_info[dst].distance[src] > MIN_NUMA_DISTANCE)
+                    numa_info[src].distance[dst] = numa_info[dst].distance[src];
+                else
+                    numa_info[src].distance[dst] = DEF_NUMA_DISTANCE;
+            }
+        }
+    }
+}
+
 void parse_numa_opts(MachineClass *mc)
 {
     int i;
@@ -390,6 +436,7 @@ void parse_numa_opts(MachineClass *mc)
         }
 
         validate_numa_cpus();
+        default_numa_distance();
     } else {
         numa_set_mem_node_id(0, ram_size, 0);
     }
diff --git a/qapi-schema.json b/qapi-schema.json
index 32b4a4b..21ad94a 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -5644,15 +5644,19 @@
 ##
 # @NumaOptionsType:
 #
+# @node: NUMA nodes configuration
+#
+# @dist: NUMA distance configuration
+#
 # Since: 2.1
 ##
 { 'enum': 'NumaOptionsType',
-  'data': [ 'node' ] }
+  'data': [ 'node', 'dist' ] }
 
 ##
 # @NumaOptions:
 #
-# A discriminated record of NUMA options. (for OptsVisitor)
+# A discriminated record of NUMA options.
 #
 # Since: 2.1
 ##
@@ -5660,7 +5664,8 @@
   'base': { 'type': 'NumaOptionsType' },
   'discriminator': 'type',
   'data': {
-    'node': 'NumaNodeOptions' }}
+    'node': 'NumaNodeOptions',
+    'dist': 'NumaDistOptions' }}
 
 ##
 # @NumaNodeOptions:
@@ -5689,6 +5694,25 @@
    '*memdev': 'str' }}
 
 ##
+# @NumaDistOptions:
+#
+# Set distance between 2 NUMA nodes. (for OptsVisitor)
+#
+# @src: source NUMA node.
+#
+# @dst: destination NUMA node.
+#
+# @val: NUMA distance from source node to destination node.
+#
+# Since: 2.10
+##
+{ 'struct': 'NumaDistOptions',
+  'data': {
+   'src': 'uint64',
+   'dst': 'uint64',
+   'val': 'uint8' }}
+
+##
 # @HostMemPolicy:
 #
 # Host memory policy types
diff --git a/qemu-options.hx b/qemu-options.hx
index 8dd8ee3..43c3950 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -139,12 +139,15 @@ ETEXI
 
 DEF("numa", HAS_ARG, QEMU_OPTION_numa,
     "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
-    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n", QEMU_ARCH_ALL)
+    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
+    "-numa dist,src=source,dst=destination,val=distance\n", QEMU_ARCH_ALL)
 STEXI
 @item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
 @itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
+@itemx -numa dist,src=@var{source},dst=@var{destination},val=@var{distance}
 @findex -numa
 Define a NUMA node and assign RAM and VCPUs to it.
+Set NUMA distance from source node to destination node.
 
 @var{firstcpu} and @var{lastcpu} are CPU indexes. Each
 @samp{cpus} option represent a contiguous range of CPU indexes
@@ -167,6 +170,13 @@ split equally between them.
 @samp{mem} and @samp{memdev} are mutually exclusive. Furthermore,
 if one node uses @samp{memdev}, all of them have to use it.
 
+@var{source} and @var{destination} are NUMA node IDs.
+@var{distance} is the NUMA distance from @var{source} to @var{destination}.
+The distance from node A to node B may be different from the distance from
+node B to node A since the distance can to be asymmetry.
+If the distance is not set, the default distance for a local NUMA node is 10,
+and 20 for a remote node.
+
 Note that the -@option{numa} option doesn't allocate any of the
 specified resources, it just assigns existing resources to NUMA
 nodes. This means that one still has to use the @option{-m},
-- 
2.7.4


Re: [Qemu-devel] [PATCH v3] Allow setting NUMA distance for different NUMA nodes
Posted by Andrew Jones 7 years ago
You should have CC'ed me, considering this version is addressing my
review comments, but whatever, I usually skim the list so I found it...

On Wed, Mar 22, 2017 at 05:32:46PM +0800, He Chen wrote:
> Current, QEMU does not provide a clear command to set vNUMA distance for
> guest although we already have `-numa` command to set vNUMA nodes.
> 
> vNUMA distance makes sense in certain scenario.
> But now, if we create a guest that has 4 vNUMA nodes, when we check NUMA
> info via `numactl -H`, we will see:
> 
> node distance:
> node    0    1    2    3
>   0:   10   20   20   20
>   1:   20   10   20   20
>   2:   20   20   10   20
>   3:   20   20   20   10
> 
> Guest kernel regards all local node as distance 10, and all remote node
> as distance 20 when there is no SLIT table since QEMU doesn't build it.
> It looks like a little strange when you have seen the distance in an
> actual physical machine that contains 4 NUMA nodes. My machine shows:
> 
> node distance:
> node    0    1    2    3
>   0:   10   21   31   41
>   1:   21   10   21   31
>   2:   31   21   10   21
>   3:   41   31   21   10
> 
> To set vNUMA distance, guest should see a complete SLIT table.
> I found QEMU has provide `-acpitable` command that allows users to add
> a ACPI table into guest, but it requires users building ACPI table by
> themselves first. Using `-acpitable` to add a SLIT table may be not so
> straightforward or flexible, imagine that when the vNUMA configuration
> is changes and we need to generate another SLIT table manually. It may
> not be friendly to users or upper software like libvirt.
> 
> This patch is going to add SLIT table support in QEMU, and provides
> additional option `dist` for command `-numa` to allow user set vNUMA
> distance by QEMU command.
> 
> With this patch, when a user wants to create a guest that contains
> several vNUMA nodes and also wants to set distance among those nodes,
> the QEMU command would like:
> 
> ```
> -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=0,policy=bind,id=node0 \
> -numa node,nodeid=0,cpus=0,memdev=node0 \
> -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=1,policy=bind,id=node1 \
> -numa node,nodeid=1,cpus=1,memdev=node1 \
> -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=2,policy=bind,id=node2 \
> -numa node,nodeid=2,cpus=2,memdev=node2 \
> -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=3,policy=bind,id=node3 \
> -numa node,nodeid=3,cpus=3,memdev=node3 \
> -numa dist,src=0,dst=1,val=21 \
> -numa dist,src=0,dst=2,val=31 \
> -numa dist,src=0,dst=3,val=41 \
> -numa dist,src=1,dst=0,val=21 \
> ...
> ```
> 
> Signed-off-by: He Chen <he.chen@linux.intel.com>
> 
> fix

stray 'fix' above the ---

> ---
>  hw/acpi/aml-build.c         | 26 +++++++++++++++++++++++++
>  hw/arm/virt-acpi-build.c    |  2 ++
>  hw/i386/acpi-build.c        |  2 ++
>  include/hw/acpi/aml-build.h |  1 +
>  include/sysemu/numa.h       |  1 +
>  include/sysemu/sysemu.h     |  4 ++++
>  numa.c                      | 47 +++++++++++++++++++++++++++++++++++++++++++++
>  qapi-schema.json            | 30 ++++++++++++++++++++++++++---
>  qemu-options.hx             | 12 +++++++++++-
>  9 files changed, 121 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index c6f2032..410b30e 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -24,6 +24,7 @@
>  #include "hw/acpi/aml-build.h"
>  #include "qemu/bswap.h"
>  #include "qemu/bitops.h"
> +#include "sysemu/numa.h"
>  
>  static GArray *build_alloc_array(void)
>  {
> @@ -1609,3 +1610,28 @@ void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base,
>      numamem->base_addr = cpu_to_le64(base);
>      numamem->range_length = cpu_to_le64(len);
>  }
> +
> +/*
> + * ACPI spec 5.2.17 System Locality Distance Information Table
> + * (Revision 2.0 or later)
> + */
> +void build_slit(GArray *table_data, BIOSLinker *linker)
> +{
> +    int slit_start, i, j;
> +    slit_start = table_data->len;
> +
> +    acpi_data_push(table_data, sizeof(AcpiTableHeader));
> +
> +    build_append_int_noprefix(table_data, nb_numa_nodes, 8);
> +    for (i = 0; i < nb_numa_nodes; i++) {
> +        for (j = 0; j < nb_numa_nodes; j++) {
> +            build_append_int_noprefix(table_data, numa_info[i].distance[j], 1);
> +        }
> +    }
> +
> +    build_header(linker, table_data,
> +                 (void *)(table_data->data + slit_start),
> +                 "SLIT",
> +                 table_data->len - slit_start, 1, NULL, NULL);
> +}
> +
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index 0835e59..d9e6828 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -781,6 +781,8 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
>      if (nb_numa_nodes > 0) {
>          acpi_add_table(table_offsets, tables_blob);
>          build_srat(tables_blob, tables->linker, vms);
> +        acpi_add_table(table_offsets, tables_blob);
> +        build_slit(tables_blob, tables->linker);

I didn't expect you'd actually add the support to the ARM machine
model, but only provide the API. Anyway, thanks for this, assuming
you've tested it and it works.

>      }
>  
>      if (its_class_name() && !vmc->no_its) {
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index 2073108..12730ea 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -2678,6 +2678,8 @@ void acpi_build(AcpiBuildTables *tables, MachineState *machine)
>      if (pcms->numa_nodes) {
>          acpi_add_table(table_offsets, tables_blob);
>          build_srat(tables_blob, tables->linker, machine);
> +        acpi_add_table(table_offsets, tables_blob);
> +        build_slit(tables_blob, tables->linker);
>      }
>      if (acpi_get_mcfg(&mcfg)) {
>          acpi_add_table(table_offsets, tables_blob);
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 00c21f1..329a0d0 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -389,4 +389,5 @@ GCC_FMT_ATTR(2, 3);
>  void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base,
>                         uint64_t len, int node, MemoryAffinityFlags flags);
>  
> +void build_slit(GArray *table_data, BIOSLinker *linker);
>  #endif
> diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
> index 8f09dcf..2f7a941 100644
> --- a/include/sysemu/numa.h
> +++ b/include/sysemu/numa.h
> @@ -21,6 +21,7 @@ typedef struct node_info {
>      struct HostMemoryBackend *node_memdev;
>      bool present;
>      QLIST_HEAD(, numa_addr_range) addr; /* List to store address ranges */
> +    uint8_t distance[MAX_NODES];
>  } NodeInfo;
>  
>  extern NodeInfo numa_info[MAX_NODES];
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index 576c7ce..a4e328d 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -169,6 +169,10 @@ extern int mem_prealloc;
>  
>  #define MAX_NODES 128
>  #define NUMA_NODE_UNASSIGNED MAX_NODES
> +#define MIN_NUMA_DISTANCE 10
> +#define DEF_NUMA_DISTANCE 20
> +#define MAX_NUMA_DISTANCE 254

You ignored my comment about starting with 'NUMA_DISTANCE' without
even replying to that comment stating why you prefer it this way.
That's a bit annoying.

> +#define NUMA_DISTANCE_UNREACHABLE 255

Here you do start with NUMA_DISTANCE, so these names have an
inconsistent format.

>  
>  #define MAX_OPTION_ROMS 16
>  typedef struct QEMUOptionRom {
> diff --git a/numa.c b/numa.c
> index e01cb54..425a320 100644
> --- a/numa.c
> +++ b/numa.c
> @@ -212,6 +212,28 @@ static void numa_node_parse(NumaNodeOptions *node, QemuOpts *opts, Error **errp)
>      max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
>  }
>  
> +static void numa_distance_parse(NumaDistOptions *dist, QemuOpts *opts, Error **errp)
> +{
> +    uint64_t src = dist->src;
> +    uint64_t dst = dist->dst;
> +    uint8_t val = dist->val;
> +
> +    if (src >= MAX_NODES || dst >= MAX_NODES) {
> +        error_setg(errp, "Max number of NUMA nodes reached: %"
> +                   PRIu64 "", src > dst ? src : dst);
> +        return;
> +    }
> +
> +    if (val < MIN_NUMA_DISTANCE || val > MAX_NUMA_DISTANCE) {

val > MAX_NUMA_DISTANCE is not invalid, as NUMA_DISTANCE_UNREACHABLE
is 255. Isn't it possible that that could be what the user wants?

> +        error_setg(errp,
> +                "NUMA distance (%" PRIu8 ") out of range (%d) ~ (%d)",
> +                dist->val, MIN_NUMA_DISTANCE, MAX_NUMA_DISTANCE);
> +        return;
> +    }
> +
> +    numa_info[src].distance[dst] = val;
> +}
> +
>  static int parse_numa(void *opaque, QemuOpts *opts, Error **errp)
>  {
>      NumaOptions *object = NULL;
> @@ -235,6 +257,12 @@ static int parse_numa(void *opaque, QemuOpts *opts, Error **errp)
>          }
>          nb_numa_nodes++;
>          break;
> +    case NUMA_OPTIONS_TYPE_DIST:
> +        numa_distance_parse(&object->u.dist, opts, &err);
> +        if (err) {
> +            goto end;
> +        }
> +        break;
>      default:
>          abort();
>      }
> @@ -294,6 +322,24 @@ static void validate_numa_cpus(void)
>      g_free(seen_cpus);
>  }
>  
> +static void default_numa_distance(void)
> +{
> +    int src, dst;
> +
> +    for (src = 0; src < nb_numa_nodes; src++) {
> +        for (dst = 0; dst < nb_numa_nodes; dst++) {
> +            if (src == dst && numa_info[src].distance[dst] != MIN_NUMA_DISTANCE) {
> +                numa_info[src].distance[dst] = MIN_NUMA_DISTANCE;
> +            } else if (numa_info[src].distance[dst] <= MIN_NUMA_DISTANCE) {
> +                if (numa_info[dst].distance[src] > MIN_NUMA_DISTANCE)
> +                    numa_info[src].distance[dst] = numa_info[dst].distance[src];
> +                else
> +                    numa_info[src].distance[dst] = DEF_NUMA_DISTANCE;
> +            }
> +        }
> +    }
> +}
> +
>  void parse_numa_opts(MachineClass *mc)
>  {
>      int i;
> @@ -390,6 +436,7 @@ void parse_numa_opts(MachineClass *mc)
>          }
>  
>          validate_numa_cpus();
> +        default_numa_distance();
>      } else {
>          numa_set_mem_node_id(0, ram_size, 0);
>      }
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 32b4a4b..21ad94a 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -5644,15 +5644,19 @@
>  ##
>  # @NumaOptionsType:
>  #
> +# @node: NUMA nodes configuration
> +#
> +# @dist: NUMA distance configuration
> +#
>  # Since: 2.1
>  ##
>  { 'enum': 'NumaOptionsType',
> -  'data': [ 'node' ] }
> +  'data': [ 'node', 'dist' ] }
>  
>  ##
>  # @NumaOptions:
>  #
> -# A discriminated record of NUMA options. (for OptsVisitor)
> +# A discriminated record of NUMA options.
>  #
>  # Since: 2.1
>  ##
> @@ -5660,7 +5664,8 @@
>    'base': { 'type': 'NumaOptionsType' },
>    'discriminator': 'type',
>    'data': {
> -    'node': 'NumaNodeOptions' }}
> +    'node': 'NumaNodeOptions',
> +    'dist': 'NumaDistOptions' }}
>  
>  ##
>  # @NumaNodeOptions:
> @@ -5689,6 +5694,25 @@
>     '*memdev': 'str' }}
>  
>  ##
> +# @NumaDistOptions:
> +#
> +# Set distance between 2 NUMA nodes. (for OptsVisitor)

Set the distance

> +#
> +# @src: source NUMA node.
> +#
> +# @dst: destination NUMA node.
> +#
> +# @val: NUMA distance from source node to destination node.
> +#
> +# Since: 2.10
> +##
> +{ 'struct': 'NumaDistOptions',
> +  'data': {
> +   'src': 'uint64',
> +   'dst': 'uint64',
> +   'val': 'uint8' }}
> +
> +##
>  # @HostMemPolicy:
>  #
>  # Host memory policy types
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 8dd8ee3..43c3950 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -139,12 +139,15 @@ ETEXI
>  
>  DEF("numa", HAS_ARG, QEMU_OPTION_numa,
>      "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
> -    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n", QEMU_ARCH_ALL)
> +    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
> +    "-numa dist,src=source,dst=destination,val=distance\n", QEMU_ARCH_ALL)
>  STEXI
>  @item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
>  @itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
> +@itemx -numa dist,src=@var{source},dst=@var{destination},val=@var{distance}
>  @findex -numa
>  Define a NUMA node and assign RAM and VCPUs to it.
> +Set NUMA distance from source node to destination node.

Set the NUMA distance from a source node to a destination node.

>  
>  @var{firstcpu} and @var{lastcpu} are CPU indexes. Each
>  @samp{cpus} option represent a contiguous range of CPU indexes
> @@ -167,6 +170,13 @@ split equally between them.
>  @samp{mem} and @samp{memdev} are mutually exclusive. Furthermore,
>  if one node uses @samp{memdev}, all of them have to use it.
>  
> +@var{source} and @var{destination} are NUMA node IDs.
> +@var{distance} is the NUMA distance from @var{source} to @var{destination}.
> +The distance from node A to node B may be different from the distance from
> +node B to node A since the distance can to be asymmetry.

/since/as/
/asymmetry/asymmetrical/

> +If the distance is not set, the default distance for a local NUMA node is 10,

no need for ,

> +and 20 for a remote node.

and the default distance for a remote node is 20.

Also need sentence like: When the distance for only one direction is
given, then the opposite direction defaults to the same distance.

And maybe should also point out how to set an unreachable node with 255.

> +
>  Note that the -@option{numa} option doesn't allocate any of the
>  specified resources, it just assigns existing resources to NUMA
>  nodes. This means that one still has to use the @option{-m},
> -- 
> 2.7.4
> 
> 

drew

Re: [Qemu-devel] [PATCH v3] Allow setting NUMA distance for different NUMA nodes
Posted by He Chen 7 years ago
On Wed, Mar 22, 2017 at 11:34:05AM +0100, Andrew Jones wrote:
> 
> You should have CC'ed me, considering this version is addressing my
> review comments, but whatever, I usually skim the list so I found it...
> 
> On Wed, Mar 22, 2017 at 05:32:46PM +0800, He Chen wrote:
> > Current, QEMU does not provide a clear command to set vNUMA distance for
> > guest although we already have `-numa` command to set vNUMA nodes.
> > 
> > vNUMA distance makes sense in certain scenario.
> > But now, if we create a guest that has 4 vNUMA nodes, when we check NUMA
> > info via `numactl -H`, we will see:
> > 
> > node distance:
> > node    0    1    2    3
> >   0:   10   20   20   20
> >   1:   20   10   20   20
> >   2:   20   20   10   20
> >   3:   20   20   20   10
> > 
> > Guest kernel regards all local node as distance 10, and all remote node
> > as distance 20 when there is no SLIT table since QEMU doesn't build it.
> > It looks like a little strange when you have seen the distance in an
> > actual physical machine that contains 4 NUMA nodes. My machine shows:
> > 
> > node distance:
> > node    0    1    2    3
> >   0:   10   21   31   41
> >   1:   21   10   21   31
> >   2:   31   21   10   21
> >   3:   41   31   21   10
> > 
> > To set vNUMA distance, guest should see a complete SLIT table.
> > I found QEMU has provide `-acpitable` command that allows users to add
> > a ACPI table into guest, but it requires users building ACPI table by
> > themselves first. Using `-acpitable` to add a SLIT table may be not so
> > straightforward or flexible, imagine that when the vNUMA configuration
> > is changes and we need to generate another SLIT table manually. It may
> > not be friendly to users or upper software like libvirt.
> > 
> > This patch is going to add SLIT table support in QEMU, and provides
> > additional option `dist` for command `-numa` to allow user set vNUMA
> > distance by QEMU command.
> > 
> > With this patch, when a user wants to create a guest that contains
> > several vNUMA nodes and also wants to set distance among those nodes,
> > the QEMU command would like:
> > 
> > ```
> > -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=0,policy=bind,id=node0 \
> > -numa node,nodeid=0,cpus=0,memdev=node0 \
> > -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=1,policy=bind,id=node1 \
> > -numa node,nodeid=1,cpus=1,memdev=node1 \
> > -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=2,policy=bind,id=node2 \
> > -numa node,nodeid=2,cpus=2,memdev=node2 \
> > -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=3,policy=bind,id=node3 \
> > -numa node,nodeid=3,cpus=3,memdev=node3 \
> > -numa dist,src=0,dst=1,val=21 \
> > -numa dist,src=0,dst=2,val=31 \
> > -numa dist,src=0,dst=3,val=41 \
> > -numa dist,src=1,dst=0,val=21 \
> > ...
> > ```
> > 
> > Signed-off-by: He Chen <he.chen@linux.intel.com>
> > 
> > fix
> 
> stray 'fix' above the ---
> 
> > ---
> >  hw/acpi/aml-build.c         | 26 +++++++++++++++++++++++++
> >  hw/arm/virt-acpi-build.c    |  2 ++
> >  hw/i386/acpi-build.c        |  2 ++
> >  include/hw/acpi/aml-build.h |  1 +
> >  include/sysemu/numa.h       |  1 +
> >  include/sysemu/sysemu.h     |  4 ++++
> >  numa.c                      | 47 +++++++++++++++++++++++++++++++++++++++++++++
> >  qapi-schema.json            | 30 ++++++++++++++++++++++++++---
> >  qemu-options.hx             | 12 +++++++++++-
> >  9 files changed, 121 insertions(+), 4 deletions(-)
> > 
> > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> > index c6f2032..410b30e 100644
> > --- a/hw/acpi/aml-build.c
> > +++ b/hw/acpi/aml-build.c
> > @@ -24,6 +24,7 @@
> >  #include "hw/acpi/aml-build.h"
> >  #include "qemu/bswap.h"
> >  #include "qemu/bitops.h"
> > +#include "sysemu/numa.h"
> >  
> >  static GArray *build_alloc_array(void)
> >  {
> > @@ -1609,3 +1610,28 @@ void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base,
> >      numamem->base_addr = cpu_to_le64(base);
> >      numamem->range_length = cpu_to_le64(len);
> >  }
> > +
> > +/*
> > + * ACPI spec 5.2.17 System Locality Distance Information Table
> > + * (Revision 2.0 or later)
> > + */
> > +void build_slit(GArray *table_data, BIOSLinker *linker)
> > +{
> > +    int slit_start, i, j;
> > +    slit_start = table_data->len;
> > +
> > +    acpi_data_push(table_data, sizeof(AcpiTableHeader));
> > +
> > +    build_append_int_noprefix(table_data, nb_numa_nodes, 8);
> > +    for (i = 0; i < nb_numa_nodes; i++) {
> > +        for (j = 0; j < nb_numa_nodes; j++) {
> > +            build_append_int_noprefix(table_data, numa_info[i].distance[j], 1);
> > +        }
> > +    }
> > +
> > +    build_header(linker, table_data,
> > +                 (void *)(table_data->data + slit_start),
> > +                 "SLIT",
> > +                 table_data->len - slit_start, 1, NULL, NULL);
> > +}
> > +
> > diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> > index 0835e59..d9e6828 100644
> > --- a/hw/arm/virt-acpi-build.c
> > +++ b/hw/arm/virt-acpi-build.c
> > @@ -781,6 +781,8 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
> >      if (nb_numa_nodes > 0) {
> >          acpi_add_table(table_offsets, tables_blob);
> >          build_srat(tables_blob, tables->linker, vms);
> > +        acpi_add_table(table_offsets, tables_blob);
> > +        build_slit(tables_blob, tables->linker);
> 
> I didn't expect you'd actually add the support to the ARM machine
> model, but only provide the API. Anyway, thanks for this, assuming
> you've tested it and it works.
> 
> >      }

Well... I am sorry that actually I did not test it since I don't have
ARM platform. I am afraid I will remove it in next patch.
> >  
> >      if (its_class_name() && !vmc->no_its) {
> > diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> > index 2073108..12730ea 100644
> > --- a/hw/i386/acpi-build.c
> > +++ b/hw/i386/acpi-build.c
> > @@ -2678,6 +2678,8 @@ void acpi_build(AcpiBuildTables *tables, MachineState *machine)
> >      if (pcms->numa_nodes) {
> >          acpi_add_table(table_offsets, tables_blob);
> >          build_srat(tables_blob, tables->linker, machine);
> > +        acpi_add_table(table_offsets, tables_blob);
> > +        build_slit(tables_blob, tables->linker);
> >      }
> >      if (acpi_get_mcfg(&mcfg)) {
> >          acpi_add_table(table_offsets, tables_blob);
> > diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> > index 00c21f1..329a0d0 100644
> > --- a/include/hw/acpi/aml-build.h
> > +++ b/include/hw/acpi/aml-build.h
> > @@ -389,4 +389,5 @@ GCC_FMT_ATTR(2, 3);
> >  void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base,
> >                         uint64_t len, int node, MemoryAffinityFlags flags);
> >  
> > +void build_slit(GArray *table_data, BIOSLinker *linker);
> >  #endif
> > diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
> > index 8f09dcf..2f7a941 100644
> > --- a/include/sysemu/numa.h
> > +++ b/include/sysemu/numa.h
> > @@ -21,6 +21,7 @@ typedef struct node_info {
> >      struct HostMemoryBackend *node_memdev;
> >      bool present;
> >      QLIST_HEAD(, numa_addr_range) addr; /* List to store address ranges */
> > +    uint8_t distance[MAX_NODES];
> >  } NodeInfo;
> >  
> >  extern NodeInfo numa_info[MAX_NODES];
> > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> > index 576c7ce..a4e328d 100644
> > --- a/include/sysemu/sysemu.h
> > +++ b/include/sysemu/sysemu.h
> > @@ -169,6 +169,10 @@ extern int mem_prealloc;
> >  
> >  #define MAX_NODES 128
> >  #define NUMA_NODE_UNASSIGNED MAX_NODES
> > +#define MIN_NUMA_DISTANCE 10
> > +#define DEF_NUMA_DISTANCE 20
> > +#define MAX_NUMA_DISTANCE 254
> 
> You ignored my comment about starting with 'NUMA_DISTANCE' without
> even replying to that comment stating why you prefer it this way.
> That's a bit annoying.
> 
Apologyies for my carelessness. I miss this comment, and surely, I will
change these names in next patch.

> > +#define NUMA_DISTANCE_UNREACHABLE 255
> 
> Here you do start with NUMA_DISTANCE, so these names have an
> inconsistent format.
> 
> >  
> >  #define MAX_OPTION_ROMS 16
> >  typedef struct QEMUOptionRom {
> > diff --git a/numa.c b/numa.c
> > index e01cb54..425a320 100644
> > --- a/numa.c
> > +++ b/numa.c
> > @@ -212,6 +212,28 @@ static void numa_node_parse(NumaNodeOptions *node, QemuOpts *opts, Error **errp)
> >      max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
> >  }
> >  
> > +static void numa_distance_parse(NumaDistOptions *dist, QemuOpts *opts, Error **errp)
> > +{
> > +    uint64_t src = dist->src;
> > +    uint64_t dst = dist->dst;
> > +    uint8_t val = dist->val;
> > +
> > +    if (src >= MAX_NODES || dst >= MAX_NODES) {
> > +        error_setg(errp, "Max number of NUMA nodes reached: %"
> > +                   PRIu64 "", src > dst ? src : dst);
> > +        return;
> > +    }
> > +
> > +    if (val < MIN_NUMA_DISTANCE || val > MAX_NUMA_DISTANCE) {
> 
> val > MAX_NUMA_DISTANCE is not invalid, as NUMA_DISTANCE_UNREACHABLE
> is 255. Isn't it possible that that could be what the user wants?
> 
Yes, you're right.

> > +        error_setg(errp,
> > +                "NUMA distance (%" PRIu8 ") out of range (%d) ~ (%d)",
> > +                dist->val, MIN_NUMA_DISTANCE, MAX_NUMA_DISTANCE);
> > +        return;
> > +    }
> > +
> > +    numa_info[src].distance[dst] = val;
> > +}
> > +
> >  static int parse_numa(void *opaque, QemuOpts *opts, Error **errp)
> >  {
> >      NumaOptions *object = NULL;
> > @@ -235,6 +257,12 @@ static int parse_numa(void *opaque, QemuOpts *opts, Error **errp)
> >          }
> >          nb_numa_nodes++;
> >          break;
> > +    case NUMA_OPTIONS_TYPE_DIST:
> > +        numa_distance_parse(&object->u.dist, opts, &err);
> > +        if (err) {
> > +            goto end;
> > +        }
> > +        break;
> >      default:
> >          abort();
> >      }
> > @@ -294,6 +322,24 @@ static void validate_numa_cpus(void)
> >      g_free(seen_cpus);
> >  }
> >  
> > +static void default_numa_distance(void)
> > +{
> > +    int src, dst;
> > +
> > +    for (src = 0; src < nb_numa_nodes; src++) {
> > +        for (dst = 0; dst < nb_numa_nodes; dst++) {
> > +            if (src == dst && numa_info[src].distance[dst] != MIN_NUMA_DISTANCE) {
> > +                numa_info[src].distance[dst] = MIN_NUMA_DISTANCE;
> > +            } else if (numa_info[src].distance[dst] <= MIN_NUMA_DISTANCE) {
> > +                if (numa_info[dst].distance[src] > MIN_NUMA_DISTANCE)
> > +                    numa_info[src].distance[dst] = numa_info[dst].distance[src];
> > +                else
> > +                    numa_info[src].distance[dst] = DEF_NUMA_DISTANCE;
> > +            }
> > +        }
> > +    }
> > +}
> > +
> >  void parse_numa_opts(MachineClass *mc)
> >  {
> >      int i;
> > @@ -390,6 +436,7 @@ void parse_numa_opts(MachineClass *mc)
> >          }
> >  
> >          validate_numa_cpus();
> > +        default_numa_distance();
> >      } else {
> >          numa_set_mem_node_id(0, ram_size, 0);
> >      }
> > diff --git a/qapi-schema.json b/qapi-schema.json
> > index 32b4a4b..21ad94a 100644
> > --- a/qapi-schema.json
> > +++ b/qapi-schema.json
> > @@ -5644,15 +5644,19 @@
> >  ##
> >  # @NumaOptionsType:
> >  #
> > +# @node: NUMA nodes configuration
> > +#
> > +# @dist: NUMA distance configuration
> > +#
> >  # Since: 2.1
> >  ##
> >  { 'enum': 'NumaOptionsType',
> > -  'data': [ 'node' ] }
> > +  'data': [ 'node', 'dist' ] }
> >  
> >  ##
> >  # @NumaOptions:
> >  #
> > -# A discriminated record of NUMA options. (for OptsVisitor)
> > +# A discriminated record of NUMA options.
> >  #
> >  # Since: 2.1
> >  ##
> > @@ -5660,7 +5664,8 @@
> >    'base': { 'type': 'NumaOptionsType' },
> >    'discriminator': 'type',
> >    'data': {
> > -    'node': 'NumaNodeOptions' }}
> > +    'node': 'NumaNodeOptions',
> > +    'dist': 'NumaDistOptions' }}
> >  
> >  ##
> >  # @NumaNodeOptions:
> > @@ -5689,6 +5694,25 @@
> >     '*memdev': 'str' }}
> >  
> >  ##
> > +# @NumaDistOptions:
> > +#
> > +# Set distance between 2 NUMA nodes. (for OptsVisitor)
> 
> Set the distance
> 
> > +#
> > +# @src: source NUMA node.
> > +#
> > +# @dst: destination NUMA node.
> > +#
> > +# @val: NUMA distance from source node to destination node.
> > +#
> > +# Since: 2.10
> > +##
> > +{ 'struct': 'NumaDistOptions',
> > +  'data': {
> > +   'src': 'uint64',
> > +   'dst': 'uint64',
> > +   'val': 'uint8' }}
> > +
> > +##
> >  # @HostMemPolicy:
> >  #
> >  # Host memory policy types
> > diff --git a/qemu-options.hx b/qemu-options.hx
> > index 8dd8ee3..43c3950 100644
> > --- a/qemu-options.hx
> > +++ b/qemu-options.hx
> > @@ -139,12 +139,15 @@ ETEXI
> >  
> >  DEF("numa", HAS_ARG, QEMU_OPTION_numa,
> >      "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
> > -    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n", QEMU_ARCH_ALL)
> > +    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
> > +    "-numa dist,src=source,dst=destination,val=distance\n", QEMU_ARCH_ALL)
> >  STEXI
> >  @item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
> >  @itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
> > +@itemx -numa dist,src=@var{source},dst=@var{destination},val=@var{distance}
> >  @findex -numa
> >  Define a NUMA node and assign RAM and VCPUs to it.
> > +Set NUMA distance from source node to destination node.
> 
> Set the NUMA distance from a source node to a destination node.
> 
> >  
> >  @var{firstcpu} and @var{lastcpu} are CPU indexes. Each
> >  @samp{cpus} option represent a contiguous range of CPU indexes
> > @@ -167,6 +170,13 @@ split equally between them.
> >  @samp{mem} and @samp{memdev} are mutually exclusive. Furthermore,
> >  if one node uses @samp{memdev}, all of them have to use it.
> >  
> > +@var{source} and @var{destination} are NUMA node IDs.
> > +@var{distance} is the NUMA distance from @var{source} to @var{destination}.
> > +The distance from node A to node B may be different from the distance from
> > +node B to node A since the distance can to be asymmetry.
> 
> /since/as/
> /asymmetry/asymmetrical/
> 
> > +If the distance is not set, the default distance for a local NUMA node is 10,
> 
> no need for ,
> 
> > +and 20 for a remote node.
> 
> and the default distance for a remote node is 20.
> 
> Also need sentence like: When the distance for only one direction is
> given, then the opposite direction defaults to the same distance.
> 
> And maybe should also point out how to set an unreachable node with 255.
> 
Thanks for your careful review. Regarding setting unreachable node,
which way is good? Set exact 255 as distance or distance that is larger
than 255 is regarded as unreachable?
> > +
> >  Note that the -@option{numa} option doesn't allocate any of the
> >  specified resources, it just assigns existing resources to NUMA
> >  nodes. This means that one still has to use the @option{-m},
> > -- 
> > 2.7.4
> > 
> > 
> 
> drew

Re: [Qemu-devel] [PATCH v3] Allow setting NUMA distance for different NUMA nodes
Posted by Andrew Jones 7 years ago
On Thu, Mar 23, 2017 at 03:23:35PM +0800, He Chen wrote:
> > 
> > And maybe should also point out how to set an unreachable node with 255.
> > 
> Thanks for your careful review. Regarding setting unreachable node,
> which way is good? Set exact 255 as distance or distance that is larger
> than 255 is regarded as unreachable?

If a numeric distance is to be used, then I think following the spec with
255 is best. Alternatively, we could have something like
'-numa dist,src=0,dst=1,unreachable', rather than specifying val at all,
but, IMO, it's not worth it.

Thanks,
drew

Re: [Qemu-devel] [PATCH v3] Allow setting NUMA distance for different NUMA nodes
Posted by Eric Blake 7 years ago
On 03/23/2017 02:23 AM, He Chen wrote:

>>> -numa dist,src=1,dst=0,val=21 \

>>
>> And maybe should also point out how to set an unreachable node with 255.
>>
> Thanks for your careful review. Regarding setting unreachable node,
> which way is good? Set exact 255 as distance or distance that is larger
> than 255 is regarded as unreachable?

Given that you spec'd val as 'uint8', you can't pass a value for
distance that is larger than 255.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Re: [Qemu-devel] [PATCH v3] Allow setting NUMA distance for different NUMA nodes
Posted by Igor Mammedov 7 years ago
On Wed, 22 Mar 2017 17:32:46 +0800
He Chen <he.chen@linux.intel.com> wrote:

> Current, QEMU does not provide a clear command to set vNUMA distance for
> guest although we already have `-numa` command to set vNUMA nodes.
> 
> vNUMA distance makes sense in certain scenario.
> But now, if we create a guest that has 4 vNUMA nodes, when we check NUMA
> info via `numactl -H`, we will see:
> 
> node distance:
> node    0    1    2    3
>   0:   10   20   20   20
>   1:   20   10   20   20
>   2:   20   20   10   20
>   3:   20   20   20   10
> 
> Guest kernel regards all local node as distance 10, and all remote node
> as distance 20 when there is no SLIT table since QEMU doesn't build it.
> It looks like a little strange when you have seen the distance in an
> actual physical machine that contains 4 NUMA nodes. My machine shows:
> 
> node distance:
> node    0    1    2    3
>   0:   10   21   31   41
>   1:   21   10   21   31
>   2:   31   21   10   21
>   3:   41   31   21   10
> 
> To set vNUMA distance, guest should see a complete SLIT table.
> I found QEMU has provide `-acpitable` command that allows users to add
> a ACPI table into guest, but it requires users building ACPI table by
> themselves first. Using `-acpitable` to add a SLIT table may be not so
> straightforward or flexible, imagine that when the vNUMA configuration
> is changes and we need to generate another SLIT table manually. It may
> not be friendly to users or upper software like libvirt.
> 
> This patch is going to add SLIT table support in QEMU, and provides
> additional option `dist` for command `-numa` to allow user set vNUMA
> distance by QEMU command.
> 
> With this patch, when a user wants to create a guest that contains
> several vNUMA nodes and also wants to set distance among those nodes,
> the QEMU command would like:
> 
> ```
> -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=0,policy=bind,id=node0 \
> -numa node,nodeid=0,cpus=0,memdev=node0 \
> -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=1,policy=bind,id=node1 \
> -numa node,nodeid=1,cpus=1,memdev=node1 \
> -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=2,policy=bind,id=node2 \
> -numa node,nodeid=2,cpus=2,memdev=node2 \
> -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=3,policy=bind,id=node3 \
> -numa node,nodeid=3,cpus=3,memdev=node3 \
> -numa dist,src=0,dst=1,val=21 \
> -numa dist,src=0,dst=2,val=31 \
> -numa dist,src=0,dst=3,val=41 \
> -numa dist,src=1,dst=0,val=21 \
> ...
> ```
> 
> Signed-off-by: He Chen <he.chen@linux.intel.com>
> 
> fix
> ---
>  hw/acpi/aml-build.c         | 26 +++++++++++++++++++++++++
>  hw/arm/virt-acpi-build.c    |  2 ++
>  hw/i386/acpi-build.c        |  2 ++
>  include/hw/acpi/aml-build.h |  1 +
>  include/sysemu/numa.h       |  1 +
>  include/sysemu/sysemu.h     |  4 ++++
>  numa.c                      | 47 +++++++++++++++++++++++++++++++++++++++++++++
>  qapi-schema.json            | 30 ++++++++++++++++++++++++++---
>  qemu-options.hx             | 12 +++++++++++-
>  9 files changed, 121 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index c6f2032..410b30e 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -24,6 +24,7 @@
>  #include "hw/acpi/aml-build.h"
>  #include "qemu/bswap.h"
>  #include "qemu/bitops.h"
> +#include "sysemu/numa.h"
>  
>  static GArray *build_alloc_array(void)
>  {
> @@ -1609,3 +1610,28 @@ void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base,
>      numamem->base_addr = cpu_to_le64(base);
>      numamem->range_length = cpu_to_le64(len);
>  }
> +
> +/*
> + * ACPI spec 5.2.17 System Locality Distance Information Table
> + * (Revision 2.0 or later)
> + */
> +void build_slit(GArray *table_data, BIOSLinker *linker)
> +{
> +    int slit_start, i, j;
> +    slit_start = table_data->len;
> +
> +    acpi_data_push(table_data, sizeof(AcpiTableHeader));
> +
> +    build_append_int_noprefix(table_data, nb_numa_nodes, 8);
> +    for (i = 0; i < nb_numa_nodes; i++) {
> +        for (j = 0; j < nb_numa_nodes; j++) {
> +            build_append_int_noprefix(table_data, numa_info[i].distance[j], 1);
> +        }
> +    }
> +
> +    build_header(linker, table_data,
> +                 (void *)(table_data->data + slit_start),
> +                 "SLIT",
> +                 table_data->len - slit_start, 1, NULL, NULL);
> +}
> +
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index 0835e59..d9e6828 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -781,6 +781,8 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
>      if (nb_numa_nodes > 0) {
>          acpi_add_table(table_offsets, tables_blob);
>          build_srat(tables_blob, tables->linker, vms);
> +        acpi_add_table(table_offsets, tables_blob);
> +        build_slit(tables_blob, tables->linker);
>      }
>  
>      if (its_class_name() && !vmc->no_its) {
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index 2073108..12730ea 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -2678,6 +2678,8 @@ void acpi_build(AcpiBuildTables *tables, MachineState *machine)
>      if (pcms->numa_nodes) {
>          acpi_add_table(table_offsets, tables_blob);
>          build_srat(tables_blob, tables->linker, machine);
> +        acpi_add_table(table_offsets, tables_blob);
> +        build_slit(tables_blob, tables->linker);
>      }
>      if (acpi_get_mcfg(&mcfg)) {
>          acpi_add_table(table_offsets, tables_blob);
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 00c21f1..329a0d0 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -389,4 +389,5 @@ GCC_FMT_ATTR(2, 3);
>  void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base,
>                         uint64_t len, int node, MemoryAffinityFlags flags);
>  
> +void build_slit(GArray *table_data, BIOSLinker *linker);
>  #endif
> diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
> index 8f09dcf..2f7a941 100644
> --- a/include/sysemu/numa.h
> +++ b/include/sysemu/numa.h
> @@ -21,6 +21,7 @@ typedef struct node_info {
>      struct HostMemoryBackend *node_memdev;
>      bool present;
>      QLIST_HEAD(, numa_addr_range) addr; /* List to store address ranges */
> +    uint8_t distance[MAX_NODES];
>  } NodeInfo;
>  
>  extern NodeInfo numa_info[MAX_NODES];
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index 576c7ce..a4e328d 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -169,6 +169,10 @@ extern int mem_prealloc;
>  
>  #define MAX_NODES 128
>  #define NUMA_NODE_UNASSIGNED MAX_NODES
> +#define MIN_NUMA_DISTANCE 10
> +#define DEF_NUMA_DISTANCE 20
> +#define MAX_NUMA_DISTANCE 254
> +#define NUMA_DISTANCE_UNREACHABLE 255
>  
>  #define MAX_OPTION_ROMS 16
>  typedef struct QEMUOptionRom {
> diff --git a/numa.c b/numa.c
> index e01cb54..425a320 100644
> --- a/numa.c
> +++ b/numa.c
> @@ -212,6 +212,28 @@ static void numa_node_parse(NumaNodeOptions *node, QemuOpts *opts, Error **errp)
>      max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
>  }
>  
> +static void numa_distance_parse(NumaDistOptions *dist, QemuOpts *opts, Error **errp)
> +{
> +    uint64_t src = dist->src;
> +    uint64_t dst = dist->dst;
> +    uint8_t val = dist->val;
> +
> +    if (src >= MAX_NODES || dst >= MAX_NODES) {
this allows to specify any src/dst instead of present nodes only,
you should probably check that src/dst are present here,
see: numa_info[x].present

if node is not present complain that it should be declared with
'-numa node' option first and exit.

> +        error_setg(errp, "Max number of NUMA nodes reached: %"
> +                   PRIu64 "", src > dst ? src : dst);
> +        return;
> +    }
> +
> +    if (val < MIN_NUMA_DISTANCE || val > MAX_NUMA_DISTANCE) {
> +        error_setg(errp,
> +                "NUMA distance (%" PRIu8 ") out of range (%d) ~ (%d)",
> +                dist->val, MIN_NUMA_DISTANCE, MAX_NUMA_DISTANCE);
> +        return;
> +    }
> +
> +    numa_info[src].distance[dst] = val;
> +}
> +
>  static int parse_numa(void *opaque, QemuOpts *opts, Error **errp)
>  {
>      NumaOptions *object = NULL;
> @@ -235,6 +257,12 @@ static int parse_numa(void *opaque, QemuOpts *opts, Error **errp)
>          }
>          nb_numa_nodes++;
>          break;
> +    case NUMA_OPTIONS_TYPE_DIST:
> +        numa_distance_parse(&object->u.dist, opts, &err);
> +        if (err) {
> +            goto end;
> +        }
> +        break;
>      default:
>          abort();
>      }
> @@ -294,6 +322,24 @@ static void validate_numa_cpus(void)
>      g_free(seen_cpus);
>  }
>  
> +static void default_numa_distance(void)
> +{
> +    int src, dst;
> +
fill in defaults only if there weren't any '-numa dist' on CLI
and refuse to start if partial filled table were explicitly provided on CLI

> +    for (src = 0; src < nb_numa_nodes; src++) {
> +        for (dst = 0; dst < nb_numa_nodes; dst++) {
> +            if (src == dst && numa_info[src].distance[dst] != MIN_NUMA_DISTANCE) {
> +                numa_info[src].distance[dst] = MIN_NUMA_DISTANCE;
> +            } else if (numa_info[src].distance[dst] <= MIN_NUMA_DISTANCE) {
> +                if (numa_info[dst].distance[src] > MIN_NUMA_DISTANCE)
> +                    numa_info[src].distance[dst] = numa_info[dst].distance[src];
> +                else
> +                    numa_info[src].distance[dst] = DEF_NUMA_DISTANCE;
> +            }
> +        }
> +    }
> +}
> +
>  void parse_numa_opts(MachineClass *mc)
>  {
>      int i;
> @@ -390,6 +436,7 @@ void parse_numa_opts(MachineClass *mc)
>          }
>  
>          validate_numa_cpus();
> +        default_numa_distance();
>      } else {
>          numa_set_mem_node_id(0, ram_size, 0);
>      }
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 32b4a4b..21ad94a 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -5644,15 +5644,19 @@
>  ##
>  # @NumaOptionsType:
>  #
> +# @node: NUMA nodes configuration
> +#
> +# @dist: NUMA distance configuration
> +#
>  # Since: 2.1
>  ##
>  { 'enum': 'NumaOptionsType',
> -  'data': [ 'node' ] }
> +  'data': [ 'node', 'dist' ] }
>  
>  ##
>  # @NumaOptions:
>  #
> -# A discriminated record of NUMA options. (for OptsVisitor)
> +# A discriminated record of NUMA options.
unrelated change?

>  #
>  # Since: 2.1
>  ##
> @@ -5660,7 +5664,8 @@
>    'base': { 'type': 'NumaOptionsType' },
>    'discriminator': 'type',
>    'data': {
> -    'node': 'NumaNodeOptions' }}
> +    'node': 'NumaNodeOptions',
> +    'dist': 'NumaDistOptions' }}
>  
>  ##
>  # @NumaNodeOptions:
> @@ -5689,6 +5694,25 @@
>     '*memdev': 'str' }}
>  
>  ##
> +# @NumaDistOptions:
> +#
> +# Set distance between 2 NUMA nodes. (for OptsVisitor)
> +#
> +# @src: source NUMA node.
> +#
> +# @dst: destination NUMA node.
> +#
> +# @val: NUMA distance from source node to destination node.
> +#
> +# Since: 2.10
> +##
> +{ 'struct': 'NumaDistOptions',
> +  'data': {
> +   'src': 'uint64',
> +   'dst': 'uint64',
nodeid is uint16, so probably these should be also uint16


> +   'val': 'uint8' }}
> +
> +##
>  # @HostMemPolicy:
>  #
>  # Host memory policy types
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 8dd8ee3..43c3950 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -139,12 +139,15 @@ ETEXI
>  
>  DEF("numa", HAS_ARG, QEMU_OPTION_numa,
>      "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
> -    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n", QEMU_ARCH_ALL)
> +    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
> +    "-numa dist,src=source,dst=destination,val=distance\n", QEMU_ARCH_ALL)
>  STEXI
>  @item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
>  @itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
> +@itemx -numa dist,src=@var{source},dst=@var{destination},val=@var{distance}
>  @findex -numa
>  Define a NUMA node and assign RAM and VCPUs to it.
> +Set NUMA distance from source node to destination node.
>  
>  @var{firstcpu} and @var{lastcpu} are CPU indexes. Each
>  @samp{cpus} option represent a contiguous range of CPU indexes
> @@ -167,6 +170,13 @@ split equally between them.
>  @samp{mem} and @samp{memdev} are mutually exclusive. Furthermore,
>  if one node uses @samp{memdev}, all of them have to use it.
>  
> +@var{source} and @var{destination} are NUMA node IDs.
> +@var{distance} is the NUMA distance from @var{source} to @var{destination}.
> +The distance from node A to node B may be different from the distance from
> +node B to node A since the distance can to be asymmetry.
> +If the distance is not set, the default distance for a local NUMA node is 10,
> +and 20 for a remote node.
> +
>  Note that the -@option{numa} option doesn't allocate any of the
>  specified resources, it just assigns existing resources to NUMA
>  nodes. This means that one still has to use the @option{-m},


Re: [Qemu-devel] [PATCH v3] Allow setting NUMA distance for different NUMA nodes
Posted by He Chen 6 years, 12 months ago
On Thu, Mar 23, 2017 at 10:29:55AM +0100, Igor Mammedov wrote:
> On Wed, 22 Mar 2017 17:32:46 +0800
> He Chen <he.chen@linux.intel.com> wrote:
> 
> > Current, QEMU does not provide a clear command to set vNUMA distance for
> > guest although we already have `-numa` command to set vNUMA nodes.
> > 
> > vNUMA distance makes sense in certain scenario.
> > But now, if we create a guest that has 4 vNUMA nodes, when we check NUMA
> > info via `numactl -H`, we will see:
> > 
> > node distance:
> > node    0    1    2    3
> >   0:   10   20   20   20
> >   1:   20   10   20   20
> >   2:   20   20   10   20
> >   3:   20   20   20   10
> > 
> > Guest kernel regards all local node as distance 10, and all remote node
> > as distance 20 when there is no SLIT table since QEMU doesn't build it.
> > It looks like a little strange when you have seen the distance in an
> > actual physical machine that contains 4 NUMA nodes. My machine shows:
> > 
> > node distance:
> > node    0    1    2    3
> >   0:   10   21   31   41
> >   1:   21   10   21   31
> >   2:   31   21   10   21
> >   3:   41   31   21   10
> > 
> > To set vNUMA distance, guest should see a complete SLIT table.
> > I found QEMU has provide `-acpitable` command that allows users to add
> > a ACPI table into guest, but it requires users building ACPI table by
> > themselves first. Using `-acpitable` to add a SLIT table may be not so
> > straightforward or flexible, imagine that when the vNUMA configuration
> > is changes and we need to generate another SLIT table manually. It may
> > not be friendly to users or upper software like libvirt.
> > 
> > This patch is going to add SLIT table support in QEMU, and provides
> > additional option `dist` for command `-numa` to allow user set vNUMA
> > distance by QEMU command.
> > 
> > With this patch, when a user wants to create a guest that contains
> > several vNUMA nodes and also wants to set distance among those nodes,
> > the QEMU command would like:
> > 
> > ```
> > -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=0,policy=bind,id=node0 \
> > -numa node,nodeid=0,cpus=0,memdev=node0 \
> > -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=1,policy=bind,id=node1 \
> > -numa node,nodeid=1,cpus=1,memdev=node1 \
> > -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=2,policy=bind,id=node2 \
> > -numa node,nodeid=2,cpus=2,memdev=node2 \
> > -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=3,policy=bind,id=node3 \
> > -numa node,nodeid=3,cpus=3,memdev=node3 \
> > -numa dist,src=0,dst=1,val=21 \
> > -numa dist,src=0,dst=2,val=31 \
> > -numa dist,src=0,dst=3,val=41 \
> > -numa dist,src=1,dst=0,val=21 \
> > ...
> > ```
> > 
> > Signed-off-by: He Chen <he.chen@linux.intel.com>
> > 
> > fix
> > ---
> >  hw/acpi/aml-build.c         | 26 +++++++++++++++++++++++++
> >  hw/arm/virt-acpi-build.c    |  2 ++
> >  hw/i386/acpi-build.c        |  2 ++
> >  include/hw/acpi/aml-build.h |  1 +
> >  include/sysemu/numa.h       |  1 +
> >  include/sysemu/sysemu.h     |  4 ++++
> >  numa.c                      | 47 +++++++++++++++++++++++++++++++++++++++++++++
> >  qapi-schema.json            | 30 ++++++++++++++++++++++++++---
> >  qemu-options.hx             | 12 +++++++++++-
> >  9 files changed, 121 insertions(+), 4 deletions(-)
> > 
> > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> > index c6f2032..410b30e 100644
> > --- a/hw/acpi/aml-build.c
> > +++ b/hw/acpi/aml-build.c
> > @@ -24,6 +24,7 @@
> >  #include "hw/acpi/aml-build.h"
> >  #include "qemu/bswap.h"
> >  #include "qemu/bitops.h"
> > +#include "sysemu/numa.h"
> >  
> >  static GArray *build_alloc_array(void)
> >  {
> > @@ -1609,3 +1610,28 @@ void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base,
> >      numamem->base_addr = cpu_to_le64(base);
> >      numamem->range_length = cpu_to_le64(len);
> >  }
> > +
> > +/*
> > + * ACPI spec 5.2.17 System Locality Distance Information Table
> > + * (Revision 2.0 or later)
> > + */
> > +void build_slit(GArray *table_data, BIOSLinker *linker)
> > +{
> > +    int slit_start, i, j;
> > +    slit_start = table_data->len;
> > +
> > +    acpi_data_push(table_data, sizeof(AcpiTableHeader));
> > +
> > +    build_append_int_noprefix(table_data, nb_numa_nodes, 8);
> > +    for (i = 0; i < nb_numa_nodes; i++) {
> > +        for (j = 0; j < nb_numa_nodes; j++) {
> > +            build_append_int_noprefix(table_data, numa_info[i].distance[j], 1);
> > +        }
> > +    }
> > +
> > +    build_header(linker, table_data,
> > +                 (void *)(table_data->data + slit_start),
> > +                 "SLIT",
> > +                 table_data->len - slit_start, 1, NULL, NULL);
> > +}
> > +
> > diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> > index 0835e59..d9e6828 100644
> > --- a/hw/arm/virt-acpi-build.c
> > +++ b/hw/arm/virt-acpi-build.c
> > @@ -781,6 +781,8 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
> >      if (nb_numa_nodes > 0) {
> >          acpi_add_table(table_offsets, tables_blob);
> >          build_srat(tables_blob, tables->linker, vms);
> > +        acpi_add_table(table_offsets, tables_blob);
> > +        build_slit(tables_blob, tables->linker);
> >      }
> >  
> >      if (its_class_name() && !vmc->no_its) {
> > diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> > index 2073108..12730ea 100644
> > --- a/hw/i386/acpi-build.c
> > +++ b/hw/i386/acpi-build.c
> > @@ -2678,6 +2678,8 @@ void acpi_build(AcpiBuildTables *tables, MachineState *machine)
> >      if (pcms->numa_nodes) {
> >          acpi_add_table(table_offsets, tables_blob);
> >          build_srat(tables_blob, tables->linker, machine);
> > +        acpi_add_table(table_offsets, tables_blob);
> > +        build_slit(tables_blob, tables->linker);
> >      }
> >      if (acpi_get_mcfg(&mcfg)) {
> >          acpi_add_table(table_offsets, tables_blob);
> > diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> > index 00c21f1..329a0d0 100644
> > --- a/include/hw/acpi/aml-build.h
> > +++ b/include/hw/acpi/aml-build.h
> > @@ -389,4 +389,5 @@ GCC_FMT_ATTR(2, 3);
> >  void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base,
> >                         uint64_t len, int node, MemoryAffinityFlags flags);
> >  
> > +void build_slit(GArray *table_data, BIOSLinker *linker);
> >  #endif
> > diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
> > index 8f09dcf..2f7a941 100644
> > --- a/include/sysemu/numa.h
> > +++ b/include/sysemu/numa.h
> > @@ -21,6 +21,7 @@ typedef struct node_info {
> >      struct HostMemoryBackend *node_memdev;
> >      bool present;
> >      QLIST_HEAD(, numa_addr_range) addr; /* List to store address ranges */
> > +    uint8_t distance[MAX_NODES];
> >  } NodeInfo;
> >  
> >  extern NodeInfo numa_info[MAX_NODES];
> > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> > index 576c7ce..a4e328d 100644
> > --- a/include/sysemu/sysemu.h
> > +++ b/include/sysemu/sysemu.h
> > @@ -169,6 +169,10 @@ extern int mem_prealloc;
> >  
> >  #define MAX_NODES 128
> >  #define NUMA_NODE_UNASSIGNED MAX_NODES
> > +#define MIN_NUMA_DISTANCE 10
> > +#define DEF_NUMA_DISTANCE 20
> > +#define MAX_NUMA_DISTANCE 254
> > +#define NUMA_DISTANCE_UNREACHABLE 255
> >  
> >  #define MAX_OPTION_ROMS 16
> >  typedef struct QEMUOptionRom {
> > diff --git a/numa.c b/numa.c
> > index e01cb54..425a320 100644
> > --- a/numa.c
> > +++ b/numa.c
> > @@ -212,6 +212,28 @@ static void numa_node_parse(NumaNodeOptions *node, QemuOpts *opts, Error **errp)
> >      max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
> >  }
> >  
> > +static void numa_distance_parse(NumaDistOptions *dist, QemuOpts *opts, Error **errp)
> > +{
> > +    uint64_t src = dist->src;
> > +    uint64_t dst = dist->dst;
> > +    uint8_t val = dist->val;
> > +
> > +    if (src >= MAX_NODES || dst >= MAX_NODES) {
> this allows to specify any src/dst instead of present nodes only,
> you should probably check that src/dst are present here,
> see: numa_info[x].present
> 
> if node is not present complain that it should be declared with
> '-numa node' option first and exit.
> 
Thanks for pointing out this.

> > +        error_setg(errp, "Max number of NUMA nodes reached: %"
> > +                   PRIu64 "", src > dst ? src : dst);
> > +        return;
> > +    }
> > +
> > +    if (val < MIN_NUMA_DISTANCE || val > MAX_NUMA_DISTANCE) {
> > +        error_setg(errp,
> > +                "NUMA distance (%" PRIu8 ") out of range (%d) ~ (%d)",
> > +                dist->val, MIN_NUMA_DISTANCE, MAX_NUMA_DISTANCE);
> > +        return;
> > +    }
> > +
> > +    numa_info[src].distance[dst] = val;
> > +}
> > +
> >  static int parse_numa(void *opaque, QemuOpts *opts, Error **errp)
> >  {
> >      NumaOptions *object = NULL;
> > @@ -235,6 +257,12 @@ static int parse_numa(void *opaque, QemuOpts *opts, Error **errp)
> >          }
> >          nb_numa_nodes++;
> >          break;
> > +    case NUMA_OPTIONS_TYPE_DIST:
> > +        numa_distance_parse(&object->u.dist, opts, &err);
> > +        if (err) {
> > +            goto end;
> > +        }
> > +        break;
> >      default:
> >          abort();
> >      }
> > @@ -294,6 +322,24 @@ static void validate_numa_cpus(void)
> >      g_free(seen_cpus);
> >  }
> >  
> > +static void default_numa_distance(void)
> > +{
> > +    int src, dst;
> > +
> fill in defaults only if there weren't any '-numa dist' on CLI
> and refuse to start if partial filled table were explicitly provided on CLI
> 
I am afraid that I may have a bad function name here, fill_numa_distance
might be a better name?

Also, since the distance can be asymmetric, IMHO, providing a partial
filled table looks fine. If we set many NUMA nodes e.g. 8 nodes for
guest, the whole filled table would require many command lines which
might be inconvenient and less flexible.
Please correct me if I am wrong, thanks.
> > +    for (src = 0; src < nb_numa_nodes; src++) {
> > +        for (dst = 0; dst < nb_numa_nodes; dst++) {
> > +            if (src == dst && numa_info[src].distance[dst] != MIN_NUMA_DISTANCE) {
> > +                numa_info[src].distance[dst] = MIN_NUMA_DISTANCE;
> > +            } else if (numa_info[src].distance[dst] <= MIN_NUMA_DISTANCE) {
> > +                if (numa_info[dst].distance[src] > MIN_NUMA_DISTANCE)
> > +                    numa_info[src].distance[dst] = numa_info[dst].distance[src];
> > +                else
> > +                    numa_info[src].distance[dst] = DEF_NUMA_DISTANCE;
> > +            }
> > +        }
> > +    }
> > +}
> > +
> >  void parse_numa_opts(MachineClass *mc)
> >  {
> >      int i;
> > @@ -390,6 +436,7 @@ void parse_numa_opts(MachineClass *mc)
> >          }
> >  
> >          validate_numa_cpus();
> > +        default_numa_distance();
> >      } else {
> >          numa_set_mem_node_id(0, ram_size, 0);
> >      }
> > diff --git a/qapi-schema.json b/qapi-schema.json
> > index 32b4a4b..21ad94a 100644
> > --- a/qapi-schema.json
> > +++ b/qapi-schema.json
> > @@ -5644,15 +5644,19 @@
> >  ##
> >  # @NumaOptionsType:
> >  #
> > +# @node: NUMA nodes configuration
> > +#
> > +# @dist: NUMA distance configuration
> > +#
> >  # Since: 2.1
> >  ##
> >  { 'enum': 'NumaOptionsType',
> > -  'data': [ 'node' ] }
> > +  'data': [ 'node', 'dist' ] }
> >  
> >  ##
> >  # @NumaOptions:
> >  #
> > -# A discriminated record of NUMA options. (for OptsVisitor)
> > +# A discriminated record of NUMA options.
> unrelated change?
> 
> >  #
> >  # Since: 2.1
> >  ##
> > @@ -5660,7 +5664,8 @@
> >    'base': { 'type': 'NumaOptionsType' },
> >    'discriminator': 'type',
> >    'data': {
> > -    'node': 'NumaNodeOptions' }}
> > +    'node': 'NumaNodeOptions',
> > +    'dist': 'NumaDistOptions' }}
> >  
> >  ##
> >  # @NumaNodeOptions:
> > @@ -5689,6 +5694,25 @@
> >     '*memdev': 'str' }}
> >  
> >  ##
> > +# @NumaDistOptions:
> > +#
> > +# Set distance between 2 NUMA nodes. (for OptsVisitor)
> > +#
> > +# @src: source NUMA node.
> > +#
> > +# @dst: destination NUMA node.
> > +#
> > +# @val: NUMA distance from source node to destination node.
> > +#
> > +# Since: 2.10
> > +##
> > +{ 'struct': 'NumaDistOptions',
> > +  'data': {
> > +   'src': 'uint64',
> > +   'dst': 'uint64',
> nodeid is uint16, so probably these should be also uint16
> 
> 
> > +   'val': 'uint8' }}
> > +
> > +##
> >  # @HostMemPolicy:
> >  #
> >  # Host memory policy types
> > diff --git a/qemu-options.hx b/qemu-options.hx
> > index 8dd8ee3..43c3950 100644
> > --- a/qemu-options.hx
> > +++ b/qemu-options.hx
> > @@ -139,12 +139,15 @@ ETEXI
> >  
> >  DEF("numa", HAS_ARG, QEMU_OPTION_numa,
> >      "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
> > -    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n", QEMU_ARCH_ALL)
> > +    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
> > +    "-numa dist,src=source,dst=destination,val=distance\n", QEMU_ARCH_ALL)
> >  STEXI
> >  @item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
> >  @itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
> > +@itemx -numa dist,src=@var{source},dst=@var{destination},val=@var{distance}
> >  @findex -numa
> >  Define a NUMA node and assign RAM and VCPUs to it.
> > +Set NUMA distance from source node to destination node.
> >  
> >  @var{firstcpu} and @var{lastcpu} are CPU indexes. Each
> >  @samp{cpus} option represent a contiguous range of CPU indexes
> > @@ -167,6 +170,13 @@ split equally between them.
> >  @samp{mem} and @samp{memdev} are mutually exclusive. Furthermore,
> >  if one node uses @samp{memdev}, all of them have to use it.
> >  
> > +@var{source} and @var{destination} are NUMA node IDs.
> > +@var{distance} is the NUMA distance from @var{source} to @var{destination}.
> > +The distance from node A to node B may be different from the distance from
> > +node B to node A since the distance can to be asymmetry.
> > +If the distance is not set, the default distance for a local NUMA node is 10,
> > +and 20 for a remote node.
> > +
> >  Note that the -@option{numa} option doesn't allocate any of the
> >  specified resources, it just assigns existing resources to NUMA
> >  nodes. This means that one still has to use the @option{-m},
> 

Re: [Qemu-devel] [PATCH v3] Allow setting NUMA distance for different NUMA nodes
Posted by Igor Mammedov 6 years, 12 months ago
On Wed, 29 Mar 2017 16:50:00 +0800
He Chen <he.chen@linux.intel.com> wrote:

> On Thu, Mar 23, 2017 at 10:29:55AM +0100, Igor Mammedov wrote:
> > On Wed, 22 Mar 2017 17:32:46 +0800
> > He Chen <he.chen@linux.intel.com> wrote:
> >   
> > > Current, QEMU does not provide a clear command to set vNUMA distance for
> > > guest although we already have `-numa` command to set vNUMA nodes.
> > > 
> > > vNUMA distance makes sense in certain scenario.
> > > But now, if we create a guest that has 4 vNUMA nodes, when we check NUMA
> > > info via `numactl -H`, we will see:
> > > 
> > > node distance:
> > > node    0    1    2    3
> > >   0:   10   20   20   20
> > >   1:   20   10   20   20
> > >   2:   20   20   10   20
> > >   3:   20   20   20   10
> > > 
> > > Guest kernel regards all local node as distance 10, and all remote node
> > > as distance 20 when there is no SLIT table since QEMU doesn't build it.
> > > It looks like a little strange when you have seen the distance in an
> > > actual physical machine that contains 4 NUMA nodes. My machine shows:
> > > 
> > > node distance:
> > > node    0    1    2    3
> > >   0:   10   21   31   41
> > >   1:   21   10   21   31
> > >   2:   31   21   10   21
> > >   3:   41   31   21   10
> > > 
> > > To set vNUMA distance, guest should see a complete SLIT table.
> > > I found QEMU has provide `-acpitable` command that allows users to add
> > > a ACPI table into guest, but it requires users building ACPI table by
> > > themselves first. Using `-acpitable` to add a SLIT table may be not so
> > > straightforward or flexible, imagine that when the vNUMA configuration
> > > is changes and we need to generate another SLIT table manually. It may
> > > not be friendly to users or upper software like libvirt.
> > > 
> > > This patch is going to add SLIT table support in QEMU, and provides
> > > additional option `dist` for command `-numa` to allow user set vNUMA
> > > distance by QEMU command.
> > > 
> > > With this patch, when a user wants to create a guest that contains
> > > several vNUMA nodes and also wants to set distance among those nodes,
> > > the QEMU command would like:
> > > 
> > > ```
> > > -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=0,policy=bind,id=node0 \
> > > -numa node,nodeid=0,cpus=0,memdev=node0 \
> > > -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=1,policy=bind,id=node1 \
> > > -numa node,nodeid=1,cpus=1,memdev=node1 \
> > > -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=2,policy=bind,id=node2 \
> > > -numa node,nodeid=2,cpus=2,memdev=node2 \
> > > -object memory-backend-ram,size=1G,prealloc=yes,host-nodes=3,policy=bind,id=node3 \
> > > -numa node,nodeid=3,cpus=3,memdev=node3 \
> > > -numa dist,src=0,dst=1,val=21 \
> > > -numa dist,src=0,dst=2,val=31 \
> > > -numa dist,src=0,dst=3,val=41 \
> > > -numa dist,src=1,dst=0,val=21 \
> > > ...
> > > ```
> > > 
> > > Signed-off-by: He Chen <he.chen@linux.intel.com>
> > > 
> > > fix
> > > ---
> > >  hw/acpi/aml-build.c         | 26 +++++++++++++++++++++++++
> > >  hw/arm/virt-acpi-build.c    |  2 ++
> > >  hw/i386/acpi-build.c        |  2 ++
> > >  include/hw/acpi/aml-build.h |  1 +
> > >  include/sysemu/numa.h       |  1 +
> > >  include/sysemu/sysemu.h     |  4 ++++
> > >  numa.c                      | 47 +++++++++++++++++++++++++++++++++++++++++++++
> > >  qapi-schema.json            | 30 ++++++++++++++++++++++++++---
> > >  qemu-options.hx             | 12 +++++++++++-
> > >  9 files changed, 121 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> > > index c6f2032..410b30e 100644
> > > --- a/hw/acpi/aml-build.c
> > > +++ b/hw/acpi/aml-build.c
> > > @@ -24,6 +24,7 @@
> > >  #include "hw/acpi/aml-build.h"
> > >  #include "qemu/bswap.h"
> > >  #include "qemu/bitops.h"
> > > +#include "sysemu/numa.h"
> > >  
> > >  static GArray *build_alloc_array(void)
> > >  {
> > > @@ -1609,3 +1610,28 @@ void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base,
> > >      numamem->base_addr = cpu_to_le64(base);
> > >      numamem->range_length = cpu_to_le64(len);
> > >  }
> > > +
> > > +/*
> > > + * ACPI spec 5.2.17 System Locality Distance Information Table
> > > + * (Revision 2.0 or later)
> > > + */
> > > +void build_slit(GArray *table_data, BIOSLinker *linker)
> > > +{
> > > +    int slit_start, i, j;
> > > +    slit_start = table_data->len;
> > > +
> > > +    acpi_data_push(table_data, sizeof(AcpiTableHeader));
> > > +
> > > +    build_append_int_noprefix(table_data, nb_numa_nodes, 8);
> > > +    for (i = 0; i < nb_numa_nodes; i++) {
> > > +        for (j = 0; j < nb_numa_nodes; j++) {
> > > +            build_append_int_noprefix(table_data, numa_info[i].distance[j], 1);
> > > +        }
> > > +    }
> > > +
> > > +    build_header(linker, table_data,
> > > +                 (void *)(table_data->data + slit_start),
> > > +                 "SLIT",
> > > +                 table_data->len - slit_start, 1, NULL, NULL);
> > > +}
> > > +
> > > diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> > > index 0835e59..d9e6828 100644
> > > --- a/hw/arm/virt-acpi-build.c
> > > +++ b/hw/arm/virt-acpi-build.c
> > > @@ -781,6 +781,8 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
> > >      if (nb_numa_nodes > 0) {
> > >          acpi_add_table(table_offsets, tables_blob);
> > >          build_srat(tables_blob, tables->linker, vms);
> > > +        acpi_add_table(table_offsets, tables_blob);
> > > +        build_slit(tables_blob, tables->linker);
> > >      }
> > >  
> > >      if (its_class_name() && !vmc->no_its) {
> > > diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> > > index 2073108..12730ea 100644
> > > --- a/hw/i386/acpi-build.c
> > > +++ b/hw/i386/acpi-build.c
> > > @@ -2678,6 +2678,8 @@ void acpi_build(AcpiBuildTables *tables, MachineState *machine)
> > >      if (pcms->numa_nodes) {
> > >          acpi_add_table(table_offsets, tables_blob);
> > >          build_srat(tables_blob, tables->linker, machine);
> > > +        acpi_add_table(table_offsets, tables_blob);
> > > +        build_slit(tables_blob, tables->linker);
> > >      }
> > >      if (acpi_get_mcfg(&mcfg)) {
> > >          acpi_add_table(table_offsets, tables_blob);
> > > diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> > > index 00c21f1..329a0d0 100644
> > > --- a/include/hw/acpi/aml-build.h
> > > +++ b/include/hw/acpi/aml-build.h
> > > @@ -389,4 +389,5 @@ GCC_FMT_ATTR(2, 3);
> > >  void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base,
> > >                         uint64_t len, int node, MemoryAffinityFlags flags);
> > >  
> > > +void build_slit(GArray *table_data, BIOSLinker *linker);
> > >  #endif
> > > diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
> > > index 8f09dcf..2f7a941 100644
> > > --- a/include/sysemu/numa.h
> > > +++ b/include/sysemu/numa.h
> > > @@ -21,6 +21,7 @@ typedef struct node_info {
> > >      struct HostMemoryBackend *node_memdev;
> > >      bool present;
> > >      QLIST_HEAD(, numa_addr_range) addr; /* List to store address ranges */
> > > +    uint8_t distance[MAX_NODES];
> > >  } NodeInfo;
> > >  
> > >  extern NodeInfo numa_info[MAX_NODES];
> > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> > > index 576c7ce..a4e328d 100644
> > > --- a/include/sysemu/sysemu.h
> > > +++ b/include/sysemu/sysemu.h
> > > @@ -169,6 +169,10 @@ extern int mem_prealloc;
> > >  
> > >  #define MAX_NODES 128
> > >  #define NUMA_NODE_UNASSIGNED MAX_NODES
> > > +#define MIN_NUMA_DISTANCE 10
> > > +#define DEF_NUMA_DISTANCE 20
> > > +#define MAX_NUMA_DISTANCE 254
> > > +#define NUMA_DISTANCE_UNREACHABLE 255
> > >  
> > >  #define MAX_OPTION_ROMS 16
> > >  typedef struct QEMUOptionRom {
> > > diff --git a/numa.c b/numa.c
> > > index e01cb54..425a320 100644
> > > --- a/numa.c
> > > +++ b/numa.c
> > > @@ -212,6 +212,28 @@ static void numa_node_parse(NumaNodeOptions *node, QemuOpts *opts, Error **errp)
> > >      max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
> > >  }
> > >  
> > > +static void numa_distance_parse(NumaDistOptions *dist, QemuOpts *opts, Error **errp)
> > > +{
> > > +    uint64_t src = dist->src;
> > > +    uint64_t dst = dist->dst;
> > > +    uint8_t val = dist->val;
> > > +
> > > +    if (src >= MAX_NODES || dst >= MAX_NODES) {  
> > this allows to specify any src/dst instead of present nodes only,
> > you should probably check that src/dst are present here,
> > see: numa_info[x].present
> > 
> > if node is not present complain that it should be declared with
> > '-numa node' option first and exit.
> >   
> Thanks for pointing out this.
> 
> > > +        error_setg(errp, "Max number of NUMA nodes reached: %"
> > > +                   PRIu64 "", src > dst ? src : dst);
> > > +        return;
> > > +    }
> > > +
> > > +    if (val < MIN_NUMA_DISTANCE || val > MAX_NUMA_DISTANCE) {
> > > +        error_setg(errp,
> > > +                "NUMA distance (%" PRIu8 ") out of range (%d) ~ (%d)",
> > > +                dist->val, MIN_NUMA_DISTANCE, MAX_NUMA_DISTANCE);
> > > +        return;
> > > +    }
> > > +
> > > +    numa_info[src].distance[dst] = val;
> > > +}
> > > +
> > >  static int parse_numa(void *opaque, QemuOpts *opts, Error **errp)
> > >  {
> > >      NumaOptions *object = NULL;
> > > @@ -235,6 +257,12 @@ static int parse_numa(void *opaque, QemuOpts *opts, Error **errp)
> > >          }
> > >          nb_numa_nodes++;
> > >          break;
> > > +    case NUMA_OPTIONS_TYPE_DIST:
> > > +        numa_distance_parse(&object->u.dist, opts, &err);
> > > +        if (err) {
> > > +            goto end;
> > > +        }
> > > +        break;
> > >      default:
> > >          abort();
> > >      }
> > > @@ -294,6 +322,24 @@ static void validate_numa_cpus(void)
> > >      g_free(seen_cpus);
> > >  }
> > >  
> > > +static void default_numa_distance(void)
> > > +{
> > > +    int src, dst;
> > > +  
> > fill in defaults only if there weren't any '-numa dist' on CLI
> > and refuse to start if partial filled table were explicitly provided on CLI
> >   
> I am afraid that I may have a bad function name here, fill_numa_distance
> might be a better name?
> 
> Also, since the distance can be asymmetric, IMHO, providing a partial
> filled table looks fine. If we set many NUMA nodes e.g. 8 nodes for
> guest, the whole filled table would require many command lines which
> might be inconvenient and less flexible.
asymmetric doesn't imply sparse, so one has to specify full matrix
it might be inconvenient /long/ but is very flexible.

if you want shorten CLI maybe adding an extra option could help:
  -numa dist,default=xxx
that way QEMU won't have to maintain default value and it would
be up to user maintain it.

> Please correct me if I am wrong, thanks.
> > > +    for (src = 0; src < nb_numa_nodes; src++) {
> > > +        for (dst = 0; dst < nb_numa_nodes; dst++) {
> > > +            if (src == dst && numa_info[src].distance[dst] != MIN_NUMA_DISTANCE) {
> > > +                numa_info[src].distance[dst] = MIN_NUMA_DISTANCE;
spec says that diagonal element of matrix are equal to 10,
there shouldn't by silent fixup.
If user provided wrong value just error out pointing to CLI mistake.
May be add diagonal check to numa_distance_parse()


> > > +            } else if (numa_info[src].distance[dst] <= MIN_NUMA_DISTANCE) {
> > > +                if (numa_info[dst].distance[src] > MIN_NUMA_DISTANCE)
> > > +                    numa_info[src].distance[dst] = numa_info[dst].distance[src];
> > > +                else
> > > +                    numa_info[src].distance[dst] = DEF_NUMA_DISTANCE;
> > > +            }
> > > +        }
> > > +    }
> > > +}
> > > +
> > >  void parse_numa_opts(MachineClass *mc)
> > >  {
> > >      int i;
> > > @@ -390,6 +436,7 @@ void parse_numa_opts(MachineClass *mc)
> > >          }
> > >  
> > >          validate_numa_cpus();
> > > +        default_numa_distance();
> > >      } else {
> > >          numa_set_mem_node_id(0, ram_size, 0);
> > >      }
> > > diff --git a/qapi-schema.json b/qapi-schema.json
> > > index 32b4a4b..21ad94a 100644
> > > --- a/qapi-schema.json
> > > +++ b/qapi-schema.json
> > > @@ -5644,15 +5644,19 @@
> > >  ##
> > >  # @NumaOptionsType:
> > >  #
> > > +# @node: NUMA nodes configuration
> > > +#
> > > +# @dist: NUMA distance configuration
> > > +#
> > >  # Since: 2.1
> > >  ##
> > >  { 'enum': 'NumaOptionsType',
> > > -  'data': [ 'node' ] }
> > > +  'data': [ 'node', 'dist' ] }
> > >  
> > >  ##
> > >  # @NumaOptions:
> > >  #
> > > -# A discriminated record of NUMA options. (for OptsVisitor)
> > > +# A discriminated record of NUMA options.  
> > unrelated change?
> >   
> > >  #
> > >  # Since: 2.1
> > >  ##
> > > @@ -5660,7 +5664,8 @@
> > >    'base': { 'type': 'NumaOptionsType' },
> > >    'discriminator': 'type',
> > >    'data': {
> > > -    'node': 'NumaNodeOptions' }}
> > > +    'node': 'NumaNodeOptions',
> > > +    'dist': 'NumaDistOptions' }}
> > >  
> > >  ##
> > >  # @NumaNodeOptions:
> > > @@ -5689,6 +5694,25 @@
> > >     '*memdev': 'str' }}
> > >  
> > >  ##
> > > +# @NumaDistOptions:
> > > +#
> > > +# Set distance between 2 NUMA nodes. (for OptsVisitor)
> > > +#
> > > +# @src: source NUMA node.
> > > +#
> > > +# @dst: destination NUMA node.
> > > +#
> > > +# @val: NUMA distance from source node to destination node.
> > > +#
> > > +# Since: 2.10
> > > +##
> > > +{ 'struct': 'NumaDistOptions',
> > > +  'data': {
> > > +   'src': 'uint64',
> > > +   'dst': 'uint64',  
> > nodeid is uint16, so probably these should be also uint16
> > 
> >   
> > > +   'val': 'uint8' }}
> > > +
> > > +##
> > >  # @HostMemPolicy:
> > >  #
> > >  # Host memory policy types
> > > diff --git a/qemu-options.hx b/qemu-options.hx
> > > index 8dd8ee3..43c3950 100644
> > > --- a/qemu-options.hx
> > > +++ b/qemu-options.hx
> > > @@ -139,12 +139,15 @@ ETEXI
> > >  
> > >  DEF("numa", HAS_ARG, QEMU_OPTION_numa,
> > >      "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
> > > -    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n", QEMU_ARCH_ALL)
> > > +    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
> > > +    "-numa dist,src=source,dst=destination,val=distance\n", QEMU_ARCH_ALL)
> > >  STEXI
> > >  @item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
> > >  @itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
> > > +@itemx -numa dist,src=@var{source},dst=@var{destination},val=@var{distance}
> > >  @findex -numa
> > >  Define a NUMA node and assign RAM and VCPUs to it.
> > > +Set NUMA distance from source node to destination node.
> > >  
> > >  @var{firstcpu} and @var{lastcpu} are CPU indexes. Each
> > >  @samp{cpus} option represent a contiguous range of CPU indexes
> > > @@ -167,6 +170,13 @@ split equally between them.
> > >  @samp{mem} and @samp{memdev} are mutually exclusive. Furthermore,
> > >  if one node uses @samp{memdev}, all of them have to use it.
> > >  
> > > +@var{source} and @var{destination} are NUMA node IDs.
> > > +@var{distance} is the NUMA distance from @var{source} to @var{destination}.
> > > +The distance from node A to node B may be different from the distance from
> > > +node B to node A since the distance can to be asymmetry.
> > > +If the distance is not set, the default distance for a local NUMA node is 10,
> > > +and 20 for a remote node.
> > > +
> > >  Note that the -@option{numa} option doesn't allocate any of the
> > >  specified resources, it just assigns existing resources to NUMA
> > >  nodes. This means that one still has to use the @option{-m},  
> >   


Re: [Qemu-devel] [PATCH v3] Allow setting NUMA distance for different NUMA nodes
Posted by Andrew Jones 6 years, 12 months ago
On Wed, Mar 29, 2017 at 01:44:58PM +0200, Igor Mammedov wrote:
> On Wed, 29 Mar 2017 16:50:00 +0800
> He Chen <he.chen@linux.intel.com> wrote:
> > > > +static void default_numa_distance(void)
> > > > +{
> > > > +    int src, dst;
> > > > +  
> > > fill in defaults only if there weren't any '-numa dist' on CLI
> > > and refuse to start if partial filled table were explicitly provided on CLI
> > >   
> > I am afraid that I may have a bad function name here, fill_numa_distance
> > might be a better name?
> > 
> > Also, since the distance can be asymmetric, IMHO, providing a partial
> > filled table looks fine. If we set many NUMA nodes e.g. 8 nodes for
> > guest, the whole filled table would require many command lines which
> > might be inconvenient and less flexible.
> asymmetric doesn't imply sparse, so one has to specify full matrix
> it might be inconvenient /long/ but is very flexible.

This makes me realize that a user only inputting one of A -> B or B -> A
command line inputs doesn't imply symmetry.  It could be that the user
just forgot to input the opposite.  To avoid the ambiguity we either need
to force both to be input (as it was before I suggested otherwise), or
add a '-numa symmetric' type of property to enable the shortcut.  I guess
we should just avoid the shortcut, at least for now.

Thanks,
drew

Re: [Qemu-devel] [PATCH v3] Allow setting NUMA distance for different NUMA nodes
Posted by Eduardo Habkost 6 years, 12 months ago
On Wed, Mar 29, 2017 at 02:30:38PM +0200, Andrew Jones wrote:
> On Wed, Mar 29, 2017 at 01:44:58PM +0200, Igor Mammedov wrote:
> > On Wed, 29 Mar 2017 16:50:00 +0800
> > He Chen <he.chen@linux.intel.com> wrote:
> > > > > +static void default_numa_distance(void)
> > > > > +{
> > > > > +    int src, dst;
> > > > > +  
> > > > fill in defaults only if there weren't any '-numa dist' on CLI
> > > > and refuse to start if partial filled table were explicitly provided on CLI
> > > >   
> > > I am afraid that I may have a bad function name here, fill_numa_distance
> > > might be a better name?
> > > 
> > > Also, since the distance can be asymmetric, IMHO, providing a partial
> > > filled table looks fine. If we set many NUMA nodes e.g. 8 nodes for
> > > guest, the whole filled table would require many command lines which
> > > might be inconvenient and less flexible.
> > asymmetric doesn't imply sparse, so one has to specify full matrix
> > it might be inconvenient /long/ but is very flexible.
> 
> This makes me realize that a user only inputting one of A -> B or B -> A
> command line inputs doesn't imply symmetry.  It could be that the user
> just forgot to input the opposite.  To avoid the ambiguity we either need
> to force both to be input (as it was before I suggested otherwise), or
> add a '-numa symmetric' type of property to enable the shortcut.  I guess
> we should just avoid the shortcut, at least for now.

Is protecting the user from one very specific (and very rare[1])
mistake a good reason for making the automatic default less
useful? Requiring an explicit '-numa symmetric' option to enable
the automatic default seems to defeat the purpose of having an
automatic default, to me.

I bet people would just specify the distances for both
directions, instead of having to read the documentation for
"-numa symmetric" to be sure if that's exactly what they want.

[1] I mean: forgetting an option isn't rare, but asymmetric NUMA
    distances are so rare that most people who reviewed those
    patches were surprised when they learned that the specs allow
    asymmetric distances.

-- 
Eduardo

Re: [Qemu-devel] [PATCH v3] Allow setting NUMA distance for different NUMA nodes
Posted by Andrew Jones 6 years, 12 months ago
On Thu, Mar 30, 2017 at 12:00:48PM -0300, Eduardo Habkost wrote:
> On Wed, Mar 29, 2017 at 02:30:38PM +0200, Andrew Jones wrote:
> > On Wed, Mar 29, 2017 at 01:44:58PM +0200, Igor Mammedov wrote:
> > > On Wed, 29 Mar 2017 16:50:00 +0800
> > > He Chen <he.chen@linux.intel.com> wrote:
> > > > > > +static void default_numa_distance(void)
> > > > > > +{
> > > > > > +    int src, dst;
> > > > > > +  
> > > > > fill in defaults only if there weren't any '-numa dist' on CLI
> > > > > and refuse to start if partial filled table were explicitly provided on CLI
> > > > >   
> > > > I am afraid that I may have a bad function name here, fill_numa_distance
> > > > might be a better name?
> > > > 
> > > > Also, since the distance can be asymmetric, IMHO, providing a partial
> > > > filled table looks fine. If we set many NUMA nodes e.g. 8 nodes for
> > > > guest, the whole filled table would require many command lines which
> > > > might be inconvenient and less flexible.
> > > asymmetric doesn't imply sparse, so one has to specify full matrix
> > > it might be inconvenient /long/ but is very flexible.
> > 
> > This makes me realize that a user only inputting one of A -> B or B -> A
> > command line inputs doesn't imply symmetry.  It could be that the user
> > just forgot to input the opposite.  To avoid the ambiguity we either need
> > to force both to be input (as it was before I suggested otherwise), or
> > add a '-numa symmetric' type of property to enable the shortcut.  I guess
> > we should just avoid the shortcut, at least for now.
> 
> Is protecting the user from one very specific (and very rare[1])
> mistake a good reason for making the automatic default less
> useful?

Maybe not, but creating new interfaces with known ambiguities isn't ideal
either.

> Requiring an explicit '-numa symmetric' option to enable
> the automatic default seems to defeat the purpose of having an
> automatic default, to me.

We could reverse it, '-numa asymmetric' could turn off the defaults
and abort on an incomplete table.


Or, we could leave it ambiguous, but improve the heuristic by requiring
all directions be given if even one direction has an asymmetric reverse
path given. With that, there's still a chance to let user error slip
through, but much less. I could live with that.

Thanks,
drew

Re: [Qemu-devel] [PATCH v3] Allow setting NUMA distance for different NUMA nodes
Posted by Eduardo Habkost 6 years, 12 months ago
On Thu, Mar 30, 2017 at 05:26:26PM +0200, Andrew Jones wrote:
> On Thu, Mar 30, 2017 at 12:00:48PM -0300, Eduardo Habkost wrote:
> > On Wed, Mar 29, 2017 at 02:30:38PM +0200, Andrew Jones wrote:
> > > On Wed, Mar 29, 2017 at 01:44:58PM +0200, Igor Mammedov wrote:
> > > > On Wed, 29 Mar 2017 16:50:00 +0800
> > > > He Chen <he.chen@linux.intel.com> wrote:
> > > > > > > +static void default_numa_distance(void)
> > > > > > > +{
> > > > > > > +    int src, dst;
> > > > > > > +  
> > > > > > fill in defaults only if there weren't any '-numa dist' on CLI
> > > > > > and refuse to start if partial filled table were explicitly provided on CLI
> > > > > >   
> > > > > I am afraid that I may have a bad function name here, fill_numa_distance
> > > > > might be a better name?
> > > > > 
> > > > > Also, since the distance can be asymmetric, IMHO, providing a partial
> > > > > filled table looks fine. If we set many NUMA nodes e.g. 8 nodes for
> > > > > guest, the whole filled table would require many command lines which
> > > > > might be inconvenient and less flexible.
> > > > asymmetric doesn't imply sparse, so one has to specify full matrix
> > > > it might be inconvenient /long/ but is very flexible.
> > > 
> > > This makes me realize that a user only inputting one of A -> B or B -> A
> > > command line inputs doesn't imply symmetry.  It could be that the user
> > > just forgot to input the opposite.  To avoid the ambiguity we either need
> > > to force both to be input (as it was before I suggested otherwise), or
> > > add a '-numa symmetric' type of property to enable the shortcut.  I guess
> > > we should just avoid the shortcut, at least for now.
> > 
> > Is protecting the user from one very specific (and very rare[1])
> > mistake a good reason for making the automatic default less
> > useful?
> 
> Maybe not, but creating new interfaces with known ambiguities isn't ideal
> either.
> 
> > Requiring an explicit '-numa symmetric' option to enable
> > the automatic default seems to defeat the purpose of having an
> > automatic default, to me.
> 
> We could reverse it, '-numa asymmetric' could turn off the defaults
> and abort on an incomplete table.

I doubt anybody would ever use that option, either.

> 
> Or, we could leave it ambiguous, but improve the heuristic by requiring
> all directions be given if even one direction has an asymmetric reverse
> path given. With that, there's still a chance to let user error slip
> through, but much less. I could live with that.

Sounds good to me.

-- 
Eduardo