[libvirt] [RFC PATCH v1 0/4] NUMA Host or Node Partitioning

Wim Ten Have posted 4 patches 5 years, 1 month ago
Test syntax-check passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/libvirt tags/patchew/20191021192108.25974-1-wim.ten.have@oracle.com
docs/formatdomain.html.in                     |  94 ++++
docs/schemas/domaincommon.rng                 |  65 +++
src/conf/domain_conf.c                        | 482 +++++++++++++++++-
src/conf/domain_conf.h                        |   2 +
src/conf/numa_conf.c                          | 241 ++++++++-
src/conf/numa_conf.h                          |  58 ++-
src/libvirt_private.syms                      |   8 +
src/qemu/qemu_driver.c                        |  65 ++-
src/qemu/qemu_hotplug.c                       |  95 +++-
.../cpu-host-passthrough-nonuma.args          |  29 ++
.../cpu-host-passthrough-nonuma.xml           |  19 +
.../cpu-host-passthrough-numa-contiguous.args |  37 ++
.../cpu-host-passthrough-numa-contiguous.xml  |  20 +
.../cpu-host-passthrough-numa-interleave.args |  41 ++
.../cpu-host-passthrough-numa-interleave.xml  |  19 +
...host-passthrough-numa-node-contiguous.args |  53 ++
...-host-passthrough-numa-node-contiguous.xml |  21 +
...host-passthrough-numa-node-interleave.args |  41 ++
...-host-passthrough-numa-node-interleave.xml |  22 +
...ost-passthrough-numa-node-round-robin.args | 125 +++++
...host-passthrough-numa-node-round-robin.xml |  21 +
...u-host-passthrough-numa-node-siblings.args |  32 ++
...pu-host-passthrough-numa-node-siblings.xml |  23 +
...cpu-host-passthrough-numa-round-robin.args |  37 ++
.../cpu-host-passthrough-numa-round-robin.xml |  22 +
.../cpu-host-passthrough-numa-siblings.args   |  37 ++
.../cpu-host-passthrough-numa-siblings.xml    |  20 +
.../cpu-host-passthrough-numa.args            |  37 ++
.../cpu-host-passthrough-numa.xml             |  20 +
tests/qemuxml2argvtest.c                      |  10 +
30 files changed, 1765 insertions(+), 31 deletions(-)
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-nonuma.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-nonuma.xml
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-contiguous.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-contiguous.xml
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-interleave.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-interleave.xml
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-contiguous.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-contiguous.xml
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-interleave.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-interleave.xml
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-round-robin.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-round-robin.xml
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-siblings.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-siblings.xml
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-round-robin.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-round-robin.xml
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-siblings.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-siblings.xml
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa.xml
[libvirt] [RFC PATCH v1 0/4] NUMA Host or Node Partitioning
Posted by Wim Ten Have 5 years, 1 month ago
From: Wim ten Have <wim.ten.have@oracle.com>

This patch extends guest domain administration by adding a feature that
creates a guest with a NUMA layout, also referred to as vNUMA (Virtual
NUMA).

NUMA (Non-Uniform Memory Access) is a method of configuring a cluster of
nodes within a single multiprocessing system such that each node shares
its processor local memory with other nodes, improving performance and
the ability of the system to be expanded.

The illustration below shows a typical 4-node NUMA system. Within this
system, each socket is equipped with its own distinct memory and some
also with I/O. Access to memory or I/O on remote nodes is only possible
communicating through the "Interconnect."

    +-------------+-------+        +-------+-------------+
    |NODE0|       |       |        |       |       |NODE3|
    |     | CPU00 | CPU03 |        | CPU12 | CPU15 |     |
    |     |       |       |        |       |       |     |
    | Mem +--- Socket0 ---<-------->--- Socket3 ---+ Mem |
    |     |       |       |        |       |       |     |
    +-----+ CPU01 | CPU02 |        | CPU13 | CPU14 |     |
    | I/O |       |       |        |       |       |     |
    +-----+-------^-------+        +-------^-------+-----+
                  |                        |
                  |      Interconnect      |
                  |                        |
    +-------------v-------+        +-------v-------------+
    |NODE1|       |       |        |       |       |NODE2|
    |     | CPU04 | CPU07 |        | CPU08 | CPU11 |     |
    |     |       |       |        |       |       |     |
    | Mem +--- Socket1 ---<-------->--- Socket2 ---+ Mem |
    |     |       |       |        |       |       |     |
    +-----+ CPU05 | CPU06 |        | CPU09 | CPU10 |     |
    | I/O |       |       |        |       |       |     |
    +-----+-------+-------+        +-------+-------+-----+

Unfortunately, NUMA architectures have some drawbacks. For example,
when data is stored in memory associated with Socket2 but is accessed
by a CPU in Socket0, that CPU uses the interconnect to access the
memory associated with Socket2. These interconnect hops add data access
delays. Some high performance software takes NUMA architecture into
account by carefully placing data in memory and pinning the processes
most likely to access that data to CPUs with the shortest access times.
Similarly, such software can pin its I/O processes to CPUs with the
shortest access times to I/O devices. When such software is run within
a guest VM, constructing the VM such that its virtual NUMA topology
mirrors the physical NUMA topology preserves the application software's
performance.

The changes brought by this patch series add a new libvirt domain element
named <vnuma> that allows for dynamic 'host' or 'node' partitioning of
a guest where libvirt inspects the host capabilities and renders a best
guest XML design holding a host matching vNUMA topology.

  <domain>
    ..
    <vnuma mode='host|node'
           distribution='contiguous|siblings|round-robin|interleave'>
      <memory unit='KiB'>524288</memory>
      <partition nodeset="1-4,^3" cells="8"/>
    </vnuma>
    ..
  </domain>

The content of this <vnuma> element causes libvirt to dynamically
partition the guest domain XML into a 'host' or 'node' numa model.

Under <vnuma mode='host' ... > the guest domain is automatically
partitioned according to the "host" capabilities.

Under <vnuma mode='node' ... > the guest domain is partitioned according
to the nodeset and cells under the vnuma partition subelement.

The optional <vnuma> attribute distribution='type' is to indicate the
guest numa cell cpus distribution. This distribution='type' can have
the following values:
- 'contiguous' delivery, under which the cpus enumerate sequentially
   over the numa defined cells.
- 'siblings' cpus are distributed over the numa cells matching the host
   CPU SMT model.
- 'round-robin' cpus are distributed over the numa cells matching the
   host CPU topology.
- 'interleave' cpus are interleaved one at a time over the numa cells.

The optional subelement <memory> specifies the memory size reserved
for the guest to dimension its <numa> <cell id> size. If no memory is
specified, the <vnuma> <memory> setting is acquired from the guest's
total memory, <domain> <memory> setting.

The optional attribute <partition> is only active when <vnuma mode='node'>
is in effect and allows for defining the active "nodeset" and "cells" to
target for under the "guest" domain. For example, the specified attribute
"nodeset" can limit the assigned host NUMA nodes in effect under the guest
with help of NUMA node tuning (<numatune>.)  Alternatively, the provided
"cells" attribute can define the guest number of vNUMA cells to render.

We're planning a 'virsh vnuma' command to convert existing guest domains
to one of these vNUMA models.

Wim ten Have (4):
  XML definitions for guest vNUMA and parsing routines
  qemu: driver changes adding vNUMA vCPU hotplug support
  qemu: driver changes adding vNUMA memory hotplug support
  tests: add various tests to exercise vNUMA host partitioning

 docs/formatdomain.html.in                     |  94 ++++
 docs/schemas/domaincommon.rng                 |  65 +++
 src/conf/domain_conf.c                        | 482 +++++++++++++++++-
 src/conf/domain_conf.h                        |   2 +
 src/conf/numa_conf.c                          | 241 ++++++++-
 src/conf/numa_conf.h                          |  58 ++-
 src/libvirt_private.syms                      |   8 +
 src/qemu/qemu_driver.c                        |  65 ++-
 src/qemu/qemu_hotplug.c                       |  95 +++-
 .../cpu-host-passthrough-nonuma.args          |  29 ++
 .../cpu-host-passthrough-nonuma.xml           |  19 +
 .../cpu-host-passthrough-numa-contiguous.args |  37 ++
 .../cpu-host-passthrough-numa-contiguous.xml  |  20 +
 .../cpu-host-passthrough-numa-interleave.args |  41 ++
 .../cpu-host-passthrough-numa-interleave.xml  |  19 +
 ...host-passthrough-numa-node-contiguous.args |  53 ++
 ...-host-passthrough-numa-node-contiguous.xml |  21 +
 ...host-passthrough-numa-node-interleave.args |  41 ++
 ...-host-passthrough-numa-node-interleave.xml |  22 +
 ...ost-passthrough-numa-node-round-robin.args | 125 +++++
 ...host-passthrough-numa-node-round-robin.xml |  21 +
 ...u-host-passthrough-numa-node-siblings.args |  32 ++
 ...pu-host-passthrough-numa-node-siblings.xml |  23 +
 ...cpu-host-passthrough-numa-round-robin.args |  37 ++
 .../cpu-host-passthrough-numa-round-robin.xml |  22 +
 .../cpu-host-passthrough-numa-siblings.args   |  37 ++
 .../cpu-host-passthrough-numa-siblings.xml    |  20 +
 .../cpu-host-passthrough-numa.args            |  37 ++
 .../cpu-host-passthrough-numa.xml             |  20 +
 tests/qemuxml2argvtest.c                      |  10 +
 30 files changed, 1765 insertions(+), 31 deletions(-)
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-nonuma.args
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-nonuma.xml
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-contiguous.args
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-contiguous.xml
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-interleave.args
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-interleave.xml
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-contiguous.args
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-contiguous.xml
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-interleave.args
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-interleave.xml
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-round-robin.args
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-round-robin.xml
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-siblings.args
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-node-siblings.xml
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-round-robin.args
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-round-robin.xml
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-siblings.args
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa-siblings.xml
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa.args
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa.xml

-- 
2.21.0

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] [RFC PATCH v1 0/4] NUMA Host or Node Partitioning
Posted by Daniel P. Berrangé 5 years, 1 month ago
On Mon, Oct 21, 2019 at 09:21:04PM +0200, Wim Ten Have wrote:
> From: Wim ten Have <wim.ten.have@oracle.com>
> 
> This patch extends guest domain administration by adding a feature that
> creates a guest with a NUMA layout, also referred to as vNUMA (Virtual
> NUMA).

Errr, that feature already exists. You can create a guest NUMA layout
with this:

<domain>
   <cpu>
     ...
     <numa>
       <cell id='0' cpus='0-3' memory='512000' unit='KiB' discard='yes'/>
       <cell id='1' cpus='4-7' memory='512000' unit='KiB' memAccess='shared'/>
     </numa>
     ...
   </cpu>
</domain>

[snip]

> The changes brought by this patch series add a new libvirt domain element
> named <vnuma> that allows for dynamic 'host' or 'node' partitioning of
> a guest where libvirt inspects the host capabilities and renders a best
> guest XML design holding a host matching vNUMA topology.
> 
>   <domain>
>     ..
>     <vnuma mode='host|node'
>            distribution='contiguous|siblings|round-robin|interleave'>
>       <memory unit='KiB'>524288</memory>
>       <partition nodeset="1-4,^3" cells="8"/>
>     </vnuma>
>     ..
>   </domain>
> 
> The content of this <vnuma> element causes libvirt to dynamically
> partition the guest domain XML into a 'host' or 'node' numa model
> 
> Under <vnuma mode='host' ... > the guest domain is automatically
> partitioned according to the "host" capabilities.
> 
> Under <vnuma mode='node' ... > the guest domain is partitioned according
> to the nodeset and cells under the vnuma partition subelement.
> 
> The optional <vnuma> attribute distribution='type' is to indicate the
> guest numa cell cpus distribution. This distribution='type' can have
> the following values:
> - 'contiguous' delivery, under which the cpus enumerate sequentially
>    over the numa defined cells.
> - 'siblings' cpus are distributed over the numa cells matching the host
>    CPU SMT model.
> - 'round-robin' cpus are distributed over the numa cells matching the
>    host CPU topology.
> - 'interleave' cpus are interleaved one at a time over the numa cells.
> 
> The optional subelement <memory> specifies the memory size reserved
> for the guest to dimension its <numa> <cell id> size. If no memory is
> specified, the <vnuma> <memory> setting is acquired from the guest's
> total memory, <domain> <memory> setting.

This seems to be just implementing some specific policies to
automagically configure the NUMA config. This is all already
possible for the mgmt apps todo with the existing XML configs
we expose AFAIK.  Libvirt's goal is to /not/ implement specific
policies like this, but instead expose the mechanism for apps
to use to define policies as they see fit.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list