[libvirt] [RFC PATCH auto partition NUMA guest domains v1 0/2] auto partition guests providing the host NUMA topology

Wim Ten Have posted 2 patches 5 years, 6 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/libvirt tags/patchew/20180925100242.10678-1-wim.ten.have@oracle.com
Test syntax-check passed
docs/formatdomain.html.in                     |   7 +
docs/schemas/cputypes.rng                     |   1 +
src/conf/cpu_conf.c                           |   3 +-
src/conf/cpu_conf.h                           |   1 +
src/conf/domain_conf.c                        | 166 ++++++++++++++++++
.../cpu-host-passthrough-nonuma.args          |  25 +++
.../cpu-host-passthrough-nonuma.xml           |  18 ++
.../cpu-host-passthrough-numa.args            |  29 +++
.../cpu-host-passthrough-numa.xml             |  18 ++
tests/qemuxml2argvtest.c                      |   2 +
10 files changed, 269 insertions(+), 1 deletion(-)
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-nonuma.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-nonuma.xml
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa.args
create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa.xml
[libvirt] [RFC PATCH auto partition NUMA guest domains v1 0/2] auto partition guests providing the host NUMA topology
Posted by Wim Ten Have 5 years, 6 months ago
From: Wim ten Have <wim.ten.have@oracle.com>

This patch extends the guest domain administration adding support
to automatically advertise the host NUMA node capabilities obtained
architecture under a guest by creating a vNUMA copy.

The mechanism is enabled by setting the check='numa' attribute under
the CPU 'host-passthrough' topology:
  <cpu mode='host-passthrough' check='numa' .../> 

When enabled the mechanism automatically renders the host capabilities
provided NUMA architecture, evenly balances the guest reserved vcpu
and memory amongst its vNUMA composed cells and have the cell allocated
vcpus pinned towards the host NUMA node physical cpusets.  This in such
way that the host NUMA topology is still in effect under the partitioned
guest domain.

Below example auto partitions the host 'lscpu' listed physical NUMA detail
under a guest domain vNUMA description.

    [root@host ]# lscpu 
    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                240
    On-line CPU(s) list:   0-239
    Thread(s) per core:    2
    Core(s) per socket:    15
    Socket(s):             8
    NUMA node(s):          8
    Vendor ID:             GenuineIntel
    CPU family:            6
    Model:                 62
    Model name:            Intel(R) Xeon(R) CPU E7-8895 v2 @ 2.80GHz
    Stepping:              7
    CPU MHz:               3449.555
    CPU max MHz:           3600.0000
    CPU min MHz:           1200.0000
    BogoMIPS:              5586.28
    Virtualization:        VT-x
    L1d cache:             32K
    L1i cache:             32K
    L2 cache:              256K
    L3 cache:              38400K
    NUMA node0 CPU(s):     0-14,120-134
    NUMA node1 CPU(s):     15-29,135-149
    NUMA node2 CPU(s):     30-44,150-164
    NUMA node3 CPU(s):     45-59,165-179
    NUMA node4 CPU(s):     60-74,180-194
    NUMA node5 CPU(s):     75-89,195-209
    NUMA node6 CPU(s):     90-104,210-224
    NUMA node7 CPU(s):     105-119,225-239
    Flags:                 ...

The guest 'anuma' without the auto partition rendering enabled
reads;   "<cpu mode='host-passthrough' check='none'/>"

    <domain type='kvm'>
      <name>anuma</name>
      <uuid>3f439f5f-1156-4d48-9491-945a2c0abc6d</uuid>
      <memory unit='KiB'>67108864</memory>
      <currentMemory unit='KiB'>67108864</currentMemory>
      <vcpu placement='static'>16</vcpu>
      <os>
        <type arch='x86_64' machine='pc-q35-2.11'>hvm</type>
        <boot dev='hd'/>
      </os>
      <features>
        <acpi/>
        <apic/>
        <vmport state='off'/>
      </features>
      <cpu mode='host-passthrough' check='none'/>
      <clock offset='utc'>
        <timer name='rtc' tickpolicy='catchup'/>
        <timer name='pit' tickpolicy='delay'/>
        <timer name='hpet' present='no'/>
      </clock>
      <on_poweroff>destroy</on_poweroff>
      <on_reboot>restart</on_reboot>
      <on_crash>destroy</on_crash>
      <pm>
        <suspend-to-mem enabled='no'/>
        <suspend-to-disk enabled='no'/>
      </pm>
      <devices>
        <emulator>/usr/bin/qemu-system-x86_64</emulator>
        <disk type='file' device='disk'>
          <driver name='qemu' type='qcow2'/>
          <source file='/var/lib/libvirt/images/anuma.qcow2'/>

Enabling the auto partitioning the guest 'anuma' XML is rewritten
as listed below;   "<cpu mode='host-passthrough' check='numa'>"

    <domain type='kvm'>
      <name>anuma</name>
      <uuid>3f439f5f-1156-4d48-9491-945a2c0abc6d</uuid>
      <memory unit='KiB'>67108864</memory>
      <currentMemory unit='KiB'>67108864</currentMemory>
      <vcpu placement='static'>16</vcpu>
      <cputune>
        <vcpupin vcpu='0' cpuset='0-14,120-134'/>
        <vcpupin vcpu='1' cpuset='15-29,135-149'/>
        <vcpupin vcpu='2' cpuset='30-44,150-164'/>
        <vcpupin vcpu='3' cpuset='45-59,165-179'/>
        <vcpupin vcpu='4' cpuset='60-74,180-194'/>
        <vcpupin vcpu='5' cpuset='75-89,195-209'/>
        <vcpupin vcpu='6' cpuset='90-104,210-224'/>
        <vcpupin vcpu='7' cpuset='105-119,225-239'/>
        <vcpupin vcpu='8' cpuset='0-14,120-134'/>
        <vcpupin vcpu='9' cpuset='15-29,135-149'/>
        <vcpupin vcpu='10' cpuset='30-44,150-164'/>
        <vcpupin vcpu='11' cpuset='45-59,165-179'/>
        <vcpupin vcpu='12' cpuset='60-74,180-194'/>
        <vcpupin vcpu='13' cpuset='75-89,195-209'/>
        <vcpupin vcpu='14' cpuset='90-104,210-224'/>
        <vcpupin vcpu='15' cpuset='105-119,225-239'/>
      </cputune>
      <os>
        <type arch='x86_64' machine='pc-q35-2.11'>hvm</type>
        <boot dev='hd'/>
      </os>
      <features>
        <acpi/>
        <apic/>
        <vmport state='off'/>
      </features>
      <cpu mode='host-passthrough' check='numa'>
        <topology sockets='8' cores='1' threads='2'/>
        <numa>
          <cell id='0' cpus='0,8' memory='8388608' unit='KiB'>
            <distances>
              <sibling id='0' value='10'/>
              <sibling id='1' value='21'/>
              <sibling id='2' value='31'/>
              <sibling id='3' value='21'/>
              <sibling id='4' value='21'/>
              <sibling id='5' value='31'/>
              <sibling id='6' value='31'/>
              <sibling id='7' value='31'/>
            </distances>
          </cell>
          <cell id='1' cpus='1,9' memory='8388608' unit='KiB'>
            <distances>
              <sibling id='0' value='21'/>
              <sibling id='1' value='10'/>
              <sibling id='2' value='21'/>
              <sibling id='3' value='31'/>
              <sibling id='4' value='31'/>
              <sibling id='5' value='21'/>
              <sibling id='6' value='31'/>
              <sibling id='7' value='31'/>
            </distances>
          </cell>
          <cell id='2' cpus='2,10' memory='8388608' unit='KiB'>
            <distances>
              <sibling id='0' value='31'/>
              <sibling id='1' value='21'/>
              <sibling id='2' value='10'/>
              <sibling id='3' value='21'/>
              <sibling id='4' value='31'/>
              <sibling id='5' value='31'/>
              <sibling id='6' value='21'/>
              <sibling id='7' value='31'/>
            </distances>
          </cell>
          <cell id='3' cpus='3,11' memory='8388608' unit='KiB'>
            <distances>
              <sibling id='0' value='21'/>
              <sibling id='1' value='31'/>
              <sibling id='2' value='21'/>
              <sibling id='3' value='10'/>
              <sibling id='4' value='31'/>
              <sibling id='5' value='31'/>
              <sibling id='6' value='31'/>
              <sibling id='7' value='21'/>
            </distances>
          </cell>
          <cell id='4' cpus='4,12' memory='8388608' unit='KiB'>
            <distances>
              <sibling id='0' value='21'/>
              <sibling id='1' value='31'/>
              <sibling id='2' value='31'/>
              <sibling id='3' value='31'/>
              <sibling id='4' value='10'/>
              <sibling id='5' value='21'/>
              <sibling id='6' value='21'/>
              <sibling id='7' value='31'/>
            </distances>
          </cell>
          <cell id='5' cpus='5,13' memory='8388608' unit='KiB'>
            <distances>
              <sibling id='0' value='31'/>
              <sibling id='1' value='21'/>
              <sibling id='2' value='31'/>
              <sibling id='3' value='31'/>
              <sibling id='4' value='21'/>
              <sibling id='5' value='10'/>
              <sibling id='6' value='31'/>
              <sibling id='7' value='21'/>
            </distances>
          </cell>
          <cell id='6' cpus='6,14' memory='8388608' unit='KiB'>
            <distances>
              <sibling id='0' value='31'/>
              <sibling id='1' value='31'/>
              <sibling id='2' value='21'/>
              <sibling id='3' value='31'/>
              <sibling id='4' value='21'/>
              <sibling id='5' value='31'/>
              <sibling id='6' value='10'/>
              <sibling id='7' value='21'/>
            </distances>
          </cell>
          <cell id='7' cpus='7,15' memory='8388608' unit='KiB'>
            <distances>
              <sibling id='0' value='31'/>
              <sibling id='1' value='31'/>
              <sibling id='2' value='31'/>
              <sibling id='3' value='21'/>
              <sibling id='4' value='31'/>
              <sibling id='5' value='21'/>
              <sibling id='6' value='21'/>
              <sibling id='7' value='10'/>
            </distances>
          </cell>
        </numa>
      </cpu>
      <clock offset='utc'>
        <timer name='rtc' tickpolicy='catchup'/>
        <timer name='pit' tickpolicy='delay'/>
        <timer name='hpet' present='no'/>
      </clock>
      <on_poweroff>destroy</on_poweroff>
      <on_reboot>restart</on_reboot>
      <on_crash>destroy</on_crash>
      <pm>
        <suspend-to-mem enabled='no'/>
        <suspend-to-disk enabled='no'/>
      </pm>
      <devices>
        <emulator>/usr/bin/qemu-system-x86_64</emulator>
        <disk type='file' device='disk'>
          <driver name='qemu' type='qcow2'/>
          <source file='/var/lib/libvirt/images/anuma.qcow2'/>

Finally the auto partitioned guest anuma 'lscpu' listed virtual vNUMA detail.

    [root@anuma ~]# lscpu
    Architecture:        x86_64
    CPU op-mode(s):      32-bit, 64-bit
    Byte Order:          Little Endian
    CPU(s):              16
    On-line CPU(s) list: 0-15
    Thread(s) per core:  2
    Core(s) per socket:  1
    Socket(s):           8
    NUMA node(s):        8
    Vendor ID:           GenuineIntel
    CPU family:          6
    Model:               62
    Model name:          Intel(R) Xeon(R) CPU E7-8895 v2 @ 2.80GHz
    Stepping:            7
    CPU MHz:             2793.268
    BogoMIPS:            5586.53
    Virtualization:      VT-x
    Hypervisor vendor:   KVM
    Virtualization type: full
    L1d cache:           32K
    L1i cache:           32K
    L2 cache:            4096K
    L3 cache:            16384K
    NUMA node0 CPU(s):   0,8
    NUMA node1 CPU(s):   1,9
    NUMA node2 CPU(s):   2,10
    NUMA node3 CPU(s):   3,11
    NUMA node4 CPU(s):   4,12
    NUMA node5 CPU(s):   5,13
    NUMA node6 CPU(s):   6,14
    NUMA node7 CPU(s):   7,15
    Flags:               ...

Wim ten Have (2):
  domain: auto partition guests providing the host NUMA topology
  qemuxml2argv: add tests that exercise vNUMA auto partition topology

 docs/formatdomain.html.in                     |   7 +
 docs/schemas/cputypes.rng                     |   1 +
 src/conf/cpu_conf.c                           |   3 +-
 src/conf/cpu_conf.h                           |   1 +
 src/conf/domain_conf.c                        | 166 ++++++++++++++++++
 .../cpu-host-passthrough-nonuma.args          |  25 +++
 .../cpu-host-passthrough-nonuma.xml           |  18 ++
 .../cpu-host-passthrough-numa.args            |  29 +++
 .../cpu-host-passthrough-numa.xml             |  18 ++
 tests/qemuxml2argvtest.c                      |   2 +
 10 files changed, 269 insertions(+), 1 deletion(-)
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-nonuma.args
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-nonuma.xml
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa.args
 create mode 100644 tests/qemuxml2argvdata/cpu-host-passthrough-numa.xml

-- 
2.17.1

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [RFC PATCH auto partition NUMA guest domains v1 0/2] auto partition guests providing the host NUMA topology
Posted by Jiri Denemark 5 years, 6 months ago
On Tue, Sep 25, 2018 at 12:02:40 +0200, Wim Ten Have wrote:
> From: Wim ten Have <wim.ten.have@oracle.com>
> 
> This patch extends the guest domain administration adding support
> to automatically advertise the host NUMA node capabilities obtained
> architecture under a guest by creating a vNUMA copy.

I'm pretty sure someone would find this useful and such configuration is
perfectly valid in libvirt. But I don't think there is a compelling
reason to add some magic into the domain XML which would automatically
expand to such configuration. It's basically a NUMA placement policy and
libvirt generally tries to avoid including any kind of policies and
rather just provide all the mechanisms and knobs which can be used by
applications to implement any policy they like.

> The mechanism is enabled by setting the check='numa' attribute under
> the CPU 'host-passthrough' topology:
>   <cpu mode='host-passthrough' check='numa' .../>

Anyway, this is definitely not the right place for such option. The
'check' attribute is described as

    "Since 3.2.0, an optional check attribute can be used to request a
    specific way of checking whether the virtual CPU matches the
    specification."

and the new 'numa' value does not fit in there in any way.

Moreover the code does the automatic NUMA placement at the moment
libvirt parses the domain XML, which is not the right place since it
would break migration, snapshots, and save/restore features.

We have existing placement attributes for vcpu and numatune/memory
elements which would have been much better place for implementing such
feature. And event cpu/numa element could have been enhanced to support
similar configuration.

Jirka

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [RFC PATCH auto partition NUMA guest domains v1 0/2] auto partition guests providing the host NUMA topology
Posted by Wim ten Have 5 years, 5 months ago
On Tue, 25 Sep 2018 14:37:15 +0200
Jiri Denemark <jdenemar@redhat.com> wrote:

> On Tue, Sep 25, 2018 at 12:02:40 +0200, Wim Ten Have wrote:
> > From: Wim ten Have <wim.ten.have@oracle.com>
> > 
> > This patch extends the guest domain administration adding support
> > to automatically advertise the host NUMA node capabilities obtained
> > architecture under a guest by creating a vNUMA copy.  
> 
> I'm pretty sure someone would find this useful and such configuration is
> perfectly valid in libvirt. But I don't think there is a compelling
> reason to add some magic into the domain XML which would automatically
> expand to such configuration. It's basically a NUMA placement policy and
> libvirt generally tries to avoid including any kind of policies and
> rather just provide all the mechanisms and knobs which can be used by
> applications to implement any policy they like.
> 
> > The mechanism is enabled by setting the check='numa' attribute under
> > the CPU 'host-passthrough' topology:
> >   <cpu mode='host-passthrough' check='numa' .../>  
> 
> Anyway, this is definitely not the right place for such option. The
> 'check' attribute is described as
> 
>     "Since 3.2.0, an optional check attribute can be used to request a
>     specific way of checking whether the virtual CPU matches the
>     specification."
> 
> and the new 'numa' value does not fit in there in any way.
> 
> Moreover the code does the automatic NUMA placement at the moment
> libvirt parses the domain XML, which is not the right place since it
> would break migration, snapshots, and save/restore features.

  Howdy, thanks for your fast response.  I was Out Of Office for a
  while unable to reply earlier.  The beef of this code does indeed
  not belong under the domain code and should rather move into the NUMA
  specific code where check='numa' is simply badly chosen.

  Also whilst OOO it occurred to me that besides auto partitioning the
  host into a vNUMA replica there's probably even other configuration
  target we may introduce reserving a single NUMA-node out of the nodes
  reserved for a guest to configure.

  So my plan is to come back asap with reworked code.

> We have existing placement attributes for vcpu and numatune/memory
> elements which would have been much better place for implementing such
> feature. And event cpu/numa element could have been enhanced to support
> similar configuration.

  Going over libvirt documentation I am more appealed with vcpu area.  As
  said let me rework and return with better approach/RFC asap.

Rgds,
- Wim10H.


> Jirka
> 
> --
> libvir-list mailing list
> libvir-list@redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list