[libvirt PATCH 0/4] src: add configurable support for cgroups usage

Daniel P. Berrangé posted 4 patches 2 weeks ago
Test syntax-check failed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/libvirt tags/patchew/20200320134014.3123811-1-berrange@redhat.com
docs/formatdomain.html.in                 | 24 +++++++-
docs/schemas/domaincommon.rng             | 17 +++++-
src/conf/domain_conf.c                    | 46 ++++++++++++----
src/conf/domain_conf.h                    | 13 +++++
src/libvirt_private.syms                  |  2 +
src/lxc/lxc_cgroup.c                      | 32 +++++++++++
src/lxc/lxc_cgroup.h                      |  3 +
src/lxc/lxc_process.c                     |  7 ++-
src/qemu/qemu_cgroup.c                    | 67 +++++++++++++++++++++--
src/qemu/qemu_command.c                   |  9 +--
src/util/vircgroup.c                      | 57 ++++++++++++-------
src/util/vircgroup.h                      |  7 +++
tests/lxcxml2xmldata/lxc-capabilities.xml |  2 +-
tests/lxcxml2xmldata/lxc-idmap.xml        |  2 +-
14 files changed, 239 insertions(+), 49 deletions(-)

[libvirt PATCH 0/4] src: add configurable support for cgroups usage

Posted by Daniel P. Berrangé 2 weeks ago
This simple series allows apps to choose how cgroups are managed by
libvirt, between cgroupfs and machined, or disabled entirely.

Daniel P. Berrangé (4):
  conf: allow different resource registration modes
  util: allow choice between machined or direct cgroups filesystem
  qemu: wire up support for controlling use of cgroups backend
  lxc: wire up support for controlling use of cgroups backend

 docs/formatdomain.html.in                 | 24 +++++++-
 docs/schemas/domaincommon.rng             | 17 +++++-
 src/conf/domain_conf.c                    | 46 ++++++++++++----
 src/conf/domain_conf.h                    | 13 +++++
 src/libvirt_private.syms                  |  2 +
 src/lxc/lxc_cgroup.c                      | 32 +++++++++++
 src/lxc/lxc_cgroup.h                      |  3 +
 src/lxc/lxc_process.c                     |  7 ++-
 src/qemu/qemu_cgroup.c                    | 67 +++++++++++++++++++++--
 src/qemu/qemu_command.c                   |  9 +--
 src/util/vircgroup.c                      | 57 ++++++++++++-------
 src/util/vircgroup.h                      |  7 +++
 tests/lxcxml2xmldata/lxc-capabilities.xml |  2 +-
 tests/lxcxml2xmldata/lxc-idmap.xml        |  2 +-
 14 files changed, 239 insertions(+), 49 deletions(-)

-- 
2.24.1

Re: [libvirt PATCH 0/4] src: add configurable support for cgroups usage

Posted by Pavel Hrdina 2 weeks ago
On Fri, Mar 20, 2020 at 01:40:10PM +0000, Daniel P. Berrangé wrote:
> This simple series allows apps to choose how cgroups are managed by
> libvirt, between cgroupfs and machined, or disabled entirely.

I'm not so sure about this series.  The situation with cgroups and
systemd is a bit more complex then the current code handles.

There is an existing issue where we are violating the delegation rules
described by cgroups and systemd.

Currently the "cgroupfs" approach is used only on non-systemd hosts and
we should keep it that way.  Libvirt is not allowed to mangle with
cgroups owned by systemd so IMHO the warning in configuration file is
not enough because without delegation cgroups will not work properly in
libvirt.

One example is that without delegation the VM cgroups would not have any
controllers enabled by default.

Having an option to completely disable cgroups is OK but that's the only
thing I would expose for now.

When using machined the current topology looks like this (only the files
that we use are listed):

/sys/fs/cgroup/machine.slice/
└── machine-qemu\x2d1\x2dcentos8.scope
    ├── emulator
    │   ├── cgroup.controllers
    │   ├── cgroup.threads
    │   ├── cgroup.type
    │   ├── cpuset.cpus
    │   └── cpuset.mems
    ├── vcpu0
    │   ├── cgroup.controllers
    │   ├── cgroup.threads
    │   ├── cgroup.type
    │   ├── cpuset.cpus
    │   └── cpuset.mems
    ├── vcpu1
    │   ├── cgroup.controllers
    │   ├── cgroup.threads
    │   ├── cgroup.type
    │   ├── cpuset.cpus
    │   └── cpuset.mems
    ├── vcpu2
    │   ├── cgroup.controllers
    │   ├── cgroup.threads
    │   ├── cgroup.type
    │   ├── cpuset.cpus
    │   └── cpuset.mems
    ├── vcpu3
    │   ├── cgroup.controllers
    │   ├── cgroup.threads
    │   ├── cgroup.type
    │   ├── cpuset.cpus
    │   └── cpuset.mems
    ├── cgroup.controllers
    ├── cgroup.procs
    ├── cgroup.subtree_control
    ├── cgroup.type
    ├── cpu.max
    ├── cpu.stat
    ├── cpu.weight
    ├── io.bfq.weight
    ├── io.max
    ├── io.stat
    ├── io.weight
    ├── memory.high
    ├── memory.max
    ├── memory.stat
    └── memory.swap.max

which is incorrect.  The group "machine-qemu\x2d1\x2dcentos8.scope" is
marked as delegated which means we can create new sub-cgroups there and
that we can access only these files directly in that group:

    cgroup.subtree_control - to enable cgroup controllers for our
                             sub-cgroups

    cgroup.procs - to move processes around into our sub-cgroups

Other files are off limit and we should not touch them.  In addition
systemd may create its own sub-cgroups and it would fail to do so
because there is another limitation from kernel that processes can live
only in the leaves except for the root cgroup.

Exactly that is happening right now and it was discovered by user
reporting a BUG that systemctl daemon-reloaed changed the values in
cpu.shares because it owns that file and it was our fault to change it.

Currently we use these cgroups files:

    cgroup.type

        - to enable threaded mode that is required for vcpu, emulator 
          and iothread pinning.  We should not do that in the delegated
          cgroup because we may break systemd, this has to be moved to
          a sub-cgroup together with the pinning.

    io.weight / io.bfq.weight, cpu.weight

        - These cannot be moved into the sub-cgroup because it would
          lose the effect and would be completely pointless to set it.
          All the *.weight files works in a way that the resources
          available to parent are distributed using the weight to the
          children.  If we would move it to the sub-cgroup there would
          be only single child all the time.  Therefore we have to use
          D-Bus to talk to systemd to have these values configured
          directly for the delegated cgroup.

    io.max, memory.max, memory.high, memory.swap.max, cpu.max

        - All of these can be safely set in the sub-cgroup that we will
          have to create as they are absolute limits and they are not
          affected by siblings.

    cpuset.mems, cpuset.cpus

        - These are used together with the threaded model and can also
          be safely set in the sub-cgroup.

Based on all of the above this is the new topology that has to be
created by libvirt if systemd is present:

/sys/fs/cgroup/machine.slice/
└── machine-qemu\x2d1\x2dcentos8.scope
    ├── libvirt-vm (or something else)
    │   ├── emulator
    │   │   ├── cgroup.controllers
    │   │   ├── cgroup.threads
    │   │   ├── cgroup.type
    │   │   ├── cpuset.cpus
    │   │   └── cpuset.mems
    │   ├── vcpu0
    │   │   ├── cgroup.controllers
    │   │   ├── cgroup.threads
    │   │   ├── cgroup.type
    │   │   ├── cpuset.cpus
    │   │   └── cpuset.mems
    │   ├── vcpu1
    │   │   ├── cgroup.controllers
    │   │   ├── cgroup.threads
    │   │   ├── cgroup.type
    │   │   ├── cpuset.cpus
    │   │   └── cpuset.mems
    │   ├── vcpu2
    │   │   ├── cgroup.controllers
    │   │   ├── cgroup.threads
    │   │   ├── cgroup.type
    │   │   ├── cpuset.cpus
    │   │   └── cpuset.mems
    │   ├── vcpu3
    │   │   ├── cgroup.controllers
    │   │   ├── cgroup.threads
    │   │   ├── cgroup.type
    │   │   ├── cpuset.cpus
    │   │   └── cpuset.mems
    │   ├── cgroup.controllers
    │   ├── cgroup.procs
    │   ├── cgroup.subtree_control
    │   ├── cgroup.type
    │   ├── cpu.max
    │   ├── cpu.stat
    │   ├── io.max
    │   ├── io.stat
    │   ├── memory.high
    │   ├── memory.max
    │   ├── memory.stat
    │   └── memory.swap.max
    ├── cgroup.procs
    ├── cgroup.subtree_control
    ├── cpu.weight          (only via systemd using D-Bus)
    ├── io.bfq.weight       (only via systemd using D-Bus)
    └── io.weight           (only via systemd using D-Bus)

Pavel

Re: [libvirt PATCH 0/4] src: add configurable support for cgroups usage

Posted by Daniel P. Berrangé 2 weeks ago
On Fri, Mar 20, 2020 at 04:30:07PM +0100, Pavel Hrdina wrote:
> On Fri, Mar 20, 2020 at 01:40:10PM +0000, Daniel P. Berrangé wrote:
> > This simple series allows apps to choose how cgroups are managed by
> > libvirt, between cgroupfs and machined, or disabled entirely.
> 
> I'm not so sure about this series.  The situation with cgroups and
> systemd is a bit more complex then the current code handles.
> 
> There is an existing issue where we are violating the delegation rules
> described by cgroups and systemd.
> 
> Currently the "cgroupfs" approach is used only on non-systemd hosts and
> we should keep it that way.  Libvirt is not allowed to mangle with
> cgroups owned by systemd so IMHO the warning in configuration file is
> not enough because without delegation cgroups will not work properly in
> libvirt.

That isn't the case currently AFAICT from current code.

Before this series, the virCgroupNewMachine method will first try
systemd and then fallback to directly cgroupfs.

This fallback can happen on a systemd host, when machined is not
installed, as machined is an optional component.

If we need to mandate use of systemd on systemd hosts, then our
existing code is broken and needs fixing. 

I'm happy todo such a fix, and then adjust this series to take
account of it. Essentially we'd allow apps to specify 'cgroupfs'
or 'machined' but we'd enforce they only make safe choices.

ie on a systemd host we'd only allow 'none' or 'machined'

On a non-systemd host, or on a systemd host where we've
been delegated a subtree, we'd only allow 'none' or 'cgroupfs'

> One example is that without delegation the VM cgroups would not have any
> controllers enabled by default.
> 
> Having an option to completely disable cgroups is OK but that's the only
> thing I would expose for now.
> 
> When using machined the current topology looks like this (only the files
> that we use are listed):
> 
> /sys/fs/cgroup/machine.slice/
> └── machine-qemu\x2d1\x2dcentos8.scope
>     ├── emulator
>     │   ├── cgroup.controllers
>     │   ├── cgroup.threads
>     │   ├── cgroup.type
>     │   ├── cpuset.cpus
>     │   └── cpuset.mems
>     ├── vcpu0
>     │   ├── cgroup.controllers
>     │   ├── cgroup.threads
>     │   ├── cgroup.type
>     │   ├── cpuset.cpus
>     │   └── cpuset.mems
>     ├── vcpu1
>     │   ├── cgroup.controllers
>     │   ├── cgroup.threads
>     │   ├── cgroup.type
>     │   ├── cpuset.cpus
>     │   └── cpuset.mems
>     ├── vcpu2
>     │   ├── cgroup.controllers
>     │   ├── cgroup.threads
>     │   ├── cgroup.type
>     │   ├── cpuset.cpus
>     │   └── cpuset.mems
>     ├── vcpu3
>     │   ├── cgroup.controllers
>     │   ├── cgroup.threads
>     │   ├── cgroup.type
>     │   ├── cpuset.cpus
>     │   └── cpuset.mems
>     ├── cgroup.controllers
>     ├── cgroup.procs
>     ├── cgroup.subtree_control
>     ├── cgroup.type
>     ├── cpu.max
>     ├── cpu.stat
>     ├── cpu.weight
>     ├── io.bfq.weight
>     ├── io.max
>     ├── io.stat
>     ├── io.weight
>     ├── memory.high
>     ├── memory.max
>     ├── memory.stat
>     └── memory.swap.max
> 
> which is incorrect.  The group "machine-qemu\x2d1\x2dcentos8.scope" is
> marked as delegated which means we can create new sub-cgroups there and
> that we can access only these files directly in that group:
> 
>     cgroup.subtree_control - to enable cgroup controllers for our
>                              sub-cgroups
> 
>     cgroup.procs - to move processes around into our sub-cgroups
> 
> Other files are off limit and we should not touch them.  In addition
> systemd may create its own sub-cgroups and it would fail to do so
> because there is another limitation from kernel that processes can live
> only in the leaves except for the root cgroup.
> 
> Exactly that is happening right now and it was discovered by user
> reporting a BUG that systemctl daemon-reloaed changed the values in
> cpu.shares because it owns that file and it was our fault to change it.
> 
> Currently we use these cgroups files:
> 
>     cgroup.type
> 
>         - to enable threaded mode that is required for vcpu, emulator 
>           and iothread pinning.  We should not do that in the delegated
>           cgroup because we may break systemd, this has to be moved to
>           a sub-cgroup together with the pinning.
> 
>     io.weight / io.bfq.weight, cpu.weight
> 
>         - These cannot be moved into the sub-cgroup because it would
>           lose the effect and would be completely pointless to set it.
>           All the *.weight files works in a way that the resources
>           available to parent are distributed using the weight to the
>           children.  If we would move it to the sub-cgroup there would
>           be only single child all the time.  Therefore we have to use
>           D-Bus to talk to systemd to have these values configured
>           directly for the delegated cgroup.
> 
>     io.max, memory.max, memory.high, memory.swap.max, cpu.max
> 
>         - All of these can be safely set in the sub-cgroup that we will
>           have to create as they are absolute limits and they are not
>           affected by siblings.
> 
>     cpuset.mems, cpuset.cpus
> 
>         - These are used together with the threaded model and can also
>           be safely set in the sub-cgroup.
> 
> Based on all of the above this is the new topology that has to be
> created by libvirt if systemd is present:
> 
> /sys/fs/cgroup/machine.slice/
> └── machine-qemu\x2d1\x2dcentos8.scope
>     ├── libvirt-vm (or something else)

So IIUC, we can allow  'cgroupfs' on a systemd host, provided that
the path for the <partition> points to a sub-tree that is delegated.

>     │   ├── emulator
>     │   │   ├── cgroup.controllers
>     │   │   ├── cgroup.threads
>     │   │   ├── cgroup.type
>     │   │   ├── cpuset.cpus
>     │   │   └── cpuset.mems
>     │   ├── vcpu0
>     │   │   ├── cgroup.controllers
>     │   │   ├── cgroup.threads
>     │   │   ├── cgroup.type
>     │   │   ├── cpuset.cpus
>     │   │   └── cpuset.mems
>     │   ├── vcpu1
>     │   │   ├── cgroup.controllers
>     │   │   ├── cgroup.threads
>     │   │   ├── cgroup.type
>     │   │   ├── cpuset.cpus
>     │   │   └── cpuset.mems
>     │   ├── vcpu2
>     │   │   ├── cgroup.controllers
>     │   │   ├── cgroup.threads
>     │   │   ├── cgroup.type
>     │   │   ├── cpuset.cpus
>     │   │   └── cpuset.mems
>     │   ├── vcpu3
>     │   │   ├── cgroup.controllers
>     │   │   ├── cgroup.threads
>     │   │   ├── cgroup.type
>     │   │   ├── cpuset.cpus
>     │   │   └── cpuset.mems
>     │   ├── cgroup.controllers
>     │   ├── cgroup.procs
>     │   ├── cgroup.subtree_control
>     │   ├── cgroup.type
>     │   ├── cpu.max
>     │   ├── cpu.stat
>     │   ├── io.max
>     │   ├── io.stat
>     │   ├── memory.high
>     │   ├── memory.max
>     │   ├── memory.stat
>     │   └── memory.swap.max
>     ├── cgroup.procs
>     ├── cgroup.subtree_control
>     ├── cpu.weight          (only via systemd using D-Bus)
>     ├── io.bfq.weight       (only via systemd using D-Bus)
>     └── io.weight           (only via systemd using D-Bus)
> 
> Pavel



Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt PATCH 0/4] src: add configurable support for cgroups usage

Posted by Pavel Hrdina 2 weeks ago
On Fri, Mar 20, 2020 at 03:39:58PM +0000, Daniel P. Berrangé wrote:
> On Fri, Mar 20, 2020 at 04:30:07PM +0100, Pavel Hrdina wrote:
> > On Fri, Mar 20, 2020 at 01:40:10PM +0000, Daniel P. Berrangé wrote:
> > > This simple series allows apps to choose how cgroups are managed by
> > > libvirt, between cgroupfs and machined, or disabled entirely.
> > 
> > I'm not so sure about this series.  The situation with cgroups and
> > systemd is a bit more complex then the current code handles.
> > 
> > There is an existing issue where we are violating the delegation rules
> > described by cgroups and systemd.
> > 
> > Currently the "cgroupfs" approach is used only on non-systemd hosts and
> > we should keep it that way.  Libvirt is not allowed to mangle with
> > cgroups owned by systemd so IMHO the warning in configuration file is
> > not enough because without delegation cgroups will not work properly in
> > libvirt.
> 
> That isn't the case currently AFAICT from current code.
> 
> Before this series, the virCgroupNewMachine method will first try
> systemd and then fallback to directly cgroupfs.
> 
> This fallback can happen on a systemd host, when machined is not
> installed, as machined is an optional component.

Right, I should have checked the code.

> If we need to mandate use of systemd on systemd hosts, then our
> existing code is broken and needs fixing. 

I think we should do that because without delegation we should not touch
anything in cgroups.

> I'm happy todo such a fix, and then adjust this series to take
> account of it. Essentially we'd allow apps to specify 'cgroupfs'
> or 'machined' but we'd enforce they only make safe choices.
> 
> ie on a systemd host we'd only allow 'none' or 'machined'
> 
> On a non-systemd host, or on a systemd host where we've
> been delegated a subtree, we'd only allow 'none' or 'cgroupfs'

This sounds reasonable, I was thinking about the check if we have
delegated subtree but decided not to mention it.  The only place where
we can check the delegation is using systemd API, probably over D-Bus.

It's not reflected anywhere in the cgroup files.

Pavel

Re: [libvirt PATCH 0/4] src: add configurable support for cgroups usage

Posted by Daniel P. Berrangé 2 weeks ago
On Fri, Mar 20, 2020 at 04:48:58PM +0100, Pavel Hrdina wrote:
> On Fri, Mar 20, 2020 at 03:39:58PM +0000, Daniel P. Berrangé wrote:
> > On Fri, Mar 20, 2020 at 04:30:07PM +0100, Pavel Hrdina wrote:
> > > On Fri, Mar 20, 2020 at 01:40:10PM +0000, Daniel P. Berrangé wrote:
> > > > This simple series allows apps to choose how cgroups are managed by
> > > > libvirt, between cgroupfs and machined, or disabled entirely.
> > > 
> > > I'm not so sure about this series.  The situation with cgroups and
> > > systemd is a bit more complex then the current code handles.
> > > 
> > > There is an existing issue where we are violating the delegation rules
> > > described by cgroups and systemd.
> > > 
> > > Currently the "cgroupfs" approach is used only on non-systemd hosts and
> > > we should keep it that way.  Libvirt is not allowed to mangle with
> > > cgroups owned by systemd so IMHO the warning in configuration file is
> > > not enough because without delegation cgroups will not work properly in
> > > libvirt.
> > 
> > That isn't the case currently AFAICT from current code.
> > 
> > Before this series, the virCgroupNewMachine method will first try
> > systemd and then fallback to directly cgroupfs.
> > 
> > This fallback can happen on a systemd host, when machined is not
> > installed, as machined is an optional component.
> 
> Right, I should have checked the code.
> 
> > If we need to mandate use of systemd on systemd hosts, then our
> > existing code is broken and needs fixing. 
> 
> I think we should do that because without delegation we should not touch
> anything in cgroups.
> 
> > I'm happy todo such a fix, and then adjust this series to take
> > account of it. Essentially we'd allow apps to specify 'cgroupfs'
> > or 'machined' but we'd enforce they only make safe choices.
> > 
> > ie on a systemd host we'd only allow 'none' or 'machined'
> > 
> > On a non-systemd host, or on a systemd host where we've
> > been delegated a subtree, we'd only allow 'none' or 'cgroupfs'
> 
> This sounds reasonable, I was thinking about the check if we have
> delegated subtree but decided not to mention it.  The only place where
> we can check the delegation is using systemd API, probably over D-Bus.
> 
> It's not reflected anywhere in the cgroup files.

The key scenario for cgroupfs would be if libvirtd was run inside a
container on a host that otherwise uses systemd. IIUC, the container
should get given a delegated subtree and we need to be able to use
cgroupfs backend here. Currently this works (accidentally) because
we'll try to talk to machined and fail, so fallback to direct usage.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt PATCH 0/4] src: add configurable support for cgroups usage

Posted by Pavel Hrdina 2 weeks ago
On Fri, Mar 20, 2020 at 03:59:41PM +0000, Daniel P. Berrangé wrote:
> On Fri, Mar 20, 2020 at 04:48:58PM +0100, Pavel Hrdina wrote:
> > On Fri, Mar 20, 2020 at 03:39:58PM +0000, Daniel P. Berrangé wrote:
> > > On Fri, Mar 20, 2020 at 04:30:07PM +0100, Pavel Hrdina wrote:
> > > > On Fri, Mar 20, 2020 at 01:40:10PM +0000, Daniel P. Berrangé wrote:
> > > > > This simple series allows apps to choose how cgroups are managed by
> > > > > libvirt, between cgroupfs and machined, or disabled entirely.
> > > > 
> > > > I'm not so sure about this series.  The situation with cgroups and
> > > > systemd is a bit more complex then the current code handles.
> > > > 
> > > > There is an existing issue where we are violating the delegation rules
> > > > described by cgroups and systemd.
> > > > 
> > > > Currently the "cgroupfs" approach is used only on non-systemd hosts and
> > > > we should keep it that way.  Libvirt is not allowed to mangle with
> > > > cgroups owned by systemd so IMHO the warning in configuration file is
> > > > not enough because without delegation cgroups will not work properly in
> > > > libvirt.
> > > 
> > > That isn't the case currently AFAICT from current code.
> > > 
> > > Before this series, the virCgroupNewMachine method will first try
> > > systemd and then fallback to directly cgroupfs.
> > > 
> > > This fallback can happen on a systemd host, when machined is not
> > > installed, as machined is an optional component.
> > 
> > Right, I should have checked the code.
> > 
> > > If we need to mandate use of systemd on systemd hosts, then our
> > > existing code is broken and needs fixing. 
> > 
> > I think we should do that because without delegation we should not touch
> > anything in cgroups.
> > 
> > > I'm happy todo such a fix, and then adjust this series to take
> > > account of it. Essentially we'd allow apps to specify 'cgroupfs'
> > > or 'machined' but we'd enforce they only make safe choices.
> > > 
> > > ie on a systemd host we'd only allow 'none' or 'machined'
> > > 
> > > On a non-systemd host, or on a systemd host where we've
> > > been delegated a subtree, we'd only allow 'none' or 'cgroupfs'
> > 
> > This sounds reasonable, I was thinking about the check if we have
> > delegated subtree but decided not to mention it.  The only place where
> > we can check the delegation is using systemd API, probably over D-Bus.
> > 
> > It's not reflected anywhere in the cgroup files.
> 
> The key scenario for cgroupfs would be if libvirtd was run inside a
> container on a host that otherwise uses systemd. IIUC, the container
> should get given a delegated subtree and we need to be able to use
> cgroupfs backend here. Currently this works (accidentally) because
> we'll try to talk to machined and fail, so fallback to direct usage.

That should be OK and we can support this even outside of containers if
we can verify that the cgroup subtree is delegated and it's safe to use
it for VMs.

The wrong usage of delegation by libvirt is the source of the reported
BUG so we should be restrictive.

Pavel

Re: [libvirt PATCH 0/4] src: add configurable support for cgroups usage

Posted by Daniel P. Berrangé 2 weeks ago
On Fri, Mar 20, 2020 at 05:42:57PM +0100, Pavel Hrdina wrote:
> On Fri, Mar 20, 2020 at 03:59:41PM +0000, Daniel P. Berrangé wrote:
> > On Fri, Mar 20, 2020 at 04:48:58PM +0100, Pavel Hrdina wrote:
> > > On Fri, Mar 20, 2020 at 03:39:58PM +0000, Daniel P. Berrangé wrote:
> > > > On Fri, Mar 20, 2020 at 04:30:07PM +0100, Pavel Hrdina wrote:
> > > > > On Fri, Mar 20, 2020 at 01:40:10PM +0000, Daniel P. Berrangé wrote:
> > > > > > This simple series allows apps to choose how cgroups are managed by
> > > > > > libvirt, between cgroupfs and machined, or disabled entirely.
> > > > > 
> > > > > I'm not so sure about this series.  The situation with cgroups and
> > > > > systemd is a bit more complex then the current code handles.
> > > > > 
> > > > > There is an existing issue where we are violating the delegation rules
> > > > > described by cgroups and systemd.
> > > > > 
> > > > > Currently the "cgroupfs" approach is used only on non-systemd hosts and
> > > > > we should keep it that way.  Libvirt is not allowed to mangle with
> > > > > cgroups owned by systemd so IMHO the warning in configuration file is
> > > > > not enough because without delegation cgroups will not work properly in
> > > > > libvirt.
> > > > 
> > > > That isn't the case currently AFAICT from current code.
> > > > 
> > > > Before this series, the virCgroupNewMachine method will first try
> > > > systemd and then fallback to directly cgroupfs.
> > > > 
> > > > This fallback can happen on a systemd host, when machined is not
> > > > installed, as machined is an optional component.
> > > 
> > > Right, I should have checked the code.
> > > 
> > > > If we need to mandate use of systemd on systemd hosts, then our
> > > > existing code is broken and needs fixing. 
> > > 
> > > I think we should do that because without delegation we should not touch
> > > anything in cgroups.
> > > 
> > > > I'm happy todo such a fix, and then adjust this series to take
> > > > account of it. Essentially we'd allow apps to specify 'cgroupfs'
> > > > or 'machined' but we'd enforce they only make safe choices.
> > > > 
> > > > ie on a systemd host we'd only allow 'none' or 'machined'
> > > > 
> > > > On a non-systemd host, or on a systemd host where we've
> > > > been delegated a subtree, we'd only allow 'none' or 'cgroupfs'
> > > 
> > > This sounds reasonable, I was thinking about the check if we have
> > > delegated subtree but decided not to mention it.  The only place where
> > > we can check the delegation is using systemd API, probably over D-Bus.
> > > 
> > > It's not reflected anywhere in the cgroup files.
> > 
> > The key scenario for cgroupfs would be if libvirtd was run inside a
> > container on a host that otherwise uses systemd. IIUC, the container
> > should get given a delegated subtree and we need to be able to use
> > cgroupfs backend here. Currently this works (accidentally) because
> > we'll try to talk to machined and fail, so fallback to direct usage.
> 
> That should be OK and we can support this even outside of containers if
> we can verify that the cgroup subtree is delegated and it's safe to use
> it for VMs.
> 
> The wrong usage of delegation by libvirt is the source of the reported
> BUG so we should be restrictive.

Do you have a pointer to the bugs you've seen in this area. In particular
I'm wondering if there's a need for different behaviour with v1 vs v2
cgroups. Systemd has always wanted proper delegation behaviour from apps,
but IIUC, this was mostly a nice-to-have under v1, compared to a must-have
under v2.

My big concern is that we put in a restriction and accidentally break
someone's deployment scenario under v1.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt PATCH 0/4] src: add configurable support for cgroups usage

Posted by Pavel Hrdina 2 weeks ago
On Fri, Mar 20, 2020 at 05:01:40PM +0000, Daniel P. Berrangé wrote:
> On Fri, Mar 20, 2020 at 05:42:57PM +0100, Pavel Hrdina wrote:
> > On Fri, Mar 20, 2020 at 03:59:41PM +0000, Daniel P. Berrangé wrote:
> > > On Fri, Mar 20, 2020 at 04:48:58PM +0100, Pavel Hrdina wrote:
> > > > On Fri, Mar 20, 2020 at 03:39:58PM +0000, Daniel P. Berrangé wrote:
> > > > > On Fri, Mar 20, 2020 at 04:30:07PM +0100, Pavel Hrdina wrote:
> > > > > > On Fri, Mar 20, 2020 at 01:40:10PM +0000, Daniel P. Berrangé wrote:
> > > > > > > This simple series allows apps to choose how cgroups are managed by
> > > > > > > libvirt, between cgroupfs and machined, or disabled entirely.
> > > > > > 
> > > > > > I'm not so sure about this series.  The situation with cgroups and
> > > > > > systemd is a bit more complex then the current code handles.
> > > > > > 
> > > > > > There is an existing issue where we are violating the delegation rules
> > > > > > described by cgroups and systemd.
> > > > > > 
> > > > > > Currently the "cgroupfs" approach is used only on non-systemd hosts and
> > > > > > we should keep it that way.  Libvirt is not allowed to mangle with
> > > > > > cgroups owned by systemd so IMHO the warning in configuration file is
> > > > > > not enough because without delegation cgroups will not work properly in
> > > > > > libvirt.
> > > > > 
> > > > > That isn't the case currently AFAICT from current code.
> > > > > 
> > > > > Before this series, the virCgroupNewMachine method will first try
> > > > > systemd and then fallback to directly cgroupfs.
> > > > > 
> > > > > This fallback can happen on a systemd host, when machined is not
> > > > > installed, as machined is an optional component.
> > > > 
> > > > Right, I should have checked the code.
> > > > 
> > > > > If we need to mandate use of systemd on systemd hosts, then our
> > > > > existing code is broken and needs fixing. 
> > > > 
> > > > I think we should do that because without delegation we should not touch
> > > > anything in cgroups.
> > > > 
> > > > > I'm happy todo such a fix, and then adjust this series to take
> > > > > account of it. Essentially we'd allow apps to specify 'cgroupfs'
> > > > > or 'machined' but we'd enforce they only make safe choices.
> > > > > 
> > > > > ie on a systemd host we'd only allow 'none' or 'machined'
> > > > > 
> > > > > On a non-systemd host, or on a systemd host where we've
> > > > > been delegated a subtree, we'd only allow 'none' or 'cgroupfs'
> > > > 
> > > > This sounds reasonable, I was thinking about the check if we have
> > > > delegated subtree but decided not to mention it.  The only place where
> > > > we can check the delegation is using systemd API, probably over D-Bus.
> > > > 
> > > > It's not reflected anywhere in the cgroup files.
> > > 
> > > The key scenario for cgroupfs would be if libvirtd was run inside a
> > > container on a host that otherwise uses systemd. IIUC, the container
> > > should get given a delegated subtree and we need to be able to use
> > > cgroupfs backend here. Currently this works (accidentally) because
> > > we'll try to talk to machined and fail, so fallback to direct usage.
> > 
> > That should be OK and we can support this even outside of containers if
> > we can verify that the cgroup subtree is delegated and it's safe to use
> > it for VMs.
> > 
> > The wrong usage of delegation by libvirt is the source of the reported
> > BUG so we should be restrictive.
> 
> Do you have a pointer to the bugs you've seen in this area. In particular
> I'm wondering if there's a need for different behaviour with v1 vs v2
> cgroups. Systemd has always wanted proper delegation behaviour from apps,
> but IIUC, this was mostly a nice-to-have under v1, compared to a must-have
> under v2.

Sure, the first BZ was actually reported on RHEL-7:

https://bugzilla.redhat.com/show_bug.cgi?id=1789824

and we have clones for RHEL-8 and RHEL-AV-8, the issue is present with
both v1 and v2.

It is nice to have with v1 because nothing is really enforced, not even
the no-processes-in-inner-nodes rule.  In addition cgroups v1 are broken
in many ways.

The only difference that I can remember now is that with cgroups v2 we
cannot enable controllers that are not supported by systemd because of
the unified hierarchy and the fact that we should not touch the
cgroups.subtree_control file.  In cgroups v1 where each controller is
usually mounted separately we are free to do whatever we like with the
controllers not managed by systemd.  I have to check the code, but we
probably violate the restriction with v2 when we fallback to the
non-machined codepath.

> My big concern is that we put in a restriction and accidentally break
> someone's deployment scenario under v1.

That might happen and unfortunately they should fix their scenario as
it might not work as they are expecting as systemd is the owner of
non-delegated cgroups in both v1 and v2 as the BZ demonstrates.

Pavel