[libvirt] [RFC v2 0/4] LXC with block device and enabled userns

Radostin Stoyanov posted 4 patches 5 years, 9 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/libvirt tags/patchew/20180610111426.5211-1-rstoyanov1@gmail.com
Test syntax-check passed
src/lxc/lxc_container.c  |  58 +-------------
src/lxc/lxc_container.h  |   4 +
src/lxc/lxc_controller.c | 158 +++++++++++++++++++++++----------------
3 files changed, 97 insertions(+), 123 deletions(-)
[libvirt] [RFC v2 0/4] LXC with block device and enabled userns
Posted by Radostin Stoyanov 5 years, 9 months ago
Hi all,

This patch series aims to resolve
https://bugzilla.redhat.com/show_bug.cgi?id=1328946

For background information about the issue see v1 of this RFC.
https://www.redhat.com/archives/libvir-list/2018-April/msg01270.html

The current state of this series enables the start of LXC container with NBD
file system and enabled user namespace.

However, container shutdown causes "kernel BUG at fs/buffer.c:3058!"
https://pastebin.com/raw/y0ycSM0H

The reason for this is because qemu-nbd process is terminated/killed without
unmounting the container root file system.

This issue has been reported in [1] and [2].
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1356110
[2] http://lkml.iu.edu/hypermail/linux/kernel/1509.3/00027.html

As a workaround we could unmount the root file system of container before shutdown.

For example with:
    $ CT_PID=$(pidof libvirt_lxc)
    $ sudo nsenter \
        --mount=/proc/$CT_PID/task/$CT_PID/ns/mnt \
        /bin/bash -c "umount /var/run/libvirt/lxc/guest.root/"

I noticed that we already have the functions lxcContainerUnmountSubtree
and virProcessRunInMountNamespace.

Any suggestions on how to properly implement this?

Thanks,

Radostin Stoyanov (4):
  lxc: Make lxcContainerMountFSBlock non static
  lxc: Move up virLXCControllerAppendNBDPids
  lxc: Mount NBD devices before clone
  lxc: Remove unused lxcContainerPrepareRoot

 src/lxc/lxc_container.c  |  58 +-------------
 src/lxc/lxc_container.h  |   4 +
 src/lxc/lxc_controller.c | 158 +++++++++++++++++++++++----------------
 3 files changed, 97 insertions(+), 123 deletions(-)

--
2.17.1

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [RFC v2 0/4] LXC with block device and enabled userns
Posted by Daniel P. Berrangé 5 years, 9 months ago
On Sun, Jun 10, 2018 at 12:14:22PM +0100, Radostin Stoyanov wrote:
> Hi all,
> 
> This patch series aims to resolve
> https://bugzilla.redhat.com/show_bug.cgi?id=1328946
> 
> For background information about the issue see v1 of this RFC.
> https://www.redhat.com/archives/libvir-list/2018-April/msg01270.html
> 
> The current state of this series enables the start of LXC container with NBD
> file system and enabled user namespace.
> 
> However, container shutdown causes "kernel BUG at fs/buffer.c:3058!"
> https://pastebin.com/raw/y0ycSM0H
> 
> The reason for this is because qemu-nbd process is terminated/killed without
> unmounting the container root file system.
> 
> This issue has been reported in [1] and [2].
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1356110
> [2] http://lkml.iu.edu/hypermail/linux/kernel/1509.3/00027.html

This is not really a kernel bug at the end of the day. We have a filesystem
backed by NBD block device, and we're killing the NBD block device. So there's
nothing the kernel can really do here if there's outstanding I/O pendnig at
this time.

There is also this BZ reported against libvirt that has more info:

  https://bugzilla.redhat.com/show_bug.cgi?id=1570902

> As a workaround we could unmount the root file system of container before shutdown.
> 
> For example with:
>     $ CT_PID=$(pidof libvirt_lxc)
>     $ sudo nsenter \
>         --mount=/proc/$CT_PID/task/$CT_PID/ns/mnt \
>         /bin/bash -c "umount /var/run/libvirt/lxc/guest.root/"
> 
> I noticed that we already have the functions lxcContainerUnmountSubtree
> and virProcessRunInMountNamespace.
> 
> Any suggestions on how to properly implement this?

We can't unmount the filesystem directly because we don't have any process
running inside the container's mount namespace at this time. The libvirt_lxc
controller is running in a custom mount namespace that is different from what
the container has.

The first thing we need todo is take qemu-nbd out of the cgroups. This will
ensure that it doesn't get killed at the same time as we're killing off all
the container PIDs. It will also fix the OOM deadlocks we see when the memory
controller prevents qemu-nbd allocating RAM needed to proces I/O.

Then, we can kill all processes in the container as normal. Once they are
all gone, we know the kernel will have cleaned up the mount namespace. We
can thus safely kill qemu-nbd at this point.

Ideally qemu-nbd would automatically exit when the last use of /dev/nbdNNN
was release (ie when filesystem was unmounted). This is something you can
enable for loopback devices, but I'm not sure it works for NBD. THis would
be a useful kernel enhancement if someone feels adventurous.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [RFC v2 0/4] LXC with block device and enabled userns
Posted by Radostin Stoyanov 5 years, 9 months ago
On 13/06/18 11:46, Daniel P. Berrangé wrote:
> On Sun, Jun 10, 2018 at 12:14:22PM +0100, Radostin Stoyanov wrote:
>> Hi all,
>>
>> This patch series aims to resolve
>> https://bugzilla.redhat.com/show_bug.cgi?id=1328946
>>
>> For background information about the issue see v1 of this RFC.
>> https://www.redhat.com/archives/libvir-list/2018-April/msg01270.html
>>
>> The current state of this series enables the start of LXC container with NBD
>> file system and enabled user namespace.
>>
>> However, container shutdown causes "kernel BUG at fs/buffer.c:3058!"
>> https://pastebin.com/raw/y0ycSM0H
>>
>> The reason for this is because qemu-nbd process is terminated/killed without
>> unmounting the container root file system.
>>
>> This issue has been reported in [1] and [2].
>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1356110
>> [2] http://lkml.iu.edu/hypermail/linux/kernel/1509.3/00027.html
> This is not really a kernel bug at the end of the day. We have a filesystem
> backed by NBD block device, and we're killing the NBD block device. So there's
> nothing the kernel can really do here if there's outstanding I/O pendnig at
> this time.
>
> There is also this BZ reported against libvirt that has more info:
>
>   https://bugzilla.redhat.com/show_bug.cgi?id=1570902
>
>> As a workaround we could unmount the root file system of container before shutdown.
>>
>> For example with:
>>     $ CT_PID=$(pidof libvirt_lxc)
>>     $ sudo nsenter \
>>         --mount=/proc/$CT_PID/task/$CT_PID/ns/mnt \
>>         /bin/bash -c "umount /var/run/libvirt/lxc/guest.root/"
>>
>> I noticed that we already have the functions lxcContainerUnmountSubtree
>> and virProcessRunInMountNamespace.
>>
>> Any suggestions on how to properly implement this?
> We can't unmount the filesystem directly because we don't have any process
> running inside the container's mount namespace at this time. The libvirt_lxc
> controller is running in a custom mount namespace that is different from what
> the container has.
>
> The first thing we need todo is take qemu-nbd out of the cgroups. This will
> ensure that it doesn't get killed at the same time as we're killing off all
> the container PIDs. It will also fix the OOM deadlocks we see when the memory
> controller prevents qemu-nbd allocating RAM needed to proces I/O.
>
> Then, we can kill all processes in the container as normal. Once they are
> all gone, we know the kernel will have cleaned up the mount namespace. We
> can thus safely kill qemu-nbd at this point.
Thank you for the pointers!
> Ideally qemu-nbd would automatically exit when the last use of /dev/nbdNNN
> was release (ie when filesystem was unmounted). This is something you can
> enable for loopback devices, but I'm not sure it works for NBD. THis would
> be a useful kernel enhancement if someone feels adventurous.
It seems like qemu-nbd terminates automatically when the last client
disconnects.

https://git.qemu.org/?p=qemu.git;a=blob;f=qemu-nbd.c;h=51b9d38c72732c821cb4ee5bf362533406ce2494;hb=HEAD#l341

I will send a patch thattakes qemu-nbd out of the cgroups and
disconnects qemu-nbd on container shutdown.

Radostin


--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [RFC v2 0/4] LXC with block device and enabled userns
Posted by Daniel P. Berrangé 5 years, 9 months ago
On Wed, Jun 13, 2018 at 03:18:02PM +0100, Radostin Stoyanov wrote:

> > Ideally qemu-nbd would automatically exit when the last use of /dev/nbdNNN
> > was release (ie when filesystem was unmounted). This is something you can
> > enable for loopback devices, but I'm not sure it works for NBD. THis would
> > be a useful kernel enhancement if someone feels adventurous.
> It seems like qemu-nbd terminates automatically when the last client
> disconnects.

Right, but in the case where the kernel has connected qemu-nbd to a block
device, the kernel itself is a client that won't exit.

So what I was describing was that the kernel should automatically delete
the /dev/ndb0 device when the last mount was unmounted, and thus in turn
it can close the last client of qemu-nbd

> 
> https://git.qemu.org/?p=qemu.git;a=blob;f=qemu-nbd.c;h=51b9d38c72732c821cb4ee5bf362533406ce2494;hb=HEAD#l341
> 
> I will send a patch thattakes qemu-nbd out of the cgroups and
> disconnects qemu-nbd on container shutdown.
> 
> Radostin
> 

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list