drivers/vhost/vsock.c | 96 +++++++++++++++++++++++++++------ include/linux/miscdevice.h | 1 + include/linux/virtio_vsock.h | 2 + include/net/af_vsock.h | 10 ++-- net/vmw_vsock/af_vsock.c | 85 +++++++++++++++++++++++------ net/vmw_vsock/hyperv_transport.c | 2 +- net/vmw_vsock/virtio_transport.c | 5 +- net/vmw_vsock/virtio_transport_common.c | 14 ++++- net/vmw_vsock/vmci_transport.c | 4 +- net/vmw_vsock/vsock_loopback.c | 4 +- 10 files changed, 180 insertions(+), 43 deletions(-)
Picking up Stefano's v1 [1], this series adds netns support to
vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
namespaces, defering that for future implementation and discussion.
Any vsock created with /dev/vhost-vsock is a global vsock, accessible
from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
"scoped" vsock, accessible only to sockets in its namespace. If a global
vsock or scoped vsock share the same CID, the scoped vsock takes
precedence.
If a socket in a namespace connects with a global vsock, the CID becomes
unavailable to any VMM in that namespace when creating new vsocks. If
disconnected, the CID becomes available again.
Testing
QEMU with /dev/vhost-vsock-netns support:
https://github.com/beshleman/qemu/tree/vsock-netns
Test: Scoped vsocks isolated by namespace
host# ip netns add ns1
host# ip netns add ns2
host# ip netns exec ns1 \
qemu-system-x86_64 \
-m 8G -smp 4 -cpu host -enable-kvm \
-serial mon:stdio \
-drive if=virtio,file=${IMAGE1} \
-device vhost-vsock-pci,netns=on,guest-cid=15
host# ip netns exec ns2 \
qemu-system-x86_64 \
-m 8G -smp 4 -cpu host -enable-kvm \
-serial mon:stdio \
-drive if=virtio,file=${IMAGE2} \
-device vhost-vsock-pci,netns=on,guest-cid=15
host# socat - VSOCK-CONNECT:15:1234
2025/03/10 17:09:40 socat[255741] E connect(5, AF=40 cid:15 port:1234, 16): No such device
host# echo foobar1 | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
host# echo foobar2 | sudo ip netns exec ns2 socat - VSOCK-CONNECT:15:1234
vm1# socat - VSOCK-LISTEN:1234
foobar1
vm2# socat - VSOCK-LISTEN:1234
foobar2
Test: Global vsocks accessible to any namespace
host# qemu-system-x86_64 \
-m 8G -smp 4 -cpu host -enable-kvm \
-serial mon:stdio \
-drive if=virtio,file=${IMAGE2} \
-device vhost-vsock-pci,guest-cid=15,netns=off
host# echo foobar | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
vm# socat - VSOCK-LISTEN:1234
foobar
Test: Connecting to global vsock makes CID unavailble to namespace
host# qemu-system-x86_64 \
-m 8G -smp 4 -cpu host -enable-kvm \
-serial mon:stdio \
-drive if=virtio,file=${IMAGE2} \
-device vhost-vsock-pci,guest-cid=15,netns=off
vm# socat - VSOCK-LISTEN:1234
host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
host# ip netns exec ns1 \
qemu-system-x86_64 \
-m 8G -smp 4 -cpu host -enable-kvm \
-serial mon:stdio \
-drive if=virtio,file=${IMAGE1} \
-device vhost-vsock-pci,netns=on,guest-cid=15
qemu-system-x86_64: -device vhost-vsock-pci,netns=on,guest-cid=15: vhost-vsock: unable to set guest cid: Address already in use
Signed-off-by: Bobby Eshleman <bobbyeshleman@gmail.com>
---
Changes in v2:
- only support vhost-vsock namespaces
- all g2h namespaces retain old behavior, only common API changes
impacted by vhost-vsock changes
- add /dev/vhost-vsock-netns for "opt-in"
- leave /dev/vhost-vsock to old behavior
- removed netns module param
- Link to v1: https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com
Changes in v1:
- added 'netns' module param to vsock.ko to enable the
network namespace support (disabled by default)
- added 'vsock_net_eq()' to check the "net" assigned to a socket
only when 'netns' support is enabled
- Link to RFC: https://patchwork.ozlabs.org/cover/1202235/
---
Stefano Garzarella (3):
vsock: add network namespace support
vsock/virtio_transport_common: handle netns of received packets
vhost/vsock: use netns of process that opens the vhost-vsock-netns device
drivers/vhost/vsock.c | 96 +++++++++++++++++++++++++++------
include/linux/miscdevice.h | 1 +
include/linux/virtio_vsock.h | 2 +
include/net/af_vsock.h | 10 ++--
net/vmw_vsock/af_vsock.c | 85 +++++++++++++++++++++++------
net/vmw_vsock/hyperv_transport.c | 2 +-
net/vmw_vsock/virtio_transport.c | 5 +-
net/vmw_vsock/virtio_transport_common.c | 14 ++++-
net/vmw_vsock/vmci_transport.c | 4 +-
net/vmw_vsock/vsock_loopback.c | 4 +-
10 files changed, 180 insertions(+), 43 deletions(-)
---
base-commit: 0ea09cbf8350b70ad44d67a1dcb379008a356034
change-id: 20250312-vsock-netns-45da9424f726
Best regards,
--
Bobby Eshleman <bobbyeshleman@gmail.com>
CCing Daniel
On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote:
>Picking up Stefano's v1 [1], this series adds netns support to
>vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
>namespaces, defering that for future implementation and discussion.
>
>Any vsock created with /dev/vhost-vsock is a global vsock, accessible
>from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
>"scoped" vsock, accessible only to sockets in its namespace. If a global
>vsock or scoped vsock share the same CID, the scoped vsock takes
>precedence.
>
>If a socket in a namespace connects with a global vsock, the CID becomes
>unavailable to any VMM in that namespace when creating new vsocks. If
>disconnected, the CID becomes available again.
I was talking about this feature with Daniel and he pointed out
something interesting (Daniel please feel free to correct me):
If we have a process in the host that does a listen(AF_VSOCK) in a
namespace, can this receive connections from guests connected to
/dev/vhost-vsock in any namespace?
Should we provide something (e.g. sysctl/sysfs entry) to disable
this behaviour, preventing a process in a namespace from receiving
connections from the global vsock address space (i.e.
/dev/vhost-vsock VMs)?
I understand that by default maybe we should allow this behaviour in
order to not break current applications, but in some cases the user may
want to isolate sockets in a namespace also from being accessed by VMs
running in the global vsock address space.
Indeed in this series we have talked mostly about the host -> guest path
(as the direction of the connection), but little about the guest -> host
path, maybe we should explain it better in the cover/commit
descriptions/documentation.
Thanks,
Stefano
>
>Testing
>
>QEMU with /dev/vhost-vsock-netns support:
> https://github.com/beshleman/qemu/tree/vsock-netns
>
>Test: Scoped vsocks isolated by namespace
>
> host# ip netns add ns1
> host# ip netns add ns2
> host# ip netns exec ns1 \
> qemu-system-x86_64 \
> -m 8G -smp 4 -cpu host -enable-kvm \
> -serial mon:stdio \
> -drive if=virtio,file=${IMAGE1} \
> -device vhost-vsock-pci,netns=on,guest-cid=15
> host# ip netns exec ns2 \
> qemu-system-x86_64 \
> -m 8G -smp 4 -cpu host -enable-kvm \
> -serial mon:stdio \
> -drive if=virtio,file=${IMAGE2} \
> -device vhost-vsock-pci,netns=on,guest-cid=15
>
> host# socat - VSOCK-CONNECT:15:1234
> 2025/03/10 17:09:40 socat[255741] E connect(5, AF=40 cid:15 port:1234, 16): No such device
>
> host# echo foobar1 | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> host# echo foobar2 | sudo ip netns exec ns2 socat - VSOCK-CONNECT:15:1234
>
> vm1# socat - VSOCK-LISTEN:1234
> foobar1
> vm2# socat - VSOCK-LISTEN:1234
> foobar2
>
>Test: Global vsocks accessible to any namespace
>
> host# qemu-system-x86_64 \
> -m 8G -smp 4 -cpu host -enable-kvm \
> -serial mon:stdio \
> -drive if=virtio,file=${IMAGE2} \
> -device vhost-vsock-pci,guest-cid=15,netns=off
>
> host# echo foobar | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
>
> vm# socat - VSOCK-LISTEN:1234
> foobar
>
>Test: Connecting to global vsock makes CID unavailble to namespace
>
> host# qemu-system-x86_64 \
> -m 8G -smp 4 -cpu host -enable-kvm \
> -serial mon:stdio \
> -drive if=virtio,file=${IMAGE2} \
> -device vhost-vsock-pci,guest-cid=15,netns=off
>
> vm# socat - VSOCK-LISTEN:1234
>
> host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> host# ip netns exec ns1 \
> qemu-system-x86_64 \
> -m 8G -smp 4 -cpu host -enable-kvm \
> -serial mon:stdio \
> -drive if=virtio,file=${IMAGE1} \
> -device vhost-vsock-pci,netns=on,guest-cid=15
>
> qemu-system-x86_64: -device vhost-vsock-pci,netns=on,guest-cid=15: vhost-vsock: unable to set guest cid: Address already in use
>
>Signed-off-by: Bobby Eshleman <bobbyeshleman@gmail.com>
>---
>Changes in v2:
>- only support vhost-vsock namespaces
>- all g2h namespaces retain old behavior, only common API changes
> impacted by vhost-vsock changes
>- add /dev/vhost-vsock-netns for "opt-in"
>- leave /dev/vhost-vsock to old behavior
>- removed netns module param
>- Link to v1: https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com
>
>Changes in v1:
>- added 'netns' module param to vsock.ko to enable the
> network namespace support (disabled by default)
>- added 'vsock_net_eq()' to check the "net" assigned to a socket
> only when 'netns' support is enabled
>- Link to RFC: https://patchwork.ozlabs.org/cover/1202235/
>
>---
>Stefano Garzarella (3):
> vsock: add network namespace support
> vsock/virtio_transport_common: handle netns of received packets
> vhost/vsock: use netns of process that opens the vhost-vsock-netns device
>
> drivers/vhost/vsock.c | 96 +++++++++++++++++++++++++++------
> include/linux/miscdevice.h | 1 +
> include/linux/virtio_vsock.h | 2 +
> include/net/af_vsock.h | 10 ++--
> net/vmw_vsock/af_vsock.c | 85 +++++++++++++++++++++++------
> net/vmw_vsock/hyperv_transport.c | 2 +-
> net/vmw_vsock/virtio_transport.c | 5 +-
> net/vmw_vsock/virtio_transport_common.c | 14 ++++-
> net/vmw_vsock/vmci_transport.c | 4 +-
> net/vmw_vsock/vsock_loopback.c | 4 +-
> 10 files changed, 180 insertions(+), 43 deletions(-)
>---
>base-commit: 0ea09cbf8350b70ad44d67a1dcb379008a356034
>change-id: 20250312-vsock-netns-45da9424f726
>
>Best regards,
>--
>Bobby Eshleman <bobbyeshleman@gmail.com>
>
On Fri, Mar 28, 2025 at 06:03:19PM +0100, Stefano Garzarella wrote:
> CCing Daniel
>
> On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote:
> > Picking up Stefano's v1 [1], this series adds netns support to
> > vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
> > namespaces, defering that for future implementation and discussion.
> >
> > Any vsock created with /dev/vhost-vsock is a global vsock, accessible
> > from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
> > "scoped" vsock, accessible only to sockets in its namespace. If a global
> > vsock or scoped vsock share the same CID, the scoped vsock takes
> > precedence.
> >
> > If a socket in a namespace connects with a global vsock, the CID becomes
> > unavailable to any VMM in that namespace when creating new vsocks. If
> > disconnected, the CID becomes available again.
>
> I was talking about this feature with Daniel and he pointed out something
> interesting (Daniel please feel free to correct me):
>
> If we have a process in the host that does a listen(AF_VSOCK) in a
> namespace, can this receive connections from guests connected to
> /dev/vhost-vsock in any namespace?
>
> Should we provide something (e.g. sysctl/sysfs entry) to disable
> this behaviour, preventing a process in a namespace from receiving
> connections from the global vsock address space (i.e. /dev/vhost-vsock
> VMs)?
I think my concern goes a bit beyond that, to the general conceptual
idea of sharing the CID space between the global vsocks and namespace
vsocks. So I'm not sure a sysctl would be sufficient...details later
below..
> I understand that by default maybe we should allow this behaviour in order
> to not break current applications, but in some cases the user may want to
> isolate sockets in a namespace also from being accessed by VMs running in
> the global vsock address space.
>
> Indeed in this series we have talked mostly about the host -> guest path (as
> the direction of the connection), but little about the guest -> host path,
> maybe we should explain it better in the cover/commit
> descriptions/documentation.
> > Testing
> >
> > QEMU with /dev/vhost-vsock-netns support:
> > https://github.com/beshleman/qemu/tree/vsock-netns
> >
> > Test: Scoped vsocks isolated by namespace
> >
> > host# ip netns add ns1
> > host# ip netns add ns2
> > host# ip netns exec ns1 \
> > qemu-system-x86_64 \
> > -m 8G -smp 4 -cpu host -enable-kvm \
> > -serial mon:stdio \
> > -drive if=virtio,file=${IMAGE1} \
> > -device vhost-vsock-pci,netns=on,guest-cid=15
> > host# ip netns exec ns2 \
> > qemu-system-x86_64 \
> > -m 8G -smp 4 -cpu host -enable-kvm \
> > -serial mon:stdio \
> > -drive if=virtio,file=${IMAGE2} \
> > -device vhost-vsock-pci,netns=on,guest-cid=15
> >
> > host# socat - VSOCK-CONNECT:15:1234
> > 2025/03/10 17:09:40 socat[255741] E connect(5, AF=40 cid:15 port:1234, 16): No such device
> >
> > host# echo foobar1 | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> > host# echo foobar2 | sudo ip netns exec ns2 socat - VSOCK-CONNECT:15:1234
> >
> > vm1# socat - VSOCK-LISTEN:1234
> > foobar1
> > vm2# socat - VSOCK-LISTEN:1234
> > foobar2
> >
> > Test: Global vsocks accessible to any namespace
> >
> > host# qemu-system-x86_64 \
> > -m 8G -smp 4 -cpu host -enable-kvm \
> > -serial mon:stdio \
> > -drive if=virtio,file=${IMAGE2} \
> > -device vhost-vsock-pci,guest-cid=15,netns=off
> >
> > host# echo foobar | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> >
> > vm# socat - VSOCK-LISTEN:1234
> > foobar
> >
> > Test: Connecting to global vsock makes CID unavailble to namespace
> >
> > host# qemu-system-x86_64 \
> > -m 8G -smp 4 -cpu host -enable-kvm \
> > -serial mon:stdio \
> > -drive if=virtio,file=${IMAGE2} \
> > -device vhost-vsock-pci,guest-cid=15,netns=off
> >
> > vm# socat - VSOCK-LISTEN:1234
> >
> > host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> > host# ip netns exec ns1 \
> > qemu-system-x86_64 \
> > -m 8G -smp 4 -cpu host -enable-kvm \
> > -serial mon:stdio \
> > -drive if=virtio,file=${IMAGE1} \
> > -device vhost-vsock-pci,netns=on,guest-cid=15
> >
> > qemu-system-x86_64: -device vhost-vsock-pci,netns=on,guest-cid=15: vhost-vsock: unable to set guest cid: Address already in use
I find it conceptually quite unsettling that the VSOCK CID address
space for AF_VSOCK is shared between the host and the namespace.
That feels contrary to how namespaces are more commonly used for
deterministically isolating resources between the namespace and the
host.
Naively I would expect that in a namespace, all VSOCK CIDs are
free for use, without having to concern yourself with what CIDs
are in use in the host now, or in future.
What happens if we reverse the QEMU order above, to get the
following scenario
# Launch VM1 inside the NS
host# ip netns exec ns1 \
qemu-system-x86_64 \
-m 8G -smp 4 -cpu host -enable-kvm \
-serial mon:stdio \
-drive if=virtio,file=${IMAGE1} \
-device vhost-vsock-pci,netns=on,guest-cid=15
# Launch VM2
host# qemu-system-x86_64 \
-m 8G -smp 4 -cpu host -enable-kvm \
-serial mon:stdio \
-drive if=virtio,file=${IMAGE2} \
-device vhost-vsock-pci,guest-cid=15,netns=off
vm1# socat - VSOCK-LISTEN:1234
vm2# socat - VSOCK-LISTEN:1234
host# socat - VSOCK-CONNECT:15:1234
=> Presume this connects to "VM2" running outside the NS
host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
=> Does this connect to "VM1" inside the NS, or "VM2"
outside the NS ?
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Tue, Apr 01, 2025 at 08:05:16PM +0100, Daniel P. Berrangé wrote:
> On Fri, Mar 28, 2025 at 06:03:19PM +0100, Stefano Garzarella wrote:
> > CCing Daniel
> >
> > On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote:
> > > Picking up Stefano's v1 [1], this series adds netns support to
> > > vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
> > > namespaces, defering that for future implementation and discussion.
> > >
> > > Any vsock created with /dev/vhost-vsock is a global vsock, accessible
> > > from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
> > > "scoped" vsock, accessible only to sockets in its namespace. If a global
> > > vsock or scoped vsock share the same CID, the scoped vsock takes
> > > precedence.
> > >
> > > If a socket in a namespace connects with a global vsock, the CID becomes
> > > unavailable to any VMM in that namespace when creating new vsocks. If
> > > disconnected, the CID becomes available again.
> >
> > I was talking about this feature with Daniel and he pointed out something
> > interesting (Daniel please feel free to correct me):
> >
> > If we have a process in the host that does a listen(AF_VSOCK) in a
> > namespace, can this receive connections from guests connected to
> > /dev/vhost-vsock in any namespace?
> >
> > Should we provide something (e.g. sysctl/sysfs entry) to disable
> > this behaviour, preventing a process in a namespace from receiving
> > connections from the global vsock address space (i.e. /dev/vhost-vsock
> > VMs)?
>
> I think my concern goes a bit beyond that, to the general conceptual
> idea of sharing the CID space between the global vsocks and namespace
> vsocks. So I'm not sure a sysctl would be sufficient...details later
> below..
>
> > I understand that by default maybe we should allow this behaviour in order
> > to not break current applications, but in some cases the user may want to
> > isolate sockets in a namespace also from being accessed by VMs running in
> > the global vsock address space.
> >
> > Indeed in this series we have talked mostly about the host -> guest path (as
> > the direction of the connection), but little about the guest -> host path,
> > maybe we should explain it better in the cover/commit
> > descriptions/documentation.
>
> > > Testing
> > >
> > > QEMU with /dev/vhost-vsock-netns support:
> > > https://github.com/beshleman/qemu/tree/vsock-netns
> > >
> > > Test: Scoped vsocks isolated by namespace
> > >
> > > host# ip netns add ns1
> > > host# ip netns add ns2
> > > host# ip netns exec ns1 \
> > > qemu-system-x86_64 \
> > > -m 8G -smp 4 -cpu host -enable-kvm \
> > > -serial mon:stdio \
> > > -drive if=virtio,file=${IMAGE1} \
> > > -device vhost-vsock-pci,netns=on,guest-cid=15
> > > host# ip netns exec ns2 \
> > > qemu-system-x86_64 \
> > > -m 8G -smp 4 -cpu host -enable-kvm \
> > > -serial mon:stdio \
> > > -drive if=virtio,file=${IMAGE2} \
> > > -device vhost-vsock-pci,netns=on,guest-cid=15
> > >
> > > host# socat - VSOCK-CONNECT:15:1234
> > > 2025/03/10 17:09:40 socat[255741] E connect(5, AF=40 cid:15 port:1234, 16): No such device
> > >
> > > host# echo foobar1 | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> > > host# echo foobar2 | sudo ip netns exec ns2 socat - VSOCK-CONNECT:15:1234
> > >
> > > vm1# socat - VSOCK-LISTEN:1234
> > > foobar1
> > > vm2# socat - VSOCK-LISTEN:1234
> > > foobar2
> > >
> > > Test: Global vsocks accessible to any namespace
> > >
> > > host# qemu-system-x86_64 \
> > > -m 8G -smp 4 -cpu host -enable-kvm \
> > > -serial mon:stdio \
> > > -drive if=virtio,file=${IMAGE2} \
> > > -device vhost-vsock-pci,guest-cid=15,netns=off
> > >
> > > host# echo foobar | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> > >
> > > vm# socat - VSOCK-LISTEN:1234
> > > foobar
> > >
> > > Test: Connecting to global vsock makes CID unavailble to namespace
> > >
> > > host# qemu-system-x86_64 \
> > > -m 8G -smp 4 -cpu host -enable-kvm \
> > > -serial mon:stdio \
> > > -drive if=virtio,file=${IMAGE2} \
> > > -device vhost-vsock-pci,guest-cid=15,netns=off
> > >
> > > vm# socat - VSOCK-LISTEN:1234
> > >
> > > host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> > > host# ip netns exec ns1 \
> > > qemu-system-x86_64 \
> > > -m 8G -smp 4 -cpu host -enable-kvm \
> > > -serial mon:stdio \
> > > -drive if=virtio,file=${IMAGE1} \
> > > -device vhost-vsock-pci,netns=on,guest-cid=15
> > >
> > > qemu-system-x86_64: -device vhost-vsock-pci,netns=on,guest-cid=15: vhost-vsock: unable to set guest cid: Address already in use
>
> I find it conceptually quite unsettling that the VSOCK CID address
> space for AF_VSOCK is shared between the host and the namespace.
> That feels contrary to how namespaces are more commonly used for
> deterministically isolating resources between the namespace and the
> host.
>
> Naively I would expect that in a namespace, all VSOCK CIDs are
> free for use, without having to concern yourself with what CIDs
> are in use in the host now, or in future.
>
True, that would be ideal. I think the definition of backwards
compatibility we've established includes the notion that any VM may
reach any namespace and any namespace may reach any VM. IIUC, it sounds
like you are suggesting this be revised to more strictly adhere to
namespace semantics?
I do like Stefano's suggestion to add a sysctl for a "strict" mode,
Since it offers the best of both worlds, and still tends conservative in
protecting existing applications... but I agree, the non-strict mode
vsock would be unique WRT the usual concept of namespaces.
> What happens if we reverse the QEMU order above, to get the
> following scenario
>
> # Launch VM1 inside the NS
> host# ip netns exec ns1 \
> qemu-system-x86_64 \
> -m 8G -smp 4 -cpu host -enable-kvm \
> -serial mon:stdio \
> -drive if=virtio,file=${IMAGE1} \
> -device vhost-vsock-pci,netns=on,guest-cid=15
> # Launch VM2
> host# qemu-system-x86_64 \
> -m 8G -smp 4 -cpu host -enable-kvm \
> -serial mon:stdio \
> -drive if=virtio,file=${IMAGE2} \
> -device vhost-vsock-pci,guest-cid=15,netns=off
>
> vm1# socat - VSOCK-LISTEN:1234
> vm2# socat - VSOCK-LISTEN:1234
>
> host# socat - VSOCK-CONNECT:15:1234
> => Presume this connects to "VM2" running outside the NS
>
> host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
>
> => Does this connect to "VM1" inside the NS, or "VM2"
> outside the NS ?
>
VM1 inside the NS. Current logic says that whenever two CIDs collide
(local vs global), always select the one in the local namespace
(irrespective of creation order).
Adding a sysctl option... it would *never* connect to the global one,
even if there was no local match but there was a global one.
>
>
> With regards,
> Daniel
Thanks for the review!
Best,
Bobby
On Wed, 2 Apr 2025 at 02:21, Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
>
> On Tue, Apr 01, 2025 at 08:05:16PM +0100, Daniel P. Berrangé wrote:
> > On Fri, Mar 28, 2025 at 06:03:19PM +0100, Stefano Garzarella wrote:
> > > CCing Daniel
> > >
> > > On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote:
> > > > Picking up Stefano's v1 [1], this series adds netns support to
> > > > vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
> > > > namespaces, defering that for future implementation and discussion.
> > > >
> > > > Any vsock created with /dev/vhost-vsock is a global vsock, accessible
> > > > from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
> > > > "scoped" vsock, accessible only to sockets in its namespace. If a global
> > > > vsock or scoped vsock share the same CID, the scoped vsock takes
> > > > precedence.
> > > >
> > > > If a socket in a namespace connects with a global vsock, the CID becomes
> > > > unavailable to any VMM in that namespace when creating new vsocks. If
> > > > disconnected, the CID becomes available again.
> > >
> > > I was talking about this feature with Daniel and he pointed out something
> > > interesting (Daniel please feel free to correct me):
> > >
> > > If we have a process in the host that does a listen(AF_VSOCK) in a
> > > namespace, can this receive connections from guests connected to
> > > /dev/vhost-vsock in any namespace?
> > >
> > > Should we provide something (e.g. sysctl/sysfs entry) to disable
> > > this behaviour, preventing a process in a namespace from receiving
> > > connections from the global vsock address space (i.e. /dev/vhost-vsock
> > > VMs)?
> >
> > I think my concern goes a bit beyond that, to the general conceptual
> > idea of sharing the CID space between the global vsocks and namespace
> > vsocks. So I'm not sure a sysctl would be sufficient...details later
> > below..
> >
> > > I understand that by default maybe we should allow this behaviour in order
> > > to not break current applications, but in some cases the user may want to
> > > isolate sockets in a namespace also from being accessed by VMs running in
> > > the global vsock address space.
> > >
> > > Indeed in this series we have talked mostly about the host -> guest path (as
> > > the direction of the connection), but little about the guest -> host path,
> > > maybe we should explain it better in the cover/commit
> > > descriptions/documentation.
> >
> > > > Testing
> > > >
> > > > QEMU with /dev/vhost-vsock-netns support:
> > > > https://github.com/beshleman/qemu/tree/vsock-netns
> > > >
> > > > Test: Scoped vsocks isolated by namespace
> > > >
> > > > host# ip netns add ns1
> > > > host# ip netns add ns2
> > > > host# ip netns exec ns1 \
> > > > qemu-system-x86_64 \
> > > > -m 8G -smp 4 -cpu host -enable-kvm \
> > > > -serial mon:stdio \
> > > > -drive if=virtio,file=${IMAGE1} \
> > > > -device vhost-vsock-pci,netns=on,guest-cid=15
> > > > host# ip netns exec ns2 \
> > > > qemu-system-x86_64 \
> > > > -m 8G -smp 4 -cpu host -enable-kvm \
> > > > -serial mon:stdio \
> > > > -drive if=virtio,file=${IMAGE2} \
> > > > -device vhost-vsock-pci,netns=on,guest-cid=15
> > > >
> > > > host# socat - VSOCK-CONNECT:15:1234
> > > > 2025/03/10 17:09:40 socat[255741] E connect(5, AF=40 cid:15 port:1234, 16): No such device
> > > >
> > > > host# echo foobar1 | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> > > > host# echo foobar2 | sudo ip netns exec ns2 socat - VSOCK-CONNECT:15:1234
> > > >
> > > > vm1# socat - VSOCK-LISTEN:1234
> > > > foobar1
> > > > vm2# socat - VSOCK-LISTEN:1234
> > > > foobar2
> > > >
> > > > Test: Global vsocks accessible to any namespace
> > > >
> > > > host# qemu-system-x86_64 \
> > > > -m 8G -smp 4 -cpu host -enable-kvm \
> > > > -serial mon:stdio \
> > > > -drive if=virtio,file=${IMAGE2} \
> > > > -device vhost-vsock-pci,guest-cid=15,netns=off
> > > >
> > > > host# echo foobar | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> > > >
> > > > vm# socat - VSOCK-LISTEN:1234
> > > > foobar
> > > >
> > > > Test: Connecting to global vsock makes CID unavailble to namespace
> > > >
> > > > host# qemu-system-x86_64 \
> > > > -m 8G -smp 4 -cpu host -enable-kvm \
> > > > -serial mon:stdio \
> > > > -drive if=virtio,file=${IMAGE2} \
> > > > -device vhost-vsock-pci,guest-cid=15,netns=off
> > > >
> > > > vm# socat - VSOCK-LISTEN:1234
> > > >
> > > > host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> > > > host# ip netns exec ns1 \
> > > > qemu-system-x86_64 \
> > > > -m 8G -smp 4 -cpu host -enable-kvm \
> > > > -serial mon:stdio \
> > > > -drive if=virtio,file=${IMAGE1} \
> > > > -device vhost-vsock-pci,netns=on,guest-cid=15
> > > >
> > > > qemu-system-x86_64: -device vhost-vsock-pci,netns=on,guest-cid=15: vhost-vsock: unable to set guest cid: Address already in use
> >
> > I find it conceptually quite unsettling that the VSOCK CID address
> > space for AF_VSOCK is shared between the host and the namespace.
> > That feels contrary to how namespaces are more commonly used for
> > deterministically isolating resources between the namespace and the
> > host.
> >
> > Naively I would expect that in a namespace, all VSOCK CIDs are
> > free for use, without having to concern yourself with what CIDs
> > are in use in the host now, or in future.
> >
>
> True, that would be ideal. I think the definition of backwards
> compatibility we've established includes the notion that any VM may
> reach any namespace and any namespace may reach any VM. IIUC, it
> sounds
> like you are suggesting this be revised to more strictly adhere to
> namespace semantics?
>
> I do like Stefano's suggestion to add a sysctl for a "strict" mode,
> Since it offers the best of both worlds, and still tends conservative in
> protecting existing applications... but I agree, the non-strict mode
> vsock would be unique WRT the usual concept of namespaces.
Maybe we could do the opposite, enable strict mode by default (I think
it was similar to what I had tried to do with the kernel module in v1, I
was young I know xD)
And provide a way to disable it for those use cases where the user wants
backward compatibility, while paying the cost of less isolation.
I was thinking two options (not sure if the second one can be done):
1. provide a global sysfs/sysctl that disables strict mode, but this
then applies to all namespaces
2. provide something that allows disabling strict mode by namespace.
Maybe when it is created there are options, or something that can be
set later.
2 would be ideal, but that might be too much, so 1 might be enough. In
any case, 2 could also be a next step.
WDYT?
Thanks,
Stefano
On Wed, Apr 02, 2025 at 10:13:43AM +0200, Stefano Garzarella wrote: > On Wed, 2 Apr 2025 at 02:21, Bobby Eshleman <bobbyeshleman@gmail.com> wrote: > > > > I do like Stefano's suggestion to add a sysctl for a "strict" mode, > > Since it offers the best of both worlds, and still tends conservative in > > protecting existing applications... but I agree, the non-strict mode > > vsock would be unique WRT the usual concept of namespaces. > > Maybe we could do the opposite, enable strict mode by default (I think > it was similar to what I had tried to do with the kernel module in v1, I > was young I know xD) > And provide a way to disable it for those use cases where the user wants > backward compatibility, while paying the cost of less isolation. I think backwards compatible has to be the default behaviour, otherwise the change has too high risk of breaking existing deployments that are already using netns and relying on VSOCK being global. Breakage has to be opt in. > I was thinking two options (not sure if the second one can be done): > > 1. provide a global sysfs/sysctl that disables strict mode, but this > then applies to all namespaces > > 2. provide something that allows disabling strict mode by namespace. > Maybe when it is created there are options, or something that can be > set later. > > 2 would be ideal, but that might be too much, so 1 might be enough. In > any case, 2 could also be a next step. > > WDYT? It occured to me that the problem we face with the CID space usage is somewhat similar to the UID/GID space usage for user namespaces. In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to allow IDs in the namespace to be arbitrarily mapped onto IDs in the host. At the risk of being overkill, is it worth trying a similar kind of approach for the vsock CID space ? A simple variant would be a /proc/net/vsock_cid_outside specifying a set of CIDs which are exclusively referencing /dev/vhost-vsock associations created outside the namespace. Anything not listed would be exclusively referencing associations created inside the namespace. A more complex variant would be to allow a full remapping of CIDs as is done with userns, via a /proc/net/vsock_cid_map, which the same three parameters, so that CID=15 association outside the namespace could be remapped to CID=9015 inside the namespace, allow the inside namespace to define its out association for CID=15 without clashing. IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock associations created outside namespace, while unmapped CIDs would be exclusively referencing /dev/vhost-vsock associations inside the namespace. A likely benefit of relying on a kernel defined mapping/partition of the CID space is that apps like QEMU don't need changing, as there's no need to invent a new /dev/vhost-vsock-netns device node. Both approaches give the desirable security protection whereby the inside namespace can be prevented from accessing certain CIDs that were associated outside the namespace. Some rule would need to be defined for updating the /proc/net/vsock_cid_map file as it is the security control mechanism. If it is write-once then if the container mgmt app initializes it, nothing later could change it. A key question is do we need the "first come, first served" behaviour for CIDs where a CID can be arbitrarily used by outside or inside namespace according to whatever tries to associate a CID first ? IMHO those semantics lead to unpredictable behaviour for apps because what happens depends on ordering of app launches inside & outside the namespace, but they do sort of allow for VSOCK namespace behaviour to be 'zero conf' out of the box. A mapping that strictly partitions CIDs to either outside or inside namespace usage, but never both, gives well defined behaviour, at the cost of needing to setup an initial mapping/partition. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote: >On Wed, Apr 02, 2025 at 10:13:43AM +0200, Stefano Garzarella wrote: >> On Wed, 2 Apr 2025 at 02:21, Bobby Eshleman <bobbyeshleman@gmail.com> wrote: >> > >> > I do like Stefano's suggestion to add a sysctl for a "strict" mode, >> > Since it offers the best of both worlds, and still tends conservative in >> > protecting existing applications... but I agree, the non-strict mode >> > vsock would be unique WRT the usual concept of namespaces. >> >> Maybe we could do the opposite, enable strict mode by default (I think >> it was similar to what I had tried to do with the kernel module in v1, I >> was young I know xD) >> And provide a way to disable it for those use cases where the user wants >> backward compatibility, while paying the cost of less isolation. > >I think backwards compatible has to be the default behaviour, otherwise >the change has too high risk of breaking existing deployments that are >already using netns and relying on VSOCK being global. Breakage has to >be opt in. > >> I was thinking two options (not sure if the second one can be done): >> >> 1. provide a global sysfs/sysctl that disables strict mode, but this >> then applies to all namespaces >> >> 2. provide something that allows disabling strict mode by namespace. >> Maybe when it is created there are options, or something that can be >> set later. >> >> 2 would be ideal, but that might be too much, so 1 might be enough. In >> any case, 2 could also be a next step. >> >> WDYT? > >It occured to me that the problem we face with the CID space usage is >somewhat similar to the UID/GID space usage for user namespaces. > >In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to >allow IDs in the namespace to be arbitrarily mapped onto IDs in the host. > >At the risk of being overkill, is it worth trying a similar kind of >approach for the vsock CID space ? > >A simple variant would be a /proc/net/vsock_cid_outside specifying a set >of CIDs which are exclusively referencing /dev/vhost-vsock associations >created outside the namespace. Anything not listed would be exclusively >referencing associations created inside the namespace. I like the idea and I think it is also easily usable in a nested environment, where for example in L1 we can decide whether or not a namespace can access the L0 host (CID=2), by adding 2 to /proc/net/vsock_cid_outside > >A more complex variant would be to allow a full remapping of CIDs as is >done with userns, via a /proc/net/vsock_cid_map, which the same three >parameters, so that CID=15 association outside the namespace could be >remapped to CID=9015 inside the namespace, allow the inside namespace >to define its out association for CID=15 without clashing. > >IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock >associations created outside namespace, while unmapped CIDs would be >exclusively referencing /dev/vhost-vsock associations inside the >namespace. This is maybe a little overkill, but I don't object to it! It could also be a next step. But if it's easy to implement, we can go straight with it. > >A likely benefit of relying on a kernel defined mapping/partition of >the CID space is that apps like QEMU don't need changing, as there's >no need to invent a new /dev/vhost-vsock-netns device node. Yeah, I see that! However, should this be paired with a sysctl/sysfs to do opt-in? Or can we do something to figure out if the user didn't write these files, then behave as before (but maybe we need to reverse the logic, I don't know if that makes sense). > >Both approaches give the desirable security protection whereby the >inside namespace can be prevented from accessing certain CIDs that >were associated outside the namespace. > >Some rule would need to be defined for updating the /proc/net/vsock_cid_map >file as it is the security control mechanism. If it is write-once then >if the container mgmt app initializes it, nothing later could change >it. > >A key question is do we need the "first come, first served" behaviour >for CIDs where a CID can be arbitrarily used by outside or inside namespace >according to whatever tries to associate a CID first ? > >IMHO those semantics lead to unpredictable behaviour for apps because >what happens depends on ordering of app launches inside & outside the >namespace, but they do sort of allow for VSOCK namespace behaviour to >be 'zero conf' out of the box. Yes, I agree that we should avoid it if possible. > >A mapping that strictly partitions CIDs to either outside or inside >namespace usage, but never both, gives well defined behaviour, at the >cost of needing to setup an initial mapping/partition. Thanks for your points! Stefano
On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote: > On Wed, Apr 02, 2025 at 10:13:43AM +0200, Stefano Garzarella wrote: > > On Wed, 2 Apr 2025 at 02:21, Bobby Eshleman <bobbyeshleman@gmail.com> wrote: > > > > > > I do like Stefano's suggestion to add a sysctl for a "strict" mode, > > > Since it offers the best of both worlds, and still tends conservative in > > > protecting existing applications... but I agree, the non-strict mode > > > vsock would be unique WRT the usual concept of namespaces. > > > > Maybe we could do the opposite, enable strict mode by default (I think > > it was similar to what I had tried to do with the kernel module in v1, I > > was young I know xD) > > And provide a way to disable it for those use cases where the user wants > > backward compatibility, while paying the cost of less isolation. > > I think backwards compatible has to be the default behaviour, otherwise > the change has too high risk of breaking existing deployments that are > already using netns and relying on VSOCK being global. Breakage has to > be opt in. > > > I was thinking two options (not sure if the second one can be done): > > > > 1. provide a global sysfs/sysctl that disables strict mode, but this > > then applies to all namespaces > > > > 2. provide something that allows disabling strict mode by namespace. > > Maybe when it is created there are options, or something that can be > > set later. > > > > 2 would be ideal, but that might be too much, so 1 might be enough. In > > any case, 2 could also be a next step. > > > > WDYT? > > It occured to me that the problem we face with the CID space usage is > somewhat similar to the UID/GID space usage for user namespaces. > > In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to > allow IDs in the namespace to be arbitrarily mapped onto IDs in the host. > > At the risk of being overkill, is it worth trying a similar kind of > approach for the vsock CID space ? > > A simple variant would be a /proc/net/vsock_cid_outside specifying a set > of CIDs which are exclusively referencing /dev/vhost-vsock associations > created outside the namespace. Anything not listed would be exclusively > referencing associations created inside the namespace. > > A more complex variant would be to allow a full remapping of CIDs as is > done with userns, via a /proc/net/vsock_cid_map, which the same three > parameters, so that CID=15 association outside the namespace could be > remapped to CID=9015 inside the namespace, allow the inside namespace > to define its out association for CID=15 without clashing. > > IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock > associations created outside namespace, while unmapped CIDs would be > exclusively referencing /dev/vhost-vsock associations inside the > namespace. > > A likely benefit of relying on a kernel defined mapping/partition of > the CID space is that apps like QEMU don't need changing, as there's > no need to invent a new /dev/vhost-vsock-netns device node. > > Both approaches give the desirable security protection whereby the > inside namespace can be prevented from accessing certain CIDs that > were associated outside the namespace. > > Some rule would need to be defined for updating the /proc/net/vsock_cid_map > file as it is the security control mechanism. If it is write-once then > if the container mgmt app initializes it, nothing later could change > it. > > A key question is do we need the "first come, first served" behaviour > for CIDs where a CID can be arbitrarily used by outside or inside namespace > according to whatever tries to associate a CID first ? I think with /proc/net/vsock_cid_outside, instead of disallowing the CID from being used, this could be solved by disallowing remapping the CID while in use? The thing I like about this is that users can check /proc/net/vsock_cid_outside to figure out what might be going on, instead of trying to check lsof or ps to figure out if the VMM processes have used /dev/vhost-vsock vs /dev/vhost-vsock-netns. Just to check I am following... I suppose we would have a few typical configurations for /proc/net/vsock_cid_outside. Following uid_map file format of: "<local cid start> <global cid start> <range size>" 1. Identity mapping, current namespace CID is global CID (default setting for new namespaces): # empty file OR 0 0 4294967295 2. Complete isolation from global space (initialized, but no mappings): 0 0 0 3. Mapping in ranges of global CIDs For example, global CID space starts at 7000, up to 32-bit max: 7000 0 4294960295 Or for multiple mappings (0-100 map to 7000-7100, 1000-1100 map to 8000-8100) : 7000 0 100 8000 1000 100 One thing I don't love is that option 3 seems to not be addressing a known use case. It doesn't necessarily hurt to have, but it will add complexity to CID handling that might never get used? Since options 1/2 could also be represented by a boolean (yes/no "current ns shares CID with global"), I wonder if we could either A) only support the first two options at first, or B) add just /proc/net/vsock_ns_mode at first, which supports only "global" and "local", and later add a "mapped" mode plus /proc/net/vsock_cid_outside or the full mapping if the need arises? This could also be how we support Option 2 from Stefano's last email of supporting per-namespace opt-in/opt-out. Any thoughts on this? > > IMHO those semantics lead to unpredictable behaviour for apps because > what happens depends on ordering of app launches inside & outside the > namespace, but they do sort of allow for VSOCK namespace behaviour to > be 'zero conf' out of the box. > > A mapping that strictly partitions CIDs to either outside or inside > namespace usage, but never both, gives well defined behaviour, at the > cost of needing to setup an initial mapping/partition. > Agreed, I do like the plainness of reasoning through it. Thanks! Bobby
On Wed, Apr 02, 2025 at 03:18:13PM -0700, Bobby Eshleman wrote:
> On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote:
> > It occured to me that the problem we face with the CID space usage is
> > somewhat similar to the UID/GID space usage for user namespaces.
> >
> > In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to
> > allow IDs in the namespace to be arbitrarily mapped onto IDs in the host.
> >
> > At the risk of being overkill, is it worth trying a similar kind of
> > approach for the vsock CID space ?
> >
> > A simple variant would be a /proc/net/vsock_cid_outside specifying a set
> > of CIDs which are exclusively referencing /dev/vhost-vsock associations
> > created outside the namespace. Anything not listed would be exclusively
> > referencing associations created inside the namespace.
> >
> > A more complex variant would be to allow a full remapping of CIDs as is
> > done with userns, via a /proc/net/vsock_cid_map, which the same three
> > parameters, so that CID=15 association outside the namespace could be
> > remapped to CID=9015 inside the namespace, allow the inside namespace
> > to define its out association for CID=15 without clashing.
> >
> > IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock
> > associations created outside namespace, while unmapped CIDs would be
> > exclusively referencing /dev/vhost-vsock associations inside the
> > namespace.
> >
> > A likely benefit of relying on a kernel defined mapping/partition of
> > the CID space is that apps like QEMU don't need changing, as there's
> > no need to invent a new /dev/vhost-vsock-netns device node.
> >
> > Both approaches give the desirable security protection whereby the
> > inside namespace can be prevented from accessing certain CIDs that
> > were associated outside the namespace.
> >
> > Some rule would need to be defined for updating the /proc/net/vsock_cid_map
> > file as it is the security control mechanism. If it is write-once then
> > if the container mgmt app initializes it, nothing later could change
> > it.
> >
> > A key question is do we need the "first come, first served" behaviour
> > for CIDs where a CID can be arbitrarily used by outside or inside namespace
> > according to whatever tries to associate a CID first ?
>
> I think with /proc/net/vsock_cid_outside, instead of disallowing the CID
> from being used, this could be solved by disallowing remapping the CID
> while in use?
>
> The thing I like about this is that users can check
> /proc/net/vsock_cid_outside to figure out what might be going on,
> instead of trying to check lsof or ps to figure out if the VMM processes
> have used /dev/vhost-vsock vs /dev/vhost-vsock-netns.
>
> Just to check I am following... I suppose we would have a few typical
> configurations for /proc/net/vsock_cid_outside. Following uid_map file
> format of:
> "<local cid start> <global cid start> <range size>"
>
> 1. Identity mapping, current namespace CID is global CID (default
> setting for new namespaces):
>
> # empty file
>
> OR
>
> 0 0 4294967295
>
> 2. Complete isolation from global space (initialized, but no mappings):
>
> 0 0 0
>
> 3. Mapping in ranges of global CIDs
>
> For example, global CID space starts at 7000, up to 32-bit max:
>
> 7000 0 4294960295
>
> Or for multiple mappings (0-100 map to 7000-7100, 1000-1100 map to
> 8000-8100) :
>
> 7000 0 100
> 8000 1000 100
>
>
> One thing I don't love is that option 3 seems to not be addressing a
> known use case. It doesn't necessarily hurt to have, but it will add
> complexity to CID handling that might never get used?
Yeah, I have the same feeling that full remapping of CIDs is probably
adding complexity without clear benefit, unless it somehow helps us
with the nested-virt scenario to disambiguate L0/L1/L2 CID ranges ?
I've not thought the latter through to any great level of detail
though
> Since options 1/2 could also be represented by a boolean (yes/no
> "current ns shares CID with global"), I wonder if we could either A)
> only support the first two options at first, or B) add just
> /proc/net/vsock_ns_mode at first, which supports only "global" and
> "local", and later add a "mapped" mode plus /proc/net/vsock_cid_outside
> or the full mapping if the need arises?
Two options is sufficient if you want to control AF_VSOCK usage
and /dev/vhost-vsock usage as a pair. If you want to separately
control them though, it would push for three options - global,
local, and mixed. By mixed I mean AF_VSOCK in the NS can access
the global CID from the NS, but the NS can't associate the global
CID with a guest.
IOW, this breaks down like:
* CID=N local - aka fully private
Outside NS: Can associate outside CID=N with a guest.
AF_VSOCK permitted to access outside CID=N
Inside NS: Can NOT associate outside CID=N with a guest
Can associate inside CID=N with a guest
AF_VSOCK forbidden to access outside CID=N
AF_VSOCK permitted to access inside CID=N
* CID=N mixed - aka partially shared
Outside NS: Can associate outside CID=N with a guest.
AF_VSOCK permitted to access outside CID=N
Inside NS: Can NOT associate outside CID=N with a guest
AF_VSOCK permitted to access outside CID=N
No inside CID=N concept
* CID=N global - aka current historic behaviour
Outside NS: Can associate outside CID=N with a guest.
AF_VSOCK permitted to access outside CID=N
Inside NS: Can associate outside CID=N with a guest
AF_VSOCK permitted to access outside CID=N
No inside CID=N concept
I was thinking the 'mixed' mode might be useful if the outside NS wants
to retain control over setting up the association, but delegate to
processes in the inside NS for providing individual services to that
guest. This means if the outside NS needs to restart the VM, there is
no race window in which the inside NS can grab the assocaition with the
CID
As for whether we need to control this per-CID, or a single setting
applying to all CID.
Consider that the host OS can be running one or more "service VMs" on
well known CIDs that can be leveraged from other NS, while those other
NS also run some "end user VMs" that should be private to the NS.
IOW, the CIDs for the service VMs would need to be using "mixed"
policy, while the CIDs for the end user VMs would be "local".
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Fri, Apr 04, 2025 at 02:05:32PM +0100, Daniel P. Berrangé wrote: > On Wed, Apr 02, 2025 at 03:18:13PM -0700, Bobby Eshleman wrote: > > On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote: > > > It occured to me that the problem we face with the CID space usage is > > > somewhat similar to the UID/GID space usage for user namespaces. > > > > > > In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to > > > allow IDs in the namespace to be arbitrarily mapped onto IDs in the host. > > > > > > At the risk of being overkill, is it worth trying a similar kind of > > > approach for the vsock CID space ? > > > > > > A simple variant would be a /proc/net/vsock_cid_outside specifying a set > > > of CIDs which are exclusively referencing /dev/vhost-vsock associations > > > created outside the namespace. Anything not listed would be exclusively > > > referencing associations created inside the namespace. > > > > > > A more complex variant would be to allow a full remapping of CIDs as is > > > done with userns, via a /proc/net/vsock_cid_map, which the same three > > > parameters, so that CID=15 association outside the namespace could be > > > remapped to CID=9015 inside the namespace, allow the inside namespace > > > to define its out association for CID=15 without clashing. > > > > > > IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock > > > associations created outside namespace, while unmapped CIDs would be > > > exclusively referencing /dev/vhost-vsock associations inside the > > > namespace. > > > > > > A likely benefit of relying on a kernel defined mapping/partition of > > > the CID space is that apps like QEMU don't need changing, as there's > > > no need to invent a new /dev/vhost-vsock-netns device node. > > > > > > Both approaches give the desirable security protection whereby the > > > inside namespace can be prevented from accessing certain CIDs that > > > were associated outside the namespace. > > > > > > Some rule would need to be defined for updating the /proc/net/vsock_cid_map > > > file as it is the security control mechanism. If it is write-once then > > > if the container mgmt app initializes it, nothing later could change > > > it. > > > > > > A key question is do we need the "first come, first served" behaviour > > > for CIDs where a CID can be arbitrarily used by outside or inside namespace > > > according to whatever tries to associate a CID first ? > > > > I think with /proc/net/vsock_cid_outside, instead of disallowing the CID > > from being used, this could be solved by disallowing remapping the CID > > while in use? > > > > The thing I like about this is that users can check > > /proc/net/vsock_cid_outside to figure out what might be going on, > > instead of trying to check lsof or ps to figure out if the VMM processes > > have used /dev/vhost-vsock vs /dev/vhost-vsock-netns. > > > > Just to check I am following... I suppose we would have a few typical > > configurations for /proc/net/vsock_cid_outside. Following uid_map file > > format of: > > "<local cid start> <global cid start> <range size>" > > > > 1. Identity mapping, current namespace CID is global CID (default > > setting for new namespaces): > > > > # empty file > > > > OR > > > > 0 0 4294967295 > > > > 2. Complete isolation from global space (initialized, but no mappings): > > > > 0 0 0 > > > > 3. Mapping in ranges of global CIDs > > > > For example, global CID space starts at 7000, up to 32-bit max: > > > > 7000 0 4294960295 > > > > Or for multiple mappings (0-100 map to 7000-7100, 1000-1100 map to > > 8000-8100) : > > > > 7000 0 100 > > 8000 1000 100 > > > > > > One thing I don't love is that option 3 seems to not be addressing a > > known use case. It doesn't necessarily hurt to have, but it will add > > complexity to CID handling that might never get used? > > Yeah, I have the same feeling that full remapping of CIDs is probably > adding complexity without clear benefit, unless it somehow helps us > with the nested-virt scenario to disambiguate L0/L1/L2 CID ranges ? > I've not thought the latter through to any great level of detail > though > > > Since options 1/2 could also be represented by a boolean (yes/no > > "current ns shares CID with global"), I wonder if we could either A) > > only support the first two options at first, or B) add just > > /proc/net/vsock_ns_mode at first, which supports only "global" and > > "local", and later add a "mapped" mode plus /proc/net/vsock_cid_outside > > or the full mapping if the need arises? > > Two options is sufficient if you want to control AF_VSOCK usage > and /dev/vhost-vsock usage as a pair. If you want to separately > control them though, it would push for three options - global, > local, and mixed. By mixed I mean AF_VSOCK in the NS can access > the global CID from the NS, but the NS can't associate the global > CID with a guest. > > IOW, this breaks down like: > > * CID=N local - aka fully private > > Outside NS: Can associate outside CID=N with a guest. > AF_VSOCK permitted to access outside CID=N > > Inside NS: Can NOT associate outside CID=N with a guest > Can associate inside CID=N with a guest > AF_VSOCK forbidden to access outside CID=N > AF_VSOCK permitted to access inside CID=N > > > * CID=N mixed - aka partially shared > > Outside NS: Can associate outside CID=N with a guest. > AF_VSOCK permitted to access outside CID=N > > Inside NS: Can NOT associate outside CID=N with a guest > AF_VSOCK permitted to access outside CID=N > No inside CID=N concept > > > * CID=N global - aka current historic behaviour > > Outside NS: Can associate outside CID=N with a guest. > AF_VSOCK permitted to access outside CID=N > > Inside NS: Can associate outside CID=N with a guest > AF_VSOCK permitted to access outside CID=N > No inside CID=N concept > > > I was thinking the 'mixed' mode might be useful if the outside NS wants > to retain control over setting up the association, but delegate to > processes in the inside NS for providing individual services to that > guest. This means if the outside NS needs to restart the VM, there is > no race window in which the inside NS can grab the assocaition with the > CID > > As for whether we need to control this per-CID, or a single setting > applying to all CID. > > Consider that the host OS can be running one or more "service VMs" on > well known CIDs that can be leveraged from other NS, while those other > NS also run some "end user VMs" that should be private to the NS. > > IOW, the CIDs for the service VMs would need to be using "mixed" > policy, while the CIDs for the end user VMs would be "local". > I think this sounds pretty flexible, and IMO adding the third mode doesn't add much more additional complexity. Going this route, we have: - three modes: local, global, mixed - at first, no vsock_cid_map (local has no outside CIDs, global and mixed have no inside CIDs, so no cross-mapping needed) - only later add a full mapped mode and vsock_cid_map if necessary. Stefano, any preferences on this vs starting with the restricted vsock_cid_map (only supporting "0 0 0" and "0 0 <size>")? I'm leaning towards the modes because it covers more use cases and seems like a clearer user interface? To clarify another aspect... child namespaces must inherit the parent's local. So if namespace P sets the mode to local, and then creates a child process that then creates namespace C... then C's global and mixed modes are implicitly restricted to P's local space? Thanks, Bobby
On Fri, Apr 18, 2025 at 10:57:52AM -0700, Bobby Eshleman wrote: >On Fri, Apr 04, 2025 at 02:05:32PM +0100, Daniel P. Berrangé wrote: >> On Wed, Apr 02, 2025 at 03:18:13PM -0700, Bobby Eshleman wrote: >> > On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote: >> > > It occured to me that the problem we face with the CID space usage is >> > > somewhat similar to the UID/GID space usage for user namespaces. >> > > >> > > In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to >> > > allow IDs in the namespace to be arbitrarily mapped onto IDs in the host. >> > > >> > > At the risk of being overkill, is it worth trying a similar kind of >> > > approach for the vsock CID space ? >> > > >> > > A simple variant would be a /proc/net/vsock_cid_outside specifying a set >> > > of CIDs which are exclusively referencing /dev/vhost-vsock associations >> > > created outside the namespace. Anything not listed would be exclusively >> > > referencing associations created inside the namespace. >> > > >> > > A more complex variant would be to allow a full remapping of CIDs as is >> > > done with userns, via a /proc/net/vsock_cid_map, which the same three >> > > parameters, so that CID=15 association outside the namespace could be >> > > remapped to CID=9015 inside the namespace, allow the inside namespace >> > > to define its out association for CID=15 without clashing. >> > > >> > > IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock >> > > associations created outside namespace, while unmapped CIDs would be >> > > exclusively referencing /dev/vhost-vsock associations inside the >> > > namespace. >> > > >> > > A likely benefit of relying on a kernel defined mapping/partition of >> > > the CID space is that apps like QEMU don't need changing, as there's >> > > no need to invent a new /dev/vhost-vsock-netns device node. >> > > >> > > Both approaches give the desirable security protection whereby the >> > > inside namespace can be prevented from accessing certain CIDs that >> > > were associated outside the namespace. >> > > >> > > Some rule would need to be defined for updating the /proc/net/vsock_cid_map >> > > file as it is the security control mechanism. If it is write-once then >> > > if the container mgmt app initializes it, nothing later could change >> > > it. >> > > >> > > A key question is do we need the "first come, first served" behaviour >> > > for CIDs where a CID can be arbitrarily used by outside or inside namespace >> > > according to whatever tries to associate a CID first ? >> > >> > I think with /proc/net/vsock_cid_outside, instead of disallowing the CID >> > from being used, this could be solved by disallowing remapping the CID >> > while in use? >> > >> > The thing I like about this is that users can check >> > /proc/net/vsock_cid_outside to figure out what might be going on, >> > instead of trying to check lsof or ps to figure out if the VMM processes >> > have used /dev/vhost-vsock vs /dev/vhost-vsock-netns. >> > >> > Just to check I am following... I suppose we would have a few typical >> > configurations for /proc/net/vsock_cid_outside. Following uid_map file >> > format of: >> > "<local cid start> <global cid start> <range size>" >> > >> > 1. Identity mapping, current namespace CID is global CID (default >> > setting for new namespaces): >> > >> > # empty file >> > >> > OR >> > >> > 0 0 4294967295 >> > >> > 2. Complete isolation from global space (initialized, but no mappings): >> > >> > 0 0 0 >> > >> > 3. Mapping in ranges of global CIDs >> > >> > For example, global CID space starts at 7000, up to 32-bit max: >> > >> > 7000 0 4294960295 >> > >> > Or for multiple mappings (0-100 map to 7000-7100, 1000-1100 map to >> > 8000-8100) : >> > >> > 7000 0 100 >> > 8000 1000 100 >> > >> > >> > One thing I don't love is that option 3 seems to not be addressing a >> > known use case. It doesn't necessarily hurt to have, but it will add >> > complexity to CID handling that might never get used? >> >> Yeah, I have the same feeling that full remapping of CIDs is probably >> adding complexity without clear benefit, unless it somehow helps us >> with the nested-virt scenario to disambiguate L0/L1/L2 CID ranges ? >> I've not thought the latter through to any great level of detail >> though >> >> > Since options 1/2 could also be represented by a boolean (yes/no >> > "current ns shares CID with global"), I wonder if we could either A) >> > only support the first two options at first, or B) add just >> > /proc/net/vsock_ns_mode at first, which supports only "global" and >> > "local", and later add a "mapped" mode plus /proc/net/vsock_cid_outside >> > or the full mapping if the need arises? >> >> Two options is sufficient if you want to control AF_VSOCK usage >> and /dev/vhost-vsock usage as a pair. If you want to separately >> control them though, it would push for three options - global, >> local, and mixed. By mixed I mean AF_VSOCK in the NS can access >> the global CID from the NS, but the NS can't associate the global >> CID with a guest. >> >> IOW, this breaks down like: >> >> * CID=N local - aka fully private >> >> Outside NS: Can associate outside CID=N with a guest. >> AF_VSOCK permitted to access outside CID=N >> >> Inside NS: Can NOT associate outside CID=N with a guest >> Can associate inside CID=N with a guest >> AF_VSOCK forbidden to access outside CID=N >> AF_VSOCK permitted to access inside CID=N >> >> >> * CID=N mixed - aka partially shared >> >> Outside NS: Can associate outside CID=N with a guest. >> AF_VSOCK permitted to access outside CID=N >> >> Inside NS: Can NOT associate outside CID=N with a guest >> AF_VSOCK permitted to access outside CID=N >> No inside CID=N concept >> >> >> * CID=N global - aka current historic behaviour >> >> Outside NS: Can associate outside CID=N with a guest. >> AF_VSOCK permitted to access outside CID=N >> >> Inside NS: Can associate outside CID=N with a guest >> AF_VSOCK permitted to access outside CID=N >> No inside CID=N concept >> >> >> I was thinking the 'mixed' mode might be useful if the outside NS wants >> to retain control over setting up the association, but delegate to >> processes in the inside NS for providing individual services to that >> guest. This means if the outside NS needs to restart the VM, there is >> no race window in which the inside NS can grab the assocaition with the >> CID >> >> As for whether we need to control this per-CID, or a single setting >> applying to all CID. >> >> Consider that the host OS can be running one or more "service VMs" on >> well known CIDs that can be leveraged from other NS, while those other >> NS also run some "end user VMs" that should be private to the NS. >> >> IOW, the CIDs for the service VMs would need to be using "mixed" >> policy, while the CIDs for the end user VMs would be "local". >> > >I think this sounds pretty flexible, and IMO adding the third mode >doesn't add much more additional complexity. > >Going this route, we have: >- three modes: local, global, mixed >- at first, no vsock_cid_map (local has no outside CIDs, global and mixed have no inside > CIDs, so no cross-mapping needed) >- only later add a full mapped mode and vsock_cid_map if necessary. > >Stefano, any preferences on this vs starting with the restricted >vsock_cid_map (only supporting "0 0 0" and "0 0 <size>")? No preference, I also like this idea. > >I'm leaning towards the modes because it covers more use cases and seems >like a clearer user interface? Sure, go head! > >To clarify another aspect... child namespaces must inherit the parent's >local. So if namespace P sets the mode to local, and then creates a >child process that then creates namespace C... then C's global and mixed >modes are implicitly restricted to P's local space? I think so, but it's still not clear to me if the mode can be selected per namespace or it's a setting for the entire system, but I think we can discuss this better on a proposal with some code :-) Thanks, Stefano
On Wed, Apr 02, 2025 at 03:18:13PM -0700, Bobby Eshleman wrote: > On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote: > > On Wed, Apr 02, 2025 at 10:13:43AM +0200, Stefano Garzarella wrote: > > > On Wed, 2 Apr 2025 at 02:21, Bobby Eshleman <bobbyeshleman@gmail.com> wrote: > > > > > > > > I do like Stefano's suggestion to add a sysctl for a "strict" mode, > > > > Since it offers the best of both worlds, and still tends conservative in > > > > protecting existing applications... but I agree, the non-strict mode > > > > vsock would be unique WRT the usual concept of namespaces. > > > > > > Maybe we could do the opposite, enable strict mode by default (I think > > > it was similar to what I had tried to do with the kernel module in v1, I > > > was young I know xD) > > > And provide a way to disable it for those use cases where the user wants > > > backward compatibility, while paying the cost of less isolation. > > > > I think backwards compatible has to be the default behaviour, otherwise > > the change has too high risk of breaking existing deployments that are > > already using netns and relying on VSOCK being global. Breakage has to > > be opt in. > > > > > I was thinking two options (not sure if the second one can be done): > > > > > > 1. provide a global sysfs/sysctl that disables strict mode, but this > > > then applies to all namespaces > > > > > > 2. provide something that allows disabling strict mode by namespace. > > > Maybe when it is created there are options, or something that can be > > > set later. > > > > > > 2 would be ideal, but that might be too much, so 1 might be enough. In > > > any case, 2 could also be a next step. > > > > > > WDYT? > > > > It occured to me that the problem we face with the CID space usage is > > somewhat similar to the UID/GID space usage for user namespaces. > > > > In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to > > allow IDs in the namespace to be arbitrarily mapped onto IDs in the host. > > > > At the risk of being overkill, is it worth trying a similar kind of > > approach for the vsock CID space ? > > > > A simple variant would be a /proc/net/vsock_cid_outside specifying a set > > of CIDs which are exclusively referencing /dev/vhost-vsock associations > > created outside the namespace. Anything not listed would be exclusively > > referencing associations created inside the namespace. > > > > A more complex variant would be to allow a full remapping of CIDs as is > > done with userns, via a /proc/net/vsock_cid_map, which the same three > > parameters, so that CID=15 association outside the namespace could be > > remapped to CID=9015 inside the namespace, allow the inside namespace > > to define its out association for CID=15 without clashing. > > > > IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock > > associations created outside namespace, while unmapped CIDs would be > > exclusively referencing /dev/vhost-vsock associations inside the > > namespace. > > > > A likely benefit of relying on a kernel defined mapping/partition of > > the CID space is that apps like QEMU don't need changing, as there's > > no need to invent a new /dev/vhost-vsock-netns device node. > > > > Both approaches give the desirable security protection whereby the > > inside namespace can be prevented from accessing certain CIDs that > > were associated outside the namespace. > > > > Some rule would need to be defined for updating the /proc/net/vsock_cid_map > > file as it is the security control mechanism. If it is write-once then > > if the container mgmt app initializes it, nothing later could change > > it. > > > > A key question is do we need the "first come, first served" behaviour > > for CIDs where a CID can be arbitrarily used by outside or inside namespace > > according to whatever tries to associate a CID first ? > > I think with /proc/net/vsock_cid_outside, instead of disallowing the CID > from being used, this could be solved by disallowing remapping the CID > while in use? > > The thing I like about this is that users can check > /proc/net/vsock_cid_outside to figure out what might be going on, > instead of trying to check lsof or ps to figure out if the VMM processes > have used /dev/vhost-vsock vs /dev/vhost-vsock-netns. > > Just to check I am following... I suppose we would have a few typical > configurations for /proc/net/vsock_cid_outside. Following uid_map file > format of: > "<local cid start> <global cid start> <range size>" > > 1. Identity mapping, current namespace CID is global CID (default > setting for new namespaces): > > # empty file > > OR > > 0 0 4294967295 > > 2. Complete isolation from global space (initialized, but no mappings): > > 0 0 0 > > 3. Mapping in ranges of global CIDs > > For example, global CID space starts at 7000, up to 32-bit max: > > 7000 0 4294960295 > > Or for multiple mappings (0-100 map to 7000-7100, 1000-1100 map to > 8000-8100) : > > 7000 0 100 > 8000 1000 100 > > > One thing I don't love is that option 3 seems to not be addressing a > known use case. It doesn't necessarily hurt to have, but it will add > complexity to CID handling that might never get used? > > Since options 1/2 could also be represented by a boolean (yes/no > "current ns shares CID with global"), I wonder if we could either A) > only support the first two options at first, or B) add just > /proc/net/vsock_ns_mode at first, which supports only "global" and > "local", and later add a "mapped" mode plus /proc/net/vsock_cid_outside > or the full mapping if the need arises? > > This could also be how we support Option 2 from Stefano's last email of > supporting per-namespace opt-in/opt-out. > > Any thoughts on this? > Stefano, Would only supporting 1/2 still support the Kata use case? Thanks, Bobby
On Wed, Apr 02, 2025 at 03:28:19PM -0700, Bobby Eshleman wrote: >On Wed, Apr 02, 2025 at 03:18:13PM -0700, Bobby Eshleman wrote: >> On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote: >> > On Wed, Apr 02, 2025 at 10:13:43AM +0200, Stefano Garzarella wrote: >> > > On Wed, 2 Apr 2025 at 02:21, Bobby Eshleman <bobbyeshleman@gmail.com> wrote: >> > > > >> > > > I do like Stefano's suggestion to add a sysctl for a "strict" mode, >> > > > Since it offers the best of both worlds, and still tends conservative in >> > > > protecting existing applications... but I agree, the non-strict mode >> > > > vsock would be unique WRT the usual concept of namespaces. >> > > >> > > Maybe we could do the opposite, enable strict mode by default (I think >> > > it was similar to what I had tried to do with the kernel module in v1, I >> > > was young I know xD) >> > > And provide a way to disable it for those use cases where the user wants >> > > backward compatibility, while paying the cost of less isolation. >> > >> > I think backwards compatible has to be the default behaviour, otherwise >> > the change has too high risk of breaking existing deployments that are >> > already using netns and relying on VSOCK being global. Breakage has to >> > be opt in. >> > >> > > I was thinking two options (not sure if the second one can be done): >> > > >> > > 1. provide a global sysfs/sysctl that disables strict mode, but this >> > > then applies to all namespaces >> > > >> > > 2. provide something that allows disabling strict mode by namespace. >> > > Maybe when it is created there are options, or something that can be >> > > set later. >> > > >> > > 2 would be ideal, but that might be too much, so 1 might be enough. In >> > > any case, 2 could also be a next step. >> > > >> > > WDYT? >> > >> > It occured to me that the problem we face with the CID space usage is >> > somewhat similar to the UID/GID space usage for user namespaces. >> > >> > In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to >> > allow IDs in the namespace to be arbitrarily mapped onto IDs in the host. >> > >> > At the risk of being overkill, is it worth trying a similar kind of >> > approach for the vsock CID space ? >> > >> > A simple variant would be a /proc/net/vsock_cid_outside specifying a set >> > of CIDs which are exclusively referencing /dev/vhost-vsock associations >> > created outside the namespace. Anything not listed would be exclusively >> > referencing associations created inside the namespace. >> > >> > A more complex variant would be to allow a full remapping of CIDs as is >> > done with userns, via a /proc/net/vsock_cid_map, which the same three >> > parameters, so that CID=15 association outside the namespace could be >> > remapped to CID=9015 inside the namespace, allow the inside namespace >> > to define its out association for CID=15 without clashing. >> > >> > IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock >> > associations created outside namespace, while unmapped CIDs would be >> > exclusively referencing /dev/vhost-vsock associations inside the >> > namespace. >> > >> > A likely benefit of relying on a kernel defined mapping/partition of >> > the CID space is that apps like QEMU don't need changing, as there's >> > no need to invent a new /dev/vhost-vsock-netns device node. >> > >> > Both approaches give the desirable security protection whereby the >> > inside namespace can be prevented from accessing certain CIDs that >> > were associated outside the namespace. >> > >> > Some rule would need to be defined for updating the /proc/net/vsock_cid_map >> > file as it is the security control mechanism. If it is write-once then >> > if the container mgmt app initializes it, nothing later could change >> > it. >> > >> > A key question is do we need the "first come, first served" behaviour >> > for CIDs where a CID can be arbitrarily used by outside or inside namespace >> > according to whatever tries to associate a CID first ? >> >> I think with /proc/net/vsock_cid_outside, instead of disallowing the CID >> from being used, this could be solved by disallowing remapping the CID >> while in use? >> >> The thing I like about this is that users can check >> /proc/net/vsock_cid_outside to figure out what might be going on, >> instead of trying to check lsof or ps to figure out if the VMM processes >> have used /dev/vhost-vsock vs /dev/vhost-vsock-netns. Yes, although the user in theory should not care about this information, right? I mean I don't even know if it makes sense to expose the contents of /proc/net/vsock_cid_outside in the namespace. >> >> Just to check I am following... I suppose we would have a few typical >> configurations for /proc/net/vsock_cid_outside. Following uid_map file >> format of: >> "<local cid start> <global cid start> <range size>" This seems to relate more to /proc/net/vsock_cid_map, for /proc/net/vsock_cid_outside I think 2 parameters are enough (CID, range), right? >> >> 1. Identity mapping, current namespace CID is global CID (default >> setting for new namespaces): >> >> # empty file >> >> OR >> >> 0 0 4294967295 >> >> 2. Complete isolation from global space (initialized, but no mappings): >> >> 0 0 0 >> >> 3. Mapping in ranges of global CIDs >> >> For example, global CID space starts at 7000, up to 32-bit max: >> >> 7000 0 4294960295 >> >> Or for multiple mappings (0-100 map to 7000-7100, 1000-1100 map to >> 8000-8100) : >> >> 7000 0 100 >> 8000 1000 100 >> >> >> One thing I don't love is that option 3 seems to not be addressing a >> known use case. It doesn't necessarily hurt to have, but it will add >> complexity to CID handling that might never get used? Yes, as I also mentioned in the previous email, we could also do a step-by-step thing. IMHO we can define /proc/net/vsock_cid_map (with the structure you just defined), but for now only support 1-1 mapping (with the ranges of course, I mean the first two parameters should always be the same) and then add option 3 in the future. >> >> Since options 1/2 could also be represented by a boolean (yes/no >> "current ns shares CID with global"), I wonder if we could either A) >> only support the first two options at first, or B) add just >> /proc/net/vsock_ns_mode at first, which supports only "global" and >> "local", and later add a "mapped" mode plus /proc/net/vsock_cid_outside >> or the full mapping if the need arises? I think option A is the same as I meant above :-) >> >> This could also be how we support Option 2 from Stefano's last email of >> supporting per-namespace opt-in/opt-out. Hmm, how can we do it by namespace? Isn't that global? >> >> Any thoughts on this? >> > >Stefano, > >Would only supporting 1/2 still support the Kata use case? I think so, actually I was thinking something similar in the message I just sent. By default (if the file is empty), nothing should change, so that's fine IMO. As Paolo suggested, we absolutely have to have tests to verify these things. Thanks, Stefano
On Thu, Apr 03, 2025 at 11:33:14AM +0200, Stefano Garzarella wrote: > On Wed, Apr 02, 2025 at 03:28:19PM -0700, Bobby Eshleman wrote: > > On Wed, Apr 02, 2025 at 03:18:13PM -0700, Bobby Eshleman wrote: > > > On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote: > > > > On Wed, Apr 02, 2025 at 10:13:43AM +0200, Stefano Garzarella wrote: > > > > > On Wed, 2 Apr 2025 at 02:21, Bobby Eshleman <bobbyeshleman@gmail.com> wrote: > > > > > > > > > > > > I do like Stefano's suggestion to add a sysctl for a "strict" mode, > > > > > > Since it offers the best of both worlds, and still tends conservative in > > > > > > protecting existing applications... but I agree, the non-strict mode > > > > > > vsock would be unique WRT the usual concept of namespaces. > > > > > > > > > > Maybe we could do the opposite, enable strict mode by default (I think > > > > > it was similar to what I had tried to do with the kernel module in v1, I > > > > > was young I know xD) > > > > > And provide a way to disable it for those use cases where the user wants > > > > > backward compatibility, while paying the cost of less isolation. > > > > > > > > I think backwards compatible has to be the default behaviour, otherwise > > > > the change has too high risk of breaking existing deployments that are > > > > already using netns and relying on VSOCK being global. Breakage has to > > > > be opt in. > > > > > > > > > I was thinking two options (not sure if the second one can be done): > > > > > > > > > > 1. provide a global sysfs/sysctl that disables strict mode, but this > > > > > then applies to all namespaces > > > > > > > > > > 2. provide something that allows disabling strict mode by namespace. > > > > > Maybe when it is created there are options, or something that can be > > > > > set later. > > > > > > > > > > 2 would be ideal, but that might be too much, so 1 might be enough. In > > > > > any case, 2 could also be a next step. > > > > > > > > > > WDYT? > > > > > > > > It occured to me that the problem we face with the CID space usage is > > > > somewhat similar to the UID/GID space usage for user namespaces. > > > > > > > > In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to > > > > allow IDs in the namespace to be arbitrarily mapped onto IDs in the host. > > > > > > > > At the risk of being overkill, is it worth trying a similar kind of > > > > approach for the vsock CID space ? > > > > > > > > A simple variant would be a /proc/net/vsock_cid_outside specifying a set > > > > of CIDs which are exclusively referencing /dev/vhost-vsock associations > > > > created outside the namespace. Anything not listed would be exclusively > > > > referencing associations created inside the namespace. > > > > > > > > A more complex variant would be to allow a full remapping of CIDs as is > > > > done with userns, via a /proc/net/vsock_cid_map, which the same three > > > > parameters, so that CID=15 association outside the namespace could be > > > > remapped to CID=9015 inside the namespace, allow the inside namespace > > > > to define its out association for CID=15 without clashing. > > > > > > > > IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock > > > > associations created outside namespace, while unmapped CIDs would be > > > > exclusively referencing /dev/vhost-vsock associations inside the > > > > namespace. > > > > > > > > A likely benefit of relying on a kernel defined mapping/partition of > > > > the CID space is that apps like QEMU don't need changing, as there's > > > > no need to invent a new /dev/vhost-vsock-netns device node. > > > > > > > > Both approaches give the desirable security protection whereby the > > > > inside namespace can be prevented from accessing certain CIDs that > > > > were associated outside the namespace. > > > > > > > > Some rule would need to be defined for updating the /proc/net/vsock_cid_map > > > > file as it is the security control mechanism. If it is write-once then > > > > if the container mgmt app initializes it, nothing later could change > > > > it. > > > > > > > > A key question is do we need the "first come, first served" behaviour > > > > for CIDs where a CID can be arbitrarily used by outside or inside namespace > > > > according to whatever tries to associate a CID first ? > > > > > > I think with /proc/net/vsock_cid_outside, instead of disallowing the CID > > > from being used, this could be solved by disallowing remapping the CID > > > while in use? > > > > > > The thing I like about this is that users can check > > > /proc/net/vsock_cid_outside to figure out what might be going on, > > > instead of trying to check lsof or ps to figure out if the VMM processes > > > have used /dev/vhost-vsock vs /dev/vhost-vsock-netns. > > Yes, although the user in theory should not care about this information, > right? > I mean I don't even know if it makes sense to expose the contents of > /proc/net/vsock_cid_outside in the namespace. > > > > > > > Just to check I am following... I suppose we would have a few typical > > > configurations for /proc/net/vsock_cid_outside. Following uid_map file > > > format of: > > > "<local cid start> <global cid start> <range size>" > > This seems to relate more to /proc/net/vsock_cid_map, for > /proc/net/vsock_cid_outside I think 2 parameters are enough > (CID, range), right? > True, yes vsock_cid_map. > > > > > > 1. Identity mapping, current namespace CID is global CID (default > > > setting for new namespaces): > > > > > > # empty file > > > > > > OR > > > > > > 0 0 4294967295 > > > > > > 2. Complete isolation from global space (initialized, but no mappings): > > > > > > 0 0 0 > > > > > > 3. Mapping in ranges of global CIDs > > > > > > For example, global CID space starts at 7000, up to 32-bit max: > > > > > > 7000 0 4294960295 > > > > > > Or for multiple mappings (0-100 map to 7000-7100, 1000-1100 map to > > > 8000-8100) : > > > > > > 7000 0 100 > > > 8000 1000 100 > > > > > > > > > One thing I don't love is that option 3 seems to not be addressing a > > > known use case. It doesn't necessarily hurt to have, but it will add > > > complexity to CID handling that might never get used? > > Yes, as I also mentioned in the previous email, we could also do a > step-by-step thing. > > IMHO we can define /proc/net/vsock_cid_map (with the structure you just > defined), but for now only support 1-1 mapping (with the ranges of > course, I mean the first two parameters should always be the same) and > then add option 3 in the future. > makes sense, sgtm! > > > > > > Since options 1/2 could also be represented by a boolean (yes/no > > > "current ns shares CID with global"), I wonder if we could either A) > > > only support the first two options at first, or B) add just > > > /proc/net/vsock_ns_mode at first, which supports only "global" and > > > "local", and later add a "mapped" mode plus /proc/net/vsock_cid_outside > > > or the full mapping if the need arises? > > I think option A is the same as I meant above :-) > Indeed. > > > > > > This could also be how we support Option 2 from Stefano's last email of > > > supporting per-namespace opt-in/opt-out. > > Hmm, how can we do it by namespace? Isn't that global? > I think the file path is global but the contents are tied per-namespace, according to the namespace of the process that called open() on it. This way the container mgr can write-once lock it, and the namespace processes can read it? > > > > > > Any thoughts on this? > > > > > > > Stefano, > > > > Would only supporting 1/2 still support the Kata use case? > > I think so, actually I was thinking something similar in the message I just > sent. > > By default (if the file is empty), nothing should change, so that's fine > IMO. As Paolo suggested, we absolutely have to have tests to verify these > things. > Sounds like a plan! I'm working on the new vsock vmtest now and will include the new tests in the next rev. Also, I'm thinking we should protect vsock_cid_map behind a capability, but I'm not sure which one is correct (CAP_NET_ADMIN?). WDYT? Thanks!
On Fri, Mar 28, 2025 at 06:03:19PM +0100, Stefano Garzarella wrote: > CCing Daniel > > On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote: > > Picking up Stefano's v1 [1], this series adds netns support to > > vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h) > > namespaces, defering that for future implementation and discussion. > > > > Any vsock created with /dev/vhost-vsock is a global vsock, accessible > > from any namespace. Any vsock created with /dev/vhost-vsock-netns is a > > "scoped" vsock, accessible only to sockets in its namespace. If a global > > vsock or scoped vsock share the same CID, the scoped vsock takes > > precedence. > > > > If a socket in a namespace connects with a global vsock, the CID becomes > > unavailable to any VMM in that namespace when creating new vsocks. If > > disconnected, the CID becomes available again. > > I was talking about this feature with Daniel and he pointed out something > interesting (Daniel please feel free to correct me): > > If we have a process in the host that does a listen(AF_VSOCK) in a > namespace, can this receive connections from guests connected to > /dev/vhost-vsock in any namespace? > > Should we provide something (e.g. sysctl/sysfs entry) to disable > this behaviour, preventing a process in a namespace from receiving > connections from the global vsock address space (i.e. /dev/vhost-vsock > VMs)? > > I understand that by default maybe we should allow this behaviour in order > to not break current applications, but in some cases the user may want to > isolate sockets in a namespace also from being accessed by VMs running in > the global vsock address space. > Adding this strict namespace mode makes sense to me, and I think the sysctl/sysfs approach works well to minimize application changes. The approach we were taking was to only allow /dev/vhost-vsock-netns (no global /dev/vhost-vsock mixed in on the system), but adding the explicit system-wide option I think improves the overall security posture of g2h connections. > Indeed in this series we have talked mostly about the host -> guest path (as > the direction of the connection), but little about the guest -> host path, > maybe we should explain it better in the cover/commit > descriptions/documentation. > Sounds good! Best, Bobby
On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote:
> Picking up Stefano's v1 [1], this series adds netns support to
> vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
> namespaces, defering that for future implementation and discussion.
>
> Any vsock created with /dev/vhost-vsock is a global vsock, accessible
> from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
> "scoped" vsock, accessible only to sockets in its namespace. If a global
> vsock or scoped vsock share the same CID, the scoped vsock takes
> precedence.
>
> If a socket in a namespace connects with a global vsock, the CID becomes
> unavailable to any VMM in that namespace when creating new vsocks. If
> disconnected, the CID becomes available again.
yea that's a sane way to do it.
Thanks!
> Testing
>
> QEMU with /dev/vhost-vsock-netns support:
> https://github.com/beshleman/qemu/tree/vsock-netns
>
> Test: Scoped vsocks isolated by namespace
>
> host# ip netns add ns1
> host# ip netns add ns2
> host# ip netns exec ns1 \
> qemu-system-x86_64 \
> -m 8G -smp 4 -cpu host -enable-kvm \
> -serial mon:stdio \
> -drive if=virtio,file=${IMAGE1} \
> -device vhost-vsock-pci,netns=on,guest-cid=15
> host# ip netns exec ns2 \
> qemu-system-x86_64 \
> -m 8G -smp 4 -cpu host -enable-kvm \
> -serial mon:stdio \
> -drive if=virtio,file=${IMAGE2} \
> -device vhost-vsock-pci,netns=on,guest-cid=15
>
> host# socat - VSOCK-CONNECT:15:1234
> 2025/03/10 17:09:40 socat[255741] E connect(5, AF=40 cid:15 port:1234, 16): No such device
>
> host# echo foobar1 | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> host# echo foobar2 | sudo ip netns exec ns2 socat - VSOCK-CONNECT:15:1234
>
> vm1# socat - VSOCK-LISTEN:1234
> foobar1
> vm2# socat - VSOCK-LISTEN:1234
> foobar2
>
> Test: Global vsocks accessible to any namespace
>
> host# qemu-system-x86_64 \
> -m 8G -smp 4 -cpu host -enable-kvm \
> -serial mon:stdio \
> -drive if=virtio,file=${IMAGE2} \
> -device vhost-vsock-pci,guest-cid=15,netns=off
>
> host# echo foobar | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
>
> vm# socat - VSOCK-LISTEN:1234
> foobar
>
> Test: Connecting to global vsock makes CID unavailble to namespace
>
> host# qemu-system-x86_64 \
> -m 8G -smp 4 -cpu host -enable-kvm \
> -serial mon:stdio \
> -drive if=virtio,file=${IMAGE2} \
> -device vhost-vsock-pci,guest-cid=15,netns=off
>
> vm# socat - VSOCK-LISTEN:1234
>
> host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> host# ip netns exec ns1 \
> qemu-system-x86_64 \
> -m 8G -smp 4 -cpu host -enable-kvm \
> -serial mon:stdio \
> -drive if=virtio,file=${IMAGE1} \
> -device vhost-vsock-pci,netns=on,guest-cid=15
>
> qemu-system-x86_64: -device vhost-vsock-pci,netns=on,guest-cid=15: vhost-vsock: unable to set guest cid: Address already in use
>
> Signed-off-by: Bobby Eshleman <bobbyeshleman@gmail.com>
> ---
> Changes in v2:
> - only support vhost-vsock namespaces
> - all g2h namespaces retain old behavior, only common API changes
> impacted by vhost-vsock changes
> - add /dev/vhost-vsock-netns for "opt-in"
> - leave /dev/vhost-vsock to old behavior
> - removed netns module param
> - Link to v1: https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com
>
> Changes in v1:
> - added 'netns' module param to vsock.ko to enable the
> network namespace support (disabled by default)
> - added 'vsock_net_eq()' to check the "net" assigned to a socket
> only when 'netns' support is enabled
> - Link to RFC: https://patchwork.ozlabs.org/cover/1202235/
>
> ---
> Stefano Garzarella (3):
> vsock: add network namespace support
> vsock/virtio_transport_common: handle netns of received packets
> vhost/vsock: use netns of process that opens the vhost-vsock-netns device
>
> drivers/vhost/vsock.c | 96 +++++++++++++++++++++++++++------
> include/linux/miscdevice.h | 1 +
> include/linux/virtio_vsock.h | 2 +
> include/net/af_vsock.h | 10 ++--
> net/vmw_vsock/af_vsock.c | 85 +++++++++++++++++++++++------
> net/vmw_vsock/hyperv_transport.c | 2 +-
> net/vmw_vsock/virtio_transport.c | 5 +-
> net/vmw_vsock/virtio_transport_common.c | 14 ++++-
> net/vmw_vsock/vmci_transport.c | 4 +-
> net/vmw_vsock/vsock_loopback.c | 4 +-
> 10 files changed, 180 insertions(+), 43 deletions(-)
> ---
> base-commit: 0ea09cbf8350b70ad44d67a1dcb379008a356034
> change-id: 20250312-vsock-netns-45da9424f726
>
> Best regards,
> --
> Bobby Eshleman <bobbyeshleman@gmail.com>
On Fri, Mar 21, 2025 at 03:49:38PM -0400, Michael S. Tsirkin wrote: > On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote: > > Picking up Stefano's v1 [1], this series adds netns support to > > vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h) > > namespaces, defering that for future implementation and discussion. > > > > Any vsock created with /dev/vhost-vsock is a global vsock, accessible > > from any namespace. Any vsock created with /dev/vhost-vsock-netns is a > > "scoped" vsock, accessible only to sockets in its namespace. If a global > > vsock or scoped vsock share the same CID, the scoped vsock takes > > precedence. > > > > If a socket in a namespace connects with a global vsock, the CID becomes > > unavailable to any VMM in that namespace when creating new vsocks. If > > disconnected, the CID becomes available again. > > > yea that's a sane way to do it. > Thanks! > Sgtm, thank you! Best, Bobby
Hey all,
Apologies for forgetting the 'net-next' prefix on this one. Should I
resend or no?
Best,
Bobby
On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote:
> Picking up Stefano's v1 [1], this series adds netns support to
> vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
> namespaces, defering that for future implementation and discussion.
>
> Any vsock created with /dev/vhost-vsock is a global vsock, accessible
> from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
> "scoped" vsock, accessible only to sockets in its namespace. If a global
> vsock or scoped vsock share the same CID, the scoped vsock takes
> precedence.
>
> If a socket in a namespace connects with a global vsock, the CID becomes
> unavailable to any VMM in that namespace when creating new vsocks. If
> disconnected, the CID becomes available again.
>
> Testing
>
> QEMU with /dev/vhost-vsock-netns support:
> https://github.com/beshleman/qemu/tree/vsock-netns
>
> Test: Scoped vsocks isolated by namespace
>
> host# ip netns add ns1
> host# ip netns add ns2
> host# ip netns exec ns1 \
> qemu-system-x86_64 \
> -m 8G -smp 4 -cpu host -enable-kvm \
> -serial mon:stdio \
> -drive if=virtio,file=${IMAGE1} \
> -device vhost-vsock-pci,netns=on,guest-cid=15
> host# ip netns exec ns2 \
> qemu-system-x86_64 \
> -m 8G -smp 4 -cpu host -enable-kvm \
> -serial mon:stdio \
> -drive if=virtio,file=${IMAGE2} \
> -device vhost-vsock-pci,netns=on,guest-cid=15
>
> host# socat - VSOCK-CONNECT:15:1234
> 2025/03/10 17:09:40 socat[255741] E connect(5, AF=40 cid:15 port:1234, 16): No such device
>
> host# echo foobar1 | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> host# echo foobar2 | sudo ip netns exec ns2 socat - VSOCK-CONNECT:15:1234
>
> vm1# socat - VSOCK-LISTEN:1234
> foobar1
> vm2# socat - VSOCK-LISTEN:1234
> foobar2
>
> Test: Global vsocks accessible to any namespace
>
> host# qemu-system-x86_64 \
> -m 8G -smp 4 -cpu host -enable-kvm \
> -serial mon:stdio \
> -drive if=virtio,file=${IMAGE2} \
> -device vhost-vsock-pci,guest-cid=15,netns=off
>
> host# echo foobar | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
>
> vm# socat - VSOCK-LISTEN:1234
> foobar
>
> Test: Connecting to global vsock makes CID unavailble to namespace
>
> host# qemu-system-x86_64 \
> -m 8G -smp 4 -cpu host -enable-kvm \
> -serial mon:stdio \
> -drive if=virtio,file=${IMAGE2} \
> -device vhost-vsock-pci,guest-cid=15,netns=off
>
> vm# socat - VSOCK-LISTEN:1234
>
> host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> host# ip netns exec ns1 \
> qemu-system-x86_64 \
> -m 8G -smp 4 -cpu host -enable-kvm \
> -serial mon:stdio \
> -drive if=virtio,file=${IMAGE1} \
> -device vhost-vsock-pci,netns=on,guest-cid=15
>
> qemu-system-x86_64: -device vhost-vsock-pci,netns=on,guest-cid=15: vhost-vsock: unable to set guest cid: Address already in use
>
> Signed-off-by: Bobby Eshleman <bobbyeshleman@gmail.com>
> ---
> Changes in v2:
> - only support vhost-vsock namespaces
> - all g2h namespaces retain old behavior, only common API changes
> impacted by vhost-vsock changes
> - add /dev/vhost-vsock-netns for "opt-in"
> - leave /dev/vhost-vsock to old behavior
> - removed netns module param
> - Link to v1: https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com
>
> Changes in v1:
> - added 'netns' module param to vsock.ko to enable the
> network namespace support (disabled by default)
> - added 'vsock_net_eq()' to check the "net" assigned to a socket
> only when 'netns' support is enabled
> - Link to RFC: https://patchwork.ozlabs.org/cover/1202235/
>
> ---
> Stefano Garzarella (3):
> vsock: add network namespace support
> vsock/virtio_transport_common: handle netns of received packets
> vhost/vsock: use netns of process that opens the vhost-vsock-netns device
>
> drivers/vhost/vsock.c | 96 +++++++++++++++++++++++++++------
> include/linux/miscdevice.h | 1 +
> include/linux/virtio_vsock.h | 2 +
> include/net/af_vsock.h | 10 ++--
> net/vmw_vsock/af_vsock.c | 85 +++++++++++++++++++++++------
> net/vmw_vsock/hyperv_transport.c | 2 +-
> net/vmw_vsock/virtio_transport.c | 5 +-
> net/vmw_vsock/virtio_transport_common.c | 14 ++++-
> net/vmw_vsock/vmci_transport.c | 4 +-
> net/vmw_vsock/vsock_loopback.c | 4 +-
> 10 files changed, 180 insertions(+), 43 deletions(-)
> ---
> base-commit: 0ea09cbf8350b70ad44d67a1dcb379008a356034
> change-id: 20250312-vsock-netns-45da9424f726
>
> Best regards,
> --
> Bobby Eshleman <bobbyeshleman@gmail.com>
>
Hi Bobby,
first of all, thank you for starting this work again!
On Wed, Mar 12, 2025 at 07:28:33PM -0700, Bobby Eshleman wrote:
>Hey all,
>
>Apologies for forgetting the 'net-next' prefix on this one. Should I
>resend or no?
I'd say let's do a firts review cycle on this, then you can re-post.
Please check also maintainer cced, it looks like someone is missing:
https://patchwork.kernel.org/project/netdevbpf/patch/20250312-vsock-netns-v2-1-84bffa1aa97a@gmail.com/
>
>Best,
>Bobby
>
>On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote:
>> Picking up Stefano's v1 [1], this series adds netns support to
>> vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
>> namespaces, defering that for future implementation and discussion.
>>
>> Any vsock created with /dev/vhost-vsock is a global vsock, accessible
>> from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
>> "scoped" vsock, accessible only to sockets in its namespace. If a global
>> vsock or scoped vsock share the same CID, the scoped vsock takes
>> precedence.
This inside the netns, right?
I mean if we are in a netns, and there is a VM A attached to
/dev/vhost-vsock-netns witch CID=42 and a VM B attached to
/dev/vhost-vsock also with CID=42, this means that VM A will not be
accessible in the netns, but it can be accessible outside of the netns,
right?
>>
>> If a socket in a namespace connects with a global vsock, the CID becomes
>> unavailable to any VMM in that namespace when creating new vsocks. If
>> disconnected, the CID becomes available again.
IIUC if an application in the host running in a netns, is connected to a
guest attached to /dev/vhost-vsock (e.g. CID=42), a new guest can't be
ask for the same CID (42) on /dev/vhost-vsock-netns in the same netns
till that connection is active. Is that right?
>>
>> Testing
>>
>> QEMU with /dev/vhost-vsock-netns support:
>> https://github.com/beshleman/qemu/tree/vsock-netns
You can also use unmodified QEMU using `vhostfd` parameter of
`vhost-vsock-pci` device:
# FD will contain the file descriptor to /dev/vhost-vsock-netns
exec {FD}<>/dev/vhost-vsock-netns
# pass FD to the device, this is used for example by libvirt
qemu-system-x86_64 -smp 2 -M q35,accel=kvm,memory-backend=mem \
-drive file=fedora.qcow2,format=qcow2,if=virtio \
-object memory-backend-memfd,id=mem,size=512M \
-device vhost-vsock-pci,vhostfd=${FD},guest-cid=42 -nographic
That said, I agree we can extend QEMU with `netns` param too.
BTW, I'm traveling, I'll be back next Tuesday and I hope to take a
deeper look to the patches.
Thanks,
Stefano
>>
>> Test: Scoped vsocks isolated by namespace
>>
>> host# ip netns add ns1
>> host# ip netns add ns2
>> host# ip netns exec ns1 \
>> qemu-system-x86_64 \
>> -m 8G -smp 4 -cpu host -enable-kvm \
>> -serial mon:stdio \
>> -drive if=virtio,file=${IMAGE1} \
>> -device
>> vhost-vsock-pci,netns=on,guest-cid=15
>> host# ip netns exec ns2 \
>> qemu-system-x86_64 \
>> -m 8G -smp 4 -cpu host -enable-kvm \
>> -serial mon:stdio \
>> -drive if=virtio,file=${IMAGE2} \
>> -device vhost-vsock-pci,netns=on,guest-cid=15
>>
>> host# socat - VSOCK-CONNECT:15:1234
>> 2025/03/10 17:09:40 socat[255741] E connect(5, AF=40 cid:15 port:1234, 16): No such device
>>
>> host# echo foobar1 | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
>> host# echo foobar2 | sudo ip netns exec ns2 socat - VSOCK-CONNECT:15:1234
>>
>> vm1# socat - VSOCK-LISTEN:1234
>> foobar1
>> vm2# socat - VSOCK-LISTEN:1234
>> foobar2
>>
>> Test: Global vsocks accessible to any namespace
>>
>> host# qemu-system-x86_64 \
>> -m 8G -smp 4 -cpu host -enable-kvm \
>> -serial mon:stdio \)
>> -drive if=virtio,file=${IMAGE2} \
>> -device vhost-vsock-pci,guest-cid=15,netns=off
>>
>> host# echo foobar | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
>>
>> vm# socat - VSOCK-LISTEN:1234
>> foobar
>>
>> Test: Connecting to global vsock makes CID unavailble to namespace
>>
>> host# qemu-system-x86_64 \
>> -m 8G -smp 4 -cpu host -enable-kvm \
>> -serial mon:stdio \
>> -drive if=virtio,file=${IMAGE2} \
>> -device vhost-vsock-pci,guest-cid=15,netns=off
>>
>> vm# socat - VSOCK-LISTEN:1234
>>
>> host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
>> host# ip netns exec ns1 \
>> qemu-system-x86_64 \
>> -m 8G -smp 4 -cpu host -enable-kvm \
>> -serial mon:stdio \
>> -drive if=virtio,file=${IMAGE1} \
>> -device vhost-vsock-pci,netns=on,guest-cid=15
>>
>> qemu-system-x86_64: -device vhost-vsock-pci,netns=on,guest-cid=15: vhost-vsock: unable to set guest cid: Address already in use
>>
>> Signed-off-by: Bobby Eshleman <bobbyeshleman@gmail.com>
>> ---
>> Changes in v2:
>> - only support vhost-vsock namespaces
>> - all g2h namespaces retain old behavior, only common API changes
>> impacted by vhost-vsock changes
>> - add /dev/vhost-vsock-netns for "opt-in"
>> - leave /dev/vhost-vsock to old behavior
>> - removed netns module param
>> - Link to v1: https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com
>>
>> Changes in v1:
>> - added 'netns' module param to vsock.ko to enable the
>> network namespace support (disabled by default)
>> - added 'vsock_net_eq()' to check the "net" assigned to a socket
>> only when 'netns' support is enabled
>> - Link to RFC: https://patchwork.ozlabs.org/cover/1202235/
>>
>> ---
>> Stefano Garzarella (3):
>> vsock: add network namespace support
>> vsock/virtio_transport_common: handle netns of received packets
>> vhost/vsock: use netns of process that opens the vhost-vsock-netns device
>>
>> drivers/vhost/vsock.c | 96 +++++++++++++++++++++++++++------
>> include/linux/miscdevice.h | 1 +
>> include/linux/virtio_vsock.h | 2 +
>> include/net/af_vsock.h | 10 ++--
>> net/vmw_vsock/af_vsock.c | 85 +++++++++++++++++++++++------
>> net/vmw_vsock/hyperv_transport.c | 2 +-
>> net/vmw_vsock/virtio_transport.c | 5 +-
>> net/vmw_vsock/virtio_transport_common.c | 14 ++++-
>> net/vmw_vsock/vmci_transport.c | 4 +-
>> net/vmw_vsock/vsock_loopback.c | 4 +-
>> 10 files changed, 180 insertions(+), 43 deletions(-)
>> ---
>> base-commit: 0ea09cbf8350b70ad44d67a1dcb379008a356034
>> change-id: 20250312-vsock-netns-45da9424f726
>>
>> Best regards,
>> --
>> Bobby Eshleman <bobbyeshleman@gmail.com>
>>
>
On Thu, Mar 13, 2025 at 04:37:16PM +0100, Stefano Garzarella wrote:
> Hi Bobby,
> first of all, thank you for starting this work again!
>
You're welcome, thank you for your work getting it started!
> On Wed, Mar 12, 2025 at 07:28:33PM -0700, Bobby Eshleman wrote:
> > Hey all,
> >
> > Apologies for forgetting the 'net-next' prefix on this one. Should I
> > resend or no?
>
> I'd say let's do a firts review cycle on this, then you can re-post.
> Please check also maintainer cced, it looks like someone is missing:
> https://patchwork.kernel.org/project/netdevbpf/patch/20250312-vsock-netns-v2-1-84bffa1aa97a@gmail.com/
>
Duly noted, I'll double-check the ccs next time. sgtm on the re-post!
> > On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote:
> > > Picking up Stefano's v1 [1], this series adds netns support to
> > > vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
> > > namespaces, defering that for future implementation and discussion.
> > >
> > > Any vsock created with /dev/vhost-vsock is a global vsock, accessible
> > > from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
> > > "scoped" vsock, accessible only to sockets in its namespace. If a global
> > > vsock or scoped vsock share the same CID, the scoped vsock takes
> > > precedence.
>
> This inside the netns, right?
> I mean if we are in a netns, and there is a VM A attached to
> /dev/vhost-vsock-netns witch CID=42 and a VM B attached to /dev/vhost-vsock
> also with CID=42, this means that VM A will not be accessible in the netns,
> but it can be accessible outside of the netns,
> right?
>
In this scenario, CID=42 goes to VM A (/dev/vhost-vsock-netns) for any
socket in its namespace. For any other namespace, CID=42 will go to VM
B (/dev/vhost-vsock).
If I understand your setup correctly:
Namespace 1:
VM A - /dev/vhost-vsock-netns, CID=42
Process X
Namespace 2:
VM B - /dev/vhost-vsock, CID=42
Process Y
Namespace 3:
Process Z
In this scenario, taking connect() as an example:
Process X connect(CID=42) goes to VM A
Process Y connect(CID=42) goes to VM B
Process Z connect(CID=42) goes to VM B
If VM A goes away (migration, shutdown, etc...):
Process X connect(CID=42) also goes to VM B
> > >
> > > If a socket in a namespace connects with a global vsock, the CID becomes
> > > unavailable to any VMM in that namespace when creating new vsocks. If
> > > disconnected, the CID becomes available again.
>
> IIUC if an application in the host running in a netns, is connected to a
> guest attached to /dev/vhost-vsock (e.g. CID=42), a new guest can't be ask
> for the same CID (42) on /dev/vhost-vsock-netns in the same netns till that
> connection is active. Is that right?
>
Right. Here is the scenario I am trying to avoid:
Step 1: namespace 1, VM A allocated with CID 42 on /dev/vhost-vsock
Step 2: namespace 2, connect(CID=42) (this is legal, preserves old
behavior)
Step 3: namespace 2, VM B allocated with CID 42 on
/dev/vhost-vsock-netns
After step 3, CID=42 in this current namespace should belong to VM B, but
the connection from step 2 would be with VM A.
I think we have some options:
1. disallow the new VM B because the namespace is already active with VM A
2. try and allow the connection to resume, but just make sure that new
connections got o VM B
3. close the connection from namespace 2, spin up VM B, hope user
manages connection retry
4. auto-retry connect to the new VM B? (seems like doing too much on the
kernel side to me)
I chose option 1 for this rev mostly for the simplicity but definitely
open to suggestions. I think option 3 is also a simple implementation.
Option 2 would require adding some concept of "vhost-vsock ns at time of
connection" to each socket, so the tranport would know which vhost_vsock
to use for which socket.
> > >
> > > Testing
> > >
> > > QEMU with /dev/vhost-vsock-netns support:
> > > https://github.com/beshleman/qemu/tree/vsock-netns
>
> You can also use unmodified QEMU using `vhostfd` parameter of
> `vhost-vsock-pci` device:
>
> # FD will contain the file descriptor to /dev/vhost-vsock-netns
> exec {FD}<>/dev/vhost-vsock-netns
>
> # pass FD to the device, this is used for example by libvirt
> qemu-system-x86_64 -smp 2 -M q35,accel=kvm,memory-backend=mem \
> -drive file=fedora.qcow2,format=qcow2,if=virtio \
> -object memory-backend-memfd,id=mem,size=512M \
> -device vhost-vsock-pci,vhostfd=${FD},guest-cid=42 -nographic
>
Very nice, thanks, I didn't realize that!
> That said, I agree we can extend QEMU with `netns` param too.
>
I'm open to either. Your solution above is super elegant.
> BTW, I'm traveling, I'll be back next Tuesday and I hope to take a deeper
> look to the patches.
>
> Thanks,
> Stefano
>
Thanks Stefano! Enjoy the travel.
Best,
Bobby
© 2016 - 2025 Red Hat, Inc.