[v2] vsock: add namespace support to vhost-vsock

[PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Bobby Eshleman 11 months ago

Picking up Stefano's v1 [1], this series adds netns support to
vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
namespaces, defering that for future implementation and discussion.

Any vsock created with /dev/vhost-vsock is a global vsock, accessible
from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
"scoped" vsock, accessible only to sockets in its namespace. If a global
vsock or scoped vsock share the same CID, the scoped vsock takes
precedence.

If a socket in a namespace connects with a global vsock, the CID becomes
unavailable to any VMM in that namespace when creating new vsocks. If
disconnected, the CID becomes available again.

Testing

QEMU with /dev/vhost-vsock-netns support:
	https://github.com/beshleman/qemu/tree/vsock-netns

Test: Scoped vsocks isolated by namespace

  host# ip netns add ns1
  host# ip netns add ns2
  host# ip netns exec ns1 \
				  qemu-system-x86_64 \
					  -m 8G -smp 4 -cpu host -enable-kvm \
					  -serial mon:stdio \
					  -drive if=virtio,file=${IMAGE1} \
					  -device vhost-vsock-pci,netns=on,guest-cid=15
  host# ip netns exec ns2 \
				  qemu-system-x86_64 \
					  -m 8G -smp 4 -cpu host -enable-kvm \
					  -serial mon:stdio \
					  -drive if=virtio,file=${IMAGE2} \
					  -device vhost-vsock-pci,netns=on,guest-cid=15

  host# socat - VSOCK-CONNECT:15:1234
  2025/03/10 17:09:40 socat[255741] E connect(5, AF=40 cid:15 port:1234, 16): No such device

  host# echo foobar1 | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
  host# echo foobar2 | sudo ip netns exec ns2 socat - VSOCK-CONNECT:15:1234

  vm1# socat - VSOCK-LISTEN:1234
  foobar1
  vm2# socat - VSOCK-LISTEN:1234
  foobar2

Test: Global vsocks accessible to any namespace

  host# qemu-system-x86_64 \
	  -m 8G -smp 4 -cpu host -enable-kvm \
	  -serial mon:stdio \
	  -drive if=virtio,file=${IMAGE2} \
	  -device vhost-vsock-pci,guest-cid=15,netns=off

  host# echo foobar | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234

  vm# socat - VSOCK-LISTEN:1234
  foobar

Test: Connecting to global vsock makes CID unavailble to namespace

  host# qemu-system-x86_64 \
	  -m 8G -smp 4 -cpu host -enable-kvm \
	  -serial mon:stdio \
	  -drive if=virtio,file=${IMAGE2} \
	  -device vhost-vsock-pci,guest-cid=15,netns=off

  vm# socat - VSOCK-LISTEN:1234

  host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
  host# ip netns exec ns1 \
				  qemu-system-x86_64 \
					  -m 8G -smp 4 -cpu host -enable-kvm \
					  -serial mon:stdio \
					  -drive if=virtio,file=${IMAGE1} \
					  -device vhost-vsock-pci,netns=on,guest-cid=15

  qemu-system-x86_64: -device vhost-vsock-pci,netns=on,guest-cid=15: vhost-vsock: unable to set guest cid: Address already in use

Signed-off-by: Bobby Eshleman <bobbyeshleman@gmail.com>
---
Changes in v2:
- only support vhost-vsock namespaces
- all g2h namespaces retain old behavior, only common API changes
  impacted by vhost-vsock changes
- add /dev/vhost-vsock-netns for "opt-in"
- leave /dev/vhost-vsock to old behavior
- removed netns module param
- Link to v1: https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com

Changes in v1:
- added 'netns' module param to vsock.ko to enable the
  network namespace support (disabled by default)
- added 'vsock_net_eq()' to check the "net" assigned to a socket
  only when 'netns' support is enabled
- Link to RFC: https://patchwork.ozlabs.org/cover/1202235/

---
Stefano Garzarella (3):
      vsock: add network namespace support
      vsock/virtio_transport_common: handle netns of received packets
      vhost/vsock: use netns of process that opens the vhost-vsock-netns device

 drivers/vhost/vsock.c                   | 96 +++++++++++++++++++++++++++------
 include/linux/miscdevice.h              |  1 +
 include/linux/virtio_vsock.h            |  2 +
 include/net/af_vsock.h                  | 10 ++--
 net/vmw_vsock/af_vsock.c                | 85 +++++++++++++++++++++++------
 net/vmw_vsock/hyperv_transport.c        |  2 +-
 net/vmw_vsock/virtio_transport.c        |  5 +-
 net/vmw_vsock/virtio_transport_common.c | 14 ++++-
 net/vmw_vsock/vmci_transport.c          |  4 +-
 net/vmw_vsock/vsock_loopback.c          |  4 +-
 10 files changed, 180 insertions(+), 43 deletions(-)
---
base-commit: 0ea09cbf8350b70ad44d67a1dcb379008a356034
change-id: 20250312-vsock-netns-45da9424f726

Best regards,
-- 
Bobby Eshleman <bobbyeshleman@gmail.com>

Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Stefano Garzarella 10 months, 2 weeks ago

CCing Daniel

On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote:
>Picking up Stefano's v1 [1], this series adds netns support to
>vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
>namespaces, defering that for future implementation and discussion.
>
>Any vsock created with /dev/vhost-vsock is a global vsock, accessible
>from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
>"scoped" vsock, accessible only to sockets in its namespace. If a global
>vsock or scoped vsock share the same CID, the scoped vsock takes
>precedence.
>
>If a socket in a namespace connects with a global vsock, the CID becomes
>unavailable to any VMM in that namespace when creating new vsocks. If
>disconnected, the CID becomes available again.

I was talking about this feature with Daniel and he pointed out 
something interesting (Daniel please feel free to correct me):

     If we have a process in the host that does a listen(AF_VSOCK) in a 
     namespace, can this receive connections from guests connected to 
     /dev/vhost-vsock in any namespace?

     Should we provide something (e.g. sysctl/sysfs entry) to disable 
     this behaviour, preventing a process in a namespace from receiving 
     connections from the global vsock address space (i.e.  
     /dev/vhost-vsock VMs)?

I understand that by default maybe we should allow this behaviour in 
order to not break current applications, but in some cases the user may 
want to isolate sockets in a namespace also from being accessed by VMs 
running in the global vsock address space.

Indeed in this series we have talked mostly about the host -> guest path 
(as the direction of the connection), but little about the guest -> host 
path, maybe we should explain it better in the cover/commit 
descriptions/documentation.

Thanks,
Stefano

>
>Testing
>
>QEMU with /dev/vhost-vsock-netns support:
>	https://github.com/beshleman/qemu/tree/vsock-netns
>
>Test: Scoped vsocks isolated by namespace
>
>  host# ip netns add ns1
>  host# ip netns add ns2
>  host# ip netns exec ns1 \
>				  qemu-system-x86_64 \
>					  -m 8G -smp 4 -cpu host -enable-kvm \
>					  -serial mon:stdio \
>					  -drive if=virtio,file=${IMAGE1} \
>					  -device vhost-vsock-pci,netns=on,guest-cid=15
>  host# ip netns exec ns2 \
>				  qemu-system-x86_64 \
>					  -m 8G -smp 4 -cpu host -enable-kvm \
>					  -serial mon:stdio \
>					  -drive if=virtio,file=${IMAGE2} \
>					  -device vhost-vsock-pci,netns=on,guest-cid=15
>
>  host# socat - VSOCK-CONNECT:15:1234
>  2025/03/10 17:09:40 socat[255741] E connect(5, AF=40 cid:15 port:1234, 16): No such device
>
>  host# echo foobar1 | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
>  host# echo foobar2 | sudo ip netns exec ns2 socat - VSOCK-CONNECT:15:1234
>
>  vm1# socat - VSOCK-LISTEN:1234
>  foobar1
>  vm2# socat - VSOCK-LISTEN:1234
>  foobar2
>
>Test: Global vsocks accessible to any namespace
>
>  host# qemu-system-x86_64 \
>	  -m 8G -smp 4 -cpu host -enable-kvm \
>	  -serial mon:stdio \
>	  -drive if=virtio,file=${IMAGE2} \
>	  -device vhost-vsock-pci,guest-cid=15,netns=off
>
>  host# echo foobar | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
>
>  vm# socat - VSOCK-LISTEN:1234
>  foobar
>
>Test: Connecting to global vsock makes CID unavailble to namespace
>
>  host# qemu-system-x86_64 \
>	  -m 8G -smp 4 -cpu host -enable-kvm \
>	  -serial mon:stdio \
>	  -drive if=virtio,file=${IMAGE2} \
>	  -device vhost-vsock-pci,guest-cid=15,netns=off
>
>  vm# socat - VSOCK-LISTEN:1234
>
>  host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
>  host# ip netns exec ns1 \
>				  qemu-system-x86_64 \
>					  -m 8G -smp 4 -cpu host -enable-kvm \
>					  -serial mon:stdio \
>					  -drive if=virtio,file=${IMAGE1} \
>					  -device vhost-vsock-pci,netns=on,guest-cid=15
>
>  qemu-system-x86_64: -device vhost-vsock-pci,netns=on,guest-cid=15: vhost-vsock: unable to set guest cid: Address already in use
>
>Signed-off-by: Bobby Eshleman <bobbyeshleman@gmail.com>
>---
>Changes in v2:
>- only support vhost-vsock namespaces
>- all g2h namespaces retain old behavior, only common API changes
>  impacted by vhost-vsock changes
>- add /dev/vhost-vsock-netns for "opt-in"
>- leave /dev/vhost-vsock to old behavior
>- removed netns module param
>- Link to v1: https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com
>
>Changes in v1:
>- added 'netns' module param to vsock.ko to enable the
>  network namespace support (disabled by default)
>- added 'vsock_net_eq()' to check the "net" assigned to a socket
>  only when 'netns' support is enabled
>- Link to RFC: https://patchwork.ozlabs.org/cover/1202235/
>
>---
>Stefano Garzarella (3):
>      vsock: add network namespace support
>      vsock/virtio_transport_common: handle netns of received packets
>      vhost/vsock: use netns of process that opens the vhost-vsock-netns device
>
> drivers/vhost/vsock.c                   | 96 +++++++++++++++++++++++++++------
> include/linux/miscdevice.h              |  1 +
> include/linux/virtio_vsock.h            |  2 +
> include/net/af_vsock.h                  | 10 ++--
> net/vmw_vsock/af_vsock.c                | 85 +++++++++++++++++++++++------
> net/vmw_vsock/hyperv_transport.c        |  2 +-
> net/vmw_vsock/virtio_transport.c        |  5 +-
> net/vmw_vsock/virtio_transport_common.c | 14 ++++-
> net/vmw_vsock/vmci_transport.c          |  4 +-
> net/vmw_vsock/vsock_loopback.c          |  4 +-
> 10 files changed, 180 insertions(+), 43 deletions(-)
>---
>base-commit: 0ea09cbf8350b70ad44d67a1dcb379008a356034
>change-id: 20250312-vsock-netns-45da9424f726
>
>Best regards,
>-- 
>Bobby Eshleman <bobbyeshleman@gmail.com>
>

Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Daniel P. Berrangé 10 months, 1 week ago

On Fri, Mar 28, 2025 at 06:03:19PM +0100, Stefano Garzarella wrote:
> CCing Daniel
> 
> On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote:
> > Picking up Stefano's v1 [1], this series adds netns support to
> > vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
> > namespaces, defering that for future implementation and discussion.
> > 
> > Any vsock created with /dev/vhost-vsock is a global vsock, accessible
> > from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
> > "scoped" vsock, accessible only to sockets in its namespace. If a global
> > vsock or scoped vsock share the same CID, the scoped vsock takes
> > precedence.
> > 
> > If a socket in a namespace connects with a global vsock, the CID becomes
> > unavailable to any VMM in that namespace when creating new vsocks. If
> > disconnected, the CID becomes available again.
> 
> I was talking about this feature with Daniel and he pointed out something
> interesting (Daniel please feel free to correct me):
> 
>     If we have a process in the host that does a listen(AF_VSOCK) in a
> namespace, can this receive connections from guests connected to
> /dev/vhost-vsock in any namespace?
> 
>     Should we provide something (e.g. sysctl/sysfs entry) to disable
> this behaviour, preventing a process in a namespace from receiving
> connections from the global vsock address space (i.e.      /dev/vhost-vsock
> VMs)?

I think my concern goes a bit beyond that, to the general conceptual
idea of sharing the CID space between the global vsocks and namespace
vsocks. So I'm not sure a sysctl would be sufficient...details later
below..

> I understand that by default maybe we should allow this behaviour in order
> to not break current applications, but in some cases the user may want to
> isolate sockets in a namespace also from being accessed by VMs running in
> the global vsock address space.
> 
> Indeed in this series we have talked mostly about the host -> guest path (as
> the direction of the connection), but little about the guest -> host path,
> maybe we should explain it better in the cover/commit
> descriptions/documentation.

> > Testing
> > 
> > QEMU with /dev/vhost-vsock-netns support:
> > 	https://github.com/beshleman/qemu/tree/vsock-netns
> > 
> > Test: Scoped vsocks isolated by namespace
> > 
> >  host# ip netns add ns1
> >  host# ip netns add ns2
> >  host# ip netns exec ns1 \
> > 				  qemu-system-x86_64 \
> > 					  -m 8G -smp 4 -cpu host -enable-kvm \
> > 					  -serial mon:stdio \
> > 					  -drive if=virtio,file=${IMAGE1} \
> > 					  -device vhost-vsock-pci,netns=on,guest-cid=15
> >  host# ip netns exec ns2 \
> > 				  qemu-system-x86_64 \
> > 					  -m 8G -smp 4 -cpu host -enable-kvm \
> > 					  -serial mon:stdio \
> > 					  -drive if=virtio,file=${IMAGE2} \
> > 					  -device vhost-vsock-pci,netns=on,guest-cid=15
> > 
> >  host# socat - VSOCK-CONNECT:15:1234
> >  2025/03/10 17:09:40 socat[255741] E connect(5, AF=40 cid:15 port:1234, 16): No such device
> > 
> >  host# echo foobar1 | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> >  host# echo foobar2 | sudo ip netns exec ns2 socat - VSOCK-CONNECT:15:1234
> > 
> >  vm1# socat - VSOCK-LISTEN:1234
> >  foobar1
> >  vm2# socat - VSOCK-LISTEN:1234
> >  foobar2
> > 
> > Test: Global vsocks accessible to any namespace
> > 
> >  host# qemu-system-x86_64 \
> > 	  -m 8G -smp 4 -cpu host -enable-kvm \
> > 	  -serial mon:stdio \
> > 	  -drive if=virtio,file=${IMAGE2} \
> > 	  -device vhost-vsock-pci,guest-cid=15,netns=off
> > 
> >  host# echo foobar | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> > 
> >  vm# socat - VSOCK-LISTEN:1234
> >  foobar
> > 
> > Test: Connecting to global vsock makes CID unavailble to namespace
> > 
> >  host# qemu-system-x86_64 \
> > 	  -m 8G -smp 4 -cpu host -enable-kvm \
> > 	  -serial mon:stdio \
> > 	  -drive if=virtio,file=${IMAGE2} \
> > 	  -device vhost-vsock-pci,guest-cid=15,netns=off
> > 
> >  vm# socat - VSOCK-LISTEN:1234
> > 
> >  host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> >  host# ip netns exec ns1 \
> > 				  qemu-system-x86_64 \
> > 					  -m 8G -smp 4 -cpu host -enable-kvm \
> > 					  -serial mon:stdio \
> > 					  -drive if=virtio,file=${IMAGE1} \
> > 					  -device vhost-vsock-pci,netns=on,guest-cid=15
> > 
> >  qemu-system-x86_64: -device vhost-vsock-pci,netns=on,guest-cid=15: vhost-vsock: unable to set guest cid: Address already in use

I find it conceptually quite unsettling that the VSOCK CID address
space for AF_VSOCK is shared between the host and the namespace.
That feels contrary to how namespaces are more commonly used for
deterministically isolating resources between the namespace and the
host.

Naively I would expect that in a namespace, all VSOCK CIDs are
free for use, without having to concern yourself with what CIDs
are in use in the host now, or in future.

What happens if we reverse the QEMU order above, to get the
following scenario

   # Launch VM1 inside the NS
   host# ip netns exec ns1 \
  				  qemu-system-x86_64 \
  					  -m 8G -smp 4 -cpu host -enable-kvm \
  					  -serial mon:stdio \
  					  -drive if=virtio,file=${IMAGE1} \
  					  -device vhost-vsock-pci,netns=on,guest-cid=15
   # Launch VM2
   host# qemu-system-x86_64 \
  	  -m 8G -smp 4 -cpu host -enable-kvm \
  	  -serial mon:stdio \
  	  -drive if=virtio,file=${IMAGE2} \
  	  -device vhost-vsock-pci,guest-cid=15,netns=off
  
   vm1# socat - VSOCK-LISTEN:1234
   vm2# socat - VSOCK-LISTEN:1234

   host# socat - VSOCK-CONNECT:15:1234
     => Presume this connects to "VM2" running outside the NS

   host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234

     => Does this connect to "VM1" inside the NS, or "VM2"
        outside the NS ?



With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Bobby Eshleman 10 months, 1 week ago

On Tue, Apr 01, 2025 at 08:05:16PM +0100, Daniel P. Berrangé wrote:
> On Fri, Mar 28, 2025 at 06:03:19PM +0100, Stefano Garzarella wrote:
> > CCing Daniel
> > 
> > On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote:
> > > Picking up Stefano's v1 [1], this series adds netns support to
> > > vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
> > > namespaces, defering that for future implementation and discussion.
> > > 
> > > Any vsock created with /dev/vhost-vsock is a global vsock, accessible
> > > from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
> > > "scoped" vsock, accessible only to sockets in its namespace. If a global
> > > vsock or scoped vsock share the same CID, the scoped vsock takes
> > > precedence.
> > > 
> > > If a socket in a namespace connects with a global vsock, the CID becomes
> > > unavailable to any VMM in that namespace when creating new vsocks. If
> > > disconnected, the CID becomes available again.
> > 
> > I was talking about this feature with Daniel and he pointed out something
> > interesting (Daniel please feel free to correct me):
> > 
> >     If we have a process in the host that does a listen(AF_VSOCK) in a
> > namespace, can this receive connections from guests connected to
> > /dev/vhost-vsock in any namespace?
> > 
> >     Should we provide something (e.g. sysctl/sysfs entry) to disable
> > this behaviour, preventing a process in a namespace from receiving
> > connections from the global vsock address space (i.e.      /dev/vhost-vsock
> > VMs)?
> 
> I think my concern goes a bit beyond that, to the general conceptual
> idea of sharing the CID space between the global vsocks and namespace
> vsocks. So I'm not sure a sysctl would be sufficient...details later
> below..
> 
> > I understand that by default maybe we should allow this behaviour in order
> > to not break current applications, but in some cases the user may want to
> > isolate sockets in a namespace also from being accessed by VMs running in
> > the global vsock address space.
> > 
> > Indeed in this series we have talked mostly about the host -> guest path (as
> > the direction of the connection), but little about the guest -> host path,
> > maybe we should explain it better in the cover/commit
> > descriptions/documentation.
> 
> > > Testing
> > > 
> > > QEMU with /dev/vhost-vsock-netns support:
> > > 	https://github.com/beshleman/qemu/tree/vsock-netns
> > > 
> > > Test: Scoped vsocks isolated by namespace
> > > 
> > >  host# ip netns add ns1
> > >  host# ip netns add ns2
> > >  host# ip netns exec ns1 \
> > > 				  qemu-system-x86_64 \
> > > 					  -m 8G -smp 4 -cpu host -enable-kvm \
> > > 					  -serial mon:stdio \
> > > 					  -drive if=virtio,file=${IMAGE1} \
> > > 					  -device vhost-vsock-pci,netns=on,guest-cid=15
> > >  host# ip netns exec ns2 \
> > > 				  qemu-system-x86_64 \
> > > 					  -m 8G -smp 4 -cpu host -enable-kvm \
> > > 					  -serial mon:stdio \
> > > 					  -drive if=virtio,file=${IMAGE2} \
> > > 					  -device vhost-vsock-pci,netns=on,guest-cid=15
> > > 
> > >  host# socat - VSOCK-CONNECT:15:1234
> > >  2025/03/10 17:09:40 socat[255741] E connect(5, AF=40 cid:15 port:1234, 16): No such device
> > > 
> > >  host# echo foobar1 | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> > >  host# echo foobar2 | sudo ip netns exec ns2 socat - VSOCK-CONNECT:15:1234
> > > 
> > >  vm1# socat - VSOCK-LISTEN:1234
> > >  foobar1
> > >  vm2# socat - VSOCK-LISTEN:1234
> > >  foobar2
> > > 
> > > Test: Global vsocks accessible to any namespace
> > > 
> > >  host# qemu-system-x86_64 \
> > > 	  -m 8G -smp 4 -cpu host -enable-kvm \
> > > 	  -serial mon:stdio \
> > > 	  -drive if=virtio,file=${IMAGE2} \
> > > 	  -device vhost-vsock-pci,guest-cid=15,netns=off
> > > 
> > >  host# echo foobar | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> > > 
> > >  vm# socat - VSOCK-LISTEN:1234
> > >  foobar
> > > 
> > > Test: Connecting to global vsock makes CID unavailble to namespace
> > > 
> > >  host# qemu-system-x86_64 \
> > > 	  -m 8G -smp 4 -cpu host -enable-kvm \
> > > 	  -serial mon:stdio \
> > > 	  -drive if=virtio,file=${IMAGE2} \
> > > 	  -device vhost-vsock-pci,guest-cid=15,netns=off
> > > 
> > >  vm# socat - VSOCK-LISTEN:1234
> > > 
> > >  host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> > >  host# ip netns exec ns1 \
> > > 				  qemu-system-x86_64 \
> > > 					  -m 8G -smp 4 -cpu host -enable-kvm \
> > > 					  -serial mon:stdio \
> > > 					  -drive if=virtio,file=${IMAGE1} \
> > > 					  -device vhost-vsock-pci,netns=on,guest-cid=15
> > > 
> > >  qemu-system-x86_64: -device vhost-vsock-pci,netns=on,guest-cid=15: vhost-vsock: unable to set guest cid: Address already in use
> 
> I find it conceptually quite unsettling that the VSOCK CID address
> space for AF_VSOCK is shared between the host and the namespace.
> That feels contrary to how namespaces are more commonly used for
> deterministically isolating resources between the namespace and the
> host.
> 
> Naively I would expect that in a namespace, all VSOCK CIDs are
> free for use, without having to concern yourself with what CIDs
> are in use in the host now, or in future.
> 

True, that would be ideal. I think the definition of backwards
compatibility we've established includes the notion that any VM may
reach any namespace and any namespace may reach any VM. IIUC, it sounds
like you are suggesting this be revised to more strictly adhere to
namespace semantics?

I do like Stefano's suggestion to add a sysctl for a "strict" mode,
Since it offers the best of both worlds, and still tends conservative in
protecting existing applications... but I agree, the non-strict mode
vsock would be unique WRT the usual concept of namespaces.

> What happens if we reverse the QEMU order above, to get the
> following scenario
> 
>    # Launch VM1 inside the NS
>    host# ip netns exec ns1 \
>   				  qemu-system-x86_64 \
>   					  -m 8G -smp 4 -cpu host -enable-kvm \
>   					  -serial mon:stdio \
>   					  -drive if=virtio,file=${IMAGE1} \
>   					  -device vhost-vsock-pci,netns=on,guest-cid=15
>    # Launch VM2
>    host# qemu-system-x86_64 \
>   	  -m 8G -smp 4 -cpu host -enable-kvm \
>   	  -serial mon:stdio \
>   	  -drive if=virtio,file=${IMAGE2} \
>   	  -device vhost-vsock-pci,guest-cid=15,netns=off
>   
>    vm1# socat - VSOCK-LISTEN:1234
>    vm2# socat - VSOCK-LISTEN:1234
> 
>    host# socat - VSOCK-CONNECT:15:1234
>      => Presume this connects to "VM2" running outside the NS
> 
>    host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> 
>      => Does this connect to "VM1" inside the NS, or "VM2"
>         outside the NS ?
> 

VM1 inside the NS. Current logic says that whenever two CIDs collide
(local vs global), always select the one in the local namespace
(irrespective of creation order).

Adding a sysctl option... it would *never* connect to the global one,
even if there was no local match but there was a global one.

> 
> 
> With regards,
> Daniel

Thanks for the review!

Best,
Bobby

Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Stefano Garzarella 10 months, 1 week ago

On Wed, 2 Apr 2025 at 02:21, Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
>
> On Tue, Apr 01, 2025 at 08:05:16PM +0100, Daniel P. Berrangé wrote:
> > On Fri, Mar 28, 2025 at 06:03:19PM +0100, Stefano Garzarella wrote:
> > > CCing Daniel
> > >
> > > On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote:
> > > > Picking up Stefano's v1 [1], this series adds netns support to
> > > > vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
> > > > namespaces, defering that for future implementation and discussion.
> > > >
> > > > Any vsock created with /dev/vhost-vsock is a global vsock, accessible
> > > > from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
> > > > "scoped" vsock, accessible only to sockets in its namespace. If a global
> > > > vsock or scoped vsock share the same CID, the scoped vsock takes
> > > > precedence.
> > > >
> > > > If a socket in a namespace connects with a global vsock, the CID becomes
> > > > unavailable to any VMM in that namespace when creating new vsocks. If
> > > > disconnected, the CID becomes available again.
> > >
> > > I was talking about this feature with Daniel and he pointed out something
> > > interesting (Daniel please feel free to correct me):
> > >
> > >     If we have a process in the host that does a listen(AF_VSOCK) in a
> > > namespace, can this receive connections from guests connected to
> > > /dev/vhost-vsock in any namespace?
> > >
> > >     Should we provide something (e.g. sysctl/sysfs entry) to disable
> > > this behaviour, preventing a process in a namespace from receiving
> > > connections from the global vsock address space (i.e.      /dev/vhost-vsock
> > > VMs)?
> >
> > I think my concern goes a bit beyond that, to the general conceptual
> > idea of sharing the CID space between the global vsocks and namespace
> > vsocks. So I'm not sure a sysctl would be sufficient...details later
> > below..
> >
> > > I understand that by default maybe we should allow this behaviour in order
> > > to not break current applications, but in some cases the user may want to
> > > isolate sockets in a namespace also from being accessed by VMs running in
> > > the global vsock address space.
> > >
> > > Indeed in this series we have talked mostly about the host -> guest path (as
> > > the direction of the connection), but little about the guest -> host path,
> > > maybe we should explain it better in the cover/commit
> > > descriptions/documentation.
> >
> > > > Testing
> > > >
> > > > QEMU with /dev/vhost-vsock-netns support:
> > > >   https://github.com/beshleman/qemu/tree/vsock-netns
> > > >
> > > > Test: Scoped vsocks isolated by namespace
> > > >
> > > >  host# ip netns add ns1
> > > >  host# ip netns add ns2
> > > >  host# ip netns exec ns1 \
> > > >                             qemu-system-x86_64 \
> > > >                                     -m 8G -smp 4 -cpu host -enable-kvm \
> > > >                                     -serial mon:stdio \
> > > >                                     -drive if=virtio,file=${IMAGE1} \
> > > >                                     -device vhost-vsock-pci,netns=on,guest-cid=15
> > > >  host# ip netns exec ns2 \
> > > >                             qemu-system-x86_64 \
> > > >                                     -m 8G -smp 4 -cpu host -enable-kvm \
> > > >                                     -serial mon:stdio \
> > > >                                     -drive if=virtio,file=${IMAGE2} \
> > > >                                     -device vhost-vsock-pci,netns=on,guest-cid=15
> > > >
> > > >  host# socat - VSOCK-CONNECT:15:1234
> > > >  2025/03/10 17:09:40 socat[255741] E connect(5, AF=40 cid:15 port:1234, 16): No such device
> > > >
> > > >  host# echo foobar1 | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> > > >  host# echo foobar2 | sudo ip netns exec ns2 socat - VSOCK-CONNECT:15:1234
> > > >
> > > >  vm1# socat - VSOCK-LISTEN:1234
> > > >  foobar1
> > > >  vm2# socat - VSOCK-LISTEN:1234
> > > >  foobar2
> > > >
> > > > Test: Global vsocks accessible to any namespace
> > > >
> > > >  host# qemu-system-x86_64 \
> > > >     -m 8G -smp 4 -cpu host -enable-kvm \
> > > >     -serial mon:stdio \
> > > >     -drive if=virtio,file=${IMAGE2} \
> > > >     -device vhost-vsock-pci,guest-cid=15,netns=off
> > > >
> > > >  host# echo foobar | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> > > >
> > > >  vm# socat - VSOCK-LISTEN:1234
> > > >  foobar
> > > >
> > > > Test: Connecting to global vsock makes CID unavailble to namespace
> > > >
> > > >  host# qemu-system-x86_64 \
> > > >     -m 8G -smp 4 -cpu host -enable-kvm \
> > > >     -serial mon:stdio \
> > > >     -drive if=virtio,file=${IMAGE2} \
> > > >     -device vhost-vsock-pci,guest-cid=15,netns=off
> > > >
> > > >  vm# socat - VSOCK-LISTEN:1234
> > > >
> > > >  host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> > > >  host# ip netns exec ns1 \
> > > >                             qemu-system-x86_64 \
> > > >                                     -m 8G -smp 4 -cpu host -enable-kvm \
> > > >                                     -serial mon:stdio \
> > > >                                     -drive if=virtio,file=${IMAGE1} \
> > > >                                     -device vhost-vsock-pci,netns=on,guest-cid=15
> > > >
> > > >  qemu-system-x86_64: -device vhost-vsock-pci,netns=on,guest-cid=15: vhost-vsock: unable to set guest cid: Address already in use
> >
> > I find it conceptually quite unsettling that the VSOCK CID address
> > space for AF_VSOCK is shared between the host and the namespace.
> > That feels contrary to how namespaces are more commonly used for
> > deterministically isolating resources between the namespace and the
> > host.
> >
> > Naively I would expect that in a namespace, all VSOCK CIDs are
> > free for use, without having to concern yourself with what CIDs
> > are in use in the host now, or in future.
> >
>
> True, that would be ideal. I think the definition of backwards
> compatibility we've established includes the notion that any VM may
> reach any namespace and any namespace may reach any VM. IIUC, it 
> sounds
> like you are suggesting this be revised to more strictly adhere to
> namespace semantics?
>
> I do like Stefano's suggestion to add a sysctl for a "strict" mode,
> Since it offers the best of both worlds, and still tends conservative in
> protecting existing applications... but I agree, the non-strict mode
> vsock would be unique WRT the usual concept of namespaces.

Maybe we could do the opposite, enable strict mode by default (I think 
it was similar to what I had tried to do with the kernel module in v1, I 
was young I know xD)
And provide a way to disable it for those use cases where the user wants 
backward compatibility, while paying the cost of less isolation.

I was thinking two options (not sure if the second one can be done):

  1. provide a global sysfs/sysctl that disables strict mode, but this
  then applies to all namespaces

  2. provide something that allows disabling strict mode by namespace.
  Maybe when it is created there are options, or something that can be
  set later.

2 would be ideal, but that might be too much, so 1 might be enough. In 
any case, 2 could also be a next step.

WDYT?

Thanks,
Stefano

Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Daniel P. Berrangé 10 months, 1 week ago

On Wed, Apr 02, 2025 at 10:13:43AM +0200, Stefano Garzarella wrote:
> On Wed, 2 Apr 2025 at 02:21, Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> >
> > I do like Stefano's suggestion to add a sysctl for a "strict" mode,
> > Since it offers the best of both worlds, and still tends conservative in
> > protecting existing applications... but I agree, the non-strict mode
> > vsock would be unique WRT the usual concept of namespaces.
> 
> Maybe we could do the opposite, enable strict mode by default (I think 
> it was similar to what I had tried to do with the kernel module in v1, I 
> was young I know xD)
> And provide a way to disable it for those use cases where the user wants 
> backward compatibility, while paying the cost of less isolation.

I think backwards compatible has to be the default behaviour, otherwise
the change has too high risk of breaking existing deployments that are
already using netns and relying on VSOCK being global. Breakage has to
be opt in.

> I was thinking two options (not sure if the second one can be done):
> 
>   1. provide a global sysfs/sysctl that disables strict mode, but this
>   then applies to all namespaces
> 
>   2. provide something that allows disabling strict mode by namespace.
>   Maybe when it is created there are options, or something that can be
>   set later.
> 
> 2 would be ideal, but that might be too much, so 1 might be enough. In 
> any case, 2 could also be a next step.
> 
> WDYT?

It occured to me that the problem we face with the CID space usage is
somewhat similar to the UID/GID space usage for user namespaces.

In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to
allow IDs in the namespace to be arbitrarily mapped onto IDs in the host.

At the risk of being overkill, is it worth trying a similar kind of
approach for the vsock CID space ?

A simple variant would be a /proc/net/vsock_cid_outside specifying a set
of CIDs which are exclusively referencing /dev/vhost-vsock associations
created outside the namespace. Anything not listed would be exclusively
referencing associations created inside the namespace.

A more complex variant would be to allow a full remapping of CIDs as is
done with userns, via a /proc/net/vsock_cid_map, which the same three
parameters, so that CID=15 association outside the namespace could be
remapped to CID=9015 inside the namespace, allow the inside namespace
to define its out association for CID=15 without clashing.

IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock
associations created outside namespace, while unmapped CIDs would be
exclusively referencing /dev/vhost-vsock associations inside the
namespace. 

A likely benefit of relying on a kernel defined mapping/partition of
the CID space is that apps like QEMU don't need changing, as there's
no need to invent a new /dev/vhost-vsock-netns device node.

Both approaches give the desirable security protection whereby the
inside namespace can be prevented from accessing certain CIDs that
were associated outside the namespace.

Some rule would need to be defined for updating the /proc/net/vsock_cid_map
file as it is the security control mechanism. If it is write-once then
if the container mgmt app initializes it, nothing later could change
it.

A key question is do we need the "first come, first served" behaviour
for CIDs where a CID can be arbitrarily used by outside or inside namespace
according to whatever tries to associate a CID first ?

IMHO those semantics lead to unpredictable behaviour for apps because
what happens depends on ordering of app launches inside & outside the
namespace, but they do sort of allow for VSOCK namespace behaviour to
be 'zero conf' out of the box.

A mapping that strictly partitions CIDs to either outside or inside
namespace usage, but never both, gives well defined behaviour, at the
cost of needing to setup an initial mapping/partition.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Stefano Garzarella 10 months, 1 week ago

On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote:
>On Wed, Apr 02, 2025 at 10:13:43AM +0200, Stefano Garzarella wrote:
>> On Wed, 2 Apr 2025 at 02:21, Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
>> >
>> > I do like Stefano's suggestion to add a sysctl for a "strict" mode,
>> > Since it offers the best of both worlds, and still tends conservative in
>> > protecting existing applications... but I agree, the non-strict mode
>> > vsock would be unique WRT the usual concept of namespaces.
>>
>> Maybe we could do the opposite, enable strict mode by default (I think
>> it was similar to what I had tried to do with the kernel module in v1, I
>> was young I know xD)
>> And provide a way to disable it for those use cases where the user wants
>> backward compatibility, while paying the cost of less isolation.
>
>I think backwards compatible has to be the default behaviour, otherwise
>the change has too high risk of breaking existing deployments that are
>already using netns and relying on VSOCK being global. Breakage has to
>be opt in.
>
>> I was thinking two options (not sure if the second one can be done):
>>
>>   1. provide a global sysfs/sysctl that disables strict mode, but this
>>   then applies to all namespaces
>>
>>   2. provide something that allows disabling strict mode by namespace.
>>   Maybe when it is created there are options, or something that can be
>>   set later.
>>
>> 2 would be ideal, but that might be too much, so 1 might be enough. In
>> any case, 2 could also be a next step.
>>
>> WDYT?
>
>It occured to me that the problem we face with the CID space usage is
>somewhat similar to the UID/GID space usage for user namespaces.
>
>In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to
>allow IDs in the namespace to be arbitrarily mapped onto IDs in the host.
>
>At the risk of being overkill, is it worth trying a similar kind of
>approach for the vsock CID space ?
>
>A simple variant would be a /proc/net/vsock_cid_outside specifying a set
>of CIDs which are exclusively referencing /dev/vhost-vsock associations
>created outside the namespace. Anything not listed would be exclusively
>referencing associations created inside the namespace.

I like the idea and I think it is also easily usable in a nested 
environment, where for example in L1 we can decide whether or not a 
namespace can access the L0 host (CID=2), by adding 2 to 
/proc/net/vsock_cid_outside

>
>A more complex variant would be to allow a full remapping of CIDs as is
>done with userns, via a /proc/net/vsock_cid_map, which the same three
>parameters, so that CID=15 association outside the namespace could be
>remapped to CID=9015 inside the namespace, allow the inside namespace
>to define its out association for CID=15 without clashing.
>
>IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock
>associations created outside namespace, while unmapped CIDs would be
>exclusively referencing /dev/vhost-vsock associations inside the
>namespace.

This is maybe a little overkill, but I don't object to it!
It could also be a next step. But if it's easy to implement, we can go 
straight with it.

>
>A likely benefit of relying on a kernel defined mapping/partition of
>the CID space is that apps like QEMU don't need changing, as there's
>no need to invent a new /dev/vhost-vsock-netns device node.

Yeah, I see that!
However, should this be paired with a sysctl/sysfs to do opt-in?

Or can we do something to figure out if the user didn't write these 
files, then behave as before (but maybe we need to reverse the logic, I 
don't know if that makes sense).

>
>Both approaches give the desirable security protection whereby the
>inside namespace can be prevented from accessing certain CIDs that
>were associated outside the namespace.
>
>Some rule would need to be defined for updating the /proc/net/vsock_cid_map
>file as it is the security control mechanism. If it is write-once then
>if the container mgmt app initializes it, nothing later could change
>it.
>
>A key question is do we need the "first come, first served" behaviour
>for CIDs where a CID can be arbitrarily used by outside or inside namespace
>according to whatever tries to associate a CID first ?
>
>IMHO those semantics lead to unpredictable behaviour for apps because
>what happens depends on ordering of app launches inside & outside the
>namespace, but they do sort of allow for VSOCK namespace behaviour to
>be 'zero conf' out of the box.

Yes, I agree that we should avoid it if possible.

>
>A mapping that strictly partitions CIDs to either outside or inside
>namespace usage, but never both, gives well defined behaviour, at the
>cost of needing to setup an initial mapping/partition.

Thanks for your points!
Stefano

Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Bobby Eshleman 10 months, 1 week ago

On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote:
> On Wed, Apr 02, 2025 at 10:13:43AM +0200, Stefano Garzarella wrote:
> > On Wed, 2 Apr 2025 at 02:21, Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> > >
> > > I do like Stefano's suggestion to add a sysctl for a "strict" mode,
> > > Since it offers the best of both worlds, and still tends conservative in
> > > protecting existing applications... but I agree, the non-strict mode
> > > vsock would be unique WRT the usual concept of namespaces.
> > 
> > Maybe we could do the opposite, enable strict mode by default (I think 
> > it was similar to what I had tried to do with the kernel module in v1, I 
> > was young I know xD)
> > And provide a way to disable it for those use cases where the user wants 
> > backward compatibility, while paying the cost of less isolation.
> 
> I think backwards compatible has to be the default behaviour, otherwise
> the change has too high risk of breaking existing deployments that are
> already using netns and relying on VSOCK being global. Breakage has to
> be opt in.
> 
> > I was thinking two options (not sure if the second one can be done):
> > 
> >   1. provide a global sysfs/sysctl that disables strict mode, but this
> >   then applies to all namespaces
> > 
> >   2. provide something that allows disabling strict mode by namespace.
> >   Maybe when it is created there are options, or something that can be
> >   set later.
> > 
> > 2 would be ideal, but that might be too much, so 1 might be enough. In 
> > any case, 2 could also be a next step.
> > 
> > WDYT?
> 
> It occured to me that the problem we face with the CID space usage is
> somewhat similar to the UID/GID space usage for user namespaces.
> 
> In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to
> allow IDs in the namespace to be arbitrarily mapped onto IDs in the host.
> 
> At the risk of being overkill, is it worth trying a similar kind of
> approach for the vsock CID space ?
> 
> A simple variant would be a /proc/net/vsock_cid_outside specifying a set
> of CIDs which are exclusively referencing /dev/vhost-vsock associations
> created outside the namespace. Anything not listed would be exclusively
> referencing associations created inside the namespace.
> 
> A more complex variant would be to allow a full remapping of CIDs as is
> done with userns, via a /proc/net/vsock_cid_map, which the same three
> parameters, so that CID=15 association outside the namespace could be
> remapped to CID=9015 inside the namespace, allow the inside namespace
> to define its out association for CID=15 without clashing.
> 
> IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock
> associations created outside namespace, while unmapped CIDs would be
> exclusively referencing /dev/vhost-vsock associations inside the
> namespace. 
> 
> A likely benefit of relying on a kernel defined mapping/partition of
> the CID space is that apps like QEMU don't need changing, as there's
> no need to invent a new /dev/vhost-vsock-netns device node.
> 
> Both approaches give the desirable security protection whereby the
> inside namespace can be prevented from accessing certain CIDs that
> were associated outside the namespace.
> 
> Some rule would need to be defined for updating the /proc/net/vsock_cid_map
> file as it is the security control mechanism. If it is write-once then
> if the container mgmt app initializes it, nothing later could change
> it.
> 
> A key question is do we need the "first come, first served" behaviour
> for CIDs where a CID can be arbitrarily used by outside or inside namespace
> according to whatever tries to associate a CID first ?

I think with /proc/net/vsock_cid_outside, instead of disallowing the CID
from being used, this could be solved by disallowing remapping the CID
while in use?

The thing I like about this is that users can check
/proc/net/vsock_cid_outside to figure out what might be going on,
instead of trying to check lsof or ps to figure out if the VMM processes
have used /dev/vhost-vsock vs /dev/vhost-vsock-netns.

Just to check I am following... I suppose we would have a few typical
configurations for /proc/net/vsock_cid_outside. Following uid_map file
format of:
	"<local cid start>		<global cid start>		<range size>"

	1. Identity mapping, current namespace CID is global CID (default
	setting for new namespaces):

		# empty file

				OR

		0    0    4294967295

	2. Complete isolation from global space (initialized, but no mappings):

		0    0    0

	3. Mapping in ranges of global CIDs

	For example, global CID space starts at 7000, up to 32-bit max:

		7000    0    4294960295
	
	Or for multiple mappings (0-100 map to 7000-7100, 1000-1100 map to
	8000-8100) :

		7000    0       100
		8000    1000    100


One thing I don't love is that option 3 seems to not be addressing a
known use case. It doesn't necessarily hurt to have, but it will add
complexity to CID handling that might never get used?

Since options 1/2 could also be represented by a boolean (yes/no
"current ns shares CID with global"), I wonder if we could either A)
only support the first two options at first, or B) add just
/proc/net/vsock_ns_mode at first, which supports only "global" and
"local", and later add a "mapped" mode plus /proc/net/vsock_cid_outside
or the full mapping if the need arises?

This could also be how we support Option 2 from Stefano's last email of
supporting per-namespace opt-in/opt-out.

Any thoughts on this?

> 
> IMHO those semantics lead to unpredictable behaviour for apps because
> what happens depends on ordering of app launches inside & outside the
> namespace, but they do sort of allow for VSOCK namespace behaviour to
> be 'zero conf' out of the box.
> 
> A mapping that strictly partitions CIDs to either outside or inside
> namespace usage, but never both, gives well defined behaviour, at the
> cost of needing to setup an initial mapping/partition.
> 

Agreed, I do like the plainness of reasoning through it.

Thanks!
Bobby

Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Daniel P. Berrangé 10 months, 1 week ago

On Wed, Apr 02, 2025 at 03:18:13PM -0700, Bobby Eshleman wrote:
> On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote:
> > It occured to me that the problem we face with the CID space usage is
> > somewhat similar to the UID/GID space usage for user namespaces.
> > 
> > In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to
> > allow IDs in the namespace to be arbitrarily mapped onto IDs in the host.
> > 
> > At the risk of being overkill, is it worth trying a similar kind of
> > approach for the vsock CID space ?
> > 
> > A simple variant would be a /proc/net/vsock_cid_outside specifying a set
> > of CIDs which are exclusively referencing /dev/vhost-vsock associations
> > created outside the namespace. Anything not listed would be exclusively
> > referencing associations created inside the namespace.
> > 
> > A more complex variant would be to allow a full remapping of CIDs as is
> > done with userns, via a /proc/net/vsock_cid_map, which the same three
> > parameters, so that CID=15 association outside the namespace could be
> > remapped to CID=9015 inside the namespace, allow the inside namespace
> > to define its out association for CID=15 without clashing.
> > 
> > IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock
> > associations created outside namespace, while unmapped CIDs would be
> > exclusively referencing /dev/vhost-vsock associations inside the
> > namespace. 
> > 
> > A likely benefit of relying on a kernel defined mapping/partition of
> > the CID space is that apps like QEMU don't need changing, as there's
> > no need to invent a new /dev/vhost-vsock-netns device node.
> > 
> > Both approaches give the desirable security protection whereby the
> > inside namespace can be prevented from accessing certain CIDs that
> > were associated outside the namespace.
> > 
> > Some rule would need to be defined for updating the /proc/net/vsock_cid_map
> > file as it is the security control mechanism. If it is write-once then
> > if the container mgmt app initializes it, nothing later could change
> > it.
> > 
> > A key question is do we need the "first come, first served" behaviour
> > for CIDs where a CID can be arbitrarily used by outside or inside namespace
> > according to whatever tries to associate a CID first ?
> 
> I think with /proc/net/vsock_cid_outside, instead of disallowing the CID
> from being used, this could be solved by disallowing remapping the CID
> while in use?
> 
> The thing I like about this is that users can check
> /proc/net/vsock_cid_outside to figure out what might be going on,
> instead of trying to check lsof or ps to figure out if the VMM processes
> have used /dev/vhost-vsock vs /dev/vhost-vsock-netns.
> 
> Just to check I am following... I suppose we would have a few typical
> configurations for /proc/net/vsock_cid_outside. Following uid_map file
> format of:
> 	"<local cid start>		<global cid start>		<range size>"
> 
> 	1. Identity mapping, current namespace CID is global CID (default
> 	setting for new namespaces):
> 
> 		# empty file
> 
> 				OR
> 
> 		0    0    4294967295
> 
> 	2. Complete isolation from global space (initialized, but no mappings):
> 
> 		0    0    0
> 
> 	3. Mapping in ranges of global CIDs
> 
> 	For example, global CID space starts at 7000, up to 32-bit max:
> 
> 		7000    0    4294960295
> 	
> 	Or for multiple mappings (0-100 map to 7000-7100, 1000-1100 map to
> 	8000-8100) :
> 
> 		7000    0       100
> 		8000    1000    100
> 
> 
> One thing I don't love is that option 3 seems to not be addressing a
> known use case. It doesn't necessarily hurt to have, but it will add
> complexity to CID handling that might never get used?

Yeah, I have the same feeling that full remapping of CIDs is probably
adding complexity without clear benefit, unless it somehow helps us
with the nested-virt scenario to disambiguate L0/L1/L2 CID ranges ?
I've not thought the latter through to any great level of detail
though

> Since options 1/2 could also be represented by a boolean (yes/no
> "current ns shares CID with global"), I wonder if we could either A)
> only support the first two options at first, or B) add just
> /proc/net/vsock_ns_mode at first, which supports only "global" and
> "local", and later add a "mapped" mode plus /proc/net/vsock_cid_outside
> or the full mapping if the need arises?

Two options is sufficient if you want to control AF_VSOCK usage
and /dev/vhost-vsock usage as a pair. If you want to separately
control them though, it would push for three options - global,
local, and mixed. By mixed I mean AF_VSOCK in the NS can access
the global CID from the NS, but the NS can't associate the global
CID with a guest.

IOW, this breaks down like:

 * CID=N local - aka fully private

     Outside NS: Can associate outside CID=N with a guest.
                 AF_VSOCK permitted to access outside CID=N

     Inside NS: Can NOT associate outside CID=N with a guest
                Can associate inside CID=N with a guest
                AF_VSOCK forbidden to access outside CID=N
                AF_VSOCK permitted to access inside CID=N


 * CID=N mixed - aka partially shared

     Outside NS: Can associate outside CID=N with a guest.
                 AF_VSOCK permitted to access outside CID=N

     Inside NS: Can NOT associate outside CID=N with a guest
                AF_VSOCK permitted to access outside CID=N
                No inside CID=N concept


 * CID=N global - aka current historic behaviour

     Outside NS: Can associate outside CID=N with a guest.
                 AF_VSOCK permitted to access outside CID=N

     Inside NS: Can associate outside CID=N with a guest
                AF_VSOCK permitted to access outside CID=N
                No inside CID=N concept


I was thinking the 'mixed' mode might be useful if the outside NS wants
to retain control over setting up the association, but delegate to
processes in the inside NS for providing individual services to that
guest.  This means if the outside NS needs to restart the VM, there is
no race window in which the inside NS can grab the assocaition with the
CID

As for whether we need to control this per-CID, or a single setting
applying to all CID.

Consider that the host OS can be running one or more "service VMs" on
well known CIDs that can be leveraged from other NS, while those other
NS also run some  "end user VMs" that should be private to the NS.

IOW, the CIDs for the service VMs would need to be using "mixed"
policy, while the CIDs for the end user VMs would be "local".

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Bobby Eshleman 9 months, 3 weeks ago

On Fri, Apr 04, 2025 at 02:05:32PM +0100, Daniel P. Berrangé wrote:
> On Wed, Apr 02, 2025 at 03:18:13PM -0700, Bobby Eshleman wrote:
> > On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote:
> > > It occured to me that the problem we face with the CID space usage is
> > > somewhat similar to the UID/GID space usage for user namespaces.
> > > 
> > > In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to
> > > allow IDs in the namespace to be arbitrarily mapped onto IDs in the host.
> > > 
> > > At the risk of being overkill, is it worth trying a similar kind of
> > > approach for the vsock CID space ?
> > > 
> > > A simple variant would be a /proc/net/vsock_cid_outside specifying a set
> > > of CIDs which are exclusively referencing /dev/vhost-vsock associations
> > > created outside the namespace. Anything not listed would be exclusively
> > > referencing associations created inside the namespace.
> > > 
> > > A more complex variant would be to allow a full remapping of CIDs as is
> > > done with userns, via a /proc/net/vsock_cid_map, which the same three
> > > parameters, so that CID=15 association outside the namespace could be
> > > remapped to CID=9015 inside the namespace, allow the inside namespace
> > > to define its out association for CID=15 without clashing.
> > > 
> > > IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock
> > > associations created outside namespace, while unmapped CIDs would be
> > > exclusively referencing /dev/vhost-vsock associations inside the
> > > namespace. 
> > > 
> > > A likely benefit of relying on a kernel defined mapping/partition of
> > > the CID space is that apps like QEMU don't need changing, as there's
> > > no need to invent a new /dev/vhost-vsock-netns device node.
> > > 
> > > Both approaches give the desirable security protection whereby the
> > > inside namespace can be prevented from accessing certain CIDs that
> > > were associated outside the namespace.
> > > 
> > > Some rule would need to be defined for updating the /proc/net/vsock_cid_map
> > > file as it is the security control mechanism. If it is write-once then
> > > if the container mgmt app initializes it, nothing later could change
> > > it.
> > > 
> > > A key question is do we need the "first come, first served" behaviour
> > > for CIDs where a CID can be arbitrarily used by outside or inside namespace
> > > according to whatever tries to associate a CID first ?
> > 
> > I think with /proc/net/vsock_cid_outside, instead of disallowing the CID
> > from being used, this could be solved by disallowing remapping the CID
> > while in use?
> > 
> > The thing I like about this is that users can check
> > /proc/net/vsock_cid_outside to figure out what might be going on,
> > instead of trying to check lsof or ps to figure out if the VMM processes
> > have used /dev/vhost-vsock vs /dev/vhost-vsock-netns.
> > 
> > Just to check I am following... I suppose we would have a few typical
> > configurations for /proc/net/vsock_cid_outside. Following uid_map file
> > format of:
> > 	"<local cid start>		<global cid start>		<range size>"
> > 
> > 	1. Identity mapping, current namespace CID is global CID (default
> > 	setting for new namespaces):
> > 
> > 		# empty file
> > 
> > 				OR
> > 
> > 		0    0    4294967295
> > 
> > 	2. Complete isolation from global space (initialized, but no mappings):
> > 
> > 		0    0    0
> > 
> > 	3. Mapping in ranges of global CIDs
> > 
> > 	For example, global CID space starts at 7000, up to 32-bit max:
> > 
> > 		7000    0    4294960295
> > 	
> > 	Or for multiple mappings (0-100 map to 7000-7100, 1000-1100 map to
> > 	8000-8100) :
> > 
> > 		7000    0       100
> > 		8000    1000    100
> > 
> > 
> > One thing I don't love is that option 3 seems to not be addressing a
> > known use case. It doesn't necessarily hurt to have, but it will add
> > complexity to CID handling that might never get used?
> 
> Yeah, I have the same feeling that full remapping of CIDs is probably
> adding complexity without clear benefit, unless it somehow helps us
> with the nested-virt scenario to disambiguate L0/L1/L2 CID ranges ?
> I've not thought the latter through to any great level of detail
> though
> 
> > Since options 1/2 could also be represented by a boolean (yes/no
> > "current ns shares CID with global"), I wonder if we could either A)
> > only support the first two options at first, or B) add just
> > /proc/net/vsock_ns_mode at first, which supports only "global" and
> > "local", and later add a "mapped" mode plus /proc/net/vsock_cid_outside
> > or the full mapping if the need arises?
> 
> Two options is sufficient if you want to control AF_VSOCK usage
> and /dev/vhost-vsock usage as a pair. If you want to separately
> control them though, it would push for three options - global,
> local, and mixed. By mixed I mean AF_VSOCK in the NS can access
> the global CID from the NS, but the NS can't associate the global
> CID with a guest.
> 
> IOW, this breaks down like:
> 
>  * CID=N local - aka fully private
> 
>      Outside NS: Can associate outside CID=N with a guest.
>                  AF_VSOCK permitted to access outside CID=N
> 
>      Inside NS: Can NOT associate outside CID=N with a guest
>                 Can associate inside CID=N with a guest
>                 AF_VSOCK forbidden to access outside CID=N
>                 AF_VSOCK permitted to access inside CID=N
> 
> 
>  * CID=N mixed - aka partially shared
> 
>      Outside NS: Can associate outside CID=N with a guest.
>                  AF_VSOCK permitted to access outside CID=N
> 
>      Inside NS: Can NOT associate outside CID=N with a guest
>                 AF_VSOCK permitted to access outside CID=N
>                 No inside CID=N concept
> 
> 
>  * CID=N global - aka current historic behaviour
> 
>      Outside NS: Can associate outside CID=N with a guest.
>                  AF_VSOCK permitted to access outside CID=N
> 
>      Inside NS: Can associate outside CID=N with a guest
>                 AF_VSOCK permitted to access outside CID=N
>                 No inside CID=N concept
> 
> 
> I was thinking the 'mixed' mode might be useful if the outside NS wants
> to retain control over setting up the association, but delegate to
> processes in the inside NS for providing individual services to that
> guest.  This means if the outside NS needs to restart the VM, there is
> no race window in which the inside NS can grab the assocaition with the
> CID
> 
> As for whether we need to control this per-CID, or a single setting
> applying to all CID.
> 
> Consider that the host OS can be running one or more "service VMs" on
> well known CIDs that can be leveraged from other NS, while those other
> NS also run some  "end user VMs" that should be private to the NS.
> 
> IOW, the CIDs for the service VMs would need to be using "mixed"
> policy, while the CIDs for the end user VMs would be "local".
> 

I think this sounds pretty flexible, and IMO adding the third mode
doesn't add much more additional complexity.

Going this route, we have:
- three modes: local, global, mixed
- at first, no vsock_cid_map (local has no outside CIDs, global and mixed have no inside
	CIDs, so no cross-mapping needed)
- only later add a full mapped mode and vsock_cid_map if necessary.

Stefano, any preferences on this vs starting with the restricted
vsock_cid_map (only supporting "0 0 0" and "0 0 <size>")?

I'm leaning towards the modes because it covers more use cases and seems
like a clearer user interface?

To clarify another aspect... child namespaces must inherit the parent's
local. So if namespace P sets the mode to local, and then creates a
child process that then creates namespace C... then C's global and mixed
modes are implicitly restricted to P's local space?

Thanks,
Bobby

Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Stefano Garzarella 9 months, 3 weeks ago

On Fri, Apr 18, 2025 at 10:57:52AM -0700, Bobby Eshleman wrote:
>On Fri, Apr 04, 2025 at 02:05:32PM +0100, Daniel P. Berrangé wrote:
>> On Wed, Apr 02, 2025 at 03:18:13PM -0700, Bobby Eshleman wrote:
>> > On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote:
>> > > It occured to me that the problem we face with the CID space usage is
>> > > somewhat similar to the UID/GID space usage for user namespaces.
>> > >
>> > > In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to
>> > > allow IDs in the namespace to be arbitrarily mapped onto IDs in the host.
>> > >
>> > > At the risk of being overkill, is it worth trying a similar kind of
>> > > approach for the vsock CID space ?
>> > >
>> > > A simple variant would be a /proc/net/vsock_cid_outside specifying a set
>> > > of CIDs which are exclusively referencing /dev/vhost-vsock associations
>> > > created outside the namespace. Anything not listed would be exclusively
>> > > referencing associations created inside the namespace.
>> > >
>> > > A more complex variant would be to allow a full remapping of CIDs as is
>> > > done with userns, via a /proc/net/vsock_cid_map, which the same three
>> > > parameters, so that CID=15 association outside the namespace could be
>> > > remapped to CID=9015 inside the namespace, allow the inside namespace
>> > > to define its out association for CID=15 without clashing.
>> > >
>> > > IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock
>> > > associations created outside namespace, while unmapped CIDs would be
>> > > exclusively referencing /dev/vhost-vsock associations inside the
>> > > namespace.
>> > >
>> > > A likely benefit of relying on a kernel defined mapping/partition of
>> > > the CID space is that apps like QEMU don't need changing, as there's
>> > > no need to invent a new /dev/vhost-vsock-netns device node.
>> > >
>> > > Both approaches give the desirable security protection whereby the
>> > > inside namespace can be prevented from accessing certain CIDs that
>> > > were associated outside the namespace.
>> > >
>> > > Some rule would need to be defined for updating the /proc/net/vsock_cid_map
>> > > file as it is the security control mechanism. If it is write-once then
>> > > if the container mgmt app initializes it, nothing later could change
>> > > it.
>> > >
>> > > A key question is do we need the "first come, first served" behaviour
>> > > for CIDs where a CID can be arbitrarily used by outside or inside namespace
>> > > according to whatever tries to associate a CID first ?
>> >
>> > I think with /proc/net/vsock_cid_outside, instead of disallowing the CID
>> > from being used, this could be solved by disallowing remapping the CID
>> > while in use?
>> >
>> > The thing I like about this is that users can check
>> > /proc/net/vsock_cid_outside to figure out what might be going on,
>> > instead of trying to check lsof or ps to figure out if the VMM processes
>> > have used /dev/vhost-vsock vs /dev/vhost-vsock-netns.
>> >
>> > Just to check I am following... I suppose we would have a few typical
>> > configurations for /proc/net/vsock_cid_outside. Following uid_map file
>> > format of:
>> > 	"<local cid start>		<global cid start>		<range size>"
>> >
>> > 	1. Identity mapping, current namespace CID is global CID (default
>> > 	setting for new namespaces):
>> >
>> > 		# empty file
>> >
>> > 				OR
>> >
>> > 		0    0    4294967295
>> >
>> > 	2. Complete isolation from global space (initialized, but no mappings):
>> >
>> > 		0    0    0
>> >
>> > 	3. Mapping in ranges of global CIDs
>> >
>> > 	For example, global CID space starts at 7000, up to 32-bit max:
>> >
>> > 		7000    0    4294960295
>> > 	
>> > 	Or for multiple mappings (0-100 map to 7000-7100, 1000-1100 map to
>> > 	8000-8100) :
>> >
>> > 		7000    0       100
>> > 		8000    1000    100
>> >
>> >
>> > One thing I don't love is that option 3 seems to not be addressing a
>> > known use case. It doesn't necessarily hurt to have, but it will add
>> > complexity to CID handling that might never get used?
>>
>> Yeah, I have the same feeling that full remapping of CIDs is probably
>> adding complexity without clear benefit, unless it somehow helps us
>> with the nested-virt scenario to disambiguate L0/L1/L2 CID ranges ?
>> I've not thought the latter through to any great level of detail
>> though
>>
>> > Since options 1/2 could also be represented by a boolean (yes/no
>> > "current ns shares CID with global"), I wonder if we could either A)
>> > only support the first two options at first, or B) add just
>> > /proc/net/vsock_ns_mode at first, which supports only "global" and
>> > "local", and later add a "mapped" mode plus /proc/net/vsock_cid_outside
>> > or the full mapping if the need arises?
>>
>> Two options is sufficient if you want to control AF_VSOCK usage
>> and /dev/vhost-vsock usage as a pair. If you want to separately
>> control them though, it would push for three options - global,
>> local, and mixed. By mixed I mean AF_VSOCK in the NS can access
>> the global CID from the NS, but the NS can't associate the global
>> CID with a guest.
>>
>> IOW, this breaks down like:
>>
>>  * CID=N local - aka fully private
>>
>>      Outside NS: Can associate outside CID=N with a guest.
>>                  AF_VSOCK permitted to access outside CID=N
>>
>>      Inside NS: Can NOT associate outside CID=N with a guest
>>                 Can associate inside CID=N with a guest
>>                 AF_VSOCK forbidden to access outside CID=N
>>                 AF_VSOCK permitted to access inside CID=N
>>
>>
>>  * CID=N mixed - aka partially shared
>>
>>      Outside NS: Can associate outside CID=N with a guest.
>>                  AF_VSOCK permitted to access outside CID=N
>>
>>      Inside NS: Can NOT associate outside CID=N with a guest
>>                 AF_VSOCK permitted to access outside CID=N
>>                 No inside CID=N concept
>>
>>
>>  * CID=N global - aka current historic behaviour
>>
>>      Outside NS: Can associate outside CID=N with a guest.
>>                  AF_VSOCK permitted to access outside CID=N
>>
>>      Inside NS: Can associate outside CID=N with a guest
>>                 AF_VSOCK permitted to access outside CID=N
>>                 No inside CID=N concept
>>
>>
>> I was thinking the 'mixed' mode might be useful if the outside NS wants
>> to retain control over setting up the association, but delegate to
>> processes in the inside NS for providing individual services to that
>> guest.  This means if the outside NS needs to restart the VM, there is
>> no race window in which the inside NS can grab the assocaition with the
>> CID
>>
>> As for whether we need to control this per-CID, or a single setting
>> applying to all CID.
>>
>> Consider that the host OS can be running one or more "service VMs" on
>> well known CIDs that can be leveraged from other NS, while those other
>> NS also run some  "end user VMs" that should be private to the NS.
>>
>> IOW, the CIDs for the service VMs would need to be using "mixed"
>> policy, while the CIDs for the end user VMs would be "local".
>>
>
>I think this sounds pretty flexible, and IMO adding the third mode
>doesn't add much more additional complexity.
>
>Going this route, we have:
>- three modes: local, global, mixed
>- at first, no vsock_cid_map (local has no outside CIDs, global and mixed have no inside
>	CIDs, so no cross-mapping needed)
>- only later add a full mapped mode and vsock_cid_map if necessary.
>
>Stefano, any preferences on this vs starting with the restricted
>vsock_cid_map (only supporting "0 0 0" and "0 0 <size>")?

No preference, I also like this idea.

>
>I'm leaning towards the modes because it covers more use cases and seems
>like a clearer user interface?

Sure, go head!

>
>To clarify another aspect... child namespaces must inherit the parent's
>local. So if namespace P sets the mode to local, and then creates a
>child process that then creates namespace C... then C's global and mixed
>modes are implicitly restricted to P's local space?

I think so, but it's still not clear to me if the mode can be selected 
per namespace or it's a setting for the entire system, but I think we 
can discuss this better on a proposal with some code :-)

Thanks,
Stefano

Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Bobby Eshleman 10 months, 1 week ago

On Wed, Apr 02, 2025 at 03:18:13PM -0700, Bobby Eshleman wrote:
> On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote:
> > On Wed, Apr 02, 2025 at 10:13:43AM +0200, Stefano Garzarella wrote:
> > > On Wed, 2 Apr 2025 at 02:21, Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> > > >
> > > > I do like Stefano's suggestion to add a sysctl for a "strict" mode,
> > > > Since it offers the best of both worlds, and still tends conservative in
> > > > protecting existing applications... but I agree, the non-strict mode
> > > > vsock would be unique WRT the usual concept of namespaces.
> > > 
> > > Maybe we could do the opposite, enable strict mode by default (I think 
> > > it was similar to what I had tried to do with the kernel module in v1, I 
> > > was young I know xD)
> > > And provide a way to disable it for those use cases where the user wants 
> > > backward compatibility, while paying the cost of less isolation.
> > 
> > I think backwards compatible has to be the default behaviour, otherwise
> > the change has too high risk of breaking existing deployments that are
> > already using netns and relying on VSOCK being global. Breakage has to
> > be opt in.
> > 
> > > I was thinking two options (not sure if the second one can be done):
> > > 
> > >   1. provide a global sysfs/sysctl that disables strict mode, but this
> > >   then applies to all namespaces
> > > 
> > >   2. provide something that allows disabling strict mode by namespace.
> > >   Maybe when it is created there are options, or something that can be
> > >   set later.
> > > 
> > > 2 would be ideal, but that might be too much, so 1 might be enough. In 
> > > any case, 2 could also be a next step.
> > > 
> > > WDYT?
> > 
> > It occured to me that the problem we face with the CID space usage is
> > somewhat similar to the UID/GID space usage for user namespaces.
> > 
> > In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to
> > allow IDs in the namespace to be arbitrarily mapped onto IDs in the host.
> > 
> > At the risk of being overkill, is it worth trying a similar kind of
> > approach for the vsock CID space ?
> > 
> > A simple variant would be a /proc/net/vsock_cid_outside specifying a set
> > of CIDs which are exclusively referencing /dev/vhost-vsock associations
> > created outside the namespace. Anything not listed would be exclusively
> > referencing associations created inside the namespace.
> > 
> > A more complex variant would be to allow a full remapping of CIDs as is
> > done with userns, via a /proc/net/vsock_cid_map, which the same three
> > parameters, so that CID=15 association outside the namespace could be
> > remapped to CID=9015 inside the namespace, allow the inside namespace
> > to define its out association for CID=15 without clashing.
> > 
> > IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock
> > associations created outside namespace, while unmapped CIDs would be
> > exclusively referencing /dev/vhost-vsock associations inside the
> > namespace. 
> > 
> > A likely benefit of relying on a kernel defined mapping/partition of
> > the CID space is that apps like QEMU don't need changing, as there's
> > no need to invent a new /dev/vhost-vsock-netns device node.
> > 
> > Both approaches give the desirable security protection whereby the
> > inside namespace can be prevented from accessing certain CIDs that
> > were associated outside the namespace.
> > 
> > Some rule would need to be defined for updating the /proc/net/vsock_cid_map
> > file as it is the security control mechanism. If it is write-once then
> > if the container mgmt app initializes it, nothing later could change
> > it.
> > 
> > A key question is do we need the "first come, first served" behaviour
> > for CIDs where a CID can be arbitrarily used by outside or inside namespace
> > according to whatever tries to associate a CID first ?
> 
> I think with /proc/net/vsock_cid_outside, instead of disallowing the CID
> from being used, this could be solved by disallowing remapping the CID
> while in use?
> 
> The thing I like about this is that users can check
> /proc/net/vsock_cid_outside to figure out what might be going on,
> instead of trying to check lsof or ps to figure out if the VMM processes
> have used /dev/vhost-vsock vs /dev/vhost-vsock-netns.
> 
> Just to check I am following... I suppose we would have a few typical
> configurations for /proc/net/vsock_cid_outside. Following uid_map file
> format of:
> 	"<local cid start>		<global cid start>		<range size>"
> 
> 	1. Identity mapping, current namespace CID is global CID (default
> 	setting for new namespaces):
> 
> 		# empty file
> 
> 				OR
> 
> 		0    0    4294967295
> 
> 	2. Complete isolation from global space (initialized, but no mappings):
> 
> 		0    0    0
> 
> 	3. Mapping in ranges of global CIDs
> 
> 	For example, global CID space starts at 7000, up to 32-bit max:
> 
> 		7000    0    4294960295
> 	
> 	Or for multiple mappings (0-100 map to 7000-7100, 1000-1100 map to
> 	8000-8100) :
> 
> 		7000    0       100
> 		8000    1000    100
> 
> 
> One thing I don't love is that option 3 seems to not be addressing a
> known use case. It doesn't necessarily hurt to have, but it will add
> complexity to CID handling that might never get used?
> 
> Since options 1/2 could also be represented by a boolean (yes/no
> "current ns shares CID with global"), I wonder if we could either A)
> only support the first two options at first, or B) add just
> /proc/net/vsock_ns_mode at first, which supports only "global" and
> "local", and later add a "mapped" mode plus /proc/net/vsock_cid_outside
> or the full mapping if the need arises?
> 
> This could also be how we support Option 2 from Stefano's last email of
> supporting per-namespace opt-in/opt-out.
> 
> Any thoughts on this?
> 

Stefano,

Would only supporting 1/2 still support the Kata use case?

Thanks,
Bobby

Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Stefano Garzarella 10 months, 1 week ago

On Wed, Apr 02, 2025 at 03:28:19PM -0700, Bobby Eshleman wrote:
>On Wed, Apr 02, 2025 at 03:18:13PM -0700, Bobby Eshleman wrote:
>> On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote:
>> > On Wed, Apr 02, 2025 at 10:13:43AM +0200, Stefano Garzarella wrote:
>> > > On Wed, 2 Apr 2025 at 02:21, Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
>> > > >
>> > > > I do like Stefano's suggestion to add a sysctl for a "strict" mode,
>> > > > Since it offers the best of both worlds, and still tends conservative in
>> > > > protecting existing applications... but I agree, the non-strict mode
>> > > > vsock would be unique WRT the usual concept of namespaces.
>> > >
>> > > Maybe we could do the opposite, enable strict mode by default (I think
>> > > it was similar to what I had tried to do with the kernel module in v1, I
>> > > was young I know xD)
>> > > And provide a way to disable it for those use cases where the user wants
>> > > backward compatibility, while paying the cost of less isolation.
>> >
>> > I think backwards compatible has to be the default behaviour, otherwise
>> > the change has too high risk of breaking existing deployments that are
>> > already using netns and relying on VSOCK being global. Breakage has to
>> > be opt in.
>> >
>> > > I was thinking two options (not sure if the second one can be done):
>> > >
>> > >   1. provide a global sysfs/sysctl that disables strict mode, but this
>> > >   then applies to all namespaces
>> > >
>> > >   2. provide something that allows disabling strict mode by namespace.
>> > >   Maybe when it is created there are options, or something that can be
>> > >   set later.
>> > >
>> > > 2 would be ideal, but that might be too much, so 1 might be enough. In
>> > > any case, 2 could also be a next step.
>> > >
>> > > WDYT?
>> >
>> > It occured to me that the problem we face with the CID space usage is
>> > somewhat similar to the UID/GID space usage for user namespaces.
>> >
>> > In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to
>> > allow IDs in the namespace to be arbitrarily mapped onto IDs in the host.
>> >
>> > At the risk of being overkill, is it worth trying a similar kind of
>> > approach for the vsock CID space ?
>> >
>> > A simple variant would be a /proc/net/vsock_cid_outside specifying a set
>> > of CIDs which are exclusively referencing /dev/vhost-vsock associations
>> > created outside the namespace. Anything not listed would be exclusively
>> > referencing associations created inside the namespace.
>> >
>> > A more complex variant would be to allow a full remapping of CIDs as is
>> > done with userns, via a /proc/net/vsock_cid_map, which the same three
>> > parameters, so that CID=15 association outside the namespace could be
>> > remapped to CID=9015 inside the namespace, allow the inside namespace
>> > to define its out association for CID=15 without clashing.
>> >
>> > IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock
>> > associations created outside namespace, while unmapped CIDs would be
>> > exclusively referencing /dev/vhost-vsock associations inside the
>> > namespace.
>> >
>> > A likely benefit of relying on a kernel defined mapping/partition of
>> > the CID space is that apps like QEMU don't need changing, as there's
>> > no need to invent a new /dev/vhost-vsock-netns device node.
>> >
>> > Both approaches give the desirable security protection whereby the
>> > inside namespace can be prevented from accessing certain CIDs that
>> > were associated outside the namespace.
>> >
>> > Some rule would need to be defined for updating the /proc/net/vsock_cid_map
>> > file as it is the security control mechanism. If it is write-once then
>> > if the container mgmt app initializes it, nothing later could change
>> > it.
>> >
>> > A key question is do we need the "first come, first served" behaviour
>> > for CIDs where a CID can be arbitrarily used by outside or inside namespace
>> > according to whatever tries to associate a CID first ?
>>
>> I think with /proc/net/vsock_cid_outside, instead of disallowing the CID
>> from being used, this could be solved by disallowing remapping the CID
>> while in use?
>>
>> The thing I like about this is that users can check
>> /proc/net/vsock_cid_outside to figure out what might be going on,
>> instead of trying to check lsof or ps to figure out if the VMM processes
>> have used /dev/vhost-vsock vs /dev/vhost-vsock-netns.

Yes, although the user in theory should not care about this information,
right?
I mean I don't even know if it makes sense to expose the contents of
/proc/net/vsock_cid_outside in the namespace.

>>
>> Just to check I am following... I suppose we would have a few typical
>> configurations for /proc/net/vsock_cid_outside. Following uid_map file
>> format of:
>> 	"<local cid start>		<global cid start>		<range size>"

This seems to relate more to /proc/net/vsock_cid_map, for
/proc/net/vsock_cid_outside I think 2 parameters are enough
(CID, range), right?

>>
>> 	1. Identity mapping, current namespace CID is global CID (default
>> 	setting for new namespaces):
>>
>> 		# empty file
>>
>> 				OR
>>
>> 		0    0    4294967295
>>
>> 	2. Complete isolation from global space (initialized, but no mappings):
>>
>> 		0    0    0
>>
>> 	3. Mapping in ranges of global CIDs
>>
>> 	For example, global CID space starts at 7000, up to 32-bit max:
>>
>> 		7000    0    4294960295
>> 	
>> 	Or for multiple mappings (0-100 map to 7000-7100, 1000-1100 map to
>> 	8000-8100) :
>>
>> 		7000    0       100
>> 		8000    1000    100
>>
>>
>> One thing I don't love is that option 3 seems to not be addressing a
>> known use case. It doesn't necessarily hurt to have, but it will add
>> complexity to CID handling that might never get used?

Yes, as I also mentioned in the previous email, we could also do a
step-by-step thing.

IMHO we can define /proc/net/vsock_cid_map (with the structure you just
defined), but for now only support 1-1 mapping (with the ranges of
course, I mean the first two parameters should always be the same) and
then add option 3 in the future.

>>
>> Since options 1/2 could also be represented by a boolean (yes/no
>> "current ns shares CID with global"), I wonder if we could either A)
>> only support the first two options at first, or B) add just
>> /proc/net/vsock_ns_mode at first, which supports only "global" and
>> "local", and later add a "mapped" mode plus /proc/net/vsock_cid_outside
>> or the full mapping if the need arises?

I think option A is the same as I meant above :-)

>>
>> This could also be how we support Option 2 from Stefano's last email of
>> supporting per-namespace opt-in/opt-out.

Hmm, how can we do it by namespace? Isn't that global?

>>
>> Any thoughts on this?
>>
>
>Stefano,
>
>Would only supporting 1/2 still support the Kata use case?

I think so, actually I was thinking something similar in the message I 
just sent.

By default (if the file is empty), nothing should change, so that's fine 
IMO. As Paolo suggested, we absolutely have to have tests to verify 
these things.

Thanks,
Stefano

Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Bobby Eshleman 10 months, 1 week ago

On Thu, Apr 03, 2025 at 11:33:14AM +0200, Stefano Garzarella wrote:
> On Wed, Apr 02, 2025 at 03:28:19PM -0700, Bobby Eshleman wrote:
> > On Wed, Apr 02, 2025 at 03:18:13PM -0700, Bobby Eshleman wrote:
> > > On Wed, Apr 02, 2025 at 10:21:36AM +0100, Daniel P. Berrangé wrote:
> > > > On Wed, Apr 02, 2025 at 10:13:43AM +0200, Stefano Garzarella wrote:
> > > > > On Wed, 2 Apr 2025 at 02:21, Bobby Eshleman <bobbyeshleman@gmail.com> wrote:
> > > > > >
> > > > > > I do like Stefano's suggestion to add a sysctl for a "strict" mode,
> > > > > > Since it offers the best of both worlds, and still tends conservative in
> > > > > > protecting existing applications... but I agree, the non-strict mode
> > > > > > vsock would be unique WRT the usual concept of namespaces.
> > > > >
> > > > > Maybe we could do the opposite, enable strict mode by default (I think
> > > > > it was similar to what I had tried to do with the kernel module in v1, I
> > > > > was young I know xD)
> > > > > And provide a way to disable it for those use cases where the user wants
> > > > > backward compatibility, while paying the cost of less isolation.
> > > >
> > > > I think backwards compatible has to be the default behaviour, otherwise
> > > > the change has too high risk of breaking existing deployments that are
> > > > already using netns and relying on VSOCK being global. Breakage has to
> > > > be opt in.
> > > >
> > > > > I was thinking two options (not sure if the second one can be done):
> > > > >
> > > > >   1. provide a global sysfs/sysctl that disables strict mode, but this
> > > > >   then applies to all namespaces
> > > > >
> > > > >   2. provide something that allows disabling strict mode by namespace.
> > > > >   Maybe when it is created there are options, or something that can be
> > > > >   set later.
> > > > >
> > > > > 2 would be ideal, but that might be too much, so 1 might be enough. In
> > > > > any case, 2 could also be a next step.
> > > > >
> > > > > WDYT?
> > > >
> > > > It occured to me that the problem we face with the CID space usage is
> > > > somewhat similar to the UID/GID space usage for user namespaces.
> > > >
> > > > In the latter case, userns has exposed /proc/$PID/uid_map & gid_map, to
> > > > allow IDs in the namespace to be arbitrarily mapped onto IDs in the host.
> > > >
> > > > At the risk of being overkill, is it worth trying a similar kind of
> > > > approach for the vsock CID space ?
> > > >
> > > > A simple variant would be a /proc/net/vsock_cid_outside specifying a set
> > > > of CIDs which are exclusively referencing /dev/vhost-vsock associations
> > > > created outside the namespace. Anything not listed would be exclusively
> > > > referencing associations created inside the namespace.
> > > >
> > > > A more complex variant would be to allow a full remapping of CIDs as is
> > > > done with userns, via a /proc/net/vsock_cid_map, which the same three
> > > > parameters, so that CID=15 association outside the namespace could be
> > > > remapped to CID=9015 inside the namespace, allow the inside namespace
> > > > to define its out association for CID=15 without clashing.
> > > >
> > > > IOW, mapped CIDs would be exclusively referencing /dev/vhost-vsock
> > > > associations created outside namespace, while unmapped CIDs would be
> > > > exclusively referencing /dev/vhost-vsock associations inside the
> > > > namespace.
> > > >
> > > > A likely benefit of relying on a kernel defined mapping/partition of
> > > > the CID space is that apps like QEMU don't need changing, as there's
> > > > no need to invent a new /dev/vhost-vsock-netns device node.
> > > >
> > > > Both approaches give the desirable security protection whereby the
> > > > inside namespace can be prevented from accessing certain CIDs that
> > > > were associated outside the namespace.
> > > >
> > > > Some rule would need to be defined for updating the /proc/net/vsock_cid_map
> > > > file as it is the security control mechanism. If it is write-once then
> > > > if the container mgmt app initializes it, nothing later could change
> > > > it.
> > > >
> > > > A key question is do we need the "first come, first served" behaviour
> > > > for CIDs where a CID can be arbitrarily used by outside or inside namespace
> > > > according to whatever tries to associate a CID first ?
> > > 
> > > I think with /proc/net/vsock_cid_outside, instead of disallowing the CID
> > > from being used, this could be solved by disallowing remapping the CID
> > > while in use?
> > > 
> > > The thing I like about this is that users can check
> > > /proc/net/vsock_cid_outside to figure out what might be going on,
> > > instead of trying to check lsof or ps to figure out if the VMM processes
> > > have used /dev/vhost-vsock vs /dev/vhost-vsock-netns.
> 
> Yes, although the user in theory should not care about this information,
> right?
> I mean I don't even know if it makes sense to expose the contents of
> /proc/net/vsock_cid_outside in the namespace.
> 
> > > 
> > > Just to check I am following... I suppose we would have a few typical
> > > configurations for /proc/net/vsock_cid_outside. Following uid_map file
> > > format of:
> > > 	"<local cid start>		<global cid start>		<range size>"
> 
> This seems to relate more to /proc/net/vsock_cid_map, for
> /proc/net/vsock_cid_outside I think 2 parameters are enough
> (CID, range), right?
> 

True, yes vsock_cid_map.

> > > 
> > > 	1. Identity mapping, current namespace CID is global CID (default
> > > 	setting for new namespaces):
> > > 
> > > 		# empty file
> > > 
> > > 				OR
> > > 
> > > 		0    0    4294967295
> > > 
> > > 	2. Complete isolation from global space (initialized, but no mappings):
> > > 
> > > 		0    0    0
> > > 
> > > 	3. Mapping in ranges of global CIDs
> > > 
> > > 	For example, global CID space starts at 7000, up to 32-bit max:
> > > 
> > > 		7000    0    4294960295
> > > 	
> > > 	Or for multiple mappings (0-100 map to 7000-7100, 1000-1100 map to
> > > 	8000-8100) :
> > > 
> > > 		7000    0       100
> > > 		8000    1000    100
> > > 
> > > 
> > > One thing I don't love is that option 3 seems to not be addressing a
> > > known use case. It doesn't necessarily hurt to have, but it will add
> > > complexity to CID handling that might never get used?
> 
> Yes, as I also mentioned in the previous email, we could also do a
> step-by-step thing.
> 
> IMHO we can define /proc/net/vsock_cid_map (with the structure you just
> defined), but for now only support 1-1 mapping (with the ranges of
> course, I mean the first two parameters should always be the same) and
> then add option 3 in the future.
> 

makes sense, sgtm!

> > > 
> > > Since options 1/2 could also be represented by a boolean (yes/no
> > > "current ns shares CID with global"), I wonder if we could either A)
> > > only support the first two options at first, or B) add just
> > > /proc/net/vsock_ns_mode at first, which supports only "global" and
> > > "local", and later add a "mapped" mode plus /proc/net/vsock_cid_outside
> > > or the full mapping if the need arises?
> 
> I think option A is the same as I meant above :-)
> 

Indeed.

> > > 
> > > This could also be how we support Option 2 from Stefano's last email of
> > > supporting per-namespace opt-in/opt-out.
> 
> Hmm, how can we do it by namespace? Isn't that global?
> 

I think the file path is global but the contents are tied per-namespace,
according to the namespace of the process that called open() on it.
This way the container mgr can write-once lock it, and the namespace
processes can read it?

> > > 
> > > Any thoughts on this?
> > > 
> > 
> > Stefano,
> > 
> > Would only supporting 1/2 still support the Kata use case?
> 
> I think so, actually I was thinking something similar in the message I just
> sent.
> 
> By default (if the file is empty), nothing should change, so that's fine
> IMO. As Paolo suggested, we absolutely have to have tests to verify these
> things.
> 

Sounds like a plan! I'm working on the new vsock vmtest now and will
include the new tests in the next rev.

Also, I'm thinking we should protect vsock_cid_map behind a capability,
but I'm not sure which one is correct (CAP_NET_ADMIN?). WDYT?

Thanks!

Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Bobby Eshleman 10 months, 2 weeks ago

On Fri, Mar 28, 2025 at 06:03:19PM +0100, Stefano Garzarella wrote:
> CCing Daniel
> 
> On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote:
> > Picking up Stefano's v1 [1], this series adds netns support to
> > vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
> > namespaces, defering that for future implementation and discussion.
> > 
> > Any vsock created with /dev/vhost-vsock is a global vsock, accessible
> > from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
> > "scoped" vsock, accessible only to sockets in its namespace. If a global
> > vsock or scoped vsock share the same CID, the scoped vsock takes
> > precedence.
> > 
> > If a socket in a namespace connects with a global vsock, the CID becomes
> > unavailable to any VMM in that namespace when creating new vsocks. If
> > disconnected, the CID becomes available again.
> 
> I was talking about this feature with Daniel and he pointed out something
> interesting (Daniel please feel free to correct me):
> 
>     If we have a process in the host that does a listen(AF_VSOCK) in a
> namespace, can this receive connections from guests connected to
> /dev/vhost-vsock in any namespace?
> 
>     Should we provide something (e.g. sysctl/sysfs entry) to disable
> this behaviour, preventing a process in a namespace from receiving
> connections from the global vsock address space (i.e.      /dev/vhost-vsock
> VMs)?
> 
> I understand that by default maybe we should allow this behaviour in order
> to not break current applications, but in some cases the user may want to
> isolate sockets in a namespace also from being accessed by VMs running in
> the global vsock address space.
> 

Adding this strict namespace mode makes sense to me, and I think the
sysctl/sysfs approach works well to minimize application changes. The
approach we were taking was to only allow /dev/vhost-vsock-netns (no
global /dev/vhost-vsock mixed in on the system), but adding the explicit
system-wide option I think improves the overall security posture of g2h
connections.

> Indeed in this series we have talked mostly about the host -> guest path (as
> the direction of the connection), but little about the guest -> host path,
> maybe we should explain it better in the cover/commit
> descriptions/documentation.
> 

Sounds good!

Best,
Bobby

Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Michael S. Tsirkin 10 months, 3 weeks ago

On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote:
> Picking up Stefano's v1 [1], this series adds netns support to
> vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
> namespaces, defering that for future implementation and discussion.
> 
> Any vsock created with /dev/vhost-vsock is a global vsock, accessible
> from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
> "scoped" vsock, accessible only to sockets in its namespace. If a global
> vsock or scoped vsock share the same CID, the scoped vsock takes
> precedence.
> 
> If a socket in a namespace connects with a global vsock, the CID becomes
> unavailable to any VMM in that namespace when creating new vsocks. If
> disconnected, the CID becomes available again.


yea that's a sane way to do it.
Thanks!

> Testing
> 
> QEMU with /dev/vhost-vsock-netns support:
> 	https://github.com/beshleman/qemu/tree/vsock-netns
> 
> Test: Scoped vsocks isolated by namespace
> 
>   host# ip netns add ns1
>   host# ip netns add ns2
>   host# ip netns exec ns1 \
> 				  qemu-system-x86_64 \
> 					  -m 8G -smp 4 -cpu host -enable-kvm \
> 					  -serial mon:stdio \
> 					  -drive if=virtio,file=${IMAGE1} \
> 					  -device vhost-vsock-pci,netns=on,guest-cid=15
>   host# ip netns exec ns2 \
> 				  qemu-system-x86_64 \
> 					  -m 8G -smp 4 -cpu host -enable-kvm \
> 					  -serial mon:stdio \
> 					  -drive if=virtio,file=${IMAGE2} \
> 					  -device vhost-vsock-pci,netns=on,guest-cid=15
> 
>   host# socat - VSOCK-CONNECT:15:1234
>   2025/03/10 17:09:40 socat[255741] E connect(5, AF=40 cid:15 port:1234, 16): No such device
> 
>   host# echo foobar1 | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
>   host# echo foobar2 | sudo ip netns exec ns2 socat - VSOCK-CONNECT:15:1234
> 
>   vm1# socat - VSOCK-LISTEN:1234
>   foobar1
>   vm2# socat - VSOCK-LISTEN:1234
>   foobar2
> 
> Test: Global vsocks accessible to any namespace
> 
>   host# qemu-system-x86_64 \
> 	  -m 8G -smp 4 -cpu host -enable-kvm \
> 	  -serial mon:stdio \
> 	  -drive if=virtio,file=${IMAGE2} \
> 	  -device vhost-vsock-pci,guest-cid=15,netns=off
> 
>   host# echo foobar | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> 
>   vm# socat - VSOCK-LISTEN:1234
>   foobar
> 
> Test: Connecting to global vsock makes CID unavailble to namespace
> 
>   host# qemu-system-x86_64 \
> 	  -m 8G -smp 4 -cpu host -enable-kvm \
> 	  -serial mon:stdio \
> 	  -drive if=virtio,file=${IMAGE2} \
> 	  -device vhost-vsock-pci,guest-cid=15,netns=off
> 
>   vm# socat - VSOCK-LISTEN:1234
> 
>   host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
>   host# ip netns exec ns1 \
> 				  qemu-system-x86_64 \
> 					  -m 8G -smp 4 -cpu host -enable-kvm \
> 					  -serial mon:stdio \
> 					  -drive if=virtio,file=${IMAGE1} \
> 					  -device vhost-vsock-pci,netns=on,guest-cid=15
> 
>   qemu-system-x86_64: -device vhost-vsock-pci,netns=on,guest-cid=15: vhost-vsock: unable to set guest cid: Address already in use
> 
> Signed-off-by: Bobby Eshleman <bobbyeshleman@gmail.com>
> ---
> Changes in v2:
> - only support vhost-vsock namespaces
> - all g2h namespaces retain old behavior, only common API changes
>   impacted by vhost-vsock changes
> - add /dev/vhost-vsock-netns for "opt-in"
> - leave /dev/vhost-vsock to old behavior
> - removed netns module param
> - Link to v1: https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com
> 
> Changes in v1:
> - added 'netns' module param to vsock.ko to enable the
>   network namespace support (disabled by default)
> - added 'vsock_net_eq()' to check the "net" assigned to a socket
>   only when 'netns' support is enabled
> - Link to RFC: https://patchwork.ozlabs.org/cover/1202235/
> 
> ---
> Stefano Garzarella (3):
>       vsock: add network namespace support
>       vsock/virtio_transport_common: handle netns of received packets
>       vhost/vsock: use netns of process that opens the vhost-vsock-netns device
> 
>  drivers/vhost/vsock.c                   | 96 +++++++++++++++++++++++++++------
>  include/linux/miscdevice.h              |  1 +
>  include/linux/virtio_vsock.h            |  2 +
>  include/net/af_vsock.h                  | 10 ++--
>  net/vmw_vsock/af_vsock.c                | 85 +++++++++++++++++++++++------
>  net/vmw_vsock/hyperv_transport.c        |  2 +-
>  net/vmw_vsock/virtio_transport.c        |  5 +-
>  net/vmw_vsock/virtio_transport_common.c | 14 ++++-
>  net/vmw_vsock/vmci_transport.c          |  4 +-
>  net/vmw_vsock/vsock_loopback.c          |  4 +-
>  10 files changed, 180 insertions(+), 43 deletions(-)
> ---
> base-commit: 0ea09cbf8350b70ad44d67a1dcb379008a356034
> change-id: 20250312-vsock-netns-45da9424f726
> 
> Best regards,
> -- 
> Bobby Eshleman <bobbyeshleman@gmail.com>

Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Bobby Eshleman 10 months, 3 weeks ago

On Fri, Mar 21, 2025 at 03:49:38PM -0400, Michael S. Tsirkin wrote:
> On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote:
> > Picking up Stefano's v1 [1], this series adds netns support to
> > vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
> > namespaces, defering that for future implementation and discussion.
> > 
> > Any vsock created with /dev/vhost-vsock is a global vsock, accessible
> > from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
> > "scoped" vsock, accessible only to sockets in its namespace. If a global
> > vsock or scoped vsock share the same CID, the scoped vsock takes
> > precedence.
> > 
> > If a socket in a namespace connects with a global vsock, the CID becomes
> > unavailable to any VMM in that namespace when creating new vsocks. If
> > disconnected, the CID becomes available again.
> 
> 
> yea that's a sane way to do it.
> Thanks!
> 

Sgtm, thank you!

Best,
Bobby

Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Bobby Eshleman 11 months ago

Hey all,

Apologies for forgetting the 'net-next' prefix on this one. Should I
resend or no?

Best,
Bobby

On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote:
> Picking up Stefano's v1 [1], this series adds netns support to
> vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
> namespaces, defering that for future implementation and discussion.
> 
> Any vsock created with /dev/vhost-vsock is a global vsock, accessible
> from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
> "scoped" vsock, accessible only to sockets in its namespace. If a global
> vsock or scoped vsock share the same CID, the scoped vsock takes
> precedence.
> 
> If a socket in a namespace connects with a global vsock, the CID becomes
> unavailable to any VMM in that namespace when creating new vsocks. If
> disconnected, the CID becomes available again.
> 
> Testing
> 
> QEMU with /dev/vhost-vsock-netns support:
> 	https://github.com/beshleman/qemu/tree/vsock-netns
> 
> Test: Scoped vsocks isolated by namespace
> 
>   host# ip netns add ns1
>   host# ip netns add ns2
>   host# ip netns exec ns1 \
> 				  qemu-system-x86_64 \
> 					  -m 8G -smp 4 -cpu host -enable-kvm \
> 					  -serial mon:stdio \
> 					  -drive if=virtio,file=${IMAGE1} \
> 					  -device vhost-vsock-pci,netns=on,guest-cid=15
>   host# ip netns exec ns2 \
> 				  qemu-system-x86_64 \
> 					  -m 8G -smp 4 -cpu host -enable-kvm \
> 					  -serial mon:stdio \
> 					  -drive if=virtio,file=${IMAGE2} \
> 					  -device vhost-vsock-pci,netns=on,guest-cid=15
> 
>   host# socat - VSOCK-CONNECT:15:1234
>   2025/03/10 17:09:40 socat[255741] E connect(5, AF=40 cid:15 port:1234, 16): No such device
> 
>   host# echo foobar1 | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
>   host# echo foobar2 | sudo ip netns exec ns2 socat - VSOCK-CONNECT:15:1234
> 
>   vm1# socat - VSOCK-LISTEN:1234
>   foobar1
>   vm2# socat - VSOCK-LISTEN:1234
>   foobar2
> 
> Test: Global vsocks accessible to any namespace
> 
>   host# qemu-system-x86_64 \
> 	  -m 8G -smp 4 -cpu host -enable-kvm \
> 	  -serial mon:stdio \
> 	  -drive if=virtio,file=${IMAGE2} \
> 	  -device vhost-vsock-pci,guest-cid=15,netns=off
> 
>   host# echo foobar | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
> 
>   vm# socat - VSOCK-LISTEN:1234
>   foobar
> 
> Test: Connecting to global vsock makes CID unavailble to namespace
> 
>   host# qemu-system-x86_64 \
> 	  -m 8G -smp 4 -cpu host -enable-kvm \
> 	  -serial mon:stdio \
> 	  -drive if=virtio,file=${IMAGE2} \
> 	  -device vhost-vsock-pci,guest-cid=15,netns=off
> 
>   vm# socat - VSOCK-LISTEN:1234
> 
>   host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
>   host# ip netns exec ns1 \
> 				  qemu-system-x86_64 \
> 					  -m 8G -smp 4 -cpu host -enable-kvm \
> 					  -serial mon:stdio \
> 					  -drive if=virtio,file=${IMAGE1} \
> 					  -device vhost-vsock-pci,netns=on,guest-cid=15
> 
>   qemu-system-x86_64: -device vhost-vsock-pci,netns=on,guest-cid=15: vhost-vsock: unable to set guest cid: Address already in use
> 
> Signed-off-by: Bobby Eshleman <bobbyeshleman@gmail.com>
> ---
> Changes in v2:
> - only support vhost-vsock namespaces
> - all g2h namespaces retain old behavior, only common API changes
>   impacted by vhost-vsock changes
> - add /dev/vhost-vsock-netns for "opt-in"
> - leave /dev/vhost-vsock to old behavior
> - removed netns module param
> - Link to v1: https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com
> 
> Changes in v1:
> - added 'netns' module param to vsock.ko to enable the
>   network namespace support (disabled by default)
> - added 'vsock_net_eq()' to check the "net" assigned to a socket
>   only when 'netns' support is enabled
> - Link to RFC: https://patchwork.ozlabs.org/cover/1202235/
> 
> ---
> Stefano Garzarella (3):
>       vsock: add network namespace support
>       vsock/virtio_transport_common: handle netns of received packets
>       vhost/vsock: use netns of process that opens the vhost-vsock-netns device
> 
>  drivers/vhost/vsock.c                   | 96 +++++++++++++++++++++++++++------
>  include/linux/miscdevice.h              |  1 +
>  include/linux/virtio_vsock.h            |  2 +
>  include/net/af_vsock.h                  | 10 ++--
>  net/vmw_vsock/af_vsock.c                | 85 +++++++++++++++++++++++------
>  net/vmw_vsock/hyperv_transport.c        |  2 +-
>  net/vmw_vsock/virtio_transport.c        |  5 +-
>  net/vmw_vsock/virtio_transport_common.c | 14 ++++-
>  net/vmw_vsock/vmci_transport.c          |  4 +-
>  net/vmw_vsock/vsock_loopback.c          |  4 +-
>  10 files changed, 180 insertions(+), 43 deletions(-)
> ---
> base-commit: 0ea09cbf8350b70ad44d67a1dcb379008a356034
> change-id: 20250312-vsock-netns-45da9424f726
> 
> Best regards,
> -- 
> Bobby Eshleman <bobbyeshleman@gmail.com>
>

Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Stefano Garzarella 11 months ago

Hi Bobby,
first of all, thank you for starting this work again!

On Wed, Mar 12, 2025 at 07:28:33PM -0700, Bobby Eshleman wrote:
>Hey all,
>
>Apologies for forgetting the 'net-next' prefix on this one. Should I
>resend or no?

I'd say let's do a firts review cycle on this, then you can re-post.
Please check also maintainer cced, it looks like someone is missing:
https://patchwork.kernel.org/project/netdevbpf/patch/20250312-vsock-netns-v2-1-84bffa1aa97a@gmail.com/

>
>Best,
>Bobby
>
>On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote:
>> Picking up Stefano's v1 [1], this series adds netns support to
>> vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
>> namespaces, defering that for future implementation and discussion.
>>
>> Any vsock created with /dev/vhost-vsock is a global vsock, accessible
>> from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
>> "scoped" vsock, accessible only to sockets in its namespace. If a global
>> vsock or scoped vsock share the same CID, the scoped vsock takes
>> precedence.

This inside the netns, right?
I mean if we are in a netns, and there is a VM A attached to 
/dev/vhost-vsock-netns witch CID=42 and a VM B attached to 
/dev/vhost-vsock also with CID=42, this means that VM A will not be 
accessible in the netns, but it can be accessible outside of the netns,
right?

>>
>> If a socket in a namespace connects with a global vsock, the CID becomes
>> unavailable to any VMM in that namespace when creating new vsocks. If
>> disconnected, the CID becomes available again.

IIUC if an application in the host running in a netns, is connected to a 
guest attached to /dev/vhost-vsock (e.g. CID=42), a new guest can't be 
ask for the same CID (42) on /dev/vhost-vsock-netns in the same netns 
till that connection is active. Is that right?

>>
>> Testing
>>
>> QEMU with /dev/vhost-vsock-netns support:
>> 	https://github.com/beshleman/qemu/tree/vsock-netns

You can also use unmodified QEMU using `vhostfd` parameter of 
`vhost-vsock-pci` device:

# FD will contain the file descriptor to /dev/vhost-vsock-netns
exec {FD}<>/dev/vhost-vsock-netns

# pass FD to the device, this is used for example by libvirt
qemu-system-x86_64 -smp 2 -M q35,accel=kvm,memory-backend=mem \
   -drive file=fedora.qcow2,format=qcow2,if=virtio \
   -object memory-backend-memfd,id=mem,size=512M \
   -device vhost-vsock-pci,vhostfd=${FD},guest-cid=42 -nographic

That said, I agree we can extend QEMU with `netns` param too.

BTW, I'm traveling, I'll be back next Tuesday and I hope to take a 
deeper look to the patches.

Thanks,
Stefano

>>
>> Test: Scoped vsocks isolated by namespace
>>
>>   host# ip netns add ns1
>>   host# ip netns add ns2
>>   host# ip netns exec ns1 \
>> 				  qemu-system-x86_64 \
>> 					  -m 8G -smp 4 -cpu host -enable-kvm \
>> 					  -serial mon:stdio \
>> 					  -drive if=virtio,file=${IMAGE1} \
>> 					  -device 
>> 					  vhost-vsock-pci,netns=on,guest-cid=15
>>   host# ip netns exec ns2 \
>> 				  qemu-system-x86_64 \
>> 					  -m 8G -smp 4 -cpu host -enable-kvm \
>> 					  -serial mon:stdio \
>> 					  -drive if=virtio,file=${IMAGE2} \
>> 					  -device vhost-vsock-pci,netns=on,guest-cid=15
>>
>>   host# socat - VSOCK-CONNECT:15:1234
>>   2025/03/10 17:09:40 socat[255741] E connect(5, AF=40 cid:15 port:1234, 16): No such device
>>
>>   host# echo foobar1 | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
>>   host# echo foobar2 | sudo ip netns exec ns2 socat - VSOCK-CONNECT:15:1234
>>
>>   vm1# socat - VSOCK-LISTEN:1234
>>   foobar1
>>   vm2# socat - VSOCK-LISTEN:1234
>>   foobar2
>>
>> Test: Global vsocks accessible to any namespace
>>
>>   host# qemu-system-x86_64 \
>> 	  -m 8G -smp 4 -cpu host -enable-kvm \
>> 	  -serial mon:stdio \)
>> 	  -drive if=virtio,file=${IMAGE2} \
>> 	  -device vhost-vsock-pci,guest-cid=15,netns=off
>>
>>   host# echo foobar | sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
>>
>>   vm# socat - VSOCK-LISTEN:1234
>>   foobar
>>
>> Test: Connecting to global vsock makes CID unavailble to namespace
>>
>>   host# qemu-system-x86_64 \
>> 	  -m 8G -smp 4 -cpu host -enable-kvm \
>> 	  -serial mon:stdio \
>> 	  -drive if=virtio,file=${IMAGE2} \
>> 	  -device vhost-vsock-pci,guest-cid=15,netns=off
>>
>>   vm# socat - VSOCK-LISTEN:1234
>>
>>   host# sudo ip netns exec ns1 socat - VSOCK-CONNECT:15:1234
>>   host# ip netns exec ns1 \
>> 				  qemu-system-x86_64 \
>> 					  -m 8G -smp 4 -cpu host -enable-kvm \
>> 					  -serial mon:stdio \
>> 					  -drive if=virtio,file=${IMAGE1} \
>> 					  -device vhost-vsock-pci,netns=on,guest-cid=15
>>
>>   qemu-system-x86_64: -device vhost-vsock-pci,netns=on,guest-cid=15: vhost-vsock: unable to set guest cid: Address already in use
>>
>> Signed-off-by: Bobby Eshleman <bobbyeshleman@gmail.com>
>> ---
>> Changes in v2:
>> - only support vhost-vsock namespaces
>> - all g2h namespaces retain old behavior, only common API changes
>>   impacted by vhost-vsock changes
>> - add /dev/vhost-vsock-netns for "opt-in"
>> - leave /dev/vhost-vsock to old behavior
>> - removed netns module param
>> - Link to v1: https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com
>>
>> Changes in v1:
>> - added 'netns' module param to vsock.ko to enable the
>>   network namespace support (disabled by default)
>> - added 'vsock_net_eq()' to check the "net" assigned to a socket
>>   only when 'netns' support is enabled
>> - Link to RFC: https://patchwork.ozlabs.org/cover/1202235/
>>
>> ---
>> Stefano Garzarella (3):
>>       vsock: add network namespace support
>>       vsock/virtio_transport_common: handle netns of received packets
>>       vhost/vsock: use netns of process that opens the vhost-vsock-netns device
>>
>>  drivers/vhost/vsock.c                   | 96 +++++++++++++++++++++++++++------
>>  include/linux/miscdevice.h              |  1 +
>>  include/linux/virtio_vsock.h            |  2 +
>>  include/net/af_vsock.h                  | 10 ++--
>>  net/vmw_vsock/af_vsock.c                | 85 +++++++++++++++++++++++------
>>  net/vmw_vsock/hyperv_transport.c        |  2 +-
>>  net/vmw_vsock/virtio_transport.c        |  5 +-
>>  net/vmw_vsock/virtio_transport_common.c | 14 ++++-
>>  net/vmw_vsock/vmci_transport.c          |  4 +-
>>  net/vmw_vsock/vsock_loopback.c          |  4 +-
>>  10 files changed, 180 insertions(+), 43 deletions(-)
>> ---
>> base-commit: 0ea09cbf8350b70ad44d67a1dcb379008a356034
>> change-id: 20250312-vsock-netns-45da9424f726
>>
>> Best regards,
>> --
>> Bobby Eshleman <bobbyeshleman@gmail.com>
>>
>

Re: [PATCH v2 0/3] vsock: add namespace support to vhost-vsock

Posted by Bobby Eshleman 11 months ago

On Thu, Mar 13, 2025 at 04:37:16PM +0100, Stefano Garzarella wrote:
> Hi Bobby,
> first of all, thank you for starting this work again!
> 

You're welcome, thank you for your work getting it started!

> On Wed, Mar 12, 2025 at 07:28:33PM -0700, Bobby Eshleman wrote:
> > Hey all,
> > 
> > Apologies for forgetting the 'net-next' prefix on this one. Should I
> > resend or no?
> 
> I'd say let's do a firts review cycle on this, then you can re-post.
> Please check also maintainer cced, it looks like someone is missing:
> https://patchwork.kernel.org/project/netdevbpf/patch/20250312-vsock-netns-v2-1-84bffa1aa97a@gmail.com/
> 

Duly noted, I'll double-check the ccs next time. sgtm on the re-post!

> > On Wed, Mar 12, 2025 at 01:59:34PM -0700, Bobby Eshleman wrote:
> > > Picking up Stefano's v1 [1], this series adds netns support to
> > > vhost-vsock. Unlike v1, this series does not address guest-to-host (g2h)
> > > namespaces, defering that for future implementation and discussion.
> > > 
> > > Any vsock created with /dev/vhost-vsock is a global vsock, accessible
> > > from any namespace. Any vsock created with /dev/vhost-vsock-netns is a
> > > "scoped" vsock, accessible only to sockets in its namespace. If a global
> > > vsock or scoped vsock share the same CID, the scoped vsock takes
> > > precedence.
> 
> This inside the netns, right?
> I mean if we are in a netns, and there is a VM A attached to
> /dev/vhost-vsock-netns witch CID=42 and a VM B attached to /dev/vhost-vsock
> also with CID=42, this means that VM A will not be accessible in the netns,
> but it can be accessible outside of the netns,
> right?
> 

In this scenario, CID=42 goes to VM A (/dev/vhost-vsock-netns) for any
socket in its namespace.  For any other namespace, CID=42 will go to VM
B (/dev/vhost-vsock).

If I understand your setup correctly:

	Namespace 1:
		VM A - /dev/vhost-vsock-netns, CID=42
		Process X
	Namespace 2:
		VM B - /dev/vhost-vsock, CID=42
		Process Y
	Namespace 3:
		Process Z

In this scenario, taking connect() as an example:
	Process X connect(CID=42) goes to VM A
	Process Y connect(CID=42) goes to VM B
	Process Z connect(CID=42) goes to VM B

If VM A goes away (migration, shutdown, etc...):
	Process X connect(CID=42) also goes to VM B

> > > 
> > > If a socket in a namespace connects with a global vsock, the CID becomes
> > > unavailable to any VMM in that namespace when creating new vsocks. If
> > > disconnected, the CID becomes available again.
> 
> IIUC if an application in the host running in a netns, is connected to a
> guest attached to /dev/vhost-vsock (e.g. CID=42), a new guest can't be ask
> for the same CID (42) on /dev/vhost-vsock-netns in the same netns till that
> connection is active. Is that right?
> 

Right. Here is the scenario I am trying to avoid:

Step 1: namespace 1, VM A allocated with CID 42 on /dev/vhost-vsock
Step 2: namespace 2, connect(CID=42) (this is legal, preserves old
behavior)
Step 3: namespace 2, VM B allocated with CID 42 on
/dev/vhost-vsock-netns

After step 3, CID=42 in this current namespace should belong to VM B, but
the connection from step 2 would be with VM A.

I think we have some options:
1. disallow the new VM B because the namespace is already active with VM A
2. try and allow the connection to resume, but just make sure that new
   connections got o VM B
3. close the connection from namespace 2, spin up VM B, hope user
	 manages connection retry
4. auto-retry connect to the new VM B? (seems like doing too much on the
   kernel side to me)

I chose option 1 for this rev mostly for the simplicity but definitely
open to suggestions. I think option 3 is also a simple implementation.
Option 2 would require adding some concept of "vhost-vsock ns at time of
connection" to each socket, so the tranport would know which vhost_vsock
to use for which socket.

> > > 
> > > Testing
> > > 
> > > QEMU with /dev/vhost-vsock-netns support:
> > > 	https://github.com/beshleman/qemu/tree/vsock-netns
> 
> You can also use unmodified QEMU using `vhostfd` parameter of
> `vhost-vsock-pci` device:
> 
> # FD will contain the file descriptor to /dev/vhost-vsock-netns
> exec {FD}<>/dev/vhost-vsock-netns
> 
> # pass FD to the device, this is used for example by libvirt
> qemu-system-x86_64 -smp 2 -M q35,accel=kvm,memory-backend=mem \
>   -drive file=fedora.qcow2,format=qcow2,if=virtio \
>   -object memory-backend-memfd,id=mem,size=512M \
>   -device vhost-vsock-pci,vhostfd=${FD},guest-cid=42 -nographic
> 

Very nice, thanks, I didn't realize that!

> That said, I agree we can extend QEMU with `netns` param too.
> 

I'm open to either. Your solution above is super elegant.

> BTW, I'm traveling, I'll be back next Tuesday and I hope to take a deeper
> look to the patches.
> 
> Thanks,
> Stefano
> 

Thanks Stefano! Enjoy the travel.

Best,
Bobby