virtiofsd: drop Linux capabilities(7)

[PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Stefan Hajnoczi 5 years ago

virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
whitelisted set of capabilities that we require.  This improves security in
case virtiofsd is compromised by making it hard for an attacker to gain further
access to the system.

Stefan Hajnoczi (2):
  virtiofsd: only retain file system capabilities
  virtiofsd: drop all capabilities in the wait parent process

 tools/virtiofsd/passthrough_ll.c | 51 ++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

-- 
2.25.1

Re: [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Dr. David Alan Gilbert 4 years, 12 months ago

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> whitelisted set of capabilities that we require.  This improves security in
> case virtiofsd is compromised by making it hard for an attacker to gain further
> access to the system.

Queued.

> Stefan Hajnoczi (2):
>   virtiofsd: only retain file system capabilities
>   virtiofsd: drop all capabilities in the wait parent process
> 
>  tools/virtiofsd/passthrough_ll.c | 51 ++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> -- 
> 2.25.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Vivek Goyal 4 years, 10 months ago

On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> whitelisted set of capabilities that we require.  This improves security in
> case virtiofsd is compromised by making it hard for an attacker to gain further
> access to the system.

Hi Stefan,

I just noticed that this patch set breaks overlayfs on top of virtiofs.

overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
need CAP_SYS_ADMIN.

man xattr says.

   Trusted extended attributes
       Trusted  extended  attributes  are  visible and accessible only to pro‐
       cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
       class are used to implement mechanisms in user space (i.e., outside the
       kernel) which keep information in extended attributes to which ordinary
       processes should not have access.

There is a chance that overlay moves away from trusted xattr in future.
But for now we need to make it work. This is an important use case for
kata docker in docker build.

May be we can add an option to virtiofsd say "--add-cap <capability>" and
ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
with this capaibility.

Thanks
Vivek

> 
> Stefan Hajnoczi (2):
>   virtiofsd: only retain file system capabilities
>   virtiofsd: drop all capabilities in the wait parent process
> 
>  tools/virtiofsd/passthrough_ll.c | 51 ++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> -- 
> 2.25.1
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Dr. David Alan Gilbert 4 years, 10 months ago

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > whitelisted set of capabilities that we require.  This improves security in
> > case virtiofsd is compromised by making it hard for an attacker to gain further
> > access to the system.
> 
> Hi Stefan,
> 
> I just noticed that this patch set breaks overlayfs on top of virtiofs.
> 
> overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> need CAP_SYS_ADMIN.
> 
> man xattr says.
> 
>    Trusted extended attributes
>        Trusted  extended  attributes  are  visible and accessible only to pro‐
>        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
>        class are used to implement mechanisms in user space (i.e., outside the
>        kernel) which keep information in extended attributes to which ordinary
>        processes should not have access.
> 
> There is a chance that overlay moves away from trusted xattr in future.
> But for now we need to make it work. This is an important use case for
> kata docker in docker build.
> 
> May be we can add an option to virtiofsd say "--add-cap <capability>" and
> ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> with this capaibility.

I'll admit I don't like the idea of giving it cap_sys_admin.
Can you explain:
  a) What overlayfs uses trusted for?
  b) If something nasty was to write junk into the trusted attributes,
    what would happen?
  c) I see overlayfs has a fallback check if xattr isn't supported at
all - what is the consequence?

Dave

> Thanks
> Vivek
> 
> > 
> > Stefan Hajnoczi (2):
> >   virtiofsd: only retain file system capabilities
> >   virtiofsd: drop all capabilities in the wait parent process
> > 
> >  tools/virtiofsd/passthrough_ll.c | 51 ++++++++++++++++++++++++++++++++
> >  1 file changed, 51 insertions(+)
> > 
> > -- 
> > 2.25.1
> > 
> > _______________________________________________
> > Virtio-fs mailing list
> > Virtio-fs@redhat.com
> > https://www.redhat.com/mailman/listinfo/virtio-fs
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Vivek Goyal 4 years, 10 months ago

On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgoyal@redhat.com) wrote:
> > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > whitelisted set of capabilities that we require.  This improves security in
> > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > access to the system.
> > 
> > Hi Stefan,
> > 
> > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > 
> > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > need CAP_SYS_ADMIN.
> > 
> > man xattr says.
> > 
> >    Trusted extended attributes
> >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> >        class are used to implement mechanisms in user space (i.e., outside the
> >        kernel) which keep information in extended attributes to which ordinary
> >        processes should not have access.
> > 
> > There is a chance that overlay moves away from trusted xattr in future.
> > But for now we need to make it work. This is an important use case for
> > kata docker in docker build.
> > 
> > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > with this capaibility.
> 
> I'll admit I don't like the idea of giving it cap_sys_admin.
> Can you explain:
>   a) What overlayfs uses trusted for?

overlayfs stores bunch of metadata and uses "trusted" xattrs for it.

>   b) If something nasty was to write junk into the trusted attributes,
>     what would happen?

This directory is owned by guest. So it should be able to write
anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?

>   c) I see overlayfs has a fallback check if xattr isn't supported at
> all - what is the consequence?

It falls back to I think read only mode. 

For a moment forget about overlayfs. Say a user process in guest with
CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
passthrough filesystem, so it should go through. But currently it
wont.

Thanks
Vivek

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Chirantan Ekbote 4 years, 10 months ago

On Fri, Jun 19, 2020 at 4:27 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > whitelisted set of capabilities that we require.  This improves security in
> > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > access to the system.
> > >
> > > Hi Stefan,
> > >
> > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > >
> > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > need CAP_SYS_ADMIN.
> > >

Not just that but it needs CAP_SYS_ADMIN in the init namespace[1].  We
have the same problem.  Our virtiofs process has CAP_SYS_ADMIN in a
user namespace and it cannot set any trusted or security xattrs.  The
security xattrs check is at least namespace aware so you only need
CAP_SYS_ADMIN in the namespace that mounted the fs but that doesn't
help us much.

We ended up working around it by prefixing "user.virtiofs." to the
xattr name[2], which has its own problems but there was pretty much no
chance that we would be able to give the fs device CAP_SYS_ADMIN in
the init namespace.

[1]: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/+/5e857ce6eae7ca21b2055cca4885545e29228fe2/fs/xattr.c#116
[2]: https://chromium-review.googlesource.com/c/chromiumos/platform/crosvm/+/2243111

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Dr. David Alan Gilbert 4 years, 10 months ago

* Chirantan Ekbote (chirantan@chromium.org) wrote:
> On Fri, Jun 19, 2020 at 4:27 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > access to the system.
> > > >
> > > > Hi Stefan,
> > > >
> > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > >
> > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > need CAP_SYS_ADMIN.
> > > >
> 
> Not just that but it needs CAP_SYS_ADMIN in the init namespace[1].  We
> have the same problem.  Our virtiofs process has CAP_SYS_ADMIN in a
> user namespace and it cannot set any trusted or security xattrs.  The
> security xattrs check is at least namespace aware so you only need
> CAP_SYS_ADMIN in the namespace that mounted the fs but that doesn't
> help us much.

It would have been good if you'd mentioned that; it would have saved
Vivek some confusion!

> We ended up working around it by prefixing "user.virtiofs." to the
> xattr name[2], which has its own problems but there was pretty much no
> chance that we would be able to give the fs device CAP_SYS_ADMIN in
> the init namespace.


What problems did you hit with that?  We should standardise the renaming
so we make an on-disc format that's compatible.

Dave

> 
> [1]: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/+/5e857ce6eae7ca21b2055cca4885545e29228fe2/fs/xattr.c#116
> [2]: https://chromium-review.googlesource.com/c/chromiumos/platform/crosvm/+/2243111
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Chirantan Ekbote 4 years, 10 months ago

On Fri, Jun 19, 2020 at 5:40 PM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Chirantan Ekbote (chirantan@chromium.org) wrote:
>
> > We ended up working around it by prefixing "user.virtiofs." to the
> > xattr name[2], which has its own problems but there was pretty much no
> > chance that we would be able to give the fs device CAP_SYS_ADMIN in
> > the init namespace.
>
>
> What problems did you hit with that?  We should standardise the renaming
> so we make an on-disc format that's compatible.
>

I guess what I meant by problems is that it made what was previously a
simple and straightforward implementation into something more complex
and added some limitations.  For example, we now need to parse the
result of the listxattr system call and strip out the prefix from any
name in the list.  It also means that we cannot allow the guest to
directly set or remove any "user.virtiofs." xattr as this would allow
an unprivileged process in the vm to modify an xattr that it wouldn't
otherwise be allowed to modify.  On top of being a somewhat arbitrary
restriction this also means that you can't have stacked virtiofs
instances as the lower instance would reject attempts by the upper one
to set those xattrs.  These limitations aren't really a problem for us
but I can see how they might be a problem for others.

The change was also merged just yesterday so there may be other
problems with it that haven't surfaced yet.

I didn't mention it before because I figured this was something that
we brought upon ourselves as chrome os is a bit extreme about
sandboxing.  If we can come up with a standardized way to handle this
I think we'll gladly switch the chrome os implementation to use it.

Chirantan

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Dr. David Alan Gilbert 4 years, 10 months ago

* Chirantan Ekbote (chirantan@chromium.org) wrote:
> On Fri, Jun 19, 2020 at 5:40 PM Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> >
> > * Chirantan Ekbote (chirantan@chromium.org) wrote:
> >
> > > We ended up working around it by prefixing "user.virtiofs." to the
> > > xattr name[2], which has its own problems but there was pretty much no
> > > chance that we would be able to give the fs device CAP_SYS_ADMIN in
> > > the init namespace.
> >
> >
> > What problems did you hit with that?  We should standardise the renaming
> > so we make an on-disc format that's compatible.
> >
> 
> I guess what I meant by problems is that it made what was previously a
> simple and straightforward implementation into something more complex
> and added some limitations.

Yeh.

>  For example, we now need to parse the
> result of the listxattr system call and strip out the prefix from any
> name in the list.  It also means that we cannot allow the guest to
> directly set or remove any "user.virtiofs." xattr as this would allow
> an unprivileged process in the vm to modify an xattr that it wouldn't
> otherwise be allowed to modify.  On top of being a somewhat arbitrary
> restriction this also means that you can't have stacked virtiofs
> instances as the lower instance would reject attempts by the upper one
> to set those xattrs.  These limitations aren't really a problem for us
> but I can see how they might be a problem for others.

Isn't this a classic escaping problem?
Would it work if you prepended  'user.virtiofs.' onto any xattr
that started with 'trusted' or 'user.virtiofs.' ?

> The change was also merged just yesterday so there may be other
> problems with it that haven't surfaced yet.
> 
> I didn't mention it before because I figured this was something that
> we brought upon ourselves as chrome os is a bit extreme about
> sandboxing. 

I think we're trying to be as extreme in virtiofsd, but it is causing us
similar problems.

> If we can come up with a standardized way to handle this
> I think we'll gladly switch the chrome os implementation to use it.

Great!

Dave

> 
> Chirantan
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Vivek Goyal 4 years, 10 months ago

On Fri, Jun 19, 2020 at 01:46:20PM +0900, Chirantan Ekbote wrote:
> On Fri, Jun 19, 2020 at 4:27 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > access to the system.
> > > >
> > > > Hi Stefan,
> > > >
> > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > >
> > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > need CAP_SYS_ADMIN.
> > > >
> 
> Not just that but it needs CAP_SYS_ADMIN in the init namespace[1].  We
> have the same problem.  Our virtiofs process has CAP_SYS_ADMIN in a
> user namespace and it cannot set any trusted or security xattrs.  The
> security xattrs check is at least namespace aware so you only need
> CAP_SYS_ADMIN in the namespace that mounted the fs but that doesn't
> help us much.
> 
> We ended up working around it by prefixing "user.virtiofs." to the
> xattr name[2], which has its own problems but there was pretty much no
> chance that we would be able to give the fs device CAP_SYS_ADMIN in
> the init namespace.

Chirantan, 

So you ended up renaming all "trusted", "security" and "system" xattrs?
Only "user" xattrs are complete passthrough?

IOW, security.selinux will be renamed to user.virtiofs.security.selinux
on host?

Thanks
Vivek

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Chirantan Ekbote 4 years, 10 months ago

On Sat, Jun 20, 2020 at 4:15 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Fri, Jun 19, 2020 at 01:46:20PM +0900, Chirantan Ekbote wrote:
> > On Fri, Jun 19, 2020 at 4:27 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > > access to the system.
> > > > >
> > > > > Hi Stefan,
> > > > >
> > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > >
> > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > > need CAP_SYS_ADMIN.
> > > > >
> >
> > Not just that but it needs CAP_SYS_ADMIN in the init namespace[1].  We
> > have the same problem.  Our virtiofs process has CAP_SYS_ADMIN in a
> > user namespace and it cannot set any trusted or security xattrs.  The
> > security xattrs check is at least namespace aware so you only need
> > CAP_SYS_ADMIN in the namespace that mounted the fs but that doesn't
> > help us much.
> >
> > We ended up working around it by prefixing "user.virtiofs." to the
> > xattr name[2], which has its own problems but there was pretty much no
> > chance that we would be able to give the fs device CAP_SYS_ADMIN in
> > the init namespace.
>
> Chirantan,
>
> So you ended up renaming all "trusted", "security" and "system" xattrs?
> Only "user" xattrs are complete passthrough?
>

No, we only rename "security" xattrs (except for selinux).

>
> IOW, security.selinux will be renamed to user.virtiofs.security.selinux
> on host?
>

We don't relabel security.selinux because it only requires CAP_FOWNER
in the process's user namespace and it also does its own MAC-based
checks.  Also we have some tools that label files beforehand so it
seemed easier to leave them unchanged.

Chirantan

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Vivek Goyal 4 years, 10 months ago

On Thu, Jun 25, 2020 at 12:19:39PM +0900, Chirantan Ekbote wrote:
[..]
> > Chirantan,
> >
> > So you ended up renaming all "trusted", "security" and "system" xattrs?
> > Only "user" xattrs are complete passthrough?
> >
> 
> No, we only rename "security" xattrs (except for selinux).
> 
> >
> > IOW, security.selinux will be renamed to user.virtiofs.security.selinux
> > on host?
> >
> 
> We don't relabel security.selinux because it only requires CAP_FOWNER
> in the process's user namespace and it also does its own MAC-based
> checks.  Also we have some tools that label files beforehand so it
> seemed easier to leave them unchanged.

If we rename selinux xattr also, then we can support selinux both in
guest and host and they both can have their own independent policies.

Otherwise we either have to disable selinux on host (if we want to
support it in guest) or somehow guest and how policies will have
to know about each other and be able to work together (which will
be hard for a generic use case).

Vivek

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Chirantan Ekbote 4 years, 9 months ago

On Thu, Jun 25, 2020 at 9:55 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Thu, Jun 25, 2020 at 12:19:39PM +0900, Chirantan Ekbote wrote:
> [..]
> > > Chirantan,
> > >
> > > So you ended up renaming all "trusted", "security" and "system" xattrs?
> > > Only "user" xattrs are complete passthrough?
> > >
> >
> > No, we only rename "security" xattrs (except for selinux).
> >
> > >
> > > IOW, security.selinux will be renamed to user.virtiofs.security.selinux
> > > on host?
> > >
> >
> > We don't relabel security.selinux because it only requires CAP_FOWNER
> > in the process's user namespace and it also does its own MAC-based
> > checks.  Also we have some tools that label files beforehand so it
> > seemed easier to leave them unchanged.
>
> If we rename selinux xattr also, then we can support selinux both in
> guest and host and they both can have their own independent policies.
>

This works as long as you don't need to support setfscreatecon, which
exists to guarantee that an inode is atomically created with the
correct selinux context.  It's kind of possible for files with
O_TMPFILE + fsetxattr + linkat but there is no equivalent for
directories.  You're basically forced to create the directory and then
set the xattr and while it's possible to prevent other threads in the
server from seeing the unfinished directories with a rwlock or similar
there is no protection against power loss.  If the machine loses power
after the directory is created but before the selinux xattr is set
then that directory will have the wrong selinux context and the guest
would need to run restorecon at boot to assign the correct label.

> Otherwise we either have to disable selinux on host (if we want to
> support it in guest) or somehow guest and how policies will have
> to know about each other and be able to work together (which will
> be hard for a generic use case).

Yes, I agree this is hard to do for a generic case but unfortunately
the more I understand how selinux works the less I feel that it works
well with a passthrough style file system.  As you said it either
needs to be turned off on the host or the host and guest need to work
together.

Chirantan

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Vivek Goyal 4 years, 9 months ago

On Mon, Jul 13, 2020 at 05:54:19PM +0900, Chirantan Ekbote wrote:
> On Thu, Jun 25, 2020 at 9:55 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Thu, Jun 25, 2020 at 12:19:39PM +0900, Chirantan Ekbote wrote:
> > [..]
> > > > Chirantan,
> > > >
> > > > So you ended up renaming all "trusted", "security" and "system" xattrs?
> > > > Only "user" xattrs are complete passthrough?
> > > >
> > >
> > > No, we only rename "security" xattrs (except for selinux).
> > >
> > > >
> > > > IOW, security.selinux will be renamed to user.virtiofs.security.selinux
> > > > on host?
> > > >
> > >
> > > We don't relabel security.selinux because it only requires CAP_FOWNER
> > > in the process's user namespace and it also does its own MAC-based
> > > checks.  Also we have some tools that label files beforehand so it
> > > seemed easier to leave them unchanged.
> >
> > If we rename selinux xattr also, then we can support selinux both in
> > guest and host and they both can have their own independent policies.
> >
> 
> This works as long as you don't need to support setfscreatecon, which
> exists to guarantee that an inode is atomically created with the
> correct selinux context.  It's kind of possible for files with
> O_TMPFILE + fsetxattr + linkat but there is no equivalent for
> directories.  You're basically forced to create the directory and then
> set the xattr and while it's possible to prevent other threads in the
> server from seeing the unfinished directories with a rwlock or similar
> there is no protection against power loss.  If the machine loses power
> after the directory is created but before the selinux xattr is set
> then that directory will have the wrong selinux context and the guest
> would need to run restorecon at boot to assign the correct label.

Overlayfs has this notion of work directory where they create directory
(and file if O_TMPFILE is not supported), and then rename it to correct
place. I am wondering if we could do something similar. If server is
given a work directory where.

A. Create new directory
B. set selinux lables
C. rename directory

That way if machine crashes after step A or step B, that directory
never becomes visible to guest and no relabeling is required.

Thanks
Vivek

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Dr. David Alan Gilbert 4 years, 10 months ago

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > whitelisted set of capabilities that we require.  This improves security in
> > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > access to the system.
> > > 
> > > Hi Stefan,
> > > 
> > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > 
> > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > need CAP_SYS_ADMIN.
> > > 
> > > man xattr says.
> > > 
> > >    Trusted extended attributes
> > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > >        class are used to implement mechanisms in user space (i.e., outside the
> > >        kernel) which keep information in extended attributes to which ordinary
> > >        processes should not have access.
> > > 
> > > There is a chance that overlay moves away from trusted xattr in future.
> > > But for now we need to make it work. This is an important use case for
> > > kata docker in docker build.
> > > 
> > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > with this capaibility.
> > 
> > I'll admit I don't like the idea of giving it cap_sys_admin.
> > Can you explain:
> >   a) What overlayfs uses trusted for?
> 
> overlayfs stores bunch of metadata and uses "trusted" xattrs for it.

Tell me more about this metadata.
Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?
Or what happens if I was to write random numbers into OVL_XATTR_NLINK?

> >   b) If something nasty was to write junk into the trusted attributes,
> >     what would happen?
> 
> This directory is owned by guest. So it should be able to write
> anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?

Well, we shouldn't be able to break/crash/escape into the host; how
much does overlayfs validate trusted.* it uses?

> >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > all - what is the consequence?
> 
> It falls back to I think read only mode. 

It looks like the fallback is more subtle to me:
        /*
         * Check if upper/work fs supports trusted.overlay.* xattr
         */
        err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
        if (err) {
                ofs->noxattr = true;
                ofs->config.index = false;
                ofs->config.metacopy = false;
                pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");

but I don't know what index and metacopy are.

> For a moment forget about overlayfs. Say a user process in guest with
> CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> passthrough filesystem, so it should go through. But currently it
> wont.

As long as any effects of what it writes are contained to the area of
the filesystem exposed to the guest, yes - however it worries me what
the consequences of broken trusted metadata is.  If it's delicate enough
that it's guarded by CAP_SYS_ADMIN someone must have worried about it.

Dave

> Thanks
> Vivek
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Daniel P. Berrangé 4 years, 10 months ago

On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgoyal@redhat.com) wrote:
> > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > access to the system.
> > > > 
> > > > Hi Stefan,
> > > > 
> > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > 
> > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > need CAP_SYS_ADMIN.
> > > > 
> > > > man xattr says.
> > > > 
> > > >    Trusted extended attributes
> > > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > > >        class are used to implement mechanisms in user space (i.e., outside the
> > > >        kernel) which keep information in extended attributes to which ordinary
> > > >        processes should not have access.
> > > > 
> > > > There is a chance that overlay moves away from trusted xattr in future.
> > > > But for now we need to make it work. This is an important use case for
> > > > kata docker in docker build.
> > > > 
> > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > > with this capaibility.
> > > 
> > > I'll admit I don't like the idea of giving it cap_sys_admin.
> > > Can you explain:
> > >   a) What overlayfs uses trusted for?
> > 
> > overlayfs stores bunch of metadata and uses "trusted" xattrs for it.
> 
> Tell me more about this metadata.
> Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?
> Or what happens if I was to write random numbers into OVL_XATTR_NLINK?
> 
> > >   b) If something nasty was to write junk into the trusted attributes,
> > >     what would happen?
> > 
> > This directory is owned by guest. So it should be able to write
> > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> 
> Well, we shouldn't be able to break/crash/escape into the host; how
> much does overlayfs validate trusted.* it uses?
> 
> > >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > > all - what is the consequence?
> > 
> > It falls back to I think read only mode. 
> 
> It looks like the fallback is more subtle to me:
>         /*
>          * Check if upper/work fs supports trusted.overlay.* xattr
>          */
>         err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
>         if (err) {
>                 ofs->noxattr = true;
>                 ofs->config.index = false;
>                 ofs->config.metacopy = false;
>                 pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");
> 
> but I don't know what index and metacopy are.
> 
> > For a moment forget about overlayfs. Say a user process in guest with
> > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> > passthrough filesystem, so it should go through. But currently it
> > wont.
> 
> As long as any effects of what it writes are contained to the area of
> the filesystem exposed to the guest, yes - however it worries me what
> the consequences of broken trusted metadata is.  If it's delicate enough
> that it's guarded by CAP_SYS_ADMIN someone must have worried about it.

The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful
mechanism for applications to control access. The host kernel doesn'
tuse this namespace itself. Linux has four namespaces for xattrs:

 -  user - for userspace apps. accessible based on read/write permissions
 -  trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only
 -  system - for kernel only. used by ACLs
 -  security - for kernel only. used by SELinux

The use case for "trusted" xattrs is thus where a privileged management
application or service wants to store metadata against the file, but
also needs to grant an unprivileged process access to write to this file
while not allowing that unprivileged process the ability to change the
metadata. This is mentioned in the man page:

[man xattr(7)]
   Trusted extended attributes
       Trusted  extended attributes are visible and accessible only to pro‐
       cesses that have the CAP_SYS_ADMIN capability.  Attributes  in  this
       class  are used to implement mechanisms in user space (i.e., outside
       the kernel) which keep information in extended attributes  to  which
       ordinary processes should not have access.

   User extended attributes
       User  extended  attributes  may be assigned to files and directories
       for storing arbitrary additional information such as the mime  type,
       character  set  or  encoding  of a file.  The access permissions for
       user attributes are defined by the file permission bits:  read  per‐
       mission is required to retrieve the attribute value, and writer per‐
       mission is required to change it.
[/man]

Libvirtd uses the "trusted." xattr namespace to record information against
disk images for QEMU, because we need to grant QEMU access to read/write
the disk iamges, but don't want QEMU to be able to alter our xattrs.

It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap.
It really ought to have had its own dedicated capability :-( Such is
life with anything that uses CAP_SYS_ADMIN...

With this in mind we really should have both trusted. & user. xattrs
allowed to the guest by default.

Conversely, we'll need to block usage of the security. and system.
namespaces.

Unfortuntely this implies allowing CAP_SYS_ADMIN by default. We could
have a "--trusted-xattrs=on|off" flag to allow people to run in a more
locked down state if they knowingly want to block trusted xattrs.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Vivek Goyal 4 years, 10 months ago

On Fri, Jun 19, 2020 at 12:39:14PM +0100, Daniel P. Berrangé wrote:
[..]
> The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful
> mechanism for applications to control access. The host kernel doesn'
> tuse this namespace itself. Linux has four namespaces for xattrs:
> 
>  -  user - for userspace apps. accessible based on read/write permissions
>  -  trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only
>  -  system - for kernel only. used by ACLs
>  -  security - for kernel only. used by SELinux
> 
> The use case for "trusted" xattrs is thus where a privileged management
> application or service wants to store metadata against the file, but
> also needs to grant an unprivileged process access to write to this file
> while not allowing that unprivileged process the ability to change the
> metadata. This is mentioned in the man page:
> 
> [man xattr(7)]
>    Trusted extended attributes
>        Trusted  extended attributes are visible and accessible only to pro‐
>        cesses that have the CAP_SYS_ADMIN capability.  Attributes  in  this
>        class  are used to implement mechanisms in user space (i.e., outside
>        the kernel) which keep information in extended attributes  to  which
>        ordinary processes should not have access.
> 
>    User extended attributes
>        User  extended  attributes  may be assigned to files and directories
>        for storing arbitrary additional information such as the mime  type,
>        character  set  or  encoding  of a file.  The access permissions for
>        user attributes are defined by the file permission bits:  read  per‐
>        mission is required to retrieve the attribute value, and writer per‐
>        mission is required to change it.
> [/man]
> 
> Libvirtd uses the "trusted." xattr namespace to record information against
> disk images for QEMU, because we need to grant QEMU access to read/write
> the disk iamges, but don't want QEMU to be able to alter our xattrs.
> 
> It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap.
> It really ought to have had its own dedicated capability :-( Such is
> life with anything that uses CAP_SYS_ADMIN...
> 
> With this in mind we really should have both trusted. & user. xattrs
> allowed to the guest by default.
> 
> Conversely, we'll need to block usage of the security. and system.
> namespaces.

I am wondering can we block usage of "system" and "security"?  What
about guest setting acls over virtiofs files. These will have to
go through and that means we need to allow system xattrs.

Similarly setting file capabilities inside should trigger
setxattr(security.capability) and that means we need to allow security
xattr as well.

Thanks
Vivek

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Dr. David Alan Gilbert 4 years, 10 months ago

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Fri, Jun 19, 2020 at 12:39:14PM +0100, Daniel P. Berrangé wrote:
> [..]
> > The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful
> > mechanism for applications to control access. The host kernel doesn'
> > tuse this namespace itself. Linux has four namespaces for xattrs:
> > 
> >  -  user - for userspace apps. accessible based on read/write permissions
> >  -  trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only
> >  -  system - for kernel only. used by ACLs
> >  -  security - for kernel only. used by SELinux
> > 
> > The use case for "trusted" xattrs is thus where a privileged management
> > application or service wants to store metadata against the file, but
> > also needs to grant an unprivileged process access to write to this file
> > while not allowing that unprivileged process the ability to change the
> > metadata. This is mentioned in the man page:
> > 
> > [man xattr(7)]
> >    Trusted extended attributes
> >        Trusted  extended attributes are visible and accessible only to pro‐
> >        cesses that have the CAP_SYS_ADMIN capability.  Attributes  in  this
> >        class  are used to implement mechanisms in user space (i.e., outside
> >        the kernel) which keep information in extended attributes  to  which
> >        ordinary processes should not have access.
> > 
> >    User extended attributes
> >        User  extended  attributes  may be assigned to files and directories
> >        for storing arbitrary additional information such as the mime  type,
> >        character  set  or  encoding  of a file.  The access permissions for
> >        user attributes are defined by the file permission bits:  read  per‐
> >        mission is required to retrieve the attribute value, and writer per‐
> >        mission is required to change it.
> > [/man]
> > 
> > Libvirtd uses the "trusted." xattr namespace to record information against
> > disk images for QEMU, because we need to grant QEMU access to read/write
> > the disk iamges, but don't want QEMU to be able to alter our xattrs.
> > 
> > It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap.
> > It really ought to have had its own dedicated capability :-( Such is
> > life with anything that uses CAP_SYS_ADMIN...
> > 
> > With this in mind we really should have both trusted. & user. xattrs
> > allowed to the guest by default.
> > 
> > Conversely, we'll need to block usage of the security. and system.
> > namespaces.
> 
> I am wondering can we block usage of "system" and "security"?  What
> about guest setting acls over virtiofs files. These will have to
> go through and that means we need to allow system xattrs.
> 
> Similarly setting file capabilities inside should trigger
> setxattr(security.capability) and that means we need to allow security
> xattr as well.

Yep, we see that when people install Fedora packages, when rpm
unpackgs /usr/bin/newgidmap which has:

$ getfattr -d '--match=.*' /usr/bin/newgidmap 
getfattr: Removing leading '/' from absolute path names
# file: usr/bin/newgidmap
security.capability=0sAQAAAkAAAAAAAAAAAAAAAAAAAAA=
security.selinux="system_u:object_r:bin_t:s0"


Dave

> Thanks
> Vivek
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Dr. David Alan Gilbert 4 years, 10 months ago

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote:
> > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > > access to the system.
> > > > > 
> > > > > Hi Stefan,
> > > > > 
> > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > > 
> > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > > need CAP_SYS_ADMIN.
> > > > > 
> > > > > man xattr says.
> > > > > 
> > > > >    Trusted extended attributes
> > > > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > > > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > > > >        class are used to implement mechanisms in user space (i.e., outside the
> > > > >        kernel) which keep information in extended attributes to which ordinary
> > > > >        processes should not have access.
> > > > > 
> > > > > There is a chance that overlay moves away from trusted xattr in future.
> > > > > But for now we need to make it work. This is an important use case for
> > > > > kata docker in docker build.
> > > > > 
> > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > > > with this capaibility.
> > > > 
> > > > I'll admit I don't like the idea of giving it cap_sys_admin.
> > > > Can you explain:
> > > >   a) What overlayfs uses trusted for?
> > > 
> > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it.
> > 
> > Tell me more about this metadata.
> > Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?
> > Or what happens if I was to write random numbers into OVL_XATTR_NLINK?
> > 
> > > >   b) If something nasty was to write junk into the trusted attributes,
> > > >     what would happen?
> > > 
> > > This directory is owned by guest. So it should be able to write
> > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> > 
> > Well, we shouldn't be able to break/crash/escape into the host; how
> > much does overlayfs validate trusted.* it uses?
> > 
> > > >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > > > all - what is the consequence?
> > > 
> > > It falls back to I think read only mode. 
> > 
> > It looks like the fallback is more subtle to me:
> >         /*
> >          * Check if upper/work fs supports trusted.overlay.* xattr
> >          */
> >         err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
> >         if (err) {
> >                 ofs->noxattr = true;
> >                 ofs->config.index = false;
> >                 ofs->config.metacopy = false;
> >                 pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");
> > 
> > but I don't know what index and metacopy are.
> > 
> > > For a moment forget about overlayfs. Say a user process in guest with
> > > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> > > passthrough filesystem, so it should go through. But currently it
> > > wont.
> > 
> > As long as any effects of what it writes are contained to the area of
> > the filesystem exposed to the guest, yes - however it worries me what
> > the consequences of broken trusted metadata is.  If it's delicate enough
> > that it's guarded by CAP_SYS_ADMIN someone must have worried about it.
> 
> The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful
> mechanism for applications to control access. The host kernel doesn'
> tuse this namespace itself. Linux has four namespaces for xattrs:
> 
>  -  user - for userspace apps. accessible based on read/write permissions
>  -  trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only
>  -  system - for kernel only. used by ACLs
>  -  security - for kernel only. used by SELinux
> 
> The use case for "trusted" xattrs is thus where a privileged management
> application or service wants to store metadata against the file, but
> also needs to grant an unprivileged process access to write to this file
> while not allowing that unprivileged process the ability to change the
> metadata. This is mentioned in the man page:
> 
> [man xattr(7)]
>    Trusted extended attributes
>        Trusted  extended attributes are visible and accessible only to pro‐
>        cesses that have the CAP_SYS_ADMIN capability.  Attributes  in  this
>        class  are used to implement mechanisms in user space (i.e., outside
>        the kernel) which keep information in extended attributes  to  which
>        ordinary processes should not have access.

overlayfs inside the kernel is using trusted though which is the case
Vivek ran into.

>    User extended attributes
>        User  extended  attributes  may be assigned to files and directories
>        for storing arbitrary additional information such as the mime  type,
>        character  set  or  encoding  of a file.  The access permissions for
>        user attributes are defined by the file permission bits:  read  per‐
>        mission is required to retrieve the attribute value, and writer per‐
>        mission is required to change it.
> [/man]
> 
> Libvirtd uses the "trusted." xattr namespace to record information against
> disk images for QEMU, because we need to grant QEMU access to read/write
> the disk iamges, but don't want QEMU to be able to alter our xattrs.
> 
> It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap.
> It really ought to have had its own dedicated capability :-( Such is
> life with anything that uses CAP_SYS_ADMIN...
> 
> With this in mind we really should have both trusted. & user. xattrs
> allowed to the guest by default.
> 
> Conversely, we'll need to block usage of the security. and system.
> namespaces.
> 
> Unfortuntely this implies allowing CAP_SYS_ADMIN by default. We could
> have a "--trusted-xattrs=on|off" flag to allow people to run in a more
> locked down state if they knowingly want to block trusted xattrs.

Does the trick described by Chirantan work for you, where the clients
'trusted' view is different from the host?

Dave

> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Daniel P. Berrangé 4 years, 10 months ago

On Fri, Jun 19, 2020 at 12:49:57PM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote:
> > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > > > access to the system.
> > > > > > 
> > > > > > Hi Stefan,
> > > > > > 
> > > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > > > 
> > > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > > > need CAP_SYS_ADMIN.
> > > > > > 
> > > > > > man xattr says.
> > > > > > 
> > > > > >    Trusted extended attributes
> > > > > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > > > > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > > > > >        class are used to implement mechanisms in user space (i.e., outside the
> > > > > >        kernel) which keep information in extended attributes to which ordinary
> > > > > >        processes should not have access.
> > > > > > 
> > > > > > There is a chance that overlay moves away from trusted xattr in future.
> > > > > > But for now we need to make it work. This is an important use case for
> > > > > > kata docker in docker build.
> > > > > > 
> > > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > > > > with this capaibility.
> > > > > 
> > > > > I'll admit I don't like the idea of giving it cap_sys_admin.
> > > > > Can you explain:
> > > > >   a) What overlayfs uses trusted for?
> > > > 
> > > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it.
> > > 
> > > Tell me more about this metadata.
> > > Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?
> > > Or what happens if I was to write random numbers into OVL_XATTR_NLINK?
> > > 
> > > > >   b) If something nasty was to write junk into the trusted attributes,
> > > > >     what would happen?
> > > > 
> > > > This directory is owned by guest. So it should be able to write
> > > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> > > 
> > > Well, we shouldn't be able to break/crash/escape into the host; how
> > > much does overlayfs validate trusted.* it uses?
> > > 
> > > > >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > > > > all - what is the consequence?
> > > > 
> > > > It falls back to I think read only mode. 
> > > 
> > > It looks like the fallback is more subtle to me:
> > >         /*
> > >          * Check if upper/work fs supports trusted.overlay.* xattr
> > >          */
> > >         err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
> > >         if (err) {
> > >                 ofs->noxattr = true;
> > >                 ofs->config.index = false;
> > >                 ofs->config.metacopy = false;
> > >                 pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");
> > > 
> > > but I don't know what index and metacopy are.
> > > 
> > > > For a moment forget about overlayfs. Say a user process in guest with
> > > > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> > > > passthrough filesystem, so it should go through. But currently it
> > > > wont.
> > > 
> > > As long as any effects of what it writes are contained to the area of
> > > the filesystem exposed to the guest, yes - however it worries me what
> > > the consequences of broken trusted metadata is.  If it's delicate enough
> > > that it's guarded by CAP_SYS_ADMIN someone must have worried about it.
> > 
> > The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful
> > mechanism for applications to control access. The host kernel doesn'
> > tuse this namespace itself. Linux has four namespaces for xattrs:
> > 
> >  -  user - for userspace apps. accessible based on read/write permissions
> >  -  trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only
> >  -  system - for kernel only. used by ACLs
> >  -  security - for kernel only. used by SELinux
> > 
> > The use case for "trusted" xattrs is thus where a privileged management
> > application or service wants to store metadata against the file, but
> > also needs to grant an unprivileged process access to write to this file
> > while not allowing that unprivileged process the ability to change the
> > metadata. This is mentioned in the man page:
> > 
> > [man xattr(7)]
> >    Trusted extended attributes
> >        Trusted  extended attributes are visible and accessible only to pro‐
> >        cesses that have the CAP_SYS_ADMIN capability.  Attributes  in  this
> >        class  are used to implement mechanisms in user space (i.e., outside
> >        the kernel) which keep information in extended attributes  to  which
> >        ordinary processes should not have access.
> 
> overlayfs inside the kernel is using trusted though which is the case
> Vivek ran into.

IIUC, this is need on the lower-layer FS. If you're running overlayfs on the
host, then virtiofsd is exposing the upper layer FS, so guest use of "trusted."
on the upper layer won't clash with overlayfs in the host.

If you're running overlayfs in the guest, then virtiofs is the lower layer
so needs to support the "trusted." on the FS its exposing. Again this should
not clash with anything on the host.

> >    User extended attributes
> >        User  extended  attributes  may be assigned to files and directories
> >        for storing arbitrary additional information such as the mime  type,
> >        character  set  or  encoding  of a file.  The access permissions for
> >        user attributes are defined by the file permission bits:  read  per‐
> >        mission is required to retrieve the attribute value, and writer per‐
> >        mission is required to change it.
> > [/man]
> > 
> > Libvirtd uses the "trusted." xattr namespace to record information against
> > disk images for QEMU, because we need to grant QEMU access to read/write
> > the disk iamges, but don't want QEMU to be able to alter our xattrs.
> > 
> > It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap.
> > It really ought to have had its own dedicated capability :-( Such is
> > life with anything that uses CAP_SYS_ADMIN...
> > 
> > With this in mind we really should have both trusted. & user. xattrs
> > allowed to the guest by default.
> > 
> > Conversely, we'll need to block usage of the security. and system.
> > namespaces.
> > 
> > Unfortuntely this implies allowing CAP_SYS_ADMIN by default. We could
> > have a "--trusted-xattrs=on|off" flag to allow people to run in a more
> > locked down state if they knowingly want to block trusted xattrs.
> 
> Does the trick described by Chirantan work for you, where the clients
> 'trusted' view is different from the host?

That works as long as you don't require any interoperability with
processes that have setup the exposed file tree on the host, or
might need to concurrently access it.

This probably suggests a need various modes of exposing xattrs

 - "passthrough"  - run CAP_SYS_ADMIN and pass directly
 - "mapped"       - no capabilities, rewrite everything into user. namespace
 - "disable"      - don't allow xattr

As mentioned in the other thread though, CAP_SYS_ADMIN needs to be present
in the initial namespace, so to support "passthrough" we'd have to NOT
use user namespace too. Something to be aware of when we accept the
user namespace patches.

Such "mapping" of attributes could also be used if virtiofs was running
as an unprivileged user account, to expose a full range of file owner,
group and permissions to the guest, while having everything be owned
by the normal user on the host.   QEMU's current 9p filesystem has
this kind of mapping support for owner/permissions, so guest view can
be distinct from the host view.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Vivek Goyal 4 years, 10 months ago

On Fri, Jun 19, 2020 at 01:05:47PM +0100, Daniel P. Berrangé wrote:
> On Fri, Jun 19, 2020 at 12:49:57PM +0100, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote:
> > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > > > > access to the system.
> > > > > > > 
> > > > > > > Hi Stefan,
> > > > > > > 
> > > > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > > > > 
> > > > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > > > > need CAP_SYS_ADMIN.
> > > > > > > 
> > > > > > > man xattr says.
> > > > > > > 
> > > > > > >    Trusted extended attributes
> > > > > > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > > > > > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > > > > > >        class are used to implement mechanisms in user space (i.e., outside the
> > > > > > >        kernel) which keep information in extended attributes to which ordinary
> > > > > > >        processes should not have access.
> > > > > > > 
> > > > > > > There is a chance that overlay moves away from trusted xattr in future.
> > > > > > > But for now we need to make it work. This is an important use case for
> > > > > > > kata docker in docker build.
> > > > > > > 
> > > > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > > > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > > > > > with this capaibility.
> > > > > > 
> > > > > > I'll admit I don't like the idea of giving it cap_sys_admin.
> > > > > > Can you explain:
> > > > > >   a) What overlayfs uses trusted for?
> > > > > 
> > > > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it.
> > > > 
> > > > Tell me more about this metadata.
> > > > Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?
> > > > Or what happens if I was to write random numbers into OVL_XATTR_NLINK?
> > > > 
> > > > > >   b) If something nasty was to write junk into the trusted attributes,
> > > > > >     what would happen?
> > > > > 
> > > > > This directory is owned by guest. So it should be able to write
> > > > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> > > > 
> > > > Well, we shouldn't be able to break/crash/escape into the host; how
> > > > much does overlayfs validate trusted.* it uses?
> > > > 
> > > > > >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > > > > > all - what is the consequence?
> > > > > 
> > > > > It falls back to I think read only mode. 
> > > > 
> > > > It looks like the fallback is more subtle to me:
> > > >         /*
> > > >          * Check if upper/work fs supports trusted.overlay.* xattr
> > > >          */
> > > >         err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
> > > >         if (err) {
> > > >                 ofs->noxattr = true;
> > > >                 ofs->config.index = false;
> > > >                 ofs->config.metacopy = false;
> > > >                 pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");
> > > > 
> > > > but I don't know what index and metacopy are.
> > > > 
> > > > > For a moment forget about overlayfs. Say a user process in guest with
> > > > > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> > > > > passthrough filesystem, so it should go through. But currently it
> > > > > wont.
> > > > 
> > > > As long as any effects of what it writes are contained to the area of
> > > > the filesystem exposed to the guest, yes - however it worries me what
> > > > the consequences of broken trusted metadata is.  If it's delicate enough
> > > > that it's guarded by CAP_SYS_ADMIN someone must have worried about it.
> > > 
> > > The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful
> > > mechanism for applications to control access. The host kernel doesn'
> > > tuse this namespace itself. Linux has four namespaces for xattrs:
> > > 
> > >  -  user - for userspace apps. accessible based on read/write permissions
> > >  -  trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only
> > >  -  system - for kernel only. used by ACLs
> > >  -  security - for kernel only. used by SELinux
> > > 
> > > The use case for "trusted" xattrs is thus where a privileged management
> > > application or service wants to store metadata against the file, but
> > > also needs to grant an unprivileged process access to write to this file
> > > while not allowing that unprivileged process the ability to change the
> > > metadata. This is mentioned in the man page:
> > > 
> > > [man xattr(7)]
> > >    Trusted extended attributes
> > >        Trusted  extended attributes are visible and accessible only to pro‐
> > >        cesses that have the CAP_SYS_ADMIN capability.  Attributes  in  this
> > >        class  are used to implement mechanisms in user space (i.e., outside
> > >        the kernel) which keep information in extended attributes  to  which
> > >        ordinary processes should not have access.
> > 
> > overlayfs inside the kernel is using trusted though which is the case
> > Vivek ran into.
> 
> IIUC, this is need on the lower-layer FS. If you're running overlayfs on the
> host, then virtiofsd is exposing the upper layer FS, so guest use of "trusted."
> on the upper layer won't clash with overlayfs in the host.
> 
> If you're running overlayfs in the guest, then virtiofs is the lower layer
> so needs to support the "trusted." on the FS its exposing. Again this should
> not clash with anything on the host.
> 
> > >    User extended attributes
> > >        User  extended  attributes  may be assigned to files and directories
> > >        for storing arbitrary additional information such as the mime  type,
> > >        character  set  or  encoding  of a file.  The access permissions for
> > >        user attributes are defined by the file permission bits:  read  per‐
> > >        mission is required to retrieve the attribute value, and writer per‐
> > >        mission is required to change it.
> > > [/man]
> > > 
> > > Libvirtd uses the "trusted." xattr namespace to record information against
> > > disk images for QEMU, because we need to grant QEMU access to read/write
> > > the disk iamges, but don't want QEMU to be able to alter our xattrs.
> > > 
> > > It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap.
> > > It really ought to have had its own dedicated capability :-( Such is
> > > life with anything that uses CAP_SYS_ADMIN...
> > > 
> > > With this in mind we really should have both trusted. & user. xattrs
> > > allowed to the guest by default.
> > > 
> > > Conversely, we'll need to block usage of the security. and system.
> > > namespaces.
> > > 
> > > Unfortuntely this implies allowing CAP_SYS_ADMIN by default. We could
> > > have a "--trusted-xattrs=on|off" flag to allow people to run in a more
> > > locked down state if they knowingly want to block trusted xattrs.
> > 
> > Does the trick described by Chirantan work for you, where the clients
> > 'trusted' view is different from the host?
> 
> That works as long as you don't require any interoperability with
> processes that have setup the exposed file tree on the host, or
> might need to concurrently access it.
> 
> This probably suggests a need various modes of exposing xattrs
> 
>  - "passthrough"  - run CAP_SYS_ADMIN and pass directly
>  - "mapped"       - no capabilities, rewrite everything into user. namespace
>  - "disable"      - don't allow xattr
> 

We already do "passthrough" (this is our default) and "disable" (-o no_xattr).
We probably need to implement this "mapped" mode and that will allow us
to run without CAP_SYS_ADMIN in init_user_ns.

> As mentioned in the other thread though, CAP_SYS_ADMIN needs to be present
> in the initial namespace, so to support "passthrough" we'd have to NOT
> use user namespace too. Something to be aware of when we accept the
> user namespace patches.

> 
> Such "mapping" of attributes could also be used if virtiofs was running
> as an unprivileged user account, to expose a full range of file owner,
> group and permissions to the guest, while having everything be owned
> by the normal user on the host.   QEMU's current 9p filesystem has
> this kind of mapping support for owner/permissions, so guest view can
> be distinct from the host view.

If we figure out a way to run in a user namespace, then we don't
necessarily require this mapping functionality, right? User namespace will
automatically provide that mapping. We can just run as root of that user
namespace.

When 9p wrote this remapping method, we probably did not have user
namespace as popular as they are now. So we probably should implement
remapping of trusted xattr and then run in a user namesapce.

And complete passthrough mode will have to run in init user namespace
and this will be least securre mode (from host security point of view).

Thanks
Vivek

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Vivek Goyal 4 years, 10 months ago

On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgoyal@redhat.com) wrote:
> > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > access to the system.
> > > > 
> > > > Hi Stefan,
> > > > 
> > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > 
> > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > need CAP_SYS_ADMIN.
> > > > 
> > > > man xattr says.
> > > > 
> > > >    Trusted extended attributes
> > > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > > >        class are used to implement mechanisms in user space (i.e., outside the
> > > >        kernel) which keep information in extended attributes to which ordinary
> > > >        processes should not have access.
> > > > 
> > > > There is a chance that overlay moves away from trusted xattr in future.
> > > > But for now we need to make it work. This is an important use case for
> > > > kata docker in docker build.
> > > > 
> > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > > with this capaibility.
> > > 
> > > I'll admit I don't like the idea of giving it cap_sys_admin.
> > > Can you explain:
> > >   a) What overlayfs uses trusted for?
> > 
> > overlayfs stores bunch of metadata and uses "trusted" xattrs for it.
> 
> Tell me more about this metadata.
> Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?

It contains path information which is used for lookup into lower layer.

> Or what happens if I was to write random numbers into OVL_XATTR_NLINK?

Overlay is storing its metadata in trusted.* xattrs. If a user modifies
metadata, then various kind of bad things can happen. I think one can
do some kind of checks on metadata to make sure it does not crash
atleast.

And that's true for any filesystem. Isn't. If user manages to modify
metadata outside of filesystem, then lot of bad things can happen. I
thought that's the reason that people are not comfortable with the
idea of allowing mounting filesystem from inside user namespace because
it makes it easy to mount a hand crafted filesystem.

Anyway, I think overlayfs is just one use case of trusted xattr. Even
if overlayfs moves away from trusted xattr, what about other users.
We need to have a story around how will we support trusted xattrs
safely.


> 
> > >   b) If something nasty was to write junk into the trusted attributes,
> > >     what would happen?
> > 
> > This directory is owned by guest. So it should be able to write
> > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> 
> Well, we shouldn't be able to break/crash/escape into the host; how
> much does overlayfs validate trusted.* it uses?

I thought qemu and kvm are the one who should ensure we should not be
able to break out of sandbox. Kernel implementation could be as 
buggy as it wanted to be. We are working with this security model
that kernel is completely untrusted.

> 
> > >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > > all - what is the consequence?
> > 
> > It falls back to I think read only mode. 
> 
> It looks like the fallback is more subtle to me:
>         /*
>          * Check if upper/work fs supports trusted.overlay.* xattr
>          */
>         err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
>         if (err) {
>                 ofs->noxattr = true;
>                 ofs->config.index = false;
>                 ofs->config.metacopy = false;
>                 pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");
> 
> but I don't know what index and metacopy are.

They enable certain features in overlayfs. In fact, we fall back to
lesser capability on if we are running on ext4/xfs. For virtiofs, 
we deny the mount completely.

        /*
         * We allowed sub-optimal upper fs configuration and don't want to break
         * users over kernel upgrade, but we never allowed remote upper fs, so
         * we can enforce strict requirements for remote upper fs.
         */
        if (ovl_dentry_remote(ofs->workdir) &&
            (!d_type || !rename_whiteout || ofs->noxattr)) {
                pr_err("upper fs missing required features.\n");
                err = -EINVAL;
                goto out;
        }

> 
> > For a moment forget about overlayfs. Say a user process in guest with
> > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> > passthrough filesystem, so it should go through. But currently it
> > wont.
> 
> As long as any effects of what it writes are contained to the area of
> the filesystem exposed to the guest, yes - however it worries me what
> the consequences of broken trusted metadata is.  If it's delicate enough
> that it's guarded by CAP_SYS_ADMIN someone must have worried about it.

Agreed that we need to look into whether having CAP_SYS_ADMIN allow
virtiofsd to break out of jail. 

May be we need to provide that remapping trusted xattr feature so
that we don't have to have CAP_SYS_ADMIN in init_user_ns and can
provide this emulation even when running in user namespace.

Vivek

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Dr. David Alan Gilbert 4 years, 10 months ago

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote:
> > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Vivek Goyal (vgoyal@redhat.com) wrote:
> > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > > access to the system.
> > > > > 
> > > > > Hi Stefan,
> > > > > 
> > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > > > > 
> > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain
> > > > > need CAP_SYS_ADMIN.
> > > > > 
> > > > > man xattr says.
> > > > > 
> > > > >    Trusted extended attributes
> > > > >        Trusted  extended  attributes  are  visible and accessible only to pro‐
> > > > >        cesses that have the  CAP_SYS_ADMIN  capability.   Attributes  in  this
> > > > >        class are used to implement mechanisms in user space (i.e., outside the
> > > > >        kernel) which keep information in extended attributes to which ordinary
> > > > >        processes should not have access.
> > > > > 
> > > > > There is a chance that overlay moves away from trusted xattr in future.
> > > > > But for now we need to make it work. This is an important use case for
> > > > > kata docker in docker build.
> > > > > 
> > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and
> > > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon
> > > > > with this capaibility.
> > > > 
> > > > I'll admit I don't like the idea of giving it cap_sys_admin.
> > > > Can you explain:
> > > >   a) What overlayfs uses trusted for?
> > > 
> > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it.
> > 
> > Tell me more about this metadata.
> > Taking a juicy looking one, what does OVL_XATTR_REDIRECT do?
> 
> It contains path information which is used for lookup into lower layer.
> 
> > Or what happens if I was to write random numbers into OVL_XATTR_NLINK?
> 
> Overlay is storing its metadata in trusted.* xattrs. If a user modifies
> metadata, then various kind of bad things can happen. I think one can
> do some kind of checks on metadata to make sure it does not crash
> atleast.
> 
> And that's true for any filesystem. Isn't. If user manages to modify
> metadata outside of filesystem, then lot of bad things can happen. I
> thought that's the reason that people are not comfortable with the
> idea of allowing mounting filesystem from inside user namespace because
> it makes it easy to mount a hand crafted filesystem.
> 
> Anyway, I think overlayfs is just one use case of trusted xattr. Even
> if overlayfs moves away from trusted xattr, what about other users.
> We need to have a story around how will we support trusted xattrs
> safely.
> 
> 
> > 
> > > >   b) If something nasty was to write junk into the trusted attributes,
> > > >     what would happen?
> > > 
> > > This directory is owned by guest. So it should be able to write
> > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> > 
> > Well, we shouldn't be able to break/crash/escape into the host; how
> > much does overlayfs validate trusted.* it uses?
> 
> I thought qemu and kvm are the one who should ensure we should not be
> able to break out of sandbox. Kernel implementation could be as 
> buggy as it wanted to be. We are working with this security model
> that kernel is completely untrusted.

But with virtiofs we allow the guest to do a lot of filesystem
operations on the host.  It's the virtiofsd that has to ensure that
these are safe and contained within the fs it's exposed; the qemu/kvm
can't protect us from that.

That's why we sandbox the virtiofsd like we do - if we allow a
priviliged guest to perform calls to an unconstrained virtiofsd it would
be able to escape.  That's what I want to check.

Dave

> > 
> > > >   c) I see overlayfs has a fallback check if xattr isn't supported at
> > > > all - what is the consequence?
> > > 
> > > It falls back to I think read only mode. 
> > 
> > It looks like the fallback is more subtle to me:
> >         /*
> >          * Check if upper/work fs supports trusted.overlay.* xattr
> >          */
> >         err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0);
> >         if (err) {
> >                 ofs->noxattr = true;
> >                 ofs->config.index = false;
> >                 ofs->config.metacopy = false;
> >                 pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n");
> > 
> > but I don't know what index and metacopy are.
> 
> They enable certain features in overlayfs. In fact, we fall back to
> lesser capability on if we are running on ext4/xfs. For virtiofs, 
> we deny the mount completely.
> 
>         /*
>          * We allowed sub-optimal upper fs configuration and don't want to break
>          * users over kernel upgrade, but we never allowed remote upper fs, so
>          * we can enforce strict requirements for remote upper fs.
>          */
>         if (ovl_dentry_remote(ofs->workdir) &&
>             (!d_type || !rename_whiteout || ofs->noxattr)) {
>                 pr_err("upper fs missing required features.\n");
>                 err = -EINVAL;
>                 goto out;
>         }
> 
> > 
> > > For a moment forget about overlayfs. Say a user process in guest with
> > > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a
> > > passthrough filesystem, so it should go through. But currently it
> > > wont.
> > 
> > As long as any effects of what it writes are contained to the area of
> > the filesystem exposed to the guest, yes - however it worries me what
> > the consequences of broken trusted metadata is.  If it's delicate enough
> > that it's guarded by CAP_SYS_ADMIN someone must have worried about it.
> 
> Agreed that we need to look into whether having CAP_SYS_ADMIN allow
> virtiofsd to break out of jail. 
> 
> May be we need to provide that remapping trusted xattr feature so
> that we don't have to have CAP_SYS_ADMIN in init_user_ns and can
> provide this emulation even when running in user namespace.
> 
> Vivek
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Vivek Goyal 4 years, 10 months ago

On Fri, Jun 19, 2020 at 05:16:48PM +0100, Dr. David Alan Gilbert wrote:

[..]
> > > > >   b) If something nasty was to write junk into the trusted attributes,
> > > > >     what would happen?
> > > > 
> > > > This directory is owned by guest. So it should be able to write
> > > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> > > 
> > > Well, we shouldn't be able to break/crash/escape into the host; how
> > > much does overlayfs validate trusted.* it uses?
> > 
> > I thought qemu and kvm are the one who should ensure we should not be
> > able to break out of sandbox. Kernel implementation could be as 
> > buggy as it wanted to be. We are working with this security model
> > that kernel is completely untrusted.
> 
> But with virtiofs we allow the guest to do a lot of filesystem
> operations on the host.  It's the virtiofsd that has to ensure that
> these are safe and contained within the fs it's exposed; the qemu/kvm
> can't protect us from that.

Fair enough. I should have added virtiofsd to list. Its an attack
vector and is of concern.

> 
> That's why we sandbox the virtiofsd like we do - if we allow a
> priviliged guest to perform calls to an unconstrained virtiofsd it would
> be able to escape.  That's what I want to check.

Sure. So does giving CAP_SYS_ADMIN to virtiofsd allow daemon to escape
the jail. If it does we need to implement what crossvm folks did,
remapping of trusted xattr. That will also allow us to run inside
user namespace and still be able to support trusted xattr emulation
for guest.

Vivek

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Dr. David Alan Gilbert 4 years, 10 months ago

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Fri, Jun 19, 2020 at 05:16:48PM +0100, Dr. David Alan Gilbert wrote:
> 
> [..]
> > > > > >   b) If something nasty was to write junk into the trusted attributes,
> > > > > >     what would happen?
> > > > > 
> > > > > This directory is owned by guest. So it should be able to write
> > > > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right?
> > > > 
> > > > Well, we shouldn't be able to break/crash/escape into the host; how
> > > > much does overlayfs validate trusted.* it uses?
> > > 
> > > I thought qemu and kvm are the one who should ensure we should not be
> > > able to break out of sandbox. Kernel implementation could be as 
> > > buggy as it wanted to be. We are working with this security model
> > > that kernel is completely untrusted.
> > 
> > But with virtiofs we allow the guest to do a lot of filesystem
> > operations on the host.  It's the virtiofsd that has to ensure that
> > these are safe and contained within the fs it's exposed; the qemu/kvm
> > can't protect us from that.
> 
> Fair enough. I should have added virtiofsd to list. Its an attack
> vector and is of concern.
> 
> > 
> > That's why we sandbox the virtiofsd like we do - if we allow a
> > priviliged guest to perform calls to an unconstrained virtiofsd it would
> > be able to escape.  That's what I want to check.
> 
> Sure. So does giving CAP_SYS_ADMIN to virtiofsd allow daemon to escape
> the jail.

So that's *my* question - what bad things can someone do by setting
attributes (trusted/system/security) - it's fine if they break they
screwup the security inside the container, because they'd need to be
CAP_SYS_ADMIN inside the container to do it - as long as they can't
break the host.  So what happens if someone starts doing bad things to
trusted.* attributes on an overlayfs - no other fs uses them as far as I
know.

> If it does we need to implement what crossvm folks did,
> remapping of trusted xattr. That will also allow us to run inside
> user namespace and still be able to support trusted xattr emulation
> for guest.

I think we need to do that anyway, it's just going to take a while to
figure out.

Dave

> 
> Vivek
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Miklos Szeredi 4 years, 10 months ago

On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > whitelisted set of capabilities that we require.  This improves security in
> > case virtiofsd is compromised by making it hard for an attacker to gain further
> > access to the system.
>
> Hi Stefan,
>
> I just noticed that this patch set breaks overlayfs on top of virtiofs.

How so?  Virtiofs isn't mounting overlayfs, is it?  Only the mounter
requires CAP_SYS_ADMIN, not the accessor.

Thanks,
Miklos

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Vivek Goyal 4 years, 10 months ago

On Fri, Jun 19, 2020 at 04:16:30PM +0200, Miklos Szeredi wrote:
> On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > whitelisted set of capabilities that we require.  This improves security in
> > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > access to the system.
> >
> > Hi Stefan,
> >
> > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> 
> How so?  Virtiofs isn't mounting overlayfs, is it?  Only the mounter
> requires CAP_SYS_ADMIN, not the accessor.

virtiofsd needs CAP_SYS_ADMIN, otherwise fsetxattr(trusted.overlay.opaque)
fails in lo_setxattr().

This is triggered when we mount overlayfs on top of virtiofs and overlayfs
tries to set OVL_XATTR_OPAQUE on upper to check if trusted xattrs are
supported or not.

Thanks
Vivek

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Miklos Szeredi 4 years, 10 months ago

On Fri, Jun 19, 2020 at 4:25 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Fri, Jun 19, 2020 at 04:16:30PM +0200, Miklos Szeredi wrote:
> > On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > whitelisted set of capabilities that we require.  This improves security in
> > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > access to the system.
> > >
> > > Hi Stefan,
> > >
> > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> >
> > How so?  Virtiofs isn't mounting overlayfs, is it?  Only the mounter
> > requires CAP_SYS_ADMIN, not the accessor.
>
> virtiofsd needs CAP_SYS_ADMIN, otherwise fsetxattr(trusted.overlay.opaque)
> fails in lo_setxattr().
>
> This is triggered when we mount overlayfs on top of virtiofs and overlayfs
> tries to set OVL_XATTR_OPAQUE on upper to check if trusted xattrs are
> supported or not.

Ah, right.

Plan is to use "user.*" xattr for unprivileged overlay.  This would be
a good way to eliminate this attack surface in the overlay on virtiofs
case as well.

Other ways to minimize risk is to separate operations requiring
CAP_SYS_ADMIN into a separate process, preferably a separate
executable, that communicates with virtiofsd using a pipe and contains
the minimum amount of code.

Thanks,
Miklos

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Vivek Goyal 4 years, 10 months ago

On Fri, Jun 19, 2020 at 05:26:37PM +0200, Miklos Szeredi wrote:
> On Fri, Jun 19, 2020 at 4:25 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Fri, Jun 19, 2020 at 04:16:30PM +0200, Miklos Szeredi wrote:
> > > On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > > > whitelisted set of capabilities that we require.  This improves security in
> > > > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > > > access to the system.
> > > >
> > > > Hi Stefan,
> > > >
> > > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> > >
> > > How so?  Virtiofs isn't mounting overlayfs, is it?  Only the mounter
> > > requires CAP_SYS_ADMIN, not the accessor.
> >
> > virtiofsd needs CAP_SYS_ADMIN, otherwise fsetxattr(trusted.overlay.opaque)
> > fails in lo_setxattr().
> >
> > This is triggered when we mount overlayfs on top of virtiofs and overlayfs
> > tries to set OVL_XATTR_OPAQUE on upper to check if trusted xattrs are
> > supported or not.
> 
> Ah, right.
> 
> Plan is to use "user.*" xattr for unprivileged overlay.  This would be
> a good way to eliminate this attack surface in the overlay on virtiofs
> case as well.

So unpriviliged overlay is one which is mounted from inside a user
namespace. But in this case we might be mounting it from init_user_ns
of guest. So from overlayfs perspective this will still be treated
as priviliged overlay instance and it will use trusted xattrs, IIUC?

Thanks
Vivek

> 
> Other ways to minimize risk is to separate operations requiring
> CAP_SYS_ADMIN into a separate process, preferably a separate
> executable, that communicates with virtiofsd using a pipe and contains
> the minimum amount of code.

> 
> Thanks,
> Miklos
>

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Vivek Goyal 4 years, 10 months ago

On Fri, Jun 19, 2020 at 04:16:30PM +0200, Miklos Szeredi wrote:
> On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > > whitelisted set of capabilities that we require.  This improves security in
> > > case virtiofsd is compromised by making it hard for an attacker to gain further
> > > access to the system.
> >
> > Hi Stefan,
> >
> > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> 
> How so?  Virtiofs isn't mounting overlayfs, is it?  Only the mounter
> requires CAP_SYS_ADMIN, not the accessor.

Are you thinking of virtiofs on top of overlayfs? I was referring to
mounting overlayfs on top of virtiofs inside VM.

Thanks
Vivek

Re: [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Vivek Goyal 5 years ago

On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> whitelisted set of capabilities that we require.  This improves security in
> case virtiofsd is compromised by making it hard for an attacker to gain further
> access to the system.

Hi Stefan,

Good to see this patch. We needed to limit capabilities to reduce attack
surface.

What tests have you run to make sure this current set of whitelisted
capabilities is good enough.

Vivek

> 
> Stefan Hajnoczi (2):
>   virtiofsd: only retain file system capabilities
>   virtiofsd: drop all capabilities in the wait parent process
> 
>  tools/virtiofsd/passthrough_ll.c | 51 ++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> -- 
> 2.25.1
>

Re: [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

Posted by Stefan Hajnoczi 5 years ago

On Thu, Apr 16, 2020 at 04:10:22PM -0400, Vivek Goyal wrote:
> On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > virtiofsd doesn't need of all Linux capabilities(7) available to root.  Keep a
> > whitelisted set of capabilities that we require.  This improves security in
> > case virtiofsd is compromised by making it hard for an attacker to gain further
> > access to the system.
> 
> Hi Stefan,
> 
> Good to see this patch. We needed to limit capabilities to reduce attack
> surface.
> 
> What tests have you run to make sure this current set of whitelisted
> capabilities is good enough.

Booting and light usage of Fedora 29 and running blogbench.

I would appreciate it if others could try it out with their
tests/workloads.

Stefan