virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a whitelisted set of capabilities that we require. This improves security in case virtiofsd is compromised by making it hard for an attacker to gain further access to the system. Stefan Hajnoczi (2): virtiofsd: only retain file system capabilities virtiofsd: drop all capabilities in the wait parent process tools/virtiofsd/passthrough_ll.c | 51 ++++++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+) -- 2.25.1
* Stefan Hajnoczi (stefanha@redhat.com) wrote: > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > whitelisted set of capabilities that we require. This improves security in > case virtiofsd is compromised by making it hard for an attacker to gain further > access to the system. Queued. > Stefan Hajnoczi (2): > virtiofsd: only retain file system capabilities > virtiofsd: drop all capabilities in the wait parent process > > tools/virtiofsd/passthrough_ll.c | 51 ++++++++++++++++++++++++++++++++ > 1 file changed, 51 insertions(+) > > -- > 2.25.1 > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > whitelisted set of capabilities that we require. This improves security in > case virtiofsd is compromised by making it hard for an attacker to gain further > access to the system. Hi Stefan, I just noticed that this patch set breaks overlayfs on top of virtiofs. overlayfs sets "trusted.overlay.*" and xattrs in trusted domain need CAP_SYS_ADMIN. man xattr says. Trusted extended attributes Trusted extended attributes are visible and accessible only to pro‐ cesses that have the CAP_SYS_ADMIN capability. Attributes in this class are used to implement mechanisms in user space (i.e., outside the kernel) which keep information in extended attributes to which ordinary processes should not have access. There is a chance that overlay moves away from trusted xattr in future. But for now we need to make it work. This is an important use case for kata docker in docker build. May be we can add an option to virtiofsd say "--add-cap <capability>" and ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon with this capaibility. Thanks Vivek > > Stefan Hajnoczi (2): > virtiofsd: only retain file system capabilities > virtiofsd: drop all capabilities in the wait parent process > > tools/virtiofsd/passthrough_ll.c | 51 ++++++++++++++++++++++++++++++++ > 1 file changed, 51 insertions(+) > > -- > 2.25.1 > > _______________________________________________ > Virtio-fs mailing list > Virtio-fs@redhat.com > https://www.redhat.com/mailman/listinfo/virtio-fs
* Vivek Goyal (vgoyal@redhat.com) wrote: > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > > whitelisted set of capabilities that we require. This improves security in > > case virtiofsd is compromised by making it hard for an attacker to gain further > > access to the system. > > Hi Stefan, > > I just noticed that this patch set breaks overlayfs on top of virtiofs. > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain > need CAP_SYS_ADMIN. > > man xattr says. > > Trusted extended attributes > Trusted extended attributes are visible and accessible only to pro‐ > cesses that have the CAP_SYS_ADMIN capability. Attributes in this > class are used to implement mechanisms in user space (i.e., outside the > kernel) which keep information in extended attributes to which ordinary > processes should not have access. > > There is a chance that overlay moves away from trusted xattr in future. > But for now we need to make it work. This is an important use case for > kata docker in docker build. > > May be we can add an option to virtiofsd say "--add-cap <capability>" and > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon > with this capaibility. I'll admit I don't like the idea of giving it cap_sys_admin. Can you explain: a) What overlayfs uses trusted for? b) If something nasty was to write junk into the trusted attributes, what would happen? c) I see overlayfs has a fallback check if xattr isn't supported at all - what is the consequence? Dave > Thanks > Vivek > > > > > Stefan Hajnoczi (2): > > virtiofsd: only retain file system capabilities > > virtiofsd: drop all capabilities in the wait parent process > > > > tools/virtiofsd/passthrough_ll.c | 51 ++++++++++++++++++++++++++++++++ > > 1 file changed, 51 insertions(+) > > > > -- > > 2.25.1 > > > > _______________________________________________ > > Virtio-fs mailing list > > Virtio-fs@redhat.com > > https://www.redhat.com/mailman/listinfo/virtio-fs > > _______________________________________________ > Virtio-fs mailing list > Virtio-fs@redhat.com > https://www.redhat.com/mailman/listinfo/virtio-fs -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote: > * Vivek Goyal (vgoyal@redhat.com) wrote: > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > > > whitelisted set of capabilities that we require. This improves security in > > > case virtiofsd is compromised by making it hard for an attacker to gain further > > > access to the system. > > > > Hi Stefan, > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs. > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain > > need CAP_SYS_ADMIN. > > > > man xattr says. > > > > Trusted extended attributes > > Trusted extended attributes are visible and accessible only to pro‐ > > cesses that have the CAP_SYS_ADMIN capability. Attributes in this > > class are used to implement mechanisms in user space (i.e., outside the > > kernel) which keep information in extended attributes to which ordinary > > processes should not have access. > > > > There is a chance that overlay moves away from trusted xattr in future. > > But for now we need to make it work. This is an important use case for > > kata docker in docker build. > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon > > with this capaibility. > > I'll admit I don't like the idea of giving it cap_sys_admin. > Can you explain: > a) What overlayfs uses trusted for? overlayfs stores bunch of metadata and uses "trusted" xattrs for it. > b) If something nasty was to write junk into the trusted attributes, > what would happen? This directory is owned by guest. So it should be able to write anything it wants, as long as process in guest has CAP_SYS_ADMIN, right? > c) I see overlayfs has a fallback check if xattr isn't supported at > all - what is the consequence? It falls back to I think read only mode. For a moment forget about overlayfs. Say a user process in guest with CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a passthrough filesystem, so it should go through. But currently it wont. Thanks Vivek
On Fri, Jun 19, 2020 at 4:27 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote: > > * Vivek Goyal (vgoyal@redhat.com) wrote: > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > > > > whitelisted set of capabilities that we require. This improves security in > > > > case virtiofsd is compromised by making it hard for an attacker to gain further > > > > access to the system. > > > > > > Hi Stefan, > > > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs. > > > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain > > > need CAP_SYS_ADMIN. > > > Not just that but it needs CAP_SYS_ADMIN in the init namespace[1]. We have the same problem. Our virtiofs process has CAP_SYS_ADMIN in a user namespace and it cannot set any trusted or security xattrs. The security xattrs check is at least namespace aware so you only need CAP_SYS_ADMIN in the namespace that mounted the fs but that doesn't help us much. We ended up working around it by prefixing "user.virtiofs." to the xattr name[2], which has its own problems but there was pretty much no chance that we would be able to give the fs device CAP_SYS_ADMIN in the init namespace. [1]: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/+/5e857ce6eae7ca21b2055cca4885545e29228fe2/fs/xattr.c#116 [2]: https://chromium-review.googlesource.com/c/chromiumos/platform/crosvm/+/2243111
* Chirantan Ekbote (chirantan@chromium.org) wrote: > On Fri, Jun 19, 2020 at 4:27 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote: > > > * Vivek Goyal (vgoyal@redhat.com) wrote: > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > > > > > whitelisted set of capabilities that we require. This improves security in > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further > > > > > access to the system. > > > > > > > > Hi Stefan, > > > > > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs. > > > > > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain > > > > need CAP_SYS_ADMIN. > > > > > > Not just that but it needs CAP_SYS_ADMIN in the init namespace[1]. We > have the same problem. Our virtiofs process has CAP_SYS_ADMIN in a > user namespace and it cannot set any trusted or security xattrs. The > security xattrs check is at least namespace aware so you only need > CAP_SYS_ADMIN in the namespace that mounted the fs but that doesn't > help us much. It would have been good if you'd mentioned that; it would have saved Vivek some confusion! > We ended up working around it by prefixing "user.virtiofs." to the > xattr name[2], which has its own problems but there was pretty much no > chance that we would be able to give the fs device CAP_SYS_ADMIN in > the init namespace. What problems did you hit with that? We should standardise the renaming so we make an on-disc format that's compatible. Dave > > [1]: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/+/5e857ce6eae7ca21b2055cca4885545e29228fe2/fs/xattr.c#116 > [2]: https://chromium-review.googlesource.com/c/chromiumos/platform/crosvm/+/2243111 > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Fri, Jun 19, 2020 at 5:40 PM Dr. David Alan Gilbert <dgilbert@redhat.com> wrote: > > * Chirantan Ekbote (chirantan@chromium.org) wrote: > > > We ended up working around it by prefixing "user.virtiofs." to the > > xattr name[2], which has its own problems but there was pretty much no > > chance that we would be able to give the fs device CAP_SYS_ADMIN in > > the init namespace. > > > What problems did you hit with that? We should standardise the renaming > so we make an on-disc format that's compatible. > I guess what I meant by problems is that it made what was previously a simple and straightforward implementation into something more complex and added some limitations. For example, we now need to parse the result of the listxattr system call and strip out the prefix from any name in the list. It also means that we cannot allow the guest to directly set or remove any "user.virtiofs." xattr as this would allow an unprivileged process in the vm to modify an xattr that it wouldn't otherwise be allowed to modify. On top of being a somewhat arbitrary restriction this also means that you can't have stacked virtiofs instances as the lower instance would reject attempts by the upper one to set those xattrs. These limitations aren't really a problem for us but I can see how they might be a problem for others. The change was also merged just yesterday so there may be other problems with it that haven't surfaced yet. I didn't mention it before because I figured this was something that we brought upon ourselves as chrome os is a bit extreme about sandboxing. If we can come up with a standardized way to handle this I think we'll gladly switch the chrome os implementation to use it. Chirantan
* Chirantan Ekbote (chirantan@chromium.org) wrote: > On Fri, Jun 19, 2020 at 5:40 PM Dr. David Alan Gilbert > <dgilbert@redhat.com> wrote: > > > > * Chirantan Ekbote (chirantan@chromium.org) wrote: > > > > > We ended up working around it by prefixing "user.virtiofs." to the > > > xattr name[2], which has its own problems but there was pretty much no > > > chance that we would be able to give the fs device CAP_SYS_ADMIN in > > > the init namespace. > > > > > > What problems did you hit with that? We should standardise the renaming > > so we make an on-disc format that's compatible. > > > > I guess what I meant by problems is that it made what was previously a > simple and straightforward implementation into something more complex > and added some limitations. Yeh. > For example, we now need to parse the > result of the listxattr system call and strip out the prefix from any > name in the list. It also means that we cannot allow the guest to > directly set or remove any "user.virtiofs." xattr as this would allow > an unprivileged process in the vm to modify an xattr that it wouldn't > otherwise be allowed to modify. On top of being a somewhat arbitrary > restriction this also means that you can't have stacked virtiofs > instances as the lower instance would reject attempts by the upper one > to set those xattrs. These limitations aren't really a problem for us > but I can see how they might be a problem for others. Isn't this a classic escaping problem? Would it work if you prepended 'user.virtiofs.' onto any xattr that started with 'trusted' or 'user.virtiofs.' ? > The change was also merged just yesterday so there may be other > problems with it that haven't surfaced yet. > > I didn't mention it before because I figured this was something that > we brought upon ourselves as chrome os is a bit extreme about > sandboxing. I think we're trying to be as extreme in virtiofsd, but it is causing us similar problems. > If we can come up with a standardized way to handle this > I think we'll gladly switch the chrome os implementation to use it. Great! Dave > > Chirantan > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Fri, Jun 19, 2020 at 01:46:20PM +0900, Chirantan Ekbote wrote: > On Fri, Jun 19, 2020 at 4:27 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote: > > > * Vivek Goyal (vgoyal@redhat.com) wrote: > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > > > > > whitelisted set of capabilities that we require. This improves security in > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further > > > > > access to the system. > > > > > > > > Hi Stefan, > > > > > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs. > > > > > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain > > > > need CAP_SYS_ADMIN. > > > > > > Not just that but it needs CAP_SYS_ADMIN in the init namespace[1]. We > have the same problem. Our virtiofs process has CAP_SYS_ADMIN in a > user namespace and it cannot set any trusted or security xattrs. The > security xattrs check is at least namespace aware so you only need > CAP_SYS_ADMIN in the namespace that mounted the fs but that doesn't > help us much. > > We ended up working around it by prefixing "user.virtiofs." to the > xattr name[2], which has its own problems but there was pretty much no > chance that we would be able to give the fs device CAP_SYS_ADMIN in > the init namespace. Chirantan, So you ended up renaming all "trusted", "security" and "system" xattrs? Only "user" xattrs are complete passthrough? IOW, security.selinux will be renamed to user.virtiofs.security.selinux on host? Thanks Vivek
On Sat, Jun 20, 2020 at 4:15 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Fri, Jun 19, 2020 at 01:46:20PM +0900, Chirantan Ekbote wrote: > > On Fri, Jun 19, 2020 at 4:27 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote: > > > > * Vivek Goyal (vgoyal@redhat.com) wrote: > > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > > > > > > whitelisted set of capabilities that we require. This improves security in > > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further > > > > > > access to the system. > > > > > > > > > > Hi Stefan, > > > > > > > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs. > > > > > > > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain > > > > > need CAP_SYS_ADMIN. > > > > > > > > > Not just that but it needs CAP_SYS_ADMIN in the init namespace[1]. We > > have the same problem. Our virtiofs process has CAP_SYS_ADMIN in a > > user namespace and it cannot set any trusted or security xattrs. The > > security xattrs check is at least namespace aware so you only need > > CAP_SYS_ADMIN in the namespace that mounted the fs but that doesn't > > help us much. > > > > We ended up working around it by prefixing "user.virtiofs." to the > > xattr name[2], which has its own problems but there was pretty much no > > chance that we would be able to give the fs device CAP_SYS_ADMIN in > > the init namespace. > > Chirantan, > > So you ended up renaming all "trusted", "security" and "system" xattrs? > Only "user" xattrs are complete passthrough? > No, we only rename "security" xattrs (except for selinux). > > IOW, security.selinux will be renamed to user.virtiofs.security.selinux > on host? > We don't relabel security.selinux because it only requires CAP_FOWNER in the process's user namespace and it also does its own MAC-based checks. Also we have some tools that label files beforehand so it seemed easier to leave them unchanged. Chirantan
On Thu, Jun 25, 2020 at 12:19:39PM +0900, Chirantan Ekbote wrote: [..] > > Chirantan, > > > > So you ended up renaming all "trusted", "security" and "system" xattrs? > > Only "user" xattrs are complete passthrough? > > > > No, we only rename "security" xattrs (except for selinux). > > > > > IOW, security.selinux will be renamed to user.virtiofs.security.selinux > > on host? > > > > We don't relabel security.selinux because it only requires CAP_FOWNER > in the process's user namespace and it also does its own MAC-based > checks. Also we have some tools that label files beforehand so it > seemed easier to leave them unchanged. If we rename selinux xattr also, then we can support selinux both in guest and host and they both can have their own independent policies. Otherwise we either have to disable selinux on host (if we want to support it in guest) or somehow guest and how policies will have to know about each other and be able to work together (which will be hard for a generic use case). Vivek
On Thu, Jun 25, 2020 at 9:55 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Thu, Jun 25, 2020 at 12:19:39PM +0900, Chirantan Ekbote wrote: > [..] > > > Chirantan, > > > > > > So you ended up renaming all "trusted", "security" and "system" xattrs? > > > Only "user" xattrs are complete passthrough? > > > > > > > No, we only rename "security" xattrs (except for selinux). > > > > > > > > IOW, security.selinux will be renamed to user.virtiofs.security.selinux > > > on host? > > > > > > > We don't relabel security.selinux because it only requires CAP_FOWNER > > in the process's user namespace and it also does its own MAC-based > > checks. Also we have some tools that label files beforehand so it > > seemed easier to leave them unchanged. > > If we rename selinux xattr also, then we can support selinux both in > guest and host and they both can have their own independent policies. > This works as long as you don't need to support setfscreatecon, which exists to guarantee that an inode is atomically created with the correct selinux context. It's kind of possible for files with O_TMPFILE + fsetxattr + linkat but there is no equivalent for directories. You're basically forced to create the directory and then set the xattr and while it's possible to prevent other threads in the server from seeing the unfinished directories with a rwlock or similar there is no protection against power loss. If the machine loses power after the directory is created but before the selinux xattr is set then that directory will have the wrong selinux context and the guest would need to run restorecon at boot to assign the correct label. > Otherwise we either have to disable selinux on host (if we want to > support it in guest) or somehow guest and how policies will have > to know about each other and be able to work together (which will > be hard for a generic use case). Yes, I agree this is hard to do for a generic case but unfortunately the more I understand how selinux works the less I feel that it works well with a passthrough style file system. As you said it either needs to be turned off on the host or the host and guest need to work together. Chirantan
On Mon, Jul 13, 2020 at 05:54:19PM +0900, Chirantan Ekbote wrote: > On Thu, Jun 25, 2020 at 9:55 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Thu, Jun 25, 2020 at 12:19:39PM +0900, Chirantan Ekbote wrote: > > [..] > > > > Chirantan, > > > > > > > > So you ended up renaming all "trusted", "security" and "system" xattrs? > > > > Only "user" xattrs are complete passthrough? > > > > > > > > > > No, we only rename "security" xattrs (except for selinux). > > > > > > > > > > > IOW, security.selinux will be renamed to user.virtiofs.security.selinux > > > > on host? > > > > > > > > > > We don't relabel security.selinux because it only requires CAP_FOWNER > > > in the process's user namespace and it also does its own MAC-based > > > checks. Also we have some tools that label files beforehand so it > > > seemed easier to leave them unchanged. > > > > If we rename selinux xattr also, then we can support selinux both in > > guest and host and they both can have their own independent policies. > > > > This works as long as you don't need to support setfscreatecon, which > exists to guarantee that an inode is atomically created with the > correct selinux context. It's kind of possible for files with > O_TMPFILE + fsetxattr + linkat but there is no equivalent for > directories. You're basically forced to create the directory and then > set the xattr and while it's possible to prevent other threads in the > server from seeing the unfinished directories with a rwlock or similar > there is no protection against power loss. If the machine loses power > after the directory is created but before the selinux xattr is set > then that directory will have the wrong selinux context and the guest > would need to run restorecon at boot to assign the correct label. Overlayfs has this notion of work directory where they create directory (and file if O_TMPFILE is not supported), and then rename it to correct place. I am wondering if we could do something similar. If server is given a work directory where. A. Create new directory B. set selinux lables C. rename directory That way if machine crashes after step A or step B, that directory never becomes visible to guest and no relabeling is required. Thanks Vivek
* Vivek Goyal (vgoyal@redhat.com) wrote: > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote: > > * Vivek Goyal (vgoyal@redhat.com) wrote: > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > > > > whitelisted set of capabilities that we require. This improves security in > > > > case virtiofsd is compromised by making it hard for an attacker to gain further > > > > access to the system. > > > > > > Hi Stefan, > > > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs. > > > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain > > > need CAP_SYS_ADMIN. > > > > > > man xattr says. > > > > > > Trusted extended attributes > > > Trusted extended attributes are visible and accessible only to pro‐ > > > cesses that have the CAP_SYS_ADMIN capability. Attributes in this > > > class are used to implement mechanisms in user space (i.e., outside the > > > kernel) which keep information in extended attributes to which ordinary > > > processes should not have access. > > > > > > There is a chance that overlay moves away from trusted xattr in future. > > > But for now we need to make it work. This is an important use case for > > > kata docker in docker build. > > > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon > > > with this capaibility. > > > > I'll admit I don't like the idea of giving it cap_sys_admin. > > Can you explain: > > a) What overlayfs uses trusted for? > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it. Tell me more about this metadata. Taking a juicy looking one, what does OVL_XATTR_REDIRECT do? Or what happens if I was to write random numbers into OVL_XATTR_NLINK? > > b) If something nasty was to write junk into the trusted attributes, > > what would happen? > > This directory is owned by guest. So it should be able to write > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right? Well, we shouldn't be able to break/crash/escape into the host; how much does overlayfs validate trusted.* it uses? > > c) I see overlayfs has a fallback check if xattr isn't supported at > > all - what is the consequence? > > It falls back to I think read only mode. It looks like the fallback is more subtle to me: /* * Check if upper/work fs supports trusted.overlay.* xattr */ err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0); if (err) { ofs->noxattr = true; ofs->config.index = false; ofs->config.metacopy = false; pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n"); but I don't know what index and metacopy are. > For a moment forget about overlayfs. Say a user process in guest with > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a > passthrough filesystem, so it should go through. But currently it > wont. As long as any effects of what it writes are contained to the area of the filesystem exposed to the guest, yes - however it worries me what the consequences of broken trusted metadata is. If it's delicate enough that it's guarded by CAP_SYS_ADMIN someone must have worried about it. Dave > Thanks > Vivek -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote: > * Vivek Goyal (vgoyal@redhat.com) wrote: > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote: > > > * Vivek Goyal (vgoyal@redhat.com) wrote: > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > > > > > whitelisted set of capabilities that we require. This improves security in > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further > > > > > access to the system. > > > > > > > > Hi Stefan, > > > > > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs. > > > > > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain > > > > need CAP_SYS_ADMIN. > > > > > > > > man xattr says. > > > > > > > > Trusted extended attributes > > > > Trusted extended attributes are visible and accessible only to pro‐ > > > > cesses that have the CAP_SYS_ADMIN capability. Attributes in this > > > > class are used to implement mechanisms in user space (i.e., outside the > > > > kernel) which keep information in extended attributes to which ordinary > > > > processes should not have access. > > > > > > > > There is a chance that overlay moves away from trusted xattr in future. > > > > But for now we need to make it work. This is an important use case for > > > > kata docker in docker build. > > > > > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and > > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon > > > > with this capaibility. > > > > > > I'll admit I don't like the idea of giving it cap_sys_admin. > > > Can you explain: > > > a) What overlayfs uses trusted for? > > > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it. > > Tell me more about this metadata. > Taking a juicy looking one, what does OVL_XATTR_REDIRECT do? > Or what happens if I was to write random numbers into OVL_XATTR_NLINK? > > > > b) If something nasty was to write junk into the trusted attributes, > > > what would happen? > > > > This directory is owned by guest. So it should be able to write > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right? > > Well, we shouldn't be able to break/crash/escape into the host; how > much does overlayfs validate trusted.* it uses? > > > > c) I see overlayfs has a fallback check if xattr isn't supported at > > > all - what is the consequence? > > > > It falls back to I think read only mode. > > It looks like the fallback is more subtle to me: > /* > * Check if upper/work fs supports trusted.overlay.* xattr > */ > err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0); > if (err) { > ofs->noxattr = true; > ofs->config.index = false; > ofs->config.metacopy = false; > pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n"); > > but I don't know what index and metacopy are. > > > For a moment forget about overlayfs. Say a user process in guest with > > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a > > passthrough filesystem, so it should go through. But currently it > > wont. > > As long as any effects of what it writes are contained to the area of > the filesystem exposed to the guest, yes - however it worries me what > the consequences of broken trusted metadata is. If it's delicate enough > that it's guarded by CAP_SYS_ADMIN someone must have worried about it. The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful mechanism for applications to control access. The host kernel doesn' tuse this namespace itself. Linux has four namespaces for xattrs: - user - for userspace apps. accessible based on read/write permissions - trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only - system - for kernel only. used by ACLs - security - for kernel only. used by SELinux The use case for "trusted" xattrs is thus where a privileged management application or service wants to store metadata against the file, but also needs to grant an unprivileged process access to write to this file while not allowing that unprivileged process the ability to change the metadata. This is mentioned in the man page: [man xattr(7)] Trusted extended attributes Trusted extended attributes are visible and accessible only to pro‐ cesses that have the CAP_SYS_ADMIN capability. Attributes in this class are used to implement mechanisms in user space (i.e., outside the kernel) which keep information in extended attributes to which ordinary processes should not have access. User extended attributes User extended attributes may be assigned to files and directories for storing arbitrary additional information such as the mime type, character set or encoding of a file. The access permissions for user attributes are defined by the file permission bits: read per‐ mission is required to retrieve the attribute value, and writer per‐ mission is required to change it. [/man] Libvirtd uses the "trusted." xattr namespace to record information against disk images for QEMU, because we need to grant QEMU access to read/write the disk iamges, but don't want QEMU to be able to alter our xattrs. It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap. It really ought to have had its own dedicated capability :-( Such is life with anything that uses CAP_SYS_ADMIN... With this in mind we really should have both trusted. & user. xattrs allowed to the guest by default. Conversely, we'll need to block usage of the security. and system. namespaces. Unfortuntely this implies allowing CAP_SYS_ADMIN by default. We could have a "--trusted-xattrs=on|off" flag to allow people to run in a more locked down state if they knowingly want to block trusted xattrs. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Fri, Jun 19, 2020 at 12:39:14PM +0100, Daniel P. Berrangé wrote: [..] > The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful > mechanism for applications to control access. The host kernel doesn' > tuse this namespace itself. Linux has four namespaces for xattrs: > > - user - for userspace apps. accessible based on read/write permissions > - trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only > - system - for kernel only. used by ACLs > - security - for kernel only. used by SELinux > > The use case for "trusted" xattrs is thus where a privileged management > application or service wants to store metadata against the file, but > also needs to grant an unprivileged process access to write to this file > while not allowing that unprivileged process the ability to change the > metadata. This is mentioned in the man page: > > [man xattr(7)] > Trusted extended attributes > Trusted extended attributes are visible and accessible only to pro‐ > cesses that have the CAP_SYS_ADMIN capability. Attributes in this > class are used to implement mechanisms in user space (i.e., outside > the kernel) which keep information in extended attributes to which > ordinary processes should not have access. > > User extended attributes > User extended attributes may be assigned to files and directories > for storing arbitrary additional information such as the mime type, > character set or encoding of a file. The access permissions for > user attributes are defined by the file permission bits: read per‐ > mission is required to retrieve the attribute value, and writer per‐ > mission is required to change it. > [/man] > > Libvirtd uses the "trusted." xattr namespace to record information against > disk images for QEMU, because we need to grant QEMU access to read/write > the disk iamges, but don't want QEMU to be able to alter our xattrs. > > It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap. > It really ought to have had its own dedicated capability :-( Such is > life with anything that uses CAP_SYS_ADMIN... > > With this in mind we really should have both trusted. & user. xattrs > allowed to the guest by default. > > Conversely, we'll need to block usage of the security. and system. > namespaces. I am wondering can we block usage of "system" and "security"? What about guest setting acls over virtiofs files. These will have to go through and that means we need to allow system xattrs. Similarly setting file capabilities inside should trigger setxattr(security.capability) and that means we need to allow security xattr as well. Thanks Vivek
* Vivek Goyal (vgoyal@redhat.com) wrote: > On Fri, Jun 19, 2020 at 12:39:14PM +0100, Daniel P. Berrangé wrote: > [..] > > The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful > > mechanism for applications to control access. The host kernel doesn' > > tuse this namespace itself. Linux has four namespaces for xattrs: > > > > - user - for userspace apps. accessible based on read/write permissions > > - trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only > > - system - for kernel only. used by ACLs > > - security - for kernel only. used by SELinux > > > > The use case for "trusted" xattrs is thus where a privileged management > > application or service wants to store metadata against the file, but > > also needs to grant an unprivileged process access to write to this file > > while not allowing that unprivileged process the ability to change the > > metadata. This is mentioned in the man page: > > > > [man xattr(7)] > > Trusted extended attributes > > Trusted extended attributes are visible and accessible only to pro‐ > > cesses that have the CAP_SYS_ADMIN capability. Attributes in this > > class are used to implement mechanisms in user space (i.e., outside > > the kernel) which keep information in extended attributes to which > > ordinary processes should not have access. > > > > User extended attributes > > User extended attributes may be assigned to files and directories > > for storing arbitrary additional information such as the mime type, > > character set or encoding of a file. The access permissions for > > user attributes are defined by the file permission bits: read per‐ > > mission is required to retrieve the attribute value, and writer per‐ > > mission is required to change it. > > [/man] > > > > Libvirtd uses the "trusted." xattr namespace to record information against > > disk images for QEMU, because we need to grant QEMU access to read/write > > the disk iamges, but don't want QEMU to be able to alter our xattrs. > > > > It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap. > > It really ought to have had its own dedicated capability :-( Such is > > life with anything that uses CAP_SYS_ADMIN... > > > > With this in mind we really should have both trusted. & user. xattrs > > allowed to the guest by default. > > > > Conversely, we'll need to block usage of the security. and system. > > namespaces. > > I am wondering can we block usage of "system" and "security"? What > about guest setting acls over virtiofs files. These will have to > go through and that means we need to allow system xattrs. > > Similarly setting file capabilities inside should trigger > setxattr(security.capability) and that means we need to allow security > xattr as well. Yep, we see that when people install Fedora packages, when rpm unpackgs /usr/bin/newgidmap which has: $ getfattr -d '--match=.*' /usr/bin/newgidmap getfattr: Removing leading '/' from absolute path names # file: usr/bin/newgidmap security.capability=0sAQAAAkAAAAAAAAAAAAAAAAAAAAA= security.selinux="system_u:object_r:bin_t:s0" Dave > Thanks > Vivek -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
* Daniel P. Berrangé (berrange@redhat.com) wrote: > On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote: > > * Vivek Goyal (vgoyal@redhat.com) wrote: > > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote: > > > > * Vivek Goyal (vgoyal@redhat.com) wrote: > > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > > > > > > whitelisted set of capabilities that we require. This improves security in > > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further > > > > > > access to the system. > > > > > > > > > > Hi Stefan, > > > > > > > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs. > > > > > > > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain > > > > > need CAP_SYS_ADMIN. > > > > > > > > > > man xattr says. > > > > > > > > > > Trusted extended attributes > > > > > Trusted extended attributes are visible and accessible only to pro‐ > > > > > cesses that have the CAP_SYS_ADMIN capability. Attributes in this > > > > > class are used to implement mechanisms in user space (i.e., outside the > > > > > kernel) which keep information in extended attributes to which ordinary > > > > > processes should not have access. > > > > > > > > > > There is a chance that overlay moves away from trusted xattr in future. > > > > > But for now we need to make it work. This is an important use case for > > > > > kata docker in docker build. > > > > > > > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and > > > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon > > > > > with this capaibility. > > > > > > > > I'll admit I don't like the idea of giving it cap_sys_admin. > > > > Can you explain: > > > > a) What overlayfs uses trusted for? > > > > > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it. > > > > Tell me more about this metadata. > > Taking a juicy looking one, what does OVL_XATTR_REDIRECT do? > > Or what happens if I was to write random numbers into OVL_XATTR_NLINK? > > > > > > b) If something nasty was to write junk into the trusted attributes, > > > > what would happen? > > > > > > This directory is owned by guest. So it should be able to write > > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right? > > > > Well, we shouldn't be able to break/crash/escape into the host; how > > much does overlayfs validate trusted.* it uses? > > > > > > c) I see overlayfs has a fallback check if xattr isn't supported at > > > > all - what is the consequence? > > > > > > It falls back to I think read only mode. > > > > It looks like the fallback is more subtle to me: > > /* > > * Check if upper/work fs supports trusted.overlay.* xattr > > */ > > err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0); > > if (err) { > > ofs->noxattr = true; > > ofs->config.index = false; > > ofs->config.metacopy = false; > > pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n"); > > > > but I don't know what index and metacopy are. > > > > > For a moment forget about overlayfs. Say a user process in guest with > > > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a > > > passthrough filesystem, so it should go through. But currently it > > > wont. > > > > As long as any effects of what it writes are contained to the area of > > the filesystem exposed to the guest, yes - however it worries me what > > the consequences of broken trusted metadata is. If it's delicate enough > > that it's guarded by CAP_SYS_ADMIN someone must have worried about it. > > The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful > mechanism for applications to control access. The host kernel doesn' > tuse this namespace itself. Linux has four namespaces for xattrs: > > - user - for userspace apps. accessible based on read/write permissions > - trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only > - system - for kernel only. used by ACLs > - security - for kernel only. used by SELinux > > The use case for "trusted" xattrs is thus where a privileged management > application or service wants to store metadata against the file, but > also needs to grant an unprivileged process access to write to this file > while not allowing that unprivileged process the ability to change the > metadata. This is mentioned in the man page: > > [man xattr(7)] > Trusted extended attributes > Trusted extended attributes are visible and accessible only to pro‐ > cesses that have the CAP_SYS_ADMIN capability. Attributes in this > class are used to implement mechanisms in user space (i.e., outside > the kernel) which keep information in extended attributes to which > ordinary processes should not have access. overlayfs inside the kernel is using trusted though which is the case Vivek ran into. > User extended attributes > User extended attributes may be assigned to files and directories > for storing arbitrary additional information such as the mime type, > character set or encoding of a file. The access permissions for > user attributes are defined by the file permission bits: read per‐ > mission is required to retrieve the attribute value, and writer per‐ > mission is required to change it. > [/man] > > Libvirtd uses the "trusted." xattr namespace to record information against > disk images for QEMU, because we need to grant QEMU access to read/write > the disk iamges, but don't want QEMU to be able to alter our xattrs. > > It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap. > It really ought to have had its own dedicated capability :-( Such is > life with anything that uses CAP_SYS_ADMIN... > > With this in mind we really should have both trusted. & user. xattrs > allowed to the guest by default. > > Conversely, we'll need to block usage of the security. and system. > namespaces. > > Unfortuntely this implies allowing CAP_SYS_ADMIN by default. We could > have a "--trusted-xattrs=on|off" flag to allow people to run in a more > locked down state if they knowingly want to block trusted xattrs. Does the trick described by Chirantan work for you, where the clients 'trusted' view is different from the host? Dave > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Fri, Jun 19, 2020 at 12:49:57PM +0100, Dr. David Alan Gilbert wrote: > * Daniel P. Berrangé (berrange@redhat.com) wrote: > > On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote: > > > * Vivek Goyal (vgoyal@redhat.com) wrote: > > > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote: > > > > > * Vivek Goyal (vgoyal@redhat.com) wrote: > > > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > > > > > > > whitelisted set of capabilities that we require. This improves security in > > > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further > > > > > > > access to the system. > > > > > > > > > > > > Hi Stefan, > > > > > > > > > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs. > > > > > > > > > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain > > > > > > need CAP_SYS_ADMIN. > > > > > > > > > > > > man xattr says. > > > > > > > > > > > > Trusted extended attributes > > > > > > Trusted extended attributes are visible and accessible only to pro‐ > > > > > > cesses that have the CAP_SYS_ADMIN capability. Attributes in this > > > > > > class are used to implement mechanisms in user space (i.e., outside the > > > > > > kernel) which keep information in extended attributes to which ordinary > > > > > > processes should not have access. > > > > > > > > > > > > There is a chance that overlay moves away from trusted xattr in future. > > > > > > But for now we need to make it work. This is an important use case for > > > > > > kata docker in docker build. > > > > > > > > > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and > > > > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon > > > > > > with this capaibility. > > > > > > > > > > I'll admit I don't like the idea of giving it cap_sys_admin. > > > > > Can you explain: > > > > > a) What overlayfs uses trusted for? > > > > > > > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it. > > > > > > Tell me more about this metadata. > > > Taking a juicy looking one, what does OVL_XATTR_REDIRECT do? > > > Or what happens if I was to write random numbers into OVL_XATTR_NLINK? > > > > > > > > b) If something nasty was to write junk into the trusted attributes, > > > > > what would happen? > > > > > > > > This directory is owned by guest. So it should be able to write > > > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right? > > > > > > Well, we shouldn't be able to break/crash/escape into the host; how > > > much does overlayfs validate trusted.* it uses? > > > > > > > > c) I see overlayfs has a fallback check if xattr isn't supported at > > > > > all - what is the consequence? > > > > > > > > It falls back to I think read only mode. > > > > > > It looks like the fallback is more subtle to me: > > > /* > > > * Check if upper/work fs supports trusted.overlay.* xattr > > > */ > > > err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0); > > > if (err) { > > > ofs->noxattr = true; > > > ofs->config.index = false; > > > ofs->config.metacopy = false; > > > pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n"); > > > > > > but I don't know what index and metacopy are. > > > > > > > For a moment forget about overlayfs. Say a user process in guest with > > > > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a > > > > passthrough filesystem, so it should go through. But currently it > > > > wont. > > > > > > As long as any effects of what it writes are contained to the area of > > > the filesystem exposed to the guest, yes - however it worries me what > > > the consequences of broken trusted metadata is. If it's delicate enough > > > that it's guarded by CAP_SYS_ADMIN someone must have worried about it. > > > > The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful > > mechanism for applications to control access. The host kernel doesn' > > tuse this namespace itself. Linux has four namespaces for xattrs: > > > > - user - for userspace apps. accessible based on read/write permissions > > - trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only > > - system - for kernel only. used by ACLs > > - security - for kernel only. used by SELinux > > > > The use case for "trusted" xattrs is thus where a privileged management > > application or service wants to store metadata against the file, but > > also needs to grant an unprivileged process access to write to this file > > while not allowing that unprivileged process the ability to change the > > metadata. This is mentioned in the man page: > > > > [man xattr(7)] > > Trusted extended attributes > > Trusted extended attributes are visible and accessible only to pro‐ > > cesses that have the CAP_SYS_ADMIN capability. Attributes in this > > class are used to implement mechanisms in user space (i.e., outside > > the kernel) which keep information in extended attributes to which > > ordinary processes should not have access. > > overlayfs inside the kernel is using trusted though which is the case > Vivek ran into. IIUC, this is need on the lower-layer FS. If you're running overlayfs on the host, then virtiofsd is exposing the upper layer FS, so guest use of "trusted." on the upper layer won't clash with overlayfs in the host. If you're running overlayfs in the guest, then virtiofs is the lower layer so needs to support the "trusted." on the FS its exposing. Again this should not clash with anything on the host. > > User extended attributes > > User extended attributes may be assigned to files and directories > > for storing arbitrary additional information such as the mime type, > > character set or encoding of a file. The access permissions for > > user attributes are defined by the file permission bits: read per‐ > > mission is required to retrieve the attribute value, and writer per‐ > > mission is required to change it. > > [/man] > > > > Libvirtd uses the "trusted." xattr namespace to record information against > > disk images for QEMU, because we need to grant QEMU access to read/write > > the disk iamges, but don't want QEMU to be able to alter our xattrs. > > > > It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap. > > It really ought to have had its own dedicated capability :-( Such is > > life with anything that uses CAP_SYS_ADMIN... > > > > With this in mind we really should have both trusted. & user. xattrs > > allowed to the guest by default. > > > > Conversely, we'll need to block usage of the security. and system. > > namespaces. > > > > Unfortuntely this implies allowing CAP_SYS_ADMIN by default. We could > > have a "--trusted-xattrs=on|off" flag to allow people to run in a more > > locked down state if they knowingly want to block trusted xattrs. > > Does the trick described by Chirantan work for you, where the clients > 'trusted' view is different from the host? That works as long as you don't require any interoperability with processes that have setup the exposed file tree on the host, or might need to concurrently access it. This probably suggests a need various modes of exposing xattrs - "passthrough" - run CAP_SYS_ADMIN and pass directly - "mapped" - no capabilities, rewrite everything into user. namespace - "disable" - don't allow xattr As mentioned in the other thread though, CAP_SYS_ADMIN needs to be present in the initial namespace, so to support "passthrough" we'd have to NOT use user namespace too. Something to be aware of when we accept the user namespace patches. Such "mapping" of attributes could also be used if virtiofs was running as an unprivileged user account, to expose a full range of file owner, group and permissions to the guest, while having everything be owned by the normal user on the host. QEMU's current 9p filesystem has this kind of mapping support for owner/permissions, so guest view can be distinct from the host view. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Fri, Jun 19, 2020 at 01:05:47PM +0100, Daniel P. Berrangé wrote: > On Fri, Jun 19, 2020 at 12:49:57PM +0100, Dr. David Alan Gilbert wrote: > > * Daniel P. Berrangé (berrange@redhat.com) wrote: > > > On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote: > > > > * Vivek Goyal (vgoyal@redhat.com) wrote: > > > > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote: > > > > > > * Vivek Goyal (vgoyal@redhat.com) wrote: > > > > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > > > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > > > > > > > > whitelisted set of capabilities that we require. This improves security in > > > > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further > > > > > > > > access to the system. > > > > > > > > > > > > > > Hi Stefan, > > > > > > > > > > > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs. > > > > > > > > > > > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain > > > > > > > need CAP_SYS_ADMIN. > > > > > > > > > > > > > > man xattr says. > > > > > > > > > > > > > > Trusted extended attributes > > > > > > > Trusted extended attributes are visible and accessible only to pro‐ > > > > > > > cesses that have the CAP_SYS_ADMIN capability. Attributes in this > > > > > > > class are used to implement mechanisms in user space (i.e., outside the > > > > > > > kernel) which keep information in extended attributes to which ordinary > > > > > > > processes should not have access. > > > > > > > > > > > > > > There is a chance that overlay moves away from trusted xattr in future. > > > > > > > But for now we need to make it work. This is an important use case for > > > > > > > kata docker in docker build. > > > > > > > > > > > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and > > > > > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon > > > > > > > with this capaibility. > > > > > > > > > > > > I'll admit I don't like the idea of giving it cap_sys_admin. > > > > > > Can you explain: > > > > > > a) What overlayfs uses trusted for? > > > > > > > > > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it. > > > > > > > > Tell me more about this metadata. > > > > Taking a juicy looking one, what does OVL_XATTR_REDIRECT do? > > > > Or what happens if I was to write random numbers into OVL_XATTR_NLINK? > > > > > > > > > > b) If something nasty was to write junk into the trusted attributes, > > > > > > what would happen? > > > > > > > > > > This directory is owned by guest. So it should be able to write > > > > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right? > > > > > > > > Well, we shouldn't be able to break/crash/escape into the host; how > > > > much does overlayfs validate trusted.* it uses? > > > > > > > > > > c) I see overlayfs has a fallback check if xattr isn't supported at > > > > > > all - what is the consequence? > > > > > > > > > > It falls back to I think read only mode. > > > > > > > > It looks like the fallback is more subtle to me: > > > > /* > > > > * Check if upper/work fs supports trusted.overlay.* xattr > > > > */ > > > > err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0); > > > > if (err) { > > > > ofs->noxattr = true; > > > > ofs->config.index = false; > > > > ofs->config.metacopy = false; > > > > pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n"); > > > > > > > > but I don't know what index and metacopy are. > > > > > > > > > For a moment forget about overlayfs. Say a user process in guest with > > > > > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a > > > > > passthrough filesystem, so it should go through. But currently it > > > > > wont. > > > > > > > > As long as any effects of what it writes are contained to the area of > > > > the filesystem exposed to the guest, yes - however it worries me what > > > > the consequences of broken trusted metadata is. If it's delicate enough > > > > that it's guarded by CAP_SYS_ADMIN someone must have worried about it. > > > > > > The CAP_SYS_ADMIN requirement for 'trusted.' xattrs is simply a useful > > > mechanism for applications to control access. The host kernel doesn' > > > tuse this namespace itself. Linux has four namespaces for xattrs: > > > > > > - user - for userspace apps. accessible based on read/write permissions > > > - trusted - for userspace apps. accessible by CAP_SYS_ADMIN processes only > > > - system - for kernel only. used by ACLs > > > - security - for kernel only. used by SELinux > > > > > > The use case for "trusted" xattrs is thus where a privileged management > > > application or service wants to store metadata against the file, but > > > also needs to grant an unprivileged process access to write to this file > > > while not allowing that unprivileged process the ability to change the > > > metadata. This is mentioned in the man page: > > > > > > [man xattr(7)] > > > Trusted extended attributes > > > Trusted extended attributes are visible and accessible only to pro‐ > > > cesses that have the CAP_SYS_ADMIN capability. Attributes in this > > > class are used to implement mechanisms in user space (i.e., outside > > > the kernel) which keep information in extended attributes to which > > > ordinary processes should not have access. > > > > overlayfs inside the kernel is using trusted though which is the case > > Vivek ran into. > > IIUC, this is need on the lower-layer FS. If you're running overlayfs on the > host, then virtiofsd is exposing the upper layer FS, so guest use of "trusted." > on the upper layer won't clash with overlayfs in the host. > > If you're running overlayfs in the guest, then virtiofs is the lower layer > so needs to support the "trusted." on the FS its exposing. Again this should > not clash with anything on the host. > > > > User extended attributes > > > User extended attributes may be assigned to files and directories > > > for storing arbitrary additional information such as the mime type, > > > character set or encoding of a file. The access permissions for > > > user attributes are defined by the file permission bits: read per‐ > > > mission is required to retrieve the attribute value, and writer per‐ > > > mission is required to change it. > > > [/man] > > > > > > Libvirtd uses the "trusted." xattr namespace to record information against > > > disk images for QEMU, because we need to grant QEMU access to read/write > > > the disk iamges, but don't want QEMU to be able to alter our xattrs. > > > > > > It is unfortunate that this namespace is tied to the CAP_SYS_ADMIN cap. > > > It really ought to have had its own dedicated capability :-( Such is > > > life with anything that uses CAP_SYS_ADMIN... > > > > > > With this in mind we really should have both trusted. & user. xattrs > > > allowed to the guest by default. > > > > > > Conversely, we'll need to block usage of the security. and system. > > > namespaces. > > > > > > Unfortuntely this implies allowing CAP_SYS_ADMIN by default. We could > > > have a "--trusted-xattrs=on|off" flag to allow people to run in a more > > > locked down state if they knowingly want to block trusted xattrs. > > > > Does the trick described by Chirantan work for you, where the clients > > 'trusted' view is different from the host? > > That works as long as you don't require any interoperability with > processes that have setup the exposed file tree on the host, or > might need to concurrently access it. > > This probably suggests a need various modes of exposing xattrs > > - "passthrough" - run CAP_SYS_ADMIN and pass directly > - "mapped" - no capabilities, rewrite everything into user. namespace > - "disable" - don't allow xattr > We already do "passthrough" (this is our default) and "disable" (-o no_xattr). We probably need to implement this "mapped" mode and that will allow us to run without CAP_SYS_ADMIN in init_user_ns. > As mentioned in the other thread though, CAP_SYS_ADMIN needs to be present > in the initial namespace, so to support "passthrough" we'd have to NOT > use user namespace too. Something to be aware of when we accept the > user namespace patches. > > Such "mapping" of attributes could also be used if virtiofs was running > as an unprivileged user account, to expose a full range of file owner, > group and permissions to the guest, while having everything be owned > by the normal user on the host. QEMU's current 9p filesystem has > this kind of mapping support for owner/permissions, so guest view can > be distinct from the host view. If we figure out a way to run in a user namespace, then we don't necessarily require this mapping functionality, right? User namespace will automatically provide that mapping. We can just run as root of that user namespace. When 9p wrote this remapping method, we probably did not have user namespace as popular as they are now. So we probably should implement remapping of trusted xattr and then run in a user namesapce. And complete passthrough mode will have to run in init user namespace and this will be least securre mode (from host security point of view). Thanks Vivek
On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote: > * Vivek Goyal (vgoyal@redhat.com) wrote: > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote: > > > * Vivek Goyal (vgoyal@redhat.com) wrote: > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > > > > > whitelisted set of capabilities that we require. This improves security in > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further > > > > > access to the system. > > > > > > > > Hi Stefan, > > > > > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs. > > > > > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain > > > > need CAP_SYS_ADMIN. > > > > > > > > man xattr says. > > > > > > > > Trusted extended attributes > > > > Trusted extended attributes are visible and accessible only to pro‐ > > > > cesses that have the CAP_SYS_ADMIN capability. Attributes in this > > > > class are used to implement mechanisms in user space (i.e., outside the > > > > kernel) which keep information in extended attributes to which ordinary > > > > processes should not have access. > > > > > > > > There is a chance that overlay moves away from trusted xattr in future. > > > > But for now we need to make it work. This is an important use case for > > > > kata docker in docker build. > > > > > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and > > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon > > > > with this capaibility. > > > > > > I'll admit I don't like the idea of giving it cap_sys_admin. > > > Can you explain: > > > a) What overlayfs uses trusted for? > > > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it. > > Tell me more about this metadata. > Taking a juicy looking one, what does OVL_XATTR_REDIRECT do? It contains path information which is used for lookup into lower layer. > Or what happens if I was to write random numbers into OVL_XATTR_NLINK? Overlay is storing its metadata in trusted.* xattrs. If a user modifies metadata, then various kind of bad things can happen. I think one can do some kind of checks on metadata to make sure it does not crash atleast. And that's true for any filesystem. Isn't. If user manages to modify metadata outside of filesystem, then lot of bad things can happen. I thought that's the reason that people are not comfortable with the idea of allowing mounting filesystem from inside user namespace because it makes it easy to mount a hand crafted filesystem. Anyway, I think overlayfs is just one use case of trusted xattr. Even if overlayfs moves away from trusted xattr, what about other users. We need to have a story around how will we support trusted xattrs safely. > > > > b) If something nasty was to write junk into the trusted attributes, > > > what would happen? > > > > This directory is owned by guest. So it should be able to write > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right? > > Well, we shouldn't be able to break/crash/escape into the host; how > much does overlayfs validate trusted.* it uses? I thought qemu and kvm are the one who should ensure we should not be able to break out of sandbox. Kernel implementation could be as buggy as it wanted to be. We are working with this security model that kernel is completely untrusted. > > > > c) I see overlayfs has a fallback check if xattr isn't supported at > > > all - what is the consequence? > > > > It falls back to I think read only mode. > > It looks like the fallback is more subtle to me: > /* > * Check if upper/work fs supports trusted.overlay.* xattr > */ > err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0); > if (err) { > ofs->noxattr = true; > ofs->config.index = false; > ofs->config.metacopy = false; > pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n"); > > but I don't know what index and metacopy are. They enable certain features in overlayfs. In fact, we fall back to lesser capability on if we are running on ext4/xfs. For virtiofs, we deny the mount completely. /* * We allowed sub-optimal upper fs configuration and don't want to break * users over kernel upgrade, but we never allowed remote upper fs, so * we can enforce strict requirements for remote upper fs. */ if (ovl_dentry_remote(ofs->workdir) && (!d_type || !rename_whiteout || ofs->noxattr)) { pr_err("upper fs missing required features.\n"); err = -EINVAL; goto out; } > > > For a moment forget about overlayfs. Say a user process in guest with > > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a > > passthrough filesystem, so it should go through. But currently it > > wont. > > As long as any effects of what it writes are contained to the area of > the filesystem exposed to the guest, yes - however it worries me what > the consequences of broken trusted metadata is. If it's delicate enough > that it's guarded by CAP_SYS_ADMIN someone must have worried about it. Agreed that we need to look into whether having CAP_SYS_ADMIN allow virtiofsd to break out of jail. May be we need to provide that remapping trusted xattr feature so that we don't have to have CAP_SYS_ADMIN in init_user_ns and can provide this emulation even when running in user namespace. Vivek
* Vivek Goyal (vgoyal@redhat.com) wrote: > On Fri, Jun 19, 2020 at 09:27:46AM +0100, Dr. David Alan Gilbert wrote: > > * Vivek Goyal (vgoyal@redhat.com) wrote: > > > On Thu, Jun 18, 2020 at 08:16:55PM +0100, Dr. David Alan Gilbert wrote: > > > > * Vivek Goyal (vgoyal@redhat.com) wrote: > > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > > > > > > whitelisted set of capabilities that we require. This improves security in > > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further > > > > > > access to the system. > > > > > > > > > > Hi Stefan, > > > > > > > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs. > > > > > > > > > > overlayfs sets "trusted.overlay.*" and xattrs in trusted domain > > > > > need CAP_SYS_ADMIN. > > > > > > > > > > man xattr says. > > > > > > > > > > Trusted extended attributes > > > > > Trusted extended attributes are visible and accessible only to pro‐ > > > > > cesses that have the CAP_SYS_ADMIN capability. Attributes in this > > > > > class are used to implement mechanisms in user space (i.e., outside the > > > > > kernel) which keep information in extended attributes to which ordinary > > > > > processes should not have access. > > > > > > > > > > There is a chance that overlay moves away from trusted xattr in future. > > > > > But for now we need to make it work. This is an important use case for > > > > > kata docker in docker build. > > > > > > > > > > May be we can add an option to virtiofsd say "--add-cap <capability>" and > > > > > ask user to pass in "--add-cap cap_sys_admin" if they need to run daemon > > > > > with this capaibility. > > > > > > > > I'll admit I don't like the idea of giving it cap_sys_admin. > > > > Can you explain: > > > > a) What overlayfs uses trusted for? > > > > > > overlayfs stores bunch of metadata and uses "trusted" xattrs for it. > > > > Tell me more about this metadata. > > Taking a juicy looking one, what does OVL_XATTR_REDIRECT do? > > It contains path information which is used for lookup into lower layer. > > > Or what happens if I was to write random numbers into OVL_XATTR_NLINK? > > Overlay is storing its metadata in trusted.* xattrs. If a user modifies > metadata, then various kind of bad things can happen. I think one can > do some kind of checks on metadata to make sure it does not crash > atleast. > > And that's true for any filesystem. Isn't. If user manages to modify > metadata outside of filesystem, then lot of bad things can happen. I > thought that's the reason that people are not comfortable with the > idea of allowing mounting filesystem from inside user namespace because > it makes it easy to mount a hand crafted filesystem. > > Anyway, I think overlayfs is just one use case of trusted xattr. Even > if overlayfs moves away from trusted xattr, what about other users. > We need to have a story around how will we support trusted xattrs > safely. > > > > > > > > b) If something nasty was to write junk into the trusted attributes, > > > > what would happen? > > > > > > This directory is owned by guest. So it should be able to write > > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right? > > > > Well, we shouldn't be able to break/crash/escape into the host; how > > much does overlayfs validate trusted.* it uses? > > I thought qemu and kvm are the one who should ensure we should not be > able to break out of sandbox. Kernel implementation could be as > buggy as it wanted to be. We are working with this security model > that kernel is completely untrusted. But with virtiofs we allow the guest to do a lot of filesystem operations on the host. It's the virtiofsd that has to ensure that these are safe and contained within the fs it's exposed; the qemu/kvm can't protect us from that. That's why we sandbox the virtiofsd like we do - if we allow a priviliged guest to perform calls to an unconstrained virtiofsd it would be able to escape. That's what I want to check. Dave > > > > > > c) I see overlayfs has a fallback check if xattr isn't supported at > > > > all - what is the consequence? > > > > > > It falls back to I think read only mode. > > > > It looks like the fallback is more subtle to me: > > /* > > * Check if upper/work fs supports trusted.overlay.* xattr > > */ > > err = ovl_do_setxattr(ofs->workdir, OVL_XATTR_OPAQUE, "0", 1, 0); > > if (err) { > > ofs->noxattr = true; > > ofs->config.index = false; > > ofs->config.metacopy = false; > > pr_warn("upper fs does not support xattr, falling back to index=off and metacopy=off.\n"); > > > > but I don't know what index and metacopy are. > > They enable certain features in overlayfs. In fact, we fall back to > lesser capability on if we are running on ext4/xfs. For virtiofs, > we deny the mount completely. > > /* > * We allowed sub-optimal upper fs configuration and don't want to break > * users over kernel upgrade, but we never allowed remote upper fs, so > * we can enforce strict requirements for remote upper fs. > */ > if (ovl_dentry_remote(ofs->workdir) && > (!d_type || !rename_whiteout || ofs->noxattr)) { > pr_err("upper fs missing required features.\n"); > err = -EINVAL; > goto out; > } > > > > > > For a moment forget about overlayfs. Say a user process in guest with > > > CAP_SYS_ADMIN is writing trusted.foo. Should that succeed? Is a > > > passthrough filesystem, so it should go through. But currently it > > > wont. > > > > As long as any effects of what it writes are contained to the area of > > the filesystem exposed to the guest, yes - however it worries me what > > the consequences of broken trusted metadata is. If it's delicate enough > > that it's guarded by CAP_SYS_ADMIN someone must have worried about it. > > Agreed that we need to look into whether having CAP_SYS_ADMIN allow > virtiofsd to break out of jail. > > May be we need to provide that remapping trusted xattr feature so > that we don't have to have CAP_SYS_ADMIN in init_user_ns and can > provide this emulation even when running in user namespace. > > Vivek -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Fri, Jun 19, 2020 at 05:16:48PM +0100, Dr. David Alan Gilbert wrote: [..] > > > > > b) If something nasty was to write junk into the trusted attributes, > > > > > what would happen? > > > > > > > > This directory is owned by guest. So it should be able to write > > > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right? > > > > > > Well, we shouldn't be able to break/crash/escape into the host; how > > > much does overlayfs validate trusted.* it uses? > > > > I thought qemu and kvm are the one who should ensure we should not be > > able to break out of sandbox. Kernel implementation could be as > > buggy as it wanted to be. We are working with this security model > > that kernel is completely untrusted. > > But with virtiofs we allow the guest to do a lot of filesystem > operations on the host. It's the virtiofsd that has to ensure that > these are safe and contained within the fs it's exposed; the qemu/kvm > can't protect us from that. Fair enough. I should have added virtiofsd to list. Its an attack vector and is of concern. > > That's why we sandbox the virtiofsd like we do - if we allow a > priviliged guest to perform calls to an unconstrained virtiofsd it would > be able to escape. That's what I want to check. Sure. So does giving CAP_SYS_ADMIN to virtiofsd allow daemon to escape the jail. If it does we need to implement what crossvm folks did, remapping of trusted xattr. That will also allow us to run inside user namespace and still be able to support trusted xattr emulation for guest. Vivek
* Vivek Goyal (vgoyal@redhat.com) wrote: > On Fri, Jun 19, 2020 at 05:16:48PM +0100, Dr. David Alan Gilbert wrote: > > [..] > > > > > > b) If something nasty was to write junk into the trusted attributes, > > > > > > what would happen? > > > > > > > > > > This directory is owned by guest. So it should be able to write > > > > > anything it wants, as long as process in guest has CAP_SYS_ADMIN, right? > > > > > > > > Well, we shouldn't be able to break/crash/escape into the host; how > > > > much does overlayfs validate trusted.* it uses? > > > > > > I thought qemu and kvm are the one who should ensure we should not be > > > able to break out of sandbox. Kernel implementation could be as > > > buggy as it wanted to be. We are working with this security model > > > that kernel is completely untrusted. > > > > But with virtiofs we allow the guest to do a lot of filesystem > > operations on the host. It's the virtiofsd that has to ensure that > > these are safe and contained within the fs it's exposed; the qemu/kvm > > can't protect us from that. > > Fair enough. I should have added virtiofsd to list. Its an attack > vector and is of concern. > > > > > That's why we sandbox the virtiofsd like we do - if we allow a > > priviliged guest to perform calls to an unconstrained virtiofsd it would > > be able to escape. That's what I want to check. > > Sure. So does giving CAP_SYS_ADMIN to virtiofsd allow daemon to escape > the jail. So that's *my* question - what bad things can someone do by setting attributes (trusted/system/security) - it's fine if they break they screwup the security inside the container, because they'd need to be CAP_SYS_ADMIN inside the container to do it - as long as they can't break the host. So what happens if someone starts doing bad things to trusted.* attributes on an overlayfs - no other fs uses them as far as I know. > If it does we need to implement what crossvm folks did, > remapping of trusted xattr. That will also allow us to run inside > user namespace and still be able to support trusted xattr emulation > for guest. I think we need to do that anyway, it's just going to take a while to figure out. Dave > > Vivek -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > > whitelisted set of capabilities that we require. This improves security in > > case virtiofsd is compromised by making it hard for an attacker to gain further > > access to the system. > > Hi Stefan, > > I just noticed that this patch set breaks overlayfs on top of virtiofs. How so? Virtiofs isn't mounting overlayfs, is it? Only the mounter requires CAP_SYS_ADMIN, not the accessor. Thanks, Miklos
On Fri, Jun 19, 2020 at 04:16:30PM +0200, Miklos Szeredi wrote: > On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > > > whitelisted set of capabilities that we require. This improves security in > > > case virtiofsd is compromised by making it hard for an attacker to gain further > > > access to the system. > > > > Hi Stefan, > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs. > > How so? Virtiofs isn't mounting overlayfs, is it? Only the mounter > requires CAP_SYS_ADMIN, not the accessor. virtiofsd needs CAP_SYS_ADMIN, otherwise fsetxattr(trusted.overlay.opaque) fails in lo_setxattr(). This is triggered when we mount overlayfs on top of virtiofs and overlayfs tries to set OVL_XATTR_OPAQUE on upper to check if trusted xattrs are supported or not. Thanks Vivek
On Fri, Jun 19, 2020 at 4:25 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Fri, Jun 19, 2020 at 04:16:30PM +0200, Miklos Szeredi wrote: > > On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > > > > whitelisted set of capabilities that we require. This improves security in > > > > case virtiofsd is compromised by making it hard for an attacker to gain further > > > > access to the system. > > > > > > Hi Stefan, > > > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs. > > > > How so? Virtiofs isn't mounting overlayfs, is it? Only the mounter > > requires CAP_SYS_ADMIN, not the accessor. > > virtiofsd needs CAP_SYS_ADMIN, otherwise fsetxattr(trusted.overlay.opaque) > fails in lo_setxattr(). > > This is triggered when we mount overlayfs on top of virtiofs and overlayfs > tries to set OVL_XATTR_OPAQUE on upper to check if trusted xattrs are > supported or not. Ah, right. Plan is to use "user.*" xattr for unprivileged overlay. This would be a good way to eliminate this attack surface in the overlay on virtiofs case as well. Other ways to minimize risk is to separate operations requiring CAP_SYS_ADMIN into a separate process, preferably a separate executable, that communicates with virtiofsd using a pipe and contains the minimum amount of code. Thanks, Miklos
On Fri, Jun 19, 2020 at 05:26:37PM +0200, Miklos Szeredi wrote: > On Fri, Jun 19, 2020 at 4:25 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Fri, Jun 19, 2020 at 04:16:30PM +0200, Miklos Szeredi wrote: > > > On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > > > > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > > > > > whitelisted set of capabilities that we require. This improves security in > > > > > case virtiofsd is compromised by making it hard for an attacker to gain further > > > > > access to the system. > > > > > > > > Hi Stefan, > > > > > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs. > > > > > > How so? Virtiofs isn't mounting overlayfs, is it? Only the mounter > > > requires CAP_SYS_ADMIN, not the accessor. > > > > virtiofsd needs CAP_SYS_ADMIN, otherwise fsetxattr(trusted.overlay.opaque) > > fails in lo_setxattr(). > > > > This is triggered when we mount overlayfs on top of virtiofs and overlayfs > > tries to set OVL_XATTR_OPAQUE on upper to check if trusted xattrs are > > supported or not. > > Ah, right. > > Plan is to use "user.*" xattr for unprivileged overlay. This would be > a good way to eliminate this attack surface in the overlay on virtiofs > case as well. So unpriviliged overlay is one which is mounted from inside a user namespace. But in this case we might be mounting it from init_user_ns of guest. So from overlayfs perspective this will still be treated as priviliged overlay instance and it will use trusted xattrs, IIUC? Thanks Vivek > > Other ways to minimize risk is to separate operations requiring > CAP_SYS_ADMIN into a separate process, preferably a separate > executable, that communicates with virtiofsd using a pipe and contains > the minimum amount of code. > > Thanks, > Miklos >
On Fri, Jun 19, 2020 at 04:16:30PM +0200, Miklos Szeredi wrote: > On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > > > whitelisted set of capabilities that we require. This improves security in > > > case virtiofsd is compromised by making it hard for an attacker to gain further > > > access to the system. > > > > Hi Stefan, > > > > I just noticed that this patch set breaks overlayfs on top of virtiofs. > > How so? Virtiofs isn't mounting overlayfs, is it? Only the mounter > requires CAP_SYS_ADMIN, not the accessor. Are you thinking of virtiofs on top of overlayfs? I was referring to mounting overlayfs on top of virtiofs inside VM. Thanks Vivek
On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > whitelisted set of capabilities that we require. This improves security in > case virtiofsd is compromised by making it hard for an attacker to gain further > access to the system. Hi Stefan, Good to see this patch. We needed to limit capabilities to reduce attack surface. What tests have you run to make sure this current set of whitelisted capabilities is good enough. Vivek > > Stefan Hajnoczi (2): > virtiofsd: only retain file system capabilities > virtiofsd: drop all capabilities in the wait parent process > > tools/virtiofsd/passthrough_ll.c | 51 ++++++++++++++++++++++++++++++++ > 1 file changed, 51 insertions(+) > > -- > 2.25.1 >
On Thu, Apr 16, 2020 at 04:10:22PM -0400, Vivek Goyal wrote: > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote: > > virtiofsd doesn't need of all Linux capabilities(7) available to root. Keep a > > whitelisted set of capabilities that we require. This improves security in > > case virtiofsd is compromised by making it hard for an attacker to gain further > > access to the system. > > Hi Stefan, > > Good to see this patch. We needed to limit capabilities to reduce attack > surface. > > What tests have you run to make sure this current set of whitelisted > capabilities is good enough. Booting and light usage of Fedora 29 and running blogbench. I would appreciate it if others could try it out with their tests/workloads. Stefan
© 2016 - 2025 Red Hat, Inc.