If a container manager restricts its unprivileged (user namespaced)
children by a device cgroup, it is not necessary to deny mknod
anymore. Thus, user space applications may map devices on different
locations in the file system by using mknod() inside the container.
A use case for this, we also use in GyroidOS, is to run virsh for
VMs inside an unprivileged container. virsh creates device nodes,
e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
in a non-initial userns, even if a cgroup device white list with the
corresponding major, minor of /dev/null exists. Thus, in this case
the usual bind mounts or pre populated device nodes under /dev are
not sufficient.
To circumvent this limitation, we allow mknod() in fs/namei.c if a
bpf cgroup device guard is enabeld for the current task using
devcgroup_task_is_guarded() and check CAP_MKNOD for the current user
namespace by ns_capable() instead of the global CAP_MKNOD.
To avoid unusable device nodes on file systems mounted in
non-initial user namespace, may_open_dev() ignores the SB_I_NODEV
for cgroup device guarded tasks.
Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
---
fs/namei.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index e56ff39a79bc..ef4f22b9575c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3221,6 +3221,9 @@ EXPORT_SYMBOL(vfs_mkobj);
bool may_open_dev(const struct path *path)
{
+ if (devcgroup_task_is_guarded(current))
+ return !(path->mnt->mnt_flags & MNT_NODEV);
+
return !(path->mnt->mnt_flags & MNT_NODEV) &&
!(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
}
@@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
if (error)
return error;
- if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
- !capable(CAP_MKNOD))
- return -EPERM;
+ /*
+ * In case of a device cgroup restirction allow mknod in user
+ * namespace. Otherwise just check global capability; thus,
+ * mknod is also disabled for user namespace other than the
+ * initial one.
+ */
+ if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) {
+ if (devcgroup_task_is_guarded(current)) {
+ if (!ns_capable(current_user_ns(), CAP_MKNOD))
+ return -EPERM;
+ } else if (!capable(CAP_MKNOD))
+ return -EPERM;
+ }
if (!dir->i_op->mknod)
return -EPERM;
--
2.30.2
Hello,
kernel test robot noticed "WARNING:suspicious_RCU_usage" on:
commit: bffc333633f1e681c01ada11bd695aa220518bd8 ("[PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard")
url: https://github.com/intel-lab-lkp/linux/commits/Michael-Wei/bpf-add-cgroup-device-guard-to-flag-a-cgroup-device-prog/20230814-224110
patch link: https://lore.kernel.org/all/20230814-devcg_guard-v1-4-654971ab88b1@aisec.fraunhofer.de/
patch subject: [PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard
in testcase: boot
compiler: gcc-12
test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 4G
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202308151506.6be3b169-oliver.sang@intel.com
[ 14.468719][ T139]
[ 14.468999][ T139] =============================
[ 14.469545][ T139] WARNING: suspicious RCU usage
[ 14.469968][ T139] 6.5.0-rc6-00004-gbffc333633f1 #1 Not tainted
[ 14.470520][ T139] -----------------------------
[ 14.470940][ T139] include/linux/cgroup.h:423 suspicious rcu_dereference_check() usage!
[ 14.471703][ T139]
[ 14.471703][ T139] other info that might help us debug this:
[ 14.471703][ T139]
[ 14.472692][ T139]
[ 14.472692][ T139] rcu_scheduler_active = 2, debug_locks = 1
[ 14.473469][ T139] no locks held by (journald)/139.
[ 14.473935][ T139]
[ 14.473935][ T139] stack backtrace:
[ 14.474454][ T139] CPU: 1 PID: 139 Comm: (journald) Not tainted 6.5.0-rc6-00004-gbffc333633f1 #1
[ 14.475296][ T139] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 14.476298][ T139] Call Trace:
[ 14.476608][ T139] dump_stack_lvl+0x78/0x8c
[ 14.477055][ T139] dump_stack+0x12/0x18
[ 14.477420][ T139] lockdep_rcu_suspicious+0x153/0x1a4
[ 14.477928][ T139] cgroup_bpf_device_guard_enabled+0x14f/0x168
[ 14.478476][ T139] devcgroup_task_is_guarded+0x10/0x20
[ 14.478973][ T139] may_open_dev+0x11/0x44
[ 14.479367][ T139] may_open+0x115/0x13c
[ 14.479727][ T139] do_open+0xa1/0x378
[ 14.480113][ T139] path_openat+0xdc/0x1bc
[ 14.480512][ T139] do_filp_open+0x91/0x124
[ 14.480911][ T139] ? lock_release+0x62/0x118
[ 14.481329][ T139] ? _raw_spin_unlock+0x18/0x34
[ 14.481797][ T139] ? alloc_fd+0x112/0x1c4
[ 14.482183][ T139] do_sys_openat2+0x7a/0xa0
[ 14.482592][ T139] __ia32_sys_openat+0x66/0x9c
[ 14.483065][ T139] do_int80_syscall_32+0x27/0x48
[ 14.483502][ T139] entry_INT80_32+0x10d/0x10d
[ 14.483962][ T139] EIP: 0xa7f39092
[ 14.484267][ T139] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 f8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4
26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
[ 14.485995][ T139] EAX: ffffffda EBX: ffffff9c ECX: 005df542 EDX: 00008100
[ 14.486622][ T139] ESI: 00000000 EDI: 00000000 EBP: affeb888 ESP: affeb6ec
[ 14.487225][ T139] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200246
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230815/202308151506.6be3b169-oliver.sang@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Tue, Aug 15, 2023 at 9:18 AM kernel test robot <oliver.sang@intel.com> wrote:
>
>
>
> Hello,
>
> kernel test robot noticed "WARNING:suspicious_RCU_usage" on:
>
> commit: bffc333633f1e681c01ada11bd695aa220518bd8 ("[PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard")
> url: https://github.com/intel-lab-lkp/linux/commits/Michael-Wei/bpf-add-cgroup-device-guard-to-flag-a-cgroup-device-prog/20230814-224110
> patch link: https://lore.kernel.org/all/20230814-devcg_guard-v1-4-654971ab88b1@aisec.fraunhofer.de/
> patch subject: [PATCH RFC 4/4] fs: allow mknod in non-initial userns using cgroup device guard
>
> in testcase: boot
>
> compiler: gcc-12
> test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 4G
>
> (please refer to attached dmesg/kmsg for entire log/backtrace)
>
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202308151506.6be3b169-oliver.sang@intel.com
>
>
>
> [ 14.468719][ T139]
> [ 14.468999][ T139] =============================
> [ 14.469545][ T139] WARNING: suspicious RCU usage
> [ 14.469968][ T139] 6.5.0-rc6-00004-gbffc333633f1 #1 Not tainted
> [ 14.470520][ T139] -----------------------------
> [ 14.470940][ T139] include/linux/cgroup.h:423 suspicious rcu_dereference_check() usage!
Most likely it's because in "cgroup_bpf_device_guard_enabled" function:
struct cgroup *cgrp = task_dfl_cgroup(task);
should be under rcu_read_lock (or cgroup_mutex). If we get rid of
cgroup_mutex and make cgroup_bpf_device_guard_enabled
function specific to "current" task we will solve this issue too.
> [ 14.471703][ T139]
> [ 14.471703][ T139] other info that might help us debug this:
> [ 14.471703][ T139]
> [ 14.472692][ T139]
> [ 14.472692][ T139] rcu_scheduler_active = 2, debug_locks = 1
> [ 14.473469][ T139] no locks held by (journald)/139.
> [ 14.473935][ T139]
> [ 14.473935][ T139] stack backtrace:
> [ 14.474454][ T139] CPU: 1 PID: 139 Comm: (journald) Not tainted 6.5.0-rc6-00004-gbffc333633f1 #1
> [ 14.475296][ T139] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [ 14.476298][ T139] Call Trace:
> [ 14.476608][ T139] dump_stack_lvl+0x78/0x8c
> [ 14.477055][ T139] dump_stack+0x12/0x18
> [ 14.477420][ T139] lockdep_rcu_suspicious+0x153/0x1a4
> [ 14.477928][ T139] cgroup_bpf_device_guard_enabled+0x14f/0x168
> [ 14.478476][ T139] devcgroup_task_is_guarded+0x10/0x20
> [ 14.478973][ T139] may_open_dev+0x11/0x44
> [ 14.479367][ T139] may_open+0x115/0x13c
> [ 14.479727][ T139] do_open+0xa1/0x378
> [ 14.480113][ T139] path_openat+0xdc/0x1bc
> [ 14.480512][ T139] do_filp_open+0x91/0x124
> [ 14.480911][ T139] ? lock_release+0x62/0x118
> [ 14.481329][ T139] ? _raw_spin_unlock+0x18/0x34
> [ 14.481797][ T139] ? alloc_fd+0x112/0x1c4
> [ 14.482183][ T139] do_sys_openat2+0x7a/0xa0
> [ 14.482592][ T139] __ia32_sys_openat+0x66/0x9c
> [ 14.483065][ T139] do_int80_syscall_32+0x27/0x48
> [ 14.483502][ T139] entry_INT80_32+0x10d/0x10d
> [ 14.483962][ T139] EIP: 0xa7f39092
> [ 14.484267][ T139] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 f8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4
> 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
> [ 14.485995][ T139] EAX: ffffffda EBX: ffffff9c ECX: 005df542 EDX: 00008100
> [ 14.486622][ T139] ESI: 00000000 EDI: 00000000 EBP: affeb888 ESP: affeb6ec
> [ 14.487225][ T139] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200246
>
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20230815/202308151506.6be3b169-oliver.sang@intel.com
>
>
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
>
+CC Stéphane Graber <stgraber@ubuntu.com>
On Mon, Aug 14, 2023 at 4:26 PM Michael Weiß
<michael.weiss@aisec.fraunhofer.de> wrote:
>
> If a container manager restricts its unprivileged (user namespaced)
> children by a device cgroup, it is not necessary to deny mknod
> anymore. Thus, user space applications may map devices on different
> locations in the file system by using mknod() inside the container.
>
> A use case for this, we also use in GyroidOS, is to run virsh for
> VMs inside an unprivileged container. virsh creates device nodes,
> e.g., "/var/run/libvirt/qemu/11-fgfg.dev/null" which currently fails
> in a non-initial userns, even if a cgroup device white list with the
> corresponding major, minor of /dev/null exists. Thus, in this case
> the usual bind mounts or pre populated device nodes under /dev are
> not sufficient.
>
> To circumvent this limitation, we allow mknod() in fs/namei.c if a
> bpf cgroup device guard is enabeld for the current task using
> devcgroup_task_is_guarded() and check CAP_MKNOD for the current user
> namespace by ns_capable() instead of the global CAP_MKNOD.
>
> To avoid unusable device nodes on file systems mounted in
> non-initial user namespace, may_open_dev() ignores the SB_I_NODEV
> for cgroup device guarded tasks.
>
> Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
> ---
> fs/namei.c | 19 ++++++++++++++++---
> 1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index e56ff39a79bc..ef4f22b9575c 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3221,6 +3221,9 @@ EXPORT_SYMBOL(vfs_mkobj);
>
> bool may_open_dev(const struct path *path)
> {
> + if (devcgroup_task_is_guarded(current))
> + return !(path->mnt->mnt_flags & MNT_NODEV);
> +
> return !(path->mnt->mnt_flags & MNT_NODEV) &&
> !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
> }
> @@ -3976,9 +3979,19 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
> if (error)
> return error;
>
> - if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout &&
> - !capable(CAP_MKNOD))
> - return -EPERM;
> + /*
> + * In case of a device cgroup restirction allow mknod in user
> + * namespace. Otherwise just check global capability; thus,
> + * mknod is also disabled for user namespace other than the
> + * initial one.
> + */
> + if ((S_ISCHR(mode) || S_ISBLK(mode)) && !is_whiteout) {
> + if (devcgroup_task_is_guarded(current)) {
> + if (!ns_capable(current_user_ns(), CAP_MKNOD))
> + return -EPERM;
> + } else if (!capable(CAP_MKNOD))
> + return -EPERM;
> + }
>
> if (!dir->i_op->mknod)
> return -EPERM;
>
> --
> 2.30.2
>
© 2016 - 2025 Red Hat, Inc.