From nobody Tue Jun 23 21:16:01 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 90C8CC433F5 for ; Tue, 1 Feb 2022 20:55:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232699AbiBAUzq (ORCPT ); Tue, 1 Feb 2022 15:55:46 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59368 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230014AbiBAUzm (ORCPT ); Tue, 1 Feb 2022 15:55:42 -0500 Received: from mail-pf1-x44a.google.com (mail-pf1-x44a.google.com [IPv6:2607:f8b0:4864:20::44a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 80B16C061714 for ; Tue, 1 Feb 2022 12:55:42 -0800 (PST) Received: by mail-pf1-x44a.google.com with SMTP id 16-20020a621710000000b004c81f7ea48aso9736029pfx.17 for ; Tue, 01 Feb 2022 12:55:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=rThEDZn+lvseNJpmwu5NYRqQrzaWHFfOZgkrNvBxioY=; b=RsSjfhezq9hddP0+USmpiXrTisgiTKiRvti5DwTuxApn72KC6nEKeVjvD7k59IaZXA D5WnvZe6afi0RlFEwfMdlUaawBfEqeJngrfSYNkD14l9UX59UU+MwyZE9iyOu3y3rzo9 oiOI0xx/Foga/eJvY6XxI99tr7aT+Lpuo/7ZRjPAJbAxopJ8R2HzDInAmmkBV/U0bA1s BgP3xFSYParQKG7UVkYYYet/BlGBqZ094D28/H0s1VSGE6Hlx6yR68nFUKpP+RehXrpS n0Gd4edV9rAKLOUEzMx5mzLSC79IfApdjezjcDupD6o68MLv/37ov11L8R3IRTZ6r1tM xPoA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=rThEDZn+lvseNJpmwu5NYRqQrzaWHFfOZgkrNvBxioY=; b=0bwE003yEUkRawdctfPIJGtgPSW5+kuUmJ/andcDnv0lKlzYzygBgSbx0/2o9EZLRz I/VIoLRuQtpg/A4BxFmSMHDlMNMUfDKbi/0nMZH79mN5xb0nyHQore75MX1HT6cdcTOO 6bc1Hj/YBhXG49qUsyjeVaWxuBgWp/q1YdPeTpP6VxTUlQfur3aVsK+mVxuTq9Elr0Vv Gi7snwElfdTO1hntaBEpI9/TcJJQu+STFwy0OBcrRgmUYBsnd9M/W/zQqU1Lz5zSaO1b c9DFhGE4uDgQWJ3VVsmOdy3kNDfQCuMUwG7OYPsHHfiJ/3I8VtmxnQ4CNv6QdHZmn2Wv fJeQ== X-Gm-Message-State: AOAM530qjODyal9Guql/5VnCJ9ehWYrWcDTqbnCCAZ25MMUy4RleBuxs aN1DKPehc41UNabR+KYlKXtxr5esKK4= X-Google-Smtp-Source: ABdhPJwdx6TAzMTb2rN2WINjSTvZbX5HKheIlI6PnxkT52N5BF3kbH+rc/Q9S5LU731La6cVfP7bs8vltJs= X-Received: from haoluo.svl.corp.google.com ([2620:15c:2cd:202:1cdb:a263:2495:80fd]) (user=haoluo job=sendgmr) by 2002:a17:90b:1bc3:: with SMTP id oa3mr4406017pjb.172.1643748941970; Tue, 01 Feb 2022 12:55:41 -0800 (PST) Date: Tue, 1 Feb 2022 12:55:30 -0800 In-Reply-To: <20220201205534.1962784-1-haoluo@google.com> Message-Id: <20220201205534.1962784-2-haoluo@google.com> Mime-Version: 1.0 References: <20220201205534.1962784-1-haoluo@google.com> X-Mailer: git-send-email 2.35.0.rc2.247.g8bbb082509-goog Subject: [PATCH RFC bpf-next v2 1/5] bpf: Bpffs directory tag From: Hao Luo To: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann Cc: Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , Shakeel Butt , Joe Burton , Stanislav Fomichev , bpf@vger.kernel.org, linux-kernel@vger.kernel.org, Hao Luo Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Introduce a tag structure for directories in bpffs. A tag carries special information about a directory. For example, a BPF_DIR_KERNFS_REP tag denotes that a directory is a replicate of a kernfs hierarchy. At mkdir, if the parent directory has a tag, the child directory also gets tag. For KERNFS_REP directories, the tag references a kernfs node. The KERNFS_REP hierarchy mirrors the hierarchy in kernfs. Userspace is responsible for sync'ing two hierarchies. The initial tag can be created by pinning a certain type of bpf objects. The following patches will introduce such objects and the tagged directory will mirror the cgroup hierarchy. Tags are destroyed at rmdir. Signed-off-by: Hao Luo --- kernel/bpf/inode.c | 80 +++++++++++++++++++++++++++++++++++++++++++++- kernel/bpf/inode.h | 22 +++++++++++++ 2 files changed, 101 insertions(+), 1 deletion(-) create mode 100644 kernel/bpf/inode.h diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index 5a8d9f7467bf..ecc357009df5 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -16,11 +16,13 @@ #include #include #include +#include #include #include #include #include #include "preload/bpf_preload.h" +#include "inode.h" =20 enum bpf_type { BPF_TYPE_UNSPEC =3D 0, @@ -142,6 +144,52 @@ static int bpf_inode_type(const struct inode *inode, e= num bpf_type *type) return 0; } =20 +static struct bpf_dir_tag *inode_tag(const struct inode *inode) +{ + if (unlikely(!S_ISDIR(inode->i_mode))) + return NULL; + + return inode->i_private; +} + +/* tag_dir_inode - tag a newly created directory. + * @tag: tag of parent directory + * @dentry: dentry of the new directory + * @inode: inode of the new directory + * + * Called from bpf_mkdir. + */ +static int tag_dir_inode(const struct bpf_dir_tag *tag, + const struct dentry *dentry, struct inode *inode) +{ + struct bpf_dir_tag *t; + struct kernfs_node *kn; + + WARN_ON(tag->type !=3D BPF_DIR_KERNFS_REP); + + /* kn is put at tag deallocation. */ + kn =3D kernfs_find_and_get_ns(tag->private, dentry->d_name.name, NULL); + if (unlikely(!kn)) + return -ENOENT; + + if (unlikely(kernfs_type(kn) !=3D KERNFS_DIR)) { + kernfs_put(kn); + return -EPERM; + } + + t =3D kzalloc(sizeof(struct bpf_dir_tag), GFP_KERNEL | __GFP_NOWARN); + if (unlikely(!t)) { + kernfs_put(kn); + return -ENOMEM; + } + + t->type =3D tag->type; + t->private =3D kn; + + inode->i_private =3D t; + return 0; +} + static void bpf_dentry_finalize(struct dentry *dentry, struct inode *inode, struct inode *dir) { @@ -156,6 +204,8 @@ static int bpf_mkdir(struct user_namespace *mnt_userns,= struct inode *dir, struct dentry *dentry, umode_t mode) { struct inode *inode; + struct bpf_dir_tag *tag; + int err; =20 inode =3D bpf_get_inode(dir->i_sb, dir, mode | S_IFDIR); if (IS_ERR(inode)) @@ -164,6 +214,15 @@ static int bpf_mkdir(struct user_namespace *mnt_userns= , struct inode *dir, inode->i_op =3D &bpf_dir_iops; inode->i_fop =3D &simple_dir_operations; =20 + tag =3D inode_tag(dir); + if (tag) { + err =3D tag_dir_inode(tag, dentry, inode); + if (err) { + iput(inode); + return err; + } + } + inc_nlink(inode); inc_nlink(dir); =20 @@ -404,11 +463,30 @@ static int bpf_symlink(struct user_namespace *mnt_use= rns, struct inode *dir, return 0; } =20 +static void untag_dir_inode(struct inode *dir) +{ + struct bpf_dir_tag *tag =3D inode_tag(dir); + + WARN_ON(tag->type !=3D BPF_DIR_KERNFS_REP); + + dir->i_private =3D NULL; + kernfs_put(tag->private); + kfree(tag); +} + +static int bpf_rmdir(struct inode *dir, struct dentry *dentry) +{ + if (inode_tag(dir)) + untag_dir_inode(dir); + + return simple_rmdir(dir, dentry); +} + static const struct inode_operations bpf_dir_iops =3D { .lookup =3D bpf_lookup, .mkdir =3D bpf_mkdir, .symlink =3D bpf_symlink, - .rmdir =3D simple_rmdir, + .rmdir =3D bpf_rmdir, .rename =3D simple_rename, .link =3D simple_link, .unlink =3D simple_unlink, diff --git a/kernel/bpf/inode.h b/kernel/bpf/inode.h new file mode 100644 index 000000000000..2cfeef39e861 --- /dev/null +++ b/kernel/bpf/inode.h @@ -0,0 +1,22 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* Copyright (c) 2022 Google + */ +#ifndef __BPF_INODE_H_ +#define __BPF_INODE_H_ + +enum tag_type { + /* The directory is a replicate of a kernfs directory hierarchy. */ + BPF_DIR_KERNFS_REP =3D 0, +}; + +/* A tag for bpffs directories. It carries special information about a + * directory. For example, BPF_DIR_KERNFS_REP denotes that the directory is + * a replicate of a kernfs hierarchy. Pinning a certain type of objects ta= gs + * a directory and the tag will be removed at rmdir. + */ +struct bpf_dir_tag { + enum tag_type type; + void *private; /* tag private data */ +}; + +#endif --=20 2.35.0.rc2.247.g8bbb082509-goog From nobody Tue Jun 23 21:16:01 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9AA08C433FE for ; Tue, 1 Feb 2022 20:55:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231542AbiBAUzu (ORCPT ); Tue, 1 Feb 2022 15:55:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59386 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232087AbiBAUzp (ORCPT ); Tue, 1 Feb 2022 15:55:45 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5C965C061714 for ; Tue, 1 Feb 2022 12:55:45 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id y4-20020a5b0f44000000b00611862e546dso35559190ybr.7 for ; Tue, 01 Feb 2022 12:55:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=zPUB3NCEXXiz6+mubQFrvt3pkt+Hiu6v49aq+5qNgdA=; b=nG2n54JlmG3+OOeTvBCYp5mh4jCaq3qOEY+zPFz6IXJegkVJgfuonqSpBZGtMzingX yTVHeIhvqQA/FTsbVjtbDL0QTZAKpJsOURMN4I9KfajdtTGBq7iJ4tHi+/P6ql3ctOez shL0NIZaY8GB4DTeo5YH0ZuKZVBFHI5qU61ICzxLJ67nJh06c4lRWGVhYwB+jRI9B1y8 hiSDf90S8zMxkLelqfRgF81OaPW2GFcGfIYmAlTdwoUfEWKjg2kpmVLQkamLXfpP+hp/ c6lxssgkd0eZE+4bz68vPp4EQedKA8jRElaKZzK+Tto1VoudK57mtTzCM/svm7+Y8MsN GRmQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=zPUB3NCEXXiz6+mubQFrvt3pkt+Hiu6v49aq+5qNgdA=; b=U2V4uw+hizkDy3micjmBSP+cQYubxDpG1EFf5KQW6dLGlc3G7DBXhClbcJjQVQfqBr wedCC/0PegjnFYglrRhkbyA5bWwyEV5VGAbYB044u0xHQ8jP3tDhRiQMGmSR4cE4LP8U SZTdHwwIIhoUjt7p1T4/suJ/msnX1HKbfQdb0SKEIFZHRX+29H/WJjNxNBX6oq4OXI5n 7l3Z1cBxYfCs6myMko3rZFg1nIgs1tnN58nN8skfDyZVite5kjJWZh9VFjZvzvnW+FvH IX5LLaTfJuCkTJykEKA0jyigPczGvMSR25gKo9XwxbLas4+tBOW4lilCuGCCnY6wwEog 8wwQ== X-Gm-Message-State: AOAM530V6tK6nSDrjufigvAn1DlvUZ+hf7tkJLSc+OPnWPV6vL02O6Uc PQDGxzUC3Cmq5ZbB7bN375qBh5dcy5M= X-Google-Smtp-Source: ABdhPJze7Krj8ejwFo366WPCbYzzm4Q3Mfn9Q5H6bxPKSCgnl/JbAbBlyieQcUP4BCeZgDYjNJPloFr1RKI= X-Received: from haoluo.svl.corp.google.com ([2620:15c:2cd:202:1cdb:a263:2495:80fd]) (user=haoluo job=sendgmr) by 2002:a25:557:: with SMTP id 84mr38786815ybf.637.1643748944511; Tue, 01 Feb 2022 12:55:44 -0800 (PST) Date: Tue, 1 Feb 2022 12:55:31 -0800 In-Reply-To: <20220201205534.1962784-1-haoluo@google.com> Message-Id: <20220201205534.1962784-3-haoluo@google.com> Mime-Version: 1.0 References: <20220201205534.1962784-1-haoluo@google.com> X-Mailer: git-send-email 2.35.0.rc2.247.g8bbb082509-goog Subject: [PATCH RFC bpf-next v2 2/5] bpf: Introduce inherit list for dir tag. From: Hao Luo To: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann Cc: Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , Shakeel Butt , Joe Burton , Stanislav Fomichev , bpf@vger.kernel.org, linux-kernel@vger.kernel.org, Hao Luo Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Embed a list of bpf objects in a directory's tag. This list is shared by all the directories in the tagged hierarchy. When a new tagged directory is created, it will be prepopulated with the objects in the inherit list. When the directory is removed, the inherited objects will be removed automatically. Because the whole tagged hierarchy share the same list, all the directories in the hierarchy have the same set of objects to be prepopulated. Signed-off-by: Hao Luo --- kernel/bpf/inode.c | 110 +++++++++++++++++++++++++++++++++++++++++---- kernel/bpf/inode.h | 33 ++++++++++++++ 2 files changed, 135 insertions(+), 8 deletions(-) diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index ecc357009df5..9ae17a2bf779 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -24,13 +24,6 @@ #include "preload/bpf_preload.h" #include "inode.h" =20 -enum bpf_type { - BPF_TYPE_UNSPEC =3D 0, - BPF_TYPE_PROG, - BPF_TYPE_MAP, - BPF_TYPE_LINK, -}; - static void *bpf_any_get(void *raw, enum bpf_type type) { switch (type) { @@ -69,6 +62,20 @@ static void bpf_any_put(void *raw, enum bpf_type type) } } =20 +static void free_obj_list(struct kref *kref) +{ + struct obj_list *list; + struct bpf_inherit_entry *e; + + list =3D container_of(kref, struct obj_list, refcnt); + list_for_each_entry(e, &list->list, list) { + list_del_rcu(&e->list); + bpf_any_put(e->obj, e->type); + kfree(e); + } + kfree(list); +} + static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type) { void *raw; @@ -100,6 +107,10 @@ static const struct inode_operations bpf_prog_iops =3D= { }; static const struct inode_operations bpf_map_iops =3D { }; static const struct inode_operations bpf_link_iops =3D { }; =20 +static int bpf_mkprog(struct dentry *dentry, umode_t mode, void *arg); +static int bpf_mkmap(struct dentry *dentry, umode_t mode, void *arg); +static int bpf_mklink(struct dentry *dentry, umode_t mode, void *arg); + static struct inode *bpf_get_inode(struct super_block *sb, const struct inode *dir, umode_t mode) @@ -184,12 +195,62 @@ static int tag_dir_inode(const struct bpf_dir_tag *ta= g, } =20 t->type =3D tag->type; + t->inherit_objects =3D tag->inherit_objects; + kref_get(&t->inherit_objects->refcnt); t->private =3D kn; =20 inode->i_private =3D t; return 0; } =20 +/* populate_dir - populate directory with bpf objects in a tag's + * inherit_objects. + * @dir: dentry of the directory. + * @inode: inode of the direcotry. + * + * Called from mkdir. Must be called after dentry has been finalized. + */ +static int populate_dir(struct dentry *dir, struct inode *inode) +{ + struct bpf_dir_tag *tag =3D inode_tag(inode); + struct bpf_inherit_entry *e; + struct dentry *child; + int ret; + + rcu_read_lock(); + list_for_each_entry_rcu(e, &tag->inherit_objects->list, list) { + child =3D lookup_one_len_unlocked(e->name.name, dir, + strlen(e->name.name)); + if (unlikely(IS_ERR(child))) { + ret =3D PTR_ERR(child); + break; + } + + switch (e->type) { + case BPF_TYPE_PROG: + ret =3D bpf_mkprog(child, e->mode, e->obj); + break; + case BPF_TYPE_MAP: + ret =3D bpf_mkmap(child, e->mode, e->obj); + break; + case BPF_TYPE_LINK: + ret =3D bpf_mklink(child, e->mode, e->obj); + break; + default: + ret =3D -EPERM; + break; + } + dput(child); + if (ret) + break; + + /* To match bpf_any_put in bpf_free_inode. */ + bpf_any_get(e->obj, e->type); + } + rcu_read_unlock(); + return ret; +} + static void bpf_dentry_finalize(struct dentry *dentry, struct inode *inode, struct inode *dir) { @@ -227,6 +288,12 @@ static int bpf_mkdir(struct user_namespace *mnt_userns= , struct inode *dir, inc_nlink(dir); =20 bpf_dentry_finalize(dentry, inode, dir); + + if (tag) { + err =3D populate_dir(dentry, inode); + if (err) + return err; + } return 0; } =20 @@ -463,6 +530,30 @@ static int bpf_symlink(struct user_namespace *mnt_user= ns, struct inode *dir, return 0; } =20 +/* unpopulate_dir - remove pre-populated entries from directory. + * @dentry: dentry of directory + * @inode: inode of directory + * + * Called from rmdir. + */ +static void unpopulate_dir(struct dentry *dentry, struct inode *inode) +{ + struct bpf_dir_tag *tag =3D inode_tag(inode); + struct bpf_inherit_entry *e; + struct dentry *child; + + rcu_read_lock(); + list_for_each_entry_rcu(e, &tag->inherit_objects->list, list) { + child =3D d_hash_and_lookup(dentry, &e->name); + if (unlikely(IS_ERR(child))) + continue; + + simple_unlink(inode, child); + dput(child); + } + rcu_read_unlock(); +} + static void untag_dir_inode(struct inode *dir) { struct bpf_dir_tag *tag =3D inode_tag(dir); @@ -471,13 +562,16 @@ static void untag_dir_inode(struct inode *dir) =20 dir->i_private =3D NULL; kernfs_put(tag->private); + kref_put(&tag->inherit_objects->refcnt, free_obj_list); kfree(tag); } =20 static int bpf_rmdir(struct inode *dir, struct dentry *dentry) { - if (inode_tag(dir)) + if (inode_tag(dir)) { + unpopulate_dir(dentry, dir); untag_dir_inode(dir); + } =20 return simple_rmdir(dir, dentry); } diff --git a/kernel/bpf/inode.h b/kernel/bpf/inode.h index 2cfeef39e861..a8207122643d 100644 --- a/kernel/bpf/inode.h +++ b/kernel/bpf/inode.h @@ -4,11 +4,42 @@ #ifndef __BPF_INODE_H_ #define __BPF_INODE_H_ =20 +#include +#include + +enum bpf_type { + BPF_TYPE_UNSPEC =3D 0, + BPF_TYPE_PROG, + BPF_TYPE_MAP, + BPF_TYPE_LINK, +}; + enum tag_type { /* The directory is a replicate of a kernfs directory hierarchy. */ BPF_DIR_KERNFS_REP =3D 0, }; =20 +/* Entry for bpf_dir_tag->inherit_objects. + * + * When a new directory is created from a tagged directory, the new direct= ory + * will be populated with bpf objects in the tag's inherit_objects list. E= ach + * entry holds a reference of a bpf object and the information needed to + * recreate the object's entry in the new directory. + */ +struct bpf_inherit_entry { + struct list_head list; + void *obj; /* bpf object to inherit. */ + enum bpf_type type; /* type of the object (prog, map or link). */ + struct qstr name; /* name of the entry. */ + umode_t mode; /* access mode of the entry. */ +}; + +struct obj_list { + struct list_head list; + struct kref refcnt; + struct inode *root; +}; + /* A tag for bpffs directories. It carries special information about a * directory. For example, BPF_DIR_KERNFS_REP denotes that the directory is * a replicate of a kernfs hierarchy. Pinning a certain type of objects ta= gs @@ -16,6 +47,8 @@ enum tag_type { */ struct bpf_dir_tag { enum tag_type type; + /* list of bpf objects that a directory inherits from its parent. */ + struct obj_list *inherit_objects; void *private; /* tag private data */ }; =20 --=20 2.35.0.rc2.247.g8bbb082509-goog From nobody Tue Jun 23 21:16:01 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C9651C43217 for ; Tue, 1 Feb 2022 20:55:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237674AbiBAUzw (ORCPT ); Tue, 1 Feb 2022 15:55:52 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59398 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230023AbiBAUzr (ORCPT ); Tue, 1 Feb 2022 15:55:47 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9B73EC061714 for ; Tue, 1 Feb 2022 12:55:47 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id i10-20020a25540a000000b0061391789216so35629343ybb.2 for ; Tue, 01 Feb 2022 12:55:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=dm/QA8eOMDyEhqviMMC11srMRKdcrEoIlVOAi9iuAaY=; b=EOpU/04WC7Oa2ypb8D//Kiwq9aBRceZLH54HJIBWJcN5sfbyJClGGQd3F6+YQNVBDQ jdJZd32Fxzn1EOKRz+B6vMy1XkIVgutEM7xSmFjH1isBkX9PnzFOJ2W07aCMDp9hjaFT Uh/XWgmOlvSPqHhBb7qlZ7XOcwvovINKjGp23bvs/3SzI48jMQM92CFVhoX8Tfl2pC0S QtLWTpdraQA72kWfRUivi9scYXS1M3M2zT+YhXfMlKzTrxchFyTpw9+8JhubV0++SgP6 cXhWvvhQ5cj5yZm2a7bzwRYwN78qEn1TwnAuhHmFYzj8AWvrJ7CEEEd/40O8S+dp6uNn txiA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=dm/QA8eOMDyEhqviMMC11srMRKdcrEoIlVOAi9iuAaY=; b=fiWjIwQNQp6LqZXxWFGH8r1fdYtP4OyKp4OZ/HhZv59PnkC9N2WXB3vf09/H5Mx8km K+y2A/rs7VUWccMMVw6t/C6+IQKh6OgP2Wx6WBpMQRjjMMdbokqqTTY51RX+eY3GFJon cfFggXUAMzhqGVeeEnUFeWpvL4t4Zjr/KG8s8En5XP0uCQpSUHh5K4S805kw8Qu4HlmB tt45j3kD03hJ6+t+YCQSinp1q70suCmfyuUXEZNNqLcmo1ivCy8PjXSpeR3/nHmHfGbF 1xte+FkRA+v3XOj9wWNtFpkavng5Oxn9Fp8W4EpIYKJtBW8fO+l8TeR+p0swP35RCZLV t23g== X-Gm-Message-State: AOAM531hq6vANT2MVLIsvfLYHOSBMRseRU3VmfKmOXm/Fr3oBA3ddyrb I49WQEu2hNoTj8YVzOUx+LBYm2GUZrY= X-Google-Smtp-Source: ABdhPJzDi2kBw0BS06ZEeFeTZkDZ0K3lIomhmC8BXon0zUnggjPKNcyl2sAP1ZdF6cD1y5QNjzBtlyYQEJs= X-Received: from haoluo.svl.corp.google.com ([2620:15c:2cd:202:1cdb:a263:2495:80fd]) (user=haoluo job=sendgmr) by 2002:a81:1a55:: with SMTP id a82mr88441ywa.369.1643748946816; Tue, 01 Feb 2022 12:55:46 -0800 (PST) Date: Tue, 1 Feb 2022 12:55:32 -0800 In-Reply-To: <20220201205534.1962784-1-haoluo@google.com> Message-Id: <20220201205534.1962784-4-haoluo@google.com> Mime-Version: 1.0 References: <20220201205534.1962784-1-haoluo@google.com> X-Mailer: git-send-email 2.35.0.rc2.247.g8bbb082509-goog Subject: [PATCH RFC bpf-next v2 3/5] bpf: cgroup_view iter From: Hao Luo To: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann Cc: Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , Shakeel Butt , Joe Burton , Stanislav Fomichev , bpf@vger.kernel.org, linux-kernel@vger.kernel.org, Hao Luo Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Introduce a new type of iter prog: 'cgroup_view'. It prints out cgroup's state. Cgroup_view is supposed to be used together with directory tagging. When cgroup_view is pinned in a directory, it tags that directory as KERNFS_REP, i.e. a replicate of the cgroup hierarchy. Whenever a subdirectory is created, if there is a child cgroup of the same name exists, the subdirectory inherits the pinned cgroup_view object from its parent and holds a reference of the corresponding kernfs node. The cgroup_view prog takes a pointer to the cgroup and can use family of seq_print helpers to print out cgroup state. A typical use case of cgroup_view is to extend the cgroupfs interface. Signed-off-by: Hao Luo --- include/linux/bpf.h | 2 + kernel/bpf/Makefile | 2 +- kernel/bpf/bpf_iter.c | 11 ++++ kernel/bpf/cgroup_view_iter.c | 114 ++++++++++++++++++++++++++++++++++ 4 files changed, 128 insertions(+), 1 deletion(-) create mode 100644 kernel/bpf/cgroup_view_iter.c diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 6eb0b180d33b..494927b2b3c2 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1610,6 +1610,7 @@ typedef const struct bpf_func_proto * =20 enum bpf_iter_feature { BPF_ITER_RESCHED =3D BIT(0), + BPF_ITER_INHERIT =3D BIT(1), }; =20 #define BPF_ITER_CTX_ARG_MAX 2 @@ -1647,6 +1648,7 @@ bpf_iter_get_func_proto(enum bpf_func_id func_id, con= st struct bpf_prog *prog); int bpf_iter_link_attach(const union bpf_attr *attr, bpfptr_t uattr, struc= t bpf_prog *prog); int bpf_iter_new_fd(struct bpf_link *link); bool bpf_link_is_iter(struct bpf_link *link); +bool bpf_link_support_inherit(struct bpf_link *link); struct bpf_prog *bpf_iter_get_info(struct bpf_iter_meta *meta, bool in_sto= p); int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx); void bpf_iter_map_show_fdinfo(const struct bpf_iter_aux_info *aux, diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index c1a9be6a4b9f..d9d2b8541ba7 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -8,7 +8,7 @@ CFLAGS_core.o +=3D $(call cc-disable-warning, override-init= ) $(cflags-nogcse-yy) =20 obj-$(CONFIG_BPF_SYSCALL) +=3D syscall.o verifier.o inode.o helpers.o tnum= .o bpf_iter.o map_iter.o task_iter.o prog_iter.o obj-$(CONFIG_BPF_SYSCALL) +=3D hashtab.o arraymap.o percpu_freelist.o bpf_= lru_list.o lpm_trie.o map_in_map.o bloom_filter.o -obj-$(CONFIG_BPF_SYSCALL) +=3D local_storage.o queue_stack_maps.o ringbuf.o +obj-$(CONFIG_BPF_SYSCALL) +=3D local_storage.o queue_stack_maps.o ringbuf.= o cgroup_view_iter.o obj-$(CONFIG_BPF_SYSCALL) +=3D bpf_local_storage.o bpf_task_storage.o obj-${CONFIG_BPF_LSM} +=3D bpf_inode_storage.o obj-$(CONFIG_BPF_SYSCALL) +=3D disasm.o diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c index 110029ede71e..ff5577a5f73a 100644 --- a/kernel/bpf/bpf_iter.c +++ b/kernel/bpf/bpf_iter.c @@ -496,6 +496,17 @@ bool bpf_link_is_iter(struct bpf_link *link) return link->ops =3D=3D &bpf_iter_link_lops; } =20 +bool bpf_link_support_inherit(struct bpf_link *link) +{ + struct bpf_iter_link *iter_link; + + if (!bpf_link_is_iter(link)) + return false; + + iter_link =3D container_of(link, struct bpf_iter_link, link); + return iter_link->tinfo->reg_info->feature & BPF_ITER_INHERIT; +} + int bpf_iter_link_attach(const union bpf_attr *attr, bpfptr_t uattr, struct bpf_prog *prog) { diff --git a/kernel/bpf/cgroup_view_iter.c b/kernel/bpf/cgroup_view_iter.c new file mode 100644 index 000000000000..a44d115235c4 --- /dev/null +++ b/kernel/bpf/cgroup_view_iter.c @@ -0,0 +1,114 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2022 Google */ +#include +#include +#include +#include +#include +#include +#include +#include "inode.h" + +static void *cgroup_view_seq_start(struct seq_file *seq, loff_t *pos) +{ + struct bpf_dir_tag *tag; + struct kernfs_node *kn; + struct cgroup *cgroup; + struct inode *dir; + + /* Only one session is supported. */ + if (*pos > 0) + return NULL; + + dir =3D d_inode(seq->file->f_path.dentry->d_parent); + tag =3D dir->i_private; + if (!tag) + return NULL; + + kn =3D tag->private; + + rcu_read_lock(); + cgroup =3D rcu_dereference(*(void __rcu __force **)&kn->priv); + if (!cgroup || !cgroup_tryget(cgroup)) + cgroup =3D NULL; + rcu_read_unlock(); + + if (!cgroup) + return NULL; + + if (*pos =3D=3D 0) + ++*pos; + return cgroup; +} + +static void *cgroup_view_seq_next(struct seq_file *seq, void *v, loff_t *p= os) +{ + ++*pos; + return NULL; +} + +struct bpf_iter__cgroup_view { + __bpf_md_ptr(struct bpf_iter_meta *, meta); + __bpf_md_ptr(struct cgroup *, cgroup); +}; + +DEFINE_BPF_ITER_FUNC(cgroup_view, struct bpf_iter_meta *meta, struct cgrou= p *cgroup) + +static int cgroup_view_seq_show(struct seq_file *seq, void *v) +{ + struct bpf_iter__cgroup_view ctx; + struct bpf_iter_meta meta; + struct bpf_prog *prog; + int ret =3D 0; + + ctx.meta =3D &meta; + ctx.cgroup =3D v; + meta.seq =3D seq; + prog =3D bpf_iter_get_info(&meta, false); + if (prog) + ret =3D bpf_iter_run_prog(prog, &ctx); + + return ret; +} + +static void cgroup_view_seq_stop(struct seq_file *seq, void *v) +{ + if (v) + cgroup_put(v); +} + +static const struct seq_operations cgroup_view_seq_ops =3D { + .start =3D cgroup_view_seq_start, + .next =3D cgroup_view_seq_next, + .stop =3D cgroup_view_seq_stop, + .show =3D cgroup_view_seq_show, +}; + +BTF_ID_LIST(btf_cgroup_id) +BTF_ID(struct, cgroup) + +static const struct bpf_iter_seq_info cgroup_view_seq_info =3D { + .seq_ops =3D &cgroup_view_seq_ops, + .init_seq_private =3D NULL, + .fini_seq_private =3D NULL, + .seq_priv_size =3D 0, +}; + +static struct bpf_iter_reg cgroup_view_reg_info =3D { + .target =3D "cgroup_view", + .feature =3D BPF_ITER_INHERIT, + .ctx_arg_info_size =3D 1, + .ctx_arg_info =3D { + { offsetof(struct bpf_iter__cgroup_view, cgroup), + PTR_TO_BTF_ID }, + }, + .seq_info =3D &cgroup_view_seq_info, +}; + +static int __init cgroup_view_init(void) +{ + cgroup_view_reg_info.ctx_arg_info[0].btf_id =3D *btf_cgroup_id; + return bpf_iter_reg_target(&cgroup_view_reg_info); +} + +late_initcall(cgroup_view_init); --=20 2.35.0.rc2.247.g8bbb082509-goog From nobody Tue Jun 23 21:16:01 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 78023C433F5 for ; Tue, 1 Feb 2022 20:55:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241474AbiBAUz4 (ORCPT ); Tue, 1 Feb 2022 15:55:56 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59422 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234741AbiBAUzu (ORCPT ); Tue, 1 Feb 2022 15:55:50 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CC70FC061749 for ; Tue, 1 Feb 2022 12:55:49 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id y4-20020a5b0f44000000b00611862e546dso35559575ybr.7 for ; Tue, 01 Feb 2022 12:55:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=TrRMtZ7rpWFBy3mN9jukrWT5/O4hvi4cjTw+7/27mDs=; b=BPfKMhRpkmr59DmiYl06B1vd1XGK2BTPFlXloc47Xbchd+K/D+0T1O0zu3r4KG8C+J MJ/jYs/i09fQ+Cew4BlHmC+o3+WajaBpAGlrdUGMA5vUBqPiFevv4KSVLTJus8MkWVqY RuD/0Z0NzzF4XEAwu4Ut3Vx4z9TYtxa1XmuQ8j4mB9KNcFx4NqmqTTyTVmFb6pO9Y4Mi mhXSXxERDH4vv8UFz50UA6wr4aeDcnSobbJty/OmrZ1sw4QlPzFtIhCzosR2M9DQ6YoG Pedp3zA2Hnkm5ncjXAJ9tHGc7IbOi8BvkurHiApQh8C2x4IpF3Hr4EeUYH9pjYaH5m/+ nsig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=TrRMtZ7rpWFBy3mN9jukrWT5/O4hvi4cjTw+7/27mDs=; b=jil64BjM17hG0nPN0CfllRktbw1V+Tnjf2jLqzZISS1J0KiOnoSuREsVctyYV5JgtT v7+UjLr7TCf4FSqgWl7NYa1efxkHtMsrZQ8nr1/nzrXk0MWL0wjbmbthJyh898ZUbQPj MPqTRNNP3pxIiMw23VP4YcN/fUCmxGadWFdmLRg+KDpB3QVvxuHO5XXZ0OMAzpiu2j1B vE0olAT8C9/bQgEDA/xRxKEQnwacG82ytxMpsVs22cZh1shFu90WKSkvhYqi2KU5TwL2 h8qYYvbgbZyG6D18/G23hPGj8G1DvPZvSHT1UB58nyPP8kuYhCPABK1t03PMnCaR2gdU p5GQ== X-Gm-Message-State: AOAM530xKnguTsvJoLJDGMAMRlumqNePjG/WAZO7pZFzmIyKb45SNQmW yw8C6dkzrSR0VWoe8A9fx0e/Vjy7PFk= X-Google-Smtp-Source: ABdhPJzyYTbnOcjPUT6K25yIzD0kbUD9e8z1MZA8T/wWw87xYomk8IAVurjv9l7i2v1Png4zCo10PSVDCeg= X-Received: from haoluo.svl.corp.google.com ([2620:15c:2cd:202:1cdb:a263:2495:80fd]) (user=haoluo job=sendgmr) by 2002:a0d:d782:: with SMTP id z124mr93207ywd.28.1643748949022; Tue, 01 Feb 2022 12:55:49 -0800 (PST) Date: Tue, 1 Feb 2022 12:55:33 -0800 In-Reply-To: <20220201205534.1962784-1-haoluo@google.com> Message-Id: <20220201205534.1962784-5-haoluo@google.com> Mime-Version: 1.0 References: <20220201205534.1962784-1-haoluo@google.com> X-Mailer: git-send-email 2.35.0.rc2.247.g8bbb082509-goog Subject: [PATCH RFC bpf-next v2 4/5] bpf: Pin cgroup_view From: Hao Luo To: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann Cc: Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , Shakeel Butt , Joe Burton , Stanislav Fomichev , bpf@vger.kernel.org, linux-kernel@vger.kernel.org, Hao Luo Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When pinning cgroup_view iter into bpffs, the pinning adds a tag to the parent directory, indicating the underneath directory hierarchy is a replicate of the cgroup hierarchy. Each of the directory connects to a cgroup directory. Whenever a subdirectory is created, if there is a subcgroup of the same name exists, the subdirectory will be populated with entries holding a list of bpf objects registered in the tag's inherit list. The inherit list is formed by the objects pinned in the top level tagged directory. For example, bpf_obj_pin(cgroup_view_link, "/sys/fs/bpf/A/obj"); pins a link in A. A becomes tagged and the link object is registered in the inherit list of A's tag. mkdir("/sys/fs/bpf/A/B"); When A/B is created, B inherits the pinned objects in A. B is populated with objects. > ls /sys/fs/bpf/A/B obj Currently, only pinning cgroup_view link can tag a directory. And once tagged, only rmdir can remove the tag. Signed-off-by: Hao Luo --- kernel/bpf/inode.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 84 insertions(+) diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index 9ae17a2bf779..b71840bf979d 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -71,6 +71,7 @@ static void free_obj_list(struct kref *kref) list_for_each_entry(e, &list->list, list) { list_del_rcu(&e->list); bpf_any_put(e->obj, e->type); + kfree(e->name.name); kfree(e); } kfree(list); @@ -486,9 +487,20 @@ static int bpf_mkmap(struct dentry *dentry, umode_t mo= de, void *arg) &bpffs_map_fops : &bpffs_obj_fops); } =20 +static int +bpf_inherit_object(struct dentry *dentry, umode_t mode, void *obj, + enum bpf_type type); + static int bpf_mklink(struct dentry *dentry, umode_t mode, void *arg) { struct bpf_link *link =3D arg; + int err; + + if (bpf_link_support_inherit(link)) { + err =3D bpf_inherit_object(dentry, mode, link, BPF_TYPE_LINK); + if (err) + return err; + } =20 return bpf_mkobj_ops(dentry, mode, arg, &bpf_link_iops, bpf_link_is_iter(link) ? @@ -586,6 +598,78 @@ static const struct inode_operations bpf_dir_iops =3D { .unlink =3D simple_unlink, }; =20 +/* bpf_inherit_object - register an object in a diretory tag's inherit list + * @dentry: dentry of the location to pin + * @mode: mode of created file entry + * @obj: bpf object + * @type: type of bpf object + * + * Could be called from bpf_obj_do_pin() or from mkdir(). + */ +static int bpf_inherit_object(struct dentry *dentry, umode_t mode, + void *obj, enum bpf_type type) +{ + struct inode *dir =3D d_inode(dentry->d_parent); + struct obj_list *inherits; + struct bpf_inherit_entry *e; + struct bpf_dir_tag *tag; + const char *name; + bool queued =3D false, new_tag =3D false; + + /* allocate bpf_dir_tag */ + tag =3D inode_tag(dir); + if (!tag) { + new_tag =3D true; + tag =3D kzalloc(sizeof(struct bpf_dir_tag), GFP_KERNEL); + if (unlikely(!tag)) + return -ENOMEM; + + tag->type =3D BPF_DIR_KERNFS_REP; + inherits =3D kzalloc(sizeof(struct obj_list), GFP_KERNEL); + if (unlikely(!inherits)) { + kfree(tag); + return -ENOMEM; + } + + kref_init(&inherits->refcnt); + INIT_LIST_HEAD(&inherits->list); + tag->inherit_objects =3D inherits; + /* initial tag points to the default root cgroup. */ + tag->private =3D cgrp_dfl_root.kf_root->kn; + dir->i_private =3D tag; + } else { + inherits =3D tag->inherit_objects; + } + + list_for_each_entry_rcu(e, &inherits->list, list) { + if (!strcmp(dentry->d_name.name, e->name.name)) { + queued =3D true; + break; + } + } + + /* queue in tag's inherit_list. */ + if (!queued) { + e =3D kzalloc(sizeof(struct bpf_inherit_entry), GFP_KERNEL); + if (!e) { + if (new_tag) { + kfree(tag); + kfree(inherits); + } + return -ENOMEM; + } + + INIT_LIST_HEAD(&e->list); + e->mode =3D mode; + e->obj =3D obj; + e->type =3D type; + name =3D kstrdup(dentry->d_name.name, GFP_USER | __GFP_NOWARN); + e->name =3D (struct qstr)QSTR_INIT(name, strlen(name)); + list_add_rcu(&e->list, &inherits->list); + } + return 0; +} + /* pin iterator link into bpffs */ static int bpf_iter_link_pin_kernel(struct dentry *parent, const char *name, struct bpf_link *link) --=20 2.35.0.rc2.247.g8bbb082509-goog From nobody Tue Jun 23 21:16:01 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EF30EC433EF for ; Tue, 1 Feb 2022 20:55:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242625AbiBAUz6 (ORCPT ); Tue, 1 Feb 2022 15:55:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59444 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236881AbiBAUzw (ORCPT ); Tue, 1 Feb 2022 15:55:52 -0500 Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com [IPv6:2607:f8b0:4864:20::b4a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 56031C06173D for ; Tue, 1 Feb 2022 12:55:52 -0800 (PST) Received: by mail-yb1-xb4a.google.com with SMTP id a125-20020a25ca83000000b00619442ade1cso26502963ybg.1 for ; Tue, 01 Feb 2022 12:55:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=pL1E8fJxKw13cCsiOnlzxl+3OEwkHyaKnHI4u2VvDps=; b=iLwxZIv5uqjajpGnCS3rmRzTL28QDNo1wzcLG4pKrQAIir+6+XS5Oth4WrIeoROhaC K4x9s+1s0lVKIYmOYxqRr2YllIsLUD/Fd2Kr/4loQYX1M8KrWo+hswzpdIkl0MOVnLI5 qyyQptR7MsdpfC7LZCBs+JmxPf700zWwkxD6xiTZRkk8IcluJD7XR60iDT01WAl6DP+G ygG2vSjX8X5sGmeHwzxsFlPFFeaxnrVNuesRujenB9aTqUvV2ILCLTW4bckP+y9yJhuw HejzUN2pY4TY054MLd4OIijtaAGbHt0mdUDYd2hE2Iywrk/r50G5z4cfcIE5K7DBjaNO +a2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=pL1E8fJxKw13cCsiOnlzxl+3OEwkHyaKnHI4u2VvDps=; b=F5YXosHah9jMSSPd33Odkn2nV6RcvtIMnrE1mub3WM+zAGozoHMLjh28tOxhXOVYOm ZnHUpzLw1gxXfCcb90Bl1O5MzCNSjVwA8yPvyLxzUyJcdeNfTJwpDPBLYGlp+qWu1k5G L75+kwIMkOnZZATN1YvKGanfkXOY+L1at3+Cp49aQkcHkTl609nXX+K5tisUfKT1qreO WmhSXTVNcvdRpQGt/cRAZMpJ4qVnvsEqJn2UT86NvyUXH5pblqOgoBQh1wNyigv//oKe hWV5aDi5e+9P10U5bIKIFfDdIAy6UyhNr6hIfEIq8F+Ev/BVn3jwy6BCK0gWKo4CsOha AxPQ== X-Gm-Message-State: AOAM532EYJ54vgI4UTrIgIrkm7rOCmXhNBuGlx5hB/WycVK3YN14FxJI ykhhEaEGp28DTY14UhZet7QoBIRzcVk= X-Google-Smtp-Source: ABdhPJx2K4/2udytZZXqNqYSXlX+Tejacq3qe7zEpZvsjURu1Lo9LnFuORhx1dED7XoxakLOdMSUBRGlftA= X-Received: from haoluo.svl.corp.google.com ([2620:15c:2cd:202:1cdb:a263:2495:80fd]) (user=haoluo job=sendgmr) by 2002:a25:c5cc:: with SMTP id v195mr41095321ybe.373.1643748951613; Tue, 01 Feb 2022 12:55:51 -0800 (PST) Date: Tue, 1 Feb 2022 12:55:34 -0800 In-Reply-To: <20220201205534.1962784-1-haoluo@google.com> Message-Id: <20220201205534.1962784-6-haoluo@google.com> Mime-Version: 1.0 References: <20220201205534.1962784-1-haoluo@google.com> X-Mailer: git-send-email 2.35.0.rc2.247.g8bbb082509-goog Subject: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link From: Hao Luo To: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann Cc: Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , Shakeel Butt , Joe Burton , Stanislav Fomichev , bpf@vger.kernel.org, linux-kernel@vger.kernel.org, Hao Luo Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" A selftest for demo/testing cgroup_view pinning, directory tagging and cat cgroup_view pinned objects. The added cgroup_view prog is a naive use case, which merely reads the cgroup's id. This selftest introduces two metrics for measuring cgroup-level scheduling queuing delays: - queue_self: queueing delays caused by waiting behind self cgroup. - queue_other: queueing delays caused by waiting behind other cgroups. The selftest collects task-level queuing delays and breaks them into queue_self delays and queue_other delays. This is done by hooking at handlers at context switch and wakeup. Only the max queue_self and queue_other delays are recorded. This breakdown is helpful in analyzing the cause of long scheduling delays. Large value in queue_self is an indication of contention that comes from within the cgroup. Large value in queue_other indicates contention between cgroups. A new iter prog type cgroup_view is implemented "dump_cgropu_lat", which reads the recorded delays and dumps the stats through bpffs interfaces. Specifically, cgroup_view is initially pinned in an empty directory bpffs, which effectively turns the directory into a mirror of the cgroup hierarchy. When a new cgroup is created, we manually created a new directory in bpffs in correspondence. The new directory will contain a file prepopulated with the pinned cgroup_view object. Reading that file yields the queue stats of the corresponding cgroup. Signed-off-by: Hao Luo --- .../selftests/bpf/prog_tests/pinning_cgroup.c | 143 +++++++++++ tools/testing/selftests/bpf/progs/bpf_iter.h | 7 + .../bpf/progs/bpf_iter_cgroup_view.c | 232 ++++++++++++++++++ 3 files changed, 382 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/pinning_cgroup.c create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_cgroup_view.c diff --git a/tools/testing/selftests/bpf/prog_tests/pinning_cgroup.c b/tool= s/testing/selftests/bpf/prog_tests/pinning_cgroup.c new file mode 100644 index 000000000000..ebef154e63c9 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/pinning_cgroup.c @@ -0,0 +1,143 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include +#include +#include "bpf_iter_cgroup_view.skel.h" + +static void spin_on_cpu(int seconds) +{ + time_t start, now; + + start =3D time(NULL); + do { + now =3D time(NULL); + } while (now - start < seconds); +} + +static void do_work(const char *cgroup) +{ + int i, cpu =3D 0, pid; + char cmd[128]; + + /* make cgroup threaded */ + snprintf(cmd, 128, "echo threaded > %s/cgroup.type", cgroup); + system(cmd); + + /* try to enable cpu controller. this may fail if there cpu controller + * is not available in cgroup.controllers or there is a cgroup v1 already + * mounted in the system. + */ + snprintf(cmd, 128, "echo \"+cpu\" > %s/cgroup.subtree_control", cgroup); + system(cmd); + + /* launch two children, both running in child cgroup */ + for (i =3D 0; i < 2; ++i) { + pid =3D fork(); + if (pid =3D=3D 0) { + /* attach to cgroup */ + snprintf(cmd, 128, "echo %d > %s/cgroup.procs", getpid(), cgroup); + system(cmd); + + /* pin process to target cpu */ + snprintf(cmd, 128, "taskset -pc %d %d", cpu, getpid()); + system(cmd); + + spin_on_cpu(3); /* spin on cpu for 3 seconds */ + exit(0); + } + } + + /* pin process to target cpu */ + snprintf(cmd, 128, "taskset -pc %d %d", cpu, getpid()); + system(cmd); + + spin_on_cpu(3); /* spin on cpu for 3 seconds */ + wait(NULL); +} + +static void check_pinning(const char *rootpath) +{ + const char *child_cgroup =3D "/sys/fs/cgroup/child"; + struct bpf_iter_cgroup_view *skel; + struct bpf_link *link; + struct stat statbuf =3D {}; + FILE *file; + unsigned long queue_self, queue_other; + int cgroup_id, link_fd; + char path[64]; + char buf[64]; + + skel =3D bpf_iter_cgroup_view__open_and_load(); + if (!ASSERT_OK_PTR(skel, "bpf_iter_cgroup_view__open_and_load")) + return; + + /* pin path at parent dir. */ + link =3D bpf_program__attach_iter(skel->progs.dump_cgroup_lat, NULL); + link_fd =3D bpf_link__fd(link); + + /* test initial pinning */ + snprintf(path, 64, "%s/obj", rootpath); + ASSERT_OK(bpf_obj_pin(link_fd, path), "bpf_obj_pin"); + ASSERT_OK(stat(path, &statbuf), "pinned_object_exists"); + + /* test mkdir */ + mkdir(child_cgroup, 0755); + snprintf(path, 64, "%s/child", rootpath); + ASSERT_OK(mkdir(path, 0755), "mkdir"); + + /* test that new dir has been pre-populated with pinned objects */ + snprintf(path, 64, "%s/child/obj", rootpath); + ASSERT_OK(stat(path, &statbuf), "populate"); + + bpf_iter_cgroup_view__attach(skel); + do_work(child_cgroup); + bpf_iter_cgroup_view__detach(skel); + + /* test cat inherited objects */ + file =3D fopen(path, "r"); + if (ASSERT_OK_PTR(file, "open")) { + ASSERT_OK_PTR(fgets(buf, sizeof(buf), file), "cat"); + ASSERT_EQ(sscanf(buf, "cgroup_id: %8d", &cgroup_id), 1, "output"); + + ASSERT_OK_PTR(fgets(buf, sizeof(buf), file), "cat"); + ASSERT_EQ(sscanf(buf, "queue_self: %8lu", &queue_self), 1, "output"); + + ASSERT_OK_PTR(fgets(buf, sizeof(buf), file), "cat"); + ASSERT_EQ(sscanf(buf, "queue_other: %8lu", &queue_other), 1, "output"); + + fclose(file); + } + + /* test rmdir */ + snprintf(path, 64, "%s/child", rootpath); + ASSERT_OK(rmdir(path), "rmdir"); + + /* unpin object */ + snprintf(path, 64, "%s/obj", rootpath); + ASSERT_OK(unlink(path), "unlink"); + + bpf_link__destroy(link); + bpf_iter_cgroup_view__destroy(skel); +} + +void test_pinning_cgroup(void) +{ + char tmpl[] =3D "/sys/fs/bpf/pinning_test_XXXXXX"; + char *rootpath; + + system("mount -t cgroup2 none /sys/fs/cgroup"); + system("mount -t bpf bpffs /sys/fs/bpf"); + + rootpath =3D mkdtemp(tmpl); + chmod(rootpath, 0755); + + /* check pinning map, prog and link in kernfs */ + if (test__start_subtest("pinning")) + check_pinning(rootpath); + + rmdir(rootpath); +} diff --git a/tools/testing/selftests/bpf/progs/bpf_iter.h b/tools/testing/s= elftests/bpf/progs/bpf_iter.h index 8cfaeba1ddbf..506bb3efd9b4 100644 --- a/tools/testing/selftests/bpf/progs/bpf_iter.h +++ b/tools/testing/selftests/bpf/progs/bpf_iter.h @@ -16,6 +16,7 @@ #define bpf_iter__bpf_map_elem bpf_iter__bpf_map_elem___not_used #define bpf_iter__bpf_sk_storage_map bpf_iter__bpf_sk_storage_map___not_us= ed #define bpf_iter__sockmap bpf_iter__sockmap___not_used +#define bpf_iter__cgroup_view bpf_iter__cgroup_view___not_used #define btf_ptr btf_ptr___not_used #define BTF_F_COMPACT BTF_F_COMPACT___not_used #define BTF_F_NONAME BTF_F_NONAME___not_used @@ -37,6 +38,7 @@ #undef bpf_iter__bpf_map_elem #undef bpf_iter__bpf_sk_storage_map #undef bpf_iter__sockmap +#undef bpf_iter__cgroup_view #undef btf_ptr #undef BTF_F_COMPACT #undef BTF_F_NONAME @@ -132,6 +134,11 @@ struct bpf_iter__sockmap { struct sock *sk; }; =20 +struct bpf_iter__cgroup_view { + struct bpf_iter_meta *meta; + struct cgroup *cgroup; +} __attribute__((preserve_access_index)); + struct btf_ptr { void *ptr; __u32 type_id; diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_cgroup_view.c b/too= ls/testing/selftests/bpf/progs/bpf_iter_cgroup_view.c new file mode 100644 index 000000000000..43404c21aee3 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/bpf_iter_cgroup_view.c @@ -0,0 +1,232 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2022 Google */ +#include "bpf_iter.h" +#include +#include +#include + +char _license[] SEC("license") =3D "GPL"; + +#define TASK_RUNNING 0 +#define BPF_F_CURRENT_CPU 0xffffffffULL + +extern void fair_sched_class __ksym; +extern bool CONFIG_FAIR_GROUP_SCHED __kconfig; +extern bool CONFIG_CGROUP_SCHED __kconfig; + +struct wait_lat { + /* Queue_self stands for the latency a task experiences while waiting + * behind the tasks that are from the same cgroup. + * + * Queue_other stands for the latency a task experiences while waiting + * behind the tasks that are from other cgroups. + * + * For example, if there are three tasks: A, B and C. Suppose A and B + * are in the same cgroup and C is in another cgroup and we see A has + * a queueing latency X milliseconds. Let's say during the X milliseconds, + * B has run for Y milliseconds. We can break down X to two parts: time + * when B is on cpu, that is Y; the time when C is on cpu, that is X - Y. + * + * Queue_self is the former (Y) while queue_other is the latter (X - Y). + * + * large value in queue_self is an indication of contention within a + * cgroup; while large value in queue_other is an indication of + * contention from multiple cgroups. + */ + u64 queue_self; + u64 queue_other; +}; + +struct timestamp { + /* timestamp when last queued */ + u64 tsp; + + /* cgroup exec_clock when last queued */ + u64 exec_clock; +}; + +/* Map to store per-cgroup wait latency */ +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __type(key, u64); + __type(value, struct wait_lat); + __uint(max_entries, 65532); +} cgroup_lat SEC(".maps"); + +/* Map to store per-task queue timestamp */ +struct { + __uint(type, BPF_MAP_TYPE_TASK_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, struct timestamp); +} start SEC(".maps"); + +/* adapt from task_cfs_rq in kernel/sched/sched.h */ +__always_inline +struct cfs_rq *task_cfs_rq(struct task_struct *t) +{ + if (!CONFIG_FAIR_GROUP_SCHED) + return NULL; + + return BPF_CORE_READ(&t->se, cfs_rq); +} + +/* record enqueue timestamp */ +__always_inline +static int trace_enqueue(struct task_struct *t) +{ + u32 pid =3D t->pid; + struct timestamp *ptr; + struct cfs_rq *cfs_rq; + + if (!pid) + return 0; + + /* only measure for CFS tasks */ + if (t->sched_class !=3D &fair_sched_class) + return 0; + + ptr =3D bpf_task_storage_get(&start, t, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (!ptr) + return 0; + + /* CONFIG_FAIR_GROUP_SCHED may not be enabled */ + cfs_rq =3D task_cfs_rq(t); + if (!cfs_rq) + return 0; + + ptr->tsp =3D bpf_ktime_get_ns(); + ptr->exec_clock =3D BPF_CORE_READ(cfs_rq, exec_clock); + return 0; +} + +SEC("tp_btf/sched_wakeup") +int handle__sched_wakeup(u64 *ctx) +{ + /* TP_PROTO(struct task_struct *p) */ + struct task_struct *p =3D (void *)ctx[0]; + + return trace_enqueue(p); +} + +SEC("tp_btf/sched_wakeup_new") +int handle__sched_wakeup_new(u64 *ctx) +{ + /* TP_PROTO(struct task_struct *p) */ + struct task_struct *p =3D (void *)ctx[0]; + + return trace_enqueue(p); +} + +/* task_group() from kernel/sched/sched.h */ +__always_inline +struct task_group *task_group(struct task_struct *p) +{ + if (!CONFIG_CGROUP_SCHED) + return NULL; + + return BPF_CORE_READ(p, sched_task_group); +} + +__always_inline +struct cgroup *task_cgroup(struct task_struct *p) +{ + struct task_group *tg; + + tg =3D task_group(p); + if (!tg) + return NULL; + + return BPF_CORE_READ(tg, css).cgroup; +} + +__always_inline +u64 max(u64 x, u64 y) +{ + return x > y ? x : y; +} + +SEC("tp_btf/sched_switch") +int handle__sched_switch(u64 *ctx) +{ + /* TP_PROTO(bool preempt, struct task_struct *prev, + * struct task_struct *next) + */ + struct task_struct *prev =3D (struct task_struct *)ctx[1]; + struct task_struct *next =3D (struct task_struct *)ctx[2]; + u64 delta, delta_self, delta_other, id; + struct cfs_rq *cfs_rq; + struct timestamp *tsp; + struct wait_lat *lat; + struct cgroup *cgroup; + + /* ivcsw: treat like an enqueue event and store timestamp */ + if (prev->__state =3D=3D TASK_RUNNING) + trace_enqueue(prev); + + /* only measure for CFS tasks */ + if (next->sched_class !=3D &fair_sched_class) + return 0; + + /* fetch timestamp and calculate delta */ + tsp =3D bpf_task_storage_get(&start, next, 0, 0); + if (!tsp) + return 0; /* missed enqueue */ + + /* CONFIG_FAIR_GROUP_SCHED may not be enabled */ + cfs_rq =3D task_cfs_rq(next); + if (!cfs_rq) + return 0; + + /* cpu controller may not be enabled */ + cgroup =3D task_cgroup(next); + if (!cgroup) + return 0; + + /* calculate self delay and other delay */ + delta =3D bpf_ktime_get_ns() - tsp->tsp; + delta_self =3D BPF_CORE_READ(cfs_rq, exec_clock) - tsp->exec_clock; + if (delta_self > delta) + delta_self =3D delta; + delta_other =3D delta - delta_self; + + /* insert into cgroup_lat map */ + id =3D BPF_CORE_READ(cgroup, kn, id); + lat =3D bpf_map_lookup_elem(&cgroup_lat, &id); + if (!lat) { + struct wait_lat w =3D { + .queue_self =3D delta_self, + .queue_other =3D delta_other, + }; + + bpf_map_update_elem(&cgroup_lat, &id, &w, BPF_ANY); + } else { + lat->queue_self =3D max(delta_self, lat->queue_self); + lat->queue_other =3D max(delta_other, lat->queue_other); + } + + bpf_task_storage_delete(&start, next); + return 0; +} + +SEC("iter/cgroup_view") +int dump_cgroup_lat(struct bpf_iter__cgroup_view *ctx) +{ + struct seq_file *seq =3D ctx->meta->seq; + struct cgroup *cgroup =3D ctx->cgroup; + struct wait_lat *lat; + u64 id; + + BPF_SEQ_PRINTF(seq, "cgroup_id: %8lu\n", cgroup->kn->id); + lat =3D bpf_map_lookup_elem(&cgroup_lat, &id); + if (lat) { + BPF_SEQ_PRINTF(seq, "queue_self: %8lu\n", lat->queue_self); + BPF_SEQ_PRINTF(seq, "queue_other: %8lu\n", lat->queue_other); + } else { + /* print anyway for universal parsing logic in userspace. */ + BPF_SEQ_PRINTF(seq, "queue_self: %8d\n", 0); + BPF_SEQ_PRINTF(seq, "queue_other: %8d\n", 0); + } + return 0; +} --=20 2.35.0.rc2.247.g8bbb082509-goog