From nobody Sat Feb  7 16:39:57 2026
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A5B7FC00140
	for <linux-kernel@archiver.kernel.org>; Fri,  5 Aug 2022 21:50:45 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S241646AbiHEVuo (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Fri, 5 Aug 2022 17:50:44 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38758 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S241413AbiHEVts (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 5 Aug 2022 17:49:48 -0400
Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com
 [IPv6:2607:f8b0:4864:20::1149])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6ED087821D
        for <linux-kernel@vger.kernel.org>;
 Fri,  5 Aug 2022 14:49:13 -0700 (PDT)
Received: by mail-yw1-x1149.google.com with SMTP id
 00721157ae682-32851d0f8beso31037907b3.22
        for <linux-kernel@vger.kernel.org>;
 Fri, 05 Aug 2022 14:49:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=cc:to:from:subject:references:mime-version:message-id:in-reply-to
         :date:from:to:cc;
        bh=jzlqfxGqbALVlE62T11CZvVYWQyynjwbnPgXmnDU8Ms=;
        b=PnkhDO72c+35oAf4MOmM8Hfo7DqzG1vCWTh7pI6jFbsVDYaIgzfp1o57anUh70rgQB
         P85jllN982gOzW7oT07kQT4nQe1PSZh0ZIQJ6KqHtxSW45FS7wE1A++HDiJ+7ov2fqRb
         Nif6fkOgbeUBhND4LV1i90KDjrRXOXhzkagnVVrngsU+jkTO/OT6B/RmS5KaZAqe2X24
         yfMXGXjOalg6AiSXvJrYeBOBKt/jh4HfKSkoaSrbVM4D8CrVihmUaqrkA34D6PUj98ax
         mUa9Zk+wv+gUNPjGAJRWG/EvMN+ZclOzFH+JuzPik5ZRBIqEM2Js1W8BkNA8YNDcbNf6
         bhYg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:from:subject:references:mime-version:message-id:in-reply-to
         :date:x-gm-message-state:from:to:cc;
        bh=jzlqfxGqbALVlE62T11CZvVYWQyynjwbnPgXmnDU8Ms=;
        b=OdF6LoK0wDy+JeZY4RDsqmIZrrnGSdvZzc1UbhzSKfldslRth3nUhnk/SpE0qKWDa1
         Fn4PyzKulYsJ0Or8NcBowB0f4EJH232w4mM2Jsobinb4valHy002qVdPnIvFOkwBw8Eq
         p9fFUscIeJVFdSsCM9A3vqlBbaiyY+MfaGXhc+do2yfdtkRshrGXyhhAW49bQV6F3hiP
         wArg/Ey9JMKzwhjYH3SFYJZ1fau+BadZBuhXxIcGshwuXjqftXWDgRWh83NR/ppfgyQq
         fSGNm3Xkm88H4d8D7kUwsUSlkvXgGsm3UBHRL8lLx6uWAfXTIO4CS/67dGyICg98lbCX
         gwxQ==
X-Gm-Message-State: ACgBeo1yxu3DHXL86dZXyFPTbIVCL/7NzhBGoXQNW5XhezCCxieFQkgN
        RKOSNKIffKO/4FOyCVkeWqKj7kOuyt8kISMtOIgl0+3NxT9tBvNnl5YQ0wamv43s1J7FpW9R/D0
        iAgpQOyfkJPOy/iwfQGV/qyMTNvNUJtyJUZIzOrZtusAoP+zGOUY08OcAx5h8rnWu6PLV3w==
X-Google-Smtp-Source: 
 AA6agR6UHXjXR4qBThOA7qNq4kyibbUbcx6W4g+nBPphyBum9Qfyds3C1747PPlApC29M8fGgZcK2BffDuM=
X-Received: from haoluo.svl.corp.google.com
 ([2620:15c:2d4:203:4f27:97db:8644:dc82])
 (user=haoluo job=sendgmr) by 2002:a0d:ead2:0:b0:31f:6484:e81d with SMTP id
 t201-20020a0dead2000000b0031f6484e81dmr7920987ywe.129.1659736152450; Fri, 05
 Aug 2022 14:49:12 -0700 (PDT)
Date: Fri,  5 Aug 2022 14:48:21 -0700
In-Reply-To: <20220805214821.1058337-1-haoluo@google.com>
Message-Id: <20220805214821.1058337-9-haoluo@google.com>
Mime-Version: 1.0
References: <20220805214821.1058337-1-haoluo@google.com>
X-Mailer: git-send-email 2.37.1.559.g78731f0fdb-goog
Subject: [PATCH bpf-next v7 8/8] selftests/bpf: add a selftest for cgroup
 hierarchical stats collection
From: Hao Luo <haoluo@google.com>
To: linux-kernel@vger.kernel.org, bpf@vger.kernel.org,
        cgroups@vger.kernel.org, netdev@vger.kernel.org
Cc: Alexei Starovoitov <ast@kernel.org>,
        Andrii Nakryiko <andrii@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Martin KaFai Lau <martin.lau@linux.dev>,
        Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
        Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>,
        KP Singh <kpsingh@kernel.org>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Michal Hocko <mhocko@kernel.org>,
        Benjamin Tissoires <benjamin.tissoires@redhat.com>,
        John Fastabend <john.fastabend@gmail.com>,
        Michal Koutny <mkoutny@suse.com>,
        Roman Gushchin <roman.gushchin@linux.dev>,
        David Rientjes <rientjes@google.com>,
        Stanislav Fomichev <sdf@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        Yosry Ahmed <yosryahmed@google.com>,
        Hao Luo <haoluo@google.com>
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Yosry Ahmed <yosryahmed@google.com>

Add a selftest that tests the whole workflow for collecting,
aggregating (flushing), and displaying cgroup hierarchical stats.

TL;DR:
- Userspace program creates a cgroup hierarchy and induces memcg reclaim
  in parts of it.
- Whenever reclaim happens, vmscan_start and vmscan_end update
  per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
  have updates.
- When userspace tries to read the stats, vmscan_dump calls rstat to flush
  the stats, and outputs the stats in text format to userspace (similar
  to cgroupfs stats).
- rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
  updates, vmscan_flush aggregates cpu readings and propagates updates
  to parents.
- Userspace program makes sure the stats are aggregated and read
  correctly.

Detailed explanation:
- The test loads tracing bpf programs, vmscan_start and vmscan_end, to
  measure the latency of cgroup reclaim. Per-cgroup readings are stored in
  percpu maps for efficiency. When a cgroup reading is updated on a cpu,
  cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
  rstat updated tree on that cpu.

- A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
  each cgroup. Reading this file invokes the program, which calls
  cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
  cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
  the stats are exposed to the user. vmscan_dump returns 1 to terminate
  iteration early, so that we only expose stats for one cgroup per read.

- An ftrace program, vmscan_flush, is also loaded and attached to
  bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
  once for each (cgroup, cpu) pair that has updates. cgroups are popped
  from the rstat tree in a bottom-up fashion, so calls will always be
  made for cgroups that have updates before their parents. The program
  aggregates percpu readings to a total per-cgroup reading, and also
  propagates them to the parent cgroup. After rstat flushing is over, all
  cgroups will have correct updated hierarchical readings (including all
  cpus and all their descendants).

- Finally, the test creates a cgroup hierarchy and induces memcg reclaim
  in parts of it, and makes sure that the stats collection, aggregation,
  and reading workflow works as expected.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Hao Luo <haoluo@google.com>
---
 .../prog_tests/cgroup_hierarchical_stats.c    | 358 ++++++++++++++++++
 .../bpf/progs/cgroup_hierarchical_stats.c     | 226 +++++++++++
 2 files changed, 584 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_hierarchi=
cal_stats.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_hierarchical_s=
tats.c

diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_sta=
ts.c b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
new file mode 100644
index 000000000000..9d48b7187992
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/cgroup_hierarchical_stats.c
@@ -0,0 +1,358 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Functions to manage eBPF programs attached to cgroup subsystems
+ *
+ * Copyright 2022 Google LLC.
+ */
+#include <asm-generic/errno.h>
+#include <errno.h>
+#include <sys/types.h>
+#include <sys/mount.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include <test_progs.h>
+#include <bpf/libbpf.h>
+#include <bpf/bpf.h>
+
+#include "cgroup_helpers.h"
+#include "cgroup_hierarchical_stats.skel.h"
+
+#define PAGE_SIZE 4096
+#define MB(x) (x << 20)
+
+#define BPFFS_ROOT "/sys/fs/bpf/"
+#define BPFFS_VMSCAN BPFFS_ROOT"vmscan/"
+
+#define CG_ROOT_NAME "root"
+#define CG_ROOT_ID 1
+
+#define CGROUP_PATH(p, n) {.path =3D p"/"n, .name =3D n}
+
+static struct {
+	const char *path, *name;
+	unsigned long long id;
+	int fd;
+} cgroups[] =3D {
+	CGROUP_PATH("/", "test"),
+	CGROUP_PATH("/test", "child1"),
+	CGROUP_PATH("/test", "child2"),
+	CGROUP_PATH("/test/child1", "child1_1"),
+	CGROUP_PATH("/test/child1", "child1_2"),
+	CGROUP_PATH("/test/child2", "child2_1"),
+	CGROUP_PATH("/test/child2", "child2_2"),
+};
+
+#define N_CGROUPS ARRAY_SIZE(cgroups)
+#define N_NON_LEAF_CGROUPS 3
+
+static int root_cgroup_fd;
+static bool mounted_bpffs;
+
+/* reads file at 'path' to 'buf', returns 0 on success. */
+static int read_from_file(const char *path, char *buf, size_t size)
+{
+	int fd, len;
+
+	fd =3D open(path, O_RDONLY);
+	if (fd < 0)
+		return fd;
+
+	len =3D read(fd, buf, size);
+	close(fd);
+	if (len < 0)
+		return len;
+
+	buf[len] =3D 0;
+	return 0;
+}
+
+/* mounts bpffs and mkdir for reading stats, returns 0 on success. */
+static int setup_bpffs(void)
+{
+	int err;
+
+	/* Mount bpffs */
+	err =3D mount("bpf", BPFFS_ROOT, "bpf", 0, NULL);
+	mounted_bpffs =3D !err;
+	if (ASSERT_FALSE(err && errno !=3D EBUSY, "mount"))
+		return err;
+
+	/* Create a directory to contain stat files in bpffs */
+	err =3D mkdir(BPFFS_VMSCAN, 0755);
+	if (!ASSERT_OK(err, "mkdir"))
+		return err;
+
+	return 0;
+}
+
+static void cleanup_bpffs(void)
+{
+	/* Remove created directory in bpffs */
+	ASSERT_OK(rmdir(BPFFS_VMSCAN), "rmdir "BPFFS_VMSCAN);
+
+	/* Unmount bpffs, if it wasn't already mounted when we started */
+	if (mounted_bpffs)
+		return;
+
+	ASSERT_OK(umount(BPFFS_ROOT), "unmount bpffs");
+}
+
+/* sets up cgroups, returns 0 on success. */
+static int setup_cgroups(void)
+{
+	int i, fd, err;
+
+	err =3D setup_cgroup_environment();
+	if (!ASSERT_OK(err, "setup_cgroup_environment"))
+		return err;
+
+	root_cgroup_fd =3D get_root_cgroup();
+	if (!ASSERT_GE(root_cgroup_fd, 0, "get_root_cgroup"))
+		return root_cgroup_fd;
+
+	for (i =3D 0; i < N_CGROUPS; i++) {
+		fd =3D create_and_get_cgroup(cgroups[i].path);
+		if (!ASSERT_GE(fd, 0, "create_and_get_cgroup"))
+			return fd;
+
+		cgroups[i].fd =3D fd;
+		cgroups[i].id =3D get_cgroup_id(cgroups[i].path);
+
+		/*
+		 * Enable memcg controller for the entire hierarchy.
+		 * Note that stats are collected for all cgroups in a hierarchy
+		 * with memcg enabled anyway, but are only exposed for cgroups
+		 * that have memcg enabled.
+		 */
+		if (i < N_NON_LEAF_CGROUPS) {
+			err =3D enable_controllers(cgroups[i].path, "memory");
+			if (!ASSERT_OK(err, "enable_controllers"))
+				return err;
+		}
+	}
+	return 0;
+}
+
+static void cleanup_cgroups(void)
+{
+	close(root_cgroup_fd);
+	for (int i =3D 0; i < N_CGROUPS; i++)
+		close(cgroups[i].fd);
+	cleanup_cgroup_environment();
+}
+
+/* Sets up cgroup hiearchary, returns 0 on success. */
+static int setup_hierarchy(void)
+{
+	return setup_bpffs() || setup_cgroups();
+}
+
+static void destroy_hierarchy(void)
+{
+	cleanup_cgroups();
+	cleanup_bpffs();
+}
+
+static int reclaimer(const char *cgroup_path, size_t size)
+{
+	static char size_buf[128];
+	char *buf, *ptr;
+	int err;
+
+	/* Join cgroup in the parent process workdir */
+	if (join_parent_cgroup(cgroup_path))
+		return EACCES;
+
+	/* Allocate memory */
+	buf =3D malloc(size);
+	if (!buf)
+		return ENOMEM;
+
+	/* Write to memory to make sure it's actually allocated */
+	for (ptr =3D buf; ptr < buf + size; ptr +=3D PAGE_SIZE)
+		*ptr =3D 1;
+
+	/* Try to reclaim memory */
+	snprintf(size_buf, 128, "%lu", size);
+	err =3D write_cgroup_file_parent(cgroup_path, "memory.reclaim", size_buf);
+
+	free(buf);
+	/* memory.reclaim returns EAGAIN if the amount is not fully reclaimed */
+	if (err && errno !=3D EAGAIN)
+		return errno;
+
+	return 0;
+}
+
+static int induce_vmscan(void)
+{
+	int i, status;
+
+	/*
+	 * In every leaf cgroup, run a child process that allocates some memory
+	 * and attempts to reclaim some of it.
+	 */
+	for (i =3D N_NON_LEAF_CGROUPS; i < N_CGROUPS; i++) {
+		pid_t pid;
+
+		/* Create reclaimer child */
+		pid =3D fork();
+		if (pid =3D=3D 0) {
+			status =3D reclaimer(cgroups[i].path, MB(5));
+			exit(status);
+		}
+
+		/* Cleanup reclaimer child */
+		waitpid(pid, &status, 0);
+		ASSERT_TRUE(WIFEXITED(status), "reclaimer exited");
+		ASSERT_EQ(WEXITSTATUS(status), 0, "reclaim exit code");
+	}
+	return 0;
+}
+
+static unsigned long long
+get_cgroup_vmscan_delay(unsigned long long cgroup_id, const char *file_nam=
e)
+{
+	unsigned long long vmscan =3D 0, id =3D 0;
+	static char buf[128], path[128];
+
+	/* For every cgroup, read the file generated by cgroup_iter */
+	snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
+	if (!ASSERT_OK(read_from_file(path, buf, 128), "read cgroup_iter"))
+		return 0;
+
+	/* Check the output file formatting */
+	ASSERT_EQ(sscanf(buf, "cg_id: %llu, total_vmscan_delay: %llu\n",
+			 &id, &vmscan), 2, "output format");
+
+	/* Check that the cgroup_id is displayed correctly */
+	ASSERT_EQ(id, cgroup_id, "cgroup_id");
+	/* Check that the vmscan reading is non-zero */
+	ASSERT_GT(vmscan, 0, "vmscan_reading");
+	return vmscan;
+}
+
+static void check_vmscan_stats(void)
+{
+	unsigned long long vmscan_readings[N_CGROUPS], vmscan_root;
+	int i;
+
+	for (i =3D 0; i < N_CGROUPS; i++) {
+		vmscan_readings[i] =3D get_cgroup_vmscan_delay(cgroups[i].id,
+							     cgroups[i].name);
+	}
+
+	/* Read stats for root too */
+	vmscan_root =3D get_cgroup_vmscan_delay(CG_ROOT_ID, CG_ROOT_NAME);
+
+	/* Check that child1 =3D=3D child1_1 + child1_2 */
+	ASSERT_EQ(vmscan_readings[1], vmscan_readings[3] + vmscan_readings[4],
+		  "child1_vmscan");
+	/* Check that child2 =3D=3D child2_1 + child2_2 */
+	ASSERT_EQ(vmscan_readings[2], vmscan_readings[5] + vmscan_readings[6],
+		  "child2_vmscan");
+	/* Check that test =3D=3D child1 + child2 */
+	ASSERT_EQ(vmscan_readings[0], vmscan_readings[1] + vmscan_readings[2],
+		  "test_vmscan");
+	/* Check that root >=3D test */
+	ASSERT_GE(vmscan_root, vmscan_readings[1], "root_vmscan");
+}
+
+/* Creates iter link and pins in bpffs, returns 0 on success, -errno on fa=
ilure.
+ */
+static int setup_cgroup_iter(struct cgroup_hierarchical_stats *obj,
+			     int cgroup_fd, const char *file_name)
+{
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	union bpf_iter_link_info linfo =3D {};
+	struct bpf_link *link;
+	static char path[128];
+	int err;
+
+	/*
+	 * Create an iter link, parameterized by cgroup_fd.
+	 * We only want to traverse one cgroup, so set the traversal order to
+	 * "pre", and return 1 from dump_vmscan to stop iteration after the
+	 * first cgroup.
+	 */
+	linfo.cgroup.cgroup_fd =3D cgroup_fd;
+	linfo.cgroup.order =3D BPF_ITER_SELF;
+	opts.link_info =3D &linfo;
+	opts.link_info_len =3D sizeof(linfo);
+	link =3D bpf_program__attach_iter(obj->progs.dump_vmscan, &opts);
+	if (!ASSERT_OK_PTR(link, "attach iter"))
+		return -EFAULT;
+
+	/* Pin the link to a bpffs file */
+	snprintf(path, 128, "%s%s", BPFFS_VMSCAN, file_name);
+	err =3D bpf_link__pin(link, path);
+	ASSERT_OK(err, "pin cgroup_iter");
+
+	/* Remove the link, leaving only the ref held by the pinned file */
+	bpf_link__destroy(link);
+	return err;
+}
+
+/* Sets up programs for collecting stats, returns 0 on success. */
+static int setup_progs(struct cgroup_hierarchical_stats **skel)
+{
+	int i, err;
+
+	*skel =3D cgroup_hierarchical_stats__open_and_load();
+	if (!ASSERT_OK_PTR(*skel, "open_and_load"))
+		return 1;
+
+	/* Attach cgroup_iter program that will dump the stats to cgroups */
+	for (i =3D 0; i < N_CGROUPS; i++) {
+		err =3D setup_cgroup_iter(*skel, cgroups[i].fd, cgroups[i].name);
+		if (!ASSERT_OK(err, "setup_cgroup_iter"))
+			return err;
+	}
+
+	/* Also dump stats for root */
+	err =3D setup_cgroup_iter(*skel, root_cgroup_fd, CG_ROOT_NAME);
+	if (!ASSERT_OK(err, "setup_cgroup_iter"))
+		return err;
+
+	err =3D cgroup_hierarchical_stats__attach(*skel);
+	if (!ASSERT_OK(err, "attach"))
+		return err;
+
+	return 0;
+}
+
+static void destroy_progs(struct cgroup_hierarchical_stats *skel)
+{
+	static char path[128];
+	int i;
+
+	for (i =3D 0; i < N_CGROUPS; i++) {
+		/* Delete files in bpffs that cgroup_iters are pinned in */
+		snprintf(path, 128, "%s%s", BPFFS_VMSCAN,
+			 cgroups[i].name);
+		ASSERT_OK(remove(path), "remove cgroup_iter pin");
+	}
+
+	/* Delete root file in bpffs */
+	snprintf(path, 128, "%s%s", BPFFS_VMSCAN, CG_ROOT_NAME);
+	ASSERT_OK(remove(path), "remove cgroup_iter root pin");
+	cgroup_hierarchical_stats__destroy(skel);
+}
+
+void test_cgroup_hierarchical_stats(void)
+{
+	struct cgroup_hierarchical_stats *skel =3D NULL;
+
+	if (setup_hierarchy())
+		goto hierarchy_cleanup;
+	if (setup_progs(&skel))
+		goto cleanup;
+	if (induce_vmscan())
+		goto cleanup;
+	check_vmscan_stats();
+cleanup:
+	destroy_progs(skel);
+hierarchy_cleanup:
+	destroy_hierarchy();
+}
diff --git a/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c =
b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
new file mode 100644
index 000000000000..8ab4253a1592
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgroup_hierarchical_stats.c
@@ -0,0 +1,226 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Functions to manage eBPF programs attached to cgroup subsystems
+ *
+ * Copyright 2022 Google LLC.
+ */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+
+char _license[] SEC("license") =3D "GPL";
+
+/*
+ * Start times are stored per-task, not per-cgroup, as multiple tasks in o=
ne
+ * cgroup can perform reclaim concurrently.
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, __u64);
+} vmscan_start_time SEC(".maps");
+
+struct vmscan_percpu {
+	/* Previous percpu state, to figure out if we have new updates */
+	__u64 prev;
+	/* Current percpu state */
+	__u64 state;
+};
+
+struct vmscan {
+	/* State propagated through children, pending aggregation */
+	__u64 pending;
+	/* Total state, including all cpus and all children */
+	__u64 state;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_HASH);
+	__uint(max_entries, 100);
+	__type(key, __u64);
+	__type(value, struct vmscan_percpu);
+} pcpu_cgroup_vmscan_elapsed SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 100);
+	__type(key, __u64);
+	__type(value, struct vmscan);
+} cgroup_vmscan_elapsed SEC(".maps");
+
+extern void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) __ksym;
+extern void cgroup_rstat_flush(struct cgroup *cgrp) __ksym;
+
+static struct cgroup *task_memcg(struct task_struct *task)
+{
+	int cgrp_id;
+
+#if __has_builtin(__builtin_preserve_enum_value)
+	cgrp_id =3D bpf_core_enum_value(enum cgroup_subsys_id, memory_cgrp_id);
+#else
+	cgrp_id =3D memory_cgrp_id;
+#endif
+	return task->cgroups->subsys[cgrp_id]->cgroup;
+}
+
+static uint64_t cgroup_id(struct cgroup *cgrp)
+{
+	return cgrp->kn->id;
+}
+
+static int create_vmscan_percpu_elem(__u64 cg_id, __u64 state)
+{
+	struct vmscan_percpu pcpu_init =3D {.state =3D state, .prev =3D 0};
+
+	return bpf_map_update_elem(&pcpu_cgroup_vmscan_elapsed, &cg_id,
+				   &pcpu_init, BPF_NOEXIST);
+}
+
+static int create_vmscan_elem(__u64 cg_id, __u64 state, __u64 pending)
+{
+	struct vmscan init =3D {.state =3D state, .pending =3D pending};
+
+	return bpf_map_update_elem(&cgroup_vmscan_elapsed, &cg_id,
+				   &init, BPF_NOEXIST);
+}
+
+SEC("tp_btf/mm_vmscan_memcg_reclaim_begin")
+int BPF_PROG(vmscan_start, int order, gfp_t gfp_flags)
+{
+	struct task_struct *task =3D bpf_get_current_task_btf();
+	__u64 *start_time_ptr;
+
+	start_time_ptr =3D bpf_task_storage_get(&vmscan_start_time, task, 0,
+					      BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (start_time_ptr)
+		*start_time_ptr =3D bpf_ktime_get_ns();
+	return 0;
+}
+
+SEC("tp_btf/mm_vmscan_memcg_reclaim_end")
+int BPF_PROG(vmscan_end, unsigned long nr_reclaimed)
+{
+	struct vmscan_percpu *pcpu_stat;
+	struct task_struct *current =3D bpf_get_current_task_btf();
+	struct cgroup *cgrp;
+	__u64 *start_time_ptr;
+	__u64 current_elapsed, cg_id;
+	__u64 end_time =3D bpf_ktime_get_ns();
+
+	/*
+	 * cgrp is the first parent cgroup of current that has memcg enabled in
+	 * its subtree_control, or NULL if memcg is disabled in the entire tree.
+	 * In a cgroup hierarchy like this:
+	 *                               a
+	 *                              / \
+	 *                             b   c
+	 *  If "a" has memcg enabled, while "b" doesn't, then processes in "b"
+	 *  will accumulate their stats directly to "a". This makes sure that no
+	 *  stats are lost from processes in leaf cgroups that don't have memcg
+	 *  enabled, but only exposes stats for cgroups that have memcg enabled.
+	 */
+	cgrp =3D task_memcg(current);
+	if (!cgrp)
+		return 0;
+
+	cg_id =3D cgroup_id(cgrp);
+	start_time_ptr =3D bpf_task_storage_get(&vmscan_start_time, current, 0,
+					      BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!start_time_ptr)
+		return 0;
+
+	current_elapsed =3D end_time - *start_time_ptr;
+	pcpu_stat =3D bpf_map_lookup_elem(&pcpu_cgroup_vmscan_elapsed,
+					&cg_id);
+	if (pcpu_stat)
+		pcpu_stat->state +=3D current_elapsed;
+	else if (create_vmscan_percpu_elem(cg_id, current_elapsed))
+		return 0;
+
+	cgroup_rstat_updated(cgrp, bpf_get_smp_processor_id());
+	return 0;
+}
+
+SEC("fentry/bpf_rstat_flush")
+int BPF_PROG(vmscan_flush, struct cgroup *cgrp, struct cgroup *parent, int=
 cpu)
+{
+	struct vmscan_percpu *pcpu_stat;
+	struct vmscan *total_stat, *parent_stat;
+	__u64 cg_id =3D cgroup_id(cgrp);
+	__u64 parent_cg_id =3D parent ? cgroup_id(parent) : 0;
+	__u64 *pcpu_vmscan;
+	__u64 state;
+	__u64 delta =3D 0;
+
+	/* Add CPU changes on this level since the last flush */
+	pcpu_stat =3D bpf_map_lookup_percpu_elem(&pcpu_cgroup_vmscan_elapsed,
+					       &cg_id, cpu);
+	if (pcpu_stat) {
+		state =3D pcpu_stat->state;
+		delta +=3D state - pcpu_stat->prev;
+		pcpu_stat->prev =3D state;
+	}
+
+	total_stat =3D bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
+	if (!total_stat) {
+		if (create_vmscan_elem(cg_id, delta, 0))
+			return 0;
+
+		goto update_parent;
+	}
+
+	/* Collect pending stats from subtree */
+	if (total_stat->pending) {
+		delta +=3D total_stat->pending;
+		total_stat->pending =3D 0;
+	}
+
+	/* Propagate changes to this cgroup's total */
+	total_stat->state +=3D delta;
+
+update_parent:
+	/* Skip if there are no changes to propagate, or no parent */
+	if (!delta || !parent_cg_id)
+		return 0;
+
+	/* Propagate changes to cgroup's parent */
+	parent_stat =3D bpf_map_lookup_elem(&cgroup_vmscan_elapsed,
+					  &parent_cg_id);
+	if (parent_stat)
+		parent_stat->pending +=3D delta;
+	else
+		create_vmscan_elem(parent_cg_id, 0, delta);
+	return 0;
+}
+
+SEC("iter.s/cgroup")
+int BPF_PROG(dump_vmscan, struct bpf_iter_meta *meta, struct cgroup *cgrp)
+{
+	struct seq_file *seq =3D meta->seq;
+	struct vmscan *total_stat;
+	__u64 cg_id =3D cgrp ? cgroup_id(cgrp) : 0;
+
+	/* Do nothing for the terminal call */
+	if (!cg_id)
+		return 1;
+
+	/* Flush the stats to make sure we get the most updated numbers */
+	cgroup_rstat_flush(cgrp);
+
+	total_stat =3D bpf_map_lookup_elem(&cgroup_vmscan_elapsed, &cg_id);
+	if (!total_stat) {
+		BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: 0\n",
+			       cg_id);
+	} else {
+		BPF_SEQ_PRINTF(seq, "cg_id: %llu, total_vmscan_delay: %llu\n",
+			       cg_id, total_stat->state);
+	}
+
+	/*
+	 * We only dump stats for one cgroup here, so return 1 to stop
+	 * iteration after the first cgroup.
+	 */
+	return 1;
+}
--=20
2.37.1.559.g78731f0fdb-goog