From nobody Sat Jun 13 23:11:08 2026
Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 668C937C901;
	Tue,  5 May 2026 10:55:38 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=193.142.43.55
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1777978541; cv=none;
 b=Rx6VRSNZFgC2IWA2BmZKWLqsqRCYBkQwayjdifpoe8DqSRxTyLZnOAPUYA3x8FtkZ2v/2HHuGQHWSjanzND6QjyA/WyWn5C4u52c//Xm5ZcNCsg0Di++tOUH92fY1BjzByYXcsCGfGShrw3D9/UHS4ckXH+c9wrQCv7gis9Gv78=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1777978541; c=relaxed/simple;
	bh=cmvqbVllm6UilRlRWlpSJ7Y07fW4Q9ioe2MlFrpi6YU=;
	h=Date:From:To:Subject:Cc:In-Reply-To:References:MIME-Version:
	 Message-ID:Content-Type;
 b=MiUIeCotlGOP2VYtGVqPiovKSn9IZkos//dMnyC/gQoqKJQbs9GZf28ojfueJfWejVp0ZF/K4NbuHKjXw4vuaxt2SaZpltt64yop663w/kOhU0MXow4RNxRnyMwTM4gpyinBNAF55xynUpxdIiuZVlny/VcdqwA1ij1t9Xs2y7o=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linutronix.de;
 spf=pass smtp.mailfrom=linutronix.de;
 dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b=k1BMPeLM;
 dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b=X5fSqIR6; arc=none smtp.client-ip=193.142.43.55
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linutronix.de
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linutronix.de
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b="k1BMPeLM";
	dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de
 header.b="X5fSqIR6"
Date: Tue, 05 May 2026 10:55:35 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020; t=1777978537;
	h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=7N8UpyuXlAJtY2c5lkj7gVPt7V1Yf0zxM2tFyl1OMDs=;
	b=k1BMPeLMZHfDZhQapoSBH12JWDVmVPNGwulkuXwN9vSojJ4d5wUwsdTkJPExouoZU1iBiH
	8jKorGkfAAPO/9KxswMUDSY5UD2EuyNKAiGjnWQ5Ma4B/KWs6ona9oGKt1eykAyfPaxO+d
	lw6Z0NzgdLgZ9tDtMaJdSEPyrDq6tTq67n7boSTezJFfmZyswAgX9YavaS3niLp5fINHVY
	mgbLbRmDFd4NshUILv4YUoPGue7BijVr9o08KMHWV/+aiBRHgXrq7z1MZE0L6LYrYs0+sR
	WvOxg3pE6hCs0u5fIVBFAuM0N/upZ24GUiGaXZFW4ndrV2azLvUj+VeNBVrR+Q==
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de;
	s=2020e; t=1777978537;
	h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=7N8UpyuXlAJtY2c5lkj7gVPt7V1Yf0zxM2tFyl1OMDs=;
	b=X5fSqIR60zmAd0k8EKhV+v85WET1EvUQ1h+JMzOBTlg4W1S76qFknnLa9+uH1P5LDBonBo
	nUxHwofWFKIaLyBw==
From: "tip-bot2 for Aniket Gattani" <tip-bot2@linutronix.de>
Sender: tip-bot2@linutronix.de
Reply-to: linux-kernel@vger.kernel.org
To: linux-tip-commits@vger.kernel.org
Subject: [tip: locking/core] selftests/membarrier: Add rseq stress test for
 CFS throttle interactions
Cc: kernel test robot <lkp@intel.com>,
 Aniket Gattani <aniketgattani@google.com>,
 "Peter Zijlstra (Intel)" <peterz@infradead.org>, x86@kernel.org,
 linux-kernel@vger.kernel.org
In-Reply-To: <20260503212205.3714217-4-aniketgattani@google.com>
References: <20260503212205.3714217-4-aniketgattani@google.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Message-ID: <177797853561.424702.2347141406578693763.tip-bot2@tip-bot2>
Robot-ID: <tip-bot2@linutronix.de>
Robot-Unsubscribe: 
 Contact <mailto:tglx@kernel.org> to get blacklisted from these emails
Precedence: bulk
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

The following commit has been merged into the locking/core branch of tip:

Commit-ID:     03240f5de2dd312f388f3e493f194d54d43e2924
Gitweb:        https://git.kernel.org/tip/03240f5de2dd312f388f3e493f194d54d=
43e2924
Author:        Aniket Gattani <aniketgattani@google.com>
AuthorDate:    Sun, 03 May 2026 21:22:05=20
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 05 May 2026 12:50:48 +02:00

selftests/membarrier: Add rseq stress test for CFS throttle interactions

Add a new stress test to exercise the interaction between targeted
expedited membarrier commands and CFS bandwidth throttling.

The test creates a deep cgroup hierarchy and aggressively hammers the
membarrier syscall to expose lock contention and latency issues. This
serves as a reliable reproducer for the `membarrier_ipi_mutex` cascade
lockup, ensuring future changes to membarrier locking do not regress
targeted command latency.

Closes: https://lore.kernel.org/r/202604151516.Vc7Ro4LP-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Aniket Gattani <aniketgattani@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260503212205.3714217-4-aniketgattani@googl=
e.com
---
 tools/testing/selftests/membarrier/Makefile                 |   5 +-
 tools/testing/selftests/membarrier/membarrier_rseq_stress.c | 951 +++++++-
 2 files changed, 954 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/membarrier/membarrier_rseq_stre=
ss.c

diff --git a/tools/testing/selftests/membarrier/Makefile b/tools/testing/se=
lftests/membarrier/Makefile
index fc840e0..829f95c 100644
--- a/tools/testing/selftests/membarrier/Makefile
+++ b/tools/testing/selftests/membarrier/Makefile
@@ -1,8 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0-only
-CFLAGS +=3D -g $(KHDR_INCLUDES)
+CFLAGS +=3D -g $(KHDR_INCLUDES) -pthread -I../../../../tools/include
 LDLIBS +=3D -lpthread
=20
 TEST_GEN_PROGS :=3D membarrier_test_single_thread \
-		membarrier_test_multi_thread
+		membarrier_test_multi_thread \
+		membarrier_rseq_stress
=20
 include ../lib.mk
diff --git a/tools/testing/selftests/membarrier/membarrier_rseq_stress.c b/=
tools/testing/selftests/membarrier/membarrier_rseq_stress.c
new file mode 100644
index 0000000..c188d74
--- /dev/null
+++ b/tools/testing/selftests/membarrier/membarrier_rseq_stress.c
@@ -0,0 +1,951 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Membarrier stress test for CFS throttle interactions.
+ *
+ * Reproducer for the interaction between CFS throttle and expedited memba=
rrier.
+ */
+
+#ifndef _GNU_SOURCE
+#define _GNU_SOURCE
+#endif
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <syscall.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <fcntl.h>
+#include <stdint.h>
+#include <errno.h>
+#include <sched.h>
+#include <time.h>
+#include <signal.h>
+#include <stdatomic.h>
+#include <dirent.h>
+#include <sys/prctl.h>
+#include <sys/mman.h>
+
+#include "../kselftest.h"
+
+/* -- Architecture-specific rseq signature -- */
+#if defined(__x86_64__) || defined(__i386__)
+# define RSEQ_SIG  0x53053053U
+#elif defined(__aarch64__)
+# define RSEQ_SIG  0xd428bc00U
+#elif defined(__powerpc__) || defined(__powerpc64__)
+# define RSEQ_SIG  0x0f000000U
+#elif defined(__s390__) || defined(__s390x__)
+# define RSEQ_SIG  0x0c000000U
+#else
+# define RSEQ_SIG  0
+# define UNSUPPORTED_ARCH 1
+#endif
+
+/* -- rseq ABI (kernel uapi; define locally for portability) -- */
+#define RSEQ_CPU_ID_UNINITIALIZED       ((__u32)-1)
+
+#include <linux/compiler.h>
+
+struct rseq_abi {
+	__u32 cpu_id_start;
+	__u32 cpu_id;
+	__u64 rseq_cs;
+	__u32 flags;
+	__u32 node_id;
+	__u32 mm_cid;
+	char  end[0];
+} __aligned(32);
+
+/* -- membarrier constants (not in all distro headers) -- */
+#ifndef MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
+# define MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ          (1 << 7)
+#endif
+#ifndef MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ
+# define MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ (1 << 8)
+#endif
+#ifndef MEMBARRIER_CMD_FLAG_CPU
+# define MEMBARRIER_CMD_FLAG_CPU  (1 << 0)
+#endif
+
+/* -- Test parameters -- */
+#define N_SIBLINGS          2000
+#define NEST_DEPTH		5
+static char g_cgroup_path[4096];
+static int use_cgroup_v2;
+
+#define CFS_QUOTA_US        1000
+#define CFS_PERIOD_US       5000
+#define N_HAMMER_PER_CPU    25
+#define N_BURNER_PER_CPU    50
+#define MAX_STRESS_CPUS     1024
+#define TEST_DURATION_SEC   20
+
+/* Latency thresholds for the sentinel */
+#define LATENCY_WARN_MS     50
+#define LATENCY_CRITICAL_MS 200
+
+/* Sentinel sampling interval */
+#define SENTINEL_INTERVAL_US  500
+
+/* -- Shared globals -- */
+static atomic_int  g_stop;
+static atomic_int  g_stop_sentinel;
+static atomic_long g_max_latency_us;
+static atomic_long g_interval_max_latency_us;
+static atomic_long g_mb_ok;
+static atomic_long g_mb_err;
+static int         g_ncpus_stress;
+static int *g_stress_cpus;
+
+static atomic_int  g_test_ready;
+
+/* Per-thread rseq ABI block registered with the kernel */
+static __thread struct rseq_abi tls_rseq
+	__attribute__((tls_model("initial-exec"))) __aligned(32) =3D {
+	.cpu_id =3D RSEQ_CPU_ID_UNINITIALIZED,
+};
+
+/* -- Utility -- */
+static int write_file(const char *path, const char *val)
+{
+	int fd =3D open(path, O_WRONLY | O_CLOEXEC);
+
+	if (fd < 0)
+		return -errno;
+
+	size_t len =3D strlen(val);
+	ssize_t r =3D write(fd, val, len);
+
+	close(fd);
+	if (r < 0)
+		return -errno;
+	if ((size_t)r !=3D len)
+		return -EIO;
+	return 0;
+}
+
+static uint64_t monotonic_us(void)
+{
+	struct timespec ts;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+	return (uint64_t)ts.tv_sec * 1000000ULL + ts.tv_nsec / 1000ULL;
+}
+
+static void update_max_latency(long lat)
+{
+	long old =3D atomic_load_explicit(&g_max_latency_us, memory_order_relaxed=
);
+
+	while (lat > old) {
+		if (atomic_compare_exchange_weak_explicit(&g_max_latency_us, &old, lat,
+				memory_order_relaxed, memory_order_relaxed))
+			break;
+	}
+
+	old =3D atomic_load_explicit(&g_interval_max_latency_us, memory_order_rel=
axed);
+	while (lat > old) {
+		if (atomic_compare_exchange_weak_explicit(&g_interval_max_latency_us, &o=
ld, lat,
+				memory_order_relaxed, memory_order_relaxed))
+			break;
+	}
+}
+
+static void init_stress_cpus(void)
+{
+	cpu_set_t set;
+	int capacity =3D MAX_STRESS_CPUS;
+
+	g_stress_cpus =3D malloc(capacity * sizeof(int));
+	if (!g_stress_cpus)
+		ksft_exit_fail_msg("malloc failed for g_stress_cpus\n");
+
+	if (sched_getaffinity(0, sizeof(set), &set) < 0)
+		ksft_exit_fail_msg("sched_getaffinity failed\n");
+
+	for (int i =3D 0; i < CPU_SETSIZE && g_ncpus_stress < capacity; i++) {
+		if (CPU_ISSET(i, &set))
+			g_stress_cpus[g_ncpus_stress++] =3D i;
+	}
+
+	if (g_ncpus_stress =3D=3D 0)
+		ksft_exit_skip("No CPUs available for stress test\n");
+
+	ksft_print_msg("Stressing %d CPUs discovered via affinity\n", g_ncpus_str=
ess);
+}
+
+/* -- rseq / membarrier helpers -- */
+static int rseq_register_thread(void)
+{
+	int r =3D syscall(SYS_rseq, &tls_rseq, sizeof(tls_rseq), 0, RSEQ_SIG);
+
+	return (r =3D=3D 0 || errno =3D=3D EBUSY || errno =3D=3D EINVAL) ? 0 : -1;
+}
+
+static int rseq_register_thread_at(struct rseq_abi *rseq)
+{
+	int r =3D syscall(SYS_rseq, rseq, sizeof(*rseq), 0, RSEQ_SIG);
+
+	return (r =3D=3D 0 || errno =3D=3D EBUSY || errno =3D=3D EINVAL) ? 0 : -1;
+}
+
+static int membarrier_register_rseq_mm(void)
+{
+	return syscall(SYS_membarrier,
+		       MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ, 0, 0);
+}
+
+/* -- cgroup helpers -- */
+static void rm_cgroup_recursive(const char *path)
+{
+	DIR *dir =3D opendir(path);
+
+	if (!dir)
+		return;
+	struct dirent *entry;
+
+	while ((entry =3D readdir(dir)) !=3D NULL) {
+		if (strcmp(entry->d_name, ".") =3D=3D 0 || strcmp(entry->d_name, "..") =
=3D=3D 0)
+			continue;
+		if (entry->d_type =3D=3D DT_DIR) {
+			char sub_path[4096];
+
+			snprintf(sub_path, sizeof(sub_path), "%s/%s", path, entry->d_name);
+			rm_cgroup_recursive(sub_path);
+		}
+	}
+	closedir(dir);
+	rmdir(path);
+}
+
+static void cgroup_teardown(void);
+
+static int cgroup_setup(void)
+{
+	struct stat st;
+
+	if (stat("/sys/fs/cgroup/cpu", &st) =3D=3D 0) {
+		use_cgroup_v2 =3D 0;
+		snprintf(g_cgroup_path, sizeof(g_cgroup_path),
+			 "/sys/fs/cgroup/cpu/membarrier_stress_test");
+	} else if (stat("/dev/cgroup/cpu", &st) =3D=3D 0) {
+		use_cgroup_v2 =3D 0;
+		snprintf(g_cgroup_path, sizeof(g_cgroup_path),
+			 "/dev/cgroup/cpu/membarrier_stress_test");
+	} else if (stat("/cgroup/cpu", &st) =3D=3D 0) {
+		use_cgroup_v2 =3D 0;
+		snprintf(g_cgroup_path, sizeof(g_cgroup_path),
+			 "/cgroup/cpu/membarrier_stress_test");
+	} else if (stat("/sys/fs/cgroup/cgroup.controllers", &st) =3D=3D 0) {
+		use_cgroup_v2 =3D 1;
+		snprintf(g_cgroup_path, sizeof(g_cgroup_path),
+			 "/sys/fs/cgroup/membarrier_stress_test");
+	} else {
+		ksft_print_msg("WARN: cgroup mount not found. Using v2 at /sys/fs/cgroup=
\n");
+		use_cgroup_v2 =3D 1;
+		snprintf(g_cgroup_path, sizeof(g_cgroup_path),
+			 "/sys/fs/cgroup/membarrier_stress_test");
+	}
+
+	/* Robust cleanup before setup */
+	cgroup_teardown();
+
+	if (use_cgroup_v2) {
+		/* Enable cpu controller in root cgroup */
+		if (write_file("/sys/fs/cgroup/cgroup.subtree_control", "+cpu") < 0)
+			ksft_print_msg("WARN: failed to enable cpu controller in /sys/fs/cgroup=
\n");
+	}
+
+	if (mkdir(g_cgroup_path, 0755) < 0 && errno !=3D EEXIST) {
+		ksft_print_msg("mkdir base %s failed: %s\n", g_cgroup_path, strerror(err=
no));
+		return -1;
+	}
+
+	if (use_cgroup_v2) {
+		char ctrl_path[4096];
+
+		snprintf(ctrl_path, sizeof(ctrl_path), "%s/cgroup.subtree_control", g_cg=
roup_path);
+		if (write_file(ctrl_path, "+cpu") < 0)
+			ksft_print_msg("WARN: failed to enable cpu controller in %s\n",
+				       g_cgroup_path);
+	}
+
+	for (int i =3D 0; i < N_SIBLINGS; i++) {
+		char sibling_path[4096];
+
+		snprintf(sibling_path, sizeof(sibling_path), "%s/n%d", g_cgroup_path, i);
+		if (mkdir(sibling_path, 0755) < 0 && errno !=3D EEXIST) {
+			ksft_print_msg("mkdir wide %s failed: %s\n", sibling_path, strerror(err=
no));
+			return -1;
+		}
+
+		if (use_cgroup_v2) {
+			char ctrl_path[4096];
+
+			snprintf(ctrl_path, sizeof(ctrl_path),
+				 "%s/cgroup.subtree_control", sibling_path);
+			if (write_file(ctrl_path, "+cpu") < 0)
+				ksft_print_msg("WARN: failed to enable cpu controller in %s\n",
+					       sibling_path);
+		}
+
+		char current_path[4096];
+
+		snprintf(current_path, sizeof(current_path), "%s", sibling_path);
+		for (int j =3D 0; j < NEST_DEPTH; j++) {
+			snprintf(current_path + strlen(current_path),
+				 sizeof(current_path) - strlen(current_path), "/d%d", j);
+			if (mkdir(current_path, 0755) < 0 && errno !=3D EEXIST) {
+				ksft_print_msg("mkdir deep %s failed: %s\n",
+					       current_path, strerror(errno));
+				return -1;
+			}
+
+			/* Enable for all but the leaf */
+			if (use_cgroup_v2 && j < NEST_DEPTH - 1) {
+				char ctrl_path[4096];
+
+				snprintf(ctrl_path, sizeof(ctrl_path), "%s/cgroup.subtree_control",
+					 current_path);
+				if (write_file(ctrl_path, "+cpu") < 0)
+					ksft_print_msg("WARN: cannot enable cpu controller in %s\n",
+						       current_path);
+			}
+		}
+	}
+
+	char quota[64], period[64], max_str[128];
+
+	snprintf(quota, sizeof(quota), "%d", CFS_QUOTA_US);
+	snprintf(period, sizeof(period), "%d", CFS_PERIOD_US);
+	snprintf(max_str, sizeof(max_str), "%d %d", CFS_QUOTA_US, CFS_PERIOD_US);
+
+	if (use_cgroup_v2) {
+		char max_path[4096];
+
+		snprintf(max_path, sizeof(max_path), "%s/cpu.max", g_cgroup_path);
+		if (write_file(max_path, max_str) < 0) {
+			ksft_print_msg("ERROR: cannot write cpu.max at %s\n", max_path);
+			return -1;
+		}
+		ksft_print_msg("cgroup (v2) %s: cpu.max=3D%s\n", g_cgroup_path, max_str);
+	} else {
+		char quota_path[4096], period_path[4096];
+
+		snprintf(quota_path, sizeof(quota_path), "%s/cpu.cfs_quota_us", g_cgroup=
_path);
+		snprintf(period_path, sizeof(period_path), "%s/cpu.cfs_period_us", g_cgr=
oup_path);
+
+		if (write_file(period_path, period) < 0) {
+			ksft_print_msg("ERROR: cannot write cpu.cfs_period_us at %s\n",
+				       period_path);
+			return -1;
+		}
+		if (write_file(quota_path, quota) < 0) {
+			ksft_print_msg("ERROR: cannot write cpu.cfs_quota_us at %s\n", quota_pa=
th);
+			return -1;
+		}
+		ksft_print_msg("cgroup (v1) %s: cpu.cfs_quota_us=3D%d cpu.cfs_period_us=
=3D%d\n",
+			       g_cgroup_path, CFS_QUOTA_US, CFS_PERIOD_US);
+	}
+
+	return 0;
+}
+
+static int cgroup_add_pid_to_path(pid_t pid, const char *path)
+{
+	char buf[32], file_path[4096];
+
+	snprintf(buf, sizeof(buf), "%d", (int)pid);
+	if (use_cgroup_v2) {
+		snprintf(file_path, sizeof(file_path), "%s/cgroup.procs", path);
+		return write_file(file_path, buf);
+	}
+	/* In v1, try tasks first, fallback to cgroup.procs */
+	snprintf(file_path, sizeof(file_path), "%s/tasks", path);
+	int r =3D write_file(file_path, buf);
+
+	if (r < 0) {
+		snprintf(file_path, sizeof(file_path), "%s/cgroup.procs", path);
+		r =3D write_file(file_path, buf);
+	}
+	return r;
+}
+
+static void cgroup_teardown(void)
+{
+	rm_cgroup_recursive(g_cgroup_path);
+}
+
+static void cgroup_unthrottle(void)
+{
+	if (use_cgroup_v2) {
+		char max_path[4096];
+
+		snprintf(max_path, sizeof(max_path), "%s/cpu.max", g_cgroup_path);
+		write_file(max_path, "max");
+	} else {
+		char quota_path[4096];
+
+		snprintf(quota_path, sizeof(quota_path), "%s/cpu.cfs_quota_us", g_cgroup=
_path);
+		write_file(quota_path, "-1");
+	}
+}
+
+/* -- CPU burner (inside throttled child process) -- */
+static void *burner_thread_fn(void *arg)
+{
+	struct rseq_abi my_rseq;
+	int cpu =3D (int)(uintptr_t)arg;
+
+	memset(&my_rseq, 0, sizeof(my_rseq));
+	my_rseq.cpu_id =3D RSEQ_CPU_ID_UNINITIALIZED;
+
+	if (rseq_register_thread_at(&my_rseq) < 0) {
+		perror("rseq_register (burner)");
+		return NULL;
+	}
+
+	cpu_set_t set;
+
+	CPU_ZERO(&set);
+	CPU_SET(cpu, &set);
+	if (sched_setaffinity(0, sizeof(set), &set) < 0)
+		perror("sched_setaffinity (burner)");
+
+	unsigned long sink =3D 0;
+
+	while (!atomic_load_explicit(&g_stop, memory_order_relaxed)) {
+		sink++;
+		/* Prevent compiler from optimizing the loop away */
+		asm volatile("" : "+g"(sink));
+	}
+
+	return NULL;
+}
+
+static int burner_thread_fn_wrapper(void *arg)
+{
+	burner_thread_fn(arg);
+	return 0;
+}
+
+static int leaf_child_fn(void *arg)
+{
+	int i =3D (int)(uintptr_t)arg;
+	int total_burners =3D g_ncpus_stress * N_BURNER_PER_CPU;
+	int n_threads_per_leaf =3D total_burners / N_SIBLINGS;
+
+	if (i < (total_burners % N_SIBLINGS))
+		n_threads_per_leaf++;
+
+	prctl(PR_SET_PDEATHSIG, SIGTERM);
+	if (getppid() =3D=3D 1)
+		_exit(1);
+
+	char leaf_path[4096];
+
+	snprintf(leaf_path, sizeof(leaf_path), "%s/n%d", g_cgroup_path, i);
+	for (int j =3D 0; j < NEST_DEPTH; j++)
+		snprintf(leaf_path + strlen(leaf_path),
+			 sizeof(leaf_path) - strlen(leaf_path), "/d%d", j);
+
+		int r =3D cgroup_add_pid_to_path(getpid(), leaf_path);
+
+		if (r < 0) {
+			char buf[512];
+			int len =3D snprintf(buf, sizeof(buf),
+					   "[leaf child %d] failed to join cgroup %s: err %d\n",
+					   i, leaf_path, -r);
+			(void)!write(2, buf, len);
+			_exit(1);
+		}
+
+	for (int j =3D 0; j < n_threads_per_leaf; j++) {
+		int cpu =3D g_stress_cpus[(i * n_threads_per_leaf + j) % g_ncpus_stress];
+
+		/* Allocate stack via mmap (bypasses heap) */
+		size_t stack_size =3D 64 * 1024;
+		void *stack =3D mmap(NULL, stack_size, PROT_READ | PROT_WRITE,
+				   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+		if (stack =3D=3D MAP_FAILED) {
+			const char *msg =3D "mmap stack failed\n";
+			(void)!write(2, msg, strlen(msg));
+			_exit(1);
+		}
+
+		/* Use raw clone to create a thread sharing the VM and thread group */
+		pid_t pid =3D clone(burner_thread_fn_wrapper, stack + stack_size,
+				  CLONE_VM | CLONE_THREAD | CLONE_SIGHAND,
+				  (void *)(uintptr_t)cpu);
+		if (pid < 0) {
+			const char *msg =3D "clone burner failed\n";
+			(void)!write(2, msg, strlen(msg));
+			_exit(1);
+		}
+	}
+
+	// Wait for SIGTERM
+	sigset_t mask;
+
+	sigemptyset(&mask);
+	sigaddset(&mask, SIGTERM);
+	int sig;
+
+	sigwait(&mask, &sig);
+
+	_exit(0);
+}
+
+struct leaf_info {
+	pid_t pid;
+	void *stack;
+};
+
+static int run_throttle_child(void *arg)
+{
+	(void)arg;
+	prctl(PR_SET_PDEATHSIG, SIGTERM);
+	if (getppid() =3D=3D 1)
+		_exit(1);
+
+	int n_leafs =3D N_SIBLINGS;
+
+	/* Block signals before spawning to avoid missing early failures */
+	sigset_t mask;
+
+	sigemptyset(&mask);
+	sigaddset(&mask, SIGTERM);
+	sigaddset(&mask, SIGCHLD);
+	sigprocmask(SIG_BLOCK, &mask, NULL);
+
+	/* Use mmap for tracking structures to avoid glibc heap usage */
+	struct leaf_info *leaves =3D mmap(NULL, n_leafs * sizeof(struct leaf_info=
),
+					PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (leaves =3D=3D MAP_FAILED) {
+		const char *msg =3D "mmap leaves array failed\n";
+		(void)!write(2, msg, strlen(msg));
+		_exit(1);
+	}
+
+	for (int i =3D 0; i < n_leafs; i++) {
+		size_t stack_size =3D 64 * 1024;
+		void *stack =3D mmap(NULL, stack_size, PROT_READ | PROT_WRITE,
+				   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+		if (stack =3D=3D MAP_FAILED) {
+			const char *msg =3D "mmap leaf stack failed\n";
+			(void)!write(2, msg, strlen(msg));
+			_exit(1);
+		}
+
+		leaves[i].stack =3D stack;
+
+		pid_t pid =3D clone(leaf_child_fn, stack + stack_size,
+				  CLONE_VM | SIGCHLD, (void *)(uintptr_t)i);
+
+		if (pid < 0) {
+			const char *msg =3D "clone (leaf child) failed\n";
+			(void)!write(2, msg, strlen(msg));
+
+			/* Clean up successfully spawned children */
+			for (int j =3D 0; j < i; j++) {
+				kill(leaves[j].pid, SIGTERM);
+				waitpid(leaves[j].pid, NULL, 0);
+				munmap(leaves[j].stack, stack_size);
+			}
+			munmap(leaves, n_leafs * sizeof(struct leaf_info));
+
+			if (errno =3D=3D EAGAIN)
+				_exit(4);
+			else
+				_exit(1);
+		}
+		leaves[i].pid =3D pid;
+	}
+
+	int failed =3D 0;
+
+	while (1) {
+		int sig;
+
+		sigwait(&mask, &sig);
+
+		if (sig =3D=3D SIGTERM) {
+			break;
+		} else if (sig =3D=3D SIGCHLD) {
+			int status;
+			pid_t pid;
+
+			// Reap all dead children
+			while ((pid =3D waitpid(-1, &status, WNOHANG)) > 0) {
+				for (int i =3D 0; i < n_leafs; i++) {
+					if (leaves[i].pid =3D=3D pid) {
+						leaves[i].pid =3D 0;
+						break;
+					}
+				}
+				if ((WIFEXITED(status) && WEXITSTATUS(status) !=3D 0) ||
+				    WIFSIGNALED(status)) {
+					char buf[128];
+					int len =3D snprintf(buf, sizeof(buf),
+							   "[manager] child %d died unexpectedly (status %d)\n",
+							   pid, WEXITSTATUS(status));
+					(void)!write(2, buf, len);
+					failed =3D 1;
+				}
+			}
+			if (failed)
+				break;
+		}
+	}
+
+	// Terminate all leaf kids
+	for (int i =3D 0; i < n_leafs; i++) {
+		if (leaves[i].pid > 0)
+			kill(leaves[i].pid, SIGTERM);
+	}
+
+	for (int i =3D 0; i < n_leafs; i++) {
+		if (leaves[i].pid > 0)
+			waitpid(leaves[i].pid, NULL, 0);
+		munmap(leaves[i].stack, 64 * 1024);
+	}
+
+	munmap(leaves, n_leafs * sizeof(struct leaf_info));
+
+	_exit(failed ? 1 : 0);
+}
+
+/* -- Membarrier hammer thread -- */
+static void *hammer_thread_fn(void *arg)
+{
+	int target_cpu =3D *(int *)arg;
+	long local_ok =3D 0;
+	long local_err =3D 0;
+	int count =3D 0;
+	const int batch_size =3D 1024;
+
+	if (rseq_register_thread() < 0) {
+		ksft_print_msg("[hammer] rseq_register failed: %s\n", strerror(errno));
+		return NULL;
+	}
+
+	membarrier_register_rseq_mm();
+
+	while (!atomic_load_explicit(&g_stop, memory_order_relaxed)) {
+		int r =3D syscall(SYS_membarrier,
+				MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ,
+				MEMBARRIER_CMD_FLAG_CPU,
+				target_cpu);
+		if (__builtin_expect(r =3D=3D 0, 1))
+			local_ok++;
+		else
+			local_err++;
+
+		count++;
+		if (__builtin_expect(count >=3D batch_size, 0)) {
+			atomic_fetch_add_explicit(&g_mb_ok, local_ok, memory_order_relaxed);
+			atomic_fetch_add_explicit(&g_mb_err, local_err, memory_order_relaxed);
+			local_ok =3D 0;
+			local_err =3D 0;
+			count =3D 0;
+		}
+	}
+
+	/* Flush any remaining counts on exit */
+	if (local_ok > 0)
+		atomic_fetch_add_explicit(&g_mb_ok, local_ok, memory_order_relaxed);
+	if (local_err > 0)
+		atomic_fetch_add_explicit(&g_mb_err, local_err, memory_order_relaxed);
+
+	return NULL;
+}
+
+/* -- Latency sentinel -- */
+static void *sentinel_thread_fn(void *arg)
+{
+	(void)arg;
+	struct sched_param sp =3D { .sched_priority =3D 20 };
+
+	if (sched_setscheduler(0, SCHED_FIFO, &sp) < 0)
+		ksft_print_msg("WARN: no SCHED_FIFO for sentinel (less precise)\n");
+
+	while (!atomic_load_explicit(&g_test_ready, memory_order_relaxed) &&
+	       !atomic_load_explicit(&g_stop_sentinel, memory_order_relaxed)) {
+		struct timespec ts =3D {0, 1000 * 1000}; /* 1ms */
+
+		clock_nanosleep(CLOCK_MONOTONIC, 0, &ts, NULL);
+	}
+
+	uint64_t prev =3D monotonic_us();
+
+	while (!atomic_load_explicit(&g_stop_sentinel, memory_order_relaxed)) {
+		struct timespec ts =3D {
+			.tv_sec  =3D 0,
+			.tv_nsec =3D SENTINEL_INTERVAL_US * 1000L,
+		};
+		clock_nanosleep(CLOCK_MONOTONIC, 0, &ts, NULL);
+
+		uint64_t now =3D monotonic_us();
+		long latency_us =3D (long)(now - prev) - SENTINEL_INTERVAL_US;
+
+		prev =3D now;
+
+		if (latency_us <=3D 0)
+			continue;
+
+		update_max_latency(latency_us);
+
+		if (latency_us > LATENCY_CRITICAL_MS * 1000L) {
+			ksft_print_msg("\n[SENTINEL] CRITICAL: %ld ms delay (lockup precursor!)=
\n",
+				latency_us / 1000);
+		} else if (latency_us > LATENCY_WARN_MS * 1000L) {
+			ksft_print_msg("\n[SENTINEL] WARN: %ld ms latency spike\n",
+				latency_us / 1000);
+		}
+	}
+	return NULL;
+}
+
+/* -- Progress reporter -- */
+static void *reporter_thread_fn(void *arg)
+{
+	(void)arg;
+	int elapsed =3D 0;
+
+	while (!atomic_load_explicit(&g_stop_sentinel, memory_order_relaxed)) {
+		for (int i =3D 0; i < 5; i++) {
+			sleep(1);
+			if (atomic_load_explicit(&g_stop_sentinel, memory_order_relaxed))
+				break;
+		}
+		if (atomic_load_explicit(&g_stop_sentinel, memory_order_relaxed))
+			break;
+		elapsed +=3D 5;
+		long interval_max =3D atomic_exchange_explicit(&g_interval_max_latency_u=
s,
+							     0, memory_order_relaxed);
+
+		ksft_print_msg("[%3ds] mb: ok=3D%-10ld err=3D%-8ld | max_lat=3D%ld us\n",
+		       elapsed,
+		       atomic_load(&g_mb_ok),
+		       atomic_load(&g_mb_err),
+		       interval_max);
+	}
+	return NULL;
+}
+
+/* -- Main -- */
+int main(void)
+{
+	ksft_print_header();
+#ifdef UNSUPPORTED_ARCH
+	ksft_exit_skip("Unsupported architecture\n");
+#endif
+	ksft_set_plan(1);
+
+	if (geteuid() !=3D 0)
+		ksft_exit_skip("Must run as root (cgroup + SCHED_FIFO)\n");
+
+	init_stress_cpus();
+
+	ksft_print_msg("=3D=3D=3D membarrier rseq + CFS unthrottle stress =3D=3D=
=3D\n");
+	ksft_print_msg("Stressing CPUs: %d\n", g_ncpus_stress);
+	ksft_print_msg("Quota: %d/%d us  (~%d unthrottles/sec/CPU)\n",
+	       CFS_QUOTA_US, CFS_PERIOD_US,
+	       1000000 / CFS_PERIOD_US);
+	ksft_print_msg("Hammer threads: %d per CPU (%d total)\n",
+	       N_HAMMER_PER_CPU, g_ncpus_stress * N_HAMMER_PER_CPU);
+	ksft_print_msg("Duration: %d seconds\n\n", TEST_DURATION_SEC);
+
+	if (cgroup_setup() < 0) {
+		cgroup_teardown();
+		ksft_exit_skip("cgroup_setup failed (missing permissions or v2 ctrls?)\n=
");
+	}
+
+	if (rseq_register_thread() < 0) {
+		ksft_print_msg("rseq_register (%s) failed: %s\n", __func__, strerror(err=
no));
+		cgroup_teardown();
+		ksft_exit_skip("rseq syscall failed or not available\n");
+	}
+	if (membarrier_register_rseq_mm() < 0) {
+		ksft_print_msg("MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ: %s\n"
+			"Kernel >=3D 5.10 with CONFIG_RSEQ required.\n",
+			strerror(errno));
+		cgroup_teardown();
+		ksft_exit_skip("membarrier register failed\n");
+	}
+	ksft_print_msg("rseq membarrier registered OK\n");
+
+	sigset_t sigmask;
+
+	sigemptyset(&sigmask);
+	sigaddset(&sigmask, SIGTERM);
+	sigprocmask(SIG_BLOCK, &sigmask, NULL);
+
+	void *stack =3D malloc(1024 * 1024);
+
+	if (!stack) {
+		perror("malloc stack");
+		cgroup_teardown();
+		ksft_exit_fail_msg("Malloc stack failed\n");
+	}
+	pid_t child =3D clone(run_throttle_child, stack + 1024 * 1024, CLONE_VM |=
 SIGCHLD, NULL);
+
+	if (child < 0) {
+		perror("clone");
+		cgroup_teardown();
+		ksft_exit_fail_msg("Clone failed\n");
+	}
+
+	sigprocmask(SIG_UNBLOCK, &sigmask, NULL);
+	ksft_print_msg("Throttle child PID %d started\n", child);
+
+	int n_threads =3D g_ncpus_stress * N_HAMMER_PER_CPU + 2;
+	pthread_t *threads =3D (pthread_t *)calloc(n_threads, sizeof(pthread_t));
+	int       *cpuargs =3D (int *)calloc(g_ncpus_stress * N_HAMMER_PER_CPU, s=
izeof(int));
+
+	if (!threads || !cpuargs) {
+		perror("calloc");
+		kill(child, SIGTERM);
+		waitpid(child, NULL, 0);
+		cgroup_teardown();
+		ksft_exit_fail_msg("Thread allocation failed\n");
+	}
+
+	int ti =3D 0, ai =3D 0;
+	int r;
+
+	ksft_print_msg("Creating sentinel thread...\n");
+	r =3D pthread_create(&threads[ti], NULL, sentinel_thread_fn, NULL);
+	if (r !=3D 0) {
+		kill(child, SIGTERM);
+		waitpid(child, NULL, 0);
+		cgroup_teardown();
+		free(threads);
+		free(cpuargs);
+		free(g_stress_cpus);
+		ksft_exit_fail_msg("pthread_create (sentinel) failed: %s\n", strerror(r)=
);
+	}
+	ti++;
+
+	ksft_print_msg("Creating reporter thread...\n");
+	r =3D pthread_create(&threads[ti], NULL, reporter_thread_fn, NULL);
+	if (r !=3D 0) {
+		atomic_store(&g_stop_sentinel, 1);
+		pthread_join(threads[0], NULL);
+		kill(child, SIGTERM);
+		waitpid(child, NULL, 0);
+		cgroup_teardown();
+		free(threads);
+		free(cpuargs);
+		free(g_stress_cpus);
+		ksft_exit_fail_msg("pthread_create (reporter) failed: %s\n", strerror(r)=
);
+	}
+	ti++;
+
+	ksft_print_msg("Creating %d hammer threads...\n", g_ncpus_stress * N_HAMM=
ER_PER_CPU);
+	for (int i =3D 0; i < g_ncpus_stress; i++) {
+		int cpu =3D g_stress_cpus[i];
+
+		for (int j =3D 0; j < N_HAMMER_PER_CPU; j++) {
+			cpuargs[ai] =3D cpu;
+			r =3D pthread_create(&threads[ti], NULL, hammer_thread_fn, &cpuargs[ai]=
);
+			if (r !=3D 0) {
+				ksft_print_msg("pthread_create failed at thread %d: %s\n",
+					       ti, strerror(r));
+
+				atomic_store(&g_stop_sentinel, 1);
+				pthread_join(threads[0], NULL);
+				pthread_join(threads[1], NULL);
+
+				atomic_store(&g_stop, 1);
+				for (int k =3D 2; k < ti; k++)
+					pthread_join(threads[k], NULL);
+
+				kill(child, SIGTERM);
+				waitpid(child, NULL, 0);
+				cgroup_teardown();
+
+				free(threads);
+				free(cpuargs);
+				free(g_stress_cpus);
+
+				if (r =3D=3D EAGAIN)
+					ksft_exit_skip("Resource limits prevent threads\n");
+				else
+					ksft_exit_fail_msg("Failed to create hammer thread\n");
+			}
+			ti++;
+			ai++;
+		}
+	}
+
+	ksft_print_msg("All threads running. Tip: monitor dmesg for lockups\n\n");
+
+	atomic_store_explicit(&g_test_ready, 1, memory_order_relaxed);
+	int child_failed =3D 0;
+	int child_status =3D 0;
+
+	for (int i =3D 0; i < TEST_DURATION_SEC; i++) {
+		sleep(1);
+		int r =3D waitpid(child, &child_status, WNOHANG);
+
+		if (r =3D=3D child) {
+			child_failed =3D 1;
+			break;
+		}
+	}
+
+	atomic_store(&g_stop_sentinel, 1);
+	pthread_join(threads[0], NULL);
+	pthread_join(threads[1], NULL);
+
+	atomic_store(&g_stop, 1);
+
+	/* Unthrottle to allow children to exit quickly */
+	cgroup_unthrottle();
+
+	if (!child_failed) {
+		kill(child, SIGTERM);
+		waitpid(child, NULL, 0);
+	}
+	for (int i =3D 2; i < ti; i++)
+		pthread_join(threads[i], NULL);
+
+	long max_lat   =3D atomic_load(&g_max_latency_us);
+	long total_ok  =3D atomic_load(&g_mb_ok);
+	long total_err =3D atomic_load(&g_mb_err);
+
+	ksft_print_msg("\n=3D=3D=3D RESULTS =3D=3D=3D\n");
+	ksft_print_msg("membarrier syscalls : %ld ok  %ld errors\n", total_ok, to=
tal_err);
+	ksft_print_msg("Max scheduler latency: %ld us  (%ld ms)\n", max_lat, max_=
lat / 1000);
+	cgroup_teardown();
+	free(threads);
+	free(cpuargs);
+	free(g_stress_cpus);
+
+	if (child_failed) {
+		if (WIFEXITED(child_status) && WEXITSTATUS(child_status) =3D=3D 4)
+			ksft_exit_skip("Manager child skipped (resource limits?)\n");
+		ksft_test_result_fail("membarrier_rseq_stress: Manager child died early\=
n");
+		ksft_exit_fail();
+	} else if (total_ok =3D=3D 0) {
+		ksft_test_result_fail("membarrier_rseq_stress: No successful membarrier =
calls\n");
+		ksft_exit_fail();
+	} else if (total_err > 0) {
+		ksft_test_result_fail("membarrier_rseq_stress: syscall errors\n");
+		ksft_exit_fail();
+	} else if (max_lat > LATENCY_CRITICAL_MS * 1000L) {
+		ksft_test_result_fail("membarrier_rseq_stress: LOCKUP PRECURSOR\n");
+		ksft_exit_fail();
+	} else if (max_lat > LATENCY_WARN_MS * 1000L) {
+		ksft_test_result_fail("membarrier_rseq_stress: significant latency spike=
\n");
+		ksft_exit_fail();
+	} else {
+		ksft_test_result_pass("membarrier_rseq_stress\n");
+		ksft_exit_pass();
+	}
+
+	return 0;
+}