From nobody Sat Jun 13 23:11:08 2026 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 668C937C901; Tue, 5 May 2026 10:55:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777978541; cv=none; b=Rx6VRSNZFgC2IWA2BmZKWLqsqRCYBkQwayjdifpoe8DqSRxTyLZnOAPUYA3x8FtkZ2v/2HHuGQHWSjanzND6QjyA/WyWn5C4u52c//Xm5ZcNCsg0Di++tOUH92fY1BjzByYXcsCGfGShrw3D9/UHS4ckXH+c9wrQCv7gis9Gv78= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777978541; c=relaxed/simple; bh=cmvqbVllm6UilRlRWlpSJ7Y07fW4Q9ioe2MlFrpi6YU=; h=Date:From:To:Subject:Cc:In-Reply-To:References:MIME-Version: Message-ID:Content-Type; b=MiUIeCotlGOP2VYtGVqPiovKSn9IZkos//dMnyC/gQoqKJQbs9GZf28ojfueJfWejVp0ZF/K4NbuHKjXw4vuaxt2SaZpltt64yop663w/kOhU0MXow4RNxRnyMwTM4gpyinBNAF55xynUpxdIiuZVlny/VcdqwA1ij1t9Xs2y7o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=k1BMPeLM; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=X5fSqIR6; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="k1BMPeLM"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="X5fSqIR6" Date: Tue, 05 May 2026 10:55:35 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1777978537; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=7N8UpyuXlAJtY2c5lkj7gVPt7V1Yf0zxM2tFyl1OMDs=; b=k1BMPeLMZHfDZhQapoSBH12JWDVmVPNGwulkuXwN9vSojJ4d5wUwsdTkJPExouoZU1iBiH 8jKorGkfAAPO/9KxswMUDSY5UD2EuyNKAiGjnWQ5Ma4B/KWs6ona9oGKt1eykAyfPaxO+d lw6Z0NzgdLgZ9tDtMaJdSEPyrDq6tTq67n7boSTezJFfmZyswAgX9YavaS3niLp5fINHVY mgbLbRmDFd4NshUILv4YUoPGue7BijVr9o08KMHWV/+aiBRHgXrq7z1MZE0L6LYrYs0+sR WvOxg3pE6hCs0u5fIVBFAuM0N/upZ24GUiGaXZFW4ndrV2azLvUj+VeNBVrR+Q== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1777978537; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=7N8UpyuXlAJtY2c5lkj7gVPt7V1Yf0zxM2tFyl1OMDs=; b=X5fSqIR60zmAd0k8EKhV+v85WET1EvUQ1h+JMzOBTlg4W1S76qFknnLa9+uH1P5LDBonBo nUxHwofWFKIaLyBw== From: "tip-bot2 for Aniket Gattani" Sender: tip-bot2@linutronix.de Reply-to: linux-kernel@vger.kernel.org To: linux-tip-commits@vger.kernel.org Subject: [tip: locking/core] selftests/membarrier: Add rseq stress test for CFS throttle interactions Cc: kernel test robot , Aniket Gattani , "Peter Zijlstra (Intel)" , x86@kernel.org, linux-kernel@vger.kernel.org In-Reply-To: <20260503212205.3714217-4-aniketgattani@google.com> References: <20260503212205.3714217-4-aniketgattani@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-ID: <177797853561.424702.2347141406578693763.tip-bot2@tip-bot2> Robot-ID: Robot-Unsubscribe: Contact to get blacklisted from these emails Precedence: bulk Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable The following commit has been merged into the locking/core branch of tip: Commit-ID: 03240f5de2dd312f388f3e493f194d54d43e2924 Gitweb: https://git.kernel.org/tip/03240f5de2dd312f388f3e493f194d54d= 43e2924 Author: Aniket Gattani AuthorDate: Sun, 03 May 2026 21:22:05=20 Committer: Peter Zijlstra CommitterDate: Tue, 05 May 2026 12:50:48 +02:00 selftests/membarrier: Add rseq stress test for CFS throttle interactions Add a new stress test to exercise the interaction between targeted expedited membarrier commands and CFS bandwidth throttling. The test creates a deep cgroup hierarchy and aggressively hammers the membarrier syscall to expose lock contention and latency issues. This serves as a reliable reproducer for the `membarrier_ipi_mutex` cascade lockup, ensuring future changes to membarrier locking do not regress targeted command latency. Closes: https://lore.kernel.org/r/202604151516.Vc7Ro4LP-lkp@intel.com/ Reported-by: kernel test robot Signed-off-by: Aniket Gattani Signed-off-by: Peter Zijlstra (Intel) Link: https://patch.msgid.link/20260503212205.3714217-4-aniketgattani@googl= e.com --- tools/testing/selftests/membarrier/Makefile | 5 +- tools/testing/selftests/membarrier/membarrier_rseq_stress.c | 951 +++++++- 2 files changed, 954 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/membarrier/membarrier_rseq_stre= ss.c diff --git a/tools/testing/selftests/membarrier/Makefile b/tools/testing/se= lftests/membarrier/Makefile index fc840e0..829f95c 100644 --- a/tools/testing/selftests/membarrier/Makefile +++ b/tools/testing/selftests/membarrier/Makefile @@ -1,8 +1,9 @@ # SPDX-License-Identifier: GPL-2.0-only -CFLAGS +=3D -g $(KHDR_INCLUDES) +CFLAGS +=3D -g $(KHDR_INCLUDES) -pthread -I../../../../tools/include LDLIBS +=3D -lpthread =20 TEST_GEN_PROGS :=3D membarrier_test_single_thread \ - membarrier_test_multi_thread + membarrier_test_multi_thread \ + membarrier_rseq_stress =20 include ../lib.mk diff --git a/tools/testing/selftests/membarrier/membarrier_rseq_stress.c b/= tools/testing/selftests/membarrier/membarrier_rseq_stress.c new file mode 100644 index 0000000..c188d74 --- /dev/null +++ b/tools/testing/selftests/membarrier/membarrier_rseq_stress.c @@ -0,0 +1,951 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Membarrier stress test for CFS throttle interactions. + * + * Reproducer for the interaction between CFS throttle and expedited memba= rrier. + */ + +#ifndef _GNU_SOURCE +#define _GNU_SOURCE +#endif +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "../kselftest.h" + +/* -- Architecture-specific rseq signature -- */ +#if defined(__x86_64__) || defined(__i386__) +# define RSEQ_SIG 0x53053053U +#elif defined(__aarch64__) +# define RSEQ_SIG 0xd428bc00U +#elif defined(__powerpc__) || defined(__powerpc64__) +# define RSEQ_SIG 0x0f000000U +#elif defined(__s390__) || defined(__s390x__) +# define RSEQ_SIG 0x0c000000U +#else +# define RSEQ_SIG 0 +# define UNSUPPORTED_ARCH 1 +#endif + +/* -- rseq ABI (kernel uapi; define locally for portability) -- */ +#define RSEQ_CPU_ID_UNINITIALIZED ((__u32)-1) + +#include + +struct rseq_abi { + __u32 cpu_id_start; + __u32 cpu_id; + __u64 rseq_cs; + __u32 flags; + __u32 node_id; + __u32 mm_cid; + char end[0]; +} __aligned(32); + +/* -- membarrier constants (not in all distro headers) -- */ +#ifndef MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ +# define MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ (1 << 7) +#endif +#ifndef MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ +# define MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ (1 << 8) +#endif +#ifndef MEMBARRIER_CMD_FLAG_CPU +# define MEMBARRIER_CMD_FLAG_CPU (1 << 0) +#endif + +/* -- Test parameters -- */ +#define N_SIBLINGS 2000 +#define NEST_DEPTH 5 +static char g_cgroup_path[4096]; +static int use_cgroup_v2; + +#define CFS_QUOTA_US 1000 +#define CFS_PERIOD_US 5000 +#define N_HAMMER_PER_CPU 25 +#define N_BURNER_PER_CPU 50 +#define MAX_STRESS_CPUS 1024 +#define TEST_DURATION_SEC 20 + +/* Latency thresholds for the sentinel */ +#define LATENCY_WARN_MS 50 +#define LATENCY_CRITICAL_MS 200 + +/* Sentinel sampling interval */ +#define SENTINEL_INTERVAL_US 500 + +/* -- Shared globals -- */ +static atomic_int g_stop; +static atomic_int g_stop_sentinel; +static atomic_long g_max_latency_us; +static atomic_long g_interval_max_latency_us; +static atomic_long g_mb_ok; +static atomic_long g_mb_err; +static int g_ncpus_stress; +static int *g_stress_cpus; + +static atomic_int g_test_ready; + +/* Per-thread rseq ABI block registered with the kernel */ +static __thread struct rseq_abi tls_rseq + __attribute__((tls_model("initial-exec"))) __aligned(32) =3D { + .cpu_id =3D RSEQ_CPU_ID_UNINITIALIZED, +}; + +/* -- Utility -- */ +static int write_file(const char *path, const char *val) +{ + int fd =3D open(path, O_WRONLY | O_CLOEXEC); + + if (fd < 0) + return -errno; + + size_t len =3D strlen(val); + ssize_t r =3D write(fd, val, len); + + close(fd); + if (r < 0) + return -errno; + if ((size_t)r !=3D len) + return -EIO; + return 0; +} + +static uint64_t monotonic_us(void) +{ + struct timespec ts; + + clock_gettime(CLOCK_MONOTONIC, &ts); + return (uint64_t)ts.tv_sec * 1000000ULL + ts.tv_nsec / 1000ULL; +} + +static void update_max_latency(long lat) +{ + long old =3D atomic_load_explicit(&g_max_latency_us, memory_order_relaxed= ); + + while (lat > old) { + if (atomic_compare_exchange_weak_explicit(&g_max_latency_us, &old, lat, + memory_order_relaxed, memory_order_relaxed)) + break; + } + + old =3D atomic_load_explicit(&g_interval_max_latency_us, memory_order_rel= axed); + while (lat > old) { + if (atomic_compare_exchange_weak_explicit(&g_interval_max_latency_us, &o= ld, lat, + memory_order_relaxed, memory_order_relaxed)) + break; + } +} + +static void init_stress_cpus(void) +{ + cpu_set_t set; + int capacity =3D MAX_STRESS_CPUS; + + g_stress_cpus =3D malloc(capacity * sizeof(int)); + if (!g_stress_cpus) + ksft_exit_fail_msg("malloc failed for g_stress_cpus\n"); + + if (sched_getaffinity(0, sizeof(set), &set) < 0) + ksft_exit_fail_msg("sched_getaffinity failed\n"); + + for (int i =3D 0; i < CPU_SETSIZE && g_ncpus_stress < capacity; i++) { + if (CPU_ISSET(i, &set)) + g_stress_cpus[g_ncpus_stress++] =3D i; + } + + if (g_ncpus_stress =3D=3D 0) + ksft_exit_skip("No CPUs available for stress test\n"); + + ksft_print_msg("Stressing %d CPUs discovered via affinity\n", g_ncpus_str= ess); +} + +/* -- rseq / membarrier helpers -- */ +static int rseq_register_thread(void) +{ + int r =3D syscall(SYS_rseq, &tls_rseq, sizeof(tls_rseq), 0, RSEQ_SIG); + + return (r =3D=3D 0 || errno =3D=3D EBUSY || errno =3D=3D EINVAL) ? 0 : -1; +} + +static int rseq_register_thread_at(struct rseq_abi *rseq) +{ + int r =3D syscall(SYS_rseq, rseq, sizeof(*rseq), 0, RSEQ_SIG); + + return (r =3D=3D 0 || errno =3D=3D EBUSY || errno =3D=3D EINVAL) ? 0 : -1; +} + +static int membarrier_register_rseq_mm(void) +{ + return syscall(SYS_membarrier, + MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ, 0, 0); +} + +/* -- cgroup helpers -- */ +static void rm_cgroup_recursive(const char *path) +{ + DIR *dir =3D opendir(path); + + if (!dir) + return; + struct dirent *entry; + + while ((entry =3D readdir(dir)) !=3D NULL) { + if (strcmp(entry->d_name, ".") =3D=3D 0 || strcmp(entry->d_name, "..") = =3D=3D 0) + continue; + if (entry->d_type =3D=3D DT_DIR) { + char sub_path[4096]; + + snprintf(sub_path, sizeof(sub_path), "%s/%s", path, entry->d_name); + rm_cgroup_recursive(sub_path); + } + } + closedir(dir); + rmdir(path); +} + +static void cgroup_teardown(void); + +static int cgroup_setup(void) +{ + struct stat st; + + if (stat("/sys/fs/cgroup/cpu", &st) =3D=3D 0) { + use_cgroup_v2 =3D 0; + snprintf(g_cgroup_path, sizeof(g_cgroup_path), + "/sys/fs/cgroup/cpu/membarrier_stress_test"); + } else if (stat("/dev/cgroup/cpu", &st) =3D=3D 0) { + use_cgroup_v2 =3D 0; + snprintf(g_cgroup_path, sizeof(g_cgroup_path), + "/dev/cgroup/cpu/membarrier_stress_test"); + } else if (stat("/cgroup/cpu", &st) =3D=3D 0) { + use_cgroup_v2 =3D 0; + snprintf(g_cgroup_path, sizeof(g_cgroup_path), + "/cgroup/cpu/membarrier_stress_test"); + } else if (stat("/sys/fs/cgroup/cgroup.controllers", &st) =3D=3D 0) { + use_cgroup_v2 =3D 1; + snprintf(g_cgroup_path, sizeof(g_cgroup_path), + "/sys/fs/cgroup/membarrier_stress_test"); + } else { + ksft_print_msg("WARN: cgroup mount not found. Using v2 at /sys/fs/cgroup= \n"); + use_cgroup_v2 =3D 1; + snprintf(g_cgroup_path, sizeof(g_cgroup_path), + "/sys/fs/cgroup/membarrier_stress_test"); + } + + /* Robust cleanup before setup */ + cgroup_teardown(); + + if (use_cgroup_v2) { + /* Enable cpu controller in root cgroup */ + if (write_file("/sys/fs/cgroup/cgroup.subtree_control", "+cpu") < 0) + ksft_print_msg("WARN: failed to enable cpu controller in /sys/fs/cgroup= \n"); + } + + if (mkdir(g_cgroup_path, 0755) < 0 && errno !=3D EEXIST) { + ksft_print_msg("mkdir base %s failed: %s\n", g_cgroup_path, strerror(err= no)); + return -1; + } + + if (use_cgroup_v2) { + char ctrl_path[4096]; + + snprintf(ctrl_path, sizeof(ctrl_path), "%s/cgroup.subtree_control", g_cg= roup_path); + if (write_file(ctrl_path, "+cpu") < 0) + ksft_print_msg("WARN: failed to enable cpu controller in %s\n", + g_cgroup_path); + } + + for (int i =3D 0; i < N_SIBLINGS; i++) { + char sibling_path[4096]; + + snprintf(sibling_path, sizeof(sibling_path), "%s/n%d", g_cgroup_path, i); + if (mkdir(sibling_path, 0755) < 0 && errno !=3D EEXIST) { + ksft_print_msg("mkdir wide %s failed: %s\n", sibling_path, strerror(err= no)); + return -1; + } + + if (use_cgroup_v2) { + char ctrl_path[4096]; + + snprintf(ctrl_path, sizeof(ctrl_path), + "%s/cgroup.subtree_control", sibling_path); + if (write_file(ctrl_path, "+cpu") < 0) + ksft_print_msg("WARN: failed to enable cpu controller in %s\n", + sibling_path); + } + + char current_path[4096]; + + snprintf(current_path, sizeof(current_path), "%s", sibling_path); + for (int j =3D 0; j < NEST_DEPTH; j++) { + snprintf(current_path + strlen(current_path), + sizeof(current_path) - strlen(current_path), "/d%d", j); + if (mkdir(current_path, 0755) < 0 && errno !=3D EEXIST) { + ksft_print_msg("mkdir deep %s failed: %s\n", + current_path, strerror(errno)); + return -1; + } + + /* Enable for all but the leaf */ + if (use_cgroup_v2 && j < NEST_DEPTH - 1) { + char ctrl_path[4096]; + + snprintf(ctrl_path, sizeof(ctrl_path), "%s/cgroup.subtree_control", + current_path); + if (write_file(ctrl_path, "+cpu") < 0) + ksft_print_msg("WARN: cannot enable cpu controller in %s\n", + current_path); + } + } + } + + char quota[64], period[64], max_str[128]; + + snprintf(quota, sizeof(quota), "%d", CFS_QUOTA_US); + snprintf(period, sizeof(period), "%d", CFS_PERIOD_US); + snprintf(max_str, sizeof(max_str), "%d %d", CFS_QUOTA_US, CFS_PERIOD_US); + + if (use_cgroup_v2) { + char max_path[4096]; + + snprintf(max_path, sizeof(max_path), "%s/cpu.max", g_cgroup_path); + if (write_file(max_path, max_str) < 0) { + ksft_print_msg("ERROR: cannot write cpu.max at %s\n", max_path); + return -1; + } + ksft_print_msg("cgroup (v2) %s: cpu.max=3D%s\n", g_cgroup_path, max_str); + } else { + char quota_path[4096], period_path[4096]; + + snprintf(quota_path, sizeof(quota_path), "%s/cpu.cfs_quota_us", g_cgroup= _path); + snprintf(period_path, sizeof(period_path), "%s/cpu.cfs_period_us", g_cgr= oup_path); + + if (write_file(period_path, period) < 0) { + ksft_print_msg("ERROR: cannot write cpu.cfs_period_us at %s\n", + period_path); + return -1; + } + if (write_file(quota_path, quota) < 0) { + ksft_print_msg("ERROR: cannot write cpu.cfs_quota_us at %s\n", quota_pa= th); + return -1; + } + ksft_print_msg("cgroup (v1) %s: cpu.cfs_quota_us=3D%d cpu.cfs_period_us= =3D%d\n", + g_cgroup_path, CFS_QUOTA_US, CFS_PERIOD_US); + } + + return 0; +} + +static int cgroup_add_pid_to_path(pid_t pid, const char *path) +{ + char buf[32], file_path[4096]; + + snprintf(buf, sizeof(buf), "%d", (int)pid); + if (use_cgroup_v2) { + snprintf(file_path, sizeof(file_path), "%s/cgroup.procs", path); + return write_file(file_path, buf); + } + /* In v1, try tasks first, fallback to cgroup.procs */ + snprintf(file_path, sizeof(file_path), "%s/tasks", path); + int r =3D write_file(file_path, buf); + + if (r < 0) { + snprintf(file_path, sizeof(file_path), "%s/cgroup.procs", path); + r =3D write_file(file_path, buf); + } + return r; +} + +static void cgroup_teardown(void) +{ + rm_cgroup_recursive(g_cgroup_path); +} + +static void cgroup_unthrottle(void) +{ + if (use_cgroup_v2) { + char max_path[4096]; + + snprintf(max_path, sizeof(max_path), "%s/cpu.max", g_cgroup_path); + write_file(max_path, "max"); + } else { + char quota_path[4096]; + + snprintf(quota_path, sizeof(quota_path), "%s/cpu.cfs_quota_us", g_cgroup= _path); + write_file(quota_path, "-1"); + } +} + +/* -- CPU burner (inside throttled child process) -- */ +static void *burner_thread_fn(void *arg) +{ + struct rseq_abi my_rseq; + int cpu =3D (int)(uintptr_t)arg; + + memset(&my_rseq, 0, sizeof(my_rseq)); + my_rseq.cpu_id =3D RSEQ_CPU_ID_UNINITIALIZED; + + if (rseq_register_thread_at(&my_rseq) < 0) { + perror("rseq_register (burner)"); + return NULL; + } + + cpu_set_t set; + + CPU_ZERO(&set); + CPU_SET(cpu, &set); + if (sched_setaffinity(0, sizeof(set), &set) < 0) + perror("sched_setaffinity (burner)"); + + unsigned long sink =3D 0; + + while (!atomic_load_explicit(&g_stop, memory_order_relaxed)) { + sink++; + /* Prevent compiler from optimizing the loop away */ + asm volatile("" : "+g"(sink)); + } + + return NULL; +} + +static int burner_thread_fn_wrapper(void *arg) +{ + burner_thread_fn(arg); + return 0; +} + +static int leaf_child_fn(void *arg) +{ + int i =3D (int)(uintptr_t)arg; + int total_burners =3D g_ncpus_stress * N_BURNER_PER_CPU; + int n_threads_per_leaf =3D total_burners / N_SIBLINGS; + + if (i < (total_burners % N_SIBLINGS)) + n_threads_per_leaf++; + + prctl(PR_SET_PDEATHSIG, SIGTERM); + if (getppid() =3D=3D 1) + _exit(1); + + char leaf_path[4096]; + + snprintf(leaf_path, sizeof(leaf_path), "%s/n%d", g_cgroup_path, i); + for (int j =3D 0; j < NEST_DEPTH; j++) + snprintf(leaf_path + strlen(leaf_path), + sizeof(leaf_path) - strlen(leaf_path), "/d%d", j); + + int r =3D cgroup_add_pid_to_path(getpid(), leaf_path); + + if (r < 0) { + char buf[512]; + int len =3D snprintf(buf, sizeof(buf), + "[leaf child %d] failed to join cgroup %s: err %d\n", + i, leaf_path, -r); + (void)!write(2, buf, len); + _exit(1); + } + + for (int j =3D 0; j < n_threads_per_leaf; j++) { + int cpu =3D g_stress_cpus[(i * n_threads_per_leaf + j) % g_ncpus_stress]; + + /* Allocate stack via mmap (bypasses heap) */ + size_t stack_size =3D 64 * 1024; + void *stack =3D mmap(NULL, stack_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (stack =3D=3D MAP_FAILED) { + const char *msg =3D "mmap stack failed\n"; + (void)!write(2, msg, strlen(msg)); + _exit(1); + } + + /* Use raw clone to create a thread sharing the VM and thread group */ + pid_t pid =3D clone(burner_thread_fn_wrapper, stack + stack_size, + CLONE_VM | CLONE_THREAD | CLONE_SIGHAND, + (void *)(uintptr_t)cpu); + if (pid < 0) { + const char *msg =3D "clone burner failed\n"; + (void)!write(2, msg, strlen(msg)); + _exit(1); + } + } + + // Wait for SIGTERM + sigset_t mask; + + sigemptyset(&mask); + sigaddset(&mask, SIGTERM); + int sig; + + sigwait(&mask, &sig); + + _exit(0); +} + +struct leaf_info { + pid_t pid; + void *stack; +}; + +static int run_throttle_child(void *arg) +{ + (void)arg; + prctl(PR_SET_PDEATHSIG, SIGTERM); + if (getppid() =3D=3D 1) + _exit(1); + + int n_leafs =3D N_SIBLINGS; + + /* Block signals before spawning to avoid missing early failures */ + sigset_t mask; + + sigemptyset(&mask); + sigaddset(&mask, SIGTERM); + sigaddset(&mask, SIGCHLD); + sigprocmask(SIG_BLOCK, &mask, NULL); + + /* Use mmap for tracking structures to avoid glibc heap usage */ + struct leaf_info *leaves =3D mmap(NULL, n_leafs * sizeof(struct leaf_info= ), + PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (leaves =3D=3D MAP_FAILED) { + const char *msg =3D "mmap leaves array failed\n"; + (void)!write(2, msg, strlen(msg)); + _exit(1); + } + + for (int i =3D 0; i < n_leafs; i++) { + size_t stack_size =3D 64 * 1024; + void *stack =3D mmap(NULL, stack_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (stack =3D=3D MAP_FAILED) { + const char *msg =3D "mmap leaf stack failed\n"; + (void)!write(2, msg, strlen(msg)); + _exit(1); + } + + leaves[i].stack =3D stack; + + pid_t pid =3D clone(leaf_child_fn, stack + stack_size, + CLONE_VM | SIGCHLD, (void *)(uintptr_t)i); + + if (pid < 0) { + const char *msg =3D "clone (leaf child) failed\n"; + (void)!write(2, msg, strlen(msg)); + + /* Clean up successfully spawned children */ + for (int j =3D 0; j < i; j++) { + kill(leaves[j].pid, SIGTERM); + waitpid(leaves[j].pid, NULL, 0); + munmap(leaves[j].stack, stack_size); + } + munmap(leaves, n_leafs * sizeof(struct leaf_info)); + + if (errno =3D=3D EAGAIN) + _exit(4); + else + _exit(1); + } + leaves[i].pid =3D pid; + } + + int failed =3D 0; + + while (1) { + int sig; + + sigwait(&mask, &sig); + + if (sig =3D=3D SIGTERM) { + break; + } else if (sig =3D=3D SIGCHLD) { + int status; + pid_t pid; + + // Reap all dead children + while ((pid =3D waitpid(-1, &status, WNOHANG)) > 0) { + for (int i =3D 0; i < n_leafs; i++) { + if (leaves[i].pid =3D=3D pid) { + leaves[i].pid =3D 0; + break; + } + } + if ((WIFEXITED(status) && WEXITSTATUS(status) !=3D 0) || + WIFSIGNALED(status)) { + char buf[128]; + int len =3D snprintf(buf, sizeof(buf), + "[manager] child %d died unexpectedly (status %d)\n", + pid, WEXITSTATUS(status)); + (void)!write(2, buf, len); + failed =3D 1; + } + } + if (failed) + break; + } + } + + // Terminate all leaf kids + for (int i =3D 0; i < n_leafs; i++) { + if (leaves[i].pid > 0) + kill(leaves[i].pid, SIGTERM); + } + + for (int i =3D 0; i < n_leafs; i++) { + if (leaves[i].pid > 0) + waitpid(leaves[i].pid, NULL, 0); + munmap(leaves[i].stack, 64 * 1024); + } + + munmap(leaves, n_leafs * sizeof(struct leaf_info)); + + _exit(failed ? 1 : 0); +} + +/* -- Membarrier hammer thread -- */ +static void *hammer_thread_fn(void *arg) +{ + int target_cpu =3D *(int *)arg; + long local_ok =3D 0; + long local_err =3D 0; + int count =3D 0; + const int batch_size =3D 1024; + + if (rseq_register_thread() < 0) { + ksft_print_msg("[hammer] rseq_register failed: %s\n", strerror(errno)); + return NULL; + } + + membarrier_register_rseq_mm(); + + while (!atomic_load_explicit(&g_stop, memory_order_relaxed)) { + int r =3D syscall(SYS_membarrier, + MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ, + MEMBARRIER_CMD_FLAG_CPU, + target_cpu); + if (__builtin_expect(r =3D=3D 0, 1)) + local_ok++; + else + local_err++; + + count++; + if (__builtin_expect(count >=3D batch_size, 0)) { + atomic_fetch_add_explicit(&g_mb_ok, local_ok, memory_order_relaxed); + atomic_fetch_add_explicit(&g_mb_err, local_err, memory_order_relaxed); + local_ok =3D 0; + local_err =3D 0; + count =3D 0; + } + } + + /* Flush any remaining counts on exit */ + if (local_ok > 0) + atomic_fetch_add_explicit(&g_mb_ok, local_ok, memory_order_relaxed); + if (local_err > 0) + atomic_fetch_add_explicit(&g_mb_err, local_err, memory_order_relaxed); + + return NULL; +} + +/* -- Latency sentinel -- */ +static void *sentinel_thread_fn(void *arg) +{ + (void)arg; + struct sched_param sp =3D { .sched_priority =3D 20 }; + + if (sched_setscheduler(0, SCHED_FIFO, &sp) < 0) + ksft_print_msg("WARN: no SCHED_FIFO for sentinel (less precise)\n"); + + while (!atomic_load_explicit(&g_test_ready, memory_order_relaxed) && + !atomic_load_explicit(&g_stop_sentinel, memory_order_relaxed)) { + struct timespec ts =3D {0, 1000 * 1000}; /* 1ms */ + + clock_nanosleep(CLOCK_MONOTONIC, 0, &ts, NULL); + } + + uint64_t prev =3D monotonic_us(); + + while (!atomic_load_explicit(&g_stop_sentinel, memory_order_relaxed)) { + struct timespec ts =3D { + .tv_sec =3D 0, + .tv_nsec =3D SENTINEL_INTERVAL_US * 1000L, + }; + clock_nanosleep(CLOCK_MONOTONIC, 0, &ts, NULL); + + uint64_t now =3D monotonic_us(); + long latency_us =3D (long)(now - prev) - SENTINEL_INTERVAL_US; + + prev =3D now; + + if (latency_us <=3D 0) + continue; + + update_max_latency(latency_us); + + if (latency_us > LATENCY_CRITICAL_MS * 1000L) { + ksft_print_msg("\n[SENTINEL] CRITICAL: %ld ms delay (lockup precursor!)= \n", + latency_us / 1000); + } else if (latency_us > LATENCY_WARN_MS * 1000L) { + ksft_print_msg("\n[SENTINEL] WARN: %ld ms latency spike\n", + latency_us / 1000); + } + } + return NULL; +} + +/* -- Progress reporter -- */ +static void *reporter_thread_fn(void *arg) +{ + (void)arg; + int elapsed =3D 0; + + while (!atomic_load_explicit(&g_stop_sentinel, memory_order_relaxed)) { + for (int i =3D 0; i < 5; i++) { + sleep(1); + if (atomic_load_explicit(&g_stop_sentinel, memory_order_relaxed)) + break; + } + if (atomic_load_explicit(&g_stop_sentinel, memory_order_relaxed)) + break; + elapsed +=3D 5; + long interval_max =3D atomic_exchange_explicit(&g_interval_max_latency_u= s, + 0, memory_order_relaxed); + + ksft_print_msg("[%3ds] mb: ok=3D%-10ld err=3D%-8ld | max_lat=3D%ld us\n", + elapsed, + atomic_load(&g_mb_ok), + atomic_load(&g_mb_err), + interval_max); + } + return NULL; +} + +/* -- Main -- */ +int main(void) +{ + ksft_print_header(); +#ifdef UNSUPPORTED_ARCH + ksft_exit_skip("Unsupported architecture\n"); +#endif + ksft_set_plan(1); + + if (geteuid() !=3D 0) + ksft_exit_skip("Must run as root (cgroup + SCHED_FIFO)\n"); + + init_stress_cpus(); + + ksft_print_msg("=3D=3D=3D membarrier rseq + CFS unthrottle stress =3D=3D= =3D\n"); + ksft_print_msg("Stressing CPUs: %d\n", g_ncpus_stress); + ksft_print_msg("Quota: %d/%d us (~%d unthrottles/sec/CPU)\n", + CFS_QUOTA_US, CFS_PERIOD_US, + 1000000 / CFS_PERIOD_US); + ksft_print_msg("Hammer threads: %d per CPU (%d total)\n", + N_HAMMER_PER_CPU, g_ncpus_stress * N_HAMMER_PER_CPU); + ksft_print_msg("Duration: %d seconds\n\n", TEST_DURATION_SEC); + + if (cgroup_setup() < 0) { + cgroup_teardown(); + ksft_exit_skip("cgroup_setup failed (missing permissions or v2 ctrls?)\n= "); + } + + if (rseq_register_thread() < 0) { + ksft_print_msg("rseq_register (%s) failed: %s\n", __func__, strerror(err= no)); + cgroup_teardown(); + ksft_exit_skip("rseq syscall failed or not available\n"); + } + if (membarrier_register_rseq_mm() < 0) { + ksft_print_msg("MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ: %s\n" + "Kernel >=3D 5.10 with CONFIG_RSEQ required.\n", + strerror(errno)); + cgroup_teardown(); + ksft_exit_skip("membarrier register failed\n"); + } + ksft_print_msg("rseq membarrier registered OK\n"); + + sigset_t sigmask; + + sigemptyset(&sigmask); + sigaddset(&sigmask, SIGTERM); + sigprocmask(SIG_BLOCK, &sigmask, NULL); + + void *stack =3D malloc(1024 * 1024); + + if (!stack) { + perror("malloc stack"); + cgroup_teardown(); + ksft_exit_fail_msg("Malloc stack failed\n"); + } + pid_t child =3D clone(run_throttle_child, stack + 1024 * 1024, CLONE_VM |= SIGCHLD, NULL); + + if (child < 0) { + perror("clone"); + cgroup_teardown(); + ksft_exit_fail_msg("Clone failed\n"); + } + + sigprocmask(SIG_UNBLOCK, &sigmask, NULL); + ksft_print_msg("Throttle child PID %d started\n", child); + + int n_threads =3D g_ncpus_stress * N_HAMMER_PER_CPU + 2; + pthread_t *threads =3D (pthread_t *)calloc(n_threads, sizeof(pthread_t)); + int *cpuargs =3D (int *)calloc(g_ncpus_stress * N_HAMMER_PER_CPU, s= izeof(int)); + + if (!threads || !cpuargs) { + perror("calloc"); + kill(child, SIGTERM); + waitpid(child, NULL, 0); + cgroup_teardown(); + ksft_exit_fail_msg("Thread allocation failed\n"); + } + + int ti =3D 0, ai =3D 0; + int r; + + ksft_print_msg("Creating sentinel thread...\n"); + r =3D pthread_create(&threads[ti], NULL, sentinel_thread_fn, NULL); + if (r !=3D 0) { + kill(child, SIGTERM); + waitpid(child, NULL, 0); + cgroup_teardown(); + free(threads); + free(cpuargs); + free(g_stress_cpus); + ksft_exit_fail_msg("pthread_create (sentinel) failed: %s\n", strerror(r)= ); + } + ti++; + + ksft_print_msg("Creating reporter thread...\n"); + r =3D pthread_create(&threads[ti], NULL, reporter_thread_fn, NULL); + if (r !=3D 0) { + atomic_store(&g_stop_sentinel, 1); + pthread_join(threads[0], NULL); + kill(child, SIGTERM); + waitpid(child, NULL, 0); + cgroup_teardown(); + free(threads); + free(cpuargs); + free(g_stress_cpus); + ksft_exit_fail_msg("pthread_create (reporter) failed: %s\n", strerror(r)= ); + } + ti++; + + ksft_print_msg("Creating %d hammer threads...\n", g_ncpus_stress * N_HAMM= ER_PER_CPU); + for (int i =3D 0; i < g_ncpus_stress; i++) { + int cpu =3D g_stress_cpus[i]; + + for (int j =3D 0; j < N_HAMMER_PER_CPU; j++) { + cpuargs[ai] =3D cpu; + r =3D pthread_create(&threads[ti], NULL, hammer_thread_fn, &cpuargs[ai]= ); + if (r !=3D 0) { + ksft_print_msg("pthread_create failed at thread %d: %s\n", + ti, strerror(r)); + + atomic_store(&g_stop_sentinel, 1); + pthread_join(threads[0], NULL); + pthread_join(threads[1], NULL); + + atomic_store(&g_stop, 1); + for (int k =3D 2; k < ti; k++) + pthread_join(threads[k], NULL); + + kill(child, SIGTERM); + waitpid(child, NULL, 0); + cgroup_teardown(); + + free(threads); + free(cpuargs); + free(g_stress_cpus); + + if (r =3D=3D EAGAIN) + ksft_exit_skip("Resource limits prevent threads\n"); + else + ksft_exit_fail_msg("Failed to create hammer thread\n"); + } + ti++; + ai++; + } + } + + ksft_print_msg("All threads running. Tip: monitor dmesg for lockups\n\n"); + + atomic_store_explicit(&g_test_ready, 1, memory_order_relaxed); + int child_failed =3D 0; + int child_status =3D 0; + + for (int i =3D 0; i < TEST_DURATION_SEC; i++) { + sleep(1); + int r =3D waitpid(child, &child_status, WNOHANG); + + if (r =3D=3D child) { + child_failed =3D 1; + break; + } + } + + atomic_store(&g_stop_sentinel, 1); + pthread_join(threads[0], NULL); + pthread_join(threads[1], NULL); + + atomic_store(&g_stop, 1); + + /* Unthrottle to allow children to exit quickly */ + cgroup_unthrottle(); + + if (!child_failed) { + kill(child, SIGTERM); + waitpid(child, NULL, 0); + } + for (int i =3D 2; i < ti; i++) + pthread_join(threads[i], NULL); + + long max_lat =3D atomic_load(&g_max_latency_us); + long total_ok =3D atomic_load(&g_mb_ok); + long total_err =3D atomic_load(&g_mb_err); + + ksft_print_msg("\n=3D=3D=3D RESULTS =3D=3D=3D\n"); + ksft_print_msg("membarrier syscalls : %ld ok %ld errors\n", total_ok, to= tal_err); + ksft_print_msg("Max scheduler latency: %ld us (%ld ms)\n", max_lat, max_= lat / 1000); + cgroup_teardown(); + free(threads); + free(cpuargs); + free(g_stress_cpus); + + if (child_failed) { + if (WIFEXITED(child_status) && WEXITSTATUS(child_status) =3D=3D 4) + ksft_exit_skip("Manager child skipped (resource limits?)\n"); + ksft_test_result_fail("membarrier_rseq_stress: Manager child died early\= n"); + ksft_exit_fail(); + } else if (total_ok =3D=3D 0) { + ksft_test_result_fail("membarrier_rseq_stress: No successful membarrier = calls\n"); + ksft_exit_fail(); + } else if (total_err > 0) { + ksft_test_result_fail("membarrier_rseq_stress: syscall errors\n"); + ksft_exit_fail(); + } else if (max_lat > LATENCY_CRITICAL_MS * 1000L) { + ksft_test_result_fail("membarrier_rseq_stress: LOCKUP PRECURSOR\n"); + ksft_exit_fail(); + } else if (max_lat > LATENCY_WARN_MS * 1000L) { + ksft_test_result_fail("membarrier_rseq_stress: significant latency spike= \n"); + ksft_exit_fail(); + } else { + ksft_test_result_pass("membarrier_rseq_stress\n"); + ksft_exit_pass(); + } + + return 0; +}