From nobody Sun Jun 14 04:11:30 2026 Received: from mail-dy1-f202.google.com (mail-dy1-f202.google.com [74.125.82.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 64AF037CD20 for ; Sun, 3 May 2026 21:23:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777843391; cv=none; b=kfSaDNF+7sNq5/Gk22XVd1mnXx5IUDc/1Uc1s4Ry05M7KfDfaiNjRLA3Y2IKSeURPdZBy/ONVFw+Tje2Wip9ZiJN+F6Ye++2lz4sMUv+zG/SOb3pqGbka5c1eSbWANBozOc5cn010ADPJjfRRNstC5IcvgqVZXQHN0WANh2Qas8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777843391; c=relaxed/simple; bh=tcfwiHfBmdPlDswQOFCg9Ur4ZjaLagCYDx1ehhgDEQk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=OFy7UwSR/0vZAlZvgQGHl1UuvGC0N4OIAxnLpW+hYZKATIU5A9iYkEVN8XxzZ/rEB3leWD9AnwWV0UMHW9pZ6Dp1hHR/wGgShXIdI44MvDiZ2fQb1Grs+RKitil6Q79xt9TWVRscmPKdULviM6ub7Wo/5B2p6fMnY5oZ8XCweGs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--aniketgattani.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=p7i0qwuy; arc=none smtp.client-ip=74.125.82.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--aniketgattani.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="p7i0qwuy" Received: by mail-dy1-f202.google.com with SMTP id 5a478bee46e88-2ba8013a9e3so5651090eec.0 for ; Sun, 03 May 2026 14:23:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777843389; x=1778448189; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=dHo3S2sC3hPKYGBbJhzMJdbsQdshtIhZZObsDHnBi1U=; b=p7i0qwuy3JnZnNzgmJksfIiMNm/QjtbdUSpH0iDwjg2t4BriTrRXeQKV4js0MkOIRB FNGJaS0pb/JvBs2FSp+t1bQPymGRIQ8NEzi1GSDBhO5tGvfYZFrJT8tnf4KQoFAKtP7S EwOnRG1Nwv27o3hzb180zRmNFqt+Qx+h9qXzDHkvaz8hxvi1BQ5E+pedtySAMTC+K4nu Z+WmqBzhDr0xi9R68zMf95GX6TGBA8eCO+lwKCI9vEhx1M7OiYGesjXPIprauNFE+KYF rNdT817Udg9dmpZBhQ3jbyJqSb9WRYqvPDsQbRphrWmyf4PbZZFojEo9C0PBnKmzZKgs clDw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777843389; x=1778448189; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=dHo3S2sC3hPKYGBbJhzMJdbsQdshtIhZZObsDHnBi1U=; b=NntbpmEjPi10iYK2Kscr8M+narri9+VzK92QXvS2mliOhHln+r27Gzff/FLzWWyyMw L48WP4NUAEl2vpFn6LUcLcVMpoXumqJiZrQ/O84vbXffvLGr7BLNH92OzANkUFDReFjz nhjdiLziMPSQbdhVvs6xYBQtcdRSiJOHAntHa2XZRvaEjrR0I5n5agb6gVGdijRPYJMp NmVR4Rz9bfLN3TLHrXXd24oCt/xPtJL2MgkLFBBKQ7hSORkQX5E6S+npUZ2H5ijXvLZo aVUHvxoGH4SS3yQgR6J2UNiJxb5cYRdknJkVXfgoK0f4bbhqDtTO0lc/Y2gk8zs5EEnX ZVVA== X-Forwarded-Encrypted: i=1; AFNElJ9QUD52nqgUaqYMEgr6A7JkdL8eJE0m4sXXpFQ+iiR3C36fDXQMAwmk8Kbhijm3qzldi0QZ1PPR1aOnhSY=@vger.kernel.org X-Gm-Message-State: AOJu0YwGywPUJwnYcwWywaCNXj0WTGzQB1QCot/IlmAOHIsOpOHTAMIw 7ZIEHUe7KKkBqxTnsSPk7jfLnWo4bS210wX7o8rc6nSr9a2PZR4y0sx9pH6bBzgaNSkDnWG0k39 SSfdwfUkWDj8QD/dAbKdz0yaR96j0phjgPQ== X-Received: from dybcr21.prod.google.com ([2002:a05:7300:ac95:b0:2e2:2088:dca7]) (user=aniketgattani job=prod-delivery.src-stubby-dispatcher) by 2002:a05:7300:cb17:b0:2ed:e14:7f54 with SMTP id 5a478bee46e88-2efba7a6c9dmr2712951eec.30.1777843389183; Sun, 03 May 2026 14:23:09 -0700 (PDT) Date: Sun, 3 May 2026 21:22:03 +0000 In-Reply-To: <20260503212205.3714217-1-aniketgattani@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260503212205.3714217-1-aniketgattani@google.com> X-Mailer: git-send-email 2.54.0.545.g6539524ca2-goog Message-ID: <20260503212205.3714217-2-aniketgattani@google.com> Subject: [PATCH v3 1/3] sched/membarrier: Use per-CPU mutexes for targeted commands From: Aniket Gattani To: Mathieu Desnoyers , "Paul E . McKenney" Cc: Peter Zijlstra , Ingo Molnar , Ben Segall , Josh Don , linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, Aniket Gattani Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Currently, the membarrier system call uses a single global mutex (`membarrier_ipi_mutex`) to serialize expedited commands. This causes significant contention on large systems when multiple threads invoke membarrier concurrently, even if they target different CPUs. This contention becomes critical when combined with CFS bandwidth throttling/unthrottling, during which interrupts can be disabled for relatively long periods on target CPUs. If membarrier is waiting for a response from such a CPU, it holds the global mutex, blocking all other membarrier calls on the system. This cascade effect can lead to hard lockups when thousands of threads stall waiting for the mutex. Optimize `MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ` when a specific CPU is targeted by introducing per-CPU mutexes. Broadcast commands and commands without a specific CPU target continue to use the global mutex. This prevents the cascade lockup scenario. As measured by the stress test introduced in the subsequent patch, on an AMD Turin machine with 384 CPUs (2 NUMA nodes with SMT=3D2), this optimization yields 200x more throughput. Changes in v3: - Fixed the code path when `cpu_id < 0` in membarrier_private_expedited as reported by Peter. Link: https://lore.kernel.org/lkml/20260428124823.GY3102624@noisy.program= ming.kicks-ass.net/ Signed-off-by: Aniket Gattani --- kernel/sched/membarrier.c | 77 ++++++++++++++++++++++----------------- 1 file changed, 44 insertions(+), 33 deletions(-) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index 623445603725..3d88e900a17f 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -164,8 +164,26 @@ | MEMBARRIER_PRIVATE_EXPEDITED_RSEQ_BITMASK \ | MEMBARRIER_CMD_GET_REGISTRATIONS) =20 +/* + * Scoped guard for memory barriers on entry and exit. + * Matches memory barriers before & after rq->curr modification in schedul= er. + */ +DEFINE_LOCK_GUARD_0(mb, smp_mb(), smp_mb()) static DEFINE_MUTEX(membarrier_ipi_mutex); +static DEFINE_PER_CPU(struct mutex, membarrier_cpu_mutexes); + #define SERIALIZE_IPI() guard(mutex)(&membarrier_ipi_mutex) +#define SERIALIZE_IPI_CPU(cpu_id) guard(mutex)(&per_cpu(membarrier_cpu_mut= exes, cpu_id)) + +static int __init membarrier_init(void) +{ + int i; + + for_each_possible_cpu(i) + mutex_init(&per_cpu(membarrier_cpu_mutexes, i)); + return 0; +} +core_initcall(membarrier_init); =20 static void ipi_mb(void *info) { @@ -315,7 +333,6 @@ static int membarrier_global_expedited(void) =20 static int membarrier_private_expedited(int flags, int cpu_id) { - cpumask_var_t tmpmask; struct mm_struct *mm =3D current->mm; smp_call_func_t ipi_func =3D ipi_mb; =20 @@ -352,30 +369,45 @@ static int membarrier_private_expedited(int flags, in= t cpu_id) * On RISC-V, this barrier pairing is also needed for the * SYNC_CORE command when switching between processes, cf. * the inline comments in membarrier_arch_switch_mm(). + * + * Memory barrier on the caller thread _after_ we finished + * waiting for the last IPI. Matches memory barriers before + * rq->curr modification in scheduler. */ - smp_mb(); /* system call entry is not a mb. */ - - if (cpu_id < 0 && !zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) - return -ENOMEM; - - SERIALIZE_IPI(); - cpus_read_lock(); - + guard(mb)(); if (cpu_id >=3D 0) { + if (cpu_id >=3D nr_cpu_ids || !cpu_possible(cpu_id)) + return 0; + + SERIALIZE_IPI_CPU(cpu_id); + guard(cpus_read_lock)(); struct task_struct *p; =20 - if (cpu_id >=3D nr_cpu_ids || !cpu_online(cpu_id)) - goto out; + if (!cpu_online(cpu_id)) + return 0; + rcu_read_lock(); p =3D rcu_dereference(cpu_rq(cpu_id)->curr); if (!p || p->mm !=3D mm) { rcu_read_unlock(); - goto out; + return 0; } rcu_read_unlock(); + /* + * smp_call_function_single() will call ipi_func() if cpu_id + * is the calling CPU. + */ + smp_call_function_single(cpu_id, ipi_func, NULL, 1); } else { + cpumask_var_t __free(free_cpumask_var) tmpmask =3D CPUMASK_VAR_NULL; int cpu; =20 + if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) + return -ENOMEM; + + SERIALIZE_IPI(); + guard(cpus_read_lock)(); + rcu_read_lock(); for_each_online_cpu(cpu) { struct task_struct *p; @@ -385,15 +417,6 @@ static int membarrier_private_expedited(int flags, int= cpu_id) __cpumask_set_cpu(cpu, tmpmask); } rcu_read_unlock(); - } - - if (cpu_id >=3D 0) { - /* - * smp_call_function_single() will call ipi_func() if cpu_id - * is the calling CPU. - */ - smp_call_function_single(cpu_id, ipi_func, NULL, 1); - } else { /* * For regular membarrier, we can save a few cycles by * skipping the current cpu -- we're about to do smp_mb() @@ -420,18 +443,6 @@ static int membarrier_private_expedited(int flags, int= cpu_id) } } =20 -out: - if (cpu_id < 0) - free_cpumask_var(tmpmask); - cpus_read_unlock(); - - /* - * Memory barrier on the caller thread _after_ we finished - * waiting for the last IPI. Matches memory barriers before - * rq->curr modification in scheduler. - */ - smp_mb(); /* exit from system call is not a mb */ - return 0; } =20 --=20 2.54.0.545.g6539524ca2-goog From nobody Sun Jun 14 04:11:30 2026 Received: from mail-dy1-f201.google.com (mail-dy1-f201.google.com [74.125.82.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DD40B3A641B for ; Sun, 3 May 2026 21:23:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777843393; cv=none; b=aZjqv1ZpN9TgNLL5DxQLdCBZs2jBe79CODg5vhda7fRlZ/72VumhY6F4JzRptYR6trJf+vOF0iX8HV8HFME69e/lIZjdqWJL0zBryMJV8e+O2JwJFp3j23+gCKOIxB+9EEonwaL0fShFhtuUYxV/QTl/SoG3wstSnl+tPf1hT7c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777843393; c=relaxed/simple; bh=S3gGEckI/DSlIcvU2HtcMqtoB7rDjBnUKRH/p6HcwkI=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=pNhFe6392wyiDZsByA/WybPcJO+1kRtrHw62JZLaLKsr7zWLn7Yjly0BH/W6oCjDm90ujNscq9yeVnA4PfMszQM+zFOcKYbP+TUtIWiXxXlWL81WhCmp0YxLNKfnsTxFg6K70omeyJHm36i3iPWvGtGQo72+X80f9/zVbBQesgQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--aniketgattani.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=ua+P1Kg+; arc=none smtp.client-ip=74.125.82.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--aniketgattani.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="ua+P1Kg+" Received: by mail-dy1-f201.google.com with SMTP id 5a478bee46e88-2ee34588671so3015743eec.0 for ; Sun, 03 May 2026 14:23:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777843391; x=1778448191; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=/hgNY4cuzop2SCxhAontHgj2zQ9qMym81XhTapVgV6M=; b=ua+P1Kg+UQ+TSPH11It2cNq9T0kW5VlmFjiRhpNBshZkRKOgIxqsiaJNI57Iuv+eBR b7uAYv0Thu/cat7fVZdzjoJ7BxML45nr6cy7axJXLLc24/ndSUI+VyABmocsVy2OWtOv kJpKKKOc1NAyX/+DmWDL8A21o17i0pNoHrDeYvbYrFm1mdGbTwx0b6OIXIpmfM0I9/Ft 2RdnPoFuO/JCzjJ15P7gHRtOJ2iFGwGeEegs+vwbeZwpFyN+GNF4r1xzqdIx0xoi9rFd zXw47W8QYxw/7Hpy3TIuq/qWdSVLiQOpOyAlPzORmybx5u1SyCM7BiCOYg59r5AuTR3V 0CHA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777843391; x=1778448191; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=/hgNY4cuzop2SCxhAontHgj2zQ9qMym81XhTapVgV6M=; b=d2xIdZkpPcVaxqODzqOseNCAnyWo7qrYN4deUG5jnJt2bG/jNPxxN2J97qcvMn5QZn cap026/dcSHVqLvh5TquuvprXgasSuRfPKrMIBh32gFFNiRYIp0aAwQzU8BMsx9e6iXL I4G7vrZc/0cqUoJ9WJRU97Jjh8D5grWvuW/Kvr6rJRurSsAadAKseamxaCtmi7zl67uL nL/Ybq9Rh7GvF7eUbher0p9r3UVqh/IvekB4fSSQrq9/LJXJ7ba0xBVXTrWDTv8jl4CR z4x1JGCRV6dzYOMmXAWYg7mMOF6wjw+WyfpHjx+eyKoMHrcLAEJyo6mq0VdnB3+Nsmo8 szEw== X-Forwarded-Encrypted: i=1; AFNElJ8PvM3rJuOkjBrj7gxpw0xvzwld2JB0ZuFSEn/2DRdjoTPiW6PB9c7vrvUkvFHHgDzafUUOkgh7UxZ708Y=@vger.kernel.org X-Gm-Message-State: AOJu0Yzkw4V0NkBtzl6d+ZQn2gI1cMQ2rK6Cag5UAlRnL5UJ5TbSzcZw M3tCoBDgv7w+tRyfUF75KyqcGAdtxmD0HQuVh2xhcQ0TG7tdSvbCXTinOJUiU3sK5e8QFROa9ft gNLZ5cfOq/3GPx7Jl6KOAdJlOZ63pTX1wAA== X-Received: from dybrp17.prod.google.com ([2002:a05:7301:4611:b0:2d9:e328:e0db]) (user=aniketgattani job=prod-delivery.src-stubby-dispatcher) by 2002:a05:7301:9f06:b0:2ef:2878:7ade with SMTP id 5a478bee46e88-2efb85b7e58mr3381155eec.15.1777843390801; Sun, 03 May 2026 14:23:10 -0700 (PDT) Date: Sun, 3 May 2026 21:22:04 +0000 In-Reply-To: <20260503212205.3714217-1-aniketgattani@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260503212205.3714217-1-aniketgattani@google.com> X-Mailer: git-send-email 2.54.0.545.g6539524ca2-goog Message-ID: <20260503212205.3714217-3-aniketgattani@google.com> Subject: [PATCH v3 2/3] sched/membarrier: Modernize membarrier_global_expedited with cleanup guards From: Aniket Gattani To: Mathieu Desnoyers , "Paul E . McKenney" Cc: Peter Zijlstra , Ingo Molnar , Ben Segall , Josh Don , linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, Aniket Gattani Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Replace explicit lock/unlock and free calls with scoped guards and automatic cleanup constructs. Signed-off-by: Aniket Gattani --- kernel/sched/membarrier.c | 21 ++++----------------- 1 file changed, 4 insertions(+), 17 deletions(-) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index 3d88e900a17f..12b68a77630a 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -267,23 +267,19 @@ void membarrier_update_current_mm(struct mm_struct *n= ext_mm) =20 static int membarrier_global_expedited(void) { + cpumask_var_t __free(free_cpumask_var) tmpmask =3D CPUMASK_VAR_NULL; int cpu; - cpumask_var_t tmpmask; =20 if (num_online_cpus() =3D=3D 1) return 0; =20 - /* - * Matches memory barriers after rq->curr modification in - * scheduler. - */ - smp_mb(); /* system call entry is not a mb. */ - if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL)) return -ENOMEM; =20 + guard(mb)(); SERIALIZE_IPI(); - cpus_read_lock(); + guard(cpus_read_lock)(); + rcu_read_lock(); for_each_online_cpu(cpu) { struct task_struct *p; @@ -319,15 +315,6 @@ static int membarrier_global_expedited(void) smp_call_function_many(tmpmask, ipi_mb, NULL, 1); preempt_enable(); =20 - free_cpumask_var(tmpmask); - cpus_read_unlock(); - - /* - * Memory barrier on the caller thread _after_ we finished - * waiting for the last IPI. Matches memory barriers before - * rq->curr modification in scheduler. - */ - smp_mb(); /* exit from system call is not a mb */ return 0; } =20 --=20 2.54.0.545.g6539524ca2-goog From nobody Sun Jun 14 04:11:30 2026 Received: from mail-dy1-f201.google.com (mail-dy1-f201.google.com [74.125.82.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 747FB3A7844 for ; Sun, 3 May 2026 21:23:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777843396; cv=none; b=ZybfDdQ1rCxOcjp9150CEv5W6ly3UAacOFguKsDP5+PPuYjmUTLClJWnRK7bvmjS3QO1iqpfcJbG0/fhWdY/RZflwxI8MQYZIkSNLL9gagU/vFIbMthfBHbX9swjUS6D7kEiypECMTZAdY5CSCWT3JEIuYtdGP61seNXwrhkzEw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777843396; c=relaxed/simple; bh=BZ/8mZko9CNUx/yqcW80TqwmIibwWcV/0XZwjCIrunE=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=eGP8VZQLEOzNs2mqNRXRV4Dt+V6z8CNUIlPWscMLwTsu3BZNoS34/Ou9gZuZkUvNl2KosMuTO1DKeGzrxAd+G2Z+3bru6okS8n8jCmnMDz+j1ExCE4qhj055aEnmL9ItDMMNYFbUkBCWk53WqUeu5O7vHx/9Waab2G2EOjF8ytY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--aniketgattani.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=VLY+p952; arc=none smtp.client-ip=74.125.82.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--aniketgattani.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="VLY+p952" Received: by mail-dy1-f201.google.com with SMTP id 5a478bee46e88-2c0f6593ef5so5032768eec.1 for ; Sun, 03 May 2026 14:23:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1777843392; x=1778448192; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=HLINxPTDln0Gk6/qTLIHSXNAZ3ruBQNffiyItDEl0jE=; b=VLY+p952T3k7n+c0dJvLSkUEZFvBrTWmf/mNSFdIwC3tl9RNOL5L7KHBnwkefpGMxA ExABdJ5tMtQM6Jb0OlBpb6kkUA/nuY+oYEcEgk0kdoX4zRa3FZxwxt4FilUsNx1oy7Gy 2+muRje9vk3DfbqeLoYwNpEFvdrFSrIrEqHRb8+IUFpRxePhHjSPc0NQwXiTHY0wqLdf 1R7JnTwXkXLb0G/o3t4ZwaVrDhphFpDzWnJMYsIAzoJ1P6y+KVGjhaUPyWjqqQB+sMK7 xsanP7SYmmcy/K/6k9f2Eu4Z/9NZXQfNhPwGUNd4KORB/wjskFB5AKxospcUY3cHVupm 4IiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777843393; x=1778448193; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=HLINxPTDln0Gk6/qTLIHSXNAZ3ruBQNffiyItDEl0jE=; b=B1LZq/TR6101PXgmESUXb//gtZZjjwd9MylKjEo5iFgHPqApl3OmeE8GgCBfmt1a9o U8pJllwwojFnPW5efens84u8/L7YvLWEk1fknGT9nDShX7Uo3t5jZYV+lZWKfjQIWC2x JEuUDfOmWcxYdjN0GDwZ4qgWoi6rkWkKisVIxPW5uZsCKiyiuLFmI5QQgRJaTMPC/aNK /ctO7NPR3QPF8siAnXkVuxcDZM6K0+UK0kSSS4/c2WvsxfLopKsmRkjL6zcK6VzqUcnA X8Hf1O1nCks6eiYoEJLmrDmLo3JJaSMwDRAA657mIjvAbBEtIRKp/YSIDVCxDFqns7eT +D+Q== X-Forwarded-Encrypted: i=1; AFNElJ9/DJSS1IBxnLses4c2VEDUQPLA1cnUPFTSabwWSggzEWF0jjo/wPigYForprRJt9i0lEO1aXu56M+lWFM=@vger.kernel.org X-Gm-Message-State: AOJu0YxBnljcV+0kMjbVVmJKTb2sKhv8bd54RP5drjZVF9pUcpvmI9dI KyUlDIhP0cdSC84Y7yDE0g3K1AmJFezQ56PL+ojQ/I3Uwi6W30BL4Fo0PEgZ0xCJEr3qrsGyzBQ 0QM5YctcD27+RbR4zuFrUafqJ8IoA4Rj+rA== X-Received: from dyng39.prod.google.com ([2002:a05:7300:7f27:b0:2d9:3459:1de6]) (user=aniketgattani job=prod-delivery.src-stubby-dispatcher) by 2002:a05:7300:a14b:b0:2dd:c066:c02 with SMTP id 5a478bee46e88-2efbaca2796mr3411185eec.22.1777843392314; Sun, 03 May 2026 14:23:12 -0700 (PDT) Date: Sun, 3 May 2026 21:22:05 +0000 In-Reply-To: <20260503212205.3714217-1-aniketgattani@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260503212205.3714217-1-aniketgattani@google.com> X-Mailer: git-send-email 2.54.0.545.g6539524ca2-goog Message-ID: <20260503212205.3714217-4-aniketgattani@google.com> Subject: [PATCH v3 3/3] selftests/membarrier: Add rseq stress test for CFS throttle interactions From: Aniket Gattani To: Mathieu Desnoyers , "Paul E . McKenney" Cc: Peter Zijlstra , Ingo Molnar , Ben Segall , Josh Don , linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, Aniket Gattani , kernel test robot Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add a new stress test to exercise the interaction between targeted expedited membarrier commands and CFS bandwidth throttling. The test creates a deep cgroup hierarchy and aggressively hammers the membarrier syscall to expose lock contention and latency issues. This serves as a reliable reproducer for the `membarrier_ipi_mutex` cascade lockup, ensuring future changes to membarrier locking do not regress targeted command latency. Reported-by: kernel test robot Closes: https://lore.kernel.org/r/202604151516.Vc7Ro4LP-lkp@intel.com/ Signed-off-by: Aniket Gattani --- Changes in v3: - None --- tools/testing/selftests/membarrier/Makefile | 5 +- .../membarrier/membarrier_rseq_stress.c | 951 ++++++++++++++++++ 2 files changed, 954 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/membarrier/membarrier_rseq_stre= ss.c diff --git a/tools/testing/selftests/membarrier/Makefile b/tools/testing/se= lftests/membarrier/Makefile index fc840e06ff56..829f95c83515 100644 --- a/tools/testing/selftests/membarrier/Makefile +++ b/tools/testing/selftests/membarrier/Makefile @@ -1,8 +1,9 @@ # SPDX-License-Identifier: GPL-2.0-only -CFLAGS +=3D -g $(KHDR_INCLUDES) +CFLAGS +=3D -g $(KHDR_INCLUDES) -pthread -I../../../../tools/include LDLIBS +=3D -lpthread =20 TEST_GEN_PROGS :=3D membarrier_test_single_thread \ - membarrier_test_multi_thread + membarrier_test_multi_thread \ + membarrier_rseq_stress =20 include ../lib.mk diff --git a/tools/testing/selftests/membarrier/membarrier_rseq_stress.c b/= tools/testing/selftests/membarrier/membarrier_rseq_stress.c new file mode 100644 index 000000000000..c188d7498610 --- /dev/null +++ b/tools/testing/selftests/membarrier/membarrier_rseq_stress.c @@ -0,0 +1,951 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Membarrier stress test for CFS throttle interactions. + * + * Reproducer for the interaction between CFS throttle and expedited memba= rrier. + */ + +#ifndef _GNU_SOURCE +#define _GNU_SOURCE +#endif +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "../kselftest.h" + +/* -- Architecture-specific rseq signature -- */ +#if defined(__x86_64__) || defined(__i386__) +# define RSEQ_SIG 0x53053053U +#elif defined(__aarch64__) +# define RSEQ_SIG 0xd428bc00U +#elif defined(__powerpc__) || defined(__powerpc64__) +# define RSEQ_SIG 0x0f000000U +#elif defined(__s390__) || defined(__s390x__) +# define RSEQ_SIG 0x0c000000U +#else +# define RSEQ_SIG 0 +# define UNSUPPORTED_ARCH 1 +#endif + +/* -- rseq ABI (kernel uapi; define locally for portability) -- */ +#define RSEQ_CPU_ID_UNINITIALIZED ((__u32)-1) + +#include + +struct rseq_abi { + __u32 cpu_id_start; + __u32 cpu_id; + __u64 rseq_cs; + __u32 flags; + __u32 node_id; + __u32 mm_cid; + char end[0]; +} __aligned(32); + +/* -- membarrier constants (not in all distro headers) -- */ +#ifndef MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ +# define MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ (1 << 7) +#endif +#ifndef MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ +# define MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ (1 << 8) +#endif +#ifndef MEMBARRIER_CMD_FLAG_CPU +# define MEMBARRIER_CMD_FLAG_CPU (1 << 0) +#endif + +/* -- Test parameters -- */ +#define N_SIBLINGS 2000 +#define NEST_DEPTH 5 +static char g_cgroup_path[4096]; +static int use_cgroup_v2; + +#define CFS_QUOTA_US 1000 +#define CFS_PERIOD_US 5000 +#define N_HAMMER_PER_CPU 25 +#define N_BURNER_PER_CPU 50 +#define MAX_STRESS_CPUS 1024 +#define TEST_DURATION_SEC 20 + +/* Latency thresholds for the sentinel */ +#define LATENCY_WARN_MS 50 +#define LATENCY_CRITICAL_MS 200 + +/* Sentinel sampling interval */ +#define SENTINEL_INTERVAL_US 500 + +/* -- Shared globals -- */ +static atomic_int g_stop; +static atomic_int g_stop_sentinel; +static atomic_long g_max_latency_us; +static atomic_long g_interval_max_latency_us; +static atomic_long g_mb_ok; +static atomic_long g_mb_err; +static int g_ncpus_stress; +static int *g_stress_cpus; + +static atomic_int g_test_ready; + +/* Per-thread rseq ABI block registered with the kernel */ +static __thread struct rseq_abi tls_rseq + __attribute__((tls_model("initial-exec"))) __aligned(32) =3D { + .cpu_id =3D RSEQ_CPU_ID_UNINITIALIZED, +}; + +/* -- Utility -- */ +static int write_file(const char *path, const char *val) +{ + int fd =3D open(path, O_WRONLY | O_CLOEXEC); + + if (fd < 0) + return -errno; + + size_t len =3D strlen(val); + ssize_t r =3D write(fd, val, len); + + close(fd); + if (r < 0) + return -errno; + if ((size_t)r !=3D len) + return -EIO; + return 0; +} + +static uint64_t monotonic_us(void) +{ + struct timespec ts; + + clock_gettime(CLOCK_MONOTONIC, &ts); + return (uint64_t)ts.tv_sec * 1000000ULL + ts.tv_nsec / 1000ULL; +} + +static void update_max_latency(long lat) +{ + long old =3D atomic_load_explicit(&g_max_latency_us, memory_order_relaxed= ); + + while (lat > old) { + if (atomic_compare_exchange_weak_explicit(&g_max_latency_us, &old, lat, + memory_order_relaxed, memory_order_relaxed)) + break; + } + + old =3D atomic_load_explicit(&g_interval_max_latency_us, memory_order_rel= axed); + while (lat > old) { + if (atomic_compare_exchange_weak_explicit(&g_interval_max_latency_us, &o= ld, lat, + memory_order_relaxed, memory_order_relaxed)) + break; + } +} + +static void init_stress_cpus(void) +{ + cpu_set_t set; + int capacity =3D MAX_STRESS_CPUS; + + g_stress_cpus =3D malloc(capacity * sizeof(int)); + if (!g_stress_cpus) + ksft_exit_fail_msg("malloc failed for g_stress_cpus\n"); + + if (sched_getaffinity(0, sizeof(set), &set) < 0) + ksft_exit_fail_msg("sched_getaffinity failed\n"); + + for (int i =3D 0; i < CPU_SETSIZE && g_ncpus_stress < capacity; i++) { + if (CPU_ISSET(i, &set)) + g_stress_cpus[g_ncpus_stress++] =3D i; + } + + if (g_ncpus_stress =3D=3D 0) + ksft_exit_skip("No CPUs available for stress test\n"); + + ksft_print_msg("Stressing %d CPUs discovered via affinity\n", g_ncpus_str= ess); +} + +/* -- rseq / membarrier helpers -- */ +static int rseq_register_thread(void) +{ + int r =3D syscall(SYS_rseq, &tls_rseq, sizeof(tls_rseq), 0, RSEQ_SIG); + + return (r =3D=3D 0 || errno =3D=3D EBUSY || errno =3D=3D EINVAL) ? 0 : -1; +} + +static int rseq_register_thread_at(struct rseq_abi *rseq) +{ + int r =3D syscall(SYS_rseq, rseq, sizeof(*rseq), 0, RSEQ_SIG); + + return (r =3D=3D 0 || errno =3D=3D EBUSY || errno =3D=3D EINVAL) ? 0 : -1; +} + +static int membarrier_register_rseq_mm(void) +{ + return syscall(SYS_membarrier, + MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ, 0, 0); +} + +/* -- cgroup helpers -- */ +static void rm_cgroup_recursive(const char *path) +{ + DIR *dir =3D opendir(path); + + if (!dir) + return; + struct dirent *entry; + + while ((entry =3D readdir(dir)) !=3D NULL) { + if (strcmp(entry->d_name, ".") =3D=3D 0 || strcmp(entry->d_name, "..") = =3D=3D 0) + continue; + if (entry->d_type =3D=3D DT_DIR) { + char sub_path[4096]; + + snprintf(sub_path, sizeof(sub_path), "%s/%s", path, entry->d_name); + rm_cgroup_recursive(sub_path); + } + } + closedir(dir); + rmdir(path); +} + +static void cgroup_teardown(void); + +static int cgroup_setup(void) +{ + struct stat st; + + if (stat("/sys/fs/cgroup/cpu", &st) =3D=3D 0) { + use_cgroup_v2 =3D 0; + snprintf(g_cgroup_path, sizeof(g_cgroup_path), + "/sys/fs/cgroup/cpu/membarrier_stress_test"); + } else if (stat("/dev/cgroup/cpu", &st) =3D=3D 0) { + use_cgroup_v2 =3D 0; + snprintf(g_cgroup_path, sizeof(g_cgroup_path), + "/dev/cgroup/cpu/membarrier_stress_test"); + } else if (stat("/cgroup/cpu", &st) =3D=3D 0) { + use_cgroup_v2 =3D 0; + snprintf(g_cgroup_path, sizeof(g_cgroup_path), + "/cgroup/cpu/membarrier_stress_test"); + } else if (stat("/sys/fs/cgroup/cgroup.controllers", &st) =3D=3D 0) { + use_cgroup_v2 =3D 1; + snprintf(g_cgroup_path, sizeof(g_cgroup_path), + "/sys/fs/cgroup/membarrier_stress_test"); + } else { + ksft_print_msg("WARN: cgroup mount not found. Using v2 at /sys/fs/cgroup= \n"); + use_cgroup_v2 =3D 1; + snprintf(g_cgroup_path, sizeof(g_cgroup_path), + "/sys/fs/cgroup/membarrier_stress_test"); + } + + /* Robust cleanup before setup */ + cgroup_teardown(); + + if (use_cgroup_v2) { + /* Enable cpu controller in root cgroup */ + if (write_file("/sys/fs/cgroup/cgroup.subtree_control", "+cpu") < 0) + ksft_print_msg("WARN: failed to enable cpu controller in /sys/fs/cgroup= \n"); + } + + if (mkdir(g_cgroup_path, 0755) < 0 && errno !=3D EEXIST) { + ksft_print_msg("mkdir base %s failed: %s\n", g_cgroup_path, strerror(err= no)); + return -1; + } + + if (use_cgroup_v2) { + char ctrl_path[4096]; + + snprintf(ctrl_path, sizeof(ctrl_path), "%s/cgroup.subtree_control", g_cg= roup_path); + if (write_file(ctrl_path, "+cpu") < 0) + ksft_print_msg("WARN: failed to enable cpu controller in %s\n", + g_cgroup_path); + } + + for (int i =3D 0; i < N_SIBLINGS; i++) { + char sibling_path[4096]; + + snprintf(sibling_path, sizeof(sibling_path), "%s/n%d", g_cgroup_path, i); + if (mkdir(sibling_path, 0755) < 0 && errno !=3D EEXIST) { + ksft_print_msg("mkdir wide %s failed: %s\n", sibling_path, strerror(err= no)); + return -1; + } + + if (use_cgroup_v2) { + char ctrl_path[4096]; + + snprintf(ctrl_path, sizeof(ctrl_path), + "%s/cgroup.subtree_control", sibling_path); + if (write_file(ctrl_path, "+cpu") < 0) + ksft_print_msg("WARN: failed to enable cpu controller in %s\n", + sibling_path); + } + + char current_path[4096]; + + snprintf(current_path, sizeof(current_path), "%s", sibling_path); + for (int j =3D 0; j < NEST_DEPTH; j++) { + snprintf(current_path + strlen(current_path), + sizeof(current_path) - strlen(current_path), "/d%d", j); + if (mkdir(current_path, 0755) < 0 && errno !=3D EEXIST) { + ksft_print_msg("mkdir deep %s failed: %s\n", + current_path, strerror(errno)); + return -1; + } + + /* Enable for all but the leaf */ + if (use_cgroup_v2 && j < NEST_DEPTH - 1) { + char ctrl_path[4096]; + + snprintf(ctrl_path, sizeof(ctrl_path), "%s/cgroup.subtree_control", + current_path); + if (write_file(ctrl_path, "+cpu") < 0) + ksft_print_msg("WARN: cannot enable cpu controller in %s\n", + current_path); + } + } + } + + char quota[64], period[64], max_str[128]; + + snprintf(quota, sizeof(quota), "%d", CFS_QUOTA_US); + snprintf(period, sizeof(period), "%d", CFS_PERIOD_US); + snprintf(max_str, sizeof(max_str), "%d %d", CFS_QUOTA_US, CFS_PERIOD_US); + + if (use_cgroup_v2) { + char max_path[4096]; + + snprintf(max_path, sizeof(max_path), "%s/cpu.max", g_cgroup_path); + if (write_file(max_path, max_str) < 0) { + ksft_print_msg("ERROR: cannot write cpu.max at %s\n", max_path); + return -1; + } + ksft_print_msg("cgroup (v2) %s: cpu.max=3D%s\n", g_cgroup_path, max_str); + } else { + char quota_path[4096], period_path[4096]; + + snprintf(quota_path, sizeof(quota_path), "%s/cpu.cfs_quota_us", g_cgroup= _path); + snprintf(period_path, sizeof(period_path), "%s/cpu.cfs_period_us", g_cgr= oup_path); + + if (write_file(period_path, period) < 0) { + ksft_print_msg("ERROR: cannot write cpu.cfs_period_us at %s\n", + period_path); + return -1; + } + if (write_file(quota_path, quota) < 0) { + ksft_print_msg("ERROR: cannot write cpu.cfs_quota_us at %s\n", quota_pa= th); + return -1; + } + ksft_print_msg("cgroup (v1) %s: cpu.cfs_quota_us=3D%d cpu.cfs_period_us= =3D%d\n", + g_cgroup_path, CFS_QUOTA_US, CFS_PERIOD_US); + } + + return 0; +} + +static int cgroup_add_pid_to_path(pid_t pid, const char *path) +{ + char buf[32], file_path[4096]; + + snprintf(buf, sizeof(buf), "%d", (int)pid); + if (use_cgroup_v2) { + snprintf(file_path, sizeof(file_path), "%s/cgroup.procs", path); + return write_file(file_path, buf); + } + /* In v1, try tasks first, fallback to cgroup.procs */ + snprintf(file_path, sizeof(file_path), "%s/tasks", path); + int r =3D write_file(file_path, buf); + + if (r < 0) { + snprintf(file_path, sizeof(file_path), "%s/cgroup.procs", path); + r =3D write_file(file_path, buf); + } + return r; +} + +static void cgroup_teardown(void) +{ + rm_cgroup_recursive(g_cgroup_path); +} + +static void cgroup_unthrottle(void) +{ + if (use_cgroup_v2) { + char max_path[4096]; + + snprintf(max_path, sizeof(max_path), "%s/cpu.max", g_cgroup_path); + write_file(max_path, "max"); + } else { + char quota_path[4096]; + + snprintf(quota_path, sizeof(quota_path), "%s/cpu.cfs_quota_us", g_cgroup= _path); + write_file(quota_path, "-1"); + } +} + +/* -- CPU burner (inside throttled child process) -- */ +static void *burner_thread_fn(void *arg) +{ + struct rseq_abi my_rseq; + int cpu =3D (int)(uintptr_t)arg; + + memset(&my_rseq, 0, sizeof(my_rseq)); + my_rseq.cpu_id =3D RSEQ_CPU_ID_UNINITIALIZED; + + if (rseq_register_thread_at(&my_rseq) < 0) { + perror("rseq_register (burner)"); + return NULL; + } + + cpu_set_t set; + + CPU_ZERO(&set); + CPU_SET(cpu, &set); + if (sched_setaffinity(0, sizeof(set), &set) < 0) + perror("sched_setaffinity (burner)"); + + unsigned long sink =3D 0; + + while (!atomic_load_explicit(&g_stop, memory_order_relaxed)) { + sink++; + /* Prevent compiler from optimizing the loop away */ + asm volatile("" : "+g"(sink)); + } + + return NULL; +} + +static int burner_thread_fn_wrapper(void *arg) +{ + burner_thread_fn(arg); + return 0; +} + +static int leaf_child_fn(void *arg) +{ + int i =3D (int)(uintptr_t)arg; + int total_burners =3D g_ncpus_stress * N_BURNER_PER_CPU; + int n_threads_per_leaf =3D total_burners / N_SIBLINGS; + + if (i < (total_burners % N_SIBLINGS)) + n_threads_per_leaf++; + + prctl(PR_SET_PDEATHSIG, SIGTERM); + if (getppid() =3D=3D 1) + _exit(1); + + char leaf_path[4096]; + + snprintf(leaf_path, sizeof(leaf_path), "%s/n%d", g_cgroup_path, i); + for (int j =3D 0; j < NEST_DEPTH; j++) + snprintf(leaf_path + strlen(leaf_path), + sizeof(leaf_path) - strlen(leaf_path), "/d%d", j); + + int r =3D cgroup_add_pid_to_path(getpid(), leaf_path); + + if (r < 0) { + char buf[512]; + int len =3D snprintf(buf, sizeof(buf), + "[leaf child %d] failed to join cgroup %s: err %d\n", + i, leaf_path, -r); + (void)!write(2, buf, len); + _exit(1); + } + + for (int j =3D 0; j < n_threads_per_leaf; j++) { + int cpu =3D g_stress_cpus[(i * n_threads_per_leaf + j) % g_ncpus_stress]; + + /* Allocate stack via mmap (bypasses heap) */ + size_t stack_size =3D 64 * 1024; + void *stack =3D mmap(NULL, stack_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (stack =3D=3D MAP_FAILED) { + const char *msg =3D "mmap stack failed\n"; + (void)!write(2, msg, strlen(msg)); + _exit(1); + } + + /* Use raw clone to create a thread sharing the VM and thread group */ + pid_t pid =3D clone(burner_thread_fn_wrapper, stack + stack_size, + CLONE_VM | CLONE_THREAD | CLONE_SIGHAND, + (void *)(uintptr_t)cpu); + if (pid < 0) { + const char *msg =3D "clone burner failed\n"; + (void)!write(2, msg, strlen(msg)); + _exit(1); + } + } + + // Wait for SIGTERM + sigset_t mask; + + sigemptyset(&mask); + sigaddset(&mask, SIGTERM); + int sig; + + sigwait(&mask, &sig); + + _exit(0); +} + +struct leaf_info { + pid_t pid; + void *stack; +}; + +static int run_throttle_child(void *arg) +{ + (void)arg; + prctl(PR_SET_PDEATHSIG, SIGTERM); + if (getppid() =3D=3D 1) + _exit(1); + + int n_leafs =3D N_SIBLINGS; + + /* Block signals before spawning to avoid missing early failures */ + sigset_t mask; + + sigemptyset(&mask); + sigaddset(&mask, SIGTERM); + sigaddset(&mask, SIGCHLD); + sigprocmask(SIG_BLOCK, &mask, NULL); + + /* Use mmap for tracking structures to avoid glibc heap usage */ + struct leaf_info *leaves =3D mmap(NULL, n_leafs * sizeof(struct leaf_info= ), + PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (leaves =3D=3D MAP_FAILED) { + const char *msg =3D "mmap leaves array failed\n"; + (void)!write(2, msg, strlen(msg)); + _exit(1); + } + + for (int i =3D 0; i < n_leafs; i++) { + size_t stack_size =3D 64 * 1024; + void *stack =3D mmap(NULL, stack_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (stack =3D=3D MAP_FAILED) { + const char *msg =3D "mmap leaf stack failed\n"; + (void)!write(2, msg, strlen(msg)); + _exit(1); + } + + leaves[i].stack =3D stack; + + pid_t pid =3D clone(leaf_child_fn, stack + stack_size, + CLONE_VM | SIGCHLD, (void *)(uintptr_t)i); + + if (pid < 0) { + const char *msg =3D "clone (leaf child) failed\n"; + (void)!write(2, msg, strlen(msg)); + + /* Clean up successfully spawned children */ + for (int j =3D 0; j < i; j++) { + kill(leaves[j].pid, SIGTERM); + waitpid(leaves[j].pid, NULL, 0); + munmap(leaves[j].stack, stack_size); + } + munmap(leaves, n_leafs * sizeof(struct leaf_info)); + + if (errno =3D=3D EAGAIN) + _exit(4); + else + _exit(1); + } + leaves[i].pid =3D pid; + } + + int failed =3D 0; + + while (1) { + int sig; + + sigwait(&mask, &sig); + + if (sig =3D=3D SIGTERM) { + break; + } else if (sig =3D=3D SIGCHLD) { + int status; + pid_t pid; + + // Reap all dead children + while ((pid =3D waitpid(-1, &status, WNOHANG)) > 0) { + for (int i =3D 0; i < n_leafs; i++) { + if (leaves[i].pid =3D=3D pid) { + leaves[i].pid =3D 0; + break; + } + } + if ((WIFEXITED(status) && WEXITSTATUS(status) !=3D 0) || + WIFSIGNALED(status)) { + char buf[128]; + int len =3D snprintf(buf, sizeof(buf), + "[manager] child %d died unexpectedly (status %d)\n", + pid, WEXITSTATUS(status)); + (void)!write(2, buf, len); + failed =3D 1; + } + } + if (failed) + break; + } + } + + // Terminate all leaf kids + for (int i =3D 0; i < n_leafs; i++) { + if (leaves[i].pid > 0) + kill(leaves[i].pid, SIGTERM); + } + + for (int i =3D 0; i < n_leafs; i++) { + if (leaves[i].pid > 0) + waitpid(leaves[i].pid, NULL, 0); + munmap(leaves[i].stack, 64 * 1024); + } + + munmap(leaves, n_leafs * sizeof(struct leaf_info)); + + _exit(failed ? 1 : 0); +} + +/* -- Membarrier hammer thread -- */ +static void *hammer_thread_fn(void *arg) +{ + int target_cpu =3D *(int *)arg; + long local_ok =3D 0; + long local_err =3D 0; + int count =3D 0; + const int batch_size =3D 1024; + + if (rseq_register_thread() < 0) { + ksft_print_msg("[hammer] rseq_register failed: %s\n", strerror(errno)); + return NULL; + } + + membarrier_register_rseq_mm(); + + while (!atomic_load_explicit(&g_stop, memory_order_relaxed)) { + int r =3D syscall(SYS_membarrier, + MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ, + MEMBARRIER_CMD_FLAG_CPU, + target_cpu); + if (__builtin_expect(r =3D=3D 0, 1)) + local_ok++; + else + local_err++; + + count++; + if (__builtin_expect(count >=3D batch_size, 0)) { + atomic_fetch_add_explicit(&g_mb_ok, local_ok, memory_order_relaxed); + atomic_fetch_add_explicit(&g_mb_err, local_err, memory_order_relaxed); + local_ok =3D 0; + local_err =3D 0; + count =3D 0; + } + } + + /* Flush any remaining counts on exit */ + if (local_ok > 0) + atomic_fetch_add_explicit(&g_mb_ok, local_ok, memory_order_relaxed); + if (local_err > 0) + atomic_fetch_add_explicit(&g_mb_err, local_err, memory_order_relaxed); + + return NULL; +} + +/* -- Latency sentinel -- */ +static void *sentinel_thread_fn(void *arg) +{ + (void)arg; + struct sched_param sp =3D { .sched_priority =3D 20 }; + + if (sched_setscheduler(0, SCHED_FIFO, &sp) < 0) + ksft_print_msg("WARN: no SCHED_FIFO for sentinel (less precise)\n"); + + while (!atomic_load_explicit(&g_test_ready, memory_order_relaxed) && + !atomic_load_explicit(&g_stop_sentinel, memory_order_relaxed)) { + struct timespec ts =3D {0, 1000 * 1000}; /* 1ms */ + + clock_nanosleep(CLOCK_MONOTONIC, 0, &ts, NULL); + } + + uint64_t prev =3D monotonic_us(); + + while (!atomic_load_explicit(&g_stop_sentinel, memory_order_relaxed)) { + struct timespec ts =3D { + .tv_sec =3D 0, + .tv_nsec =3D SENTINEL_INTERVAL_US * 1000L, + }; + clock_nanosleep(CLOCK_MONOTONIC, 0, &ts, NULL); + + uint64_t now =3D monotonic_us(); + long latency_us =3D (long)(now - prev) - SENTINEL_INTERVAL_US; + + prev =3D now; + + if (latency_us <=3D 0) + continue; + + update_max_latency(latency_us); + + if (latency_us > LATENCY_CRITICAL_MS * 1000L) { + ksft_print_msg("\n[SENTINEL] CRITICAL: %ld ms delay (lockup precursor!)= \n", + latency_us / 1000); + } else if (latency_us > LATENCY_WARN_MS * 1000L) { + ksft_print_msg("\n[SENTINEL] WARN: %ld ms latency spike\n", + latency_us / 1000); + } + } + return NULL; +} + +/* -- Progress reporter -- */ +static void *reporter_thread_fn(void *arg) +{ + (void)arg; + int elapsed =3D 0; + + while (!atomic_load_explicit(&g_stop_sentinel, memory_order_relaxed)) { + for (int i =3D 0; i < 5; i++) { + sleep(1); + if (atomic_load_explicit(&g_stop_sentinel, memory_order_relaxed)) + break; + } + if (atomic_load_explicit(&g_stop_sentinel, memory_order_relaxed)) + break; + elapsed +=3D 5; + long interval_max =3D atomic_exchange_explicit(&g_interval_max_latency_u= s, + 0, memory_order_relaxed); + + ksft_print_msg("[%3ds] mb: ok=3D%-10ld err=3D%-8ld | max_lat=3D%ld us\n", + elapsed, + atomic_load(&g_mb_ok), + atomic_load(&g_mb_err), + interval_max); + } + return NULL; +} + +/* -- Main -- */ +int main(void) +{ + ksft_print_header(); +#ifdef UNSUPPORTED_ARCH + ksft_exit_skip("Unsupported architecture\n"); +#endif + ksft_set_plan(1); + + if (geteuid() !=3D 0) + ksft_exit_skip("Must run as root (cgroup + SCHED_FIFO)\n"); + + init_stress_cpus(); + + ksft_print_msg("=3D=3D=3D membarrier rseq + CFS unthrottle stress =3D=3D= =3D\n"); + ksft_print_msg("Stressing CPUs: %d\n", g_ncpus_stress); + ksft_print_msg("Quota: %d/%d us (~%d unthrottles/sec/CPU)\n", + CFS_QUOTA_US, CFS_PERIOD_US, + 1000000 / CFS_PERIOD_US); + ksft_print_msg("Hammer threads: %d per CPU (%d total)\n", + N_HAMMER_PER_CPU, g_ncpus_stress * N_HAMMER_PER_CPU); + ksft_print_msg("Duration: %d seconds\n\n", TEST_DURATION_SEC); + + if (cgroup_setup() < 0) { + cgroup_teardown(); + ksft_exit_skip("cgroup_setup failed (missing permissions or v2 ctrls?)\n= "); + } + + if (rseq_register_thread() < 0) { + ksft_print_msg("rseq_register (%s) failed: %s\n", __func__, strerror(err= no)); + cgroup_teardown(); + ksft_exit_skip("rseq syscall failed or not available\n"); + } + if (membarrier_register_rseq_mm() < 0) { + ksft_print_msg("MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ: %s\n" + "Kernel >=3D 5.10 with CONFIG_RSEQ required.\n", + strerror(errno)); + cgroup_teardown(); + ksft_exit_skip("membarrier register failed\n"); + } + ksft_print_msg("rseq membarrier registered OK\n"); + + sigset_t sigmask; + + sigemptyset(&sigmask); + sigaddset(&sigmask, SIGTERM); + sigprocmask(SIG_BLOCK, &sigmask, NULL); + + void *stack =3D malloc(1024 * 1024); + + if (!stack) { + perror("malloc stack"); + cgroup_teardown(); + ksft_exit_fail_msg("Malloc stack failed\n"); + } + pid_t child =3D clone(run_throttle_child, stack + 1024 * 1024, CLONE_VM |= SIGCHLD, NULL); + + if (child < 0) { + perror("clone"); + cgroup_teardown(); + ksft_exit_fail_msg("Clone failed\n"); + } + + sigprocmask(SIG_UNBLOCK, &sigmask, NULL); + ksft_print_msg("Throttle child PID %d started\n", child); + + int n_threads =3D g_ncpus_stress * N_HAMMER_PER_CPU + 2; + pthread_t *threads =3D (pthread_t *)calloc(n_threads, sizeof(pthread_t)); + int *cpuargs =3D (int *)calloc(g_ncpus_stress * N_HAMMER_PER_CPU, s= izeof(int)); + + if (!threads || !cpuargs) { + perror("calloc"); + kill(child, SIGTERM); + waitpid(child, NULL, 0); + cgroup_teardown(); + ksft_exit_fail_msg("Thread allocation failed\n"); + } + + int ti =3D 0, ai =3D 0; + int r; + + ksft_print_msg("Creating sentinel thread...\n"); + r =3D pthread_create(&threads[ti], NULL, sentinel_thread_fn, NULL); + if (r !=3D 0) { + kill(child, SIGTERM); + waitpid(child, NULL, 0); + cgroup_teardown(); + free(threads); + free(cpuargs); + free(g_stress_cpus); + ksft_exit_fail_msg("pthread_create (sentinel) failed: %s\n", strerror(r)= ); + } + ti++; + + ksft_print_msg("Creating reporter thread...\n"); + r =3D pthread_create(&threads[ti], NULL, reporter_thread_fn, NULL); + if (r !=3D 0) { + atomic_store(&g_stop_sentinel, 1); + pthread_join(threads[0], NULL); + kill(child, SIGTERM); + waitpid(child, NULL, 0); + cgroup_teardown(); + free(threads); + free(cpuargs); + free(g_stress_cpus); + ksft_exit_fail_msg("pthread_create (reporter) failed: %s\n", strerror(r)= ); + } + ti++; + + ksft_print_msg("Creating %d hammer threads...\n", g_ncpus_stress * N_HAMM= ER_PER_CPU); + for (int i =3D 0; i < g_ncpus_stress; i++) { + int cpu =3D g_stress_cpus[i]; + + for (int j =3D 0; j < N_HAMMER_PER_CPU; j++) { + cpuargs[ai] =3D cpu; + r =3D pthread_create(&threads[ti], NULL, hammer_thread_fn, &cpuargs[ai]= ); + if (r !=3D 0) { + ksft_print_msg("pthread_create failed at thread %d: %s\n", + ti, strerror(r)); + + atomic_store(&g_stop_sentinel, 1); + pthread_join(threads[0], NULL); + pthread_join(threads[1], NULL); + + atomic_store(&g_stop, 1); + for (int k =3D 2; k < ti; k++) + pthread_join(threads[k], NULL); + + kill(child, SIGTERM); + waitpid(child, NULL, 0); + cgroup_teardown(); + + free(threads); + free(cpuargs); + free(g_stress_cpus); + + if (r =3D=3D EAGAIN) + ksft_exit_skip("Resource limits prevent threads\n"); + else + ksft_exit_fail_msg("Failed to create hammer thread\n"); + } + ti++; + ai++; + } + } + + ksft_print_msg("All threads running. Tip: monitor dmesg for lockups\n\n"); + + atomic_store_explicit(&g_test_ready, 1, memory_order_relaxed); + int child_failed =3D 0; + int child_status =3D 0; + + for (int i =3D 0; i < TEST_DURATION_SEC; i++) { + sleep(1); + int r =3D waitpid(child, &child_status, WNOHANG); + + if (r =3D=3D child) { + child_failed =3D 1; + break; + } + } + + atomic_store(&g_stop_sentinel, 1); + pthread_join(threads[0], NULL); + pthread_join(threads[1], NULL); + + atomic_store(&g_stop, 1); + + /* Unthrottle to allow children to exit quickly */ + cgroup_unthrottle(); + + if (!child_failed) { + kill(child, SIGTERM); + waitpid(child, NULL, 0); + } + for (int i =3D 2; i < ti; i++) + pthread_join(threads[i], NULL); + + long max_lat =3D atomic_load(&g_max_latency_us); + long total_ok =3D atomic_load(&g_mb_ok); + long total_err =3D atomic_load(&g_mb_err); + + ksft_print_msg("\n=3D=3D=3D RESULTS =3D=3D=3D\n"); + ksft_print_msg("membarrier syscalls : %ld ok %ld errors\n", total_ok, to= tal_err); + ksft_print_msg("Max scheduler latency: %ld us (%ld ms)\n", max_lat, max_= lat / 1000); + cgroup_teardown(); + free(threads); + free(cpuargs); + free(g_stress_cpus); + + if (child_failed) { + if (WIFEXITED(child_status) && WEXITSTATUS(child_status) =3D=3D 4) + ksft_exit_skip("Manager child skipped (resource limits?)\n"); + ksft_test_result_fail("membarrier_rseq_stress: Manager child died early\= n"); + ksft_exit_fail(); + } else if (total_ok =3D=3D 0) { + ksft_test_result_fail("membarrier_rseq_stress: No successful membarrier = calls\n"); + ksft_exit_fail(); + } else if (total_err > 0) { + ksft_test_result_fail("membarrier_rseq_stress: syscall errors\n"); + ksft_exit_fail(); + } else if (max_lat > LATENCY_CRITICAL_MS * 1000L) { + ksft_test_result_fail("membarrier_rseq_stress: LOCKUP PRECURSOR\n"); + ksft_exit_fail(); + } else if (max_lat > LATENCY_WARN_MS * 1000L) { + ksft_test_result_fail("membarrier_rseq_stress: significant latency spike= \n"); + ksft_exit_fail(); + } else { + ksft_test_result_pass("membarrier_rseq_stress\n"); + ksft_exit_pass(); + } + + return 0; +} --=20 2.54.0.545.g6539524ca2-goog