From nobody Thu Apr  2 15:02:05 2026
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2DCB8ECAAD8
	for <linux-kernel@archiver.kernel.org>; Thu, 22 Sep 2022 22:01:46 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229990AbiIVWBo (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 22 Sep 2022 18:01:44 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39962 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230070AbiIVWBb (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 22 Sep 2022 18:01:31 -0400
Received: from mail-qk1-x72f.google.com (mail-qk1-x72f.google.com
 [IPv6:2607:f8b0:4864:20::72f])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 98DBCF8C3B
        for <linux-kernel@vger.kernel.org>;
 Thu, 22 Sep 2022 15:01:28 -0700 (PDT)
Received: by mail-qk1-x72f.google.com with SMTP id u28so7200017qku.2
        for <linux-kernel@vger.kernel.org>;
 Thu, 22 Sep 2022 15:01:28 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=joelfernandes.org; s=google;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date;
        bh=IQJWYIs3s/JNqTVqdijAxlnocaXDFdhIr24z6VeHMYs=;
        b=jwqtI/ir1/9QYQmUi22GHsNmQSvY/SpUn/eD5GrpxeUTXavSHyjny/GhEH8s1BBTzS
         N9kaFvdQlZyzn8GH/wfIFCb3R3ky3KWSvth6UelZAz+YsEgAPkfnzAaX80ba3Opod3ae
         LbrGE9bA02FFhZFjm3EMR2eMx3ayphy11HFKk=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date;
        bh=IQJWYIs3s/JNqTVqdijAxlnocaXDFdhIr24z6VeHMYs=;
        b=wnyYTb4sxaWtazxlvfWP5MtaBUyzqsybuMoiFiE/QtyfB72IROib0AKsFMRSNKRDS3
         ZBsXWOHKUhIkz09xairJXC+LBBtjESL/1IG430Z39ePPxl9jrxl0zAKNKXUw1nSj0mHO
         XohZi3bzM2uAiehEG1laHvfnJLB/S+Ew0dJkt9muTES0dhAmWj1DL+oMUPmDTov5W+Vw
         j79iPhs9L376D/f3DOn6oLJ2tTQ4UlExb21Swd/DX40x47OzWlxMMSVBUakZOZAo0wwC
         f+wVaBKnAleskpN2IdpA28PRVV1/MCBnMfg/p+djKgbjmP54PceB0EyxHTJ/AwqJ+IxY
         7QLA==
X-Gm-Message-State: ACrzQf3H3bbQezayc64uPNaIhngKOxO7ZU1LfpKAhjf2xgsLmx17Qfdr
        OEC1aGcbaxk2STjgdjD1zPMJ5Q==
X-Google-Smtp-Source: 
 AMsMyM6Jr+cg8BfbQ1U3dFQPxtY31CT8KlGkoxF1V+vFlPeoDGe28QlwQjF+kS7ubJTJ66PNDGbwaw==
X-Received: by 2002:a37:aacd:0:b0:6cd:d0b4:9a2e with SMTP id
 t196-20020a37aacd000000b006cdd0b49a2emr3929389qke.652.1663884087415;
        Thu, 22 Sep 2022 15:01:27 -0700 (PDT)
Received: from joelboxx.c.googlers.com.com
 (48.230.85.34.bc.googleusercontent.com. [34.85.230.48])
        by smtp.gmail.com with ESMTPSA id
 z12-20020ac87f8c000000b0035ba7012724sm4465833qtj.70.2022.09.22.15.01.26
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 22 Sep 2022 15:01:26 -0700 (PDT)
From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
To: rcu@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, rushikesh.s.kadam@intel.com,
        urezki@gmail.com, neeraj.iitr10@gmail.com, frederic@kernel.org,
        paulmck@kernel.org, rostedt@goodmis.org,
        "Joel Fernandes (Google)" <joel@joelfernandes.org>
Subject: [PATCH v6 1/4] rcu: Make call_rcu() lazy to save power
Date: Thu, 22 Sep 2022 22:01:01 +0000
Message-Id: <20220922220104.2446868-2-joel@joelfernandes.org>
X-Mailer: git-send-email 2.37.3.998.g577e59143f-goog
In-Reply-To: <20220922220104.2446868-1-joel@joelfernandes.org>
References: <20220922220104.2446868-1-joel@joelfernandes.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

Implement timer-based RCU lazy callback batching. The batch is flushed
whenever a certain amount of time has passed, or the batch on a
particular CPU grows too big. Also memory pressure will flush it in a
future patch.

To handle several corner cases automagically (such as rcu_barrier() and
hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
length has the lazy CB length included in it. A separate lazy CB length
counter is also introduced to keep track of the number of lazy CBs.

v5->v6:

[ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
  deferral levels wake much earlier so for those it is not needed. ]

[ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]

[ Frederic Weisbec: Make rcu_barrier() wake up only if main list empty. ]

[ Frederic Weisbec: Remove extra 'else if' branch in rcu_nocb_try_bypass().=
 ]

[ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp=
 ]

[ Paul/Thomas/Joel: Make call_rcu() default lazy so users don't mess up. ]

Suggested-by: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reported-by: Paul E. McKenney <paulmck@kernel.org>
---
 include/linux/rcupdate.h |   7 ++
 kernel/rcu/Kconfig       |   8 ++
 kernel/rcu/rcu.h         |   8 ++
 kernel/rcu/tiny.c        |   2 +-
 kernel/rcu/tree.c        | 133 ++++++++++++++++++----------
 kernel/rcu/tree.h        |  17 +++-
 kernel/rcu/tree_exp.h    |   2 +-
 kernel/rcu/tree_nocb.h   | 184 ++++++++++++++++++++++++++++++++-------
 8 files changed, 277 insertions(+), 84 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 08605ce7379d..40ae36904825 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -108,6 +108,13 @@ static inline int rcu_preempt_depth(void)
=20
 #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
=20
+#ifdef CONFIG_RCU_LAZY
+void call_rcu_flush(struct rcu_head *head, rcu_callback_t func);
+#else
+static inline void call_rcu_flush(struct rcu_head *head,
+		rcu_callback_t func) {  call_rcu(head, func); }
+#endif
+
 /* Internal to kernel */
 void rcu_init(void);
 extern int rcu_scheduler_active;
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index f53ad63b2bc6..edd632e68497 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -314,4 +314,12 @@ config TASKS_TRACE_RCU_READ_MB
 	  Say N here if you hate read-side memory barriers.
 	  Take the default if you are unsure.
=20
+config RCU_LAZY
+	bool "RCU callback lazy invocation functionality"
+	depends on RCU_NOCB_CPU
+	default n
+	help
+	  To save power, batch RCU callbacks and flush after delay, memory
+	  pressure or callback list growing too big.
+
 endmenu # "RCU Subsystem"
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index be5979da07f5..65704cbc9df7 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -474,6 +474,14 @@ enum rcutorture_type {
 	INVALID_RCU_FLAVOR
 };
=20
+#if defined(CONFIG_RCU_LAZY)
+unsigned long rcu_lazy_get_jiffies_till_flush(void);
+void rcu_lazy_set_jiffies_till_flush(unsigned long j);
+#else
+static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return=
 0; }
+static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
+#endif
+
 #if defined(CONFIG_TREE_RCU)
 void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
 			    unsigned long *gp_seq);
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index a33a8d4942c3..810479cf17ba 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -44,7 +44,7 @@ static struct rcu_ctrlblk rcu_ctrlblk =3D {
=20
 void rcu_barrier(void)
 {
-	wait_rcu_gp(call_rcu);
+	wait_rcu_gp(call_rcu_flush);
 }
 EXPORT_SYMBOL(rcu_barrier);
=20
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 5ec97e3f7468..736d0d724207 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2728,47 +2728,8 @@ static void check_cb_ovld(struct rcu_data *rdp)
 	raw_spin_unlock_rcu_node(rnp);
 }
=20
-/**
- * call_rcu() - Queue an RCU callback for invocation after a grace period.
- * @head: structure to be used for queueing the RCU updates.
- * @func: actual callback function to be invoked after the grace period
- *
- * The callback function will be invoked some time after a full grace
- * period elapses, in other words after all pre-existing RCU read-side
- * critical sections have completed.  However, the callback function
- * might well execute concurrently with RCU read-side critical sections
- * that started after call_rcu() was invoked.
- *
- * RCU read-side critical sections are delimited by rcu_read_lock()
- * and rcu_read_unlock(), and may be nested.  In addition, but only in
- * v5.0 and later, regions of code across which interrupts, preemption,
- * or softirqs have been disabled also serve as RCU read-side critical
- * sections.  This includes hardware interrupt handlers, softirq handlers,
- * and NMI handlers.
- *
- * Note that all CPUs must agree that the grace period extended beyond
- * all pre-existing RCU read-side critical section.  On systems with more
- * than one CPU, this means that when "func()" is invoked, each CPU is
- * guaranteed to have executed a full memory barrier since the end of its
- * last RCU read-side critical section whose beginning preceded the call
- * to call_rcu().  It also means that each CPU executing an RCU read-side
- * critical section that continues beyond the start of "func()" must have
- * executed a memory barrier after the call_rcu() but before the beginning
- * of that RCU read-side critical section.  Note that these guarantees
- * include CPUs that are offline, idle, or executing in user mode, as
- * well as CPUs that are executing in the kernel.
- *
- * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
- * resulting RCU callback function "func()", then both CPU A and CPU B are
- * guaranteed to execute a full memory barrier during the time interval
- * between the call to call_rcu() and the invocation of "func()" -- even
- * if CPU A and CPU B are the same CPU (but again only if the system has
- * more than one CPU).
- *
- * Implementation of these memory-ordering guarantees is described here:
- * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
- */
-void call_rcu(struct rcu_head *head, rcu_callback_t func)
+static void
+__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
 {
 	static atomic_t doublefrees;
 	unsigned long flags;
@@ -2809,7 +2770,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t f=
unc)
 	}
=20
 	check_cb_ovld(rdp);
-	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
+	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
 		return; // Enqueued onto ->nocb_bypass, so just leave.
 	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
 	rcu_segcblist_enqueue(&rdp->cblist, head);
@@ -2831,8 +2792,84 @@ void call_rcu(struct rcu_head *head, rcu_callback_t =
func)
 		local_irq_restore(flags);
 	}
 }
-EXPORT_SYMBOL_GPL(call_rcu);
=20
+#ifdef CONFIG_RCU_LAZY
+/**
+ * call_rcu_flush() - Queue RCU callback for invocation after grace period=
, and
+ * flush all lazy callbacks (including the new one) to the main ->cblist w=
hile
+ * doing so.
+ *
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all pre-existing RCU read-side
+ * critical sections have completed.
+ *
+ * Use this API instead of call_rcu() if you don't mind the callback being
+ * invoked after very long periods of time on systems without memory press=
ure
+ * and on systems which are lightly loaded or mostly idle.
+ *
+ * Other than the extra delay in callbacks being invoked, this function is
+ * identical to, and reuses call_rcu()'s logic. Refer to call_rcu() for mo=
re
+ * details about memory ordering and other functionality.
+ */
+void call_rcu_flush(struct rcu_head *head, rcu_callback_t func)
+{
+	return __call_rcu_common(head, func, false);
+}
+EXPORT_SYMBOL_GPL(call_rcu_flush);
+#endif
+
+/**
+ * call_rcu() - Queue an RCU callback for invocation after a grace period.
+ * By default the callbacks are 'lazy' and are kept hidden from the main
+ * ->cblist to prevent starting of grace periods too soon.
+ * If you desire grace periods to start very soon, use call_rcu_flush().
+ *
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all pre-existing RCU read-side
+ * critical sections have completed.  However, the callback function
+ * might well execute concurrently with RCU read-side critical sections
+ * that started after call_rcu() was invoked.
+ *
+ * RCU read-side critical sections are delimited by rcu_read_lock()
+ * and rcu_read_unlock(), and may be nested.  In addition, but only in
+ * v5.0 and later, regions of code across which interrupts, preemption,
+ * or softirqs have been disabled also serve as RCU read-side critical
+ * sections.  This includes hardware interrupt handlers, softirq handlers,
+ * and NMI handlers.
+ *
+ * Note that all CPUs must agree that the grace period extended beyond
+ * all pre-existing RCU read-side critical section.  On systems with more
+ * than one CPU, this means that when "func()" is invoked, each CPU is
+ * guaranteed to have executed a full memory barrier since the end of its
+ * last RCU read-side critical section whose beginning preceded the call
+ * to call_rcu().  It also means that each CPU executing an RCU read-side
+ * critical section that continues beyond the start of "func()" must have
+ * executed a memory barrier after the call_rcu() but before the beginning
+ * of that RCU read-side critical section.  Note that these guarantees
+ * include CPUs that are offline, idle, or executing in user mode, as
+ * well as CPUs that are executing in the kernel.
+ *
+ * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
+ * resulting RCU callback function "func()", then both CPU A and CPU B are
+ * guaranteed to execute a full memory barrier during the time interval
+ * between the call to call_rcu() and the invocation of "func()" -- even
+ * if CPU A and CPU B are the same CPU (but again only if the system has
+ * more than one CPU).
+ *
+ * Implementation of these memory-ordering guarantees is described here:
+ * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
+ */
+void call_rcu(struct rcu_head *head, rcu_callback_t func)
+{
+	return __call_rcu_common(head, func, true);
+}
+EXPORT_SYMBOL_GPL(call_rcu);
=20
 /* Maximum number of jiffies to wait before draining a batch. */
 #define KFREE_DRAIN_JIFFIES (5 * HZ)
@@ -3507,7 +3544,7 @@ void synchronize_rcu(void)
 		if (rcu_gp_is_expedited())
 			synchronize_rcu_expedited();
 		else
-			wait_rcu_gp(call_rcu);
+			wait_rcu_gp(call_rcu_flush);
 		return;
 	}
=20
@@ -3902,7 +3939,11 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 	rdp->barrier_head.func =3D rcu_barrier_callback;
 	debug_rcu_head_queue(&rdp->barrier_head);
 	rcu_nocb_lock(rdp);
-	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
+	/*
+	 * Flush the bypass list, but also wake up the GP thread as otherwise
+	 * bypass/lazy CBs maynot be noticed, and can cause real long delays!
+	 */
+	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_WAKE));
 	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
 		atomic_inc(&rcu_state.barrier_cpu_count);
 	} else {
@@ -4323,7 +4364,7 @@ void rcutree_migrate_callbacks(int cpu)
 	my_rdp =3D this_cpu_ptr(&rcu_data);
 	my_rnp =3D my_rdp->mynode;
 	rcu_nocb_lock(my_rdp); /* irqs already disabled. */
-	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies));
+	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies, FLUSH_BP_NONE)=
);
 	raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */
 	/* Leverage recent GPs and set GP for new callbacks. */
 	needwake =3D rcu_advance_cbs(my_rnp, rdp) ||
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index d4a97e40ea9c..361c41d642c7 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -263,14 +263,16 @@ struct rcu_data {
 	unsigned long last_fqs_resched;	/* Time of last rcu_resched(). */
 	unsigned long last_sched_clock;	/* Jiffies of last rcu_sched_clock_irq().=
 */
=20
+	long lazy_len;			/* Length of buffered lazy callbacks. */
 	int cpu;
 };
=20
 /* Values for nocb_defer_wakeup field in struct rcu_data. */
 #define RCU_NOCB_WAKE_NOT	0
 #define RCU_NOCB_WAKE_BYPASS	1
-#define RCU_NOCB_WAKE		2
-#define RCU_NOCB_WAKE_FORCE	3
+#define RCU_NOCB_WAKE_LAZY	2
+#define RCU_NOCB_WAKE		3
+#define RCU_NOCB_WAKE_FORCE	4
=20
 #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
 					/* For jiffies_till_first_fqs and */
@@ -439,10 +441,17 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp=
);
 static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
 static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
 static void rcu_init_one_nocb(struct rcu_node *rnp);
+
+#define FLUSH_BP_NONE 0
+/* Is the CB being enqueued after the flush, a lazy CB? */
+#define FLUSH_BP_LAZY BIT(0)
+/* Wake up nocb-GP thread after flush? */
+#define FLUSH_BP_WAKE BIT(1)
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *r=
hp,
-				  unsigned long j);
+				  unsigned long j, unsigned long flush_flags);
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				bool *was_alldone, unsigned long flags);
+				bool *was_alldone, unsigned long flags,
+				bool lazy);
 static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
 				 unsigned long flags);
 static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 18e9b4cd78ef..5cac05600798 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -937,7 +937,7 @@ void synchronize_rcu_expedited(void)
=20
 	/* If expedited grace periods are prohibited, fall back to normal. */
 	if (rcu_gp_is_normal()) {
-		wait_rcu_gp(call_rcu);
+		wait_rcu_gp(call_rcu_flush);
 		return;
 	}
=20
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index f77a6d7e1356..661c685aba3f 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool fo=
rce)
 	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
 }
=20
+/*
+ * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
+ * can elapse before lazy callbacks are flushed. Lazy callbacks
+ * could be flushed much earlier for a number of other reasons
+ * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
+ * left unsubmitted to RCU after those many jiffies.
+ */
+#define LAZY_FLUSH_JIFFIES (10 * HZ)
+static unsigned long jiffies_till_flush =3D LAZY_FLUSH_JIFFIES;
+
+#ifdef CONFIG_RCU_LAZY
+// To be called only from test code.
+void rcu_lazy_set_jiffies_till_flush(unsigned long jif)
+{
+	jiffies_till_flush =3D jif;
+}
+EXPORT_SYMBOL(rcu_lazy_set_jiffies_till_flush);
+
+unsigned long rcu_lazy_get_jiffies_till_flush(void)
+{
+	return jiffies_till_flush;
+}
+EXPORT_SYMBOL(rcu_lazy_get_jiffies_till_flush);
+#endif
+
 /*
  * Arrange to wake the GP kthread for this NOCB group at some future
  * time when it is safe to do so.
@@ -269,10 +294,14 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, =
int waketype,
 	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
=20
 	/*
-	 * Bypass wakeup overrides previous deferments. In case
-	 * of callback storm, no need to wake up too early.
+	 * Bypass wakeup overrides previous deferments. In case of
+	 * callback storm, no need to wake up too early.
 	 */
-	if (waketype =3D=3D RCU_NOCB_WAKE_BYPASS) {
+	if (waketype =3D=3D RCU_NOCB_WAKE_LAZY
+		&& READ_ONCE(rdp->nocb_defer_wakeup) =3D=3D RCU_NOCB_WAKE_NOT) {
+		mod_timer(&rdp_gp->nocb_timer, jiffies + jiffies_till_flush);
+		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
+	} else if (waketype =3D=3D RCU_NOCB_WAKE_BYPASS) {
 		mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
 		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
 	} else {
@@ -293,12 +322,16 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, =
int waketype,
  * proves to be initially empty, just return false because the no-CB GP
  * kthread may need to be awakened in this case.
  *
+ * Return true if there was something to be flushed and it succeeded, othe=
rwise
+ * false.
+ *
  * Note that this function always returns true if rhp is NULL.
  */
 static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head=
 *rhp,
-				     unsigned long j)
+				     unsigned long j, unsigned long flush_flags)
 {
 	struct rcu_cblist rcl;
+	bool lazy =3D flush_flags & FLUSH_BP_LAZY;
=20
 	WARN_ON_ONCE(!rcu_rdp_is_offloaded(rdp));
 	rcu_lockdep_assert_cblist_protected(rdp);
@@ -310,7 +343,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *=
rdp, struct rcu_head *rhp,
 	/* Note: ->cblist.len already accounts for ->nocb_bypass contents. */
 	if (rhp)
 		rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
-	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
+
+	/*
+	 * If the new CB requested was a lazy one, queue it onto the main
+	 * ->cblist so we can take advantage of a sooner grade period.
+	 */
+	if (lazy && rhp) {
+		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
+		rcu_cblist_enqueue(&rcl, rhp);
+		WRITE_ONCE(rdp->lazy_len, 0);
+	} else {
+		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
+		WRITE_ONCE(rdp->lazy_len, 0);
+	}
+
 	rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
 	WRITE_ONCE(rdp->nocb_bypass_first, j);
 	rcu_nocb_bypass_unlock(rdp);
@@ -326,13 +372,33 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data =
*rdp, struct rcu_head *rhp,
  * Note that this function always returns true if rhp is NULL.
  */
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *r=
hp,
-				  unsigned long j)
+				  unsigned long j, unsigned long flush_flags)
 {
+	bool ret;
+	bool was_alldone =3D false;
+	bool bypass_all_lazy =3D false;
+	long bypass_ncbs;
+
 	if (!rcu_rdp_is_offloaded(rdp))
 		return true;
 	rcu_lockdep_assert_cblist_protected(rdp);
 	rcu_nocb_bypass_lock(rdp);
-	return rcu_nocb_do_flush_bypass(rdp, rhp, j);
+
+	if (flush_flags & FLUSH_BP_WAKE) {
+		was_alldone =3D !rcu_segcblist_pend_cbs(&rdp->cblist);
+		bypass_ncbs =3D rcu_cblist_n_cbs(&rdp->nocb_bypass);
+		bypass_all_lazy =3D bypass_ncbs && (bypass_ncbs =3D=3D rdp->lazy_len);
+	}
+
+	ret =3D rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
+
+	// Wake up the nocb GP thread if needed. GP thread could be sleeping
+	// while waiting for lazy timer to expire (otherwise rcu_barrier may
+	// end up waiting for the duration of the lazy timer).
+	if (flush_flags & FLUSH_BP_WAKE && was_alldone && bypass_all_lazy)
+		wake_nocb_gp(rdp, false);
+
+	return ret;
 }
=20
 /*
@@ -345,7 +411,7 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *=
rdp, unsigned long j)
 	if (!rcu_rdp_is_offloaded(rdp) ||
 	    !rcu_nocb_bypass_trylock(rdp))
 		return;
-	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j));
+	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j, FLUSH_BP_NONE));
 }
=20
 /*
@@ -367,12 +433,14 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data=
 *rdp, unsigned long j)
  * there is only one CPU in operation.
  */
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				bool *was_alldone, unsigned long flags)
+				bool *was_alldone, unsigned long flags,
+				bool lazy)
 {
 	unsigned long c;
 	unsigned long cur_gp_seq;
 	unsigned long j =3D jiffies;
 	long ncbs =3D rcu_cblist_n_cbs(&rdp->nocb_bypass);
+	bool bypass_is_lazy =3D (ncbs =3D=3D READ_ONCE(rdp->lazy_len));
=20
 	lockdep_assert_irqs_disabled();
=20
@@ -417,25 +485,30 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp,=
 struct rcu_head *rhp,
 	// If there hasn't yet been all that many ->cblist enqueues
 	// this jiffy, tell the caller to enqueue onto ->cblist.  But flush
 	// ->nocb_bypass first.
-	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy) {
+	// Lazy CBs throttle this back and do immediate bypass queuing.
+	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy && !lazy) {
 		rcu_nocb_lock(rdp);
 		*was_alldone =3D !rcu_segcblist_pend_cbs(&rdp->cblist);
 		if (*was_alldone)
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
 					    TPS("FirstQ"));
-		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j));
+
+		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j, FLUSH_BP_NONE));
 		WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
 		return false; // Caller must enqueue the callback.
 	}
=20
 	// If ->nocb_bypass has been used too long or is too full,
 	// flush ->nocb_bypass to ->cblist.
-	if ((ncbs && j !=3D READ_ONCE(rdp->nocb_bypass_first)) ||
+	if ((ncbs && !bypass_is_lazy && j !=3D READ_ONCE(rdp->nocb_bypass_first))=
 ||
+	    (ncbs &&  bypass_is_lazy &&
+		(time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush)))=
 ||
 	    ncbs >=3D qhimark) {
 		rcu_nocb_lock(rdp);
 		*was_alldone =3D !rcu_segcblist_pend_cbs(&rdp->cblist);
=20
-		if (!rcu_nocb_flush_bypass(rdp, rhp, j)) {
+		if (!rcu_nocb_flush_bypass(rdp, rhp, j,
+					   lazy ? FLUSH_BP_LAZY : FLUSH_BP_NONE)) {
 			if (*was_alldone)
 				trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
 						    TPS("FirstQ"));
@@ -460,16 +533,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp,=
 struct rcu_head *rhp,
 	// We need to use the bypass.
 	rcu_nocb_wait_contended(rdp);
 	rcu_nocb_bypass_lock(rdp);
+
 	ncbs =3D rcu_cblist_n_cbs(&rdp->nocb_bypass);
 	rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
 	rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
+
+	if (IS_ENABLED(CONFIG_RCU_LAZY) && lazy)
+		WRITE_ONCE(rdp->lazy_len, rdp->lazy_len + 1);
+
 	if (!ncbs) {
 		WRITE_ONCE(rdp->nocb_bypass_first, j);
 		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
 	}
+
 	rcu_nocb_bypass_unlock(rdp);
 	smp_mb(); /* Order enqueue before wake. */
-	if (ncbs) {
+
+	// A wake up of the grace period kthread or timer adjustment needs to
+	// be done only if:
+	// 1. Bypass list was fully empty before (this is the first bypass list e=
ntry).
+	//	Or, both the below conditions are met:
+	// 1. Bypass list had only lazy CBs before.
+	// 2. The new CB is non-lazy.
+	if (ncbs && (!bypass_is_lazy || lazy)) {
 		local_irq_restore(flags);
 	} else {
 		// No-CBs GP kthread might be indefinitely asleep, if so, wake.
@@ -499,7 +585,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, =
bool was_alldone,
 {
 	unsigned long cur_gp_seq;
 	unsigned long j;
-	long len;
+	long len, lazy_len, bypass_len;
 	struct task_struct *t;
=20
 	// If we are being polled or there is no kthread, just leave.
@@ -512,9 +598,16 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp,=
 bool was_alldone,
 	}
 	// Need to actually to a wakeup.
 	len =3D rcu_segcblist_n_cbs(&rdp->cblist);
+	bypass_len =3D rcu_cblist_n_cbs(&rdp->nocb_bypass);
+	lazy_len =3D READ_ONCE(rdp->lazy_len);
 	if (was_alldone) {
 		rdp->qlen_last_fqs_check =3D len;
-		if (!irqs_disabled_flags(flags)) {
+		// Only lazy CBs in bypass list
+		if (lazy_len && bypass_len =3D=3D lazy_len) {
+			rcu_nocb_unlock_irqrestore(rdp, flags);
+			wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
+					   TPS("WakeLazy"));
+		} else if (!irqs_disabled_flags(flags)) {
 			/* ... if queue was empty ... */
 			rcu_nocb_unlock_irqrestore(rdp, flags);
 			wake_nocb_gp(rdp, false);
@@ -604,8 +697,8 @@ static void nocb_gp_sleep(struct rcu_data *my_rdp, int =
cpu)
  */
 static void nocb_gp_wait(struct rcu_data *my_rdp)
 {
-	bool bypass =3D false;
-	long bypass_ncbs;
+	bool bypass =3D false, lazy =3D false;
+	long bypass_ncbs, lazy_ncbs;
 	int __maybe_unused cpu =3D my_rdp->cpu;
 	unsigned long cur_gp_seq;
 	unsigned long flags;
@@ -640,24 +733,41 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 	 * won't be ignored for long.
 	 */
 	list_for_each_entry(rdp, &my_rdp->nocb_head_rdp, nocb_entry_rdp) {
+		bool flush_bypass =3D false;
+
 		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
 		rcu_nocb_lock_irqsave(rdp, flags);
 		lockdep_assert_held(&rdp->nocb_lock);
 		bypass_ncbs =3D rcu_cblist_n_cbs(&rdp->nocb_bypass);
-		if (bypass_ncbs &&
+		lazy_ncbs =3D READ_ONCE(rdp->lazy_len);
+
+		if (bypass_ncbs && (lazy_ncbs =3D=3D bypass_ncbs) &&
+		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flus=
h) ||
+		     bypass_ncbs > 2 * qhimark)) {
+			flush_bypass =3D true;
+		} else if (bypass_ncbs && (lazy_ncbs !=3D bypass_ncbs) &&
 		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
 		     bypass_ncbs > 2 * qhimark)) {
-			// Bypass full or old, so flush it.
-			(void)rcu_nocb_try_flush_bypass(rdp, j);
-			bypass_ncbs =3D rcu_cblist_n_cbs(&rdp->nocb_bypass);
+			flush_bypass =3D true;
 		} else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
 			rcu_nocb_unlock_irqrestore(rdp, flags);
 			continue; /* No callbacks here, try next. */
 		}
+
+		if (flush_bypass) {
+			// Bypass full or old, so flush it.
+			(void)rcu_nocb_try_flush_bypass(rdp, j);
+			bypass_ncbs =3D rcu_cblist_n_cbs(&rdp->nocb_bypass);
+			lazy_ncbs =3D READ_ONCE(rdp->lazy_len);
+		}
+
 		if (bypass_ncbs) {
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
-					    TPS("Bypass"));
-			bypass =3D true;
+				    bypass_ncbs =3D=3D lazy_ncbs ? TPS("Lazy") : TPS("Bypass"));
+			if (bypass_ncbs =3D=3D lazy_ncbs)
+				lazy =3D true;
+			else
+				bypass =3D true;
 		}
 		rnp =3D rdp->mynode;
=20
@@ -705,12 +815,21 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 	my_rdp->nocb_gp_gp =3D needwait_gp;
 	my_rdp->nocb_gp_seq =3D needwait_gp ? wait_gp_seq : 0;
=20
-	if (bypass && !rcu_nocb_poll) {
-		// At least one child with non-empty ->nocb_bypass, so set
-		// timer in order to avoid stranding its callbacks.
-		wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
-				   TPS("WakeBypassIsDeferred"));
+	// At least one child with non-empty ->nocb_bypass, so set
+	// timer in order to avoid stranding its callbacks.
+	if (!rcu_nocb_poll) {
+		// If bypass list only has lazy CBs. Add a deferred
+		// lazy wake up.
+		if (lazy && !bypass) {
+			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
+					TPS("WakeLazyIsDeferred"));
+		// Otherwise add a deferred bypass wake up.
+		} else if (bypass) {
+			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
+					TPS("WakeBypassIsDeferred"));
+		}
 	}
+
 	if (rcu_nocb_poll) {
 		/* Polling, so trace if first poll in the series. */
 		if (gotcbs)
@@ -1036,7 +1155,7 @@ static long rcu_nocb_rdp_deoffload(void *arg)
 	 * return false, which means that future calls to rcu_nocb_try_bypass()
 	 * will refuse to put anything into the bypass.
 	 */
-	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
+	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_NONE));
 	/*
 	 * Start with invoking rcu_core() early. This way if the current thread
 	 * happens to preempt an ongoing call to rcu_core() in the middle,
@@ -1278,6 +1397,7 @@ static void __init rcu_boot_init_nocb_percpu_data(str=
uct rcu_data *rdp)
 	raw_spin_lock_init(&rdp->nocb_gp_lock);
 	timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0);
 	rcu_cblist_init(&rdp->nocb_bypass);
+	WRITE_ONCE(rdp->lazy_len, 0);
 	mutex_init(&rdp->nocb_gp_kthread_mutex);
 }
=20
@@ -1559,13 +1679,13 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
 }
=20
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *r=
hp,
-				  unsigned long j)
+				  unsigned long j, unsigned long flush_flags)
 {
 	return true;
 }
=20
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				bool *was_alldone, unsigned long flags)
+				bool *was_alldone, unsigned long flags, bool lazy)
 {
 	return false;
 }
--=20
2.37.3.998.g577e59143f-goog
From nobody Thu Apr  2 15:02:05 2026
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 47B8CECAAD8
	for <linux-kernel@archiver.kernel.org>; Thu, 22 Sep 2022 22:01:43 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229738AbiIVWBj (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 22 Sep 2022 18:01:39 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39896 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230021AbiIVWB3 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 22 Sep 2022 18:01:29 -0400
Received: from mail-qt1-x835.google.com (mail-qt1-x835.google.com
 [IPv6:2607:f8b0:4864:20::835])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EE349F8F9F
        for <linux-kernel@vger.kernel.org>;
 Thu, 22 Sep 2022 15:01:28 -0700 (PDT)
Received: by mail-qt1-x835.google.com with SMTP id b23so7293448qtr.13
        for <linux-kernel@vger.kernel.org>;
 Thu, 22 Sep 2022 15:01:28 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=joelfernandes.org; s=google;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date;
        bh=zIWTRfnvBQN5vTCxbMIC47LGFoMDr9nrugSvdFV6LqQ=;
        b=gUN22foRH9YYHKT9UteEYRTA+owayfilXutUXe3Zk13WvNcaCda8c21NFVdilPC1jx
         N+5tQb1HODwuk6YQWLhWfrrTSgo+5nDRdpPNNSZDNu4MMOany9YsQdKeKyDWdyUo/5p5
         m5cscoZ65fgawWZwXKCUNz5m+EZKuCholMDXg=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date;
        bh=zIWTRfnvBQN5vTCxbMIC47LGFoMDr9nrugSvdFV6LqQ=;
        b=VXbraio3C+/isgdmzwBj0JTxDJ/0TdQFLqzdTqSJ2zwkJFGbChZqQ5tFXLkzZsy4ka
         2bkjCoRaxkn8yGyhOjW+Ti+tAClTQrnO5KdHFfyCULlLbOdNz6f2U24R1ks4LYnpz6bO
         wJXejJqPz6JhEoWmfMV75zwq4recaCPGu5jBlWra68omm/rKGEzmMhbHJ8/sBRHc31pF
         U+S59CxHcmJOBR/EQUt21tlcawRxIzZuq6BM+dSVjf4KyxsmbuA7cZyy+5Lek9N8ULbc
         /6vDUhEr0AMSZ6DdimIcKYSW8VDFlVbfdR/wJLlklNO1aiUFNX8BmVWSvOpVi/QmskVk
         IKHA==
X-Gm-Message-State: ACrzQf3/60qhL09lOi47fRZTBWaJqxrX9gC/cQAO46GnCJcDz6dZHYw5
        uZ0W9/Cfjin+8pBf2xipUxEU2A==
X-Google-Smtp-Source: 
 AMsMyM4bB5yLNsDEX8+uaXcyDCoP0jgWLZBaeOJw2x0V4bRfyYrRh6IHLLk/Er3Lp8Yj5rrWanyR8Q==
X-Received: by 2002:a05:622a:1909:b0:344:9f41:9477 with SMTP id
 w9-20020a05622a190900b003449f419477mr4599043qtc.619.1663884088046;
        Thu, 22 Sep 2022 15:01:28 -0700 (PDT)
Received: from joelboxx.c.googlers.com.com
 (48.230.85.34.bc.googleusercontent.com. [34.85.230.48])
        by smtp.gmail.com with ESMTPSA id
 z12-20020ac87f8c000000b0035ba7012724sm4465833qtj.70.2022.09.22.15.01.27
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 22 Sep 2022 15:01:27 -0700 (PDT)
From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
To: rcu@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, rushikesh.s.kadam@intel.com,
        urezki@gmail.com, neeraj.iitr10@gmail.com, frederic@kernel.org,
        paulmck@kernel.org, rostedt@goodmis.org,
        Vineeth Pillai <vineeth@bitbyteword.org>,
        Joel Fernandes <joel@joelfernandes.org>
Subject: [PATCH v6 2/4] rcu: shrinker for lazy rcu
Date: Thu, 22 Sep 2022 22:01:02 +0000
Message-Id: <20220922220104.2446868-3-joel@joelfernandes.org>
X-Mailer: git-send-email 2.37.3.998.g577e59143f-goog
In-Reply-To: <20220922220104.2446868-1-joel@joelfernandes.org>
References: <20220922220104.2446868-1-joel@joelfernandes.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

From: Vineeth Pillai <vineeth@bitbyteword.org>

The shrinker is used to speed up the free'ing of memory potentially held
by RCU lazy callbacks. RCU kernel module test cases show this to be
effective. Test is introduced in a later patch.

Signed-off-by: Vineeth Pillai <vineeth@bitbyteword.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree_nocb.h | 52 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 661c685aba3f..1a182b9c4f6c 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -1332,6 +1332,55 @@ int rcu_nocb_cpu_offload(int cpu)
 }
 EXPORT_SYMBOL_GPL(rcu_nocb_cpu_offload);
=20
+static unsigned long
+lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+	int cpu;
+	unsigned long count =3D 0;
+
+	/* Snapshot count of all CPUs */
+	for_each_possible_cpu(cpu) {
+		struct rcu_data *rdp =3D per_cpu_ptr(&rcu_data, cpu);
+
+		count +=3D  READ_ONCE(rdp->lazy_len);
+	}
+
+	return count ? count : SHRINK_EMPTY;
+}
+
+static unsigned long
+lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+{
+	int cpu;
+	unsigned long flags;
+	unsigned long count =3D 0;
+
+	/* Snapshot count of all CPUs */
+	for_each_possible_cpu(cpu) {
+		struct rcu_data *rdp =3D per_cpu_ptr(&rcu_data, cpu);
+		int _count =3D READ_ONCE(rdp->lazy_len);
+
+		if (_count =3D=3D 0)
+			continue;
+		rcu_nocb_lock_irqsave(rdp, flags);
+		WRITE_ONCE(rdp->lazy_len, 0);
+		rcu_nocb_unlock_irqrestore(rdp, flags);
+		wake_nocb_gp(rdp, false);
+		sc->nr_to_scan -=3D _count;
+		count +=3D _count;
+		if (sc->nr_to_scan <=3D 0)
+			break;
+	}
+	return count ? count : SHRINK_STOP;
+}
+
+static struct shrinker lazy_rcu_shrinker =3D {
+	.count_objects =3D lazy_rcu_shrink_count,
+	.scan_objects =3D lazy_rcu_shrink_scan,
+	.batch =3D 0,
+	.seeks =3D DEFAULT_SEEKS,
+};
+
 void __init rcu_init_nohz(void)
 {
 	int cpu;
@@ -1362,6 +1411,9 @@ void __init rcu_init_nohz(void)
 	if (!rcu_state.nocb_is_setup)
 		return;
=20
+	if (register_shrinker(&lazy_rcu_shrinker, "rcu-lazy"))
+		pr_err("Failed to register lazy_rcu shrinker!\n");
+
 	if (!cpumask_subset(rcu_nocb_mask, cpu_possible_mask)) {
 		pr_info("\tNote: kernel parameter 'rcu_nocbs=3D', 'nohz_full', or 'isolc=
pus=3D' contains nonexistent CPUs.\n");
 		cpumask_and(rcu_nocb_mask, cpu_possible_mask,
--=20
2.37.3.998.g577e59143f-goog
From nobody Thu Apr  2 15:02:05 2026
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D9658C54EE9
	for <linux-kernel@archiver.kernel.org>; Thu, 22 Sep 2022 22:01:49 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231447AbiIVWBr (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 22 Sep 2022 18:01:47 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39964 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230175AbiIVWBb (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 22 Sep 2022 18:01:31 -0400
Received: from mail-qk1-x734.google.com (mail-qk1-x734.google.com
 [IPv6:2607:f8b0:4864:20::734])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 122DF101965
        for <linux-kernel@vger.kernel.org>;
 Thu, 22 Sep 2022 15:01:30 -0700 (PDT)
Received: by mail-qk1-x734.google.com with SMTP id i3so7212069qkl.3
        for <linux-kernel@vger.kernel.org>;
 Thu, 22 Sep 2022 15:01:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=joelfernandes.org; s=google;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date;
        bh=B0XQyn7nJk8icZHRVUAbtruFw6kr1vhii64N8EO85w4=;
        b=obIE1rAS1C/o65gwmG1qn02DAb4k8F+UbmURCeF7jPGY33vwjXFpzfF0PxhbdTO54u
         TrDjjXL49ljYT810xWc+ZhPZEn7SGlMWqJRHp1O+GELXMgOLvS8YV3PFN8Tk8f9JKV8l
         wTnLcS7hd9U2nnBDQPmq8rOf173B8D1qpBPuc=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date;
        bh=B0XQyn7nJk8icZHRVUAbtruFw6kr1vhii64N8EO85w4=;
        b=lcsAwmVI181N/9W4NecY2FTP/geQSGMCwLSxCQp65Up2DUBM0tB/oZdR+tj7lFDnwg
         bvhA9+iviMb6xRXG8gr2CGOgx4YcdLuWPnf9NG3ZGrqpGBfDtPA19lH0MmxWlfwArP30
         nUjrL8RJLLQTbJiHswdEFZN2XZ50WcWLJcWkRRMWF96ZgMzlYNzjzJYtBdIjtk+fKr+O
         PJ17fbUddROOX5JNuu0Vu1bxLxNUYpYOUR7gr7pTKaIScTZ05YeivMt4ujRK16L1S6b6
         nwr4uYI3PLEtfnYkc+T2AhvD3FVqwyigURqOrSPeGRtn8Nt7vO+7Z+cohqqRR20VqpDF
         AQAg==
X-Gm-Message-State: ACrzQf1vudvFwqswov0wuqt9sNJ1CYjhuOlapbzgITSVL/Nmlh+d/cFZ
        850Pf2UhGFBbv0LKyeEe125ScQ==
X-Google-Smtp-Source: 
 AMsMyM5W580IcuarDtexq6SFQIE5t986Srjk7Qxuc5IvNhVy43mWq/MX6UE31zpWmmABFdgkEWT0yg==
X-Received: by 2002:a05:620a:4081:b0:6ce:6253:b90c with SMTP id
 f1-20020a05620a408100b006ce6253b90cmr3745912qko.172.1663884089154;
        Thu, 22 Sep 2022 15:01:29 -0700 (PDT)
Received: from joelboxx.c.googlers.com.com
 (48.230.85.34.bc.googleusercontent.com. [34.85.230.48])
        by smtp.gmail.com with ESMTPSA id
 z12-20020ac87f8c000000b0035ba7012724sm4465833qtj.70.2022.09.22.15.01.28
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 22 Sep 2022 15:01:28 -0700 (PDT)
From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
To: rcu@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, rushikesh.s.kadam@intel.com,
        urezki@gmail.com, neeraj.iitr10@gmail.com, frederic@kernel.org,
        paulmck@kernel.org, rostedt@goodmis.org,
        "Joel Fernandes (Google)" <joel@joelfernandes.org>
Subject: [PATCH v6 3/4] rcuscale: Add laziness and kfree tests
Date: Thu, 22 Sep 2022 22:01:03 +0000
Message-Id: <20220922220104.2446868-4-joel@joelfernandes.org>
X-Mailer: git-send-email 2.37.3.998.g577e59143f-goog
In-Reply-To: <20220922220104.2446868-1-joel@joelfernandes.org>
References: <20220922220104.2446868-1-joel@joelfernandes.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

We add 2 tests to rcuscale, first one is a startup test to check whether
we are not too lazy or too hard working. Two, emulate kfree_rcu() itself
to use call_rcu() and check memory pressure. In my testing, the new
call_rcu() does well to keep memory pressure under control, similar
to kfree_rcu().

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/rcuscale.c | 65 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 64 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
index 3ef02d4a8108..027b7c1e7613 100644
--- a/kernel/rcu/rcuscale.c
+++ b/kernel/rcu/rcuscale.c
@@ -95,6 +95,7 @@ torture_param(int, verbose, 1, "Enable verbose debugging =
printk()s");
 torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to d=
isable");
 torture_param(int, kfree_rcu_test, 0, "Do we run a kfree_rcu() scale test?=
");
 torture_param(int, kfree_mult, 1, "Multiple of kfree_obj size to allocate.=
");
+torture_param(int, kfree_by_call_rcu, 0, "Use call_rcu() to emulate kfree_=
rcu()?");
=20
 static char *scale_type =3D "rcu";
 module_param(scale_type, charp, 0444);
@@ -659,6 +660,14 @@ struct kfree_obj {
 	struct rcu_head rh;
 };
=20
+/* Used if doing RCU-kfree'ing via call_rcu(). */
+static void kfree_call_rcu(struct rcu_head *rh)
+{
+	struct kfree_obj *obj =3D container_of(rh, struct kfree_obj, rh);
+
+	kfree(obj);
+}
+
 static int
 kfree_scale_thread(void *arg)
 {
@@ -696,6 +705,11 @@ kfree_scale_thread(void *arg)
 			if (!alloc_ptr)
 				return -ENOMEM;
=20
+			if (kfree_by_call_rcu) {
+				call_rcu(&(alloc_ptr->rh), kfree_call_rcu);
+				continue;
+			}
+
 			// By default kfree_rcu_test_single and kfree_rcu_test_double are
 			// initialized to false. If both have the same value (false or true)
 			// both are randomly tested, otherwise only the one with value true
@@ -767,11 +781,58 @@ kfree_scale_shutdown(void *arg)
 	return -EINVAL;
 }
=20
+// Used if doing RCU-kfree'ing via call_rcu().
+static unsigned long jiffies_at_lazy_cb;
+static struct rcu_head lazy_test1_rh;
+static int rcu_lazy_test1_cb_called;
+static void call_rcu_lazy_test1(struct rcu_head *rh)
+{
+	jiffies_at_lazy_cb =3D jiffies;
+	WRITE_ONCE(rcu_lazy_test1_cb_called, 1);
+}
+
 static int __init
 kfree_scale_init(void)
 {
 	long i;
 	int firsterr =3D 0;
+	unsigned long orig_jif, jif_start;
+
+	// Also, do a quick self-test to ensure laziness is as much as
+	// expected.
+	if (kfree_by_call_rcu && !IS_ENABLED(CONFIG_RCU_LAZY)) {
+		pr_alert("CONFIG_RCU_LAZY is disabled, falling back to kfree_rcu() "
+			 "for delayed RCU kfree'ing\n");
+		kfree_by_call_rcu =3D 0;
+	}
+
+	if (kfree_by_call_rcu) {
+		/* do a test to check the timeout. */
+		orig_jif =3D rcu_lazy_get_jiffies_till_flush();
+
+		rcu_lazy_set_jiffies_till_flush(2 * HZ);
+		rcu_barrier();
+
+		jif_start =3D jiffies;
+		jiffies_at_lazy_cb =3D 0;
+		call_rcu(&lazy_test1_rh, call_rcu_lazy_test1);
+
+		smp_cond_load_relaxed(&rcu_lazy_test1_cb_called, VAL =3D=3D 1);
+
+		rcu_lazy_set_jiffies_till_flush(orig_jif);
+
+		if (WARN_ON_ONCE(jiffies_at_lazy_cb - jif_start < 2 * HZ)) {
+			pr_alert("ERROR: call_rcu() CBs are not being lazy as expected!\n");
+			WARN_ON_ONCE(1);
+			return -1;
+		}
+
+		if (WARN_ON_ONCE(jiffies_at_lazy_cb - jif_start > 3 * HZ)) {
+			pr_alert("ERROR: call_rcu() CBs are being too lazy!\n");
+			WARN_ON_ONCE(1);
+			return -1;
+		}
+	}
=20
 	kfree_nrealthreads =3D compute_real(kfree_nthreads);
 	/* Start up the kthreads. */
@@ -784,7 +845,9 @@ kfree_scale_init(void)
 		schedule_timeout_uninterruptible(1);
 	}
=20
-	pr_alert("kfree object size=3D%zu\n", kfree_mult * sizeof(struct kfree_ob=
j));
+	pr_alert("kfree object size=3D%zu, kfree_by_call_rcu=3D%d\n",
+			kfree_mult * sizeof(struct kfree_obj),
+			kfree_by_call_rcu);
=20
 	kfree_reader_tasks =3D kcalloc(kfree_nrealthreads, sizeof(kfree_reader_ta=
sks[0]),
 			       GFP_KERNEL);
--=20
2.37.3.998.g577e59143f-goog
From nobody Thu Apr  2 15:02:05 2026
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7108DC54EE9
	for <linux-kernel@archiver.kernel.org>; Thu, 22 Sep 2022 22:01:55 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231621AbiIVWBw (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 22 Sep 2022 18:01:52 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39972 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230185AbiIVWBb (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 22 Sep 2022 18:01:31 -0400
Received: from mail-qk1-x72c.google.com (mail-qk1-x72c.google.com
 [IPv6:2607:f8b0:4864:20::72c])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CAAF4105D73
        for <linux-kernel@vger.kernel.org>;
 Thu, 22 Sep 2022 15:01:30 -0700 (PDT)
Received: by mail-qk1-x72c.google.com with SMTP id g2so7222514qkk.1
        for <linux-kernel@vger.kernel.org>;
 Thu, 22 Sep 2022 15:01:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=joelfernandes.org; s=google;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date;
        bh=xKb1FWV/BNF7dFbOAVH7aAIeBlfPfyRPTEu/ZS4Sqtc=;
        b=RnyQz9cyUSvE+E9yNg8BuKP3ayY1XXrWj08RP64gS1HcGnYT5NUIRDUjvu3gGiWjUV
         o1Zbw5vznDKaFeJJ5TIS25gdWVQxGknMvaW+EOMWW8Mrw9cPMyB4MMIkbhCcIW/L8WnW
         5FiFRK9OG3eWD6ot5GkUHboo9fTJIcjWNYVko=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date;
        bh=xKb1FWV/BNF7dFbOAVH7aAIeBlfPfyRPTEu/ZS4Sqtc=;
        b=qkOesMh/7/C/y4SlAe1jgHg4oes9BVcYj6pxS2Agy+1Ynj3KxqK4ltUjg0pTwRfp+Z
         L88V0rKeFcEKlm0BW1nyA8LZqWSRkHcwVhhVhnUwRk/FU8ios8loL6kdl8W2C1A7VlD/
         y5HHE0KuPbTZiZ6dwVE6SOhMsQI9mhnpxLTGKww+lCHiaE4+DTsn07kV7riE2GhAHvOL
         QXVE33cGJPOoR7nxwRtPnNwr0/Fylqn25n+c20xzSnT1b0CMUEfsXNdM98tnDkPmVUGn
         EWEKzLpDn1OKHhc9D2oGlu0dLSaPqysmIF2lpYIQItbCVcGv4F93br2V9tEY2PUzZ1pk
         IaDQ==
X-Gm-Message-State: ACrzQf3DYEBnXLi0yRh5JPwkXumOHwgnqLuzA4GqaqK6HK54ct8vKQCW
        qyvzCJkO8JAoIn9QW1Rcr/Ibzg==
X-Google-Smtp-Source: 
 AMsMyM4/Y4MStj6VIj9Ja0U0VSqEWGeY7ueLAXYF+EGSUTsY0oso40SG361nVaHoskgS1OvVwPwISA==
X-Received: by 2002:a05:620a:198a:b0:6ce:7f32:9f3f with SMTP id
 bm10-20020a05620a198a00b006ce7f329f3fmr3844189qkb.90.1663884089931;
        Thu, 22 Sep 2022 15:01:29 -0700 (PDT)
Received: from joelboxx.c.googlers.com.com
 (48.230.85.34.bc.googleusercontent.com. [34.85.230.48])
        by smtp.gmail.com with ESMTPSA id
 z12-20020ac87f8c000000b0035ba7012724sm4465833qtj.70.2022.09.22.15.01.29
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 22 Sep 2022 15:01:29 -0700 (PDT)
From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
To: rcu@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, rushikesh.s.kadam@intel.com,
        urezki@gmail.com, neeraj.iitr10@gmail.com, frederic@kernel.org,
        paulmck@kernel.org, rostedt@goodmis.org,
        "Joel Fernandes (Google)" <joel@joelfernandes.org>
Subject: [PATCH v6 4/4] percpu-refcount: Use call_rcu_flush() for atomic
 switch
Date: Thu, 22 Sep 2022 22:01:04 +0000
Message-Id: <20220922220104.2446868-5-joel@joelfernandes.org>
X-Mailer: git-send-email 2.37.3.998.g577e59143f-goog
In-Reply-To: <20220922220104.2446868-1-joel@joelfernandes.org>
References: <20220922220104.2446868-1-joel@joelfernandes.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

call_rcu() changes to save power will slow down percpu refcounter
per-CPU to atomic switch path.  The primitive uses RCU when switching to
atomic mode.

The enqueued async callback wakes up waiters waiting in the
percpu_ref_switch_waitq. This will slow down the per-CPU refcount users
such as blk_pre_runtime_suspend().

Use the call_rcu_flush() API instead which reverts to the old behavior.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 lib/percpu-refcount.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index e5c5315da274..65c58a029297 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -230,7 +230,8 @@ static void __percpu_ref_switch_to_atomic(struct percpu=
_ref *ref,
 		percpu_ref_noop_confirm_switch;
=20
 	percpu_ref_get(ref);	/* put after confirmation */
-	call_rcu(&ref->data->rcu, percpu_ref_switch_to_atomic_rcu);
+	call_rcu_flush(&ref->data->rcu,
+		       percpu_ref_switch_to_atomic_rcu);
 }
=20
 static void __percpu_ref_switch_to_percpu(struct percpu_ref *ref)
--=20
2.37.3.998.g577e59143f-goog