[v1] cgroup: nmi safe css_rstat_updated

[RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe

Posted by Shakeel Butt 9 months, 2 weeks ago

To make css_rstat_updated() able to safely run in nmi context, it can
not spin on locks and rather has to do trylock on the per-cpu per-ss raw
spinlock. This patch implements the backlog mechanism to handle the
failure in acquiring the per-cpu per-ss raw spinlock.

Each subsystem provides a per-cpu lockless list on which the kernel
stores the css given to css_rstat_updated() on trylock failure. These
lockless lists serve as backlog. On cgroup stats flushing code path, the
kernel first processes all the per-cpu lockless backlog lists of the
given ss and then proceeds to flush the update stat trees.

With css_rstat_updated() being nmi safe, the memch stats can and will be
converted to be nmi safe to enable nmi safe mem charging.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 kernel/cgroup/rstat.c | 99 +++++++++++++++++++++++++++++++++----------
 1 file changed, 76 insertions(+), 23 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index d3092b4c85d7..ac533e46afa9 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -11,6 +11,7 @@
 
 static DEFINE_SPINLOCK(rstat_base_lock);
 static DEFINE_PER_CPU(raw_spinlock_t, rstat_base_cpu_lock);
+static DEFINE_PER_CPU(struct llist_head, rstat_backlog_list);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
 
@@ -42,6 +43,13 @@ static raw_spinlock_t *ss_rstat_cpu_lock(struct cgroup_subsys *ss, int cpu)
 	return per_cpu_ptr(&rstat_base_cpu_lock, cpu);
 }
 
+static struct llist_head *ss_lhead_cpu(struct cgroup_subsys *ss, int cpu)
+{
+	if (ss)
+		return per_cpu_ptr(ss->lhead, cpu);
+	return per_cpu_ptr(&rstat_backlog_list, cpu);
+}
+
 /*
  * Helper functions for rstat per CPU locks.
  *
@@ -86,6 +94,21 @@ unsigned long _css_rstat_cpu_lock(struct cgroup_subsys_state *css, int cpu,
 	return flags;
 }
 
+static __always_inline
+bool _css_rstat_cpu_trylock(struct cgroup_subsys_state *css, int cpu,
+			    unsigned long *flags)
+{
+	struct cgroup *cgrp = css->cgroup;
+	raw_spinlock_t *cpu_lock;
+	bool contended;
+
+	cpu_lock = ss_rstat_cpu_lock(css->ss, cpu);
+	contended = !raw_spin_trylock_irqsave(cpu_lock, *flags);
+	if (contended)
+		trace_cgroup_rstat_cpu_lock_contended(cgrp, cpu, contended);
+	return !contended;
+}
+
 static __always_inline
 void _css_rstat_cpu_unlock(struct cgroup_subsys_state *css, int cpu,
 		unsigned long flags, const bool fast_path)
@@ -102,32 +125,16 @@ void _css_rstat_cpu_unlock(struct cgroup_subsys_state *css, int cpu,
 	raw_spin_unlock_irqrestore(cpu_lock, flags);
 }
 
-/**
- * css_rstat_updated - keep track of updated rstat_cpu
- * @css: target cgroup subsystem state
- * @cpu: cpu on which rstat_cpu was updated
- *
- * @css's rstat_cpu on @cpu was updated. Put it on the parent's matching
- * rstat_cpu->updated_children list. See the comment on top of
- * css_rstat_cpu definition for details.
- */
-__bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
+static void css_add_to_backlog(struct cgroup_subsys_state *css, int cpu)
 {
-	unsigned long flags;
-
-	/*
-	 * Speculative already-on-list test. This may race leading to
-	 * temporary inaccuracies, which is fine.
-	 *
-	 * Because @parent's updated_children is terminated with @parent
-	 * instead of NULL, we can tell whether @css is on the list by
-	 * testing the next pointer for NULL.
-	 */
-	if (data_race(css_rstat_cpu(css, cpu)->updated_next))
-		return;
+	struct llist_head *lhead = ss_lhead_cpu(css->ss, cpu);
+	struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
 
-	flags = _css_rstat_cpu_lock(css, cpu, true);
+	llist_add_iff_not_on_list(&rstatc->lnode, lhead);
+}
 
+static void __css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
+{
 	/* put @css and all ancestors on the corresponding updated lists */
 	while (true) {
 		struct css_rstat_cpu *rstatc = css_rstat_cpu(css, cpu);
@@ -153,6 +160,51 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 
 		css = parent;
 	}
+}
+
+static void css_process_backlog(struct cgroup_subsys *ss, int cpu)
+{
+	struct llist_head *lhead = ss_lhead_cpu(ss, cpu);
+	struct llist_node *lnode;
+
+	while ((lnode = llist_del_first_init(lhead))) {
+		struct css_rstat_cpu *rstatc;
+
+		rstatc = container_of(lnode, struct css_rstat_cpu, lnode);
+		__css_rstat_updated(rstatc->owner, cpu);
+	}
+}
+
+/**
+ * css_rstat_updated - keep track of updated rstat_cpu
+ * @css: target cgroup subsystem state
+ * @cpu: cpu on which rstat_cpu was updated
+ *
+ * @css's rstat_cpu on @cpu was updated. Put it on the parent's matching
+ * rstat_cpu->updated_children list. See the comment on top of
+ * css_rstat_cpu definition for details.
+ */
+__bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
+{
+	unsigned long flags;
+
+	/*
+	 * Speculative already-on-list test. This may race leading to
+	 * temporary inaccuracies, which is fine.
+	 *
+	 * Because @parent's updated_children is terminated with @parent
+	 * instead of NULL, we can tell whether @css is on the list by
+	 * testing the next pointer for NULL.
+	 */
+	if (data_race(css_rstat_cpu(css, cpu)->updated_next))
+		return;
+
+	if (!_css_rstat_cpu_trylock(css, cpu, &flags)) {
+		css_add_to_backlog(css, cpu);
+		return;
+	}
+
+	__css_rstat_updated(css, cpu);
 
 	_css_rstat_cpu_unlock(css, cpu, flags, true);
 }
@@ -255,6 +307,7 @@ static struct cgroup_subsys_state *css_rstat_updated_list(
 
 	flags = _css_rstat_cpu_lock(root, cpu, false);
 
+	css_process_backlog(root->ss, cpu);
 	/* Return NULL if this subtree is not on-list */
 	if (!rstatc->updated_next)
 		goto unlock_ret;
-- 
2.47.1

Re: [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe

Posted by Yosry Ahmed 9 months, 1 week ago

On Mon, Apr 28, 2025 at 11:12:09PM -0700, Shakeel Butt wrote:
> To make css_rstat_updated() able to safely run in nmi context, it can
> not spin on locks and rather has to do trylock on the per-cpu per-ss raw
> spinlock. This patch implements the backlog mechanism to handle the
> failure in acquiring the per-cpu per-ss raw spinlock.
> 
> Each subsystem provides a per-cpu lockless list on which the kernel
> stores the css given to css_rstat_updated() on trylock failure. These
> lockless lists serve as backlog. On cgroup stats flushing code path, the
> kernel first processes all the per-cpu lockless backlog lists of the
> given ss and then proceeds to flush the update stat trees.
> 
> With css_rstat_updated() being nmi safe, the memch stats can and will be
> converted to be nmi safe to enable nmi safe mem charging.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
>  kernel/cgroup/rstat.c | 99 +++++++++++++++++++++++++++++++++----------
>  1 file changed, 76 insertions(+), 23 deletions(-)
> 
[..]
> @@ -153,6 +160,51 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
>  
>  		css = parent;
>  	}
> +}
> +
> +static void css_process_backlog(struct cgroup_subsys *ss, int cpu)
> +{
> +	struct llist_head *lhead = ss_lhead_cpu(ss, cpu);
> +	struct llist_node *lnode;
> +
> +	while ((lnode = llist_del_first_init(lhead))) {
> +		struct css_rstat_cpu *rstatc;
> +
> +		rstatc = container_of(lnode, struct css_rstat_cpu, lnode);
> +		__css_rstat_updated(rstatc->owner, cpu);
> +	}
> +}
> +
> +/**
> + * css_rstat_updated - keep track of updated rstat_cpu
> + * @css: target cgroup subsystem state
> + * @cpu: cpu on which rstat_cpu was updated
> + *
> + * @css's rstat_cpu on @cpu was updated. Put it on the parent's matching
> + * rstat_cpu->updated_children list. See the comment on top of
> + * css_rstat_cpu definition for details.
> + */
> +__bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
> +{
> +	unsigned long flags;
> +
> +	/*
> +	 * Speculative already-on-list test. This may race leading to
> +	 * temporary inaccuracies, which is fine.
> +	 *
> +	 * Because @parent's updated_children is terminated with @parent
> +	 * instead of NULL, we can tell whether @css is on the list by
> +	 * testing the next pointer for NULL.
> +	 */
> +	if (data_race(css_rstat_cpu(css, cpu)->updated_next))
> +		return;
> +
> +	if (!_css_rstat_cpu_trylock(css, cpu, &flags)) {


IIUC this trylock will only fail if a BPF program runs in NMI context
and tries to update cgroup stats, interrupting a context that is already
holding the lock (i.e. updating or flushing stats).

How often does this happen in practice tho? Is it worth the complexity?

I wonder if it's better if we make css_rstat_updated() inherently
lockless instead.

What if css_rstat_updated() always just adds to a lockless tree, and we
defer constructing the proper tree to the flushing side? This should
make updates generally faster and avoids locking or disabling interrupts
in the fast path. We essentially push more work to the flushing side.

We may be able to consolidate some of the code too if all the logic
manipulating the tree is on the flushing side.

WDYT? Am I missing something here?

> +		css_add_to_backlog(css, cpu);
> +		return;
> +	}
> +
> +	__css_rstat_updated(css, cpu);
>  
>  	_css_rstat_cpu_unlock(css, cpu, flags, true);
>  }
> @@ -255,6 +307,7 @@ static struct cgroup_subsys_state *css_rstat_updated_list(
>  
>  	flags = _css_rstat_cpu_lock(root, cpu, false);
>  
> +	css_process_backlog(root->ss, cpu);
>  	/* Return NULL if this subtree is not on-list */
>  	if (!rstatc->updated_next)
>  		goto unlock_ret;
> -- 
> 2.47.1
>

Re: [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe

Posted by Shakeel Butt 9 months, 1 week ago

On Wed, Apr 30, 2025 at 06:14:28AM -0700, Yosry Ahmed wrote:
[...]
> > +
> > +	if (!_css_rstat_cpu_trylock(css, cpu, &flags)) {
> 
> 
> IIUC this trylock will only fail if a BPF program runs in NMI context
> and tries to update cgroup stats, interrupting a context that is already
> holding the lock (i.e. updating or flushing stats).
> 

Correct (though note that flushing side can be on a different CPU).

> How often does this happen in practice tho? Is it worth the complexity?

This is about correctness, so even a chance of occurance need the
solution.

> 
> I wonder if it's better if we make css_rstat_updated() inherently
> lockless instead.
> 
> What if css_rstat_updated() always just adds to a lockless tree,

Here I assume you meant lockless list instead of tree.

> and we
> defer constructing the proper tree to the flushing side? This should
> make updates generally faster and avoids locking or disabling interrupts
> in the fast path. We essentially push more work to the flushing side.
> 
> We may be able to consolidate some of the code too if all the logic
> manipulating the tree is on the flushing side.
> 
> WDYT? Am I missing something here?
> 

Yes this can be done but I don't think we need to tie that to current
series. I think we can start with lockless in the nmi context and then
iteratively make css_rstat_updated() lockless for all contexts.

Re: [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe

Posted by Yosry Ahmed 9 months, 1 week ago

On Thu, May 01, 2025 at 03:10:20PM -0700, Shakeel Butt wrote:
> On Wed, Apr 30, 2025 at 06:14:28AM -0700, Yosry Ahmed wrote:
> [...]
> > > +
> > > +	if (!_css_rstat_cpu_trylock(css, cpu, &flags)) {
> > 
> > 
> > IIUC this trylock will only fail if a BPF program runs in NMI context
> > and tries to update cgroup stats, interrupting a context that is already
> > holding the lock (i.e. updating or flushing stats).
> > 
> 
> Correct (though note that flushing side can be on a different CPU).
> 
> > How often does this happen in practice tho? Is it worth the complexity?
> 
> This is about correctness, so even a chance of occurance need the
> solution.

Right, my question was more about the need to special case NMIs, see
below.

> 
> > 
> > I wonder if it's better if we make css_rstat_updated() inherently
> > lockless instead.
> > 
> > What if css_rstat_updated() always just adds to a lockless tree,
> 
> Here I assume you meant lockless list instead of tree.

Yeah, in a sense. I meant using lockless lists to implement the rstat
tree instead of normal linked lists.

> 
> > and we
> > defer constructing the proper tree to the flushing side? This should
> > make updates generally faster and avoids locking or disabling interrupts
> > in the fast path. We essentially push more work to the flushing side.
> > 
> > We may be able to consolidate some of the code too if all the logic
> > manipulating the tree is on the flushing side.
> > 
> > WDYT? Am I missing something here?
> > 
> 
> Yes this can be done but I don't think we need to tie that to current
> series. I think we can start with lockless in the nmi context and then
> iteratively make css_rstat_updated() lockless for all contexts.

My question is basically whether it would be simpler to actually make it
all lockless than special casing NMIs. With this patch we have two
different paths and a deferred list that we process at a later point. I
think it may be simpler if we just make it all lockless to begin with.
Then we would have a single path and no special deferred processing.

WDYT?

Re: [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe

Posted by Shakeel Butt 9 months, 1 week ago

On Tue, May 06, 2025 at 09:41:04AM +0000, Yosry Ahmed wrote:
> On Thu, May 01, 2025 at 03:10:20PM -0700, Shakeel Butt wrote:
> > On Wed, Apr 30, 2025 at 06:14:28AM -0700, Yosry Ahmed wrote:
> > [...]
> > > > +
> > > > +	if (!_css_rstat_cpu_trylock(css, cpu, &flags)) {
> > > 
> > > 
> > > IIUC this trylock will only fail if a BPF program runs in NMI context
> > > and tries to update cgroup stats, interrupting a context that is already
> > > holding the lock (i.e. updating or flushing stats).
> > > 
> > 
> > Correct (though note that flushing side can be on a different CPU).
> > 
> > > How often does this happen in practice tho? Is it worth the complexity?
> > 
> > This is about correctness, so even a chance of occurance need the
> > solution.
> 
> Right, my question was more about the need to special case NMIs, see
> below.
> 
> > 
> > > 
> > > I wonder if it's better if we make css_rstat_updated() inherently
> > > lockless instead.
> > > 
> > > What if css_rstat_updated() always just adds to a lockless tree,
> > 
> > Here I assume you meant lockless list instead of tree.
> 
> Yeah, in a sense. I meant using lockless lists to implement the rstat
> tree instead of normal linked lists.
> 
> > 
> > > and we
> > > defer constructing the proper tree to the flushing side? This should
> > > make updates generally faster and avoids locking or disabling interrupts
> > > in the fast path. We essentially push more work to the flushing side.
> > > 
> > > We may be able to consolidate some of the code too if all the logic
> > > manipulating the tree is on the flushing side.
> > > 
> > > WDYT? Am I missing something here?
> > > 
> > 
> > Yes this can be done but I don't think we need to tie that to current
> > series. I think we can start with lockless in the nmi context and then
> > iteratively make css_rstat_updated() lockless for all contexts.
> 
> My question is basically whether it would be simpler to actually make it
> all lockless than special casing NMIs. With this patch we have two
> different paths and a deferred list that we process at a later point. I
> think it may be simpler if we just make it all lockless to begin with.
> Then we would have a single path and no special deferred processing.
> 
> WDYT?

So, in the update side, always add to the lockless list (if not already)
and on the flush side, built the udpate tree from the lockless list and
flush it. Hopefully this tree building and flushing can be done in a
more optimized way. Is this what you are suggesting?

Re: [RFC PATCH 3/3] cgroup: make css_rstat_updated nmi safe

Posted by Yosry Ahmed 9 months ago

On Tue, May 06, 2025 at 12:30:18PM -0700, Shakeel Butt wrote:
> On Tue, May 06, 2025 at 09:41:04AM +0000, Yosry Ahmed wrote:
> > On Thu, May 01, 2025 at 03:10:20PM -0700, Shakeel Butt wrote:
> > > On Wed, Apr 30, 2025 at 06:14:28AM -0700, Yosry Ahmed wrote:
> > > [...]
> > > > > +
> > > > > +	if (!_css_rstat_cpu_trylock(css, cpu, &flags)) {
> > > > 
> > > > 
> > > > IIUC this trylock will only fail if a BPF program runs in NMI context
> > > > and tries to update cgroup stats, interrupting a context that is already
> > > > holding the lock (i.e. updating or flushing stats).
> > > > 
> > > 
> > > Correct (though note that flushing side can be on a different CPU).
> > > 
> > > > How often does this happen in practice tho? Is it worth the complexity?
> > > 
> > > This is about correctness, so even a chance of occurance need the
> > > solution.
> > 
> > Right, my question was more about the need to special case NMIs, see
> > below.
> > 
> > > 
> > > > 
> > > > I wonder if it's better if we make css_rstat_updated() inherently
> > > > lockless instead.
> > > > 
> > > > What if css_rstat_updated() always just adds to a lockless tree,
> > > 
> > > Here I assume you meant lockless list instead of tree.
> > 
> > Yeah, in a sense. I meant using lockless lists to implement the rstat
> > tree instead of normal linked lists.
> > 
> > > 
> > > > and we
> > > > defer constructing the proper tree to the flushing side? This should
> > > > make updates generally faster and avoids locking or disabling interrupts
> > > > in the fast path. We essentially push more work to the flushing side.
> > > > 
> > > > We may be able to consolidate some of the code too if all the logic
> > > > manipulating the tree is on the flushing side.
> > > > 
> > > > WDYT? Am I missing something here?
> > > > 
> > > 
> > > Yes this can be done but I don't think we need to tie that to current
> > > series. I think we can start with lockless in the nmi context and then
> > > iteratively make css_rstat_updated() lockless for all contexts.
> > 
> > My question is basically whether it would be simpler to actually make it
> > all lockless than special casing NMIs. With this patch we have two
> > different paths and a deferred list that we process at a later point. I
> > think it may be simpler if we just make it all lockless to begin with.
> > Then we would have a single path and no special deferred processing.
> > 
> > WDYT?
> 
> So, in the update side, always add to the lockless list (if not already)
> and on the flush side, built the udpate tree from the lockless list and
> flush it.

Exactly, yes.

> Hopefully this tree building and flushing can be done in a
> more optimized way. Is this what you are suggesting?

Yes, but this latter part can be a follow up if it's not straight
forward. For now we can just add use a lockless list on the update side
and move updating the tree (i.e updated_next and updated_children) to
the flush side.