[PATCH v2] memcg: skip cgroup_file_notify if spinning is not allowed

Shakeel Butt posted 1 patch 1 week, 2 days ago
include/linux/memcontrol.h | 26 +++++++++++++++++++-------
mm/memcontrol.c            |  7 ++++---
2 files changed, 23 insertions(+), 10 deletions(-)
[PATCH v2] memcg: skip cgroup_file_notify if spinning is not allowed
Posted by Shakeel Butt 1 week, 2 days ago
Generally memcg charging is allowed from all the contexts including NMI
where even spinning on spinlock can cause locking issues. However one
call chain was missed during the addition of memcg charging from any
context support. That is try_charge_memcg() -> memcg_memory_event() ->
cgroup_file_notify().

The possible function call tree under cgroup_file_notify() can acquire
many different spin locks in spinning mode. Some of them are
cgroup_file_kn_lock, kernfs_notify_lock, pool_workqeue's lock. So, let's
just skip cgroup_file_notify() from memcg charging if the context does
not allow spinning.

Alternative approach was also explored where instead of skipping
cgroup_file_notify(), we defer the memcg event processing to irq_work
[1]. However it adds complexity and it was decided to keep things simple
until we need more memcg events with !allow_spinning requirement.

Link: https://lore.kernel.org/all/5qi2llyzf7gklncflo6gxoozljbm4h3tpnuv4u4ej4ztysvi6f@x44v7nz2wdzd/ [1]
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
---
Changes since v1:
- Add warning if !allow_spinning is used with memcg events other than
  MEMCG_MAX (requested by Roman)

 include/linux/memcontrol.h | 26 +++++++++++++++++++-------
 mm/memcontrol.c            |  7 ++++---
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 16fe0306e50e..873e510d6f8d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1001,22 +1001,28 @@ static inline void count_memcg_event_mm(struct mm_struct *mm,
 	count_memcg_events_mm(mm, idx, 1);
 }
 
-static inline void memcg_memory_event(struct mem_cgroup *memcg,
-				      enum memcg_memory_event event)
+static inline void __memcg_memory_event(struct mem_cgroup *memcg,
+					enum memcg_memory_event event,
+					bool allow_spinning)
 {
 	bool swap_event = event == MEMCG_SWAP_HIGH || event == MEMCG_SWAP_MAX ||
 			  event == MEMCG_SWAP_FAIL;
 
+	/* For now only MEMCG_MAX can happen with !allow_spinning context. */
+	VM_WARN_ON_ONCE(!allow_spinning && event != MEMCG_MAX);
+
 	atomic_long_inc(&memcg->memory_events_local[event]);
-	if (!swap_event)
+	if (!swap_event && allow_spinning)
 		cgroup_file_notify(&memcg->events_local_file);
 
 	do {
 		atomic_long_inc(&memcg->memory_events[event]);
-		if (swap_event)
-			cgroup_file_notify(&memcg->swap_events_file);
-		else
-			cgroup_file_notify(&memcg->events_file);
+		if (allow_spinning) {
+			if (swap_event)
+				cgroup_file_notify(&memcg->swap_events_file);
+			else
+				cgroup_file_notify(&memcg->events_file);
+		}
 
 		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 			break;
@@ -1026,6 +1032,12 @@ static inline void memcg_memory_event(struct mem_cgroup *memcg,
 		 !mem_cgroup_is_root(memcg));
 }
 
+static inline void memcg_memory_event(struct mem_cgroup *memcg,
+				      enum memcg_memory_event event)
+{
+	__memcg_memory_event(memcg, event, true);
+}
+
 static inline void memcg_memory_event_mm(struct mm_struct *mm,
 					 enum memcg_memory_event event)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e090f29eb03b..4deda33625f4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2307,12 +2307,13 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	bool drained = false;
 	bool raised_max_event = false;
 	unsigned long pflags;
+	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
 
 retry:
 	if (consume_stock(memcg, nr_pages))
 		return 0;
 
-	if (!gfpflags_allow_spinning(gfp_mask))
+	if (!allow_spinning)
 		/* Avoid the refill and flush of the older stock */
 		batch = nr_pages;
 
@@ -2348,7 +2349,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (!gfpflags_allow_blocking(gfp_mask))
 		goto nomem;
 
-	memcg_memory_event(mem_over_limit, MEMCG_MAX);
+	__memcg_memory_event(mem_over_limit, MEMCG_MAX, allow_spinning);
 	raised_max_event = true;
 
 	psi_memstall_enter(&pflags);
@@ -2415,7 +2416,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * a MEMCG_MAX event.
 	 */
 	if (!raised_max_event)
-		memcg_memory_event(mem_over_limit, MEMCG_MAX);
+		__memcg_memory_event(mem_over_limit, MEMCG_MAX, allow_spinning);
 
 	/*
 	 * The allocation either can't fail or will lead to more memory
-- 
2.47.3
Re: [PATCH v2] memcg: skip cgroup_file_notify if spinning is not allowed
Posted by Andrew Morton 1 week, 2 days ago
On Mon, 22 Sep 2025 15:02:03 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> Generally memcg charging is allowed from all the contexts including NMI
> where even spinning on spinlock can cause locking issues. However one
> call chain was missed during the addition of memcg charging from any
> context support. That is try_charge_memcg() -> memcg_memory_event() ->
> cgroup_file_notify().
> 
> The possible function call tree under cgroup_file_notify() can acquire
> many different spin locks in spinning mode. Some of them are
> cgroup_file_kn_lock, kernfs_notify_lock, pool_workqeue's lock. So, let's
> just skip cgroup_file_notify() from memcg charging if the context does
> not allow spinning.
> 
> Alternative approach was also explored where instead of skipping
> cgroup_file_notify(), we defer the memcg event processing to irq_work
> [1]. However it adds complexity and it was decided to keep things simple
> until we need more memcg events with !allow_spinning requirement.

What are the downsides here?  Inaccurate charging obviously, but how
might this affect users?

> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2307,12 +2307,13 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	bool drained = false;
>  	bool raised_max_event = false;
>  	unsigned long pflags;
> +	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
>  

Does this affect only the problematic call chain which you have
identified, or might other callers be undesirably affected?
Re: [PATCH v2] memcg: skip cgroup_file_notify if spinning is not allowed
Posted by Shakeel Butt 1 week, 2 days ago
On Mon, Sep 22, 2025 at 04:03:08PM -0700, Andrew Morton wrote:
> On Mon, 22 Sep 2025 15:02:03 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > Generally memcg charging is allowed from all the contexts including NMI
> > where even spinning on spinlock can cause locking issues. However one
> > call chain was missed during the addition of memcg charging from any
> > context support. That is try_charge_memcg() -> memcg_memory_event() ->
> > cgroup_file_notify().
> > 
> > The possible function call tree under cgroup_file_notify() can acquire
> > many different spin locks in spinning mode. Some of them are
> > cgroup_file_kn_lock, kernfs_notify_lock, pool_workqeue's lock. So, let's
> > just skip cgroup_file_notify() from memcg charging if the context does
> > not allow spinning.
> > 
> > Alternative approach was also explored where instead of skipping
> > cgroup_file_notify(), we defer the memcg event processing to irq_work
> > [1]. However it adds complexity and it was decided to keep things simple
> > until we need more memcg events with !allow_spinning requirement.
> 
> What are the downsides here?  Inaccurate charging obviously, but how
> might this affect users?

Charging will still be accurate. The only thing we will miss is the
possible notifications to the userspace for memory.events[.local] files.

> 
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2307,12 +2307,13 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	bool drained = false;
> >  	bool raised_max_event = false;
> >  	unsigned long pflags;
> > +	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
> >  
> 
> Does this affect only the problematic call chain which you have
> identified, or might other callers be undesirably affected?

It will only affect the call chain which can not spin due to possibly
NMI context and at the moment only bpf programs can cause that.
Re: [PATCH v2] memcg: skip cgroup_file_notify if spinning is not allowed
Posted by Andrew Morton 1 week, 2 days ago
On Mon, 22 Sep 2025 16:22:57 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> > 
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2307,12 +2307,13 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > >  	bool drained = false;
> > >  	bool raised_max_event = false;
> > >  	unsigned long pflags;
> > > +	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
> > >  
> > 
> > Does this affect only the problematic call chain which you have
> > identified, or might other callers be undesirably affected?
> 
> It will only affect the call chain which can not spin due to possibly
> NMI context and at the moment only bpf programs can cause that.

"possibly" NMI context?  Is it possible that a bpf caller which could
have taken locks will now skip the notifications?  Or do the gfp_flags
get propagated all the way through?
Re: [PATCH v2] memcg: skip cgroup_file_notify if spinning is not allowed
Posted by Shakeel Butt 1 week, 2 days ago
On Mon, Sep 22, 2025 at 04:43:28PM -0700, Andrew Morton wrote:
> On Mon, 22 Sep 2025 16:22:57 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > > 
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -2307,12 +2307,13 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > >  	bool drained = false;
> > > >  	bool raised_max_event = false;
> > > >  	unsigned long pflags;
> > > > +	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
> > > >  
> > > 
> > > Does this affect only the problematic call chain which you have
> > > identified, or might other callers be undesirably affected?
> > 
> > It will only affect the call chain which can not spin due to possibly
> > NMI context and at the moment only bpf programs can cause that.
> 
> "possibly" NMI context? 

NMI is one source which can cause recursive context but bpf programs
attached to specific call chains can also cause this recursion. For
example in [1], a bpf program related to sched_ext was called with
scheduler locks held and then that program made a memcg charged
allocation which can potentially call cgroup_file_notify(). The
notification call chain again tries to grab scheduler locks and
potentially causing deadlocks.

[1] https://lore.kernel.org/all/20250905061919.439648-1-yepeilin@google.com/

> Is it possible that a bpf caller which could
> have taken locks will now skip the notifications?  Or do the gfp_flags
> get propagated all the way through?

The bpf programs which might be called/triggerd in a context where
spinning on a lock might not be safe, will skip the notifications. The
gfp_flags are plugged through. Basically we have kmalloc_nolock()
interfaces coming up which will make sure the correct gfp flags are
passed through.
Re: [PATCH v2] memcg: skip cgroup_file_notify if spinning is not allowed
Posted by Andrew Morton 1 week, 2 days ago
On Mon, 22 Sep 2025 15:02:03 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> Generally memcg charging is allowed from all the contexts including NMI
> where even spinning on spinlock can cause locking issues. However one
> call chain was missed during the addition of memcg charging from any
> context support. That is try_charge_memcg() -> memcg_memory_event() ->
> cgroup_file_notify().
> 
> The possible function call tree under cgroup_file_notify() can acquire
> many different spin locks in spinning mode. Some of them are
> cgroup_file_kn_lock, kernfs_notify_lock, pool_workqeue's lock. So, let's
> just skip cgroup_file_notify() from memcg charging if the context does
> not allow spinning.
> 
> Alternative approach was also explored where instead of skipping
> cgroup_file_notify(), we defer the memcg event processing to irq_work
> [1]. However it adds complexity and it was decided to keep things simple
> until we need more memcg events with !allow_spinning requirement.
> 
> Link: https://lore.kernel.org/all/5qi2llyzf7gklncflo6gxoozljbm4h3tpnuv4u4ej4ztysvi6f@x44v7nz2wdzd/ [1]
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> Acked-by: Michal Hocko <mhocko@suse.com>

Fixes a possible kernel deadlock, yes?

Is a cc:stable appropriate and can we identify a Fixes: target?

Thanks.

(Did it ever generate lockdep warnings?)
Re: [PATCH v2] memcg: skip cgroup_file_notify if spinning is not allowed
Posted by Shakeel Butt 1 week, 2 days ago
On Mon, Sep 22, 2025 at 04:04:43PM -0700, Andrew Morton wrote:
> On Mon, 22 Sep 2025 15:02:03 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > Generally memcg charging is allowed from all the contexts including NMI
> > where even spinning on spinlock can cause locking issues. However one
> > call chain was missed during the addition of memcg charging from any
> > context support. That is try_charge_memcg() -> memcg_memory_event() ->
> > cgroup_file_notify().
> > 
> > The possible function call tree under cgroup_file_notify() can acquire
> > many different spin locks in spinning mode. Some of them are
> > cgroup_file_kn_lock, kernfs_notify_lock, pool_workqeue's lock. So, let's
> > just skip cgroup_file_notify() from memcg charging if the context does
> > not allow spinning.
> > 
> > Alternative approach was also explored where instead of skipping
> > cgroup_file_notify(), we defer the memcg event processing to irq_work
> > [1]. However it adds complexity and it was decided to keep things simple
> > until we need more memcg events with !allow_spinning requirement.
> > 
> > Link: https://lore.kernel.org/all/5qi2llyzf7gklncflo6gxoozljbm4h3tpnuv4u4ej4ztysvi6f@x44v7nz2wdzd/ [1]
> > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> 
> Fixes a possible kernel deadlock, yes?
> 
> Is a cc:stable appropriate and can we identify a Fixes: target?
> 
> Thanks.
> 
> (Did it ever generate lockdep warnings?)

The report is here:
https://lore.kernel.org/all/20250905061919.439648-1-yepeilin@google.com/

I am not sure about the Fixes tag though or more like which one to put
in the Fixes as we recently started supporting memcg charging for NMI
context or allowing bpf programs to do memcg charged allocations in
recursive context (see the above report for this recursive call chain).
There is no single commit which can be blamed here.

Alexei, what do you suggest?
Re: [PATCH v2] memcg: skip cgroup_file_notify if spinning is not allowed
Posted by Andrew Morton 1 week, 2 days ago
On Mon, 22 Sep 2025 16:39:53 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> On Mon, Sep 22, 2025 at 04:04:43PM -0700, Andrew Morton wrote:
> > On Mon, 22 Sep 2025 15:02:03 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > 
> > > Generally memcg charging is allowed from all the contexts including NMI
> > > where even spinning on spinlock can cause locking issues. However one
> > > call chain was missed during the addition of memcg charging from any
> > > context support. That is try_charge_memcg() -> memcg_memory_event() ->
> > > cgroup_file_notify().
> > > 
> > > The possible function call tree under cgroup_file_notify() can acquire
> > > many different spin locks in spinning mode. Some of them are
> > > cgroup_file_kn_lock, kernfs_notify_lock, pool_workqeue's lock. So, let's
> > > just skip cgroup_file_notify() from memcg charging if the context does
> > > not allow spinning.
> > > 
> > > Alternative approach was also explored where instead of skipping
> > > cgroup_file_notify(), we defer the memcg event processing to irq_work
> > > [1]. However it adds complexity and it was decided to keep things simple
> > > until we need more memcg events with !allow_spinning requirement.
> > > 
> > > Link: https://lore.kernel.org/all/5qi2llyzf7gklncflo6gxoozljbm4h3tpnuv4u4ej4ztysvi6f@x44v7nz2wdzd/ [1]
> > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > > Acked-by: Michal Hocko <mhocko@suse.com>
> > 
> > Fixes a possible kernel deadlock, yes?
> > 
> > Is a cc:stable appropriate and can we identify a Fixes: target?
> > 
> > Thanks.
> > 
> > (Did it ever generate lockdep warnings?)
> 
> The report is here:
> https://lore.kernel.org/all/20250905061919.439648-1-yepeilin@google.com/
> 
> I am not sure about the Fixes tag though or more like which one to put
> in the Fixes as we recently started supporting memcg charging for NMI
> context or allowing bpf programs to do memcg charged allocations in
> recursive context (see the above report for this recursive call chain).
> There is no single commit which can be blamed here.

I tend to view the Fixes: as us suggesting which kernel versions should
be patched.  I'm suspecting that's 6.16+, so using the final relevant
patch in that release as a Fixes: target would work.
Re: [PATCH v2] memcg: skip cgroup_file_notify if spinning is not allowed
Posted by Shakeel Butt 1 week, 2 days ago
On Mon, Sep 22, 2025 at 04:55:09PM -0700, Andrew Morton wrote:
> On Mon, 22 Sep 2025 16:39:53 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > On Mon, Sep 22, 2025 at 04:04:43PM -0700, Andrew Morton wrote:
> > > On Mon, 22 Sep 2025 15:02:03 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > 
> > > > Generally memcg charging is allowed from all the contexts including NMI
> > > > where even spinning on spinlock can cause locking issues. However one
> > > > call chain was missed during the addition of memcg charging from any
> > > > context support. That is try_charge_memcg() -> memcg_memory_event() ->
> > > > cgroup_file_notify().
> > > > 
> > > > The possible function call tree under cgroup_file_notify() can acquire
> > > > many different spin locks in spinning mode. Some of them are
> > > > cgroup_file_kn_lock, kernfs_notify_lock, pool_workqeue's lock. So, let's
> > > > just skip cgroup_file_notify() from memcg charging if the context does
> > > > not allow spinning.
> > > > 
> > > > Alternative approach was also explored where instead of skipping
> > > > cgroup_file_notify(), we defer the memcg event processing to irq_work
> > > > [1]. However it adds complexity and it was decided to keep things simple
> > > > until we need more memcg events with !allow_spinning requirement.
> > > > 
> > > > Link: https://lore.kernel.org/all/5qi2llyzf7gklncflo6gxoozljbm4h3tpnuv4u4ej4ztysvi6f@x44v7nz2wdzd/ [1]
> > > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > > > Acked-by: Michal Hocko <mhocko@suse.com>
> > > 
> > > Fixes a possible kernel deadlock, yes?
> > > 
> > > Is a cc:stable appropriate and can we identify a Fixes: target?
> > > 
> > > Thanks.
> > > 
> > > (Did it ever generate lockdep warnings?)
> > 
> > The report is here:
> > https://lore.kernel.org/all/20250905061919.439648-1-yepeilin@google.com/
> > 
> > I am not sure about the Fixes tag though or more like which one to put
> > in the Fixes as we recently started supporting memcg charging for NMI
> > context or allowing bpf programs to do memcg charged allocations in
> > recursive context (see the above report for this recursive call chain).
> > There is no single commit which can be blamed here.
> 
> I tend to view the Fixes: as us suggesting which kernel versions should
> be patched.  I'm suspecting that's 6.16+, so using the final relevant
> patch in that release as a Fixes: target would work.
> 

Sounds good. Let use the following.

Fixes: 3ac4638a734a ("memcg: make memcg_rstat_updated nmi safe")