[PATCH 4.4 4.9 v1 2/2] mm: slub: allocate_slab() enables IRQ right after scheduler starts

Kazuhiro Hayashi posted 2 patches 10 months, 1 week ago
[PATCH 4.4 4.9 v1 2/2] mm: slub: allocate_slab() enables IRQ right after scheduler starts
Posted by Kazuhiro Hayashi 10 months, 1 week ago
This patch resolves problem in 4.4 & 4.9 PREEMPT_RT kernels that
the following WARNING happens repeatedly due to broken context
caused by running slab allocation with IRQ disabled by mistake.

WARNING: CPU: * PID: ** at */kernel/cpu.c:197 unpin_current_cpu+0x60/0x70()

The system is almost unresponsive and the boot stalls once it occurs.
This repeated WARNING only happens while kernel is booting
(before reaches the userland) with a quite low reproducibility:
Only one time in around 1,000 ~ 10,000 reboots.

[Problem details]

On PREEMPT_RT kernels < v4.14-rt, after __slab_alloc() disables
IRQ with local_irq_save(), allocate_slab() is responsible for
re-enabling IRQ only under the specific conditions:

(1) gfpflags_allow_blocking(flags) OR
(2) system_state == SYSTEM_RUNNING

The problem happens when (1) is false AND system_state == SYSTEM_BOOTING,
caused by the following scenario:

1. Some kernel codes invokes the allocator without __GFP_DIRECT_RECLAIM
   bit (i.e. blocking not allowed) while SYSTEM_BOOTING
2. allocate_slab() calls the following functions with IRQ disabled
3. buffered_rmqueue() invokes local_[spin_]lock_irqsave(pa_lock) which
   might call schedule() and enable IRQ, if it failed to get pa_lock
4. The migrate_disable counter, which is not intended to be updated with
   IRQs disabled, is accidentally updated after schedule() then
   migrate_enable() raises WARN_ON_ONCE(p->migrate_disable <= 0)
5. The unpin_current_cpu() WARNING is raised eventually because the
   refcount counter is linked to the migrate_disable counter

The behavior 2-5 above has been obsereved[1] using ftrace.
The condition (2) above intends to make the memory allocator fully
preemptible on PREEMPT_RT kernels[2], so the lock function in the
step 3 above should work if SYSTEM_RUNNING but not if SYSTEM_BOOTING.

[How this is resolved in newer RT kernels]

A patch series in the mainline (v4.13) introduces SYSTEM_SCHEDULING[3].
On top of this, v4.14-rt (6cec8467) changes the condition (2) above:

-	if (system_state == SYSTEM_RUNNING)
+	if (system_state > SYSTEM_BOOTING)

This avoids the problem by enabling IRQ after SYSTEM_SCHEULDING.
Thus, the conditions that allocate_slab() enables IRQ are like:

(2)system_state   v4.9-rt or before  v4.14-rt or later
SYSTEM_BOOTING        (1)==true          (1)==true
                          :                  :
                          :                  v
SYSTEM_SCHEDULING         : < Problem      Always
                          v < occurs here    |
SYSTEM_RUNNING          Always               |
                          |                  |
                          v                  v

[How this patch works]

An simple option would be to backport the series[3], which is possible
and has been verified[4]. However, that series pulls functional
changes like SYSTEM_SCHEDULING and adjustments for it,
early might_sleep() and smp_processor_id() supports, etc.
Therefore, this patch uses an extra (but not mainline) flag
"system_scheduling" provided by the prior patch instead of
introducing SYSTEM_SCHEDULING, then uses the same condition as
newer RT kernels in allocate_slab().

This patch also applies the fix in v5.4-rt (7adf5bc5) to care
SYSTEM_SUSPEND in the condition check.

[1] https://lore.kernel.org/all/TYCPR01MB11385E3CDF05544B63F7EF9C1E1622@TYCPR01MB11385.jpnprd01.prod.outlook.com/
[2] https://docs.kernel.org/locking/locktypes.html#raw-spinlock-t-on-rt
[3] https://lore.kernel.org/all/20170516184231.564888231@linutronix.de/T/
[4] https://lore.kernel.org/all/TYCPR01MB1138579CA7612B568BB880652E1272@TYCPR01MB11385.jpnprd01.prod.outlook.com/

Signed-off-by: Kazuhiro Hayashi <kazuhiro3.hayashi@toshiba.co.jp>
Reviewed-by: Pavel Machek <pavel@denx.de>
---
 mm/slub.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index fd23ff951395..6186a2586289 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1412,7 +1412,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	if (gfpflags_allow_blocking(flags))
 		enableirqs = true;
 #ifdef CONFIG_PREEMPT_RT_FULL
-	if (system_state == SYSTEM_RUNNING)
+	/* SYSTEM_SCHEDULING <= system_state < SYSTEM_SUSPEND in the mainline */
+	if (system_scheduling && system_state < SYSTEM_SUSPEND)
 		enableirqs = true;
 #endif
 	if (enableirqs)
-- 
2.30.2
Re: [PATCH 4.4 4.9 v1 2/2] mm: slub: allocate_slab() enables IRQ right after scheduler starts
Posted by Sebastian Andrzej Siewior 10 months, 1 week ago
On 2025-02-04 09:46:04 [+0900], Kazuhiro Hayashi wrote:
> On PREEMPT_RT kernels < v4.14-rt, after __slab_alloc() disables
> IRQ with local_irq_save(), allocate_slab() is responsible for
> re-enabling IRQ only under the specific conditions:
> 
> (1) gfpflags_allow_blocking(flags) OR
> (2) system_state == SYSTEM_RUNNING
>
> The problem happens when (1) is false AND system_state == SYSTEM_BOOTING,
> caused by the following scenario:
> 
> 1. Some kernel codes invokes the allocator without __GFP_DIRECT_RECLAIM
>    bit (i.e. blocking not allowed) while SYSTEM_BOOTING
> 2. allocate_slab() calls the following functions with IRQ disabled
> 3. buffered_rmqueue() invokes local_[spin_]lock_irqsave(pa_lock) which
>    might call schedule() and enable IRQ, if it failed to get pa_lock
> 4. The migrate_disable counter, which is not intended to be updated with
>    IRQs disabled, is accidentally updated after schedule() then
>    migrate_enable() raises WARN_ON_ONCE(p->migrate_disable <= 0)
> 5. The unpin_current_cpu() WARNING is raised eventually because the
>    refcount counter is linked to the migrate_disable counter

I think the problem is that system_state _is_ SYSTEM_BOOTING but
scheduling and/or an additional CPU is already enabled. This is why the
lock is contended. Otherwise you wouldn't get that far. 

> The behavior 2-5 above has been obsereved[1] using ftrace.
> The condition (2) above intends to make the memory allocator fully
> preemptible on PREEMPT_RT kernels[2], so the lock function in the
> step 3 above should work if SYSTEM_RUNNING but not if SYSTEM_BOOTING.
> 
> [How this is resolved in newer RT kernels]
> 
> A patch series in the mainline (v4.13) introduces SYSTEM_SCHEDULING[3].
> On top of this, v4.14-rt (6cec8467) changes the condition (2) above:
> 
> -	if (system_state == SYSTEM_RUNNING)
> +	if (system_state > SYSTEM_BOOTING)
> 
> This avoids the problem by enabling IRQ after SYSTEM_SCHEULDING.

Yes. Once scheduling is enabled any sleeping lock might be contended so
interrupts must be enabled. This is not done done unconditionally
because early boot code expects to run with disabled interrupts.
This "system_state == SYSTEM_RUNNING" looked like a sane state in
between. However, once the scheduler is up and running, interrupts in
the system are enabled and so the slub allocator needs to always enable
interrupts.

> Thus, the conditions that allocate_slab() enables IRQ are like:
> 
> (2)system_state   v4.9-rt or before  v4.14-rt or later
> SYSTEM_BOOTING        (1)==true          (1)==true
>                           :                  :
>                           :                  v
> SYSTEM_SCHEDULING         : < Problem      Always
>                           v < occurs here    |
> SYSTEM_RUNNING          Always               |
>                           |                  |
>                           v                  v
> 
> [How this patch works]
> 
> An simple option would be to backport the series[3], which is possible
> and has been verified[4]. However, that series pulls functional
> changes like SYSTEM_SCHEDULING and adjustments for it,
> early might_sleep() and smp_processor_id() supports, etc.
> Therefore, this patch uses an extra (but not mainline) flag
> "system_scheduling" provided by the prior patch instead of
> introducing SYSTEM_SCHEDULING, then uses the same condition as
> newer RT kernels in allocate_slab().

The proposal looks okay. However the verified upstream version not only
addresses your issue but also makes smp_processor_id() and might_sleep()
work in the early phase. I would prefer the upstream solution for those
two reasons.

Sebastian
Re: [PATCH 4.4 4.9 v1 2/2] mm: slub: allocate_slab() enables IRQ right after scheduler starts
Posted by Pavel Machek 10 months, 1 week ago
Hi!

> > An simple option would be to backport the series[3], which is possible
> > and has been verified[4]. However, that series pulls functional
> > changes like SYSTEM_SCHEDULING and adjustments for it,
> > early might_sleep() and smp_processor_id() supports, etc.
> > Therefore, this patch uses an extra (but not mainline) flag
> > "system_scheduling" provided by the prior patch instead of
> > introducing SYSTEM_SCHEDULING, then uses the same condition as
> > newer RT kernels in allocate_slab().
> 
> The proposal looks okay. However the verified upstream version not only
> addresses your issue but also makes smp_processor_id() and might_sleep()
> work in the early phase. I would prefer the upstream solution for those
> two reasons.

So... first, thanks for review. This is mostly my fault, Kazuhiro
Hayashi did full backport, but I asked for minimal version.

AFAICT there's no place for stable-rt to apply this, so we'll just be
taking this to our trees in CIP project. (And I prefer smaller
version. Additional checking is nice for development but not so nice
this late in development cycle).

Best regards,
								Pavel
-- 
DENX Software Engineering GmbH,        Managing Director: Erika Unter
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany