This patch resolves problem in 4.4 & 4.9 PREEMPT_RT kernels that
the following WARNING happens repeatedly due to broken context
caused by running slab allocation with IRQ disabled by mistake.
WARNING: CPU: * PID: ** at */kernel/cpu.c:197 unpin_current_cpu+0x60/0x70()
The system is almost unresponsive and the boot stalls once it occurs.
This repeated WARNING only happens while kernel is booting
(before reaches the userland) with a quite low reproducibility:
Only one time in around 1,000 ~ 10,000 reboots.
[Problem details]
On PREEMPT_RT kernels < v4.14-rt, after __slab_alloc() disables
IRQ with local_irq_save(), allocate_slab() is responsible for
re-enabling IRQ only under the specific conditions:
(1) gfpflags_allow_blocking(flags) OR
(2) system_state == SYSTEM_RUNNING
The problem happens when (1) is false AND system_state == SYSTEM_BOOTING,
caused by the following scenario:
1. Some kernel codes invokes the allocator without __GFP_DIRECT_RECLAIM
bit (i.e. blocking not allowed) while SYSTEM_BOOTING
2. allocate_slab() calls the following functions with IRQ disabled
3. buffered_rmqueue() invokes local_[spin_]lock_irqsave(pa_lock) which
might call schedule() and enable IRQ, if it failed to get pa_lock
4. The migrate_disable counter, which is not intended to be updated with
IRQs disabled, is accidentally updated after schedule() then
migrate_enable() raises WARN_ON_ONCE(p->migrate_disable <= 0)
5. The unpin_current_cpu() WARNING is raised eventually because the
refcount counter is linked to the migrate_disable counter
The behavior 2-5 above has been obsereved[1] using ftrace.
The condition (2) above intends to make the memory allocator fully
preemptible on PREEMPT_RT kernels[2], so the lock function in the
step 3 above should work if SYSTEM_RUNNING but not if SYSTEM_BOOTING.
[How this is resolved in newer RT kernels]
A patch series in the mainline (v4.13) introduces SYSTEM_SCHEDULING[3].
On top of this, v4.14-rt (6cec8467) changes the condition (2) above:
- if (system_state == SYSTEM_RUNNING)
+ if (system_state > SYSTEM_BOOTING)
This avoids the problem by enabling IRQ after SYSTEM_SCHEULDING.
Thus, the conditions that allocate_slab() enables IRQ are like:
(2)system_state v4.9-rt or before v4.14-rt or later
SYSTEM_BOOTING (1)==true (1)==true
: :
: v
SYSTEM_SCHEDULING : < Problem Always
v < occurs here |
SYSTEM_RUNNING Always |
| |
v v
[How this patch works]
An simple option would be to backport the series[3], which is possible
and has been verified[4]. However, that series pulls functional
changes like SYSTEM_SCHEDULING and adjustments for it,
early might_sleep() and smp_processor_id() supports, etc.
Therefore, this patch uses an extra (but not mainline) flag
"system_scheduling" provided by the prior patch instead of
introducing SYSTEM_SCHEDULING, then uses the same condition as
newer RT kernels in allocate_slab().
This patch also applies the fix in v5.4-rt (7adf5bc5) to care
SYSTEM_SUSPEND in the condition check.
[1] https://lore.kernel.org/all/TYCPR01MB11385E3CDF05544B63F7EF9C1E1622@TYCPR01MB11385.jpnprd01.prod.outlook.com/
[2] https://docs.kernel.org/locking/locktypes.html#raw-spinlock-t-on-rt
[3] https://lore.kernel.org/all/20170516184231.564888231@linutronix.de/T/
[4] https://lore.kernel.org/all/TYCPR01MB1138579CA7612B568BB880652E1272@TYCPR01MB11385.jpnprd01.prod.outlook.com/
Signed-off-by: Kazuhiro Hayashi <kazuhiro3.hayashi@toshiba.co.jp>
Reviewed-by: Pavel Machek <pavel@denx.de>
---
mm/slub.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/slub.c b/mm/slub.c
index fd23ff951395..6186a2586289 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1412,7 +1412,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
if (gfpflags_allow_blocking(flags))
enableirqs = true;
#ifdef CONFIG_PREEMPT_RT_FULL
- if (system_state == SYSTEM_RUNNING)
+ /* SYSTEM_SCHEDULING <= system_state < SYSTEM_SUSPEND in the mainline */
+ if (system_scheduling && system_state < SYSTEM_SUSPEND)
enableirqs = true;
#endif
if (enableirqs)
--
2.30.2
On 2025-02-04 09:46:04 [+0900], Kazuhiro Hayashi wrote: > On PREEMPT_RT kernels < v4.14-rt, after __slab_alloc() disables > IRQ with local_irq_save(), allocate_slab() is responsible for > re-enabling IRQ only under the specific conditions: > > (1) gfpflags_allow_blocking(flags) OR > (2) system_state == SYSTEM_RUNNING > > The problem happens when (1) is false AND system_state == SYSTEM_BOOTING, > caused by the following scenario: > > 1. Some kernel codes invokes the allocator without __GFP_DIRECT_RECLAIM > bit (i.e. blocking not allowed) while SYSTEM_BOOTING > 2. allocate_slab() calls the following functions with IRQ disabled > 3. buffered_rmqueue() invokes local_[spin_]lock_irqsave(pa_lock) which > might call schedule() and enable IRQ, if it failed to get pa_lock > 4. The migrate_disable counter, which is not intended to be updated with > IRQs disabled, is accidentally updated after schedule() then > migrate_enable() raises WARN_ON_ONCE(p->migrate_disable <= 0) > 5. The unpin_current_cpu() WARNING is raised eventually because the > refcount counter is linked to the migrate_disable counter I think the problem is that system_state _is_ SYSTEM_BOOTING but scheduling and/or an additional CPU is already enabled. This is why the lock is contended. Otherwise you wouldn't get that far. > The behavior 2-5 above has been obsereved[1] using ftrace. > The condition (2) above intends to make the memory allocator fully > preemptible on PREEMPT_RT kernels[2], so the lock function in the > step 3 above should work if SYSTEM_RUNNING but not if SYSTEM_BOOTING. > > [How this is resolved in newer RT kernels] > > A patch series in the mainline (v4.13) introduces SYSTEM_SCHEDULING[3]. > On top of this, v4.14-rt (6cec8467) changes the condition (2) above: > > - if (system_state == SYSTEM_RUNNING) > + if (system_state > SYSTEM_BOOTING) > > This avoids the problem by enabling IRQ after SYSTEM_SCHEULDING. Yes. Once scheduling is enabled any sleeping lock might be contended so interrupts must be enabled. This is not done done unconditionally because early boot code expects to run with disabled interrupts. This "system_state == SYSTEM_RUNNING" looked like a sane state in between. However, once the scheduler is up and running, interrupts in the system are enabled and so the slub allocator needs to always enable interrupts. > Thus, the conditions that allocate_slab() enables IRQ are like: > > (2)system_state v4.9-rt or before v4.14-rt or later > SYSTEM_BOOTING (1)==true (1)==true > : : > : v > SYSTEM_SCHEDULING : < Problem Always > v < occurs here | > SYSTEM_RUNNING Always | > | | > v v > > [How this patch works] > > An simple option would be to backport the series[3], which is possible > and has been verified[4]. However, that series pulls functional > changes like SYSTEM_SCHEDULING and adjustments for it, > early might_sleep() and smp_processor_id() supports, etc. > Therefore, this patch uses an extra (but not mainline) flag > "system_scheduling" provided by the prior patch instead of > introducing SYSTEM_SCHEDULING, then uses the same condition as > newer RT kernels in allocate_slab(). The proposal looks okay. However the verified upstream version not only addresses your issue but also makes smp_processor_id() and might_sleep() work in the early phase. I would prefer the upstream solution for those two reasons. Sebastian
Hi! > > An simple option would be to backport the series[3], which is possible > > and has been verified[4]. However, that series pulls functional > > changes like SYSTEM_SCHEDULING and adjustments for it, > > early might_sleep() and smp_processor_id() supports, etc. > > Therefore, this patch uses an extra (but not mainline) flag > > "system_scheduling" provided by the prior patch instead of > > introducing SYSTEM_SCHEDULING, then uses the same condition as > > newer RT kernels in allocate_slab(). > > The proposal looks okay. However the verified upstream version not only > addresses your issue but also makes smp_processor_id() and might_sleep() > work in the early phase. I would prefer the upstream solution for those > two reasons. So... first, thanks for review. This is mostly my fault, Kazuhiro Hayashi did full backport, but I asked for minimal version. AFAICT there's no place for stable-rt to apply this, so we'll just be taking this to our trees in CIP project. (And I prefer smaller version. Additional checking is nice for development but not so nice this late in development cycle). Best regards, Pavel -- DENX Software Engineering GmbH, Managing Director: Erika Unter HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
© 2016 - 2025 Red Hat, Inc.