kernel/sched/sched.h | 2 ++ 1 file changed, 2 insertions(+)
The dead lock can happen if we try to use printk(), such as a call of
SCHED_WARN_ON(), during the rq->__lock is held. The printk() will try to
print the message to the console, and the console driver can call
queue_work_on(), which will try to obtain rq->__lock again.
This means that any WARN during the kernel function that hold the
rq->__lock, such as schedule(), sched_ttwu_pending(), etc, can cause dead
lock.
Following is the call trace of the deadlock case that I encounter:
PID: 0 TASK: ff36bfda010c8000 CPU: 156 COMMAND: "swapper/156"
#0 crash_nmi_callback+30
#1 nmi_handle+85
#2 default_do_nmi+66
#3 exc_nmi+291
#4 end_repeat_nmi+22
[exception RIP: native_queued_spin_lock_slowpath+96]
#5 native_queued_spin_lock_slowpath+96
#6 _raw_spin_lock+30
#7 ttwu_queue+111
#8 try_to_wake_up+375
#9 __queue_work+462
#10 queue_work_on+32
#11 soft_cursor+420
#12 bit_cursor+898
#13 hide_cursor+39
#14 vt_console_print+995
#15 call_console_drivers.constprop.0+204
#16 console_unlock+374
#17 vprintk_emit+280
#18 printk+88
#19 __warn_printk+71
#20 enqueue_task_fair+1779
#21 activate_task+102
#22 ttwu_do_activate+155
#23 sched_ttwu_pending+177
#24 flush_smp_call_function_from_idle+42
#25 do_idle+161
#26 cpu_startup_entry+25
#27 secondary_startup_64_no_verify+194
Fix this by using __printk_safe_enter()/__printk_safe_exit() in
rq_pin_lock()/rq_unpin_lock(). Then, printk will defer to print out the
buffers to the console.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Bin Lai <laib2@chinatelecom.cn>
---
kernel/sched/sched.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4c36cc680361..ba06dfdaea55 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1632,6 +1632,7 @@ static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
{
rf->cookie = lockdep_pin_lock(__rq_lockp(rq));
+ __printk_safe_enter();
#ifdef CONFIG_SCHED_DEBUG
rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
rf->clock_update_flags = 0;
@@ -1648,6 +1649,7 @@ static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
rf->clock_update_flags = RQCF_UPDATED;
#endif
+ __printk_safe_exit();
lockdep_unpin_lock(__rq_lockp(rq), rf->cookie);
}
--
2.39.2
Hi Menglong,
kernel test robot noticed the following build errors:
[auto build test ERROR on next-20240806]
url: https://github.com/intel-lab-lkp/linux/commits/Menglong-Dong/sched-make-printk-safe-when-rq-lock-is-held/20240806-155153
base: next-20240806
patch link: https://lore.kernel.org/r/20240806074131.36007-1-dongml2%40chinatelecom.cn
patch subject: [PATCH next] sched: make printk safe when rq lock is held
config: openrisc-allnoconfig (https://download.01.org/0day-ci/archive/20240808/202408080114.hpttUZjC-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240808/202408080114.hpttUZjC-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408080114.hpttUZjC-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from kernel/sched/core.c:88:
kernel/sched/sched.h: In function 'rq_pin_lock':
>> kernel/sched/sched.h:1705:9: error: implicit declaration of function '__printk_safe_enter'; did you mean '__printk_deferred_enter'? [-Wimplicit-function-declaration]
1705 | __printk_safe_enter();
| ^~~~~~~~~~~~~~~~~~~
| __printk_deferred_enter
kernel/sched/sched.h: In function 'rq_unpin_lock':
>> kernel/sched/sched.h:1722:9: error: implicit declaration of function '__printk_safe_exit'; did you mean '__printk_ratelimit'? [-Wimplicit-function-declaration]
1722 | __printk_safe_exit();
| ^~~~~~~~~~~~~~~~~~
| __printk_ratelimit
vim +1705 kernel/sched/sched.h
1690
1691 /*
1692 * Lockdep annotation that avoids accidental unlocks; it's like a
1693 * sticky/continuous lockdep_assert_held().
1694 *
1695 * This avoids code that has access to 'struct rq *rq' (basically everything in
1696 * the scheduler) from accidentally unlocking the rq if they do not also have a
1697 * copy of the (on-stack) 'struct rq_flags rf'.
1698 *
1699 * Also see Documentation/locking/lockdep-design.rst.
1700 */
1701 static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
1702 {
1703 rf->cookie = lockdep_pin_lock(__rq_lockp(rq));
1704
> 1705 __printk_safe_enter();
1706 #ifdef CONFIG_SCHED_DEBUG
1707 rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
1708 rf->clock_update_flags = 0;
1709 # ifdef CONFIG_SMP
1710 SCHED_WARN_ON(rq->balance_callback && rq->balance_callback != &balance_push_callback);
1711 # endif
1712 #endif
1713 }
1714
1715 static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
1716 {
1717 #ifdef CONFIG_SCHED_DEBUG
1718 if (rq->clock_update_flags > RQCF_ACT_SKIP)
1719 rf->clock_update_flags = RQCF_UPDATED;
1720 #endif
1721
> 1722 __printk_safe_exit();
1723 lockdep_unpin_lock(__rq_lockp(rq), rf->cookie);
1724 }
1725
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Menglong,
kernel test robot noticed the following build errors:
[auto build test ERROR on next-20240806]
url: https://github.com/intel-lab-lkp/linux/commits/Menglong-Dong/sched-make-printk-safe-when-rq-lock-is-held/20240806-155153
base: next-20240806
patch link: https://lore.kernel.org/r/20240806074131.36007-1-dongml2%40chinatelecom.cn
patch subject: [PATCH next] sched: make printk safe when rq lock is held
config: um-allnoconfig (https://download.01.org/0day-ci/archive/20240808/202408080033.hGD8Z1m1-lkp@intel.com/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240808/202408080033.hGD8Z1m1-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408080033.hGD8Z1m1-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from kernel/sched/core.c:10:
In file included from include/linux/highmem.h:12:
In file included from include/linux/hardirq.h:11:
In file included from arch/um/include/asm/hardirq.h:5:
In file included from include/asm-generic/hardirq.h:17:
In file included from include/linux/irq.h:20:
In file included from include/linux/io.h:14:
In file included from arch/um/include/asm/io.h:24:
include/asm-generic/io.h:548:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
548 | val = __raw_readb(PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:561:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
561 | val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
| ~~~~~~~~~~ ^
include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
| ^
In file included from kernel/sched/core.c:10:
In file included from include/linux/highmem.h:12:
In file included from include/linux/hardirq.h:11:
In file included from arch/um/include/asm/hardirq.h:5:
In file included from include/asm-generic/hardirq.h:17:
In file included from include/linux/irq.h:20:
In file included from include/linux/io.h:14:
In file included from arch/um/include/asm/io.h:24:
include/asm-generic/io.h:574:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
574 | val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
| ~~~~~~~~~~ ^
include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
| ^
In file included from kernel/sched/core.c:10:
In file included from include/linux/highmem.h:12:
In file included from include/linux/hardirq.h:11:
In file included from arch/um/include/asm/hardirq.h:5:
In file included from include/asm-generic/hardirq.h:17:
In file included from include/linux/irq.h:20:
In file included from include/linux/io.h:14:
In file included from arch/um/include/asm/io.h:24:
include/asm-generic/io.h:585:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
585 | __raw_writeb(value, PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:595:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
595 | __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:605:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
605 | __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:693:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
693 | readsb(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:701:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
701 | readsw(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:709:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
709 | readsl(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:718:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
718 | writesb(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:727:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
727 | writesw(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:736:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
736 | writesl(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
In file included from kernel/sched/core.c:88:
>> kernel/sched/sched.h:1705:2: error: call to undeclared function '__printk_safe_enter'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
1705 | __printk_safe_enter();
| ^
kernel/sched/sched.h:1705:2: note: did you mean '__printk_deferred_enter'?
include/linux/printk.h:166:13: note: '__printk_deferred_enter' declared here
166 | extern void __printk_deferred_enter(void);
| ^
In file included from kernel/sched/core.c:88:
>> kernel/sched/sched.h:1722:2: error: call to undeclared function '__printk_safe_exit'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
1722 | __printk_safe_exit();
| ^
kernel/sched/sched.h:1722:2: note: did you mean '__printk_ratelimit'?
include/linux/printk.h:182:12: note: '__printk_ratelimit' declared here
182 | extern int __printk_ratelimit(const char *func);
| ^
12 warnings and 2 errors generated.
vim +/__printk_safe_enter +1705 kernel/sched/sched.h
1690
1691 /*
1692 * Lockdep annotation that avoids accidental unlocks; it's like a
1693 * sticky/continuous lockdep_assert_held().
1694 *
1695 * This avoids code that has access to 'struct rq *rq' (basically everything in
1696 * the scheduler) from accidentally unlocking the rq if they do not also have a
1697 * copy of the (on-stack) 'struct rq_flags rf'.
1698 *
1699 * Also see Documentation/locking/lockdep-design.rst.
1700 */
1701 static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
1702 {
1703 rf->cookie = lockdep_pin_lock(__rq_lockp(rq));
1704
> 1705 __printk_safe_enter();
1706 #ifdef CONFIG_SCHED_DEBUG
1707 rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
1708 rf->clock_update_flags = 0;
1709 # ifdef CONFIG_SMP
1710 SCHED_WARN_ON(rq->balance_callback && rq->balance_callback != &balance_push_callback);
1711 # endif
1712 #endif
1713 }
1714
1715 static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
1716 {
1717 #ifdef CONFIG_SCHED_DEBUG
1718 if (rq->clock_update_flags > RQCF_ACT_SKIP)
1719 rf->clock_update_flags = RQCF_UPDATED;
1720 #endif
1721
> 1722 __printk_safe_exit();
1723 lockdep_unpin_lock(__rq_lockp(rq), rf->cookie);
1724 }
1725
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Tue, Aug 06, 2024 at 03:41:31PM +0800, Menglong Dong wrote: > The dead lock can happen if we try to use printk(), such as a call of > SCHED_WARN_ON(), during the rq->__lock is held. The printk() will try to > print the message to the console, and the console driver can call > queue_work_on(), which will try to obtain rq->__lock again. > > This means that any WARN during the kernel function that hold the > rq->__lock, such as schedule(), sched_ttwu_pending(), etc, can cause dead > lock. > > Following is the call trace of the deadlock case that I encounter: > > PID: 0 TASK: ff36bfda010c8000 CPU: 156 COMMAND: "swapper/156" > #0 crash_nmi_callback+30 > #1 nmi_handle+85 > #2 default_do_nmi+66 > #3 exc_nmi+291 > #4 end_repeat_nmi+22 > [exception RIP: native_queued_spin_lock_slowpath+96] > #5 native_queued_spin_lock_slowpath+96 > #6 _raw_spin_lock+30 > #7 ttwu_queue+111 > #8 try_to_wake_up+375 > #9 __queue_work+462 > #10 queue_work_on+32 > #11 soft_cursor+420 > #12 bit_cursor+898 > #13 hide_cursor+39 > #14 vt_console_print+995 > #15 call_console_drivers.constprop.0+204 > #16 console_unlock+374 > #17 vprintk_emit+280 > #18 printk+88 > #19 __warn_printk+71 > #20 enqueue_task_fair+1779 > #21 activate_task+102 > #22 ttwu_do_activate+155 > #23 sched_ttwu_pending+177 > #24 flush_smp_call_function_from_idle+42 > #25 do_idle+161 > #26 cpu_startup_entry+25 > #27 secondary_startup_64_no_verify+194 > > Fix this by using __printk_safe_enter()/__printk_safe_exit() in > rq_pin_lock()/rq_unpin_lock(). Then, printk will defer to print out the > buffers to the console. Nope, sorry, not happening.
© 2016 - 2025 Red Hat, Inc.