* [PATCH next] sched: make printk safe when rq lock is held
@ 2024-08-06 7:41 Menglong Dong
2024-08-06 10:39 ` Peter Zijlstra
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Menglong Dong @ 2024-08-06 7:41 UTC (permalink / raw)
To: peterz
Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, Menglong Dong, Bin Lai
The dead lock can happen if we try to use printk(), such as a call of
SCHED_WARN_ON(), during the rq->__lock is held. The printk() will try to
print the message to the console, and the console driver can call
queue_work_on(), which will try to obtain rq->__lock again.
This means that any WARN during the kernel function that hold the
rq->__lock, such as schedule(), sched_ttwu_pending(), etc, can cause dead
lock.
Following is the call trace of the deadlock case that I encounter:
PID: 0 TASK: ff36bfda010c8000 CPU: 156 COMMAND: "swapper/156"
#0 crash_nmi_callback+30
#1 nmi_handle+85
#2 default_do_nmi+66
#3 exc_nmi+291
#4 end_repeat_nmi+22
[exception RIP: native_queued_spin_lock_slowpath+96]
#5 native_queued_spin_lock_slowpath+96
#6 _raw_spin_lock+30
#7 ttwu_queue+111
#8 try_to_wake_up+375
#9 __queue_work+462
#10 queue_work_on+32
#11 soft_cursor+420
#12 bit_cursor+898
#13 hide_cursor+39
#14 vt_console_print+995
#15 call_console_drivers.constprop.0+204
#16 console_unlock+374
#17 vprintk_emit+280
#18 printk+88
#19 __warn_printk+71
#20 enqueue_task_fair+1779
#21 activate_task+102
#22 ttwu_do_activate+155
#23 sched_ttwu_pending+177
#24 flush_smp_call_function_from_idle+42
#25 do_idle+161
#26 cpu_startup_entry+25
#27 secondary_startup_64_no_verify+194
Fix this by using __printk_safe_enter()/__printk_safe_exit() in
rq_pin_lock()/rq_unpin_lock(). Then, printk will defer to print out the
buffers to the console.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Bin Lai <laib2@chinatelecom.cn>
---
kernel/sched/sched.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4c36cc680361..ba06dfdaea55 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1632,6 +1632,7 @@ static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
{
rf->cookie = lockdep_pin_lock(__rq_lockp(rq));
+ __printk_safe_enter();
#ifdef CONFIG_SCHED_DEBUG
rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
rf->clock_update_flags = 0;
@@ -1648,6 +1649,7 @@ static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
rf->clock_update_flags = RQCF_UPDATED;
#endif
+ __printk_safe_exit();
lockdep_unpin_lock(__rq_lockp(rq), rf->cookie);
}
--
2.39.2
^ permalink raw reply related [flat|nested] 4+ messages in thread* Re: [PATCH next] sched: make printk safe when rq lock is held
2024-08-06 7:41 [PATCH next] sched: make printk safe when rq lock is held Menglong Dong
@ 2024-08-06 10:39 ` Peter Zijlstra
2024-08-07 17:11 ` kernel test robot
2024-08-07 17:31 ` kernel test robot
2 siblings, 0 replies; 4+ messages in thread
From: Peter Zijlstra @ 2024-08-06 10:39 UTC (permalink / raw)
To: Menglong Dong
Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
bsegall, mgorman, vschneid, linux-kernel, Menglong Dong, Bin Lai
On Tue, Aug 06, 2024 at 03:41:31PM +0800, Menglong Dong wrote:
> The dead lock can happen if we try to use printk(), such as a call of
> SCHED_WARN_ON(), during the rq->__lock is held. The printk() will try to
> print the message to the console, and the console driver can call
> queue_work_on(), which will try to obtain rq->__lock again.
>
> This means that any WARN during the kernel function that hold the
> rq->__lock, such as schedule(), sched_ttwu_pending(), etc, can cause dead
> lock.
>
> Following is the call trace of the deadlock case that I encounter:
>
> PID: 0 TASK: ff36bfda010c8000 CPU: 156 COMMAND: "swapper/156"
> #0 crash_nmi_callback+30
> #1 nmi_handle+85
> #2 default_do_nmi+66
> #3 exc_nmi+291
> #4 end_repeat_nmi+22
> [exception RIP: native_queued_spin_lock_slowpath+96]
> #5 native_queued_spin_lock_slowpath+96
> #6 _raw_spin_lock+30
> #7 ttwu_queue+111
> #8 try_to_wake_up+375
> #9 __queue_work+462
> #10 queue_work_on+32
> #11 soft_cursor+420
> #12 bit_cursor+898
> #13 hide_cursor+39
> #14 vt_console_print+995
> #15 call_console_drivers.constprop.0+204
> #16 console_unlock+374
> #17 vprintk_emit+280
> #18 printk+88
> #19 __warn_printk+71
> #20 enqueue_task_fair+1779
> #21 activate_task+102
> #22 ttwu_do_activate+155
> #23 sched_ttwu_pending+177
> #24 flush_smp_call_function_from_idle+42
> #25 do_idle+161
> #26 cpu_startup_entry+25
> #27 secondary_startup_64_no_verify+194
>
> Fix this by using __printk_safe_enter()/__printk_safe_exit() in
> rq_pin_lock()/rq_unpin_lock(). Then, printk will defer to print out the
> buffers to the console.
Nope, sorry, not happening.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH next] sched: make printk safe when rq lock is held
2024-08-06 7:41 [PATCH next] sched: make printk safe when rq lock is held Menglong Dong
2024-08-06 10:39 ` Peter Zijlstra
@ 2024-08-07 17:11 ` kernel test robot
2024-08-07 17:31 ` kernel test robot
2 siblings, 0 replies; 4+ messages in thread
From: kernel test robot @ 2024-08-07 17:11 UTC (permalink / raw)
To: Menglong Dong, peterz
Cc: llvm, oe-kbuild-all, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
linux-kernel, Menglong Dong, Bin Lai
Hi Menglong,
kernel test robot noticed the following build errors:
[auto build test ERROR on next-20240806]
url: https://github.com/intel-lab-lkp/linux/commits/Menglong-Dong/sched-make-printk-safe-when-rq-lock-is-held/20240806-155153
base: next-20240806
patch link: https://lore.kernel.org/r/20240806074131.36007-1-dongml2%40chinatelecom.cn
patch subject: [PATCH next] sched: make printk safe when rq lock is held
config: um-allnoconfig (https://download.01.org/0day-ci/archive/20240808/202408080033.hGD8Z1m1-lkp@intel.com/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240808/202408080033.hGD8Z1m1-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408080033.hGD8Z1m1-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from kernel/sched/core.c:10:
In file included from include/linux/highmem.h:12:
In file included from include/linux/hardirq.h:11:
In file included from arch/um/include/asm/hardirq.h:5:
In file included from include/asm-generic/hardirq.h:17:
In file included from include/linux/irq.h:20:
In file included from include/linux/io.h:14:
In file included from arch/um/include/asm/io.h:24:
include/asm-generic/io.h:548:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
548 | val = __raw_readb(PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:561:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
561 | val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
| ~~~~~~~~~~ ^
include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
| ^
In file included from kernel/sched/core.c:10:
In file included from include/linux/highmem.h:12:
In file included from include/linux/hardirq.h:11:
In file included from arch/um/include/asm/hardirq.h:5:
In file included from include/asm-generic/hardirq.h:17:
In file included from include/linux/irq.h:20:
In file included from include/linux/io.h:14:
In file included from arch/um/include/asm/io.h:24:
include/asm-generic/io.h:574:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
574 | val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
| ~~~~~~~~~~ ^
include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
| ^
In file included from kernel/sched/core.c:10:
In file included from include/linux/highmem.h:12:
In file included from include/linux/hardirq.h:11:
In file included from arch/um/include/asm/hardirq.h:5:
In file included from include/asm-generic/hardirq.h:17:
In file included from include/linux/irq.h:20:
In file included from include/linux/io.h:14:
In file included from arch/um/include/asm/io.h:24:
include/asm-generic/io.h:585:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
585 | __raw_writeb(value, PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:595:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
595 | __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:605:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
605 | __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:693:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
693 | readsb(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:701:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
701 | readsw(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:709:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
709 | readsl(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:718:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
718 | writesb(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:727:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
727 | writesw(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:736:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
736 | writesl(PCI_IOBASE + addr, buffer, count);
| ~~~~~~~~~~ ^
In file included from kernel/sched/core.c:88:
>> kernel/sched/sched.h:1705:2: error: call to undeclared function '__printk_safe_enter'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
1705 | __printk_safe_enter();
| ^
kernel/sched/sched.h:1705:2: note: did you mean '__printk_deferred_enter'?
include/linux/printk.h:166:13: note: '__printk_deferred_enter' declared here
166 | extern void __printk_deferred_enter(void);
| ^
In file included from kernel/sched/core.c:88:
>> kernel/sched/sched.h:1722:2: error: call to undeclared function '__printk_safe_exit'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
1722 | __printk_safe_exit();
| ^
kernel/sched/sched.h:1722:2: note: did you mean '__printk_ratelimit'?
include/linux/printk.h:182:12: note: '__printk_ratelimit' declared here
182 | extern int __printk_ratelimit(const char *func);
| ^
12 warnings and 2 errors generated.
vim +/__printk_safe_enter +1705 kernel/sched/sched.h
1690
1691 /*
1692 * Lockdep annotation that avoids accidental unlocks; it's like a
1693 * sticky/continuous lockdep_assert_held().
1694 *
1695 * This avoids code that has access to 'struct rq *rq' (basically everything in
1696 * the scheduler) from accidentally unlocking the rq if they do not also have a
1697 * copy of the (on-stack) 'struct rq_flags rf'.
1698 *
1699 * Also see Documentation/locking/lockdep-design.rst.
1700 */
1701 static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
1702 {
1703 rf->cookie = lockdep_pin_lock(__rq_lockp(rq));
1704
> 1705 __printk_safe_enter();
1706 #ifdef CONFIG_SCHED_DEBUG
1707 rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
1708 rf->clock_update_flags = 0;
1709 # ifdef CONFIG_SMP
1710 SCHED_WARN_ON(rq->balance_callback && rq->balance_callback != &balance_push_callback);
1711 # endif
1712 #endif
1713 }
1714
1715 static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
1716 {
1717 #ifdef CONFIG_SCHED_DEBUG
1718 if (rq->clock_update_flags > RQCF_ACT_SKIP)
1719 rf->clock_update_flags = RQCF_UPDATED;
1720 #endif
1721
> 1722 __printk_safe_exit();
1723 lockdep_unpin_lock(__rq_lockp(rq), rf->cookie);
1724 }
1725
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH next] sched: make printk safe when rq lock is held
2024-08-06 7:41 [PATCH next] sched: make printk safe when rq lock is held Menglong Dong
2024-08-06 10:39 ` Peter Zijlstra
2024-08-07 17:11 ` kernel test robot
@ 2024-08-07 17:31 ` kernel test robot
2 siblings, 0 replies; 4+ messages in thread
From: kernel test robot @ 2024-08-07 17:31 UTC (permalink / raw)
To: Menglong Dong, peterz
Cc: oe-kbuild-all, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
linux-kernel, Menglong Dong, Bin Lai
Hi Menglong,
kernel test robot noticed the following build errors:
[auto build test ERROR on next-20240806]
url: https://github.com/intel-lab-lkp/linux/commits/Menglong-Dong/sched-make-printk-safe-when-rq-lock-is-held/20240806-155153
base: next-20240806
patch link: https://lore.kernel.org/r/20240806074131.36007-1-dongml2%40chinatelecom.cn
patch subject: [PATCH next] sched: make printk safe when rq lock is held
config: openrisc-allnoconfig (https://download.01.org/0day-ci/archive/20240808/202408080114.hpttUZjC-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 14.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240808/202408080114.hpttUZjC-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408080114.hpttUZjC-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from kernel/sched/core.c:88:
kernel/sched/sched.h: In function 'rq_pin_lock':
>> kernel/sched/sched.h:1705:9: error: implicit declaration of function '__printk_safe_enter'; did you mean '__printk_deferred_enter'? [-Wimplicit-function-declaration]
1705 | __printk_safe_enter();
| ^~~~~~~~~~~~~~~~~~~
| __printk_deferred_enter
kernel/sched/sched.h: In function 'rq_unpin_lock':
>> kernel/sched/sched.h:1722:9: error: implicit declaration of function '__printk_safe_exit'; did you mean '__printk_ratelimit'? [-Wimplicit-function-declaration]
1722 | __printk_safe_exit();
| ^~~~~~~~~~~~~~~~~~
| __printk_ratelimit
vim +1705 kernel/sched/sched.h
1690
1691 /*
1692 * Lockdep annotation that avoids accidental unlocks; it's like a
1693 * sticky/continuous lockdep_assert_held().
1694 *
1695 * This avoids code that has access to 'struct rq *rq' (basically everything in
1696 * the scheduler) from accidentally unlocking the rq if they do not also have a
1697 * copy of the (on-stack) 'struct rq_flags rf'.
1698 *
1699 * Also see Documentation/locking/lockdep-design.rst.
1700 */
1701 static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
1702 {
1703 rf->cookie = lockdep_pin_lock(__rq_lockp(rq));
1704
> 1705 __printk_safe_enter();
1706 #ifdef CONFIG_SCHED_DEBUG
1707 rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
1708 rf->clock_update_flags = 0;
1709 # ifdef CONFIG_SMP
1710 SCHED_WARN_ON(rq->balance_callback && rq->balance_callback != &balance_push_callback);
1711 # endif
1712 #endif
1713 }
1714
1715 static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
1716 {
1717 #ifdef CONFIG_SCHED_DEBUG
1718 if (rq->clock_update_flags > RQCF_ACT_SKIP)
1719 rf->clock_update_flags = RQCF_UPDATED;
1720 #endif
1721
> 1722 __printk_safe_exit();
1723 lockdep_unpin_lock(__rq_lockp(rq), rf->cookie);
1724 }
1725
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-08-07 17:32 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-06 7:41 [PATCH next] sched: make printk safe when rq lock is held Menglong Dong
2024-08-06 10:39 ` Peter Zijlstra
2024-08-07 17:11 ` kernel test robot
2024-08-07 17:31 ` kernel test robot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox