linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] watchdog/core: Fix AA deadlock due to watchdog holding cpu_hotplug_lock and wait for wq
@ 2024-06-06 15:38 Luo Gengkun
  2024-06-11  7:56 ` kernel test robot
  0 siblings, 1 reply; 2+ messages in thread
From: Luo Gengkun @ 2024-06-06 15:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: pmladek, mhocko, lecopzer.chen, yaoma, linuxppc-dev, dianders,
	song, bpf, npiggin, trix, naveen.n.rao, kernelfans, akpm,
	luogengkun, tglx

We found an AA deadlock problem as shown belowed:

TaskA				TaskB				WatchDog			system_wq

...
css_killed_work_fn:
P(cgroup_mutex)
...
								...
								__lockup_detector_reconfigure:
								P(cpu_hotplug_lock.read)
								...
				...
				cpu_up:
				percpu_down_write:
				P(cpu_hotplug_lock.write)
												...
												cgroup_bpf_release:
												P(cgroup_mutex)
								smp_call_on_cpu:
								Wait system_wq

cpuset_css_offline:
P(cpu_hotplug_lock.read)

WatchDog is waitting for system_wq, who is waitting for cgroup_mutex, to finish
the jobs, but the owner of the cgroup_mutex is waitting for cpu_hotplug_lock.
The key point is the cpu_hotplug_lock, cause the system_wq may be waitting other
lock. It seems unhealthy to hold a lock when waitting system_wq, because we
never know what jobs are system_wq doing. So I fix this by replace cpu_read_lock/unlock
with cpu_hotplug_disable/enable to prevent cpu offline/online.

Fixes: e31d6883f21c ("watchdog/core, powerpc: Lock cpus across reconfiguration")

Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
---
 kernel/watchdog.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 51915b44ac73..6ac6fb8d3be0 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -867,7 +867,7 @@ int lockup_detector_offline_cpu(unsigned int cpu)
 
 static void __lockup_detector_reconfigure(void)
 {
-	cpus_read_lock();
+	cpu_hotplug_disable();
 	watchdog_hardlockup_stop();
 
 	softlockup_stop_all();
@@ -877,7 +877,7 @@ static void __lockup_detector_reconfigure(void)
 		softlockup_start_all();
 
 	watchdog_hardlockup_start();
-	cpus_read_unlock();
+	cpu_hotplug_enable();
 	/*
 	 * Must be called outside the cpus locked section to prevent
 	 * recursive locking in the perf code.
@@ -916,11 +916,11 @@ static __init void lockup_detector_setup(void)
 #else /* CONFIG_SOFTLOCKUP_DETECTOR */
 static void __lockup_detector_reconfigure(void)
 {
-	cpus_read_lock();
+	cpu_hotplug_disable();
 	watchdog_hardlockup_stop();
 	lockup_detector_update_enable();
 	watchdog_hardlockup_start();
-	cpus_read_unlock();
+	cpu_hotplug_enable();
 }
 void lockup_detector_reconfigure(void)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH] watchdog/core: Fix AA deadlock due to watchdog holding cpu_hotplug_lock and wait for wq
  2024-06-06 15:38 [PATCH] watchdog/core: Fix AA deadlock due to watchdog holding cpu_hotplug_lock and wait for wq Luo Gengkun
@ 2024-06-11  7:56 ` kernel test robot
  0 siblings, 0 replies; 2+ messages in thread
From: kernel test robot @ 2024-06-11  7:56 UTC (permalink / raw)
  To: Luo Gengkun
  Cc: luogengkun, pmladek, mhocko, lkp, song, trix, linux-kernel,
	npiggin, dianders, yaoma, tglx, kernelfans, oe-lkp, oliver.sang,
	bpf, linuxppc-dev, akpm, naveen.n.rao, lecopzer.chen



Hello,

kernel test robot noticed "WARNING:possible_circular_locking_dependency_detected" on:

commit: d362c5c67bb96ccdc4dd34a781d23348d927392d ("[PATCH] watchdog/core: Fix AA deadlock due to watchdog holding cpu_hotplug_lock and wait for wq")
url: https://github.com/intel-lab-lkp/linux/commits/Luo-Gengkun/watchdog-core-Fix-AA-deadlock-due-to-watchdog-holding-cpu_hotplug_lock-and-wait-for-wq/20240606-233305
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/20240606153828.3261006-1-luogengkun@huaweicloud.com/
patch subject: [PATCH] watchdog/core: Fix AA deadlock due to watchdog holding cpu_hotplug_lock and wait for wq

in testcase: rcutorture
version: 
with following parameters:

	runtime: 300s
	test: cpuhotplug
	torture_type: busted



compiler: clang-18
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202406111537.dd9d27e9-lkp@intel.com


[   87.506482][    T9] WARNING: possible circular locking dependency detected
[   87.506854][    T9] 6.10.0-rc1-00236-gd362c5c67bb9 #1 Not tainted
[   87.507186][    T9] ------------------------------------------------------
[   87.507554][    T9] kworker/0:1/9 is trying to acquire lock:
[ 87.507861][ T9] ffffffff84305f90 (watchdog_mutex){+.+.}-{3:3}, at: lockup_detector_cleanup (kernel/watchdog.c:937) 
[   87.509166][    T9]
[   87.509166][    T9] but task is already holding lock:
[ 87.509550][ T9] ffffc9000009fd58 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works (kernel/workqueue.c:3207) 
[   87.510129][    T9]
[   87.510129][    T9] which lock already depends on the new lock.
[   87.510129][    T9]
[   87.510660][    T9]
[   87.510660][    T9] the existing dependency chain (in reverse order) is:
[   87.511125][    T9]
[   87.511125][    T9] -> #2 ((work_completion)(&wfc.work)){+.+.}-{0:0}:
[ 87.511584][ T9] __flush_work (kernel/workqueue.c:3894) 
[ 87.511849][ T9] work_on_cpu_key (kernel/workqueue.c:683 kernel/workqueue.c:6693) 
[ 87.512120][ T9] cpu_down (kernel/cpu.c:1487) 
[ 87.512358][ T9] device_offline (drivers/base/core.c:?) 
[ 87.512631][ T9] remove_cpu (kernel/cpu.c:1522) 
[ 87.512876][ T9] torture_offline (??:?) torture
[ 87.513217][ T9] torture_onoff (??:?) torture
[ 87.513535][ T9] kthread (kernel/kthread.c:391) 
[ 87.513777][ T9] ret_from_fork (arch/x86/kernel/process.c:153) 
[ 87.514035][ T9] ret_from_fork_asm (arch/x86/entry/entry_64.S:257) 
[   87.514311][    T9]
[   87.514311][    T9] -> #1 (cpu_add_remove_lock){+.+.}-{3:3}:
[ 87.514727][ T9] __mutex_lock (kernel/locking/mutex.c:608) 
[ 87.514986][ T9] cpu_hotplug_disable (kernel/cpu.c:555) 
[ 87.515271][ T9] __lockup_detector_reconfigure (kernel/watchdog.c:871) 
[ 87.515599][ T9] lockup_detector_setup (kernel/watchdog.c:912) 
[ 87.515914][ T9] kernel_init_freeable (init/main.c:1570) 
[ 87.516213][ T9] kernel_init (init/main.c:1469) 
[ 87.516467][ T9] ret_from_fork (arch/x86/kernel/process.c:153) 
[ 87.516727][ T9] ret_from_fork_asm (arch/x86/entry/entry_64.S:257) 
[   87.517002][    T9]
[   87.517002][    T9] -> #0 (watchdog_mutex){+.+.}-{3:3}:
[ 87.517415][ T9] __lock_acquire (kernel/locking/lockdep.c:3135) 
[ 87.517695][ T9] lock_acquire (kernel/locking/lockdep.c:5754) 
[ 87.517957][ T9] __mutex_lock (kernel/locking/mutex.c:608) 
[ 87.518215][ T9] lockup_detector_cleanup (kernel/watchdog.c:937) 
[ 87.518518][ T9] _cpu_down (kernel/cpu.c:1450) 
[ 87.518768][ T9] __cpu_down_maps_locked (kernel/cpu.c:1463) 
[ 87.519065][ T9] work_for_cpu_fn (kernel/workqueue.c:6670) 
[ 87.519333][ T9] process_scheduled_works (kernel/workqueue.c:?) 
[ 87.519648][ T9] worker_thread (include/linux/list.h:373 kernel/workqueue.c:946 kernel/workqueue.c:3394) 
[ 87.519915][ T9] kthread (kernel/kthread.c:391) 
[ 87.520157][ T9] ret_from_fork (arch/x86/kernel/process.c:153) 
[ 87.520415][ T9] ret_from_fork_asm (arch/x86/entry/entry_64.S:257) 
[   87.520690][    T9]
[   87.520690][    T9] other info that might help us debug this:
[   87.520690][    T9]
[   87.521221][    T9] Chain exists of:
[   87.521221][    T9]   watchdog_mutex --> cpu_add_remove_lock --> (work_completion)(&wfc.work)
[   87.521221][    T9]
[   87.521963][    T9]  Possible unsafe locking scenario:
[   87.521963][    T9]
[   87.522347][    T9]        CPU0                    CPU1
[   87.522624][    T9]        ----                    ----
[   87.522902][    T9]   lock((work_completion)(&wfc.work));
[   87.523191][    T9]                                lock(cpu_add_remove_lock);
[   87.523569][    T9]                                lock((work_completion)(&wfc.work));
[   87.523984][    T9]   lock(watchdog_mutex);
[   87.524212][    T9]
[   87.524212][    T9]  *** DEADLOCK ***
[   87.524212][    T9]
[   87.524628][    T9] 2 locks held by kworker/0:1/9:
[ 87.524885][ T9] #0: ffff88810007cd58 ((wq_completion)events){+.+.}-{0:0}, at: process_scheduled_works (kernel/workqueue.c:3206) 
[ 87.525461][ T9] #1: ffffc9000009fd58 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_scheduled_works (kernel/workqueue.c:3207) 
[   87.526065][    T9]
[   87.526065][    T9] stack backtrace:
[   87.526372][    T9] CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.10.0-rc1-00236-gd362c5c67bb9 #1
[   87.526839][    T9] Workqueue: events work_for_cpu_fn
[   87.527114][    T9] Call Trace:
[   87.527292][    T9]  <TASK>
[ 87.527451][ T9] dump_stack_lvl (lib/dump_stack.c:119) 
[ 87.527691][ T9] check_noncircular (kernel/locking/lockdep.c:?) 
[ 87.527955][ T9] __lock_acquire (kernel/locking/lockdep.c:3135) 
[ 87.528218][ T9] ? lock_release (arch/x86/include/asm/bitops.h:227 arch/x86/include/asm/bitops.h:239 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228 kernel/locking/lockdep.c:352 kernel/locking/lockdep.c:5436 kernel/locking/lockdep.c:5774) 
[ 87.528466][ T9] lock_acquire (kernel/locking/lockdep.c:5754) 
[ 87.528703][ T9] ? lockup_detector_cleanup (kernel/watchdog.c:937) 
[ 87.528991][ T9] ? lockup_detector_cleanup (kernel/watchdog.c:937) 
[ 87.529293][ T9] __mutex_lock (kernel/locking/mutex.c:608) 
[ 87.529530][ T9] ? lockup_detector_cleanup (kernel/watchdog.c:937) 
[ 87.529817][ T9] ? mark_lock (arch/x86/include/asm/bitops.h:227 arch/x86/include/asm/bitops.h:239 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228 kernel/locking/lockdep.c:4656) 
[ 87.530047][ T9] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:?) 
[ 87.530361][ T9] ? _raw_spin_unlock_irq (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:77 include/linux/spinlock_api_smp.h:159 kernel/locking/spinlock.c:202) 
[ 87.530635][ T9] lockup_detector_cleanup (kernel/watchdog.c:937) 
[ 87.530911][ T9] _cpu_down (kernel/cpu.c:1450) 
[ 87.531139][ T9] ? process_scheduled_works (kernel/workqueue.c:3207) 
[ 87.531440][ T9] __cpu_down_maps_locked (kernel/cpu.c:1463) 
[ 87.531716][ T9] ? __pfx___cpu_down_maps_locked (kernel/cpu.c:1460) 
[ 87.532039][ T9] work_for_cpu_fn (kernel/workqueue.c:6670) 
[ 87.532285][ T9] process_scheduled_works (kernel/workqueue.c:?) 
[ 87.532594][ T9] worker_thread (include/linux/list.h:373 kernel/workqueue.c:946 kernel/workqueue.c:3394) 
[ 87.532839][ T9] ? lock_release (arch/x86/include/asm/bitops.h:227 arch/x86/include/asm/bitops.h:239 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228 kernel/locking/lockdep.c:352 kernel/locking/lockdep.c:5436 kernel/locking/lockdep.c:5774) 
[ 87.533103][ T9] ? __kthread_parkme (kernel/kthread.c:?) 
[ 87.533365][ T9] ? __kthread_parkme (include/linux/instrumented.h:? include/asm-generic/bitops/instrumented-non-atomic.h:141 kernel/kthread.c:280) 
[ 87.533629][ T9] kthread (kernel/kthread.c:391) 
[ 87.533846][ T9] ? __pfx_worker_thread (kernel/workqueue.c:3339) 
[ 87.534117][ T9] ? __pfx_kthread (kernel/kthread.c:342) 
[ 87.534361][ T9] ret_from_fork (arch/x86/kernel/process.c:153) 
[ 87.534597][ T9] ? __pfx_kthread (kernel/kthread.c:342) 
[ 87.534841][ T9] ret_from_fork_asm (arch/x86/entry/entry_64.S:257) 
[   87.535111][    T9]  </TASK>



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240611/202406111537.dd9d27e9-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-06-11  7:58 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-06 15:38 [PATCH] watchdog/core: Fix AA deadlock due to watchdog holding cpu_hotplug_lock and wait for wq Luo Gengkun
2024-06-11  7:56 ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).