public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling
@ 2023-06-09 11:24 Xiongfeng Wang
  2023-06-09 14:55 ` Thomas Gleixner
  2023-08-30 10:29 ` [tip: smp/urgent] cpu/hotplug: Prevent self deadlock on CPU hot-unplug tip-bot2 for Thomas Gleixner
  0 siblings, 2 replies; 19+ messages in thread
From: Xiongfeng Wang @ 2023-06-09 11:24 UTC (permalink / raw)
  To: Thomas Gleixner, vschneid, Phil Auld, vdonnefort
  Cc: Linux Kernel Mailing List, wangxiongfeng2, Wei Li, liaoyu (E),
	zhangqiao22

Hello,
 When I do some low power tests, the following hung task is printed.

  Call trace:
   __switch_to+0xd4/0x160
   __schedule+0x38c/0x8c4
   __cond_resched+0x24/0x50
   unmap_kernel_range_noflush+0x210/0x240
   kretprobe_trampoline+0x0/0xc8
   __vunmap+0x70/0x31c
   __vfree+0x34/0x8c
   vfree+0x40/0x58
   free_vm_stack_cache+0x44/0x74
   cpuhp_invoke_callback+0xc4/0x71c
   _cpu_down+0x108/0x284
   kretprobe_trampoline+0x0/0xc8
   suspend_enter+0xd8/0x8ec
   suspend_devices_and_enter+0x1f0/0x360
   pm_suspend.part.1+0x428/0x53c
   pm_suspend+0x3c/0xa0
   devdrv_suspend_proc+0x148/0x248 [drv_devmng]
   devdrv_manager_set_power_state+0x140/0x680 [drv_devmng]
   devdrv_manager_ioctl+0xcc/0x210 [drv_devmng]
   drv_ascend_intf_ioctl+0x84/0x248 [drv_davinci_intf]
   __arm64_sys_ioctl+0xb4/0xf0
   el0_svc_common.constprop.0+0x140/0x374
   do_el0_svc+0x80/0xa0
   el0_svc+0x1c/0x28
   el0_sync_handler+0x90/0xf0
   el0_sync+0x168/0x180

After some analysis, I found it is caused by the following race condition.

1. A task running on CPU1 is throttled for cfs bandwidth. CPU1 starts the
hrtimer cfs_bandwidth 'period_timer' and enqueue the hrtimer on CPU1's rbtree.
2. Then the task is migrated to CPU2 and starts to offline CPU1. CPU1 starts
CPUHP AP steps, and then the hrtimer 'period_timer' expires and re-enqueued on CPU1.
3. CPU1 runs to take_cpu_down() and disable irq. After CPU1 finished CPUHP AP
steps, CPU2 starts the rest CPUHP step.
4. When CPU2 runs to free_vm_stack_cache(), it is sched out in __vunmap()
because it run out of CPU quota. start_cfs_bandwidth() does not restart the
hrtimer because 'cfs_b->period_active' is set.
5. The task waits the hrtimer 'period_timer' to expire to wake itself up, but
CPU1 has disabled irq and the hrtimer won't expire until it is migrated to CPU2
in hrtimers_dead_cpu(). But the task is blocked and cannot proceed to
hrtimers_dead_cpu() step. So the task hungs.

    CPU1      			                 	 CPU2
Task set cfs_quota
start hrtimer cfs_bandwidth 'period_timer'
						start to offline CPU1
CPU1 start CPUHP AP step
...
'period_timer' expired and re-enqueued on CPU1
...
disable irq in take_cpu_down()
...
						CPU2 start the rest CPUHP steps
						...
					      sched out in free_vm_stack_cache()
						wait for 'period_timer' expires


Appreciate it a lot if anyone can give some suggestion on how fix this problem !

Thanks,
Xiongfeng



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-08-30 19:22 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-09 11:24 [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling Xiongfeng Wang
2023-06-09 14:55 ` Thomas Gleixner
2023-06-12 12:49   ` Xiongfeng Wang
2023-06-26  8:23     ` Xiongfeng Wang
2023-06-27 16:46       ` Vincent Guittot
2023-06-28 12:03         ` Thomas Gleixner
2023-06-28 12:35           ` Vincent Guittot
2023-06-28 22:01             ` Thomas Gleixner
2023-06-29  1:41               ` Xiongfeng Wang
2023-06-29  8:30               ` Vincent Guittot
2023-08-22  8:58                 ` Xiongfeng Wang
2023-08-23 10:14                 ` Thomas Gleixner
2023-08-24  7:25                   ` Yu Liao
2023-08-29  7:18                   ` Vincent Guittot
2023-06-28 13:30         ` Vincent Guittot
2023-06-28 21:09           ` Thomas Gleixner
2023-06-29  1:26         ` Xiongfeng Wang
2023-06-29  8:33           ` Vincent Guittot
2023-08-30 10:29 ` [tip: smp/urgent] cpu/hotplug: Prevent self deadlock on CPU hot-unplug tip-bot2 for Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox