All of lore.kernel.org
 help / color / mirror / Atom feed
* [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling
@ 2023-06-09 11:24 Xiongfeng Wang
  2023-06-09 14:55 ` Thomas Gleixner
  2023-08-30 10:29 ` [tip: smp/urgent] cpu/hotplug: Prevent self deadlock on CPU hot-unplug tip-bot2 for Thomas Gleixner
  0 siblings, 2 replies; 19+ messages in thread
From: Xiongfeng Wang @ 2023-06-09 11:24 UTC (permalink / raw)
  To: Thomas Gleixner, vschneid, Phil Auld, vdonnefort
  Cc: Linux Kernel Mailing List, wangxiongfeng2, Wei Li, liaoyu (E),
	zhangqiao22

Hello,
 When I do some low power tests, the following hung task is printed.

  Call trace:
   __switch_to+0xd4/0x160
   __schedule+0x38c/0x8c4
   __cond_resched+0x24/0x50
   unmap_kernel_range_noflush+0x210/0x240
   kretprobe_trampoline+0x0/0xc8
   __vunmap+0x70/0x31c
   __vfree+0x34/0x8c
   vfree+0x40/0x58
   free_vm_stack_cache+0x44/0x74
   cpuhp_invoke_callback+0xc4/0x71c
   _cpu_down+0x108/0x284
   kretprobe_trampoline+0x0/0xc8
   suspend_enter+0xd8/0x8ec
   suspend_devices_and_enter+0x1f0/0x360
   pm_suspend.part.1+0x428/0x53c
   pm_suspend+0x3c/0xa0
   devdrv_suspend_proc+0x148/0x248 [drv_devmng]
   devdrv_manager_set_power_state+0x140/0x680 [drv_devmng]
   devdrv_manager_ioctl+0xcc/0x210 [drv_devmng]
   drv_ascend_intf_ioctl+0x84/0x248 [drv_davinci_intf]
   __arm64_sys_ioctl+0xb4/0xf0
   el0_svc_common.constprop.0+0x140/0x374
   do_el0_svc+0x80/0xa0
   el0_svc+0x1c/0x28
   el0_sync_handler+0x90/0xf0
   el0_sync+0x168/0x180

After some analysis, I found it is caused by the following race condition.

1. A task running on CPU1 is throttled for cfs bandwidth. CPU1 starts the
hrtimer cfs_bandwidth 'period_timer' and enqueue the hrtimer on CPU1's rbtree.
2. Then the task is migrated to CPU2 and starts to offline CPU1. CPU1 starts
CPUHP AP steps, and then the hrtimer 'period_timer' expires and re-enqueued on CPU1.
3. CPU1 runs to take_cpu_down() and disable irq. After CPU1 finished CPUHP AP
steps, CPU2 starts the rest CPUHP step.
4. When CPU2 runs to free_vm_stack_cache(), it is sched out in __vunmap()
because it run out of CPU quota. start_cfs_bandwidth() does not restart the
hrtimer because 'cfs_b->period_active' is set.
5. The task waits the hrtimer 'period_timer' to expire to wake itself up, but
CPU1 has disabled irq and the hrtimer won't expire until it is migrated to CPU2
in hrtimers_dead_cpu(). But the task is blocked and cannot proceed to
hrtimers_dead_cpu() step. So the task hungs.

    CPU1      			                 	 CPU2
Task set cfs_quota
start hrtimer cfs_bandwidth 'period_timer'
						start to offline CPU1
CPU1 start CPUHP AP step
...
'period_timer' expired and re-enqueued on CPU1
...
disable irq in take_cpu_down()
...
						CPU2 start the rest CPUHP steps
						...
					      sched out in free_vm_stack_cache()
						wait for 'period_timer' expires


Appreciate it a lot if anyone can give some suggestion on how fix this problem !

Thanks,
Xiongfeng



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-08-30 18:44 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-09 11:24 [Question] report a race condition between CPU hotplug state machine and hrtimer 'sched_cfs_period_timer' for cfs bandwidth throttling Xiongfeng Wang
2023-06-09 14:55 ` Thomas Gleixner
2023-06-12 12:49   ` Xiongfeng Wang
2023-06-26  8:23     ` Xiongfeng Wang
2023-06-27 16:46       ` Vincent Guittot
2023-06-28 12:03         ` Thomas Gleixner
2023-06-28 12:35           ` Vincent Guittot
2023-06-28 22:01             ` Thomas Gleixner
2023-06-29  1:41               ` Xiongfeng Wang
2023-06-29  8:30               ` Vincent Guittot
2023-08-22  8:58                 ` Xiongfeng Wang
2023-08-23 10:14                 ` Thomas Gleixner
2023-08-24  7:25                   ` Yu Liao
2023-08-29  7:18                   ` Vincent Guittot
2023-06-28 13:30         ` Vincent Guittot
2023-06-28 21:09           ` Thomas Gleixner
2023-06-29  1:26         ` Xiongfeng Wang
2023-06-29  8:33           ` Vincent Guittot
2023-08-30 10:29 ` [tip: smp/urgent] cpu/hotplug: Prevent self deadlock on CPU hot-unplug tip-bot2 for Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.