public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
@ 2024-07-11 12:59 Valentin Schneider
  2024-07-11 12:59 ` [RFC PATCH v3 01/10] rcuwait: Split type definition to its own header Valentin Schneider
                   ` (9 more replies)
  0 siblings, 10 replies; 28+ messages in thread
From: Valentin Schneider @ 2024-07-11 12:59 UTC (permalink / raw)
  To: linux-kernel, rcu
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Phil Auld, Clark Williams, Tomas Glozar, Paul E. McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Mathieu Desnoyers, Lai Jiangshan,
	Zqiang, Alexander Gordeev, Catalin Marinas, Arnd Bergmann,
	Guo Ren, Palmer Dabbelt, Andrew Morton, Oleg Nesterov, Jens Axboe

Hi folks,

The patches are based on top of tip/sched/core at:
  db43a609d01e ("sched: Update MAINTAINERS and CREDITS")
They are also available at:
  https://gitlab.com/vschneid/linux.git -b mainline/sched/fair/throttle-on-exit-all-v2

Problem statement
=================

CFS tasks can end up throttled while holding locks that other, non-throttled
tasks are blocking on.

For !PREEMPT_RT, this can be a source of latency due to the throttling causing a
resource acquisition denial.

For PREEMPT_RT, this is worse and can lead to a deadlock:
o A CFS task p0 gets throttled while holding read_lock(&lock)
o A task p1 blocks on write_lock(&lock), making further readers enter the
  slowpath
o A ktimers or ksoftirqd task blocks on read_lock(&lock)

If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued on
the same CPU as one where ktimers/ksoftirqd is blocked on read_lock(&lock),
this creates a circular dependency.

This has been observed to happen with:
o fs/eventpoll.c::ep->lock
o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above)
but can trigger with any rwlock that can be acquired in both process and
softirq contexts.

The linux-rt tree has had
  1ea50f9636f0 ("softirq: Use a dedicated thread for timer wakeups.")
which helped this scenario for non-rwlock locks by ensuring the throttled
task would get PI'd to FIFO1 (ktimers' default priority). Unfortunately,
rwlocks cannot sanely do PI as they allow multiple readers.

Proposed approach
=================

Peter mentioned [1] that there have been discussions on changing /when/ the
throttling happens: rather than have it be done immediately upon updating
the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
for the task to be about to return to userspace: if it's in userspace, it can't
hold any in-kernel lock.

Ben Segall proposed a different approach [2] which I worked on at [3], but I
eventually lost faith in all the .*nr_running hackery that it required.
Furthermore, talking about this at OSPM24, I felt like the first approach had
more appeal, so guess who's back (back again).

See the last patch for implementation considerations.

I've tried splitting this into as many bite-size patches as possible, but it's
tricky to further split that one last patch switching from cfs_rq-wide throttling
to per-task throttling, apologies for the big diff.

Testing
=======

Tested on QEMU via the below with the usual DEBUG options enabled.

  mount -t cgroup -o cpu none /root/cpu

  mkdir /root/cpu/cg0
  echo 10000 >  /root/cpu/cg0/cpu.cfs_period_us
  echo 1000 > /root/cpu/cg0/cpu.cfs_quota_us

  mkdir /root/cpu/cg0/cg00
  mkdir /root/cpu/cg0/cg01

  mkdir /root/cpu/cg0/cg00/cg000
  mkdir /root/cpu/cg0/cg00/cg001

  spawn() {
      while true; do cat /sys/devices/system/cpu/smt/active &>/dev/null; done &
      PID=$!
      echo "Starting PID${PID}"
      echo $PID > $1
  }

  spawn cpu/cg0/tasks
  spawn cpu/cg0/tasks
  spawn cpu/cg0/tasks
  spawn cpu/cg0/tasks

  spawn cpu/cg0/cg01/tasks

  spawn cpu/cg0/cg00/cg000/tasks
  spawn cpu/cg0/cg00/cg001/tasks

  sleep 120

  kill $(jobs -p)  

I've also looked at the PELT side of things using rt-app + LISA. The PELT behaviour
itself looks mostly the same, but the execution patterns change, again see last
patch for details.
  
Links
=====
  
[1]: https://lore.kernel.org/lkml/20231031160120.GE15024@noisy.programming.kicks-ass.net/
[2]: http://lore.kernel.org/r/xm26edfxpock.fsf@bsegall-linux.svl.corp.google.com
[3]: https://lore.kernel.org/lkml/20240202080920.3337862-1-vschneid@redhat.com/

Valentin Schneider (10):
  rcuwait: Split type definition to its own header
  irq_work: Split type definition to its own header
  task_work, sched: Add a _locked variant to task_work_cancel()
  sched/fair: Introduce sched_throttle_work
  sched/fair: Introduce an irq_work for cancelling throttle task_work
  sched/fair: Prepare switched_from & switched_to for per-task
    throttling
  sched/fair: Prepare task_change_group_fair() for per-task throttling
  sched/fair: Prepare migrate_task_rq_fair() for per-task throttling
  sched/fair: Add a class->task_woken callback in preparation for
    per-task throttling
  sched/fair: Throttle CFS tasks on return to userspace

 include/linux/irq_work.h       |   8 +-
 include/linux/irq_work_types.h |  14 +
 include/linux/rcuwait.h        |   9 +-
 include/linux/rcuwait_types.h  |  16 ++
 include/linux/sched.h          |   6 +
 kernel/sched/core.c            |   4 +
 kernel/sched/fair.c            | 497 +++++++++++++++++++++++++--------
 kernel/sched/sched.h           |   6 +
 kernel/sched/task_work_sched.h |  14 +
 kernel/task_work.c             |  67 ++++-
 10 files changed, 499 insertions(+), 142 deletions(-)
 create mode 100644 include/linux/irq_work_types.h
 create mode 100644 include/linux/rcuwait_types.h
 create mode 100644 kernel/sched/task_work_sched.h

--
2.43.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-07-24  7:22 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-11 12:59 [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry Valentin Schneider
2024-07-11 12:59 ` [RFC PATCH v3 01/10] rcuwait: Split type definition to its own header Valentin Schneider
2024-07-12 15:15   ` Peter Zijlstra
2024-07-15 17:57     ` Valentin Schneider
2024-07-11 12:59 ` [RFC PATCH v3 02/10] irq_work: " Valentin Schneider
2024-07-12 15:17   ` Peter Zijlstra
2024-07-11 12:59 ` [RFC PATCH v3 03/10] task_work, sched: Add a _locked variant to task_work_cancel() Valentin Schneider
2024-07-12 10:35   ` Oleg Nesterov
2024-07-12 15:20   ` Peter Zijlstra
2024-07-15 17:57     ` Valentin Schneider
2024-07-11 12:59 ` [RFC PATCH v3 04/10] sched/fair: Introduce sched_throttle_work Valentin Schneider
2024-07-12 15:21   ` Peter Zijlstra
2024-07-15 17:58     ` Valentin Schneider
2024-07-11 12:59 ` [RFC PATCH v3 05/10] sched/fair: Introduce an irq_work for cancelling throttle task_work Valentin Schneider
2024-07-11 13:00 ` [RFC PATCH v3 06/10] sched/fair: Prepare switched_from & switched_to for per-task throttling Valentin Schneider
2024-07-12 15:26   ` Peter Zijlstra
2024-07-11 13:00 ` [RFC PATCH v3 07/10] sched/fair: Prepare task_change_group_fair() " Valentin Schneider
2024-07-11 13:00 ` [RFC PATCH v3 08/10] sched/fair: Prepare migrate_task_rq_fair() " Valentin Schneider
2024-07-11 13:00 ` [RFC PATCH v3 09/10] sched/fair: Add a class->task_woken callback in preparation " Valentin Schneider
2024-07-11 13:00 ` [RFC PATCH v3 10/10] sched/fair: Throttle CFS tasks on return to userspace Valentin Schneider
2024-07-12 17:05   ` Peter Zijlstra
2024-07-12 17:10   ` Peter Zijlstra
2024-07-12 17:43   ` Peter Zijlstra
2024-07-16 12:46     ` Valentin Schneider
2024-07-19  0:25   ` Benjamin Segall
2024-07-23 15:16     ` Valentin Schneider
2024-07-24  1:34       ` Benjamin Segall
2024-07-24  7:20         ` Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox