[RFC PATCH] cgroup: Track time in cgroup v2 freezer

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] cgroup: Track time in cgroup v2 freezer
@ 2025-06-03 22:43 Tiffany Yang
  2025-06-03 23:03 ` Tejun Heo
  2025-06-17  9:49 ` Michal Koutný
  0 siblings, 2 replies; 11+ messages in thread
From: Tiffany Yang @ 2025-06-03 22:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: cgroups, kernel-team, John Stultz, Thomas Gleixner, Stephen Boyd,
	Anna-Maria Behnsen, Frederic Weisbecker, Tejun Heo,
	Johannes Weiner, Michal Koutný, Rafael J. Wysocki,
	Pavel Machek, Roman Gushchin, Chen Ridong, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider

The cgroup v2 freezer controller allows user processes to be dynamically
added to and removed from an interruptible frozen state from
userspace. This feature is helpful for application management, as it
allows background tasks to be frozen to prevent them from being
scheduled or otherwise contending with foreground tasks for resources.
Still, applications are usually unaware of their having been placed in
the freezer cgroup, so any watchdog timers they may have set will fire
when they exit. To address this problem, I propose tracking the per-task
frozen time and exposing it to userland via procfs.

Currently, the cgroup css_set_lock is used to serialize accesses to the
new task_struct counters (frozen_time_total and frozen_time_start). If
we start to see higher contention on this lock, we may want to introduce
a separate per-task mutex or seq_lock, but the main focus in this
initial submission is establishing the right UAPI for this accounting
information.

While any comments on this RFC are appreciated, there are several areas
where feedback would be especially welcome:
   1. I know there is some hesitancy toward adding new proc files to
      the system, so I would welcome suggestions as to how this per-task
      accounting might be better exposed to userland.
   2. Unlike the cgroup v1 freezer controller, the cgroup v2 freezer
      does not use the system-wide freezer shared by the power
      management system to freeze tasks. Instead, tasks are placed into
      a cgroup v2 freezer-specific frozen state similar to jobctl
      stop. Consequently, the time being accounted for here is somewhat
      narrow and specific to cgroup v2 functionality, but there may be
      better ways to generalize it.

Since this is a first stab at discussing the potential interface, I've
not yet updated the procfs documentation for this. Once there is
consensus around the interface, I will fill that out.

Thank you for your time!
Tiffany

Signed-off-by: Tiffany Yang <ynaffit@google.com>
---
Cc: John Stultz <jstultz@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Pavel Machek <pavel@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Chen Ridong <chenridong@huawei.com>
---
 fs/proc/base.c          |  2 ++
 include/linux/cgroup.h  |  2 ++
 include/linux/sched.h   |  3 +++
 kernel/cgroup/cgroup.c  |  2 ++
 kernel/cgroup/freezer.c | 20 ++++++++++++++++++++
 5 files changed, 29 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index c667702dc69b..38a05bb53cd1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3377,6 +3377,7 @@ static const struct pid_entry tgid_base_stuff[] = {
 #endif
 #ifdef CONFIG_CGROUPS
 	ONE("cgroup",  S_IRUGO, proc_cgroup_show),
+	ONE("cgroup_v2_freezer_time_frozen",  0444, proc_cgroup_frztime_show),
 #endif
 #ifdef CONFIG_PROC_CPU_RESCTRL
 	ONE("cpu_resctrl_groups", S_IRUGO, proc_resctrl_show),
@@ -3724,6 +3725,7 @@ static const struct pid_entry tid_base_stuff[] = {
 #endif
 #ifdef CONFIG_CGROUPS
 	ONE("cgroup",  S_IRUGO, proc_cgroup_show),
+	ONE("cgroup_v2_freezer_time_frozen",  0444, proc_cgroup_frztime_show),
 #endif
 #ifdef CONFIG_PROC_CPU_RESCTRL
 	ONE("cpu_resctrl_groups", S_IRUGO, proc_resctrl_show),
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b18fb5fcb38e..871831808e22 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -837,6 +837,8 @@ void cgroup_update_frozen(struct cgroup *cgrp);
 void cgroup_freeze(struct cgroup *cgrp, bool freeze);
 void cgroup_freezer_migrate_task(struct task_struct *task, struct cgroup *src,
 				 struct cgroup *dst);
+int proc_cgroup_frztime_show(struct seq_file *m, struct pid_namespace *ns,
+			     struct pid *pid, struct task_struct *tsk);
 
 static inline bool cgroup_task_frozen(struct task_struct *task)
 {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index aa9c5be7a632..55d173fd070c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1321,6 +1321,9 @@ struct task_struct {
 	struct css_set __rcu		*cgroups;
 	/* cg_list protected by css_set_lock and tsk->alloc_lock: */
 	struct list_head		cg_list;
+	/* freezer stats protected by the css_set_lock: */
+	u64				frozen_time_total;
+	u64				frozen_time_start;
 #endif
 #ifdef CONFIG_X86_CPU_RESCTRL
 	u32				closid;
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a723b7dc6e4e..05e1d2cf3654 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6470,6 +6470,8 @@ void cgroup_fork(struct task_struct *child)
 {
 	RCU_INIT_POINTER(child->cgroups, &init_css_set);
 	INIT_LIST_HEAD(&child->cg_list);
+	child->frozen_time_total = 0;
+	child->frozen_time_start = 0;
 }
 
 /**
diff --git a/kernel/cgroup/freezer.c b/kernel/cgroup/freezer.c
index bf1690a167dd..7dd9e70a47c5 100644
--- a/kernel/cgroup/freezer.c
+++ b/kernel/cgroup/freezer.c
@@ -110,6 +110,7 @@ void cgroup_enter_frozen(void)
 
 	spin_lock_irq(&css_set_lock);
 	current->frozen = true;
+	current->frozen_time_start = ktime_get_ns();
 	cgrp = task_dfl_cgroup(current);
 	cgroup_inc_frozen_cnt(cgrp);
 	cgroup_update_frozen(cgrp);
@@ -132,10 +133,13 @@ void cgroup_leave_frozen(bool always_leave)
 	spin_lock_irq(&css_set_lock);
 	cgrp = task_dfl_cgroup(current);
 	if (always_leave || !test_bit(CGRP_FREEZE, &cgrp->flags)) {
+		u64 end_ns;
 		cgroup_dec_frozen_cnt(cgrp);
 		cgroup_update_frozen(cgrp);
 		WARN_ON_ONCE(!current->frozen);
 		current->frozen = false;
+		end_ns = ktime_get_ns();
+		current->frozen_time_total += (end_ns - current->frozen_time_start);
 	} else if (!(current->jobctl & JOBCTL_TRAP_FREEZE)) {
 		spin_lock(&current->sighand->siglock);
 		current->jobctl |= JOBCTL_TRAP_FREEZE;
@@ -254,6 +258,22 @@ void cgroup_freezer_migrate_task(struct task_struct *task,
 	cgroup_freeze_task(task, test_bit(CGRP_FREEZE, &dst->flags));
 }
 
+int proc_cgroup_frztime_show(struct seq_file *m, struct pid_namespace *ns,
+			     struct pid *pid, struct task_struct *tsk)
+{
+	u64 delta = 0;
+
+	spin_lock_irq(&css_set_lock);
+	if (tsk->frozen)
+		delta = ktime_get() - tsk->frozen_time_start;
+
+	seq_printf(m, "%llu\n",
+		   (unsigned long long)(tsk->frozen_time_total + delta));
+	spin_unlock_irq(&css_set_lock);
+
+	return 0;
+}
+
 void cgroup_freeze(struct cgroup *cgrp, bool freeze)
 {
 	struct cgroup_subsys_state *css;
-- 
2.49.0.1204.g71687c7c1d-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] cgroup: Track time in cgroup v2 freezer
  2025-06-03 22:43 [RFC PATCH] cgroup: Track time in cgroup v2 freezer Tiffany Yang
@ 2025-06-03 23:03 ` Tejun Heo
  2025-06-04 19:39   ` Tiffany Yang
  2025-06-17  9:49 ` Michal Koutný
  1 sibling, 1 reply; 11+ messages in thread
From: Tejun Heo @ 2025-06-03 23:03 UTC (permalink / raw)
  To: Tiffany Yang
  Cc: linux-kernel, cgroups, kernel-team, John Stultz, Thomas Gleixner,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker,
	Johannes Weiner, Michal Koutný, Rafael J. Wysocki,
	Pavel Machek, Roman Gushchin, Chen Ridong, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider

On Tue, Jun 03, 2025 at 10:43:05PM +0000, Tiffany Yang wrote:
> The cgroup v2 freezer controller allows user processes to be dynamically
> added to and removed from an interruptible frozen state from
> userspace. This feature is helpful for application management, as it
> allows background tasks to be frozen to prevent them from being
> scheduled or otherwise contending with foreground tasks for resources.
> Still, applications are usually unaware of their having been placed in
> the freezer cgroup, so any watchdog timers they may have set will fire
> when they exit. To address this problem, I propose tracking the per-task
> frozen time and exposing it to userland via procfs.

Just on a glance, it feels rather odd to be tracking this per task given
that the state is per cgroup. Can you account this per cgroup?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] cgroup: Track time in cgroup v2 freezer
  2025-06-03 23:03 ` Tejun Heo
@ 2025-06-04 19:39   ` Tiffany Yang
  2025-06-04 22:47     ` Tejun Heo
  0 siblings, 1 reply; 11+ messages in thread
From: Tiffany Yang @ 2025-06-04 19:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, cgroups, kernel-team, John Stultz, Thomas Gleixner,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker,
	Johannes Weiner, Michal Koutný, Rafael J. Wysocki,
	Pavel Machek, Roman Gushchin, Chen Ridong, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider

Tejun Heo <tj@kernel.org> writes:

> On Tue, Jun 03, 2025 at 10:43:05PM +0000, Tiffany Yang wrote:
>> The cgroup v2 freezer controller allows user processes to be dynamically
>> added to and removed from an interruptible frozen state from
>> userspace. This feature is helpful for application management, as it
>> allows background tasks to be frozen to prevent them from being
>> scheduled or otherwise contending with foreground tasks for resources.
>> Still, applications are usually unaware of their having been placed in
>> the freezer cgroup, so any watchdog timers they may have set will fire
>> when they exit. To address this problem, I propose tracking the per-task
>> frozen time and exposing it to userland via procfs.
>
> Just on a glance, it feels rather odd to be tracking this per task given
> that the state is per cgroup. Can you account this per cgroup?
>
> Thanks.

Hi Tejun!

Thanks for taking a look! In this case, I would argue that the value we
are accounting for (time that a task has not been able to run because it
is in the cgroup v2 frozen state) is task-specific and distinct from the
time that the cgroup it belongs to has been frozen.

A cgroup is not considered frozen until all of its members are frozen,
and if one task then leaves the frozen state, the entire cgroup is
considered no longer frozen, even if its other members stay in the
frozen state. Similarly, even if a task is migrated from one frozen
cgroup (A) to another frozen cgroup (B), the time cgroup B has been
frozen would not be representative of that task even though it is a
member.

There is also latency between when each task in a cgroup is marked as
to-be-frozen/unfrozen and when it actually enters the frozen state, so
each descendant task has a different frozen time. For watchdogs that
elapse on a per-task basis, a per-cgroup time-in-frozen value would
underreport the actual time each task spent unable to run. Tasks that
miss a deadline might incorrectly be considered misbehaving when the
time they spent suspended was not correctly accounted for.

Please let me know if that answers your question or if there's something
I'm missing. I agree that it would be cleaner/preferable to keep this
accounting under a cgroup-specific umbrella, so I hope there is some way
to get around these issues, but it doesn't look like cgroup fs has a
good way to keep task-specific stats at the moment.

-- 
Tiffany Y. Yang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] cgroup: Track time in cgroup v2 freezer
  2025-06-04 19:39   ` Tiffany Yang
@ 2025-06-04 22:47     ` Tejun Heo
  2025-06-27  2:19       ` Tiffany Yang
  0 siblings, 1 reply; 11+ messages in thread
From: Tejun Heo @ 2025-06-04 22:47 UTC (permalink / raw)
  To: Tiffany Yang
  Cc: linux-kernel, cgroups, kernel-team, John Stultz, Thomas Gleixner,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker,
	Johannes Weiner, Michal Koutný, Rafael J. Wysocki,
	Pavel Machek, Roman Gushchin, Chen Ridong, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider

Hello, Tiffany.

On Wed, Jun 04, 2025 at 07:39:29PM +0000, Tiffany Yang wrote:
...
> Thanks for taking a look! In this case, I would argue that the value we
> are accounting for (time that a task has not been able to run because it
> is in the cgroup v2 frozen state) is task-specific and distinct from the
> time that the cgroup it belongs to has been frozen.
> 
> A cgroup is not considered frozen until all of its members are frozen,
> and if one task then leaves the frozen state, the entire cgroup is
> considered no longer frozen, even if its other members stay in the
> frozen state. Similarly, even if a task is migrated from one frozen
> cgroup (A) to another frozen cgroup (B), the time cgroup B has been
> frozen would not be representative of that task even though it is a
> member.
> 
> There is also latency between when each task in a cgroup is marked as
> to-be-frozen/unfrozen and when it actually enters the frozen state, so
> each descendant task has a different frozen time. For watchdogs that
> elapse on a per-task basis, a per-cgroup time-in-frozen value would
> underreport the actual time each task spent unable to run. Tasks that
> miss a deadline might incorrectly be considered misbehaving when the
> time they spent suspended was not correctly accounted for.
> 
> Please let me know if that answers your question or if there's something
> I'm missing. I agree that it would be cleaner/preferable to keep this
> accounting under a cgroup-specific umbrella, so I hope there is some way
> to get around these issues, but it doesn't look like cgroup fs has a
> good way to keep task-specific stats at the moment.

I'm not sure freezing/frozen distinction is that meaningful. If each cgroup
tracks total durations for both states, most threads should be able to rely
on freezing duration delta, right? There shouldn't be significant time gap
between freezing starting and most threads being frozen although the cgroup
may not reach full frozen state due to e.g. NFS and what not.

As long as threads are not migrated across cgroups, it should be able to do
something like:

1. Read /proc/self/cgroup to determine the current cgroup.
2. Read and remember freezing duration $CGRP/cgroup.stat.
3. Do time taking operation.
4. Read $CGRP/cgrp.stat and calculate delta and deduct that from time taken.

Would that work?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] cgroup: Track time in cgroup v2 freezer
  2025-06-03 22:43 [RFC PATCH] cgroup: Track time in cgroup v2 freezer Tiffany Yang
  2025-06-03 23:03 ` Tejun Heo
@ 2025-06-17  9:49 ` Michal Koutný
  2025-06-27  7:47   ` Tiffany Yang
  1 sibling, 1 reply; 11+ messages in thread
From: Michal Koutný @ 2025-06-17  9:49 UTC (permalink / raw)
  To: Tiffany Yang
  Cc: linux-kernel, cgroups, kernel-team, John Stultz, Thomas Gleixner,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Tejun Heo,
	Johannes Weiner, Rafael J. Wysocki, Pavel Machek, Roman Gushchin,
	Chen Ridong, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider

[-- Attachment #1: Type: text/plain, Size: 1525 bytes --]

Hello.

On Tue, Jun 03, 2025 at 10:43:05PM +0000, Tiffany Yang <ynaffit@google.com> wrote:
> The cgroup v2 freezer controller allows user processes to be dynamically
> added to and removed from an interruptible frozen state from
> userspace.

Beware of freezing by migration vs freezing by cgroup attribute change.
The latter is primary design of cgroup v2, the former is "only" for
consistency.

> This feature is helpful for application management, as it
> allows background tasks to be frozen to prevent them from being
> scheduled or otherwise contending with foreground tasks for resources.

> Still, applications are usually unaware of their having been placed in
> the freezer cgroup, so any watchdog timers they may have set will fire
> when they exit. To address this problem, I propose tracking the per-task
> frozen time and exposing it to userland via procfs.

But the watchdog fires rightfully when the application does not run,
doesn't it?
It should be responsibility of the "freezing agent" to prepare or notify
the application about expected latencies.

> but the main focus in this initial submission is establishing the
> right UAPI for this accounting information.

/proc/<pid>/cgroup_v2_freezer_time_frozen looks quite extraordinary with
other similar metrics, my first thought would be a field in
/proc/<pid>/stat (or track it per cgroup as Tejun suggests).

Could you please primarily explain why the application itself should
care about the frozen time (and not other causes of delay)?

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] cgroup: Track time in cgroup v2 freezer
  2025-06-04 22:47     ` Tejun Heo
@ 2025-06-27  2:19       ` Tiffany Yang
  2025-06-27 19:01         ` Tejun Heo
  0 siblings, 1 reply; 11+ messages in thread
From: Tiffany Yang @ 2025-06-27  2:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, cgroups, kernel-team, John Stultz, Thomas Gleixner,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker,
	Johannes Weiner, Michal Koutný, Rafael J. Wysocki,
	Pavel Machek, Roman Gushchin, Chen Ridong, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider

Tejun Heo <tj@kernel.org> writes:

> Hello, Tiffany.
>
> On Wed, Jun 04, 2025 at 07:39:29PM +0000, Tiffany Yang wrote:
> ...
>> Thanks for taking a look! In this case, I would argue that the value we
>> are accounting for (time that a task has not been able to run because it
>> is in the cgroup v2 frozen state) is task-specific and distinct from the
>> time that the cgroup it belongs to has been frozen.
>> 
>> A cgroup is not considered frozen until all of its members are frozen,
>> and if one task then leaves the frozen state, the entire cgroup is
>> considered no longer frozen, even if its other members stay in the
>> frozen state. Similarly, even if a task is migrated from one frozen
>> cgroup (A) to another frozen cgroup (B), the time cgroup B has been
>> frozen would not be representative of that task even though it is a
>> member.
>> 
>> There is also latency between when each task in a cgroup is marked as
>> to-be-frozen/unfrozen and when it actually enters the frozen state, so
>> each descendant task has a different frozen time. For watchdogs that
>> elapse on a per-task basis, a per-cgroup time-in-frozen value would
>> underreport the actual time each task spent unable to run. Tasks that
>> miss a deadline might incorrectly be considered misbehaving when the
>> time they spent suspended was not correctly accounted for.
>> 
>> Please let me know if that answers your question or if there's something
>> I'm missing. I agree that it would be cleaner/preferable to keep this
>> accounting under a cgroup-specific umbrella, so I hope there is some way
>> to get around these issues, but it doesn't look like cgroup fs has a
>> good way to keep task-specific stats at the moment.
>
> I'm not sure freezing/frozen distinction is that meaningful. If each cgroup
> tracks total durations for both states, most threads should be able to rely
> on freezing duration delta, right? There shouldn't be significant time gap
> between freezing starting and most threads being frozen although the cgroup
> may not reach full frozen state due to e.g. NFS and what not.
>
> As long as threads are not migrated across cgroups, it should be able to do
> something like:
>
> 1. Read /proc/self/cgroup to determine the current cgroup.
> 2. Read and remember freezing duration $CGRP/cgroup.stat.
> 3. Do time taking operation.
> 4. Read $CGRP/cgrp.stat and calculate delta and deduct that from time taken.
>
> Would that work?
>
> Thanks.

Hi Tejun,

Thank you for your feedback! You made a good observation that it's
really the duration delta that matters here. I looked at tracking the
time from when we set/clear a cgroup's CGRP_FREEZE flag and compared
that to the per-task measurements of its members. For large (1000+
thread) cgroups, the latency between when a cgroup starts freezing and
when a task near the tail end of its cset->tasks actually enters the
handler is fairly significant. On an x86 VM, I saw a difference of about
1 tick per hundred tasks (i.e., the 6000th task would have been frozen
for 60 ticks less than the duration reported by its cgroup). We'd expect
this latency to accumulate more slowly on bare metal, but it would still
grow linearly.

Fortunately, since this same latency is present when we
unfreeze a cgroup and each member task, it's effectively canceled out
when we look at the freezing duration for tasks in cgroups that are not
currently frozen. For a running task, the measurement of how long it had
spent frozen in the past was within 1-2 ticks of its cgroup's. Our use
case does not look at this accounting until after a task has become
unfrozen, so the per-cgroup values seem like a reasonable substitution
for our purposes!

That being said, I realized from Michal's reply that the tracked value
doesn't have to be as narrow as the cgroup v2 freezing time. Basically,
we just want to give userspace some measure of time that a task cannot
run when it expects to be running. It doesn't seem practical to give an
exact accounting, but maybe tracking the time that each task spends in
some combination of stopped or frozen would provide a useful estimate.

What do you think?

-- 
Tiffany Y. Yang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] cgroup: Track time in cgroup v2 freezer
  2025-06-17  9:49 ` Michal Koutný
@ 2025-06-27  7:47   ` Tiffany Yang
  2025-06-30 17:40     ` Michal Koutný
  0 siblings, 1 reply; 11+ messages in thread
From: Tiffany Yang @ 2025-06-27  7:47 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-kernel, cgroups, kernel-team, John Stultz, Thomas Gleixner,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Tejun Heo,
	Johannes Weiner, Rafael J. Wysocki, Pavel Machek, Roman Gushchin,
	Chen Ridong, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider

Michal Koutný <mkoutny@suse.com> writes:

Hello! Thanks for taking the time to respond!

> Hello.
>
> On Tue, Jun 03, 2025 at 10:43:05PM +0000, Tiffany Yang <ynaffit@google.com> wrote:
>> The cgroup v2 freezer controller allows user processes to be dynamically
>> added to and removed from an interruptible frozen state from
>> userspace.
>
> Beware of freezing by migration vs freezing by cgroup attribute change.
> The latter is primary design of cgroup v2, the former is "only" for
> consistency.
>
>> This feature is helpful for application management, as it
>> allows background tasks to be frozen to prevent them from being
>> scheduled or otherwise contending with foreground tasks for resources.
>
>> Still, applications are usually unaware of their having been placed in
>> the freezer cgroup, so any watchdog timers they may have set will fire
>> when they exit. To address this problem, I propose tracking the per-task
>> frozen time and exposing it to userland via procfs.
>
> But the watchdog fires rightfully when the application does not run,
> doesn't it?

Good question. I should've been clearer about our use case. In both
cases, the watchdog is being used to ensure that a job is completed
before some deadline. When the deadline is relative to the system time,
then yes, it would be firing correctly. In our case, the deadline is
meant to be relative to the time our task spends running; since we don't
have a clock for that, we set our timer against the system time
(CLOCK_MONOTONIC, in this case) as an approximation.

This timer may fire (correctly) while our application is still frozen,
but our watchdog task won't run until it's unfrozen. At that point, it
can check how much time has been spent in the cgroup v2 freezer and
decide whether to rearm the timer or to initiate a corrective action.

> It should be responsibility of the "freezing agent" to prepare or notify
> the application about expected latencies.
>

Fair point! The freezing agent could roughly track freeze-entrance and
freeze-exit times, but how it would communicate those values to every
application being frozen along with who would be responsible for
keeping track of per-thread accumulated frozen times make this a little
messy. The accuracy of those user timestamps compared to ones taken in
the kernel may be further degraded by possible preemptions, etc.

>> but the main focus in this initial submission is establishing the
>> right UAPI for this accounting information.
>
> /proc/<pid>/cgroup_v2_freezer_time_frozen looks quite extraordinary with

Agreed.

> other similar metrics, my first thought would be a field in
> /proc/<pid>/stat (or track it per cgroup as Tejun suggests).
>

Adding it to /proc/<pid>/stat is an option, but because this metric
isn't very widely used and exactly what it measures is pretty particular
("freezer time, but no, cgroup freezer time, but v2 and not v1"), we
were hesitant to add it there and make this interface even more
difficult for folks to parse.

> Could you please primarily explain why the application itself should
> care about the frozen time (and not other causes of delay)?
>

Thank you for asking this! This is a very helpful question. My answer is
that other causes of delay may be equally important, but this is another
place where things get messy because of the spectrum of types of
"delay". If we break delays into 2 categories, delays that were
requested (sleep) and delays that were not (SIGSTOP), I can say that we
are primarily interested in delays that were not requested. However,
there are many cases that fall somewhere in between, like the wakeup
latency after a sleep, or that are difficult to account for, like
blocking on a futex (requested), where the owner might be preempted (not
requested).

Which is all to say that this is a hard thing to really pin down
generalized semantics for.

We can usually ignore the smaller sources of delay on a time-shared
system, but larger causes of delay (e.g., cgroup v2 freezer, SIGSTOP,
or really bad cases of scheduler starvation) can cause problems.

In this case, we've focused on a narrowish solution to just the cgroup
v2 freezer delays because it's fairly tractable. Ideally, we could
abstract this out in a more general way to other delays (like SIGSTOP),
but the challenge here is that there isn't a clear line that separates a
problematic delay from an acceptable delay. Suggestions for a framework
to approach this more generally are very welcome.

In the meantime, focusing on task frozen/stopped time seems like the
most reasonable approach. Maybe that would be clear enough to make it
palatable for proc/<pid>/stat ?

-- 
Tiffany Y. Yang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] cgroup: Track time in cgroup v2 freezer
  2025-06-27  2:19       ` Tiffany Yang
@ 2025-06-27 19:01         ` Tejun Heo
  2025-07-14  4:44           ` Tiffany Yang
  0 siblings, 1 reply; 11+ messages in thread
From: Tejun Heo @ 2025-06-27 19:01 UTC (permalink / raw)
  To: Tiffany Yang
  Cc: linux-kernel, cgroups, kernel-team, John Stultz, Thomas Gleixner,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker,
	Johannes Weiner, Michal Koutný, Rafael J. Wysocki,
	Pavel Machek, Roman Gushchin, Chen Ridong, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider

Hello,

On Thu, Jun 26, 2025 at 07:19:18PM -0700, Tiffany Yang wrote:
...
> Fortunately, since this same latency is present when we
> unfreeze a cgroup and each member task, it's effectively canceled out
> when we look at the freezing duration for tasks in cgroups that are not
> currently frozen. For a running task, the measurement of how long it had
> spent frozen in the past was within 1-2 ticks of its cgroup's. Our use
> case does not look at this accounting until after a task has become
> unfrozen, so the per-cgroup values seem like a reasonable substitution
> for our purposes!

Glad that worked out, but I'm curious what are the involved time scales.
Let's say you get things off by some tens of msecs, or maybe even hundreds,
does that matter for your purpose?

> That being said, I realized from Michal's reply that the tracked value
> doesn't have to be as narrow as the cgroup v2 freezing time. Basically,
> we just want to give userspace some measure of time that a task cannot
> run when it expects to be running. It doesn't seem practical to give an
> exact accounting, but maybe tracking the time that each task spends in
> some combination of stopped or frozen would provide a useful estimate.

While it's not my call, I'm not necessarily against. However, as you noted
in another reply, the challenge is that there are multiple states and it's
not clear what combinations would be useful for whom. When/if we encounter
more real world use cases tha twould require these numbers, they may shed
light on what the right combination / interface is. IOW, I'm not sure this
is a case where adding something preemptively is a good idea.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] cgroup: Track time in cgroup v2 freezer
  2025-06-27  7:47   ` Tiffany Yang
@ 2025-06-30 17:40     ` Michal Koutný
  2025-07-14  4:53       ` Tiffany Yang
  0 siblings, 1 reply; 11+ messages in thread
From: Michal Koutný @ 2025-06-30 17:40 UTC (permalink / raw)
  To: Tiffany Yang
  Cc: linux-kernel, cgroups, kernel-team, John Stultz, Thomas Gleixner,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Tejun Heo,
	Johannes Weiner, Rafael J. Wysocki, Pavel Machek, Roman Gushchin,
	Chen Ridong, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider

[-- Attachment #1: Type: text/plain, Size: 2726 bytes --]

On Fri, Jun 27, 2025 at 12:47:23AM -0700, Tiffany Yang <ynaffit@google.com> wrote:
> In our case, the deadline is meant to be relative to the time our task
> spends running; since we don't have a clock for that, we set our timer
> against the system time (CLOCK_MONOTONIC, in this case) as an
> approximation.

Would it be sufficient to measure that deadline against
cpu.stat:usage_usec (CPU time consumed by the cgroup)? Or do I
misunderstand your latter deadline metric?

> Adding it to /proc/<pid>/stat is an option, but because this metric
> isn't very widely used and exactly what it measures is pretty particular
> ("freezer time, but no, cgroup freezer time, but v2 and not v1"), we
> were hesitant to add it there and make this interface even more
> difficult for folks to parse.

Yeah, it'd need strong use case to add it there.

> Thank you for asking this! This is a very helpful question. My answer is
> that other causes of delay may be equally important, but this is another
> place where things get messy because of the spectrum of types of
> "delay". If we break delays into 2 categories, delays that were
> requested (sleep) and delays that were not (SIGSTOP), I can say that we
> are primarily interested in delays that were not requested.

(Note that SIGSTOP may be sent to self or within the group but) mind
that even the category "not requested" is split into two other: resource
contention and freezing management. And the latter should be under
control of the agent that sets the deadlines.

> However, there are many cases that fall somewhere in between, like the
> wakeup latency after a sleep, or that are difficult to account for,
> like blocking on a futex (requested), where the owner might be
> preempted (not requested).

Those are order(s) of magnitude different. I can't imagine that using
freezer for jobs where also wakeup latency matters.


> Ideally, we could abstract this out in a more general way to other
> delays (like SIGSTOP), but the challenge here is that there isn't a
> clear line that separates a problematic delay from an acceptable
> delay. Suggestions for a framework to approach this more generally are
> very welcome.

Well, there are multiple similar metrics: various (cgroup) PSI, (global)
steal time, cpu.stat:throttled_usage and perhaps some more.

> In the meantime, focusing on task frozen/stopped time seems like the
> most reasonable approach. Maybe that would be clear enough to make it
> palatable for proc/<pid>/stat ?

Tejun's suggestion with tracking cgroup's frozen time of whole cgroup
could complement other "debugging" stats provided by cgroups by I tend
to think that it's not good (and certainly not complete) solution to
your problem.

Regards,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] cgroup: Track time in cgroup v2 freezer
  2025-06-27 19:01         ` Tejun Heo
@ 2025-07-14  4:44           ` Tiffany Yang
  0 siblings, 0 replies; 11+ messages in thread
From: Tiffany Yang @ 2025-07-14  4:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, cgroups, kernel-team, John Stultz, Thomas Gleixner,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker,
	Johannes Weiner, Michal Koutný, Rafael J. Wysocki,
	Pavel Machek, Roman Gushchin, Chen Ridong, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider

Hello!

> ...
> Glad that worked out, but I'm curious what are the involved time scales.
> Let's say you get things off by some tens of msecs, or maybe even  
> hundreds,
> does that matter for your purpose?

Our wotchdog timers are on the order of a few seconds. Tens of msecs
would be acceptable, but I think variances of hundreds of msecs would be
large enough to cause issues. I mainly wanted to include the rough
numbers in case anybody was curious about the actual magnitude of the
offsets (if they wanted this accounting for a different use case, for
example).

> While it's not my call, I'm not necessarily against. However, as you noted
> in another reply, the challenge is that there are multiple states and it's
> not clear what combinations would be useful for whom. When/if we encounter
> more real world use cases tha twould require these numbers, they may shed
> light on what the right combination / interface is. IOW, I'm not sure this
> is a case where adding something preemptively is a good idea.

Ack. The per-cgroup accounting gets us most of the way there, so I'll be
sending out that updated version shortly! I agree that another relevant
use case would be immensely helpful for figuring out a way to describe
the broader problem we are trying to solve and what shape a solution
should take.

Thank you for your feedback and for helping me to arrive at (what I hope
is) a reasonable approach.

-- 
Tiffany Y. Yang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH] cgroup: Track time in cgroup v2 freezer
  2025-06-30 17:40     ` Michal Koutný
@ 2025-07-14  4:53       ` Tiffany Yang
  0 siblings, 0 replies; 11+ messages in thread
From: Tiffany Yang @ 2025-07-14  4:53 UTC (permalink / raw)
  To: Michal Koutný
  Cc: linux-kernel, cgroups, kernel-team, John Stultz, Thomas Gleixner,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker, Tejun Heo,
	Johannes Weiner, Rafael J. Wysocki, Pavel Machek, Roman Gushchin,
	Chen Ridong, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider

Michal Koutný <mkoutny@suse.com> writes:

> Would it be sufficient to measure that deadline against
> cpu.stat:usage_usec (CPU time consumed by the cgroup)? Or do I
> misunderstand your latter deadline metric?

CPU time is a good way to think about the quantity we are trying to
measure against, but it does not account for sleep time (either
voluntarily or waiting on a futex, etc.). Unlike freeze time, we would
want sleep time to count against our deadline because a timeout would
likely indicate a problem in the application's logic.

> (Note that SIGSTOP may be sent to self or within the group but) mind
> that even the category "not requested" is split into two other: resource
> contention and freezing management. And the latter should be under
> control of the agent that sets the deadlines.

This would be ideal, but in our case, the agent that sets/enforces the
deadlines is a task in the same application. It has no control over
freezing events and (currently) no way to know when one has
occurred. Consequently, even if the freezing manager were to send the
relevant information to our agent, none of those messages could be
processed until the application was unfrozen.

The result would be competing directly against the task under deadline
(to handle communication as it came in) or delaying corrective action
decisions (to wait until the deadline to deal with any messages). If the
application were frozen multiple times during the timer interval, that
cost would be incurred each time. As an alternative, the watchdog could
request this information from the freezing manager upon timer elapse,
but that would also introduce significant latency to deadline
enforcement.

> Those are order(s) of magnitude different. I can't imagine that using
> freezer for jobs where also wakeup latency matters.

This is true! These examples were mainly to illustrate the breadth of
the problem space/how slippery it can be to generalize.

> Well, there are multiple similar metrics: various (cgroup) PSI, (global)
> steal time, cpu.stat:throttled_usage and perhaps some more.

Ah! Thanks for noting these. It's helpful to have these concrete
examples to find ways to think about this problem.

Philosophically, I think the time we're trying to account for is most
similar to steal time because it allows a VM to correct the internal
accounting it uses to enforce policy. After considering how the delay
we're trying to track fits among these, I think one quality that makes
it somewhat difficult to formalize is that we are trying to account for
multiple external sources of delay, but we also want to exclude
"internal" delay (contention, voluntary sleep). The specificity of this
is making an iterative approach seem more appealing...

> Tejun's suggestion with tracking cgroup's frozen time of whole cgroup
> could complement other "debugging" stats provided by cgroups by I tend
> to think that it's not good (and certainly not complete) solution to
> your problem.

I agree that it doesn't necessarily feel complete, but after spending
this time mulling over the problem, I think it still feels too narrow to
know what a more general solution should look like.

Since there isn't yet a clear way to identify a set of "lost" time that
everyone (or at least a wider group of users) cares about, it seems like
iterating over components of interest is the best way to make progress
for now. That way, at least folks can track some combination of the
values that matter to them. (One aspect of this I find interesting is
time that is accounted for in multiple metrics. Maybe a better way to
think about this problem can be found in some relation between these
overlaps.)

I really appreciate the effort that you've put into trying to understand
the larger problem and the questions you've asked to help me think about
it. Thank you very much for your time!

-- 
Tiffany Y. Yang

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-07-14  4:53 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-03 22:43 [RFC PATCH] cgroup: Track time in cgroup v2 freezer Tiffany Yang
2025-06-03 23:03 ` Tejun Heo
2025-06-04 19:39   ` Tiffany Yang
2025-06-04 22:47     ` Tejun Heo
2025-06-27  2:19       ` Tiffany Yang
2025-06-27 19:01         ` Tejun Heo
2025-07-14  4:44           ` Tiffany Yang
2025-06-17  9:49 ` Michal Koutný
2025-06-27  7:47   ` Tiffany Yang
2025-06-30 17:40     ` Michal Koutný
2025-07-14  4:53       ` Tiffany Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).