[RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
@ 2025-08-07 12:14 Zihuan Zhang
  2025-08-07 12:14 ` [RFC PATCH v1 1/9] freezer: Introduce freeze_priority field in task_struct Zihuan Zhang
                   ` (10 more replies)
  0 siblings, 11 replies; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-07 12:14 UTC (permalink / raw)
  To: Rafael J . Wysocki, Peter Zijlstra, Oleg Nesterov,
	David Hildenbrand, Michal Hocko, Jonathan Corbet
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	len brown, pavel machek, Kees Cook, Andrew Morton,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Catalin Marinas, Nico Pache, xu xin,
	wangfushuai, Andrii Nakryiko, Christian Brauner, Thomas Gleixner,
	Jeff Layton, Al Viro, Adrian Ratiu, linux-pm, linux-mm,
	linux-fsdevel, linux-doc, linux-kernel, Zihuan Zhang

The Linux task freezer was designed in a much earlier era, when userspace was relatively simple and flat.
Over the years, as modern desktop and mobile systems have become increasingly complex—with intricate IPC,
asynchronous I/O, and deep event loops—the original freezer model has shown its age.

## Background

Currently, the freezer traverses the task list linearly and attempts to freeze all tasks equally.
It sends a signal and waits for `freezing()` to become true. While this model works well in many cases, it has several inherent limitations:

- Signal-based logic cannot freeze uninterruptible (D-state) tasks
- Dependencies between processes can cause freeze retries 
- Retry-based recovery introduces unpredictable suspend latency

## Real-world problem illustration

Consider the following scenario during suspend:

Freeze Window Begins

    [process A] - epoll_wait()
        │
        ▼
    [process B] - event source (already frozen)

→ A enters D-state because of waiting for B
→ Cannot respond to freezing signal
→ Freezer retries in a loop
→ Suspend latency spikes

In such cases, we observed that a normal 1–2ms freezer cycle could balloon to **tens of milliseconds**. 
Worse, the kernel has no insight into the root cause and simply retries blindly.

## Proposed solution: Freeze priority model

To address this, we propose a **layered freeze model** based on per-task freeze priorities.

### Design

We introduce 4 levels of freeze priority:

| Priority | Level             | Description                       |
|----------|-------------------|-----------------------------------|
| 0        | HIGH              | D-state TASKs                     |
| 1        | NORMAL            | regular  use space TASKS          |
| 2        | LOW               | not yet used                      |
| 4        | NEVER_FREEZE      | zombie TASKs , PF_SUSPNED_TASK    |

The kernel will freeze processes **in priority order**, ensuring that higher-priority tasks are frozen first.
This avoids dependency inversion scenarios and provides a deterministic path forward for tricky cases.
By freezing control or event-source threads first, we prevent dependent tasks from entering D-state prematurely — effectively avoiding dependency inversion.

Although introducing more fine-grained freeze_priority levels improves extensibility and allows better modeling of task dependencies, 
it may also introduce additional overhead during task traversal, potentially affecting freezer performance.

In our test environment, increasing the maximum freeze retries to 16 only added ~4ms of overhead to the total suspend latency,
suggesting the added robustness comes at a relatively low cost. However, for latency-critical systems, this trade-off should be carefully evaluated.

## Benefits

- Solves D-state process freeze stalls caused by premature freezing of dependencies
- Enables more robust and reliable suspend/resume on complex userspace systems
- Introduces extensibility: tasks can be categorized by role, urgency, or dependency
- Reduces race conditions by introducing deterministic freezing order

## Previous Discussion
Link: https://lore.kernel.org/all/20250606062502.19607-1-zhangzihuan@kylinos.cn/
Link: https://lore.kernel.org/all/1ca889fd-6ead-4d4f-a3c7-361ea05bb659@kylinos.cn/

## Future directions

This framework opens up several promising areas for further development:

1. Adaptive behavior based on runtime statistics or retry feedback
The freezer adapts dynamically during suspend/hibernate based on the number of retries and which tasks failed to freeze. 
Tasks that failed in previous rounds will be assigned a higher freeze priority, improving convergence speed and reducing unnecessary retries.

2. cgroup-aware hierarchical freezing for containerized systems
The design supports cgroup-aware task traversal and freezing. 
This ensures compatibility with containerized environments, allowing for better control and visibility when freezing processes in different cgroups.

3. Unified freezing of userspace processes and kernel threads
Based on extensive testing, we found that freezing userspace tasks and kernel threads together works reliably in practice. 
Separating them does not resolve dependency issues between user and kernel context. Moreover, most kernel threads are marked as non-freezable,
so including them in the same freeze pass does not impact correctness and simplifies the logic.

Although the current implementation is relatively simple, it already helps alleviate some suspend failures caused by tasks stuck in D state.
In our testing, we observed that certain D-state tasks are triggered by filesystem sync operations during the freezing phase.
At this stage, we don't yet have a comprehensive solution for that class of problems.
This patchset represents a testable version of our design. We plan to further investigate and address such filesystem-related D-state issues in future revisions.

Patch summary:
 - Patch 1-3: Core infrastructure: field, API, layered freeze logic
 - Patch 4-7: Default priorities and dynamic adjustments
 - Patch 8: Statistics: freeze pass retry count
 - Patch 9: Procfs interface for userspace access

Zihuan Zhang (9):
  freezer: Introduce freeze_priority field in task_struct
  freezer: Introduce API to set per-task freeze priority
  freezer: Add per-priority layered freeze logic
  freezer: Set default freeze priority for userspace tasks
  freezer: set default freeze priority for PF_SUSPEND_TASK processes
  freezer: Set default freeze priority for zombie tasks
  freezer: raise freeze priority of tasks failed to freeze last time
  freezer: Add retry count statistics for freeze pass iterations
  proc: Add /proc/<pid>/freeze_priority interface

 Documentation/filesystems/proc.rst | 14 ++++++-
 fs/proc/base.c                     | 64 ++++++++++++++++++++++++++++++
 include/linux/freezer.h            | 20 ++++++++++
 include/linux/sched.h              |  3 ++
 kernel/fork.c                      |  1 +
 kernel/power/process.c             | 23 ++++++++++-
 kernel/sched/core.c                |  2 +
 7 files changed, 124 insertions(+), 3 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH v1 1/9] freezer: Introduce freeze_priority field in task_struct
  2025-08-07 12:14 [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues Zihuan Zhang
@ 2025-08-07 12:14 ` Zihuan Zhang
  2025-08-07 12:14 ` [RFC PATCH v1 2/9] freezer: Introduce API to set per-task freeze priority Zihuan Zhang
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-07 12:14 UTC (permalink / raw)
  To: Rafael J . Wysocki, Peter Zijlstra, Oleg Nesterov,
	David Hildenbrand, Michal Hocko, Jonathan Corbet
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	len brown, pavel machek, Kees Cook, Andrew Morton,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Catalin Marinas, Nico Pache, xu xin,
	wangfushuai, Andrii Nakryiko, Christian Brauner, Thomas Gleixner,
	Jeff Layton, Al Viro, Adrian Ratiu, linux-pm, linux-mm,
	linux-fsdevel, linux-doc, linux-kernel, Zihuan Zhang

To improve the flexibility and correctness of the freezer subsystem,
we introduce a new field `freeze_priority` in `task_struct`.

This field will allow us to assign different freezing priorities to
tasks, enabling prioritized traversal in future changes. This is
particularly useful when dealing with complex inter-process dependencies
in modern userspace workloads (e.g., service managers, IPC daemons).

Although this patch does not change behavior yet, it provides the
necessary infrastructure for upcoming logic to address issues like
dependency stalls and D-state hangs. It also helps avoid potential
race conditions by paving the way for deterministic freezing order.

Signed-off-by: Zihuan Zhang <zhangzihuan@kylinos.cn>
---
 include/linux/freezer.h | 7 +++++++
 include/linux/sched.h   | 3 +++
 2 files changed, 10 insertions(+)

diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index b303472255be..6314f8b68035 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -16,6 +16,13 @@ DECLARE_STATIC_KEY_FALSE(freezer_active);
 extern bool pm_freezing;		/* PM freezing in effect */
 extern bool pm_nosig_freezing;		/* PM nosig freezing in effect */
 
+enum freeze_priority {
+	FREEZE_PRIORITY_HIGH		= 0,
+	FREEZE_PRIORITY_NORMAL		= 1,
+	FREEZE_PRIORITY_LOW		= 2,
+	FREEZE_PRIORITY_NEVER		= 4
+};
+
 /*
  * Timeout for stopping processes
  */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2b272382673d..7915e6214e50 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -910,6 +910,9 @@ struct task_struct {
 	unsigned int			btrace_seq;
 #endif
 
+#ifdef CONFIG_FREEZER
+	unsigned int			freeze_priority;
+#endif
 	unsigned int			policy;
 	unsigned long			max_allowed_capacity;
 	int				nr_cpus_allowed;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v1 2/9] freezer: Introduce API to set per-task freeze priority
  2025-08-07 12:14 [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues Zihuan Zhang
  2025-08-07 12:14 ` [RFC PATCH v1 1/9] freezer: Introduce freeze_priority field in task_struct Zihuan Zhang
@ 2025-08-07 12:14 ` Zihuan Zhang
  2025-08-07 12:14 ` [RFC PATCH v1 3/9] freezer: Add per-priority layered freeze logic Zihuan Zhang
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-07 12:14 UTC (permalink / raw)
  To: Rafael J . Wysocki, Peter Zijlstra, Oleg Nesterov,
	David Hildenbrand, Michal Hocko, Jonathan Corbet
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	len brown, pavel machek, Kees Cook, Andrew Morton,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Catalin Marinas, Nico Pache, xu xin,
	wangfushuai, Andrii Nakryiko, Christian Brauner, Thomas Gleixner,
	Jeff Layton, Al Viro, Adrian Ratiu, linux-pm, linux-mm,
	linux-fsdevel, linux-doc, linux-kernel, Zihuan Zhang

This patch introduces the basic API for setting freeze priority on a per-task basis.
The actual usage and policy for assigning specific priorities will be added in
subsequent patches.

This change lays the groundwork for implementing a more flexible and
dependency-aware freezer model.

Signed-off-by: Zihuan Zhang <zhangzihuan@kylinos.cn>
---
 include/linux/freezer.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 6314f8b68035..f231a60c3120 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -46,6 +46,18 @@ static inline bool freezing(struct task_struct *p)
 	return false;
 }
 
+static inline bool freeze_set_default_priority(struct task_struct *p, unsigned int prio)
+{
+	if ((p->flags & PF_KTHREAD) || prio > FREEZE_PRIORITY_NEVER)
+		return false;
+
+	p->freeze_priority = prio;
+
+	pr_debug("set default freeze priority for comm:%s pid:%d prio:%d\n",
+		 p->comm, p->pid, p->freeze_priority);
+	return true;
+}
+
 /* Takes and releases task alloc lock using task_lock() */
 extern void __thaw_task(struct task_struct *t);
 
@@ -80,6 +92,7 @@ static inline bool cgroup_freezing(struct task_struct *task)
 #else /* !CONFIG_FREEZER */
 static inline bool frozen(struct task_struct *p) { return false; }
 static inline bool freezing(struct task_struct *p) { return false; }
+static inline bool freeze_set_default_priority(struct task_struct *p, unsigned int prio) {}
 static inline void __thaw_task(struct task_struct *t) {}
 
 static inline bool __refrigerator(bool check_kthr_stop) { return false; }
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v1 3/9] freezer: Add per-priority layered freeze logic
  2025-08-07 12:14 [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues Zihuan Zhang
  2025-08-07 12:14 ` [RFC PATCH v1 1/9] freezer: Introduce freeze_priority field in task_struct Zihuan Zhang
  2025-08-07 12:14 ` [RFC PATCH v1 2/9] freezer: Introduce API to set per-task freeze priority Zihuan Zhang
@ 2025-08-07 12:14 ` Zihuan Zhang
  2025-08-07 12:14 ` [RFC PATCH v1 4/9] freezer: Set default freeze priority for userspace tasks Zihuan Zhang
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-07 12:14 UTC (permalink / raw)
  To: Rafael J . Wysocki, Peter Zijlstra, Oleg Nesterov,
	David Hildenbrand, Michal Hocko, Jonathan Corbet
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	len brown, pavel machek, Kees Cook, Andrew Morton,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Catalin Marinas, Nico Pache, xu xin,
	wangfushuai, Andrii Nakryiko, Christian Brauner, Thomas Gleixner,
	Jeff Layton, Al Viro, Adrian Ratiu, linux-pm, linux-mm,
	linux-fsdevel, linux-doc, linux-kernel, Zihuan Zhang

The current freezer traverses all user tasks in a single pass, without
distinguishing between tasks that are easier or harder to freeze. This
uniform treatment may cause suboptimal behavior when certain newly created
tasks, service daemons, or system threads block the progress of freeze due
to dependency ordering issues.

This patch introduces a simple multi-pass traversal model in
try_to_freeze_tasks(), where user tasks are grouped and frozen by their
freeze_priority in descending order. Tasks marked with higher priority
are attempted earlier, which can help break dependency cycles earlier
and reduce retry iterations.

Specifically:
 - A new loop iterates over priority levels.
 - In each round, only tasks with freeze_priority < current priority are visited.
 - The behavior applies only to user task freezing (when user_only == true).

This approach preserves compatibility with the current logic, while enabling
fine-grained control via future enhancements (e.g., dynamic priority tuning).

Signed-off-by: Zihuan Zhang <zhangzihuan@kylinos.cn>
---
 kernel/power/process.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/power/process.c b/kernel/power/process.c
index dc0dfc349f22..06eafdb32abb 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -32,10 +32,12 @@ static int try_to_freeze_tasks(bool user_only)
 	struct task_struct *g, *p;
 	unsigned long end_time;
 	unsigned int todo;
+	unsigned int round = 0;
 	bool wq_busy = false;
 	ktime_t start, end, elapsed;
 	unsigned int elapsed_msecs;
 	bool wakeup = false;
+	bool has_freezable_task;
 	int sleep_usecs = USEC_PER_MSEC;
 
 	pr_info("Freezing %s\n", what);
@@ -47,13 +49,18 @@ static int try_to_freeze_tasks(bool user_only)
 	if (!user_only)
 		freeze_workqueues_begin();
 
-	while (true) {
+	while (round < FREEZE_PRIORITY_NEVER) {
 		todo = 0;
+		has_freezable_task = false;
 		read_lock(&tasklist_lock);
 		for_each_process_thread(g, p) {
+			if (user_only && !(p->flags & PF_KTHREAD) && round < p->freeze_priority)
+				continue;
+
 			if (p == current || !freeze_task(p))
 				continue;
 
+			has_freezable_task = true;
 			todo++;
 		}
 		read_unlock(&tasklist_lock);
@@ -63,6 +70,12 @@ static int try_to_freeze_tasks(bool user_only)
 			todo += wq_busy;
 		}
 
+		round++;
+
+		/* sleep only if need to freeze tasks */
+		if (user_only && !has_freezable_task)
+			continue;
+
 		if (!todo || time_after(jiffies, end_time))
 			break;
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v1 4/9] freezer: Set default freeze priority for userspace tasks
  2025-08-07 12:14 [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues Zihuan Zhang
                   ` (2 preceding siblings ...)
  2025-08-07 12:14 ` [RFC PATCH v1 3/9] freezer: Add per-priority layered freeze logic Zihuan Zhang
@ 2025-08-07 12:14 ` Zihuan Zhang
  2025-08-07 12:14 ` [RFC PATCH v1 5/9] freezer: set default freeze priority for PF_SUSPEND_TASK processes Zihuan Zhang
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-07 12:14 UTC (permalink / raw)
  To: Rafael J . Wysocki, Peter Zijlstra, Oleg Nesterov,
	David Hildenbrand, Michal Hocko, Jonathan Corbet
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	len brown, pavel machek, Kees Cook, Andrew Morton,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Catalin Marinas, Nico Pache, xu xin,
	wangfushuai, Andrii Nakryiko, Christian Brauner, Thomas Gleixner,
	Jeff Layton, Al Viro, Adrian Ratiu, linux-pm, linux-mm,
	linux-fsdevel, linux-doc, linux-kernel, Zihuan Zhang

The freezer framework now supports per-task freeze priorities. To
ensure consistent behavior, this patch assigns a default freeze
priority (FREEZE_PRIORITY_NORMAL) to all newly created userspace tasks.

This helps maintain deterministic freezing order and prepares the
ground for future enhancements based on priority-aware freezing logic.

Kernel threads are not affected by this change, since they are excluded.

Signed-off-by: Zihuan Zhang <zhangzihuan@kylinos.cn>
---
 kernel/fork.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index 9ce93fd20f82..04af5390af25 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2422,6 +2422,7 @@ __latent_entropy struct task_struct *copy_process(
 
 	copy_oom_score_adj(clone_flags, p);
 
+	freeze_set_default_priority(p, FREEZE_PRIORITY_NORMAL);
 	return p;
 
 bad_fork_core_free:
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v1 5/9] freezer: set default freeze priority for PF_SUSPEND_TASK processes
  2025-08-07 12:14 [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues Zihuan Zhang
                   ` (3 preceding siblings ...)
  2025-08-07 12:14 ` [RFC PATCH v1 4/9] freezer: Set default freeze priority for userspace tasks Zihuan Zhang
@ 2025-08-07 12:14 ` Zihuan Zhang
  2025-08-08 14:39   ` Oleg Nesterov
  2025-08-07 12:14 ` [RFC PATCH v1 6/9] freezer: Set default freeze priority for zombie tasks Zihuan Zhang
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-07 12:14 UTC (permalink / raw)
  To: Rafael J . Wysocki, Peter Zijlstra, Oleg Nesterov,
	David Hildenbrand, Michal Hocko, Jonathan Corbet
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	len brown, pavel machek, Kees Cook, Andrew Morton,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Catalin Marinas, Nico Pache, xu xin,
	wangfushuai, Andrii Nakryiko, Christian Brauner, Thomas Gleixner,
	Jeff Layton, Al Viro, Adrian Ratiu, linux-pm, linux-mm,
	linux-fsdevel, linux-doc, linux-kernel, Zihuan Zhang

Tasks marked with PF_SUSPEND_TASK are involved in system suspend or
hibernate operations. These tasks must not be frozen, as they are
responsible for coordinating or executing parts of the suspend/resume
sequence.

This patch explicitly sets their freeze_priority to FREEZE_PRIORITY_NEVER
during initialization. This makes their exemption from the freezer logic
clear in the new freeze-priority model and avoids redundant evaluations
during process traversal.

Signed-off-by: Zihuan Zhang <zhangzihuan@kylinos.cn>
---
 kernel/power/process.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/power/process.c b/kernel/power/process.c
index 06eafdb32abb..21bbca7040cf 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -147,6 +147,7 @@ int freeze_processes(void)
 
 	pm_wakeup_clear(0);
 	pm_freezing = true;
+	freeze_set_default_priority(current, FREEZE_PRIORITY_NEVER);
 	error = try_to_freeze_tasks(true);
 	if (!error)
 		__usermodehelper_set_disable_depth(UMH_DISABLED);
@@ -218,6 +219,7 @@ void thaw_processes(void)
 	WARN_ON(!(curr->flags & PF_SUSPEND_TASK));
 	curr->flags &= ~PF_SUSPEND_TASK;
 
+	freeze_set_default_priority(current, FREEZE_PRIORITY_NORMAL);
 	usermodehelper_enable();
 
 	schedule();
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v1 6/9] freezer: Set default freeze priority for zombie tasks
  2025-08-07 12:14 [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues Zihuan Zhang
                   ` (4 preceding siblings ...)
  2025-08-07 12:14 ` [RFC PATCH v1 5/9] freezer: set default freeze priority for PF_SUSPEND_TASK processes Zihuan Zhang
@ 2025-08-07 12:14 ` Zihuan Zhang
  2025-08-08 14:29   ` Oleg Nesterov
  2025-08-07 12:14 ` [RFC PATCH v1 7/9] freezer: raise freeze priority of tasks failed to freeze last time Zihuan Zhang
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-07 12:14 UTC (permalink / raw)
  To: Rafael J . Wysocki, Peter Zijlstra, Oleg Nesterov,
	David Hildenbrand, Michal Hocko, Jonathan Corbet
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	len brown, pavel machek, Kees Cook, Andrew Morton,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Catalin Marinas, Nico Pache, xu xin,
	wangfushuai, Andrii Nakryiko, Christian Brauner, Thomas Gleixner,
	Jeff Layton, Al Viro, Adrian Ratiu, linux-pm, linux-mm,
	linux-fsdevel, linux-doc, linux-kernel, Zihuan Zhang

Zombie processes are not subject to freezing, but they are still part of
the global task list. During freeze traversal, tasks are examined for
priority and eligibility, which may involve unnecessary locking even for
non-freezable tasks like zombies.

This patch assigns a default freeze priority to zombie tasks during exit,
so that the freezer can skip priority setup and locking for them in
subsequent iterations.

This helps reduce overhead during freeze traversal, especially when many
zombie processes exist in the system.

Signed-off-by: Zihuan Zhang <zhangzihuan@kylinos.cn>
---
 kernel/sched/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index be00629f0ba4..5a26d7511047 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -42,6 +42,7 @@
 #include <linux/context_tracking.h>
 #include <linux/cpuset.h>
 #include <linux/delayacct.h>
+#include <linux/freezer.h>
 #include <linux/init_task.h>
 #include <linux/interrupt.h>
 #include <linux/ioprio.h>
@@ -6980,6 +6981,7 @@ void __noreturn do_task_dead(void)
 	current->flags |= PF_NOFREEZE;
 
 	__schedule(SM_NONE);
+	freeze_set_default_priority(current, FREEZE_PRIORITY_NEVER);
 	BUG();
 
 	/* Avoid "noreturn function does return" - but don't continue if BUG() is a NOP: */
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v1 7/9] freezer: raise freeze priority of tasks failed to freeze last time
  2025-08-07 12:14 [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues Zihuan Zhang
                   ` (5 preceding siblings ...)
  2025-08-07 12:14 ` [RFC PATCH v1 6/9] freezer: Set default freeze priority for zombie tasks Zihuan Zhang
@ 2025-08-07 12:14 ` Zihuan Zhang
  2025-08-08 14:53   ` Oleg Nesterov
  2025-08-07 12:14 ` [RFC PATCH v1 8/9] freezer: Add retry count statistics for freeze pass iterations Zihuan Zhang
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-07 12:14 UTC (permalink / raw)
  To: Rafael J . Wysocki, Peter Zijlstra, Oleg Nesterov,
	David Hildenbrand, Michal Hocko, Jonathan Corbet
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	len brown, pavel machek, Kees Cook, Andrew Morton,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Catalin Marinas, Nico Pache, xu xin,
	wangfushuai, Andrii Nakryiko, Christian Brauner, Thomas Gleixner,
	Jeff Layton, Al Viro, Adrian Ratiu, linux-pm, linux-mm,
	linux-fsdevel, linux-doc, linux-kernel, Zihuan Zhang

If there are tasks that fail to freeze in the current suspend attempt,
we raise their freeze priority for the next attempt.

This change ensures that such tasks are frozen earlier in the next
round, helping to reduce retry counts and avoid persistent D-state
tasks due to dependency misordering.

Signed-off-by: Zihuan Zhang <zhangzihuan@kylinos.cn>
---
 kernel/power/process.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/power/process.c b/kernel/power/process.c
index 21bbca7040cf..9d3cbde905b9 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -111,8 +111,10 @@ static int try_to_freeze_tasks(bool user_only)
 		if (!wakeup || pm_debug_messages_on) {
 			read_lock(&tasklist_lock);
 			for_each_process_thread(g, p) {
-				if (p != current && freezing(p) && !frozen(p))
+				if (p != current && freezing(p) && !frozen(p)) {
+					freeze_set_default_priority(p, FREEZE_PRIORITY_HIGH);
 					sched_show_task(p);
+				}
 			}
 			read_unlock(&tasklist_lock);
 		}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v1 8/9] freezer: Add retry count statistics for freeze pass iterations
  2025-08-07 12:14 [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues Zihuan Zhang
                   ` (6 preceding siblings ...)
  2025-08-07 12:14 ` [RFC PATCH v1 7/9] freezer: raise freeze priority of tasks failed to freeze last time Zihuan Zhang
@ 2025-08-07 12:14 ` Zihuan Zhang
  2025-08-07 12:14 ` [RFC PATCH v1 9/9] proc: Add /proc/<pid>/freeze_priority interface Zihuan Zhang
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-07 12:14 UTC (permalink / raw)
  To: Rafael J . Wysocki, Peter Zijlstra, Oleg Nesterov,
	David Hildenbrand, Michal Hocko, Jonathan Corbet
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	len brown, pavel machek, Kees Cook, Andrew Morton,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Catalin Marinas, Nico Pache, xu xin,
	wangfushuai, Andrii Nakryiko, Christian Brauner, Thomas Gleixner,
	Jeff Layton, Al Viro, Adrian Ratiu, linux-pm, linux-mm,
	linux-fsdevel, linux-doc, linux-kernel, Zihuan Zhang

Freezer retry loops during suspend are often triggered by tasks entering
D-state (TASK_UNINTERRUPTIBLE), which cannot be frozen.  This patch adds
a simple retry counter to freeze_processes() to help quantify how many
attempts were required before all tasks entered the frozen state. This
is useful for performance tuning and debugging unpredictable suspend
delays.

A new dmesg log is added for visibility:

freeze round: xx, tasks to freeze: xx

This message allows users to correlate freeze instability with system
state.

Signed-off-by: Zihuan Zhang <zhangzihuan@kylinos.cn>
---
 kernel/power/process.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/power/process.c b/kernel/power/process.c
index 9d3cbde905b9..442d2ebba3ed 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -76,6 +76,8 @@ static int try_to_freeze_tasks(bool user_only)
 		if (user_only && !has_freezable_task)
 			continue;
 
+		pr_info("freeze round: %d, task to freeze: %d\n", round, todo);
+
 		if (!todo || time_after(jiffies, end_time))
 			break;
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v1 9/9] proc: Add /proc/<pid>/freeze_priority interface
  2025-08-07 12:14 [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues Zihuan Zhang
                   ` (7 preceding siblings ...)
  2025-08-07 12:14 ` [RFC PATCH v1 8/9] freezer: Add retry count statistics for freeze pass iterations Zihuan Zhang
@ 2025-08-07 12:14 ` Zihuan Zhang
  2025-08-07 13:25 ` [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues Michal Hocko
  2025-08-14 14:37 ` Peter Zijlstra
  10 siblings, 0 replies; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-07 12:14 UTC (permalink / raw)
  To: Rafael J . Wysocki, Peter Zijlstra, Oleg Nesterov,
	David Hildenbrand, Michal Hocko, Jonathan Corbet
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	len brown, pavel machek, Kees Cook, Andrew Morton,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Catalin Marinas, Nico Pache, xu xin,
	wangfushuai, Andrii Nakryiko, Christian Brauner, Thomas Gleixner,
	Jeff Layton, Al Viro, Adrian Ratiu, linux-pm, linux-mm,
	linux-fsdevel, linux-doc, linux-kernel, Zihuan Zhang

This patch introduces a new proc file `/proc/[pid]/freeze_priority`
that allows reading and writing the freeze priority of a task.

This is useful for  process freezing mechanisms that wish to prioritize
which tasks to freeze first during suspend or hibernation.

To avoid misuse and for system integrity, userspace is not permitted to
assign the `FREEZE_PRIORITY_NEVER` level to any task.

Signed-off-by: Zihuan Zhang <zhangzihuan@kylinos.cn>
---
 Documentation/filesystems/proc.rst | 14 ++++++-
 fs/proc/base.c                     | 64 ++++++++++++++++++++++++++++++
 2 files changed, 77 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 2971551b7235..4b7bc695b249 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -48,7 +48,8 @@ fixes/update part 1.1  Stefani Seibold <stefani@seibold.net>    June 9 2009
   3.11	/proc/<pid>/patch_state - Livepatch patch operation state
   3.12	/proc/<pid>/arch_status - Task architecture specific information
   3.13  /proc/<pid>/fd - List of symlinks to open files
-  3.14  /proc/<pid/ksm_stat - Information about the process's ksm status.
+  3.14  /proc/<pid>/ksm_stat - Information about the process's ksm status
+  3.15  /proc/<pid>/freeze_priority - Information about freeze_priority.
 
   4	Configuring procfs
   4.1	Mount options
@@ -2349,6 +2350,17 @@ applicable to KSM.
 More information about KSM can be found in
 Documentation/admin-guide/mm/ksm.rst.
 
+3.15	/proc/<pid>/freeze_priority - Information about freeze_priority
+-----------------------------------------------------------------------
+This file exposes the `freeze_priority` value of a given task.
+
+The freezer subsystem uses `freeze_priority` to determine the order
+in which tasks are frozen during suspend/hibernate. Tasks with
+lower values are frozen earlier. Higher values defer the task to
+later freeze rounds.
+
+Writing a value to this file allows user space to adjust the
+priority of the task in the freezer traversal.
 
 Chapter 4: Configuring procfs
 =============================
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 62d35631ba8c..724145356128 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -86,6 +86,7 @@
 #include <linux/user_namespace.h>
 #include <linux/fs_parser.h>
 #include <linux/fs_struct.h>
+#include <linux/freezer.h>
 #include <linux/slab.h>
 #include <linux/sched/autogroup.h>
 #include <linux/sched/mm.h>
@@ -3290,6 +3291,66 @@ static int proc_pid_ksm_stat(struct seq_file *m, struct pid_namespace *ns,
 }
 #endif /* CONFIG_KSM */
 
+#ifdef CONFIG_FREEZER
+static int freeze_priority_show(struct seq_file *m, void *v)
+{
+	struct inode *inode = m->private;
+	struct task_struct *p;
+
+	p = get_proc_task(inode);
+	if (!p)
+		return -ESRCH;
+
+	task_lock(p);
+	seq_printf(m, "%u\n", p->freeze_priority);
+	task_unlock(p);
+
+	put_task_struct(p);
+
+	return 0;
+}
+
+static ssize_t freeze_priority_write(struct file *file, const char __user *buf,
+				     size_t count, loff_t *ppos)
+{
+	struct inode *inode = file_inode(file);
+	struct task_struct *p;
+	u64 freeze_priority;
+	int err;
+
+	err = kstrtoull_from_user(buf, count, 10, &freeze_priority);
+	if (err < 0)
+		return err;
+
+	if (freeze_priority >= FREEZE_PRIORITY_NEVER)
+		return -EINVAL;
+
+	p = get_proc_task(inode);
+	if (!p)
+		return -ESRCH;
+
+	task_lock(p);
+	p->freeze_priority = freeze_priority;
+	task_unlock(p);
+
+	put_task_struct(p);
+	return count;
+}
+
+static int freeze_priority_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, freeze_priority_show, inode);
+}
+
+static const struct file_operations proc_pid_freeze_priority = {
+	.open		= freeze_priority_open,
+	.read		= seq_read,
+	.write		= freeze_priority_write,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+#endif /* CONFIG_FREEZER */
+
 #ifdef CONFIG_KSTACK_ERASE_METRICS
 static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
 				struct pid *pid, struct task_struct *task)
@@ -3407,6 +3468,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 	REG("timers",	  S_IRUGO, proc_timers_operations),
 #endif
 	REG("timerslack_ns", S_IRUGO|S_IWUGO, proc_pid_set_timerslack_ns_operations),
+#ifdef CONFIG_FREEZER
+	REG("freeze_priority",  S_IRUGO|S_IWUSR, proc_pid_freeze_priority),
+#endif
 #ifdef CONFIG_LIVEPATCH
 	ONE("patch_state",  S_IRUSR, proc_pid_patch_state),
 #endif
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
  2025-08-07 12:14 [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues Zihuan Zhang
                   ` (8 preceding siblings ...)
  2025-08-07 12:14 ` [RFC PATCH v1 9/9] proc: Add /proc/<pid>/freeze_priority interface Zihuan Zhang
@ 2025-08-07 13:25 ` Michal Hocko
  2025-08-08  1:13   ` Zihuan Zhang
  2025-08-14 14:37 ` Peter Zijlstra
  10 siblings, 1 reply; 38+ messages in thread
From: Michal Hocko @ 2025-08-07 13:25 UTC (permalink / raw)
  To: Zihuan Zhang
  Cc: Rafael J . Wysocki, Peter Zijlstra, Oleg Nesterov,
	David Hildenbrand, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel

On Thu 07-08-25 20:14:09, Zihuan Zhang wrote:
> The Linux task freezer was designed in a much earlier era, when userspace was relatively simple and flat.
> Over the years, as modern desktop and mobile systems have become increasingly complex—with intricate IPC,
> asynchronous I/O, and deep event loops—the original freezer model has shown its age.

A modern userspace might be more complex or convoluted but I do not
think the above statement is accurate or even correct.

> ## Background
> 
> Currently, the freezer traverses the task list linearly and attempts to freeze all tasks equally.
> It sends a signal and waits for `freezing()` to become true. While this model works well in many cases, it has several inherent limitations:
> 
> - Signal-based logic cannot freeze uninterruptible (D-state) tasks
> - Dependencies between processes can cause freeze retries 
> - Retry-based recovery introduces unpredictable suspend latency
> 
> ## Real-world problem illustration
> 
> Consider the following scenario during suspend:
> 
> Freeze Window Begins
> 
>     [process A] - epoll_wait()
>         │
>         ▼
>     [process B] - event source (already frozen)
> 
> → A enters D-state because of waiting for B

I thought opoll_wait was waiting in interruptible sleep.

> → Cannot respond to freezing signal
> → Freezer retries in a loop
> → Suspend latency spikes
> 
> In such cases, we observed that a normal 1–2ms freezer cycle could balloon to **tens of milliseconds**. 
> Worse, the kernel has no insight into the root cause and simply retries blindly.
> 
> ## Proposed solution: Freeze priority model
> 
> To address this, we propose a **layered freeze model** based on per-task freeze priorities.
> 
> ### Design
> 
> We introduce 4 levels of freeze priority:
> 
> 
> | Priority | Level             | Description                       |
> |----------|-------------------|-----------------------------------|
> | 0        | HIGH              | D-state TASKs                     |
> | 1        | NORMAL            | regular  use space TASKS          |
> | 2        | LOW               | not yet used                      |
> | 4        | NEVER_FREEZE      | zombie TASKs , PF_SUSPNED_TASK    |
> 
> 
> The kernel will freeze processes **in priority order**, ensuring that higher-priority tasks are frozen first.
> This avoids dependency inversion scenarios and provides a deterministic path forward for tricky cases.
> By freezing control or event-source threads first, we prevent dependent tasks from entering D-state prematurely — effectively avoiding dependency inversion.

I really fail to see how that is supposed to work to be honest. If a
process is running in the userspace then the priority shouldn't really
matter much. Tasks will get a signal, freeze themselves and you are
done. If they are running in the userspace and e.g. sleeping while not
TASK_FREEZABLE then priority simply makes no difference. And if they are
TASK_FREEZABLE then the priority doens't matter either.

What am I missing?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
  2025-08-07 13:25 ` [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues Michal Hocko
@ 2025-08-08  1:13   ` Zihuan Zhang
  2025-08-08  7:00     ` Michal Hocko
  2025-08-08  7:57     ` Oleg Nesterov
  0 siblings, 2 replies; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-08  1:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rafael J . Wysocki, Peter Zijlstra, Oleg Nesterov,
	David Hildenbrand, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel

Hi,

在 2025/8/7 21:25, Michal Hocko 写道:
> On Thu 07-08-25 20:14:09, Zihuan Zhang wrote:
>> The Linux task freezer was designed in a much earlier era, when userspace was relatively simple and flat.
>> Over the years, as modern desktop and mobile systems have become increasingly complex—with intricate IPC,
>> asynchronous I/O, and deep event loops—the original freezer model has shown its age.
> A modern userspace might be more complex or convoluted but I do not
> think the above statement is accurate or even correct.
You’re right — that statement may not be accurate. I’ll be more careful 
with the wording.
>> ## Background
>>
>> Currently, the freezer traverses the task list linearly and attempts to freeze all tasks equally.
>> It sends a signal and waits for `freezing()` to become true. While this model works well in many cases, it has several inherent limitations:
>>
>> - Signal-based logic cannot freeze uninterruptible (D-state) tasks
>> - Dependencies between processes can cause freeze retries
>> - Retry-based recovery introduces unpredictable suspend latency
>>
>> ## Real-world problem illustration
>>
>> Consider the following scenario during suspend:
>>
>> Freeze Window Begins
>>
>>      [process A] - epoll_wait()
>>          │
>>          ▼
>>      [process B] - event source (already frozen)
>>
>> → A enters D-state because of waiting for B
> I thought opoll_wait was waiting in interruptible sleep.

Apologies — my description may not be entirely accurate.

But there are some dmesg logs:

[   62.880497] PM: suspend entry (deep)
[   63.130639] Filesystems sync: 0.249 seconds
[   63.130643] PM: Preparing system for sleep (deep)
[   63.226398] Freezing user space processes
[   63.227193] freeze round: 0, task to freeze: 681
[   63.228110] freeze round: 1, task to freeze: 1
[   63.230064] task:Xorg            state:D stack:0     pid:1404  tgid:1404  ppid:1348   task_flags:0x400100 flags:0x00004004
[   63.230068] Call Trace:
[   63.230069]  <TASK>
[   63.230071]  __schedule+0x52e/0xea0
[   63.230077]  schedule+0x27/0x80
[   63.230079]  schedule_timeout+0xf2/0x100
[   63.230082]  wait_for_completion+0x85/0x130
[   63.230085]  __flush_work+0x21f/0x310
[   63.230087]  ? __pfx_wq_barrier_func+0x10/0x10
[   63.230091]  drm_mode_rmfb+0x138/0x1b0
[   63.230093]  ? __pfx_drm_mode_rmfb_work_fn+0x10/0x10
[   63.230095]  ? __pfx_drm_mode_rmfb_ioctl+0x10/0x10
[   63.230097]  drm_ioctl_kernel+0xa5/0x100
[   63.230099]  drm_ioctl+0x270/0x4b0
[   63.230101]  ? __pfx_drm_mode_rmfb_ioctl+0x10/0x10
[   63.230104]  ? syscall_exit_work+0x108/0x140
[   63.230107]  radeon_drm_ioctl+0x4a/0x80 [radeon]
[   63.230141]  __x64_sys_ioctl+0x93/0xe0
[   63.230144]  ? syscall_trace_enter+0xfa/0x1c0
[   63.230146]  do_syscall_64+0x7d/0x2c0
[   63.230148]  ? do_syscall_64+0x1f3/0x2c0
[   63.230150]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   63.230153] RIP: 0033:0x7f1aa132550b
[   63.230154] RSP: 002b:00007ffebab69678 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   63.230156] RAX: ffffffffffffffda RBX: 00007ffebab696bc RCX: 00007f1aa132550b
[   63.230158] RDX: 00007ffebab696bc RSI: 00000000c00464af RDI: 000000000000000e
[   63.230159] RBP: 00000000c00464af R08: 00007f1aa0c41220 R09: 000055a71ce32310
[   63.230160] R10: 0000000000000087 R11: 0000000000000246 R12: 000055a71b813660
[   63.230161] R13: 000000000000000e R14: 0000000003a8f5cd R15: 000055a71b6bbfb0
[   63.230164]  </TASK>
[   63.230248] freeze round: 2, task to freeze: 1


You can find it in this patch

link: 
https://lore.kernel.org/all/20250619035355.33402-1-zhangzihuan@kylinos.cn/

>> → Cannot respond to freezing signal
>> → Freezer retries in a loop
>> → Suspend latency spikes
>>
>> In such cases, we observed that a normal 1–2ms freezer cycle could balloon to **tens of milliseconds**.
>> Worse, the kernel has no insight into the root cause and simply retries blindly.
>>
>> ## Proposed solution: Freeze priority model
>>
>> To address this, we propose a **layered freeze model** based on per-task freeze priorities.
>>
>> ### Design
>>
>> We introduce 4 levels of freeze priority:
>>
>>
>> | Priority | Level             | Description                       |
>> |----------|-------------------|-----------------------------------|
>> | 0        | HIGH              | D-state TASKs                     |
>> | 1        | NORMAL            | regular  use space TASKS          |
>> | 2        | LOW               | not yet used                      |
>> | 4        | NEVER_FREEZE      | zombie TASKs , PF_SUSPNED_TASK    |
>>
>>
>> The kernel will freeze processes **in priority order**, ensuring that higher-priority tasks are frozen first.
>> This avoids dependency inversion scenarios and provides a deterministic path forward for tricky cases.
>> By freezing control or event-source threads first, we prevent dependent tasks from entering D-state prematurely — effectively avoiding dependency inversion.
> I really fail to see how that is supposed to work to be honest. If a
> process is running in the userspace then the priority shouldn't really
> matter much. Tasks will get a signal, freeze themselves and you are
> done. If they are running in the userspace and e.g. sleeping while not
> TASK_FREEZABLE then priority simply makes no difference. And if they are
> TASK_FREEZABLE then the priority doens't matter either.
>
> What am I missing?
under ideal conditions, if a userspace task is TASK_FREEZABLE, receives 
the freezing() signal, and enters the refrigerator in a timely manner, 
then freeze priority wouldn’t make a difference.

However, in practice, we’ve observed cases where tasks appear stuck in 
uninterruptible sleep (D state) during the freeze phase  — and thus 
cannot respond to signals or enter the refrigerator. These tasks are 
technically TASK_FREEZABLE, but due to the nature of their sleep state, 
they don’t freeze promptly, and may require multiple retry rounds, or 
cause the entire suspend to fail.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
  2025-08-08  1:13   ` Zihuan Zhang
@ 2025-08-08  7:00     ` Michal Hocko
  2025-08-08  7:52       ` Zihuan Zhang
  2025-08-08  7:57     ` Oleg Nesterov
  1 sibling, 1 reply; 38+ messages in thread
From: Michal Hocko @ 2025-08-08  7:00 UTC (permalink / raw)
  To: Zihuan Zhang
  Cc: Rafael J . Wysocki, Peter Zijlstra, Oleg Nesterov,
	David Hildenbrand, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel

On Fri 08-08-25 09:13:30, Zihuan Zhang wrote:
[...]
> However, in practice, we’ve observed cases where tasks appear stuck in
> uninterruptible sleep (D state) during the freeze phase  — and thus cannot
> respond to signals or enter the refrigerator. These tasks are technically
> TASK_FREEZABLE, but due to the nature of their sleep state, they don’t
> freeze promptly, and may require multiple retry rounds, or cause the entire
> suspend to fail.

Right, but that is an inherent problem of the freezer implemenatation.
It is not really clear to me how priorities or layers improve on that.
Could you please elaborate on that?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
  2025-08-08  7:00     ` Michal Hocko
@ 2025-08-08  7:52       ` Zihuan Zhang
  2025-08-08  8:58         ` Michal Hocko
  0 siblings, 1 reply; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-08  7:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rafael J . Wysocki, Peter Zijlstra, Oleg Nesterov,
	David Hildenbrand, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel

在 2025/8/8 15:00, Michal Hocko 写道:
> On Fri 08-08-25 09:13:30, Zihuan Zhang wrote:
> [...]
>> However, in practice, we’ve observed cases where tasks appear stuck in
>> uninterruptible sleep (D state) during the freeze phase  — and thus cannot
>> respond to signals or enter the refrigerator. These tasks are technically
>> TASK_FREEZABLE, but due to the nature of their sleep state, they don’t
>> freeze promptly, and may require multiple retry rounds, or cause the entire
>> suspend to fail.
> Right, but that is an inherent problem of the freezer implemenatation.
> It is not really clear to me how priorities or layers improve on that.
> Could you please elaborate on that?

Thanks for the follow-up.

 From our observations, we’ve seen processes like Xorg that are in a 
normal state before freezing begins, but enter D state during the freeze 
window. Upon investigation,

we found that these processes often depend on other user processes 
(e.g., I/O helpers or system services), and when those dependencies are 
frozen first, the dependent process (like Xorg) gets stuck and can’t be 
frozen itself.

This led us to treat such processes as “hard to freeze” tasks — not 
because they’re inherently unfreezable, but because they are more likely 
to become problematic if not frozen early enough.

So our model works as follows:
     •    By default, freezer tries to freeze all freezable tasks in 
each round.
     •    With our approach, we only attempt to freeze tasks whose 
freeze_priority is less than or equal to the current round number.
     •    This ensures that higher-priority (i.e., harder-to-freeze) 
tasks are attempted earlier, increasing the chance that they freeze 
before being blocked by others.

Since we cannot know in advance which tasks will be difficult to freeze, 
we use heuristics:
     •    Any task that causes freeze failure or is found in D state 
during the freeze window is treated as hard-to-freeze in the next 
attempt and its priority is increased.
     •    Additionally, users can manually raise/reduce the freeze 
priority of known problematic tasks via an exposed sysfs interface, 
giving them fine-grained control.

This doesn’t change the fundamental logic of the freezer — it still 
retries until all tasks are frozen — but by adjusting the traversal order,

  we’ve observed significantly fewer retries and more reliable success 
in scenarios where these D state transitions occur.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
  2025-08-08  1:13   ` Zihuan Zhang
  2025-08-08  7:00     ` Michal Hocko
@ 2025-08-08  7:57     ` Oleg Nesterov
  2025-08-08  8:40       ` Zihuan Zhang
  1 sibling, 1 reply; 38+ messages in thread
From: Oleg Nesterov @ 2025-08-08  7:57 UTC (permalink / raw)
  To: Zihuan Zhang
  Cc: Michal Hocko, Rafael J . Wysocki, Peter Zijlstra,
	David Hildenbrand, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel

On 08/08, Zihuan Zhang wrote:
>
> 在 2025/8/7 21:25, Michal Hocko 写道:
> >If they are running in the userspace and e.g. sleeping while not
> >TASK_FREEZABLE then priority simply makes no difference. And if they are
> >TASK_FREEZABLE then the priority doens't matter either.
> >
> >What am I missing?

I too do not understand how can this series improve the freezer.

> under ideal conditions, if a userspace task is TASK_FREEZABLE, receives the
> freezing() signal, and enters the refrigerator in a timely manner,

Note that __freeze_task() won't even send a signal to a sleeping
TASK_FREEZABLE task, __freeze_task() will just change its state to
TASK_FROZEN.

Oleg.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
  2025-08-08  7:57     ` Oleg Nesterov
@ 2025-08-08  8:40       ` Zihuan Zhang
  0 siblings, 0 replies; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-08  8:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Michal Hocko, Rafael J . Wysocki, Peter Zijlstra,
	David Hildenbrand, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel

Hi,

在 2025/8/8 15:57, Oleg Nesterov 写道:
> On 08/08, Zihuan Zhang wrote:
>> 在 2025/8/7 21:25, Michal Hocko 写道:
>>> If they are running in the userspace and e.g. sleeping while not
>>> TASK_FREEZABLE then priority simply makes no difference. And if they are
>>> TASK_FREEZABLE then the priority doens't matter either.
>>>
>>> What am I missing?
> I too do not understand how can this series improve the freezer.

Thanks for your question — actually, I just replied to Michal with a 
similar explanation, but I really appreciate you raising the same point, 
so let me add a bit more context here.

Right now, we're trying to address the case where certain tasks fail to 
freeze (often due to short-lived D-state issues). Our current workaround 
is to increase the number of freeze iterations in the next suspend 
attempt for those tasks.

While this isn't a perfect solution, the overhead of a few extra 
iterations is minimal compared to the cost of retrying the whole suspend 
cycle due to a stuck D-state task. So for now, we believe this is a 
reasonable tradeoff until we find a more deterministic way to 
preemptively detect and prioritize problematic tasks.

Happy to hear your thoughts or suggestions if you think there's a better 
direction to explore.

>> under ideal conditions, if a userspace task is TASK_FREEZABLE, receives the
>> freezing() signal, and enters the refrigerator in a timely manner,
> Note that __freeze_task() won't even send a signal to a sleeping
> TASK_FREEZABLE task, __freeze_task() will just change its state to
> TASK_FROZEN.
>
> Oleg.
>
You are right.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
  2025-08-08  7:52       ` Zihuan Zhang
@ 2025-08-08  8:58         ` Michal Hocko
  2025-08-11  9:13           ` Zihuan Zhang
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Hocko @ 2025-08-08  8:58 UTC (permalink / raw)
  To: Zihuan Zhang
  Cc: Rafael J . Wysocki, Peter Zijlstra, Oleg Nesterov,
	David Hildenbrand, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel

On Fri 08-08-25 15:52:31, Zihuan Zhang wrote:
> 
> 在 2025/8/8 15:00, Michal Hocko 写道:
> > On Fri 08-08-25 09:13:30, Zihuan Zhang wrote:
> > [...]
> > > However, in practice, we’ve observed cases where tasks appear stuck in
> > > uninterruptible sleep (D state) during the freeze phase  — and thus cannot
> > > respond to signals or enter the refrigerator. These tasks are technically
> > > TASK_FREEZABLE, but due to the nature of their sleep state, they don’t
> > > freeze promptly, and may require multiple retry rounds, or cause the entire
> > > suspend to fail.
> > Right, but that is an inherent problem of the freezer implemenatation.
> > It is not really clear to me how priorities or layers improve on that.
> > Could you please elaborate on that?
> 
> Thanks for the follow-up.
> 
> From our observations, we’ve seen processes like Xorg that are in a normal
> state before freezing begins, but enter D state during the freeze window.
> Upon investigation,
> 
> we found that these processes often depend on other user processes (e.g.,
> I/O helpers or system services), and when those dependencies are frozen
> first, the dependent process (like Xorg) gets stuck and can’t be frozen
> itself.

OK, I see.

> This led us to treat such processes as “hard to freeze” tasks — not because
> they’re inherently unfreezable, but because they are more likely to become
> problematic if not frozen early enough.
> 
> So our model works as follows:
>     •    By default, freezer tries to freeze all freezable tasks in each
> round.
>     •    With our approach, we only attempt to freeze tasks whose
> freeze_priority is less than or equal to the current round number.
>     •    This ensures that higher-priority (i.e., harder-to-freeze) tasks
> are attempted earlier, increasing the chance that they freeze before being
> blocked by others.
> 
> Since we cannot know in advance which tasks will be difficult to freeze, we
> use heuristics:
>     •    Any task that causes freeze failure or is found in D state during
> the freeze window is treated as hard-to-freeze in the next attempt and its
> priority is increased.
>     •    Additionally, users can manually raise/reduce the freeze priority
> of known problematic tasks via an exposed sysfs interface, giving them
> fine-grained control.

This would have been a very useful information for the changelog so that
we can understand what you are trying to achieve.

> This doesn’t change the fundamental logic of the freezer — it still retries
> until all tasks are frozen — but by adjusting the traversal order,
> 
>  we’ve observed significantly fewer retries and more reliable success in
> scenarios where these D state transitions occur.
 
OK, I believe I do understand what you are trying to achieve but I am
not conviced this is a robust way to deal with the problem. This all
seems highly timing specific that might work in very specific usecase
but you are essentially trying to fight tiny race windows with a very
probabilitistic interface.

Also the interface seems to be really coarse grained and it can easily
turn out insufficient for other usecases while it is not entirely clear
to me how this could be extended for those.

I believe it would be more useful to find sources of those freezer
blockers and try to address those. Making more blocked tasks
__set_task_frozen compatible sounds like a general improvement in
itself.

Thanks
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 6/9] freezer: Set default freeze priority for zombie tasks
  2025-08-07 12:14 ` [RFC PATCH v1 6/9] freezer: Set default freeze priority for zombie tasks Zihuan Zhang
@ 2025-08-08 14:29   ` Oleg Nesterov
  2025-08-11  9:29     ` Zihuan Zhang
  2025-08-12  8:07     ` Zihuan Zhang
  0 siblings, 2 replies; 38+ messages in thread
From: Oleg Nesterov @ 2025-08-08 14:29 UTC (permalink / raw)
  To: Zihuan Zhang
  Cc: Rafael J . Wysocki, Peter Zijlstra, David Hildenbrand,
	Michal Hocko, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel

On 08/07, Zihuan Zhang wrote:
>
> @@ -6980,6 +6981,7 @@ void __noreturn do_task_dead(void)
>  	current->flags |= PF_NOFREEZE;
>
>  	__schedule(SM_NONE);
> +	freeze_set_default_priority(current, FREEZE_PRIORITY_NEVER);
>  	BUG();

But this change has no effect?

Firstly, this last __schedule() should not return, note the BUG() we have.

Secondly, this zombie is already PF_NOFREEZE, freeze_task() will return
false anyway.

Oleg.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 5/9] freezer: set default freeze priority for PF_SUSPEND_TASK processes
  2025-08-07 12:14 ` [RFC PATCH v1 5/9] freezer: set default freeze priority for PF_SUSPEND_TASK processes Zihuan Zhang
@ 2025-08-08 14:39   ` Oleg Nesterov
  2025-08-11  9:25     ` Zihuan Zhang
  0 siblings, 1 reply; 38+ messages in thread
From: Oleg Nesterov @ 2025-08-08 14:39 UTC (permalink / raw)
  To: Zihuan Zhang
  Cc: Rafael J . Wysocki, Peter Zijlstra, David Hildenbrand,
	Michal Hocko, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel

On 08/07, Zihuan Zhang wrote:
>
> --- a/kernel/power/process.c
> +++ b/kernel/power/process.c
> @@ -147,6 +147,7 @@ int freeze_processes(void)
>
>  	pm_wakeup_clear(0);
>  	pm_freezing = true;
> +	freeze_set_default_priority(current, FREEZE_PRIORITY_NEVER);

But why?

Again, freeze_task() will return false anyway, this process is
PF_SUSPEND_TASK.

> @@ -218,6 +219,7 @@ void thaw_processes(void)
>  	WARN_ON(!(curr->flags & PF_SUSPEND_TASK));
>  	curr->flags &= ~PF_SUSPEND_TASK;
>
> +	freeze_set_default_priority(current, FREEZE_PRIORITY_NORMAL);
>  	usermodehelper_enable();

What if current->freeze_priority was changed via
/proc/pid/freeze_priority you add in 9/9 ?

Oleg.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 7/9] freezer: raise freeze priority of tasks failed to freeze last time
  2025-08-07 12:14 ` [RFC PATCH v1 7/9] freezer: raise freeze priority of tasks failed to freeze last time Zihuan Zhang
@ 2025-08-08 14:53   ` Oleg Nesterov
  2025-08-11  9:31     ` Zihuan Zhang
  0 siblings, 1 reply; 38+ messages in thread
From: Oleg Nesterov @ 2025-08-08 14:53 UTC (permalink / raw)
  To: Zihuan Zhang
  Cc: Rafael J . Wysocki, Peter Zijlstra, David Hildenbrand,
	Michal Hocko, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel

On 08/07, Zihuan Zhang wrote:
>
> --- a/kernel/power/process.c
> +++ b/kernel/power/process.c
> @@ -111,8 +111,10 @@ static int try_to_freeze_tasks(bool user_only)
>  		if (!wakeup || pm_debug_messages_on) {
>  			read_lock(&tasklist_lock);
>  			for_each_process_thread(g, p) {
> -				if (p != current && freezing(p) && !frozen(p))
> +				if (p != current && freezing(p) && !frozen(p)) {
> +					freeze_set_default_priority(p, FREEZE_PRIORITY_HIGH);
>  					sched_show_task(p);
> +				}

IMO, this change should go into 3/9 to make the intent more clear.

Other than that, I leave this series to maintainers and other reviewers.
Personally, I am skeptical.

Oleg.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
  2025-08-08  8:58         ` Michal Hocko
@ 2025-08-11  9:13           ` Zihuan Zhang
  2025-08-11 10:58             ` Michal Hocko
  0 siblings, 1 reply; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-11  9:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rafael J . Wysocki, Peter Zijlstra, Oleg Nesterov,
	David Hildenbrand, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel


在 2025/8/8 16:58, Michal Hocko 写道:
> On Fri 08-08-25 15:52:31, Zihuan Zhang wrote:
>> 在 2025/8/8 15:00, Michal Hocko 写道:
>>> On Fri 08-08-25 09:13:30, Zihuan Zhang wrote:
>>> [...]
>>>> However, in practice, we’ve observed cases where tasks appear stuck in
>>>> uninterruptible sleep (D state) during the freeze phase  — and thus cannot
>>>> respond to signals or enter the refrigerator. These tasks are technically
>>>> TASK_FREEZABLE, but due to the nature of their sleep state, they don’t
>>>> freeze promptly, and may require multiple retry rounds, or cause the entire
>>>> suspend to fail.
>>> Right, but that is an inherent problem of the freezer implemenatation.
>>> It is not really clear to me how priorities or layers improve on that.
>>> Could you please elaborate on that?
>> Thanks for the follow-up.
>>
>>  From our observations, we’ve seen processes like Xorg that are in a normal
>> state before freezing begins, but enter D state during the freeze window.
>> Upon investigation,
>>
>> we found that these processes often depend on other user processes (e.g.,
>> I/O helpers or system services), and when those dependencies are frozen
>> first, the dependent process (like Xorg) gets stuck and can’t be frozen
>> itself.
> OK, I see.
>
>> This led us to treat such processes as “hard to freeze” tasks — not because
>> they’re inherently unfreezable, but because they are more likely to become
>> problematic if not frozen early enough.
>>
>> So our model works as follows:
>>      •    By default, freezer tries to freeze all freezable tasks in each
>> round.
>>      •    With our approach, we only attempt to freeze tasks whose
>> freeze_priority is less than or equal to the current round number.
>>      •    This ensures that higher-priority (i.e., harder-to-freeze) tasks
>> are attempted earlier, increasing the chance that they freeze before being
>> blocked by others.
>>
>> Since we cannot know in advance which tasks will be difficult to freeze, we
>> use heuristics:
>>      •    Any task that causes freeze failure or is found in D state during
>> the freeze window is treated as hard-to-freeze in the next attempt and its
>> priority is increased.
>>      •    Additionally, users can manually raise/reduce the freeze priority
>> of known problematic tasks via an exposed sysfs interface, giving them
>> fine-grained control.
> This would have been a very useful information for the changelog so that
> we can understand what you are trying to achieve.
>
Got it, I’ll add that info to the changelog. Thanks!
>> This doesn’t change the fundamental logic of the freezer — it still retries
>> until all tasks are frozen — but by adjusting the traversal order,
>>
>>   we’ve observed significantly fewer retries and more reliable success in
>> scenarios where these D state transitions occur.
>   
> OK, I believe I do understand what you are trying to achieve but I am
> not conviced this is a robust way to deal with the problem. This all
> seems highly timing specific that might work in very specific usecase
> but you are essentially trying to fight tiny race windows with a very
> probabilitistic interface.

Actually, our approach does not conflict with solving the problem. We 
plan to keep the freeze priority mechanism disabled by default and only 
enable it when issues arise, so as to maintain the consistency of the 
existing code flow as much as possible. It acts like a fallback mechanism.

We acknowledge that the causes of D-state tasks are complex and require 
high effort to fully resolve, which the current freezer mechanism cannot 
achieve. Our solution is low-cost and able to capture some problematic 
tasks effectively.

> Also the interface seems to be really coarse grained and it can easily
> turn out insufficient for other usecases while it is not entirely clear
> to me how this could be extended for those.
  We recognize that the current interface is relatively coarse-grained 
and may not be sufficient for all scenarios. The present implementation 
is a basic version.

Our plan is to introduce a classification-based mechanism that assigns 
different freeze priorities according to process categories. For 
example, filesystem and graphics-related processes will be given higher 
default freeze priority, as they are critical in the freezing workflow. 
This classification approach helps target important processes more 
precisely.

However, this requires further testing and refinement before full 
deployment. We believe this incremental, category-based design will make 
the mechanism more effective and adaptable over time while keeping it 
manageable.
> I believe it would be more useful to find sources of those freezer
> blockers and try to address those. Making more blocked tasks
> __set_task_frozen compatible sounds like a general improvement in
> itself.

we have already identified some causes of D-state tasks, many of which 
are related to the filesystem. On some systems, certain processes 
frequently execute ext4_sync_file, and under contention this can lead to 
D-state tasks.

  6616.650482] task:ThreadPoolForeg state:D stack:0     pid:262026 
tgid:4065  ppid:2490   task_flags:0x400040 flags:0x00004004
[ 6616.650485] Call Trace:
[ 6616.650486]  <TASK>
[ 6616.650489]  __schedule+0x532/0xea0
[ 6616.650494]  schedule+0x27/0x80
[ 6616.650496]  jbd2_log_wait_commit+0xa6/0x120
[ 6616.650499]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 6616.650502]  ext4_sync_file+0x1ba/0x380
[ 6616.650505]  do_fsync+0x3b/0x80
[ 6616.650507]  __x64_sys_fdatasync+0x17/0x20
[ 6616.650509]  do_syscall_64+0x7d/0x2c0
[ 6616.650512]  ? syscall_exit_work+0x108/0x140
[ 6616.650515]  ? do_syscall_64+0x1f3/0x2c0
[ 6616.650517]  ? syscall_exit_work+0x108/0x140
[ 6616.650519]  ? do_syscall_64+0x1d5/0x2c0
[ 6616.650522]  ? audit_reset_context.part.0+0x284/0x2f0
[ 6616.650524]  ? syscall_exit_work+0x108/0x140
[ 6616.650527]  ? do_syscall_64+0x1f3/0x2c0
[ 6616.650529]  ? futex_unqueue+0x4e/0x80
[ 6616.650531]  ? __futex_wait+0x9b/0x100
[ 6616.650534]  ? __pfx_futex_wake_mark+0x10/0x10
[ 6616.650536]  ? timerqueue_del+0x2e/0x50
[ 6616.650539]  ? __remove_hrtimer+0x39/0x70
[ 6616.650542]  ? hrtimer_try_to_cancel+0x85/0x100
[ 6616.650544]  ? hrtimer_cancel+0x15/0x30
[ 6616.650546]  ? futex_wait+0x7d/0x110
[ 6616.650549]  ? __pfx_hrtimer_wakeup+0x10/0x10
[ 6616.650552]  ? audit_reset_context.part.0+0x284/0x2f0
[ 6616.650554]  ? syscall_exit_work+0x108/0x140
[ 6616.650556]  ? do_syscall_64+0x1d5/0x2c0
[ 6616.650558]  ? switch_fpu_return+0x4f/0xd0
[ 6616.650560]  ? do_syscall_64+0x1d5/0x2c0
[ 6616.650563]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 6616.650565] RIP: 0033:0x7f095ef8f3eb
[ 6616.650567] RSP: 002b:00007f07409fa360 EFLAGS: 00000293 ORIG_RAX: 
000000000000004b
[ 6616.650569] RAX: ffffffffffffffda RBX: 00000d38021f03a0 RCX: 
00007f095ef8f3eb
[ 6616.650570] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
000000000000009a
[ 6616.650571] RBP: 00007f07409fa410 R08: 0000000000000000 R09: 
00007f07409fa570
[ 6616.650572] R10: 00007f0960a60000 R11: 0000000000000293 R12: 
00000d38021f0380
[ 6616.650573] R13: 000055c28c70b400 R14: 00007f07409fa3a0 R15: 
00007f07409fa380


While the kernel already supports freezing the filesystem, which can 
address this problem, it is quite expensive — enabling this feature 
increases the suspend time by about  3~4 seconds in our tests. We are 
therefore exploring lower-cost approaches to mitigate the issue without 
such a heavy performance impact.

root@zzhwaxy-pc:/sys/power# echo 1 > freeze_filesystems
root@zzhwaxy-pc:/sys/power# sudo dmesg | grep -E 'suspend'
[ 9844.984658] PM: suspend entry (deep)
[ 9850.998197] PM: suspend exit

root@zzhwaxy-pc:/sys/power# echo 0 > freeze_filesystems
root@zzhwaxy-pc:/sys/power# sudo dmesg | grep -E 'suspend'
[ 9893.928486] PM: suspend entry (deep)
[ 9896.239425] PM: suspend exit

> Thanks


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 5/9] freezer: set default freeze priority for PF_SUSPEND_TASK processes
  2025-08-08 14:39   ` Oleg Nesterov
@ 2025-08-11  9:25     ` Zihuan Zhang
  2025-08-11  9:32       ` Oleg Nesterov
  0 siblings, 1 reply; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-11  9:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Rafael J . Wysocki, Peter Zijlstra, David Hildenbrand,
	Michal Hocko, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel


在 2025/8/8 22:39, Oleg Nesterov 写道:
> On 08/07, Zihuan Zhang wrote:
>> --- a/kernel/power/process.c
>> +++ b/kernel/power/process.c
>> @@ -147,6 +147,7 @@ int freeze_processes(void)
>>
>>   	pm_wakeup_clear(0);
>>   	pm_freezing = true;
>> +	freeze_set_default_priority(current, FREEZE_PRIORITY_NEVER);
> But why?
>
> Again, freeze_task() will return false anyway, this process is
> PF_SUSPEND_TASK.

I  think there is resaon put it here. For example, systemd-sleep is a 
user-space process that executes the suspend flow.

  If we don’t set its freeze priority explicitly, our current code may 
end up with this user process being the last one that cannot freeze.

Of course, we could adjust the code logic to handle it differently, but 
in our model its freeze priority is considered the lowest, so setting it 
here ensures consistent behavior.


>> @@ -218,6 +219,7 @@ void thaw_processes(void)
>>   	WARN_ON(!(curr->flags & PF_SUSPEND_TASK));
>>   	curr->flags &= ~PF_SUSPEND_TASK;
>>
>> +	freeze_set_default_priority(current, FREEZE_PRIORITY_NORMAL);
>>   	usermodehelper_enable();
> What if current->freeze_priority was changed via
> /proc/pid/freeze_priority you add in 9/9 ?

Sorry, my oversight. You are right — in this case we probably should not 
allow user space to change the freeze_priority of a PF_SUSPEND_TASK. 
This would avoid unintended behavior during suspend.

>
> Oleg.
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 6/9] freezer: Set default freeze priority for zombie tasks
  2025-08-08 14:29   ` Oleg Nesterov
@ 2025-08-11  9:29     ` Zihuan Zhang
  2025-08-11  9:42       ` Oleg Nesterov
  2025-08-12  8:07     ` Zihuan Zhang
  1 sibling, 1 reply; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-11  9:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Rafael J . Wysocki, Peter Zijlstra, David Hildenbrand,
	Michal Hocko, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel


在 2025/8/8 22:29, Oleg Nesterov 写道:
> On 08/07, Zihuan Zhang wrote:
>> @@ -6980,6 +6981,7 @@ void __noreturn do_task_dead(void)
>>   	current->flags |= PF_NOFREEZE;
>>
>>   	__schedule(SM_NONE);
>> +	freeze_set_default_priority(current, FREEZE_PRIORITY_NEVER);
>>   	BUG();
> But this change has no effect?
>
> Firstly, this last __schedule() should not return, note the BUG() we have.
>
> Secondly, this zombie is already PF_NOFREEZE, freeze_task() will return
> false anyway.
Sorry, but in our tests with a large number of zombie tasks, returning 
early reduced the overhead. Even though freeze_task() would return false 
for PF_NOFREEZE, skipping the extra path still saved time in our 
suspend/freezer loop.
> Oleg.
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 7/9] freezer: raise freeze priority of tasks failed to freeze last time
  2025-08-08 14:53   ` Oleg Nesterov
@ 2025-08-11  9:31     ` Zihuan Zhang
  0 siblings, 0 replies; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-11  9:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Rafael J . Wysocki, Peter Zijlstra, David Hildenbrand,
	Michal Hocko, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel


在 2025/8/8 22:53, Oleg Nesterov 写道:
>> -				if (p != current && freezing(p) && !frozen(p))
>> +				if (p != current && freezing(p) && !frozen(p)) {
>> +					freeze_set_default_priority(p, FREEZE_PRIORITY_HIGH);
>>   					sched_show_task(p);
>> +				}
> IMO, this change should go into 3/9 to make the intent more clear.
Will do. Indeed, D-state tasks are not easy to solve, so our plan is to 
provide a fallback-like mechanism.
> Other than that, I leave this series to maintainers and other reviewers.
> Personally, I am skeptical.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 5/9] freezer: set default freeze priority for PF_SUSPEND_TASK processes
  2025-08-11  9:25     ` Zihuan Zhang
@ 2025-08-11  9:32       ` Oleg Nesterov
  2025-08-11  9:42         ` Zihuan Zhang
  0 siblings, 1 reply; 38+ messages in thread
From: Oleg Nesterov @ 2025-08-11  9:32 UTC (permalink / raw)
  To: Zihuan Zhang
  Cc: Rafael J . Wysocki, Peter Zijlstra, David Hildenbrand,
	Michal Hocko, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel

On 08/11, Zihuan Zhang wrote:
> 
> 在 2025/8/8 22:39, Oleg Nesterov 写道:
> >On 08/07, Zihuan Zhang wrote:
> >>--- a/kernel/power/process.c
> >>+++ b/kernel/power/process.c
> >>@@ -147,6 +147,7 @@ int freeze_processes(void)
> >>
> >>  	pm_wakeup_clear(0);
> >>  	pm_freezing = true;
> >>+	freeze_set_default_priority(current, FREEZE_PRIORITY_NEVER);
> >But why?
> >
> >Again, freeze_task() will return false anyway, this process is
> >PF_SUSPEND_TASK.
>
> I  think there is resaon put it here. For example, systemd-sleep is a
> user-space process that executes the suspend flow.
>
>  If we don’t set its freeze priority explicitly, our current code may end up
> with this user process being the last one that cannot freeze.

How so? sorry I don't follow.

Oleg.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 5/9] freezer: set default freeze priority for PF_SUSPEND_TASK processes
  2025-08-11  9:32       ` Oleg Nesterov
@ 2025-08-11  9:42         ` Zihuan Zhang
  2025-08-11  9:46           ` Oleg Nesterov
  0 siblings, 1 reply; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-11  9:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Rafael J . Wysocki, Peter Zijlstra, David Hildenbrand,
	Michal Hocko, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel


在 2025/8/11 17:32, Oleg Nesterov 写道:
> On 08/11, Zihuan Zhang wrote:
>> 在 2025/8/8 22:39, Oleg Nesterov 写道:
>>> On 08/07, Zihuan Zhang wrote:
>>>> --- a/kernel/power/process.c
>>>> +++ b/kernel/power/process.c
>>>> @@ -147,6 +147,7 @@ int freeze_processes(void)
>>>>
>>>>   	pm_wakeup_clear(0);
>>>>   	pm_freezing = true;
>>>> +	freeze_set_default_priority(current, FREEZE_PRIORITY_NEVER);
>>> But why?
>>>
>>> Again, freeze_task() will return false anyway, this process is
>>> PF_SUSPEND_TASK.
>> I  think there is resaon put it here. For example, systemd-sleep is a
>> user-space process that executes the suspend flow.
>>
>>   If we don’t set its freeze priority explicitly, our current code may end up
>> with this user process being the last one that cannot freeze.
> How so? sorry I don't follow.

The problem is in this part:

+            if (user_only && !(p->flags & PF_KTHREAD) && round < 
p->freeze_priority)
+                continue;

PF_SUSPEND_TASK is a user process, so it meets the “needs freezing” 
condition and todo gets incremented. But it actually doesn’t need to 
freeze, so resulting in an infinite loop

> Oleg.
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 6/9] freezer: Set default freeze priority for zombie tasks
  2025-08-11  9:29     ` Zihuan Zhang
@ 2025-08-11  9:42       ` Oleg Nesterov
  0 siblings, 0 replies; 38+ messages in thread
From: Oleg Nesterov @ 2025-08-11  9:42 UTC (permalink / raw)
  To: Zihuan Zhang
  Cc: Rafael J . Wysocki, Peter Zijlstra, David Hildenbrand,
	Michal Hocko, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel

On 08/11, Zihuan Zhang wrote:
> 
> 在 2025/8/8 22:29, Oleg Nesterov 写道:
> >On 08/07, Zihuan Zhang wrote:
> >>@@ -6980,6 +6981,7 @@ void __noreturn do_task_dead(void)
> >>  	current->flags |= PF_NOFREEZE;
> >>
> >>  	__schedule(SM_NONE);
> >>+	freeze_set_default_priority(current, FREEZE_PRIORITY_NEVER);
> >>  	BUG();
> >But this change has no effect?
> >
> >Firstly, this last __schedule() should not return, note the BUG() we have.
> >
> >Secondly, this zombie is already PF_NOFREEZE, freeze_task() will return
> >false anyway.
> Sorry, but in our tests with a large number of zombie tasks, returning early
> reduced the overhead. Even though freeze_task() would return false for
> PF_NOFREEZE, skipping the extra path still saved time in our suspend/freezer

https://lore.kernel.org/all/20250707084214.GD1613200@noisy.programming.kicks-ass.net/

Anyway the patch makes no sense in its current form, see my note
about __schedule() above.

Oleg.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 5/9] freezer: set default freeze priority for PF_SUSPEND_TASK processes
  2025-08-11  9:42         ` Zihuan Zhang
@ 2025-08-11  9:46           ` Oleg Nesterov
  2025-08-11  9:54             ` Zihuan Zhang
  0 siblings, 1 reply; 38+ messages in thread
From: Oleg Nesterov @ 2025-08-11  9:46 UTC (permalink / raw)
  To: Zihuan Zhang
  Cc: Rafael J . Wysocki, Peter Zijlstra, David Hildenbrand,
	Michal Hocko, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel

On 08/11, Zihuan Zhang wrote:
> 
> 在 2025/8/11 17:32, Oleg Nesterov 写道:
> >On 08/11, Zihuan Zhang wrote:
> >>在 2025/8/8 22:39, Oleg Nesterov 写道:
> >>>On 08/07, Zihuan Zhang wrote:
> >>>>--- a/kernel/power/process.c
> >>>>+++ b/kernel/power/process.c
> >>>>@@ -147,6 +147,7 @@ int freeze_processes(void)
> >>>>
> >>>>  	pm_wakeup_clear(0);
> >>>>  	pm_freezing = true;
> >>>>+	freeze_set_default_priority(current, FREEZE_PRIORITY_NEVER);
> >>>But why?
> >>>
> >>>Again, freeze_task() will return false anyway, this process is
> >>>PF_SUSPEND_TASK.
> >>I  think there is resaon put it here. For example, systemd-sleep is a
> >>user-space process that executes the suspend flow.
> >>
> >>  If we don’t set its freeze priority explicitly, our current code may end up
> >>with this user process being the last one that cannot freeze.
> >How so? sorry I don't follow.
>
> The problem is in this part:
>
> +            if (user_only && !(p->flags & PF_KTHREAD) && round <
> p->freeze_priority)
> +                continue;
>
> PF_SUSPEND_TASK is a user process, so it meets the “needs freezing”
> condition and todo gets incremented.
            ^^^^^^^^^^^^^^^^^^^^^^^^^

No.
	if (p == current || !freeze_task(p))
		continue;

	todo++;

Again, again, freeze_task(p) returns false.

> But it actually doesn’t need to freeze,
> so resulting in an infinite loop

I don't think so.

Oleg.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 5/9] freezer: set default freeze priority for PF_SUSPEND_TASK processes
  2025-08-11  9:46           ` Oleg Nesterov
@ 2025-08-11  9:54             ` Zihuan Zhang
  0 siblings, 0 replies; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-11  9:54 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Rafael J . Wysocki, Peter Zijlstra, David Hildenbrand,
	Michal Hocko, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel


在 2025/8/11 17:46, Oleg Nesterov 写道:
> On 08/11, Zihuan Zhang wrote:
>> 在 2025/8/11 17:32, Oleg Nesterov 写道:
>>> On 08/11, Zihuan Zhang wrote:
>>>> 在 2025/8/8 22:39, Oleg Nesterov 写道:
>>>>> On 08/07, Zihuan Zhang wrote:
>>>>>> --- a/kernel/power/process.c
>>>>>> +++ b/kernel/power/process.c
>>>>>> @@ -147,6 +147,7 @@ int freeze_processes(void)
>>>>>>
>>>>>>   	pm_wakeup_clear(0);
>>>>>>   	pm_freezing = true;
>>>>>> +	freeze_set_default_priority(current, FREEZE_PRIORITY_NEVER);
>>>>> But why?
>>>>>
>>>>> Again, freeze_task() will return false anyway, this process is
>>>>> PF_SUSPEND_TASK.
>>>> I  think there is resaon put it here. For example, systemd-sleep is a
>>>> user-space process that executes the suspend flow.
>>>>
>>>>   If we don’t set its freeze priority explicitly, our current code may end up
>>>> with this user process being the last one that cannot freeze.
>>> How so? sorry I don't follow.
>> The problem is in this part:
>>
>> +            if (user_only && !(p->flags & PF_KTHREAD) && round <
>> p->freeze_priority)
>> +                continue;
>>
>> PF_SUSPEND_TASK is a user process, so it meets the “needs freezing”
>> condition and todo gets incremented.
>              ^^^^^^^^^^^^^^^^^^^^^^^^^
>
> No.
> 	if (p == current || !freeze_task(p))
> 		continue;
>
> 	todo++;
>
> Again, again, freeze_task(p) returns false.
>
>> But it actually doesn’t need to freeze,
>> so resulting in an infinite loop
> I don't think so.
>
> Oleg.
Sorry, you’re right — it’s indeed unnecessary. In an earlier version, I 
incremented the counter before the continue, but I later removed that 
and forgot about it.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
  2025-08-11  9:13           ` Zihuan Zhang
@ 2025-08-11 10:58             ` Michal Hocko
  2025-08-12  5:57               ` Zihuan Zhang
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Hocko @ 2025-08-11 10:58 UTC (permalink / raw)
  To: Zihuan Zhang
  Cc: Rafael J . Wysocki, Peter Zijlstra, Oleg Nesterov,
	David Hildenbrand, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel

On Mon 11-08-25 17:13:43, Zihuan Zhang wrote:
> 
> 在 2025/8/8 16:58, Michal Hocko 写道:
[...]
> > Also the interface seems to be really coarse grained and it can easily
> > turn out insufficient for other usecases while it is not entirely clear
> > to me how this could be extended for those.
>  We recognize that the current interface is relatively coarse-grained and
> may not be sufficient for all scenarios. The present implementation is a
> basic version.
> 
> Our plan is to introduce a classification-based mechanism that assigns
> different freeze priorities according to process categories. For example,
> filesystem and graphics-related processes will be given higher default
> freeze priority, as they are critical in the freezing workflow. This
> classification approach helps target important processes more precisely.
> 
> However, this requires further testing and refinement before full
> deployment. We believe this incremental, category-based design will make the
> mechanism more effective and adaptable over time while keeping it
> manageable.

Unless there is a clear path for a more extendable interface then
introducing this one is a no-go. We do not want to grow different ways
to establish freezing policies.

But much more fundamentally. So far I haven't really seen any argument
why different priorities help with the underlying problem other than the
timing might be slightly different if you change the order of freezing.
This to me sounds like the proposed scheme mostly works around the
problem you are seeing and as such is not a really good candidate to be
merged as a long term solution. Not to mention with a user API that
needs to be maintained for ever.

So NAK from me on the interface.

> > I believe it would be more useful to find sources of those freezer
> > blockers and try to address those. Making more blocked tasks
> > __set_task_frozen compatible sounds like a general improvement in
> > itself.
> 
> we have already identified some causes of D-state tasks, many of which are
> related to the filesystem. On some systems, certain processes frequently
> execute ext4_sync_file, and under contention this can lead to D-state tasks.

Please work with maintainers of those subsystems to find proper
solutions.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
  2025-08-11 10:58             ` Michal Hocko
@ 2025-08-12  5:57               ` Zihuan Zhang
  2025-08-12 17:26                 ` Darrick J. Wong
  0 siblings, 1 reply; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-12  5:57 UTC (permalink / raw)
  To: Michal Hocko, Theodore Ts'o, Jan Kara
  Cc: Rafael J . Wysocki, Peter Zijlstra, Oleg Nesterov,
	David Hildenbrand, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel, linux-ext4

Hi all,

We encountered an issue where the number of freeze retries increased due 
to processes stuck in D state. The logs point to jbd2-related activity.

log1:

6616.650482] task:ThreadPoolForeg state:D stack:0     pid:262026
tgid:4065  ppid:2490   task_flags:0x400040 flags:0x00004004
[ 6616.650485] Call Trace:
[ 6616.650486]  <TASK>
[ 6616.650489]  __schedule+0x532/0xea0
[ 6616.650494]  schedule+0x27/0x80
[ 6616.650496]  jbd2_log_wait_commit+0xa6/0x120
[ 6616.650499]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 6616.650502]  ext4_sync_file+0x1ba/0x380
[ 6616.650505]  do_fsync+0x3b/0x80

log2:

[  631.206315] jdb2_log_wait_log_commit  completed (elapsed 0.002 seconds)
[  631.215325] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
[  631.240704] jdb2_log_wait_log_commit  completed (elapsed 0.386 seconds)
[  631.262167] Filesystems sync: 0.424 seconds
[  631.262821] Freezing user space processes
[  631.263839] freeze round: 1, task to freeze: 852
[  631.265128] freeze round: 2, task to freeze: 2
[  631.267039] freeze round: 3, task to freeze: 2
[  631.271176] freeze round: 4, task to freeze: 2
[  631.279160] freeze round: 5, task to freeze: 2
[  631.287152] freeze round: 6, task to freeze: 2
[  631.295346] freeze round: 7, task to freeze: 2
[  631.301747] freeze round: 8, task to freeze: 2
[  631.309346] freeze round: 9, task to freeze: 2
[  631.317353] freeze round: 10, task to freeze: 2
[  631.325348] freeze round: 11, task to freeze: 2
[  631.333353] freeze round: 12, task to freeze: 2
[  631.341358] freeze round: 13, task to freeze: 2
[  631.349357] freeze round: 14, task to freeze: 2
[  631.357363] freeze round: 15, task to freeze: 2
[  631.365361] freeze round: 16, task to freeze: 2
[  631.373379] freeze round: 17, task to freeze: 2
[  631.381366] freeze round: 18, task to freeze: 2
[  631.389365] freeze round: 19, task to freeze: 2
[  631.397371] freeze round: 20, task to freeze: 2
[  631.405373] freeze round: 21, task to freeze: 2
[  631.413373] freeze round: 22, task to freeze: 2
[  631.421392] freeze round: 23, task to freeze: 1
[  631.429948] freeze round: 24, task to freeze: 1
[  631.438295] freeze round: 25, task to freeze: 1
[  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
[  631.446387] freeze round: 26, task to freeze: 0
[  631.446390] Freezing user space processes completed (elapsed 0.183 
seconds)
[  631.446392] OOM killer disabled.
[  631.446393] Freezing remaining freezable tasks
[  631.446656] freeze round: 1, task to freeze: 4
[  631.447976] freeze round: 2, task to freeze: 0
[  631.447978] Freezing remaining freezable tasks completed (elapsed 
0.001 seconds)
[  631.447980] PM: suspend debug: Waiting for 1 second(s).
[  632.450858] OOM killer enabled.
[  632.450859] Restarting tasks: Starting
[  632.453140] Restarting tasks: Done
[  632.453173] random: crng reseeded on system resumption
[  632.453370] PM: suspend exit
[  632.462799] jdb2_log_wait_log_commit  completed (elapsed 0.000 seconds)
[  632.466114] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)

This is the reason:

[  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)


During freezing, user processes executing jbd2_log_wait_commit enter D 
state because this function calls wait_event and can take tens of 
milliseconds to complete. This long execution time, coupled with 
possible competition with the freezer, causes repeated freeze retries.

While we understand that jbd2 is a freezable kernel thread, we would 
like to know if there is a way to freeze it earlier or freeze some 
critical processes proactively to reduce this contention.

Thanks for your input and suggestions.

在 2025/8/11 18:58, Michal Hocko 写道:
> On Mon 11-08-25 17:13:43, Zihuan Zhang wrote:
>> 在 2025/8/8 16:58, Michal Hocko 写道:
> [...]
>>> Also the interface seems to be really coarse grained and it can easily
>>> turn out insufficient for other usecases while it is not entirely clear
>>> to me how this could be extended for those.
>>   We recognize that the current interface is relatively coarse-grained and
>> may not be sufficient for all scenarios. The present implementation is a
>> basic version.
>>
>> Our plan is to introduce a classification-based mechanism that assigns
>> different freeze priorities according to process categories. For example,
>> filesystem and graphics-related processes will be given higher default
>> freeze priority, as they are critical in the freezing workflow. This
>> classification approach helps target important processes more precisely.
>>
>> However, this requires further testing and refinement before full
>> deployment. We believe this incremental, category-based design will make the
>> mechanism more effective and adaptable over time while keeping it
>> manageable.
> Unless there is a clear path for a more extendable interface then
> introducing this one is a no-go. We do not want to grow different ways
> to establish freezing policies.
>
> But much more fundamentally. So far I haven't really seen any argument
> why different priorities help with the underlying problem other than the
> timing might be slightly different if you change the order of freezing.
> This to me sounds like the proposed scheme mostly works around the
> problem you are seeing and as such is not a really good candidate to be
> merged as a long term solution. Not to mention with a user API that
> needs to be maintained for ever.
>
> So NAK from me on the interface.
>
Thanks for the feedback. I understand your concern that changing the 
freezer priority order looks like working around the symptom rather than 
solving the root cause.

Since the last discussion, we have analyzed the D-state processes 
further and identified that the long wait time is caused by 
jbd2_log_wait_commit. This wait happens because user tasks call into 
this function during fsync/fdatasync and it can take tens of 
milliseconds to complete. When this coincides with the freezer 
operation, the tasks are stuck in D state and retried multiple times, 
increasing the total freeze time.

Although we know that jbd2 is a freezable kernel thread, we are 
exploring whether freezing it earlier — or freezing certain key 
processes first — could reduce this contention and improve freeze 
completion time.


>>> I believe it would be more useful to find sources of those freezer
>>> blockers and try to address those. Making more blocked tasks
>>> __set_task_frozen compatible sounds like a general improvement in
>>> itself.
>> we have already identified some causes of D-state tasks, many of which are
>> related to the filesystem. On some systems, certain processes frequently
>> execute ext4_sync_file, and under contention this can lead to D-state tasks.
> Please work with maintainers of those subsystems to find proper
> solutions.

We’ve pulled in the jbd2 maintainer to get feedback on whether changing 
the freeze ordering for jbd2 is safe or if there’s a better approach to 
avoid the repeated retries caused by this wait.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 6/9] freezer: Set default freeze priority for zombie tasks
  2025-08-08 14:29   ` Oleg Nesterov
  2025-08-11  9:29     ` Zihuan Zhang
@ 2025-08-12  8:07     ` Zihuan Zhang
  1 sibling, 0 replies; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-12  8:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Rafael J . Wysocki, Peter Zijlstra, David Hildenbrand,
	Michal Hocko, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel


在 2025/8/8 22:29, Oleg Nesterov 写道:
> On 08/07, Zihuan Zhang wrote:
>> @@ -6980,6 +6981,7 @@ void __noreturn do_task_dead(void)
>>   	current->flags |= PF_NOFREEZE;
>>
>>   	__schedule(SM_NONE);
>> +	freeze_set_default_priority(current, FREEZE_PRIORITY_NEVER);
>>   	BUG();
> But this change has no effect?
>
> Firstly, this last __schedule() should not return, note the BUG() we have.
>
> Secondly, this zombie is already PF_NOFREEZE, freeze_task() will return
> false anyway.

Thanks for pointing that out.
Indeed, I’ve noticed that in the current position the code has no effect.
If we move this code to a more appropriate place, it should improve both 
safety and usefulness compared to the previous implementation.

> Oleg.
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
  2025-08-12  5:57               ` Zihuan Zhang
@ 2025-08-12 17:26                 ` Darrick J. Wong
  2025-08-13  5:48                   ` Zihuan Zhang
  0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2025-08-12 17:26 UTC (permalink / raw)
  To: Zihuan Zhang
  Cc: Michal Hocko, Theodore Ts'o, Jan Kara, Rafael J . Wysocki,
	Peter Zijlstra, Oleg Nesterov, David Hildenbrand, Jonathan Corbet,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	len brown, pavel machek, Kees Cook, Andrew Morton,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Catalin Marinas, Nico Pache, xu xin,
	wangfushuai, Andrii Nakryiko, Christian Brauner, Thomas Gleixner,
	Jeff Layton, Al Viro, Adrian Ratiu, linux-pm, linux-mm,
	linux-fsdevel, linux-doc, linux-kernel, linux-ext4

On Tue, Aug 12, 2025 at 01:57:49PM +0800, Zihuan Zhang wrote:
> Hi all,
> 
> We encountered an issue where the number of freeze retries increased due to
> processes stuck in D state. The logs point to jbd2-related activity.
> 
> log1:
> 
> 6616.650482] task:ThreadPoolForeg state:D stack:0     pid:262026
> tgid:4065  ppid:2490   task_flags:0x400040 flags:0x00004004
> [ 6616.650485] Call Trace:
> [ 6616.650486]  <TASK>
> [ 6616.650489]  __schedule+0x532/0xea0
> [ 6616.650494]  schedule+0x27/0x80
> [ 6616.650496]  jbd2_log_wait_commit+0xa6/0x120
> [ 6616.650499]  ? __pfx_autoremove_wake_function+0x10/0x10
> [ 6616.650502]  ext4_sync_file+0x1ba/0x380
> [ 6616.650505]  do_fsync+0x3b/0x80
> 
> log2:
> 
> [  631.206315] jdb2_log_wait_log_commit  completed (elapsed 0.002 seconds)
> [  631.215325] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
> [  631.240704] jdb2_log_wait_log_commit  completed (elapsed 0.386 seconds)
> [  631.262167] Filesystems sync: 0.424 seconds
> [  631.262821] Freezing user space processes
> [  631.263839] freeze round: 1, task to freeze: 852
> [  631.265128] freeze round: 2, task to freeze: 2
> [  631.267039] freeze round: 3, task to freeze: 2
> [  631.271176] freeze round: 4, task to freeze: 2
> [  631.279160] freeze round: 5, task to freeze: 2
> [  631.287152] freeze round: 6, task to freeze: 2
> [  631.295346] freeze round: 7, task to freeze: 2
> [  631.301747] freeze round: 8, task to freeze: 2
> [  631.309346] freeze round: 9, task to freeze: 2
> [  631.317353] freeze round: 10, task to freeze: 2
> [  631.325348] freeze round: 11, task to freeze: 2
> [  631.333353] freeze round: 12, task to freeze: 2
> [  631.341358] freeze round: 13, task to freeze: 2
> [  631.349357] freeze round: 14, task to freeze: 2
> [  631.357363] freeze round: 15, task to freeze: 2
> [  631.365361] freeze round: 16, task to freeze: 2
> [  631.373379] freeze round: 17, task to freeze: 2
> [  631.381366] freeze round: 18, task to freeze: 2
> [  631.389365] freeze round: 19, task to freeze: 2
> [  631.397371] freeze round: 20, task to freeze: 2
> [  631.405373] freeze round: 21, task to freeze: 2
> [  631.413373] freeze round: 22, task to freeze: 2
> [  631.421392] freeze round: 23, task to freeze: 1
> [  631.429948] freeze round: 24, task to freeze: 1
> [  631.438295] freeze round: 25, task to freeze: 1
> [  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
> [  631.446387] freeze round: 26, task to freeze: 0
> [  631.446390] Freezing user space processes completed (elapsed 0.183
> seconds)
> [  631.446392] OOM killer disabled.
> [  631.446393] Freezing remaining freezable tasks
> [  631.446656] freeze round: 1, task to freeze: 4
> [  631.447976] freeze round: 2, task to freeze: 0
> [  631.447978] Freezing remaining freezable tasks completed (elapsed 0.001
> seconds)
> [  631.447980] PM: suspend debug: Waiting for 1 second(s).
> [  632.450858] OOM killer enabled.
> [  632.450859] Restarting tasks: Starting
> [  632.453140] Restarting tasks: Done
> [  632.453173] random: crng reseeded on system resumption
> [  632.453370] PM: suspend exit
> [  632.462799] jdb2_log_wait_log_commit  completed (elapsed 0.000 seconds)
> [  632.466114] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
> 
> This is the reason:
> 
> [  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
> 
> 
> During freezing, user processes executing jbd2_log_wait_commit enter D state
> because this function calls wait_event and can take tens of milliseconds to
> complete. This long execution time, coupled with possible competition with
> the freezer, causes repeated freeze retries.
> 
> While we understand that jbd2 is a freezable kernel thread, we would like to
> know if there is a way to freeze it earlier or freeze some critical
> processes proactively to reduce this contention.

Freeze the filesystem before you start freezing kthreads?  That should
quiesce the jbd2 workers and pause anyone trying to write to the fs.
Maybe the missing piece here is the device model not knowing how to call
bdev_freeze prior to a suspend?

That said, I think that doesn't 100% work for XFS because it has
kworkers for metadata buffer read completions, and freezes don't affect
read operations...

(just my clueless 2c)

--D

> Thanks for your input and suggestions.
> 
> 在 2025/8/11 18:58, Michal Hocko 写道:
> > On Mon 11-08-25 17:13:43, Zihuan Zhang wrote:
> > > 在 2025/8/8 16:58, Michal Hocko 写道:
> > [...]
> > > > Also the interface seems to be really coarse grained and it can easily
> > > > turn out insufficient for other usecases while it is not entirely clear
> > > > to me how this could be extended for those.
> > >   We recognize that the current interface is relatively coarse-grained and
> > > may not be sufficient for all scenarios. The present implementation is a
> > > basic version.
> > > 
> > > Our plan is to introduce a classification-based mechanism that assigns
> > > different freeze priorities according to process categories. For example,
> > > filesystem and graphics-related processes will be given higher default
> > > freeze priority, as they are critical in the freezing workflow. This
> > > classification approach helps target important processes more precisely.
> > > 
> > > However, this requires further testing and refinement before full
> > > deployment. We believe this incremental, category-based design will make the
> > > mechanism more effective and adaptable over time while keeping it
> > > manageable.
> > Unless there is a clear path for a more extendable interface then
> > introducing this one is a no-go. We do not want to grow different ways
> > to establish freezing policies.
> > 
> > But much more fundamentally. So far I haven't really seen any argument
> > why different priorities help with the underlying problem other than the
> > timing might be slightly different if you change the order of freezing.
> > This to me sounds like the proposed scheme mostly works around the
> > problem you are seeing and as such is not a really good candidate to be
> > merged as a long term solution. Not to mention with a user API that
> > needs to be maintained for ever.
> > 
> > So NAK from me on the interface.
> > 
> Thanks for the feedback. I understand your concern that changing the freezer
> priority order looks like working around the symptom rather than solving the
> root cause.
> 
> Since the last discussion, we have analyzed the D-state processes further
> and identified that the long wait time is caused by jbd2_log_wait_commit.
> This wait happens because user tasks call into this function during
> fsync/fdatasync and it can take tens of milliseconds to complete. When this
> coincides with the freezer operation, the tasks are stuck in D state and
> retried multiple times, increasing the total freeze time.
> 
> Although we know that jbd2 is a freezable kernel thread, we are exploring
> whether freezing it earlier — or freezing certain key processes first —
> could reduce this contention and improve freeze completion time.
> 
> 
> > > > I believe it would be more useful to find sources of those freezer
> > > > blockers and try to address those. Making more blocked tasks
> > > > __set_task_frozen compatible sounds like a general improvement in
> > > > itself.
> > > we have already identified some causes of D-state tasks, many of which are
> > > related to the filesystem. On some systems, certain processes frequently
> > > execute ext4_sync_file, and under contention this can lead to D-state tasks.
> > Please work with maintainers of those subsystems to find proper
> > solutions.
> 
> We’ve pulled in the jbd2 maintainer to get feedback on whether changing the
> freeze ordering for jbd2 is safe or if there’s a better approach to avoid
> the repeated retries caused by this wait.
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
  2025-08-12 17:26                 ` Darrick J. Wong
@ 2025-08-13  5:48                   ` Zihuan Zhang
  2025-08-14 16:43                     ` Darrick J. Wong
  0 siblings, 1 reply; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-13  5:48 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Michal Hocko, Theodore Ts'o, Jan Kara, Rafael J . Wysocki,
	Peter Zijlstra, Oleg Nesterov, David Hildenbrand, Jonathan Corbet,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	len brown, pavel machek, Kees Cook, Andrew Morton,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Catalin Marinas, Nico Pache, xu xin,
	wangfushuai, Andrii Nakryiko, Christian Brauner, Thomas Gleixner,
	Jeff Layton, Al Viro, Adrian Ratiu, linux-pm, linux-mm,
	linux-fsdevel, linux-doc, linux-kernel, linux-ext4

Hi,

在 2025/8/13 01:26, Darrick J. Wong 写道:
> On Tue, Aug 12, 2025 at 01:57:49PM +0800, Zihuan Zhang wrote:
>> Hi all,
>>
>> We encountered an issue where the number of freeze retries increased due to
>> processes stuck in D state. The logs point to jbd2-related activity.
>>
>> log1:
>>
>> 6616.650482] task:ThreadPoolForeg state:D stack:0     pid:262026
>> tgid:4065  ppid:2490   task_flags:0x400040 flags:0x00004004
>> [ 6616.650485] Call Trace:
>> [ 6616.650486]  <TASK>
>> [ 6616.650489]  __schedule+0x532/0xea0
>> [ 6616.650494]  schedule+0x27/0x80
>> [ 6616.650496]  jbd2_log_wait_commit+0xa6/0x120
>> [ 6616.650499]  ? __pfx_autoremove_wake_function+0x10/0x10
>> [ 6616.650502]  ext4_sync_file+0x1ba/0x380
>> [ 6616.650505]  do_fsync+0x3b/0x80
>>
>> log2:
>>
>> [  631.206315] jdb2_log_wait_log_commit  completed (elapsed 0.002 seconds)
>> [  631.215325] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
>> [  631.240704] jdb2_log_wait_log_commit  completed (elapsed 0.386 seconds)
>> [  631.262167] Filesystems sync: 0.424 seconds
>> [  631.262821] Freezing user space processes
>> [  631.263839] freeze round: 1, task to freeze: 852
>> [  631.265128] freeze round: 2, task to freeze: 2
>> [  631.267039] freeze round: 3, task to freeze: 2
>> [  631.271176] freeze round: 4, task to freeze: 2
>> [  631.279160] freeze round: 5, task to freeze: 2
>> [  631.287152] freeze round: 6, task to freeze: 2
>> [  631.295346] freeze round: 7, task to freeze: 2
>> [  631.301747] freeze round: 8, task to freeze: 2
>> [  631.309346] freeze round: 9, task to freeze: 2
>> [  631.317353] freeze round: 10, task to freeze: 2
>> [  631.325348] freeze round: 11, task to freeze: 2
>> [  631.333353] freeze round: 12, task to freeze: 2
>> [  631.341358] freeze round: 13, task to freeze: 2
>> [  631.349357] freeze round: 14, task to freeze: 2
>> [  631.357363] freeze round: 15, task to freeze: 2
>> [  631.365361] freeze round: 16, task to freeze: 2
>> [  631.373379] freeze round: 17, task to freeze: 2
>> [  631.381366] freeze round: 18, task to freeze: 2
>> [  631.389365] freeze round: 19, task to freeze: 2
>> [  631.397371] freeze round: 20, task to freeze: 2
>> [  631.405373] freeze round: 21, task to freeze: 2
>> [  631.413373] freeze round: 22, task to freeze: 2
>> [  631.421392] freeze round: 23, task to freeze: 1
>> [  631.429948] freeze round: 24, task to freeze: 1
>> [  631.438295] freeze round: 25, task to freeze: 1
>> [  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
>> [  631.446387] freeze round: 26, task to freeze: 0
>> [  631.446390] Freezing user space processes completed (elapsed 0.183
>> seconds)
>> [  631.446392] OOM killer disabled.
>> [  631.446393] Freezing remaining freezable tasks
>> [  631.446656] freeze round: 1, task to freeze: 4
>> [  631.447976] freeze round: 2, task to freeze: 0
>> [  631.447978] Freezing remaining freezable tasks completed (elapsed 0.001
>> seconds)
>> [  631.447980] PM: suspend debug: Waiting for 1 second(s).
>> [  632.450858] OOM killer enabled.
>> [  632.450859] Restarting tasks: Starting
>> [  632.453140] Restarting tasks: Done
>> [  632.453173] random: crng reseeded on system resumption
>> [  632.453370] PM: suspend exit
>> [  632.462799] jdb2_log_wait_log_commit  completed (elapsed 0.000 seconds)
>> [  632.466114] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
>>
>> This is the reason:
>>
>> [  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
>>
>>
>> During freezing, user processes executing jbd2_log_wait_commit enter D state
>> because this function calls wait_event and can take tens of milliseconds to
>> complete. This long execution time, coupled with possible competition with
>> the freezer, causes repeated freeze retries.
>>
>> While we understand that jbd2 is a freezable kernel thread, we would like to
>> know if there is a way to freeze it earlier or freeze some critical
>> processes proactively to reduce this contention.
> Freeze the filesystem before you start freezing kthreads?  That should
> quiesce the jbd2 workers and pause anyone trying to write to the fs.
Indeed, freezing the filesystem can work.

However, this approach is quite expensive: it increases the total 
suspend time by about 3 to 4 seconds. Because of this overhead, we are 
exploring alternative solutions with lower cost.

We have tested it:

https://lore.kernel.org/all/09df0911-9421-40af-8296-de1383be1c58@kylinos.cn/ 

> Maybe the missing piece here is the device model not knowing how to call
> bdev_freeze prior to a suspend?
Currently, suspend flow seem to does not invoke bdev_freeze(). Do you 
have any plans or insights on improving or integrating this 
functionality more smoothly into the device model and suspend sequence?
> That said, I think that doesn't 100% work for XFS because it has
> kworkers for metadata buffer read completions, and freezes don't affect
> read operations...

Does read activity also cause processes to enter D (uninterruptible 
sleep) state?

 From what I understand, it’s usually writes or synchronous operations 
that do, but I’m curious if reads can also lead to D state under certain 
conditions.

> (just my clueless 2c)
>
> --D
>
>> Thanks for your input and suggestions.
>>
>> 在 2025/8/11 18:58, Michal Hocko 写道:
>>> On Mon 11-08-25 17:13:43, Zihuan Zhang wrote:
>>>> 在 2025/8/8 16:58, Michal Hocko 写道:
>>> [...]
>>>>> Also the interface seems to be really coarse grained and it can easily
>>>>> turn out insufficient for other usecases while it is not entirely clear
>>>>> to me how this could be extended for those.
>>>>    We recognize that the current interface is relatively coarse-grained and
>>>> may not be sufficient for all scenarios. The present implementation is a
>>>> basic version.
>>>>
>>>> Our plan is to introduce a classification-based mechanism that assigns
>>>> different freeze priorities according to process categories. For example,
>>>> filesystem and graphics-related processes will be given higher default
>>>> freeze priority, as they are critical in the freezing workflow. This
>>>> classification approach helps target important processes more precisely.
>>>>
>>>> However, this requires further testing and refinement before full
>>>> deployment. We believe this incremental, category-based design will make the
>>>> mechanism more effective and adaptable over time while keeping it
>>>> manageable.
>>> Unless there is a clear path for a more extendable interface then
>>> introducing this one is a no-go. We do not want to grow different ways
>>> to establish freezing policies.
>>>
>>> But much more fundamentally. So far I haven't really seen any argument
>>> why different priorities help with the underlying problem other than the
>>> timing might be slightly different if you change the order of freezing.
>>> This to me sounds like the proposed scheme mostly works around the
>>> problem you are seeing and as such is not a really good candidate to be
>>> merged as a long term solution. Not to mention with a user API that
>>> needs to be maintained for ever.
>>>
>>> So NAK from me on the interface.
>>>
>> Thanks for the feedback. I understand your concern that changing the freezer
>> priority order looks like working around the symptom rather than solving the
>> root cause.
>>
>> Since the last discussion, we have analyzed the D-state processes further
>> and identified that the long wait time is caused by jbd2_log_wait_commit.
>> This wait happens because user tasks call into this function during
>> fsync/fdatasync and it can take tens of milliseconds to complete. When this
>> coincides with the freezer operation, the tasks are stuck in D state and
>> retried multiple times, increasing the total freeze time.
>>
>> Although we know that jbd2 is a freezable kernel thread, we are exploring
>> whether freezing it earlier — or freezing certain key processes first —
>> could reduce this contention and improve freeze completion time.
>>
>>
>>>>> I believe it would be more useful to find sources of those freezer
>>>>> blockers and try to address those. Making more blocked tasks
>>>>> __set_task_frozen compatible sounds like a general improvement in
>>>>> itself.
>>>> we have already identified some causes of D-state tasks, many of which are
>>>> related to the filesystem. On some systems, certain processes frequently
>>>> execute ext4_sync_file, and under contention this can lead to D-state tasks.
>>> Please work with maintainers of those subsystems to find proper
>>> solutions.
>> We’ve pulled in the jbd2 maintainer to get feedback on whether changing the
>> freeze ordering for jbd2 is safe or if there’s a better approach to avoid
>> the repeated retries caused by this wait.
>>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
  2025-08-07 12:14 [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues Zihuan Zhang
                   ` (9 preceding siblings ...)
  2025-08-07 13:25 ` [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues Michal Hocko
@ 2025-08-14 14:37 ` Peter Zijlstra
  2025-08-15  8:27   ` Zihuan Zhang
  10 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2025-08-14 14:37 UTC (permalink / raw)
  To: Zihuan Zhang
  Cc: Rafael J . Wysocki, Oleg Nesterov, David Hildenbrand,
	Michal Hocko, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel

On Thu, Aug 07, 2025 at 08:14:09PM +0800, Zihuan Zhang wrote:

> Freeze Window Begins
> 
>     [process A] - epoll_wait()
>         │
>         ▼
>     [process B] - event source (already frozen)
> 

Can we make epoll_wait() TASK_FREEZABLE? AFAICT it doesn't hold any
resources, it just sits there waiting for stuff.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
  2025-08-13  5:48                   ` Zihuan Zhang
@ 2025-08-14 16:43                     ` Darrick J. Wong
  2025-08-15  8:17                       ` Zihuan Zhang
  0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2025-08-14 16:43 UTC (permalink / raw)
  To: Zihuan Zhang
  Cc: Michal Hocko, Theodore Ts'o, Jan Kara, Rafael J . Wysocki,
	Peter Zijlstra, Oleg Nesterov, David Hildenbrand, Jonathan Corbet,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	len brown, pavel machek, Kees Cook, Andrew Morton,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Catalin Marinas, Nico Pache, xu xin,
	wangfushuai, Andrii Nakryiko, Christian Brauner, Thomas Gleixner,
	Jeff Layton, Al Viro, Adrian Ratiu, linux-pm, linux-mm,
	linux-fsdevel, linux-doc, linux-kernel, linux-ext4

On Wed, Aug 13, 2025 at 01:48:37PM +0800, Zihuan Zhang wrote:
> Hi,
> 
> 在 2025/8/13 01:26, Darrick J. Wong 写道:
> > On Tue, Aug 12, 2025 at 01:57:49PM +0800, Zihuan Zhang wrote:
> > > Hi all,
> > > 
> > > We encountered an issue where the number of freeze retries increased due to
> > > processes stuck in D state. The logs point to jbd2-related activity.
> > > 
> > > log1:
> > > 
> > > 6616.650482] task:ThreadPoolForeg state:D stack:0     pid:262026
> > > tgid:4065  ppid:2490   task_flags:0x400040 flags:0x00004004
> > > [ 6616.650485] Call Trace:
> > > [ 6616.650486]  <TASK>
> > > [ 6616.650489]  __schedule+0x532/0xea0
> > > [ 6616.650494]  schedule+0x27/0x80
> > > [ 6616.650496]  jbd2_log_wait_commit+0xa6/0x120
> > > [ 6616.650499]  ? __pfx_autoremove_wake_function+0x10/0x10
> > > [ 6616.650502]  ext4_sync_file+0x1ba/0x380
> > > [ 6616.650505]  do_fsync+0x3b/0x80
> > > 
> > > log2:
> > > 
> > > [  631.206315] jdb2_log_wait_log_commit  completed (elapsed 0.002 seconds)
> > > [  631.215325] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
> > > [  631.240704] jdb2_log_wait_log_commit  completed (elapsed 0.386 seconds)
> > > [  631.262167] Filesystems sync: 0.424 seconds
> > > [  631.262821] Freezing user space processes
> > > [  631.263839] freeze round: 1, task to freeze: 852
> > > [  631.265128] freeze round: 2, task to freeze: 2
> > > [  631.267039] freeze round: 3, task to freeze: 2
> > > [  631.271176] freeze round: 4, task to freeze: 2
> > > [  631.279160] freeze round: 5, task to freeze: 2
> > > [  631.287152] freeze round: 6, task to freeze: 2
> > > [  631.295346] freeze round: 7, task to freeze: 2
> > > [  631.301747] freeze round: 8, task to freeze: 2
> > > [  631.309346] freeze round: 9, task to freeze: 2
> > > [  631.317353] freeze round: 10, task to freeze: 2
> > > [  631.325348] freeze round: 11, task to freeze: 2
> > > [  631.333353] freeze round: 12, task to freeze: 2
> > > [  631.341358] freeze round: 13, task to freeze: 2
> > > [  631.349357] freeze round: 14, task to freeze: 2
> > > [  631.357363] freeze round: 15, task to freeze: 2
> > > [  631.365361] freeze round: 16, task to freeze: 2
> > > [  631.373379] freeze round: 17, task to freeze: 2
> > > [  631.381366] freeze round: 18, task to freeze: 2
> > > [  631.389365] freeze round: 19, task to freeze: 2
> > > [  631.397371] freeze round: 20, task to freeze: 2
> > > [  631.405373] freeze round: 21, task to freeze: 2
> > > [  631.413373] freeze round: 22, task to freeze: 2
> > > [  631.421392] freeze round: 23, task to freeze: 1
> > > [  631.429948] freeze round: 24, task to freeze: 1
> > > [  631.438295] freeze round: 25, task to freeze: 1
> > > [  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
> > > [  631.446387] freeze round: 26, task to freeze: 0
> > > [  631.446390] Freezing user space processes completed (elapsed 0.183
> > > seconds)
> > > [  631.446392] OOM killer disabled.
> > > [  631.446393] Freezing remaining freezable tasks
> > > [  631.446656] freeze round: 1, task to freeze: 4
> > > [  631.447976] freeze round: 2, task to freeze: 0
> > > [  631.447978] Freezing remaining freezable tasks completed (elapsed 0.001
> > > seconds)
> > > [  631.447980] PM: suspend debug: Waiting for 1 second(s).
> > > [  632.450858] OOM killer enabled.
> > > [  632.450859] Restarting tasks: Starting
> > > [  632.453140] Restarting tasks: Done
> > > [  632.453173] random: crng reseeded on system resumption
> > > [  632.453370] PM: suspend exit
> > > [  632.462799] jdb2_log_wait_log_commit  completed (elapsed 0.000 seconds)
> > > [  632.466114] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
> > > 
> > > This is the reason:
> > > 
> > > [  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
> > > 
> > > 
> > > During freezing, user processes executing jbd2_log_wait_commit enter D state
> > > because this function calls wait_event and can take tens of milliseconds to
> > > complete. This long execution time, coupled with possible competition with
> > > the freezer, causes repeated freeze retries.
> > > 
> > > While we understand that jbd2 is a freezable kernel thread, we would like to
> > > know if there is a way to freeze it earlier or freeze some critical
> > > processes proactively to reduce this contention.
> > Freeze the filesystem before you start freezing kthreads?  That should
> > quiesce the jbd2 workers and pause anyone trying to write to the fs.
> Indeed, freezing the filesystem can work.
> 
> However, this approach is quite expensive: it increases the total suspend
> time by about 3 to 4 seconds. Because of this overhead, we are exploring
> alternative solutions with lower cost.

Indeed it does, because now XFS and friends will actually shut down
their background workers and flush all the dirty data and metadata to
disk.  On the other hand, if the system crashes while suspended, there's
a lot less recovery work to be done.

Granted the kernel (or userspace) will usually sync() before suspending
so that's not been a huge problem in production afaict.

> We have tested it:
> 
> https://lore.kernel.org/all/09df0911-9421-40af-8296-de1383be1c58@kylinos.cn/
> 
> > Maybe the missing piece here is the device model not knowing how to call
> > bdev_freeze prior to a suspend?
> Currently, suspend flow seem to does not invoke bdev_freeze(). Do you have
> any plans or insights on improving or integrating this functionality more
> smoothly into the device model and suspend sequence?
> > That said, I think that doesn't 100% work for XFS because it has
> > kworkers for metadata buffer read completions, and freezes don't affect
> > read operations...
> 
> Does read activity also cause processes to enter D (uninterruptible sleep)
> state?

Usually.

> From what I understand, it’s usually writes or synchronous operations that
> do, but I’m curious if reads can also lead to D state under certain
> conditions.

Anything that sets the task state to uninterruptible.

--D

> > (just my clueless 2c)
> > 
> > --D
> > 
> > > Thanks for your input and suggestions.
> > > 
> > > 在 2025/8/11 18:58, Michal Hocko 写道:
> > > > On Mon 11-08-25 17:13:43, Zihuan Zhang wrote:
> > > > > 在 2025/8/8 16:58, Michal Hocko 写道:
> > > > [...]
> > > > > > Also the interface seems to be really coarse grained and it can easily
> > > > > > turn out insufficient for other usecases while it is not entirely clear
> > > > > > to me how this could be extended for those.
> > > > >    We recognize that the current interface is relatively coarse-grained and
> > > > > may not be sufficient for all scenarios. The present implementation is a
> > > > > basic version.
> > > > > 
> > > > > Our plan is to introduce a classification-based mechanism that assigns
> > > > > different freeze priorities according to process categories. For example,
> > > > > filesystem and graphics-related processes will be given higher default
> > > > > freeze priority, as they are critical in the freezing workflow. This
> > > > > classification approach helps target important processes more precisely.
> > > > > 
> > > > > However, this requires further testing and refinement before full
> > > > > deployment. We believe this incremental, category-based design will make the
> > > > > mechanism more effective and adaptable over time while keeping it
> > > > > manageable.
> > > > Unless there is a clear path for a more extendable interface then
> > > > introducing this one is a no-go. We do not want to grow different ways
> > > > to establish freezing policies.
> > > > 
> > > > But much more fundamentally. So far I haven't really seen any argument
> > > > why different priorities help with the underlying problem other than the
> > > > timing might be slightly different if you change the order of freezing.
> > > > This to me sounds like the proposed scheme mostly works around the
> > > > problem you are seeing and as such is not a really good candidate to be
> > > > merged as a long term solution. Not to mention with a user API that
> > > > needs to be maintained for ever.
> > > > 
> > > > So NAK from me on the interface.
> > > > 
> > > Thanks for the feedback. I understand your concern that changing the freezer
> > > priority order looks like working around the symptom rather than solving the
> > > root cause.
> > > 
> > > Since the last discussion, we have analyzed the D-state processes further
> > > and identified that the long wait time is caused by jbd2_log_wait_commit.
> > > This wait happens because user tasks call into this function during
> > > fsync/fdatasync and it can take tens of milliseconds to complete. When this
> > > coincides with the freezer operation, the tasks are stuck in D state and
> > > retried multiple times, increasing the total freeze time.
> > > 
> > > Although we know that jbd2 is a freezable kernel thread, we are exploring
> > > whether freezing it earlier — or freezing certain key processes first —
> > > could reduce this contention and improve freeze completion time.
> > > 
> > > 
> > > > > > I believe it would be more useful to find sources of those freezer
> > > > > > blockers and try to address those. Making more blocked tasks
> > > > > > __set_task_frozen compatible sounds like a general improvement in
> > > > > > itself.
> > > > > we have already identified some causes of D-state tasks, many of which are
> > > > > related to the filesystem. On some systems, certain processes frequently
> > > > > execute ext4_sync_file, and under contention this can lead to D-state tasks.
> > > > Please work with maintainers of those subsystems to find proper
> > > > solutions.
> > > We’ve pulled in the jbd2 maintainer to get feedback on whether changing the
> > > freeze ordering for jbd2 is safe or if there’s a better approach to avoid
> > > the repeated retries caused by this wait.
> > > 
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
  2025-08-14 16:43                     ` Darrick J. Wong
@ 2025-08-15  8:17                       ` Zihuan Zhang
  0 siblings, 0 replies; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-15  8:17 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Michal Hocko, Theodore Ts'o, Jan Kara, Rafael J . Wysocki,
	Peter Zijlstra, Oleg Nesterov, David Hildenbrand, Jonathan Corbet,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	len brown, pavel machek, Kees Cook, Andrew Morton,
	Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Catalin Marinas, Nico Pache, xu xin,
	wangfushuai, Andrii Nakryiko, Christian Brauner, Thomas Gleixner,
	Jeff Layton, Al Viro, Adrian Ratiu, linux-pm, linux-mm,
	linux-fsdevel, linux-doc, linux-kernel, linux-ext4


在 2025/8/15 00:43, Darrick J. Wong 写道:
> On Wed, Aug 13, 2025 at 01:48:37PM +0800, Zihuan Zhang wrote:
>> Hi,
>>
>> 在 2025/8/13 01:26, Darrick J. Wong 写道:
>>> On Tue, Aug 12, 2025 at 01:57:49PM +0800, Zihuan Zhang wrote:
>>>> Hi all,
>>>>
>>>> We encountered an issue where the number of freeze retries increased due to
>>>> processes stuck in D state. The logs point to jbd2-related activity.
>>>>
>>>> log1:
>>>>
>>>> 6616.650482] task:ThreadPoolForeg state:D stack:0     pid:262026
>>>> tgid:4065  ppid:2490   task_flags:0x400040 flags:0x00004004
>>>> [ 6616.650485] Call Trace:
>>>> [ 6616.650486]  <TASK>
>>>> [ 6616.650489]  __schedule+0x532/0xea0
>>>> [ 6616.650494]  schedule+0x27/0x80
>>>> [ 6616.650496]  jbd2_log_wait_commit+0xa6/0x120
>>>> [ 6616.650499]  ? __pfx_autoremove_wake_function+0x10/0x10
>>>> [ 6616.650502]  ext4_sync_file+0x1ba/0x380
>>>> [ 6616.650505]  do_fsync+0x3b/0x80
>>>>
>>>> log2:
>>>>
>>>> [  631.206315] jdb2_log_wait_log_commit  completed (elapsed 0.002 seconds)
>>>> [  631.215325] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
>>>> [  631.240704] jdb2_log_wait_log_commit  completed (elapsed 0.386 seconds)
>>>> [  631.262167] Filesystems sync: 0.424 seconds
>>>> [  631.262821] Freezing user space processes
>>>> [  631.263839] freeze round: 1, task to freeze: 852
>>>> [  631.265128] freeze round: 2, task to freeze: 2
>>>> [  631.267039] freeze round: 3, task to freeze: 2
>>>> [  631.271176] freeze round: 4, task to freeze: 2
>>>> [  631.279160] freeze round: 5, task to freeze: 2
>>>> [  631.287152] freeze round: 6, task to freeze: 2
>>>> [  631.295346] freeze round: 7, task to freeze: 2
>>>> [  631.301747] freeze round: 8, task to freeze: 2
>>>> [  631.309346] freeze round: 9, task to freeze: 2
>>>> [  631.317353] freeze round: 10, task to freeze: 2
>>>> [  631.325348] freeze round: 11, task to freeze: 2
>>>> [  631.333353] freeze round: 12, task to freeze: 2
>>>> [  631.341358] freeze round: 13, task to freeze: 2
>>>> [  631.349357] freeze round: 14, task to freeze: 2
>>>> [  631.357363] freeze round: 15, task to freeze: 2
>>>> [  631.365361] freeze round: 16, task to freeze: 2
>>>> [  631.373379] freeze round: 17, task to freeze: 2
>>>> [  631.381366] freeze round: 18, task to freeze: 2
>>>> [  631.389365] freeze round: 19, task to freeze: 2
>>>> [  631.397371] freeze round: 20, task to freeze: 2
>>>> [  631.405373] freeze round: 21, task to freeze: 2
>>>> [  631.413373] freeze round: 22, task to freeze: 2
>>>> [  631.421392] freeze round: 23, task to freeze: 1
>>>> [  631.429948] freeze round: 24, task to freeze: 1
>>>> [  631.438295] freeze round: 25, task to freeze: 1
>>>> [  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
>>>> [  631.446387] freeze round: 26, task to freeze: 0
>>>> [  631.446390] Freezing user space processes completed (elapsed 0.183
>>>> seconds)
>>>> [  631.446392] OOM killer disabled.
>>>> [  631.446393] Freezing remaining freezable tasks
>>>> [  631.446656] freeze round: 1, task to freeze: 4
>>>> [  631.447976] freeze round: 2, task to freeze: 0
>>>> [  631.447978] Freezing remaining freezable tasks completed (elapsed 0.001
>>>> seconds)
>>>> [  631.447980] PM: suspend debug: Waiting for 1 second(s).
>>>> [  632.450858] OOM killer enabled.
>>>> [  632.450859] Restarting tasks: Starting
>>>> [  632.453140] Restarting tasks: Done
>>>> [  632.453173] random: crng reseeded on system resumption
>>>> [  632.453370] PM: suspend exit
>>>> [  632.462799] jdb2_log_wait_log_commit  completed (elapsed 0.000 seconds)
>>>> [  632.466114] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
>>>>
>>>> This is the reason:
>>>>
>>>> [  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
>>>>
>>>>
>>>> During freezing, user processes executing jbd2_log_wait_commit enter D state
>>>> because this function calls wait_event and can take tens of milliseconds to
>>>> complete. This long execution time, coupled with possible competition with
>>>> the freezer, causes repeated freeze retries.
>>>>
>>>> While we understand that jbd2 is a freezable kernel thread, we would like to
>>>> know if there is a way to freeze it earlier or freeze some critical
>>>> processes proactively to reduce this contention.
>>> Freeze the filesystem before you start freezing kthreads?  That should
>>> quiesce the jbd2 workers and pause anyone trying to write to the fs.
>> Indeed, freezing the filesystem can work.
>>
>> However, this approach is quite expensive: it increases the total suspend
>> time by about 3 to 4 seconds. Because of this overhead, we are exploring
>> alternative solutions with lower cost.
> Indeed it does, because now XFS and friends will actually shut down
> their background workers and flush all the dirty data and metadata to
> disk.  On the other hand, if the system crashes while suspended, there's
> a lot less recovery work to be done.
>
> Granted the kernel (or userspace) will usually sync() before suspending
> so that's not been a huge problem in production afaict.


Thank you for your explanation!

>> We have tested it:
>>
>> https://lore.kernel.org/all/09df0911-9421-40af-8296-de1383be1c58@kylinos.cn/
>>
>>> Maybe the missing piece here is the device model not knowing how to call
>>> bdev_freeze prior to a suspend?
>> Currently, suspend flow seem to does not invoke bdev_freeze(). Do you have
>> any plans or insights on improving or integrating this functionality more
>> smoothly into the device model and suspend sequence?
>>> That said, I think that doesn't 100% work for XFS because it has
>>> kworkers for metadata buffer read completions, and freezes don't affect
>>> read operations...
>> Does read activity also cause processes to enter D (uninterruptible sleep)
>> state?
> Usually.

I think you are right.

read operations like vfs_read also cause it.

[   79.179682] PM: suspend entry (deep)
[   79.302703] Filesystems sync: 0.123 seconds
[   79.385416] Freezing user space processes
[   79.386223] round:0 todo:673
[   79.387025] currnet process has not been frozen :Xorg pid:1588
[   79.387026] task:Xorg            state:D stack:0     pid:1588 
tgid:1588  ppid:1471   flags:0x00000004
[   79.387030] Call Trace:
[   79.387031]  <TASK>
[   79.387032]  __schedule+0x46c/0xe40
[   79.387038]  schedule+0x32/0xb0
[   79.387040]  schedule_timeout+0x23d/0x2a0
[   79.387043]  ? pollwake+0x78/0xa0
[   79.387046]  wait_for_completion+0x8c/0x180
[   79.387048]  __flush_work+0x204/0x2d0
[   79.387051]  ? __pfx_wq_barrier_func+0x10/0x10
[   79.387054]  drm_mode_rmfb+0x1a0/0x200
[   79.387057]  ? __pfx_drm_mode_rmfb_work_fn+0x10/0x10
[   79.387058]  ? __pfx_drm_mode_rmfb_ioctl+0x10/0x10
[   79.387060]  drm_ioctl_kernel+0xbc/0x150
[   79.387062]  ? __stack_depot_save+0x38/0x4c0
[   79.387066]  drm_ioctl+0x270/0x470
[   79.387068]  ? __pfx_drm_mode_rmfb_ioctl+0x10/0x10
[   79.387072]  radeon_drm_ioctl+0x4a/0x80 [radeon]
[   79.387108]  __x64_sys_ioctl+0x8c/0xc0
[   79.387110]  do_syscall_64+0x7e/0x270
[   79.387112]  ? __fsnotify_parent+0x113/0x370
[   79.387114]  ? drm_read+0x284/0x320
[   79.387117]  ? syscall_exit_work+0x110/0x140
[   79.387120]  ? vfs_read+0x220/0x2f0
[   79.387122]  ? vfs_read+0x220/0x2f0
[   79.387123]  ? audit_reset_context.part.0+0x27a/0x2f0
[   79.387126]  ? audit_reset_context.part.0+0x27a/0x2f0
[   79.387128]  ? syscall_exit_work+0x110/0x140
[   79.387130]  ? do_syscall_64+0x10f/0x270
[   79.387131]  ? audit_reset_context.part.0+0x27a/0x2f0
[   79.387133]  ? syscall_exit_work+0x110/0x140
[   79.387135]  ? do_syscall_64+0x10f/0x270
[   79.387137]  ? audit_reset_context.part.0+0x27a/0x2f0
[   79.387139]  ? syscall_exit_work+0x110/0x140
[   79.387141]  ? do_syscall_64+0x10f/0x270
[   79.387142]  ? syscall_exit_work+0x110/0x140
[   79.387144]  ? do_syscall_64+0x10f/0x270
[   79.387145]  ? irqtime_account_irq+0x40/0xc0
[   79.387148]  ? irqentry_exit_to_user_mode+0x74/0x1e0
[   79.387150]  entry_SYSCALL_64_after_hwframe+0x76/0xe0
[   79.387153] RIP: 0033:0x7f91baf2550b
[   79.387155] RSP: 002b:00007ffc673d5668 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[   79.387157] RAX: ffffffffffffffda RBX: 00007ffc673d56ac RCX: 
00007f91baf2550b
[   79.387158] RDX: 00007ffc673d56ac RSI: 00000000c00464af RDI: 
000000000000000e
[   79.387159] RBP: 00000000c00464af R08: 00007f91ba860220 R09: 
000056429d1d9fa0
[   79.387160] R10: 0000000000000103 R11: 0000000000000246 R12: 
000056429ba931e0
[   79.387161] R13: 000000000000000e R14: 00000000049f0b22 R15: 
000056429b93bfb0
[   79.387164]  </TASK>
[   79.387255] round:1 todo:1

>>  From what I understand, it’s usually writes or synchronous operations that
>> do, but I’m curious if reads can also lead to D state under certain
>> conditions.
> Anything that sets the task state to uninterruptible.
>
> --D
>
>>> (just my clueless 2c)
>>>
>>> --D
>>>
>>>> Thanks for your input and suggestions.
>>>>
>>>> 在 2025/8/11 18:58, Michal Hocko 写道:
>>>>> On Mon 11-08-25 17:13:43, Zihuan Zhang wrote:
>>>>>> 在 2025/8/8 16:58, Michal Hocko 写道:
>>>>> [...]
>>>>>>> Also the interface seems to be really coarse grained and it can easily
>>>>>>> turn out insufficient for other usecases while it is not entirely clear
>>>>>>> to me how this could be extended for those.
>>>>>>     We recognize that the current interface is relatively coarse-grained and
>>>>>> may not be sufficient for all scenarios. The present implementation is a
>>>>>> basic version.
>>>>>>
>>>>>> Our plan is to introduce a classification-based mechanism that assigns
>>>>>> different freeze priorities according to process categories. For example,
>>>>>> filesystem and graphics-related processes will be given higher default
>>>>>> freeze priority, as they are critical in the freezing workflow. This
>>>>>> classification approach helps target important processes more precisely.
>>>>>>
>>>>>> However, this requires further testing and refinement before full
>>>>>> deployment. We believe this incremental, category-based design will make the
>>>>>> mechanism more effective and adaptable over time while keeping it
>>>>>> manageable.
>>>>> Unless there is a clear path for a more extendable interface then
>>>>> introducing this one is a no-go. We do not want to grow different ways
>>>>> to establish freezing policies.
>>>>>
>>>>> But much more fundamentally. So far I haven't really seen any argument
>>>>> why different priorities help with the underlying problem other than the
>>>>> timing might be slightly different if you change the order of freezing.
>>>>> This to me sounds like the proposed scheme mostly works around the
>>>>> problem you are seeing and as such is not a really good candidate to be
>>>>> merged as a long term solution. Not to mention with a user API that
>>>>> needs to be maintained for ever.
>>>>>
>>>>> So NAK from me on the interface.
>>>>>
>>>> Thanks for the feedback. I understand your concern that changing the freezer
>>>> priority order looks like working around the symptom rather than solving the
>>>> root cause.
>>>>
>>>> Since the last discussion, we have analyzed the D-state processes further
>>>> and identified that the long wait time is caused by jbd2_log_wait_commit.
>>>> This wait happens because user tasks call into this function during
>>>> fsync/fdatasync and it can take tens of milliseconds to complete. When this
>>>> coincides with the freezer operation, the tasks are stuck in D state and
>>>> retried multiple times, increasing the total freeze time.
>>>>
>>>> Although we know that jbd2 is a freezable kernel thread, we are exploring
>>>> whether freezing it earlier — or freezing certain key processes first —
>>>> could reduce this contention and improve freeze completion time.
>>>>
>>>>
>>>>>>> I believe it would be more useful to find sources of those freezer
>>>>>>> blockers and try to address those. Making more blocked tasks
>>>>>>> __set_task_frozen compatible sounds like a general improvement in
>>>>>>> itself.
>>>>>> we have already identified some causes of D-state tasks, many of which are
>>>>>> related to the filesystem. On some systems, certain processes frequently
>>>>>> execute ext4_sync_file, and under contention this can lead to D-state tasks.
>>>>> Please work with maintainers of those subsystems to find proper
>>>>> solutions.
>>>> We’ve pulled in the jbd2 maintainer to get feedback on whether changing the
>>>> freeze ordering for jbd2 is safe or if there’s a better approach to avoid
>>>> the repeated retries caused by this wait.
>>>>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
  2025-08-14 14:37 ` Peter Zijlstra
@ 2025-08-15  8:27   ` Zihuan Zhang
  0 siblings, 0 replies; 38+ messages in thread
From: Zihuan Zhang @ 2025-08-15  8:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J . Wysocki, Oleg Nesterov, David Hildenbrand,
	Michal Hocko, Jonathan Corbet, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, len brown, pavel machek,
	Kees Cook, Andrew Morton, Lorenzo Stoakes, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Catalin Marinas, Nico Pache, xu xin, wangfushuai, Andrii Nakryiko,
	Christian Brauner, Thomas Gleixner, Jeff Layton, Al Viro,
	Adrian Ratiu, linux-pm, linux-mm, linux-fsdevel, linux-doc,
	linux-kernel


在 2025/8/14 22:37, Peter Zijlstra 写道:
> On Thu, Aug 07, 2025 at 08:14:09PM +0800, Zihuan Zhang wrote:
>
>> Freeze Window Begins
>>
>>      [process A] - epoll_wait()
>>          │
>>          ▼
>>      [process B] - event source (already frozen)
>>
> Can we make epoll_wait() TASK_FREEZABLE? AFAICT it doesn't hold any
> resources, it just sits there waiting for stuff.

Based on the code, it’s ep_poll() that puts the task into the D state, 
most likely due to I/O or lower-level driver behavior. In fs/eventpoll.c:

Line:2097 __set_current_state 
<https://elixir.bootlin.com/linux/v6.16/C/ident/__set_current_state>(TASK_INTERRUPTIBLE 
<https://elixir.bootlin.com/linux/v6.16/C/ident/TASK_INTERRUPTIBLE>);

Simply changing the task state may not actually address the root cause. 
Currently, our approach is to identify tasks that are more likely to 
cause such issues and freeze them earlier or later in the process to 
avoid conflicts.



^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2025-08-15  8:27 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-07 12:14 [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues Zihuan Zhang
2025-08-07 12:14 ` [RFC PATCH v1 1/9] freezer: Introduce freeze_priority field in task_struct Zihuan Zhang
2025-08-07 12:14 ` [RFC PATCH v1 2/9] freezer: Introduce API to set per-task freeze priority Zihuan Zhang
2025-08-07 12:14 ` [RFC PATCH v1 3/9] freezer: Add per-priority layered freeze logic Zihuan Zhang
2025-08-07 12:14 ` [RFC PATCH v1 4/9] freezer: Set default freeze priority for userspace tasks Zihuan Zhang
2025-08-07 12:14 ` [RFC PATCH v1 5/9] freezer: set default freeze priority for PF_SUSPEND_TASK processes Zihuan Zhang
2025-08-08 14:39   ` Oleg Nesterov
2025-08-11  9:25     ` Zihuan Zhang
2025-08-11  9:32       ` Oleg Nesterov
2025-08-11  9:42         ` Zihuan Zhang
2025-08-11  9:46           ` Oleg Nesterov
2025-08-11  9:54             ` Zihuan Zhang
2025-08-07 12:14 ` [RFC PATCH v1 6/9] freezer: Set default freeze priority for zombie tasks Zihuan Zhang
2025-08-08 14:29   ` Oleg Nesterov
2025-08-11  9:29     ` Zihuan Zhang
2025-08-11  9:42       ` Oleg Nesterov
2025-08-12  8:07     ` Zihuan Zhang
2025-08-07 12:14 ` [RFC PATCH v1 7/9] freezer: raise freeze priority of tasks failed to freeze last time Zihuan Zhang
2025-08-08 14:53   ` Oleg Nesterov
2025-08-11  9:31     ` Zihuan Zhang
2025-08-07 12:14 ` [RFC PATCH v1 8/9] freezer: Add retry count statistics for freeze pass iterations Zihuan Zhang
2025-08-07 12:14 ` [RFC PATCH v1 9/9] proc: Add /proc/<pid>/freeze_priority interface Zihuan Zhang
2025-08-07 13:25 ` [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues Michal Hocko
2025-08-08  1:13   ` Zihuan Zhang
2025-08-08  7:00     ` Michal Hocko
2025-08-08  7:52       ` Zihuan Zhang
2025-08-08  8:58         ` Michal Hocko
2025-08-11  9:13           ` Zihuan Zhang
2025-08-11 10:58             ` Michal Hocko
2025-08-12  5:57               ` Zihuan Zhang
2025-08-12 17:26                 ` Darrick J. Wong
2025-08-13  5:48                   ` Zihuan Zhang
2025-08-14 16:43                     ` Darrick J. Wong
2025-08-15  8:17                       ` Zihuan Zhang
2025-08-08  7:57     ` Oleg Nesterov
2025-08-08  8:40       ` Zihuan Zhang
2025-08-14 14:37 ` Peter Zijlstra
2025-08-15  8:27   ` Zihuan Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).