Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Zihuan Zhang <zhangzihuan@kylinos.cn>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>, Theodore Ts'o <tytso@mit.edu>,
	Jan Kara <jack@suse.com>,
	"Rafael J . Wysocki" <rafael@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Oleg Nesterov <oleg@redhat.com>,
	David Hildenbrand <david@redhat.com>,
	Jonathan Corbet <corbet@lwn.net>, Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	len brown <len.brown@intel.com>, pavel machek <pavel@kernel.org>,
	Kees Cook <kees@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Nico Pache <npache@redhat.com>, xu xin <xu.xin16@zte.com.cn>,
	wangfushuai <wangfushuai@baidu.com>,
	Andrii Nakryiko <andrii@kernel.org>,
	Christian Brauner <brauner@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Jeff Layton <jlayton@kernel.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Adrian Ratiu <adrian.ratiu@collabora.com>,
	linux-pm@vger.kernel.org, linux-mm@kvack.org,
	linux-fsdevel@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org
Subject: Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues
Date: Fri, 15 Aug 2025 16:17:14 +0800	[thread overview]
Message-ID: <6fadd7e2-a404-4514-8b42-8872beea1ac8@kylinos.cn> (raw)
In-Reply-To: <20250814164313.GO7942@frogsfrogsfrogs>


在 2025/8/15 00:43, Darrick J. Wong 写道:
> On Wed, Aug 13, 2025 at 01:48:37PM +0800, Zihuan Zhang wrote:
>> Hi,
>>
>> 在 2025/8/13 01:26, Darrick J. Wong 写道:
>>> On Tue, Aug 12, 2025 at 01:57:49PM +0800, Zihuan Zhang wrote:
>>>> Hi all,
>>>>
>>>> We encountered an issue where the number of freeze retries increased due to
>>>> processes stuck in D state. The logs point to jbd2-related activity.
>>>>
>>>> log1:
>>>>
>>>> 6616.650482] task:ThreadPoolForeg state:D stack:0     pid:262026
>>>> tgid:4065  ppid:2490   task_flags:0x400040 flags:0x00004004
>>>> [ 6616.650485] Call Trace:
>>>> [ 6616.650486]  <TASK>
>>>> [ 6616.650489]  __schedule+0x532/0xea0
>>>> [ 6616.650494]  schedule+0x27/0x80
>>>> [ 6616.650496]  jbd2_log_wait_commit+0xa6/0x120
>>>> [ 6616.650499]  ? __pfx_autoremove_wake_function+0x10/0x10
>>>> [ 6616.650502]  ext4_sync_file+0x1ba/0x380
>>>> [ 6616.650505]  do_fsync+0x3b/0x80
>>>>
>>>> log2:
>>>>
>>>> [  631.206315] jdb2_log_wait_log_commit  completed (elapsed 0.002 seconds)
>>>> [  631.215325] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
>>>> [  631.240704] jdb2_log_wait_log_commit  completed (elapsed 0.386 seconds)
>>>> [  631.262167] Filesystems sync: 0.424 seconds
>>>> [  631.262821] Freezing user space processes
>>>> [  631.263839] freeze round: 1, task to freeze: 852
>>>> [  631.265128] freeze round: 2, task to freeze: 2
>>>> [  631.267039] freeze round: 3, task to freeze: 2
>>>> [  631.271176] freeze round: 4, task to freeze: 2
>>>> [  631.279160] freeze round: 5, task to freeze: 2
>>>> [  631.287152] freeze round: 6, task to freeze: 2
>>>> [  631.295346] freeze round: 7, task to freeze: 2
>>>> [  631.301747] freeze round: 8, task to freeze: 2
>>>> [  631.309346] freeze round: 9, task to freeze: 2
>>>> [  631.317353] freeze round: 10, task to freeze: 2
>>>> [  631.325348] freeze round: 11, task to freeze: 2
>>>> [  631.333353] freeze round: 12, task to freeze: 2
>>>> [  631.341358] freeze round: 13, task to freeze: 2
>>>> [  631.349357] freeze round: 14, task to freeze: 2
>>>> [  631.357363] freeze round: 15, task to freeze: 2
>>>> [  631.365361] freeze round: 16, task to freeze: 2
>>>> [  631.373379] freeze round: 17, task to freeze: 2
>>>> [  631.381366] freeze round: 18, task to freeze: 2
>>>> [  631.389365] freeze round: 19, task to freeze: 2
>>>> [  631.397371] freeze round: 20, task to freeze: 2
>>>> [  631.405373] freeze round: 21, task to freeze: 2
>>>> [  631.413373] freeze round: 22, task to freeze: 2
>>>> [  631.421392] freeze round: 23, task to freeze: 1
>>>> [  631.429948] freeze round: 24, task to freeze: 1
>>>> [  631.438295] freeze round: 25, task to freeze: 1
>>>> [  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
>>>> [  631.446387] freeze round: 26, task to freeze: 0
>>>> [  631.446390] Freezing user space processes completed (elapsed 0.183
>>>> seconds)
>>>> [  631.446392] OOM killer disabled.
>>>> [  631.446393] Freezing remaining freezable tasks
>>>> [  631.446656] freeze round: 1, task to freeze: 4
>>>> [  631.447976] freeze round: 2, task to freeze: 0
>>>> [  631.447978] Freezing remaining freezable tasks completed (elapsed 0.001
>>>> seconds)
>>>> [  631.447980] PM: suspend debug: Waiting for 1 second(s).
>>>> [  632.450858] OOM killer enabled.
>>>> [  632.450859] Restarting tasks: Starting
>>>> [  632.453140] Restarting tasks: Done
>>>> [  632.453173] random: crng reseeded on system resumption
>>>> [  632.453370] PM: suspend exit
>>>> [  632.462799] jdb2_log_wait_log_commit  completed (elapsed 0.000 seconds)
>>>> [  632.466114] jdb2_log_wait_log_commit  completed (elapsed 0.001 seconds)
>>>>
>>>> This is the reason:
>>>>
>>>> [  631.444546] jdb2_log_wait_log_commit  completed (elapsed 0.249 seconds)
>>>>
>>>>
>>>> During freezing, user processes executing jbd2_log_wait_commit enter D state
>>>> because this function calls wait_event and can take tens of milliseconds to
>>>> complete. This long execution time, coupled with possible competition with
>>>> the freezer, causes repeated freeze retries.
>>>>
>>>> While we understand that jbd2 is a freezable kernel thread, we would like to
>>>> know if there is a way to freeze it earlier or freeze some critical
>>>> processes proactively to reduce this contention.
>>> Freeze the filesystem before you start freezing kthreads?  That should
>>> quiesce the jbd2 workers and pause anyone trying to write to the fs.
>> Indeed, freezing the filesystem can work.
>>
>> However, this approach is quite expensive: it increases the total suspend
>> time by about 3 to 4 seconds. Because of this overhead, we are exploring
>> alternative solutions with lower cost.
> Indeed it does, because now XFS and friends will actually shut down
> their background workers and flush all the dirty data and metadata to
> disk.  On the other hand, if the system crashes while suspended, there's
> a lot less recovery work to be done.
>
> Granted the kernel (or userspace) will usually sync() before suspending
> so that's not been a huge problem in production afaict.


Thank you for your explanation!

>> We have tested it:
>>
>> https://lore.kernel.org/all/09df0911-9421-40af-8296-de1383be1c58@kylinos.cn/
>>
>>> Maybe the missing piece here is the device model not knowing how to call
>>> bdev_freeze prior to a suspend?
>> Currently, suspend flow seem to does not invoke bdev_freeze(). Do you have
>> any plans or insights on improving or integrating this functionality more
>> smoothly into the device model and suspend sequence?
>>> That said, I think that doesn't 100% work for XFS because it has
>>> kworkers for metadata buffer read completions, and freezes don't affect
>>> read operations...
>> Does read activity also cause processes to enter D (uninterruptible sleep)
>> state?
> Usually.

I think you are right.

read operations like vfs_read also cause it.

[   79.179682] PM: suspend entry (deep)
[   79.302703] Filesystems sync: 0.123 seconds
[   79.385416] Freezing user space processes
[   79.386223] round:0 todo:673
[   79.387025] currnet process has not been frozen :Xorg pid:1588
[   79.387026] task:Xorg            state:D stack:0     pid:1588 
tgid:1588  ppid:1471   flags:0x00000004
[   79.387030] Call Trace:
[   79.387031]  <TASK>
[   79.387032]  __schedule+0x46c/0xe40
[   79.387038]  schedule+0x32/0xb0
[   79.387040]  schedule_timeout+0x23d/0x2a0
[   79.387043]  ? pollwake+0x78/0xa0
[   79.387046]  wait_for_completion+0x8c/0x180
[   79.387048]  __flush_work+0x204/0x2d0
[   79.387051]  ? __pfx_wq_barrier_func+0x10/0x10
[   79.387054]  drm_mode_rmfb+0x1a0/0x200
[   79.387057]  ? __pfx_drm_mode_rmfb_work_fn+0x10/0x10
[   79.387058]  ? __pfx_drm_mode_rmfb_ioctl+0x10/0x10
[   79.387060]  drm_ioctl_kernel+0xbc/0x150
[   79.387062]  ? __stack_depot_save+0x38/0x4c0
[   79.387066]  drm_ioctl+0x270/0x470
[   79.387068]  ? __pfx_drm_mode_rmfb_ioctl+0x10/0x10
[   79.387072]  radeon_drm_ioctl+0x4a/0x80 [radeon]
[   79.387108]  __x64_sys_ioctl+0x8c/0xc0
[   79.387110]  do_syscall_64+0x7e/0x270
[   79.387112]  ? __fsnotify_parent+0x113/0x370
[   79.387114]  ? drm_read+0x284/0x320
[   79.387117]  ? syscall_exit_work+0x110/0x140
[   79.387120]  ? vfs_read+0x220/0x2f0
[   79.387122]  ? vfs_read+0x220/0x2f0
[   79.387123]  ? audit_reset_context.part.0+0x27a/0x2f0
[   79.387126]  ? audit_reset_context.part.0+0x27a/0x2f0
[   79.387128]  ? syscall_exit_work+0x110/0x140
[   79.387130]  ? do_syscall_64+0x10f/0x270
[   79.387131]  ? audit_reset_context.part.0+0x27a/0x2f0
[   79.387133]  ? syscall_exit_work+0x110/0x140
[   79.387135]  ? do_syscall_64+0x10f/0x270
[   79.387137]  ? audit_reset_context.part.0+0x27a/0x2f0
[   79.387139]  ? syscall_exit_work+0x110/0x140
[   79.387141]  ? do_syscall_64+0x10f/0x270
[   79.387142]  ? syscall_exit_work+0x110/0x140
[   79.387144]  ? do_syscall_64+0x10f/0x270
[   79.387145]  ? irqtime_account_irq+0x40/0xc0
[   79.387148]  ? irqentry_exit_to_user_mode+0x74/0x1e0
[   79.387150]  entry_SYSCALL_64_after_hwframe+0x76/0xe0
[   79.387153] RIP: 0033:0x7f91baf2550b
[   79.387155] RSP: 002b:00007ffc673d5668 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[   79.387157] RAX: ffffffffffffffda RBX: 00007ffc673d56ac RCX: 
00007f91baf2550b
[   79.387158] RDX: 00007ffc673d56ac RSI: 00000000c00464af RDI: 
000000000000000e
[   79.387159] RBP: 00000000c00464af R08: 00007f91ba860220 R09: 
000056429d1d9fa0
[   79.387160] R10: 0000000000000103 R11: 0000000000000246 R12: 
000056429ba931e0
[   79.387161] R13: 000000000000000e R14: 00000000049f0b22 R15: 
000056429b93bfb0
[   79.387164]  </TASK>
[   79.387255] round:1 todo:1

>>  From what I understand, it’s usually writes or synchronous operations that
>> do, but I’m curious if reads can also lead to D state under certain
>> conditions.
> Anything that sets the task state to uninterruptible.
>
> --D
>
>>> (just my clueless 2c)
>>>
>>> --D
>>>
>>>> Thanks for your input and suggestions.
>>>>
>>>> 在 2025/8/11 18:58, Michal Hocko 写道:
>>>>> On Mon 11-08-25 17:13:43, Zihuan Zhang wrote:
>>>>>> 在 2025/8/8 16:58, Michal Hocko 写道:
>>>>> [...]
>>>>>>> Also the interface seems to be really coarse grained and it can easily
>>>>>>> turn out insufficient for other usecases while it is not entirely clear
>>>>>>> to me how this could be extended for those.
>>>>>>     We recognize that the current interface is relatively coarse-grained and
>>>>>> may not be sufficient for all scenarios. The present implementation is a
>>>>>> basic version.
>>>>>>
>>>>>> Our plan is to introduce a classification-based mechanism that assigns
>>>>>> different freeze priorities according to process categories. For example,
>>>>>> filesystem and graphics-related processes will be given higher default
>>>>>> freeze priority, as they are critical in the freezing workflow. This
>>>>>> classification approach helps target important processes more precisely.
>>>>>>
>>>>>> However, this requires further testing and refinement before full
>>>>>> deployment. We believe this incremental, category-based design will make the
>>>>>> mechanism more effective and adaptable over time while keeping it
>>>>>> manageable.
>>>>> Unless there is a clear path for a more extendable interface then
>>>>> introducing this one is a no-go. We do not want to grow different ways
>>>>> to establish freezing policies.
>>>>>
>>>>> But much more fundamentally. So far I haven't really seen any argument
>>>>> why different priorities help with the underlying problem other than the
>>>>> timing might be slightly different if you change the order of freezing.
>>>>> This to me sounds like the proposed scheme mostly works around the
>>>>> problem you are seeing and as such is not a really good candidate to be
>>>>> merged as a long term solution. Not to mention with a user API that
>>>>> needs to be maintained for ever.
>>>>>
>>>>> So NAK from me on the interface.
>>>>>
>>>> Thanks for the feedback. I understand your concern that changing the freezer
>>>> priority order looks like working around the symptom rather than solving the
>>>> root cause.
>>>>
>>>> Since the last discussion, we have analyzed the D-state processes further
>>>> and identified that the long wait time is caused by jbd2_log_wait_commit.
>>>> This wait happens because user tasks call into this function during
>>>> fsync/fdatasync and it can take tens of milliseconds to complete. When this
>>>> coincides with the freezer operation, the tasks are stuck in D state and
>>>> retried multiple times, increasing the total freeze time.
>>>>
>>>> Although we know that jbd2 is a freezable kernel thread, we are exploring
>>>> whether freezing it earlier — or freezing certain key processes first —
>>>> could reduce this contention and improve freeze completion time.
>>>>
>>>>
>>>>>>> I believe it would be more useful to find sources of those freezer
>>>>>>> blockers and try to address those. Making more blocked tasks
>>>>>>> __set_task_frozen compatible sounds like a general improvement in
>>>>>>> itself.
>>>>>> we have already identified some causes of D-state tasks, many of which are
>>>>>> related to the filesystem. On some systems, certain processes frequently
>>>>>> execute ext4_sync_file, and under contention this can lead to D-state tasks.
>>>>> Please work with maintainers of those subsystems to find proper
>>>>> solutions.
>>>> We’ve pulled in the jbd2 maintainer to get feedback on whether changing the
>>>> freeze ordering for jbd2 is safe or if there’s a better approach to avoid
>>>> the repeated retries caused by this wait.
>>>>

next prev parent reply	other threads:[~2025-08-15  8:17 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-07 12:14 [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues Zihuan Zhang
2025-08-07 12:14 ` [RFC PATCH v1 1/9] freezer: Introduce freeze_priority field in task_struct Zihuan Zhang
2025-08-07 12:14 ` [RFC PATCH v1 2/9] freezer: Introduce API to set per-task freeze priority Zihuan Zhang
2025-08-07 12:14 ` [RFC PATCH v1 3/9] freezer: Add per-priority layered freeze logic Zihuan Zhang
2025-08-07 12:14 ` [RFC PATCH v1 4/9] freezer: Set default freeze priority for userspace tasks Zihuan Zhang
2025-08-07 12:14 ` [RFC PATCH v1 5/9] freezer: set default freeze priority for PF_SUSPEND_TASK processes Zihuan Zhang
2025-08-08 14:39   ` Oleg Nesterov
2025-08-11  9:25     ` Zihuan Zhang
2025-08-11  9:32       ` Oleg Nesterov
2025-08-11  9:42         ` Zihuan Zhang
2025-08-11  9:46           ` Oleg Nesterov
2025-08-11  9:54             ` Zihuan Zhang
2025-08-07 12:14 ` [RFC PATCH v1 6/9] freezer: Set default freeze priority for zombie tasks Zihuan Zhang
2025-08-08 14:29   ` Oleg Nesterov
2025-08-11  9:29     ` Zihuan Zhang
2025-08-11  9:42       ` Oleg Nesterov
2025-08-12  8:07     ` Zihuan Zhang
2025-08-07 12:14 ` [RFC PATCH v1 7/9] freezer: raise freeze priority of tasks failed to freeze last time Zihuan Zhang
2025-08-08 14:53   ` Oleg Nesterov
2025-08-11  9:31     ` Zihuan Zhang
2025-08-07 12:14 ` [RFC PATCH v1 8/9] freezer: Add retry count statistics for freeze pass iterations Zihuan Zhang
2025-08-07 12:14 ` [RFC PATCH v1 9/9] proc: Add /proc/<pid>/freeze_priority interface Zihuan Zhang
2025-08-07 13:25 ` [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues Michal Hocko
2025-08-08  1:13   ` Zihuan Zhang
2025-08-08  7:00     ` Michal Hocko
2025-08-08  7:52       ` Zihuan Zhang
2025-08-08  8:58         ` Michal Hocko
2025-08-11  9:13           ` Zihuan Zhang
2025-08-11 10:58             ` Michal Hocko
2025-08-12  5:57               ` Zihuan Zhang
2025-08-12 17:26                 ` Darrick J. Wong
2025-08-13  5:48                   ` Zihuan Zhang
2025-08-14 16:43                     ` Darrick J. Wong
2025-08-15  8:17                       ` Zihuan Zhang [this message]
2025-08-08  7:57     ` Oleg Nesterov
2025-08-08  8:40       ` Zihuan Zhang
2025-08-14 14:37 ` Peter Zijlstra
2025-08-15  8:27   ` Zihuan Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6fadd7e2-a404-4514-8b42-8872beea1ac8@kylinos.cn \
    --to=zhangzihuan@kylinos.cn \
    --cc=Liam.Howlett@oracle.com \
    --cc=adrian.ratiu@collabora.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrii@kernel.org \
    --cc=brauner@kernel.org \
    --cc=bsegall@google.com \
    --cc=catalin.marinas@arm.com \
    --cc=corbet@lwn.net \
    --cc=david@redhat.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=djwong@kernel.org \
    --cc=jack@suse.com \
    --cc=jlayton@kernel.org \
    --cc=juri.lelli@redhat.com \
    --cc=kees@kernel.org \
    --cc=len.brown@intel.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=npache@redhat.com \
    --cc=oleg@redhat.com \
    --cc=pavel@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rafael@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=tglx@linutronix.de \
    --cc=tytso@mit.edu \
    --cc=vbabka@suse.cz \
    --cc=vincent.guittot@linaro.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vschneid@redhat.com \
    --cc=wangfushuai@baidu.com \
    --cc=xu.xin16@zte.com.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).