From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailgw.kylinos.cn (mailgw.kylinos.cn [124.126.103.232]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2A9EC2C1598; Wed, 13 Aug 2025 05:48:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=124.126.103.232 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755064142; cv=none; b=bFONUAexirVhlxa6Qm2PU+pfQDrClzVrfzuHZiYr3g92Z0cODaaFuo2zv6psLuAxHl8DFvJFVI0xV8mEtXQr98GjjDS3Spnu4RpYq9vT/0B4AdrZ94HDM3Jj62K/5IuNQ7zHQYK01BZuRaQ0MILAZnsKaE9RApEPb9Tq0cUxFHI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755064142; c=relaxed/simple; bh=XhlugzNdH+d95jCW0kj5l9IYJ//kXJPzqTGjstCtpEE=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=c+Abkn9yHqLukqE+wRasMICohEjnYLhhX9EFFKIJRarKnDgInq+O6KeL8vpGKTfNm3L61ckAHPCqrL6TaGDuSKjM9tds390d+S60mPvpmtb7sipwPhAeAq9LvFvt5pfVhAHhWWgT424fUxsXI84mtYOQOp7wx56GYI+RI2K7+as= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=kylinos.cn; spf=pass smtp.mailfrom=kylinos.cn; arc=none smtp.client-ip=124.126.103.232 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=kylinos.cn Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=kylinos.cn X-UUID: 30a643cc780911f0b29709d653e92f7d-20250813 X-CID-P-RULE: Release_Ham X-CID-O-INFO: VERSION:1.1.45,REQID:3daa047f-cd79-47a5-8fd7-224c239c4ce8,IP:0,U RL:0,TC:0,Content:0,EDM:0,RT:0,SF:0,FILE:0,BULK:0,RULE:Release_Ham,ACTION: release,TS:0 X-CID-META: VersionHash:6493067,CLOUDID:994f152d6c1a295ca93fe5963adb0ebd,BulkI D:nil,BulkQuantity:0,Recheck:0,SF:80|81|82|83|102,TC:nil,Content:0|52,EDM: -3,IP:nil,URL:99|1,File:nil,RT:nil,Bulk:nil,QS:nil,BEC:nil,COL:0,OSI:0,OSA :0,AV:0,LES:1,SPR:NO,DKR:0,DKP:0,BRR:0,BRE:0,ARC:0 X-CID-BVR: 0 X-CID-BAS: 0,_,0,_ X-CID-FACTOR: TF_CID_SPAM_SNR,TF_CID_SPAM_ULS X-UUID: 30a643cc780911f0b29709d653e92f7d-20250813 Received: from mail.kylinos.cn [(10.44.16.175)] by mailgw.kylinos.cn (envelope-from ) (Generic MTA) with ESMTP id 54274607; Wed, 13 Aug 2025 13:48:51 +0800 Received: from mail.kylinos.cn (localhost [127.0.0.1]) by mail.kylinos.cn (NSMail) with SMTP id 0897CE008FA5; Wed, 13 Aug 2025 13:48:51 +0800 (CST) X-ns-mid: postfix-689C2742-86747173 Received: from [172.25.120.24] (unknown [172.25.120.24]) by mail.kylinos.cn (NSMail) with ESMTPA id 4E0DFE008FA3; Wed, 13 Aug 2025 13:48:38 +0800 (CST) Message-ID: <8c61ab95-9caa-4b57-adfd-31f941f0264d@kylinos.cn> Date: Wed, 13 Aug 2025 13:48:37 +0800 Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v1 0/9] freezer: Introduce freeze priority model to address process dependency issues To: "Darrick J. Wong" Cc: Michal Hocko , Theodore Ts'o , Jan Kara , "Rafael J . Wysocki" , Peter Zijlstra , Oleg Nesterov , David Hildenbrand , Jonathan Corbet , Ingo Molnar , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , len brown , pavel machek , Kees Cook , Andrew Morton , Lorenzo Stoakes , "Liam R . Howlett" , Vlastimil Babka , Mike Rapoport , Suren Baghdasaryan , Catalin Marinas , Nico Pache , xu xin , wangfushuai , Andrii Nakryiko , Christian Brauner , Thomas Gleixner , Jeff Layton , Al Viro , Adrian Ratiu , linux-pm@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org References: <20250807121418.139765-1-zhangzihuan@kylinos.cn> <4c46250f-eb0f-4e12-8951-89431c195b46@kylinos.cn> <09df0911-9421-40af-8296-de1383be1c58@kylinos.cn> <20250812172655.GF7938@frogsfrogsfrogs> From: Zihuan Zhang In-Reply-To: <20250812172655.GF7938@frogsfrogsfrogs> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Hi, =E5=9C=A8 2025/8/13 01:26, Darrick J. Wong =E5=86=99=E9=81=93: > On Tue, Aug 12, 2025 at 01:57:49PM +0800, Zihuan Zhang wrote: >> Hi all, >> >> We encountered an issue where the number of freeze retries increased d= ue to >> processes stuck in D state. The logs point to jbd2-related activity. >> >> log1: >> >> 6616.650482] task:ThreadPoolForeg state:D stack:0=C2=A0 =C2=A0 =C2=A0p= id:262026 >> tgid:4065=C2=A0 ppid:2490=C2=A0 =C2=A0task_flags:0x400040 flags:0x0000= 4004 >> [ 6616.650485] Call Trace: >> [ 6616.650486]=C2=A0 >> [ 6616.650489]=C2=A0 __schedule+0x532/0xea0 >> [ 6616.650494]=C2=A0 schedule+0x27/0x80 >> [ 6616.650496]=C2=A0 jbd2_log_wait_commit+0xa6/0x120 >> [ 6616.650499]=C2=A0 ? __pfx_autoremove_wake_function+0x10/0x10 >> [ 6616.650502]=C2=A0 ext4_sync_file+0x1ba/0x380 >> [ 6616.650505]=C2=A0 do_fsync+0x3b/0x80 >> >> log2: >> >> [=C2=A0 631.206315] jdb2_log_wait_log_commit=C2=A0 completed (elapsed = 0.002 seconds) >> [=C2=A0 631.215325] jdb2_log_wait_log_commit=C2=A0 completed (elapsed = 0.001 seconds) >> [=C2=A0 631.240704] jdb2_log_wait_log_commit=C2=A0 completed (elapsed = 0.386 seconds) >> [=C2=A0 631.262167] Filesystems sync: 0.424 seconds >> [=C2=A0 631.262821] Freezing user space processes >> [=C2=A0 631.263839] freeze round: 1, task to freeze: 852 >> [=C2=A0 631.265128] freeze round: 2, task to freeze: 2 >> [=C2=A0 631.267039] freeze round: 3, task to freeze: 2 >> [=C2=A0 631.271176] freeze round: 4, task to freeze: 2 >> [=C2=A0 631.279160] freeze round: 5, task to freeze: 2 >> [=C2=A0 631.287152] freeze round: 6, task to freeze: 2 >> [=C2=A0 631.295346] freeze round: 7, task to freeze: 2 >> [=C2=A0 631.301747] freeze round: 8, task to freeze: 2 >> [=C2=A0 631.309346] freeze round: 9, task to freeze: 2 >> [=C2=A0 631.317353] freeze round: 10, task to freeze: 2 >> [=C2=A0 631.325348] freeze round: 11, task to freeze: 2 >> [=C2=A0 631.333353] freeze round: 12, task to freeze: 2 >> [=C2=A0 631.341358] freeze round: 13, task to freeze: 2 >> [=C2=A0 631.349357] freeze round: 14, task to freeze: 2 >> [=C2=A0 631.357363] freeze round: 15, task to freeze: 2 >> [=C2=A0 631.365361] freeze round: 16, task to freeze: 2 >> [=C2=A0 631.373379] freeze round: 17, task to freeze: 2 >> [=C2=A0 631.381366] freeze round: 18, task to freeze: 2 >> [=C2=A0 631.389365] freeze round: 19, task to freeze: 2 >> [=C2=A0 631.397371] freeze round: 20, task to freeze: 2 >> [=C2=A0 631.405373] freeze round: 21, task to freeze: 2 >> [=C2=A0 631.413373] freeze round: 22, task to freeze: 2 >> [=C2=A0 631.421392] freeze round: 23, task to freeze: 1 >> [=C2=A0 631.429948] freeze round: 24, task to freeze: 1 >> [=C2=A0 631.438295] freeze round: 25, task to freeze: 1 >> [=C2=A0 631.444546] jdb2_log_wait_log_commit=C2=A0 completed (elapsed = 0.249 seconds) >> [=C2=A0 631.446387] freeze round: 26, task to freeze: 0 >> [=C2=A0 631.446390] Freezing user space processes completed (elapsed 0= .183 >> seconds) >> [=C2=A0 631.446392] OOM killer disabled. >> [=C2=A0 631.446393] Freezing remaining freezable tasks >> [=C2=A0 631.446656] freeze round: 1, task to freeze: 4 >> [=C2=A0 631.447976] freeze round: 2, task to freeze: 0 >> [=C2=A0 631.447978] Freezing remaining freezable tasks completed (elap= sed 0.001 >> seconds) >> [=C2=A0 631.447980] PM: suspend debug: Waiting for 1 second(s). >> [=C2=A0 632.450858] OOM killer enabled. >> [=C2=A0 632.450859] Restarting tasks: Starting >> [=C2=A0 632.453140] Restarting tasks: Done >> [=C2=A0 632.453173] random: crng reseeded on system resumption >> [=C2=A0 632.453370] PM: suspend exit >> [=C2=A0 632.462799] jdb2_log_wait_log_commit=C2=A0 completed (elapsed = 0.000 seconds) >> [=C2=A0 632.466114] jdb2_log_wait_log_commit=C2=A0 completed (elapsed = 0.001 seconds) >> >> This is the reason: >> >> [=C2=A0 631.444546] jdb2_log_wait_log_commit=C2=A0 completed (elapsed = 0.249 seconds) >> >> >> During freezing, user processes executing jbd2_log_wait_commit enter D= state >> because this function calls wait_event and can take tens of millisecon= ds to >> complete. This long execution time, coupled with possible competition = with >> the freezer, causes repeated freeze retries. >> >> While we understand that jbd2 is a freezable kernel thread, we would l= ike to >> know if there is a way to freeze it earlier or freeze some critical >> processes proactively to reduce this contention. > Freeze the filesystem before you start freezing kthreads? That should > quiesce the jbd2 workers and pause anyone trying to write to the fs. Indeed, freezing the filesystem can work. However, this approach is quite expensive: it increases the total=20 suspend time by about 3 to 4 seconds. Because of this overhead, we are=20 exploring alternative solutions with lower cost. We have tested it: https://lore.kernel.org/all/09df0911-9421-40af-8296-de1383be1c58@kylinos.= cn/=20 > Maybe the missing piece here is the device model not knowing how to cal= l > bdev_freeze prior to a suspend? Currently, suspend flow seem to does not invoke bdev_freeze(). Do you=20 have any plans or insights on improving or integrating this=20 functionality more smoothly into the device model and suspend sequence? > That said, I think that doesn't 100% work for XFS because it has > kworkers for metadata buffer read completions, and freezes don't affect > read operations... Does read activity also cause processes to enter D (uninterruptible=20 sleep) state? From what I understand, it=E2=80=99s usually writes or synchronous opera= tions=20 that do, but I=E2=80=99m curious if reads can also lead to D state under = certain=20 conditions. > (just my clueless 2c) > > --D > >> Thanks for your input and suggestions. >> >> =E5=9C=A8 2025/8/11 18:58, Michal Hocko =E5=86=99=E9=81=93: >>> On Mon 11-08-25 17:13:43, Zihuan Zhang wrote: >>>> =E5=9C=A8 2025/8/8 16:58, Michal Hocko =E5=86=99=E9=81=93: >>> [...] >>>>> Also the interface seems to be really coarse grained and it can eas= ily >>>>> turn out insufficient for other usecases while it is not entirely c= lear >>>>> to me how this could be extended for those. >>>> =C2=A0We recognize that the current interface is relatively coarse= -grained and >>>> may not be sufficient for all scenarios. The present implementation = is a >>>> basic version. >>>> >>>> Our plan is to introduce a classification-based mechanism that assig= ns >>>> different freeze priorities according to process categories. For exa= mple, >>>> filesystem and graphics-related processes will be given higher defau= lt >>>> freeze priority, as they are critical in the freezing workflow. This >>>> classification approach helps target important processes more precis= ely. >>>> >>>> However, this requires further testing and refinement before full >>>> deployment. We believe this incremental, category-based design will = make the >>>> mechanism more effective and adaptable over time while keeping it >>>> manageable. >>> Unless there is a clear path for a more extendable interface then >>> introducing this one is a no-go. We do not want to grow different way= s >>> to establish freezing policies. >>> >>> But much more fundamentally. So far I haven't really seen any argumen= t >>> why different priorities help with the underlying problem other than = the >>> timing might be slightly different if you change the order of freezin= g. >>> This to me sounds like the proposed scheme mostly works around the >>> problem you are seeing and as such is not a really good candidate to = be >>> merged as a long term solution. Not to mention with a user API that >>> needs to be maintained for ever. >>> >>> So NAK from me on the interface. >>> >> Thanks for the feedback. I understand your concern that changing the f= reezer >> priority order looks like working around the symptom rather than solvi= ng the >> root cause. >> >> Since the last discussion, we have analyzed the D-state processes furt= her >> and identified that the long wait time is caused by jbd2_log_wait_comm= it. >> This wait happens because user tasks call into this function during >> fsync/fdatasync and it can take tens of milliseconds to complete. When= this >> coincides with the freezer operation, the tasks are stuck in D state a= nd >> retried multiple times, increasing the total freeze time. >> >> Although we know that jbd2 is a freezable kernel thread, we are explor= ing >> whether freezing it earlier =E2=80=94 or freezing certain key processe= s first =E2=80=94 >> could reduce this contention and improve freeze completion time. >> >> >>>>> I believe it would be more useful to find sources of those freezer >>>>> blockers and try to address those. Making more blocked tasks >>>>> __set_task_frozen compatible sounds like a general improvement in >>>>> itself. >>>> we have already identified some causes of D-state tasks, many of whi= ch are >>>> related to the filesystem. On some systems, certain processes freque= ntly >>>> execute ext4_sync_file, and under contention this can lead to D-stat= e tasks. >>> Please work with maintainers of those subsystems to find proper >>> solutions. >> We=E2=80=99ve pulled in the jbd2 maintainer to get feedback on whether= changing the >> freeze ordering for jbd2 is safe or if there=E2=80=99s a better approa= ch to avoid >> the repeated retries caused by this wait. >>