fscache/NFS re-export server lockup

public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed

* fscache/NFS re-export server lockup
@ 2025-12-19 15:59 Mike Owen
  2025-12-19 17:01 ` Daire Byrne
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Owen @ 2025-12-19 15:59 UTC (permalink / raw)
  To: linux-nfs, netfs

We are using NFS re-exporting with fscache but have had a number of
lockups with re-export servers.

The NFS re-export servers are running Ubuntu 24.04 with a
6.14.0-36-generic kernel mounting Isilon and NetApp filers over NFSv3.

The NFS clients are mounting the re-exported shares over NFSv4 and
running AlmaLinux 9.6.

When the re-export server locks up, it is not possible to access the
/var/cache/fscache directory (separate file system) on the server and
all clients are hanging on their mounts from the re-export.

When the server is rebooted, most of the contents of the
/var/cache/fscache directory is missing e.g. the file system contents
before the lockup may have been a few TBs, but after a reboot only a
few hundred GBs.

The kern.log (below) from the re-export server reports the following
when the lockup happens.

We have tried using a more recent 6.17.11 kernel on the re-export
server and changing the fscache file system from ext4 to xfs, but
still have the same lockups.

Any ideas on what is happening here - and how it may be fixed?

2025-12-03T15:54:24.189530+00:00 ip-172-23-113-43 kernel: netfs:
fscache_begin_operation: cookie state change wait timed out:
cookie->state=1 state=1
2025-12-03T15:54:24.189546+00:00 ip-172-23-113-43 kernel: netfs:
fscache_begin_operation: cookie state change wait timed out:
cookie->state=1 state=1
2025-12-03T15:54:24.189549+00:00 ip-172-23-113-43 kernel: netfs:
O-cookie c=00049c53 [fl=6124 na=1 nA=2 s=L]
2025-12-03T15:54:24.189550+00:00 ip-172-23-113-43 kernel: netfs:
O-cookie c=0004ddfe [fl=4124 na=1 nA=2 s=L]
2025-12-03T15:54:24.189551+00:00 ip-172-23-113-43 kernel: netfs:
O-cookie V=0000000a
[Enfs,4.2,2,108,887517ac,2,,,50,100000,100000,927c0,927c0,927c0,927c0,1]
2025-12-03T15:54:24.189552+00:00 ip-172-23-113-43 kernel: netfs:
O-cookie V=0000000f
[Enfs,4.2,2,108,887517ac,3,,,50,100000,100000,927c0,927c0,927c0,927c0,1]
2025-12-03T15:54:24.189553+00:00 ip-172-23-113-43 kernel: netfs:
O-key=[56] '0100010c0200000008000000e7ba9075008000002000f360d67b09000000e7ba907508000000ffffffff000000000b00700b010000000000'
2025-12-03T15:54:24.190876+00:00 ip-172-23-113-43 kernel: netfs:
O-key=[56] '0100010c03000000070000002fbb9934008000002000f360d67b060000002fbb993407000000ffffffff000000000200c109010000000000'
2025-12-03T15:54:24.702103+00:00 ip-172-23-113-43 kernel: netfs:
fscache_begin_operation: cookie state change wait timed out:
cookie->state=1 state=1
2025-12-03T15:54:24.702112+00:00 ip-172-23-113-43 kernel: netfs:
O-cookie c=000404b3 [fl=6124 na=1 nA=2 s=L]
2025-12-03T15:54:24.702113+00:00 ip-172-23-113-43 kernel: netfs:
O-cookie V=0000000f
[Enfs,4.2,2,108,887517ac,3,,,50,100000,100000,927c0,927c0,927c0,927c0,1]
2025-12-03T15:54:24.703856+00:00 ip-172-23-113-43 kernel: netfs:
O-key=[56] '0100010c03000000070000007ece2438008000002000f360d67b060000007ece243807000000ffffffff000000000200c109010000000000'
2025-12-03T15:57:25.438905+00:00 ip-172-23-113-43 kernel: INFO: task
kcompactd0:171 blocked for more than 122 seconds.
2025-12-03T15:57:25.438921+00:00 ip-172-23-113-43 kernel:
Tainted: G           OE      6.14.0-36-generic #36~24.04.1-Ubuntu
2025-12-03T15:57:25.438928+00:00 ip-172-23-113-43 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
2025-12-03T15:57:25.439 is bellow995+00:00 ip-172-23-113-43 kernel:
task:kcompactd0      state:D stack:0     pid:171   tgid:171   ppid:2
   task_flags:0x210040 flags:0x00004000
2025-12-03T15:57:25.440000+00:00 ip-172-23-113-43 kernel: Call Trace:
2025-12-03T15:57:25.440000+00:00 ip-172-23-113-43 kernel:  <TASK>
2025-12-03T15:57:25.440003+00:00 ip-172-23-113-43 kernel:
__schedule+0x2cf/0x640
2025-12-03T15:57:25.441017+00:00 ip-172-23-113-43 kernel:  schedule+0x29/0xd0
2025-12-03T15:57:25.441022+00:00 ip-172-23-113-43 kernel:  io_schedule+0x4c/0x80
2025-12-03T15:57:25.441023+00:00 ip-172-23-113-43 kernel:
folio_wait_bit_common+0x138/0x310
2025-12-03T15:57:25.441023+00:00 ip-172-23-113-43 kernel:  ?
__pfx_wake_page_function+0x10/0x10
2025-12-03T15:57:25.441024+00:00 ip-172-23-113-43 kernel:
folio_wait_private_2+0x2c/0x60
2025-12-03T15:57:25.441025+00:00 ip-172-23-113-43 kernel:
nfs_release_folio+0xa0/0x120 [nfs]
2025-12-03T15:57:25.441032+00:00 ip-172-23-113-43 kernel:
filemap_release_folio+0x68/0xa0
2025-12-03T15:57:25.441033+00:00 ip-172-23-113-43 kernel:
split_huge_page_to_list_to_order+0x401/0x970
2025-12-03T15:57:25.441033+00:00 ip-172-23-113-43 kernel:  ?
compaction_alloc_noprof+0x1c5/0x2f0
2025-12-03T15:57:25.441034+00:00 ip-172-23-113-43 kernel:
split_folio_to_list+0x22/0x70
2025-12-03T15:57:25.441035+00:00 ip-172-23-113-43 kernel:
migrate_pages_batch+0x2f2/0xa70
2025-12-03T15:57:25.441035+00:00 ip-172-23-113-43 kernel:  ?
__pfx_compaction_free+0x10/0x10
2025-12-03T15:57:25.441038+00:00 ip-172-23-113-43 kernel:  ?
__pfx_compaction_alloc+0x10/0x10
2025-12-03T15:57:25.441039+00:00 ip-172-23-113-43 kernel:  ?
__mod_memcg_lruvec_state+0xf4/0x250
2025-12-03T15:57:25.441039+00:00 ip-172-23-113-43 kernel:  ?
migrate_pages_batch+0x5e8/0xa70
2025-12-03T15:57:25.441040+00:00 ip-172-23-113-43 kernel:
migrate_pages_sync+0x84/0x1e0
2025-12-03T15:57:25.441040+00:00 ip-172-23-113-43 kernel:  ?
__pfx_compaction_free+0x10/0x10
2025-12-03T15:57:25.441041+00:00 ip-172-23-113-43 kernel:  ?
__pfx_compaction_alloc+0x10/0x10
2025-12-03T15:57:25.441044+00:00 ip-172-23-113-43 kernel:
migrate_pages+0x38f/0x4c0
2025-12-03T15:57:25.441047+00:00 ip-172-23-113-43 kernel:  ?
__pfx_compaction_free+0x10/0x10
2025-12-03T15:57:25.441048+00:00 ip-172-23-113-43 kernel:  ?
__pfx_compaction_alloc+0x10/0x10
2025-12-03T15:57:25.441048+00:00 ip-172-23-113-43 kernel:
compact_zone+0x385/0x700
2025-12-03T15:57:25.441049+00:00 ip-172-23-113-43 kernel:  ?
isolate_migratepages_range+0xc1/0xf0
2025-12-03T15:57:25.441049+00:00 ip-172-23-113-43 kernel:
kcompactd_do_work+0xfc/0x240
2025-12-03T15:57:25.441050+00:00 ip-172-23-113-43 kernel:  kcompactd+0x43f/0x4a0
2025-12-03T15:57:25.441052+00:00 ip-172-23-113-43 kernel:  ?
__pfx_autoremove_wake_function+0x10/0x10
2025-12-03T15:57:25.441053+00:00 ip-172-23-113-43 kernel:  ?
__pfx_kcompactd+0x10/0x10
2025-12-03T15:57:25.441053+00:00 ip-172-23-113-43 kernel:  kthread+0xfe/0x230
2025-12-03T15:57:25.441054+00:00 ip-172-23-113-43 kernel:  ?
__pfx_kthread+0x10/0x10
2025-12-03T15:57:25.441054+00:00 ip-172-23-113-43 kernel:
ret_from_fork+0x47/0x70
2025-12-03T15:57:25.441055+00:00 ip-172-23-113-43 kernel:  ?
__pfx_kthread+0x10/0x10
2025-12-03T15:57:25.441057+00:00 ip-172-23-113-43 kernel:
ret_from_fork_asm+0x1a/0x30
2025-12-03T15:57:25.441058+00:00 ip-172-23-113-43 kernel:  </TASK>
2025-12-03T15:57:25.441059+00:00 ip-172-23-113-43 kernel: INFO: task
jbd2/md127-8:6401 blocked for more than 122 seconds.
2025-12-03T15:57:25.441059+00:00 ip-172-23-113-43 kernel:
Tainted: G           OE      6.14.0-36-generic #36~24.04.1-Ubuntu
2025-12-03T15:57:25.443097+00:00 ip-172-23-113-43 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
2025-12-03T15:57:25.445001+00:00 ip-172-23-113-43 kernel:
task:jbd2/md127-8    state:D stack:0     pid:6401  tgid:6401  ppid:2
   task_flags:0x240040 flags:0x00004000
2025-12-03T15:57:25.445005+00:00 ip-172-23-113-43 kernel: Call Trace:
2025-12-03T15:57:25.445006+00:00 ip-172-23-113-43 kernel:  <TASK>
2025-12-03T15:57:25.445006+00:00 ip-172-23-113-43 kernel:
__schedule+0x2cf/0x640
2025-12-03T15:57:25.445007+00:00 ip-172-23-113-43 kernel:  schedule+0x29/0xd0
2025-12-03T15:57:25.445007+00:00 ip-172-23-113-43 kernel:
jbd2_journal_wait_updates+0x71/0xf0
2025-12-03T15:57:25.445008+00:00 ip-172-23-113-43 kernel:  ?
__pfx_autoremove_wake_function+0x10/0x10
2025-12-03T15:57:25.445008+00:00 ip-172-23-113-43 kernel:
jbd2_journal_commit_transaction+0x26e/0x1800
2025-12-03T15:57:25.445009+00:00 ip-172-23-113-43 kernel:  ?
__update_idle_core+0x58/0x130
2025-12-03T15:57:25.445010+00:00 ip-172-23-113-43 kernel:  ?
psi_group_change+0x201/0x4e0
2025-12-03T15:57:25.445010+00:00 ip-172-23-113-43 kernel:  ?
psi_task_switch+0x130/0x3a0
2025-12-03T15:57:25.445011+00:00 ip-172-23-113-43 kernel:  kjournald2+0xaa/0x250
2025-12-03T15:57:25.445011+00:00 ip-172-23-113-43 kernel:  ?
__pfx_autoremove_wake_function+0x10/0x10
2025-12-03T15:57:25.445012+00:00 ip-172-23-113-43 kernel:  ?
__pfx_kjournald2+0x10/0x10
2025-12-03T15:57:25.445012+00:00 ip-172-23-113-43 kernel:  kthread+0xfe/0x230
2025-12-03T15:57:25.445015+00:00 ip-172-23-113-43 kernel:  ?
__pfx_kthread+0x10/0x10
2025-12-03T15:57:25.445016+00:00 ip-172-23-113-43 kernel:
ret_from_fork+0x47/0x70
2025-12-03T15:57:25.445017+00:00 ip-172-23-113-43 kernel:  ?
__pfx_kthread+0x10/0x10
2025-12-03T15:57:25.445018+00:00 ip-172-23-113-43 kernel:
ret_from_fork_asm+0x1a/0x30
2025-12-03T15:57:25.445018+00:00 ip-172-23-113-43 kernel:  </TASK>
2025-12-03T15:57:25.445019+00:00 ip-172-23-113-43 kernel: INFO: task
nfsd:6783 blocked for more than 122 seconds.
2025-12-03T15:57:25.445020+00:00 ip-172-23-113-43 kernel:
Tainted: G           OE      6.14.0-36-generic #36~24.04.1-Ubuntu
2025-12-03T15:57:25.445023+00:00 ip-172-23-113-43 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
2025-12-03T15:57:25.447708+00:00 ip-172-23-113-43 kernel: task:nfsd
        state:D stack:0     pid:6783  tgid:6783  ppid:2
task_flags:0x240040 flags:0x00004000
2025-12-03T15:57:25.447713+00:00 ip-172-23-113-43 kernel: Call Trace:
2025-12-03T15:57:25.447714+00:00 ip-172-23-113-43 kernel:  <TASK>
2025-12-03T15:57:25.447714+00:00 ip-172-23-113-43 kernel:
__schedule+0x2cf/0x640
2025-12-03T15:57:25.447715+00:00 ip-172-23-113-43 kernel:  schedule+0x29/0xd0
2025-12-03T15:57:25.447716+00:00 ip-172-23-113-43 kernel:
schedule_preempt_disabled+0x15/0x30
2025-12-03T15:57:25.447716+00:00 ip-172-23-113-43 kernel:
rwsem_down_read_slowpath+0x261/0x490
2025-12-03T15:57:25.447717+00:00 ip-172-23-113-43 kernel:  ?
mempool_alloc_slab+0x15/0x20
2025-12-03T15:57:25.447717+00:00 ip-172-23-113-43 kernel:  down_read+0x48/0xd0
2025-12-03T15:57:25.447718+00:00 ip-172-23-113-43 kernel:
ext4_llseek+0xfc/0x120
2025-12-03T15:57:25.447719+00:00 ip-172-23-113-43 kernel:  vfs_llseek+0x1c/0x40
2025-12-03T15:57:25.447719+00:00 ip-172-23-113-43 kernel:
cachefiles_do_prepare_read+0xc1/0x3d0 [cachefiles]
2025-12-03T15:57:25.447720+00:00 ip-172-23-113-43 kernel:
cachefiles_prepare_read+0x32/0x50 [cachefiles]
2025-12-03T15:57:25.447720+00:00 ip-172-23-113-43 kernel:
netfs_read_to_pagecache+0xc1/0x530 [netfs]
2025-12-03T15:57:25.447721+00:00 ip-172-23-113-43 kernel:
netfs_readahead+0x169/0x270 [netfs]
2025-12-03T15:57:25.447735+00:00 ip-172-23-113-43 kernel:
nfs_netfs_readahead+0x1f/0x40 [nfs]
2025-12-03T15:57:25.447736+00:00 ip-172-23-113-43 kernel:
nfs_readahead+0x184/0x2b0 [nfs]
2025-12-03T15:57:25.447737+00:00 ip-172-23-113-43 kernel:  read_pages+0x85/0x220
2025-12-03T15:57:25.447737+00:00 ip-172-23-113-43 kernel:  ?
__pfx_workingset_update_node+0x10/0x10
2025-12-03T15:57:25.447738+00:00 ip-172-23-113-43 kernel:
page_cache_ra_unbounded+0x18e/0x240
2025-12-03T15:57:25.447738+00:00 ip-172-23-113-43 kernel:
page_cache_sync_ra+0x228/0x230
2025-12-03T15:57:25.447739+00:00 ip-172-23-113-43 kernel:
filemap_get_pages+0x133/0x4a0
2025-12-03T15:57:25.447740+00:00 ip-172-23-113-43 kernel:
filemap_splice_read+0x155/0x380
2025-12-03T15:57:25.447740+00:00 ip-172-23-113-43 kernel:  ?
find_inode+0xd9/0x100
2025-12-03T15:57:25.447741+00:00 ip-172-23-113-43 kernel:
nfs_file_splice_read+0xba/0xf0 [nfs]
2025-12-03T15:57:25.447741+00:00 ip-172-23-113-43 kernel:
do_splice_read+0x63/0xe0
2025-12-03T15:57:25.447742+00:00 ip-172-23-113-43 kernel:
splice_direct_to_actor+0xb1/0x260
2025-12-03T15:57:25.447743+00:00 ip-172-23-113-43 kernel:  ?
__pfx_nfsd_direct_splice_actor+0x10/0x10 [nfsd]
2025-12-03T15:57:25.447743+00:00 ip-172-23-113-43 kernel:
nfsd_splice_read+0x11b/0x130 [nfsd]
2025-12-03T15:57:25.447744+00:00 ip-172-23-113-43 kernel:
nfsd4_encode_splice_read+0x65/0x120 [nfsd]
2025-12-03T15:57:25.447744+00:00 ip-172-23-113-43 kernel:
nfsd4_encode_read+0x132/0x170 [nfsd]
2025-12-03T15:57:25.447745+00:00 ip-172-23-113-43 kernel:
nfsd4_encode_operation+0xd4/0x350 [nfsd]
2025-12-03T15:57:25.447745+00:00 ip-172-23-113-43 kernel:
nfsd4_proc_compound+0x1da/0x7d0 [nfsd]
2025-12-03T15:57:25.447746+00:00 ip-172-23-113-43 kernel:
nfsd_dispatch+0xd7/0x220 [nfsd]
2025-12-03T15:57:25.447746+00:00 ip-172-23-113-43 kernel:
svc_process_common+0x4c6/0x740 [sunrpc]
2025-12-03T15:57:25.447747+00:00 ip-172-23-113-43 kernel:  ?
__pfx_nfsd_dispatch+0x10/0x10 [nfsd]
2025-12-03T15:57:25.447747+00:00 ip-172-23-113-43 kernel:
svc_process+0x136/0x1f0 [sunrpc]
2025-12-03T15:57:25.447748+00:00 ip-172-23-113-43 kernel:
svc_handle_xprt+0x46e/0x570 [sunrpc]
2025-12-03T15:57:25.447751+00:00 ip-172-23-113-43 kernel:
svc_recv+0x184/0x2e0 [sunrpc]
2025-12-03T15:57:25.447752+00:00 ip-172-23-113-43 kernel:  ?
__pfx_nfsd+0x10/0x10 [nfsd]
2025-12-03T15:57:25.447752+00:00 ip-172-23-113-43 kernel:  nfsd+0x90/0xf0 [nfsd]
2025-12-03T15:57:25.447753+00:00 ip-172-23-113-43 kernel:  kthread+0xfe/0x230
2025-12-03T15:57:25.447753+00:00 ip-172-23-113-43 kernel:  ?
__pfx_kthread+0x10/0x10
2025-12-03T15:57:25.447754+00:00 ip-172-23-113-43 kernel:
ret_from_fork+0x47/0x70
2025-12-03T15:57:25.447756+00:00 ip-172-23-113-43 kernel:  ?
__pfx_kthread+0x10/0x10
2025-12-03T15:57:25.447757+00:00 ip-172-23-113-43 kernel:
ret_from_fork_asm+0x1a/0x30
2025-12-03T15:57:25.447757+00:00 ip-172-23-113-43 kernel:  </TASK>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fscache/NFS re-export server lockup
  2025-12-19 15:59 fscache/NFS re-export server lockup Mike Owen
@ 2025-12-19 17:01 ` Daire Byrne
  2026-01-09 21:45   ` Mike Owen
  0 siblings, 1 reply; 8+ messages in thread
From: Daire Byrne @ 2025-12-19 17:01 UTC (permalink / raw)
  To: mjnowen; +Cc: linux-nfs, netfs

On Fri, 19 Dec 2025 at 16:25, Mike Owen <mjnowen@gmail.com> wrote:
>
> We are using NFS re-exporting with fscache but have had a number of
> lockups with re-export servers.
>
> The NFS re-export servers are running Ubuntu 24.04 with a
> 6.14.0-36-generic kernel mounting Isilon and NetApp filers over NFSv3.
>
> The NFS clients are mounting the re-exported shares over NFSv4 and
> running AlmaLinux 9.6.
>
> When the re-export server locks up, it is not possible to access the
> /var/cache/fscache directory (separate file system) on the server and
> all clients are hanging on their mounts from the re-export.

That does kind of point to a filesystem problem with /var/cache/fscache itself?

Can you stop the nfs server and is access to /var/cache/fscache still blocked?

> When the server is rebooted, most of the contents of the
> /var/cache/fscache directory is missing e.g. the file system contents
> before the lockup may have been a few TBs, but after a reboot only a
> few hundred GBs.

I can't say I've ever seen that before... we use ext4 for the fscache
filesystem mount on /var/cache/fscache.

And I presume there is definitely nothing else that might be
interacting with that /var/cache/fscache filesystem outside of fscache
or cachefilesd?

Our /etc/cachefilesd.conf is pretty basic (brun 30%, bcull 10%, bstop 3%).

> The kern.log (below) from the re-export server reports the following
> when the lockup happens.
>
> We have tried using a more recent 6.17.11 kernel on the re-export
> server and changing the fscache file system from ext4 to xfs, but
> still have the same lockups.

We were running a heavily patched v6.4 kernel for almost 2 years and
have just updated to v6.18.1 with no patches - so far so good!

We are mostly re-exporting NFSv3 -> NFSv3 (Linux NFS storage servers)
with a small amount of NFSv3 -> NFSv4.2 (Netapps).

> Any ideas on what is happening here - and how it may be fixed?

Sorry I can't be more helpful.

Daire

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fscache/NFS re-export server lockup
  2025-12-19 17:01 ` Daire Byrne
@ 2026-01-09 21:45   ` Mike Owen
  2026-01-10  9:48     ` Mike Snitzer
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Owen @ 2026-01-09 21:45 UTC (permalink / raw)
  To: Daire Byrne, dhowells; +Cc: linux-nfs, netfs

Hi Daire, thanks for the comments.

> Can you stop the nfs server and is access to /var/cache/fscache still blocked?
As the machine is deadlocked, after reboot (so the nfs server is
definitely stopped), the actual data is gone/corrupted.

>And I presume there is definitely nothing else that might be
interacting with that /var/cache/fscache filesystem outside of fscache
or cachefilesd?
Correct. Machine is dedicated to KNFSD caching duties.

> Our /etc/cachefilesd.conf is pretty basic (brun 30%, bcull 10%, bstop 3%).
Similar settings here:
brun 20%
bcull 7%
bstop 3%
frun 20%
fcull 7%
fstop 3%
Although I should note that the issue happens when only ~10-20% of the
NVMe capacity is used, so culling has never had to run at this point.

We did try running 6.17.0 but made no difference. I see another thread
of yours with Chuck: "refcount_t underflow (nfsd4_sequence_done?) with
v6.18 re-export"
and suggested commits to investigate, incl: cbfd91d22776 ("nfsd: never
defer requests during idmap lookup") as well as try using 6.18.4, so
it's possible there is a cascading issue here and we are in need of
some NFS patches.

I'm hoping @dhowells might have some suggestions on how to further
debug this issue, given the below stack we are seeing when it
deadlocks?

Thanks,
-Mike

2025-12-03T15:54:24.189530+00:00 ip-172-23-113-43 kernel: netfs:
fscache_begin_operation: cookie state change wait timed out:
cookie->state=1 state=1
2025-12-03T15:54:24.189546+00:00 ip-172-23-113-43 kernel: netfs:
fscache_begin_operation: cookie state change wait timed out:
cookie->state=1 state=1
2025-12-03T15:54:24.189549+00:00 ip-172-23-113-43 kernel: netfs:
O-cookie c=00049c53 [fl=6124 na=1 nA=2 s=L]
2025-12-03T15:54:24.189550+00:00 ip-172-23-113-43 kernel: netfs:
O-cookie c=0004ddfe [fl=4124 na=1 nA=2 s=L]
2025-12-03T15:54:24.189551+00:00 ip-172-23-113-43 kernel: netfs:
O-cookie V=0000000a
[Enfs,4.2,2,108,887517ac,2,,,50,100000,100000,927c0,927c0,927c0,927c0,1]
2025-12-03T15:54:24.189552+00:00 ip-172-23-113-43 kernel: netfs:
O-cookie V=0000000f
[Enfs,4.2,2,108,887517ac,3,,,50,100000,100000,927c0,927c0,927c0,927c0,1]
2025-12-03T15:54:24.189553+00:00 ip-172-23-113-43 kernel: netfs:
O-key=[56] '0100010c0200000008000000e7ba9075008000002000f360d67b09000000e7ba907508000000ffffffff000000000b00700b010000000000'
2025-12-03T15:54:24.190876+00:00 ip-172-23-113-43 kernel: netfs:
O-key=[56] '0100010c03000000070000002fbb9934008000002000f360d67b060000002fbb993407000000ffffffff000000000200c109010000000000'
2025-12-03T15:54:24.702103+00:00 ip-172-23-113-43 kernel: netfs:
fscache_begin_operation: cookie state change wait timed out:
cookie->state=1 state=1
2025-12-03T15:54:24.702112+00:00 ip-172-23-113-43 kernel: netfs:
O-cookie c=000404b3 [fl=6124 na=1 nA=2 s=L]
2025-12-03T15:54:24.702113+00:00 ip-172-23-113-43 kernel: netfs:
O-cookie V=0000000f
[Enfs,4.2,2,108,887517ac,3,,,50,100000,100000,927c0,927c0,927c0,927c0,1]
2025-12-03T15:54:24.703856+00:00 ip-172-23-113-43 kernel: netfs:
O-key=[56] '0100010c03000000070000007ece2438008000002000f360d67b060000007ece243807000000ffffffff000000000200c109010000000000'
2025-12-03T15:57:25.438905+00:00 ip-172-23-113-43 kernel: INFO: task
kcompactd0:171 blocked for more than 122 seconds.
2025-12-03T15:57:25.438921+00:00 ip-172-23-113-43 kernel:
Tainted: G           OE      6.14.0-36-generic #36~24.04.1-Ubuntu
2025-12-03T15:57:25.438928+00:00 ip-172-23-113-43 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
2025-12-03T15:57:25.439 is bellow995+00:00 ip-172-23-113-43 kernel:
task:kcompactd0      state:D stack:0     pid:171   tgid:171   ppid:2
   task_flags:0x210040 flags:0x00004000
2025-12-03T15:57:25.440000+00:00 ip-172-23-113-43 kernel: Call Trace:
2025-12-03T15:57:25.440000+00:00 ip-172-23-113-43 kernel:  <TASK>
2025-12-03T15:57:25.440003+00:00 ip-172-23-113-43 kernel:
__schedule+0x2cf/0x640
2025-12-03T15:57:25.441017+00:00 ip-172-23-113-43 kernel:  schedule+0x29/0xd0
2025-12-03T15:57:25.441022+00:00 ip-172-23-113-43 kernel:  io_schedule+0x4c/0x80
2025-12-03T15:57:25.441023+00:00 ip-172-23-113-43 kernel:
folio_wait_bit_common+0x138/0x310
2025-12-03T15:57:25.441023+00:00 ip-172-23-113-43 kernel:  ?
__pfx_wake_page_function+0x10/0x10
2025-12-03T15:57:25.441024+00:00 ip-172-23-113-43 kernel:
folio_wait_private_2+0x2c/0x60
2025-12-03T15:57:25.441025+00:00 ip-172-23-113-43 kernel:
nfs_release_folio+0xa0/0x120 [nfs]
2025-12-03T15:57:25.441032+00:00 ip-172-23-113-43 kernel:
filemap_release_folio+0x68/0xa0
2025-12-03T15:57:25.441033+00:00 ip-172-23-113-43 kernel:
split_huge_page_to_list_to_order+0x401/0x970
2025-12-03T15:57:25.441033+00:00 ip-172-23-113-43 kernel:  ?
compaction_alloc_noprof+0x1c5/0x2f0
2025-12-03T15:57:25.441034+00:00 ip-172-23-113-43 kernel:
split_folio_to_list+0x22/0x70
2025-12-03T15:57:25.441035+00:00 ip-172-23-113-43 kernel:
migrate_pages_batch+0x2f2/0xa70
2025-12-03T15:57:25.441035+00:00 ip-172-23-113-43 kernel:  ?
__pfx_compaction_free+0x10/0x10
2025-12-03T15:57:25.441038+00:00 ip-172-23-113-43 kernel:  ?
__pfx_compaction_alloc+0x10/0x10
2025-12-03T15:57:25.441039+00:00 ip-172-23-113-43 kernel:  ?
__mod_memcg_lruvec_state+0xf4/0x250
2025-12-03T15:57:25.441039+00:00 ip-172-23-113-43 kernel:  ?
migrate_pages_batch+0x5e8/0xa70
2025-12-03T15:57:25.441040+00:00 ip-172-23-113-43 kernel:
migrate_pages_sync+0x84/0x1e0
2025-12-03T15:57:25.441040+00:00 ip-172-23-113-43 kernel:  ?
__pfx_compaction_free+0x10/0x10
2025-12-03T15:57:25.441041+00:00 ip-172-23-113-43 kernel:  ?
__pfx_compaction_alloc+0x10/0x10
2025-12-03T15:57:25.441044+00:00 ip-172-23-113-43 kernel:
migrate_pages+0x38f/0x4c0
2025-12-03T15:57:25.441047+00:00 ip-172-23-113-43 kernel:  ?
__pfx_compaction_free+0x10/0x10
2025-12-03T15:57:25.441048+00:00 ip-172-23-113-43 kernel:  ?
__pfx_compaction_alloc+0x10/0x10
2025-12-03T15:57:25.441048+00:00 ip-172-23-113-43 kernel:
compact_zone+0x385/0x700
2025-12-03T15:57:25.441049+00:00 ip-172-23-113-43 kernel:  ?
isolate_migratepages_range+0xc1/0xf0
2025-12-03T15:57:25.441049+00:00 ip-172-23-113-43 kernel:
kcompactd_do_work+0xfc/0x240
2025-12-03T15:57:25.441050+00:00 ip-172-23-113-43 kernel:  kcompactd+0x43f/0x4a0
2025-12-03T15:57:25.441052+00:00 ip-172-23-113-43 kernel:  ?
__pfx_autoremove_wake_function+0x10/0x10
2025-12-03T15:57:25.441053+00:00 ip-172-23-113-43 kernel:  ?
__pfx_kcompactd+0x10/0x10
2025-12-03T15:57:25.441053+00:00 ip-172-23-113-43 kernel:  kthread+0xfe/0x230
2025-12-03T15:57:25.441054+00:00 ip-172-23-113-43 kernel:  ?
__pfx_kthread+0x10/0x10
2025-12-03T15:57:25.441054+00:00 ip-172-23-113-43 kernel:
ret_from_fork+0x47/0x70
2025-12-03T15:57:25.441055+00:00 ip-172-23-113-43 kernel:  ?
__pfx_kthread+0x10/0x10
2025-12-03T15:57:25.441057+00:00 ip-172-23-113-43 kernel:
ret_from_fork_asm+0x1a/0x30
2025-12-03T15:57:25.441058+00:00 ip-172-23-113-43 kernel:  </TASK>
2025-12-03T15:57:25.441059+00:00 ip-172-23-113-43 kernel: INFO: task
jbd2/md127-8:6401 blocked for more than 122 seconds.
2025-12-03T15:57:25.441059+00:00 ip-172-23-113-43 kernel:
Tainted: G           OE      6.14.0-36-generic #36~24.04.1-Ubuntu
2025-12-03T15:57:25.443097+00:00 ip-172-23-113-43 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
2025-12-03T15:57:25.445001+00:00 ip-172-23-113-43 kernel:
task:jbd2/md127-8    state:D stack:0     pid:6401  tgid:6401  ppid:2
   task_flags:0x240040 flags:0x00004000
2025-12-03T15:57:25.445005+00:00 ip-172-23-113-43 kernel: Call Trace:
2025-12-03T15:57:25.445006+00:00 ip-172-23-113-43 kernel:  <TASK>
2025-12-03T15:57:25.445006+00:00 ip-172-23-113-43 kernel:
__schedule+0x2cf/0x640
2025-12-03T15:57:25.445007+00:00 ip-172-23-113-43 kernel:  schedule+0x29/0xd0
2025-12-03T15:57:25.445007+00:00 ip-172-23-113-43 kernel:
jbd2_journal_wait_updates+0x71/0xf0
2025-12-03T15:57:25.445008+00:00 ip-172-23-113-43 kernel:  ?
__pfx_autoremove_wake_function+0x10/0x10
2025-12-03T15:57:25.445008+00:00 ip-172-23-113-43 kernel:
jbd2_journal_commit_transaction+0x26e/0x1800
2025-12-03T15:57:25.445009+00:00 ip-172-23-113-43 kernel:  ?
__update_idle_core+0x58/0x130
2025-12-03T15:57:25.445010+00:00 ip-172-23-113-43 kernel:  ?
psi_group_change+0x201/0x4e0
2025-12-03T15:57:25.445010+00:00 ip-172-23-113-43 kernel:  ?
psi_task_switch+0x130/0x3a0
2025-12-03T15:57:25.445011+00:00 ip-172-23-113-43 kernel:  kjournald2+0xaa/0x250
2025-12-03T15:57:25.445011+00:00 ip-172-23-113-43 kernel:  ?
__pfx_autoremove_wake_function+0x10/0x10
2025-12-03T15:57:25.445012+00:00 ip-172-23-113-43 kernel:  ?
__pfx_kjournald2+0x10/0x10
2025-12-03T15:57:25.445012+00:00 ip-172-23-113-43 kernel:  kthread+0xfe/0x230
2025-12-03T15:57:25.445015+00:00 ip-172-23-113-43 kernel:  ?
__pfx_kthread+0x10/0x10
2025-12-03T15:57:25.445016+00:00 ip-172-23-113-43 kernel:
ret_from_fork+0x47/0x70
2025-12-03T15:57:25.445017+00:00 ip-172-23-113-43 kernel:  ?
__pfx_kthread+0x10/0x10
2025-12-03T15:57:25.445018+00:00 ip-172-23-113-43 kernel:
ret_from_fork_asm+0x1a/0x30
2025-12-03T15:57:25.445018+00:00 ip-172-23-113-43 kernel:  </TASK>
2025-12-03T15:57:25.445019+00:00 ip-172-23-113-43 kernel: INFO: task
nfsd:6783 blocked for more than 122 seconds.
2025-12-03T15:57:25.445020+00:00 ip-172-23-113-43 kernel:
Tainted: G           OE      6.14.0-36-generic #36~24.04.1-Ubuntu
2025-12-03T15:57:25.445023+00:00 ip-172-23-113-43 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
2025-12-03T15:57:25.447708+00:00 ip-172-23-113-43 kernel: task:nfsd
        state:D stack:0     pid:6783  tgid:6783  ppid:2
task_flags:0x240040 flags:0x00004000
2025-12-03T15:57:25.447713+00:00 ip-172-23-113-43 kernel: Call Trace:
2025-12-03T15:57:25.447714+00:00 ip-172-23-113-43 kernel:  <TASK>
2025-12-03T15:57:25.447714+00:00 ip-172-23-113-43 kernel:
__schedule+0x2cf/0x640
2025-12-03T15:57:25.447715+00:00 ip-172-23-113-43 kernel:  schedule+0x29/0xd0
2025-12-03T15:57:25.447716+00:00 ip-172-23-113-43 kernel:
schedule_preempt_disabled+0x15/0x30
2025-12-03T15:57:25.447716+00:00 ip-172-23-113-43 kernel:
rwsem_down_read_slowpath+0x261/0x490
2025-12-03T15:57:25.447717+00:00 ip-172-23-113-43 kernel:  ?
mempool_alloc_slab+0x15/0x20
2025-12-03T15:57:25.447717+00:00 ip-172-23-113-43 kernel:  down_read+0x48/0xd0
2025-12-03T15:57:25.447718+00:00 ip-172-23-113-43 kernel:
ext4_llseek+0xfc/0x120
2025-12-03T15:57:25.447719+00:00 ip-172-23-113-43 kernel:  vfs_llseek+0x1c/0x40
2025-12-03T15:57:25.447719+00:00 ip-172-23-113-43 kernel:
cachefiles_do_prepare_read+0xc1/0x3d0 [cachefiles]
2025-12-03T15:57:25.447720+00:00 ip-172-23-113-43 kernel:
cachefiles_prepare_read+0x32/0x50 [cachefiles]
2025-12-03T15:57:25.447720+00:00 ip-172-23-113-43 kernel:
netfs_read_to_pagecache+0xc1/0x530 [netfs]
2025-12-03T15:57:25.447721+00:00 ip-172-23-113-43 kernel:
netfs_readahead+0x169/0x270 [netfs]
2025-12-03T15:57:25.447735+00:00 ip-172-23-113-43 kernel:
nfs_netfs_readahead+0x1f/0x40 [nfs]
2025-12-03T15:57:25.447736+00:00 ip-172-23-113-43 kernel:
nfs_readahead+0x184/0x2b0 [nfs]
2025-12-03T15:57:25.447737+00:00 ip-172-23-113-43 kernel:  read_pages+0x85/0x220
2025-12-03T15:57:25.447737+00:00 ip-172-23-113-43 kernel:  ?
__pfx_workingset_update_node+0x10/0x10
2025-12-03T15:57:25.447738+00:00 ip-172-23-113-43 kernel:
page_cache_ra_unbounded+0x18e/0x240
2025-12-03T15:57:25.447738+00:00 ip-172-23-113-43 kernel:
page_cache_sync_ra+0x228/0x230
2025-12-03T15:57:25.447739+00:00 ip-172-23-113-43 kernel:
filemap_get_pages+0x133/0x4a0
2025-12-03T15:57:25.447740+00:00 ip-172-23-113-43 kernel:
filemap_splice_read+0x155/0x380
2025-12-03T15:57:25.447740+00:00 ip-172-23-113-43 kernel:  ?
find_inode+0xd9/0x100
2025-12-03T15:57:25.447741+00:00 ip-172-23-113-43 kernel:
nfs_file_splice_read+0xba/0xf0 [nfs]
2025-12-03T15:57:25.447741+00:00 ip-172-23-113-43 kernel:
do_splice_read+0x63/0xe0
2025-12-03T15:57:25.447742+00:00 ip-172-23-113-43 kernel:
splice_direct_to_actor+0xb1/0x260
2025-12-03T15:57:25.447743+00:00 ip-172-23-113-43 kernel:  ?
__pfx_nfsd_direct_splice_actor+0x10/0x10 [nfsd]
2025-12-03T15:57:25.447743+00:00 ip-172-23-113-43 kernel:
nfsd_splice_read+0x11b/0x130 [nfsd]
2025-12-03T15:57:25.447744+00:00 ip-172-23-113-43 kernel:
nfsd4_encode_splice_read+0x65/0x120 [nfsd]
2025-12-03T15:57:25.447744+00:00 ip-172-23-113-43 kernel:
nfsd4_encode_read+0x132/0x170 [nfsd]
2025-12-03T15:57:25.447745+00:00 ip-172-23-113-43 kernel:
nfsd4_encode_operation+0xd4/0x350 [nfsd]
2025-12-03T15:57:25.447745+00:00 ip-172-23-113-43 kernel:
nfsd4_proc_compound+0x1da/0x7d0 [nfsd]
2025-12-03T15:57:25.447746+00:00 ip-172-23-113-43 kernel:
nfsd_dispatch+0xd7/0x220 [nfsd]
2025-12-03T15:57:25.447746+00:00 ip-172-23-113-43 kernel:
svc_process_common+0x4c6/0x740 [sunrpc]
2025-12-03T15:57:25.447747+00:00 ip-172-23-113-43 kernel:  ?
__pfx_nfsd_dispatch+0x10/0x10 [nfsd]
2025-12-03T15:57:25.447747+00:00 ip-172-23-113-43 kernel:
svc_process+0x136/0x1f0 [sunrpc]
2025-12-03T15:57:25.447748+00:00 ip-172-23-113-43 kernel:
svc_handle_xprt+0x46e/0x570 [sunrpc]
2025-12-03T15:57:25.447751+00:00 ip-172-23-113-43 kernel:
svc_recv+0x184/0x2e0 [sunrpc]
2025-12-03T15:57:25.447752+00:00 ip-172-23-113-43 kernel:  ?
__pfx_nfsd+0x10/0x10 [nfsd]
2025-12-03T15:57:25.447752+00:00 ip-172-23-113-43 kernel:  nfsd+0x90/0xf0 [nfsd]
2025-12-03T15:57:25.447753+00:00 ip-172-23-113-43 kernel:  kthread+0xfe/0x230
2025-12-03T15:57:25.447753+00:00 ip-172-23-113-43 kernel:  ?
__pfx_kthread+0x10/0x10
2025-12-03T15:57:25.447754+00:00 ip-172-23-113-43 kernel:
ret_from_fork+0x47/0x70
2025-12-03T15:57:25.447756+00:00 ip-172-23-113-43 kernel:  ?
__pfx_kthread+0x10/0x10
2025-12-03T15:57:25.447757+00:00 ip-172-23-113-43 kernel:
ret_from_fork_asm+0x1a/0x30
2025-12-03T15:57:25.447757+00:00 ip-172-23-113-43 kernel:  </TASK>


On Fri, 19 Dec 2025 at 17:02, Daire Byrne <daire@dneg.com> wrote:
>
> On Fri, 19 Dec 2025 at 16:25, Mike Owen <mjnowen@gmail.com> wrote:
> >
> > We are using NFS re-exporting with fscache but have had a number of
> > lockups with re-export servers.
> >
> > The NFS re-export servers are running Ubuntu 24.04 with a
> > 6.14.0-36-generic kernel mounting Isilon and NetApp filers over NFSv3.
> >
> > The NFS clients are mounting the re-exported shares over NFSv4 and
> > running AlmaLinux 9.6.
> >
> > When the re-export server locks up, it is not possible to access the
> > /var/cache/fscache directory (separate file system) on the server and
> > all clients are hanging on their mounts from the re-export.
>
> That does kind of point to a filesystem problem with /var/cache/fscache itself?
>
> Can you stop the nfs server and is access to /var/cache/fscache still blocked?
>
> > When the server is rebooted, most of the contents of the
> > /var/cache/fscache directory is missing e.g. the file system contents
> > before the lockup may have been a few TBs, but after a reboot only a
> > few hundred GBs.
>
> I can't say I've ever seen that before... we use ext4 for the fscache
> filesystem mount on /var/cache/fscache.
>
> And I presume there is definitely nothing else that might be
> interacting with that /var/cache/fscache filesystem outside of fscache
> or cachefilesd?
>
> Our /etc/cachefilesd.conf is pretty basic (brun 30%, bcull 10%, bstop 3%).
>
> > The kern.log (below) from the re-export server reports the following
> > when the lockup happens.
> >
> > We have tried using a more recent 6.17.11 kernel on the re-export
> > server and changing the fscache file system from ext4 to xfs, but
> > still have the same lockups.
>
> We were running a heavily patched v6.4 kernel for almost 2 years and
> have just updated to v6.18.1 with no patches - so far so good!
>
> We are mostly re-exporting NFSv3 -> NFSv3 (Linux NFS storage servers)
> with a small amount of NFSv3 -> NFSv4.2 (Netapps).
>
> > Any ideas on what is happening here - and how it may be fixed?
>
> Sorry I can't be more helpful.
>
> Daire

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fscache/NFS re-export server lockup
  2026-01-09 21:45   ` Mike Owen
@ 2026-01-10  9:48     ` Mike Snitzer
  2026-01-12 15:16       ` Mike Owen
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Snitzer @ 2026-01-10  9:48 UTC (permalink / raw)
  To: Mike Owen; +Cc: Daire Byrne, dhowells, linux-nfs, netfs, trondmy, okorniev

Hi Mike,

On Fri, Jan 09, 2026 at 09:45:47PM +0000, Mike Owen wrote:
> Hi Daire, thanks for the comments.
> 
> > Can you stop the nfs server and is access to /var/cache/fscache still blocked?
> As the machine is deadlocked, after reboot (so the nfs server is
> definitely stopped), the actual data is gone/corrupted.
> 
> >And I presume there is definitely nothing else that might be
> interacting with that /var/cache/fscache filesystem outside of fscache
> or cachefilesd?
> Correct. Machine is dedicated to KNFSD caching duties.
> 
> > Our /etc/cachefilesd.conf is pretty basic (brun 30%, bcull 10%, bstop 3%).
> Similar settings here:
> brun 20%
> bcull 7%
> bstop 3%
> frun 20%
> fcull 7%
> fstop 3%
> Although I should note that the issue happens when only ~10-20% of the
> NVMe capacity is used, so culling has never had to run at this point.
> 
> We did try running 6.17.0 but made no difference. I see another thread
> of yours with Chuck: "refcount_t underflow (nfsd4_sequence_done?) with
> v6.18 re-export"
> and suggested commits to investigate, incl: cbfd91d22776 ("nfsd: never
> defer requests during idmap lookup") as well as try using 6.18.4, so
> it's possible there is a cascading issue here and we are in need of
> some NFS patches.
> 
> I'm hoping @dhowells might have some suggestions on how to further
> debug this issue, given the below stack we are seeing when it
> deadlocks?
> 
> Thanks,
> -Mike

This commit from Trond, which he'll be sending to Linus soon as part
of his 6.19-rc NFS client fixes pull request, should fix the NFS
re-export induced nfs_release_folio deadlock reflected in your below
stack trace:
https://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=commitdiff;h=cce0be6eb4971456b703aaeafd571650d314bcca

Here is more context for why I know that to be likely, it fixed my
nasty LOCALIO-based reexport deadlock situation too:
https://lore.kernel.org/linux-nfs/20260107160858.6847-1-snitzer@kernel.org/

I'm doing my part to advocate that Red Hat (Olga cc'd) take this
fix into RHEL 10.2 (and backport as needed).

Good luck getting Ubuntu to include this fix in a timely manner (we'll
all thank you for that if you can help shake the Canonical tree).

BTW, you'd do well to fix your editor/email so that it doesn't line
wrap when you share logs on Linux mailing lists:

> 2025-12-03T15:57:25.438905+00:00 ip-172-23-113-43 kernel: INFO: task
> kcompactd0:171 blocked for more than 122 seconds.
> 2025-12-03T15:57:25.438921+00:00 ip-172-23-113-43 kernel:
> Tainted: G           OE      6.14.0-36-generic #36~24.04.1-Ubuntu
> 2025-12-03T15:57:25.438928+00:00 ip-172-23-113-43 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 2025-12-03T15:57:25.439 is bellow995+00:00 ip-172-23-113-43 kernel:
> task:kcompactd0      state:D stack:0     pid:171   tgid:171   ppid:2
>    task_flags:0x210040 flags:0x00004000
> 2025-12-03T15:57:25.440000+00:00 ip-172-23-113-43 kernel: Call Trace:
> 2025-12-03T15:57:25.440000+00:00 ip-172-23-113-43 kernel:  <TASK>
> 2025-12-03T15:57:25.440003+00:00 ip-172-23-113-43 kernel:
> __schedule+0x2cf/0x640
> 2025-12-03T15:57:25.441017+00:00 ip-172-23-113-43 kernel:  schedule+0x29/0xd0
> 2025-12-03T15:57:25.441022+00:00 ip-172-23-113-43 kernel:  io_schedule+0x4c/0x80
> 2025-12-03T15:57:25.441023+00:00 ip-172-23-113-43 kernel:
> folio_wait_bit_common+0x138/0x310
> 2025-12-03T15:57:25.441023+00:00 ip-172-23-113-43 kernel:  ?
> __pfx_wake_page_function+0x10/0x10
> 2025-12-03T15:57:25.441024+00:00 ip-172-23-113-43 kernel:
> folio_wait_private_2+0x2c/0x60
> 2025-12-03T15:57:25.441025+00:00 ip-172-23-113-43 kernel:
> nfs_release_folio+0xa0/0x120 [nfs]
> 2025-12-03T15:57:25.441032+00:00 ip-172-23-113-43 kernel:
> filemap_release_folio+0x68/0xa0
> 2025-12-03T15:57:25.441033+00:00 ip-172-23-113-43 kernel:
> split_huge_page_to_list_to_order+0x401/0x970
> 2025-12-03T15:57:25.441033+00:00 ip-172-23-113-43 kernel:  ?
> compaction_alloc_noprof+0x1c5/0x2f0
> 2025-12-03T15:57:25.441034+00:00 ip-172-23-113-43 kernel:
> split_folio_to_list+0x22/0x70
> 2025-12-03T15:57:25.441035+00:00 ip-172-23-113-43 kernel:
> migrate_pages_batch+0x2f2/0xa70
> 2025-12-03T15:57:25.441035+00:00 ip-172-23-113-43 kernel:  ?
> __pfx_compaction_free+0x10/0x10
> 2025-12-03T15:57:25.441038+00:00 ip-172-23-113-43 kernel:  ?
> __pfx_compaction_alloc+0x10/0x10
> 2025-12-03T15:57:25.441039+00:00 ip-172-23-113-43 kernel:  ?
> __mod_memcg_lruvec_state+0xf4/0x250
> 2025-12-03T15:57:25.441039+00:00 ip-172-23-113-43 kernel:  ?
> migrate_pages_batch+0x5e8/0xa70
> 2025-12-03T15:57:25.441040+00:00 ip-172-23-113-43 kernel:
> migrate_pages_sync+0x84/0x1e0
> 2025-12-03T15:57:25.441040+00:00 ip-172-23-113-43 kernel:  ?
> __pfx_compaction_free+0x10/0x10
> 2025-12-03T15:57:25.441041+00:00 ip-172-23-113-43 kernel:  ?
> __pfx_compaction_alloc+0x10/0x10
> 2025-12-03T15:57:25.441044+00:00 ip-172-23-113-43 kernel:
> migrate_pages+0x38f/0x4c0
> 2025-12-03T15:57:25.441047+00:00 ip-172-23-113-43 kernel:  ?
> __pfx_compaction_free+0x10/0x10
> 2025-12-03T15:57:25.441048+00:00 ip-172-23-113-43 kernel:  ?
> __pfx_compaction_alloc+0x10/0x10
> 2025-12-03T15:57:25.441048+00:00 ip-172-23-113-43 kernel:
> compact_zone+0x385/0x700
> 2025-12-03T15:57:25.441049+00:00 ip-172-23-113-43 kernel:  ?
> isolate_migratepages_range+0xc1/0xf0
> 2025-12-03T15:57:25.441049+00:00 ip-172-23-113-43 kernel:
> kcompactd_do_work+0xfc/0x240
> 2025-12-03T15:57:25.441050+00:00 ip-172-23-113-43 kernel:  kcompactd+0x43f/0x4a0
> 2025-12-03T15:57:25.441052+00:00 ip-172-23-113-43 kernel:  ?
> __pfx_autoremove_wake_function+0x10/0x10
> 2025-12-03T15:57:25.441053+00:00 ip-172-23-113-43 kernel:  ?
> __pfx_kcompactd+0x10/0x10
> 2025-12-03T15:57:25.441053+00:00 ip-172-23-113-43 kernel:  kthread+0xfe/0x230
> 2025-12-03T15:57:25.441054+00:00 ip-172-23-113-43 kernel:  ?
> __pfx_kthread+0x10/0x10
> 2025-12-03T15:57:25.441054+00:00 ip-172-23-113-43 kernel:
> ret_from_fork+0x47/0x70
> 2025-12-03T15:57:25.441055+00:00 ip-172-23-113-43 kernel:  ?
> __pfx_kthread+0x10/0x10
> 2025-12-03T15:57:25.441057+00:00 ip-172-23-113-43 kernel:
> ret_from_fork_asm+0x1a/0x30
> 2025-12-03T15:57:25.441058+00:00 ip-172-23-113-43 kernel:  </TASK>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fscache/NFS re-export server lockup
  2026-01-10  9:48     ` Mike Snitzer
@ 2026-01-12 15:16       ` Mike Owen
  2026-01-15 15:20         ` Mike Owen
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Owen @ 2026-01-12 15:16 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Daire Byrne, dhowells, linux-nfs, netfs, trondmy, okorniev

Ah, this looks promising. Thanks for the info, Mike!
Whilst we wait for the necessary NFS client fixes PR, I'll look to add
the patch to 6.19-rc5 and report back if this fixes the issue we are
seeing.
I'll see what I can do internally to advocate Canonical absorbing it as well.
ACK on my log wrapping. My bad.
Thanks again!
-Mike

On Sat, 10 Jan 2026 at 09:48, Mike Snitzer <snitzer@kernel.org> wrote:
>
> Hi Mike,
>
> On Fri, Jan 09, 2026 at 09:45:47PM +0000, Mike Owen wrote:
> > Hi Daire, thanks for the comments.
> >
> > > Can you stop the nfs server and is access to /var/cache/fscache still blocked?
> > As the machine is deadlocked, after reboot (so the nfs server is
> > definitely stopped), the actual data is gone/corrupted.
> >
> > >And I presume there is definitely nothing else that might be
> > interacting with that /var/cache/fscache filesystem outside of fscache
> > or cachefilesd?
> > Correct. Machine is dedicated to KNFSD caching duties.
> >
> > > Our /etc/cachefilesd.conf is pretty basic (brun 30%, bcull 10%, bstop 3%).
> > Similar settings here:
> > brun 20%
> > bcull 7%
> > bstop 3%
> > frun 20%
> > fcull 7%
> > fstop 3%
> > Although I should note that the issue happens when only ~10-20% of the
> > NVMe capacity is used, so culling has never had to run at this point.
> >
> > We did try running 6.17.0 but made no difference. I see another thread
> > of yours with Chuck: "refcount_t underflow (nfsd4_sequence_done?) with
> > v6.18 re-export"
> > and suggested commits to investigate, incl: cbfd91d22776 ("nfsd: never
> > defer requests during idmap lookup") as well as try using 6.18.4, so
> > it's possible there is a cascading issue here and we are in need of
> > some NFS patches.
> >
> > I'm hoping @dhowells might have some suggestions on how to further
> > debug this issue, given the below stack we are seeing when it
> > deadlocks?
> >
> > Thanks,
> > -Mike
>
> This commit from Trond, which he'll be sending to Linus soon as part
> of his 6.19-rc NFS client fixes pull request, should fix the NFS
> re-export induced nfs_release_folio deadlock reflected in your below
> stack trace:
> https://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=commitdiff;h=cce0be6eb4971456b703aaeafd571650d314bcca
>
> Here is more context for why I know that to be likely, it fixed my
> nasty LOCALIO-based reexport deadlock situation too:
> https://lore.kernel.org/linux-nfs/20260107160858.6847-1-snitzer@kernel.org/
>
> I'm doing my part to advocate that Red Hat (Olga cc'd) take this
> fix into RHEL 10.2 (and backport as needed).
>
> Good luck getting Ubuntu to include this fix in a timely manner (we'll
> all thank you for that if you can help shake the Canonical tree).
>
> BTW, you'd do well to fix your editor/email so that it doesn't line
> wrap when you share logs on Linux mailing lists:
>
> > 2025-12-03T15:57:25.438905+00:00 ip-172-23-113-43 kernel: INFO: task
> > kcompactd0:171 blocked for more than 122 seconds.
> > 2025-12-03T15:57:25.438921+00:00 ip-172-23-113-43 kernel:
> > Tainted: G           OE      6.14.0-36-generic #36~24.04.1-Ubuntu
> > 2025-12-03T15:57:25.438928+00:00 ip-172-23-113-43 kernel: "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > 2025-12-03T15:57:25.439 is bellow995+00:00 ip-172-23-113-43 kernel:
> > task:kcompactd0      state:D stack:0     pid:171   tgid:171   ppid:2
> >    task_flags:0x210040 flags:0x00004000
> > 2025-12-03T15:57:25.440000+00:00 ip-172-23-113-43 kernel: Call Trace:
> > 2025-12-03T15:57:25.440000+00:00 ip-172-23-113-43 kernel:  <TASK>
> > 2025-12-03T15:57:25.440003+00:00 ip-172-23-113-43 kernel:
> > __schedule+0x2cf/0x640
> > 2025-12-03T15:57:25.441017+00:00 ip-172-23-113-43 kernel:  schedule+0x29/0xd0
> > 2025-12-03T15:57:25.441022+00:00 ip-172-23-113-43 kernel:  io_schedule+0x4c/0x80
> > 2025-12-03T15:57:25.441023+00:00 ip-172-23-113-43 kernel:
> > folio_wait_bit_common+0x138/0x310
> > 2025-12-03T15:57:25.441023+00:00 ip-172-23-113-43 kernel:  ?
> > __pfx_wake_page_function+0x10/0x10
> > 2025-12-03T15:57:25.441024+00:00 ip-172-23-113-43 kernel:
> > folio_wait_private_2+0x2c/0x60
> > 2025-12-03T15:57:25.441025+00:00 ip-172-23-113-43 kernel:
> > nfs_release_folio+0xa0/0x120 [nfs]
> > 2025-12-03T15:57:25.441032+00:00 ip-172-23-113-43 kernel:
> > filemap_release_folio+0x68/0xa0
> > 2025-12-03T15:57:25.441033+00:00 ip-172-23-113-43 kernel:
> > split_huge_page_to_list_to_order+0x401/0x970
> > 2025-12-03T15:57:25.441033+00:00 ip-172-23-113-43 kernel:  ?
> > compaction_alloc_noprof+0x1c5/0x2f0
> > 2025-12-03T15:57:25.441034+00:00 ip-172-23-113-43 kernel:
> > split_folio_to_list+0x22/0x70
> > 2025-12-03T15:57:25.441035+00:00 ip-172-23-113-43 kernel:
> > migrate_pages_batch+0x2f2/0xa70
> > 2025-12-03T15:57:25.441035+00:00 ip-172-23-113-43 kernel:  ?
> > __pfx_compaction_free+0x10/0x10
> > 2025-12-03T15:57:25.441038+00:00 ip-172-23-113-43 kernel:  ?
> > __pfx_compaction_alloc+0x10/0x10
> > 2025-12-03T15:57:25.441039+00:00 ip-172-23-113-43 kernel:  ?
> > __mod_memcg_lruvec_state+0xf4/0x250
> > 2025-12-03T15:57:25.441039+00:00 ip-172-23-113-43 kernel:  ?
> > migrate_pages_batch+0x5e8/0xa70
> > 2025-12-03T15:57:25.441040+00:00 ip-172-23-113-43 kernel:
> > migrate_pages_sync+0x84/0x1e0
> > 2025-12-03T15:57:25.441040+00:00 ip-172-23-113-43 kernel:  ?
> > __pfx_compaction_free+0x10/0x10
> > 2025-12-03T15:57:25.441041+00:00 ip-172-23-113-43 kernel:  ?
> > __pfx_compaction_alloc+0x10/0x10
> > 2025-12-03T15:57:25.441044+00:00 ip-172-23-113-43 kernel:
> > migrate_pages+0x38f/0x4c0
> > 2025-12-03T15:57:25.441047+00:00 ip-172-23-113-43 kernel:  ?
> > __pfx_compaction_free+0x10/0x10
> > 2025-12-03T15:57:25.441048+00:00 ip-172-23-113-43 kernel:  ?
> > __pfx_compaction_alloc+0x10/0x10
> > 2025-12-03T15:57:25.441048+00:00 ip-172-23-113-43 kernel:
> > compact_zone+0x385/0x700
> > 2025-12-03T15:57:25.441049+00:00 ip-172-23-113-43 kernel:  ?
> > isolate_migratepages_range+0xc1/0xf0
> > 2025-12-03T15:57:25.441049+00:00 ip-172-23-113-43 kernel:
> > kcompactd_do_work+0xfc/0x240
> > 2025-12-03T15:57:25.441050+00:00 ip-172-23-113-43 kernel:  kcompactd+0x43f/0x4a0
> > 2025-12-03T15:57:25.441052+00:00 ip-172-23-113-43 kernel:  ?
> > __pfx_autoremove_wake_function+0x10/0x10
> > 2025-12-03T15:57:25.441053+00:00 ip-172-23-113-43 kernel:  ?
> > __pfx_kcompactd+0x10/0x10
> > 2025-12-03T15:57:25.441053+00:00 ip-172-23-113-43 kernel:  kthread+0xfe/0x230
> > 2025-12-03T15:57:25.441054+00:00 ip-172-23-113-43 kernel:  ?
> > __pfx_kthread+0x10/0x10
> > 2025-12-03T15:57:25.441054+00:00 ip-172-23-113-43 kernel:
> > ret_from_fork+0x47/0x70
> > 2025-12-03T15:57:25.441055+00:00 ip-172-23-113-43 kernel:  ?
> > __pfx_kthread+0x10/0x10
> > 2025-12-03T15:57:25.441057+00:00 ip-172-23-113-43 kernel:
> > ret_from_fork_asm+0x1a/0x30
> > 2025-12-03T15:57:25.441058+00:00 ip-172-23-113-43 kernel:  </TASK>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fscache/NFS re-export server lockup
  2026-01-12 15:16       ` Mike Owen
@ 2026-01-15 15:20         ` Mike Owen
  2026-01-15 21:53           ` Mike Snitzer
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Owen @ 2026-01-15 15:20 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Daire Byrne, dhowells, linux-nfs, netfs, trondmy, okorniev

[-- Attachment #1: Type: text/plain, Size: 3328 bytes --]

Hi Mike S,

On 12/01/2026 15:16, Mike Owen wrote:
> Ah, this looks promising. Thanks for the info, Mike!
> Whilst we wait for the necessary NFS client fixes PR, I'll look to add
> the patch to 6.19-rc5 and report back if this fixes the issue we are
> seeing.
> I'll see what I can do internally to advocate Canonical absorbing it as well.
> ACK on my log wrapping. My bad.
> Thanks again!
> -Mike
> 
> On Sat, 10 Jan 2026 at 09:48, Mike Snitzer <snitzer@kernel.org> wrote:
>>
>> Hi Mike,
>>
>> On Fri, Jan 09, 2026 at 09:45:47PM +0000, Mike Owen wrote:
>>> Hi Daire, thanks for the comments.
>>>
>>>> Can you stop the nfs server and is access to /var/cache/fscache still blocked?
>>> As the machine is deadlocked, after reboot (so the nfs server is
>>> definitely stopped), the actual data is gone/corrupted.
>>>
>>>> And I presume there is definitely nothing else that might be
>>> interacting with that /var/cache/fscache filesystem outside of fscache
>>> or cachefilesd?
>>> Correct. Machine is dedicated to KNFSD caching duties.
>>>
>>>> Our /etc/cachefilesd.conf is pretty basic (brun 30%, bcull 10%, bstop 3%).
>>> Similar settings here:
>>> brun 20%
>>> bcull 7%
>>> bstop 3%
>>> frun 20%
>>> fcull 7%
>>> fstop 3%
>>> Although I should note that the issue happens when only ~10-20% of the
>>> NVMe capacity is used, so culling has never had to run at this point.
>>>
>>> We did try running 6.17.0 but made no difference. I see another thread
>>> of yours with Chuck: "refcount_t underflow (nfsd4_sequence_done?) with
>>> v6.18 re-export"
>>> and suggested commits to investigate, incl: cbfd91d22776 ("nfsd: never
>>> defer requests during idmap lookup") as well as try using 6.18.4, so
>>> it's possible there is a cascading issue here and we are in need of
>>> some NFS patches.
>>>
>>> I'm hoping @dhowells might have some suggestions on how to further
>>> debug this issue, given the below stack we are seeing when it
>>> deadlocks?
>>>
>>> Thanks,
>>> -Mike
>>
>> This commit from Trond, which he'll be sending to Linus soon as part
>> of his 6.19-rc NFS client fixes pull request, should fix the NFS
>> re-export induced nfs_release_folio deadlock reflected in your below
>> stack trace:
>> https://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=commitdiff;h=cce0be6eb4971456b703aaeafd571650d314bcca
>>
>> Here is more context for why I know that to be likely, it fixed my
>> nasty LOCALIO-based reexport deadlock situation too:
>> https://lore.kernel.org/linux-nfs/20260107160858.6847-1-snitzer@kernel.org/
>>
>> I'm doing my part to advocate that Red Hat (Olga cc'd) take this
>> fix into RHEL 10.2 (and backport as needed).
>>
>> Good luck getting Ubuntu to include this fix in a timely manner (we'll
>> all thank you for that if you can help shake the Canonical tree).
>>
>> BTW, you'd do well to fix your editor/email so that it doesn't line
>> wrap when you share logs on Linux mailing lists:
>>
...

We deployed 6.19-rc5 + kernel patch: "NFS: Fix a deadlock involving nfs_release_folio()" but unfortunately this has not fixed the issue. The KNFSD server becomes wedged (as far as NFSD is concerned, can still login) and we get the attached dmesg log. I attempted an analysis/RCA of this circular dependency/deadlock issue to try and assist getting this resolved (see attached). Any other patches to try?
-Mike

[-- Attachment #2: dmesg.txt --]
[-- Type: text/plain, Size: 36995 bytes --]

[Thu Jan 15 10:37:38 2026] md127: array will not be assembled in old kernels that lack configurable LBS support (<= 6.18)
[Thu Jan 15 10:37:38 2026] md127: detected capacity change from 0 to 29296345088
[Thu Jan 15 10:38:46 2026] EXT4-fs (md127): mounted filesystem cc12ea7f-34f9-445a-b8ad-a9f1a122c51d r/w with ordered data mode. Quota mode: none.
[Thu Jan 15 10:38:46 2026] netfs: FS-Cache loaded
[Thu Jan 15 10:38:46 2026] CacheFiles: Loaded
[Thu Jan 15 10:38:46 2026] netfs: Cache "mycache" added (type cachefiles)
[Thu Jan 15 10:38:46 2026] CacheFiles: File cache on mycache registered
[Thu Jan 15 10:39:07 2026] NFSD: Using nfsdcld client tracking operations.
[Thu Jan 15 10:39:07 2026] NFSD: no clients to reclaim, skipping NFSv4 grace period (net effffff9)
[Thu Jan 15 10:39:28 2026] audit: type=1400 audit(1768473568.836:130): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="rsyslogd" pid=8112 comm="apparmor_parser"
[Thu Jan 15 10:40:39 2026] hrtimer: interrupt took 10300 ns
[Thu Jan 15 10:42:35 2026] loop4: detected capacity change from 0 to 98480
[Thu Jan 15 10:42:35 2026] audit: type=1400 audit(1768473755.481:131): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/snap/snapd/25935/usr/lib/snapd/snap-confine" pid=9736 comm="apparmor_parser"
[Thu Jan 15 10:42:35 2026] audit: type=1400 audit(1768473755.482:132): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/snap/snapd/25935/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=9736 comm="apparmor_parser"
[Thu Jan 15 10:42:40 2026] loop5: detected capacity change from 0 to 8
[Thu Jan 15 10:42:40 2026] audit: type=1400 audit(1768473760.184:133): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/snapd/25935/usr/lib/snapd/snap-confine" pid=10068 comm="apparmor_parser"
[Thu Jan 15 10:42:40 2026] audit: type=1400 audit(1768473760.194:134): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/snapd/25935/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=10068 comm="apparmor_parser"
[Thu Jan 15 10:42:40 2026] audit: type=1400 audit(1768473760.201:135): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.amazon-ssm-agent.amazon-ssm-agent" pid=10071 comm="apparmor_parser"
[Thu Jan 15 10:42:40 2026] audit: type=1400 audit(1768473760.201:136): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.amazon-ssm-agent.ssm-cli" pid=10072 comm="apparmor_parser"
[Thu Jan 15 10:42:40 2026] audit: type=1400 audit(1768473760.207:137): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap-update-ns.amazon-ssm-agent" pid=10070 comm="apparmor_parser"
[Thu Jan 15 11:59:09 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
[Thu Jan 15 11:59:09 2026] netfs: O-cookie c=0000df95 [fl=6124 na=1 nA=2 s=L]
[Thu Jan 15 11:59:09 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
[Thu Jan 15 11:59:09 2026] netfs: O-key=[32] 'f360d67b060000003414386507000000ffffffff000000000200c10901000000'
[Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
[Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000e07f [fl=6124 na=1 nA=2 s=L]
[Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
[Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b0600000020301c6207000000ffffffff000000000200c10901000000'
[Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
[Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
[Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000e1a1 [fl=6124 na=1 nA=2 s=L]
[Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
[Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
[Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
[Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000e01c [fl=6124 na=1 nA=2 s=L]
[Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000e14d [fl=6124 na=1 nA=2 s=L]
[Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
[Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b0600000071149ebc06000000ffffffff000000000200c10901000000'
[Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
[Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
[Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000b381 [fl=6124 na=1 nA=2 s=L]
[Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b06000000bfd9ce3407000000ffffffff000000000200c10901000000'
[Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000df8c [fl=6124 na=1 nA=2 s=L]
[Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
[Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
[Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b060000006b7a9b3a07000000ffffffff000000000200c10901000000'
[Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b06000000317d6c3207000000ffffffff000000000200c10901000000'
[Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000dbbc [fl=6124 na=1 nA=2 s=L]
[Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
[Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
[Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
[Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000b2c0 [fl=6124 na=1 nA=2 s=L]
[Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
[Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b0600000038c7f06f06000000ffffffff000000000200c10901000000'
[Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b06000000e1483c3807000000ffffffff000000000200c10901000000'
[Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b06000000e811572e07000000ffffffff000000000200c10901000000'
[Thu Jan 15 11:59:11 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
[Thu Jan 15 11:59:11 2026] netfs: O-cookie c=0000dc24 [fl=6124 na=1 nA=2 s=L]
[Thu Jan 15 11:59:11 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
[Thu Jan 15 11:59:11 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
[Thu Jan 15 11:59:11 2026] netfs: O-cookie c=0000c542 [fl=6124 na=1 nA=2 s=L]
[Thu Jan 15 11:59:11 2026] netfs: O-key=[32] 'f360d67b06000000af19d78706000000ffffffff000000000200c10901000000'
[Thu Jan 15 11:59:11 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
[Thu Jan 15 11:59:11 2026] netfs: O-key=[32] 'f360d67b060000003d594c7508000000ffffffff000000000200c10901000000'
[Thu Jan 15 11:59:11 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
[Thu Jan 15 11:59:11 2026] netfs: O-cookie c=0000cea9 [fl=6124 na=1 nA=2 s=L]
[Thu Jan 15 11:59:11 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
[Thu Jan 15 11:59:11 2026] netfs: O-key=[32] 'f360d67b0600000016b7ae5907000000ffffffff000000000200c10901000000'
[Thu Jan 15 12:01:08 2026] INFO: task kcompactd0:170 blocked for more than 122 seconds.
[Thu Jan 15 12:01:08 2026]       Tainted: G           OE       6.19.0-rc5-knfsd #1
[Thu Jan 15 12:01:08 2026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 15 12:01:08 2026] task:kcompactd0      state:D stack:0     pid:170   tgid:170   ppid:2      task_flags:0x210040 flags:0x00080000
[Thu Jan 15 12:01:08 2026] Call Trace:
[Thu Jan 15 12:01:08 2026]  <TASK>
[Thu Jan 15 12:01:08 2026]  __schedule+0x481/0x17d0
[Thu Jan 15 12:01:08 2026]  ? __pfx_radix_tree_node_ctor+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ? cgroup_writeback_by_id+0x4b/0x200
[Thu Jan 15 12:01:08 2026]  ? xas_store+0x5b/0x7f0
[Thu Jan 15 12:01:08 2026]  schedule+0x20/0xe0
[Thu Jan 15 12:01:08 2026]  io_schedule+0x4c/0x80
[Thu Jan 15 12:01:08 2026]  folio_wait_bit_common+0x133/0x310
[Thu Jan 15 12:01:08 2026]  ? __pfx_wake_page_function+0x10/0x10
[Thu Jan 15 12:01:08 2026]  folio_wait_private_2+0x2c/0x60
[Thu Jan 15 12:01:08 2026]  nfs_release_folio+0x61/0x130 [nfs]
[Thu Jan 15 12:01:08 2026]  filemap_release_folio+0x68/0xa0
[Thu Jan 15 12:01:08 2026]  __folio_split+0x178/0x8e0
[Thu Jan 15 12:01:08 2026]  ? post_alloc_hook+0xc1/0x140
[Thu Jan 15 12:01:08 2026]  __split_huge_page_to_list_to_order+0x2b/0xb0
[Thu Jan 15 12:01:08 2026]  split_folio_to_list+0x10/0x20
[Thu Jan 15 12:01:08 2026]  migrate_pages_batch+0x45d/0xea0
[Thu Jan 15 12:01:08 2026]  ? __pfx_compaction_alloc+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ? __pfx_compaction_free+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ? asm_exc_xen_unknown_trap+0x1/0x20
[Thu Jan 15 12:01:08 2026]  ? __pfx_compaction_free+0x10/0x10
[Thu Jan 15 12:01:08 2026]  migrate_pages+0xaef/0xd80
[Thu Jan 15 12:01:08 2026]  ? __pfx_compaction_free+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ? __pfx_compaction_alloc+0x10/0x10
[Thu Jan 15 12:01:08 2026]  compact_zone+0xb3f/0x1200
[Thu Jan 15 12:01:08 2026]  ? psi_group_change+0x1f8/0x4c0
[Thu Jan 15 12:01:08 2026]  ? kvm_sched_clock_read+0x11/0x20
[Thu Jan 15 12:01:08 2026]  compact_node+0xaf/0x130
[Thu Jan 15 12:01:08 2026]  kcompactd+0x374/0x4d0
[Thu Jan 15 12:01:08 2026]  ? __pfx_autoremove_wake_function+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ? __pfx_kcompactd+0x10/0x10
[Thu Jan 15 12:01:08 2026]  kthread+0xf9/0x210
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork+0x25c/0x290
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork_asm+0x1a/0x30
[Thu Jan 15 12:01:08 2026]  </TASK>
[Thu Jan 15 12:01:08 2026] INFO: task jbd2/md127-8:6833 blocked for more than 122 seconds.
[Thu Jan 15 12:01:08 2026]       Tainted: G           OE       6.19.0-rc5-knfsd #1
[Thu Jan 15 12:01:08 2026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 15 12:01:08 2026] task:jbd2/md127-8    state:D stack:0     pid:6833  tgid:6833  ppid:2      task_flags:0x240040 flags:0x00080000
[Thu Jan 15 12:01:08 2026] Call Trace:
[Thu Jan 15 12:01:08 2026]  <TASK>
[Thu Jan 15 12:01:08 2026]  __schedule+0x481/0x17d0
[Thu Jan 15 12:01:08 2026]  ? ext4_process_freed_data+0x236/0x320
[Thu Jan 15 12:01:08 2026]  schedule+0x20/0xe0
[Thu Jan 15 12:01:08 2026]  jbd2_journal_wait_updates+0x6e/0xe0
[Thu Jan 15 12:01:08 2026]  ? __pfx_autoremove_wake_function+0x10/0x10
[Thu Jan 15 12:01:08 2026]  jbd2_journal_commit_transaction+0x26e/0x1730
[Thu Jan 15 12:01:08 2026]  ? psi_group_change+0x1f8/0x4c0
[Thu Jan 15 12:01:08 2026]  ? kvm_sched_clock_read+0x11/0x20
[Thu Jan 15 12:01:08 2026]  kjournald2+0xa5/0x240
[Thu Jan 15 12:01:08 2026]  ? __pfx_autoremove_wake_function+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ? __pfx_kjournald2+0x10/0x10
[Thu Jan 15 12:01:08 2026]  kthread+0xf9/0x210
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork+0x25c/0x290
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork_asm+0x1a/0x30
[Thu Jan 15 12:01:08 2026]  </TASK>
[Thu Jan 15 12:01:08 2026] INFO: task nfsd:7300 blocked for more than 122 seconds.
[Thu Jan 15 12:01:08 2026]       Tainted: G           OE       6.19.0-rc5-knfsd #1
[Thu Jan 15 12:01:08 2026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 15 12:01:08 2026] task:nfsd            state:D stack:0     pid:7300  tgid:7300  ppid:2      task_flags:0x240040 flags:0x00080000
[Thu Jan 15 12:01:08 2026] Call Trace:
[Thu Jan 15 12:01:08 2026]  <TASK>
[Thu Jan 15 12:01:08 2026]  __schedule+0x481/0x17d0
[Thu Jan 15 12:01:08 2026]  ? __do_sys_getcwd+0x1ce/0x1e0
[Thu Jan 15 12:01:08 2026]  schedule+0x20/0xe0
[Thu Jan 15 12:01:08 2026]  schedule_preempt_disabled+0xe/0x20
[Thu Jan 15 12:01:08 2026]  rwsem_down_read_slowpath+0x278/0x500
[Thu Jan 15 12:01:08 2026]  down_read+0x41/0xb0
[Thu Jan 15 12:01:08 2026]  ext4_llseek+0xfc/0x120
[Thu Jan 15 12:01:08 2026]  vfs_llseek+0x1c/0x40
[Thu Jan 15 12:01:08 2026]  cachefiles_do_prepare_read+0xb4/0x380 [cachefiles]
[Thu Jan 15 12:01:08 2026]  cachefiles_prepare_read+0x32/0x50 [cachefiles]
[Thu Jan 15 12:01:08 2026]  netfs_read_to_pagecache+0xc6/0x4d0 [netfs]
[Thu Jan 15 12:01:08 2026]  netfs_readahead+0x165/0x240 [netfs]
[Thu Jan 15 12:01:08 2026]  nfs_netfs_readahead+0x1f/0x40 [nfs]
[Thu Jan 15 12:01:08 2026]  nfs_readahead+0x14d/0x280 [nfs]
[Thu Jan 15 12:01:08 2026]  read_pages+0x84/0x240
[Thu Jan 15 12:01:08 2026]  page_cache_ra_unbounded+0x1ac/0x2a0
[Thu Jan 15 12:01:08 2026]  page_cache_sync_ra+0x243/0x250
[Thu Jan 15 12:01:08 2026]  filemap_get_pages+0x158/0x790
[Thu Jan 15 12:01:08 2026]  ? iput+0x16/0x290
[Thu Jan 15 12:01:08 2026]  ? __d_obtain_alias+0x7e/0x230
[Thu Jan 15 12:01:08 2026]  filemap_splice_read+0x15e/0x380
[Thu Jan 15 12:01:08 2026]  nfs_file_splice_read+0xba/0xf0 [nfs]
[Thu Jan 15 12:01:08 2026]  do_splice_read+0x63/0xe0
[Thu Jan 15 12:01:08 2026]  splice_direct_to_actor+0xb0/0x260
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd_direct_splice_actor+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_splice_read+0x105/0x110 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_read+0xbc/0x1b0 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd3_proc_read+0x101/0x230 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_dispatch+0xd6/0x1f0 [nfsd]
[Thu Jan 15 12:01:08 2026]  svc_process_common+0x4bf/0x6c0 [sunrpc]
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  svc_process+0x15b/0x260 [sunrpc]
[Thu Jan 15 12:01:08 2026]  svc_recv+0x7ae/0x960 [sunrpc]
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd+0x85/0xe0 [nfsd]
[Thu Jan 15 12:01:08 2026]  kthread+0xf9/0x210
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork+0x25c/0x290
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork_asm+0x1a/0x30
[Thu Jan 15 12:01:08 2026]  </TASK>
[Thu Jan 15 12:01:08 2026] INFO: task nfsd:7300 <reader> blocked on an rw-semaphore likely owned by task kworker/u96:8:37820 <writer>
[Thu Jan 15 12:01:08 2026] INFO: task nfsd:7302 blocked for more than 122 seconds.
[Thu Jan 15 12:01:08 2026]       Tainted: G           OE       6.19.0-rc5-knfsd #1
[Thu Jan 15 12:01:08 2026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 15 12:01:08 2026] task:nfsd            state:D stack:0     pid:7302  tgid:7302  ppid:2      task_flags:0x240040 flags:0x00080000
[Thu Jan 15 12:01:08 2026] Call Trace:
[Thu Jan 15 12:01:08 2026]  <TASK>
[Thu Jan 15 12:01:08 2026]  __schedule+0x481/0x17d0
[Thu Jan 15 12:01:08 2026]  ? __do_sys_getcwd+0x1ce/0x1e0
[Thu Jan 15 12:01:08 2026]  schedule+0x20/0xe0
[Thu Jan 15 12:01:08 2026]  schedule_preempt_disabled+0xe/0x20
[Thu Jan 15 12:01:08 2026]  rwsem_down_read_slowpath+0x278/0x500
[Thu Jan 15 12:01:08 2026]  down_read+0x41/0xb0
[Thu Jan 15 12:01:08 2026]  ext4_llseek+0xfc/0x120
[Thu Jan 15 12:01:08 2026]  vfs_llseek+0x1c/0x40
[Thu Jan 15 12:01:08 2026]  cachefiles_do_prepare_read+0xb4/0x380 [cachefiles]
[Thu Jan 15 12:01:08 2026]  cachefiles_prepare_read+0x32/0x50 [cachefiles]
[Thu Jan 15 12:01:08 2026]  netfs_read_to_pagecache+0xc6/0x4d0 [netfs]
[Thu Jan 15 12:01:08 2026]  netfs_readahead+0x165/0x240 [netfs]
[Thu Jan 15 12:01:08 2026]  nfs_netfs_readahead+0x1f/0x40 [nfs]
[Thu Jan 15 12:01:08 2026]  nfs_readahead+0x14d/0x280 [nfs]
[Thu Jan 15 12:01:08 2026]  ? __pfx_lru_add+0x10/0x10
[Thu Jan 15 12:01:08 2026]  read_pages+0x84/0x240
[Thu Jan 15 12:01:08 2026]  page_cache_ra_order+0x261/0x430
[Thu Jan 15 12:01:08 2026]  page_cache_async_ra+0x193/0x210
[Thu Jan 15 12:01:08 2026]  filemap_readahead.isra.0+0x75/0xb0
[Thu Jan 15 12:01:08 2026]  filemap_get_pages+0x35b/0x790
[Thu Jan 15 12:01:08 2026]  ? iput+0x16/0x290
[Thu Jan 15 12:01:08 2026]  filemap_splice_read+0x15e/0x380
[Thu Jan 15 12:01:08 2026]  nfs_file_splice_read+0xba/0xf0 [nfs]
[Thu Jan 15 12:01:08 2026]  do_splice_read+0x63/0xe0
[Thu Jan 15 12:01:08 2026]  splice_direct_to_actor+0xb0/0x260
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd_direct_splice_actor+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_splice_read+0x105/0x110 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_read+0xbc/0x1b0 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd3_proc_read+0x101/0x230 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_dispatch+0xd6/0x1f0 [nfsd]
[Thu Jan 15 12:01:08 2026]  svc_process_common+0x4bf/0x6c0 [sunrpc]
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  svc_process+0x15b/0x260 [sunrpc]
[Thu Jan 15 12:01:08 2026]  svc_recv+0x7ae/0x960 [sunrpc]
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd+0x85/0xe0 [nfsd]
[Thu Jan 15 12:01:08 2026]  kthread+0xf9/0x210
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork+0x25c/0x290
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork_asm+0x1a/0x30
[Thu Jan 15 12:01:08 2026]  </TASK>
[Thu Jan 15 12:01:08 2026] INFO: task nfsd:7302 <reader> blocked on an rw-semaphore likely owned by task kworker/u96:2:28110 <writer>
[Thu Jan 15 12:01:08 2026] INFO: task nfsd:7303 blocked for more than 122 seconds.
[Thu Jan 15 12:01:08 2026]       Tainted: G           OE       6.19.0-rc5-knfsd #1
[Thu Jan 15 12:01:08 2026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 15 12:01:08 2026] task:nfsd            state:D stack:0     pid:7303  tgid:7303  ppid:2      task_flags:0x240040 flags:0x00080000
[Thu Jan 15 12:01:08 2026] Call Trace:
[Thu Jan 15 12:01:08 2026]  <TASK>
[Thu Jan 15 12:01:08 2026]  __schedule+0x481/0x17d0
[Thu Jan 15 12:01:08 2026]  ? blk_mq_flush_plug_list+0x62/0x1e0
[Thu Jan 15 12:01:08 2026]  ? __do_sys_getcwd+0x1ce/0x1e0
[Thu Jan 15 12:01:08 2026]  schedule+0x20/0xe0
[Thu Jan 15 12:01:08 2026]  wait_transaction_locked+0x87/0xd0
[Thu Jan 15 12:01:08 2026]  ? __pfx_autoremove_wake_function+0x10/0x10
[Thu Jan 15 12:01:08 2026]  add_transaction_credits+0x1e0/0x360
[Thu Jan 15 12:01:08 2026]  ? iomap_dio_bio_iter+0x23a/0x4f0
[Thu Jan 15 12:01:08 2026]  start_this_handle+0xfe/0x520
[Thu Jan 15 12:01:08 2026]  jbd2__journal_start+0x107/0x1e0
[Thu Jan 15 12:01:08 2026]  __ext4_journal_start_sb+0x13b/0x1a0
[Thu Jan 15 12:01:08 2026]  ext4_dirty_inode+0x39/0x90
[Thu Jan 15 12:01:08 2026]  __mark_inode_dirty+0x60/0x350
[Thu Jan 15 12:01:08 2026]  generic_update_time+0x50/0x60
[Thu Jan 15 12:01:08 2026]  touch_atime+0xca/0xf0
[Thu Jan 15 12:01:08 2026]  ext4_file_read_iter+0xdf/0x200
[Thu Jan 15 12:01:08 2026]  vfs_iocb_iter_read+0xff/0x160
[Thu Jan 15 12:01:08 2026]  cachefiles_read+0x19e/0x400 [cachefiles]
[Thu Jan 15 12:01:08 2026]  netfs_read_to_pagecache+0x2ab/0x4d0 [netfs]
[Thu Jan 15 12:01:08 2026]  netfs_readahead+0x165/0x240 [netfs]
[Thu Jan 15 12:01:08 2026]  nfs_netfs_readahead+0x1f/0x40 [nfs]
[Thu Jan 15 12:01:08 2026]  nfs_readahead+0x14d/0x280 [nfs]
[Thu Jan 15 12:01:08 2026]  read_pages+0x84/0x240
[Thu Jan 15 12:01:08 2026]  page_cache_ra_unbounded+0x1ac/0x2a0
[Thu Jan 15 12:01:08 2026]  page_cache_sync_ra+0x243/0x250
[Thu Jan 15 12:01:08 2026]  filemap_get_pages+0x158/0x790
[Thu Jan 15 12:01:08 2026]  filemap_splice_read+0x15e/0x380
[Thu Jan 15 12:01:08 2026]  ? nfs_getattr+0x2b4/0x540 [nfs]
[Thu Jan 15 12:01:08 2026]  nfs_file_splice_read+0xba/0xf0 [nfs]
[Thu Jan 15 12:01:08 2026]  do_splice_read+0x63/0xe0
[Thu Jan 15 12:01:08 2026]  splice_direct_to_actor+0xb0/0x260
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd_direct_splice_actor+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_splice_read+0x105/0x110 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_read+0xbc/0x1b0 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd3_proc_read+0x101/0x230 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_dispatch+0xd6/0x1f0 [nfsd]
[Thu Jan 15 12:01:08 2026]  svc_process_common+0x4bf/0x6c0 [sunrpc]
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  svc_process+0x15b/0x260 [sunrpc]
[Thu Jan 15 12:01:08 2026]  svc_recv+0x7ae/0x960 [sunrpc]
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd+0x85/0xe0 [nfsd]
[Thu Jan 15 12:01:08 2026]  kthread+0xf9/0x210
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork+0x25c/0x290
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork_asm+0x1a/0x30
[Thu Jan 15 12:01:08 2026]  </TASK>
[Thu Jan 15 12:01:08 2026] INFO: task nfsd:7304 blocked for more than 122 seconds.
[Thu Jan 15 12:01:08 2026]       Tainted: G           OE       6.19.0-rc5-knfsd #1
[Thu Jan 15 12:01:08 2026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 15 12:01:08 2026] task:nfsd            state:D stack:0     pid:7304  tgid:7304  ppid:2      task_flags:0x240040 flags:0x00080000
[Thu Jan 15 12:01:08 2026] Call Trace:
[Thu Jan 15 12:01:08 2026]  <TASK>
[Thu Jan 15 12:01:08 2026]  __schedule+0x481/0x17d0
[Thu Jan 15 12:01:08 2026]  ? __do_sys_getcwd+0x1ce/0x1e0
[Thu Jan 15 12:01:08 2026]  schedule+0x20/0xe0
[Thu Jan 15 12:01:08 2026]  schedule_preempt_disabled+0xe/0x20
[Thu Jan 15 12:01:08 2026]  rwsem_down_read_slowpath+0x278/0x500
[Thu Jan 15 12:01:08 2026]  down_read+0x41/0xb0
[Thu Jan 15 12:01:08 2026]  ext4_llseek+0xfc/0x120
[Thu Jan 15 12:01:08 2026]  vfs_llseek+0x1c/0x40
[Thu Jan 15 12:01:08 2026]  cachefiles_do_prepare_read+0xb4/0x380 [cachefiles]
[Thu Jan 15 12:01:08 2026]  cachefiles_prepare_read+0x32/0x50 [cachefiles]
[Thu Jan 15 12:01:08 2026]  netfs_read_to_pagecache+0xc6/0x4d0 [netfs]
[Thu Jan 15 12:01:08 2026]  netfs_readahead+0x165/0x240 [netfs]
[Thu Jan 15 12:01:08 2026]  nfs_netfs_readahead+0x1f/0x40 [nfs]
[Thu Jan 15 12:01:08 2026]  nfs_readahead+0x14d/0x280 [nfs]
[Thu Jan 15 12:01:08 2026]  ? __pfx_lru_add+0x10/0x10
[Thu Jan 15 12:01:08 2026]  read_pages+0x84/0x240
[Thu Jan 15 12:01:08 2026]  page_cache_ra_order+0x261/0x430
[Thu Jan 15 12:01:08 2026]  page_cache_async_ra+0x193/0x210
[Thu Jan 15 12:01:08 2026]  filemap_readahead.isra.0+0x75/0xb0
[Thu Jan 15 12:01:08 2026]  filemap_get_pages+0x35b/0x790
[Thu Jan 15 12:01:08 2026]  ? iput+0x16/0x290
[Thu Jan 15 12:01:08 2026]  filemap_splice_read+0x15e/0x380
[Thu Jan 15 12:01:08 2026]  nfs_file_splice_read+0xba/0xf0 [nfs]
[Thu Jan 15 12:01:08 2026]  do_splice_read+0x63/0xe0
[Thu Jan 15 12:01:08 2026]  splice_direct_to_actor+0xb0/0x260
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd_direct_splice_actor+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_splice_read+0x105/0x110 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_read+0xbc/0x1b0 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd3_proc_read+0x101/0x230 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_dispatch+0xd6/0x1f0 [nfsd]
[Thu Jan 15 12:01:08 2026]  svc_process_common+0x4bf/0x6c0 [sunrpc]
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  svc_process+0x15b/0x260 [sunrpc]
[Thu Jan 15 12:01:08 2026]  svc_recv+0x7ae/0x960 [sunrpc]
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd+0x85/0xe0 [nfsd]
[Thu Jan 15 12:01:08 2026]  kthread+0xf9/0x210
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork+0x25c/0x290
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork_asm+0x1a/0x30
[Thu Jan 15 12:01:08 2026]  </TASK>
[Thu Jan 15 12:01:08 2026] INFO: task nfsd:7304 <reader> blocked on an rw-semaphore likely owned by task kworker/u96:3:37809 <writer>
[Thu Jan 15 12:01:08 2026] INFO: task nfsd:7306 blocked for more than 122 seconds.
[Thu Jan 15 12:01:08 2026]       Tainted: G           OE       6.19.0-rc5-knfsd #1
[Thu Jan 15 12:01:08 2026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 15 12:01:08 2026] task:nfsd            state:D stack:0     pid:7306  tgid:7306  ppid:2      task_flags:0x240040 flags:0x00080000
[Thu Jan 15 12:01:08 2026] Call Trace:
[Thu Jan 15 12:01:08 2026]  <TASK>
[Thu Jan 15 12:01:08 2026]  __schedule+0x481/0x17d0
[Thu Jan 15 12:01:08 2026]  ? __do_sys_getcwd+0x1ce/0x1e0
[Thu Jan 15 12:01:08 2026]  schedule+0x20/0xe0
[Thu Jan 15 12:01:08 2026]  schedule_preempt_disabled+0xe/0x20
[Thu Jan 15 12:01:08 2026]  rwsem_down_read_slowpath+0x278/0x500
[Thu Jan 15 12:01:08 2026]  down_read+0x41/0xb0
[Thu Jan 15 12:01:08 2026]  ext4_llseek+0xfc/0x120
[Thu Jan 15 12:01:08 2026]  vfs_llseek+0x1c/0x40
[Thu Jan 15 12:01:08 2026]  cachefiles_do_prepare_read+0xb4/0x380 [cachefiles]
[Thu Jan 15 12:01:08 2026]  cachefiles_prepare_read+0x32/0x50 [cachefiles]
[Thu Jan 15 12:01:08 2026]  netfs_read_to_pagecache+0xc6/0x4d0 [netfs]
[Thu Jan 15 12:01:08 2026]  netfs_readahead+0x165/0x240 [netfs]
[Thu Jan 15 12:01:08 2026]  nfs_netfs_readahead+0x1f/0x40 [nfs]
[Thu Jan 15 12:01:08 2026]  nfs_readahead+0x14d/0x280 [nfs]
[Thu Jan 15 12:01:08 2026]  read_pages+0x84/0x240
[Thu Jan 15 12:01:08 2026]  page_cache_ra_order+0x2c7/0x430
[Thu Jan 15 12:01:08 2026]  page_cache_sync_ra+0x176/0x250
[Thu Jan 15 12:01:08 2026]  filemap_get_pages+0x158/0x790
[Thu Jan 15 12:01:08 2026]  ? iput+0x16/0x290
[Thu Jan 15 12:01:08 2026]  ? __d_obtain_alias+0x7e/0x230
[Thu Jan 15 12:01:08 2026]  filemap_splice_read+0x15e/0x380
[Thu Jan 15 12:01:08 2026]  nfs_file_splice_read+0xba/0xf0 [nfs]
[Thu Jan 15 12:01:08 2026]  do_splice_read+0x63/0xe0
[Thu Jan 15 12:01:08 2026]  splice_direct_to_actor+0xb0/0x260
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd_direct_splice_actor+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_splice_read+0x105/0x110 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_read+0xbc/0x1b0 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd3_proc_read+0x101/0x230 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_dispatch+0xd6/0x1f0 [nfsd]
[Thu Jan 15 12:01:08 2026]  svc_process_common+0x4bf/0x6c0 [sunrpc]
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  svc_process+0x15b/0x260 [sunrpc]
[Thu Jan 15 12:01:08 2026]  svc_recv+0x7ae/0x960 [sunrpc]
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd+0x85/0xe0 [nfsd]
[Thu Jan 15 12:01:08 2026]  kthread+0xf9/0x210
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork+0x25c/0x290
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork_asm+0x1a/0x30
[Thu Jan 15 12:01:08 2026]  </TASK>
[Thu Jan 15 12:01:08 2026] INFO: task nfsd:7306 <reader> blocked on an rw-semaphore likely owned by task kworker/u96:6:29817 <writer>
[Thu Jan 15 12:01:08 2026] INFO: task nfsd:7307 blocked for more than 122 seconds.
[Thu Jan 15 12:01:08 2026]       Tainted: G           OE       6.19.0-rc5-knfsd #1
[Thu Jan 15 12:01:08 2026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 15 12:01:08 2026] task:nfsd            state:D stack:0     pid:7307  tgid:7307  ppid:2      task_flags:0x240040 flags:0x00080000
[Thu Jan 15 12:01:08 2026] Call Trace:
[Thu Jan 15 12:01:08 2026]  <TASK>
[Thu Jan 15 12:01:08 2026]  __schedule+0x481/0x17d0
[Thu Jan 15 12:01:08 2026]  ? __do_sys_getcwd+0x1ce/0x1e0
[Thu Jan 15 12:01:08 2026]  schedule+0x20/0xe0
[Thu Jan 15 12:01:08 2026]  schedule_preempt_disabled+0xe/0x20
[Thu Jan 15 12:01:08 2026]  rwsem_down_read_slowpath+0x278/0x500
[Thu Jan 15 12:01:08 2026]  down_read+0x41/0xb0
[Thu Jan 15 12:01:08 2026]  ext4_llseek+0xfc/0x120
[Thu Jan 15 12:01:08 2026]  vfs_llseek+0x1c/0x40
[Thu Jan 15 12:01:08 2026]  cachefiles_do_prepare_read+0xb4/0x380 [cachefiles]
[Thu Jan 15 12:01:08 2026]  cachefiles_prepare_read+0x32/0x50 [cachefiles]
[Thu Jan 15 12:01:08 2026]  netfs_read_to_pagecache+0xc6/0x4d0 [netfs]
[Thu Jan 15 12:01:08 2026]  netfs_readahead+0x165/0x240 [netfs]
[Thu Jan 15 12:01:08 2026]  nfs_netfs_readahead+0x1f/0x40 [nfs]
[Thu Jan 15 12:01:08 2026]  nfs_readahead+0x14d/0x280 [nfs]
[Thu Jan 15 12:01:08 2026]  read_pages+0x84/0x240
[Thu Jan 15 12:01:08 2026]  page_cache_ra_order+0x261/0x430
[Thu Jan 15 12:01:08 2026]  page_cache_sync_ra+0x176/0x250
[Thu Jan 15 12:01:08 2026]  filemap_get_pages+0x158/0x790
[Thu Jan 15 12:01:08 2026]  ? iput+0x16/0x290
[Thu Jan 15 12:01:08 2026]  ? __d_obtain_alias+0x7e/0x230
[Thu Jan 15 12:01:08 2026]  filemap_splice_read+0x15e/0x380
[Thu Jan 15 12:01:08 2026]  nfs_file_splice_read+0xba/0xf0 [nfs]
[Thu Jan 15 12:01:08 2026]  do_splice_read+0x63/0xe0
[Thu Jan 15 12:01:08 2026]  splice_direct_to_actor+0xb0/0x260
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd_direct_splice_actor+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_splice_read+0x105/0x110 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_read+0xbc/0x1b0 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd3_proc_read+0x101/0x230 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_dispatch+0xd6/0x1f0 [nfsd]
[Thu Jan 15 12:01:08 2026]  svc_process_common+0x4bf/0x6c0 [sunrpc]
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  svc_process+0x15b/0x260 [sunrpc]
[Thu Jan 15 12:01:08 2026]  svc_recv+0x7ae/0x960 [sunrpc]
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd+0x85/0xe0 [nfsd]
[Thu Jan 15 12:01:08 2026]  kthread+0xf9/0x210
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork+0x25c/0x290
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork_asm+0x1a/0x30
[Thu Jan 15 12:01:08 2026]  </TASK>
[Thu Jan 15 12:01:08 2026] INFO: task nfsd:7307 <reader> blocked on an rw-semaphore likely owned by task kworker/u96:80:41180 <writer>
[Thu Jan 15 12:01:08 2026] INFO: task nfsd:7308 blocked for more than 122 seconds.
[Thu Jan 15 12:01:08 2026]       Tainted: G           OE       6.19.0-rc5-knfsd #1
[Thu Jan 15 12:01:08 2026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 15 12:01:08 2026] task:nfsd            state:D stack:0     pid:7308  tgid:7308  ppid:2      task_flags:0x240040 flags:0x00080000
[Thu Jan 15 12:01:08 2026] Call Trace:
[Thu Jan 15 12:01:08 2026]  <TASK>
[Thu Jan 15 12:01:08 2026]  __schedule+0x481/0x17d0
[Thu Jan 15 12:01:08 2026]  ? kmalloc_reserve+0xa4/0x100
[Thu Jan 15 12:01:08 2026]  ? __do_sys_getcwd+0x1ce/0x1e0
[Thu Jan 15 12:01:08 2026]  schedule+0x20/0xe0
[Thu Jan 15 12:01:08 2026]  schedule_preempt_disabled+0xe/0x20
[Thu Jan 15 12:01:08 2026]  rwsem_down_read_slowpath+0x278/0x500
[Thu Jan 15 12:01:08 2026]  down_read+0x41/0xb0
[Thu Jan 15 12:01:08 2026]  ext4_llseek+0xfc/0x120
[Thu Jan 15 12:01:08 2026]  vfs_llseek+0x1c/0x40
[Thu Jan 15 12:01:08 2026]  cachefiles_do_prepare_read+0xb4/0x380 [cachefiles]
[Thu Jan 15 12:01:08 2026]  cachefiles_prepare_read+0x32/0x50 [cachefiles]
[Thu Jan 15 12:01:08 2026]  netfs_read_to_pagecache+0xc6/0x4d0 [netfs]
[Thu Jan 15 12:01:08 2026]  netfs_readahead+0x165/0x240 [netfs]
[Thu Jan 15 12:01:08 2026]  nfs_netfs_readahead+0x1f/0x40 [nfs]
[Thu Jan 15 12:01:08 2026]  nfs_readahead+0x14d/0x280 [nfs]
[Thu Jan 15 12:01:08 2026]  read_pages+0x84/0x240
[Thu Jan 15 12:01:08 2026]  page_cache_ra_unbounded+0x203/0x2a0
[Thu Jan 15 12:01:08 2026]  page_cache_ra_order+0x3ef/0x430
[Thu Jan 15 12:01:08 2026]  page_cache_async_ra+0x193/0x210
[Thu Jan 15 12:01:08 2026]  filemap_readahead.isra.0+0x75/0xb0
[Thu Jan 15 12:01:08 2026]  filemap_get_pages+0x35b/0x790
[Thu Jan 15 12:01:08 2026]  ? iput+0x16/0x290
[Thu Jan 15 12:01:08 2026]  ? __d_obtain_alias+0x7e/0x230
[Thu Jan 15 12:01:08 2026]  ? nfs_attribute_cache_expired+0x39/0xb0 [nfs]
[Thu Jan 15 12:01:08 2026]  filemap_splice_read+0x15e/0x380
[Thu Jan 15 12:01:08 2026]  nfs_file_splice_read+0xba/0xf0 [nfs]
[Thu Jan 15 12:01:08 2026]  do_splice_read+0x63/0xe0
[Thu Jan 15 12:01:08 2026]  splice_direct_to_actor+0xb0/0x260
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd_direct_splice_actor+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_splice_read+0x105/0x110 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_read+0xbc/0x1b0 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd3_proc_read+0x101/0x230 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_dispatch+0xd6/0x1f0 [nfsd]
[Thu Jan 15 12:01:08 2026]  svc_process_common+0x4bf/0x6c0 [sunrpc]
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  svc_process+0x15b/0x260 [sunrpc]
[Thu Jan 15 12:01:08 2026]  svc_recv+0x7ae/0x960 [sunrpc]
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd+0x85/0xe0 [nfsd]
[Thu Jan 15 12:01:08 2026]  kthread+0xf9/0x210
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork+0x25c/0x290
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork_asm+0x1a/0x30
[Thu Jan 15 12:01:08 2026]  </TASK>
[Thu Jan 15 12:01:08 2026] INFO: task nfsd:7308 <reader> blocked on an rw-semaphore likely owned by task kworker/u96:118:41218 <writer>
[Thu Jan 15 12:01:08 2026] INFO: task nfsd:7310 blocked for more than 122 seconds.
[Thu Jan 15 12:01:08 2026]       Tainted: G           OE       6.19.0-rc5-knfsd #1
[Thu Jan 15 12:01:08 2026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jan 15 12:01:08 2026] task:nfsd            state:D stack:0     pid:7310  tgid:7310  ppid:2      task_flags:0x240040 flags:0x00080000
[Thu Jan 15 12:01:08 2026] Call Trace:
[Thu Jan 15 12:01:08 2026]  <TASK>
[Thu Jan 15 12:01:08 2026]  __schedule+0x481/0x17d0
[Thu Jan 15 12:01:08 2026]  ? __do_sys_getcwd+0x1ce/0x1e0
[Thu Jan 15 12:01:08 2026]  schedule+0x20/0xe0
[Thu Jan 15 12:01:08 2026]  schedule_preempt_disabled+0xe/0x20
[Thu Jan 15 12:01:08 2026]  rwsem_down_read_slowpath+0x278/0x500
[Thu Jan 15 12:01:08 2026]  down_read+0x41/0xb0
[Thu Jan 15 12:01:08 2026]  ext4_llseek+0xfc/0x120
[Thu Jan 15 12:01:08 2026]  vfs_llseek+0x1c/0x40
[Thu Jan 15 12:01:08 2026]  cachefiles_do_prepare_read+0xb4/0x380 [cachefiles]
[Thu Jan 15 12:01:08 2026]  cachefiles_prepare_read+0x32/0x50 [cachefiles]
[Thu Jan 15 12:01:08 2026]  netfs_read_to_pagecache+0xc6/0x4d0 [netfs]
[Thu Jan 15 12:01:08 2026]  netfs_readahead+0x165/0x240 [netfs]
[Thu Jan 15 12:01:08 2026]  nfs_netfs_readahead+0x1f/0x40 [nfs]
[Thu Jan 15 12:01:08 2026]  nfs_readahead+0x14d/0x280 [nfs]
[Thu Jan 15 12:01:08 2026]  read_pages+0x84/0x240
[Thu Jan 15 12:01:08 2026]  page_cache_ra_unbounded+0x1ac/0x2a0
[Thu Jan 15 12:01:08 2026]  page_cache_sync_ra+0x243/0x250
[Thu Jan 15 12:01:08 2026]  filemap_get_pages+0x158/0x790
[Thu Jan 15 12:01:08 2026]  ? iput+0x16/0x290
[Thu Jan 15 12:01:08 2026]  ? __d_obtain_alias+0x7e/0x230
[Thu Jan 15 12:01:08 2026]  filemap_splice_read+0x15e/0x380
[Thu Jan 15 12:01:08 2026]  nfs_file_splice_read+0xba/0xf0 [nfs]
[Thu Jan 15 12:01:08 2026]  do_splice_read+0x63/0xe0
[Thu Jan 15 12:01:08 2026]  splice_direct_to_actor+0xb0/0x260
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd_direct_splice_actor+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_splice_read+0x105/0x110 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_read+0xbc/0x1b0 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd3_proc_read+0x101/0x230 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd_dispatch+0xd6/0x1f0 [nfsd]
[Thu Jan 15 12:01:08 2026]  svc_process_common+0x4bf/0x6c0 [sunrpc]
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  svc_process+0x15b/0x260 [sunrpc]
[Thu Jan 15 12:01:08 2026]  svc_recv+0x7ae/0x960 [sunrpc]
[Thu Jan 15 12:01:08 2026]  ? __pfx_nfsd+0x10/0x10 [nfsd]
[Thu Jan 15 12:01:08 2026]  nfsd+0x85/0xe0 [nfsd]
[Thu Jan 15 12:01:08 2026]  kthread+0xf9/0x210
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork+0x25c/0x290
[Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
[Thu Jan 15 12:01:08 2026]  ret_from_fork_asm+0x1a/0x30
[Thu Jan 15 12:01:08 2026]  </TASK>
[Thu Jan 15 12:01:08 2026] INFO: task nfsd:7310 <reader> blocked on an rw-semaphore likely owned by task kworker/u96:95:41195 <writer>
[Thu Jan 15 12:01:08 2026] Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings

[-- Attachment #3: knfsd-cachefiles-deadlock-analysis.md --]
[-- Type: text/plain, Size: 5230 bytes --]

# Analysis of KNFSD/CacheFiles Deadlock

This document provides an analysis of a deadlock issue observed on Ubuntu 24.04 KNFSD nodes
running a custom Linux kernel 6.19-rc5. The deadlock involves the NFS server (nfsd),
CacheFiles/netfs layer, ext4 journaling, and memory compaction.

## Summary

The system enters a deadlock state where:

1. Multiple **nfsd threads** are blocked waiting for ext4 inode rw-semaphores
2. Those rw-semaphores are held by **kworker threads** (writers)
3. The **jbd2 journal thread** for the cache filesystem is blocked waiting for updates
4. **kcompactd** (memory compaction) is blocked waiting for an NFS folio to release fscache state
5. The **fscache cookie state machine** is timing out

## Environment

- **Kernel**: Linux 6.19.0-rc5-knfsd (custom build)
- **Platform**: Amazon EC2 (24 CPUs, 192GB RAM)
- **Configuration**: 256 nfsd threads
- **Cache Filesystem**: ext4 on md127 RAID array

## Detailed Breakdown

### 1. Initial Symptom: Cookie State Timeouts (11:59:09)

The first warning signs appear as fscache cookie state timeout errors:

```text
netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
```

These errors indicate that fscache operations are waiting for cookies to transition to the
correct state (`FSCACHE_COOKIE_STATE_ACTIVE`), but the state transitions are stalled.

### 2. The Deadlock Chain (12:01:08)

#### kcompactd0 (memory compaction daemon)

```text
folio_wait_private_2+0x2c/0x60
nfs_release_folio+0x61/0x130 [nfs]
filemap_release_folio+0x68/0xa0
__folio_split+0x178/0x8e0
```

The compaction daemon is trying to migrate/split NFS folios but is blocked in
`folio_wait_private_2()`. This waits for the `PG_fscache` flag to clear, which happens when
fscache completes its I/O operations on the folio. However, fscache operations are stuck.

#### jbd2/md127-8 (ext4 journal for cache filesystem)

```text
jbd2_journal_wait_updates+0x6e/0xe0
jbd2_journal_commit_transaction+0x26e/0x1730
```

The journal commit is waiting for outstanding updates to complete, but those updates are
blocked.

#### nfsd threads (multiple patterns)

**Pattern A** - Blocked on ext4 inode rwsem:

```text
rwsem_down_read_slowpath+0x278/0x500
down_read+0x41/0xb0
ext4_llseek+0xfc/0x120            <-- needs inode->i_rwsem for SEEK_DATA/SEEK_HOLE
vfs_llseek+0x1c/0x40
cachefiles_do_prepare_read        <-- checking what's cached
```

The kernel logs explicitly identify the blocking relationship:

- `nfsd:7300 <reader> blocked on an rw-semaphore likely owned by task kworker/u96:8:37820 <writer>`

**Pattern B** - Blocked on jbd2 transaction:

```text
wait_transaction_locked+0x87/0xd0
add_transaction_credits+0x1e0/0x360
jbd2__journal_start
ext4_dirty_inode                  <-- updating atime on cache file
cachefiles_read
```

### 3. The Circular Dependency

```text
  nfsd read request
       │
       ▼
  nfs_readahead → netfs_readahead → cachefiles_prepare_read
       │
       ▼
  ext4_llseek (needs i_rwsem READ lock)
       │
       ▼
  BLOCKED: kworkers hold i_rwsem as WRITER
       │
       ├─────────────────────────────────────────────┐
       │                                             │
       ▼                                             ▼
  Those kworkers are likely doing              jbd2 waiting for
  cachefiles write operations                  updates to complete
       │                                             │
       ▼                                             │
  Waiting for journal                                │
       │                                             │
       └──────────────────►◄─────────────────────────┘
                           │
                           ▼
                    Memory pressure triggers
                    kcompactd which needs to
                    release NFS folios
                           │
                           ▼
                    nfs_release_folio waits for
                    fscache (PG_fscache/private_2)
                           │
                           ▼
                    fscache operations are stuck
                    waiting for the blocked operations above
```

## Root Cause Analysis

1. **ext4_llseek holding i_rwsem**: When CacheFiles uses `SEEK_DATA`/`SEEK_HOLE` to check
   what's cached, it takes the inode's rw-semaphore. If writers (kworkers doing cache writes)
   hold this semaphore, readers (the prepare_read path) block.

2. **Journal contention**: The ext4 journal on the cache filesystem (md127) is under heavy
   load. With 256 nfsd threads all potentially doing cache operations, journal contention
   is severe.

3. **Memory pressure + folio release**: When memory compaction tries to free NFS folios that
   have active fscache state, it must wait for fscache to complete—but fscache is blocked
   on ext4.

4. **Cookie state machine stalls**: The fscache cookie state transitions are blocked because
   the underlying I/O can't complete, leading to the timeout messages.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fscache/NFS re-export server lockup
  2026-01-15 15:20         ` Mike Owen
@ 2026-01-15 21:53           ` Mike Snitzer
  2026-03-09 11:35             ` Mike Owen
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Snitzer @ 2026-01-15 21:53 UTC (permalink / raw)
  To: Mike Owen; +Cc: Daire Byrne, dhowells, linux-nfs, netfs, trondmy, okorniev

On Thu, Jan 15, 2026 at 03:20:24PM +0000, Mike Owen wrote:
> Hi Mike S,
> 
> On 12/01/2026 15:16, Mike Owen wrote:
> > Ah, this looks promising. Thanks for the info, Mike!
> > Whilst we wait for the necessary NFS client fixes PR, I'll look to add
> > the patch to 6.19-rc5 and report back if this fixes the issue we are
> > seeing.
> > I'll see what I can do internally to advocate Canonical absorbing it as well.
> > ACK on my log wrapping. My bad.
> > Thanks again!
> > -Mike
> > 
> > On Sat, 10 Jan 2026 at 09:48, Mike Snitzer <snitzer@kernel.org> wrote:
> >>
> >> Hi Mike,
> >>
> >> On Fri, Jan 09, 2026 at 09:45:47PM +0000, Mike Owen wrote:
> >>> Hi Daire, thanks for the comments.
> >>>
> >>>> Can you stop the nfs server and is access to /var/cache/fscache still blocked?
> >>> As the machine is deadlocked, after reboot (so the nfs server is
> >>> definitely stopped), the actual data is gone/corrupted.
> >>>
> >>>> And I presume there is definitely nothing else that might be
> >>> interacting with that /var/cache/fscache filesystem outside of fscache
> >>> or cachefilesd?
> >>> Correct. Machine is dedicated to KNFSD caching duties.
> >>>
> >>>> Our /etc/cachefilesd.conf is pretty basic (brun 30%, bcull 10%, bstop 3%).
> >>> Similar settings here:
> >>> brun 20%
> >>> bcull 7%
> >>> bstop 3%
> >>> frun 20%
> >>> fcull 7%
> >>> fstop 3%
> >>> Although I should note that the issue happens when only ~10-20% of the
> >>> NVMe capacity is used, so culling has never had to run at this point.
> >>>
> >>> We did try running 6.17.0 but made no difference. I see another thread
> >>> of yours with Chuck: "refcount_t underflow (nfsd4_sequence_done?) with
> >>> v6.18 re-export"
> >>> and suggested commits to investigate, incl: cbfd91d22776 ("nfsd: never
> >>> defer requests during idmap lookup") as well as try using 6.18.4, so
> >>> it's possible there is a cascading issue here and we are in need of
> >>> some NFS patches.
> >>>
> >>> I'm hoping @dhowells might have some suggestions on how to further
> >>> debug this issue, given the below stack we are seeing when it
> >>> deadlocks?
> >>>
> >>> Thanks,
> >>> -Mike
> >>
> >> This commit from Trond, which he'll be sending to Linus soon as part
> >> of his 6.19-rc NFS client fixes pull request, should fix the NFS
> >> re-export induced nfs_release_folio deadlock reflected in your below
> >> stack trace:
> >> https://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=commitdiff;h=cce0be6eb4971456b703aaeafd571650d314bcca
> >>
> >> Here is more context for why I know that to be likely, it fixed my
> >> nasty LOCALIO-based reexport deadlock situation too:
> >> https://lore.kernel.org/linux-nfs/20260107160858.6847-1-snitzer@kernel.org/
> >>
> >> I'm doing my part to advocate that Red Hat (Olga cc'd) take this
> >> fix into RHEL 10.2 (and backport as needed).
> >>
> >> Good luck getting Ubuntu to include this fix in a timely manner (we'll
> >> all thank you for that if you can help shake the Canonical tree).
> >>
> >> BTW, you'd do well to fix your editor/email so that it doesn't line
> >> wrap when you share logs on Linux mailing lists:
> >>
> ...
> 
> We deployed 6.19-rc5 + kernel patch: "NFS: Fix a deadlock involving nfs_release_folio()" but unfortunately this has not fixed the issue. The KNFSD server becomes wedged (as far as NFSD is concerned, can still login) and we get the attached dmesg log. I attempted an analysis/RCA of this circular dependency/deadlock issue to try and assist getting this resolved (see attached). Any other patches to try?
> -Mike

> [Thu Jan 15 10:37:38 2026] md127: array will not be assembled in old kernels that lack configurable LBS support (<= 6.18)
> [Thu Jan 15 10:37:38 2026] md127: detected capacity change from 0 to 29296345088
> [Thu Jan 15 10:38:46 2026] EXT4-fs (md127): mounted filesystem cc12ea7f-34f9-445a-b8ad-a9f1a122c51d r/w with ordered data mode. Quota mode: none.
> [Thu Jan 15 10:38:46 2026] netfs: FS-Cache loaded
> [Thu Jan 15 10:38:46 2026] CacheFiles: Loaded
> [Thu Jan 15 10:38:46 2026] netfs: Cache "mycache" added (type cachefiles)
> [Thu Jan 15 10:38:46 2026] CacheFiles: File cache on mycache registered
> [Thu Jan 15 10:39:07 2026] NFSD: Using nfsdcld client tracking operations.
> [Thu Jan 15 10:39:07 2026] NFSD: no clients to reclaim, skipping NFSv4 grace period (net effffff9)
> [Thu Jan 15 10:39:28 2026] audit: type=1400 audit(1768473568.836:130): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="rsyslogd" pid=8112 comm="apparmor_parser"
> [Thu Jan 15 10:40:39 2026] hrtimer: interrupt took 10300 ns
> [Thu Jan 15 10:42:35 2026] loop4: detected capacity change from 0 to 98480
> [Thu Jan 15 10:42:35 2026] audit: type=1400 audit(1768473755.481:131): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/snap/snapd/25935/usr/lib/snapd/snap-confine" pid=9736 comm="apparmor_parser"
> [Thu Jan 15 10:42:35 2026] audit: type=1400 audit(1768473755.482:132): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/snap/snapd/25935/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=9736 comm="apparmor_parser"
> [Thu Jan 15 10:42:40 2026] loop5: detected capacity change from 0 to 8
> [Thu Jan 15 10:42:40 2026] audit: type=1400 audit(1768473760.184:133): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/snapd/25935/usr/lib/snapd/snap-confine" pid=10068 comm="apparmor_parser"
> [Thu Jan 15 10:42:40 2026] audit: type=1400 audit(1768473760.194:134): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/snapd/25935/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=10068 comm="apparmor_parser"
> [Thu Jan 15 10:42:40 2026] audit: type=1400 audit(1768473760.201:135): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.amazon-ssm-agent.amazon-ssm-agent" pid=10071 comm="apparmor_parser"
> [Thu Jan 15 10:42:40 2026] audit: type=1400 audit(1768473760.201:136): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.amazon-ssm-agent.ssm-cli" pid=10072 comm="apparmor_parser"
> [Thu Jan 15 10:42:40 2026] audit: type=1400 audit(1768473760.207:137): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap-update-ns.amazon-ssm-agent" pid=10070 comm="apparmor_parser"
> [Thu Jan 15 11:59:09 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
> [Thu Jan 15 11:59:09 2026] netfs: O-cookie c=0000df95 [fl=6124 na=1 nA=2 s=L]
> [Thu Jan 15 11:59:09 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
> [Thu Jan 15 11:59:09 2026] netfs: O-key=[32] 'f360d67b060000003414386507000000ffffffff000000000200c10901000000'
> [Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
> [Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000e07f [fl=6124 na=1 nA=2 s=L]
> [Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
> [Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b0600000020301c6207000000ffffffff000000000200c10901000000'
> [Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
> [Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
> [Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000e1a1 [fl=6124 na=1 nA=2 s=L]
> [Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
> [Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
> [Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
> [Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000e01c [fl=6124 na=1 nA=2 s=L]
> [Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000e14d [fl=6124 na=1 nA=2 s=L]
> [Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
> [Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b0600000071149ebc06000000ffffffff000000000200c10901000000'
> [Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
> [Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
> [Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000b381 [fl=6124 na=1 nA=2 s=L]
> [Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b06000000bfd9ce3407000000ffffffff000000000200c10901000000'
> [Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000df8c [fl=6124 na=1 nA=2 s=L]
> [Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
> [Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
> [Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b060000006b7a9b3a07000000ffffffff000000000200c10901000000'
> [Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b06000000317d6c3207000000ffffffff000000000200c10901000000'
> [Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000dbbc [fl=6124 na=1 nA=2 s=L]
> [Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
> [Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
> [Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
> [Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000b2c0 [fl=6124 na=1 nA=2 s=L]
> [Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
> [Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b0600000038c7f06f06000000ffffffff000000000200c10901000000'
> [Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b06000000e1483c3807000000ffffffff000000000200c10901000000'
> [Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b06000000e811572e07000000ffffffff000000000200c10901000000'
> [Thu Jan 15 11:59:11 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
> [Thu Jan 15 11:59:11 2026] netfs: O-cookie c=0000dc24 [fl=6124 na=1 nA=2 s=L]
> [Thu Jan 15 11:59:11 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
> [Thu Jan 15 11:59:11 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
> [Thu Jan 15 11:59:11 2026] netfs: O-cookie c=0000c542 [fl=6124 na=1 nA=2 s=L]
> [Thu Jan 15 11:59:11 2026] netfs: O-key=[32] 'f360d67b06000000af19d78706000000ffffffff000000000200c10901000000'
> [Thu Jan 15 11:59:11 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
> [Thu Jan 15 11:59:11 2026] netfs: O-key=[32] 'f360d67b060000003d594c7508000000ffffffff000000000200c10901000000'
> [Thu Jan 15 11:59:11 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
> [Thu Jan 15 11:59:11 2026] netfs: O-cookie c=0000cea9 [fl=6124 na=1 nA=2 s=L]
> [Thu Jan 15 11:59:11 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
> [Thu Jan 15 11:59:11 2026] netfs: O-key=[32] 'f360d67b0600000016b7ae5907000000ffffffff000000000200c10901000000'
> [Thu Jan 15 12:01:08 2026] INFO: task kcompactd0:170 blocked for more than 122 seconds.
> [Thu Jan 15 12:01:08 2026]       Tainted: G           OE       6.19.0-rc5-knfsd #1
> [Thu Jan 15 12:01:08 2026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Thu Jan 15 12:01:08 2026] task:kcompactd0      state:D stack:0     pid:170   tgid:170   ppid:2      task_flags:0x210040 flags:0x00080000
> [Thu Jan 15 12:01:08 2026] Call Trace:
> [Thu Jan 15 12:01:08 2026]  <TASK>
> [Thu Jan 15 12:01:08 2026]  __schedule+0x481/0x17d0
> [Thu Jan 15 12:01:08 2026]  ? __pfx_radix_tree_node_ctor+0x10/0x10
> [Thu Jan 15 12:01:08 2026]  ? cgroup_writeback_by_id+0x4b/0x200
> [Thu Jan 15 12:01:08 2026]  ? xas_store+0x5b/0x7f0
> [Thu Jan 15 12:01:08 2026]  schedule+0x20/0xe0
> [Thu Jan 15 12:01:08 2026]  io_schedule+0x4c/0x80
> [Thu Jan 15 12:01:08 2026]  folio_wait_bit_common+0x133/0x310
> [Thu Jan 15 12:01:08 2026]  ? __pfx_wake_page_function+0x10/0x10
> [Thu Jan 15 12:01:08 2026]  folio_wait_private_2+0x2c/0x60
> [Thu Jan 15 12:01:08 2026]  nfs_release_folio+0x61/0x130 [nfs]
> [Thu Jan 15 12:01:08 2026]  filemap_release_folio+0x68/0xa0
> [Thu Jan 15 12:01:08 2026]  __folio_split+0x178/0x8e0
> [Thu Jan 15 12:01:08 2026]  ? post_alloc_hook+0xc1/0x140
> [Thu Jan 15 12:01:08 2026]  __split_huge_page_to_list_to_order+0x2b/0xb0
> [Thu Jan 15 12:01:08 2026]  split_folio_to_list+0x10/0x20
> [Thu Jan 15 12:01:08 2026]  migrate_pages_batch+0x45d/0xea0
> [Thu Jan 15 12:01:08 2026]  ? __pfx_compaction_alloc+0x10/0x10
> [Thu Jan 15 12:01:08 2026]  ? __pfx_compaction_free+0x10/0x10
> [Thu Jan 15 12:01:08 2026]  ? asm_exc_xen_unknown_trap+0x1/0x20
> [Thu Jan 15 12:01:08 2026]  ? __pfx_compaction_free+0x10/0x10
> [Thu Jan 15 12:01:08 2026]  migrate_pages+0xaef/0xd80
> [Thu Jan 15 12:01:08 2026]  ? __pfx_compaction_free+0x10/0x10
> [Thu Jan 15 12:01:08 2026]  ? __pfx_compaction_alloc+0x10/0x10
> [Thu Jan 15 12:01:08 2026]  compact_zone+0xb3f/0x1200
> [Thu Jan 15 12:01:08 2026]  ? psi_group_change+0x1f8/0x4c0
> [Thu Jan 15 12:01:08 2026]  ? kvm_sched_clock_read+0x11/0x20
> [Thu Jan 15 12:01:08 2026]  compact_node+0xaf/0x130
> [Thu Jan 15 12:01:08 2026]  kcompactd+0x374/0x4d0
> [Thu Jan 15 12:01:08 2026]  ? __pfx_autoremove_wake_function+0x10/0x10
> [Thu Jan 15 12:01:08 2026]  ? __pfx_kcompactd+0x10/0x10
> [Thu Jan 15 12:01:08 2026]  kthread+0xf9/0x210
> [Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
> [Thu Jan 15 12:01:08 2026]  ret_from_fork+0x25c/0x290
> [Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
> [Thu Jan 15 12:01:08 2026]  ret_from_fork_asm+0x1a/0x30
> [Thu Jan 15 12:01:08 2026]  </TASK>

OK, my deadlock was blocked in wait_on_commit whereas you've
consistently shown folio_wait_private_2 in your stack traces (I
originally missed that detail).  So I'm not aware of what is different
in your setup.. devil is in the details, but maybe its your use of
fscache?

> # Analysis of KNFSD/CacheFiles Deadlock
> 
> This document provides an analysis of a deadlock issue observed on Ubuntu 24.04 KNFSD nodes
> running a custom Linux kernel 6.19-rc5. The deadlock involves the NFS server (nfsd),
> CacheFiles/netfs layer, ext4 journaling, and memory compaction.
> 
> ## Summary
> 
> The system enters a deadlock state where:
> 
> 1. Multiple **nfsd threads** are blocked waiting for ext4 inode rw-semaphores
> 2. Those rw-semaphores are held by **kworker threads** (writers)
> 3. The **jbd2 journal thread** for the cache filesystem is blocked waiting for updates
> 4. **kcompactd** (memory compaction) is blocked waiting for an NFS folio to release fscache state
> 5. The **fscache cookie state machine** is timing out

David Howells may be able to inform his fscache-specific vantage point
by having a look at Trond's fix that I pointed to earlier?

Mike

ps. solid effort with your below analysis, but its quite fscache
specific so I think it'll best help inform David:

> 
> ## Environment
> 
> - **Kernel**: Linux 6.19.0-rc5-knfsd (custom build)
> - **Platform**: Amazon EC2 (24 CPUs, 192GB RAM)
> - **Configuration**: 256 nfsd threads
> - **Cache Filesystem**: ext4 on md127 RAID array
> 
> ## Detailed Breakdown
> 
> ### 1. Initial Symptom: Cookie State Timeouts (11:59:09)
> 
> The first warning signs appear as fscache cookie state timeout errors:
> 
> ```text
> netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
> ```
> 
> These errors indicate that fscache operations are waiting for cookies to transition to the
> correct state (`FSCACHE_COOKIE_STATE_ACTIVE`), but the state transitions are stalled.
> 
> ### 2. The Deadlock Chain (12:01:08)
> 
> #### kcompactd0 (memory compaction daemon)
> 
> ```text
> folio_wait_private_2+0x2c/0x60
> nfs_release_folio+0x61/0x130 [nfs]
> filemap_release_folio+0x68/0xa0
> __folio_split+0x178/0x8e0
> ```
> 
> The compaction daemon is trying to migrate/split NFS folios but is blocked in
> `folio_wait_private_2()`. This waits for the `PG_fscache` flag to clear, which happens when
> fscache completes its I/O operations on the folio. However, fscache operations are stuck.
> 
> #### jbd2/md127-8 (ext4 journal for cache filesystem)
> 
> ```text
> jbd2_journal_wait_updates+0x6e/0xe0
> jbd2_journal_commit_transaction+0x26e/0x1730
> ```
> 
> The journal commit is waiting for outstanding updates to complete, but those updates are
> blocked.
> 
> #### nfsd threads (multiple patterns)
> 
> **Pattern A** - Blocked on ext4 inode rwsem:
> 
> ```text
> rwsem_down_read_slowpath+0x278/0x500
> down_read+0x41/0xb0
> ext4_llseek+0xfc/0x120            <-- needs inode->i_rwsem for SEEK_DATA/SEEK_HOLE
> vfs_llseek+0x1c/0x40
> cachefiles_do_prepare_read        <-- checking what's cached
> ```
> 
> The kernel logs explicitly identify the blocking relationship:
> 
> - `nfsd:7300 <reader> blocked on an rw-semaphore likely owned by task kworker/u96:8:37820 <writer>`
> 
> **Pattern B** - Blocked on jbd2 transaction:
> 
> ```text
> wait_transaction_locked+0x87/0xd0
> add_transaction_credits+0x1e0/0x360
> jbd2__journal_start
> ext4_dirty_inode                  <-- updating atime on cache file
> cachefiles_read
> ```
> 
> ### 3. The Circular Dependency
> 
> ```text
>   nfsd read request
>        │
>        ▼
>   nfs_readahead → netfs_readahead → cachefiles_prepare_read
>        │
>        ▼
>   ext4_llseek (needs i_rwsem READ lock)
>        │
>        ▼
>   BLOCKED: kworkers hold i_rwsem as WRITER
>        │
>        ├─────────────────────────────────────────────┐
>        │                                             │
>        ▼                                             ▼
>   Those kworkers are likely doing              jbd2 waiting for
>   cachefiles write operations                  updates to complete
>        │                                             │
>        ▼                                             │
>   Waiting for journal                                │
>        │                                             │
>        └──────────────────►◄─────────────────────────┘
>                            │
>                            ▼
>                     Memory pressure triggers
>                     kcompactd which needs to
>                     release NFS folios
>                            │
>                            ▼
>                     nfs_release_folio waits for
>                     fscache (PG_fscache/private_2)
>                            │
>                            ▼
>                     fscache operations are stuck
>                     waiting for the blocked operations above
> ```
> 
> ## Root Cause Analysis
> 
> 1. **ext4_llseek holding i_rwsem**: When CacheFiles uses `SEEK_DATA`/`SEEK_HOLE` to check
>    what's cached, it takes the inode's rw-semaphore. If writers (kworkers doing cache writes)
>    hold this semaphore, readers (the prepare_read path) block.
> 
> 2. **Journal contention**: The ext4 journal on the cache filesystem (md127) is under heavy
>    load. With 256 nfsd threads all potentially doing cache operations, journal contention
>    is severe.
> 
> 3. **Memory pressure + folio release**: When memory compaction tries to free NFS folios that
>    have active fscache state, it must wait for fscache to complete—but fscache is blocked
>    on ext4.
> 
> 4. **Cookie state machine stalls**: The fscache cookie state transitions are blocked because
>    the underlying I/O can't complete, leading to the timeout messages.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fscache/NFS re-export server lockup
  2026-01-15 21:53           ` Mike Snitzer
@ 2026-03-09 11:35             ` Mike Owen
  0 siblings, 0 replies; 8+ messages in thread
From: Mike Owen @ 2026-03-09 11:35 UTC (permalink / raw)
  To: linux-nfs, netfs; +Cc: Daire Byrne, dhowells, trondmy, okorniev, Mike Snitzer

For anyone following this thread, the solution was a broad set of parameter tuning (mkfs/mount options, block device tuning, and VM sysctls) combined with switching from EXT4 to XFS.
Some sizeable performance jumps were gained as well across the stack. No issue with FS-Cache here as previously implied.
Thanks to those who provided feedback.
-Mike

On 15/01/2026 21:53, Mike Snitzer wrote:
> On Thu, Jan 15, 2026 at 03:20:24PM +0000, Mike Owen wrote:
>> Hi Mike S,
>>
>> On 12/01/2026 15:16, Mike Owen wrote:
>>> Ah, this looks promising. Thanks for the info, Mike!
>>> Whilst we wait for the necessary NFS client fixes PR, I'll look to add
>>> the patch to 6.19-rc5 and report back if this fixes the issue we are
>>> seeing.
>>> I'll see what I can do internally to advocate Canonical absorbing it as well.
>>> ACK on my log wrapping. My bad.
>>> Thanks again!
>>> -Mike
>>>
>>> On Sat, 10 Jan 2026 at 09:48, Mike Snitzer <snitzer@kernel.org> wrote:
>>>>
>>>> Hi Mike,
>>>>
>>>> On Fri, Jan 09, 2026 at 09:45:47PM +0000, Mike Owen wrote:
>>>>> Hi Daire, thanks for the comments.
>>>>>
>>>>>> Can you stop the nfs server and is access to /var/cache/fscache still blocked?
>>>>> As the machine is deadlocked, after reboot (so the nfs server is
>>>>> definitely stopped), the actual data is gone/corrupted.
>>>>>
>>>>>> And I presume there is definitely nothing else that might be
>>>>> interacting with that /var/cache/fscache filesystem outside of fscache
>>>>> or cachefilesd?
>>>>> Correct. Machine is dedicated to KNFSD caching duties.
>>>>>
>>>>>> Our /etc/cachefilesd.conf is pretty basic (brun 30%, bcull 10%, bstop 3%).
>>>>> Similar settings here:
>>>>> brun 20%
>>>>> bcull 7%
>>>>> bstop 3%
>>>>> frun 20%
>>>>> fcull 7%
>>>>> fstop 3%
>>>>> Although I should note that the issue happens when only ~10-20% of the
>>>>> NVMe capacity is used, so culling has never had to run at this point.
>>>>>
>>>>> We did try running 6.17.0 but made no difference. I see another thread
>>>>> of yours with Chuck: "refcount_t underflow (nfsd4_sequence_done?) with
>>>>> v6.18 re-export"
>>>>> and suggested commits to investigate, incl: cbfd91d22776 ("nfsd: never
>>>>> defer requests during idmap lookup") as well as try using 6.18.4, so
>>>>> it's possible there is a cascading issue here and we are in need of
>>>>> some NFS patches.
>>>>>
>>>>> I'm hoping @dhowells might have some suggestions on how to further
>>>>> debug this issue, given the below stack we are seeing when it
>>>>> deadlocks?
>>>>>
>>>>> Thanks,
>>>>> -Mike
>>>>
>>>> This commit from Trond, which he'll be sending to Linus soon as part
>>>> of his 6.19-rc NFS client fixes pull request, should fix the NFS
>>>> re-export induced nfs_release_folio deadlock reflected in your below
>>>> stack trace:
>>>> https://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=commitdiff;h=cce0be6eb4971456b703aaeafd571650d314bcca
>>>>
>>>> Here is more context for why I know that to be likely, it fixed my
>>>> nasty LOCALIO-based reexport deadlock situation too:
>>>> https://lore.kernel.org/linux-nfs/20260107160858.6847-1-snitzer@kernel.org/
>>>>
>>>> I'm doing my part to advocate that Red Hat (Olga cc'd) take this
>>>> fix into RHEL 10.2 (and backport as needed).
>>>>
>>>> Good luck getting Ubuntu to include this fix in a timely manner (we'll
>>>> all thank you for that if you can help shake the Canonical tree).
>>>>
>>>> BTW, you'd do well to fix your editor/email so that it doesn't line
>>>> wrap when you share logs on Linux mailing lists:
>>>>
>> ...
>>
>> We deployed 6.19-rc5 + kernel patch: "NFS: Fix a deadlock involving nfs_release_folio()" but unfortunately this has not fixed the issue. The KNFSD server becomes wedged (as far as NFSD is concerned, can still login) and we get the attached dmesg log. I attempted an analysis/RCA of this circular dependency/deadlock issue to try and assist getting this resolved (see attached). Any other patches to try?
>> -Mike
> 
>> [Thu Jan 15 10:37:38 2026] md127: array will not be assembled in old kernels that lack configurable LBS support (<= 6.18)
>> [Thu Jan 15 10:37:38 2026] md127: detected capacity change from 0 to 29296345088
>> [Thu Jan 15 10:38:46 2026] EXT4-fs (md127): mounted filesystem cc12ea7f-34f9-445a-b8ad-a9f1a122c51d r/w with ordered data mode. Quota mode: none.
>> [Thu Jan 15 10:38:46 2026] netfs: FS-Cache loaded
>> [Thu Jan 15 10:38:46 2026] CacheFiles: Loaded
>> [Thu Jan 15 10:38:46 2026] netfs: Cache "mycache" added (type cachefiles)
>> [Thu Jan 15 10:38:46 2026] CacheFiles: File cache on mycache registered
>> [Thu Jan 15 10:39:07 2026] NFSD: Using nfsdcld client tracking operations.
>> [Thu Jan 15 10:39:07 2026] NFSD: no clients to reclaim, skipping NFSv4 grace period (net effffff9)
>> [Thu Jan 15 10:39:28 2026] audit: type=1400 audit(1768473568.836:130): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="rsyslogd" pid=8112 comm="apparmor_parser"
>> [Thu Jan 15 10:40:39 2026] hrtimer: interrupt took 10300 ns
>> [Thu Jan 15 10:42:35 2026] loop4: detected capacity change from 0 to 98480
>> [Thu Jan 15 10:42:35 2026] audit: type=1400 audit(1768473755.481:131): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/snap/snapd/25935/usr/lib/snapd/snap-confine" pid=9736 comm="apparmor_parser"
>> [Thu Jan 15 10:42:35 2026] audit: type=1400 audit(1768473755.482:132): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/snap/snapd/25935/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=9736 comm="apparmor_parser"
>> [Thu Jan 15 10:42:40 2026] loop5: detected capacity change from 0 to 8
>> [Thu Jan 15 10:42:40 2026] audit: type=1400 audit(1768473760.184:133): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/snapd/25935/usr/lib/snapd/snap-confine" pid=10068 comm="apparmor_parser"
>> [Thu Jan 15 10:42:40 2026] audit: type=1400 audit(1768473760.194:134): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/snapd/25935/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=10068 comm="apparmor_parser"
>> [Thu Jan 15 10:42:40 2026] audit: type=1400 audit(1768473760.201:135): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.amazon-ssm-agent.amazon-ssm-agent" pid=10071 comm="apparmor_parser"
>> [Thu Jan 15 10:42:40 2026] audit: type=1400 audit(1768473760.201:136): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.amazon-ssm-agent.ssm-cli" pid=10072 comm="apparmor_parser"
>> [Thu Jan 15 10:42:40 2026] audit: type=1400 audit(1768473760.207:137): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap-update-ns.amazon-ssm-agent" pid=10070 comm="apparmor_parser"
>> [Thu Jan 15 11:59:09 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
>> [Thu Jan 15 11:59:09 2026] netfs: O-cookie c=0000df95 [fl=6124 na=1 nA=2 s=L]
>> [Thu Jan 15 11:59:09 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
>> [Thu Jan 15 11:59:09 2026] netfs: O-key=[32] 'f360d67b060000003414386507000000ffffffff000000000200c10901000000'
>> [Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
>> [Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000e07f [fl=6124 na=1 nA=2 s=L]
>> [Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
>> [Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b0600000020301c6207000000ffffffff000000000200c10901000000'
>> [Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
>> [Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
>> [Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000e1a1 [fl=6124 na=1 nA=2 s=L]
>> [Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
>> [Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
>> [Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
>> [Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000e01c [fl=6124 na=1 nA=2 s=L]
>> [Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000e14d [fl=6124 na=1 nA=2 s=L]
>> [Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
>> [Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b0600000071149ebc06000000ffffffff000000000200c10901000000'
>> [Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
>> [Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
>> [Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000b381 [fl=6124 na=1 nA=2 s=L]
>> [Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b06000000bfd9ce3407000000ffffffff000000000200c10901000000'
>> [Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000df8c [fl=6124 na=1 nA=2 s=L]
>> [Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
>> [Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
>> [Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b060000006b7a9b3a07000000ffffffff000000000200c10901000000'
>> [Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b06000000317d6c3207000000ffffffff000000000200c10901000000'
>> [Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000dbbc [fl=6124 na=1 nA=2 s=L]
>> [Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
>> [Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
>> [Thu Jan 15 11:59:10 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
>> [Thu Jan 15 11:59:10 2026] netfs: O-cookie c=0000b2c0 [fl=6124 na=1 nA=2 s=L]
>> [Thu Jan 15 11:59:10 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
>> [Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b0600000038c7f06f06000000ffffffff000000000200c10901000000'
>> [Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b06000000e1483c3807000000ffffffff000000000200c10901000000'
>> [Thu Jan 15 11:59:10 2026] netfs: O-key=[32] 'f360d67b06000000e811572e07000000ffffffff000000000200c10901000000'
>> [Thu Jan 15 11:59:11 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
>> [Thu Jan 15 11:59:11 2026] netfs: O-cookie c=0000dc24 [fl=6124 na=1 nA=2 s=L]
>> [Thu Jan 15 11:59:11 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
>> [Thu Jan 15 11:59:11 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
>> [Thu Jan 15 11:59:11 2026] netfs: O-cookie c=0000c542 [fl=6124 na=1 nA=2 s=L]
>> [Thu Jan 15 11:59:11 2026] netfs: O-key=[32] 'f360d67b06000000af19d78706000000ffffffff000000000200c10901000000'
>> [Thu Jan 15 11:59:11 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
>> [Thu Jan 15 11:59:11 2026] netfs: O-key=[32] 'f360d67b060000003d594c7508000000ffffffff000000000200c10901000000'
>> [Thu Jan 15 11:59:11 2026] netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
>> [Thu Jan 15 11:59:11 2026] netfs: O-cookie c=0000cea9 [fl=6124 na=1 nA=2 s=L]
>> [Thu Jan 15 11:59:11 2026] netfs: O-cookie V=00000002 [Infs,3.0,2,,2109120a,7bd660f3,,,d0,100000,100000,927c0,927c0,927c0,927c0,1]
>> [Thu Jan 15 11:59:11 2026] netfs: O-key=[32] 'f360d67b0600000016b7ae5907000000ffffffff000000000200c10901000000'
>> [Thu Jan 15 12:01:08 2026] INFO: task kcompactd0:170 blocked for more than 122 seconds.
>> [Thu Jan 15 12:01:08 2026]       Tainted: G           OE       6.19.0-rc5-knfsd #1
>> [Thu Jan 15 12:01:08 2026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [Thu Jan 15 12:01:08 2026] task:kcompactd0      state:D stack:0     pid:170   tgid:170   ppid:2      task_flags:0x210040 flags:0x00080000
>> [Thu Jan 15 12:01:08 2026] Call Trace:
>> [Thu Jan 15 12:01:08 2026]  <TASK>
>> [Thu Jan 15 12:01:08 2026]  __schedule+0x481/0x17d0
>> [Thu Jan 15 12:01:08 2026]  ? __pfx_radix_tree_node_ctor+0x10/0x10
>> [Thu Jan 15 12:01:08 2026]  ? cgroup_writeback_by_id+0x4b/0x200
>> [Thu Jan 15 12:01:08 2026]  ? xas_store+0x5b/0x7f0
>> [Thu Jan 15 12:01:08 2026]  schedule+0x20/0xe0
>> [Thu Jan 15 12:01:08 2026]  io_schedule+0x4c/0x80
>> [Thu Jan 15 12:01:08 2026]  folio_wait_bit_common+0x133/0x310
>> [Thu Jan 15 12:01:08 2026]  ? __pfx_wake_page_function+0x10/0x10
>> [Thu Jan 15 12:01:08 2026]  folio_wait_private_2+0x2c/0x60
>> [Thu Jan 15 12:01:08 2026]  nfs_release_folio+0x61/0x130 [nfs]
>> [Thu Jan 15 12:01:08 2026]  filemap_release_folio+0x68/0xa0
>> [Thu Jan 15 12:01:08 2026]  __folio_split+0x178/0x8e0
>> [Thu Jan 15 12:01:08 2026]  ? post_alloc_hook+0xc1/0x140
>> [Thu Jan 15 12:01:08 2026]  __split_huge_page_to_list_to_order+0x2b/0xb0
>> [Thu Jan 15 12:01:08 2026]  split_folio_to_list+0x10/0x20
>> [Thu Jan 15 12:01:08 2026]  migrate_pages_batch+0x45d/0xea0
>> [Thu Jan 15 12:01:08 2026]  ? __pfx_compaction_alloc+0x10/0x10
>> [Thu Jan 15 12:01:08 2026]  ? __pfx_compaction_free+0x10/0x10
>> [Thu Jan 15 12:01:08 2026]  ? asm_exc_xen_unknown_trap+0x1/0x20
>> [Thu Jan 15 12:01:08 2026]  ? __pfx_compaction_free+0x10/0x10
>> [Thu Jan 15 12:01:08 2026]  migrate_pages+0xaef/0xd80
>> [Thu Jan 15 12:01:08 2026]  ? __pfx_compaction_free+0x10/0x10
>> [Thu Jan 15 12:01:08 2026]  ? __pfx_compaction_alloc+0x10/0x10
>> [Thu Jan 15 12:01:08 2026]  compact_zone+0xb3f/0x1200
>> [Thu Jan 15 12:01:08 2026]  ? psi_group_change+0x1f8/0x4c0
>> [Thu Jan 15 12:01:08 2026]  ? kvm_sched_clock_read+0x11/0x20
>> [Thu Jan 15 12:01:08 2026]  compact_node+0xaf/0x130
>> [Thu Jan 15 12:01:08 2026]  kcompactd+0x374/0x4d0
>> [Thu Jan 15 12:01:08 2026]  ? __pfx_autoremove_wake_function+0x10/0x10
>> [Thu Jan 15 12:01:08 2026]  ? __pfx_kcompactd+0x10/0x10
>> [Thu Jan 15 12:01:08 2026]  kthread+0xf9/0x210
>> [Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
>> [Thu Jan 15 12:01:08 2026]  ret_from_fork+0x25c/0x290
>> [Thu Jan 15 12:01:08 2026]  ? __pfx_kthread+0x10/0x10
>> [Thu Jan 15 12:01:08 2026]  ret_from_fork_asm+0x1a/0x30
>> [Thu Jan 15 12:01:08 2026]  </TASK>
> 
> OK, my deadlock was blocked in wait_on_commit whereas you've
> consistently shown folio_wait_private_2 in your stack traces (I
> originally missed that detail).  So I'm not aware of what is different
> in your setup.. devil is in the details, but maybe its your use of
> fscache?
> 
>> # Analysis of KNFSD/CacheFiles Deadlock
>>
>> This document provides an analysis of a deadlock issue observed on Ubuntu 24.04 KNFSD nodes
>> running a custom Linux kernel 6.19-rc5. The deadlock involves the NFS server (nfsd),
>> CacheFiles/netfs layer, ext4 journaling, and memory compaction.
>>
>> ## Summary
>>
>> The system enters a deadlock state where:
>>
>> 1. Multiple **nfsd threads** are blocked waiting for ext4 inode rw-semaphores
>> 2. Those rw-semaphores are held by **kworker threads** (writers)
>> 3. The **jbd2 journal thread** for the cache filesystem is blocked waiting for updates
>> 4. **kcompactd** (memory compaction) is blocked waiting for an NFS folio to release fscache state
>> 5. The **fscache cookie state machine** is timing out
> 
> David Howells may be able to inform his fscache-specific vantage point
> by having a look at Trond's fix that I pointed to earlier?
> 
> Mike
> 
> ps. solid effort with your below analysis, but its quite fscache
> specific so I think it'll best help inform David:
> 
>>
>> ## Environment
>>
>> - **Kernel**: Linux 6.19.0-rc5-knfsd (custom build)
>> - **Platform**: Amazon EC2 (24 CPUs, 192GB RAM)
>> - **Configuration**: 256 nfsd threads
>> - **Cache Filesystem**: ext4 on md127 RAID array
>>
>> ## Detailed Breakdown
>>
>> ### 1. Initial Symptom: Cookie State Timeouts (11:59:09)
>>
>> The first warning signs appear as fscache cookie state timeout errors:
>>
>> ```text
>> netfs: fscache_begin_operation: cookie state change wait timed out: cookie->state=1 state=1
>> ```
>>
>> These errors indicate that fscache operations are waiting for cookies to transition to the
>> correct state (`FSCACHE_COOKIE_STATE_ACTIVE`), but the state transitions are stalled.
>>
>> ### 2. The Deadlock Chain (12:01:08)
>>
>> #### kcompactd0 (memory compaction daemon)
>>
>> ```text
>> folio_wait_private_2+0x2c/0x60
>> nfs_release_folio+0x61/0x130 [nfs]
>> filemap_release_folio+0x68/0xa0
>> __folio_split+0x178/0x8e0
>> ```
>>
>> The compaction daemon is trying to migrate/split NFS folios but is blocked in
>> `folio_wait_private_2()`. This waits for the `PG_fscache` flag to clear, which happens when
>> fscache completes its I/O operations on the folio. However, fscache operations are stuck.
>>
>> #### jbd2/md127-8 (ext4 journal for cache filesystem)
>>
>> ```text
>> jbd2_journal_wait_updates+0x6e/0xe0
>> jbd2_journal_commit_transaction+0x26e/0x1730
>> ```
>>
>> The journal commit is waiting for outstanding updates to complete, but those updates are
>> blocked.
>>
>> #### nfsd threads (multiple patterns)
>>
>> **Pattern A** - Blocked on ext4 inode rwsem:
>>
>> ```text
>> rwsem_down_read_slowpath+0x278/0x500
>> down_read+0x41/0xb0
>> ext4_llseek+0xfc/0x120            <-- needs inode->i_rwsem for SEEK_DATA/SEEK_HOLE
>> vfs_llseek+0x1c/0x40
>> cachefiles_do_prepare_read        <-- checking what's cached
>> ```
>>
>> The kernel logs explicitly identify the blocking relationship:
>>
>> - `nfsd:7300 <reader> blocked on an rw-semaphore likely owned by task kworker/u96:8:37820 <writer>`
>>
>> **Pattern B** - Blocked on jbd2 transaction:
>>
>> ```text
>> wait_transaction_locked+0x87/0xd0
>> add_transaction_credits+0x1e0/0x360
>> jbd2__journal_start
>> ext4_dirty_inode                  <-- updating atime on cache file
>> cachefiles_read
>> ```
>>
>> ### 3. The Circular Dependency
>>
>> ```text
>>   nfsd read request
>>        │
>>        ▼
>>   nfs_readahead → netfs_readahead → cachefiles_prepare_read
>>        │
>>        ▼
>>   ext4_llseek (needs i_rwsem READ lock)
>>        │
>>        ▼
>>   BLOCKED: kworkers hold i_rwsem as WRITER
>>        │
>>        ├─────────────────────────────────────────────┐
>>        │                                             │
>>        ▼                                             ▼
>>   Those kworkers are likely doing              jbd2 waiting for
>>   cachefiles write operations                  updates to complete
>>        │                                             │
>>        ▼                                             │
>>   Waiting for journal                                │
>>        │                                             │
>>        └──────────────────►◄─────────────────────────┘
>>                            │
>>                            ▼
>>                     Memory pressure triggers
>>                     kcompactd which needs to
>>                     release NFS folios
>>                            │
>>                            ▼
>>                     nfs_release_folio waits for
>>                     fscache (PG_fscache/private_2)
>>                            │
>>                            ▼
>>                     fscache operations are stuck
>>                     waiting for the blocked operations above
>> ```
>>
>> ## Root Cause Analysis
>>
>> 1. **ext4_llseek holding i_rwsem**: When CacheFiles uses `SEEK_DATA`/`SEEK_HOLE` to check
>>    what's cached, it takes the inode's rw-semaphore. If writers (kworkers doing cache writes)
>>    hold this semaphore, readers (the prepare_read path) block.
>>
>> 2. **Journal contention**: The ext4 journal on the cache filesystem (md127) is under heavy
>>    load. With 256 nfsd threads all potentially doing cache operations, journal contention
>>    is severe.
>>
>> 3. **Memory pressure + folio release**: When memory compaction tries to free NFS folios that
>>    have active fscache state, it must wait for fscache to complete—but fscache is blocked
>>    on ext4.
>>
>> 4. **Cookie state machine stalls**: The fscache cookie state transitions are blocked because
>>    the underlying I/O can't complete, leading to the timeout messages.
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-03-09 11:35 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-19 15:59 fscache/NFS re-export server lockup Mike Owen
2025-12-19 17:01 ` Daire Byrne
2026-01-09 21:45   ` Mike Owen
2026-01-10  9:48     ` Mike Snitzer
2026-01-12 15:16       ` Mike Owen
2026-01-15 15:20         ` Mike Owen
2026-01-15 21:53           ` Mike Snitzer
2026-03-09 11:35             ` Mike Owen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox