Hi,
We encountered a reproducible NFSv4.1 client hang issue under concurrent workload.
Environment:
- Two independent Linux clients (VMs)
- Both mount the same Windows NFS server (NFSv4.1)
- Kernel version: 6.1.78
- Mount options: vers=4.1,soft,proto=tcp,timeo=60,retrans=10
Workload:
- Each client copies ~5GB files to the same NFS share
- Copy runs every 30 minutes
- After ~41 iterations (~20 hours), both clients hang simultaneously
Symptoms:
- All NFS operations (ls, df, rsync, cp) hang in D state
- No NFS RPC traffic observed (tcpdump shows only TCP ACK)
- nfsstat shows retrans=0
Sysrq stack shows:
NFS state manager thread:
nfs4_run_state_manager
nfs4_drain_slot_tbl
wait_for_completion_interruptible
User processes:
rpc_wait_bit_killable
nfs4_proc_getattr
nfs4_run_open_task
Both clients exhibit identical behavior at the same time.
This suggests that:
- The client enters NFSv4.1 state recovery
- nfs4_drain_slot_tbl waits for slots to drain
- At least one slot never completes
- All further RPCs are blocked
Questions:
1. Is it expected that nfs4_drain_slot_tbl can block indefinitely?
2. What conditions can cause a slot to never be released?
3. Should the client force session reset instead of waiting forever?
4. Is this a known interoperability issue with Windows NFSv4.1 server?
We can provide additional logs if needed.
Thanks.