WARNING in nfs_free_request: PG_WB_END stale during NFS read completion with FS-Cache

public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Mike Owen <mjnowen@gmail.com>
To: linux-nfs@vger.kernel.org, netfs@lists.linux.dev
Subject: WARNING in nfs_free_request: PG_WB_END stale during NFS read completion with FS-Cache
Date: Wed, 1 Apr 2026 18:54:15 +0100	[thread overview]
Message-ID: <3820f743-ef30-4ace-bad0-fea2f97398fc@gmail.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 3098 bytes --]

I am running a fleet of KNFSD NFS re-export proxy instances with kernel 6.19.4 (vanilla stable, compiled from source with no custom patches). Under production
NFS read load, 13 of 16 instances hit a `WARN_ON_ONCE` in `nfs_free_request()` at `fs/nfs/pagelist.c:587`. The WARNING fires in the NFS read completion path and indicates that the `PG_WB_END` page-group synchronisation flag is unexpectedly still set when an `nfs_page` request is being freed.

The WARNING does not cause a crash or data corruption, but it reproduces consistently across the majority of instances under load.

I have written up the detailed issue, per attached: "nfs-free-request-warning-report.md", including reproducer steps.

WARNING: fs/nfs/pagelist.c:587 at nfs_free_request+0x175/0x1e0 [nfs], CPU#10: kworker/u96:47/131414
Modules linked in: tls tcp_diag udp_diag inet_diag nfsd auth_rpcgss nfsv3 nfs_acl nfs lockd grace cachefiles netfs xfs raid0 ...
CPU: 10 UID: 0 PID: 131414 Comm: kworker/u96:47 Tainted: G           OE       6.19.4-knfsd #1 VOLUNTARY
Hardware name: Amazon EC2 i3en.6xlarge/, BIOS 1.0 10/16/2017
Workqueue: nfsiod rpc_async_release [sunrpc]
RIP: 0010:nfs_free_request+0x175/0x1e0 [nfs]
Code: 48 8b 43 38 84 c0 0f 89 ce fe ff ff 0f 0b 48 8b 43 38 f6 c4 01 0f 84 cc fe ff ff 0f 0b 48 8b 43 38 f6 c4 02 0f 84 ca fe ff ff <0f> 0b 48 8b 43 38 f6 c4 04 0f 84 c8 fe ff ff 0f 0b 48 8b 43 38 f6
RSP: 0018:ffffd1de8a523d18 EFLAGS: 00010202
RAX: 0000000000000204 RBX: ffff8aa811146180 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8aa811146180
RBP: ffffd1de8a523d28 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8aa811146180
R13: ffff8aa811146180 R14: 0000000000000001 R15: ffff8ac45ab8a298
FS:  0000000000000000(0000) GS:ffff8ad5a4b3c000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007339078ee200 CR3: 000000015b244005 CR4: 00000000007706f0
PKRU: 55555554
Call Trace:
 <TASK>
 nfs_page_group_destroy+0x65/0xc0 [nfs]
 nfs_page_group_destroy+0x9f/0xc0 [nfs]
 nfs_release_request+0x36/0x60 [nfs]
 nfs_readpage_release.isra.0+0x37/0x80 [nfs]
 nfs_read_completion+0x80/0x170 [nfs]
 nfs_pgio_release+0x16/0x20 [nfs]
 rpc_free_task+0x3f/0x80 [sunrpc]
 rpc_async_release+0x33/0x50 [sunrpc]
 process_one_work+0x182/0x350
 worker_thread+0x19a/0x320
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xf9/0x210
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x25c/0x290
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
---[ end trace 0000000000000000 ]---

Questions:

1. Is this a known issue in the NFS page-group synchronisation code
   when FS-Cache/cachefiles is active?

2. Could the netfs/cachefiles write-back path (writing read data to
   the cache) inadvertently set `PG_WB_END` on `nfs_page` structures
   during a read operation?

3. Is this specific to the re-export (nfsd + NFS client on same host)
   configuration, or could it affect any NFS client with FS-Cache
   enabled?

Any guidance or patches to address this would be greatly appreciated.

Thank you for reading.
-Mike

[-- Attachment #2: nfs-free-request-warning-report.md --]
[-- Type: text/markdown, Size: 10408 bytes --]

# WARNING in nfs_free_request: PG_WB_END stale in NFS read completion path

## Summary

I am running a fleet of KNFSD NFS re-export proxy instances with kernel 6.19.4 (vanilla
stable, compiled from source with no custom patches). Under production
NFS read load, 13 of 16 instances hit a `WARN_ON_ONCE` in
`nfs_free_request()` at `fs/nfs/pagelist.c:587`. The WARNING fires in
the NFS read completion path and indicates that the `PG_WB_END`
page-group synchronisation flag is unexpectedly still set when an
`nfs_page` request is being freed.

The WARNING does not cause a crash or data corruption, but it
reproduces consistently across the majority of instances under load.

## Environment

- **Kernel**: 6.19.4 (vanilla stable from kernel.org, no custom patches)
- **Hardware**: Amazon EC2 i3en.6xlarge (Intel Xeon 8259CL, 24 vCPUs, 192GB RAM, 2× 7.5TB NVMe instance store)
- **Architecture**: NFS re-export proxy (see diagram below)
- **FS-Cache**: Enabled via `cachefiles` on XFS-formatted RAID0 of local NVMe (mounted at `/var/cache/fscache`)

### NFS Re-Export Proxy Architecture

The KNFSD proxy sits between NFS clients and a source NFS filer. It re-exports NFS mounts with
FS-Cache enabled to accelerate repeated reads:

```text
NFS Clients ──NFSv3──> KNFSD Proxy (nfsd + NFS client + cachefiles) ──NFSv3──> Source Filer
                              │
                              ├── nfsd serves downstream clients
                              ├── NFS client mounts source filer with fsc (FS-Cache)
                              ├── cachefiles/netfs writes read data to local XFS cache
                              └── nfsd re-reads from cache on subsequent client requests
```

This means the proxy runs both nfsd (server) and the NFS client
simultaneously, with the cachefiles/netfs layer performing write-back
of read data to the local XFS-backed cache volume.

## RCA

What follows is a GenAI analysis of the possible root cause of the issue.

## Stack Trace

From instance `i-0fd575219253e4388` (representative of all 13 affected
instances):

```text
WARNING: fs/nfs/pagelist.c:587 at nfs_free_request+0x175/0x1e0 [nfs], CPU#10: kworker/u96:47/131414
Modules linked in: tls tcp_diag udp_diag inet_diag nfsd auth_rpcgss nfsv3 nfs_acl nfs lockd grace cachefiles netfs xfs raid0 ...
CPU: 10 UID: 0 PID: 131414 Comm: kworker/u96:47 Tainted: G           OE       6.19.4-knfsd #1 VOLUNTARY
Hardware name: Amazon EC2 i3en.6xlarge/, BIOS 1.0 10/16/2017
Workqueue: nfsiod rpc_async_release [sunrpc]
RIP: 0010:nfs_free_request+0x175/0x1e0 [nfs]
Code: 48 8b 43 38 84 c0 0f 89 ce fe ff ff 0f 0b 48 8b 43 38 f6 c4 01 0f 84 cc fe ff ff 0f 0b 48 8b 43 38 f6 c4 02 0f 84 ca fe ff ff <0f> 0b 48 8b 43 38 f6 c4 04 0f 84 c8 fe ff ff 0f 0b 48 8b 43 38 f6
RSP: 0018:ffffd1de8a523d18 EFLAGS: 00010202
RAX: 0000000000000204 RBX: ffff8aa811146180 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8aa811146180
RBP: ffffd1de8a523d28 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8aa811146180
R13: ffff8aa811146180 R14: 0000000000000001 R15: ffff8ac45ab8a298
FS:  0000000000000000(0000) GS:ffff8ad5a4b3c000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007339078ee200 CR3: 000000015b244005 CR4: 00000000007706f0
PKRU: 55555554
Call Trace:
 <TASK>
 nfs_page_group_destroy+0x65/0xc0 [nfs]
 nfs_page_group_destroy+0x9f/0xc0 [nfs]
 nfs_release_request+0x36/0x60 [nfs]
 nfs_readpage_release.isra.0+0x37/0x80 [nfs]
 nfs_read_completion+0x80/0x170 [nfs]
 nfs_pgio_release+0x16/0x20 [nfs]
 rpc_free_task+0x3f/0x80 [sunrpc]
 rpc_async_release+0x33/0x50 [sunrpc]
 process_one_work+0x182/0x350
 worker_thread+0x19a/0x320
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xf9/0x210
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x25c/0x290
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
---[ end trace 0000000000000000 ]---
```

## Analysis

### Call path

The WARNING occurs in the NFS **read** completion path, running in the
`nfsiod` workqueue:

```text
rpc_async_release
  → rpc_free_task
    → nfs_pgio_release
      → nfs_read_completion
        → nfs_readpage_release
          → nfs_release_request
            → kref_put → nfs_page_group_destroy
              → nfs_free_request        ← WARNING fires here
```

The double `nfs_page_group_destroy` entry in the call trace indicates a
multi-member page group: a subrequest releases its reference on the head
request, triggering the head's group destruction, which then calls
`nfs_free_request()` for each member in the do..while loop.

### Which WARN fires

The `nfs_free_request()` function contains several `WARN_ON_ONCE` debug
checks that verify all page-group synchronisation flags are cleared
before freeing an `nfs_page`:

```c
void nfs_free_request(struct nfs_page *req)
{
    WARN_ON_ONCE(req->wb_this_page != req);
    WARN_ON_ONCE(test_bit(PG_TEARDOWN, &req->wb_flags));
    WARN_ON_ONCE(test_bit(PG_UNLOCKPAGE, &req->wb_flags));
    WARN_ON_ONCE(test_bit(PG_UPTODATE, &req->wb_flags));
    WARN_ON_ONCE(test_bit(PG_WB_END, &req->wb_flags));     /* ← this one */
    WARN_ON_ONCE(test_bit(PG_REMOVE, &req->wb_flags));
    ...
}
```

Disassembly of the `Code:` dump confirms the faulting `<0f> 0b` (ud2)
corresponds to the `test ah, 0x02` instruction, which tests bit 9 of
`wb_flags` — i.e. `PG_WB_END`.

### The stale flag

RAX contains `wb_flags` at the point of the WARN. Across the 13
affected instances, two values are observed:

- `RAX=0x204` → `PG_CLEAN(2)` + `PG_WB_END(9)` — observed on 10 instances
- `RAX=0x205` → `PG_BUSY(0)` + `PG_CLEAN(2)` + `PG_WB_END(9)` — observed on 3 instances

Both values have `PG_WB_END` (bit 9) set. The difference is whether
`PG_BUSY` (the request lock bit) has been cleared yet — reflecting two
slightly different timing snapshots of the same race.

### Why PG_WB_END should not be set here?

`PG_WB_END` is a page-group synchronisation flag used via
`nfs_page_group_sync_on_bit(req, PG_WB_END)` in `nfs_write_completion()`
(the write path). In the normal read path, `PG_WB_END` should never be
set. The read path uses `PG_UPTODATE` (via `nfs_page_group_set_uptodate()`)
and `PG_UNLOCKPAGE` (via `nfs_readpage_release()`), but not `PG_WB_END`.

I suspect the cachefiles/netfs integration may be involved? In the
re-export proxy configuration, when the NFS client reads from the source
filer with FS-Cache enabled, the netfs layer also writes the read data
to the cachefiles backing store. This creates a hybrid read+write
lifecycle for the pages involved. It is possible that the cache
write-back path sets `PG_WB_END` on page group members during what the
NFS client considers a read operation, and the flag is not properly
cleared before the read completion tears down the page group.

## Pattern Across Instances

### Hit rate

13 of 16 instances (81%) hit the WARNING. All instances run the same
kernel, same configuration, and serve the same type of NFS re-export
workload.

### Timing

The 13 warnings are spread over approximately 5.5 hours on the same day,
with no two instances firing at the same time. This is consistent with
a race condition that triggers independently under load:

```text
12:08 — i-0d9db222c56b148d1
13:21 — i-0561e189a06c5a288
13:39 — i-0c7dcaae400c2392c
13:52 — i-04c428e40afd64ca3
15:13 — i-0f032747134603e7b
15:28 — i-0e9d7909689579332
15:32 — i-04d2e1572a03b471b
16:01 — i-0e3e93a3c8367c27d
16:29 — i-0fd575219253e4388
16:58 — i-01b1ed44abfd52cf5
17:12 — i-02ef878062d4f5f86
17:18 — i-0eccb264cc90eb737
17:36 — i-064c649eaaf196ae7
```

### Call trace variants

Two variants are observed:

**Variant A** (8 instances) — recursive page group destroy (multi-member
page group where subrequest releases the head):

```text
nfs_page_group_destroy+0x65/0xc0   ← nfs_free_request(tmp) in do..while
nfs_page_group_destroy+0x9f/0xc0   ← nfs_release_request(head) at end
nfs_release_request+0x36/0x60
nfs_readpage_release
```

**Variant B** (5 instances) — direct, non-recursive (single-member page
group or head is the only remaining member):

```text
nfs_page_group_destroy+0x65/0xc0   ← nfs_free_request(tmp)
nfs_release_request+0x36/0x60
nfs_readpage_release
```

Both variants end at the same `nfs_free_request` with `PG_WB_END` set.

## Reproduction

This is reproducible under sustained NFS read load through the proxy
with FS-Cache (cachefiles) enabled. The workload involves multiple NFS
clients performing concurrent reads through the KNFSD re-export proxy,
triggering cold-cache fills where the NFS client reads from the source
filer and cachefiles writes the data to the local XFS cache.

Since `WARN_ON_ONCE` only fires once per boot, the actual frequency of
the race condition is likely higher than what the logs show.

## Reproducer

Somewhat excessive, but using 2 FIO jobs, running on 10 x machines, mounting a single KNFSD proxy instance, whicih in-turn mounts a single, source NFS filer, I can 100% reproduce the issue, using `stress-ng` to simulate production load. 1st FIO job creates the files, 2nd FIO job reads the files. Source data is 6.4TB of random data. The FIO 2nd job runs for 60 minutes, but this issue will surface much sooner.

**FIO job 1:**

```text
[warning-log-message]
ioengine=sync
direct=0
rw=write
directory=/mnt/data
unique_filename=0
bs=1M
size=10G
numjobs=64
```

**FIO job 2:**

```text
[warning-log-message]
ioengine=posixaio
direct=0
time_based
allow_file_create=0
directory=/mnt/data
unique_filename=0
bs=1M
rw=randread
size=10G
numjobs=64
runtime=60m
norandommap
```

Additionally, run this command on a KNFSD proxy instance whilst the 2nd FIO job is running to strain the system memory:

```bash
sudo stress-ng --vm 8 --vm-bytes 95% --vm-method all --timeout 60m --oom-avoid
```

## Questions

1. Is this a known issue in the NFS page-group synchronisation code
   when FS-Cache/cachefiles is active?

2. Could the netfs/cachefiles write-back path (writing read data to
   the cache) inadvertently set `PG_WB_END` on `nfs_page` structures
   during a read operation?

3. Is this specific to the re-export (nfsd + NFS client on same host)
   configuration, or could it affect any NFS client with FS-Cache
   enabled?

Any guidance or patches to address this would be greatly appreciated.

Thank you for reading.

                 reply	other threads:[~2026-04-01 17:54 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3820f743-ef30-4ace-bad0-fea2f97398fc@gmail.com \
    --to=mjnowen@gmail.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=netfs@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox