* leaked pNFS DS nfs_client references
@ 2025-04-21 19:46 Jeff Layton
2025-04-23 15:53 ` Jeff Layton
0 siblings, 1 reply; 2+ messages in thread
From: Jeff Layton @ 2025-04-21 19:46 UTC (permalink / raw)
To: Trond Myklebust, Anna Schumaker; +Cc: linux-nfs
Hi Trond/Anna:
We (at Meta) have been hunting a number of problems surrounding leaked
network namespaces with containerized workloads. We recently deployed a
v6.9 based kernel on the clients that has all the known containerized
NFS fixes from upstream.
Usually, when we've found problems with leaked netns's it has been
because there were still outstanding RPCs associated with the rpc_clnt.
Today, we found a host that seems to have some leaked nfs_client
structures, but there is no associated RPC activity.
In this case, we had 2 leaked net namespaces. We discovered them by
looking under /sys/kernel/debugfs/rpc_xprt for xprts associated with
netns's that no longer have any userland tasks attached.
Some drgn (pardon my terrible Python):
>>> for net in for_each_net():
... if (net.ns.inum == 4026558887 or net.ns.inum == 4026558805):
... print("netns:", net.ns.inum)
... nfs_net = cast("struct nfs_net *", net.gen.ptr[prog["nfs_net_id"]])
... print("Volume list empty:", list_empty(nfs_net.nfs_volume_list.address_of_()))
... for clnt in list_for_each_entry("struct nfs_client", nfs_net.nfs_client_list.address_of_(), "cl_share_link"):
... rpcclnt = clnt.cl_rpcclient
... print(clnt.cl_count.refs.counter, clnt.cl_hostname, rpcclnt.cl_vers, "tasks: ", list_count_nodes(rpcclnt.cl_tasks.address_of_()))
...
netns: (unsigned int)4026558805
Volume list empty: True
(int)1 (char *)0xffff8a12e988a500 = "f00::3117:a4f1:a940:94af" (u32)3 tasks: 0
(int)1 (char *)0xffff8881a0f694c0 = "f00::bfaa:cec2:8ee2:295" (u32)3 tasks: 0
(int)1 (char *)0xffff889e81a74e40 = "f00::8f23:f52d:9d79:a7b0" (u32)3 tasks: 0
(int)1 (char *)0xffff8a027d8e0780 = "f00::d209:97ba:1c6:3282" (u32)3 tasks: 0
netns: (unsigned int)4026558887
Volume list empty: True
(int)1 (char *)0xffff8a14d5b0e2c0 = "f00::3f52:fea6:4ccb:96dd" (u32)3 tasks: 0
(int)1 (char *)0xffff8881e6626cc0 = "f00::705:c924:ddc1:51e4" (u32)3 tasks: 0
(int)1 (char *)0xffff8a149cdb6680 = "f00::3117:a4f1:a940:94af" (u32)3 tasks: 0
(int)1 (char *)0xffff8896ada2f800 = "f00::d56c:cd93:1f0c:99c7" (u32)3 tasks: 0
(int)1 (char *)0xffff8a159251f240 = "f00::614d:87c1:a73f:1f09" (u32)3 tasks: 0
(int)1 (char *)0xffff888e699f4940 = "f00::1285:b785:f114:d38b" (u32)3 tasks: 0
(int)1 (char *)0xffff88812ae41500 = "f00::fb1c:bc4a:3d9a:c2a6" (u32)3 tasks: 0
(int)1 (char *)0xffff8a137dbc4e00 = "f00::bd2f:5851:b552:5bce" (u32)3 tasks: 0
There are 12 leaked nfs_clients in 2 netns's. There are no longer any
struct nfs_servers associated with either netns. Each leaked client has
a single outstanding reference. They're all connections to different
DS's (except for one between the two netns's, but I suspect that's just
coincidence). They're all NFSv3, which indicates that they are pNFS DS
clients. None of them have any running RPCs.
I took a look at the nfs_client refcount handling in the pNFS code but
didn't see any obvious bugs.
One thing we could consider is adding a refcount tracker for these
objects. That would tell us pretty quickly what took the leftover
references in the first place, assuming this is reproducible.
This kernel is based on v6.9, so it's possible we missed a fix that we
need. I didn't see anything obvious in recent git fixes though.
Any thoughts?
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: leaked pNFS DS nfs_client references
2025-04-21 19:46 leaked pNFS DS nfs_client references Jeff Layton
@ 2025-04-23 15:53 ` Jeff Layton
0 siblings, 0 replies; 2+ messages in thread
From: Jeff Layton @ 2025-04-23 15:53 UTC (permalink / raw)
To: Trond Myklebust, Anna Schumaker; +Cc: linux-nfs, Omar Sandoval, Chris Mason
On Mon, 2025-04-21 at 15:46 -0400, Jeff Layton wrote:
> Hi Trond/Anna:
>
> We (at Meta) have been hunting a number of problems surrounding leaked
> network namespaces with containerized workloads. We recently deployed a
> v6.9 based kernel on the clients that has all the known containerized
> NFS fixes from upstream.
>
> Usually, when we've found problems with leaked netns's it has been
> because there were still outstanding RPCs associated with the rpc_clnt.
> Today, we found a host that seems to have some leaked nfs_client
> structures, but there is no associated RPC activity.
>
> In this case, we had 2 leaked net namespaces. We discovered them by
> looking under /sys/kernel/debugfs/rpc_xprt for xprts associated with
> netns's that no longer have any userland tasks attached.
>
> Some drgn (pardon my terrible Python):
>
> > > > for net in for_each_net():
> ... if (net.ns.inum == 4026558887 or net.ns.inum == 4026558805):
> ... print("netns:", net.ns.inum)
> ... nfs_net = cast("struct nfs_net *", net.gen.ptr[prog["nfs_net_id"]])
> ... print("Volume list empty:", list_empty(nfs_net.nfs_volume_list.address_of_()))
> ... for clnt in list_for_each_entry("struct nfs_client", nfs_net.nfs_client_list.address_of_(), "cl_share_link"):
> ... rpcclnt = clnt.cl_rpcclient
> ... print(clnt.cl_count.refs.counter, clnt.cl_hostname, rpcclnt.cl_vers, "tasks: ", list_count_nodes(rpcclnt.cl_tasks.address_of_()))
> ...
> netns: (unsigned int)4026558805
> Volume list empty: True
> (int)1 (char *)0xffff8a12e988a500 = "f00::3117:a4f1:a940:94af" (u32)3 tasks: 0
> (int)1 (char *)0xffff8881a0f694c0 = "f00::bfaa:cec2:8ee2:295" (u32)3 tasks: 0
> (int)1 (char *)0xffff889e81a74e40 = "f00::8f23:f52d:9d79:a7b0" (u32)3 tasks: 0
> (int)1 (char *)0xffff8a027d8e0780 = "f00::d209:97ba:1c6:3282" (u32)3 tasks: 0
> netns: (unsigned int)4026558887
> Volume list empty: True
> (int)1 (char *)0xffff8a14d5b0e2c0 = "f00::3f52:fea6:4ccb:96dd" (u32)3 tasks: 0
> (int)1 (char *)0xffff8881e6626cc0 = "f00::705:c924:ddc1:51e4" (u32)3 tasks: 0
> (int)1 (char *)0xffff8a149cdb6680 = "f00::3117:a4f1:a940:94af" (u32)3 tasks: 0
> (int)1 (char *)0xffff8896ada2f800 = "f00::d56c:cd93:1f0c:99c7" (u32)3 tasks: 0
> (int)1 (char *)0xffff8a159251f240 = "f00::614d:87c1:a73f:1f09" (u32)3 tasks: 0
> (int)1 (char *)0xffff888e699f4940 = "f00::1285:b785:f114:d38b" (u32)3 tasks: 0
> (int)1 (char *)0xffff88812ae41500 = "f00::fb1c:bc4a:3d9a:c2a6" (u32)3 tasks: 0
> (int)1 (char *)0xffff8a137dbc4e00 = "f00::bd2f:5851:b552:5bce" (u32)3 tasks: 0
>
> There are 12 leaked nfs_clients in 2 netns's. There are no longer any
> struct nfs_servers associated with either netns. Each leaked client has
> a single outstanding reference. They're all connections to different
> DS's (except for one between the two netns's, but I suspect that's just
> coincidence). They're all NFSv3, which indicates that they are pNFS DS
> clients. None of them have any running RPCs.
>
> I took a look at the nfs_client refcount handling in the pNFS code but
> didn't see any obvious bugs.
>
> One thing we could consider is adding a refcount tracker for these
> objects. That would tell us pretty quickly what took the leftover
> references in the first place, assuming this is reproducible.
>
> This kernel is based on v6.9, so it's possible we missed a fix that we
> need. I didn't see anything obvious in recent git fixes though.
>
> Any thoughts?
An update:
Omar and I worked together yesterday, and confirmed that this is the
same problem that he reported a week or two ago. This kernel has the
two patches you sent on April 6th:
[PATCH v2 1/2] NFSv4: Handle fatal ENETDOWN and ENETUNREACH errors
[PATCH v2 2/2] NFSv4/pnfs: Layoutreturn on close must handle fatal networking errors
...so that's evidently not enough to fix it.
Note that this is a potential memory corruptor. The leaked layout segs
were all sitting on the layouts plh_return_segs list. If they get freed
later, then that will likely scribble over the list_head in the freed
layout.
Looking over the code, it appears that when the inode is evicted, the
plh_return_segs list should get cleaned out via:
nfs4_evict_inode
pnfs_destroy_layout_final
__pnfs_destroy_layout
pnfs_mark_layout_stateid_invalid
pnfs_free_returned_lsegs
pnfs_free_returned_lsegs() should put them all on the tmp_list and then
__pnfs_destroy_layout() should free the contents of the list via
pnfs_free_lseg_list().
It's not clear to me why that didn't happen here.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2025-04-23 15:53 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-21 19:46 leaked pNFS DS nfs_client references Jeff Layton
2025-04-23 15:53 ` Jeff Layton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox