* Possible memory leak on nfsd
@ 2024-11-27 2:35 Chen Chen via Bugspray Bot
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
` (17 more replies)
0 siblings, 18 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-11-27 2:35 UTC (permalink / raw)
To: linux-nfs, trondmy, jlayton, anna, cel
Chen Chen added an attachment on Kernel.org Bugzilla:
Created attachment 307283
sar -r mem usage
My RHEL9 server with only NFS service often OOMed after a day or two, with no userspace memory usage. So I switched to elrepo kernel-lts and still the problem persists.
I'm now using 6.1.119-1.el9.elrepo.x86_64. The problem also occured on (RHEL) 5.14.0-427.40.1.el9_4, (RHEL) 5.14.0-503.14.1.el9_5 and 6.1.115-1.el9.elrepo.x86_64.
I'm not so sure it is caused by NFS but since it is the only service running on the server I can only suspect it is the culprit. The server has a Mellanox Technologies MT27500 Family [ConnectX-3] Infiniband Card and NFSoRMDA is enabled. No 3rd drivers used.
The following data were gathered moments before it OOMed and crashed
sar reported a typical memory leak appearance.
01:20:13 AM 390187300 388732764 3501864 0.89 4856 363952 390344 0.09 100680 358384 17148
01:30:13 AM 379492128 378312768 13642416 3.46 4856 909388 390344 0.09 108844 895740 16
01:40:13 AM 367687716 367062060 24851416 6.30 4856 1498272 390344 0.09 116736 1476672 16
01:50:50 AM 361704244 361471420 30437312 7.72 4856 1888780 390344 0.09 127888 1856036 29912
02:00:13 AM 355796296 355848120 36061648 9.15 4856 2173560 390344 0.09 131544 2137152 0
....
09:00:13 AM 1518392 18089616 373760196 94.79 4760 18648816 390344 0.09 470608 18273412 36
09:10:13 AM 1499980 17223900 374626172 95.01 4740 17801676 390344 0.09 471964 17424672 5292
09:20:13 AM 1561896 6784736 385059756 97.66 1712 7338540 423580 0.10 325452 7070372 0
meminfo also didn't show anything using ram.
MemTotal: 394292660 kB
MemFree: 1551296 kB
MemAvailable: 6776108 kB
Buffers: 1712 kB
Cached: 7340144 kB
SwapCached: 4308 kB
Active: 325936 kB
Inactive: 7071836 kB
...
KReclaimable: 129816 kB
Slab: 331596 kB
SReclaimable: 129816 kB
SUnreclaim: 201780 kB
...
VmallocUsed: 319528 kB
slabinfo is low. Attached.
vmallocinfo doesn't have much. Attached.
dmesg log showed it has killed nearly every userspace programs.
[29960.547403] Tasks state (memory values in pages):
[29960.547404] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[29960.547412] [ 1020] 0 1020 9498 640 94208 1000 -1000 systemd-udevd
[29960.547417] [ 1247] 0 1247 105208 6888 126976 0 -1000 multipathd
[29960.547421] [ 1342] 0 1342 23190 330 65536 764 -1000 auditd
[29960.547428] [ 1472] 0 1472 4185 806 73728 357 -1000 sshd
[29960.547438] Out of memory and no killable processes...
[29960.547439] Kernel panic - not syncing: System is deadlocked on memory
systemctl status attached. Nothing else is running.
I have a 224G vmcore dump but have no idea how to deal with it. And it is too big to upload somewhere I think.
I appreciate any help to help me detect what went wrong.
File: sar (text/plain)
Size: 6.95 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307283
---
sar -r mem usage
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory leak on nfsd
2024-11-27 2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
@ 2024-11-27 2:35 ` Chen Chen via Bugspray Bot
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
` (16 subsequent siblings)
17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-11-27 2:35 UTC (permalink / raw)
To: linux-nfs, trondmy, jlayton, anna, cel
Chen Chen added an attachment on Kernel.org Bugzilla:
Created attachment 307284
lsmod
File: lsmod (text/plain)
Size: 4.96 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307284
---
lsmod
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory leak on nfsd
2024-11-27 2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
@ 2024-11-27 2:35 ` Chen Chen via Bugspray Bot
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
` (15 subsequent siblings)
17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-11-27 2:35 UTC (permalink / raw)
To: linux-nfs, trondmy, jlayton, anna, cel
Chen Chen added an attachment on Kernel.org Bugzilla:
Created attachment 307285
/proc/meminfo
File: meminfo (text/plain)
Size: 1.53 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307285
---
/proc/meminfo
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory leak on nfsd
2024-11-27 2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
@ 2024-11-27 2:35 ` Chen Chen via Bugspray Bot
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
` (14 subsequent siblings)
17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-11-27 2:35 UTC (permalink / raw)
To: linux-nfs, trondmy, jlayton, anna, cel
Chen Chen added an attachment on Kernel.org Bugzilla:
Created attachment 307286
/proc/slabinfo
File: slabinfo (text/plain)
Size: 30.92 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307286
---
/proc/slabinfo
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory leak on nfsd
2024-11-27 2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
` (2 preceding siblings ...)
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
@ 2024-11-27 2:35 ` Chen Chen via Bugspray Bot
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
` (13 subsequent siblings)
17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-11-27 2:35 UTC (permalink / raw)
To: linux-nfs, trondmy, jlayton, anna, cel
Chen Chen added an attachment on Kernel.org Bugzilla:
Created attachment 307287
systemctl status
File: systemctl_status (application/octet-stream)
Size: 4.06 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307287
---
systemctl status
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory leak on nfsd
2024-11-27 2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
` (3 preceding siblings ...)
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
@ 2024-11-27 2:35 ` Chen Chen via Bugspray Bot
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
` (12 subsequent siblings)
17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-11-27 2:35 UTC (permalink / raw)
To: linux-nfs, trondmy, jlayton, anna, cel
Chen Chen added an attachment on Kernel.org Bugzilla:
Created attachment 307288
/proc/vmallocinfo
File: vmallocinfo (text/plain)
Size: 170.08 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307288
---
/proc/vmallocinfo
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory leak on nfsd
2024-11-27 2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
` (4 preceding siblings ...)
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
@ 2024-11-27 2:35 ` Chen Chen via Bugspray Bot
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
` (11 subsequent siblings)
17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-11-27 2:35 UTC (permalink / raw)
To: linux-nfs, trondmy, jlayton, anna, cel
Chen Chen added an attachment on Kernel.org Bugzilla:
Created attachment 307289
/proc/vmstat
File: vmstat (text/plain)
Size: 3.65 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307289
---
/proc/vmstat
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory leak on nfsd
2024-11-27 2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
` (5 preceding siblings ...)
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
@ 2024-11-27 2:35 ` Chen Chen via Bugspray Bot
2024-12-07 8:35 ` Chen Chen via Bugspray Bot
` (10 subsequent siblings)
17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-11-27 2:35 UTC (permalink / raw)
To: linux-nfs, trondmy, jlayton, anna, cel
Chen Chen added an attachment on Kernel.org Bugzilla:
Created attachment 307290
oom dmesg from kdump
File: vmcore-dmesg.txt (text/plain)
Size: 535.11 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307290
---
oom dmesg from kdump
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory leak on nfsd
2024-11-27 2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
` (6 preceding siblings ...)
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
@ 2024-12-07 8:35 ` Chen Chen via Bugspray Bot
2024-12-07 15:30 ` Chuck Lever via Bugspray Bot
` (9 subsequent siblings)
17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-12-07 8:35 UTC (permalink / raw)
To: jlayton, cel, linux-nfs, anna, trondmy
Chen Chen added an attachment on Kernel.org Bugzilla:
Created attachment 307330
dmesg of another 3 crashes
Since reporting I got another 3 crashes. All killed by nfsd.
First one:
[136965.765431] Out of memory and no killable processes...
[136965.765433] Kernel panic - not syncing: System is deadlocked on memory
[136965.766148] CPU: 2 PID: 1856 Comm: nfsd Kdump: loaded Tainted: G E 6.1.119-1.el9.elrepo.x86_64 #1
[136965.766852] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.22.2 09/12/2024
[136965.767546] Call Trace:
[136965.768230] <TASK>
[136965.768903] dump_stack_lvl+0x45/0x5e
[136965.769571] panic+0x10c/0x2c2
[136965.770231] out_of_memory.cold+0x2f/0x7e
[136965.770874] __alloc_pages_slowpath.constprop.0+0x707/0x9d0
[136965.771518] __alloc_pages+0x35d/0x370
[136965.772147] __alloc_pages_bulk+0x3e5/0x680
[136965.772766] svc_alloc_arg+0x81/0x1f0 [sunrpc]
[136965.773431] svc_recv+0x1f/0x190 [sunrpc]
[136965.774089] ? nfsd_inet6addr_event+0x110/0x110 [nfsd]
[136965.774726] nfsd+0x87/0xc0 [nfsd]
[136965.775347] kthread+0xe5/0x110
[136965.775926] ? kthread_complete_and_exit+0x20/0x20
[136965.776499] ret_from_fork+0x1f/0x30
[136965.777062] </TASK>
Second:
[167723.787640] WARNING: CPU: 3 PID: 1872 at mm/slab_common.c:957 free_large_kmalloc+0x5a/0x80
[167723.787667] Modules linked in: <cut here>
[167723.787874] CPU: 3 PID: 1872 Comm: nfsd Kdump: loaded Not tainted 5.14.0-503.15.1.el9_5.x86_64 #1
[167723.787882] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.22.2 09/12/2024
[167723.787886] RIP: 0010:free_large_kmalloc+0x5a/0x80
Third:
[ 3883.748094] ------------[ cut here ]------------
[ 3883.748105] WARNING: CPU: 9 PID: 1886 at mm/slab_common.c:957 free_large_kmalloc+0x5a/0x80
[ 3883.748131] Modules linked in: <cut here>
[ 3883.748339] CPU: 9 PID: 1886 Comm: nfsd Kdump: loaded Not tainted 5.14.0-503.15.1.el9_5.x86_64 #1
[ 3883.748342] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.22.2 09/12/2024
[ 3883.748344] RIP: 0010:free_large_kmalloc+0x5a/0x80
File: crash.log (text/plain)
Size: 31.77 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307330
---
dmesg of another 3 crashes
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory leak on nfsd
2024-11-27 2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
` (7 preceding siblings ...)
2024-12-07 8:35 ` Chen Chen via Bugspray Bot
@ 2024-12-07 15:30 ` Chuck Lever via Bugspray Bot
2024-12-10 5:20 ` Chen Chen via Bugspray Bot
` (8 subsequent siblings)
17 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever via Bugspray Bot @ 2024-12-07 15:30 UTC (permalink / raw)
To: anna, jlayton, linux-nfs, cel, trondmy
Chuck Lever writes via Kernel.org Bugzilla:
Hi Chen -
After some review, these all appear to be Red Hat Enterprise kernels. Such kernels are extensively patched and maintained exclusively by Red Hat engineers. I kindly request that you report this issue to Red Hat first and have them troubleshoot it.
If they find there is a needed upstream fix, do feel free to re-open this bug.
[I am a fan of the old ConnectX-3 cards, btw]
View: https://bugzilla.kernel.org/show_bug.cgi?id=219535#c9
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory leak on nfsd
2024-11-27 2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
` (8 preceding siblings ...)
2024-12-07 15:30 ` Chuck Lever via Bugspray Bot
@ 2024-12-10 5:20 ` Chen Chen via Bugspray Bot
2024-12-10 14:45 ` Chuck Lever via Bugspray Bot
` (7 subsequent siblings)
17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-12-10 5:20 UTC (permalink / raw)
To: linux-nfs, cel, anna, trondmy, jlayton
Chen Chen writes via Kernel.org Bugzilla:
Hi Mr. Lever,
I *clearly* stated I was using 6.1.119 which is the latest longterm kernel released on 2024-11-22, compiled by the ELRepo Project as-is from upstream tarball.
[136965.766148] CPU: 2 PID: 1856 Comm: nfsd Kdump: loaded Tainted: G E 6.1.119-1.el9.elrepo.x86_64 #1
I encountered the problem in both shipped RHEL kernel and latest and sub-latest lts. So the bug must still exists in upstream. That's why I filed this bug.
Anyway, I encountered another 2 crashes in the last two days and call stack insists nfsd caused it.
View: https://bugzilla.kernel.org/show_bug.cgi?id=219535#c10
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory leak on nfsd
2024-11-27 2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
` (9 preceding siblings ...)
2024-12-10 5:20 ` Chen Chen via Bugspray Bot
@ 2024-12-10 14:45 ` Chuck Lever via Bugspray Bot
2024-12-11 1:15 ` Chen Chen via Bugspray Bot
` (6 subsequent siblings)
17 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever via Bugspray Bot @ 2024-12-10 14:45 UTC (permalink / raw)
To: linux-nfs, cel, anna, trondmy, jlayton
Chuck Lever writes via Kernel.org Bugzilla:
This is what comment 0 says:
> My RHEL9 server with only NFS service often OOMed after a day or two,
> with no userspace memory usage. So I switched to elrepo kernel-lts and
> still the problem persists.
> I'm now using 6.1.119-1.el9.elrepo.x86_64. The problem also occured on
> (RHEL 5.14.0-427.40.1.el9_4, (RHEL) 5.14.0-503.14.1.el9_5 and
> 6.1.115-1.el9.elrepo.x86_64.
You mentioned RHEL, and RHEL 9 in particular, several times here. I have no prior knowledge of "the ELRepo Project" -- never heard of it. By "uname" these all look like distro-built kernels to me.
> Anyway, I encountered another 2 crashes in the last two days and
> call stack insists nfsd caused it.
I'm not saying this isn't an NFSD bug. But it might not be a problem in recent kernels. If I'm reading your reports correctly, you have not tested with 6.12 or newer. 6.1.anything is based on a two-year old code base.
Any fix we create for this issue must be applied to the upstream Linus kernel first. Indeed, a fix might already exist somewhere in upstream. By upstream, I mean the "master" branch in this repo:
https://git.kernel.org./pub/scm/linux/kernel/git/torvalds/linux.git
Therefore the first task is for you to confirm by testing that this branch either still has this issue, in which case we have to troubleshoot further; or does not, in which case you can bisect to find the upstream fix that needs to be backported to the LTS kernels.
View: https://bugzilla.kernel.org/show_bug.cgi?id=219535#c11
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory leak on nfsd
2024-11-27 2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
` (10 preceding siblings ...)
2024-12-10 14:45 ` Chuck Lever via Bugspray Bot
@ 2024-12-11 1:15 ` Chen Chen via Bugspray Bot
2024-12-12 16:00 ` Chuck Lever via Bugspray Bot
` (5 subsequent siblings)
17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-12-11 1:15 UTC (permalink / raw)
To: cel, anna, jlayton, trondmy, linux-nfs
Chen Chen writes via Kernel.org Bugzilla:
Hi Mr. Lever,
> You mentioned RHEL, and RHEL 9 in particular, several times here.
Because I want to indicate that, except the kernel, every other toolchains were using latest version from RHEL9.
The ELRepo Project (https://elrepo.org/) is a group of guys grabbing the latest kernel source and package it into RPMs for easy installation on latest EL-like releases (like RHEL, Oracle Linux, Rocky, Alma etc.)
> By upstream, I mean the "master" branch in this repo
OK. I've just installed the latest stable (aka 6.12.4) and see if it might help.
View: https://bugzilla.kernel.org/show_bug.cgi?id=219535#c12
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory leak on nfsd
2024-11-27 2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
` (11 preceding siblings ...)
2024-12-11 1:15 ` Chen Chen via Bugspray Bot
@ 2024-12-12 16:00 ` Chuck Lever via Bugspray Bot
2024-12-12 16:15 ` Fwd: " Chuck Lever
2025-01-10 16:50 ` Chen Chen via Bugspray Bot
` (4 subsequent siblings)
17 siblings, 1 reply; 21+ messages in thread
From: Chuck Lever via Bugspray Bot @ 2024-12-12 16:00 UTC (permalink / raw)
To: jlayton, linux-nfs, trondmy, cel, anna
Chuck Lever writes via Kernel.org Bugzilla:
From attachment 307290:
[29924.805968] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-0.slice/user@0.service/init.scope,task=(sd-pam),pid=4503,uid=0
[29924.805991] Out of memory: Killed process 4503 ((sd-pam)) total-vm:173972kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:96kB oom_score_adj:100
[29925.425864] nfsd invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[29925.425872] CPU: 0 PID: 1874 Comm: nfsd Kdump: loaded Tainted: G E 6.1.119-1.el9.elrepo.x86_64 #1
[29925.425875] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.22.2 09/12/2024
[29925.425877] Call Trace:
[29925.425880] <TASK>
[29925.425885] dump_stack_lvl+0x45/0x5e
[29925.425893] dump_header+0x4a/0x213
[29925.425897] oom_kill_process.cold+0xb/0x10
[29925.425901] out_of_memory+0xed/0x2e0
[29925.425906] __alloc_pages_slowpath.constprop.0+0x707/0x9d0
[29925.425916] __alloc_pages+0x35d/0x370
[29925.425921] __alloc_pages_bulk+0x3e5/0x680
[29925.425927] svc_alloc_arg+0x81/0x1f0 [sunrpc]
[29925.425991] svc_recv+0x1f/0x190 [sunrpc]
[29925.426043] ? nfsd_inet6addr_event+0x110/0x110 [nfsd]
[29925.426080] nfsd+0x87/0xc0 [nfsd]
[29925.426113] kthread+0xe5/0x110
[29925.426118] ? kthread_complete_and_exit+0x20/0x20
[29925.426122] ret_from_fork+0x1f/0x30
[29925.426129] </TASK>
NFSD is targeted by OOM killer because it frequently allocates up to 256 pages at a time to fill the send and receive buffers. It is not necessarily the source of a leak.
The bulk page allocator is on the slow path here, suggesting there weren't any free pages available on the lists it normally checks first. So it is doing one-at-a-time order-0 allocations, a sign that memory is short.
We see that Node 1 appears to be short on free memory, but the system has not pushed into swap at all. Kernel memory isn't swappable, so whatever is leaking is in the kernel proper.
The slab caches all look reasonably sized, so not likely a slab leak.
At this point we would want someone with some MM expertise to come in and help us nail down the leak.
View: https://bugzilla.kernel.org/show_bug.cgi?id=219535#c13
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Fwd: Possible memory leak on nfsd
2024-12-12 16:00 ` Chuck Lever via Bugspray Bot
@ 2024-12-12 16:15 ` Chuck Lever
0 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2024-12-12 16:15 UTC (permalink / raw)
To: linux-mm, Linux NFS Mailing List
Hi -
An NFSD page allocation on v6.1.y is triggering OOM-killer. The reporter
has provided a lot of detail, and we need some help steering us towards
the possible leak culprit. Any takers?
(We've asked the reporter to reproduce on a more recent kernel if
possible).
-------- Forwarded Message --------
Subject: Re: Possible memory leak on nfsd
Date: Thu, 12 Dec 2024 16:00:17 +0000
From: Chuck Lever via Bugspray Bot <bugbot@kernel.org>
To: jlayton@kernel.org, linux-nfs@vger.kernel.org, trondmy@kernel.org,
cel@kernel.org, anna@kernel.org
Chuck Lever writes via Kernel.org Bugzilla:
From attachment 307290:
[29924.805968]
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-0.slice/user@0.service/init.scope,task=(sd-pam),pid=4503,uid=0
[29924.805991] Out of memory: Killed process 4503 ((sd-pam))
total-vm:173972kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0
pgtables:96kB oom_score_adj:100
[29925.425864] nfsd invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL),
order=0, oom_score_adj=0
[29925.425872] CPU: 0 PID: 1874 Comm: nfsd Kdump: loaded Tainted: G
E 6.1.119-1.el9.elrepo.x86_64 #1
[29925.425875] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS
2.22.2 09/12/2024
[29925.425877] Call Trace:
[29925.425880] <TASK>
[29925.425885] dump_stack_lvl+0x45/0x5e
[29925.425893] dump_header+0x4a/0x213
[29925.425897] oom_kill_process.cold+0xb/0x10
[29925.425901] out_of_memory+0xed/0x2e0
[29925.425906] __alloc_pages_slowpath.constprop.0+0x707/0x9d0
[29925.425916] __alloc_pages+0x35d/0x370
[29925.425921] __alloc_pages_bulk+0x3e5/0x680
[29925.425927] svc_alloc_arg+0x81/0x1f0 [sunrpc]
[29925.425991] svc_recv+0x1f/0x190 [sunrpc]
[29925.426043] ? nfsd_inet6addr_event+0x110/0x110 [nfsd]
[29925.426080] nfsd+0x87/0xc0 [nfsd]
[29925.426113] kthread+0xe5/0x110
[29925.426118] ? kthread_complete_and_exit+0x20/0x20
[29925.426122] ret_from_fork+0x1f/0x30
[29925.426129] </TASK>
NFSD is triggering the OOM killer because it frequently allocates up to
256 pages at a time to fill the send and receive buffers. It is not
necessarily the source of a leak.
The bulk page allocator is on the slow path here, suggesting there
weren't any free pages available on the lists it normally checks first.
So it is doing one-at-a-time order-0 allocations, a sign that memory is
short.
We see that Node 1 appears to be short on free memory, but the system
has not pushed into swap at all. Kernel memory isn't swappable, so
whatever is leaking is in the kernel proper.
The slab caches all look reasonably sized, so not likely a slab leak.
At this point we would want someone with some MM expertise to come in
and help us nail down the leak.
View: https://bugzilla.kernel.org/show_bug.cgi?id=219535#c13
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory leak on nfsd
2024-11-27 2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
` (12 preceding siblings ...)
2024-12-12 16:00 ` Chuck Lever via Bugspray Bot
@ 2025-01-10 16:50 ` Chen Chen via Bugspray Bot
2025-01-10 20:35 ` Chuck Lever
2025-01-22 20:45 ` JJ Jordan via Bugspray Bot
` (3 subsequent siblings)
17 siblings, 1 reply; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2025-01-10 16:50 UTC (permalink / raw)
To: anna, linux-nfs, linux-mm, chuck.lever, jlayton, cel, trondmy
Chen Chen writes via Kernel.org Bugzilla:
Sorry for my rudeness in my previous discussion.
After switching to 6.12.4, the server stayed stable for 30 days. So whatever caused the memleak should have been resolved between 6.1.119 to 6.12.
You might want to close this bug if backport is not worthwhile.
View: https://bugzilla.kernel.org/show_bug.cgi?id=219535#c15
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory leak on nfsd
2025-01-10 16:50 ` Chen Chen via Bugspray Bot
@ 2025-01-10 20:35 ` Chuck Lever
0 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2025-01-10 20:35 UTC (permalink / raw)
To: Chen Chen via Bugspray Bot, anna, linux-nfs, linux-mm, jlayton,
cel, trondmy
On 1/10/25 11:50 AM, Chen Chen via Bugspray Bot wrote:
> Chen Chen writes via Kernel.org Bugzilla:
>
> Sorry for my rudeness in my previous discussion.
>
> After switching to 6.12.4, the server stayed stable for 30 days.
That's good news!
> So whatever caused the memleak should have been resolved between 6.1.119 to 6.12.
That's tens of thousands of commits over two years. Unfortunately that
doesn't really tell us what the problem is.
> You might want to close this bug if backport is not worthwhile.
We need to know the exact commit that contains the fix before it can
be determined whether a backport is feasible.
Are you able to bisect between v6.1 and v6.12 ? If not, do you have
a simple, narrow reproducer that we can use to explore this ourselves?
--
Chuck Lever
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory leak on nfsd
2024-11-27 2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
` (13 preceding siblings ...)
2025-01-10 16:50 ` Chen Chen via Bugspray Bot
@ 2025-01-22 20:45 ` JJ Jordan via Bugspray Bot
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
` (2 subsequent siblings)
17 siblings, 0 replies; 21+ messages in thread
From: JJ Jordan via Bugspray Bot @ 2025-01-22 20:45 UTC (permalink / raw)
To: anna, chuck.lever, cel, trondmy, jlayton, linux-nfs, linux-mm
JJ Jordan added an attachment on Kernel.org Bugzilla:
Created attachment 307525
Logs and traces from Jan-18 pt1
Here are the traces from two NFS crashes that occurred this past weekend.
Both occurred in the AM (US time) on Jan 18, a few hours apart from one
another.
I followed the instructions I found on the various threads.
There was no output to `rpcdebug -m rpc -c`, not sure what I did wrong
there. The syslog ought to contain the output of sysrq-trigger, however.
The output from trace-cmd captures several days' worth of logs in either
case, but not from system boot.
The syslogs I have cut from ~one hour before the incident until it finished
shutting down prior to reboot. I have removed the output of other services.
Both are VMs on GCE running the 6.1.119 kernel from Debian bookworm (6.1.0-28)
~60Gi memory, 16 CPUs.
File: nfs-traces-250118-pt1.tar.bz2 (application/octet-stream)
Size: 4.61 MiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307525
---
Logs and traces from Jan-18 pt1
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory leak on nfsd
2024-11-27 2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
` (14 preceding siblings ...)
2025-01-22 20:45 ` JJ Jordan via Bugspray Bot
@ 2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
17 siblings, 0 replies; 21+ messages in thread
From: JJ Jordan via Bugspray Bot @ 2025-01-22 21:25 UTC (permalink / raw)
To: trondmy, linux-mm, anna, jlayton, cel, linux-nfs, chuck.lever
JJ Jordan added an attachment on Kernel.org Bugzilla:
Created attachment 307526
Logs and traces from Jan-18 pt2
Part 2, see previous description
File: nfs-traces-250118-pt2.tar.bz2 (application/octet-stream)
Size: 601.99 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307526
---
Logs and traces from Jan-18 pt2
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory leak on nfsd
2024-11-27 2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
` (15 preceding siblings ...)
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
@ 2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
17 siblings, 0 replies; 21+ messages in thread
From: JJ Jordan via Bugspray Bot @ 2025-01-22 21:25 UTC (permalink / raw)
To: trondmy, linux-mm, anna, jlayton, cel, linux-nfs, chuck.lever
JJ Jordan added an attachment on Kernel.org Bugzilla:
Comment on attachment 307525
Logs and traces from Jan-18 pt1
This was submitted in error, apologies.
File: nfs-traces-250118-pt1.tar.bz2 (application/octet-stream)
Size: 4.61 MiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307525
---
Logs and traces from Jan-18 pt1
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Possible memory leak on nfsd
2024-11-27 2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
` (16 preceding siblings ...)
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
@ 2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
17 siblings, 0 replies; 21+ messages in thread
From: JJ Jordan via Bugspray Bot @ 2025-01-22 21:25 UTC (permalink / raw)
To: trondmy, linux-mm, anna, jlayton, cel, linux-nfs, chuck.lever
JJ Jordan added an attachment on Kernel.org Bugzilla:
Comment on attachment 307526
Logs and traces from Jan-18 pt2
Also submitted in error.
File: nfs-traces-250118-pt2.tar.bz2 (application/octet-stream)
Size: 601.99 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307526
---
Logs and traces from Jan-18 pt2
You can reply to this message to join the discussion.
--
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2025-01-22 21:24 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-27 2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
2024-11-27 2:35 ` Chen Chen via Bugspray Bot
2024-12-07 8:35 ` Chen Chen via Bugspray Bot
2024-12-07 15:30 ` Chuck Lever via Bugspray Bot
2024-12-10 5:20 ` Chen Chen via Bugspray Bot
2024-12-10 14:45 ` Chuck Lever via Bugspray Bot
2024-12-11 1:15 ` Chen Chen via Bugspray Bot
2024-12-12 16:00 ` Chuck Lever via Bugspray Bot
2024-12-12 16:15 ` Fwd: " Chuck Lever
2025-01-10 16:50 ` Chen Chen via Bugspray Bot
2025-01-10 20:35 ` Chuck Lever
2025-01-22 20:45 ` JJ Jordan via Bugspray Bot
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox