public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
* Possible memory leak on nfsd
@ 2024-11-27  2:35 Chen Chen via Bugspray Bot
  2024-11-27  2:35 ` Chen Chen via Bugspray Bot
                   ` (17 more replies)
  0 siblings, 18 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-11-27  2:35 UTC (permalink / raw)
  To: linux-nfs, trondmy, jlayton, anna, cel

Chen Chen added an attachment on Kernel.org Bugzilla:

Created attachment 307283
sar -r mem usage

My RHEL9 server with only NFS service often OOMed after a day or two, with no userspace memory usage. So I switched to elrepo kernel-lts and still the problem persists.

I'm now using 6.1.119-1.el9.elrepo.x86_64. The problem also occured on (RHEL) 5.14.0-427.40.1.el9_4, (RHEL) 5.14.0-503.14.1.el9_5 and 6.1.115-1.el9.elrepo.x86_64.

I'm not so sure it is caused by NFS but since it is the only service running on the server I can only suspect it is the culprit. The server has a Mellanox Technologies MT27500 Family [ConnectX-3] Infiniband Card and NFSoRMDA is enabled. No 3rd drivers used.

The following data were gathered moments before it OOMed and crashed

sar reported a typical memory leak appearance.
01:20:13 AM 390187300 388732764   3501864      0.89      4856    363952    390344      0.09    100680    358384     17148
01:30:13 AM 379492128 378312768  13642416      3.46      4856    909388    390344      0.09    108844    895740        16
01:40:13 AM 367687716 367062060  24851416      6.30      4856   1498272    390344      0.09    116736   1476672        16
01:50:50 AM 361704244 361471420  30437312      7.72      4856   1888780    390344      0.09    127888   1856036     29912
02:00:13 AM 355796296 355848120  36061648      9.15      4856   2173560    390344      0.09    131544   2137152         0
....
09:00:13 AM   1518392  18089616 373760196     94.79      4760  18648816    390344      0.09    470608  18273412        36
09:10:13 AM   1499980  17223900 374626172     95.01      4740  17801676    390344      0.09    471964  17424672      5292
09:20:13 AM   1561896   6784736 385059756     97.66      1712   7338540    423580      0.10    325452   7070372         0

meminfo also didn't show anything using ram.
MemTotal:       394292660 kB
MemFree:         1551296 kB
MemAvailable:    6776108 kB
Buffers:            1712 kB
Cached:          7340144 kB
SwapCached:         4308 kB
Active:           325936 kB
Inactive:        7071836 kB
...
KReclaimable:     129816 kB
Slab:             331596 kB
SReclaimable:     129816 kB
SUnreclaim:       201780 kB
...
VmallocUsed:      319528 kB

slabinfo is low. Attached.

vmallocinfo doesn't have much. Attached.

dmesg log showed it has killed nearly every userspace programs.
[29960.547403] Tasks state (memory values in pages):
[29960.547404] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[29960.547412] [   1020]     0  1020     9498      640    94208     1000         -1000 systemd-udevd
[29960.547417] [   1247]     0  1247   105208     6888   126976        0         -1000 multipathd
[29960.547421] [   1342]     0  1342    23190      330    65536      764         -1000 auditd
[29960.547428] [   1472]     0  1472     4185      806    73728      357         -1000 sshd
[29960.547438] Out of memory and no killable processes...
[29960.547439] Kernel panic - not syncing: System is deadlocked on memory

systemctl status attached. Nothing else is running.

I have a 224G vmcore dump but have no idea how to deal with it. And it is too big to upload somewhere I think.

I appreciate any help to help me detect what went wrong.

File: sar (text/plain)
Size: 6.95 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307283
---
sar -r mem usage

You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory leak on nfsd
  2024-11-27  2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
@ 2024-11-27  2:35 ` Chen Chen via Bugspray Bot
  2024-11-27  2:35 ` Chen Chen via Bugspray Bot
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-11-27  2:35 UTC (permalink / raw)
  To: linux-nfs, trondmy, jlayton, anna, cel

Chen Chen added an attachment on Kernel.org Bugzilla:

Created attachment 307284
lsmod

File: lsmod (text/plain)
Size: 4.96 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307284
---
lsmod

You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory leak on nfsd
  2024-11-27  2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
  2024-11-27  2:35 ` Chen Chen via Bugspray Bot
@ 2024-11-27  2:35 ` Chen Chen via Bugspray Bot
  2024-11-27  2:35 ` Chen Chen via Bugspray Bot
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-11-27  2:35 UTC (permalink / raw)
  To: linux-nfs, trondmy, jlayton, anna, cel

Chen Chen added an attachment on Kernel.org Bugzilla:

Created attachment 307285
/proc/meminfo

File: meminfo (text/plain)
Size: 1.53 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307285
---
/proc/meminfo

You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory leak on nfsd
  2024-11-27  2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
  2024-11-27  2:35 ` Chen Chen via Bugspray Bot
  2024-11-27  2:35 ` Chen Chen via Bugspray Bot
@ 2024-11-27  2:35 ` Chen Chen via Bugspray Bot
  2024-11-27  2:35 ` Chen Chen via Bugspray Bot
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-11-27  2:35 UTC (permalink / raw)
  To: linux-nfs, trondmy, jlayton, anna, cel

Chen Chen added an attachment on Kernel.org Bugzilla:

Created attachment 307286
/proc/slabinfo

File: slabinfo (text/plain)
Size: 30.92 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307286
---
/proc/slabinfo

You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory leak on nfsd
  2024-11-27  2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
                   ` (2 preceding siblings ...)
  2024-11-27  2:35 ` Chen Chen via Bugspray Bot
@ 2024-11-27  2:35 ` Chen Chen via Bugspray Bot
  2024-11-27  2:35 ` Chen Chen via Bugspray Bot
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-11-27  2:35 UTC (permalink / raw)
  To: linux-nfs, trondmy, jlayton, anna, cel

Chen Chen added an attachment on Kernel.org Bugzilla:

Created attachment 307287
systemctl status

File: systemctl_status (application/octet-stream)
Size: 4.06 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307287
---
systemctl status

You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory leak on nfsd
  2024-11-27  2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
                   ` (3 preceding siblings ...)
  2024-11-27  2:35 ` Chen Chen via Bugspray Bot
@ 2024-11-27  2:35 ` Chen Chen via Bugspray Bot
  2024-11-27  2:35 ` Chen Chen via Bugspray Bot
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-11-27  2:35 UTC (permalink / raw)
  To: linux-nfs, trondmy, jlayton, anna, cel

Chen Chen added an attachment on Kernel.org Bugzilla:

Created attachment 307288
/proc/vmallocinfo

File: vmallocinfo (text/plain)
Size: 170.08 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307288
---
/proc/vmallocinfo

You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory leak on nfsd
  2024-11-27  2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
                   ` (4 preceding siblings ...)
  2024-11-27  2:35 ` Chen Chen via Bugspray Bot
@ 2024-11-27  2:35 ` Chen Chen via Bugspray Bot
  2024-11-27  2:35 ` Chen Chen via Bugspray Bot
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-11-27  2:35 UTC (permalink / raw)
  To: linux-nfs, trondmy, jlayton, anna, cel

Chen Chen added an attachment on Kernel.org Bugzilla:

Created attachment 307289
/proc/vmstat

File: vmstat (text/plain)
Size: 3.65 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307289
---
/proc/vmstat

You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory leak on nfsd
  2024-11-27  2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
                   ` (5 preceding siblings ...)
  2024-11-27  2:35 ` Chen Chen via Bugspray Bot
@ 2024-11-27  2:35 ` Chen Chen via Bugspray Bot
  2024-12-07  8:35 ` Chen Chen via Bugspray Bot
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-11-27  2:35 UTC (permalink / raw)
  To: linux-nfs, trondmy, jlayton, anna, cel

Chen Chen added an attachment on Kernel.org Bugzilla:

Created attachment 307290
oom dmesg from kdump

File: vmcore-dmesg.txt (text/plain)
Size: 535.11 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307290
---
oom dmesg from kdump

You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory leak on nfsd
  2024-11-27  2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
                   ` (6 preceding siblings ...)
  2024-11-27  2:35 ` Chen Chen via Bugspray Bot
@ 2024-12-07  8:35 ` Chen Chen via Bugspray Bot
  2024-12-07 15:30 ` Chuck Lever via Bugspray Bot
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-12-07  8:35 UTC (permalink / raw)
  To: jlayton, cel, linux-nfs, anna, trondmy

Chen Chen added an attachment on Kernel.org Bugzilla:

Created attachment 307330
dmesg of another 3 crashes

Since reporting I got another 3 crashes. All killed by nfsd.

First one:
[136965.765431] Out of memory and no killable processes...
[136965.765433] Kernel panic - not syncing: System is deadlocked on memory
[136965.766148] CPU: 2 PID: 1856 Comm: nfsd Kdump: loaded Tainted: G            E      6.1.119-1.el9.elrepo.x86_64 #1
[136965.766852] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.22.2 09/12/2024
[136965.767546] Call Trace:
[136965.768230]  <TASK>
[136965.768903]  dump_stack_lvl+0x45/0x5e
[136965.769571]  panic+0x10c/0x2c2
[136965.770231]  out_of_memory.cold+0x2f/0x7e
[136965.770874]  __alloc_pages_slowpath.constprop.0+0x707/0x9d0
[136965.771518]  __alloc_pages+0x35d/0x370
[136965.772147]  __alloc_pages_bulk+0x3e5/0x680
[136965.772766]  svc_alloc_arg+0x81/0x1f0 [sunrpc]
[136965.773431]  svc_recv+0x1f/0x190 [sunrpc]
[136965.774089]  ? nfsd_inet6addr_event+0x110/0x110 [nfsd]
[136965.774726]  nfsd+0x87/0xc0 [nfsd]
[136965.775347]  kthread+0xe5/0x110
[136965.775926]  ? kthread_complete_and_exit+0x20/0x20
[136965.776499]  ret_from_fork+0x1f/0x30
[136965.777062]  </TASK>

Second:
[167723.787640] WARNING: CPU: 3 PID: 1872 at mm/slab_common.c:957 free_large_kmalloc+0x5a/0x80
[167723.787667] Modules linked in: <cut here>
[167723.787874] CPU: 3 PID: 1872 Comm: nfsd Kdump: loaded Not tainted 5.14.0-503.15.1.el9_5.x86_64 #1
[167723.787882] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.22.2 09/12/2024
[167723.787886] RIP: 0010:free_large_kmalloc+0x5a/0x80

Third:
[ 3883.748094] ------------[ cut here ]------------
[ 3883.748105] WARNING: CPU: 9 PID: 1886 at mm/slab_common.c:957 free_large_kmalloc+0x5a/0x80
[ 3883.748131] Modules linked in: <cut here>
[ 3883.748339] CPU: 9 PID: 1886 Comm: nfsd Kdump: loaded Not tainted 5.14.0-503.15.1.el9_5.x86_64 #1
[ 3883.748342] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.22.2 09/12/2024
[ 3883.748344] RIP: 0010:free_large_kmalloc+0x5a/0x80

File: crash.log (text/plain)
Size: 31.77 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307330
---
dmesg of another 3 crashes

You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory leak on nfsd
  2024-11-27  2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
                   ` (7 preceding siblings ...)
  2024-12-07  8:35 ` Chen Chen via Bugspray Bot
@ 2024-12-07 15:30 ` Chuck Lever via Bugspray Bot
  2024-12-10  5:20 ` Chen Chen via Bugspray Bot
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever via Bugspray Bot @ 2024-12-07 15:30 UTC (permalink / raw)
  To: anna, jlayton, linux-nfs, cel, trondmy

Chuck Lever writes via Kernel.org Bugzilla:

Hi Chen -

After some review, these all appear to be Red Hat Enterprise kernels. Such kernels are extensively patched and maintained exclusively by Red Hat engineers. I kindly request that you report this issue to Red Hat first and have them troubleshoot it.

If they find there is a needed upstream fix, do feel free to re-open this bug.

[I am a fan of the old ConnectX-3 cards, btw]

View: https://bugzilla.kernel.org/show_bug.cgi?id=219535#c9
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory leak on nfsd
  2024-11-27  2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
                   ` (8 preceding siblings ...)
  2024-12-07 15:30 ` Chuck Lever via Bugspray Bot
@ 2024-12-10  5:20 ` Chen Chen via Bugspray Bot
  2024-12-10 14:45 ` Chuck Lever via Bugspray Bot
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-12-10  5:20 UTC (permalink / raw)
  To: linux-nfs, cel, anna, trondmy, jlayton

Chen Chen writes via Kernel.org Bugzilla:

Hi Mr. Lever,

I *clearly* stated I was using 6.1.119 which is the latest longterm kernel released on 2024-11-22, compiled by the ELRepo Project as-is from upstream tarball.

[136965.766148] CPU: 2 PID: 1856 Comm: nfsd Kdump: loaded Tainted: G            E      6.1.119-1.el9.elrepo.x86_64 #1


I encountered the problem in both shipped RHEL kernel and latest and sub-latest lts. So the bug must still exists in upstream. That's why I filed this bug.

Anyway, I encountered another 2 crashes in the last two days and call stack insists nfsd caused it.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219535#c10
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory leak on nfsd
  2024-11-27  2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
                   ` (9 preceding siblings ...)
  2024-12-10  5:20 ` Chen Chen via Bugspray Bot
@ 2024-12-10 14:45 ` Chuck Lever via Bugspray Bot
  2024-12-11  1:15 ` Chen Chen via Bugspray Bot
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever via Bugspray Bot @ 2024-12-10 14:45 UTC (permalink / raw)
  To: linux-nfs, cel, anna, trondmy, jlayton

Chuck Lever writes via Kernel.org Bugzilla:

This is what comment 0 says:

> My RHEL9 server with only NFS service often OOMed after a day or two,
> with no userspace memory usage. So I switched to elrepo kernel-lts and
> still the problem persists.

> I'm now using 6.1.119-1.el9.elrepo.x86_64. The problem also occured on
> (RHEL 5.14.0-427.40.1.el9_4, (RHEL) 5.14.0-503.14.1.el9_5 and
> 6.1.115-1.el9.elrepo.x86_64.

You mentioned RHEL, and RHEL 9 in particular, several times here. I have no prior knowledge of "the ELRepo Project" -- never heard of it. By "uname" these all look like distro-built kernels to me.

> Anyway, I encountered another 2 crashes in the last two days and
> call stack insists nfsd caused it.

I'm not saying this isn't an NFSD bug. But it might not be a problem in recent kernels. If I'm reading your reports correctly, you have not tested with 6.12 or newer. 6.1.anything is based on a two-year old code base.

Any fix we create for this issue must be applied to the upstream Linus kernel first. Indeed, a fix might already exist somewhere in upstream. By upstream, I mean the "master" branch in this repo:

https://git.kernel.org./pub/scm/linux/kernel/git/torvalds/linux.git

Therefore the first task is for you to confirm by testing that this branch either still has this issue, in which case we have to troubleshoot further; or does not, in which case you can bisect to find the upstream fix that needs to be backported to the LTS kernels.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219535#c11
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory leak on nfsd
  2024-11-27  2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
                   ` (10 preceding siblings ...)
  2024-12-10 14:45 ` Chuck Lever via Bugspray Bot
@ 2024-12-11  1:15 ` Chen Chen via Bugspray Bot
  2024-12-12 16:00 ` Chuck Lever via Bugspray Bot
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2024-12-11  1:15 UTC (permalink / raw)
  To: cel, anna, jlayton, trondmy, linux-nfs

Chen Chen writes via Kernel.org Bugzilla:

Hi Mr. Lever,

> You mentioned RHEL, and RHEL 9 in particular, several times here.

Because I want to indicate that, except the kernel, every other toolchains were using latest version from RHEL9.

The ELRepo Project (https://elrepo.org/) is a group of guys grabbing the latest kernel source and package it into RPMs for easy installation on latest EL-like releases (like RHEL, Oracle Linux, Rocky, Alma etc.)

> By upstream, I mean the "master" branch in this repo

OK. I've just installed the latest stable (aka 6.12.4) and see if it might help.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219535#c12
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory leak on nfsd
  2024-11-27  2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
                   ` (11 preceding siblings ...)
  2024-12-11  1:15 ` Chen Chen via Bugspray Bot
@ 2024-12-12 16:00 ` Chuck Lever via Bugspray Bot
  2024-12-12 16:15   ` Fwd: " Chuck Lever
  2025-01-10 16:50 ` Chen Chen via Bugspray Bot
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 21+ messages in thread
From: Chuck Lever via Bugspray Bot @ 2024-12-12 16:00 UTC (permalink / raw)
  To: jlayton, linux-nfs, trondmy, cel, anna

Chuck Lever writes via Kernel.org Bugzilla:

From attachment 307290:

[29924.805968] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-0.slice/user@0.service/init.scope,task=(sd-pam),pid=4503,uid=0
[29924.805991] Out of memory: Killed process 4503 ((sd-pam)) total-vm:173972kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:96kB oom_score_adj:100
[29925.425864] nfsd invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[29925.425872] CPU: 0 PID: 1874 Comm: nfsd Kdump: loaded Tainted: G            E      6.1.119-1.el9.elrepo.x86_64 #1
[29925.425875] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.22.2 09/12/2024
[29925.425877] Call Trace:
[29925.425880]  <TASK>
[29925.425885]  dump_stack_lvl+0x45/0x5e
[29925.425893]  dump_header+0x4a/0x213
[29925.425897]  oom_kill_process.cold+0xb/0x10
[29925.425901]  out_of_memory+0xed/0x2e0
[29925.425906]  __alloc_pages_slowpath.constprop.0+0x707/0x9d0
[29925.425916]  __alloc_pages+0x35d/0x370
[29925.425921]  __alloc_pages_bulk+0x3e5/0x680
[29925.425927]  svc_alloc_arg+0x81/0x1f0 [sunrpc]
[29925.425991]  svc_recv+0x1f/0x190 [sunrpc]
[29925.426043]  ? nfsd_inet6addr_event+0x110/0x110 [nfsd]
[29925.426080]  nfsd+0x87/0xc0 [nfsd]
[29925.426113]  kthread+0xe5/0x110
[29925.426118]  ? kthread_complete_and_exit+0x20/0x20
[29925.426122]  ret_from_fork+0x1f/0x30
[29925.426129]  </TASK>

NFSD is targeted by OOM killer because it frequently allocates up to 256 pages at a time to fill the send and receive buffers. It is not necessarily the source of a leak.

The bulk page allocator is on the slow path here, suggesting there weren't any free pages available on the lists it normally checks first. So it is doing one-at-a-time order-0 allocations, a sign that memory is short.

We see that Node 1 appears to be short on free memory, but the system has not pushed into swap at all. Kernel memory isn't swappable, so whatever is leaking is in the kernel proper.

The slab caches all look reasonably sized, so not likely a slab leak.

At this point we would want someone with some MM expertise to come in and help us nail down the leak.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219535#c13
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Fwd: Possible memory leak on nfsd
  2024-12-12 16:00 ` Chuck Lever via Bugspray Bot
@ 2024-12-12 16:15   ` Chuck Lever
  0 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2024-12-12 16:15 UTC (permalink / raw)
  To: linux-mm, Linux NFS Mailing List

Hi -

An NFSD page allocation on v6.1.y is triggering OOM-killer. The reporter
has provided a lot of detail, and we need some help steering us towards
the possible leak culprit. Any takers?

(We've asked the reporter to reproduce on a more recent kernel if
possible).

-------- Forwarded Message --------
Subject: Re: Possible memory leak on nfsd
Date: Thu, 12 Dec 2024 16:00:17 +0000
From: Chuck Lever via Bugspray Bot <bugbot@kernel.org>
To: jlayton@kernel.org, linux-nfs@vger.kernel.org, trondmy@kernel.org, 
cel@kernel.org, anna@kernel.org

Chuck Lever writes via Kernel.org Bugzilla:

 From attachment 307290:

[29924.805968] 
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-0.slice/user@0.service/init.scope,task=(sd-pam),pid=4503,uid=0
[29924.805991] Out of memory: Killed process 4503 ((sd-pam)) 
total-vm:173972kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 
pgtables:96kB oom_score_adj:100
[29925.425864] nfsd invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), 
order=0, oom_score_adj=0
[29925.425872] CPU: 0 PID: 1874 Comm: nfsd Kdump: loaded Tainted: G 
       E      6.1.119-1.el9.elrepo.x86_64 #1
[29925.425875] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 
2.22.2 09/12/2024
[29925.425877] Call Trace:
[29925.425880]  <TASK>
[29925.425885]  dump_stack_lvl+0x45/0x5e
[29925.425893]  dump_header+0x4a/0x213
[29925.425897]  oom_kill_process.cold+0xb/0x10
[29925.425901]  out_of_memory+0xed/0x2e0
[29925.425906]  __alloc_pages_slowpath.constprop.0+0x707/0x9d0
[29925.425916]  __alloc_pages+0x35d/0x370
[29925.425921]  __alloc_pages_bulk+0x3e5/0x680
[29925.425927]  svc_alloc_arg+0x81/0x1f0 [sunrpc]
[29925.425991]  svc_recv+0x1f/0x190 [sunrpc]
[29925.426043]  ? nfsd_inet6addr_event+0x110/0x110 [nfsd]
[29925.426080]  nfsd+0x87/0xc0 [nfsd]
[29925.426113]  kthread+0xe5/0x110
[29925.426118]  ? kthread_complete_and_exit+0x20/0x20
[29925.426122]  ret_from_fork+0x1f/0x30
[29925.426129]  </TASK>

NFSD is triggering the OOM killer because it frequently allocates up to 
256 pages at a time to fill the send and receive buffers. It is not 
necessarily the source of a leak.

The bulk page allocator is on the slow path here, suggesting there 
weren't any free pages available on the lists it normally checks first. 
So it is doing one-at-a-time order-0 allocations, a sign that memory is 
short.

We see that Node 1 appears to be short on free memory, but the system 
has not pushed into swap at all. Kernel memory isn't swappable, so 
whatever is leaking is in the kernel proper.

The slab caches all look reasonably sized, so not likely a slab leak.

At this point we would want someone with some MM expertise to come in 
and help us nail down the leak.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219535#c13
You can reply to this message to join the discussion.

-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory leak on nfsd
  2024-11-27  2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
                   ` (12 preceding siblings ...)
  2024-12-12 16:00 ` Chuck Lever via Bugspray Bot
@ 2025-01-10 16:50 ` Chen Chen via Bugspray Bot
  2025-01-10 20:35   ` Chuck Lever
  2025-01-22 20:45 ` JJ Jordan via Bugspray Bot
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 21+ messages in thread
From: Chen Chen via Bugspray Bot @ 2025-01-10 16:50 UTC (permalink / raw)
  To: anna, linux-nfs, linux-mm, chuck.lever, jlayton, cel, trondmy

Chen Chen writes via Kernel.org Bugzilla:

Sorry for my rudeness in my previous discussion.

After switching to 6.12.4, the server stayed stable for 30 days. So whatever caused the memleak should have been resolved between 6.1.119 to 6.12.

You might want to close this bug if backport is not worthwhile.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219535#c15
You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory leak on nfsd
  2025-01-10 16:50 ` Chen Chen via Bugspray Bot
@ 2025-01-10 20:35   ` Chuck Lever
  0 siblings, 0 replies; 21+ messages in thread
From: Chuck Lever @ 2025-01-10 20:35 UTC (permalink / raw)
  To: Chen Chen via Bugspray Bot, anna, linux-nfs, linux-mm, jlayton,
	cel, trondmy

On 1/10/25 11:50 AM, Chen Chen via Bugspray Bot wrote:
> Chen Chen writes via Kernel.org Bugzilla:
> 
> Sorry for my rudeness in my previous discussion.
> 
> After switching to 6.12.4, the server stayed stable for 30 days.

That's good news!


> So whatever caused the memleak should have been resolved between 6.1.119 to 6.12.

That's tens of thousands of commits over two years. Unfortunately that
doesn't really tell us what the problem is.


> You might want to close this bug if backport is not worthwhile.

We need to know the exact commit that contains the fix before it can
be determined whether a backport is feasible.

Are you able to bisect between v6.1 and v6.12 ? If not, do you have
a simple, narrow reproducer that we can use to explore this ourselves?


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory leak on nfsd
  2024-11-27  2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
                   ` (13 preceding siblings ...)
  2025-01-10 16:50 ` Chen Chen via Bugspray Bot
@ 2025-01-22 20:45 ` JJ Jordan via Bugspray Bot
  2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 21+ messages in thread
From: JJ Jordan via Bugspray Bot @ 2025-01-22 20:45 UTC (permalink / raw)
  To: anna, chuck.lever, cel, trondmy, jlayton, linux-nfs, linux-mm

JJ Jordan added an attachment on Kernel.org Bugzilla:

Created attachment 307525
Logs and traces from Jan-18 pt1

Here are the traces from two NFS crashes that occurred this past weekend.
Both occurred in the AM (US time) on Jan 18, a few hours apart from one
another.

I followed the instructions I found on the various threads.
There was no output to `rpcdebug -m rpc -c`, not sure what I did wrong
there. The syslog ought to contain the output of sysrq-trigger, however.

The output from trace-cmd captures several days' worth of logs in either
case, but not from system boot.

The syslogs I have cut from ~one hour before the incident until it finished
shutting down prior to reboot. I have removed the output of other services.

Both are VMs on GCE running the 6.1.119 kernel from Debian bookworm (6.1.0-28)
~60Gi memory, 16 CPUs.

File: nfs-traces-250118-pt1.tar.bz2 (application/octet-stream)
Size: 4.61 MiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307525
---
Logs and traces from Jan-18 pt1

You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory leak on nfsd
  2024-11-27  2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
                   ` (14 preceding siblings ...)
  2025-01-22 20:45 ` JJ Jordan via Bugspray Bot
@ 2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
  2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
  2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
  17 siblings, 0 replies; 21+ messages in thread
From: JJ Jordan via Bugspray Bot @ 2025-01-22 21:25 UTC (permalink / raw)
  To: trondmy, linux-mm, anna, jlayton, cel, linux-nfs, chuck.lever

JJ Jordan added an attachment on Kernel.org Bugzilla:

Created attachment 307526
Logs and traces from Jan-18 pt2

Part 2, see previous description

File: nfs-traces-250118-pt2.tar.bz2 (application/octet-stream)
Size: 601.99 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307526
---
Logs and traces from Jan-18 pt2

You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory leak on nfsd
  2024-11-27  2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
                   ` (15 preceding siblings ...)
  2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
@ 2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
  2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
  17 siblings, 0 replies; 21+ messages in thread
From: JJ Jordan via Bugspray Bot @ 2025-01-22 21:25 UTC (permalink / raw)
  To: trondmy, linux-mm, anna, jlayton, cel, linux-nfs, chuck.lever

JJ Jordan added an attachment on Kernel.org Bugzilla:

Comment on attachment 307525
Logs and traces from Jan-18 pt1

This was submitted in error, apologies.

File: nfs-traces-250118-pt1.tar.bz2 (application/octet-stream)
Size: 4.61 MiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307525
---
Logs and traces from Jan-18 pt1

You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Possible memory leak on nfsd
  2024-11-27  2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
                   ` (16 preceding siblings ...)
  2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
@ 2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
  17 siblings, 0 replies; 21+ messages in thread
From: JJ Jordan via Bugspray Bot @ 2025-01-22 21:25 UTC (permalink / raw)
  To: trondmy, linux-mm, anna, jlayton, cel, linux-nfs, chuck.lever

JJ Jordan added an attachment on Kernel.org Bugzilla:

Comment on attachment 307526
Logs and traces from Jan-18 pt2

Also submitted in error.

File: nfs-traces-250118-pt2.tar.bz2 (application/octet-stream)
Size: 601.99 KiB
Link: https://bugzilla.kernel.org/attachment.cgi?id=307526
---
Logs and traces from Jan-18 pt2

You can reply to this message to join the discussion.
-- 
Deet-doot-dot, I am a bot.
Kernel.org Bugzilla (bugspray 0.1-dev)


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-01-22 21:24 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-27  2:35 Possible memory leak on nfsd Chen Chen via Bugspray Bot
2024-11-27  2:35 ` Chen Chen via Bugspray Bot
2024-11-27  2:35 ` Chen Chen via Bugspray Bot
2024-11-27  2:35 ` Chen Chen via Bugspray Bot
2024-11-27  2:35 ` Chen Chen via Bugspray Bot
2024-11-27  2:35 ` Chen Chen via Bugspray Bot
2024-11-27  2:35 ` Chen Chen via Bugspray Bot
2024-11-27  2:35 ` Chen Chen via Bugspray Bot
2024-12-07  8:35 ` Chen Chen via Bugspray Bot
2024-12-07 15:30 ` Chuck Lever via Bugspray Bot
2024-12-10  5:20 ` Chen Chen via Bugspray Bot
2024-12-10 14:45 ` Chuck Lever via Bugspray Bot
2024-12-11  1:15 ` Chen Chen via Bugspray Bot
2024-12-12 16:00 ` Chuck Lever via Bugspray Bot
2024-12-12 16:15   ` Fwd: " Chuck Lever
2025-01-10 16:50 ` Chen Chen via Bugspray Bot
2025-01-10 20:35   ` Chuck Lever
2025-01-22 20:45 ` JJ Jordan via Bugspray Bot
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot
2025-01-22 21:25 ` JJ Jordan via Bugspray Bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox