From: Dai Ngo <dai.ngo@oracle.com>
To: chuck.lever@oracle.com
Cc: linux-nfs@vger.kernel.org
Subject: [PATCH v2 0/2] NFSD: handling memory shortage problem with Courteous server
Date: Mon, 4 Jul 2022 12:05:41 -0700 [thread overview]
Message-ID: <1656961543-25210-1-git-send-email-dai.ngo@oracle.com> (raw)
Currently the idle timeout for courtesy client is fixed at 1 day. If
there are lots of courtesy clients remain in the system it can cause
memory resource shortage that effects the operations of other modules
in the kernel. This problem can be observed by running pynfs nfs4.0
CID5 test in a loop. Eventually system runs out of memory and rpc.gssd
fails to add new watch:
rpc.gssd[3851]: ERROR: inotify_add_watch failed for nfsd4_cb/clnt6c2e:
No space left on device
and alloc_inode also fails with out of memory:
Call Trace:
<TASK>
dump_stack_lvl+0x33/0x42
dump_header+0x4a/0x1ed
oom_kill_process+0x80/0x10d
out_of_memory+0x237/0x25f
__alloc_pages_slowpath.constprop.0+0x617/0x7b6
__alloc_pages+0x132/0x1e3
alloc_slab_page+0x15/0x33
allocate_slab+0x78/0x1ab
? alloc_inode+0x38/0x8d
___slab_alloc+0x2af/0x373
? alloc_inode+0x38/0x8d
? slab_pre_alloc_hook.constprop.0+0x9f/0x158
? alloc_inode+0x38/0x8d
__slab_alloc.constprop.0+0x1c/0x24
kmem_cache_alloc_lru+0x8c/0x142
alloc_inode+0x38/0x8d
iget_locked+0x60/0x126
kernfs_get_inode+0x18/0x105
kernfs_iop_lookup+0x6d/0xbc
__lookup_slow+0xb7/0xf9
lookup_slow+0x3a/0x52
walk_component+0x90/0x100
? inode_permission+0x87/0x128
link_path_walk.part.0.constprop.0+0x266/0x2ea
? path_init+0x101/0x2f2
path_lookupat+0x4c/0xfa
filename_lookup+0x63/0xd7
? getname_flags+0x32/0x17a
? kmem_cache_alloc+0x11f/0x144
? getname_flags+0x16c/0x17a
user_path_at_empty+0x37/0x4b
do_readlinkat+0x61/0x102
__x64_sys_readlinkat+0x18/0x1b
do_syscall_64+0x57/0x72
entry_SYSCALL_64_after_hwframe+0x46/0xb0
This patch addresses this problem by:
. removing the fixed 1-day idle time limit for courtesy client.
Courtesy client is now allowed to remain valid as long as the
available system memory is above 80%.
. when available system memory drops below 80%, laundromat starts
trimming older courtesy clients. The number of courtesy clients
to trim is a percentage of the total number of courtesy clients
exist in the system. This percentage is computed based on
the current percentage of available system memory.
. the percentage of number of courtesy clients to be trimmed
is based on this table:
----------------------------------
| % memory | % courtesy clients |
| available | to trim |
----------------------------------
| > 80 | 0 |
| > 70 | 10 |
| > 60 | 20 |
| > 50 | 40 |
| > 40 | 60 |
| > 30 | 80 |
| < 30 | 100 |
----------------------------------
. due to the overhead associated with removing client record,
there is a limit of 128 clients to be trimmed for each
laundromat run. This is done to prevent the laundromat from
spending too long destroying the clients and misses performing
its other tasks in a timely manner.
. the laundromat is scheduled to run sooner if there are more
courtesy clients need to be destroyed.
The shrinker method was evaluated and found it's not suitable
for this problem due to these reasons:
. destroying the NFSv4 client on the shrinker context can cause
deadlock since nfsd_file_put calls into the underlying FS
code and we have no control what it will do as seen in this
stack trace:
======================================================
WARNING: possible circular locking dependency detected
5.19.0-rc2_sk+ #1 Not tainted
------------------------------------------------------
lck/31847 is trying to acquire lock:
ffff88811d268850 (&sb->s_type->i_mutex_key#16){+.+.}-{3:3}, at: btrfs_inode_lock+0x38/0x70
#012but task is already holding lock:
ffffffffb41848c0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0x506/0x1db0
#012which lock already depends on the new lock.
#012the existing dependency chain (in reverse order) is:
#012-> #1 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0xc0/0x100
__kmalloc+0x51/0x320
btrfs_buffered_write+0x2eb/0xd90
btrfs_do_write_iter+0x6bf/0x11c0
do_iter_readv_writev+0x2bb/0x5a0
do_iter_write+0x131/0x630
nfsd_vfs_write+0x4da/0x1900 [nfsd]
nfsd4_write+0x2ac/0x760 [nfsd]
nfsd4_proc_compound+0xce8/0x23e0 [nfsd]
nfsd_dispatch+0x4ed/0xc10 [nfsd]
svc_process_common+0xd3f/0x1b00 [sunrpc]
svc_process+0x361/0x4f0 [sunrpc]
nfsd+0x2d6/0x570 [nfsd]
kthread+0x2a1/0x340
ret_from_fork+0x22/0x30
#012-> #0 (&sb->s_type->i_mutex_key#16){+.+.}-{3:3}:
__lock_acquire+0x318d/0x7830
lock_acquire+0x1bb/0x500
down_write+0x82/0x130
btrfs_inode_lock+0x38/0x70
btrfs_sync_file+0x280/0x1010
nfsd_file_flush.isra.0+0x1b/0x220 [nfsd]
nfsd_file_put+0xd4/0x110 [nfsd]
release_all_access+0x13a/0x220 [nfsd]
nfs4_free_ol_stateid+0x40/0x90 [nfsd]
free_ol_stateid_reaplist+0x131/0x210 [nfsd]
release_openowner+0xf7/0x160 [nfsd]
__destroy_client+0x3cc/0x740 [nfsd]
nfsd_cc_lru_scan+0x271/0x410 [nfsd]
shrink_slab.constprop.0+0x31e/0x7d0
shrink_node+0x54b/0xe50
try_to_free_pages+0x394/0xba0
__alloc_pages_slowpath.constprop.0+0x5d2/0x1db0
__alloc_pages+0x4d6/0x580
__handle_mm_fault+0xc25/0x2810
handle_mm_fault+0x136/0x480
do_user_addr_fault+0x3d8/0xec0
exc_page_fault+0x5d/0xc0
asm_exc_page_fault+0x27/0x30
#012other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(&sb->s_type->i_mutex_key#16);
lock(fs_reclaim);
lock(&sb->s_type->i_mutex_key#16);
#012 *** DEADLOCK ***
. the shrinker kicks in only when memory drops really low, ~<5%.
By this time, some other components in the system already run
into issue with memory shortage. For example, rpc.gssd starts
failing to add watches in /var/lib/nfs/rpc_pipefs/nfsd4_cb
once the memory consumed by these watches reaches about 1% of
available system memory.
. destroying the NFSv4 client has significant overhead due to
the upcall to user space to remove the client records which
might access storage device. There is potential deadlock
if the storage subsystem needs to allocate memory.
next reply other threads:[~2022-07-04 19:05 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-04 19:05 Dai Ngo [this message]
2022-07-04 19:05 ` [PATCH v2 1/2] NFSD: keep track of the number of courtesy clients in the system Dai Ngo
2022-07-04 19:05 ` [PATCH v2 2/2] NFSD: handling memory shortage condition with Courteous server Dai Ngo
2022-07-05 14:50 ` [PATCH v2 0/2] NFSD: handling memory shortage problem " Chuck Lever III
2022-07-05 18:42 ` dai.ngo
2022-07-05 19:08 ` Chuck Lever III
2022-07-06 15:46 ` J. Bruce Fields
2022-07-06 16:04 ` Chuck Lever III
2022-07-05 18:48 ` Jeff Layton
2022-07-05 19:15 ` Chuck Lever III
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1656961543-25210-1-git-send-email-dai.ngo@oracle.com \
--to=dai.ngo@oracle.com \
--cc=chuck.lever@oracle.com \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox