From: Trond Myklebust <trondmy@kernel.org>
To: gaurav gangalwar <gaurav.gangalwar@gmail.com>
Cc: anna@kernel.org, tom@talpey.com, chuck.lever@oracle.com,
linux-nfs@vger.kernel.org
Subject: Re: [PATCH] nfs: Implement delayed data server destruction with hold cache
Date: Wed, 19 Nov 2025 13:24:29 -0500 [thread overview]
Message-ID: <08ce85ac96d63f4ac9dd94bf444095359ffe4dbd.camel@kernel.org> (raw)
In-Reply-To: <CAJiE4O=62Yxeo=24p9k3H_dC6Z7LuVwLFo8ca98yJTvsSTMfhQ@mail.gmail.com>
On Wed, 2025-11-19 at 22:04 +0530, gaurav gangalwar wrote:
> Thanks Trond for review comments, reply inline.
>
> On Tue, Nov 18, 2025 at 9:46 PM Trond Myklebust <trondmy@kernel.org>
> wrote:
> >
> > On Tue, 2025-11-18 at 09:43 -0500, Trond Myklebust wrote:
> > > On Tue, 2025-11-18 at 05:57 -0500, Gaurav Gangalwar wrote:
> > > > Introduce a hold cache mechanism for NFS pNFS data servers to
> > > > avoid
> > > > unnecessary connection churn when data servers are temporarily
> > > > idle.
> > > >
> > > > Key changes:
> > > >
> > > > 1. Hold Cache Implementation:
> > > > - Add nfs4_data_server_hold_cache to namespace structure
> > > > - Move data servers to hold cache when refcount reaches zero
> > > > - Always update ds_last_access timestamp on every reference
> > > >
> > > > 2. Configurable Module Parameters:
> > > > - nfs4_pnfs_ds_grace_period: Grace period before destroying
> > > > idle
> > > > data servers (default: 300 seconds)
> > > > - nfs4_pnfs_ds_cleanup_interval: Interval for periodic
> > > > cleanup
> > > > work (default: 300 seconds)
> > > >
> > > > 3. Periodic Cleanup Work:
> > > > - Schedule delayed work on first DS usage (lazy
> > > > initialization)
> > > > - Check hold cache and destroy data servers that exceed
> > > > grace
> > > > period
> > > > - Reschedule work automatically for continuous monitoring
> > > >
> > > > 4. Callback Mechanism:
> > > > - Use function pointer callback to avoid circular module
> > > > dependencies
> > > > - nfsv4.ko registers cleanup callback during initialization
> > > > - nfs.ko calls callback during namespace cleanup (if
> > > > registered)
> > > >
> > > > 5. Timestamp Tracking:
> > > > - Add ds_last_access field to nfs4_pnfs_ds structure
> > > > - Update timestamp on DS allocation, lookup, and reference
> > > >
> > > > Benefits:
> > > > - Reduces connection setup/teardown overhead for intermittently
> > > > used
> > > > DSs
> > > > - Allows DS reuse if accessed again within grace period
> > > > - Configurable behavior via module parameters
> > > >
> > >
> > > Please read RFC8881 Section 12.2.10
> > > (https://datatracker.ietf.org/doc/html/rfc8881#device_ids)
> > >
> > > Specifically, the following paragraph, which disallows what you
> > > are
> > > proposing:
> > >
> > > Device ID to device address mappings are not leased, and can be
> > > changed
> > > at any time. (Note that while device ID to device address
> > > mappings
> > > are
> > > likely to change after the metadata server restarts, the server
> > > is
> > > not
> > > required to change the mappings.) A server has two choices for
> > > changing
> > > mappings. It can recall all layouts referring to the device ID or
> > > it
> > > can use a notification mechanism.
> > >
> nfs4_data_server_cache is per network namespace and cache ds_addrs ->
> nfs_client, so it should be independent of device id.
OK, but that dissociates the address cache from the deviceid cache, and
means that when we finally get round to implementing deviceid
notifications, then we'll have to manage 2 levels of caching. That's
not desirable either.
If you really need this extra caching of connections, then is there any
reason why you can't just implement it with deviceid notifications?
> I am trying to understand how a change in Device ID to device address
> mapping can make difference to nfs4_data_server_cache,
> since this cache lookup is done using ds address. As long as the
> address and connections are valid it should be fine.
> One scenario I can think of for address is valid but connection is
> not
> could be an ip address move, but in that case connection should reset
> and nfs client should reconnect.
Are you asking under what circumstances a notification might want to be
sent?
The following come to mind: rebalancing client load across multiple IP
addresses, managing RDMA vs plain TCP connections, network
failover/failback to a different IP and/or subnet, or just letting the
client know about temporary outages of some addresses. In some cases,
it could even just be that the data server is being decommissioned, and
so the deviceids are being deleted permanently.
The point is that notifications allow you to do caching of connections
indefinitely if you want to.
One thing to note though, is that since hyperscalers have been known to
set up environments where the number of data servers reaches the 1000s,
you will at the very least want to limit the maximum size of the cache.
> >
> > Note that you could circumvent the above restriction by adding a
> > revalidating step.
> > i.e. in order to figure out if the cached addresses and connections
> > are
> > still valid and preferred, call GETDEVICEINFO after receiving the
> > first
> > LAYOUTGET to re-reference the cached device id.
> Didn't get this, GETDEVICEINFO should be already happening after
> LAYOUTGET, so if there is change in device info it will get it.
> >
> > However given that we usually keep layouts around until the
> > delegation
> > is returned (assuming the server handed us one), we should be
> > caching
> > these connections for a minute or so already.
>
> We have enabled only read delegations, so this is unlikely to help.
Sure, but that's something you can fix on the server. The client
support is already fully implemented.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trondmy@kernel.org, trond.myklebust@hammerspace.com
next prev parent reply other threads:[~2025-11-19 18:24 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-18 10:57 [PATCH] nfs: Implement delayed data server destruction with hold cache Gaurav Gangalwar
2025-11-18 14:43 ` Trond Myklebust
2025-11-18 16:16 ` Trond Myklebust
2025-11-19 16:34 ` gaurav gangalwar
2025-11-19 18:24 ` Trond Myklebust [this message]
2025-11-19 11:20 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=08ce85ac96d63f4ac9dd94bf444095359ffe4dbd.camel@kernel.org \
--to=trondmy@kernel.org \
--cc=anna@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=gaurav.gangalwar@gmail.com \
--cc=linux-nfs@vger.kernel.org \
--cc=tom@talpey.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).