From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 40A60288C34 for ; Wed, 19 Nov 2025 18:24:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763576672; cv=none; b=TqHUF6xFaNzsL1fnuQ8rlfIjLnqLKfJkpRX66wfrSYYoz64IIkfOUoWb0Vf/RbH7eg/aC7ymiPyPjxGzfu78dLJ+vVmJDm/FzSwJKg9kgqltuioBp3vAyMSPSYnY7QA971f8iSePReXB6uyMyLVPMzPLHvE9nrk6qdAXV+UJJ1g= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763576672; c=relaxed/simple; bh=/+/BtcFT/YcCq4fkuNftQ7uqBuyrzC/gmC5W0l0EdBU=; h=Message-ID:Subject:From:To:Cc:Date:In-Reply-To:References: Content-Type:MIME-Version; b=D1QknoIY6c23fADAoCdEN405C9Q4f9ImLATYpumsJ3rsUUkNxKV5WsM2kkDc2mXMwBHacAERIaeGGI7I5zjxUJKqJoFkTL1XiwBPPM099illi5xLji9RU8Uj5qE0f9UM2huHaNs+ofuoQY+aOY4IXWVvfyTv2VguuEDeaxS6PdQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=o0hVnMhp; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="o0hVnMhp" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 78870C4CEF5; Wed, 19 Nov 2025 18:24:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1763576671; bh=/+/BtcFT/YcCq4fkuNftQ7uqBuyrzC/gmC5W0l0EdBU=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=o0hVnMhp1mnwmGU7hDDfUVWgt2GWoV/fJ9AeIdxBEJxf0EcR/5rBN6l6KveM8a/Z0 y9jVo6sAPcJtcygYNXH/ms3K7jcjC6UWS7bajiRnD8li9G4Wi/BB8SbEjmZM2JJQNt qL76lg8m/KjGW0AtfYIHwvFkWrnhatD4qYvwynhzG24H1ImJs5KQyS3pPm9OMdK0n7 S8ja0r9q2oRa9k410bQFyjivwR9zzaLvsuYKhyFQ48/H5Vp5sJ1b+VAxDEqaq8gdXb bKVO4Ja/YiX58dNp8e4uKaylfEK4EgoDG2JmB17MELF5s5F1zmooHe7psKDz5ysLYz ksy58tj3nhHcw== Message-ID: <08ce85ac96d63f4ac9dd94bf444095359ffe4dbd.camel@kernel.org> Subject: Re: [PATCH] nfs: Implement delayed data server destruction with hold cache From: Trond Myklebust To: gaurav gangalwar Cc: anna@kernel.org, tom@talpey.com, chuck.lever@oracle.com, linux-nfs@vger.kernel.org Date: Wed, 19 Nov 2025 13:24:29 -0500 In-Reply-To: References: <20251118105752.52098-1-gaurav.gangalwar@gmail.com> <1fd78dbccac873a277e71e55409acc5d1d3e6886.camel@kernel.org> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.58.1 (3.58.1-1.fc43) Precedence: bulk X-Mailing-List: linux-nfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 On Wed, 2025-11-19 at 22:04 +0530, gaurav gangalwar wrote: > Thanks Trond for review comments, reply inline. >=20 > On Tue, Nov 18, 2025 at 9:46=E2=80=AFPM Trond Myklebust > wrote: > >=20 > > On Tue, 2025-11-18 at 09:43 -0500, Trond Myklebust wrote: > > > On Tue, 2025-11-18 at 05:57 -0500, Gaurav Gangalwar wrote: > > > > Introduce a hold cache mechanism for NFS pNFS data servers to > > > > avoid > > > > unnecessary connection churn when data servers are temporarily > > > > idle. > > > >=20 > > > > Key changes: > > > >=20 > > > > 1. Hold Cache Implementation: > > > > =C2=A0=C2=A0 - Add nfs4_data_server_hold_cache to namespace structu= re > > > > =C2=A0=C2=A0 - Move data servers to hold cache when refcount reache= s zero > > > > =C2=A0=C2=A0 - Always update ds_last_access timestamp on every refe= rence > > > >=20 > > > > 2. Configurable Module Parameters: > > > > =C2=A0=C2=A0 - nfs4_pnfs_ds_grace_period: Grace period before destr= oying > > > > idle > > > > =C2=A0=C2=A0=C2=A0=C2=A0 data servers (default: 300 seconds) > > > > =C2=A0=C2=A0 - nfs4_pnfs_ds_cleanup_interval: Interval for periodic > > > > cleanup > > > > =C2=A0=C2=A0=C2=A0=C2=A0 work (default: 300 seconds) > > > >=20 > > > > 3. Periodic Cleanup Work: > > > > =C2=A0=C2=A0 - Schedule delayed work on first DS usage (lazy > > > > initialization) > > > > =C2=A0=C2=A0 - Check hold cache and destroy data servers that excee= d > > > > grace > > > > period > > > > =C2=A0=C2=A0 - Reschedule work automatically for continuous monitor= ing > > > >=20 > > > > 4. Callback Mechanism: > > > > =C2=A0=C2=A0 - Use function pointer callback to avoid circular modu= le > > > > dependencies > > > > =C2=A0=C2=A0 - nfsv4.ko registers cleanup callback during initializ= ation > > > > =C2=A0=C2=A0 - nfs.ko calls callback during namespace cleanup (if > > > > registered) > > > >=20 > > > > 5. Timestamp Tracking: > > > > =C2=A0=C2=A0 - Add ds_last_access field to nfs4_pnfs_ds structure > > > > =C2=A0=C2=A0 - Update timestamp on DS allocation, lookup, and refer= ence > > > >=20 > > > > Benefits: > > > > - Reduces connection setup/teardown overhead for intermittently > > > > used > > > > DSs > > > > - Allows DS reuse if accessed again within grace period > > > > - Configurable behavior via module parameters > > > >=20 > > >=20 > > > Please read RFC8881 Section 12.2.10 > > > (https://datatracker.ietf.org/doc/html/rfc8881#device_ids) > > >=20 > > > Specifically, the following paragraph, which disallows what you > > > are > > > proposing: > > >=20 > > > Device ID to device address mappings are not leased, and can be > > > changed > > > at any time. (Note that while device ID to device address > > > mappings > > > are > > > likely to change after the metadata server restarts, the server > > > is > > > not > > > required to change the mappings.) A server has two choices for > > > changing > > > mappings. It can recall all layouts referring to the device ID or > > > it > > > can use a notification mechanism. > > >=20 > nfs4_data_server_cache is per network namespace and cache ds_addrs -> > nfs_client, so it should be independent of device id. OK, but that dissociates the address cache from the deviceid cache, and means that when we finally get round to implementing deviceid notifications, then we'll have to manage 2 levels of caching. That's not desirable either. If you really need this extra caching of connections, then is there any reason why you can't just implement it with deviceid notifications? > I am trying to understand how a change in Device ID to device address > mapping can make difference to nfs4_data_server_cache, > since this cache lookup is done using ds address. As long as the > address and connections are valid it should be fine. > One scenario I can think of for address is valid but connection is > not > could be an ip address move, but in that case connection should reset > and nfs client should reconnect. Are you asking under what circumstances a notification might want to be sent? The following come to mind: rebalancing client load across multiple IP addresses, managing RDMA vs plain TCP connections, network failover/failback to a different IP and/or subnet, or just letting the client know about temporary outages of some addresses. In some cases, it could even just be that the data server is being decommissioned, and so the deviceids are being deleted permanently. The point is that notifications allow you to do caching of connections indefinitely if you want to. One thing to note though, is that since hyperscalers have been known to set up environments where the number of data servers reaches the 1000s, you will at the very least want to limit the maximum size of the cache. > >=20 > > Note that you could circumvent the above restriction by adding a > > revalidating step. > > i.e. in order to figure out if the cached addresses and connections > > are > > still valid and preferred, call GETDEVICEINFO after receiving the > > first > > LAYOUTGET to re-reference the cached device id. > Didn't get this, GETDEVICEINFO should be already happening after > LAYOUTGET, so if there is change in device info it will get it. > >=20 > > However given that we usually keep layouts around until the > > delegation > > is returned (assuming the server handed us one), we should be > > caching > > these connections for a minute or so already. >=20 > We have enabled only read delegations, so this is unlikely to help. Sure, but that's something you can fix on the server. The client support is already fully implemented. --=20 Trond Myklebust Linux NFS client maintainer, Hammerspace trondmy@kernel.org, trond.myklebust@hammerspace.com