Re: [PATCH] NFS: Avoid flushing data while holding directory locks in nfs_rename()

Linux NFS development
 help / color / mirror / Atom feed

From: Chuck Lever <chuck.lever@oracle.com>
To: Trond Myklebust <trondmy@hammerspace.com>,
	"anna@kernel.org" <anna@kernel.org>
Cc: "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH] NFS: Avoid flushing data while holding directory locks in nfs_rename()
Date: Wed, 30 Apr 2025 09:28:21 -0400	[thread overview]
Message-ID: <0a8d2285-e5b5-4a75-98a0-1a94d6fbbae3@oracle.com> (raw)
In-Reply-To: <c4b94ca894764b5f323fbf92530389f8c8b85894.camel@hammerspace.com>

On 4/29/25 7:22 PM, Trond Myklebust wrote:
> On Tue, 2025-04-29 at 12:14 -0400, Chuck Lever wrote:
>> On 4/27/25 7:01 PM, trondmy@kernel.org wrote:
>>> From: Trond Myklebust <trond.myklebust@hammerspace.com>
>>>
>>> The Linux client assumes that all filehandles are non-volatile for
>>> renames within the same directory (otherwise sillyrename cannot
>>> work).
>>> However, the existence of the Linux 'subtree_check' export option
>>> has
>>> meant that nfs_rename() has always assumed it needs to flush writes
>>> before attempting to rename.
>>>
>>> Since NFSv4 does allow the client to query whether or not the
>>> server
>>> exhibits this behaviour, and since knfsd does actually set the
>>> appropriate flag when 'subtree_check' is enabled on an export, it
>>> should be OK to optimise away the write flushing behaviour in the
>>> cases
>>> where it is clearly not needed.
>>
>> No objection to the high level goal, seems like a sensible change.
>>
>> So NFSv3 still has the flushing requirement, but NFSv4 can disable
>> that
>> requirement as long as the server in question asserts
>> FH4_VOLATILE_ANY
>> and FH4_VOL_RENAME, correct?
>>
>> I'm wondering how confident we are that other server implementations
>> behave reasonably. Yes, they are broken if they don't behave, but
>> there
>> is still risk of introducing a regression.
> 
> I'd prefer to deal with that through other means if it turns out that
> we have to. The problem we have right now is that we are forcing a
> writeback of the entire file while holding several directory mutexes
> (as well as the filesystem-global s_vfs_rename_mutex). That's terrible
> for performance and scalability.
> 
>>
>>
>>> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
>>> ---
>>>  fs/nfs/client.c           |  2 ++
>>>  fs/nfs/dir.c              | 15 ++++++++++++++-
>>>  include/linux/nfs_fs_sb.h |  6 ++++++
>>>  3 files changed, 22 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/nfs/client.c b/fs/nfs/client.c
>>> index 2115c1189c2d..6d63b958c4bb 100644
>>> --- a/fs/nfs/client.c
>>> +++ b/fs/nfs/client.c
>>> @@ -1105,6 +1105,8 @@ struct nfs_server *nfs_create_server(struct
>>> fs_context *fc)
>>>  		if (server->namelen == 0 || server->namelen >
>>> NFS2_MAXNAMLEN)
>>>  			server->namelen = NFS2_MAXNAMLEN;
>>>  	}
>>> +	/* Linux 'subtree_check' borkenness mandates this setting
>>> */
>>
>> Assuming you are going to respin this patch to deal with the build
>> bot
>> warnings, can you make this comment more useful? "borkenness" sounds
>> like you are dealing with a server bug here, but that's not really
>> the case. subtree_check is working as designed: it requires the extra
>> flushing. The no_subtree_check case does not, IIUC.
> 
> subtree checking is working as designed, but that doesn't change the
> fact that it is a violation of the NFSv3 protocol. You can't to change
> the filehandle in mid flight in any stateless protocol, because that
> will break applications.

My point is that the comment you add in this patch, although whimsical,
is not terribly illuminating. A more expansive comment (with the detail
you provide above) would be helpful.

But if subtree_check is not compliant with RFC 1813, perhaps we should
consider finally deprecating and removing it. At least exports(5) should
mention the spec compliance issue, but my copy of exports(5) does not.


>> It would be better to explain this need strictly in terms of file
>> handle
>> volatility, no?
>>
>>
>>> +	server->fh_expire_type = NFS_FH_VOL_RENAME;
>>>  
>>>  	if (!(fattr->valid & NFS_ATTR_FATTR)) {
>>>  		error = ctx->nfs_mod->rpc_ops->getattr(server,
>>> ctx->mntfh,
>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>>> index bd23fc736b39..d0e0b435a843 100644
>>> --- a/fs/nfs/dir.c
>>> +++ b/fs/nfs/dir.c
>>> @@ -2676,6 +2676,18 @@ nfs_unblock_rename(struct rpc_task *task,
>>> struct nfs_renamedata *data)
>>>  	unblock_revalidate(new_dentry);
>>>  }
>>>  
>>> +static bool nfs_rename_is_unsafe_cross_dir(struct dentry
>>> *old_dentry,
>>> +					   struct dentry
>>> *new_dentry)
>>> +{
>>> +	struct nfs_server *server = NFS_SB(old_dentry->d_sb);
>>> +
>>> +	if (old_dentry->d_parent != new_dentry->d_parent)
>>> +		return false;
>>> +	if (server->fh_expire_type & NFS_FH_RENAME_UNSAFE)
>>> +		return !(server->fh_expire_type &
>>> NFS_FH_NOEXPIRE_WITH_OPEN);
>>> +	return true;
>>> +}
>>> +
>>>  /*
>>>   * RENAME
>>>   * FIXME: Some nfsds, like the Linux user space nfsd, may generate
>>> a
>>> @@ -2763,7 +2775,8 @@ int nfs_rename(struct mnt_idmap *idmap,
>>> struct inode *old_dir,
>>>  
>>>  	}
>>>  
>>> -	if (S_ISREG(old_inode->i_mode))
>>> +	if (S_ISREG(old_inode->i_mode) &&
>>> +	    nfs_rename_is_unsafe_cross_dir(old_dentry,
>>> new_dentry))
>>>  		nfs_sync_inode(old_inode);
>>>  	task = nfs_async_rename(old_dir, new_dir, old_dentry,
>>> new_dentry,
>>>  				must_unblock ? nfs_unblock_rename
>>> : NULL);
>>> diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
>>> index 71319637a84e..6732c7e04d19 100644
>>> --- a/include/linux/nfs_fs_sb.h
>>> +++ b/include/linux/nfs_fs_sb.h
>>> @@ -236,6 +236,12 @@ struct nfs_server {
>>>  	u32			acl_bitmask;	/* V4 bitmask
>>> representing the ACEs
>>>  						   that are
>>> supported on this
>>>  						   filesystem */
>>> +	/* The following #defines numerically match the NFSv4
>>> equivalents */
>>> +#define NFS_FH_NOEXPIRE_WITH_OPEN (0x1)
>>> +#define NFS_FH_VOLATILE_ANY (0x2)
>>> +#define NFS_FH_VOL_MIGRATION (0x4)
>>> +#define NFS_FH_VOL_RENAME (0x8)
>>> +#define NFS_FH_RENAME_UNSAFE (NFS_FH_VOLATILE_ANY |
>>> NFS_FH_VOL_RENAME)
>>>  	u32			fh_expire_type;	/* V4
>>> bitmask representing file
>>>  						   handle
>>> volatility type for
>>>  						   this filesystem
>>> */
>>
> 


-- 
Chuck Lever

     prev parent reply	other threads:[~2025-04-30 13:28 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-27 23:01 [PATCH] NFS: Avoid flushing data while holding directory locks in nfs_rename() trondmy
2025-04-28  1:53 ` kernel test robot
2025-04-28  2:35 ` kernel test robot
2025-04-29 16:14 ` Chuck Lever
2025-04-29 16:54   ` Jeff Layton
2025-04-29 23:22   ` Trond Myklebust
2025-04-30 13:28     ` Chuck Lever [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0a8d2285-e5b5-4a75-98a0-1a94d6fbbae3@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=anna@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox