From: Jeff Layton <jlayton@kernel.org>
To: Aleksa Sarai <cyphar@cyphar.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	 Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	Chuck Lever <chuck.lever@oracle.com>,
	Amir Goldstein <amir73il@gmail.com>,
	Alexander Aring <alex.aring@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Namhyung Kim <namhyung@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@kernel.org>, Ian Rogers <irogers@google.com>,
	Adrian Hunter <adrian.hunter@intel.com>
Cc: linux-fsdevel@vger.kernel.org, linux-nfs@vger.kernel.org,
	 linux-kernel@vger.kernel.org, linux-api@vger.kernel.org,
	 linux-perf-users@vger.kernel.org
Subject: Re: [PATCH RFC v3 1/2] uapi: explain how per-syscall AT_* flags should be allocated
Date: Thu, 01 Aug 2024 15:21:49 -0400	[thread overview]
Message-ID: <37ca45753755fcb57702b5f61c7f18aeb02a7d96.camel@kernel.org> (raw)
In-Reply-To: <20240801-exportfs-u64-mount-id-v3-1-be5d6283144a@cyphar.com>
On Thu, 2024-08-01 at 13:52 +1000, Aleksa Sarai wrote:
> Unfortunately, the way we have gone about adding new AT_* flags has
> been a little messy. In the beginning, all of the AT_* flags had generic
> meanings and so it made sense to share the flag bits indiscriminately.
> However, we inevitably ran into syscalls that needed their own
> syscall-specific flags. Due to the lack of a planned out policy, we
> ended up with the following situations:
> 
>  * Existing syscalls adding new features tended to use new AT_* bits,
>    with some effort taken to try to re-use bits for flags that were so
>    obviously syscall specific that they only make sense for a single
>    syscall (such as the AT_EACCESS/AT_REMOVEDIR/AT_HANDLE_FID triplet).
> 
>    Given the constraints of bitflags, this works well in practice, but
>    ideally (to avoid future confusion) we would plan ahead and define a
>    set of "per-syscall bits" ahead of time so that when allocating new
>    bits we don't end up with a complete mish-mash of which bits are
>    supposed to be per-syscall and which aren't.
> 
>  * New syscalls dealt with this in several ways:
> 
>    - Some syscalls (like renameat2(2), move_mount(2), fsopen(2), and
>      fspick(2)) created their separate own flag spaces that have no
>      overlap with the AT_* flags. Most of these ended up allocating
>      their bits sequentually.
> 
>      In the case of move_mount(2) and fspick(2), several flags have
>      identical meanings to AT_* flags but were allocated in their own
>      flag space.
> 
>      This makes sense for syscalls that will never share AT_* flags, but
>      for some syscalls this leads to duplication with AT_* flags in a
>      way that could cause confusion (if renameat2(2) grew a
>      RENAME_EMPTY_PATH it seems likely that users could mistake it for
>      AT_EMPTY_PATH since it is an *at(2) syscall).
> 
>    - Some syscalls unfortunately ended up both creating their own flag
>      space while also using bits from other flag spaces. The most
>      obvious example is open_tree(2), where the standard usage ends up
>      using flags from *THREE* separate flag spaces:
> 
>        open_tree(AT_FDCWD, "/foo", OPEN_TREE_CLONE|O_CLOEXEC|AT_RECURSIVE);
> 
>      (Note that O_CLOEXEC is also platform-specific, so several future
>      OPEN_TREE_* bits are also made unusable in one fell swoop.)
> 
> It's not entirely clear to me what the "right" choice is for new
> syscalls. Just saying that all future VFS syscalls should use AT_* flags
> doesn't seem practical. openat2(2) has RESOLVE_* flags (many of which
> don't make much sense to burn generic AT_* flags for) and move_mount(2)
> has separate AT_*-like flags for both the source and target so separate
> flags are needed anyway (though it seems possible that renameat2(2)
> could grow *_EMPTY_PATH flags at some point, and it's a bit of a shame
> they can't be reused).
> 
> But at least for syscalls that _do_ choose to use AT_* flags, we should
> explicitly state the policy that 0x2ff is currently intended for
> per-syscall flags and that new flags should err on the side of
> overlapping with existing flag bits (so we can extend the scope of
> generic flags in the future if necessary).
> 
> And add AT_* aliases for the RENAME_* flags to further cement that
> renameat2(2) is an *at(2) flag, just with its own per-syscall flags.
> 
> Suggested-by: Amir Goldstein <amir73il@gmail.com>
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  include/uapi/linux/fcntl.h                         | 80 ++++++++++++++-------
>  tools/perf/trace/beauty/include/uapi/linux/fcntl.h | 83 +++++++++++++++-------
>  2 files changed, 115 insertions(+), 48 deletions(-)
> 
This seems like a reasonable proposal. We haven't been as careful about
how these flags get allocated as we probably should have in the past.
Nothing we can do about the existing ones, but some guidelines going
forward are welcome.
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index e55a3314bcb0..38a6d66d9e88 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -90,37 +90,69 @@
>  #define DN_ATTRIB	0x00000020	/* File changed attibutes */
>  #define DN_MULTISHOT	0x80000000	/* Don't remove notifier */
>  
> +#define AT_FDCWD		-100    /* Special value for dirfd used to
> +					   indicate openat should use the
> +					   current working directory. */
> +
> +
> +/* Generic flags for the *at(2) family of syscalls. */
> +
> +/* Reserved for per-syscall flags	0xff. */
> +#define AT_SYMLINK_NOFOLLOW		0x100   /* Do not follow symbolic
> +						   links. */
> +/* Reserved for per-syscall flags	0x200 */
> +#define AT_SYMLINK_FOLLOW		0x400   /* Follow symbolic links. */
> +#define AT_NO_AUTOMOUNT			0x800	/* Suppress terminal automount
> +						   traversal. */
> +#define AT_EMPTY_PATH			0x1000	/* Allow empty relative
> +						   pathname to operate on dirfd
> +						   directly. */
>  /*
> - * The constants AT_REMOVEDIR and AT_EACCESS have the same value.  AT_EACCESS is
> - * meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to
> - * unlinkat.  The two functions do completely different things and therefore,
> - * the flags can be allowed to overlap.  For example, passing AT_REMOVEDIR to
> - * faccessat would be undefined behavior and thus treating it equivalent to
> - * AT_EACCESS is valid undefined behavior.
> + * These flags are currently statx(2)-specific, but they could be made generic
> + * in the future and so they should not be used for other per-syscall flags.
>   */
> -#define AT_FDCWD		-100    /* Special value used to indicate
> -                                           openat should use the current
> -                                           working directory. */
> -#define AT_SYMLINK_NOFOLLOW	0x100   /* Do not follow symbolic links.  */
> +#define AT_STATX_SYNC_TYPE		0x6000	/* Type of synchronisation required from statx() */
> +#define AT_STATX_SYNC_AS_STAT		0x0000	/* - Do whatever stat() does */
> +#define AT_STATX_FORCE_SYNC		0x2000	/* - Force the attributes to be sync'd with the server */
> +#define AT_STATX_DONT_SYNC		0x4000	/* - Don't sync attributes with the server */
> +
> +#define AT_RECURSIVE			0x8000	/* Apply to the entire subtree */
> +
> +/*
> + * Per-syscall flags for the *at(2) family of syscalls.
> + *
> + * These are flags that are so syscall-specific that a user passing these flags
> + * to the wrong syscall is so "clearly wrong" that we can safely call such
> + * usage "undefined behaviour".
> + *
> + * For example, the constants AT_REMOVEDIR and AT_EACCESS have the same value.
> + * AT_EACCESS is meaningful only to faccessat, while AT_REMOVEDIR is meaningful
> + * only to unlinkat. The two functions do completely different things and
> + * therefore, the flags can be allowed to overlap. For example, passing
> + * AT_REMOVEDIR to faccessat would be undefined behavior and thus treating it
> + * equivalent to AT_EACCESS is valid undefined behavior.
> + *
> + * Note for implementers: When picking a new per-syscall AT_* flag, try to
> + * reuse already existing flags first. This leaves us with as many unused bits
> + * as possible, so we can use them for generic bits in the future if necessary.
> + */
> +
> +/* Flags for renameat2(2) (must match legacy RENAME_* flags). */
> +#define AT_RENAME_NOREPLACE	0x0001
> +#define AT_RENAME_EXCHANGE	0x0002
> +#define AT_RENAME_WHITEOUT	0x0004
> +
> +/* Flag for faccessat(2). */
>  #define AT_EACCESS		0x200	/* Test access permitted for
>                                             effective IDs, not real IDs.  */
> +/* Flag for unlinkat(2). */
>  #define AT_REMOVEDIR		0x200   /* Remove directory instead of
>                                             unlinking file.  */
> -#define AT_SYMLINK_FOLLOW	0x400   /* Follow symbolic links.  */
> -#define AT_NO_AUTOMOUNT		0x800	/* Suppress terminal automount traversal */
> -#define AT_EMPTY_PATH		0x1000	/* Allow empty relative pathname */
> -
> -#define AT_STATX_SYNC_TYPE	0x6000	/* Type of synchronisation required from statx() */
> -#define AT_STATX_SYNC_AS_STAT	0x0000	/* - Do whatever stat() does */
> -#define AT_STATX_FORCE_SYNC	0x2000	/* - Force the attributes to be sync'd with the server */
> -#define AT_STATX_DONT_SYNC	0x4000	/* - Don't sync attributes with the server */
> -
> -#define AT_RECURSIVE		0x8000	/* Apply to the entire subtree */
> +/* Flags for name_to_handle_at(2). */
> +#define AT_HANDLE_FID		0x200	/* File handle is needed to compare
> +					   object identity and may not be
> +					   usable with open_by_handle_at(2). */
>  
> -/* Flags for name_to_handle_at(2). We reuse AT_ flag space to save bits... */
> -#define AT_HANDLE_FID		AT_REMOVEDIR	/* file handle is needed to
> -					compare object identity and may not
> -					be usable to open_by_handle_at(2) */
>  #if defined(__KERNEL__)
>  #define AT_GETATTR_NOSEC	0x80000000
>  #endif
> diff --git a/tools/perf/trace/beauty/include/uapi/linux/fcntl.h b/tools/perf/trace/beauty/include/uapi/linux/fcntl.h
> index c0bcc185fa48..38a6d66d9e88 100644
> --- a/tools/perf/trace/beauty/include/uapi/linux/fcntl.h
> +++ b/tools/perf/trace/beauty/include/uapi/linux/fcntl.h
> @@ -16,6 +16,9 @@
>  
>  #define F_DUPFD_QUERY	(F_LINUX_SPECIFIC_BASE + 3)
>  
> +/* Was the file just created? */
> +#define F_CREATED_QUERY	(F_LINUX_SPECIFIC_BASE + 4)
> +
>  /*
>   * Cancel a blocking posix lock; internal use only until we expose an
>   * asynchronous lock api to userspace:
> @@ -87,37 +90,69 @@
>  #define DN_ATTRIB	0x00000020	/* File changed attibutes */
>  #define DN_MULTISHOT	0x80000000	/* Don't remove notifier */
>  
> +#define AT_FDCWD		-100    /* Special value for dirfd used to
> +					   indicate openat should use the
> +					   current working directory. */
> +
> +
> +/* Generic flags for the *at(2) family of syscalls. */
> +
> +/* Reserved for per-syscall flags	0xff. */
> +#define AT_SYMLINK_NOFOLLOW		0x100   /* Do not follow symbolic
> +						   links. */
> +/* Reserved for per-syscall flags	0x200 */
> +#define AT_SYMLINK_FOLLOW		0x400   /* Follow symbolic links. */
> +#define AT_NO_AUTOMOUNT			0x800	/* Suppress terminal automount
> +						   traversal. */
> +#define AT_EMPTY_PATH			0x1000	/* Allow empty relative
> +						   pathname to operate on dirfd
> +						   directly. */
> +/*
> + * These flags are currently statx(2)-specific, but they could be made generic
> + * in the future and so they should not be used for other per-syscall flags.
> + */
> +#define AT_STATX_SYNC_TYPE		0x6000	/* Type of synchronisation required from statx() */
> +#define AT_STATX_SYNC_AS_STAT		0x0000	/* - Do whatever stat() does */
> +#define AT_STATX_FORCE_SYNC		0x2000	/* - Force the attributes to be sync'd with the server */
> +#define AT_STATX_DONT_SYNC		0x4000	/* - Don't sync attributes with the server */
> +
> +#define AT_RECURSIVE			0x8000	/* Apply to the entire subtree */
> +
>  /*
> - * The constants AT_REMOVEDIR and AT_EACCESS have the same value.  AT_EACCESS is
> - * meaningful only to faccessat, while AT_REMOVEDIR is meaningful only to
> - * unlinkat.  The two functions do completely different things and therefore,
> - * the flags can be allowed to overlap.  For example, passing AT_REMOVEDIR to
> - * faccessat would be undefined behavior and thus treating it equivalent to
> - * AT_EACCESS is valid undefined behavior.
> + * Per-syscall flags for the *at(2) family of syscalls.
> + *
> + * These are flags that are so syscall-specific that a user passing these flags
> + * to the wrong syscall is so "clearly wrong" that we can safely call such
> + * usage "undefined behaviour".
> + *
> + * For example, the constants AT_REMOVEDIR and AT_EACCESS have the same value.
> + * AT_EACCESS is meaningful only to faccessat, while AT_REMOVEDIR is meaningful
> + * only to unlinkat. The two functions do completely different things and
> + * therefore, the flags can be allowed to overlap. For example, passing
> + * AT_REMOVEDIR to faccessat would be undefined behavior and thus treating it
> + * equivalent to AT_EACCESS is valid undefined behavior.
> + *
> + * Note for implementers: When picking a new per-syscall AT_* flag, try to
> + * reuse already existing flags first. This leaves us with as many unused bits
> + * as possible, so we can use them for generic bits in the future if necessary.
>   */
> -#define AT_FDCWD		-100    /* Special value used to indicate
> -                                           openat should use the current
> -                                           working directory. */
> -#define AT_SYMLINK_NOFOLLOW	0x100   /* Do not follow symbolic links.  */
> +
> +/* Flags for renameat2(2) (must match legacy RENAME_* flags). */
> +#define AT_RENAME_NOREPLACE	0x0001
> +#define AT_RENAME_EXCHANGE	0x0002
> +#define AT_RENAME_WHITEOUT	0x0004
> +
> +/* Flag for faccessat(2). */
>  #define AT_EACCESS		0x200	/* Test access permitted for
>                                             effective IDs, not real IDs.  */
> +/* Flag for unlinkat(2). */
>  #define AT_REMOVEDIR		0x200   /* Remove directory instead of
>                                             unlinking file.  */
> -#define AT_SYMLINK_FOLLOW	0x400   /* Follow symbolic links.  */
> -#define AT_NO_AUTOMOUNT		0x800	/* Suppress terminal automount traversal */
> -#define AT_EMPTY_PATH		0x1000	/* Allow empty relative pathname */
> -
> -#define AT_STATX_SYNC_TYPE	0x6000	/* Type of synchronisation required from statx() */
> -#define AT_STATX_SYNC_AS_STAT	0x0000	/* - Do whatever stat() does */
> -#define AT_STATX_FORCE_SYNC	0x2000	/* - Force the attributes to be sync'd with the server */
> -#define AT_STATX_DONT_SYNC	0x4000	/* - Don't sync attributes with the server */
> -
> -#define AT_RECURSIVE		0x8000	/* Apply to the entire subtree */
> +/* Flags for name_to_handle_at(2). */
> +#define AT_HANDLE_FID		0x200	/* File handle is needed to compare
> +					   object identity and may not be
> +					   usable with open_by_handle_at(2). */
>  
> -/* Flags for name_to_handle_at(2). We reuse AT_ flag space to save bits... */
> -#define AT_HANDLE_FID		AT_REMOVEDIR	/* file handle is needed to
> -					compare object identity and may not
> -					be usable to open_by_handle_at(2) */
>  #if defined(__KERNEL__)
>  #define AT_GETATTR_NOSEC	0x80000000
>  #endif
> 
Reviewed-by: Jeff Layton <jlayton@kernel.org>
next prev parent reply	other threads:[~2024-08-01 19:21 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-01  3:52 [PATCH RFC v3 0/2] fhandle: expose u64 mount id to name_to_handle_at(2) Aleksa Sarai
2024-08-01  3:52 ` [PATCH RFC v3 1/2] uapi: explain how per-syscall AT_* flags should be allocated Aleksa Sarai
2024-08-01 19:21   ` Jeff Layton [this message]
2024-08-01  3:52 ` [PATCH RFC v3 2/2] fhandle: expose u64 mount id to name_to_handle_at(2) Aleksa Sarai
2024-08-01 19:46   ` Jeff Layton
2024-08-01 14:28 ` [PATCH RFC v3 0/2] " Josef Bacik
2024-08-02  1:43   ` Aleksa Sarai
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox
  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):
  git send-email \
    --in-reply-to=37ca45753755fcb57702b5f61c7f18aeb02a7d96.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=alex.aring@gmail.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=amir73il@gmail.com \
    --cc=brauner@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=cyphar@cyphar.com \
    --cc=irogers@google.com \
    --cc=jack@suse.cz \
    --cc=jolsa@kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY
  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
  Be sure your reply has a Subject: header at the top and a blank line
  before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).