* Re: [PATCH 0/2] mount: add OPEN_TREE_NAMESPACE
From: Askar Safin @ 2026-01-19 17:11 UTC (permalink / raw)
To: brauner
Cc: amir73il, cyphar, jack, jlayton, josef, linux-fsdevel, viro,
Lennart Poettering, David Howells, Zhang Yunkai, cgel.zte,
Menglong Dong, linux-kernel, initramfs, containers, linux-api,
news, lwn, Jonathan Corbet, Rob Landley, emily, Christoph Hellwig
In-Reply-To: <20251229-work-empty-namespace-v1-0-bfb24c7b061f@kernel.org>
Christian Brauner <brauner@kernel.org>:
> Extend open_tree() with a new OPEN_TREE_NAMESPACE flag. Similar to
> OPEN_TREE_CLONE only the indicated mount tree is copied. Instead of
> returning a file descriptor referring to that mount tree
> OPEN_TREE_NAMESPACE will cause open_tree() to return a file descriptor
> to a new mount namespace. In that new mount namespace the copied mount
> tree has been mounted on top of a copy of the real rootfs.
I want to point at security benefits of this.
[[ TL;DR: [1] and [2] are very big changes to how mount namespaces work.
I like them, and I think they should get wider exposure. ]]
If this patchset ([1]) and [2] both land (they are both in "next" now and
likely will be submitted to mainline soon) and "nullfs_rootfs" is passed on
command line, then mount namespace created by open_tree(OPEN_TREE_NAMESPACE) will
usually contain exactly 2 mounts: nullfs and whatever was passed to
open_tree(OPEN_TREE_NAMESPACE).
This means that even if attacker somehow is able to unmount its root and
get access to underlying mounts, then the only underlying thing they will
get is nullfs.
Also this means that other mounts are not only hidden in new namespace, they
are fully absent. This prevents attacks discussed here: [3], [4].
Also this means that (assuming we have both [1] and [2] and "nullfs_rootfs"
is passed), there is no anymore hidden writable mount shared by all containers,
potentially available to attackers. This is concern raised in [5]:
> You want rootfs to be a NULLFS instead of ramfs. You don't seem to want it to
> actually _be_ a filesystem. Even with your "fix", containers could communicate
> with each _other_ through it if it becomes accessible. If a container can get
> access to an empty initramfs and write into it, it can ask/answer the question
> "Are there any other containers on this machine running stux24" and then coordinate.
Note: as well as I understand all actual security bugs are already fixed in kernel,
runc and similar tools. But still [1] and [2] reduce chances of similar bugs
in the future, and this is very good thing.
Also: [1] and [2] are pretty big changes to how mount namespaces work, so
I added more people and lists to CC.
This mail is answer to [1].
[1] https://lore.kernel.org/all/20251229-work-empty-namespace-v1-0-bfb24c7b061f@kernel.org/
[2] https://lore.kernel.org/all/20260112-work-immutable-rootfs-v2-0-88dd1c34a204@kernel.org/
[3] https://lore.kernel.org/all/rxh6knvencwjajhgvdgzmrkwmyxwotu3itqyreun3h2pmaujhr@snhuqoq44kkf/
[4] https://github.com/opencontainers/runc/pull/1962
[5] https://lore.kernel.org/all/cec90924-e7ec-377c-fb02-e0f25ab9db73@landley.net/
--
Askar Safin
^ permalink raw reply
* Re: [RFC v1] man/man2/close.2: CAVEATS: Document divergence from POSIX.1-2024
From: Alejandro Colomar @ 2026-01-18 22:23 UTC (permalink / raw)
To: Zack Weinberg
Cc: Rich Felker, Vincent Lefevre, Jan Kara, Alexander Viro,
Christian Brauner, linux-fsdevel, linux-api, GNU libc development
In-Reply-To: <8c47e10a-be82-4d5b-a45e-2526f6e95123@app.fastmail.com>
[-- Attachment #1: Type: text/plain, Size: 5703 bytes --]
Hi Zack and others,
Just a gentle ping. It would be nice to have an agreement for some
patch.
Have a lovely night!
Alex
On Fri, May 23, 2025 at 02:10:57PM -0400, Zack Weinberg wrote:
> Taking everything said in this thread into account, I have attempted to
> wordsmith new language for the close(2) manpage. Please let me know
> what you think, and please help me with the bits marked in square
> brackets. I can make this into a proper patch for the manpages
> when everyone is happy with it.
>
> zw
>
> ---
>
> DESCRIPTION
> ... existing text ...
>
> close() always succeeds. That is, after it returns, _fd_ has
> always been disconnected from the open file it formerly referred
> to, and its number can be recycled to refer to some other file.
> Furthermore, if _fd_ was the last reference to the underlying
> open file description, the resources associated with the open file
> description will always have been scheduled to be released.
>
> However, close may report _delayed errors_ from a previous I/O
> operation. Therefore, its return value should not be ignored.
>
> RETURN VALUE
> close() returns zero if there are no delayed errors to report,
> or -1 if there _might_ be delayed errors.
>
> When close() returns -1, check _errno_ to see what the situation
> actually is. Most, but not all, _errno_ codes indicate a delayed
> I/O error that should be reported to the user. See ERRORS and
> NOTES for more detail.
>
> [QUERY: Is it ever possible to get delayed errors on close() from
> a file that was opened with O_RDONLY? What about a file that was
> opened with O_RDWR but never actually written to? If people only
> have to worry about delayed errors if the file was actually
> written to, we should say so at this point.
>
> It would also be good to mention whether it is possible to get a
> delayed error on close() even if a previous call to fsync() or
> fdatasync() succeeded and there haven’t been any more writes to
> that file *description* (not necessarily via the fd being closed)
> since.]
>
> ERRORS
> EBADF _fd_ wasn’t open in the first place, or is outside the
> valid numeric range for file descriptors.
>
> EINPROGRESS
> EINTR
> There are no delayed errors to report, but the kernel is
> still doing some clean-up work in the background. This
> situation should be treated the same as if close() had
> returned zero. Do not retry the close(), and do not report
> an error to the user.
>
> EDQUOT
> EFBIG
> EIO
> ENOSPC
> These are the most common errno codes associated with
> delayed I/O errors. They should be treated as a hard
> failure to write to the file that was formerly associated
> with _fd_, the same as if an earlier write(2) had failed
> with one of these codes. The file has still been closed!
> Do not retry the close(). But do report an error to the user.
>
> Depending on the underlying file, close() may return other errno
> codes; these should generally also be treated as delayed I/O errors.
>
> NOTES
> Dealing with error returns from close()
>
> As discussed above, close() always closes the file. Except when
> errno is set to EBADF, EINPROGRESS, or EINTR, an error return from
> close() reports a _delayed I/O error_ from a previous write()
> operation.
>
> It is vital to report delayed I/O errors to the user; failing to
> check the return value of close() can cause _silent_ loss of data.
> The most common situations where this actually happens involve
> networked filesystems, where, in the name of throughput, write()
> often returns success before the server has actually confirmed a
> successful write.
>
> However, it is also vital to understand that _no matter what_
> close() returns, and _no matter what_ it sets errno to, when it
> returns, _the file descriptor passed to close() has been closed_,
> and its number is _immediately_ available for reuse by open(2),
> dup(2), etc. Therefore, one should never retry a close(), not
> even if it set errno to a value that normally indicates the
> operation needs to be retried (e.g. EINTR). Retrying a close()
> is a serious bug, particularly in a multithreaded program; if
> the file descriptor number has already been reused, _that file_
> will get closed out from under whatever other thread opened it.
>
> [Possibly something about fsync/fdatasync here?]
>
> BUGS
> Prior to POSIX.1-2024, there was no official guarantee that
> close() would always close the file descriptor, even on error.
> Linux has always closed the file descriptor, even on error,
> but other implementations might not have.
>
> The only such implementation we have heard of is HP-UX; at least
> some versions of HP-UX’s man page for close() said it should be
> retried if it returned -1 with errno set to EINTR. (If you know
> exactly which versions of HP-UX are affected, or of any other
> Unix where close() doesn’t always close the file descriptor,
> please contact us about it.)
>
> Portable code should nonetheless never retry a failed close(); the
> consequences of a file descriptor leak are far less dangerous than
> the consequences of closing a file out from under another thread.
--
<https://www.alejandro-colomar.es>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH bpf-next v5 8/9] libbpf: Add common attr support for map_create
From: Andrii Nakryiko @ 2026-01-16 22:33 UTC (permalink / raw)
To: Leon Hwang
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Christian Brauner, Seth Forshee, Yuichiro Tsuji,
Andrey Albershteyn, Willem de Bruijn, Jason Xing, Tao Chen,
Mykyta Yatsenko, Kumar Kartikeya Dwivedi, Anton Protopopov,
Amery Hung, Rong Tao, linux-kernel, linux-api, linux-kselftest,
kernel-patches-bot
In-Reply-To: <22e0de9a-8963-454b-8b35-f8c9be15dee3@linux.dev>
On Fri, Jan 16, 2026 at 6:17 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
>
>
> On 2026/1/16 09:03, Andrii Nakryiko wrote:
> > On Mon, Jan 12, 2026 at 6:59 AM Leon Hwang <leon.hwang@linux.dev> wrote:
> >>
> >> With the previous commit adding common attribute support for
> >> BPF_MAP_CREATE, users can now retrieve detailed error messages when map
> >> creation fails via the log_buf field.
> >>
> >> Introduce struct bpf_syscall_common_attr_opts with the following fields:
> >> log_buf, log_size, log_level, and log_true_size.
> >>
> >> Extend bpf_map_create_opts with a new field common_attr_opts, allowing
> >> users to capture and inspect log messages on map creation failures.
> >>
> >> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
> >> ---
> >> tools/lib/bpf/bpf.c | 15 ++++++++++++++-
> >> tools/lib/bpf/bpf.h | 17 ++++++++++++++++-
> >> 2 files changed, 30 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> >> index d44e667aaf02..d65df1b7b2be 100644
> >> --- a/tools/lib/bpf/bpf.c
> >> +++ b/tools/lib/bpf/bpf.c
> >> @@ -207,6 +207,9 @@ int bpf_map_create(enum bpf_map_type map_type,
> >> const struct bpf_map_create_opts *opts)
> >> {
> >> const size_t attr_sz = offsetofend(union bpf_attr, excl_prog_hash_size);
> >> + const size_t common_attr_sz = sizeof(struct bpf_common_attr);
> >> + struct bpf_syscall_common_attr_opts *common_attr_opts;
> >> + struct bpf_common_attr common_attr;
> >> union bpf_attr attr;
> >> int fd;
> >>
> >> @@ -240,7 +243,17 @@ int bpf_map_create(enum bpf_map_type map_type,
> >> attr.excl_prog_hash = ptr_to_u64(OPTS_GET(opts, excl_prog_hash, NULL));
> >> attr.excl_prog_hash_size = OPTS_GET(opts, excl_prog_hash_size, 0);
> >>
> >> - fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, attr_sz);
> >> + common_attr_opts = OPTS_GET(opts, common_attr_opts, NULL);
> >> + if (common_attr_opts && feat_supported(NULL, FEAT_EXTENDED_SYSCALL)) {
> >> + memset(&common_attr, 0, common_attr_sz);
> >> + common_attr.log_buf = ptr_to_u64(OPTS_GET(common_attr_opts, log_buf, NULL));
> >> + common_attr.log_size = OPTS_GET(common_attr_opts, log_size, 0);
> >> + common_attr.log_level = OPTS_GET(common_attr_opts, log_level, 0);
> >> + fd = sys_bpf_ext_fd(BPF_MAP_CREATE, &attr, attr_sz, &common_attr, common_attr_sz);
> >> + OPTS_SET(common_attr_opts, log_true_size, common_attr.log_true_size);
> >> + } else {
> >> + fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, attr_sz);
> >
> > OPTS_SET(log_true_size) to zero here, maybe?
> >
>
> Unnecessary, but ok to do it.
>
> >> + }
> >> return libbpf_err_errno(fd);
> >> }
> >>
> >> diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> >> index 2c8e88ddb674..c4a26e6b71ea 100644
> >> --- a/tools/lib/bpf/bpf.h
> >> +++ b/tools/lib/bpf/bpf.h
> >> @@ -37,6 +37,18 @@ extern "C" {
> >>
> >> LIBBPF_API int libbpf_set_memlock_rlim(size_t memlock_bytes);
> >>
> >> +struct bpf_syscall_common_attr_opts {
> >> + size_t sz; /* size of this struct for forward/backward compatibility */
> >> +
> >> + char *log_buf;
> >> + __u32 log_size;
> >> + __u32 log_level;
> >> + __u32 log_true_size;
> >> +
> >> + size_t :0;
> >> +};
> >> +#define bpf_syscall_common_attr_opts__last_field log_true_size
> >
> > see below, let's drop this struct and just add these 4 fields directly
> > to bpf_map_create_opts
> >
> >> +
> >> struct bpf_map_create_opts {
> >> size_t sz; /* size of this struct for forward/backward compatibility */
> >>
> >> @@ -57,9 +69,12 @@ struct bpf_map_create_opts {
> >>
> >> const void *excl_prog_hash;
> >> __u32 excl_prog_hash_size;
> >> +
> >> + struct bpf_syscall_common_attr_opts *common_attr_opts;
> >
> > maybe let's just add those log_xxx fields here directly? This whole
> > extra bpf_syscall_common_attr_opts pointer and struct seems like a
> > cumbersome API.
> >
>
> Oops... This struct was suggested by the v3 discussion [1].
>
> This struct was used to report 'log_true_size' without changing
> 'bpf_map_create()' API.
>
Ah, I already forgot. log_true_size being an output parameter here...
Sigh. I don't like the verboseness of bpf_syscall_common_attr_opts and
"common_attr_opts" and all that stuff...
What if we make it struct bpf_log_opts {} and keep it log-specific?
> Links
> [1]
> https://lore.kernel.org/bpf/CAEf4Bzaw9cboFSf1OXmD84S7pKaeyj=bcQg_diUzGwAkFsjUgg@mail.gmail.com/
>
> Thanks,
> Leon
>
> >> +
> >> size_t :0;
> >> };
> >> -#define bpf_map_create_opts__last_field excl_prog_hash_size
> >> +#define bpf_map_create_opts__last_field common_attr_opts
> >>
> >> LIBBPF_API int bpf_map_create(enum bpf_map_type map_type,
> >> const char *map_name,
> >> --
> >> 2.52.0
> >>
>
^ permalink raw reply
* Re: [PATCH bpf-next v5 4/9] bpf: Add syscall common attributes support for prog_load
From: Andrii Nakryiko @ 2026-01-16 22:29 UTC (permalink / raw)
To: Leon Hwang
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Christian Brauner, Seth Forshee, Yuichiro Tsuji,
Andrey Albershteyn, Willem de Bruijn, Jason Xing, Tao Chen,
Mykyta Yatsenko, Kumar Kartikeya Dwivedi, Anton Protopopov,
Amery Hung, Rong Tao, linux-kernel, linux-api, linux-kselftest,
kernel-patches-bot
In-Reply-To: <36cf80a8-a224-4191-b235-50c2b3dd73f6@linux.dev>
On Fri, Jan 16, 2026 at 6:10 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
>
>
> On 2026/1/16 08:54, Andrii Nakryiko wrote:
> > On Mon, Jan 12, 2026 at 6:59 AM Leon Hwang <leon.hwang@linux.dev> wrote:
> >>
> >> The log buffer of common attributes would be confusing with the one in
> >> 'union bpf_attr' for BPF_PROG_LOAD.
> >>
> >> In order to clarify the usage of these two log buffers, they both can be
> >> used for logging if:
> >>
> >> * They are same, including 'log_buf', 'log_level' and 'log_size'.
> >> * One of them is missing, then another one will be used for logging.
> >>
> >> If they both have 'log_buf' but they are not same totally, return -EUSERS.
> >
> > why use this special error code that we don't seem to use in BPF
> > subsystem at all? What's wrong with -EINVAL. This shouldn't be an easy
> > mistake to do, tbh.
> >
>
> -EUSERS was suggested by Alexei.
>
> However, I agree with you that it is better to use -EINVAL here.
I don't know what the context was, if you can find it that would be
great. Maybe special error makes sense for what Alexei had in mind.
>
> >>
> >> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
> >> ---
> >> include/linux/bpf_verifier.h | 4 +++-
> >> kernel/bpf/log.c | 29 ++++++++++++++++++++++++++---
> >> kernel/bpf/syscall.c | 9 ++++++---
> >> 3 files changed, 35 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> >> index 4c9632c40059..da2d37ca60e7 100644
> >> --- a/include/linux/bpf_verifier.h
> >> +++ b/include/linux/bpf_verifier.h
> >> @@ -637,9 +637,11 @@ struct bpf_log_attr {
> >> u32 log_level;
> >> struct bpf_attrs *attrs;
> >> u32 offsetof_log_true_size;
> >> + struct bpf_attrs *attrs_common;
> >> };
> >>
> >> -int bpf_prog_load_log_attr_init(struct bpf_log_attr *log_attr, struct bpf_attrs *attrs);
> >> +int bpf_prog_load_log_attr_init(struct bpf_log_attr *log_attr, struct bpf_attrs *attrs,
> >> + struct bpf_attrs *attrs_common);
> >> int bpf_log_attr_finalize(struct bpf_log_attr *log_attr, struct bpf_verifier_log *log);
> >>
> >> #define BPF_MAX_SUBPROGS 256
> >> diff --git a/kernel/bpf/log.c b/kernel/bpf/log.c
> >> index 457b724c4176..eba60a13e244 100644
> >> --- a/kernel/bpf/log.c
> >> +++ b/kernel/bpf/log.c
> >> @@ -865,23 +865,41 @@ void print_insn_state(struct bpf_verifier_env *env, const struct bpf_verifier_st
> >> }
> >>
> >> static int bpf_log_attr_init(struct bpf_log_attr *log_attr, struct bpf_attrs *attrs, u64 log_buf,
> >> - u32 log_size, u32 log_level, int offsetof_log_true_size)
> >> + u32 log_size, u32 log_level, int offsetof_log_true_size,
> >> + struct bpf_attrs *attrs_common)
> >> {
> >> + const struct bpf_common_attr *common_attr = attrs_common ? attrs_common->attr : NULL;
> >> +
> >
> > There is something to be said about naming choices here :) it's easy
> > to get lost in attrs_common being actually bpf_attrs, which contains
> > attr field, which is actually of bpf_common_attr type... It's a bit
> > disorienting. :)
> >
>
> I see your point about the naming being confusing.
>
> The original intent of 'struct bpf_attrs' was to provide a shared
> wrapper for both 'union bpf_attr' and 'struct bpf_common_attr'. However,
> I agree that using 'attrs_common' here makes the layering harder to follow.
>
> If that approach is undesirable, how about introducing a dedicated
> structure instead, e.g.:
>
> struct bpf_common_attrs {
> const struct bpf_common_attr *attr;
> bpfptr_t uattr;
> u32 size;
> };
>
> This should make the ownership and intent clearer.
I don't know and it's not that important, as it's pretty content. But
I'd just try to shorten some names, maybe just "common" for internal
helpers would make sense. common->log_buf, seems to work.
>
> Thanks,
> Leon
>
> [...]
>
^ permalink raw reply
* Re: [PATCH bpf-next v5 2/9] libbpf: Add support for extended bpf syscall
From: Andrii Nakryiko @ 2026-01-16 22:27 UTC (permalink / raw)
To: Leon Hwang
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Christian Brauner, Seth Forshee, Yuichiro Tsuji,
Andrey Albershteyn, Willem de Bruijn, Jason Xing, Tao Chen,
Mykyta Yatsenko, Kumar Kartikeya Dwivedi, Anton Protopopov,
Amery Hung, Rong Tao, linux-kernel, linux-api, linux-kselftest,
kernel-patches-bot
In-Reply-To: <3b0fa14d-a11d-4ed7-8f28-2e99d74f6b46@linux.dev>
On Fri, Jan 16, 2026 at 5:58 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
>
>
> On 2026/1/16 08:42, Andrii Nakryiko wrote:
> > On Mon, Jan 12, 2026 at 6:58 AM Leon Hwang <leon.hwang@linux.dev> wrote:
> >>
> >> To support the extended BPF syscall introduced in the previous commit,
> >> introduce the following internal APIs:
> >>
> >> * 'sys_bpf_ext()'
> >> * 'sys_bpf_ext_fd()'
> >> They wrap the raw 'syscall()' interface to support passing extended
> >> attributes.
> >> * 'probe_sys_bpf_ext()'
> >> Check whether current kernel supports the extended attributes.
> >>
> >> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
> >> ---
> >> tools/lib/bpf/bpf.c | 34 +++++++++++++++++++++++++++++++++
> >> tools/lib/bpf/features.c | 8 ++++++++
> >> tools/lib/bpf/libbpf_internal.h | 3 +++
> >> 3 files changed, 45 insertions(+)
> >>
> >> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> >> index 21b57a629916..d44e667aaf02 100644
> >> --- a/tools/lib/bpf/bpf.c
> >> +++ b/tools/lib/bpf/bpf.c
> >> @@ -69,6 +69,40 @@ static inline __u64 ptr_to_u64(const void *ptr)
> >> return (__u64) (unsigned long) ptr;
> >> }
> >>
> >> +static inline int sys_bpf_ext(enum bpf_cmd cmd, union bpf_attr *attr,
> >> + unsigned int size,
> >> + struct bpf_common_attr *common_attr,
> >
> > nit: kernel uses consistent attr_common/size_common pattern, but here
> > you are inverting attr_common -> common_attr, let's not?
> >
>
> Ack.
>
> I'll keep the same pattern.
>
> >> + unsigned int size_common)
> >> +{
> >> + cmd = common_attr ? (cmd | BPF_COMMON_ATTRS) : (cmd & ~BPF_COMMON_ATTRS);
> >> + return syscall(__NR_bpf, cmd, attr, size, common_attr, size_common);
> >> +}
> >> +
> >> +static inline int sys_bpf_ext_fd(enum bpf_cmd cmd, union bpf_attr *attr,
> >> + unsigned int size,
> >> + struct bpf_common_attr *common_attr,
> >> + unsigned int size_common)
> >> +{
> >> + int fd;
> >> +
> >> + fd = sys_bpf_ext(cmd, attr, size, common_attr, size_common);
> >> + return ensure_good_fd(fd);
> >> +}
> >> +
> >> +int probe_sys_bpf_ext(void)
> >> +{
> >> + const size_t attr_sz = offsetofend(union bpf_attr, prog_token_fd);
> >> + union bpf_attr attr;
> >> + int fd;
> >> +
> >> + memset(&attr, 0, attr_sz);
> >> + fd = syscall(__NR_bpf, BPF_PROG_LOAD | BPF_COMMON_ATTRS, &attr, attr_sz, NULL,
> >> + sizeof(struct bpf_common_attr));
> >> + if (fd >= 0)
> >> + close(fd);
> >
> > hm... close can change errno, this is fragile. If fd >= 0, something
> > is wrong with our detection, just return error right away?
> >
>
> How about capture errno before closing?
>
> err = errno;
> if (fd >= 0)
> close(fd);
> return err = EFAULT;
not sure what this code is trying to do, but yes, preserving errno is
one way to fix an immediate problem.
But fd should really not be >= 0, and if it is -- it's some problem,
so I'd return an error in that case to keep us aware, which is why I'm
saying I'd just return inside if (fd >= 0) { }
>
> Then, we can wrap all details in probe_sys_bpf_ext().
>
> >> + return errno == EFAULT;
> >> +}
> >> +
[...]
^ permalink raw reply
* Re: [PATCH bpf-next v5 8/9] libbpf: Add common attr support for map_create
From: Leon Hwang @ 2026-01-16 14:17 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Christian Brauner, Seth Forshee, Yuichiro Tsuji,
Andrey Albershteyn, Willem de Bruijn, Jason Xing, Tao Chen,
Mykyta Yatsenko, Kumar Kartikeya Dwivedi, Anton Protopopov,
Amery Hung, Rong Tao, linux-kernel, linux-api, linux-kselftest,
kernel-patches-bot
In-Reply-To: <CAEf4BzarSrW1aTRcjrheLWqxFCh1FFd7vwJ4OQup1dbT13EapQ@mail.gmail.com>
On 2026/1/16 09:03, Andrii Nakryiko wrote:
> On Mon, Jan 12, 2026 at 6:59 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>
>> With the previous commit adding common attribute support for
>> BPF_MAP_CREATE, users can now retrieve detailed error messages when map
>> creation fails via the log_buf field.
>>
>> Introduce struct bpf_syscall_common_attr_opts with the following fields:
>> log_buf, log_size, log_level, and log_true_size.
>>
>> Extend bpf_map_create_opts with a new field common_attr_opts, allowing
>> users to capture and inspect log messages on map creation failures.
>>
>> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
>> ---
>> tools/lib/bpf/bpf.c | 15 ++++++++++++++-
>> tools/lib/bpf/bpf.h | 17 ++++++++++++++++-
>> 2 files changed, 30 insertions(+), 2 deletions(-)
>>
>> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
>> index d44e667aaf02..d65df1b7b2be 100644
>> --- a/tools/lib/bpf/bpf.c
>> +++ b/tools/lib/bpf/bpf.c
>> @@ -207,6 +207,9 @@ int bpf_map_create(enum bpf_map_type map_type,
>> const struct bpf_map_create_opts *opts)
>> {
>> const size_t attr_sz = offsetofend(union bpf_attr, excl_prog_hash_size);
>> + const size_t common_attr_sz = sizeof(struct bpf_common_attr);
>> + struct bpf_syscall_common_attr_opts *common_attr_opts;
>> + struct bpf_common_attr common_attr;
>> union bpf_attr attr;
>> int fd;
>>
>> @@ -240,7 +243,17 @@ int bpf_map_create(enum bpf_map_type map_type,
>> attr.excl_prog_hash = ptr_to_u64(OPTS_GET(opts, excl_prog_hash, NULL));
>> attr.excl_prog_hash_size = OPTS_GET(opts, excl_prog_hash_size, 0);
>>
>> - fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, attr_sz);
>> + common_attr_opts = OPTS_GET(opts, common_attr_opts, NULL);
>> + if (common_attr_opts && feat_supported(NULL, FEAT_EXTENDED_SYSCALL)) {
>> + memset(&common_attr, 0, common_attr_sz);
>> + common_attr.log_buf = ptr_to_u64(OPTS_GET(common_attr_opts, log_buf, NULL));
>> + common_attr.log_size = OPTS_GET(common_attr_opts, log_size, 0);
>> + common_attr.log_level = OPTS_GET(common_attr_opts, log_level, 0);
>> + fd = sys_bpf_ext_fd(BPF_MAP_CREATE, &attr, attr_sz, &common_attr, common_attr_sz);
>> + OPTS_SET(common_attr_opts, log_true_size, common_attr.log_true_size);
>> + } else {
>> + fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, attr_sz);
>
> OPTS_SET(log_true_size) to zero here, maybe?
>
Unnecessary, but ok to do it.
>> + }
>> return libbpf_err_errno(fd);
>> }
>>
>> diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
>> index 2c8e88ddb674..c4a26e6b71ea 100644
>> --- a/tools/lib/bpf/bpf.h
>> +++ b/tools/lib/bpf/bpf.h
>> @@ -37,6 +37,18 @@ extern "C" {
>>
>> LIBBPF_API int libbpf_set_memlock_rlim(size_t memlock_bytes);
>>
>> +struct bpf_syscall_common_attr_opts {
>> + size_t sz; /* size of this struct for forward/backward compatibility */
>> +
>> + char *log_buf;
>> + __u32 log_size;
>> + __u32 log_level;
>> + __u32 log_true_size;
>> +
>> + size_t :0;
>> +};
>> +#define bpf_syscall_common_attr_opts__last_field log_true_size
>
> see below, let's drop this struct and just add these 4 fields directly
> to bpf_map_create_opts
>
>> +
>> struct bpf_map_create_opts {
>> size_t sz; /* size of this struct for forward/backward compatibility */
>>
>> @@ -57,9 +69,12 @@ struct bpf_map_create_opts {
>>
>> const void *excl_prog_hash;
>> __u32 excl_prog_hash_size;
>> +
>> + struct bpf_syscall_common_attr_opts *common_attr_opts;
>
> maybe let's just add those log_xxx fields here directly? This whole
> extra bpf_syscall_common_attr_opts pointer and struct seems like a
> cumbersome API.
>
Oops... This struct was suggested by the v3 discussion [1].
This struct was used to report 'log_true_size' without changing
'bpf_map_create()' API.
Links
[1]
https://lore.kernel.org/bpf/CAEf4Bzaw9cboFSf1OXmD84S7pKaeyj=bcQg_diUzGwAkFsjUgg@mail.gmail.com/
Thanks,
Leon
>> +
>> size_t :0;
>> };
>> -#define bpf_map_create_opts__last_field excl_prog_hash_size
>> +#define bpf_map_create_opts__last_field common_attr_opts
>>
>> LIBBPF_API int bpf_map_create(enum bpf_map_type map_type,
>> const char *map_name,
>> --
>> 2.52.0
>>
^ permalink raw reply
* Re: [PATCH bpf-next v5 4/9] bpf: Add syscall common attributes support for prog_load
From: Leon Hwang @ 2026-01-16 14:10 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Christian Brauner, Seth Forshee, Yuichiro Tsuji,
Andrey Albershteyn, Willem de Bruijn, Jason Xing, Tao Chen,
Mykyta Yatsenko, Kumar Kartikeya Dwivedi, Anton Protopopov,
Amery Hung, Rong Tao, linux-kernel, linux-api, linux-kselftest,
kernel-patches-bot
In-Reply-To: <CAEf4BzZbcA2T8+OR1_68sxq9Chukmh8beyz+018O22U=SsafrA@mail.gmail.com>
On 2026/1/16 08:54, Andrii Nakryiko wrote:
> On Mon, Jan 12, 2026 at 6:59 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>
>> The log buffer of common attributes would be confusing with the one in
>> 'union bpf_attr' for BPF_PROG_LOAD.
>>
>> In order to clarify the usage of these two log buffers, they both can be
>> used for logging if:
>>
>> * They are same, including 'log_buf', 'log_level' and 'log_size'.
>> * One of them is missing, then another one will be used for logging.
>>
>> If they both have 'log_buf' but they are not same totally, return -EUSERS.
>
> why use this special error code that we don't seem to use in BPF
> subsystem at all? What's wrong with -EINVAL. This shouldn't be an easy
> mistake to do, tbh.
>
-EUSERS was suggested by Alexei.
However, I agree with you that it is better to use -EINVAL here.
>>
>> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
>> ---
>> include/linux/bpf_verifier.h | 4 +++-
>> kernel/bpf/log.c | 29 ++++++++++++++++++++++++++---
>> kernel/bpf/syscall.c | 9 ++++++---
>> 3 files changed, 35 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
>> index 4c9632c40059..da2d37ca60e7 100644
>> --- a/include/linux/bpf_verifier.h
>> +++ b/include/linux/bpf_verifier.h
>> @@ -637,9 +637,11 @@ struct bpf_log_attr {
>> u32 log_level;
>> struct bpf_attrs *attrs;
>> u32 offsetof_log_true_size;
>> + struct bpf_attrs *attrs_common;
>> };
>>
>> -int bpf_prog_load_log_attr_init(struct bpf_log_attr *log_attr, struct bpf_attrs *attrs);
>> +int bpf_prog_load_log_attr_init(struct bpf_log_attr *log_attr, struct bpf_attrs *attrs,
>> + struct bpf_attrs *attrs_common);
>> int bpf_log_attr_finalize(struct bpf_log_attr *log_attr, struct bpf_verifier_log *log);
>>
>> #define BPF_MAX_SUBPROGS 256
>> diff --git a/kernel/bpf/log.c b/kernel/bpf/log.c
>> index 457b724c4176..eba60a13e244 100644
>> --- a/kernel/bpf/log.c
>> +++ b/kernel/bpf/log.c
>> @@ -865,23 +865,41 @@ void print_insn_state(struct bpf_verifier_env *env, const struct bpf_verifier_st
>> }
>>
>> static int bpf_log_attr_init(struct bpf_log_attr *log_attr, struct bpf_attrs *attrs, u64 log_buf,
>> - u32 log_size, u32 log_level, int offsetof_log_true_size)
>> + u32 log_size, u32 log_level, int offsetof_log_true_size,
>> + struct bpf_attrs *attrs_common)
>> {
>> + const struct bpf_common_attr *common_attr = attrs_common ? attrs_common->attr : NULL;
>> +
>
> There is something to be said about naming choices here :) it's easy
> to get lost in attrs_common being actually bpf_attrs, which contains
> attr field, which is actually of bpf_common_attr type... It's a bit
> disorienting. :)
>
I see your point about the naming being confusing.
The original intent of 'struct bpf_attrs' was to provide a shared
wrapper for both 'union bpf_attr' and 'struct bpf_common_attr'. However,
I agree that using 'attrs_common' here makes the layering harder to follow.
If that approach is undesirable, how about introducing a dedicated
structure instead, e.g.:
struct bpf_common_attrs {
const struct bpf_common_attr *attr;
bpfptr_t uattr;
u32 size;
};
This should make the ownership and intent clearer.
Thanks,
Leon
[...]
^ permalink raw reply
* Re: [PATCH bpf-next v5 2/9] libbpf: Add support for extended bpf syscall
From: Leon Hwang @ 2026-01-16 13:57 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Christian Brauner, Seth Forshee, Yuichiro Tsuji,
Andrey Albershteyn, Willem de Bruijn, Jason Xing, Tao Chen,
Mykyta Yatsenko, Kumar Kartikeya Dwivedi, Anton Protopopov,
Amery Hung, Rong Tao, linux-kernel, linux-api, linux-kselftest,
kernel-patches-bot
In-Reply-To: <CAEf4BzYRC+=J05C6QDwgzbJ7gO7gZD4xcEcj9ixCaJ=xaRuSsQ@mail.gmail.com>
On 2026/1/16 08:42, Andrii Nakryiko wrote:
> On Mon, Jan 12, 2026 at 6:58 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>
>> To support the extended BPF syscall introduced in the previous commit,
>> introduce the following internal APIs:
>>
>> * 'sys_bpf_ext()'
>> * 'sys_bpf_ext_fd()'
>> They wrap the raw 'syscall()' interface to support passing extended
>> attributes.
>> * 'probe_sys_bpf_ext()'
>> Check whether current kernel supports the extended attributes.
>>
>> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
>> ---
>> tools/lib/bpf/bpf.c | 34 +++++++++++++++++++++++++++++++++
>> tools/lib/bpf/features.c | 8 ++++++++
>> tools/lib/bpf/libbpf_internal.h | 3 +++
>> 3 files changed, 45 insertions(+)
>>
>> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
>> index 21b57a629916..d44e667aaf02 100644
>> --- a/tools/lib/bpf/bpf.c
>> +++ b/tools/lib/bpf/bpf.c
>> @@ -69,6 +69,40 @@ static inline __u64 ptr_to_u64(const void *ptr)
>> return (__u64) (unsigned long) ptr;
>> }
>>
>> +static inline int sys_bpf_ext(enum bpf_cmd cmd, union bpf_attr *attr,
>> + unsigned int size,
>> + struct bpf_common_attr *common_attr,
>
> nit: kernel uses consistent attr_common/size_common pattern, but here
> you are inverting attr_common -> common_attr, let's not?
>
Ack.
I'll keep the same pattern.
>> + unsigned int size_common)
>> +{
>> + cmd = common_attr ? (cmd | BPF_COMMON_ATTRS) : (cmd & ~BPF_COMMON_ATTRS);
>> + return syscall(__NR_bpf, cmd, attr, size, common_attr, size_common);
>> +}
>> +
>> +static inline int sys_bpf_ext_fd(enum bpf_cmd cmd, union bpf_attr *attr,
>> + unsigned int size,
>> + struct bpf_common_attr *common_attr,
>> + unsigned int size_common)
>> +{
>> + int fd;
>> +
>> + fd = sys_bpf_ext(cmd, attr, size, common_attr, size_common);
>> + return ensure_good_fd(fd);
>> +}
>> +
>> +int probe_sys_bpf_ext(void)
>> +{
>> + const size_t attr_sz = offsetofend(union bpf_attr, prog_token_fd);
>> + union bpf_attr attr;
>> + int fd;
>> +
>> + memset(&attr, 0, attr_sz);
>> + fd = syscall(__NR_bpf, BPF_PROG_LOAD | BPF_COMMON_ATTRS, &attr, attr_sz, NULL,
>> + sizeof(struct bpf_common_attr));
>> + if (fd >= 0)
>> + close(fd);
>
> hm... close can change errno, this is fragile. If fd >= 0, something
> is wrong with our detection, just return error right away?
>
How about capture errno before closing?
err = errno;
if (fd >= 0)
close(fd);
return err = EFAULT;
Then, we can wrap all details in probe_sys_bpf_ext().
>> + return errno == EFAULT;
>> +}
>> +
>> static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
>> unsigned int size)
>> {
>> diff --git a/tools/lib/bpf/features.c b/tools/lib/bpf/features.c
>> index b842b83e2480..d786a815f1ae 100644
>> --- a/tools/lib/bpf/features.c
>> +++ b/tools/lib/bpf/features.c
>> @@ -506,6 +506,11 @@ static int probe_kern_arg_ctx_tag(int token_fd)
>> return probe_fd(prog_fd);
>> }
>>
>> +static int probe_kern_extended_syscall(int token_fd)
>> +{
>> + return probe_sys_bpf_ext();
>> +}
>> +
>> typedef int (*feature_probe_fn)(int /* token_fd */);
>>
>> static struct kern_feature_cache feature_cache;
>> @@ -581,6 +586,9 @@ static struct kern_feature_desc {
>> [FEAT_BTF_QMARK_DATASEC] = {
>> "BTF DATASEC names starting from '?'", probe_kern_btf_qmark_datasec,
>> },
>> + [FEAT_EXTENDED_SYSCALL] = {
>> + "Kernel supports extended syscall", probe_kern_extended_syscall,
>
> "extended syscall" is a bit vague... We specifically detect common
> attrs support, maybe say that?
>
Ack.
I'll update it to "BPF syscall common attributes support."
>> + },
>> };
>>
>> bool feat_supported(struct kern_feature_cache *cache, enum kern_feature_id feat_id)
>> diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h
>> index fc59b21b51b5..e2a6ef4b45ae 100644
>> --- a/tools/lib/bpf/libbpf_internal.h
>> +++ b/tools/lib/bpf/libbpf_internal.h
>> @@ -392,6 +392,8 @@ enum kern_feature_id {
>> FEAT_ARG_CTX_TAG,
>> /* Kernel supports '?' at the front of datasec names */
>> FEAT_BTF_QMARK_DATASEC,
>> + /* Kernel supports extended syscall */
>> + FEAT_EXTENDED_SYSCALL,
>
> FEAT_BPF_COMMON_ATTRS ?
>
FEAT_BPF_SYSCALL_COMMON_ATTRS seems more accurate.
Thanks,
Leon
>> __FEAT_CNT,
>> };
>>
>> @@ -757,4 +759,5 @@ int probe_fd(int fd);
>> #define SHA256_DWORD_SIZE SHA256_DIGEST_LENGTH / sizeof(__u64)
>>
>> void libbpf_sha256(const void *data, size_t len, __u8 out[SHA256_DIGEST_LENGTH]);
>> +int probe_sys_bpf_ext(void);
>> #endif /* __LIBBPF_LIBBPF_INTERNAL_H */
>> --
>> 2.52.0
>>
^ permalink raw reply
* Re: O_CLOEXEC use for OPEN_TREE_CLOEXEC
From: Christian Brauner @ 2026-01-16 10:00 UTC (permalink / raw)
To: Florian Weimer
Cc: linux-fsdevel, linux-api, linux-kernel, Al Viro, David Howells,
DJ Delorie
In-Reply-To: <lhuwm1ji7bl.fsf@oldenburg.str.redhat.com>
On Thu, Jan 15, 2026 at 09:55:10AM +0100, Florian Weimer wrote:
> * Christian Brauner:
>
> > On Tue, Jan 13, 2026 at 11:40:55PM +0100, Florian Weimer wrote:
> >> In <linux/mount.h>, we have this:
> >>
> >> #define OPEN_TREE_CLOEXEC O_CLOEXEC /* Close the file on execve() */
> >>
> >> This causes a few pain points for us to on the glibc side when we mirror
> >> this into <linux/mount.h> becuse O_CLOEXEC is defined in <fcntl.h>,
> >> which is one of the headers that's completely incompatible with the UAPI
> >> headers.
> >>
> >> The reason why this is painful is because O_CLOEXEC has at least three
> >> different values across architectures: 0x80000, 0x200000, 0x400000
> >>
> >> Even for the UAPI this isn't ideal because it effectively burns three
> >> open_tree flags, unless the flags are made architecture-specific, too.
> >
> > I think that just got cargo-culted... A long time ago some API define as
> > O_CLOEXEC and now a lot of APIs have done the same.
>
> Yes, it looks like inotify is in the same boat.
It's unfortunately nost just inotify...:
include/linux/net.h:#define SOCK_CLOEXEC O_CLOEXEC
include/uapi/drm/drm.h:#define DRM_CLOEXEC O_CLOEXEC
include/uapi/linux/eventfd.h:#define EFD_CLOEXEC O_CLOEXEC
include/uapi/linux/eventpoll.h:#define EPOLL_CLOEXEC O_CLOEXEC
include/uapi/linux/inotify.h:#define IN_CLOEXEC O_CLOEXEC
include/uapi/linux/signalfd.h:#define SFD_CLOEXEC O_CLOEXEC
include/uapi/linux/timerfd.h:#define TFD_CLOEXEC O_CLOEXEC
>
> > I'm pretty sure we can't change that now but we can document that this
> > shouldn't be ifdefed and instead be a separate per-syscall bit. But I
> > think that's the best we can do right now.
>
> Maybe add something like this as a safety measure, to ensure that the
> flags don't overlap?
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index c58674a20cad..5bbfd379ec44 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -3069,6 +3069,9 @@ static struct file *vfs_open_tree(int dfd, const char __user *filename, unsigned
> bool detached = flags & OPEN_TREE_CLONE;
>
> BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC);
> + BUILD_BUG_IN(!(O_CLOEXEC & OPEN_TREE_CLONE));
> + BUILD_BUG_ON(!((AT_EMPTY_PATH | AT_NO_AUTOMOUNT | AT_RECURSIVE | AT_SYMLINK_NOFOLLOW) &
> + (O_CLOEXEC | OPEN_TREE_CLONE)));
Yeah, we can do something like that!
^ permalink raw reply
* Re: [PATCH bpf-next v5 8/9] libbpf: Add common attr support for map_create
From: Andrii Nakryiko @ 2026-01-16 1:03 UTC (permalink / raw)
To: Leon Hwang
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Christian Brauner, Seth Forshee, Yuichiro Tsuji,
Andrey Albershteyn, Willem de Bruijn, Jason Xing, Tao Chen,
Mykyta Yatsenko, Kumar Kartikeya Dwivedi, Anton Protopopov,
Amery Hung, Rong Tao, linux-kernel, linux-api, linux-kselftest,
kernel-patches-bot
In-Reply-To: <20260112145616.44195-9-leon.hwang@linux.dev>
On Mon, Jan 12, 2026 at 6:59 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
> With the previous commit adding common attribute support for
> BPF_MAP_CREATE, users can now retrieve detailed error messages when map
> creation fails via the log_buf field.
>
> Introduce struct bpf_syscall_common_attr_opts with the following fields:
> log_buf, log_size, log_level, and log_true_size.
>
> Extend bpf_map_create_opts with a new field common_attr_opts, allowing
> users to capture and inspect log messages on map creation failures.
>
> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
> ---
> tools/lib/bpf/bpf.c | 15 ++++++++++++++-
> tools/lib/bpf/bpf.h | 17 ++++++++++++++++-
> 2 files changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index d44e667aaf02..d65df1b7b2be 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -207,6 +207,9 @@ int bpf_map_create(enum bpf_map_type map_type,
> const struct bpf_map_create_opts *opts)
> {
> const size_t attr_sz = offsetofend(union bpf_attr, excl_prog_hash_size);
> + const size_t common_attr_sz = sizeof(struct bpf_common_attr);
> + struct bpf_syscall_common_attr_opts *common_attr_opts;
> + struct bpf_common_attr common_attr;
> union bpf_attr attr;
> int fd;
>
> @@ -240,7 +243,17 @@ int bpf_map_create(enum bpf_map_type map_type,
> attr.excl_prog_hash = ptr_to_u64(OPTS_GET(opts, excl_prog_hash, NULL));
> attr.excl_prog_hash_size = OPTS_GET(opts, excl_prog_hash_size, 0);
>
> - fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, attr_sz);
> + common_attr_opts = OPTS_GET(opts, common_attr_opts, NULL);
> + if (common_attr_opts && feat_supported(NULL, FEAT_EXTENDED_SYSCALL)) {
> + memset(&common_attr, 0, common_attr_sz);
> + common_attr.log_buf = ptr_to_u64(OPTS_GET(common_attr_opts, log_buf, NULL));
> + common_attr.log_size = OPTS_GET(common_attr_opts, log_size, 0);
> + common_attr.log_level = OPTS_GET(common_attr_opts, log_level, 0);
> + fd = sys_bpf_ext_fd(BPF_MAP_CREATE, &attr, attr_sz, &common_attr, common_attr_sz);
> + OPTS_SET(common_attr_opts, log_true_size, common_attr.log_true_size);
> + } else {
> + fd = sys_bpf_fd(BPF_MAP_CREATE, &attr, attr_sz);
OPTS_SET(log_true_size) to zero here, maybe?
> + }
> return libbpf_err_errno(fd);
> }
>
> diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> index 2c8e88ddb674..c4a26e6b71ea 100644
> --- a/tools/lib/bpf/bpf.h
> +++ b/tools/lib/bpf/bpf.h
> @@ -37,6 +37,18 @@ extern "C" {
>
> LIBBPF_API int libbpf_set_memlock_rlim(size_t memlock_bytes);
>
> +struct bpf_syscall_common_attr_opts {
> + size_t sz; /* size of this struct for forward/backward compatibility */
> +
> + char *log_buf;
> + __u32 log_size;
> + __u32 log_level;
> + __u32 log_true_size;
> +
> + size_t :0;
> +};
> +#define bpf_syscall_common_attr_opts__last_field log_true_size
see below, let's drop this struct and just add these 4 fields directly
to bpf_map_create_opts
> +
> struct bpf_map_create_opts {
> size_t sz; /* size of this struct for forward/backward compatibility */
>
> @@ -57,9 +69,12 @@ struct bpf_map_create_opts {
>
> const void *excl_prog_hash;
> __u32 excl_prog_hash_size;
> +
> + struct bpf_syscall_common_attr_opts *common_attr_opts;
maybe let's just add those log_xxx fields here directly? This whole
extra bpf_syscall_common_attr_opts pointer and struct seems like a
cumbersome API.
> +
> size_t :0;
> };
> -#define bpf_map_create_opts__last_field excl_prog_hash_size
> +#define bpf_map_create_opts__last_field common_attr_opts
>
> LIBBPF_API int bpf_map_create(enum bpf_map_type map_type,
> const char *map_name,
> --
> 2.52.0
>
^ permalink raw reply
* Re: [PATCH bpf-next v5 4/9] bpf: Add syscall common attributes support for prog_load
From: Andrii Nakryiko @ 2026-01-16 0:54 UTC (permalink / raw)
To: Leon Hwang
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Christian Brauner, Seth Forshee, Yuichiro Tsuji,
Andrey Albershteyn, Willem de Bruijn, Jason Xing, Tao Chen,
Mykyta Yatsenko, Kumar Kartikeya Dwivedi, Anton Protopopov,
Amery Hung, Rong Tao, linux-kernel, linux-api, linux-kselftest,
kernel-patches-bot
In-Reply-To: <20260112145616.44195-5-leon.hwang@linux.dev>
On Mon, Jan 12, 2026 at 6:59 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
> The log buffer of common attributes would be confusing with the one in
> 'union bpf_attr' for BPF_PROG_LOAD.
>
> In order to clarify the usage of these two log buffers, they both can be
> used for logging if:
>
> * They are same, including 'log_buf', 'log_level' and 'log_size'.
> * One of them is missing, then another one will be used for logging.
>
> If they both have 'log_buf' but they are not same totally, return -EUSERS.
why use this special error code that we don't seem to use in BPF
subsystem at all? What's wrong with -EINVAL. This shouldn't be an easy
mistake to do, tbh.
>
> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
> ---
> include/linux/bpf_verifier.h | 4 +++-
> kernel/bpf/log.c | 29 ++++++++++++++++++++++++++---
> kernel/bpf/syscall.c | 9 ++++++---
> 3 files changed, 35 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 4c9632c40059..da2d37ca60e7 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -637,9 +637,11 @@ struct bpf_log_attr {
> u32 log_level;
> struct bpf_attrs *attrs;
> u32 offsetof_log_true_size;
> + struct bpf_attrs *attrs_common;
> };
>
> -int bpf_prog_load_log_attr_init(struct bpf_log_attr *log_attr, struct bpf_attrs *attrs);
> +int bpf_prog_load_log_attr_init(struct bpf_log_attr *log_attr, struct bpf_attrs *attrs,
> + struct bpf_attrs *attrs_common);
> int bpf_log_attr_finalize(struct bpf_log_attr *log_attr, struct bpf_verifier_log *log);
>
> #define BPF_MAX_SUBPROGS 256
> diff --git a/kernel/bpf/log.c b/kernel/bpf/log.c
> index 457b724c4176..eba60a13e244 100644
> --- a/kernel/bpf/log.c
> +++ b/kernel/bpf/log.c
> @@ -865,23 +865,41 @@ void print_insn_state(struct bpf_verifier_env *env, const struct bpf_verifier_st
> }
>
> static int bpf_log_attr_init(struct bpf_log_attr *log_attr, struct bpf_attrs *attrs, u64 log_buf,
> - u32 log_size, u32 log_level, int offsetof_log_true_size)
> + u32 log_size, u32 log_level, int offsetof_log_true_size,
> + struct bpf_attrs *attrs_common)
> {
> + const struct bpf_common_attr *common_attr = attrs_common ? attrs_common->attr : NULL;
> +
There is something to be said about naming choices here :) it's easy
to get lost in attrs_common being actually bpf_attrs, which contains
attr field, which is actually of bpf_common_attr type... It's a bit
disorienting. :)
> memset(log_attr, 0, sizeof(*log_attr));
> log_attr->log_buf = log_buf;
> log_attr->log_size = log_size;
> log_attr->log_level = log_level;
> log_attr->attrs = attrs;
> log_attr->offsetof_log_true_size = offsetof_log_true_size;
> + log_attr->attrs_common = attrs_common;
> +
> + if (log_buf && common_attr && common_attr->log_buf &&
> + (log_buf != common_attr->log_buf ||
> + log_size != common_attr->log_size ||
> + log_level != common_attr->log_level))
> + return -EUSERS;
> +
> + if (!log_buf && common_attr && common_attr->log_buf) {
> + log_attr->log_buf = common_attr->log_buf;
> + log_attr->log_size = common_attr->log_size;
> + log_attr->log_level = common_attr->log_level;
> + }
> +
> return 0;
> }
>
[...]
^ permalink raw reply
* Re: [PATCH bpf-next v5 2/9] libbpf: Add support for extended bpf syscall
From: Andrii Nakryiko @ 2026-01-16 0:42 UTC (permalink / raw)
To: Leon Hwang
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Christian Brauner, Seth Forshee, Yuichiro Tsuji,
Andrey Albershteyn, Willem de Bruijn, Jason Xing, Tao Chen,
Mykyta Yatsenko, Kumar Kartikeya Dwivedi, Anton Protopopov,
Amery Hung, Rong Tao, linux-kernel, linux-api, linux-kselftest,
kernel-patches-bot
In-Reply-To: <20260112145616.44195-3-leon.hwang@linux.dev>
On Mon, Jan 12, 2026 at 6:58 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
> To support the extended BPF syscall introduced in the previous commit,
> introduce the following internal APIs:
>
> * 'sys_bpf_ext()'
> * 'sys_bpf_ext_fd()'
> They wrap the raw 'syscall()' interface to support passing extended
> attributes.
> * 'probe_sys_bpf_ext()'
> Check whether current kernel supports the extended attributes.
>
> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
> ---
> tools/lib/bpf/bpf.c | 34 +++++++++++++++++++++++++++++++++
> tools/lib/bpf/features.c | 8 ++++++++
> tools/lib/bpf/libbpf_internal.h | 3 +++
> 3 files changed, 45 insertions(+)
>
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index 21b57a629916..d44e667aaf02 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -69,6 +69,40 @@ static inline __u64 ptr_to_u64(const void *ptr)
> return (__u64) (unsigned long) ptr;
> }
>
> +static inline int sys_bpf_ext(enum bpf_cmd cmd, union bpf_attr *attr,
> + unsigned int size,
> + struct bpf_common_attr *common_attr,
nit: kernel uses consistent attr_common/size_common pattern, but here
you are inverting attr_common -> common_attr, let's not?
> + unsigned int size_common)
> +{
> + cmd = common_attr ? (cmd | BPF_COMMON_ATTRS) : (cmd & ~BPF_COMMON_ATTRS);
> + return syscall(__NR_bpf, cmd, attr, size, common_attr, size_common);
> +}
> +
> +static inline int sys_bpf_ext_fd(enum bpf_cmd cmd, union bpf_attr *attr,
> + unsigned int size,
> + struct bpf_common_attr *common_attr,
> + unsigned int size_common)
> +{
> + int fd;
> +
> + fd = sys_bpf_ext(cmd, attr, size, common_attr, size_common);
> + return ensure_good_fd(fd);
> +}
> +
> +int probe_sys_bpf_ext(void)
> +{
> + const size_t attr_sz = offsetofend(union bpf_attr, prog_token_fd);
> + union bpf_attr attr;
> + int fd;
> +
> + memset(&attr, 0, attr_sz);
> + fd = syscall(__NR_bpf, BPF_PROG_LOAD | BPF_COMMON_ATTRS, &attr, attr_sz, NULL,
> + sizeof(struct bpf_common_attr));
> + if (fd >= 0)
> + close(fd);
hm... close can change errno, this is fragile. If fd >= 0, something
is wrong with our detection, just return error right away?
> + return errno == EFAULT;
> +}
> +
> static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
> unsigned int size)
> {
> diff --git a/tools/lib/bpf/features.c b/tools/lib/bpf/features.c
> index b842b83e2480..d786a815f1ae 100644
> --- a/tools/lib/bpf/features.c
> +++ b/tools/lib/bpf/features.c
> @@ -506,6 +506,11 @@ static int probe_kern_arg_ctx_tag(int token_fd)
> return probe_fd(prog_fd);
> }
>
> +static int probe_kern_extended_syscall(int token_fd)
> +{
> + return probe_sys_bpf_ext();
> +}
> +
> typedef int (*feature_probe_fn)(int /* token_fd */);
>
> static struct kern_feature_cache feature_cache;
> @@ -581,6 +586,9 @@ static struct kern_feature_desc {
> [FEAT_BTF_QMARK_DATASEC] = {
> "BTF DATASEC names starting from '?'", probe_kern_btf_qmark_datasec,
> },
> + [FEAT_EXTENDED_SYSCALL] = {
> + "Kernel supports extended syscall", probe_kern_extended_syscall,
"extended syscall" is a bit vague... We specifically detect common
attrs support, maybe say that?
> + },
> };
>
> bool feat_supported(struct kern_feature_cache *cache, enum kern_feature_id feat_id)
> diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h
> index fc59b21b51b5..e2a6ef4b45ae 100644
> --- a/tools/lib/bpf/libbpf_internal.h
> +++ b/tools/lib/bpf/libbpf_internal.h
> @@ -392,6 +392,8 @@ enum kern_feature_id {
> FEAT_ARG_CTX_TAG,
> /* Kernel supports '?' at the front of datasec names */
> FEAT_BTF_QMARK_DATASEC,
> + /* Kernel supports extended syscall */
> + FEAT_EXTENDED_SYSCALL,
FEAT_BPF_COMMON_ATTRS ?
> __FEAT_CNT,
> };
>
> @@ -757,4 +759,5 @@ int probe_fd(int fd);
> #define SHA256_DWORD_SIZE SHA256_DIGEST_LENGTH / sizeof(__u64)
>
> void libbpf_sha256(const void *data, size_t len, __u8 out[SHA256_DIGEST_LENGTH]);
> +int probe_sys_bpf_ext(void);
> #endif /* __LIBBPF_LIBBPF_INTERNAL_H */
> --
> 2.52.0
>
^ permalink raw reply
* Re: O_CLOEXEC use for OPEN_TREE_CLOEXEC
From: Florian Weimer @ 2026-01-15 8:55 UTC (permalink / raw)
To: Christian Brauner
Cc: linux-fsdevel, linux-api, linux-kernel, Al Viro, David Howells,
DJ Delorie
In-Reply-To: <20260114-alias-riefen-2cb8c09d0ded@brauner>
* Christian Brauner:
> On Tue, Jan 13, 2026 at 11:40:55PM +0100, Florian Weimer wrote:
>> In <linux/mount.h>, we have this:
>>
>> #define OPEN_TREE_CLOEXEC O_CLOEXEC /* Close the file on execve() */
>>
>> This causes a few pain points for us to on the glibc side when we mirror
>> this into <linux/mount.h> becuse O_CLOEXEC is defined in <fcntl.h>,
>> which is one of the headers that's completely incompatible with the UAPI
>> headers.
>>
>> The reason why this is painful is because O_CLOEXEC has at least three
>> different values across architectures: 0x80000, 0x200000, 0x400000
>>
>> Even for the UAPI this isn't ideal because it effectively burns three
>> open_tree flags, unless the flags are made architecture-specific, too.
>
> I think that just got cargo-culted... A long time ago some API define as
> O_CLOEXEC and now a lot of APIs have done the same.
Yes, it looks like inotify is in the same boat.
> I'm pretty sure we can't change that now but we can document that this
> shouldn't be ifdefed and instead be a separate per-syscall bit. But I
> think that's the best we can do right now.
Maybe add something like this as a safety measure, to ensure that the
flags don't overlap?
diff --git a/fs/namespace.c b/fs/namespace.c
index c58674a20cad..5bbfd379ec44 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3069,6 +3069,9 @@ static struct file *vfs_open_tree(int dfd, const char __user *filename, unsigned
bool detached = flags & OPEN_TREE_CLONE;
BUILD_BUG_ON(OPEN_TREE_CLOEXEC != O_CLOEXEC);
+ BUILD_BUG_IN(!(O_CLOEXEC & OPEN_TREE_CLONE));
+ BUILD_BUG_ON(!((AT_EMPTY_PATH | AT_NO_AUTOMOUNT | AT_RECURSIVE | AT_SYMLINK_NOFOLLOW) &
+ (O_CLOEXEC | OPEN_TREE_CLONE)));
if (flags & ~(AT_EMPTY_PATH | AT_NO_AUTOMOUNT | AT_RECURSIVE |
AT_SYMLINK_NOFOLLOW | OPEN_TREE_CLONE |
@@ -3100,7 +3103,7 @@ static struct file *vfs_open_tree(int dfd, const char __user *filename, unsigned
SYSCALL_DEFINE3(open_tree, int, dfd, const char __user *, filename, unsigned, flags)
{
- return FD_ADD(flags, vfs_open_tree(dfd, filename, flags));
+ return FD_ADD(flags & O_CLOEXEC, vfs_open_tree(dfd, filename, flags));
}
/*
(Completely untested.)
Passing the mix of flags to FD_ADD isn't really future-proof if FD_ADD
ever recognizes more than just O_CLOEXEC.
Thanks,
Florian
^ permalink raw reply related
* Re: O_CLOEXEC use for OPEN_TREE_CLOEXEC
From: Aleksa Sarai @ 2026-01-14 21:18 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Christian Brauner, Florian Weimer, linux-fsdevel, linux-api,
linux-kernel, Al Viro, David Howells, DJ Delorie
In-Reply-To: <CALCETrWMWs3_G5JhJb7+h+JQjpqXxqOh2vNcQaG1HuXjaeCqQw@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 2010 bytes --]
On 2026-01-14, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Jan 14, 2026 at 8:09 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Tue, Jan 13, 2026 at 11:40:55PM +0100, Florian Weimer wrote:
> > > In <linux/mount.h>, we have this:
> > >
> > > #define OPEN_TREE_CLOEXEC O_CLOEXEC /* Close the file on execve() */
> > >
> > > This causes a few pain points for us to on the glibc side when we mirror
> > > this into <linux/mount.h> becuse O_CLOEXEC is defined in <fcntl.h>,
> > > which is one of the headers that's completely incompatible with the UAPI
> > > headers.
> > >
> > > The reason why this is painful is because O_CLOEXEC has at least three
> > > different values across architectures: 0x80000, 0x200000, 0x400000
> > >
> > > Even for the UAPI this isn't ideal because it effectively burns three
> > > open_tree flags, unless the flags are made architecture-specific, too.
> >
> > I think that just got cargo-culted... A long time ago some API define as
> > O_CLOEXEC and now a lot of APIs have done the same. I'm pretty sure we
> > can't change that now but we can document that this shouldn't be ifdefed
> > and instead be a separate per-syscall bit. But I think that's the best
> > we can do right now.
> >
>
> How about, for future syscalls, we make CLOEXEC unconditional? If
> anyone wants an ofd to get inherited across exec, they can F_SETFD it
> themselves.
I believe newer interfaces have already started doing that (e.g., all of
the pidfd stuff is O_CLOEXEC by default) but we should definitely update
the documentation in Documentation/process/adding-syscalls.rst to stop
recommending the inclusion of the O_CLOEXEC flag.
The funniest thing about open_tree(2) is that it actually borrows flag
bits from three distinct namespaces! It has an OPEN_TREE_* namespace,
the AT_* namespace (which now has a concept of "per-syscall flags"), and
O_CLOEXEC. What a fun interface!
--
Aleksa Sarai
https://www.cyphar.com/
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply
* Re: O_CLOEXEC use for OPEN_TREE_CLOEXEC
From: Andy Lutomirski @ 2026-01-14 19:42 UTC (permalink / raw)
To: Christian Brauner
Cc: Florian Weimer, linux-fsdevel, linux-api, linux-kernel, Al Viro,
David Howells, DJ Delorie
In-Reply-To: <20260114-alias-riefen-2cb8c09d0ded@brauner>
On Wed, Jan 14, 2026 at 8:09 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Tue, Jan 13, 2026 at 11:40:55PM +0100, Florian Weimer wrote:
> > In <linux/mount.h>, we have this:
> >
> > #define OPEN_TREE_CLOEXEC O_CLOEXEC /* Close the file on execve() */
> >
> > This causes a few pain points for us to on the glibc side when we mirror
> > this into <linux/mount.h> becuse O_CLOEXEC is defined in <fcntl.h>,
> > which is one of the headers that's completely incompatible with the UAPI
> > headers.
> >
> > The reason why this is painful is because O_CLOEXEC has at least three
> > different values across architectures: 0x80000, 0x200000, 0x400000
> >
> > Even for the UAPI this isn't ideal because it effectively burns three
> > open_tree flags, unless the flags are made architecture-specific, too.
>
> I think that just got cargo-culted... A long time ago some API define as
> O_CLOEXEC and now a lot of APIs have done the same. I'm pretty sure we
> can't change that now but we can document that this shouldn't be ifdefed
> and instead be a separate per-syscall bit. But I think that's the best
> we can do right now.
>
How about, for future syscalls, we make CLOEXEC unconditional? If
anyone wants an ofd to get inherited across exec, they can F_SETFD it
themselves.
--Andy
^ permalink raw reply
* Re: [PATCH v8 14/18] mm: memfd_luo: allow preserving memfd
From: Pratyush Yadav @ 2026-01-14 19:02 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Pasha Tatashin, pratyush, jasonmiu, graf, rppt, dmatlack,
rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx,
mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
saeedm, ajayachandra, parav, leonro, witu, hughd, skhawaja,
chrisl
In-Reply-To: <20260107185414.GG293394@nvidia.com>
On Wed, Jan 07 2026, Jason Gunthorpe wrote:
> On Tue, Nov 25, 2025 at 11:58:44AM -0500, Pasha Tatashin wrote:
>> From: Pratyush Yadav <ptyadav@amazon.de>
>>
>> The ability to preserve a memfd allows userspace to use KHO and LUO to
>> transfer its memory contents to the next kernel. This is useful in many
>> ways. For one, it can be used with IOMMUFD as the backing store for
>> IOMMU page tables. Preserving IOMMUFD is essential for performing a
>> hypervisor live update with passthrough devices. memfd support provides
>> the first building block for making that possible.
>
> I would lead with the use of memfd to back the guest memory pages for
> use with KVM :)
I would assume using 1G-page-backed memfd is the more common use case,
and this patch doesn't come with 1G page support.
Anyway, the patch is now already applied so we can't go back and fix
the commit message...
--
Regards,
Pratyush Yadav
^ permalink raw reply
* Re: [PATCH v8 14/18] mm: memfd_luo: allow preserving memfd
From: Pratyush Yadav @ 2026-01-14 18:59 UTC (permalink / raw)
To: Mike Rapoport
Cc: Chris Mason, Pratyush Yadav, Pasha Tatashin, jasonmiu, graf,
dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
roman.gushchin, chenridong, axboe, mark.rutland, jannh,
vincent.guittot, hannes, dan.j.williams, david, joel.granados,
rostedt, anna.schumaker, song, linux, linux-kernel, linux-doc,
linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael,
dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
chrisl
In-Reply-To: <aWfLS48tG7XInpNN@kernel.org>
On Wed, Jan 14 2026, Mike Rapoport wrote:
> On Tue, Jan 13, 2026 at 06:09:23AM -0800, Chris Mason wrote:
>> On Tue, 25 Nov 2025 11:58:44 -0500 Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
>>
>> Hi everyone,
>>
>> I'm running the v6.19 mm commits through patch review automation, and this
>> commit was flagged. I don't know this code well, but it looks like it
>> might be a real bug.
>>
>> [AI review output below]
>>
>> > diff --git a/mm/memfd_luo.c b/mm/memfd_luo.c
>> > new file mode 100644
>> > index 0000000000000..4f6ba63b43105
>> > --- /dev/null
>> > +++ b/mm/memfd_luo.c
>>
>> [ ... ]
>>
>> > +static int memfd_luo_retrieve_folios(struct file *file,
>> > + struct memfd_luo_folio_ser *folios_ser,
>> > + u64 nr_folios)
>> > +{
>> > + struct inode *inode = file_inode(file);
>> > + struct address_space *mapping = inode->i_mapping;
>> > + struct folio *folio;
>> > + int err = -EIO;
>> > + long i;
>> > +
>> > + for (i = 0; i < nr_folios; i++) {
>>
>> [ ... ]
>>
>> > + err = shmem_add_to_page_cache(folio, mapping, index, NULL,
>> > + mapping_gfp_mask(mapping));
>> > + if (err) {
>> > + pr_err("shmem: failed to add to page cache folio index %ld: %d\n",
>> > + i, err);
>> > + goto unlock_folio;
>> > + }
>> > +
>> > + if (flags & MEMFD_LUO_FOLIO_UPTODATE)
>> > + folio_mark_uptodate(folio);
>> > + if (flags & MEMFD_LUO_FOLIO_DIRTY)
>> > + folio_mark_dirty(folio);
>> > +
>> > + err = shmem_inode_acct_blocks(inode, 1);
>> > + if (err) {
>> > + pr_err("shmem: failed to account folio index %ld: %d\n",
>> > + i, err);
>> > + goto unlock_folio;
>> > + }
>>
>> When shmem_inode_acct_blocks() fails here, the folio has already been
>> added to the page cache by shmem_add_to_page_cache(). Should the folio be
>> removed from the page cache before going to unlock_folio?
>>
>> Looking at shmem_alloc_and_add_folio() in mm/shmem.c, when
>> shmem_inode_acct_blocks() fails after the folio has been added, it calls
>> filemap_remove_folio() to remove it:
>>
>> error = shmem_inode_acct_blocks(inode, pages);
>> if (error) {
>> ...
>> if (error) {
>> filemap_remove_folio(folio);
>> goto unlock;
>> }
>> }
>>
>> Without this, the folio remains in the page cache (counted in
>> mapping->nrpages) but info->alloced is not incremented (since
>> shmem_recalc_inode is not called). This could cause shmem accounting
>> inconsistency.
>
> My understanding that if anything fails in memfd_luo_retrieve_folios() the
> file is destroyed anyway and the accounting wouldn't matter.
>
> But to be on the safe side we should fix the error handling here.
> @Pratyush, what do you say?
Yeah, I don't think the inode's alloced accounting is a real issue here
since the file will be destroyed immediately after. This is why I didn't
want to add the extra complexity of the error handling.
But now that I think of it, perhaps the lingering unaccounted folio
might cause an underflow in vm_committed_as. shmem_inode_acct_blocks()
cleans up the vm_acct_memory() call in case of failure. But perhaps the
iput() triggers an extra shmem_unacct_memory() because of the lingering
folio.
I am not 100% sure that can actually happen since the code is a bit
complex. Let me check and get back to you.
--
Regards,
Pratyush Yadav
^ permalink raw reply
* Re: [PATCH v8 14/18] mm: memfd_luo: allow preserving memfd
From: Mike Rapoport @ 2026-01-14 16:58 UTC (permalink / raw)
To: Chris Mason, Pratyush Yadav
Cc: Pasha Tatashin, jasonmiu, graf, dmatlack, rientjes, corbet,
rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
david, joel.granados, rostedt, anna.schumaker, song, linux,
linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
jgg, parav, leonro, witu, hughd, skhawaja, chrisl
In-Reply-To: <20260113140927.1074142-1-clm@meta.com>
On Tue, Jan 13, 2026 at 06:09:23AM -0800, Chris Mason wrote:
> On Tue, 25 Nov 2025 11:58:44 -0500 Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
>
> Hi everyone,
>
> I'm running the v6.19 mm commits through patch review automation, and this
> commit was flagged. I don't know this code well, but it looks like it
> might be a real bug.
>
> [AI review output below]
>
> > diff --git a/mm/memfd_luo.c b/mm/memfd_luo.c
> > new file mode 100644
> > index 0000000000000..4f6ba63b43105
> > --- /dev/null
> > +++ b/mm/memfd_luo.c
>
> [ ... ]
>
> > +static int memfd_luo_retrieve_folios(struct file *file,
> > + struct memfd_luo_folio_ser *folios_ser,
> > + u64 nr_folios)
> > +{
> > + struct inode *inode = file_inode(file);
> > + struct address_space *mapping = inode->i_mapping;
> > + struct folio *folio;
> > + int err = -EIO;
> > + long i;
> > +
> > + for (i = 0; i < nr_folios; i++) {
>
> [ ... ]
>
> > + err = shmem_add_to_page_cache(folio, mapping, index, NULL,
> > + mapping_gfp_mask(mapping));
> > + if (err) {
> > + pr_err("shmem: failed to add to page cache folio index %ld: %d\n",
> > + i, err);
> > + goto unlock_folio;
> > + }
> > +
> > + if (flags & MEMFD_LUO_FOLIO_UPTODATE)
> > + folio_mark_uptodate(folio);
> > + if (flags & MEMFD_LUO_FOLIO_DIRTY)
> > + folio_mark_dirty(folio);
> > +
> > + err = shmem_inode_acct_blocks(inode, 1);
> > + if (err) {
> > + pr_err("shmem: failed to account folio index %ld: %d\n",
> > + i, err);
> > + goto unlock_folio;
> > + }
>
> When shmem_inode_acct_blocks() fails here, the folio has already been
> added to the page cache by shmem_add_to_page_cache(). Should the folio be
> removed from the page cache before going to unlock_folio?
>
> Looking at shmem_alloc_and_add_folio() in mm/shmem.c, when
> shmem_inode_acct_blocks() fails after the folio has been added, it calls
> filemap_remove_folio() to remove it:
>
> error = shmem_inode_acct_blocks(inode, pages);
> if (error) {
> ...
> if (error) {
> filemap_remove_folio(folio);
> goto unlock;
> }
> }
>
> Without this, the folio remains in the page cache (counted in
> mapping->nrpages) but info->alloced is not incremented (since
> shmem_recalc_inode is not called). This could cause shmem accounting
> inconsistency.
My understanding that if anything fails in memfd_luo_retrieve_folios() the
file is destroyed anyway and the accounting wouldn't matter.
But to be on the safe side we should fix the error handling here.
@Pratyush, what do you say?
--
Sincerely yours,
Mike.
^ permalink raw reply
* Re: O_CLOEXEC use for OPEN_TREE_CLOEXEC
From: Christian Brauner @ 2026-01-14 16:03 UTC (permalink / raw)
To: Florian Weimer
Cc: linux-fsdevel, linux-api, linux-kernel, Al Viro, David Howells,
DJ Delorie
In-Reply-To: <lhupl7dcf0o.fsf@oldenburg.str.redhat.com>
On Tue, Jan 13, 2026 at 11:40:55PM +0100, Florian Weimer wrote:
> In <linux/mount.h>, we have this:
>
> #define OPEN_TREE_CLOEXEC O_CLOEXEC /* Close the file on execve() */
>
> This causes a few pain points for us to on the glibc side when we mirror
> this into <linux/mount.h> becuse O_CLOEXEC is defined in <fcntl.h>,
> which is one of the headers that's completely incompatible with the UAPI
> headers.
>
> The reason why this is painful is because O_CLOEXEC has at least three
> different values across architectures: 0x80000, 0x200000, 0x400000
>
> Even for the UAPI this isn't ideal because it effectively burns three
> open_tree flags, unless the flags are made architecture-specific, too.
I think that just got cargo-culted... A long time ago some API define as
O_CLOEXEC and now a lot of APIs have done the same. I'm pretty sure we
can't change that now but we can document that this shouldn't be ifdefed
and instead be a separate per-syscall bit. But I think that's the best
we can do right now.
^ permalink raw reply
* O_CLOEXEC use for OPEN_TREE_CLOEXEC
From: Florian Weimer @ 2026-01-13 22:40 UTC (permalink / raw)
To: linux-fsdevel; +Cc: linux-api, linux-kernel, Al Viro, David Howells, DJ Delorie
In <linux/mount.h>, we have this:
#define OPEN_TREE_CLOEXEC O_CLOEXEC /* Close the file on execve() */
This causes a few pain points for us to on the glibc side when we mirror
this into <linux/mount.h> becuse O_CLOEXEC is defined in <fcntl.h>,
which is one of the headers that's completely incompatible with the UAPI
headers.
The reason why this is painful is because O_CLOEXEC has at least three
different values across architectures: 0x80000, 0x200000, 0x400000
Even for the UAPI this isn't ideal because it effectively burns three
open_tree flags, unless the flags are made architecture-specific, too.
Thanks,
Florian
^ permalink raw reply
* Re: [PATCH v8 14/18] mm: memfd_luo: allow preserving memfd
From: Chris Mason @ 2026-01-13 14:09 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Chris Mason, jasonmiu, graf, rppt, dmatlack, rientjes, corbet,
rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
david, joel.granados, rostedt, anna.schumaker, song, linux,
linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
jgg, parav, leonro, witu, hughd, skhawaja, chrisl
In-Reply-To: <20251125165850.3389713-15-pasha.tatashin@soleen.com>
On Tue, 25 Nov 2025 11:58:44 -0500 Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
Hi everyone,
I'm running the v6.19 mm commits through patch review automation, and this
commit was flagged. I don't know this code well, but it looks like it
might be a real bug.
[AI review output below]
> diff --git a/mm/memfd_luo.c b/mm/memfd_luo.c
> new file mode 100644
> index 0000000000000..4f6ba63b43105
> --- /dev/null
> +++ b/mm/memfd_luo.c
[ ... ]
> +static int memfd_luo_retrieve_folios(struct file *file,
> + struct memfd_luo_folio_ser *folios_ser,
> + u64 nr_folios)
> +{
> + struct inode *inode = file_inode(file);
> + struct address_space *mapping = inode->i_mapping;
> + struct folio *folio;
> + int err = -EIO;
> + long i;
> +
> + for (i = 0; i < nr_folios; i++) {
[ ... ]
> + err = shmem_add_to_page_cache(folio, mapping, index, NULL,
> + mapping_gfp_mask(mapping));
> + if (err) {
> + pr_err("shmem: failed to add to page cache folio index %ld: %d\n",
> + i, err);
> + goto unlock_folio;
> + }
> +
> + if (flags & MEMFD_LUO_FOLIO_UPTODATE)
> + folio_mark_uptodate(folio);
> + if (flags & MEMFD_LUO_FOLIO_DIRTY)
> + folio_mark_dirty(folio);
> +
> + err = shmem_inode_acct_blocks(inode, 1);
> + if (err) {
> + pr_err("shmem: failed to account folio index %ld: %d\n",
> + i, err);
> + goto unlock_folio;
> + }
When shmem_inode_acct_blocks() fails here, the folio has already been
added to the page cache by shmem_add_to_page_cache(). Should the folio be
removed from the page cache before going to unlock_folio?
Looking at shmem_alloc_and_add_folio() in mm/shmem.c, when
shmem_inode_acct_blocks() fails after the folio has been added, it calls
filemap_remove_folio() to remove it:
error = shmem_inode_acct_blocks(inode, pages);
if (error) {
...
if (error) {
filemap_remove_folio(folio);
goto unlock;
}
}
Without this, the folio remains in the page cache (counted in
mapping->nrpages) but info->alloced is not incremented (since
shmem_recalc_inode is not called). This could cause shmem accounting
inconsistency.
^ permalink raw reply
* Re: [PATCHSET v5] fs: generic file IO error reporting
From: Christian Brauner @ 2026-01-13 8:58 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Christian Brauner, linux-api, jack, hch, linux-xfs, linux-ext4,
linux-fsdevel, gabriel, amir73il, Gao Xiang
In-Reply-To: <176826402528.3490369.2415315475116356277.stgit@frogsfrogsfrogs>
On Mon, 12 Jan 2026 16:31:03 -0800, Darrick J. Wong wrote:
> This patchset adds some generic helpers so that filesystems can report
> errors to fsnotify in a standard way. Then it adapts iomap to use the
> generic helpers so that any iomap-enabled filesystem can report I/O
> errors through this mechanism as well. Finally, it makes XFS report
> metadata errors through this mechanism in much the same way that ext4
> does now.
>
> [...]
Applied to the vfs-7.0.fserror branch of the vfs/vfs.git tree.
Patches in the vfs-7.0.fserror branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-7.0.fserror
[1/6] uapi: promote EFSCORRUPTED and EUCLEAN to errno.h
https://git.kernel.org/vfs/vfs/c/602544773763
[2/6] fs: report filesystem and file I/O errors to fsnotify
https://git.kernel.org/vfs/vfs/c/21945e6cb516
[3/6] iomap: report file I/O errors to the VFS
https://git.kernel.org/vfs/vfs/c/a9d573ee88af
[4/6] xfs: report fs metadata errors via fsnotify
https://git.kernel.org/vfs/vfs/c/efd87a100729
[5/6] xfs: translate fsdax media errors into file "data lost" errors when convenient
https://git.kernel.org/vfs/vfs/c/94503211d2fd
[6/6] ext4: convert to new fserror helpers
https://git.kernel.org/vfs/vfs/c/81d2e13a57c9
^ permalink raw reply
* [PATCH 1/6] uapi: promote EFSCORRUPTED and EUCLEAN to errno.h
From: Darrick J. Wong @ 2026-01-13 0:31 UTC (permalink / raw)
To: djwong, brauner
Cc: hch, hsiangkao, jack, linux-api, linux-xfs, jack, linux-ext4,
linux-fsdevel, gabriel, hch, amir73il
In-Reply-To: <176826402528.3490369.2415315475116356277.stgit@frogsfrogsfrogs>
From: Darrick J. Wong <djwong@kernel.org>
Stop definining these privately and instead move them to the uapi
errno.h so that they become canonical instead of copy pasta.
Cc: linux-api@vger.kernel.org
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
---
arch/alpha/include/uapi/asm/errno.h | 2 ++
arch/mips/include/uapi/asm/errno.h | 2 ++
arch/parisc/include/uapi/asm/errno.h | 2 ++
arch/sparc/include/uapi/asm/errno.h | 2 ++
fs/erofs/internal.h | 2 --
fs/ext2/ext2.h | 1 -
fs/ext4/ext4.h | 3 ---
fs/f2fs/f2fs.h | 3 ---
fs/minix/minix.h | 2 --
fs/udf/udf_sb.h | 2 --
fs/xfs/xfs_linux.h | 2 --
include/linux/jbd2.h | 3 ---
include/uapi/asm-generic/errno.h | 2 ++
tools/arch/alpha/include/uapi/asm/errno.h | 2 ++
tools/arch/mips/include/uapi/asm/errno.h | 2 ++
tools/arch/parisc/include/uapi/asm/errno.h | 2 ++
tools/arch/sparc/include/uapi/asm/errno.h | 2 ++
tools/include/uapi/asm-generic/errno.h | 2 ++
18 files changed, 20 insertions(+), 18 deletions(-)
diff --git a/arch/alpha/include/uapi/asm/errno.h b/arch/alpha/include/uapi/asm/errno.h
index 3d265f6babaf0a..6791f6508632ee 100644
--- a/arch/alpha/include/uapi/asm/errno.h
+++ b/arch/alpha/include/uapi/asm/errno.h
@@ -55,6 +55,7 @@
#define ENOSR 82 /* Out of streams resources */
#define ETIME 83 /* Timer expired */
#define EBADMSG 84 /* Not a data message */
+#define EFSBADCRC EBADMSG /* Bad CRC detected */
#define EPROTO 85 /* Protocol error */
#define ENODATA 86 /* No data available */
#define ENOSTR 87 /* Device not a stream */
@@ -96,6 +97,7 @@
#define EREMCHG 115 /* Remote address changed */
#define EUCLEAN 117 /* Structure needs cleaning */
+#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
#define ENOTNAM 118 /* Not a XENIX named type file */
#define ENAVAIL 119 /* No XENIX semaphores available */
#define EISNAM 120 /* Is a named type file */
diff --git a/arch/mips/include/uapi/asm/errno.h b/arch/mips/include/uapi/asm/errno.h
index 2fb714e2d6d8fc..c01ed91b1ef44b 100644
--- a/arch/mips/include/uapi/asm/errno.h
+++ b/arch/mips/include/uapi/asm/errno.h
@@ -50,6 +50,7 @@
#define EDOTDOT 73 /* RFS specific error */
#define EMULTIHOP 74 /* Multihop attempted */
#define EBADMSG 77 /* Not a data message */
+#define EFSBADCRC EBADMSG /* Bad CRC detected */
#define ENAMETOOLONG 78 /* File name too long */
#define EOVERFLOW 79 /* Value too large for defined data type */
#define ENOTUNIQ 80 /* Name not unique on network */
@@ -88,6 +89,7 @@
#define EISCONN 133 /* Transport endpoint is already connected */
#define ENOTCONN 134 /* Transport endpoint is not connected */
#define EUCLEAN 135 /* Structure needs cleaning */
+#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
#define ENOTNAM 137 /* Not a XENIX named type file */
#define ENAVAIL 138 /* No XENIX semaphores available */
#define EISNAM 139 /* Is a named type file */
diff --git a/arch/parisc/include/uapi/asm/errno.h b/arch/parisc/include/uapi/asm/errno.h
index 8d94739d75c67c..8cbc07c1903e4c 100644
--- a/arch/parisc/include/uapi/asm/errno.h
+++ b/arch/parisc/include/uapi/asm/errno.h
@@ -36,6 +36,7 @@
#define EDOTDOT 66 /* RFS specific error */
#define EBADMSG 67 /* Not a data message */
+#define EFSBADCRC EBADMSG /* Bad CRC detected */
#define EUSERS 68 /* Too many users */
#define EDQUOT 69 /* Quota exceeded */
#define ESTALE 70 /* Stale file handle */
@@ -62,6 +63,7 @@
#define ERESTART 175 /* Interrupted system call should be restarted */
#define ESTRPIPE 176 /* Streams pipe error */
#define EUCLEAN 177 /* Structure needs cleaning */
+#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
#define ENOTNAM 178 /* Not a XENIX named type file */
#define ENAVAIL 179 /* No XENIX semaphores available */
#define EISNAM 180 /* Is a named type file */
diff --git a/arch/sparc/include/uapi/asm/errno.h b/arch/sparc/include/uapi/asm/errno.h
index 81a732b902ee38..4a41e7835fd5b8 100644
--- a/arch/sparc/include/uapi/asm/errno.h
+++ b/arch/sparc/include/uapi/asm/errno.h
@@ -48,6 +48,7 @@
#define ENOSR 74 /* Out of streams resources */
#define ENOMSG 75 /* No message of desired type */
#define EBADMSG 76 /* Not a data message */
+#define EFSBADCRC EBADMSG /* Bad CRC detected */
#define EIDRM 77 /* Identifier removed */
#define EDEADLK 78 /* Resource deadlock would occur */
#define ENOLCK 79 /* No record locks available */
@@ -91,6 +92,7 @@
#define ENOTUNIQ 115 /* Name not unique on network */
#define ERESTART 116 /* Interrupted syscall should be restarted */
#define EUCLEAN 117 /* Structure needs cleaning */
+#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
#define ENOTNAM 118 /* Not a XENIX named type file */
#define ENAVAIL 119 /* No XENIX semaphores available */
#define EISNAM 120 /* Is a named type file */
diff --git a/fs/erofs/internal.h b/fs/erofs/internal.h
index f7f622836198da..d06e99baf5d5ae 100644
--- a/fs/erofs/internal.h
+++ b/fs/erofs/internal.h
@@ -541,6 +541,4 @@ long erofs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg);
long erofs_compat_ioctl(struct file *filp, unsigned int cmd,
unsigned long arg);
-#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
-
#endif /* __EROFS_INTERNAL_H */
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index cf97b76e9fd3e9..5e0c6c5fcb6cd6 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -357,7 +357,6 @@ struct ext2_inode {
*/
#define EXT2_VALID_FS 0x0001 /* Unmounted cleanly */
#define EXT2_ERROR_FS 0x0002 /* Errors detected */
-#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
/*
* Mount flags
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 56112f201cace7..62c091b52bacdf 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3938,7 +3938,4 @@ extern int ext4_block_write_begin(handle_t *handle, struct folio *folio,
get_block_t *get_block);
#endif /* __KERNEL__ */
-#define EFSBADCRC EBADMSG /* Bad CRC detected */
-#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
-
#endif /* _EXT4_H */
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 20edbb99b814a7..9f3aa3c7f12613 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -5004,7 +5004,4 @@ static inline void f2fs_invalidate_internal_cache(struct f2fs_sb_info *sbi,
f2fs_invalidate_compress_pages_range(sbi, blkaddr, len);
}
-#define EFSBADCRC EBADMSG /* Bad CRC detected */
-#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
-
#endif /* _LINUX_F2FS_H */
diff --git a/fs/minix/minix.h b/fs/minix/minix.h
index 2bfaf377f2086c..7e1f652f16d311 100644
--- a/fs/minix/minix.h
+++ b/fs/minix/minix.h
@@ -175,6 +175,4 @@ static inline int minix_test_bit(int nr, const void *vaddr)
__minix_error_inode((inode), __func__, __LINE__, \
(fmt), ##__VA_ARGS__)
-#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
-
#endif /* FS_MINIX_H */
diff --git a/fs/udf/udf_sb.h b/fs/udf/udf_sb.h
index 08ec8756b9487b..8399accc788dea 100644
--- a/fs/udf/udf_sb.h
+++ b/fs/udf/udf_sb.h
@@ -55,8 +55,6 @@
#define MF_DUPLICATE_MD 0x01
#define MF_MIRROR_FE_LOADED 0x02
-#define EFSCORRUPTED EUCLEAN
-
struct udf_meta_data {
__u32 s_meta_file_loc;
__u32 s_mirror_file_loc;
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 4dd747bdbccab2..55064228c4d574 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -121,8 +121,6 @@ typedef __u32 xfs_nlink_t;
#define ENOATTR ENODATA /* Attribute not found */
#define EWRONGFS EINVAL /* Mount with wrong filesystem type */
-#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
-#define EFSBADCRC EBADMSG /* Bad CRC detected */
#define __return_address __builtin_return_address(0)
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index f5eaf76198f377..a53a00d36228ce 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1815,7 +1815,4 @@ static inline int jbd2_handle_buffer_credits(handle_t *handle)
#endif /* __KERNEL__ */
-#define EFSBADCRC EBADMSG /* Bad CRC detected */
-#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
-
#endif /* _LINUX_JBD2_H */
diff --git a/include/uapi/asm-generic/errno.h b/include/uapi/asm-generic/errno.h
index cf9c51ac49f97e..92e7ae493ee315 100644
--- a/include/uapi/asm-generic/errno.h
+++ b/include/uapi/asm-generic/errno.h
@@ -55,6 +55,7 @@
#define EMULTIHOP 72 /* Multihop attempted */
#define EDOTDOT 73 /* RFS specific error */
#define EBADMSG 74 /* Not a data message */
+#define EFSBADCRC EBADMSG /* Bad CRC detected */
#define EOVERFLOW 75 /* Value too large for defined data type */
#define ENOTUNIQ 76 /* Name not unique on network */
#define EBADFD 77 /* File descriptor in bad state */
@@ -98,6 +99,7 @@
#define EINPROGRESS 115 /* Operation now in progress */
#define ESTALE 116 /* Stale file handle */
#define EUCLEAN 117 /* Structure needs cleaning */
+#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
#define ENOTNAM 118 /* Not a XENIX named type file */
#define ENAVAIL 119 /* No XENIX semaphores available */
#define EISNAM 120 /* Is a named type file */
diff --git a/tools/arch/alpha/include/uapi/asm/errno.h b/tools/arch/alpha/include/uapi/asm/errno.h
index 3d265f6babaf0a..6791f6508632ee 100644
--- a/tools/arch/alpha/include/uapi/asm/errno.h
+++ b/tools/arch/alpha/include/uapi/asm/errno.h
@@ -55,6 +55,7 @@
#define ENOSR 82 /* Out of streams resources */
#define ETIME 83 /* Timer expired */
#define EBADMSG 84 /* Not a data message */
+#define EFSBADCRC EBADMSG /* Bad CRC detected */
#define EPROTO 85 /* Protocol error */
#define ENODATA 86 /* No data available */
#define ENOSTR 87 /* Device not a stream */
@@ -96,6 +97,7 @@
#define EREMCHG 115 /* Remote address changed */
#define EUCLEAN 117 /* Structure needs cleaning */
+#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
#define ENOTNAM 118 /* Not a XENIX named type file */
#define ENAVAIL 119 /* No XENIX semaphores available */
#define EISNAM 120 /* Is a named type file */
diff --git a/tools/arch/mips/include/uapi/asm/errno.h b/tools/arch/mips/include/uapi/asm/errno.h
index 2fb714e2d6d8fc..c01ed91b1ef44b 100644
--- a/tools/arch/mips/include/uapi/asm/errno.h
+++ b/tools/arch/mips/include/uapi/asm/errno.h
@@ -50,6 +50,7 @@
#define EDOTDOT 73 /* RFS specific error */
#define EMULTIHOP 74 /* Multihop attempted */
#define EBADMSG 77 /* Not a data message */
+#define EFSBADCRC EBADMSG /* Bad CRC detected */
#define ENAMETOOLONG 78 /* File name too long */
#define EOVERFLOW 79 /* Value too large for defined data type */
#define ENOTUNIQ 80 /* Name not unique on network */
@@ -88,6 +89,7 @@
#define EISCONN 133 /* Transport endpoint is already connected */
#define ENOTCONN 134 /* Transport endpoint is not connected */
#define EUCLEAN 135 /* Structure needs cleaning */
+#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
#define ENOTNAM 137 /* Not a XENIX named type file */
#define ENAVAIL 138 /* No XENIX semaphores available */
#define EISNAM 139 /* Is a named type file */
diff --git a/tools/arch/parisc/include/uapi/asm/errno.h b/tools/arch/parisc/include/uapi/asm/errno.h
index 8d94739d75c67c..8cbc07c1903e4c 100644
--- a/tools/arch/parisc/include/uapi/asm/errno.h
+++ b/tools/arch/parisc/include/uapi/asm/errno.h
@@ -36,6 +36,7 @@
#define EDOTDOT 66 /* RFS specific error */
#define EBADMSG 67 /* Not a data message */
+#define EFSBADCRC EBADMSG /* Bad CRC detected */
#define EUSERS 68 /* Too many users */
#define EDQUOT 69 /* Quota exceeded */
#define ESTALE 70 /* Stale file handle */
@@ -62,6 +63,7 @@
#define ERESTART 175 /* Interrupted system call should be restarted */
#define ESTRPIPE 176 /* Streams pipe error */
#define EUCLEAN 177 /* Structure needs cleaning */
+#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
#define ENOTNAM 178 /* Not a XENIX named type file */
#define ENAVAIL 179 /* No XENIX semaphores available */
#define EISNAM 180 /* Is a named type file */
diff --git a/tools/arch/sparc/include/uapi/asm/errno.h b/tools/arch/sparc/include/uapi/asm/errno.h
index 81a732b902ee38..4a41e7835fd5b8 100644
--- a/tools/arch/sparc/include/uapi/asm/errno.h
+++ b/tools/arch/sparc/include/uapi/asm/errno.h
@@ -48,6 +48,7 @@
#define ENOSR 74 /* Out of streams resources */
#define ENOMSG 75 /* No message of desired type */
#define EBADMSG 76 /* Not a data message */
+#define EFSBADCRC EBADMSG /* Bad CRC detected */
#define EIDRM 77 /* Identifier removed */
#define EDEADLK 78 /* Resource deadlock would occur */
#define ENOLCK 79 /* No record locks available */
@@ -91,6 +92,7 @@
#define ENOTUNIQ 115 /* Name not unique on network */
#define ERESTART 116 /* Interrupted syscall should be restarted */
#define EUCLEAN 117 /* Structure needs cleaning */
+#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
#define ENOTNAM 118 /* Not a XENIX named type file */
#define ENAVAIL 119 /* No XENIX semaphores available */
#define EISNAM 120 /* Is a named type file */
diff --git a/tools/include/uapi/asm-generic/errno.h b/tools/include/uapi/asm-generic/errno.h
index cf9c51ac49f97e..92e7ae493ee315 100644
--- a/tools/include/uapi/asm-generic/errno.h
+++ b/tools/include/uapi/asm-generic/errno.h
@@ -55,6 +55,7 @@
#define EMULTIHOP 72 /* Multihop attempted */
#define EDOTDOT 73 /* RFS specific error */
#define EBADMSG 74 /* Not a data message */
+#define EFSBADCRC EBADMSG /* Bad CRC detected */
#define EOVERFLOW 75 /* Value too large for defined data type */
#define ENOTUNIQ 76 /* Name not unique on network */
#define EBADFD 77 /* File descriptor in bad state */
@@ -98,6 +99,7 @@
#define EINPROGRESS 115 /* Operation now in progress */
#define ESTALE 116 /* Stale file handle */
#define EUCLEAN 117 /* Structure needs cleaning */
+#define EFSCORRUPTED EUCLEAN /* Filesystem is corrupted */
#define ENOTNAM 118 /* Not a XENIX named type file */
#define ENAVAIL 119 /* No XENIX semaphores available */
#define EISNAM 120 /* Is a named type file */
^ permalink raw reply related
* [PATCHSET v5] fs: generic file IO error reporting
From: Darrick J. Wong @ 2026-01-13 0:31 UTC (permalink / raw)
To: djwong, brauner
Cc: linux-api, jack, hch, hsiangkao, linux-xfs, jack, linux-ext4,
linux-fsdevel, gabriel, hch, amir73il
Hi all,
This patchset adds some generic helpers so that filesystems can report
errors to fsnotify in a standard way. Then it adapts iomap to use the
generic helpers so that any iomap-enabled filesystem can report I/O
errors through this mechanism as well. Finally, it makes XFS report
metadata errors through this mechanism in much the same way that ext4
does now.
These are a prerequisite for the XFS self-healing series which will
come at a later time.
v5: tidy comments, un-inline the unmount function
v4: drag out of RFC status, finalize the sign of errnos that we accept
If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.
This has been running on the djcloud for months with no problems. Enjoy!
Comments and questions are, as always, welcome.
--D
kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=filesystem-error-reporting
fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=filesystem-error-reporting
---
Commits in this patchset:
* uapi: promote EFSCORRUPTED and EUCLEAN to errno.h
* fs: report filesystem and file I/O errors to fsnotify
* iomap: report file I/O errors to the VFS
* xfs: report fs metadata errors via fsnotify
* xfs: translate fsdax media errors into file "data lost" errors when convenient
* ext4: convert to new fserror helpers
---
arch/alpha/include/uapi/asm/errno.h | 2
arch/mips/include/uapi/asm/errno.h | 2
arch/parisc/include/uapi/asm/errno.h | 2
arch/sparc/include/uapi/asm/errno.h | 2
fs/erofs/internal.h | 2
fs/ext2/ext2.h | 1
fs/ext4/ext4.h | 3
fs/f2fs/f2fs.h | 3
fs/minix/minix.h | 2
fs/udf/udf_sb.h | 2
fs/xfs/xfs_linux.h | 2
include/linux/fs/super_types.h | 7 +
include/linux/fserror.h | 75 +++++++++++
include/linux/jbd2.h | 3
include/uapi/asm-generic/errno.h | 2
tools/arch/alpha/include/uapi/asm/errno.h | 2
tools/arch/mips/include/uapi/asm/errno.h | 2
tools/arch/parisc/include/uapi/asm/errno.h | 2
tools/arch/sparc/include/uapi/asm/errno.h | 2
tools/include/uapi/asm-generic/errno.h | 2
fs/Makefile | 2
fs/ext4/ioctl.c | 2
fs/ext4/super.c | 13 +-
fs/fserror.c | 194 ++++++++++++++++++++++++++++
fs/iomap/buffered-io.c | 23 +++
fs/iomap/direct-io.c | 12 ++
fs/iomap/ioend.c | 6 +
fs/super.c | 3
fs/xfs/xfs_fsops.c | 4 +
fs/xfs/xfs_health.c | 14 ++
fs/xfs/xfs_notify_failure.c | 4 +
31 files changed, 373 insertions(+), 24 deletions(-)
create mode 100644 include/linux/fserror.h
create mode 100644 fs/fserror.c
^ permalink raw reply
* [PATCH bpf-next v5 9/9] selftests/bpf: Add tests to verify map create failure log
From: Leon Hwang @ 2026-01-12 14:56 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Shuah Khan, Christian Brauner, Seth Forshee, Yuichiro Tsuji,
Andrey Albershteyn, Leon Hwang, Willem de Bruijn, Jason Xing,
Tao Chen, Mykyta Yatsenko, Kumar Kartikeya Dwivedi,
Anton Protopopov, Amery Hung, Rong Tao, linux-kernel, linux-api,
linux-kselftest, kernel-patches-bot
In-Reply-To: <20260112145616.44195-1-leon.hwang@linux.dev>
Add tests to verify that the kernel reports the expected error messages
when map creation fails.
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
.../selftests/bpf/prog_tests/map_init.c | 168 ++++++++++++++++++
1 file changed, 168 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/map_init.c b/tools/testing/selftests/bpf/prog_tests/map_init.c
index 14a31109dd0e..824e2bea74bf 100644
--- a/tools/testing/selftests/bpf/prog_tests/map_init.c
+++ b/tools/testing/selftests/bpf/prog_tests/map_init.c
@@ -212,3 +212,171 @@ void test_map_init(void)
if (test__start_subtest("pcpu_lru_map_init"))
test_pcpu_lru_map_init();
}
+
+#define BPF_LOG_FIXED 8
+
+static void test_map_create(enum bpf_map_type map_type, const char *map_name,
+ struct bpf_map_create_opts *opts, const char *exp_msg)
+{
+ const int key_size = 4, value_size = 4, max_entries = 1;
+ char log_buf[128];
+ int fd;
+ LIBBPF_OPTS(bpf_syscall_common_attr_opts, copts);
+
+ log_buf[0] = '\0';
+ copts.log_buf = log_buf;
+ copts.log_size = sizeof(log_buf);
+ copts.log_level = BPF_LOG_FIXED;
+ opts->common_attr_opts = &copts;
+ fd = bpf_map_create(map_type, map_name, key_size, value_size, max_entries, opts);
+ if (!ASSERT_LT(fd, 0, "bpf_map_create")) {
+ close(fd);
+ return;
+ }
+
+ ASSERT_STREQ(log_buf, exp_msg, "log_buf");
+ ASSERT_EQ(copts.log_true_size, strlen(exp_msg) + 1, "log_true_size");
+}
+
+static void test_map_create_array(struct bpf_map_create_opts *opts, const char *exp_msg)
+{
+ test_map_create(BPF_MAP_TYPE_ARRAY, "test_map_create", opts, exp_msg);
+}
+
+static void test_invalid_vmlinux_value_type_id_struct_ops(void)
+{
+ const char *msg = "btf_vmlinux_value_type_id can only be used with struct_ops maps.\n";
+ LIBBPF_OPTS(bpf_map_create_opts, opts,
+ .btf_vmlinux_value_type_id = 1,
+ );
+
+ test_map_create_array(&opts, msg);
+}
+
+static void test_invalid_vmlinux_value_type_id_kv_type_id(void)
+{
+ const char *msg = "btf_vmlinux_value_type_id is mutually exclusive with btf_key_type_id and btf_value_type_id.\n";
+ LIBBPF_OPTS(bpf_map_create_opts, opts,
+ .btf_vmlinux_value_type_id = 1,
+ .btf_key_type_id = 1,
+ );
+
+ test_map_create(BPF_MAP_TYPE_STRUCT_OPS, "test_map_create", &opts, msg);
+}
+
+static void test_invalid_value_type_id(void)
+{
+ const char *msg = "Invalid btf_value_type_id.\n";
+ LIBBPF_OPTS(bpf_map_create_opts, opts,
+ .btf_key_type_id = 1,
+ );
+
+ test_map_create_array(&opts, msg);
+}
+
+static void test_invalid_map_extra(void)
+{
+ const char *msg = "Invalid map_extra.\n";
+ LIBBPF_OPTS(bpf_map_create_opts, opts,
+ .map_extra = 1,
+ );
+
+ test_map_create_array(&opts, msg);
+}
+
+static void test_invalid_numa_node(void)
+{
+ const char *msg = "Invalid numa_node.\n";
+ LIBBPF_OPTS(bpf_map_create_opts, opts,
+ .map_flags = BPF_F_NUMA_NODE,
+ .numa_node = 0xFF,
+ );
+
+ test_map_create_array(&opts, msg);
+}
+
+static void test_invalid_map_type(void)
+{
+ const char *msg = "Invalid map_type.\n";
+ LIBBPF_OPTS(bpf_map_create_opts, opts);
+
+ test_map_create(__MAX_BPF_MAP_TYPE, "test_map_create", &opts, msg);
+}
+
+static void test_invalid_token_fd(void)
+{
+ const char *msg = "Invalid map_token_fd.\n";
+ LIBBPF_OPTS(bpf_map_create_opts, opts,
+ .map_flags = BPF_F_TOKEN_FD,
+ .token_fd = 0xFF,
+ );
+
+ test_map_create_array(&opts, msg);
+}
+
+static void test_invalid_map_name(void)
+{
+ const char *msg = "Invalid map_name.\n";
+ LIBBPF_OPTS(bpf_map_create_opts, opts);
+
+ test_map_create(BPF_MAP_TYPE_ARRAY, "test-!@#", &opts, msg);
+}
+
+static void test_invalid_btf_fd(void)
+{
+ const char *msg = "Invalid btf_fd.\n";
+ LIBBPF_OPTS(bpf_map_create_opts, opts,
+ .btf_fd = -1,
+ .btf_key_type_id = 1,
+ .btf_value_type_id = 1,
+ );
+
+ test_map_create_array(&opts, msg);
+}
+
+static void test_excl_prog_hash_size_1(void)
+{
+ const char *msg = "Invalid excl_prog_hash_size.\n";
+ const char *hash = "DEADCODE";
+ LIBBPF_OPTS(bpf_map_create_opts, opts,
+ .excl_prog_hash = hash,
+ );
+
+ test_map_create_array(&opts, msg);
+}
+
+static void test_excl_prog_hash_size_2(void)
+{
+ const char *msg = "Invalid excl_prog_hash_size.\n";
+ LIBBPF_OPTS(bpf_map_create_opts, opts,
+ .excl_prog_hash_size = 1,
+ );
+
+ test_map_create_array(&opts, msg);
+}
+
+void test_map_create_failure(void)
+{
+ if (test__start_subtest("invalid_vmlinux_value_type_id_struct_ops"))
+ test_invalid_vmlinux_value_type_id_struct_ops();
+ if (test__start_subtest("invalid_vmlinux_value_type_id_kv_type_id"))
+ test_invalid_vmlinux_value_type_id_kv_type_id();
+ if (test__start_subtest("invalid_value_type_id"))
+ test_invalid_value_type_id();
+ if (test__start_subtest("invalid_map_extra"))
+ test_invalid_map_extra();
+ if (test__start_subtest("invalid_numa_node"))
+ test_invalid_numa_node();
+ if (test__start_subtest("invalid_map_type"))
+ test_invalid_map_type();
+ if (test__start_subtest("invalid_token_fd"))
+ test_invalid_token_fd();
+ if (test__start_subtest("invalid_map_name"))
+ test_invalid_map_name();
+ if (test__start_subtest("invalid_btf_fd"))
+ test_invalid_btf_fd();
+ if (test__start_subtest("invalid_excl_prog_hash_size_1"))
+ test_excl_prog_hash_size_1();
+ if (test__start_subtest("invalid_excl_prog_hash_size_2"))
+ test_excl_prog_hash_size_2();
+}
--
2.52.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox