Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH bpf-next v14 0/8] bpf: Extend BPF syscall with common attributes support
From: patchwork-bot+netdevbpf @ 2026-05-12 19:50 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, ast, daniel, john.fastabend, andrii, martin.lau, eddyz87,
	song, yonghong.song, kpsingh, sdf, haoluo, jolsa, shuah, brauner,
	sforshee, yuichtsu, aalbersh, willemb, kerneljasonxing,
	chen.dylane, yatsenko, memxor, a.s.protopopov, ameryhung, rongtao,
	linux-kernel, linux-api, linux-kselftest, kernel-patches-bot
In-Reply-To: <20260512153157.28382-1-leon.hwang@linux.dev>

Hello:

This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Tue, 12 May 2026 23:31:49 +0800 you wrote:
> This patch series builds upon the discussion in
> "[PATCH bpf-next v4 0/4] bpf: Improve error reporting for freplace attachment failure" [1].
> 
> This patch series introduces support for *common attributes* in the BPF
> syscall, providing a unified mechanism for passing shared metadata across
> all BPF commands, initially used by BPF_PROG_LOAD, BPF_BTF_LOAD, and
> BPF_MAP_CREATE.
> 
> [...]

Here is the summary with links:
  - [bpf-next,v14,1/8] bpf: Extend BPF syscall with common attributes support
    https://git.kernel.org/bpf/bpf-next/c/f28771c0691b
  - [bpf-next,v14,2/8] libbpf: Add support for extended BPF syscall
    https://git.kernel.org/bpf/bpf-next/c/b1bff4080942
  - [bpf-next,v14,3/8] bpf: Refactor reporting log_true_size for prog_load
    https://git.kernel.org/bpf/bpf-next/c/503c039ffeca
  - [bpf-next,v14,4/8] bpf: Add syscall common attributes support for prog_load
    https://git.kernel.org/bpf/bpf-next/c/ac89d33fdd81
  - [bpf-next,v14,5/8] bpf: Add syscall common attributes support for btf_load
    https://git.kernel.org/bpf/bpf-next/c/ceeb7eda94a3
  - [bpf-next,v14,6/8] bpf: Add syscall common attributes support for map_create
    https://git.kernel.org/bpf/bpf-next/c/49f9b2b2a18c
  - [bpf-next,v14,7/8] libbpf: Add syscall common attributes support for map_create
    https://git.kernel.org/bpf/bpf-next/c/702259006f93
  - [bpf-next,v14,8/8] selftests/bpf: Add tests to verify map create failure log
    https://git.kernel.org/bpf/bpf-next/c/f675483cac1d

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH] crypto: af_alg - Document the deprecation of AF_ALG
From: Ignat Korchagin @ 2026-05-12 21:18 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Kamran Khan, Jeff Barnes, Andy Lutomirski,
	linux-crypto@vger.kernel.org, Herbert Xu,
	linux-doc@vger.kernel.org, linux-api@vger.kernel.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	Linus Torvalds
In-Reply-To: <20260511213829.GA316710@google.com>

On Mon, May 11, 2026 at 10:38 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Mon, May 11, 2026 at 10:03:21PM +0100, Ignat Korchagin wrote:
> > I don't think fully discounting hardware offloading is beneficial here. HW
> > accelerators will be produced and without a common interface vendors would
> > start implementing their own "bespoke" drivers with bespoke userspace
> > interfaces (we already had such proposals), which in turn may introduce more
> > attack surface. Yes, AF_ALG needs substantial improvement, but at least it
> > can be a standardisation point.
>
> That isn't the best way to accelerate symmetric crypto anymore though,
> if it ever was.  This has been known for a long time.
>
> > > In any case, any hypothetical security benefit provided by AF_ALG would
> > > have to be *very high* to outweigh the continuous stream of
> > > vulnerabilities in it.  I understand that people using AF_ALG might not
> > > be familiar with that continuous stream of vulnerabilities, but it would
> >
> >
> > Is it actually that much compared to other features/subsystems, like eBPF or
> > user namespaces? But we don't rush to deprecate those - instead trying to
> > harden them and come up with better design.
>
> There are plenty of other kernel features with a large attack surface,
> of course.  But they tend to be much more useful than AF_ALG.  It's all
> about weighing benefits vs. risks.

If divide number of CVEs in such systems on imaginary units of
usefulness, I think the ratio is similar.

> When we get the point where a large number of Linux users *had* to
> disable AF_ALG as an emergency vulnerability response, and at the same
> time their systems weren't even using AF_ALG so nothing even broke and
> they could have just done that to begin with, I think we get a very

Well, there were: cryptsetup, RHEL fips check, so there are some...

> clear idea of which side is heavier for AF_ALG in the real world.

Same thing could be said for unprivileged user namespaces - distros
even put a custom sysctl to restrict it and no-one noticed.

> The main relevance of AF_ALG to the Linux community is that it allows
> their systems to be exploited.

To be clear I'm not arguing for the current AF_ALG implementation. I
agree, the splice zero-copy is... suboptimal (to be soft) and is
actually not-so-zero copy. But I think it was just added before we had
more modern approaches like io_uring (have their own can of worms, but
hey - people adopt it fast).

But I advocate for the usefulness of the concept itself - kernel/OS
providing crypto services to userspace. As mentioned in other threads,
other operating systems have it and Linux lags behind. There are use
cases: common interface for HW accelerators, embedded systems, which
don't have the space to bring a userspace lib etc. Even non-technical:
there are environments that just don't want to rely on third-party
userspace libraries like OpenSSL purely for licensing reasons. And I
agree, that it is hard to do it right, but we can piggy-back on other
subsystems (such as io_uring mentioned or other ideas).

> - Eric
>

Ignat

^ permalink raw reply

* Re: [PATCH bpf-next v14 2/8] libbpf: Add support for extended BPF syscall
From: Leon Hwang @ 2026-05-13  2:10 UTC (permalink / raw)
  To: bot+bpf-ci, bpf
  Cc: ast, daniel, john.fastabend, andrii, martin.lau, eddyz87, song,
	yonghong.song, kpsingh, sdf, haoluo, jolsa, shuah, brauner,
	sforshee, yuichtsu, aalbersh, willemb, kerneljasonxing,
	chen.dylane, yatsenko, memxor, a.s.protopopov, ameryhung, rongtao,
	linux-kernel, linux-api, linux-kselftest, kernel-patches-bot,
	martin.lau, clm, ihor.solodrai
In-Reply-To: <44819125e8a416353eb3e314e7804d2c762ea87331f98f1396eaa8b86421d950@mail.kernel.org>

On 13/5/26 00:23, bot+bpf-ci@kernel.org wrote:
>> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
>> index 5846de364209..9d8740761b7a 100644
>> --- a/tools/lib/bpf/bpf.c
>> +++ b/tools/lib/bpf/bpf.c
> 
> [ ... ]
> 
>> +int probe_sys_bpf_ext(void)
>> +{
>> +	const size_t attr_sz = offsetofend(union bpf_attr, prog_token_fd);
>> +	union bpf_attr attr;
>> +	int fd;
>> +
>> +	memset(&attr, 0, attr_sz);
>> +	fd = syscall(__NR_bpf, BPF_PROG_LOAD | BPF_COMMON_ATTRS, &attr, attr_sz, NULL,
>> +		     sizeof(struct bpf_common_attr));
>> +	if (fd >= 0) {
>> +		close(fd);
>> +		return -EINVAL;
>                         ^^^^^^
> 
> Should probe_sys_bpf_ext() return 1 instead of -EINVAL when fd >= 0?
> 
> The probe function detects kernel support for extended BPF syscall
> attributes by calling the BPF syscall with the BPF_COMMON_ATTRS flag
> and a NULL pointer for common attributes.
> 
> The expected outcomes are:
> - New kernel with support: errno=EFAULT (can't copy from NULL) -> return 1
> - Old kernel without support: errno!=EFAULT (invalid command) -> return 0
> 
> If the syscall succeeds (fd >= 0), that definitively proves the kernel
> accepts the extended syscall format and supports BPF_COMMON_ATTRS. The
> function should return 1 (supported) rather than -EINVAL (error).
> 
> Returning -EINVAL causes feat_supported() in features.c to print a
> warning "Detection of kernel BPF syscall common attributes support
> support failed: Invalid argument" and mark the feature as MISSING,
> incorrectly reporting that the kernel doesn't support the feature when
> the successful syscall proves it does.
> 

Should not return 1.

This return was discussed in v5 and v6.

Cite Andrii's review in v5:

But fd should really not be >= 0, and if it is -- it's some problem,
so I'd return an error in that case to keep us aware, which is why I'm
saying I'd just return inside if (fd >= 0) { }

Thanks,
Leon

>> +	}
>> +	return errno == EFAULT ? 1 : 0;
>> +}
> 
> [ ... ]
> 
> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25745661172


^ permalink raw reply

* Re: [PATCH] crypto: af_alg - Document the deprecation of AF_ALG
From: Jeff Barnes @ 2026-05-13 14:29 UTC (permalink / raw)
  To: Ignat Korchagin
  Cc: Eric Biggers, Kamran Khan, Andy Lutomirski,
	linux-crypto@vger.kernel.org, Herbert Xu,
	linux-doc@vger.kernel.org, linux-api@vger.kernel.org,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	Linus Torvalds
In-Reply-To: <CAOs+rJUA+bz6Y2GKioHnFGFKX_uAP+4LaPRs=ZDgRQoUi4mWkg@mail.gmail.com>



On May 12 2026, at 5:18 pm, Ignat Korchagin <ignat@linux.win> wrote:

> On Mon, May 11, 2026 at 10:38 PM Eric Biggers <ebiggers@kernel.org> wrote:
>>  
>> On Mon, May 11, 2026 at 10:03:21PM +0100, Ignat Korchagin wrote:
>> > I don't think fully discounting hardware offloading is beneficial
>> here. HW
>> > accelerators will be produced and without a common interface
>> vendors would
>> > start implementing their own "bespoke" drivers with bespoke userspace
>> > interfaces (we already had such proposals), which in turn may
>> introduce more
>> > attack surface. Yes, AF_ALG needs substantial improvement, but at
>> least it
>> > can be a standardisation point.
>>  
>> That isn't the best way to accelerate symmetric crypto anymore though,
>> if it ever was.  This has been known for a long time.
>>  
>> > > In any case, any hypothetical security benefit provided by AF_ALG would
>> > > have to be *very high* to outweigh the continuous stream of
>> > > vulnerabilities in it.  I understand that people using AF_ALG
>> might not
>> > > be familiar with that continuous stream of vulnerabilities, but
>> it would
>> >
>> >
>> > Is it actually that much compared to other features/subsystems,
>> like eBPF or
>> > user namespaces? But we don't rush to deprecate those - instead
>> trying to
>> > harden them and come up with better design.
>>  
>> There are plenty of other kernel features with a large attack surface,
>> of course.  But they tend to be much more useful than AF_ALG.  It's all
>> about weighing benefits vs. risks.
>  
> If divide number of CVEs in such systems on imaginary units of
> usefulness, I think the ratio is similar.
>  
>> When we get the point where a large number of Linux users *had* to
>> disable AF_ALG as an emergency vulnerability response, and at the same
>> time their systems weren't even using AF_ALG so nothing even broke and
>> they could have just done that to begin with, I think we get a very
>  
> Well, there were: cryptsetup, RHEL fips check, so there are some...

cryptsetup does not have a hard dependency on AF_ALG.
It is a potential consumer via AF_ALG.

AF_ALG provides a broad, hard-to-control interface
cryptsetup (and similar tools) are not blockers

AF_ALG removal does not necessarily break cryptsetup usage. Removal does
improve FIPS boundary clarity.


>  
>> clear idea of which side is heavier for AF_ALG in the real world.
>  
> Same thing could be said for unprivileged user namespaces - distros
> even put a custom sysctl to restrict it and no-one noticed.
>  
>> The main relevance of AF_ALG to the Linux community is that it allows
>> their systems to be exploited.
>  
> To be clear I'm not arguing for the current AF_ALG implementation. I
> agree, the splice zero-copy is... suboptimal (to be soft) and is
> actually not-so-zero copy. But I think it was just added before we had
> more modern approaches like io_uring (have their own can of worms, but
> hey - people adopt it fast).
>  
> But I advocate for the usefulness of the concept itself - kernel/OS
> providing crypto services to userspace. As mentioned in other threads,
> other operating systems have it and Linux lags behind. There are use
> cases: common interface for HW accelerators, embedded systems, which
> don't have the space to bring a userspace lib etc. Even non-technical:
> there are environments that just don't want to rely on third-party
> userspace libraries like OpenSSL purely for licensing reasons. And I
> agree, that it is hard to do it right, but we can piggy-back on other
> subsystems (such as io_uring mentioned or other ideas).
>  
>> - Eric
>>  
>  
> Ignat
>  
Jeff

^ permalink raw reply

* Re: [RFC PATCH v2 1/2] vfs: syscalls: add mkdirat2() that returns an O_DIRECTORY fd
From: Jori Koolstra @ 2026-05-15 10:27 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Aleksa Sarai, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Alexander Viro, Arnd Bergmann,
	H . Peter Anvin, Jan Kara, Peter Zijlstra, Andrey Albershteyn,
	Masami Hiramatsu, Jiri Olsa, Thomas Weißschuh,
	Mathieu Desnoyers, cmirabil, Greg Kroah-Hartman, Jeff Layton,
	linux-kernel, linux-fsdevel, linux-api, linux-arch
In-Reply-To: <20260511-hochdekoriert-neoliberale-f7a2922bc57c@brauner>


> Op 11-05-2026 14:00 CEST schreef Christian Brauner <brauner@kernel.org>:
> 
> mkdirat2() is objectively the worse api. It forces userspace to use a
> separate system call without any reason whatsoever. If you can to
> O_CREAT you should also be able to to O_DIRECTORY in the same system
> call. If we support O_DIRECTORY | O_CREAT we get all the lookup
> restriction niceties RESOLVE_* for free. Plus, it is supportable both in
> openat() and openat2() because I made that combo return an errno.

I don't disagree. I know that some of the UAPI feature requests are not fully
flashed out, but at least it gives a basis to get the discussion going.

In fact I already have a O_DIRECTORY | O_CREAT patch that at least passes
some initial tests. However, I need to sit on it a little bit to think whether
I am not leaving something out. Also, I understand why vfs_create() wasn't used
in the O_CREAT path, for instance because you cannot just make use of may_create_dentry()
there. But now that we are going

> 
> UAPI design often is a nasty mix of performance (context switches),
> separation of concerns and privileges, tastefulness, and compromises you
> never thought or wanted to make.
> 
> I think here it is pretty clear that O_DIRECTORY | O_CREAT is the right
> thing to do. Instead of restructuring a bunch of codepaths so it can be
> plumbed through to the filesystems we just reuse the existing codepaths
> that give us the right context for free.
> 
> And during LSFMM the VFS maintains all agreed to proceed with
> O_DIRECTORY | O_CREAT.

^ permalink raw reply

* Re: [RFC PATCH v2 1/2] vfs: syscalls: add mkdirat2() that returns an O_DIRECTORY fd
From: Jori Koolstra @ 2026-05-15 10:55 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Aleksa Sarai, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Alexander Viro, Arnd Bergmann,
	H . Peter Anvin, Jan Kara, Peter Zijlstra, Andrey Albershteyn,
	Masami Hiramatsu, Jiri Olsa, Thomas Weißschuh,
	Mathieu Desnoyers, cmirabil, Greg Kroah-Hartman, Jeff Layton,
	linux-kernel, linux-fsdevel, linux-api, linux-arch
In-Reply-To: <20260511-hochdekoriert-neoliberale-f7a2922bc57c@brauner>

Sorry for the double email, this keyboard is so finicky, I really need to fix it.

> Op 11-05-2026 14:00 CEST schreef Christian Brauner <brauner@kernel.org>:
> 
> mkdirat2() is objectively the worse api. It forces userspace to use a
> separate system call without any reason whatsoever. If you can to
> O_CREAT you should also be able to to O_DIRECTORY in the same system
> call. If we support O_DIRECTORY | O_CREAT we get all the lookup
> restriction niceties RESOLVE_* for free. Plus, it is supportable both in
> openat() and openat2() because I made that combo return an errno.

I don't disagree. I know that some of the UAPI feature requests are not fully
flashed out, but at least it gives a basis to get the discussion going.

In fact I already have a O_DIRECTORY | O_CREAT patch that at least passes
the initial tests. However, I need to sit on it a little bit to think whether
I am not leaving something out. Also, I understand why vfs_create() wasn't used
in the O_CREAT path, for instance because you cannot just make use of may_create_dentry()
there. But now that we are going to string another path through lookup_open() it
would be great if we could reuse some of the logic from vfs_create() and vfs_mkdir().

Perhaps we could move may_create_dentry() out of the vfs_* calls and let the caller
take care of that. Then again, this is the pattern for all those calls. You could also
just accept some redundancies with may_o_create(), or have something like
static vfs_mkdir/create_common() functions.

There are also some minor things. If i_op->mkdir is missing this is an EPERM, but with
i_op->create it is EACCESS (and suggesting ENOSYS). Should this not be a consistent error
code? I also wonder whether there is a nicer way to handle error being returned from
vfs_mkdir et al. If I am reading

	if (!error) {
		dentry = vfs_mkdir(mnt_idmap(path.mnt), path.dentry->d_inode,
				   dentry, mode, &delegated_inode);
		if (IS_ERR(dentry))
			error = PTR_ERR(dentry);
	}
	end_creating_path(&path, dentry);

it feels like there is a missing return inside the if (IS_ERR(dentry)) block, and I
have to go several function deep to see that end_creating_path correctly deals with
error values being passed instead of a dentry. Then again, probably not worth the
churn...

> 
> UAPI design often is a nasty mix of performance (context switches),
> separation of concerns and privileges, tastefulness, and compromises you
> never thought or wanted to make.
> 

Yes, thanks for suggestion this back at FOSDEM. It is quite fun, and lots
to learn :)

> 
> And during LSFMM the VFS maintains all agreed to proceed with
> O_DIRECTORY | O_CREAT.

Ah good, I didn't know that. I haven't build up the street cred to attend,
maybe next time :)

Thanks,
Jori.

^ permalink raw reply

* Re: [RFC PATCH v2 1/2] vfs: syscalls: add mkdirat2() that returns an O_DIRECTORY fd
From: Christian Brauner @ 2026-05-15 13:49 UTC (permalink / raw)
  To: Jori Koolstra
  Cc: Aleksa Sarai, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, Alexander Viro, Arnd Bergmann,
	H . Peter Anvin, Jan Kara, Peter Zijlstra, Andrey Albershteyn,
	Masami Hiramatsu, Jiri Olsa, Thomas Weißschuh,
	Mathieu Desnoyers, cmirabil, Greg Kroah-Hartman, Jeff Layton,
	linux-kernel, linux-fsdevel, linux-api, linux-arch
In-Reply-To: <1805140959.1362290.1778842523425@kpc.webmail.kpnmail.nl>

On Fri, May 15, 2026 at 12:55:23PM +0200, Jori Koolstra wrote:
> Sorry for the double email, this keyboard is so finicky, I really need to fix it.
> 
> > Op 11-05-2026 14:00 CEST schreef Christian Brauner <brauner@kernel.org>:
> > 
> > mkdirat2() is objectively the worse api. It forces userspace to use a
> > separate system call without any reason whatsoever. If you can to
> > O_CREAT you should also be able to to O_DIRECTORY in the same system
> > call. If we support O_DIRECTORY | O_CREAT we get all the lookup
> > restriction niceties RESOLVE_* for free. Plus, it is supportable both in
> > openat() and openat2() because I made that combo return an errno.
> 
> I don't disagree. I know that some of the UAPI feature requests are not fully
> flashed out, but at least it gives a basis to get the discussion going.
> 
> In fact I already have a O_DIRECTORY | O_CREAT patch that at least passes
> the initial tests. However, I need to sit on it a little bit to think whether
> I am not leaving something out. Also, I understand why vfs_create() wasn't used
> in the O_CREAT path, for instance because you cannot just make use of may_create_dentry()
> there. But now that we are going to string another path through lookup_open() it
> would be great if we could reuse some of the logic from vfs_create() and vfs_mkdir().
> 
> Perhaps we could move may_create_dentry() out of the vfs_* calls and let the caller
> take care of that. Then again, this is the pattern for all those calls. You could also
> just accept some redundancies with may_o_create(), or have something like
> static vfs_mkdir/create_common() functions.
> 
> There are also some minor things. If i_op->mkdir is missing this is an EPERM, but with
> i_op->create it is EACCESS (and suggesting ENOSYS). Should this not be a consistent error

ENOSYS is worse because it would indicate to userspace that the whole
system call they're using isn't available. So we can't ever use that.

This is probably just historical baggage. EACCES is especially wrong
because userspace would expect that particular error code to stem from
an LSM.

In both cases the correct error code would be EOPNOTSUPP and we should
aim to be very consistent in such cases and only reserve this for
missing support for a specific functionality.

I would be very surprised if userspace depended on EACCES being returned
though. So I would say let's try and see whether we can correct this and
return EOPNOTSUPP. The amount of filesystems that have no ->create
should be dwarved by those that do.

> code? I also wonder whether there is a nicer way to handle error being returned from
> vfs_mkdir et al. If I am reading
> 
> 	if (!error) {
> 		dentry = vfs_mkdir(mnt_idmap(path.mnt), path.dentry->d_inode,
> 				   dentry, mode, &delegated_inode);
> 		if (IS_ERR(dentry))
> 			error = PTR_ERR(dentry);
> 	}
> 	end_creating_path(&path, dentry);
> 
> it feels like there is a missing return inside the if (IS_ERR(dentry)) block, and I
> have to go several function deep to see that end_creating_path correctly deals with
> error values being passed instead of a dentry. Then again, probably not worth the
> churn...

Not without it being a major clarity win at least.

> > UAPI design often is a nasty mix of performance (context switches),
> > separation of concerns and privileges, tastefulness, and compromises you
> > never thought or wanted to make.
> > 
> 
> Yes, thanks for suggestion this back at FOSDEM. It is quite fun, and lots
> to learn :)

Thanks for working on this. It's helpful.

^ permalink raw reply

* Re: [PATCH v14 00/15] Exposing case folding behavior
From: Cedric Blancher @ 2026-05-16  6:43 UTC (permalink / raw)
  To: Christian Brauner, Chuck Lever
  Cc: Chuck Lever, linux-fsdevel, linux-ext4, linux-xfs, linux-cifs,
	linux-nfs, linux-api, linux-f2fs-devel, hirofumi, linkinjeon,
	sj1557.seo, yuezhang.mo, almaz.alexandrovich, slava, glaubitz,
	frank.li, tytso, adilger.kernel, cem, sfrench, pc, ronniesahlberg,
	sprasad, trondmy, anna, jaegeuk, chao, hansg, senozhatsky,
	Darrick J. Wong, Roland Mainz, Steve French
In-Reply-To: <20260511-wertverlust-vorbringen-070f016f3bd4@brauner>

On Mon, 11 May 2026 at 16:11, Christian Brauner <brauner@kernel.org> wrote:
>
> On Thu, 07 May 2026 04:52:53 -0400, Chuck Lever wrote:
> > Christian, let's lock this one in. I will post subsequent changes
> > as delta patches.
> >
> > Following on from:
> >
> > https://lore.kernel.org/linux-nfs/20251021-zypressen-bazillus-545a44af57fd@brauner/T/#m0ba197d75b7921d994cf284f3cef3a62abb11aaa
> >
> > [...]
>
> Applied to the vfs-7.2.exportfs branch of the vfs/vfs.git tree.
> Patches in the vfs-7.2.exportfs branch should appear in linux-next soon.

@Chuck Lever Thank you!

Does that mean the support for case-insensitive filesystems will work
with Linux 7.2?

Ced
-- 
Cedric Blancher <cedric.blancher@gmail.com>
[https://plus.google.com/u/0/+CedricBlancher/]
Institute Pasteur

^ permalink raw reply

* Re: [PATCH v14 00/15] Exposing case folding behavior
From: Chuck Lever @ 2026-05-16 14:48 UTC (permalink / raw)
  To: Cedric Blancher, Christian Brauner
  Cc: Chuck Lever, linux-fsdevel, linux-ext4, linux-xfs, linux-cifs,
	linux-nfs, linux-api, linux-f2fs-devel, hirofumi, linkinjeon,
	sj1557.seo, yuezhang.mo, almaz.alexandrovich, slava, glaubitz,
	frank.li, tytso, adilger.kernel, cem, sfrench, pc, ronniesahlberg,
	sprasad, trondmy, anna, jaegeuk, chao, hansg, senozhatsky,
	Darrick J. Wong, Roland Mainz, Steve French
In-Reply-To: <CALXu0UdsurG-ayuYViqs0HXOfgyDw8gpNC+f=5y59cuuSPUbBA@mail.gmail.com>

On 5/16/26 2:43 AM, Cedric Blancher wrote:
> On Mon, 11 May 2026 at 16:11, Christian Brauner <brauner@kernel.org> wrote:
>>
>> On Thu, 07 May 2026 04:52:53 -0400, Chuck Lever wrote:
>>> Christian, let's lock this one in. I will post subsequent changes
>>> as delta patches.
>>>
>>> Following on from:
>>>
>>> https://lore.kernel.org/linux-nfs/20251021-zypressen-bazillus-545a44af57fd@brauner/T/#m0ba197d75b7921d994cf284f3cef3a62abb11aaa
>>>
>>> [...]
>>
>> Applied to the vfs-7.2.exportfs branch of the vfs/vfs.git tree.
>> Patches in the vfs-7.2.exportfs branch should appear in linux-next soon.
> 
> @Chuck Lever Thank you!
> 
> Does that mean the support for case-insensitive filesystems will work
> with Linux 7.2?

I don't want to make claims with 100% certainty, but we expect this
series to get merged into 7.2. It's early days, so there are likely to
be bugs -- there is so much subtle behavior under the covers.


-- 
Chuck Lever

^ permalink raw reply

* [PATCH 0/6] sched: Add support for long task name
From: André Almeida @ 2026-05-17 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, David Laight, Linus Torvalds, akpm,
	Yafang Shao, andrii.nakryiko, arnaldo.melo, Petr Mladek
  Cc: linux-kernel, kernel-dev, linux-mm, linux-api, Bhupesh,
	André Almeida

Hi folks,

This is a fresh re-spin of the work done by Bhupesh at 
https://lore.kernel.org/lkml/20250821102152.323367-1-bhupesh@igalia.com/

* Use case

When debugging and tracing complex programs with hundreds of threads, 16
long thread names are not enough anymore. cmd_line can show a lot of
characters, but it's not affected by pthread_setname_np() or
prctl(PR_SET_NAME), so let's give the same love kthreads got with commit
6b59808bfe48 ("workqueue: Show the latest workqueue name in 
/proc/PID/{comm,stat,status}"). This work creates a new
PR_{SET,GET}_EXT_NAME that supports 64 byte long names.

* Patchset

Patch 1 is just a minor comment update.

Patch 2 and 3 do some prep work in order to avoid buffer overflows around
the kernel, now that current->comm is bigger. It also make sure that if
the destination buffer is smaller than TASK_COMM_EXT_LEN, it will
be NUL-terminated.

Patch 4 sets current->comm length to TASK_COMM_EXT_LEN and take care of
making sure that current userspace APIs gets only TASK_COMM_LEN.

Patch 5 creates new prctl() to set and get all the TASK_COMM_EXT_LEN bytes.

Patch 6 adapts the existing selftest for this new interface.

* Testing

selftests/prctl/set-process-name.c survives this patchset, and it was extended
to the new interface. Care was taken to make sure the old interfaces still
return 16 bytes, to avoid buffer overflow.

This patchset also survived some basic trace-cmd tests, but any advise or
how to stress even more all those string copies is very welcomed.

* Changes

Since Bhupesh's v8:
 - Truncate userspace return to 16 bytes for old interfaces (PR_GET_NAME,
   /proc/PID/comm/)
 - Replace __cstr_array_copy() with new strtostr()
 - Add new interface prctl(PR_{SET,GET}_EXT_NAME)
 - Adapt selftest to this patchset

---
André Almeida (6):
      sched: Update get_task_comm() comment
      treewide: Get rid of get_task_comm()
      string: Introduce strtostr() for safe and performance string copies
      sched: Extend task command name to 64 bytes
      prctl: Add support for long user thread names
      selftests: prctl: Add test for long thread names

 drivers/connector/cn_proc.c                        |  2 +-
 drivers/dma-buf/sw_sync.c                          |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_eviction_fence.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c            |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c    |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c             |  4 +--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c       |  2 +-
 drivers/gpu/drm/lima/lima_ctx.c                    |  2 +-
 drivers/gpu/drm/panfrost/panfrost_gem.c            |  2 +-
 drivers/gpu/drm/panthor/panthor_gem.c              |  2 +-
 drivers/gpu/drm/panthor/panthor_sched.c            |  2 +-
 drivers/gpu/drm/virtio/virtgpu_ioctl.c             |  2 +-
 drivers/hwtracing/stm/core.c                       |  2 +-
 drivers/tty/tty_audit.c                            |  2 +-
 fs/binfmt_elf.c                                    |  2 +-
 fs/binfmt_elf_fdpic.c                              |  2 +-
 fs/proc/array.c                                    |  2 +-
 include/linux/coredump.h                           |  2 +-
 include/linux/sched.h                              | 24 ++-------------
 include/linux/string.h                             | 28 +++++++++++++++++
 include/linux/tracepoint.h                         |  4 +--
 include/trace/events/block.h                       | 10 +++---
 include/trace/events/coredump.h                    |  2 +-
 include/trace/events/f2fs.h                        |  4 +--
 include/trace/events/oom.h                         |  2 +-
 include/trace/events/osnoise.h                     |  2 +-
 include/trace/events/sched.h                       | 10 +++---
 include/trace/events/signal.h                      |  2 +-
 include/trace/events/task.h                        |  4 +--
 include/uapi/linux/prctl.h                         |  3 ++
 kernel/audit.c                                     |  6 ++--
 kernel/auditsc.c                                   |  6 ++--
 kernel/printk/nbcon.c                              |  2 +-
 kernel/printk/printk.c                             |  4 +--
 kernel/sys.c                                       | 23 +++++++++++---
 net/bluetooth/hci_sock.c                           |  2 +-
 net/netfilter/nf_tables_api.c                      |  2 +-
 security/integrity/integrity_audit.c               |  3 +-
 security/ipe/audit.c                               |  2 +-
 security/landlock/domain.c                         |  2 +-
 security/lsm_audit.c                               |  7 +++--
 tools/bpf/bpftool/pids.c                           |  4 +--
 .../selftests/bpf/test_kmods/bpf_testmod-events.h  |  2 +-
 tools/testing/selftests/prctl/set-process-name.c   | 36 ++++++++++++++++++++++
 45 files changed, 152 insertions(+), 84 deletions(-)
---
base-commit: 5d6919055dec134de3c40167a490f33c74c12581
change-id: 20260516-tonyk-long_name-b9f345aeb041

Best regards,
--  
André Almeida <andrealmeid@igalia.com>


^ permalink raw reply

* [PATCH 1/6] sched: Update get_task_comm() comment
From: André Almeida @ 2026-05-17 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, David Laight, Linus Torvalds, akpm,
	Yafang Shao, andrii.nakryiko, arnaldo.melo, Petr Mladek
  Cc: linux-kernel, kernel-dev, linux-mm, linux-api, Bhupesh,
	André Almeida
In-Reply-To: <20260517-tonyk-long_name-v1-0-3c282eaa91e2@igalia.com>

Since commit 3a3f61ce5e0b ("exec: Make sure task->comm is always
NUL-terminated"), __set_task_comm() no longer uses strscpy_pad(). Update
the stale comment accordingly.

Co-developed-by: Bhupesh <bhupesh@igalia.com>
Signed-off-by: Bhupesh <bhupesh@igalia.com>
Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 include/linux/sched.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 368c7b4d7cb5..60d004a49a27 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2005,7 +2005,7 @@ extern void __set_task_comm(struct task_struct *tsk, const char *from, bool exec
  *   User space can randomly change their names anyway, so locking for readers
  *   doesn't make sense. For writers, locking is probably necessary, as a race
  *   condition could lead to long-term mixed results.
- *   The strscpy_pad() in __set_task_comm() can ensure that the task comm is
+ *   The logic inside __set_task_comm() ensures that the task comm is
  *   always NUL-terminated and zero-padded. Therefore the race condition between
  *   reader and writer is not an issue.
  *

-- 
2.54.0


^ permalink raw reply related

* [PATCH 4/6] sched: Extend task command name to 64 bytes
From: André Almeida @ 2026-05-17 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, David Laight, Linus Torvalds, akpm,
	Yafang Shao, andrii.nakryiko, arnaldo.melo, Petr Mladek
  Cc: linux-kernel, kernel-dev, linux-mm, linux-api, Bhupesh,
	André Almeida
In-Reply-To: <20260517-tonyk-long_name-v1-0-3c282eaa91e2@igalia.com>

Command name has been restrict to only 16 bytes, which is too limiting,
specially when debugging and tracing complex software with thousands of
threads and the need to differentiate them.

Just as it was done with kthreads in commit 6b59808bfe48 ("workqueue:
Show the latest workqueue name in /proc/PID/{comm,stat,status}"), support
long names for userspace threads as well.

To avoid buffer overflows, cap all existing userspace APIs to
TASK_COMM_LEN, and leave the full extended name for a new interface.

Co-developed-by: Bhupesh <bhupesh@igalia.com>
Signed-off-by: Bhupesh <bhupesh@igalia.com>
Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 fs/proc/array.c       |  2 +-
 include/linux/sched.h |  3 ++-
 kernel/sys.c          | 10 +++++-----
 3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index c8c3fbd9bfa9..312371eddc7f 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -110,7 +110,7 @@ void proc_task_name(struct seq_file *m, struct task_struct *p, bool escape)
 	else if (p->flags & PF_KTHREAD)
 		get_kthread_comm(tcomm, sizeof(tcomm), p);
 	else
-		strscpy_pad(tcomm, p->comm);
+		strscpy_pad(tcomm, p->comm, TASK_COMM_LEN);
 
 	if (escape)
 		seq_escape_str(m, tcomm, ESCAPE_SPACE | ESCAPE_SPECIAL, "\n\\");
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b6de742b1155..f7fd2b7d131d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -323,6 +323,7 @@ struct user_event_mm;
  */
 enum {
 	TASK_COMM_LEN = 16,
+	TASK_COMM_EXT_LEN = 64,
 };
 
 extern void sched_tick(void);
@@ -1167,7 +1168,7 @@ struct task_struct {
 	 * - set it with set_task_comm() to ensure it is always
 	 *   NUL-terminated and zero-padded
 	 */
-	char				comm[TASK_COMM_LEN];
+	char				comm[TASK_COMM_EXT_LEN];
 
 	struct nameidata		*nameidata;
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 1d5152d2395e..76d77218ab19 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2535,7 +2535,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
 	struct task_struct *me = current;
-	unsigned char comm[sizeof(me->comm)];
+	unsigned char comm[TASK_COMM_LEN];
 	long error;
 
 	error = security_task_prctl(option, arg2, arg3, arg4, arg5);
@@ -2601,16 +2601,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			error = -EINVAL;
 		break;
 	case PR_SET_NAME:
-		comm[sizeof(me->comm) - 1] = 0;
+		comm[TASK_COMM_LEN - 1] = 0;
 		if (strncpy_from_user(comm, (char __user *)arg2,
-				      sizeof(me->comm) - 1) < 0)
+				      TASK_COMM_LEN - 1) < 0)
 			return -EFAULT;
 		set_task_comm(me, comm);
 		proc_comm_connector(me);
 		break;
 	case PR_GET_NAME:
-		strscpy_pad(comm, me->comm);
-		if (copy_to_user((char __user *)arg2, comm, sizeof(comm)))
+		strscpy_pad(comm, me->comm, TASK_COMM_LEN);
+		if (copy_to_user((char __user *)arg2, comm, TASK_COMM_LEN))
 			return -EFAULT;
 		break;
 	case PR_GET_ENDIAN:

-- 
2.54.0


^ permalink raw reply related

* [PATCH 2/6] treewide: Get rid of get_task_comm()
From: André Almeida @ 2026-05-17 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, David Laight, Linus Torvalds, akpm,
	Yafang Shao, andrii.nakryiko, arnaldo.melo, Petr Mladek
  Cc: linux-kernel, kernel-dev, linux-mm, linux-api, Bhupesh,
	André Almeida
In-Reply-To: <20260517-tonyk-long_name-v1-0-3c282eaa91e2@igalia.com>

Since commit 4cc0473d7754 ("get rid of __get_task_comm()"),
get_task_comm() does just a redundant check for the buffer size and call
strscpy_pad(). Replace get_task_comm() calls with strscpy_pad(), that will
do the right thing if the buffers sizes doesn't match: zero-pad if it's
bigger, and truncate if it's smaller.

Link: https://lore.kernel.org/lkml/CAHk-=wi5c=_-FBGo_88CowJd_F-Gi6Ud9d=TALm65ReN7YjrMw@mail.gmail.com/
Co-developed-by: Bhupesh <bhupesh@igalia.com>
Signed-off-by: Bhupesh <bhupesh@igalia.com>
Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/connector/cn_proc.c                        |  2 +-
 drivers/dma-buf/sw_sync.c                          |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_eviction_fence.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c            |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c    |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c             |  4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c       |  2 +-
 drivers/gpu/drm/lima/lima_ctx.c                    |  2 +-
 drivers/gpu/drm/panfrost/panfrost_gem.c            |  2 +-
 drivers/gpu/drm/panthor/panthor_gem.c              |  2 +-
 drivers/gpu/drm/panthor/panthor_sched.c            |  2 +-
 drivers/gpu/drm/virtio/virtgpu_ioctl.c             |  2 +-
 drivers/hwtracing/stm/core.c                       |  2 +-
 drivers/tty/tty_audit.c                            |  2 +-
 fs/binfmt_elf.c                                    |  2 +-
 fs/binfmt_elf_fdpic.c                              |  2 +-
 fs/proc/array.c                                    |  2 +-
 include/linux/sched.h                              | 19 -------------------
 kernel/audit.c                                     |  6 ++++--
 kernel/auditsc.c                                   |  6 ++++--
 kernel/printk/printk.c                             |  2 +-
 kernel/sys.c                                       |  2 +-
 net/bluetooth/hci_sock.c                           |  2 +-
 net/netfilter/nf_tables_api.c                      |  2 +-
 security/integrity/integrity_audit.c               |  3 ++-
 security/ipe/audit.c                               |  2 +-
 security/landlock/domain.c                         |  2 +-
 security/lsm_audit.c                               |  7 ++++---
 29 files changed, 39 insertions(+), 52 deletions(-)

diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c
index 0056ab81fbc3..c78243ed3c2a 100644
--- a/drivers/connector/cn_proc.c
+++ b/drivers/connector/cn_proc.c
@@ -278,7 +278,7 @@ void proc_comm_connector(struct task_struct *task)
 	ev->what = PROC_EVENT_COMM;
 	ev->event_data.comm.process_pid  = task->pid;
 	ev->event_data.comm.process_tgid = task->tgid;
-	get_task_comm(ev->event_data.comm.comm, task);
+	strscpy_pad(ev->event_data.comm.comm, task->comm);
 
 	memcpy(&msg->id, &cn_proc_event_id, sizeof(msg->id));
 	msg->ack = 0; /* not used */
diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c
index 8df20b0218a9..d501657ad801 100644
--- a/drivers/dma-buf/sw_sync.c
+++ b/drivers/dma-buf/sw_sync.c
@@ -312,7 +312,7 @@ static int sw_sync_debugfs_open(struct inode *inode, struct file *file)
 	struct sync_timeline *obj;
 	char task_comm[TASK_COMM_LEN];
 
-	get_task_comm(task_comm, current);
+	strscpy_pad(task_comm, current->comm);
 
 	obj = sync_timeline_create(task_comm);
 	if (!obj)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
index 6a364357522b..13c8857e4ffb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
@@ -74,7 +74,7 @@ struct amdgpu_amdkfd_fence *amdgpu_amdkfd_fence_create(u64 context,
 	/* This reference gets released in amdkfd_fence_release */
 	mmgrab(mm);
 	fence->mm = mm;
-	get_task_comm(fence->timeline_name, current);
+	strscpy_pad(fence->timeline_name, current->comm);
 	spin_lock_init(&fence->lock);
 	fence->svm_bo = svm_bo;
 	fence->context_id = context_id;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_eviction_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_eviction_fence.c
index 4c5e38dea4c2..faf0f36d8328 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_eviction_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_eviction_fence.c
@@ -129,7 +129,7 @@ int amdgpu_evf_mgr_rearm(struct amdgpu_eviction_fence_mgr *evf_mgr,
 		return -ENOMEM;
 
 	ev_fence->evf_mgr = evf_mgr;
-	get_task_comm(ev_fence->timeline_name, current);
+	strscpy_pad(ev_fence->timeline_name, current->comm);
 	spin_lock_init(&ev_fence->lock);
 	dma_fence_init64(&ev_fence->base, &amdgpu_eviction_fence_ops,
 			 &ev_fence->lock, evf_mgr->ev_fence_ctx,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 6c644cfe6695..c45630457155 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -4419,7 +4419,7 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
 	}
 
 	con->init_task_pid = task_pid_nr(current);
-	get_task_comm(con->init_task_comm, current);
+	strscpy_pad(con->init_task_comm, current->comm);
 
 	mutex_init(&con->critical_region_lock);
 	INIT_LIST_HEAD(&con->critical_region_head);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c
index e2d5f04296e1..8fdc38d8d64d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c
@@ -85,7 +85,7 @@ int amdgpu_userq_fence_driver_alloc(struct amdgpu_device *adev,
 
 	fence_drv->adev = adev;
 	fence_drv->context = dma_fence_context_alloc(1);
-	get_task_comm(fence_drv->timeline_name, current);
+	strscpy_pad(fence_drv->timeline_name, current->comm);
 
 	*fence_drv_req = fence_drv;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 9ba9de16a27a..de80d0ace905 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2571,10 +2571,10 @@ void amdgpu_vm_set_task_info(struct amdgpu_vm *vm)
 		return;
 
 	vm->task_info->task.pid = current->pid;
-	get_task_comm(vm->task_info->task.comm, current);
+	strscpy_pad(vm->task_info->task.comm, current->comm);
 
 	vm->task_info->tgid = current->tgid;
-	get_task_comm(vm->task_info->process_name, current->group_leader);
+	strscpy_pad(vm->task_info->process_name, current->group_leader->comm);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index 2a241a5b12c4..f8ce59d8587a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -563,7 +563,7 @@ static int amdgpu_vram_mgr_new(struct ttm_resource_manager *man,
 	}
 
 	vres->task.pid = task_pid_nr(current);
-	get_task_comm(vres->task.comm, current);
+	strscpy_pad(vres->task.comm, current->comm);
 	list_add_tail(&vres->vres_node, &mgr->allocated_vres_list);
 
 	if (bo->flags & AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS && adjust_dcc_size) {
diff --git a/drivers/gpu/drm/lima/lima_ctx.c b/drivers/gpu/drm/lima/lima_ctx.c
index 68ede7a725e2..e8c5c3601bf1 100644
--- a/drivers/gpu/drm/lima/lima_ctx.c
+++ b/drivers/gpu/drm/lima/lima_ctx.c
@@ -29,7 +29,7 @@ int lima_ctx_create(struct lima_device *dev, struct lima_ctx_mgr *mgr, u32 *id)
 		goto err_out0;
 
 	ctx->pid = task_pid_nr(current);
-	get_task_comm(ctx->pname, current);
+	strscpy_pad(ctx->pname, current->comm);
 
 	return 0;
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c b/drivers/gpu/drm/panfrost/panfrost_gem.c
index 3a7fce428898..11936c4d3573 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gem.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gem.c
@@ -36,7 +36,7 @@ static void panfrost_gem_debugfs_bo_add(struct panfrost_device *pfdev,
 					struct panfrost_gem_object *bo)
 {
 	bo->debugfs.creator.tgid = current->tgid;
-	get_task_comm(bo->debugfs.creator.process_name, current->group_leader);
+	strscpy_pad(bo->debugfs.creator.process_name, current->group_leader->comm);
 
 	mutex_lock(&pfdev->debugfs.gems_lock);
 	list_add_tail(&bo->debugfs.node, &pfdev->debugfs.gems_list);
diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
index cd49859da89b..b44fd715c17e 100644
--- a/drivers/gpu/drm/panthor/panthor_gem.c
+++ b/drivers/gpu/drm/panthor/panthor_gem.c
@@ -46,7 +46,7 @@ static void panthor_gem_debugfs_bo_add(struct panthor_gem_object *bo)
 						    struct panthor_device, base);
 
 	bo->debugfs.creator.tgid = current->tgid;
-	get_task_comm(bo->debugfs.creator.process_name, current->group_leader);
+	strscpy_pad(bo->debugfs.creator.process_name, current->group_leader->comm);
 
 	mutex_lock(&ptdev->gems.lock);
 	list_add_tail(&bo->debugfs.node, &ptdev->gems.node);
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index 2fe04d0f0e3a..8ee9de96acf6 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -3603,7 +3603,7 @@ static void group_init_task_info(struct panthor_group *group)
 	struct task_struct *task = current->group_leader;
 
 	group->task_info.pid = task->pid;
-	get_task_comm(group->task_info.comm, task);
+	strscpy_pad(group->task_info.comm, task->comm);
 }
 
 static void add_group_kbo_sizes(struct panthor_device *ptdev,
diff --git a/drivers/gpu/drm/virtio/virtgpu_ioctl.c b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
index c33c057365f8..d2bf221e8f01 100644
--- a/drivers/gpu/drm/virtio/virtgpu_ioctl.c
+++ b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
@@ -50,7 +50,7 @@ static void virtio_gpu_create_context_locked(struct virtio_gpu_device *vgdev,
 	} else {
 		char dbgname[TASK_COMM_LEN];
 
-		get_task_comm(dbgname, current);
+		strscpy_pad(dbgname, current->comm);
 		virtio_gpu_cmd_context_create(vgdev, vfpriv->ctx_id,
 					      vfpriv->context_init, strlen(dbgname),
 					      dbgname);
diff --git a/drivers/hwtracing/stm/core.c b/drivers/hwtracing/stm/core.c
index f48c6a8a0654..c7715439964e 100644
--- a/drivers/hwtracing/stm/core.c
+++ b/drivers/hwtracing/stm/core.c
@@ -634,7 +634,7 @@ static ssize_t stm_char_write(struct file *file, const char __user *buf,
 		char comm[sizeof(current->comm)];
 		char *ids[] = { comm, "default", NULL };
 
-		get_task_comm(comm, current);
+		strscpy_pad(comm, current->comm);
 
 		err = stm_assign_first_policy(stmf->stm, &stmf->output, ids, 1);
 		/*
diff --git a/drivers/tty/tty_audit.c b/drivers/tty/tty_audit.c
index d014af6ab060..d514a81d0a5c 100644
--- a/drivers/tty/tty_audit.c
+++ b/drivers/tty/tty_audit.c
@@ -77,7 +77,7 @@ static void tty_audit_log(const char *description, dev_t dev,
 	audit_log_format(ab, "%s pid=%u uid=%u auid=%u ses=%u major=%d minor=%d comm=",
 			 description, pid, uid, loginuid, sessionid,
 			 MAJOR(dev), MINOR(dev));
-	get_task_comm(name, current);
+	strscpy_pad(name, current->comm);
 	audit_log_untrustedstring(ab, name);
 	audit_log_format(ab, " data=");
 	audit_log_n_hex(ab, data, size);
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 16a56b6b3f6c..d25922460b63 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1557,7 +1557,7 @@ static int fill_psinfo(struct elf_prpsinfo *psinfo, struct task_struct *p,
 	SET_UID(psinfo->pr_uid, from_kuid_munged(cred->user_ns, cred->uid));
 	SET_GID(psinfo->pr_gid, from_kgid_munged(cred->user_ns, cred->gid));
 	rcu_read_unlock();
-	get_task_comm(psinfo->pr_fname, p);
+	strscpy_pad(psinfo->pr_fname, p->comm);
 
 	return 0;
 }
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 7e3108489c83..c4d4e59ff34d 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1371,7 +1371,7 @@ static int fill_psinfo(struct elf_prpsinfo *psinfo, struct task_struct *p,
 	SET_UID(psinfo->pr_uid, from_kuid_munged(cred->user_ns, cred->uid));
 	SET_GID(psinfo->pr_gid, from_kgid_munged(cred->user_ns, cred->gid));
 	rcu_read_unlock();
-	get_task_comm(psinfo->pr_fname, p);
+	strscpy_pad(psinfo->pr_fname, p->comm);
 
 	return 0;
 }
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 90fb0c6b5f99..c8c3fbd9bfa9 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -110,7 +110,7 @@ void proc_task_name(struct seq_file *m, struct task_struct *p, bool escape)
 	else if (p->flags & PF_KTHREAD)
 		get_kthread_comm(tcomm, sizeof(tcomm), p);
 	else
-		get_task_comm(tcomm, p);
+		strscpy_pad(tcomm, p->comm);
 
 	if (escape)
 		seq_escape_str(m, tcomm, ESCAPE_SPACE | ESCAPE_SPECIAL, "\n\\");
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 60d004a49a27..b6de742b1155 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2000,25 +2000,6 @@ extern void __set_task_comm(struct task_struct *tsk, const char *from, bool exec
 	__set_task_comm(tsk, from, false);		\
 })
 
-/*
- * - Why not use task_lock()?
- *   User space can randomly change their names anyway, so locking for readers
- *   doesn't make sense. For writers, locking is probably necessary, as a race
- *   condition could lead to long-term mixed results.
- *   The logic inside __set_task_comm() ensures that the task comm is
- *   always NUL-terminated and zero-padded. Therefore the race condition between
- *   reader and writer is not an issue.
- *
- * - BUILD_BUG_ON() can help prevent the buf from being truncated.
- *   Since the callers don't perform any return value checks, this safeguard is
- *   necessary.
- */
-#define get_task_comm(buf, tsk) ({			\
-	BUILD_BUG_ON(sizeof(buf) < TASK_COMM_LEN);	\
-	strscpy_pad(buf, (tsk)->comm);			\
-	buf;						\
-})
-
 static __always_inline void scheduler_ipi(void)
 {
 	/*
diff --git a/kernel/audit.c b/kernel/audit.c
index e1d489bc2dff..6fc867adbf3d 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -1662,7 +1662,8 @@ static void audit_log_multicast(int group, const char *op, int err)
 	audit_put_tty(tty);
 	audit_log_task_context(ab); /* subj= */
 	audit_log_format(ab, " comm=");
-	audit_log_untrustedstring(ab, get_task_comm(comm, current));
+	strscpy_pad(comm, current->comm);
+	audit_log_untrustedstring(ab, comm);
 	audit_log_d_path_exe(ab, current->mm); /* exe= */
 	audit_log_format(ab, " nl-mcgrp=%d op=%s res=%d", group, op, !err);
 	audit_log_end(ab);
@@ -2465,7 +2466,8 @@ void audit_log_task_info(struct audit_buffer *ab)
 			 audit_get_sessionid(current));
 	audit_put_tty(tty);
 	audit_log_format(ab, " comm=");
-	audit_log_untrustedstring(ab, get_task_comm(comm, current));
+	strscpy_pad(comm, current->comm);
+	audit_log_untrustedstring(ab, comm);
 	audit_log_d_path_exe(ab, current->mm);
 	audit_log_task_context(ab);
 }
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index ab54fccba215..8e4f70105a13 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -2877,7 +2877,8 @@ void __audit_log_nfcfg(const char *name, u8 af, unsigned int nentries,
 	audit_log_format(ab, " pid=%u", task_tgid_nr(current));
 	audit_log_task_context(ab); /* subj= */
 	audit_log_format(ab, " comm=");
-	audit_log_untrustedstring(ab, get_task_comm(comm, current));
+	strscpy_pad(comm, current->comm);
+	audit_log_untrustedstring(ab, comm);
 	audit_log_end(ab);
 }
 EXPORT_SYMBOL_GPL(__audit_log_nfcfg);
@@ -2900,7 +2901,8 @@ static void audit_log_task(struct audit_buffer *ab)
 			 sessionid);
 	audit_log_task_context(ab);
 	audit_log_format(ab, " pid=%d comm=", task_tgid_nr(current));
-	audit_log_untrustedstring(ab, get_task_comm(comm, current));
+	strscpy_pad(comm, current->comm);
+	audit_log_untrustedstring(ab, comm);
 	audit_log_d_path_exe(ab, current->mm);
 }
 
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 0323149548f6..1f04e753ca02 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2247,7 +2247,7 @@ static u16 printk_sprint(char *text, u16 size, int facility,
 static void printk_store_execution_ctx(struct printk_info *info)
 {
 	info->caller_id2 = printk_caller_id2();
-	get_task_comm(info->comm, current);
+	strscpy_pad(info->comm, current->comm);
 }
 
 static void pmsg_load_execution_ctx(struct printk_message *pmsg,
diff --git a/kernel/sys.c b/kernel/sys.c
index 62e842055cc9..1d5152d2395e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2609,7 +2609,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		proc_comm_connector(me);
 		break;
 	case PR_GET_NAME:
-		get_task_comm(comm, me);
+		strscpy_pad(comm, me->comm);
 		if (copy_to_user((char __user *)arg2, comm, sizeof(comm)))
 			return -EFAULT;
 		break;
diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
index 0290dea081f6..38e16ba2de38 100644
--- a/net/bluetooth/hci_sock.c
+++ b/net/bluetooth/hci_sock.c
@@ -106,7 +106,7 @@ static bool hci_sock_gen_cookie(struct sock *sk)
 			id = 0xffffffff;
 
 		hci_pi(sk)->cookie = id;
-		get_task_comm(hci_pi(sk)->comm, current);
+		strscpy_pad(hci_pi(sk)->comm, current->comm);
 		return true;
 	}
 
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 87387adbca65..d859ffa2874c 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -9711,7 +9711,7 @@ static int nf_tables_fill_gen_info(struct sk_buff *skb, struct net *net,
 
 	if (nla_put_be32(skb, NFTA_GEN_ID, htonl(nft_base_seq(net))) ||
 	    nla_put_be32(skb, NFTA_GEN_PROC_PID, htonl(task_pid_nr(current))) ||
-	    nla_put_string(skb, NFTA_GEN_PROC_NAME, get_task_comm(buf, current)))
+	    nla_put_string(skb, NFTA_GEN_PROC_NAME, strscpy_pad(buf, current->comm)))
 		goto nla_put_failure;
 
 	nlmsg_end(skb, nlh);
diff --git a/security/integrity/integrity_audit.c b/security/integrity/integrity_audit.c
index d8d9e5ff1cd2..98060060929d 100644
--- a/security/integrity/integrity_audit.c
+++ b/security/integrity/integrity_audit.c
@@ -54,7 +54,8 @@ void integrity_audit_message(int audit_msgno, struct inode *inode,
 			 audit_get_sessionid(current));
 	audit_log_task_context(ab);
 	audit_log_format(ab, " op=%s cause=%s comm=", op, cause);
-	audit_log_untrustedstring(ab, get_task_comm(name, current));
+	strscpy_pad(name, current->comm);
+	audit_log_untrustedstring(ab, name);
 	if (fname) {
 		audit_log_format(ab, " name=");
 		audit_log_untrustedstring(ab, fname);
diff --git a/security/ipe/audit.c b/security/ipe/audit.c
index 93fb59fbddd6..c04901baed73 100644
--- a/security/ipe/audit.c
+++ b/security/ipe/audit.c
@@ -145,7 +145,7 @@ void ipe_audit_match(const struct ipe_eval_ctx *const ctx,
 	audit_log_format(ab, "ipe_op=%s ipe_hook=%s enforcing=%d pid=%d comm=",
 			 op, audit_hook_names[ctx->hook], READ_ONCE(enforce),
 			 task_tgid_nr(current));
-	audit_log_untrustedstring(ab, get_task_comm(comm, current));
+	audit_log_untrustedstring(ab, strscpy_pad(comm, current->comm));
 
 	if (ctx->file) {
 		audit_log_d_path(ab, " path=", &ctx->file->f_path);
diff --git a/security/landlock/domain.c b/security/landlock/domain.c
index 06b6bd845060..a35a27f523e6 100644
--- a/security/landlock/domain.c
+++ b/security/landlock/domain.c
@@ -101,7 +101,7 @@ static struct landlock_details *get_current_details(void)
 	memcpy(details->exe_path, path_str, path_size);
 	details->pid = get_pid(task_tgid(current));
 	details->uid = from_kuid(&init_user_ns, current_uid());
-	get_task_comm(details->comm, current);
+	strscpy_pad(details->comm, current->comm);
 	return details;
 }
 
diff --git a/security/lsm_audit.c b/security/lsm_audit.c
index 737f5a263a8f..a587ffecd985 100644
--- a/security/lsm_audit.c
+++ b/security/lsm_audit.c
@@ -276,8 +276,8 @@ void audit_log_lsm_data(struct audit_buffer *ab,
 			if (pid) {
 				char tskcomm[sizeof(tsk->comm)];
 				audit_log_format(ab, " opid=%d ocomm=", pid);
-				audit_log_untrustedstring(ab,
-				    get_task_comm(tskcomm, tsk));
+				strscpy_pad(tskcomm, tsk->comm);
+				audit_log_untrustedstring(ab, tskcomm);
 			}
 		}
 		break;
@@ -417,7 +417,8 @@ static void dump_common_audit_data(struct audit_buffer *ab,
 	char comm[sizeof(current->comm)];
 
 	audit_log_format(ab, " pid=%d comm=", task_tgid_nr(current));
-	audit_log_untrustedstring(ab, get_task_comm(comm, current));
+	strscpy_pad(comm, current->comm);
+	audit_log_untrustedstring(ab, comm);
 	audit_log_lsm_data(ab, a);
 }
 

-- 
2.54.0


^ permalink raw reply related

* [PATCH 3/6] string: Introduce strtostr() for safe and performance string copies
From: André Almeida @ 2026-05-17 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, David Laight, Linus Torvalds, akpm,
	Yafang Shao, andrii.nakryiko, arnaldo.melo, Petr Mladek
  Cc: linux-kernel, kernel-dev, linux-mm, linux-api, André Almeida
In-Reply-To: <20260517-tonyk-long_name-v1-0-3c282eaa91e2@igalia.com>

Some parts of the kernel uses memcpy() instead of strscpy() because they
are performance sensitive and doesn't care about the return value of
strscpy(). One such common case is to copy current->comm to a different
buffer.

As the command name is guaranteed to be NUL-terminated in the range of
TASK_COMM_LEN, this is safe enough and doesn't create unterminated
strings. However, in order to expand the size of current->comm, this
expectation will be broken and those memcpy() could create such strings
without trailing NUL byte.

In order to support a fast and safe string copy, create strtostr(), to copy
a NUL-terminated string to a new string buffer. If the destination buffer
is bigger than the source, no pad is applied, but the string is
NUL-terminated. If the destination buffer is smaller, the string is
truncated. The last byte of the destination is always set to NUL for safety.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 include/linux/coredump.h                           |  2 +-
 include/linux/string.h                             | 28 ++++++++++++++++++++++
 include/linux/tracepoint.h                         |  4 ++--
 include/trace/events/block.h                       | 10 ++++----
 include/trace/events/coredump.h                    |  2 +-
 include/trace/events/f2fs.h                        |  4 ++--
 include/trace/events/oom.h                         |  2 +-
 include/trace/events/osnoise.h                     |  2 +-
 include/trace/events/sched.h                       | 10 ++++----
 include/trace/events/signal.h                      |  2 +-
 include/trace/events/task.h                        |  4 ++--
 kernel/printk/nbcon.c                              |  2 +-
 kernel/printk/printk.c                             |  2 +-
 tools/bpf/bpftool/pids.c                           |  4 ++--
 .../selftests/bpf/test_kmods/bpf_testmod-events.h  |  2 +-
 15 files changed, 54 insertions(+), 26 deletions(-)

diff --git a/include/linux/coredump.h b/include/linux/coredump.h
index 68861da4cf7c..b370ef69f673 100644
--- a/include/linux/coredump.h
+++ b/include/linux/coredump.h
@@ -54,7 +54,7 @@ extern void vfs_coredump(const kernel_siginfo_t *siginfo);
 	do {	\
 		char comm[TASK_COMM_LEN];	\
 		/* This will always be NUL terminated. */ \
-		memcpy(comm, current->comm, sizeof(comm)); \
+		strtostr(comm, current->comm); \
 		printk_ratelimited(Level "coredump: %d(%*pE): " Format "\n",	\
 			task_tgid_vnr(current), (int)strlen(comm), comm, ##__VA_ARGS__);	\
 	} while (0)	\
diff --git a/include/linux/string.h b/include/linux/string.h
index b850bd91b3d8..ff1f59f4139c 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -445,6 +445,34 @@ void memcpy_and_pad(void *dest, size_t dest_len, const void *src, size_t count,
 	memcpy(dest, src, strnlen(src, min(_src_len, _dest_len)));	\
 } while (0)
 
+/**
+ * strtostr - Copy NUL-terminanted string to NUL-terminate string
+ *
+ * @dest: Pointer of destination string
+ * @src: Pointer to NUL-terminates string
+ *
+ * This is a replacement for strcpy() where the caller doesn't care about the
+ * return value and if the string is going to be truncated, albeit it needs
+ * to mark sure that it will be NUL-terminated. Intended for performance
+ * sensitive cases, such as tracing.
+ *
+ * If the destination is bigger than the source, no padding happens. It it's
+ * smaller the strings gets truncated.
+ *
+ * Both arguments needs to be arrays with lengths discoverable by the compiler.
+ */
+#define strtostr(dest, src)	do {					\
+	const size_t _dest_len = __must_be_cstr(dest) +			\
+				 ARRAY_SIZE(dest);			\
+	const size_t _src_len = __must_be_cstr(src) +			\
+				__builtin_object_size(src, 1);		\
+									\
+	BUILD_BUG_ON(!__builtin_constant_p(_dest_len) ||		\
+		     _dest_len == (size_t)-1);				\
+	memcpy(dest, src, strnlen(src, min(_src_len, _dest_len)));	\
+	dest[_dest_len - 1] = '\0';						\
+} while (0)
+
 /**
  * memtostr - Copy a possibly non-NUL-term string to a NUL-term string
  * @dest: Pointer to destination NUL-terminates string
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 763eea4d80d8..19e3cb4ca487 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -615,10 +615,10 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
  *	*
  *
  *	TP_fast_assign(
- *		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
+ *		strtostr(__entry->next_comm, next->comm);
  *		__entry->prev_pid	= prev->pid;
  *		__entry->prev_prio	= prev->prio;
- *		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
+ *		strtostr(__entry->prev_comm, prev->comm);
  *		__entry->next_pid	= next->pid;
  *		__entry->next_prio	= next->prio;
  *	),
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 6aa79e2d799c..779622cadee3 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -213,7 +213,7 @@ DECLARE_EVENT_CLASS(block_rq,
 
 		blk_fill_rwbs(__entry->rwbs, rq->cmd_flags);
 		__get_str(cmd)[0] = '\0';
-		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, current->comm);
 	),
 
 	TP_printk("%d,%d %s %u (%s) %llu + %u %s,%u,%u [%s]",
@@ -351,7 +351,7 @@ DECLARE_EVENT_CLASS(block_bio,
 		__entry->sector		= bio->bi_iter.bi_sector;
 		__entry->nr_sector	= bio_sectors(bio);
 		blk_fill_rwbs(__entry->rwbs, bio->bi_opf);
-		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, current->comm);
 	),
 
 	TP_printk("%d,%d %s %llu + %u [%s]",
@@ -434,7 +434,7 @@ TRACE_EVENT(block_plug,
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, current->comm);
 	),
 
 	TP_printk("[%s]", __entry->comm)
@@ -453,7 +453,7 @@ DECLARE_EVENT_CLASS(block_unplug,
 
 	TP_fast_assign(
 		__entry->nr_rq = depth;
-		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, current->comm);
 	),
 
 	TP_printk("[%s] %d", __entry->comm, __entry->nr_rq)
@@ -504,7 +504,7 @@ TRACE_EVENT(block_split,
 		__entry->sector		= bio->bi_iter.bi_sector;
 		__entry->new_sector	= new_sector;
 		blk_fill_rwbs(__entry->rwbs, bio->bi_opf);
-		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, current->comm);
 	),
 
 	TP_printk("%d,%d %s %llu / %llu [%s]",
diff --git a/include/trace/events/coredump.h b/include/trace/events/coredump.h
index c7b9c53fc498..581768a122f8 100644
--- a/include/trace/events/coredump.h
+++ b/include/trace/events/coredump.h
@@ -32,7 +32,7 @@ TRACE_EVENT(coredump,
 
 	TP_fast_assign(
 		__entry->sig = sig;
-		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, current->comm);
 	),
 
 	TP_printk("sig=%d comm=%s",
diff --git a/include/trace/events/f2fs.h b/include/trace/events/f2fs.h
index b5188d2671d7..cc1fd1e01541 100644
--- a/include/trace/events/f2fs.h
+++ b/include/trace/events/f2fs.h
@@ -2505,7 +2505,7 @@ TRACE_EVENT(f2fs_lock_elapsed_time,
 
 	TP_fast_assign(
 		__entry->dev		= sbi->sb->s_dev;
-		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, p->comm);
 		__entry->pid		= p->pid;
 		__entry->prio		= p->prio;
 		__entry->ioprio_class	= IOPRIO_PRIO_CLASS(ioprio);
@@ -2558,7 +2558,7 @@ DECLARE_EVENT_CLASS(f2fs_priority_update,
 
 	TP_fast_assign(
 		__entry->dev		= sbi->sb->s_dev;
-		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, p->comm);
 		__entry->pid		= p->pid;
 		__entry->lock_name	= lock_name;
 		__entry->is_write	= is_write;
diff --git a/include/trace/events/oom.h b/include/trace/events/oom.h
index 9f0a5d1482c4..61b66928de4d 100644
--- a/include/trace/events/oom.h
+++ b/include/trace/events/oom.h
@@ -23,7 +23,7 @@ TRACE_EVENT(oom_score_adj_update,
 
 	TP_fast_assign(
 		__entry->pid = task->pid;
-		memcpy(__entry->comm, task->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, task->comm);
 		__entry->oom_score_adj = task->signal->oom_score_adj;
 	),
 
diff --git a/include/trace/events/osnoise.h b/include/trace/events/osnoise.h
index 3f4273623801..26e42fd1a084 100644
--- a/include/trace/events/osnoise.h
+++ b/include/trace/events/osnoise.h
@@ -116,7 +116,7 @@ TRACE_EVENT(thread_noise,
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, t->comm);
 		__entry->pid = t->pid;
 		__entry->start = start;
 		__entry->duration = duration;
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 535860581f15..91bc5931e2a3 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -152,7 +152,7 @@ DECLARE_EVENT_CLASS(sched_wakeup_template,
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, p->comm);
 		__entry->pid		= p->pid;
 		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
 		__entry->target_cpu	= task_cpu(p);
@@ -237,11 +237,11 @@ TRACE_EVENT(sched_switch,
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
+		strtostr(__entry->prev_comm, prev->comm);
 		__entry->prev_pid	= prev->pid;
 		__entry->prev_prio	= prev->prio;
 		__entry->prev_state	= __trace_sched_switch_state(preempt, prev_state, prev);
-		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
+		strtostr(__entry->next_comm, next->comm);
 		__entry->next_pid	= next->pid;
 		__entry->next_prio	= next->prio;
 		/* XXX SCHED_DEADLINE */
@@ -346,7 +346,7 @@ TRACE_EVENT(sched_process_exit,
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, p->comm);
 		__entry->pid		= p->pid;
 		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
 		__entry->group_dead	= group_dead;
@@ -787,7 +787,7 @@ TRACE_EVENT(sched_skip_cpuset_numa,
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, tsk->comm);
 		__entry->pid		 = task_pid_nr(tsk);
 		__entry->tgid		 = task_tgid_nr(tsk);
 		__entry->ngid		 = task_numa_group_id(tsk);
diff --git a/include/trace/events/signal.h b/include/trace/events/signal.h
index 1db7e4b07c01..8759078b0da9 100644
--- a/include/trace/events/signal.h
+++ b/include/trace/events/signal.h
@@ -67,7 +67,7 @@ TRACE_EVENT(signal_generate,
 	TP_fast_assign(
 		__entry->sig	= sig;
 		TP_STORE_SIGINFO(__entry, info);
-		memcpy(__entry->comm, task->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, task->comm);
 		__entry->pid	= task->pid;
 		__entry->group	= group;
 		__entry->result	= result;
diff --git a/include/trace/events/task.h b/include/trace/events/task.h
index b9a129eb54d9..8636ead17cd8 100644
--- a/include/trace/events/task.h
+++ b/include/trace/events/task.h
@@ -21,7 +21,7 @@ TRACE_EVENT(task_newtask,
 
 	TP_fast_assign(
 		__entry->pid = task->pid;
-		memcpy(__entry->comm, task->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, task->comm);
 		__entry->clone_flags = clone_flags;
 		__entry->oom_score_adj = task->signal->oom_score_adj;
 	),
@@ -46,7 +46,7 @@ TRACE_EVENT(task_rename,
 
 	TP_fast_assign(
 		__entry->pid = task->pid;
-		memcpy(entry->oldcomm, task->comm, TASK_COMM_LEN);
+		strtostr(entry->oldcomm, task->comm);
 		strscpy(entry->newcomm, comm, TASK_COMM_LEN);
 		__entry->oom_score_adj = task->signal->oom_score_adj;
 	),
diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
index d7044a7a214b..5b0c54082876 100644
--- a/kernel/printk/nbcon.c
+++ b/kernel/printk/nbcon.c
@@ -952,7 +952,7 @@ static void wctxt_load_execution_ctx(struct nbcon_write_context *wctxt,
 {
 	wctxt->cpu = pmsg->cpu;
 	wctxt->pid = pmsg->pid;
-	memcpy(wctxt->comm, pmsg->comm, sizeof(wctxt->comm));
+	strtostr(wctxt->comm, pmsg->comm);
 	static_assert(sizeof(wctxt->comm) == sizeof(pmsg->comm));
 }
 #else
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 1f04e753ca02..231c4d7c3580 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2255,7 +2255,7 @@ static void pmsg_load_execution_ctx(struct printk_message *pmsg,
 {
 	pmsg->cpu = printk_info_get_cpu(info);
 	pmsg->pid = printk_info_get_pid(info);
-	memcpy(pmsg->comm, info->comm, sizeof(pmsg->comm));
+	strtostr(pmsg->comm, info->comm);
 	static_assert(sizeof(pmsg->comm) == sizeof(info->comm));
 }
 #else
diff --git a/tools/bpf/bpftool/pids.c b/tools/bpf/bpftool/pids.c
index 23f488cf1740..46b62f65dc43 100644
--- a/tools/bpf/bpftool/pids.c
+++ b/tools/bpf/bpftool/pids.c
@@ -53,7 +53,7 @@ static void add_ref(struct hashmap *map, struct pid_iter_entry *e)
 		refs->refs = tmp;
 		ref = &refs->refs[refs->ref_cnt];
 		ref->pid = e->pid;
-		memcpy(ref->comm, e->comm, sizeof(ref->comm));
+		strtostr(ref->comm, e->comm);
 		ref->comm[sizeof(ref->comm) - 1] = '\0';
 		refs->ref_cnt++;
 
@@ -77,7 +77,7 @@ static void add_ref(struct hashmap *map, struct pid_iter_entry *e)
 	}
 	ref = &refs->refs[0];
 	ref->pid = e->pid;
-	memcpy(ref->comm, e->comm, sizeof(ref->comm));
+	strtostr(ref->comm, e->comm);
 	ref->comm[sizeof(ref->comm) - 1] = '\0';
 	refs->ref_cnt = 1;
 	refs->has_bpf_cookie = e->has_bpf_cookie;
diff --git a/tools/testing/selftests/bpf/test_kmods/bpf_testmod-events.h b/tools/testing/selftests/bpf/test_kmods/bpf_testmod-events.h
index 45a5e41f3a92..72c865ccf1b5 100644
--- a/tools/testing/selftests/bpf/test_kmods/bpf_testmod-events.h
+++ b/tools/testing/selftests/bpf/test_kmods/bpf_testmod-events.h
@@ -20,7 +20,7 @@ TRACE_EVENT(bpf_testmod_test_read,
 	),
 	TP_fast_assign(
 		__entry->pid = task->pid;
-		memcpy(__entry->comm, task->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, task->comm);
 		__entry->off = ctx->off;
 		__entry->len = ctx->len;
 	),

-- 
2.54.0


^ permalink raw reply related

* [PATCH 5/6] prctl: Add support for long user thread names
From: André Almeida @ 2026-05-17 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, David Laight, Linus Torvalds, akpm,
	Yafang Shao, andrii.nakryiko, arnaldo.melo, Petr Mladek
  Cc: linux-kernel, kernel-dev, linux-mm, linux-api, André Almeida
In-Reply-To: <20260517-tonyk-long_name-v1-0-3c282eaa91e2@igalia.com>

Add support for getting and setting long user thread names with
PR_{SET,GET}_EXT_NAME.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 include/linux/sched.h      |  2 +-
 include/uapi/linux/prctl.h |  3 +++
 kernel/sys.c               | 15 ++++++++++++++-
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f7fd2b7d131d..fd4256c8627b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1997,7 +1997,7 @@ extern void kick_process(struct task_struct *tsk);
 
 extern void __set_task_comm(struct task_struct *tsk, const char *from, bool exec);
 #define set_task_comm(tsk, from) ({			\
-	BUILD_BUG_ON(sizeof(from) != TASK_COMM_LEN);	\
+	BUILD_BUG_ON(sizeof(from) < TASK_COMM_LEN);	\
 	__set_task_comm(tsk, from, false);		\
 })
 
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index b6ec6f693719..a07f8edadd65 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -56,6 +56,9 @@
 #define PR_SET_NAME    15		/* Set process name */
 #define PR_GET_NAME    16		/* Get process name */
 
+#define PR_SET_EXT_NAME    17		/* Set extended process name */
+#define PR_GET_EXT_NAME    18		/* Get extended process name */
+
 /* Get/set process endian */
 #define PR_GET_ENDIAN	19
 #define PR_SET_ENDIAN	20
diff --git a/kernel/sys.c b/kernel/sys.c
index 76d77218ab19..1b70d53da998 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2535,7 +2535,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
 	struct task_struct *me = current;
-	unsigned char comm[TASK_COMM_LEN];
+	unsigned char comm[TASK_COMM_EXT_LEN];
 	long error;
 
 	error = security_task_prctl(option, arg2, arg3, arg4, arg5);
@@ -2613,6 +2613,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		if (copy_to_user((char __user *)arg2, comm, TASK_COMM_LEN))
 			return -EFAULT;
 		break;
+	case PR_SET_EXT_NAME:
+		comm[TASK_COMM_EXT_LEN - 1] = 0;
+		if (strncpy_from_user(comm, (char __user *)arg2,
+				      TASK_COMM_EXT_LEN - 1) < 0)
+			return -EFAULT;
+		set_task_comm(me, comm);
+		proc_comm_connector(me);
+		break;
+	case PR_GET_EXT_NAME:
+		strscpy_pad(comm, me->comm, TASK_COMM_EXT_LEN);
+		if (copy_to_user((char __user *)arg2, comm, TASK_COMM_EXT_LEN))
+			return -EFAULT;
+		break;
 	case PR_GET_ENDIAN:
 		error = GET_ENDIAN(me, arg2);
 		break;

-- 
2.54.0


^ permalink raw reply related

* [PATCH 6/6] selftests: prctl: Add test for long thread names
From: André Almeida @ 2026-05-17 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, David Laight, Linus Torvalds, akpm,
	Yafang Shao, andrii.nakryiko, arnaldo.melo, Petr Mladek
  Cc: linux-kernel, kernel-dev, linux-mm, linux-api, André Almeida
In-Reply-To: <20260517-tonyk-long_name-v1-0-3c282eaa91e2@igalia.com>

Add tests for the new interface to set and get long thread names. The
kernel should accept the LONG_NAME and returning it accordingly. For the
old PR_GET_NAME interface, the kernel should truncate the name up to 16
chars. /proc/<task>/comm should return the same string ad PR_GET_NAME.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 tools/testing/selftests/prctl/set-process-name.c | 36 ++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/tools/testing/selftests/prctl/set-process-name.c b/tools/testing/selftests/prctl/set-process-name.c
index 3f7b146d36df..0f20f7deac67 100644
--- a/tools/testing/selftests/prctl/set-process-name.c
+++ b/tools/testing/selftests/prctl/set-process-name.c
@@ -9,9 +9,17 @@
 
 #include "kselftest_harness.h"
 
+#ifndef PR_SET_EXT_NAME
+# define PR_SET_EXT_NAME 17
+# define PR_GET_EXT_NAME 18
+#endif
+
 #define CHANGE_NAME "changename"
+#define LONG_NAME	"change_to_very_long_extended_name"
+#define LONG_NAME_CAP	"change_to_very_"
 #define EMPTY_NAME ""
 #define TASK_COMM_LEN 16
+#define TASK_COMM_EXT_LEN 64
 #define MAX_PATH_LEN 50
 
 int set_name(char *name)
@@ -25,6 +33,16 @@ int set_name(char *name)
 	return res;
 }
 
+int set_ext_name(char *name)
+{
+	int res;
+
+	res = prctl(PR_SET_EXT_NAME, name, NULL, NULL, NULL);
+
+	if (res < 0)
+		return -errno;
+}
+
 int check_is_name_correct(char *check_name)
 {
 	char name[TASK_COMM_LEN];
@@ -38,6 +56,19 @@ int check_is_name_correct(char *check_name)
 	return !strcmp(name, check_name);
 }
 
+int check_is_ext_name_correct(char *check_name)
+{
+	char name[TASK_COMM_EXT_LEN];
+	int res;
+
+	res = prctl(PR_GET_EXT_NAME, name, NULL, NULL, NULL);
+
+	if (res < 0)
+		return -errno;
+
+	return !strcmp(name, check_name);
+}
+
 int check_null_pointer(char *check_name)
 {
 	char *name = NULL;
@@ -82,6 +113,11 @@ TEST(rename_process) {
 	EXPECT_GE(set_name(CHANGE_NAME), 0);
 	EXPECT_TRUE(check_is_name_correct(CHANGE_NAME));
 
+	EXPECT_GE(set_ext_name(LONG_NAME), 0);
+	EXPECT_TRUE(check_is_ext_name_correct(LONG_NAME));
+	EXPECT_TRUE(check_is_name_correct(LONG_NAME_CAP));
+	EXPECT_TRUE(check_name());
+
 	EXPECT_GE(set_name(EMPTY_NAME), 0);
 	EXPECT_TRUE(check_is_name_correct(EMPTY_NAME));
 

-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH 3/6] string: Introduce strtostr() for safe and performance string copies
From: David Laight @ 2026-05-17 21:34 UTC (permalink / raw)
  To: André Almeida
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, Linus Torvalds, akpm, Yafang Shao,
	andrii.nakryiko, arnaldo.melo, Petr Mladek, linux-kernel,
	kernel-dev, linux-mm, linux-api
In-Reply-To: <20260517-tonyk-long_name-v1-3-3c282eaa91e2@igalia.com>

On Sun, 17 May 2026 15:36:13 -0300
André Almeida <andrealmeid@igalia.com> wrote:

> Some parts of the kernel uses memcpy() instead of strscpy() because they
> are performance sensitive and doesn't care about the return value of
> strscpy(). One such common case is to copy current->comm to a different
> buffer.
> 
> As the command name is guaranteed to be NUL-terminated in the range of
> TASK_COMM_LEN, this is safe enough and doesn't create unterminated
> strings. However, in order to expand the size of current->comm, this
> expectation will be broken and those memcpy() could create such strings
> without trailing NUL byte.
> 
> In order to support a fast and safe string copy, create strtostr(), to copy
> a NUL-terminated string to a new string buffer. If the destination buffer
> is bigger than the source, no pad is applied, but the string is
> NUL-terminated. If the destination buffer is smaller, the string is
> truncated. The last byte of the destination is always set to NUL for safety.
> 
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>  include/linux/coredump.h                           |  2 +-
>  include/linux/string.h                             | 28 ++++++++++++++++++++++
>  include/linux/tracepoint.h                         |  4 ++--
>  include/trace/events/block.h                       | 10 ++++----
>  include/trace/events/coredump.h                    |  2 +-
>  include/trace/events/f2fs.h                        |  4 ++--
>  include/trace/events/oom.h                         |  2 +-
>  include/trace/events/osnoise.h                     |  2 +-
>  include/trace/events/sched.h                       | 10 ++++----
>  include/trace/events/signal.h                      |  2 +-
>  include/trace/events/task.h                        |  4 ++--
>  kernel/printk/nbcon.c                              |  2 +-
>  kernel/printk/printk.c                             |  2 +-
>  tools/bpf/bpftool/pids.c                           |  4 ++--
>  .../selftests/bpf/test_kmods/bpf_testmod-events.h  |  2 +-
>  15 files changed, 54 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/coredump.h b/include/linux/coredump.h
> index 68861da4cf7c..b370ef69f673 100644
> --- a/include/linux/coredump.h
> +++ b/include/linux/coredump.h
> @@ -54,7 +54,7 @@ extern void vfs_coredump(const kernel_siginfo_t *siginfo);
>  	do {	\
>  		char comm[TASK_COMM_LEN];	\
>  		/* This will always be NUL terminated. */ \
> -		memcpy(comm, current->comm, sizeof(comm)); \
> +		strtostr(comm, current->comm); \
>  		printk_ratelimited(Level "coredump: %d(%*pE): " Format "\n",	\
>  			task_tgid_vnr(current), (int)strlen(comm), comm, ##__VA_ARGS__);	\
>  	} while (0)	\
> diff --git a/include/linux/string.h b/include/linux/string.h
> index b850bd91b3d8..ff1f59f4139c 100644
> --- a/include/linux/string.h
> +++ b/include/linux/string.h
> @@ -445,6 +445,34 @@ void memcpy_and_pad(void *dest, size_t dest_len, const void *src, size_t count,
>  	memcpy(dest, src, strnlen(src, min(_src_len, _dest_len)));	\
>  } while (0)
>  
> +/**
> + * strtostr - Copy NUL-terminanted string to NUL-terminate string
> + *
> + * @dest: Pointer of destination string
> + * @src: Pointer to NUL-terminates string
> + *
> + * This is a replacement for strcpy() where the caller doesn't care about the
> + * return value and if the string is going to be truncated, albeit it needs
> + * to mark sure that it will be NUL-terminated. Intended for performance
> + * sensitive cases, such as tracing.

If you care about performance, and the destination isn't smaller (especially
if the sizes are the same) then just use memcpy().
 
> + *
> + * If the destination is bigger than the source, no padding happens. It it's
> + * smaller the strings gets truncated.
> + *
> + * Both arguments needs to be arrays with lengths discoverable by the compiler.
> + */
> +#define strtostr(dest, src)	do {					\
> +	const size_t _dest_len = __must_be_cstr(dest) +			\
> +				 ARRAY_SIZE(dest);			\
> +	const size_t _src_len = __must_be_cstr(src) +			\
> +				__builtin_object_size(src, 1);		\
> +									\
> +	BUILD_BUG_ON(!__builtin_constant_p(_dest_len) ||		\
> +		     _dest_len == (size_t)-1);				\
> +	memcpy(dest, src, strnlen(src, min(_src_len, _dest_len)));	\
> +	dest[_dest_len - 1] = '\0';						\
> +} while (0)

That doesn't work (for all sorts of reasons).
_dest_len can be the size of a pointer - no array check.
You need to use __is_array() and sizeof () for both dest and src.
You might have meant to check that _src_len is constant, not _dest_len.
You must not leave the destination unterminated.

__builtin_object_size(x->y,1) is also entirely useless!
If you have a pointer to a structure that ends in an array then the
object size of that array is SIZE_MAX (as if the array continues past
the end of the structure).
See https://godbolt.org/z/csenjfvxe (which I happened to prepare earlier today).

__builtin_object_size(x->y,0) also seems to always return SIZE_MAX.
You do get a sane answer for (x->y,3) on recent clang - but nowhere else.

-- David



^ permalink raw reply

* Re: [RFC] fs/ioctl.c: FIBMAP requires CAP_SYS_RAWIO while FIEMAP exposes identical data unprivileged
From: Christoph Hellwig @ 2026-05-18  5:08 UTC (permalink / raw)
  To: Cyber_black
  Cc: linux-fsdevel@vger.kernel.org, Mark Fasheh, Theodore Ts'o,
	linux-api
In-Reply-To: <_fcorj7Aa0YnzUmrPnqdEbTjLqS6S7t84HKrzsswvKm71LC0uVmTD2cthCwpgeI-296unEpzPZYBNdFFDXjsQvZRtGfTaQlKmcRkiSI4wiQ=@proton.me>

On Fri, May 15, 2026 at 05:36:45PM +0000, Cyber_black wrote:
> Option B) Add a capability check to ioctl_fiemap() to match FIBMAP.
> This restores the intended restriction, at the cost of breaking
> unprivileged use of FIEMAP (e.g. filefrag, btrfs tools, e2freefrag).
> This option is a larger ABI impact and likely undesirable.
> 
> The preferred fix is Option A, since FIEMAP has been available
> unprivileged since 2008 with no reported security issues, and read
> access to physical block layout is already implicitly available
> through open() permission on the file.

No, FIEMAP really should not be available unprivileged.  So I think B is
the right thing.  Can you send a proper patch with a proper signoff?


^ permalink raw reply

* Re: [PATCH 3/6] string: Introduce strtostr() for safe and performance string copies
From: André Almeida @ 2026-05-18 14:36 UTC (permalink / raw)
  To: David Laight
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, Linus Torvalds, akpm, Yafang Shao,
	andrii.nakryiko, arnaldo.melo, Petr Mladek, linux-kernel,
	kernel-dev, linux-mm, linux-api
In-Reply-To: <20260517223419.3262de7c@pumpkin>

Hi David, thanks for the feedback!

Em 17/05/2026 18:34, David Laight escreveu:
> On Sun, 17 May 2026 15:36:13 -0300
> André Almeida <andrealmeid@igalia.com> wrote:
> 
>> Some parts of the kernel uses memcpy() instead of strscpy() because they
>> are performance sensitive and doesn't care about the return value of
>> strscpy(). One such common case is to copy current->comm to a different
>> buffer.
>>
>> As the command name is guaranteed to be NUL-terminated in the range of
>> TASK_COMM_LEN, this is safe enough and doesn't create unterminated
>> strings. However, in order to expand the size of current->comm, this
>> expectation will be broken and those memcpy() could create such strings
>> without trailing NUL byte.
>>
>> In order to support a fast and safe string copy, create strtostr(), to copy
>> a NUL-terminated string to a new string buffer. If the destination buffer
>> is bigger than the source, no pad is applied, but the string is
>> NUL-terminated. If the destination buffer is smaller, the string is
>> truncated. The last byte of the destination is always set to NUL for safety.
>>
>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>> ---
[...]>> +/**
>> + * strtostr - Copy NUL-terminanted string to NUL-terminate string
>> + *
>> + * @dest: Pointer of destination string
>> + * @src: Pointer to NUL-terminates string
>> + *
>> + * This is a replacement for strcpy() where the caller doesn't care about the
>> + * return value and if the string is going to be truncated, albeit it needs
>> + * to mark sure that it will be NUL-terminated. Intended for performance
>> + * sensitive cases, such as tracing.
> 
> If you care about performance, and the destination isn't smaller (especially
> if the sizes are the same) then just use memcpy().
>   

The problem is that as I'm expanding current->comm, the source buffer 
might be bigger than destination, and when we truncate the string, it 
won't have the termination NUL byte. So we need an extra dest[len-1] = 
\0 after the memcpy.

>> + *
>> + * If the destination is bigger than the source, no padding happens. It it's
>> + * smaller the strings gets truncated.
>> + *
>> + * Both arguments needs to be arrays with lengths discoverable by the compiler.
>> + */
>> +#define strtostr(dest, src)	do {					\
>> +	const size_t _dest_len = __must_be_cstr(dest) +			\
>> +				 ARRAY_SIZE(dest);			\
>> +	const size_t _src_len = __must_be_cstr(src) +			\
>> +				__builtin_object_size(src, 1);		\
>> +									\
>> +	BUILD_BUG_ON(!__builtin_constant_p(_dest_len) ||		\
>> +		     _dest_len == (size_t)-1);				\
>> +	memcpy(dest, src, strnlen(src, min(_src_len, _dest_len)));	\
>> +	dest[_dest_len - 1] = '\0';						\
>> +} while (0)
> 
> That doesn't work (for all sorts of reasons).
> _dest_len can be the size of a pointer - no array check.
> You need to use __is_array() and sizeof () for both dest and src.
> You might have meant to check that _src_len is constant, not _dest_len.
> You must not leave the destination unterminated.
> 
> __builtin_object_size(x->y,1) is also entirely useless!
> If you have a pointer to a structure that ends in an array then the
> object size of that array is SIZE_MAX (as if the array continues past
> the end of the structure).
> See https://godbolt.org/z/csenjfvxe (which I happened to prepare earlier today).
> 
> __builtin_object_size(x->y,0) also seems to always return SIZE_MAX.
> You do get a sane answer for (x->y,3) on recent clang - but nowhere else.
> 

Oops, you are right, thanks for pointing that out. This is how it would 
look like checking that both args are arrays and using sizeof to get 
their length, if it sounds good I can apply for the v2:

#define strtostr(dest, src)	do {				\
	const size_t _dest_len = __must_be_array(dest) +	\
				 sizeof(dest);			\
	const size_t _src_len = __must_be_array(src) +		\
				sizeof(src);			\
								\
	BUILD_BUG_ON(!__builtin_constant_p(_dest_len) ||	\
		     _dest_len == (size_t)-1);			\
	memcpy(dest, src, min(_src_len, _dest_len)));		\
	dest[_dest_len - 1] = '\0';				\
} while (0)


> -- David
> 
> 


^ permalink raw reply

* Re: [RFC] fs/ioctl.c: FIBMAP requires CAP_SYS_RAWIO while FIEMAP exposes identical data unprivileged
From: Darrick J. Wong @ 2026-05-18 16:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Cyber_black, linux-fsdevel@vger.kernel.org, Mark Fasheh,
	Theodore Ts'o, linux-api
In-Reply-To: <agqevS--YYBVW2Oz@infradead.org>

On Sun, May 17, 2026 at 10:08:13PM -0700, Christoph Hellwig wrote:
> On Fri, May 15, 2026 at 05:36:45PM +0000, Cyber_black wrote:
> > Option B) Add a capability check to ioctl_fiemap() to match FIBMAP.
> > This restores the intended restriction, at the cost of breaking
> > unprivileged use of FIEMAP (e.g. filefrag, btrfs tools, e2freefrag).
> > This option is a larger ABI impact and likely undesirable.
> > 
> > The preferred fix is Option A, since FIEMAP has been available
> > unprivileged since 2008 with no reported security issues, and read
> > access to physical block layout is already implicitly available
> > through open() permission on the file.
> 
> No, FIEMAP really should not be available unprivileged.  So I think B is
> the right thing.  Can you send a proper patch with a proper signoff?

For anyone who might be relying on FIEMAP output to find sparse regions
-- don't.  FIEMAP is a lowlevel fs debugging interface; it won't tell
you about dirty pagecache backed by unwritten disk space.  cp was burned
by that a decade and a half ago.

--D

^ permalink raw reply

* [2]Yazışmada 2 ileti var[RFC] fs/ioctl.c: FIBMAP requires CAP_SYS_RAWIO while FIEMAP exposes identical data unprivileged
From: Cyber_black @ 2026-05-18 16:21 UTC (permalink / raw)
  To: hch@infradead.org
  Cc: linux-fsdevel@vger.kernel.org, tytso@mit.edu,
	linux-api@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 687 bytes --]

On Fri, May 15, 2026 at 05:36:45PM +0000, Maintainer wrote:> No, FIEMAP really should not be available unprivileged. So I think B is
> the right thing. Can you send a proper patch with a proper signoff?

Absolutely, thanks for the guidance. You're right that Option B is the
correct approach for consistency and security.

I've prepared the patch below. It adds CAP_SYS_RAWIO check to
ioctl_fiemap() to match the protection already in place for FIBMAP.

The check is placed early in the function, before any filesystem-specific
operations, following the same pattern as ioctl_fibmap().

Best regards,

Eneshan Erdoğan Karaca

My github:https://github.com/Kisaca-Enes

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-fs-ioctl-add-CAP_SYS_RAWIO-check-to-FS_IOC_FIEMAP.patch --]
[-- Type: text/x-patch; name=0001-fs-ioctl-add-CAP_SYS_RAWIO-check-to-FS_IOC_FIEMAP.patch, Size: 1196 bytes --]

From: Cyber_black <Cyberblackk@proton.me>
Date: Fri, 16 May 2026 12:00:00 +0000
Subject: [PATCH] fs/ioctl: add CAP_SYS_RAWIO check to FS_IOC_FIEMAP

FS_IOC_FIEMAP exposes physical block addresses of files to unprivileged
users, which is the same privileged information that FIBMAP protects with
CAP_SYS_RAWIO capability check.

For consistency in the VFS privilege model and to prevent information
disclosure of physical disk layout, add the same capability check to
ioctl_fiemap() that already exists in ioctl_fibmap().

FIEMAP has been available unprivileged since 2008, but as noted by the
maintainers, this was an unintended exposure that should be corrected.

Signed-off-by: Cyber_black <Cyberblackk@proton.me>
---
 fs/ioctl.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 1234567890ab..abcdef1234567 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -199,6 +199,9 @@ static int ioctl_fiemap(struct file *filp, struct fiemap __user *ufiemap)
 	struct fiemap_extent_info fieinfo = { 0, };
 	struct inode *inode = file_inode(filp);
 	int error;
+
+	if (!capable(CAP_SYS_RAWIO))
+		return -EPERM;

 	if (!inode->i_op->fiemap)
 		return -EOPNOTSUPP;
-- 
2.40.0

^ permalink raw reply related

* Re: [RFC] fs/ioctl.c: FIBMAP requires CAP_SYS_RAWIO while FIEMAP exposes identical data unprivileged
From: Andy Lutomirski @ 2026-05-18 16:22 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Cyber_black, linux-fsdevel@vger.kernel.org,
	Mark Fasheh, Theodore Ts'o, linux-api
In-Reply-To: <20260518162048.GC9531@frogsfrogsfrogs>

On Mon, May 18, 2026 at 9:21 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Sun, May 17, 2026 at 10:08:13PM -0700, Christoph Hellwig wrote:
> > On Fri, May 15, 2026 at 05:36:45PM +0000, Cyber_black wrote:
> > > Option B) Add a capability check to ioctl_fiemap() to match FIBMAP.
> > > This restores the intended restriction, at the cost of breaking
> > > unprivileged use of FIEMAP (e.g. filefrag, btrfs tools, e2freefrag).
> > > This option is a larger ABI impact and likely undesirable.
> > >
> > > The preferred fix is Option A, since FIEMAP has been available
> > > unprivileged since 2008 with no reported security issues, and read
> > > access to physical block layout is already implicitly available
> > > through open() permission on the file.
> >
> > No, FIEMAP really should not be available unprivileged.  So I think B is
> > the right thing.  Can you send a proper patch with a proper signoff?
>
> For anyone who might be relying on FIEMAP output to find sparse regions
> -- don't.  FIEMAP is a lowlevel fs debugging interface; it won't tell
> you about dirty pagecache backed by unwritten disk space.  cp was burned
> by that a decade and a half ago.
>

The only way that I'm personally aware of to determine whether ranges
in two files are reflinked to each other (and the only efficient way
to find identical blocks to, say, archive a large directory without
reading all the contents) is FIEMAP.  I wrote some code to do this
awhile back (not in production use).  Yes, I realize that it might
have issues with dirty page cache.

Is there some other way to do this?  Could an API be added that
efficiently answers the actual question without revealing information
that shouldn't be revealed?

--Andy

^ permalink raw reply

* [5][RFC] fs/ioctl.c: FIBMAP requires CAP_SYS_RAWIO while FIEMAP exposes identical data unprivileged
From: Cyber_black @ 2026-05-18 17:22 UTC (permalink / raw)
  To: luto@amacapital.net
  Cc: hch@infradead.org, linux-fsdevel@vger.kernel.org, tytso@mit.edu,
	linux-api@vger.kernel.org, djwong@kernel.org, mark@fasheh.com,
	moybs027@gmail.com

Thank you for raising this important question, Andy. I've been following the discussion as a "listening guest" and I have a thought.

My idea is this: Instead of forcing FIEMAP to become a root-only interface (breaking existing tools), or leaving it as-is (with information disclosure), what if we design a new, restricted API that is not privileged but also not unprivileged in the traditional sense?

Concretely:

1.  The API would be callable by any user, but it would not expose physical block addresses.

2.  It would answer higher-level questions that tools actually need, such as:

    -   "Are these two file ranges reflinked (shared)?" (for deduplication)

    -   "Is this file range sparse (holes)?" (without leaking physical locations)

    -   "What is the allocation status (delayed, unwritten, etc.)?"

3.  The kernel would maintain a capability or permission that is not root-equivalent (e.g., a new `CAP_BLOCK_MAP_QUERY`), but the API would not require full `CAP_SYS_RAWIO`.

This way:

-   Tools like `filefrag`, `cp`, and deduplication utilities can work without root.

-   Physical block addresses remain hidden from unprivileged users, closing the information leak.

-   We avoid forcing users to run these tools as root, which would open up far more serious risks (e.g., kernel panic, accidental corruption).

In short: we don't need to choose between "unprivileged leak" and "root-only". We can design a purpose‑limited API that answers only the necessary questions, with the minimum privilege required.

Would this be acceptable? I'd be happy to help draft a more detailed proposal or prototype.

This idea was developed together with my friend playerofficial19 (moybs027@gmail.com) through discussion. We hope it's helpful.

^ permalink raw reply

* Re: [PATCH 3/6] string: Introduce strtostr() for safe and performance string copies
From: David Laight @ 2026-05-18 18:38 UTC (permalink / raw)
  To: André Almeida
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, Linus Torvalds, akpm, Yafang Shao,
	andrii.nakryiko, arnaldo.melo, Petr Mladek, linux-kernel,
	kernel-dev, linux-mm, linux-api
In-Reply-To: <d4d6cf61-568e-478e-88d6-01b769d7eded@igalia.com>

On Mon, 18 May 2026 11:36:49 -0300
André Almeida <andrealmeid@igalia.com> wrote:

> Hi David, thanks for the feedback!
> 
> Em 17/05/2026 18:34, David Laight escreveu:
> > On Sun, 17 May 2026 15:36:13 -0300
> > André Almeida <andrealmeid@igalia.com> wrote:
> >   
> >> Some parts of the kernel uses memcpy() instead of strscpy() because they
> >> are performance sensitive and doesn't care about the return value of
> >> strscpy(). One such common case is to copy current->comm to a different
> >> buffer.
> >>
> >> As the command name is guaranteed to be NUL-terminated in the range of
> >> TASK_COMM_LEN, this is safe enough and doesn't create unterminated
> >> strings. However, in order to expand the size of current->comm, this
> >> expectation will be broken and those memcpy() could create such strings
> >> without trailing NUL byte.
> >>
> >> In order to support a fast and safe string copy, create strtostr(), to copy
> >> a NUL-terminated string to a new string buffer. If the destination buffer
> >> is bigger than the source, no pad is applied, but the string is
> >> NUL-terminated. If the destination buffer is smaller, the string is
> >> truncated. The last byte of the destination is always set to NUL for safety.
> >>
> >> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> >> ---
> [...]>> +/**
> >> + * strtostr - Copy NUL-terminanted string to NUL-terminate string
> >> + *
> >> + * @dest: Pointer of destination string
> >> + * @src: Pointer to NUL-terminates string
> >> + *
> >> + * This is a replacement for strcpy() where the caller doesn't care about the
> >> + * return value and if the string is going to be truncated, albeit it needs
> >> + * to mark sure that it will be NUL-terminated. Intended for performance
> >> + * sensitive cases, such as tracing.  
> > 
> > If you care about performance, and the destination isn't smaller (especially
> > if the sizes are the same) then just use memcpy().
> >     
> 
> The problem is that as I'm expanding current->comm, the source buffer 
> might be bigger than destination, and when we truncate the string, it 
> won't have the termination NUL byte. So we need an extra dest[len-1] = 
> \0 after the memcpy.

It depends on other access to the destination.
If it might be being concurrently read it is vital that it is always
terminated.
So you can't even temporarily have a non-zero byte at the end.

> 
> >> + *
> >> + * If the destination is bigger than the source, no padding happens. It it's
> >> + * smaller the strings gets truncated.
> >> + *
> >> + * Both arguments needs to be arrays with lengths discoverable by the compiler.
> >> + */
> >> +#define strtostr(dest, src)	do {					\
> >> +	const size_t _dest_len = __must_be_cstr(dest) +			\
> >> +				 ARRAY_SIZE(dest);			\
> >> +	const size_t _src_len = __must_be_cstr(src) +			\
> >> +				__builtin_object_size(src, 1);		\
> >> +									\
> >> +	BUILD_BUG_ON(!__builtin_constant_p(_dest_len) ||		\
> >> +		     _dest_len == (size_t)-1);				\
> >> +	memcpy(dest, src, strnlen(src, min(_src_len, _dest_len)));	\
> >> +	dest[_dest_len - 1] = '\0';						\
> >> +} while (0)  
> > 
> > That doesn't work (for all sorts of reasons).
> > _dest_len can be the size of a pointer - no array check.
> > You need to use __is_array() and sizeof () for both dest and src.
> > You might have meant to check that _src_len is constant, not _dest_len.
> > You must not leave the destination unterminated.
> > 
> > __builtin_object_size(x->y,1) is also entirely useless!
> > If you have a pointer to a structure that ends in an array then the
> > object size of that array is SIZE_MAX (as if the array continues past
> > the end of the structure).
> > See https://godbolt.org/z/csenjfvxe (which I happened to prepare earlier today).
> > 
> > __builtin_object_size(x->y,0) also seems to always return SIZE_MAX.
> > You do get a sane answer for (x->y,3) on recent clang - but nowhere else.
> >   
> 
> Oops, you are right, thanks for pointing that out. This is how it would 
> look like checking that both args are arrays and using sizeof to get 
> their length, if it sounds good I can apply for the v2:
> 
> #define strtostr(dest, src)	do {				\
> 	const size_t _dest_len = __must_be_array(dest) +	\
> 				 sizeof(dest);			\
> 	const size_t _src_len = __must_be_array(src) +		\
> 				sizeof(src);			\
> 								\
> 	BUILD_BUG_ON(!__builtin_constant_p(_dest_len) ||	\
> 		     _dest_len == (size_t)-1);			\

That test can never fail.

> 	memcpy(dest, src, min(_src_len, _dest_len)));		\
> 	dest[_dest_len - 1] = '\0';				\

You are expending 'dest' twice.
Where it (p++)->array then the two values would be different and the final
value of 'p' incorrect.
Much better to assign both pointers to local variables.
Here you can use their required types to get type checking (I wouldn't bother
about the extra checks that _must_be_cstr() does).

I'd also create function that is explicitly for copying process names.
(Or replace the one that is already there - saves a lot of churn.)
then you know (and can check) the sizes are the expected ones.

It might even be worth making the #define (needed to get the array sizes)
call out to different functions for the different cases.

Thinks more...
On 64bit the 16 byte copy can be 'load; store; load; mask; store' provided
the buffer is aligned (copying u64 on 32bit will work the same).
But that requires that all the buffers be aligned.
So you'd need to check _Alignof(dest) >= _Alignof(u64) as well.
(Probably with a fallback to get things to compile.)

Whether that is best for the longer 64 byte copy is anybodies guess.

I also suspect it would be best to zero fill when copying a 16 byte
name into a 64 byte buffer.
(If you zero fill first then you can just copy 16 bytes over.)

-- David

> } while (0)
> 
> 
> > -- David
> > 
> >   
> 


^ permalink raw reply

* Re: [5][RFC] fs/ioctl.c: FIBMAP requires CAP_SYS_RAWIO while FIEMAP exposes identical data unprivileged
From: Andreas Dilger @ 2026-05-18 19:49 UTC (permalink / raw)
  To: Cyber_black
  Cc: luto@amacapital.net, hch@infradead.org,
	linux-fsdevel@vger.kernel.org, tytso@mit.edu,
	linux-api@vger.kernel.org, djwong@kernel.org, mark@fasheh.com,
	moybs027@gmail.com
In-Reply-To: <-nQmUF-iBsNFQ1Iz2j_cVui7DxnmpAO7z3X7qH8Xzpr7CYXE8j5x5YeFQ39U1wcMFNuVnuxu1pJf7ooiwJYK8ZFJDpjEtifFaBuWNJIi0ak=@proton.me>

On May 18, 2026, at 11:22, Cyber_black <Cyberblackk@proton.me> wrote:
> 
> Thank you for raising this important question, Andy. I've been following the discussion as a "listening guest" and I have a thought.
> 
> My idea is this: Instead of forcing FIEMAP to become a root-only interface (breaking existing tools), or leaving it as-is (with information disclosure), what if we design a new, restricted API that is not privileged but also not unprivileged in the traditional sense?

What is the *actual* security risk of showing block numbers to users for their own files?

If an attacker can access the underlying device/image, they could directly use debugfs
or other filesystem tools to get file->block mappings anyway, and could modify the image
arbitrarily.  Restricting FIEMAP to root or obscuring block numbers is security through
obscurity and provides no actual safety.

Cheers, Andreas

> 
> Concretely:
> 
> 1.  The API would be callable by any user, but it would not expose physical block addresses.
> 
> 2.  It would answer higher-level questions that tools actually need, such as:
> 
>    -   "Are these two file ranges reflinked (shared)?" (for deduplication)
> 
>    -   "Is this file range sparse (holes)?" (without leaking physical locations)
> 
>    -   "What is the allocation status (delayed, unwritten, etc.)?"
> 
> 3.  The kernel would maintain a capability or permission that is not root-equivalent (e.g., a new `CAP_BLOCK_MAP_QUERY`), but the API would not require full `CAP_SYS_RAWIO`.
> 
> 
> This way:
> 
> -   Tools like `filefrag`, `cp`, and deduplication utilities can work without root.
> 
> -   Physical block addresses remain hidden from unprivileged users, closing the information leak.
> 
> -   We avoid forcing users to run these tools as root, which would open up far more serious risks (e.g., kernel panic, accidental corruption).
> 
> 
> In short: we don't need to choose between "unprivileged leak" and "root-only". We can design a purpose‑limited API that answers only the necessary questions, with the minimum privilege required.
> 
> Would this be acceptable? I'd be happy to help draft a more detailed proposal or prototype.
> 
> This idea was developed together with my friend playerofficial19 (moybs027@gmail.com) through discussion. We hope it's helpful.
> 


Cheers, Andreas






^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox