Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [RFC] TID v2.0: kernel module for cache-line zeroization against Flush+Reload (CLFLUSHOPT + LFENCE + REP STOSQ)
From: Jann Horn @ 2026-05-19 21:41 UTC (permalink / raw)
  To: Ahmad Hasan
  Cc: linux-kernel, linux-security-module, linux-hardening,
	kernel-hardening, linux-crypto, linux-mm, linux-api,
	linux-kselftest
In-Reply-To: <CAAmtCfMHqdWbYh-Hc5sGbOhXSM-aCA9G0-s64G8FTM+rGEV5RA@mail.gmail.com>

On Tue, May 19, 2026 at 11:31 PM Ahmad Hasan
<ahmaaaaadbntaaaaa@gmail.com> wrote:
> Thank you for your questions. I'll address each one:
>
> == 1. Threat Model ==
>
> The target scenario is a same-machine attacker
> in multi-tenant/cloud environments where two
> processes share physical L3 cache.
>
> Example: a cryptographic service and a malicious
> process running on the same host. The attacker
> uses Flush+Reload to measure cache access timing
> after every encryption operation — no physical
> access required.
>
> This is documented with real measurements:
> - Without TID: 78 cycles (Cache HIT — key pattern visible)
> - With TID v2.0: 286 cycles (Cache MISS — attack defeated)

So you're assuming that the cryptographic code leaks secrets through a
cache-based side channel? That would be a vulnerability in the crypto
code.

> == 2. Why Kernel Module and not userspace? ==
>
> You are correct that CLFLUSHOPT does not require
> Ring 0. However, userspace execution can be
> interrupted by a Context Switch, which expands
> the timing window from 372ns to 36,640ns —
> making the attack significantly easier.

Why does it matter how many hundreds of nanoseconds it takes to wipe
the data from memory? You can also have a context switch directly
before you enter your cache-wiping syscall, or in the middle of a
crypto operation.

> == 3. Why not add this directly to libraries? ==
>
> No major security library implements CLFLUSHOPT
> after wiping — not OpenSSL, not libsodium, not
> glibc, not memzero_explicit. This gap has existed
> since Flush+Reload was published in 2014.

I don't think that's a gap, because the standard approach to
mitigating cache-based side channels such as FLUSH+RELOAD is to not
access memory at secret-dependent indices in the first place.

^ permalink raw reply

* Re: [RFC] fs/ioctl.c: FIBMAP requires CAP_SYS_RAWIO while FIEMAP exposes identical data unprivileged
From: Andy Lutomirski @ 2026-05-19 20:51 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Cyber_black, linux-fsdevel@vger.kernel.org,
	Mark Fasheh, Theodore Ts'o, linux-api
In-Reply-To: <20260519033126.GD9531@frogsfrogsfrogs>

On Mon, May 18, 2026 at 8:31 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Mon, May 18, 2026 at 09:22:42AM -0700, Andy Lutomirski wrote:
> > On Mon, May 18, 2026 at 9:21 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Sun, May 17, 2026 at 10:08:13PM -0700, Christoph Hellwig wrote:
> > > > On Fri, May 15, 2026 at 05:36:45PM +0000, Cyber_black wrote:
> > > > > Option B) Add a capability check to ioctl_fiemap() to match FIBMAP.
> > > > > This restores the intended restriction, at the cost of breaking
> > > > > unprivileged use of FIEMAP (e.g. filefrag, btrfs tools, e2freefrag).
> > > > > This option is a larger ABI impact and likely undesirable.
> > > > >
> > > > > The preferred fix is Option A, since FIEMAP has been available
> > > > > unprivileged since 2008 with no reported security issues, and read
> > > > > access to physical block layout is already implicitly available
> > > > > through open() permission on the file.
> > > >
> > > > No, FIEMAP really should not be available unprivileged.  So I think B is
> > > > the right thing.  Can you send a proper patch with a proper signoff?
> > >
> > > For anyone who might be relying on FIEMAP output to find sparse regions
> > > -- don't.  FIEMAP is a lowlevel fs debugging interface; it won't tell
> > > you about dirty pagecache backed by unwritten disk space.  cp was burned
> > > by that a decade and a half ago.
> > >
> >
> > The only way that I'm personally aware of to determine whether ranges
> > in two files are reflinked to each other (and the only efficient way
> > to find identical blocks to, say, archive a large directory without
> > reading all the contents) is FIEMAP.  I wrote some code to do this
> > awhile back (not in production use).  Yes, I realize that it might
> > have issues with dirty page cache.
> >
> > Is there some other way to do this?  Could an API be added that
> > efficiently answers the actual question without revealing information
> > that shouldn't be revealed?
>
> Well, yes, we *could* make yet another ioctl, but we could also just run
> fe_physical through a one-way u64 hash function and set
> FIEMAP_EXTENT_UNKNOWN if (say) you don't have CAP_SYS_RAWIO or
> something.  Then your comparison function might still work... maybe?
>
> OTOH nobody really wants Linus roaring at them, so we might all just do
> absolutely nothing.
>
> Also note that FIEMAP still doesn't report devices, so you're still
> playing with fire on multi-device reflink-aware filesystems like XFS.
>

A hash would be fine for me.

But really a nicer interface would translate logical ranges in a file
to some range identifier, where:

- It would be consistent with page cache.  So holes are only reported
if the current logical contents is a hole.
- It would return reliably different identifiers for ranges that do
not have identical contents.
- It would usually return the same identifier for ranges that are
known to the FS to have identical contents.
- It would not return the same identifier for files on different
backing devices that just happen to be backed by the same offset
within their respective backing devices.
- It would not necessarily return values that are consistent across a
remount.  But maybe some kind of mount id would be around to at least
detect this happening.

Fun bonus points: if the range is dirty in page cache, tell me, and if
it's not dirty, then, on supporting filesystems, return a value that
will *change* if someone writes to the file and it get undirtied
again.  IOW it would be nice to be able to use this to efficiently
scan through a file and see what extents may have been modified since
the last scan.  But this would be complex.

I couldn't care less about the actual location of a file.

Anyway, this is a bit of a pie-in-the-sky thought.

^ permalink raw reply

* Re: [PATCH 3/6] string: Introduce strtostr() for safe and performance string copies
From: André Almeida @ 2026-05-19 20:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Laight, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, akpm, Yafang Shao, andrii.nakryiko,
	arnaldo.melo, Petr Mladek, linux-kernel, kernel-dev, linux-mm,
	linux-api
In-Reply-To: <CAHk-=wgBdK5iRf1NdOuMT0-+sjxUc8QAU9vr66jBBzY6EFDtUA@mail.gmail.com>

Em 19/05/2026 17:37, Linus Torvalds escreveu:
> On Mon, 18 May 2026 at 09:37, André Almeida <andrealmeid@igalia.com> wrote:
>>
>> The problem is that as I'm expanding current->comm, the source buffer
>> might be bigger than destination, and when we truncate the string, it
>> won't have the termination NUL byte. So we need an extra dest[len-1] =
>> \0 after the memcpy.
> 
> What's wrong with just using strscpy() with 'len' being min(srcsize,dstsize)?
> 
Well, I thought that strscpy() was too expensive for the trace use case, 
but I'm happy to use it in the v2 if it's ok.

^ permalink raw reply

* Re: [PATCH 3/6] string: Introduce strtostr() for safe and performance string copies
From: Linus Torvalds @ 2026-05-19 20:37 UTC (permalink / raw)
  To: André Almeida
  Cc: David Laight, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Steven Rostedt, Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, akpm, Yafang Shao, andrii.nakryiko,
	arnaldo.melo, Petr Mladek, linux-kernel, kernel-dev, linux-mm,
	linux-api
In-Reply-To: <d4d6cf61-568e-478e-88d6-01b769d7eded@igalia.com>

On Mon, 18 May 2026 at 09:37, André Almeida <andrealmeid@igalia.com> wrote:
>
> The problem is that as I'm expanding current->comm, the source buffer
> might be bigger than destination, and when we truncate the string, it
> won't have the termination NUL byte. So we need an extra dest[len-1] =
> \0 after the memcpy.

What's wrong with just using strscpy() with 'len' being min(srcsize,dstsize)?

           Linus

^ permalink raw reply

* Re: [PATCH 3/6] string: Introduce strtostr() for safe and performance string copies
From: André Almeida @ 2026-05-19 19:47 UTC (permalink / raw)
  To: David Laight
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, Linus Torvalds, akpm, Yafang Shao,
	andrii.nakryiko, arnaldo.melo, Petr Mladek, linux-kernel,
	kernel-dev, linux-mm, linux-api
In-Reply-To: <20260518193843.7bde8d53@pumpkin>

Em 18/05/2026 15:38, David Laight escreveu:
> On Mon, 18 May 2026 11:36:49 -0300
> André Almeida <andrealmeid@igalia.com> wrote:
> 
>> Hi David, thanks for the feedback!
>>
>> Em 17/05/2026 18:34, David Laight escreveu:
>>> On Sun, 17 May 2026 15:36:13 -0300
>>> André Almeida <andrealmeid@igalia.com> wrote:
>>>    
>>>> Some parts of the kernel uses memcpy() instead of strscpy() because they
>>>> are performance sensitive and doesn't care about the return value of
>>>> strscpy(). One such common case is to copy current->comm to a different
>>>> buffer.
>>>>
>>>> As the command name is guaranteed to be NUL-terminated in the range of
>>>> TASK_COMM_LEN, this is safe enough and doesn't create unterminated
>>>> strings. However, in order to expand the size of current->comm, this
>>>> expectation will be broken and those memcpy() could create such strings
>>>> without trailing NUL byte.
>>>>
>>>> In order to support a fast and safe string copy, create strtostr(), to copy
>>>> a NUL-terminated string to a new string buffer. If the destination buffer
>>>> is bigger than the source, no pad is applied, but the string is
>>>> NUL-terminated. If the destination buffer is smaller, the string is
>>>> truncated. The last byte of the destination is always set to NUL for safety.
>>>>
>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>> ---
>> [...]>> +/**
>>>> + * strtostr - Copy NUL-terminanted string to NUL-terminate string
>>>> + *
>>>> + * @dest: Pointer of destination string
>>>> + * @src: Pointer to NUL-terminates string
>>>> + *
>>>> + * This is a replacement for strcpy() where the caller doesn't care about the
>>>> + * return value and if the string is going to be truncated, albeit it needs
>>>> + * to mark sure that it will be NUL-terminated. Intended for performance
>>>> + * sensitive cases, such as tracing.
>>>
>>> If you care about performance, and the destination isn't smaller (especially
>>> if the sizes are the same) then just use memcpy().
>>>      
>>
>> The problem is that as I'm expanding current->comm, the source buffer
>> might be bigger than destination, and when we truncate the string, it
>> won't have the termination NUL byte. So we need an extra dest[len-1] =
>> \0 after the memcpy.
> 
> It depends on other access to the destination.
> If it might be being concurrently read it is vital that it is always
> terminated.
> So you can't even temporarily have a non-zero byte at the end.
> 

I don't think this is the case here, as far as I can tell all the 
callers of strtostr will wait the end of the copy before using it.

>>
>>>> + *
>>>> + * If the destination is bigger than the source, no padding happens. It it's
>>>> + * smaller the strings gets truncated.
>>>> + *
>>>> + * Both arguments needs to be arrays with lengths discoverable by the compiler.
>>>> + */
>>>> +#define strtostr(dest, src)	do {					\
>>>> +	const size_t _dest_len = __must_be_cstr(dest) +			\
>>>> +				 ARRAY_SIZE(dest);			\
>>>> +	const size_t _src_len = __must_be_cstr(src) +			\
>>>> +				__builtin_object_size(src, 1);		\
>>>> +									\
>>>> +	BUILD_BUG_ON(!__builtin_constant_p(_dest_len) ||		\
>>>> +		     _dest_len == (size_t)-1);				\
>>>> +	memcpy(dest, src, strnlen(src, min(_src_len, _dest_len)));	\
>>>> +	dest[_dest_len - 1] = '\0';						\
>>>> +} while (0)
>>>
>>> That doesn't work (for all sorts of reasons).
>>> _dest_len can be the size of a pointer - no array check.
>>> You need to use __is_array() and sizeof () for both dest and src.
>>> You might have meant to check that _src_len is constant, not _dest_len.
>>> You must not leave the destination unterminated.
>>>
>>> __builtin_object_size(x->y,1) is also entirely useless!
>>> If you have a pointer to a structure that ends in an array then the
>>> object size of that array is SIZE_MAX (as if the array continues past
>>> the end of the structure).
>>> See https://godbolt.org/z/csenjfvxe (which I happened to prepare earlier today).
>>>
>>> __builtin_object_size(x->y,0) also seems to always return SIZE_MAX.
>>> You do get a sane answer for (x->y,3) on recent clang - but nowhere else.
>>>    
>>
>> Oops, you are right, thanks for pointing that out. This is how it would
>> look like checking that both args are arrays and using sizeof to get
>> their length, if it sounds good I can apply for the v2:
>>
>> #define strtostr(dest, src)	do {				\
>> 	const size_t _dest_len = __must_be_array(dest) +	\
>> 				 sizeof(dest);			\
>> 	const size_t _src_len = __must_be_array(src) +		\
>> 				sizeof(src);			\
>> 								\
>> 	BUILD_BUG_ON(!__builtin_constant_p(_dest_len) ||	\
>> 		     _dest_len == (size_t)-1);			\
> 
> That test can never fail.
> 
>> 	memcpy(dest, src, min(_src_len, _dest_len)));		\
>> 	dest[_dest_len - 1] = '\0';				\
> 
> You are expending 'dest' twice.
> Where it (p++)->array then the two values would be different and the final
> value of 'p' incorrect.
> Much better to assign both pointers to local variables.
> Here you can use their required types to get type checking (I wouldn't bother
> about the extra checks that _must_be_cstr() does).
> 

Also, all those memcpy() that I replaced had explicitly the dest size. I 
think I could reuse it for strtostr() to simplify a bit things, what do 
you think?

> I'd also create function that is explicitly for copying process names.
> (Or replace the one that is already there - saves a lot of churn.)
> then you know (and can check) the sizes are the expected ones.
> 

I don't have strong feeling about get_task_comm(), but Linus said that 
"I'd rather aim to get rid of get_task_comm() entirely"[1] so for me 
it's fine to get a new function for that.

[1] 
https://lore.kernel.org/all/CAHk-=wi5c=_-FBGo_88CowJd_F-Gi6Ud9d=TALm65ReN7YjrMw@mail.gmail.com/

> It might even be worth making the #define (needed to get the array sizes)
> call out to different functions for the different cases.
> 
> Thinks more...
> On 64bit the 16 byte copy can be 'load; store; load; mask; store' provided
> the buffer is aligned (copying u64 on 32bit will work the same).
> But that requires that all the buffers be aligned.
> So you'd need to check _Alignof(dest) >= _Alignof(u64) as well.
> (Probably with a fallback to get things to compile.)
> 
> Whether that is best for the longer 64 byte copy is anybodies guess.
> 
> I also suspect it would be best to zero fill when copying a 16 byte
> name into a 64 byte buffer.
> (If you zero fill first then you can just copy 16 bytes over.)
> 
> -- David
> 
>> } while (0)
>>
>>
>>> -- David
>>>
>>>    
>>
> 


^ permalink raw reply

* Re: [RFC] TID v2.0: kernel module for cache-line zeroization against Flush+Reload (CLFLUSHOPT + LFENCE + REP STOSQ)
From: Jann Horn @ 2026-05-19 16:47 UTC (permalink / raw)
  To: Ahmed Hassan
  Cc: linux-kernel, linux-security-module, linux-hardening,
	kernel-hardening, linux-crypto, linux-mm, linux-api,
	linux-kselftest
In-Reply-To: <F78521DA-08DC-424E-BBE1-231BC900CEE0@gmail.com>

On Mon, May 18, 2026 at 11:47 PM Ahmed Hassan
<ahmaaaaadbntaaaaa@gmail.com> wrote:
>
> Hi kernel developers,
>
> I am sharing TID (The Instant Destroyer) v2.0, a Linux kernel module
> written in C that addresses a specific gap in existing security
> libraries: none of them (libsodium, OpenSSL, glibc memzero_explicit)
> flush CPU cache lines after memory zeroization.
>
>
> == Problem ==
>
> Standard zeroization functions (explicit_bzero, sodium_memzero,
> OPENSSL_cleanse) prevent the compiler from eliding the wipe, but do
> not evict CPU cache lines (L1/L2/L3). This leaves residual key
> material measurable via Flush+Reload (Yarom & Falkner, 2014) after
> data use ends.

The thing you're talking about isn't really related to the
Flush+Reload side channel attack, right? You're just talking about
flushing cache lines.

In what threat model would this be an issue? Normally, the goal of
memory zeroing is to ensure that sensitive data is wiped before an
attacker has a chance to physically pull out the RAM from a machine
and plug it into another device that can reveal RAM contents, or
before an attacker gains physical control of a locked device and can
connect malicious peripherals to it, or such.

So for this to be an actual security problem, the device would have to
keep running in a sufficiently high power state that data caches are
not discarded, and at the same time not perform enough memory accesses
to cause this memory to be discarded...

Assuming that this is an actual problem, why are you using a kernel
module for this? At least on x86, CLFLUSH is unprivileged, so crypto
libraries should be able to just use that directly. (There is the
caveat of what happens when the kernel migrates pages or kills a
process, but that's a larger problem.)

^ permalink raw reply

* Re: [RFC] fs/ioctl.c: FIBMAP requires CAP_SYS_RAWIO while FIEMAP exposes identical data unprivileged
From: Christoph Hellwig @ 2026-05-19 11:45 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andy Lutomirski, Christoph Hellwig, Cyber_black,
	linux-fsdevel@vger.kernel.org, Mark Fasheh, Theodore Ts'o,
	linux-api
In-Reply-To: <20260519033126.GD9531@frogsfrogsfrogs>

On Mon, May 18, 2026 at 08:31:26PM -0700, Darrick J. Wong wrote:
> > The only way that I'm personally aware of to determine whether ranges
> > in two files are reflinked to each other (and the only efficient way
> > to find identical blocks to, say, archive a large directory without
> > reading all the contents) is FIEMAP.  I wrote some code to do this
> > awhile back (not in production use).  Yes, I realize that it might
> > have issues with dirty page cache.
> > 
> > Is there some other way to do this?  Could an API be added that
> > efficiently answers the actual question without revealing information
> > that shouldn't be revealed?
> 
> Well, yes, we *could* make yet another ioctl, but we could also just run
> fe_physical through a one-way u64 hash function and set
> FIEMAP_EXTENT_UNKNOWN if (say) you don't have CAP_SYS_RAWIO or
> something.  Then your comparison function might still work... maybe?

What is the actual use case for that dedup detection?  I.e. what is
considered duplicate?  Does the application already have candidate
ranges or does it scan the output for all fіles?

For xfs the rmap can directly tell you what is shared, but I can't think
of a good way to expose that, but part of that might be that I don't
understand what question is asked and why.

Note the FIEMAP output can give you the wrong answer, e.g. with XFS
and multiple devices, or for file systems that can do tail packing and
have small amounts of data for multiple files in the same block.

> Also note that FIEMAP still doesn't report devices, so you're still
> playing with fire on multi-device reflink-aware filesystems like XFS.

or even on f2fs despite the lack of reflink support if the caller is
dumb enough.  All that of course depends on what the caller is doing
based on the FIEMAP output.

^ permalink raw reply

* Re: [RFC] fs/ioctl.c: FIBMAP requires CAP_SYS_RAWIO while FIEMAP exposes identical data unprivileged
From: Christoph Hellwig @ 2026-05-19 11:42 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Christoph Hellwig, Cyber_black, linux-fsdevel@vger.kernel.org,
	Mark Fasheh, linux-api
In-Reply-To: <20260519022327.GA11894@macsyma-wired.lan>

On Mon, May 18, 2026 at 10:23:27PM -0400, Theodore Tso wrote:
> I disagree.  As I recall, we discussed whether or not FIEMAP needed to
> be unprivileged many years ago, and it was a conscious choice not to
> require root privs.  I don't believe it is a security issue to allow
> users to see the logical -> physical block mappings for inodes.

Users have no business even known it.  It is a side channel that can
easily leak information for attackers that know allocation policies.
And as the reported state it also is inconsistent with how FIBMAP has
behaved since the damn of time.


^ permalink raw reply

* Re: [RFC] fs/ioctl.c: FIBMAP requires CAP_SYS_RAWIO while FIEMAP exposes identical data unprivileged
From: Andreas Dilger @ 2026-05-19  7:53 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andy Lutomirski, Christoph Hellwig, Cyber_black,
	linux-fsdevel@vger.kernel.org, Mark Fasheh, Theodore Ts'o,
	linux-api
In-Reply-To: <20260519033126.GD9531@frogsfrogsfrogs>

On May 18, 2026, at 21:31, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Mon, May 18, 2026 at 09:22:42AM -0700, Andy Lutomirski wrote:
>> On Mon, May 18, 2026 at 9:21 AM Darrick J. Wong <djwong@kernel.org> wrote:
>>> 
>>> On Sun, May 17, 2026 at 10:08:13PM -0700, Christoph Hellwig wrote:
>>>> On Fri, May 15, 2026 at 05:36:45PM +0000, Cyber_black wrote:
>>>>> Option B) Add a capability check to ioctl_fiemap() to match FIBMAP.
>>>>> This restores the intended restriction, at the cost of breaking
>>>>> unprivileged use of FIEMAP (e.g. filefrag, btrfs tools, e2freefrag).
>>>>> This option is a larger ABI impact and likely undesirable.
>>>>> 
>>>>> The preferred fix is Option A, since FIEMAP has been available
>>>>> unprivileged since 2008 with no reported security issues, and read
>>>>> access to physical block layout is already implicitly available
>>>>> through open() permission on the file.
>>>> 
>>>> No, FIEMAP really should not be available unprivileged.  So I think B is
>>>> the right thing.  Can you send a proper patch with a proper signoff?
>>> 
>>> For anyone who might be relying on FIEMAP output to find sparse regions
>>> -- don't.  FIEMAP is a lowlevel fs debugging interface; it won't tell
>>> you about dirty pagecache backed by unwritten disk space.  cp was burned
>>> by that a decade and a half ago.
>>> 
>> 
>> The only way that I'm personally aware of to determine whether ranges
>> in two files are reflinked to each other (and the only efficient way
>> to find identical blocks to, say, archive a large directory without
>> reading all the contents) is FIEMAP.  I wrote some code to do this
>> awhile back (not in production use).  Yes, I realize that it might
>> have issues with dirty page cache.
>> 
>> Is there some other way to do this?  Could an API be added that
>> efficiently answers the actual question without revealing information
>> that shouldn't be revealed?
> 
> Well, yes, we *could* make yet another ioctl, but we could also just run
> fe_physical through a one-way u64 hash function and set
> FIEMAP_EXTENT_UNKNOWN if (say) you don't have CAP_SYS_RAWIO or
> something.  Then your comparison function might still work... maybe?
> 
> OTOH nobody really wants Linus roaring at them, so we might all just do
> absolutely nothing.
> 
> Also note that FIEMAP still doesn't report devices, so you're still
> playing with fire on multi-device reflink-aware filesystems like XFS.

I've long had a patch to add device printing to FIEMAP/filefrag, but IIRC
the last time I tried to submit it upstream it was rejected.  Maybe times
have changed and there is a chance to get it included.

Cheers, Andreas






^ permalink raw reply

* Re: [RFC] fs/ioctl.c: FIBMAP requires CAP_SYS_RAWIO while FIEMAP exposes identical data unprivileged
From: Darrick J. Wong @ 2026-05-19  3:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christoph Hellwig, Cyber_black, linux-fsdevel@vger.kernel.org,
	Mark Fasheh, Theodore Ts'o, linux-api
In-Reply-To: <CALCETrUFMFNnJ6FLd9SkzS5E1q3x+cqGvOvo5PzU2V_+moSEJw@mail.gmail.com>

On Mon, May 18, 2026 at 09:22:42AM -0700, Andy Lutomirski wrote:
> On Mon, May 18, 2026 at 9:21 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Sun, May 17, 2026 at 10:08:13PM -0700, Christoph Hellwig wrote:
> > > On Fri, May 15, 2026 at 05:36:45PM +0000, Cyber_black wrote:
> > > > Option B) Add a capability check to ioctl_fiemap() to match FIBMAP.
> > > > This restores the intended restriction, at the cost of breaking
> > > > unprivileged use of FIEMAP (e.g. filefrag, btrfs tools, e2freefrag).
> > > > This option is a larger ABI impact and likely undesirable.
> > > >
> > > > The preferred fix is Option A, since FIEMAP has been available
> > > > unprivileged since 2008 with no reported security issues, and read
> > > > access to physical block layout is already implicitly available
> > > > through open() permission on the file.
> > >
> > > No, FIEMAP really should not be available unprivileged.  So I think B is
> > > the right thing.  Can you send a proper patch with a proper signoff?
> >
> > For anyone who might be relying on FIEMAP output to find sparse regions
> > -- don't.  FIEMAP is a lowlevel fs debugging interface; it won't tell
> > you about dirty pagecache backed by unwritten disk space.  cp was burned
> > by that a decade and a half ago.
> >
> 
> The only way that I'm personally aware of to determine whether ranges
> in two files are reflinked to each other (and the only efficient way
> to find identical blocks to, say, archive a large directory without
> reading all the contents) is FIEMAP.  I wrote some code to do this
> awhile back (not in production use).  Yes, I realize that it might
> have issues with dirty page cache.
> 
> Is there some other way to do this?  Could an API be added that
> efficiently answers the actual question without revealing information
> that shouldn't be revealed?

Well, yes, we *could* make yet another ioctl, but we could also just run
fe_physical through a one-way u64 hash function and set
FIEMAP_EXTENT_UNKNOWN if (say) you don't have CAP_SYS_RAWIO or
something.  Then your comparison function might still work... maybe?

OTOH nobody really wants Linus roaring at them, so we might all just do
absolutely nothing.

Also note that FIEMAP still doesn't report devices, so you're still
playing with fire on multi-device reflink-aware filesystems like XFS.

--D

^ permalink raw reply

* Re: [RFC] fs/ioctl.c: FIBMAP requires CAP_SYS_RAWIO while FIEMAP exposes identical data unprivileged
From: Theodore Tso @ 2026-05-19  2:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Cyber_black, linux-fsdevel@vger.kernel.org, Mark Fasheh,
	linux-api
In-Reply-To: <agqevS--YYBVW2Oz@infradead.org>

On Sun, May 17, 2026 at 10:08:13PM -0700, Christoph Hellwig wrote:
> On Fri, May 15, 2026 at 05:36:45PM +0000, Cyber_black wrote:
> > Option B) Add a capability check to ioctl_fiemap() to match FIBMAP.
> > This restores the intended restriction, at the cost of breaking
> > unprivileged use of FIEMAP (e.g. filefrag, btrfs tools, e2freefrag).
> > This option is a larger ABI impact and likely undesirable.
> > 
> > The preferred fix is Option A, since FIEMAP has been available
> > unprivileged since 2008 with no reported security issues, and read
> > access to physical block layout is already implicitly available
> > through open() permission on the file.
> 
> No, FIEMAP really should not be available unprivileged.  So I think B is
> the right thing.  Can you send a proper patch with a proper signoff?
> 

I disagree.  As I recall, we discussed whether or not FIEMAP needed to
be unprivileged many years ago, and it was a conscious choice not to
require root privs.  I don't believe it is a security issue to allow
users to see the logical -> physical block mappings for inodes.

Users might misuse it, and we did have that issue many years ago when
cp attempted to use FIEMAP in a way way that it wasn't intended to be
used[1].  However, that was over 15 years ago.

[1] https://lwn.net/Articles/429345/

But just because an interface could be misued doesn't mean that we
should restrict it, IMHO.

					- Ted

^ permalink raw reply

* Re: [5][RFC] fs/ioctl.c: FIBMAP requires CAP_SYS_RAWIO while FIEMAP exposes identical data unprivileged
From: Andreas Dilger @ 2026-05-18 19:49 UTC (permalink / raw)
  To: Cyber_black
  Cc: luto@amacapital.net, hch@infradead.org,
	linux-fsdevel@vger.kernel.org, tytso@mit.edu,
	linux-api@vger.kernel.org, djwong@kernel.org, mark@fasheh.com,
	moybs027@gmail.com
In-Reply-To: <-nQmUF-iBsNFQ1Iz2j_cVui7DxnmpAO7z3X7qH8Xzpr7CYXE8j5x5YeFQ39U1wcMFNuVnuxu1pJf7ooiwJYK8ZFJDpjEtifFaBuWNJIi0ak=@proton.me>

On May 18, 2026, at 11:22, Cyber_black <Cyberblackk@proton.me> wrote:
> 
> Thank you for raising this important question, Andy. I've been following the discussion as a "listening guest" and I have a thought.
> 
> My idea is this: Instead of forcing FIEMAP to become a root-only interface (breaking existing tools), or leaving it as-is (with information disclosure), what if we design a new, restricted API that is not privileged but also not unprivileged in the traditional sense?

What is the *actual* security risk of showing block numbers to users for their own files?

If an attacker can access the underlying device/image, they could directly use debugfs
or other filesystem tools to get file->block mappings anyway, and could modify the image
arbitrarily.  Restricting FIEMAP to root or obscuring block numbers is security through
obscurity and provides no actual safety.

Cheers, Andreas

> 
> Concretely:
> 
> 1.  The API would be callable by any user, but it would not expose physical block addresses.
> 
> 2.  It would answer higher-level questions that tools actually need, such as:
> 
>    -   "Are these two file ranges reflinked (shared)?" (for deduplication)
> 
>    -   "Is this file range sparse (holes)?" (without leaking physical locations)
> 
>    -   "What is the allocation status (delayed, unwritten, etc.)?"
> 
> 3.  The kernel would maintain a capability or permission that is not root-equivalent (e.g., a new `CAP_BLOCK_MAP_QUERY`), but the API would not require full `CAP_SYS_RAWIO`.
> 
> 
> This way:
> 
> -   Tools like `filefrag`, `cp`, and deduplication utilities can work without root.
> 
> -   Physical block addresses remain hidden from unprivileged users, closing the information leak.
> 
> -   We avoid forcing users to run these tools as root, which would open up far more serious risks (e.g., kernel panic, accidental corruption).
> 
> 
> In short: we don't need to choose between "unprivileged leak" and "root-only". We can design a purpose‑limited API that answers only the necessary questions, with the minimum privilege required.
> 
> Would this be acceptable? I'd be happy to help draft a more detailed proposal or prototype.
> 
> This idea was developed together with my friend playerofficial19 (moybs027@gmail.com) through discussion. We hope it's helpful.
> 


Cheers, Andreas






^ permalink raw reply

* Re: [PATCH 3/6] string: Introduce strtostr() for safe and performance string copies
From: David Laight @ 2026-05-18 18:38 UTC (permalink / raw)
  To: André Almeida
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, Linus Torvalds, akpm, Yafang Shao,
	andrii.nakryiko, arnaldo.melo, Petr Mladek, linux-kernel,
	kernel-dev, linux-mm, linux-api
In-Reply-To: <d4d6cf61-568e-478e-88d6-01b769d7eded@igalia.com>

On Mon, 18 May 2026 11:36:49 -0300
André Almeida <andrealmeid@igalia.com> wrote:

> Hi David, thanks for the feedback!
> 
> Em 17/05/2026 18:34, David Laight escreveu:
> > On Sun, 17 May 2026 15:36:13 -0300
> > André Almeida <andrealmeid@igalia.com> wrote:
> >   
> >> Some parts of the kernel uses memcpy() instead of strscpy() because they
> >> are performance sensitive and doesn't care about the return value of
> >> strscpy(). One such common case is to copy current->comm to a different
> >> buffer.
> >>
> >> As the command name is guaranteed to be NUL-terminated in the range of
> >> TASK_COMM_LEN, this is safe enough and doesn't create unterminated
> >> strings. However, in order to expand the size of current->comm, this
> >> expectation will be broken and those memcpy() could create such strings
> >> without trailing NUL byte.
> >>
> >> In order to support a fast and safe string copy, create strtostr(), to copy
> >> a NUL-terminated string to a new string buffer. If the destination buffer
> >> is bigger than the source, no pad is applied, but the string is
> >> NUL-terminated. If the destination buffer is smaller, the string is
> >> truncated. The last byte of the destination is always set to NUL for safety.
> >>
> >> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> >> ---
> [...]>> +/**
> >> + * strtostr - Copy NUL-terminanted string to NUL-terminate string
> >> + *
> >> + * @dest: Pointer of destination string
> >> + * @src: Pointer to NUL-terminates string
> >> + *
> >> + * This is a replacement for strcpy() where the caller doesn't care about the
> >> + * return value and if the string is going to be truncated, albeit it needs
> >> + * to mark sure that it will be NUL-terminated. Intended for performance
> >> + * sensitive cases, such as tracing.  
> > 
> > If you care about performance, and the destination isn't smaller (especially
> > if the sizes are the same) then just use memcpy().
> >     
> 
> The problem is that as I'm expanding current->comm, the source buffer 
> might be bigger than destination, and when we truncate the string, it 
> won't have the termination NUL byte. So we need an extra dest[len-1] = 
> \0 after the memcpy.

It depends on other access to the destination.
If it might be being concurrently read it is vital that it is always
terminated.
So you can't even temporarily have a non-zero byte at the end.

> 
> >> + *
> >> + * If the destination is bigger than the source, no padding happens. It it's
> >> + * smaller the strings gets truncated.
> >> + *
> >> + * Both arguments needs to be arrays with lengths discoverable by the compiler.
> >> + */
> >> +#define strtostr(dest, src)	do {					\
> >> +	const size_t _dest_len = __must_be_cstr(dest) +			\
> >> +				 ARRAY_SIZE(dest);			\
> >> +	const size_t _src_len = __must_be_cstr(src) +			\
> >> +				__builtin_object_size(src, 1);		\
> >> +									\
> >> +	BUILD_BUG_ON(!__builtin_constant_p(_dest_len) ||		\
> >> +		     _dest_len == (size_t)-1);				\
> >> +	memcpy(dest, src, strnlen(src, min(_src_len, _dest_len)));	\
> >> +	dest[_dest_len - 1] = '\0';						\
> >> +} while (0)  
> > 
> > That doesn't work (for all sorts of reasons).
> > _dest_len can be the size of a pointer - no array check.
> > You need to use __is_array() and sizeof () for both dest and src.
> > You might have meant to check that _src_len is constant, not _dest_len.
> > You must not leave the destination unterminated.
> > 
> > __builtin_object_size(x->y,1) is also entirely useless!
> > If you have a pointer to a structure that ends in an array then the
> > object size of that array is SIZE_MAX (as if the array continues past
> > the end of the structure).
> > See https://godbolt.org/z/csenjfvxe (which I happened to prepare earlier today).
> > 
> > __builtin_object_size(x->y,0) also seems to always return SIZE_MAX.
> > You do get a sane answer for (x->y,3) on recent clang - but nowhere else.
> >   
> 
> Oops, you are right, thanks for pointing that out. This is how it would 
> look like checking that both args are arrays and using sizeof to get 
> their length, if it sounds good I can apply for the v2:
> 
> #define strtostr(dest, src)	do {				\
> 	const size_t _dest_len = __must_be_array(dest) +	\
> 				 sizeof(dest);			\
> 	const size_t _src_len = __must_be_array(src) +		\
> 				sizeof(src);			\
> 								\
> 	BUILD_BUG_ON(!__builtin_constant_p(_dest_len) ||	\
> 		     _dest_len == (size_t)-1);			\

That test can never fail.

> 	memcpy(dest, src, min(_src_len, _dest_len)));		\
> 	dest[_dest_len - 1] = '\0';				\

You are expending 'dest' twice.
Where it (p++)->array then the two values would be different and the final
value of 'p' incorrect.
Much better to assign both pointers to local variables.
Here you can use their required types to get type checking (I wouldn't bother
about the extra checks that _must_be_cstr() does).

I'd also create function that is explicitly for copying process names.
(Or replace the one that is already there - saves a lot of churn.)
then you know (and can check) the sizes are the expected ones.

It might even be worth making the #define (needed to get the array sizes)
call out to different functions for the different cases.

Thinks more...
On 64bit the 16 byte copy can be 'load; store; load; mask; store' provided
the buffer is aligned (copying u64 on 32bit will work the same).
But that requires that all the buffers be aligned.
So you'd need to check _Alignof(dest) >= _Alignof(u64) as well.
(Probably with a fallback to get things to compile.)

Whether that is best for the longer 64 byte copy is anybodies guess.

I also suspect it would be best to zero fill when copying a 16 byte
name into a 64 byte buffer.
(If you zero fill first then you can just copy 16 bytes over.)

-- David

> } while (0)
> 
> 
> > -- David
> > 
> >   
> 


^ permalink raw reply

* [5][RFC] fs/ioctl.c: FIBMAP requires CAP_SYS_RAWIO while FIEMAP exposes identical data unprivileged
From: Cyber_black @ 2026-05-18 17:22 UTC (permalink / raw)
  To: luto@amacapital.net
  Cc: hch@infradead.org, linux-fsdevel@vger.kernel.org, tytso@mit.edu,
	linux-api@vger.kernel.org, djwong@kernel.org, mark@fasheh.com,
	moybs027@gmail.com

Thank you for raising this important question, Andy. I've been following the discussion as a "listening guest" and I have a thought.

My idea is this: Instead of forcing FIEMAP to become a root-only interface (breaking existing tools), or leaving it as-is (with information disclosure), what if we design a new, restricted API that is not privileged but also not unprivileged in the traditional sense?

Concretely:

1.  The API would be callable by any user, but it would not expose physical block addresses.

2.  It would answer higher-level questions that tools actually need, such as:

    -   "Are these two file ranges reflinked (shared)?" (for deduplication)

    -   "Is this file range sparse (holes)?" (without leaking physical locations)

    -   "What is the allocation status (delayed, unwritten, etc.)?"

3.  The kernel would maintain a capability or permission that is not root-equivalent (e.g., a new `CAP_BLOCK_MAP_QUERY`), but the API would not require full `CAP_SYS_RAWIO`.

This way:

-   Tools like `filefrag`, `cp`, and deduplication utilities can work without root.

-   Physical block addresses remain hidden from unprivileged users, closing the information leak.

-   We avoid forcing users to run these tools as root, which would open up far more serious risks (e.g., kernel panic, accidental corruption).

In short: we don't need to choose between "unprivileged leak" and "root-only". We can design a purpose‑limited API that answers only the necessary questions, with the minimum privilege required.

Would this be acceptable? I'd be happy to help draft a more detailed proposal or prototype.

This idea was developed together with my friend playerofficial19 (moybs027@gmail.com) through discussion. We hope it's helpful.

^ permalink raw reply

* Re: [RFC] fs/ioctl.c: FIBMAP requires CAP_SYS_RAWIO while FIEMAP exposes identical data unprivileged
From: Andy Lutomirski @ 2026-05-18 16:22 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Cyber_black, linux-fsdevel@vger.kernel.org,
	Mark Fasheh, Theodore Ts'o, linux-api
In-Reply-To: <20260518162048.GC9531@frogsfrogsfrogs>

On Mon, May 18, 2026 at 9:21 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Sun, May 17, 2026 at 10:08:13PM -0700, Christoph Hellwig wrote:
> > On Fri, May 15, 2026 at 05:36:45PM +0000, Cyber_black wrote:
> > > Option B) Add a capability check to ioctl_fiemap() to match FIBMAP.
> > > This restores the intended restriction, at the cost of breaking
> > > unprivileged use of FIEMAP (e.g. filefrag, btrfs tools, e2freefrag).
> > > This option is a larger ABI impact and likely undesirable.
> > >
> > > The preferred fix is Option A, since FIEMAP has been available
> > > unprivileged since 2008 with no reported security issues, and read
> > > access to physical block layout is already implicitly available
> > > through open() permission on the file.
> >
> > No, FIEMAP really should not be available unprivileged.  So I think B is
> > the right thing.  Can you send a proper patch with a proper signoff?
>
> For anyone who might be relying on FIEMAP output to find sparse regions
> -- don't.  FIEMAP is a lowlevel fs debugging interface; it won't tell
> you about dirty pagecache backed by unwritten disk space.  cp was burned
> by that a decade and a half ago.
>

The only way that I'm personally aware of to determine whether ranges
in two files are reflinked to each other (and the only efficient way
to find identical blocks to, say, archive a large directory without
reading all the contents) is FIEMAP.  I wrote some code to do this
awhile back (not in production use).  Yes, I realize that it might
have issues with dirty page cache.

Is there some other way to do this?  Could an API be added that
efficiently answers the actual question without revealing information
that shouldn't be revealed?

--Andy

^ permalink raw reply

* [2]Yazışmada 2 ileti var[RFC] fs/ioctl.c: FIBMAP requires CAP_SYS_RAWIO while FIEMAP exposes identical data unprivileged
From: Cyber_black @ 2026-05-18 16:21 UTC (permalink / raw)
  To: hch@infradead.org
  Cc: linux-fsdevel@vger.kernel.org, tytso@mit.edu,
	linux-api@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 687 bytes --]

On Fri, May 15, 2026 at 05:36:45PM +0000, Maintainer wrote:> No, FIEMAP really should not be available unprivileged. So I think B is
> the right thing. Can you send a proper patch with a proper signoff?

Absolutely, thanks for the guidance. You're right that Option B is the
correct approach for consistency and security.

I've prepared the patch below. It adds CAP_SYS_RAWIO check to
ioctl_fiemap() to match the protection already in place for FIBMAP.

The check is placed early in the function, before any filesystem-specific
operations, following the same pattern as ioctl_fibmap().

Best regards,

Eneshan Erdoğan Karaca

My github:https://github.com/Kisaca-Enes

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-fs-ioctl-add-CAP_SYS_RAWIO-check-to-FS_IOC_FIEMAP.patch --]
[-- Type: text/x-patch; name=0001-fs-ioctl-add-CAP_SYS_RAWIO-check-to-FS_IOC_FIEMAP.patch, Size: 1196 bytes --]

From: Cyber_black <Cyberblackk@proton.me>
Date: Fri, 16 May 2026 12:00:00 +0000
Subject: [PATCH] fs/ioctl: add CAP_SYS_RAWIO check to FS_IOC_FIEMAP

FS_IOC_FIEMAP exposes physical block addresses of files to unprivileged
users, which is the same privileged information that FIBMAP protects with
CAP_SYS_RAWIO capability check.

For consistency in the VFS privilege model and to prevent information
disclosure of physical disk layout, add the same capability check to
ioctl_fiemap() that already exists in ioctl_fibmap().

FIEMAP has been available unprivileged since 2008, but as noted by the
maintainers, this was an unintended exposure that should be corrected.

Signed-off-by: Cyber_black <Cyberblackk@proton.me>
---
 fs/ioctl.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 1234567890ab..abcdef1234567 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -199,6 +199,9 @@ static int ioctl_fiemap(struct file *filp, struct fiemap __user *ufiemap)
 	struct fiemap_extent_info fieinfo = { 0, };
 	struct inode *inode = file_inode(filp);
 	int error;
+
+	if (!capable(CAP_SYS_RAWIO))
+		return -EPERM;

 	if (!inode->i_op->fiemap)
 		return -EOPNOTSUPP;
-- 
2.40.0

^ permalink raw reply related

* Re: [RFC] fs/ioctl.c: FIBMAP requires CAP_SYS_RAWIO while FIEMAP exposes identical data unprivileged
From: Darrick J. Wong @ 2026-05-18 16:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Cyber_black, linux-fsdevel@vger.kernel.org, Mark Fasheh,
	Theodore Ts'o, linux-api
In-Reply-To: <agqevS--YYBVW2Oz@infradead.org>

On Sun, May 17, 2026 at 10:08:13PM -0700, Christoph Hellwig wrote:
> On Fri, May 15, 2026 at 05:36:45PM +0000, Cyber_black wrote:
> > Option B) Add a capability check to ioctl_fiemap() to match FIBMAP.
> > This restores the intended restriction, at the cost of breaking
> > unprivileged use of FIEMAP (e.g. filefrag, btrfs tools, e2freefrag).
> > This option is a larger ABI impact and likely undesirable.
> > 
> > The preferred fix is Option A, since FIEMAP has been available
> > unprivileged since 2008 with no reported security issues, and read
> > access to physical block layout is already implicitly available
> > through open() permission on the file.
> 
> No, FIEMAP really should not be available unprivileged.  So I think B is
> the right thing.  Can you send a proper patch with a proper signoff?

For anyone who might be relying on FIEMAP output to find sparse regions
-- don't.  FIEMAP is a lowlevel fs debugging interface; it won't tell
you about dirty pagecache backed by unwritten disk space.  cp was burned
by that a decade and a half ago.

--D

^ permalink raw reply

* Re: [PATCH 3/6] string: Introduce strtostr() for safe and performance string copies
From: André Almeida @ 2026-05-18 14:36 UTC (permalink / raw)
  To: David Laight
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, Linus Torvalds, akpm, Yafang Shao,
	andrii.nakryiko, arnaldo.melo, Petr Mladek, linux-kernel,
	kernel-dev, linux-mm, linux-api
In-Reply-To: <20260517223419.3262de7c@pumpkin>

Hi David, thanks for the feedback!

Em 17/05/2026 18:34, David Laight escreveu:
> On Sun, 17 May 2026 15:36:13 -0300
> André Almeida <andrealmeid@igalia.com> wrote:
> 
>> Some parts of the kernel uses memcpy() instead of strscpy() because they
>> are performance sensitive and doesn't care about the return value of
>> strscpy(). One such common case is to copy current->comm to a different
>> buffer.
>>
>> As the command name is guaranteed to be NUL-terminated in the range of
>> TASK_COMM_LEN, this is safe enough and doesn't create unterminated
>> strings. However, in order to expand the size of current->comm, this
>> expectation will be broken and those memcpy() could create such strings
>> without trailing NUL byte.
>>
>> In order to support a fast and safe string copy, create strtostr(), to copy
>> a NUL-terminated string to a new string buffer. If the destination buffer
>> is bigger than the source, no pad is applied, but the string is
>> NUL-terminated. If the destination buffer is smaller, the string is
>> truncated. The last byte of the destination is always set to NUL for safety.
>>
>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>> ---
[...]>> +/**
>> + * strtostr - Copy NUL-terminanted string to NUL-terminate string
>> + *
>> + * @dest: Pointer of destination string
>> + * @src: Pointer to NUL-terminates string
>> + *
>> + * This is a replacement for strcpy() where the caller doesn't care about the
>> + * return value and if the string is going to be truncated, albeit it needs
>> + * to mark sure that it will be NUL-terminated. Intended for performance
>> + * sensitive cases, such as tracing.
> 
> If you care about performance, and the destination isn't smaller (especially
> if the sizes are the same) then just use memcpy().
>   

The problem is that as I'm expanding current->comm, the source buffer 
might be bigger than destination, and when we truncate the string, it 
won't have the termination NUL byte. So we need an extra dest[len-1] = 
\0 after the memcpy.

>> + *
>> + * If the destination is bigger than the source, no padding happens. It it's
>> + * smaller the strings gets truncated.
>> + *
>> + * Both arguments needs to be arrays with lengths discoverable by the compiler.
>> + */
>> +#define strtostr(dest, src)	do {					\
>> +	const size_t _dest_len = __must_be_cstr(dest) +			\
>> +				 ARRAY_SIZE(dest);			\
>> +	const size_t _src_len = __must_be_cstr(src) +			\
>> +				__builtin_object_size(src, 1);		\
>> +									\
>> +	BUILD_BUG_ON(!__builtin_constant_p(_dest_len) ||		\
>> +		     _dest_len == (size_t)-1);				\
>> +	memcpy(dest, src, strnlen(src, min(_src_len, _dest_len)));	\
>> +	dest[_dest_len - 1] = '\0';						\
>> +} while (0)
> 
> That doesn't work (for all sorts of reasons).
> _dest_len can be the size of a pointer - no array check.
> You need to use __is_array() and sizeof () for both dest and src.
> You might have meant to check that _src_len is constant, not _dest_len.
> You must not leave the destination unterminated.
> 
> __builtin_object_size(x->y,1) is also entirely useless!
> If you have a pointer to a structure that ends in an array then the
> object size of that array is SIZE_MAX (as if the array continues past
> the end of the structure).
> See https://godbolt.org/z/csenjfvxe (which I happened to prepare earlier today).
> 
> __builtin_object_size(x->y,0) also seems to always return SIZE_MAX.
> You do get a sane answer for (x->y,3) on recent clang - but nowhere else.
> 

Oops, you are right, thanks for pointing that out. This is how it would 
look like checking that both args are arrays and using sizeof to get 
their length, if it sounds good I can apply for the v2:

#define strtostr(dest, src)	do {				\
	const size_t _dest_len = __must_be_array(dest) +	\
				 sizeof(dest);			\
	const size_t _src_len = __must_be_array(src) +		\
				sizeof(src);			\
								\
	BUILD_BUG_ON(!__builtin_constant_p(_dest_len) ||	\
		     _dest_len == (size_t)-1);			\
	memcpy(dest, src, min(_src_len, _dest_len)));		\
	dest[_dest_len - 1] = '\0';				\
} while (0)


> -- David
> 
> 


^ permalink raw reply

* Re: [RFC] fs/ioctl.c: FIBMAP requires CAP_SYS_RAWIO while FIEMAP exposes identical data unprivileged
From: Christoph Hellwig @ 2026-05-18  5:08 UTC (permalink / raw)
  To: Cyber_black
  Cc: linux-fsdevel@vger.kernel.org, Mark Fasheh, Theodore Ts'o,
	linux-api
In-Reply-To: <_fcorj7Aa0YnzUmrPnqdEbTjLqS6S7t84HKrzsswvKm71LC0uVmTD2cthCwpgeI-296unEpzPZYBNdFFDXjsQvZRtGfTaQlKmcRkiSI4wiQ=@proton.me>

On Fri, May 15, 2026 at 05:36:45PM +0000, Cyber_black wrote:
> Option B) Add a capability check to ioctl_fiemap() to match FIBMAP.
> This restores the intended restriction, at the cost of breaking
> unprivileged use of FIEMAP (e.g. filefrag, btrfs tools, e2freefrag).
> This option is a larger ABI impact and likely undesirable.
> 
> The preferred fix is Option A, since FIEMAP has been available
> unprivileged since 2008 with no reported security issues, and read
> access to physical block layout is already implicitly available
> through open() permission on the file.

No, FIEMAP really should not be available unprivileged.  So I think B is
the right thing.  Can you send a proper patch with a proper signoff?


^ permalink raw reply

* Re: [PATCH 3/6] string: Introduce strtostr() for safe and performance string copies
From: David Laight @ 2026-05-17 21:34 UTC (permalink / raw)
  To: André Almeida
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, Linus Torvalds, akpm, Yafang Shao,
	andrii.nakryiko, arnaldo.melo, Petr Mladek, linux-kernel,
	kernel-dev, linux-mm, linux-api
In-Reply-To: <20260517-tonyk-long_name-v1-3-3c282eaa91e2@igalia.com>

On Sun, 17 May 2026 15:36:13 -0300
André Almeida <andrealmeid@igalia.com> wrote:

> Some parts of the kernel uses memcpy() instead of strscpy() because they
> are performance sensitive and doesn't care about the return value of
> strscpy(). One such common case is to copy current->comm to a different
> buffer.
> 
> As the command name is guaranteed to be NUL-terminated in the range of
> TASK_COMM_LEN, this is safe enough and doesn't create unterminated
> strings. However, in order to expand the size of current->comm, this
> expectation will be broken and those memcpy() could create such strings
> without trailing NUL byte.
> 
> In order to support a fast and safe string copy, create strtostr(), to copy
> a NUL-terminated string to a new string buffer. If the destination buffer
> is bigger than the source, no pad is applied, but the string is
> NUL-terminated. If the destination buffer is smaller, the string is
> truncated. The last byte of the destination is always set to NUL for safety.
> 
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>  include/linux/coredump.h                           |  2 +-
>  include/linux/string.h                             | 28 ++++++++++++++++++++++
>  include/linux/tracepoint.h                         |  4 ++--
>  include/trace/events/block.h                       | 10 ++++----
>  include/trace/events/coredump.h                    |  2 +-
>  include/trace/events/f2fs.h                        |  4 ++--
>  include/trace/events/oom.h                         |  2 +-
>  include/trace/events/osnoise.h                     |  2 +-
>  include/trace/events/sched.h                       | 10 ++++----
>  include/trace/events/signal.h                      |  2 +-
>  include/trace/events/task.h                        |  4 ++--
>  kernel/printk/nbcon.c                              |  2 +-
>  kernel/printk/printk.c                             |  2 +-
>  tools/bpf/bpftool/pids.c                           |  4 ++--
>  .../selftests/bpf/test_kmods/bpf_testmod-events.h  |  2 +-
>  15 files changed, 54 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/coredump.h b/include/linux/coredump.h
> index 68861da4cf7c..b370ef69f673 100644
> --- a/include/linux/coredump.h
> +++ b/include/linux/coredump.h
> @@ -54,7 +54,7 @@ extern void vfs_coredump(const kernel_siginfo_t *siginfo);
>  	do {	\
>  		char comm[TASK_COMM_LEN];	\
>  		/* This will always be NUL terminated. */ \
> -		memcpy(comm, current->comm, sizeof(comm)); \
> +		strtostr(comm, current->comm); \
>  		printk_ratelimited(Level "coredump: %d(%*pE): " Format "\n",	\
>  			task_tgid_vnr(current), (int)strlen(comm), comm, ##__VA_ARGS__);	\
>  	} while (0)	\
> diff --git a/include/linux/string.h b/include/linux/string.h
> index b850bd91b3d8..ff1f59f4139c 100644
> --- a/include/linux/string.h
> +++ b/include/linux/string.h
> @@ -445,6 +445,34 @@ void memcpy_and_pad(void *dest, size_t dest_len, const void *src, size_t count,
>  	memcpy(dest, src, strnlen(src, min(_src_len, _dest_len)));	\
>  } while (0)
>  
> +/**
> + * strtostr - Copy NUL-terminanted string to NUL-terminate string
> + *
> + * @dest: Pointer of destination string
> + * @src: Pointer to NUL-terminates string
> + *
> + * This is a replacement for strcpy() where the caller doesn't care about the
> + * return value and if the string is going to be truncated, albeit it needs
> + * to mark sure that it will be NUL-terminated. Intended for performance
> + * sensitive cases, such as tracing.

If you care about performance, and the destination isn't smaller (especially
if the sizes are the same) then just use memcpy().
 
> + *
> + * If the destination is bigger than the source, no padding happens. It it's
> + * smaller the strings gets truncated.
> + *
> + * Both arguments needs to be arrays with lengths discoverable by the compiler.
> + */
> +#define strtostr(dest, src)	do {					\
> +	const size_t _dest_len = __must_be_cstr(dest) +			\
> +				 ARRAY_SIZE(dest);			\
> +	const size_t _src_len = __must_be_cstr(src) +			\
> +				__builtin_object_size(src, 1);		\
> +									\
> +	BUILD_BUG_ON(!__builtin_constant_p(_dest_len) ||		\
> +		     _dest_len == (size_t)-1);				\
> +	memcpy(dest, src, strnlen(src, min(_src_len, _dest_len)));	\
> +	dest[_dest_len - 1] = '\0';						\
> +} while (0)

That doesn't work (for all sorts of reasons).
_dest_len can be the size of a pointer - no array check.
You need to use __is_array() and sizeof () for both dest and src.
You might have meant to check that _src_len is constant, not _dest_len.
You must not leave the destination unterminated.

__builtin_object_size(x->y,1) is also entirely useless!
If you have a pointer to a structure that ends in an array then the
object size of that array is SIZE_MAX (as if the array continues past
the end of the structure).
See https://godbolt.org/z/csenjfvxe (which I happened to prepare earlier today).

__builtin_object_size(x->y,0) also seems to always return SIZE_MAX.
You do get a sane answer for (x->y,3) on recent clang - but nowhere else.

-- David



^ permalink raw reply

* [PATCH 6/6] selftests: prctl: Add test for long thread names
From: André Almeida @ 2026-05-17 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, David Laight, Linus Torvalds, akpm,
	Yafang Shao, andrii.nakryiko, arnaldo.melo, Petr Mladek
  Cc: linux-kernel, kernel-dev, linux-mm, linux-api, André Almeida
In-Reply-To: <20260517-tonyk-long_name-v1-0-3c282eaa91e2@igalia.com>

Add tests for the new interface to set and get long thread names. The
kernel should accept the LONG_NAME and returning it accordingly. For the
old PR_GET_NAME interface, the kernel should truncate the name up to 16
chars. /proc/<task>/comm should return the same string ad PR_GET_NAME.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 tools/testing/selftests/prctl/set-process-name.c | 36 ++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/tools/testing/selftests/prctl/set-process-name.c b/tools/testing/selftests/prctl/set-process-name.c
index 3f7b146d36df..0f20f7deac67 100644
--- a/tools/testing/selftests/prctl/set-process-name.c
+++ b/tools/testing/selftests/prctl/set-process-name.c
@@ -9,9 +9,17 @@
 
 #include "kselftest_harness.h"
 
+#ifndef PR_SET_EXT_NAME
+# define PR_SET_EXT_NAME 17
+# define PR_GET_EXT_NAME 18
+#endif
+
 #define CHANGE_NAME "changename"
+#define LONG_NAME	"change_to_very_long_extended_name"
+#define LONG_NAME_CAP	"change_to_very_"
 #define EMPTY_NAME ""
 #define TASK_COMM_LEN 16
+#define TASK_COMM_EXT_LEN 64
 #define MAX_PATH_LEN 50
 
 int set_name(char *name)
@@ -25,6 +33,16 @@ int set_name(char *name)
 	return res;
 }
 
+int set_ext_name(char *name)
+{
+	int res;
+
+	res = prctl(PR_SET_EXT_NAME, name, NULL, NULL, NULL);
+
+	if (res < 0)
+		return -errno;
+}
+
 int check_is_name_correct(char *check_name)
 {
 	char name[TASK_COMM_LEN];
@@ -38,6 +56,19 @@ int check_is_name_correct(char *check_name)
 	return !strcmp(name, check_name);
 }
 
+int check_is_ext_name_correct(char *check_name)
+{
+	char name[TASK_COMM_EXT_LEN];
+	int res;
+
+	res = prctl(PR_GET_EXT_NAME, name, NULL, NULL, NULL);
+
+	if (res < 0)
+		return -errno;
+
+	return !strcmp(name, check_name);
+}
+
 int check_null_pointer(char *check_name)
 {
 	char *name = NULL;
@@ -82,6 +113,11 @@ TEST(rename_process) {
 	EXPECT_GE(set_name(CHANGE_NAME), 0);
 	EXPECT_TRUE(check_is_name_correct(CHANGE_NAME));
 
+	EXPECT_GE(set_ext_name(LONG_NAME), 0);
+	EXPECT_TRUE(check_is_ext_name_correct(LONG_NAME));
+	EXPECT_TRUE(check_is_name_correct(LONG_NAME_CAP));
+	EXPECT_TRUE(check_name());
+
 	EXPECT_GE(set_name(EMPTY_NAME), 0);
 	EXPECT_TRUE(check_is_name_correct(EMPTY_NAME));
 

-- 
2.54.0


^ permalink raw reply related

* [PATCH 5/6] prctl: Add support for long user thread names
From: André Almeida @ 2026-05-17 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, David Laight, Linus Torvalds, akpm,
	Yafang Shao, andrii.nakryiko, arnaldo.melo, Petr Mladek
  Cc: linux-kernel, kernel-dev, linux-mm, linux-api, André Almeida
In-Reply-To: <20260517-tonyk-long_name-v1-0-3c282eaa91e2@igalia.com>

Add support for getting and setting long user thread names with
PR_{SET,GET}_EXT_NAME.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 include/linux/sched.h      |  2 +-
 include/uapi/linux/prctl.h |  3 +++
 kernel/sys.c               | 15 ++++++++++++++-
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f7fd2b7d131d..fd4256c8627b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1997,7 +1997,7 @@ extern void kick_process(struct task_struct *tsk);
 
 extern void __set_task_comm(struct task_struct *tsk, const char *from, bool exec);
 #define set_task_comm(tsk, from) ({			\
-	BUILD_BUG_ON(sizeof(from) != TASK_COMM_LEN);	\
+	BUILD_BUG_ON(sizeof(from) < TASK_COMM_LEN);	\
 	__set_task_comm(tsk, from, false);		\
 })
 
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index b6ec6f693719..a07f8edadd65 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -56,6 +56,9 @@
 #define PR_SET_NAME    15		/* Set process name */
 #define PR_GET_NAME    16		/* Get process name */
 
+#define PR_SET_EXT_NAME    17		/* Set extended process name */
+#define PR_GET_EXT_NAME    18		/* Get extended process name */
+
 /* Get/set process endian */
 #define PR_GET_ENDIAN	19
 #define PR_SET_ENDIAN	20
diff --git a/kernel/sys.c b/kernel/sys.c
index 76d77218ab19..1b70d53da998 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2535,7 +2535,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
 	struct task_struct *me = current;
-	unsigned char comm[TASK_COMM_LEN];
+	unsigned char comm[TASK_COMM_EXT_LEN];
 	long error;
 
 	error = security_task_prctl(option, arg2, arg3, arg4, arg5);
@@ -2613,6 +2613,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		if (copy_to_user((char __user *)arg2, comm, TASK_COMM_LEN))
 			return -EFAULT;
 		break;
+	case PR_SET_EXT_NAME:
+		comm[TASK_COMM_EXT_LEN - 1] = 0;
+		if (strncpy_from_user(comm, (char __user *)arg2,
+				      TASK_COMM_EXT_LEN - 1) < 0)
+			return -EFAULT;
+		set_task_comm(me, comm);
+		proc_comm_connector(me);
+		break;
+	case PR_GET_EXT_NAME:
+		strscpy_pad(comm, me->comm, TASK_COMM_EXT_LEN);
+		if (copy_to_user((char __user *)arg2, comm, TASK_COMM_EXT_LEN))
+			return -EFAULT;
+		break;
 	case PR_GET_ENDIAN:
 		error = GET_ENDIAN(me, arg2);
 		break;

-- 
2.54.0


^ permalink raw reply related

* [PATCH 3/6] string: Introduce strtostr() for safe and performance string copies
From: André Almeida @ 2026-05-17 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, David Laight, Linus Torvalds, akpm,
	Yafang Shao, andrii.nakryiko, arnaldo.melo, Petr Mladek
  Cc: linux-kernel, kernel-dev, linux-mm, linux-api, André Almeida
In-Reply-To: <20260517-tonyk-long_name-v1-0-3c282eaa91e2@igalia.com>

Some parts of the kernel uses memcpy() instead of strscpy() because they
are performance sensitive and doesn't care about the return value of
strscpy(). One such common case is to copy current->comm to a different
buffer.

As the command name is guaranteed to be NUL-terminated in the range of
TASK_COMM_LEN, this is safe enough and doesn't create unterminated
strings. However, in order to expand the size of current->comm, this
expectation will be broken and those memcpy() could create such strings
without trailing NUL byte.

In order to support a fast and safe string copy, create strtostr(), to copy
a NUL-terminated string to a new string buffer. If the destination buffer
is bigger than the source, no pad is applied, but the string is
NUL-terminated. If the destination buffer is smaller, the string is
truncated. The last byte of the destination is always set to NUL for safety.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 include/linux/coredump.h                           |  2 +-
 include/linux/string.h                             | 28 ++++++++++++++++++++++
 include/linux/tracepoint.h                         |  4 ++--
 include/trace/events/block.h                       | 10 ++++----
 include/trace/events/coredump.h                    |  2 +-
 include/trace/events/f2fs.h                        |  4 ++--
 include/trace/events/oom.h                         |  2 +-
 include/trace/events/osnoise.h                     |  2 +-
 include/trace/events/sched.h                       | 10 ++++----
 include/trace/events/signal.h                      |  2 +-
 include/trace/events/task.h                        |  4 ++--
 kernel/printk/nbcon.c                              |  2 +-
 kernel/printk/printk.c                             |  2 +-
 tools/bpf/bpftool/pids.c                           |  4 ++--
 .../selftests/bpf/test_kmods/bpf_testmod-events.h  |  2 +-
 15 files changed, 54 insertions(+), 26 deletions(-)

diff --git a/include/linux/coredump.h b/include/linux/coredump.h
index 68861da4cf7c..b370ef69f673 100644
--- a/include/linux/coredump.h
+++ b/include/linux/coredump.h
@@ -54,7 +54,7 @@ extern void vfs_coredump(const kernel_siginfo_t *siginfo);
 	do {	\
 		char comm[TASK_COMM_LEN];	\
 		/* This will always be NUL terminated. */ \
-		memcpy(comm, current->comm, sizeof(comm)); \
+		strtostr(comm, current->comm); \
 		printk_ratelimited(Level "coredump: %d(%*pE): " Format "\n",	\
 			task_tgid_vnr(current), (int)strlen(comm), comm, ##__VA_ARGS__);	\
 	} while (0)	\
diff --git a/include/linux/string.h b/include/linux/string.h
index b850bd91b3d8..ff1f59f4139c 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -445,6 +445,34 @@ void memcpy_and_pad(void *dest, size_t dest_len, const void *src, size_t count,
 	memcpy(dest, src, strnlen(src, min(_src_len, _dest_len)));	\
 } while (0)
 
+/**
+ * strtostr - Copy NUL-terminanted string to NUL-terminate string
+ *
+ * @dest: Pointer of destination string
+ * @src: Pointer to NUL-terminates string
+ *
+ * This is a replacement for strcpy() where the caller doesn't care about the
+ * return value and if the string is going to be truncated, albeit it needs
+ * to mark sure that it will be NUL-terminated. Intended for performance
+ * sensitive cases, such as tracing.
+ *
+ * If the destination is bigger than the source, no padding happens. It it's
+ * smaller the strings gets truncated.
+ *
+ * Both arguments needs to be arrays with lengths discoverable by the compiler.
+ */
+#define strtostr(dest, src)	do {					\
+	const size_t _dest_len = __must_be_cstr(dest) +			\
+				 ARRAY_SIZE(dest);			\
+	const size_t _src_len = __must_be_cstr(src) +			\
+				__builtin_object_size(src, 1);		\
+									\
+	BUILD_BUG_ON(!__builtin_constant_p(_dest_len) ||		\
+		     _dest_len == (size_t)-1);				\
+	memcpy(dest, src, strnlen(src, min(_src_len, _dest_len)));	\
+	dest[_dest_len - 1] = '\0';						\
+} while (0)
+
 /**
  * memtostr - Copy a possibly non-NUL-term string to a NUL-term string
  * @dest: Pointer to destination NUL-terminates string
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 763eea4d80d8..19e3cb4ca487 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -615,10 +615,10 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
  *	*
  *
  *	TP_fast_assign(
- *		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
+ *		strtostr(__entry->next_comm, next->comm);
  *		__entry->prev_pid	= prev->pid;
  *		__entry->prev_prio	= prev->prio;
- *		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
+ *		strtostr(__entry->prev_comm, prev->comm);
  *		__entry->next_pid	= next->pid;
  *		__entry->next_prio	= next->prio;
  *	),
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 6aa79e2d799c..779622cadee3 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -213,7 +213,7 @@ DECLARE_EVENT_CLASS(block_rq,
 
 		blk_fill_rwbs(__entry->rwbs, rq->cmd_flags);
 		__get_str(cmd)[0] = '\0';
-		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, current->comm);
 	),
 
 	TP_printk("%d,%d %s %u (%s) %llu + %u %s,%u,%u [%s]",
@@ -351,7 +351,7 @@ DECLARE_EVENT_CLASS(block_bio,
 		__entry->sector		= bio->bi_iter.bi_sector;
 		__entry->nr_sector	= bio_sectors(bio);
 		blk_fill_rwbs(__entry->rwbs, bio->bi_opf);
-		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, current->comm);
 	),
 
 	TP_printk("%d,%d %s %llu + %u [%s]",
@@ -434,7 +434,7 @@ TRACE_EVENT(block_plug,
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, current->comm);
 	),
 
 	TP_printk("[%s]", __entry->comm)
@@ -453,7 +453,7 @@ DECLARE_EVENT_CLASS(block_unplug,
 
 	TP_fast_assign(
 		__entry->nr_rq = depth;
-		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, current->comm);
 	),
 
 	TP_printk("[%s] %d", __entry->comm, __entry->nr_rq)
@@ -504,7 +504,7 @@ TRACE_EVENT(block_split,
 		__entry->sector		= bio->bi_iter.bi_sector;
 		__entry->new_sector	= new_sector;
 		blk_fill_rwbs(__entry->rwbs, bio->bi_opf);
-		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, current->comm);
 	),
 
 	TP_printk("%d,%d %s %llu / %llu [%s]",
diff --git a/include/trace/events/coredump.h b/include/trace/events/coredump.h
index c7b9c53fc498..581768a122f8 100644
--- a/include/trace/events/coredump.h
+++ b/include/trace/events/coredump.h
@@ -32,7 +32,7 @@ TRACE_EVENT(coredump,
 
 	TP_fast_assign(
 		__entry->sig = sig;
-		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, current->comm);
 	),
 
 	TP_printk("sig=%d comm=%s",
diff --git a/include/trace/events/f2fs.h b/include/trace/events/f2fs.h
index b5188d2671d7..cc1fd1e01541 100644
--- a/include/trace/events/f2fs.h
+++ b/include/trace/events/f2fs.h
@@ -2505,7 +2505,7 @@ TRACE_EVENT(f2fs_lock_elapsed_time,
 
 	TP_fast_assign(
 		__entry->dev		= sbi->sb->s_dev;
-		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, p->comm);
 		__entry->pid		= p->pid;
 		__entry->prio		= p->prio;
 		__entry->ioprio_class	= IOPRIO_PRIO_CLASS(ioprio);
@@ -2558,7 +2558,7 @@ DECLARE_EVENT_CLASS(f2fs_priority_update,
 
 	TP_fast_assign(
 		__entry->dev		= sbi->sb->s_dev;
-		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, p->comm);
 		__entry->pid		= p->pid;
 		__entry->lock_name	= lock_name;
 		__entry->is_write	= is_write;
diff --git a/include/trace/events/oom.h b/include/trace/events/oom.h
index 9f0a5d1482c4..61b66928de4d 100644
--- a/include/trace/events/oom.h
+++ b/include/trace/events/oom.h
@@ -23,7 +23,7 @@ TRACE_EVENT(oom_score_adj_update,
 
 	TP_fast_assign(
 		__entry->pid = task->pid;
-		memcpy(__entry->comm, task->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, task->comm);
 		__entry->oom_score_adj = task->signal->oom_score_adj;
 	),
 
diff --git a/include/trace/events/osnoise.h b/include/trace/events/osnoise.h
index 3f4273623801..26e42fd1a084 100644
--- a/include/trace/events/osnoise.h
+++ b/include/trace/events/osnoise.h
@@ -116,7 +116,7 @@ TRACE_EVENT(thread_noise,
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, t->comm);
 		__entry->pid = t->pid;
 		__entry->start = start;
 		__entry->duration = duration;
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 535860581f15..91bc5931e2a3 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -152,7 +152,7 @@ DECLARE_EVENT_CLASS(sched_wakeup_template,
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, p->comm);
 		__entry->pid		= p->pid;
 		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
 		__entry->target_cpu	= task_cpu(p);
@@ -237,11 +237,11 @@ TRACE_EVENT(sched_switch,
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
+		strtostr(__entry->prev_comm, prev->comm);
 		__entry->prev_pid	= prev->pid;
 		__entry->prev_prio	= prev->prio;
 		__entry->prev_state	= __trace_sched_switch_state(preempt, prev_state, prev);
-		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
+		strtostr(__entry->next_comm, next->comm);
 		__entry->next_pid	= next->pid;
 		__entry->next_prio	= next->prio;
 		/* XXX SCHED_DEADLINE */
@@ -346,7 +346,7 @@ TRACE_EVENT(sched_process_exit,
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, p->comm);
 		__entry->pid		= p->pid;
 		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
 		__entry->group_dead	= group_dead;
@@ -787,7 +787,7 @@ TRACE_EVENT(sched_skip_cpuset_numa,
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, tsk->comm);
 		__entry->pid		 = task_pid_nr(tsk);
 		__entry->tgid		 = task_tgid_nr(tsk);
 		__entry->ngid		 = task_numa_group_id(tsk);
diff --git a/include/trace/events/signal.h b/include/trace/events/signal.h
index 1db7e4b07c01..8759078b0da9 100644
--- a/include/trace/events/signal.h
+++ b/include/trace/events/signal.h
@@ -67,7 +67,7 @@ TRACE_EVENT(signal_generate,
 	TP_fast_assign(
 		__entry->sig	= sig;
 		TP_STORE_SIGINFO(__entry, info);
-		memcpy(__entry->comm, task->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, task->comm);
 		__entry->pid	= task->pid;
 		__entry->group	= group;
 		__entry->result	= result;
diff --git a/include/trace/events/task.h b/include/trace/events/task.h
index b9a129eb54d9..8636ead17cd8 100644
--- a/include/trace/events/task.h
+++ b/include/trace/events/task.h
@@ -21,7 +21,7 @@ TRACE_EVENT(task_newtask,
 
 	TP_fast_assign(
 		__entry->pid = task->pid;
-		memcpy(__entry->comm, task->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, task->comm);
 		__entry->clone_flags = clone_flags;
 		__entry->oom_score_adj = task->signal->oom_score_adj;
 	),
@@ -46,7 +46,7 @@ TRACE_EVENT(task_rename,
 
 	TP_fast_assign(
 		__entry->pid = task->pid;
-		memcpy(entry->oldcomm, task->comm, TASK_COMM_LEN);
+		strtostr(entry->oldcomm, task->comm);
 		strscpy(entry->newcomm, comm, TASK_COMM_LEN);
 		__entry->oom_score_adj = task->signal->oom_score_adj;
 	),
diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
index d7044a7a214b..5b0c54082876 100644
--- a/kernel/printk/nbcon.c
+++ b/kernel/printk/nbcon.c
@@ -952,7 +952,7 @@ static void wctxt_load_execution_ctx(struct nbcon_write_context *wctxt,
 {
 	wctxt->cpu = pmsg->cpu;
 	wctxt->pid = pmsg->pid;
-	memcpy(wctxt->comm, pmsg->comm, sizeof(wctxt->comm));
+	strtostr(wctxt->comm, pmsg->comm);
 	static_assert(sizeof(wctxt->comm) == sizeof(pmsg->comm));
 }
 #else
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 1f04e753ca02..231c4d7c3580 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2255,7 +2255,7 @@ static void pmsg_load_execution_ctx(struct printk_message *pmsg,
 {
 	pmsg->cpu = printk_info_get_cpu(info);
 	pmsg->pid = printk_info_get_pid(info);
-	memcpy(pmsg->comm, info->comm, sizeof(pmsg->comm));
+	strtostr(pmsg->comm, info->comm);
 	static_assert(sizeof(pmsg->comm) == sizeof(info->comm));
 }
 #else
diff --git a/tools/bpf/bpftool/pids.c b/tools/bpf/bpftool/pids.c
index 23f488cf1740..46b62f65dc43 100644
--- a/tools/bpf/bpftool/pids.c
+++ b/tools/bpf/bpftool/pids.c
@@ -53,7 +53,7 @@ static void add_ref(struct hashmap *map, struct pid_iter_entry *e)
 		refs->refs = tmp;
 		ref = &refs->refs[refs->ref_cnt];
 		ref->pid = e->pid;
-		memcpy(ref->comm, e->comm, sizeof(ref->comm));
+		strtostr(ref->comm, e->comm);
 		ref->comm[sizeof(ref->comm) - 1] = '\0';
 		refs->ref_cnt++;
 
@@ -77,7 +77,7 @@ static void add_ref(struct hashmap *map, struct pid_iter_entry *e)
 	}
 	ref = &refs->refs[0];
 	ref->pid = e->pid;
-	memcpy(ref->comm, e->comm, sizeof(ref->comm));
+	strtostr(ref->comm, e->comm);
 	ref->comm[sizeof(ref->comm) - 1] = '\0';
 	refs->ref_cnt = 1;
 	refs->has_bpf_cookie = e->has_bpf_cookie;
diff --git a/tools/testing/selftests/bpf/test_kmods/bpf_testmod-events.h b/tools/testing/selftests/bpf/test_kmods/bpf_testmod-events.h
index 45a5e41f3a92..72c865ccf1b5 100644
--- a/tools/testing/selftests/bpf/test_kmods/bpf_testmod-events.h
+++ b/tools/testing/selftests/bpf/test_kmods/bpf_testmod-events.h
@@ -20,7 +20,7 @@ TRACE_EVENT(bpf_testmod_test_read,
 	),
 	TP_fast_assign(
 		__entry->pid = task->pid;
-		memcpy(__entry->comm, task->comm, TASK_COMM_LEN);
+		strtostr(__entry->comm, task->comm);
 		__entry->off = ctx->off;
 		__entry->len = ctx->len;
 	),

-- 
2.54.0


^ permalink raw reply related

* [PATCH 2/6] treewide: Get rid of get_task_comm()
From: André Almeida @ 2026-05-17 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, David Laight, Linus Torvalds, akpm,
	Yafang Shao, andrii.nakryiko, arnaldo.melo, Petr Mladek
  Cc: linux-kernel, kernel-dev, linux-mm, linux-api, Bhupesh,
	André Almeida
In-Reply-To: <20260517-tonyk-long_name-v1-0-3c282eaa91e2@igalia.com>

Since commit 4cc0473d7754 ("get rid of __get_task_comm()"),
get_task_comm() does just a redundant check for the buffer size and call
strscpy_pad(). Replace get_task_comm() calls with strscpy_pad(), that will
do the right thing if the buffers sizes doesn't match: zero-pad if it's
bigger, and truncate if it's smaller.

Link: https://lore.kernel.org/lkml/CAHk-=wi5c=_-FBGo_88CowJd_F-Gi6Ud9d=TALm65ReN7YjrMw@mail.gmail.com/
Co-developed-by: Bhupesh <bhupesh@igalia.com>
Signed-off-by: Bhupesh <bhupesh@igalia.com>
Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/connector/cn_proc.c                        |  2 +-
 drivers/dma-buf/sw_sync.c                          |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_eviction_fence.c |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c            |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c    |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c             |  4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c       |  2 +-
 drivers/gpu/drm/lima/lima_ctx.c                    |  2 +-
 drivers/gpu/drm/panfrost/panfrost_gem.c            |  2 +-
 drivers/gpu/drm/panthor/panthor_gem.c              |  2 +-
 drivers/gpu/drm/panthor/panthor_sched.c            |  2 +-
 drivers/gpu/drm/virtio/virtgpu_ioctl.c             |  2 +-
 drivers/hwtracing/stm/core.c                       |  2 +-
 drivers/tty/tty_audit.c                            |  2 +-
 fs/binfmt_elf.c                                    |  2 +-
 fs/binfmt_elf_fdpic.c                              |  2 +-
 fs/proc/array.c                                    |  2 +-
 include/linux/sched.h                              | 19 -------------------
 kernel/audit.c                                     |  6 ++++--
 kernel/auditsc.c                                   |  6 ++++--
 kernel/printk/printk.c                             |  2 +-
 kernel/sys.c                                       |  2 +-
 net/bluetooth/hci_sock.c                           |  2 +-
 net/netfilter/nf_tables_api.c                      |  2 +-
 security/integrity/integrity_audit.c               |  3 ++-
 security/ipe/audit.c                               |  2 +-
 security/landlock/domain.c                         |  2 +-
 security/lsm_audit.c                               |  7 ++++---
 29 files changed, 39 insertions(+), 52 deletions(-)

diff --git a/drivers/connector/cn_proc.c b/drivers/connector/cn_proc.c
index 0056ab81fbc3..c78243ed3c2a 100644
--- a/drivers/connector/cn_proc.c
+++ b/drivers/connector/cn_proc.c
@@ -278,7 +278,7 @@ void proc_comm_connector(struct task_struct *task)
 	ev->what = PROC_EVENT_COMM;
 	ev->event_data.comm.process_pid  = task->pid;
 	ev->event_data.comm.process_tgid = task->tgid;
-	get_task_comm(ev->event_data.comm.comm, task);
+	strscpy_pad(ev->event_data.comm.comm, task->comm);
 
 	memcpy(&msg->id, &cn_proc_event_id, sizeof(msg->id));
 	msg->ack = 0; /* not used */
diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c
index 8df20b0218a9..d501657ad801 100644
--- a/drivers/dma-buf/sw_sync.c
+++ b/drivers/dma-buf/sw_sync.c
@@ -312,7 +312,7 @@ static int sw_sync_debugfs_open(struct inode *inode, struct file *file)
 	struct sync_timeline *obj;
 	char task_comm[TASK_COMM_LEN];
 
-	get_task_comm(task_comm, current);
+	strscpy_pad(task_comm, current->comm);
 
 	obj = sync_timeline_create(task_comm);
 	if (!obj)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
index 6a364357522b..13c8857e4ffb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c
@@ -74,7 +74,7 @@ struct amdgpu_amdkfd_fence *amdgpu_amdkfd_fence_create(u64 context,
 	/* This reference gets released in amdkfd_fence_release */
 	mmgrab(mm);
 	fence->mm = mm;
-	get_task_comm(fence->timeline_name, current);
+	strscpy_pad(fence->timeline_name, current->comm);
 	spin_lock_init(&fence->lock);
 	fence->svm_bo = svm_bo;
 	fence->context_id = context_id;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_eviction_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_eviction_fence.c
index 4c5e38dea4c2..faf0f36d8328 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_eviction_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_eviction_fence.c
@@ -129,7 +129,7 @@ int amdgpu_evf_mgr_rearm(struct amdgpu_eviction_fence_mgr *evf_mgr,
 		return -ENOMEM;
 
 	ev_fence->evf_mgr = evf_mgr;
-	get_task_comm(ev_fence->timeline_name, current);
+	strscpy_pad(ev_fence->timeline_name, current->comm);
 	spin_lock_init(&ev_fence->lock);
 	dma_fence_init64(&ev_fence->base, &amdgpu_eviction_fence_ops,
 			 &ev_fence->lock, evf_mgr->ev_fence_ctx,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 6c644cfe6695..c45630457155 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -4419,7 +4419,7 @@ int amdgpu_ras_init(struct amdgpu_device *adev)
 	}
 
 	con->init_task_pid = task_pid_nr(current);
-	get_task_comm(con->init_task_comm, current);
+	strscpy_pad(con->init_task_comm, current->comm);
 
 	mutex_init(&con->critical_region_lock);
 	INIT_LIST_HEAD(&con->critical_region_head);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c
index e2d5f04296e1..8fdc38d8d64d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_userq_fence.c
@@ -85,7 +85,7 @@ int amdgpu_userq_fence_driver_alloc(struct amdgpu_device *adev,
 
 	fence_drv->adev = adev;
 	fence_drv->context = dma_fence_context_alloc(1);
-	get_task_comm(fence_drv->timeline_name, current);
+	strscpy_pad(fence_drv->timeline_name, current->comm);
 
 	*fence_drv_req = fence_drv;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 9ba9de16a27a..de80d0ace905 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2571,10 +2571,10 @@ void amdgpu_vm_set_task_info(struct amdgpu_vm *vm)
 		return;
 
 	vm->task_info->task.pid = current->pid;
-	get_task_comm(vm->task_info->task.comm, current);
+	strscpy_pad(vm->task_info->task.comm, current->comm);
 
 	vm->task_info->tgid = current->tgid;
-	get_task_comm(vm->task_info->process_name, current->group_leader);
+	strscpy_pad(vm->task_info->process_name, current->group_leader->comm);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
index 2a241a5b12c4..f8ce59d8587a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c
@@ -563,7 +563,7 @@ static int amdgpu_vram_mgr_new(struct ttm_resource_manager *man,
 	}
 
 	vres->task.pid = task_pid_nr(current);
-	get_task_comm(vres->task.comm, current);
+	strscpy_pad(vres->task.comm, current->comm);
 	list_add_tail(&vres->vres_node, &mgr->allocated_vres_list);
 
 	if (bo->flags & AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS && adjust_dcc_size) {
diff --git a/drivers/gpu/drm/lima/lima_ctx.c b/drivers/gpu/drm/lima/lima_ctx.c
index 68ede7a725e2..e8c5c3601bf1 100644
--- a/drivers/gpu/drm/lima/lima_ctx.c
+++ b/drivers/gpu/drm/lima/lima_ctx.c
@@ -29,7 +29,7 @@ int lima_ctx_create(struct lima_device *dev, struct lima_ctx_mgr *mgr, u32 *id)
 		goto err_out0;
 
 	ctx->pid = task_pid_nr(current);
-	get_task_comm(ctx->pname, current);
+	strscpy_pad(ctx->pname, current->comm);
 
 	return 0;
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c b/drivers/gpu/drm/panfrost/panfrost_gem.c
index 3a7fce428898..11936c4d3573 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gem.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gem.c
@@ -36,7 +36,7 @@ static void panfrost_gem_debugfs_bo_add(struct panfrost_device *pfdev,
 					struct panfrost_gem_object *bo)
 {
 	bo->debugfs.creator.tgid = current->tgid;
-	get_task_comm(bo->debugfs.creator.process_name, current->group_leader);
+	strscpy_pad(bo->debugfs.creator.process_name, current->group_leader->comm);
 
 	mutex_lock(&pfdev->debugfs.gems_lock);
 	list_add_tail(&bo->debugfs.node, &pfdev->debugfs.gems_list);
diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
index cd49859da89b..b44fd715c17e 100644
--- a/drivers/gpu/drm/panthor/panthor_gem.c
+++ b/drivers/gpu/drm/panthor/panthor_gem.c
@@ -46,7 +46,7 @@ static void panthor_gem_debugfs_bo_add(struct panthor_gem_object *bo)
 						    struct panthor_device, base);
 
 	bo->debugfs.creator.tgid = current->tgid;
-	get_task_comm(bo->debugfs.creator.process_name, current->group_leader);
+	strscpy_pad(bo->debugfs.creator.process_name, current->group_leader->comm);
 
 	mutex_lock(&ptdev->gems.lock);
 	list_add_tail(&bo->debugfs.node, &ptdev->gems.node);
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index 2fe04d0f0e3a..8ee9de96acf6 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -3603,7 +3603,7 @@ static void group_init_task_info(struct panthor_group *group)
 	struct task_struct *task = current->group_leader;
 
 	group->task_info.pid = task->pid;
-	get_task_comm(group->task_info.comm, task);
+	strscpy_pad(group->task_info.comm, task->comm);
 }
 
 static void add_group_kbo_sizes(struct panthor_device *ptdev,
diff --git a/drivers/gpu/drm/virtio/virtgpu_ioctl.c b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
index c33c057365f8..d2bf221e8f01 100644
--- a/drivers/gpu/drm/virtio/virtgpu_ioctl.c
+++ b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
@@ -50,7 +50,7 @@ static void virtio_gpu_create_context_locked(struct virtio_gpu_device *vgdev,
 	} else {
 		char dbgname[TASK_COMM_LEN];
 
-		get_task_comm(dbgname, current);
+		strscpy_pad(dbgname, current->comm);
 		virtio_gpu_cmd_context_create(vgdev, vfpriv->ctx_id,
 					      vfpriv->context_init, strlen(dbgname),
 					      dbgname);
diff --git a/drivers/hwtracing/stm/core.c b/drivers/hwtracing/stm/core.c
index f48c6a8a0654..c7715439964e 100644
--- a/drivers/hwtracing/stm/core.c
+++ b/drivers/hwtracing/stm/core.c
@@ -634,7 +634,7 @@ static ssize_t stm_char_write(struct file *file, const char __user *buf,
 		char comm[sizeof(current->comm)];
 		char *ids[] = { comm, "default", NULL };
 
-		get_task_comm(comm, current);
+		strscpy_pad(comm, current->comm);
 
 		err = stm_assign_first_policy(stmf->stm, &stmf->output, ids, 1);
 		/*
diff --git a/drivers/tty/tty_audit.c b/drivers/tty/tty_audit.c
index d014af6ab060..d514a81d0a5c 100644
--- a/drivers/tty/tty_audit.c
+++ b/drivers/tty/tty_audit.c
@@ -77,7 +77,7 @@ static void tty_audit_log(const char *description, dev_t dev,
 	audit_log_format(ab, "%s pid=%u uid=%u auid=%u ses=%u major=%d minor=%d comm=",
 			 description, pid, uid, loginuid, sessionid,
 			 MAJOR(dev), MINOR(dev));
-	get_task_comm(name, current);
+	strscpy_pad(name, current->comm);
 	audit_log_untrustedstring(ab, name);
 	audit_log_format(ab, " data=");
 	audit_log_n_hex(ab, data, size);
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 16a56b6b3f6c..d25922460b63 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1557,7 +1557,7 @@ static int fill_psinfo(struct elf_prpsinfo *psinfo, struct task_struct *p,
 	SET_UID(psinfo->pr_uid, from_kuid_munged(cred->user_ns, cred->uid));
 	SET_GID(psinfo->pr_gid, from_kgid_munged(cred->user_ns, cred->gid));
 	rcu_read_unlock();
-	get_task_comm(psinfo->pr_fname, p);
+	strscpy_pad(psinfo->pr_fname, p->comm);
 
 	return 0;
 }
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 7e3108489c83..c4d4e59ff34d 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1371,7 +1371,7 @@ static int fill_psinfo(struct elf_prpsinfo *psinfo, struct task_struct *p,
 	SET_UID(psinfo->pr_uid, from_kuid_munged(cred->user_ns, cred->uid));
 	SET_GID(psinfo->pr_gid, from_kgid_munged(cred->user_ns, cred->gid));
 	rcu_read_unlock();
-	get_task_comm(psinfo->pr_fname, p);
+	strscpy_pad(psinfo->pr_fname, p->comm);
 
 	return 0;
 }
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 90fb0c6b5f99..c8c3fbd9bfa9 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -110,7 +110,7 @@ void proc_task_name(struct seq_file *m, struct task_struct *p, bool escape)
 	else if (p->flags & PF_KTHREAD)
 		get_kthread_comm(tcomm, sizeof(tcomm), p);
 	else
-		get_task_comm(tcomm, p);
+		strscpy_pad(tcomm, p->comm);
 
 	if (escape)
 		seq_escape_str(m, tcomm, ESCAPE_SPACE | ESCAPE_SPECIAL, "\n\\");
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 60d004a49a27..b6de742b1155 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2000,25 +2000,6 @@ extern void __set_task_comm(struct task_struct *tsk, const char *from, bool exec
 	__set_task_comm(tsk, from, false);		\
 })
 
-/*
- * - Why not use task_lock()?
- *   User space can randomly change their names anyway, so locking for readers
- *   doesn't make sense. For writers, locking is probably necessary, as a race
- *   condition could lead to long-term mixed results.
- *   The logic inside __set_task_comm() ensures that the task comm is
- *   always NUL-terminated and zero-padded. Therefore the race condition between
- *   reader and writer is not an issue.
- *
- * - BUILD_BUG_ON() can help prevent the buf from being truncated.
- *   Since the callers don't perform any return value checks, this safeguard is
- *   necessary.
- */
-#define get_task_comm(buf, tsk) ({			\
-	BUILD_BUG_ON(sizeof(buf) < TASK_COMM_LEN);	\
-	strscpy_pad(buf, (tsk)->comm);			\
-	buf;						\
-})
-
 static __always_inline void scheduler_ipi(void)
 {
 	/*
diff --git a/kernel/audit.c b/kernel/audit.c
index e1d489bc2dff..6fc867adbf3d 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -1662,7 +1662,8 @@ static void audit_log_multicast(int group, const char *op, int err)
 	audit_put_tty(tty);
 	audit_log_task_context(ab); /* subj= */
 	audit_log_format(ab, " comm=");
-	audit_log_untrustedstring(ab, get_task_comm(comm, current));
+	strscpy_pad(comm, current->comm);
+	audit_log_untrustedstring(ab, comm);
 	audit_log_d_path_exe(ab, current->mm); /* exe= */
 	audit_log_format(ab, " nl-mcgrp=%d op=%s res=%d", group, op, !err);
 	audit_log_end(ab);
@@ -2465,7 +2466,8 @@ void audit_log_task_info(struct audit_buffer *ab)
 			 audit_get_sessionid(current));
 	audit_put_tty(tty);
 	audit_log_format(ab, " comm=");
-	audit_log_untrustedstring(ab, get_task_comm(comm, current));
+	strscpy_pad(comm, current->comm);
+	audit_log_untrustedstring(ab, comm);
 	audit_log_d_path_exe(ab, current->mm);
 	audit_log_task_context(ab);
 }
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index ab54fccba215..8e4f70105a13 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -2877,7 +2877,8 @@ void __audit_log_nfcfg(const char *name, u8 af, unsigned int nentries,
 	audit_log_format(ab, " pid=%u", task_tgid_nr(current));
 	audit_log_task_context(ab); /* subj= */
 	audit_log_format(ab, " comm=");
-	audit_log_untrustedstring(ab, get_task_comm(comm, current));
+	strscpy_pad(comm, current->comm);
+	audit_log_untrustedstring(ab, comm);
 	audit_log_end(ab);
 }
 EXPORT_SYMBOL_GPL(__audit_log_nfcfg);
@@ -2900,7 +2901,8 @@ static void audit_log_task(struct audit_buffer *ab)
 			 sessionid);
 	audit_log_task_context(ab);
 	audit_log_format(ab, " pid=%d comm=", task_tgid_nr(current));
-	audit_log_untrustedstring(ab, get_task_comm(comm, current));
+	strscpy_pad(comm, current->comm);
+	audit_log_untrustedstring(ab, comm);
 	audit_log_d_path_exe(ab, current->mm);
 }
 
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 0323149548f6..1f04e753ca02 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2247,7 +2247,7 @@ static u16 printk_sprint(char *text, u16 size, int facility,
 static void printk_store_execution_ctx(struct printk_info *info)
 {
 	info->caller_id2 = printk_caller_id2();
-	get_task_comm(info->comm, current);
+	strscpy_pad(info->comm, current->comm);
 }
 
 static void pmsg_load_execution_ctx(struct printk_message *pmsg,
diff --git a/kernel/sys.c b/kernel/sys.c
index 62e842055cc9..1d5152d2395e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2609,7 +2609,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		proc_comm_connector(me);
 		break;
 	case PR_GET_NAME:
-		get_task_comm(comm, me);
+		strscpy_pad(comm, me->comm);
 		if (copy_to_user((char __user *)arg2, comm, sizeof(comm)))
 			return -EFAULT;
 		break;
diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c
index 0290dea081f6..38e16ba2de38 100644
--- a/net/bluetooth/hci_sock.c
+++ b/net/bluetooth/hci_sock.c
@@ -106,7 +106,7 @@ static bool hci_sock_gen_cookie(struct sock *sk)
 			id = 0xffffffff;
 
 		hci_pi(sk)->cookie = id;
-		get_task_comm(hci_pi(sk)->comm, current);
+		strscpy_pad(hci_pi(sk)->comm, current->comm);
 		return true;
 	}
 
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 87387adbca65..d859ffa2874c 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -9711,7 +9711,7 @@ static int nf_tables_fill_gen_info(struct sk_buff *skb, struct net *net,
 
 	if (nla_put_be32(skb, NFTA_GEN_ID, htonl(nft_base_seq(net))) ||
 	    nla_put_be32(skb, NFTA_GEN_PROC_PID, htonl(task_pid_nr(current))) ||
-	    nla_put_string(skb, NFTA_GEN_PROC_NAME, get_task_comm(buf, current)))
+	    nla_put_string(skb, NFTA_GEN_PROC_NAME, strscpy_pad(buf, current->comm)))
 		goto nla_put_failure;
 
 	nlmsg_end(skb, nlh);
diff --git a/security/integrity/integrity_audit.c b/security/integrity/integrity_audit.c
index d8d9e5ff1cd2..98060060929d 100644
--- a/security/integrity/integrity_audit.c
+++ b/security/integrity/integrity_audit.c
@@ -54,7 +54,8 @@ void integrity_audit_message(int audit_msgno, struct inode *inode,
 			 audit_get_sessionid(current));
 	audit_log_task_context(ab);
 	audit_log_format(ab, " op=%s cause=%s comm=", op, cause);
-	audit_log_untrustedstring(ab, get_task_comm(name, current));
+	strscpy_pad(name, current->comm);
+	audit_log_untrustedstring(ab, name);
 	if (fname) {
 		audit_log_format(ab, " name=");
 		audit_log_untrustedstring(ab, fname);
diff --git a/security/ipe/audit.c b/security/ipe/audit.c
index 93fb59fbddd6..c04901baed73 100644
--- a/security/ipe/audit.c
+++ b/security/ipe/audit.c
@@ -145,7 +145,7 @@ void ipe_audit_match(const struct ipe_eval_ctx *const ctx,
 	audit_log_format(ab, "ipe_op=%s ipe_hook=%s enforcing=%d pid=%d comm=",
 			 op, audit_hook_names[ctx->hook], READ_ONCE(enforce),
 			 task_tgid_nr(current));
-	audit_log_untrustedstring(ab, get_task_comm(comm, current));
+	audit_log_untrustedstring(ab, strscpy_pad(comm, current->comm));
 
 	if (ctx->file) {
 		audit_log_d_path(ab, " path=", &ctx->file->f_path);
diff --git a/security/landlock/domain.c b/security/landlock/domain.c
index 06b6bd845060..a35a27f523e6 100644
--- a/security/landlock/domain.c
+++ b/security/landlock/domain.c
@@ -101,7 +101,7 @@ static struct landlock_details *get_current_details(void)
 	memcpy(details->exe_path, path_str, path_size);
 	details->pid = get_pid(task_tgid(current));
 	details->uid = from_kuid(&init_user_ns, current_uid());
-	get_task_comm(details->comm, current);
+	strscpy_pad(details->comm, current->comm);
 	return details;
 }
 
diff --git a/security/lsm_audit.c b/security/lsm_audit.c
index 737f5a263a8f..a587ffecd985 100644
--- a/security/lsm_audit.c
+++ b/security/lsm_audit.c
@@ -276,8 +276,8 @@ void audit_log_lsm_data(struct audit_buffer *ab,
 			if (pid) {
 				char tskcomm[sizeof(tsk->comm)];
 				audit_log_format(ab, " opid=%d ocomm=", pid);
-				audit_log_untrustedstring(ab,
-				    get_task_comm(tskcomm, tsk));
+				strscpy_pad(tskcomm, tsk->comm);
+				audit_log_untrustedstring(ab, tskcomm);
 			}
 		}
 		break;
@@ -417,7 +417,8 @@ static void dump_common_audit_data(struct audit_buffer *ab,
 	char comm[sizeof(current->comm)];
 
 	audit_log_format(ab, " pid=%d comm=", task_tgid_nr(current));
-	audit_log_untrustedstring(ab, get_task_comm(comm, current));
+	strscpy_pad(comm, current->comm);
+	audit_log_untrustedstring(ab, comm);
 	audit_log_lsm_data(ab, a);
 }
 

-- 
2.54.0


^ permalink raw reply related

* [PATCH 4/6] sched: Extend task command name to 64 bytes
From: André Almeida @ 2026-05-17 18:36 UTC (permalink / raw)
  To: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, David Laight, Linus Torvalds, akpm,
	Yafang Shao, andrii.nakryiko, arnaldo.melo, Petr Mladek
  Cc: linux-kernel, kernel-dev, linux-mm, linux-api, Bhupesh,
	André Almeida
In-Reply-To: <20260517-tonyk-long_name-v1-0-3c282eaa91e2@igalia.com>

Command name has been restrict to only 16 bytes, which is too limiting,
specially when debugging and tracing complex software with thousands of
threads and the need to differentiate them.

Just as it was done with kthreads in commit 6b59808bfe48 ("workqueue:
Show the latest workqueue name in /proc/PID/{comm,stat,status}"), support
long names for userspace threads as well.

To avoid buffer overflows, cap all existing userspace APIs to
TASK_COMM_LEN, and leave the full extended name for a new interface.

Co-developed-by: Bhupesh <bhupesh@igalia.com>
Signed-off-by: Bhupesh <bhupesh@igalia.com>
Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 fs/proc/array.c       |  2 +-
 include/linux/sched.h |  3 ++-
 kernel/sys.c          | 10 +++++-----
 3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index c8c3fbd9bfa9..312371eddc7f 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -110,7 +110,7 @@ void proc_task_name(struct seq_file *m, struct task_struct *p, bool escape)
 	else if (p->flags & PF_KTHREAD)
 		get_kthread_comm(tcomm, sizeof(tcomm), p);
 	else
-		strscpy_pad(tcomm, p->comm);
+		strscpy_pad(tcomm, p->comm, TASK_COMM_LEN);
 
 	if (escape)
 		seq_escape_str(m, tcomm, ESCAPE_SPACE | ESCAPE_SPECIAL, "\n\\");
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b6de742b1155..f7fd2b7d131d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -323,6 +323,7 @@ struct user_event_mm;
  */
 enum {
 	TASK_COMM_LEN = 16,
+	TASK_COMM_EXT_LEN = 64,
 };
 
 extern void sched_tick(void);
@@ -1167,7 +1168,7 @@ struct task_struct {
 	 * - set it with set_task_comm() to ensure it is always
 	 *   NUL-terminated and zero-padded
 	 */
-	char				comm[TASK_COMM_LEN];
+	char				comm[TASK_COMM_EXT_LEN];
 
 	struct nameidata		*nameidata;
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 1d5152d2395e..76d77218ab19 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2535,7 +2535,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
 	struct task_struct *me = current;
-	unsigned char comm[sizeof(me->comm)];
+	unsigned char comm[TASK_COMM_LEN];
 	long error;
 
 	error = security_task_prctl(option, arg2, arg3, arg4, arg5);
@@ -2601,16 +2601,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			error = -EINVAL;
 		break;
 	case PR_SET_NAME:
-		comm[sizeof(me->comm) - 1] = 0;
+		comm[TASK_COMM_LEN - 1] = 0;
 		if (strncpy_from_user(comm, (char __user *)arg2,
-				      sizeof(me->comm) - 1) < 0)
+				      TASK_COMM_LEN - 1) < 0)
 			return -EFAULT;
 		set_task_comm(me, comm);
 		proc_comm_connector(me);
 		break;
 	case PR_GET_NAME:
-		strscpy_pad(comm, me->comm);
-		if (copy_to_user((char __user *)arg2, comm, sizeof(comm)))
+		strscpy_pad(comm, me->comm, TASK_COMM_LEN);
+		if (copy_to_user((char __user *)arg2, comm, TASK_COMM_LEN))
 			return -EFAULT;
 		break;
 	case PR_GET_ENDIAN:

-- 
2.54.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox