Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-26 21:52 UTC (permalink / raw)
  To: Theodore Tso
  Cc: linux-api, linux-kernel, Matthew Wilcox, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ybmbjekuvzmaw4hmlxd7nxs546dqtwmxqxwyali74d6m3u7tat@b4q3japqnhrl>

On 05/26, Theodore Tso wrote:
> On Tue, May 26, 2026 at 01:10:55AM +0000, Jaegeuk Kim wrote:
> > Background
> > ----------
> > The primary use case is accelerating AI model loading, which demands
> > exceptionally high sequential read speeds. In our benchmarks on embedded
> > systems:
> >  - Using high-order page allocations allows the system to saturate the
> >    Universal Flash Storage (UFS) bandwidth, reaching 4 GB/s even at
> >    medium-to-low CPU frequencies.
> >  - In contrast, standard small folios cap performance at 2 GB/s.
> 
> So you're interested in optimizing the I/O speeds.  And apparenty, on
> your hardware, the UFS controller has limits on scatter-gather entries
> --- UFS seems to call this Physical Region Description (PRD) table
> entries.  Per Gemini:
> 
>     1. PRD Segment & Length Limits
> 	
> 	Maximum PRD Entries: Hardware limits typically cap the number
> 	    of PRD entries (or segments) to 255 or 256 per transfer
> 	    request.
> 	
> 	Maximum Transfer Length: Each individual PRD entry typically
> 	    allows a maximum transfer size of (65,535 bytes) per segment.
> 
>     2. Host Controller Hardware Limits (UFSHCI)
>     
> 	Transfer Queue Depth: A UFS controller supports a predefined
> 	    number of outstanding task request entries. This is often
> 	    hard-capped at 32 concurrent transfer requests (slots) by the
> 	    doorbell register array.
> 	
> 	Descriptor Pre-fetch: Some UFS host controllers are
> 	   pre-configured to pre-fetch multiple PRD entries sequentially
> 	   before requiring main memory reads.
> 
> Is this an accurate description of the limits that you are trying to
> work with?  How much data are you trying to read?  Looking at Gemma 4
> models, E2B is about 10GB or 3GB for the 4-bit quantized version.  E4B
> is 15GB, or 5GB for the 4-bit quantized version.  Is that about right?
> 
> It seems... surprising that the additional I/O operations are actually
> throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s).  Have you dug
> into why this is happening, and whether there is anything that can be
> optimized below the file system?

I can't tell the exact size tho, roughly it's between 1GB and 4GB. And,
per lots of test results with various tunings, it turned out memory
allocation speed was the culprit. If we use 4KB page, we couldn't get
the full bandwidth unless we set the biggest core running the highest frequency.
Unfortunately, however, we can't use the core like that due to performance
drop of other system service and power drain.

> 
> > Problem Statement
> > -----------------
> > High-order pages become heavily fragmented and scarce shortly after
> > device boot.  We cannot afford to deplete these limited resources on
> > default filesystem operations using large folios. Instead, we need a
> > mechanism to strictly prioritize and reserve high-order allocations
> > for specific, critical payloads—specifically, large AI model files.
> 
> There's a fundamental assumption here, which is that the only use of
> high order pages is the page cache.  This doesn't take into account
> anonymous pages used by programs that isn't backed by files.  Nor does
> it take into account kernel memory allocations.
> 
> But that being said, you seem to be assuming that you can reduce the
> pressure on high order pages by only using large folios for these AI
> model files.
> 
> But the problem with using small folios is that if you want to
> actually *use* the memory, unless you want to segment out the memory
> so it can't be used for anything other than the AI models (e.g., by
> using somthing like hugetlbfs) it's just going to break up the memory
> into smaller folios.  So that's not actually going to *help* in actual
> real life use cases.  It might help for your artificial benchmarks /
> experiments, but in the real life case where Android applications are
> running and fragmenting all of the device memory, the large folios
> won't be available *anyway*.

Agreed it's hard to get this done perfectly tho, as the best effort on this
particular AI model case, I focused on two timings when loading the models:
1) right after device boot, 2) dynamic loading when required. To secure high
order pages, for 1), I disabled the large folio consumed by EROFS, while for
2), I tried to call compact_memory before loading the model. Both of cases,
I could observe we could get fair amount of large folios. Yes, not 100% tho.

> 
> > 
> > Q: Why is deregistering the inode number linked to inode deletion?
> > A: We need the high-order allocation hint to persist even if the inode is
> >  temporarily evicted from the VFS cache. To achieve this, we maintain a tracking
> >  list of hinted inode numbers. When a file is permanently deleted, its hint
> >  becomes obsolete, requiring us to deregister it from the list to prevent memory
> >  leaks or identifier reuse conflicts.
> 
> Assuming that the high-order allocation hint is a good thing, why not
> just make it persistent?  e.g., just a *real* extended attribute
> (which is more wateful of space), or grab a flag in the on-disk f2fs
> inode?  Then you don't need to have an in-memory list of hinted
> inodes; instead, you can just have the Android package manager set
> that flag indicating that you want that special treatment.  This is
> all assuming that we need an explicit hint, though....

I think that's doable, yes, if the explict hint is acceptable.

> 
> > Massive AI model loading is a long-term architectural
> > paradigm. Providing a targeted VFS/filesystem hint to optimize read
> > bandwidth for specific large datasets is a highly practical,
> > repeatable pattern that addresses a systemic bottleneck in embedded
> > AI deployments.
> 
> It's really too bad you didn't propose this as a LSF/MM topic, and
> presented this at a session at Zagreb two weeks ago.  That would have
> been a much more upstream-friendly way of collaborating, and it might
> have allowed the mm experts to give you some more dynamic, real-time
> feedback.

Indeed, I was off from LSF/MM for years due to various product issues, not
related F2FS tho. Let me make some effort to attend upcoming ones like LPC,
if I can get the budget from company.

> 
> Cheers,
> 
> 					- Ted
> 
> 
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply

* Re: [PATCH v2 4/6] sched: Extend task command name to 64 bytes
From: Steven Rostedt @ 2026-05-26 16:31 UTC (permalink / raw)
  To: David Laight
  Cc: André Almeida, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, Linus Torvalds, akpm, Yafang Shao,
	andrii.nakryiko, arnaldo.melo, Petr Mladek, linux-kernel,
	kernel-dev, linux-mm, linux-api, Bhupesh
In-Reply-To: <20260525114241.4b6f3050@pumpkin>

On Mon, 25 May 2026 11:42:41 +0100
David Laight <david.laight.linux@gmail.com> wrote:

> > >  	error = security_task_prctl(option, arg2, arg3, arg4, arg5);
> > > @@ -2601,16 +2601,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> > >  			error = -EINVAL;
> > >  		break;
> > >  	case PR_SET_NAME:
> > > -		comm[sizeof(me->comm) - 1] = 0;
> > > +		comm[TASK_COMM_LEN - 1] = 0;
> > >  		if (strncpy_from_user(comm, (char __user *)arg2,
> > > -				      sizeof(me->comm) - 1) < 0)
> > > +				      TASK_COMM_LEN - 1) < 0)    
> > 
> > Nak - you can't do that.
> > You are reading data that the application doesn't expect you to read.  
> 
> Or have I got confused over the names...

You may have gotten confused by names, as sizeof(me->comm) is the same as
TASK_COMM_LEN. Basically, the above doesn't change anything.

-- Steve

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Bart Van Assche @ 2026-05-26 16:14 UTC (permalink / raw)
  To: Theodore Tso, Jaegeuk Kim
  Cc: linux-api, linux-kernel, Matthew Wilcox, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ybmbjekuvzmaw4hmlxd7nxs546dqtwmxqxwyali74d6m3u7tat@b4q3japqnhrl>

On 5/26/26 6:42 AM, Theodore Tso wrote:
> It seems... surprising that the additional I/O operations are actually
> throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s).  Have you dug
> into why this is happening, and whether there is anything that can be
> optimized below the file system?
The layers below the filesystem (block, SCSI, UFS) is what I'm
responsible for in the Pixel team and I can assure you that these are
highly optimized.

Since the transfer size used in Jaegeuk's tests is much larger than 4
KiB, how many CPU cycles are used per IO by the layers below the
filesystem is not limiting the transfer bandwidth.

Bart.

^ permalink raw reply

* Re: [PATCH 1/3] net: Remove support for AIO on sockets
From: Jens Axboe @ 2026-05-26 15:58 UTC (permalink / raw)
  To: Christoph Hellwig, demiobenour
  Cc: Herbert Xu, David S. Miller, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, Jakub Kicinski, Simon Horman,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, James Clark, Jonathan Corbet,
	Shuah Khan, Eric Biggers, Ard Biesheuvel, linux-crypto,
	linux-kernel, io-uring, netdev, linux-perf-users, linux-doc,
	Toke Høiland-Jørgensen, linux-api
In-Reply-To: <ahQCZQNoyO8GQt3H@infradead.org>

On 5/25/26 2:03 AM, Christoph Hellwig wrote:
> On Sat, May 23, 2026 at 03:43:02PM -0400, Demi Marie Obenour via B4 Relay wrote:
>> From: Demi Marie Obenour <demiobenour@gmail.com>
>>
>> The only user of msg->msg_iocb was AF_ALG, but that's deprecated.
>> It can be removed entirely at the cost of only supporting synchronous
>> operations.  This doesn't break userspace, which will silently block
>> (for a bounded amount of time) in io_submit instead of operating
>> asynchronously.
>>
>> This also makes struct msghdr smaller, helping every other caller of
>> sendmsg().
> 
> So we just had a discussion at LLC about how networking needs to support
> AIO better for zero copy.
> 
> The current TCP zerocopy implementation provides completion notification
> through the socket error code, which is freaking weird and doesn't
> integrate well with either io_uring or in-kernel callers.

We already have that via io_uring, and without needing msg_kiocb or the
(very) weird socket error code retrieving.

-- 
Jens Axboe

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Theodore Tso @ 2026-05-26 13:42 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: linux-api, linux-kernel, Matthew Wilcox, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ahTzHyHBL8t0iNBR@google.com>

On Tue, May 26, 2026 at 01:10:55AM +0000, Jaegeuk Kim wrote:
> Background
> ----------
> The primary use case is accelerating AI model loading, which demands
> exceptionally high sequential read speeds. In our benchmarks on embedded
> systems:
>  - Using high-order page allocations allows the system to saturate the
>    Universal Flash Storage (UFS) bandwidth, reaching 4 GB/s even at
>    medium-to-low CPU frequencies.
>  - In contrast, standard small folios cap performance at 2 GB/s.

So you're interested in optimizing the I/O speeds.  And apparenty, on
your hardware, the UFS controller has limits on scatter-gather entries
--- UFS seems to call this Physical Region Description (PRD) table
entries.  Per Gemini:

    1. PRD Segment & Length Limits
	
	Maximum PRD Entries: Hardware limits typically cap the number
	    of PRD entries (or segments) to 255 or 256 per transfer
	    request.
	
	Maximum Transfer Length: Each individual PRD entry typically
	    allows a maximum transfer size of (65,535 bytes) per segment.

    2. Host Controller Hardware Limits (UFSHCI)
    
	Transfer Queue Depth: A UFS controller supports a predefined
	    number of outstanding task request entries. This is often
	    hard-capped at 32 concurrent transfer requests (slots) by the
	    doorbell register array.
	
	Descriptor Pre-fetch: Some UFS host controllers are
	   pre-configured to pre-fetch multiple PRD entries sequentially
	   before requiring main memory reads.

Is this an accurate description of the limits that you are trying to
work with?  How much data are you trying to read?  Looking at Gemma 4
models, E2B is about 10GB or 3GB for the 4-bit quantized version.  E4B
is 15GB, or 5GB for the 4-bit quantized version.  Is that about right?

It seems... surprising that the additional I/O operations are actually
throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s).  Have you dug
into why this is happening, and whether there is anything that can be
optimized below the file system?

> Problem Statement
> -----------------
> High-order pages become heavily fragmented and scarce shortly after
> device boot.  We cannot afford to deplete these limited resources on
> default filesystem operations using large folios. Instead, we need a
> mechanism to strictly prioritize and reserve high-order allocations
> for specific, critical payloads—specifically, large AI model files.

There's a fundamental assumption here, which is that the only use of
high order pages is the page cache.  This doesn't take into account
anonymous pages used by programs that isn't backed by files.  Nor does
it take into account kernel memory allocations.

But that being said, you seem to be assuming that you can reduce the
pressure on high order pages by only using large folios for these AI
model files.

But the problem with using small folios is that if you want to
actually *use* the memory, unless you want to segment out the memory
so it can't be used for anything other than the AI models (e.g., by
using somthing like hugetlbfs) it's just going to break up the memory
into smaller folios.  So that's not actually going to *help* in actual
real life use cases.  It might help for your artificial benchmarks /
experiments, but in the real life case where Android applications are
running and fragmenting all of the device memory, the large folios
won't be available *anyway*.

> 
> Q: Why is deregistering the inode number linked to inode deletion?
> A: We need the high-order allocation hint to persist even if the inode is
>  temporarily evicted from the VFS cache. To achieve this, we maintain a tracking
>  list of hinted inode numbers. When a file is permanently deleted, its hint
>  becomes obsolete, requiring us to deregister it from the list to prevent memory
>  leaks or identifier reuse conflicts.

Assuming that the high-order allocation hint is a good thing, why not
just make it persistent?  e.g., just a *real* extended attribute
(which is more wateful of space), or grab a flag in the on-disk f2fs
inode?  Then you don't need to have an in-memory list of hinted
inodes; instead, you can just have the Android package manager set
that flag indicating that you want that special treatment.  This is
all assuming that we need an explicit hint, though....

> Massive AI model loading is a long-term architectural
> paradigm. Providing a targeted VFS/filesystem hint to optimize read
> bandwidth for specific large datasets is a highly practical,
> repeatable pattern that addresses a systemic bottleneck in embedded
> AI deployments.

It's really too bad you didn't propose this as a LSF/MM topic, and
presented this at a session at Zagreb two weeks ago.  That would have
been a much more upstream-friendly way of collaborating, and it might
have allowed the mm experts to give you some more dynamic, real-time
feedback.

Cheers,

					- Ted

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-26  4:12 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Theodore Tso, linux-api, linux-kernel, Matthew Wilcox,
	linux-f2fs-devel, Christoph Hellwig, linux-mm, linux-fsdevel,
	Akilesh Kailash, Christian Brauner
In-Reply-To: <8a42abed-8289-44ec-a144-dfe531a4af71@infradead.org>

On 05/25, Randy Dunlap wrote:
> 
> 
> On 5/25/26 6:10 PM, Jaegeuk Kim wrote:
> > On 05/22, Theodore Tso wrote:
> >> On Fri, May 22, 2026 at 05:08:41PM +0000, Jaegeuk Kim wrote:
> >>>
> >>> Thank you for the explanation. It seems I made a wrong assumption on the
> >>> usage of "user." prefix where each filesystem can support in different
> >>> ways.
> >>
> >> The "user." prefix is used by all userspace applications that wish to
> >> store extended attributes.  For example, user.mime_type,
> >> user.xdg.origin_url, user.charset, user.appache_handler, etc
> >>
> >> For more information, see:
> >>
> >>     https://www.freedesktop.org/wiki/CommonExtendedAttribute
> >>     https://wiki.archlinux.org/title/Extended_attributes
> >>
> >> I certainly assumed this was common knowledge across all file system
> >> maintainers, but this was apparently not true in your case.  I don't
> >> know how this could be the case given that f2fs implements extended
> >> attributes, and I would have thought you would have known that when
> >> testing that feature.
> >>
> >>> I shared some motivation when replying to Darrick's feedback [1], but yes,
> >>> it was not enough for all heads-up. The problem started that some speicific
> >>> application needs as many high-order pages as possible mostly for reads. So,
> >>> I thought we can turn on large folio on the specific files per hints. One way
> >>> for the hints was using immutable bit, but it turned out it's very hard to
> >>> manage disabling the bit whenever deleting the files. Along with limited
> >>> ioctl() and requiring inode eviction to manage large folio activation, I had
> >>> to implement this path.
> >>>
> >>> [1] https://lore.kernel.org/lkml/aeA5C8byIpXWla7f@google.com/
> >>
> >> Actually, you still haven't explained your use case, at least, not
> >> well enough for me to understand what you are trying to do.
> >>
> >> So an application wants a particular file to use as many high-order
> >> pages as possible.  Why?  What sort of guarantees do you need to
> >> provide?  What happens if they can't be provided?  What happens if a
> >> possibly malicious, or at least gready, application uses this
> >> interface to grab a lot of high-order pages?
> >>
> >> >From your patch:
> >>
> >> 1. setxattr(file, "user.fadvise", &value, sizeof(unsigned int), 0)
> >>  -> register the inode number for large folio
> >> 2. chmod(0400, file)
> >>  -> make Read-Only
> >> 3. open()
> >>  -> f2fs_iget() with large folio
> >> 4. open(WRITE), mkwrite on mmap, chmod(WRITE)
> >>  -> return error
> >> 5. iput() and open()
> >>  -> goto #3
> >> 6. unlink
> >>  -> deregister the inode number
> >>
> >> Why should making the file read-only matter?  And when you say
> >> "derigster the inode number", why should this be related to deleting
> >> the inode?
> >>
> >> This is an interface which seems to be very specific to your use case.
> >> What if those requirements change over time?  What if you want pull in
> >> a file without making it be read-only?  And what if you want to
> >> release the large-order pages without deleting the file?
> > 
> > Let me try to write more details, helped with Gemini.
> 
> [as an interested reader:]
> 
> If this idea is so good, why shouldn't it be done in the VFS/MM so that
> other filesystems could do the same thing instead of just in f2fs?

Thanks for the feedback. I'm really open, but just trying to understand it's
good or not. If it's so bad at all, I'd be really ready to drop it even the
ioctl approach, even though I already prepared its implementation.

>
> 
> -- 
> ~Randy
> 

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-26  3:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Theodore Tso, linux-api, linux-kernel, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ahUF7HqSKFJ422bU@casper.infradead.org>

On 05/26, Matthew Wilcox wrote:
> On Tue, May 26, 2026 at 01:21:33AM +0000, Jaegeuk Kim wrote:
> > On 05/24, Christoph Hellwig wrote:
> > > On Fri, May 22, 2026 at 02:04:59PM +0000, Jaegeuk Kim wrote:
> > > > This was a quick buddyinfo right after booting the device.
> > > > 
> > > > Before:
> > > > Node 0, zone   Normal  22684  42284  28704  16901   9515   4566   1854    673    181     36    758
> > > > 
> > > > After disabling EROFS large folio:
> > > > Node 0, zone   Normal   8486   4732   2175   1161    697    272     82     19      3      1    856
> > > 
> > > And what are you trying to say us with that?
> > 
> > This means, high-order pages were used up by EROFS which sets large folio by
> > default. So, I wanted to say the concern was based on actual data which was what
> > Mattew asked.
> 
> This isn't that though.  What you actually need is to show that high order
> allocations are _failing_.  The MM is far more complicated than you seem
> to understand.  There isn't a fixed number of large folios available;
> when we try to allocate memory, we do reclaim.  And if there's large
> folios on the LRU list, you'll get them.
> 
> If what you want is large folios readily available, then what you want
> is large folios used _everywhere_ because then they're easy to get!
> If there's small folios in use, you need to reclaim a lot of memory in
> order to reassemble large folios (it's the birthday paradox, similar to
> the hash collision problem).

Thanks for the feedback. Actually, I tried to do compact_memory before doing
read() for AI loading, but I got complaints where it took hundreds milliseconds
to run that compact_memory. Is there a good way to secure high-order pages before
that read()? It was quite hard to project when it will happen.

> 
> 
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Randy Dunlap @ 2026-05-26  3:35 UTC (permalink / raw)
  To: Jaegeuk Kim, Theodore Tso
  Cc: linux-api, linux-kernel, Matthew Wilcox, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ahTzHyHBL8t0iNBR@google.com>



On 5/25/26 6:10 PM, Jaegeuk Kim wrote:
> On 05/22, Theodore Tso wrote:
>> On Fri, May 22, 2026 at 05:08:41PM +0000, Jaegeuk Kim wrote:
>>>
>>> Thank you for the explanation. It seems I made a wrong assumption on the
>>> usage of "user." prefix where each filesystem can support in different
>>> ways.
>>
>> The "user." prefix is used by all userspace applications that wish to
>> store extended attributes.  For example, user.mime_type,
>> user.xdg.origin_url, user.charset, user.appache_handler, etc
>>
>> For more information, see:
>>
>>     https://www.freedesktop.org/wiki/CommonExtendedAttribute
>>     https://wiki.archlinux.org/title/Extended_attributes
>>
>> I certainly assumed this was common knowledge across all file system
>> maintainers, but this was apparently not true in your case.  I don't
>> know how this could be the case given that f2fs implements extended
>> attributes, and I would have thought you would have known that when
>> testing that feature.
>>
>>> I shared some motivation when replying to Darrick's feedback [1], but yes,
>>> it was not enough for all heads-up. The problem started that some speicific
>>> application needs as many high-order pages as possible mostly for reads. So,
>>> I thought we can turn on large folio on the specific files per hints. One way
>>> for the hints was using immutable bit, but it turned out it's very hard to
>>> manage disabling the bit whenever deleting the files. Along with limited
>>> ioctl() and requiring inode eviction to manage large folio activation, I had
>>> to implement this path.
>>>
>>> [1] https://lore.kernel.org/lkml/aeA5C8byIpXWla7f@google.com/
>>
>> Actually, you still haven't explained your use case, at least, not
>> well enough for me to understand what you are trying to do.
>>
>> So an application wants a particular file to use as many high-order
>> pages as possible.  Why?  What sort of guarantees do you need to
>> provide?  What happens if they can't be provided?  What happens if a
>> possibly malicious, or at least gready, application uses this
>> interface to grab a lot of high-order pages?
>>
>> >From your patch:
>>
>> 1. setxattr(file, "user.fadvise", &value, sizeof(unsigned int), 0)
>>  -> register the inode number for large folio
>> 2. chmod(0400, file)
>>  -> make Read-Only
>> 3. open()
>>  -> f2fs_iget() with large folio
>> 4. open(WRITE), mkwrite on mmap, chmod(WRITE)
>>  -> return error
>> 5. iput() and open()
>>  -> goto #3
>> 6. unlink
>>  -> deregister the inode number
>>
>> Why should making the file read-only matter?  And when you say
>> "derigster the inode number", why should this be related to deleting
>> the inode?
>>
>> This is an interface which seems to be very specific to your use case.
>> What if those requirements change over time?  What if you want pull in
>> a file without making it be read-only?  And what if you want to
>> release the large-order pages without deleting the file?
> 
> Let me try to write more details, helped with Gemini.

[as an interested reader:]

If this idea is so good, why shouldn't it be done in the VFS/MM so that
other filesystems could do the same thing instead of just in f2fs?


-- 
~Randy


^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-26  3:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Theodore Tso, linux-api, linux-kernel, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ahUG3ZCnc1RQ0EL_@casper.infradead.org>

On 05/26, Matthew Wilcox wrote:
> On Tue, May 26, 2026 at 01:10:55AM +0000, Jaegeuk Kim wrote:
> > Let me try to write more details, helped with Gemini.
> 
> This is garbage, and frankly disrespectful.  I'm not going to argue with
> your AI bot.

I wrote down all and they rephrased it a bit. Which points are you feeling
like that?

> 
> 
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Matthew Wilcox @ 2026-05-26  2:35 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Theodore Tso, linux-api, linux-kernel, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ahTzHyHBL8t0iNBR@google.com>

On Tue, May 26, 2026 at 01:10:55AM +0000, Jaegeuk Kim wrote:
> Let me try to write more details, helped with Gemini.

This is garbage, and frankly disrespectful.  I'm not going to argue with
your AI bot.

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Matthew Wilcox @ 2026-05-26  2:31 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Christoph Hellwig, Theodore Tso, linux-api, linux-kernel,
	linux-f2fs-devel, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ahT1nT3xsMGkyJab@google.com>

On Tue, May 26, 2026 at 01:21:33AM +0000, Jaegeuk Kim wrote:
> On 05/24, Christoph Hellwig wrote:
> > On Fri, May 22, 2026 at 02:04:59PM +0000, Jaegeuk Kim wrote:
> > > This was a quick buddyinfo right after booting the device.
> > > 
> > > Before:
> > > Node 0, zone   Normal  22684  42284  28704  16901   9515   4566   1854    673    181     36    758
> > > 
> > > After disabling EROFS large folio:
> > > Node 0, zone   Normal   8486   4732   2175   1161    697    272     82     19      3      1    856
> > 
> > And what are you trying to say us with that?
> 
> This means, high-order pages were used up by EROFS which sets large folio by
> default. So, I wanted to say the concern was based on actual data which was what
> Mattew asked.

This isn't that though.  What you actually need is to show that high order
allocations are _failing_.  The MM is far more complicated than you seem
to understand.  There isn't a fixed number of large folios available;
when we try to allocate memory, we do reclaim.  And if there's large
folios on the LRU list, you'll get them.

If what you want is large folios readily available, then what you want
is large folios used _everywhere_ because then they're easy to get!
If there's small folios in use, you need to reclaim a lot of memory in
order to reassemble large folios (it's the birthday paradox, similar to
the hash collision problem).

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-26  1:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Theodore Tso, linux-api, linux-kernel, Matthew Wilcox,
	linux-f2fs-devel, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ahPffhaOi2CBtWof@infradead.org>

On 05/24, Christoph Hellwig wrote:
> On Fri, May 22, 2026 at 02:04:59PM +0000, Jaegeuk Kim wrote:
> > This was a quick buddyinfo right after booting the device.
> > 
> > Before:
> > Node 0, zone   Normal  22684  42284  28704  16901   9515   4566   1854    673    181     36    758
> > 
> > After disabling EROFS large folio:
> > Node 0, zone   Normal   8486   4732   2175   1161    697    272     82     19      3      1    856
> 
> And what are you trying to say us with that?

This means, high-order pages were used up by EROFS which sets large folio by
default. So, I wanted to say the concern was based on actual data which was what
Mattew asked.

> 
> 
> 
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-26  1:10 UTC (permalink / raw)
  To: Theodore Tso
  Cc: linux-api, linux-kernel, Matthew Wilcox, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <20260522224108.GA18663@macsyma-wired.lan>

On 05/22, Theodore Tso wrote:
> On Fri, May 22, 2026 at 05:08:41PM +0000, Jaegeuk Kim wrote:
> > 
> > Thank you for the explanation. It seems I made a wrong assumption on the
> > usage of "user." prefix where each filesystem can support in different
> > ways.
> 
> The "user." prefix is used by all userspace applications that wish to
> store extended attributes.  For example, user.mime_type,
> user.xdg.origin_url, user.charset, user.appache_handler, etc
> 
> For more information, see:
> 
>     https://www.freedesktop.org/wiki/CommonExtendedAttribute
>     https://wiki.archlinux.org/title/Extended_attributes
> 
> I certainly assumed this was common knowledge across all file system
> maintainers, but this was apparently not true in your case.  I don't
> know how this could be the case given that f2fs implements extended
> attributes, and I would have thought you would have known that when
> testing that feature.
> 
> > I shared some motivation when replying to Darrick's feedback [1], but yes,
> > it was not enough for all heads-up. The problem started that some speicific
> > application needs as many high-order pages as possible mostly for reads. So,
> > I thought we can turn on large folio on the specific files per hints. One way
> > for the hints was using immutable bit, but it turned out it's very hard to
> > manage disabling the bit whenever deleting the files. Along with limited
> > ioctl() and requiring inode eviction to manage large folio activation, I had
> > to implement this path.
> > 
> > [1] https://lore.kernel.org/lkml/aeA5C8byIpXWla7f@google.com/
> 
> Actually, you still haven't explained your use case, at least, not
> well enough for me to understand what you are trying to do.
> 
> So an application wants a particular file to use as many high-order
> pages as possible.  Why?  What sort of guarantees do you need to
> provide?  What happens if they can't be provided?  What happens if a
> possibly malicious, or at least gready, application uses this
> interface to grab a lot of high-order pages?
> 
> >From your patch:
> 
> 1. setxattr(file, "user.fadvise", &value, sizeof(unsigned int), 0)
>  -> register the inode number for large folio
> 2. chmod(0400, file)
>  -> make Read-Only
> 3. open()
>  -> f2fs_iget() with large folio
> 4. open(WRITE), mkwrite on mmap, chmod(WRITE)
>  -> return error
> 5. iput() and open()
>  -> goto #3
> 6. unlink
>  -> deregister the inode number
> 
> Why should making the file read-only matter?  And when you say
> "derigster the inode number", why should this be related to deleting
> the inode?
> 
> This is an interface which seems to be very specific to your use case.
> What if those requirements change over time?  What if you want pull in
> a file without making it be read-only?  And what if you want to
> release the large-order pages without deleting the file?

Let me try to write more details, helped with Gemini.

Background
----------
The primary use case is accelerating AI model loading, which demands
exceptionally high sequential read speeds. In our benchmarks on embedded
systems:
 - Using high-order page allocations allows the system to saturate the
   Universal Flash Storage (UFS) bandwidth, reaching 4 GB/s even at
   medium-to-low CPU frequencies.
 - In contrast, standard small folios cap performance at 2 GB/s.

The performance doubling stems directly from reducing CPU cycle overhead during
memory allocation.

Problem Statement
-----------------
High-order pages become heavily fragmented and scarce shortly after device boot.
We cannot afford to deplete these limited resources on default filesystem
operations using large folios. Instead, we need a mechanism to strictly
prioritize and reserve high-order allocations for specific, critical
payloads—specifically, large AI model files.

Design Principles
-----------------
 - Best-Effort Allocation: The system guarantees no fixed number of
 high-order pages. Allocation falls back gracefully from Order-10 down to
 Order-0 based on current memory availability.

 - Standard Page Cache Lifecycle: No custom or rigid memory management is
 introduced. These folios remain fully under the control of the Memory
 Management (MM) subsystem and can be reclaimed via the Least Recently
 Used (LRU) mechanism at any time.

 - Read-Only Optimization: To minimize code complexity (e.g., handling
 writeback, compression, and concurrency), this high-order allocation mechanism
 is strictly restricted to read-only files. The vast majority of performance
 gains are derived from read operations.

Questions
---------
Q: Why does an application require a specific file to utilize as many high-order
pages as possible?
A: It significantly boosts sequential read bandwidth in resource-constrained
 embedded systems by reducing the CPU overhead associated with page allocation
 during high-throughput I/O.

Q: What sort of guarantees does this mechanism need to provide?
A: No hard guarantees are provided. The filesystem provides a best-effort
 mechanism to attempt high-order page allocations for flagged inodes while the
 filesystem is mounted.

Q: What is the fallback behavior if high-order pages cannot be allocated?
A: The system treats the configuration as a performance hint. If high-order
 pages are unavailable, it seamlessly falls back to standard small folios.
 Functional behavior remains entirely unchanged.

Q: Why is restricting the implementation to read-only files necessary?
A: Limiting the scope to read-only files bypasses the architectural complexities
 of managing writes, dirtying pages, and compression in large folios, while
 still capturing the core performance benefits of high-speed sequential reads.

Q: What mitigations prevent a malicious or greedy application from abusing this
 interface to monopolize high-order pages?
A: The interface acts purely as a hint to the allocation path. Because it falls
 back to small folios when memory is tight, it poses no greater systemic risk
 than existing large-folio implementations used by other filesystems. Standard
 MM eviction and LRU paths remain fully active.

Q: Why is deregistering the inode number linked to inode deletion?
A: We need the high-order allocation hint to persist even if the inode is
 temporarily evicted from the VFS cache. To achieve this, we maintain a tracking
 list of hinted inode numbers. When a file is permanently deleted, its hint
 becomes obsolete, requiring us to deregister it from the list to prevent memory
 leaks or identifier reuse conflicts.

Q: How can an application release these large-order pages without deleting the
 file?
A: Pages allocated via this mechanism receive no special status in the page
 cache. They are managed by standard LRU logic and can be explicitly released by
 the user at any time using existing system calls, such as
 posix_fadvise(..., POSIX_FADV_DONTNEED).

Q: This interface seems highly tailored to a specific use case. What happens if
 these requirements evolve over time?
A: Massive AI model loading is a long-term architectural paradigm. Providing a
 targeted VFS/filesystem hint to optimize read bandwidth for specific large
 datasets is a highly practical, repeatable pattern that addresses a systemic
 bottleneck in embedded AI deployments.

> 
> 						- Ted

^ permalink raw reply

* Re: [PATCH v2 4/6] sched: Extend task command name to 64 bytes
From: David Laight @ 2026-05-25 10:42 UTC (permalink / raw)
  To: André Almeida
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, Linus Torvalds, akpm, Yafang Shao,
	andrii.nakryiko, arnaldo.melo, Petr Mladek, linux-kernel,
	kernel-dev, linux-mm, linux-api, Bhupesh
In-Reply-To: <20260525114107.7fa5b4c1@pumpkin>

On Mon, 25 May 2026 11:41:07 +0100
David Laight <david.laight.linux@gmail.com> wrote:

> On Sun, 24 May 2026 19:38:54 -0300
> André Almeida <andrealmeid@igalia.com> wrote:
> 
> > Command name has been restrict to only 16 bytes, which is too limiting,
> > specially when debugging and tracing complex software with thousands of
> > threads and the need to differentiate them.
> > 
> > Just as it was done with kthreads in commit 6b59808bfe48 ("workqueue:
> > Show the latest workqueue name in /proc/PID/{comm,stat,status}"), support
> > long names for userspace threads as well.
> > 
> > To avoid buffer overflows, cap all existing userspace APIs to
> > TASK_COMM_LEN, and leave the full extended name for a new interface.
> > 
> > Co-developed-by: Bhupesh <bhupesh@igalia.com>
> > Signed-off-by: Bhupesh <bhupesh@igalia.com>
> > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > ---
> >  fs/proc/array.c       |  2 +-
> >  include/linux/sched.h |  3 ++-
> >  kernel/sys.c          | 10 +++++-----
> >  3 files changed, 8 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/proc/array.c b/fs/proc/array.c
> > index c8c3fbd9bfa9..312371eddc7f 100644
> > --- a/fs/proc/array.c
> > +++ b/fs/proc/array.c
> > @@ -110,7 +110,7 @@ void proc_task_name(struct seq_file *m, struct task_struct *p, bool escape)
> >  	else if (p->flags & PF_KTHREAD)
> >  		get_kthread_comm(tcomm, sizeof(tcomm), p);
> >  	else
> > -		strscpy_pad(tcomm, p->comm);
> > +		strscpy_pad(tcomm, p->comm, TASK_COMM_LEN);
> >  
> >  	if (escape)
> >  		seq_escape_str(m, tcomm, ESCAPE_SPACE | ESCAPE_SPECIAL, "\n\\");
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index b6de742b1155..f7fd2b7d131d 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -323,6 +323,7 @@ struct user_event_mm;
> >   */
> >  enum {
> >  	TASK_COMM_LEN = 16,
> > +	TASK_COMM_EXT_LEN = 64,
> >  };
> >  
> >  extern void sched_tick(void);
> > @@ -1167,7 +1168,7 @@ struct task_struct {
> >  	 * - set it with set_task_comm() to ensure it is always
> >  	 *   NUL-terminated and zero-padded
> >  	 */
> > -	char				comm[TASK_COMM_LEN];
> > +	char				comm[TASK_COMM_EXT_LEN];
> >  
> >  	struct nameidata		*nameidata;
> >  
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 1d5152d2395e..76d77218ab19 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -2535,7 +2535,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> >  		unsigned long, arg4, unsigned long, arg5)
> >  {
> >  	struct task_struct *me = current;
> > -	unsigned char comm[sizeof(me->comm)];
> > +	unsigned char comm[TASK_COMM_LEN];
> >  	long error;
> >  
> >  	error = security_task_prctl(option, arg2, arg3, arg4, arg5);
> > @@ -2601,16 +2601,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> >  			error = -EINVAL;
> >  		break;
> >  	case PR_SET_NAME:
> > -		comm[sizeof(me->comm) - 1] = 0;
> > +		comm[TASK_COMM_LEN - 1] = 0;
> >  		if (strncpy_from_user(comm, (char __user *)arg2,
> > -				      sizeof(me->comm) - 1) < 0)
> > +				      TASK_COMM_LEN - 1) < 0)  
> 
> Nak - you can't do that.
> You are reading data that the application doesn't expect you to read.

Or have I got confused over the names...

-- David

> 
> >  			return -EFAULT;
> >  		set_task_comm(me, comm);
> >  		proc_comm_connector(me);
> >  		break;
> >  	case PR_GET_NAME:
> > -		strscpy_pad(comm, me->comm);
> > -		if (copy_to_user((char __user *)arg2, comm, sizeof(comm)))
> > +		strscpy_pad(comm, me->comm, TASK_COMM_LEN);
> > +		if (copy_to_user((char __user *)arg2, comm, TASK_COMM_LEN))  
> 
> Double-nak - you are writing beyond the end of the applications buffer.
> 
> You can't change the user memory that the syscalls access.
> 
> You can support the longer name for read/write of /proc/self/comm.
> 
> -- David
> 
> >  			return -EFAULT;
> >  		break;
> >  	case PR_GET_ENDIAN:
> >   
> 


^ permalink raw reply

* Re: [PATCH v2 4/6] sched: Extend task command name to 64 bytes
From: David Laight @ 2026-05-25 10:41 UTC (permalink / raw)
  To: André Almeida
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, Linus Torvalds, akpm, Yafang Shao,
	andrii.nakryiko, arnaldo.melo, Petr Mladek, linux-kernel,
	kernel-dev, linux-mm, linux-api, Bhupesh
In-Reply-To: <20260524-tonyk-long_name-v2-4-332f6bd041c4@igalia.com>

On Sun, 24 May 2026 19:38:54 -0300
André Almeida <andrealmeid@igalia.com> wrote:

> Command name has been restrict to only 16 bytes, which is too limiting,
> specially when debugging and tracing complex software with thousands of
> threads and the need to differentiate them.
> 
> Just as it was done with kthreads in commit 6b59808bfe48 ("workqueue:
> Show the latest workqueue name in /proc/PID/{comm,stat,status}"), support
> long names for userspace threads as well.
> 
> To avoid buffer overflows, cap all existing userspace APIs to
> TASK_COMM_LEN, and leave the full extended name for a new interface.
> 
> Co-developed-by: Bhupesh <bhupesh@igalia.com>
> Signed-off-by: Bhupesh <bhupesh@igalia.com>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>  fs/proc/array.c       |  2 +-
>  include/linux/sched.h |  3 ++-
>  kernel/sys.c          | 10 +++++-----
>  3 files changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index c8c3fbd9bfa9..312371eddc7f 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -110,7 +110,7 @@ void proc_task_name(struct seq_file *m, struct task_struct *p, bool escape)
>  	else if (p->flags & PF_KTHREAD)
>  		get_kthread_comm(tcomm, sizeof(tcomm), p);
>  	else
> -		strscpy_pad(tcomm, p->comm);
> +		strscpy_pad(tcomm, p->comm, TASK_COMM_LEN);
>  
>  	if (escape)
>  		seq_escape_str(m, tcomm, ESCAPE_SPACE | ESCAPE_SPECIAL, "\n\\");
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b6de742b1155..f7fd2b7d131d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -323,6 +323,7 @@ struct user_event_mm;
>   */
>  enum {
>  	TASK_COMM_LEN = 16,
> +	TASK_COMM_EXT_LEN = 64,
>  };
>  
>  extern void sched_tick(void);
> @@ -1167,7 +1168,7 @@ struct task_struct {
>  	 * - set it with set_task_comm() to ensure it is always
>  	 *   NUL-terminated and zero-padded
>  	 */
> -	char				comm[TASK_COMM_LEN];
> +	char				comm[TASK_COMM_EXT_LEN];
>  
>  	struct nameidata		*nameidata;
>  
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 1d5152d2395e..76d77218ab19 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2535,7 +2535,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		unsigned long, arg4, unsigned long, arg5)
>  {
>  	struct task_struct *me = current;
> -	unsigned char comm[sizeof(me->comm)];
> +	unsigned char comm[TASK_COMM_LEN];
>  	long error;
>  
>  	error = security_task_prctl(option, arg2, arg3, arg4, arg5);
> @@ -2601,16 +2601,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  			error = -EINVAL;
>  		break;
>  	case PR_SET_NAME:
> -		comm[sizeof(me->comm) - 1] = 0;
> +		comm[TASK_COMM_LEN - 1] = 0;
>  		if (strncpy_from_user(comm, (char __user *)arg2,
> -				      sizeof(me->comm) - 1) < 0)
> +				      TASK_COMM_LEN - 1) < 0)

Nak - you can't do that.
You are reading data that the application doesn't expect you to read.

>  			return -EFAULT;
>  		set_task_comm(me, comm);
>  		proc_comm_connector(me);
>  		break;
>  	case PR_GET_NAME:
> -		strscpy_pad(comm, me->comm);
> -		if (copy_to_user((char __user *)arg2, comm, sizeof(comm)))
> +		strscpy_pad(comm, me->comm, TASK_COMM_LEN);
> +		if (copy_to_user((char __user *)arg2, comm, TASK_COMM_LEN))

Double-nak - you are writing beyond the end of the applications buffer.

You can't change the user memory that the syscalls access.

You can support the longer name for read/write of /proc/self/comm.

-- David

>  			return -EFAULT;
>  		break;
>  	case PR_GET_ENDIAN:
> 


^ permalink raw reply

* Re: [PATCH v2 2/6] treewide: Get rid of get_task_comm()
From: David Laight @ 2026-05-25 10:34 UTC (permalink / raw)
  To: André Almeida
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, Linus Torvalds, akpm, Yafang Shao,
	andrii.nakryiko, arnaldo.melo, Petr Mladek, linux-kernel,
	kernel-dev, linux-mm, linux-api, Bhupesh
In-Reply-To: <20260524-tonyk-long_name-v2-2-332f6bd041c4@igalia.com>

On Sun, 24 May 2026 19:38:52 -0300
André Almeida <andrealmeid@igalia.com> wrote:

> Since commit 4cc0473d7754 ("get rid of __get_task_comm()"),
> get_task_comm() does just a redundant check for the buffer size and call
> strscpy_pad(). Replace get_task_comm() calls with strscpy_pad(), that will
> do the right thing if the buffers sizes doesn't match: zero-pad if it's
> bigger, and truncate if it's smaller.
> 
> Link: https://lore.kernel.org/lkml/CAHk-=wi5c=_-FBGo_88CowJd_F-Gi6Ud9d=TALm65ReN7YjrMw@mail.gmail.com/
> Co-developed-by: Bhupesh <bhupesh@igalia.com>
> Signed-off-by: Bhupesh <bhupesh@igalia.com>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
... 
> -/*
> - * - Why not use task_lock()?
> - *   User space can randomly change their names anyway, so locking for readers
> - *   doesn't make sense. For writers, locking is probably necessary, as a race
> - *   condition could lead to long-term mixed results.
> - *   The logic inside __set_task_comm() ensures that the task comm is
> - *   always NUL-terminated and zero-padded. Therefore the race condition between
> - *   reader and writer is not an issue.
> - *
> - * - BUILD_BUG_ON() can help prevent the buf from being truncated.
> - *   Since the callers don't perform any return value checks, this safeguard is
> - *   necessary.
> - */
> -#define get_task_comm(buf, tsk) ({			\
> -	BUILD_BUG_ON(sizeof(buf) < TASK_COMM_LEN);	\
> -	strscpy_pad(buf, (tsk)->comm);			\
> -	buf;						\
> -})
> -

I don't think it is worth the churn of removing this wrapper.
The calls can be optimised based on the knowledge that tsk->com
is always '\0' terminated and can be assumed to be padded.
(A read mid-update might give an unpadded result, but that doesn't
matter because it can only 'leak' part of an old name.

-- David

^ permalink raw reply

* Re: [PATCH 1/3] net: Remove support for AIO on sockets
From: Christoph Hellwig @ 2026-05-25  8:03 UTC (permalink / raw)
  To: demiobenour
  Cc: Herbert Xu, David S. Miller, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, Jens Axboe, Jakub Kicinski,
	Simon Horman, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Jonathan Corbet, Shuah Khan, Eric Biggers,
	Ard Biesheuvel, linux-crypto, linux-kernel, io-uring, netdev,
	linux-perf-users, linux-doc, Toke Høiland-Jørgensen,
	linux-api
In-Reply-To: <20260523-af-alg-harden-v1-1-c76755c3a5c5@gmail.com>

On Sat, May 23, 2026 at 03:43:02PM -0400, Demi Marie Obenour via B4 Relay wrote:
> From: Demi Marie Obenour <demiobenour@gmail.com>
> 
> The only user of msg->msg_iocb was AF_ALG, but that's deprecated.
> It can be removed entirely at the cost of only supporting synchronous
> operations.  This doesn't break userspace, which will silently block
> (for a bounded amount of time) in io_submit instead of operating
> asynchronously.
> 
> This also makes struct msghdr smaller, helping every other caller of
> sendmsg().

So we just had a discussion at LLC about how networking needs to support
AIO better for zero copy.

The current TCP zerocopy implementation provides completion notification
through the socket error code, which is freaking weird and doesn't
integrate well with either io_uring or in-kernel callers.

So we really want to pass the iocb down into networking and have it
call ki_complete on completion, with something higher up in the stack
adding that to the error queue for the legacy user interface.

Now I'm not sure if we wouldn't be better off passing that iocb
explicitly instead of in a weird hidden way, but this seemed like
a good place to bring this up.


^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Christoph Hellwig @ 2026-05-25  5:37 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Theodore Tso, Christoph Hellwig, linux-api, linux-kernel,
	Matthew Wilcox, linux-f2fs-devel, linux-mm, linux-fsdevel,
	Akilesh Kailash, Christian Brauner
In-Reply-To: <ag_OVwPF49LSZ7rz@google.com>

On Fri, May 22, 2026 at 03:32:39AM +0000, Jaegeuk Kim wrote:
> I went this route because Android heavily restricts ioctl() permissions
> and we needed broader access for this to work within the framework. It’s
> definitely a pragmatic choice just to get it running in production.

That is not a good reason.

> If ioctl() is a right way for upstream, I'm happy to change this patch. By
> the way, I really don't understand why all the messages are so offensive,
> even without trying to understand the problem or guiding right directions.

The right way is to:

 1) Talk to the relevant subsystems (MM and fsdevel), and if it affects
    userspace that linux-api list and actually explain your use case.
 2) And then actually listen to feedback.  f2fs just keeps piling these
    ABI hacks on without any review, and it is causing real problems.


^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Christoph Hellwig @ 2026-05-25  5:34 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Matthew Wilcox, Theodore Tso, linux-api, linux-kernel,
	linux-f2fs-devel, Christoph Hellwig, linux-mm, linux-fsdevel,
	Akilesh Kailash, Christian Brauner
In-Reply-To: <ahBii6bk0KbK_NHV@google.com>

On Fri, May 22, 2026 at 02:04:59PM +0000, Jaegeuk Kim wrote:
> This was a quick buddyinfo right after booting the device.
> 
> Before:
> Node 0, zone   Normal  22684  42284  28704  16901   9515   4566   1854    673    181     36    758
> 
> After disabling EROFS large folio:
> Node 0, zone   Normal   8486   4732   2175   1161    697    272     82     19      3      1    856

And what are you trying to say us with that?


^ permalink raw reply

* Re: [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Christoph Hellwig @ 2026-05-25  5:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Theodore Tso, Christoph Hellwig, Jaegeuk Kim, linux-kernel,
	linux-f2fs-devel, Akilesh Kailash, linux-fsdevel, linux-mm,
	linux-api, Christian Brauner
In-Reply-To: <ag9D6_7dttbDGHZ6@casper.infradead.org>

On Thu, May 21, 2026 at 06:42:03PM +0100, Matthew Wilcox wrote:
> On Thu, May 21, 2026 at 11:57:48AM -0400, Theodore Tso wrote:
> > So let me get this straight.  This is a magic xattr interface which is
> > not even persisted in the file system, but instead sets a 32-bit
> > bitmask in the struct inode which disappears once the inode gets
> > flushed from the inode stack.  And it uses a generic xattr name,
> > "user.fadvise".
> > 
> > There's no way in *hell* any other file system is likely to adopt such
> > a broken interface, so why didn't you just use an ioctl to set this
> > magic f2fs-specific flag?
> 
> I mean, yes, this API is horrendous.  But it's just another example of
> f2fs thinking it's somehow special and not just enabling large folios
> like other filesystems do.  This hurts everyone, not just people who use
> f2fs.

Yes.  And assuming we'd have a legit use to unconditionally use smaller
folios for given files we'd really need to control it in the MM.  Even
if it ends up being a Android-only hack.

^ permalink raw reply

* Re: [RFC] fs/ioctl.c: FIBMAP requires CAP_SYS_RAWIO while FIEMAP exposes identical data unprivileged
From: Christoph Hellwig @ 2026-05-25  5:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Darrick J. Wong, Christoph Hellwig, Cyber_black,
	linux-fsdevel@vger.kernel.org, Mark Fasheh, Theodore Ts'o,
	linux-api
In-Reply-To: <CALCETrXWuMJstpkDhV4eKTwbRhQAQ0RZTkkFN=+oXrkiShgx1A@mail.gmail.com>

On Tue, May 19, 2026 at 01:51:53PM -0700, Andy Lutomirski wrote:
> >
> > Also note that FIEMAP still doesn't report devices, so you're still
> > playing with fire on multi-device reflink-aware filesystems like XFS.
> >
> 
> A hash would be fine for me.
> 
> But really a nicer interface would translate logical ranges in a file
> to some range identifier, where:

All this sounds really complicated and probably not doable.  But you
haven't answered the basic question, which is if your use cases already
has candidates and you just want to confirm them, or if you are
iterating all file logical to physical mappings in the file systems?

Can you explain your highlevel use case a bit?


^ permalink raw reply

* [PATCH v2 6/6] selftests: prctl: Add test for long thread names
From: André Almeida @ 2026-05-24 22:38 UTC (permalink / raw)
  To: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, David Laight, Linus Torvalds, akpm,
	Yafang Shao, andrii.nakryiko, arnaldo.melo, Petr Mladek
  Cc: linux-kernel, kernel-dev, linux-mm, linux-api, André Almeida
In-Reply-To: <20260524-tonyk-long_name-v2-0-332f6bd041c4@igalia.com>

Add tests for the new interface to set and get long thread names. The
kernel should accept the LONG_NAME and returning it accordingly. For the
old PR_GET_NAME interface, the kernel should truncate the name up to 16
chars. /proc/<task>/comm should return the same string ad PR_GET_NAME.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 tools/testing/selftests/prctl/set-process-name.c | 36 ++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/tools/testing/selftests/prctl/set-process-name.c b/tools/testing/selftests/prctl/set-process-name.c
index 3f7b146d36df..0f20f7deac67 100644
--- a/tools/testing/selftests/prctl/set-process-name.c
+++ b/tools/testing/selftests/prctl/set-process-name.c
@@ -9,9 +9,17 @@
 
 #include "kselftest_harness.h"
 
+#ifndef PR_SET_EXT_NAME
+# define PR_SET_EXT_NAME 17
+# define PR_GET_EXT_NAME 18
+#endif
+
 #define CHANGE_NAME "changename"
+#define LONG_NAME	"change_to_very_long_extended_name"
+#define LONG_NAME_CAP	"change_to_very_"
 #define EMPTY_NAME ""
 #define TASK_COMM_LEN 16
+#define TASK_COMM_EXT_LEN 64
 #define MAX_PATH_LEN 50
 
 int set_name(char *name)
@@ -25,6 +33,16 @@ int set_name(char *name)
 	return res;
 }
 
+int set_ext_name(char *name)
+{
+	int res;
+
+	res = prctl(PR_SET_EXT_NAME, name, NULL, NULL, NULL);
+
+	if (res < 0)
+		return -errno;
+}
+
 int check_is_name_correct(char *check_name)
 {
 	char name[TASK_COMM_LEN];
@@ -38,6 +56,19 @@ int check_is_name_correct(char *check_name)
 	return !strcmp(name, check_name);
 }
 
+int check_is_ext_name_correct(char *check_name)
+{
+	char name[TASK_COMM_EXT_LEN];
+	int res;
+
+	res = prctl(PR_GET_EXT_NAME, name, NULL, NULL, NULL);
+
+	if (res < 0)
+		return -errno;
+
+	return !strcmp(name, check_name);
+}
+
 int check_null_pointer(char *check_name)
 {
 	char *name = NULL;
@@ -82,6 +113,11 @@ TEST(rename_process) {
 	EXPECT_GE(set_name(CHANGE_NAME), 0);
 	EXPECT_TRUE(check_is_name_correct(CHANGE_NAME));
 
+	EXPECT_GE(set_ext_name(LONG_NAME), 0);
+	EXPECT_TRUE(check_is_ext_name_correct(LONG_NAME));
+	EXPECT_TRUE(check_is_name_correct(LONG_NAME_CAP));
+	EXPECT_TRUE(check_name());
+
 	EXPECT_GE(set_name(EMPTY_NAME), 0);
 	EXPECT_TRUE(check_is_name_correct(EMPTY_NAME));
 

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 5/6] prctl: Add support for long user thread names
From: André Almeida @ 2026-05-24 22:38 UTC (permalink / raw)
  To: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, David Laight, Linus Torvalds, akpm,
	Yafang Shao, andrii.nakryiko, arnaldo.melo, Petr Mladek
  Cc: linux-kernel, kernel-dev, linux-mm, linux-api, André Almeida
In-Reply-To: <20260524-tonyk-long_name-v2-0-332f6bd041c4@igalia.com>

Add support for getting and setting long user thread names with
PR_{SET,GET}_EXT_NAME.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 include/linux/sched.h      |  2 +-
 include/uapi/linux/prctl.h |  3 +++
 kernel/sys.c               | 15 ++++++++++++++-
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f7fd2b7d131d..fd4256c8627b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1997,7 +1997,7 @@ extern void kick_process(struct task_struct *tsk);
 
 extern void __set_task_comm(struct task_struct *tsk, const char *from, bool exec);
 #define set_task_comm(tsk, from) ({			\
-	BUILD_BUG_ON(sizeof(from) != TASK_COMM_LEN);	\
+	BUILD_BUG_ON(sizeof(from) < TASK_COMM_LEN);	\
 	__set_task_comm(tsk, from, false);		\
 })
 
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index b6ec6f693719..a07f8edadd65 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -56,6 +56,9 @@
 #define PR_SET_NAME    15		/* Set process name */
 #define PR_GET_NAME    16		/* Get process name */
 
+#define PR_SET_EXT_NAME    17		/* Set extended process name */
+#define PR_GET_EXT_NAME    18		/* Get extended process name */
+
 /* Get/set process endian */
 #define PR_GET_ENDIAN	19
 #define PR_SET_ENDIAN	20
diff --git a/kernel/sys.c b/kernel/sys.c
index 76d77218ab19..1b70d53da998 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2535,7 +2535,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
 	struct task_struct *me = current;
-	unsigned char comm[TASK_COMM_LEN];
+	unsigned char comm[TASK_COMM_EXT_LEN];
 	long error;
 
 	error = security_task_prctl(option, arg2, arg3, arg4, arg5);
@@ -2613,6 +2613,19 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		if (copy_to_user((char __user *)arg2, comm, TASK_COMM_LEN))
 			return -EFAULT;
 		break;
+	case PR_SET_EXT_NAME:
+		comm[TASK_COMM_EXT_LEN - 1] = 0;
+		if (strncpy_from_user(comm, (char __user *)arg2,
+				      TASK_COMM_EXT_LEN - 1) < 0)
+			return -EFAULT;
+		set_task_comm(me, comm);
+		proc_comm_connector(me);
+		break;
+	case PR_GET_EXT_NAME:
+		strscpy_pad(comm, me->comm, TASK_COMM_EXT_LEN);
+		if (copy_to_user((char __user *)arg2, comm, TASK_COMM_EXT_LEN))
+			return -EFAULT;
+		break;
 	case PR_GET_ENDIAN:
 		error = GET_ENDIAN(me, arg2);
 		break;

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 4/6] sched: Extend task command name to 64 bytes
From: André Almeida @ 2026-05-24 22:38 UTC (permalink / raw)
  To: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, David Laight, Linus Torvalds, akpm,
	Yafang Shao, andrii.nakryiko, arnaldo.melo, Petr Mladek
  Cc: linux-kernel, kernel-dev, linux-mm, linux-api, Bhupesh,
	André Almeida
In-Reply-To: <20260524-tonyk-long_name-v2-0-332f6bd041c4@igalia.com>

Command name has been restrict to only 16 bytes, which is too limiting,
specially when debugging and tracing complex software with thousands of
threads and the need to differentiate them.

Just as it was done with kthreads in commit 6b59808bfe48 ("workqueue:
Show the latest workqueue name in /proc/PID/{comm,stat,status}"), support
long names for userspace threads as well.

To avoid buffer overflows, cap all existing userspace APIs to
TASK_COMM_LEN, and leave the full extended name for a new interface.

Co-developed-by: Bhupesh <bhupesh@igalia.com>
Signed-off-by: Bhupesh <bhupesh@igalia.com>
Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 fs/proc/array.c       |  2 +-
 include/linux/sched.h |  3 ++-
 kernel/sys.c          | 10 +++++-----
 3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index c8c3fbd9bfa9..312371eddc7f 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -110,7 +110,7 @@ void proc_task_name(struct seq_file *m, struct task_struct *p, bool escape)
 	else if (p->flags & PF_KTHREAD)
 		get_kthread_comm(tcomm, sizeof(tcomm), p);
 	else
-		strscpy_pad(tcomm, p->comm);
+		strscpy_pad(tcomm, p->comm, TASK_COMM_LEN);
 
 	if (escape)
 		seq_escape_str(m, tcomm, ESCAPE_SPACE | ESCAPE_SPECIAL, "\n\\");
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b6de742b1155..f7fd2b7d131d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -323,6 +323,7 @@ struct user_event_mm;
  */
 enum {
 	TASK_COMM_LEN = 16,
+	TASK_COMM_EXT_LEN = 64,
 };
 
 extern void sched_tick(void);
@@ -1167,7 +1168,7 @@ struct task_struct {
 	 * - set it with set_task_comm() to ensure it is always
 	 *   NUL-terminated and zero-padded
 	 */
-	char				comm[TASK_COMM_LEN];
+	char				comm[TASK_COMM_EXT_LEN];
 
 	struct nameidata		*nameidata;
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 1d5152d2395e..76d77218ab19 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2535,7 +2535,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
 	struct task_struct *me = current;
-	unsigned char comm[sizeof(me->comm)];
+	unsigned char comm[TASK_COMM_LEN];
 	long error;
 
 	error = security_task_prctl(option, arg2, arg3, arg4, arg5);
@@ -2601,16 +2601,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 			error = -EINVAL;
 		break;
 	case PR_SET_NAME:
-		comm[sizeof(me->comm) - 1] = 0;
+		comm[TASK_COMM_LEN - 1] = 0;
 		if (strncpy_from_user(comm, (char __user *)arg2,
-				      sizeof(me->comm) - 1) < 0)
+				      TASK_COMM_LEN - 1) < 0)
 			return -EFAULT;
 		set_task_comm(me, comm);
 		proc_comm_connector(me);
 		break;
 	case PR_GET_NAME:
-		strscpy_pad(comm, me->comm);
-		if (copy_to_user((char __user *)arg2, comm, sizeof(comm)))
+		strscpy_pad(comm, me->comm, TASK_COMM_LEN);
+		if (copy_to_user((char __user *)arg2, comm, TASK_COMM_LEN))
 			return -EFAULT;
 		break;
 	case PR_GET_ENDIAN:

-- 
2.54.0


^ permalink raw reply related

* [PATCH v2 3/6] treewide: Replace memcpy(..., current->comm) with strscpy()
From: André Almeida @ 2026-05-24 22:38 UTC (permalink / raw)
  To: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, David Laight, Linus Torvalds, akpm,
	Yafang Shao, andrii.nakryiko, arnaldo.melo, Petr Mladek
  Cc: linux-kernel, kernel-dev, linux-mm, linux-api, André Almeida
In-Reply-To: <20260524-tonyk-long_name-v2-0-332f6bd041c4@igalia.com>

In order to increase the size of current->comm[] and to avoid breaking any
existing code, replace memcpy() with strscpy(). The later function makes
sure that the copy is NUL terminated. This is crucial given that the
source buffer might be larger than the destination buffer and could
truncate the NUL character out of it.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
Changes from v2:
 - New patch, dropped strtostr() from last version
---
 include/linux/coredump.h        |  2 +-
 include/linux/tracepoint.h      |  4 ++--
 include/trace/events/block.h    | 10 +++++-----
 include/trace/events/coredump.h |  2 +-
 include/trace/events/f2fs.h     |  4 ++--
 include/trace/events/oom.h      |  2 +-
 include/trace/events/osnoise.h  |  2 +-
 include/trace/events/sched.h    | 10 +++++-----
 include/trace/events/signal.h   |  2 +-
 include/trace/events/task.h     |  4 ++--
 kernel/printk/nbcon.c           |  2 +-
 kernel/printk/printk.c          |  2 +-
 12 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/include/linux/coredump.h b/include/linux/coredump.h
index 68861da4cf7c..45cd55114120 100644
--- a/include/linux/coredump.h
+++ b/include/linux/coredump.h
@@ -54,7 +54,7 @@ extern void vfs_coredump(const kernel_siginfo_t *siginfo);
 	do {	\
 		char comm[TASK_COMM_LEN];	\
 		/* This will always be NUL terminated. */ \
-		memcpy(comm, current->comm, sizeof(comm)); \
+		strscpy(comm, current->comm, sizeof(comm)); \
 		printk_ratelimited(Level "coredump: %d(%*pE): " Format "\n",	\
 			task_tgid_vnr(current), (int)strlen(comm), comm, ##__VA_ARGS__);	\
 	} while (0)	\
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 763eea4d80d8..90fd9109210c 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -615,10 +615,10 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p)
  *	*
  *
  *	TP_fast_assign(
- *		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
+ *		strscpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
  *		__entry->prev_pid	= prev->pid;
  *		__entry->prev_prio	= prev->prio;
- *		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
+ *		strscpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
  *		__entry->next_pid	= next->pid;
  *		__entry->next_prio	= next->prio;
  *	),
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 6aa79e2d799c..73db3713b967 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -213,7 +213,7 @@ DECLARE_EVENT_CLASS(block_rq,
 
 		blk_fill_rwbs(__entry->rwbs, rq->cmd_flags);
 		__get_str(cmd)[0] = '\0';
-		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+		strscpy(__entry->comm, current->comm, TASK_COMM_LEN);
 	),
 
 	TP_printk("%d,%d %s %u (%s) %llu + %u %s,%u,%u [%s]",
@@ -351,7 +351,7 @@ DECLARE_EVENT_CLASS(block_bio,
 		__entry->sector		= bio->bi_iter.bi_sector;
 		__entry->nr_sector	= bio_sectors(bio);
 		blk_fill_rwbs(__entry->rwbs, bio->bi_opf);
-		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+		strscpy(__entry->comm, current->comm, TASK_COMM_LEN);
 	),
 
 	TP_printk("%d,%d %s %llu + %u [%s]",
@@ -434,7 +434,7 @@ TRACE_EVENT(block_plug,
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+		strscpy(__entry->comm, current->comm, TASK_COMM_LEN);
 	),
 
 	TP_printk("[%s]", __entry->comm)
@@ -453,7 +453,7 @@ DECLARE_EVENT_CLASS(block_unplug,
 
 	TP_fast_assign(
 		__entry->nr_rq = depth;
-		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+		strscpy(__entry->comm, current->comm, TASK_COMM_LEN);
 	),
 
 	TP_printk("[%s] %d", __entry->comm, __entry->nr_rq)
@@ -504,7 +504,7 @@ TRACE_EVENT(block_split,
 		__entry->sector		= bio->bi_iter.bi_sector;
 		__entry->new_sector	= new_sector;
 		blk_fill_rwbs(__entry->rwbs, bio->bi_opf);
-		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+		strscpy(__entry->comm, current->comm, TASK_COMM_LEN);
 	),
 
 	TP_printk("%d,%d %s %llu / %llu [%s]",
diff --git a/include/trace/events/coredump.h b/include/trace/events/coredump.h
index c7b9c53fc498..dc21ec89a4fb 100644
--- a/include/trace/events/coredump.h
+++ b/include/trace/events/coredump.h
@@ -32,7 +32,7 @@ TRACE_EVENT(coredump,
 
 	TP_fast_assign(
 		__entry->sig = sig;
-		memcpy(__entry->comm, current->comm, TASK_COMM_LEN);
+		strscpy(__entry->comm, current->comm, TASK_COMM_LEN);
 	),
 
 	TP_printk("sig=%d comm=%s",
diff --git a/include/trace/events/f2fs.h b/include/trace/events/f2fs.h
index b5188d2671d7..1e56e448268c 100644
--- a/include/trace/events/f2fs.h
+++ b/include/trace/events/f2fs.h
@@ -2505,7 +2505,7 @@ TRACE_EVENT(f2fs_lock_elapsed_time,
 
 	TP_fast_assign(
 		__entry->dev		= sbi->sb->s_dev;
-		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		strscpy(__entry->comm, p->comm, TASK_COMM_LEN);
 		__entry->pid		= p->pid;
 		__entry->prio		= p->prio;
 		__entry->ioprio_class	= IOPRIO_PRIO_CLASS(ioprio);
@@ -2558,7 +2558,7 @@ DECLARE_EVENT_CLASS(f2fs_priority_update,
 
 	TP_fast_assign(
 		__entry->dev		= sbi->sb->s_dev;
-		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		strscpy(__entry->comm, p->comm, TASK_COMM_LEN);
 		__entry->pid		= p->pid;
 		__entry->lock_name	= lock_name;
 		__entry->is_write	= is_write;
diff --git a/include/trace/events/oom.h b/include/trace/events/oom.h
index 9f0a5d1482c4..172278a7e20a 100644
--- a/include/trace/events/oom.h
+++ b/include/trace/events/oom.h
@@ -23,7 +23,7 @@ TRACE_EVENT(oom_score_adj_update,
 
 	TP_fast_assign(
 		__entry->pid = task->pid;
-		memcpy(__entry->comm, task->comm, TASK_COMM_LEN);
+		strscpy(__entry->comm, task->comm, TASK_COMM_LEN);
 		__entry->oom_score_adj = task->signal->oom_score_adj;
 	),
 
diff --git a/include/trace/events/osnoise.h b/include/trace/events/osnoise.h
index 3f4273623801..4db90931e897 100644
--- a/include/trace/events/osnoise.h
+++ b/include/trace/events/osnoise.h
@@ -116,7 +116,7 @@ TRACE_EVENT(thread_noise,
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+		strscpy(__entry->comm, t->comm, TASK_COMM_LEN);
 		__entry->pid = t->pid;
 		__entry->start = start;
 		__entry->duration = duration;
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 535860581f15..a932f443f327 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -152,7 +152,7 @@ DECLARE_EVENT_CLASS(sched_wakeup_template,
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		strscpy(__entry->comm, p->comm, TASK_COMM_LEN);
 		__entry->pid		= p->pid;
 		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
 		__entry->target_cpu	= task_cpu(p);
@@ -237,11 +237,11 @@ TRACE_EVENT(sched_switch,
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
+		strscpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
 		__entry->prev_pid	= prev->pid;
 		__entry->prev_prio	= prev->prio;
 		__entry->prev_state	= __trace_sched_switch_state(preempt, prev_state, prev);
-		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
+		strscpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
 		__entry->next_pid	= next->pid;
 		__entry->next_prio	= next->prio;
 		/* XXX SCHED_DEADLINE */
@@ -346,7 +346,7 @@ TRACE_EVENT(sched_process_exit,
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		strscpy(__entry->comm, p->comm, TASK_COMM_LEN);
 		__entry->pid		= p->pid;
 		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
 		__entry->group_dead	= group_dead;
@@ -787,7 +787,7 @@ TRACE_EVENT(sched_skip_cpuset_numa,
 	),
 
 	TP_fast_assign(
-		memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+		strscpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
 		__entry->pid		 = task_pid_nr(tsk);
 		__entry->tgid		 = task_tgid_nr(tsk);
 		__entry->ngid		 = task_numa_group_id(tsk);
diff --git a/include/trace/events/signal.h b/include/trace/events/signal.h
index 1db7e4b07c01..6aa7d1123f04 100644
--- a/include/trace/events/signal.h
+++ b/include/trace/events/signal.h
@@ -67,7 +67,7 @@ TRACE_EVENT(signal_generate,
 	TP_fast_assign(
 		__entry->sig	= sig;
 		TP_STORE_SIGINFO(__entry, info);
-		memcpy(__entry->comm, task->comm, TASK_COMM_LEN);
+		strscpy(__entry->comm, task->comm, TASK_COMM_LEN);
 		__entry->pid	= task->pid;
 		__entry->group	= group;
 		__entry->result	= result;
diff --git a/include/trace/events/task.h b/include/trace/events/task.h
index b9a129eb54d9..f75dbf20fe02 100644
--- a/include/trace/events/task.h
+++ b/include/trace/events/task.h
@@ -21,7 +21,7 @@ TRACE_EVENT(task_newtask,
 
 	TP_fast_assign(
 		__entry->pid = task->pid;
-		memcpy(__entry->comm, task->comm, TASK_COMM_LEN);
+		strscpy(__entry->comm, task->comm, TASK_COMM_LEN);
 		__entry->clone_flags = clone_flags;
 		__entry->oom_score_adj = task->signal->oom_score_adj;
 	),
@@ -46,7 +46,7 @@ TRACE_EVENT(task_rename,
 
 	TP_fast_assign(
 		__entry->pid = task->pid;
-		memcpy(entry->oldcomm, task->comm, TASK_COMM_LEN);
+		strscpy(entry->oldcomm, task->comm, TASK_COMM_LEN);
 		strscpy(entry->newcomm, comm, TASK_COMM_LEN);
 		__entry->oom_score_adj = task->signal->oom_score_adj;
 	),
diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
index d7044a7a214b..7625adc0a2e1 100644
--- a/kernel/printk/nbcon.c
+++ b/kernel/printk/nbcon.c
@@ -952,7 +952,7 @@ static void wctxt_load_execution_ctx(struct nbcon_write_context *wctxt,
 {
 	wctxt->cpu = pmsg->cpu;
 	wctxt->pid = pmsg->pid;
-	memcpy(wctxt->comm, pmsg->comm, sizeof(wctxt->comm));
+	strscpy(wctxt->comm, pmsg->comm, sizeof(wctxt->comm));
 	static_assert(sizeof(wctxt->comm) == sizeof(pmsg->comm));
 }
 #else
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 1f04e753ca02..eaf8b7b930df 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2255,7 +2255,7 @@ static void pmsg_load_execution_ctx(struct printk_message *pmsg,
 {
 	pmsg->cpu = printk_info_get_cpu(info);
 	pmsg->pid = printk_info_get_pid(info);
-	memcpy(pmsg->comm, info->comm, sizeof(pmsg->comm));
+	strscpy(pmsg->comm, info->comm, sizeof(pmsg->comm));
 	static_assert(sizeof(pmsg->comm) == sizeof(info->comm));
 }
 #else

-- 
2.54.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox