Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH 1/3] net: Remove support for AIO on sockets
From: Christoph Hellwig @ 2026-05-25  8:03 UTC (permalink / raw)
  To: demiobenour
  Cc: Herbert Xu, David S. Miller, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, Jens Axboe, Jakub Kicinski,
	Simon Horman, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Jonathan Corbet, Shuah Khan, Eric Biggers,
	Ard Biesheuvel, linux-crypto, linux-kernel, io-uring, netdev,
	linux-perf-users, linux-doc, Toke Høiland-Jørgensen,
	linux-api
In-Reply-To: <20260523-af-alg-harden-v1-1-c76755c3a5c5@gmail.com>

On Sat, May 23, 2026 at 03:43:02PM -0400, Demi Marie Obenour via B4 Relay wrote:
> From: Demi Marie Obenour <demiobenour@gmail.com>
> 
> The only user of msg->msg_iocb was AF_ALG, but that's deprecated.
> It can be removed entirely at the cost of only supporting synchronous
> operations.  This doesn't break userspace, which will silently block
> (for a bounded amount of time) in io_submit instead of operating
> asynchronously.
> 
> This also makes struct msghdr smaller, helping every other caller of
> sendmsg().

So we just had a discussion at LLC about how networking needs to support
AIO better for zero copy.

The current TCP zerocopy implementation provides completion notification
through the socket error code, which is freaking weird and doesn't
integrate well with either io_uring or in-kernel callers.

So we really want to pass the iocb down into networking and have it
call ki_complete on completion, with something higher up in the stack
adding that to the error queue for the legacy user interface.

Now I'm not sure if we wouldn't be better off passing that iocb
explicitly instead of in a weird hidden way, but this seemed like
a good place to bring this up.

^ permalink raw reply

* Re: [PATCH v2 2/6] treewide: Get rid of get_task_comm()
From: David Laight @ 2026-05-25 10:34 UTC (permalink / raw)
  To: André Almeida
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, Linus Torvalds, akpm, Yafang Shao,
	andrii.nakryiko, arnaldo.melo, Petr Mladek, linux-kernel,
	kernel-dev, linux-mm, linux-api, Bhupesh
In-Reply-To: <20260524-tonyk-long_name-v2-2-332f6bd041c4@igalia.com>

On Sun, 24 May 2026 19:38:52 -0300
André Almeida <andrealmeid@igalia.com> wrote:

> Since commit 4cc0473d7754 ("get rid of __get_task_comm()"),
> get_task_comm() does just a redundant check for the buffer size and call
> strscpy_pad(). Replace get_task_comm() calls with strscpy_pad(), that will
> do the right thing if the buffers sizes doesn't match: zero-pad if it's
> bigger, and truncate if it's smaller.
> 
> Link: https://lore.kernel.org/lkml/CAHk-=wi5c=_-FBGo_88CowJd_F-Gi6Ud9d=TALm65ReN7YjrMw@mail.gmail.com/
> Co-developed-by: Bhupesh <bhupesh@igalia.com>
> Signed-off-by: Bhupesh <bhupesh@igalia.com>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
... 
> -/*
> - * - Why not use task_lock()?
> - *   User space can randomly change their names anyway, so locking for readers
> - *   doesn't make sense. For writers, locking is probably necessary, as a race
> - *   condition could lead to long-term mixed results.
> - *   The logic inside __set_task_comm() ensures that the task comm is
> - *   always NUL-terminated and zero-padded. Therefore the race condition between
> - *   reader and writer is not an issue.
> - *
> - * - BUILD_BUG_ON() can help prevent the buf from being truncated.
> - *   Since the callers don't perform any return value checks, this safeguard is
> - *   necessary.
> - */
> -#define get_task_comm(buf, tsk) ({			\
> -	BUILD_BUG_ON(sizeof(buf) < TASK_COMM_LEN);	\
> -	strscpy_pad(buf, (tsk)->comm);			\
> -	buf;						\
> -})
> -

I don't think it is worth the churn of removing this wrapper.
The calls can be optimised based on the knowledge that tsk->com
is always '\0' terminated and can be assumed to be padded.
(A read mid-update might give an unpadded result, but that doesn't
matter because it can only 'leak' part of an old name.

-- David

^ permalink raw reply

* Re: [PATCH v2 4/6] sched: Extend task command name to 64 bytes
From: David Laight @ 2026-05-25 10:41 UTC (permalink / raw)
  To: André Almeida
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, Linus Torvalds, akpm, Yafang Shao,
	andrii.nakryiko, arnaldo.melo, Petr Mladek, linux-kernel,
	kernel-dev, linux-mm, linux-api, Bhupesh
In-Reply-To: <20260524-tonyk-long_name-v2-4-332f6bd041c4@igalia.com>

On Sun, 24 May 2026 19:38:54 -0300
André Almeida <andrealmeid@igalia.com> wrote:

> Command name has been restrict to only 16 bytes, which is too limiting,
> specially when debugging and tracing complex software with thousands of
> threads and the need to differentiate them.
> 
> Just as it was done with kthreads in commit 6b59808bfe48 ("workqueue:
> Show the latest workqueue name in /proc/PID/{comm,stat,status}"), support
> long names for userspace threads as well.
> 
> To avoid buffer overflows, cap all existing userspace APIs to
> TASK_COMM_LEN, and leave the full extended name for a new interface.
> 
> Co-developed-by: Bhupesh <bhupesh@igalia.com>
> Signed-off-by: Bhupesh <bhupesh@igalia.com>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>  fs/proc/array.c       |  2 +-
>  include/linux/sched.h |  3 ++-
>  kernel/sys.c          | 10 +++++-----
>  3 files changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index c8c3fbd9bfa9..312371eddc7f 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -110,7 +110,7 @@ void proc_task_name(struct seq_file *m, struct task_struct *p, bool escape)
>  	else if (p->flags & PF_KTHREAD)
>  		get_kthread_comm(tcomm, sizeof(tcomm), p);
>  	else
> -		strscpy_pad(tcomm, p->comm);
> +		strscpy_pad(tcomm, p->comm, TASK_COMM_LEN);
>  
>  	if (escape)
>  		seq_escape_str(m, tcomm, ESCAPE_SPACE | ESCAPE_SPECIAL, "\n\\");
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b6de742b1155..f7fd2b7d131d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -323,6 +323,7 @@ struct user_event_mm;
>   */
>  enum {
>  	TASK_COMM_LEN = 16,
> +	TASK_COMM_EXT_LEN = 64,
>  };
>  
>  extern void sched_tick(void);
> @@ -1167,7 +1168,7 @@ struct task_struct {
>  	 * - set it with set_task_comm() to ensure it is always
>  	 *   NUL-terminated and zero-padded
>  	 */
> -	char				comm[TASK_COMM_LEN];
> +	char				comm[TASK_COMM_EXT_LEN];
>  
>  	struct nameidata		*nameidata;
>  
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 1d5152d2395e..76d77218ab19 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2535,7 +2535,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  		unsigned long, arg4, unsigned long, arg5)
>  {
>  	struct task_struct *me = current;
> -	unsigned char comm[sizeof(me->comm)];
> +	unsigned char comm[TASK_COMM_LEN];
>  	long error;
>  
>  	error = security_task_prctl(option, arg2, arg3, arg4, arg5);
> @@ -2601,16 +2601,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  			error = -EINVAL;
>  		break;
>  	case PR_SET_NAME:
> -		comm[sizeof(me->comm) - 1] = 0;
> +		comm[TASK_COMM_LEN - 1] = 0;
>  		if (strncpy_from_user(comm, (char __user *)arg2,
> -				      sizeof(me->comm) - 1) < 0)
> +				      TASK_COMM_LEN - 1) < 0)

Nak - you can't do that.
You are reading data that the application doesn't expect you to read.

>  			return -EFAULT;
>  		set_task_comm(me, comm);
>  		proc_comm_connector(me);
>  		break;
>  	case PR_GET_NAME:
> -		strscpy_pad(comm, me->comm);
> -		if (copy_to_user((char __user *)arg2, comm, sizeof(comm)))
> +		strscpy_pad(comm, me->comm, TASK_COMM_LEN);
> +		if (copy_to_user((char __user *)arg2, comm, TASK_COMM_LEN))

Double-nak - you are writing beyond the end of the applications buffer.

You can't change the user memory that the syscalls access.

You can support the longer name for read/write of /proc/self/comm.

-- David

>  			return -EFAULT;
>  		break;
>  	case PR_GET_ENDIAN:
> 


^ permalink raw reply

* Re: [PATCH v2 4/6] sched: Extend task command name to 64 bytes
From: David Laight @ 2026-05-25 10:42 UTC (permalink / raw)
  To: André Almeida
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Steven Rostedt,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, Linus Torvalds, akpm, Yafang Shao,
	andrii.nakryiko, arnaldo.melo, Petr Mladek, linux-kernel,
	kernel-dev, linux-mm, linux-api, Bhupesh
In-Reply-To: <20260525114107.7fa5b4c1@pumpkin>

On Mon, 25 May 2026 11:41:07 +0100
David Laight <david.laight.linux@gmail.com> wrote:

> On Sun, 24 May 2026 19:38:54 -0300
> André Almeida <andrealmeid@igalia.com> wrote:
> 
> > Command name has been restrict to only 16 bytes, which is too limiting,
> > specially when debugging and tracing complex software with thousands of
> > threads and the need to differentiate them.
> > 
> > Just as it was done with kthreads in commit 6b59808bfe48 ("workqueue:
> > Show the latest workqueue name in /proc/PID/{comm,stat,status}"), support
> > long names for userspace threads as well.
> > 
> > To avoid buffer overflows, cap all existing userspace APIs to
> > TASK_COMM_LEN, and leave the full extended name for a new interface.
> > 
> > Co-developed-by: Bhupesh <bhupesh@igalia.com>
> > Signed-off-by: Bhupesh <bhupesh@igalia.com>
> > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > ---
> >  fs/proc/array.c       |  2 +-
> >  include/linux/sched.h |  3 ++-
> >  kernel/sys.c          | 10 +++++-----
> >  3 files changed, 8 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/proc/array.c b/fs/proc/array.c
> > index c8c3fbd9bfa9..312371eddc7f 100644
> > --- a/fs/proc/array.c
> > +++ b/fs/proc/array.c
> > @@ -110,7 +110,7 @@ void proc_task_name(struct seq_file *m, struct task_struct *p, bool escape)
> >  	else if (p->flags & PF_KTHREAD)
> >  		get_kthread_comm(tcomm, sizeof(tcomm), p);
> >  	else
> > -		strscpy_pad(tcomm, p->comm);
> > +		strscpy_pad(tcomm, p->comm, TASK_COMM_LEN);
> >  
> >  	if (escape)
> >  		seq_escape_str(m, tcomm, ESCAPE_SPACE | ESCAPE_SPECIAL, "\n\\");
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index b6de742b1155..f7fd2b7d131d 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -323,6 +323,7 @@ struct user_event_mm;
> >   */
> >  enum {
> >  	TASK_COMM_LEN = 16,
> > +	TASK_COMM_EXT_LEN = 64,
> >  };
> >  
> >  extern void sched_tick(void);
> > @@ -1167,7 +1168,7 @@ struct task_struct {
> >  	 * - set it with set_task_comm() to ensure it is always
> >  	 *   NUL-terminated and zero-padded
> >  	 */
> > -	char				comm[TASK_COMM_LEN];
> > +	char				comm[TASK_COMM_EXT_LEN];
> >  
> >  	struct nameidata		*nameidata;
> >  
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 1d5152d2395e..76d77218ab19 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -2535,7 +2535,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> >  		unsigned long, arg4, unsigned long, arg5)
> >  {
> >  	struct task_struct *me = current;
> > -	unsigned char comm[sizeof(me->comm)];
> > +	unsigned char comm[TASK_COMM_LEN];
> >  	long error;
> >  
> >  	error = security_task_prctl(option, arg2, arg3, arg4, arg5);
> > @@ -2601,16 +2601,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> >  			error = -EINVAL;
> >  		break;
> >  	case PR_SET_NAME:
> > -		comm[sizeof(me->comm) - 1] = 0;
> > +		comm[TASK_COMM_LEN - 1] = 0;
> >  		if (strncpy_from_user(comm, (char __user *)arg2,
> > -				      sizeof(me->comm) - 1) < 0)
> > +				      TASK_COMM_LEN - 1) < 0)  
> 
> Nak - you can't do that.
> You are reading data that the application doesn't expect you to read.

Or have I got confused over the names...

-- David

> 
> >  			return -EFAULT;
> >  		set_task_comm(me, comm);
> >  		proc_comm_connector(me);
> >  		break;
> >  	case PR_GET_NAME:
> > -		strscpy_pad(comm, me->comm);
> > -		if (copy_to_user((char __user *)arg2, comm, sizeof(comm)))
> > +		strscpy_pad(comm, me->comm, TASK_COMM_LEN);
> > +		if (copy_to_user((char __user *)arg2, comm, TASK_COMM_LEN))  
> 
> Double-nak - you are writing beyond the end of the applications buffer.
> 
> You can't change the user memory that the syscalls access.
> 
> You can support the longer name for read/write of /proc/self/comm.
> 
> -- David
> 
> >  			return -EFAULT;
> >  		break;
> >  	case PR_GET_ENDIAN:
> >   
> 


^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-26  1:10 UTC (permalink / raw)
  To: Theodore Tso
  Cc: linux-api, linux-kernel, Matthew Wilcox, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <20260522224108.GA18663@macsyma-wired.lan>

On 05/22, Theodore Tso wrote:
> On Fri, May 22, 2026 at 05:08:41PM +0000, Jaegeuk Kim wrote:
> > 
> > Thank you for the explanation. It seems I made a wrong assumption on the
> > usage of "user." prefix where each filesystem can support in different
> > ways.
> 
> The "user." prefix is used by all userspace applications that wish to
> store extended attributes.  For example, user.mime_type,
> user.xdg.origin_url, user.charset, user.appache_handler, etc
> 
> For more information, see:
> 
>     https://www.freedesktop.org/wiki/CommonExtendedAttribute
>     https://wiki.archlinux.org/title/Extended_attributes
> 
> I certainly assumed this was common knowledge across all file system
> maintainers, but this was apparently not true in your case.  I don't
> know how this could be the case given that f2fs implements extended
> attributes, and I would have thought you would have known that when
> testing that feature.
> 
> > I shared some motivation when replying to Darrick's feedback [1], but yes,
> > it was not enough for all heads-up. The problem started that some speicific
> > application needs as many high-order pages as possible mostly for reads. So,
> > I thought we can turn on large folio on the specific files per hints. One way
> > for the hints was using immutable bit, but it turned out it's very hard to
> > manage disabling the bit whenever deleting the files. Along with limited
> > ioctl() and requiring inode eviction to manage large folio activation, I had
> > to implement this path.
> > 
> > [1] https://lore.kernel.org/lkml/aeA5C8byIpXWla7f@google.com/
> 
> Actually, you still haven't explained your use case, at least, not
> well enough for me to understand what you are trying to do.
> 
> So an application wants a particular file to use as many high-order
> pages as possible.  Why?  What sort of guarantees do you need to
> provide?  What happens if they can't be provided?  What happens if a
> possibly malicious, or at least gready, application uses this
> interface to grab a lot of high-order pages?
> 
> >From your patch:
> 
> 1. setxattr(file, "user.fadvise", &value, sizeof(unsigned int), 0)
>  -> register the inode number for large folio
> 2. chmod(0400, file)
>  -> make Read-Only
> 3. open()
>  -> f2fs_iget() with large folio
> 4. open(WRITE), mkwrite on mmap, chmod(WRITE)
>  -> return error
> 5. iput() and open()
>  -> goto #3
> 6. unlink
>  -> deregister the inode number
> 
> Why should making the file read-only matter?  And when you say
> "derigster the inode number", why should this be related to deleting
> the inode?
> 
> This is an interface which seems to be very specific to your use case.
> What if those requirements change over time?  What if you want pull in
> a file without making it be read-only?  And what if you want to
> release the large-order pages without deleting the file?

Let me try to write more details, helped with Gemini.

Background
----------
The primary use case is accelerating AI model loading, which demands
exceptionally high sequential read speeds. In our benchmarks on embedded
systems:
 - Using high-order page allocations allows the system to saturate the
   Universal Flash Storage (UFS) bandwidth, reaching 4 GB/s even at
   medium-to-low CPU frequencies.
 - In contrast, standard small folios cap performance at 2 GB/s.

The performance doubling stems directly from reducing CPU cycle overhead during
memory allocation.

Problem Statement
-----------------
High-order pages become heavily fragmented and scarce shortly after device boot.
We cannot afford to deplete these limited resources on default filesystem
operations using large folios. Instead, we need a mechanism to strictly
prioritize and reserve high-order allocations for specific, critical
payloads—specifically, large AI model files.

Design Principles
-----------------
 - Best-Effort Allocation: The system guarantees no fixed number of
 high-order pages. Allocation falls back gracefully from Order-10 down to
 Order-0 based on current memory availability.

 - Standard Page Cache Lifecycle: No custom or rigid memory management is
 introduced. These folios remain fully under the control of the Memory
 Management (MM) subsystem and can be reclaimed via the Least Recently
 Used (LRU) mechanism at any time.

 - Read-Only Optimization: To minimize code complexity (e.g., handling
 writeback, compression, and concurrency), this high-order allocation mechanism
 is strictly restricted to read-only files. The vast majority of performance
 gains are derived from read operations.

Questions
---------
Q: Why does an application require a specific file to utilize as many high-order
pages as possible?
A: It significantly boosts sequential read bandwidth in resource-constrained
 embedded systems by reducing the CPU overhead associated with page allocation
 during high-throughput I/O.

Q: What sort of guarantees does this mechanism need to provide?
A: No hard guarantees are provided. The filesystem provides a best-effort
 mechanism to attempt high-order page allocations for flagged inodes while the
 filesystem is mounted.

Q: What is the fallback behavior if high-order pages cannot be allocated?
A: The system treats the configuration as a performance hint. If high-order
 pages are unavailable, it seamlessly falls back to standard small folios.
 Functional behavior remains entirely unchanged.

Q: Why is restricting the implementation to read-only files necessary?
A: Limiting the scope to read-only files bypasses the architectural complexities
 of managing writes, dirtying pages, and compression in large folios, while
 still capturing the core performance benefits of high-speed sequential reads.

Q: What mitigations prevent a malicious or greedy application from abusing this
 interface to monopolize high-order pages?
A: The interface acts purely as a hint to the allocation path. Because it falls
 back to small folios when memory is tight, it poses no greater systemic risk
 than existing large-folio implementations used by other filesystems. Standard
 MM eviction and LRU paths remain fully active.

Q: Why is deregistering the inode number linked to inode deletion?
A: We need the high-order allocation hint to persist even if the inode is
 temporarily evicted from the VFS cache. To achieve this, we maintain a tracking
 list of hinted inode numbers. When a file is permanently deleted, its hint
 becomes obsolete, requiring us to deregister it from the list to prevent memory
 leaks or identifier reuse conflicts.

Q: How can an application release these large-order pages without deleting the
 file?
A: Pages allocated via this mechanism receive no special status in the page
 cache. They are managed by standard LRU logic and can be explicitly released by
 the user at any time using existing system calls, such as
 posix_fadvise(..., POSIX_FADV_DONTNEED).

Q: This interface seems highly tailored to a specific use case. What happens if
 these requirements evolve over time?
A: Massive AI model loading is a long-term architectural paradigm. Providing a
 targeted VFS/filesystem hint to optimize read bandwidth for specific large
 datasets is a highly practical, repeatable pattern that addresses a systemic
 bottleneck in embedded AI deployments.

> 
> 						- Ted

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-26  1:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Theodore Tso, linux-api, linux-kernel, Matthew Wilcox,
	linux-f2fs-devel, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ahPffhaOi2CBtWof@infradead.org>

On 05/24, Christoph Hellwig wrote:
> On Fri, May 22, 2026 at 02:04:59PM +0000, Jaegeuk Kim wrote:
> > This was a quick buddyinfo right after booting the device.
> > 
> > Before:
> > Node 0, zone   Normal  22684  42284  28704  16901   9515   4566   1854    673    181     36    758
> > 
> > After disabling EROFS large folio:
> > Node 0, zone   Normal   8486   4732   2175   1161    697    272     82     19      3      1    856
> 
> And what are you trying to say us with that?

This means, high-order pages were used up by EROFS which sets large folio by
default. So, I wanted to say the concern was based on actual data which was what
Mattew asked.

> 
> 
> 
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Matthew Wilcox @ 2026-05-26  2:31 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Christoph Hellwig, Theodore Tso, linux-api, linux-kernel,
	linux-f2fs-devel, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ahT1nT3xsMGkyJab@google.com>

On Tue, May 26, 2026 at 01:21:33AM +0000, Jaegeuk Kim wrote:
> On 05/24, Christoph Hellwig wrote:
> > On Fri, May 22, 2026 at 02:04:59PM +0000, Jaegeuk Kim wrote:
> > > This was a quick buddyinfo right after booting the device.
> > > 
> > > Before:
> > > Node 0, zone   Normal  22684  42284  28704  16901   9515   4566   1854    673    181     36    758
> > > 
> > > After disabling EROFS large folio:
> > > Node 0, zone   Normal   8486   4732   2175   1161    697    272     82     19      3      1    856
> > 
> > And what are you trying to say us with that?
> 
> This means, high-order pages were used up by EROFS which sets large folio by
> default. So, I wanted to say the concern was based on actual data which was what
> Mattew asked.

This isn't that though.  What you actually need is to show that high order
allocations are _failing_.  The MM is far more complicated than you seem
to understand.  There isn't a fixed number of large folios available;
when we try to allocate memory, we do reclaim.  And if there's large
folios on the LRU list, you'll get them.

If what you want is large folios readily available, then what you want
is large folios used _everywhere_ because then they're easy to get!
If there's small folios in use, you need to reclaim a lot of memory in
order to reassemble large folios (it's the birthday paradox, similar to
the hash collision problem).

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Matthew Wilcox @ 2026-05-26  2:35 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Theodore Tso, linux-api, linux-kernel, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ahTzHyHBL8t0iNBR@google.com>

On Tue, May 26, 2026 at 01:10:55AM +0000, Jaegeuk Kim wrote:
> Let me try to write more details, helped with Gemini.

This is garbage, and frankly disrespectful.  I'm not going to argue with
your AI bot.

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-26  3:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Theodore Tso, linux-api, linux-kernel, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ahUG3ZCnc1RQ0EL_@casper.infradead.org>

On 05/26, Matthew Wilcox wrote:
> On Tue, May 26, 2026 at 01:10:55AM +0000, Jaegeuk Kim wrote:
> > Let me try to write more details, helped with Gemini.
> 
> This is garbage, and frankly disrespectful.  I'm not going to argue with
> your AI bot.

I wrote down all and they rephrased it a bit. Which points are you feeling
like that?

> 
> 
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Randy Dunlap @ 2026-05-26  3:35 UTC (permalink / raw)
  To: Jaegeuk Kim, Theodore Tso
  Cc: linux-api, linux-kernel, Matthew Wilcox, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ahTzHyHBL8t0iNBR@google.com>



On 5/25/26 6:10 PM, Jaegeuk Kim wrote:
> On 05/22, Theodore Tso wrote:
>> On Fri, May 22, 2026 at 05:08:41PM +0000, Jaegeuk Kim wrote:
>>>
>>> Thank you for the explanation. It seems I made a wrong assumption on the
>>> usage of "user." prefix where each filesystem can support in different
>>> ways.
>>
>> The "user." prefix is used by all userspace applications that wish to
>> store extended attributes.  For example, user.mime_type,
>> user.xdg.origin_url, user.charset, user.appache_handler, etc
>>
>> For more information, see:
>>
>>     https://www.freedesktop.org/wiki/CommonExtendedAttribute
>>     https://wiki.archlinux.org/title/Extended_attributes
>>
>> I certainly assumed this was common knowledge across all file system
>> maintainers, but this was apparently not true in your case.  I don't
>> know how this could be the case given that f2fs implements extended
>> attributes, and I would have thought you would have known that when
>> testing that feature.
>>
>>> I shared some motivation when replying to Darrick's feedback [1], but yes,
>>> it was not enough for all heads-up. The problem started that some speicific
>>> application needs as many high-order pages as possible mostly for reads. So,
>>> I thought we can turn on large folio on the specific files per hints. One way
>>> for the hints was using immutable bit, but it turned out it's very hard to
>>> manage disabling the bit whenever deleting the files. Along with limited
>>> ioctl() and requiring inode eviction to manage large folio activation, I had
>>> to implement this path.
>>>
>>> [1] https://lore.kernel.org/lkml/aeA5C8byIpXWla7f@google.com/
>>
>> Actually, you still haven't explained your use case, at least, not
>> well enough for me to understand what you are trying to do.
>>
>> So an application wants a particular file to use as many high-order
>> pages as possible.  Why?  What sort of guarantees do you need to
>> provide?  What happens if they can't be provided?  What happens if a
>> possibly malicious, or at least gready, application uses this
>> interface to grab a lot of high-order pages?
>>
>> >From your patch:
>>
>> 1. setxattr(file, "user.fadvise", &value, sizeof(unsigned int), 0)
>>  -> register the inode number for large folio
>> 2. chmod(0400, file)
>>  -> make Read-Only
>> 3. open()
>>  -> f2fs_iget() with large folio
>> 4. open(WRITE), mkwrite on mmap, chmod(WRITE)
>>  -> return error
>> 5. iput() and open()
>>  -> goto #3
>> 6. unlink
>>  -> deregister the inode number
>>
>> Why should making the file read-only matter?  And when you say
>> "derigster the inode number", why should this be related to deleting
>> the inode?
>>
>> This is an interface which seems to be very specific to your use case.
>> What if those requirements change over time?  What if you want pull in
>> a file without making it be read-only?  And what if you want to
>> release the large-order pages without deleting the file?
> 
> Let me try to write more details, helped with Gemini.

[as an interested reader:]

If this idea is so good, why shouldn't it be done in the VFS/MM so that
other filesystems could do the same thing instead of just in f2fs?


-- 
~Randy


^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-26  3:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Theodore Tso, linux-api, linux-kernel, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ahUF7HqSKFJ422bU@casper.infradead.org>

On 05/26, Matthew Wilcox wrote:
> On Tue, May 26, 2026 at 01:21:33AM +0000, Jaegeuk Kim wrote:
> > On 05/24, Christoph Hellwig wrote:
> > > On Fri, May 22, 2026 at 02:04:59PM +0000, Jaegeuk Kim wrote:
> > > > This was a quick buddyinfo right after booting the device.
> > > > 
> > > > Before:
> > > > Node 0, zone   Normal  22684  42284  28704  16901   9515   4566   1854    673    181     36    758
> > > > 
> > > > After disabling EROFS large folio:
> > > > Node 0, zone   Normal   8486   4732   2175   1161    697    272     82     19      3      1    856
> > > 
> > > And what are you trying to say us with that?
> > 
> > This means, high-order pages were used up by EROFS which sets large folio by
> > default. So, I wanted to say the concern was based on actual data which was what
> > Mattew asked.
> 
> This isn't that though.  What you actually need is to show that high order
> allocations are _failing_.  The MM is far more complicated than you seem
> to understand.  There isn't a fixed number of large folios available;
> when we try to allocate memory, we do reclaim.  And if there's large
> folios on the LRU list, you'll get them.
> 
> If what you want is large folios readily available, then what you want
> is large folios used _everywhere_ because then they're easy to get!
> If there's small folios in use, you need to reclaim a lot of memory in
> order to reassemble large folios (it's the birthday paradox, similar to
> the hash collision problem).

Thanks for the feedback. Actually, I tried to do compact_memory before doing
read() for AI loading, but I got complaints where it took hundreds milliseconds
to run that compact_memory. Is there a good way to secure high-order pages before
that read()? It was quite hard to project when it will happen.

> 
> 
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-26  4:12 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Theodore Tso, linux-api, linux-kernel, Matthew Wilcox,
	linux-f2fs-devel, Christoph Hellwig, linux-mm, linux-fsdevel,
	Akilesh Kailash, Christian Brauner
In-Reply-To: <8a42abed-8289-44ec-a144-dfe531a4af71@infradead.org>

On 05/25, Randy Dunlap wrote:
> 
> 
> On 5/25/26 6:10 PM, Jaegeuk Kim wrote:
> > On 05/22, Theodore Tso wrote:
> >> On Fri, May 22, 2026 at 05:08:41PM +0000, Jaegeuk Kim wrote:
> >>>
> >>> Thank you for the explanation. It seems I made a wrong assumption on the
> >>> usage of "user." prefix where each filesystem can support in different
> >>> ways.
> >>
> >> The "user." prefix is used by all userspace applications that wish to
> >> store extended attributes.  For example, user.mime_type,
> >> user.xdg.origin_url, user.charset, user.appache_handler, etc
> >>
> >> For more information, see:
> >>
> >>     https://www.freedesktop.org/wiki/CommonExtendedAttribute
> >>     https://wiki.archlinux.org/title/Extended_attributes
> >>
> >> I certainly assumed this was common knowledge across all file system
> >> maintainers, but this was apparently not true in your case.  I don't
> >> know how this could be the case given that f2fs implements extended
> >> attributes, and I would have thought you would have known that when
> >> testing that feature.
> >>
> >>> I shared some motivation when replying to Darrick's feedback [1], but yes,
> >>> it was not enough for all heads-up. The problem started that some speicific
> >>> application needs as many high-order pages as possible mostly for reads. So,
> >>> I thought we can turn on large folio on the specific files per hints. One way
> >>> for the hints was using immutable bit, but it turned out it's very hard to
> >>> manage disabling the bit whenever deleting the files. Along with limited
> >>> ioctl() and requiring inode eviction to manage large folio activation, I had
> >>> to implement this path.
> >>>
> >>> [1] https://lore.kernel.org/lkml/aeA5C8byIpXWla7f@google.com/
> >>
> >> Actually, you still haven't explained your use case, at least, not
> >> well enough for me to understand what you are trying to do.
> >>
> >> So an application wants a particular file to use as many high-order
> >> pages as possible.  Why?  What sort of guarantees do you need to
> >> provide?  What happens if they can't be provided?  What happens if a
> >> possibly malicious, or at least gready, application uses this
> >> interface to grab a lot of high-order pages?
> >>
> >> >From your patch:
> >>
> >> 1. setxattr(file, "user.fadvise", &value, sizeof(unsigned int), 0)
> >>  -> register the inode number for large folio
> >> 2. chmod(0400, file)
> >>  -> make Read-Only
> >> 3. open()
> >>  -> f2fs_iget() with large folio
> >> 4. open(WRITE), mkwrite on mmap, chmod(WRITE)
> >>  -> return error
> >> 5. iput() and open()
> >>  -> goto #3
> >> 6. unlink
> >>  -> deregister the inode number
> >>
> >> Why should making the file read-only matter?  And when you say
> >> "derigster the inode number", why should this be related to deleting
> >> the inode?
> >>
> >> This is an interface which seems to be very specific to your use case.
> >> What if those requirements change over time?  What if you want pull in
> >> a file without making it be read-only?  And what if you want to
> >> release the large-order pages without deleting the file?
> > 
> > Let me try to write more details, helped with Gemini.
> 
> [as an interested reader:]
> 
> If this idea is so good, why shouldn't it be done in the VFS/MM so that
> other filesystems could do the same thing instead of just in f2fs?

Thanks for the feedback. I'm really open, but just trying to understand it's
good or not. If it's so bad at all, I'd be really ready to drop it even the
ioctl approach, even though I already prepared its implementation.

>
> 
> -- 
> ~Randy
> 

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Theodore Tso @ 2026-05-26 13:42 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: linux-api, linux-kernel, Matthew Wilcox, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ahTzHyHBL8t0iNBR@google.com>

On Tue, May 26, 2026 at 01:10:55AM +0000, Jaegeuk Kim wrote:
> Background
> ----------
> The primary use case is accelerating AI model loading, which demands
> exceptionally high sequential read speeds. In our benchmarks on embedded
> systems:
>  - Using high-order page allocations allows the system to saturate the
>    Universal Flash Storage (UFS) bandwidth, reaching 4 GB/s even at
>    medium-to-low CPU frequencies.
>  - In contrast, standard small folios cap performance at 2 GB/s.

So you're interested in optimizing the I/O speeds.  And apparenty, on
your hardware, the UFS controller has limits on scatter-gather entries
--- UFS seems to call this Physical Region Description (PRD) table
entries.  Per Gemini:

    1. PRD Segment & Length Limits

	Maximum PRD Entries: Hardware limits typically cap the number
	    of PRD entries (or segments) to 255 or 256 per transfer
	    request.

	Maximum Transfer Length: Each individual PRD entry typically
	    allows a maximum transfer size of (65,535 bytes) per segment.

    2. Host Controller Hardware Limits (UFSHCI)

	Transfer Queue Depth: A UFS controller supports a predefined
	    number of outstanding task request entries. This is often
	    hard-capped at 32 concurrent transfer requests (slots) by the
	    doorbell register array.

	Descriptor Pre-fetch: Some UFS host controllers are
	   pre-configured to pre-fetch multiple PRD entries sequentially
	   before requiring main memory reads.

Is this an accurate description of the limits that you are trying to
work with?  How much data are you trying to read?  Looking at Gemma 4
models, E2B is about 10GB or 3GB for the 4-bit quantized version.  E4B
is 15GB, or 5GB for the 4-bit quantized version.  Is that about right?

It seems... surprising that the additional I/O operations are actually
throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s).  Have you dug
into why this is happening, and whether there is anything that can be
optimized below the file system?

> Problem Statement
> -----------------
> High-order pages become heavily fragmented and scarce shortly after
> device boot.  We cannot afford to deplete these limited resources on
> default filesystem operations using large folios. Instead, we need a
> mechanism to strictly prioritize and reserve high-order allocations
> for specific, critical payloads—specifically, large AI model files.

There's a fundamental assumption here, which is that the only use of
high order pages is the page cache.  This doesn't take into account
anonymous pages used by programs that isn't backed by files.  Nor does
it take into account kernel memory allocations.

But that being said, you seem to be assuming that you can reduce the
pressure on high order pages by only using large folios for these AI
model files.

But the problem with using small folios is that if you want to
actually *use* the memory, unless you want to segment out the memory
so it can't be used for anything other than the AI models (e.g., by
using somthing like hugetlbfs) it's just going to break up the memory
into smaller folios.  So that's not actually going to *help* in actual
real life use cases.  It might help for your artificial benchmarks /
experiments, but in the real life case where Android applications are
running and fragmenting all of the device memory, the large folios
won't be available *anyway*.

> 
> Q: Why is deregistering the inode number linked to inode deletion?
> A: We need the high-order allocation hint to persist even if the inode is
>  temporarily evicted from the VFS cache. To achieve this, we maintain a tracking
>  list of hinted inode numbers. When a file is permanently deleted, its hint
>  becomes obsolete, requiring us to deregister it from the list to prevent memory
>  leaks or identifier reuse conflicts.

Assuming that the high-order allocation hint is a good thing, why not
just make it persistent?  e.g., just a *real* extended attribute
(which is more wateful of space), or grab a flag in the on-disk f2fs
inode?  Then you don't need to have an in-memory list of hinted
inodes; instead, you can just have the Android package manager set
that flag indicating that you want that special treatment.  This is
all assuming that we need an explicit hint, though....

> Massive AI model loading is a long-term architectural
> paradigm. Providing a targeted VFS/filesystem hint to optimize read
> bandwidth for specific large datasets is a highly practical,
> repeatable pattern that addresses a systemic bottleneck in embedded
> AI deployments.

It's really too bad you didn't propose this as a LSF/MM topic, and
presented this at a session at Zagreb two weeks ago.  That would have
been a much more upstream-friendly way of collaborating, and it might
have allowed the mm experts to give you some more dynamic, real-time
feedback.

Cheers,

					- Ted

^ permalink raw reply

* Re: [PATCH 1/3] net: Remove support for AIO on sockets
From: Jens Axboe @ 2026-05-26 15:58 UTC (permalink / raw)
  To: Christoph Hellwig, demiobenour
  Cc: Herbert Xu, David S. Miller, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, Jakub Kicinski, Simon Horman,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, James Clark, Jonathan Corbet,
	Shuah Khan, Eric Biggers, Ard Biesheuvel, linux-crypto,
	linux-kernel, io-uring, netdev, linux-perf-users, linux-doc,
	Toke Høiland-Jørgensen, linux-api
In-Reply-To: <ahQCZQNoyO8GQt3H@infradead.org>

On 5/25/26 2:03 AM, Christoph Hellwig wrote:
> On Sat, May 23, 2026 at 03:43:02PM -0400, Demi Marie Obenour via B4 Relay wrote:
>> From: Demi Marie Obenour <demiobenour@gmail.com>
>>
>> The only user of msg->msg_iocb was AF_ALG, but that's deprecated.
>> It can be removed entirely at the cost of only supporting synchronous
>> operations.  This doesn't break userspace, which will silently block
>> (for a bounded amount of time) in io_submit instead of operating
>> asynchronously.
>>
>> This also makes struct msghdr smaller, helping every other caller of
>> sendmsg().
> 
> So we just had a discussion at LLC about how networking needs to support
> AIO better for zero copy.
> 
> The current TCP zerocopy implementation provides completion notification
> through the socket error code, which is freaking weird and doesn't
> integrate well with either io_uring or in-kernel callers.

We already have that via io_uring, and without needing msg_kiocb or the
(very) weird socket error code retrieving.

-- 
Jens Axboe

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Bart Van Assche @ 2026-05-26 16:14 UTC (permalink / raw)
  To: Theodore Tso, Jaegeuk Kim
  Cc: linux-api, linux-kernel, Matthew Wilcox, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ybmbjekuvzmaw4hmlxd7nxs546dqtwmxqxwyali74d6m3u7tat@b4q3japqnhrl>

On 5/26/26 6:42 AM, Theodore Tso wrote:
> It seems... surprising that the additional I/O operations are actually
> throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s).  Have you dug
> into why this is happening, and whether there is anything that can be
> optimized below the file system?
The layers below the filesystem (block, SCSI, UFS) is what I'm
responsible for in the Pixel team and I can assure you that these are
highly optimized.

Since the transfer size used in Jaegeuk's tests is much larger than 4
KiB, how many CPU cycles are used per IO by the layers below the
filesystem is not limiting the transfer bandwidth.

Bart.

^ permalink raw reply

* Re: [PATCH v2 4/6] sched: Extend task command name to 64 bytes
From: Steven Rostedt @ 2026-05-26 16:31 UTC (permalink / raw)
  To: David Laight
  Cc: André Almeida, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Christian Brauner, Kees Cook, Shuah Khan, willy,
	mathieu.desnoyers, Linus Torvalds, akpm, Yafang Shao,
	andrii.nakryiko, arnaldo.melo, Petr Mladek, linux-kernel,
	kernel-dev, linux-mm, linux-api, Bhupesh
In-Reply-To: <20260525114241.4b6f3050@pumpkin>

On Mon, 25 May 2026 11:42:41 +0100
David Laight <david.laight.linux@gmail.com> wrote:

> > >  	error = security_task_prctl(option, arg2, arg3, arg4, arg5);
> > > @@ -2601,16 +2601,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> > >  			error = -EINVAL;
> > >  		break;
> > >  	case PR_SET_NAME:
> > > -		comm[sizeof(me->comm) - 1] = 0;
> > > +		comm[TASK_COMM_LEN - 1] = 0;
> > >  		if (strncpy_from_user(comm, (char __user *)arg2,
> > > -				      sizeof(me->comm) - 1) < 0)
> > > +				      TASK_COMM_LEN - 1) < 0)    
> > 
> > Nak - you can't do that.
> > You are reading data that the application doesn't expect you to read.  
> 
> Or have I got confused over the names...

You may have gotten confused by names, as sizeof(me->comm) is the same as
TASK_COMM_LEN. Basically, the above doesn't change anything.

-- Steve

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-26 21:52 UTC (permalink / raw)
  To: Theodore Tso
  Cc: linux-api, linux-kernel, Matthew Wilcox, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ybmbjekuvzmaw4hmlxd7nxs546dqtwmxqxwyali74d6m3u7tat@b4q3japqnhrl>

On 05/26, Theodore Tso wrote:
> On Tue, May 26, 2026 at 01:10:55AM +0000, Jaegeuk Kim wrote:
> > Background
> > ----------
> > The primary use case is accelerating AI model loading, which demands
> > exceptionally high sequential read speeds. In our benchmarks on embedded
> > systems:
> >  - Using high-order page allocations allows the system to saturate the
> >    Universal Flash Storage (UFS) bandwidth, reaching 4 GB/s even at
> >    medium-to-low CPU frequencies.
> >  - In contrast, standard small folios cap performance at 2 GB/s.
> 
> So you're interested in optimizing the I/O speeds.  And apparenty, on
> your hardware, the UFS controller has limits on scatter-gather entries
> --- UFS seems to call this Physical Region Description (PRD) table
> entries.  Per Gemini:
> 
>     1. PRD Segment & Length Limits
> 	
> 	Maximum PRD Entries: Hardware limits typically cap the number
> 	    of PRD entries (or segments) to 255 or 256 per transfer
> 	    request.
> 	
> 	Maximum Transfer Length: Each individual PRD entry typically
> 	    allows a maximum transfer size of (65,535 bytes) per segment.
> 
>     2. Host Controller Hardware Limits (UFSHCI)
>     
> 	Transfer Queue Depth: A UFS controller supports a predefined
> 	    number of outstanding task request entries. This is often
> 	    hard-capped at 32 concurrent transfer requests (slots) by the
> 	    doorbell register array.
> 	
> 	Descriptor Pre-fetch: Some UFS host controllers are
> 	   pre-configured to pre-fetch multiple PRD entries sequentially
> 	   before requiring main memory reads.
> 
> Is this an accurate description of the limits that you are trying to
> work with?  How much data are you trying to read?  Looking at Gemma 4
> models, E2B is about 10GB or 3GB for the 4-bit quantized version.  E4B
> is 15GB, or 5GB for the 4-bit quantized version.  Is that about right?
> 
> It seems... surprising that the additional I/O operations are actually
> throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s).  Have you dug
> into why this is happening, and whether there is anything that can be
> optimized below the file system?

I can't tell the exact size tho, roughly it's between 1GB and 4GB. And,
per lots of test results with various tunings, it turned out memory
allocation speed was the culprit. If we use 4KB page, we couldn't get
the full bandwidth unless we set the biggest core running the highest frequency.
Unfortunately, however, we can't use the core like that due to performance
drop of other system service and power drain.

> 
> > Problem Statement
> > -----------------
> > High-order pages become heavily fragmented and scarce shortly after
> > device boot.  We cannot afford to deplete these limited resources on
> > default filesystem operations using large folios. Instead, we need a
> > mechanism to strictly prioritize and reserve high-order allocations
> > for specific, critical payloads—specifically, large AI model files.
> 
> There's a fundamental assumption here, which is that the only use of
> high order pages is the page cache.  This doesn't take into account
> anonymous pages used by programs that isn't backed by files.  Nor does
> it take into account kernel memory allocations.
> 
> But that being said, you seem to be assuming that you can reduce the
> pressure on high order pages by only using large folios for these AI
> model files.
> 
> But the problem with using small folios is that if you want to
> actually *use* the memory, unless you want to segment out the memory
> so it can't be used for anything other than the AI models (e.g., by
> using somthing like hugetlbfs) it's just going to break up the memory
> into smaller folios.  So that's not actually going to *help* in actual
> real life use cases.  It might help for your artificial benchmarks /
> experiments, but in the real life case where Android applications are
> running and fragmenting all of the device memory, the large folios
> won't be available *anyway*.

Agreed it's hard to get this done perfectly tho, as the best effort on this
particular AI model case, I focused on two timings when loading the models:
1) right after device boot, 2) dynamic loading when required. To secure high
order pages, for 1), I disabled the large folio consumed by EROFS, while for
2), I tried to call compact_memory before loading the model. Both of cases,
I could observe we could get fair amount of large folios. Yes, not 100% tho.

> 
> > 
> > Q: Why is deregistering the inode number linked to inode deletion?
> > A: We need the high-order allocation hint to persist even if the inode is
> >  temporarily evicted from the VFS cache. To achieve this, we maintain a tracking
> >  list of hinted inode numbers. When a file is permanently deleted, its hint
> >  becomes obsolete, requiring us to deregister it from the list to prevent memory
> >  leaks or identifier reuse conflicts.
> 
> Assuming that the high-order allocation hint is a good thing, why not
> just make it persistent?  e.g., just a *real* extended attribute
> (which is more wateful of space), or grab a flag in the on-disk f2fs
> inode?  Then you don't need to have an in-memory list of hinted
> inodes; instead, you can just have the Android package manager set
> that flag indicating that you want that special treatment.  This is
> all assuming that we need an explicit hint, though....

I think that's doable, yes, if the explict hint is acceptable.

> 
> > Massive AI model loading is a long-term architectural
> > paradigm. Providing a targeted VFS/filesystem hint to optimize read
> > bandwidth for specific large datasets is a highly practical,
> > repeatable pattern that addresses a systemic bottleneck in embedded
> > AI deployments.
> 
> It's really too bad you didn't propose this as a LSF/MM topic, and
> presented this at a session at Zagreb two weeks ago.  That would have
> been a much more upstream-friendly way of collaborating, and it might
> have allowed the mm experts to give you some more dynamic, real-time
> feedback.

Indeed, I was off from LSF/MM for years due to various product issues, not
related F2FS tho. Let me make some effort to attend upcoming ones like LPC,
if I can get the budget from company.

> 
> Cheers,
> 
> 					- Ted
> 
> 
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply

* Re: [PATCH v2 3/6] treewide: Replace memcpy(..., current->comm) with strscpy()
From: Steven Rostedt @ 2026-05-26 23:06 UTC (permalink / raw)
  To: André Almeida
  Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Christian Brauner,
	Kees Cook, Shuah Khan, willy, mathieu.desnoyers, David Laight,
	Linus Torvalds, akpm, Yafang Shao, andrii.nakryiko, arnaldo.melo,
	Petr Mladek, linux-kernel, kernel-dev, linux-mm, linux-api
In-Reply-To: <20260524-tonyk-long_name-v2-3-332f6bd041c4@igalia.com>

On Sun, 24 May 2026 19:38:53 -0300
André Almeida <andrealmeid@igalia.com> wrote:

> In order to increase the size of current->comm[] and to avoid breaking any
> existing code, replace memcpy() with strscpy(). The later function makes
> sure that the copy is NUL terminated. This is crucial given that the
> source buffer might be larger than the destination buffer and could
> truncate the NUL character out of it.
> 
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
> Changes from v2:
>  - New patch, dropped strtostr() from last version
> ---
>  include/linux/coredump.h        |  2 +-
>  include/linux/tracepoint.h      |  4 ++--
>  include/trace/events/block.h    | 10 +++++-----
>  include/trace/events/coredump.h |  2 +-
>  include/trace/events/f2fs.h     |  4 ++--
>  include/trace/events/oom.h      |  2 +-
>  include/trace/events/osnoise.h  |  2 +-
>  include/trace/events/sched.h    | 10 +++++-----
>  include/trace/events/signal.h   |  2 +-
>  include/trace/events/task.h     |  4 ++--
>  kernel/printk/nbcon.c           |  2 +-
>  kernel/printk/printk.c          |  2 +-
>  12 files changed, 23 insertions(+), 23 deletions(-)
> 

So I was curious to what impact this would have on tracing. I decided to
run the following:

    perf stat -r 100 ./hackbench 50

To see how it affects things. Hackbench is a bit of a microbenchmark but it
stresses the scheduler and thus, scheduler trace events.

I first ran the above and put the output into "stat.baseline", then I enabled
all scheduler trace events:

   trace-cmd start -e sched

and ran it again and put the output into "stat.before".

I applied the patch and ran it again before enabling tracing (just to see
the variance) and put that into "stat.baseline2". I then enabled tracing
and ran it again and put the output into "stat.after".

Here's the results:

stat.baseline:

 Performance counter stats for '/work/c/hackbench 50' (100 runs):

            53,165      context-switches                 #  11002.2 cs/sec  cs_per_second       ( +-  1.33% )
             8,010      cpu-migrations                   #   1657.6 migrations/sec  migrations_per_second  ( +-  0.90% )
            53,936      page-faults                      #  11161.7 faults/sec  page_faults_per_second  ( +-  0.50% )
          4,832.24 msec task-clock                       #      6.0 CPUs  CPUs_utilized         ( +-  0.12% )
        18,787,710      branch-misses                    #      1.2 %  branch_miss_rate         ( +-  0.17% )  (38.88%)
     1,452,653,496      branches                         #    300.6 M/sec  branch_frequency     ( +-  0.14% )  (61.55%)
    15,607,564,080      cpu-cycles                       #      3.2 GHz  cycles_frequency       ( +-  0.15% )  (56.21%)
     7,648,608,518      instructions                     #      0.5 instructions  insn_per_cycle  ( +-  0.11% )  (55.82%)
    12,025,223,911      stalled-cycles-frontend          #     0.77 frontend_cycles_idle        ( +-  0.14% )  (56.26%)

       0.808204663 +- 0.001059873 seconds time elapsed  ( +-  0.13% )

stat.before:

 Performance counter stats for '/work/c/hackbench 50' (100 runs):

            54,722      context-switches                 #  11041.0 cs/sec  cs_per_second       ( +-  1.35% )
             8,170      cpu-migrations                   #   1648.4 migrations/sec  migrations_per_second  ( +-  1.08% )
            54,295      page-faults                      #  10954.8 faults/sec  page_faults_per_second  ( +-  0.53% )
          4,956.27 msec task-clock                       #      6.0 CPUs  CPUs_utilized         ( +-  0.14% )
        19,304,657      branch-misses                    #      1.2 %  branch_miss_rate         ( +-  0.20% )  (37.27%)
     1,497,794,368      branches                         #    302.2 M/sec  branch_frequency     ( +-  0.17% )  (60.74%)
    16,037,658,236      cpu-cycles                       #      3.2 GHz  cycles_frequency       ( +-  0.16% )  (57.72%)
     7,875,024,533      instructions                     #      0.5 instructions  insn_per_cycle  ( +-  0.13% )  (57.83%)
    12,344,722,147      stalled-cycles-frontend          #     0.77 frontend_cycles_idle        ( +-  0.17% )  (55.77%)

       0.827636161 +- 0.001027531 seconds time elapsed  ( +-  0.12% )


stat.baseline2:

 Performance counter stats for '/work/c/hackbench 50' (100 runs):

            52,590      context-switches                 #  10837.7 cs/sec  cs_per_second       ( +-  1.18% )
             7,958      cpu-migrations                   #   1640.0 migrations/sec  migrations_per_second  ( +-  0.99% )
            53,819      page-faults                      #  11090.9 faults/sec  page_faults_per_second  ( +-  0.48% )
          4,852.52 msec task-clock                       #      6.0 CPUs  CPUs_utilized         ( +-  0.11% )
        18,933,395      branch-misses                    #      1.2 %  branch_miss_rate         ( +-  0.18% )  (37.13%)
     1,451,361,950      branches                         #    299.1 M/sec  branch_frequency     ( +-  0.13% )  (60.09%)
    15,683,586,735      cpu-cycles                       #      3.2 GHz  cycles_frequency       ( +-  0.13% )  (56.05%)
     7,628,894,710      instructions                     #      0.5 instructions  insn_per_cycle  ( +-  0.10% )  (57.22%)
    12,063,750,082      stalled-cycles-frontend          #     0.77 frontend_cycles_idle        ( +-  0.14% )  (57.11%)

       0.811536383 +- 0.001337259 seconds time elapsed  ( +-  0.16% )

stat.after:

 Performance counter stats for '/work/c/hackbench 50' (100 runs):

            53,799      context-switches                 #  10743.3 cs/sec  cs_per_second       ( +-  1.35% )
             8,095      cpu-migrations                   #   1616.5 migrations/sec  migrations_per_second  ( +-  0.86% )
            54,330      page-faults                      #  10849.4 faults/sec  page_faults_per_second  ( +-  0.55% )
          5,007.67 msec task-clock                       #      6.0 CPUs  CPUs_utilized         ( +-  0.13% )
        19,444,339      branch-misses                    #      1.2 %  branch_miss_rate         ( +-  0.21% )  (38.04%)
     1,504,382,421      branches                         #    300.4 M/sec  branch_frequency     ( +-  0.17% )  (60.42%)
    16,225,153,060      cpu-cycles                       #      3.2 GHz  cycles_frequency       ( +-  0.16% )  (56.19%)
     7,889,645,005      instructions                     #      0.5 instructions  insn_per_cycle  ( +-  0.16% )  (56.30%)
    12,488,115,947      stalled-cycles-frontend          #     0.77 frontend_cycles_idle        ( +-  0.16% )  (55.55%)

       0.835123855 +- 0.001015781 seconds time elapsed  ( +-  0.12% )


Looking at the difference between cpu-cycles of baseline and baseline2, we have:

  15,607,564,080 vs 15,683,586,735 where it went up by 0.4% (in the noise).

But when enabling tracing, we have between before and after:

  16,037,658,236 vs 16,225,153,060 which is 1.1%. May be low but not insignificant.

Where tracing enabled slowed the code down by 2.7% (16,037,658,236 vs 15,607,564,080)
having another 1% is quite an impact!

As tracing now slows it down by 3.9% which is a significant increase from 2.7%

I really rather keep memcpy() here.

-- Steve

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Bart Van Assche @ 2026-05-27  1:15 UTC (permalink / raw)
  To: Theodore Tso, Jaegeuk Kim
  Cc: linux-api, linux-kernel, Matthew Wilcox, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ybmbjekuvzmaw4hmlxd7nxs546dqtwmxqxwyali74d6m3u7tat@b4q3japqnhrl>

On 5/26/26 6:42 AM, Theodore Tso wrote:
>      2. Host Controller Hardware Limits (UFSHCI)
>      
> 	Transfer Queue Depth: A UFS controller supports a predefined
> 	    number of outstanding task request entries. This is often
> 	    hard-capped at 32 concurrent transfer requests (slots) by the
> 	    doorbell register array.

The above information comes from the UFSHCI 3 standard. Jaegeuk's test
setup has an UFSHCI 4.0 controller that supports one submission queue
per CPU and also one completion queue per CPU. This is an architecture
that is very similar but not identical to NVMe. Jaegeuk, please correct
me if I got anything wrong.

Bart.

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Theodore Tso @ 2026-05-27  1:21 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: linux-api, linux-kernel, Matthew Wilcox, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ahYWKH9-ybDlZuJd@google.com>

On Tue, May 26, 2026 at 09:52:40PM +0000, Jaegeuk Kim wrote:
> > It seems... surprising that the additional I/O operations are actually
> > throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s).  Have you dug
> > into why this is happening, and whether there is anything that can be
> > optimized below the file system?
> 
> I can't tell the exact size tho, roughly it's between 1GB and
> 4GB. And, per lots of test results with various tunings, it turned
> out memory allocation speed was the culprit. If we use 4KB page, we
> couldn't get the full bandwidth unless we set the biggest core
> running the highest frequency.

OK, if we assume that the model file that you want to load is is 2GB
then the number of 4k pages that you need is a bit over half a million
(524288).  So if it take 1 second with large folios (2 GB/s as you
stated above), and half-second without (4 GB/s), then you're basically
saying that it was costing you half-second to allocate 524288
singleton pages.  And the whole point of this exercise is to save that
half second?

And I assume that these timing was using a performance cores, and part
of the goal here is to be able to use an efficiency core instead.

Did I get that right?

> > But the problem with using small folios is that if you want to
> > actually *use* the memory, unless you want to segment out the memory
> > so it can't be used for anything other than the AI models (e.g., by
> > using somthing like hugetlbfs) it's just going to break up the memory
> > into smaller folios.  So that's not actually going to *help* in actual
> > real life use cases.  It might help for your artificial benchmarks /
> > experiments, but in the real life case where Android applications are
> > running and fragmenting all of the device memory, the large folios
> > won't be available *anyway*.
> 
> Agreed it's hard to get this done perfectly tho, as the best effort on this
> particular AI model case, I focused on two timings when loading the models:
> 1) right after device boot, 2) dynamic loading when required. To secure high
> order pages, for 1), I disabled the large folio consumed by EROFS, while for
> 2), I tried to call compact_memory before loading the model. Both of cases,
> I could observe we could get fair amount of large folios. Yes, not 100% tho.

If (1) is a common case in real life, the thing to do would be grab
2GB of large folios early in the startup sequence, and then letting
erofs do its thing --- and then at the end of the startup, right before you
load the model, you can release the 2GB worth of large folios.

(That being said, I'm guessing #1 is actually not that interesting,
since as a percentage of the time that it takes for an Android device
to startup, is adding an extra half-second *really* going to be
noticeable by the user?)

But for case #2, that's the much more challenging case.  If you don't
call compact_memory() you're going to burn half a second to allocate
the 4k pages, since the large folios won't be available.  But if you
*do* call compact_memory() in a production ROM, depending fragmented the
memory is and how much memory have, calling compat_memory() could take
**minutes**.  So what's the point?

The bottom line is if it's right after device boot, there are simple
techniques that don't require hacking up the f2fs.  But in the
demand-loaded case, calling compact_memory() is the last thing you'll
want to do.  You're better either asking the mm to allocate the 4k
pages, or do whatever compaction it can do to just free up 2GB worth
of folios.  (Calling compact_memory() is overkill, and only makes
sense in the context of benchmark / proof of concept demo.)

Either way, trying to get file systems to avoid using large folios in
the hopes that this will speed up large AI model loading.... doesn't
seem to make sense.

If the problem is fundamentally about making 2GB worth of large folios
available in a way that takes significantly less time that just
allocating the model using half-million 4k pages, that's the question
that we should be asking Matthew and the mm folks.  Which is why it
was too bad we didn't raise this issue at LSF/MM earlier this month.

> Indeed, I was off from LSF/MM for years due to various product issues, not
> related F2FS tho. Let me make some effort to attend upcoming ones like LPC,
> if I can get the budget from company.

Next time, as a suggestion, feel free to raise the issue when the
LSF/MM CFP goes out, even if you don't think it's likely you will get
an invite.  Indeed, with a sufficiently interesting topic, that's the
way to *get* an invitation.  It will require breaking down the
technical requires as you and I have done for the last few messages on
this thread.

Even if you can't attend LSF/MM due to time or budget reasons, there
are a number of your colleagues who are attending, who could raise the
question on your behalf.  I've been known to do that once or twice on
behalf of other Google teams.  But it does require that you approach
the usual LSF/MM suspects a good 2-3 months before the conference so
we can help you craft the an appropriate response to the CFP.

Cheers,

					- Ted

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-27  2:43 UTC (permalink / raw)
  To: Theodore Tso
  Cc: linux-api, linux-kernel, Matthew Wilcox, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <psj3kr2gcze2yll5xdbvyyzxwcwhds5gh55poobpkfxrkpbgr7@ljdindismzd4>

On 05/26, Theodore Tso wrote:
> On Tue, May 26, 2026 at 09:52:40PM +0000, Jaegeuk Kim wrote:
> > > It seems... surprising that the additional I/O operations are actually
> > > throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s).  Have you dug
> > > into why this is happening, and whether there is anything that can be
> > > optimized below the file system?
> > 
> > I can't tell the exact size tho, roughly it's between 1GB and
> > 4GB. And, per lots of test results with various tunings, it turned
> > out memory allocation speed was the culprit. If we use 4KB page, we
> > couldn't get the full bandwidth unless we set the biggest core
> > running the highest frequency.
> 
> OK, if we assume that the model file that you want to load is is 2GB
> then the number of 4k pages that you need is a bit over half a million
> (524288).  So if it take 1 second with large folios (2 GB/s as you
> stated above), and half-second without (4 GB/s), then you're basically
> saying that it was costing you half-second to allocate 524288
> singleton pages.  And the whole point of this exercise is to save that
> half second?
> 
> And I assume that these timing was using a performance cores, and part
> of the goal here is to be able to use an efficiency core instead.
> 
> Did I get that right?

Yes, right.

> 
> > > But the problem with using small folios is that if you want to
> > > actually *use* the memory, unless you want to segment out the memory
> > > so it can't be used for anything other than the AI models (e.g., by
> > > using somthing like hugetlbfs) it's just going to break up the memory
> > > into smaller folios.  So that's not actually going to *help* in actual
> > > real life use cases.  It might help for your artificial benchmarks /
> > > experiments, but in the real life case where Android applications are
> > > running and fragmenting all of the device memory, the large folios
> > > won't be available *anyway*.
> > 
> > Agreed it's hard to get this done perfectly tho, as the best effort on this
> > particular AI model case, I focused on two timings when loading the models:
> > 1) right after device boot, 2) dynamic loading when required. To secure high
> > order pages, for 1), I disabled the large folio consumed by EROFS, while for
> > 2), I tried to call compact_memory before loading the model. Both of cases,
> > I could observe we could get fair amount of large folios. Yes, not 100% tho.
> 
> If (1) is a common case in real life, the thing to do would be grab
> 2GB of large folios early in the startup sequence, and then letting
> erofs do its thing --- and then at the end of the startup, right before you
> load the model, you can release the 2GB worth of large folios.
> 
> (That being said, I'm guessing #1 is actually not that interesting,
> since as a percentage of the time that it takes for an Android device
> to startup, is adding an extra half-second *really* going to be
> noticeable by the user?)
> 
> But for case #2, that's the much more challenging case.  If you don't
> call compact_memory() you're going to burn half a second to allocate
> the 4k pages, since the large folios won't be available.  But if you
> *do* call compact_memory() in a production ROM, depending fragmented the
> memory is and how much memory have, calling compat_memory() could take
> **minutes**.  So what's the point?
> 
> The bottom line is if it's right after device boot, there are simple
> techniques that don't require hacking up the f2fs.  But in the
> demand-loaded case, calling compact_memory() is the last thing you'll
> want to do.  You're better either asking the mm to allocate the 4k
> pages, or do whatever compaction it can do to just free up 2GB worth
> of folios.  (Calling compact_memory() is overkill, and only makes
> sense in the context of benchmark / proof of concept demo.)
> 
> Either way, trying to get file systems to avoid using large folios in
> the hopes that this will speed up large AI model loading.... doesn't
> seem to make sense.
> 
> If the problem is fundamentally about making 2GB worth of large folios
> available in a way that takes significantly less time that just
> allocating the model using half-million 4k pages, that's the question
> that we should be asking Matthew and the mm folks.  Which is why it
> was too bad we didn't raise this issue at LSF/MM earlier this month.

Thanks for the context. To clarify a piece I missed earlier: the model pages
are also utilized for inference. Our data shows that larger chunks yield
higher inference speeds. Consequently, I required high-order pages to optimize
both read throughput and inference latency. I will halt my current efforts
and wait for alternative suggestions.

> 
> > Indeed, I was off from LSF/MM for years due to various product issues, not
> > related F2FS tho. Let me make some effort to attend upcoming ones like LPC,
> > if I can get the budget from company.
> 
> Next time, as a suggestion, feel free to raise the issue when the
> LSF/MM CFP goes out, even if you don't think it's likely you will get
> an invite.  Indeed, with a sufficiently interesting topic, that's the
> way to *get* an invitation.  It will require breaking down the
> technical requires as you and I have done for the last few messages on
> this thread.
> 
> Even if you can't attend LSF/MM due to time or budget reasons, there
> are a number of your colleagues who are attending, who could raise the
> question on your behalf.  I've been known to do that once or twice on
> behalf of other Google teams.  But it does require that you approach
> the usual LSF/MM suspects a good 2-3 months before the conference so
> we can help you craft the an appropriate response to the CFP.

Thanks for the suggestion. Will definitely do.

> 
> Cheers,
> 
> 					- Ted
> 
> 
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Matthew Wilcox @ 2026-05-27  3:30 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Theodore Tso, linux-api, linux-kernel, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ahZaScMpx19ZLQi4@google.com>

On Wed, May 27, 2026 at 02:43:21AM +0000, Jaegeuk Kim wrote:
> Thanks for the context. To clarify a piece I missed earlier: the model pages
> are also utilized for inference. Our data shows that larger chunks yield
> higher inference speeds. Consequently, I required high-order pages to optimize
> both read throughput and inference latency. I will halt my current efforts
> and wait for alternative suggestions.

I think your efforts would be best directed towards general support for
large folios in f2fs.  There's still 40+ places in f2fs that use a
struct page, and converting them all to folios would be a great help.

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Christoph Hellwig @ 2026-05-27  6:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jaegeuk Kim, Christoph Hellwig, Theodore Tso, linux-api,
	linux-kernel, linux-f2fs-devel, linux-mm, linux-fsdevel,
	Akilesh Kailash, Christian Brauner
In-Reply-To: <ahUF7HqSKFJ422bU@casper.infradead.org>

On Tue, May 26, 2026 at 03:31:08AM +0100, Matthew Wilcox wrote:
> > > And what are you trying to say us with that?
> > 
> > This means, high-order pages were used up by EROFS which sets large folio by
> > default. So, I wanted to say the concern was based on actual data which was what
> > Mattew asked.
> 
> This isn't that though.  What you actually need is to show that high order
> allocations are _failing_.

Exactly.

> If what you want is large folios readily available, then what you want
> is large folios used _everywhere_ because then they're easy to get!

Yes.

> If there's small folios in use, you need to reclaim a lot of memory in
> order to reassemble large folios (it's the birthday paradox, similar to
> the hash collision problem).

Yeah.  Although it seems we have an issue with > order costly folios
at the moment, but we should fix this.

And f2fs really needs to up the game and support large folios fully
so that we can run that kind of analysis there as well, without this
all this is just piling hacks on top of other hacks.

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Christoph Hellwig @ 2026-05-27  6:28 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Theodore Tso, Jaegeuk Kim, linux-api, linux-kernel,
	Matthew Wilcox, linux-f2fs-devel, Christoph Hellwig, linux-mm,
	linux-fsdevel, Akilesh Kailash, Christian Brauner
In-Reply-To: <f4e521ac-2381-49ca-8dcc-3cb3cf3ffaea@acm.org>

On Tue, May 26, 2026 at 09:14:52AM -0700, Bart Van Assche wrote:
> On 5/26/26 6:42 AM, Theodore Tso wrote:
> > It seems... surprising that the additional I/O operations are actually
> > throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s).  Have you dug
> > into why this is happening, and whether there is anything that can be
> > optimized below the file system?
> The layers below the filesystem (block, SCSI, UFS) is what I'm
> responsible for in the Pixel team and I can assure you that these are
> highly optimized.
> 
> Since the transfer size used in Jaegeuk's tests is much larger than 4
> KiB, how many CPU cycles are used per IO by the layers below the
> filesystem is not limiting the transfer bandwidth.

I'm honestly not sure what discussion we have here.  Larger I/O is
pretty much always more efficient.  If you submit smaller I/O you
need more merging to build it back up larger, and more I/Os.

Which is exaxtly why we need large folio support everywhere, as it
makes a huge difference in I/O performance.


^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Christoph Hellwig @ 2026-05-27  6:31 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jaegeuk Kim, linux-api, linux-kernel, Matthew Wilcox,
	linux-f2fs-devel, Christoph Hellwig, linux-mm, linux-fsdevel,
	Akilesh Kailash, Christian Brauner
In-Reply-To: <psj3kr2gcze2yll5xdbvyyzxwcwhds5gh55poobpkfxrkpbgr7@ljdindismzd4>

On Tue, May 26, 2026 at 08:21:43PM -0500, Theodore Tso wrote:
> The bottom line is if it's right after device boot, there are simple
> techniques that don't require hacking up the f2fs.  But in the
> demand-loaded case, calling compact_memory() is the last thing you'll
> want to do.  You're better either asking the mm to allocate the 4k
> pages, or do whatever compaction it can do to just free up 2GB worth
> of folios.  (Calling compact_memory() is overkill, and only makes
> sense in the context of benchmark / proof of concept demo.)

Or have a lot of clean pagecache using higher order folios that can
you can instantly reclaim?


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox