From: Jaegeuk Kim <jaegeuk@kernel.org>
To: Theodore Tso <tytso@mit.edu>
Cc: linux-api@vger.kernel.org, linux-kernel@vger.kernel.org,
Matthew Wilcox <willy@infradead.org>,
linux-f2fs-devel@lists.sourceforge.net,
Christoph Hellwig <hch@infradead.org>,
linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
Akilesh Kailash <akailash@google.com>,
Christian Brauner <christian@brauner.io>
Subject: Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
Date: Tue, 26 May 2026 21:52:40 +0000 [thread overview]
Message-ID: <ahYWKH9-ybDlZuJd@google.com> (raw)
In-Reply-To: <ybmbjekuvzmaw4hmlxd7nxs546dqtwmxqxwyali74d6m3u7tat@b4q3japqnhrl>
On 05/26, Theodore Tso wrote:
> On Tue, May 26, 2026 at 01:10:55AM +0000, Jaegeuk Kim wrote:
> > Background
> > ----------
> > The primary use case is accelerating AI model loading, which demands
> > exceptionally high sequential read speeds. In our benchmarks on embedded
> > systems:
> > - Using high-order page allocations allows the system to saturate the
> > Universal Flash Storage (UFS) bandwidth, reaching 4 GB/s even at
> > medium-to-low CPU frequencies.
> > - In contrast, standard small folios cap performance at 2 GB/s.
>
> So you're interested in optimizing the I/O speeds. And apparenty, on
> your hardware, the UFS controller has limits on scatter-gather entries
> --- UFS seems to call this Physical Region Description (PRD) table
> entries. Per Gemini:
>
> 1. PRD Segment & Length Limits
>
> Maximum PRD Entries: Hardware limits typically cap the number
> of PRD entries (or segments) to 255 or 256 per transfer
> request.
>
> Maximum Transfer Length: Each individual PRD entry typically
> allows a maximum transfer size of (65,535 bytes) per segment.
>
> 2. Host Controller Hardware Limits (UFSHCI)
>
> Transfer Queue Depth: A UFS controller supports a predefined
> number of outstanding task request entries. This is often
> hard-capped at 32 concurrent transfer requests (slots) by the
> doorbell register array.
>
> Descriptor Pre-fetch: Some UFS host controllers are
> pre-configured to pre-fetch multiple PRD entries sequentially
> before requiring main memory reads.
>
> Is this an accurate description of the limits that you are trying to
> work with? How much data are you trying to read? Looking at Gemma 4
> models, E2B is about 10GB or 3GB for the 4-bit quantized version. E4B
> is 15GB, or 5GB for the 4-bit quantized version. Is that about right?
>
> It seems... surprising that the additional I/O operations are actually
> throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s). Have you dug
> into why this is happening, and whether there is anything that can be
> optimized below the file system?
I can't tell the exact size tho, roughly it's between 1GB and 4GB. And,
per lots of test results with various tunings, it turned out memory
allocation speed was the culprit. If we use 4KB page, we couldn't get
the full bandwidth unless we set the biggest core running the highest frequency.
Unfortunately, however, we can't use the core like that due to performance
drop of other system service and power drain.
>
> > Problem Statement
> > -----------------
> > High-order pages become heavily fragmented and scarce shortly after
> > device boot. We cannot afford to deplete these limited resources on
> > default filesystem operations using large folios. Instead, we need a
> > mechanism to strictly prioritize and reserve high-order allocations
> > for specific, critical payloads—specifically, large AI model files.
>
> There's a fundamental assumption here, which is that the only use of
> high order pages is the page cache. This doesn't take into account
> anonymous pages used by programs that isn't backed by files. Nor does
> it take into account kernel memory allocations.
>
> But that being said, you seem to be assuming that you can reduce the
> pressure on high order pages by only using large folios for these AI
> model files.
>
> But the problem with using small folios is that if you want to
> actually *use* the memory, unless you want to segment out the memory
> so it can't be used for anything other than the AI models (e.g., by
> using somthing like hugetlbfs) it's just going to break up the memory
> into smaller folios. So that's not actually going to *help* in actual
> real life use cases. It might help for your artificial benchmarks /
> experiments, but in the real life case where Android applications are
> running and fragmenting all of the device memory, the large folios
> won't be available *anyway*.
Agreed it's hard to get this done perfectly tho, as the best effort on this
particular AI model case, I focused on two timings when loading the models:
1) right after device boot, 2) dynamic loading when required. To secure high
order pages, for 1), I disabled the large folio consumed by EROFS, while for
2), I tried to call compact_memory before loading the model. Both of cases,
I could observe we could get fair amount of large folios. Yes, not 100% tho.
>
> >
> > Q: Why is deregistering the inode number linked to inode deletion?
> > A: We need the high-order allocation hint to persist even if the inode is
> > temporarily evicted from the VFS cache. To achieve this, we maintain a tracking
> > list of hinted inode numbers. When a file is permanently deleted, its hint
> > becomes obsolete, requiring us to deregister it from the list to prevent memory
> > leaks or identifier reuse conflicts.
>
> Assuming that the high-order allocation hint is a good thing, why not
> just make it persistent? e.g., just a *real* extended attribute
> (which is more wateful of space), or grab a flag in the on-disk f2fs
> inode? Then you don't need to have an in-memory list of hinted
> inodes; instead, you can just have the Android package manager set
> that flag indicating that you want that special treatment. This is
> all assuming that we need an explicit hint, though....
I think that's doable, yes, if the explict hint is acceptable.
>
> > Massive AI model loading is a long-term architectural
> > paradigm. Providing a targeted VFS/filesystem hint to optimize read
> > bandwidth for specific large datasets is a highly practical,
> > repeatable pattern that addresses a systemic bottleneck in embedded
> > AI deployments.
>
> It's really too bad you didn't propose this as a LSF/MM topic, and
> presented this at a session at Zagreb two weeks ago. That would have
> been a much more upstream-friendly way of collaborating, and it might
> have allowed the mm experts to give you some more dynamic, real-time
> feedback.
Indeed, I was off from LSF/MM for years due to various product issues, not
related F2FS tho. Let me make some effort to attend upcoming ones like LPC,
if I can get the budget from company.
>
> Cheers,
>
> - Ted
>
>
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
next prev parent reply other threads:[~2026-05-26 21:52 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20260409134538.3692605-1-jaegeuk@kernel.org>
[not found] ` <adhPZxtbZxgU-37v@google.com>
2026-04-14 8:02 ` [PATCH v2] f2fs: another way to set large folio by remembering inode number Christoph Hellwig
2026-04-15 16:44 ` Jaegeuk Kim
2026-04-15 17:15 ` Matthew Wilcox
2026-04-15 22:02 ` Jaegeuk Kim
2026-04-15 23:49 ` Darrick J. Wong
2026-04-16 1:19 ` Jaegeuk Kim
2026-05-21 8:51 ` Christoph Hellwig
2026-05-21 15:57 ` Theodore Tso
2026-05-21 17:42 ` Matthew Wilcox
2026-05-22 3:59 ` Jaegeuk Kim
2026-05-22 12:55 ` Matthew Wilcox
2026-05-22 14:04 ` [f2fs-dev] " Jaegeuk Kim
2026-05-25 5:34 ` Christoph Hellwig
2026-05-26 1:21 ` Jaegeuk Kim
2026-05-26 2:31 ` Matthew Wilcox
2026-05-26 3:47 ` Jaegeuk Kim
2026-05-25 5:34 ` Christoph Hellwig
2026-05-22 3:32 ` [f2fs-dev] " Jaegeuk Kim
2026-05-22 3:53 ` Eric Biggers
2026-05-22 4:02 ` Jaegeuk Kim
2026-05-22 10:01 ` Christian Brauner
2026-05-22 14:11 ` Theodore Tso
2026-05-22 17:08 ` Jaegeuk Kim
2026-05-22 22:41 ` Theodore Tso
2026-05-26 1:10 ` Jaegeuk Kim
2026-05-26 2:35 ` Matthew Wilcox
2026-05-26 3:34 ` Jaegeuk Kim
2026-05-26 3:35 ` Randy Dunlap
2026-05-26 4:12 ` Jaegeuk Kim
2026-05-26 13:42 ` Theodore Tso
2026-05-26 16:14 ` Bart Van Assche
2026-05-26 21:52 ` Jaegeuk Kim [this message]
2026-05-25 5:37 ` Christoph Hellwig
2026-05-22 9:59 ` Christian Brauner
2026-04-15 16:41 ` Jaegeuk Kim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ahYWKH9-ybDlZuJd@google.com \
--to=jaegeuk@kernel.org \
--cc=akailash@google.com \
--cc=christian@brauner.io \
--cc=hch@infradead.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-f2fs-devel@lists.sourceforge.net \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=tytso@mit.edu \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox