* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-26 21:52 UTC (permalink / raw)
To: Theodore Tso
Cc: linux-api, linux-kernel, Matthew Wilcox, linux-f2fs-devel,
Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
Christian Brauner
In-Reply-To: <ybmbjekuvzmaw4hmlxd7nxs546dqtwmxqxwyali74d6m3u7tat@b4q3japqnhrl>
On 05/26, Theodore Tso wrote:
> On Tue, May 26, 2026 at 01:10:55AM +0000, Jaegeuk Kim wrote:
> > Background
> > ----------
> > The primary use case is accelerating AI model loading, which demands
> > exceptionally high sequential read speeds. In our benchmarks on embedded
> > systems:
> > - Using high-order page allocations allows the system to saturate the
> > Universal Flash Storage (UFS) bandwidth, reaching 4 GB/s even at
> > medium-to-low CPU frequencies.
> > - In contrast, standard small folios cap performance at 2 GB/s.
>
> So you're interested in optimizing the I/O speeds. And apparenty, on
> your hardware, the UFS controller has limits on scatter-gather entries
> --- UFS seems to call this Physical Region Description (PRD) table
> entries. Per Gemini:
>
> 1. PRD Segment & Length Limits
>
> Maximum PRD Entries: Hardware limits typically cap the number
> of PRD entries (or segments) to 255 or 256 per transfer
> request.
>
> Maximum Transfer Length: Each individual PRD entry typically
> allows a maximum transfer size of (65,535 bytes) per segment.
>
> 2. Host Controller Hardware Limits (UFSHCI)
>
> Transfer Queue Depth: A UFS controller supports a predefined
> number of outstanding task request entries. This is often
> hard-capped at 32 concurrent transfer requests (slots) by the
> doorbell register array.
>
> Descriptor Pre-fetch: Some UFS host controllers are
> pre-configured to pre-fetch multiple PRD entries sequentially
> before requiring main memory reads.
>
> Is this an accurate description of the limits that you are trying to
> work with? How much data are you trying to read? Looking at Gemma 4
> models, E2B is about 10GB or 3GB for the 4-bit quantized version. E4B
> is 15GB, or 5GB for the 4-bit quantized version. Is that about right?
>
> It seems... surprising that the additional I/O operations are actually
> throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s). Have you dug
> into why this is happening, and whether there is anything that can be
> optimized below the file system?
I can't tell the exact size tho, roughly it's between 1GB and 4GB. And,
per lots of test results with various tunings, it turned out memory
allocation speed was the culprit. If we use 4KB page, we couldn't get
the full bandwidth unless we set the biggest core running the highest frequency.
Unfortunately, however, we can't use the core like that due to performance
drop of other system service and power drain.
>
> > Problem Statement
> > -----------------
> > High-order pages become heavily fragmented and scarce shortly after
> > device boot. We cannot afford to deplete these limited resources on
> > default filesystem operations using large folios. Instead, we need a
> > mechanism to strictly prioritize and reserve high-order allocations
> > for specific, critical payloads—specifically, large AI model files.
>
> There's a fundamental assumption here, which is that the only use of
> high order pages is the page cache. This doesn't take into account
> anonymous pages used by programs that isn't backed by files. Nor does
> it take into account kernel memory allocations.
>
> But that being said, you seem to be assuming that you can reduce the
> pressure on high order pages by only using large folios for these AI
> model files.
>
> But the problem with using small folios is that if you want to
> actually *use* the memory, unless you want to segment out the memory
> so it can't be used for anything other than the AI models (e.g., by
> using somthing like hugetlbfs) it's just going to break up the memory
> into smaller folios. So that's not actually going to *help* in actual
> real life use cases. It might help for your artificial benchmarks /
> experiments, but in the real life case where Android applications are
> running and fragmenting all of the device memory, the large folios
> won't be available *anyway*.
Agreed it's hard to get this done perfectly tho, as the best effort on this
particular AI model case, I focused on two timings when loading the models:
1) right after device boot, 2) dynamic loading when required. To secure high
order pages, for 1), I disabled the large folio consumed by EROFS, while for
2), I tried to call compact_memory before loading the model. Both of cases,
I could observe we could get fair amount of large folios. Yes, not 100% tho.
>
> >
> > Q: Why is deregistering the inode number linked to inode deletion?
> > A: We need the high-order allocation hint to persist even if the inode is
> > temporarily evicted from the VFS cache. To achieve this, we maintain a tracking
> > list of hinted inode numbers. When a file is permanently deleted, its hint
> > becomes obsolete, requiring us to deregister it from the list to prevent memory
> > leaks or identifier reuse conflicts.
>
> Assuming that the high-order allocation hint is a good thing, why not
> just make it persistent? e.g., just a *real* extended attribute
> (which is more wateful of space), or grab a flag in the on-disk f2fs
> inode? Then you don't need to have an in-memory list of hinted
> inodes; instead, you can just have the Android package manager set
> that flag indicating that you want that special treatment. This is
> all assuming that we need an explicit hint, though....
I think that's doable, yes, if the explict hint is acceptable.
>
> > Massive AI model loading is a long-term architectural
> > paradigm. Providing a targeted VFS/filesystem hint to optimize read
> > bandwidth for specific large datasets is a highly practical,
> > repeatable pattern that addresses a systemic bottleneck in embedded
> > AI deployments.
>
> It's really too bad you didn't propose this as a LSF/MM topic, and
> presented this at a session at Zagreb two weeks ago. That would have
> been a much more upstream-friendly way of collaborating, and it might
> have allowed the mm experts to give you some more dynamic, real-time
> feedback.
Indeed, I was off from LSF/MM for years due to various product issues, not
related F2FS tho. Let me make some effort to attend upcoming ones like LPC,
if I can get the budget from company.
>
> Cheers,
>
> - Ted
>
>
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
^ permalink raw reply
* Re: [PATCH v2 3/6] treewide: Replace memcpy(..., current->comm) with strscpy()
From: Steven Rostedt @ 2026-05-26 23:06 UTC (permalink / raw)
To: André Almeida
Cc: Peter Zijlstra, Juri Lelli, Vincent Guittot, Christian Brauner,
Kees Cook, Shuah Khan, willy, mathieu.desnoyers, David Laight,
Linus Torvalds, akpm, Yafang Shao, andrii.nakryiko, arnaldo.melo,
Petr Mladek, linux-kernel, kernel-dev, linux-mm, linux-api
In-Reply-To: <20260524-tonyk-long_name-v2-3-332f6bd041c4@igalia.com>
On Sun, 24 May 2026 19:38:53 -0300
André Almeida <andrealmeid@igalia.com> wrote:
> In order to increase the size of current->comm[] and to avoid breaking any
> existing code, replace memcpy() with strscpy(). The later function makes
> sure that the copy is NUL terminated. This is crucial given that the
> source buffer might be larger than the destination buffer and could
> truncate the NUL character out of it.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
> Changes from v2:
> - New patch, dropped strtostr() from last version
> ---
> include/linux/coredump.h | 2 +-
> include/linux/tracepoint.h | 4 ++--
> include/trace/events/block.h | 10 +++++-----
> include/trace/events/coredump.h | 2 +-
> include/trace/events/f2fs.h | 4 ++--
> include/trace/events/oom.h | 2 +-
> include/trace/events/osnoise.h | 2 +-
> include/trace/events/sched.h | 10 +++++-----
> include/trace/events/signal.h | 2 +-
> include/trace/events/task.h | 4 ++--
> kernel/printk/nbcon.c | 2 +-
> kernel/printk/printk.c | 2 +-
> 12 files changed, 23 insertions(+), 23 deletions(-)
>
So I was curious to what impact this would have on tracing. I decided to
run the following:
perf stat -r 100 ./hackbench 50
To see how it affects things. Hackbench is a bit of a microbenchmark but it
stresses the scheduler and thus, scheduler trace events.
I first ran the above and put the output into "stat.baseline", then I enabled
all scheduler trace events:
trace-cmd start -e sched
and ran it again and put the output into "stat.before".
I applied the patch and ran it again before enabling tracing (just to see
the variance) and put that into "stat.baseline2". I then enabled tracing
and ran it again and put the output into "stat.after".
Here's the results:
stat.baseline:
Performance counter stats for '/work/c/hackbench 50' (100 runs):
53,165 context-switches # 11002.2 cs/sec cs_per_second ( +- 1.33% )
8,010 cpu-migrations # 1657.6 migrations/sec migrations_per_second ( +- 0.90% )
53,936 page-faults # 11161.7 faults/sec page_faults_per_second ( +- 0.50% )
4,832.24 msec task-clock # 6.0 CPUs CPUs_utilized ( +- 0.12% )
18,787,710 branch-misses # 1.2 % branch_miss_rate ( +- 0.17% ) (38.88%)
1,452,653,496 branches # 300.6 M/sec branch_frequency ( +- 0.14% ) (61.55%)
15,607,564,080 cpu-cycles # 3.2 GHz cycles_frequency ( +- 0.15% ) (56.21%)
7,648,608,518 instructions # 0.5 instructions insn_per_cycle ( +- 0.11% ) (55.82%)
12,025,223,911 stalled-cycles-frontend # 0.77 frontend_cycles_idle ( +- 0.14% ) (56.26%)
0.808204663 +- 0.001059873 seconds time elapsed ( +- 0.13% )
stat.before:
Performance counter stats for '/work/c/hackbench 50' (100 runs):
54,722 context-switches # 11041.0 cs/sec cs_per_second ( +- 1.35% )
8,170 cpu-migrations # 1648.4 migrations/sec migrations_per_second ( +- 1.08% )
54,295 page-faults # 10954.8 faults/sec page_faults_per_second ( +- 0.53% )
4,956.27 msec task-clock # 6.0 CPUs CPUs_utilized ( +- 0.14% )
19,304,657 branch-misses # 1.2 % branch_miss_rate ( +- 0.20% ) (37.27%)
1,497,794,368 branches # 302.2 M/sec branch_frequency ( +- 0.17% ) (60.74%)
16,037,658,236 cpu-cycles # 3.2 GHz cycles_frequency ( +- 0.16% ) (57.72%)
7,875,024,533 instructions # 0.5 instructions insn_per_cycle ( +- 0.13% ) (57.83%)
12,344,722,147 stalled-cycles-frontend # 0.77 frontend_cycles_idle ( +- 0.17% ) (55.77%)
0.827636161 +- 0.001027531 seconds time elapsed ( +- 0.12% )
stat.baseline2:
Performance counter stats for '/work/c/hackbench 50' (100 runs):
52,590 context-switches # 10837.7 cs/sec cs_per_second ( +- 1.18% )
7,958 cpu-migrations # 1640.0 migrations/sec migrations_per_second ( +- 0.99% )
53,819 page-faults # 11090.9 faults/sec page_faults_per_second ( +- 0.48% )
4,852.52 msec task-clock # 6.0 CPUs CPUs_utilized ( +- 0.11% )
18,933,395 branch-misses # 1.2 % branch_miss_rate ( +- 0.18% ) (37.13%)
1,451,361,950 branches # 299.1 M/sec branch_frequency ( +- 0.13% ) (60.09%)
15,683,586,735 cpu-cycles # 3.2 GHz cycles_frequency ( +- 0.13% ) (56.05%)
7,628,894,710 instructions # 0.5 instructions insn_per_cycle ( +- 0.10% ) (57.22%)
12,063,750,082 stalled-cycles-frontend # 0.77 frontend_cycles_idle ( +- 0.14% ) (57.11%)
0.811536383 +- 0.001337259 seconds time elapsed ( +- 0.16% )
stat.after:
Performance counter stats for '/work/c/hackbench 50' (100 runs):
53,799 context-switches # 10743.3 cs/sec cs_per_second ( +- 1.35% )
8,095 cpu-migrations # 1616.5 migrations/sec migrations_per_second ( +- 0.86% )
54,330 page-faults # 10849.4 faults/sec page_faults_per_second ( +- 0.55% )
5,007.67 msec task-clock # 6.0 CPUs CPUs_utilized ( +- 0.13% )
19,444,339 branch-misses # 1.2 % branch_miss_rate ( +- 0.21% ) (38.04%)
1,504,382,421 branches # 300.4 M/sec branch_frequency ( +- 0.17% ) (60.42%)
16,225,153,060 cpu-cycles # 3.2 GHz cycles_frequency ( +- 0.16% ) (56.19%)
7,889,645,005 instructions # 0.5 instructions insn_per_cycle ( +- 0.16% ) (56.30%)
12,488,115,947 stalled-cycles-frontend # 0.77 frontend_cycles_idle ( +- 0.16% ) (55.55%)
0.835123855 +- 0.001015781 seconds time elapsed ( +- 0.12% )
Looking at the difference between cpu-cycles of baseline and baseline2, we have:
15,607,564,080 vs 15,683,586,735 where it went up by 0.4% (in the noise).
But when enabling tracing, we have between before and after:
16,037,658,236 vs 16,225,153,060 which is 1.1%. May be low but not insignificant.
Where tracing enabled slowed the code down by 2.7% (16,037,658,236 vs 15,607,564,080)
having another 1% is quite an impact!
As tracing now slows it down by 3.9% which is a significant increase from 2.7%
I really rather keep memcpy() here.
-- Steve
^ permalink raw reply
* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Bart Van Assche @ 2026-05-27 1:15 UTC (permalink / raw)
To: Theodore Tso, Jaegeuk Kim
Cc: linux-api, linux-kernel, Matthew Wilcox, linux-f2fs-devel,
Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
Christian Brauner
In-Reply-To: <ybmbjekuvzmaw4hmlxd7nxs546dqtwmxqxwyali74d6m3u7tat@b4q3japqnhrl>
On 5/26/26 6:42 AM, Theodore Tso wrote:
> 2. Host Controller Hardware Limits (UFSHCI)
>
> Transfer Queue Depth: A UFS controller supports a predefined
> number of outstanding task request entries. This is often
> hard-capped at 32 concurrent transfer requests (slots) by the
> doorbell register array.
The above information comes from the UFSHCI 3 standard. Jaegeuk's test
setup has an UFSHCI 4.0 controller that supports one submission queue
per CPU and also one completion queue per CPU. This is an architecture
that is very similar but not identical to NVMe. Jaegeuk, please correct
me if I got anything wrong.
Bart.
^ permalink raw reply
* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Theodore Tso @ 2026-05-27 1:21 UTC (permalink / raw)
To: Jaegeuk Kim
Cc: linux-api, linux-kernel, Matthew Wilcox, linux-f2fs-devel,
Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
Christian Brauner
In-Reply-To: <ahYWKH9-ybDlZuJd@google.com>
On Tue, May 26, 2026 at 09:52:40PM +0000, Jaegeuk Kim wrote:
> > It seems... surprising that the additional I/O operations are actually
> > throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s). Have you dug
> > into why this is happening, and whether there is anything that can be
> > optimized below the file system?
>
> I can't tell the exact size tho, roughly it's between 1GB and
> 4GB. And, per lots of test results with various tunings, it turned
> out memory allocation speed was the culprit. If we use 4KB page, we
> couldn't get the full bandwidth unless we set the biggest core
> running the highest frequency.
OK, if we assume that the model file that you want to load is is 2GB
then the number of 4k pages that you need is a bit over half a million
(524288). So if it take 1 second with large folios (2 GB/s as you
stated above), and half-second without (4 GB/s), then you're basically
saying that it was costing you half-second to allocate 524288
singleton pages. And the whole point of this exercise is to save that
half second?
And I assume that these timing was using a performance cores, and part
of the goal here is to be able to use an efficiency core instead.
Did I get that right?
> > But the problem with using small folios is that if you want to
> > actually *use* the memory, unless you want to segment out the memory
> > so it can't be used for anything other than the AI models (e.g., by
> > using somthing like hugetlbfs) it's just going to break up the memory
> > into smaller folios. So that's not actually going to *help* in actual
> > real life use cases. It might help for your artificial benchmarks /
> > experiments, but in the real life case where Android applications are
> > running and fragmenting all of the device memory, the large folios
> > won't be available *anyway*.
>
> Agreed it's hard to get this done perfectly tho, as the best effort on this
> particular AI model case, I focused on two timings when loading the models:
> 1) right after device boot, 2) dynamic loading when required. To secure high
> order pages, for 1), I disabled the large folio consumed by EROFS, while for
> 2), I tried to call compact_memory before loading the model. Both of cases,
> I could observe we could get fair amount of large folios. Yes, not 100% tho.
If (1) is a common case in real life, the thing to do would be grab
2GB of large folios early in the startup sequence, and then letting
erofs do its thing --- and then at the end of the startup, right before you
load the model, you can release the 2GB worth of large folios.
(That being said, I'm guessing #1 is actually not that interesting,
since as a percentage of the time that it takes for an Android device
to startup, is adding an extra half-second *really* going to be
noticeable by the user?)
But for case #2, that's the much more challenging case. If you don't
call compact_memory() you're going to burn half a second to allocate
the 4k pages, since the large folios won't be available. But if you
*do* call compact_memory() in a production ROM, depending fragmented the
memory is and how much memory have, calling compat_memory() could take
**minutes**. So what's the point?
The bottom line is if it's right after device boot, there are simple
techniques that don't require hacking up the f2fs. But in the
demand-loaded case, calling compact_memory() is the last thing you'll
want to do. You're better either asking the mm to allocate the 4k
pages, or do whatever compaction it can do to just free up 2GB worth
of folios. (Calling compact_memory() is overkill, and only makes
sense in the context of benchmark / proof of concept demo.)
Either way, trying to get file systems to avoid using large folios in
the hopes that this will speed up large AI model loading.... doesn't
seem to make sense.
If the problem is fundamentally about making 2GB worth of large folios
available in a way that takes significantly less time that just
allocating the model using half-million 4k pages, that's the question
that we should be asking Matthew and the mm folks. Which is why it
was too bad we didn't raise this issue at LSF/MM earlier this month.
> Indeed, I was off from LSF/MM for years due to various product issues, not
> related F2FS tho. Let me make some effort to attend upcoming ones like LPC,
> if I can get the budget from company.
Next time, as a suggestion, feel free to raise the issue when the
LSF/MM CFP goes out, even if you don't think it's likely you will get
an invite. Indeed, with a sufficiently interesting topic, that's the
way to *get* an invitation. It will require breaking down the
technical requires as you and I have done for the last few messages on
this thread.
Even if you can't attend LSF/MM due to time or budget reasons, there
are a number of your colleagues who are attending, who could raise the
question on your behalf. I've been known to do that once or twice on
behalf of other Google teams. But it does require that you approach
the usual LSF/MM suspects a good 2-3 months before the conference so
we can help you craft the an appropriate response to the CFP.
Cheers,
- Ted
^ permalink raw reply
* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-27 2:43 UTC (permalink / raw)
To: Theodore Tso
Cc: linux-api, linux-kernel, Matthew Wilcox, linux-f2fs-devel,
Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
Christian Brauner
In-Reply-To: <psj3kr2gcze2yll5xdbvyyzxwcwhds5gh55poobpkfxrkpbgr7@ljdindismzd4>
On 05/26, Theodore Tso wrote:
> On Tue, May 26, 2026 at 09:52:40PM +0000, Jaegeuk Kim wrote:
> > > It seems... surprising that the additional I/O operations are actually
> > > throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s). Have you dug
> > > into why this is happening, and whether there is anything that can be
> > > optimized below the file system?
> >
> > I can't tell the exact size tho, roughly it's between 1GB and
> > 4GB. And, per lots of test results with various tunings, it turned
> > out memory allocation speed was the culprit. If we use 4KB page, we
> > couldn't get the full bandwidth unless we set the biggest core
> > running the highest frequency.
>
> OK, if we assume that the model file that you want to load is is 2GB
> then the number of 4k pages that you need is a bit over half a million
> (524288). So if it take 1 second with large folios (2 GB/s as you
> stated above), and half-second without (4 GB/s), then you're basically
> saying that it was costing you half-second to allocate 524288
> singleton pages. And the whole point of this exercise is to save that
> half second?
>
> And I assume that these timing was using a performance cores, and part
> of the goal here is to be able to use an efficiency core instead.
>
> Did I get that right?
Yes, right.
>
> > > But the problem with using small folios is that if you want to
> > > actually *use* the memory, unless you want to segment out the memory
> > > so it can't be used for anything other than the AI models (e.g., by
> > > using somthing like hugetlbfs) it's just going to break up the memory
> > > into smaller folios. So that's not actually going to *help* in actual
> > > real life use cases. It might help for your artificial benchmarks /
> > > experiments, but in the real life case where Android applications are
> > > running and fragmenting all of the device memory, the large folios
> > > won't be available *anyway*.
> >
> > Agreed it's hard to get this done perfectly tho, as the best effort on this
> > particular AI model case, I focused on two timings when loading the models:
> > 1) right after device boot, 2) dynamic loading when required. To secure high
> > order pages, for 1), I disabled the large folio consumed by EROFS, while for
> > 2), I tried to call compact_memory before loading the model. Both of cases,
> > I could observe we could get fair amount of large folios. Yes, not 100% tho.
>
> If (1) is a common case in real life, the thing to do would be grab
> 2GB of large folios early in the startup sequence, and then letting
> erofs do its thing --- and then at the end of the startup, right before you
> load the model, you can release the 2GB worth of large folios.
>
> (That being said, I'm guessing #1 is actually not that interesting,
> since as a percentage of the time that it takes for an Android device
> to startup, is adding an extra half-second *really* going to be
> noticeable by the user?)
>
> But for case #2, that's the much more challenging case. If you don't
> call compact_memory() you're going to burn half a second to allocate
> the 4k pages, since the large folios won't be available. But if you
> *do* call compact_memory() in a production ROM, depending fragmented the
> memory is and how much memory have, calling compat_memory() could take
> **minutes**. So what's the point?
>
> The bottom line is if it's right after device boot, there are simple
> techniques that don't require hacking up the f2fs. But in the
> demand-loaded case, calling compact_memory() is the last thing you'll
> want to do. You're better either asking the mm to allocate the 4k
> pages, or do whatever compaction it can do to just free up 2GB worth
> of folios. (Calling compact_memory() is overkill, and only makes
> sense in the context of benchmark / proof of concept demo.)
>
> Either way, trying to get file systems to avoid using large folios in
> the hopes that this will speed up large AI model loading.... doesn't
> seem to make sense.
>
> If the problem is fundamentally about making 2GB worth of large folios
> available in a way that takes significantly less time that just
> allocating the model using half-million 4k pages, that's the question
> that we should be asking Matthew and the mm folks. Which is why it
> was too bad we didn't raise this issue at LSF/MM earlier this month.
Thanks for the context. To clarify a piece I missed earlier: the model pages
are also utilized for inference. Our data shows that larger chunks yield
higher inference speeds. Consequently, I required high-order pages to optimize
both read throughput and inference latency. I will halt my current efforts
and wait for alternative suggestions.
>
> > Indeed, I was off from LSF/MM for years due to various product issues, not
> > related F2FS tho. Let me make some effort to attend upcoming ones like LPC,
> > if I can get the budget from company.
>
> Next time, as a suggestion, feel free to raise the issue when the
> LSF/MM CFP goes out, even if you don't think it's likely you will get
> an invite. Indeed, with a sufficiently interesting topic, that's the
> way to *get* an invitation. It will require breaking down the
> technical requires as you and I have done for the last few messages on
> this thread.
>
> Even if you can't attend LSF/MM due to time or budget reasons, there
> are a number of your colleagues who are attending, who could raise the
> question on your behalf. I've been known to do that once or twice on
> behalf of other Google teams. But it does require that you approach
> the usual LSF/MM suspects a good 2-3 months before the conference so
> we can help you craft the an appropriate response to the CFP.
Thanks for the suggestion. Will definitely do.
>
> Cheers,
>
> - Ted
>
>
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
^ permalink raw reply
* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Matthew Wilcox @ 2026-05-27 3:30 UTC (permalink / raw)
To: Jaegeuk Kim
Cc: Theodore Tso, linux-api, linux-kernel, linux-f2fs-devel,
Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
Christian Brauner
In-Reply-To: <ahZaScMpx19ZLQi4@google.com>
On Wed, May 27, 2026 at 02:43:21AM +0000, Jaegeuk Kim wrote:
> Thanks for the context. To clarify a piece I missed earlier: the model pages
> are also utilized for inference. Our data shows that larger chunks yield
> higher inference speeds. Consequently, I required high-order pages to optimize
> both read throughput and inference latency. I will halt my current efforts
> and wait for alternative suggestions.
I think your efforts would be best directed towards general support for
large folios in f2fs. There's still 40+ places in f2fs that use a
struct page, and converting them all to folios would be a great help.
^ permalink raw reply
* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Christoph Hellwig @ 2026-05-27 6:26 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Jaegeuk Kim, Christoph Hellwig, Theodore Tso, linux-api,
linux-kernel, linux-f2fs-devel, linux-mm, linux-fsdevel,
Akilesh Kailash, Christian Brauner
In-Reply-To: <ahUF7HqSKFJ422bU@casper.infradead.org>
On Tue, May 26, 2026 at 03:31:08AM +0100, Matthew Wilcox wrote:
> > > And what are you trying to say us with that?
> >
> > This means, high-order pages were used up by EROFS which sets large folio by
> > default. So, I wanted to say the concern was based on actual data which was what
> > Mattew asked.
>
> This isn't that though. What you actually need is to show that high order
> allocations are _failing_.
Exactly.
> If what you want is large folios readily available, then what you want
> is large folios used _everywhere_ because then they're easy to get!
Yes.
> If there's small folios in use, you need to reclaim a lot of memory in
> order to reassemble large folios (it's the birthday paradox, similar to
> the hash collision problem).
Yeah. Although it seems we have an issue with > order costly folios
at the moment, but we should fix this.
And f2fs really needs to up the game and support large folios fully
so that we can run that kind of analysis there as well, without this
all this is just piling hacks on top of other hacks.
^ permalink raw reply
* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Christoph Hellwig @ 2026-05-27 6:28 UTC (permalink / raw)
To: Bart Van Assche
Cc: Theodore Tso, Jaegeuk Kim, linux-api, linux-kernel,
Matthew Wilcox, linux-f2fs-devel, Christoph Hellwig, linux-mm,
linux-fsdevel, Akilesh Kailash, Christian Brauner
In-Reply-To: <f4e521ac-2381-49ca-8dcc-3cb3cf3ffaea@acm.org>
On Tue, May 26, 2026 at 09:14:52AM -0700, Bart Van Assche wrote:
> On 5/26/26 6:42 AM, Theodore Tso wrote:
> > It seems... surprising that the additional I/O operations are actually
> > throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s). Have you dug
> > into why this is happening, and whether there is anything that can be
> > optimized below the file system?
> The layers below the filesystem (block, SCSI, UFS) is what I'm
> responsible for in the Pixel team and I can assure you that these are
> highly optimized.
>
> Since the transfer size used in Jaegeuk's tests is much larger than 4
> KiB, how many CPU cycles are used per IO by the layers below the
> filesystem is not limiting the transfer bandwidth.
I'm honestly not sure what discussion we have here. Larger I/O is
pretty much always more efficient. If you submit smaller I/O you
need more merging to build it back up larger, and more I/Os.
Which is exaxtly why we need large folio support everywhere, as it
makes a huge difference in I/O performance.
^ permalink raw reply
* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Christoph Hellwig @ 2026-05-27 6:31 UTC (permalink / raw)
To: Theodore Tso
Cc: Jaegeuk Kim, linux-api, linux-kernel, Matthew Wilcox,
linux-f2fs-devel, Christoph Hellwig, linux-mm, linux-fsdevel,
Akilesh Kailash, Christian Brauner
In-Reply-To: <psj3kr2gcze2yll5xdbvyyzxwcwhds5gh55poobpkfxrkpbgr7@ljdindismzd4>
On Tue, May 26, 2026 at 08:21:43PM -0500, Theodore Tso wrote:
> The bottom line is if it's right after device boot, there are simple
> techniques that don't require hacking up the f2fs. But in the
> demand-loaded case, calling compact_memory() is the last thing you'll
> want to do. You're better either asking the mm to allocate the 4k
> pages, or do whatever compaction it can do to just free up 2GB worth
> of folios. (Calling compact_memory() is overkill, and only makes
> sense in the context of benchmark / proof of concept demo.)
Or have a lot of clean pagecache using higher order folios that can
you can instantly reclaim?
^ permalink raw reply
* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Christoph Hellwig @ 2026-05-27 6:33 UTC (permalink / raw)
To: Jaegeuk Kim
Cc: Matthew Wilcox, Theodore Tso, linux-api, linux-kernel,
linux-f2fs-devel, Christoph Hellwig, linux-mm, linux-fsdevel,
Akilesh Kailash, Christian Brauner
In-Reply-To: <ahUX4tBeLykdQNxY@google.com>
On Tue, May 26, 2026 at 03:47:46AM +0000, Jaegeuk Kim wrote:
> Thanks for the feedback. Actually, I tried to do compact_memory before doing
> read() for AI loading, but I got complaints where it took hundreds milliseconds
> to run that compact_memory. Is there a good way to secure high-order pages before
> that read()? It was quite hard to project when it will happen.
Make sure that all files use large folios as much as possible so that
you have a lot of clean pagecache to reclaim.. Same for any other
easily reclaimable memory through shrinkers.
^ permalink raw reply
* Re: [PATCH 1/3] net: Remove support for AIO on sockets
From: Christoph Hellwig @ 2026-05-27 8:13 UTC (permalink / raw)
To: Jens Axboe
Cc: Christoph Hellwig, demiobenour, Herbert Xu, David S. Miller,
Eric Dumazet, Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn,
Jakub Kicinski, Simon Horman, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
James Clark, Jonathan Corbet, Shuah Khan, Eric Biggers,
Ard Biesheuvel, linux-crypto, linux-kernel, io-uring, netdev,
linux-perf-users, linux-doc, Toke Høiland-Jørgensen,
linux-api, David Howells
In-Reply-To: <92db3ff0-8f0b-4b61-a167-5004ffcf9025@kernel.dk>
On Tue, May 26, 2026 at 09:58:27AM -0600, Jens Axboe wrote:
> > The current TCP zerocopy implementation provides completion notification
> > through the socket error code, which is freaking weird and doesn't
> > integrate well with either io_uring or in-kernel callers.
>
> We already have that via io_uring
Where? And how do make that available to in-kernel users like
storage protocols and network file system, which really suffer from
the current MSG_SPLICE_PAGES semantics.
> , and without needing msg_kiocb or the
What do you think is the downside of using a kiocb here like for
everything else with async notifications?
^ permalink raw reply
* Re: [PATCH v2 4/6] sched: Extend task command name to 64 bytes
From: David Laight @ 2026-05-27 8:42 UTC (permalink / raw)
To: Steven Rostedt
Cc: André Almeida, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Christian Brauner, Kees Cook, Shuah Khan, willy,
mathieu.desnoyers, Linus Torvalds, akpm, Yafang Shao,
andrii.nakryiko, arnaldo.melo, Petr Mladek, linux-kernel,
kernel-dev, linux-mm, linux-api, Bhupesh
In-Reply-To: <20260526123103.4facbaed@gandalf.local.home>
On Tue, 26 May 2026 12:31:03 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 25 May 2026 11:42:41 +0100
> David Laight <david.laight.linux@gmail.com> wrote:
>
> > > > error = security_task_prctl(option, arg2, arg3, arg4, arg5);
> > > > @@ -2601,16 +2601,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> > > > error = -EINVAL;
> > > > break;
> > > > case PR_SET_NAME:
> > > > - comm[sizeof(me->comm) - 1] = 0;
> > > > + comm[TASK_COMM_LEN - 1] = 0;
> > > > if (strncpy_from_user(comm, (char __user *)arg2,
> > > > - sizeof(me->comm) - 1) < 0)
> > > > + TASK_COMM_LEN - 1) < 0)
> > >
> > > Nak - you can't do that.
> > > You are reading data that the application doesn't expect you to read.
> >
> > Or have I got confused over the names...
>
> You may have gotten confused by names, as sizeof(me->comm) is the same as
> TASK_COMM_LEN. Basically, the above doesn't change anything.
The name of the patch doesn't help:
sched: Extend task command name to 64 bytes
If you want to catch/check all the uses I suspect that all the
occurrences of TASK_COMM_LEN need changing.
For clarity this one should probably be TASK_COMM_LEN_OLD.
(that might be problematic for the uapi headers)
-- David
>
> -- Steve
^ permalink raw reply
* Re: [PATCH v2 3/6] treewide: Replace memcpy(..., current->comm) with strscpy()
From: David Laight @ 2026-05-27 9:18 UTC (permalink / raw)
To: Steven Rostedt
Cc: André Almeida, Peter Zijlstra, Juri Lelli, Vincent Guittot,
Christian Brauner, Kees Cook, Shuah Khan, willy,
mathieu.desnoyers, Linus Torvalds, akpm, Yafang Shao,
andrii.nakryiko, arnaldo.melo, Petr Mladek, linux-kernel,
kernel-dev, linux-mm, linux-api
In-Reply-To: <20260526190625.3f4aca0a@gandalf.local.home>
On Tue, 26 May 2026 19:06:25 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Sun, 24 May 2026 19:38:53 -0300
> André Almeida <andrealmeid@igalia.com> wrote:
>
> > In order to increase the size of current->comm[] and to avoid breaking any
> > existing code, replace memcpy() with strscpy(). The later function makes
> > sure that the copy is NUL terminated. This is crucial given that the
> > source buffer might be larger than the destination buffer and could
> > truncate the NUL character out of it.
...
> As tracing now slows it down by 3.9% which is a significant increase from 2.7%
>
> I really rather keep memcpy() here.
I suspect the copies could/should be replaced by a copy_task_comm()
function that can perform optimisations that strscpy[_pad]() can't
do because it can (for example) assume that the source is terminated.
When the src and dst are the same size it can also degenerate to
memcpy() - which should get inlined.
The cost of copying 64 bytes might still be rather more than copying
just 16.
A compromise of 32 may be better.
-- David
>
> -- Steve
^ permalink raw reply
* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-27 15:39 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Theodore Tso, linux-api, linux-kernel, linux-f2fs-devel,
Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
Christian Brauner
In-Reply-To: <ahZlbQPzTUecMKGU@casper.infradead.org>
On 05/27, Matthew Wilcox wrote:
> On Wed, May 27, 2026 at 02:43:21AM +0000, Jaegeuk Kim wrote:
> > Thanks for the context. To clarify a piece I missed earlier: the model pages
> > are also utilized for inference. Our data shows that larger chunks yield
> > higher inference speeds. Consequently, I required high-order pages to optimize
> > both read throughput and inference latency. I will halt my current efforts
> > and wait for alternative suggestions.
>
> I think your efforts would be best directed towards general support for
> large folios in f2fs. There's still 40+ places in f2fs that use a
> struct page, and converting them all to folios would be a great help.
Ok, I'll dive into it in priority, but it'll take some time since I think we'd
better to refactor some major data structures.
>
>
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
^ permalink raw reply
* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-27 15:42 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Matthew Wilcox, Theodore Tso, linux-api, linux-kernel,
linux-f2fs-devel, linux-mm, Akilesh Kailash, linux-fsdevel,
Christian Brauner
In-Reply-To: <ahaOpdw7NgsWe8J4@infradead.org>
On 05/26, Christoph Hellwig wrote:
> On Tue, May 26, 2026 at 03:31:08AM +0100, Matthew Wilcox wrote:
> > > > And what are you trying to say us with that?
> > >
> > > This means, high-order pages were used up by EROFS which sets large folio by
> > > default. So, I wanted to say the concern was based on actual data which was what
> > > Mattew asked.
> >
> > This isn't that though. What you actually need is to show that high order
> > allocations are _failing_.
>
> Exactly.
>
> > If what you want is large folios readily available, then what you want
> > is large folios used _everywhere_ because then they're easy to get!
>
> Yes.
>
> > If there's small folios in use, you need to reclaim a lot of memory in
> > order to reassemble large folios (it's the birthday paradox, similar to
> > the hash collision problem).
>
> Yeah. Although it seems we have an issue with > order costly folios
> at the moment, but we should fix this.
>
> And f2fs really needs to up the game and support large folios fully
> so that we can run that kind of analysis there as well, without this
> all this is just piling hacks on top of other hacks.
Ok, I'll revisit the large folio support in f2fs, and try to revisit the
problem afterwards.
Thanks,
>
>
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
^ permalink raw reply
* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-27 15:59 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Bart Van Assche, Theodore Tso, linux-api, linux-kernel,
Matthew Wilcox, linux-f2fs-devel, linux-mm, Akilesh Kailash,
linux-fsdevel, Christian Brauner
In-Reply-To: <ahaPDHiXcJoVShPv@infradead.org>
On 05/26, Christoph Hellwig wrote:
> On Tue, May 26, 2026 at 09:14:52AM -0700, Bart Van Assche wrote:
> > On 5/26/26 6:42 AM, Theodore Tso wrote:
> > > It seems... surprising that the additional I/O operations are actually
> > > throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s). Have you dug
> > > into why this is happening, and whether there is anything that can be
> > > optimized below the file system?
> > The layers below the filesystem (block, SCSI, UFS) is what I'm
> > responsible for in the Pixel team and I can assure you that these are
> > highly optimized.
> >
> > Since the transfer size used in Jaegeuk's tests is much larger than 4
> > KiB, how many CPU cycles are used per IO by the layers below the
> > filesystem is not limiting the transfer bandwidth.
>
> I'm honestly not sure what discussion we have here. Larger I/O is
> pretty much always more efficient. If you submit smaller I/O you
> need more merging to build it back up larger, and more I/Os.
F2FS merges bios before submit_bio, regardless of small or large folios,
since the block addresses are consecutive. So, I think IO subsystem was
working in full speed.
>
> Which is exaxtly why we need large folio support everywhere, as it
> makes a huge difference in I/O performance.
>
>
>
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
^ permalink raw reply
* [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Li Chen @ 2026-05-28 9:52 UTC (permalink / raw)
To: Christian Brauner, Kees Cook, Alexander Viro
Cc: linux-fsdevel, linux-api, linux-kernel, linux-mm, linux-arch,
linux-doc, linux-kselftest, x86, Arnd Bergmann, Andy Lutomirski,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan
Hi,
This is an early RFC for an idea that is probably still rough in both the
UAPI and implementation details. Sorry for the rough edges; I am sending
it now to check whether this direction is worth pursuing and to get
feedback on the kernel/userspace boundary.
The series is based on linux-next version 20260518.
This RFC adds spawn_template, a userspace-controlled exec acceleration
mechanism for runtimes that repeatedly start the same executable with
different argv, envp, and per-spawn file descriptor setup.
The main target is agent runtimes. Modern coding agents repeatedly start
short-lived helper tools such as rg, git, sed, awk, python, node, and
shell wrappers while they inspect and edit a workspace. Those runtimes
already know which tools are hot, and they are also the right place to
decide policy. The kernel does not choose names such as rg, git, or sed.
Userspace opts in by creating a template fd for one executable, then uses
that fd for later spawns. Launchers, shells, and build systems have a
similar repeated-startup shape and could use the same primitive, but the
agent runtime case is the main motivation for this RFC.
The mechanism applies to the executable that userspace asks the kernel to
start. If an agent runtime directly starts /usr/bin/rg, the rg executable
is the template target. If the runtime starts /usr/bin/bash -c "rg ... |
head", the shell is the template target unless the shell itself opts in
when it starts rg and head. The kernel does not parse the shell command
string or rewrite inner commands into template spawns. Userspace has to
call spawn_template for those inner commands explicitly:
direct exec shell wrapper
----------- -------------
agent agent
template("/usr/bin/rg") template("/usr/bin/bash")
spawn rg argv spawn bash -c "rg ... | head"
kernel target: rg kernel target: bash
rg startup benefits rg/head need shell opt-in
Several agent runtime discussions are moving toward direct argv-style
exec tools for both security and policy clarity. For example, opencode
issue #2206 proposes an exec tool as a safer alternative to a shell-only
bash tool:
https://github.com/anomalyco/opencode/issues/2206
spawn_template is meant to support both models. Direct exec users can
cache the actual hot tool. Shell-wrapper users can cache the shell and
still reduce shell startup cost. If a shell or an agent runtime later
uses the same API for commands started inside a shell command, those
inner tools can benefit too.
Each spawn still goes through the normal exec path. The template reuses
only metadata that can be revalidated before use. Credential preparation,
permission checks, binary handler checks, secure-exec handling, and LSM
hooks remain on the normal execve path.
The UAPI has two operations. spawn_template_create() creates an
anonymous-inode template fd from either an executable fd or an absolute
executable path. spawn_template_spawn() starts one child from that
template, applies per-spawn fd, cwd, and signal actions, and returns both
pid and pidfd.
fd inheritance is deliberately conservative. By default, after the
requested per-spawn actions have run, the child closes fds above stderr.
An agent runtime can still request traditional inheritance explicitly,
but helper tools do not inherit unrelated secret files or sockets by
accident. The create-time actions fields are reserved and rejected in
this RFC because fd numbers are per-process state, not stable reusable
objects. The caller supplies fd actions for each spawn instead.
A typical agent runtime would keep one template per hot executable and
still build argv, envp, cwd, and pipe wiring for each tool call:
rg_tmpl = spawn_template_create("/usr/bin/rg");
for each search request:
out_r, out_w = pipe_cloexec();
err_r, err_w = pipe_cloexec();
actions = [
FCHDIR(worktree_fd),
DUP2(out_w, STDOUT_FILENO),
DUP2(err_w, STDERR_FILENO),
];
child = spawn_template_spawn(rg_tmpl, rg_argv, envp, actions);
close(out_w);
close(err_w);
read out_r and err_r;
waitid(P_PIDFD, child.pidfd, ...);
A shell-wrapper runtime would use the same shape with a template for
/usr/bin/bash and argv such as ["/usr/bin/bash", "-c", command]. That
reduces shell startup cost, but it does not cache rg or head inside that
command unless the shell also opts into spawn_template for commands it
starts internally.
The template pins the executable and denies writes to that file while the
template fd is alive, so cached executable metadata cannot race with a
writer changing the same inode. This means direct in-place writes to the
executable can fail while a runtime keeps a template open. It does not
block the common package-manager update pattern where a new inode is
written and then atomically renamed over the old path. In that case the
old path-created template becomes stale, spawn_template_spawn() rejects
it with ESTALE, and the runtime should close and recreate the template
for the new executable.
in-place write package-manager update
-------------- ----------------------
template pins old inode write new inode
write(old inode) denied rename(new, "/usr/bin/rg")
cached metadata safe old template sees path mismatch
spawn_template_spawn() = -ESTALE
recreate template for new inode
Each spawn revalidates executable identity before cached metadata is
used. Path-created templates only accept absolute paths: a relative path
such as ./tool depends on cwd, and the same string can name a different
file after chdir. For an absolute path template, each spawn reopens the
path and checks that it still resolves to the executable recorded when
the template was created. If the path now names a replaced file, the
template is stale and userspace should close and recreate it.
A template fd can be passed over SCM_RIGHTS like any other fd, but this
RFC does not treat that as delegation. spawn_template_spawn() only works
while the caller still has the same struct cred object that created the
template. If another task, or the same task after a credential change,
receives the fd, spawn fails instead of running the executable using the
creator's launch authority:
ordinary fd spawn_template fd
----------- -----------------
A: open log A: create rg template
A -> B: SCM_RIGHTS(fd) A -> B: SCM_RIGHTS(tfd)
B: read(fd) = ok B: spawn(tfd) = -EACCES
B: create own rg template
B: spawn(own_tfd) = ok
open-file use is delegated spawn authority is not delegated
The cached state is intentionally small. The template fd keeps the opened
main executable file, an optional absolute path string, the creator
credential pointer, and the deny-write state. The executable identity key
records device, inode, size, mode, owner, ctime, and mtime, and is
rechecked before cached metadata is used. The ELF cache keeps only the
main executable's ELF header, program header table, and program header
count.
cached in this RFC not cached in this RFC
------------------ ----------------------
opened main executable PT_INTERP metadata
executable identity key shared-library graph
main ELF header VMA layout metadata
main ELF program headers cross-process metadata sharing
creator cred pointer
deny-write state
This RFC does not cache ELF interpreter metadata, shared-library
dependency state, or derived mapping-layout state. Shared-library
resolution is dynamic linker policy and depends on LD_LIBRARY_PATH,
RPATH, RUNPATH, /etc/ld.so.cache, mount namespaces, and secure-exec
state. It also does not share cached executable metadata between template
fds created by different processes. Each template owns its small cached
metadata object in this RFC.
Performance
===========
The numbers below come from my separate local autogen-bench project.
autogen-bench uses AutoGen [1] Core as the agent harness: RoutedAgent
instances run under SingleThreadedAgentRuntime, and RPC-style dispatch
fans out concurrent tool-call requests to worker agents. The workload
definitions, generated test files, and subprocess/spawn_template backends
are local to autogen-bench.
The agent-tools preset includes direct tool calls and shell-wrapper forms
for:
rg, grep, sed, awk, cat, head, tail, find, stat, ls, git-status, git-diff,
python-small, node-small, sh-c, and bash-c.
The benchmark is launch-heavy but not no-op: it searches generated
Python-like source files, reads sample files, runs small Python and
Node.js programs, and runs git status and git diff in a small repository.
It does not include model inference or long-running tool work, so the
numbers mainly describe the short-tool regime.
The subprocess column starts each tool call through the existing
userspace launch path. The spawn_template column creates templates for
hot executables and uses spawn_template_spawn() for later calls.
Total in-flight tool calls stay at 16; only the worker-process split
changes. For example, 4x4 means 4 worker processes with 4 in-flight tool
calls each. The two time_s values are subprocess/spawn_template wall
times.
Workload Calls subprocess spawn_template time_s Delta
(workers) calls calls/s calls/s seconds
1x16 6144 411.04 420.32 14.95/14.62 +2.26%
2x8 6144 666.78 690.08 9.21/8.90 +3.49%
4x4 6144 955.61 1003.25 6.43/6.12 +4.99%
8x2 6144 1048.25 1069.18 5.86/5.75 +2.00%
The table measures the whole mixed workload, including both process
startup and the short tool work done after exec. Since this workload is
launch-heavy, the possible launch-side savings include:
- the template fd keeps an opened executable, avoiding repeated ordinary
open/path setup for that executable;
- the kernel can reuse cached main-executable ELF header and program
header metadata after revalidation;
- the fork-and-exec-style launch is submitted as one
spawn_template_spawn() operation;
- fd, cwd, and signal actions run in the child kernel path instead of
being driven one syscall at a time by userspace child glue;
- pid and pidfd are returned by the same operation, reducing some
runtime-side bookkeeping.
In local experiments before this RFC, I also tried caching ELF
interpreter metadata and derived ELF mapping-layout metadata. A focused
repeated-exec benchmark did not show a stable standalone throughput gain
for those two optimizations, so this RFC leaves them out and keeps only
the main executable metadata cache.
I also tried sharing main-executable ELF metadata across template fds
created by different processes for the same executable identity. That can
reduce duplicated metadata memory when many agent worker processes create
their own templates for /usr/bin/rg, /usr/bin/git, and similar tools, but
it did not show a stable throughput win in local multi-agent tests. It
also adds cache keying, lifetime, invalidation, credential, and namespace
questions to the RFC. This version therefore keeps per-template metadata
ownership and leaves cross-process sharing out.
Sorry again for the rough edges in this RFC. I would appreciate feedback
on whether this direction is useful and what the right API boundary
should be.
Thanks,
Li
[1]: https://github.com/microsoft/autogen
Li Chen (13):
exec: factor argument setup out of do_execveat_common()
exec: add an internal helper for opened executables
file: expose helpers for in-kernel fd actions
exec: add spawn template UAPI definitions
exec: add spawn template file descriptors
exec: add spawn_template_spawn()
exec: validate spawn template executable identity
binfmt_elf: cache ELF metadata for spawn templates
Documentation: describe spawn templates
exec: require absolute paths for path-created templates
exec: let close-range actions target the max fd
syscalls: add generic spawn template entries
selftests/exec: cover spawn template basics
Documentation/userspace-api/index.rst | 1 +
.../userspace-api/spawn_template.rst | 153 +++
MAINTAINERS | 6 +
arch/x86/entry/syscalls/syscall_64.tbl | 3 +-
fs/Makefile | 2 +-
fs/binfmt_elf.c | 104 +-
fs/exec.c | 162 ++-
fs/file.c | 11 +-
fs/spawn_template.c | 619 +++++++++++
include/linux/binfmts.h | 10 +
include/linux/fdtable.h | 2 +
include/linux/spawn_template.h | 72 ++
include/linux/syscalls.h | 7 +
include/uapi/asm-generic/unistd.h | 7 +-
include/uapi/linux/spawn_template.h | 62 ++
scripts/syscall.tbl | 2 +
tools/testing/selftests/exec/Makefile | 1 +
tools/testing/selftests/exec/spawn_template.c | 997 ++++++++++++++++++
18 files changed, 2179 insertions(+), 42 deletions(-)
create mode 100644 Documentation/userspace-api/spawn_template.rst
create mode 100644 fs/spawn_template.c
create mode 100644 include/linux/spawn_template.h
create mode 100644 include/uapi/linux/spawn_template.h
create mode 100644 tools/testing/selftests/exec/spawn_template.c
--
2.52.0
^ permalink raw reply
* [RFC PATCH v1 01/13] exec: factor argument setup out of do_execveat_common()
From: Li Chen @ 2026-05-28 9:52 UTC (permalink / raw)
To: Christian Brauner, Kees Cook, Alexander Viro
Cc: linux-fsdevel, linux-api, linux-kernel, linux-mm, linux-arch,
linux-doc, linux-kselftest, x86, Arnd Bergmann, Andy Lutomirski,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Li Chen
In-Reply-To: <20260528095235.2491226-1-me@linux.beauty>
Move the common userspace argv and envp counting and stack setup code
into do_execveat_common_bprm(). Keep do_execveat_common() responsible
for the existing RLIMIT_NPROC check, bprm allocation, and error path.
This is a mechanical refactor for later opened-file exec users. It
does not change execve or execveat behavior.
Signed-off-by: Li Chen <me@linux.beauty>
---
fs/exec.c | 53 +++++++++++++++++++++++++++++++----------------------
1 file changed, 31 insertions(+), 22 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 2889b7cf808d7..53f7b18d2b1ea 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1775,31 +1775,12 @@ static int bprm_execve(struct linux_binprm *bprm)
return retval;
}
-static int do_execveat_common(int fd, struct filename *filename,
- struct user_arg_ptr argv,
- struct user_arg_ptr envp,
- int flags)
+static int do_execveat_common_bprm(struct linux_binprm *bprm,
+ struct user_arg_ptr argv,
+ struct user_arg_ptr envp)
{
int retval;
- /*
- * We move the actual failure in case of RLIMIT_NPROC excess from
- * set*uid() to execve() because too many poorly written programs
- * don't check setuid() return code. Here we additionally recheck
- * whether NPROC limit is still exceeded.
- */
- if ((current->flags & PF_NPROC_EXCEEDED) &&
- is_rlimit_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)))
- return -EAGAIN;
-
- /* We're below the limit (still or again), so we don't want to make
- * further execve() calls fail. */
- current->flags &= ~PF_NPROC_EXCEEDED;
-
- CLASS(bprm, bprm)(fd, filename, flags);
- if (IS_ERR(bprm))
- return PTR_ERR(bprm);
-
retval = count(argv, MAX_ARG_STRINGS);
if (retval < 0)
return retval;
@@ -1846,6 +1827,34 @@ static int do_execveat_common(int fd, struct filename *filename,
return bprm_execve(bprm);
}
+static int do_execveat_common(int fd, struct filename *filename,
+ struct user_arg_ptr argv,
+ struct user_arg_ptr envp,
+ int flags)
+{
+ /*
+ * We move the actual failure in case of RLIMIT_NPROC excess from
+ * set*uid() to execve() because too many poorly written programs
+ * don't check setuid() return code. Here we additionally recheck
+ * whether NPROC limit is still exceeded.
+ */
+ if ((current->flags & PF_NPROC_EXCEEDED) &&
+ is_rlimit_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)))
+ return -EAGAIN;
+
+ /*
+ * We're below the limit (still or again), so we don't want to make
+ * further execve() calls fail.
+ */
+ current->flags &= ~PF_NPROC_EXCEEDED;
+
+ CLASS(bprm, bprm)(fd, filename, flags);
+ if (IS_ERR(bprm))
+ return PTR_ERR(bprm);
+
+ return do_execveat_common_bprm(bprm, argv, envp);
+}
+
int kernel_execve(const char *kernel_filename,
const char *const *argv, const char *const *envp)
{
--
2.52.0
^ permalink raw reply related
* [RFC PATCH v1 02/13] exec: add an internal helper for opened executables
From: Li Chen @ 2026-05-28 9:52 UTC (permalink / raw)
To: Christian Brauner, Kees Cook, Alexander Viro
Cc: linux-fsdevel, linux-api, linux-kernel, linux-mm, linux-arch,
linux-doc, linux-kselftest, x86, Arnd Bergmann, Andy Lutomirski,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Li Chen
In-Reply-To: <20260528095235.2491226-1-me@linux.beauty>
Split alloc_bprm_file() from alloc_bprm() so internal callers can build
a linux_binprm from an executable file that they already opened.
Add kernel_execveat_file() for in-kernel users that need to execute an
opened file while still using the normal execve credential, LSM, and
binary-format path.
Signed-off-by: Li Chen <me@linux.beauty>
---
fs/exec.c | 78 +++++++++++++++++++++++++++++++++++------
include/linux/binfmts.h | 4 +++
2 files changed, 71 insertions(+), 11 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 53f7b18d2b1ea..5b91a9b208a77 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1392,16 +1392,13 @@ static void free_bprm(struct linux_binprm *bprm)
kfree(bprm);
}
-static struct linux_binprm *alloc_bprm(int fd, struct filename *filename, int flags)
+static struct linux_binprm *alloc_bprm_file(struct file *file,
+ struct filename *filename,
+ int fd, int flags)
{
struct linux_binprm *bprm;
- struct file *file;
int retval = -ENOMEM;
- file = do_open_execat(fd, filename, flags);
- if (IS_ERR(file))
- return ERR_CAST(file);
-
bprm = kzalloc_obj(*bprm);
if (!bprm) {
do_close_execat(file);
@@ -1463,6 +1460,17 @@ static struct linux_binprm *alloc_bprm(int fd, struct filename *filename, int fl
return ERR_PTR(retval);
}
+static struct linux_binprm *alloc_bprm(int fd, struct filename *filename, int flags)
+{
+ struct file *file;
+
+ file = do_open_execat(fd, filename, flags);
+ if (IS_ERR(file))
+ return ERR_CAST(file);
+
+ return alloc_bprm_file(file, filename, fd, flags);
+}
+
DEFINE_CLASS(bprm, struct linux_binprm *, if (!IS_ERR(_T)) free_bprm(_T),
alloc_bprm(fd, name, flags), int fd, struct filename *name, int flags)
@@ -1901,6 +1909,59 @@ int kernel_execve(const char *kernel_filename,
return bprm_execve(bprm);
}
+static inline struct user_arg_ptr native_arg(const char __user *const __user *p)
+{
+ return (struct user_arg_ptr){.ptr.native = p};
+}
+
+static int do_execveat_file_common(struct file *file, struct filename *filename,
+ struct user_arg_ptr argv,
+ struct user_arg_ptr envp, int flags)
+{
+ struct linux_binprm *bprm;
+ struct file *exec_file;
+ int retval;
+
+ if (flags & ~AT_EMPTY_PATH)
+ return -EINVAL;
+
+ if ((current->flags & PF_NPROC_EXCEEDED) &&
+ is_rlimit_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC)))
+ return -EAGAIN;
+
+ current->flags &= ~PF_NPROC_EXCEEDED;
+
+ retval = exe_file_deny_write_access(file);
+ if (retval)
+ return retval;
+ exec_file = get_file(file);
+
+ bprm = alloc_bprm_file(exec_file, filename, AT_FDCWD, flags);
+ if (IS_ERR(bprm))
+ return PTR_ERR(bprm);
+
+ retval = do_execveat_common_bprm(bprm, argv, envp);
+ free_bprm(bprm);
+ return retval;
+}
+
+int kernel_execveat_file(struct file *file, const char *filename,
+ const void __user *argv,
+ const void __user *envp,
+ int flags)
+{
+ const char __user *const __user *user_argv;
+ const char __user *const __user *user_envp;
+
+ CLASS(filename_kernel, name)(filename);
+
+ user_argv = (const char __user *const __user *)argv;
+ user_envp = (const char __user *const __user *)envp;
+
+ return do_execveat_file_common(file, name, native_arg(user_argv),
+ native_arg(user_envp), flags);
+}
+
void set_binfmt(struct linux_binfmt *new)
{
struct mm_struct *mm = current->mm;
@@ -1925,11 +1986,6 @@ void set_dumpable(struct mm_struct *mm, int value)
__mm_flags_set_mask_dumpable(mm, value);
}
-static inline struct user_arg_ptr native_arg(const char __user *const __user *p)
-{
- return (struct user_arg_ptr){.ptr.native = p};
-}
-
SYSCALL_DEFINE3(execve,
const char __user *, filename,
const char __user *const __user *, argv,
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 65abd5ab8836c..c0715678c9a06 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -141,6 +141,10 @@ extern int transfer_args_to_stack(struct linux_binprm *bprm,
unsigned long *sp_location);
extern int bprm_change_interp(const char *interp, struct linux_binprm *bprm);
int copy_string_kernel(const char *arg, struct linux_binprm *bprm);
+int kernel_execveat_file(struct file *file, const char *filename,
+ const void __user *argv,
+ const void __user *envp,
+ int flags);
extern void set_binfmt(struct linux_binfmt *new);
extern ssize_t read_code(struct file *, unsigned long, loff_t, size_t);
--
2.52.0
^ permalink raw reply related
* [RFC PATCH v1 03/13] file: expose helpers for in-kernel fd actions
From: Li Chen @ 2026-05-28 9:52 UTC (permalink / raw)
To: Christian Brauner, Kees Cook, Alexander Viro
Cc: linux-fsdevel, linux-api, linux-kernel, linux-mm, linux-arch,
linux-doc, linux-kselftest, x86, Arnd Bergmann, Andy Lutomirski,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Li Chen
In-Reply-To: <20260528095235.2491226-1-me@linux.beauty>
Split do_close_range() from the close_range syscall wrapper and make
ksys_dup3() available to in-kernel callers. Later spawn-template fd
actions use these helpers instead of duplicating close and dup logic.
Signed-off-by: Li Chen <me@linux.beauty>
---
fs/file.c | 11 ++++++++---
include/linux/fdtable.h | 2 ++
2 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/fs/file.c b/fs/file.c
index e5c75b22e0c7c..a9f4b4e2dcd45 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -815,8 +815,7 @@ static inline void __range_close(struct files_struct *files, unsigned int fd,
* from @fd up to and including @max_fd are closed.
* Currently, errors to close a given file descriptor are ignored.
*/
-SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
- unsigned int, flags)
+int do_close_range(unsigned int fd, unsigned int max_fd, unsigned int flags)
{
struct task_struct *me = current;
struct files_struct *cur_fds = me->files, *fds = NULL;
@@ -867,6 +866,12 @@ SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
return 0;
}
+SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
+ unsigned int, flags)
+{
+ return do_close_range(fd, max_fd, flags);
+}
+
/**
* file_close_fd - return file associated with fd
* @fd: file descriptor to retrieve file for
@@ -1421,7 +1426,7 @@ int receive_fd_replace(int new_fd, struct file *file, unsigned int o_flags)
return new_fd;
}
-static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
+int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
{
int err = -EBADF;
struct file *file;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index c45306a9f0072..7f852fcc082a4 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -112,6 +112,8 @@ int iterate_fd(struct files_struct *, unsigned,
extern int close_fd(unsigned int fd);
extern struct file *file_close_fd(unsigned int fd);
+int do_close_range(unsigned int fd, unsigned int max_fd, unsigned int flags);
+int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags);
extern struct kmem_cache *files_cachep;
--
2.52.0
^ permalink raw reply related
* [RFC PATCH v1 04/13] exec: add spawn template UAPI definitions
From: Li Chen @ 2026-05-28 9:52 UTC (permalink / raw)
To: Christian Brauner, Kees Cook, Alexander Viro
Cc: linux-fsdevel, linux-api, linux-kernel, linux-mm, linux-arch,
linux-doc, linux-kselftest, x86, Arnd Bergmann, Andy Lutomirski,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Li Chen
In-Reply-To: <20260528095235.2491226-1-me@linux.beauty>
Add the userspace ABI structures and flags for creating a spawn
template and spawning a process from it. The ABI carries argv, envp,
and per-spawn fd actions while leaving policy decisions in userspace.
Signed-off-by: Li Chen <me@linux.beauty>
---
MAINTAINERS | 1 +
include/uapi/linux/spawn_template.h | 62 +++++++++++++++++++++++++++++
2 files changed, 63 insertions(+)
create mode 100644 include/uapi/linux/spawn_template.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 3dd58a16f06a9..d7b1191e33ca0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9739,6 +9739,7 @@ F: include/linux/elf.h
F: include/uapi/linux/auxvec.h
F: include/uapi/linux/binfmts.h
F: include/uapi/linux/elf.h
+F: include/uapi/linux/spawn_template.h
F: kernel/fork.c
F: mm/vma_exec.c
F: tools/testing/selftests/exec/
diff --git a/include/uapi/linux/spawn_template.h b/include/uapi/linux/spawn_template.h
new file mode 100644
index 0000000000000..84f026fdf9090
--- /dev/null
+++ b/include/uapi/linux/spawn_template.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_SPAWN_TEMPLATE_H
+#define _UAPI_LINUX_SPAWN_TEMPLATE_H
+
+#include <linux/openat2.h>
+#include <linux/types.h>
+
+#define SPAWN_TEMPLATE_CREATE_CLOEXEC (1ULL << 0)
+#define SPAWN_TEMPLATE_SPAWN_INHERIT_FDS (1ULL << 0)
+
+enum spawn_template_action_type {
+ SPAWN_TEMPLATE_ACTION_CLOSE = 0,
+ SPAWN_TEMPLATE_ACTION_DUP2 = 1,
+ SPAWN_TEMPLATE_ACTION_FCHDIR = 2,
+ SPAWN_TEMPLATE_ACTION_OPEN = 3,
+ SPAWN_TEMPLATE_ACTION_CLOSE_RANGE = 4,
+ SPAWN_TEMPLATE_ACTION_SIGMASK = 5,
+ SPAWN_TEMPLATE_ACTION_SIGDEFAULT = 6,
+};
+
+struct spawn_template_action {
+ __u32 type;
+ __u32 flags;
+ __s32 fd;
+ __s32 newfd;
+ __aligned_u64 arg;
+};
+
+struct spawn_template_open {
+ __aligned_u64 path;
+ struct open_how how;
+};
+
+struct spawn_template_sigset {
+ __aligned_u64 sigset;
+ __u64 sigsetsize;
+};
+
+struct spawn_template_create_args {
+ __aligned_u64 flags;
+ __s32 execfd;
+ __u32 exec_flags;
+ __aligned_u64 filename;
+ __aligned_u64 actions;
+ __aligned_u64 actions_len;
+ __aligned_u64 reserved[4];
+};
+
+struct spawn_template_spawn_args {
+ __aligned_u64 flags;
+ __aligned_u64 pidfd;
+ __aligned_u64 argv;
+ __aligned_u64 envp;
+ __aligned_u64 actions;
+ __aligned_u64 actions_len;
+ __aligned_u64 reserved[4];
+};
+
+#define SPAWN_TEMPLATE_CREATE_ARGS_SIZE_VER0 72
+#define SPAWN_TEMPLATE_SPAWN_ARGS_SIZE_VER0 80
+
+#endif /* _UAPI_LINUX_SPAWN_TEMPLATE_H */
--
2.52.0
^ permalink raw reply related
* [RFC PATCH v1 05/13] exec: add spawn template file descriptors
From: Li Chen @ 2026-05-28 9:52 UTC (permalink / raw)
To: Christian Brauner, Kees Cook, Alexander Viro
Cc: linux-fsdevel, linux-api, linux-kernel, linux-mm, linux-arch,
linux-doc, linux-kselftest, x86, Arnd Bergmann, Andy Lutomirski,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Li Chen
In-Reply-To: <20260528095235.2491226-1-me@linux.beauty>
Add spawn_template_create() and back each template with an anon-inode fd.
Creation records the per-template state that later spawns reuse: the opened
executable file, optional absolute path, creator credential, and deny-write
state. Keep write access denied until the template fd is released so cached
state cannot race with writers.
This patch only creates and releases template fds.
Spawning and ELF metadata caching are added separately.
Signed-off-by: Li Chen <me@linux.beauty>
---
MAINTAINERS | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 -
fs/Makefile | 2 +-
fs/spawn_template.c | 180 +++++++++++++++++++++++++
include/linux/syscalls.h | 3 +
5 files changed, 185 insertions(+), 2 deletions(-)
create mode 100644 fs/spawn_template.c
diff --git a/MAINTAINERS b/MAINTAINERS
index d7b1191e33ca0..d5441812825c3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9732,6 +9732,7 @@ F: Documentation/userspace-api/ELF.rst
F: fs/*binfmt_*.c
F: fs/Kconfig.binfmt
F: fs/exec.c
+F: fs/spawn_template.c
F: fs/tests/binfmt_*_kunit.c
F: fs/tests/exec_kunit.c
F: include/linux/binfmts.h
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 524155d655da1..d6c1667e8f3b8 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -396,7 +396,6 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
-
#
# Due to a historical design error, certain syscalls are numbered differently
# in x32 as compared to native x86_64. These syscalls have numbers 512-547.
diff --git a/fs/Makefile b/fs/Makefile
index ae1b07f9c6a0c..796eb4ae143e5 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -8,7 +8,7 @@
obj-y := open.o read_write.o file_table.o super.o \
- char_dev.o stat.o exec.o pipe.o namei.o fcntl.o \
+ char_dev.o stat.o exec.o spawn_template.o pipe.o namei.o fcntl.o \
ioctl.o readdir.o select.o dcache.o inode.o \
attr.o bad_inode.o file.o filesystems.o namespace.o \
seq_file.o xattr.o libfs.o fs-writeback.o \
diff --git a/fs/spawn_template.c b/fs/spawn_template.c
new file mode 100644
index 0000000000000..280a1038cc45e
--- /dev/null
+++ b/fs/spawn_template.c
@@ -0,0 +1,180 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/anon_inodes.h>
+#include <linux/cred.h>
+#include <linux/err.h>
+#include <linux/fcntl.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/syscalls.h>
+#include <linux/uaccess.h>
+#include <uapi/linux/spawn_template.h>
+
+#include "internal.h"
+
+#define SPAWN_TEMPLATE_MAX_ACTIONS 256
+
+struct spawn_template {
+ struct file *exec_file;
+ const struct cred *creator_cred;
+ char *filename;
+ bool deny_write;
+};
+
+static const struct file_operations spawn_template_fops;
+
+static bool spawn_template_file_exec_allowed(struct file *file)
+{
+ if (!S_ISREG(file_inode(file)->i_mode))
+ return false;
+ if (path_noexec(&file->f_path))
+ return false;
+ if (file_permission(file, MAY_EXEC))
+ return false;
+ return can_mmap_file(file);
+}
+
+static int spawn_template_release(struct inode *inode, struct file *file)
+{
+ struct spawn_template *tmpl = file->private_data;
+
+ if (tmpl->deny_write)
+ exe_file_allow_write_access(tmpl->exec_file);
+ fput(tmpl->exec_file);
+ put_cred(tmpl->creator_cred);
+ kfree(tmpl->filename);
+ kfree(tmpl);
+ return 0;
+}
+
+static const struct file_operations spawn_template_fops = {
+ .release = spawn_template_release,
+ .llseek = noop_llseek,
+};
+
+static int spawn_template_open_execfd(int execfd, struct file **file,
+ bool *deny_write)
+{
+ int ret;
+
+ if (execfd < 0)
+ return -EINVAL;
+
+ CLASS(fd, f)(execfd);
+ if (fd_empty(f))
+ return -EBADF;
+
+ if (!spawn_template_file_exec_allowed(fd_file(f)))
+ return -EACCES;
+
+ ret = exe_file_deny_write_access(fd_file(f));
+ if (ret)
+ return ret;
+
+ *file = get_file(fd_file(f));
+ *deny_write = true;
+ return 0;
+}
+
+static int spawn_template_open_filename(u64 filename, struct file **file,
+ char **path,
+ bool *deny_write)
+{
+ char *kfilename __free(kfree) = NULL;
+ struct file *exec __free(fput) = NULL;
+ struct file *tmp_file;
+ char *tmp;
+
+ if (!filename)
+ return -EINVAL;
+
+ tmp = strndup_user(u64_to_user_ptr(filename), PATH_MAX);
+ if (IS_ERR(tmp))
+ return PTR_ERR(tmp);
+ kfilename = tmp;
+
+ tmp_file = open_exec(kfilename);
+ if (IS_ERR(tmp_file))
+ return PTR_ERR(tmp_file);
+ exec = tmp_file;
+ if (!spawn_template_file_exec_allowed(exec)) {
+ exe_file_allow_write_access(exec);
+ return -EACCES;
+ }
+
+ *file = no_free_ptr(exec);
+ *path = no_free_ptr(kfilename);
+ *deny_write = true;
+ return 0;
+}
+
+SYSCALL_DEFINE2(spawn_template_create,
+ struct spawn_template_create_args __user *, uargs,
+ size_t, usize)
+{
+ struct spawn_template_create_args args;
+ struct spawn_template *tmpl;
+ int fd_flags = 0;
+ int ret;
+
+ BUILD_BUG_ON(sizeof(struct spawn_template_create_args) !=
+ SPAWN_TEMPLATE_CREATE_ARGS_SIZE_VER0);
+
+ if (usize < SPAWN_TEMPLATE_CREATE_ARGS_SIZE_VER0)
+ return -EINVAL;
+ if (usize > PAGE_SIZE)
+ return -E2BIG;
+
+ ret = copy_struct_from_user(&args, sizeof(args), uargs, usize);
+ if (ret)
+ return ret;
+
+ if (args.flags & ~SPAWN_TEMPLATE_CREATE_CLOEXEC)
+ return -EINVAL;
+ if (args.exec_flags || args.reserved[0] || args.reserved[1] ||
+ args.reserved[2] || args.reserved[3])
+ return -EINVAL;
+ if (args.actions || args.actions_len)
+ return -EINVAL;
+ if ((args.execfd < 0 && !args.filename) ||
+ (args.execfd >= 0 && args.filename))
+ return -EINVAL;
+
+ tmpl = kzalloc_obj(*tmpl, GFP_KERNEL);
+ if (!tmpl)
+ return -ENOMEM;
+ tmpl->creator_cred = get_current_cred();
+
+ if (args.filename)
+ ret = spawn_template_open_filename(args.filename,
+ &tmpl->exec_file,
+ &tmpl->filename,
+ &tmpl->deny_write);
+ else
+ ret = spawn_template_open_execfd(args.execfd,
+ &tmpl->exec_file,
+ &tmpl->deny_write);
+ if (ret)
+ goto out_free_tmpl;
+
+ if (args.flags & SPAWN_TEMPLATE_CREATE_CLOEXEC)
+ fd_flags |= O_CLOEXEC;
+
+ ret = anon_inode_getfd("spawn_template", &spawn_template_fops, tmpl,
+ fd_flags);
+ if (ret < 0)
+ goto out_put_exec;
+
+ return ret;
+
+out_put_exec:
+ if (tmpl->deny_write)
+ exe_file_allow_write_access(tmpl->exec_file);
+ fput(tmpl->exec_file);
+out_free_tmpl:
+ put_cred(tmpl->creator_cred);
+ kfree(tmpl->filename);
+ kfree(tmpl);
+ return ret;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f3dfc3269188a..4b41950488bd6 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -67,6 +67,7 @@ struct rseq;
union bpf_attr;
struct io_uring_params;
struct clone_args;
+struct spawn_template_create_args;
struct open_how;
struct mount_attr;
struct landlock_ruleset_attr;
@@ -821,6 +822,8 @@ asmlinkage long sys_clone(unsigned long, unsigned long, int __user *,
#endif
asmlinkage long sys_clone3(struct clone_args __user *uargs, size_t size);
+asmlinkage long sys_spawn_template_create(struct spawn_template_create_args __user *uargs,
+ size_t size);
asmlinkage long sys_execve(const char __user *filename,
const char __user *const __user *argv,
--
2.52.0
^ permalink raw reply related
* [RFC PATCH v1 06/13] exec: add spawn_template_spawn()
From: Li Chen @ 2026-05-28 9:52 UTC (permalink / raw)
To: Christian Brauner, Kees Cook, Alexander Viro
Cc: linux-fsdevel, linux-api, linux-kernel, linux-mm, linux-arch,
linux-doc, linux-kselftest, x86, Arnd Bergmann, Andy Lutomirski,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Li Chen
In-Reply-To: <20260528095235.2491226-1-me@linux.beauty>
Add spawn_template_spawn() to start a child from a template fd. The child
uses the template's pinned executable file, runs per-spawn fd, cwd, and
signal actions, closes non-stdio fds by default, and then executes through
the normal opened-file exec path.
Return a pidfd for the child so userspace can wait or signal it without
racy pid reuse. Keep fd inheritance opt-in with
SPAWN_TEMPLATE_SPAWN_INHERIT_FDS.
This patch consumes cached template state but does not add ELF metadata
caching; executable identity and ELF metadata caching are added separately.
Signed-off-by: Li Chen <me@linux.beauty>
---
fs/spawn_template.c | 346 +++++++++++++++++++++++++++++++++++++++
include/linux/syscalls.h | 4 +
2 files changed, 350 insertions(+)
diff --git a/fs/spawn_template.c b/fs/spawn_template.c
index 280a1038cc45e..8c3711929cffb 100644
--- a/fs/spawn_template.c
+++ b/fs/spawn_template.c
@@ -1,14 +1,24 @@
// SPDX-License-Identifier: GPL-2.0-only
#include <linux/anon_inodes.h>
+#include <linux/binfmts.h>
+#include <linux/close_range.h>
#include <linux/cred.h>
#include <linux/err.h>
#include <linux/fcntl.h>
+#include <linux/fdtable.h>
#include <linux/file.h>
#include <linux/fs.h>
+#include <linux/fs_struct.h>
#include <linux/kernel.h>
+#include <linux/namei.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/task.h>
+#include <linux/signal.h>
#include <linux/slab.h>
+#include <linux/string.h>
#include <linux/syscalls.h>
#include <linux/uaccess.h>
+#include <uapi/linux/openat2.h>
#include <uapi/linux/spawn_template.h>
#include "internal.h"
@@ -22,8 +32,262 @@ struct spawn_template {
bool deny_write;
};
+struct spawn_template_spawn_context {
+ struct spawn_template *tmpl;
+ struct spawn_template_spawn_args args;
+ struct spawn_template_action *actions;
+};
+
static const struct file_operations spawn_template_fops;
+static int spawn_template_exit_status(int err)
+{
+ switch (err) {
+ case -ENOENT:
+ return 127;
+ case -EACCES:
+ case -ENOEXEC:
+ return 126;
+ default:
+ return 1;
+ }
+}
+
+static bool spawn_template_cred_matches(struct spawn_template *tmpl)
+{
+ return current_cred() == tmpl->creator_cred;
+}
+
+static int spawn_template_copy_signal_set(const struct spawn_template_action *action,
+ sigset_t *mask)
+{
+ struct spawn_template_sigset sigset;
+
+ if (!action->arg)
+ return -EINVAL;
+ if (copy_from_user(&sigset, u64_to_user_ptr(action->arg),
+ sizeof(sigset)))
+ return -EFAULT;
+ if (sigset.sigsetsize != sizeof(sigset_t))
+ return -EINVAL;
+ if (copy_from_user(mask, u64_to_user_ptr(sigset.sigset), sizeof(*mask)))
+ return -EFAULT;
+ sigdelsetmask(mask, sigmask(SIGKILL) | sigmask(SIGSTOP));
+
+ return 0;
+}
+
+static int spawn_template_apply_open(const struct spawn_template_action *action)
+{
+ struct spawn_template_open open;
+ struct file *file __free(fput) = NULL;
+ struct file *tmp;
+ struct open_flags op;
+ int ret;
+
+ if (action->fd < AT_FDCWD || action->newfd < 0 || action->flags ||
+ !action->arg)
+ return -EINVAL;
+
+ if (copy_from_user(&open, u64_to_user_ptr(action->arg), sizeof(open)))
+ return -EFAULT;
+
+ ret = build_open_flags(&open.how, &op);
+ if (ret)
+ return ret;
+
+ CLASS(filename_flags, name)(u64_to_user_ptr(open.path), op.lookup_flags);
+ tmp = do_file_open(action->fd, name, &op);
+ if (IS_ERR(tmp))
+ return PTR_ERR(tmp);
+ file = tmp;
+
+ return replace_fd(action->newfd, file, open.how.flags & O_CLOEXEC);
+}
+
+static int spawn_template_apply_sigmask(const struct spawn_template_action *action)
+{
+ sigset_t mask;
+ int ret;
+
+ if (action->fd || action->newfd || action->flags)
+ return -EINVAL;
+
+ ret = spawn_template_copy_signal_set(action, &mask);
+ if (ret)
+ return ret;
+
+ set_current_blocked(&mask);
+ return 0;
+}
+
+static int spawn_template_apply_sigdefault(const struct spawn_template_action *action)
+{
+ sigset_t mask;
+ struct k_sigaction sa = {};
+ int ret;
+ int sig;
+
+ if (action->fd || action->newfd || action->flags)
+ return -EINVAL;
+
+ ret = spawn_template_copy_signal_set(action, &mask);
+ if (ret)
+ return ret;
+
+ sa.sa.sa_handler = SIG_DFL;
+ sigemptyset(&sa.sa.sa_mask);
+
+ for (sig = 1; sig < _NSIG; sig++) {
+ if (!sigismember(&mask, sig))
+ continue;
+ ret = do_sigaction(sig, &sa, NULL);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+static int spawn_template_apply_action(const struct spawn_template_action *action)
+{
+ switch (action->type) {
+ case SPAWN_TEMPLATE_ACTION_CLOSE:
+ return close_fd(action->fd);
+ case SPAWN_TEMPLATE_ACTION_DUP2:
+ if (action->fd == action->newfd) {
+ if (action->flags)
+ return -EINVAL;
+ CLASS(fd, f)(action->fd);
+
+ if (fd_empty(f))
+ return -EBADF;
+ return 0;
+ }
+ return ksys_dup3(action->fd, action->newfd, action->flags);
+ case SPAWN_TEMPLATE_ACTION_FCHDIR: {
+ CLASS(fd, f)(action->fd);
+ int ret;
+
+ if (fd_empty(f))
+ return -EBADF;
+ if (!d_can_lookup(fd_file(f)->f_path.dentry))
+ return -ENOTDIR;
+
+ ret = file_permission(fd_file(f), MAY_EXEC | MAY_CHDIR);
+ if (!ret)
+ set_fs_pwd(current->fs, &fd_file(f)->f_path);
+ return ret;
+ }
+ case SPAWN_TEMPLATE_ACTION_OPEN:
+ return spawn_template_apply_open(action);
+ case SPAWN_TEMPLATE_ACTION_CLOSE_RANGE:
+ return do_close_range(action->fd, action->newfd, action->flags);
+ case SPAWN_TEMPLATE_ACTION_SIGMASK:
+ return spawn_template_apply_sigmask(action);
+ case SPAWN_TEMPLATE_ACTION_SIGDEFAULT:
+ return spawn_template_apply_sigdefault(action);
+ default:
+ return -EINVAL;
+ }
+}
+
+static int spawn_template_copy_actions(struct spawn_template_action **out_actions,
+ u64 count, u64 uaddr)
+{
+ struct spawn_template_action __user *uactions;
+ struct spawn_template_action *actions __free(kfree) = NULL;
+ struct spawn_template_action *tmp;
+ u64 i;
+
+ *out_actions = NULL;
+ if (!count)
+ return 0;
+ if (count > SPAWN_TEMPLATE_MAX_ACTIONS)
+ return -E2BIG;
+ if (!uaddr)
+ return -EINVAL;
+
+ uactions = u64_to_user_ptr(uaddr);
+ tmp = memdup_array_user(uactions, count, sizeof(*actions));
+ if (IS_ERR(tmp))
+ return PTR_ERR(tmp);
+ actions = tmp;
+
+ for (i = 0; i < count; i++) {
+ switch (actions[i].type) {
+ case SPAWN_TEMPLATE_ACTION_CLOSE:
+ if (actions[i].fd < 0 || actions[i].flags ||
+ actions[i].newfd || actions[i].arg)
+ return -EINVAL;
+ break;
+ case SPAWN_TEMPLATE_ACTION_DUP2:
+ if (actions[i].fd < 0 || actions[i].newfd < 0 ||
+ (actions[i].flags & ~O_CLOEXEC) || actions[i].arg)
+ return -EINVAL;
+ break;
+ case SPAWN_TEMPLATE_ACTION_FCHDIR:
+ if (actions[i].fd < 0 || actions[i].flags ||
+ actions[i].newfd || actions[i].arg)
+ return -EINVAL;
+ break;
+ case SPAWN_TEMPLATE_ACTION_OPEN:
+ if (actions[i].fd < AT_FDCWD || actions[i].newfd < 0 ||
+ actions[i].flags || !actions[i].arg)
+ return -EINVAL;
+ break;
+ case SPAWN_TEMPLATE_ACTION_CLOSE_RANGE:
+ if (actions[i].fd < 0 || actions[i].newfd < 0 ||
+ actions[i].fd > actions[i].newfd ||
+ (actions[i].flags &
+ ~(CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC)) ||
+ actions[i].arg)
+ return -EINVAL;
+ break;
+ case SPAWN_TEMPLATE_ACTION_SIGMASK:
+ case SPAWN_TEMPLATE_ACTION_SIGDEFAULT:
+ if (actions[i].fd || actions[i].newfd ||
+ actions[i].flags || !actions[i].arg)
+ return -EINVAL;
+ break;
+ default:
+ return -EINVAL;
+ }
+ }
+
+ *out_actions = no_free_ptr(actions);
+ return 0;
+}
+
+static int spawn_template_child(void *data)
+{
+ struct spawn_template_spawn_context *ctx = data;
+ struct spawn_template *tmpl = ctx->tmpl;
+ int ret;
+ u64 i;
+
+ for (i = 0; i < ctx->args.actions_len; i++) {
+ ret = spawn_template_apply_action(&ctx->actions[i]);
+ if (ret < 0)
+ goto out_exec_error;
+ }
+
+ if (!(ctx->args.flags & SPAWN_TEMPLATE_SPAWN_INHERIT_FDS)) {
+ ret = do_close_range(3, ~0U, 0);
+ if (ret < 0)
+ goto out_exec_error;
+ }
+
+ ret = kernel_execveat_file(tmpl->exec_file, "",
+ u64_to_user_ptr(ctx->args.argv),
+ u64_to_user_ptr(ctx->args.envp),
+ AT_EMPTY_PATH);
+out_exec_error:
+ if (ret < 0)
+ do_exit(spawn_template_exit_status(ret));
+ return 0;
+}
+
static bool spawn_template_file_exec_allowed(struct file *file)
{
if (!S_ISREG(file_inode(file)->i_mode))
@@ -53,6 +317,18 @@ static const struct file_operations spawn_template_fops = {
.llseek = noop_llseek,
};
+static struct file *spawn_template_file_from_fd(int fd)
+{
+ CLASS(fd, f)(fd);
+
+ if (fd_empty(f))
+ return ERR_PTR(-EBADF);
+ if (fd_file(f)->f_op != &spawn_template_fops)
+ return ERR_PTR(-EINVAL);
+
+ return get_file(fd_file(f));
+}
+
static int spawn_template_open_execfd(int execfd, struct file **file,
bool *deny_write)
{
@@ -178,3 +454,73 @@ SYSCALL_DEFINE2(spawn_template_create,
kfree(tmpl);
return ret;
}
+
+SYSCALL_DEFINE3(spawn_template_spawn, int, template_fd,
+ struct spawn_template_spawn_args __user *, uargs,
+ size_t, usize)
+{
+ struct spawn_template_spawn_context *ctx;
+ struct kernel_clone_args kargs;
+ struct file *template_file;
+ int ret;
+
+ BUILD_BUG_ON(sizeof(struct spawn_template_spawn_args) !=
+ SPAWN_TEMPLATE_SPAWN_ARGS_SIZE_VER0);
+
+ if (usize < SPAWN_TEMPLATE_SPAWN_ARGS_SIZE_VER0)
+ return -EINVAL;
+ if (usize > PAGE_SIZE)
+ return -E2BIG;
+
+ template_file = spawn_template_file_from_fd(template_fd);
+ if (IS_ERR(template_file))
+ return PTR_ERR(template_file);
+
+ if (!spawn_template_cred_matches(template_file->private_data)) {
+ ret = -EACCES;
+ goto out_put_template;
+ }
+
+ ctx = kzalloc_obj(*ctx, GFP_KERNEL);
+ if (!ctx) {
+ ret = -ENOMEM;
+ goto out_put_template;
+ }
+
+ ctx->tmpl = template_file->private_data;
+
+ ret = copy_struct_from_user(&ctx->args, sizeof(ctx->args), uargs,
+ usize);
+ if (ret)
+ goto out_free_ctx;
+
+ if ((ctx->args.flags & ~SPAWN_TEMPLATE_SPAWN_INHERIT_FDS) ||
+ !ctx->args.pidfd || ctx->args.reserved[0] ||
+ ctx->args.reserved[1] || ctx->args.reserved[2] ||
+ ctx->args.reserved[3]) {
+ ret = -EINVAL;
+ goto out_free_ctx;
+ }
+
+ ret = spawn_template_copy_actions(&ctx->actions, ctx->args.actions_len,
+ ctx->args.actions);
+ if (ret)
+ goto out_free_ctx;
+
+ kargs = (struct kernel_clone_args) {
+ .flags = CLONE_VM | CLONE_VFORK | CLONE_PIDFD,
+ .pidfd = u64_to_user_ptr(ctx->args.pidfd),
+ .exit_signal = SIGCHLD,
+ .fn = spawn_template_child,
+ .fn_arg = ctx,
+ };
+
+ ret = kernel_clone(&kargs);
+
+ kfree(ctx->actions);
+out_free_ctx:
+ kfree(ctx);
+out_put_template:
+ fput(template_file);
+ return ret;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 4b41950488bd6..df7368edf6778 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -68,6 +68,7 @@ union bpf_attr;
struct io_uring_params;
struct clone_args;
struct spawn_template_create_args;
+struct spawn_template_spawn_args;
struct open_how;
struct mount_attr;
struct landlock_ruleset_attr;
@@ -824,6 +825,9 @@ asmlinkage long sys_clone(unsigned long, unsigned long, int __user *,
asmlinkage long sys_clone3(struct clone_args __user *uargs, size_t size);
asmlinkage long sys_spawn_template_create(struct spawn_template_create_args __user *uargs,
size_t size);
+asmlinkage long sys_spawn_template_spawn(int template_fd,
+ struct spawn_template_spawn_args __user *uargs,
+ size_t size);
asmlinkage long sys_execve(const char __user *filename,
const char __user *const __user *argv,
--
2.52.0
^ permalink raw reply related
* [RFC PATCH v1 07/13] exec: validate spawn template executable identity
From: Li Chen @ 2026-05-28 9:52 UTC (permalink / raw)
To: Christian Brauner, Kees Cook, Alexander Viro
Cc: linux-fsdevel, linux-api, linux-kernel, linux-mm, linux-arch,
linux-doc, linux-kselftest, x86, Arnd Bergmann, Andy Lutomirski,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Li Chen
In-Reply-To: <20260528095235.2491226-1-me@linux.beauty>
Record a conservative executable identity key when a template is created:
device, inode, size, mode, owner, ctime, and mtime. Recheck it before
each spawn. For path-created templates, also reopen the path so a replaced
executable cannot silently reuse the old template fd.
Reject stale templates with ESTALE. Keep the check conservative by also
rechecking that the file remains a regular executable mapping target.
Signed-off-by: Li Chen <me@linux.beauty>
---
MAINTAINERS | 1 +
fs/spawn_template.c | 75 ++++++++++++++++++++++++++++++++++
include/linux/spawn_template.h | 25 ++++++++++++
3 files changed, 101 insertions(+)
create mode 100644 include/linux/spawn_template.h
diff --git a/MAINTAINERS b/MAINTAINERS
index d5441812825c3..ea4134a188779 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9737,6 +9737,7 @@ F: fs/tests/binfmt_*_kunit.c
F: fs/tests/exec_kunit.c
F: include/linux/binfmts.h
F: include/linux/elf.h
+F: include/linux/spawn_template.h
F: include/uapi/linux/auxvec.h
F: include/uapi/linux/binfmts.h
F: include/uapi/linux/elf.h
diff --git a/fs/spawn_template.c b/fs/spawn_template.c
index 8c3711929cffb..268f804227987 100644
--- a/fs/spawn_template.c
+++ b/fs/spawn_template.c
@@ -15,6 +15,7 @@
#include <linux/sched/task.h>
#include <linux/signal.h>
#include <linux/slab.h>
+#include <linux/spawn_template.h>
#include <linux/string.h>
#include <linux/syscalls.h>
#include <linux/uaccess.h>
@@ -27,6 +28,7 @@
struct spawn_template {
struct file *exec_file;
+ struct spawn_template_file_key exec_key;
const struct cred *creator_cred;
char *filename;
bool deny_write;
@@ -40,6 +42,46 @@ struct spawn_template_spawn_context {
static const struct file_operations spawn_template_fops;
+static bool spawn_template_file_exec_allowed(struct file *file);
+
+void spawn_template_fill_file_key(struct file *file,
+ struct spawn_template_file_key *key)
+{
+ struct inode *inode = file_inode(file);
+ struct timespec64 ctime = inode_get_ctime(inode);
+ struct timespec64 mtime = inode_get_mtime(inode);
+
+ key->dev = inode->i_sb->s_dev;
+ key->ino = inode->i_ino;
+ key->size = i_size_read(inode);
+ key->mode = READ_ONCE(inode->i_mode);
+ key->uid = inode->i_uid;
+ key->gid = inode->i_gid;
+ key->ctime_sec = ctime.tv_sec;
+ key->ctime_nsec = ctime.tv_nsec;
+ key->mtime_sec = mtime.tv_sec;
+ key->mtime_nsec = mtime.tv_nsec;
+}
+
+bool spawn_template_file_key_matches(struct file *file,
+ const struct spawn_template_file_key *key)
+{
+ struct spawn_template_file_key cur;
+
+ spawn_template_fill_file_key(file, &cur);
+
+ return cur.dev == key->dev &&
+ cur.ino == key->ino &&
+ cur.size == key->size &&
+ cur.mode == key->mode &&
+ uid_eq(cur.uid, key->uid) &&
+ gid_eq(cur.gid, key->gid) &&
+ cur.ctime_sec == key->ctime_sec &&
+ cur.ctime_nsec == key->ctime_nsec &&
+ cur.mtime_sec == key->mtime_sec &&
+ cur.mtime_nsec == key->mtime_nsec;
+}
+
static int spawn_template_exit_status(int err)
{
switch (err) {
@@ -58,6 +100,32 @@ static bool spawn_template_cred_matches(struct spawn_template *tmpl)
return current_cred() == tmpl->creator_cred;
}
+static bool spawn_template_key_matches(struct spawn_template *tmpl)
+{
+ bool matches;
+
+ if (tmpl->filename) {
+ struct file *file __free(fput) = NULL;
+ struct file *tmp;
+
+ tmp = open_exec(tmpl->filename);
+ if (IS_ERR(tmp))
+ return false;
+ file = tmp;
+
+ matches = spawn_template_file_key_matches(file,
+ &tmpl->exec_key);
+ matches = matches && spawn_template_file_exec_allowed(file);
+ exe_file_allow_write_access(file);
+ if (!matches)
+ return false;
+ }
+
+ return spawn_template_file_exec_allowed(tmpl->exec_file) &&
+ spawn_template_file_key_matches(tmpl->exec_file,
+ &tmpl->exec_key);
+}
+
static int spawn_template_copy_signal_set(const struct spawn_template_action *action,
sigset_t *mask)
{
@@ -433,6 +501,7 @@ SYSCALL_DEFINE2(spawn_template_create,
&tmpl->deny_write);
if (ret)
goto out_free_tmpl;
+ spawn_template_fill_file_key(tmpl->exec_file, &tmpl->exec_key);
if (args.flags & SPAWN_TEMPLATE_CREATE_CLOEXEC)
fd_flags |= O_CLOEXEC;
@@ -507,6 +576,11 @@ SYSCALL_DEFINE3(spawn_template_spawn, int, template_fd,
if (ret)
goto out_free_ctx;
+ if (!spawn_template_key_matches(ctx->tmpl)) {
+ ret = -ESTALE;
+ goto out_free_actions;
+ }
+
kargs = (struct kernel_clone_args) {
.flags = CLONE_VM | CLONE_VFORK | CLONE_PIDFD,
.pidfd = u64_to_user_ptr(ctx->args.pidfd),
@@ -517,6 +591,7 @@ SYSCALL_DEFINE3(spawn_template_spawn, int, template_fd,
ret = kernel_clone(&kargs);
+out_free_actions:
kfree(ctx->actions);
out_free_ctx:
kfree(ctx);
diff --git a/include/linux/spawn_template.h b/include/linux/spawn_template.h
new file mode 100644
index 0000000000000..f14a7749fe55b
--- /dev/null
+++ b/include/linux/spawn_template.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SPAWN_TEMPLATE_H
+#define _LINUX_SPAWN_TEMPLATE_H
+
+#include <linux/fs.h>
+
+struct spawn_template_file_key {
+ dev_t dev;
+ ino_t ino;
+ loff_t size;
+ umode_t mode;
+ kuid_t uid;
+ kgid_t gid;
+ u64 ctime_sec;
+ u64 ctime_nsec;
+ u64 mtime_sec;
+ u64 mtime_nsec;
+};
+
+void spawn_template_fill_file_key(struct file *file,
+ struct spawn_template_file_key *key);
+bool spawn_template_file_key_matches(struct file *file,
+ const struct spawn_template_file_key *key);
+
+#endif /* _LINUX_SPAWN_TEMPLATE_H */
--
2.52.0
^ permalink raw reply related
* [RFC PATCH v1 08/13] binfmt_elf: cache ELF metadata for spawn templates
From: Li Chen @ 2026-05-28 9:52 UTC (permalink / raw)
To: Christian Brauner, Kees Cook, Alexander Viro
Cc: linux-fsdevel, linux-api, linux-kernel, linux-mm, linux-arch,
linux-doc, linux-kselftest, x86, Arnd Bergmann, Andy Lutomirski,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan, Li Chen
In-Reply-To: <20260528095235.2491226-1-me@linux.beauty>
Spawn templates keep an opened executable and revalidate its file identity
before every spawn. Add an ELF-side template object for the main
executable.
It caches the executable identity key, ELF header, program header table,
and program header count so repeated spawns can reuse validated metadata.
Do not cache interpreter metadata, shared-library dependency state, or
derived mapping-layout state in this RFC.
Keep the normal exec security path intact. The child still executes through
bprm_execve(), credentials, permissions, and LSM hooks. This only avoids
rereading immutable main-executable metadata after template creation and
revalidation.
Signed-off-by: Li Chen <me@linux.beauty>
---
fs/binfmt_elf.c | 104 ++++++++++++++++++++++++++++++++-
fs/exec.c | 37 +++++++++++-
fs/spawn_template.c | 38 +++++++-----
include/linux/binfmts.h | 6 ++
include/linux/spawn_template.h | 47 +++++++++++++++
5 files changed, 213 insertions(+), 19 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 16a56b6b3f6ca..631dd029aeee7 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -48,6 +48,7 @@
#include <linux/uaccess.h>
#include <uapi/linux/rseq.h>
#include <linux/rseq.h>
+#include <linux/spawn_template.h>
#include <asm/param.h>
#include <asm/page.h>
@@ -552,6 +553,89 @@ static struct elf_phdr *load_elf_phdrs(const struct elfhdr *elf_ex,
return elf_phdata;
}
+#if !ELF_COMPAT
+void spawn_exec_template_put(struct spawn_exec_template *tmpl)
+{
+ if (!tmpl)
+ return;
+ if (!refcount_dec_and_test(&tmpl->refcount))
+ return;
+ kfree(tmpl->exec_phdrs);
+ kfree(tmpl);
+}
+
+struct spawn_exec_template *
+spawn_exec_template_get(struct spawn_exec_template *tmpl)
+{
+ refcount_inc(&tmpl->refcount);
+ return tmpl;
+}
+
+bool spawn_exec_template_matches(struct spawn_exec_template *tmpl,
+ struct file *file)
+{
+ if (!tmpl)
+ return false;
+ if (!spawn_template_file_key_matches(file, &tmpl->exec_key))
+ return false;
+ if (!can_mmap_file(file))
+ return false;
+ return true;
+}
+
+int spawn_exec_template_create(struct file *file,
+ struct spawn_exec_template **out)
+{
+ struct spawn_exec_template *tmpl;
+ loff_t pos = 0;
+ ssize_t nread;
+ int retval;
+
+ *out = NULL;
+
+ tmpl = kzalloc_obj(*tmpl, GFP_KERNEL);
+ if (!tmpl)
+ return -ENOMEM;
+ refcount_set(&tmpl->refcount, 1);
+
+ spawn_template_fill_file_key(file, &tmpl->exec_key);
+
+ nread = kernel_read(file, &tmpl->exec_ehdr, sizeof(tmpl->exec_ehdr),
+ &pos);
+ if (nread < 0) {
+ retval = nread;
+ goto out_put_template;
+ }
+
+ retval = -ENOEXEC;
+ if (nread != sizeof(tmpl->exec_ehdr))
+ goto out_put_template;
+ if (memcmp(tmpl->exec_ehdr.e_ident, ELFMAG, SELFMAG) != 0)
+ goto out_put_template;
+ if (tmpl->exec_ehdr.e_type != ET_EXEC &&
+ tmpl->exec_ehdr.e_type != ET_DYN)
+ goto out_put_template;
+ if (!elf_check_arch(&tmpl->exec_ehdr))
+ goto out_put_template;
+ if (elf_check_fdpic(&tmpl->exec_ehdr))
+ goto out_put_template;
+ if (!can_mmap_file(file))
+ goto out_put_template;
+
+ tmpl->exec_phdrs = load_elf_phdrs(&tmpl->exec_ehdr, file);
+ if (!tmpl->exec_phdrs)
+ goto out_put_template;
+ tmpl->exec_phnum = tmpl->exec_ehdr.e_phnum;
+
+ *out = tmpl;
+ return 0;
+
+out_put_template:
+ spawn_exec_template_put(tmpl);
+ return retval;
+}
+#endif
+
#ifndef CONFIG_ARCH_BINFMT_ELF_STATE
/**
@@ -832,6 +916,7 @@ static int parse_elf_properties(struct file *f, const struct elf_phdr *phdr,
static int load_elf_binary(struct linux_binprm *bprm)
{
struct file *interpreter = NULL; /* to shut gcc up */
+ struct spawn_exec_template *spawn_tmpl = bprm->spawn_template;
unsigned long load_bias = 0, phdr_addr = 0;
int first_pt_load = 1;
unsigned long error;
@@ -851,6 +936,12 @@ static int load_elf_binary(struct linux_binprm *bprm)
struct arch_elf_state arch_state = INIT_ARCH_ELF_STATE;
struct mm_struct *mm;
struct pt_regs *regs;
+ bool use_spawn_tmpl = spawn_exec_template_matches(spawn_tmpl, bprm->file);
+ bool free_elf_phdata = true;
+
+ if (use_spawn_tmpl)
+ memcpy(bprm->buf, &spawn_tmpl->exec_ehdr,
+ sizeof(spawn_tmpl->exec_ehdr));
retval = -ENOEXEC;
/* First of all, some simple consistency checks */
@@ -866,7 +957,12 @@ static int load_elf_binary(struct linux_binprm *bprm)
if (!can_mmap_file(bprm->file))
goto out;
- elf_phdata = load_elf_phdrs(elf_ex, bprm->file);
+ if (use_spawn_tmpl)
+ elf_phdata = spawn_tmpl->exec_phdrs;
+ else
+ elf_phdata = load_elf_phdrs(elf_ex, bprm->file);
+ if (use_spawn_tmpl)
+ free_elf_phdata = false;
if (!elf_phdata)
goto out;
@@ -1283,7 +1379,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
}
}
- kfree(elf_phdata);
+ if (free_elf_phdata)
+ kfree(elf_phdata);
set_binfmt(&elf_format);
@@ -1390,7 +1487,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
if (interpreter)
fput(interpreter);
out_free_ph:
- kfree(elf_phdata);
+ if (free_elf_phdata)
+ kfree(elf_phdata);
goto out;
}
diff --git a/fs/exec.c b/fs/exec.c
index 5b91a9b208a77..96b6f6274e0d3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1914,9 +1914,12 @@ static inline struct user_arg_ptr native_arg(const char __user *const __user *p)
return (struct user_arg_ptr){.ptr.native = p};
}
-static int do_execveat_file_common(struct file *file, struct filename *filename,
- struct user_arg_ptr argv,
- struct user_arg_ptr envp, int flags)
+static int do_execveat_file_template_common(struct file *file,
+ struct filename *filename,
+ struct user_arg_ptr argv,
+ struct user_arg_ptr envp,
+ int flags,
+ struct spawn_exec_template *tmpl)
{
struct linux_binprm *bprm;
struct file *exec_file;
@@ -1940,11 +1943,20 @@ static int do_execveat_file_common(struct file *file, struct filename *filename,
if (IS_ERR(bprm))
return PTR_ERR(bprm);
+ bprm->spawn_template = tmpl;
retval = do_execveat_common_bprm(bprm, argv, envp);
free_bprm(bprm);
return retval;
}
+static int do_execveat_file_common(struct file *file, struct filename *filename,
+ struct user_arg_ptr argv,
+ struct user_arg_ptr envp, int flags)
+{
+ return do_execveat_file_template_common(file, filename, argv, envp,
+ flags, NULL);
+}
+
int kernel_execveat_file(struct file *file, const char *filename,
const void __user *argv,
const void __user *envp,
@@ -1962,6 +1974,25 @@ int kernel_execveat_file(struct file *file, const char *filename,
native_arg(user_envp), flags);
}
+int kernel_execveat_file_template(struct file *file, const char *filename,
+ const void __user *argv,
+ const void __user *envp, int flags,
+ struct spawn_exec_template *tmpl)
+{
+ const char __user *const __user *user_argv;
+ const char __user *const __user *user_envp;
+
+ CLASS(filename_kernel, name)(filename);
+
+ user_argv = (const char __user *const __user *)argv;
+ user_envp = (const char __user *const __user *)envp;
+
+ return do_execveat_file_template_common(file, name,
+ native_arg(user_argv),
+ native_arg(user_envp),
+ flags, tmpl);
+}
+
void set_binfmt(struct linux_binfmt *new)
{
struct mm_struct *mm = current->mm;
diff --git a/fs/spawn_template.c b/fs/spawn_template.c
index 268f804227987..a11a7ed676416 100644
--- a/fs/spawn_template.c
+++ b/fs/spawn_template.c
@@ -28,7 +28,7 @@
struct spawn_template {
struct file *exec_file;
- struct spawn_template_file_key exec_key;
+ struct spawn_exec_template *exec_template;
const struct cred *creator_cred;
char *filename;
bool deny_write;
@@ -36,6 +36,7 @@ struct spawn_template {
struct spawn_template_spawn_context {
struct spawn_template *tmpl;
+ struct spawn_exec_template *exec_template;
struct spawn_template_spawn_args args;
struct spawn_template_action *actions;
};
@@ -114,16 +115,16 @@ static bool spawn_template_key_matches(struct spawn_template *tmpl)
file = tmp;
matches = spawn_template_file_key_matches(file,
- &tmpl->exec_key);
+ &tmpl->exec_template->exec_key);
matches = matches && spawn_template_file_exec_allowed(file);
exe_file_allow_write_access(file);
if (!matches)
return false;
}
- return spawn_template_file_exec_allowed(tmpl->exec_file) &&
- spawn_template_file_key_matches(tmpl->exec_file,
- &tmpl->exec_key);
+ if (!spawn_template_file_exec_allowed(tmpl->exec_file))
+ return false;
+ return spawn_exec_template_matches(tmpl->exec_template, tmpl->exec_file);
}
static int spawn_template_copy_signal_set(const struct spawn_template_action *action,
@@ -331,26 +332,29 @@ static int spawn_template_child(void *data)
{
struct spawn_template_spawn_context *ctx = data;
struct spawn_template *tmpl = ctx->tmpl;
+ struct spawn_exec_template *exec_template = ctx->exec_template;
int ret;
u64 i;
for (i = 0; i < ctx->args.actions_len; i++) {
ret = spawn_template_apply_action(&ctx->actions[i]);
if (ret < 0)
- goto out_exec_error;
+ goto out_put_exec_template;
}
if (!(ctx->args.flags & SPAWN_TEMPLATE_SPAWN_INHERIT_FDS)) {
ret = do_close_range(3, ~0U, 0);
if (ret < 0)
- goto out_exec_error;
+ goto out_put_exec_template;
}
- ret = kernel_execveat_file(tmpl->exec_file, "",
- u64_to_user_ptr(ctx->args.argv),
- u64_to_user_ptr(ctx->args.envp),
- AT_EMPTY_PATH);
-out_exec_error:
+ ret = kernel_execveat_file_template(tmpl->exec_file, "",
+ u64_to_user_ptr(ctx->args.argv),
+ u64_to_user_ptr(ctx->args.envp),
+ AT_EMPTY_PATH,
+ exec_template);
+out_put_exec_template:
+ spawn_exec_template_put(exec_template);
if (ret < 0)
do_exit(spawn_template_exit_status(ret));
return 0;
@@ -373,6 +377,7 @@ static int spawn_template_release(struct inode *inode, struct file *file)
if (tmpl->deny_write)
exe_file_allow_write_access(tmpl->exec_file);
+ spawn_exec_template_put(tmpl->exec_template);
fput(tmpl->exec_file);
put_cred(tmpl->creator_cred);
kfree(tmpl->filename);
@@ -501,7 +506,10 @@ SYSCALL_DEFINE2(spawn_template_create,
&tmpl->deny_write);
if (ret)
goto out_free_tmpl;
- spawn_template_fill_file_key(tmpl->exec_file, &tmpl->exec_key);
+
+ ret = spawn_exec_template_create(tmpl->exec_file, &tmpl->exec_template);
+ if (ret)
+ goto out_put_exec;
if (args.flags & SPAWN_TEMPLATE_CREATE_CLOEXEC)
fd_flags |= O_CLOEXEC;
@@ -514,6 +522,7 @@ SYSCALL_DEFINE2(spawn_template_create,
return ret;
out_put_exec:
+ spawn_exec_template_put(tmpl->exec_template);
if (tmpl->deny_write)
exe_file_allow_write_access(tmpl->exec_file);
fput(tmpl->exec_file);
@@ -580,6 +589,7 @@ SYSCALL_DEFINE3(spawn_template_spawn, int, template_fd,
ret = -ESTALE;
goto out_free_actions;
}
+ ctx->exec_template = spawn_exec_template_get(ctx->tmpl->exec_template);
kargs = (struct kernel_clone_args) {
.flags = CLONE_VM | CLONE_VFORK | CLONE_PIDFD,
@@ -590,6 +600,8 @@ SYSCALL_DEFINE3(spawn_template_spawn, int, template_fd,
};
ret = kernel_clone(&kargs);
+ if (ret < 0)
+ spawn_exec_template_put(ctx->exec_template);
out_free_actions:
kfree(ctx->actions);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index c0715678c9a06..4e76a94d331a8 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -9,6 +9,7 @@
struct filename;
struct coredump_params;
+struct spawn_exec_template;
#define CORENAME_MAX_SIZE 128
@@ -53,6 +54,7 @@ struct linux_binprm {
struct file *executable; /* Executable to pass to the interpreter */
struct file *interpreter;
struct file *file;
+ struct spawn_exec_template *spawn_template;
struct cred *cred; /* new credentials */
int unsafe; /* how unsafe this exec is (mask of LSM_UNSAFE_*) */
unsigned int per_clear; /* bits to clear in current->personality */
@@ -145,6 +147,10 @@ int kernel_execveat_file(struct file *file, const char *filename,
const void __user *argv,
const void __user *envp,
int flags);
+int kernel_execveat_file_template(struct file *file, const char *filename,
+ const void __user *argv,
+ const void __user *envp, int flags,
+ struct spawn_exec_template *tmpl);
extern void set_binfmt(struct linux_binfmt *new);
extern ssize_t read_code(struct file *, unsigned long, loff_t, size_t);
diff --git a/include/linux/spawn_template.h b/include/linux/spawn_template.h
index f14a7749fe55b..426413bc11eea 100644
--- a/include/linux/spawn_template.h
+++ b/include/linux/spawn_template.h
@@ -2,7 +2,9 @@
#ifndef _LINUX_SPAWN_TEMPLATE_H
#define _LINUX_SPAWN_TEMPLATE_H
+#include <linux/elf.h>
#include <linux/fs.h>
+#include <linux/refcount.h>
struct spawn_template_file_key {
dev_t dev;
@@ -17,9 +19,54 @@ struct spawn_template_file_key {
u64 mtime_nsec;
};
+struct spawn_exec_template {
+ refcount_t refcount;
+ struct spawn_template_file_key exec_key;
+ struct elfhdr exec_ehdr;
+ struct elf_phdr *exec_phdrs;
+ unsigned int exec_phnum;
+};
+
void spawn_template_fill_file_key(struct file *file,
struct spawn_template_file_key *key);
bool spawn_template_file_key_matches(struct file *file,
const struct spawn_template_file_key *key);
+#ifdef CONFIG_BINFMT_ELF
+int spawn_exec_template_create(struct file *file,
+ struct spawn_exec_template **out);
+struct spawn_exec_template *
+spawn_exec_template_get(struct spawn_exec_template *tmpl);
+void spawn_exec_template_put(struct spawn_exec_template *tmpl);
+bool spawn_exec_template_matches(struct spawn_exec_template *tmpl,
+ struct file *file);
+#else
+static inline int spawn_exec_template_create(struct file *file,
+ struct spawn_exec_template **out)
+{
+ (void)file;
+ (void)out;
+ return -ENOEXEC;
+}
+
+static inline void spawn_exec_template_put(struct spawn_exec_template *tmpl)
+{
+ (void)tmpl;
+}
+
+static inline struct spawn_exec_template *
+spawn_exec_template_get(struct spawn_exec_template *tmpl)
+{
+ return tmpl;
+}
+
+static inline bool spawn_exec_template_matches(struct spawn_exec_template *tmpl,
+ struct file *file)
+{
+ (void)tmpl;
+ (void)file;
+ return false;
+}
+#endif
+
#endif /* _LINUX_SPAWN_TEMPLATE_H */
--
2.52.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox