Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Christian Brauner @ 2026-06-01 16:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Andy Lutomirski, Askar Safin, linux-fsdevel,
	Alexander Viro, Jan Kara, linux-kernel, linux-mm, linux-api,
	netdev, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches
In-Reply-To: <CAHk-=wiFuud0Nn3B9YpTWyQja08TeXVk2AB-aAkmVXyigOagbQ@mail.gmail.com>

On Mon, Jun 01, 2026 at 08:50:00AM -0700, Linus Torvalds wrote:
> On Mon, 1 Jun 2026 at 08:36, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Can we review this series properly?
> 
> Well, since it pretty much is what I suggested a few years ago, I
> certainly won't NAK it.
> 
> And the patches looked very straightforward to me. Just the final
> diffstat is worth quoting again because that certainly doesn't look
> problematic:
> 
>   7 files changed, 33 insertions(+), 204 deletions(-)
> 
> and it removes that GIFT flag that was truly disgusting.
> 
> So I'm certainly ok with it from a "looking at the patch" standpoint.
> I didn't _test_ it. I don't have any workload that might remotely
> care.
> 
> I did a quick scan on debian code search for vmsplice, and after ten
> pages of entries that weren't actually *using* it but had lists of
> system calls, I grew bored. So there are likely users, but I don't
> know what they are and how much they care. It *might* be a big
> performance issue somewhere. Unlikely, but...

As usual I would argue to accept it and revert in case we get actual
regression reports...

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Christian Brauner @ 2026-06-01 16:16 UTC (permalink / raw)
  To: Askar Safin
  Cc: Pedro Falcato, linux-fsdevel, Alexander Viro, Jan Kara,
	linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Miklos Szeredi, patches
In-Reply-To: <CAPnZJGD5QR5jrNMAem7FjMRTtw+Ue+jm7Dc2Kp8Lcjj+9TDw_Q@mail.gmail.com>

On Mon, Jun 01, 2026 at 12:21:06AM +0300, Askar Safin wrote:
> On Sun, May 31, 2026 at 11:54 AM Pedro Falcato <pfalcato@suse.de> wrote:
> > So, you took an ongoing discussion with an ongoing RFC patchset, and you
> > decided to reimplement part of the idea on your own, as a concurrent patchset.
> 
> Yes. But I propose an alternative solution to this problem.

So I think this is a case where no explicit rules have been broken. But
if you know that someone has been posting patches and is working on a
problem just racing them to get your own stuff merged is very likely to
unnecessarily ruffle feathers. So sync with the person next time.

The discussion wasn't at an impasse and Pedro is expected to follow-up.
It's not very nice to just have someone else's work be for naught.

> Brauner said in discussion for your patchset:
> "So I'm not very likely to pick this up as is".
> So, I decided to submit another solution.

This lacks quite some context... I said "in its current form" and the a
long discussion ensued.

> If my patchset is applied, then I will try to deal
> with splice-pagecache-to-pipe somehow,
> probably by removing it, too. :) I decided first

So ok, but this is literally what Pedro is working on. This just wastes
people's time.

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-01 15:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andy Lutomirski, Askar Safin, linux-fsdevel, Christian Brauner,
	Alexander Viro, Jan Kara, linux-kernel, linux-mm, linux-api,
	netdev, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches
In-Reply-To: <ah2nBAdsE5vVJ2PL@casper.infradead.org>

On Mon, 1 Jun 2026 at 08:36, Matthew Wilcox <willy@infradead.org> wrote:
>
> Can we review this series properly?

Well, since it pretty much is what I suggested a few years ago, I
certainly won't NAK it.

And the patches looked very straightforward to me. Just the final
diffstat is worth quoting again because that certainly doesn't look
problematic:

  7 files changed, 33 insertions(+), 204 deletions(-)

and it removes that GIFT flag that was truly disgusting.

So I'm certainly ok with it from a "looking at the patch" standpoint.
I didn't _test_ it. I don't have any workload that might remotely
care.

I did a quick scan on debian code search for vmsplice, and after ten
pages of entries that weren't actually *using* it but had lists of
system calls, I grew bored. So there are likely users, but I don't
know what they are and how much they care. It *might* be a big
performance issue somewhere. Unlikely, but...

         Linus

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Matthew Wilcox @ 2026-06-01 15:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Askar Safin, linux-fsdevel, Christian Brauner, Alexander Viro,
	Jan Kara, linux-kernel, linux-mm, linux-api, netdev,
	Linus Torvalds, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches
In-Reply-To: <CALCETrW__=8mSusayfXG7UFCfue5BGbx+vqESj1d9wqOfX4s8w@mail.gmail.com>

On Sun, May 31, 2026 at 08:11:34PM -0700, Andy Lutomirski wrote:
> On Sat, May 30, 2026 at 6:03 PM Askar Safin <safinaskar@gmail.com> wrote:
> >
> > See recent discussion here:
> > https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u
> >
> > For all these reasons I propose to make vmsplice a simple wrapper for
> > preadv2/pwritev2.
> >
> 
> I have no comment on the code or the history.  But I'm 100% in favor
> of the solution.  vmsplice is a crappy API, and would be incredibly
> complex to get the implementation right,  and it should be removed.
> But it has users, and the approach of just mapping them straight to
> pread/pwrite makes perfect sense.

I agree with Andy.  I think it was appropriate to send this series, since
(as far as I can tell) it's a completely different approach from the others
taken.  I'm not really qualified to judge whether the implementation is
good (it's a bit outside my competency as a reviewer), but the described
approach is more convincing to me than the other approaches.

Can we review this series properly?

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Li Chen @ 2026-06-01 15:11 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Christian Brauner, Kees Cook, Alexander Viro, linux-fsdevel,
	linux-api, linux-kernel, linux-mm, linux-arch, linux-doc,
	linux-kselftest, x86, Arnd Bergmann, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan
In-Reply-To: <vealb52tv5suireenkke4lul2l3wbnaul2rp3ea545ly5wa5ty@yk3aksvp7skt>

Hi Mateusz,

 ---- On Thu, 28 May 2026 20:55:32 +0800  Mateusz Guzik <mjguzik@gmail.com> wrote --- 
 > On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
 > > This RFC adds spawn_template, a userspace-controlled exec acceleration
 > > mechanism for runtimes that repeatedly start the same executable with
 > > different argv, envp, and per-spawn file descriptor setup.
 > > 
 > > The main target is agent runtimes. Modern coding agents repeatedly start
 > > short-lived helper tools such as rg, git, sed, awk, python, node, and
 > > shell wrappers while they inspect and edit a workspace. Those runtimes
 > > already know which tools are hot, and they are also the right place to
 > > decide policy. The kernel does not choose names such as rg, git, or sed.
 > > Userspace opts in by creating a template fd for one executable, then uses
 > > that fd for later spawns. Launchers, shells, and build systems have a
 > > similar repeated-startup shape and could use the same primitive, but the
 > > agent runtime case is the main motivation for this RFC.
 > > 
 > [..]
 > > A typical agent runtime would keep one template per hot executable and
 > > still build argv, envp, cwd, and pipe wiring for each tool call:
 > > 
 > >     rg_tmpl = spawn_template_create("/usr/bin/rg");
 > > 
 > >     for each search request:
 > >         out_r, out_w = pipe_cloexec();
 > >         err_r, err_w = pipe_cloexec();
 > >         actions = [
 > >             FCHDIR(worktree_fd),
 > >             DUP2(out_w, STDOUT_FILENO),
 > >             DUP2(err_w, STDERR_FILENO),
 > >         ];
 > >         child = spawn_template_spawn(rg_tmpl, rg_argv, envp, actions);
 > >         close(out_w);
 > >         close(err_w);
 > >         read out_r and err_r;
 > >         waitid(P_PIDFD, child.pidfd, ...);
 > > 
 > > 
 > [..]
 > > The cached state is intentionally small. The template fd keeps the opened
 > > main executable file, an optional absolute path string, the creator
 > > credential pointer, and the deny-write state. The executable identity key
 > > records device, inode, size, mode, owner, ctime, and mtime, and is
 > > rechecked before cached metadata is used. The ELF cache keeps only the
 > > main executable's ELF header, program header table, and program header
 > > count.
 > > 
 > >     cached in this RFC          not cached in this RFC
 > >     ------------------          ----------------------
 > >     opened main executable      PT_INTERP metadata
 > >     executable identity key     shared-library graph
 > >     main ELF header             VMA layout metadata
 > >     main ELF program headers    cross-process metadata sharing
 > >     creator cred pointer
 > >     deny-write state
 > > 
 > > This RFC does not cache ELF interpreter metadata, shared-library
 > > dependency state, or derived mapping-layout state. Shared-library
 > > resolution is dynamic linker policy and depends on LD_LIBRARY_PATH,
 > > RPATH, RUNPATH, /etc/ld.so.cache, mount namespaces, and secure-exec
 > > state. It also does not share cached executable metadata between template
 > > fds created by different processes. Each template owns its small cached
 > > metadata object in this RFC.
 > > 
 > > Performance
 > > ===========
 > > 
 > [..]
 > > Workload     Calls  subprocess  spawn_template  time_s       Delta
 > > (workers)    calls  calls/s     calls/s         seconds
 > > 1x16         6144      411.04          420.32   14.95/14.62  +2.26%
 > > 2x8          6144      666.78          690.08    9.21/8.90   +3.49%
 > > 4x4          6144      955.61         1003.25    6.43/6.12   +4.99%
 > > 8x2          6144     1048.25         1069.18    5.86/5.75   +2.00%
 > > 
 > 
 > This problem is dear to my heart and I have been pondering it on and off
 > for some time now. The entire fork + exec idiom is terrible and needs tox
 > be retired.
 > 
 > Is this vibe-coded? I asked claude for in-kernel posix_spawn for kicks
 > some time ago and it generated remarkably similar code. But that's a
 > tangent.

Partly, yes. The original idea came from using agents myself and noticing
that they spend a lot of time starting short-lived tools such as rg, sed,
git, bash, and python. I was wondering whether repeated tool calls could
be made cheaper.

After that I used an LLM to bounce around the smallest kernel prototype
for the idea. I did some review, patch split, test, benchmark, leak-check work,
and throw away some cache codes that not actually useful.

 > I'm rather confused by the angle in the patchset. Most of this shaves
 > off a tiny amount of work, while retaining the primary avoidable reason
 > for bad performance: the very fact that fork is part of the picture,
 > especially the part mucking with mm. Creating a pristine process is the
 > way to go.
 > 
 > Additionally there is a known problem where transiently copied file
 > descriptors on fork + exec cause a headache in multithreaded programs
 > doing something like this in parallel. I only did cursory reading, it
 > seems your patchset keeps the same problem in place.
 > 
 > There are numerous impactful ways to speed up execs both in terms of
 > single-threaded cost and their multicore scalability, most of which
 > would be immediately usable by all programs without an opt-in. imo these
 > needs to be exhausted before something like a "template" can be
 > considered.
 > 
 > Per the above, the primary win would stem from *NOT* messing with mm.
 > 
 > As in, whatever the interface, it needs to create an "empty" target
 > process (for lack of a better term).
 > 
 > In terms of userspace-visible APIs, a clean solution escapes me.
 > 
 > Some time ago I proposed returning a handle which is populated over time
 > by the parnet-to-be. One of the problems with it I failed to consider at
 > the time is NUMA locality -- what if the process to be created is going
 > to run on another domain? For example, opening and installing a file for
 > its later use will result in avoidable loss of locality for some of the
 > in-kernel data. That's on top of the fd vs fork problem.
 > 
 > From perf standpoint, the final goal of whatever mechanism should be a
 > state where the target process avoided copying any state it did not need
 > to and which allocated any memory it needed from local NUMA node
 > (whatever it may happen to be). Of course if no affinity is assigned it
 > may happen to move again and lose such locality, nothing can be done
 > about that. But pretend the process is to run in a specific node the
 > parent is NOT running in.
 > 
 > So I think the pragmatic way forward is to implement something close to
 > posix_spawn in the kernel. It may make sense for the thing to take the
 > PATH argument for repeated exec attempts. I understand this is of no use
 > in your particular case, but it very much IS of use for most of the
 > real-world. The initial implementation might even start with doing vfork
 > just to get it off the ground.
 > 
 > The next step would be to extend the interface with means to AVOID
 > copying any file descriptors. There could be a dedicated file action
 > which tells the kernel to avoid such copies or something like a
 > close_range file action (or close_from) -- with a range like <0, INT_MAX>
 > you know no fds are copied.
 > 
 > For the NUMA angle to be sorted out, any file action which opens a file
 > or dups from the parent needs to execute in the child. And frankly
 > something would be needed to ask the scheduler where does it think the
 > child is going to run, so that the task_struct itself can also be
 > allocated with the right backing.
 > 
 > I have not looked into what's needed to create a new process and NOT
 > mess with mm, but I don't think there are unsolvable problems there, at
 > worst some churn.
 > 
 > There are of course other parameters which need to be sorted out, that's
 > covered by the posix_spawn thing.
 > 
 > This e-mail is long enough, so I'm not going to go into issues
 > concerning exec itself right now.
 > 
 > tl;dr I would suggest redoing the patchset as posix_spawn and then doing
 > the actual optimization of not cloning mm itself.
 > 

Thanks a lot for writing this up. I clearly had too narrow a view of the
problem. I was mostly thinking about repeated executable startup, but your
reply and Christian's and Andy's made me see that the more useful target is probably
a pidfd/pidfs-backed process builder which can sit under posix_spawn, and
then grow into something that avoids the fork-shaped mm and fd costs. I
learned a lot from this thread.

At a high level, Windows CreateProcess/NtCreateUserProcess also looks
closer to this direction than fork+exec: create the target process
directly, pass explicit startup attributes and handle inheritance state,
and avoid starting from a copy of the parent address space. That seems
to be the same basic advantage here: build the child closer to its final
shape instead of copying parent state and then throwing much of it away.

I will study the process creation, exec, pidfd/pidfs, and posix_spawn
codes more carefully, then try the direction you suggested
and benchmark the mm/fd costs.

Regards,
Li


^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Andy Lutomirski @ 2026-06-01  3:11 UTC (permalink / raw)
  To: Askar Safin
  Cc: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara,
	linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches
In-Reply-To: <20260531010107.1953702-1-safinaskar@gmail.com>

On Sat, May 30, 2026 at 6:03 PM Askar Safin <safinaskar@gmail.com> wrote:
>
> See recent discussion here:
> https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u
>
> For all these reasons I propose to make vmsplice a simple wrapper for
> preadv2/pwritev2.
>

I have no comment on the code or the history.  But I'm 100% in favor
of the solution.  vmsplice is a crappy API, and would be incredibly
complex to get the implementation right,  and it should be removed.
But it has users, and the approach of just mapping them straight to
pread/pwrite makes perfect sense.

(If anyone wants to contemplate how bad the API is, contemplate gift
mode.  Or contemplate that, if you want correct results, you need to
avoid modifying the memory until the recipient is done reading or you
need to avoid reading the memory until the writer is done writing, and
vmsplice *does not tell you when it's done*.  And there isn't even a
caller specification of whether they want to read or write.  It's ...
crap.)

--Andy

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Li Chen @ 2026-06-01  2:47 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kees Cook, Alexander Viro, linux-fsdevel, linux-api, linux-kernel,
	linux-mm, linux-arch, linux-doc, linux-kselftest, x86,
	Arnd Bergmann, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Jan Kara,
	Jonathan Corbet, Shuah Khan
In-Reply-To: <20260528-madig-fachrichtung-fehlinformation-61117ba640da@brauner>

Hi Christian,

Thanks a lot for your great review!

 ---- On Thu, 28 May 2026 19:02:53 +0800  Christian Brauner <brauner@kernel.org> wrote --- 
 > On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
 > > Hi,
 > > 
 > > This is an early RFC for an idea that is probably still rough in both the
 > > UAPI and implementation details. Sorry for the rough edges; I am sending
 > > it now to check whether this direction is worth pursuing and to get
 > > feedback on the kernel/userspace boundary.
 > 
 > The idea of having a builder api for exec isn't all that crazy. But it
 > should simply be built on top of pidfds and thus pidfs itself instead.
 > It has all the basic infrastructure in place already.

Yes, that makes a lot more sense. I was staring too hard at the "hot
executable" part and made the cache/template the API, which was probably
the wrong thing to expose. Sorry about that.

 > Any implementation
 > should also allow userspace to implement posix_spawn() on top of it.

That's so cool, and this is a really useful point. I had not thought about this as
something that could sit under posix_spawn(), but that makes the target
much clearer. It should be a generic exec/spawn builder first, and the
agent use case should just be one user of it.

 > fd = pidfd_open(0, PIDFD_EMPTY /* or better name */)
 > 
 > pidfd_config(fd, ...) // modeled similar to fsconfig()

Reusing pidfd_open() with an empty target is nice because it keeps the API close
to pidfds, but I wonder if a separate entry point such as
pidfd_spawn_open() or pidfd_create() would make the "new process
builder" case a bit more explicit? Either way, the configuration side
being fsconfig-like makes sense to me.

Thanks again for pointing me in this direction. It helps a lot.

Regards,
Li

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-06-01  1:52 UTC (permalink / raw)
  To: Barry Song
  Cc: Theodore Tso, Bart Van Assche, linux-api, linux-kernel,
	Matthew Wilcox, linux-f2fs-devel, Christoph Hellwig, linux-mm,
	linux-fsdevel, Akilesh Kailash, Christian Brauner
In-Reply-To: <CAGsJ_4yJihngSY0GNcc+MwPHJjpF1qCnS8-UE1GwYoNDtEm9mQ@mail.gmail.com>

On 05/31, Barry Song via Linux-f2fs-devel wrote:
> On Sun, May 31, 2026 at 8:12 AM Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> >
> > On 05/28, Christoph Hellwig wrote:
> > > On Wed, May 27, 2026 at 03:59:35PM +0000, Jaegeuk Kim wrote:
> > > > F2FS merges bios before submit_bio, regardless of small or large folios,
> > > > since the block addresses are consecutive. So, I think IO subsystem was
> > > > working in full speed.
> > >
> > > As does every other remotely modern file system.  But that merging is
> > > surprisingly expensive, which is why using folios gets really major
> > > performance improvements.
> > >
> > > For one doing these checks to merge touch quite a few cache lines.
> > > Second, devices are often a lot more efficient if they see fewer SGL
> > > entries.  I.e. having a 1MB bio a single SGL tends to work better than
> > > having 256 of them.
> > > The same is true in the kernel code itself, both in the submission path
> > > (dma mapping and co), and even more so in the page cache handling
> > > both before submitting and in the completion path.
> > >
> > > See Bart's patch about how long the walk of the bio_vecs in the f2fs
> > > completion path can take.  We had similar issues in XFS even in the
> > > workqueue completion path due to lack of rescheduling, and these simply
> > > go away when you do the folio manipulation in larger chunks (LAZY_PREEMPT
> > > would avoid the need to explicit rescheduling these days, but that just
> > > papers over the symptoms in this case).
> > >
> >
> > I see. That's also super helpful. Let me kick off the large folio support asap.
> > Thanks.
> 
> Hi Jaegeuk,

Hi Barry,

> 
> Nanzhe has put significant effort into this work at Xiaomi over
> the past several months. Large folios can now be supported on
> non-immutable files.
> 
> He has conducted extensive testing on the Pixel 6 and fixed a
> number of hangs discovered during development. He is still
> benchmarking performance, but the implementation appears to be
> reasonably stable at this point. We can run Android Monkey for
> many hours without observing any hangs.
> 
> If you would like to see an RFC, I can ask Nanzhe to send one
> as soon as possible after some cleanup and polishing.

Yeah, I was about to reach out to you. Let's do some discussion
offline.

Thanks,

> 
> Best regards,
> Barry
> 
> 
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-05-31 21:21 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara,
	linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Miklos Szeredi, patches
In-Reply-To: <ahv16ogY8Zx3Rtox@pedro-suse.lan>

On Sun, May 31, 2026 at 11:54 AM Pedro Falcato <pfalcato@suse.de> wrote:
> So, you took an ongoing discussion with an ongoing RFC patchset, and you
> decided to reimplement part of the idea on your own, as a concurrent patchset.

Yes. But I propose an alternative solution to this problem.

Brauner said in discussion for your patchset:
"So I'm not very likely to pick this up as is".
So, I decided to submit another solution.

Pedro, I'm not trying to insult you.

Other kernel developers will decide which of these two solutions they like more.

Many people in discussion of your patchset said how they
dislike splice/vmsplice, and especially vmsplice.
Hellwig said "vmsplice is the worst".
Brauner, Hellwig, Horn said that they dislike vmsplice.
They said that vmsplice in its current form should not
be used, and that it is broken.

Despite all these problems nobody managed to fix
vmsplice in all these years.
So I propose just to effectively remove it.

You may think that I just saw a recent discussion and decided
to jump in. No. splice/vmsplice is my topic of interest for many
years. You can verify this by searching "f:Askar splice"
on lore.kernel.org . I simply decided that given
recent vulnerabilities now is the perfect time to solve
all these vmsplice problems once and for all.

I explained my position here:
https://lore.kernel.org/all/20260523204100.553125-1-safinaskar@gmail.com/ .
Nobody answered, so I just posted this patchset.

If my patchset is applied, then I will try to deal
with splice-pagecache-to-pipe somehow,
probably by removing it, too. :) I decided first
to deal with vmsplice, because it seems to be
easier problem.

> > vmsplice(fd, vec, vlen, vmsplice_flags) will
> > be equivalent to preadv2(fd, vec, vlen, -1, rw_flags) if you have
> > readable pipe and to pwritev2(fd, vec, vlen, -1, rw_flags) if you have
> > writable pipe.
>
> This does not work. https://codesearch.debian.net/search?q=vmsplice%28&literal=1
> There are users.

Yes, they are. But my solution is compatible. vmsplice is simply performance
optimization. vmsplice will work just as before, but slower.
And, most importantly, vmsplice design problems will be gone
(nobody managed to fix them anyway for all these years).

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Hildenbrand (Arm) @ 2026-05-31 19:01 UTC (permalink / raw)
  To: Pedro Falcato, Askar Safin
  Cc: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara,
	linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, Miklos Szeredi, patches
In-Reply-To: <ahv16ogY8Zx3Rtox@pedro-suse.lan>

On 5/31/26 10:54, Pedro Falcato wrote:
> On Sun, May 31, 2026 at 01:01:04AM +0000, Askar Safin wrote:
>> This patchset is for VFS.
>>
>> Recently we got a lot of vulnerabilities in splice/vmsplice.
>>
>> Also vmsplice already was source of vulnerabilities in the past:
>> CVE-2020-29374 (see https://lwn.net/Articles/849638/ ).
>>
>> Also vmsplice is problematic for other reasons. Here is what other
>> developers say:
>>
>> Linus Torvalds in 2023:
>>> So I'd personally be perfectly ok with just making vmsplice() be
>>> exactly the same as write, and turn all of vmsplice() into just "it's
>>> a read() if the pipe is open for read, and a write if it's open for
>>> writing".
>> https://lore.kernel.org/all/CAHk-=wgG_2cmHgZwKjydi7=iimyHyN8aessnbM9XQ9ufbaUz9g@mail.gmail.com/
>>
>> Christoph Hellwig in May 2026:
>>> vmsplice is the worst, as it is one of the few remaining places that
>>> can incorrectly dirty file backed pages without telling the file system
>>> and cause the other problems fixed by a FOLL_PIN conversion, but it is
>>> the only one where we do not have any idea yet how we could convert it
>>> to FOLL_PIN due to the unbounded pin time.
>> https://lore.kernel.org/all/agwFlBKvKytjURDO@infradead.org/
>>
>> See recent discussion here:
>> https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u
> 
> So, you took an ongoing discussion with an ongoing RFC patchset, and you
> decided to reimplement part of the idea on your own, as a concurrent patchset.
> 
> Riiiiiight.... I don't think I have to NAK this, do I?

Jup. I'll just ignore this patch set here.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Pedro Falcato @ 2026-05-31  8:54 UTC (permalink / raw)
  To: Askar Safin
  Cc: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara,
	linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Miklos Szeredi, patches
In-Reply-To: <20260531010107.1953702-1-safinaskar@gmail.com>

On Sun, May 31, 2026 at 01:01:04AM +0000, Askar Safin wrote:
> This patchset is for VFS.
> 
> Recently we got a lot of vulnerabilities in splice/vmsplice.
> 
> Also vmsplice already was source of vulnerabilities in the past:
> CVE-2020-29374 (see https://lwn.net/Articles/849638/ ).
> 
> Also vmsplice is problematic for other reasons. Here is what other
> developers say:
> 
> Linus Torvalds in 2023:
> > So I'd personally be perfectly ok with just making vmsplice() be
> > exactly the same as write, and turn all of vmsplice() into just "it's
> > a read() if the pipe is open for read, and a write if it's open for
> > writing".
> https://lore.kernel.org/all/CAHk-=wgG_2cmHgZwKjydi7=iimyHyN8aessnbM9XQ9ufbaUz9g@mail.gmail.com/
> 
> Christoph Hellwig in May 2026:
> > vmsplice is the worst, as it is one of the few remaining places that
> > can incorrectly dirty file backed pages without telling the file system
> > and cause the other problems fixed by a FOLL_PIN conversion, but it is
> > the only one where we do not have any idea yet how we could convert it
> > to FOLL_PIN due to the unbounded pin time.
> https://lore.kernel.org/all/agwFlBKvKytjURDO@infradead.org/
> 
> See recent discussion here:
> https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u

So, you took an ongoing discussion with an ongoing RFC patchset, and you
decided to reimplement part of the idea on your own, as a concurrent patchset.

Riiiiiight.... I don't think I have to NAK this, do I?

> 
> For all these reasons I propose to make vmsplice a simple wrapper for
> preadv2/pwritev2.
> 
> vmsplice(fd, vec, vlen, vmsplice_flags) will
> be equivalent to preadv2(fd, vec, vlen, -1, rw_flags) if you have
> readable pipe and to pwritev2(fd, vec, vlen, -1, rw_flags) if you have
> writable pipe.

This does not work. https://codesearch.debian.net/search?q=vmsplice%28&literal=1
There are users.

-- 
Pedro

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Barry Song @ 2026-05-31  5:28 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Christoph Hellwig, Bart Van Assche, Theodore Tso, linux-api,
	linux-kernel, Matthew Wilcox, linux-f2fs-devel, linux-mm,
	Akilesh Kailash, linux-fsdevel, Christian Brauner
In-Reply-To: <aht812OhSPFqIBPK@google.com>

On Sun, May 31, 2026 at 8:12 AM Jaegeuk Kim <jaegeuk@kernel.org> wrote:
>
> On 05/28, Christoph Hellwig wrote:
> > On Wed, May 27, 2026 at 03:59:35PM +0000, Jaegeuk Kim wrote:
> > > F2FS merges bios before submit_bio, regardless of small or large folios,
> > > since the block addresses are consecutive. So, I think IO subsystem was
> > > working in full speed.
> >
> > As does every other remotely modern file system.  But that merging is
> > surprisingly expensive, which is why using folios gets really major
> > performance improvements.
> >
> > For one doing these checks to merge touch quite a few cache lines.
> > Second, devices are often a lot more efficient if they see fewer SGL
> > entries.  I.e. having a 1MB bio a single SGL tends to work better than
> > having 256 of them.
> > The same is true in the kernel code itself, both in the submission path
> > (dma mapping and co), and even more so in the page cache handling
> > both before submitting and in the completion path.
> >
> > See Bart's patch about how long the walk of the bio_vecs in the f2fs
> > completion path can take.  We had similar issues in XFS even in the
> > workqueue completion path due to lack of rescheduling, and these simply
> > go away when you do the folio manipulation in larger chunks (LAZY_PREEMPT
> > would avoid the need to explicit rescheduling these days, but that just
> > papers over the symptoms in this case).
> >
>
> I see. That's also super helpful. Let me kick off the large folio support asap.
> Thanks.

Hi Jaegeuk,

Nanzhe has put significant effort into this work at Xiaomi over
the past several months. Large folios can now be supported on
non-immutable files.

He has conducted extensive testing on the Pixel 6 and fixed a
number of hangs discovered during development. He is still
benchmarking performance, but the implementation appears to be
reasonably stable at this point. We can run Android Monkey for
many hours without observing any hangs.

If you would like to see an RFC, I can ask Nanzhe to send one
as soon as possible after some cleanup and polishing.

Best regards,
Barry

^ permalink raw reply

* [PATCH 3/3] splice: remove PIPE_BUF_FLAG_GIFT
From: Askar Safin @ 2026-05-31  1:01 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches
In-Reply-To: <20260531010107.1953702-1-safinaskar@gmail.com>

It is unused now.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/fuse/dev.c             | 1 -
 fs/splice.c               | 6 ++----
 include/linux/pipe_fs_i.h | 1 -
 3 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5dda7080f4a9..fb8fe0c96692 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2352,7 +2352,6 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
 				goto out_free;
 
 			*obuf = *ibuf;
-			obuf->flags &= ~PIPE_BUF_FLAG_GIFT;
 			obuf->len = rem;
 			ibuf->offset += obuf->len;
 			ibuf->len -= obuf->len;
diff --git a/fs/splice.c b/fs/splice.c
index b1a4e3713bd6..6ddf7dd72f7b 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1622,10 +1622,9 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			*obuf = *ibuf;
 
 			/*
-			 * Don't inherit the gift and merge flags, we need to
+			 * Don't inherit the merge flag, we need to
 			 * prevent multiple steals of this page.
 			 */
-			obuf->flags &= ~PIPE_BUF_FLAG_GIFT;
 			obuf->flags &= ~PIPE_BUF_FLAG_CAN_MERGE;
 
 			obuf->len = len;
@@ -1711,10 +1710,9 @@ static ssize_t link_pipe(struct pipe_inode_info *ipipe,
 		*obuf = *ibuf;
 
 		/*
-		 * Don't inherit the gift and merge flag, we need to prevent
+		 * Don't inherit the merge flag, we need to prevent
 		 * multiple steals of this page.
 		 */
-		obuf->flags &= ~PIPE_BUF_FLAG_GIFT;
 		obuf->flags &= ~PIPE_BUF_FLAG_CAN_MERGE;
 
 		if (obuf->len > len)
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 7f6a92ac9704..a1eeed800669 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -6,7 +6,6 @@
 
 #define PIPE_BUF_FLAG_LRU	0x01	/* page is on the LRU */
 #define PIPE_BUF_FLAG_ATOMIC	0x02	/* was atomically mapped */
-#define PIPE_BUF_FLAG_GIFT	0x04	/* page is a gift */
 #define PIPE_BUF_FLAG_PACKET	0x08	/* read() as a packet */
 #define PIPE_BUF_FLAG_CAN_MERGE	0x10	/* can merge buffers */
 #define PIPE_BUF_FLAG_WHOLE	0x20	/* read() must return entire buffer or error */
-- 
2.47.3


^ permalink raw reply related

* [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-05-31  1:01 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches
In-Reply-To: <20260531010107.1953702-1-safinaskar@gmail.com>

vmsplice behavior on writable pipe became equivalent to pwritev2.
vmsplice behavior on readable pipe already was nearly
equivalent to preadv2, but I made this explicit. I. e. I made it
obvious from code that vmsplice now is equivalent to preadv2/pwritev2.

Also I moved vmsplice to fs/read_write.c, because now it arguably
belongs there.

Note that SPLICE_F_NONBLOCK behavior slightly changed: previously
vmsplice ignored whether the pipe was opened with O_NONBLOCK, and mode
of operation depended on whether SPLICE_F_NONBLOCK was passed only.
Now the operation will be non-blocking if O_NONBLOCK was passed when
opening *or* SPLICE_F_NONBLOCK was passed to vmsplice. Previous
behavior was arguably buggy, and new behavior is arguably better.

Now SPLICE_F_GIFT is always ignored by all 3 syscalls: splice, tee
and vmsplice.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/read_write.c          |  23 +++++
 fs/splice.c              | 192 +--------------------------------------
 include/linux/skbuff.h   |   4 +-
 include/linux/splice.h   |   2 +-
 include/linux/syscalls.h |   4 +-
 5 files changed, 29 insertions(+), 196 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 50bff7edc91f..1e5444f4dab3 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1213,6 +1213,29 @@ SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
 	return do_pwritev(fd, vec, vlen, pos, flags);
 }
 
+/*
+ * Legacy preadv2/pwritev2 wrapper.
+ */
+SYSCALL_DEFINE4(vmsplice, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned int, flags)
+{
+	if (unlikely(flags & ~SPLICE_F_ALL))
+		return -EINVAL;
+
+	CLASS(fd, f)(fd);
+	if (fd_empty(f))
+		return -EBADF;
+
+	/* We do do_writev/do_readv, so it is okay to pass "false" here */
+	if (!get_pipe_info(fd_file(f), /* for_splice = */ false))
+		return -EBADF;
+
+	if (fd_file(f)->f_mode & FMODE_WRITE)
+		return do_writev(fd, vec, vlen, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+	else
+		return do_readv(fd, vec, vlen, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+}
+
 /*
  * Various compat syscalls.  Note that they all pretend to take a native
  * iovec - import_iovec will properly treat those as compat_iovecs based on
diff --git a/fs/splice.c b/fs/splice.c
index 59adbc2fa4d6..b1a4e3713bd6 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -159,22 +159,6 @@ const struct pipe_buf_operations page_cache_pipe_buf_ops = {
 	.get		= generic_pipe_buf_get,
 };
 
-static bool user_page_pipe_buf_try_steal(struct pipe_inode_info *pipe,
-		struct pipe_buffer *buf)
-{
-	if (!(buf->flags & PIPE_BUF_FLAG_GIFT))
-		return false;
-
-	buf->flags |= PIPE_BUF_FLAG_LRU;
-	return generic_pipe_buf_try_steal(pipe, buf);
-}
-
-static const struct pipe_buf_operations user_page_pipe_buf_ops = {
-	.release	= page_cache_pipe_buf_release,
-	.try_steal	= user_page_pipe_buf_try_steal,
-	.get		= generic_pipe_buf_get,
-};
-
 static void wakeup_pipe_readers(struct pipe_inode_info *pipe)
 {
 	smp_mb();
@@ -589,8 +573,7 @@ static void splice_from_pipe_end(struct pipe_inode_info *pipe, struct splice_des
  * Description:
  *    This function does little more than loop over the pipe and call
  *    @actor to do the actual moving of a single struct pipe_buffer to
- *    the desired destination. See pipe_to_file, pipe_to_sendmsg, or
- *    pipe_to_user.
+ *    the desired destination. See pipe_to_file or pipe_to_sendmsg.
  *
  */
 ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
@@ -1440,179 +1423,6 @@ static ssize_t __do_splice(struct file *in, loff_t __user *off_in,
 	return ret;
 }
 
-static ssize_t iter_to_pipe(struct iov_iter *from,
-			    struct pipe_inode_info *pipe,
-			    unsigned int flags)
-{
-	struct pipe_buffer buf = {
-		.ops = &user_page_pipe_buf_ops,
-		.flags = flags
-	};
-	size_t total = 0;
-	ssize_t ret = 0;
-
-	while (iov_iter_count(from)) {
-		struct page *pages[16];
-		ssize_t left;
-		size_t start;
-		int i, n;
-
-		left = iov_iter_get_pages2(from, pages, ~0UL, 16, &start);
-		if (left <= 0) {
-			ret = left;
-			break;
-		}
-
-		n = DIV_ROUND_UP(left + start, PAGE_SIZE);
-		for (i = 0; i < n; i++) {
-			int size = umin(left, PAGE_SIZE - start);
-
-			buf.page = pages[i];
-			buf.offset = start;
-			buf.len = size;
-			ret = add_to_pipe(pipe, &buf);
-			if (unlikely(ret < 0)) {
-				iov_iter_revert(from, left);
-				// this one got dropped by add_to_pipe()
-				while (++i < n)
-					put_page(pages[i]);
-				goto out;
-			}
-			total += ret;
-			left -= size;
-			start = 0;
-		}
-	}
-out:
-	return total ? total : ret;
-}
-
-static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
-			struct splice_desc *sd)
-{
-	int n = copy_page_to_iter(buf->page, buf->offset, sd->len, sd->u.data);
-	return n == sd->len ? n : -EFAULT;
-}
-
-/*
- * For lack of a better implementation, implement vmsplice() to userspace
- * as a simple copy of the pipe's pages to the user iov.
- */
-static ssize_t vmsplice_to_user(struct file *file, struct iov_iter *iter,
-				unsigned int flags)
-{
-	struct pipe_inode_info *pipe = get_pipe_info(file, true);
-	struct splice_desc sd = {
-		.total_len = iov_iter_count(iter),
-		.flags = flags,
-		.u.data = iter
-	};
-	ssize_t ret = 0;
-
-	if (!pipe)
-		return -EBADF;
-
-	pipe_clear_nowait(file);
-
-	if (sd.total_len) {
-		pipe_lock(pipe);
-		ret = __splice_from_pipe(pipe, &sd, pipe_to_user);
-		pipe_unlock(pipe);
-	}
-
-	if (ret > 0)
-		fsnotify_access(file);
-
-	return ret;
-}
-
-/*
- * vmsplice splices a user address range into a pipe. It can be thought of
- * as splice-from-memory, where the regular splice is splice-from-file (or
- * to file). In both cases the output is a pipe, naturally.
- */
-static ssize_t vmsplice_to_pipe(struct file *file, struct iov_iter *iter,
-				unsigned int flags)
-{
-	struct pipe_inode_info *pipe;
-	ssize_t ret = 0;
-	unsigned buf_flag = 0;
-
-	if (flags & SPLICE_F_GIFT)
-		buf_flag = PIPE_BUF_FLAG_GIFT;
-
-	pipe = get_pipe_info(file, true);
-	if (!pipe)
-		return -EBADF;
-
-	pipe_clear_nowait(file);
-
-	pipe_lock(pipe);
-	ret = wait_for_space(pipe, flags);
-	if (!ret)
-		ret = iter_to_pipe(iter, pipe, buf_flag);
-	pipe_unlock(pipe);
-	if (ret > 0) {
-		wakeup_pipe_readers(pipe);
-		fsnotify_modify(file);
-	}
-	return ret;
-}
-
-/*
- * Note that vmsplice only really supports true splicing _from_ user memory
- * to a pipe, not the other way around. Splicing from user memory is a simple
- * operation that can be supported without any funky alignment restrictions
- * or nasty vm tricks. We simply map in the user memory and fill them into
- * a pipe. The reverse isn't quite as easy, though. There are two possible
- * solutions for that:
- *
- *	- memcpy() the data internally, at which point we might as well just
- *	  do a regular read() on the buffer anyway.
- *	- Lots of nasty vm tricks, that are neither fast nor flexible (it
- *	  has restriction limitations on both ends of the pipe).
- *
- * Currently we punt and implement it as a normal copy, see pipe_to_user().
- *
- */
-SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, uiov,
-		unsigned long, nr_segs, unsigned int, flags)
-{
-	struct iovec iovstack[UIO_FASTIOV];
-	struct iovec *iov = iovstack;
-	struct iov_iter iter;
-	ssize_t error;
-	int type;
-
-	if (unlikely(flags & ~SPLICE_F_ALL))
-		return -EINVAL;
-
-	CLASS(fd, f)(fd);
-	if (fd_empty(f))
-		return -EBADF;
-	if (fd_file(f)->f_mode & FMODE_WRITE)
-		type = ITER_SOURCE;
-	else if (fd_file(f)->f_mode & FMODE_READ)
-		type = ITER_DEST;
-	else
-		return -EBADF;
-
-	error = import_iovec(type, uiov, nr_segs,
-			     ARRAY_SIZE(iovstack), &iov, &iter);
-	if (error < 0)
-		return error;
-
-	if (!iov_iter_count(&iter))
-		error = 0;
-	else if (type == ITER_SOURCE)
-		error = vmsplice_to_pipe(fd_file(f), &iter, flags);
-	else
-		error = vmsplice_to_user(fd_file(f), &iter, flags);
-
-	kfree(iov);
-	return error;
-}
-
 SYSCALL_DEFINE6(splice, int, fd_in, loff_t __user *, off_in,
 		int, fd_out, loff_t __user *, off_out,
 		size_t, len, unsigned int, flags)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2bcf78a4de7b..2961fee3e5cc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -505,7 +505,7 @@ enum {
 	SKBFL_ZEROCOPY_ENABLE = BIT(0),
 
 	/* This indicates at least one fragment might be overwritten
-	 * (as in vmsplice(), sendfile() ...)
+	 * (as in sendfile(), ...)
 	 * If we need to compute a TX checksum, we'll need to copy
 	 * all frags to avoid possible bad checksum
 	 */
@@ -4017,7 +4017,7 @@ static inline int skb_linearize(struct sk_buff *skb)
  * @skb: buffer to test
  *
  * Return: true if the skb has at least one frag that might be modified
- * by an external entity (as in vmsplice()/sendfile())
+ * by an external entity (as in sendfile())
  */
 static inline bool skb_has_shared_frag(const struct sk_buff *skb)
 {
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 9dec4861d09f..fb4f035aae83 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -19,7 +19,7 @@
 				 /* we may still block on the fd we splice */
 				 /* from/to, of course */
 #define SPLICE_F_MORE	(0x04)	/* expect more data */
-#define SPLICE_F_GIFT	(0x08)	/* pages passed in are a gift */
+#define SPLICE_F_GIFT	(0x08)	/* ignored */
 
 #define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT)
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f5639d5ac331..a86a88207956 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -514,8 +514,8 @@ asmlinkage long sys_ppoll_time32(struct pollfd __user *, unsigned int,
 			  struct old_timespec32 __user *, const sigset_t __user *,
 			  size_t);
 asmlinkage long sys_signalfd4(int ufd, sigset_t __user *user_mask, size_t sizemask, int flags);
-asmlinkage long sys_vmsplice(int fd, const struct iovec __user *iov,
-			     unsigned long nr_segs, unsigned int flags);
+asmlinkage long sys_vmsplice(unsigned long fd, const struct iovec __user *vec,
+			     unsigned long vlen, unsigned int flags);
 asmlinkage long sys_splice(int fd_in, loff_t __user *off_in,
 			   int fd_out, loff_t __user *off_out,
 			   size_t len, unsigned int flags);
-- 
2.47.3


^ permalink raw reply related

* [PATCH 1/3] tee: fs/splice.c: remove unused parameter "flags" from "link_pipe"
From: Askar Safin @ 2026-05-31  1:01 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches
In-Reply-To: <20260531010107.1953702-1-safinaskar@gmail.com>

Remove unused parameter "flags" from "link_pipe".

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/splice.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 9d8f63e2fd1a..59adbc2fa4d6 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1849,7 +1849,7 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
  */
 static ssize_t link_pipe(struct pipe_inode_info *ipipe,
 			 struct pipe_inode_info *opipe,
-			 size_t len, unsigned int flags)
+			 size_t len)
 {
 	struct pipe_buffer *ibuf, *obuf;
 	unsigned int i_head, o_head;
@@ -1962,7 +1962,7 @@ ssize_t do_tee(struct file *in, struct file *out, size_t len,
 		if (!ret) {
 			ret = opipe_prep(opipe, flags);
 			if (!ret)
-				ret = link_pipe(ipipe, opipe, len, flags);
+				ret = link_pipe(ipipe, opipe, len);
 		}
 	}
 
-- 
2.47.3


^ permalink raw reply related

* [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-05-31  1:01 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches

This patchset is for VFS.

Recently we got a lot of vulnerabilities in splice/vmsplice.

Also vmsplice already was source of vulnerabilities in the past:
CVE-2020-29374 (see https://lwn.net/Articles/849638/ ).

Also vmsplice is problematic for other reasons. Here is what other
developers say:

Linus Torvalds in 2023:
> So I'd personally be perfectly ok with just making vmsplice() be
> exactly the same as write, and turn all of vmsplice() into just "it's
> a read() if the pipe is open for read, and a write if it's open for
> writing".
https://lore.kernel.org/all/CAHk-=wgG_2cmHgZwKjydi7=iimyHyN8aessnbM9XQ9ufbaUz9g@mail.gmail.com/

Christoph Hellwig in May 2026:
> vmsplice is the worst, as it is one of the few remaining places that
> can incorrectly dirty file backed pages without telling the file system
> and cause the other problems fixed by a FOLL_PIN conversion, but it is
> the only one where we do not have any idea yet how we could convert it
> to FOLL_PIN due to the unbounded pin time.
https://lore.kernel.org/all/agwFlBKvKytjURDO@infradead.org/

See recent discussion here:
https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u

For all these reasons I propose to make vmsplice a simple wrapper for
preadv2/pwritev2.

vmsplice(fd, vec, vlen, vmsplice_flags) will
be equivalent to preadv2(fd, vec, vlen, -1, rw_flags) if you have
readable pipe and to pwritev2(fd, vec, vlen, -1, rw_flags) if you have
writable pipe.

SPLICE_F_NONBLOCK is translated to RWF_NOWAIT, all other SPLICE_F_*
flags are ignored.

There is a small change to handling of NONBLOCK-related flags,
see commit messages for details.

I tested this patch in Qemu.

This patchset was written by me, not by LLMs.

Askar Safin (3):
  tee: fs/splice.c: remove unused parameter "flags" from "link_pipe"
  vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
  splice: remove PIPE_BUF_FLAG_GIFT

 fs/fuse/dev.c             |   1 -
 fs/read_write.c           |  23 +++++
 fs/splice.c               | 202 +-------------------------------------
 include/linux/pipe_fs_i.h |   1 -
 include/linux/skbuff.h    |   4 +-
 include/linux/splice.h    |   2 +-
 include/linux/syscalls.h  |   4 +-
 7 files changed, 33 insertions(+), 204 deletions(-)


base-commit: e7ae89a0c97ce2b68b0983cd01eda67cf373517d (7.1-rc5)
-- 
2.47.3


^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-31  0:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Theodore Tso, linux-api, linux-kernel, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ahiZRpE593n4blxn@casper.infradead.org>

On 05/28, Matthew Wilcox wrote:
> On Tue, May 26, 2026 at 01:10:55AM +0000, Jaegeuk Kim wrote:
> > Background
> > ----------
> > The primary use case is accelerating AI model loading, which demands
> > exceptionally high sequential read speeds. In our benchmarks on embedded
> > systems:
> >  - Using high-order page allocations allows the system to saturate the
> >    Universal Flash Storage (UFS) bandwidth, reaching 4 GB/s even at
> >    medium-to-low CPU frequencies.
> >  - In contrast, standard small folios cap performance at 2 GB/s.
> > 
> > The performance doubling stems directly from reducing CPU cycle overhead during
> > memory allocation.
> 
> When you say "AI model loading", are you mmap()ing the file of weights,
> or are you calling read() to load the file into anonymous memory?
> 
> This matters because for the first operation, you need to allocate folios
> of PMD size in order to make best use of TLB entries.  For the second
> operation, it's more important to iterate through the file quickly,
> freeing folios behind you after you access them so they're available
> for the next batch.

We deal with multiple options tho, what I'm looking at is mostly a preloading
models by mmap(MAP_POPULATE) which takes the readahead path bumping up the order
by 2. Previously I also looked at fadvise(WILLNEED), but gave up due to the
broken interface. OTOH, we use RWF_DONTCACHE for read() case, but I don't
think it's ideal for the best loading performance.

> 
> 
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-31  0:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, Theodore Tso, linux-api, linux-kernel,
	Matthew Wilcox, linux-f2fs-devel, linux-mm, Akilesh Kailash,
	linux-fsdevel, Christian Brauner
In-Reply-To: <ahkl52N3RDcusCNd@infradead.org>

On 05/28, Christoph Hellwig wrote:
> On Wed, May 27, 2026 at 03:59:35PM +0000, Jaegeuk Kim wrote:
> > F2FS merges bios before submit_bio, regardless of small or large folios,
> > since the block addresses are consecutive. So, I think IO subsystem was
> > working in full speed.
> 
> As does every other remotely modern file system.  But that merging is
> surprisingly expensive, which is why using folios gets really major
> performance improvements.
> 
> For one doing these checks to merge touch quite a few cache lines.
> Second, devices are often a lot more efficient if they see fewer SGL
> entries.  I.e. having a 1MB bio a single SGL tends to work better than
> having 256 of them.
> The same is true in the kernel code itself, both in the submission path
> (dma mapping and co), and even more so in the page cache handling
> both before submitting and in the completion path.
> 
> See Bart's patch about how long the walk of the bio_vecs in the f2fs
> completion path can take.  We had similar issues in XFS even in the
> workqueue completion path due to lack of rescheduling, and these simply
> go away when you do the folio manipulation in larger chunks (LAZY_PREEMPT
> would avoid the need to explicit rescheduling these days, but that just
> papers over the symptoms in this case).
> 

I see. That's also super helpful. Let me kick off the large folio support asap.
Thanks.

^ permalink raw reply

* [PATCH v4 11/11] kernel/api: add syscall enter/exit tracepoints
From: Sasha Levin @ 2026-05-29 23:33 UTC (permalink / raw)
  To: linux-api, linux-kernel
  Cc: linux-doc, linux-fsdevel, linux-kbuild, linux-kselftest,
	workflows, tools, x86, Thomas Gleixner, Paul E . McKenney,
	Greg Kroah-Hartman, Jonathan Corbet, Dmitry Vyukov, Randy Dunlap,
	Cyril Hrubis, Kees Cook, Jake Edge, David Laight,
	Gabriele Paoloni, Mauro Carvalho Chehab, Christian Brauner,
	Alexander Viro, Andrew Morton, Masahiro Yamada, Shuah Khan,
	Arnd Bergmann, Nathan Chancellor, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260529233311.1901670-1-sashal@kernel.org>

Add two tracepoints to the CONFIG_KAPI_RUNTIME_CHECKS syscall validation
path so the framework's behavior can be observed without the noise and
loss of pr_warn_ratelimited():

  kapi_syscall_enter - the spec name, the raw argument values, and a
                       rendered "name=value" list of the specified
                       parameters (pointer-like values in hex, integers
                       and file descriptors in decimal)
  kapi_syscall_exit  - the spec name, the return value, and whether it
                       matched the specification (spec_match)

Both fire only for syscalls that have a KAPI specification and live
inside the existing CONFIG_KAPI_RUNTIME_CHECKS region, so they exist
exactly when the runtime checks do; they compile to no-ops without
CONFIG_TRACEPOINTS and stay dormant until enabled. The parameter list
is rendered only when the enter tracepoint is enabled.

kapi_syscall_exit is also emitted on the parameter-validation rejection
path -- where the validator returns -EINVAL and the real handler is
skipped -- with spec_match=0, so every kapi_syscall_enter has a matching
exit.

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 Documentation/dev-tools/kernel-api-spec.rst | 29 ++++++++
 MAINTAINERS                                 |  1 +
 include/trace/events/kapi.h                 | 74 ++++++++++++++++++++
 kernel/api/kernel_api_spec.c                | 77 ++++++++++++++++++---
 4 files changed, 173 insertions(+), 8 deletions(-)
 create mode 100644 include/trace/events/kapi.h

diff --git a/Documentation/dev-tools/kernel-api-spec.rst b/Documentation/dev-tools/kernel-api-spec.rst
index 26598a98c0f69..561e7bff58379 100644
--- a/Documentation/dev-tools/kernel-api-spec.rst
+++ b/Documentation/dev-tools/kernel-api-spec.rst
@@ -285,6 +285,35 @@ custom validation functions via the ``validate`` field in the constraint spec:
     .type = KAPI_CONSTRAINT_CUSTOM,
     .validate = validate_buffer_size,
 
+Tracepoints
+-----------
+
+When ``CONFIG_KAPI_RUNTIME_CHECKS`` is enabled, the syscall validation path emits
+two ftrace tracepoints (in the ``kapi`` trace system) for every syscall that has a
+specification:
+
+- ``kapi_syscall_enter`` -- fired before parameter validation, recording the spec
+  name, the raw syscall argument values, and -- when the spec provides parameter
+  metadata -- a rendered ``name=value`` list: pointer-like values are shown in hex,
+  integers and file descriptors in decimal, and an unnamed parameter as ``arg``.
+- ``kapi_syscall_exit`` -- fired after the handler returns, or in place of the
+  handler when parameter validation rejects the call (the handler is skipped and
+  ``-EINVAL`` is returned). Records the spec name, the return value, and
+  ``spec_match``: 0 when the call did not conform to the spec -- the parameters were
+  rejected, or the return value was not one the spec allows -- and 1 otherwise.
+
+Unlike the ``pr_warn_ratelimited`` violation reports, the tracepoints capture every
+spec'd call rather than only violations, are lossless under load, and can be filtered
+with the usual ftrace facilities. They require ``CONFIG_TRACEPOINTS`` and stay dormant
+until enabled::
+
+    # echo 1 > /sys/kernel/tracing/events/kapi/enable
+    # cat /sys/kernel/tracing/trace
+     ...  kapi_syscall_enter: sys_read(fd=3, buf=0x7ffd46780b58, count=0x340)
+     ...  kapi_syscall_exit: sys_read = 832 spec_match=1
+     ...  kapi_syscall_enter: sys_open(filename=0x480300, flags=268435456, mode=0x0)
+     ...  kapi_syscall_exit: sys_open = -22 spec_match=0
+
 DebugFS Interface
 =================
 
diff --git a/MAINTAINERS b/MAINTAINERS
index ddfd9cad98916..48def631ad823 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13823,6 +13823,7 @@ L:	linux-api@vger.kernel.org
 S:	Maintained
 F:	Documentation/dev-tools/kernel-api-spec.rst
 F:	include/linux/kernel_api_spec.h
+F:	include/trace/events/kapi.h
 F:	kernel/api/
 F:	tools/kapi/
 F:	tools/lib/python/kdoc/kdoc_apispec.py
diff --git a/include/trace/events/kapi.h b/include/trace/events/kapi.h
new file mode 100644
index 0000000000000..47828f3338828
--- /dev/null
+++ b/include/trace/events/kapi.h
@@ -0,0 +1,74 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM kapi
+
+#if !defined(_TRACE_KAPI_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_KAPI_H
+
+#include <linux/tracepoint.h>
+
+/* Max length of the rendered "name=value, ..." parameter list. */
+#define KAPI_TP_PARAMS_LEN 256
+
+/*
+ * Emitted from the CONFIG_KAPI_RUNTIME_CHECKS syscall validation path for
+ * syscalls that have a KAPI specification: kapi_syscall_enter fires before
+ * parameter validation, kapi_syscall_exit after the handler returns.
+ * @name is the spec name, e.g. "sys_open".
+ *
+ * kapi_syscall_enter carries both the raw argument values (args[]) and, when
+ * the spec provides parameter metadata, a rendered "name=value" list (params,
+ * built by the caller): pointer-like values in hex, integers and fds in decimal.
+ */
+TRACE_EVENT(kapi_syscall_enter,
+
+	TP_PROTO(const char *name, int nargs, const s64 *args, const char *params),
+
+	TP_ARGS(name, nargs, args, params),
+
+	TP_STRUCT__entry(
+		__string(	name,	name	)
+		__field(	int,	nargs	)
+		__array(	u64,	args,	6	)
+		__string(	params,	params	)
+	),
+
+	TP_fast_assign(
+		__assign_str(name);
+		__entry->nargs = nargs;
+		memset(__entry->args, 0, sizeof(__entry->args));
+		if (args && nargs > 0)
+			memcpy(__entry->args, args,
+			       min_t(int, nargs, 6) * sizeof(__entry->args[0]));
+		__assign_str(params);
+	),
+
+	TP_printk("%s(%s)", __get_str(name), __get_str(params))
+);
+
+TRACE_EVENT(kapi_syscall_exit,
+
+	TP_PROTO(const char *name, long ret, bool spec_match),
+
+	TP_ARGS(name, ret, spec_match),
+
+	TP_STRUCT__entry(
+		__string(	name,		name		)
+		__field(	long,		ret		)
+		__field(	bool,		spec_match	)
+	),
+
+	TP_fast_assign(
+		__assign_str(name);
+		__entry->ret = ret;
+		__entry->spec_match = spec_match;
+	),
+
+	TP_printk("%s = %ld spec_match=%d",
+		  __get_str(name), __entry->ret, __entry->spec_match)
+);
+
+#endif /* _TRACE_KAPI_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/api/kernel_api_spec.c b/kernel/api/kernel_api_spec.c
index 1a9041a7f21a4..2aa8c04a5851e 100644
--- a/kernel/api/kernel_api_spec.c
+++ b/kernel/api/kernel_api_spec.c
@@ -659,6 +659,45 @@ EXPORT_SYMBOL_GPL(kapi_print_spec);
 
 #ifdef CONFIG_KAPI_RUNTIME_CHECKS
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/kapi.h>
+
+/*
+ * Render a syscall's parameters as a "name=value, ..." string for the
+ * kapi_syscall_enter tracepoint.  Names come from the spec; pointer-like
+ * values are shown in hex, integers and file descriptors in decimal.
+ */
+static void kapi_trace_format_params(const struct kernel_api_spec *spec,
+				     const s64 *args, int nargs,
+				     char *buf, size_t size)
+{
+	int i, used = 0;
+
+	buf[0] = '\0';
+	/* Bound by the caller-supplied arg count; the spec arity may differ. */
+	for (i = 0; args && i < nargs && i < 6; i++) {
+		const char *name = "arg";
+		bool dec = false;
+
+		if (i < spec->param_count) {
+			const struct kapi_param_spec *ps = &spec->params[i];
+
+			if (ps->name)
+				name = ps->name;
+			dec = ps->type == KAPI_TYPE_INT || ps->type == KAPI_TYPE_FD;
+		}
+
+		used += scnprintf(buf + used, size - used, "%s%s=",
+				  i ? ", " : "", name);
+		if (dec)
+			used += scnprintf(buf + used, size - used, "%lld",
+					  (long long)args[i]);
+		else
+			used += scnprintf(buf + used, size - used, "0x%llx",
+					  (unsigned long long)args[i]);
+	}
+}
+
 /**
  * kapi_validate_fd - Validate that a file descriptor value is in valid range
  * @fd: File descriptor to validate
@@ -1154,16 +1193,24 @@ EXPORT_SYMBOL_GPL(kapi_validate_syscall_param);
 int kapi_validate_syscall_params(const struct kernel_api_spec *spec,
 				 const s64 *params, int param_count)
 {
-	int i;
+	int i, ret = 0;
 
 	if (!spec || !params)
 		return 0;
 
+	if (trace_kapi_syscall_enter_enabled()) {
+		char pbuf[KAPI_TP_PARAMS_LEN];
+
+		kapi_trace_format_params(spec, params, param_count, pbuf, sizeof(pbuf));
+		trace_kapi_syscall_enter(spec->name, param_count, params, pbuf);
+	}
+
 	/* Validate that we have the expected number of parameters */
 	if (param_count != spec->param_count) {
 		pr_warn_ratelimited("API %s: parameter count mismatch (expected %u, got %d)\n",
 			spec->name, spec->param_count, param_count);
-		return -EINVAL;
+		ret = -EINVAL;
+		goto out;
 	}
 
 	/* Validate each parameter with context */
@@ -1173,12 +1220,22 @@ int kapi_validate_syscall_params(const struct kernel_api_spec *spec,
 		if (!kapi_validate_param_with_context(param_spec, params[i], params, param_count)) {
 			if (strncmp(spec->name, "sys_", 4) == 0) {
 				/* For syscalls, we can return EINVAL to userspace */
-				return -EINVAL;
+				ret = -EINVAL;
+				goto out;
 			}
 		}
 	}
 
-	return 0;
+out:
+	/*
+	 * Emit the exit event on the rejection path too (the wrapper
+	 * short-circuits the handler on a non-zero return), so every
+	 * kapi_syscall_enter has a matching kapi_syscall_exit.
+	 */
+	if (ret)
+		trace_kapi_syscall_exit(spec->name, ret, false);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(kapi_validate_syscall_params);
 
@@ -1301,14 +1358,18 @@ EXPORT_SYMBOL_GPL(kapi_validate_return_value);
  */
 int kapi_validate_syscall_return(const struct kernel_api_spec *spec, s64 retval)
 {
+	bool valid = true;
+
 	if (!spec)
 		return 0;
 
-	/* Skip return validation if return spec was not defined */
-	if (spec->return_magic != KAPI_MAGIC_RETURN)
-		return 0;
+	/* Validate against the return spec when one was defined */
+	if (spec->return_magic == KAPI_MAGIC_RETURN)
+		valid = kapi_validate_return_value(spec, retval);
+
+	trace_kapi_syscall_exit(spec->name, retval, valid);
 
-	if (!kapi_validate_return_value(spec, retval)) {
+	if (!valid) {
 		/* Log the violation but don't change the return value */
 		pr_warn_ratelimited("KAPI: Syscall %s returned unspecified value %lld\n",
 				    spec->name, retval);
-- 
2.53.0


^ permalink raw reply related

* [PATCH v4 10/11] kernel/api: add API specification for sys_madvise
From: Sasha Levin @ 2026-05-29 23:33 UTC (permalink / raw)
  To: linux-api, linux-kernel
  Cc: linux-doc, linux-fsdevel, linux-kbuild, linux-kselftest,
	workflows, tools, x86, Thomas Gleixner, Paul E . McKenney,
	Greg Kroah-Hartman, Jonathan Corbet, Dmitry Vyukov, Randy Dunlap,
	Cyril Hrubis, Kees Cook, Jake Edge, David Laight,
	Gabriele Paoloni, Mauro Carvalho Chehab, Christian Brauner,
	Alexander Viro, Andrew Morton, Masahiro Yamada, Shuah Khan,
	Arnd Bergmann, Nathan Chancellor, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260529233311.1901670-1-sashal@kernel.org>

Add KAPI-annotated kerneldoc for the sys_madvise system call in
mm/madvise.c.

The specification documents parameter constraints (start, len_in,
behavior), per-behavior error conditions, lock acquisition (mmap_lock
read and write modes plus the per-VMA fast path, mmu_gather and
mmu_notifier brackets), signal handling, side effects, capability
requirements (CAP_SYS_ADMIN for MADV_HWPOISON and MADV_SOFT_OFFLINE),
mseal interaction, and the heterogeneous skip semantics across the
hint, immediate-action and destructive groups.

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 mm/madvise.c | 575 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 575 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index dbb69400786d1..ed0a046e9e25b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -2032,6 +2032,581 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 	return error;
 }
 
+/**
+ * sys_madvise - Give advice about use of memory
+ * @start: Starting virtual address of the range to advise on
+ * @len_in: Length of the range in bytes
+ * @behavior: Advice (a MADV_* constant) the kernel should apply to the range
+ *
+ * long-desc: Provides the kernel with advice or directions about the address
+ *   range starting at start and extending for len_in bytes. The advice is
+ *   selected by behavior, which is one of the MADV_* constants defined in
+ *   <sys/mman.h>. The semantics fall into three groups. The hint group
+ *   primarily updates VMA flags (MADV_NORMAL, MADV_RANDOM, MADV_SEQUENTIAL,
+ *   MADV_DONTFORK, MADV_DOFORK, MADV_DONTDUMP, MADV_DODUMP, MADV_WIPEONFORK,
+ *   MADV_KEEPONFORK, MADV_MERGEABLE, MADV_UNMERGEABLE, MADV_HUGEPAGE,
+ *   MADV_NOHUGEPAGE), and is itself heterogeneous: the fork-copy gates
+ *   (MADV_DONTFORK / MADV_DOFORK) and the KSM scan gates
+ *   (MADV_MERGEABLE / MADV_UNMERGEABLE) are strictly honored by their
+ *   consumers; MADV_HUGEPAGE / MADV_NOHUGEPAGE express THP eligibility
+ *   advice rather than allocation guarantees (MADV_NOHUGEPAGE blocks the
+ *   normal fault-time, MADV_COLLAPSE and khugepaged paths; MADV_HUGEPAGE
+ *   widens eligibility and increases defrag aggressiveness but does not
+ *   force allocation, which still depends on the global
+ *   transparent_hugepage= mode, VMA suitability, and allocation success);
+ *   MADV_WIPEONFORK / MADV_KEEPONFORK do not wipe at fork time but cause
+ *   the child's first access to fault in zero-filled pages;
+ *   MADV_DONTDUMP / MADV_DODUMP normally control coredump inclusion but
+ *   can be overridden by always_dump_vma() for gate, vm_ops-named or
+ *   arch-named VMAs; and MADV_NORMAL / MADV_RANDOM / MADV_SEQUENTIAL are
+ *   genuinely heuristic read-ahead hints. The non-destructive
+ *   immediate-action group performs work
+ *   synchronously while preserving page contents (MADV_WILLNEED, MADV_COLD,
+ *   MADV_PAGEOUT, MADV_POPULATE_READ, MADV_POPULATE_WRITE, MADV_COLLAPSE,
+ *   MADV_GUARD_REMOVE). The destructive group discards, replaces or
+ *   invalidates page contents (MADV_DONTNEED, MADV_DONTNEED_LOCKED,
+ *   MADV_FREE, MADV_REMOVE, MADV_GUARD_INSTALL, MADV_HWPOISON,
+ *   MADV_SOFT_OFFLINE). MADV_GUARD_INSTALL belongs to the destructive group
+ *   because it zaps any existing pages in the range before installing PTE
+ *   guard markers.
+ *
+ *   start must be page-aligned; len_in is rounded up to the next page
+ *   boundary internally. Once those validation checks pass, a zero-length
+ *   range succeeds without performing work. The kernel rejects ranges that
+ *   wrap (start + PAGE_ALIGN(len_in) < start) and ranges where len_in is
+ *   non-zero but rounds down to zero. Address tagging bits are stripped
+ *   from start before VMA lookup for every behavior except MADV_HWPOISON
+ *   and MADV_SOFT_OFFLINE, which receive the raw start value because they
+ *   bypass the VMA walk entirely.
+ *
+ *   The kernel return value reports whether any error condition was
+ *   encountered, not whether the requested work was performed. The
+ *   relationship between the return code and the work done varies by
+ *   handler:
+ *
+ *     - Hint behaviors update VMA flags. The flags fall into five
+ *       sub-classes by how their consumers honor them:
+ *       (a) Hard gates -- MADV_DONTFORK / MADV_DOFORK strictly gate VMA
+ *       copy in dup_mmap(); MADV_MERGEABLE / MADV_UNMERGEABLE strictly
+ *       gate whether KSM will scan the VMA at all. These take effect
+ *       immediately and cannot be overridden by other policy.
+ *       (b) THP eligibility advice -- MADV_NOHUGEPAGE blocks the normal
+ *       fault-time, MADV_COLLAPSE and khugepaged THP paths for the VMA
+ *       (driver-internal PMD insertion via insert_pmd() is the only
+ *       documented bypass). MADV_HUGEPAGE only widens THP eligibility
+ *       under the kernel's "madvise" / "except-advised" policy and
+ *       increases defrag aggressiveness; it does not force allocation,
+ *       which still depends on the global transparent_hugepage= mode
+ *       (always / madvise / never), VMA suitability, defrag GFP policy,
+ *       and allocation or memcg-charge success.
+ *       (c) Fault-on-access -- MADV_WIPEONFORK / MADV_KEEPONFORK do not
+ *       wipe pages at fork time; instead the child VMA's pages are not
+ *       copied and the child sees zero-filled pages only when it first
+ *       reads or writes them.
+ *       (d) Mostly-strict with override -- MADV_DONTDUMP / MADV_DODUMP
+ *       control coredump inclusion via VM_DONTDUMP, but always_dump_vma()
+ *       can still include gate, vm_ops-named or arch-named VMAs in the
+ *       core regardless.
+ *       (e) Heuristic -- MADV_NORMAL / MADV_RANDOM / MADV_SEQUENTIAL set
+ *       VM_RAND_READ / VM_SEQ_READ as read-ahead hints that the read-ahead
+ *       code weighs against other policy and may diverge from at runtime.
+ *       In all five sub-classes the requested flag bits on the VMA are set;
+ *       what differs is the strength of the resulting downstream effect.
+ *
+ *     - Walk-and-skip handlers (MADV_COLD, MADV_PAGEOUT, MADV_FREE,
+ *       MADV_GUARD_REMOVE) traverse the range and silently skip pages or
+ *       PMDs that fail per-page preconditions (absent, special, device,
+ *       shared, non-LRU, unsplittable, locked, etc.), returning 0 even
+ *       when most or all pages were skipped.
+ *
+ *     - Bulk-backend handlers delegate the requested range to a single
+ *       backend call: MADV_DONTNEED and MADV_DONTNEED_LOCKED to
+ *       zap_page_range_single_batched(), MADV_REMOVE to vfs_fallocate(),
+ *       MADV_WILLNEED on regular files to vfs_fadvise(). The backend's
+ *       return is propagated for MADV_REMOVE and discarded for
+ *       MADV_WILLNEED; DAX files short-circuit MADV_WILLNEED entirely.
+ *
+ *     - Stop-on-error handlers (MADV_POPULATE_READ, MADV_POPULATE_WRITE,
+ *       MADV_SOFT_OFFLINE) walk the range but surface the first per-page
+ *       failure as an errno (-EHWPOISON, -EFAULT, -ENOMEM, ...) rather
+ *       than skipping silently.
+ *
+ *     - Hybrid handlers combine modes: MADV_WILLNEED walks for anonymous
+ *       and shmem ranges but bulk-calls vfs_fadvise() for regular files;
+ *       MADV_COLLAPSE walks PMD-by-PMD and tracks the last scan failure
+ *       so transient skips coexist with terminal errors;
+ *       MADV_GUARD_INSTALL walks to install markers and re-walks after
+ *       zap_page_range_single() to clear pre-existing pages, retrying up
+ *       to MAX_MADVISE_GUARD_RETRIES; MADV_HWPOISON walks pages but folds
+ *       memory_failure()'s -EOPNOTSUPP back to 0.
+ *
+ *   Applications that need to know whether a specific page was acted on
+ *   must verify the result through other means (e.g. /proc/[pid]/smaps,
+ *   page faults, read-after-write).
+ *
+ *   On success, madvise() returns 0; unlike read(2) and write(2) it has no
+ *   notion of partial completion at the syscall boundary. When the range
+ *   spans multiple VMAs, the kernel applies the advice to each in turn; an
+ *   unmapped gap inside the range causes the call to return -ENOMEM after
+ *   processing the mapped portions, rather than aborting at the gap.
+ *
+ *   POSIX defines posix_madvise(3) for a portable subset (POSIX_MADV_NORMAL,
+ *   _RANDOM, _SEQUENTIAL, _WILLNEED, _DONTNEED). Linux MADV_DONTNEED is
+ *   destructive: it discards the contents of the affected anonymous pages and
+ *   subsequent reads return zero. POSIX permits but does not require
+ *   destruction, so portable code that needs the POSIX semantics should use
+ *   posix_madvise(3) instead.
+ *
+ * contexts: process, sleepable
+ *
+ * param: start
+ *   type: uint, input
+ *   constraint-type: page_aligned
+ *   cdesc: Starting virtual address of the range. Must be aligned to
+ *     PAGE_SIZE. An unaligned start always returns -EINVAL, even when
+ *     len_in is zero. Address tag bits, where supported by the architecture,
+ *     are cleared via untagged_addr() before the range is interpreted, with
+ *     the exception of MADV_HWPOISON and MADV_SOFT_OFFLINE, which receive
+ *     the raw start value because they bypass the VMA walk.
+ *
+ * param: len_in
+ *   type: uint, input
+ *   constraint-type: range(0, SIZE_MAX)
+ *   cdesc: Length of the range in bytes. Internally rounded up to a multiple
+ *     of PAGE_SIZE. A len_in of 0 is accepted and the call is a no-op that
+ *     returns 0. A non-zero len_in that rounds up to 0 (i.e. wraps around)
+ *     returns -EINVAL, as does a range whose end (start + PAGE_ALIGN(len_in))
+ *     would wrap below start.
+ *
+ * param: behavior
+ *   type: int, input
+ *   cdesc: One of the MADV_* constants from <sys/mman.h>. See the long
+ *     description above for the full list and the three semantic groups
+ *     (hint, immediate-action, destructive). Behaviors gated by Kconfig
+ *     (KSM, transparent hugepage, memory failure) return -EINVAL when the
+ *     underlying support is disabled. A few architectures (notably alpha)
+ *     renumber values; portable code should always use the symbolic names.
+ *
+ * return:
+ *   type: int
+ *   check-type: exact
+ *   success: 0
+ *   desc: On success, returns 0. On error, returns a negative error code.
+ *     There is no partial-success indication; either the entire processed
+ *     range succeeded, or an error is returned and an unspecified prefix of
+ *     the range may have been advised.
+ *
+ * error: EINVAL, Invalid argument
+ *   desc: Returned for invalid input (unrecognised MADV_*, Kconfig-gated
+ *     behavior, unaligned start, range wrap, non-zero len_in rounding to
+ *     zero) and for per-behavior VMA-filter violations. The constraint:
+ *     blocks cover FREE, WIPEONFORK, REMOVE, COLD and PAGEOUT; inline
+ *     filters also reject DOFORK on VM_SPECIAL, KEEPONFORK on VM_DROPPABLE,
+ *     DODUMP on non-hugetlb VM_SPECIAL/VM_DROPPABLE, GUARD_* on VM_SPECIAL
+ *     or VM_HUGETLB, GUARD_INSTALL on VM_LOCKED. Also
+ *     returned by faultin_page_range() and madvise_collapse_errno().
+ *
+ * error: ENOMEM, Cannot allocate memory
+ *   desc: Some part of the requested range falls in a gap between mapped
+ *     VMAs; the kernel still applies the behavior to the mapped subranges
+ *     and only returns -ENOMEM after the walk completes. MADV_POPULATE_*
+ *     also returns -ENOMEM when the region has no VMA or when
+ *     faultin_page_range() exhausts memory. MADV_COLLAPSE returns -ENOMEM
+ *     when its struct collapse_control cannot be allocated up front, and
+ *     when madvise_collapse_errno() maps SCAN_ALLOC_HUGE_PAGE_FAIL (no
+ *     hugepage available) to -ENOMEM.
+ *
+ * error: EAGAIN, Resource temporarily unavailable
+ *   desc: For the VMA-flag-mutating behaviors, an internal -ENOMEM from VMA
+ *     splitting is translated to -EAGAIN before being returned to userspace,
+ *     advising the caller that a transient kernel resource shortage
+ *     prevented the update. Also returned by MADV_COLLAPSE via
+ *     madvise_collapse_errno() for transient scan failures (folio lock
+ *     contention, LRU isolation failure, dirty/writeback) where retrying
+ *     the call may succeed.
+ *
+ * error: EIO, Input/output error
+ *   desc: For MADV_REMOVE, an I/O error from the underlying filesystem's
+ *     FALLOC_FL_PUNCH_HOLE handler is propagated back as -EIO. MADV_WILLNEED
+ *     and MADV_PAGEOUT do not surface filesystem or device I/O errors:
+ *     vfs_fadvise() returns are discarded by madvise_willneed() and the
+ *     pageout walk is invoked through a void helper, so transient I/O
+ *     failures during read-ahead or page-out are silently dropped.
+ *
+ * error: EBADF, Bad file descriptor
+ *   desc: Returned by MADV_WILLNEED when applied to a non-file-backed VMA
+ *     and the kernel was built without CONFIG_SWAP, so there is neither a
+ *     file to read-ahead from nor a swap device to fault from.
+ *
+ * error: EACCES, Permission denied
+ *   desc: Returned by MADV_REMOVE when the target VMA is not a writable
+ *     shared mapping (vma_is_shared_maywrite() is false). Punching a hole in
+ *     a private or read-only shared mapping is not permitted; the operation
+ *     would either be invisible to other mappers or violate file permissions.
+ *
+ * error: EPERM, Operation not permitted
+ *   desc: Returned in two situations. First, MADV_HWPOISON and
+ *     MADV_SOFT_OFFLINE require CAP_SYS_ADMIN; the inject-error handler
+ *     refuses non-privileged callers. Second, on 64-bit kernels, a discard
+ *     operation (MADV_FREE, MADV_DONTNEED, MADV_DONTNEED_LOCKED, MADV_REMOVE,
+ *     MADV_DONTFORK, MADV_WIPEONFORK, MADV_GUARD_INSTALL) is refused on a
+ *     read-only anonymous VMA that has been sealed with mseal(2), to prevent
+ *     bypassing the seal by discarding mapped data.
+ *
+ * error: EINTR, Interrupted system call
+ *   desc: Returned when a fatal signal is delivered while the call is
+ *     waiting to acquire the mmap write lock for a VMA-flag-mutating
+ *     behavior (mmap_write_lock_killable() returns -EINTR), or when
+ *     MADV_POPULATE_READ/MADV_POPULATE_WRITE is interrupted while faulting
+ *     in pages (faultin_page_range() returns -EINTR). The single-shot
+ *     madvise() syscall is not automatically restarted by the signal
+ *     framework on this path; the caller must reissue the request if
+ *     desired.
+ *
+ * error: EHWPOISON, Memory page has hardware error
+ *   desc: MADV_POPULATE_READ or MADV_POPULATE_WRITE encountered a page that
+ *     has been marked as containing a hardware-detected memory error and
+ *     could not be faulted in.
+ *
+ * error: EFAULT, Bad address
+ *   desc: MADV_POPULATE_READ or MADV_POPULATE_WRITE attempted to fault in a
+ *     page whose mapping raised VM_FAULT_SIGBUS or VM_FAULT_SIGSEGV (for
+ *     example, a file-backed page beyond the end of the file).
+ *
+ * error: EBUSY, Device or resource busy
+ *   desc: Returned by MADV_COLLAPSE via madvise_collapse_errno() in two
+ *     specific scan-failure modes: SCAN_CGROUP_CHARGE_FAIL (the new
+ *     hugepage cannot be charged to the memory cgroup) and
+ *     SCAN_EXCEED_NONE_PTE (too many absent PTEs in the candidate range
+ *     for a synchronous collapse). Other transient collapse failures are
+ *     reported as -EAGAIN; non-transient ones as -EINVAL.
+ *
+ * lock: mm->mmap_lock (read mode)
+ *   type: rwlock
+ *   acquired: yes
+ *   released: yes
+ *   desc: Held on entry to the VMA walk for MADV_REMOVE, MADV_WILLNEED,
+ *     MADV_COLD, MADV_PAGEOUT and MADV_COLLAPSE, and as the fallback when
+ *     the per-VMA fast path declines. Several handlers drop and reacquire
+ *     this lock mid-operation: MADV_WILLNEED on a file-backed VMA and
+ *     MADV_REMOVE around their vfs_fadvise() / vfs_fallocate() callouts;
+ *     MADV_COLLAPSE on file-backed ranges around the page migration
+ *     pipeline; MADV_POPULATE_* (dispatched directly to madvise_populate())
+ *     around each faultin_page_range() call, which may itself drop the
+ *     lock internally before returning.
+ *
+ * lock: mm->mmap_lock (write mode; killable)
+ *   type: rwlock
+ *   acquired: yes
+ *   released: yes
+ *   desc: Acquired in killable write mode for behaviors that modify
+ *     vma->vm_flags or split/merge VMAs (MADV_NORMAL, MADV_RANDOM,
+ *     MADV_SEQUENTIAL, MADV_DONTFORK, MADV_DOFORK, MADV_DONTDUMP, MADV_DODUMP,
+ *     MADV_WIPEONFORK, MADV_KEEPONFORK, MADV_MERGEABLE, MADV_UNMERGEABLE,
+ *     MADV_HUGEPAGE, MADV_NOHUGEPAGE). If the acquisition is killed by a
+ *     fatal signal, the syscall returns -EINTR before any VMA is touched.
+ *
+ * lock: per-VMA read lock (vma->vm_lock)
+ *   type: custom
+ *   acquired: yes
+ *   released: yes
+ *   desc: Tried first for MADV_DONTNEED, MADV_DONTNEED_LOCKED, MADV_FREE,
+ *     MADV_GUARD_INSTALL and MADV_GUARD_REMOVE via lock_vma_under_rcu(). The
+ *     per-VMA path is taken only when the requested range fits within a
+ *     single VMA, the target mm is the caller's mm, the VMA is not armed
+ *     with userfaultfd, and (for behaviors that establish page tables) an
+ *     anon_vma is already attached. Otherwise the code falls back to the
+ *     mmap read lock above.
+ *
+ * lock: mmu_gather TLB batch
+ *   type: custom
+ *   acquired: yes
+ *   released: yes
+ *   desc: For MADV_DONTNEED, MADV_DONTNEED_LOCKED and MADV_FREE the syscall
+ *     wraps the per-VMA work in tlb_gather_mmu() / tlb_finish_mmu() so PTE
+ *     clearing and TLB invalidation are batched. MADV_COLD and MADV_PAGEOUT
+ *     build a short-lived gather inside the handler. MADV_GUARD_INSTALL
+ *     builds a transient gather via zap_page_range_single() each time the
+ *     retry loop has to clear pre-existing pages; if the range is already
+ *     empty no gather is built. MADV_GUARD_REMOVE never zaps and never
+ *     gathers.
+ *
+ * lock: mmu_notifier invalidate range
+ *   type: custom
+ *   acquired: yes
+ *   released: yes
+ *   desc: All zap-based paths -- MADV_DONTNEED, MADV_DONTNEED_LOCKED, the
+ *     zap branch of MADV_GUARD_INSTALL via zap_page_range_single(), and
+ *     MADV_FREE's own walk -- bracket their work with
+ *     mmu_notifier_invalidate_range_start()/_end() so secondary MMUs (KVM,
+ *     IOMMUv2, etc.) observe the page clearing.
+ *
+ * signal: Any fatal signal
+ *   direction: receive
+ *   action: return
+ *   condition: Acquiring the mmap write lock or faulting in pages for
+ *     MADV_POPULATE_*
+ *   desc: A pending fatal signal aborts mmap_write_lock_killable() (used by
+ *     the VMA-flag-mutating behaviors) and faultin_page_range() (used by
+ *     MADV_POPULATE_READ and MADV_POPULATE_WRITE), in both cases surfacing as
+ *     -EINTR to userspace. The single-shot madvise() syscall does not request
+ *     transparent restart on these paths; the caller is expected to reissue
+ *     the call if appropriate.
+ *   errno: -EINTR
+ *   timing: during
+ *   restartable: no
+ *
+ * side-effect: modify_state
+ *   target: vma->vm_flags
+ *   condition: Hint-group behaviors (MADV_NORMAL, MADV_RANDOM, MADV_SEQUENTIAL,
+ *     MADV_DONTFORK, MADV_DOFORK, MADV_DONTDUMP, MADV_DODUMP, MADV_WIPEONFORK,
+ *     MADV_KEEPONFORK, MADV_MERGEABLE, MADV_UNMERGEABLE, MADV_HUGEPAGE,
+ *     MADV_NOHUGEPAGE)
+ *   desc: Sets or clears VM_RAND_READ, VM_SEQ_READ, VM_DONTCOPY, VM_DONTDUMP,
+ *     VM_WIPEONFORK, VM_MERGEABLE or VM_HUGEPAGE on the affected VMAs and may
+ *     split or merge VMAs to apply the change to a sub-range. The change is
+ *     reversible by issuing madvise() with the inverse advice (e.g.
+ *     MADV_DOFORK undoes MADV_DONTFORK), with the caveat that the inverse
+ *     call's per-VMA filter still applies: MADV_DOFORK rejects VM_SPECIAL,
+ *     MADV_DODUMP rejects non-hugetlb VM_SPECIAL or VM_DROPPABLE, and
+ *     MADV_KEEPONFORK rejects VM_DROPPABLE; for those classes of VMA the
+ *     inverse cannot complete.
+ *   reversible: yes
+ *
+ * side-effect: free_memory | modify_state | irreversible
+ *   target: page tables and resident pages within the range
+ *   condition: MADV_DONTNEED, MADV_DONTNEED_LOCKED, MADV_FREE
+ *   desc: MADV_DONTNEED zaps PTEs, releasing the underlying pages or swap
+ *     slots so the next access faults in zero-filled anonymous pages or
+ *     re-reads the file. MADV_DONTNEED_LOCKED is identical but tolerates
+ *     VM_LOCKED. MADV_FREE marks anonymous pages lazy-freeable: clean pages
+ *     may be reclaimed under memory pressure, while writes before
+ *     reclamation cancel the lazy-free. Discarded data cannot be recovered.
+ *   reversible: no
+ *
+ * side-effect: filesystem | irreversible
+ *   target: backing file (FALLOC_FL_PUNCH_HOLE)
+ *   condition: MADV_REMOVE
+ *   desc: Calls vfs_fallocate(FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE) on
+ *     the backing file, deallocating the corresponding file blocks. The hole
+ *     is visible to all mappers of the file and to read(2)/write(2)
+ *     callers; subsequent reads return zero. Filesystem freeze protection,
+ *     i_rwsem and any quota/space accounting are taken by the underlying
+ *     fallocate path.
+ *   reversible: no
+ *
+ * side-effect: modify_state | schedule
+ *   target: LRU lists and page reclaim
+ *   condition: MADV_COLD, MADV_PAGEOUT
+ *   desc: MADV_COLD deactivates the affected pages, moving them to the
+ *     inactive LRU and clearing PG_referenced/PG_young so they are reclaimed
+ *     sooner under pressure. MADV_PAGEOUT additionally calls reclaim_pages()
+ *     to write dirty pages out and drop clean ones synchronously. Page data
+ *     is preserved (rereads will fault in the same content), but the I/O and
+ *     LRU bookkeeping cannot be undone.
+ *   reversible: no
+ *
+ * side-effect: modify_state
+ *   target: page tables (faultin)
+ *   condition: MADV_POPULATE_READ, MADV_POPULATE_WRITE
+ *   desc: Walks the requested range with faultin_page_range(), populating
+ *     PTEs by triggering read or write faults so subsequent accesses do not
+ *     fault. Equivalent to touching every page in the range while suppressing
+ *     SIGBUS/SIGSEGV through the syscall return value. Allocations made by
+ *     faultin are not undone on partial failure.
+ *   reversible: no
+ *
+ * side-effect: modify_state | schedule
+ *   target: transparent hugepage layout
+ *   condition: MADV_COLLAPSE
+ *   desc: Synchronously coalesces base pages in the range into a PMD-sized
+ *     transparent hugepage when the mapping permits. Performs the same page
+ *     migration and zeroing that khugepaged would do asynchronously; the
+ *     range's data is preserved across the collapse.
+ *   reversible: no
+ *
+ * side-effect: free_memory | modify_state | irreversible
+ *   target: PTE marker (PTE_MARKER_GUARD)
+ *   condition: MADV_GUARD_INSTALL, MADV_GUARD_REMOVE
+ *   desc: MADV_GUARD_INSTALL installs PTE_MARKER_GUARD entries that cause
+ *     subsequent accesses to deliver SIGSEGV without consuming physical
+ *     memory; existing pages already mapped in the range are zapped via
+ *     zap_page_range_single() before the markers are installed, so any
+ *     prior contents are lost. MADV_GUARD_REMOVE clears the markers but
+ *     does not (and cannot) restore zapped data.
+ *   reversible: no
+ *
+ * side-effect: hardware | irreversible
+ *   target: physical page (memory_failure / soft_offline_page)
+ *   condition: MADV_HWPOISON, MADV_SOFT_OFFLINE
+ *   desc: MADV_HWPOISON marks the affected pages as containing an
+ *     unrecoverable hardware error using the same machine-check path that
+ *     real ECC failures take; MADV_SOFT_OFFLINE migrates the contents off
+ *     the affected pages and removes them from the buddy allocator. Both
+ *     paths affect physical memory bookkeeping kernel-wide and cannot be
+ *     undone without a reboot. Intended for testing the memory-failure
+ *     pipeline; restricted to CAP_SYS_ADMIN.
+ *   reversible: no
+ *
+ * side-effect: modify_state
+ *   target: KSM merge state (vm_flags & VM_MERGEABLE)
+ *   condition: MADV_MERGEABLE, MADV_UNMERGEABLE
+ *   desc: Toggles the VMA's eligibility for the kernel same-page merger.
+ *     Enabling merging may later cause identical anonymous pages to be
+ *     replaced by shared, write-protected copies; disabling merging tears
+ *     any existing merges down lazily. The flag toggle itself is reversible
+ *     by issuing the inverse advice.
+ *   reversible: yes
+ *
+ * side-effect: modify_state
+ *   target: userfaultfd event queue
+ *   condition: MADV_DONTNEED, MADV_DONTNEED_LOCKED, MADV_FREE, MADV_REMOVE
+ *     on a userfaultfd-armed VMA
+ *   desc: Generates a UFFD_EVENT_REMOVE notification covering the discarded
+ *     range so userfaultfd monitors observing the mapping see the
+ *     invalidation. The event is queued before the discard takes effect; the
+ *     monitor cannot veto it.
+ *   reversible: no
+ *
+ * capability: CAP_SYS_ADMIN
+ *   type: perform_operation
+ *   allows: Inject memory errors via MADV_HWPOISON or MADV_SOFT_OFFLINE
+ *   without: Both behaviors return -EPERM
+ *   condition: Checked at entry to madvise_inject_error() before any pages
+ *     are looked up
+ *
+ * constraint: Page-aligned start
+ *   desc: start must lie on a page boundary; otherwise the call returns
+ *     -EINVAL before any VMA is consulted.
+ *   expr: (start & (PAGE_SIZE - 1)) == 0
+ *
+ * constraint: Length rounded up to PAGE_SIZE
+ *   desc: The effective range length is PAGE_ALIGN(len_in). A non-zero len_in
+ *     that overflows during rounding, or a (start, end) range that wraps,
+ *     is rejected with -EINVAL.
+ *   expr: end = start + PAGE_ALIGN(len_in); end >= start
+ *
+ * constraint: Behavior must be supported
+ *   desc: behavior must be one of the MADV_* values listed under the
+ *     behavior parameter. Behaviors gated by Kconfig (KSM, THP, memory
+ *     failure) are rejected with -EINVAL when the corresponding option is
+ *     disabled in the running kernel.
+ *
+ * constraint: mseal-protected discards
+ *   desc: On 64-bit kernels, a discard operation (FREE, DONTNEED,
+ *     DONTNEED_LOCKED, REMOVE, DONTFORK, WIPEONFORK, GUARD_INSTALL) against
+ *     a sealed anonymous VMA is rejected unless the mapping is currently
+ *     writable -- both VM_WRITE in vm_flags and arch_vma_access_permitted()
+ *     allowing write -- so that mseal(2) cannot be bypassed by instructing
+ *     the kernel to throw the data away. File-backed sealed VMAs and
+ *     writable sealed VMAs are not subject to this restriction.
+ *   expr: !is_discard(behavior) || !vma_is_sealed(vma) ||
+ *     !vma_is_anonymous(vma) || ((vma->vm_flags & VM_WRITE) &&
+ *     arch_vma_access_permitted(vma, true, false, false))
+ *
+ * constraint: MADV_FREE requires anonymous mappings
+ *   desc: MADV_FREE is defined only over anonymous mappings; the handler
+ *     rejects file-backed VMAs with -EINVAL.
+ *   expr: vma_is_anonymous(vma)
+ *
+ * constraint: MADV_WIPEONFORK requires private anonymous mappings
+ *   desc: MADV_WIPEONFORK rejects file-backed mappings and shared anonymous
+ *     mappings; only MAP_PRIVATE anonymous VMAs accept it. Both rejections
+ *     surface as -EINVAL.
+ *   expr: !vma->vm_file && !(vma->vm_flags & VM_SHARED)
+ *
+ * constraint: MADV_REMOVE requires a writable shared file mapping
+ *   desc: MADV_REMOVE rejects VM_LOCKED VMAs, VMAs without an associated
+ *     file/mapping/host inode, and non-shared-writable mappings. The first
+ *     two cases return -EINVAL; a private or read-only shared mapping
+ *     returns -EACCES.
+ *   expr: !(vma->vm_flags & VM_LOCKED) && vma->vm_file &&
+ *     vma->vm_file->f_mapping && vma->vm_file->f_mapping->host &&
+ *     vma_is_shared_maywrite(vma)
+ *
+ * constraint: MADV_COLD / MADV_PAGEOUT VMA filter
+ *   desc: Both behaviors require LRU-managed pages; they reject VMAs that
+ *     are mlocked, raw-PFN or hugetlb.
+ *   expr: !(vma->vm_flags & (VM_LOCKED | VM_PFNMAP | VM_HUGETLB))
+ *
+ * examples: madvise(p, len, MADV_SEQUENTIAL);  // set VM_SEQ_READ on the VMA
+ *   madvise(p, len, MADV_POPULATE_WRITE);  // prefault writable PTEs
+ *   madvise(p, len, MADV_DONTNEED);        // discard anonymous pages
+ *   madvise(p, len, MADV_GUARD_INSTALL);   // install SIGSEGV guard pages
+ *
+ * notes: madvise(2) reports only whether an error condition was
+ *   encountered, not whether the requested work was performed. The hint
+ *   group sets VMA flags whose downstream strictness varies:
+ *   MADV_DONTFORK / MADV_DOFORK and MADV_MERGEABLE / MADV_UNMERGEABLE are
+ *   hard gates honored by fork-copy and KSM scanning respectively;
+ *   MADV_NOHUGEPAGE is a hard gate against the normal user-visible THP
+ *   paths but MADV_HUGEPAGE is eligibility/advice that does not force
+ *   THP installation -- the global transparent_hugepage= mode, VMA
+ *   suitability and allocation success still apply; MADV_WIPEONFORK /
+ *   MADV_KEEPONFORK take effect at the child's first page access
+ *   (zero-on-fault), not at fork time; MADV_DONTDUMP / MADV_DODUMP gate
+ *   coredump inclusion but can be overridden by always_dump_vma();
+ *   MADV_NORMAL / MADV_RANDOM / MADV_SEQUENTIAL are heuristic read-ahead
+ *   hints that the read-ahead code may weigh against other policy. The
+ *   non-hint behaviors
+ *   are not uniform: walk-and-skip handlers (COLD, PAGEOUT, FREE,
+ *   GUARD_REMOVE) silently skip pages that fail per-page preconditions;
+ *   bulk-backend handlers (DONTNEED, DONTNEED_LOCKED, REMOVE, and
+ *   WILLNEED on regular files) delegate the range to a single backend
+ *   call whose return is propagated for REMOVE and discarded for
+ *   WILLNEED; stop-on-error handlers (POPULATE_READ, POPULATE_WRITE,
+ *   SOFT_OFFLINE) surface the first per-page failure rather than
+ *   skipping; and hybrid handlers (WILLNEED for anon/shmem, COLLAPSE,
+ *   GUARD_INSTALL, HWPOISON) mix walking with bulk backends, retry/zap
+ *   loops or selective error suppression. A successful return therefore
+ *   guarantees only that no error was raised in the handler that ran,
+ *   not that every page was processed.
+ *
+ *   Behavior introduction history (mainline): MADV_FREE in 4.5,
+ *   MADV_WIPEONFORK / MADV_KEEPONFORK in 4.14, MADV_COLD / MADV_PAGEOUT in
+ *   5.4, MADV_POPULATE_READ / MADV_POPULATE_WRITE in 5.14,
+ *   MADV_DONTNEED_LOCKED in 5.18, MADV_COLLAPSE in 6.1, MADV_GUARD_INSTALL /
+ *   MADV_GUARD_REMOVE in 6.13. Code that wants to remain portable to older
+ *   kernels must handle -EINVAL gracefully and fall back.
+ *
+ *   process_madvise(2) extends the same set of advices to another process
+ *   identified by a pidfd. When the target mm is the caller's own (the
+ *   pidfd refers to the caller), any locally-supported MADV_* value is
+ *   accepted. When the target is a different mm, the behavior must be in
+ *   the non-destructive remote subset (MADV_COLD, MADV_PAGEOUT,
+ *   MADV_WILLNEED, MADV_COLLAPSE) or the call returns -EINVAL, and the
+ *   caller must hold CAP_SYS_NICE.
+ *
+ *   The discard subset (MADV_FREE, MADV_DONTNEED, MADV_DONTNEED_LOCKED,
+ *   MADV_REMOVE, MADV_DONTFORK, MADV_WIPEONFORK, MADV_GUARD_INSTALL) is
+ *   refused on non-writable anonymous VMAs sealed with mseal(2) on 64-bit
+ *   kernels. mseal(2) does not provide an unseal operation, so applications
+ *   that need to retain the ability to discard such pages must keep the
+ *   mapping writable (and arch-accessible for write) or refrain from
+ *   sealing it.
+ *
+ *   On a userfaultfd-armed VMA, all four destructive discards (DONTNEED,
+ *   DONTNEED_LOCKED, FREE, REMOVE) emit a UFFD_EVENT_REMOVE event; the
+ *   per-VMA fast path is bypassed and the syscall falls back to the
+ *   heavier mmap_read_lock so the userfaultfd monitor is consulted before
+ *   the discard takes effect.
+ *
+ *   MADV_GUARD_INSTALL retries up to MAX_MADVISE_GUARD_RETRIES (3) times
+ *   when it loses races with concurrent faulting or khugepaged. If those
+ *   retries are exhausted the handler returns -ERESTARTNOINTR via
+ *   restart_syscall(); the kernel's syscall return path treats this as a
+ *   transparent restart of madvise() and re-enters the call with the same
+ *   arguments. The transparent restart is unconditional and is not driven
+ *   by signal delivery, so the caller never observes an errno from this
+ *   path and the call appears to make eventual forward progress.
+ *   anon_vma_prepare() failures inside MADV_GUARD_INSTALL bypass the
+ *   ENOMEM-to-EAGAIN translation that applies to the VMA-flag-mutating
+ *   behaviors and surface as -ENOMEM directly.
+ *
+ *   Architecture note: alpha defines MADV_DONTNEED as 6 (not 4) and reserves
+ *   MADV_SPACEAVAIL=5; portable code must use the symbolic names from
+ *   <sys/mman.h>.
+ */
 SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 {
 	return do_madvise(current->mm, start, len_in, behavior);
-- 
2.53.0


^ permalink raw reply related

* [PATCH v4 09/11] kernel/api: add runtime verification selftest
From: Sasha Levin @ 2026-05-29 23:33 UTC (permalink / raw)
  To: linux-api, linux-kernel
  Cc: linux-doc, linux-fsdevel, linux-kbuild, linux-kselftest,
	workflows, tools, x86, Thomas Gleixner, Paul E . McKenney,
	Greg Kroah-Hartman, Jonathan Corbet, Dmitry Vyukov, Randy Dunlap,
	Cyril Hrubis, Kees Cook, Jake Edge, David Laight,
	Gabriele Paoloni, Mauro Carvalho Chehab, Christian Brauner,
	Alexander Viro, Andrew Morton, Masahiro Yamada, Shuah Khan,
	Arnd Bergmann, Nathan Chancellor, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260529233311.1901670-1-sashal@kernel.org>

Add a selftest for CONFIG_KAPI_RUNTIME_CHECKS that exercises
sys_open/sys_read/sys_write/sys_close through raw syscall() and
verifies KAPI pre-validation catches invalid parameters while
allowing valid operations through.

Test cases (TAP output):
  1-4: Valid open/read/write/close succeed
  5-7: Invalid flags, mode bits, NULL path rejected with EINVAL
  8:   dmesg contains expected KAPI warning strings

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 MAINTAINERS                                   |    1 +
 tools/testing/selftests/Makefile              |    1 +
 tools/testing/selftests/kapi/Makefile         |    7 +
 tools/testing/selftests/kapi/kapi_test_util.h |   33 +
 tools/testing/selftests/kapi/test_kapi.c      | 1096 +++++++++++++++++
 5 files changed, 1138 insertions(+)
 create mode 100644 tools/testing/selftests/kapi/Makefile
 create mode 100644 tools/testing/selftests/kapi/kapi_test_util.h
 create mode 100644 tools/testing/selftests/kapi/test_kapi.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 0d14205077908..ddfd9cad98916 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13826,6 +13826,7 @@ F:	include/linux/kernel_api_spec.h
 F:	kernel/api/
 F:	tools/kapi/
 F:	tools/lib/python/kdoc/kdoc_apispec.py
+F:	tools/testing/selftests/kapi/
 
 KERNEL AUTOMOUNTER
 M:	Ian Kent <raven@themaw.net>
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 450f13ba4cca9..7881bec5aafe1 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -48,6 +48,7 @@ TARGETS += intel_pstate
 TARGETS += iommu
 TARGETS += ipc
 TARGETS += ir
+TARGETS += kapi
 TARGETS += kcmp
 TARGETS += kexec
 TARGETS += kselftest_harness
diff --git a/tools/testing/selftests/kapi/Makefile b/tools/testing/selftests/kapi/Makefile
new file mode 100644
index 0000000000000..32a750901b111
--- /dev/null
+++ b/tools/testing/selftests/kapi/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+
+TEST_GEN_PROGS := test_kapi
+
+CFLAGS += -static -Wall -Wextra -Werror -O2 $(KHDR_INCLUDES)
+
+include ../lib.mk
diff --git a/tools/testing/selftests/kapi/kapi_test_util.h b/tools/testing/selftests/kapi/kapi_test_util.h
new file mode 100644
index 0000000000000..e097c370542ad
--- /dev/null
+++ b/tools/testing/selftests/kapi/kapi_test_util.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2026 Sasha Levin <sashal@kernel.org>
+ *
+ * Compatibility helpers for KAPI selftests.
+ *
+ * __NR_open is not defined on aarch64 and riscv64 (only __NR_openat exists).
+ * Provide a wrapper that uses __NR_openat with AT_FDCWD to achieve the same
+ * behavior as __NR_open on architectures that lack it.
+ */
+#ifndef KAPI_TEST_UTIL_H
+#define KAPI_TEST_UTIL_H
+
+#include <fcntl.h>
+#include <sys/syscall.h>
+
+#ifndef __NR_open
+/*
+ * On architectures without __NR_open (e.g., aarch64, riscv64),
+ * use openat(AT_FDCWD, ...) which is equivalent.
+ */
+static inline long kapi_sys_open(const char *pathname, int flags, int mode)
+{
+	return syscall(__NR_openat, AT_FDCWD, pathname, flags, mode);
+}
+#else
+static inline long kapi_sys_open(const char *pathname, int flags, int mode)
+{
+	return syscall(__NR_open, pathname, flags, mode);
+}
+#endif
+
+#endif /* KAPI_TEST_UTIL_H */
diff --git a/tools/testing/selftests/kapi/test_kapi.c b/tools/testing/selftests/kapi/test_kapi.c
new file mode 100644
index 0000000000000..a6b7576f95c3e
--- /dev/null
+++ b/tools/testing/selftests/kapi/test_kapi.c
@@ -0,0 +1,1096 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2026 Sasha Levin <sashal@kernel.org>
+ *
+ * Userspace selftest for KAPI runtime verification of syscall parameters.
+ *
+ * Exercises sys_open, sys_read, sys_write, and sys_close through raw
+ * syscall() to ensure KAPI pre-validation wrappers interact correctly
+ * with normal kernel error handling.
+ *
+ * Requires CONFIG_KAPI_RUNTIME_CHECKS=y for full coverage; many tests
+ * also pass without it.
+ *
+ * TAP output format.
+ */
+
+#define _GNU_SOURCE
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <signal.h>
+#include <sys/syscall.h>
+#include <sys/stat.h>
+#include <linux/limits.h>
+#include "../kselftest.h"
+#include "kapi_test_util.h"
+
+#define NUM_TESTS 29
+
+/*
+ * Set from the SIGPIPE handler. `volatile sig_atomic_t` is the POSIX-
+ * mandated type for flags touched by async-signal-safe handlers;
+ * checkpatch's generic "volatile considered harmful" warning targets
+ * kernel code and does not apply here.
+ */
+static volatile sig_atomic_t got_sigpipe;
+
+/*
+ * The tap_* helpers are thin wrappers around ksft_test_result_* so the
+ * rest of this file reads like the original author wrote it, while the
+ * output goes through the shared kselftest harness.
+ */
+static void tap_ok(const char *desc)
+{
+	ksft_test_result_pass("%s\n", desc);
+}
+
+static void tap_fail(const char *desc, const char *reason)
+{
+	ksft_test_result_fail("%s: %s\n", desc, reason);
+}
+
+static void tap_skip(const char *desc, const char *reason)
+{
+	ksft_test_result_skip("%s: %s\n", desc, reason);
+}
+
+/*
+ * Return true when the kernel provides the kapi runtime-check surface.
+ * Tests that rely on KAPI rejecting bad parameters pre-call should be
+ * skipped on kernels without it, not reported as failures.
+ */
+static bool kapi_runtime_checks_active(void)
+{
+	struct stat st;
+
+	return stat("/sys/kernel/debug/kapi", &st) == 0 && S_ISDIR(st.st_mode);
+}
+
+static void sigpipe_handler(int sig)
+{
+	(void)sig;
+	got_sigpipe = 1;
+}
+
+/* ---- Valid operation tests ---- */
+
+/*
+ * Test 1: open a readable file
+ * Returns fd on success.
+ */
+static int test_open_valid(void)
+{
+	errno = 0;
+	long fd = kapi_sys_open("/etc/hostname", O_RDONLY, 0);
+
+	if (fd >= 0) {
+		tap_ok("open valid file");
+	} else {
+		/* /etc/hostname might not exist; try /etc/passwd */
+		errno = 0;
+		fd = kapi_sys_open("/etc/passwd", O_RDONLY, 0);
+		if (fd >= 0)
+			tap_ok("open valid file (fallback /etc/passwd)");
+		else
+			tap_fail("open valid file", strerror(errno));
+	}
+	return (int)fd;
+}
+
+/*
+ * Test 2: read from fd
+ */
+static void test_read_valid(int fd)
+{
+	char buf[256];
+
+	errno = 0;
+	long ret = syscall(__NR_read, fd, buf, sizeof(buf));
+
+	if (ret > 0)
+		tap_ok("read from valid fd");
+	else if (ret == 0)
+		tap_ok("read from valid fd (EOF)");
+	else
+		tap_fail("read from valid fd", strerror(errno));
+}
+
+/*
+ * Test 3: write to /dev/null
+ */
+static void test_write_valid(void)
+{
+	errno = 0;
+	long devnull = kapi_sys_open("/dev/null", O_WRONLY, 0);
+
+	if (devnull < 0) {
+		tap_fail("write to /dev/null (open failed)", strerror(errno));
+		return;
+	}
+
+	errno = 0;
+	long ret = syscall(__NR_write, (int)devnull, "hello", 5);
+
+	if (ret == 5)
+		tap_ok("write to /dev/null");
+	else
+		tap_fail("write to /dev/null",
+			 ret < 0 ? strerror(errno) : "short write");
+
+	syscall(__NR_close, (int)devnull);
+}
+
+/*
+ * Test 4: close fd
+ */
+static void test_close_valid(int fd)
+{
+	errno = 0;
+	long ret = syscall(__NR_close, fd);
+
+	if (ret == 0)
+		tap_ok("close valid fd");
+	else
+		tap_fail("close valid fd", strerror(errno));
+}
+
+/* ---- KAPI parameter rejection tests ---- */
+
+/*
+ * Test 5: open with invalid flag bits
+ * 0x10000000 is outside the valid O_* mask, KAPI should reject.
+ */
+static void test_open_invalid_flags(void)
+{
+	long ret;
+
+	if (!kapi_runtime_checks_active()) {
+		tap_skip("open with invalid flags",
+			 "CONFIG_KAPI_RUNTIME_CHECKS not enabled");
+		return;
+	}
+
+	errno = 0;
+	/*
+	 * Use /dev/null (always present on any sane rootfs) so KAPI's flag
+	 * validation is reached before a path-lookup ENOENT can mask it.
+	 * 0x10000000 is outside the valid O_* mask.
+	 */
+	ret = kapi_sys_open("/dev/null", 0x10000000, 0);
+
+	if (ret == -1 && errno == EINVAL) {
+		tap_ok("open with invalid flags returns EINVAL");
+	} else if (ret >= 0) {
+		tap_fail("open with invalid flags", "expected EINVAL, got success");
+		syscall(__NR_close, (int)ret);
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected EINVAL, got %s",
+			 strerror(errno));
+		tap_fail("open with invalid flags", msg);
+	}
+}
+
+/*
+ * Test 6: open with invalid mode bits
+ * 0xFFFF has bits outside S_IALLUGO (07777), KAPI should reject.
+ */
+static void test_open_invalid_mode(void)
+{
+	long ret;
+
+	if (!kapi_runtime_checks_active()) {
+		tap_skip("open with invalid mode",
+			 "CONFIG_KAPI_RUNTIME_CHECKS not enabled");
+		return;
+	}
+
+	errno = 0;
+	ret = kapi_sys_open("/tmp/kapi_test_mode",
+			    O_CREAT | O_WRONLY | O_EXCL, 0xFFFF);
+
+	if (ret == -1 && errno == EINVAL) {
+		tap_ok("open with invalid mode returns EINVAL");
+	} else if (ret >= 0) {
+		tap_fail("open with invalid mode", "expected EINVAL, got success");
+		syscall(__NR_close, (int)ret);
+		unlink("/tmp/kapi_test_mode");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected EINVAL, got %s",
+			 strerror(errno));
+		tap_fail("open with invalid mode", msg);
+	}
+}
+
+/*
+ * Test 7: open with NULL path
+ * KAPI USER_PATH constraint should reject NULL.
+ */
+static void test_open_null_path(void)
+{
+	errno = 0;
+	long ret = kapi_sys_open(NULL, O_RDONLY, 0);
+
+	if (ret == -1 && errno == EINVAL) {
+		tap_ok("open with NULL path returns EINVAL");
+	} else if (ret == -1 && errno == EFAULT) {
+		/* Kernel may catch this as EFAULT before KAPI */
+		tap_ok("open with NULL path returns EFAULT (acceptable)");
+	} else if (ret >= 0) {
+		tap_fail("open with NULL path", "expected error, got success");
+		syscall(__NR_close, (int)ret);
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "got %s", strerror(errno));
+		tap_fail("open with NULL path", msg);
+	}
+}
+
+/*
+ * Test 8: open with flag bit 30 set (0x40000000)
+ * This bit is outside the valid O_* mask, KAPI should reject with EINVAL.
+ */
+static void test_open_flag_bit30(void)
+{
+	long ret;
+
+	if (!kapi_runtime_checks_active()) {
+		tap_skip("open with flag bit 30 (0x40000000) returns EINVAL",
+			 "CONFIG_KAPI_RUNTIME_CHECKS not enabled");
+		return;
+	}
+
+	errno = 0;
+	ret = kapi_sys_open("/dev/null", 0x40000000, 0);
+
+	if (ret == -1 && errno == EINVAL) {
+		tap_ok("open with flag bit 30 (0x40000000) returns EINVAL");
+	} else if (ret >= 0) {
+		tap_fail("open with flag bit 30 (0x40000000) returns EINVAL",
+			 "expected EINVAL, got success");
+		syscall(__NR_close, (int)ret);
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected EINVAL, got %s",
+			 strerror(errno));
+		tap_fail("open with flag bit 30 (0x40000000) returns EINVAL",
+			 msg);
+	}
+}
+
+/* ---- Boundary condition and error path tests ---- */
+
+/*
+ * Test 9: read with fd=-1 should return an error.
+ * With CONFIG_KAPI_RUNTIME_CHECKS=y, KAPI validates the fd first and
+ * rejects negative fds (other than AT_FDCWD) with EINVAL.  Without
+ * KAPI, the kernel returns EBADF.  Accept either.
+ */
+static void test_read_bad_fd(void)
+{
+	char buf[16];
+
+	errno = 0;
+	long ret = syscall(__NR_read, -1, buf, sizeof(buf));
+
+	if (ret == -1 && (errno == EBADF || errno == EINVAL)) {
+		tap_ok("read with fd=-1 returns error");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected EBADF/EINVAL, got %s",
+			 ret >= 0 ? "success" : strerror(errno));
+		tap_fail("read with fd=-1 returns error", msg);
+	}
+}
+
+/*
+ * Test 10: read with count=0 should return 0
+ */
+static void test_read_zero_count(void)
+{
+	char buf[1];
+	long fd;
+
+	errno = 0;
+	fd = kapi_sys_open("/dev/null", O_RDONLY, 0);
+	if (fd < 0) {
+		tap_fail("read with count=0 returns 0",
+			 "cannot open /dev/null");
+		return;
+	}
+
+	errno = 0;
+	long ret = syscall(__NR_read, (int)fd, buf, 0);
+
+	if (ret == 0) {
+		tap_ok("read with count=0 returns 0");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected 0, got %ld (errno=%s)",
+			 ret, strerror(errno));
+		tap_fail("read with count=0 returns 0", msg);
+	}
+
+	syscall(__NR_close, (int)fd);
+}
+
+/*
+ * Test 11: write with count=0 should return 0
+ */
+static void test_write_zero_count(void)
+{
+	long fd;
+
+	errno = 0;
+	fd = kapi_sys_open("/dev/null", O_WRONLY, 0);
+	if (fd < 0) {
+		tap_fail("write with count=0 returns 0",
+			 "cannot open /dev/null");
+		return;
+	}
+
+	errno = 0;
+	long ret = syscall(__NR_write, (int)fd, "x", 0);
+
+	if (ret == 0) {
+		tap_ok("write with count=0 returns 0");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected 0, got %ld (errno=%s)",
+			 ret, strerror(errno));
+		tap_fail("write with count=0 returns 0", msg);
+	}
+
+	syscall(__NR_close, (int)fd);
+}
+
+/*
+ * Test 12: open with a path longer than PATH_MAX should fail
+ * Expect ENAMETOOLONG or EINVAL.
+ */
+static void test_open_long_path(void)
+{
+	char *longpath;
+	size_t len = PATH_MAX + 256;
+
+	longpath = malloc(len);
+	if (!longpath) {
+		tap_fail("open with path > PATH_MAX", "malloc failed");
+		return;
+	}
+
+	memset(longpath, 'A', len - 1);
+	longpath[0] = '/';
+	longpath[len - 1] = '\0';
+
+	errno = 0;
+	long ret = kapi_sys_open(longpath, O_RDONLY, 0);
+
+	if (ret == -1 && (errno == ENAMETOOLONG || errno == EINVAL)) {
+		tap_ok("open with path > PATH_MAX returns ENAMETOOLONG/EINVAL");
+	} else if (ret >= 0) {
+		tap_fail("open with path > PATH_MAX",
+			 "expected error, got success");
+		syscall(__NR_close, (int)ret);
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg),
+			 "expected ENAMETOOLONG/EINVAL, got %s",
+			 strerror(errno));
+		tap_fail("open with path > PATH_MAX", msg);
+	}
+
+	free(longpath);
+}
+
+/*
+ * Test 13: read with unmapped user pointer should return EFAULT or EINVAL.
+ * Use a pipe with data so the kernel actually tries to copy to the buffer.
+ */
+static void test_read_unmapped_buf(void)
+{
+	int pipefd[2];
+
+	if (pipe(pipefd) < 0) {
+		tap_fail("read with unmapped buffer returns EFAULT/EINVAL",
+			 "pipe() failed");
+		return;
+	}
+
+	/* Write some data so read has something to copy */
+	(void)write(pipefd[1], "hello", 5);
+
+	errno = 0;
+	long ret = syscall(__NR_read, pipefd[0], (void *)0xDEAD0000, 16);
+
+	if (ret == -1 && (errno == EFAULT || errno == EINVAL)) {
+		tap_ok("read with unmapped buffer returns EFAULT/EINVAL");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg),
+			 "expected EFAULT/EINVAL, got %s",
+			 ret >= 0 ? "success" : strerror(errno));
+		tap_fail("read with unmapped buffer returns EFAULT/EINVAL",
+			 msg);
+	}
+
+	close(pipefd[0]);
+	close(pipefd[1]);
+}
+
+/*
+ * Test 14: write with unmapped user pointer should return EFAULT or EINVAL.
+ * Use a pipe so the kernel actually tries to copy from the buffer.
+ */
+static void test_write_unmapped_buf(void)
+{
+	int pipefd[2];
+
+	if (pipe(pipefd) < 0) {
+		tap_fail("write with unmapped buffer returns EFAULT/EINVAL",
+			 "pipe() failed");
+		return;
+	}
+
+	errno = 0;
+	long ret = syscall(__NR_write, pipefd[1], (void *)0xDEAD0000, 16);
+
+	if (ret == -1 && (errno == EFAULT || errno == EINVAL)) {
+		tap_ok("write with unmapped buffer returns EFAULT/EINVAL");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg),
+			 "expected EFAULT/EINVAL, got %s",
+			 ret >= 0 ? "success" : strerror(errno));
+		tap_fail("write with unmapped buffer returns EFAULT/EINVAL",
+			 msg);
+	}
+
+	close(pipefd[0]);
+	close(pipefd[1]);
+}
+
+/*
+ * Test 15: close an already-closed fd should return EBADF
+ */
+static void test_close_already_closed(void)
+{
+	long fd;
+
+	errno = 0;
+	fd = kapi_sys_open("/dev/null", O_RDONLY, 0);
+	if (fd < 0) {
+		tap_fail("close already-closed fd returns EBADF",
+			 "cannot open /dev/null");
+		return;
+	}
+
+	/* Close it once - should succeed */
+	syscall(__NR_close, (int)fd);
+
+	/* Close it again - should fail with EBADF */
+	errno = 0;
+	long ret = syscall(__NR_close, (int)fd);
+
+	if (ret == -1 && errno == EBADF) {
+		tap_ok("close already-closed fd returns EBADF");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected EBADF, got %s",
+			 ret == 0 ? "success" : strerror(errno));
+		tap_fail("close already-closed fd returns EBADF", msg);
+	}
+}
+
+/*
+ * Test 16: open /dev/null with O_RDONLY|O_CLOEXEC should succeed
+ */
+static void test_open_valid_cloexec(void)
+{
+	errno = 0;
+	long fd = kapi_sys_open("/dev/null", O_RDONLY | O_CLOEXEC, 0);
+
+	if (fd >= 0) {
+		tap_ok("open /dev/null with O_RDONLY|O_CLOEXEC succeeds");
+		syscall(__NR_close, (int)fd);
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected success, got %s",
+			 strerror(errno));
+		tap_fail("open /dev/null with O_RDONLY|O_CLOEXEC succeeds",
+			 msg);
+	}
+}
+
+/*
+ * Test 17: write 0 bytes to /dev/null should return 0
+ */
+static void test_write_zero_devnull(void)
+{
+	long fd;
+
+	errno = 0;
+	fd = kapi_sys_open("/dev/null", O_WRONLY, 0);
+	if (fd < 0) {
+		tap_fail("write 0 bytes to /dev/null returns 0",
+			 "cannot open /dev/null");
+		return;
+	}
+
+	errno = 0;
+	long ret = syscall(__NR_write, (int)fd, "", 0);
+
+	if (ret == 0) {
+		tap_ok("write 0 bytes to /dev/null returns 0");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected 0, got %ld (errno=%s)",
+			 ret, strerror(errno));
+		tap_fail("write 0 bytes to /dev/null returns 0", msg);
+	}
+
+	syscall(__NR_close, (int)fd);
+}
+
+/*
+ * Test 18: read from a write-only fd should return EBADF
+ */
+static void test_read_writeonly_fd(void)
+{
+	long fd;
+
+	errno = 0;
+	fd = kapi_sys_open("/dev/null", O_WRONLY, 0);
+	if (fd < 0) {
+		tap_fail("read from write-only fd returns EBADF",
+			 "cannot open /dev/null");
+		return;
+	}
+
+	char buf[16];
+
+	errno = 0;
+	long ret = syscall(__NR_read, (int)fd, buf, sizeof(buf));
+
+	if (ret == -1 && errno == EBADF) {
+		tap_ok("read from write-only fd returns EBADF");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected EBADF, got %s",
+			 ret >= 0 ? "success" : strerror(errno));
+		tap_fail("read from write-only fd returns EBADF", msg);
+	}
+
+	syscall(__NR_close, (int)fd);
+}
+
+/*
+ * Test 19: write to a read-only fd should return EBADF
+ */
+static void test_write_readonly_fd(void)
+{
+	long fd;
+
+	errno = 0;
+	fd = kapi_sys_open("/dev/null", O_RDONLY, 0);
+	if (fd < 0) {
+		tap_fail("write to read-only fd returns EBADF",
+			 "cannot open /dev/null");
+		return;
+	}
+
+	errno = 0;
+	long ret = syscall(__NR_write, (int)fd, "hello", 5);
+
+	if (ret == -1 && errno == EBADF) {
+		tap_ok("write to read-only fd returns EBADF");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected EBADF, got %s",
+			 ret >= 0 ? "success" : strerror(errno));
+		tap_fail("write to read-only fd returns EBADF", msg);
+	}
+
+	syscall(__NR_close, (int)fd);
+}
+
+/*
+ * Test 20: close fd 9999 (likely invalid) should return EBADF
+ */
+static void test_close_fd_9999(void)
+{
+	errno = 0;
+	long ret = syscall(__NR_close, 9999);
+
+	if (ret == -1 && errno == EBADF) {
+		tap_ok("close fd 9999 returns EBADF");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected EBADF, got %s",
+			 ret == 0 ? "success" : strerror(errno));
+		tap_fail("close fd 9999 returns EBADF", msg);
+	}
+}
+
+/*
+ * Test 21: read from pipe after write end is closed returns 0 (EOF)
+ */
+static void test_read_closed_pipe(void)
+{
+	int pipefd[2];
+
+	if (pipe(pipefd) < 0) {
+		tap_fail("read from closed pipe returns 0 (EOF)",
+			 "pipe() failed");
+		return;
+	}
+
+	/* Close write end */
+	close(pipefd[1]);
+
+	char buf[16];
+
+	errno = 0;
+	long ret = syscall(__NR_read, pipefd[0], buf, sizeof(buf));
+
+	if (ret == 0) {
+		tap_ok("read from closed pipe returns 0 (EOF)");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected 0, got %ld (errno=%s)",
+			 ret, ret < 0 ? strerror(errno) : "n/a");
+		tap_fail("read from closed pipe returns 0 (EOF)", msg);
+	}
+
+	close(pipefd[0]);
+}
+
+/*
+ * Test 22: write to pipe after read end is closed returns EPIPE + SIGPIPE
+ */
+static void test_write_closed_pipe(void)
+{
+	int pipefd[2];
+	struct sigaction sa, old_sa;
+
+	if (pipe(pipefd) < 0) {
+		tap_fail("write to closed pipe returns EPIPE + SIGPIPE",
+			 "pipe() failed");
+		return;
+	}
+
+	/* Install SIGPIPE handler */
+	memset(&sa, 0, sizeof(sa));
+	sa.sa_handler = sigpipe_handler;
+	sigemptyset(&sa.sa_mask);
+	sigaction(SIGPIPE, &sa, &old_sa);
+
+	got_sigpipe = 0;
+
+	/* Close read end */
+	close(pipefd[0]);
+
+	errno = 0;
+	long ret = syscall(__NR_write, pipefd[1], "hello", 5);
+
+	if (ret == -1 && errno == EPIPE && got_sigpipe) {
+		tap_ok("write to closed pipe returns EPIPE + SIGPIPE");
+	} else if (ret == -1 && errno == EPIPE) {
+		tap_ok("write to closed pipe returns EPIPE (SIGPIPE not caught)");
+	} else {
+		char msg[128];
+
+		snprintf(msg, sizeof(msg),
+			 "expected EPIPE, got %s (sigpipe=%d)",
+			 ret >= 0 ? "success" : strerror(errno),
+			 (int)got_sigpipe);
+		tap_fail("write to closed pipe returns EPIPE + SIGPIPE", msg);
+	}
+
+	/* Restore SIGPIPE handler */
+	sigaction(SIGPIPE, &old_sa, NULL);
+	close(pipefd[1]);
+}
+
+/*
+ * Test 23: open with O_DIRECTORY on a regular file returns ENOTDIR
+ */
+static void test_open_directory_on_file(void)
+{
+	errno = 0;
+	long ret = kapi_sys_open("/dev/null", O_RDONLY | O_DIRECTORY, 0);
+
+	if (ret == -1 && errno == ENOTDIR) {
+		tap_ok("open O_DIRECTORY on regular file returns ENOTDIR");
+	} else if (ret >= 0) {
+		tap_fail("open O_DIRECTORY on regular file",
+			 "expected ENOTDIR, got success");
+		syscall(__NR_close, (int)ret);
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected ENOTDIR, got %s",
+			 strerror(errno));
+		tap_fail("open O_DIRECTORY on regular file", msg);
+	}
+}
+
+/*
+ * Test 24: open nonexistent file without O_CREAT returns ENOENT
+ */
+static void test_open_nonexistent(void)
+{
+	errno = 0;
+	long ret = kapi_sys_open("/tmp/kapi_nonexistent_file_12345",
+				 O_RDONLY, 0);
+
+	if (ret == -1 && errno == ENOENT) {
+		tap_ok("open nonexistent file without O_CREAT returns ENOENT");
+	} else if (ret >= 0) {
+		tap_fail("open nonexistent file",
+			 "expected ENOENT, got success (file exists?)");
+		syscall(__NR_close, (int)ret);
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected ENOENT, got %s",
+			 strerror(errno));
+		tap_fail("open nonexistent file", msg);
+	}
+}
+
+/*
+ * Test 25: close stdin (fd 0) should succeed
+ * We dup it first so we can restore it.
+ */
+static void test_close_stdin(void)
+{
+	int saved_stdin = dup(0);
+
+	if (saved_stdin < 0) {
+		tap_fail("close stdin succeeds", "cannot dup stdin");
+		return;
+	}
+
+	errno = 0;
+	long ret = syscall(__NR_close, 0);
+
+	if (ret == 0) {
+		tap_ok("close stdin (fd 0) succeeds");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected success, got %s",
+			 strerror(errno));
+		tap_fail("close stdin (fd 0) succeeds", msg);
+	}
+
+	/* Restore stdin */
+	dup2(saved_stdin, 0);
+	close(saved_stdin);
+}
+
+/*
+ * Test 26: read after close returns EBADF
+ */
+static void test_read_after_close(void)
+{
+	long fd;
+
+	errno = 0;
+	fd = kapi_sys_open("/dev/null", O_RDONLY, 0);
+	if (fd < 0) {
+		tap_fail("read after close returns EBADF",
+			 "cannot open /dev/null");
+		return;
+	}
+
+	syscall(__NR_close, (int)fd);
+
+	char buf[16];
+
+	errno = 0;
+	long ret = syscall(__NR_read, (int)fd, buf, sizeof(buf));
+
+	if (ret == -1 && errno == EBADF) {
+		tap_ok("read after close returns EBADF");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected EBADF, got %s",
+			 ret >= 0 ? "success" : strerror(errno));
+		tap_fail("read after close returns EBADF", msg);
+	}
+}
+
+/*
+ * Test 27: write with large count
+ * Without KAPI: the kernel clamps count to MAX_RW_COUNT and succeeds.
+ * With KAPI: KAPI validates the buffer against the count and may
+ * return EFAULT/EINVAL since the buffer is smaller than count.
+ * Accept either success or EFAULT/EINVAL.
+ */
+static void test_write_large_count(void)
+{
+	long fd;
+	char buf[64] = "test data";
+
+	errno = 0;
+	fd = kapi_sys_open("/dev/null", O_WRONLY, 0);
+	if (fd < 0) {
+		tap_fail("write with large count handled correctly",
+			 "cannot open /dev/null");
+		return;
+	}
+
+	errno = 0;
+	long ret = syscall(__NR_write, (int)fd, buf, (size_t)0x7ffff000UL);
+
+	if (ret > 0) {
+		tap_ok("write with large count succeeds (clamped, no KAPI)");
+	} else if (ret == -1 && (errno == EFAULT || errno == EINVAL)) {
+		tap_ok("write with large count returns EFAULT/EINVAL (KAPI validates buffer)");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected success or EFAULT, got %s",
+			 ret == 0 ? "zero" : strerror(errno));
+		tap_fail("write with large count handled correctly", msg);
+	}
+
+	syscall(__NR_close, (int)fd);
+}
+
+/* ---- Integration tests ---- */
+
+/*
+ * Test 28: full normal syscall path - open, read, write, close
+ * Verify KAPI does not interfere with normal operations.
+ */
+static void test_normal_path(void)
+{
+	long rd_fd, wr_fd;
+	char buf[128];
+	int ok = 1;
+	char reason[128] = "";
+
+	/* Open a readable file */
+	errno = 0;
+	rd_fd = kapi_sys_open("/etc/hostname", O_RDONLY, 0);
+	if (rd_fd < 0) {
+		errno = 0;
+		rd_fd = kapi_sys_open("/etc/passwd", O_RDONLY, 0);
+	}
+	if (rd_fd < 0) {
+		snprintf(reason, sizeof(reason), "open readable file: %s",
+			 strerror(errno));
+		ok = 0;
+	}
+
+	/* Read from it */
+	if (ok) {
+		errno = 0;
+		long n = syscall(__NR_read, (int)rd_fd, buf, sizeof(buf));
+
+		if (n < 0) {
+			snprintf(reason, sizeof(reason), "read: %s",
+				 strerror(errno));
+			ok = 0;
+		}
+	}
+
+	/* Open /dev/null for writing */
+	wr_fd = -1;
+	if (ok) {
+		errno = 0;
+		wr_fd = kapi_sys_open("/dev/null", O_WRONLY, 0);
+		if (wr_fd < 0) {
+			snprintf(reason, sizeof(reason),
+				 "open /dev/null: %s", strerror(errno));
+			ok = 0;
+		}
+	}
+
+	/* Write to /dev/null */
+	if (ok) {
+		errno = 0;
+		long n = syscall(__NR_write, (int)wr_fd, "test", 4);
+
+		if (n != 4) {
+			snprintf(reason, sizeof(reason), "write: %s",
+				 n < 0 ? strerror(errno) : "short write");
+			ok = 0;
+		}
+	}
+
+	/* Close both fds */
+	if (rd_fd >= 0) {
+		errno = 0;
+		if (syscall(__NR_close, (int)rd_fd) != 0 && ok) {
+			snprintf(reason, sizeof(reason), "close read fd: %s",
+				 strerror(errno));
+			ok = 0;
+		}
+	}
+
+	if (wr_fd >= 0) {
+		errno = 0;
+		if (syscall(__NR_close, (int)wr_fd) != 0 && ok) {
+			snprintf(reason, sizeof(reason), "close write fd: %s",
+				 strerror(errno));
+			ok = 0;
+		}
+	}
+
+	if (ok)
+		tap_ok("normal syscall path (open/read/write/close) works");
+	else
+		tap_fail("normal syscall path (open/read/write/close) works",
+			 reason);
+}
+
+/*
+ * Test 29: verify dmesg contains KAPI warnings for the invalid tests
+ */
+static void test_dmesg_warnings(void)
+{
+	int kmsg_fd = open("/dev/kmsg", O_RDONLY | O_NONBLOCK);
+
+	if (kmsg_fd < 0) {
+		tap_skip("dmesg contains expected KAPI warnings",
+			 "cannot open /dev/kmsg");
+		return;
+	}
+
+	/*
+	 * Rewind to the start of kmsg. SEEK_DATA on /dev/kmsg is the
+	 * documented way to skip to the first entry still in the ring
+	 * buffer. Older kernels (or CONFIG_PRINTK=n builds) may reject
+	 * the seek with -EINVAL; in that case we can't reliably audit
+	 * past warnings, so skip the test rather than fail it.
+	 */
+	if (lseek(kmsg_fd, 0, SEEK_DATA) == (off_t)-1) {
+		tap_skip("dmesg contains expected KAPI warnings",
+			 "lseek(SEEK_DATA) not supported on /dev/kmsg");
+		close(kmsg_fd);
+		return;
+	}
+
+	char line[4096];
+	int found_invalid_bits = 0;
+	int found_null = 0;
+	ssize_t n;
+
+	for (;;) {
+		n = read(kmsg_fd, line, sizeof(line) - 1);
+		if (n > 0) {
+			line[n] = '\0';
+			if (strstr(line, "contains invalid bits"))
+				found_invalid_bits++;
+			if (strstr(line, "NULL") && strstr(line, "not allowed"))
+				found_null++;
+		} else if (n == -1 && errno == EPIPE) {
+			/* Ring buffer wrapped, continue reading */
+			continue;
+		} else {
+			/* EAGAIN (no more messages) or other error */
+			break;
+		}
+	}
+
+	close(kmsg_fd);
+
+	if (found_invalid_bits >= 2 && found_null >= 1) {
+		tap_ok("dmesg contains expected KAPI warnings");
+	} else if (found_invalid_bits >= 1 || found_null >= 1) {
+		char msg[128];
+
+		snprintf(msg, sizeof(msg),
+			 "partial: invalid_bits=%d null=%d",
+			 found_invalid_bits, found_null);
+		tap_ok(msg);
+	} else {
+		tap_fail("dmesg KAPI warnings",
+			 "no KAPI warnings found in dmesg");
+	}
+}
+
+int main(void)
+{
+	ksft_print_header();
+	ksft_set_plan(NUM_TESTS);
+
+	/* Valid operations (1-4) */
+	int fd = test_open_valid();
+
+	if (fd >= 0)
+		test_read_valid(fd);
+	else
+		tap_fail("read from valid fd", "no fd from open");
+
+	test_write_valid();
+
+	if (fd >= 0)
+		test_close_valid(fd);
+	else
+		tap_fail("close valid fd", "no fd from open");
+
+	/* KAPI parameter rejection (5-8) */
+	test_open_invalid_flags();
+	test_open_invalid_mode();
+	test_open_null_path();
+	test_open_flag_bit30();
+
+	/* Boundary conditions and error paths (9-20) */
+	test_read_bad_fd();
+	test_read_zero_count();
+	test_write_zero_count();
+	test_open_long_path();
+	test_read_unmapped_buf();
+	test_write_unmapped_buf();
+	test_close_already_closed();
+	test_open_valid_cloexec();
+	test_write_zero_devnull();
+	test_read_writeonly_fd();
+	test_write_readonly_fd();
+	test_close_fd_9999();
+
+	/* Pipe and lifecycle tests (21-27) */
+	test_read_closed_pipe();
+	test_write_closed_pipe();
+	test_open_directory_on_file();
+	test_open_nonexistent();
+	test_close_stdin();
+	test_read_after_close();
+	test_write_large_count();
+
+	/* Integration (28-29) */
+	test_normal_path();
+	test_dmesg_warnings();
+
+	ksft_finished();
+	return 0;
+}
-- 
2.53.0


^ permalink raw reply related

* [PATCH v4 08/11] kernel/api: add API specification for sys_write
From: Sasha Levin @ 2026-05-29 23:33 UTC (permalink / raw)
  To: linux-api, linux-kernel
  Cc: linux-doc, linux-fsdevel, linux-kbuild, linux-kselftest,
	workflows, tools, x86, Thomas Gleixner, Paul E . McKenney,
	Greg Kroah-Hartman, Jonathan Corbet, Dmitry Vyukov, Randy Dunlap,
	Cyril Hrubis, Kees Cook, Jake Edge, David Laight,
	Gabriele Paoloni, Mauro Carvalho Chehab, Christian Brauner,
	Alexander Viro, Andrew Morton, Masahiro Yamada, Shuah Khan,
	Arnd Bergmann, Nathan Chancellor, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260529233311.1901670-1-sashal@kernel.org>

Add KAPI-annotated kerneldoc for the sys_write system call in
fs/read_write.c.

The specification documents parameter constraints (fd, user buffer,
count), error conditions, locking requirements, signal handling
behavior, and short write semantics.

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/read_write.c | 391 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 391 insertions(+)

diff --git a/fs/read_write.c b/fs/read_write.c
index 1a5dd11a0f3ed..7c00f8b19ec20 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1046,6 +1046,397 @@ ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
 	return ret;
 }
 
+/**
+ * sys_write - Write data to a file descriptor
+ * @fd: File descriptor to write to
+ * @buf: User-space buffer containing data to write
+ * @count: Maximum number of bytes to write
+ *
+ * long-desc: Attempts to write up to count bytes from the buffer starting at
+ *   buf to the file referred to by the file descriptor fd. For seekable files
+ *   (regular files, block devices), the write begins at the current file offset,
+ *   and the file offset is advanced by the number of bytes written. If the file
+ *   was opened with O_APPEND, the file offset is first set to the end of the
+ *   file before writing. For non-seekable files (pipes, FIFOs, sockets, character
+ *   devices), the file offset is not used and writing occurs at the current
+ *   position as defined by the device.
+ *
+ *   The number of bytes written may be less than count if, for example, there is
+ *   insufficient space on the underlying physical medium, or the RLIMIT_FSIZE
+ *   resource limit is encountered, or the call was interrupted by a signal
+ *   handler after having written less than count bytes. In the event of a
+ *   successful partial write, the caller should make another write() call to
+ *   transfer the remaining bytes. This behavior is called a "short write."
+ *
+ *   On Linux, write() transfers at most MAX_RW_COUNT (0x7ffff000, approximately
+ *   2GB minus one page) bytes per call, regardless of whether the file or
+ *   filesystem would allow more. This prevents signed arithmetic overflow.
+ *
+ *   For regular files, a successful write() does not guarantee that data has been
+ *   committed to disk. Use fsync(2) or fdatasync(2) if durability is required.
+ *   For O_SYNC or O_DSYNC files, the kernel automatically syncs data on write.
+ *
+ *   POSIX permits writes that are interrupted after partial writes to either
+ *   return -1 with errno=EINTR, or to return the count of bytes already written.
+ *   Linux implements the latter behavior: if some data has been written before
+ *   a signal arrives, write() returns the number of bytes written rather than
+ *   failing with EINTR.
+ *
+ * contexts: process, sleepable
+ *
+ * param: fd
+ *   type: fd, input
+ *   constraint-type: range(0, INT_MAX)
+ *   cdesc: Must be a valid, open file descriptor with write permission.
+ *     The file must have been opened with O_WRONLY or O_RDWR. File descriptors
+ *     opened with O_RDONLY, O_PATH, or that have been closed return EBADF.
+ *     Standard file descriptors 0 (stdin), 1 (stdout), 2 (stderr) are valid if
+ *     open and writable. AT_FDCWD and other special values are not valid.
+ *
+ * param: buf
+ *   type: user_ptr, input
+ *   constraint-type: buffer(2)
+ *   cdesc: Must point to a valid, readable user-space memory region of at
+ *     least count bytes. The buffer is validated via access_ok() before any
+ *     write operation. NULL is invalid and returns EFAULT. For O_DIRECT writes,
+ *     the buffer may need to be aligned to the filesystem's block size (varies
+ *     by filesystem; query with statx() using STATX_DIOALIGN on Linux 6.1+).
+ *
+ * param: count
+ *   type: uint, input
+ *   constraint-type: range(0, SIZE_MAX)
+ *   cdesc: Maximum number of bytes to write. Clamped internally to
+ *     MAX_RW_COUNT (INT_MAX & PAGE_MASK, approximately 0x7ffff000 bytes) to
+ *     prevent signed overflow. A count of 0 is passed through to the underlying
+ *     file operation and typically returns 0, but may trigger filesystem
+ *     or driver-specific side effects. Cast to ssize_t must not be negative.
+ *
+ * return:
+ *   type: int
+ *   check-type: range
+ *   success: >= 0
+ *   desc: On success, returns the number of bytes written (non-negative). Zero
+ *     indicates that nothing was written (count was 0, or no space available
+ *     for non-blocking writes). The return value may be less than count due to
+ *     resource limits, signal interruption, or device constraints (short write).
+ *     On error, returns a negative error code.
+ *
+ * error: EBADF, Bad file descriptor
+ *   desc: fd is not a valid file descriptor, or fd was not opened for writing.
+ *     This includes file descriptors opened with O_RDONLY, O_PATH, or file
+ *     descriptors that have been closed. Also returned if the file structure
+ *     does not have FMODE_WRITE set.
+ *
+ * error: EFAULT, Bad address
+ *   desc: buf points outside the accessible address space. The buffer address
+ *     failed access_ok() validation. Can also occur if a fault happens during
+ *     copy_from_user() when reading data from user space.
+ *
+ * error: EINVAL, Invalid argument
+ *   desc: Returned in several cases: (1) The file descriptor refers to an
+ *     object that is not suitable for writing (no write or write_iter method).
+ *     (2) The file was opened with O_DIRECT and the buffer alignment, offset,
+ *     or count does not meet the filesystem's alignment requirements. (3) The
+ *     count argument, when cast to ssize_t, is negative. Also returned if the
+ *     file lacks the FMODE_CAN_WRITE flag.
+ *
+ * error: EAGAIN, Resource temporarily unavailable
+ *   desc: fd refers to a file (pipe, socket, device) that is marked non-blocking
+ *     (O_NONBLOCK) and the write would block because the buffer is full.
+ *     Equivalent to EWOULDBLOCK. The application should retry later or use
+ *     select/poll/epoll to wait for writability.
+ *
+ * error: EWOULDBLOCK, Operation would block
+ *   desc: Alias of EAGAIN on Linux (identical errno value). POSIX permits
+ *     implementations to distinguish the two; Linux does not. Listed here
+ *     for completeness so tooling that consults the spec does not treat
+ *     EWOULDBLOCK-returning call sites as undocumented. See EAGAIN above
+ *     for the conditions that trigger it.
+ *
+ * error: EINTR, Interrupted system call
+ *   desc: The call was interrupted by a signal before any data was written. This
+ *     only occurs if no data has been transferred; if some data was written
+ *     before the signal, the call returns the number of bytes written. The
+ *     caller should typically restart the write.
+ *
+ * error: EPIPE, Broken pipe
+ *   desc: fd refers to a pipe or socket whose reading end has been closed.
+ *     When this condition occurs, the calling process also receives a SIGPIPE
+ *     signal. If the signal is caught or ignored, EPIPE is still returned.
+ *     For sockets, MSG_NOSIGNAL (via send()) suppresses the signal. For
+ *     pwritev2(), the RWF_NOSIGNAL flag suppresses it.
+ *
+ * error: EFBIG, File too large
+ *   desc: An attempt was made to write a file that exceeds the implementation-
+ *     defined maximum file size or the file size limit (RLIMIT_FSIZE) of the
+ *     process. When RLIMIT_FSIZE is exceeded, the process also receives SIGXFSZ.
+ *     For files not opened with O_LARGEFILE on 32-bit systems, the limit is 2GB.
+ *
+ * error: ENOSPC, No space left on device
+ *   desc: The device containing the file has no room for the data. This can
+ *     occur mid-write resulting in a short write followed by ENOSPC on retry.
+ *
+ * error: EDQUOT, Disk quota exceeded
+ *   desc: The user's quota of disk blocks on the filesystem has been exhausted.
+ *     Like ENOSPC, this can result in a short write.
+ *
+ * error: EIO, Input/output error
+ *   desc: A low-level I/O error occurred while modifying the inode or writing
+ *     data. This typically indicates hardware failure, filesystem corruption,
+ *     or network filesystem timeout. Some data may have been written.
+ *
+ * error: EPERM, Operation not permitted
+ *   desc: The operation was prevented: (1) by a file seal (F_SEAL_WRITE or
+ *     F_SEAL_FUTURE_WRITE on memfd/shmem), (2) writing to an immutable inode
+ *     (IS_IMMUTABLE), (3) by an LSM hook denying the operation, or (4) by a
+ *     fanotify permission event denying the write.
+ *
+ * error: EOVERFLOW, Value too large for defined data type
+ *   desc: The file position plus count would exceed LLONG_MAX. Also returned
+ *     when the offset would exceed filesystem limits after the write.
+ *
+ * error: EDESTADDRREQ, Destination address required
+ *   desc: fd is a datagram socket for which no peer address has been set using
+ *     connect(2). Use sendto(2) to specify the destination address.
+ *
+ * error: ETXTBSY, Text file busy
+ *   desc: The file is being used as a swap file (IS_SWAPFILE). Note: unlike
+ *     the traditional Unix meaning, Linux does not return ETXTBSY when writing
+ *     to an executing binary; that only blocks open() with O_WRONLY/O_RDWR.
+ *
+ * error: EXDEV, Cross-device link
+ *   desc: When writing to a pipe that has been configured as a watch queue
+ *     (CONFIG_WATCH_QUEUE), direct write() calls are not supported.
+ *
+ * error: ENOMEM, Out of memory
+ *   desc: Insufficient kernel memory was available for the write operation.
+ *     For pipes, this occurs when allocating pages for the pipe buffer.
+ *
+ * error: ERESTARTSYS, Restart system call (internal)
+ *   desc: Internal error code indicating the syscall should be restarted. This
+ *     is converted to EINTR if SA_RESTART is not set on the signal handler, or
+ *     the syscall is transparently restarted if SA_RESTART is set. User space
+ *     should not see this error code directly.
+ *
+ * error: EACCES, Permission denied
+ *   desc: The security subsystem (LSM such as SELinux or AppArmor) denied the
+ *     write operation via security_file_permission(). This can occur even if
+ *     the file was successfully opened.
+ *
+ * lock: file->f_pos_lock
+ *   type: mutex
+ *   acquired: conditional
+ *   released: true
+ *   desc: For regular files that require atomic position updates (FMODE_ATOMIC_POS),
+ *     the f_pos_lock mutex is acquired by fdget_pos() at syscall entry and released
+ *     by fdput_pos() at syscall exit. This serializes concurrent writes sharing
+ *     the same file description. Not acquired for stream files (FMODE_STREAM like
+ *     pipes and sockets) or when the file is not shared.
+ *
+ * lock: sb->s_writers (freeze protection)
+ *   type: custom
+ *   acquired: conditional
+ *   released: true
+ *   desc: For regular files, file_start_write() acquires freeze protection on
+ *     the superblock via sb_start_write() before the write, and file_end_write()
+ *     releases it after. This prevents writes during filesystem freeze. Not
+ *     acquired for non-regular files (pipes, sockets, devices).
+ *
+ * lock: inode->i_rwsem
+ *   type: rwlock
+ *   acquired: conditional
+ *   released: true
+ *   desc: For regular files using generic_file_write_iter(), the inode's i_rwsem
+ *     is acquired in write mode before modifying file data. This is internal to
+ *     the filesystem and released before return. Not all filesystems use this
+ *     pattern.
+ *
+ * lock: pipe->mutex
+ *   type: mutex
+ *   acquired: conditional
+ *   released: true
+ *   desc: For pipes and FIFOs, the pipe's mutex is held while modifying pipe
+ *     buffers. Released temporarily while waiting for space, then reacquired.
+ *
+ * lock: RCU read-side
+ *   type: rcu
+ *   acquired: conditional
+ *   released: true
+ *   desc: Held transiently during file descriptor table lookup within fdget().
+ *     The RCU read lock is acquired and released internally by the fd lookup
+ *     path, not held across the entire syscall. fdput() releases the file
+ *     reference count, not the RCU lock.
+ *
+ * signal: SIGPIPE
+ *   direction: send
+ *   action: terminate
+ *   condition: Writing to a pipe or socket with no readers
+ *   desc: When writing to a pipe whose read end is closed, or a socket whose
+ *     peer has closed, SIGPIPE is sent to the calling process. The default
+ *     action terminates the process. Use signal(SIGPIPE, SIG_IGN) to suppress
+ *     for write(). EPIPE is returned regardless of signal disposition.
+ *   timing: during
+ *
+ * signal: SIGXFSZ
+ *   direction: send
+ *   action: coredump
+ *   condition: Writing exceeds RLIMIT_FSIZE
+ *   desc: When a write would exceed the soft file size limit (RLIMIT_FSIZE),
+ *     SIGXFSZ is sent. The default action terminates with a core dump. The
+ *     write returns EFBIG. If RLIMIT_FSIZE is RLIM_INFINITY, no signal is sent.
+ *   timing: during
+ *
+ * signal: Any signal
+ *   direction: receive
+ *   action: return
+ *   condition: While blocked waiting for space (pipes, sockets)
+ *   desc: The syscall may be interrupted by signals while waiting for buffer
+ *     space to become available. If interrupted before any data is written,
+ *     returns -EINTR or -ERESTARTSYS. If data was already written, returns the
+ *     byte count. Restartable if SA_RESTART is set and no data was written.
+ *   errno: -EINTR
+ *   timing: during
+ *   restartable: yes
+ *
+ * side-effect: file_position
+ *   target: file->f_pos
+ *   condition: For seekable files when write succeeds (returns > 0)
+ *   desc: The file offset (f_pos) is advanced by the number of bytes written.
+ *     For files opened with O_APPEND, f_pos is first set to file size. For
+ *     stream files (FMODE_STREAM such as pipes and sockets), the offset is not
+ *     used or modified. Position updates are protected by f_pos_lock when
+ *     shared.
+ *   reversible: no
+ *
+ * side-effect: modify_state
+ *   target: inode timestamps (mtime, ctime)
+ *   condition: When write succeeds (returns > 0)
+ *   desc: Updates the file's modification time (mtime) and change time (ctime)
+ *     via file_update_time(). The update precision depends on filesystem mount
+ *     options (fine-grained timestamps for multigrain inodes).
+ *   reversible: no
+ *
+ * side-effect: modify_state
+ *   target: SUID/SGID bits (mode)
+ *   condition: When writing to a setuid/setgid file
+ *   desc: The SUID bit is cleared when a non-root user writes to a file with
+ *     the bit set. The SGID bit may also be cleared. This is a security feature
+ *     to prevent privilege escalation via modified setuid binaries. Done via
+ *     file_remove_privs().
+ *   reversible: no
+ *
+ * side-effect: modify_state
+ *   target: file data
+ *   condition: When write succeeds (returns > 0)
+ *   desc: Modifies the file's data content. For regular files, data is written
+ *     to the page cache (buffered I/O) or directly to storage (O_DIRECT).
+ *     Data may not be persistent until fsync() is called or the file is closed.
+ *   reversible: no
+ *
+ * side-effect: modify_state
+ *   target: task I/O accounting
+ *   condition: Always
+ *   desc: Updates the current task's I/O accounting statistics. The wchar field
+ *     (write characters) is incremented by bytes written via add_wchar() only on
+ *     successful writes (ret > 0). The syscw field (syscall write count) is
+ *     incremented via inc_syscw() when the write operation is attempted
+ *     (after passing initial validation checks). These statistics are visible
+ *     in /proc/[pid]/io.
+ *   reversible: no
+ *
+ * side-effect: modify_state
+ *   target: fsnotify events
+ *   condition: When write returns > 0
+ *   desc: Generates an FS_MODIFY fsnotify event via fsnotify_modify(), allowing
+ *     inotify, fanotify, and dnotify watchers to be notified of the write.
+ *
+ * capability: CAP_DAC_OVERRIDE
+ *   type: bypass_check
+ *   allows: Bypass discretionary access control on write permission
+ *   without: Standard DAC checks are enforced
+ *   condition: Checked at open time via inode_permission(), not during read()
+ *
+ * capability: CAP_FSETID
+ *   type: bypass_check
+ *   allows: Bypass ownership checks for SUID/SGID clearing
+ *   without: SUID/SGID bits are cleared on write by non-owner
+ *   condition: Checked during file_remove_privs()
+ *
+ * constraint: MAX_RW_COUNT
+ *   desc: The count parameter is silently clamped to MAX_RW_COUNT (INT_MAX &
+ *     PAGE_MASK, approximately 2GB minus one page) to prevent integer overflow
+ *     in internal calculations. This is transparent to the caller.
+ *   expr: actual_count = min(count, MAX_RW_COUNT)
+ *
+ * constraint: File must be open for writing
+ *   desc: The file descriptor must have been opened with O_WRONLY or O_RDWR.
+ *     Files opened with O_RDONLY or O_PATH cannot be written and return EBADF.
+ *     The file must have both FMODE_WRITE and FMODE_CAN_WRITE flags set.
+ *   expr: (file->f_mode & FMODE_WRITE) && (file->f_mode & FMODE_CAN_WRITE)
+ *
+ * constraint: RLIMIT_FSIZE
+ *   desc: The size of data written is constrained by the RLIMIT_FSIZE resource
+ *     limit. If writing would exceed this limit, SIGXFSZ is sent and EFBIG is
+ *     returned. The limit does not apply to files beyond the limit - only to
+ *     writes that would cross it.
+ *   expr: pos + count <= rlimit(RLIMIT_FSIZE) || rlimit(RLIMIT_FSIZE) == RLIM_INFINITY
+ *
+ * constraint: File seals
+ *   desc: For memfd or shmem files with F_SEAL_WRITE or F_SEAL_FUTURE_WRITE
+ *     seals applied, all write operations fail with EPERM. With F_SEAL_GROW,
+ *     writes that would extend file size fail with EPERM.
+ *
+ * examples: n = write(fd, buf, sizeof(buf));  // Basic write
+ *   n = write(STDOUT_FILENO, msg, strlen(msg));  // Write to stdout
+ *   // Handle short writes:
+ *   while (total < len) {
+ *     n = write(fd, buf + total, len - total);
+ *     if (n < 0) break;
+ *     total += n;
+ *   }
+ *   // Pipe error handling:
+ *   if (write(pipefd[1], &byte, 1) < 0 && errno == EPIPE)
+ *     handle_broken_pipe();
+ *
+ * notes: The behavior of write() varies significantly depending on the type of
+ *   file descriptor:
+ *
+ *   - Regular files: Writes to the page cache (buffered) or directly to storage
+ *     (O_DIRECT). Short writes are rare except near RLIMIT_FSIZE or disk full.
+ *     O_APPEND is atomic for determining write position.
+ *
+ *   - Pipes and FIFOs: Blocking by default. Writes up to PIPE_BUF (4096 bytes
+ *     on Linux) are guaranteed atomic. Larger writes may be interleaved with
+ *     writes from other processes. Blocks if pipe is full; returns EAGAIN with
+ *     O_NONBLOCK. SIGPIPE/EPIPE if no readers.
+ *
+ *   - Sockets: Behavior depends on socket type and protocol. Stream sockets
+ *     (TCP) may return partial writes. Datagram sockets (UDP) typically write
+ *     complete messages or fail. SIGPIPE/EPIPE for broken connections (unless
+ *     MSG_NOSIGNAL). EDESTADDRREQ for unconnected datagram sockets.
+ *
+ *   - Terminals: May block on flow control. Canonical vs raw mode affects
+ *     behavior. Special characters may be interpreted.
+ *
+ *   - Device special files: Behavior is device-specific. Block devices behave
+ *     similarly to regular files. Character device behavior varies.
+ *
+ *   Race condition considerations: Concurrent writes from threads sharing a
+ *   file description race on the file position. Linux 3.14+ provides atomic
+ *   position updates via f_pos_lock for regular files (FMODE_ATOMIC_POS), but
+ *   for maximum safety, use pwrite() for concurrent positioned writes.
+ *
+ *   O_DIRECT writes bypass the page cache and typically require buffer and
+ *   offset alignment to filesystem block size. Query requirements via statx()
+ *   with STATX_DIOALIGN (Linux 6.1+). Unaligned O_DIRECT writes return EINVAL
+ *   on most filesystems.
+ *
+ *   For zero-copy writes, consider using splice(2), sendfile(2), or vmsplice(2)
+ *   instead of copying data through user-space buffers with write().
+ *
+ *   Partial writes (short writes) must be handled by application code.
+ *   Applications should loop until all data is written or an error occurs.
+ */
 SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
 		size_t, count)
 {
-- 
2.53.0


^ permalink raw reply related

* [PATCH v4 07/11] kernel/api: add API specification for sys_read
From: Sasha Levin @ 2026-05-29 23:33 UTC (permalink / raw)
  To: linux-api, linux-kernel
  Cc: linux-doc, linux-fsdevel, linux-kbuild, linux-kselftest,
	workflows, tools, x86, Thomas Gleixner, Paul E . McKenney,
	Greg Kroah-Hartman, Jonathan Corbet, Dmitry Vyukov, Randy Dunlap,
	Cyril Hrubis, Kees Cook, Jake Edge, David Laight,
	Gabriele Paoloni, Mauro Carvalho Chehab, Christian Brauner,
	Alexander Viro, Andrew Morton, Masahiro Yamada, Shuah Khan,
	Arnd Bergmann, Nathan Chancellor, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260529233311.1901670-1-sashal@kernel.org>

Add KAPI-annotated kerneldoc for the sys_read system call in
fs/read_write.c.

The specification documents parameter constraints (fd, user buffer,
count), error conditions, locking requirements, signal handling
behavior, and short read semantics.

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/read_write.c | 301 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 301 insertions(+)

diff --git a/fs/read_write.c b/fs/read_write.c
index 50bff7edc91f3..1a5dd11a0f3ed 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -721,6 +721,307 @@ ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
 	return ret;
 }
 
+/**
+ * sys_read - Read data from a file descriptor
+ * @fd: File descriptor to read from
+ * @buf: User-space buffer to read data into
+ * @count: Maximum number of bytes to read
+ *
+ * long-desc: Attempts to read up to count bytes from file descriptor fd into
+ *   the buffer starting at buf. For seekable files (regular files, block
+ *   devices), the read begins at the current file offset, and the file offset
+ *   is advanced by the number of bytes read. For non-seekable files (pipes,
+ *   FIFOs, sockets, character devices), the file offset is not used.
+ *
+ *   If count is zero and fd refers to a regular file, read() may detect errors
+ *   as described below. In the absence of errors, or if read() does not check
+ *   for errors, a read() with a count of 0 returns zero and has no other effects.
+ *
+ *   On success, the number of bytes read is returned (zero indicates end of
+ *   file for regular files). It is not an error if this number is smaller than
+ *   the number of bytes requested; this may happen because fewer bytes are
+ *   actually available right now (maybe because we were close to end-of-file,
+ *   or because we are reading from a pipe, socket, or terminal), or because
+ *   read() was interrupted by a signal.
+ *
+ *   On Linux, read() transfers at most MAX_RW_COUNT (0x7ffff000, approximately
+ *   2GB) bytes per call, regardless of whether the filesystem would allow more.
+ *   This is to avoid issues with signed arithmetic overflow on 32-bit systems.
+ *
+ *   POSIX allows reads that are interrupted after reading some data to either
+ *   return -1 (with errno set to EINTR) or return the number of bytes already
+ *   read. Linux follows the latter behavior: if data has been read before a
+ *   signal arrives, the call returns the bytes read rather than failing.
+ *
+ * contexts: process, sleepable
+ *
+ * param: fd
+ *   type: fd, input
+ *   constraint-type: range(0, INT_MAX)
+ *   cdesc: Must be a valid, open file descriptor with read permission.
+ *     The file must have been opened with O_RDONLY or O_RDWR. Special values
+ *     like AT_FDCWD are not valid. File descriptors for directories return
+ *     EISDIR. Standard file descriptors 0 (stdin), 1 (stdout), 2 (stderr) are
+ *     valid if open and readable.
+ *
+ * param: buf
+ *   type: user_ptr, output
+ *   constraint-type: buffer(2)
+ *   cdesc: Must point to a valid, writable user-space memory region of at
+ *     least count bytes. The buffer is validated via access_ok() before any
+ *     read operation. NULL is invalid and will return EFAULT. The buffer may
+ *     be partially written if an error occurs mid-read. For O_DIRECT reads,
+ *     the buffer may need to be aligned to the filesystem's block size (varies
+ *     by filesystem, check via statx() with STATX_DIOALIGN).
+ *
+ * param: count
+ *   type: uint, input
+ *   constraint-type: range(0, SIZE_MAX)
+ *   cdesc: Maximum number of bytes to read. Clamped internally to
+ *     MAX_RW_COUNT (INT_MAX & PAGE_MASK, approximately 0x7ffff000 bytes) to
+ *     prevent signed overflow issues. A count of 0 returns immediately with 0
+ *     without accessing the file (but may still detect errors). Large values
+ *     are not errors but will be clamped. Cast to ssize_t must not be negative.
+ *
+ * return:
+ *   type: int
+ *   check-type: range
+ *   success: >= 0
+ *   desc: On success, returns the number of bytes read (non-negative). Zero
+ *     indicates end-of-file (EOF) for regular files, or no data available
+ *     from a device that does not block. The return value may be less than
+ *     count if fewer bytes were available (short read). Partial reads are
+ *     not errors. On error, returns a negative error code.
+ *
+ * error: EBADF, Bad file descriptor
+ *   desc: fd is not a valid file descriptor, or fd was not opened for reading.
+ *     This includes file descriptors opened with O_WRONLY, O_PATH, or file
+ *     descriptors that have been closed. Also returned if the file structure
+ *     does not have FMODE_READ set.
+ *
+ * error: EFAULT, Bad address
+ *   desc: buf points outside the accessible address space. The buffer address
+ *     failed access_ok() validation. Can also occur if a fault happens during
+ *     copy_to_user() when transferring data to user space after the read
+ *     completes in kernel space.
+ *
+ * error: EINVAL, Invalid argument
+ *   desc: Returned in several cases: (1) The file descriptor refers to an
+ *     object that is not suitable for reading (no read or read_iter method).
+ *     (2) The file was opened with O_DIRECT and the buffer alignment, offset,
+ *     or count does not meet the filesystem's alignment requirements. (3) For
+ *     timerfd file descriptors, the buffer is smaller than 8 bytes. (4) The
+ *     count argument, when cast to ssize_t, is negative.
+ *
+ * error: EISDIR, Is a directory
+ *   desc: fd refers to a directory. Directories cannot be read using read();
+ *     use getdents64() instead. This error is returned by the generic_read_dir()
+ *     handler installed for directory file operations.
+ *
+ * error: EAGAIN, Resource temporarily unavailable
+ *   desc: fd refers to a file (pipe, socket, device) that is marked non-blocking
+ *     (O_NONBLOCK) and the read would block. Also returned with IOCB_NOWAIT
+ *     when data is not immediately available. Equivalent to EWOULDBLOCK.
+ *     The application should retry the read later or use select/poll/epoll.
+ *
+ * error: EWOULDBLOCK, Operation would block
+ *   desc: Alias of EAGAIN on Linux (identical errno value). POSIX permits
+ *     implementations to distinguish the two; Linux does not. Listed here
+ *     for completeness so tooling that consults the spec does not treat
+ *     EWOULDBLOCK-returning call sites as undocumented. See EAGAIN above
+ *     for the conditions that trigger it.
+ *
+ * error: EINTR, Interrupted system call
+ *   desc: The call was interrupted by a signal before any data was read. This
+ *     only occurs if no data has been transferred; if some data was read before
+ *     the signal, the call returns the number of bytes read. The caller should
+ *     typically restart the read.
+ *
+ * error: EIO, Input/output error
+ *   desc: A low-level I/O error occurred. For regular files, this typically
+ *     indicates a hardware error on the storage device, a filesystem error,
+ *     or a network filesystem timeout. For terminals, this may indicate the
+ *     controlling terminal has been closed for a background process.
+ *
+ * error: EOVERFLOW, Value too large for defined data type
+ *   desc: The file position plus count would exceed LLONG_MAX. Also returned
+ *     when reading from certain files (e.g., some /proc files) where the file
+ *     position would overflow. For files without FOP_UNSIGNED_OFFSET flag,
+ *     negative file positions are not allowed.
+ *
+ * error: ENOBUFS, No buffer space available
+ *   desc: Returned when reading from pipe-based watch queues (CONFIG_WATCH_QUEUE)
+ *     when the buffer is too small to hold a complete notification, or when
+ *     reading packets from pipes with PIPE_BUF_FLAG_WHOLE set.
+ *
+ * error: ERESTARTSYS, Restart system call (internal)
+ *   desc: Internal error code indicating the syscall should be restarted. This
+ *     is typically translated to EINTR if SA_RESTART is not set on the signal
+ *     handler, or the syscall is transparently restarted if SA_RESTART is set.
+ *     User space should not see this error code directly.
+ *
+ * error: EACCES, Permission denied
+ *   desc: The security subsystem (LSM such as SELinux or AppArmor) denied
+ *     the read operation via security_file_permission(). This can occur even
+ *     if the file was successfully opened, as LSM policies may enforce per-
+ *     operation checks.
+ *
+ * error: EPERM, Operation not permitted
+ *   desc: Returned by fanotify permission events (CONFIG_FANOTIFY_ACCESS_PERMISSIONS)
+ *     when a user-space fanotify listener denies the read operation via
+ *     fsnotify_file_area_perm().
+ *
+ * error: ENODATA, No data available
+ *   desc: Returned when reading from files backed by fscache/cachefiles
+ *     and the requested data range is not available in the cache
+ *     (e.g., beyond EOF or in an uncached region). Also returned by
+ *     some filesystem-specific read handlers (e.g., xattr reads).
+ *
+ * error: EOPNOTSUPP, Operation not supported
+ *   desc: Returned when the file descriptor does not support the read
+ *     operation, such as reading from certain special files or when the
+ *     filesystem does not implement read for this file type.
+ *
+ * lock: file->f_pos_lock
+ *   type: mutex
+ *   acquired: conditional
+ *   released: true
+ *   desc: For regular files that require atomic position updates (FMODE_ATOMIC_POS),
+ *     the f_pos_lock mutex is acquired by fdget_pos() at syscall entry and released
+ *     by fdput_pos() at syscall exit. This serializes concurrent reads that share
+ *     the same file description. Not acquired for files opened with FMODE_STREAM
+ *     (pipes, sockets) or when the file is not shared.
+ *
+ * lock: Filesystem-specific locks
+ *   type: custom
+ *   acquired: conditional
+ *   released: true
+ *   desc: The filesystem's read_iter or read method may acquire additional locks.
+ *     For regular files, this typically includes the inode's i_rwsem for certain
+ *     operations. For pipes, the pipe->mutex is acquired. For sockets, socket
+ *     lock is acquired. These are internal to the file operation and released
+ *     before return.
+ *
+ * lock: RCU read-side
+ *   type: rcu
+ *   acquired: conditional
+ *   released: true
+ *   desc: Held transiently during file descriptor table lookup within fdget().
+ *     The RCU read lock is acquired and released internally by the fd lookup
+ *     path, not held across the entire syscall. fdput() releases the file
+ *     reference count, not the RCU lock.
+ *
+ * signal: Any signal
+ *   direction: receive
+ *   action: return
+ *   condition: When blocked waiting for data on interruptible operations
+ *   desc: The syscall may be interrupted by signals while waiting for data to
+ *     become available (pipes, sockets, terminals) or waiting for locks. If
+ *     interrupted before any data is read, returns -EINTR or -ERESTARTSYS.
+ *     If data has already been read, returns the number of bytes read.
+ *   errno: -EINTR
+ *   timing: during
+ *   restartable: yes
+ *
+ * side-effect: file_position
+ *   target: file->f_pos
+ *   condition: For seekable files when read succeeds (returns > 0)
+ *   desc: The file offset (f_pos) is advanced by the number of bytes read.
+ *     For stream files (FMODE_STREAM such as pipes and sockets), the offset
+ *     is not used or modified. The offset update is protected by f_pos_lock
+ *     when the file is shared between threads/processes.
+ *   reversible: no
+ *
+ * side-effect: modify_state
+ *   target: inode access time (atime)
+ *   condition: When read succeeds and O_NOATIME is not set
+ *   desc: Updates the file's access time (atime) via touch_atime(). The update
+ *     may be suppressed by mount options (noatime, relatime), the O_NOATIME
+ *     flag, or if the filesystem does not support atime. Relatime only updates
+ *     atime if it is older than mtime or ctime, or more than a day old.
+ *   reversible: no
+ *
+ * side-effect: modify_state
+ *   target: task I/O accounting
+ *   condition: Always
+ *   desc: Updates the current task's I/O accounting statistics. The rchar field
+ *     (read characters) is incremented by bytes read via add_rchar() only on
+ *     successful reads (ret > 0). The syscr field (syscall read count) is
+ *     incremented unconditionally via inc_syscr(). These statistics are visible
+ *     in /proc/[pid]/io.
+ *   reversible: no
+ *
+ * side-effect: modify_state
+ *   target: fsnotify events
+ *   condition: When read returns > 0
+ *   desc: Generates an FS_ACCESS fsnotify event via fsnotify_access() allowing
+ *     inotify, fanotify, and dnotify watchers to be notified of the read. This
+ *     occurs after data transfer completes successfully.
+ *   reversible: no
+ *
+ * capability: CAP_DAC_OVERRIDE
+ *   type: bypass_check
+ *   allows: Bypass discretionary access control on read permission
+ *   without: Standard DAC checks are enforced
+ *   condition: Checked at open time via inode_permission(), not during read()
+ *
+ * capability: CAP_DAC_READ_SEARCH
+ *   type: bypass_check
+ *   allows: Bypass read permission checks on regular files
+ *   without: Must have read permission on file
+ *   condition: Checked at open time via inode_permission(), not during read()
+ *
+ * constraint: MAX_RW_COUNT
+ *   desc: The count parameter is silently clamped to MAX_RW_COUNT (INT_MAX &
+ *     PAGE_MASK, approximately 2GB minus one page) to prevent integer overflow
+ *     in internal calculations. This is transparent to the caller; the syscall
+ *     succeeds but reads at most MAX_RW_COUNT bytes.
+ *   expr: actual_count = min(count, MAX_RW_COUNT)
+ *
+ * constraint: File must be open for reading
+ *   desc: The file descriptor must have been opened with O_RDONLY or O_RDWR.
+ *     Files opened with O_WRONLY or O_PATH cannot be read and return EBADF.
+ *     The file must have both FMODE_READ and FMODE_CAN_READ flags set.
+ *   expr: (file->f_mode & FMODE_READ) && (file->f_mode & FMODE_CAN_READ)
+ *
+ * examples: n = read(fd, buf, sizeof(buf));  // Basic read
+ *   n = read(STDIN_FILENO, buf, 1024);  // Read from stdin
+ *   while ((n = read(fd, buf, 4096)) > 0) { process(buf, n); }  // Read loop
+ *   if (read(fd, buf, count) == 0) { handle_eof(); }  // Check for EOF
+ *
+ * notes: The behavior of read() varies significantly depending on the type of
+ *   file descriptor:
+ *
+ *   - Regular files: Reads from current position, advances position, returns 0
+ *     at EOF. Short reads are rare but possible near EOF or on signal.
+ *
+ *   - Pipes and FIFOs: Blocking by default. Returns available data (up to count)
+ *     or blocks until data is available. Returns 0 when all writers have closed.
+ *     O_NONBLOCK returns EAGAIN when empty instead of blocking.
+ *
+ *   - Sockets: Similar to pipes. Specific behavior depends on socket type and
+ *     protocol. MSG_* flags can be specified via recv() for more control.
+ *
+ *   - Terminals: Line-buffered in canonical mode; read returns when newline is
+ *     entered or buffer is full. Raw mode returns immediately when data available.
+ *     Special handling for signals (SIGINT on Ctrl+C, etc.).
+ *
+ *   - Device special files: Behavior is device-specific. Some devices support
+ *     seeking, others do not. Read size may be constrained by device.
+ *
+ *   Race condition: Concurrent reads from the same file description (not just
+ *   file descriptor) can race on the file position. Linux 3.14+ provides atomic
+ *   position updates for regular files via f_pos_lock, but applications should
+ *   use pread() for concurrent positioned reads.
+ *
+ *   O_DIRECT reads bypass the page cache and typically require aligned buffers
+ *   and positions. Alignment requirements are filesystem-specific; use statx()
+ *   with STATX_DIOALIGN (Linux 6.1+) to query. Unaligned O_DIRECT reads fail
+ *   with EINVAL on most filesystems.
+ *
+ *   For splice(2)-like zero-copy reads, consider using splice(), sendfile(),
+ *   or copy_file_range() instead of read() + write().
+ */
 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
 {
 	return ksys_read(fd, buf, count);
-- 
2.53.0


^ permalink raw reply related

* [PATCH v4 06/11] kernel/api: add API specification for sys_close
From: Sasha Levin @ 2026-05-29 23:33 UTC (permalink / raw)
  To: linux-api, linux-kernel
  Cc: linux-doc, linux-fsdevel, linux-kbuild, linux-kselftest,
	workflows, tools, x86, Thomas Gleixner, Paul E . McKenney,
	Greg Kroah-Hartman, Jonathan Corbet, Dmitry Vyukov, Randy Dunlap,
	Cyril Hrubis, Kees Cook, Jake Edge, David Laight,
	Gabriele Paoloni, Mauro Carvalho Chehab, Christian Brauner,
	Alexander Viro, Andrew Morton, Masahiro Yamada, Shuah Khan,
	Arnd Bergmann, Nathan Chancellor, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260529233311.1901670-1-sashal@kernel.org>

Add KAPI-annotated kerneldoc for the sys_close system call in fs/open.c.

The specification documents the file descriptor parameter, error
conditions, locking requirements, side effects on pending I/O, and
the close-on-exec relationship.

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/open.c | 244 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 244 insertions(+)

diff --git a/fs/open.c b/fs/open.c
index 61573d5abda00..24f6bd5e50090 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1808,6 +1808,250 @@ EXPORT_SYMBOL(filp_close);
  * releasing the fd. This ensures that one clone task can't release
  * an fd while another clone is opening it.
  */
+/**
+ * sys_close - Close a file descriptor
+ * @fd: The file descriptor to close
+ *
+ * long-desc: Terminates access to an open file descriptor, releasing the file
+ *   descriptor for reuse by subsequent open(), dup(), or similar syscalls.
+ *
+ *   Traditional POSIX advisory record locks held by the process on the
+ *   associated file are released (note: POSIX locks are released when ANY fd
+ *   for that inode is closed by the process, not just the last one). OFD locks
+ *   and flock locks are associated with the open file description and are only
+ *   released when the last file descriptor referring to that open file
+ *   description is closed.
+ *   When this is the last file descriptor
+ *   referring to the underlying open file description, associated resources are
+ *   freed. If the file was previously unlinked, the file itself is deleted when
+ *   the last reference is closed.
+ *
+ *   CRITICAL: The file descriptor is ALWAYS closed, even when close() returns
+ *   an error. This differs from POSIX semantics where the state of the file
+ *   descriptor is unspecified after EINTR. On Linux, the fd is released early
+ *   in close() processing before flush operations that may fail. Therefore,
+ *   retrying close() after an error return is DANGEROUS and may close an
+ *   unrelated file descriptor that was assigned to another thread.
+ *
+ *   Errors returned from close() (EIO, ENOSPC, EDQUOT) indicate that the final
+ *   flush of buffered data failed. These errors commonly occur on network
+ *   filesystems like NFS when write errors are deferred to close time. A
+ *   successful return from close() does NOT guarantee that data has been
+ *   successfully written to disk; the kernel uses buffer cache to defer writes.
+ *   To ensure data persistence, call fsync() before close().
+ *
+ *   Note that close() does not affect the close-on-exec flag set via
+ *   fcntl(fd, F_SETFD, FD_CLOEXEC) or O_CLOEXEC on other file descriptors.
+ *   The close-on-exec flag only causes automatic close during exec(), not
+ *   during explicit close() calls.
+ *
+ *   On close, the following cleanup operations are performed: POSIX advisory
+ *   locks are removed, dnotify registrations are cleaned up, the file is
+ *   flushed to storage if applicable, and the file
+ *   reference is released. If this was the last reference, additional cleanup
+ *   includes: fsnotify close notification, epoll cleanup, flock and lease
+ *   removal, FASYNC cleanup, and the file structure deallocation.
+ *
+ * contexts: process, sleepable
+ *
+ * param: fd
+ *   type: fd, input
+ *   constraint-type: range(0, INT_MAX)
+ *   cdesc: Must be a valid, open file descriptor for the current process.
+ *     The value 0, 1, or 2 (stdin, stdout, stderr) may be closed like any other
+ *     fd, though this is unusual and may cause issues with libraries that assume
+ *     these descriptors are valid. The parameter is unsigned int to match kernel
+ *     file descriptor table indexing, but values exceeding INT_MAX are effectively
+ *     invalid due to internal checks.
+ *
+ * return:
+ *   type: int
+ *   check-type: exact
+ *   success: 0
+ *   desc: Returns 0 on success. On error, returns a negative error code.
+ *     IMPORTANT: Even when an error is returned, the file descriptor is still
+ *     closed and must not be used again. The error indicates a problem with
+ *     the final flush operation, not that the fd remains open.
+ *
+ * error: EBADF, Bad file descriptor
+ *   desc: The file descriptor fd is not a valid open file descriptor, or was
+ *     already closed. This is the only error that indicates the fd was NOT
+ *     closed (because it was never open to begin with). Occurs when fd is out
+ *     of range, has no file assigned, or was already closed.
+ *
+ * error: EINTR, Interrupted system call
+ *   desc: The flush operation was interrupted by a signal before completion.
+ *     This occurs when the close-time flush operation (e.g., on NFS) performs an
+ *     interruptible wait that receives a signal. IMPORTANT: Despite this error,
+ *     the file descriptor IS closed and must not be used again. This error
+ *     is generated by converting kernel-internal restart codes (ERESTARTSYS,
+ *     ERESTARTNOINTR, ERESTARTNOHAND, ERESTART_RESTARTBLOCK) to EINTR because
+ *     restarting the syscall would be incorrect once the fd is freed.
+ *
+ * error: EIO, I/O error
+ *   desc: An I/O error occurred during the flush of buffered data to the
+ *     underlying storage. This typically indicates a hardware error, network
+ *     failure on NFS, or other storage system error. The file descriptor is
+ *     still closed. Previously buffered write data may have been lost.
+ *
+ * error: ENOSPC, No space left on device
+ *   desc: There was insufficient space on the storage device to flush buffered
+ *     writes. This is common on NFS when the server runs out of space between
+ *     write() and close(). The file descriptor is still closed.
+ *
+ * error: EDQUOT, Disk quota exceeded
+ *   desc: The user's disk quota was exceeded while attempting to flush buffered
+ *     writes. Common on NFS when quota is exceeded between write() and close().
+ *     The file descriptor is still closed.
+ *
+ * lock: files->file_lock
+ *   type: spinlock
+ *   acquired: true
+ *   released: true
+ *   desc: Acquired via file_close_fd() to atomically lookup and remove the fd
+ *     from the file descriptor table. Held only during the table manipulation;
+ *     released before flush and final cleanup operations. This ensures that
+ *     another thread cannot allocate the same fd number while close is in
+ *     progress.
+ *
+ * lock: file->f_lock
+ *   type: spinlock
+ *   acquired: true
+ *   released: true
+ *   desc: Acquired during epoll cleanup (eventpoll_release_file) and dnotify
+ *     cleanup to safely unlink the file from monitoring structures. May also
+ *     be acquired during lock context operations.
+ *
+ * lock: ep->mtx
+ *   type: mutex
+ *   acquired: true
+ *   released: true
+ *   desc: Acquired during epoll cleanup if the file was monitored by epoll.
+ *     Used to safely remove the file from epoll interest lists.
+ *
+ * lock: flc_lock
+ *   type: spinlock
+ *   acquired: true
+ *   released: true
+ *   desc: File lock context spinlock, acquired during locks_remove_file() to
+ *     safely remove POSIX, flock, and lease locks associated with the file.
+ *
+ * signal: pending_signals
+ *   direction: receive
+ *   action: return
+ *   condition: When close-time flush performs interruptible wait
+ *   desc: If the close-time flush operation (e.g., on NFS) performs an
+ *     interruptible wait and a signal is pending, the wait is interrupted.
+ *     Any kernel restart codes are converted to EINTR since close cannot be
+ *     restarted after the fd is freed.
+ *   errno: -EINTR
+ *   timing: during
+ *   restartable: no
+ *
+ * side-effect: resource_destroy | irreversible
+ *   target: File descriptor table entry
+ *   desc: The file descriptor is removed from the process's file descriptor
+ *     table, making the fd number available for reuse by subsequent open(),
+ *     dup(), or similar calls. This occurs BEFORE any flush or cleanup that
+ *     might fail, making the operation irreversible regardless of return value.
+ *   condition: Always (when fd is valid)
+ *   reversible: no
+ *
+ * side-effect: lock_release
+ *   target: POSIX advisory locks, OFD locks, flock locks
+ *   desc: All advisory locks held on the file by this process are removed.
+ *     POSIX locks are removed via locks_remove_posix() during filp_flush().
+ *     All lock types (POSIX, OFD, flock) are removed via locks_remove_file()
+ *     during __fput() when this is the last reference.
+ *   condition: File has FMODE_OPENED and !(FMODE_PATH)
+ *   reversible: no
+ *
+ * side-effect: resource_destroy
+ *   target: File leases
+ *   desc: Any file leases held on the file are removed during locks_remove_file()
+ *     when this is the last reference to the open file description.
+ *   condition: File had leases and this is the last close
+ *   reversible: no
+ *
+ * side-effect: modify_state
+ *   target: dnotify registrations
+ *   desc: Directory notification (dnotify) registrations associated with this
+ *     file are cleaned up via dnotify_flush(). This only applies to directories.
+ *   condition: File is a directory with dnotify registrations
+ *   reversible: no
+ *
+ * side-effect: modify_state
+ *   target: epoll interest lists
+ *   desc: If the file was being monitored by epoll instances, it is removed
+ *     from those interest lists via eventpoll_release().
+ *   condition: File was added to epoll instances
+ *   reversible: no
+ *
+ * side-effect: filesystem
+ *   target: Buffered data
+ *   desc: Any buffered data is flushed if applicable (e.g., on NFS). This
+ *     attempts to write any buffered data to storage
+ *     and may return errors (EIO, ENOSPC, EDQUOT) if the flush fails. The
+ *     success of this flush is NOT guaranteed even with a 0 return; use
+ *     fsync() before close() to ensure data persistence.
+ *   condition: File was opened for writing and has buffered data
+ *   reversible: no
+ *
+ * side-effect: free_memory
+ *   target: struct file and related structures
+ *   desc: When this is the last reference to the file, the file structure is
+ *     freed and the dentry and mount references are released.
+ *   condition: This is the last reference to the file
+ *   reversible: no
+ *
+ * side-effect: filesystem
+ *   target: Unlinked file deletion
+ *   desc: If the file was previously unlinked (deleted) but kept open, closing
+ *     the last reference causes the actual file data to be removed from the
+ *     filesystem and the inode to be freed.
+ *   condition: File was unlinked and this is the last reference
+ *   reversible: no
+ *
+ * state-trans: file_descriptor
+ *   from: open
+ *   to: closed/free
+ *   condition: Valid fd passed to close
+ *   desc: The file descriptor transitions from open (usable) to closed (invalid).
+ *     The fd number becomes available for reuse. This transition occurs early
+ *     in close() processing, before any operations that might fail.
+ *
+ * state-trans: file_reference_count
+ *   from: n
+ *   to: n-1 (or freed if n was 1)
+ *   condition: Always on successful fd lookup
+ *   desc: The file's reference count is decremented. If this was the last
+ *     reference, the file is fully cleaned up and freed.
+ *
+ * constraint: File Descriptor Reuse Race
+ *   desc: Because the fd is freed early in close() processing, another thread
+ *     may receive the same fd number from a concurrent open() before close()
+ *     returns. Applications must not retry close() after an error return, as
+ *     this could close an unrelated file opened by another thread.
+ *   expr: After close(fd) returns (even with error), fd is invalid
+ *
+ * examples: close(fd);  // Basic usage - ignore errors (common but not ideal)
+ *   if (close(fd) == -1) perror("close");  // Log errors for debugging
+ *   fsync(fd); close(fd);  // Ensure data persistence before closing
+ *
+ * notes: The fd is always freed regardless of the return value. POSIX
+ *   specifies that on EINTR the state of the fd is unspecified, but Linux
+ *   always closes it. Retrying close() after an error may close an unrelated
+ *   fd that was reassigned by another thread, so callers should never retry.
+ *
+ *   Error codes like EIO, ENOSPC, and EDQUOT indicate that previously buffered
+ *   writes may have failed to reach storage. These errors are particularly
+ *   common on NFS where write errors are often deferred to close time.
+ *
+ *   Calling close() on a file descriptor while another thread is using it
+ *   (e.g., in a blocking read() or write()) does not interrupt the blocked
+ *   operation. The blocked operation continues on the underlying file and
+ *   may complete even after close() returns.
+ */
 SYSCALL_DEFINE1(close, unsigned int, fd)
 {
 	int retval;
-- 
2.53.0


^ permalink raw reply related

* [PATCH v4 05/11] kernel/api: add API specification for sys_open
From: Sasha Levin @ 2026-05-29 23:33 UTC (permalink / raw)
  To: linux-api, linux-kernel
  Cc: linux-doc, linux-fsdevel, linux-kbuild, linux-kselftest,
	workflows, tools, x86, Thomas Gleixner, Paul E . McKenney,
	Greg Kroah-Hartman, Jonathan Corbet, Dmitry Vyukov, Randy Dunlap,
	Cyril Hrubis, Kees Cook, Jake Edge, David Laight,
	Gabriele Paoloni, Mauro Carvalho Chehab, Christian Brauner,
	Alexander Viro, Andrew Morton, Masahiro Yamada, Shuah Khan,
	Arnd Bergmann, Nathan Chancellor, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260529233311.1901670-1-sashal@kernel.org>

Add KAPI-annotated kerneldoc for the sys_open system call in fs/open.c.

The specification documents parameter constraints (pathname, flags
bitmask, permission mode), 25 error conditions, locking requirements,
side effects, required capabilities, and usage examples.

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/open.c | 317 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 317 insertions(+)

diff --git a/fs/open.c b/fs/open.c
index 91f1139591abe..61573d5abda00 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1373,6 +1373,323 @@ int do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
 }
 
 
+/**
+ * sys_open - Open or create a file
+ * @filename: Pathname of the file to open or create
+ * @flags: File access mode and behavior flags (O_RDONLY, O_WRONLY, O_RDWR, etc.)
+ * @mode: File permission bits for newly created files (only with O_CREAT/O_TMPFILE)
+ *
+ * long-desc: Opens the file specified by pathname. If O_CREAT or O_TMPFILE is
+ *   specified in flags, the file is created if it does not exist; its mode is
+ *   set according to the mode parameter modified by the process's umask.
+ *
+ *   The flags argument must include one of the following access modes: O_RDONLY
+ *   (read-only), O_WRONLY (write-only), or O_RDWR (read/write). These are the
+ *   low-order two bits of flags. In addition, zero or more file creation and
+ *   file status flags can be bitwise-ORed in flags.
+ *
+ *   File creation flags: O_CREAT, O_EXCL, O_NOCTTY, O_TRUNC, O_DIRECTORY,
+ *   O_NOFOLLOW, O_CLOEXEC, O_TMPFILE. These flags affect open behavior.
+ *
+ *   File status flags: O_APPEND, FASYNC, O_DIRECT, O_DSYNC, O_LARGEFILE,
+ *   O_NOATIME, O_NONBLOCK (O_NDELAY), O_PATH, O_SYNC. These become part of the
+ *   file's open file description and can be retrieved/modified with fcntl().
+ *
+ *   The return value is a file descriptor, a small nonnegative integer used in
+ *   subsequent system calls (read, write, lseek, fcntl, etc.) to refer to the
+ *   open file. The file descriptor returned by a successful open is the lowest-
+ *   numbered file descriptor not currently open for the process.
+ *
+ *   On 64-bit systems, O_LARGEFILE is automatically added to the flags. On 32-bit
+ *   systems, files larger than 2GB require O_LARGEFILE to be explicitly set.
+ *
+ *   This syscall is a legacy interface. Modern code should prefer openat() for
+ *   relative path operations and openat2() for additional control via resolve
+ *   flags. The open() call is equivalent to openat(AT_FDCWD, pathname, flags).
+ *
+ * contexts: process, sleepable
+ *
+ * param: filename
+ *   type: path, input
+ *   constraint-type: user_path
+ *   cdesc: Must be a valid null-terminated path string in user memory.
+ *     Maximum path length is PATH_MAX (4096 bytes) including null terminator.
+ *     For relative paths, resolution starts from current working directory.
+ *     The path is followed (symlinks resolved) unless O_NOFOLLOW is specified.
+ *
+ * param: flags
+ *   type: int, input
+ *   constraint-type: mask(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY |
+ *                         O_TRUNC | O_APPEND | O_NONBLOCK | O_DSYNC | O_SYNC | FASYNC |
+ *                         O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | O_NOATIME |
+ *                         O_CLOEXEC | O_PATH | O_TMPFILE)
+ *   cdesc: Must include exactly one of O_RDONLY (0), O_WRONLY (1), or
+ *     O_RDWR (2) as the access mode. Additional flags may be ORed. Invalid flag
+ *     combinations (e.g., O_PATH with incompatible flags, O_TMPFILE without
+ *     O_DIRECTORY, O_TMPFILE with read-only mode) return EINVAL. Since Linux
+ *     6.7, O_CREAT is silently ignored when combined with O_DIRECTORY. Unknown
+ *     flags are silently ignored for backward compatibility (unlike openat2
+ *     which rejects them).
+ *
+ * param: mode
+ *   type: uint, input
+ *   constraint-type: mask(S_ISUID | S_ISGID | S_ISVTX | S_IRWXU | S_IRWXG | S_IRWXO)
+ *   cdesc: Only meaningful when O_CREAT or O_TMPFILE is specified in
+ *     flags. Specifies the file mode bits (permissions and setuid/setgid/sticky
+ *     bits) for a newly created file. The effective mode is (mode & ~umask).
+ *     When O_CREAT/O_TMPFILE is not set, mode is ignored. Mode values exceeding
+ *     S_IALLUGO (07777) are masked off.
+ *
+ * return:
+ *   type: int
+ *   check-type: fd
+ *   success: >= 0
+ *   desc: On success, returns a new file descriptor (non-negative integer).
+ *     The returned file descriptor is the lowest-numbered descriptor not
+ *     currently open for the process. On error, returns a negative error code.
+ *
+ * error: EACCES, Permission denied
+ *   desc: The requested access to the file is not allowed, or search permission
+ *     is denied for one of the directories in the path prefix of pathname, or
+ *     the file did not exist yet and write access to the parent directory is
+ *     not allowed, or O_TRUNC is specified but write permission is denied, or
+ *     the file is on a filesystem mounted with noexec and MAY_EXEC was implied.
+ *
+ * error: EAGAIN, Resource temporarily unavailable
+ *   desc: The file is a FIFO or regular file, O_NONBLOCK is specified, and the
+ *     operation would block. Also returned when RESOLVE_CACHED is used with
+ *     openat2() and the lookup cannot be satisfied from the dentry cache.
+ *
+ * error: EBUSY, Device or resource busy
+ *   desc: O_EXCL was specified in flags and pathname refers to a block device
+ *     that is in use by the system (e.g., it is mounted).
+ *
+ * error: EDQUOT, Disk quota exceeded
+ *   desc: O_CREAT is specified and the file does not exist, and the user's quota
+ *     of disk blocks or inodes on the filesystem has been exhausted.
+ *
+ * error: EEXIST, File exists
+ *   desc: O_CREAT and O_EXCL were specified in flags, but pathname already exists.
+ *     This error is atomic with respect to file creation - it prevents race
+ *     conditions (TOCTOU) when creating files.
+ *
+ * error: EFAULT, Bad address
+ *   desc: pathname points outside the process's accessible address space.
+ *
+ * error: EINTR, Interrupted system call
+ *   desc: The call was interrupted by a signal handler before completing file
+ *     open. This can occur during lock acquisition or when breaking leases.
+ *
+ * error: EINVAL, Invalid argument
+ *   desc: Returned for several conditions: (1) Invalid O_* flag combinations
+ *     (O_TMPFILE without O_DIRECTORY, O_TMPFILE with read-only access, O_PATH
+ *     with flags other than O_DIRECTORY|O_NOFOLLOW|O_CLOEXEC).
+ *     (2) mode contains bits outside S_IALLUGO when O_CREAT/O_TMPFILE
+ *     is set (openat2 only). (3) O_DIRECT requested but filesystem doesn't
+ *     support it. (4) The filesystem does not support O_SYNC or O_DSYNC.
+ *
+ * error: EISDIR, Is a directory
+ *   desc: pathname refers to a directory and the access requested involved
+ *     writing (O_WRONLY, O_RDWR, or O_TRUNC). Also returned when O_TMPFILE is
+ *     used on a directory that doesn't support tmpfile operations.
+ *
+ * error: ELOOP, Too many symbolic links
+ *   desc: Too many symbolic links were encountered in resolving pathname, or
+ *     O_NOFOLLOW was specified but pathname refers to a symbolic link.
+ *
+ * error: EMFILE, Too many open files
+ *   desc: The per-process limit on the number of open file descriptors has been
+ *     reached. This limit is RLIMIT_NOFILE (default typically 1024, max set by
+ *     /proc/sys/fs/nr_open).
+ *
+ * error: ENAMETOOLONG, File name too long
+ *   desc: pathname was too long, exceeding PATH_MAX (4096) bytes, or a single
+ *     path component exceeded NAME_MAX (usually 255) bytes.
+ *
+ * error: ENFILE, Too many open files in system
+ *   desc: The system-wide limit on the total number of open files has been
+ *     reached (/proc/sys/fs/file-max). Processes with CAP_SYS_ADMIN can exceed
+ *     this limit.
+ *
+ * error: ENODEV, No such device
+ *   desc: pathname refers to a special file that has no corresponding device, or
+ *     the file's inode has no file operations assigned.
+ *
+ * error: ENOENT, No such file or directory
+ *   desc: A directory component in pathname does not exist or is a dangling
+ *     symbolic link, or O_CREAT is not set and the named file does not exist,
+ *     or pathname is an empty string (unless AT_EMPTY_PATH is used with openat2).
+ *
+ * error: ENOMEM, Out of memory
+ *   desc: The kernel could not allocate sufficient memory for the file structure,
+ *     path lookup structures, or the filename buffer.
+ *
+ * error: ENOSPC, No space left on device
+ *   desc: O_CREAT was specified and the file does not exist, and the directory
+ *     or filesystem containing the file has no room for a new file entry.
+ *
+ * error: ENOTDIR, Not a directory
+ *   desc: A component used as a directory in pathname is not actually a directory,
+ *     or O_DIRECTORY was specified and pathname was not a directory.
+ *
+ * error: ENXIO, No such device or address
+ *   desc: O_NONBLOCK | O_WRONLY is set and the named file is a FIFO and no
+ *     process has the FIFO open for reading. Also returned when opening a device
+ *     special file that does not exist.
+ *
+ * error: EOPNOTSUPP, Operation not supported
+ *   desc: The filesystem containing pathname does not support O_TMPFILE.
+ *
+ * error: EOVERFLOW, Value too large for defined data type
+ *   desc: pathname refers to a regular file that is too large to be opened.
+ *     This occurs on 32-bit systems without O_LARGEFILE when the file size
+ *     exceeds 2GB (2^31 - 1 bytes).
+ *
+ * error: EPERM, Operation not permitted
+ *   desc: O_NOATIME flag was specified but the effective UID of the caller did
+ *     not match the owner of the file and the caller is not privileged, or the
+ *     file is append-only and O_TRUNC was specified or write mode without
+ *     O_APPEND, or the file is immutable, or a seal prevents the operation.
+ *
+ * error: EROFS, Read-only file system
+ *   desc: pathname refers to a file on a read-only filesystem and write access
+ *     was requested.
+ *
+ * error: ETXTBSY, Text file busy
+ *   desc: pathname refers to an executable image which is currently being
+ *     executed, or to a swap file, and write access or truncation was requested.
+ *
+ * error: EWOULDBLOCK, Resource temporarily unavailable
+ *   desc: O_NONBLOCK was specified and an incompatible lease is held on the file.
+ *
+ * lock: files->file_lock
+ *   type: spinlock
+ *   acquired: true
+ *   released: true
+ *   desc: Acquired when allocating a file descriptor slot. Held briefly during
+ *     fd allocation via alloc_fd() and released before the syscall returns.
+ *
+ * lock: inode->i_rwsem (parent directory)
+ *   type: rwlock
+ *   acquired: conditional
+ *   released: true
+ *   desc: Write lock acquired on parent directory inode when creating a new file
+ *     (O_CREAT). Acquired via inode_lock_nested() in lookup path. May use
+ *     killable variant which can return EINTR on fatal signal.
+ *
+ * lock: RCU read-side
+ *   type: rcu
+ *   acquired: true
+ *   released: true
+ *   desc: Path lookup uses RCU mode initially for performance. If RCU lookup
+ *     fails (returns -ECHILD), falls back to reference-based lookup.
+ *
+ * signal: Any signal
+ *   direction: receive
+ *   action: return
+ *   condition: When blocked on interruptible or killable operations
+ *   desc: The syscall may be interrupted during path lookup, lock acquisition,
+ *     or lease breaking. Fatal signals (SIGKILL, etc.) will interrupt killable
+ *     operations. Non-fatal signals may interrupt interruptible operations.
+ *   errno: -EINTR
+ *   timing: during
+ *   restartable: yes
+ *
+ * side-effect: resource_create | alloc_memory
+ *   target: file descriptor, file structure, dentry cache
+ *   desc: Allocates a new file descriptor in the process's fd table. Allocates
+ *     a struct file from the filp slab cache. May allocate dentries and inodes
+ *     during path lookup. System-wide file count (nr_files) is incremented.
+ *   reversible: yes
+ *
+ * side-effect: filesystem
+ *   target: filesystem, inode
+ *   condition: When O_CREAT is specified and file doesn't exist
+ *   desc: Creates a new file on the filesystem. Creates new inode, allocates
+ *     data blocks as needed, and creates directory entry. Updates parent
+ *     directory mtime and ctime.
+ *   reversible: no
+ *
+ * side-effect: filesystem
+ *   target: file content
+ *   condition: When O_TRUNC is specified for existing file
+ *   desc: Truncates the file to zero length, releasing data blocks. Updates
+ *     file mtime and ctime. May trigger notifications to lease holders.
+ *   reversible: no
+ *
+ * side-effect: modify_state
+ *   target: inode timestamps
+ *   condition: Unless O_NOATIME is specified
+ *   desc: Opens for reading may update inode access time (atime) unless mounted
+ *     with noatime/relatime or O_NOATIME is specified. Opens for writing that
+ *     truncate or create update mtime and ctime.
+ *
+ * capability: CAP_DAC_OVERRIDE
+ *   type: bypass_check
+ *   allows: Bypass file read, write, and execute permission checks
+ *   without: Standard DAC (discretionary access control) checks are applied
+ *   condition: Checked when file permission would otherwise deny access
+ *
+ * capability: CAP_DAC_READ_SEARCH
+ *   type: bypass_check
+ *   allows: Bypass read permission on files and search permission on directories
+ *   without: Must have read permission on file or search permission on directory
+ *   condition: Checked during path traversal and file open
+ *
+ * capability: CAP_FOWNER
+ *   type: bypass_check
+ *   allows: Use O_NOATIME on files not owned by caller
+ *   without: O_NOATIME returns EPERM if caller is not file owner
+ *   condition: Checked when O_NOATIME is specified and caller is not owner
+ *
+ * capability: CAP_SYS_ADMIN
+ *   type: increase_limit
+ *   allows: Exceed the system-wide file limit (file-max)
+ *   without: Returns ENFILE when system limit is reached
+ *   condition: Checked in alloc_empty_file() when nr_files >= max_files
+ *
+ * constraint: RLIMIT_NOFILE (per-process fd limit)
+ *   desc: The returned file descriptor must be less than the process's
+ *     RLIMIT_NOFILE limit. Default is typically 1024, maximum is controlled
+ *     by /proc/sys/fs/nr_open (default 1048576). Exceeding returns EMFILE.
+ *   expr: fd < rlimit(RLIMIT_NOFILE)
+ *
+ * constraint: file-max (system-wide limit)
+ *   desc: System-wide limit on open files in /proc/sys/fs/file-max. Processes
+ *     without CAP_SYS_ADMIN receive ENFILE when this limit is reached. The
+ *     limit is computed based on system memory at boot time.
+ *   expr: nr_files < files_stat.max_files || capable(CAP_SYS_ADMIN)
+ *
+ * constraint: PATH_MAX
+ *   desc: Maximum length of pathname including null terminator is PATH_MAX
+ *     (4096 bytes). Individual path components must not exceed NAME_MAX (255).
+ *
+ * examples: fd = open("/etc/passwd", O_RDONLY);  // Read existing file
+ *   fd = open("/tmp/newfile", O_WRONLY | O_CREAT | O_TRUNC, 0644);  // Create/truncate
+ *   fd = open("/tmp/lockfile", O_WRONLY | O_CREAT | O_EXCL, 0600);  // Exclusive create
+ *   fd = open("/dev/null", O_RDWR);  // Open device
+ *   fd = open("/tmp", O_RDONLY | O_DIRECTORY);  // Open directory
+ *   fd = open("/tmp", O_TMPFILE | O_RDWR, 0600);  // Anonymous temp file
+ *
+ * notes: The distinction between O_RDONLY, O_WRONLY, and O_RDWR is critical.
+ *   O_RDONLY is defined as 0, so (flags & O_RDONLY) always evaluates to zero.
+ *   Test access mode using (flags & O_ACCMODE) == O_RDONLY.
+ *
+ *   When O_CREAT is specified without O_EXCL, there is a race condition between
+ *   testing for file existence and creating it. Use O_CREAT | O_EXCL for atomic
+ *   exclusive file creation.
+ *
+ *   O_CLOEXEC should be used in multithreaded programs to prevent file descriptor
+ *   leaks to child processes between fork() and execve().
+ *
+ *   O_DIRECT has alignment requirements that vary by filesystem. Use statx()
+ *   with STATX_DIOALIGN (Linux 6.1+) to query requirements. Unaligned I/O may
+ *   fail with EINVAL or fall back to buffered I/O.
+ *
+ *   O_PATH opens a file descriptor that can be used only for certain operations
+ *   (fstat, dup, fcntl, close, fchdir on directories, as dirfd for *at() calls).
+ *   I/O operations will fail with EBADF.
+ */
 SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
 {
 	if (force_o_largefile())
-- 
2.53.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox