From: Sasha Levin <sashal@kernel.org>
To: linux-api@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-kbuild@vger.kernel.org, linux-kselftest@vger.kernel.org,
workflows@vger.kernel.org, tools@kernel.org, x86@kernel.org,
Thomas Gleixner <tglx@kernel.org>,
"Paul E . McKenney" <paulmck@kernel.org>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Jonathan Corbet <corbet@lwn.net>,
Dmitry Vyukov <dvyukov@google.com>,
Randy Dunlap <rdunlap@infradead.org>,
Cyril Hrubis <chrubis@suse.cz>, Kees Cook <kees@kernel.org>,
Jake Edge <jake@lwn.net>,
David Laight <david.laight.linux@gmail.com>,
Gabriele Paoloni <gpaoloni@redhat.com>,
Mauro Carvalho Chehab <mchehab@kernel.org>,
Christian Brauner <brauner@kernel.org>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Andrew Morton <akpm@linux-foundation.org>,
Masahiro Yamada <masahiroy@kernel.org>,
Shuah Khan <skhan@linuxfoundation.org>,
Arnd Bergmann <arnd@arndb.de>,
Nathan Chancellor <nathan@kernel.org>,
Steven Rostedt <rostedt@goodmis.org>,
Masami Hiramatsu <mhiramat@kernel.org>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Subject: [PATCH v4 10/11] kernel/api: add API specification for sys_madvise
Date: Fri, 29 May 2026 19:33:09 -0400 [thread overview]
Message-ID: <20260529233311.1901670-11-sashal@kernel.org> (raw)
In-Reply-To: <20260529233311.1901670-1-sashal@kernel.org>
Add KAPI-annotated kerneldoc for the sys_madvise system call in
mm/madvise.c.
The specification documents parameter constraints (start, len_in,
behavior), per-behavior error conditions, lock acquisition (mmap_lock
read and write modes plus the per-VMA fast path, mmu_gather and
mmu_notifier brackets), signal handling, side effects, capability
requirements (CAP_SYS_ADMIN for MADV_HWPOISON and MADV_SOFT_OFFLINE),
mseal interaction, and the heterogeneous skip semantics across the
hint, immediate-action and destructive groups.
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
mm/madvise.c | 575 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 575 insertions(+)
diff --git a/mm/madvise.c b/mm/madvise.c
index dbb69400786d1..ed0a046e9e25b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -2032,6 +2032,581 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
return error;
}
+/**
+ * sys_madvise - Give advice about use of memory
+ * @start: Starting virtual address of the range to advise on
+ * @len_in: Length of the range in bytes
+ * @behavior: Advice (a MADV_* constant) the kernel should apply to the range
+ *
+ * long-desc: Provides the kernel with advice or directions about the address
+ * range starting at start and extending for len_in bytes. The advice is
+ * selected by behavior, which is one of the MADV_* constants defined in
+ * <sys/mman.h>. The semantics fall into three groups. The hint group
+ * primarily updates VMA flags (MADV_NORMAL, MADV_RANDOM, MADV_SEQUENTIAL,
+ * MADV_DONTFORK, MADV_DOFORK, MADV_DONTDUMP, MADV_DODUMP, MADV_WIPEONFORK,
+ * MADV_KEEPONFORK, MADV_MERGEABLE, MADV_UNMERGEABLE, MADV_HUGEPAGE,
+ * MADV_NOHUGEPAGE), and is itself heterogeneous: the fork-copy gates
+ * (MADV_DONTFORK / MADV_DOFORK) and the KSM scan gates
+ * (MADV_MERGEABLE / MADV_UNMERGEABLE) are strictly honored by their
+ * consumers; MADV_HUGEPAGE / MADV_NOHUGEPAGE express THP eligibility
+ * advice rather than allocation guarantees (MADV_NOHUGEPAGE blocks the
+ * normal fault-time, MADV_COLLAPSE and khugepaged paths; MADV_HUGEPAGE
+ * widens eligibility and increases defrag aggressiveness but does not
+ * force allocation, which still depends on the global
+ * transparent_hugepage= mode, VMA suitability, and allocation success);
+ * MADV_WIPEONFORK / MADV_KEEPONFORK do not wipe at fork time but cause
+ * the child's first access to fault in zero-filled pages;
+ * MADV_DONTDUMP / MADV_DODUMP normally control coredump inclusion but
+ * can be overridden by always_dump_vma() for gate, vm_ops-named or
+ * arch-named VMAs; and MADV_NORMAL / MADV_RANDOM / MADV_SEQUENTIAL are
+ * genuinely heuristic read-ahead hints. The non-destructive
+ * immediate-action group performs work
+ * synchronously while preserving page contents (MADV_WILLNEED, MADV_COLD,
+ * MADV_PAGEOUT, MADV_POPULATE_READ, MADV_POPULATE_WRITE, MADV_COLLAPSE,
+ * MADV_GUARD_REMOVE). The destructive group discards, replaces or
+ * invalidates page contents (MADV_DONTNEED, MADV_DONTNEED_LOCKED,
+ * MADV_FREE, MADV_REMOVE, MADV_GUARD_INSTALL, MADV_HWPOISON,
+ * MADV_SOFT_OFFLINE). MADV_GUARD_INSTALL belongs to the destructive group
+ * because it zaps any existing pages in the range before installing PTE
+ * guard markers.
+ *
+ * start must be page-aligned; len_in is rounded up to the next page
+ * boundary internally. Once those validation checks pass, a zero-length
+ * range succeeds without performing work. The kernel rejects ranges that
+ * wrap (start + PAGE_ALIGN(len_in) < start) and ranges where len_in is
+ * non-zero but rounds down to zero. Address tagging bits are stripped
+ * from start before VMA lookup for every behavior except MADV_HWPOISON
+ * and MADV_SOFT_OFFLINE, which receive the raw start value because they
+ * bypass the VMA walk entirely.
+ *
+ * The kernel return value reports whether any error condition was
+ * encountered, not whether the requested work was performed. The
+ * relationship between the return code and the work done varies by
+ * handler:
+ *
+ * - Hint behaviors update VMA flags. The flags fall into five
+ * sub-classes by how their consumers honor them:
+ * (a) Hard gates -- MADV_DONTFORK / MADV_DOFORK strictly gate VMA
+ * copy in dup_mmap(); MADV_MERGEABLE / MADV_UNMERGEABLE strictly
+ * gate whether KSM will scan the VMA at all. These take effect
+ * immediately and cannot be overridden by other policy.
+ * (b) THP eligibility advice -- MADV_NOHUGEPAGE blocks the normal
+ * fault-time, MADV_COLLAPSE and khugepaged THP paths for the VMA
+ * (driver-internal PMD insertion via insert_pmd() is the only
+ * documented bypass). MADV_HUGEPAGE only widens THP eligibility
+ * under the kernel's "madvise" / "except-advised" policy and
+ * increases defrag aggressiveness; it does not force allocation,
+ * which still depends on the global transparent_hugepage= mode
+ * (always / madvise / never), VMA suitability, defrag GFP policy,
+ * and allocation or memcg-charge success.
+ * (c) Fault-on-access -- MADV_WIPEONFORK / MADV_KEEPONFORK do not
+ * wipe pages at fork time; instead the child VMA's pages are not
+ * copied and the child sees zero-filled pages only when it first
+ * reads or writes them.
+ * (d) Mostly-strict with override -- MADV_DONTDUMP / MADV_DODUMP
+ * control coredump inclusion via VM_DONTDUMP, but always_dump_vma()
+ * can still include gate, vm_ops-named or arch-named VMAs in the
+ * core regardless.
+ * (e) Heuristic -- MADV_NORMAL / MADV_RANDOM / MADV_SEQUENTIAL set
+ * VM_RAND_READ / VM_SEQ_READ as read-ahead hints that the read-ahead
+ * code weighs against other policy and may diverge from at runtime.
+ * In all five sub-classes the requested flag bits on the VMA are set;
+ * what differs is the strength of the resulting downstream effect.
+ *
+ * - Walk-and-skip handlers (MADV_COLD, MADV_PAGEOUT, MADV_FREE,
+ * MADV_GUARD_REMOVE) traverse the range and silently skip pages or
+ * PMDs that fail per-page preconditions (absent, special, device,
+ * shared, non-LRU, unsplittable, locked, etc.), returning 0 even
+ * when most or all pages were skipped.
+ *
+ * - Bulk-backend handlers delegate the requested range to a single
+ * backend call: MADV_DONTNEED and MADV_DONTNEED_LOCKED to
+ * zap_page_range_single_batched(), MADV_REMOVE to vfs_fallocate(),
+ * MADV_WILLNEED on regular files to vfs_fadvise(). The backend's
+ * return is propagated for MADV_REMOVE and discarded for
+ * MADV_WILLNEED; DAX files short-circuit MADV_WILLNEED entirely.
+ *
+ * - Stop-on-error handlers (MADV_POPULATE_READ, MADV_POPULATE_WRITE,
+ * MADV_SOFT_OFFLINE) walk the range but surface the first per-page
+ * failure as an errno (-EHWPOISON, -EFAULT, -ENOMEM, ...) rather
+ * than skipping silently.
+ *
+ * - Hybrid handlers combine modes: MADV_WILLNEED walks for anonymous
+ * and shmem ranges but bulk-calls vfs_fadvise() for regular files;
+ * MADV_COLLAPSE walks PMD-by-PMD and tracks the last scan failure
+ * so transient skips coexist with terminal errors;
+ * MADV_GUARD_INSTALL walks to install markers and re-walks after
+ * zap_page_range_single() to clear pre-existing pages, retrying up
+ * to MAX_MADVISE_GUARD_RETRIES; MADV_HWPOISON walks pages but folds
+ * memory_failure()'s -EOPNOTSUPP back to 0.
+ *
+ * Applications that need to know whether a specific page was acted on
+ * must verify the result through other means (e.g. /proc/[pid]/smaps,
+ * page faults, read-after-write).
+ *
+ * On success, madvise() returns 0; unlike read(2) and write(2) it has no
+ * notion of partial completion at the syscall boundary. When the range
+ * spans multiple VMAs, the kernel applies the advice to each in turn; an
+ * unmapped gap inside the range causes the call to return -ENOMEM after
+ * processing the mapped portions, rather than aborting at the gap.
+ *
+ * POSIX defines posix_madvise(3) for a portable subset (POSIX_MADV_NORMAL,
+ * _RANDOM, _SEQUENTIAL, _WILLNEED, _DONTNEED). Linux MADV_DONTNEED is
+ * destructive: it discards the contents of the affected anonymous pages and
+ * subsequent reads return zero. POSIX permits but does not require
+ * destruction, so portable code that needs the POSIX semantics should use
+ * posix_madvise(3) instead.
+ *
+ * contexts: process, sleepable
+ *
+ * param: start
+ * type: uint, input
+ * constraint-type: page_aligned
+ * cdesc: Starting virtual address of the range. Must be aligned to
+ * PAGE_SIZE. An unaligned start always returns -EINVAL, even when
+ * len_in is zero. Address tag bits, where supported by the architecture,
+ * are cleared via untagged_addr() before the range is interpreted, with
+ * the exception of MADV_HWPOISON and MADV_SOFT_OFFLINE, which receive
+ * the raw start value because they bypass the VMA walk.
+ *
+ * param: len_in
+ * type: uint, input
+ * constraint-type: range(0, SIZE_MAX)
+ * cdesc: Length of the range in bytes. Internally rounded up to a multiple
+ * of PAGE_SIZE. A len_in of 0 is accepted and the call is a no-op that
+ * returns 0. A non-zero len_in that rounds up to 0 (i.e. wraps around)
+ * returns -EINVAL, as does a range whose end (start + PAGE_ALIGN(len_in))
+ * would wrap below start.
+ *
+ * param: behavior
+ * type: int, input
+ * cdesc: One of the MADV_* constants from <sys/mman.h>. See the long
+ * description above for the full list and the three semantic groups
+ * (hint, immediate-action, destructive). Behaviors gated by Kconfig
+ * (KSM, transparent hugepage, memory failure) return -EINVAL when the
+ * underlying support is disabled. A few architectures (notably alpha)
+ * renumber values; portable code should always use the symbolic names.
+ *
+ * return:
+ * type: int
+ * check-type: exact
+ * success: 0
+ * desc: On success, returns 0. On error, returns a negative error code.
+ * There is no partial-success indication; either the entire processed
+ * range succeeded, or an error is returned and an unspecified prefix of
+ * the range may have been advised.
+ *
+ * error: EINVAL, Invalid argument
+ * desc: Returned for invalid input (unrecognised MADV_*, Kconfig-gated
+ * behavior, unaligned start, range wrap, non-zero len_in rounding to
+ * zero) and for per-behavior VMA-filter violations. The constraint:
+ * blocks cover FREE, WIPEONFORK, REMOVE, COLD and PAGEOUT; inline
+ * filters also reject DOFORK on VM_SPECIAL, KEEPONFORK on VM_DROPPABLE,
+ * DODUMP on non-hugetlb VM_SPECIAL/VM_DROPPABLE, GUARD_* on VM_SPECIAL
+ * or VM_HUGETLB, GUARD_INSTALL on VM_LOCKED. Also
+ * returned by faultin_page_range() and madvise_collapse_errno().
+ *
+ * error: ENOMEM, Cannot allocate memory
+ * desc: Some part of the requested range falls in a gap between mapped
+ * VMAs; the kernel still applies the behavior to the mapped subranges
+ * and only returns -ENOMEM after the walk completes. MADV_POPULATE_*
+ * also returns -ENOMEM when the region has no VMA or when
+ * faultin_page_range() exhausts memory. MADV_COLLAPSE returns -ENOMEM
+ * when its struct collapse_control cannot be allocated up front, and
+ * when madvise_collapse_errno() maps SCAN_ALLOC_HUGE_PAGE_FAIL (no
+ * hugepage available) to -ENOMEM.
+ *
+ * error: EAGAIN, Resource temporarily unavailable
+ * desc: For the VMA-flag-mutating behaviors, an internal -ENOMEM from VMA
+ * splitting is translated to -EAGAIN before being returned to userspace,
+ * advising the caller that a transient kernel resource shortage
+ * prevented the update. Also returned by MADV_COLLAPSE via
+ * madvise_collapse_errno() for transient scan failures (folio lock
+ * contention, LRU isolation failure, dirty/writeback) where retrying
+ * the call may succeed.
+ *
+ * error: EIO, Input/output error
+ * desc: For MADV_REMOVE, an I/O error from the underlying filesystem's
+ * FALLOC_FL_PUNCH_HOLE handler is propagated back as -EIO. MADV_WILLNEED
+ * and MADV_PAGEOUT do not surface filesystem or device I/O errors:
+ * vfs_fadvise() returns are discarded by madvise_willneed() and the
+ * pageout walk is invoked through a void helper, so transient I/O
+ * failures during read-ahead or page-out are silently dropped.
+ *
+ * error: EBADF, Bad file descriptor
+ * desc: Returned by MADV_WILLNEED when applied to a non-file-backed VMA
+ * and the kernel was built without CONFIG_SWAP, so there is neither a
+ * file to read-ahead from nor a swap device to fault from.
+ *
+ * error: EACCES, Permission denied
+ * desc: Returned by MADV_REMOVE when the target VMA is not a writable
+ * shared mapping (vma_is_shared_maywrite() is false). Punching a hole in
+ * a private or read-only shared mapping is not permitted; the operation
+ * would either be invisible to other mappers or violate file permissions.
+ *
+ * error: EPERM, Operation not permitted
+ * desc: Returned in two situations. First, MADV_HWPOISON and
+ * MADV_SOFT_OFFLINE require CAP_SYS_ADMIN; the inject-error handler
+ * refuses non-privileged callers. Second, on 64-bit kernels, a discard
+ * operation (MADV_FREE, MADV_DONTNEED, MADV_DONTNEED_LOCKED, MADV_REMOVE,
+ * MADV_DONTFORK, MADV_WIPEONFORK, MADV_GUARD_INSTALL) is refused on a
+ * read-only anonymous VMA that has been sealed with mseal(2), to prevent
+ * bypassing the seal by discarding mapped data.
+ *
+ * error: EINTR, Interrupted system call
+ * desc: Returned when a fatal signal is delivered while the call is
+ * waiting to acquire the mmap write lock for a VMA-flag-mutating
+ * behavior (mmap_write_lock_killable() returns -EINTR), or when
+ * MADV_POPULATE_READ/MADV_POPULATE_WRITE is interrupted while faulting
+ * in pages (faultin_page_range() returns -EINTR). The single-shot
+ * madvise() syscall is not automatically restarted by the signal
+ * framework on this path; the caller must reissue the request if
+ * desired.
+ *
+ * error: EHWPOISON, Memory page has hardware error
+ * desc: MADV_POPULATE_READ or MADV_POPULATE_WRITE encountered a page that
+ * has been marked as containing a hardware-detected memory error and
+ * could not be faulted in.
+ *
+ * error: EFAULT, Bad address
+ * desc: MADV_POPULATE_READ or MADV_POPULATE_WRITE attempted to fault in a
+ * page whose mapping raised VM_FAULT_SIGBUS or VM_FAULT_SIGSEGV (for
+ * example, a file-backed page beyond the end of the file).
+ *
+ * error: EBUSY, Device or resource busy
+ * desc: Returned by MADV_COLLAPSE via madvise_collapse_errno() in two
+ * specific scan-failure modes: SCAN_CGROUP_CHARGE_FAIL (the new
+ * hugepage cannot be charged to the memory cgroup) and
+ * SCAN_EXCEED_NONE_PTE (too many absent PTEs in the candidate range
+ * for a synchronous collapse). Other transient collapse failures are
+ * reported as -EAGAIN; non-transient ones as -EINVAL.
+ *
+ * lock: mm->mmap_lock (read mode)
+ * type: rwlock
+ * acquired: yes
+ * released: yes
+ * desc: Held on entry to the VMA walk for MADV_REMOVE, MADV_WILLNEED,
+ * MADV_COLD, MADV_PAGEOUT and MADV_COLLAPSE, and as the fallback when
+ * the per-VMA fast path declines. Several handlers drop and reacquire
+ * this lock mid-operation: MADV_WILLNEED on a file-backed VMA and
+ * MADV_REMOVE around their vfs_fadvise() / vfs_fallocate() callouts;
+ * MADV_COLLAPSE on file-backed ranges around the page migration
+ * pipeline; MADV_POPULATE_* (dispatched directly to madvise_populate())
+ * around each faultin_page_range() call, which may itself drop the
+ * lock internally before returning.
+ *
+ * lock: mm->mmap_lock (write mode; killable)
+ * type: rwlock
+ * acquired: yes
+ * released: yes
+ * desc: Acquired in killable write mode for behaviors that modify
+ * vma->vm_flags or split/merge VMAs (MADV_NORMAL, MADV_RANDOM,
+ * MADV_SEQUENTIAL, MADV_DONTFORK, MADV_DOFORK, MADV_DONTDUMP, MADV_DODUMP,
+ * MADV_WIPEONFORK, MADV_KEEPONFORK, MADV_MERGEABLE, MADV_UNMERGEABLE,
+ * MADV_HUGEPAGE, MADV_NOHUGEPAGE). If the acquisition is killed by a
+ * fatal signal, the syscall returns -EINTR before any VMA is touched.
+ *
+ * lock: per-VMA read lock (vma->vm_lock)
+ * type: custom
+ * acquired: yes
+ * released: yes
+ * desc: Tried first for MADV_DONTNEED, MADV_DONTNEED_LOCKED, MADV_FREE,
+ * MADV_GUARD_INSTALL and MADV_GUARD_REMOVE via lock_vma_under_rcu(). The
+ * per-VMA path is taken only when the requested range fits within a
+ * single VMA, the target mm is the caller's mm, the VMA is not armed
+ * with userfaultfd, and (for behaviors that establish page tables) an
+ * anon_vma is already attached. Otherwise the code falls back to the
+ * mmap read lock above.
+ *
+ * lock: mmu_gather TLB batch
+ * type: custom
+ * acquired: yes
+ * released: yes
+ * desc: For MADV_DONTNEED, MADV_DONTNEED_LOCKED and MADV_FREE the syscall
+ * wraps the per-VMA work in tlb_gather_mmu() / tlb_finish_mmu() so PTE
+ * clearing and TLB invalidation are batched. MADV_COLD and MADV_PAGEOUT
+ * build a short-lived gather inside the handler. MADV_GUARD_INSTALL
+ * builds a transient gather via zap_page_range_single() each time the
+ * retry loop has to clear pre-existing pages; if the range is already
+ * empty no gather is built. MADV_GUARD_REMOVE never zaps and never
+ * gathers.
+ *
+ * lock: mmu_notifier invalidate range
+ * type: custom
+ * acquired: yes
+ * released: yes
+ * desc: All zap-based paths -- MADV_DONTNEED, MADV_DONTNEED_LOCKED, the
+ * zap branch of MADV_GUARD_INSTALL via zap_page_range_single(), and
+ * MADV_FREE's own walk -- bracket their work with
+ * mmu_notifier_invalidate_range_start()/_end() so secondary MMUs (KVM,
+ * IOMMUv2, etc.) observe the page clearing.
+ *
+ * signal: Any fatal signal
+ * direction: receive
+ * action: return
+ * condition: Acquiring the mmap write lock or faulting in pages for
+ * MADV_POPULATE_*
+ * desc: A pending fatal signal aborts mmap_write_lock_killable() (used by
+ * the VMA-flag-mutating behaviors) and faultin_page_range() (used by
+ * MADV_POPULATE_READ and MADV_POPULATE_WRITE), in both cases surfacing as
+ * -EINTR to userspace. The single-shot madvise() syscall does not request
+ * transparent restart on these paths; the caller is expected to reissue
+ * the call if appropriate.
+ * errno: -EINTR
+ * timing: during
+ * restartable: no
+ *
+ * side-effect: modify_state
+ * target: vma->vm_flags
+ * condition: Hint-group behaviors (MADV_NORMAL, MADV_RANDOM, MADV_SEQUENTIAL,
+ * MADV_DONTFORK, MADV_DOFORK, MADV_DONTDUMP, MADV_DODUMP, MADV_WIPEONFORK,
+ * MADV_KEEPONFORK, MADV_MERGEABLE, MADV_UNMERGEABLE, MADV_HUGEPAGE,
+ * MADV_NOHUGEPAGE)
+ * desc: Sets or clears VM_RAND_READ, VM_SEQ_READ, VM_DONTCOPY, VM_DONTDUMP,
+ * VM_WIPEONFORK, VM_MERGEABLE or VM_HUGEPAGE on the affected VMAs and may
+ * split or merge VMAs to apply the change to a sub-range. The change is
+ * reversible by issuing madvise() with the inverse advice (e.g.
+ * MADV_DOFORK undoes MADV_DONTFORK), with the caveat that the inverse
+ * call's per-VMA filter still applies: MADV_DOFORK rejects VM_SPECIAL,
+ * MADV_DODUMP rejects non-hugetlb VM_SPECIAL or VM_DROPPABLE, and
+ * MADV_KEEPONFORK rejects VM_DROPPABLE; for those classes of VMA the
+ * inverse cannot complete.
+ * reversible: yes
+ *
+ * side-effect: free_memory | modify_state | irreversible
+ * target: page tables and resident pages within the range
+ * condition: MADV_DONTNEED, MADV_DONTNEED_LOCKED, MADV_FREE
+ * desc: MADV_DONTNEED zaps PTEs, releasing the underlying pages or swap
+ * slots so the next access faults in zero-filled anonymous pages or
+ * re-reads the file. MADV_DONTNEED_LOCKED is identical but tolerates
+ * VM_LOCKED. MADV_FREE marks anonymous pages lazy-freeable: clean pages
+ * may be reclaimed under memory pressure, while writes before
+ * reclamation cancel the lazy-free. Discarded data cannot be recovered.
+ * reversible: no
+ *
+ * side-effect: filesystem | irreversible
+ * target: backing file (FALLOC_FL_PUNCH_HOLE)
+ * condition: MADV_REMOVE
+ * desc: Calls vfs_fallocate(FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE) on
+ * the backing file, deallocating the corresponding file blocks. The hole
+ * is visible to all mappers of the file and to read(2)/write(2)
+ * callers; subsequent reads return zero. Filesystem freeze protection,
+ * i_rwsem and any quota/space accounting are taken by the underlying
+ * fallocate path.
+ * reversible: no
+ *
+ * side-effect: modify_state | schedule
+ * target: LRU lists and page reclaim
+ * condition: MADV_COLD, MADV_PAGEOUT
+ * desc: MADV_COLD deactivates the affected pages, moving them to the
+ * inactive LRU and clearing PG_referenced/PG_young so they are reclaimed
+ * sooner under pressure. MADV_PAGEOUT additionally calls reclaim_pages()
+ * to write dirty pages out and drop clean ones synchronously. Page data
+ * is preserved (rereads will fault in the same content), but the I/O and
+ * LRU bookkeeping cannot be undone.
+ * reversible: no
+ *
+ * side-effect: modify_state
+ * target: page tables (faultin)
+ * condition: MADV_POPULATE_READ, MADV_POPULATE_WRITE
+ * desc: Walks the requested range with faultin_page_range(), populating
+ * PTEs by triggering read or write faults so subsequent accesses do not
+ * fault. Equivalent to touching every page in the range while suppressing
+ * SIGBUS/SIGSEGV through the syscall return value. Allocations made by
+ * faultin are not undone on partial failure.
+ * reversible: no
+ *
+ * side-effect: modify_state | schedule
+ * target: transparent hugepage layout
+ * condition: MADV_COLLAPSE
+ * desc: Synchronously coalesces base pages in the range into a PMD-sized
+ * transparent hugepage when the mapping permits. Performs the same page
+ * migration and zeroing that khugepaged would do asynchronously; the
+ * range's data is preserved across the collapse.
+ * reversible: no
+ *
+ * side-effect: free_memory | modify_state | irreversible
+ * target: PTE marker (PTE_MARKER_GUARD)
+ * condition: MADV_GUARD_INSTALL, MADV_GUARD_REMOVE
+ * desc: MADV_GUARD_INSTALL installs PTE_MARKER_GUARD entries that cause
+ * subsequent accesses to deliver SIGSEGV without consuming physical
+ * memory; existing pages already mapped in the range are zapped via
+ * zap_page_range_single() before the markers are installed, so any
+ * prior contents are lost. MADV_GUARD_REMOVE clears the markers but
+ * does not (and cannot) restore zapped data.
+ * reversible: no
+ *
+ * side-effect: hardware | irreversible
+ * target: physical page (memory_failure / soft_offline_page)
+ * condition: MADV_HWPOISON, MADV_SOFT_OFFLINE
+ * desc: MADV_HWPOISON marks the affected pages as containing an
+ * unrecoverable hardware error using the same machine-check path that
+ * real ECC failures take; MADV_SOFT_OFFLINE migrates the contents off
+ * the affected pages and removes them from the buddy allocator. Both
+ * paths affect physical memory bookkeeping kernel-wide and cannot be
+ * undone without a reboot. Intended for testing the memory-failure
+ * pipeline; restricted to CAP_SYS_ADMIN.
+ * reversible: no
+ *
+ * side-effect: modify_state
+ * target: KSM merge state (vm_flags & VM_MERGEABLE)
+ * condition: MADV_MERGEABLE, MADV_UNMERGEABLE
+ * desc: Toggles the VMA's eligibility for the kernel same-page merger.
+ * Enabling merging may later cause identical anonymous pages to be
+ * replaced by shared, write-protected copies; disabling merging tears
+ * any existing merges down lazily. The flag toggle itself is reversible
+ * by issuing the inverse advice.
+ * reversible: yes
+ *
+ * side-effect: modify_state
+ * target: userfaultfd event queue
+ * condition: MADV_DONTNEED, MADV_DONTNEED_LOCKED, MADV_FREE, MADV_REMOVE
+ * on a userfaultfd-armed VMA
+ * desc: Generates a UFFD_EVENT_REMOVE notification covering the discarded
+ * range so userfaultfd monitors observing the mapping see the
+ * invalidation. The event is queued before the discard takes effect; the
+ * monitor cannot veto it.
+ * reversible: no
+ *
+ * capability: CAP_SYS_ADMIN
+ * type: perform_operation
+ * allows: Inject memory errors via MADV_HWPOISON or MADV_SOFT_OFFLINE
+ * without: Both behaviors return -EPERM
+ * condition: Checked at entry to madvise_inject_error() before any pages
+ * are looked up
+ *
+ * constraint: Page-aligned start
+ * desc: start must lie on a page boundary; otherwise the call returns
+ * -EINVAL before any VMA is consulted.
+ * expr: (start & (PAGE_SIZE - 1)) == 0
+ *
+ * constraint: Length rounded up to PAGE_SIZE
+ * desc: The effective range length is PAGE_ALIGN(len_in). A non-zero len_in
+ * that overflows during rounding, or a (start, end) range that wraps,
+ * is rejected with -EINVAL.
+ * expr: end = start + PAGE_ALIGN(len_in); end >= start
+ *
+ * constraint: Behavior must be supported
+ * desc: behavior must be one of the MADV_* values listed under the
+ * behavior parameter. Behaviors gated by Kconfig (KSM, THP, memory
+ * failure) are rejected with -EINVAL when the corresponding option is
+ * disabled in the running kernel.
+ *
+ * constraint: mseal-protected discards
+ * desc: On 64-bit kernels, a discard operation (FREE, DONTNEED,
+ * DONTNEED_LOCKED, REMOVE, DONTFORK, WIPEONFORK, GUARD_INSTALL) against
+ * a sealed anonymous VMA is rejected unless the mapping is currently
+ * writable -- both VM_WRITE in vm_flags and arch_vma_access_permitted()
+ * allowing write -- so that mseal(2) cannot be bypassed by instructing
+ * the kernel to throw the data away. File-backed sealed VMAs and
+ * writable sealed VMAs are not subject to this restriction.
+ * expr: !is_discard(behavior) || !vma_is_sealed(vma) ||
+ * !vma_is_anonymous(vma) || ((vma->vm_flags & VM_WRITE) &&
+ * arch_vma_access_permitted(vma, true, false, false))
+ *
+ * constraint: MADV_FREE requires anonymous mappings
+ * desc: MADV_FREE is defined only over anonymous mappings; the handler
+ * rejects file-backed VMAs with -EINVAL.
+ * expr: vma_is_anonymous(vma)
+ *
+ * constraint: MADV_WIPEONFORK requires private anonymous mappings
+ * desc: MADV_WIPEONFORK rejects file-backed mappings and shared anonymous
+ * mappings; only MAP_PRIVATE anonymous VMAs accept it. Both rejections
+ * surface as -EINVAL.
+ * expr: !vma->vm_file && !(vma->vm_flags & VM_SHARED)
+ *
+ * constraint: MADV_REMOVE requires a writable shared file mapping
+ * desc: MADV_REMOVE rejects VM_LOCKED VMAs, VMAs without an associated
+ * file/mapping/host inode, and non-shared-writable mappings. The first
+ * two cases return -EINVAL; a private or read-only shared mapping
+ * returns -EACCES.
+ * expr: !(vma->vm_flags & VM_LOCKED) && vma->vm_file &&
+ * vma->vm_file->f_mapping && vma->vm_file->f_mapping->host &&
+ * vma_is_shared_maywrite(vma)
+ *
+ * constraint: MADV_COLD / MADV_PAGEOUT VMA filter
+ * desc: Both behaviors require LRU-managed pages; they reject VMAs that
+ * are mlocked, raw-PFN or hugetlb.
+ * expr: !(vma->vm_flags & (VM_LOCKED | VM_PFNMAP | VM_HUGETLB))
+ *
+ * examples: madvise(p, len, MADV_SEQUENTIAL); // set VM_SEQ_READ on the VMA
+ * madvise(p, len, MADV_POPULATE_WRITE); // prefault writable PTEs
+ * madvise(p, len, MADV_DONTNEED); // discard anonymous pages
+ * madvise(p, len, MADV_GUARD_INSTALL); // install SIGSEGV guard pages
+ *
+ * notes: madvise(2) reports only whether an error condition was
+ * encountered, not whether the requested work was performed. The hint
+ * group sets VMA flags whose downstream strictness varies:
+ * MADV_DONTFORK / MADV_DOFORK and MADV_MERGEABLE / MADV_UNMERGEABLE are
+ * hard gates honored by fork-copy and KSM scanning respectively;
+ * MADV_NOHUGEPAGE is a hard gate against the normal user-visible THP
+ * paths but MADV_HUGEPAGE is eligibility/advice that does not force
+ * THP installation -- the global transparent_hugepage= mode, VMA
+ * suitability and allocation success still apply; MADV_WIPEONFORK /
+ * MADV_KEEPONFORK take effect at the child's first page access
+ * (zero-on-fault), not at fork time; MADV_DONTDUMP / MADV_DODUMP gate
+ * coredump inclusion but can be overridden by always_dump_vma();
+ * MADV_NORMAL / MADV_RANDOM / MADV_SEQUENTIAL are heuristic read-ahead
+ * hints that the read-ahead code may weigh against other policy. The
+ * non-hint behaviors
+ * are not uniform: walk-and-skip handlers (COLD, PAGEOUT, FREE,
+ * GUARD_REMOVE) silently skip pages that fail per-page preconditions;
+ * bulk-backend handlers (DONTNEED, DONTNEED_LOCKED, REMOVE, and
+ * WILLNEED on regular files) delegate the range to a single backend
+ * call whose return is propagated for REMOVE and discarded for
+ * WILLNEED; stop-on-error handlers (POPULATE_READ, POPULATE_WRITE,
+ * SOFT_OFFLINE) surface the first per-page failure rather than
+ * skipping; and hybrid handlers (WILLNEED for anon/shmem, COLLAPSE,
+ * GUARD_INSTALL, HWPOISON) mix walking with bulk backends, retry/zap
+ * loops or selective error suppression. A successful return therefore
+ * guarantees only that no error was raised in the handler that ran,
+ * not that every page was processed.
+ *
+ * Behavior introduction history (mainline): MADV_FREE in 4.5,
+ * MADV_WIPEONFORK / MADV_KEEPONFORK in 4.14, MADV_COLD / MADV_PAGEOUT in
+ * 5.4, MADV_POPULATE_READ / MADV_POPULATE_WRITE in 5.14,
+ * MADV_DONTNEED_LOCKED in 5.18, MADV_COLLAPSE in 6.1, MADV_GUARD_INSTALL /
+ * MADV_GUARD_REMOVE in 6.13. Code that wants to remain portable to older
+ * kernels must handle -EINVAL gracefully and fall back.
+ *
+ * process_madvise(2) extends the same set of advices to another process
+ * identified by a pidfd. When the target mm is the caller's own (the
+ * pidfd refers to the caller), any locally-supported MADV_* value is
+ * accepted. When the target is a different mm, the behavior must be in
+ * the non-destructive remote subset (MADV_COLD, MADV_PAGEOUT,
+ * MADV_WILLNEED, MADV_COLLAPSE) or the call returns -EINVAL, and the
+ * caller must hold CAP_SYS_NICE.
+ *
+ * The discard subset (MADV_FREE, MADV_DONTNEED, MADV_DONTNEED_LOCKED,
+ * MADV_REMOVE, MADV_DONTFORK, MADV_WIPEONFORK, MADV_GUARD_INSTALL) is
+ * refused on non-writable anonymous VMAs sealed with mseal(2) on 64-bit
+ * kernels. mseal(2) does not provide an unseal operation, so applications
+ * that need to retain the ability to discard such pages must keep the
+ * mapping writable (and arch-accessible for write) or refrain from
+ * sealing it.
+ *
+ * On a userfaultfd-armed VMA, all four destructive discards (DONTNEED,
+ * DONTNEED_LOCKED, FREE, REMOVE) emit a UFFD_EVENT_REMOVE event; the
+ * per-VMA fast path is bypassed and the syscall falls back to the
+ * heavier mmap_read_lock so the userfaultfd monitor is consulted before
+ * the discard takes effect.
+ *
+ * MADV_GUARD_INSTALL retries up to MAX_MADVISE_GUARD_RETRIES (3) times
+ * when it loses races with concurrent faulting or khugepaged. If those
+ * retries are exhausted the handler returns -ERESTARTNOINTR via
+ * restart_syscall(); the kernel's syscall return path treats this as a
+ * transparent restart of madvise() and re-enters the call with the same
+ * arguments. The transparent restart is unconditional and is not driven
+ * by signal delivery, so the caller never observes an errno from this
+ * path and the call appears to make eventual forward progress.
+ * anon_vma_prepare() failures inside MADV_GUARD_INSTALL bypass the
+ * ENOMEM-to-EAGAIN translation that applies to the VMA-flag-mutating
+ * behaviors and surface as -ENOMEM directly.
+ *
+ * Architecture note: alpha defines MADV_DONTNEED as 6 (not 4) and reserves
+ * MADV_SPACEAVAIL=5; portable code must use the symbolic names from
+ * <sys/mman.h>.
+ */
SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
{
return do_madvise(current->mm, start, len_in, behavior);
--
2.53.0
next prev parent reply other threads:[~2026-05-29 23:33 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-29 23:32 [PATCH v4 00/11] Kernel API Specification Framework Sasha Levin
2026-05-29 23:33 ` [PATCH v4 01/11] kernel/api: introduce kernel API specification framework Sasha Levin
2026-05-29 23:33 ` [PATCH v4 02/11] kernel/api: enable kerneldoc-based API specifications Sasha Levin
2026-05-29 23:33 ` [PATCH v4 03/11] kernel/api: add debugfs interface for kernel " Sasha Levin
2026-05-29 23:33 ` [PATCH v4 04/11] tools/kapi: add kernel API specification extraction tool Sasha Levin
2026-05-29 23:33 ` [PATCH v4 05/11] kernel/api: add API specification for sys_open Sasha Levin
2026-05-29 23:33 ` [PATCH v4 06/11] kernel/api: add API specification for sys_close Sasha Levin
2026-05-29 23:33 ` [PATCH v4 07/11] kernel/api: add API specification for sys_read Sasha Levin
2026-05-29 23:33 ` [PATCH v4 08/11] kernel/api: add API specification for sys_write Sasha Levin
2026-05-29 23:33 ` [PATCH v4 09/11] kernel/api: add runtime verification selftest Sasha Levin
2026-05-29 23:33 ` Sasha Levin [this message]
2026-05-29 23:33 ` [PATCH v4 11/11] kernel/api: add syscall enter/exit tracepoints Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260529233311.1901670-11-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=brauner@kernel.org \
--cc=chrubis@suse.cz \
--cc=corbet@lwn.net \
--cc=david.laight.linux@gmail.com \
--cc=dvyukov@google.com \
--cc=gpaoloni@redhat.com \
--cc=gregkh@linuxfoundation.org \
--cc=jake@lwn.net \
--cc=kees@kernel.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kbuild@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=masahiroy@kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mchehab@kernel.org \
--cc=mhiramat@kernel.org \
--cc=nathan@kernel.org \
--cc=paulmck@kernel.org \
--cc=rdunlap@infradead.org \
--cc=rostedt@goodmis.org \
--cc=skhan@linuxfoundation.org \
--cc=tglx@kernel.org \
--cc=tools@kernel.org \
--cc=viro@zeniv.linux.org.uk \
--cc=workflows@vger.kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox