Linux filesystem development
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Zizhi Wo <wozizhi@huawei.com>, Zizhi Wo <wozizhi@huaweicloud.com>,
	Jan Kara <jack@suse.cz>, Christian Brauner <brauner@kernel.org>,
	Sasha Levin <sashal@kernel.org>,
	bcrl@kvack.org, viro@zeniv.linux.org.uk, linux-aio@kvack.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [PATCH AUTOSEL 7.0] fs: aio: set VMA_DONTCOPY_BIT in mmap to fix NULL-pointer-dereference error
Date: Tue, 28 Apr 2026 06:40:34 -0400	[thread overview]
Message-ID: <20260428104133.2858589-23-sashal@kernel.org> (raw)
In-Reply-To: <20260428104133.2858589-1-sashal@kernel.org>

From: Zizhi Wo <wozizhi@huawei.com>

[ Upstream commit c03ce4173c7bffe1e7477f905a09b015d4000d3c ]

[BUG]
Recently, our internal syzkaller testing uncovered a null pointer
dereference issue:
BUG: kernel NULL pointer dereference, address: 0000000000000000
...
[   51.111664]  filemap_read_folio+0x25/0xe0
[   51.112410]  filemap_fault+0xad7/0x1250
[   51.113112]  __do_fault+0x4b/0x460
[   51.113699]  do_pte_missing+0x5bc/0x1db0
[   51.114250]  ? __pte_offset_map+0x23/0x170
[   51.114822]  __handle_mm_fault+0x9f8/0x1680
[   51.115408]  handle_mm_fault+0x24c/0x570
[   51.115958]  do_user_addr_fault+0x226/0xa50
...
Crash analysis showed the file involved was an AIO ring file.

[CAUSE]
	PARENT process		CHILD process
t=0	io_setup(1, &ctx)
	[access ctx addr]
	fork()
	io_destroy
	  vm_munmap // not affect child vma
	  percpu_ref_put
	  ...
	    put_aio_ring_file
t=1				[access ctx addr]	// pagefault
				...
				  __do_fault
				    filemap_fault
				      max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE)
t=2	      truncate_setsize
	        truncate_pagecache
t=3				      filemap_get_folio	// no folio, create folio
				      __filemap_get_folio(..., FGP_CREAT, ...)	// page_not_uptodate
				      filemap_read_folio(file, mapping->a_ops->read_folio, folio) // oops!

At t=0, the parent process calls io_setup and then fork. The child process
gets its own VMA but without any PTEs. The parent then calls io_destroy.
Before i_size is truncated to 0, at t=1 the child process accesses this AIO
ctx address and triggers a pagefault. After the max_idx check passes, at
t=2 the parent calls truncate_setsize and truncate_pagecache. At t=3 the
child fails to obtain the folio, falls into the "page_not_uptodate" path,
and hits this problem because AIO does not implement "read_folio".

[Fix]
Fix this by marking the AIO ring buffer VMA with VM_DONTCOPY so
that fork()'s dup_mmap() skips it entirely. This is the correct
semantic because:

1) The child's ioctx_table is already reset to NULL by mm_init_aio() during
fork(), so the child has no AIO context and no way to perform any AIO
operations on this mapping.
2) The AIO ring VMA is only meaningful in conjunction with its associated
kioctx, which is never inherited across fork(). So child process with no
AIO context has no legitimate reason to access the ring buffer. Delivering
SIGSEGV on such an erroneous access is preferable to a kernel crash.

Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com>
Link: https://patch.msgid.link/20260413010814.548568-1-wozizhi@huawei.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Based on my investigation, I now have enough information to provide a
comprehensive analysis.

## Complete Phase-by-Phase Analysis

### PHASE 1: COMMIT MESSAGE FORENSICS

**Step 1.1 - Subject Line:**
- Record: `[fs: aio]` `[set VMA_DONTCOPY_BIT]` - indicates fix; title
  explicitly says "to fix NULL-pointer-dereference error"

**Step 1.2 - Tags:**
- Record:
  - Signed-off-by: Zizhi Wo (author from Huawei)
  - Link:
    https://patch.msgid.link/20260413010814.548568-1-wozizhi@huawei.com
  - Reviewed-by: Jan Kara (SUSE, MM/filesystems reviewer)
  - Signed-off-by: Christian Brauner (VFS maintainer)
  - No explicit Cc: stable, no Fixes: tag (expected for this evaluation
    pipeline)

**Step 1.3 - Commit Body:**
- Record: The body describes an internal syzkaller-discovered NULL
  pointer deref reproducible by a fork()+io_destroy race. A detailed
  timing diagram shows 4 time steps (t=0..t=3) explaining the race
  between parent's io_destroy() teardown and child's page fault on the
  inherited AIO ring VMA. The kernel crash stack trace shows:
  `do_user_addr_fault -> handle_mm_fault -> __handle_mm_fault ->
  do_pte_missing -> __do_fault -> filemap_fault -> filemap_read_folio` -
  oops at `a_ops->read_folio` (NULL).

**Step 1.4 - Hidden bug fixes:**
- Record: Not hidden - the subject explicitly says "to fix NULL-pointer-
  dereference error". This is a clear bug fix.

### PHASE 2: DIFF ANALYSIS

**Step 2.1 - Inventory:**
- Record: One file modified (`fs/aio.c`), 1 line changed (+1/-1), single
  function `aio_ring_mmap_prepare()`. Surgical, minimal scope.

**Step 2.2 - Code flow:**
- Record: Before: VMA created with `VMA_DONTEXPAND_BIT` only. After: VMA
  created with both `VMA_DONTEXPAND_BIT` and `VMA_DONTCOPY_BIT`. Affects
  fork()'s `dup_mmap()` behavior: child will not inherit this VMA.

**Step 2.3 - Bug mechanism:**
- Record: Category (h) Hardware-semantic fix / (d) Memory safety.
  Mechanism: Preventing fork()-time VMA duplication of the AIO ring
  buffer, eliminating the race window where child holds a VMA to a ring
  file while parent tears it down.

**Step 2.4 - Fix quality:**
- Record: Obviously correct, minimal, surgical. Risk of regression
  extremely low - the only behavioral change is that child processes
  cannot access the parent's AIO ring (which was never semantically
  valid - see `mm_init_aio()` which already zeros `ioctx_table` in
  child).

### PHASE 3: GIT HISTORY INVESTIGATION

**Step 3.1 - Blame the buggy code:**
- Record: The AIO ring mmap hook is ancient (pre-2.6.12). The `.fault =
  filemap_fault` vm_op was added in mid-2010s. The fundamental bug (fork
  copies VMA but child has no AIO context) has existed essentially since
  AIO ring was made mappable. Verified via `git log --follow fs/aio.c`
  showing AIO predates the current git history (from Linux-2.6.12-rc2).

**Step 3.2 - Follow Fixes: tag:**
- Record: No Fixes: tag. The bug is essentially inherent to the AIO ring
  design from the start.

**Step 3.3 - Related changes:**
- Record: Previously, commit `81e9d6f864765` ("aio: fix mremap after
  fork null-deref", 2023, in v6.3) fixed an adjacent fork+AIO NULL-
  deref. That commit was `Cc: stable` tagged and backported. A follow-up
  commit `3adf7ae18bf42` ("fs: aio: reject partial mremap...") by the
  same author fixes yet another NULL-deref in the same family (also
  reviewed by Jan Kara). These demonstrate a pattern of fork+AIO race
  bugs.

**Step 3.4 - Author:**
- Record: Zizhi Wo is a regular Huawei kernel contributor, working on
  filesystem issues. Also authored the related `3adf7ae18bf42` mremap
  fix.

**Step 3.5 - Dependencies:**
- Record: None. The fix is self-contained. The `VM_DONTCOPY` flag has
  been part of `dup_mmap()` logic for many years (mm/mmap.c), checked
  via `mpnt->vm_flags & VM_DONTCOPY`.

### PHASE 4: MAILING LIST RESEARCH

**Step 4.1 - Original discussion:**
- Record: `b4 dig -c c03ce4173c7bf` found the original submission at htt
  ps://lore.kernel.org/all/20260413010814.548568-1-wozizhi@huawei.com/ -
  v1 only (no later revisions needed). Jan Kara's review comment
  (retrieved via b4 dig -m): "*I agree it would have to be a rather
  contrived setup to rely on AIO ringbuffer being inherited by
  fork(2)... AIO ringbuffer is mostly a legacy thing these days... So
  I'm OK with trying this simple fix and seeing whether somebody
  complains.*" - No NAKs, no stable nomination but no objection to the
  approach.

**Step 4.2 - Reviewers:**
- Record: CC'd: viro (VFS), jack (Jan Kara - MM/FS), brauner (VFS
  maintainer), bcrl (AIO original maintainer), linux-fsdevel, linux-aio,
  yangerkun, chengzhihao1. Plus Jan Kara added Jens Axboe for awareness.
  Appropriate review coverage.

**Step 4.3 - Bug report:**
- Record: Found by Huawei internal syzkaller (fuzzer). Reproducible
  kernel NULL pointer dereference - not theoretical.

**Step 4.4 - Related patches:**
- Record: Follow-up `3adf7ae18bf42` ("fs: aio: reject partial
  mremap...") addresses a related but different NULL-deref in the same
  subsystem. Independent fix.

**Step 4.5 - Stable list history:**
- Record: No explicit stable mailing list discussion found. However, the
  precedent (81e9d6f864765) of fork-related AIO fix being backported
  supports that this is stable material.

### PHASE 5: CODE SEMANTIC ANALYSIS

**Step 5.1 - Key functions:**
- Record: `aio_ring_mmap_prepare()` is the only function modified.

**Step 5.2 - Callers:**
- Record: Called by VFS mmap logic via `f_op->mmap_prepare` during
  `mmap()` on the AIO ring file. Reachable from `io_setup(2)` syscall
  via `aio_setup_ring() -> do_mmap(aio_ring_file, ...)`. Reachable by
  any unprivileged process that can do io_setup().

**Step 5.3 - Callees:**
- Record: `vma_desc_set_flags()` - setting VMA flags during mmap
  preparation. No side effects other than flag setting.

**Step 5.4 - Call chain:**
- Record: Bug path reachable from userspace:
  1. User calls `io_setup(2)` -> mmap of AIO ring VMA
  2. User calls `fork(2)` -> child inherits VMA (before this fix)
  3. User (child) touches the VMA address -> triggers fault
  4. User (parent) calls `io_destroy(2)` concurrently -> race triggers
     NULL deref
  All reachable by unprivileged userspace.

**Step 5.5 - Similar patterns:**
- Record: Verified via Grep that `VM_DONTCOPY` is used in several kernel
  subsystems (android/binder.c, KFD, xen, infiniband, etc.) for VMAs
  that shouldn't be inherited by fork. The AIO ring is semantically the
  same class - it's associated with parent-specific kernel state.

### PHASE 6: CROSS-REFERENCING AND STABLE TREE ANALYSIS

**Step 6.1 - Buggy code in stable trees:**
- Record: Verified by examining `fs/aio.c` in each stable tree:
  - `stable/linux-5.10.y`: Uses `vma->vm_flags |= VM_DONTEXPAND;` (no
    VM_DONTCOPY)
  - `stable/linux-5.15.y`: Uses `vma->vm_flags |= VM_DONTEXPAND;`
  - `stable/linux-6.1.y`: Uses `vma->vm_flags |= VM_DONTEXPAND;`
  - `stable/linux-6.6.y`: Uses `vm_flags_set(vma, VM_DONTEXPAND);`
  - `stable/linux-6.12.y`: Uses `vm_flags_set(vma, VM_DONTEXPAND);`
  - `stable/linux-6.17.y`, `6.18.y`, `6.19.y`: Uses `desc->vm_flags |=
    VM_DONTEXPAND;`

  All stable trees are missing VM_DONTCOPY and vulnerable to the bug.

**Step 6.2 - Backport complications:**
- Record: The upstream patch uses `vma_desc_set_flags(desc,
  VMA_DONTEXPAND_BIT, VMA_DONTCOPY_BIT)` which was introduced in 7.0
  (master). For each stable tree, the fix needs adaptation:
  - 5.10-6.1: `vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;`
  - 6.6-6.12: `vm_flags_set(vma, VM_DONTEXPAND | VM_DONTCOPY);`
  - 6.17-6.19: `desc->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;`

  Minor textual adjustment needed but semantically identical.

**Step 6.3 - Related fixes in stable:**
- Record: Commit `81e9d6f864765` ("aio: fix mremap after fork null-
  deref") was backported to stable (verified present in
  stable/linux-5.10.y as `c261f798f7baa` and in stable/linux-6.6.y as
  `81e9d6f864765`). That confirms the AIO+fork class of bugs has been
  considered stable-worthy before.

### PHASE 7: SUBSYSTEM AND MAINTAINER CONTEXT

**Step 7.1 - Subsystem:**
- Record: `fs/aio.c` - AIO filesystem interface. IMPORTANT criticality -
  not in the hot path for most users (io_uring is newer), but AIO is
  widely used by legacy applications, databases (Oracle, MySQL), and
  libaio consumers. Still heavily supported.

**Step 7.2 - Activity:**
- Record: AIO is mature/stable subsystem. Low activity (mostly
  maintenance) - the bug has likely been present for years without being
  hit due to the unusual trigger (fork after io_setup is uncommon).

### PHASE 8: IMPACT AND RISK ASSESSMENT

**Step 8.1 - Affected users:**
- Record: Any system using AIO where a process that called io_setup()
  then forks (e.g., databases, async I/O applications with forking). The
  fork+AIO combination is unusual but legitimate.

**Step 8.2 - Trigger conditions:**
- Record: Race between parent's io_destroy() and child's page fault on
  inherited ring. Triggerable by unprivileged userspace. Timing-
  dependent but demonstrated via syzkaller (reproducible). No special
  privileges needed.

**Step 8.3 - Failure mode severity:**
- Record: Kernel NULL pointer dereference - CRITICAL. System crash
  (oops), potential panic depending on config. Triggerable by
  unprivileged userspace - this borders on security concern (DoS).

**Step 8.4 - Risk-benefit:**
- Record:
  - BENEFIT: Prevents kernel NULL deref crash triggerable by
    unprivileged userspace. HIGH benefit.
  - RISK: 1-line flag change. VM_DONTCOPY means child doesn't inherit
    parent's AIO ring - semantically correct since `mm_init_aio()`
    already clears the child's `ioctx_table`. Verified in kernel/fork.c:
    `mm_init_aio() { mm->ioctx_table = NULL; }`. Very low risk.

### PHASE 9: FINAL SYNTHESIS

**Step 9.1 - Evidence:**
- FOR: Fixes real kernel NULL deref; found by syzkaller (reproducible);
  reviewed by Jan Kara; 1-line surgical change; semantically correct
  (child has no AIO context); precedent of similar fork+AIO fix being
  backported (`81e9d6f864765`); triggerable by unprivileged userspace
  (crash = DoS).
- AGAINST: Requires backport adjustment for older stable trees (API
  evolved from `vma->vm_flags |=` to `vm_flags_set()` to
  `desc->vm_flags` to `vma_desc_set_flags()`). However this is a trivial
  textual adjustment.

**Step 9.2 - Stable rules checklist:**
1. Obviously correct and tested? YES - reviewed by Jan Kara, logic
   verified
2. Fixes a real bug? YES - syzkaller-reproducible NULL deref
3. Important issue? YES - CRITICAL: NULL pointer deref = kernel crash
4. Small and contained? YES - 1 line
5. No new features? YES - just adds a flag
6. Can apply to stable? YES with minor textual adjustment per kernel
   version

**Step 9.3 - Exception categories:**
- Not applicable - this is a direct bug fix for a kernel NULL deref.

**Step 9.4 - Decision:** The fix clearly meets all stable criteria:
obviously correct, fixes a real (syzkaller-reproducible) kernel crash,
very small (1 line), no new APIs. The bug is triggerable by unprivileged
userspace via fork+io_setup/io_destroy race, making it a serious issue
worthy of backport. The only complication is the 4 slightly different
forms needed for different stable trees, but each is a trivial
adaptation.

## Verification

- [Phase 1] Parsed tags: Reviewed-by Jan Kara, Signed-off-by Christian
  Brauner (VFS maintainer), Link to lore patch submission
- [Phase 2] Diff analysis: verified via `git show c03ce4173c7bf` -
  single 1-line change in `aio_ring_mmap_prepare()` adding
  VMA_DONTCOPY_BIT
- [Phase 3] Checked file history `git log --oneline -- fs/aio.c` - aio.c
  is pre-2.6.12 (ancient), the ring mmap code has existed for over a
  decade
- [Phase 3] Found related earlier fix `81e9d6f864765` "aio: fix mremap
  after fork null-deref" with explicit `Cc: <stable@vger.kernel.org>` -
  established precedent
- [Phase 4] `b4 dig -c c03ce4173c7bf -a`: single version (v1) - applied
  as-is, no revisions needed
- [Phase 4] `b4 dig -c c03ce4173c7bf -w`: verified maintainers CC'd
  (viro, jack, brauner, bcrl, linux-fsdevel, linux-aio)
- [Phase 4] `b4 dig -m /tmp/aio_patch.mbox`: Jan Kara's review approved
  the approach, called AIO ring "mostly a legacy thing", no NAKs
- [Phase 5] Verified `VMA_DONTCOPY_BIT` = 17 via
  `DECLARE_VMA_BIT(DONTCOPY, 17)` in include/linux/mm.h;
  `vma_desc_set_flags` expands to `vma_desc_set_flags_mask(desc,
  mk_vma_flags(__VA_ARGS__))`
- [Phase 5] Verified `mm_init_aio` in kernel/fork.c: `mm->ioctx_table =
  NULL;` - child has no AIO context, confirming semantic correctness
- [Phase 5] Verified `VM_DONTCOPY` handling in mm/mmap.c dup_mmap: `if
  (mpnt->vm_flags & VM_DONTCOPY) { ... continue; }` - VMA is skipped
  during fork
- [Phase 6] Read code from each stable tree's `fs/aio.c`:
  - 5.10.y line 369-373: `vma->vm_flags |= VM_DONTEXPAND`
  - 5.15.y line 368-373: `vma->vm_flags |= VM_DONTEXPAND`
  - 6.1.y line 395-400: `vma->vm_flags |= VM_DONTEXPAND`
  - 6.6.y line 395-400: `vm_flags_set(vma, VM_DONTEXPAND)`
  - 6.12.y line 395-400: `vm_flags_set(vma, VM_DONTEXPAND)`
  - 6.17.y/6.18.y/6.19.y line 395-400: `desc->vm_flags |= VM_DONTEXPAND`
  None have VM_DONTCOPY - all are vulnerable.
- [Phase 6] Verified related `81e9d6f864765` is in stable via `git log
  stable/linux-5.10.y` (as `c261f798f7baa`) and stable/linux-6.6.y
- [Phase 7] Subsystem identified as fs/aio (filesystem, async I/O) -
  IMPORTANT criticality level
- [Phase 8] Failure mode: NULL deref at `filemap_read_folio` when
  accessing `a_ops->read_folio` (not implemented by AIO) - confirmed
  from stack trace in commit message
- UNVERIFIED: Whether the original syzkaller reproducer is public
  (Huawei internal testing, report not public)
- UNVERIFIED: Exact date when the bug first became exploitable (depends
  on when filemap_fault path was used for this VMA, which has been
  present since aio ring was mappable - approximately since 2013)

## Conclusion

This is a small, surgical bug fix for a kernel NULL pointer dereference
that can be triggered by unprivileged userspace via a fork+AIO race. The
fix is semantically correct (child has no AIO context, so the VMA
shouldn't be inherited), was reviewed by Jan Kara, and has an
established precedent of related fork+AIO fixes being backported to
stable. The only caveat is that each stable tree needs a minor textual
adaptation due to API evolution (from `vma->vm_flags |=` to
`vm_flags_set()` to `vma_desc_set_flags()`), but the one-line semantic
change applies cleanly in every case.

**YES**

 fs/aio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/aio.c b/fs/aio.c
index a07bdd1aaaa60..6d436f8b3f349 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -394,7 +394,7 @@ static const struct vm_operations_struct aio_ring_vm_ops = {
 
 static int aio_ring_mmap_prepare(struct vm_area_desc *desc)
 {
-	vma_desc_set_flags(desc, VMA_DONTEXPAND_BIT);
+	vma_desc_set_flags(desc, VMA_DONTEXPAND_BIT, VMA_DONTCOPY_BIT);
 	desc->vm_ops = &aio_ring_vm_ops;
 	return 0;
 }
-- 
2.53.0


       reply	other threads:[~2026-04-28 10:42 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260428104133.2858589-1-sashal@kernel.org>
2026-04-28 10:40 ` Sasha Levin [this message]
2026-04-28 10:40 ` [PATCH AUTOSEL 7.0-5.10] fs: aio: reject partial mremap to avoid Null-pointer-dereference error Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260428104133.2858589-23-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=bcrl@kvack.org \
    --cc=brauner@kernel.org \
    --cc=jack@suse.cz \
    --cc=linux-aio@kvack.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=patches@lists.linux.dev \
    --cc=stable@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=wozizhi@huawei.com \
    --cc=wozizhi@huaweicloud.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox