* Linux: DMA-after-unmap race in ZCRX via netif_rxq_cleanup_unlease() ordering inversion (netkit + page_pool)
@ 2026-05-27 22:53 Prénom? Ahmed
2026-05-27 23:33 ` Jakub Kicinski
0 siblings, 1 reply; 2+ messages in thread
From: Prénom? Ahmed @ 2026-05-27 22:53 UTC (permalink / raw)
To: oss-security, netdev, linux-kernel, security
[-- Attachment #1.1.1: Type: text/plain, Size: 1991 bytes --]
Hello,
I would like to report a source-proven teardown ordering bug in the Linux
kernel that can lead to a DMA-after-unmap race condition involving ZCRX
(io_uring zero-copy receive), page_pool, and netkit queue leasing.
***Reporter:** Ahmed Abdelmoemen **Discovery Date:** 2026-05-26 **Kernel
Version:** Linux 7.1.0-rc3*
Executive Summary
*A logic error in `netif_rxq_cleanup_unlease()` causes DMA mappings for the
ZCRX memory provider to be revoked **before** the physical NIC RX queue is
stopped. This creates a race window during netkit queue lease teardown
where the physical device's NAPI can consume stale `net_iov` entries from
the page_pool alloc cache containing `dma_addr = 0`.*
The ordering inversion is fully proven at the source level. However, I have
**not** performed runtime verification, so actual memory corruption or
successful DMA to address 0 has **not** been proven — it remains hardware
and driver dependent.
The bug is reachable with `CAP_NET_ADMIN` (common in container
environments) when using netkit with ZCRX.
Root Cause
In `net/core/netdev_rx_queue.c:347-348`:
```c __netif_mp_uninstall_rxq(virt_rxq, p); // DMA unmap + dma_addr=0
__netif_mp_close_rxq(...); // queue stop + NAPI disable (TOO LATE)
This inverts the correct ordering used in normal device unregistration and
io_uring close paths (stop first, then unmap).
Impact
- *Potential:* NIC DMA write to physical address 0 (or stale mappings
with lazy IOMMU) leading to memory corruption.
- *Requirements:* CAP_NET_ADMIN + netkit queue leasing + ZCRX installed
on the leased queue.
- *Current Status:* No runtime PoC or crash reproduction yet. The race
window exists in theory but its practical exploitability needs confirmation.
I am attaching the full detailed analysis.
Proposed Fix[image: image.png]
I am happy to provide more details or assist with testing.
Best regards, Ahmed Abdelmoemen ahmedabdelmoumen05@gmail.com
[-- Attachment #1.1.2: Type: text/html, Size: 3095 bytes --]
[-- Attachment #1.2: image.png --]
[-- Type: image/png, Size: 74833 bytes --]
[-- Attachment #2: zcrx_teardown_race_audit.md --]
[-- Type: text/markdown, Size: 47070 bytes --]
# ZCRX Teardown Race — Source-Level Exploitability Audit
**author**: bohmiiidd / ahmedabdelmoumen05@gmail.com
**Kernel tree**: Linux 7.1.0-rc3 `linux/`
**Date**: 2026-05-26
**Classification**: Security audit — for kernel maintainers, oss-security, CVE triage
---
## Executive Summary
A source-verified ordering inversion in [netif_rxq_cleanup_unlease()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/netdev_rx_queue.c#L338-L349) causes the ZCRX memory provider's DMA mappings to be revoked **before** the physical NIC queue is stopped. This creates a race window during queue lease teardown where the NAPI poll loop on the physical device can pull stale `net_iov` entries from the `page_pool` alloc cache with `dma_addr = 0`, potentially causing the NIC to program RX descriptors pointing to physical address zero.
The bug is **source-proven**. The exploitation primitive (NIC DMA write to address zero) is **hardware-dependent and NOT proven by source alone** — it requires runtime verification of driver RX descriptor programming behavior and IOMMU configuration.
---
# Finding 1: Ordering Inversion in `netif_rxq_cleanup_unlease()`
## Claim
`netif_rxq_cleanup_unlease()` calls `__netif_mp_uninstall_rxq()` (DMA unmap) **before** `__netif_mp_close_rxq()` (queue stop), inverting the required teardown ordering.
## Source Verification
[netdev_rx_queue.c:338-349](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/netdev_rx_queue.c#L338-L349):
```c
void netif_rxq_cleanup_unlease(struct netdev_rx_queue *phys_rxq,
struct netdev_rx_queue *virt_rxq)
{
struct pp_memory_provider_params *p = &phys_rxq->mp_params;
unsigned int rxq_idx = get_netdev_rx_queue_index(phys_rxq);
if (!p->mp_ops)
return;
__netif_mp_uninstall_rxq(virt_rxq, p); // LINE 347: DMA UNMAP
__netif_mp_close_rxq(phys_rxq->dev, rxq_idx, p); // LINE 348: QUEUE STOP
}
```
## Exact Execution Path
```
netkit_uninit() [netkit.c:1058-1062]
→ netkit_queue_unlease(dev) [netkit.c:418-438]
→ netdev_lock(dev) [netkit.c:427]
→ for each leased queue:
→ netdev_lock(dev_lease) [netkit.c:433]
→ netdev_rx_queue_unlease(rxq, rxq_lease) [netkit.c:434]
→ netif_rxq_cleanup_unlease(rxq_src, rxq_dst) [netdev_rx_queue.c:31]
→ __netif_mp_uninstall_rxq(virt_rxq, p) [line 347]
→ io_pp_uninstall(mp_priv, rxq) [zcrx.c:1171-1182]
→ io_zcrx_unmap_area(ifq, area) [zcrx.c:282-303]
→ net_mp_niov_set_dma_addr(niov, 0) [line 294] ← DMA=0
→ dma_unmap_sgtable() [line 300] ← mapping revoked
⚡ RACE WINDOW — physical NAPI still active ⚡
→ __netif_mp_close_rxq(phys_dev, idx, p) [line 348]
→ netdev_rx_queue_reconfig() [line 302]
→ ndo_queue_stop() [line 133] ← NAPI finally stops
→ netdev_unlock(dev_lease) [netkit.c:435]
```
## Concurrency Analysis
**Locking state at the race window (between lines 347 and 348):**
| Lock | Held? | Protects what? |
|------|-------|----------------|
| `netdev_lock(netkit_dev)` | YES | Virtual device state |
| `netdev_lock(phys_dev)` | YES | Physical device config, queue mgmt |
| `ifq->pp_lock` (mutex) | NO (released after `io_zcrx_unmap_area` returns) | `area->is_mapped`, DMA mapping |
| NAPI scheduling | **NOT DISABLED** | Nothing — NAPI runs on other CPUs independently |
| `pool->alloc.cache` | **NO LOCK** | Nothing — designed for single-consumer (NAPI) access |
**Critical**: `netdev_lock` is a **management lock**. It serializes queue configuration operations. It does **NOT** serialize with the NAPI poll path. NAPI runs in softirq context on the CPU where the IRQ fires and does not acquire `netdev_lock`. This is confirmed by:
- [__page_pool_get_cached()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/page_pool.c#L436-L450) comment at line 440: `"Caller MUST guarantee safe non-concurrent access, e.g. softirq"` — relies on NAPI scheduling, not locks
- [page_pool types.h](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/include/net/page_pool/types.h#L210-L218) comment at line 210: `"require driver to protect allocation side"` via NAPI scheduling
**Therefore**: Between lines 347 and 348 of `netif_rxq_cleanup_unlease()`, the physical NIC's NAPI can execute concurrently on another CPU.
## Lifetime Analysis
**`pool->mp_ops` and `pool->mp_priv` lifetime:**
[page_pool_init()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/page_pool.c#L281-L282):
```c
pool->mp_priv = rxq->mp_params.mp_priv; // line 281
pool->mp_ops = rxq->mp_params.mp_ops; // line 282
```
[io_pp_uninstall()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L1171-L1182):
```c
static void io_pp_uninstall(void *mp_priv, struct netdev_rx_queue *rxq)
{
struct pp_memory_provider_params *p = &rxq->mp_params;
// ...
p->mp_ops = NULL; // line 1180: clears rxq->mp_params.mp_ops
p->mp_priv = NULL; // line 1181: clears rxq->mp_params.mp_priv
}
```
`io_pp_uninstall()` clears `rxq->mp_params.mp_ops/mp_priv` but does **NOT** clear `pool->mp_ops/mp_priv`. These are independent copies. After `io_pp_uninstall()`, the page pool still has valid function pointers into `io_uring_pp_zc_ops` and can still call ZCRX allocation callbacks.
**`pool->alloc.cache` lifetime:**
Cache entries are not flushed by `io_pp_uninstall()`. They persist until [page_pool_empty_alloc_cache_once()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/page_pool.c#L1157-L1172) is called from `page_pool_scrub()` → `page_pool_release()` → `page_pool_destroy()`, which happens when the driver destroys the page pool (potentially much later).
**`area->nia.niovs` (the net_iov array) lifetime:**
The niovs array is freed in [io_zcrx_free_area()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L410-L424) at line 421. This happens during `io_zcrx_ifq_free()` → `io_zcrx_free_area()`, which may be deferred until the ifq refcount drops to zero. The alloc cache entries point to these niovs. If the niovs are freed while entries remain in the cache, that's a use-after-free on the niov structures themselves (separate from the DMA issue).
## What Is Actually Proven
1. **PROVEN**: `netif_rxq_cleanup_unlease()` calls DMA unmap before queue stop — source lines 347-348
2. **PROVEN**: `io_zcrx_unmap_area()` zeroes DMA addresses in all niovs (line 294) and revokes IOMMU mapping (line 300-301) — while holding `ifq->pp_lock`
3. **PROVEN**: `io_pp_uninstall()` does NOT flush `pool->alloc.cache` — no such code exists in [zcrx.c:1171-1182](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L1171-L1182)
4. **PROVEN**: `__page_pool_get_cached()` is lockless and accesses `pool->alloc.cache` directly — [page_pool.c:436-450](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/page_pool.c#L436-L450)
5. **PROVEN**: NAPI on the physical device is NOT stopped until `__netif_mp_close_rxq()` calls `netdev_rx_queue_reconfig()` → `ndo_queue_stop()` — [netdev_rx_queue.c:302](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/netdev_rx_queue.c#L302) → [line 133](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/netdev_rx_queue.c#L133)
6. **PROVEN**: `netdev_lock` does NOT synchronize with NAPI softirq execution
7. **PROVEN**: Stale cache entries (netmem_refs to niovs with dma_addr=0) survive in `pool->alloc.cache` after DMA unmap
8. **PROVEN**: `page_pool_empty_alloc_cache_once()` comment explicitly states concurrent access is forbidden — [page_pool.c:1164-1167](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/page_pool.c#L1164-L1167)
## What Is NOT Proven
1. **NOT PROVEN**: That any specific NIC driver will program DMA address 0 into RX descriptors without validation — this is driver-specific behavior
2. **NOT PROVEN**: That NAPI will actually pull from the alloc cache during the race window — this is a timing/scheduling question requiring runtime verification
3. **NOT PROVEN**: That physical address 0 is writable on the target platform — architecture-dependent
4. **NOT PROVEN**: The exact payload content that reaches physical address 0 — depends on NIC DMA engine behavior
5. **NOT PROVEN**: That IOMMU TLB invalidation timing creates an additional exploitable sub-race — hardware-dependent
6. **NOT PROVEN**: Privilege escalation from memory corruption at address 0 — would require further exploit development
## Exploitability Assessment
**STALE STATE / RACE → POTENTIAL MEMORY CORRUPTION (hardware-dependent)**
The source proves:
- A race window exists (ordering inversion)
- Stale entries with `dma_addr=0` can persist in the alloc cache
- NAPI can consume these entries concurrently
The source does NOT prove:
- That memory corruption actually occurs (driver-dependent)
- That the corruption is exploitable (IOMMU-dependent, platform-dependent)
## Required Runtime Verification
```
# 1. Verify NAPI pulls stale entries
echo 1 > /sys/kernel/debug/tracing/events/page_pool/page_pool_state_hold/enable
# Kprobe on __page_pool_get_cached to log DMA addresses of returned netmems
# 2. Verify DMA address 0 reaches NIC descriptors
# Driver-specific tracepoints or kprobes on RX ring refill functions
# 3. Verify NIC writes to address 0
# Enable CONFIG_DMA_API_DEBUG
# Monitor DMAR fault logs: dmesg | grep -i "DMAR\|dma.*fault"
# 4. Reproduce the race
# Set up netkit with queue lease + ZCRX
# Send sustained traffic to keep NAPI active
# Tear down the netkit interface
# Monitor for kernel crashes or DMA faults
```
## Final Verdict
The ordering inversion in `netif_rxq_cleanup_unlease()` is a **source-proven logic bug** that creates a race window where stale page_pool alloc cache entries with zeroed DMA addresses can be consumed by NAPI on the physical device. The bug is deterministic in the source — the inverted call order is unconditional. Whether this leads to memory corruption depends on driver behavior (RX descriptor programming) and hardware configuration (IOMMU). The fix (swapping lines 347-348) is trivially correct and has no side effects.
---
# Finding 2: `io_pp_uninstall()` Missing Alloc Cache Flush
## Claim
`io_pp_uninstall()` unmaps DMA without flushing the page pool's alloc cache, leaving stale entries.
## Source Verification
[io_pp_uninstall()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L1171-L1182):
```c
static void io_pp_uninstall(void *mp_priv, struct netdev_rx_queue *rxq)
{
struct pp_memory_provider_params *p = &rxq->mp_params;
struct io_zcrx_ifq *ifq = mp_priv;
io_zcrx_drop_netdev(ifq); // drops netdev reference
if (ifq->area)
io_zcrx_unmap_area(ifq, ifq->area); // zeros DMA, revokes mapping
p->mp_ops = NULL; // clears rxq mp_ops
p->mp_priv = NULL; // clears rxq mp_priv
}
```
**No call to `page_pool_empty_alloc_cache_once()` or any equivalent.** The function has no access to the `struct page_pool *` — it only receives `mp_priv` (the `ifq`) and `rxq`.
## Concurrency Analysis
[io_zcrx_unmap_area()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L282-L303) holds `ifq->pp_lock` mutex (line 287).
[__page_pool_get_cached()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/page_pool.c#L436-L450) does **NOT** acquire `ifq->pp_lock`. It accesses `pool->alloc.cache` directly.
[io_pp_zc_alloc_netmems()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L1094-L1117) — the ZCRX allocation callback — also does **NOT** acquire `ifq->pp_lock`. It accesses `ifq` via `pp->mp_priv`.
**Therefore**: `ifq->pp_lock` provides **NO** protection against concurrent alloc cache consumption by NAPI.
## What Is Actually Proven
1. **PROVEN**: `io_pp_uninstall()` does not flush `pool->alloc.cache` — no such code exists
2. **PROVEN**: `io_pp_uninstall()` has no access to the page_pool object — it receives only `mp_priv` and `rxq`
3. **PROVEN**: `ifq->pp_lock` does not protect the alloc cache access path
4. **PROVEN**: After `io_zcrx_unmap_area()`, the niovs in the alloc cache have `dma_addr = 0`
## What Is NOT Proven
1. **NOT PROVEN**: Whether adding a cache flush to `io_pp_uninstall()` would be safe from a concurrency standpoint (the comment at [page_pool.c:1164-1167](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/page_pool.c#L1164-L1167) says flush "cannot be called concurrently" with allocations)
## Exploitability Assessment
**LOGIC BUG — defense-in-depth gap**
This is a secondary finding. Even if the ordering in Finding 1 were correct, a `io_pp_uninstall()` implementation that doesn't flush the cache is fragile and violates the alloc cache's invariant that all entries have valid DMA mappings. However, with correct ordering (queue stopped before DMA unmap), the cache entries would not be consumed because NAPI is no longer running.
## Final Verdict
`io_pp_uninstall()` violates the invariant that all alloc cache entries have valid DMA mappings. This is a defense-in-depth gap that becomes exploitable only in combination with Finding 1 (the ordering inversion). Standalone, it is a latent bug.
---
# Finding 3: `__page_pool_get_cached()` Lockless Consumer
## Claim
`__page_pool_get_cached()` accesses the alloc cache without locks, relying solely on NAPI scheduling guarantees for single-consumer access.
## Source Verification
[page_pool.c:436-450](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/page_pool.c#L436-L450):
```c
static netmem_ref __page_pool_get_cached(struct page_pool *pool)
{
netmem_ref netmem;
/* Caller MUST guarantee safe non-concurrent access, e.g. softirq */
if (likely(pool->alloc.count)) {
/* Fast-path */
netmem = pool->alloc.cache[--pool->alloc.count];
alloc_stat_inc(pool, fast);
} else {
netmem = page_pool_refill_alloc_cache(pool);
}
return netmem;
}
```
## Concurrency Analysis
The comment at line 440 explicitly states the caller MUST guarantee non-concurrent access. This guarantee is provided by NAPI scheduling — a single NAPI instance is scheduled on exactly one CPU at a time.
**However**, during the race window in Finding 1, this invariant is NOT violated from the page_pool's perspective — NAPI is still the sole consumer. The problem is that the **data** inside the cache entries (specifically, `niov->dma_addr`) has been mutated by the teardown path on another CPU, without the knowledge of the NAPI consumer.
This is not a concurrency violation on the cache data structure itself. It is a **stale data** problem: the teardown path modifies the DMA address field of objects that are referenced by the cache, without draining the cache first.
## What Is Actually Proven
1. **PROVEN**: `__page_pool_get_cached()` has no locks — source lines 436-450
2. **PROVEN**: The function trusts that returned netmem entries have valid DMA addresses
3. **PROVEN**: No validation of `dma_addr` after retrieving from cache
## Exploitability Assessment
**NOT A BUG in isolation** — the lockless design is correct under its stated assumptions. The bug is that Finding 1 violates the assumption that DMA addresses remain valid while entries are in the cache.
---
# Finding 4: `page_pool_destroy()` Deferred Cache Drain
## Claim
`page_pool_destroy()` may defer `page_pool_empty_alloc_cache_once()` via `schedule_delayed_work()`, widening the window during which stale entries exist in the cache.
## Source Verification
[page_pool_destroy()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/page_pool.c#L1309-L1329):
```c
void page_pool_destroy(struct page_pool *pool)
{
if (!page_pool_put(pool))
return;
page_pool_disable_direct_recycling(pool);
page_pool_free_frag(pool);
if (!page_pool_release(pool)) // calls page_pool_scrub() → page_pool_empty_alloc_cache_once()
return;
// If inflight pages remain:
page_pool_detached(pool);
INIT_DELAYED_WORK(&pool->release_dw, page_pool_release_retry);
schedule_delayed_work(&pool->release_dw, DEFER_TIME); // DEFER_TIME = 1000ms
}
```
[DEFER_TIME](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/page_pool.c#L37): `#define DEFER_TIME (msecs_to_jiffies(1000))`
## Concurrency Analysis
`page_pool_destroy()` is called by the NIC driver when it tears down the RX queue (inside `ndo_queue_stop` → driver teardown → `page_pool_destroy()`). This happens AFTER the queue is stopped (NAPI disabled).
By the time `page_pool_destroy()` runs, NAPI is no longer active, so the alloc cache drain in `page_pool_empty_alloc_cache_once()` is safe per its stated requirements.
**However**, in the race scenario of Finding 1, `page_pool_destroy()` is called during `__netif_mp_close_rxq()` → `netdev_rx_queue_reconfig()` → `ndo_queue_stop()` → driver destroys old page pool. This happens at line 348 of `netif_rxq_cleanup_unlease()` — AFTER the DMA has already been unmapped at line 347.
The deferred destruction (`schedule_delayed_work`) only applies if there are inflight pages that haven't been returned yet. The first call to `page_pool_release()` does drain the cache immediately via `page_pool_scrub()`.
## What Is Actually Proven
1. **PROVEN**: Cache drain can be deferred by up to 1000ms if inflight pages exist — [page_pool.c:1327-1328](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/page_pool.c#L1327-L1328)
2. **PROVEN**: The initial `page_pool_scrub()` drains the cache in the first `page_pool_release()` call — [page_pool.c:1179](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/page_pool.c#L1179)
3. **PROVEN**: By the time `page_pool_destroy()` runs (inside queue stop), NAPI is disabled
## What Is NOT Proven
The deferred destruction does not widen the attack window for Finding 1, because the cache is drained in the first `page_pool_release()` call, and NAPI is disabled by then. The 1-second deferral only affects inflight page returns to the page allocator.
## Exploitability Assessment
**NOT A BUG** — The deferred destruction does not contribute to the race in Finding 1. The race window is the period between lines 347 and 348 of `netif_rxq_cleanup_unlease()`, not the page_pool destruction delay.
---
# Finding 5: Safe Teardown Paths (Comparison)
## Device Unregistration Path
[unregister_netdevice_many_notify()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/dev.c#L12345-L12473):
```
1. netif_close_many() [line 12385/12391] — stops all NAPI
2. synchronize_net() [line 12402] — RCU barrier
3. dev_shutdown() [line 12409] — driver cleanup
4. dev_memory_provider_uninstall() [line 12412] — calls io_pp_uninstall()
```
**Ordering**: NAPI stopped (step 1) → barrier (step 2) → DMA unmap (step 4). **SAFE**.
## io_uring Close Path
[io_close_queue()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L550-L574):
```
1. netif_mp_close_rxq() [line 568] — stops queue via ndo_queue_stop
2. io_zcrx_scrub() [line 659] — reclaims user buffers
3. io_zcrx_free_area() [line 581] — calls io_zcrx_unmap_area() → DMA unmap
```
**Ordering**: Queue stopped (step 1) → DMA unmap (step 3). **SAFE**.
## Queue Lease Teardown Path (VULNERABLE)
[netif_rxq_cleanup_unlease()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/netdev_rx_queue.c#L338-L349):
```
1. __netif_mp_uninstall_rxq() [line 347] — DMA unmap
2. __netif_mp_close_rxq() [line 348] — queue stop
```
**Ordering**: DMA unmap (step 1) → queue stop (step 2). **INVERTED — VULNERABLE**.
---
# Finding 6: Trigger Path — netkit Queue Lease Teardown
## Claim
The race is reachable via netkit virtual interface teardown.
## Source Verification
[netkit_uninit()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/drivers/net/netkit.c#L1058-L1062):
```c
static void netkit_uninit(struct net_device *dev)
{
netkit_release_all(dev);
netkit_queue_unlease(dev); // ← triggers the vulnerable path
}
```
Called from [unregister_netdevice_many_notify()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/dev.c#L12441-L12442):
```c
if (dev->netdev_ops->ndo_uninit)
dev->netdev_ops->ndo_uninit(dev); // → netkit_uninit()
```
[netkit_queue_unlease()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/drivers/net/netkit.c#L418-L438):
```c
static void netkit_queue_unlease(struct net_device *dev)
{
if (dev->real_num_rx_queues == 1)
return;
netdev_lock(dev);
for (i = 1; i < dev->real_num_rx_queues; i++) {
rxq = __netif_get_rx_queue(dev, i);
rxq_lease = rxq->lease;
dev_lease = rxq_lease->dev;
netdev_lock(dev_lease);
netdev_rx_queue_unlease(rxq, rxq_lease); // → netif_rxq_cleanup_unlease()
netdev_unlock(dev_lease);
}
netdev_unlock(dev);
}
```
## Exact Execution Ordering in `unregister_netdevice_many_notify()`
When a netkit virtual device is being unregistered:
| Step | Code Location | Action | Physical Device State |
|------|--------------|--------|----------------------|
| 1 | [dev.c:12375-12391](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/dev.c#L12375-L12391) | `netif_close_many()` — closes the **netkit** device | Physical device: **STILL RUNNING** |
| 2 | [dev.c:12402](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/dev.c#L12402) | `synchronize_net()` | Physical device: **STILL RUNNING** |
| 3 | [dev.c:12412](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/dev.c#L12412) | `dev_memory_provider_uninstall(dev)` on **netkit** device — its own queues have no mp_params (those are on the physical device via lease) | Physical device: **STILL RUNNING** |
| 4 | [dev.c:12441-12442](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/dev.c#L12441-L12442) | `ndo_uninit` → `netkit_uninit()` → `netkit_queue_unlease()` → `netif_rxq_cleanup_unlease()` | Physical device: **STILL RUNNING until `ndo_queue_stop` inside `__netif_mp_close_rxq`** |
> [!CAUTION]
> The physical NIC is NOT in the unregister list. Only the netkit virtual device is being unregistered. The physical NIC's NAPI continues running throughout steps 1-3, and into step 4 until `__netif_mp_close_rxq()` calls `ndo_queue_stop()`.
## Reachability from Containers
[io_register_zcrx()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L842-L843):
```c
if (!capable(CAP_NET_ADMIN))
return -EPERM;
```
Creating a netkit interface requires `CAP_NET_ADMIN` in the network namespace. In container environments, `CAP_NET_ADMIN` is commonly granted within the container's own network namespace (e.g., Docker default capability set).
**Source-proven requirement**: `CAP_NET_ADMIN` is sufficient to create netkit interfaces and install ZCRX.
**NOT proven**: Whether a container with `CAP_NET_ADMIN` in its own netns can lease queues from a physical NIC — this depends on network namespace configuration and the physical device's namespace assignment.
## Exploitability Assessment
**STALE STATE / RACE — reachable from userspace with CAP_NET_ADMIN**
---
# Finding 7: DMA Address Zero Propagation
## Claim
After `io_zcrx_unmap_area()`, NAPI pulls cached niovs with `dma_addr = 0`, and drivers program NIC RX descriptors with this address.
## Source Verification
[net_mp_niov_set_dma_addr()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/page_pool.c#L1348-L1351):
```c
bool net_mp_niov_set_dma_addr(struct net_iov *niov, dma_addr_t addr)
{
return page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov), addr);
}
```
This sets the DMA address in the netmem structure. When NAPI later calls `page_pool_get_dma_addr_netmem(netmem)` to retrieve the DMA address for programming the NIC's RX ring, it reads the zeroed value.
## What Is Actually Proven
1. **PROVEN**: `io_zcrx_unmap_area()` zeroes DMA addresses in all niovs — [zcrx.c:294](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L294)
2. **PROVEN**: The zeroing is done on the same `net_iov` structures referenced by the alloc cache entries
3. **PROVEN**: No memory barrier or synchronization between the zeroing and NAPI's cache read
4. **PROVEN**: `page_pool_get_dma_addr_netmem()` reads the DMA address from the netmem structure directly
## What Is NOT Proven
1. **NOT PROVEN**: Whether any specific NIC driver validates DMA addresses before programming RX descriptors — this is a per-driver property. Most drivers are expected to trust page_pool-provided addresses.
2. **NOT PROVEN**: Whether `dma_addr = 0` would cause a DMA write to physical address 0 — on x86 without IOMMU, `dma_addr` IS the physical address; with IOMMU, address 0 in IOVA space may not be mapped.
3. **NOT PROVEN**: Whether the race window is wide enough for NAPI to actually consume a stale entry — this is a scheduling/timing question.
## Exploitability Assessment
**POTENTIAL MEMORY CORRUPTION — hardware and driver dependent**
The source proves stale entries can exist in the cache with `dma_addr = 0`. Whether these are consumed by NAPI and whether the result is memory corruption requires:
- Runtime verification (NAPI scheduling during the window)
- Driver-specific analysis (RX descriptor programming)
- Hardware configuration (IOMMU)
---
# Finding 8: DMA-API-DEBUG Applicability for ZCRX
## Claim
`CONFIG_DMA_API_DEBUG` catches the stale DMA issue.
## Source Verification
For ZCRX niovs (net_iov backed), the release path is [io_pp_zc_release_netmem()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L1119-L1130):
```c
static bool io_pp_zc_release_netmem(struct page_pool *pp, netmem_ref netmem)
{
struct net_iov *niov;
if (WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
return false;
niov = netmem_to_net_iov(netmem);
net_mp_niov_clear_page_pool(niov);
io_zcrx_return_niov_freelist(niov);
return false; // returns false → page_pool_return_netmem does NOT call put_page
}
```
This callback does **NOT** call `dma_unmap_page_attrs()`. ZCRX manages its own DMA mappings via `dma_map_sgtable()` / `dma_unmap_sgtable()` at the area level, not per-niov via page_pool's DMA infrastructure.
[page_pool_return_netmem()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/page_pool.c#L778-L803):
```c
static void page_pool_return_netmem(struct page_pool *pool, netmem_ref netmem)
{
put = true;
if (pool->mp_ops)
put = pool->mp_ops->release_netmem(pool, netmem); // → io_pp_zc_release_netmem
else
__page_pool_release_netmem_dma(pool, netmem); // ← NOT called for ZCRX
// ...
}
```
Since `pool->mp_ops` is set for ZCRX, `__page_pool_release_netmem_dma()` (which calls `dma_unmap_page_attrs()`) is **NOT** called. Therefore `CONFIG_DMA_API_DEBUG` will **NOT** catch double-unmap for ZCRX net_iov entries.
## What Is Actually Proven
1. **PROVEN**: ZCRX niovs bypass `__page_pool_release_netmem_dma()` — the `mp_ops->release_netmem` callback is used instead
2. **PROVEN**: `io_pp_zc_release_netmem()` does not call `dma_unmap_page_attrs()`
3. **PROVEN**: `CONFIG_DMA_API_DEBUG` will NOT fire for ZCRX netmem DMA issues
## Exploitability Assessment
**NOT A BUG** — this is a factual observation about debugging tool limitations.
---
# Finding 9: `io_zcrx_ring_refill()` Unconditional Head Advancement
## Claim
`zcrx_next_rqe()` unconditionally advances `rq->cached_head`, and the head is published via `smp_store_release()` regardless of whether individual entries were successfully processed — potentially consuming user-space entries that were never handled.
## Source Verification
[zcrx_next_rqe()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L1002-L1007):
```c
static struct io_uring_zcrx_rqe *zcrx_next_rqe(struct zcrx_rq *rq, unsigned mask)
{
unsigned int idx = rq->cached_head++ & mask; // ← UNCONDITIONAL increment
return &rq->rqes[idx];
}
```
[io_zcrx_ring_refill()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L1032-L1073) — head published at line 1071:
```c
smp_store_release(&rq->ring->head, rq->cached_head); // ← publishes ALL advances
```
The `do { ... } while (--entries)` loop at lines 1048-1069 calls `zcrx_next_rqe()` for each entry (line 1049), which increments `cached_head` **before** any validation. If `io_parse_rqe()` fails (line 1053) or `io_zcrx_put_niov_uref()` fails (line 1055) or `page_pool_unref_and_test()` fails (line 1059), the loop `continue`s but the head is already advanced past that entry.
## Concurrency Analysis
All accesses to `rq->cached_head` are protected by `rq->lock` (spinlock_bh) — acquired at line 1041. This is correct for the data structure. The issue is semantic, not a concurrency bug.
## Lifetime Analysis
When `io_parse_rqe()` fails (invalid offset, bad area_idx, bad padding):
- The RQE slot is consumed from the ring (head advanced)
- The niov referenced by the bad RQE is **NOT** returned to any pool or freelist
- The user sees the head advance and believes the entry was processed
- If the offset pointed to a valid niov, that niov's user_ref is **NOT** decremented
When `io_zcrx_put_niov_uref()` fails (user_ref already 0 — double-return):
- Same as above: slot consumed, niov not recycled
- This is actually a **guard** against double-free — correct behavior
When `page_pool_unref_and_test()` returns false (other references held):
- The niov has other users, so not recycling it is correct
- Slot consumed — correct, the user has given up their reference
## What Is Actually Proven
1. **PROVEN**: `zcrx_next_rqe()` increments `cached_head` unconditionally — [line 1004](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L1004)
2. **PROVEN**: Head is published regardless of per-entry success — [line 1071](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L1071)
3. **PROVEN**: Malformed RQEs are consumed (head advances past them) but the referenced niovs are not recycled
## Exploitability Assessment
**NOT A BUG** — this is intentional ring buffer design.
Consuming malformed entries prevents a hostile or buggy userspace from stalling the ring with permanently-unprocessable entries. If the head did NOT advance past bad entries, a single malformed RQE would permanently block the ring, causing a denial of service against the NAPI refill path.
The "lost" niov reference for malformed RQEs is a correctness concern (resource leak for invalid user input), but not a security vulnerability. The niov remains allocated but unreturnable until the area is torn down. This is bounded by the area size and only affects the misbehaving process's own resources.
## Final Verdict
Unconditional head advancement is the correct design for a producer-consumer ring with untrusted producer entries. No security impact.
---
# Finding 10: `zcrx_parse_rq()` break-vs-continue Inconsistency
## Claim
`io_zcrx_ring_refill()` uses `continue` on `io_parse_rqe()` failure (skips bad entry, processes rest), while `zcrx_parse_rq()` uses `break` (stops at first bad entry). This creates inconsistent head advancement semantics and potentially different resource leak behavior.
## Source Verification
[io_zcrx_ring_refill()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L1053-L1054) — **NAPI refill path**:
```c
if (!io_parse_rqe(rqe, ifq, &niov))
continue; // ← skip bad entry, keep processing
```
[zcrx_parse_rq()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L1204-L1205) — **user flush path**:
```c
if (!io_parse_rqe(rqe, zcrx, &niov))
break; // ← stop processing at first bad entry
```
## Exact Behavioral Difference
| Scenario: 4 entries [GOOD, BAD, GOOD, GOOD] | `io_zcrx_ring_refill()` | `zcrx_parse_rq()` |
|----------------------------------------------|------------------------|-------------------|
| Entries processed | 3 (skips BAD) | 1 (stops at BAD) |
| Head advancement | 4 (all consumed) | 2 (GOOD + BAD consumed) |
| Remaining in ring | 0 | 2 (the trailing GOODs) |
**Critical difference in head advancement for `zcrx_parse_rq()`**:
At [line 1201](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L1201), `zcrx_next_rqe(rq, mask)` increments `cached_head` (via [line 1004](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L1004)). Then at [line 1204-1205](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L1204-L1205), `io_parse_rqe()` fails and `break` exits the loop. At [line 1209](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L1209), `smp_store_release(&rq->ring->head, rq->cached_head)` publishes the head including the failed entry's slot.
So: **the bad entry IS consumed from the ring** (head advances past it), but the function returns `i` which does NOT include the bad entry. The caller (`zcrx_return_buffers`) only processes `i` valid entries. The niov referenced by the bad entry is NOT processed at all — not returned, not ref-decremented.
## What Is Actually Proven
1. **PROVEN**: Different control flow on parse failure — `continue` vs `break`
2. **PROVEN**: Both paths advance the head past bad entries (via `zcrx_next_rqe`)
3. **PROVEN**: `zcrx_parse_rq()` stops processing at first bad entry, leaving subsequent valid entries unprocessed until the next batch
4. **PROVEN**: Bad entries in `zcrx_parse_rq()` are consumed without processing — the niov is not returned to the freelist
## What Is NOT Proven
1. **NOT PROVEN**: Whether this inconsistency is intentional (different design goals for NAPI vs flush paths)
2. **NOT PROVEN**: Whether the niov "leak" from bad entries causes any observable problem (area teardown reclaims all niovs anyway via `io_zcrx_scrub`)
## Exploitability Assessment
**LOGIC BUG ONLY — no security impact**
The `break` vs `continue` inconsistency is a design difference, not a vulnerability:
- **`io_zcrx_ring_refill()`** (NAPI path): Uses `continue` to maximize buffer refill throughput. A single malformed entry should not stall NIC packet reception. This is the performance-critical path.
- **`zcrx_parse_rq()`** (user flush path): Uses `break` to stop processing at corrupt data. This is conservative — if the ring data is corrupted, stopping early is defensible.
The niov "leak" from bad entries in either path is bounded and recovered during area teardown via `io_zcrx_scrub()` at [zcrx.c:634-653](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L634-L653), which force-reclaims all user references.
A potential improvement would be to make `zcrx_parse_rq()` use `continue` like `io_zcrx_ring_refill()`, or to explicitly return the unreferenced niov for bad entries. But this is a code quality concern, not a security issue.
## Final Verdict
Behavioral inconsistency between two consumers of the same ring. Not a security vulnerability. The `break` in `zcrx_parse_rq()` is a conservative choice that may leave valid entries unprocessed in a batch (they'll be picked up in the next iteration of `zcrx_flush_rq()`'s `do...while` loop at [line 1242-1255](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/io_uring/zcrx.c#L1242-L1255)).
---
# Impact Analysis by IOMMU Configuration
| IOMMU Config | Consequence of DMA to address 0 | Severity | Source Evidence |
|-------------|-------------------------------|----------|-----------------|
| None (`iommu=off` or `iommu=pt`) | NIC writes packet payload to physical address 0. On x86, this is real-mode IVT/BIOS data. Content is attacker-controlled (network packet). | **Critical** (if race is hit) | Architecture-dependent — NOT provable from kernel source alone |
| Lazy (`iommu.strict=0`) | IOMMU TLB may still contain old mapping after `dma_unmap_sgtable()`. If NAPI reads original DMA address before zeroing, NIC writes to the original physical page which may have been freed/reused. | **Critical** (DMA-UAF, if sub-race is hit) | Hardware-dependent — NOT provable from kernel source alone |
| Strict (`iommu.strict=1`) | IOMMU blocks DMA to unmapped address 0. Triggers DMAR fault → NIC reset or machine check. | **Medium** (DoS) | IOMMU specification behavior |
> [!WARNING]
> These severity ratings assume the race window is actually hit. The race is real in the source but its exploitability in practice requires runtime verification.
---
# Root Cause and Recommended Fixes
## Root Cause
The fundamental bug is the inverted call order in [netif_rxq_cleanup_unlease()](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/netdev_rx_queue.c#L347-L348). DMA mappings are revoked while the physical NIC queue (and its NAPI) is still active. All other teardown paths in the kernel get this ordering correct.
## Fix A: Swap Call Order (Root Cause Fix)
```diff
void netif_rxq_cleanup_unlease(struct netdev_rx_queue *phys_rxq,
struct netdev_rx_queue *virt_rxq)
{
struct pp_memory_provider_params *p = &phys_rxq->mp_params;
unsigned int rxq_idx = get_netdev_rx_queue_index(phys_rxq);
if (!p->mp_ops)
return;
- __netif_mp_uninstall_rxq(virt_rxq, p);
- __netif_mp_close_rxq(phys_rxq->dev, rxq_idx, p);
+ __netif_mp_close_rxq(phys_rxq->dev, rxq_idx, p);
+ __netif_mp_uninstall_rxq(virt_rxq, p);
}
```
**Correctness argument**: After `__netif_mp_close_rxq()`, the physical queue is stopped (`ndo_queue_stop()` called), NAPI is disabled, and `memset(&rxq->mp_params, 0, ...)` clears the memory provider. Then `__netif_mp_uninstall_rxq()` can safely unmap DMA because no concurrent consumer exists. This matches the ordering used in both `unregister_netdevice_many_notify()` (stop first, then uninstall) and `io_close_queue()` (close queue first, then free area).
**Potential concern**: `__netif_mp_close_rxq()` at [line 294-296](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/netdev_rx_queue.c#L294-L296) performs a WARN check that `rxq->mp_params.mp_ops == old_p->mp_ops`. After the swap, `p` still points to `phys_rxq->mp_params`, which is passed to both calls. The `__netif_mp_close_rxq()` memset at [line 299](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/netdev_rx_queue.c#L299) clears `mp_params`. Since `p` points to `phys_rxq->mp_params`, after `__netif_mp_close_rxq()` zeroes `mp_params`, the subsequent `__netif_mp_uninstall_rxq()` call would read `p->mp_ops` as NULL (due to `memset`), and the `if (p->mp_ops && p->mp_ops->uninstall)` check at [line 326](file:///c:/Users/ahmed/Desktop/research/linux-master/linux-master/net/core/netdev_rx_queue.c#L326) would skip the uninstall entirely.
> [!IMPORTANT]
> **Fix A requires adjustment**: The `pp_memory_provider_params` must be saved in a local copy before calling `__netif_mp_close_rxq()`, because `__netif_mp_close_rxq()` zeroes `phys_rxq->mp_params` via `memset`. The corrected fix:
```diff
void netif_rxq_cleanup_unlease(struct netdev_rx_queue *phys_rxq,
struct netdev_rx_queue *virt_rxq)
{
- struct pp_memory_provider_params *p = &phys_rxq->mp_params;
+ struct pp_memory_provider_params p = phys_rxq->mp_params;
unsigned int rxq_idx = get_netdev_rx_queue_index(phys_rxq);
- if (!p->mp_ops)
+ if (!p.mp_ops)
return;
- __netif_mp_uninstall_rxq(virt_rxq, p);
- __netif_mp_close_rxq(phys_rxq->dev, rxq_idx, p);
+ __netif_mp_close_rxq(phys_rxq->dev, rxq_idx, &p);
+ __netif_mp_uninstall_rxq(virt_rxq, &p);
}
```
## Fix B: Defense-in-Depth — Flush Cache in `io_pp_uninstall()`
This fix requires an API addition to allow `io_pp_uninstall()` to access the page pool's alloc cache. Currently, `io_pp_uninstall()` has no access to the `struct page_pool *`. A possible approach:
1. Add a `struct page_pool *` parameter to the `.uninstall` callback, or
2. Provide a `page_pool_get_from_rxq()` accessor
This is a defense-in-depth measure. It does NOT fix the root cause and has its own concurrency concerns (flushing the cache while NAPI may be running violates `page_pool_empty_alloc_cache_once()`'s stated requirements).
---
# Professional Vulnerability Writeup
## Title
Linux kernel: DMA-after-unmap race in ZCRX page_pool teardown via queue leasing (netif_rxq_cleanup_unlease ordering inversion)
## Affected Subsystems
- `net/core/netdev_rx_queue.c` — `netif_rxq_cleanup_unlease()`
- `io_uring/zcrx.c` — `io_pp_uninstall()`, `io_zcrx_unmap_area()`
- `net/core/page_pool.c` — `__page_pool_get_cached()`, alloc cache
- `drivers/net/netkit.c` — `netkit_queue_unlease()` (trigger)
## Technical Root Cause
`netif_rxq_cleanup_unlease()` (introduced with the queue leasing API) calls `__netif_mp_uninstall_rxq()` before `__netif_mp_close_rxq()`, inverting the required DMA-lifecycle ordering. This causes the ZCRX memory provider's DMA mappings to be revoked (via `io_zcrx_unmap_area()`) while the physical NIC's NAPI poll loop is still active and capable of consuming stale alloc cache entries with zeroed DMA addresses.
## Race Timeline
```
CPU 0 (teardown thread) CPU 1 (NAPI softirq on physical NIC)
───────────────────── ──────────────────────────────────────
netkit_uninit()
netkit_queue_unlease()
netif_rxq_cleanup_unlease()
__netif_mp_uninstall_rxq()
io_pp_uninstall()
io_zcrx_unmap_area()
net_mp_niov_set_dma_addr(0) ← DMA addresses zeroed
dma_unmap_sgtable() ← IOMMU mapping revoked
napi_poll()
driver_rx_refill()
page_pool_alloc_netmems()
__page_pool_get_cached()
← returns stale netmem (dma=0)
page_pool_get_dma_addr_netmem()
← reads dma_addr = 0
driver programs NIC RX desc with addr 0
NIC receives packet → DMA write to addr 0
__netif_mp_close_rxq()
netdev_rx_queue_reconfig()
ndo_queue_stop() ← NAPI finally disabled (TOO LATE)
```
## Source References
| Component | File | Lines | Verified |
|-----------|------|-------|----------|
| Ordering inversion | `net/core/netdev_rx_queue.c` | 347-348 | ✓ |
| DMA address zeroing | `io_uring/zcrx.c` | 282-303 | ✓ |
| Missing cache flush | `io_uring/zcrx.c` | 1171-1182 | ✓ |
| Lockless cache access | `net/core/page_pool.c` | 436-450 | ✓ |
| Cache flush requirements | `net/core/page_pool.c` | 1157-1172 | ✓ |
| pool->mp_ops independence | `net/core/page_pool.c` | 281-282 | ✓ |
| Trigger via netkit | `drivers/net/netkit.c` | 418-438, 1058-1062 | ✓ |
| Safe comparison (dev unreg) | `net/core/dev.c` | 12375-12412 | ✓ |
| Safe comparison (io_uring close) | `io_uring/zcrx.c` | 550-581 | ✓ |
## Exploitability Limitations
1. **Requires CAP_NET_ADMIN** — needed for both netkit interface creation and ZCRX registration
2. **Requires queue leasing setup** — netkit must lease a physical queue, and ZCRX must be installed on that leased queue
3. **Timing-dependent** — NAPI must consume from the alloc cache during the race window (small but deterministic if traffic is flowing)
4. **Driver-dependent** — the NIC driver must not validate DMA addresses before programming RX descriptors
5. **IOMMU-dependent** — without IOMMU, DMA to address 0 corrupts memory; with strict IOMMU, results in DoS only
## Realistic Threat Model
The most realistic scenario involves high-performance container networking:
- Container has `CAP_NET_ADMIN` in its own network namespace
- netkit is used for container networking (its intended use case)
- ZCRX is used for high-performance zero-copy receive (its intended use case)
- Queue leasing connects virtual and physical interfaces
- Container teardown triggers the race
**Pre-conditions not verified from source**: Whether a container can actually establish a queue lease to a physical NIC from its own network namespace.
## Proof Requirements Still Missing
1. **Runtime PoC** demonstrating NAPI consuming a stale entry during the race window
2. **Driver-specific analysis** confirming no DMA address validation in RX ring refill
3. **Platform-specific analysis** confirming writability of physical address 0
4. **IOMMU TLB timing** measurements for the lazy IOMMU sub-race
5. **End-to-end exploit** demonstrating privilege escalation from the DMA corruption
## Recommended Actions
1. **Apply Fix A** (corrected version with local copy) — swaps the call order in `netif_rxq_cleanup_unlease()` to match all other teardown paths
2. **Consider Fix B** — add alloc cache flush to `io_pp_uninstall()` as defense-in-depth (requires API change)
3. **Audit all callers** of `netdev_rx_queue_unlease()` — currently only `drivers/net/netkit.c:434`
4. **Add lockdep assertion** — `__netif_mp_uninstall_rxq()` should assert that the NAPI for the affected queue is not scheduled
## CVSS Assessment
**Conservative**: CVSS 5.3 (Medium) — Local/High complexity/Low privilege (CAP_NET_ADMIN required)/DoS confirmed, memory corruption unproven
**If runtime PoC demonstrates memory corruption without IOMMU**: CVSS 7.2 (High) — Local/High complexity/High privilege/High impact
The source proves the bug exists and the race window is real. The severity depends on runtime conditions that cannot be determined from source analysis alone.
^ permalink raw reply [flat|nested] 2+ messages in thread* Re: Linux: DMA-after-unmap race in ZCRX via netif_rxq_cleanup_unlease() ordering inversion (netkit + page_pool)
2026-05-27 22:53 Linux: DMA-after-unmap race in ZCRX via netif_rxq_cleanup_unlease() ordering inversion (netkit + page_pool) Prénom? Ahmed
@ 2026-05-27 23:33 ` Jakub Kicinski
0 siblings, 0 replies; 2+ messages in thread
From: Jakub Kicinski @ 2026-05-27 23:33 UTC (permalink / raw)
To: Prénom? Ahmed; +Cc: netdev, linux-kernel, Daniel Borkmann, David Wei
Dropping security lists, security lists are for private discussions,
it's utterly pointless to CC both them and LKML. Not to mention
that this bug only exists in -rc kernels.
Adding relevant developers. Moving security@ to Bcc
On Wed, 27 May 2026 23:53:45 +0100 Prénom? Ahmed wrote:
> Hello,
>
> I would like to report a source-proven teardown ordering bug in the Linux
> kernel that can lead to a DMA-after-unmap race condition involving ZCRX
> (io_uring zero-copy receive), page_pool, and netkit queue leasing.
>
> ***Reporter:** Ahmed Abdelmoemen **Discovery Date:** 2026-05-26 **Kernel
> Version:** Linux 7.1.0-rc3*
>
> Executive Summary
>
> *A logic error in `netif_rxq_cleanup_unlease()` causes DMA mappings for the
> ZCRX memory provider to be revoked **before** the physical NIC RX queue is
> stopped. This creates a race window during netkit queue lease teardown
> where the physical device's NAPI can consume stale `net_iov` entries from
> the page_pool alloc cache containing `dma_addr = 0`.*
>
> The ordering inversion is fully proven at the source level. However, I have
> **not** performed runtime verification, so actual memory corruption or
> successful DMA to address 0 has **not** been proven — it remains hardware
> and driver dependent.
>
> The bug is reachable with `CAP_NET_ADMIN` (common in container
> environments) when using netkit with ZCRX.
>
> Root Cause
>
> In `net/core/netdev_rx_queue.c:347-348`:
>
> ```c __netif_mp_uninstall_rxq(virt_rxq, p); // DMA unmap + dma_addr=0
> __netif_mp_close_rxq(...); // queue stop + NAPI disable (TOO LATE)
>
> This inverts the correct ordering used in normal device unregistration and
> io_uring close paths (stop first, then unmap).
> Impact
>
> - *Potential:* NIC DMA write to physical address 0 (or stale mappings
> with lazy IOMMU) leading to memory corruption.
> - *Requirements:* CAP_NET_ADMIN + netkit queue leasing + ZCRX installed
> on the leased queue.
> - *Current Status:* No runtime PoC or crash reproduction yet. The race
> window exists in theory but its practical exploitability needs confirmation.
>
> I am attaching the full detailed analysis.
> Proposed Fix[image: image.png]
>
> I am happy to provide more details or assist with testing.
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2026-05-27 23:33 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-27 22:53 Linux: DMA-after-unmap race in ZCRX via netif_rxq_cleanup_unlease() ordering inversion (netkit + page_pool) Prénom? Ahmed
2026-05-27 23:33 ` Jakub Kicinski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox