* [PATCH bpf v3 1/2] bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT
2026-02-13 3:40 [PATCH bpf v3 0/2] bpf: cpumap/devmap: fix per-CPU bulk queue races on PREEMPT_RT Jiayuan Chen
@ 2026-02-13 3:40 ` Jiayuan Chen
2026-02-13 3:40 ` [PATCH bpf v3 2/2] bpf: devmap: fix race in bq_xmit_all " Jiayuan Chen
2026-02-17 7:43 ` [PATCH bpf v3 0/2] bpf: cpumap/devmap: fix per-CPU bulk queue races " Sebastian Andrzej Siewior
2 siblings, 0 replies; 5+ messages in thread
From: Jiayuan Chen @ 2026-02-13 3:40 UTC (permalink / raw)
To: xxx
Cc: jiayuan.chen, jiayuan.chen, syzbot+2b3391f44313b3983e91,
Alexei Starovoitov, Daniel Borkmann, David S. Miller,
Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Sebastian Andrzej Siewior, Clark Williams,
Steven Rostedt, Thomas Gleixner, netdev, bpf, linux-kernel,
linux-rt-devel
From: Jiayuan Chen <jiayuan.chen@shopee.com>
On PREEMPT_RT kernels, the per-CPU xdp_bulk_queue (bq) can be accessed
concurrently by multiple preemptible tasks on the same CPU.
The original code assumes bq_enqueue() and __cpu_map_flush() run
atomically with respect to each other on the same CPU, relying on
local_bh_disable() to prevent preemption. However, on PREEMPT_RT,
local_bh_disable() only calls migrate_disable() (when
PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable
preemption, which allows CFS scheduling to preempt a task during
bq_flush_to_queue(), enabling another task on the same CPU to enter
bq_enqueue() and operate on the same per-CPU bq concurrently.
This leads to several races:
1. Double __list_del_clearprev(): after bq->count is reset in
bq_flush_to_queue(), a preempting task can call bq_enqueue() ->
bq_flush_to_queue() on the same bq when bq->count reaches
CPU_MAP_BULK_SIZE. Both tasks then call __list_del_clearprev()
on the same bq->flush_node, the second call dereferences the
prev pointer that was already set to NULL by the first.
2. bq->count and bq->q[] races: concurrent bq_enqueue() can corrupt
the packet queue while bq_flush_to_queue() is processing it.
The race between task A (__cpu_map_flush -> bq_flush_to_queue) and
task B (bq_enqueue -> bq_flush_to_queue) on the same CPU:
Task A (xdp_do_flush) Task B (cpu_map_enqueue)
---------------------- ------------------------
bq_flush_to_queue(bq)
spin_lock(&q->producer_lock)
/* flush bq->q[] to ptr_ring */
bq->count = 0
spin_unlock(&q->producer_lock)
bq_enqueue(rcpu, xdpf)
<-- CFS preempts Task A --> bq->q[bq->count++] = xdpf
/* ... more enqueues until full ... */
bq_flush_to_queue(bq)
spin_lock(&q->producer_lock)
/* flush to ptr_ring */
spin_unlock(&q->producer_lock)
__list_del_clearprev(flush_node)
/* sets flush_node.prev = NULL */
<-- Task A resumes -->
__list_del_clearprev(flush_node)
flush_node.prev->next = ...
/* prev is NULL -> kernel oops */
Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it
in bq_enqueue() and __cpu_map_flush(). These paths already run under
local_bh_disable(), so use local_lock_nested_bh() which on non-RT is
a pure annotation with no overhead, and on PREEMPT_RT provides a
per-CPU sleeping lock that serializes access to the bq.
To reproduce, insert an mdelay(100) between bq->count = 0 and
__list_del_clearprev() in bq_flush_to_queue(), then run reproducer
provided by syzkaller.
Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT")
Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
kernel/bpf/cpumap.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 04171fbc39cb..32b43cb9061b 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -29,6 +29,7 @@
#include <linux/sched.h>
#include <linux/workqueue.h>
#include <linux/kthread.h>
+#include <linux/local_lock.h>
#include <linux/completion.h>
#include <trace/events/xdp.h>
#include <linux/btf_ids.h>
@@ -52,6 +53,7 @@ struct xdp_bulk_queue {
struct list_head flush_node;
struct bpf_cpu_map_entry *obj;
unsigned int count;
+ local_lock_t bq_lock;
};
/* Struct for every remote "destination" CPU in map */
@@ -451,6 +453,7 @@ __cpu_map_entry_alloc(struct bpf_map *map, struct bpf_cpumap_val *value,
for_each_possible_cpu(i) {
bq = per_cpu_ptr(rcpu->bulkq, i);
bq->obj = rcpu;
+ local_lock_init(&bq->bq_lock);
}
/* Alloc queue */
@@ -722,6 +725,8 @@ static void bq_flush_to_queue(struct xdp_bulk_queue *bq)
struct ptr_ring *q;
int i;
+ lockdep_assert_held(&bq->bq_lock);
+
if (unlikely(!bq->count))
return;
@@ -749,11 +754,15 @@ static void bq_flush_to_queue(struct xdp_bulk_queue *bq)
}
/* Runs under RCU-read-side, plus in softirq under NAPI protection.
- * Thus, safe percpu variable access.
+ * Thus, safe percpu variable access. PREEMPT_RT relies on
+ * local_lock_nested_bh() to serialise access to the per-CPU bq.
*/
static void bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf)
{
- struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq);
+ struct xdp_bulk_queue *bq;
+
+ local_lock_nested_bh(&rcpu->bulkq->bq_lock);
+ bq = this_cpu_ptr(rcpu->bulkq);
if (unlikely(bq->count == CPU_MAP_BULK_SIZE))
bq_flush_to_queue(bq);
@@ -774,6 +783,8 @@ static void bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf)
list_add(&bq->flush_node, flush_list);
}
+
+ local_unlock_nested_bh(&rcpu->bulkq->bq_lock);
}
int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf,
@@ -810,7 +821,9 @@ void __cpu_map_flush(struct list_head *flush_list)
struct xdp_bulk_queue *bq, *tmp;
list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
+ local_lock_nested_bh(&bq->obj->bulkq->bq_lock);
bq_flush_to_queue(bq);
+ local_unlock_nested_bh(&bq->obj->bulkq->bq_lock);
/* If already running, costs spin_lock_irqsave + smb_mb */
wake_up_process(bq->obj->kthread);
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* [PATCH bpf v3 2/2] bpf: devmap: fix race in bq_xmit_all on PREEMPT_RT
2026-02-13 3:40 [PATCH bpf v3 0/2] bpf: cpumap/devmap: fix per-CPU bulk queue races on PREEMPT_RT Jiayuan Chen
2026-02-13 3:40 ` [PATCH bpf v3 1/2] bpf: cpumap: fix race in bq_flush_to_queue " Jiayuan Chen
@ 2026-02-13 3:40 ` Jiayuan Chen
2026-02-17 7:42 ` Sebastian Andrzej Siewior
2026-02-17 7:43 ` [PATCH bpf v3 0/2] bpf: cpumap/devmap: fix per-CPU bulk queue races " Sebastian Andrzej Siewior
2 siblings, 1 reply; 5+ messages in thread
From: Jiayuan Chen @ 2026-02-13 3:40 UTC (permalink / raw)
To: xxx
Cc: jiayuan.chen, jiayuan.chen, Sebastian Andrzej Siewior,
Alexei Starovoitov, Daniel Borkmann, David S. Miller,
Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
Jiri Olsa, Clark Williams, Steven Rostedt, Thomas Gleixner,
netdev, bpf, linux-kernel, linux-rt-devel
From: Jiayuan Chen <jiayuan.chen@shopee.com>
On PREEMPT_RT kernels, the per-CPU xdp_dev_bulk_queue (bq) can be
accessed concurrently by multiple preemptible tasks on the same CPU.
The original code assumes bq_enqueue() and __dev_flush() run atomically
with respect to each other on the same CPU, relying on
local_bh_disable() to prevent preemption. However, on PREEMPT_RT,
local_bh_disable() only calls migrate_disable() (when
PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable
preemption, which allows CFS scheduling to preempt a task during
bq_xmit_all(), enabling another task on the same CPU to enter
bq_enqueue() and operate on the same per-CPU bq concurrently.
This leads to several races:
1. Double-free / use-after-free on bq->q[]: bq_xmit_all() snapshots
cnt = bq->count, then iterates bq->q[0..cnt-1] to transmit frames.
If preempted after the snapshot, a second task can call bq_enqueue()
-> bq_xmit_all() on the same bq, transmitting (and freeing) the
same frames. When the first task resumes, it operates on stale
pointers in bq->q[], causing use-after-free.
2. bq->count and bq->q[] corruption: concurrent bq_enqueue() modifying
bq->count and bq->q[] while bq_xmit_all() is reading them.
3. dev_rx/xdp_prog teardown race: __dev_flush() clears bq->dev_rx and
bq->xdp_prog after bq_xmit_all(). If preempted between
bq_xmit_all() return and bq->dev_rx = NULL, a preempting
bq_enqueue() sees dev_rx still set (non-NULL), skips adding bq to
the flush_list, and enqueues a frame. When __dev_flush() resumes,
it clears dev_rx and removes bq from the flush_list, orphaning the
newly enqueued frame.
4. __list_del_clearprev() on flush_node: similar to the cpumap race,
both tasks can call __list_del_clearprev() on the same flush_node,
the second dereferences the prev pointer already set to NULL.
The race between task A (__dev_flush -> bq_xmit_all) and task B
(bq_enqueue -> bq_xmit_all) on the same CPU:
Task A (xdp_do_flush) Task B (ndo_xdp_xmit redirect)
---------------------- --------------------------------
__dev_flush(flush_list)
bq_xmit_all(bq)
cnt = bq->count /* e.g. 16 */
/* start iterating bq->q[] */
<-- CFS preempts Task A -->
bq_enqueue(dev, xdpf)
bq->count == DEV_MAP_BULK_SIZE
bq_xmit_all(bq, 0)
cnt = bq->count /* same 16! */
ndo_xdp_xmit(bq->q[])
/* frames freed by driver */
bq->count = 0
<-- Task A resumes -->
ndo_xdp_xmit(bq->q[])
/* use-after-free: frames already freed! */
To reproduce, insert an mdelay(100) in bq_xmit_all() after
"cnt = bq->count" and before the actual transmit loop. Then pin two
threads to the same CPU, each running BPF_PROG_TEST_RUN with an XDP
program that redirects to a DEVMAP entry (e.g. a veth pair). CFS
timeslicing during the mdelay window causes interleaving. Without the
fix, KASAN reports null-ptr-deref due to operating on freed frames:
BUG: KASAN: null-ptr-deref in __build_skb_around+0x22d/0x340
Write of size 32 at addr 0000000000000d50 by task devmap_race_rep/449
CPU: 0 UID: 0 PID: 449 Comm: devmap_race_rep Not tainted 6.19.0+ #31 PREEMPT_RT
Call Trace:
<TASK>
__build_skb_around+0x22d/0x340
build_skb_around+0x25/0x260
__xdp_build_skb_from_frame+0x103/0x860
veth_xdp_rcv_bulk_skb.isra.0+0x162/0x320
veth_xdp_rcv.constprop.0+0x61e/0xbb0
veth_poll+0x280/0xb50
__napi_poll.constprop.0+0xa5/0x590
net_rx_action+0x4b0/0xea0
handle_softirqs.isra.0+0x1b3/0x780
__local_bh_enable_ip+0x12a/0x240
xdp_test_run_batch.constprop.0+0xedd/0x1f60
bpf_test_run_xdp_live+0x304/0x640
bpf_prog_test_run_xdp+0xd24/0x1b70
__sys_bpf+0x61c/0x3e00
</TASK>
Kernel panic - not syncing: Fatal exception in interrupt
Fix this by adding a local_lock_t to xdp_dev_bulk_queue and acquiring
it in bq_enqueue() and __dev_flush(). These paths already run under
local_bh_disable(), so use local_lock_nested_bh() which on non-RT is
a pure annotation with no overhead, and on PREEMPT_RT provides a
per-CPU sleeping lock that serializes access to the bq.
Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT")
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
kernel/bpf/devmap.c | 25 +++++++++++++++++++++----
1 file changed, 21 insertions(+), 4 deletions(-)
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 2625601de76e..10cf0731f91d 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -45,6 +45,7 @@
* types of devmap; only the lookup and insertion is different.
*/
#include <linux/bpf.h>
+#include <linux/local_lock.h>
#include <net/xdp.h>
#include <linux/filter.h>
#include <trace/events/xdp.h>
@@ -60,6 +61,7 @@ struct xdp_dev_bulk_queue {
struct net_device *dev_rx;
struct bpf_prog *xdp_prog;
unsigned int count;
+ local_lock_t bq_lock;
};
struct bpf_dtab_netdev {
@@ -381,6 +383,8 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
int to_send = cnt;
int i;
+ lockdep_assert_held(&bq->bq_lock);
+
if (unlikely(!cnt))
return;
@@ -425,10 +429,12 @@ void __dev_flush(struct list_head *flush_list)
struct xdp_dev_bulk_queue *bq, *tmp;
list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
+ local_lock_nested_bh(&bq->dev->xdp_bulkq->bq_lock);
bq_xmit_all(bq, XDP_XMIT_FLUSH);
bq->dev_rx = NULL;
bq->xdp_prog = NULL;
__list_del_clearprev(&bq->flush_node);
+ local_unlock_nested_bh(&bq->dev->xdp_bulkq->bq_lock);
}
}
@@ -451,12 +457,16 @@ static void *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
/* Runs in NAPI, i.e., softirq under local_bh_disable(). Thus, safe percpu
* variable access, and map elements stick around. See comment above
- * xdp_do_flush() in filter.c.
+ * xdp_do_flush() in filter.c. PREEMPT_RT relies on local_lock_nested_bh()
+ * to serialise access to the per-CPU bq.
*/
static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
struct net_device *dev_rx, struct bpf_prog *xdp_prog)
{
- struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
+ struct xdp_dev_bulk_queue *bq;
+
+ local_lock_nested_bh(&dev->xdp_bulkq->bq_lock);
+ bq = this_cpu_ptr(dev->xdp_bulkq);
if (unlikely(bq->count == DEV_MAP_BULK_SIZE))
bq_xmit_all(bq, 0);
@@ -477,6 +487,8 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
}
bq->q[bq->count++] = xdpf;
+
+ local_unlock_nested_bh(&dev->xdp_bulkq->bq_lock);
}
static inline int __xdp_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
@@ -1115,8 +1127,13 @@ static int dev_map_notification(struct notifier_block *notifier,
if (!netdev->xdp_bulkq)
return NOTIFY_BAD;
- for_each_possible_cpu(cpu)
- per_cpu_ptr(netdev->xdp_bulkq, cpu)->dev = netdev;
+ for_each_possible_cpu(cpu) {
+ struct xdp_dev_bulk_queue *bq;
+
+ bq = per_cpu_ptr(netdev->xdp_bulkq, cpu);
+ bq->dev = netdev;
+ local_lock_init(&bq->bq_lock);
+ }
break;
case NETDEV_UNREGISTER:
/* This rcu_read_lock/unlock pair is needed because
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH bpf v3 2/2] bpf: devmap: fix race in bq_xmit_all on PREEMPT_RT
2026-02-13 3:40 ` [PATCH bpf v3 2/2] bpf: devmap: fix race in bq_xmit_all " Jiayuan Chen
@ 2026-02-17 7:42 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 5+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-17 7:42 UTC (permalink / raw)
To: Jiayuan Chen
Cc: xxx, jiayuan.chen, Alexei Starovoitov, Daniel Borkmann,
David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Hao Luo, Jiri Olsa, Clark Williams, Steven Rostedt,
Thomas Gleixner, netdev, bpf, linux-kernel, linux-rt-devel
On 2026-02-13 11:40:15 [+0800], Jiayuan Chen wrote:
…
> timeslicing during the mdelay window causes interleaving. Without the
> fix, KASAN reports null-ptr-deref due to operating on freed frames:
>
> BUG: KASAN: null-ptr-deref in __build_skb_around+0x22d/0x340
> Write of size 32 at addr 0000000000000d50 by task devmap_race_rep/449
>
> CPU: 0 UID: 0 PID: 449 Comm: devmap_race_rep Not tainted 6.19.0+ #31 PREEMPT_RT
> Call Trace:
> <TASK>
> __build_skb_around+0x22d/0x340
> build_skb_around+0x25/0x260
> __xdp_build_skb_from_frame+0x103/0x860
> veth_xdp_rcv_bulk_skb.isra.0+0x162/0x320
> veth_xdp_rcv.constprop.0+0x61e/0xbb0
> veth_poll+0x280/0xb50
> __napi_poll.constprop.0+0xa5/0x590
> net_rx_action+0x4b0/0xea0
> handle_softirqs.isra.0+0x1b3/0x780
> __local_bh_enable_ip+0x12a/0x240
> xdp_test_run_batch.constprop.0+0xedd/0x1f60
> bpf_test_run_xdp_live+0x304/0x640
> bpf_prog_test_run_xdp+0xd24/0x1b70
> __sys_bpf+0x61c/0x3e00
> </TASK>
>
> Kernel panic - not syncing: Fatal exception in interrupt
I would move this next to the diffstat (same as with the previous
patch) since it is obvious once you described it.
Sebastian
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH bpf v3 0/2] bpf: cpumap/devmap: fix per-CPU bulk queue races on PREEMPT_RT
2026-02-13 3:40 [PATCH bpf v3 0/2] bpf: cpumap/devmap: fix per-CPU bulk queue races on PREEMPT_RT Jiayuan Chen
2026-02-13 3:40 ` [PATCH bpf v3 1/2] bpf: cpumap: fix race in bq_flush_to_queue " Jiayuan Chen
2026-02-13 3:40 ` [PATCH bpf v3 2/2] bpf: devmap: fix race in bq_xmit_all " Jiayuan Chen
@ 2026-02-17 7:43 ` Sebastian Andrzej Siewior
2 siblings, 0 replies; 5+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-17 7:43 UTC (permalink / raw)
To: Jiayuan Chen
Cc: jiayuan.chen, Alexei Starovoitov, Daniel Borkmann,
David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
John Fastabend, Stanislav Fomichev, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
KP Singh, Hao Luo, Jiri Olsa, Clark Williams, Steven Rostedt,
Thomas Gleixner, netdev, bpf, linux-kernel, linux-rt-devel
- xxx@vger.kernel.org
On 2026-02-13 11:40:13 [+0800], Jiayuan Chen wrote:
> On PREEMPT_RT kernels, local_bh_disable() only calls migrate_disable()
> (when PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable
> preemption. This means CFS scheduling can preempt a task inside the
> per-CPU bulk queue (bq) operations in cpumap and devmap, allowing
> another task on the same CPU to concurrently access the same bq,
> leading to use-after-free, list corruption, and kernel panics.
>
> Patch 1 fixes the cpumap race in bq_flush_to_queue(), originally
> reported by syzbot [1].
>
> Patch 2 fixes the same class of race in devmap's bq_xmit_all(),
> identified by code inspection after Sebastian Andrzej Siewior pointed
> out that devmap has the same per-CPU bulk queue pattern [2].
…
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Sebastian
^ permalink raw reply [flat|nested] 5+ messages in thread