[PATCH bpf v2] bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf v2] bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT
@ 2026-02-12  2:36 Jiayuan Chen
  2026-02-12 14:33 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 3+ messages in thread
From: Jiayuan Chen @ 2026-02-12  2:36 UTC (permalink / raw)
  To: xxx
  Cc: Jiayuan Chen, syzbot+2b3391f44313b3983e91, Jiayuan Chen,
	Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, Sebastian Andrzej Siewior, Clark Williams,
	Steven Rostedt, Thomas Gleixner, netdev, bpf, linux-kernel,
	linux-rt-devel

From: Jiayuan Chen <jiayuan.chen@shopee.com>

On PREEMPT_RT kernels, the per-CPU xdp_bulk_queue (bq) can be accessed
concurrently by multiple preemptible tasks on the same CPU.

The original code assumes bq_enqueue() and __cpu_map_flush() run
atomically with respect to each other on the same CPU, relying on
local_bh_disable() to prevent preemption. However, on PREEMPT_RT,
local_bh_disable() only calls migrate_disable() and does not disable
preemption. spin_lock() also becomes a sleeping rt_mutex. Together,
this allows CFS scheduling to preempt a task during bq_flush_to_queue(),
enabling another task on the same CPU to enter bq_enqueue() and operate
on the same per-CPU bq concurrently.

This leads to several races:

1. Double __list_del_clearprev(): after spin_unlock() in
   bq_flush_to_queue(), a preempting task can call bq_enqueue() ->
   bq_flush_to_queue() on the same bq when bq->count reaches
   CPU_MAP_BULK_SIZE. Both tasks then call __list_del_clearprev()
   on the same bq->flush_node, the second call dereferences the
   prev pointer that was already set to NULL by the first.

2. bq->count and bq->q[] races: concurrent bq_enqueue() can corrupt
   the packet queue while bq_flush_to_queue() is processing it.

The race between task A (__cpu_map_flush -> bq_flush_to_queue) and
task B (bq_enqueue -> bq_flush_to_queue) on the same CPU:

  Task A (xdp_do_flush)          Task B (cpu_map_enqueue)
  ----------------------         ------------------------
  bq_flush_to_queue(bq)
    spin_lock(&q->producer_lock)
    /* flush bq->q[] to ptr_ring */
    bq->count = 0
    spin_unlock(&q->producer_lock)
                                   bq_enqueue(rcpu, xdpf)
    <-- CFS preempts Task A -->      bq->q[bq->count++] = xdpf
                                     /* ... more enqueues until full ... */
                                     bq_flush_to_queue(bq)
                                       spin_lock(&q->producer_lock)
                                       /* flush to ptr_ring */
                                       spin_unlock(&q->producer_lock)
                                       __list_del_clearprev(flush_node)
                                         /* sets flush_node.prev = NULL */
    <-- Task A resumes -->
    __list_del_clearprev(flush_node)
      flush_node.prev->next = ...
      /* prev is NULL -> kernel oops */

Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it
in bq_enqueue() and __cpu_map_flush(). On non-RT kernels, local_lock
maps to preempt_disable/enable with zero additional overhead. On
PREEMPT_RT, it provides a per-CPU sleeping lock that serializes
access to the bq. Use local_lock_nested_bh() since these paths already
run under local_bh_disable().

An alternative approach of snapshotting bq->count and bq->q[] before
releasing the producer_lock was considered, but it requires copying
the entire bq->q[] array on every flush, adding unnecessary overhead.

To reproduce, insert an mdelay(100) between spin_unlock() and
__list_del_clearprev() in bq_flush_to_queue(), then run reproducer
provided by syzkaller.

Panic:
===
 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 0 P4D 0
 Oops: Oops: 0002 [#1] SMP PTI
 CPU: 0 UID: 0 PID: 377 Comm: a.out Not tainted 6.19.0+ #21 PREEMPT_RT
 RIP: 0010:bq_flush_to_queue+0x145/0x200
 Call Trace:
  <TASK>
  __cpu_map_flush+0x2c/0x70
  xdp_do_flush+0x64/0x1b0
  xdp_test_run_batch.constprop.0+0x4d4/0x6d0
  bpf_test_run_xdp_live+0x24b/0x3e0
  bpf_prog_test_run_xdp+0x4a1/0x6e0
  __sys_bpf+0x44a/0x2760
  __x64_sys_bpf+0x1a/0x30
  x64_sys_call+0x146c/0x26e0
  do_syscall_64+0xd5/0x5a0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  </TASK>

Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT")
Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
v1 -> v2: https://lore.kernel.org/bpf/20260211064417.196401-1-jiayuan.chen@linux.dev/
- Use local_lock_nested_bh()/local_unlock_nested_bh() instead of
  local_lock()/local_unlock(), since these paths already run under
  local_bh_disable(). (Sebastian Andrzej Siewior)
- Replace "Caller must hold bq->bq_lock" comment with
  lockdep_assert_held() in bq_flush_to_queue(). (Sebastian Andrzej Siewior)
- Fix Fixes tag to 3253cb49cbad ("softirq: Allow to drop the
  softirq-BKL lock on PREEMPT_RT") which is the actual commit that
  makes the race possible. (Sebastian Andrzej Siewior)
---
 kernel/bpf/cpumap.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 04171fbc39cb..32b43cb9061b 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -29,6 +29,7 @@
 #include <linux/sched.h>
 #include <linux/workqueue.h>
 #include <linux/kthread.h>
+#include <linux/local_lock.h>
 #include <linux/completion.h>
 #include <trace/events/xdp.h>
 #include <linux/btf_ids.h>
@@ -52,6 +53,7 @@ struct xdp_bulk_queue {
 	struct list_head flush_node;
 	struct bpf_cpu_map_entry *obj;
 	unsigned int count;
+	local_lock_t bq_lock;
 };
 
 /* Struct for every remote "destination" CPU in map */
@@ -451,6 +453,7 @@ __cpu_map_entry_alloc(struct bpf_map *map, struct bpf_cpumap_val *value,
 	for_each_possible_cpu(i) {
 		bq = per_cpu_ptr(rcpu->bulkq, i);
 		bq->obj = rcpu;
+		local_lock_init(&bq->bq_lock);
 	}
 
 	/* Alloc queue */
@@ -722,6 +725,8 @@ static void bq_flush_to_queue(struct xdp_bulk_queue *bq)
 	struct ptr_ring *q;
 	int i;
 
+	lockdep_assert_held(&bq->bq_lock);
+
 	if (unlikely(!bq->count))
 		return;
 
@@ -749,11 +754,15 @@ static void bq_flush_to_queue(struct xdp_bulk_queue *bq)
 }
 
 /* Runs under RCU-read-side, plus in softirq under NAPI protection.
- * Thus, safe percpu variable access.
+ * Thus, safe percpu variable access. PREEMPT_RT relies on
+ * local_lock_nested_bh() to serialise access to the per-CPU bq.
  */
 static void bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf)
 {
-	struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq);
+	struct xdp_bulk_queue *bq;
+
+	local_lock_nested_bh(&rcpu->bulkq->bq_lock);
+	bq = this_cpu_ptr(rcpu->bulkq);
 
 	if (unlikely(bq->count == CPU_MAP_BULK_SIZE))
 		bq_flush_to_queue(bq);
@@ -774,6 +783,8 @@ static void bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf)
 
 		list_add(&bq->flush_node, flush_list);
 	}
+
+	local_unlock_nested_bh(&rcpu->bulkq->bq_lock);
 }
 
 int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf,
@@ -810,7 +821,9 @@ void __cpu_map_flush(struct list_head *flush_list)
 	struct xdp_bulk_queue *bq, *tmp;
 
 	list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
+		local_lock_nested_bh(&bq->obj->bulkq->bq_lock);
 		bq_flush_to_queue(bq);
+		local_unlock_nested_bh(&bq->obj->bulkq->bq_lock);
 
 		/* If already running, costs spin_lock_irqsave + smb_mb */
 		wake_up_process(bq->obj->kthread);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH bpf v2] bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT
  2026-02-12  2:36 [PATCH bpf v2] bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT Jiayuan Chen
@ 2026-02-12 14:33 ` Sebastian Andrzej Siewior
  2026-02-13  1:37   ` Jiayuan Chen
  0 siblings, 1 reply; 3+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-12 14:33 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: xxx, Jiayuan Chen, syzbot+2b3391f44313b3983e91,
	Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, Clark Williams, Steven Rostedt, Thomas Gleixner,
	netdev, bpf, linux-kernel, linux-rt-devel

On 2026-02-12 10:36:33 [+0800], Jiayuan Chen wrote:
> From: Jiayuan Chen <jiayuan.chen@shopee.com>
> 
…
> local_bh_disable() only calls migrate_disable() and does not disable
> preemption. spin_lock() also becomes a sleeping rt_mutex. Together,
There is no spin_lock() otherwise there would be no problem.

…
> Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it
> in bq_enqueue() and __cpu_map_flush(). On non-RT kernels, local_lock
> maps to preempt_disable/enable with zero additional overhead. On
> PREEMPT_RT, it provides a per-CPU sleeping lock that serializes
> access to the bq. Use local_lock_nested_bh() since these paths already
> run under local_bh_disable().

So you use local_lock_nested_bh() and not local_lock() but you mention
local_lock. The difference is that the former does not add any
preempt_disable() on !RT. At this point I am curious how much of this
was written by you and how much is auto generated. 

> An alternative approach of snapshotting bq->count and bq->q[] before
> releasing the producer_lock was considered, but it requires copying
> the entire bq->q[] array on every flush, adding unnecessary overhead.

But you still have list_head which is not protected.

> To reproduce, insert an mdelay(100) between spin_unlock() and
> __list_del_clearprev() in bq_flush_to_queue(), then run reproducer
> provided by syzkaller.
> 
> Panic:
> ===
>  BUG: kernel NULL pointer dereference, address: 0000000000000000
>  #PF: supervisor write access in kernel mode
>  #PF: error_code(0x0002) - not-present page
>  PGD 0 P4D 0
>  Oops: Oops: 0002 [#1] SMP PTI
>  CPU: 0 UID: 0 PID: 377 Comm: a.out Not tainted 6.19.0+ #21 PREEMPT_RT
>  RIP: 0010:bq_flush_to_queue+0x145/0x200
>  Call Trace:
>   <TASK>
>   __cpu_map_flush+0x2c/0x70
>   xdp_do_flush+0x64/0x1b0
>   xdp_test_run_batch.constprop.0+0x4d4/0x6d0
>   bpf_test_run_xdp_live+0x24b/0x3e0
>   bpf_prog_test_run_xdp+0x4a1/0x6e0
>   __sys_bpf+0x44a/0x2760
>   __x64_sys_bpf+0x1a/0x30
>   x64_sys_call+0x146c/0x26e0
>   do_syscall_64+0xd5/0x5a0
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>   </TASK>

This could be omitted. It is obvious once you see it. I somehow missed
this alloc_percpu instance while looking for this kind of bugs.
Another one is hiding in devmap.c. Mind to take a look? I think I
skip this entire folder…

> Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT")
> Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/
> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> ---
> v1 -> v2: https://lore.kernel.org/bpf/20260211064417.196401-1-jiayuan.chen@linux.dev/
> - Use local_lock_nested_bh()/local_unlock_nested_bh() instead of
>   local_lock()/local_unlock(), since these paths already run under
>   local_bh_disable(). (Sebastian Andrzej Siewior)
> - Replace "Caller must hold bq->bq_lock" comment with
>   lockdep_assert_held() in bq_flush_to_queue(). (Sebastian Andrzej Siewior)
> - Fix Fixes tag to 3253cb49cbad ("softirq: Allow to drop the
>   softirq-BKL lock on PREEMPT_RT") which is the actual commit that
>   makes the race possible. (Sebastian Andrzej Siewior)
> ---
>  kernel/bpf/cpumap.c | 17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)

The changes below look good.

Sebastian

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH bpf v2] bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT
  2026-02-12 14:33 ` Sebastian Andrzej Siewior
@ 2026-02-13  1:37   ` Jiayuan Chen
  0 siblings, 0 replies; 3+ messages in thread
From: Jiayuan Chen @ 2026-02-13  1:37 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: xxx, Jiayuan Chen, syzbot+2b3391f44313b3983e91,
	Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Hao Luo,
	Jiri Olsa, Clark Williams, Steven Rostedt, Thomas Gleixner,
	netdev, bpf, linux-kernel, linux-rt-devel

2026/2/12 22:33, "Sebastian Andrzej Siewior" <bigeasy@linutronix.de mailto:bigeasy@linutronix.de?to=%22Sebastian%20Andrzej%20Siewior%22%20%3Cbigeasy%40linutronix.de%3E > wrote:




Thanks for the review, Sebastian.


> There is no spin_lock() otherwise there would be no problem.


Right, will fix the description. The per-CPU bq has no lock at all,
which is the root cause...
> …
> 
> > 
> > Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it
> >  in bq_enqueue() and __cpu_map_flush(). On non-RT kernels, local_lock
> >  maps to preempt_disable/enable with zero additional overhead. On
> >  PREEMPT_RT, it provides a per-CPU sleeping lock that serializes
> >  access to the bq. Use local_lock_nested_bh() since these paths already
> >  run under local_bh_disable().
> > 
> So you use local_lock_nested_bh() and not local_lock() but you mention
> local_lock. The difference is that the former does not add any
> preempt_disable() on !RT. At this point I am curious how much of this
> was written by you and how much is auto generated. 


Sorry about the inconsistencies. The commit message became messy after
several rounds of editing across versions, and I failed to update all
the descriptions to match the final code.


> > 
> > An alternative approach of snapshotting bq->count and bq->q[] before
> >  releasing the producer_lock was considered, but it requires copying
> >  the entire bq->q[] array on every flush, adding unnecessary overhead.
> > 
> But you still have list_head which is not protected.

The snapshot alternative is incomplete as described. Will drop that paragraph.

> > 
> > To reproduce, insert an mdelay(100) between spin_unlock() and
> >  __list_del_clearprev() in bq_flush_to_queue(), then run reproducer
> >  provided by syzkaller.
> >  
> >  Panic:
> >  ===
> >  BUG: kernel NULL pointer dereference, address: 0000000000000000
> >  #PF: supervisor write access in kernel mode
> >  #PF: error_code(0x0002) - not-present page
> >  PGD 0 P4D 0
> >  Oops: Oops: 0002 [#1] SMP PTI
> >  CPU: 0 UID: 0 PID: 377 Comm: a.out Not tainted 6.19.0+ #21 PREEMPT_RT
> >  RIP: 0010:bq_flush_to_queue+0x145/0x200
> >  Call Trace:
> >  <TASK>
> >  __cpu_map_flush+0x2c/0x70
> >  xdp_do_flush+0x64/0x1b0
> >  xdp_test_run_batch.constprop.0+0x4d4/0x6d0
> >  bpf_test_run_xdp_live+0x24b/0x3e0
> >  bpf_prog_test_run_xdp+0x4a1/0x6e0
> >  __sys_bpf+0x44a/0x2760
> >  __x64_sys_bpf+0x1a/0x30
> >  x64_sys_call+0x146c/0x26e0
> >  do_syscall_64+0xd5/0x5a0
> >  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >  </TASK>
> > 
> This could be omitted. It is obvious once you see it. I somehow missed

Will remove the panic trace.

> this alloc_percpu instance while looking for this kind of bugs.
> Another one is hiding in devmap.c. Mind to take a look? I think I
> skip this entire folder…


I see the same pattern there - xdp_dev_bulk_queue is per-CPU with no preemption
protection in bq_enqueue() and __dev_flush().

> > 
> > Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT")
> >  Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com
> >  Closes: https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/
> >  Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
> >  Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> >  ---
> >  v1 -> v2: https://lore.kernel.org/bpf/20260211064417.196401-1-jiayuan.chen@linux.dev/
> >  - Use local_lock_nested_bh()/local_unlock_nested_bh() instead of
> >  local_lock()/local_unlock(), since these paths already run under
> >  local_bh_disable(). (Sebastian Andrzej Siewior)
> >  - Replace "Caller must hold bq->bq_lock" comment with
> >  lockdep_assert_held() in bq_flush_to_queue(). (Sebastian Andrzej Siewior)
> >  - Fix Fixes tag to 3253cb49cbad ("softirq: Allow to drop the
> >  softirq-BKL lock on PREEMPT_RT") which is the actual commit that
> >  makes the race possible. (Sebastian Andrzej Siewior)
> >  ---
> >  kernel/bpf/cpumap.c | 17 +++++++++++++++--
> >  1 file changed, 15 insertions(+), 2 deletions(-)
> > 
> The changes below look good.

Thanks!

> Sebastian

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-02-13  1:37 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-12  2:36 [PATCH bpf v2] bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT Jiayuan Chen
2026-02-12 14:33 ` Sebastian Andrzej Siewior
2026-02-13  1:37   ` Jiayuan Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox