[PATCH bpf v1] bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jiayuan Chen <jiayuan.chen@linux.dev>
To: bpf@vger.kernel.org
Cc: Jiayuan Chen <jiayuan.chen@shopee.com>,
	syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com,
	Jiayuan Chen <jiayuan.chen@linux.dev>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	Stanislav Fomichev <sdf@fomichev.me>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
	Yonghong Song <yonghong.song@linux.dev>,
	KP Singh <kpsingh@kernel.org>, Hao Luo <haoluo@google.com>,
	Jiri Olsa <jolsa@kernel.org>, Kees Cook <kees@kernel.org>,
	"Gustavo A. R. Silva" <gustavoars@kernel.org>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Clark Williams <clrkwllms@kernel.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Thomas Gleixner <tglx@kernel.org>,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-hardening@vger.kernel.org, linux-rt-devel@lists.linux.dev
Subject: [PATCH bpf v1] bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT
Date: Wed, 11 Feb 2026 14:44:16 +0800	[thread overview]
Message-ID: <20260211064417.196401-1-jiayuan.chen@linux.dev> (raw)

From: Jiayuan Chen <jiayuan.chen@shopee.com>

On PREEMPT_RT kernels, the per-CPU xdp_bulk_queue (bq) can be accessed
concurrently by multiple preemptible tasks on the same CPU.

The original code assumes bq_enqueue() and __cpu_map_flush() run
atomically with respect to each other on the same CPU, relying on
local_bh_disable() to prevent preemption. However, on PREEMPT_RT,
local_bh_disable() only calls migrate_disable() and does not disable
preemption. spin_lock() also becomes a sleeping rt_mutex. Together,
this allows CFS scheduling to preempt a task during bq_flush_to_queue(),
enabling another task on the same CPU to enter bq_enqueue() and operate
on the same per-CPU bq concurrently.

This leads to several races:

1. Double __list_del_clearprev(): after spin_unlock() in
 bq_flush_to_queue(), a preempting task can call bq_enqueue() ->
 bq_flush_to_queue() on the same bq when bq->count reaches
 CPU_MAP_BULK_SIZE. Both tasks then call __list_del_clearprev()
 on the same bq->flush_node, the second call dereferences the
 prev pointer that was already set to NULL by the first.

2. bq->count and bq->q[] races: concurrent bq_enqueue() can corrupt
 the packet queue while bq_flush_to_queue() is processing it.

The race between task A (__cpu_map_flush -> bq_flush_to_queue) and
task B (bq_enqueue -> bq_flush_to_queue) on the same CPU:

Task A (xdp_do_flush)          Task B (cpu_map_enqueue)
----------------------         ------------------------
bq_flush_to_queue(bq)
  spin_lock(&q->producer_lock)
  /* flush bq->q[] to ptr_ring */
  bq->count = 0
  spin_unlock(&q->producer_lock)
                                 bq_enqueue(rcpu, xdpf)
  <-- CFS preempts Task A -->      bq->q[bq->count++] = xdpf
                                   /* ... more enqueues until full ... */
                                   bq_flush_to_queue(bq)
                                     spin_lock(&q->producer_lock)
                                     /* flush to ptr_ring */
                                     spin_unlock(&q->producer_lock)
                                     __list_del_clearprev(flush_node)
                                       /* sets flush_node.prev = NULL */
  <-- Task A resumes -->
  __list_del_clearprev(flush_node)
    flush_node.prev->next = ...
    /* prev is NULL -> kernel oops */

Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it
in bq_enqueue() and __cpu_map_flush(). On non-RT kernels, local_lock
maps to preempt_disable/enable with zero additional overhead. On
PREEMPT_RT, it provides a per-CPU sleeping lock that serializes
access to the bq.

An alternative approach of snapshotting bq->count and bq->q[] before
releasing the producer_lock was considered, but it requires copying
the entire bq->q[] array on every flush, adding unnecessary overhead.

To reproduce, insert an mdelay(100) between spin_unlock() and
__list_del_clearprev() in bq_flush_to_queue(), then run reproducer
provided by syzkaller.

Panic:
===
BUG: kernel NULL pointer dereference, address: 0000000000000000
PF: supervisor write access in kernel mode
PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: Oops: 0002 [#1] SMP PTI
CPU: 0 UID: 0 PID: 377 Comm: a.out Not tainted 6.19.0+ #21 PREEMPT_RT
RIP: 0010:bq_flush_to_queue+0x145/0x200
RSP: 0018:ffffc90000eb78e8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffffc90000eb7a10 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc90000eb7928 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: ffffe8ffffc255c0 R14: 0000000000000000 R15: ffff88800c01b800
FS:  00007fae0762d6c0(0000) GS:ffff8880f9c75000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000000d698005 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
 <TASK>
 __cpu_map_flush+0x2c/0x70
 xdp_do_flush+0x64/0x1b0
 xdp_test_run_batch.constprop.0+0x4d4/0x6d0
 bpf_test_run_xdp_live+0x24b/0x3e0
 ? get_page_from_freelist+0x5ea/0x1cf0
 ? __pfx_xdp_test_run_init_page+0x10/0x10
 ? __check_object_size+0x25e/0x320
 ? bpf_dispatcher_change_prog+0x20d/0x4e0
 bpf_prog_test_run_xdp+0x4a1/0x6e0
 ? rt_spin_unlock+0x61/0xb0
 __sys_bpf+0x44a/0x2760
 ? set_ptes.isra.0+0x3b/0x90
 ? rt_spin_unlock+0x61/0xb0
 ? do_anonymous_page+0x106/0x430
 __x64_sys_bpf+0x1a/0x30
 x64_sys_call+0x146c/0x26e0
 do_syscall_64+0xd5/0x5a0
 ? do_user_addr_fault+0x1d0/0x870
 ? irqentry_exit+0x3e/0x670
 ? clear_bhb_loop+0x30/0x80
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fae0777928d
RSP: 002b:00007fae0762ce68 EFLAGS: 00000206 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 00007fae0762dcdc RCX: 00007fae0777928d
RDX: 0000000000000050 RSI: 0000200000000000 RDI: 000000000000000a
RBP: 00007fae0762ce90 R08: 00007fae0762d6c0 R09: 203a6362696c6720
R10: 0000000000000000 R11: 0000000000000206 R12: 00007fae0762d6c0
R13: ffffffffffffff88 R14: 000000000000006e R15: 00007ffdd6e54110
 </TASK>
Modules linked in:
CR2: 0000000000000000
---[ end trace 0000000000000000 ]---
RIP: 0010:bq_flush_to_queue+0x145/0x200
RSP: 0018:ffffc90000eb78e8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffffc90000eb7a10 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc90000eb7928 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: ffffe8ffffc255c0 R14: 0000000000000000 R15: ffff88800c01b800
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000000d698005 CR4: 0000000000770ef0
PKRU: 55555554
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

Fixes: d2d6422f8bd1 ("x86: Allow to enable PREEMPT_RT.")
Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 kernel/bpf/cpumap.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 04171fbc39cb..7fda8421ec40 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -29,6 +29,7 @@
 #include <linux/sched.h>
 #include <linux/workqueue.h>
 #include <linux/kthread.h>
+#include <linux/local_lock.h>
 #include <linux/completion.h>
 #include <trace/events/xdp.h>
 #include <linux/btf_ids.h>
@@ -52,6 +53,7 @@ struct xdp_bulk_queue {
 	struct list_head flush_node;
 	struct bpf_cpu_map_entry *obj;
 	unsigned int count;
+	local_lock_t bq_lock;
 };
 
 /* Struct for every remote "destination" CPU in map */
@@ -451,6 +453,7 @@ __cpu_map_entry_alloc(struct bpf_map *map, struct bpf_cpumap_val *value,
 	for_each_possible_cpu(i) {
 		bq = per_cpu_ptr(rcpu->bulkq, i);
 		bq->obj = rcpu;
+		local_lock_init(&bq->bq_lock);
 	}
 
 	/* Alloc queue */
@@ -714,6 +717,7 @@ const struct bpf_map_ops cpu_map_ops = {
 	.map_redirect		= cpu_map_redirect,
 };
 
+/* Caller must hold bq->bq_lock */
 static void bq_flush_to_queue(struct xdp_bulk_queue *bq)
 {
 	struct bpf_cpu_map_entry *rcpu = bq->obj;
@@ -750,10 +754,16 @@ static void bq_flush_to_queue(struct xdp_bulk_queue *bq)
 
 /* Runs under RCU-read-side, plus in softirq under NAPI protection.
  * Thus, safe percpu variable access.
+ *
+ * On PREEMPT_RT, local_bh_disable() does not disable preemption,
+ * so we use local_lock to serialize access to the per-CPU bq.
  */
 static void bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf)
 {
-	struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq);
+	struct xdp_bulk_queue *bq;
+
+	local_lock(&rcpu->bulkq->bq_lock);
+	bq = this_cpu_ptr(rcpu->bulkq);
 
 	if (unlikely(bq->count == CPU_MAP_BULK_SIZE))
 		bq_flush_to_queue(bq);
@@ -774,6 +784,8 @@ static void bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf)
 
 		list_add(&bq->flush_node, flush_list);
 	}
+
+	local_unlock(&rcpu->bulkq->bq_lock);
 }
 
 int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf,
@@ -810,7 +822,9 @@ void __cpu_map_flush(struct list_head *flush_list)
 	struct xdp_bulk_queue *bq, *tmp;
 
 	list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
+		local_lock(&bq->obj->bulkq->bq_lock);
 		bq_flush_to_queue(bq);
+		local_unlock(&bq->obj->bulkq->bq_lock);
 
 		/* If already running, costs spin_lock_irqsave + smb_mb */
 		wake_up_process(bq->obj->kthread);
-- 
2.43.0

next             reply	other threads:[~2026-02-11  6:44 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-11  6:44 Jiayuan Chen [this message]
2026-02-11 11:44 ` [PATCH bpf v1] bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT Sebastian Andrzej Siewior
2026-02-11 12:26   ` Jiayuan Chen

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:04171fbc39c dfblob:7fda8421ec4 )
 OR (
bs:"[PATCH bpf v1] bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260211064417.196401-1-jiayuan.chen@linux.dev \
    --to=jiayuan.chen@linux.dev \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bigeasy@linutronix.de \
    --cc=bpf@vger.kernel.org \
    --cc=clrkwllms@kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=eddyz87@gmail.com \
    --cc=gustavoars@kernel.org \
    --cc=haoluo@google.com \
    --cc=hawk@kernel.org \
    --cc=jiayuan.chen@shopee.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kees@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=linux-hardening@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rt-devel@lists.linux.dev \
    --cc=martin.lau@linux.dev \
    --cc=netdev@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=sdf@fomichev.me \
    --cc=song@kernel.org \
    --cc=syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com \
    --cc=tglx@kernel.org \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.