From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-f51.google.com (mail-wm1-f51.google.com [209.85.128.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4160438237E for ; Wed, 22 Apr 2026 19:42:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.51 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776886955; cv=none; b=pJIin6c73keQvaU8YmgnaUqh0xGnbpuJpxscebA9DDhp+l4R7hbGX4HPD47cFg9sOuUxWpHxBBvgbSKopXEuTnKm40Pi9WPy7FaJMSdFCskdqJn6X1/fuPWQxfxaaJEJttSHqPMSRmE8a9Qe2WUaTZJRCGBZCPLtnpzUpiOu+IA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776886955; c=relaxed/simple; bh=iDGcPw29VSpK5iGeiRuiQUS8ngKzgneY0o2HHrZ4hk8=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=JqwlOtDIggXltA0QVaqqwAA5GzZLh0rCtr+UggQeNiAkxpSjgCwlE7IbIm5S8aW+54pAGzFh66cjFfVGtjWgVb5krLuhSjaX1oNhCHeZQoqEGi1MGP9ZDWpyDLOrEIT5mGgy4l0z61sEU2e5Nznjcd5SkpbfwoJRl8SsgfczdWQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=D0WGl6rS; arc=none smtp.client-ip=209.85.128.51 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="D0WGl6rS" Received: by mail-wm1-f51.google.com with SMTP id 5b1f17b1804b1-488b0e1b870so95373465e9.2 for ; Wed, 22 Apr 2026 12:42:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1776886952; x=1777491752; darn=vger.kernel.org; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:from:to:cc:subject:date:message-id :reply-to; bh=MUbLNloFFL52J6FAzPdovfmP+gUCh80isOSJmAyF2PA=; b=D0WGl6rSo3o7QlNcGfwS6yZ60vHNjnLP5MK/qFN/Lnl2ayQdxGVEB+lT+A6tHeKMOX 8vXkNNWnvyzayqRBnTTAB1gNC2yn9XEM+LFyISEQJ8cGKjk1AZtt5iCsmgDFZqmUF7QE QbTg3zvtkSZuUBEc12dM1piLm9t+X0FnyYg1O/KKMA3pqYy3uhP+G9DoHXggu5seNCbb SLx2yMolDfvjSc8sJLeSNUI1ZchTQTXteyGor6+DjRfBPtEfrXgtR5Rl+A1eaZdpNH+/ bsonIUmf2yRgLt40dCkE4MU7ysDRzfF/W7WtQAdXh/X8c4kSSOQYWpCh4AUOS833w1NW vn0g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776886952; x=1777491752; h=cc:to:in-reply-to:references:message-id:content-transfer-encoding :mime-version:subject:date:from:x-gm-gg:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=MUbLNloFFL52J6FAzPdovfmP+gUCh80isOSJmAyF2PA=; b=SI8LRZNeFBI22Oqwx+hqr1stnEPK66bFC7pIfekOGgXUJl1wyJ62P7WjZpEqXJPeRD YxwAvJurBLSARwx5Y44ncoWG4e9oed67a/fRgq8Mpn++OiggZEXD8HWFrVSqqkSqh/Sn nUamqg1kODqVsf7A9J7qNvVbYBUy7onS7xF5XnqgyEd7pZXCCXz+ag1A0LS/rW14Wpx8 KFZP9h4WELjyN8ADZBgHqtbGi+tgqN3GzkkWKaGlTR3svrOMOuS9Lun33EB0r2uTARyl UpzUisBPDvkc+tEDA1EqUV9G+dOK/6iOFnmbiWYVdkmFMwVS8BXdTzOm3kCqiSlpf3Lo 0vAA== X-Gm-Message-State: AOJu0Yyn9MwBcCyGw984VSo61GFaCX6RbVw+hpVt7OU6y6HGmHPpik3N ixamouqt6yWewADjPXAYDPVAK26Rn9UOZgUCM0Vi1Mu6OBTNNTmJnWox X-Gm-Gg: AeBDietfJydHQBKsdtitfDZ5P0UXMN5WEWhhg4t2jFzN/Rj10q4IXd9BhMaBQ/ZKIAj LtbuCgyceIWTM0BEmiNN5QlgRWmTNMZ9sbUOv+9ghr3wMa7hpbm3Qj3Q/i3MsIem0kA0FIXjIUF 3OE3UvYCNb0t4DhSbEONKmppKkXf0dh+ZC9LHx8ooEXnoKeFK0C9m54f4T2cGb5uiTr10RQ4gpk R354hCjqN2aIO3Zjd7u5zdy2E79TdYwAhedYu4DhzSzeY7y/mV3LPcrmSkAgHQdhFOGzLiQ/Fe1 gh9VNux+21ylD5GniINuK4fke5YJknUyMs1YLk3N8PuUefRmuGMq2F/V/ya+B0hdgOT6Ud6m1JE WQdl3d1RV1dkoUDjcVfOhQii2gAdjmU6+EcGencn8C1JccmqSGLP/+gOXeqpv2RIinYW6EbLnvN RX3lvpw5UefRyftns1Uy+0OYCf6ppo5TicL7w= X-Received: by 2002:a05:600c:33aa:b0:488:ae6c:42c0 with SMTP id 5b1f17b1804b1-488fb742e7fmr210946275e9.7.1776886951424; Wed, 22 Apr 2026 12:42:31 -0700 (PDT) Received: from localhost ([2a03:2880:30ff:42::]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-488fc177dafsm571536325e9.4.2026.04.22.12.42.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 22 Apr 2026 12:42:30 -0700 (PDT) From: Mykyta Yatsenko Date: Wed, 22 Apr 2026 12:41:08 -0700 Subject: [PATCH bpf-next v13 3/6] bpf: Add sleepable support for classic tracepoint programs Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20260422-sleepable_tracepoints-v13-3-99005dff21ef@meta.com> References: <20260422-sleepable_tracepoints-v13-0-99005dff21ef@meta.com> In-Reply-To: <20260422-sleepable_tracepoints-v13-0-99005dff21ef@meta.com> To: bpf@vger.kernel.org, ast@kernel.org, andrii@kernel.org, daniel@iogearbox.net, kafai@meta.com, kernel-team@meta.com, eddyz87@gmail.com, memxor@gmail.com, peterz@infradead.org, rostedt@goodmis.org Cc: Mykyta Yatsenko X-Mailer: b4 0.16-dev X-Developer-Signature: v=1; a=ed25519-sha256; t=1776886944; l=12606; i=yatsenko@meta.com; s=20260324; h=from:subject:message-id; bh=VQ9prtTRVyhDtXpIBM6bzqEoDZCpAhNymJF+JdDSuw4=; b=x91HdwqS3HoGtYSCWnqWoX03PTcVBUzaMFLP2ChouKXyoaN3Wi8EisOm1FTHDza+yGIwzBvwA 5jwKtA9TPklBcbEvoIkJLuwjwwhTx28e5jab0GdlLll2h2e0++5nE2O X-Developer-Key: i=yatsenko@meta.com; a=ed25519; pk=1zCUBXUa66KmzfjNsG8YNlMj2ckPdqBPvFq2ww3/YaA= From: Mykyta Yatsenko Add trace_call_bpf_faultable(), a variant of trace_call_bpf() for faultable tracepoints that supports sleepable BPF programs. It uses rcu_tasks_trace for lifetime protection and bpf_prog_run_array_sleepable() for per-program RCU flavor selection, following the uprobe_prog_run() pattern. Restructure perf_syscall_enter() and perf_syscall_exit() to run BPF programs before perf event processing. Previously, BPF ran after the per-cpu perf trace buffer was allocated under preempt_disable, requiring cleanup via perf_swevent_put_recursion_context() on filter. Now BPF runs in faultable context before preempt_disable, reading syscall arguments from local variables instead of the per-cpu trace record, removing the dependency on buffer allocation. This allows sleepable BPF programs to execute and avoids unnecessary buffer allocation when BPF filters the event. The perf event submission path (buffer allocation, fill, submit) remains under preempt_disable as before. Since BPF no longer runs within the buffer allocation context, the fake_regs output parameter to perf_trace_buf_alloc() is no longer needed and is replaced with NULL. Add an attach-time check in __perf_event_set_bpf_prog() to reject sleepable BPF_PROG_TYPE_TRACEPOINT programs on non-syscall tracepoints, since only syscall tracepoints run in faultable context. This prepares the classic tracepoint runtime and attach paths for sleepable programs. The verifier changes to allow loading sleepable BPF_PROG_TYPE_TRACEPOINT programs are in a subsequent patch. To: Peter Zijlstra To: Steven Rostedt Acked-by: Kumar Kartikeya Dwivedi # for BPF bits Acked-by: Steven Rostedt Signed-off-by: Mykyta Yatsenko --- include/linux/trace_events.h | 6 +++ kernel/events/core.c | 9 ++++ kernel/trace/bpf_trace.c | 28 +++++++++++ kernel/trace/trace_syscalls.c | 110 ++++++++++++++++++++++-------------------- 4 files changed, 101 insertions(+), 52 deletions(-) diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index 40a43a4c7caf..d49338c44014 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -770,6 +770,7 @@ trace_trigger_soft_disabled(struct trace_event_file *file) #ifdef CONFIG_BPF_EVENTS unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx); +unsigned int trace_call_bpf_faultable(struct trace_event_call *call, void *ctx); int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog, u64 bpf_cookie); void perf_event_detach_bpf_prog(struct perf_event *event); int perf_event_query_prog_array(struct perf_event *event, void __user *info); @@ -792,6 +793,11 @@ static inline unsigned int trace_call_bpf(struct trace_event_call *call, void *c return 1; } +static inline unsigned int trace_call_bpf_faultable(struct trace_event_call *call, void *ctx) +{ + return 1; +} + static inline int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog, u64 bpf_cookie) { diff --git a/kernel/events/core.c b/kernel/events/core.c index 6d1f8bad7e1c..0f9cacfa7cb8 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -11643,6 +11643,15 @@ static int __perf_event_set_bpf_prog(struct perf_event *event, /* only uprobe programs are allowed to be sleepable */ return -EINVAL; + if (prog->type == BPF_PROG_TYPE_TRACEPOINT && prog->sleepable) { + /* + * Sleepable tracepoint programs can only attach to faultable + * tracepoints. Currently only syscall tracepoints are faultable. + */ + if (!is_syscall_tp) + return -EINVAL; + } + /* Kprobe override only works for kprobes, not uprobes. */ if (prog->kprobe_override && !is_kprobe) return -EINVAL; diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 7276c72c1d31..a822c589c9bd 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -152,6 +152,34 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx) return ret; } +/** + * trace_call_bpf_faultable - invoke BPF program in faultable context + * @call: tracepoint event + * @ctx: opaque context pointer + * + * Variant of trace_call_bpf() for faultable tracepoints (syscall + * tracepoints). Supports sleepable BPF programs by using rcu_tasks_trace + * for lifetime protection and bpf_prog_run_array_sleepable() for per-program + * RCU flavor selection, following the uprobe pattern. + * + * Per-program recursion protection is provided by + * bpf_prog_run_array_sleepable(). Global bpf_prog_active is not + * needed because syscall tracepoints cannot self-recurse. + * + * Must be called from a faultable/preemptible context. + */ +unsigned int trace_call_bpf_faultable(struct trace_event_call *call, void *ctx) +{ + struct bpf_prog_array *prog_array; + + might_fault(); + guard(rcu_tasks_trace)(); + + prog_array = rcu_dereference_check(call->prog_array, + rcu_read_lock_trace_held()); + return bpf_prog_run_array_sleepable(prog_array, ctx, bpf_prog_run); +} + #ifdef CONFIG_BPF_KPROBE_OVERRIDE BPF_CALL_2(bpf_override_return, struct pt_regs *, regs, unsigned long, rc) { diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index 8ad72e17d8eb..e98ee7e1e66f 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -1371,33 +1371,33 @@ static DECLARE_BITMAP(enabled_perf_exit_syscalls, NR_syscalls); static int sys_perf_refcount_enter; static int sys_perf_refcount_exit; -static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *regs, +static int perf_call_bpf_enter(struct trace_event_call *call, struct syscall_metadata *sys_data, - struct syscall_trace_enter *rec) + int syscall_nr, unsigned long *args) { struct syscall_tp_t { struct trace_entry ent; int syscall_nr; unsigned long args[SYSCALL_DEFINE_MAXARGS]; } __aligned(8) param; + struct pt_regs regs = {}; int i; BUILD_BUG_ON(sizeof(param.ent) < sizeof(void *)); - /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. ¶m) */ - perf_fetch_caller_regs(regs); - *(struct pt_regs **)¶m = regs; - param.syscall_nr = rec->nr; + /* bpf prog requires 'regs' to be the first member in the ctx */ + perf_fetch_caller_regs(®s); + *(struct pt_regs **)¶m = ®s; + param.syscall_nr = syscall_nr; for (i = 0; i < sys_data->nb_args; i++) - param.args[i] = rec->args[i]; - return trace_call_bpf(call, ¶m); + param.args[i] = args[i]; + return trace_call_bpf_faultable(call, ¶m); } static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id) { struct syscall_metadata *sys_data; struct syscall_trace_enter *rec; - struct pt_regs *fake_regs; struct hlist_head *head; unsigned long args[6]; bool valid_prog_array; @@ -1410,12 +1410,7 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id) int size = 0; int uargs = 0; - /* - * Syscall probe called with preemption enabled, but the ring - * buffer and per-cpu data require preemption to be disabled. - */ might_fault(); - guard(preempt_notrace)(); syscall_nr = trace_get_syscall_nr(current, regs); if (syscall_nr < 0 || syscall_nr >= NR_syscalls) @@ -1429,6 +1424,26 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id) syscall_get_arguments(current, regs, args); + /* + * Run BPF program in faultable context before per-cpu buffer + * allocation, allowing sleepable BPF programs to execute. + */ + valid_prog_array = bpf_prog_array_valid(sys_data->enter_event); + if (valid_prog_array && + !perf_call_bpf_enter(sys_data->enter_event, sys_data, + syscall_nr, args)) + return; + + /* + * Per-cpu ring buffer and perf event list operations require + * preemption to be disabled. + */ + guard(preempt_notrace)(); + + head = this_cpu_ptr(sys_data->enter_event->perf_events); + if (hlist_empty(head)) + return; + /* Check if this syscall event faults in user space memory */ mayfault = sys_data->user_mask != 0; @@ -1438,17 +1453,12 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id) return; } - head = this_cpu_ptr(sys_data->enter_event->perf_events); - valid_prog_array = bpf_prog_array_valid(sys_data->enter_event); - if (!valid_prog_array && hlist_empty(head)) - return; - /* get the size after alignment with the u32 buffer size field */ size += sizeof(unsigned long) * sys_data->nb_args + sizeof(*rec); size = ALIGN(size + sizeof(u32), sizeof(u64)); size -= sizeof(u32); - rec = perf_trace_buf_alloc(size, &fake_regs, &rctx); + rec = perf_trace_buf_alloc(size, NULL, &rctx); if (!rec) return; @@ -1458,13 +1468,6 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id) if (mayfault) syscall_put_data(sys_data, rec, user_ptr, size, user_sizes, uargs); - if ((valid_prog_array && - !perf_call_bpf_enter(sys_data->enter_event, fake_regs, sys_data, rec)) || - hlist_empty(head)) { - perf_swevent_put_recursion_context(rctx); - return; - } - perf_trace_buf_submit(rec, size, rctx, sys_data->enter_event->event.type, 1, regs, head, NULL); @@ -1514,40 +1517,35 @@ static void perf_sysenter_disable(struct trace_event_call *call) syscall_fault_buffer_disable(); } -static int perf_call_bpf_exit(struct trace_event_call *call, struct pt_regs *regs, - struct syscall_trace_exit *rec) +static int perf_call_bpf_exit(struct trace_event_call *call, + int syscall_nr, long ret_val) { struct syscall_tp_t { struct trace_entry ent; int syscall_nr; unsigned long ret; } __aligned(8) param; - - /* bpf prog requires 'regs' to be the first member in the ctx (a.k.a. ¶m) */ - perf_fetch_caller_regs(regs); - *(struct pt_regs **)¶m = regs; - param.syscall_nr = rec->nr; - param.ret = rec->ret; - return trace_call_bpf(call, ¶m); + struct pt_regs regs = {}; + + /* bpf prog requires 'regs' to be the first member in the ctx */ + perf_fetch_caller_regs(®s); + *(struct pt_regs **)¶m = ®s; + param.syscall_nr = syscall_nr; + param.ret = ret_val; + return trace_call_bpf_faultable(call, ¶m); } static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret) { struct syscall_metadata *sys_data; struct syscall_trace_exit *rec; - struct pt_regs *fake_regs; struct hlist_head *head; bool valid_prog_array; int syscall_nr; int rctx; int size; - /* - * Syscall probe called with preemption enabled, but the ring - * buffer and per-cpu data require preemption to be disabled. - */ might_fault(); - guard(preempt_notrace)(); syscall_nr = trace_get_syscall_nr(current, regs); if (syscall_nr < 0 || syscall_nr >= NR_syscalls) @@ -1559,29 +1557,37 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret) if (!sys_data) return; - head = this_cpu_ptr(sys_data->exit_event->perf_events); + /* + * Run BPF program in faultable context before per-cpu buffer + * allocation, allowing sleepable BPF programs to execute. + */ valid_prog_array = bpf_prog_array_valid(sys_data->exit_event); - if (!valid_prog_array && hlist_empty(head)) + if (valid_prog_array && + !perf_call_bpf_exit(sys_data->exit_event, syscall_nr, + syscall_get_return_value(current, regs))) + return; + + /* + * Per-cpu ring buffer and perf event list operations require + * preemption to be disabled. + */ + guard(preempt_notrace)(); + + head = this_cpu_ptr(sys_data->exit_event->perf_events); + if (hlist_empty(head)) return; /* We can probably do that at build time */ size = ALIGN(sizeof(*rec) + sizeof(u32), sizeof(u64)); size -= sizeof(u32); - rec = perf_trace_buf_alloc(size, &fake_regs, &rctx); + rec = perf_trace_buf_alloc(size, NULL, &rctx); if (!rec) return; rec->nr = syscall_nr; rec->ret = syscall_get_return_value(current, regs); - if ((valid_prog_array && - !perf_call_bpf_exit(sys_data->exit_event, fake_regs, rec)) || - hlist_empty(head)) { - perf_swevent_put_recursion_context(rctx); - return; - } - perf_trace_buf_submit(rec, size, rctx, sys_data->exit_event->event.type, 1, regs, head, NULL); } -- 2.52.0