From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-f42.google.com (mail-wm1-f42.google.com [209.85.128.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CD3213E5589 for ; Mon, 23 Mar 2026 20:57:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.42 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774299465; cv=none; b=Gscn8uBHV96J2ktTJ+YuA8+jcGfo1sqXJeee2qa50oVVtk0wEKm5UWyZrM9w7wUtZVPNUrY0q6IftHCErhEsGEHs3PBZKh0nHL/w2WIwsIPFO7uVqKnZJA8NXNzXbnZ392MzGz2ZDK+8vxxWevp9w4K9FnPAXYQePRUq9N74zyg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774299465; c=relaxed/simple; bh=ytSjY5imEW0VOV7oB8g3vlPgH7EfhR4xzOrM0iD/aeg=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=lccpJsTG09hwH1kwXxIb6Kx4guCAYhw0wxMQ72H0PCrxAkgtbrFrjt4+k6F8trezCJsorhMLiMfK7KnIJkqFhC8Ia5ZTCbgnCtRChMpbwLrfhz68NnrUuFKWpSWZBXJ67/+jQd2VaH29YlsRGzHH4euHhoBwoaQO8OlYBmWbYwA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=lVof6Pgj; arc=none smtp.client-ip=209.85.128.42 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="lVof6Pgj" Received: by mail-wm1-f42.google.com with SMTP id 5b1f17b1804b1-48334ee0aeaso5006555e9.1 for ; Mon, 23 Mar 2026 13:57:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1774299462; x=1774904262; darn=vger.kernel.org; h=mime-version:message-id:date:references:in-reply-to:subject:cc:to :from:from:to:cc:subject:date:message-id:reply-to; bh=fSrzFDLGLlgg747vZ1gr3bc+2bStKqFCf8aLH6SAXEw=; b=lVof6PgjyiTMtW3dQLOG/T1qp8ifAy3j42mQJDvwoFeAW5ebzTD82o+qmqBgYQJaIW orcZuJ4iufe0pU5Z5CPKJ8/7FinhzI0Ok5VuOE9PsY/f4TRV+ZuI6THbBnghxgXAr4Ag xOevoX0o01ZCJK5zk02oimZ4QuJqbf3Rh2K8ih0M3Wl2SHgDrf5GEqTY1NGZIL35Uqb6 k64EhmEHnppcEGWrX+WsuhZW0d6gf0fZZlfrmonsiFYEs07rzaTnhKiFKhi5trES3VLG /IBtxZBLnYBgeprvl7QuBGJqMF6GWTT1HVR73tm7tq5pnT/yVG1Em4n78eYza0NJZ4cr DOeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774299462; x=1774904262; h=mime-version:message-id:date:references:in-reply-to:subject:cc:to :from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=fSrzFDLGLlgg747vZ1gr3bc+2bStKqFCf8aLH6SAXEw=; b=jZMItA3nVjdzXvolosfwdlQ8xSFufLwaKjJIsOHQsgYbKlu6MlHSsAaGG9/LfPlH9X Sgqrc1cIK661SvnLJEISWZ7/aFMW2+17ZZJQMgwyZAsKXuI7sojxQdO1TLYp7twHgjBV IVPAUtqPelxWcUDhTjLvDwiAJk95HC0UZXnPpYBOy1V63s0igFgz5szHevpOAZOlStbq T6hP5vnbkywUFr5HLd9TZsida5yLnD836dyHmhkPlAnTjTVTcn2VCSAn3aHs7sEQhaaf DNtyc4oBgF4UXFghLKhEgKyUnZofPVUGb/qtSOp+1b1cOZJwi+eEYp2hHdxi/7+pO3n6 zgXw== X-Gm-Message-State: AOJu0Ywf5Cr6257+HfLhG0gHfZj7v2QbqDd1d6WzQu9SQNaGtPGeAF4K FJxtJUjEE03S7F8UI7qiLJdLM4MWyEkmMIf/JBgKdAvZ/x2Tz4XrNx5U X-Gm-Gg: ATEYQzzaoY3Ze5QBinP28DmtfncgPnSOzpR/OGV/HR+fynBwldf3G/KhufwMXkuqB3c v0NGOnK2x/njNMavuqivDwST1TGpA7wJeHZutjzOZhX9xRp5Q6z1exb6lm0JXWcn2gLWG0JeQaC U4PqAr5wYJqgWM2gugz5tJpiN3+wUqlTFg0PcNFKmRk5J4lbewN9TPdNns9+6kS08b8FzhQwaZg fyRePhyCFKo4XoIg4Wk1SQSPf9FE764w8lY4hNg30s1h+p/OmrQ9O0dVfe0TXz5wUCMCXzQTIMq /9ZngxkG8DIFEEk4YmEE+1l4+jviRxEIVjhHySSKWnxD/4N5FYfgT6wDPnIlI1Qg1Pp7JW4/xSs g4DLbvTYsk5+DYuzkIftw8+hU5z6cvul0148S6Lul4zjGklaMMqECw0JnI6+2Vae7oztkKij+s/ BgKtn11clgneFv4VSz0rtSwFC8FezZX50/znsfps0N3C6a X-Received: by 2002:a05:600c:4ed1:b0:47d:8479:78d5 with SMTP id 5b1f17b1804b1-486fede7230mr201671825e9.7.1774299461866; Mon, 23 Mar 2026 13:57:41 -0700 (PDT) Received: from localhost ([2a01:4b00:bd1f:f500:f867:fc8a:5174:5755]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4870f6c116asm4329265e9.1.2026.03.23.13.57.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 23 Mar 2026 13:57:41 -0700 (PDT) From: Mykyta Yatsenko To: Kumar Kartikeya Dwivedi Cc: bpf@vger.kernel.org, ast@kernel.org, andrii@kernel.org, daniel@iogearbox.net, kafai@meta.com, kernel-team@meta.com, eddyz87@gmail.com, Mykyta Yatsenko Subject: Re: [PATCH bpf-next v5 2/5] bpf: Add sleepable support for classic tracepoint programs In-Reply-To: References: <20260316-sleepable_tracepoints-v5-0-85525de71d25@meta.com> <20260316-sleepable_tracepoints-v5-2-85525de71d25@meta.com> Date: Mon, 23 Mar 2026 20:57:40 +0000 Message-ID: <87a4vyjmbf.fsf@gmail.com> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain Kumar Kartikeya Dwivedi writes: > On Mon, 16 Mar 2026 at 22:47, Mykyta Yatsenko > wrote: >> >> From: Mykyta Yatsenko >> >> Add trace_call_bpf_faultable(), a variant of trace_call_bpf() for >> faultable tracepoints that supports sleepable BPF programs. It uses >> rcu_tasks_trace for lifetime protection and bpf_prog_run_array_uprobe() >> for per-program RCU flavor selection, following the uprobe_prog_run() >> pattern. Uses preempt-safe this_cpu_inc_return/this_cpu_dec for the >> bpf_prog_active recursion counter since preemption is enabled in this >> context. >> >> Restructure perf_syscall_enter() and perf_syscall_exit() to run BPF >> filter before perf event processing. Previously, BPF ran after the >> per-cpu perf trace buffer was allocated under preempt_disable, >> requiring cleanup via perf_swevent_put_recursion_context() on filter. >> Now BPF runs in faultable context before preempt_disable, reading >> syscall arguments from local variables instead of the per-cpu trace >> record, removing the dependency on buffer allocation. This allows >> sleepable BPF programs to execute and avoids unnecessary buffer >> allocation when BPF filters the event. The perf event submission >> path (buffer allocation, fill, submit) remains under preempt_disable >> as before. >> >> Add an attach-time check in __perf_event_set_bpf_prog() to reject >> sleepable BPF_PROG_TYPE_TRACEPOINT programs on non-syscall >> tracepoints, since only syscall tracepoints run in faultable context. >> >> This prepares the classic tracepoint runtime and attach paths for >> sleepable programs. The verifier changes to allow loading sleepable >> BPF_PROG_TYPE_TRACEPOINT programs are in a subsequent patch. >> >> Signed-off-by: Mykyta Yatsenko >> --- >> include/linux/trace_events.h | 6 +++ >> kernel/events/core.c | 9 ++++ >> kernel/trace/bpf_trace.c | 39 ++++++++++++++++ >> kernel/trace/trace_syscalls.c | 104 +++++++++++++++++++++++------------------- >> 4 files changed, 110 insertions(+), 48 deletions(-) >> >> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h >> index 37eb2f0f3dd8..5fbbeb9ec4b9 100644 >> --- a/include/linux/trace_events.h >> +++ b/include/linux/trace_events.h >> @@ -767,6 +767,7 @@ trace_trigger_soft_disabled(struct trace_event_file *file) >> >> #ifdef CONFIG_BPF_EVENTS >> unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx); >> +unsigned int trace_call_bpf_faultable(struct trace_event_call *call, void *ctx); >> int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog, u64 bpf_cookie); >> void perf_event_detach_bpf_prog(struct perf_event *event); >> int perf_event_query_prog_array(struct perf_event *event, void __user *info); >> @@ -789,6 +790,11 @@ static inline unsigned int trace_call_bpf(struct trace_event_call *call, void *c >> return 1; >> } >> >> +static inline unsigned int trace_call_bpf_faultable(struct trace_event_call *call, void *ctx) >> +{ >> + return 1; >> +} >> + >> static inline int >> perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog, u64 bpf_cookie) >> { >> diff --git a/kernel/events/core.c b/kernel/events/core.c >> index 1f5699b339ec..46b733d3dd41 100644 >> --- a/kernel/events/core.c >> +++ b/kernel/events/core.c >> @@ -11647,6 +11647,15 @@ static int __perf_event_set_bpf_prog(struct perf_event *event, >> /* only uprobe programs are allowed to be sleepable */ >> return -EINVAL; >> >> + if (prog->type == BPF_PROG_TYPE_TRACEPOINT && prog->sleepable) { >> + /* >> + * Sleepable tracepoint programs can only attach to faultable >> + * tracepoints. Currently only syscall tracepoints are faultable. >> + */ >> + if (!is_syscall_tp) >> + return -EINVAL; >> + } >> + >> /* Kprobe override only works for kprobes, not uprobes. */ >> if (prog->kprobe_override && !is_kprobe) >> return -EINVAL; >> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c >> index 35ed53807cfd..69c9a5539e65 100644 >> --- a/kernel/trace/bpf_trace.c >> +++ b/kernel/trace/bpf_trace.c >> @@ -152,6 +152,45 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx) >> return ret; >> } >> >> +/** >> + * trace_call_bpf_faultable - invoke BPF program in faultable context >> + * @call: tracepoint event >> + * @ctx: opaque context pointer >> + * >> + * Variant of trace_call_bpf() for faultable tracepoints (e.g. syscall >> + * tracepoints). Supports sleepable BPF programs by using rcu_tasks_trace >> + * for lifetime protection and per-program rcu_read_lock for non-sleepable >> + * programs, following the uprobe_prog_run() pattern. >> + * >> + * Must be called from a faultable/preemptible context. >> + */ >> +unsigned int trace_call_bpf_faultable(struct trace_event_call *call, void *ctx) >> +{ >> + struct bpf_prog_array *prog_array; >> + unsigned int ret; >> + >> + might_fault(); >> + >> + guard(rcu_tasks_trace)(); >> + guard(migrate)(); >> + >> + if (unlikely(this_cpu_inc_return(bpf_prog_active) != 1)) { > > This seems a bit heavy handed, why not > bpf_prog_get_recursion_context()? Esp. since we can potentially sleep > and other tasks can preempt us on the same CPU. Agree, the same point was raised by sashiko review. Thanks, though, for the hint to look into bpf_prog_get_recursion_context(), I was not aware of it. > >> + scoped_guard(rcu) { >> + bpf_prog_inc_misses_counters(rcu_dereference(call->prog_array)); >> + } >> + this_cpu_dec(bpf_prog_active); >> + return 0; >> + } >> + >> + prog_array = rcu_dereference_check(call->prog_array, >> + rcu_read_lock_trace_held()); >> + ret = bpf_prog_run_array_uprobe(prog_array, ctx, bpf_prog_run); > > This confused me a bit, do we need to generalize it a bit? E.g. it > would now set run_ctx.is_uprobe = 1 for this case too. > Actually, except that bit, the rest looks reusable, so maybe it should > be renamed to __ version and expose two wrappers, one that passes > is_uprobe = false and one passing true? > Then do bpf_prog_run_array_trace() here? Thanks, I agree on your point. > >> + >> [...]