From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wr1-f43.google.com (mail-wr1-f43.google.com [209.85.221.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 16F441B393C for ; Tue, 24 Jun 2025 12:01:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.43 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750766490; cv=none; b=QSPfjdKLtifyv+m0lNlyB2Rm896yGxvuudwQy4DlJJOU3P4GXXQdAoiC9sVPxaRdOgJJ80CSBtl2mcIAisTimf5+/gOL13kJ2ka9PFcXsHUZORT7l7Byjdxy5rRUXtFevj1tXg3X6GV2Jl8xNTICT5UoHsJNYgaSOcvEJWbvyLw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750766490; c=relaxed/simple; bh=LsJItrfvKZJtKwNHIVvF4/q38WlSCcrV3QUKLf6Wbkg=; h=From:Date:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=tORXYInGX0eJ4j7zKWQqwSoNzYwyOAwOV6mQGkFQm4laxFIwWtf221A/rVa1gBW1ckwCo0QKnusqKlhaE60PHizdHAhDKo+HDlQCZbsQeJ2DAJysNg/bqFsDKn2y15xsC0bbkdPAc0fSeF8keb7r+zCyINBplKBA/FOQMvg5ORI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=fNoQW9Vz; arc=none smtp.client-ip=209.85.221.43 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="fNoQW9Vz" Received: by mail-wr1-f43.google.com with SMTP id ffacd0b85a97d-3a57c8e247cso4192022f8f.1 for ; Tue, 24 Jun 2025 05:01:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1750766486; x=1751371286; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:date:from:from:to:cc:subject:date:message-id:reply-to; bh=P5rc5fQSEmkAycsJYYr4flXZX8hYwCsN8uXnl9F4AfE=; b=fNoQW9VzgrTCbvn+ik1OA5xW/2X7+8or3LP6XTdfezRUtn8Rls2pr2PssRthUbTfwO az60buWkdfag6ivVmQevTT+IqD0rCR8Fog7I4mmwuCdh6vpAlj8K7RKGx5anEicLTPhD Spd9SJ4lQQcgeSATaM1kJPXvK5ytS2ITLmPw3AhMPZQTsQBqUa33Llno9HSCCHjTarN2 97qnXwQhEXJNm/Qafbb8COhAxF6tgmXahsssp9SmLVtJl81J3qwxxD3LfAlar9XbR0sm l//+yiatpgGzvccGOR0M1YTHYJqBecctBswcQjANw5Pzlvb1JAP/4FHnc6717WE8r0Mw qy6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750766486; x=1751371286; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:date:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=P5rc5fQSEmkAycsJYYr4flXZX8hYwCsN8uXnl9F4AfE=; b=Q6zo3IiMU1a/KKNa8Zto28AXRqE0PUNC3Kkap3FbEIFAO5FrKg9+KPLNTRpD27WkGn Xm16FOOMldopRsZ/srHjceVB9UXe0IBxx2bwK7a+nNs95ksmma6ltxNDU547ZRvx5HP5 /A5eMa36Bl+WvKx0LftiBPHFJQSX6uUUy7H9Pyitwn10xUAXpRZKfIgSkbDBmgn7mGR3 gm37XPmGXiHbLwPBPIi+htmwHwfwzj3tlclw79kgOoqzyDKTucHLPDVvOPBW86n+U98q j3JUL4W8uAPmdde/o05O9JNYA+niRxwNk+VeU80gEX1v06iHAhtU7oyBKo+36CQa7I0g 8K7A== X-Gm-Message-State: AOJu0YwV16XbgaYQqR5tOokyGztkZ1ZgaXdlKAi83zHmNpzJcDooXefM GlYYYJpPeFgpjt2BWKIqc8XNjszcIaHTZxbiWfiA2X8tDfqHzUUjqwAnLup5nF2z2uM= X-Gm-Gg: ASbGncsrjViwL/T7YYk5k8S5r7fl6V9gyvIvlQ+bk9IXU1i3Wnl9h/QknI/K2GLljhe bF9ufRq0m/Ao0R4T9IypZzqGdWEzULx/3YQRsCpw7ZBCSy84u/JCGEXYHiJS849gkEh5wK2zTqe CeJU7GsG3iWfdbeQfSE9P5oTHmDH5z9bNaP6mcuNyO+QL4xP29PW+ZnHcRhmiiJm5Jv9PjqOYIw 0CTK6ybR0ka/xU4RlfqXr+DJfKf/015Yn4ZqVxgcei2/P2ctyZapct3W/pqFgl63sMUcSxIiwhp CcHhI2iZ9IFGZ5BQEMwqy3SvPphYb4hETRgtM2lqXKfsGCT6os80G8SJf9Hb X-Google-Smtp-Source: AGHT+IGPNHWMGq/GLocZiHrtDmjWY5QnmyvUdnsz9w5EgbsAjd7dzp2fSsBcbfKaLlQh73wpFYBj0Q== X-Received: by 2002:a17:907:868d:b0:ad8:9d9b:40f9 with SMTP id a640c23a62f3a-ae057f20a41mr1527097066b.43.1750766473723; Tue, 24 Jun 2025 05:01:13 -0700 (PDT) Received: from krava ([176.74.159.170]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-ae053e8086fsm861013866b.27.2025.06.24.05.01.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 24 Jun 2025 05:01:13 -0700 (PDT) From: Jiri Olsa X-Google-Original-From: Jiri Olsa Date: Tue, 24 Jun 2025 14:01:12 +0200 To: Kumar Kartikeya Dwivedi Cc: bpf@vger.kernel.org, Eduard Zingerman , Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Martin KaFai Lau , Emil Tsalapatis , Barret Rhoden , Matt Bobrowski , kkd@meta.com, kernel-team@meta.com Subject: Re: [PATCH bpf-next v3 02/12] bpf: Introduce BPF standard streams Message-ID: References: <20250624031252.2966759-1-memxor@gmail.com> <20250624031252.2966759-3-memxor@gmail.com> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250624031252.2966759-3-memxor@gmail.com> On Mon, Jun 23, 2025 at 08:12:42PM -0700, Kumar Kartikeya Dwivedi wrote: > Add support for a stream API to the kernel and expose related kfuncs to > BPF programs. Two streams are exposed, BPF_STDOUT and BPF_STDERR. These > can be used for printing messages that can be consumed from user space, > thus it's similar in spirit to existing trace_pipe interface. > > The kernel will use the BPF_STDERR stream to notify the program of any > errors encountered at runtime. BPF programs themselves may use both > streams for writing debug messages. BPF library-like code may use > BPF_STDERR to print warnings or errors on misuse at runtime. just curious, IIUC we can't mix the output of the streams when we dump them, right? I wonder it'd be handy to be able to get combined output and see messages from bpf programs sorted out with messages from kernel thanks, jirka > > The implementation of a stream is as follows. Everytime a message is > emitted from the kernel (directly, or through a BPF program), a record > is allocated by bump allocating from per-cpu region backed by a page > obtained using try_alloc_pages. This ensures that we can allocate memory > from any context. The eventual plan is to discard this scheme in favor > of Alexei's kmalloc_nolock() [0]. > > This record is then locklessly inserted into a list (llist_add()) so > that the printing side doesn't require holding any locks, and works in > any context. Each stream has a maximum capacity of 4MB of text, and each > printed message is accounted against this limit. > > Messages from a program are emitted using the bpf_stream_vprintk kfunc, > which takes a stream_id argument in addition to working otherwise > similar to bpf_trace_vprintk. > > The bprintf buffer helpers are extracted out to be reused for printing > the string into them before copying it into the stream, so that we can > (with the defined max limit) format a string and know its true length > before performing allocations of the stream element. > > For consuming elements from a stream, we expose a bpf(2) syscall command > named BPF_PROG_STREAM_READ_BY_FD, which allows reading data from the > stream of a given prog_fd into a user space buffer. The main logic is > implemented in bpf_stream_read(). The log messages are queued in > bpf_stream::log by the bpf_stream_vprintk kfunc, and then pulled and > ordered correctly in the stream backlog. > > For this purpose, we hold a lock around bpf_stream_backlog_peek(), as > llist_del_first() (if we maintained a second lockless list for the > backlog) wouldn't be safe from multiple threads anyway. Then, if we > fail to find something in the backlog log, we splice out everything from > the lockless log, and place it in the backlog log, and then return the > head of the backlog. Once the full length of the element is consumed, we > will pop it and free it. > > The lockless list bpf_stream::log is a LIFO stack. Elements obtained > using a llist_del_all() operation are in LIFO order, thus would break > the chronological ordering if printed directly. Hence, this batch of > messages is first reversed. Then, it is stashed into a separate list in > the stream, i.e. the backlog_log. The head of this list is the actual > message that should always be returned to the caller. All of this is > done in bpf_stream_backlog_fill(). > > From the kernel side, the writing into the stream will be a bit more > involved than the typical printk. First, the kernel typically may print > a collection of messages into the stream, and parallel writers into the > stream may suffer from interleaving of messages. To ensure each group of > messages is visible atomically, we can lift the advantage of using a > lockless list for pushing in messages. > > To enable this, we add a bpf_stream_stage() macro, and require kernel > users to use bpf_stream_printk statements for the passed expression to > write into the stream. Underneath the macro, we have a message staging > API, where a bpf_stream_stage object on the stack accumulates the > messages being printed into a local llist_head, and then a commit > operation splices the whole batch into the stream's lockless log list. > > This is especially pertinent for rqspinlock deadlock messages printed to > program streams. After this change, we see each deadlock invocation as a > non-interleaving contiguous message without any confusion on the > reader's part, improving their user experience in debugging the fault. > > While programs cannot benefit from this staged stream writing API, they > could just as well hold an rqspinlock around their print statements to > serialize messages, hence this is kept kernel-internal for now. > > Overall, this infrastructure provides NMI-safe any context printing of > messages to two dedicated streams. > > Later patches will add support for printing splats in case of BPF arena > page faults, rqspinlock deadlocks, and cond_break timeouts, and > integration of this facility into bpftool for dumping messages to user > space. > > Make sure that we don't end up spamming too many errors if the program > keeps failing repeatedly and filling up the stream, hence emit at most > 512 error messages from the kernel for a given stream. > > [0]: https://lore.kernel.org/bpf/20250501032718.65476-1-alexei.starovoitov@gmail.com > > Reviewed-by: Eduard Zingerman > Signed-off-by: Kumar Kartikeya Dwivedi > --- > include/linux/bpf.h | 59 ++++ > include/uapi/linux/bpf.h | 24 ++ > kernel/bpf/Makefile | 2 +- > kernel/bpf/core.c | 5 + > kernel/bpf/helpers.c | 1 + > kernel/bpf/stream.c | 485 +++++++++++++++++++++++++++++++++ > kernel/bpf/syscall.c | 27 +- > tools/include/uapi/linux/bpf.h | 24 ++ > 8 files changed, 625 insertions(+), 2 deletions(-) > create mode 100644 kernel/bpf/stream.c > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 4fff0cee8622..cdd726cfe622 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -1538,6 +1538,36 @@ struct btf_mod_pair { > > struct bpf_kfunc_desc_tab; > > +enum bpf_stream_id { > + BPF_STDOUT = 1, > + BPF_STDERR = 2, > +}; > + > +struct bpf_stream_elem { > + struct llist_node node; > + int total_len; > + int consumed_len; > + char str[]; > +}; > + > +enum { > + BPF_STREAM_MAX_CAPACITY = (4 * 1024U * 1024U), > +}; > + > +struct bpf_stream { > + atomic_t capacity; > + struct llist_head log; /* list of in-flight stream elements in LIFO order */ > + > + struct mutex lock; /* lock protecting backlog_{head,tail} */ > + struct llist_node *backlog_head; /* list of in-flight stream elements in FIFO order */ > + struct llist_node *backlog_tail; /* tail of the list above */ > +}; > + > +struct bpf_stream_stage { > + struct llist_head log; > + int len; > +}; > + > struct bpf_prog_aux { > atomic64_t refcnt; > u32 used_map_cnt; > @@ -1646,6 +1676,8 @@ struct bpf_prog_aux { > struct work_struct work; > struct rcu_head rcu; > }; > + struct bpf_stream stream[2]; > + atomic_t stream_error_cnt; > }; > > struct bpf_prog { > @@ -2408,6 +2440,8 @@ int generic_map_delete_batch(struct bpf_map *map, > struct bpf_map *bpf_map_get_curr_or_next(u32 *id); > struct bpf_prog *bpf_prog_get_curr_or_next(u32 *id); > > + > +struct page *__bpf_alloc_page(int nid); > int bpf_map_alloc_pages(const struct bpf_map *map, int nid, > unsigned long nr_pages, struct page **page_array); > #ifdef CONFIG_MEMCG > @@ -3573,6 +3607,31 @@ void bpf_bprintf_cleanup(struct bpf_bprintf_data *data); > int bpf_try_get_buffers(struct bpf_bprintf_buffers **bufs); > void bpf_put_buffers(void); > > +#define BPF_PROG_STREAM_ERROR_CNT 512 > + > +void bpf_prog_stream_init(struct bpf_prog *prog); > +void bpf_prog_stream_free(struct bpf_prog *prog); > +int bpf_prog_stream_read(struct bpf_prog *prog, enum bpf_stream_id stream_id, void __user *buf, int len); > +void bpf_stream_stage_init(struct bpf_stream_stage *ss); > +void bpf_stream_stage_free(struct bpf_stream_stage *ss); > +__printf(2, 3) > +int bpf_stream_stage_printk(struct bpf_stream_stage *ss, const char *fmt, ...); > +int bpf_stream_stage_commit(struct bpf_stream_stage *ss, struct bpf_prog *prog, > + enum bpf_stream_id stream_id); > + > +bool bpf_prog_stream_error_limit(struct bpf_prog *prog); > + > +#define bpf_stream_printk(ss, ...) bpf_stream_stage_printk(&ss, __VA_ARGS__) > + > +#define bpf_stream_stage(ss, prog, stream_id, expr) \ > + ({ \ > + if (!bpf_prog_stream_error_limit(prog)) { \ > + bpf_stream_stage_init(&ss); \ > + (expr); \ > + bpf_stream_stage_commit(&ss, prog, stream_id); \ > + bpf_stream_stage_free(&ss); \ > + } \ > + }) > > #ifdef CONFIG_BPF_LSM > void bpf_cgroup_atype_get(u32 attach_btf_id, int cgroup_atype); > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 39e7818cca80..f2fce6a94523 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -906,6 +906,17 @@ union bpf_iter_link_info { > * A new file descriptor (a nonnegative integer), or -1 if an > * error occurred (in which case, *errno* is set appropriately). > * > + * BPF_PROG_STREAM_READ_BY_FD > + * Description > + * Read data of a program's BPF stream. The program is identified > + * by *prog_fd*, and the stream is identified by the *stream_id*. > + * The data is copied to a buffer pointed to by *stream_buf*, and > + * filled less than or equal to *stream_buf_len* bytes. > + * > + * Return > + * Number of bytes read from the stream on success, or -1 if an > + * error occurred (in which case, *errno* is set appropriately). > + * > * NOTES > * eBPF objects (maps and programs) can be shared between processes. > * > @@ -961,6 +972,7 @@ enum bpf_cmd { > BPF_LINK_DETACH, > BPF_PROG_BIND_MAP, > BPF_TOKEN_CREATE, > + BPF_PROG_STREAM_READ_BY_FD, > __MAX_BPF_CMD, > }; > > @@ -1463,6 +1475,11 @@ struct bpf_stack_build_id { > > #define BPF_OBJ_NAME_LEN 16U > > +enum { > + BPF_STREAM_STDOUT = 1, > + BPF_STREAM_STDERR = 2, > +}; > + > union bpf_attr { > struct { /* anonymous struct used by BPF_MAP_CREATE command */ > __u32 map_type; /* one of enum bpf_map_type */ > @@ -1849,6 +1866,13 @@ union bpf_attr { > __u32 bpffs_fd; > } token_create; > > + struct { > + __aligned_u64 stream_buf; > + __u32 stream_buf_len; > + __u32 stream_id; > + __u32 prog_fd; > + } prog_stream_read; > + > } __attribute__((aligned(8))); > > /* The description below is an attempt at providing documentation to eBPF > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > index 3a335c50e6e3..269c04a24664 100644 > --- a/kernel/bpf/Makefile > +++ b/kernel/bpf/Makefile > @@ -14,7 +14,7 @@ obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o > obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o > obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o > obj-$(CONFIG_BPF_JIT) += trampoline.o > -obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o rqspinlock.o > +obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o rqspinlock.o stream.o > ifeq ($(CONFIG_MMU)$(CONFIG_64BIT),yy) > obj-$(CONFIG_BPF_SYSCALL) += arena.o range_tree.o > endif > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > index e536a34a32c8..f0def24573ae 100644 > --- a/kernel/bpf/core.c > +++ b/kernel/bpf/core.c > @@ -134,6 +134,10 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag > mutex_init(&fp->aux->ext_mutex); > mutex_init(&fp->aux->dst_mutex); > > +#ifdef CONFIG_BPF_SYSCALL > + bpf_prog_stream_init(fp); > +#endif > + > return fp; > } > > @@ -2862,6 +2866,7 @@ static void bpf_prog_free_deferred(struct work_struct *work) > aux = container_of(work, struct bpf_prog_aux, work); > #ifdef CONFIG_BPF_SYSCALL > bpf_free_kfunc_btf_tab(aux->kfunc_btf_tab); > + bpf_prog_stream_free(aux->prog); > #endif > #ifdef CONFIG_CGROUP_BPF > if (aux->cgroup_atype != CGROUP_BPF_ATTACH_TYPE_INVALID) > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > index 67d48f9fb173..8fef7b3cbd80 100644 > --- a/kernel/bpf/helpers.c > +++ b/kernel/bpf/helpers.c > @@ -3393,6 +3393,7 @@ BTF_ID_FLAGS(func, bpf_iter_dmabuf_next, KF_ITER_NEXT | KF_RET_NULL | KF_SLEEPAB > BTF_ID_FLAGS(func, bpf_iter_dmabuf_destroy, KF_ITER_DESTROY | KF_SLEEPABLE) > #endif > BTF_ID_FLAGS(func, __bpf_trap) > +BTF_ID_FLAGS(func, bpf_stream_vprintk, KF_TRUSTED_ARGS) > BTF_KFUNCS_END(common_btf_ids) > > static const struct btf_kfunc_id_set common_kfunc_set = { > diff --git a/kernel/bpf/stream.c b/kernel/bpf/stream.c > new file mode 100644 > index 000000000000..75ceb6379368 > --- /dev/null > +++ b/kernel/bpf/stream.c > @@ -0,0 +1,485 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +/* > + * Simple per-CPU NMI-safe bump allocation mechanism, backed by the NMI-safe > + * try_alloc_pages()/free_pages_nolock() primitives. We allocate a page and > + * stash it in a local per-CPU variable, and bump allocate from the page > + * whenever items need to be printed to a stream. Each page holds a global > + * atomic refcount in its first 4 bytes, and then records of variable length > + * that describe the printed messages. Once the global refcount has dropped to > + * zero, it is a signal to free the page back to the kernel's page allocator, > + * given all the individual records in it have been consumed. > + * > + * It is possible the same page is used to serve allocations across different > + * programs, which may be consumed at different times individually, hence > + * maintaining a reference count per-page is critical for correct lifetime > + * tracking. > + * > + * The bpf_stream_page code will be replaced to use kmalloc_nolock() once it > + * lands. > + */ > +struct bpf_stream_page { > + refcount_t ref; > + u32 consumed; > + char buf[]; > +}; > + > +/* Available room to add data to a refcounted page. */ > +#define BPF_STREAM_PAGE_SZ (PAGE_SIZE - offsetofend(struct bpf_stream_page, consumed)) > + > +static DEFINE_PER_CPU(local_trylock_t, stream_local_lock) = INIT_LOCAL_TRYLOCK(stream_local_lock); > +static DEFINE_PER_CPU(struct bpf_stream_page *, stream_pcpu_page); > + > +static bool bpf_stream_page_local_lock(unsigned long *flags) > +{ > + return local_trylock_irqsave(&stream_local_lock, *flags); > +} > + > +static void bpf_stream_page_local_unlock(unsigned long *flags) > +{ > + local_unlock_irqrestore(&stream_local_lock, *flags); > +} > + > +static void bpf_stream_page_free(struct bpf_stream_page *stream_page) > +{ > + struct page *p; > + > + if (!stream_page) > + return; > + p = virt_to_page(stream_page); > + free_pages_nolock(p, 0); > +} > + > +static void bpf_stream_page_get(struct bpf_stream_page *stream_page) > +{ > + refcount_inc(&stream_page->ref); > +} > + > +static void bpf_stream_page_put(struct bpf_stream_page *stream_page) > +{ > + if (refcount_dec_and_test(&stream_page->ref)) > + bpf_stream_page_free(stream_page); > +} > + > +static void bpf_stream_page_init(struct bpf_stream_page *stream_page) > +{ > + refcount_set(&stream_page->ref, 1); > + stream_page->consumed = 0; > +} > + > +static struct bpf_stream_page *bpf_stream_page_replace(void) > +{ > + struct bpf_stream_page *stream_page, *old_stream_page; > + struct page *page; > + > + page = __bpf_alloc_page(NUMA_NO_NODE); > + if (!page) > + return NULL; > + stream_page = page_address(page); > + bpf_stream_page_init(stream_page); > + > + old_stream_page = this_cpu_read(stream_pcpu_page); > + if (old_stream_page) > + bpf_stream_page_put(old_stream_page); > + this_cpu_write(stream_pcpu_page, stream_page); > + return stream_page; > +} > + > +static int bpf_stream_page_check_room(struct bpf_stream_page *stream_page, int len) > +{ > + int min = offsetof(struct bpf_stream_elem, str[0]); > + int consumed = stream_page->consumed; > + int total = BPF_STREAM_PAGE_SZ; > + int rem = max(0, total - consumed - min); > + > + /* Let's give room of at least 8 bytes. */ > + WARN_ON_ONCE(rem % 8 != 0); > + rem = rem < 8 ? 0 : rem; > + return min(len, rem); > +} > + > +static void bpf_stream_elem_init(struct bpf_stream_elem *elem, int len) > +{ > + init_llist_node(&elem->node); > + elem->total_len = len; > + elem->consumed_len = 0; > +} > + > +static struct bpf_stream_page *bpf_stream_page_from_elem(struct bpf_stream_elem *elem) > +{ > + unsigned long addr = (unsigned long)elem; > + > + return (struct bpf_stream_page *)PAGE_ALIGN_DOWN(addr); > +} > + > +static struct bpf_stream_elem *bpf_stream_page_push_elem(struct bpf_stream_page *stream_page, int len) > +{ > + u32 consumed = stream_page->consumed; > + > + stream_page->consumed += round_up(offsetof(struct bpf_stream_elem, str[len]), 8); > + return (struct bpf_stream_elem *)&stream_page->buf[consumed]; > +} > + > +static noinline struct bpf_stream_elem *bpf_stream_page_reserve_elem(int len) > +{ > + struct bpf_stream_elem *elem = NULL; > + struct bpf_stream_page *page; > + int room = 0; > + > + page = this_cpu_read(stream_pcpu_page); > + if (!page) > + page = bpf_stream_page_replace(); > + if (!page) > + return NULL; > + > + room = bpf_stream_page_check_room(page, len); > + if (room != len) > + page = bpf_stream_page_replace(); > + if (!page) > + return NULL; > + bpf_stream_page_get(page); > + room = bpf_stream_page_check_room(page, len); > + WARN_ON_ONCE(room != len); > + > + elem = bpf_stream_page_push_elem(page, room); > + bpf_stream_elem_init(elem, room); > + return elem; > +} > + > +static struct bpf_stream_elem *bpf_stream_elem_alloc(int len) > +{ > + const int max_len = ARRAY_SIZE((struct bpf_bprintf_buffers){}.buf); > + struct bpf_stream_elem *elem; > + unsigned long flags; > + > + BUILD_BUG_ON(max_len > BPF_STREAM_PAGE_SZ); > + /* > + * Length denotes the amount of data to be written as part of stream element, > + * thus includes '\0' byte. We're capped by how much bpf_bprintf_buffers can > + * accomodate, therefore deny allocations that won't fit into them. > + */ > + if (len < 0 || len > max_len) > + return NULL; > + > + if (!bpf_stream_page_local_lock(&flags)) > + return NULL; > + elem = bpf_stream_page_reserve_elem(len); > + bpf_stream_page_local_unlock(&flags); > + return elem; > +} > + > +static int __bpf_stream_push_str(struct llist_head *log, const char *str, int len) > +{ > + struct bpf_stream_elem *elem = NULL; > + > + /* > + * Allocate a bpf_prog_stream_elem and push it to the bpf_prog_stream > + * log, elements will be popped at once and reversed to print the log. > + */ > + elem = bpf_stream_elem_alloc(len); > + if (!elem) > + return -ENOMEM; > + > + memcpy(elem->str, str, len); > + llist_add(&elem->node, log); > + > + return 0; > +} > + > +static int bpf_stream_consume_capacity(struct bpf_stream *stream, int len) > +{ > + if (atomic_read(&stream->capacity) >= BPF_STREAM_MAX_CAPACITY) > + return -ENOSPC; > + if (atomic_add_return(len, &stream->capacity) >= BPF_STREAM_MAX_CAPACITY) { > + atomic_sub(len, &stream->capacity); > + return -ENOSPC; > + } > + return 0; > +} > + > +static void bpf_stream_release_capacity(struct bpf_stream *stream, struct bpf_stream_elem *elem) > +{ > + int len = elem->total_len; > + > + atomic_sub(len, &stream->capacity); > +} > + > +static int bpf_stream_push_str(struct bpf_stream *stream, const char *str, int len) > +{ > + int ret = bpf_stream_consume_capacity(stream, len); > + > + return ret ?: __bpf_stream_push_str(&stream->log, str, len); > +} > + > +static struct bpf_stream *bpf_stream_get(enum bpf_stream_id stream_id, struct bpf_prog_aux *aux) > +{ > + if (stream_id != BPF_STDOUT && stream_id != BPF_STDERR) > + return NULL; > + return &aux->stream[stream_id - 1]; > +} > + > +static void bpf_stream_free_elem(struct bpf_stream_elem *elem) > +{ > + struct bpf_stream_page *p; > + > + p = bpf_stream_page_from_elem(elem); > + bpf_stream_page_put(p); > +} > + > +static void bpf_stream_free_list(struct llist_node *list) > +{ > + struct bpf_stream_elem *elem, *tmp; > + > + llist_for_each_entry_safe(elem, tmp, list, node) > + bpf_stream_free_elem(elem); > +} > + > +static struct llist_node *bpf_stream_backlog_peek(struct bpf_stream *stream) > +{ > + return stream->backlog_head; > +} > + > +static struct llist_node *bpf_stream_backlog_pop(struct bpf_stream *stream) > +{ > + struct llist_node *node; > + > + node = stream->backlog_head; > + if (stream->backlog_head == stream->backlog_tail) > + stream->backlog_head = stream->backlog_tail = NULL; > + else > + stream->backlog_head = node->next; > + return node; > +} > + > +static void bpf_stream_backlog_fill(struct bpf_stream *stream) > +{ > + struct llist_node *head, *tail; > + > + if (llist_empty(&stream->log)) > + return; > + tail = llist_del_all(&stream->log); > + if (!tail) > + return; > + head = llist_reverse_order(tail); > + > + if (!stream->backlog_head) { > + stream->backlog_head = head; > + stream->backlog_tail = tail; > + } else { > + stream->backlog_tail->next = head; > + stream->backlog_tail = tail; > + } > + > + return; > +} > + > +static bool bpf_stream_consume_elem(struct bpf_stream_elem *elem, int *len) > +{ > + int rem = elem->total_len - elem->consumed_len; > + int used = min(rem, *len); > + > + elem->consumed_len += used; > + *len -= used; > + > + return elem->consumed_len == elem->total_len; > +} > + > +static int bpf_stream_read(struct bpf_stream *stream, void __user *buf, int len) > +{ > + int rem_len = len, cons_len, ret = 0; > + struct bpf_stream_elem *elem = NULL; > + struct llist_node *node; > + > + mutex_lock(&stream->lock); > + > + while (rem_len) { > + int pos = len - rem_len; > + bool cont; > + > + node = bpf_stream_backlog_peek(stream); > + if (!node) { > + bpf_stream_backlog_fill(stream); > + node = bpf_stream_backlog_peek(stream); > + } > + if (!node) > + break; > + elem = container_of(node, typeof(*elem), node); > + > + cons_len = elem->consumed_len; > + cont = bpf_stream_consume_elem(elem, &rem_len) == false; > + > + ret = copy_to_user(buf + pos, elem->str + cons_len, > + elem->consumed_len - cons_len); > + /* Restore in case of error. */ > + if (ret) { > + ret = -EFAULT; > + elem->consumed_len = cons_len; > + break; > + } > + > + if (cont) > + continue; > + bpf_stream_backlog_pop(stream); > + bpf_stream_release_capacity(stream, elem); > + bpf_stream_free_elem(elem); > + } > + > + mutex_unlock(&stream->lock); > + return ret ? ret : len - rem_len; > +} > + > +int bpf_prog_stream_read(struct bpf_prog *prog, enum bpf_stream_id stream_id, void __user *buf, int len) > +{ > + struct bpf_stream *stream; > + > + stream = bpf_stream_get(stream_id, prog->aux); > + if (!stream) > + return -ENOENT; > + return bpf_stream_read(stream, buf, len); > +} > + > +__bpf_kfunc_start_defs(); > + > +/* > + * Avoid using enum bpf_stream_id so that kfunc users don't have to pull in the > + * enum in headers. > + */ > +__bpf_kfunc int bpf_stream_vprintk(int stream_id, const char *fmt__str, const void *args, u32 len__sz, void *aux__prog) > +{ > + struct bpf_bprintf_data data = { > + .get_bin_args = true, > + .get_buf = true, > + }; > + struct bpf_prog_aux *aux = aux__prog; > + u32 fmt_size = strlen(fmt__str) + 1; > + struct bpf_stream *stream; > + u32 data_len = len__sz; > + int ret, num_args; > + > + stream = bpf_stream_get(stream_id, aux); > + if (!stream) > + return -ENOENT; > + > + if (data_len & 7 || data_len > MAX_BPRINTF_VARARGS * 8 || > + (data_len && !args)) > + return -EINVAL; > + num_args = data_len / 8; > + > + ret = bpf_bprintf_prepare(fmt__str, fmt_size, args, num_args, &data); > + if (ret < 0) > + return ret; > + > + ret = bstr_printf(data.buf, MAX_BPRINTF_BUF, fmt__str, data.bin_args); > + /* If the string was truncated, we only wrote until the size of buffer. */ > + ret = min_t(u32, ret + 1, MAX_BPRINTF_BUF); > + ret = bpf_stream_push_str(stream, data.buf, ret); > + bpf_bprintf_cleanup(&data); > + > + return ret; > +} > + > +__bpf_kfunc_end_defs(); > + > +/* Added kfunc to common_btf_ids */ > + > +void bpf_prog_stream_init(struct bpf_prog *prog) > +{ > + int i; > + > + for (i = 0; i < ARRAY_SIZE(prog->aux->stream); i++) { > + atomic_set(&prog->aux->stream[i].capacity, 0); > + init_llist_head(&prog->aux->stream[i].log); > + mutex_init(&prog->aux->stream[i].lock); > + prog->aux->stream[i].backlog_head = NULL; > + prog->aux->stream[i].backlog_tail = NULL; > + } > +} > + > +void bpf_prog_stream_free(struct bpf_prog *prog) > +{ > + struct llist_node *list; > + int i; > + > + for (i = 0; i < ARRAY_SIZE(prog->aux->stream); i++) { > + list = llist_del_all(&prog->aux->stream[i].log); > + bpf_stream_free_list(list); > + bpf_stream_free_list(prog->aux->stream[i].backlog_head); > + } > +} > + > +void bpf_stream_stage_init(struct bpf_stream_stage *ss) > +{ > + init_llist_head(&ss->log); > + ss->len = 0; > +} > + > +void bpf_stream_stage_free(struct bpf_stream_stage *ss) > +{ > + struct llist_node *node; > + > + node = llist_del_all(&ss->log); > + bpf_stream_free_list(node); > +} > + > +int bpf_stream_stage_printk(struct bpf_stream_stage *ss, const char *fmt, ...) > +{ > + struct bpf_bprintf_buffers *buf; > + va_list args; > + int ret; > + > + if (bpf_try_get_buffers(&buf)) > + return -EBUSY; > + > + va_start(args, fmt); > + ret = vsnprintf(buf->buf, ARRAY_SIZE(buf->buf), fmt, args); > + va_end(args); > + /* If the string was truncated, we only wrote until the size of buffer. */ > + ret = min_t(u32, ret + 1, ARRAY_SIZE(buf->buf)); > + ss->len += ret; > + ret = __bpf_stream_push_str(&ss->log, buf->buf, ret); > + bpf_put_buffers(); > + return ret; > +} > + > +int bpf_stream_stage_commit(struct bpf_stream_stage *ss, struct bpf_prog *prog, > + enum bpf_stream_id stream_id) > +{ > + struct llist_node *list, *head, *tail; > + struct bpf_stream *stream; > + int ret; > + > + stream = bpf_stream_get(stream_id, prog->aux); > + if (!stream) > + return -EINVAL; > + > + ret = bpf_stream_consume_capacity(stream, ss->len); > + if (ret) > + return ret; > + > + list = llist_del_all(&ss->log); > + head = tail = list; > + > + if (!list) > + return 0; > + while (llist_next(list)) { > + tail = llist_next(list); > + list = tail; > + } > + llist_add_batch(head, tail, &stream->log); > + return 0; > +} > + > +bool bpf_prog_stream_error_limit(struct bpf_prog *prog) > +{ > + return atomic_fetch_add(1, &prog->aux->stream_error_cnt) >= BPF_PROG_STREAM_ERROR_CNT; > +} > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 56500381c28a..ac1010b9d11b 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -576,7 +576,7 @@ static bool can_alloc_pages(void) > !IS_ENABLED(CONFIG_PREEMPT_RT); > } > > -static struct page *__bpf_alloc_page(int nid) > +struct page *__bpf_alloc_page(int nid) > { > if (!can_alloc_pages()) > return alloc_pages_nolock(nid, 0); > @@ -5936,6 +5936,28 @@ static int token_create(union bpf_attr *attr) > return bpf_token_create(attr); > } > > +#define BPF_PROG_STREAM_READ_BY_FD_LAST_FIELD prog_stream_read.prog_fd > + > +static int prog_stream_read(union bpf_attr *attr) > +{ > + char __user *buf = u64_to_user_ptr(attr->prog_stream_read.stream_buf); > + u32 len = attr->prog_stream_read.stream_buf_len; > + struct bpf_prog *prog; > + int ret; > + > + if (CHECK_ATTR(BPF_PROG_STREAM_READ_BY_FD)) > + return -EINVAL; > + > + prog = bpf_prog_get(attr->prog_stream_read.prog_fd); > + if (IS_ERR(prog)) > + return PTR_ERR(prog); > + > + ret = bpf_prog_stream_read(prog, attr->prog_stream_read.stream_id, buf, len); > + bpf_prog_put(prog); > + > + return ret; > +} > + > static int __sys_bpf(enum bpf_cmd cmd, bpfptr_t uattr, unsigned int size) > { > union bpf_attr attr; > @@ -6072,6 +6094,9 @@ static int __sys_bpf(enum bpf_cmd cmd, bpfptr_t uattr, unsigned int size) > case BPF_TOKEN_CREATE: > err = token_create(&attr); > break; > + case BPF_PROG_STREAM_READ_BY_FD: > + err = prog_stream_read(&attr); > + break; > default: > err = -EINVAL; > break; > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h > index 39e7818cca80..f2fce6a94523 100644 > --- a/tools/include/uapi/linux/bpf.h > +++ b/tools/include/uapi/linux/bpf.h > @@ -906,6 +906,17 @@ union bpf_iter_link_info { > * A new file descriptor (a nonnegative integer), or -1 if an > * error occurred (in which case, *errno* is set appropriately). > * > + * BPF_PROG_STREAM_READ_BY_FD > + * Description > + * Read data of a program's BPF stream. The program is identified > + * by *prog_fd*, and the stream is identified by the *stream_id*. > + * The data is copied to a buffer pointed to by *stream_buf*, and > + * filled less than or equal to *stream_buf_len* bytes. > + * > + * Return > + * Number of bytes read from the stream on success, or -1 if an > + * error occurred (in which case, *errno* is set appropriately). > + * > * NOTES > * eBPF objects (maps and programs) can be shared between processes. > * > @@ -961,6 +972,7 @@ enum bpf_cmd { > BPF_LINK_DETACH, > BPF_PROG_BIND_MAP, > BPF_TOKEN_CREATE, > + BPF_PROG_STREAM_READ_BY_FD, > __MAX_BPF_CMD, > }; > > @@ -1463,6 +1475,11 @@ struct bpf_stack_build_id { > > #define BPF_OBJ_NAME_LEN 16U > > +enum { > + BPF_STREAM_STDOUT = 1, > + BPF_STREAM_STDERR = 2, > +}; > + > union bpf_attr { > struct { /* anonymous struct used by BPF_MAP_CREATE command */ > __u32 map_type; /* one of enum bpf_map_type */ > @@ -1849,6 +1866,13 @@ union bpf_attr { > __u32 bpffs_fd; > } token_create; > > + struct { > + __aligned_u64 stream_buf; > + __u32 stream_buf_len; > + __u32 stream_id; > + __u32 prog_fd; > + } prog_stream_read; > + > } __attribute__((aligned(8))); > > /* The description below is an attempt at providing documentation to eBPF > -- > 2.47.1 > >