[PATCH bpf-next v1 0/2] veristat: memory accounting for bpf programs

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next v1 0/2] veristat: memory accounting for bpf programs
@ 2025-06-05 23:06 Eduard Zingerman
  2025-06-05 23:06 ` [PATCH bpf-next v1 1/2] bpf: include verifier memory allocations in memcg statistics Eduard Zingerman
  2025-06-05 23:06 ` [PATCH bpf-next v1 2/2] veristat: memory accounting for bpf programs Eduard Zingerman
  0 siblings, 2 replies; 14+ messages in thread
From: Eduard Zingerman @ 2025-06-05 23:06 UTC (permalink / raw)
  To: bpf, ast
  Cc: andrii, daniel, martin.lau, kernel-team, yonghong.song,
	Eduard Zingerman

When working on the verifier, it is sometimes interesting to know how a
particular change affects memory consumption. This patch-set modifies
veristat to provide such information. As a collateral, kernel needs an
update to make allocations reachable from BPF program load accountable
in memcg statistics.

Here is a sample output:

    File       Program               Peak states  Peak memory (KiB)
    ---------  --------------------  -----------  -----------------
    bpf.bpf.o  lavd_select_cpu              1311              26256
    bpf.bpf.o  lavd_enqueue                 1140              22720
    bpf.bpf.o  layered_enqueue               777              11504
    ...

Technically, this is implemented by creating and entering a new cgroup
before verifying each program. The difference in memory.peak values
before and after bpf_object__load() is reported as the metric.

This incurs some overhead in veristat runtime. For example:
- increase from 82s to 102s on test_progs BPF binaries;
- increase from 42s to 47s on sched_ext BPF binaries.

Measurements are not completely stable and might change from run to
run by +-256Kb or something close. For sched_ext binaries I observe a
rate of 3 changes per run from a total of 188 programs. Mostly affects
very small programs.

I tried a different scheme, where new cgroup was allocated only once,
and then a combination of "echo 32G > memory.reclaim" and
"echo reset > memory.peak" (via fd) was executed before each program
load. For reasons unclear this approach did not produce stable
measurements.

Eduard Zingerman (2):
  bpf: include verifier memory allocations in memcg statistics
  veristat: memory accounting for bpf programs

 kernel/bpf/btf.c                       |  15 +-
 kernel/bpf/verifier.c                  |  49 ++---
 tools/testing/selftests/bpf/veristat.c | 249 ++++++++++++++++++++++++-
 3 files changed, 275 insertions(+), 38 deletions(-)

-- 
2.48.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH bpf-next v1 1/2] bpf: include verifier memory allocations in memcg statistics
  2025-06-05 23:06 [PATCH bpf-next v1 0/2] veristat: memory accounting for bpf programs Eduard Zingerman
@ 2025-06-05 23:06 ` Eduard Zingerman
  2025-06-05 23:06 ` [PATCH bpf-next v1 2/2] veristat: memory accounting for bpf programs Eduard Zingerman
  1 sibling, 0 replies; 14+ messages in thread
From: Eduard Zingerman @ 2025-06-05 23:06 UTC (permalink / raw)
  To: bpf, ast
  Cc: andrii, daniel, martin.lau, kernel-team, yonghong.song,
	Eduard Zingerman

This commit adds __GFP_ACCOUNT flag to verifier induced memory
allocations. The intent is to account for all allocations reachable
from BPF_PROG_LOAD command, which is needed to track verifier memory
consumption in veristat. This includes allocations done in verifier.c,
and some allocations in btf.c, functions in log.c do not allocate.

There is also a utility function bpf_memcg_flags() which selectively
adds GFP_ACCOUNT flag depending on the `cgroup.memory=nobpf` option.
As far as I understand [1], the idea is to remove bpf_prog instances
and maps from memcg accounting as these objects do not strictly belong
to cgroup, hence it should not apply here.

(btf_parse_fields() is reachable from both program load and map
 creation, but allocated record is not persistent as is freed as soon
 as map_check_btf() exits).

[1] https://lore.kernel.org/all/20230210154734.4416-1-laoar.shao@gmail.com/

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
---
 kernel/bpf/btf.c      | 15 ++++++-------
 kernel/bpf/verifier.c | 49 ++++++++++++++++++++++---------------------
 2 files changed, 33 insertions(+), 31 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 1d2cf898e21e..682acb1ed234 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3443,7 +3443,8 @@ btf_find_graph_root(const struct btf *btf, const struct btf_type *pt,
 	node_field_name = strstr(value_type, ":");
 	if (!node_field_name)
 		return -EINVAL;
-	value_type = kstrndup(value_type, node_field_name - value_type, GFP_KERNEL | __GFP_NOWARN);
+	value_type = kstrndup(value_type, node_field_name - value_type,
+			      GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
 	if (!value_type)
 		return -ENOMEM;
 	id = btf_find_by_name_kind(btf, value_type, BTF_KIND_STRUCT);
@@ -3958,7 +3959,7 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 	/* This needs to be kzalloc to zero out padding and unused fields, see
 	 * comment in btf_record_equal.
 	 */
-	rec = kzalloc(struct_size(rec, fields, cnt), GFP_KERNEL | __GFP_NOWARN);
+	rec = kzalloc(struct_size(rec, fields, cnt), GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
 	if (!rec)
 		return ERR_PTR(-ENOMEM);
 
@@ -9019,7 +9020,7 @@ static struct bpf_cand_cache *populate_cand_cache(struct bpf_cand_cache *cands,
 		bpf_free_cands_from_cache(*cc);
 		*cc = NULL;
 	}
-	new_cands = kmemdup(cands, sizeof_cands(cands->cnt), GFP_KERNEL);
+	new_cands = kmemdup(cands, sizeof_cands(cands->cnt), GFP_KERNEL_ACCOUNT);
 	if (!new_cands) {
 		bpf_free_cands(cands);
 		return ERR_PTR(-ENOMEM);
@@ -9027,7 +9028,7 @@ static struct bpf_cand_cache *populate_cand_cache(struct bpf_cand_cache *cands,
 	/* strdup the name, since it will stay in cache.
 	 * the cands->name points to strings in prog's BTF and the prog can be unloaded.
 	 */
-	new_cands->name = kmemdup_nul(cands->name, cands->name_len, GFP_KERNEL);
+	new_cands->name = kmemdup_nul(cands->name, cands->name_len, GFP_KERNEL_ACCOUNT);
 	bpf_free_cands(cands);
 	if (!new_cands->name) {
 		kfree(new_cands);
@@ -9111,7 +9112,7 @@ bpf_core_add_cands(struct bpf_cand_cache *cands, const struct btf *targ_btf,
 			continue;
 
 		/* most of the time there is only one candidate for a given kind+name pair */
-		new_cands = kmalloc(sizeof_cands(cands->cnt + 1), GFP_KERNEL);
+		new_cands = kmalloc(sizeof_cands(cands->cnt + 1), GFP_KERNEL_ACCOUNT);
 		if (!new_cands) {
 			bpf_free_cands(cands);
 			return ERR_PTR(-ENOMEM);
@@ -9228,7 +9229,7 @@ int bpf_core_apply(struct bpf_core_ctx *ctx, const struct bpf_core_relo *relo,
 	/* ~4k of temp memory necessary to convert LLVM spec like "0:1:0:5"
 	 * into arrays of btf_ids of struct fields and array indices.
 	 */
-	specs = kcalloc(3, sizeof(*specs), GFP_KERNEL);
+	specs = kcalloc(3, sizeof(*specs), GFP_KERNEL_ACCOUNT);
 	if (!specs)
 		return -ENOMEM;
 
@@ -9253,7 +9254,7 @@ int bpf_core_apply(struct bpf_core_ctx *ctx, const struct bpf_core_relo *relo,
 			goto out;
 		}
 		if (cc->cnt) {
-			cands.cands = kcalloc(cc->cnt, sizeof(*cands.cands), GFP_KERNEL);
+			cands.cands = kcalloc(cc->cnt, sizeof(*cands.cands), GFP_KERNEL_ACCOUNT);
 			if (!cands.cands) {
 				err = -ENOMEM;
 				goto out;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a7d6e0c5928b..dbf22877c0d8 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1403,7 +1403,7 @@ static void *realloc_array(void *arr, size_t old_n, size_t new_n, size_t size)
 		goto out;
 
 	alloc_size = kmalloc_size_roundup(size_mul(new_n, size));
-	new_arr = krealloc(arr, alloc_size, GFP_KERNEL);
+	new_arr = krealloc(arr, alloc_size, GFP_KERNEL_ACCOUNT);
 	if (!new_arr) {
 		kfree(arr);
 		return NULL;
@@ -1420,7 +1420,7 @@ static void *realloc_array(void *arr, size_t old_n, size_t new_n, size_t size)
 static int copy_reference_state(struct bpf_verifier_state *dst, const struct bpf_verifier_state *src)
 {
 	dst->refs = copy_array(dst->refs, src->refs, src->acquired_refs,
-			       sizeof(struct bpf_reference_state), GFP_KERNEL);
+			       sizeof(struct bpf_reference_state), GFP_KERNEL_ACCOUNT);
 	if (!dst->refs)
 		return -ENOMEM;
 
@@ -1439,7 +1439,7 @@ static int copy_stack_state(struct bpf_func_state *dst, const struct bpf_func_st
 	size_t n = src->allocated_stack / BPF_REG_SIZE;
 
 	dst->stack = copy_array(dst->stack, src->stack, n, sizeof(struct bpf_stack_state),
-				GFP_KERNEL);
+				GFP_KERNEL_ACCOUNT);
 	if (!dst->stack)
 		return -ENOMEM;
 
@@ -1760,7 +1760,7 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state,
 	for (i = 0; i <= src->curframe; i++) {
 		dst = dst_state->frame[i];
 		if (!dst) {
-			dst = kzalloc(sizeof(*dst), GFP_KERNEL);
+			dst = kzalloc(sizeof(*dst), GFP_KERNEL_ACCOUNT);
 			if (!dst)
 				return -ENOMEM;
 			dst_state->frame[i] = dst;
@@ -2020,7 +2020,7 @@ static struct bpf_verifier_state *push_stack(struct bpf_verifier_env *env,
 	struct bpf_verifier_stack_elem *elem;
 	int err;
 
-	elem = kzalloc(sizeof(struct bpf_verifier_stack_elem), GFP_KERNEL);
+	elem = kzalloc(sizeof(struct bpf_verifier_stack_elem), GFP_KERNEL_ACCOUNT);
 	if (!elem)
 		goto err;
 
@@ -2787,7 +2787,7 @@ static struct bpf_verifier_state *push_async_cb(struct bpf_verifier_env *env,
 	struct bpf_verifier_stack_elem *elem;
 	struct bpf_func_state *frame;
 
-	elem = kzalloc(sizeof(struct bpf_verifier_stack_elem), GFP_KERNEL);
+	elem = kzalloc(sizeof(struct bpf_verifier_stack_elem), GFP_KERNEL_ACCOUNT);
 	if (!elem)
 		goto err;
 
@@ -2815,7 +2815,7 @@ static struct bpf_verifier_state *push_async_cb(struct bpf_verifier_env *env,
 	elem->st.in_sleepable = is_sleepable;
 	elem->st.insn_hist_start = env->cur_state->insn_hist_end;
 	elem->st.insn_hist_end = elem->st.insn_hist_start;
-	frame = kzalloc(sizeof(*frame), GFP_KERNEL);
+	frame = kzalloc(sizeof(*frame), GFP_KERNEL_ACCOUNT);
 	if (!frame)
 		goto err;
 	init_func_state(env, frame,
@@ -3167,7 +3167,7 @@ static int add_kfunc_call(struct bpf_verifier_env *env, u32 func_id, s16 offset)
 			return -EINVAL;
 		}
 
-		tab = kzalloc(sizeof(*tab), GFP_KERNEL);
+		tab = kzalloc(sizeof(*tab), GFP_KERNEL_ACCOUNT);
 		if (!tab)
 			return -ENOMEM;
 		prog_aux->kfunc_tab = tab;
@@ -3183,7 +3183,7 @@ static int add_kfunc_call(struct bpf_verifier_env *env, u32 func_id, s16 offset)
 		return 0;
 
 	if (!btf_tab && offset) {
-		btf_tab = kzalloc(sizeof(*btf_tab), GFP_KERNEL);
+		btf_tab = kzalloc(sizeof(*btf_tab), GFP_KERNEL_ACCOUNT);
 		if (!btf_tab)
 			return -ENOMEM;
 		prog_aux->kfunc_btf_tab = btf_tab;
@@ -10279,7 +10279,7 @@ static int setup_func_entry(struct bpf_verifier_env *env, int subprog, int calls
 	}
 
 	caller = state->frame[state->curframe];
-	callee = kzalloc(sizeof(*callee), GFP_KERNEL);
+	callee = kzalloc(sizeof(*callee), GFP_KERNEL_ACCOUNT);
 	if (!callee)
 		return -ENOMEM;
 	state->frame[state->curframe + 1] = callee;
@@ -17606,17 +17606,18 @@ static int check_cfg(struct bpf_verifier_env *env)
 	int *insn_stack, *insn_state, *insn_postorder;
 	int ex_insn_beg, i, ret = 0;
 
-	insn_state = env->cfg.insn_state = kvcalloc(insn_cnt, sizeof(int), GFP_KERNEL);
+	insn_state = env->cfg.insn_state = kvcalloc(insn_cnt, sizeof(int), GFP_KERNEL_ACCOUNT);
 	if (!insn_state)
 		return -ENOMEM;
 
-	insn_stack = env->cfg.insn_stack = kvcalloc(insn_cnt, sizeof(int), GFP_KERNEL);
+	insn_stack = env->cfg.insn_stack = kvcalloc(insn_cnt, sizeof(int), GFP_KERNEL_ACCOUNT);
 	if (!insn_stack) {
 		kvfree(insn_state);
 		return -ENOMEM;
 	}
 
-	insn_postorder = env->cfg.insn_postorder = kvcalloc(insn_cnt, sizeof(int), GFP_KERNEL);
+	insn_postorder = env->cfg.insn_postorder =
+		kvcalloc(insn_cnt, sizeof(int), GFP_KERNEL_ACCOUNT);
 	if (!insn_postorder) {
 		kvfree(insn_state);
 		kvfree(insn_stack);
@@ -17750,7 +17751,7 @@ static int check_btf_func_early(struct bpf_verifier_env *env,
 	urecord = make_bpfptr(attr->func_info, uattr.is_kernel);
 	min_size = min_t(u32, krec_size, urec_size);
 
-	krecord = kvcalloc(nfuncs, krec_size, GFP_KERNEL | __GFP_NOWARN);
+	krecord = kvcalloc(nfuncs, krec_size, GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
 	if (!krecord)
 		return -ENOMEM;
 
@@ -17850,7 +17851,7 @@ static int check_btf_func(struct bpf_verifier_env *env,
 	urecord = make_bpfptr(attr->func_info, uattr.is_kernel);
 
 	krecord = prog->aux->func_info;
-	info_aux = kcalloc(nfuncs, sizeof(*info_aux), GFP_KERNEL | __GFP_NOWARN);
+	info_aux = kcalloc(nfuncs, sizeof(*info_aux), GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
 	if (!info_aux)
 		return -ENOMEM;
 
@@ -17936,7 +17937,7 @@ static int check_btf_line(struct bpf_verifier_env *env,
 	 * pass in a smaller bpf_line_info object.
 	 */
 	linfo = kvcalloc(nr_linfo, sizeof(struct bpf_line_info),
-			 GFP_KERNEL | __GFP_NOWARN);
+			 GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
 	if (!linfo)
 		return -ENOMEM;
 
@@ -19283,7 +19284,7 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
 	 * When looping the sl->state.branches will be > 0 and this state
 	 * will not be considered for equivalence until branches == 0.
 	 */
-	new_sl = kzalloc(sizeof(struct bpf_verifier_state_list), GFP_KERNEL);
+	new_sl = kzalloc(sizeof(struct bpf_verifier_state_list), GFP_KERNEL_ACCOUNT);
 	if (!new_sl)
 		return -ENOMEM;
 	env->total_states++;
@@ -22698,13 +22699,13 @@ static int do_check_common(struct bpf_verifier_env *env, int subprog)
 	env->prev_linfo = NULL;
 	env->pass_cnt++;
 
-	state = kzalloc(sizeof(struct bpf_verifier_state), GFP_KERNEL);
+	state = kzalloc(sizeof(struct bpf_verifier_state), GFP_KERNEL_ACCOUNT);
 	if (!state)
 		return -ENOMEM;
 	state->curframe = 0;
 	state->speculative = false;
 	state->branches = 1;
-	state->frame[0] = kzalloc(sizeof(struct bpf_func_state), GFP_KERNEL);
+	state->frame[0] = kzalloc(sizeof(struct bpf_func_state), GFP_KERNEL_ACCOUNT);
 	if (!state->frame[0]) {
 		kfree(state);
 		return -ENOMEM;
@@ -22930,7 +22931,7 @@ static void print_verification_stats(struct bpf_verifier_env *env)
 int bpf_prog_ctx_arg_info_init(struct bpf_prog *prog,
 			       const struct bpf_ctx_arg_aux *info, u32 cnt)
 {
-	prog->aux->ctx_arg_info = kmemdup_array(info, cnt, sizeof(*info), GFP_KERNEL);
+	prog->aux->ctx_arg_info = kmemdup_array(info, cnt, sizeof(*info), GFP_KERNEL_ACCOUNT);
 	prog->aux->ctx_arg_info_size = cnt;
 
 	return prog->aux->ctx_arg_info ? 0 : -ENOMEM;
@@ -23874,7 +23875,7 @@ static int compute_live_registers(struct bpf_verifier_env *env)
 	 * - repeat the computation while {in,out} fields changes for
 	 *   any instruction.
 	 */
-	state = kvcalloc(insn_cnt, sizeof(*state), GFP_KERNEL);
+	state = kvcalloc(insn_cnt, sizeof(*state), GFP_KERNEL_ACCOUNT);
 	if (!state) {
 		err = -ENOMEM;
 		goto out;
@@ -23948,7 +23949,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
 	/* 'struct bpf_verifier_env' can be global, but since it's not small,
 	 * allocate/free it every time bpf_check() is called
 	 */
-	env = kvzalloc(sizeof(struct bpf_verifier_env), GFP_KERNEL);
+	env = kvzalloc(sizeof(struct bpf_verifier_env), GFP_KERNEL_ACCOUNT);
 	if (!env)
 		return -ENOMEM;
 
@@ -24138,7 +24139,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
 		/* if program passed verifier, update used_maps in bpf_prog_info */
 		env->prog->aux->used_maps = kmalloc_array(env->used_map_cnt,
 							  sizeof(env->used_maps[0]),
-							  GFP_KERNEL);
+							  GFP_KERNEL_ACCOUNT);
 
 		if (!env->prog->aux->used_maps) {
 			ret = -ENOMEM;
@@ -24153,7 +24154,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr, __u3
 		/* if program passed verifier, update used_btfs in bpf_prog_aux */
 		env->prog->aux->used_btfs = kmalloc_array(env->used_btf_cnt,
 							  sizeof(env->used_btfs[0]),
-							  GFP_KERNEL);
+							  GFP_KERNEL_ACCOUNT);
 		if (!env->prog->aux->used_btfs) {
 			ret = -ENOMEM;
 			goto err_release_maps;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next v1 2/2] veristat: memory accounting for bpf programs
  2025-06-05 23:06 [PATCH bpf-next v1 0/2] veristat: memory accounting for bpf programs Eduard Zingerman
  2025-06-05 23:06 ` [PATCH bpf-next v1 1/2] bpf: include verifier memory allocations in memcg statistics Eduard Zingerman
@ 2025-06-05 23:06 ` Eduard Zingerman
  2025-06-06  1:03   ` Eduard Zingerman
  2025-06-06 16:53   ` Mykyta Yatsenko
  1 sibling, 2 replies; 14+ messages in thread
From: Eduard Zingerman @ 2025-06-05 23:06 UTC (permalink / raw)
  To: bpf, ast
  Cc: andrii, daniel, martin.lau, kernel-team, yonghong.song,
	Eduard Zingerman

This commit adds a new field mem_peak / "Peak memory" field to a set
of gathered statistics. The field is intended as an estimate for peak
verifier memory consumption for processing of a given program.

Mechanically stat is collected as follows:
- At the beginning of handle_verif_mode() a new cgroup namespace is
  created and cgroup fs is mounted in this namespace, memory
  controller is enabled for the root cgroup.
- At each program load:
  - bpf_object__load() is split into bpf_object__prepare() and
    bpf_object__load() to avoid accounting for memory allocated for
    maps;
  - before bpf_object__load() a new cgroup is created and veristat
    process enters this cgroup, "memory.peak" of the new cgroup is
    stashed;
  - after bpf_object__load() the difference between current
    "memory.peak" and stashed "memory.peak" is used as a metric,
    veristat exits the cgroup and cgroup is discarded.

If any of the above steps fails veristat would proceed w/o collecting
mem_peak information for a program.

The change has impact on veristat running time, e.g. for all
test_progs object files there is an increase from 82s to 102s.

I take a correlation between "Peak states" and "Peak memory" fields as
a sanity check for gathered statistics, e.g. here is a sample of data
for sched_ext programs:

File       Program               Peak states  Peak memory (KiB)
---------  --------------------  -----------  -----------------
bpf.bpf.o  lavd_select_cpu              1311              26256
bpf.bpf.o  lavd_enqueue                 1140              22720
bpf.bpf.o  layered_enqueue               777              11504
bpf.bpf.o  layered_dispatch              578               7976
bpf.bpf.o  lavd_dispatch                 634               6204
bpf.bpf.o  rusty_init                    343               5352
bpf.bpf.o  lavd_init                     361               5092
...
bpf.bpf.o  rusty_exit_task                36                256
bpf.bpf.o  rusty_running                  19                256
bpf.bpf.o  bpfland_dispatch                3                  0
bpf.bpf.o  bpfland_enable                  1                  0

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
---
 tools/testing/selftests/bpf/veristat.c | 249 ++++++++++++++++++++++++-
 1 file changed, 242 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/bpf/veristat.c b/tools/testing/selftests/bpf/veristat.c
index b2bb20b00952..e68f5dda5278 100644
--- a/tools/testing/selftests/bpf/veristat.c
+++ b/tools/testing/selftests/bpf/veristat.c
@@ -12,6 +12,7 @@
 #include <signal.h>
 #include <fcntl.h>
 #include <unistd.h>
+#include <sys/mount.h>
 #include <sys/time.h>
 #include <sys/sysinfo.h>
 #include <sys/stat.h>
@@ -49,6 +50,7 @@ enum stat_id {
 	STACK,
 	PROG_TYPE,
 	ATTACH_TYPE,
+	MEMORY_PEAK,
 
 	FILE_NAME,
 	PROG_NAME,
@@ -208,6 +210,9 @@ static struct env {
 	int top_src_lines;
 	struct var_preset *presets;
 	int npresets;
+	char cgroup_fs_mount[PATH_MAX + 1];
+	char stat_cgroup[PATH_MAX + 1];
+	int memory_peak_fd;
 } env;
 
 static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
@@ -219,6 +224,22 @@ static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va
 	return vfprintf(stderr, format, args);
 }
 
+#define log_errno(fmt, ...) log_errno_aux(__FILE__, __LINE__, fmt, ##__VA_ARGS__)
+
+__printf(3, 4)
+static int log_errno_aux(const char *file, int line, const char *fmt, ...)
+{
+	int err = -errno;
+	va_list ap;
+
+	va_start(ap, fmt);
+	fprintf(stderr, "%s:%d: ", file, line);
+	vfprintf(stderr, fmt, ap);
+	fprintf(stderr, " failed with error '%s'\n", strerror(errno));
+	va_end(ap);
+	return err;
+}
+
 #ifndef VERISTAT_VERSION
 #define VERISTAT_VERSION "<kernel>"
 #endif
@@ -734,13 +755,13 @@ static int append_file_from_file(const char *path)
 }
 
 static const struct stat_specs default_csv_output_spec = {
-	.spec_cnt = 14,
+	.spec_cnt = 15,
 	.ids = {
 		FILE_NAME, PROG_NAME, VERDICT, DURATION,
 		TOTAL_INSNS, TOTAL_STATES, PEAK_STATES,
 		MAX_STATES_PER_INSN, MARK_READ_MAX_LEN,
 		SIZE, JITED_SIZE, PROG_TYPE, ATTACH_TYPE,
-		STACK,
+		STACK, MEMORY_PEAK,
 	},
 };
 
@@ -781,6 +802,7 @@ static struct stat_def {
 	[STACK] = {"Stack depth", {"stack_depth", "stack"}, },
 	[PROG_TYPE] = { "Program type", {"prog_type"}, },
 	[ATTACH_TYPE] = { "Attach type", {"attach_type", }, },
+	[MEMORY_PEAK] = { "Peak memory (KiB)", {"mem_peak", }, },
 };
 
 static bool parse_stat_id_var(const char *name, size_t len, int *id,
@@ -1278,16 +1300,213 @@ static int max_verifier_log_size(void)
 	return log_size;
 }
 
+__printf(2, 3)
+static int write_one_line(const char *file, const char *fmt, ...)
+{
+	int err, saved_errno;
+	va_list ap;
+	FILE *f;
+
+	f = fopen(file, "w");
+	if (!f)
+		return -1;
+
+	va_start(ap, fmt);
+	errno = 0;
+	err = vfprintf(f, fmt, ap);
+	saved_errno = errno;
+	va_end(ap);
+	fclose(f);
+	errno = saved_errno;
+	return err < 0 ? -1 : 0;
+}
+
+/*
+ * This works around GCC warning about snprintf truncating strings like:
+ *
+ *   char a[PATH_MAX], b[PATH_MAX];
+ *   snprintf(a, "%s/foo", b);      // triggers -Wformat-truncation
+ */
+__printf(3, 4)
+static int snprintf_trunc(char *str, volatile size_t size, const char *fmt, ...)
+{
+	va_list ap;
+	int ret;
+
+	va_start(ap, fmt);
+	ret = vsnprintf(str, size, fmt, ap);
+	va_end(ap);
+	return ret;
+}
+
+static void destroy_stat_cgroup(void);
+static void umount_cgroupfs(void);
+
+/*
+ * Enters new cgroup namespace and mounts cgroupfs at /tmp/veristat-cgroup-mount-XXXXXX,
+ * enables "memory" controller for the root cgroup.
+ */
+static int mount_cgroupfs(void)
+{
+	char buf[PATH_MAX + 1];
+	int err;
+
+	env.memory_peak_fd = -1;
+
+	err = unshare(CLONE_NEWCGROUP);
+	if (err < 0) {
+		err = log_errno("unshare(CLONE_NEWCGROUP)");
+		goto err_out;
+	}
+
+	snprintf_trunc(buf, sizeof(buf), "%s/veristat-cgroup-mount-XXXXXX", P_tmpdir);
+	if (mkdtemp(buf) == NULL) {
+		err = log_errno("mkdtemp(%s)", buf);
+		goto err_out;
+	}
+	strcpy(env.cgroup_fs_mount, buf);
+
+	err = mount("none", env.cgroup_fs_mount, "cgroup2", 0, NULL);
+	if (err < 0) {
+		err = log_errno("mount none %s -t cgroup2", env.cgroup_fs_mount);
+		goto err_out;
+	}
+
+	snprintf_trunc(buf, sizeof(buf), "%s/cgroup.subtree_control", env.cgroup_fs_mount);
+	err = write_one_line(buf, "+memory\n");
+	if (err < 0) {
+		err = log_errno("echo '+memory' > %s", buf);
+		goto err_out;
+	}
+
+	return 0;
+
+err_out:
+	umount_cgroupfs();
+	return err;
+}
+
+static void umount_cgroupfs(void)
+{
+	int err;
+
+	if (!env.cgroup_fs_mount[0])
+		return;
+
+	err = umount(env.cgroup_fs_mount);
+	if (err < 0)
+		log_errno("umount %s", env.cgroup_fs_mount);
+
+	err = rmdir(env.cgroup_fs_mount);
+	if (err < 0)
+		log_errno("rmdir %s", env.cgroup_fs_mount);
+
+	env.cgroup_fs_mount[0] = 0;
+}
+
+/*
+ * Creates a cgroup at /tmp/veristat-cgroup-mount-XXXXXX/accounting-<pid>,
+ * moves current process to this cgroup.
+ */
+static int create_stat_cgroup(void)
+{
+	char buf[PATH_MAX + 1];
+	int err;
+
+	if (!env.cgroup_fs_mount[0])
+		return -1;
+
+	env.memory_peak_fd = -1;
+
+	snprintf_trunc(buf, sizeof(buf), "%s/accounting-%d", env.cgroup_fs_mount, getpid());
+	err = mkdir(buf, 0777);
+	if (err < 0) {
+		err = log_errno("mkdir(%s)", buf);
+		goto err_out;
+	}
+	strcpy(env.stat_cgroup, buf);
+
+	snprintf_trunc(buf, sizeof(buf), "%s/cgroup.procs", env.stat_cgroup);
+	err = write_one_line(buf, "%d\n", getpid());
+	if (err < 0) {
+		err = log_errno("echo %d > %s", getpid(), buf);
+		goto err_out;
+	}
+
+	snprintf_trunc(buf, sizeof(buf), "%s/memory.peak", env.stat_cgroup);
+	env.memory_peak_fd = open(buf, O_RDWR | O_APPEND);
+	if (env.memory_peak_fd < 0) {
+		err = log_errno("open(%s)", buf);
+		goto err_out;
+	}
+
+	return 0;
+
+err_out:
+	destroy_stat_cgroup();
+	return err;
+}
+
+static void destroy_stat_cgroup(void)
+{
+	char buf[PATH_MAX];
+	int err;
+
+	close(env.memory_peak_fd);
+
+	if (env.cgroup_fs_mount[0]) {
+		snprintf_trunc(buf, sizeof(buf), "%s/cgroup.procs", env.cgroup_fs_mount);
+		err = write_one_line(buf, "%d\n", getpid());
+		if (err < 0)
+			log_errno("echo %d > %s", getpid(), buf);
+	}
+
+	if (env.stat_cgroup[0]) {
+		err = rmdir(env.stat_cgroup);
+		if (err < 0)
+			log_errno("rmdir %s", env.stat_cgroup);
+	}
+
+	env.stat_cgroup[0] = 0;
+}
+
+/* Current value of /tmp/veristat-cgroup-mount-XXXXXX/accounting-<pid>/memory.peak */
+static long cgroup_memory_peak(void)
+{
+	long err, memory_peak;
+	char buf[32];
+
+	if (env.memory_peak_fd < 0)
+		return -1;
+
+	err = pread(env.memory_peak_fd, buf, sizeof(buf) - 1, 0);
+	if (err <= 0) {
+		log_errno("read(%s/memory.peak)", env.stat_cgroup);
+		return -1;
+	}
+
+	buf[err] = 0;
+	errno = 0;
+	memory_peak = strtoll(buf, NULL, 10);
+	if (errno) {
+		log_errno("unrecognized %s/memory.peak format: %s", env.stat_cgroup, buf);
+		return -1;
+	}
+
+	return memory_peak;
+}
+
 static int process_prog(const char *filename, struct bpf_object *obj, struct bpf_program *prog)
 {
 	const char *base_filename = basename(strdupa(filename));
 	const char *prog_name = bpf_program__name(prog);
+	long mem_peak_a, mem_peak_b, mem_peak = -1;
 	char *buf;
 	int buf_sz, log_level;
 	struct verif_stats *stats;
 	struct bpf_prog_info info;
 	__u32 info_len = sizeof(info);
-	int err = 0;
+	int err = 0, cgroup_err;
 	void *tmp;
 	int fd;
 
@@ -1332,7 +1551,16 @@ static int process_prog(const char *filename, struct bpf_object *obj, struct bpf
 	if (env.force_reg_invariants)
 		bpf_program__set_flags(prog, bpf_program__flags(prog) | BPF_F_TEST_REG_INVARIANTS);
 
-	err = bpf_object__load(obj);
+	err = bpf_object__prepare(obj);
+	if (!err) {
+		cgroup_err = create_stat_cgroup();
+		mem_peak_a = cgroup_memory_peak();
+		err = bpf_object__load(obj);
+		mem_peak_b = cgroup_memory_peak();
+		destroy_stat_cgroup();
+		if (!cgroup_err && mem_peak_a >= 0 && mem_peak_b >= 0)
+			mem_peak = mem_peak_b - mem_peak_a;
+	}
 	env.progs_processed++;
 
 	stats->file_name = strdup(base_filename);
@@ -1341,6 +1569,7 @@ static int process_prog(const char *filename, struct bpf_object *obj, struct bpf
 	stats->stats[SIZE] = bpf_program__insn_cnt(prog);
 	stats->stats[PROG_TYPE] = bpf_program__type(prog);
 	stats->stats[ATTACH_TYPE] = bpf_program__expected_attach_type(prog);
+	stats->stats[MEMORY_PEAK] = mem_peak < 0 ? -1 : mem_peak / 1024;
 
 	memset(&info, 0, info_len);
 	fd = bpf_program__fd(prog);
@@ -1824,6 +2053,7 @@ static int cmp_stat(const struct verif_stats *s1, const struct verif_stats *s2,
 	case TOTAL_STATES:
 	case PEAK_STATES:
 	case MAX_STATES_PER_INSN:
+	case MEMORY_PEAK:
 	case MARK_READ_MAX_LEN: {
 		long v1 = s1->stats[id];
 		long v2 = s2->stats[id];
@@ -2053,6 +2283,7 @@ static void prepare_value(const struct verif_stats *s, enum stat_id id,
 	case STACK:
 	case SIZE:
 	case JITED_SIZE:
+	case MEMORY_PEAK:
 		*val = s ? s->stats[id] : 0;
 		break;
 	default:
@@ -2139,6 +2370,7 @@ static int parse_stat_value(const char *str, enum stat_id id, struct verif_stats
 	case MARK_READ_MAX_LEN:
 	case SIZE:
 	case JITED_SIZE:
+	case MEMORY_PEAK:
 	case STACK: {
 		long val;
 		int err, n;
@@ -2776,7 +3008,7 @@ static void output_prog_stats(void)
 
 static int handle_verif_mode(void)
 {
-	int i, err;
+	int i, err = 0;
 
 	if (env.filename_cnt == 0) {
 		fprintf(stderr, "Please provide path to BPF object file!\n\n");
@@ -2784,11 +3016,12 @@ static int handle_verif_mode(void)
 		return -EINVAL;
 	}
 
+	mount_cgroupfs();
 	for (i = 0; i < env.filename_cnt; i++) {
 		err = process_obj(env.filenames[i]);
 		if (err) {
 			fprintf(stderr, "Failed to process '%s': %d\n", env.filenames[i], err);
-			return err;
+			goto out;
 		}
 	}
 
@@ -2796,7 +3029,9 @@ static int handle_verif_mode(void)
 
 	output_prog_stats();
 
-	return 0;
+out:
+	umount_cgroupfs();
+	return err;
 }
 
 static int handle_replay_mode(void)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v1 2/2] veristat: memory accounting for bpf programs
  2025-06-05 23:06 ` [PATCH bpf-next v1 2/2] veristat: memory accounting for bpf programs Eduard Zingerman
@ 2025-06-06  1:03   ` Eduard Zingerman
  2025-06-06  2:17     ` Alexei Starovoitov
  2025-06-06 16:53   ` Mykyta Yatsenko
  1 sibling, 1 reply; 14+ messages in thread
From: Eduard Zingerman @ 2025-06-06  1:03 UTC (permalink / raw)
  To: bpf, ast; +Cc: andrii, daniel, martin.lau, kernel-team, yonghong.song

On Thu, 2025-06-05 at 16:06 -0700, Eduard Zingerman wrote:

[...]

> +/*
> + * Enters new cgroup namespace and mounts cgroupfs at /tmp/veristat-cgroup-mount-XXXXXX,
> + * enables "memory" controller for the root cgroup.
> + */
> +static int mount_cgroupfs(void)
> +{
> +	char buf[PATH_MAX + 1];
> +	int err;
> +
> +	env.memory_peak_fd = -1;
> +
> +	err = unshare(CLONE_NEWCGROUP);
> +	if (err < 0) {
> +		err = log_errno("unshare(CLONE_NEWCGROUP)");
> +		goto err_out;
> +	}

The `unshare` call is useless. I thought it would grant me a new
hierarchy with separate cgroup.subtree_control in the root.
But that's now how things work, hierarchy is shared across namespaces.
I'll drop this call and just complain if "memory" controller is not enabled.

The "mount" part can remain, I use it to avoid searching for cgroupfs
mount point. Alternatively I can inspect /proc/self/mountinfo and
complain if cgroupfs is not found. Please let me know which way is
preferred.

[...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v1 2/2] veristat: memory accounting for bpf programs
  2025-06-06  1:03   ` Eduard Zingerman
@ 2025-06-06  2:17     ` Alexei Starovoitov
  2025-06-06  2:33       ` Eduard Zingerman
  0 siblings, 1 reply; 14+ messages in thread
From: Alexei Starovoitov @ 2025-06-06  2:17 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Kernel Team, Yonghong Song

On Thu, Jun 5, 2025 at 6:04 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Thu, 2025-06-05 at 16:06 -0700, Eduard Zingerman wrote:
>
> [...]
>
> > +/*
> > + * Enters new cgroup namespace and mounts cgroupfs at /tmp/veristat-cgroup-mount-XXXXXX,
> > + * enables "memory" controller for the root cgroup.
> > + */
> > +static int mount_cgroupfs(void)
> > +{
> > +     char buf[PATH_MAX + 1];
> > +     int err;
> > +
> > +     env.memory_peak_fd = -1;
> > +
> > +     err = unshare(CLONE_NEWCGROUP);
> > +     if (err < 0) {
> > +             err = log_errno("unshare(CLONE_NEWCGROUP)");
> > +             goto err_out;
> > +     }
>
> The `unshare` call is useless. I thought it would grant me a new
> hierarchy with separate cgroup.subtree_control in the root.
> But that's now how things work, hierarchy is shared across namespaces.
> I'll drop this call and just complain if "memory" controller is not enabled.
>
> The "mount" part can remain, I use it to avoid searching for cgroupfs
> mount point. Alternatively I can inspect /proc/self/mountinfo and
> complain if cgroupfs is not found. Please let me know which way is
> preferred.

I would keep unshare and mount,
and if possible share the code with setup_cgroup_environment()
from cgroup_helpers.c

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v1 2/2] veristat: memory accounting for bpf programs
  2025-06-06  2:17     ` Alexei Starovoitov
@ 2025-06-06  2:33       ` Eduard Zingerman
  2025-06-06  2:35         ` Alexei Starovoitov
  0 siblings, 1 reply; 14+ messages in thread
From: Eduard Zingerman @ 2025-06-06  2:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Kernel Team, Yonghong Song

On Thu, 2025-06-05 at 19:17 -0700, Alexei Starovoitov wrote:
> On Thu, Jun 5, 2025 at 6:04 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > 
> > On Thu, 2025-06-05 at 16:06 -0700, Eduard Zingerman wrote:
> > 
> > [...]
> > 
> > > +/*
> > > + * Enters new cgroup namespace and mounts cgroupfs at /tmp/veristat-cgroup-mount-XXXXXX,
> > > + * enables "memory" controller for the root cgroup.
> > > + */
> > > +static int mount_cgroupfs(void)
> > > +{
> > > +     char buf[PATH_MAX + 1];
> > > +     int err;
> > > +
> > > +     env.memory_peak_fd = -1;
> > > +
> > > +     err = unshare(CLONE_NEWCGROUP);
> > > +     if (err < 0) {
> > > +             err = log_errno("unshare(CLONE_NEWCGROUP)");
> > > +             goto err_out;
> > > +     }
> > 
> > The `unshare` call is useless. I thought it would grant me a new
> > hierarchy with separate cgroup.subtree_control in the root.
> > But that's now how things work, hierarchy is shared across namespaces.
> > I'll drop this call and just complain if "memory" controller is not enabled.
> > 
> > The "mount" part can remain, I use it to avoid searching for cgroupfs
> > mount point. Alternatively I can inspect /proc/self/mountinfo and
> > complain if cgroupfs is not found. Please let me know which way is
> > preferred.
> 
> I would keep unshare and mount,

But what would be the purpose of the unshare?
I thought that it provides a new isolated new independent
cgroup.subtree_control, but that is not the case.
Outside from that the feature gains nothing from entering new namespace.

> and if possible share the code with setup_cgroup_environment()
> from cgroup_helpers.c

setup_cgroup_environment() does the following:
 a. creates a directory for cgroupfs mount
 b. creates a new mount namespace
 c. mounts a new empty root
 d. mounts cgroupfs inside new root
 e. creates a cgroup
 f. enables controllers

Of these only (a) and (d) are needed for veristat (and (f) is deligated to init).
Also my current version does unshare for cgroup namespace, while
setup_cgroup_environment() does not do that.

Things that might be reusable:
- get_root_cgroup
- remove_cgroup
- join_root_cgroup
- join_cgroup

These rely on CGROUP_MOUNT_PATH and CGROUP_WORK_DIR being constants,
I'll need to modify these helpers to parameterize this.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v1 2/2] veristat: memory accounting for bpf programs
  2025-06-06  2:33       ` Eduard Zingerman
@ 2025-06-06  2:35         ` Alexei Starovoitov
  2025-06-06  2:46           ` Eduard Zingerman
  0 siblings, 1 reply; 14+ messages in thread
From: Alexei Starovoitov @ 2025-06-06  2:35 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Kernel Team, Yonghong Song

On Thu, Jun 5, 2025 at 7:33 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Thu, 2025-06-05 at 19:17 -0700, Alexei Starovoitov wrote:
> > On Thu, Jun 5, 2025 at 6:04 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > >
> > > On Thu, 2025-06-05 at 16:06 -0700, Eduard Zingerman wrote:
> > >
> > > [...]
> > >
> > > > +/*
> > > > + * Enters new cgroup namespace and mounts cgroupfs at /tmp/veristat-cgroup-mount-XXXXXX,
> > > > + * enables "memory" controller for the root cgroup.
> > > > + */
> > > > +static int mount_cgroupfs(void)
> > > > +{
> > > > +     char buf[PATH_MAX + 1];
> > > > +     int err;
> > > > +
> > > > +     env.memory_peak_fd = -1;
> > > > +
> > > > +     err = unshare(CLONE_NEWCGROUP);
> > > > +     if (err < 0) {
> > > > +             err = log_errno("unshare(CLONE_NEWCGROUP)");
> > > > +             goto err_out;
> > > > +     }
> > >
> > > The `unshare` call is useless. I thought it would grant me a new
> > > hierarchy with separate cgroup.subtree_control in the root.
> > > But that's now how things work, hierarchy is shared across namespaces.
> > > I'll drop this call and just complain if "memory" controller is not enabled.
> > >
> > > The "mount" part can remain, I use it to avoid searching for cgroupfs
> > > mount point. Alternatively I can inspect /proc/self/mountinfo and
> > > complain if cgroupfs is not found. Please let me know which way is
> > > preferred.
> >
> > I would keep unshare and mount,
>
> But what would be the purpose of the unshare?
> I thought that it provides a new isolated new independent
> cgroup.subtree_control, but that is not the case.
> Outside from that the feature gains nothing from entering new namespace.
>
> > and if possible share the code with setup_cgroup_environment()
> > from cgroup_helpers.c
>
> setup_cgroup_environment() does the following:
>  a. creates a directory for cgroupfs mount
>  b. creates a new mount namespace
>  c. mounts a new empty root
>  d. mounts cgroupfs inside new root
>  e. creates a cgroup
>  f. enables controllers
>
> Of these only (a) and (d) are needed for veristat (and (f) is deligated to init).
> Also my current version does unshare for cgroup namespace, while
> setup_cgroup_environment() does not do that.
>
> Things that might be reusable:
> - get_root_cgroup
> - remove_cgroup
> - join_root_cgroup
> - join_cgroup
>
> These rely on CGROUP_MOUNT_PATH and CGROUP_WORK_DIR being constants,
> I'll need to modify these helpers to parameterize this.

ok. ignore that suggestion then.
do what is simplest.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v1 2/2] veristat: memory accounting for bpf programs
  2025-06-06  2:35         ` Alexei Starovoitov
@ 2025-06-06  2:46           ` Eduard Zingerman
  0 siblings, 0 replies; 14+ messages in thread
From: Eduard Zingerman @ 2025-06-06  2:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Kernel Team, Yonghong Song

On Thu, 2025-06-05 at 19:35 -0700, Alexei Starovoitov wrote:

[...]

> > Things that might be reusable:
> > - get_root_cgroup
> > - remove_cgroup
> > - join_root_cgroup
> > - join_cgroup
> > 
> > These rely on CGROUP_MOUNT_PATH and CGROUP_WORK_DIR being constants,
> > I'll need to modify these helpers to parameterize this.
> 
> ok. ignore that suggestion then.
> do what is simplest.

I think I'll add some struct to denote cgroup context and adapt the
above utility functions. Should be useful.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v1 2/2] veristat: memory accounting for bpf programs
  2025-06-05 23:06 ` [PATCH bpf-next v1 2/2] veristat: memory accounting for bpf programs Eduard Zingerman
  2025-06-06  1:03   ` Eduard Zingerman
@ 2025-06-06 16:53   ` Mykyta Yatsenko
  2025-06-06 17:03     ` Eduard Zingerman
  1 sibling, 1 reply; 14+ messages in thread
From: Mykyta Yatsenko @ 2025-06-06 16:53 UTC (permalink / raw)
  To: Eduard Zingerman, bpf, ast
  Cc: andrii, daniel, martin.lau, kernel-team, yonghong.song

On 6/6/25 00:06, Eduard Zingerman wrote:
> This commit adds a new field mem_peak / "Peak memory" field to a set
> of gathered statistics. The field is intended as an estimate for peak
> verifier memory consumption for processing of a given program.
>
> Mechanically stat is collected as follows:
> - At the beginning of handle_verif_mode() a new cgroup namespace is
>    created and cgroup fs is mounted in this namespace, memory
>    controller is enabled for the root cgroup.
> - At each program load:
>    - bpf_object__load() is split into bpf_object__prepare() and
>      bpf_object__load() to avoid accounting for memory allocated for
>      maps;
>    - before bpf_object__load() a new cgroup is created and veristat
>      process enters this cgroup, "memory.peak" of the new cgroup is
>      stashed;
>    - after bpf_object__load() the difference between current
>      "memory.peak" and stashed "memory.peak" is used as a metric,
>      veristat exits the cgroup and cgroup is discarded.
>
> If any of the above steps fails veristat would proceed w/o collecting
> mem_peak information for a program.
>
> The change has impact on veristat running time, e.g. for all
> test_progs object files there is an increase from 82s to 102s.
>
> I take a correlation between "Peak states" and "Peak memory" fields as
> a sanity check for gathered statistics, e.g. here is a sample of data
> for sched_ext programs:
>
> File       Program               Peak states  Peak memory (KiB)
> ---------  --------------------  -----------  -----------------
> bpf.bpf.o  lavd_select_cpu              1311              26256
> bpf.bpf.o  lavd_enqueue                 1140              22720
> bpf.bpf.o  layered_enqueue               777              11504
> bpf.bpf.o  layered_dispatch              578               7976
> bpf.bpf.o  lavd_dispatch                 634               6204
> bpf.bpf.o  rusty_init                    343               5352
> bpf.bpf.o  lavd_init                     361               5092
> ...
> bpf.bpf.o  rusty_exit_task                36                256
> bpf.bpf.o  rusty_running                  19                256
> bpf.bpf.o  bpfland_dispatch                3                  0
> bpf.bpf.o  bpfland_enable                  1                  0
>
> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
> ---
>   tools/testing/selftests/bpf/veristat.c | 249 ++++++++++++++++++++++++-
>   1 file changed, 242 insertions(+), 7 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/veristat.c b/tools/testing/selftests/bpf/veristat.c
> index b2bb20b00952..e68f5dda5278 100644
> --- a/tools/testing/selftests/bpf/veristat.c
> +++ b/tools/testing/selftests/bpf/veristat.c
> @@ -12,6 +12,7 @@
>   #include <signal.h>
>   #include <fcntl.h>
>   #include <unistd.h>
> +#include <sys/mount.h>
>   #include <sys/time.h>
>   #include <sys/sysinfo.h>
>   #include <sys/stat.h>
> @@ -49,6 +50,7 @@ enum stat_id {
>   	STACK,
>   	PROG_TYPE,
>   	ATTACH_TYPE,
> +	MEMORY_PEAK,
>   
>   	FILE_NAME,
>   	PROG_NAME,
> @@ -208,6 +210,9 @@ static struct env {
>   	int top_src_lines;
>   	struct var_preset *presets;
>   	int npresets;
> +	char cgroup_fs_mount[PATH_MAX + 1];
> +	char stat_cgroup[PATH_MAX + 1];
> +	int memory_peak_fd;
>   } env;
>   
>   static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
> @@ -219,6 +224,22 @@ static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va
>   	return vfprintf(stderr, format, args);
>   }
>   
> +#define log_errno(fmt, ...) log_errno_aux(__FILE__, __LINE__, fmt, ##__VA_ARGS__)
> +
> +__printf(3, 4)
> +static int log_errno_aux(const char *file, int line, const char *fmt, ...)
> +{
> +	int err = -errno;
> +	va_list ap;
> +
> +	va_start(ap, fmt);
> +	fprintf(stderr, "%s:%d: ", file, line);
> +	vfprintf(stderr, fmt, ap);
> +	fprintf(stderr, " failed with error '%s'\n", strerror(errno));
> +	va_end(ap);
> +	return err;
> +}
> +
>   #ifndef VERISTAT_VERSION
>   #define VERISTAT_VERSION "<kernel>"
>   #endif
> @@ -734,13 +755,13 @@ static int append_file_from_file(const char *path)
>   }
>   
>   static const struct stat_specs default_csv_output_spec = {
> -	.spec_cnt = 14,
> +	.spec_cnt = 15,
>   	.ids = {
>   		FILE_NAME, PROG_NAME, VERDICT, DURATION,
>   		TOTAL_INSNS, TOTAL_STATES, PEAK_STATES,
>   		MAX_STATES_PER_INSN, MARK_READ_MAX_LEN,
>   		SIZE, JITED_SIZE, PROG_TYPE, ATTACH_TYPE,
> -		STACK,
> +		STACK, MEMORY_PEAK,
>   	},
>   };
>   
> @@ -781,6 +802,7 @@ static struct stat_def {
>   	[STACK] = {"Stack depth", {"stack_depth", "stack"}, },
>   	[PROG_TYPE] = { "Program type", {"prog_type"}, },
>   	[ATTACH_TYPE] = { "Attach type", {"attach_type", }, },
> +	[MEMORY_PEAK] = { "Peak memory (KiB)", {"mem_peak", }, },
>   };
>   
>   static bool parse_stat_id_var(const char *name, size_t len, int *id,
> @@ -1278,16 +1300,213 @@ static int max_verifier_log_size(void)
>   	return log_size;
>   }
>   
> +__printf(2, 3)
> +static int write_one_line(const char *file, const char *fmt, ...)
> +{
> +	int err, saved_errno;
> +	va_list ap;
> +	FILE *f;
> +
> +	f = fopen(file, "w");
> +	if (!f)
> +		return -1;
> +
> +	va_start(ap, fmt);
> +	errno = 0;
> +	err = vfprintf(f, fmt, ap);
> +	saved_errno = errno;
> +	va_end(ap);
> +	fclose(f);
> +	errno = saved_errno;
> +	return err < 0 ? -1 : 0;
> +}
> +
> +/*
> + * This works around GCC warning about snprintf truncating strings like:
> + *
> + *   char a[PATH_MAX], b[PATH_MAX];
> + *   snprintf(a, "%s/foo", b);      // triggers -Wformat-truncation
> + */
> +__printf(3, 4)
> +static int snprintf_trunc(char *str, volatile size_t size, const char *fmt, ...)
> +{
> +	va_list ap;
> +	int ret;
> +
> +	va_start(ap, fmt);
> +	ret = vsnprintf(str, size, fmt, ap);
> +	va_end(ap);
> +	return ret;
> +}
> +
> +static void destroy_stat_cgroup(void);
> +static void umount_cgroupfs(void);
> +
> +/*
> + * Enters new cgroup namespace and mounts cgroupfs at /tmp/veristat-cgroup-mount-XXXXXX,
> + * enables "memory" controller for the root cgroup.
> + */
> +static int mount_cgroupfs(void)
> +{
> +	char buf[PATH_MAX + 1];
> +	int err;
> +
> +	env.memory_peak_fd = -1;
> +
> +	err = unshare(CLONE_NEWCGROUP);
> +	if (err < 0) {
> +		err = log_errno("unshare(CLONE_NEWCGROUP)");
> +		goto err_out;
> +	}
> +
> +	snprintf_trunc(buf, sizeof(buf), "%s/veristat-cgroup-mount-XXXXXX", P_tmpdir);
> +	if (mkdtemp(buf) == NULL) {
> +		err = log_errno("mkdtemp(%s)", buf);
> +		goto err_out;
> +	}
> +	strcpy(env.cgroup_fs_mount, buf);
> +
> +	err = mount("none", env.cgroup_fs_mount, "cgroup2", 0, NULL);
> +	if (err < 0) {
> +		err = log_errno("mount none %s -t cgroup2", env.cgroup_fs_mount);
> +		goto err_out;
> +	}
> +
> +	snprintf_trunc(buf, sizeof(buf), "%s/cgroup.subtree_control", env.cgroup_fs_mount);
> +	err = write_one_line(buf, "+memory\n");
> +	if (err < 0) {
> +		err = log_errno("echo '+memory' > %s", buf);
> +		goto err_out;
> +	}
> +
> +	return 0;
> +
> +err_out:
> +	umount_cgroupfs();
> +	return err;
> +}
> +
> +static void umount_cgroupfs(void)
> +{
> +	int err;
> +
> +	if (!env.cgroup_fs_mount[0])
> +		return;
> +
> +	err = umount(env.cgroup_fs_mount);
> +	if (err < 0)
> +		log_errno("umount %s", env.cgroup_fs_mount);
> +
> +	err = rmdir(env.cgroup_fs_mount);
> +	if (err < 0)
> +		log_errno("rmdir %s", env.cgroup_fs_mount);
> +
> +	env.cgroup_fs_mount[0] = 0;
> +}
> +
> +/*
> + * Creates a cgroup at /tmp/veristat-cgroup-mount-XXXXXX/accounting-<pid>,
> + * moves current process to this cgroup.
> + */
> +static int create_stat_cgroup(void)
> +{
> +	char buf[PATH_MAX + 1];
> +	int err;
> +
> +	if (!env.cgroup_fs_mount[0])
> +		return -1;
> +
> +	env.memory_peak_fd = -1;
> +
> +	snprintf_trunc(buf, sizeof(buf), "%s/accounting-%d", env.cgroup_fs_mount, getpid());
> +	err = mkdir(buf, 0777);
> +	if (err < 0) {
> +		err = log_errno("mkdir(%s)", buf);
> +		goto err_out;
> +	}
> +	strcpy(env.stat_cgroup, buf);
> +
> +	snprintf_trunc(buf, sizeof(buf), "%s/cgroup.procs", env.stat_cgroup);
> +	err = write_one_line(buf, "%d\n", getpid());
> +	if (err < 0) {
> +		err = log_errno("echo %d > %s", getpid(), buf);
> +		goto err_out;
> +	}
> +
> +	snprintf_trunc(buf, sizeof(buf), "%s/memory.peak", env.stat_cgroup);
> +	env.memory_peak_fd = open(buf, O_RDWR | O_APPEND);
Why is it necessary to open in RW|APPEND mode? Won't O_RDONLY cut it?
> +	if (env.memory_peak_fd < 0) {
> +		err = log_errno("open(%s)", buf);
> +		goto err_out;
> +	}
> +
> +	return 0;
> +
> +err_out:
> +	destroy_stat_cgroup();
> +	return err;
> +}
> +
> +static void destroy_stat_cgroup(void)
> +{
> +	char buf[PATH_MAX];
> +	int err;
> +
> +	close(env.memory_peak_fd);
> +
> +	if (env.cgroup_fs_mount[0]) {
> +		snprintf_trunc(buf, sizeof(buf), "%s/cgroup.procs", env.cgroup_fs_mount);
> +		err = write_one_line(buf, "%d\n", getpid());
> +		if (err < 0)
> +			log_errno("echo %d > %s", getpid(), buf);
> +	}
> +
> +	if (env.stat_cgroup[0]) {
> +		err = rmdir(env.stat_cgroup);
> +		if (err < 0)
> +			log_errno("rmdir %s", env.stat_cgroup);
> +	}
> +
> +	env.stat_cgroup[0] = 0;
> +}
> +
> +/* Current value of /tmp/veristat-cgroup-mount-XXXXXX/accounting-<pid>/memory.peak */
> +static long cgroup_memory_peak(void)
> +{
> +	long err, memory_peak;
> +	char buf[32];
> +
> +	if (env.memory_peak_fd < 0)
> +		return -1;
> +
> +	err = pread(env.memory_peak_fd, buf, sizeof(buf) - 1, 0);
> +	if (err <= 0) {
> +		log_errno("read(%s/memory.peak)", env.stat_cgroup);
> +		return -1;
> +	}
> +
> +	buf[err] = 0;
nit: maybe rename err to len here?
> +	errno = 0;
> +	memory_peak = strtoll(buf, NULL, 10);
> +	if (errno) {
> +		log_errno("unrecognized %s/memory.peak format: %s", env.stat_cgroup, buf);
> +		return -1;
> +	}
> +
> +	return memory_peak;
> +}
> +
>   static int process_prog(const char *filename, struct bpf_object *obj, struct bpf_program *prog)
>   {
>   	const char *base_filename = basename(strdupa(filename));
>   	const char *prog_name = bpf_program__name(prog);
> +	long mem_peak_a, mem_peak_b, mem_peak = -1;
>   	char *buf;
>   	int buf_sz, log_level;
>   	struct verif_stats *stats;
>   	struct bpf_prog_info info;
>   	__u32 info_len = sizeof(info);
> -	int err = 0;
> +	int err = 0, cgroup_err;
>   	void *tmp;
>   	int fd;
>   
> @@ -1332,7 +1551,16 @@ static int process_prog(const char *filename, struct bpf_object *obj, struct bpf
>   	if (env.force_reg_invariants)
>   		bpf_program__set_flags(prog, bpf_program__flags(prog) | BPF_F_TEST_REG_INVARIANTS);
>   
> -	err = bpf_object__load(obj);
> +	err = bpf_object__prepare(obj);
> +	if (!err) {
> +		cgroup_err = create_stat_cgroup();
> +		mem_peak_a = cgroup_memory_peak();
> +		err = bpf_object__load(obj);
> +		mem_peak_b = cgroup_memory_peak();
> +		destroy_stat_cgroup();
What if we do create_stat_cgroup/destory_stat_cgroup in 
handle_verif_mode along with mount/umount_cgroupfs.
It may speed things up a little bit here and moving all cgroup 
preparations into the single place seems reasonable.
Will memory counter behave differently? We are checking the difference 
around bpf_object__load, from layman's
perspective it should be the same.
> +		if (!cgroup_err && mem_peak_a >= 0 && mem_peak_b >= 0)
> +			mem_peak = mem_peak_b - mem_peak_a;
> +	}
>   	env.progs_processed++;
>   
>   	stats->file_name = strdup(base_filename);
> @@ -1341,6 +1569,7 @@ static int process_prog(const char *filename, struct bpf_object *obj, struct bpf
>   	stats->stats[SIZE] = bpf_program__insn_cnt(prog);
>   	stats->stats[PROG_TYPE] = bpf_program__type(prog);
>   	stats->stats[ATTACH_TYPE] = bpf_program__expected_attach_type(prog);
> +	stats->stats[MEMORY_PEAK] = mem_peak < 0 ? -1 : mem_peak / 1024;
>   
>   	memset(&info, 0, info_len);
>   	fd = bpf_program__fd(prog);
> @@ -1824,6 +2053,7 @@ static int cmp_stat(const struct verif_stats *s1, const struct verif_stats *s2,
>   	case TOTAL_STATES:
>   	case PEAK_STATES:
>   	case MAX_STATES_PER_INSN:
> +	case MEMORY_PEAK:
>   	case MARK_READ_MAX_LEN: {
>   		long v1 = s1->stats[id];
>   		long v2 = s2->stats[id];
> @@ -2053,6 +2283,7 @@ static void prepare_value(const struct verif_stats *s, enum stat_id id,
>   	case STACK:
>   	case SIZE:
>   	case JITED_SIZE:
> +	case MEMORY_PEAK:
>   		*val = s ? s->stats[id] : 0;
>   		break;
>   	default:
> @@ -2139,6 +2370,7 @@ static int parse_stat_value(const char *str, enum stat_id id, struct verif_stats
>   	case MARK_READ_MAX_LEN:
>   	case SIZE:
>   	case JITED_SIZE:
> +	case MEMORY_PEAK:
>   	case STACK: {
>   		long val;
>   		int err, n;
> @@ -2776,7 +3008,7 @@ static void output_prog_stats(void)
>   
>   static int handle_verif_mode(void)
>   {
> -	int i, err;
> +	int i, err = 0;
>   
>   	if (env.filename_cnt == 0) {
>   		fprintf(stderr, "Please provide path to BPF object file!\n\n");
> @@ -2784,11 +3016,12 @@ static int handle_verif_mode(void)
>   		return -EINVAL;
>   	}
>   
> +	mount_cgroupfs();
>   	for (i = 0; i < env.filename_cnt; i++) {
>   		err = process_obj(env.filenames[i]);
>   		if (err) {
>   			fprintf(stderr, "Failed to process '%s': %d\n", env.filenames[i], err);
> -			return err;
> +			goto out;
>   		}
>   	}
>   
> @@ -2796,7 +3029,9 @@ static int handle_verif_mode(void)
>   
>   	output_prog_stats();
>   
> -	return 0;
> +out:
> +	umount_cgroupfs();
> +	return err;
>   }
>   
>   static int handle_replay_mode(void)
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v1 2/2] veristat: memory accounting for bpf programs
  2025-06-06 16:53   ` Mykyta Yatsenko
@ 2025-06-06 17:03     ` Eduard Zingerman
  2025-06-06 18:19       ` Andrii Nakryiko
  0 siblings, 1 reply; 14+ messages in thread
From: Eduard Zingerman @ 2025-06-06 17:03 UTC (permalink / raw)
  To: Mykyta Yatsenko, bpf, ast
  Cc: andrii, daniel, martin.lau, kernel-team, yonghong.song

On Fri, 2025-06-06 at 17:53 +0100, Mykyta Yatsenko wrote:

[...]

> > +/*
> > + * Creates a cgroup at /tmp/veristat-cgroup-mount-XXXXXX/accounting-<pid>,
> > + * moves current process to this cgroup.
> > + */
> > +static int create_stat_cgroup(void)
> > +{
> > +	char buf[PATH_MAX + 1];
> > +	int err;
> > +
> > +	if (!env.cgroup_fs_mount[0])
> > +		return -1;
> > +
> > +	env.memory_peak_fd = -1;
> > +
> > +	snprintf_trunc(buf, sizeof(buf), "%s/accounting-%d", env.cgroup_fs_mount, getpid());
> > +	err = mkdir(buf, 0777);
> > +	if (err < 0) {
> > +		err = log_errno("mkdir(%s)", buf);
> > +		goto err_out;
> > +	}
> > +	strcpy(env.stat_cgroup, buf);
> > +
> > +	snprintf_trunc(buf, sizeof(buf), "%s/cgroup.procs", env.stat_cgroup);
> > +	err = write_one_line(buf, "%d\n", getpid());
> > +	if (err < 0) {
> > +		err = log_errno("echo %d > %s", getpid(), buf);
> > +		goto err_out;
> > +	}
> > +
> > +	snprintf_trunc(buf, sizeof(buf), "%s/memory.peak", env.stat_cgroup);
> > +	env.memory_peak_fd = open(buf, O_RDWR | O_APPEND);
>
> Why is it necessary to open in RW|APPEND mode? Won't O_RDONLY cut it?

With current implementation -- not necessary, O_RDONLY should be sufficient.

> > +	if (env.memory_peak_fd < 0) {
> > +		err = log_errno("open(%s)", buf);
> > +		goto err_out;
> > +	}
> > +
> > +	return 0;
> > +
> > +err_out:
> > +	destroy_stat_cgroup();
> > +	return err;
> > +}

[...]

> > +/* Current value of /tmp/veristat-cgroup-mount-XXXXXX/accounting-<pid>/memory.peak */
> > +static long cgroup_memory_peak(void)
> > +{
> > +	long err, memory_peak;
> > +	char buf[32];
> > +
> > +	if (env.memory_peak_fd < 0)
> > +		return -1;
> > +
> > +	err = pread(env.memory_peak_fd, buf, sizeof(buf) - 1, 0);
> > +	if (err <= 0) {
> > +		log_errno("read(%s/memory.peak)", env.stat_cgroup);
> > +		return -1;
> > +	}
> > +
> > +	buf[err] = 0;
>
> nit: maybe rename err to len here?

Sure, will rename.

> > +	errno = 0;
> > +	memory_peak = strtoll(buf, NULL, 10);
> > +	if (errno) {
> > +		log_errno("unrecognized %s/memory.peak format: %s", env.stat_cgroup, buf);
> > +		return -1;
> > +	}
> > +
> > +	return memory_peak;
> > +}
> > +

[...]

> > @@ -1332,7 +1551,16 @@ static int process_prog(const char *filename, struct bpf_object *obj, struct bpf
> >   	if (env.force_reg_invariants)
> >   		bpf_program__set_flags(prog, bpf_program__flags(prog) | BPF_F_TEST_REG_INVARIANTS);
> >   
> > -	err = bpf_object__load(obj);
> > +	err = bpf_object__prepare(obj);
> > +	if (!err) {
> > +		cgroup_err = create_stat_cgroup();
> > +		mem_peak_a = cgroup_memory_peak();
> > +		err = bpf_object__load(obj);
> > +		mem_peak_b = cgroup_memory_peak();
> > +		destroy_stat_cgroup();
>
> What if we do create_stat_cgroup/destory_stat_cgroup in 
> handle_verif_mode along with mount/umount_cgroupfs.
> It may speed things up a little bit here and moving all cgroup 
> preparations into the single place seems reasonable.
> Will memory counter behave differently? We are checking the difference 
> around bpf_object__load, from layman's
> perspective it should be the same.

The memory.peak file accounts for peak memory consumption, so one
would need to reset this counter between program verifications.
Doc [1] describes such mechanism:

    memory.peak

      A read-write single value file which exists on non-root cgroups.

      The max memory usage recorded for the cgroup and its descendants
      since either the creation of the cgroup or the most recent reset
      for that FD.

      A write of any non-empty string to this file resets it to the
      current memory usage for subsequent reads through the same file
      descriptor.

    memory.reclaim

      A write-only nested-keyed file which exists for all cgroups.
    
      This is a simple interface to trigger memory reclaim in the target
      cgroup.
    
      Example:
    
        echo "1G" > memory.reclaim
    
      Please note that the kernel can over or under reclaim from the
      target cgroup. If less bytes are reclaimed than the specified
      amount, -EAGAIN is returned.

As mentioned in cover letter, I tried using a combination of the above,
while creating a cgroup only once. For reasons I don't understand this
did not produce stable measurements. E.g. depending on a program being
verified in isolation or as part of a batch results vary from 5mb to 2mb.

[1] https://docs.kernel.org/admin-guide/cgroup-v2.html

> > +		if (!cgroup_err && mem_peak_a >= 0 && mem_peak_b >= 0)
> > +			mem_peak = mem_peak_b - mem_peak_a;
> > +	}
> >   	env.progs_processed++;
> >   
> >   	stats->file_name = strdup(base_filename);
> > @@ -1341,6 +1569,7 @@ static int process_prog(const char *filename, struct bpf_object *obj, struct bpf

[...]

> Acked-by: Mykyta Yatsenko <yatsenko@meta.com>

Thank you. I will have to send a v2 avoiding mount operations and
controllers file modification.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v1 2/2] veristat: memory accounting for bpf programs
  2025-06-06 17:03     ` Eduard Zingerman
@ 2025-06-06 18:19       ` Andrii Nakryiko
  2025-06-07  8:13         ` Eduard Zingerman
  0 siblings, 1 reply; 14+ messages in thread
From: Andrii Nakryiko @ 2025-06-06 18:19 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Mykyta Yatsenko, bpf, ast, andrii, daniel, martin.lau,
	kernel-team, yonghong.song

On Fri, Jun 6, 2025 at 10:03 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Fri, 2025-06-06 at 17:53 +0100, Mykyta Yatsenko wrote:
>
> [...]
>
> > > +/*
> > > + * Creates a cgroup at /tmp/veristat-cgroup-mount-XXXXXX/accounting-<pid>,
> > > + * moves current process to this cgroup.
> > > + */
> > > +static int create_stat_cgroup(void)
> > > +{
> > > +   char buf[PATH_MAX + 1];
> > > +   int err;
> > > +
> > > +   if (!env.cgroup_fs_mount[0])
> > > +           return -1;
> > > +
> > > +   env.memory_peak_fd = -1;
> > > +
> > > +   snprintf_trunc(buf, sizeof(buf), "%s/accounting-%d", env.cgroup_fs_mount, getpid());
> > > +   err = mkdir(buf, 0777);
> > > +   if (err < 0) {
> > > +           err = log_errno("mkdir(%s)", buf);
> > > +           goto err_out;
> > > +   }
> > > +   strcpy(env.stat_cgroup, buf);
> > > +
> > > +   snprintf_trunc(buf, sizeof(buf), "%s/cgroup.procs", env.stat_cgroup);
> > > +   err = write_one_line(buf, "%d\n", getpid());
> > > +   if (err < 0) {
> > > +           err = log_errno("echo %d > %s", getpid(), buf);
> > > +           goto err_out;
> > > +   }
> > > +
> > > +   snprintf_trunc(buf, sizeof(buf), "%s/memory.peak", env.stat_cgroup);
> > > +   env.memory_peak_fd = open(buf, O_RDWR | O_APPEND);
> >
> > Why is it necessary to open in RW|APPEND mode? Won't O_RDONLY cut it?
>
> With current implementation -- not necessary, O_RDONLY should be sufficient.
>
> > > +   if (env.memory_peak_fd < 0) {
> > > +           err = log_errno("open(%s)", buf);
> > > +           goto err_out;
> > > +   }
> > > +
> > > +   return 0;
> > > +
> > > +err_out:
> > > +   destroy_stat_cgroup();
> > > +   return err;
> > > +}
>
> [...]
>
> > > +/* Current value of /tmp/veristat-cgroup-mount-XXXXXX/accounting-<pid>/memory.peak */
> > > +static long cgroup_memory_peak(void)
> > > +{
> > > +   long err, memory_peak;
> > > +   char buf[32];
> > > +
> > > +   if (env.memory_peak_fd < 0)
> > > +           return -1;
> > > +
> > > +   err = pread(env.memory_peak_fd, buf, sizeof(buf) - 1, 0);
> > > +   if (err <= 0) {
> > > +           log_errno("read(%s/memory.peak)", env.stat_cgroup);
> > > +           return -1;
> > > +   }
> > > +
> > > +   buf[err] = 0;
> >
> > nit: maybe rename err to len here?
>
> Sure, will rename.
>
> > > +   errno = 0;
> > > +   memory_peak = strtoll(buf, NULL, 10);
> > > +   if (errno) {
> > > +           log_errno("unrecognized %s/memory.peak format: %s", env.stat_cgroup, buf);
> > > +           return -1;
> > > +   }
> > > +
> > > +   return memory_peak;
> > > +}
> > > +
>
> [...]
>
> > > @@ -1332,7 +1551,16 @@ static int process_prog(const char *filename, struct bpf_object *obj, struct bpf
> > >     if (env.force_reg_invariants)
> > >             bpf_program__set_flags(prog, bpf_program__flags(prog) | BPF_F_TEST_REG_INVARIANTS);
> > >
> > > -   err = bpf_object__load(obj);
> > > +   err = bpf_object__prepare(obj);
> > > +   if (!err) {
> > > +           cgroup_err = create_stat_cgroup();
> > > +           mem_peak_a = cgroup_memory_peak();
> > > +           err = bpf_object__load(obj);
> > > +           mem_peak_b = cgroup_memory_peak();
> > > +           destroy_stat_cgroup();
> >
> > What if we do create_stat_cgroup/destory_stat_cgroup in
> > handle_verif_mode along with mount/umount_cgroupfs.
> > It may speed things up a little bit here and moving all cgroup
> > preparations into the single place seems reasonable.
> > Will memory counter behave differently? We are checking the difference
> > around bpf_object__load, from layman's
> > perspective it should be the same.
>
> The memory.peak file accounts for peak memory consumption, so one
> would need to reset this counter between program verifications.
> Doc [1] describes such mechanism:
>
>     memory.peak
>
>       A read-write single value file which exists on non-root cgroups.
>
>       The max memory usage recorded for the cgroup and its descendants
>       since either the creation of the cgroup or the most recent reset
>       for that FD.
>
>       A write of any non-empty string to this file resets it to the
>       current memory usage for subsequent reads through the same file
>       descriptor.
>
>     memory.reclaim
>
>       A write-only nested-keyed file which exists for all cgroups.
>
>       This is a simple interface to trigger memory reclaim in the target
>       cgroup.
>
>       Example:
>
>         echo "1G" > memory.reclaim
>
>       Please note that the kernel can over or under reclaim from the
>       target cgroup. If less bytes are reclaimed than the specified
>       amount, -EAGAIN is returned.
>
> As mentioned in cover letter, I tried using a combination of the above,
> while creating a cgroup only once. For reasons I don't understand this
> did not produce stable measurements. E.g. depending on a program being

Looking at memory_peak_write() in mm/memcontrol.c it looks reasonable
and should have worked (we do reset pc->local_watermark). But note if
(usage > peer_ctx->value) logic and /* initial write, register watcher
*/ comment. I'm totally guessing and speculating, but maybe you didn't
close and re-open the file in between and so you had stale "watcher"
with already recorded high watermark?..

I'd try again but be very careful what cgroup and at what point this
is being reset...

> verified in isolation or as part of a batch results vary from 5mb to 2mb.
>
> [1] https://docs.kernel.org/admin-guide/cgroup-v2.html
>
> > > +           if (!cgroup_err && mem_peak_a >= 0 && mem_peak_b >= 0)
> > > +                   mem_peak = mem_peak_b - mem_peak_a;
> > > +   }
> > >     env.progs_processed++;
> > >
> > >     stats->file_name = strdup(base_filename);
> > > @@ -1341,6 +1569,7 @@ static int process_prog(const char *filename, struct bpf_object *obj, struct bpf
>
> [...]
>
> > Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
>
> Thank you. I will have to send a v2 avoiding mount operations and
> controllers file modification.
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v1 2/2] veristat: memory accounting for bpf programs
  2025-06-06 18:19       ` Andrii Nakryiko
@ 2025-06-07  8:13         ` Eduard Zingerman
  2025-06-09 20:57           ` Andrii Nakryiko
  0 siblings, 1 reply; 14+ messages in thread
From: Eduard Zingerman @ 2025-06-07  8:13 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Mykyta Yatsenko, bpf, ast, andrii, daniel, martin.lau,
	kernel-team, yonghong.song

On Fri, 2025-06-06 at 11:19 -0700, Andrii Nakryiko wrote:

[...]

> Looking at memory_peak_write() in mm/memcontrol.c it looks reasonable
> and should have worked (we do reset pc->local_watermark). But note if
> (usage > peer_ctx->value) logic and /* initial write, register watcher
> */ comment. I'm totally guessing and speculating, but maybe you didn't
> close and re-open the file in between and so you had stale "watcher"
> with already recorded high watermark?..
> 
> I'd try again but be very careful what cgroup and at what point this
> is being reset...

The way I read memcontrol.c:memory_peak_write(), it always transfers
current memcg->memory (aka memory.current) to the ofp->value of the
currently open file (aka memory.peak). So this should work as
documentation suggests: one needs to keep a single fd for memory.peak
and periodically write something to it to reset the value.

---

I tried several versions with selftests and scx BPF binaries:
- version as in this patch-set, aka "many cg";
- version with a single control group that writes to memory.reclaim
  and then to memory.peak between program verifications (while holding
  same FDs for these files), aka "reset+reclaim", implementation is in [1];
- version with a single control group same as "reset+reclaim" but
  without "reclaim" part, aka "reset only", implementation can be
  trivially derived from [1].

Here are stats for each of the versions, where I try to figure out the
stability of results. Each version was run twice and generated results
compared.

|                                    |         |        one cg |     one cg |        |
|                                    | many cg | reclaim+reset | reset only | master |
|------------------------------------+---------+---------------+------------+--------|
| SCX                                |         |               |            |        |
|------------------------------------+---------+---------------+------------+--------|
| running time (sec)                 |      48 |            50 |         46 |     43 |
| jitter mem_peak_diff!=0  (of 172)  |       3 |            93 |         80 |        |
| jitter mem_peak_diff>256 (of 172)  |       0 |             5 |          7 |        |
|------------------------------------+---------+---------------+------------+--------|
| selftests                          |         |               |            |        |
|------------------------------------+---------+---------------+------------+--------|
| running time (sec)                 |     108 |           140 |         90 |     86 |
| jitter mem_peak_diff!=0  (of 3601) |     195 |          1751 |       1181 |        |
| jitter mem_peak_diff>256 (of 3601) |       1 |            22 |         14 |        |

- "jitter mem_peak_diff!=0" means that veristat was run two times and
  results were compared to produce a number of differences:
  `veristat -C -f "mem_peak_diff!=0" first-run.csv second-run.csv| wc -l`
- "jitter mem_peak_diff>256" is the same, but the filter expression
  was "mem_peak_diff>256", meaning difference is greater than 256KiB.

The big jitter comes from `0->256KiB` and `256KiB->0` transitions
occurring to very small programs. There are a lot of such programs in
selftests.

Comparison of results quality between many cg and other types (same
metrics as above, but different veristat versions were used to produce
CSVs for comparison):

|                                    |          many cg |       many cg |
|                                    | vs reset+reclaim | vs reset-only |
|------------------------------------+------------------+---------------|
| SCX                                |                  |               |
|------------------------------------+------------------+---------------|
| jitter mem_peak_diff!=0  (of 172)  |              108 |            70 |
| jitter mem_peak_diff>256 (of 172)  |                6 |             2 |
|------------------------------------+------------------+---------------|
| sleftests                          |                  |               |
|------------------------------------+------------------+---------------|
| jitter mem_peak_diff!=0  (of 3601) |             1885 |           942 |
| jitter mem_peak_diff>256 (of 3601) |               27 |            11 |


As can be seen, most of the difference in collected stats is not
bigger than 256KiB.

---

Given above I'm inclined to stick with "many cg" approach, as it has
less jitter and is reasonably performant. I need to wrap-up parallel
veristat version anyway (and many cg should be easier to manage for
parallel run).

---

[1] https://github.com/eddyz87/bpf/tree/veristat-memory-accounting.one-cg

P.S.

The only difference between [1] and my initial experiments is that I
used dprintf instead of pwrite to access memory.{peak,reclaim},
¯\_(ツ)_/¯.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v1 2/2] veristat: memory accounting for bpf programs
  2025-06-07  8:13         ` Eduard Zingerman
@ 2025-06-09 20:57           ` Andrii Nakryiko
  2025-06-09 22:45             ` Eduard Zingerman
  0 siblings, 1 reply; 14+ messages in thread
From: Andrii Nakryiko @ 2025-06-09 20:57 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Mykyta Yatsenko, bpf, ast, andrii, daniel, martin.lau,
	kernel-team, yonghong.song

On Sat, Jun 7, 2025 at 1:13 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Fri, 2025-06-06 at 11:19 -0700, Andrii Nakryiko wrote:
>
> [...]
>
> > Looking at memory_peak_write() in mm/memcontrol.c it looks reasonable
> > and should have worked (we do reset pc->local_watermark). But note if
> > (usage > peer_ctx->value) logic and /* initial write, register watcher
> > */ comment. I'm totally guessing and speculating, but maybe you didn't
> > close and re-open the file in between and so you had stale "watcher"
> > with already recorded high watermark?..
> >
> > I'd try again but be very careful what cgroup and at what point this
> > is being reset...
>
> The way I read memcontrol.c:memory_peak_write(), it always transfers
> current memcg->memory (aka memory.current) to the ofp->value of the
> currently open file (aka memory.peak). So this should work as
> documentation suggests: one needs to keep a single fd for memory.peak
> and periodically write something to it to reset the value.
>
> ---
>
> I tried several versions with selftests and scx BPF binaries:
> - version as in this patch-set, aka "many cg";
> - version with a single control group that writes to memory.reclaim
>   and then to memory.peak between program verifications (while holding
>   same FDs for these files), aka "reset+reclaim", implementation is in [1];
> - version with a single control group same as "reset+reclaim" but
>   without "reclaim" part, aka "reset only", implementation can be
>   trivially derived from [1].
>
> Here are stats for each of the versions, where I try to figure out the
> stability of results. Each version was run twice and generated results
> compared.
>
> |                                    |         |        one cg |     one cg |        |
> |                                    | many cg | reclaim+reset | reset only | master |
> |------------------------------------+---------+---------------+------------+--------|
> | SCX                                |         |               |            |        |
> |------------------------------------+---------+---------------+------------+--------|
> | running time (sec)                 |      48 |            50 |         46 |     43 |
> | jitter mem_peak_diff!=0  (of 172)  |       3 |            93 |         80 |        |
> | jitter mem_peak_diff>256 (of 172)  |       0 |             5 |          7 |        |
> |------------------------------------+---------+---------------+------------+--------|
> | selftests                          |         |               |            |        |
> |------------------------------------+---------+---------------+------------+--------|
> | running time (sec)                 |     108 |           140 |         90 |     86 |
> | jitter mem_peak_diff!=0  (of 3601) |     195 |          1751 |       1181 |        |
> | jitter mem_peak_diff>256 (of 3601) |       1 |            22 |         14 |        |
>
> - "jitter mem_peak_diff!=0" means that veristat was run two times and
>   results were compared to produce a number of differences:
>   `veristat -C -f "mem_peak_diff!=0" first-run.csv second-run.csv| wc -l`
> - "jitter mem_peak_diff>256" is the same, but the filter expression
>   was "mem_peak_diff>256", meaning difference is greater than 256KiB.
>
> The big jitter comes from `0->256KiB` and `256KiB->0` transitions
> occurring to very small programs. There are a lot of such programs in
> selftests.
>
> Comparison of results quality between many cg and other types (same
> metrics as above, but different veristat versions were used to produce
> CSVs for comparison):
>
> |                                    |          many cg |       many cg |
> |                                    | vs reset+reclaim | vs reset-only |
> |------------------------------------+------------------+---------------|
> | SCX                                |                  |               |
> |------------------------------------+------------------+---------------|
> | jitter mem_peak_diff!=0  (of 172)  |              108 |            70 |
> | jitter mem_peak_diff>256 (of 172)  |                6 |             2 |
> |------------------------------------+------------------+---------------|
> | sleftests                          |                  |               |
> |------------------------------------+------------------+---------------|
> | jitter mem_peak_diff!=0  (of 3601) |             1885 |           942 |
> | jitter mem_peak_diff>256 (of 3601) |               27 |            11 |
>
>
> As can be seen, most of the difference in collected stats is not
> bigger than 256KiB.
>
> ---
>
> Given above I'm inclined to stick with "many cg" approach, as it has
> less jitter and is reasonably performant. I need to wrap-up parallel
> veristat version anyway (and many cg should be easier to manage for
> parallel run).

As I mentioned in offline discussion, this 256KB jitter seems minor
and is not on the order of "interesting memory usage", IMO. So I'd go
with a single cgroup approach with reset and keep runtime quite
significantly faster. Mykyta is working on a series to further speed
up veristat's mass verification by relying on bpf_object__prepare()
and bpf_prog_load() for each program, instead of re-opening and
re-parsing the same object file all over again. So it would be good to
not add extra 18 seconds of runtime just for creation of cgroups.

FWIW, we can count mem_peak in megabytes and the jitter will be gone
(and we'll probably still get all the useful signal we wanted with
this measurement), if that's the concern.

>
> ---
>
> [1] https://github.com/eddyz87/bpf/tree/veristat-memory-accounting.one-cg
>
> P.S.
>
> The only difference between [1] and my initial experiments is that I
> used dprintf instead of pwrite to access memory.{peak,reclaim},
> ¯\_(ツ)_/¯.
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next v1 2/2] veristat: memory accounting for bpf programs
  2025-06-09 20:57           ` Andrii Nakryiko
@ 2025-06-09 22:45             ` Eduard Zingerman
  0 siblings, 0 replies; 14+ messages in thread
From: Eduard Zingerman @ 2025-06-09 22:45 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Mykyta Yatsenko, bpf, ast, andrii, daniel, martin.lau,
	kernel-team, yonghong.song

On Mon, 2025-06-09 at 13:57 -0700, Andrii Nakryiko wrote:
> On Sat, Jun 7, 2025 at 1:13 AM Eduard Zingerman <eddyz87@gmail.com>
> wrote:

[...]

> > Given above I'm inclined to stick with "many cg" approach, as it
> > has
> > less jitter and is reasonably performant. I need to wrap-up
> > parallel
> > veristat version anyway (and many cg should be easier to manage for
> > parallel run).
> 
> As I mentioned in offline discussion, this 256KB jitter seems minor
> and is not on the order of "interesting memory usage", IMO. So I'd go
> with a single cgroup approach with reset and keep runtime quite
> significantly faster. Mykyta is working on a series to further speed
> up veristat's mass verification by relying on bpf_object__prepare()
> and bpf_prog_load() for each program, instead of re-opening and
> re-parsing the same object file all over again. So it would be good
> to
> not add extra 18 seconds of runtime just for creation of cgroups.
> 
> FWIW, we can count mem_peak in megabytes and the jitter will be gone
> (and we'll probably still get all the useful signal we wanted with
> this measurement), if that's the concern.

Ok, makes sense, I'll rename the field to mem_peak_mb and go with a
single cgroup. Metric would be zero for small programs but that should
be fine.

[...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-06-09 22:45 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-05 23:06 [PATCH bpf-next v1 0/2] veristat: memory accounting for bpf programs Eduard Zingerman
2025-06-05 23:06 ` [PATCH bpf-next v1 1/2] bpf: include verifier memory allocations in memcg statistics Eduard Zingerman
2025-06-05 23:06 ` [PATCH bpf-next v1 2/2] veristat: memory accounting for bpf programs Eduard Zingerman
2025-06-06  1:03   ` Eduard Zingerman
2025-06-06  2:17     ` Alexei Starovoitov
2025-06-06  2:33       ` Eduard Zingerman
2025-06-06  2:35         ` Alexei Starovoitov
2025-06-06  2:46           ` Eduard Zingerman
2025-06-06 16:53   ` Mykyta Yatsenko
2025-06-06 17:03     ` Eduard Zingerman
2025-06-06 18:19       ` Andrii Nakryiko
2025-06-07  8:13         ` Eduard Zingerman
2025-06-09 20:57           ` Andrii Nakryiko
2025-06-09 22:45             ` Eduard Zingerman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).