public inbox for dwarves@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH dwarves v2 00/10] pahole: shared ELF and faster reproducible BTF encoding
@ 2024-12-13 22:36 Ihor Solodrai
  2024-12-13 22:36 ` [PATCH dwarves v2 01/10] btf_encoder: simplify function encoding Ihor Solodrai
                   ` (10 more replies)
  0 siblings, 11 replies; 30+ messages in thread
From: Ihor Solodrai @ 2024-12-13 22:36 UTC (permalink / raw)
  To: dwarves; +Cc: acme, alan.maguire, eddyz87, andrii, mykolal, bpf

This is a v2 of the patchset aiming to speed up parallel BTF encoding
when reproducible_build flag is set (see link [1]).

In comparison to v1:
  * patch #2 adding section-relative addresses to elf_functions is
    removed as unrelated [2]
  * patch #9 [3] is replaced with patches #8, #9 and #10 (the biggest
    and most important in this series)

Patch #10 rewrites multithreading implementation to job/worker
model. See the details in the commit message.

The ./tests/tests pass with a vmlinux build on bpf-next.

I also confrimed that the reproducible bpftool dump of BTF produced
for vmlinux is identical between this patch series and pahole/next.

With this patch series, the performance of parallel BTF encoding is
comparable to non-reproducible runs on pahole/next. Depending on the
number of threads and allowed memory usage (indirectly controlled by
max_decoded_cus parameter of the queue in the dwarf_loader.c), it may
be a little slower or a little faster.

Note that the number of CPU cycles is significantly less, although the
wall-clock time is somewhat greater for -j24, as reported by perf.

See sample measurements below (host nproc=24).

This patch (always reproducible)

    -j1 mem 842020 Kb, time 6.31 sec
    -j3 mem 864604 Kb, time 2.90 sec
    -j6 mem 927760 Kb, time 2.21 sec
    -j12 mem 1026616 Kb, time 2.29 sec
    -j24 mem 1188448 Kb, time 2.36 sec
    -j48 mem 1462656 Kb, time 2.48 sec

     Performance counter stats for '/home/theihor/dev/dwarves/build/pahole -J -j24 --btf_features=encode_force,var,float,enum64,decl_tag,type_tag,optimized_func,consistent_func,decl_tag_kfuncs,reproducible_build --btf_encode_detached=/dev/null --lang_exclude=rust /home/theihor/git/kernel.org/bpf-next/kbuild-output/.tmp_vmlinux1' (13 runs):

        46,771,092,586      cycles:u                                                                ( +-  0.17% )

               2.36785 +- 0.00503 seconds time elapsed  ( +-  0.21% )

pahole/next (1cb4202) non-reproducible

    -j1 mem 834004 Kb, time 6.25 sec
    -j3 mem 976480 Kb, time 3.21 sec
    -j6 mem 1081432 Kb, time 2.36 sec
    -j12 mem 1161252 Kb, time 2.07 sec
    -j24 mem 1303060 Kb, time 2.13 sec
    -j48 mem 1537800 Kb, time 2.39 sec

     Performance counter stats for '/home/theihor/dev/dwarves/build/pahole -J -j24 --btf_features=encode_force,var,float,enum64,decl_tag,type_tag,optimized_func,consistent_func,decl_tag_kfuncs --btf_encode_detached=/dev/null --lang_exclude=rust /home/theihor/git/kernel.org/bpf-next/kbuild-output/.tmp_vmlinux1' (13 runs):

        60,436,382,442      cycles:u                                                                ( +-  0.22% )

                2.2024 +- 0.0151 seconds time elapsed  ( +-  0.68% )

pahole/next (1cb4202) reproducible

    -j1 mem 4745764 Kb, time 7.64 sec
    -j3 mem 4744556 Kb, time 3.95 sec
    -j6 mem 4744592 Kb, time 2.98 sec
    -j12 mem 4744680 Kb, time 2.99 sec
    -j24 mem 4745252 Kb, time 2.99 sec
    -j48 mem 4744520 Kb, time 2.98 sec

     Performance counter stats for '/home/theihor/dev/dwarves/build/pahole -J -j24 --btf_features=encode_force,var,float,enum64,decl_tag,type_tag,optimized_func,consistent_func,decl_tag_kfuncs,reproducible_build --btf_encode_detached=/dev/null --lang_exclude=rust /home/theihor/git/kernel.org/bpf-next/kbuild-output/.tmp_vmlinux1' (13 runs):

        38,155,725,721      cycles:u                                                                ( +-  0.29% )

               3.00290 +- 0.00501 seconds time elapsed  ( +-  0.17% )

[1] https://lore.kernel.org/dwarves/20241128012341.4081072-1-ihor.solodrai@pm.me/
[2] https://lore.kernel.org/dwarves/20241128012341.4081072-3-ihor.solodrai@pm.me/
[3] https://lore.kernel.org/dwarves/20241128012341.4081072-10-ihor.solodrai@pm.me/

Alan Maguire (2):
  btf_encoder: simplify function encoding
  btf_encoder: separate elf function, saved function representations

Ihor Solodrai (8):
  dwarf_loader: introduce pre_load_module hook to conf_load
  btf_encoder: introduce elf_functions struct type
  btf_encoder: collect elf_functions in btf_encoder__pre_load_module
  btf_encoder: switch to shared elf_functions table
  btf_encoder: introduce btf_encoding_context
  btf_encoder: remove skip_encoding_inconsistent_proto
  dwarf_loader: introduce cu->id
  dwarf_loader: multithreading with a job/worker model

 btf_encoder.c               | 639 +++++++++++++++++++++---------------
 btf_encoder.h               |   8 +-
 btf_loader.c                |   2 +-
 ctf_loader.c                |   2 +-
 dwarf_loader.c              | 352 ++++++++++++++------
 dwarves.c                   |  44 ---
 dwarves.h                   |  21 +-
 pahole.c                    | 237 +++----------
 pdwtags.c                   |   3 +-
 pfunct.c                    |   3 +-
 tests/reproducible_build.sh |   5 +-
 11 files changed, 685 insertions(+), 631 deletions(-)

-- 
2.47.1



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH dwarves v2 01/10] btf_encoder: simplify function encoding
  2024-12-13 22:36 [PATCH dwarves v2 00/10] pahole: shared ELF and faster reproducible BTF encoding Ihor Solodrai
@ 2024-12-13 22:36 ` Ihor Solodrai
  2024-12-13 22:36 ` [PATCH dwarves v2 02/10] btf_encoder: separate elf function, saved function representations Ihor Solodrai
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Ihor Solodrai @ 2024-12-13 22:36 UTC (permalink / raw)
  To: dwarves; +Cc: acme, alan.maguire, eddyz87, andrii, mykolal, bpf

From: Alan Maguire <alan.maguire@oracle.com>

Currently we have two modes of function encoding; one adds functions
based upon the first instance found and ignores inconsistent
representations.  The second saves function representations and later
finds inconsistencies.  The mode chosen is determined by
conf_load->skip_encoding_btf_inconsistent_proto.

The knock-on effect is that we need to support two modes in
btf_encoder__add_func(); one for each case.  Simplify by using
the "save function" approach for both cases; only difference is
that we allow inconsistent representations if
skip_encoding_btf_inconsistent_proto is not set (it is set by default
for upstream kernels and has been for a while).

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>

Link: https://lore.kernel.org/dwarves/20241128012341.4081072-2-ihor.solodrai@pm.me/
---
 btf_encoder.c | 79 +++++++++++++++++----------------------------------
 1 file changed, 26 insertions(+), 53 deletions(-)

diff --git a/btf_encoder.c b/btf_encoder.c
index b715bc2..b9a98f5 100644
--- a/btf_encoder.c
+++ b/btf_encoder.c
@@ -88,7 +88,6 @@ struct btf_encoder_func_state {
 struct elf_function {
 	const char	*name;
 	char		*alias;
-	bool		 generated;
 	size_t		prefixlen;
 	struct btf_encoder_func_state state;
 };
@@ -120,6 +119,7 @@ struct btf_encoder {
 			  force,
 			  gen_floats,
 			  skip_encoding_decl_tag,
+			  skip_encoding_inconsistent_proto,
 			  tag_kfuncs,
 			  gen_distilled_base;
 	uint32_t	  array_index_id;
@@ -1165,18 +1165,18 @@ out:
 	return err;
 }
 
-static int32_t btf_encoder__add_func(struct btf_encoder *encoder, struct function *fn,
+static int32_t btf_encoder__add_func(struct btf_encoder *encoder,
 				     struct elf_function *func)
 {
+	struct btf_encoder_func_state *state = &func->state;
 	int btf_fnproto_id, btf_fn_id, tag_type_id = 0;
 	int16_t component_idx = -1;
 	const char *name;
 	const char *value;
 	char tmp_value[KSYM_NAME_LEN];
+	uint16_t idx;
 
-	assert(fn != NULL || func != NULL);
-
-	btf_fnproto_id = btf_encoder__add_func_proto(encoder, fn ? &fn->proto : NULL, func);
+	btf_fnproto_id = btf_encoder__add_func_proto(encoder, NULL, func);
 	name = func->alias ?: func->name;
 	if (btf_fnproto_id >= 0)
 		btf_fn_id = btf_encoder__add_ref_type(encoder, BTF_KIND_FUNC, btf_fnproto_id,
@@ -1186,40 +1186,23 @@ static int32_t btf_encoder__add_func(struct btf_encoder *encoder, struct functio
 		       name, btf_fnproto_id < 0 ? "proto" : "func");
 		return -1;
 	}
-	if (!fn) {
-		struct btf_encoder_func_state *state = &func->state;
-		uint16_t idx;
-
-		if (state->nr_annots == 0)
-			return 0;
+	if (state->nr_annots == 0)
+		return 0;
 
-		for (idx = 0; idx < state->nr_annots; idx++) {
-			struct btf_encoder_func_annot *a = &state->annots[idx];
+	for (idx = 0; idx < state->nr_annots; idx++) {
+		struct btf_encoder_func_annot *a = &state->annots[idx];
 
-			value = btf__str_by_offset(encoder->btf, a->value);
-			/* adding BTF data may result in a mode of the
-			 * value string memory, so make a temporary copy.
-			 */
-			strncpy(tmp_value, value, sizeof(tmp_value) - 1);
-			component_idx = a->component_idx;
-
-			tag_type_id = btf_encoder__add_decl_tag(encoder, tmp_value,
-								btf_fn_id, component_idx);
-			if (tag_type_id < 0)
-				break;
-		}
-	} else {
-		struct llvm_annotation *annot;
-
-		list_for_each_entry(annot, &fn->annots, node) {
-			value = annot->value;
-			component_idx = annot->component_idx;
+		value = btf__str_by_offset(encoder->btf, a->value);
+		/* adding BTF data may result in a mode of the
+		 * value string memory, so make a temporary copy.
+		 */
+		strncpy(tmp_value, value, sizeof(tmp_value) - 1);
+		component_idx = a->component_idx;
 
-			tag_type_id = btf_encoder__add_decl_tag(encoder, value, btf_fn_id,
-								component_idx);
-			if (tag_type_id < 0)
-				break;
-		}
+		tag_type_id = btf_encoder__add_decl_tag(encoder, tmp_value,
+							btf_fn_id, component_idx);
+		if (tag_type_id < 0)
+			break;
 	}
 	if (tag_type_id < 0) {
 		fprintf(stderr,
@@ -1277,8 +1260,9 @@ static int btf_encoder__add_saved_funcs(struct btf_encoder *encoder)
 		 * just do not _use_ them.  Only exclude functions with
 		 * unexpected register use or multiple inconsistent prototypes.
 		 */
-		if (!state->unexpected_reg && !state->inconsistent_proto) {
-			if (btf_encoder__add_func(encoder, NULL, func))
+		if (!encoder->skip_encoding_inconsistent_proto ||
+		    (!state->unexpected_reg && !state->inconsistent_proto)) {
+			if (btf_encoder__add_func(encoder, func))
 				return -1;
 		}
 		state->processed = 1;
@@ -2361,6 +2345,7 @@ struct btf_encoder *btf_encoder__new(struct cu *cu, const char *detached_filenam
 		encoder->force		 = conf_load->btf_encode_force;
 		encoder->gen_floats	 = conf_load->btf_gen_floats;
 		encoder->skip_encoding_decl_tag	 = conf_load->skip_encoding_btf_decl_tag;
+		encoder->skip_encoding_inconsistent_proto = conf_load->skip_encoding_btf_inconsistent_proto;
 		encoder->tag_kfuncs	 = conf_load->btf_decl_tag_kfuncs;
 		encoder->gen_distilled_base = conf_load->btf_gen_distilled_base;
 		encoder->verbose	 = verbose;
@@ -2566,7 +2551,6 @@ int btf_encoder__encode_cu(struct btf_encoder *encoder, struct cu *cu, struct co
 
 	cu__for_each_function(cu, core_id, fn) {
 		struct elf_function *func = NULL;
-		bool save = false;
 
 		/*
 		 * Skip functions that:
@@ -2588,15 +2572,8 @@ int btf_encoder__encode_cu(struct btf_encoder *encoder, struct cu *cu, struct co
 
 			/* prefer exact function name match... */
 			func = btf_encoder__find_function(encoder, name, 0);
-			if (func) {
-				if (func->generated)
-					continue;
-				if (conf_load->skip_encoding_btf_inconsistent_proto)
-					save = true;
-				else
-					func->generated = true;
-			} else if (encoder->functions.suffix_cnt &&
-				   conf_load->btf_gen_optimized) {
+			if (!func && encoder->functions.suffix_cnt &&
+			    conf_load->btf_gen_optimized) {
 				/* falling back to name.isra.0 match if no exact
 				 * match is found; only bother if we found any
 				 * .suffix function names.  The function
@@ -2607,7 +2584,6 @@ int btf_encoder__encode_cu(struct btf_encoder *encoder, struct cu *cu, struct co
 				func = btf_encoder__find_function(encoder, name,
 								  strlen(name));
 				if (func) {
-					save = true;
 					if (encoder->verbose)
 						printf("matched function '%s' with '%s'%s\n",
 						       name, func->name,
@@ -2625,10 +2601,7 @@ int btf_encoder__encode_cu(struct btf_encoder *encoder, struct cu *cu, struct co
 		if (!func)
 			continue;
 
-		if (save)
-			err = btf_encoder__save_func(encoder, fn, func);
-		else
-			err = btf_encoder__add_func(encoder, fn, func);
+		err = btf_encoder__save_func(encoder, fn, func);
 		if (err)
 			goto out;
 	}
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH dwarves v2 02/10] btf_encoder: separate elf function, saved function representations
  2024-12-13 22:36 [PATCH dwarves v2 00/10] pahole: shared ELF and faster reproducible BTF encoding Ihor Solodrai
  2024-12-13 22:36 ` [PATCH dwarves v2 01/10] btf_encoder: simplify function encoding Ihor Solodrai
@ 2024-12-13 22:36 ` Ihor Solodrai
  2024-12-19 14:59   ` Jiri Olsa
  2024-12-13 22:36 ` [PATCH dwarves v2 03/10] dwarf_loader: introduce pre_load_module hook to conf_load Ihor Solodrai
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Ihor Solodrai @ 2024-12-13 22:36 UTC (permalink / raw)
  To: dwarves; +Cc: acme, alan.maguire, eddyz87, andrii, mykolal, bpf

From: Alan Maguire <alan.maguire@oracle.com>

Have saved function representation point back at immutable ELF function
table.  This will make sharing the ELF function table across encoders
easier.  Simply accumulate saved functions for each encoder, and on
completion combine them into a name-sorted list.  Then carry out
comparisons to check for inconsistent representations, skipping functions
that are inconsistent in their representation.

/usr/bin/time samples with this change:
  jobs 1, mem 837844 Kb, time 6.40 sec
  jobs 2, mem 936204 Kb, time 3.88 sec
  jobs 4, mem 1023120 Kb, time 2.75 sec
  jobs 8, mem 1163824 Kb, time 2.31 sec
  jobs 16, mem 1190588 Kb, time 2.08 sec
  jobs 32, mem 1341180 Kb, time 2.36 sec

/usr/bin/time samples on next (3ddadc1):
  jobs 1, mem 834100 Kb, time 6.20 sec
  jobs 2, mem 925048 Kb, time 3.81 sec
  jobs 4, mem 1025424 Kb, time 2.88 sec
  jobs 8, mem 1178480 Kb, time 2.21 sec
  jobs 16, mem 1241780 Kb, time 2.07 sec
  jobs 32, mem 1442316 Kb, time 2.33 sec

Link: https://lore.kernel.org/dwarves/20241128012341.4081072-4-ihor.solodrai@pm.me/

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Co-developed-by: Ihor Solodrai <ihor.solodrai@pm.me>
Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>
---
 btf_encoder.c | 293 +++++++++++++++++++++++++++-----------------------
 btf_encoder.h |   1 +
 pahole.c      |   2 +
 3 files changed, 160 insertions(+), 136 deletions(-)

diff --git a/btf_encoder.c b/btf_encoder.c
index b9a98f5..523901c 100644
--- a/btf_encoder.c
+++ b/btf_encoder.c
@@ -72,14 +72,15 @@ struct btf_encoder_func_annot {
 
 /* state used to do later encoding of saved functions */
 struct btf_encoder_func_state {
+	struct list_head node;
+	struct btf_encoder *encoder;
+	struct elf_function *elf;
 	uint32_t type_id_off;
 	uint16_t nr_parms;
 	uint16_t nr_annots;
-	uint8_t initialized:1;
 	uint8_t optimized_parms:1;
 	uint8_t unexpected_reg:1;
 	uint8_t inconsistent_proto:1;
-	uint8_t processed:1;
 	int ret_type_id;
 	struct btf_encoder_func_parm *parms;
 	struct btf_encoder_func_annot *annots;
@@ -89,7 +90,6 @@ struct elf_function {
 	const char	*name;
 	char		*alias;
 	size_t		prefixlen;
-	struct btf_encoder_func_state state;
 };
 
 struct elf_secinfo {
@@ -126,6 +126,7 @@ struct btf_encoder {
 	struct elf_secinfo *secinfo;
 	size_t             seccnt;
 	int                encode_vars;
+	struct list_head   func_states;
 	struct {
 		struct elf_function *entries;
 		int		    allocated;
@@ -148,8 +149,6 @@ struct btf_kfunc_set_range {
 static LIST_HEAD(encoders);
 static pthread_mutex_t encoders__lock = PTHREAD_MUTEX_INITIALIZER;
 
-static int btf_encoder__add_saved_funcs(struct btf_encoder *encoder);
-
 /* mutex only needed for add/delete, as this can happen in multiple encoding
  * threads.  Traversal of the list is currently confined to thread collection.
  */
@@ -693,25 +692,26 @@ static int32_t btf_encoder__tag_type(struct btf_encoder *encoder, uint32_t tag_t
 }
 
 static int32_t btf_encoder__add_func_proto(struct btf_encoder *encoder, struct ftype *ftype,
-					   struct elf_function *func)
+					   struct btf_encoder_func_state *state)
 {
-	struct btf *btf = encoder->btf;
 	const struct btf_type *t;
+	struct btf *btf;
 	struct parameter *param;
 	uint16_t nr_params, param_idx;
 	int32_t id, type_id;
 	char tmp_name[KSYM_NAME_LEN];
 	const char *name;
-	struct btf_encoder_func_state *state;
 
-	assert(ftype != NULL || func != NULL);
+	assert(ftype != NULL || state != NULL);
 
 	/* add btf_type for func_proto */
 	if (ftype) {
+		btf = encoder->btf;
 		nr_params = ftype->nr_parms + (ftype->unspec_parms ? 1 : 0);
 		type_id = btf_encoder__tag_type(encoder, ftype->tag.type);
-	} else if (func) {
-		state = &func->state;
+	} else if (state) {
+		encoder = state->encoder;
+		btf = state->encoder->btf;
 		nr_params = state->nr_parms;
 		type_id = state->ret_type_id;
 	} else {
@@ -801,8 +801,6 @@ int32_t btf_encoder__add_encoder(struct btf_encoder *encoder, struct btf_encoder
 	if (encoder == other)
 		return 0;
 
-	btf_encoder__add_saved_funcs(other);
-
 	for (shndx = 1; shndx < other->seccnt; shndx++) {
 		struct gobuffer *var_secinfo_buf = &other->secinfo[shndx].secinfo;
 		size_t sz = gobuffer__size(var_secinfo_buf);
@@ -1031,10 +1029,13 @@ static bool types__match(struct btf_encoder *encoder,
 	return false;
 }
 
-static bool funcs__match(struct btf_encoder *encoder, struct elf_function *func,
-			 struct btf *btf1, struct btf_encoder_func_state *s1,
-			 struct btf *btf2, struct btf_encoder_func_state *s2)
+static bool funcs__match(struct btf_encoder_func_state *s1,
+			 struct btf_encoder_func_state *s2)
 {
+	struct btf_encoder *encoder = s1->encoder;
+	struct elf_function *func = s1->elf;
+	struct btf *btf1 = s1->encoder->btf;
+	struct btf *btf2 = s2->encoder->btf;
 	uint8_t i;
 
 	if (s1->nr_parms != s2->nr_parms) {
@@ -1072,8 +1073,7 @@ static bool funcs__match(struct btf_encoder *encoder, struct elf_function *func,
 
 static int32_t btf_encoder__save_func(struct btf_encoder *encoder, struct function *fn, struct elf_function *func)
 {
-	struct btf_encoder_func_state *existing = &func->state;
-	struct btf_encoder_func_state state = { 0 };
+	struct btf_encoder_func_state *state = zalloc(sizeof(*state));
 	struct ftype *ftype = &fn->proto;
 	struct btf *btf = encoder->btf;
 	struct llvm_annotation *annot;
@@ -1081,22 +1081,23 @@ static int32_t btf_encoder__save_func(struct btf_encoder *encoder, struct functi
 	uint8_t param_idx = 0;
 	int str_off, err = 0;
 
-	/* if already skipping this function, no need to proceed. */
-	if (existing->unexpected_reg || existing->inconsistent_proto)
-		return 0;
+	if (!state)
+		return -ENOMEM;
 
-	state.nr_parms = ftype->nr_parms + (ftype->unspec_parms ? 1 : 0);
-	state.ret_type_id = ftype->tag.type == 0 ? 0 : encoder->type_id_off + ftype->tag.type;
-	if (state.nr_parms > 0) {
-		state.parms = zalloc(state.nr_parms * sizeof(*state.parms));
-		if (!state.parms) {
+	state->encoder = encoder;
+	state->elf = func;
+	state->nr_parms = ftype->nr_parms + (ftype->unspec_parms ? 1 : 0);
+	state->ret_type_id = ftype->tag.type == 0 ? 0 : encoder->type_id_off + ftype->tag.type;
+	if (state->nr_parms > 0) {
+		state->parms = zalloc(state->nr_parms * sizeof(*state->parms));
+		if (!state->parms) {
 			err = -ENOMEM;
 			goto out;
 		}
 	}
-	state.inconsistent_proto = ftype->inconsistent_proto;
-	state.unexpected_reg = ftype->unexpected_reg;
-	state.optimized_parms = ftype->optimized_parms;
+	state->inconsistent_proto = ftype->inconsistent_proto;
+	state->unexpected_reg = ftype->unexpected_reg;
+	state->optimized_parms = ftype->optimized_parms;
 	ftype__for_each_parameter(ftype, param) {
 		const char *name = parameter__name(param) ?: "";
 
@@ -1105,21 +1106,21 @@ static int32_t btf_encoder__save_func(struct btf_encoder *encoder, struct functi
 			err = str_off;
 			goto out;
 		}
-		state.parms[param_idx].name_off = str_off;
-		state.parms[param_idx].type_id = param->tag.type == 0 ? 0 :
-						encoder->type_id_off + param->tag.type;
+		state->parms[param_idx].name_off = str_off;
+		state->parms[param_idx].type_id = param->tag.type == 0 ? 0 :
+						  encoder->type_id_off + param->tag.type;
 		param_idx++;
 	}
 	if (ftype->unspec_parms)
-		state.parms[param_idx].type_id = 0;
+		state->parms[param_idx].type_id = 0;
 
 	list_for_each_entry(annot, &fn->annots, node)
-		state.nr_annots++;
-	if (state.nr_annots) {
+		state->nr_annots++;
+	if (state->nr_annots) {
 		uint8_t idx = 0;
 
-		state.annots = zalloc(state.nr_annots * sizeof(*state.annots));
-		if (!state.annots) {
+		state->annots = zalloc(state->nr_annots * sizeof(*state->annots));
+		if (!state->annots) {
 			err = -ENOMEM;
 			goto out;
 		}
@@ -1129,46 +1130,24 @@ static int32_t btf_encoder__save_func(struct btf_encoder *encoder, struct functi
 				err = str_off;
 				goto out;
 			}
-			state.annots[idx].value = str_off;
-			state.annots[idx].component_idx = annot->component_idx;
+			state->annots[idx].value = str_off;
+			state->annots[idx].component_idx = annot->component_idx;
 			idx++;
 		}
 	}
-	state.initialized = 1;
-
-	if (state.unexpected_reg)
-		btf_encoder__log_func_skip(encoder, func,
-					   "unexpected register used for parameter\n");
-	if (!existing->initialized) {
-		memcpy(existing, &state, sizeof(*existing));
-		return 0;
-	}
-
-	/* If saving and we find an existing entry, we want to merge
-	 * observations across both functions, checking that the
-	 * "seen optimized parameters", "inconsistent prototype"
-	 * and "unexpected register" status is reflected in the
-	 * func entry.
-	 * If the entry is new, record encoder state required
-	 * to add the local function later (encoder + type_id_off)
-	 * such that we can add the function later.
-	 */
-	existing->optimized_parms |= state.optimized_parms;
-	existing->unexpected_reg |= state.unexpected_reg;
-	if (!existing->unexpected_reg &&
-	    !funcs__match(encoder, func, encoder->btf, &state,
-			   encoder->btf, existing))
-		existing->inconsistent_proto = 1;
+	list_add_tail(&state->node, &encoder->func_states);
+	return 0;
 out:
-	zfree(&state.annots);
-	zfree(&state.parms);
+	zfree(&state->annots);
+	zfree(&state->parms);
+	free(state);
 	return err;
 }
 
 static int32_t btf_encoder__add_func(struct btf_encoder *encoder,
-				     struct elf_function *func)
+				     struct btf_encoder_func_state *state)
 {
-	struct btf_encoder_func_state *state = &func->state;
+	struct elf_function *func = state->elf;
 	int btf_fnproto_id, btf_fn_id, tag_type_id = 0;
 	int16_t component_idx = -1;
 	const char *name;
@@ -1176,7 +1155,7 @@ static int32_t btf_encoder__add_func(struct btf_encoder *encoder,
 	char tmp_value[KSYM_NAME_LEN];
 	uint16_t idx;
 
-	btf_fnproto_id = btf_encoder__add_func_proto(encoder, NULL, func);
+	btf_fnproto_id = btf_encoder__add_func_proto(encoder, NULL, state);
 	name = func->alias ?: func->name;
 	if (btf_fnproto_id >= 0)
 		btf_fn_id = btf_encoder__add_ref_type(encoder, BTF_KIND_FUNC, btf_fnproto_id,
@@ -1214,62 +1193,6 @@ static int32_t btf_encoder__add_func(struct btf_encoder *encoder,
 	return 0;
 }
 
-static int btf_encoder__add_saved_funcs(struct btf_encoder *encoder)
-{
-	int i;
-
-	for (i = 0; i < encoder->functions.cnt; i++) {
-		struct elf_function *func = &encoder->functions.entries[i];
-		struct btf_encoder_func_state *state = &func->state;
-		struct btf_encoder *other_encoder = NULL;
-
-		if (!state->initialized || state->processed)
-			continue;
-		/* merge optimized-out status across encoders; since each
-		 * encoder has the same elf symbol table we can use the
-		 * same index to access the same elf symbol.
-		 */
-		btf_encoders__for_each_encoder(other_encoder) {
-			struct elf_function *other_func;
-			struct btf_encoder_func_state *other_state;
-			uint8_t optimized, unexpected, inconsistent;
-
-			if (other_encoder == encoder)
-				continue;
-
-			other_func = &other_encoder->functions.entries[i];
-			other_state = &other_func->state;
-			if (!other_state->initialized)
-				continue;
-			optimized = state->optimized_parms | other_state->optimized_parms;
-			unexpected = state->unexpected_reg | other_state->unexpected_reg;
-			inconsistent = state->inconsistent_proto | other_state->inconsistent_proto;
-			if (!unexpected && !inconsistent &&
-			    !funcs__match(encoder, func,
-					  encoder->btf, state,
-					  other_encoder->btf, other_state))
-				inconsistent = 1;
-			state->optimized_parms = other_state->optimized_parms = optimized;
-			state->unexpected_reg = other_state->unexpected_reg = unexpected;
-			state->inconsistent_proto = other_state->inconsistent_proto = inconsistent;
-
-			other_state->processed = 1;
-		}
-		/* do not exclude functions with optimized-out parameters; they
-		 * may still be _called_ with the right parameter values, they
-		 * just do not _use_ them.  Only exclude functions with
-		 * unexpected register use or multiple inconsistent prototypes.
-		 */
-		if (!encoder->skip_encoding_inconsistent_proto ||
-		    (!state->unexpected_reg && !state->inconsistent_proto)) {
-			if (btf_encoder__add_func(encoder, func))
-				return -1;
-		}
-		state->processed = 1;
-	}
-	return 0;
-}
-
 static int functions_cmp(const void *_a, const void *_b)
 {
 	const struct elf_function *a = _a;
@@ -1297,6 +1220,107 @@ static void *reallocarray_grow(void *ptr, int *nmemb, size_t size)
 	return new;
 }
 
+static int saved_functions_cmp(const void *_a, const void *_b)
+{
+	struct btf_encoder_func_state * const *a = _a;
+	struct btf_encoder_func_state * const *b = _b;
+
+	return functions_cmp((*a)->elf, (*b)->elf);
+}
+
+static int saved_functions_combine(void *_a, void *_b)
+{
+	uint8_t optimized, unexpected, inconsistent;
+	struct btf_encoder_func_state *a = _a;
+	struct btf_encoder_func_state *b = _b;
+	int ret;
+
+	ret = strncmp(a->elf->name, b->elf->name,
+		      max(a->elf->prefixlen, b->elf->prefixlen));
+	if (ret != 0)
+		return ret;
+	optimized = a->optimized_parms | b->optimized_parms;
+	unexpected = a->unexpected_reg | b->unexpected_reg;
+	inconsistent = a->inconsistent_proto | b->inconsistent_proto;
+	if (!unexpected && !inconsistent && !funcs__match(a, b))
+		inconsistent = 1;
+	a->optimized_parms = b->optimized_parms = optimized;
+	a->unexpected_reg = b->unexpected_reg = unexpected;
+	a->inconsistent_proto = b->inconsistent_proto = inconsistent;
+
+	return 0;
+}
+
+static void btf_encoder__delete_saved_funcs(struct btf_encoder *encoder)
+{
+	struct btf_encoder_func_state *pos, *s;
+
+	list_for_each_entry_safe(pos, s, &encoder->func_states, node) {
+		list_del(&pos->node);
+		free(pos->parms);
+		free(pos->annots);
+		free(pos);
+	}
+
+	for (int i = 0; i < encoder->functions.cnt; i++)
+		free(encoder->functions.entries[i].alias);
+}
+
+int btf_encoder__add_saved_funcs(struct btf_encoder *encoder)
+{
+	struct btf_encoder_func_state **saved_fns, *s;
+	struct btf_encoder *e = NULL;
+	int i = 0, j, nr_saved_fns = 0;
+
+	/* Retrieve function states from each encoder, combine them
+	 * and sort by name, addr.
+	 */
+	btf_encoders__for_each_encoder(e) {
+		list_for_each_entry(s, &e->func_states, node)
+			nr_saved_fns++;
+	}
+
+	if (nr_saved_fns == 0)
+		return 0;
+
+	saved_fns = calloc(nr_saved_fns, sizeof(*saved_fns));
+	btf_encoders__for_each_encoder(e) {
+		list_for_each_entry(s, &e->func_states, node)
+			saved_fns[i++] = s;
+	}
+	qsort(saved_fns, nr_saved_fns, sizeof(*saved_fns), saved_functions_cmp);
+
+	for (i = 0; i < nr_saved_fns; i = j) {
+		struct btf_encoder_func_state *state = saved_fns[i];
+
+		/* Compare across sorted functions that match by name/prefix;
+		 * share inconsistent/unexpected reg state between them.
+		 */
+		j = i + 1;
+
+		while (j < nr_saved_fns && saved_functions_combine(saved_fns[i], saved_fns[j]) == 0)
+			j++;
+
+		/* do not exclude functions with optimized-out parameters; they
+		 * may still be _called_ with the right parameter values, they
+		 * just do not _use_ them.  Only exclude functions with
+		 * unexpected register use or multiple inconsistent prototypes.
+		 */
+		if (!encoder->skip_encoding_inconsistent_proto ||
+		    (!state->unexpected_reg && !state->inconsistent_proto)) {
+			if (btf_encoder__add_func(state->encoder, state)) {
+				free(saved_fns);
+				return -1;
+			}
+		}
+	}
+	/* Now that we are done with function states, free them. */
+	free(saved_fns);
+	btf_encoders__for_each_encoder(e)
+		btf_encoder__delete_saved_funcs(e);
+	return 0;
+}
+
 static int btf_encoder__collect_function(struct btf_encoder *encoder, GElf_Sym *sym)
 {
 	struct elf_function *new;
@@ -1330,6 +1354,8 @@ static int btf_encoder__collect_function(struct btf_encoder *encoder, GElf_Sym *
 
 		encoder->functions.suffix_cnt++;
 		encoder->functions.entries[encoder->functions.cnt].prefixlen = suffix - name;
+	} else {
+		encoder->functions.entries[encoder->functions.cnt].prefixlen = strlen(name);
 	}
 	encoder->functions.cnt++;
 	return 0;
@@ -2353,6 +2379,7 @@ struct btf_encoder *btf_encoder__new(struct cu *cu, const char *detached_filenam
 		encoder->need_index_type = false;
 		encoder->array_index_id  = 0;
 		encoder->encode_vars = 0;
+		INIT_LIST_HEAD(&encoder->func_states);
 		if (!conf_load->skip_encoding_btf_vars)
 			encoder->encode_vars |= BTF_VAR_PERCPU;
 		if (conf_load->encode_btf_global_vars)
@@ -2437,16 +2464,8 @@ out_delete:
 	return NULL;
 }
 
-void btf_encoder__delete_func(struct elf_function *func)
-{
-	free(func->alias);
-	zfree(&func->state.annots);
-	zfree(&func->state.parms);
-}
-
 void btf_encoder__delete(struct btf_encoder *encoder)
 {
-	int i;
 	size_t shndx;
 
 	if (encoder == NULL)
@@ -2455,18 +2474,19 @@ void btf_encoder__delete(struct btf_encoder *encoder)
 	btf_encoders__delete(encoder);
 	for (shndx = 0; shndx < encoder->seccnt; shndx++)
 		__gobuffer__delete(&encoder->secinfo[shndx].secinfo);
+	free(encoder->secinfo);
 	zfree(&encoder->filename);
 	zfree(&encoder->source_filename);
 	btf__free(encoder->btf);
 	encoder->btf = NULL;
 	elf_symtab__delete(encoder->symtab);
 
-	for (i = 0; i < encoder->functions.cnt; i++)
-		btf_encoder__delete_func(&encoder->functions.entries[i]);
 	encoder->functions.allocated = encoder->functions.cnt = 0;
 	free(encoder->functions.entries);
 	encoder->functions.entries = NULL;
 
+	btf_encoder__delete_saved_funcs(encoder);
+
 	free(encoder);
 }
 
@@ -2591,7 +2611,8 @@ int btf_encoder__encode_cu(struct btf_encoder *encoder, struct cu *cu, struct co
 						       ", has optimized-out parameters" :
 						       fn->proto.unexpected_reg ? ", has unexpected register use by params" :
 						       "");
-					func->alias = strdup(name);
+					if (!func->alias)
+						func->alias = strdup(name);
 				}
 			}
 		} else {
diff --git a/btf_encoder.h b/btf_encoder.h
index 824963b..9b26162 100644
--- a/btf_encoder.h
+++ b/btf_encoder.h
@@ -33,5 +33,6 @@ int btf_encoder__encode_cu(struct btf_encoder *encoder, struct cu *cu, struct co
 struct btf *btf_encoder__btf(struct btf_encoder *encoder);
 
 int btf_encoder__add_encoder(struct btf_encoder *encoder, struct btf_encoder *other);
+int btf_encoder__add_saved_funcs(struct btf_encoder *encoder);
 
 #endif /* _BTF_ENCODER_H_ */
diff --git a/pahole.c b/pahole.c
index fa5d8c7..a36b732 100644
--- a/pahole.c
+++ b/pahole.c
@@ -3185,6 +3185,7 @@ static int pahole_threads_collect(struct conf_load *conf, int nr_threads, void *
 	if (error)
 		goto out;
 
+	btf_encoder__add_saved_funcs(btf_encoder);
 	for (i = 0; i < nr_threads; i++) {
 		/*
 		 * Merge content of the btf instances of worker threads to the btf
@@ -3843,6 +3844,7 @@ try_sole_arg_as_class_names:
 		}
 
 		err = btf_encoder__encode(btf_encoder);
+		btf_encoder__delete(btf_encoder);
 		if (err) {
 			fputs("Failed to encode BTF\n", stderr);
 			goto out_cus_delete;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH dwarves v2 03/10] dwarf_loader: introduce pre_load_module hook to conf_load
  2024-12-13 22:36 [PATCH dwarves v2 00/10] pahole: shared ELF and faster reproducible BTF encoding Ihor Solodrai
  2024-12-13 22:36 ` [PATCH dwarves v2 01/10] btf_encoder: simplify function encoding Ihor Solodrai
  2024-12-13 22:36 ` [PATCH dwarves v2 02/10] btf_encoder: separate elf function, saved function representations Ihor Solodrai
@ 2024-12-13 22:36 ` Ihor Solodrai
  2024-12-13 22:37 ` [PATCH dwarves v2 04/10] btf_encoder: introduce elf_functions struct type Ihor Solodrai
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Ihor Solodrai @ 2024-12-13 22:36 UTC (permalink / raw)
  To: dwarves; +Cc: acme, alan.maguire, eddyz87, andrii, mykolal, bpf

Add a function pointer to conf_load, which is called immediately after
Elf is extracted from Dwfl_Module in cus__proces_dwflmod.

This is a preparation for making elf_functions table shared between
encoders. Shared table can be built as soon as the relevant Elf is
available.

Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>

Link: https://lore.kernel.org/dwarves/20241128012341.4081072-5-ihor.solodrai@pm.me/
---
 dwarf_loader.c | 6 ++++++
 dwarves.h      | 1 +
 2 files changed, 7 insertions(+)

diff --git a/dwarf_loader.c b/dwarf_loader.c
index 598fde4..bd65c56 100644
--- a/dwarf_loader.c
+++ b/dwarf_loader.c
@@ -3826,6 +3826,12 @@ static int cus__process_dwflmod(Dwfl_Module *dwflmod,
 	Dwarf *dw = dwfl_module_getdwarf(dwflmod, &dwbias);
 
 	int err = DWARF_CB_OK;
+	if (parms->conf->pre_load_module) {
+		err = parms->conf->pre_load_module(dwflmod, elf);
+		if (err)
+			return DWARF_CB_ABORT;
+	}
+
 	if (dw != NULL) {
 		++parms->nr_dwarf_sections_found;
 		err = cus__load_module(cus, parms->conf, dwflmod, dw, elf,
diff --git a/dwarves.h b/dwarves.h
index 1cb0d62..d516d52 100644
--- a/dwarves.h
+++ b/dwarves.h
@@ -107,6 +107,7 @@ struct conf_load {
 	struct conf_fprintf	*conf_fprintf;
 	int			(*threads_prepare)(struct conf_load *conf, int nr_threads, void **thr_data);
 	int			(*threads_collect)(struct conf_load *conf, int nr_threads, void **thr_data, int error);
+	int			(*pre_load_module)(Dwfl_Module *mod, Elf *elf);
 };
 
 /** struct conf_fprintf - hints to the __fprintf routines
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH dwarves v2 04/10] btf_encoder: introduce elf_functions struct type
  2024-12-13 22:36 [PATCH dwarves v2 00/10] pahole: shared ELF and faster reproducible BTF encoding Ihor Solodrai
                   ` (2 preceding siblings ...)
  2024-12-13 22:36 ` [PATCH dwarves v2 03/10] dwarf_loader: introduce pre_load_module hook to conf_load Ihor Solodrai
@ 2024-12-13 22:37 ` Ihor Solodrai
  2024-12-13 22:37 ` [PATCH dwarves v2 05/10] btf_encoder: collect elf_functions in btf_encoder__pre_load_module Ihor Solodrai
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Ihor Solodrai @ 2024-12-13 22:37 UTC (permalink / raw)
  To: dwarves; +Cc: acme, alan.maguire, eddyz87, andrii, mykolal, bpf

Extract elf_functions struct type from btf_encoder.

Replace routines operating on functions table in btf_encoder by
routines operating on elf_functions:

- btf_encoder__collect_function -> elf_functions__collect_function
- btf_encoder__collect_symbols -> elf_functions__collect

Now these functions do not depend on btf_encoder being passed to them
as a parameter.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>

Link: https://lore.kernel.org/dwarves/20241128012341.4081072-6-ihor.solodrai@pm.me/
---
 btf_encoder.c | 120 +++++++++++++++++++++++++++-----------------------
 1 file changed, 64 insertions(+), 56 deletions(-)

diff --git a/btf_encoder.c b/btf_encoder.c
index 523901c..5c1e6e0 100644
--- a/btf_encoder.c
+++ b/btf_encoder.c
@@ -101,6 +101,13 @@ struct elf_secinfo {
 	struct gobuffer secinfo;
 };
 
+struct elf_functions {
+	struct elf_symtab *symtab;
+	struct elf_function *entries;
+	int cnt;
+	int suffix_cnt; /* number of .isra, .part etc */
+};
+
 /*
  * cu: cu being processed.
  */
@@ -127,12 +134,7 @@ struct btf_encoder {
 	size_t             seccnt;
 	int                encode_vars;
 	struct list_head   func_states;
-	struct {
-		struct elf_function *entries;
-		int		    allocated;
-		int		    cnt;
-		int		    suffix_cnt; /* number of .isra, .part etc */
-	} functions;
+	struct elf_functions functions;
 };
 
 struct btf_func {
@@ -1210,16 +1212,6 @@ static int functions_cmp(const void *_a, const void *_b)
 #define max(x, y) ((x) < (y) ? (y) : (x))
 #endif
 
-static void *reallocarray_grow(void *ptr, int *nmemb, size_t size)
-{
-	int new_nmemb = max(1000, *nmemb * 3 / 2);
-	void *new = realloc(ptr, new_nmemb * size);
-
-	if (new)
-		*nmemb = new_nmemb;
-	return new;
-}
-
 static int saved_functions_cmp(const void *_a, const void *_b)
 {
 	struct btf_encoder_func_state * const *a = _a;
@@ -1321,44 +1313,29 @@ int btf_encoder__add_saved_funcs(struct btf_encoder *encoder)
 	return 0;
 }
 
-static int btf_encoder__collect_function(struct btf_encoder *encoder, GElf_Sym *sym)
+static void elf_functions__collect_function(struct elf_functions *functions, GElf_Sym *sym)
 {
-	struct elf_function *new;
+	struct elf_function *func;
 	const char *name;
 
 	if (elf_sym__type(sym) != STT_FUNC)
-		return 0;
-	name = elf_sym__name(sym, encoder->symtab);
-	if (!name)
-		return 0;
+		return;
 
-	if (encoder->functions.cnt == encoder->functions.allocated) {
-		new = reallocarray_grow(encoder->functions.entries,
-					&encoder->functions.allocated,
-					sizeof(*encoder->functions.entries));
-		if (!new) {
-			/*
-			 * The cleanup - delete_functions is called
-			 * in btf_encoder__encode_cu error path.
-			 */
-			return -1;
-		}
-		encoder->functions.entries = new;
-	}
+	name = elf_sym__name(sym, functions->symtab);
+	if (!name)
+		return;
 
-	memset(&encoder->functions.entries[encoder->functions.cnt], 0,
-	       sizeof(*new));
-	encoder->functions.entries[encoder->functions.cnt].name = name;
+	func = &functions->entries[functions->cnt];
+	func->name = name;
 	if (strchr(name, '.')) {
 		const char *suffix = strchr(name, '.');
-
-		encoder->functions.suffix_cnt++;
-		encoder->functions.entries[encoder->functions.cnt].prefixlen = suffix - name;
+		functions->suffix_cnt++;
+		func->prefixlen = suffix - name;
 	} else {
-		encoder->functions.entries[encoder->functions.cnt].prefixlen = strlen(name);
+		func->prefixlen = strlen(name);
 	}
-	encoder->functions.cnt++;
-	return 0;
+
+	functions->cnt++;
 }
 
 static struct elf_function *btf_encoder__find_function(const struct btf_encoder *encoder,
@@ -2131,26 +2108,56 @@ int btf_encoder__encode(struct btf_encoder *encoder)
 	return err;
 }
 
-
-static int btf_encoder__collect_symbols(struct btf_encoder *encoder)
+static int elf_functions__collect(struct elf_functions *functions)
 {
-	uint32_t sym_sec_idx;
+	uint32_t nr_symbols = elf_symtab__nr_symbols(functions->symtab);
+	struct elf_function *tmp;
+	Elf32_Word sym_sec_idx;
 	uint32_t core_id;
 	GElf_Sym sym;
+	int err;
 
-	elf_symtab__for_each_symbol_index(encoder->symtab, core_id, sym, sym_sec_idx) {
-		if (btf_encoder__collect_function(encoder, &sym))
-			return -1;
+	/* We know that number of functions is less than number of symbols,
+	 * so we can overallocate temporarily.
+	 */
+	functions->entries = calloc(nr_symbols, sizeof(*functions->entries));
+	if (!functions->entries) {
+		err = -ENOMEM;
+		goto out_free;
+	}
+
+	functions->cnt = 0;
+	elf_symtab__for_each_symbol_index(functions->symtab, core_id, sym, sym_sec_idx) {
+		elf_functions__collect_function(functions, &sym);
 	}
 
-	if (encoder->functions.cnt) {
-		qsort(encoder->functions.entries, encoder->functions.cnt, sizeof(encoder->functions.entries[0]),
+	if (functions->cnt) {
+		qsort(functions->entries,
+		      functions->cnt,
+		      sizeof(*functions->entries),
 		      functions_cmp);
-		if (encoder->verbose)
-			printf("Found %d functions!\n", encoder->functions.cnt);
+	} else {
+		err = 0;
+		goto out_free;
+	}
+
+	/* Reallocate to the exact size */
+	tmp = realloc(functions->entries, functions->cnt * sizeof(struct elf_function));
+	if (tmp) {
+		functions->entries = tmp;
+	} else {
+		fprintf(stderr, "could not reallocate memory for elf_functions table\n");
+		err = -ENOMEM;
+		goto out_free;
 	}
 
 	return 0;
+
+out_free:
+	free(functions->entries);
+	functions->entries = NULL;
+	functions->cnt = 0;
+	return err;
 }
 
 static bool ftype__has_arg_names(const struct ftype *ftype)
@@ -2411,6 +2418,7 @@ struct btf_encoder *btf_encoder__new(struct cu *cu, const char *detached_filenam
 				printf("%s: '%s' doesn't have symtab.\n", __func__, cu->filename);
 			goto out;
 		}
+		encoder->functions.symtab = encoder->symtab;
 
 		/* index the ELF sections for later lookup */
 
@@ -2449,7 +2457,7 @@ struct btf_encoder *btf_encoder__new(struct cu *cu, const char *detached_filenam
 		if (!found_percpu && encoder->verbose)
 			printf("%s: '%s' doesn't have '%s' section\n", __func__, cu->filename, PERCPU_SECTION);
 
-		if (btf_encoder__collect_symbols(encoder))
+		if (elf_functions__collect(&encoder->functions))
 			goto out_delete;
 
 		if (encoder->verbose)
@@ -2481,7 +2489,7 @@ void btf_encoder__delete(struct btf_encoder *encoder)
 	encoder->btf = NULL;
 	elf_symtab__delete(encoder->symtab);
 
-	encoder->functions.allocated = encoder->functions.cnt = 0;
+	encoder->functions.cnt = 0;
 	free(encoder->functions.entries);
 	encoder->functions.entries = NULL;
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH dwarves v2 05/10] btf_encoder: collect elf_functions in btf_encoder__pre_load_module
  2024-12-13 22:36 [PATCH dwarves v2 00/10] pahole: shared ELF and faster reproducible BTF encoding Ihor Solodrai
                   ` (3 preceding siblings ...)
  2024-12-13 22:37 ` [PATCH dwarves v2 04/10] btf_encoder: introduce elf_functions struct type Ihor Solodrai
@ 2024-12-13 22:37 ` Ihor Solodrai
  2024-12-13 22:37 ` [PATCH dwarves v2 06/10] btf_encoder: switch to shared elf_functions table Ihor Solodrai
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Ihor Solodrai @ 2024-12-13 22:37 UTC (permalink / raw)
  To: dwarves; +Cc: acme, alan.maguire, eddyz87, andrii, mykolal, bpf

Introduce a global elf_functions_list variable in btf_encoder.c that
contains an elf_functions per ELF.

An elf_functions structure is allocated and filled out by
btf_encoder__pre_load_module() hook, and the list is cleared after
btf_encoder__encode() is done.

At this point btf_encoders don't use shared elf_functions yet (each
maintains their own copy as before), but it is built before encoders
are initialized.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>

Link: https://lore.kernel.org/dwarves/20241128012341.4081072-7-ihor.solodrai@pm.me/
---
 btf_encoder.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++
 btf_encoder.h |  2 ++
 pahole.c      |  3 +++
 3 files changed, 71 insertions(+)

diff --git a/btf_encoder.c b/btf_encoder.c
index 5c1e6e0..88d2872 100644
--- a/btf_encoder.c
+++ b/btf_encoder.c
@@ -102,6 +102,8 @@ struct elf_secinfo {
 };
 
 struct elf_functions {
+	struct list_head node; /* for elf_functions_list */
+	Elf *elf; /* source ELF */
 	struct elf_symtab *symtab;
 	struct elf_function *entries;
 	int cnt;
@@ -148,6 +150,67 @@ struct btf_kfunc_set_range {
 	uint64_t end;
 };
 
+
+/* In principle, multiple ELFs can be processed in one pahole run,
+ * so we have to store elf_functions table per ELF.
+ * An element is added to the list on btf_encoder__pre_load_module,
+ * and removed after btf_encoder__encode is done.
+ */
+static LIST_HEAD(elf_functions_list);
+
+static inline void elf_functions__delete(struct elf_functions *funcs)
+{
+	free(funcs->entries);
+	elf_symtab__delete(funcs->symtab);
+	list_del(&funcs->node);
+	free(funcs);
+}
+
+static inline void elf_functions__delete_all(void)
+{
+	struct list_head *pos, *tmp;
+
+	list_for_each_safe(pos, tmp, &elf_functions_list) {
+		struct elf_functions *funcs = list_entry(pos, struct elf_functions, node);
+
+		elf_functions__delete(funcs);
+	}
+}
+
+static int elf_functions__collect(struct elf_functions *functions);
+
+int btf_encoder__pre_load_module(Dwfl_Module *mod, Elf *elf)
+{
+	struct elf_functions *funcs;
+	int err;
+
+	funcs = calloc(1, sizeof(*funcs));
+	if (!funcs) {
+		err = -ENOMEM;
+		goto out_delete;
+	}
+
+	funcs->symtab = elf_symtab__new(NULL, elf);
+	if (!funcs->symtab) {
+		err = -1;
+		goto out_delete;
+	}
+
+	funcs->elf = elf;
+	err = elf_functions__collect(funcs);
+	if (err)
+		goto out_delete;
+
+	list_add_tail(&funcs->node, &elf_functions_list);
+
+	return 0;
+
+out_delete:
+	elf_functions__delete(funcs);
+	return err;
+}
+
+
 static LIST_HEAD(encoders);
 static pthread_mutex_t encoders__lock = PTHREAD_MUTEX_INITIALIZER;
 
@@ -2105,6 +2168,8 @@ int btf_encoder__encode(struct btf_encoder *encoder)
 #endif
 		err = btf_encoder__write_elf(encoder, encoder->btf, BTF_ELF_SEC);
 	}
+
+	elf_functions__delete_all();
 	return err;
 }
 
@@ -2419,6 +2484,7 @@ struct btf_encoder *btf_encoder__new(struct cu *cu, const char *detached_filenam
 			goto out;
 		}
 		encoder->functions.symtab = encoder->symtab;
+		encoder->functions.elf = cu->elf;
 
 		/* index the ELF sections for later lookup */
 
diff --git a/btf_encoder.h b/btf_encoder.h
index 9b26162..7fa0390 100644
--- a/btf_encoder.h
+++ b/btf_encoder.h
@@ -35,4 +35,6 @@ struct btf *btf_encoder__btf(struct btf_encoder *encoder);
 int btf_encoder__add_encoder(struct btf_encoder *encoder, struct btf_encoder *other);
 int btf_encoder__add_saved_funcs(struct btf_encoder *encoder);
 
+int btf_encoder__pre_load_module(Dwfl_Module *mod, Elf *elf);
+
 #endif /* _BTF_ENCODER_H_ */
diff --git a/pahole.c b/pahole.c
index a36b732..17af0b4 100644
--- a/pahole.c
+++ b/pahole.c
@@ -3738,6 +3738,9 @@ int main(int argc, char *argv[])
 		conf_load.threads_collect = pahole_threads_collect;
 	}
 
+	if (btf_encode)
+		conf_load.pre_load_module = btf_encoder__pre_load_module;
+
 	// Make 'pahole --header type < file' a shorter form of 'pahole -C type --count 1 < file'
 	if (conf.header_type && !class_name && prettify_input) {
 		conf.count = 1;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH dwarves v2 06/10] btf_encoder: switch to shared elf_functions table
  2024-12-13 22:36 [PATCH dwarves v2 00/10] pahole: shared ELF and faster reproducible BTF encoding Ihor Solodrai
                   ` (4 preceding siblings ...)
  2024-12-13 22:37 ` [PATCH dwarves v2 05/10] btf_encoder: collect elf_functions in btf_encoder__pre_load_module Ihor Solodrai
@ 2024-12-13 22:37 ` Ihor Solodrai
  2024-12-19 14:58   ` Jiri Olsa
  2024-12-13 22:37 ` [PATCH dwarves v2 07/10] btf_encoder: introduce btf_encoding_context Ihor Solodrai
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Ihor Solodrai @ 2024-12-13 22:37 UTC (permalink / raw)
  To: dwarves; +Cc: acme, alan.maguire, eddyz87, andrii, mykolal, bpf

Do not collect functions from ELF for each new btf_encoder, and
instead set a pointer to a shared elf_functions table, built
beforehand by btf_encoder__pre_cus__load_module().

Set btf_encoder's ELF symbol table to one associated with the
corresponding elf_functions, instead of maintaining a copy.

For a single-threaded case call btf_encoder__add_saved_funcs() right
before btf_encoder__encode(), and not within it.

Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>

Link: https://lore.kernel.org/bpf/20241128012341.4081072-8-ihor.solodrai@pm.me/
---
 btf_encoder.c | 49 +++++++++++++++++++++++++------------------------
 pahole.c      | 10 ++++++++--
 2 files changed, 33 insertions(+), 26 deletions(-)

diff --git a/btf_encoder.c b/btf_encoder.c
index 88d2872..4a4f6c8 100644
--- a/btf_encoder.c
+++ b/btf_encoder.c
@@ -136,7 +136,7 @@ struct btf_encoder {
 	size_t             seccnt;
 	int                encode_vars;
 	struct list_head   func_states;
-	struct elf_functions functions;
+	struct elf_functions *functions;
 };
 
 struct btf_func {
@@ -158,8 +158,23 @@ struct btf_kfunc_set_range {
  */
 static LIST_HEAD(elf_functions_list);
 
+static struct elf_functions *elf_functions__get(Elf *elf)
+{
+	struct list_head *pos;
+
+	list_for_each(pos, &elf_functions_list) {
+		struct elf_functions *funcs = list_entry(pos, struct elf_functions, node);
+
+		if (funcs->elf == elf)
+			return funcs;
+	}
+	return NULL;
+}
+
 static inline void elf_functions__delete(struct elf_functions *funcs)
 {
+	for (int i = 0; i < funcs->cnt; i++)
+		free(funcs->entries[i].alias);
 	free(funcs->entries);
 	elf_symtab__delete(funcs->symtab);
 	list_del(&funcs->node);
@@ -168,11 +183,11 @@ static inline void elf_functions__delete(struct elf_functions *funcs)
 
 static inline void elf_functions__delete_all(void)
 {
+	struct elf_functions *funcs;
 	struct list_head *pos, *tmp;
 
 	list_for_each_safe(pos, tmp, &elf_functions_list) {
-		struct elf_functions *funcs = list_entry(pos, struct elf_functions, node);
-
+		funcs = list_entry(pos, struct elf_functions, node);
 		elf_functions__delete(funcs);
 	}
 }
@@ -1316,9 +1331,6 @@ static void btf_encoder__delete_saved_funcs(struct btf_encoder *encoder)
 		free(pos->annots);
 		free(pos);
 	}
-
-	for (int i = 0; i < encoder->functions.cnt; i++)
-		free(encoder->functions.entries[i].alias);
 }
 
 int btf_encoder__add_saved_funcs(struct btf_encoder *encoder)
@@ -1406,7 +1418,7 @@ static struct elf_function *btf_encoder__find_function(const struct btf_encoder
 {
 	struct elf_function key = { .name = name, .prefixlen = prefixlen };
 
-	return bsearch(&key, encoder->functions.entries, encoder->functions.cnt, sizeof(key), functions_cmp);
+	return bsearch(&key, encoder->functions->entries, encoder->functions->cnt, sizeof(key), functions_cmp);
 }
 
 static bool btf_name_char_ok(char c, bool first)
@@ -2116,9 +2128,6 @@ int btf_encoder__encode(struct btf_encoder *encoder)
 	int err;
 	size_t shndx;
 
-	/* for single-threaded case, saved funcs are added here */
-	btf_encoder__add_saved_funcs(encoder);
-
 	for (shndx = 1; shndx < encoder->seccnt; shndx++)
 		if (gobuffer__size(&encoder->secinfo[shndx].secinfo))
 			btf_encoder__add_datasec(encoder, shndx);
@@ -2477,14 +2486,13 @@ struct btf_encoder *btf_encoder__new(struct cu *cu, const char *detached_filenam
 			goto out_delete;
 		}
 
-		encoder->symtab = elf_symtab__new(NULL, cu->elf);
+		encoder->functions = elf_functions__get(cu->elf);
+		encoder->symtab = encoder->functions->symtab;
 		if (!encoder->symtab) {
 			if (encoder->verbose)
 				printf("%s: '%s' doesn't have symtab.\n", __func__, cu->filename);
 			goto out;
 		}
-		encoder->functions.symtab = encoder->symtab;
-		encoder->functions.elf = cu->elf;
 
 		/* index the ELF sections for later lookup */
 
@@ -2523,9 +2531,6 @@ struct btf_encoder *btf_encoder__new(struct cu *cu, const char *detached_filenam
 		if (!found_percpu && encoder->verbose)
 			printf("%s: '%s' doesn't have '%s' section\n", __func__, cu->filename, PERCPU_SECTION);
 
-		if (elf_functions__collect(&encoder->functions))
-			goto out_delete;
-
 		if (encoder->verbose)
 			printf("File %s:\n", cu->filename);
 		btf_encoders__add(encoder);
@@ -2553,12 +2558,8 @@ void btf_encoder__delete(struct btf_encoder *encoder)
 	zfree(&encoder->source_filename);
 	btf__free(encoder->btf);
 	encoder->btf = NULL;
-	elf_symtab__delete(encoder->symtab);
-
-	encoder->functions.cnt = 0;
-	free(encoder->functions.entries);
-	encoder->functions.entries = NULL;
-
+	encoder->symtab = NULL;
+	encoder->functions = NULL;
 	btf_encoder__delete_saved_funcs(encoder);
 
 	free(encoder);
@@ -2657,7 +2658,7 @@ int btf_encoder__encode_cu(struct btf_encoder *encoder, struct cu *cu, struct co
 			continue;
 		if (!ftype__has_arg_names(&fn->proto))
 			continue;
-		if (encoder->functions.cnt) {
+		if (encoder->functions->cnt) {
 			const char *name;
 
 			name = function__name(fn);
@@ -2666,7 +2667,7 @@ int btf_encoder__encode_cu(struct btf_encoder *encoder, struct cu *cu, struct co
 
 			/* prefer exact function name match... */
 			func = btf_encoder__find_function(encoder, name, 0);
-			if (!func && encoder->functions.suffix_cnt &&
+			if (!func && encoder->functions->suffix_cnt &&
 			    conf_load->btf_gen_optimized) {
 				/* falling back to name.isra.0 match if no exact
 				 * match is found; only bother if we found any
diff --git a/pahole.c b/pahole.c
index 17af0b4..eb2e71a 100644
--- a/pahole.c
+++ b/pahole.c
@@ -3185,13 +3185,16 @@ static int pahole_threads_collect(struct conf_load *conf, int nr_threads, void *
 	if (error)
 		goto out;
 
-	btf_encoder__add_saved_funcs(btf_encoder);
+	err = btf_encoder__add_saved_funcs(btf_encoder);
+	if (err < 0)
+		goto out;
+
 	for (i = 0; i < nr_threads; i++) {
 		/*
 		 * Merge content of the btf instances of worker threads to the btf
 		 * instance of the primary btf_encoder.
                 */
-		if (!threads[i]->btf)
+		if (!threads[i]->encoder || !threads[i]->btf)
 			continue;
 		err = btf_encoder__add_encoder(btf_encoder, threads[i]->encoder);
 		if (err < 0)
@@ -3846,6 +3849,9 @@ try_sole_arg_as_class_names:
 			exit(1);
 		}
 
+		if (conf_load.nr_jobs <= 1 || conf_load.reproducible_build)
+			btf_encoder__add_saved_funcs(btf_encoder);
+
 		err = btf_encoder__encode(btf_encoder);
 		btf_encoder__delete(btf_encoder);
 		if (err) {
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH dwarves v2 07/10] btf_encoder: introduce btf_encoding_context
  2024-12-13 22:36 [PATCH dwarves v2 00/10] pahole: shared ELF and faster reproducible BTF encoding Ihor Solodrai
                   ` (5 preceding siblings ...)
  2024-12-13 22:37 ` [PATCH dwarves v2 06/10] btf_encoder: switch to shared elf_functions table Ihor Solodrai
@ 2024-12-13 22:37 ` Ihor Solodrai
  2024-12-17  2:39   ` Eduard Zingerman
  2024-12-13 22:37 ` [PATCH dwarves v2 08/10] btf_encoder: remove skip_encoding_inconsistent_proto Ihor Solodrai
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Ihor Solodrai @ 2024-12-13 22:37 UTC (permalink / raw)
  To: dwarves; +Cc: acme, alan.maguire, eddyz87, andrii, mykolal, bpf

Introduce a static struct holding global data necessary for BTF
encoding: elf_functions tables and btf_encoder structs.

The context has init()/exit() interface that should be used to
indicate when BTF encoding work has started and ended.

I considered freeing everything contained in the context exclusively
on exit(), however it turns out this unnecessarily increases max
RSS. Probably because the work done in btf_encoder__encode() requires
relatively more memory, and if encoders and tables are freed earlier,
that space is reused.

Compare:
    -j4: 	Maximum resident set size (kbytes): 868484
    -j8: 	Maximum resident set size (kbytes): 1003040
    -j16: 	Maximum resident set size (kbytes): 1039416
    -j32: 	Maximum resident set size (kbytes): 1145312
vs
    -j4: 	Maximum resident set size (kbytes): 972692
    -j8: 	Maximum resident set size (kbytes): 1043184
    -j16: 	Maximum resident set size (kbytes): 1081156
    -j32: 	Maximum resident set size (kbytes): 1218184

Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>
---
 btf_encoder.c | 108 +++++++++++++++++++++++++++++++++++++++-----------
 btf_encoder.h |   3 ++
 pahole.c      |  10 ++++-
 3 files changed, 96 insertions(+), 25 deletions(-)

diff --git a/btf_encoder.c b/btf_encoder.c
index 4a4f6c8..a362fb2 100644
--- a/btf_encoder.c
+++ b/btf_encoder.c
@@ -150,19 +150,73 @@ struct btf_kfunc_set_range {
 	uint64_t end;
 };
 
+static struct {
+	bool initialized;
+
+	/* In principle, multiple ELFs can be processed in one pahole
+	 * run, so we have to store elf_functions table per ELF.
+	 * An elf_functions struct is added to the list in
+	 * btf_encoder__pre_load_module().
+	 * The list is cleared at the end of btf_encoder__add_saved_funcs().
+	 */
+	struct list_head elf_functions_list;
+
+	/* The mutex only needed for add/delete, as this can happen in
+	 * multiple encoding threads.  A btf_encoder is added to this
+	 * list in btf_encoder__new(), and removed in btf_encoder__delete().
+	 * All encoders except the main one (`btf_encoder` in pahole.c)
+	 * are deleted in pahole_threads_collect().
+	 */
+	pthread_mutex_t btf_encoder_list_lock;
+	struct list_head btf_encoder_list;
+
+} btf_encoding_context;
+
+int btf_encoding_context__init(void)
+{
+	int err = 0;
+
+	if (btf_encoding_context.initialized) {
+		fprintf(stderr, "%s was called while context is already initialized\n", __func__);
+		err = -1;
+		goto out;
+	}
+
+	INIT_LIST_HEAD(&btf_encoding_context.elf_functions_list);
+	INIT_LIST_HEAD(&btf_encoding_context.btf_encoder_list);
+	pthread_mutex_init(&btf_encoding_context.btf_encoder_list_lock, NULL);
+	btf_encoding_context.initialized = true;
+
+out:
+	return err;
+}
+
+static inline void btf_encoder__delete_all(void);
+static inline void elf_functions__delete_all(void);
+
+void btf_encoding_context__exit(void)
+{
+	if (!btf_encoding_context.initialized) {
+		fprintf(stderr, "%s was called while context is not initialized\n", __func__);
+		return;
+	}
+
+	if (!list_empty(&btf_encoding_context.elf_functions_list))
+		elf_functions__delete_all();
+
+	if (!list_empty(&btf_encoding_context.btf_encoder_list))
+		btf_encoder__delete_all();
+
+	pthread_mutex_destroy(&btf_encoding_context.btf_encoder_list_lock);
+	btf_encoding_context.initialized = false;
+}
 
-/* In principle, multiple ELFs can be processed in one pahole run,
- * so we have to store elf_functions table per ELF.
- * An element is added to the list on btf_encoder__pre_load_module,
- * and removed after btf_encoder__encode is done.
- */
-static LIST_HEAD(elf_functions_list);
 
 static struct elf_functions *elf_functions__get(Elf *elf)
 {
 	struct list_head *pos;
 
-	list_for_each(pos, &elf_functions_list) {
+	list_for_each(pos, &btf_encoding_context.elf_functions_list) {
 		struct elf_functions *funcs = list_entry(pos, struct elf_functions, node);
 
 		if (funcs->elf == elf)
@@ -186,7 +240,7 @@ static inline void elf_functions__delete_all(void)
 	struct elf_functions *funcs;
 	struct list_head *pos, *tmp;
 
-	list_for_each_safe(pos, tmp, &elf_functions_list) {
+	list_for_each_safe(pos, tmp, &btf_encoding_context.elf_functions_list) {
 		funcs = list_entry(pos, struct elf_functions, node);
 		elf_functions__delete(funcs);
 	}
@@ -216,7 +270,7 @@ int btf_encoder__pre_load_module(Dwfl_Module *mod, Elf *elf)
 	if (err)
 		goto out_delete;
 
-	list_add_tail(&funcs->node, &elf_functions_list);
+	list_add_tail(&funcs->node, &btf_encoding_context.elf_functions_list);
 
 	return 0;
 
@@ -225,29 +279,21 @@ out_delete:
 	return err;
 }
 
-
-static LIST_HEAD(encoders);
-static pthread_mutex_t encoders__lock = PTHREAD_MUTEX_INITIALIZER;
-
-/* mutex only needed for add/delete, as this can happen in multiple encoding
- * threads.  Traversal of the list is currently confined to thread collection.
- */
-
 #define btf_encoders__for_each_encoder(encoder)		\
-	list_for_each_entry(encoder, &encoders, node)
+	list_for_each_entry(encoder, &btf_encoding_context.btf_encoder_list, node)
 
 static void btf_encoders__add(struct btf_encoder *encoder)
 {
-	pthread_mutex_lock(&encoders__lock);
-	list_add_tail(&encoder->node, &encoders);
-	pthread_mutex_unlock(&encoders__lock);
+	pthread_mutex_lock(&btf_encoding_context.btf_encoder_list_lock);
+	list_add_tail(&encoder->node, &btf_encoding_context.btf_encoder_list);
+	pthread_mutex_unlock(&btf_encoding_context.btf_encoder_list_lock);
 }
 
 static void btf_encoders__delete(struct btf_encoder *encoder)
 {
 	struct btf_encoder *existing = NULL;
 
-	pthread_mutex_lock(&encoders__lock);
+	pthread_mutex_lock(&btf_encoding_context.btf_encoder_list_lock);
 	/* encoder may not have been added to list yet; check. */
 	btf_encoders__for_each_encoder(existing) {
 		if (encoder == existing)
@@ -255,7 +301,7 @@ static void btf_encoders__delete(struct btf_encoder *encoder)
 	}
 	if (encoder == existing)
 		list_del(&encoder->node);
-	pthread_mutex_unlock(&encoders__lock);
+	pthread_mutex_unlock(&btf_encoding_context.btf_encoder_list_lock);
 }
 
 #define PERCPU_SECTION ".data..percpu"
@@ -1385,6 +1431,9 @@ int btf_encoder__add_saved_funcs(struct btf_encoder *encoder)
 	free(saved_fns);
 	btf_encoders__for_each_encoder(e)
 		btf_encoder__delete_saved_funcs(e);
+
+	elf_functions__delete_all();
+
 	return 0;
 }
 
@@ -2178,7 +2227,6 @@ int btf_encoder__encode(struct btf_encoder *encoder)
 		err = btf_encoder__write_elf(encoder, encoder->btf, BTF_ELF_SEC);
 	}
 
-	elf_functions__delete_all();
 	return err;
 }
 
@@ -2565,6 +2613,18 @@ void btf_encoder__delete(struct btf_encoder *encoder)
 	free(encoder);
 }
 
+static inline void btf_encoder__delete_all(void)
+{
+	struct btf_encoder *encoder;
+	struct list_head *pos, *tmp;
+
+	list_for_each_safe(pos, tmp, &btf_encoding_context.btf_encoder_list) {
+		encoder = list_entry(pos, struct btf_encoder, node);
+		btf_encoder__delete(encoder);
+	}
+}
+
+
 int btf_encoder__encode_cu(struct btf_encoder *encoder, struct cu *cu, struct conf_load *conf_load)
 {
 	struct llvm_annotation *annot;
diff --git a/btf_encoder.h b/btf_encoder.h
index 7fa0390..f14edc1 100644
--- a/btf_encoder.h
+++ b/btf_encoder.h
@@ -23,6 +23,9 @@ enum btf_var_option {
 	BTF_VAR_GLOBAL = 2,
 };
 
+int btf_encoding_context__init(void);
+void btf_encoding_context__exit(void);
+
 struct btf_encoder *btf_encoder__new(struct cu *cu, const char *detached_filename, struct btf *base_btf, bool verbose, struct conf_load *conf_load);
 void btf_encoder__delete(struct btf_encoder *encoder);
 
diff --git a/pahole.c b/pahole.c
index eb2e71a..6bbc9e4 100644
--- a/pahole.c
+++ b/pahole.c
@@ -3741,8 +3741,12 @@ int main(int argc, char *argv[])
 		conf_load.threads_collect = pahole_threads_collect;
 	}
 
-	if (btf_encode)
+	if (btf_encode) {
 		conf_load.pre_load_module = btf_encoder__pre_load_module;
+		err = btf_encoding_context__init();
+		if (err < 0)
+			goto out;
+	}
 
 	// Make 'pahole --header type < file' a shorter form of 'pahole -C type --count 1 < file'
 	if (conf.header_type && !class_name && prettify_input) {
@@ -3859,7 +3863,11 @@ try_sole_arg_as_class_names:
 			goto out_cus_delete;
 		}
 	}
+
 out_ok:
+	if (btf_encode)
+		btf_encoding_context__exit();
+
 	if (stats_formatter != NULL)
 		print_stats();
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH dwarves v2 08/10] btf_encoder: remove skip_encoding_inconsistent_proto
  2024-12-13 22:36 [PATCH dwarves v2 00/10] pahole: shared ELF and faster reproducible BTF encoding Ihor Solodrai
                   ` (6 preceding siblings ...)
  2024-12-13 22:37 ` [PATCH dwarves v2 07/10] btf_encoder: introduce btf_encoding_context Ihor Solodrai
@ 2024-12-13 22:37 ` Ihor Solodrai
  2024-12-13 22:37 ` [PATCH dwarves v2 09/10] dwarf_loader: introduce cu->id Ihor Solodrai
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Ihor Solodrai @ 2024-12-13 22:37 UTC (permalink / raw)
  To: dwarves; +Cc: acme, alan.maguire, eddyz87, andrii, mykolal, bpf

This flag is needed only for btf_encoder__add_saved_funcs(), so there
is no reason to keep it in each btf_encoder.

Link: https://lore.kernel.org/dwarves/e1df45360963d265ea5e0b3634f0a3dae0c9c343.camel@gmail.com/
Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>
---
 btf_encoder.c | 6 ++----
 btf_encoder.h | 2 +-
 pahole.c      | 4 ++--
 3 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/btf_encoder.c b/btf_encoder.c
index a362fb2..0b71498 100644
--- a/btf_encoder.c
+++ b/btf_encoder.c
@@ -128,7 +128,6 @@ struct btf_encoder {
 			  force,
 			  gen_floats,
 			  skip_encoding_decl_tag,
-			  skip_encoding_inconsistent_proto,
 			  tag_kfuncs,
 			  gen_distilled_base;
 	uint32_t	  array_index_id;
@@ -1379,7 +1378,7 @@ static void btf_encoder__delete_saved_funcs(struct btf_encoder *encoder)
 	}
 }
 
-int btf_encoder__add_saved_funcs(struct btf_encoder *encoder)
+int btf_encoder__add_saved_funcs(bool skip_encoding_inconsistent_proto)
 {
 	struct btf_encoder_func_state **saved_fns, *s;
 	struct btf_encoder *e = NULL;
@@ -1419,7 +1418,7 @@ int btf_encoder__add_saved_funcs(struct btf_encoder *encoder)
 		 * just do not _use_ them.  Only exclude functions with
 		 * unexpected register use or multiple inconsistent prototypes.
 		 */
-		if (!encoder->skip_encoding_inconsistent_proto ||
+		if (!skip_encoding_inconsistent_proto ||
 		    (!state->unexpected_reg && !state->inconsistent_proto)) {
 			if (btf_encoder__add_func(state->encoder, state)) {
 				free(saved_fns);
@@ -2500,7 +2499,6 @@ struct btf_encoder *btf_encoder__new(struct cu *cu, const char *detached_filenam
 		encoder->force		 = conf_load->btf_encode_force;
 		encoder->gen_floats	 = conf_load->btf_gen_floats;
 		encoder->skip_encoding_decl_tag	 = conf_load->skip_encoding_btf_decl_tag;
-		encoder->skip_encoding_inconsistent_proto = conf_load->skip_encoding_btf_inconsistent_proto;
 		encoder->tag_kfuncs	 = conf_load->btf_decl_tag_kfuncs;
 		encoder->gen_distilled_base = conf_load->btf_gen_distilled_base;
 		encoder->verbose	 = verbose;
diff --git a/btf_encoder.h b/btf_encoder.h
index f14edc1..421cde1 100644
--- a/btf_encoder.h
+++ b/btf_encoder.h
@@ -36,7 +36,7 @@ int btf_encoder__encode_cu(struct btf_encoder *encoder, struct cu *cu, struct co
 struct btf *btf_encoder__btf(struct btf_encoder *encoder);
 
 int btf_encoder__add_encoder(struct btf_encoder *encoder, struct btf_encoder *other);
-int btf_encoder__add_saved_funcs(struct btf_encoder *encoder);
+int btf_encoder__add_saved_funcs(bool skip_encoding_inconsistent_proto);
 
 int btf_encoder__pre_load_module(Dwfl_Module *mod, Elf *elf);
 
diff --git a/pahole.c b/pahole.c
index 6bbc9e4..7964a03 100644
--- a/pahole.c
+++ b/pahole.c
@@ -3185,7 +3185,7 @@ static int pahole_threads_collect(struct conf_load *conf, int nr_threads, void *
 	if (error)
 		goto out;
 
-	err = btf_encoder__add_saved_funcs(btf_encoder);
+	err = btf_encoder__add_saved_funcs(conf_load.skip_encoding_btf_inconsistent_proto);
 	if (err < 0)
 		goto out;
 
@@ -3854,7 +3854,7 @@ try_sole_arg_as_class_names:
 		}
 
 		if (conf_load.nr_jobs <= 1 || conf_load.reproducible_build)
-			btf_encoder__add_saved_funcs(btf_encoder);
+			btf_encoder__add_saved_funcs(conf_load.skip_encoding_btf_inconsistent_proto);
 
 		err = btf_encoder__encode(btf_encoder);
 		btf_encoder__delete(btf_encoder);
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH dwarves v2 09/10] dwarf_loader: introduce cu->id
  2024-12-13 22:36 [PATCH dwarves v2 00/10] pahole: shared ELF and faster reproducible BTF encoding Ihor Solodrai
                   ` (7 preceding siblings ...)
  2024-12-13 22:37 ` [PATCH dwarves v2 08/10] btf_encoder: remove skip_encoding_inconsistent_proto Ihor Solodrai
@ 2024-12-13 22:37 ` Ihor Solodrai
  2024-12-13 22:37 ` [PATCH dwarves v2 10/10] dwarf_loader: multithreading with a job/worker model Ihor Solodrai
  2024-12-17  7:00 ` [PATCH dwarves v2 00/10] pahole: shared ELF and faster reproducible BTF encoding Eduard Zingerman
  10 siblings, 0 replies; 30+ messages in thread
From: Ihor Solodrai @ 2024-12-13 22:37 UTC (permalink / raw)
  To: dwarves; +Cc: acme, alan.maguire, eddyz87, andrii, mykolal, bpf

Add an id member to the struct cu.

An id is an index of a CU, in order they are created in dwarf_loader.c
This allows for an easy identification of a CU, particularly when they
need to be processed in order.

Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>
---
 dwarf_loader.c | 4 ++++
 dwarves.h      | 1 +
 2 files changed, 5 insertions(+)

diff --git a/dwarf_loader.c b/dwarf_loader.c
index bd65c56..58b165d 100644
--- a/dwarf_loader.c
+++ b/dwarf_loader.c
@@ -3440,6 +3440,7 @@ struct dwarf_cus {
 	int		    build_id_len;
 	int		    error;
 	struct dwarf_cu	    *type_dcu;
+	uint32_t	nr_cus_created;
 };
 
 struct dwarf_thread {
@@ -3472,6 +3473,9 @@ static struct dwarf_cu *dwarf_cus__create_cu(struct dwarf_cus *dcus, Dwarf_Die *
 	cu->priv = dcu;
 	cu->dfops = &dwarf__ops;
 
+	cu->id = dcus->nr_cus_created;
+	dcus->nr_cus_created++;
+
 	return dcu;
 }
 
diff --git a/dwarves.h b/dwarves.h
index d516d52..7c80b18 100644
--- a/dwarves.h
+++ b/dwarves.h
@@ -291,6 +291,7 @@ struct cu {
 	struct ptr_table functions_table;
 	struct ptr_table tags_table;
 	struct rb_root	 functions;
+	uint32_t	 id;
 	const char	 *name;
 	char		 *filename;
 	void 		 *priv;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH dwarves v2 10/10] dwarf_loader: multithreading with a job/worker model
  2024-12-13 22:36 [PATCH dwarves v2 00/10] pahole: shared ELF and faster reproducible BTF encoding Ihor Solodrai
                   ` (8 preceding siblings ...)
  2024-12-13 22:37 ` [PATCH dwarves v2 09/10] dwarf_loader: introduce cu->id Ihor Solodrai
@ 2024-12-13 22:37 ` Ihor Solodrai
  2024-12-17  0:57   ` Eduard Zingerman
                     ` (2 more replies)
  2024-12-17  7:00 ` [PATCH dwarves v2 00/10] pahole: shared ELF and faster reproducible BTF encoding Eduard Zingerman
  10 siblings, 3 replies; 30+ messages in thread
From: Ihor Solodrai @ 2024-12-13 22:37 UTC (permalink / raw)
  To: dwarves; +Cc: acme, alan.maguire, eddyz87, andrii, mykolal, bpf

This is a re-implementation of an idea described in a RFC [1], and
tried in a patch there [2].

The gist of this patch is that multithreading is now contained in
dwarf_loader.c, and is implemented using a jobs queue and a pool of
worker threads. As a consequence, multithreading-related code is
removed from pahole.c.

A single-thread special case is removed: queueing setup works fine
with a single worker, which will switch between jobs as appropriate.

Code supporting previous version of the multithreading, such as
cu_state, thread_data and related functions, is also removed.

reproducible_build flag is now moot: the BTF encoding is always
reproducible with these changes.

The goal outlined in the RFC [1] - making parallel reproducible BTF
generation as fast as non-reproducible - is achieved by implementing
the requirement of ordered CU encoding (stealing) directly in the job
queue in dwarf_loader.c

The synchronization in the queue is implemented by a mutex (which
ensures consistency of queue state) and two condition variables:
job_added and job_taken. Motivation behind using condition variables
is a classic one: we want to avoid the threads checking the state of
the queue in a busy loop, competing for a single mutex.

[1] https://lore.kernel.org/dwarves/20241128012341.4081072-1-ihor.solodrai@pm.me/
[2] https://lore.kernel.org/dwarves/20241128012341.4081072-10-ihor.solodrai@pm.me/

Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>
---
 btf_encoder.c               |   8 +-
 btf_encoder.h               |   6 +-
 btf_loader.c                |   2 +-
 ctf_loader.c                |   2 +-
 dwarf_loader.c              | 342 +++++++++++++++++++++++++-----------
 dwarves.c                   |  44 -----
 dwarves.h                   |  21 +--
 pahole.c                    | 236 +++----------------------
 pdwtags.c                   |   3 +-
 pfunct.c                    |   3 +-
 tests/reproducible_build.sh |   5 +-
 11 files changed, 281 insertions(+), 391 deletions(-)

diff --git a/btf_encoder.c b/btf_encoder.c
index 0b71498..20befd6 100644
--- a/btf_encoder.c
+++ b/btf_encoder.c
@@ -163,8 +163,6 @@ static struct {
 	/* The mutex only needed for add/delete, as this can happen in
 	 * multiple encoding threads.  A btf_encoder is added to this
 	 * list in btf_encoder__new(), and removed in btf_encoder__delete().
-	 * All encoders except the main one (`btf_encoder` in pahole.c)
-	 * are deleted in pahole_threads_collect().
 	 */
 	pthread_mutex_t btf_encoder_list_lock;
 	struct list_head btf_encoder_list;
@@ -1378,7 +1376,7 @@ static void btf_encoder__delete_saved_funcs(struct btf_encoder *encoder)
 	}
 }
 
-int btf_encoder__add_saved_funcs(bool skip_encoding_inconsistent_proto)
+static int btf_encoder__add_saved_funcs(bool skip_encoding_inconsistent_proto)
 {
 	struct btf_encoder_func_state **saved_fns, *s;
 	struct btf_encoder *e = NULL;
@@ -2170,12 +2168,14 @@ out:
 	return err;
 }
 
-int btf_encoder__encode(struct btf_encoder *encoder)
+int btf_encoder__encode(struct btf_encoder *encoder, struct conf_load *conf)
 {
 	bool should_tag_kfuncs;
 	int err;
 	size_t shndx;
 
+	btf_encoder__add_saved_funcs(conf->skip_encoding_btf_inconsistent_proto);
+
 	for (shndx = 1; shndx < encoder->seccnt; shndx++)
 		if (gobuffer__size(&encoder->secinfo[shndx].secinfo))
 			btf_encoder__add_datasec(encoder, shndx);
diff --git a/btf_encoder.h b/btf_encoder.h
index 421cde1..1575b61 100644
--- a/btf_encoder.h
+++ b/btf_encoder.h
@@ -29,15 +29,11 @@ void btf_encoding_context__exit(void);
 struct btf_encoder *btf_encoder__new(struct cu *cu, const char *detached_filename, struct btf *base_btf, bool verbose, struct conf_load *conf_load);
 void btf_encoder__delete(struct btf_encoder *encoder);
 
-int btf_encoder__encode(struct btf_encoder *encoder);
-
+int btf_encoder__encode(struct btf_encoder *encoder, struct conf_load *conf);
 int btf_encoder__encode_cu(struct btf_encoder *encoder, struct cu *cu, struct conf_load *conf_load);
 
 struct btf *btf_encoder__btf(struct btf_encoder *encoder);
 
-int btf_encoder__add_encoder(struct btf_encoder *encoder, struct btf_encoder *other);
-int btf_encoder__add_saved_funcs(bool skip_encoding_inconsistent_proto);
-
 int btf_encoder__pre_load_module(Dwfl_Module *mod, Elf *elf);
 
 #endif /* _BTF_ENCODER_H_ */
diff --git a/btf_loader.c b/btf_loader.c
index 4814f29..cff0011 100644
--- a/btf_loader.c
+++ b/btf_loader.c
@@ -730,7 +730,7 @@ static int cus__load_btf(struct cus *cus, struct conf_load *conf, const char *fi
 	 * The app stole this cu, possibly deleting it,
 	 * so forget about it
 	 */
-	if (conf && conf->steal && conf->steal(cu, conf, NULL))
+	if (conf && conf->steal && conf->steal(cu, conf))
 		return 0;
 
 	cus__add(cus, cu);
diff --git a/ctf_loader.c b/ctf_loader.c
index 944bf6e..501c4ab 100644
--- a/ctf_loader.c
+++ b/ctf_loader.c
@@ -728,7 +728,7 @@ int ctf__load_file(struct cus *cus, struct conf_load *conf,
 	 * The app stole this cu, possibly deleting it,
 	 * so forget about it
 	 */
-	if (conf && conf->steal && conf->steal(cu, conf, NULL))
+	if (conf && conf->steal && conf->steal(cu, conf))
 		return 0;
 
 	cus__add(cus, cu);
diff --git a/dwarf_loader.c b/dwarf_loader.c
index 58b165d..6d22648 100644
--- a/dwarf_loader.c
+++ b/dwarf_loader.c
@@ -3250,24 +3250,20 @@ static void cu__sort_types_by_offset(struct cu *cu, struct conf_load *conf)
 	cu__for_all_tags(cu, type__sort_by_offset, conf);
 }
 
-static int cu__finalize(struct cu *cu, struct cus *cus, struct conf_load *conf, void *thr_data)
+static void cu__finalize(struct cu *cu, struct cus *cus, struct conf_load *conf)
 {
 	cu__for_all_tags(cu, class_member__cache_byte_size, conf);
 
 	if (cu__language_reorders_offsets(cu))
 		cu__sort_types_by_offset(cu, conf);
-
-	cus__set_cu_state(cus, cu, CU__LOADED);
-
-	if (conf && conf->steal) {
-		return conf->steal(cu, conf, thr_data);
-	}
-	return LSK__KEEPIT;
 }
 
-static int cus__finalize(struct cus *cus, struct cu *cu, struct conf_load *conf, void *thr_data)
+static int cus__steal_now(struct cus *cus, struct cu *cu, struct conf_load *conf)
 {
-	int lsk = cu__finalize(cu, cus, conf, thr_data);
+	if (!conf || !conf->steal)
+		return 0;
+
+	int lsk = conf->steal(cu, conf);
 	switch (lsk) {
 	case LSK__DELETE:
 		cus__remove(cus, cu);
@@ -3443,11 +3439,145 @@ struct dwarf_cus {
 	uint32_t	nr_cus_created;
 };
 
-struct dwarf_thread {
-	struct dwarf_cus	*dcus;
-	void			*data;
+/* Multithreading is implemented using a job/worker model.
+ * cus_processing_queue represents a collection of jobs to be
+ * completed by workers.
+ * dwarf_loader__worker_thread is the worker loop, taking jobs from
+ * the queue and executing them.
+ * Implementation of this queue ensures two constraints:
+ *   * JOB_STEAL jobs for a CU are executed in the order of cu->id, as
+ *     a consequence JOB_STEAL jobs always run one at a time.
+ *   * Workers are limited by max_decoded_cus: a worker will not pick
+ *     up a new JOB_DECODE if this limit is exceeded. It'll wait.
+ */
+static struct {
+	pthread_mutex_t mutex;
+	pthread_cond_t job_added;
+	pthread_cond_t job_taken;
+	/* next_cu_id determines the next CU ready to be stealed
+	 * This enforces the order of CU stealing.
+	 */
+	uint32_t next_cu_id;
+	/* max_decoded_cus is a soft limit on the number of JOB_STEAL
+	 * jobs currently in the queue (this number is equal to the
+	 * number of decoded CUs held in memory). It's soft, because a
+	 * worker thread may finish decoding it's current CU after
+	 * this limit has already been reached. In such situation,
+	 * JOB_STEAL with this CU is still added to the queue,
+	 * although a worker will not pick up a new JOB_DECODE.
+	 * So the real hard limit is max_decoded_cus + nr_workers.
+	 * This variable indirectly limits the memory usage.
+	 */
+	uint16_t max_decoded_cus;
+	uint16_t nr_decoded_cus;
+	struct list_head jobs;
+} cus_processing_queue;
+
+enum job_type {
+	JOB_NONE = 0,
+	JOB_DECODE = 1,
+	JOB_STEAL = 2,
 };
 
+struct cu_processing_job {
+	struct list_head node;
+	enum job_type type;
+	struct cu *cu; /* for stealing jobs */
+};
+
+static void cus_queue__init(uint16_t max_decoded_cus)
+{
+	pthread_mutex_init(&cus_processing_queue.mutex, NULL);
+	pthread_cond_init(&cus_processing_queue.job_added, NULL);
+	pthread_cond_init(&cus_processing_queue.job_taken, NULL);
+	INIT_LIST_HEAD(&cus_processing_queue.jobs);
+	cus_processing_queue.max_decoded_cus = max_decoded_cus;
+	cus_processing_queue.nr_decoded_cus = 0;
+	cus_processing_queue.next_cu_id = 0;
+}
+
+static void cus_queue__destroy(void)
+{
+	pthread_mutex_destroy(&cus_processing_queue.mutex);
+	pthread_cond_destroy(&cus_processing_queue.job_added);
+	pthread_cond_destroy(&cus_processing_queue.job_taken);
+}
+
+static inline void cus_queue__inc_next_cu_id(void)
+{
+	pthread_mutex_lock(&cus_processing_queue.mutex);
+	cus_processing_queue.next_cu_id++;
+	pthread_mutex_unlock(&cus_processing_queue.mutex);
+}
+
+static void cus_queue__enqueue_job(struct cu_processing_job *job)
+{
+	if (job == NULL)
+		return;
+
+	pthread_mutex_lock(&cus_processing_queue.mutex);
+
+	/* JOB_STEAL have higher priority, add them to the head so
+	 * they can be found faster
+	 */
+	if (job->type == JOB_STEAL) {
+		list_add(&job->node, &cus_processing_queue.jobs);
+		cus_processing_queue.nr_decoded_cus++;
+	} else {
+		list_add_tail(&job->node, &cus_processing_queue.jobs);
+	}
+
+	pthread_cond_signal(&cus_processing_queue.job_added);
+	pthread_mutex_unlock(&cus_processing_queue.mutex);
+}
+
+static struct cu_processing_job *cus_queue__dequeue_job(void)
+{
+	struct cu_processing_job *job, *dequeued_job = NULL;
+	struct list_head *pos, *tmp;
+
+	pthread_mutex_lock(&cus_processing_queue.mutex);
+	while (list_empty(&cus_processing_queue.jobs))
+		pthread_cond_wait(&cus_processing_queue.job_added, &cus_processing_queue.mutex);
+
+	/* First, try to find JOB_STEAL for the next CU */
+	list_for_each_safe(pos, tmp, &cus_processing_queue.jobs) {
+		job = list_entry(pos, struct cu_processing_job, node);
+		if (job->type == JOB_STEAL && job->cu->id == cus_processing_queue.next_cu_id) {
+			list_del(&job->node);
+			cus_processing_queue.nr_decoded_cus--;
+			dequeued_job = job;
+			break;
+		}
+	}
+
+	/* If no JOB_STEAL is found, check if we are allowed to decode
+	 * more CUs.  If not, it means that the CU with next_cu_id is
+	 * still being decoded while the queue is "full". Wait.
+	 * job_taken will signal that another thread was able to pick
+	 * up a JOB_STEAL, so we might be able to proceed with JOB_DECODE.
+	 */
+	if (dequeued_job == NULL) {
+		while (cus_processing_queue.nr_decoded_cus >= cus_processing_queue.max_decoded_cus)
+			pthread_cond_wait(&cus_processing_queue.job_taken, &cus_processing_queue.mutex);
+
+		/* We can decode now. */
+		list_for_each_safe(pos, tmp, &cus_processing_queue.jobs) {
+			job = list_entry(pos, struct cu_processing_job, node);
+			if (job->type == JOB_DECODE) {
+				list_del(&job->node);
+				dequeued_job = job;
+				break;
+			}
+		}
+	}
+
+	pthread_cond_signal(&cus_processing_queue.job_taken);
+	pthread_mutex_unlock(&cus_processing_queue.mutex);
+
+	return dequeued_job;
+}
+
 static struct dwarf_cu *dwarf_cus__create_cu(struct dwarf_cus *dcus, Dwarf_Die *cu_die, uint8_t pointer_size)
 {
 	/*
@@ -3479,28 +3609,6 @@ static struct dwarf_cu *dwarf_cus__create_cu(struct dwarf_cus *dcus, Dwarf_Die *
 	return dcu;
 }
 
-static int dwarf_cus__process_cu(struct dwarf_cus *dcus, Dwarf_Die *cu_die,
-				 struct cu *cu, void *thr_data)
-{
-	if (die__process_and_recode(cu_die, cu, dcus->conf) != 0 ||
-	    cus__finalize(dcus->cus, cu, dcus->conf, thr_data) == LSK__STOP_LOADING)
-		return DWARF_CB_ABORT;
-
-       return DWARF_CB_OK;
-}
-
-static int dwarf_cus__create_and_process_cu(struct dwarf_cus *dcus, Dwarf_Die *cu_die, uint8_t pointer_size)
-{
-	struct dwarf_cu *dcu = dwarf_cus__create_cu(dcus, cu_die, pointer_size);
-
-	if (dcu == NULL)
-		return DWARF_CB_ABORT;
-
-	cus__add(dcus->cus, dcu->cu);
-
-	return dwarf_cus__process_cu(dcus, cu_die, dcu->cu, NULL);
-}
-
 static int dwarf_cus__nextcu(struct dwarf_cus *dcus, struct dwarf_cu **dcu,
 			     Dwarf_Die *die_mem, Dwarf_Die **cu_die,
 			     uint8_t *pointer_size, uint8_t *offset_size)
@@ -3541,24 +3649,86 @@ out_unlock:
 	return ret;
 }
 
-static void *dwarf_cus__process_cu_thread(void *arg)
+static struct cu *dwarf_loader__decode_next_cu(struct dwarf_cus *dcus)
 {
-	struct dwarf_thread *dthr = arg;
-	struct dwarf_cus *dcus = dthr->dcus;
 	uint8_t pointer_size, offset_size;
+	struct dwarf_cu *dcu = NULL;
 	Dwarf_Die die_mem, *cu_die;
-	struct dwarf_cu *dcu;
+	int err;
 
-	while (dwarf_cus__nextcu(dcus, &dcu, &die_mem, &cu_die, &pointer_size, &offset_size) == 0) {
-		if (cu_die == NULL)
+	err = dwarf_cus__nextcu(dcus, &dcu, &die_mem, &cu_die, &pointer_size, &offset_size);
+
+	if (err < 0)
+		goto out_error;
+	else if (err == 1) /* no more CUs */
+		return NULL;
+
+	err = die__process_and_recode(cu_die, dcu->cu, dcus->conf);
+	if (err)
+		goto out_error;
+	if (cu_die == NULL)
+		return NULL;
+
+	cu__finalize(dcu->cu, dcus->cus, dcus->conf);
+
+	return dcu->cu;
+
+out_error:
+	dcus->error = err;
+	fprintf(stderr, "error decoding cu %s\n", dcu && dcu->cu ? dcu->cu->name : "");
+	return NULL;
+}
+
+static void *dwarf_loader__worker_thread(void *arg)
+{
+	struct cu_processing_job *job;
+	struct dwarf_cus *dcus = arg;
+	bool stop = false;
+	struct cu *cu;
+
+	while (!stop) {
+		job = cus_queue__dequeue_job();
+
+		switch (job->type) {
+
+		case JOB_DECODE:
+			cu = dwarf_loader__decode_next_cu(dcus);
+
+			if (cu == NULL) {
+				free(job);
+				stop = true;
+				break;
+			}
+
+			/* Create and enqueue a new JOB_STEAL for this decoded CU */
+			struct cu_processing_job *steal_job = calloc(1, sizeof(*steal_job));
+
+			steal_job->type = JOB_STEAL;
+			steal_job->cu = cu;
+			cus_queue__enqueue_job(steal_job);
+
+			/* re-enqueue JOB_DECODE so that next CU is decoded from DWARF */
+			cus_queue__enqueue_job(job);
+			break;
+
+		case JOB_STEAL:
+			if (cus__steal_now(dcus->cus, job->cu, dcus->conf) == LSK__STOP_LOADING)
+				goto out_abort;
+			cus_queue__inc_next_cu_id();
+			/* Free the job struct as it's no longer
+			 * needed after CU has been stolen.
+			 * dwarf_loader work for this CU is done.
+			 */
+			free(job);
 			break;
 
-		if (dwarf_cus__process_cu(dcus, cu_die, dcu->cu, dthr->data) == DWARF_CB_ABORT)
+		default:
+			fprintf(stderr, "Unknown dwarf_loader job type %d\n", job->type);
 			goto out_abort;
+		}
 	}
 
-	if (dcus->conf->thread_exit &&
-	    dcus->conf->thread_exit(dcus->conf, dthr->data) != 0)
+	if (dcus->error)
 		goto out_abort;
 
 	return (void *)DWARF_CB_OK;
@@ -3566,29 +3736,29 @@ out_abort:
 	return (void *)DWARF_CB_ABORT;
 }
 
-static int dwarf_cus__threaded_process_cus(struct dwarf_cus *dcus)
+static int dwarf_cus__process_cus(struct dwarf_cus *dcus)
 {
-	pthread_t threads[dcus->conf->nr_jobs];
-	struct dwarf_thread dthr[dcus->conf->nr_jobs];
-	void *thread_data[dcus->conf->nr_jobs];
-	int res;
-	int i;
+	int nr_workers = dcus->conf->nr_jobs > 0 ? dcus->conf->nr_jobs : 1;
+	pthread_t workers[nr_workers];
+	struct cu_processing_job *job;
 
-	if (dcus->conf->threads_prepare) {
-		res = dcus->conf->threads_prepare(dcus->conf, dcus->conf->nr_jobs, thread_data);
-		if (res != 0)
-			return res;
-	} else {
-		memset(thread_data, 0, sizeof(void *) * dcus->conf->nr_jobs);
+	cus_queue__init(nr_workers * 4);
+
+	/* fill up the queue with nr_workers JOB_DECODE jobs */
+	for (int i = 0; i < nr_workers; i++) {
+		job = calloc(1, sizeof(*job));
+		job->type = JOB_DECODE;
+		/* no need for locks, workers were not started yet */
+		list_add(&job->node, &cus_processing_queue.jobs);
 	}
 
-	for (i = 0; i < dcus->conf->nr_jobs; ++i) {
-		dthr[i].dcus = dcus;
-		dthr[i].data = thread_data[i];
+	if (dcus->error)
+		return dcus->error;
 
-		dcus->error = pthread_create(&threads[i], NULL,
-					     dwarf_cus__process_cu_thread,
-					     &dthr[i]);
+	for (int i = 0; i < nr_workers; ++i) {
+		dcus->error = pthread_create(&workers[i], NULL,
+					     dwarf_loader__worker_thread,
+					     dcus);
 		if (dcus->error)
 			goto out_join;
 	}
@@ -3596,54 +3766,19 @@ static int dwarf_cus__threaded_process_cus(struct dwarf_cus *dcus)
 	dcus->error = 0;
 
 out_join:
-	while (--i >= 0) {
+	for (int i = 0; i < nr_workers; ++i) {
 		void *res;
-		int err = pthread_join(threads[i], &res);
+		int err = pthread_join(workers[i], &res);
 
 		if (err == 0 && res != NULL)
 			dcus->error = (long)res;
 	}
 
-	if (dcus->conf->threads_collect) {
-		res = dcus->conf->threads_collect(dcus->conf, dcus->conf->nr_jobs,
-						  thread_data, dcus->error);
-		if (dcus->error == 0)
-			dcus->error = res;
-	}
+	cus_queue__destroy();
 
 	return dcus->error;
 }
 
-static int __dwarf_cus__process_cus(struct dwarf_cus *dcus)
-{
-	uint8_t pointer_size, offset_size;
-	Dwarf_Off noff;
-	size_t cuhl;
-
-	while (dwarf_nextcu(dcus->dw, dcus->off, &noff, &cuhl, NULL, &pointer_size, &offset_size) == 0) {
-		Dwarf_Die die_mem;
-		Dwarf_Die *cu_die = dwarf_offdie(dcus->dw, dcus->off + cuhl, &die_mem);
-
-		if (cu_die == NULL)
-			break;
-
-		if (dwarf_cus__create_and_process_cu(dcus, cu_die, pointer_size) == DWARF_CB_ABORT)
-			return DWARF_CB_ABORT;
-
-		dcus->off = noff;
-	}
-
-	return 0;
-}
-
-static int dwarf_cus__process_cus(struct dwarf_cus *dcus)
-{
-	if (dcus->conf->nr_jobs > 1)
-		return dwarf_cus__threaded_process_cus(dcus);
-
-	return __dwarf_cus__process_cus(dcus);
-}
-
 static int cus__merge_and_process_cu(struct cus *cus, struct conf_load *conf,
 				     Dwfl_Module *mod, Dwarf *dw, Elf *elf,
 				     const char *filename,
@@ -3733,7 +3868,8 @@ static int cus__merge_and_process_cu(struct cus *cus, struct conf_load *conf,
 	if (cu__resolve_func_ret_types_optimized(cu) != LSK__KEEPIT)
 		goto out_abort;
 
-	if (cus__finalize(cus, cu, conf, NULL) == LSK__STOP_LOADING)
+	cu__finalize(cu, cus, conf);
+	if (cus__steal_now(cus, cu, conf) == LSK__STOP_LOADING)
 		goto out_abort;
 
 	return 0;
@@ -3765,7 +3901,8 @@ static int cus__load_module(struct cus *cus, struct conf_load *conf,
 	}
 
 	if (type_cu != NULL) {
-		type_lsk = cu__finalize(type_cu, cus, conf, NULL);
+		cu__finalize(type_cu, cus, conf);
+		type_lsk = cus__steal_now(cus, type_cu, conf);
 		if (type_lsk == LSK__DELETE) {
 			cus__remove(cus, type_cu);
 		}
@@ -3787,6 +3924,7 @@ static int cus__load_module(struct cus *cus, struct conf_load *conf,
 			.type_dcu = type_cu ? &type_dcu : NULL,
 			.build_id = build_id,
 			.build_id_len = build_id_len,
+			.nr_cus_created = 0,
 		};
 		res = dwarf_cus__process_cus(&dcus);
 	}
diff --git a/dwarves.c b/dwarves.c
index ae512b9..7c3e878 100644
--- a/dwarves.c
+++ b/dwarves.c
@@ -530,48 +530,6 @@ void cus__unlock(struct cus *cus)
 	pthread_mutex_unlock(&cus->mutex);
 }
 
-void cus__set_cu_state(struct cus *cus, struct cu *cu, enum cu_state state)
-{
-	cus__lock(cus);
-	cu->state = state;
-	cus__unlock(cus);
-}
-
-// Used only when reproducible builds are desired
-struct cu *cus__get_next_processable_cu(struct cus *cus)
-{
-	struct cu *cu;
-
-	cus__lock(cus);
-
-	list_for_each_entry(cu, &cus->cus, node) {
-		switch (cu->state) {
-		case CU__LOADED:
-			cu->state = CU__PROCESSING;
-			goto found;
-		case CU__PROCESSING:
-			// This will happen when we get to parallel
-			// reproducible BTF encoding, libbpf dedup work needed
-			// here. The other possibility is when we're flushing
-			// the DWARF processed CUs when the parallel DWARF
-			// loading stoped and we still have CUs to encode to
-			// BTF because of ordering requirements.
-			continue;
-		case CU__UNPROCESSED:
-			// The first entry isn't loaded, signal the
-			// caller to return and try another day, as we
-			// need to respect the original DWARF CU ordering.
-			goto out;
-		}
-	}
-out:
-	cu = NULL;
-found:
-	cus__unlock(cus);
-
-	return cu;
-}
-
 bool cus__empty(const struct cus *cus)
 {
 	return list_empty(&cus->cus);
@@ -808,8 +766,6 @@ struct cu *cu__new(const char *name, uint8_t addr_size,
 		cu->addr_size = addr_size;
 		cu->extra_dbg_info = 0;
 
-		cu->state = CU__UNPROCESSED;
-
 		cu->nr_inline_expansions   = 0;
 		cu->size_inline_expansions = 0;
 		cu->nr_structures_changed  = 0;
diff --git a/dwarves.h b/dwarves.h
index 7c80b18..70b4ddf 100644
--- a/dwarves.h
+++ b/dwarves.h
@@ -44,12 +44,6 @@ enum load_steal_kind {
 	LSK__STOP_LOADING,
 };
 
-enum cu_state {
-	CU__UNPROCESSED,
-	CU__LOADED,
-	CU__PROCESSING,
-};
-
 /*
  * BTF combines all the types into one big CU using btf_dedup(), so for something
  * like a allyesconfig vmlinux kernel we can get over 65535 types.
@@ -60,7 +54,6 @@ struct btf;
 struct conf_fprintf;
 
 /** struct conf_load - load configuration
- * @thread_exit - called at the end of a thread, 1st user: BTF encoder dedup
  * @extra_dbg_info - keep original debugging format extra info
  *		     (e.g. DWARF's decl_{line,file}, id, etc)
  * @fixup_silly_bitfields - Fixup silly things such as "int foo:32;"
@@ -70,11 +63,9 @@ struct conf_fprintf;
  * @skip_missing - skip missing types rather than bailing out.
  */
 struct conf_load {
-	enum load_steal_kind	(*steal)(struct cu *cu,
-					 struct conf_load *conf,
-					 void *thr_data);
+	enum load_steal_kind	(*steal)(struct cu *cu, struct conf_load *conf);
 	struct cu *		(*early_cu_filter)(struct cu *cu);
-	int			(*thread_exit)(struct conf_load *conf, void *thr_data);
+	int			(*pre_load_module)(Dwfl_Module *mod, Elf *elf);
 	void			*cookie;
 	char			*format_path;
 	int			nr_jobs;
@@ -105,9 +96,6 @@ struct conf_load {
 	const char		*kabi_prefix;
 	struct btf		*base_btf;
 	struct conf_fprintf	*conf_fprintf;
-	int			(*threads_prepare)(struct conf_load *conf, int nr_threads, void **thr_data);
-	int			(*threads_collect)(struct conf_load *conf, int nr_threads, void **thr_data, int error);
-	int			(*pre_load_module)(Dwfl_Module *mod, Elf *elf);
 };
 
 /** struct conf_fprintf - hints to the __fprintf routines
@@ -189,10 +177,6 @@ void cus__add(struct cus *cus, struct cu *cu);
 void __cus__remove(struct cus *cus, struct cu *cu);
 void cus__remove(struct cus *cus, struct cu *cu);
 
-struct cu *cus__get_next_processable_cu(struct cus *cus);
-
-void cus__set_cu_state(struct cus *cus, struct cu *cu, enum cu_state state);
-
 void cus__print_error_msg(const char *progname, const struct cus *cus,
 			  const char *filename, const int err);
 struct cu *cus__find_pair(struct cus *cus, const char *name);
@@ -309,7 +293,6 @@ struct cu {
 	uint8_t		 nr_register_params;
 	int		 register_params[ARCH_MAX_REGISTER_PARAMS];
 	int		 functions_saved;
-	enum cu_state	 state;
 	uint16_t	 language;
 	unsigned long	 nr_inline_expansions;
 	size_t		 size_inline_expansions;
diff --git a/pahole.c b/pahole.c
index 7964a03..4148d7a 100644
--- a/pahole.c
+++ b/pahole.c
@@ -3118,6 +3118,32 @@ out:
 	return ret;
 }
 
+static enum load_steal_kind pahole_stealer__btf_encode(struct cu *cu, struct conf_load *conf_load)
+{
+	int err;
+
+	if (!btf_encoder)
+		btf_encoder = btf_encoder__new(cu,
+				       detached_btf_filename,
+				       conf_load->base_btf,
+				       global_verbose,
+				       conf_load);
+
+	if (!btf_encoder) {
+		fprintf(stderr, "Error creating BTF encoder.\n");
+		return LSK__STOP_LOADING;
+	}
+
+	err = btf_encoder__encode_cu(btf_encoder, cu, conf_load);
+	if (err < 0) {
+		fprintf(stderr, "Error while encoding BTF.\n");
+		return LSK__STOP_LOADING;
+	}
+
+	return LSK__DELETE;
+}
+
+
 static struct type_instance *header;
 
 static bool print_enumeration_with_enumerator(struct cu *cu, const char *name)
@@ -3136,87 +3162,7 @@ static bool print_enumeration_with_enumerator(struct cu *cu, const char *name)
 	return false;
 }
 
-struct thread_data {
-	struct btf *btf;
-	struct btf_encoder *encoder;
-};
-
-static int pahole_threads_prepare_reproducible_build(struct conf_load *conf, int nr_threads, void **thr_data)
-{
-	for (int i = 0; i < nr_threads; i++)
-		thr_data[i] = NULL;
-
-	return 0;
-}
-
-static int pahole_threads_prepare(struct conf_load *conf, int nr_threads, void **thr_data)
-{
-	int i;
-	struct thread_data *threads = calloc(sizeof(struct thread_data), nr_threads);
-
-	for (i = 0; i < nr_threads; i++)
-		thr_data[i] = threads + i;
-
-	return 0;
-}
-
-static int pahole_thread_exit(struct conf_load *conf, void *thr_data)
-{
-	struct thread_data *thread = thr_data;
-
-	if (thread == NULL)
-		return 0;
-
-	/*
-	 * Here we will call btf__dedup() here once we extend
-	 * btf__dedup().
-	 */
-
-	return 0;
-}
-
-static int pahole_threads_collect(struct conf_load *conf, int nr_threads, void **thr_data,
-				  int error)
-{
-	struct thread_data **threads = (struct thread_data **)thr_data;
-	int i;
-	int err = 0;
-
-	if (error)
-		goto out;
-
-	err = btf_encoder__add_saved_funcs(conf_load.skip_encoding_btf_inconsistent_proto);
-	if (err < 0)
-		goto out;
-
-	for (i = 0; i < nr_threads; i++) {
-		/*
-		 * Merge content of the btf instances of worker threads to the btf
-		 * instance of the primary btf_encoder.
-                */
-		if (!threads[i]->encoder || !threads[i]->btf)
-			continue;
-		err = btf_encoder__add_encoder(btf_encoder, threads[i]->encoder);
-		if (err < 0)
-			goto out;
-	}
-	err = 0;
-
-out:
-	for (i = 0; i < nr_threads; i++) {
-		if (threads[i]->encoder && threads[i]->encoder != btf_encoder) {
-			btf_encoder__delete(threads[i]->encoder);
-			threads[i]->encoder = NULL;
-		}
-	}
-	free(threads[0]);
-
-	return err;
-}
-
-static enum load_steal_kind pahole_stealer(struct cu *cu,
-					   struct conf_load *conf_load,
-					   void *thr_data)
+static enum load_steal_kind pahole_stealer(struct cu *cu, struct conf_load *conf_load)
 {
 	int ret = LSK__DELETE;
 
@@ -3238,94 +3184,7 @@ static enum load_steal_kind pahole_stealer(struct cu *cu,
 		return LSK__DELETE; // Maybe we can find this in several CUs, so don't stop it
 
 	if (btf_encode) {
-		static pthread_mutex_t btf_lock = PTHREAD_MUTEX_INITIALIZER;
-		struct btf_encoder *encoder;
-
-		pthread_mutex_lock(&btf_lock);
-		/*
-		 * FIXME:
-		 *
-		 * This should be really done at main(), but since in the current codebase only at this
-		 * point we'll have cu->elf setup...
-		 */
-		if (!btf_encoder) {
-			/*
-			 * btf_encoder is the primary encoder.
-			 * And, it is used by the thread
-			 * create it.
-			 */
-			btf_encoder = btf_encoder__new(cu, detached_btf_filename, conf_load->base_btf,
-						       global_verbose, conf_load);
-			if (btf_encoder && thr_data) {
-				struct thread_data *thread = thr_data;
-
-				thread->encoder = btf_encoder;
-				thread->btf = btf_encoder__btf(btf_encoder);
-			}
-		}
-
-		// Reproducible builds don't have multiple btf_encoders, so we need to keep the lock until we encode BTF for this CU.
-		if (thr_data)
-			pthread_mutex_unlock(&btf_lock);
-
-		if (!btf_encoder) {
-			ret = LSK__STOP_LOADING;
-			goto out_btf;
-		}
-
-		/*
-		 * thr_data keeps per-thread data for worker threads.  Each worker thread
-		 * has an encoder.  The main thread will merge the data collected by all
-		 * these encoders to btf_encoder.  However, the first thread reaching this
-		 * function creates btf_encoder and reuses it as its local encoder.  It
-		 * avoids copying the data collected by the first thread.
-		 */
-		if (thr_data) {
-			struct thread_data *thread = thr_data;
-
-			if (thread->encoder == NULL) {
-				thread->encoder =
-					btf_encoder__new(cu, detached_btf_filename,
-							 NULL,
-							 global_verbose,
-							 conf_load);
-				thread->btf = btf_encoder__btf(thread->encoder);
-			}
-			encoder = thread->encoder;
-		} else {
-			encoder = btf_encoder;
-		}
-
-		// Since we don't have yet a way to parallelize the BTF encoding, we
-		// need to ask the loader for the next CU that we can process, one
-		// that is loaded and is in order, if the next one isn't yet loaded,
-		// then return to let the DWARF loader thread to load the next one,
-		// eventually all will get processed, even if when all DWARF loading
-		// threads finish.
-		if (conf_load->reproducible_build) {
-			ret = LSK__KEEPIT; // we're not processing the cu passed to this
-					  // function, so keep it.
-			cu = cus__get_next_processable_cu(cus);
-			if (cu == NULL)
-				goto out_btf;
-		}
-
-		ret = btf_encoder__encode_cu(encoder, cu, conf_load);
-		if (ret < 0) {
-			fprintf(stderr, "Encountered error while encoding BTF.\n");
-			exit(1);
-		}
-
-		if (conf_load->reproducible_build) {
-			ret = LSK__KEEPIT; // we're not processing the cu passed to this function, so keep it.
-			// Kinda equivalent to LSK__DELETE since we processed this, but we can't delete it
-			// as we stash references to entries in CUs for 'struct function' in btf_encoder__add_saved_funcs()
-			// and btf_encoder__save_func(), so we can't delete them here. - Alan Maguire
-		}
-out_btf:
-		if (!thr_data) // See comment about reproducibe_build above
-			pthread_mutex_unlock(&btf_lock);
-		return ret;
+		return pahole_stealer__btf_encode(cu, conf_load);
 	}
 #if 0
 	if (ctf_encode) {
@@ -3625,24 +3484,6 @@ out_free:
 	return ret;
 }
 
-static int cus__flush_reproducible_build(struct cus *cus, struct btf_encoder *encoder, struct conf_load *conf_load)
-{
-	int err = 0;
-
-	while (true) {
-		struct cu *cu = cus__get_next_processable_cu(cus);
-
-		if (cu == NULL)
-			break;
-
-		err = btf_encoder__encode_cu(encoder, cu, conf_load);
-		if (err < 0)
-			break;
-	}
-
-	return err;
-}
-
 int main(int argc, char *argv[])
 {
 	int err, remaining, rc = EXIT_FAILURE;
@@ -3731,16 +3572,6 @@ int main(int argc, char *argv[])
 	if (languages.exclude)
 		conf_load.early_cu_filter = cu__filter;
 
-	conf_load.thread_exit = pahole_thread_exit;
-
-	if (conf_load.reproducible_build) {
-		conf_load.threads_prepare = pahole_threads_prepare_reproducible_build;
-		conf_load.threads_collect = NULL;
-	} else {
-		conf_load.threads_prepare = pahole_threads_prepare;
-		conf_load.threads_collect = pahole_threads_collect;
-	}
-
 	if (btf_encode) {
 		conf_load.pre_load_module = btf_encoder__pre_load_module;
 		err = btf_encoding_context__init();
@@ -3847,16 +3678,7 @@ try_sole_arg_as_class_names:
 	header = NULL;
 
 	if (btf_encode && btf_encoder) { // maybe all CUs were filtered out and thus we don't have an encoder?
-		if (conf_load.reproducible_build &&
-		    cus__flush_reproducible_build(cus, btf_encoder, &conf_load) < 0) {
-			fprintf(stderr, "Encountered error while encoding BTF.\n");
-			exit(1);
-		}
-
-		if (conf_load.nr_jobs <= 1 || conf_load.reproducible_build)
-			btf_encoder__add_saved_funcs(conf_load.skip_encoding_btf_inconsistent_proto);
-
-		err = btf_encoder__encode(btf_encoder);
+		err = btf_encoder__encode(btf_encoder, &conf_load);
 		btf_encoder__delete(btf_encoder);
 		if (err) {
 			fputs("Failed to encode BTF\n", stderr);
diff --git a/pdwtags.c b/pdwtags.c
index 67982af..962883d 100644
--- a/pdwtags.c
+++ b/pdwtags.c
@@ -91,8 +91,7 @@ static int cu__emit_tags(struct cu *cu)
 }
 
 static enum load_steal_kind pdwtags_stealer(struct cu *cu,
-					    struct conf_load *conf_load __maybe_unused,
-					    void *thr_data __maybe_unused)
+					    struct conf_load *conf_load __maybe_unused)
 {
 	cu__emit_tags(cu);
 	return LSK__DELETE;
diff --git a/pfunct.c b/pfunct.c
index 1d74ece..55eafe8 100644
--- a/pfunct.c
+++ b/pfunct.c
@@ -510,8 +510,7 @@ int elf_symtabs__show(char *filenames[])
 }
 
 static enum load_steal_kind pfunct_stealer(struct cu *cu,
-					   struct conf_load *conf_load __maybe_unused,
-					   void *thr_data __maybe_unused)
+					   struct conf_load *conf_load __maybe_unused)
 {
 
 	if (function_name) {
diff --git a/tests/reproducible_build.sh b/tests/reproducible_build.sh
index f10f834..a940d93 100755
--- a/tests/reproducible_build.sh
+++ b/tests/reproducible_build.sh
@@ -37,10 +37,7 @@ for threads in $(seq $nr_proc) ; do
 	sleep 0.3s
 	# PID part to remove ps output headers
 	nr_threads_started=$(ps -L -C pahole | grep -v PID | wc -l)
-
-	if [ $threads -gt 1 ] ; then
-		((nr_threads_started -= 1))
-	fi
+        ((nr_threads_started -= 1)) # main thread doesn't count, it waits to join
 
 	if [ $threads != $nr_threads_started ] ; then
 		echo "ERROR: pahole asked to start $threads encoding threads, started $nr_threads_started"
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH dwarves v2 10/10] dwarf_loader: multithreading with a job/worker model
  2024-12-13 22:37 ` [PATCH dwarves v2 10/10] dwarf_loader: multithreading with a job/worker model Ihor Solodrai
@ 2024-12-17  0:57   ` Eduard Zingerman
  2024-12-17 18:12     ` Ihor Solodrai
  2024-12-19 14:59     ` Jiri Olsa
  2024-12-17  2:14   ` Eduard Zingerman
  2024-12-19 14:59   ` Jiri Olsa
  2 siblings, 2 replies; 30+ messages in thread
From: Eduard Zingerman @ 2024-12-17  0:57 UTC (permalink / raw)
  To: Ihor Solodrai, dwarves; +Cc: acme, alan.maguire, andrii, mykolal, bpf

On Fri, 2024-12-13 at 22:37 +0000, Ihor Solodrai wrote:

[...]

> +static void *dwarf_loader__worker_thread(void *arg)
> +{
> +	struct cu_processing_job *job;
> +	struct dwarf_cus *dcus = arg;
> +	bool stop = false;
> +	struct cu *cu;
> +
> +	while (!stop) {
> +		job = cus_queue__dequeue_job();
> +
> +		switch (job->type) {
> +
> +		case JOB_DECODE:
> +			cu = dwarf_loader__decode_next_cu(dcus);
> +
> +			if (cu == NULL) {
> +				free(job);
> +				stop = true;
> +				break;
> +			}
> +
> +			/* Create and enqueue a new JOB_STEAL for this decoded CU */
> +			struct cu_processing_job *steal_job = calloc(1, sizeof(*steal_job));
> +
> +			steal_job->type = JOB_STEAL;
> +			steal_job->cu = cu;
> +			cus_queue__enqueue_job(steal_job);
> +
> +			/* re-enqueue JOB_DECODE so that next CU is decoded from DWARF */
> +			cus_queue__enqueue_job(job);
> +			break;
> +
> +		case JOB_STEAL:
> +			if (cus__steal_now(dcus->cus, job->cu, dcus->conf) == LSK__STOP_LOADING)
> +				goto out_abort;
> +			cus_queue__inc_next_cu_id();
> +			/* Free the job struct as it's no longer
> +			 * needed after CU has been stolen.
> +			 * dwarf_loader work for this CU is done.
> +			 */
> +			free(job);
>  			break;
>  
> -		if (dwarf_cus__process_cu(dcus, cu_die, dcu->cu, dthr->data) == DWARF_CB_ABORT)
> +		default:
> +			fprintf(stderr, "Unknown dwarf_loader job type %d\n", job->type);
>  			goto out_abort;
> +		}
>  	}
>  
> -	if (dcus->conf->thread_exit &&
> -	    dcus->conf->thread_exit(dcus->conf, dthr->data) != 0)
> +	if (dcus->error)
>  		goto out_abort;
>  
>  	return (void *)DWARF_CB_OK;
> @@ -3566,29 +3736,29 @@ out_abort:
>  	return (void *)DWARF_CB_ABORT;
>  }

There is no real need to use two conditional variables to achieve what is done here.
The "JOB_DECODE" item is already used as a "ticket" to do the decoding.
So it is possible to "emit" a fixed amount of tickets and alternate their state
between "decode"/"steal", w/o allocating new tickets.
This would allow to remove "job_taken" conditional variable and decode counters.
E.g. as in the patch below applied on top of this patch-set.

---

diff --git a/dwarf_loader.c b/dwarf_loader.c
index 6d22648..40ad27d 100644
--- a/dwarf_loader.c
+++ b/dwarf_loader.c
@@ -3453,23 +3453,10 @@ struct dwarf_cus {
 static struct {
 	pthread_mutex_t mutex;
 	pthread_cond_t job_added;
-	pthread_cond_t job_taken;
 	/* next_cu_id determines the next CU ready to be stealed
 	 * This enforces the order of CU stealing.
 	 */
 	uint32_t next_cu_id;
-	/* max_decoded_cus is a soft limit on the number of JOB_STEAL
-	 * jobs currently in the queue (this number is equal to the
-	 * number of decoded CUs held in memory). It's soft, because a
-	 * worker thread may finish decoding it's current CU after
-	 * this limit has already been reached. In such situation,
-	 * JOB_STEAL with this CU is still added to the queue,
-	 * although a worker will not pick up a new JOB_DECODE.
-	 * So the real hard limit is max_decoded_cus + nr_workers.
-	 * This variable indirectly limits the memory usage.
-	 */
-	uint16_t max_decoded_cus;
-	uint16_t nr_decoded_cus;
 	struct list_head jobs;
 } cus_processing_queue;
 
@@ -3489,10 +3476,7 @@ static void cus_queue__init(uint16_t max_decoded_cus)
 {
 	pthread_mutex_init(&cus_processing_queue.mutex, NULL);
 	pthread_cond_init(&cus_processing_queue.job_added, NULL);
-	pthread_cond_init(&cus_processing_queue.job_taken, NULL);
 	INIT_LIST_HEAD(&cus_processing_queue.jobs);
-	cus_processing_queue.max_decoded_cus = max_decoded_cus;
-	cus_processing_queue.nr_decoded_cus = 0;
 	cus_processing_queue.next_cu_id = 0;
 }
 
@@ -3500,7 +3484,6 @@ static void cus_queue__destroy(void)
 {
 	pthread_mutex_destroy(&cus_processing_queue.mutex);
 	pthread_cond_destroy(&cus_processing_queue.job_added);
-	pthread_cond_destroy(&cus_processing_queue.job_taken);
 }
 
 static inline void cus_queue__inc_next_cu_id(void)
@@ -3520,12 +3503,10 @@ static void cus_queue__enqueue_job(struct cu_processing_job *job)
 	/* JOB_STEAL have higher priority, add them to the head so
 	 * they can be found faster
 	 */
-	if (job->type == JOB_STEAL) {
+	if (job->type == JOB_STEAL)
 		list_add(&job->node, &cus_processing_queue.jobs);
-		cus_processing_queue.nr_decoded_cus++;
-	} else {
+	else
 		list_add_tail(&job->node, &cus_processing_queue.jobs);
-	}
 
 	pthread_cond_signal(&cus_processing_queue.job_added);
 	pthread_mutex_unlock(&cus_processing_queue.mutex);
@@ -3537,45 +3518,28 @@ static struct cu_processing_job *cus_queue__dequeue_job(void)
 	struct list_head *pos, *tmp;
 
 	pthread_mutex_lock(&cus_processing_queue.mutex);
-	while (list_empty(&cus_processing_queue.jobs))
-		pthread_cond_wait(&cus_processing_queue.job_added, &cus_processing_queue.mutex);
-
-	/* First, try to find JOB_STEAL for the next CU */
+retry:
 	list_for_each_safe(pos, tmp, &cus_processing_queue.jobs) {
 		job = list_entry(pos, struct cu_processing_job, node);
 		if (job->type == JOB_STEAL && job->cu->id == cus_processing_queue.next_cu_id) {
-			list_del(&job->node);
-			cus_processing_queue.nr_decoded_cus--;
 			dequeued_job = job;
 			break;
 		}
-	}
-
-	/* If no JOB_STEAL is found, check if we are allowed to decode
-	 * more CUs.  If not, it means that the CU with next_cu_id is
-	 * still being decoded while the queue is "full". Wait.
-	 * job_taken will signal that another thread was able to pick
-	 * up a JOB_STEAL, so we might be able to proceed with JOB_DECODE.
-	 */
-	if (dequeued_job == NULL) {
-		while (cus_processing_queue.nr_decoded_cus >= cus_processing_queue.max_decoded_cus)
-			pthread_cond_wait(&cus_processing_queue.job_taken, &cus_processing_queue.mutex);
-
-		/* We can decode now. */
-		list_for_each_safe(pos, tmp, &cus_processing_queue.jobs) {
-			job = list_entry(pos, struct cu_processing_job, node);
-			if (job->type == JOB_DECODE) {
-				list_del(&job->node);
-				dequeued_job = job;
-				break;
-			}
+		if (job->type == JOB_DECODE) {
+			/* all JOB_STEALs are added to the head, so no viable JOB_STEAL available */
+			dequeued_job = job;
+			break;
 		}
 	}
-
-	pthread_cond_signal(&cus_processing_queue.job_taken);
+	/* No jobs or only steals out of order */
+	if (!dequeued_job) {
+		pthread_cond_wait(&cus_processing_queue.job_added, &cus_processing_queue.mutex);
+		goto retry;
+	}
+	list_del(&dequeued_job->node);
 	pthread_mutex_unlock(&cus_processing_queue.mutex);
 
-	return dequeued_job;
+	return job;
 }
 
 static struct dwarf_cu *dwarf_cus__create_cu(struct dwarf_cus *dcus, Dwarf_Die *cu_die, uint8_t pointer_size)
@@ -3700,14 +3664,8 @@ static void *dwarf_loader__worker_thread(void *arg)
 				break;
 			}
 
-			/* Create and enqueue a new JOB_STEAL for this decoded CU */
-			struct cu_processing_job *steal_job = calloc(1, sizeof(*steal_job));
-
-			steal_job->type = JOB_STEAL;
-			steal_job->cu = cu;
-			cus_queue__enqueue_job(steal_job);
-
-			/* re-enqueue JOB_DECODE so that next CU is decoded from DWARF */
+			job->type = JOB_STEAL;
+			job->cu = cu;
 			cus_queue__enqueue_job(job);
 			break;
 
@@ -3715,11 +3673,10 @@ static void *dwarf_loader__worker_thread(void *arg)
 			if (cus__steal_now(dcus->cus, job->cu, dcus->conf) == LSK__STOP_LOADING)
 				goto out_abort;
 			cus_queue__inc_next_cu_id();
-			/* Free the job struct as it's no longer
-			 * needed after CU has been stolen.
-			 * dwarf_loader work for this CU is done.
-			 */
-			free(job);
+			/* re-enqueue JOB_DECODE so that next CU is decoded from DWARF */
+			job->type = JOB_DECODE;
+			job->cu = NULL;
+			cus_queue__enqueue_job(job);
 			break;
 
 		default:
@@ -3742,10 +3699,10 @@ static int dwarf_cus__process_cus(struct dwarf_cus *dcus)
 	pthread_t workers[nr_workers];
 	struct cu_processing_job *job;
 
-	cus_queue__init(nr_workers * 4);
+	cus_queue__init(nr_workers);
 
 	/* fill up the queue with nr_workers JOB_DECODE jobs */
-	for (int i = 0; i < nr_workers; i++) {
+	for (int i = 0; i < nr_workers * 4; i++) {
 		job = calloc(1, sizeof(*job));
 		job->type = JOB_DECODE;
 		/* no need for locks, workers were not started yet */


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH dwarves v2 10/10] dwarf_loader: multithreading with a job/worker model
  2024-12-13 22:37 ` [PATCH dwarves v2 10/10] dwarf_loader: multithreading with a job/worker model Ihor Solodrai
  2024-12-17  0:57   ` Eduard Zingerman
@ 2024-12-17  2:14   ` Eduard Zingerman
  2024-12-19 14:59   ` Jiri Olsa
  2 siblings, 0 replies; 30+ messages in thread
From: Eduard Zingerman @ 2024-12-17  2:14 UTC (permalink / raw)
  To: Ihor Solodrai, dwarves; +Cc: acme, alan.maguire, andrii, mykolal, bpf

On Fri, 2024-12-13 at 22:37 +0000, Ihor Solodrai wrote:

Also a small nit.
Aside from my previous email, I think this is a good simplification.

[...]

> @@ -3250,24 +3250,20 @@ static void cu__sort_types_by_offset(struct cu *cu, struct conf_load *conf)
>  	cu__for_all_tags(cu, type__sort_by_offset, conf);
>  }
>  
> -static int cu__finalize(struct cu *cu, struct cus *cus, struct conf_load *conf, void *thr_data)
> +static void cu__finalize(struct cu *cu, struct cus *cus, struct conf_load *conf)
>  {
>  	cu__for_all_tags(cu, class_member__cache_byte_size, conf);
>  
>  	if (cu__language_reorders_offsets(cu))
>  		cu__sort_types_by_offset(cu, conf);
> -
> -	cus__set_cu_state(cus, cu, CU__LOADED);
> -
> -	if (conf && conf->steal) {
> -		return conf->steal(cu, conf, thr_data);
> -	}
> -	return LSK__KEEPIT;
>  }
>  
> -static int cus__finalize(struct cus *cus, struct cu *cu, struct conf_load *conf, void *thr_data)
> +static int cus__steal_now(struct cus *cus, struct cu *cu, struct conf_load *conf)
>  {
> -	int lsk = cu__finalize(cu, cus, conf, thr_data);
> +	if (!conf || !conf->steal)
> +		return 0;

Nit: the function returns either 0 or an enum literal,
     but 0 is a valid literal value for that enum.
     This is a bit confusing.

> +
> +	int lsk = conf->steal(cu, conf);
>  	switch (lsk) {
>  	case LSK__DELETE:
>  		cus__remove(cus, cu);

[...]


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH dwarves v2 07/10] btf_encoder: introduce btf_encoding_context
  2024-12-13 22:37 ` [PATCH dwarves v2 07/10] btf_encoder: introduce btf_encoding_context Ihor Solodrai
@ 2024-12-17  2:39   ` Eduard Zingerman
  2024-12-17  3:15     ` Eduard Zingerman
  0 siblings, 1 reply; 30+ messages in thread
From: Eduard Zingerman @ 2024-12-17  2:39 UTC (permalink / raw)
  To: Ihor Solodrai, dwarves; +Cc: acme, alan.maguire, andrii, mykolal, bpf

On Fri, 2024-12-13 at 22:37 +0000, Ihor Solodrai wrote:
> Introduce a static struct holding global data necessary for BTF
> encoding: elf_functions tables and btf_encoder structs.
> 
> The context has init()/exit() interface that should be used to
> indicate when BTF encoding work has started and ended.
> 
> I considered freeing everything contained in the context exclusively
> on exit(), however it turns out this unnecessarily increases max
> RSS. Probably because the work done in btf_encoder__encode() requires
> relatively more memory, and if encoders and tables are freed earlier,
> that space is reused.
> 
> Compare:
>     -j4: 	Maximum resident set size (kbytes): 868484
>     -j8: 	Maximum resident set size (kbytes): 1003040
>     -j16: 	Maximum resident set size (kbytes): 1039416
>     -j32: 	Maximum resident set size (kbytes): 1145312
> vs
>     -j4: 	Maximum resident set size (kbytes): 972692
>     -j8: 	Maximum resident set size (kbytes): 1043184
>     -j16: 	Maximum resident set size (kbytes): 1081156
>     -j32: 	Maximum resident set size (kbytes): 1218184
> 
> Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>
> ---

After patch #10 "dwarf_loader: multithreading with a job/worker model"
from this series I do not understand why this patch is necessary.
After patch #10 there is only one BTF encoder, thus:
- there is no need to track btf_encoder_list;
- elf_functions_list can now be a part of the encoder;
- it should be possible to forgo global variable for encoder
  and pass it as a parameter for each btf_encoder__* func.

So it seems that this patch should be dropped and replaced by one that
follows patch #10 and applies the above simplifications.
Wdyt?

[...]


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH dwarves v2 07/10] btf_encoder: introduce btf_encoding_context
  2024-12-17  2:39   ` Eduard Zingerman
@ 2024-12-17  3:15     ` Eduard Zingerman
  2024-12-17 18:06       ` Ihor Solodrai
  0 siblings, 1 reply; 30+ messages in thread
From: Eduard Zingerman @ 2024-12-17  3:15 UTC (permalink / raw)
  To: Ihor Solodrai, dwarves; +Cc: acme, alan.maguire, andrii, mykolal, bpf

On Mon, 2024-12-16 at 18:39 -0800, Eduard Zingerman wrote:
> On Fri, 2024-12-13 at 22:37 +0000, Ihor Solodrai wrote:
> > Introduce a static struct holding global data necessary for BTF
> > encoding: elf_functions tables and btf_encoder structs.
> > 
> > The context has init()/exit() interface that should be used to
> > indicate when BTF encoding work has started and ended.
> > 
> > I considered freeing everything contained in the context exclusively
> > on exit(), however it turns out this unnecessarily increases max
> > RSS. Probably because the work done in btf_encoder__encode() requires
> > relatively more memory, and if encoders and tables are freed earlier,
> > that space is reused.
> > 
> > Compare:
> >     -j4: 	Maximum resident set size (kbytes): 868484
> >     -j8: 	Maximum resident set size (kbytes): 1003040
> >     -j16: 	Maximum resident set size (kbytes): 1039416
> >     -j32: 	Maximum resident set size (kbytes): 1145312
> > vs
> >     -j4: 	Maximum resident set size (kbytes): 972692
> >     -j8: 	Maximum resident set size (kbytes): 1043184
> >     -j16: 	Maximum resident set size (kbytes): 1081156
> >     -j32: 	Maximum resident set size (kbytes): 1218184
> > 
> > Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>
> > ---
> 
> After patch #10 "dwarf_loader: multithreading with a job/worker model"
> from this series I do not understand why this patch is necessary.
> After patch #10 there is only one BTF encoder, thus:
> - there is no need to track btf_encoder_list;
> - elf_functions_list can now be a part of the encoder;
> - it should be possible to forgo global variable for encoder
>   and pass it as a parameter for each btf_encoder__* func.
> 
> So it seems that this patch should be dropped and replaced by one that
> follows patch #10 and applies the above simplifications.
> Wdyt?

Meaning that patch #6 "btf_encoder: switch to shared elf_functions table"
is not necessary. Strictly speaking, patches 1,2,4 might not be necessary
as well, but could be viewed as a refactoring.
Switch to single-threaded BTF encoder significantly changes this patch-set.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH dwarves v2 00/10] pahole: shared ELF and faster reproducible BTF encoding
  2024-12-13 22:36 [PATCH dwarves v2 00/10] pahole: shared ELF and faster reproducible BTF encoding Ihor Solodrai
                   ` (9 preceding siblings ...)
  2024-12-13 22:37 ` [PATCH dwarves v2 10/10] dwarf_loader: multithreading with a job/worker model Ihor Solodrai
@ 2024-12-17  7:00 ` Eduard Zingerman
  2024-12-20 12:31   ` Jiri Olsa
  10 siblings, 1 reply; 30+ messages in thread
From: Eduard Zingerman @ 2024-12-17  7:00 UTC (permalink / raw)
  To: Ihor Solodrai, dwarves; +Cc: acme, alan.maguire, andrii, mykolal, bpf

On Fri, 2024-12-13 at 22:36 +0000, Ihor Solodrai wrote:

for allyesconfig, with your patch-set I get the following stats:

jobs 1, mem 7080048 Kb, time 97.24 sec
jobs 3, mem 7091360 Kb, time 60.10 sec
jobs 6, mem 7153848 Kb, time 49.73 sec
jobs 12, mem 7264036 Kb, time 54.67 sec

w/o your patch-set, using current pahole 'next', I get out memory both
with and without reproducible_build flag.

The vmlinux size is 7.6G.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH dwarves v2 07/10] btf_encoder: introduce btf_encoding_context
  2024-12-17  3:15     ` Eduard Zingerman
@ 2024-12-17 18:06       ` Ihor Solodrai
  2024-12-18  0:03         ` Andrii Nakryiko
  0 siblings, 1 reply; 30+ messages in thread
From: Ihor Solodrai @ 2024-12-17 18:06 UTC (permalink / raw)
  To: Eduard Zingerman; +Cc: dwarves, acme, alan.maguire, andrii, mykolal, bpf

On Monday, December 16th, 2024 at 7:15 PM, Eduard Zingerman <eddyz87@gmail.com> wrote:

> 
> On Mon, 2024-12-16 at 18:39 -0800, Eduard Zingerman wrote:
> 
> > On Fri, 2024-12-13 at 22:37 +0000, Ihor Solodrai wrote:
> > 
> > > Introduce a static struct holding global data necessary for BTF
> > > encoding: elf_functions tables and btf_encoder structs.
> [...]
> > 
> > After patch #10 "dwarf_loader: multithreading with a job/worker model"
> > from this series I do not understand why this patch is necessary.
> > After patch #10 there is only one BTF encoder, thus:
> > - there is no need to track btf_encoder_list;
> > - elf_functions_list can now be a part of the encoder;
> > - it should be possible to forgo global variable for encoder
> > and pass it as a parameter for each btf_encoder__* func.
> > 
> > So it seems that this patch should be dropped and replaced by one that
> > follows patch #10 and applies the above simplifications.
> > Wdyt?
> 
> 
> Meaning that patch #6 "btf_encoder: switch to shared elf_functions table"
> is not necessary. Strictly speaking, patches 1,2,4 might not be necessary
> as well, but could be viewed as a refactoring.
> Switch to single-threaded BTF encoder significantly changes this patch-set.

Eduard, thanks for the review again.

You are correct: if we focus on the multithreading changes in
dwarf_loader.c and make a decision that there is always a single
btf_encoder, then much of this series can be discarded.

At the same time I think most of the patches are useful. At the very
least they enabled experiments that in the end lead me to the
dwarf_loader changes.

The changes making ELF functions table shared were beneficial in
isolation, because they eliminated unnecessary duplication of
information between encoders, leading to reduced memory usage.

The changes splitting ELF and BTF function information in
btf_encoder.c and simplifying function processing are also good in
isolation.

In my opinion, it's not wise to discard all of that, because it turned
out that a single btf_encoder works better in the use-case we care
about now. Later we might want to revisit parallel BTF encoding. Then
some version of the refactoring changes here will have to be re-done.

So I think it makes sense to land most of this series without
significant re-work. But of course I am biased here, as I wrote most
of the patches, and it's always painful to "throw away" effort.

Let's see what others think.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH dwarves v2 10/10] dwarf_loader: multithreading with a job/worker model
  2024-12-17  0:57   ` Eduard Zingerman
@ 2024-12-17 18:12     ` Ihor Solodrai
  2024-12-19 14:59     ` Jiri Olsa
  1 sibling, 0 replies; 30+ messages in thread
From: Ihor Solodrai @ 2024-12-17 18:12 UTC (permalink / raw)
  To: Eduard Zingerman; +Cc: dwarves, acme, alan.maguire, andrii, mykolal, bpf

On Monday, December 16th, 2024 at 4:57 PM, Eduard Zingerman <eddyz87@gmail.com> wrote:

> 
> 
> On Fri, 2024-12-13 at 22:37 +0000, Ihor Solodrai wrote:
> 
> [...]
> 
> There is no real need to use two conditional variables to achieve what is done here.
> The "JOB_DECODE" item is already used as a "ticket" to do the decoding.
> So it is possible to "emit" a fixed amount of tickets and alternate their state
> between "decode"/"steal", w/o allocating new tickets.
> This would allow to remove "job_taken" conditional variable and decode counters.
> E.g. as in the patch below applied on top of this patch-set.

Your suggestion makes sense, I haven't thought about utilizing jobs as
"tickets". This simplifies synchronization. 

I'll incorporate this in the next version.

Thank you!

> 
> ---
> 
> diff --git a/dwarf_loader.c b/dwarf_loader.c
> index 6d22648..40ad27d 100644
> --- a/dwarf_loader.c
> +++ b/dwarf_loader.c
> @@ -3453,23 +3453,10 @@ struct dwarf_cus {
> static struct {
> pthread_mutex_t mutex;
> pthread_cond_t job_added;
> - pthread_cond_t job_taken;
> /* next_cu_id determines the next CU ready to be stealed
> * This enforces the order of CU stealing.
> /
> uint32_t next_cu_id;
> - / max_decoded_cus is a soft limit on the number of JOB_STEAL
> - * jobs currently in the queue (this number is equal to the
> - * number of decoded CUs held in memory). It's soft, because a
> - * worker thread may finish decoding it's current CU after
> - * this limit has already been reached. In such situation,
> - * JOB_STEAL with this CU is still added to the queue,
> - * although a worker will not pick up a new JOB_DECODE.
> - * So the real hard limit is max_decoded_cus + nr_workers.
> - * This variable indirectly limits the memory usage.
> - */
> - uint16_t max_decoded_cus;
> - uint16_t nr_decoded_cus;
> struct list_head jobs;
> } cus_processing_queue;
> 
> @@ -3489,10 +3476,7 @@ static void cus_queue__init(uint16_t max_decoded_cus)
> {
> pthread_mutex_init(&cus_processing_queue.mutex, NULL);
> pthread_cond_init(&cus_processing_queue.job_added, NULL);
> - pthread_cond_init(&cus_processing_queue.job_taken, NULL);
> INIT_LIST_HEAD(&cus_processing_queue.jobs);
> - cus_processing_queue.max_decoded_cus = max_decoded_cus;
> - cus_processing_queue.nr_decoded_cus = 0;
> cus_processing_queue.next_cu_id = 0;
> }
> 
> @@ -3500,7 +3484,6 @@ static void cus_queue__destroy(void)
> {
> pthread_mutex_destroy(&cus_processing_queue.mutex);
> pthread_cond_destroy(&cus_processing_queue.job_added);
> - pthread_cond_destroy(&cus_processing_queue.job_taken);
> }
> 
> static inline void cus_queue__inc_next_cu_id(void)
> @@ -3520,12 +3503,10 @@ static void cus_queue__enqueue_job(struct cu_processing_job job)
> / JOB_STEAL have higher priority, add them to the head so
> * they can be found faster
> */
> - if (job->type == JOB_STEAL) {
> 
> + if (job->type == JOB_STEAL)
> 
> list_add(&job->node, &cus_processing_queue.jobs);
> 
> - cus_processing_queue.nr_decoded_cus++;
> - } else {
> + else
> list_add_tail(&job->node, &cus_processing_queue.jobs);
> 
> - }
> 
> pthread_cond_signal(&cus_processing_queue.job_added);
> pthread_mutex_unlock(&cus_processing_queue.mutex);
> @@ -3537,45 +3518,28 @@ static struct cu_processing_job *cus_queue__dequeue_job(void)
> struct list_head *pos, tmp;
> 
> pthread_mutex_lock(&cus_processing_queue.mutex);
> - while (list_empty(&cus_processing_queue.jobs))
> - pthread_cond_wait(&cus_processing_queue.job_added, &cus_processing_queue.mutex);
> -
> - / First, try to find JOB_STEAL for the next CU */
> +retry:
> list_for_each_safe(pos, tmp, &cus_processing_queue.jobs) {
> job = list_entry(pos, struct cu_processing_job, node);
> if (job->type == JOB_STEAL && job->cu->id == cus_processing_queue.next_cu_id) {
> 
> - list_del(&job->node);
> 
> - cus_processing_queue.nr_decoded_cus--;
> dequeued_job = job;
> break;
> }
> - }
> -
> - /* If no JOB_STEAL is found, check if we are allowed to decode
> - * more CUs. If not, it means that the CU with next_cu_id is
> - * still being decoded while the queue is "full". Wait.
> - * job_taken will signal that another thread was able to pick
> - * up a JOB_STEAL, so we might be able to proceed with JOB_DECODE.
> - */
> - if (dequeued_job == NULL) {
> - while (cus_processing_queue.nr_decoded_cus >= cus_processing_queue.max_decoded_cus)
> 
> - pthread_cond_wait(&cus_processing_queue.job_taken, &cus_processing_queue.mutex);
> -
> - /* We can decode now. */
> - list_for_each_safe(pos, tmp, &cus_processing_queue.jobs) {
> - job = list_entry(pos, struct cu_processing_job, node);
> - if (job->type == JOB_DECODE) {
> 
> - list_del(&job->node);
> 
> - dequeued_job = job;
> - break;
> - }
> + if (job->type == JOB_DECODE) {
> 
> + /* all JOB_STEALs are added to the head, so no viable JOB_STEAL available /
> + dequeued_job = job;
> + break;
> }
> }
> -
> - pthread_cond_signal(&cus_processing_queue.job_taken);
> + / No jobs or only steals out of order */
> + if (!dequeued_job) {
> + pthread_cond_wait(&cus_processing_queue.job_added, &cus_processing_queue.mutex);
> + goto retry;
> + }
> + list_del(&dequeued_job->node);
> 
> pthread_mutex_unlock(&cus_processing_queue.mutex);
> 
> - return dequeued_job;
> + return job;
> }
> 
> static struct dwarf_cu *dwarf_cus__create_cu(struct dwarf_cus *dcus, Dwarf_Die *cu_die, uint8_t pointer_size)
> @@ -3700,14 +3664,8 @@ static void *dwarf_loader__worker_thread(void arg)
> break;
> }
> 
> - / Create and enqueue a new JOB_STEAL for this decoded CU */
> - struct cu_processing_job *steal_job = calloc(1, sizeof(*steal_job));
> -
> - steal_job->type = JOB_STEAL;
> 
> - steal_job->cu = cu;
> 
> - cus_queue__enqueue_job(steal_job);
> -
> - /* re-enqueue JOB_DECODE so that next CU is decoded from DWARF */
> + job->type = JOB_STEAL;
> 
> + job->cu = cu;
> 
> cus_queue__enqueue_job(job);
> break;
> 
> @@ -3715,11 +3673,10 @@ static void *dwarf_loader__worker_thread(void *arg)
> if (cus__steal_now(dcus->cus, job->cu, dcus->conf) == LSK__STOP_LOADING)
> 
> goto out_abort;
> cus_queue__inc_next_cu_id();
> - /* Free the job struct as it's no longer
> - * needed after CU has been stolen.
> - * dwarf_loader work for this CU is done.
> - /
> - free(job);
> + / re-enqueue JOB_DECODE so that next CU is decoded from DWARF */
> + job->type = JOB_DECODE;
> 
> + job->cu = NULL;
> 
> + cus_queue__enqueue_job(job);
> break;
> 
> default:
> @@ -3742,10 +3699,10 @@ static int dwarf_cus__process_cus(struct dwarf_cus *dcus)
> pthread_t workers[nr_workers];
> struct cu_processing_job job;
> 
> - cus_queue__init(nr_workers * 4);
> + cus_queue__init(nr_workers);
> 
> / fill up the queue with nr_workers JOB_DECODE jobs */
> - for (int i = 0; i < nr_workers; i++) {
> + for (int i = 0; i < nr_workers * 4; i++) {
> job = calloc(1, sizeof(*job));
> job->type = JOB_DECODE;
> 
> /* no need for locks, workers were not started yet */

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH dwarves v2 07/10] btf_encoder: introduce btf_encoding_context
  2024-12-17 18:06       ` Ihor Solodrai
@ 2024-12-18  0:03         ` Andrii Nakryiko
  2024-12-18  0:40           ` Eduard Zingerman
  0 siblings, 1 reply; 30+ messages in thread
From: Andrii Nakryiko @ 2024-12-18  0:03 UTC (permalink / raw)
  To: Ihor Solodrai
  Cc: Eduard Zingerman, dwarves, acme, alan.maguire, andrii, mykolal,
	bpf

On Tue, Dec 17, 2024 at 10:06 AM Ihor Solodrai <ihor.solodrai@pm.me> wrote:
>
> On Monday, December 16th, 2024 at 7:15 PM, Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> >
> > On Mon, 2024-12-16 at 18:39 -0800, Eduard Zingerman wrote:
> >
> > > On Fri, 2024-12-13 at 22:37 +0000, Ihor Solodrai wrote:
> > >
> > > > Introduce a static struct holding global data necessary for BTF
> > > > encoding: elf_functions tables and btf_encoder structs.
> > [...]
> > >
> > > After patch #10 "dwarf_loader: multithreading with a job/worker model"
> > > from this series I do not understand why this patch is necessary.
> > > After patch #10 there is only one BTF encoder, thus:
> > > - there is no need to track btf_encoder_list;
> > > - elf_functions_list can now be a part of the encoder;
> > > - it should be possible to forgo global variable for encoder
> > > and pass it as a parameter for each btf_encoder__* func.
> > >
> > > So it seems that this patch should be dropped and replaced by one that
> > > follows patch #10 and applies the above simplifications.
> > > Wdyt?
> >
> >
> > Meaning that patch #6 "btf_encoder: switch to shared elf_functions table"
> > is not necessary. Strictly speaking, patches 1,2,4 might not be necessary
> > as well, but could be viewed as a refactoring.
> > Switch to single-threaded BTF encoder significantly changes this patch-set.
>
> Eduard, thanks for the review again.
>
> You are correct: if we focus on the multithreading changes in
> dwarf_loader.c and make a decision that there is always a single
> btf_encoder, then much of this series can be discarded.
>
> At the same time I think most of the patches are useful. At the very
> least they enabled experiments that in the end lead me to the
> dwarf_loader changes.
>
> The changes making ELF functions table shared were beneficial in
> isolation, because they eliminated unnecessary duplication of
> information between encoders, leading to reduced memory usage.
>
> The changes splitting ELF and BTF function information in
> btf_encoder.c and simplifying function processing are also good in
> isolation.
>
> In my opinion, it's not wise to discard all of that, because it turned
> out that a single btf_encoder works better in the use-case we care
> about now. Later we might want to revisit parallel BTF encoding. Then
> some version of the refactoring changes here will have to be re-done.
>
> So I think it makes sense to land most of this series without
> significant re-work. But of course I am biased here, as I wrote most
> of the patches, and it's always painful to "throw away" effort.
>
> Let's see what others think.

I agree with Ihor. I think he invested a lot of time into these
improvements, and asking him to re-do the series just to shuffle a few
patches around is just an unnecessary overhead (which also delays the
ultimate outcome: faster BTF generation with pahole). And as Ihor
mentioned, we might improve upon this series by parallelizing encoders
to gain some further improvements, so I think all the internal
refactoring and preparations are setting up a good base for further
work.

Let's try to finalize patch #10's implementation and land this. It's a
nice improvement both performance-wise. It's also good that we don't
need to care about reproducible or not and can effectively deprecate
that short-lived feature we added not so long ago.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH dwarves v2 07/10] btf_encoder: introduce btf_encoding_context
  2024-12-18  0:03         ` Andrii Nakryiko
@ 2024-12-18  0:40           ` Eduard Zingerman
  2024-12-18 20:07             ` Ihor Solodrai
  2024-12-19 14:59             ` Jiri Olsa
  0 siblings, 2 replies; 30+ messages in thread
From: Eduard Zingerman @ 2024-12-18  0:40 UTC (permalink / raw)
  To: Andrii Nakryiko, Ihor Solodrai
  Cc: dwarves, acme, alan.maguire, andrii, mykolal, bpf

On Tue, 2024-12-17 at 16:03 -0800, Andrii Nakryiko wrote:

[...]

> I agree with Ihor. I think he invested a lot of time into these
> improvements, and asking him to re-do the series just to shuffle a few
> patches around is just an unnecessary overhead (which also delays the
> ultimate outcome: faster BTF generation with pahole).

Patches #1-4 are good refactorings.
Patches #5-7 introduce a global state and complication where this
could be avoided. (These were necessary before Ihor figured out the
trick with patch #10).
E.g. 'elf_functions_list':
- there is no reason for it to be global, it could be a part of the
  encoder, as it is now;
- there is no reason for it to be a list, encoding works with a single
  function table and keeping it as a list just confuses reader.

Same for btf_encoder_list_lock / btf_encoder_list.

> And as Ihor mentioned, we might improve upon this series by
> parallelizing encoders to gain some further improvements, so I think
> all the internal refactoring and preparations are setting up a good
> base for further work.

Not really, the main insight found by Ihor is that parallel BTF
encoding doesn't make much sense: constructing BTF is as cheap as
copying it. I don't see much room for improvement here.

[...]


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH dwarves v2 07/10] btf_encoder: introduce btf_encoding_context
  2024-12-18  0:40           ` Eduard Zingerman
@ 2024-12-18 20:07             ` Ihor Solodrai
  2024-12-19 14:59             ` Jiri Olsa
  1 sibling, 0 replies; 30+ messages in thread
From: Ihor Solodrai @ 2024-12-18 20:07 UTC (permalink / raw)
  To: dwarves
  Cc: Andrii Nakryiko, Eduard Zingerman, acme, alan.maguire, andrii,
	mykolal, bpf

Hi everyone.

I had a discussion off-list with Eduard and Andrii and we came to an
agreement on the next iteration of this series.

Patches #3 and #7 will be changed: since there is only one encoder
now, there is no reason to maintain global state, such as list of
btf_encoders and list of elf_functions structs and corresponding
locks. Refactoring about separating elf_functions from btf_encoder
will largely remain, but the object itself will live in the
btf_encoder.

Patch #10 (dwarf_loader workers) will be modified to use only one
conditional variable as Eduard suggested [1].

Andrii and I also ran a bunch of experiments to understand why after a
certain point adding more threads noticeably slows down the
processing. It turns out that with the number of jobs growing, the
dwarf loading threads start competing for memory allocation in
cu__zalloc. This can be partially mitigated by setting a larger
obstack chunk size. I will add a relevant patch in the next version of
the series.

[1] https://lore.kernel.org/dwarves/58dc053c9d47a18124d8711604b08acbc6400340.camel@gmail.com


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH dwarves v2 06/10] btf_encoder: switch to shared elf_functions table
  2024-12-13 22:37 ` [PATCH dwarves v2 06/10] btf_encoder: switch to shared elf_functions table Ihor Solodrai
@ 2024-12-19 14:58   ` Jiri Olsa
  2024-12-19 19:06     ` Ihor Solodrai
  0 siblings, 1 reply; 30+ messages in thread
From: Jiri Olsa @ 2024-12-19 14:58 UTC (permalink / raw)
  To: Ihor Solodrai; +Cc: dwarves, acme, alan.maguire, eddyz87, andrii, mykolal, bpf

On Fri, Dec 13, 2024 at 10:37:13PM +0000, Ihor Solodrai wrote:

SNIP

> @@ -2116,9 +2128,6 @@ int btf_encoder__encode(struct btf_encoder *encoder)
>  	int err;
>  	size_t shndx;
>  
> -	/* for single-threaded case, saved funcs are added here */
> -	btf_encoder__add_saved_funcs(encoder);
> -
>  	for (shndx = 1; shndx < encoder->seccnt; shndx++)
>  		if (gobuffer__size(&encoder->secinfo[shndx].secinfo))
>  			btf_encoder__add_datasec(encoder, shndx);
> @@ -2477,14 +2486,13 @@ struct btf_encoder *btf_encoder__new(struct cu *cu, const char *detached_filenam
>  			goto out_delete;
>  		}
>  
> -		encoder->symtab = elf_symtab__new(NULL, cu->elf);
> +		encoder->functions = elf_functions__get(cu->elf);

elf_functions__get should always return != NULL right? should we add assert call for that?

SNIP

> diff --git a/pahole.c b/pahole.c
> index 17af0b4..eb2e71a 100644
> --- a/pahole.c
> +++ b/pahole.c
> @@ -3185,13 +3185,16 @@ static int pahole_threads_collect(struct conf_load *conf, int nr_threads, void *
>  	if (error)
>  		goto out;
>  
> -	btf_encoder__add_saved_funcs(btf_encoder);
> +	err = btf_encoder__add_saved_funcs(btf_encoder);
> +	if (err < 0)
> +		goto out;
> +
>  	for (i = 0; i < nr_threads; i++) {
>  		/*
>  		 * Merge content of the btf instances of worker threads to the btf
>  		 * instance of the primary btf_encoder.
>                  */
> -		if (!threads[i]->btf)
> +		if (!threads[i]->encoder || !threads[i]->btf)

is this related to this change? seems like separate fix

>  			continue;
>  		err = btf_encoder__add_encoder(btf_encoder, threads[i]->encoder);
>  		if (err < 0)
> @@ -3846,6 +3849,9 @@ try_sole_arg_as_class_names:
>  			exit(1);
>  		}
>  
> +		if (conf_load.nr_jobs <= 1 || conf_load.reproducible_build)
> +			btf_encoder__add_saved_funcs(btf_encoder);

should we check the return value here as well?

thanks,
jirka

> +
>  		err = btf_encoder__encode(btf_encoder);
>  		btf_encoder__delete(btf_encoder);
>  		if (err) {
> -- 
> 2.47.1
> 
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH dwarves v2 10/10] dwarf_loader: multithreading with a job/worker model
  2024-12-13 22:37 ` [PATCH dwarves v2 10/10] dwarf_loader: multithreading with a job/worker model Ihor Solodrai
  2024-12-17  0:57   ` Eduard Zingerman
  2024-12-17  2:14   ` Eduard Zingerman
@ 2024-12-19 14:59   ` Jiri Olsa
  2024-12-19 19:31     ` Ihor Solodrai
  2 siblings, 1 reply; 30+ messages in thread
From: Jiri Olsa @ 2024-12-19 14:59 UTC (permalink / raw)
  To: Ihor Solodrai; +Cc: dwarves, acme, alan.maguire, eddyz87, andrii, mykolal, bpf

On Fri, Dec 13, 2024 at 10:37:34PM +0000, Ihor Solodrai wrote:

SNIP

> +static void *dwarf_loader__worker_thread(void *arg)
> +{
> +	struct cu_processing_job *job;
> +	struct dwarf_cus *dcus = arg;
> +	bool stop = false;
> +	struct cu *cu;
> +
> +	while (!stop) {
> +		job = cus_queue__dequeue_job();
> +
> +		switch (job->type) {
> +
> +		case JOB_DECODE:
> +			cu = dwarf_loader__decode_next_cu(dcus);
> +
> +			if (cu == NULL) {
> +				free(job);
> +				stop = true;
> +				break;
> +			}
> +
> +			/* Create and enqueue a new JOB_STEAL for this decoded CU */
> +			struct cu_processing_job *steal_job = calloc(1, sizeof(*steal_job));

missing steal_job != NULL check

SNIP

> -static int dwarf_cus__threaded_process_cus(struct dwarf_cus *dcus)
> +static int dwarf_cus__process_cus(struct dwarf_cus *dcus)
>  {
> -	pthread_t threads[dcus->conf->nr_jobs];
> -	struct dwarf_thread dthr[dcus->conf->nr_jobs];
> -	void *thread_data[dcus->conf->nr_jobs];
> -	int res;
> -	int i;
> +	int nr_workers = dcus->conf->nr_jobs > 0 ? dcus->conf->nr_jobs : 1;
> +	pthread_t workers[nr_workers];
> +	struct cu_processing_job *job;
>  
> -	if (dcus->conf->threads_prepare) {
> -		res = dcus->conf->threads_prepare(dcus->conf, dcus->conf->nr_jobs, thread_data);
> -		if (res != 0)
> -			return res;
> -	} else {
> -		memset(thread_data, 0, sizeof(void *) * dcus->conf->nr_jobs);
> +	cus_queue__init(nr_workers * 4);

why '* 4' ?

> +
> +	/* fill up the queue with nr_workers JOB_DECODE jobs */
> +	for (int i = 0; i < nr_workers; i++) {
> +		job = calloc(1, sizeof(*job));

missing job != NULL check

> +		job->type = JOB_DECODE;
> +		/* no need for locks, workers were not started yet */
> +		list_add(&job->node, &cus_processing_queue.jobs);
>  	}
>  
> -	for (i = 0; i < dcus->conf->nr_jobs; ++i) {
> -		dthr[i].dcus = dcus;
> -		dthr[i].data = thread_data[i];
> +	if (dcus->error)
> +		return dcus->error;
>  
> -		dcus->error = pthread_create(&threads[i], NULL,
> -					     dwarf_cus__process_cu_thread,
> -					     &dthr[i]);
> +	for (int i = 0; i < nr_workers; ++i) {
> +		dcus->error = pthread_create(&workers[i], NULL,
> +					     dwarf_loader__worker_thread,
> +					     dcus);
>  		if (dcus->error)
>  			goto out_join;
>  	}
> @@ -3596,54 +3766,19 @@ static int dwarf_cus__threaded_process_cus(struct dwarf_cus *dcus)
>  	dcus->error = 0;
>  
>  out_join:
> -	while (--i >= 0) {
> +	for (int i = 0; i < nr_workers; ++i) {

I think you should keep the original while loop to cleanup/wait only for
threads that we actually created

>  		void *res;
> -		int err = pthread_join(threads[i], &res);
> +		int err = pthread_join(workers[i], &res);
>  
>  		if (err == 0 && res != NULL)
>  			dcus->error = (long)res;
>  	}
>  
> -	if (dcus->conf->threads_collect) {
> -		res = dcus->conf->threads_collect(dcus->conf, dcus->conf->nr_jobs,
> -						  thread_data, dcus->error);
> -		if (dcus->error == 0)
> -			dcus->error = res;
> -	}
> +	cus_queue__destroy();
>  
>  	return dcus->error;
>  }
>  

SNIP

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH dwarves v2 10/10] dwarf_loader: multithreading with a job/worker model
  2024-12-17  0:57   ` Eduard Zingerman
  2024-12-17 18:12     ` Ihor Solodrai
@ 2024-12-19 14:59     ` Jiri Olsa
  1 sibling, 0 replies; 30+ messages in thread
From: Jiri Olsa @ 2024-12-19 14:59 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Ihor Solodrai, dwarves, acme, alan.maguire, andrii, mykolal, bpf

On Mon, Dec 16, 2024 at 04:57:33PM -0800, Eduard Zingerman wrote:

SNIP

> >  	}
> >  
> > -	if (dcus->conf->thread_exit &&
> > -	    dcus->conf->thread_exit(dcus->conf, dthr->data) != 0)
> > +	if (dcus->error)
> >  		goto out_abort;
> >  
> >  	return (void *)DWARF_CB_OK;
> > @@ -3566,29 +3736,29 @@ out_abort:
> >  	return (void *)DWARF_CB_ABORT;
> >  }
> 
> There is no real need to use two conditional variables to achieve what is done here.
> The "JOB_DECODE" item is already used as a "ticket" to do the decoding.
> So it is possible to "emit" a fixed amount of tickets and alternate their state
> between "decode"/"steal", w/o allocating new tickets.
> This would allow to remove "job_taken" conditional variable and decode counters.
> E.g. as in the patch below applied on top of this patch-set.

+1 , looks much easier

jirka

> 
> ---
> 
> diff --git a/dwarf_loader.c b/dwarf_loader.c
> index 6d22648..40ad27d 100644
> --- a/dwarf_loader.c
> +++ b/dwarf_loader.c
> @@ -3453,23 +3453,10 @@ struct dwarf_cus {
>  static struct {
>  	pthread_mutex_t mutex;
>  	pthread_cond_t job_added;
> -	pthread_cond_t job_taken;
>  	/* next_cu_id determines the next CU ready to be stealed
>  	 * This enforces the order of CU stealing.
>  	 */
>  	uint32_t next_cu_id;
> -	/* max_decoded_cus is a soft limit on the number of JOB_STEAL
> -	 * jobs currently in the queue (this number is equal to the
> -	 * number of decoded CUs held in memory). It's soft, because a
> -	 * worker thread may finish decoding it's current CU after
> -	 * this limit has already been reached. In such situation,
> -	 * JOB_STEAL with this CU is still added to the queue,
> -	 * although a worker will not pick up a new JOB_DECODE.
> -	 * So the real hard limit is max_decoded_cus + nr_workers.
> -	 * This variable indirectly limits the memory usage.
> -	 */
> -	uint16_t max_decoded_cus;
> -	uint16_t nr_decoded_cus;
>  	struct list_head jobs;
>  } cus_processing_queue;
>  
> @@ -3489,10 +3476,7 @@ static void cus_queue__init(uint16_t max_decoded_cus)
>  {
>  	pthread_mutex_init(&cus_processing_queue.mutex, NULL);
>  	pthread_cond_init(&cus_processing_queue.job_added, NULL);
> -	pthread_cond_init(&cus_processing_queue.job_taken, NULL);
>  	INIT_LIST_HEAD(&cus_processing_queue.jobs);
> -	cus_processing_queue.max_decoded_cus = max_decoded_cus;
> -	cus_processing_queue.nr_decoded_cus = 0;
>  	cus_processing_queue.next_cu_id = 0;
>  }
>  
> @@ -3500,7 +3484,6 @@ static void cus_queue__destroy(void)
>  {
>  	pthread_mutex_destroy(&cus_processing_queue.mutex);
>  	pthread_cond_destroy(&cus_processing_queue.job_added);
> -	pthread_cond_destroy(&cus_processing_queue.job_taken);
>  }
>  
>  static inline void cus_queue__inc_next_cu_id(void)
> @@ -3520,12 +3503,10 @@ static void cus_queue__enqueue_job(struct cu_processing_job *job)
>  	/* JOB_STEAL have higher priority, add them to the head so
>  	 * they can be found faster
>  	 */
> -	if (job->type == JOB_STEAL) {
> +	if (job->type == JOB_STEAL)
>  		list_add(&job->node, &cus_processing_queue.jobs);
> -		cus_processing_queue.nr_decoded_cus++;
> -	} else {
> +	else
>  		list_add_tail(&job->node, &cus_processing_queue.jobs);
> -	}
>  
>  	pthread_cond_signal(&cus_processing_queue.job_added);
>  	pthread_mutex_unlock(&cus_processing_queue.mutex);
> @@ -3537,45 +3518,28 @@ static struct cu_processing_job *cus_queue__dequeue_job(void)
>  	struct list_head *pos, *tmp;
>  
>  	pthread_mutex_lock(&cus_processing_queue.mutex);
> -	while (list_empty(&cus_processing_queue.jobs))
> -		pthread_cond_wait(&cus_processing_queue.job_added, &cus_processing_queue.mutex);
> -
> -	/* First, try to find JOB_STEAL for the next CU */
> +retry:
>  	list_for_each_safe(pos, tmp, &cus_processing_queue.jobs) {
>  		job = list_entry(pos, struct cu_processing_job, node);
>  		if (job->type == JOB_STEAL && job->cu->id == cus_processing_queue.next_cu_id) {
> -			list_del(&job->node);
> -			cus_processing_queue.nr_decoded_cus--;
>  			dequeued_job = job;
>  			break;
>  		}
> -	}
> -
> -	/* If no JOB_STEAL is found, check if we are allowed to decode
> -	 * more CUs.  If not, it means that the CU with next_cu_id is
> -	 * still being decoded while the queue is "full". Wait.
> -	 * job_taken will signal that another thread was able to pick
> -	 * up a JOB_STEAL, so we might be able to proceed with JOB_DECODE.
> -	 */
> -	if (dequeued_job == NULL) {
> -		while (cus_processing_queue.nr_decoded_cus >= cus_processing_queue.max_decoded_cus)
> -			pthread_cond_wait(&cus_processing_queue.job_taken, &cus_processing_queue.mutex);
> -
> -		/* We can decode now. */
> -		list_for_each_safe(pos, tmp, &cus_processing_queue.jobs) {
> -			job = list_entry(pos, struct cu_processing_job, node);
> -			if (job->type == JOB_DECODE) {
> -				list_del(&job->node);
> -				dequeued_job = job;
> -				break;
> -			}
> +		if (job->type == JOB_DECODE) {
> +			/* all JOB_STEALs are added to the head, so no viable JOB_STEAL available */
> +			dequeued_job = job;
> +			break;
>  		}
>  	}
> -
> -	pthread_cond_signal(&cus_processing_queue.job_taken);
> +	/* No jobs or only steals out of order */
> +	if (!dequeued_job) {
> +		pthread_cond_wait(&cus_processing_queue.job_added, &cus_processing_queue.mutex);
> +		goto retry;
> +	}
> +	list_del(&dequeued_job->node);
>  	pthread_mutex_unlock(&cus_processing_queue.mutex);
>  
> -	return dequeued_job;
> +	return job;
>  }
>  
>  static struct dwarf_cu *dwarf_cus__create_cu(struct dwarf_cus *dcus, Dwarf_Die *cu_die, uint8_t pointer_size)
> @@ -3700,14 +3664,8 @@ static void *dwarf_loader__worker_thread(void *arg)
>  				break;
>  			}
>  
> -			/* Create and enqueue a new JOB_STEAL for this decoded CU */
> -			struct cu_processing_job *steal_job = calloc(1, sizeof(*steal_job));
> -
> -			steal_job->type = JOB_STEAL;
> -			steal_job->cu = cu;
> -			cus_queue__enqueue_job(steal_job);
> -
> -			/* re-enqueue JOB_DECODE so that next CU is decoded from DWARF */
> +			job->type = JOB_STEAL;
> +			job->cu = cu;
>  			cus_queue__enqueue_job(job);
>  			break;
>  
> @@ -3715,11 +3673,10 @@ static void *dwarf_loader__worker_thread(void *arg)
>  			if (cus__steal_now(dcus->cus, job->cu, dcus->conf) == LSK__STOP_LOADING)
>  				goto out_abort;
>  			cus_queue__inc_next_cu_id();
> -			/* Free the job struct as it's no longer
> -			 * needed after CU has been stolen.
> -			 * dwarf_loader work for this CU is done.
> -			 */
> -			free(job);
> +			/* re-enqueue JOB_DECODE so that next CU is decoded from DWARF */
> +			job->type = JOB_DECODE;
> +			job->cu = NULL;
> +			cus_queue__enqueue_job(job);
>  			break;
>  
>  		default:
> @@ -3742,10 +3699,10 @@ static int dwarf_cus__process_cus(struct dwarf_cus *dcus)
>  	pthread_t workers[nr_workers];
>  	struct cu_processing_job *job;
>  
> -	cus_queue__init(nr_workers * 4);
> +	cus_queue__init(nr_workers);
>  
>  	/* fill up the queue with nr_workers JOB_DECODE jobs */
> -	for (int i = 0; i < nr_workers; i++) {
> +	for (int i = 0; i < nr_workers * 4; i++) {
>  		job = calloc(1, sizeof(*job));
>  		job->type = JOB_DECODE;
>  		/* no need for locks, workers were not started yet */
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH dwarves v2 07/10] btf_encoder: introduce btf_encoding_context
  2024-12-18  0:40           ` Eduard Zingerman
  2024-12-18 20:07             ` Ihor Solodrai
@ 2024-12-19 14:59             ` Jiri Olsa
  1 sibling, 0 replies; 30+ messages in thread
From: Jiri Olsa @ 2024-12-19 14:59 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Andrii Nakryiko, Ihor Solodrai, dwarves, acme, alan.maguire,
	andrii, mykolal, bpf

On Tue, Dec 17, 2024 at 04:40:40PM -0800, Eduard Zingerman wrote:
> On Tue, 2024-12-17 at 16:03 -0800, Andrii Nakryiko wrote:
> 
> [...]
> 
> > I agree with Ihor. I think he invested a lot of time into these
> > improvements, and asking him to re-do the series just to shuffle a few
> > patches around is just an unnecessary overhead (which also delays the
> > ultimate outcome: faster BTF generation with pahole).
> 
> Patches #1-4 are good refactorings.
> Patches #5-7 introduce a global state and complication where this
> could be avoided. (These were necessary before Ihor figured out the
> trick with patch #10).
> E.g. 'elf_functions_list':
> - there is no reason for it to be global, it could be a part of the
>   encoder, as it is now;
> - there is no reason for it to be a list, encoding works with a single
>   function table and keeping it as a list just confuses reader.

agree, with just single encoder object I don't see a reason for the
btf_encoding_context, keeping functions in encoder will be simpler

thanks,
jirka

> 
> Same for btf_encoder_list_lock / btf_encoder_list.
> 
> > And as Ihor mentioned, we might improve upon this series by
> > parallelizing encoders to gain some further improvements, so I think
> > all the internal refactoring and preparations are setting up a good
> > base for further work.
> 
> Not really, the main insight found by Ihor is that parallel BTF
> encoding doesn't make much sense: constructing BTF is as cheap as
> copying it. I don't see much room for improvement here.
> 
> [...]
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH dwarves v2 02/10] btf_encoder: separate elf function, saved function representations
  2024-12-13 22:36 ` [PATCH dwarves v2 02/10] btf_encoder: separate elf function, saved function representations Ihor Solodrai
@ 2024-12-19 14:59   ` Jiri Olsa
  0 siblings, 0 replies; 30+ messages in thread
From: Jiri Olsa @ 2024-12-19 14:59 UTC (permalink / raw)
  To: Ihor Solodrai; +Cc: dwarves, acme, alan.maguire, eddyz87, andrii, mykolal, bpf

On Fri, Dec 13, 2024 at 10:36:53PM +0000, Ihor Solodrai wrote:

SNIP

> +	btf_encoders__for_each_encoder(e) {
> +		list_for_each_entry(s, &e->func_states, node)
> +			nr_saved_fns++;
> +	}
> +
> +	if (nr_saved_fns == 0)
> +		return 0;
> +
> +	saved_fns = calloc(nr_saved_fns, sizeof(*saved_fns));
> +	btf_encoders__for_each_encoder(e) {
> +		list_for_each_entry(s, &e->func_states, node)
> +			saved_fns[i++] = s;
> +	}
> +	qsort(saved_fns, nr_saved_fns, sizeof(*saved_fns), saved_functions_cmp);
> +
> +	for (i = 0; i < nr_saved_fns; i = j) {
> +		struct btf_encoder_func_state *state = saved_fns[i];
> +
> +		/* Compare across sorted functions that match by name/prefix;
> +		 * share inconsistent/unexpected reg state between them.
> +		 */
> +		j = i + 1;
> +
> +		while (j < nr_saved_fns && saved_functions_combine(saved_fns[i], saved_fns[j]) == 0)
> +			j++;
> +
> +		/* do not exclude functions with optimized-out parameters; they
> +		 * may still be _called_ with the right parameter values, they
> +		 * just do not _use_ them.  Only exclude functions with
> +		 * unexpected register use or multiple inconsistent prototypes.
> +		 */
> +		if (!encoder->skip_encoding_inconsistent_proto ||
> +		    (!state->unexpected_reg && !state->inconsistent_proto)) {
> +			if (btf_encoder__add_func(state->encoder, state)) {
> +				free(saved_fns);
> +				return -1;
> +			}
> +		}
> +	}
> +	/* Now that we are done with function states, free them. */
> +	free(saved_fns);
> +	btf_encoders__for_each_encoder(e)
> +		btf_encoder__delete_saved_funcs(e);

nit, wrong indent

jirka

> +	return 0;
> +}
> +
>  static int btf_encoder__collect_function(struct btf_encoder *encoder, GElf_Sym *sym)

SNIP

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH dwarves v2 06/10] btf_encoder: switch to shared elf_functions table
  2024-12-19 14:58   ` Jiri Olsa
@ 2024-12-19 19:06     ` Ihor Solodrai
  0 siblings, 0 replies; 30+ messages in thread
From: Ihor Solodrai @ 2024-12-19 19:06 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: dwarves, acme, alan.maguire, eddyz87, andrii, mykolal, bpf

On Thursday, December 19th, 2024 at 6:58 AM, Jiri Olsa <olsajiri@gmail.com> wrote:

> 
> 
> On Fri, Dec 13, 2024 at 10:37:13PM +0000, Ihor Solodrai wrote:
> 
> SNIP
> 
> > @@ -2116,9 +2128,6 @@ int btf_encoder__encode(struct btf_encoder *encoder)
> > int err;
> > size_t shndx;
> > 
> > - /* for single-threaded case, saved funcs are added here */
> > - btf_encoder__add_saved_funcs(encoder);
> > -
> > for (shndx = 1; shndx < encoder->seccnt; shndx++)
> > if (gobuffer__size(&encoder->secinfo[shndx].secinfo))
> > btf_encoder__add_datasec(encoder, shndx);
> > @@ -2477,14 +2486,13 @@ struct btf_encoder *btf_encoder__new(struct cu *cu, const char *detached_filenam
> > goto out_delete;
> > }
> > 
> > - encoder->symtab = elf_symtab__new(NULL, cu->elf);
> > + encoder->functions = elf_functions__get(cu->elf);
> 
> 
> elf_functions__get should always return != NULL right? should we add assert call for that?
> 
> SNIP
> 
> > diff --git a/pahole.c b/pahole.c
> > index 17af0b4..eb2e71a 100644
> > --- a/pahole.c
> > +++ b/pahole.c
> > @@ -3185,13 +3185,16 @@ static int pahole_threads_collect(struct conf_load *conf, int nr_threads, void *
> > if (error)
> > goto out;
> > 
> > - btf_encoder__add_saved_funcs(btf_encoder);
> > + err = btf_encoder__add_saved_funcs(btf_encoder);
> > + if (err < 0)
> > + goto out;
> > +
> > for (i = 0; i < nr_threads; i++) {
> > /*
> > * Merge content of the btf instances of worker threads to the btf
> > * instance of the primary btf_encoder.
> > */
> > - if (!threads[i]->btf)
> > + if (!threads[i]->encoder || !threads[i]->btf)
> 
> 
> is this related to this change? seems like separate fix
> 
> > continue;
> > err = btf_encoder__add_encoder(btf_encoder, threads[i]->encoder);
> > if (err < 0)
> > @@ -3846,6 +3849,9 @@ try_sole_arg_as_class_names:
> > exit(1);
> > }
> > 
> > + if (conf_load.nr_jobs <= 1 || conf_load.reproducible_build)
> > + btf_encoder__add_saved_funcs(btf_encoder);
> 
> 
> should we check the return value here as well?
> 
> thanks,
> jirka

Jiri, thanks for review.

This patch is going to be removed in the next version of the series,
following a discussion with Eduard and Andrii [1].

If you're interested you can inspect WIP v3 branch on github [2].

I am still testing it, and plan to add a patch removing global
btf_encoders list. Other than that it's close to what I am going to
submit.

[1] https://lore.kernel.org/dwarves/C82bYTvJaV4bfT15o25EsBiUvFsj5eTlm17933Hvva76CXjIcu3gvpaOCWPgeZ8g3cZ-RMa8Vp0y1o_QMR2LhPB-LEUYfZCGuCfR_HvkIP8=@pm.me/
[2] https://github.com/theihor/dwarves/tree/v3.workers

> 
> > +
> > err = btf_encoder__encode(btf_encoder);
> > btf_encoder__delete(btf_encoder);
> > if (err) {
> > --
> > 2.47.1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH dwarves v2 10/10] dwarf_loader: multithreading with a job/worker model
  2024-12-19 14:59   ` Jiri Olsa
@ 2024-12-19 19:31     ` Ihor Solodrai
  2024-12-20  9:25       ` Jiri Olsa
  0 siblings, 1 reply; 30+ messages in thread
From: Ihor Solodrai @ 2024-12-19 19:31 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: dwarves, acme, alan.maguire, eddyz87, andrii, mykolal, bpf

On Thursday, December 19th, 2024 at 6:59 AM, Jiri Olsa <olsajiri@gmail.com> wrote:

> 
> 
> On Fri, Dec 13, 2024 at 10:37:34PM +0000, Ihor Solodrai wrote:
> 
> SNIP
> 
> > +static void *dwarf_loader__worker_thread(void *arg)
> > +{
> > + struct cu_processing_job *job;
> > + struct dwarf_cus *dcus = arg;
> > + bool stop = false;
> > + struct cu cu;
> > +
> > + while (!stop) {
> > + job = cus_queue__dequeue_job();
> > +
> > + switch (job->type) {
> > +
> > + case JOB_DECODE:
> > + cu = dwarf_loader__decode_next_cu(dcus);
> > +
> > + if (cu == NULL) {
> > + free(job);
> > + stop = true;
> > + break;
> > + }
> > +
> > + / Create and enqueue a new JOB_STEAL for this decoded CU */
> > + struct cu_processing_job *steal_job = calloc(1, sizeof(*steal_job));
> 
> 
> missing steal_job != NULL check

In the next version, job objects are allocated only by the main thread
and are reused when enqueued [1].

> 
> SNIP
> 
> > -static int dwarf_cus__threaded_process_cus(struct dwarf_cus *dcus)
> > +static int dwarf_cus__process_cus(struct dwarf_cus *dcus)
> > {
> > - pthread_t threads[dcus->conf->nr_jobs];
> > - struct dwarf_thread dthr[dcus->conf->nr_jobs];
> > - void *thread_data[dcus->conf->nr_jobs];
> > - int res;
> > - int i;
> > + int nr_workers = dcus->conf->nr_jobs > 0 ? dcus->conf->nr_jobs : 1;
> > + pthread_t workers[nr_workers];
> > + struct cu_processing_job *job;
> > 
> > - if (dcus->conf->threads_prepare) {
> > - res = dcus->conf->threads_prepare(dcus->conf, dcus->conf->nr_jobs, thread_data);
> > - if (res != 0)
> > - return res;
> > - } else {
> > - memset(thread_data, 0, sizeof(void *) * dcus->conf->nr_jobs);
> > + cus_queue__init(nr_workers * 4);
> 
> 
> why '* 4' ?

This is an arbitrary limit, described in comments.

If we allow the workers to pick up next cu for decoding as soon as
it's ready, then the memory usage may greatly increase, if the stealer
can't keep up with incoming work.

If we want to avoid this there needs to be a limit on how many
decoded, but not yet stolen, CUs we allow to hold in memory. When
this limit is reached the workers will wait for more CUs to get
stolen.

N x 4 is a number I picked after trying various values and looking at
the resulting memory usage.

We could make it configurable, but this value doesn't look to me as a
reasonable user-facing option. Maybe we could add "I don't care about
memory usage" flag to pahole? wdyt?

> 
> > +
> > + /* fill up the queue with nr_workers JOB_DECODE jobs */
> > + for (int i = 0; i < nr_workers; i++) {
> > + job = calloc(1, sizeof(*job));
> 
> 
> missing job != NULL check
> 
> > + job->type = JOB_DECODE;
> > + /* no need for locks, workers were not started yet */
> > + list_add(&job->node, &cus_processing_queue.jobs);
> > }
> > 
> > - for (i = 0; i < dcus->conf->nr_jobs; ++i) {
> > - dthr[i].dcus = dcus;
> > - dthr[i].data = thread_data[i];
> > + if (dcus->error)
> > + return dcus->error;
> > 
> > - dcus->error = pthread_create(&threads[i], NULL,
> > - dwarf_cus__process_cu_thread,
> > - &dthr[i]);
> > + for (int i = 0; i < nr_workers; ++i) {
> > + dcus->error = pthread_create(&workers[i], NULL,
> > + dwarf_loader__worker_thread,
> > + dcus);
> > if (dcus->error)
> > goto out_join;
> > }
> > @@ -3596,54 +3766,19 @@ static int dwarf_cus__threaded_process_cus(struct dwarf_cus *dcus)
> > dcus->error = 0;
> > 
> > out_join:
> > - while (--i >= 0) {
> > + for (int i = 0; i < nr_workers; ++i) {
> 
> 
> I think you should keep the original while loop to cleanup/wait only for
> threads that we actually created

Do you mean in case of an error from pthread_create? Ok.

> 
> > void *res;
> > - int err = pthread_join(threads[i], &res);
> > + int err = pthread_join(workers[i], &res);
> > 
> > if (err == 0 && res != NULL)
> > dcus->error = (long)res;
> > }
> > 
> > - if (dcus->conf->threads_collect) {
> > - res = dcus->conf->threads_collect(dcus->conf, dcus->conf->nr_jobs,
> > - thread_data, dcus->error);
> > - if (dcus->error == 0)
> > - dcus->error = res;
> > - }
> > + cus_queue__destroy();
> > 
> > return dcus->error;
> > }
> 
> 
> SNIP

[1] https://github.com/acmel/dwarves/commit/5278adbe5cb796c7baafb110d8c5cda107ec9d68#diff-77fe7eedce76594a6c71363b22832723fc086a8e72debd0370763e4193704b1eR3706-R3708


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH dwarves v2 10/10] dwarf_loader: multithreading with a job/worker model
  2024-12-19 19:31     ` Ihor Solodrai
@ 2024-12-20  9:25       ` Jiri Olsa
  0 siblings, 0 replies; 30+ messages in thread
From: Jiri Olsa @ 2024-12-20  9:25 UTC (permalink / raw)
  To: Ihor Solodrai
  Cc: Jiri Olsa, dwarves, acme, alan.maguire, eddyz87, andrii, mykolal,
	bpf

On Thu, Dec 19, 2024 at 07:31:35PM +0000, Ihor Solodrai wrote:

SNIP

> > 
> > 
> > why '* 4' ?
> 
> This is an arbitrary limit, described in comments.
> 
> If we allow the workers to pick up next cu for decoding as soon as
> it's ready, then the memory usage may greatly increase, if the stealer
> can't keep up with incoming work.
> 
> If we want to avoid this there needs to be a limit on how many
> decoded, but not yet stolen, CUs we allow to hold in memory. When
> this limit is reached the workers will wait for more CUs to get
> stolen.
> 
> N x 4 is a number I picked after trying various values and looking at
> the resulting memory usage.

I think we can pick some number and add reasoning to the comment

> 
> We could make it configurable, but this value doesn't look to me as a
> reasonable user-facing option. Maybe we could add "I don't care about
> memory usage" flag to pahole? wdyt?

--I-don-t-care-about-memory-usage sounds great :-) but I think constant with
some comment will be enough for now and we'll see if we need it in future


> 
> > 
> > > +
> > > + /* fill up the queue with nr_workers JOB_DECODE jobs */
> > > + for (int i = 0; i < nr_workers; i++) {
> > > + job = calloc(1, sizeof(*job));
> > 
> > 
> > missing job != NULL check
> > 
> > > + job->type = JOB_DECODE;
> > > + /* no need for locks, workers were not started yet */
> > > + list_add(&job->node, &cus_processing_queue.jobs);
> > > }
> > > 
> > > - for (i = 0; i < dcus->conf->nr_jobs; ++i) {
> > > - dthr[i].dcus = dcus;
> > > - dthr[i].data = thread_data[i];
> > > + if (dcus->error)
> > > + return dcus->error;
> > > 
> > > - dcus->error = pthread_create(&threads[i], NULL,
> > > - dwarf_cus__process_cu_thread,
> > > - &dthr[i]);
> > > + for (int i = 0; i < nr_workers; ++i) {
> > > + dcus->error = pthread_create(&workers[i], NULL,
> > > + dwarf_loader__worker_thread,
> > > + dcus);
> > > if (dcus->error)
> > > goto out_join;
> > > }
> > > @@ -3596,54 +3766,19 @@ static int dwarf_cus__threaded_process_cus(struct dwarf_cus *dcus)
> > > dcus->error = 0;
> > > 
> > > out_join:
> > > - while (--i >= 0) {
> > > + for (int i = 0; i < nr_workers; ++i) {
> > 
> > 
> > I think you should keep the original while loop to cleanup/wait only for
> > threads that we actually created
> 
> Do you mean in case of an error from pthread_create? Ok.

yes, thanks

jirka

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH dwarves v2 00/10] pahole: shared ELF and faster reproducible BTF encoding
  2024-12-17  7:00 ` [PATCH dwarves v2 00/10] pahole: shared ELF and faster reproducible BTF encoding Eduard Zingerman
@ 2024-12-20 12:31   ` Jiri Olsa
  0 siblings, 0 replies; 30+ messages in thread
From: Jiri Olsa @ 2024-12-20 12:31 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Ihor Solodrai, dwarves, acme, alan.maguire, andrii, mykolal, bpf

On Mon, Dec 16, 2024 at 11:00:08PM -0800, Eduard Zingerman wrote:
> On Fri, 2024-12-13 at 22:36 +0000, Ihor Solodrai wrote:
> 
> for allyesconfig, with your patch-set I get the following stats:
> 
> jobs 1, mem 7080048 Kb, time 97.24 sec
> jobs 3, mem 7091360 Kb, time 60.10 sec
> jobs 6, mem 7153848 Kb, time 49.73 sec
> jobs 12, mem 7264036 Kb, time 54.67 sec
> 
> w/o your patch-set, using current pahole 'next', I get out memory both
> with and without reproducible_build flag.
> 
> The vmlinux size is 7.6G.
> 
> 

I did quick test with vmlinux size 11.8G and it looks great

current pahole (v1.28 tag):

	 Performance counter stats for '/home/jolsa/kernel/pahole/build/pahole -J -j17 --btf_features=encode_force,
	 var,float,enum64,decl_tag,type_tag,optimized_func,consistent_func,decl_tag_kfuncs --lang_exclude=rust vmlinux':

	     178.081877484 seconds time elapsed

	     479.529725000 seconds user
	      26.583607000 seconds sys


new pahole:

	 Performance counter stats for '/home/jolsa/kernel/pahole/build/pahole -J -j17 --btf_features=encode_force,
	 var,float,enum64,decl_tag,type_tag,optimized_func,consistent_func,decl_tag_kfuncs --lang_exclude=rust vmlinux':

	     111.976182062 seconds time elapsed

	     242.342891000 seconds user
	      24.254289000 seconds sys

thanks,
jirka

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2024-12-20 12:31 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-13 22:36 [PATCH dwarves v2 00/10] pahole: shared ELF and faster reproducible BTF encoding Ihor Solodrai
2024-12-13 22:36 ` [PATCH dwarves v2 01/10] btf_encoder: simplify function encoding Ihor Solodrai
2024-12-13 22:36 ` [PATCH dwarves v2 02/10] btf_encoder: separate elf function, saved function representations Ihor Solodrai
2024-12-19 14:59   ` Jiri Olsa
2024-12-13 22:36 ` [PATCH dwarves v2 03/10] dwarf_loader: introduce pre_load_module hook to conf_load Ihor Solodrai
2024-12-13 22:37 ` [PATCH dwarves v2 04/10] btf_encoder: introduce elf_functions struct type Ihor Solodrai
2024-12-13 22:37 ` [PATCH dwarves v2 05/10] btf_encoder: collect elf_functions in btf_encoder__pre_load_module Ihor Solodrai
2024-12-13 22:37 ` [PATCH dwarves v2 06/10] btf_encoder: switch to shared elf_functions table Ihor Solodrai
2024-12-19 14:58   ` Jiri Olsa
2024-12-19 19:06     ` Ihor Solodrai
2024-12-13 22:37 ` [PATCH dwarves v2 07/10] btf_encoder: introduce btf_encoding_context Ihor Solodrai
2024-12-17  2:39   ` Eduard Zingerman
2024-12-17  3:15     ` Eduard Zingerman
2024-12-17 18:06       ` Ihor Solodrai
2024-12-18  0:03         ` Andrii Nakryiko
2024-12-18  0:40           ` Eduard Zingerman
2024-12-18 20:07             ` Ihor Solodrai
2024-12-19 14:59             ` Jiri Olsa
2024-12-13 22:37 ` [PATCH dwarves v2 08/10] btf_encoder: remove skip_encoding_inconsistent_proto Ihor Solodrai
2024-12-13 22:37 ` [PATCH dwarves v2 09/10] dwarf_loader: introduce cu->id Ihor Solodrai
2024-12-13 22:37 ` [PATCH dwarves v2 10/10] dwarf_loader: multithreading with a job/worker model Ihor Solodrai
2024-12-17  0:57   ` Eduard Zingerman
2024-12-17 18:12     ` Ihor Solodrai
2024-12-19 14:59     ` Jiri Olsa
2024-12-17  2:14   ` Eduard Zingerman
2024-12-19 14:59   ` Jiri Olsa
2024-12-19 19:31     ` Ihor Solodrai
2024-12-20  9:25       ` Jiri Olsa
2024-12-17  7:00 ` [PATCH dwarves v2 00/10] pahole: shared ELF and faster reproducible BTF encoding Eduard Zingerman
2024-12-20 12:31   ` Jiri Olsa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox