* Linking BTF
@ 2025-07-16 15:15 Nick Alcock
2025-07-16 21:52 ` Eduard Zingerman
0 siblings, 1 reply; 4+ messages in thread
From: Nick Alcock @ 2025-07-16 15:15 UTC (permalink / raw)
To: dwarves, bpf, Arnaldo Carvalho de Melo, alexei.starovoitov,
andrii.nakryiko, alan.maguire, stephen.brennan, jose.marchesi,
david.faust, elena.zannoni, bruce.mcculloch
So I'm working on a scheme where the compiler generates BTF for object files
it generates, for later consumption by pahole after deduplication. (And I
have a proof of concept[1], already posted[2]). But doing this sort of thing
rather than post-facto generation from DWARF raises one obvious question: if
we're emitting BTF into multiple object files, what do we do with it when we
link them together?
There seem to me to be three options: an ugly but simple one, a maximally
ELFish one and a third option that is a sort of middle route. Based on some
investigations we did, I think I know which one makes the most sense, but I
could well be wrong: please critique.
- The simple route is just to let the linker do its thing like it does for
every other section it doesn't know about, and blindly concatenate the
contents. I hope it's obvious why this is less than ideal: the result
would soon get enormous, and every BTF-reading tool would need adjusting
to allow for the fact that rather than one pile of BTF in a .BTF section,
you might have many concatenated together, with no way to tell which
input file each one referred to! You can't even save much space by
compressing the result compared to compressing the inputs' .BTF
individually (I tried it, the individual input files' chunks are usually
too far apart to be caught by the same compression dictionary).
- The maximally ELFish one is to put each .BTF into its own per-input-file
section, named after the input file. Since this requires at least a bit
of special linker handling anyway you can make it do more, and
deduplicate at the same time, turning the result into split BTF and
putting shared things into a .BTF section.
This... feels like it would be really nice, and I tried to implement it,
but at least in GNU ld falls foul of a deeply embedded architectural
limitation, and as far as I can see from a brief look at lld, it shares
this. You don't get an output section to deduplicate anything into until
after the linker has figured out what output sections exist, and it only
does this in one place: you can't go back and add more after the
fact. This means that we could only deduplicate and add sections freely
before the output sections are laid out -- when we have nowhere to put
it, and honestly we might not even have acquired all the input sections
at this point, since that is partly interleaved with output section
layout. Essentially, it is incredibly hard to have output sections whose
names depend on the contents of more than one input section, and that's
what any plausible deduplicator is going to want to do.
So this is really only useful if we're doing what ELF usually does, which
is to copy input sections into output sections without modification, or
with at most small changes (like relocs) that don't change sizes.
I note that DWARF emission is special-cased in ld in part for this
reason, and even *that* only emits a fixed set of sections rather than a
potentially unbounded one. We should probably try to emit a fixed set
too.
- So... a third option, which is probably the most BTFish because it's
something BTF already does, in a sense: put everything in one section,
call it .BTF or .BTFA or whatever, and make that section an archive of
named BTF members, and then stuff however many BTF outputs the
deduplication generates (or none, if we're just stuffing inputs into
outputs without dedupping) into archive members.
So, here's a possibility which seems to provide the latter option while
still letting existing tools read the first member (likely vmlinux):
The idea is that we add a *next member link field* in the BTF header, and a
name (a strtab offset). The next member link field is an end-of-header-
relative offset just like most of the other header fields, which chains BTF
members together in a linked list:
parent BTF
|
v
children BTF -> BTF -> BTF -> ... -> BTF
The parent is always first in the list.
This has the notable advantage that existing BTF tools understand it without
change: they see only the parent (but since the parent is vmlinux, this is
probably enough for most existing tools that don't have to deal with
modules, and of course that's enough for existing tools working over actual
modules, which won't need archives at all). We give members a name because
with the exception of the parent we do want to be able to distinguish the
members from each other: we need to know which module, or translation unit,
or whatever each individual member relates to. The parent probably doesn't
need a name (it's always "vmlinux" or "shared types" or something), so it
can just use 0.
The proof of concept I posted earlier does not understand this format yet,
but only an earlier version of the archive format used in the
proof-of-concept; the scheme above was invented later. Of course I plan to
teach the proof of concept (and upstream binutils) about the new format too,
once we agree.
There is one big change caused by using any format that involves more than
one BTF dictionary like this: external references to types become harder. If
all you have is a straight .BTF section, you can refer to a type with its ID
and it is unambiguous. If you have a bunch of them, suddenly you need a pair
of (member, type ID)! The ability to refer to types via a small fixed-size
native-type token of some kind is extremely desirable and I do not want to
lose it. But if we consider the linked list above to be an array (and
looking members by integer is something libctf will make easy in the near
future), we can just make 64-bit "archive BTF IDs" where the top 32 bits is
the index of the individual BTF: types in the parent, with an index of 0,
just get a bigger type with no change in value. The kernel apparently does
something like this internally already.
Doing this means we keep the ability to refer to BTF types *in any module*
from other sections, even if they're stored together as an archive like this,
which is how nearly all extensions to BTF are structured anyway and which
seems like obviously the right way to do things: I'm thinking up ways to
turn most remaining CTF extensions to BTF into that sort of external table
already.
[1] The proof-of-concept is here:
(binutils) https://sourceware.org/git/?p=binutils-gdb.git;a=log;h=refs/heads/users/nalcock/archive-v2/road-to-ctfv4
(kernel) https://github.com/nickalcock/linux/tree/nix/btfa
[2] mail about it: https://lore.kernel.org/dwarves/87ldqf1i19.fsf@esperi.org.uk/T/#u
(note that the branch name in this mail is out of date)
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Linking BTF
2025-07-16 15:15 Linking BTF Nick Alcock
@ 2025-07-16 21:52 ` Eduard Zingerman
2025-07-17 7:40 ` Jose E. Marchesi
0 siblings, 1 reply; 4+ messages in thread
From: Eduard Zingerman @ 2025-07-16 21:52 UTC (permalink / raw)
To: Nick Alcock, dwarves, bpf, Arnaldo Carvalho de Melo,
alexei.starovoitov, andrii.nakryiko, alan.maguire,
stephen.brennan, jose.marchesi, david.faust, elena.zannoni,
bruce.mcculloch
On Wed, 2025-07-16 at 16:15 +0100, Nick Alcock wrote:
[...]
> - So... a third option, which is probably the most BTFish because it's
> something BTF already does, in a sense: put everything in one section,
> call it .BTF or .BTFA or whatever, and make that section an archive of
> named BTF members, and then stuff however many BTF outputs the
> deduplication generates (or none, if we're just stuffing inputs into
> outputs without dedupping) into archive members.
>
> So, here's a possibility which seems to provide the latter option while
> still letting existing tools read the first member (likely vmlinux):
>
> The idea is that we add a *next member link field* in the BTF header, and a
> name (a strtab offset). The next member link field is an end-of-header-
> relative offset just like most of the other header fields, which chains BTF
> members together in a linked list:
>
> parent BTF
> |
> v
> children BTF -> BTF -> BTF -> ... -> BTF
>
> The parent is always first in the list.
Hi Nick,
You are talking about BTF section embedded in a final vmlinux binary, right?
Could you please elaborate a bit on why do you need multiple members
within this section (in the context of your third option)?
I re-read the email but don't get it :(
[...]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Linking BTF
2025-07-16 21:52 ` Eduard Zingerman
@ 2025-07-17 7:40 ` Jose E. Marchesi
2025-07-17 11:52 ` Nick Alcock
0 siblings, 1 reply; 4+ messages in thread
From: Jose E. Marchesi @ 2025-07-17 7:40 UTC (permalink / raw)
To: Eduard Zingerman
Cc: Nick Alcock, dwarves, bpf, Arnaldo Carvalho de Melo,
alexei.starovoitov, andrii.nakryiko, alan.maguire,
stephen.brennan, david.faust, elena.zannoni, bruce.mcculloch
> On Wed, 2025-07-16 at 16:15 +0100, Nick Alcock wrote:
>
> [...]
>
>> - So... a third option, which is probably the most BTFish because it's
>> something BTF already does, in a sense: put everything in one section,
>> call it .BTF or .BTFA or whatever, and make that section an archive of
>> named BTF members, and then stuff however many BTF outputs the
>> deduplication generates (or none, if we're just stuffing inputs into
>> outputs without dedupping) into archive members.
>>
>> So, here's a possibility which seems to provide the latter option while
>> still letting existing tools read the first member (likely vmlinux):
>>
>> The idea is that we add a *next member link field* in the BTF header, and a
>> name (a strtab offset). The next member link field is an end-of-header-
>> relative offset just like most of the other header fields, which chains BTF
>> members together in a linked list:
>>
>> parent BTF
>> |
>> v
>> children BTF -> BTF -> BTF -> ... -> BTF
>>
>> The parent is always first in the list.
>
> Hi Nick,
>
> You are talking about BTF section embedded in a final vmlinux binary, right?
More generally, a section embedded in any object which is the result of
linking two or more objects having .BTF sections:
ld foo.o (.BTF) bar.o (.BTF) -> baz.o (.BTF)
This covers the particular vmlinux case I think.
> Could you please elaborate a bit on why do you need multiple members
> within this section (in the context of your third option)?
> I re-read the email but don't get it :(
As I understand it:
The linker deduplicates types in the set of input .BTF sections. This
means that when linking foo.o and bar.o, if both compilation units refer
to a type 'quux', there are two possibilities:
a) The type 'quux' is the same (using C type equivalence rules) in both
compilation units. Then the type is "shared" and the linker puts it
only once in the first output BTF member in baz.o .BTF, the "parent".
b) The type 'quux' is different in both compilation units. These are
then conflicting types. Then two versions the type, foo.quux and
bar.quux, are placed by the linker in the corresponding "children"
member in baz.o.
Graphically, the .BTF section in a linked binary would contain a
one-level tree of members, with as many children as input compilation
units :
parent (common types)
|
+--- child1 (types only in child1)
+--- child2 (types only in child2)
.
+--- childN (types only in childN)
Hope this makes sense. Nick should be able to explain it better than I
do.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Linking BTF
2025-07-17 7:40 ` Jose E. Marchesi
@ 2025-07-17 11:52 ` Nick Alcock
0 siblings, 0 replies; 4+ messages in thread
From: Nick Alcock @ 2025-07-17 11:52 UTC (permalink / raw)
To: Jose E. Marchesi
Cc: Eduard Zingerman, Nick Alcock, dwarves, bpf,
Arnaldo Carvalho de Melo, alexei.starovoitov, andrii.nakryiko,
alan.maguire, stephen.brennan, david.faust, elena.zannoni,
bruce.mcculloch
On 17 Jul 2025, Jose E. Marchesi outgrape:
>
>> On Wed, 2025-07-16 at 16:15 +0100, Nick Alcock wrote:
>>
>> [...]
>>
>>> - So... a third option, which is probably the most BTFish because it's
>>> something BTF already does, in a sense: put everything in one section,
>>> call it .BTF or .BTFA or whatever, and make that section an archive of
>>> named BTF members, and then stuff however many BTF outputs the
>>> deduplication generates (or none, if we're just stuffing inputs into
>>> outputs without dedupping) into archive members.
>>>
>>> So, here's a possibility which seems to provide the latter option while
>>> still letting existing tools read the first member (likely vmlinux):
>>>
>>> The idea is that we add a *next member link field* in the BTF header, and a
>>> name (a strtab offset). The next member link field is an end-of-header-
>>> relative offset just like most of the other header fields, which chains BTF
>>> members together in a linked list:
>>>
>>> parent BTF
>>> |
>>> v
>>> children BTF -> BTF -> BTF -> ... -> BTF
>>>
>>> The parent is always first in the list.
>>
>> Hi Nick,
>>
>> You are talking about BTF section embedded in a final vmlinux binary, right?
>
> More generally, a section embedded in any object which is the result of
> linking two or more objects having .BTF sections:
>
> ld foo.o (.BTF) bar.o (.BTF) -> baz.o (.BTF)
>
> This covers the particular vmlinux case I think.
Yes, though I wasn't expecting to see this in vmlinux yet! It might
happen in the end. What this is used for is *communicating with pahole*:
the .btfa file pahole receives is one of these, containing deduplicated
BTF for the entire kernel plus all modules, and it's then up to pahole
what to do with it.
In userspace links (and in intermediate links of multifile kernel
modules, used only as input to the btfarchive deduplicator), we do see
this sort of thing heavily.
>> Could you please elaborate a bit on why do you need multiple members
>> within this section (in the context of your third option)?
>> I re-read the email but don't get it :(
>
> As I understand it:
>
> The linker deduplicates types in the set of input .BTF sections. This
> means that when linking foo.o and bar.o, if both compilation units refer
> to a type 'quux', there are two possibilities:
>
> a) The type 'quux' is the same (using C type equivalence rules) in both
> compilation units. Then the type is "shared" and the linker puts it
> only once in the first output BTF member in baz.o .BTF, the "parent".
>
> b) The type 'quux' is different in both compilation units. These are
> then conflicting types. Then two versions the type, foo.quux and
> bar.quux, are placed by the linker in the corresponding "children"
> member in baz.o.
Yes. (We don't really quite use C type equivalence rules -- we're
pickier, since types can be assignment-compatible but still different,
and we want to preserve that difference. But that's nitpicking.)
This happens really quite a lot in the kernel (I was surprised how
often). It happens even more in userspace, sometimes to an almost
pathological degree (hello, Ghostscript). LTO may make its prevalence
lower in the future, but I doubt this sort of thing will ever go away:
it's still with us in C++ programs, and there it's outright undefined
behaviour!
> Graphically, the .BTF section in a linked binary would contain a
> one-level tree of members, with as many children as input compilation
> units :
>
> parent (common types)
> |
> +--- child1 (types only in child1)
> +--- child2 (types only in child2)
> .
> +--- childN (types only in childN)
>
> Hope this makes sense. Nick should be able to explain it better than I
> do.
There are really two cases, because the purpose of "being a child" is
sort of overloaded. The kernel is, as ever, different...
- Kernel-style builds (the traditional BTF case):
vmlinux (parent) (common types, any types shared by more than one module)
+--- child1.ko (types only in child1)
+--- child2.ko (types only in child2)
.
+--- childN.ko (types only in childN)
Notably, if a type differs (conflicts) across translation units, and
all those translation units are in the core kernel, we can't put them
in children because none of them are in modules, and children are
reserved for modules: so we actually emit them as "hidden types" (a
concept BTF doesn't have and that I am not currently proposing, which
lets us say "this type is not visible in any namespaces, here's the
name of the translation unit it was found in"). The same applies if a
type differs within one module.
If a type has conflicting definitions in two distinct modules, we can
indeed just emit them into each module in turn. Also, if a type has
one definition in a lot of modules and then a different one in one or
two, we realise that the first definition is "most popular" and emit
it into the parent, then emit the conflicting one into the few
per-module children it is found in.
Types that are used only by one module are placed in that per-module
child, both because that's what pahole has always done and because it
makes sense for a loosely-coupled project like the kernel not to
clutter vmlinux up with thousands of types for huge modules like
amdgpu that might never even be loaded.
I am not expecting pahole to preserve hidden types, at least not yet
(BTF has no way to encode them and no consumer understands them), but
it can see them on its input, so it might use hiddenness as a flag
that "hey, this type is conflicting, take care with everything with
the same name" or something. The concept is not useless even if pahole
largely ignores it: it does at least preserve the type graph and
ensure that any type that refers to a conflicting type still refers to
it after deduplication: it doesn't end up pointing at some other type
with the same name.
e.g. if we have these two TUs in the core kernel:
a.c:struct foo { int a; };
struct bar { struct foo baz; };
b.c:struct foo { long a; }; /* Different! */
struct bar { struct foo baz; };
one struct foo (the least-referenced one) will wind up hidden, but the
struct bar in that same TU will *still point at the hidden type*. Both
types are *still there* and we don't end up pointing at the same
struct foo from both struct bars.
- For normal ELF links outside the kernel, the model above doesn't
really make sense. Most programs don't have a concept like kernel
modules, and most programs are more tightly coupled, so you want to
see as many types as possible. So for those, the distribution is like
this:
parent (all types that are not conflicting)
+--- child1.c (conflicting types defined in child1.c)
+--- child2.c (conflicting types defined in child2.c)
.
+--- childN.c (conflicting types defined in child3.c)
i.e., conflicting types are placed into children that are named after
the translation units they come from. Within those dictionaries, there
are no hidden types and there is no possibility of conflict; the
shared parent corresponds to "all TUs together" and there can be no
conflicts there either.
In many ways this is a simpler model, but it just won't cut it for the
kernel.
We could in the end combine the two schemes, producing a multilevel
tree, so that each module, and the core kernel, could contain an archive
like userspace links do, with each conflicting type hived off into its
own translation unit. This is *definitely* more work, and would probably
require consumer changes too. I am not proposing it, at least not yet.
But it shows where we could end up:
vmlinux (parent) (common types, any types shared by more than one module)
+--- core1a.c (conflicting types defined in core1a.c)...
...
+--- child1.ko (types found only in child1)
+-- child1a.c (conflicting types defined in child1a.c)
+-- child1b.b (conflicting types defined in child1a.c)
+--- child2.ko (types only in child2)
.
+--- childN.ko (types only in childN)
The distinction between the two link types above is largely controlled
via this linker option in GNU ld:
--ctf-share-types=<method> How to share CTF types between translation units.
<method> is: share-unconflicted (default),
share-duplicated
The final stage of kernel deduplication (the btfarchive tool) uses
share-duplicated mode (and extra stuff to smush multiple translation
units together into modules).
(that's from current upstream master: obviously I'll have to find some
way to say --ctf-or-btf without making it too verbose :) maybe I could
just add a --btf-share-types as a synonym?)
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-07-17 11:52 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-16 15:15 Linking BTF Nick Alcock
2025-07-16 21:52 ` Eduard Zingerman
2025-07-17 7:40 ` Jose E. Marchesi
2025-07-17 11:52 ` Nick Alcock
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox