* Linking BTF
@ 2025-07-16 15:15 Nick Alcock
2025-07-16 21:52 ` Eduard Zingerman
0 siblings, 1 reply; 4+ messages in thread
From: Nick Alcock @ 2025-07-16 15:15 UTC (permalink / raw)
To: dwarves, bpf, Arnaldo Carvalho de Melo, alexei.starovoitov,
andrii.nakryiko, alan.maguire, stephen.brennan, jose.marchesi,
david.faust, elena.zannoni, bruce.mcculloch
So I'm working on a scheme where the compiler generates BTF for object files
it generates, for later consumption by pahole after deduplication. (And I
have a proof of concept[1], already posted[2]). But doing this sort of thing
rather than post-facto generation from DWARF raises one obvious question: if
we're emitting BTF into multiple object files, what do we do with it when we
link them together?
There seem to me to be three options: an ugly but simple one, a maximally
ELFish one and a third option that is a sort of middle route. Based on some
investigations we did, I think I know which one makes the most sense, but I
could well be wrong: please critique.
- The simple route is just to let the linker do its thing like it does for
every other section it doesn't know about, and blindly concatenate the
contents. I hope it's obvious why this is less than ideal: the result
would soon get enormous, and every BTF-reading tool would need adjusting
to allow for the fact that rather than one pile of BTF in a .BTF section,
you might have many concatenated together, with no way to tell which
input file each one referred to! You can't even save much space by
compressing the result compared to compressing the inputs' .BTF
individually (I tried it, the individual input files' chunks are usually
too far apart to be caught by the same compression dictionary).
- The maximally ELFish one is to put each .BTF into its own per-input-file
section, named after the input file. Since this requires at least a bit
of special linker handling anyway you can make it do more, and
deduplicate at the same time, turning the result into split BTF and
putting shared things into a .BTF section.
This... feels like it would be really nice, and I tried to implement it,
but at least in GNU ld falls foul of a deeply embedded architectural
limitation, and as far as I can see from a brief look at lld, it shares
this. You don't get an output section to deduplicate anything into until
after the linker has figured out what output sections exist, and it only
does this in one place: you can't go back and add more after the
fact. This means that we could only deduplicate and add sections freely
before the output sections are laid out -- when we have nowhere to put
it, and honestly we might not even have acquired all the input sections
at this point, since that is partly interleaved with output section
layout. Essentially, it is incredibly hard to have output sections whose
names depend on the contents of more than one input section, and that's
what any plausible deduplicator is going to want to do.
So this is really only useful if we're doing what ELF usually does, which
is to copy input sections into output sections without modification, or
with at most small changes (like relocs) that don't change sizes.
I note that DWARF emission is special-cased in ld in part for this
reason, and even *that* only emits a fixed set of sections rather than a
potentially unbounded one. We should probably try to emit a fixed set
too.
- So... a third option, which is probably the most BTFish because it's
something BTF already does, in a sense: put everything in one section,
call it .BTF or .BTFA or whatever, and make that section an archive of
named BTF members, and then stuff however many BTF outputs the
deduplication generates (or none, if we're just stuffing inputs into
outputs without dedupping) into archive members.
So, here's a possibility which seems to provide the latter option while
still letting existing tools read the first member (likely vmlinux):
The idea is that we add a *next member link field* in the BTF header, and a
name (a strtab offset). The next member link field is an end-of-header-
relative offset just like most of the other header fields, which chains BTF
members together in a linked list:
parent BTF
|
v
children BTF -> BTF -> BTF -> ... -> BTF
The parent is always first in the list.
This has the notable advantage that existing BTF tools understand it without
change: they see only the parent (but since the parent is vmlinux, this is
probably enough for most existing tools that don't have to deal with
modules, and of course that's enough for existing tools working over actual
modules, which won't need archives at all). We give members a name because
with the exception of the parent we do want to be able to distinguish the
members from each other: we need to know which module, or translation unit,
or whatever each individual member relates to. The parent probably doesn't
need a name (it's always "vmlinux" or "shared types" or something), so it
can just use 0.
The proof of concept I posted earlier does not understand this format yet,
but only an earlier version of the archive format used in the
proof-of-concept; the scheme above was invented later. Of course I plan to
teach the proof of concept (and upstream binutils) about the new format too,
once we agree.
There is one big change caused by using any format that involves more than
one BTF dictionary like this: external references to types become harder. If
all you have is a straight .BTF section, you can refer to a type with its ID
and it is unambiguous. If you have a bunch of them, suddenly you need a pair
of (member, type ID)! The ability to refer to types via a small fixed-size
native-type token of some kind is extremely desirable and I do not want to
lose it. But if we consider the linked list above to be an array (and
looking members by integer is something libctf will make easy in the near
future), we can just make 64-bit "archive BTF IDs" where the top 32 bits is
the index of the individual BTF: types in the parent, with an index of 0,
just get a bigger type with no change in value. The kernel apparently does
something like this internally already.
Doing this means we keep the ability to refer to BTF types *in any module*
from other sections, even if they're stored together as an archive like this,
which is how nearly all extensions to BTF are structured anyway and which
seems like obviously the right way to do things: I'm thinking up ways to
turn most remaining CTF extensions to BTF into that sort of external table
already.
[1] The proof-of-concept is here:
(binutils) https://sourceware.org/git/?p=binutils-gdb.git;a=log;h=refs/heads/users/nalcock/archive-v2/road-to-ctfv4
(kernel) https://github.com/nickalcock/linux/tree/nix/btfa
[2] mail about it: https://lore.kernel.org/dwarves/87ldqf1i19.fsf@esperi.org.uk/T/#u
(note that the branch name in this mail is out of date)
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: Linking BTF 2025-07-16 15:15 Linking BTF Nick Alcock @ 2025-07-16 21:52 ` Eduard Zingerman 2025-07-17 7:40 ` Jose E. Marchesi 0 siblings, 1 reply; 4+ messages in thread From: Eduard Zingerman @ 2025-07-16 21:52 UTC (permalink / raw) To: Nick Alcock, dwarves, bpf, Arnaldo Carvalho de Melo, alexei.starovoitov, andrii.nakryiko, alan.maguire, stephen.brennan, jose.marchesi, david.faust, elena.zannoni, bruce.mcculloch On Wed, 2025-07-16 at 16:15 +0100, Nick Alcock wrote: [...] > - So... a third option, which is probably the most BTFish because it's > something BTF already does, in a sense: put everything in one section, > call it .BTF or .BTFA or whatever, and make that section an archive of > named BTF members, and then stuff however many BTF outputs the > deduplication generates (or none, if we're just stuffing inputs into > outputs without dedupping) into archive members. > > So, here's a possibility which seems to provide the latter option while > still letting existing tools read the first member (likely vmlinux): > > The idea is that we add a *next member link field* in the BTF header, and a > name (a strtab offset). The next member link field is an end-of-header- > relative offset just like most of the other header fields, which chains BTF > members together in a linked list: > > parent BTF > | > v > children BTF -> BTF -> BTF -> ... -> BTF > > The parent is always first in the list. Hi Nick, You are talking about BTF section embedded in a final vmlinux binary, right? Could you please elaborate a bit on why do you need multiple members within this section (in the context of your third option)? I re-read the email but don't get it :( [...] ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Linking BTF 2025-07-16 21:52 ` Eduard Zingerman @ 2025-07-17 7:40 ` Jose E. Marchesi 2025-07-17 11:52 ` Nick Alcock 0 siblings, 1 reply; 4+ messages in thread From: Jose E. Marchesi @ 2025-07-17 7:40 UTC (permalink / raw) To: Eduard Zingerman Cc: Nick Alcock, dwarves, bpf, Arnaldo Carvalho de Melo, alexei.starovoitov, andrii.nakryiko, alan.maguire, stephen.brennan, david.faust, elena.zannoni, bruce.mcculloch > On Wed, 2025-07-16 at 16:15 +0100, Nick Alcock wrote: > > [...] > >> - So... a third option, which is probably the most BTFish because it's >> something BTF already does, in a sense: put everything in one section, >> call it .BTF or .BTFA or whatever, and make that section an archive of >> named BTF members, and then stuff however many BTF outputs the >> deduplication generates (or none, if we're just stuffing inputs into >> outputs without dedupping) into archive members. >> >> So, here's a possibility which seems to provide the latter option while >> still letting existing tools read the first member (likely vmlinux): >> >> The idea is that we add a *next member link field* in the BTF header, and a >> name (a strtab offset). The next member link field is an end-of-header- >> relative offset just like most of the other header fields, which chains BTF >> members together in a linked list: >> >> parent BTF >> | >> v >> children BTF -> BTF -> BTF -> ... -> BTF >> >> The parent is always first in the list. > > Hi Nick, > > You are talking about BTF section embedded in a final vmlinux binary, right? More generally, a section embedded in any object which is the result of linking two or more objects having .BTF sections: ld foo.o (.BTF) bar.o (.BTF) -> baz.o (.BTF) This covers the particular vmlinux case I think. > Could you please elaborate a bit on why do you need multiple members > within this section (in the context of your third option)? > I re-read the email but don't get it :( As I understand it: The linker deduplicates types in the set of input .BTF sections. This means that when linking foo.o and bar.o, if both compilation units refer to a type 'quux', there are two possibilities: a) The type 'quux' is the same (using C type equivalence rules) in both compilation units. Then the type is "shared" and the linker puts it only once in the first output BTF member in baz.o .BTF, the "parent". b) The type 'quux' is different in both compilation units. These are then conflicting types. Then two versions the type, foo.quux and bar.quux, are placed by the linker in the corresponding "children" member in baz.o. Graphically, the .BTF section in a linked binary would contain a one-level tree of members, with as many children as input compilation units : parent (common types) | +--- child1 (types only in child1) +--- child2 (types only in child2) . +--- childN (types only in childN) Hope this makes sense. Nick should be able to explain it better than I do. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Linking BTF 2025-07-17 7:40 ` Jose E. Marchesi @ 2025-07-17 11:52 ` Nick Alcock 0 siblings, 0 replies; 4+ messages in thread From: Nick Alcock @ 2025-07-17 11:52 UTC (permalink / raw) To: Jose E. Marchesi Cc: Eduard Zingerman, Nick Alcock, dwarves, bpf, Arnaldo Carvalho de Melo, alexei.starovoitov, andrii.nakryiko, alan.maguire, stephen.brennan, david.faust, elena.zannoni, bruce.mcculloch On 17 Jul 2025, Jose E. Marchesi outgrape: > >> On Wed, 2025-07-16 at 16:15 +0100, Nick Alcock wrote: >> >> [...] >> >>> - So... a third option, which is probably the most BTFish because it's >>> something BTF already does, in a sense: put everything in one section, >>> call it .BTF or .BTFA or whatever, and make that section an archive of >>> named BTF members, and then stuff however many BTF outputs the >>> deduplication generates (or none, if we're just stuffing inputs into >>> outputs without dedupping) into archive members. >>> >>> So, here's a possibility which seems to provide the latter option while >>> still letting existing tools read the first member (likely vmlinux): >>> >>> The idea is that we add a *next member link field* in the BTF header, and a >>> name (a strtab offset). The next member link field is an end-of-header- >>> relative offset just like most of the other header fields, which chains BTF >>> members together in a linked list: >>> >>> parent BTF >>> | >>> v >>> children BTF -> BTF -> BTF -> ... -> BTF >>> >>> The parent is always first in the list. >> >> Hi Nick, >> >> You are talking about BTF section embedded in a final vmlinux binary, right? > > More generally, a section embedded in any object which is the result of > linking two or more objects having .BTF sections: > > ld foo.o (.BTF) bar.o (.BTF) -> baz.o (.BTF) > > This covers the particular vmlinux case I think. Yes, though I wasn't expecting to see this in vmlinux yet! It might happen in the end. What this is used for is *communicating with pahole*: the .btfa file pahole receives is one of these, containing deduplicated BTF for the entire kernel plus all modules, and it's then up to pahole what to do with it. In userspace links (and in intermediate links of multifile kernel modules, used only as input to the btfarchive deduplicator), we do see this sort of thing heavily. >> Could you please elaborate a bit on why do you need multiple members >> within this section (in the context of your third option)? >> I re-read the email but don't get it :( > > As I understand it: > > The linker deduplicates types in the set of input .BTF sections. This > means that when linking foo.o and bar.o, if both compilation units refer > to a type 'quux', there are two possibilities: > > a) The type 'quux' is the same (using C type equivalence rules) in both > compilation units. Then the type is "shared" and the linker puts it > only once in the first output BTF member in baz.o .BTF, the "parent". > > b) The type 'quux' is different in both compilation units. These are > then conflicting types. Then two versions the type, foo.quux and > bar.quux, are placed by the linker in the corresponding "children" > member in baz.o. Yes. (We don't really quite use C type equivalence rules -- we're pickier, since types can be assignment-compatible but still different, and we want to preserve that difference. But that's nitpicking.) This happens really quite a lot in the kernel (I was surprised how often). It happens even more in userspace, sometimes to an almost pathological degree (hello, Ghostscript). LTO may make its prevalence lower in the future, but I doubt this sort of thing will ever go away: it's still with us in C++ programs, and there it's outright undefined behaviour! > Graphically, the .BTF section in a linked binary would contain a > one-level tree of members, with as many children as input compilation > units : > > parent (common types) > | > +--- child1 (types only in child1) > +--- child2 (types only in child2) > . > +--- childN (types only in childN) > > Hope this makes sense. Nick should be able to explain it better than I > do. There are really two cases, because the purpose of "being a child" is sort of overloaded. The kernel is, as ever, different... - Kernel-style builds (the traditional BTF case): vmlinux (parent) (common types, any types shared by more than one module) +--- child1.ko (types only in child1) +--- child2.ko (types only in child2) . +--- childN.ko (types only in childN) Notably, if a type differs (conflicts) across translation units, and all those translation units are in the core kernel, we can't put them in children because none of them are in modules, and children are reserved for modules: so we actually emit them as "hidden types" (a concept BTF doesn't have and that I am not currently proposing, which lets us say "this type is not visible in any namespaces, here's the name of the translation unit it was found in"). The same applies if a type differs within one module. If a type has conflicting definitions in two distinct modules, we can indeed just emit them into each module in turn. Also, if a type has one definition in a lot of modules and then a different one in one or two, we realise that the first definition is "most popular" and emit it into the parent, then emit the conflicting one into the few per-module children it is found in. Types that are used only by one module are placed in that per-module child, both because that's what pahole has always done and because it makes sense for a loosely-coupled project like the kernel not to clutter vmlinux up with thousands of types for huge modules like amdgpu that might never even be loaded. I am not expecting pahole to preserve hidden types, at least not yet (BTF has no way to encode them and no consumer understands them), but it can see them on its input, so it might use hiddenness as a flag that "hey, this type is conflicting, take care with everything with the same name" or something. The concept is not useless even if pahole largely ignores it: it does at least preserve the type graph and ensure that any type that refers to a conflicting type still refers to it after deduplication: it doesn't end up pointing at some other type with the same name. e.g. if we have these two TUs in the core kernel: a.c:struct foo { int a; }; struct bar { struct foo baz; }; b.c:struct foo { long a; }; /* Different! */ struct bar { struct foo baz; }; one struct foo (the least-referenced one) will wind up hidden, but the struct bar in that same TU will *still point at the hidden type*. Both types are *still there* and we don't end up pointing at the same struct foo from both struct bars. - For normal ELF links outside the kernel, the model above doesn't really make sense. Most programs don't have a concept like kernel modules, and most programs are more tightly coupled, so you want to see as many types as possible. So for those, the distribution is like this: parent (all types that are not conflicting) +--- child1.c (conflicting types defined in child1.c) +--- child2.c (conflicting types defined in child2.c) . +--- childN.c (conflicting types defined in child3.c) i.e., conflicting types are placed into children that are named after the translation units they come from. Within those dictionaries, there are no hidden types and there is no possibility of conflict; the shared parent corresponds to "all TUs together" and there can be no conflicts there either. In many ways this is a simpler model, but it just won't cut it for the kernel. We could in the end combine the two schemes, producing a multilevel tree, so that each module, and the core kernel, could contain an archive like userspace links do, with each conflicting type hived off into its own translation unit. This is *definitely* more work, and would probably require consumer changes too. I am not proposing it, at least not yet. But it shows where we could end up: vmlinux (parent) (common types, any types shared by more than one module) +--- core1a.c (conflicting types defined in core1a.c)... ... +--- child1.ko (types found only in child1) +-- child1a.c (conflicting types defined in child1a.c) +-- child1b.b (conflicting types defined in child1a.c) +--- child2.ko (types only in child2) . +--- childN.ko (types only in childN) The distinction between the two link types above is largely controlled via this linker option in GNU ld: --ctf-share-types=<method> How to share CTF types between translation units. <method> is: share-unconflicted (default), share-duplicated The final stage of kernel deduplication (the btfarchive tool) uses share-duplicated mode (and extra stuff to smush multiple translation units together into modules). (that's from current upstream master: obviously I'll have to find some way to say --ctf-or-btf without making it too verbose :) maybe I could just add a --btf-share-types as a synonym?) ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-07-17 11:52 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-16 15:15 Linking BTF Nick Alcock 2025-07-16 21:52 ` Eduard Zingerman 2025-07-17 7:40 ` Jose E. Marchesi 2025-07-17 11:52 ` Nick Alcock
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox