public inbox for dwarves@vger.kernel.org
 help / color / mirror / Atom feed
* Linking BTF
@ 2025-07-16 15:15 Nick Alcock
  2025-07-16 21:52 ` Eduard Zingerman
  0 siblings, 1 reply; 4+ messages in thread
From: Nick Alcock @ 2025-07-16 15:15 UTC (permalink / raw)
  To: dwarves, bpf, Arnaldo Carvalho de Melo, alexei.starovoitov,
	andrii.nakryiko, alan.maguire, stephen.brennan, jose.marchesi,
	david.faust, elena.zannoni, bruce.mcculloch

So I'm working on a scheme where the compiler generates BTF for object files
it generates, for later consumption by pahole after deduplication. (And I
have a proof of concept[1], already posted[2]). But doing this sort of thing
rather than post-facto generation from DWARF raises one obvious question: if
we're emitting BTF into multiple object files, what do we do with it when we
link them together?

There seem to me to be three options: an ugly but simple one, a maximally
ELFish one and a third option that is a sort of middle route. Based on some
investigations we did, I think I know which one makes the most sense, but I
could well be wrong: please critique.

 - The simple route is just to let the linker do its thing like it does for
   every other section it doesn't know about, and blindly concatenate the
   contents. I hope it's obvious why this is less than ideal: the result
   would soon get enormous, and every BTF-reading tool would need adjusting
   to allow for the fact that rather than one pile of BTF in a .BTF section,
   you might have many concatenated together, with no way to tell which
   input file each one referred to! You can't even save much space by
   compressing the result compared to compressing the inputs' .BTF
   individually (I tried it, the individual input files' chunks are usually
   too far apart to be caught by the same compression dictionary).

 - The maximally ELFish one is to put each .BTF into its own per-input-file
   section, named after the input file. Since this requires at least a bit
   of special linker handling anyway you can make it do more, and
   deduplicate at the same time, turning the result into split BTF and
   putting shared things into a .BTF section.

   This... feels like it would be really nice, and I tried to implement it,
   but at least in GNU ld falls foul of a deeply embedded architectural
   limitation, and as far as I can see from a brief look at lld, it shares
   this. You don't get an output section to deduplicate anything into until
   after the linker has figured out what output sections exist, and it only
   does this in one place: you can't go back and add more after the
   fact. This means that we could only deduplicate and add sections freely
   before the output sections are laid out -- when we have nowhere to put
   it, and honestly we might not even have acquired all the input sections
   at this point, since that is partly interleaved with output section
   layout. Essentially, it is incredibly hard to have output sections whose
   names depend on the contents of more than one input section, and that's
   what any plausible deduplicator is going to want to do.

   So this is really only useful if we're doing what ELF usually does, which
   is to copy input sections into output sections without modification, or
   with at most small changes (like relocs) that don't change sizes.

   I note that DWARF emission is special-cased in ld in part for this
   reason, and even *that* only emits a fixed set of sections rather than a
   potentially unbounded one. We should probably try to emit a fixed set
   too.

 - So... a third option, which is probably the most BTFish because it's
   something BTF already does, in a sense: put everything in one section,
   call it .BTF or .BTFA or whatever, and make that section an archive of
   named BTF members, and then stuff however many BTF outputs the
   deduplication generates (or none, if we're just stuffing inputs into
   outputs without dedupping) into archive members.

So, here's a possibility which seems to provide the latter option while
still letting existing tools read the first member (likely vmlinux):

The idea is that we add a *next member link field* in the BTF header, and a
name (a strtab offset).  The next member link field is an end-of-header-
relative offset just like most of the other header fields, which chains BTF
members together in a linked list:

parent     BTF
            |
            v
children   BTF -> BTF -> BTF -> ... -> BTF

The parent is always first in the list.

This has the notable advantage that existing BTF tools understand it without
change: they see only the parent (but since the parent is vmlinux, this is
probably enough for most existing tools that don't have to deal with
modules, and of course that's enough for existing tools working over actual
modules, which won't need archives at all).  We give members a name because
with the exception of the parent we do want to be able to distinguish the
members from each other: we need to know which module, or translation unit,
or whatever each individual member relates to.  The parent probably doesn't
need a name (it's always "vmlinux" or "shared types" or something), so it
can just use 0.

The proof of concept I posted earlier does not understand this format yet,
but only an earlier version of the archive format used in the
proof-of-concept; the scheme above was invented later.  Of course I plan to
teach the proof of concept (and upstream binutils) about the new format too,
once we agree.


There is one big change caused by using any format that involves more than
one BTF dictionary like this: external references to types become harder. If
all you have is a straight .BTF section, you can refer to a type with its ID
and it is unambiguous. If you have a bunch of them, suddenly you need a pair
of (member, type ID)!  The ability to refer to types via a small fixed-size
native-type token of some kind is extremely desirable and I do not want to
lose it. But if we consider the linked list above to be an array (and
looking members by integer is something libctf will make easy in the near
future), we can just make 64-bit "archive BTF IDs" where the top 32 bits is
the index of the individual BTF: types in the parent, with an index of 0,
just get a bigger type with no change in value. The kernel apparently does
something like this internally already.

Doing this means we keep the ability to refer to BTF types *in any module*
from other sections, even if they're stored together as an archive like this,
which is how nearly all extensions to BTF are structured anyway and which
seems like obviously the right way to do things: I'm thinking up ways to
turn most remaining CTF extensions to BTF into that sort of external table
already.


[1] The proof-of-concept is here:
    (binutils) https://sourceware.org/git/?p=binutils-gdb.git;a=log;h=refs/heads/users/nalcock/archive-v2/road-to-ctfv4

    (kernel) https://github.com/nickalcock/linux/tree/nix/btfa

[2] mail about it: https://lore.kernel.org/dwarves/87ldqf1i19.fsf@esperi.org.uk/T/#u
    (note that the branch name in this mail is out of date)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Linking BTF
  2025-07-16 15:15 Linking BTF Nick Alcock
@ 2025-07-16 21:52 ` Eduard Zingerman
  2025-07-17  7:40   ` Jose E. Marchesi
  0 siblings, 1 reply; 4+ messages in thread
From: Eduard Zingerman @ 2025-07-16 21:52 UTC (permalink / raw)
  To: Nick Alcock, dwarves, bpf, Arnaldo Carvalho de Melo,
	alexei.starovoitov, andrii.nakryiko, alan.maguire,
	stephen.brennan, jose.marchesi, david.faust, elena.zannoni,
	bruce.mcculloch

On Wed, 2025-07-16 at 16:15 +0100, Nick Alcock wrote:

[...]

>  - So... a third option, which is probably the most BTFish because it's
>    something BTF already does, in a sense: put everything in one section,
>    call it .BTF or .BTFA or whatever, and make that section an archive of
>    named BTF members, and then stuff however many BTF outputs the
>    deduplication generates (or none, if we're just stuffing inputs into
>    outputs without dedupping) into archive members.
> 
> So, here's a possibility which seems to provide the latter option while
> still letting existing tools read the first member (likely vmlinux):
> 
> The idea is that we add a *next member link field* in the BTF header, and a
> name (a strtab offset).  The next member link field is an end-of-header-
> relative offset just like most of the other header fields, which chains BTF
> members together in a linked list:
> 
> parent     BTF
>             |
>             v
> children   BTF -> BTF -> BTF -> ... -> BTF
> 
> The parent is always first in the list.

Hi Nick,

You are talking about BTF section embedded in a final vmlinux binary, right?
Could you please elaborate a bit on why do you need multiple members
within this section (in the context of your third option)?
I re-read the email but don't get it :(

[...]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Linking BTF
  2025-07-16 21:52 ` Eduard Zingerman
@ 2025-07-17  7:40   ` Jose E. Marchesi
  2025-07-17 11:52     ` Nick Alcock
  0 siblings, 1 reply; 4+ messages in thread
From: Jose E. Marchesi @ 2025-07-17  7:40 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Nick Alcock, dwarves, bpf, Arnaldo Carvalho de Melo,
	alexei.starovoitov, andrii.nakryiko, alan.maguire,
	stephen.brennan, david.faust, elena.zannoni, bruce.mcculloch


> On Wed, 2025-07-16 at 16:15 +0100, Nick Alcock wrote:
>
> [...]
>
>>  - So... a third option, which is probably the most BTFish because it's
>>    something BTF already does, in a sense: put everything in one section,
>>    call it .BTF or .BTFA or whatever, and make that section an archive of
>>    named BTF members, and then stuff however many BTF outputs the
>>    deduplication generates (or none, if we're just stuffing inputs into
>>    outputs without dedupping) into archive members.
>> 
>> So, here's a possibility which seems to provide the latter option while
>> still letting existing tools read the first member (likely vmlinux):
>> 
>> The idea is that we add a *next member link field* in the BTF header, and a
>> name (a strtab offset).  The next member link field is an end-of-header-
>> relative offset just like most of the other header fields, which chains BTF
>> members together in a linked list:
>> 
>> parent     BTF
>>             |
>>             v
>> children   BTF -> BTF -> BTF -> ... -> BTF
>> 
>> The parent is always first in the list.
>
> Hi Nick,
>
> You are talking about BTF section embedded in a final vmlinux binary, right?

More generally, a section embedded in any object which is the result of
linking two or more objects having .BTF sections:

  ld foo.o (.BTF) bar.o (.BTF) -> baz.o (.BTF)

This covers the particular vmlinux case I think.

> Could you please elaborate a bit on why do you need multiple members
> within this section (in the context of your third option)?
> I re-read the email but don't get it :(

As I understand it:

The linker deduplicates types in the set of input .BTF sections.  This
means that when linking foo.o and bar.o, if both compilation units refer
to a type 'quux', there are two possibilities:

a) The type 'quux' is the same (using C type equivalence rules) in both
   compilation units.  Then the type is "shared" and the linker puts it
   only once in the first output BTF member in baz.o .BTF, the "parent".

b) The type 'quux' is different in both compilation units.  These are
   then conflicting types.  Then two versions the type, foo.quux and
   bar.quux, are placed by the linker in the corresponding "children"
   member in baz.o.

Graphically, the .BTF section in a linked binary would contain a
one-level tree of members, with as many children as input compilation
units :


    parent (common types)
      |
      +---  child1 (types only in child1)
      +---  child2 (types only in child2)
      .
      +---  childN (types only in childN)

Hope this makes sense.  Nick should be able to explain it better than I
do.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Linking BTF
  2025-07-17  7:40   ` Jose E. Marchesi
@ 2025-07-17 11:52     ` Nick Alcock
  0 siblings, 0 replies; 4+ messages in thread
From: Nick Alcock @ 2025-07-17 11:52 UTC (permalink / raw)
  To: Jose E. Marchesi
  Cc: Eduard Zingerman, Nick Alcock, dwarves, bpf,
	Arnaldo Carvalho de Melo, alexei.starovoitov, andrii.nakryiko,
	alan.maguire, stephen.brennan, david.faust, elena.zannoni,
	bruce.mcculloch

On 17 Jul 2025, Jose E. Marchesi outgrape:

>
>> On Wed, 2025-07-16 at 16:15 +0100, Nick Alcock wrote:
>>
>> [...]
>>
>>>  - So... a third option, which is probably the most BTFish because it's
>>>    something BTF already does, in a sense: put everything in one section,
>>>    call it .BTF or .BTFA or whatever, and make that section an archive of
>>>    named BTF members, and then stuff however many BTF outputs the
>>>    deduplication generates (or none, if we're just stuffing inputs into
>>>    outputs without dedupping) into archive members.
>>> 
>>> So, here's a possibility which seems to provide the latter option while
>>> still letting existing tools read the first member (likely vmlinux):
>>> 
>>> The idea is that we add a *next member link field* in the BTF header, and a
>>> name (a strtab offset).  The next member link field is an end-of-header-
>>> relative offset just like most of the other header fields, which chains BTF
>>> members together in a linked list:
>>> 
>>> parent     BTF
>>>             |
>>>             v
>>> children   BTF -> BTF -> BTF -> ... -> BTF
>>> 
>>> The parent is always first in the list.
>>
>> Hi Nick,
>>
>> You are talking about BTF section embedded in a final vmlinux binary, right?
>
> More generally, a section embedded in any object which is the result of
> linking two or more objects having .BTF sections:
>
>   ld foo.o (.BTF) bar.o (.BTF) -> baz.o (.BTF)
>
> This covers the particular vmlinux case I think.

Yes, though I wasn't expecting to see this in vmlinux yet! It might
happen in the end. What this is used for is *communicating with pahole*:
the .btfa file pahole receives is one of these, containing deduplicated
BTF for the entire kernel plus all modules, and it's then up to pahole
what to do with it.

In userspace links (and in intermediate links of multifile kernel
modules, used only as input to the btfarchive deduplicator), we do see
this sort of thing heavily.

>> Could you please elaborate a bit on why do you need multiple members
>> within this section (in the context of your third option)?
>> I re-read the email but don't get it :(
>
> As I understand it:
>
> The linker deduplicates types in the set of input .BTF sections.  This
> means that when linking foo.o and bar.o, if both compilation units refer
> to a type 'quux', there are two possibilities:
>
> a) The type 'quux' is the same (using C type equivalence rules) in both
>    compilation units.  Then the type is "shared" and the linker puts it
>    only once in the first output BTF member in baz.o .BTF, the "parent".
>
> b) The type 'quux' is different in both compilation units.  These are
>    then conflicting types.  Then two versions the type, foo.quux and
>    bar.quux, are placed by the linker in the corresponding "children"
>    member in baz.o.

Yes. (We don't really quite use C type equivalence rules -- we're
pickier, since types can be assignment-compatible but still different,
and we want to preserve that difference. But that's nitpicking.)

This happens really quite a lot in the kernel (I was surprised how
often). It happens even more in userspace, sometimes to an almost
pathological degree (hello, Ghostscript). LTO may make its prevalence
lower in the future, but I doubt this sort of thing will ever go away:
it's still with us in C++ programs, and there it's outright undefined
behaviour!

> Graphically, the .BTF section in a linked binary would contain a
> one-level tree of members, with as many children as input compilation
> units :
>
>     parent (common types)
>       |
>       +---  child1 (types only in child1)
>       +---  child2 (types only in child2)
>       .
>       +---  childN (types only in childN)
>
> Hope this makes sense.  Nick should be able to explain it better than I
> do.

There are really two cases, because the purpose of "being a child" is
sort of overloaded. The kernel is, as ever, different...

- Kernel-style builds (the traditional BTF case):

  vmlinux (parent) (common types, any types shared by more than one module)
       +---  child1.ko (types only in child1)
       +---  child2.ko (types only in child2)
       .
       +---  childN.ko (types only in childN)

  Notably, if a type differs (conflicts) across translation units, and
  all those translation units are in the core kernel, we can't put them
  in children because none of them are in modules, and children are
  reserved for modules: so we actually emit them as "hidden types" (a
  concept BTF doesn't have and that I am not currently proposing, which
  lets us say "this type is not visible in any namespaces, here's the
  name of the translation unit it was found in"). The same applies if a
  type differs within one module.

  If a type has conflicting definitions in two distinct modules, we can
  indeed just emit them into each module in turn. Also, if a type has
  one definition in a lot of modules and then a different one in one or
  two, we realise that the first definition is "most popular" and emit
  it into the parent, then emit the conflicting one into the few
  per-module children it is found in.

  Types that are used only by one module are placed in that per-module
  child, both because that's what pahole has always done and because it
  makes sense for a loosely-coupled project like the kernel not to
  clutter vmlinux up with thousands of types for huge modules like
  amdgpu that might never even be loaded.


  I am not expecting pahole to preserve hidden types, at least not yet
  (BTF has no way to encode them and no consumer understands them), but
  it can see them on its input, so it might use hiddenness as a flag
  that "hey, this type is conflicting, take care with everything with
  the same name" or something. The concept is not useless even if pahole
  largely ignores it: it does at least preserve the type graph and
  ensure that any type that refers to a conflicting type still refers to
  it after deduplication: it doesn't end up pointing at some other type
  with the same name.

  e.g. if we have these two TUs in the core kernel:

  a.c:struct foo { int a; };
      struct bar { struct foo baz; };

  b.c:struct foo { long a; }; /* Different! */
      struct bar { struct foo baz; };

  one struct foo (the least-referenced one) will wind up hidden, but the
  struct bar in that same TU will *still point at the hidden type*. Both
  types are *still there* and we don't end up pointing at the same
  struct foo from both struct bars.


- For normal ELF links outside the kernel, the model above doesn't
  really make sense. Most programs don't have a concept like kernel
  modules, and most programs are more tightly coupled, so you want to
  see as many types as possible. So for those, the distribution is like
  this:

  parent (all types that are not conflicting)
       +---  child1.c (conflicting types defined in child1.c)
       +---  child2.c (conflicting types defined in child2.c)
       .
       +---  childN.c (conflicting types defined in child3.c)

  i.e., conflicting types are placed into children that are named after
  the translation units they come from. Within those dictionaries, there
  are no hidden types and there is no possibility of conflict; the
  shared parent corresponds to "all TUs together" and there can be no
  conflicts there either.

  In many ways this is a simpler model, but it just won't cut it for the
  kernel.

We could in the end combine the two schemes, producing a multilevel
tree, so that each module, and the core kernel, could contain an archive
like userspace links do, with each conflicting type hived off into its
own translation unit. This is *definitely* more work, and would probably
require consumer changes too. I am not proposing it, at least not yet.
But it shows where we could end up:

  vmlinux (parent) (common types, any types shared by more than one module)
    +--- core1a.c (conflicting types defined in core1a.c)...
    ...
       +---  child1.ko (types found only in child1)
         +-- child1a.c (conflicting types defined in child1a.c)
         +-- child1b.b (conflicting types defined in child1a.c)
       +---  child2.ko (types only in child2)
       .
       +---  childN.ko (types only in childN)



The distinction between the two link types above is largely controlled
via this linker option in GNU ld:

  --ctf-share-types=<method>  How to share CTF types between translation units.
                                <method> is: share-unconflicted (default),
                                             share-duplicated

The final stage of kernel deduplication (the btfarchive tool) uses
share-duplicated mode (and extra stuff to smush multiple translation
units together into modules).

(that's from current upstream master: obviously I'll have to find some
way to say --ctf-or-btf without making it too verbose :) maybe I could
just add a --btf-share-types as a synonym?)

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-07-17 11:52 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-16 15:15 Linking BTF Nick Alcock
2025-07-16 21:52 ` Eduard Zingerman
2025-07-17  7:40   ` Jose E. Marchesi
2025-07-17 11:52     ` Nick Alcock

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox