Concerns about SFrame viability for userspace stack walking

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Concerns about SFrame viability for userspace stack walking
@ 2025-10-30  6:53 Fangrui Song
  2025-10-30  7:30 ` Jakub Jelinek
                   ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: Fangrui Song @ 2025-10-30  6:53 UTC (permalink / raw)
  To: linux-toolchains, linux-perf-users, linux-kernel

I've been following the SFrame discussion and wanted to share some concerns about its viability for userspace adoption, based on concrete measurements and comparison with existing compact unwind implementations in LLVM.

**Size overhead concerns**

Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is approximately 10% larger than the combined size of .eh_frame and .eh_frame_hdr (8.06 MiB total).
This is problematic because .eh_frame cannot be eliminated - it contains essential information for restoring callee-saved registers, LSDA, and personality information needed for debugging (e.g. reading local variables in a coredump) and C++ exception handling.

This means adopting SFrame would result in carrying both formats, with a large net size increase.

**Learning from existing compact unwind implementations**

It's worth noting that LLVM has had a battle-tested compact unwind format in production use since 2009 with OS X 10.6, which transitioned to using CFI directives in 2013 [1]. The efficiency gains are dramatic:

   __text section: 0x4a55470 bytes
   __unwind_info section: 0x79060 bytes (0.6% of __text)
   __eh_frame section: 0x58 bytes

   (On macOS you can check the section size with objdump --arch x86_64 -h clang and dump the unwind info with  objdump --arch x86_64 --unwind-info clang)

OpenVMS's x86-64 port, which is ELF-based, also adopted this format as documented in their "VSI OpenVMS Calling Standard" and their 2018 post: https://discourse.llvm.org/t/rfc-asynchronous-unwind-tables-attribute/59282

The compact unwind format achieves this efficiency through a two-level page table structure. It describes common frame layouts compactly and falls back to DWARF only when necessary, allowing most DWARF CFI entries to be eliminated while maintaining full functionality. For more details, see: https://faultlore.com/blah/compact-unwinding/ and the lld/MachO implemention https://github.com/llvm/llvm-project/blob/main/lld/MachO/UnwindInfoSection.cpp

**The AArch64 case: size matters even more**

The size consideration becomes even more critical for AArch64, which is heavily deployed on mobile phones.
There's an active feature request for compact unwind support in the AArch64 ABI: https://github.com/ARM-software/abi-aa/issues/344
This underscores the broader industry need for efficient unwind information that doesn't duplicate data or significantly increase binary size.

There are at least two formats the ELF one can learn from: LLVM's compact unwind format (aarch64) and Windows ARM64 Frame Unwind Code.

**Path forward**

Unless SFrame can actually replace .eh_frame (rather than supplementing it as an accelerator for linux-perf) and demonstrate sizes smaller than .eh_frame - matching the efficiency of existing compact unwind approaches — I question its practical viability for userspace.
The current design appears to add overhead rather than reduce it.
This isn't to suggest we should simply adopt the existing compact unwind format wholesale.
The x86-64 design dates back to 2009 or earlier, and there are likely improvements we can make. However, we should aim for similar or better efficiency gains.

For additional context, I've documented my detailed analysis at:

- https://maskray.me/blog/2025-09-28-remarks-on-sframe (covering mandatory index building problems, section group compliance and garbage collection issues, and version compatibility challenges)
- https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade-offs (size analysis)

Best regards,
Fangrui

[1]: https://github.com/llvm/llvm-project/commit/58e2d3d856b7dc7b97a18cfa2aeeb927bc7e6bd5 ("Generate compact unwind encoding from CFI directives.")

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30  6:53 Concerns about SFrame viability for userspace stack walking Fangrui Song
@ 2025-10-30  7:30 ` Jakub Jelinek
  2025-10-30  7:50   ` Fangrui Song
  2025-10-30 10:26 ` Peter Zijlstra
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: Jakub Jelinek @ 2025-10-30  7:30 UTC (permalink / raw)
  To: Fangrui Song; +Cc: linux-toolchains, linux-perf-users, linux-kernel

On Wed, Oct 29, 2025 at 11:53:32PM -0700, Fangrui Song wrote:
> I've been following the SFrame discussion and wanted to share some concerns about its viability for userspace adoption, based on concrete measurements and comparison with existing compact unwind implementations in LLVM.
> 
> **Size overhead concerns**
> 
> Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is approximately 10% larger than the combined size of .eh_frame and .eh_frame_hdr (8.06 MiB total).
> This is problematic because .eh_frame cannot be eliminated - it contains essential information for restoring callee-saved registers, LSDA, and personality information needed for debugging (e.g. reading local variables in a coredump) and C++ exception handling.

I believe .sframe only provides a subset of the .eh_frame information, so
can't be used for exception throwing, and you don't want to lose
.eh_frame_hdr either because then dlopen becomes very costly and it will
even slow down exception throwing.

If .eh_frame is considered too large, rather than inventing a new format I'd
suggest to work in the DWARF committee and provide further size
optimizations for .dwarf_frame which can then be used in .eh_frame, or agree
on .eh_frame extensions to make it smaller.

	Jakub


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30  7:30 ` Jakub Jelinek
@ 2025-10-30  7:50   ` Fangrui Song
  2025-10-30  8:05     ` Jakub Jelinek
  0 siblings, 1 reply; 30+ messages in thread
From: Fangrui Song @ 2025-10-30  7:50 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel

On Thu, Oct 30, 2025 at 12:30 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Wed, Oct 29, 2025 at 11:53:32PM -0700, Fangrui Song wrote:
> > I've been following the SFrame discussion and wanted to share some concerns about its viability for userspace adoption, based on concrete measurements and comparison with existing compact unwind implementations in LLVM.
> >
> > **Size overhead concerns**
> >
> > Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is approximately 10% larger than the combined size of .eh_frame and .eh_frame_hdr (8.06 MiB total).
> > This is problematic because .eh_frame cannot be eliminated - it contains essential information for restoring callee-saved registers, LSDA, and personality information needed for debugging (e.g. reading local variables in a coredump) and C++ exception handling.
>
> I believe .sframe only provides a subset of the .eh_frame information, so
> can't be used for exception throwing, and you don't want to lose
> .eh_frame_hdr either because then dlopen becomes very costly and it will
> even slow down exception throwing.

Right.

> If .eh_frame is considered too large, rather than inventing a new format I'd
> suggest to work in the DWARF committee and provide further size
> optimizations for .dwarf_frame which can then be used in .eh_frame, or agree
> on .eh_frame extensions to make it smaller.
>
>         Jakub

Thanks for the suggestion.
An effective compact unwinding scheme needs to leverage ISA-specific properties.
This architecture-specific nature makes it likely fall outside the
DWARF's scope.
That said, input from the DWARF committee would certainly be valuable.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30  7:50   ` Fangrui Song
@ 2025-10-30  8:05     ` Jakub Jelinek
  2025-10-31  2:51       ` Fangrui Song
  0 siblings, 1 reply; 30+ messages in thread
From: Jakub Jelinek @ 2025-10-30  8:05 UTC (permalink / raw)
  To: Fangrui Song; +Cc: linux-toolchains, linux-perf-users, linux-kernel

On Thu, Oct 30, 2025 at 12:50:42AM -0700, Fangrui Song wrote:
> An effective compact unwinding scheme needs to leverage ISA-specific properties.

Having 40-50 completely different unwinding schemes, one for each
architecture or even ISA subset, would be a complete nightmare.  Plus the
important property of DWARF is that it is easily extensible.  So, I think it
would be better to invent new DWARF DW_CFA_* arch specific opcodes which
would be a shorthand for the most common sequences of unwind info, or allow
the CIEs to define a library of DW_CFA_* sets perhaps with parameters which
would then be usable in the FDEs.  There are already some arch specific
opcodes, DW_CFA_GNU_window_save for SPARC and
DW_CFA_AARCH64_negate_ra_state_with_pc/DW_CFA_AARCH64_negate_ra_state for
AArch64, but if somebody took time to look through .eh_frame of many
binaries/libraries on several different distributions for particular arch
(so that there is no bias in what exact options those distros use etc.) and
found something that keeps repeating there commonly that could be shortened,
perhaps the assembler or linker could rewrite sequences of specific .cfi_*
directives into something equivalent but shorter once the extension opcodes
are added.  Though, there are only very few opcodes left, so taking them
should be done with great care and at least one should be left as a
multiplexer (single byte opcode followed by uleb128 code for further
operation + arguments).

	Jakub

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30  6:53 Concerns about SFrame viability for userspace stack walking Fangrui Song
  2025-10-30  7:30 ` Jakub Jelinek
@ 2025-10-30 10:26 ` Peter Zijlstra
  2025-10-30 16:48   ` Fangrui Song
  2025-10-30 17:53   ` Andi Kleen
  2025-10-30 14:47 ` Jose E. Marchesi
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 30+ messages in thread
From: Peter Zijlstra @ 2025-10-30 10:26 UTC (permalink / raw)
  To: Fangrui Song; +Cc: linux-toolchains, linux-perf-users, linux-kernel

On Wed, Oct 29, 2025 at 11:53:32PM -0700, Fangrui Song wrote:
> I've been following the SFrame discussion and wanted to share some
> concerns about its viability for userspace adoption, based on concrete
> measurements and comparison with existing compact unwind
> implementations in LLVM.
> 
> **Size overhead concerns**
> 
> Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is
> approximately 10% larger than the combined size of .eh_frame and
> .eh_frame_hdr (8.06 MiB total).  This is problematic because .eh_frame
> cannot be eliminated - it contains essential information for restoring
> callee-saved registers, LSDA, and personality information needed for
> debugging (e.g. reading local variables in a coredump) and C++
> exception handling.
> 
> This means adopting SFrame would result in carrying both formats, with
> a large net size increase.

So the SFrame unwinder is fairly simple code, but what does an .eh_frame
unwinder look like? Having read most of the links in your email, there
seem to be references to DWARF byte code interpreters and stuff like
that.

So while the format compactness is one aspect, the thing I find no
mention of, is the unwinder complexity.

There have been a number of attempts to do DWARF unwinding in
kernel space and while I think some architecture do it, x86_64 has had
very bad experiences with it. At some point I think Linus just said no
more, no DWARF, not ever.

So from a situation where compilers were generating bad CFI unwind
information, a horribly complex unwinder that could crash the kernel
harder than the thing it was reporting on and manual CFI annotations in
assembly that were never quite right, objtool and ORC were born.

The win was many:

 - simple robust unwinder
 - no manual CFI annotations that could be wrong
 - no reliance on compilers that would get it wrong

and I think this is where SFrame came from. I don't think the x86_64
Linux kernel will ever natively adopt SFrame, ORC works really well for
us.

However, we do need something to unwind userspace. And yes, personally
I'm in the frame-pointer camp, that's always worked well for me.
Distro's however don't seem to like it much, which means that every time
I do have to profile something userspace, I get to rebuild all the
relevant code with framepointers on (which is not hard, but tedious).

Barring that, we need something for which the unwind code is simple and
robust -- and I *think* this has disqualified .eh_frame and full on
DWARF.

And this is again where SFrame comes in -- its unwinder is simple,
something we can run in kernel space.

I really don't much care for the particulars, and frame pointers work
for me -- but I do care about the kernel unwinder code. It had better be
simple and robvst.

So if you want us to use .eh_frame, great, show us a simple and robust
unwinder.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30  6:53 Concerns about SFrame viability for userspace stack walking Fangrui Song
  2025-10-30  7:30 ` Jakub Jelinek
  2025-10-30 10:26 ` Peter Zijlstra
@ 2025-10-30 14:47 ` Jose E. Marchesi
  2025-11-04  9:21 ` Indu
  2025-12-01  9:04 ` Fangrui Song
  4 siblings, 0 replies; 30+ messages in thread
From: Jose E. Marchesi @ 2025-10-30 14:47 UTC (permalink / raw)
  To: Fangrui Song; +Cc: linux-toolchains, linux-perf-users, linux-kernel

Hi Fangrui.

> - https://maskray.me/blog/2025-09-28-remarks-on-sframe (covering
> mandatory index building problems, section group compliance and
> garbage collection issues, and version compatibility challenges)

After reading your blog it seems to me that your main concern is (was?)
that SFrame "violates ELF rules" because it is not amenable to
concatenation (the result of concatenating two "sframes" is not a valid
sframe) and thus it requires specific linker support to merge these
sections.  This support, once in place, will have to be maintained
moving forward, and evolved along with the format.

First, SFrame is not concatenable because its main design goal is to be
simple to _use_ (not necessarily trivial to link) so it provides little
luxuries like a fixed-size header (uoh), it is self-contained, it does
not require an explicit index to be searched efficiently (instead you
just binary search on the data in place), it has no run-time
relocations, etc.  You don't need to allocate memory dynamically to
decode and stack-walking using SFrame, and it has even been proved (by
the parca project people) that it is possible to write actual verifiable
BPF to walk an userspace stack using an internal format that is in
essence identical to SFrame. You may not care about any of that, but the
people wanting to use the format certainly do, and thats the reason
SFrame (and ORC for that matter) is the way it is.  Sure, nobody would
object making SFrame concatenable to make your (and mine, incidentaly)
life easier, but not at the cost of burdening users with extra
complexity that they dont need and are not willing to assume: why would
they?  We are not even quite sure if such a thing is achievable in this
case: you either put the complexity in the linker, or on the users, but
you cannot make it magically disappear.  If you have some _concrete_
suggestion on improving the format, please by all means let the SFrame
people know, or consider following-up in threads like [1] where these
details are being discussed.  That would be helpful indeed.

Second, as bizarre as it may be, having non-concatenable data in an ELF
section only "violates ELF rules" if the linker _doesn't know_ about the
type of the section containing it.  So the solution is obvious: make
your linker aware of SFrame sections and, voilà, the ELF violation goes
away.  You can either merge the SFrame data, or just discard the input
sections, or call the Linker Police.  Just do _something_ about it,
because doing nothing leads to emitting nonsense, and that pisses off
everyone.

Third, this "problem" is not privative to SFrame.  Other formats like EH
Frame also require some degree of linker awareness, in the form of the
generation of an explicit index, or merging, or whatever...  apparenlty
to nobody's scandal, lucky them.  Pushing the burden of dealing with
this to users or to post-processing tools, like you suggest in your
blog, is IMO hardly a satisfactory solution: it is rather a no-solution
and an attempt of making your problem everyone's problem. Now, if you
then move the goalpost and claim the problem is the _degree_ of linker
involvement, as you seem to suggest in your blog, then again your
feedback is very welcome to make SFrame more linker-friendly, as long as
it is in the form of concrete construtive suggestions _and_ not at the
cost of the user's requirements for the format.

Fourth, some people think that it is unreasonable to expect all the ELF
linkers in existence to be aware of SFrame sections (not me; you can
count them all linkers with fingers and no toes).  ELF already supports
a standard section flag SHF_OS_NONCONFORMAT that tries to deal with
cases like this... and fails miserably: unknown sections marked with
that flag are not required to be amenable to concatenation, but the
problem is that upon encountering them the linker is expected to abort
the link with an error.  This is hardly convenient for anyone, so the
SFrame people are currently proposing to the gABI [2] the addition of a
new flag SHF_OS_NONCONFORMANT_DISCARD that would make the linker to just
discard the unknown input section rather than aborting the link.

Fifth, the problems related to GC and section grouping were discussed
during Cauldron [3] and I believe a solution has been already found,
proposed independently by Roland in the gabi discussion thread.  I think
that solution is being written down and will have to be reviewed before
being used in SFrame V3.  Your help on that review would be also very
much appreciated, considering your vast experience on these matters.  I
suppose it will happen in the binutils list.  I am sure they will CC you
in the relevant thread so you won't miss it.

Sixth, several people have repeatedly pointed out that it is not
reasonable to extrapolate the big gap between SFrame V2 and V3 to future
revisions of the format, and that not implementing V2 in lld is
perfectly ok, because the kernel will start directly with V3.  The
SFrame maintainer has assured, also repeteadly, that she is well aware
that any change in the spec will have to be very carefully considered to
avoid or minimize any impact in the linkers. I would say the fact she
also maintains the SFrame support in ld may also serve as bail to
guarantee her good behavior, at least to some extent ;)

Finally, it is not clear to me at all why supporting SFrame would result
in such an unbearable burden to lld's maintenance, as you seem to
expect, given everyhing is fine on the GNU side.  Please don't
misunderstand me: you know your linker better than anyone else and I am
sure there are good reasons that justify such apprehension.  What is it?
Is it lack of contributors?  Is the lld codebase particularly difficult
to maintain or extend?  Looking at [4] the people that are contributing
the SFrame support to LLVM are also volunteering to maintain it moving
forward.. isn't that enough?  Perhaps you need a co-maintainer?  Can we,
or anyone else, help somehow?  If so, how?

Salud!

[1] https://sourceware.org/pipermail/binutils/2025-October/145086.html
[2] https://groups.google.com/g/generic-abi/c/3ZMVJDF79g8
[3] https://www.youtube.com/watch?v=L2UmAp39xqk
[4] https://discourse.llvm.org/t/rfc-adding-sframe-support-to-llvm/86900

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30 10:26 ` Peter Zijlstra
@ 2025-10-30 16:48   ` Fangrui Song
  2025-10-30 17:03     ` Jose E. Marchesi
                       ` (2 more replies)
  2025-10-30 17:53   ` Andi Kleen
  1 sibling, 3 replies; 30+ messages in thread
From: Fangrui Song @ 2025-10-30 16:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel

On Thu, Oct 30, 2025 at 3:26 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Oct 29, 2025 at 11:53:32PM -0700, Fangrui Song wrote:
> > I've been following the SFrame discussion and wanted to share some
> > concerns about its viability for userspace adoption, based on concrete
> > measurements and comparison with existing compact unwind
> > implementations in LLVM.
> >
> > **Size overhead concerns**
> >
> > Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is
> > approximately 10% larger than the combined size of .eh_frame and
> > .eh_frame_hdr (8.06 MiB total).  This is problematic because .eh_frame
> > cannot be eliminated - it contains essential information for restoring
> > callee-saved registers, LSDA, and personality information needed for
> > debugging (e.g. reading local variables in a coredump) and C++
> > exception handling.
> >
> > This means adopting SFrame would result in carrying both formats, with
> > a large net size increase.
>
> So the SFrame unwinder is fairly simple code, but what does an .eh_frame
> unwinder look like? Having read most of the links in your email, there
> seem to be references to DWARF byte code interpreters and stuff like
> that.
>
> So while the format compactness is one aspect, the thing I find no
> mention of, is the unwinder complexity.
>
> There have been a number of attempts to do DWARF unwinding in
> kernel space and while I think some architecture do it, x86_64 has had
> very bad experiences with it. At some point I think Linus just said no
> more, no DWARF, not ever.
>
> So from a situation where compilers were generating bad CFI unwind
> information, a horribly complex unwinder that could crash the kernel
> harder than the thing it was reporting on and manual CFI annotations in
> assembly that were never quite right, objtool and ORC were born.
>
> The win was many:
>
>  - simple robust unwinder
>  - no manual CFI annotations that could be wrong
>  - no reliance on compilers that would get it wrong
>
> and I think this is where SFrame came from. I don't think the x86_64
> Linux kernel will ever natively adopt SFrame, ORC works really well for
> us.
>
> However, we do need something to unwind userspace. And yes, personally
> I'm in the frame-pointer camp, that's always worked well for me.
> Distro's however don't seem to like it much, which means that every time
> I do have to profile something userspace, I get to rebuild all the
> relevant code with framepointers on (which is not hard, but tedious).
>
> Barring that, we need something for which the unwind code is simple and
> robust -- and I *think* this has disqualified .eh_frame and full on
> DWARF.
>
> And this is again where SFrame comes in -- its unwinder is simple,
> something we can run in kernel space.
>
> I really don't much care for the particulars, and frame pointers work
> for me -- but I do care about the kernel unwinder code. It had better be
> simple and robvst.
>
> So if you want us to use .eh_frame, great, show us a simple and robust
> unwinder.

Hi Peter,

Thanks for this perspective—the unwinder complexity concern is
absolutely valid and critical for kernel use.
To clarify my motivation: I've seen attempts to use SFrame for
userspace adoption
(https://fedoraproject.org/wiki/Changes/SFrameInBinaries ), and I
believe it's not viable for that purpose given the size overhead I
documented. My concerns are primarily about userspace adoption, not
the kernel's internal unwinding.

If SFrame is exclusively a kernel-space feature, it could be
implemented entirely within objtool – similar to how objtool --link
--orc generates ORC info for vmlinux.o. This approach would eliminate
the need for any modifications to assemblers and linkers, while
allowing SFrame to evolve in any incompatible way.

For userspace, we could instead modify assemblers and linkers to
support a more compact format or an extension to .eh_frame , but it
won't be SFrame (all of Apple’s compact unwind, ARM EHABI’s
exidx/extab, and Microsoft’s pdata/xdata can implement C++ exception
handling , while SFrame can't, leading to a huge missed opportunity.)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30 16:48   ` Fangrui Song
@ 2025-10-30 17:03     ` Jose E. Marchesi
  2025-10-31  4:22       ` Fangrui Song
  2025-10-30 17:33     ` Steven Rostedt
  2025-10-30 18:22     ` Peter Zijlstra
  2 siblings, 1 reply; 30+ messages in thread
From: Jose E. Marchesi @ 2025-10-30 17:03 UTC (permalink / raw)
  To: Fangrui Song
  Cc: Peter Zijlstra, linux-toolchains, linux-perf-users, linux-kernel


> On Thu, Oct 30, 2025 at 3:26 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Wed, Oct 29, 2025 at 11:53:32PM -0700, Fangrui Song wrote:
>> > I've been following the SFrame discussion and wanted to share some
>> > concerns about its viability for userspace adoption, based on concrete
>> > measurements and comparison with existing compact unwind
>> > implementations in LLVM.
>> >
>> > **Size overhead concerns**
>> >
>> > Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is
>> > approximately 10% larger than the combined size of .eh_frame and
>> > .eh_frame_hdr (8.06 MiB total).  This is problematic because .eh_frame
>> > cannot be eliminated - it contains essential information for restoring
>> > callee-saved registers, LSDA, and personality information needed for
>> > debugging (e.g. reading local variables in a coredump) and C++
>> > exception handling.
>> >
>> > This means adopting SFrame would result in carrying both formats, with
>> > a large net size increase.
>>
>> So the SFrame unwinder is fairly simple code, but what does an .eh_frame
>> unwinder look like? Having read most of the links in your email, there
>> seem to be references to DWARF byte code interpreters and stuff like
>> that.
>>
>> So while the format compactness is one aspect, the thing I find no
>> mention of, is the unwinder complexity.
>>
>> There have been a number of attempts to do DWARF unwinding in
>> kernel space and while I think some architecture do it, x86_64 has had
>> very bad experiences with it. At some point I think Linus just said no
>> more, no DWARF, not ever.
>>
>> So from a situation where compilers were generating bad CFI unwind
>> information, a horribly complex unwinder that could crash the kernel
>> harder than the thing it was reporting on and manual CFI annotations in
>> assembly that were never quite right, objtool and ORC were born.
>>
>> The win was many:
>>
>>  - simple robust unwinder
>>  - no manual CFI annotations that could be wrong
>>  - no reliance on compilers that would get it wrong
>>
>> and I think this is where SFrame came from. I don't think the x86_64
>> Linux kernel will ever natively adopt SFrame, ORC works really well for
>> us.
>>
>> However, we do need something to unwind userspace. And yes, personally
>> I'm in the frame-pointer camp, that's always worked well for me.
>> Distro's however don't seem to like it much, which means that every time
>> I do have to profile something userspace, I get to rebuild all the
>> relevant code with framepointers on (which is not hard, but tedious).
>>
>> Barring that, we need something for which the unwind code is simple and
>> robust -- and I *think* this has disqualified .eh_frame and full on
>> DWARF.
>>
>> And this is again where SFrame comes in -- its unwinder is simple,
>> something we can run in kernel space.
>>
>> I really don't much care for the particulars, and frame pointers work
>> for me -- but I do care about the kernel unwinder code. It had better be
>> simple and robvst.
>>
>> So if you want us to use .eh_frame, great, show us a simple and robust
>> unwinder.
>
> Hi Peter,
>
> Thanks for this perspective—the unwinder complexity concern is
> absolutely valid and critical for kernel use.
> To clarify my motivation: I've seen attempts to use SFrame for
> userspace adoption
> (https://fedoraproject.org/wiki/Changes/SFrameInBinaries ), and I
> believe it's not viable for that purpose given the size overhead I
> documented. My concerns are primarily about userspace adoption, not
> the kernel's internal unwinding.
>
> If SFrame is exclusively a kernel-space feature, it could be
> implemented entirely within objtool – similar to how objtool --link
> --orc generates ORC info for vmlinux.o. This approach would eliminate
> the need for any modifications to assemblers and linkers, while
> allowing SFrame to evolve in any incompatible way.
>
> For userspace, we could instead modify assemblers and linkers to
> support a more compact format or an extension to .eh_frame , but it
> won't be SFrame (all of Apple’s compact unwind, ARM EHABI’s
> exidx/extab, and Microsoft’s pdata/xdata can implement C++ exception
> handling , while SFrame can't, leading to a huge missed opportunity.)

The purpose of SFrame is not to be a more compact replacement for
.eh_frame.  It is intended to be used to walk stacks, not to unwind
them.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30 16:48   ` Fangrui Song
  2025-10-30 17:03     ` Jose E. Marchesi
@ 2025-10-30 17:33     ` Steven Rostedt
  2025-10-31  5:28       ` Fangrui Song
  2025-10-30 18:22     ` Peter Zijlstra
  2 siblings, 1 reply; 30+ messages in thread
From: Steven Rostedt @ 2025-10-30 17:33 UTC (permalink / raw)
  To: Fangrui Song
  Cc: Peter Zijlstra, linux-toolchains, linux-perf-users, linux-kernel

On Thu, 30 Oct 2025 09:48:50 -0700
Fangrui Song <maskray@sourceware.org> wrote:

> If SFrame is exclusively a kernel-space feature, it could be
> implemented entirely within objtool – similar to how objtool --link
> --orc generates ORC info for vmlinux.o. This approach would eliminate
> the need for any modifications to assemblers and linkers, while
> allowing SFrame to evolve in any incompatible way.

I'm not sure what you mean here. Yes, it is implemented in the kernel, but
it is reading user space applications to get the sframes from them.

Every running application would need this information for its executable.
The kernel is dependent on user space having this.

The only thing the kernel is doing is reading the sframe tables associated
with the running applications to be able to walk their stacks at runtime to
do profiling. As Peter asked, the kernel cares extensively on that walking
being simple. If something goes wrong, you compromise the entire machine.

-- Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30 10:26 ` Peter Zijlstra
  2025-10-30 16:48   ` Fangrui Song
@ 2025-10-30 17:53   ` Andi Kleen
  2025-10-30 18:07     ` Mark Brown
  1 sibling, 1 reply; 30+ messages in thread
From: Andi Kleen @ 2025-10-30 17:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel

Peter Zijlstra <peterz@infradead.org> writes:
>
> So the SFrame unwinder is fairly simple code, but what does an .eh_frame
> unwinder look like? Having read most of the links in your email, there
> seem to be references to DWARF byte code interpreters and stuff like
> that.

Here's Jan Beulich's Linux implementation. The x86 version was
removed, but it lives on for ARC:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arc/kernel/unwind.c

SH also has another one from Matt Flemming:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/sh/kernel/dwarf.c

IMNSHO the whole sframe effort is misguided because all the major ISAs do have
shadow stack hardware support now which is generally a better option. 
It would be better to invest effort in deploying that widely.

-Andi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30 17:53   ` Andi Kleen
@ 2025-10-30 18:07     ` Mark Brown
  2025-10-30 18:31       ` Andi Kleen
  0 siblings, 1 reply; 30+ messages in thread
From: Mark Brown @ 2025-10-30 18:07 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Fangrui Song, linux-toolchains, linux-perf-users,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 374 bytes --]

On Thu, Oct 30, 2025 at 10:53:13AM -0700, Andi Kleen wrote:

> IMNSHO the whole sframe effort is misguided because all the major ISAs do have
> shadow stack hardware support now which is generally a better option. 
> It would be better to invest effort in deploying that widely.

It's going to take a *considerable* time for the hardware support to
become standard.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30 16:48   ` Fangrui Song
  2025-10-30 17:03     ` Jose E. Marchesi
  2025-10-30 17:33     ` Steven Rostedt
@ 2025-10-30 18:22     ` Peter Zijlstra
  2 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2025-10-30 18:22 UTC (permalink / raw)
  To: Fangrui Song; +Cc: linux-toolchains, linux-perf-users, linux-kernel

On Thu, Oct 30, 2025 at 09:48:50AM -0700, Fangrui Song wrote:

> Thanks for this perspective???the unwinder complexity concern is
> absolutely valid and critical for kernel use.
> To clarify my motivation: I've seen attempts to use SFrame for
> userspace adoption
> (https://fedoraproject.org/wiki/Changes/SFrameInBinaries ), and I
> believe it's not viable for that purpose given the size overhead I
> documented. My concerns are primarily about userspace adoption, not
> the kernel's internal unwinding.
> 
> If SFrame is exclusively a kernel-space feature, it could be
> implemented entirely within objtool ??? similar to how objtool --link
> --orc generates ORC info for vmlinux.o. This approach would eliminate
> the need for any modifications to assemblers and linkers, while
> allowing SFrame to evolve in any incompatible way.
> 
> For userspace, we could instead modify assemblers and linkers to
> support a more compact format or an extension to .eh_frame , but it
> won't be SFrame (all of Apple???s compact unwind, ARM EHABI???s
> exidx/extab, and Microsoft???s pdata/xdata can implement C++ exception
> handling , while SFrame can't, leading to a huge missed opportunity.)

No, you misunderstand. The x86_64 Linux kernel is using ORC internally
and we're happy with that. However, the kernel also needs to be able to
unwind/walk user stack frames.

We need simple robust means of walking user space stacks from the kernel.

It is here that SFrame is proposed on x86_64. The kernel consumes user
space SFrame data to unwind user space stacks. This is also why the
SFrame sections are SHF_ALLOC, such that the kernel can simply fault
them in on-demand without having to otherwise initiate IO.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30 18:07     ` Mark Brown
@ 2025-10-30 18:31       ` Andi Kleen
  2025-10-30 18:45         ` Mark Brown
  2025-10-30 18:57         ` Peter Zijlstra
  0 siblings, 2 replies; 30+ messages in thread
From: Andi Kleen @ 2025-10-30 18:31 UTC (permalink / raw)
  To: Mark Brown
  Cc: Peter Zijlstra, Fangrui Song, linux-toolchains, linux-perf-users,
	linux-kernel

On Thu, Oct 30, 2025 at 06:07:49PM +0000, Mark Brown wrote:
> On Thu, Oct 30, 2025 at 10:53:13AM -0700, Andi Kleen wrote:
> 
> > IMNSHO the whole sframe effort is misguided because all the major ISAs do have
> > shadow stack hardware support now which is generally a better option. 
> > It would be better to invest effort in deploying that widely.
> 
> It's going to take a *considerable* time for the hardware support to
> become standard.

Optimizing for the past instead of the future?

Not on x86 at least. All my x86 systems have it, except for a few old
skylakes.

-Andi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30 18:31       ` Andi Kleen
@ 2025-10-30 18:45         ` Mark Brown
  2025-10-31  8:24           ` Fangrui Song
  2025-10-30 18:57         ` Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Mark Brown @ 2025-10-30 18:45 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Fangrui Song, linux-toolchains, linux-perf-users,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 743 bytes --]

On Thu, Oct 30, 2025 at 11:31:38AM -0700, Andi Kleen wrote:
> On Thu, Oct 30, 2025 at 06:07:49PM +0000, Mark Brown wrote:

> > It's going to take a *considerable* time for the hardware support to
> > become standard.

> Optimizing for the past instead of the future?

On arm64 no currently available hardware has shadow stack support, and
once systems start becoming available it'll take a very long time for
that to filter down to even being all newly shipping systems, let alone
all systems that people care about running new software on.

> Not on x86 at least. All my x86 systems have it, except for a few old
> skylakes.

My experience trying to find a system to test changes on was somewhat
different :(  I did eventually get something.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30 18:31       ` Andi Kleen
  2025-10-30 18:45         ` Mark Brown
@ 2025-10-30 18:57         ` Peter Zijlstra
  2025-10-31 11:46           ` Mark Brown
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2025-10-30 18:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mark Brown, Fangrui Song, linux-toolchains, linux-perf-users,
	linux-kernel

On Thu, Oct 30, 2025 at 11:31:38AM -0700, Andi Kleen wrote:

> Not on x86 at least. All my x86 systems have it, except for a few old
> skylakes.

About half of my systems have CET, but I just checked, none of them
seem to actually use userspace shadow stacks.

AFAICT Debian hasn't build their packages with this stuff on.

But yeah, thanks for reminding me, we should definitely build a shstk
unwinder.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30  8:05     ` Jakub Jelinek
@ 2025-10-31  2:51       ` Fangrui Song
  0 siblings, 0 replies; 30+ messages in thread
From: Fangrui Song @ 2025-10-31  2:51 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel

On Thu, Oct 30, 2025 at 1:06 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Thu, Oct 30, 2025 at 12:50:42AM -0700, Fangrui Song wrote:
> > An effective compact unwinding scheme needs to leverage ISA-specific properties.
>
> Having 40-50 completely different unwinding schemes, one for each
> architecture or even ISA subset, would be a complete nightmare.  Plus the
> important property of DWARF is that it is easily extensible.  So, I think it
> would be better to invent new DWARF DW_CFA_* arch specific opcodes which
> would be a shorthand for the most common sequences of unwind info, or allow
> the CIEs to define a library of DW_CFA_* sets perhaps with parameters which
> would then be usable in the FDEs.  There are already some arch specific
> opcodes, DW_CFA_GNU_window_save for SPARC and
> DW_CFA_AARCH64_negate_ra_state_with_pc/DW_CFA_AARCH64_negate_ra_state for
> AArch64, but if somebody took time to look through .eh_frame of many
> binaries/libraries on several different distributions for particular arch
> (so that there is no bias in what exact options those distros use etc.) and
> found something that keeps repeating there commonly that could be shortened,
> perhaps the assembler or linker could rewrite sequences of specific .cfi_*
> directives into something equivalent but shorter once the extension opcodes
> are added.  Though, there are only very few opcodes left, so taking them
> should be done with great care and at least one should be left as a
> multiplexer (single byte opcode followed by uleb128 code for further
> operation + arguments).
>
>         Jakub

That's a good point about being careful with new unwind formats.
The LLVM compact unwind format, used by Mach-O, utilizes an
architecture-agnostic page table structure but has
architecture-specific opcode formats (i386, x86-64, and aarch64).
I.e. it does not introduce an entirely different format for each arch.

I believe the size issue with .eh_frame is primarily driven by the
CIE/FDE overhead, not the CFI instructions.
The inherently large size of a single FDE (around 20 bytes
https://discourse.llvm.org/t/rfc-improving-compact-x86-64-compact-unwind-descriptors/47471/10?u=maskray
) is a significant contributor to overall size.

The performance issue of .eh_frame seems largely related to the byte
code nature of the CFI instructions.
By encoding locations with different CFI states explicitly as
different frame entries makes it faster.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30 17:03     ` Jose E. Marchesi
@ 2025-10-31  4:22       ` Fangrui Song
  2025-10-31 14:37         ` Jose E. Marchesi
  0 siblings, 1 reply; 30+ messages in thread
From: Fangrui Song @ 2025-10-31  4:22 UTC (permalink / raw)
  To: Jose E. Marchesi
  Cc: Fangrui Song, Peter Zijlstra, linux-toolchains, linux-perf-users,
	linux-kernel

On Thu, Oct 30, 2025 at 10:04 AM Jose E. Marchesi
<jose.marchesi@oracle.com> wrote:
>
>
> > On Thu, Oct 30, 2025 at 3:26 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >>
> >> On Wed, Oct 29, 2025 at 11:53:32PM -0700, Fangrui Song wrote:
> >> > I've been following the SFrame discussion and wanted to share some
> >> > concerns about its viability for userspace adoption, based on concrete
> >> > measurements and comparison with existing compact unwind
> >> > implementations in LLVM.
> >> >
> >> > **Size overhead concerns**
> >> >
> >> > Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is
> >> > approximately 10% larger than the combined size of .eh_frame and
> >> > .eh_frame_hdr (8.06 MiB total).  This is problematic because .eh_frame
> >> > cannot be eliminated - it contains essential information for restoring
> >> > callee-saved registers, LSDA, and personality information needed for
> >> > debugging (e.g. reading local variables in a coredump) and C++
> >> > exception handling.
> >> >
> >> > This means adopting SFrame would result in carrying both formats, with
> >> > a large net size increase.
> >>
> >> So the SFrame unwinder is fairly simple code, but what does an .eh_frame
> >> unwinder look like? Having read most of the links in your email, there
> >> seem to be references to DWARF byte code interpreters and stuff like
> >> that.
> >>
> >> So while the format compactness is one aspect, the thing I find no
> >> mention of, is the unwinder complexity.
> >>
> >> There have been a number of attempts to do DWARF unwinding in
> >> kernel space and while I think some architecture do it, x86_64 has had
> >> very bad experiences with it. At some point I think Linus just said no
> >> more, no DWARF, not ever.
> >>
> >> So from a situation where compilers were generating bad CFI unwind
> >> information, a horribly complex unwinder that could crash the kernel
> >> harder than the thing it was reporting on and manual CFI annotations in
> >> assembly that were never quite right, objtool and ORC were born.
> >>
> >> The win was many:
> >>
> >>  - simple robust unwinder
> >>  - no manual CFI annotations that could be wrong
> >>  - no reliance on compilers that would get it wrong
> >>
> >> and I think this is where SFrame came from. I don't think the x86_64
> >> Linux kernel will ever natively adopt SFrame, ORC works really well for
> >> us.
> >>
> >> However, we do need something to unwind userspace. And yes, personally
> >> I'm in the frame-pointer camp, that's always worked well for me.
> >> Distro's however don't seem to like it much, which means that every time
> >> I do have to profile something userspace, I get to rebuild all the
> >> relevant code with framepointers on (which is not hard, but tedious).
> >>
> >> Barring that, we need something for which the unwind code is simple and
> >> robust -- and I *think* this has disqualified .eh_frame and full on
> >> DWARF.
> >>
> >> And this is again where SFrame comes in -- its unwinder is simple,
> >> something we can run in kernel space.
> >>
> >> I really don't much care for the particulars, and frame pointers work
> >> for me -- but I do care about the kernel unwinder code. It had better be
> >> simple and robvst.
> >>
> >> So if you want us to use .eh_frame, great, show us a simple and robust
> >> unwinder.
> >
> > Hi Peter,
> >
> > Thanks for this perspective—the unwinder complexity concern is
> > absolutely valid and critical for kernel use.
> > To clarify my motivation: I've seen attempts to use SFrame for
> > userspace adoption
> > (https://fedoraproject.org/wiki/Changes/SFrameInBinaries ), and I
> > believe it's not viable for that purpose given the size overhead I
> > documented. My concerns are primarily about userspace adoption, not
> > the kernel's internal unwinding.
> >
> > If SFrame is exclusively a kernel-space feature, it could be
> > implemented entirely within objtool – similar to how objtool --link
> > --orc generates ORC info for vmlinux.o. This approach would eliminate
> > the need for any modifications to assemblers and linkers, while
> > allowing SFrame to evolve in any incompatible way.
> >
> > For userspace, we could instead modify assemblers and linkers to
> > support a more compact format or an extension to .eh_frame , but it
> > won't be SFrame (all of Apple’s compact unwind, ARM EHABI’s
> > exidx/extab, and Microsoft’s pdata/xdata can implement C++ exception
> > handling , while SFrame can't, leading to a huge missed opportunity.)
>
> The purpose of SFrame is not to be a more compact replacement for
> .eh_frame.  It is intended to be used to walk stacks, not to unwind
> them.

Hi Jose,

Let me clarify my concerns, as I think we may be talking past each other a bit.

**The primary concern: size overhead for userspace**

The fundamental issue is that SFrame, as currently designed, results
in a significant net size increase for userspace binaries because it
is large and cannot replace .eh_frame (which would mean losing
debugging and C++ exception handling support).The median .eh_frame
size across executables and shared libraries on a Linux system is 5+%
of total VM size:

https://gist.github.com/MaskRay/5995d10b65e1e18b82931c5a8d97f55e

Increasing this to 10% by adding SFrame on top is simply not viable.
As my reply to Peter mentioned, "If SFrame is exclusively a
kernel-space feature, it could be implemented entirely within
objtool—similar to how objtool --link --orc generates ORC info for
vmlinux.o."

**What about kernel use?**

As I mentioned in my reply to Peter, if SFrame is exclusively a
kernel-space feature, it could be implemented entirely within
objtool—similar to how objtool --link --orc generates ORC info for
vmlinux.o.
I believe SFrame has a size advantage over ORC, which could make it
attractive for this use case.
However, if SFrame will not replace the existing in-kernel ORC
unwinder (as Peter suggested), then I'm afraid SFrame doesn't have a
clear position—neither for vmlinux nor for userspace programs.

**On the ELF format issues**

https://groups.google.com/g/generic-abi/c/3ZMVJDF79g8

The current Binutils implementation disregards ELF and linker
conventions, which is a serious concern for all linker maintainers.
The proposed SHF_OS_NONCONFORMING_DISCARD flag has faced strong
objections in the generic ABI discussion:
https://groups.google.com/g/generic-abi/c/3ZMVJDF79g8

There are also unresolved garbage collection issues. I had to disable
-Wl,--gc-sections entirely when testing for
https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade-offs

I want to emphasize: custom merging rules do not inherently conflict
with using proper multi-section structure with section group and
SHF_LINK_ORDER.
The format could be designed to work within established ELF
conventions rather than requiring special cases throughout the linker.
The concern about maintenance burden isn't about the initial
implementation—it's about committing to long-term support for a format
that requires custom handling in every linker while providing
questionable benefit for its stated use case.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30 17:33     ` Steven Rostedt
@ 2025-10-31  5:28       ` Fangrui Song
  0 siblings, 0 replies; 30+ messages in thread
From: Fangrui Song @ 2025-10-31  5:28 UTC (permalink / raw)
  To: Steven Rostedt, Peter Zijlstra
  Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel

On Thu, Oct 30, 2025 at 10:32 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 30 Oct 2025 09:48:50 -0700
> Fangrui Song <maskray@sourceware.org> wrote:
>
> > If SFrame is exclusively a kernel-space feature, it could be
> > implemented entirely within objtool – similar to how objtool --link
> > --orc generates ORC info for vmlinux.o. This approach would eliminate
> > the need for any modifications to assemblers and linkers, while
> > allowing SFrame to evolve in any incompatible way.
>
> I'm not sure what you mean here. Yes, it is implemented in the kernel, but
> it is reading user space applications to get the sframes from them.
>
> Every running application would need this information for its executable.
> The kernel is dependent on user space having this.
>
> The only thing the kernel is doing is reading the sframe tables associated
> with the running applications to be able to walk their stacks at runtime to
> do profiling. As Peter asked, the kernel cares extensively on that walking
> being simple. If something goes wrong, you compromise the entire machine.
>
> -- Steve

I suspect your concern is primarily with DWARF expressions (the
DW_OP_* opcodes), which are needed for complex, unusual frames.
If the perf subsystem ignores those and focuses only on standard frame
layouts, ensuring safety becomes much more straightforward.

Compact unwinding formats are designed with this principle in
mind—they don't use bytecode CFI instructions or DWARF expressions at
all. Instead, they use a binary search table (similar to
.eh_frame_hdr) to locate the frame descriptor, then decode it using a
straightforward nested switch statement based on the compact encoding.

For reference, this is llvm-project/libunwind's implementation for
i386, x86-64, and aarch64:
https://github.com/llvm/llvm-project/blob/main/libunwind/src/CompactUnwinder.hpp

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30 18:45         ` Mark Brown
@ 2025-10-31  8:24           ` Fangrui Song
  0 siblings, 0 replies; 30+ messages in thread
From: Fangrui Song @ 2025-10-31  8:24 UTC (permalink / raw)
  To: Mark Brown
  Cc: Andi Kleen, Peter Zijlstra, Fangrui Song, linux-toolchains,
	linux-perf-users, linux-kernel

On Thu, Oct 30, 2025 at 11:45 AM Mark Brown <broonie@kernel.org> wrote:
>
> On Thu, Oct 30, 2025 at 11:31:38AM -0700, Andi Kleen wrote:
> > On Thu, Oct 30, 2025 at 06:07:49PM +0000, Mark Brown wrote:
>
> > > It's going to take a *considerable* time for the hardware support to
> > > become standard.
>
> > Optimizing for the past instead of the future?
>
> On arm64 no currently available hardware has shadow stack support, and
> once systems start becoming available it'll take a very long time for
> that to filter down to even being all newly shipping systems, let alone
> all systems that people care about running new software on.
>
> > Not on x86 at least. All my x86 systems have it, except for a few old
> > skylakes.
>
> My experience trying to find a system to test changes on was somewhat
> different :(  I did eventually get something.

I’ve chatted with mobile toolchain developers at the LLVM Dev Mtg, who
emphasized that size concerns are especially critical for AArch64,
which is heavily deployed on mobile phones.
I think Arm ABI makers are unlikely to want a format known not to work
with mobile Linux to coexist with a future, more widely adopted
compact format with callee-saved register, LSDA, and personality
support. I chatted with Peter Smith, who seems to think so, but I
don't want to put the word into his mouth:)

---

Intel’s 11th Gen and AMD Zen 3 support hardware shadow stack.
A software-only stack walking approach (and remains unvetted for
AArch64-see above) that doesn’t replace .eh_frame would quickly become
obsolete.
Shadow stack can be enabled per process, providing flexibility to
balance performance overhead / memory consumption with profiling
needs, even for users who don’t prioritize the security hardening
aspect.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30 18:57         ` Peter Zijlstra
@ 2025-10-31 11:46           ` Mark Brown
  0 siblings, 0 replies; 30+ messages in thread
From: Mark Brown @ 2025-10-31 11:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Fangrui Song, linux-toolchains, linux-perf-users,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 534 bytes --]

On Thu, Oct 30, 2025 at 07:57:25PM +0100, Peter Zijlstra wrote:
> On Thu, Oct 30, 2025 at 11:31:38AM -0700, Andi Kleen wrote:

> > Not on x86 at least. All my x86 systems have it, except for a few old
> > skylakes.

> About half of my systems have CET, but I just checked, none of them
> seem to actually use userspace shadow stacks.

> AFAICT Debian hasn't build their packages with this stuff on.

It hasn't - they're starting to roll it out now in the development
distro, an actual release is a couple of years away at this point.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-31  4:22       ` Fangrui Song
@ 2025-10-31 14:37         ` Jose E. Marchesi
  0 siblings, 0 replies; 30+ messages in thread
From: Jose E. Marchesi @ 2025-10-31 14:37 UTC (permalink / raw)
  To: Fangrui Song
  Cc: Peter Zijlstra, linux-toolchains, linux-perf-users, linux-kernel


> On Thu, Oct 30, 2025 at 10:04 AM Jose E. Marchesi
> <jose.marchesi@oracle.com> wrote:
>>
>>
>> > On Thu, Oct 30, 2025 at 3:26 AM Peter Zijlstra <peterz@infradead.org> wrote:
>> >>
>> >> On Wed, Oct 29, 2025 at 11:53:32PM -0700, Fangrui Song wrote:
>> >> > I've been following the SFrame discussion and wanted to share some
>> >> > concerns about its viability for userspace adoption, based on concrete
>> >> > measurements and comparison with existing compact unwind
>> >> > implementations in LLVM.
>> >> >
>> >> > **Size overhead concerns**
>> >> >
>> >> > Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is
>> >> > approximately 10% larger than the combined size of .eh_frame and
>> >> > .eh_frame_hdr (8.06 MiB total).  This is problematic because .eh_frame
>> >> > cannot be eliminated - it contains essential information for restoring
>> >> > callee-saved registers, LSDA, and personality information needed for
>> >> > debugging (e.g. reading local variables in a coredump) and C++
>> >> > exception handling.
>> >> >
>> >> > This means adopting SFrame would result in carrying both formats, with
>> >> > a large net size increase.
>> >>
>> >> So the SFrame unwinder is fairly simple code, but what does an .eh_frame
>> >> unwinder look like? Having read most of the links in your email, there
>> >> seem to be references to DWARF byte code interpreters and stuff like
>> >> that.
>> >>
>> >> So while the format compactness is one aspect, the thing I find no
>> >> mention of, is the unwinder complexity.
>> >>
>> >> There have been a number of attempts to do DWARF unwinding in
>> >> kernel space and while I think some architecture do it, x86_64 has had
>> >> very bad experiences with it. At some point I think Linus just said no
>> >> more, no DWARF, not ever.
>> >>
>> >> So from a situation where compilers were generating bad CFI unwind
>> >> information, a horribly complex unwinder that could crash the kernel
>> >> harder than the thing it was reporting on and manual CFI annotations in
>> >> assembly that were never quite right, objtool and ORC were born.
>> >>
>> >> The win was many:
>> >>
>> >>  - simple robust unwinder
>> >>  - no manual CFI annotations that could be wrong
>> >>  - no reliance on compilers that would get it wrong
>> >>
>> >> and I think this is where SFrame came from. I don't think the x86_64
>> >> Linux kernel will ever natively adopt SFrame, ORC works really well for
>> >> us.
>> >>
>> >> However, we do need something to unwind userspace. And yes, personally
>> >> I'm in the frame-pointer camp, that's always worked well for me.
>> >> Distro's however don't seem to like it much, which means that every time
>> >> I do have to profile something userspace, I get to rebuild all the
>> >> relevant code with framepointers on (which is not hard, but tedious).
>> >>
>> >> Barring that, we need something for which the unwind code is simple and
>> >> robust -- and I *think* this has disqualified .eh_frame and full on
>> >> DWARF.
>> >>
>> >> And this is again where SFrame comes in -- its unwinder is simple,
>> >> something we can run in kernel space.
>> >>
>> >> I really don't much care for the particulars, and frame pointers work
>> >> for me -- but I do care about the kernel unwinder code. It had better be
>> >> simple and robvst.
>> >>
>> >> So if you want us to use .eh_frame, great, show us a simple and robust
>> >> unwinder.
>> >
>> > Hi Peter,
>> >
>> > Thanks for this perspective—the unwinder complexity concern is
>> > absolutely valid and critical for kernel use.
>> > To clarify my motivation: I've seen attempts to use SFrame for
>> > userspace adoption
>> > (https://fedoraproject.org/wiki/Changes/SFrameInBinaries ), and I
>> > believe it's not viable for that purpose given the size overhead I
>> > documented. My concerns are primarily about userspace adoption, not
>> > the kernel's internal unwinding.
>> >
>> > If SFrame is exclusively a kernel-space feature, it could be
>> > implemented entirely within objtool – similar to how objtool --link
>> > --orc generates ORC info for vmlinux.o. This approach would eliminate
>> > the need for any modifications to assemblers and linkers, while
>> > allowing SFrame to evolve in any incompatible way.
>> >
>> > For userspace, we could instead modify assemblers and linkers to
>> > support a more compact format or an extension to .eh_frame , but it
>> > won't be SFrame (all of Apple’s compact unwind, ARM EHABI’s
>> > exidx/extab, and Microsoft’s pdata/xdata can implement C++ exception
>> > handling , while SFrame can't, leading to a huge missed opportunity.)
>>
>> The purpose of SFrame is not to be a more compact replacement for
>> .eh_frame.  It is intended to be used to walk stacks, not to unwind
>> them.
>
> Hi Jose,
>
> Let me clarify my concerns, as I think we may be talking past each
> other a bit.

Indeed, and thanks for following up :)

> **The primary concern: size overhead for userspace**
>
> The fundamental issue is that SFrame, as currently designed, results
> in a significant net size increase for userspace binaries because it
> is large and cannot replace .eh_frame (which would mean losing
> debugging and C++ exception handling support).The median .eh_frame
> size across executables and shared libraries on a Linux system is 5+%
> of total VM size:
>
> https://gist.github.com/MaskRay/5995d10b65e1e18b82931c5a8d97f55e
>
> Increasing this to 10% by adding SFrame on top is simply not viable.
> As my reply to Peter mentioned, "If SFrame is exclusively a
> kernel-space feature, it could be implemented entirely within
> objtool—similar to how objtool --link --orc generates ORC info for
> vmlinux.o."

I understand your concern, but whether the size overhead introduced by
SFrame is "viable" or not, I would say that is up to the users to
decide, not us tools engineers.  If someone wants to trade a 5% increase
in size (or whatever amount, really) for improved traceability and/or
performance, we are not going to convince them otherwise, especially if
we cannot provide a working alternative that would give them a better
tradeoff.

> **What about kernel use?**
>
> As I mentioned in my reply to Peter, if SFrame is exclusively a
> kernel-space feature, it could be implemented entirely within
> objtool—similar to how objtool --link --orc generates ORC info for
> vmlinux.o.
> I believe SFrame has a size advantage over ORC, which could make it
> attractive for this use case.
>
> However, if SFrame will not replace the existing in-kernel ORC
> unwinder (as Peter suggested), then I'm afraid SFrame doesn't have a
> clear position—neither for vmlinux nor for userspace programs.
>
> **On the ELF format issues**
>
> https://groups.google.com/g/generic-abi/c/3ZMVJDF79g8
>
> The current Binutils implementation disregards ELF and linker
> conventions, which is a serious concern for all linker maintainers.

Sorry, but binutils doesn't disregard anything it wasn't disregarding
before implementing SFrame, and certainly nothing that lld doesn't
currently disregard as well, in the sense both linkers support other
formats that require linker awareness for meaningful merging.

> The proposed SHF_OS_NONCONFORMING_DISCARD flag has faced strong
> objections in the generic ABI discussion:
> https://groups.google.com/g/generic-abi/c/3ZMVJDF79g8

I would be disappointed otherwise: it is their job to resist change, as
it is the job of everyone else to push for it whenever they feel is
necessary.  Don't get dishearted, it is just a single flag what is being
proposed, that doesn't involve any sort of elaborated semantics and that
is a clear logical complement to an existing flag.

So I remain optimistic, but given there are only a bunch of ELF linkers
around, IMO this flag proposal is more hygienic in nature than anything
else and its absence is hardly a showstopper.  It is clearly better to
have "if (section_is_unknown && nonconforming_discard) {
discard_input_section }" than "if (sframe && i_dont_support_sframe) {
discard_input_section }" in a few places, but if it turns out we can't
have it, well.. it isn't the end of the world.

> There are also unresolved garbage collection issues. I had to disable
> -Wl,--gc-sections entirely when testing for
> https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade-offs
> I want to emphasize: custom merging rules do not inherently conflict
> with using proper multi-section structure with section group and
> SHF_LINK_ORDER.

As I already pointed out in my previous reply, I think a solution was
found for that and it is being worked out.

> The format could be designed to work within established ELF
> conventions rather than requiring special cases throughout the linker.

Would you _please_ consider helping them to do so?  I believe there is
still time to get changes into V3, so if you have suggestions for
improving SFrame in that regard, other than offloading complexity to
clients or post-processing tools, or throwing the whole thing out with
the bath water, by all means please reach out to them.

> The concern about maintenance burden isn't about the initial
> implementation—it's about committing to long-term support for a format
> that requires custom handling in every linker while providing
> questionable benefit for its stated use case.

Thanks for explaining.

So you are saying that the question of whether including SFrame support
in lld boils down to the questionable benefit of SFrame for its stated
use case, the primary concern there being the size overhead.  Yes?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30  6:53 Concerns about SFrame viability for userspace stack walking Fangrui Song
                   ` (2 preceding siblings ...)
  2025-10-30 14:47 ` Jose E. Marchesi
@ 2025-11-04  9:21 ` Indu
  2025-11-05  8:21   ` Fangrui Song
  2025-12-01  9:04 ` Fangrui Song
  4 siblings, 1 reply; 30+ messages in thread
From: Indu @ 2025-11-04  9:21 UTC (permalink / raw)
  To: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel

On 2025-10-29 11:53 p.m., Fangrui Song wrote:
> I've been following the SFrame discussion and wanted to share some 
> concerns about its viability for userspace adoption, based on concrete 
> measurements and comparison with existing compact unwind implementations 
> in LLVM.
> 
> **Size overhead concerns**
> 
> Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is 
> approximately 10% larger than the combined size of .eh_frame 
> and .eh_frame_hdr (8.06 MiB total).
> This is problematic because .eh_frame cannot be eliminated - it contains 
> essential information for restoring callee-saved registers, LSDA, and 
> personality information needed for debugging (e.g. reading local 
> variables in a coredump) and C++ exception handling.
> 
> This means adopting SFrame would result in carrying both formats, with a 
> large net size increase.
> 
> **Learning from existing compact unwind implementations**
> 
> It's worth noting that LLVM has had a battle-tested compact unwind 
> format in production use since 2009 with OS X 10.6, which transitioned 
> to using CFI directives in 2013 [1]. The efficiency gains are dramatic:
> 
>    __text section: 0x4a55470 bytes
>    __unwind_info section: 0x79060 bytes (0.6% of __text)
>    __eh_frame section: 0x58 bytes
> 

I believe this is only synchronous? If yes, do you think this is a fair 
measurement to compare against ?

Does the compact unwind info scheme work well for cases of 
shrink-wrapping ? How about the case of AArch64, where the ABI does not 
mandate if and where frame record is created ?

For the numbers above, does it ensure precise stack traces ?

 From the The Apple Compact Unwinding Format document 
(https://faultlore.com/blah/compact-unwinding/),
"One consequence of only having one opcode for a whole function is that 
functions will generally have incorrect instructions for the function’s 
prologue (where callee-saved registers are individually PUSHed onto the 
stack before the rest of the stack space is allocated)."

"Presumably this isn’t a very big deal, since there’s very few 
situations where unwinding would involve a function still executing its 
prologue/epilogue."

Well, getting precise stack traces is a big deal and the users want them.

>    (On macOS you can check the section size with objdump --arch x86_64 - 
> h clang and dump the unwind info with  objdump --arch x86_64 --unwind- 
> info clang)
> 
> OpenVMS's x86-64 port, which is ELF-based, also adopted this format as 
> documented in their "VSI OpenVMS Calling Standard" and their 2018 post: 
> https://discourse.llvm.org/t/rfc-asynchronous-unwind-tables-attribute/59282
> 
> The compact unwind format achieves this efficiency through a two-level 
> page table structure. It describes common frame layouts compactly and 
> falls back to DWARF only when necessary, allowing most DWARF CFI entries 
> to be eliminated while maintaining full functionality. For more details, 
> see: https://faultlore.com/blah/compact-unwinding/ and the lld/MachO 
> implemention https://github.com/llvm/llvm-project/blob/main/lld/MachO/ 
> UnwindInfoSection.cpp
> 

How does your vision of "linker-friendly" stack tracing/stack unwinding 
format reconcile with these suggested approaches ? As far as I can tell, 
these formats also require linker created indexes and are 
non-concatenable (custom handling in every linker).  Something you've 
had "significant concerns" about.

 From 
https://docs.vmssoftware.com/vsi-openvms-calling-standard/#STACK_UNWIND_EXCEPTION_X86_64:
"The unwind dispatch table (see Section B.3.1, ''Unwind Dispatch 
Table'') is created by the linker using information in the unwind 
descriptors (see Section B.3.2, ''DWARF Unwind Descriptors'' and Section 
B.3.3, ''Compact Unwind Description'') provided by compilers. The linker 
may use the provided unwind descriptors directly or replace them with 
equivalent optimized forms based on its optimization strategies."

Above all, do users want a solution which requires falling back on 
DWARF-based processing for precise stack tracing ?

> **The AArch64 case: size matters even more**
> 
> The size consideration becomes even more critical for AArch64, which is 
> heavily deployed on mobile phones.
> There's an active feature request for compact unwind support in the 
> AArch64 ABI: https://github.com/ARM-software/abi-aa/issues/344
> This underscores the broader industry need for efficient unwind 
> information that doesn't duplicate data or significantly increase binary 
> size.
> 

Our measurements with a dataset of about 1400 userspace artifacts 
(binaries and shared libraries) show that the SFrame/(EH Frame + EH 
Frame HDR) ratio is:
   - Average of 0.70 on AArch64.
   - Average of 1.00 on x86_64.

Projecting the size of what you observe for clang binary on x86_64 to 
conclude the size ratio on AArch64 is not very wise to do.

Whether the size impact is worth the benefit: its a choice for users to 
make.  SFrame offers the users fast, precise stack traces with simple 
stack tracers.

> There are at least two formats the ELF one can learn from: LLVM's 
> compact unwind format (aarch64) and Windows ARM64 Frame Unwind Code.
> 

Please, if you have any concrete suggestions (keeping the above goals in 
mind), you already know how/where to engage.

> **Path forward**
> 
> Unless SFrame can actually replace .eh_frame (rather than supplementing 
> it as an accelerator for linux-perf) and demonstrate sizes smaller 
> than .eh_frame - matching the efficiency of existing compact unwind 
> approaches — I question its practical viability for userspace.
> The current design appears to add overhead rather than reduce it.
> This isn't to suggest we should simply adopt the existing compact unwind 
> format wholesale.
> The x86-64 design dates back to 2009 or earlier, and there are likely 
> improvements we can make. However, we should aim for similar or better 
> efficiency gains.
> 
> For additional context, I've documented my detailed analysis at:
> 
> - https://maskray.me/blog/2025-09-28-remarks-on-sframe (covering 
> mandatory index building problems, section group compliance and garbage 
> collection issues, and version compatibility challenges)

GC issue is a bug currently tracked and with a target milestone of 2.46.

> - https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade- 
> offs (size analysis)
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-11-04  9:21 ` Indu
@ 2025-11-05  8:21   ` Fangrui Song
  2025-11-06  0:44     ` Indu Bhagat
  0 siblings, 1 reply; 30+ messages in thread
From: Fangrui Song @ 2025-11-05  8:21 UTC (permalink / raw)
  To: Indu; +Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel

> On Tue, Nov 4, 2025 at 1:21 AM Indu <indu.bhagat@oracle.com> wrote:
> On 2025-10-29 11:53 p.m., Fangrui Song wrote:
> > I've been following the SFrame discussion and wanted to share some
> > concerns about its viability for userspace adoption, based on concrete
> > measurements and comparison with existing compact unwind implementations
> > in LLVM.
> >
> > **Size overhead concerns**
> >
> > Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is
> > approximately 10% larger than the combined size of .eh_frame
> > and .eh_frame_hdr (8.06 MiB total).
> > This is problematic because .eh_frame cannot be eliminated - it contains
> > essential information for restoring callee-saved registers, LSDA, and
> > personality information needed for debugging (e.g. reading local
> > variables in a coredump) and C++ exception handling.
> >
> > This means adopting SFrame would result in carrying both formats, with a
> > large net size increase.
> >
> > **Learning from existing compact unwind implementations**
> >
> > It's worth noting that LLVM has had a battle-tested compact unwind
> > format in production use since 2009 with OS X 10.6, which transitioned
> > to using CFI directives in 2013 [1]. The efficiency gains are dramatic:
> >
> >    __text section: 0x4a55470 bytes
> >    __unwind_info section: 0x79060 bytes (0.6% of __text)
> >    __eh_frame section: 0x58 bytes
> >
>
> I believe this is only synchronous? If yes, do you think this is a fair
> measurement to compare against ?
>
> Does the compact unwind info scheme work well for cases of
> shrink-wrapping ? How about the case of AArch64, where the ABI does not
> mandate if and where frame record is created ?
>
> For the numbers above, does it ensure precise stack traces ?
>
>  From the The Apple Compact Unwinding Format document
> (https://faultlore.com/blah/compact-unwinding/),
> "One consequence of only having one opcode for a whole function is that
> functions will generally have incorrect instructions for the function’s
> prologue (where callee-saved registers are individually PUSHed onto the
> stack before the rest of the stack space is allocated)."
>
> "Presumably this isn’t a very big deal, since there’s very few
> situations where unwinding would involve a function still executing its
> prologue/epilogue."
>
> Well, getting precise stack traces is a big deal and the users want them.

**Shrink-wrapping and precise stack traces**: Yes, compact unwind
handles these through an extension proposed by OpenVMS (not yet
upstreamed to LLVM):
https://lists.llvm.org/pipermail/llvm-dev/2018-January/120741.html

Technical details of the extension:

- A single unwind group describes a (prologue_part1, prologue_part2,
body, epilogue) tuple.
- The prologue is conceptually split into two parts: the first part
extends up to and including the instruction that decreases RSP; the
second part extends to a point after the last preserved register is
saved but before any preserved register is modified (this location is
not unique, providing flexibility).
  + When unwinding in the prologue, the RSP register value can be
inferred from the PC and the set of saved registers.
- Since register restoration is idempotent (restoring preserved
registers multiple times during unwinding causes no harm), there is no
need to describe `pop $reg` sequences. The unwind group needs just one
bit to describe whether the 1-byte `ret` instruction is present.
- The `length` field in the compact unwind group descriptor is
repurposed to describe the prologue's two parts.
- By composing multiple unwind groups, potentially with zero-sized
prologues or omitting `ret` instructions in epilogues, it can describe
functions with shrink wrapping or tail duplication optimization.
- Null frame groups (with no prologue or epilogue) are the default and
can describe trampolines and PLT stubs.

Now, to compare this against SFrame's space efficiency for synchronous
unwinding, I've built llvm-mc, opt, and clang with
-fno-asynchronous-unwind-tables -funwind-tables across multiple build
configurations (clang vs gcc, frame pointer vs sframe). The resulting
.sframe section sizes are significant:

% cat ~/tmp/test-unwind.sh
#!/bin/zsh
conf() {
  configure-llvm $@ -DCMAKE_EXE_LINKER_FLAGS='-pie
-Wl,-z,pack-relative-relocs' -DLLVM_ENABLE_UNWIND_TABLES=on \
    -DCMAKE_{EXE,SHARED}_LINKER_FLAGS=-fuse-ld=bfd -DLLVM_ENABLE_LLD=off
}

clang=-fno-integrated-as
gcc=("-DCMAKE_C_COMPILER=$HOME/opt/gcc-15/bin/gcc"
"-DCMAKE_CXX_COMPILER=$HOME/opt/gcc-15/bin/g++")

fp="-fno-omit-frame-pointer -momit-leaf-frame-pointer
-B$HOME/opt/binutils/bin -Wa,--gsframe=no
-fno-asynchronous-unwind-tables -funwind-tables"
sframe="-fomit-frame-pointer -momit-leaf-frame-pointer
-B$HOME/opt/binutils/bin -Wa,--gsframe -fno-asynchronous-unwind-tables
-funwind-tables"

conf custom-fp-sync -DCMAKE_{C,CXX}_FLAGS="$clang $fp"
conf custom-sframe-sync -DCMAKE_{C,CXX}_FLAGS="$clang $sframe"
conf custom-fp-gcc-sync -DCMAKE_{C,CXX}_FLAGS="$fp" ${gcc[@]}
conf custom-sframe-gcc-sync -DCMAKE_{C,CXX}_FLAGS="$sframe" ${gcc[@]}

for i in fp-sync sframe-sync fp-gcc-sync sframe-gcc-sync; do ninja -C
/tmp/out/custom-$i llvm-mc opt clang; done

% ~/Dev/unwind-info-size-analyzer/section_size.rb
/tmp/out/custom-{fp,sframe}-{,gcc-}sync/bin/{llvm-mc,opt,clang}
Filename                                    |       .text size |
 EH size |   .sframe size |   VM size | VM increase
--------------------------------------------+------------------+----------------+----------------+-----------+------------
/tmp/out/custom-fp-sync/bin/llvm-mc         |  2124031 (23.5%) |
301136 (3.3%) |       0 (0.0%) |   9050149 |           -
/tmp/out/custom-sframe-sync/bin/llvm-mc     |  2114383 (22.3%) |
367452 (3.9%) |  348235 (3.7%) |   9483621 |       +4.8%
/tmp/out/custom-fp-gcc-sync/bin/llvm-mc     |  2744214 (29.2%) |
301836 (3.2%) |       0 (0.0%) |   9389677 |       +3.8%
/tmp/out/custom-sframe-gcc-sync/bin/llvm-mc |  2705860 (27.7%) |
354292 (3.6%) |  356073 (3.6%) |   9780985 |       +8.1%
/tmp/out/custom-fp-sync/bin/opt             | 38873081 (69.9%) |
3538408 (6.4%) |       0 (0.0%) |  55598521 |           -
/tmp/out/custom-sframe-sync/bin/opt         | 39011423 (62.4%) |
4557012 (7.3%) | 4452908 (7.1%) |  62494765 |      +12.4%
/tmp/out/custom-fp-gcc-sync/bin/opt         | 54654535 (78.1%) |
3631076 (5.2%) |       0 (0.0%) |  70001573 |      +25.9%
/tmp/out/custom-sframe-gcc-sync/bin/opt     | 53644831 (70.4%) |
4857220 (6.4%) | 5263530 (6.9%) |  76205733 |      +37.1%
/tmp/out/custom-fp-sync/bin/clang           | 68345753 (73.8%) |
6643384 (7.2%) |       0 (0.0%) |  92638305 |           -
/tmp/out/custom-sframe-sync/bin/clang       | 68500319 (64.9%) |
8684540 (8.2%) | 8521760 (8.1%) | 105572021 |      +14.0%
/tmp/out/custom-fp-gcc-sync/bin/clang       | 96515079 (82.8%) |
6556756 (5.6%) |       0 (0.0%) | 116524565 |      +25.8%
/tmp/out/custom-sframe-gcc-sync/bin/clang   | 94583903 (74.0%) |
8817628 (6.9%) | 9696263 (7.6%) | 127839309 |      +38.0%

Note: in GCC FP builds, .text is larger due to missing optimization
for RBP-based frames (e.g.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108386). Once this
optimization is implemented, GCC FP builds should actually have
smaller .text than RSP-based builds, because RBP-relative addressing
produces more compact encodings than RSP-relative addressing (which
requires an extra SIB byte).

.sframe for sync is not noticeably smaller than that for async. This
is probably because
there are still many DW_CFA_advance_loc ops even in
-fno-asynchronous-unwind-tables -funwind-tables builds.


```
% ~/Dev/bloaty/out/release/bloaty /tmp/out/custom-sframe-gcc-sync/bin/clang
    FILE SIZE        VM SIZE
 --------------  --------------
  64.0%  90.2Mi  74.0%  90.2Mi    .text
  10.9%  15.4Mi   0.0%       0    .strtab
   7.0%  9.92Mi   8.1%  9.92Mi    .rodata
   6.6%  9.25Mi   7.6%  9.25Mi    .sframe
   5.2%  7.38Mi   6.1%  7.38Mi    .eh_frame
   2.9%  4.14Mi   0.0%       0    .symtab
   1.4%  1.94Mi   1.6%  1.94Mi    .data.rel.ro
   0.9%  1.23Mi   1.0%  1.23Mi    [LOAD #4 [R]]
   0.7%  1.03Mi   0.8%  1.03Mi    .eh_frame_hdr
   0.0%       0   0.5%   636Ki    .bss
   0.2%   298Ki   0.2%   298Ki    .data
   0.0%  23.1Ki   0.0%  23.1Ki    .rela.dyn
   0.0%  10.5Ki   0.0%       0    [Unmapped]
   0.0%  9.04Ki   0.0%  9.04Ki    .dynstr
   0.0%  8.79Ki   0.0%  8.79Ki    .dynsym
   0.0%  7.31Ki   0.0%  7.31Ki    .rela.plt
   0.0%  6.42Ki   0.0%  3.98Ki    [20 Others]
   0.0%  4.89Ki   0.0%  4.89Ki    .plt
   0.0%  3.55Ki   0.0%  3.50Ki    .init_array
   0.0%  2.50Ki   0.0%  2.50Ki    .hash
   0.0%  2.46Ki   0.0%  2.46Ki    .got.plt
 100.0%   140Mi 100.0%   121Mi    TOTAL
```

Here is an aarch64 build:

 cmake -GNinja -Sllvm -B/tmp/out/a64-sframe -DCMAKE_BUILD_TYPE=Release
-DCMAKE_C_COMPILER=aarch64-linux-gnu-gcc
-DCMAKE_CXX_COMPILER=aarch64-linux-gnu-g++
-DLLVM_HOST_TRIPLE=aarch64-linux-gnu -DLLVM_TARGETS_TO_BUILD=AArch64
-DLLVM_ENABLE_PLUGINS=off -DCMAKE_EXE_LINKER_FLAGS='-no-pie
-B$HOME/opt/binutils-aarch64/bin'
-DCMAKE_{C,CXX}_FLAGS="-fomit-frame-pointer -momit-leaf-frame-pointer
-B$HOME/opt/binutils-aarch64/bin -Wa,--gsframe"
-DLLVM_NATIVE_TOOL_DIR=/tmp/out/custom-fp-gcc-sync/bin
-DLLVM_ENABLE_PROJECTS=clang

% ~/Dev/bloaty/out/release/bloaty /tmp/out/a64-sframe/bin/clang
    FILE SIZE        VM SIZE
 --------------  --------------
  60.0%  71.8Mi  73.2%  71.8Mi    .text
  12.3%  14.8Mi   0.0%       0    .strtab
   8.0%  9.53Mi   9.7%  9.53Mi    .rodata
   6.2%  7.39Mi   0.0%       0    .symtab
   5.8%  6.93Mi   7.1%  6.93Mi    .eh_frame
   4.2%  5.01Mi   5.1%  5.01Mi    .sframe
   1.7%  2.00Mi   2.0%  2.00Mi    .data.rel.ro
   0.8%  1.01Mi   1.0%  1.01Mi    [LOAD #2 [RX]]
   0.8%   932Ki   0.9%   932Ki    .eh_frame_hdr
   0.0%       0   0.6%   599Ki    .bss
   0.2%   294Ki   0.3%   294Ki    .data
   0.0%  40.2Ki   0.0%  40.2Ki    .got
   0.0%  20.6Ki   0.0%       0    [Unmapped]
   0.0%  9.19Ki   0.0%  9.19Ki    .dynstr
   0.0%  8.51Ki   0.0%  8.51Ki    .dynsym
   0.0%  7.41Ki   0.0%  7.41Ki    .rela.plt
   0.0%  4.97Ki   0.0%  4.97Ki    .plt
   0.0%  4.37Ki   0.0%  4.07Ki    [17 Others]
   0.0%  3.35Ki   0.0%  3.30Ki    .init_array
   0.0%  2.49Ki   0.0%  2.49Ki    .got.plt
   0.0%  2.06Ki   0.0%       0    [ELF Section Headers]

> >    (On macOS you can check the section size with objdump --arch x86_64 -
> > h clang and dump the unwind info with  objdump --arch x86_64 --unwind-
> > info clang)
> >
> > OpenVMS's x86-64 port, which is ELF-based, also adopted this format as
> > documented in their "VSI OpenVMS Calling Standard" and their 2018 post:
> > https://discourse.llvm.org/t/rfc-asynchronous-unwind-tables-attribute/59282
> >
> > The compact unwind format achieves this efficiency through a two-level
> > page table structure. It describes common frame layouts compactly and
> > falls back to DWARF only when necessary, allowing most DWARF CFI entries
> > to be eliminated while maintaining full functionality. For more details,
> > see: https://faultlore.com/blah/compact-unwinding/ and the lld/MachO
> > implemention https://github.com/llvm/llvm-project/blob/main/lld/MachO/
> > UnwindInfoSection.cpp
> >
>
> How does your vision of "linker-friendly" stack tracing/stack unwinding
> format reconcile with these suggested approaches ? As far as I can tell,
> these formats also require linker created indexes and are
> non-concatenable (custom handling in every linker).  Something you've
> had "significant concerns" about.
>

We can distinguish between linking-time and execution-time
representations by using different section names.
The OpenVMS specification says:

    It is useful to note that the run-time representation of unwind
information can vary from little more than a simple concatenation of
the compile-time information to a substantial rewriting of unwind
information by the linker. The proposal favors simple concatenation
while maintaining the same ordering of groups as their associated
code.

The runtime library can build this index at runtime and cache it to disk.

Once the design becomes sufficiently stable, we can introduce an
opt-in linker option --xxxxframe-index that builds an index from
recognized format versions while reporting warnings for unrecognized
ones.
We need to carefully design this mechanism to be stable and robust,
avoiding frequent format updates.

>  From
> https://docs.vmssoftware.com/vsi-openvms-calling-standard/#STACK_UNWIND_EXCEPTION_X86_64:
> "The unwind dispatch table (see Section B.3.1, ''Unwind Dispatch
> Table'') is created by the linker using information in the unwind
> descriptors (see Section B.3.2, ''DWARF Unwind Descriptors'' and Section
> B.3.3, ''Compact Unwind Description'') provided by compilers. The linker
> may use the provided unwind descriptors directly or replace them with
> equivalent optimized forms based on its optimization strategies."
>
> Above all, do users want a solution which requires falling back on
> DWARF-based processing for precise stack tracing ?

The key distinction is that compact unwind handles the vast majority
of functions without DWARF—the macOS measurements show __unwind_info
at 0.6% of __text size with __eh_frame reduced to negligible size
(0x58 bytes). While SFrame also cannot handle all frames, compact
unwind achieves dramatic size reductions by making DWARF the exception
rather than requiring it alongside a supplementary format.

The DWARF fallback provides flexibility for additional coverage when
needed, but nothing is lost (at least for the clang binary on macOS)
if DWARF fallback were disabled in a hypothetical future linux-perf
implementation.

> > **The AArch64 case: size matters even more**
> >
> > The size consideration becomes even more critical for AArch64, which is
> > heavily deployed on mobile phones.
> > There's an active feature request for compact unwind support in the
> > AArch64 ABI: https://github.com/ARM-software/abi-aa/issues/344
> > This underscores the broader industry need for efficient unwind
> > information that doesn't duplicate data or significantly increase binary
> > size.
> >
>
> Our measurements with a dataset of about 1400 userspace artifacts
> (binaries and shared libraries) show that the SFrame/(EH Frame + EH
> Frame HDR) ratio is:
>    - Average of 0.70 on AArch64.
>    - Average of 1.00 on x86_64.
>
> Projecting the size of what you observe for clang binary on x86_64 to
> conclude the size ratio on AArch64 is not very wise to do.
>
> Whether the size impact is worth the benefit: its a choice for users to
> make.  SFrame offers the users fast, precise stack traces with simple
> stack tracers.

Thank you for providing the AArch64 measurements. Even with a 0.70x ratio on
AArch64, this represents substantial memory overhead when considering:

.eh_frame is already large and being complained about.
Being unable to eliminate it (needed for debugging and C++ exceptions)
and adding 0.70x more means significant additional overhead for users.

> > There are at least two formats the ELF one can learn from: LLVM's
> > compact unwind format (aarch64) and Windows ARM64 Frame Unwind Code.
> >
>
> Please, if you have any concrete suggestions (keeping the above goals in
> mind), you already know how/where to engage.

I've provided concrete suggestions throughout this discussion.

> > **Path forward**
> >
> > Unless SFrame can actually replace .eh_frame (rather than supplementing
> > it as an accelerator for linux-perf) and demonstrate sizes smaller
> > than .eh_frame - matching the efficiency of existing compact unwind
> > approaches — I question its practical viability for userspace.
> > The current design appears to add overhead rather than reduce it.
> > This isn't to suggest we should simply adopt the existing compact unwind
> > format wholesale.
> > The x86-64 design dates back to 2009 or earlier, and there are likely
> > improvements we can make. However, we should aim for similar or better
> > efficiency gains.
> >
> > For additional context, I've documented my detailed analysis at:
> >
> > - https://maskray.me/blog/2025-09-28-remarks-on-sframe (covering
> > mandatory index building problems, section group compliance and garbage
> > collection issues, and version compatibility challenges)
>
> GC issue is a bug currently tracked and with a target milestone of 2.46.
>
> > - https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade-
> > offs (size analysis)
> >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-11-05  8:21   ` Fangrui Song
@ 2025-11-06  0:44     ` Indu Bhagat
  2025-11-06  7:51       ` Florian Weimer
  2025-11-06  9:20       ` Fangrui Song
  0 siblings, 2 replies; 30+ messages in thread
From: Indu Bhagat @ 2025-11-06  0:44 UTC (permalink / raw)
  To: Fangrui Song; +Cc: linux-toolchains, linux-perf-users, linux-kernel

On 11/5/25 12:21 AM, Fangrui Song wrote:
>> On Tue, Nov 4, 2025 at 1:21 AM Indu <indu.bhagat@oracle.com> wrote:
>> On 2025-10-29 11:53 p.m., Fangrui Song wrote:
>>> I've been following the SFrame discussion and wanted to share some
>>> concerns about its viability for userspace adoption, based on concrete
>>> measurements and comparison with existing compact unwind implementations
>>> in LLVM.
>>>
>>> **Size overhead concerns**
>>>
>>> Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is
>>> approximately 10% larger than the combined size of .eh_frame
>>> and .eh_frame_hdr (8.06 MiB total).
>>> This is problematic because .eh_frame cannot be eliminated - it contains
>>> essential information for restoring callee-saved registers, LSDA, and
>>> personality information needed for debugging (e.g. reading local
>>> variables in a coredump) and C++ exception handling.
>>>
>>> This means adopting SFrame would result in carrying both formats, with a
>>> large net size increase.
>>>
>>> **Learning from existing compact unwind implementations**
>>>
>>> It's worth noting that LLVM has had a battle-tested compact unwind
>>> format in production use since 2009 with OS X 10.6, which transitioned
>>> to using CFI directives in 2013 [1]. The efficiency gains are dramatic:
>>>
>>>     __text section: 0x4a55470 bytes
>>>     __unwind_info section: 0x79060 bytes (0.6% of __text)
>>>     __eh_frame section: 0x58 bytes
>>>
>>
>> I believe this is only synchronous? If yes, do you think this is a fair
>> measurement to compare against ?
>>
>> Does the compact unwind info scheme work well for cases of
>> shrink-wrapping ? How about the case of AArch64, where the ABI does not
>> mandate if and where frame record is created ?
>>
>> For the numbers above, does it ensure precise stack traces ?
>>
>>   From the The Apple Compact Unwinding Format document
>> (https://faultlore.com/blah/compact-unwinding/),
>> "One consequence of only having one opcode for a whole function is that
>> functions will generally have incorrect instructions for the function’s
>> prologue (where callee-saved registers are individually PUSHed onto the
>> stack before the rest of the stack space is allocated)."
>>
>> "Presumably this isn’t a very big deal, since there’s very few
>> situations where unwinding would involve a function still executing its
>> prologue/epilogue."
>>
>> Well, getting precise stack traces is a big deal and the users want them.
> 
> **Shrink-wrapping and precise stack traces**: Yes, compact unwind
> handles these through an extension proposed by OpenVMS (not yet
> upstreamed to LLVM):
> https://lists.llvm.org/pipermail/llvm-dev/2018-January/120741.html
> 

Thanks for the link.

The above questions were strictly in the context of the battle-tested 
"The Apple Compact Unwinding Format" in production in the lld/MachO 
implementation, not for the proposed OpenVMS extensions.

Is it possible to get answers to those questions with that context in place?

If shrink-wrapping and precise stack traces isnt supported without the 
OpenVMS extension (that is not yet implemented), arent we comparing 
apples vs pears here ?

> Technical details of the extension:
> 
> - A single unwind group describes a (prologue_part1, prologue_part2,
> body, epilogue) tuple.
> - The prologue is conceptually split into two parts: the first part
> extends up to and including the instruction that decreases RSP; the
> second part extends to a point after the last preserved register is
> saved but before any preserved register is modified (this location is
> not unique, providing flexibility).
>    + When unwinding in the prologue, the RSP register value can be
> inferred from the PC and the set of saved registers.
> - Since register restoration is idempotent (restoring preserved
> registers multiple times during unwinding causes no harm), there is no
> need to describe `pop $reg` sequences. The unwind group needs just one
> bit to describe whether the 1-byte `ret` instruction is present.

Is this true for the case of asynchronous stack tracing too ?

> - The `length` field in the compact unwind group descriptor is
> repurposed to describe the prologue's two parts.
> - By composing multiple unwind groups, potentially with zero-sized
> prologues or omitting `ret` instructions in epilogues, it can describe
> functions with shrink wrapping or tail duplication optimization.
> - Null frame groups (with no prologue or epilogue) are the default and
> can describe trampolines and PLT stubs.

PLT stubs may use stack (push to stack). As per the document "A null 
frame (MODE = 8) is the simplest possible frame, with no allocated stack 
of either kind (hence no saved registers)".  So null frame can be used 
for PLT only if the functions invoking the PLT stub were using an 
RBP-based frame.  Isnt it ?
BTW, but both EH Frame and SFrame have specific, condensed 
representation for metadata for PLT entries.

> 

Anyway, thanks for the summary.

I see that OpenVMS extension for asynchronous compact unwind descriptors 
is an RFC state ATM.  But few important observations and questions:

  - As noted in the recently revived discussion, 
https://discourse.llvm.org/t/rfc-improving-compact-x86-64-compact-unwind-descriptors/47471, 
there is going to be a *non-negligible* size overhead as soon as you 
move towards a specification for asynchronous (vs the current 
specification that caters to synchronous only).  Now add to it, the 
quirks of each architecture/ABI :). Any comments ?

  - From the document: "Use of any preserved register must be delayed 
until all of the preserved registers have been saved."
    Q: Does this work well with optimizing compilers ? Is this an ABI 
change being asked for multiple architectures ?

  - From the document: "It appears technically feasible for a null frame 
function to have a personality routine. However, the utility of such a 
capability seems too meager to justify allowing this. We propose to not
support this." and "If the first attempt to lookup an unwind group for 
an exception address fails, then it is (tentatively) assumed to have 
occurred within a null frame function or in a part of a function
that is adequately described by a null frame. The presumed return 
address is (virtually or actually) popped from the top of stack and 
looked up. This second attempted lookup must succeed, in which case 
processing continues normally. A failure is a fatal error."
   Q: Is this a problem, especially because the goal is to evolve the 
OpenVMS RFC proposal is subsume .eh_frame ?

Are there people actively working towards bringing this to fruition?

> Now, to compare this against SFrame's space efficiency for synchronous
> unwinding, I've built llvm-mc, opt, and clang with
> -fno-asynchronous-unwind-tables -funwind-tables across multiple build
> configurations (clang vs gcc, frame pointer vs sframe). The resulting
> .sframe section sizes are significant:
> 
> % cat ~/tmp/test-unwind.sh
> #!/bin/zsh
> conf() {
>    configure-llvm $@ -DCMAKE_EXE_LINKER_FLAGS='-pie
> -Wl,-z,pack-relative-relocs' -DLLVM_ENABLE_UNWIND_TABLES=on \
>      -DCMAKE_{EXE,SHARED}_LINKER_FLAGS=-fuse-ld=bfd -DLLVM_ENABLE_LLD=off
> }
> 
> clang=-fno-integrated-as
> gcc=("-DCMAKE_C_COMPILER=$HOME/opt/gcc-15/bin/gcc"
> "-DCMAKE_CXX_COMPILER=$HOME/opt/gcc-15/bin/g++")
> 
> fp="-fno-omit-frame-pointer -momit-leaf-frame-pointer
> -B$HOME/opt/binutils/bin -Wa,--gsframe=no
> -fno-asynchronous-unwind-tables -funwind-tables"
> sframe="-fomit-frame-pointer -momit-leaf-frame-pointer
> -B$HOME/opt/binutils/bin -Wa,--gsframe -fno-asynchronous-unwind-tables
> -funwind-tables"
> 
> conf custom-fp-sync -DCMAKE_{C,CXX}_FLAGS="$clang $fp"
> conf custom-sframe-sync -DCMAKE_{C,CXX}_FLAGS="$clang $sframe"
> conf custom-fp-gcc-sync -DCMAKE_{C,CXX}_FLAGS="$fp" ${gcc[@]}
> conf custom-sframe-gcc-sync -DCMAKE_{C,CXX}_FLAGS="$sframe" ${gcc[@]}
> 
> for i in fp-sync sframe-sync fp-gcc-sync sframe-gcc-sync; do ninja -C
> /tmp/out/custom-$i llvm-mc opt clang; done
> 
> % ~/Dev/unwind-info-size-analyzer/section_size.rb
> /tmp/out/custom-{fp,sframe}-{,gcc-}sync/bin/{llvm-mc,opt,clang}
> Filename                                    |       .text size |
>   EH size |   .sframe size |   VM size | VM increase
> --------------------------------------------+------------------+----------------+----------------+-----------+------------
> /tmp/out/custom-fp-sync/bin/llvm-mc         |  2124031 (23.5%) |
> 301136 (3.3%) |       0 (0.0%) |   9050149 |           -
> /tmp/out/custom-sframe-sync/bin/llvm-mc     |  2114383 (22.3%) |
> 367452 (3.9%) |  348235 (3.7%) |   9483621 |       +4.8%
> /tmp/out/custom-fp-gcc-sync/bin/llvm-mc     |  2744214 (29.2%) |
> 301836 (3.2%) |       0 (0.0%) |   9389677 |       +3.8%
> /tmp/out/custom-sframe-gcc-sync/bin/llvm-mc |  2705860 (27.7%) |
> 354292 (3.6%) |  356073 (3.6%) |   9780985 |       +8.1%
> /tmp/out/custom-fp-sync/bin/opt             | 38873081 (69.9%) |
> 3538408 (6.4%) |       0 (0.0%) |  55598521 |           -
> /tmp/out/custom-sframe-sync/bin/opt         | 39011423 (62.4%) |
> 4557012 (7.3%) | 4452908 (7.1%) |  62494765 |      +12.4%
> /tmp/out/custom-fp-gcc-sync/bin/opt         | 54654535 (78.1%) |
> 3631076 (5.2%) |       0 (0.0%) |  70001573 |      +25.9%
> /tmp/out/custom-sframe-gcc-sync/bin/opt     | 53644831 (70.4%) |
> 4857220 (6.4%) | 5263530 (6.9%) |  76205733 |      +37.1%
> /tmp/out/custom-fp-sync/bin/clang           | 68345753 (73.8%) |
> 6643384 (7.2%) |       0 (0.0%) |  92638305 |           -
> /tmp/out/custom-sframe-sync/bin/clang       | 68500319 (64.9%) |
> 8684540 (8.2%) | 8521760 (8.1%) | 105572021 |      +14.0%
> /tmp/out/custom-fp-gcc-sync/bin/clang       | 96515079 (82.8%) |
> 6556756 (5.6%) |       0 (0.0%) | 116524565 |      +25.8%
> /tmp/out/custom-sframe-gcc-sync/bin/clang   | 94583903 (74.0%) |
> 8817628 (6.9%) | 9696263 (7.6%) | 127839309 |      +38.0%
> 
> Note: in GCC FP builds, .text is larger due to missing optimization
> for RBP-based frames (e.g.
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108386). Once this
> optimization is implemented, GCC FP builds should actually have
> smaller .text than RSP-based builds, because RBP-relative addressing
> produces more compact encodings than RSP-relative addressing (which
> requires an extra SIB byte).
> 
> .sframe for sync is not noticeably smaller than that for async. This
> is probably because
> there are still many DW_CFA_advance_loc ops even in
> -fno-asynchronous-unwind-tables -funwind-tables builds.
> 

Possible that its because in the Apple Compact Unwind Format, the linker 
optimizes compact unwind descriptors into the three-level paged 
structure, effectively de-duplicating some content.

>>>     (On macOS you can check the section size with objdump --arch x86_64 -
>>> h clang and dump the unwind info with  objdump --arch x86_64 --unwind-
>>> info clang)
>>>
>>> OpenVMS's x86-64 port, which is ELF-based, also adopted this format as
>>> documented in their "VSI OpenVMS Calling Standard" and their 2018 post:
>>> https://discourse.llvm.org/t/rfc-asynchronous-unwind-tables-attribute/59282
>>>
>>> The compact unwind format achieves this efficiency through a two-level
>>> page table structure. It describes common frame layouts compactly and
>>> falls back to DWARF only when necessary, allowing most DWARF CFI entries
>>> to be eliminated while maintaining full functionality. For more details,
>>> see: https://faultlore.com/blah/compact-unwinding/ and the lld/MachO
>>> implemention https://github.com/llvm/llvm-project/blob/main/lld/MachO/
>>> UnwindInfoSection.cpp
>>>
>>
>> How does your vision of "linker-friendly" stack tracing/stack unwinding
>> format reconcile with these suggested approaches ? As far as I can tell,
>> these formats also require linker created indexes and are
>> non-concatenable (custom handling in every linker).  Something you've
>> had "significant concerns" about.
>>

This question is unanswered: What do you think about 
"linker-friendliness" of the current implementation of the lld/MachO 
implementation of the compact unwind format in LLVM ?

> 
> We can distinguish between linking-time and execution-time
> representations by using different section names.
> The OpenVMS specification says:
> 
>      It is useful to note that the run-time representation of unwind
> information can vary from little more than a simple concatenation of
> the compile-time information to a substantial rewriting of unwind
> information by the linker. The proposal favors simple concatenation
> while maintaining the same ordering of groups as their associated
> code.
> 
> The runtime library can build this index at runtime and cache it to disk.
> 

This will include the dynamic linker and the stack tracer in the Linux 
kernel (the latter when stack tracing user space stacks).  Do you think 
this is feasible ?

> Once the design becomes sufficiently stable, we can introduce an
> opt-in linker option --xxxxframe-index that builds an index from
> recognized format versions while reporting warnings for unrecognized
> ones.> We need to carefully design this mechanism to be stable and robust,
> avoiding frequent format updates.
> >>   From
>> https://docs.vmssoftware.com/vsi-openvms-calling-standard/#STACK_UNWIND_EXCEPTION_X86_64:
>> "The unwind dispatch table (see Section B.3.1, ''Unwind Dispatch
>> Table'') is created by the linker using information in the unwind
>> descriptors (see Section B.3.2, ''DWARF Unwind Descriptors'' and Section
>> B.3.3, ''Compact Unwind Description'') provided by compilers. The linker
>> may use the provided unwind descriptors directly or replace them with
>> equivalent optimized forms based on its optimization strategies."
>>
>> Above all, do users want a solution which requires falling back on
>> DWARF-based processing for precise stack tracing ?
> 
> The key distinction is that compact unwind handles the vast majority
> of functions without DWARF—the macOS measurements show __unwind_info
> at 0.6% of __text size with __eh_frame reduced to negligible size
> (0x58 bytes). While SFrame also cannot handle all frames, compact
> unwind achieves dramatic size reductions by making DWARF the exception
> rather than requiring it alongside a supplementary format.
> 

As we have tried to reason, this is a misleading comparison. The compact 
unwind tables format:
   - needs to be extended for asynchronous stack unwinding
   - needs to be extended for other ABI/architectures
   - Making it concatenable / linker-friendly will also likely impose 
some negative effects on size.

> The DWARF fallback provides flexibility for additional coverage when
> needed, but nothing is lost (at least for the clang binary on macOS)
> if DWARF fallback were disabled in a hypothetical future linux-perf
> implementation.
> 

Fair enough, thats something for linux-perf/kernel to decide.  Once the 
OpenVMS RFC is sufficiently shaped to become a viable replacement for 
.eh_frame, this question will be for the stakeholders to decide.

>>> **The AArch64 case: size matters even more**
>>>
>>> The size consideration becomes even more critical for AArch64, which is
>>> heavily deployed on mobile phones.
>>> There's an active feature request for compact unwind support in the
>>> AArch64 ABI: https://github.com/ARM-software/abi-aa/issues/344
>>> This underscores the broader industry need for efficient unwind
>>> information that doesn't duplicate data or significantly increase binary
>>> size.
>>>
>>
>> Our measurements with a dataset of about 1400 userspace artifacts
>> (binaries and shared libraries) show that the SFrame/(EH Frame + EH
>> Frame HDR) ratio is:
>>     - Average of 0.70 on AArch64.
>>     - Average of 1.00 on x86_64.
>>
>> Projecting the size of what you observe for clang binary on x86_64 to
>> conclude the size ratio on AArch64 is not very wise to do.
>>
>> Whether the size impact is worth the benefit: its a choice for users to
>> make.  SFrame offers the users fast, precise stack traces with simple
>> stack tracers.
> 
> Thank you for providing the AArch64 measurements. Even with a 0.70x ratio on
> AArch64, this represents substantial memory overhead when considering:
> 
> .eh_frame is already large and being complained about.
> Being unable to eliminate it (needed for debugging and C++ exceptions)
> and adding 0.70x more means significant additional overhead for users.
> 
>>> There are at least two formats the ELF one can learn from: LLVM's
>>> compact unwind format (aarch64) and Windows ARM64 Frame Unwind Code.
>>>
>>
>> Please, if you have any concrete suggestions (keeping the above goals in
>> mind), you already know how/where to engage.
> 
> I've provided concrete suggestions throughout this discussion.
> 

Apologies, I should have been more precise.  And I ask because you know 
the details about both SFrame and the variants of Compact Unwind 
Descriptor formats at this point :). If you have concrete suggestions to 
improve the SFrame format for size, please let us know.

>>> **Path forward**
>>>
>>> Unless SFrame can actually replace .eh_frame (rather than supplementing
>>> it as an accelerator for linux-perf) and demonstrate sizes smaller
>>> than .eh_frame - matching the efficiency of existing compact unwind
>>> approaches — I question its practical viability for userspace.
>>> The current design appears to add overhead rather than reduce it.
>>> This isn't to suggest we should simply adopt the existing compact unwind
>>> format wholesale.
>>> The x86-64 design dates back to 2009 or earlier, and there are likely
>>> improvements we can make. However, we should aim for similar or better
>>> efficiency gains.
>>>
>>> For additional context, I've documented my detailed analysis at:
>>>
>>> - https://maskray.me/blog/2025-09-28-remarks-on-sframe (covering
>>> mandatory index building problems, section group compliance and garbage
>>> collection issues, and version compatibility challenges)
>>
>> GC issue is a bug currently tracked and with a target milestone of 2.46.
>>
>>> - https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade-
>>> offs (size analysis)
>>>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-11-06  0:44     ` Indu Bhagat
@ 2025-11-06  7:51       ` Florian Weimer
  2025-11-06 21:02         ` Indu Bhagat
  2025-11-06  9:20       ` Fangrui Song
  1 sibling, 1 reply; 30+ messages in thread
From: Florian Weimer @ 2025-11-06  7:51 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel

* Indu Bhagat:

> PLT stubs may use stack (push to stack). As per the document "A null
> frame (MODE = 8) is the simplest possible frame, with no allocated
> stack of either kind (hence no saved registers)".  So null frame can
> be used for PLT only if the functions invoking the PLT stub were using
> an RBP-based frame.  Isnt it ?

I think I said this before, but I don't think new toolchain features
need to support lazy binding.  Without lazy bindings, the PLT stubs do
not change the stack pointer or frame pointer and just make a tail call.

Do you see a need for continued support of lazy binding?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-11-06  0:44     ` Indu Bhagat
  2025-11-06  7:51       ` Florian Weimer
@ 2025-11-06  9:20       ` Fangrui Song
  2025-11-06 20:42         ` Indu Bhagat
  1 sibling, 1 reply; 30+ messages in thread
From: Fangrui Song @ 2025-11-06  9:20 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel

On Wed, Nov 5, 2025 at 4:45 PM Indu Bhagat <indu.bhagat@oracle.com> wrote:
>
> On 11/5/25 12:21 AM, Fangrui Song wrote:
> >> On Tue, Nov 4, 2025 at 1:21 AM Indu <indu.bhagat@oracle.com> wrote:
> >> On 2025-10-29 11:53 p.m., Fangrui Song wrote:
> >>> I've been following the SFrame discussion and wanted to share some
> >>> concerns about its viability for userspace adoption, based on concrete
> >>> measurements and comparison with existing compact unwind implementations
> >>> in LLVM.
> >>>
> >>> **Size overhead concerns**
> >>>
> >>> Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is
> >>> approximately 10% larger than the combined size of .eh_frame
> >>> and .eh_frame_hdr (8.06 MiB total).
> >>> This is problematic because .eh_frame cannot be eliminated - it contains
> >>> essential information for restoring callee-saved registers, LSDA, and
> >>> personality information needed for debugging (e.g. reading local
> >>> variables in a coredump) and C++ exception handling.
> >>>
> >>> This means adopting SFrame would result in carrying both formats, with a
> >>> large net size increase.
> >>>
> >>> **Learning from existing compact unwind implementations**
> >>>
> >>> It's worth noting that LLVM has had a battle-tested compact unwind
> >>> format in production use since 2009 with OS X 10.6, which transitioned
> >>> to using CFI directives in 2013 [1]. The efficiency gains are dramatic:
> >>>
> >>>     __text section: 0x4a55470 bytes
> >>>     __unwind_info section: 0x79060 bytes (0.6% of __text)
> >>>     __eh_frame section: 0x58 bytes
> >>>
> >>
> >> I believe this is only synchronous? If yes, do you think this is a fair
> >> measurement to compare against ?
> >>
> >> Does the compact unwind info scheme work well for cases of
> >> shrink-wrapping ? How about the case of AArch64, where the ABI does not
> >> mandate if and where frame record is created ?
> >>
> >> For the numbers above, does it ensure precise stack traces ?
> >>
> >>   From the The Apple Compact Unwinding Format document
> >> (https://faultlore.com/blah/compact-unwinding/),
> >> "One consequence of only having one opcode for a whole function is that
> >> functions will generally have incorrect instructions for the function’s
> >> prologue (where callee-saved registers are individually PUSHed onto the
> >> stack before the rest of the stack space is allocated)."
> >>
> >> "Presumably this isn’t a very big deal, since there’s very few
> >> situations where unwinding would involve a function still executing its
> >> prologue/epilogue."
> >>
> >> Well, getting precise stack traces is a big deal and the users want them.
> >
> > **Shrink-wrapping and precise stack traces**: Yes, compact unwind
> > handles these through an extension proposed by OpenVMS (not yet
> > upstreamed to LLVM):
> > https://lists.llvm.org/pipermail/llvm-dev/2018-January/120741.html
> >
>
> Thanks for the link.
>
> The above questions were strictly in the context of the battle-tested
> "The Apple Compact Unwinding Format" in production in the lld/MachO
> implementation, not for the proposed OpenVMS extensions.
>
> Is it possible to get answers to those questions with that context in place?
>
> If shrink-wrapping and precise stack traces isnt supported without the
> OpenVMS extension (that is not yet implemented), arent we comparing
> apples vs pears here ?

You're right to ask for clarification.
The extended compact unwind information works with shrink wrapping.

For context, a FDE in .eh_frame costs at least 20 bytes (often 30+),
plus its associated .eh_frame_hdr entry costs 8 bytes.
Even a larger compact unwind descriptor at 8 bytes yields significant
savings compared to .eh_frame. Tripling that to 24 bytes is still a
substantial win.

Additionally, very few functions benefit from shrink wrapping
optimization. When needed, we require multiple unwind description
records (typically 3).

> > Technical details of the extension:
> >
> > - A single unwind group describes a (prologue_part1, prologue_part2,
> > body, epilogue) tuple.
> > - The prologue is conceptually split into two parts: the first part
> > extends up to and including the instruction that decreases RSP; the
> > second part extends to a point after the last preserved register is
> > saved but before any preserved register is modified (this location is
> > not unique, providing flexibility).
> >    + When unwinding in the prologue, the RSP register value can be
> > inferred from the PC and the set of saved registers.
> > - Since register restoration is idempotent (restoring preserved
> > registers multiple times during unwinding causes no harm), there is no
> > need to describe `pop $reg` sequences. The unwind group needs just one
> > bit to describe whether the 1-byte `ret` instruction is present.
>
> Is this true for the case of asynchronous stack tracing too ?

Yes. I believe it means the epilogue mirrors the prologue. Since we
know which registers were saved in the prologue, we can infer the pop
instructions in the epilogue and compute the SP offset when unwinding
in the middle of an epilogue.

> > - The `length` field in the compact unwind group descriptor is
> > repurposed to describe the prologue's two parts.
> > - By composing multiple unwind groups, potentially with zero-sized
> > prologues or omitting `ret` instructions in epilogues, it can describe
> > functions with shrink wrapping or tail duplication optimization.
> > - Null frame groups (with no prologue or epilogue) are the default and
> > can describe trampolines and PLT stubs.
>
> PLT stubs may use stack (push to stack). As per the document "A null
> frame (MODE = 8) is the simplest possible frame, with no allocated stack
> of either kind (hence no saved registers)".  So null frame can be used
> for PLT only if the functions invoking the PLT stub were using an
> RBP-based frame.  Isnt it ?
> BTW, but both EH Frame and SFrame have specific, condensed
> representation for metadata for PLT entries.

A profiler can trivially retrieve the return address using the default
rule: if a code region is not covered by metadata, assume the return
address is available at *rsp (x86-64) or in the link register (most
other architectures).

This ld-generated unwind info feature is largely obsolete nowadays due
to the prevailing use of -Wl,-z,relro,-z,now (BIND_NOW). PLT entries
behave as functions without a prologue, so a profiler can trivially
retrieve the return address using the default unwinding rule.

> >
>
> Anyway, thanks for the summary.
>
> I see that OpenVMS extension for asynchronous compact unwind descriptors
> is an RFC state ATM.  But few important observations and questions:
>
>   - As noted in the recently revived discussion,
> https://discourse.llvm.org/t/rfc-improving-compact-x86-64-compact-unwind-descriptors/47471,
> there is going to be a *non-negligible* size overhead as soon as you
> move towards a specification for asynchronous (vs the current
> specification that caters to synchronous only).  Now add to it, the
> quirks of each architecture/ABI :). Any comments ?

As mentioned, even a larger compact unwind descriptor at 8 bytes
yields significant savings compared to .eh_frame, and is also
substantially smaller than SFrame.

>   - From the document: "Use of any preserved register must be delayed
> until all of the preserved registers have been saved."
>     Q: Does this work well with optimizing compilers ? Is this an ABI
> change being asked for multiple architectures ?

I think this is about support for callee-saved registers, a feature
SFrame doesn't have.

I need to think about the details, but this thread is probably not the
best place to discuss them.

>   - From the document: "It appears technically feasible for a null frame
> function to have a personality routine. However, the utility of such a
> capability seems too meager to justify allowing this. We propose to not
> support this." and "If the first attempt to lookup an unwind group for
> an exception address fails, then it is (tentatively) assumed to have
> occurred within a null frame function or in a part of a function
> that is adequately described by a null frame. The presumed return
> address is (virtually or actually) popped from the top of stack and
> looked up. This second attempted lookup must succeed, in which case
> processing continues normally. A failure is a fatal error."
>    Q: Is this a problem, especially because the goal is to evolve the
> OpenVMS RFC proposal is subsume .eh_frame ?

I think this just hard-encodes the default rule, similar to what
SFrame does: "AMD64 ABI mandates that the RA be saved at a fixed
offset from the CFA when entering a new function."

While I haven't given this much thought yet, I don't think this
introduces problems that SFrame doesn't have.

> Are there people actively working towards bringing this to fruition?
>
> > Now, to compare this against SFrame's space efficiency for synchronous
> > unwinding, I've built llvm-mc, opt, and clang with
> > -fno-asynchronous-unwind-tables -funwind-tables across multiple build
> > configurations (clang vs gcc, frame pointer vs sframe). The resulting
> > .sframe section sizes are significant:
> >
> > % cat ~/tmp/test-unwind.sh
> > #!/bin/zsh
> > conf() {
> >    configure-llvm $@ -DCMAKE_EXE_LINKER_FLAGS='-pie
> > -Wl,-z,pack-relative-relocs' -DLLVM_ENABLE_UNWIND_TABLES=on \
> >      -DCMAKE_{EXE,SHARED}_LINKER_FLAGS=-fuse-ld=bfd -DLLVM_ENABLE_LLD=off
> > }
> >
> > clang=-fno-integrated-as
> > gcc=("-DCMAKE_C_COMPILER=$HOME/opt/gcc-15/bin/gcc"
> > "-DCMAKE_CXX_COMPILER=$HOME/opt/gcc-15/bin/g++")
> >
> > fp="-fno-omit-frame-pointer -momit-leaf-frame-pointer
> > -B$HOME/opt/binutils/bin -Wa,--gsframe=no
> > -fno-asynchronous-unwind-tables -funwind-tables"
> > sframe="-fomit-frame-pointer -momit-leaf-frame-pointer
> > -B$HOME/opt/binutils/bin -Wa,--gsframe -fno-asynchronous-unwind-tables
> > -funwind-tables"
> >
> > conf custom-fp-sync -DCMAKE_{C,CXX}_FLAGS="$clang $fp"
> > conf custom-sframe-sync -DCMAKE_{C,CXX}_FLAGS="$clang $sframe"
> > conf custom-fp-gcc-sync -DCMAKE_{C,CXX}_FLAGS="$fp" ${gcc[@]}
> > conf custom-sframe-gcc-sync -DCMAKE_{C,CXX}_FLAGS="$sframe" ${gcc[@]}
> >
> > for i in fp-sync sframe-sync fp-gcc-sync sframe-gcc-sync; do ninja -C
> > /tmp/out/custom-$i llvm-mc opt clang; done
> >
> > % ~/Dev/unwind-info-size-analyzer/section_size.rb
> > /tmp/out/custom-{fp,sframe}-{,gcc-}sync/bin/{llvm-mc,opt,clang}
> > Filename                                    |       .text size |
> >   EH size |   .sframe size |   VM size | VM increase
> > --------------------------------------------+------------------+----------------+----------------+-----------+------------
> > /tmp/out/custom-fp-sync/bin/llvm-mc         |  2124031 (23.5%) |
> > 301136 (3.3%) |       0 (0.0%) |   9050149 |           -
> > /tmp/out/custom-sframe-sync/bin/llvm-mc     |  2114383 (22.3%) |
> > 367452 (3.9%) |  348235 (3.7%) |   9483621 |       +4.8%
> > /tmp/out/custom-fp-gcc-sync/bin/llvm-mc     |  2744214 (29.2%) |
> > 301836 (3.2%) |       0 (0.0%) |   9389677 |       +3.8%
> > /tmp/out/custom-sframe-gcc-sync/bin/llvm-mc |  2705860 (27.7%) |
> > 354292 (3.6%) |  356073 (3.6%) |   9780985 |       +8.1%
> > /tmp/out/custom-fp-sync/bin/opt             | 38873081 (69.9%) |
> > 3538408 (6.4%) |       0 (0.0%) |  55598521 |           -
> > /tmp/out/custom-sframe-sync/bin/opt         | 39011423 (62.4%) |
> > 4557012 (7.3%) | 4452908 (7.1%) |  62494765 |      +12.4%
> > /tmp/out/custom-fp-gcc-sync/bin/opt         | 54654535 (78.1%) |
> > 3631076 (5.2%) |       0 (0.0%) |  70001573 |      +25.9%
> > /tmp/out/custom-sframe-gcc-sync/bin/opt     | 53644831 (70.4%) |
> > 4857220 (6.4%) | 5263530 (6.9%) |  76205733 |      +37.1%
> > /tmp/out/custom-fp-sync/bin/clang           | 68345753 (73.8%) |
> > 6643384 (7.2%) |       0 (0.0%) |  92638305 |           -
> > /tmp/out/custom-sframe-sync/bin/clang       | 68500319 (64.9%) |
> > 8684540 (8.2%) | 8521760 (8.1%) | 105572021 |      +14.0%
> > /tmp/out/custom-fp-gcc-sync/bin/clang       | 96515079 (82.8%) |
> > 6556756 (5.6%) |       0 (0.0%) | 116524565 |      +25.8%
> > /tmp/out/custom-sframe-gcc-sync/bin/clang   | 94583903 (74.0%) |
> > 8817628 (6.9%) | 9696263 (7.6%) | 127839309 |      +38.0%
> >
> > Note: in GCC FP builds, .text is larger due to missing optimization
> > for RBP-based frames (e.g.
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108386). Once this
> > optimization is implemented, GCC FP builds should actually have
> > smaller .text than RSP-based builds, because RBP-relative addressing
> > produces more compact encodings than RSP-relative addressing (which
> > requires an extra SIB byte).
> >
> > .sframe for sync is not noticeably smaller than that for async. This
> > is probably because
> > there are still many DW_CFA_advance_loc ops even in
> > -fno-asynchronous-unwind-tables -funwind-tables builds.
> >
>
> Possible that its because in the Apple Compact Unwind Format, the linker
> optimizes compact unwind descriptors into the three-level paged
> structure, effectively de-duplicating some content.

Yes, the linker does perform deduplication and builds the paged index
structure. However, the fundamental compactness comes from the
encoding itself: each regular function is described with just 4 bytes
in the common encoding, compared to .sframe's much larger per-FDE
overhead.
The two-level lookup table optimization amplifies this advantage.

> >>>     (On macOS you can check the section size with objdump --arch x86_64 -
> >>> h clang and dump the unwind info with  objdump --arch x86_64 --unwind-
> >>> info clang)
> >>>
> >>> OpenVMS's x86-64 port, which is ELF-based, also adopted this format as
> >>> documented in their "VSI OpenVMS Calling Standard" and their 2018 post:
> >>> https://discourse.llvm.org/t/rfc-asynchronous-unwind-tables-attribute/59282
> >>>
> >>> The compact unwind format achieves this efficiency through a two-level
> >>> page table structure. It describes common frame layouts compactly and
> >>> falls back to DWARF only when necessary, allowing most DWARF CFI entries
> >>> to be eliminated while maintaining full functionality. For more details,
> >>> see: https://faultlore.com/blah/compact-unwinding/ and the lld/MachO
> >>> implemention https://github.com/llvm/llvm-project/blob/main/lld/MachO/
> >>> UnwindInfoSection.cpp
> >>>
> >>
> >> How does your vision of "linker-friendly" stack tracing/stack unwinding
> >> format reconcile with these suggested approaches ? As far as I can tell,
> >> these formats also require linker created indexes and are
> >> non-concatenable (custom handling in every linker).  Something you've
> >> had "significant concerns" about.
> >>
>
> This question is unanswered: What do you think about
> "linker-friendliness" of the current implementation of the lld/MachO
> implementation of the compact unwind format in LLVM ?

The linker input and output use different section names, so a dumb
linker would work as long as the runtime accepts the concatenated
sections.

My vision for an ELF compact unwind format uses separate section names
for link-time vs. runtime representations. The compiler output format
should be concatenable, with linker index-building as an optional
optimization that improves performance but isn't mandatory for
correctness.

I'll going to add more details
https://maskray.me/blog/2025-09-28-remarks-on-sframe


> >
> > We can distinguish between linking-time and execution-time
> > representations by using different section names.
> > The OpenVMS specification says:
> >
> >      It is useful to note that the run-time representation of unwind
> > information can vary from little more than a simple concatenation of
> > the compile-time information to a substantial rewriting of unwind
> > information by the linker. The proposal favors simple concatenation
> > while maintaining the same ordering of groups as their associated
> > code.
> >
> > The runtime library can build this index at runtime and cache it to disk.
> >
>
> This will include the dynamic linker and the stack tracer in the Linux
> kernel (the latter when stack tracing user space stacks).  Do you think
> this is feasible ?
>
> > Once the design becomes sufficiently stable, we can introduce an
> > opt-in linker option --xxxxframe-index that builds an index from
> > recognized format versions while reporting warnings for unrecognized
> > ones.> We need to carefully design this mechanism to be stable and robust,
> > avoiding frequent format updates.
> > >>   From
> >> https://docs.vmssoftware.com/vsi-openvms-calling-standard/#STACK_UNWIND_EXCEPTION_X86_64:
> >> "The unwind dispatch table (see Section B.3.1, ''Unwind Dispatch
> >> Table'') is created by the linker using information in the unwind
> >> descriptors (see Section B.3.2, ''DWARF Unwind Descriptors'' and Section
> >> B.3.3, ''Compact Unwind Description'') provided by compilers. The linker
> >> may use the provided unwind descriptors directly or replace them with
> >> equivalent optimized forms based on its optimization strategies."
> >>
> >> Above all, do users want a solution which requires falling back on
> >> DWARF-based processing for precise stack tracing ?
> >
> > The key distinction is that compact unwind handles the vast majority
> > of functions without DWARF—the macOS measurements show __unwind_info
> > at 0.6% of __text size with __eh_frame reduced to negligible size
> > (0x58 bytes). While SFrame also cannot handle all frames, compact
> > unwind achieves dramatic size reductions by making DWARF the exception
> > rather than requiring it alongside a supplementary format.
> >
>
> As we have tried to reason, this is a misleading comparison. The compact
> unwind tables format:
>    - needs to be extended for asynchronous stack unwinding
>    - needs to be extended for other ABI/architectures
>    - Making it concatenable / linker-friendly will also likely impose
> some negative effects on size.

The format supports i386, x86-64, aarch32, and aarch64. The OpenVMS
proposal demonstrates that supporting asynchronous unwinding is
straightforward.

Making it linker-friendly does not impose negative effects on the
output section size.

> > The DWARF fallback provides flexibility for additional coverage when
> > needed, but nothing is lost (at least for the clang binary on macOS)
> > if DWARF fallback were disabled in a hypothetical future linux-perf
> > implementation.
> >
>
> Fair enough, thats something for linux-perf/kernel to decide.  Once the
> OpenVMS RFC is sufficiently shaped to become a viable replacement for
> .eh_frame, this question will be for the stakeholders to decide.

Agreed. My concern is that .sframe is being deployed before we've
fully explored whether a more compact and efficient alternative is
achievable.


> >>> **The AArch64 case: size matters even more**
> >>>
> >>> The size consideration becomes even more critical for AArch64, which is
> >>> heavily deployed on mobile phones.
> >>> There's an active feature request for compact unwind support in the
> >>> AArch64 ABI: https://github.com/ARM-software/abi-aa/issues/344
> >>> This underscores the broader industry need for efficient unwind
> >>> information that doesn't duplicate data or significantly increase binary
> >>> size.
> >>>
> >>
> >> Our measurements with a dataset of about 1400 userspace artifacts
> >> (binaries and shared libraries) show that the SFrame/(EH Frame + EH
> >> Frame HDR) ratio is:
> >>     - Average of 0.70 on AArch64.
> >>     - Average of 1.00 on x86_64.
> >>
> >> Projecting the size of what you observe for clang binary on x86_64 to
> >> conclude the size ratio on AArch64 is not very wise to do.
> >>
> >> Whether the size impact is worth the benefit: its a choice for users to
> >> make.  SFrame offers the users fast, precise stack traces with simple
> >> stack tracers.
> >
> > Thank you for providing the AArch64 measurements. Even with a 0.70x ratio on
> > AArch64, this represents substantial memory overhead when considering:
> >
> > .eh_frame is already large and being complained about.
> > Being unable to eliminate it (needed for debugging and C++ exceptions)
> > and adding 0.70x more means significant additional overhead for users.
> >
> >>> There are at least two formats the ELF one can learn from: LLVM's
> >>> compact unwind format (aarch64) and Windows ARM64 Frame Unwind Code.
> >>>
> >>
> >> Please, if you have any concrete suggestions (keeping the above goals in
> >> mind), you already know how/where to engage.
> >
> > I've provided concrete suggestions throughout this discussion.
> >
>
> Apologies, I should have been more precise.  And I ask because you know
> the details about both SFrame and the variants of Compact Unwind
> Descriptor formats at this point :). If you have concrete suggestions to
> improve the SFrame format for size, please let us know.

At this point, I'm not certain about specific modifications to .sframe
itself. I think we should start from scratch, drawing ideas from
compact unwind information and Windows ARM64.

The existing compact unwind information uses the following 4-byte descriptor:

  uint32_t mode_specific_encoding : 24; // vary with different modes

  uint32_t mode : 4; // UNWIND_X86_64_MODE_MASK == UNWIND_ARM64_MODE_MASK

  uint32_t has_lsda : 1;
  uint32_t personality_index : 2;
  uint32_t is_not_function_start : 1;

We probably need a less-restricted version and account for different
architecture needs. The result would still be significantly smaller
than SFrame v2 and the future v3 (unless it's completely rewritten).

We should probably design an optional two-level lookup table mechanism
for additional savings (at the cost of linker friendliness).

> >>> **Path forward**
> >>>
> >>> Unless SFrame can actually replace .eh_frame (rather than supplementing
> >>> it as an accelerator for linux-perf) and demonstrate sizes smaller
> >>> than .eh_frame - matching the efficiency of existing compact unwind
> >>> approaches — I question its practical viability for userspace.
> >>> The current design appears to add overhead rather than reduce it.
> >>> This isn't to suggest we should simply adopt the existing compact unwind
> >>> format wholesale.
> >>> The x86-64 design dates back to 2009 or earlier, and there are likely
> >>> improvements we can make. However, we should aim for similar or better
> >>> efficiency gains.
> >>>
> >>> For additional context, I've documented my detailed analysis at:
> >>>
> >>> - https://maskray.me/blog/2025-09-28-remarks-on-sframe (covering
> >>> mandatory index building problems, section group compliance and garbage
> >>> collection issues, and version compatibility challenges)
> >>
> >> GC issue is a bug currently tracked and with a target milestone of 2.46.
> >>> - https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade-
> >>> offs (size analysis)
> >>>

The GC issue would not have happened at all if we had used multiple
sections and thought about ELF and linker convention :)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-11-06  9:20       ` Fangrui Song
@ 2025-11-06 20:42         ` Indu Bhagat
  2025-11-09  0:23           ` Fangrui Song
  0 siblings, 1 reply; 30+ messages in thread
From: Indu Bhagat @ 2025-11-06 20:42 UTC (permalink / raw)
  To: Fangrui Song; +Cc: linux-toolchains, linux-perf-users, linux-kernel

On 11/6/25 1:20 AM, Fangrui Song wrote:
> On Wed, Nov 5, 2025 at 4:45 PM Indu Bhagat <indu.bhagat@oracle.com> wrote:
>>
>> On 11/5/25 12:21 AM, Fangrui Song wrote:
>>>> On Tue, Nov 4, 2025 at 1:21 AM Indu <indu.bhagat@oracle.com> wrote:
>>>> On 2025-10-29 11:53 p.m., Fangrui Song wrote:
>>>>> I've been following the SFrame discussion and wanted to share some
>>>>> concerns about its viability for userspace adoption, based on concrete
>>>>> measurements and comparison with existing compact unwind implementations
>>>>> in LLVM.
>>>>>
>>>>> **Size overhead concerns**
>>>>>
>>>>> Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is
>>>>> approximately 10% larger than the combined size of .eh_frame
>>>>> and .eh_frame_hdr (8.06 MiB total).
>>>>> This is problematic because .eh_frame cannot be eliminated - it contains
>>>>> essential information for restoring callee-saved registers, LSDA, and
>>>>> personality information needed for debugging (e.g. reading local
>>>>> variables in a coredump) and C++ exception handling.
>>>>>
>>>>> This means adopting SFrame would result in carrying both formats, with a
>>>>> large net size increase.
>>>>>
>>>>> **Learning from existing compact unwind implementations**
>>>>>
>>>>> It's worth noting that LLVM has had a battle-tested compact unwind
>>>>> format in production use since 2009 with OS X 10.6, which transitioned
>>>>> to using CFI directives in 2013 [1]. The efficiency gains are dramatic:
>>>>>
>>>>>      __text section: 0x4a55470 bytes
>>>>>      __unwind_info section: 0x79060 bytes (0.6% of __text)
>>>>>      __eh_frame section: 0x58 bytes
>>>>>
>>>>
>>>> I believe this is only synchronous? If yes, do you think this is a fair
>>>> measurement to compare against ?
>>>>
>>>> Does the compact unwind info scheme work well for cases of
>>>> shrink-wrapping ? How about the case of AArch64, where the ABI does not
>>>> mandate if and where frame record is created ?
>>>>
>>>> For the numbers above, does it ensure precise stack traces ?
>>>>
>>>>    From the The Apple Compact Unwinding Format document
>>>> (https://faultlore.com/blah/compact-unwinding/),
>>>> "One consequence of only having one opcode for a whole function is that
>>>> functions will generally have incorrect instructions for the function’s
>>>> prologue (where callee-saved registers are individually PUSHed onto the
>>>> stack before the rest of the stack space is allocated)."
>>>>
>>>> "Presumably this isn’t a very big deal, since there’s very few
>>>> situations where unwinding would involve a function still executing its
>>>> prologue/epilogue."
>>>>
>>>> Well, getting precise stack traces is a big deal and the users want them.
>>>
>>> **Shrink-wrapping and precise stack traces**: Yes, compact unwind
>>> handles these through an extension proposed by OpenVMS (not yet
>>> upstreamed to LLVM):
>>> https://lists.llvm.org/pipermail/llvm-dev/2018-January/120741.html
>>>
>>
>> Thanks for the link.
>>
>> The above questions were strictly in the context of the battle-tested
>> "The Apple Compact Unwinding Format" in production in the lld/MachO
>> implementation, not for the proposed OpenVMS extensions.
>>
>> Is it possible to get answers to those questions with that context in place?
>>
>> If shrink-wrapping and precise stack traces isnt supported without the
>> OpenVMS extension (that is not yet implemented), arent we comparing
>> apples vs pears here ?
> 
> You're right to ask for clarification.
> The extended compact unwind information works with shrink wrapping.
> 

Sorry, again, not asking about the "extended".

If I may: So, this is a convoluted way of saying the current 
implementation of the Apple Compact Unwind Info (lld/MachO, which was 
used to get the data) does not support shrink wrapping.  The 
documentation of the format I am refering to 
(https://faultlore.com/blah/compact-unwinding/).

That said, the point I have been driving to:

The Apple Compact Unwind format 
(https://faultlore.com/blah/compact-unwinding/) does not support shrink 
wrapping and neither is for asynchronous stack walking.  Comparing that 
data to what SFrame gives is comparing apples to pears.  Misleading.

(The reason I asked the question to begin with is because I wasn't sure 
if the documentation is out of date).

> For context, a FDE in .eh_frame costs at least 20 bytes (often 30+),
> plus its associated .eh_frame_hdr entry costs 8 bytes.
> Even a larger compact unwind descriptor at 8 bytes yields significant
> savings compared to .eh_frame. Tripling that to 24 bytes is still a
> substantial win.
> 
> Additionally, very few functions benefit from shrink wrapping
> optimization. When needed, we require multiple unwind description
> records (typically 3).
> 
>>> Technical details of the extension:
>>>
>>> - A single unwind group describes a (prologue_part1, prologue_part2,
>>> body, epilogue) tuple.
>>> - The prologue is conceptually split into two parts: the first part
>>> extends up to and including the instruction that decreases RSP; the
>>> second part extends to a point after the last preserved register is
>>> saved but before any preserved register is modified (this location is
>>> not unique, providing flexibility).
>>>     + When unwinding in the prologue, the RSP register value can be
>>> inferred from the PC and the set of saved registers.
>>> - Since register restoration is idempotent (restoring preserved
>>> registers multiple times during unwinding causes no harm), there is no
>>> need to describe `pop $reg` sequences. The unwind group needs just one
>>> bit to describe whether the 1-byte `ret` instruction is present.
>>
>> Is this true for the case of asynchronous stack tracing too ?
> 
> Yes. I believe it means the epilogue mirrors the prologue. Since we
> know which registers were saved in the prologue, we can infer the pop
> instructions in the epilogue and compute the SP offset when unwinding
> in the middle of an epilogue.
> 

This is not asynchronous then.
This meddles with the core business of an optimizing compiler which may 
want to organize epilogue/prologue differently.

>>> - The `length` field in the compact unwind group descriptor is
>>> repurposed to describe the prologue's two parts.
>>> - By composing multiple unwind groups, potentially with zero-sized
>>> prologues or omitting `ret` instructions in epilogues, it can describe
>>> functions with shrink wrapping or tail duplication optimization.
>>> - Null frame groups (with no prologue or epilogue) are the default and
>>> can describe trampolines and PLT stubs.
>>
>> PLT stubs may use stack (push to stack). As per the document "A null
>> frame (MODE = 8) is the simplest possible frame, with no allocated stack
>> of either kind (hence no saved registers)".  So null frame can be used
>> for PLT only if the functions invoking the PLT stub were using an
>> RBP-based frame.  Isnt it ?
>> BTW, but both EH Frame and SFrame have specific, condensed
>> representation for metadata for PLT entries.
> 
> A profiler can trivially retrieve the return address using the default
> rule: if a code region is not covered by metadata, assume the return
> address is available at *rsp (x86-64) or in the link register (most
> other architectures).
> 
> This ld-generated unwind info feature is largely obsolete nowadays due
> to the prevailing use of -Wl,-z,relro,-z,now (BIND_NOW). PLT entries
> behave as functions without a prologue, so a profiler can trivially
> retrieve the return address using the default unwinding rule.
> 
>>>
>>
>> Anyway, thanks for the summary.
>>
>> I see that OpenVMS extension for asynchronous compact unwind descriptors
>> is an RFC state ATM.  But few important observations and questions:
>>
>>    - As noted in the recently revived discussion,
>> https://discourse.llvm.org/t/rfc-improving-compact-x86-64-compact-unwind-descriptors/47471,
>> there is going to be a *non-negligible* size overhead as soon as you
>> move towards a specification for asynchronous (vs the current
>> specification that caters to synchronous only).  Now add to it, the
>> quirks of each architecture/ABI :). Any comments ?
> 
> As mentioned, even a larger compact unwind descriptor at 8 bytes
> yields significant savings compared to .eh_frame, and is also
> substantially smaller than SFrame.
> 
>>    - From the document: "Use of any preserved register must be delayed
>> until all of the preserved registers have been saved."
>>      Q: Does this work well with optimizing compilers ? Is this an ABI
>> change being asked for multiple architectures ?
> 
> I think this is about support for callee-saved registers, a feature
> SFrame doesn't have.
> 

SFrame doesn't have it, because it doesnt need to carry this information 
for stack tracing. OpenVMS RFC effort, OTOH, is about subsuming 
.eh_frame and be _the_ stack tracing/stack unwinding format.  The latter 
*has to* work this out.

> I need to think about the details, but this thread is probably not the
> best place to discuss them.
> 

Absolutely, I agree, not the best place or time to pin down the details 
of an RFC at all.  But cannot let an unfair argument just fly by.

The point I am driving to with these questions around the OpenVMS 
asynchronous info RFC:
- 'OpenVMS extensions for asynchronous stack unwinding' in an RFC which 
still needs work.
- It remains to be seen how this proposal manages the fine line of 
space-efficiency while trying to be the goto format for asynchronous 
stack unwinding together with fast, precise and low-overhead stack tracing.
- SFrame is for stack tracing only.  Subsuming .eh_frame is not in the 
plans.

>>    - From the document: "It appears technically feasible for a null frame
>> function to have a personality routine. However, the utility of such a
>> capability seems too meager to justify allowing this. We propose to not
>> support this." and "If the first attempt to lookup an unwind group for
>> an exception address fails, then it is (tentatively) assumed to have
>> occurred within a null frame function or in a part of a function
>> that is adequately described by a null frame. The presumed return
>> address is (virtually or actually) popped from the top of stack and
>> looked up. This second attempted lookup must succeed, in which case
>> processing continues normally. A failure is a fatal error."
>>     Q: Is this a problem, especially because the goal is to evolve the
>> OpenVMS RFC proposal is subsume .eh_frame ?
> 
> I think this just hard-encodes the default rule, similar to what
> SFrame does: "AMD64 ABI mandates that the RA be saved at a fixed
> offset from the CFA when entering a new function."
> 
> While I haven't given this much thought yet, I don't think this
> introduces problems that SFrame doesn't have.
> 

Correction: Not true. This is configurable in SFrame. s390x needs RA 
tracking (not fixed offset) and is supported in SFrame.

>> Are there people actively working towards bringing this to fruition?
>>
>>> Now, to compare this against SFrame's space efficiency for synchronous
>>> unwinding, I've built llvm-mc, opt, and clang with
>>> -fno-asynchronous-unwind-tables -funwind-tables across multiple build
>>> configurations (clang vs gcc, frame pointer vs sframe).
>>> [snip]>>>
>>> .sframe for sync is not noticeably smaller than that for async. This
>>> is probably because
>>> there are still many DW_CFA_advance_loc ops even in
>>> -fno-asynchronous-unwind-tables -funwind-tables builds.
>>>
>>
>> Possible that its because in the Apple Compact Unwind Format, the linker
>> optimizes compact unwind descriptors into the three-level paged
>> structure, effectively de-duplicating some content.
> 
> Yes, the linker does perform deduplication and builds the paged index
> structure. However, the fundamental compactness comes from the
> encoding itself: each regular function is described with just 4 bytes
> in the common encoding, compared to .sframe's much larger per-FDE
> overhead.
> The two-level lookup table optimization amplifies this advantage.
> 
>>>>>      (On macOS you can check the section size with objdump --arch x86_64 -
>>>>> h clang and dump the unwind info with  objdump --arch x86_64 --unwind-
>>>>> info clang)
>>>>>
>>>>> OpenVMS's x86-64 port, which is ELF-based, also adopted this format as
>>>>> documented in their "VSI OpenVMS Calling Standard" and their 2018 post:
>>>>> https://discourse.llvm.org/t/rfc-asynchronous-unwind-tables-attribute/59282
>>>>>
>>>>> The compact unwind format achieves this efficiency through a two-level
>>>>> page table structure. It describes common frame layouts compactly and
>>>>> falls back to DWARF only when necessary, allowing most DWARF CFI entries
>>>>> to be eliminated while maintaining full functionality. For more details,
>>>>> see: https://faultlore.com/blah/compact-unwinding/ and the lld/MachO
>>>>> implemention https://github.com/llvm/llvm-project/blob/main/lld/MachO/
>>>>> UnwindInfoSection.cpp
>>>>>
>>>>
>>>> How does your vision of "linker-friendly" stack tracing/stack unwinding
>>>> format reconcile with these suggested approaches ? As far as I can tell,
>>>> these formats also require linker created indexes and are
>>>> non-concatenable (custom handling in every linker).  Something you've
>>>> had "significant concerns" about.
>>>>
>>
>> This question is unanswered: What do you think about
>> "linker-friendliness" of the current implementation of the lld/MachO
>> implementation of the compact unwind format in LLVM ?
> 
> The linker input and output use different section names, so a dumb
> linker would work as long as the runtime accepts the concatenated
> sections.
> 
> My vision for an ELF compact unwind format uses separate section names
> for link-time vs. runtime representations. The compiler output format
> should be concatenable, with linker index-building as an optional
> optimization that improves performance but isn't mandatory for
> correctness.
> 
> I'll going to add more details
> https://maskray.me/blog/2025-09-28-remarks-on-sframe
> 
> 
>>>
>>> We can distinguish between linking-time and execution-time
>>> representations by using different section names.
>>> The OpenVMS specification says:
>>>
>>>       It is useful to note that the run-time representation of unwind
>>> information can vary from little more than a simple concatenation of
>>> the compile-time information to a substantial rewriting of unwind
>>> information by the linker. The proposal favors simple concatenation
>>> while maintaining the same ordering of groups as their associated
>>> code.
>>>
>>> The runtime library can build this index at runtime and cache it to disk.
>>>
>>
>> This will include the dynamic linker and the stack tracer in the Linux
>> kernel (the latter when stack tracing user space stacks).  Do you think
>> this is feasible ?
>>
>>> Once the design becomes sufficiently stable, we can introduce an
>>> opt-in linker option --xxxxframe-index that builds an index from
>>> recognized format versions while reporting warnings for unrecognized
>>> ones.> We need to carefully design this mechanism to be stable and robust,
>>> avoiding frequent format updates.
>>>>>    From
>>>> https://docs.vmssoftware.com/vsi-openvms-calling-standard/#STACK_UNWIND_EXCEPTION_X86_64:
>>>> "The unwind dispatch table (see Section B.3.1, ''Unwind Dispatch
>>>> Table'') is created by the linker using information in the unwind
>>>> descriptors (see Section B.3.2, ''DWARF Unwind Descriptors'' and Section
>>>> B.3.3, ''Compact Unwind Description'') provided by compilers. The linker
>>>> may use the provided unwind descriptors directly or replace them with
>>>> equivalent optimized forms based on its optimization strategies."
>>>>
>>>> Above all, do users want a solution which requires falling back on
>>>> DWARF-based processing for precise stack tracing ?
>>>
>>> The key distinction is that compact unwind handles the vast majority
>>> of functions without DWARF—the macOS measurements show __unwind_info
>>> at 0.6% of __text size with __eh_frame reduced to negligible size
>>> (0x58 bytes). While SFrame also cannot handle all frames, compact
>>> unwind achieves dramatic size reductions by making DWARF the exception
>>> rather than requiring it alongside a supplementary format.
>>>
>>
>> As we have tried to reason, this is a misleading comparison. The compact
>> unwind tables format:
>>     - needs to be extended for asynchronous stack unwinding
>>     - needs to be extended for other ABI/architectures
>>     - Making it concatenable / linker-friendly will also likely impose
>> some negative effects on size.
> 
> The format supports i386, x86-64, aarch32, and aarch64. The OpenVMS
> proposal demonstrates that supporting asynchronous unwinding is
> straightforward.
> 
> Making it linker-friendly does not impose negative effects on the
> output section size.
> 

OK, well, I agree to disagree :)

Looking forward to some movement on the OpenVMS asynchronous unwind RFC 
to see resolution to some of the issues, and some data to back that claim.

>>> The DWARF fallback provides flexibility for additional coverage when
>>> needed, but nothing is lost (at least for the clang binary on macOS)
>>> if DWARF fallback were disabled in a hypothetical future linux-perf
>>> implementation.
>>>
>>
>> Fair enough, thats something for linux-perf/kernel to decide.  Once the
>> OpenVMS RFC is sufficiently shaped to become a viable replacement for
>> .eh_frame, this question will be for the stakeholders to decide.
> 
> Agreed. My concern is that .sframe is being deployed before we've
> fully explored whether a more compact and efficient alternative is
> achievable.
> 
> 
>>>>> **The AArch64 case: size matters even more**
>>>>>
>>>>> The size consideration becomes even more critical for AArch64, which is
>>>>> heavily deployed on mobile phones.
>>>>> There's an active feature request for compact unwind support in the
>>>>> AArch64 ABI: https://github.com/ARM-software/abi-aa/issues/344
>>>>> This underscores the broader industry need for efficient unwind
>>>>> information that doesn't duplicate data or significantly increase binary
>>>>> size.
>>>>>
>>>>
>>>> Our measurements with a dataset of about 1400 userspace artifacts
>>>> (binaries and shared libraries) show that the SFrame/(EH Frame + EH
>>>> Frame HDR) ratio is:
>>>>      - Average of 0.70 on AArch64.
>>>>      - Average of 1.00 on x86_64.
>>>>
>>>> Projecting the size of what you observe for clang binary on x86_64 to
>>>> conclude the size ratio on AArch64 is not very wise to do.
>>>>
>>>> Whether the size impact is worth the benefit: its a choice for users to
>>>> make.  SFrame offers the users fast, precise stack traces with simple
>>>> stack tracers.
>>>
>>> Thank you for providing the AArch64 measurements. Even with a 0.70x ratio on
>>> AArch64, this represents substantial memory overhead when considering:
>>>
>>> .eh_frame is already large and being complained about.
>>> Being unable to eliminate it (needed for debugging and C++ exceptions)
>>> and adding 0.70x more means significant additional overhead for users.
>>>
>>>>> There are at least two formats the ELF one can learn from: LLVM's
>>>>> compact unwind format (aarch64) and Windows ARM64 Frame Unwind Code.
>>>>>
>>>>
>>>> Please, if you have any concrete suggestions (keeping the above goals in
>>>> mind), you already know how/where to engage.
>>>
>>> I've provided concrete suggestions throughout this discussion.
>>>
>>
>> Apologies, I should have been more precise.  And I ask because you know
>> the details about both SFrame and the variants of Compact Unwind
>> Descriptor formats at this point :). If you have concrete suggestions to
>> improve the SFrame format for size, please let us know.
> 
> At this point, I'm not certain about specific modifications to .sframe
> itself. I think we should start from scratch, drawing ideas from
> compact unwind information and Windows ARM64.
> 
> The existing compact unwind information uses the following 4-byte descriptor:
> 
>    uint32_t mode_specific_encoding : 24; // vary with different modes
> 
>    uint32_t mode : 4; // UNWIND_X86_64_MODE_MASK == UNWIND_ARM64_MODE_MASK
> 
>    uint32_t has_lsda : 1;
>    uint32_t personality_index : 2;
>    uint32_t is_not_function_start : 1;
> 

Thanks.

SFrame is not for stack unwinding.  Subsuming .eh_frame is topic for 
another day.  SFrame does not intend to go that route.

> We probably need a less-restricted version and account for different
> architecture needs. The result would still be significantly smaller
> than SFrame v2 and the future v3 (unless it's completely rewritten).
> 
> We should probably design an optional two-level lookup table mechanism
> for additional savings (at the cost of linker friendliness).
> 
>>>>> **Path forward**
>>>>>
>>>>> Unless SFrame can actually replace .eh_frame (rather than supplementing
>>>>> it as an accelerator for linux-perf) and demonstrate sizes smaller
>>>>> than .eh_frame - matching the efficiency of existing compact unwind
>>>>> approaches — I question its practical viability for userspace.
>>>>> The current design appears to add overhead rather than reduce it.
>>>>> This isn't to suggest we should simply adopt the existing compact unwind
>>>>> format wholesale.
>>>>> The x86-64 design dates back to 2009 or earlier, and there are likely
>>>>> improvements we can make. However, we should aim for similar or better
>>>>> efficiency gains.
>>>>>
>>>>> For additional context, I've documented my detailed analysis at:
>>>>>
>>>>> - https://maskray.me/blog/2025-09-28-remarks-on-sframe (covering
>>>>> mandatory index building problems, section group compliance and garbage
>>>>> collection issues, and version compatibility challenges)
>>>>
>>>> GC issue is a bug currently tracked and with a target milestone of 2.46.
>>>>> - https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade-
>>>>> offs (size analysis)
>>>>>
> 
> The GC issue would not have happened at all if we had used multiple
> sections and thought about ELF and linker convention :)

Thanks for engaging.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-11-06  7:51       ` Florian Weimer
@ 2025-11-06 21:02         ` Indu Bhagat
  0 siblings, 0 replies; 30+ messages in thread
From: Indu Bhagat @ 2025-11-06 21:02 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel

On 11/5/25 11:51 PM, Florian Weimer wrote:
> * Indu Bhagat:
> 
>> PLT stubs may use stack (push to stack). As per the document "A null
>> frame (MODE = 8) is the simplest possible frame, with no allocated
>> stack of either kind (hence no saved registers)".  So null frame can
>> be used for PLT only if the functions invoking the PLT stub were using
>> an RBP-based frame.  Isnt it ?
> 
> I think I said this before, but I don't think new toolchain features
> need to support lazy binding.  Without lazy bindings, the PLT stubs do
> not change the stack pointer or frame pointer and just make a tail call.
> 
> Do you see a need for continued support of lazy binding?
> 

(Yes, you did mention this before in another thread on Binutils.)

My thinking has been: some linker emulations default to lazy (I guess 
the reason is changing the default is difficult).  So, users may end up 
continuing with lazy bindings unknowingly ?

But I guess not designing new toolchain features to support lazy binding 
seems reasonable.

Thanks





^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-11-06 20:42         ` Indu Bhagat
@ 2025-11-09  0:23           ` Fangrui Song
  0 siblings, 0 replies; 30+ messages in thread
From: Fangrui Song @ 2025-11-09  0:23 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: Fangrui Song, linux-toolchains, linux-perf-users, linux-kernel

On Thu, Nov 6, 2025 at 12:42 PM Indu Bhagat <indu.bhagat@oracle.com> wrote:
>
> On 11/6/25 1:20 AM, Fangrui Song wrote:
> > On Wed, Nov 5, 2025 at 4:45 PM Indu Bhagat <indu.bhagat@oracle.com> wrote:
> >>
> >> On 11/5/25 12:21 AM, Fangrui Song wrote:
> >>>> On Tue, Nov 4, 2025 at 1:21 AM Indu <indu.bhagat@oracle.com> wrote:
> >>>> On 2025-10-29 11:53 p.m., Fangrui Song wrote:
> >>>>> I've been following the SFrame discussion and wanted to share some
> >>>>> concerns about its viability for userspace adoption, based on concrete
> >>>>> measurements and comparison with existing compact unwind implementations
> >>>>> in LLVM.
> >>>>>
> >>>>> **Size overhead concerns**
> >>>>>
> >>>>> Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is
> >>>>> approximately 10% larger than the combined size of .eh_frame
> >>>>> and .eh_frame_hdr (8.06 MiB total).
> >>>>> This is problematic because .eh_frame cannot be eliminated - it contains
> >>>>> essential information for restoring callee-saved registers, LSDA, and
> >>>>> personality information needed for debugging (e.g. reading local
> >>>>> variables in a coredump) and C++ exception handling.
> >>>>>
> >>>>> This means adopting SFrame would result in carrying both formats, with a
> >>>>> large net size increase.
> >>>>>
> >>>>> **Learning from existing compact unwind implementations**
> >>>>>
> >>>>> It's worth noting that LLVM has had a battle-tested compact unwind
> >>>>> format in production use since 2009 with OS X 10.6, which transitioned
> >>>>> to using CFI directives in 2013 [1]. The efficiency gains are dramatic:
> >>>>>
> >>>>>      __text section: 0x4a55470 bytes
> >>>>>      __unwind_info section: 0x79060 bytes (0.6% of __text)
> >>>>>      __eh_frame section: 0x58 bytes
> >>>>>
> >>>>
> >>>> I believe this is only synchronous? If yes, do you think this is a fair
> >>>> measurement to compare against ?
> >>>>
> >>>> Does the compact unwind info scheme work well for cases of
> >>>> shrink-wrapping ? How about the case of AArch64, where the ABI does not
> >>>> mandate if and where frame record is created ?
> >>>>
> >>>> For the numbers above, does it ensure precise stack traces ?
> >>>>
> >>>>    From the The Apple Compact Unwinding Format document
> >>>> (https://faultlore.com/blah/compact-unwinding/),
> >>>> "One consequence of only having one opcode for a whole function is that
> >>>> functions will generally have incorrect instructions for the function’s
> >>>> prologue (where callee-saved registers are individually PUSHed onto the
> >>>> stack before the rest of the stack space is allocated)."
> >>>>
> >>>> "Presumably this isn’t a very big deal, since there’s very few
> >>>> situations where unwinding would involve a function still executing its
> >>>> prologue/epilogue."
> >>>>
> >>>> Well, getting precise stack traces is a big deal and the users want them.
> >>>
> >>> **Shrink-wrapping and precise stack traces**: Yes, compact unwind
> >>> handles these through an extension proposed by OpenVMS (not yet
> >>> upstreamed to LLVM):
> >>> https://lists.llvm.org/pipermail/llvm-dev/2018-January/120741.html
> >>>
> >>
> >> Thanks for the link.
> >>
> >> The above questions were strictly in the context of the battle-tested
> >> "The Apple Compact Unwinding Format" in production in the lld/MachO
> >> implementation, not for the proposed OpenVMS extensions.
> >>
> >> Is it possible to get answers to those questions with that context in place?
> >>
> >> If shrink-wrapping and precise stack traces isnt supported without the
> >> OpenVMS extension (that is not yet implemented), arent we comparing
> >> apples vs pears here ?
> >
> > You're right to ask for clarification.
> > The extended compact unwind information works with shrink wrapping.
> >
>
> Sorry, again, not asking about the "extended".
>
> If I may: So, this is a convoluted way of saying the current
> implementation of the Apple Compact Unwind Info (lld/MachO, which was
> used to get the data) does not support shrink wrapping.  The
> documentation of the format I am refering to
> (https://faultlore.com/blah/compact-unwinding/).
>
> That said, the point I have been driving to:
>
> The Apple Compact Unwind format
> (https://faultlore.com/blah/compact-unwinding/) does not support shrink
> wrapping and neither is for asynchronous stack walking.  Comparing that
> data to what SFrame gives is comparing apples to pears.  Misleading.
>
> (The reason I asked the question to begin with is because I wasn't sure
> if the documentation is out of date).

The original compact unwind information implementation was designed in
2009, before
shrink wrapping was implemented in LLVM in 2015. It is definitely not
fully asynchronous
as it lacks information about the epilogue. When unwinding in the
middle of the prologue,
one can recover partial information leveraging the prologue codegen
pattern, probably good enough to recover
SP in the absence of shrink wrapping.

While there are limitations, it does not mean we cannot yield useful
data from it.

In a x86-64 build of clang-21, there is one single CIE and 141845 FDEs.
The average size of a FDE is: (0x733348 - 0x18) / 141845 ~= 52.225
(0x18 is the first FDE offset in llvm-dwarfdump -eh-frame output).

Considering .eh_frame_hdr entry, per-function size is around 52.225+8 = 60.225.

The .sframe V2 per-function size is 0x820820 / 141845 ~= 60.078.

On LLVM Discourse we are discussing the next generation of compact
unwind information,
which will support at least asynchronous stack tracing (the SFrame
feature subset) and synchronous C++ exceptions.
We aim to provide a per-entry size of 12 bytes.
The average number of entries per function is likely between 1 and 2,
making the scheme very size-efficient even without utilizing page
table deduplication.

> > For context, a FDE in .eh_frame costs at least 20 bytes (often 30+),
> > plus its associated .eh_frame_hdr entry costs 8 bytes.
> > Even a larger compact unwind descriptor at 8 bytes yields significant
> > savings compared to .eh_frame. Tripling that to 24 bytes is still a
> > substantial win.
> >
> > Additionally, very few functions benefit from shrink wrapping
> > optimization. When needed, we require multiple unwind description
> > records (typically 3).
> >
> >>> Technical details of the extension:
> >>>
> >>> - A single unwind group describes a (prologue_part1, prologue_part2,
> >>> body, epilogue) tuple.
> >>> - The prologue is conceptually split into two parts: the first part
> >>> extends up to and including the instruction that decreases RSP; the
> >>> second part extends to a point after the last preserved register is
> >>> saved but before any preserved register is modified (this location is
> >>> not unique, providing flexibility).
> >>>     + When unwinding in the prologue, the RSP register value can be
> >>> inferred from the PC and the set of saved registers.
> >>> - Since register restoration is idempotent (restoring preserved
> >>> registers multiple times during unwinding causes no harm), there is no
> >>> need to describe `pop $reg` sequences. The unwind group needs just one
> >>> bit to describe whether the 1-byte `ret` instruction is present.
> >>
> >> Is this true for the case of asynchronous stack tracing too ?
> >
> > Yes. I believe it means the epilogue mirrors the prologue. Since we
> > know which registers were saved in the prologue, we can infer the pop
> > instructions in the epilogue and compute the SP offset when unwinding
> > in the middle of an epilogue.
> >
>
> This is not asynchronous then.
> This meddles with the core business of an optimizing compiler which may
> want to organize epilogue/prologue differently.

Asynchronous as far as the compiler-generated patterns are concerned.
Compilers do exhibit the patterns and we should utilize them, aiming
for a compact format.
We are trying to lift the restriction as much as possible when
designing the new format.

> >>> - The `length` field in the compact unwind group descriptor is
> >>> repurposed to describe the prologue's two parts.
> >>> - By composing multiple unwind groups, potentially with zero-sized
> >>> prologues or omitting `ret` instructions in epilogues, it can describe
> >>> functions with shrink wrapping or tail duplication optimization.
> >>> - Null frame groups (with no prologue or epilogue) are the default and
> >>> can describe trampolines and PLT stubs.
> >>
> >> PLT stubs may use stack (push to stack). As per the document "A null
> >> frame (MODE = 8) is the simplest possible frame, with no allocated stack
> >> of either kind (hence no saved registers)".  So null frame can be used
> >> for PLT only if the functions invoking the PLT stub were using an
> >> RBP-based frame.  Isnt it ?
> >> BTW, but both EH Frame and SFrame have specific, condensed
> >> representation for metadata for PLT entries.
> >
> > A profiler can trivially retrieve the return address using the default
> > rule: if a code region is not covered by metadata, assume the return
> > address is available at *rsp (x86-64) or in the link register (most
> > other architectures).
> >
> > This ld-generated unwind info feature is largely obsolete nowadays due
> > to the prevailing use of -Wl,-z,relro,-z,now (BIND_NOW). PLT entries
> > behave as functions without a prologue, so a profiler can trivially
> > retrieve the return address using the default unwinding rule.
> >
> >>>
> >>
> >> Anyway, thanks for the summary.
> >>
> >> I see that OpenVMS extension for asynchronous compact unwind descriptors
> >> is an RFC state ATM.  But few important observations and questions:
> >>
> >>    - As noted in the recently revived discussion,
> >> https://discourse.llvm.org/t/rfc-improving-compact-x86-64-compact-unwind-descriptors/47471,
> >> there is going to be a *non-negligible* size overhead as soon as you
> >> move towards a specification for asynchronous (vs the current
> >> specification that caters to synchronous only).  Now add to it, the
> >> quirks of each architecture/ABI :). Any comments ?
> >
> > As mentioned, even a larger compact unwind descriptor at 8 bytes
> > yields significant savings compared to .eh_frame, and is also
> > substantially smaller than SFrame.
> >
> >>    - From the document: "Use of any preserved register must be delayed
> >> until all of the preserved registers have been saved."
> >>      Q: Does this work well with optimizing compilers ? Is this an ABI
> >> change being asked for multiple architectures ?
> >
> > I think this is about support for callee-saved registers, a feature
> > SFrame doesn't have.
> >
>
> SFrame doesn't have it, because it doesnt need to carry this information
> for stack tracing. OpenVMS RFC effort, OTOH, is about subsuming
> .eh_frame and be _the_ stack tracing/stack unwinding format.  The latter
> *has to* work this out.

This stance puts SFrame in a very narrow niche.
Per-function unwind info of 120 bytes (EH+SFrame: 60+60) far exceeds the size
the next-generation compact unwind information aims to achieve (likely
<24 bytes even without using a page table).

I believe the potential of the next-generation compact unwind
information is clear. For this reason, I urge performance maintainers
not to rush the integration of sframe v3 support.
If these architectural design issues of SFrame aren't resolved
beforehand, we risk launching a format that very few people will
actually use.


> > I need to think about the details, but this thread is probably not the
> > best place to discuss them.
> >
>
> Absolutely, I agree, not the best place or time to pin down the details
> of an RFC at all.  But cannot let an unfair argument just fly by.
>
> The point I am driving to with these questions around the OpenVMS
> asynchronous info RFC:
> - 'OpenVMS extensions for asynchronous stack unwinding' in an RFC which
> still needs work.
> - It remains to be seen how this proposal manages the fine line of
> space-efficiency while trying to be the goto format for asynchronous
> stack unwinding together with fast, precise and low-overhead stack tracing.
> - SFrame is for stack tracing only.  Subsuming .eh_frame is not in the
> plans.
>
> >>    - From the document: "It appears technically feasible for a null frame
> >> function to have a personality routine. However, the utility of such a
> >> capability seems too meager to justify allowing this. We propose to not
> >> support this." and "If the first attempt to lookup an unwind group for
> >> an exception address fails, then it is (tentatively) assumed to have
> >> occurred within a null frame function or in a part of a function
> >> that is adequately described by a null frame. The presumed return
> >> address is (virtually or actually) popped from the top of stack and
> >> looked up. This second attempted lookup must succeed, in which case
> >> processing continues normally. A failure is a fatal error."
> >>     Q: Is this a problem, especially because the goal is to evolve the
> >> OpenVMS RFC proposal is subsume .eh_frame ?
> >
> > I think this just hard-encodes the default rule, similar to what
> > SFrame does: "AMD64 ABI mandates that the RA be saved at a fixed
> > offset from the CFA when entering a new function."
> >
> > While I haven't given this much thought yet, I don't think this
> > introduces problems that SFrame doesn't have.
> >
>
> Correction: Not true. This is configurable in SFrame. s390x needs RA
> tracking (not fixed offset) and is supported in SFrame.

A hypothetical s390x implementation of the compact unwind information
can reserve 1 bit (in the mode-specific-encoding, or "opcodes" in
https://faultlore.com/blah/compact-unwinding/ ) to indicate whether
the RA is saved in a stack slot or a register.

> >> Are there people actively working towards bringing this to fruition?
> >>
> >>> Now, to compare this against SFrame's space efficiency for synchronous
> >>> unwinding, I've built llvm-mc, opt, and clang with
> >>> -fno-asynchronous-unwind-tables -funwind-tables across multiple build
> >>> configurations (clang vs gcc, frame pointer vs sframe).
> >>> [snip]>>>
> >>> .sframe for sync is not noticeably smaller than that for async. This
> >>> is probably because
> >>> there are still many DW_CFA_advance_loc ops even in
> >>> -fno-asynchronous-unwind-tables -funwind-tables builds.
> >>>
> >>
> >> Possible that its because in the Apple Compact Unwind Format, the linker
> >> optimizes compact unwind descriptors into the three-level paged
> >> structure, effectively de-duplicating some content.
> >
> > Yes, the linker does perform deduplication and builds the paged index
> > structure. However, the fundamental compactness comes from the
> > encoding itself: each regular function is described with just 4 bytes
> > in the common encoding, compared to .sframe's much larger per-FDE
> > overhead.
> > The two-level lookup table optimization amplifies this advantage.
> >
> >>>>>      (On macOS you can check the section size with objdump --arch x86_64 -
> >>>>> h clang and dump the unwind info with  objdump --arch x86_64 --unwind-
> >>>>> info clang)
> >>>>>
> >>>>> OpenVMS's x86-64 port, which is ELF-based, also adopted this format as
> >>>>> documented in their "VSI OpenVMS Calling Standard" and their 2018 post:
> >>>>> https://discourse.llvm.org/t/rfc-asynchronous-unwind-tables-attribute/59282
> >>>>>
> >>>>> The compact unwind format achieves this efficiency through a two-level
> >>>>> page table structure. It describes common frame layouts compactly and
> >>>>> falls back to DWARF only when necessary, allowing most DWARF CFI entries
> >>>>> to be eliminated while maintaining full functionality. For more details,
> >>>>> see: https://faultlore.com/blah/compact-unwinding/ and the lld/MachO
> >>>>> implemention https://github.com/llvm/llvm-project/blob/main/lld/MachO/
> >>>>> UnwindInfoSection.cpp
> >>>>>
> >>>>
> >>>> How does your vision of "linker-friendly" stack tracing/stack unwinding
> >>>> format reconcile with these suggested approaches ? As far as I can tell,
> >>>> these formats also require linker created indexes and are
> >>>> non-concatenable (custom handling in every linker).  Something you've
> >>>> had "significant concerns" about.
> >>>>
> >>
> >> This question is unanswered: What do you think about
> >> "linker-friendliness" of the current implementation of the lld/MachO
> >> implementation of the compact unwind format in LLVM ?
> >
> > The linker input and output use different section names, so a dumb
> > linker would work as long as the runtime accepts the concatenated
> > sections.
> >
> > My vision for an ELF compact unwind format uses separate section names
> > for link-time vs. runtime representations. The compiler output format
> > should be concatenable, with linker index-building as an optional
> > optimization that improves performance but isn't mandatory for
> > correctness.
> >
> > I'll going to add more details
> > https://maskray.me/blog/2025-09-28-remarks-on-sframe
> >
> >
> >>>
> >>> We can distinguish between linking-time and execution-time
> >>> representations by using different section names.
> >>> The OpenVMS specification says:
> >>>
> >>>       It is useful to note that the run-time representation of unwind
> >>> information can vary from little more than a simple concatenation of
> >>> the compile-time information to a substantial rewriting of unwind
> >>> information by the linker. The proposal favors simple concatenation
> >>> while maintaining the same ordering of groups as their associated
> >>> code.
> >>>
> >>> The runtime library can build this index at runtime and cache it to disk.
> >>>
> >>
> >> This will include the dynamic linker and the stack tracer in the Linux
> >> kernel (the latter when stack tracing user space stacks).  Do you think
> >> this is feasible ?
> >>
> >>> Once the design becomes sufficiently stable, we can introduce an
> >>> opt-in linker option --xxxxframe-index that builds an index from
> >>> recognized format versions while reporting warnings for unrecognized
> >>> ones.> We need to carefully design this mechanism to be stable and robust,
> >>> avoiding frequent format updates.
> >>>>>    From
> >>>> https://docs.vmssoftware.com/vsi-openvms-calling-standard/#STACK_UNWIND_EXCEPTION_X86_64:
> >>>> "The unwind dispatch table (see Section B.3.1, ''Unwind Dispatch
> >>>> Table'') is created by the linker using information in the unwind
> >>>> descriptors (see Section B.3.2, ''DWARF Unwind Descriptors'' and Section
> >>>> B.3.3, ''Compact Unwind Description'') provided by compilers. The linker
> >>>> may use the provided unwind descriptors directly or replace them with
> >>>> equivalent optimized forms based on its optimization strategies."
> >>>>
> >>>> Above all, do users want a solution which requires falling back on
> >>>> DWARF-based processing for precise stack tracing ?
> >>>
> >>> The key distinction is that compact unwind handles the vast majority
> >>> of functions without DWARF—the macOS measurements show __unwind_info
> >>> at 0.6% of __text size with __eh_frame reduced to negligible size
> >>> (0x58 bytes). While SFrame also cannot handle all frames, compact
> >>> unwind achieves dramatic size reductions by making DWARF the exception
> >>> rather than requiring it alongside a supplementary format.
> >>>
> >>
> >> As we have tried to reason, this is a misleading comparison. The compact
> >> unwind tables format:
> >>     - needs to be extended for asynchronous stack unwinding
> >>     - needs to be extended for other ABI/architectures
> >>     - Making it concatenable / linker-friendly will also likely impose
> >> some negative effects on size.
> >
> > The format supports i386, x86-64, aarch32, and aarch64. The OpenVMS
> > proposal demonstrates that supporting asynchronous unwinding is
> > straightforward.
> >
> > Making it linker-friendly does not impose negative effects on the
> > output section size.
> >
>
> OK, well, I agree to disagree :)
>
> Looking forward to some movement on the OpenVMS asynchronous unwind RFC
> to see resolution to some of the issues, and some data to back that claim.
>
> >>> The DWARF fallback provides flexibility for additional coverage when
> >>> needed, but nothing is lost (at least for the clang binary on macOS)
> >>> if DWARF fallback were disabled in a hypothetical future linux-perf
> >>> implementation.
> >>>
> >>
> >> Fair enough, thats something for linux-perf/kernel to decide.  Once the
> >> OpenVMS RFC is sufficiently shaped to become a viable replacement for
> >> .eh_frame, this question will be for the stakeholders to decide.
> >
> > Agreed. My concern is that .sframe is being deployed before we've
> > fully explored whether a more compact and efficient alternative is
> > achievable.
> >
> >
> >>>>> **The AArch64 case: size matters even more**
> >>>>>
> >>>>> The size consideration becomes even more critical for AArch64, which is
> >>>>> heavily deployed on mobile phones.
> >>>>> There's an active feature request for compact unwind support in the
> >>>>> AArch64 ABI: https://github.com/ARM-software/abi-aa/issues/344
> >>>>> This underscores the broader industry need for efficient unwind
> >>>>> information that doesn't duplicate data or significantly increase binary
> >>>>> size.
> >>>>>
> >>>>
> >>>> Our measurements with a dataset of about 1400 userspace artifacts
> >>>> (binaries and shared libraries) show that the SFrame/(EH Frame + EH
> >>>> Frame HDR) ratio is:
> >>>>      - Average of 0.70 on AArch64.
> >>>>      - Average of 1.00 on x86_64.
> >>>>
> >>>> Projecting the size of what you observe for clang binary on x86_64 to
> >>>> conclude the size ratio on AArch64 is not very wise to do.
> >>>>
> >>>> Whether the size impact is worth the benefit: its a choice for users to
> >>>> make.  SFrame offers the users fast, precise stack traces with simple
> >>>> stack tracers.
> >>>
> >>> Thank you for providing the AArch64 measurements. Even with a 0.70x ratio on
> >>> AArch64, this represents substantial memory overhead when considering:
> >>>
> >>> .eh_frame is already large and being complained about.
> >>> Being unable to eliminate it (needed for debugging and C++ exceptions)
> >>> and adding 0.70x more means significant additional overhead for users.
> >>>
> >>>>> There are at least two formats the ELF one can learn from: LLVM's
> >>>>> compact unwind format (aarch64) and Windows ARM64 Frame Unwind Code.
> >>>>>
> >>>>
> >>>> Please, if you have any concrete suggestions (keeping the above goals in
> >>>> mind), you already know how/where to engage.
> >>>
> >>> I've provided concrete suggestions throughout this discussion.
> >>>
> >>
> >> Apologies, I should have been more precise.  And I ask because you know
> >> the details about both SFrame and the variants of Compact Unwind
> >> Descriptor formats at this point :). If you have concrete suggestions to
> >> improve the SFrame format for size, please let us know.
> >
> > At this point, I'm not certain about specific modifications to .sframe
> > itself. I think we should start from scratch, drawing ideas from
> > compact unwind information and Windows ARM64.
> >
> > The existing compact unwind information uses the following 4-byte descriptor:
> >
> >    uint32_t mode_specific_encoding : 24; // vary with different modes
> >
> >    uint32_t mode : 4; // UNWIND_X86_64_MODE_MASK == UNWIND_ARM64_MODE_MASK
> >
> >    uint32_t has_lsda : 1;
> >    uint32_t personality_index : 2;
> >    uint32_t is_not_function_start : 1;
> >
>
> Thanks.
>
> SFrame is not for stack unwinding.  Subsuming .eh_frame is topic for
> another day.  SFrame does not intend to go that route.
>
> > We probably need a less-restricted version and account for different
> > architecture needs. The result would still be significantly smaller
> > than SFrame v2 and the future v3 (unless it's completely rewritten).
> >
> > We should probably design an optional two-level lookup table mechanism
> > for additional savings (at the cost of linker friendliness).
> >
> >>>>> **Path forward**
> >>>>>
> >>>>> Unless SFrame can actually replace .eh_frame (rather than supplementing
> >>>>> it as an accelerator for linux-perf) and demonstrate sizes smaller
> >>>>> than .eh_frame - matching the efficiency of existing compact unwind
> >>>>> approaches — I question its practical viability for userspace.
> >>>>> The current design appears to add overhead rather than reduce it.
> >>>>> This isn't to suggest we should simply adopt the existing compact unwind
> >>>>> format wholesale.
> >>>>> The x86-64 design dates back to 2009 or earlier, and there are likely
> >>>>> improvements we can make. However, we should aim for similar or better
> >>>>> efficiency gains.
> >>>>>
> >>>>> For additional context, I've documented my detailed analysis at:
> >>>>>
> >>>>> - https://maskray.me/blog/2025-09-28-remarks-on-sframe (covering
> >>>>> mandatory index building problems, section group compliance and garbage
> >>>>> collection issues, and version compatibility challenges)
> >>>>
> >>>> GC issue is a bug currently tracked and with a target milestone of 2.46.
> >>>>> - https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade-
> >>>>> offs (size analysis)
> >>>>>
> >
> > The GC issue would not have happened at all if we had used multiple
> > sections and thought about ELF and linker convention :)
>
> Thanks for engaging.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Concerns about SFrame viability for userspace stack walking
  2025-10-30  6:53 Concerns about SFrame viability for userspace stack walking Fangrui Song
                   ` (3 preceding siblings ...)
  2025-11-04  9:21 ` Indu
@ 2025-12-01  9:04 ` Fangrui Song
  4 siblings, 0 replies; 30+ messages in thread
From: Fangrui Song @ 2025-12-01  9:04 UTC (permalink / raw)
  To: linux-toolchains, linux-perf-users, linux-kernel


On 2025-10-29, Fangrui Song wrote:
>I've been following the SFrame discussion and wanted to share some concerns about its viability for userspace adoption, based on concrete measurements and comparison with existing compact unwind implementations in LLVM.
>
>**Size overhead concerns**
>
>Measurements on a x86-64 clang binary show that .sframe (8.87 MiB) is approximately 10% larger than the combined size of .eh_frame and .eh_frame_hdr (8.06 MiB total).
>This is problematic because .eh_frame cannot be eliminated - it contains essential information for restoring callee-saved registers, LSDA, and personality information needed for debugging (e.g. reading local variables in a coredump) and C++ exception handling.
>
>This means adopting SFrame would result in carrying both formats, with a large net size increase.
>
>**Learning from existing compact unwind implementations**
>
>It's worth noting that LLVM has had a battle-tested compact unwind format in production use since 2009 with OS X 10.6, which transitioned to using CFI directives in 2013 [1]. The efficiency gains are dramatic:
>
>  __text section: 0x4a55470 bytes
>  __unwind_info section: 0x79060 bytes (0.6% of __text)
>  __eh_frame section: 0x58 bytes
>
>  (On macOS you can check the section size with objdump --arch x86_64 -h clang and dump the unwind info with  objdump --arch x86_64 --unwind-info clang)
>
>OpenVMS's x86-64 port, which is ELF-based, also adopted this format as documented in their "VSI OpenVMS Calling Standard" and their 2018 post: https://discourse.llvm.org/t/rfc-asynchronous-unwind-tables-attribute/59282
>
>The compact unwind format achieves this efficiency through a two-level page table structure. It describes common frame layouts compactly and falls back to DWARF only when necessary, allowing most DWARF CFI entries to be eliminated while maintaining full functionality. For more details, see: https://faultlore.com/blah/compact-unwinding/ and the lld/MachO implemention https://github.com/llvm/llvm-project/blob/main/lld/MachO/UnwindInfoSection.cpp
>
>**The AArch64 case: size matters even more**
>
>The size consideration becomes even more critical for AArch64, which is heavily deployed on mobile phones.
>There's an active feature request for compact unwind support in the AArch64 ABI: https://github.com/ARM-software/abi-aa/issues/344
>This underscores the broader industry need for efficient unwind information that doesn't duplicate data or significantly increase binary size.
>
>There are at least two formats the ELF one can learn from: LLVM's compact unwind format (aarch64) and Windows ARM64 Frame Unwind Code.
>
>**Path forward**
>
>Unless SFrame can actually replace .eh_frame (rather than supplementing it as an accelerator for linux-perf) and demonstrate sizes smaller than .eh_frame - matching the efficiency of existing compact unwind approaches — I question its practical viability for userspace.
>The current design appears to add overhead rather than reduce it.
>This isn't to suggest we should simply adopt the existing compact unwind format wholesale.
>The x86-64 design dates back to 2009 or earlier, and there are likely improvements we can make. However, we should aim for similar or better efficiency gains.
>
>For additional context, I've documented my detailed analysis at:
>
>- https://maskray.me/blog/2025-09-28-remarks-on-sframe (covering mandatory index building problems, section group compliance and garbage collection issues, and version compatibility challenges)
>- https://maskray.me/blog/2025-10-26-stack-walking-space-and-time-trade-offs (size analysis)
>
>Best regards,
>Fangrui
>
>[1]: https://github.com/llvm/llvm-project/commit/58e2d3d856b7dc7b97a18cfa2aeeb927bc7e6bd5 ("Generate compact unwind encoding from CFI directives.")
>

tl;dr I believe a compact unwind scheme demonstrates significant promise over SFrame.
The MIPS compact exception tables as implemented in Binutils is also
worth considering (the structure can be shared among all architectures
while unwind code has to be arch-specific)

I've ported the Mach-O compact unwind format to ELF in a branch, establishing a baseline for improvements to the compact unwind format.

```
% ~/Dev/object-file-size-analyzer/section_size.rb /tmp/out/custom-{fp,sframe,compact,fp-gcc,sframe-gcc}/bin/{llvm-mc,opt}
Filename                               |       .text size |        EH size |   .sframe size |  VM size | VM increase
---------------------------------------+------------------+----------------+----------------+----------+------------
/tmp/out/custom-fp/bin/llvm-mc         |  2120895 (23.5%) |  301528 (3.3%) |       0 (0.0%) |  9043221 |           -
/tmp/out/custom-sframe/bin/llvm-mc     |  2109231 (22.3%) |  367424 (3.9%) |  348041 (3.7%) |  9474085 |       +4.8%
/tmp/out/custom-compact/bin/llvm-mc    |  2109519 (24.4%) |  106288 (1.2%) |       0 (0.0%) |  8639637 |       -4.5%

/tmp/out/custom-fp-gcc/bin/llvm-mc     |  2744214 (29.2%) |  301836 (3.2%) |       0 (0.0%) |  9389677 |       +3.8%
/tmp/out/custom-sframe-gcc/bin/llvm-mc |  2705860 (27.7%) |  354292 (3.6%) |  356073 (3.6%) |  9780985 |       +8.2%

/tmp/out/custom-fp/bin/opt             | 38769545 (69.9%) | 3547688 (6.4%) |       0 (0.0%) | 55425217 |           -
/tmp/out/custom-sframe/bin/opt         | 38891295 (62.4%) | 4559644 (7.3%) | 4448874 (7.1%) | 62292133 |      +12.4%
/tmp/out/custom-compact/bin/opt        | 38898415 (74.8%) | 1200764 (2.3%) |       0 (0.0%) | 52020449 |       -6.1%
/tmp/out/custom-fp-gcc/bin/opt         | 54654215 (78.1%) | 3631196 (5.2%) |       0 (0.0%) | 70001373 |      +26.3%
/tmp/out/custom-sframe-gcc/bin/opt     | 53644895 (70.4%) | 4857364 (6.4%) | 5263676 (6.9%) | 76206149 |      +37.5%
```

**Evaluation results**

With the current implementation, 4937 out of 77648 FDEs (6.36%) require a DWARF escape, while the remaining FDEs can be replaced with unwind descriptors, yielding a huge size saving.

.eh_frame_hdr will become significantly smaller if we implement a two-level page table structure similar to Mach-O __unwind_info to deduplicate entries.

**Build configurations**

```
#!/bin/zsh
conf() {
   configure-llvm $1 -DCMAKE_EXE_LINKER_FLAGS='-fuse-ld=bfd -pie -Wl,-z,pack-relative-relocs' \
     -DCMAKE_SHARED_LINKER_FLAGS=-fuse-ld=bfd -DLLVM_ENABLE_UNWIND_TABLES=on -DLLVM_ENABLE_LLD=off ${@:2}
}

clang=(-DCMAKE_CXX_COMPILER=/tmp/Rel/bin/clang++ -DCMAKE_C_COMPILER=/tmp/Rel/bin/clang)
gcc=("-DCMAKE_C_COMPILER=$HOME/opt/gcc-15/bin/gcc" "-DCMAKE_CXX_COMPILER=$HOME/opt/gcc-15/bin/g++")

compact="-fomit-frame-pointer -momit-leaf-frame-pointer -B$HOME/opt/binutils/bin -mllvm -elf-compact-unwind -mllvm -x86-epilog-cfi=0"
fp="-fno-omit-frame-pointer -momit-leaf-frame-pointer -B$HOME/opt/binutils/bin -Wa,--gsframe=no"
sframe="-fomit-frame-pointer -momit-leaf-frame-pointer -B$HOME/opt/binutils/bin -Wa,--gsframe"

conf custom-compact -DCMAKE_{C,CXX}_FLAGS="$compact" ${clang[@]} \
   -DCMAKE_EXE_LINKER_FLAGS='-fuse-ld=lld -pie -Wl,-z,pack-relative-relocs' \
   -DCMAKE_SHARED_LINKER_FLAGS=-fuse-ld=lld

conf custom-fp -DCMAKE_{C,CXX}_FLAGS="-fno-integrated-as $fp" ${clang[@]}
conf custom-sframe -DCMAKE_{C,CXX}_FLAGS="-fno-integrated-as $sframe" ${clang[@]}

conf custom-fp-gcc -DCMAKE_{C,CXX}_FLAGS="$fp" ${gcc[@]}
conf custom-sframe-gcc -DCMAKE_{C,CXX}_FLAGS="$sframe" ${gcc[@]}

for i in compact fp sframe  fp-gcc sframe-gcc; do ninja -C /tmp/out/custom-$i llvm-mc opt; done
```

The `/tmp/out/custom-compact` build uses my llvm-project branch
(<http://github.com/MaskRay/llvm-project/tree/demo-unwind>) that ports
Mach-O compact unwind to ELF, allowing the majority of `.eh_frame` FDEs
to replace CFI instructions with unwind descriptors.

-mllvm -x86-epilog-cfi=0: Disables epilogue CFI for x86 (primarily
implemented by D42848 in 2018, notably disabled for Darwin and Windows).
Without this option most frames will not utilize unwind descriptors
because the current Mach-O compact unwind implementation does not
support popq %rbp; .cfi_def_cfa %rsp, 8; ret. I believe this is still
fair as we expect to use a 8-byte descriptor, sufficient to describe
epilogue CFI.

If you still think custom-compact using -x86-epilog-cfi is not entirely
fair to other builds, this is the table using -fno-asynchronous-unwind-tables -funwind-tables -mllvm -x86-epilog-cfi=0
for all builds:

% ~/Dev/object-file-size-analyzer/section_size.rb /tmp/out/custom-{fp-sync,sframe-sync,compact-sync}/bin/{llvm-mc,opt}
Filename                                 |       .text size |        EH size |   .sframe size |  VM size | VM increase
-----------------------------------------+------------------+----------------+----------------+----------+------------
/tmp/out/custom-fp-sync/bin/llvm-mc      |  2120895 (24.1%) |  263396 (3.0%) |       0 (0.0%) |  8802093 |           -
/tmp/out/custom-sframe-sync/bin/llvm-mc  |  2109231 (23.2%) |  291084 (3.2%) |  248654 (2.7%) |  9090325 |       +3.3%
/tmp/out/custom-compact-sync/bin/llvm-mc |  2109519 (24.4%) |  106288 (1.2%) |       0 (0.0%) |  8639637 |       -1.8%
/tmp/out/custom-fp-sync/bin/opt          | 38769545 (72.2%) | 2997572 (5.6%) |       0 (0.0%) | 53706041 |           -
/tmp/out/custom-sframe-sync/bin/opt      | 38891295 (66.9%) | 3425116 (5.9%) | 2951292 (5.1%) | 58091421 |       +8.2%
/tmp/out/custom-compact-sync/bin/opt     | 38898415 (74.8%) | 1200764 (2.3%) |       0 (0.0%) | 52020449 |       -3.1%

---

After I had implemented this, I then investigated the MIPS compact
exception tables. I can now finalize the ‘in construction’ chapter of my
blog post,
https://maskray.me/blog/2020-11-08-stack-unwinding#mips-compact-exception-tables
Designed around 2015, it is actually a very good format.


Compiler output. The directive .cfi_sections .eh_frame_entry instructs
the assembler to emit index table entries to the .eh_frame_entry
section. .cfi_fde_data opcode1, ... betweens a pair of .cfi_startproc
and .cfi_endproc describes the frame unwind opcodes where each opcode
takes one byte. The frame unwind opcodes describes the semantics of
prologue instructions, similar to Windows ARM64 Frame Unwind Codes.

Assembler processing. The assembler generates a .eh_frame_entry.* section for each section with compact unwind information.
Each .eh_frame_entry is a pair of 4 bytes, where the first word is like the first word in a .eh_frame_hdr entry.
An .eh_frame_entry entry takes one of three forms:

Inline compact: (even pc, unwind_data). This form can be used when there are at most 3 opcodes (3 bytes) and no personality routine.
Out-of-line compact: (odd pc, even unwind_ptr) where unwind_ptr points to unwind data in the .gnu_extab section.
Legacy: (odd pc, odd legacy_unwind_ptr) where legacy_unwind_ptr points to the legacy .eh_frame section.
TODO: Describe .cfi_inline_lsda, which appears related to __gnu_compact_pr[1-3].

Linker processing. GNU ld concatenates .eh_frame_entry and .eh_frame_entry.* sections, sorting them by address.
The following internal linker script fragment adds a header before the entries:

.eh_frame_hdr   : { *(.eh_frame_hdr) *(.eh_frame_entry .eh_frame_entry.*) }
Although the section name remains the traditional .eh_frame_hdr, the version is set to 2.
The linker also defines the symbol __GNU_EH_FRAME_HDR to hold the .eh_frame_hdr address.

---

I've studied numerous stack unwinding/walking formats. DWARF CFI is
essential to achieve near 100% coverage. Other formats, such as compact
unwind formats and SFrame, have limitations. The ideal future solution,
as an alternative to frame pointer chains, will be a stack unwinding
format that supports C++ exceptions and can use DWARF CFI as a fallback.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-12-01  9:03 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-30  6:53 Concerns about SFrame viability for userspace stack walking Fangrui Song
2025-10-30  7:30 ` Jakub Jelinek
2025-10-30  7:50   ` Fangrui Song
2025-10-30  8:05     ` Jakub Jelinek
2025-10-31  2:51       ` Fangrui Song
2025-10-30 10:26 ` Peter Zijlstra
2025-10-30 16:48   ` Fangrui Song
2025-10-30 17:03     ` Jose E. Marchesi
2025-10-31  4:22       ` Fangrui Song
2025-10-31 14:37         ` Jose E. Marchesi
2025-10-30 17:33     ` Steven Rostedt
2025-10-31  5:28       ` Fangrui Song
2025-10-30 18:22     ` Peter Zijlstra
2025-10-30 17:53   ` Andi Kleen
2025-10-30 18:07     ` Mark Brown
2025-10-30 18:31       ` Andi Kleen
2025-10-30 18:45         ` Mark Brown
2025-10-31  8:24           ` Fangrui Song
2025-10-30 18:57         ` Peter Zijlstra
2025-10-31 11:46           ` Mark Brown
2025-10-30 14:47 ` Jose E. Marchesi
2025-11-04  9:21 ` Indu
2025-11-05  8:21   ` Fangrui Song
2025-11-06  0:44     ` Indu Bhagat
2025-11-06  7:51       ` Florian Weimer
2025-11-06 21:02         ` Indu Bhagat
2025-11-06  9:20       ` Fangrui Song
2025-11-06 20:42         ` Indu Bhagat
2025-11-09  0:23           ` Fangrui Song
2025-12-01  9:04 ` Fangrui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).